RAUNet: Residual Attention U-Net for Semantic Segmentation of Cataract Surgical Instruments
Semantic segmentation of surgical instruments plays a crucial role in robot-assisted surgery. However, accurate segmentation of cataract surgical instruments is still a challenge due to specular reflection and class imbalance issues. In this paper, an attention-guided network is proposed to segment the cataract surgical instrument. A new attention module is designed to learn discriminative features and address the specular reflection issue. It captures global context and encodes semantic dependencies to emphasize key semantic features, boosting the feature representation. This attention module has very few parameters, which helps to save memory. Thus, it can be flexibly plugged into other networks. Besides, a hybrid loss is introduced to train our network for addressing the class imbalance issue, which merges cross entropy and logarithms of Dice loss. A new dataset named Cata7 is constructed to evaluate our network. To the best of our knowledge, this is the first cataract surgical instrument dataset for semantic segmentation. Based on this dataset, RAUNet achieves state-of-the-art performance 97.71 mean Dice and 95.62 mean IOU.
Keywords:Attention, Semantic Segmentation, Cataract, Surgical Instrument
In recent years, semantic segmentation of surgical instruments has gained increasing popularity due to its promising applications in robot-assisted surgery. One of the crucial applications is the localization and pose estimation of surgical instruments, which contributes to surgical robot control. Potential applications of segmenting surgical instruments include objective surgical skills assessment, surgical workflow optimization, report generation, etc.  These applications can reduce the workload of doctors and improve the safety of surgery.
Cataract surgery is the most common ophthalmic surgery in the world. It is performed approximately 19 million times a year . Cataract surgery is highly demanding for doctors. Computer-assisted surgery can significantly reduce the probability of accidental operation. However, most of the research related to surgical instrument segmentation focuses on endoscopic surgeries. There are few studies on cataract surgeries. To the best of our knowledge, this is the first study to segment and classify cataract surgical instruments.
Recently, a serious of methods have been proposed to segment surgical instruments. Luis et al.  presented a network based on Fully Convolutional Networks(FCN) and optic flow to solve problems such as occlusion and deformation of surgical instruments. RASNet  adopted an attention module to emphasize the targets region and improve the feature representation. Iro et al.  proposed a novel U-shape network to provide segmentation and pose estimation of instruments simultaneously. A method combining both recurrent network and the convolutional network was employed by Mohamed et al.  to improve the segmentation accuracy. From mentioned above, it can be seen that the convolutional neural network has achieved excellent performance in segmentation of surgical instruments. However, the methods mentioned above are all based on endoscopic surgery. Semantic segmentation of surgical instruments for cataract surgery is quite different from that of endoscopic surgery.
Many challenges need to be faced for the semantic segmentation of cataract surgical instruments. Different from endoscopic surgery, cataract surgery requires strong lighting conditions, leading to serious specular reflection. Specular reflection changes the visual characteristics of surgical instruments. Also, cataract surgery instruments are small for micromanipulation. Hence it is very common that surgical instruments only occupy a small region of the image. The number of background pixels is much larger than that of foreground pixels, which cause serious class imbalance issue. As a result, the surgical instrument is more likely to be misidentified as a background. Occlusion caused by eye tissues and the limited view of the camera is also important issues, causing a part of the surgical instrument to be invisible. These issues make it difficult to identify and segment the surgical instrument.
To address these issues, a novel network, Residual Attention U-Net(RAUNet), is proposed. It introduces an attention module to improve feature representation. The contributions of this work are as follows.
An innovative attention module called augmented attention module (AAM) is designed to efficiently fuse multi-level features and improve feature representation, contributing to addressing the specular reflection issue. Also, it has very few parameters, which helps to save memory.
A hybrid loss is introduced to solve the class imbalance issue. It merges cross entropy and logarithm of Dice loss to take advantage of both their merit.
To evaluate the proposed network, we construct a cataract surgery instrument dataset named Cata7. As far as we know, this is the first cataract surgery instrument dataset that can be used for semantic segmentation.
2 Residual Attention U-Net
High-resolution images provide more detailed location information, helping doctors perform accurate operations. Thus, Residual Attention U-Net (RAUNet) adopts an encoder-decoder architecture to get high-resolution masks. The architecture of RAUNet is illustrated in Fig. 1. ResNet34  pre-trained on the ImageNet is used as the encoder to extract semantic features. It helps reduce the model size and improve inference speed. In the decoder, a new attention module augmented attention module(AAM) is designed to fuse multi-level features and capture global context. Furthermore, transposed convolution is used to carry out upsampling for acquiring refined edges.
2.2 Augmented Attention Module
The decoder recovers the position details by upsampling. However, upsampling leads to blurring of edge and the loss of location details. Some existing work  adopts skip connections to concatenate the low-level features with the high-level features, which contributes to replenishing the position details. But this is a naive method. Due to the lack of semantic information in low-level features, it contains a lot of useless background information. This information may interfere with the segmentation of the target object. To address this problem, the augmented attention module is designed to capture high-level semantic information and emphasize target features.
Each channel corresponds to a specific semantic response. Surgical instruments and human tissues are often concerned with different channels. Thus, the augmented attention module model the semantic dependencies to emphasize target channels. It captures the semantic information in high-level feature maps and the global context in the low-level feature maps to encode semantic dependencies. High-level feature maps contain rich semantic information that can be used to guide low-level feature maps to select important location details. Furthermore, the global context of low-level feature maps encodes the semantic relationship between different channels, helping to filter interference information. By using this information efficiently, augmented attention module can emphasize target region and improve the feature representation. Augmented attention module is illustrated in Fig. 2.
Global average pooling is performed to extract global context and semantic information, which is described in Eq. (3). It squeezes global information into an attentive vector which encodes the semantic dependencies, contributing to emphasizing key features and filter background information. The generation of the attentive vector is described in the following:
where and refer to the high-level and low-level feature maps, respectively. denotes the global average pooling. denotes ReLU function and denotes Softmax function. refers to the parameter of the 11 convolution. refers to the bias.
where and .
Then 11 convolution with batch normalization is performed on the vector to further captures semantic dependencies. The softmax function is adopted as the activation function to normalize the vector. The low-level feature maps are multiplied by the attentive vector to generate an attentive feature map. Finally, the attentive feature map is calibrated by adding with the high-level feature map. Addition can reduce parameters of convolution compared with concatenation, which contributes to reducing the computational cost. Also, since it only uses global average pooling and 11 convolution, this module does not add too many parameters. The global average pooling squeezes global information into a vector, which also greatly reduces computational costs.
2.3 Loss Function
Semantic segmentation of surgery instruments can be considered as classifying each pixel. Therefore, cross entropy loss can be used for classification of pixels. It is the most commonly used loss function for classification. And it is denoted as H in the Eq. (4).
where , represent the width and the height of the predictions. And is the number of classes. is the ground truth of a pixel and is the prediction of a pixel.
It is common that the surgical instrument only occupies a small area of the image, which leads to serious class imbalance issue. However, the performance of cross entropy is greatly affected by this issue. The prediction is more inclined to recognize pixels as background. Therefore the surgical instrument may be partially detected or ignored. The Dice loss defined in Eq. (5) can be used to solve this problem . It evaluates the similarity between the prediction and the ground truth, which is not affected by the ratio of foreground pixels to background pixels.
where , represent the width and the height of the predictions, represents the prediction, represents the ground truth.
To effectively utilize the excellent characteristics of these two losses, we merge the Dice loss with the cross entropy function in the following:
where is a weight used to balance cross entropy loss and Dice loss. D is between 0 and 1. extends the value range from 0 to negative infinity. When the prediction is greatly different from the ground truth, is small and is close to negative infinity. The loss will increase a lot to penalize this poor prediction. This method can not only use the characteristics of the Dice loss but also improve the sensitivity of loss.
This loss is named Cross Entropy Log Dice(CEL-Dice). It combines the stability of cross entropy and the property that Dice loss is not affected by class imbalance. Therefore, it solves class imbalance better than cross entropy and its stability is better than Dice loss.
|1||Primary Incision Knife||62||39||23|
|2||Secondary Incision Knife||226||197||29|
|-||Number of Frames||2500||1800||700|
A new dataset, Cata7, is constructed to evaluate our network, which is the first cataract surgical instrument dataset for semantic segmentation. The dataset consists of seven videos while each video records a complete cataract surgery. All videos are from Beijing Tongren Hospital. Each video is split into a sequence of images, where resolution is 19201080 pixels. To reduce redundancy, the videos are downsampled from 30 fps to 1 fps. Also, images without surgical instruments are manually removed. Each image is labeled with precise edges and types of surgical instruments.
This dataset contains 2,500 images, which is divided into training and test sets. The training set consists of five video sequences and test set consists of two video sequences. The number of surgical instruments in each category is illustrated in Table 1. There are ten surgical instruments used in the surgery, which are shown in Fig. 3.
ResNet34 pre-trained on the ImageNet is utilized as the encoder. Pre-training can accelerate network convergence and improve network performance . Due to limited computing resources, each image for training is resized to 960544 pixels. The network is trained by using Adam with batch size 8. The learning rate is dynamically adjusted during training to prevent overfitting. The initial learning rate is . For every 30 iterations, the learning rate is multiplied by 0.8. As for the in the CEL-Dice, it is set to 0.2 after several experiments. Dice coefficient and Intersection-Over-Union(IOU) are selected as the evaluation metric.
Data augmentation is performed to prevent overfitting. The augmented samples are generated by random rotation, shifting and flipping. 800 images are obtained by data augmentation, increasing feature diversity to prevent over-fitting effectively. Batch normalization is used for regularization. In the decoder, batch normalization is performed after each convolution.
3.3.1 Ablation for augmented attention module
Augmented attention module(AAM) is designed to aggregate multi-level features. It captures global context and semantic dependencies to emphasize key features and suppress background features. To verify its performance, we set up a series of experiments. The results are shown in the Table 2.
RAUNet without AAM is used as the base network, which achieves 95.12 mean Dice and 91.31 mean IOU. The base network with AAM achieves 97.71 mean Dice and 95.62 mean IOU. By applying AAM, mean Dice increases by 2.59 and mean IOU increases by 4.31. Furthermore, AAM is compared with GAU . The base network with GAU achieves 96.61 mean Dice and 93.76 mean IOU. Compared to the base network with AAM, Its mean Dice and mean IOU are reduced by 1.10% and 1.86%, respectively. Besides, by applying AAM, parameters only increase by 0.73M, which is 0.87 of the base network. By applying GAU, parameters increase by 3.02M, which is 4.14 times the amount of parameters increased by AAM. These results show that AAM can not only significantly increase the segmentation accuracy, but also does not add too many parameters.
To give an intuitive comparison, the segmentation results of the base network and RAUNet are visualized in Fig. 4(a). The red line marks the contrasted region. It can be found that there are classification errors in the results of the base network. Besides, surgical instruments are not entirely segmented in the third image. Meanwhile, RAUNet can accurately segment surgical instruments by applying AAM. The masks achieved by RAUNet are the same as the ground truth. This shows that AAM contributes to capturing high-level semantic features and improving feature representation.
3.3.2 Comparison with state-of-the-art
To further verify the performance of RAUNet, it is compared with the U-Net , TernausNet  and LinkNet . As shown in Table 3, RAUNet achieves state-of-the-art performance 97.71 mean Dice and 95.62 mean IOU, which outperforms other methods. U-Net  achieves 94.99 mean Dice and 91.11 mean IOU. TernausNet  and LinkNet  achieve 92.98 and 92.21 mean IOU respectively. The performance of these methods is much poor than our RAUNet.
Pixel accuracy achieved by various methods is visualized in Fig. 5. It can be found that the primary incision knife is often misclassified by U-Net, TernausNet, and LinkNet. Since the primary incision knife is used for a short time in surgery, its samples are few, leading to the underfitting of the network. Also, lens hook is often misclassified by U-Net. This result is since the lens hook is very thin and cause severe class imbalance. Furthermore, it is similar to other surgical instruments. U-Net cannot capture high-level semantic information, which causes the misclassification. Despite these difficulties, our method still achieves high pixel accuracy. The pixel accuracy of lens hook and primary incision knife are 90.23 and 100 respectively. These results show that RAUNet can capture discriminative semantic features and address the class imbalance issue.
3.3.3 Verify the performance of CEL-Dice
CEL-Dice is utilized to solve the class balance issue. It combines the stability of cross entropy and the property that Dice loss is not affected by class imbalance. To verify its performance, it is compared with cross entropy and Dice loss. The mean Dice and mean IOU achieved by the network on the test set is illustrated in Fig. 7. They change with the training epoch. It can discover that CEL-Dice can significantly improve segmentation accuracy, which is better than Dice loss and cross entropy.
A novel network called RAUNet is proposed for semantic segmentation of surgical instruments. The augmented attention module is designed to emphasize key regions. Experimental results show that the augmented attention module can significantly improve segmentation accuracy while adds very few parameters. Also, a hybrid loss called Cross Entropy Log Dice is introduced, contributing to addressing the class imbalance issue. Proved by experiments, RAUNet achieves state-of-the-art performance on Cata7 dataset.
This research is supported by the National Natural Science Foundation of China (Grants 61533016, U1713220, U1613210), the National Key Research and Development Program of China (Grant 2017YFB1302704) and the Strategic Priority Research Program of CAS (Grant XDBS01040100).
-  Sarikaya, D., Corso, J.J., Guru, K.A.: Detection and localization of robotic tools in robot-assisted surgery videos using deep neural networks for region proposal and detection. IEEE Transactions on Medical Imaging 36(7), 1542–1549 (July 2017)
-  Trikha, S., Turnbull, A., Morris, R., Anderson, D., Hossain, P.: The journey to femtosecond laser-assisted cataract surgery: new beginnings or a false dawn? Eye 27(4), 461 (2013)
-  García-Peraza-Herrera, L.C., Li, W., Gruijthuijsen, C., Devreker, A., Attilakos, G., Deprest, J., Vander Poorten, E., Stoyanov, D., Vercauteren, T., Ourselin, S.: Real-time segmentation of non-rigid surgical tools based on deep learning and tracking. In: International Workshop on Computer-Assisted and Robotic Endoscopy. pp. 84–95. Springer (2016)
-  Ni, Z.L., Bian, G.B., Xie, X.L., Hou, Z.G., Zhou, X.H., Zhou, Y.J.: RASNet: Segmentation for tracking surgical instruments in surgical videos using refined attention segmentation network. arXiv preprint arXiv:1905.08663 (2019)
-  Laina, I., Rieke, N., Rupprecht, C., Vizcaíno, J.P., Eslami, A., Tombari, F., Navab, N.: Concurrent segmentation and localization for tracking of surgical instruments. In: International Conference on Medical Image Computing and Computer-assisted Intervention. pp. 664–672. Springer (2017)
-  Attia, M., Hossny, M., Nahavandi, S., Asadi, H.: Surgical tool segmentation using a hybrid deep CNN-RNN auto encoder-decoder. In: 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC). pp. 3373–3378. IEEE (2017)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778 (June 2016)
-  Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image Computing and Computer-assisted Intervention. pp. 234–241. Springer (2015)
-  Milletari, F., Navab, N., Ahmadi, S.: V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: 2016 Fourth International Conference on 3D Vision (3DV). pp. 565–571 (Oct 2016)
-  Iglovikov, V., Shvets, A.: Ternausnet: U-Net with vgg11 encoder pre-trained on imagenet for image segmentation. arXiv preprint arXiv:1801.05746 (2018)
-  Li, H., Xiong, P., An, J., Wang, L.: Pyramid attention network for semantic segmentation. arXiv preprint arXiv:1805.10180 (2018)
-  Chaurasia, A., Culurciello, E.: Linknet: Exploiting encoder representations for efficient semantic segmentation. In: 2017 IEEE Visual Communications and Image Processing (VCIP). pp. 1–4. IEEE (2017)