Multitask Learning of Temporal Connectionism in Convolutional Networks using a Joint Distribution Loss Function to Simultaneously Identify Tools and Phase in Surgical Videos
Surgical workflow analysis is of importance for understanding onset and persistence of surgical phases and individual tool usage across surgery and in each phase. It is beneficial for clinical quality control and to hospital administrators for understanding surgery planning. Video acquired during surgery typically can be leveraged for this task. Currently, a combination of convolutional neural network (CNN) and recurrent neural networks (RNN) are popularly used for video analysis in general, not only being restricted to surgical videos. In this paper, we propose a multi-task learning framework using CNN followed by a bi-directional long short term memory (Bi-LSTM) to learn to encapsulate both forward and backward temporal dependencies. Further, the joint distribution indicating set of tools associated with a phase is used as an additional loss during learning to correct for their co-occurrence in any predictions. Experimental evaluation is performed using the Cholec80 dataset. We report a mean average precision (mAP) score of and for tool and phase identification respectively which are higher compared to prior-art in the field.
Surgical workflow analysis using videos acquired from an endoscope is of assistance to surgeons and hospital administrators to assess quality and progress of surgery and for medico-legal litigation. Being able to provide tool usage information during surgery along with its phase, report generation, determining the duration of surgery, time to completion of surgery are some of such useful information. This information summarizing also makes it easy to find out aberration in pattern of a particular tool usage during a surgery by comparing it with the reports of past procedures. This paper***Accepted paper at MedImage workshop of Indian Confeence on Computer Vision, Graphics and Image Processing , 2018 presents a multi-task deep learning framework which simultaneously infers both tool and phase information in video frames. The summary of the method is presented in Fig.1.
Challenges: In surgical videos the tools often appear occluded behind anatomical structures, which makes the task of tool detection difficult. Also the endoscope used to acquire video suffers from motion jitters leading to variation in scene background and the degree of illumination. Specular reflection is another related artifact. This makes the task of analysis typically challenging on account of the large scale vision appearance modeling to be performed. In a related note, this would also require training data to consist of large number of frames annotated with both tool and phase information, belonging to microcosms making up such wide-scale visual variations, which being a tedious job is also challenging to collect.
Approach: In the proposed framework a convolutional neural network (CNN) is trained simultaneously for both tool and phase detection where and it learns to extract high level visual features from the frames. Trained with an additional weighted joint distribution loss function which captures the joint probability of co-occurrence of a particular set of tools generally associated with a phase of the surgery. The visual features extracted from the trained CNN is used to train a bi-directional long short term memory network (LSTM) to capture both the forward and backward temporal information across video frames. This temporal connectionism is important since in a surgery the phases are sequentially executed and there is an order in which a set of tools are used per phase.
Impact: On account of introducing the weighted joint probabilistic loss function used in this multi-task training of CNN and bidirectional temporal learning with LSTM, the mean average precision for tools is higher than all the previous works related to this domain whereas in case of phase detection the mean average precision is comparable to the state of the art. The significant advancement over known prior-art in the field is the ability to use a single network to solve both phase and tool detection simultaneously, at the highest performance metric, achieved through its auto-correcting learning ability using joint distribution modeling.
Organization of the paper: The earlier works on surgical tool and phase detection are briefly described in Sec. 2. The problem statement is presented in Sec. 3. The methodology is explained in Sec. 4 . The experiments are detailed with the results in Sec. 5. Sec. 6 presents the discussion. The conclusion is presented in Sec. 7 .
2 Prior Work
Various types of video analysis solutions have been proposed through the years. Use of 3D CNNs , combining optical flow information along with 2D images , use of a RNN/LSTM along with a CNN to model long term dependencies  are some of the widely known techniques. Many variants of these approaches have been used in both surgical phase and tool detection.
A CNN was trained to sort surgical video frames to learn temporal context between the frames and combined with a gated recurrent units (GRU) for surgical phase detection . Later  proposed a multi-task CNN framework for both tool and phase detection, extracting the features from it and applying an hierarchical hidden Markov model (HHMM) for final phase detection. Another work  used a CNN for phase classification in cataract surgery, and improved their accuracy by dataset purification and balancing. Later on  constructed a surgical process modelling, and extracted various descriptors from images and then classified them using an Adaboost classifier. Further the temporal aspect was exploited using a hidden semi Markov model. In  they proposed an evolutionary search in the space of global image features using a genetic programming based approach for phase detection in cholecystectomy videos. Another approach processes information about tool usage using non-visual electromagnetic tracking sensors and endoscopic camera for phase detection in laparoscopic surgeries using a left-right Hidden Markov Model (HMM) . Later  proposed a framework to automatically detect surgical phases from microscope videos. It first defined visual cues manually that can be helpful for discriminating the high-level tasks. The visual cues are automatically detected by image based classifiers, and the obtained time series are then aligned with a reference surgery using dynamic time warping (DTW) algorithm for phase detection. Successively  used a spatio-temporal CNN and also encoded tool and temporal information in it for extracting visual features from surgical frames and then built a classifier using DTW. In a prior work  tool presence in surgical video frames was detected by extracting visual features from a CNN and then feeding it to a LSTM for learning the temporal connectionism. Similar styled work  proposed a tool detection system for minimally invasive surgery based on a multiclass ensemble classifier which was built using gradient boosted regression trees. Subsequently  used only a CNN based approach for tool detection in each frame without considering the temporal information across video frames. Later  proposed an automatic method for detection of instruments from endoscopic images by segmenting the tip of the instrument and then recognizing based on three dimensional instrument models. Earlier works in  used image processing techniques like k-means clustering and Kalman filtering for localization and tracking of tools in surgical videos. In  combined features extracted from pretrained and fine-tuned imagenet models to create contextual features for tool detection and later proposed a label set sampling to reduce the bias. Later  proposed to use optical flow information between surgical images to exploit spatial redundancies between consecutive images. Subsequently in  proposed the CNN along with RNN framework followed by boosting of both of these networks and finally smoothing the predictions for surgical tool detection. All of the methods described above for tool and phase detection use a CNN or a CNN + RNN framework or statistical methods, but none of those captures the joint probability distribution between the tools associated with a given phase while building a multitask learning framework. Also temporal information is captured in most of the works but they only consider the effect of past frames in determining the present tool or phase. It is equally important to look into the future as much as into the past for more accurate prediction in the current scenario and consider multitask framework in temporal domain.
3 Problem Statement
Given a video frame it contains information about a particular phase of surgery and the multiple tools used which varies from a minimum of no-tool to a maximum of three tools. Given in a surgical video dataset, the ground truth for phase annotation in a frame is represented as a one-hot tensor of size , where is the number of surgical phases. The surgical tools ground truth is represented as a multi-hot tensor of size , where is the number of surgical tools. The prediction problem is modelled as where and are the phase and tool prediction tensors obtained from the trained multitask network which processes . In case of tools none or more than one tool indices can be one in a given frame, so the detection of tools from a given video frame is a multilabel multiclass classification problem where we have to predict a subset of tools out of the total set of tools.
4 Exposition to the Solution
We propose a multitask learning framework using CNN+LSTM to jointly solve for both tool and phase detection while learning with a weighted joint probability based loss function to model the dependence of tool and phase occurrence in a given frame. We first train a CNN only with the multi-task setting. Second we use the features from the penultimate fully-connected layer of the CNN trained earlier to construct a Bidirectional LSTM (Bi-LSTM) trained with a multi-task framework. The full training pipeline is shown in Fig.2. These stages are subsequently detailed.
4.1 Multitask learning of a CNN for phase and tool detection
Since the amount of tool annotated data is less, training a deep CNN architecture from scratch has been observed to lead to convergence challenges as well as slows down convergence. So to speed up the training process we have used a CNN trained prior on ImageNet for Large Scale Visual Recognition Challenge (ILSVRC) . The ResNet-50  is used as a feature extractor and is finetuned on the task specific dataset after replacing the output layer. The input to the ResNet-50 is an image of size px and the features are obtained from the last but one fully connected layer of dimension . The output layer in ResNet-50 is replaced to accommodate both tool and phase classifications with tensors matching properties of and .
Three different loss functions are used during training. During learning of phase detection, the weighted cross entropy loss is used
where with , and is the weight associated with the phase out of the classes where the weight is obtained by median frequency balancing to compensate for high class imbalance in training data.
In case of tool detection a weighted multi-label soft margin loss is used
where is the prediction of the tool in and is ground truth annotation for the tool presence with , is the tool class weight obtained by median frequency balancing to compensate for high class imbalance in training data.
The third component of the loss takes in consideration the model of joint distribution of tool and phase occurrence which is given as
where and where represents the sigmoid non-linearity, and denotes the inverse of the frequency of occurrence of tool with a phase . Using the information present in the annotated training data we create a phase-tool co-occurrence matrix which represents the count of the number of frames over all videos when the tool was being used in the phase of surgery. Subsequently we form a normalized matrix with . This is used to create an IF function defined as where is the smallest value represented in the number system being used. This function is characterized such that if frequency of phase-tool co-occurrence turn out to be zero then a large value is represented in IF to induce a very high loss in that case.
4.2 Multitask learning of a Bi-LSTM
The features extracted from the penultimate fully connected layer of the ResNet-50 trained earlier are used to train a multitask Bi-LSTM  in a similar learning framework using same cost functions as in (1), (4.1) and (3). Whitening transform  is applied to all features across the training data being fed to the Bi-LSTM. Due to its bidirectional nature it maintains two hidden layers, where ones propagates from left to right in the time unrolled sequence, and the other from right to left. The final classification result, is generated through combining the score results produced by both the LSTM hidden layers. The input to the bidirectional LSTM is sequence of visual features from the frames extracted from the entire video. A single layered Bi-LSTM with hidden neurons was used. Finally median filtering is applied to the phase predictions to remove any abrupt changes.
5 Experiments and Results
5.1 Dataset Description
|Phase Id||Phase Name||Duration (secs)|
|P2||Calot triangle dissection|
|P3||Clipping and cutting|
|P6||Cleaning and coagulation|
The proposed method is evaluated on Cholec80†††http://camma.u-strasbg.fr/datasets dataset which contains 80 videos of cholecystectomy surgeries performed by 13 surgeons at the University Hospital of Strasbourg. The phase annotation is provided for all the frames at 25 frames per second (fps) whereas tools are annotated on one per 25 frames leading to 1 fps annotation rate on a 25 fps video. These annotations are rate matched to 1 fps. The dataset is split into two equal parts, the first 40 videos are used for training the multitask CNN and Bi-LSTM and the last 40 videos are used for validation or testing. The visual appearance and list of 7 surgical tools in Cholec 80 dataset is given in Fig.3. The details about the seven different surgical phases and the meanstd of their duration in given in Table. 1. Also the dataset is imbalanced with respect to both surgical phases and tools as evident in Fig. 4(a) and Fig. 4(b) respectively. Accordingly corresponding to 7 phases of surgery and corresponding to 7 tools and the no-tool case. The phase-tool co-occurrence matrix can be visualized in Fig. 5.
The multitask CNN (Sec. 4.1) is trained with a learning rate of with a learning rate scheduler which reduces the learning rate by when the validation loss did not decrease for more than consecutive epochs of training, batch size of frames used, weight decay of , momentum of . The network is optimized using stochastic gradient descent algorithm (SGD).
The multitask Bi-LSTM (Sec. 4.2) is trained with a learning rate of with a learning rate scheduler which reduces the learning rate by when the validation loss does not decrease for more than epochs consecutive during training, batch size of video is used, and remaining parameters as same.
For comparison of the performance of the proposed method we have considered seven baselines. BL1 is the modified multi-label multi-class Resnet-50 which predicts tools present on an individual frame without using any temporal information in videos. BL2 is BL1 along with Bi-LSTM. BL3 is modified multi-class ResNet-50 used only for phase prediction using individual frame. BL4 is BL3 + Bi-LSTM. BL5 is modified ResNet-50 and it jointly predicts both tool and phase on individual frames only and trained using the 3 loss functions. BL6 is Endonet  which predicts both tool and phase. BL7 is boosted CNN + RNN  which predicts only tool. The proposed method is essentially BL5 + Bi-LSTM.
The experiments were implemented using PyTorch 0.4‡‡‡https://pytorch.org and accelerated with Nvidia CUDA 9.0§§§https://developer.nvidia.com/cuda-90-download-archive and cuDNN 7.3¶¶¶https://developer.nvidia.com/cudnn on Ubuntu 16.04 LTS Server OS. The server consisted of 2x Intel Xeon E5-2699 v3 CPU, 2x32 GB DDR4 ECC Regd. RAM, 4TB HDD, 1x Nvidia Quadro P6000 GPU with 24 GB DDR5 RAM. The CNN models (BL1, BL3, BL5) were trained for epochs while the Bi-LSTM for the adjunct models (BL2, BL4, Proposed method) for epochs.
The comparison between the baselines and the proposed method for the three metrics namely average precision, average recall, average accuracy is shown in Table. 2. The performance of the baselines (BL1, BL2, BL5) and the proposed method for tool- wise precision is shown in Fig.6. The performance of the baselines (BL3, BL4, BL5) and the proposed method for phase- wise precision and accuracy are shown in Fig.7 and Fig.8 respectively. All results are provided for the validation set (last 40 videos of Cholec80 dataset).
|Tool Detection||Phase Detection|
In this paper we have proposed a new loss function for multitask learning using a weighted joint probabilistic loss function to model the dependency of a set of tools to a phase in laparoscopic surgeries. Subsequently we use CNN and Bi-LSTM framework which jointly predicts tool and phase. We show through experiments that the mean average precision (mAP) obtained for tool detection outperforms all other previous architectures. In case of phase detection it yields better results with respect to mAP and also yields a higher accuracy. This indicates that the visual features learned by the CNN provides valuable information through rich features to the Bi-LSTM. Also the interdependence between tool and phase provided to the network through the weighted joint probabilistic loss function, which ultimately affects gradients and update of parameters helps in better convergence. Another important aspect of our framework is the use of Bi-LSTM, which has an inherent capability to capture long term dependencies both along past and future, expected to be required for better prediction in temporal domain. In Bi-LSTM full video batch stacking and whitening transform of CNN features prior to learning yield significantly better performance and faster convergence. Also the median filtering applied to the phase predictions obtained from Bi-LSTM resulted in slight improvement in mAP and accuracy due to removal of abrupt changes.
The results are provided for Cholec80 dataset which contains 80 videos of cholecystectomy surgeries. Some of the previous works have used less than 20 videos of surgeries for surgical work-flow analysis which had limited their performance on account of its inability to learn the richness of visual appearances associated with tools and phases. Without using any data augmentation techniques to compensate for the tool and phase imbalance as seen from Fig.4(a) and Fig.4(b) the model gave significantly better results, which suggests that it is robust to data imbalance. The dataset also contains lot of variability with respect to phase duration as seen from Table. 1 which does not affect the phase detection results to any significant extent thereby demonstrating the network’s capability to tackle such challenges.
Although the model can overcome the challenges described above there are some limitations. Firstly, Cholec80 dataset is limited to surgeons from one institution and can easily lead to over-fitting and hence a dataset containing surgeries from multiple surgeons from different institutions should be used for training which can yield more generalized results. Secondly, no image processing techniques were applied to the raw frames extracted from videos to remove redundant information which can help the CNN to learn better features. Thirdly the framework requires training of the CNN first followed by a Bi-LSTM, while making it as an end to end system would require training only once which would be less computationally expensive and is desired.
A multitask deep learning framework comprised of ResNet-50 and Bi-LSTM with a weighted joint distribution loss function has been proposed. It gives better mAP with respect to tool detection and comparable results for phase detection. The applicability of the proposed method is not necessarily limited only to tool and phase detection but other areas such as tool localization, estimating completion time of surgery, recognition of anatomy should be explored. Also the tools in many images can have various orientations with respect to the camera depending on the surgery, so the use of vector convolutions  can make the system rotation invariant which can be seen as a future work to improve tool and phase prediction with ability to learn with limited annotated data corpus.
-  Al Hajj, H., Lamard, M., Charrière, K., Cochener, B., Quellec, G.: Surgical tool detection in cataract surgery videos through multi-image fusion inside a convolutional neural network. In: IEEE Ann. Int. Conf. Engg. Medicine Bio. Soc. pp. 2002–2005 (2017)
-  Al Hajj, H., Lamard, M., Conze, P.H., Cochener, B., Quellec, G.: Monitoring tool usage in surgery videos using boosted convolutional and recurrent neural networks. Med. Image Anal. 47, 203–218 (2018)
-  Bodenstedt, S., Wagner, M., Katić, D., Mietkowski, P., Mayer, B., Kenngott, H., Müller-Stich, B., Dillmann, R., Speidel, S.: Unsupervised temporal context learning using convolutional neural networks for laparoscopic workflow analysis. arXiv preprint arXiv:1702.03684 (2017)
-  Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: Proc. IEEE Conf. Comp. Vis. Patt. Recog. pp. 248–255 (2009)
-  Dergachyova, O., Bouget, D., Huaulmé, A., Morandi, X., Jannin, P.: Automatic data-driven real-time segmentation and recognition of surgical workflow. Int. J. Comp. Assist. Radio. Surgery 11(6), 1081–1089 (2016)
-  Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proc. IEEE Conf. Comp. Vis. Patt. Recog. pp. 2625–2634 (2015)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proc. IEEE Conf. Comp. Vis. Patt. Recog. pp. 770–778 (2016)
-  Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Trans. Patt. Anal. Machine Intell. 35(1), 221–231 (2013)
-  Klank, U., Padoy, N., Feussner, H., Navab, N.: Automatic feature generation in endoscopic images. Int. J. Comp. Assist. Radio. Surgery 3(3-4), 331–339 (2008)
-  Lalys, F., Jannin, P.: Surgical process modelling: a review. Int. J. Comp. Assist. Radio. Surgery 9(3), 495–511 (2014)
-  Lea, C., Choi, J.H., Reiter, A., Hager, G.D.: Surgical phase recognition: from instrumented ors to hospitals around the world. In: Int. Conf. Med. Image Comput. Comp. Assist. Interv. - M2CAI workshop. pp. 45–54 (2016)
-  LeCun, Y., Bengio, Y., et al.: Convolutional networks for images, speech, and time series
-  Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional lstm-cnns-crf. In: Proc. Ann. Meeting, Assoc. Comput. Linguistics. vol. 1, pp. 1064–1074 (2016)
-  Marcos, D., Volpi, M., Komodakis, N., Tuia, D.: Rotation equivariant vector field networks. In: Proc. IEEE Inte. Conf. Comp. Vis. pp. 5048–5057 (2017)
-  Mishra, K., Sathish, R., Sheet, D.: Learning latent temporal connectionism of deep residual visual abstractions for identifying surgical tools in laparoscopy procedures. In: Proc. IEEE Conf. Comp. Vis. Patt. Recog. Workshops. pp. 58–65 (2017)
-  Padoy, N., Blum, T., Feussner, H., Berger, M.O., Navab, N.: On-line recognition of surgical activity for monitoring in the operating room. In: AAAI Conf. Artif. Intell. pp. 1718–1724 (2008)
-  Primus, M.J., Putzgruber-Adamitsch, D., Taschwer, M., Münzer, B., El-Shabrawi, Y., Böszörmenyi, L., Schoeffmann, K.: Frame-based classification of operation phases in cataract surgery videos. In: Int. Conf. Multimedia Model. pp. 241–253 (2018)
-  Ryu, J., Choi, J., Kim, H.C.: Endoscopic vision based tracking of multiple surgical instruments in robot-assisted surgery. In: Int. Conf. Control, Autom. Sys. pp. 2195–2198 (2012)
-  Sahu, M., Mukhopadhyay, A., Szengel, A., Zachow, S.: Tool and phase recognition using contextual cnn features. arXiv preprint arXiv:1610.08854 (2016)
-  Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Sig. Proces. 45(11), 2673–2681 (1997)
-  Shental, N., Hertz, T., Weinshall, D., Pavel, M.: Adjustment learning and relevant component analysis. In: Proc. Eur. Conf. Comp. Vis. pp. 776–790 (2002)
-  Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Adv. Neural Info. Proces. Sys. pp. 568–576 (2014)
-  Speidel, S., Benzko, J., Krappe, S., Sudra, G., Azad, P., Müller-Stich, B.P., Gutt, C., Dillmann, R.: Automatic classification of minimally invasive instruments based on endoscopic image sequences. In: Med. Imag.- Vis., Image-Guided Proced. Modeling. vol. 7261, p. 72610A (2009)
-  Sznitman, R., Becker, C., Fua, P.: Fast part-based classification for instrument detection in minimally invasive surgery. In: Int. Conf. Med. Image Comput. Comp. Assist. Interv. pp. 692–699 (2014)
-  Twinanda, A.P., Shehata, S., Mutter, D., Marescaux, J., De Mathelin, M., Padoy, N.: Endonet: A deep architecture for recognition tasks on laparoscopic videos. IEEE Trans. Medical Imag. 36(1), 86–97 (2017)