Unsupervised Trajectory Segmentation and Promoting
of MultiModal Surgical Demonstrations
Abstract
To improve the efficiency of surgical trajectory segmentation for robot learning in robotassisted minimally invasive surgery, this paper presents a fast unsupervised method using video and kinematic data, followed by a promoting procedure to address the oversegmentation issue. Unsupervised deep learning network, stacking convolutional autoencoder, is employed to extract more discriminative features from videos in an effective way. To further improve the accuracy of segmentation, on one hand, wavelet transform is used to filter out the noises existed in the features from video and kinematic data. On the other hand, the segmentation result is promoted by identifying the adjacent segments with no state transition based on the predefined similarity measurements. Extensive experiments on a public dataset JIGSAWS show that our method achieves much higher accuracy of segmentation than stateoftheart methods in the shorter time.
I Introduction
Surgical trajectory segmentation is a fundamental problem in the field of robotassisted minimally invasive surgery (RMIS). It can be applied to several applications, such as demonstration learning [1], skill assessment [2], complex task automation [3] and so forth. Each surgical procedure is usually represented by synchronized video and kinematic recordings, and can be decomposed into several meaningful subtrajectories. Since the segments are atomic with less complexity, lower variance and easier to eliminate outliers, the capability of further robot learning and assessment can be improved. However, it is a challenging task to segment the surgical trajectory accurately and rapidly. Even an identical surgical procedure can vary remarkably in the spatial and temporal domains due to the skill difference among surgeons. Moreover, the trajectory is susceptible to the random noise.
Traditional solutions usually transfer the surgical trajectory segmentation to a clustering problem, and are mainly divided into two categories: supervised and unsupervised methods. As the supervised methods, Linear Discriminate Analysis (LDA) [4], Hidden Markov Models (HMMs) [2], Descriptive Curve Coding (DCC) [5], and Conditional Random Field (CRF) [6] are proposed. However, the supervised method is timeconsuming because of the manual annotations of experts for training dataset. Thus, unsupervised methods have drawn more attention in recent years. Some unsupervised methods based on Gaussian Mixture Model (GMM) and Dirichlet Processes (DP) are proposed [7, 8]. Although GMM and DP based methods can get rid of the manual annotations, the room to improve the accuracy of surgical trajectory segmentation remains since only the kinematic data is taken into account. Recently, video data are involved by using a deep learning based method, since traditional pattern recognition based feature extraction methods can’t model the variations among surgeon’s videos well. A. Murali et al. [9] employ VGGNet to extract features from video followed by Transition State Clustering (TSC) for tasklevel segmentation using both kinematic and video data. Although the involvement of video source enables the higher accuracy of segmentation, the feature extraction from videos is timeconsuming and easily leads to oversegmentation.
This paper focuses on the unsupervised surgical trajectory segmentation by means of both video and kinematic data in this paper. There are challenges to find consistent segments from the varying and noising recordings from surgeons with different skills for a specific task. First, although the video is capable of improving the performance of segmentation, it is challenging to extract the distinguishing features in an efficient way. In addition, random noise has to be considered due to the difference of surgeons’ skill. Second, stateoftheart methods generally suffer from the oversegmentation issue. We need to provide an effective way to identify the adjacent segments with no state transition.
As shown in Fig. 1, a fast unsupervised method for surgical trajectory segmentation is proposed using the video and kinematic data. In particular, a promoting procedure is presented to alleviate the oversegmentation issue. First, a compact but effective unsupervised learning network called stacking convolutional autoencoder (SCAE) is employed to speed up the feature extraction of video. Wavelet transform is then used to filter the features from videos and kinematic data for the further clustering based on TSC. We refer the proposed segmentation method as TSCSCAE for abbreviation. Finally, the segmentation result is promoted by merging the clusters according to four similarity measurements called PMDD based on principal component analysis, mutual information, data average and dynamic time warping, respectively.
Ii UNSUPERVISED TRAJECTORY SEGMENTATION BASED on TSCSCAE
Iia Visual Feature Extraction Using SCAE
Stacked Convolutional AutoEncoder (SCAE) [10] is an unsupervised feature extractor which is well compatible to highdimensional input. It is much faster than other methods such as TSCVGG and TSCSIFT because of the simple neural network and unsupervised method. SCAE has more advantages in image processing as it can preserve the spatial relationship between pixels. The SCAE network for visual feature extraction is shown in Fig. 2, and the corresponding configuration is summarized in TABLE I.
Fig. 2 illustrates that the basic structure of encoder consists of convolutional layer and pooling layer. The input feature maps (for the first layer, it is the original image ) are convolved with a convolution layer to transfer the information to subsequent layers with the spatial relationship between pixels preserved. These feature maps then pass through a maxpooling layer to reduce the feature map size. After several above convpooling layers, a low dimension feature map can get from the encoder.
As shown in Fig. 2, the task of the decoder with the similar topology with the encoder is to reconstruct the encoding result to get the implied image information. Therefore, we need to upsample the encoding result to recover the feature maps. To prevent the checkerboard effect caused by traditional transposed convolution, we use bilinear interpolation to do upsampling before each convolutional layer. For further reduction of feature dimension, we employ two convolutional layers with the kernel size of after the last layer of the encoder and before the first layer of decoder respectively.
Adam optimization algorithm [11] is employed to minimize the MSE (meansquare error) based loss function, which can estimate the similarity between the reconstructed image of decoder output and the original image input to encoder. After the network training, a model (i.e., the weights of each layer) for image encoding and reconstructing can be got from the network. In the phase of feature extraction, we exclusively load the model’s encoder part to extract the features of each frame in the surgical video.
Type  Patch Size  Stride  Output Size  

convolution  33  1  64048016  
max pooling  44  4  16012016  
convolution  33  1  1601208  
Encoder  max pooling  44  4  40308 
convolution  33  1  40304  
max pooling  44  4  1074  
convolution  11  1  1071  
convolution  11  1  1074  
bilinear upsampling  40304  
convolution  33  1  40308  
Decoder  bilinear upsampling  1601208  
convolution  33  1  16012016  
bilinear upsampling  64048016  
convolution  33  1  6404803 
IiB Denoising Based on Wavelet Transform
After the feature extraction from the demonstration video, the visual and kinematic features are then feed to nonparametric mixture model for clustering. However, we find that these features usually suffer from the random noise. To get rid of it, wavelet transform based filter is employed due to its ability of multiscale filtering and a lowpass filter is designed.
In this paper, we process the kinematic data and visual features with db10 wavelet, and a 5level wavelet decomposition for denoising is performed. Fig. 3 and Fig. 4 demonstrate the comparison of kinematic and visual features before and after filtering based on wavelet transform.
After the filtering, visual and kinematic features then feed to a nonparametric mixture model to segment surgical trajectory. Considering the clustering performance, Transition State Clustering (TSC) [8] is adopted in this paper.
Iii SEGMENTATION PROMOTING BASED on PMDD
Most unsupervised trajectory segmentation methods usually have the problem of oversegmentation. To correct the wrongly segmented subtrajectories that belong to the same cluster, a criterion is required to evaluate the similarity between segments. Taking a deep insight into the same subtrajectory, they have a few implicit and explicit associations. Besides the similarity in spatial and temporal space, inner structure, variation node and moving trend are also the important factors. Taking these factors into consideration, we proposed a promoting algorithm based on PMDD consisting of four similarity measurements based on Principal Component Analysis (PCA), Mutual Information (MI), Data Average (DA) and Dynamic Time Warping (DTW).
Similarity measurement based on PCA: W. Krzanowski et al. [12] show that the PCA can be used to measure the similarity between segments. PCA mainly determine the internal link and structure between the segments. Considering two segments and , PCA could find several principle components of and , which make up a subspace representing the main information of and . The smaller subspace angle between and means the greater internal consistency between them. Thus, Similarity measurement based on PCA is defined by the angles between their subspaces comprised of principle components:
(1) 
where is the number of principle components.
Similarity measurement based on MI: The surgery is a continuous process, the data change of the segments in same surgery subprocess is similar. Entropy can be interpreted as a measurement of the uncertainty of the particular variables. Therefore, MI is a good similarity measurement for variation degree between two segments, which is obtained by subtracting the joint entropy from the entropy and of both segments:
(2) 
Similarity measurement based on DA: DA mainly reflects the spatial characteristic. During a subprocess of surgery, the trajectory in a short time interval is similar in the spatial space. Therefore, the distance between the centers of segments in spatial space is taken into account, as written as follows:
(3) 
where and are mean vectors of segments and .
Similarity measurement based on DTW: Due to the difference of surgeons’ skill, the same action may show different subtrajectories. One typical is the same behavior of different performance in temporal domain. The key issue of DTW is warping curve. Here, we take cumulative distance to calculate the best warping path while measure DTW similarity [13].
(4) 
where is the element of warping path, is the compensation parameter that can be identified by cumulative distance.
(5) 
where is the Euclidean distance between point and .
All above four similarity measurements are in different dimensions. Thus, the normalization is required to obtain the final measure. For , , , the smaller the value is, the more similar the two segments are. We perform the normalization of them using Eq. (6), and the normalization for is perform using Eq. (7). After that, the final similarity can be calculated by Eq. (8).
(6) 
(7) 
(8) 
Then, according to final similarity measurements, segments that have high similarity can be merged iteratively. Considering the segmentation results , the final similarity of each pair of two adjacent segments will be calculated by Eq. (8) in each iteration, and then we obtain a set of results . Merge the pairs with the highest final similarity, and update comprehensive similarity , merge the most similar segments in the next iteration, until overall final similarity smaller than threshold . The segmentation promoting algorithm is summarized in Algorithm 1.
Iv Experimental Results
In this section, two sets of experiments are conducted to verify the performance of proposed unsupervised segmentation algorithm for surgical trajectory. In the first experiment, TSCSCAE is evaluated with respect to the accuracy and overall running time, compared with the classic clustering methods including GMM and TSC. The effects of different data sources and wavelet transform based filtering are analyzed quantitatively. Second, the promoting method of segmentation is verified by following different methods using the kinematic data alone and the combination of video and kinematic data, respectively.
The dataset JIGSAWS [14] from Johns Hopkins University is used in the experiments, including data recordings and manual annotations. Data recordings consist of surgical video and kinematic data collected from Da Vinci Surgical System. The sampling frequency for both video and kinematics sources is 30Hz. The dataset contains three surgical tasks: Suturing (SU), NeedlePassing (NP) and KnotTying (KT), which are performed and annotated by 8 surgeons with different skill levels. The suturing and needle passing task are commonly used in literatures. In this paper, we adopt 11 demonstrations of these two tasks in the experiments, including the videos and kinematic data from 5 experts (E), 3 intermediates (I) and 3 novice (N). The kinematic data are in 38 dimensions, including position, angle velocity, angle of grasper, etc. All 11 videos of each task are used for SCAE model training and features extraction. The computational configuration used in the experiments is summarized in TABLE II.
Category  Specification 

Operating System  Ubuntu 
CPU  32 Intel Xeon E52620 v4 @ 2.10GHz 
GPU  NVidia Tesla K40 
CUDA Compute Capability  3.5 
CUDA Cores  2880 
RAM  128GB 
Programming Language  Python 
Iva Quantitative Analysis of TSCSCAE
IvA1 Accuracy Comparison
In this section, the accuracy of TSCSCAE is compared using Normalize Mutual Information (NMI), which indicates the transfer status similarity between a predictive clustering result and the ground truth (manual annotations), it can be calculated by
(9) 
where and are the information entropies of and , respectively. is mutual information. The range of NMI is [0,1], where 0 means that there is no correlation between two clustering results, while 1 represents the results are completely related.
We compare the proposed method TSCSCAE with stateoftheart methods, including TSC[8], GMM[7], TSCVGG, TSCSIFT[9] and TSCSCAE on the selected surgical demonstrations. According to the data source in the different methods, the experiments are divided into two categories: one use kinematics data alone and the other use both video and kinematic data. TABLE III shows NMI measurements of segmentation. We can see that our method TSCSCAE achieves the best NMI among all trajectory segmentation tasks, it thanks to the use of video data and wavelet transform. Especially, using both video and kinematic data, the accuracy is improved by more than 2.6 times at most, compared with TSCSIFT.
\diagboxMethodNMI()  Needle Passing  Suturing  

E  E+I  E+I+N  E  E+I  E+I+N  
TSC(K)  21.6  27.2  17.0  43.2  38.0  25.7 
GMM(K)  53.3  51.2  45.8  45.2  43.4  41.0 
TSCVGG(V&K)  62.9  64.7  69.3  58.6  64.0  66.5 
TSCSIFT(V&K)  31.0  32.6  28.2  48.0  42.5  37.7 
GMMSCAE(V&K)  59.3  57.4  58.7  57.5  52.5  51.4 
TSCSCAE(V&K)  72.6  73.8  71.2  65.5  66.3  67.2 
TSCSCAE(V&K*)  79.1  77.7  74.7  67.9  67.5  68.5 
Overall, methods with both video and kinematic data are generally better than the ones using kinematics data alone. It is consistent with the results reported in literatures. The NMI of methods using kinematics data alone has a trend of decreasing with the growing proportion of nonexpert (I N) demonstrations. This phenomenon is very significant in the suturing task, it is mainly because of the complexity and nonregularity of suturing task. What’s more, demonstrations from experts are usually smoother and rapider than nonexperts do. However, when considering both kinematics and video data, the phenomenon is obviously weakened. It proves that video data can help eliminate the influence of irregular trajectory from intermediates and novices and is an effective compensation to achieve the better surgical trajectory segmentation.
As aforementioned, random noise may cause the potential interference to the result of segmentation. To solve this problem, we perform a multiscale smoothing processing to the dataset by using db10 wavelet to filter out the smallscale noise, which indirectly improve the segmentation accuracy. Compared with the experiments without filtering in needlepassing task, the NMI is increased by 3.56.5, the improvement is 1.23.4 in suturing task.
IvA2 Overall Running Time Comparison
Another key indicator is overall running time, although surgery segmentation is not in strong realtime, the task also needs to be as fast as possible. Methods based on kinematics data alone, the running time is the cost of clustering and segmentation, while we need to add the time cost of video feature extraction for methods using visual and kinematic data (TSCVGG, TSCSIFT, etc.). For our method TSCSCAE, the time cost is calculated in three parts, including visual feature extraction, wavelet transform based filtering and clustering segmentation.
\diagboxMethodTime(s)  Needle Passing  Suturing  Elements  

E  E+I  E+I+N  E  E+I  E+I+N  
TSCK  79  103  353  59  83  331  CS 
GMMK  1.76  1.95  3.34  1.59  2.00  5.38  CS 
TSCVGG  8120+394  9744+380  14616+1226  4935+322  5922+364  8884+1404  FE+CS 
TSCSIFT  2127+440  3284+723  5019+2020  1941+404  3036+533  4633+2259  FE+CS 
GMMSCAE  128+2.94  154+2.95  231+5.57  139+2.80  167+3.30  251+5.38  FE+CS 
TSCSCAE  128+197  154+199  231+933  139+158  167+201  251+1012  FE+CS 
TSCSCAE*  128+202+27  154+201+31  231+930+48  139+160+25  167+198+29  251+1008+47  FE+CS+WT 
The running time in different steps is summarized in TABLE IV. The segmentation methods based on both visual and kinematic features are about 10 times slower than the ones using kinematic data alone. It is mainly because of the timeconsuming visual feature extraction. However, for the methods using both data sources, our method TSCSCAE is almost 10 times faster than TSCVGG and TSCSIFT. The improvements of time efficiency is due to the highefficiency unsupervised model for feature extraction of video data we employed.
IvB Evaluation of Segmentation Promoting
Oversegmentation is a common problem of clustering segmentation algorithm. To prove the validity of the proposed promoting approach as the postprocessing step, we apply it to the mainstream clustering segmentation algorithms, including GMM, TSC based methods. NMI is used to measure the similarity of transition status in the segmentation clustering method. But it is not based on transfer state to merge in the promoting stage. Therefore, we choose segmentation accuracy (segacc) as the evaluation matrix, which can measure the similarity between the segmentation result and ground truth intuitively and accurately.
The calculation of segacc can divided in two steps. In the first step, we match resultant segments to the ground truth by maximizing the number of overlap frames between predicted segments and ground truth [15]. In second step, it is true positive if the IOU (Intersection over Union) between the groundtruth segment and its corresponding resultant segments is more than a default threshold 40. We calculate the accuracy of each segment separately and then sum up them. Fig. 5 illustrates the calculation process and the segacc can be obtained using
(10) 
where ‘’ and ‘’ represent start and end frame of segment and .
\diagboxMethodsegacc

Before Promoting  After Promoting  

Needle Passing  Suturing  Needle Passing  Suturing  
E  E+I  E+I+N  E  E+I  E+I+N  E  E+I  E+I+N  E  E+I  E+I+N  
TSCK  0.498  0.563  0.529  0.484  0.535  0.542  0.614  0.578  0.615  0.547  0.565  0.630 
GMM  0.480  0.528  0.541  0.466  0.489  0.503  0.392  0.475  0.551  0.494  0.541  0.575 
TSCVGG  0.505  0.562  0.436  0.487  0.460  0.498  0.522  0.548  0.445  0.540  0.465  0.507 
TSCSIFT  0.546  0.561  0.510  0.442  0.513  0.493  0.592  0.582  0.590  0.521  0.589  0.593 
TSCSCAE  0.637  0.612  0.547  0.513  0.537  0.545  0.632  0.666  0.618  0.565  0.605  0.636 
As shown in TABLE V, the segacc of each method has been improved obviously for most cases. TSCK is the biggest beneficiary with the improvement of segacc by 15.2 on average, while the accuracy is improved less for TSCSIFT and TSCVGG. In the experiment, we notice that it is difficult to refine the segmentation if the clustering results is far away from the ground truth. As shown in Fig. 11, each color represents a surgical activity segment, while the white segment indicates incorrect segment or oversegmented segment. Among all methods, the segacc of GMM based method even declines after the promoting. Because GMM needs to specify the number of merged class, so oversegmentation in GMM is not very common instead is wrong segmentation. For our method TSCSCAE, the segmentation promoting yields up to 16.7 improvement with respect to segacc. In most cases, the resultant segmentation after the promoting is significantly improved. From the view of TABLE V, we notice that the improvement of nonexpert demonstration is more outstanding than the expert do, because the nonexpert demonstration produces more oversegmentation fragments.
In all experiments, TSCSCAE obtains the best result of segmentation, it is proved that the proposed promoting method is very effective for the surgical trajectory segmentation. In general, it can be extended to most clustering segmentation algorithms.
V Conclusion
This paper proposed a fast unsupervised method for surgical trajectory segmentation based on a compact stacking convolutional autoencoder model and wavelet transform based filtering using multimodal surgical demonstrations. The improvement with respect to the efficiency of segmentation is threefold. First, new involved model can generate more discriminative visual features faster. Second, the shortrange noises in the visual and kinematic features are filtered based on wavelet transform. Last but not least, a promoting approach is proposed to handle the oversegmentation problem. Compared with the stateoftheart methods, experimental results demonstrate that the proposed algorithm can improve the accuracy of segmentation in an more efficient way.
Acknowledgment
This work was supported by the Project of Beijing Municipal Commission of Education (KM201710028017), National Natural Science Foundation of China (61702348, 61772351, 61602324), National Key R & D Program of China (2017YFB1303000, 2017YFB1302800), the Project of the Beijing Municipal Science & Technology Commission (LJ201607), Capacity Building for SciTech Innovation  Fundamental Scientific Research Funds (025185305000), and Youth Innovative Research Team of Capital Normal University.
References
 [1] A. Guha, Y. Yang, C. Fermuuller, and Y. Aloimonos, “Minimalist plans for interpreting manipulation actions,” in Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on. IEEE, 2013, pp. 5908–5914.
 [2] C. E. Reiley, H. C. Lin, B. Varadarajan, B. Vagvolgyi, S. Khudanpur, D. Yuh, and G. Hager, “Automatic recognition of surgical motions using statistical modeling for capturing variability,” Studies in health technology and informatics, vol. 132, p. 396, 2008.
 [3] K. Shamaei, Y. Che, A. Murali, S. Sen, S. Patil, K. Goldberg, and A. M. Okamura, “A paced sharedcontrol teleoperated architecture for supervised automation of multilateral surgical tasks,” in Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on. IEEE, 2015, pp. 1434–1439.
 [4] H. C. Lin, I. Shafran, T. E. Murphy, A. M. Okamura, D. D. Yuh, and G. D. Hager, “Automatic detection and segmentation of robotassisted surgical motions,” in International Conference on Medical Image Computing and ComputerAssisted Intervention. Springer, 2005, pp. 802–810.
 [5] N. Ahmidi, Y. Gao, B. Béjar, S. S. Vedula, S. Khudanpur, R. Vidal, and G. D. Hager, “String motifbased description of tool motion for detecting skill and gestures in robotic surgery,” in International Conference on Medical Image Computing and ComputerAssisted Intervention. Springer, 2013, pp. 26–33.
 [6] L. Tao, L. Zappella, G. D. Hager, and R. Vidal, “Surgical gesture segmentation and recognition,” in International Conference on Medical Image Computing and ComputerAssisted Intervention. Springer, 2013, pp. 339–346.
 [7] S. H. Lee, I. H. Suh, S. Calinon, and R. Johansson, “Autonomous framework for segmenting robot trajectories of manipulation task,” Autonomous robots, vol. 38, no. 2, pp. 107–141, 2015.
 [8] S. Krishnan, A. Garg, S. Patil, C. Lea, G. Hager, P. Abbeel, and K. Goldberg, “Transition state clustering: Unsupervised surgical trajectory segmentation for robot learning,” The International Journal of Robotics Research, vol. 36, no. 1314, pp. 1595–1618, 2017.
 [9] A. Murali, A. Garg, S. Krishnan, F. T. Pokorny, P. Abbeel, T. Darrell, and K. Goldberg, “Tscdl: Unsupervised trajectory segmentation of multimodal surgical demonstrations with deep learning,” in Robotics and Automation (ICRA), 2016 IEEE International Conference on, 2016, pp. 4150–4157.
 [10] J. Masci, U. Meier, D. Cireşan, and J. Schmidhuber, “Stacked convolutional autoencoders for hierarchical feature extraction,” in International Conference on Artificial Neural Networks. Springer, 2011, pp. 52–59.
 [11] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [12] W. Krzanowski, “Betweengroups comparison of principal components,” Journal of the American Statistical Association, vol. 74, no. 367, pp. 703–707, 1979.
 [13] D. J. Berndt, “Finding patterns in time series: a dynamic programming approach,” Advances in knowledge discovery and data mining, pp. 229–248, 1996.
 [14] Y. Gao, S. S. Vedula, C. E. Reiley, N. Ahmidi, B. Varadarajan, H. C. Lin, L. Tao, L. Zappella, B. Béjar, D. D. Yuh, et al., “Jhuisi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling,” in MICCAI Workshop: M2CAI, vol. 3, 2014, p. 3.
 [15] C. Wu, J. Zhang, S. Savarese, and A. Saxena, “Watchnpatch: Unsupervised understanding of actions and relations,” in Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE, 2015, pp. 4362–4370.