Transferrable Operative Difficulty Assessment in Robotassisted Teleoperation: A Domain Adaptation Approach
Abstract
Providing an accurate and efficient assessment of operative difficulty is important for designing robotassisted teleoperation interfaces that are easy and natural for human operators to use. In this paper, we aim to develop a datadriven approach to numerically characterize the operative difficulty demand of complex teleoperation. In effort to provide an entirely taskindependent assessment, we consider using only data collected from the human user including: (1) physiological response, and (2) movement kinematics. By leveraging an unsupervised domain adaptation technique, our approach learns the user information that defines task difficulty in a wellknown source, namely, a Fitt’s target reaching task, and generalizes that knowledge to a more complex human motor control scenario, namely, the teleoperation of a robotic system. Our approach consists of two main parts: (1) The first part accounts for the inherent variances of user physiological and kinematic response between these crossdomain motor control scenarios that are vastly different. (2) A stacked twolayer learner is designed to improve the overall modeling performance, yielding a 96.6% accuracy in predicting the known difficulty of a Fitts’ reaching task when using movement kinematic features. We then validate the effectiveness of our model by investigating teleoperated robotic needle steering as a case study. Compared with a standard NASA TLX user survey, our results indicate significant differences in the difficulty demand for various choices of needle steering control algorithms, , as well as the difficulty of steering the needle to different targets, . The results highlight the potential of our approach to be used as a design tool to create more intuitive and natural teleoperation interfaces in robotassisted systems.
Human Performance, Machine Learning, Operator Interfaces, Surgical Robotics
1 Introduction
1.1 Motivations
In current robotassisted teleoperation systems, human users typically control remote, robotic slave tools to carry out complex operations by physically manipulating a master controller [1]. The decision for how to map user inputs to robotic slave outputs is largely the choice of the designer, with careful considerations needed in cases where the degrees of freedom of the master manipulator and slave endeffectors, or robots, do not match. In general, the use of teleoperation has enabled improvements in the capabilities and performance of the human operator in many scenarios, such as robotassisted surgery [2], search and rescue robots [3], and space exploration [4], to name a few. However, due to limitations such as that kinematic dissimilarity between master and slave devices, time delay, and the lack of sensory feedback from the slave environment [5, 6], it has been shown that teleoperation also can be mentally and physically challenging for operators to use. Furthermore, user perception of difficulty can affect one’s ability to perform specific tasks [7, 8]. As such, a poor choice of the teleoperation interface can negatively affect task performance, degrade user experience, and reduce the overall effectiveness of the system [9, 10]. Interest in this problem has also increased in the collaborative robotics community. Studies suggest that an efficient assessment of difficulty is imperative for determining how to appropriately share levels of autonomy between the robot and human operators to deliver adaptive robotic assistance and improve user satisfaction [11]. Thus, our goal is to provide an objective approach to quantify the intuitiveness, or alternatively, the difficulty demand, of robotassisted teleoperated tasks.
1.2 Background & Prior Work
Accurately inferring the difficulty of a teleoperated task is practically challenging. Current work on teleoperation difficulty assessment has primarily focused on the administration of customized questionnaires or standard user ratings, such as the NASA Task Load Index (TLX) [12]. This approach, while helpful and easy to obtain, has the risk of being biased due to human interpretation [13, 14, 15]. Furthermore, the postexperiment surveys do not describe the realtime cognitive effort or behavioral and physical demand of the task. These drawbacks not only limit the usefulness of user surveys, but could also explain differences that have been found in the literature regarding a mismatch between user performance and perceived task workload demand [16, 17, 18, 19, 20].
An objective measure of difficulty is possible in simple human motor control tasks. Commonly used in humancomputer interaction and ergonomics research, Fitts’ Law describes underlying difficulty demand of a simple target reaching task [21]. The difficulty of the targetreaching motion, i.e., the index of difficulty (ID), is mathematically quantified as a function of the target width (W) and distance to the target (D): ID . As a widely accepted empirical rule, Fitts’ difficulty model has been applied in a variety of studies for ergonomic design and general humancomputer interaction analysis [22, 23].
However, directly applying the traditional Fitts’ paradigm to realworld teleoperation could be an oversimplification as the tasks are typically more involved than simple target reaching and may require more complex motor manipulation from the human operator. Additionally, in teleoperated tasks, the robotic system often has no knowledge of particular objectives of the human operator making taskspecific metrics such as time and errors less effective in assessing task difficulty.
In this work, we aim to provide an objective task difficulty assessment explicitly devoted to characterizing the underlying difficulty demand of robotassisted teleoperation tasks. We propose a datadriven modeling approach to automatically predict the difficulty of complex operation based on the realtime measures of human sensorimotor response (i.e., physiological response and movement kinematics). Fig. 1 shows the overview. Our approach leverages the known difficulty information from a wellestablished source, namely, a simple human motor control task (Fitts’ reaching), and generalizes to a related but more complex motor control scenario (i.e., robotassisted teleoperation). Meanwhile, we explicitly consider the variability of the human sensorimotor response [24, 25, 26], which has historically made objective assessment of operative difficulty difficult. Based on unsupervised domain adaptation, our approach compensates for the inherent user response variances from two different motor control tasks. Thus, the resulting model could be applicable to the target domain of interests without a priori knowledge of particular tasks. Furthermore, a stacked machine learning architecture consisting of multiple singlealgorithm learners is used to improve the overall modeling accuracy. Finally, we validated the effectiveness of our approach by analyzing the teleoperation of robotic steerable needles and compared our difficulty predictions with the results of standard NASA TLX user surveys. To our knowledge, this is the first study to provide a transferable operative difficulty assessment for better understanding the underlying difficulty demand of complex robotassisted teleoperation.
2 Methodology
2.1 Problem Definition & Formulation
The primary goal of our study is to develop a learner that can numerically infer the difficulty demand of a complex robotassisted teleoperation task, while leveraging the established knowledge of difficulty in a simple motor control task. Fig. 2 illustrates the overall framework of the operative difficulty assessment. We formulated the process as an unsupervised domain adaptation problem, which does not require the availability of any a priori labels of particular realworld teleoperation tasks [27]. To align with domain adaptation literature, we refer to the data of complex teleoperation as the “target domain” and the simple motor control task as the “source domain”, respectively.
Formally, we denote the source domain as . Each th sample pair is composed by the input data that is drawn from a probability distribution , with a corresponding label of difficulty index that follows a conditional probability distribution . Here, is the dimension of input data. Similarly, we denote the target domain as , where is the unlabeled target data drawn independently from a probability distribution . The and are the sample sizes in source and target domains, respectively.
According to the above formulation, predicting the difficulty of complex teleoperation task in target domain can be solved by minimizing the empirical risk , namely, the expected loss, , under the target data distribution, , given the model parameters, :
(1) 
where denotes the prediction function that maps input data to the output label . The prediction function can be equivalently expressed in a conditional probability form with respect to the target distribution as .
2.2 Covariate Shift Adaptation
Covariate shift refers to a phenomenon that the samples in source and target domains follow different data distributions, namely, , while the conditional distributions of the output values assumably remain unchanged (the learned prediction function is applicable to both), namely, [28]. Under the covariate shift, the learned model from source domain can not be directly applied to target domain. Instead, the distribution discrepancy should be explicitly taken into account for enabling the crossdomain applicability. One popular technique to compensate the covariate shift is the samplebased importance reweighting. Given a set of source domain and target domain data, the expectation over each source sample is adaptively weighted in accordance with their similarity to the target data distribution [29]. Intuitively, the target difficulty labels can be more accurately inferred from labeled source domain samples that shared more similarity of user response. To this end, the importance weight of source domain samples, , is defined as the ratio of target domain distribution to the source domain distribution:
(2) 
In order to estimate the importance weight for each sample, several protocols have been proposed, including direct density estimation [29], kernel mean matching [30], unconstrained leastsquares importance fitting [31], and KullbackLeibler importance estimation procedure (KLIEP) [32]. In this work, we implemented the KLIEP procedure for domain adaptation due to its computational efficiency and stability [33]. Specifically, we compute the weight estimate by following steps:
First, we parametrize the weight as a mixture of Gaussians:
(3) 
where are the mixing coefficients, is a fixed coefficient size, denotes the Gaussian kernel function with a kernel width , and is the Gaussian centre that is randomly sampled from target domain distribution . The Gaussian kernel width is chosen based on a builtin crossvalidation procedure. Then, the optimal coefficients are determined such that the KullbackLeibler (KL) divergence between the distributions and , namely, , is minimized. Since the KL divergence function is convex, the obtained weight estimate can thus be guaranteed as globally unique [32].
Finally, based on the KLIEP domain adaptation, the expected loss function over the distribution of target domain can thus be computed with respect to the distribution of source domain as:
(4) 
Note that, estimating the importance weights of samples is performed in a fully unsupervised way and thus does not require any a priori labels in either source or target domain.
2.3 Stacked Twolayer Architecture
While the aforementioned domain adaptation adjusts the domain of applicability so that source domain data can adaptively align with that of the target domain for building a model, it does not necessarily guarantee an optimal prediction accuracy as a plain covariateshift adapted learner can still produce highvariance estimations, thus causing instability. To alleviate this problem, we seek an optimal function that minimizes the expected loss (Eq. 4). To this end, we proposed a stacked twolayer architecture that combines multiple machine learning algorithms as base learners. The idea behind stacked learning is that averaging predictions from diverse single learners might better enhance accuracy and simultaneously reduce variances. Followed by the aforementioned adaptation process, two stages of learning were performed: First, five single machine learning algorithms were chosen as components in the first layer, including ridge regression, support vector machine, random forests, stochastic gradient boosting, and adaptive boosting (AdaBoost). Each single learner was fitted on source domain data given the importance weights as estimated from the previous adaptation step. Then, the second layer algorithm takes the predictions generated by the singlealgorithm learners as input and is optimally trained to output a final prediction. We choose a linear regression with regularization to ensemble the output of firstlayer base learners while avoiding overfitting.
Taken together, transferrable difficulty assessment for complex teleoperation tasks from target domain can be ultimately solved by minimizing the weighted loss of stacked learner along with the importance weights with respect to the distribution of source domain :
(5) 
In the following, we briefly introduced the five singlealgorithm base learners in our proposed architecture.
2.3.1 Ridge Regression
Ridge regression is a linear regression algorithm with regularization to alleviate multicollinearity among feature variables in regression [34]. The algorithm of ridge regression can be described as , where the coefficients, , are solved by the ordinary leastsquare optimization such that the penalized residual sum of squares is minimized.
2.3.2 Support Vector Machine
Support vector machine performs a robust regression by exploiting a maximummargin separating hyperplane in a transformed subspace [35]. The regression version of SVM algorithm can be described as , where denotes a kernel function that maps input features into a linearly separable subspace via nonlinear transformations. Within the new feature space, it searches for an optimal separating hyperplane by solving an inequalityconstrained quadratic optimization problem.
2.3.3 Random Forests
Random forests regression is an ensemble algorithm that fits a collection of decision trees [36]. The algorithm generates a set of unpruned decision trees based on bootstrapped data samples and then randomly selects the subsets as candidates to split tree nodes. This process is repeated multiple times to blend input data for the purpose of alleviating overfitting and noisy outliers. Given the input, final regression prediction of random forests model is made by averaging ensemble outputs of the generated trees.
2.3.4 Stochastic Gradient Boosting
Stochastic gradient boosting is another ensemble algorithm that improves the predictive performance. This approach combines a set of weak learners in a greedy fashion via sequentially minimizing residual errors [37]. First, the predictions of a base learner (commonly a decision tree) are made and its residual errors are then fitted to the initial learner. Then, it updates the model by adding an additional learner to the previous learner. The final model is obtained via iteratively repeating above steps such that the prediction loss is minimized. In contrast to basic gradient boosting [38], stochastic gradient boosting randomly selects a subset of samples for learning, which alleviates the potential overfitting.
2.3.5 Adaptive Boosting (AdaBoost)
Different from the stochastic gradient boosting, AdaBoost model fits a set of weak learners while sequentially reweighting samples to obtain an incremental prediction improvement [39]. Given initial sample weights, a base learner is first learned by fitting the input samples. Then, additional learners are sequentially added and trained on the same data but the initial weights are individually adjusted at each iteration to minimize prediction loss. Last, the final model is obtained by applying the weighted average to the output of multiple individual weak learners [40]. Note that the sample weights for AdaBoost here are first initialized based on the importance weights obtained from the previous domain adaptation process and then sequentially updated for modeling optimization.
2.4 Implementation Details
We employed a nested cross validation for training and testing models using labeled source domain data. The nested cross validation consists of an inner loop and an outer loop. Within each loop, a random subset of samples was used as a holdout for validation, while the remaining data was used for th () fold training. In our study, we used fold crossvalidation for the inner () and fold crossvalidation for the outer loop (). We obtained the optimal sample importance weights and model hyperparameters through the process of inner loop cross validation. Then, the outer loop cross validation was applied to evaluate prediction performance on the testing set.
We implemented our methodology using Scikitlearn and Python 3.6. All codes for model training and testing were run on the UT Southwestern BioHPC highperformance computing cloud platform with an NVIDIA Tesla P100 GPU.
3 Experimental Methods
To validate the effectiveness of our approach, we performed the experiments to objectively inquire the inherent difficulty demand of a complex teleoperation task. Two independentlycollected datasets were used in this study: one is the source domain dataset from a simple motor control task, and the other one is the target domain dataset from a complex teleoperation task. Fig. 3 shows the common experimental setup for realtime acquisition of user response. In both experiments, subjects manipulate the position and orientation of a simulated slave tool by controlling a master controller. A haptic device (Geomagic Touch) is used as the master controller while providing haptic feedback (depending on the control task). Simultaneously, all user data were noninvasively recorded in realtime using Robot Operating System (ROS). The collected user data include the signals of human physiological response (electromyography, galvanic skin response, electroencephalography, and heart rate) and movement kinematic signals (limb angular velocities and linear accelerations). More details regarding the sensor setup, data acquisition, and signal preprocessing can be found in our prior work [41, 42].
3.1 Datasets
3.1.1 Source Domain Dataset
Source domain data was collected from subjects when performing a simple motor control task: Fitts’ target reaching. Fig. 4 illustrates the Fitts’ target reaching task. In the experiment, subjects are required to reach a series of predefined targets (see Fig. 6 (a)) by controlling the stylus of the haptic device. According to Fitts’ law, the difficulty levels of Fitts’ target reaching were varied by changing the distances between the target and a fixed starting point. We chose six different difficulty levels of target reaching (ID) ranging from 2.0 to 7.0 (bits). To obtain a comparable difficulty scale, we further normalized the original difficulty indices into the range from 0 to 1.
Dataset  Task  Master Controller  Slave Endeffector  Label  Subject  Recorded Sensory Data  Sample Size 

Source domain  Fitts’ Target Reaching  Geomagic Tough  Virtual Tooltip  Labeled (Fitts’ IDs)  14  Physiological Response Movement Kinematics  420 
Target domain  Robotic Needle Steering  Geomagic Tough  Robotic Steerable Needle  Unlabeled  6  Physiological Response Movement Kinematics  960 
3.1.2 Target Domain Dataset
Target domain data was collected from subjects when performing a higherlevel motor control task: the teleoperation of robotic steerable needles [42]. The needle is a complex nonholonomic robot that is controlled through the axial insertion and rotation forces [43]. The inherent kinematics of the steerable needle make manual control for these needles very challenging, thus creating opportunities for performance improvements for these systems through robotassisted teleoperation [44, 45, 20]. While various potential algorithms exist for these needles, in this paper, we focus on four different control algorithms, including (1) joint space control (JC), (2) steering control (SC), (3) Cartesian space control (CC), and (4) Cartesian space control with haptic force feedback (CFB). Fig. 5 illustrates the teleoperation of steerable needle, as well as the different types of control algorithms employed. Fig. 6 (b) shows the target layout with the four target locations for needle steering. We briefly introduce the four control algorithms below and detailed definitions can be found in our prior work [20] and [42].

Joint space control: In joint space control, the user defines the insertion velocity and rotation angle for the steerable needle, similar to steering a car. As the needle inherently curves during insertion, the user must actively control curvature by varying the rotation angle.

Steering control: Steering control was implemented by directionpointing (left or rightpointing), similar to hubcentered steering (i.e., driving a motorcycle). Users manipulate the stylus to define steering angle which translates to axial needle rotation.

Cartesian space control: In Cartesian space control, the user defines a desired trajectory for the needle tip with the haptic stylus in Cartesian space and a nonholonomic kinematic model for needle curvature is used to determine corresponding insertion and rotation inputs for the needle [20]. Virtual haptic boundary is employed to alert the user when the needle trajectory is beyond the boundary of reachable space.

Cartesian control with haptic force feedback: In addition, a secondary form of Cartesian space control was implemented to provide an additional haptic cue. A positionerrorbased force, , proportional to the distance between needle tip position and current user position was added to prevent the user from moving too far ahead of the needle when defining a needle trajectory.
These four algorithms were chosen for this experiment because not only are they all complex control tasks, but also there is no objective way to assess the difficulty of these algorithms relative to each other. Table 1 summarizes collected datasets from both experiments in the source domain and target domain. Additional details related to these two experiments can be found in our prior work [41, 20].
3.2 Participants
For the source domain data, fourteen subjects participated in the experiment. Each subject completed tasks over five separate sessions, where in each session subjects are required to reach each of six targets that were presented in a random order. The experiment results in the total sample size of 420 individual Fitts’ reaching trials.
In the target domain dataset, six subjects in total participated in the experiment. All subjects completed four sessions wherein each session subjects are required to perform needle steering using a single control algorithm that was randomly selected. Each session included 40 randomized trials (10 repetitions for each of four different target locations) for each of the control algorithms. The experiment resulted in a the total sample size of 960 individual needle steering trials.
3.3 Feature Extraction
Instead of focusing on performance metrics that are dependent on the task objectives, such as target locations, ideal trajectories, or ideal task completion times, we characterize the user response using features that are task independent. Our features consist of two different modalities, i.e., physiological response features (Physiological) and kinematic movement features (Kinematic) of human subjects, extracted from signals measured in realtime during task completion.
3.3.1 Physiological Features
To quantify user physiological response, all physiological features were normalized to the baseline on a persubject basis in order to eliminate individual variances across subjects. Details of obtaining the baseline of physiological response for each subject are available in [41].
We extracted the normalized root mean square (RMS) and normalized mean absolute value (MAV) to quantify muscle activations from surface EMG signals. Two normalized probabilistic metrics, Engagement and Workload, were extracted using BIOPAC Cognitive State Analysis (BIOPAC Systems, Inc.) to assess subjects’ cognitive states from the EEG data [46]. For galvanic skin response, we extracted the normalized skin conductance () and normalized conductance variances (). Finally, we calculated the normalized heart rate (HR) based on peak detection of photoplethysmography signals collected from an optical pulse sensor.
3.3.2 Kinematic Features
To characterize user movement, we extracted kinematic features from user dominant forearm and upper arm’s motion profiles that were collected from IMU sensors. We calculated the normalized angular velocity (AngVel), normalized linear acceleration (LinAcc) and average jerk (Jerk). For any motion trajectory, the normalized motion features, i.e., normalized angular velocity and linear acceleration, were calculated as the mean divided by the peak velocity or acceleration. The average jerk was used to assess the movement smoothness given the motion profiles of user arms [47]. In addition, we extracted features measured from slave endeffectors to characterize movement kinematics in the simulated workspace, including the path straightness (PathStrDev) and path efficiency (PathEff) [42]. The path straightness and path efficiency quantify the efficiency of tool movements and user motor ability to continuously manipulate tools. It is important to note that these kinematic features are derived using only the position and velocity data, and thus do not depend on a priori knowledge of actual desired paths.
3.4 Modeling Evaluation
Evaluation of our proposed method was carried out in two steps. First, we evaluate the prediction accuracy on the labeled source domain data . According to the aforementioned nested crossvalidation scheme, prediction accuracy on testing data are averaged over the 10fold outerloop cross validation. We used the root mean squared error (RMSE) and to assess overall prediction accuracy. A lower RMSE value and higher indicate a better prediction accuracy.
Next, we aim to evaluate the prediction accuracy for the unlabeled target domain data . However, this is not directly possible as there are no groundtruth difficulty labels for the target domain. Therefore, we depend on other methods of assessing difficulty as a comparison. We used the NASA Task Load Index (TLX) rating survey, a structured grading tool that was used to subjectively assess difficulty based on subject experience in teleoperation. The six measurable scales of NASA TLX rating include: Mental Demand, Physical Demand, Temporal Demand, Performance, Effort and Frustration. In the experiment, NASA TLX rating was conducted for each control algorithm when subjects completed a teleoperation session; therefore, four NASA TLX surveys were acquired from each subject. The values of each scale, ranging from 0 to 20, were normalized by dividing by the maximum value selected for each subject across all surveys into the range of 0 to 1. An average combined score was then calculated from the six NASA TLX scales scores to access the overall userperceived difficulty level for a given control algorithm. The larger value indicates a higher userperceived difficulty demand.
To quantify the strength and statistical significance of our prediction, twoway ANOVA was conducted to determine significant differences between groups (i.e., control algorithm and targets) on the predicted difficulty. Oneway ANOVA was conducted to determine the group of control algorithms on the TLX difficulty ratings. Posthoc pairwise Tukey test was used for the significant results to determine the degree to which the group levels of interests differed. For all statistics, significance was determined by value of 0.05.
Modeling Methods  Physiological  Kinematic  

RMSE  RMSE  
Single Learner 
Ridge Regression  0.233  0.499  0.097  0.916 
Support Vector Machine  0.219  0.560  0.083  0.938  
Random Forest  0.224  0.544  0.074  0.949  
Stochastic Gradient Boosting  0.248  0.444  0.142  0.819  
AdaBoost  0.246  0.453  0.074  0.950  
Stacked Learning Model  0.215  0.576  0.061  0.966 
Feature  Control Algorithm  

JC  SC  CC  CFB  
Physiological  0.555  0.453  0.451  0.450 
Kinematic  0.748  0.425  0.291  0.497 
Feature  Target Configuration  
T1  T2  T3  T4  
Physiological  0.432  0.498  0.488  0.482 
Kinematic  0.255  0.679  0.623  0.521 
Feature  Control Algorithm  Target Configuration  Interaction  

posthoc comparisons  posthoc comparisons  
Physiological  CFBCC, CFBJC  T1(T4T3T2)  
Kinematic  CC(SCCFB)JC  T1T4(T3T2)  
represents significance 
4 Results & Discussion
It is important to note that our numerical difficulty prediction in this study is similar to, but different from that originally proposed by Fitts. Both models are aimed to quantify the underlying difficulty demand of certain operations. However, the difficulty model in Fitts’ definition is quantified as a logarithmic function of target distance and target width only, which depend on particular task settings. In contrast, difficulty assessment in this work is modeled as a prediction function ID , based on extracted human sensorimotor response features as input. It is mathematically solved (see Eq. 5) while taking into the account the covariate shift of user response for an optimal modeling. Furthermore, our approach is datadriven and taskindependent in the sense that the assessment is informed by using user physiological or kinematic features only. It does not depend on any knowledge of certain task constraints or settings, such as target locations, ideal trajectories.
4.1 Distribution Variances & Shift Adaptation
Understanding the distribution variances of extracted features among the simple motor control and complex teleoperation from the source and target domains is an important step when employing our learning scheme. For a direct qualitative analysis, we used the tDistributed Stochastic Neighbor Embedding (tSNE) method to visualize the distribution variances cross domains. As a popular approach of visualizing complex data, tSNE projects the highdimensional data into lowerdimensional space while the original distributions are preserved in the projection [48].
Fig. 7(a) shows the tSNE projection in the twodimensional space based on the input data of different feature modalities. For both physiological and kinematic features, the structure of data distribution in target domain is characteristically different from the source domain data. Specifically, the distribution of source domain data (depicted in blue color) falls into a wellgrouped structure, which is overlapped with the target domain data that is more scattered (depicted in red color). It reveals that an underlying distribution discrepancy exists among the extracted features (, ) in the target and source domains. This observation can be explained by the stochastic nature of user physiological response and motion kinematics, and thus justifies the need for further adaptation in our learning model. It is important to note that the tSNE projection is a nonlinear, stochastic process that depends on a choice of parameters [49]. Therefore, we only interpret tSNE outputs as the tool for a preliminary qualitative analysis and visualization of the distribution variances.
Since two domains have different distributions, the key factor is to bridge the gap between them to enable knowledge transfer. Specially, the abovementioned covariate shift was quantitatively adapted by the KLIEP procedure (see Section II.B.) to align the user data distributions between source and target domains. Fig. 7(b) presents the histogram of estimated importance weights of all source domain samples. Higher value of weights indicates a larger similarity of data distributions cross domains. The majority of sample importance weights was within a range from 0 to 1. This result further confirms the existing data distribution discrepancies of the selected features across domains.
4.2 Prediction Accuracy at Source Domain
As the model evaluation process contains two steps, we first show the prediction accuracy on the labeled source domain. Table 2 summarizes the average testing accuracy from the 10fold outerloop crossvalidation using different modeling methods. The results show that the presented stacked learning model outperformed the other singlealgorithm learners with the highest prediction accuracy, given the input data of either physiological features or kinematic features. In particular, from the stacked learning model, the movementbased difficulty prediction showed a highest accuracy with the RMSE , , whilst the physiological responsebased model provided a lower accuracy with the average RMSE , . This result indicates that the movement features carry the most representative information for difficulty prediction. This result makes sense as the Fitts’ reaching task directly affects the way and extent to which the human user moves his or her arm. In contrast, extracted physiological features may not be sensitive enough to detect difficulty changes accurately. One explanation is that the user physiological signals, such as EMG and EEG, have relatively higher inherent signal noises. Further improvement of physiological signal acquisition might help to achieve better difficulty identification.
4.3 Operative Difficulty of Robotic Needle Steering
Operative difficulty demand of complex motor control in the target domain, i.e., teleoperation of robotic needle, was characterized by the difficulty predictions output by our developed model, as shown in Table 3. Fig. 8 and Fig. 9 are the box plots summarizing the mean values of predicted difficulty based on the candidate physiological features (left) and kinematic features (right) for characterizing control algorithm (JC, SC, CC, and CFB) and target layout (T1, T2, T3, T4), respectively. Each box represents the 25th, 50th, and 75th percentiles, while the whiskers represent the confidence intervals. The difficulty prediction was further analyzed using a twoway ANOVA to determine the significant differences between the groups of control algorithms and target layouts. Posthoc pairwise Tukey comparisons highlight the significant effects among each group. Table 4 summarizes the statistical results from twoway ANOVA and posthoc Tukey comparisons.
4.3.1 Effects of Control Algorithms
As shown in Table 4, the analysis consistently shows significant differences of the control algorithms in needle steering for all physiological features (), and kinematic features (), respectively. In particular, from both modalities of selected features, joint space control was associated with the highest mean difficulty value. It indicates that joint space control was significantly more difficult to manipulate, physiologically and kinematically. Cartesian space control shows smaller difficulty values from both physiologybased and movementbased predictions. These results match our expectations because by abstracting the nonholonomic task constraints into haptic constraints, Cartesian space control no longer requires subjects to maintain an accurate internal model of the needle kinematics, and thus potentially decreasing the difficulty demand for the control task.
Interestingly, from the physiologybased prediction, Cartesian space control with haptic feedback (resistant force feedback in this study) had the lowest average difficulty value at 0.450, which was also significantly less than Cartesian space control without haptic feedback (CC), ; however, from the movementbased prediction, Cartesian space control with haptic feedback (CFB) has significantly higher difficulty of 0.497, compared to the Cartesian space control without haptic feedback (CC) at 0.291, . It is evident that the addition of haptic feedback diminished the difficulty demand physiologically, however, significantly increased the kinematic cost. This result might be linked to previous studies that haptic feedback contributes to a higher overall physical workload [50]. Although haptic force feedback was generally designed to improve overall control performance [51, 52, 53], the presence of such a feedback might not always guarantee intuitive user experience kinematically. As indicated from the movementbased prediction, haptics could potentially interfere with user movements due to the additional force resistance, and thus resulted in the increase of underlying difficulty demand for the task. Though seemingly contradictory, the physiologybased and movementbased predictions are able to provide a meaningful difficulty assessment from different aspects of the individual user response. Our analysis indicates that a better understanding of the impacts of force feedback on the user physiological response and movement behaviors would be helpful to finetune haptics settings and deliver a more intuitive robotic interface for the human user.
NASATLX  Control Algorithm  ANOVA  

JC  SC  CC  CFB  posthoc comparisons  
Mental Demand  0.833  0.722  0.361  0.278  0.157   
Physical Demand  0.735  0.412  0.441  0.500  0.397   
Temporal Demand  0.676  0.882  0.412  0.471  0.147   
Performance  0.382  0.294  0.147  0.088  0.206   
Effort  0.917  0.806  0.194  0.167  0.271   
Frustration  0.563  0.438  0.063  0.094  0.232   
Average  0.775  0.665  0.258  0.231  0.141   
represents significance 
4.3.2 Effects of Target Configurations
Similar to the effects of control algorithms, the statistical analysis showed that target layouts in needle steering produced significant differences on the difficulty demand for all features. As shown in Table 4, for both physiological and movement kinematic features, the closest target, T1, had the lowest difficulty demand. For the difficulty model using kinematic features, target T4 was also significantly less difficult to reach than T2 and T3. This can be explained by the fact that in all control modes, this target was either reached by a simple arm reach (in Cartesian space) or a continuous insertion with a constantly changing steering angle (in Joint and Steering control).
4.3.3 Interaction Between Control Algorithms and Targets
Next, we investigated whether any potential interaction effects exists between the different targets and control algorithms on task difficulty. Fig. 10 shows the mean difficulty values of different control algorithms for each target configuration. Using the joint space control (JC), it was significantly less difficult for subjects to reach target T1 while no statistical significance was found for reaching the other targets. For the targets T1, T3, and T4, needle steering using Cartesian control was found to have the lowest difficulty demand. For the target T2, the steering control (SC) is associated with the lowest average difficulty demand compared to the other control algorithms. These comparisons suggest that it is possible to improve difficulty demand for operators by combining the consideration of control algorithms and specific setups. The result represents an opportunity for future studies to adaptively adjust control algorithms according to certain task objectives for a more intuitive robotic teleoperation.
4.3.4 Comparison to Subjective Difficulty Assessments
To validate our proposed approach, we compared the predicted difficulty demand of needle steering operations with the user perceived difficulty demand measured from standard NASA TLX user ratings. Table. 5 shows the results from oneway ANOVA statistics of normalized user ratings within different control algorithms. Fig. 11 shows the mean values of normalized scores from respective NASA TLX scales.
We acknowledge that the results of userperceived difficulty demand from NASA TLX ratings were aligned with our difficulty prediction in general, however, no statistical significance were found in user ratings among all scales. In particular, joint space control was associated with the highest average TLX score (), followed by steering control (), Cartesian control (), and Cartesian control with haptic feedback (). This result shows that the joint space control was perceived as the most difficult one for operators. Though not significantly, Cartesian control with haptic feedback had the lowest score in the scale of Mental Demand while it was associated with higher score of Physical Demand, compared to steering control and Cartesian control without haptics. The companions of NASA TLX user ratings confirmed the validity of our proposed approach, while the difficulty prediction was able to provide a statistically significant assessment.
4.4 Limitations and Future Work
The first limitation can be found in the challenge of physiologybased difficulty modeling due to high variances of user physiological response. Despite our results indicating some potential for difficulty prediction, larger sample sizes would be needed to further improve accuracy and increase the statistical power. Secondly, we tested only the teleoperation of robotic needle as a case study of complex motor control. Further investigations of broader robotassisted teleoperation are warranted to confirm the generalizability of the proposed difficulty modeling approach. Moreover, manually defining and selecting meaningful features from both the source and target domain is laborious. We might not have found the most distinguishing features of user data to capture the underlying difficulty information in complex control tasks. For future work, it could be desirable to automatically learn features from raw timeseries data of user sensorimotor signals [54]. Finally, we are also interested in expanding our method towards online difficulty prediction that can be ultimately integrated with a robot system to provide better robotic assistance for complex teleoperation.
5 Conclusion
Accurately evaluating complex human motor control tasks and obtaining a clear understanding of inherent difficulty demand is a crucial step for designing an efficient, userfriendly robotassisted teleoperation system. In this study, we present a novel method to infer the underlying difficulty demand of complex teleoperation tasks based on the realtime measures of human operator sensorimotor response. The proposed approach transfers the known difficulty information in a simple motor control task to analyze a complex teleoperation. Based on an unsupervised domain adaptation, our approach explicitly takes into account the distribution discrepancies of user response in different motor control scenarios. A stacked twolayer architecture, combining multiple singlealgorithm learners, was implemented to refine predictions. We validated our approach by analyzing teleoperated robotic needle steering as a case study and compared our results with a standard NASATLX user survey. We showed that the operative difficulty demand of complex motor control can be readily inferred from user sensorimotor response, either physiologically or kinematically. In particular, our results confirmed that Cartesian space control provides the lowest operative difficulty at 0.291, with a statistical significance when compared to other control algorithms. The target configurations could also significantly affect the difficulty demand, .
Overall, our proposed approach is datadriven and taskindependent in the sense that inferring underlying operative difficulty demand from the user physiological response and movement kinematics would not require any a priori knowledge of specific tasks or operational environment. As such, our method could potentially be generalized for analyzing a broader range of complex motor control scenarios in robotassisted systems.
References
 [1] M. Shahbazi, S. F. Atashzar, and R. Patel, “A systematic review of multilateral teleoperation systems,” IEEE Transactions on Haptics, 2018.
 [2] J. W. Hill, et al., “Telepresence surgery demonstration system,” in Proceedings of the 1994 IEEE International Conference on Robotics and Automation. IEEE, 1994, pp. 2302–2307.
 [3] J. Casper and R. R. Murphy, “Humanrobot interactions during the robotassisted urban search and rescue response at the world trade center,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 33, no. 3, pp. 367–385, 2003.
 [4] T. B. Sheridan, “Space teleoperation through time delay: Review and prognosis,” IEEE Transactions on robotics and Automation, vol. 9, no. 5, pp. 592–606, 1993.
 [5] J. Y. Chen, E. C. Haas, and M. J. Barnes, “Human performance issues and user interface design for teleoperated robots,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 37, no. 6, pp. 1231–1245, 2007.
 [6] E. Yang and M. C. Dorneich, “The emotional, cognitive, physiological, and performance effects of variable time delay in robotic teleoperation,” International Journal of Social Robotics, vol. 9, no. 4, pp. 491–508, 2017.
 [7] P. Liu and Z. Li, “Task complexity: A review and conceptualization framework,” International Journal of Industrial Ergonomics, vol. 42, no. 6, pp. 553–568, 2012.
 [8] K. Stowers, et al., “A framework to guide the assessment of human–machine systems,” Human factors, vol. 59, no. 2, pp. 172–188, 2017.
 [9] A. Zunjic, “Performance and workload of operators in a human  telerebot system,” pp. 01–03, 08 2015.
 [10] D. Pan, Y. Zhang, and Z. Li, “The effects of task complexity and spatial ability on teleoperation performance,” in International Conference on Engineering Psychology and Cognitive Ergonomics. Springer, 2017, pp. 42–50.
 [11] D.J. Kim, et al., “How autonomy impacts performance and satisfaction: Results from a study with spinal cord injured subjects using an assistive robot,” IEEE Transactions on Systems, Man, and CyberneticsPart A: Systems and Humans, vol. 42, no. 1, pp. 2–14, 2012.
 [12] S. G. Hart and L. E. Staveland, “Development of nasatlx (task load index): Results of empirical and theoretical research,” Adv. in Psycho., vol. 52, pp. 139–183, 1988.
 [13] J. C. de Winter, “Controversy in human factors constructs and the explosive use of the nasatlx: a measurement perspective,” Cognition, technology & work, vol. 16, no. 3, pp. 289–297, 2014.
 [14] S. G. Hart, “Nasatask load index (nasatlx); 20 years later,” in Proceedings of the human factors and ergonomics society annual meeting, vol. 50, no. 9. Sage Publications Sage CA: Los Angeles, CA, 2006, pp. 904–908.
 [15] L. A. Cavuoto, et al., “Improving teamwork: evaluating workload of surgical team during robotassisted surgery,” Urology, vol. 107, pp. 120–125, 2017.
 [16] D. Whitney, et al., “Comparing robot grasping teleoperation across desktop and virtual reality with ros reality,” in Proceedings of the International Symposium on Robotics Research, 2017.
 [17] D. Cannon and M. Siegel, “Perceived mental workload and operator performance of dexterous manipulators under time delay with masterslave interfaces,” in Computational Intelligence and Virtual Environments for Measurement Systems and Applications (CIVEMSA), 2015 IEEE International Conference on. IEEE, 2015, pp. 1–6.
 [18] D. B. Kaber, et al., “Effects of visual interface design, and control mode and latency on performance, telepresence and workload in a teleoperation task,” in Proceedings of the human factors and ergonomics society annual meeting, vol. 44, no. 5. SAGE Publications Sage CA: Los Angeles, CA, 2000, pp. 503–506.
 [19] C. D. Pham, H. N. T. Phan, and P. J. From, “Evaluation of Subjective and Objective Performance Metrics for Haptically Controlled Robotic Systems,” Modeling, Identification and Control, vol. 35, no. 3, pp. 147–157, 2014.
 [20] A. Majewicz and A. M. Okamura, “Cartesian and joint space teleoperation for nonholonomic steerable needles,” in 2013 World Haptics Conf. (WHC). IEEE, Apr. 2013, pp. 395–400.
 [21] P. M. Fitts, “The information capacity of the human motor system in controlling the amplitude of movement.” J. Exp. Psychol., vol. 47, no. 6, p. 381, 1954.
 [22] R. Meyer, et al., “Development and evaluation of an input method using virtual hand models for pointing to spatial objects in a stereoscopic desktop environment with a fitts’ pointing task,” in Adv. Ergonomic Design Syst., Products Process. Springer, 2016, pp. 327–342.
 [23] R. W. Soukoreff and I. S. MacKenzie, “Towards a standard for pointing device evaluation, perspectives on 27 years of fitts’ law research in hci,” Intl. J. Humancomputer Studies, vol. 61, no. 6, pp. 751–789, 2004.
 [24] W. Klonowski, “Everything you wanted to ask about eeg but were afraid to get the right answer,” Nonlinear Biomedical Physics, vol. 3, no. 1, p. 2, 2009.
 [25] W. Samek, F. C. Meinecke, and K.R. Müller, “Transferring subspaces between subjects in brain–computer interfacing,” IEEE Transactions on Biomedical Engineering, vol. 60, no. 8, pp. 2289–2298, 2013.
 [26] N. Stergiou and L. M. Decker, “Human movement variability, nonlinear dynamics, and pathology: is there a connection?” Human movement science, vol. 30, no. 5, pp. 869–888, 2011.
 [27] S. J. Pan, Q. Yang, et al., “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2010.
 [28] M. Sugiyama, et al., Dataset shift in machine learning. The MIT Press, 2017.
 [29] H. Shimodaira, “Improving predictive inference under covariate shift by weighting the loglikelihood function,” Journal of statistical planning and inference, vol. 90, no. 2, pp. 227–244, 2000.
 [30] J. Huang, et al., “Correcting sample selection bias by unlabeled data,” in Advances in neural information processing systems, 2007, pp. 601–608.
 [31] T. Kanamori, S. Hido, and M. Sugiyama, “A leastsquares approach to direct importance estimation,” Journal of Machine Learning Research, vol. 10, no. Jul, pp. 1391–1445, 2009.
 [32] M. Sugiyama, et al., “Direct importance estimation with model selection and its application to covariate shift adaptation,” in Advances in neural information processing systems, 2008, pp. 1433–1440.
 [33] M. Sugiyama, T. Suzuki, and T. Kanamori, “Density ratio estimation: A comprehensive review,” RIMS Kokyuroku, pp. 10–31, 01 2010.
 [34] A. E. Hoerl and R. W. Kennard, “Ridge regression: Biased estimation for nonorthogonal problems,” Technometrics, vol. 12, no. 1, pp. 55–67, 1970.
 [35] H. Drucker, et al., “Support vector regression machines,” Advances in neural information processing systems, vol. 9, pp. 155–161, 1997.
 [36] K. Fawagreh, M. M. Gaber, and E. Elyan, “Random forests: from early developments to recent advancements,” Systems Science & Control Engineering: An Open Access Journal, vol. 2, no. 1, pp. 602–609, 2014.
 [37] J. H. Friedman, “Stochastic gradient boosting,” Computational Statistics & Data Analysis, vol. 38, no. 4, pp. 367–378, 2002.
 [38] J. H. Friedman, “Greedy function approximation: a gradient boosting machine,” Annals of statistics, pp. 1189–1232, 2001.
 [39] Y. Freund and R. E. Schapire, “A decisiontheoretic generalization of online learning and an application to boosting,” Journal of computer and system sciences, vol. 55, no. 1, pp. 119–139, 1997.
 [40] H. Drucker, “Improving regressors using boosting techniques,” in ICML, vol. 97, 1997, pp. 107–115.
 [41] Z. Wang and A. M. Fey, “Humancentric predictive model of task difficulty for humanintheloop control tasks,” PloS one, vol. 13, no. 4, p. e0195053, 2018.
 [42] Z. Wang, R. Isabella, and A. M. Fey, “Toward intuitive teleoperation in surgery: Humancentric evaluation of teleoperation algorithms for robotic needle steering,” in Robotics and Automation (ICRA), 2018 IEEE International Conference on. IEEE, 2018, pp. 5799–5806.
 [43] R. J. Webster III, et al., “Nonholonomic modeling of needle steering,” The International Journal of Robotics Research, vol. 25, no. 56, pp. 509–525, 2006.
 [44] J. M. Romano, R. J. Webster, and A. M. Okamura, “Teleoperation of steerable needles,” in Proceedings 2007 IEEE International Conference on Robotics and Automation. IEEE, 2007, pp. 934–939.
 [45] C. Pacchierotti, et al., “Teleoperation of steerable flexible needles by combining kinesthetic and vibratory feedback,” IEEE transactions on haptics, vol. 7, no. 4, pp. 551–556, 2014.
 [46] R. R. Johnson, et al., “Drowsiness/alertness algorithm development and validation using synchronized eeg and cognitive performance to individualize a generalized model,” Biological psychology, vol. 87, no. 2, pp. 241–250, 2011.
 [47] S. Estrada, et al., “Smoothness of surgical tool tip motion correlates to skill in endovascular tasks,” IEEE Transactions on HumanMachine Systems, vol. 46, no. 5, pp. 647–659, 2016.
 [48] L. v. d. Maaten and G. Hinton, “Visualizing data using tsne,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.
 [49] M. Wattenberg, F. ViÃ©gas, and I. Johnson, “How to use tsne effectively,” Distill, 2016. [Online]. Available: http://distill.pub/2016/misreadtsne
 [50] J. C. Huegel and M. K. Oâmalley, “Workload and performance analyses with haptic and visually guided training in a dynamic motor skill task,” in Computational Surgery and Dual Training. Springer, 2014, pp. 377–387.
 [51] M. K. OâMalley, et al., “Shared control in haptic systems for performance enhancement and training,” Journal of Dynamic Systems, Measurement, and Control, vol. 128, no. 1, pp. 75–85, 2006.
 [52] H. U. Yoon, et al., “Customizing haptic and visual feedback for assistive human–robot interface and the effects on performance improvement,” Robotics and Autonomous Systems, vol. 91, pp. 258–269, 2017.
 [53] R. J. Kuiper, et al., “Evaluation of haptic and visual cues for repulsive or attractive guidance in nonholonomic steering tasks,” IEEE Transactions on HumanMachine Systems, vol. 46, no. 5, pp. 672–683, 2016.
 [54] Z. Wang and A. M. Fey, “Deep learning with convolutional neural network for objective skill evaluation in robotassisted surgery,” International Journal of Computer Assisted Radiology and Surgery, vol. 13, pp. 1959 – 1970, 2018.
[]Ziheng Wang received his B.S. degree in Mechanical Engineering and Automation from Nanjing University of Aeronautics and Astronautics, Nanjing, China, and his M.S. degree in Mechanical Engineering from the University of Texas at Dallas, USA, in 2015 and 2017. He is currently a Ph.D. candidate in the Department of Mechanical Engineering, the University of Texas at Dallas, USA. His research interests include surgical data science, machine learning, deep learning, teleoperation, surgical robots, and haptics.
[]Cong Feng received the B.S. degree in Power Engineering from Wuhan University, Wuhan, China, in 2014, and the M.S. degree in Mechanical Engineering from the University of Texas at Dallas, Richardson, TX, USA, in 2017. He is currently a Ph.D. candidate in the Department of Mechanical Engineering at the University of Texas at Dallas. His research interests include machine / deep learning, power system big data analytics, power system time series forecasting.
[]Jie Zhang received the B.S. and M.S. degrees in mechanical engineering from the Huazhong University of Science and Technology, Wuhan, China, in 2006 and 2008, respectively, and the Ph.D. degree in Mechanical Engineering from Rensselaer Polytechnic Institute, Troy, NY, USA, in 2012. He is currently an Assistant Professor of the Department of Mechanical Engineering at The University of Texas at Dallas. His research interests include multidisciplinary design optimization, complex engineered systems, big data analytics, wind and solar forecasting, renewable integration, and energy systems modeling and simulation.
[]Ann Majewicz Fey received the B.S. degrees in Mechanical Engineering and Electrical Engineering from the University of St. Thomas, MN, USA, her M.S. degree in Mechanical Engineering from Johns Hopkins University, MD, USA, and her Ph.D. degree in Mechanical Engineering from Stanford University, CA, USA, in 2008, 2010, and 2014, respectively. She is currently an Assistant Professor in the Department of Mechanical Engineering, the University of Texas at Dallas, USA with a joint appointment in the Department of Surgery at UT Southwester Medical Center, USA. She directs the Human Enabled Robotic Technology (HeRo) Laboratory where she is responsible for research projects in the areas of surgical robotics, teleoperation, and haptics. Dr. Majewicz Fey is a recipient of the 2015 National Science Foundation CISE Research Initiation Initiative (CRII) award in CyberPhysical Systems.