Opportunities of a Machine Learning-based Decision Support System for Stroke Rehabilitation Assessment
Rehabilitation assessment is critical to determine an adequate intervention for a patient. However, the current practices of assessment mainly rely on therapist’s experience, and assessment is infrequently executed due to the limited availability of a therapist. In this paper, we identified the needs of therapists to assess patient’s functional abilities (e.g. alternative perspective on assessment with quantitative information on patient’s exercise motions). As a result, we developed an intelligent decision support system that can identify salient features of assessment using reinforcement learning to assess the quality of motion and summarize patient specific analysis. We evaluated this system with seven therapists using the dataset from 15 patient performing three exercises. The evaluation demonstrates that our system is preferred over a traditional system without analysis while presenting more useful information and significantly increasing the agreement over therapists’ evaluation from 0.6600 to 0.7108 F1-scores (). We discuss the importance of presenting contextually relevant and salient information and adaptation to develop a human and machine collaborative decision making system.
Assessment of physical rehabilitation exercises is an essential process to determine an appropriate clinical intervention for a patient with musculoskeletal and neurological disorders (e.g. stroke) (Lang et al., 2013). However, this process relies on therapist’s experience (Smith et al., 2008) and is infrequently performed due to the limited availability of a therapist. Researchers have explored the possibility of computer-assisted decision support tools (Berner, 2007) that can monitor and assess chronic diseases using sensor and machine learning technologies (Webster and Celik, 2014).
For instance, a Support Vector Machine (SVM) classifier is applied to distinguish mild and severe symptoms of four Parkinson’s patients (Das et al., 2011). Neural Networks are utilized to quantify the quality of stroke rehabilitation exercises (Lee et al., 2019). These approaches process complex sensor data to automatically extract a meaningful function, machine learning model to classify the quality of motion. However, it is challenging to derive a model that can perfectly replicate therapist’s assessment given patient’s diverse physical characteristics. For example, two patients could have different ways of incorrectly performing an exercise (Figure 1). Thus, a model can incorrectly predict a new patient’s exercise motion with compensated joints that is not present in the dataset of a system. If a model with complex algorithms cannot explain why it provides different assessment (Gunning, 2017), therapists could lose trust in the model and abandon it even if it provides valuable predictions in other cases (Khairat et al., 2018).
In this paper, we implement and evaluate an intelligent decision support system (Figure 2). This system utilizes training data of all patients except a patient for testing to learn models to identify salient features of assessment using reinforcement learning and predict the quality of motion with Neural Networks. Using identified salient features and held-out user data, patient’s unaffected motions, this system can provide user-specific analysis with a visualization interface. This system empowers therapists to understand patient’s performance with 1) feature analysis with kinematic measurements, 2) images of salient frames, and 3) trajectory trends (Figure 3). While considerable prior work demonstrates the feasibility of assessing the quality of motion (Webster and Celik, 2014) and focuses on improving the accuracy of a model (Lee et al., 2019), there is a lack of systematic evaluations on such technologies.
We conducted the field studies with therapists to identify what capabilities they need and implemented an intelligent decision support system for assessing stroke rehabilitation exercises with the exercise dataset (15 post-stroke patients performing three upper-limb exercises). This system presents the predicted scores of three performance components (i.e. ‘Range of Motion’, ‘Smoothness’, ‘Compensation’) with user-specific analysis as explanations of the predictions: feature analysis, images of salient frames, and joint trajectories (Figure 3). We performed a user study with seven therapists from four rehabilitation hospitals to investigate how therapists use this system and how it can affect therapist’s decision making on assessing patient’s exercise performance. Results show that our system enables therapists to validate their assessment with quantitative, user-specific analysis, which increases user trust and utility of a system. In addition, our system assists therapists to achieve significantly higher agreement on their assessment (0.71 average F1-scores) than a traditional system without analysis (0.66 average F1-scores) ().
This paper makes the following contributions:
enumerate needs of therapists during assessing rehabilitation exercises
present the design and implementation of an intelligent decision support system for stroke rehabilitation assessment that can identify salient features using reinforcement learning to predict the quality of motion and generate user-specific analysis
describe the quantitative and qualitative evaluation of our system with seven therapists from four rehabilitation hospitals and pose this system as an approach to support consistent assessment
2. Related Work
2.1. Current Practices of Physical Rehabilitation
Patients with musculoskeletal and neurological disorders (e.g. stroke) require a rehabilitation program over several months to prevent disability and improve their functional abilities. Performing task-oriented exercises is one of the effective ways for post-stroke survivors to improve functional ability and lower a chance of having recurrent stroke (Rensink et al., 2009). During a rehabilitation program, therapists first diagnose the condition of a patient with various methods (e.g. analyzing patient’s history, conducting tests, or analyzing measurements) and determine in-home interventions. In the follow-up visits, therapists discuss patient’s progress or periodically evaluate treatment outcomes to modify interventions as appropriate (O’Sullivan et al., 2019). Although assessing patient’s performance on rehabilitation exercises is important for therapists to adjust interventions, this assessment relies on therapist’s experience (Smith et al., 2008) and infrequently performed due to the limited availability of a therapist. In addition, therapists primarily reply on patient’s self-report and do not have any quantitative performance data to understand how well patients follow the prescribed regimens (Hendricks et al., 2002). Thus, therapists encounter challenges of understanding patient’s performance and adjusting intervention. As a first step, this paper primarily focuses on understanding the effect of a machine learning-based decision support system for assessing stroke rehabilitation exercises.
2.2. Technological Support for Physical Rehabilitation
To address the limitation of current practices in physical rehabilitation, researchers have explored the feasibility of clinical decision support systems that assist a clinician to obtain insights on patients by monitoring and assessing chronic diseases with computational models (Webster and Celik, 2014).
One approach is called a rule-based model, in which domain experts, clinicians, elicit a set of monitoring rules (Siewiorek et al., 2012). For example, Huang explored a tool with therapists to specify repetitions and joint angles for monitoring knee rehabilitation exercises (Huang, 2015). This rule-based approach provides the modularization and flexibility to develop a monitoring system. However, it is time consuming to determine the right threshold values of rules for an individual’s status. Moreover, experts might not be able to articulate their decision making on a complex monitoring task. Alternative approach is a statistical model, which utilizes machine learning with labeled sensor data (Siewiorek et al., 2012). This statistical approach utilizes machine learning algorithms to process complex sensor data and automatically extract a meaningful function (e.g. Neural Network model) that can classify the quality of motion (Das et al., 2011; Lee et al., 2019). However, no algorithms can completely replicate therapist’s assessment given patient’s diverse physical characteristics and functional abilities. Moreover, a statistical approach with complex algorithms cannot explain its prediction to support expert’s decision making (Gunning, 2017), which exacerbate therapist’s trust and experience with a decision support system (Khairat et al., 2018).
In this paper, we aim to increase the interpretability of a model by feature selection (Kim et al., 2015; Biran and Cotton, 2017). Specifically, we apply reinforcement learning (Van Hasselt et al., 2016; Lee, 2019) to identify kinematic salient features for assessment. Utilizing an identified subset of features, we predict the quality of motion and generate user-specific analysis to summarize patient’s exercise performance (Lee, 2019). Our work demonstrates how a tool with predicted assessment and user-specific analysis can affect therapist’s decision making on rehabilitation assessment.
A substantial body of prior work focus on demonstrating the feasibility of collecting objective kinematic variables to quantify the performance of rehabilitation exercises (Murphy et al., 2011) and assessing the quality of motion (Lee et al., 2019). Yet, there is a lack of knowledge and evaluation about therapist’s experience on a decision support system for physical rehabilitation monitoring and assessment. Although clinical decision support systems can improve the practices of healthcare (Cai et al., 2019), systems might not be adopted in clinical practices due to lack of user trust and acceptance (Devaraj et al., 2014; Khairat et al., 2018). Specifically, clinical experts might not use a system if it is not properly integrated into workflow and does not provide relevant information (Devaraj et al., 2014). This paper contributes to increase knowledge about therapist’s needs and experience on an intelligent decision support system for stroke rehabilitation assessment. We conducted a user study with therapists to investigate what types of capabilities therapists want, how therapists use a system, and how user-specific analysis of a system affect therapist’s attitude about a system and assessment.
3. Stroke Rehabilitation as a Test Domain
Stroke is the second leading cause of death and third most common contributor to disability (Feigin et al., 2017). As stroke has increased across the world, we selected stroke rehabilitation as a probe domain. We recruited nine therapists of stroke rehabilitation from five rehabilitation centers (Table 1) to understand their needs during stroke rehabilitation assessment. Three out of nine therapists specified the design of our study (i.e. exercises and performance components for assessment). One therapist annotated the dataset to implement a system for evaluation. Two therapists reviewed our implementation before running a user study and other seven therapists participated in the evaluation of our implementation.
3.1. Needs during Rehabilitation Assessment
We interviewed and performed focus group discussion with nine therapists (2 males and 7 females, 29.6 5.4 years old) with 1 - 20 years of experience in stroke rehabilitation (, ) from five rehabilitation centers to gain knowledge about the current practices and therapists’ needs of assessing patient’s rehabilitation exercises. A group of therapists or an individual at each center participated in need finding study for an hour on the same topics: the process of assessment, strategies to cope with an uncertain situation, the current usage of technology, and opportunities for technological support. In addition, we observed an one hour-long rehabilitation session at one rehabilitation center. Our thematic analysis on need findings with therapists and observation on a rehabilitation session is described as follows:
Therapist’s experience-based and Infrequent Assessment
Therapists mainly rely on their observation and experience to approximately assess patient’s performance on rehabilitation exercises (Sanford et al., 1993; Taub et al., 2011) and determine interventions (O’Sullivan et al., 2019). When assessing rehabilitation exercises, therapists commented that “there is no exact single normality for assessment” (TP 1). Instead, therapists mentioned that they first “check the functionality of unaffected side and define adequate normality for each patient” (TP 9). Therapists then internally generate hypothetical correctness of a movement with patient’s unaffected side and then “analyze various aspects of performance: whether a patient can complete an expected movement and any compensated, not coordinated movement exists” (TP 2). During our observation on a rehabilitation session, a therapist first asked a patient to perform a motion multiple times or keep at a certain position for a while for assessment. A therapist then had to keep moving front, back, and side to collect evidences for assessment and expressed a “difficulty with collecting information on patient’s rehabilitation exercise performance” (TP 3).
When therapists are unsure, they mentioned that they record patient’s movements to review and “re-evaluate more confidently after a session by watching a video multiple times” (TP 7), or “discuss with other colleagues” (TP 8) on their experience-based assessment. As the process of the assessment is time consuming, therapists only perform infrequently the assessment (e.g. every two or three months).
Desire for Alternative Perspectives on Assessment with Quantitative Measurements
All rehabilitation centers that we visited or discussed do not use any technology for managing stroke rehabilitation. When discussing opportunities of technological support to assess rehabilitation exercises, therapists referred the need to gain insights on patient’s performance with alternative perspectives on assessment and quantitative kinematic measurements. As mentioned before, therapists have “difficulty to detect minor changes over time or discuss with other colleagues” (TP 2) without quantitative kinematic measurements. In an uncertain situation, a therapist desired to “validate his/her assessment with alternative assessment from a colleague instead of relying on only my own experience” (TP 8). Overall, therapists desired a system that can provide “another perspective of assessment with quantitative measurements” (TP 6).
Specifically, therapists want to know “how closely a patient can reach a target motion” and “to which extent a patient performs a compensated motion” (e.g. “how much a shoulder joint is elevated”) (TP 3) with quantitative measurements and images of a patient’s motion. In addition, therapists desire to understand whether a motion is smooth or not. However, as smoothness has abstract definition, therapists have “difficulty with assessing smoothness of a motion” (TP 1). “Trajectory trends (e.g. showing a graph about how a wrist joint moves during a motion) would be useful to understand smoothness of motion” (TP 1).
TP 9 commented that her rehabilitation center attempted to use a system to monitor rehabilitation before, but ended up discard it due its complex and time-consuming process for the usage. For presentation and using a system, TP 9 emphasized that “a system should be easy to use and present insights quickly with graphics given the limited session time for each patient.”.
Based on our need findings with therapists, we have identified the requirements of an intelligent decision support system for stroke rehabilitation in Table 2.
|N1. Define normality with unaffected motions of a patient||R1. Comparison between unaffected and affected motions|
|N2. Validate assessment with another perspective of assessment||R2. Prediction on assessment from a model calibrated with another therapist’s assessment|
|N4. Simple and intuitive presentation||R4. Avoid overwhelming therapists and utilize graphics to present insights quickly|
After having iterative discussion with three therapists (with , years of experience in stroke rehabilitation), we specified exercises and performance components of assessment to probe how therapists utilize an intelligent decision support system to assess patient’s rehabilitation exercises.
Three Task-Oriented Upper Limb Exercises
This paper utilizes three upper-limb stroke rehabilitation exercises (Figure 4), recommended by therapists (Lee, 2018). In Figure 4, the ‘Initial’ indicates the initial position of an exercise and the ‘Target’ describes the desired end position of an exercise.
For Exercise 1, a subject has to raise his/her wrist to the mouth as if drinking water. For Exercise 2, a subject has to pretend touching a light switch on the wall. Exercise 3 is to practice the usage of a cane while extending elbow in the seated position. These exercises are selected due to their correspondence with major motion patterns: elbow flexion for Exercise 1, shoulder flexion for Exercise 2, elbow extension for Exercise 3.
After reviewing popular stroke assessment tools (i.e. Fugl Meyer Assessment (Sanford et al., 1993) and Wolf Motor Function Test (Taub et al., 2011)) and having iterative discussion with therapists, we identified three common performance components and their scoring guidelines: ‘Range of Motion (ROM)’, ‘Smoothness’, and ‘Compensation’ (Table 3). The ‘ROM’ component describes the amount of a joint movement to achieve a task-oriented exercise. The ‘Smoothness’ component indicates the degree of a trembling and irregular movement of joints while performing an exercise. The ‘Compensation’ component checks whether compensated movements are used to achieve a target movement. For instance, a patient might elevate his/her shoulder to raise the affected hand as shown in Figure 0(b) and 0(d).
|0||Does not or barely involve any movement|
|1||Less than half way aligned with an ‘Target’ position|
|2||Movement achieves an ‘Target’ position|
|Smoothness||0||Excessive tremor or not smooth coordination|
|1||Movement influenced by tremor|
|2||Smoothly coordinated movement|
|Compensation||0||Noticeable compensation in more than two joints|
|1||Noticeable compensation in a joint|
|2||Does not involve any compensations|
We utilize an exercise dataset that is composed of sequential joint coordinates of motions and extract various kinematic features. To represent the ‘ROM’ component, we extract joint angles (e.g. elbow flexion, shoulder flexion, elbow extension), normalized relative trajectory (i.e. Euclidean distance between two joints - head and wrist, head and elbow), and normalized trajectory distance (i.e. absolute distance between two joints - head and wrist, shoulder and wrist) in x, y, z coordinates.
For the ‘Smoothness’ component, we compute the speed, acceleration, and jerk on wrist and elbow joints. Moreover, normalized speed and acceleration, and Mean Arrest Period Ratio (the portion of the frames when speed exceeds of the maximum speed) are also included based on the prior work (Rohrer et al., 2002).
For the ‘Compensation’ component, we compute joint angles (i.e. the elevated angle of a shoulder, the tilted angle of spine, and shoulder abduction) and normalized trajectories (the distance between joint positions of head, spine, shoulder joints in x, y, z axis from the initial to the current frames) to distinguish a compensated movement.
Before extracting features, we apply a moving average filter with the window size of five frames to reduce noise of acquiring joint positions from a Kinect sensor similar to (Stone and Skubic, 2011). For each exercise motion, we compute a feature matrix () with frame and features and statistics (i.e. max, min, range, average, and standard deviation) over all frames of the exercise to summarize a motion.
4. Intelligent Decision Support System for Stroke Rehabilitation Assessment
Based on identified therapists’ needs, we designed and implemented an intelligent decision support system (Figure 2) that can identify salient features for assessment using reinforcement learning to predict the quality of motion and generate user-specific analysis that includes feature analysis, images of salient frames, and trajectory trends (Figure 3). This system enables therapists to review alternative perspectives of patient’s performance with quantitative user-specific analysis for assessment.
4.1. Prediction Model
The Prediction Model (PM) applies a supervised learning algorithm to predict the quality of motion on each performance component. We explore various traditional supervised learning algorithms: Decision Trees (DTs), Linear Regression (LR), Support Vector Machine (SVM) using the ‘Scikit-learn’ (Pedregosa et al., 2011) library and Neural Networks (NNs) using ‘PyTorch’ (Paszke et al., 2017) library.
For DTs, we implement Classification and Regression Trees (CART) to build prune trees. For LR models, we apply regularization or linear combination of and (ElasticNet with ratio) to avoid overfitting. For SVMs, we apply either linear or Radial Basis Function (RBF) kernels with penalty parameter, . For NNs, we grid-search various architectures (i.e. one to three layers with hidden units) and an adaptive learning rate with different initial learning rates (i.e. ). We apply ‘ReLu’ activation functions and ‘AdamOptimizer’ and train a model until the tolerance of optimization is or the maximum iterations.
4.2. Feature Selection using Reinforcement Learning
Kinematic variables analysis is an important way for therapists to quantitatively understand patient’s performance (Wu et al., 2000). Yet, simply presenting all variables can overwhelm therapists and limit therapist’s ability to gain insights on patient’s performance. Given the limited availability to administrate multiple patients, therapists want to minimize the amount of time on analyzing kinematic variables while accurately diagnosing patient’s status. Thus, we aim at automatically identifying salient features of assessment with machine learning.
The classical approaches of feature selection (e.g. filter, wrapper, embedded methods) (Tang et al., 2014) find a fixed feature set to the entire dataset, which applies globally to all patients. Instead, this paper utilizes a Markov Decision Process (MDP) to select a feature set for each patient’s motions. As each patient has different physical and functional status (Figure 1), we hypothesize that feature selection with MDP can be beneficial over classical feature selection approaches for personalized rehabilitation assessment.
We formulate this problem of feature selection as a Markov Decision Process (MDP), where each episode is to classify an instance and the environment is the power set of the feature space. An agent sequentially determines whether to query additional feature or classify a sample while receiving a negative reward for recruiting a feature or mis-classification. To solve this problem, we apply Deep Q-network with Double Q-learning (Mnih et al., 2015; Van Hasselt et al., 2016) with the same architectures of Neural Network for the Prediction Model (PM) (Table 6) using ‘PyTorch’ libraries (Paszke et al., 2017).
Let be a sample from a dataset, where x is a feature vector and y is the class label. Let be the set of identified features and the function be the cost of adding a feature in .
State Space: , let state , and the observed state without the label be
Action Space: Let denote the action set. The agent takes either action or , where classifies the instance and queries feature .
Reward: Let the reward function be defined as
We apply an uniform cost over features: , where = 0.01.
Transition: Let the transition function be ,
where T is the terminal state after outputting the classification and revealing the true label.
4.3. Visualization Interface
As therapists want another perspectives on assessment to validate their own assessment ( in the Table 2), this interface presents the predicted assessment, scores on performance components. When presenting this predicted performance score, the performance of predictions is also included to “make clear how well the system can do” (Amershi et al., 2019). In addition, this interface presents user-specific analysis that is considered “contextually relevant information” (Amershi et al., 2019) on patient’s exercise performance ( in the Table 2) from therapists. Specifically, user-specific analysis of our interface includes the presentation of feature analysis, images of salient frames, and trajectory trends that are identified during the needs finding study. User-specific analysis with identified kinematic features is referred as explanations of predicted assessment through out this paper as described in the Section 2.2.
For simple and intuitive presentation ( in Table 2) on quantitative measurements of identified salient features, this interface utilizes a radar chart to effectively present multivariate data. To “avoid overwhelming” (Kulesza et al., 2015) therapists, this interface limits to include only three salient features with highest information gain. Utilizing selected salient features (e.g. the maximum target position, maximum elbow flexion), we identify frames in which these salient features are occurred to present images. In addition, as observing sequential patterns of kinematic variables provides another useful perspective in some cases (e.g. the assessment of the ‘Smoothness’ performance component), this interface shows trajectories of three major joints (e.g. shoulder, elbow, and wrist) for upper-limb exercises. As therapists utilize patient’s unaffected motion as normality to assess patient’s performance ( in Table 2), this interface follows this current practice, “social norms” (Amershi et al., 2019) and includes the comparison between the affected and unaffected side to present salient features and trajectory trends.
5. Experiment for System Implementation
5.1. Data Collection
We recruited 15 stroke patients and 11 healthy subjects to collect the dataset of three upper limb exercises using a Kinect v2 sensor (Microsoft, Redmond, USA). The data collection program is implemented in C# using Kinect SDK and operated on a PC with 8GB RAM and i5-4590 3.3GHz 4 Cores CPU. This program records the 3D trajectory of joints and video frames at 30 Hz. The sensor was located at a height of 0.72m above the floor and 2.5m away from a subject. The starting and ending frames of exercise movements were manually annotated during the data collection.
Before participating in the data collection, all subjects signed the consent form. Fifteen post-stroke patients (13 males and 2 females) participated in two sessions for data collection: During the first session, a therapist evaluated post-stroke patient’s functional ability using the a clinically validated tool, Fugl Meyer Assessment (FMA) (the maximum score on 66 points) (Sanford et al., 1993). Fifteen stroke survivors have diverse functional abilities from mild to severe impairment (37 21 Fugl Meyer Scores). During the second session, a stroke survivor performed 10 repetitions of each exercise with both affected and unaffected sides. Eleven healthy subjects (10 males and 1 female) performed 15 repetitions with their dominant arms for each exercise.
‘Training Data’ (Figure 2) is composed of 165 unaffected motions from 11 healthy subjects and 150 affected motions from 15 stroke survivors to train the Prediction Model (PM).
5.2. Annotations and Design Review on Interface
For implementation, we utilize the annotation of therapist 1 (TP 1), who had more interactions with recruited stroke patients by supporting the recruitment and evaluation on their functional ability with Fugl Meyer Assessment. Therapist 1 (TP1) watched the recorded videos of patient’s movements (Figure 2(a)) and annotated exercise motion dataset using the scoring guideline (Table 3) without reviewing analysis of our system (Figure 2(d), 2(c), 2(d)).
After implementing the system and interface, therapist 1 and 2 (TP 1 and 2) reviewed the web interface to detect any problems and improve its usability. According to the review, TP 1 and 2 had problems with understanding the name of features in feature analysis. Thus, we reviewed the names of features with TP 1 and 2 and converted them into clinically relevant terminologies. For this conversion of feature names, we presented all feature names and described what each feature measures to TP 1 and 2. They spoke aloud how they would describe each feature. For instance, ‘Normalized trajectory distance of spine x’ is converted to ‘Leaning trunk to the side’.
6. Real-World User Study
We performed a user study to investigate how the information of an intelligent decision support system (e.g. predicted performance scores with feature analysis, images of salient frames, trajectory trends) affect therapist’s rehabilitation assessment. For the user study, we compared the experiences of therapists using our proposed interface (Figure 3) to two baseline interfaces: ‘Traditional’ interface that presents only videos for assessment and ‘Predicted Scores’ interface that presents videos with predicted scores without any user-specific analysis. Specifically, we aim to address the following questions:
RQ 1: How do predicted assessment and user-specific analysis of our tool affect the utility of information, workload, and trust, compared to two baseline interfaces (one with only videos and the other with videos and only predicted assessment)? Do predicted assessment and user-specific analysis of our tool support more consistent assessment?
RQ 2: How do therapists utilize each user-specific analysis for assessment?
We evaluated three interfaces with respect to the following metrics: 1) subjective feedback on questionnaires, 2) logs of the web-based visualization interface, 3) agreement level of therapists’ evaluation (F1-scores).
Subjective Feedback on Questionnaires
We utilize the following questionnaires (Cai et al., 2019) to collect therapist’s subjective feedback on interfaces. All questionnaires were rated on a 7-point scale.
Usefulness: “[Tool - Condition X] is useful to understand and assess patient’s performance”
Richness: “[Tool - Condition X] generates new insights on patient’s performance”
Trust: “I can trust information from [Tool - Condition X]”
Workload: participants answered the “efforts” and “workload” dimensions of the NASA-TLX (Hart and Staveland, 1988)
Usage Intention: “I would use [Tool - Condition X] to understand and assess patient’s performance”
Preference between two interfaces: participants rated on a 7-point scale ranging from 1 (totally Condition X), 2 (much more Condition X than Y), 3 (slightly more Condition X than Y), 4 (neutral), …, 7 (totally Condition Y).
The preference is asked pairwise on three conditions/interfaces: Condition 1 (‘Traditional’ interface), Condition 2 (‘Predicted Scores’ interface), and Condition 3 (‘Proposed’ interface).
Logs of the Web Interface
Our web interfaces record a log file that counts the number of video events (e.g. ‘Play’, ‘Pause’) and measures the amount of time that a participate spends on each page/resource during assessment.
Agreement Level of Therapists’ Evaluation
Participants generate assessment of patient’s exercise performance using the interfaces. To understand whether our proposed tool with user-specific analysis supports more consistent evaluation, we compute the level of agreement of therapists’ evaluation (F1-score) for each interface.
Seven therapists (with , years of experience in stroke rehabilitation) from four rehabilitation centers participated in the user study on the evaluation (Table 1). Note that we excluded two therapists (TP 1 and 2), who generated annotation to implement our system and reviewed the design of the interface. After signing an informed consent (Institutional Review Board approved), each participant was instructed on the procedure of the study and three interfaces using dummy data (30 minutes). Then, a participant is assigned the task of assessing 45 videos (around one minute per video, in which a patient performs a rehabilitation exercise) using three interfaces (1.5 hours total) and followed by post-study questionnaires and interview (30 minutes).
Each interface is assigned a sub-task of assessing 15 videos (five patients performing three exercises). Therapists 1, who evaluated functional ability of 15 patients, divided 15 patients into three sub-groups, in which patients of each subgroup have similar functional ability. Thus, the sub-task of each interface is counterbalanced. The order of the three conditions/interfaces and assignment of a sub-task are randomized. After completing a sub-task on each interface, therapists responded the questionnaires. After finishing all sub-tasks, therapists answered the preference questionnaires and post-interview is conducted to understand therapists’ perspectives on the effectiveness of the proposed, intelligent decision support system and the opportunities to utilize this system in the practice.
7. System Implementation Results
To evaluate our implementation of the Prediction Model (PM), we apply Leave-One-Subject-Out (LOSO) cross validation on post-stroke patients, which trains data from all subjects except one post-stroke survivor and test with data from the left-out post-stroke survivor. Table 4 summarizes average F1-scores of models of three exercises.
|Exercise 1 (E1)||Exercise 2 (E2)||Exercise 3 (E3)||Overall|
|PM - DT||0.6901 0.0405||0.7645 0.0867||0.6488 0.0412||0.7011 0.0769|
|PM - LR||0.7246 0.0593||0.6430 0.0982||0.7267 0.0391||0.6981 0.0801|
|PM - SVM||0.7232 0.0364||0.6971 0.0891||0.7410 0.0052||0.7204 0.0585|
|PM - NN||0.8806 0.0502||0.8090 0.0671||0.8115 0.0436||0.8337 0.0638|
The Prediction Model (PM) using Neural Networks (NNs) achieves decent agreement level with Therapist 1’s evaluation: 0.8337 average F1-scores over three exercises. In addition, the PM with NNs outperforms the PM with other algorithms: Decision Trees (0.7011 average F1-scores), Linear Regression (0.6981 average F1-scores), Support Vector Machine (0.7204 average F1-scores). The parameters of NNs (i.e. hidden layers/units and learning rate) that achieve the best F1-score on the classification are summarized in the Table 6 in the Appendix.
We found that our system can identify salient features of assessment for individual patient’s motions. Utilizing the Neural Network architectures of the PM (Table 6), we train an agent that sequentially decides whether another feature is necessary to assess the quality of motion. To validate the feasibility of our implementation, we plot the average rewards and the average number of selected features during training an agent. Figure 5 demonstrates that an agent can identify the salient subset of features while reducing the number of selected features and improving average rewards (i.e. the correct assessment of exercise motions). In addition, compared to a model with Recursive Feature Elimination (RFE) method (Guyon et al., 2002), one of classical feature selection methods, our approach has 0.11 higher average F1-score ( using a paired t-test over 3 exercises and 3 components) and is expected to be more beneficial to generate patient-specific analysis for therapists.
8. User Study Results
8.1. Subjective Feedback on Questionnaires
Figure 6 summarizes the responses of questionnaires from therapists on three interfaces: Condition 1 (‘Traditional’ interface), Condition 2 (‘Predicted Scores’ interface), and Condition 3 (‘Proposed’ interface).
Our proposed interface, Condition 3 achieves higher usefulness () than the others (Condition 1: , and Condition 2: , ) and higher richness () than the others (Condition 1: , and Condition 2: , ) as additional explanations of Condition 3 were considered “useful to understand patient’s condition” for therapists. In addition, participated therapists expressed higher trust on Condition 3 () than the others (Condition 1: , and Condition 2: , ). Although participated therapists identified that some “predicted scores are not trustful”, analysis of the proposed interface complements to “understand why such predicted scores is generated”.
Participated therapists experienced lower efforts () using the proposed, Condition 3 than the others (Condition 1: and Condition 2: , ) and lower frustration () using the proposed, Condition 3 than the others (Condition 1: , and Condition 2: , ). Participated therapists described that user-specific analysis of the proposed interface (feature analysis, images of salient frames, and trajectory trends) reduce the effort and frustration to “search evidences in videos”.
In addition, the proposed, Condition 3 interface has higher usage intent () than the others (Condition 1: , and Condition 2: , ) and therapists mostly prefer the proposed, Condition 3 interface to two baseline interfaces (Condition 1 and 2): for preference questionnaire between Condition 1 and 3, 2 out of 7 therapists ‘Totally’ prefer Condition 3, 4 out of 7 therapists ‘Much more preferred’ Condition 3 than Condition 1, and 1 out of 7 therapists ‘Much more preferred’ Condition 1 than Condition 3. For preference questionnaire between Condition 2 and 3, 4 out of 7 therapists ‘Totally’ prefer Condition 3, 2 out of 7 therapists ‘Much more preferred’ Condition 3 than Condition 2, and 1 out of 7 therapists ‘Slightly more preferred’ Condition 2 than Condition 3. Although one therapist still considered the usefulness of the proposed interface, this therapist preferred the interface without predicted assessment and user-specific analysis. Participated therapists commented that Condition 3 with predicted assessment and additional user-specific analysis ‘is very interesting’ and ‘gives me insights to assess patient’s performance’.
In summary, Condition 3 with predicted assessment and patient-specific analysis achieved positive responses on all aspects: our proposed interface provides more useful and richer information to understand patient’s performance and has higher trust in the system, reduces therapist’s efforts and frustration to find evidences for assessment, and more likely to be used in clinical practices. However, score differences with the baseline interface are not statistically significant except for usefulness and usage intent aspects.
8.2. Logs of the Web Interface
Table 5 describes the measurements of logs (i.e. average number of video events and time on video/analysis per assessment) from three interfaces. Our proposed interface, Condition 3 has significantly lower, average number of video events () than the others (Condition 1: and Condition 2: ) (). This lower number of video events on Condition 3 indicates that Condition 3 leads to lower number of video playbacks than others.
Condition 3 with additional user-specific analysis has longer average time on assessment ( seconds) than Condition 1 () and lower average time on assessment ( seconds) than Condition 2 ( seconds). However, when we analyze average time on each information, specifically videos, Condition 3 shows significantly lower average time on videos ( seconds) than the others (Condition 1: seconds and Condition 2: seconds) .
8.3. Agreement Level of Therapists’ Evaluation
Figure 7 shows the agreement level of therapists’ evaluation on three interfaces. Our proposed interface, Condition 3 with predicted assessment and user-specific analysis (i.e. feature analysis, salient frames, and trajectory) achieves higher agreement on participated therapists’ evaluation ( F1-scores) than the others (Condition 1: F1-score () and Condition 2: ) F1-score. Although both Condition 2 and 3 achieve higher agreement level than Condition 1, the difference between Condition 1 and 2 is not statistically significant (), but the difference between Condition 1 and 3 is statistically significant (). Thus, this indicates the positive effect of including user-specific analysis on Condition 3 to improve the agreement level of therapists’ evaluation.
8.4. Post-study Interviews
After completing the study, we collected general feedback on our proposed interface (Condition 3) from therapists. Specifically, we asked therapist’s opinions and usage on the interface and the possibility of accepting it in the current practices.
Overall, therapists consider the proposed interface as “a good platform” (TP 5) for rehabilitation assessment. User-specific analysis of the interface (e.g. feature analysis, images of salient frames, and trajectory trends) “brings more interesting, new aspects of a patient” (TP 3) and enables therapists to “understand why the predicted assessment is suggested” (TP 9). Specifically, therapists found that feature analysis (Figure 2(b)) is “easy and intuitive” (TP 9) to “quickly observe the quantitative difference between unaffected and affected sides” (TP 6) for assessment. Images of salient frames (Figure 2(c)) “was helpful to validate feature analysis” (TP 9). Trajectory trends (Figure 2(d)) “was useful to review the overall trends” (TP 8) and “the duration of a motion” (TP 9), which were “helpful to assess the smoothness of a motion” (TP 4).
Therapists described two different patterns of using the interface for the assessment. One pattern is to first “review the feature analysis to get the overview of quantitative difference between unaffected and affected side” (TP 7) and “validate quantitative feature analysis with images of salient frames” and “trajectory trends” (TP 9). Some therapists preferred to get the initial insight of assessment from feature analysis, because “ graphics on quantitative difference between unaffected and affected sides are useful and fast to get insights and validate my assessment” (TP 7). The other strategy is to first “observe the trajectory trends to understand the overview of a motion” (TP 8) and “review the detailed, quantitative feature analysis” and “images of salient frames” (TP 8). Others reviewed trajectory analysis first, because it “provides various insights (e.g. duration, amplitude, and tremor) together” to improve and validate therapist’s hypothetical assessment.
After reviewing predicted assessment and user-specific analysis, therapists were able to determine whether a system makes a mistake or not and understand the capabilities of a system. Even if the predicted scores of an interface sometimes mismatch with therapist’s assessment, therapists consider “the proposed interface is trustful” (TP 9) in a way that “I can review patient-specific analysis to understand whether a system fails to predict correctly or I make a mistake” (TP 4). For example, TP 9 commented that “the prediction on range of motion (ROM) seemed to be aligned most of time with my hypothetical assessment and insights from user-specific analysis”. In contrast, TP 9 mentioned that predictions of compensation do not sometimes perform well, because the system “does not provide a prediction that is aligned with mine and include leaning trunk to the side” feature to predict compensation of a patient, who “compensates trunk to the side”. Overall, therapists developed a mental model that they would trust/rely more on the prediction of ROM and less on the prediction on Smoothness and Compensation. Thus, user-specific analysis of our interface assisted therapists to understand the capabilities of a system to predict performance components and develop the different levels of trust on predictions of each performance components from a system.
In addition, therapists considered the user-specific analysis of the proposed interface, Condition 3 “reduces their efforts and frustration on the assessment” (TP 6). “When only video is presented” (Condition 1 - ‘Traditional’), therapists have “difficulty to consider different perspectives on assessment at the same time (TP 9). So, therapists “had to replay a video multiple times” (TP 3). In contrast, the proposed interface provides the insights on patient’s performance (i.e. predicted scores and user-specific analysis), which reduce therapist’s efforts and frustration to repeatedly watch videos and search evidences for assessment. Therapists considered that user-specific analysis on each performance component is useful to reduce complexity of assessment. “It is complex and challenging process to simultaneously reviewing multiple aspects of assessment while watching a video. In contrast, user-specific analysis on each component from the interface simplified my assessment process” (TP 9). In addition, TP 9 commented that after getting used to user-specific analysis of our interface, “I started reducing replay a video to search clues for assessment” and “relied more on analysis of the interface”, because the interface “quickly presents various quantitative measurements for assessment, which I have to speculate while watching a video”.
As the proposed interface is “easy to use” and “quickly summarizes quantitative data with graphics to provide insights of patient’s performance” (TP 9), therapists are positive to accept the interface in their practices. Therapists commented that currently they “do not have much quantitative data to analyze and discuss with patients” (TP 7). Therapists described that predicted assessment and user-specific analysis with quantitative data from our interface could facilitate “understanding on patient’s performance and communication it with patients” (TP 4). In addition, some therapists consider the interface might be “helpful to motivate patient’s participation in rehabilitation program” (TP 9) by tracking and presenting patient’s progress with quantitative data.
In this section, we synthesize our findings, discuss design recommendations to create a decision support system for rehabilitation assessment: the importance of 1) presenting appropriate and salient data and 2) deriving an adaptive system for a personalized human and machine collaborative decision support system, and describe the limitations of our study.
9.1. Design Recommendations
Presenting Appropriate and Salient Data to Understand Capabilities of a System
Clinical decision support systems with machine learning algorithms have a potential to improve the current practices of healthcare. However, as mentioned earlier, a machine learning model cannot perfectly replicate expert’s knowledge and decision making. A system without supplementary explanations/information might not be adopted in the practices. Our findings demonstrate that an intelligent decision support system can automatically identify salient features of decision making (e.g. rehabilitation assessment) to predict expert’s decision making and generate explanations on its prediction (e.g. user-specific analysis with kinematic features). User-specific analysis from a system enables experts to gain new insights on patient’s performance and reduce efforts and workload for assessment. In additions, while reviewing user-specific analysis, therapists can validate their hypothetical assessment and the correctness of a system to understand the competence of a system on a task. Although we demonstrate the feasibility of a decision support system for stroke rehabilitation assessment, the applied techniques of feature selection and prediction models can be utilized to other sub-domains of rehabilitation (e.g. knee rehabilitation (Huang, 2015)), where an expert with limited availability needs to make a decision.
Adaptive Systems for Personalization
During the study, we observe therapists developed different usage patterns of a decision support system and utilized different functionalities or information of a system based on a task and their own knowledge. Some therapists preferred to get the initial insight of assessment from feature analysis and others reviewed trajectory analysis first (Section 8.4). Thus, making a system adaptable to each therapist’s preference is recommended so that therapists can quickly collect necessary information/evidence for their decision making in practices.
In addition, TP 6 suggested that it would be useful if a therapists can tune a system by including/excluding identified features to utilize different features based on individual therapist’s experience and to correct any mismatched prediction scores. For instance, as the prediction on compensation from an interface does not predict assessment correctly time to time when a patient performs leaning compensation to the side, TP 9 commented to include a “leaning trunk to the side” feature to predict compensation. Thus, designers should consider applying interactive techniques (Kulesza et al., 2015; Lee, 2019) to make a system adaptive and personalized for better integration into each therapist’s clinical practices.
Towards Human and Machine Collaborative Systems
In summary, instead of manually reviewing abundant features, machine intelligence can automatically identify salient features to provide useful insights on patient’s performance to validate the prediction of a system and support therapist’s assessment. After reviewing predicted assessment and automatically generated patient-specific analysis of a system, therapists understood the strengths and limitation of a system to consider how it can support them. Although a system sometimes disagrees with therapist’s decision, therapists considered reviewing explanations is helpful to “understand patient’s performance and validate my/therapist’s assessment” (TP 7). After improving understanding on patientâs performance, therapists generated more consistent evaluation (Figure 7). In addition, a therapist was able to enumerate how to improve an imperfect system. A promising direction of future research is to explore how human and machine intelligence can complement each other to improve a complex decision making.
This study aims to investigate how an intelligent decision support system with predicted assessment and user-specific analysis can support therapist’s assessment on rehabilitation exercises. One of the limitations of this study is that the small sample size of participated therapists for evaluation: seven therapists from four rehabilitation centers do not represent all therapists. However, such small sample size is not unusual among similar studies (Hofmann et al., 2019). In addition, although therapists expressed positive opinions about an intelligent decision support system, we evaluated only one possible type of decision making of therapists, rehabilitation assessment. Other decision makings are worth exploring and require further validation.
In this paper, we presented the needs of therapists during rehabilitation assessment and designed a decision support system that identifies salient features to predict the quality of motion and generate user-specific analysis as explanations on predictions. We then evaluated this system with seven therapists from four rehabilitation centers to investigate how predicted assessment and user-specific analysis of the system affect therapist’s decision making on stroke rehabilitation assessment. Presenting predicted assessment and user-specific analysis increases the trust on a system and brings new insights of assessment. In addition, the proposed system enables therapists to reduce their workload (e.g. repeatedly watching videos to identify evidences) and generate more consistent assessment. Our work highlights the importance of creating user-centered and trustful machine learning-based systems to augment expert’s decision making process and deploy in the practices. We believe this study can be a valuable reference to develop decision support systems for rehabilitation assessment and other critical decision supports.
|E1||(32, 32, 32) / 0.1||(16) / 0.0001||(256, 256) / 0.1|
|E2||(256) / 0.1||(64, 64) / 0.001||(128, 128) / 0.1|
|E3||(256) / 0.1||(64, 64) / 0.001||(128, 128) / 0.1|
- copyright: none
- ccs: Human-centered computing Interactive systems and tools
- ccs: Human-centered computing User studies
- ccs: Applied computing Health care information systems
- ccs: Theory of computation Sequential decision making
- Guidelines for human-ai interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 3. Cited by: §4.3, §4.3, §4.3.
- Clinical decision support systems. Vol. 233, Springer. Cited by: §1.
- Explanation and justification in machine learning: a survey. In IJCAI-17 workshop on explainable AI (XAI), Vol. 8, pp. 1. Cited by: §2.2.
- Human-centered tools for coping with imperfect algorithms during medical decision-making. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 4. Cited by: §2.2, §6.1.1.
- Quantitative measurement of motor symptoms in parkinson’s disease: a study with full-body motion capture data. In 2011 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 6789–6792. Cited by: §1, §2.2.
- Barriers and facilitators to clinical decision support systems adoption: a systematic review. Journal of Business Administration Research 3 (2), pp. p36. Cited by: §2.2.
- Datum-wise classification: a sequential approach to sparsity. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 375–390. Cited by: §4.2.1.
- Global burden of stroke. Circulation research 120 (3), pp. 439–448. Cited by: §3.
- Explainable artificial intelligence (xai). Defense Advanced Research Projects Agency (DARPA) 2. Cited by: §1, §2.2.
- Gene selection for cancer classification using support vector machines. Machine learning 46 (1-3), pp. 389–422. Cited by: §7.
- Development of nasa-tlx (task load index): results of empirical and theoretical research. In Advances in psychology, Vol. 52, pp. 139–183. Cited by: 4th item.
- Motor recovery after stroke: a systematic review of the literature. Archives of physical medicine and rehabilitation 83 (11), pp. 1629–1637. Cited by: §2.1.
- ” Occupational therapy is making”: clinical rapid prototyping and digital fabrication. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 314. Cited by: §9.2.
- Exploring in-home monitoring of rehabilitation and creating an authoring tool for physical therapists. Ph.D. Thesis, Carnegie Mellon University. Cited by: §2.2, §9.1.1.
- Classification with costly features using deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 3959–3966. Cited by: §4.2.1.
- Reasons for physicians not adopting clinical decision support systems: critical analysis. JMIR medical informatics 6 (2), pp. e24. Cited by: §1, §2.2, §2.2.
- Mind the gap: a generative approach to interpretable feature selection and extraction. In Advances in Neural Information Processing Systems, pp. 2260–2268. Cited by: §2.2.
- Principles of explanatory debugging to personalize interactive machine learning. In Proceedings of the 20th international conference on intelligent user interfaces, pp. 126–137. Cited by: §4.3, §4.3, §9.1.2.
- Assessment of upper extremity impairment, function, and activity after stroke: foundations for clinical decision making. Journal of Hand Therapy 26 (2), pp. 104–115. Cited by: §1.
- Learning to assess the quality of stroke rehabilitation exercises. In Proceedings of the 24th International Conference on Intelligent User Interfaces, pp. 218–228. Cited by: §1, §1, §2.2, §2.2.
- A technology for computer-assisted stroke rehabilitation. In 23rd International Conference on Intelligent User Interfaces, IUI â18, New York, NY, USA, pp. 665â666. External Links: Cited by: §3.2.1.
- Intelligent agent for assessing and guiding rehabilitation exercises. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 6444–6445. External Links: Cited by: §2.2.
- An intelligent decision support system for stroke rehabilitation assessment. In The 21st International ACM SIGACCESS Conference on Computers and Accessibility, pp. 694–696. Cited by: §2.2, §9.1.2.
- Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §4.2.1.
- Kinematic variables quantifying upper-extremity performance after stroke during reaching and drinking from a glass. Neurorehabilitation and neural repair 25 (1), pp. 71–80. Cited by: §2.2.
- Physical rehabilitation. FA Davis. Cited by: §2.1, §3.1.1.
- Automatic differentiation in pytorch. Cited by: §4.1, §4.2.1.
- Scikit-learn: machine learning in python. Journal of machine learning research 12 (Oct), pp. 2825–2830. Cited by: §4.1.
- Task-oriented training in rehabilitation after stroke: systematic review. Journal of advanced nursing 65 (4), pp. 737–754. Cited by: §2.1.
- Movement smoothness changes during stroke recovery. Journal of Neuroscience 22 (18), pp. 8297–8304. Cited by: §3.2.3.
- Reliability of the fugl-meyer assessment for testing motor performance in patients following stroke. Physical therapy 73 (7), pp. 447–454. Cited by: §3.1.1, §3.2.2, §5.1.
- Architecture and applications of virtual coaches. Proceedings of the IEEE 100 (8), pp. 2472–2488. Cited by: §2.2.
- Mispredictions and misrecollections: challenges for subjective outcome measurement. Disability and Rehabilitation 30 (6), pp. 418–424. Cited by: §1, §2.1.
- Evaluation of an inexpensive depth camera for in-home gait assessment. Journal of Ambient Intelligence and Smart Environments 3 (4), pp. 349–361. Cited by: §3.2.3.
- Feature selection for classification: a review. Data classification: Algorithms and applications, pp. 37. Cited by: §4.2.
- Wolf motor function test (wmft) manual. Birmingham: University of Alabama, CI Therapy Research Group. Cited by: §3.1.1, §3.2.2.
- Deep reinforcement learning with double q-learning. In Thirtieth AAAI conference on artificial intelligence, Cited by: §2.2, §4.2.1.
- Systematic review of kinect applications in elderly care and stroke rehabilitation. Journal of neuroengineering and rehabilitation 11 (1), pp. 108. Cited by: §1, §1, §2.2.
- A kinematic study of contextual effects on reaching performance in persons with and without stroke: influences of object availability. Archives of Physical Medicine and Rehabilitation 81 (1), pp. 95–101. Cited by: §4.2.