System Design for a Data-driven and Explainable Customer Sentiment Monitor

System Design for a Data-driven and Explainable Customer Sentiment Monitor


The most important goal of customer services is to keep the customer satisfied. However, service resources are always limited and must be prioritized. Therefore, it is important to identify customers who potentially become unsatisfied and might lead to escalations. Today this prioritization of customers is often done manually. Data science on IoT data (esp. log data) for machine health monitoring, as well as analytics on enterprise data for customer relationship management (CRM) have mainly been researched and applied independently. In this paper, we present a framework for a data-driven decision support system which combines IoT and enterprise data to model customer sentiment. Such decision support systems can help to prioritize customers and service resources to effectively troubleshoot problems or even avoid them. The framework is applied in a real-world case study with a major medical device manufacturer. This includes a fully automated and interpretable machine learning pipeline designed to meet the requirements defined with domain experts and end users. The overall framework is currently deployed, learns and evaluates predictive models from terabytes of IoT and enterprise data to actively monitor the customer sentiment for a fleet of thousands of high-end medical devices. Furthermore, we provide an anonymized industrial benchmark dataset for the research community.

customer service, decision support system, IoT data, explainable AI, machine learning, big data.

mlMLmachine learning \newacronymlstmLSTMlong short-term memory \newacronymdlDLdeep learning \newacronymdnnDNNdeep neural network \newacronymnnNNneural network

1 Introduction

Companies are interested in monitoring the performance of their installed systems. The success of a system depends on the health status of a machine (e.g. derived from IoT data like event logs) and customer perception (e.g. derived from ticket data). However, these two perspectives are mostly separated in the literature. The machine health perspective is often considered in disciplines like predictive maintenance or more generally prognostic health management Lei_Li_Guo_Li_Yan_Lin_2018. Sipos et al. Sipos_Fradkin_Moerchen_Wang_2014, for example, used a data-driven approach based on multiple-instance learning from event log data for predictive maintenance for high-end medical devices. Additionally, event log data are analyzed for intrusion detection Tuor_Kaplan_Hutchinson_Nichols_Robinson_2017; Kim_Kim_Thu_Kim_2016 or failure detection in data and computing centers Du_Li_Zheng_Srikumar_2017; Zhang_Xu_Min_Jiang_Pelechrinis_Zhang_2016. On the other side, the customer perspective is emphasized in the framework of customer relationship management (CRM), which is a broad discipline including strategies and processes for organizations to handle customer interactions and to keep track of all customer-related information Soltani_Navimipour_2016. Customer escalations are mostly predicted based on ticket data only Montgomery_Damian_2017; Werner_Li_Damian_2019. In manufacturing companies (e.g. for medical devices), available data typically falls in two distinct groups. First is the IoT data/machine logs generated on the system. The second group contains complementary enterprise systems. This includes ticketing systems for service activities, spare part consumption, and reported system malfunctions. To keep customers satisfied with the operation of their systems is crucial for the success of medical device manufacturer. Therefore, it is important to identify unsatisfied customers who might lead to escalations. Hence, a framework making use of both data sources in order to combine these two perspectives would be desirable. Such a system should combine existing IoT (log) data and enterprise data. It could serve as a decision support system for the end user to encourage data-driven and therefore more objective decision-making. In this paper, we present a fully automated end-to-end machine learning framework which combines both data sources to model customer sentiment. We show that customer sentiment can be better estimated when looking at the system performance based on both the machine log data (e.g. to detect system malfunction affecting the customer) and enterprise data (e.g. ticket data from customer interactions). We use historical data of escalations as labels for our predictive models to continuously learn a probability for an escalation as an estimate for the customer sentiment. This resulting decision support system helps to better prioritize customers and trouble shoot problems. The concrete problem formulation and proposed solution which combines log and enterprise data to increase predictive power and interpretability for the real-world case study serve as our main contributions. The remainder of the paper is structured as follows: Section 2 describes the problem to be solved as well as the data sources. Section 3 describes the overall methodology. Section 4 presents the experimental results. Then, Section 5 discusses the results and presents the proposed workflow. Section 6 concludes our paper and discusses future research directions.

2 Problem Description

2.1 Business Problem

Customer satisfaction and hence service resource prioritization is a key priority in many organizations. Here, we analyze data from a large and worldwide installed fleet of high-end medical devices. Therefore, customers, as well as local service entities, naturally differ in the way they communicate and document problems. This inevitably leads to situations where customers facing similar problems address the service provider in vastly different ways. Hence, objectively prioritizing customers and service resources is a hard problem. Combining relevant information from machine log and enterprise data could potentially help to better understand problems in the field and how they affect the customer sentiment. Therefore, we design a data-driven decision support system to help prioritizing customers based on an estimated sentiment. This can help to minimize unexpected escalations as a product of a more proactive customer support. The case study at hand was conducted with a major medical device company for a fleet of thousands of high-end systems used by customers world wide. Major challenges are the amount, heterogeneity, and complexity of the different data sources.

2.2 Data Description

In order to solve the business problem at hand we make use of two major data sources which we describe here in more detail.

Log Data

Log data is a time-based protocol of events recorded by different components of a medical system. An event consists of a timestamp (indicating when the event occurred), an event source (specifying which system component generated the event), an event id (representing a category of similar event types by the given event source), an event severity (typically: information, warning and error), and a message text (describing the event and giving more details like sensor values). Events are defined and implemented by the developers of each particular system component. The severity and amount of sensor data logged is decided by each individual developer.

Depending on which combination of event-source, event-id and message-text we define as unique, we get approximately different events. There can theoretically be an unlimited number of distinct message texts depending on the usage. One system typically generates from to of these events per day. A typical system family having several thousand installed systems worldwide would then generate up to GB of log data per day.

These log files are typically used by customer support centers to diagnose problems as well as by the original system developers to track whether their developed systems work as intended.

Major challenges for analyzing log data are the volume and complexity. We describe later how we automatically extract relevant information from incoming log files.

Enterprise Data Sources

Enterprise data sources are mostly collected by and stored in enterprise resource planning (ERP) systems. Types of enterprise data are:

  • service activity data / ticket data - documenting all customer interactions and problems which occur

  • spare part data - typically related to service activities, includes which spare parts have been used for maintenance / repair of a system

  • customer base / contract data - listing all customers and the corresponding relationships, especially what kind of service level has been signed

Major challenges regarding the enterprise data are:

  • getting a consistent picture for all customers and service activities worldwide, which is made more difficult by different local ERP systems

  • manual data inputs contain errors due to typos and incorrect usage

  • worldwide standards differ a lot, especially since there is no precise definition what a \saywell running medical system is and, therefore, interpretation of service data can differ from country to country

  • regional ticket data is often written in the local language

Globally operating companies can have several levels of customer service centers ranging from regional to global and all of them are generating ticket data. In our case, we consider three different ticket levels from regional to global. Furthermore, we analyze tickets generated by an information system tracking escalations from customer service to the R&D department.

2.3 Requirements

There are special requirements to be met for a successful deployment of a decision support system in a real-world scenario as in the presented case study. We describe these requirements in this section and adapt all design decisions accordingly. During the whole development life cycle from proof-of-concept up to deployment, we worked closely together with domain experts and stakeholders from all relevant departments including potential end users for the implemented decision support system. Thus, we can assure that we meet all requirements and build a framework which has a real impact for the end users.

  • Dynamics and Efficiency: Currently, decisions about escalations are made on a weekly basis. Hence, our framework must process data and provide predictions on a weekly basis as well. The overall framework should be capable to efficiently load new data, extract features, train a model, and perform predictions on a weekly basis. This should be done in the time frame of a few hours, e.g. on Monday mornings.

  • Model Performance and Output: The escalation flags (highest escalation level) used as a label in this case study are very sparse and noisy. This causes special challenges for the prediction task. A binary output is not desired, but rather a probability of escalation which models the customer sentiment. Customers with the largest probabilities will then be analyzed in more detail by the end users. Therefore, it is not the main goal to design a machine learning model, which perfectly predicts escalations, but rather a system that helps to identify, based on the designed features, which customers might need special attention.

  • Interpretability: This is especially important for real-world applications as the present case study since the end user wants to understand the reasoning behind the output of the prediction model not only to take the appropriate actions but also to build trust. We extract specific features from the log and enterprise data. These features were designed together with domain experts and end users to incorporate prior knowledge and interpretable features into the decision support system.

  • Usability: An interactive application was developed based on a commercial Business Intelligence tool which is currently in use by the medical device provider. Usability also includes considerations of what data sources need to be provided based on the end user’s request. The ability to interact with the provided decision support system enables continuous feedback for validation. Based on the explanations of the model and features, the end user can decide if the predictions are valid and with that provide more and cleaner labels for future prediction cycles. We will later describe an envisioned workflow for the designed decision support system.

3 Methodology

In this section, we describe the designed framework to solve the business problem at hand. A high-level overview of the implemented framework is depicted in Fig. 1. Hundreds of gigabytes of incoming log and enterprise data from all over the world are automatically analyzed via a log evaluation framework to calculate relevant features designed together with domain experts. We design an automated and interpretable machine learning pipeline to calculate a probability of escalation, which models the customer sentiment. The provided decision support system includes an explanation for the calculated probability, as well as historical feature data based on extracted log and enterprise data for the end user. There are major benefits when combining both data sources from the user perspective. Depending on which features explain a prediction, it is possible to identify problems as either being more related to R&D (log features) or customer service (enterprise features).

Figure 1: High-level schema of the overall processing pipeline. Systems from all over the world are sending log data. Additionally, enterprise data (sales and ticket data) are collected. Features are extracted based on domain knowledge in order to train a predictive model. The resulting data and predictions are integrated into a decision support system for the end user.

3.1 Build Dataset

Algorithm 1 describes how we built the dataset for the experimental setup. In the following, we will describe the process in more detail.


Let be the set of all customers. Fig. 2 depicts the labeling approach for one example customer using a sliding window approach. The step size is set to week. We set the window size to steps, which was proposed by domain experts. Different values were also evaluated but did not yield an improvement. From this window a feature vector is extracted, where is defined as the last week in a window. Let be the set of all escalations flags for customer and a specific time point of an escalation flag for customer . The predictive interval is set to steps. If there has been an escalation in the predictive interval, the label () for the sample will be set to and otherwise (line 9-12). After an escalation , the following steps are defined as an infected interval. All samples where the sliding window contains weeks from the infected interval are excluded (line 14-15). This was defined together with domain experts. We assume that there is already a special focus on customers for which a recent escalation occurred. As described in line 2-3, we repeat this procedure for all customers for a fixed time frame of steps, which in our case is equivalent to years. The dataset contains samples for all customers with complete data (line 4). This results in the distributions depicted in Fig. 3. Note that the number of customers is increasing while the number of escalations remains almost constant over time. This is due to limited service resources which yields to an approximate constant number of customers put into focus each week. We provide an anonymized version of for the research community as an industrial benchmark nguyen_an_2020_4383145 2.

Figure 2: Description of the labeling approach for one example customer . We assume that (last week in window) starts at some point . (a) Sample with negative label ( ) since there is no escalation within predictive interval of weeks. (b), (c) Samples are labeled positive () since there is an escalation within the predictive interval. (d) We exclude infected samples from the dataset. (e) First valid sample after the infected interval.
Figure 3: Sample distribution for the whole dataset . The graph depicts all samples () and positive samples () for .

Feature Extraction

Weekends and public holidays can introduce noise into calculated features. Therefore, we decided together with domain experts to aggregate features on a weekly basis. For all customers and prediction weeks , we calculate features for a window of weeks (line 2-8).

Log Data
Due to the high volume and complexity of existing data sources, feature extraction is required. The machine log data as described in 2.2.1 is not feasible to analyze in their raw format. We use a log evaluation framework to detect the occurrences of specific sequences of events, determined by domain experts. The extracted features have clear meanings and are related to specific system malfunctions, which can affect customers in their daily work routines. Such features include, for example: abort of operation, system delay, user interface (UI) freeze, and UI pop ups. We also extract whether there was a software (SW) update performed for a system.

Enterprise Data
The enterprise data available can be split in two connected groups - sales data and customer service tickets - as described in 2.2.2. Features for sales data are the number and total cost of replaced parts. Features derived from ticket data include the number of open tickets, age of the oldest open ticket, the rated severity, as well as the number of site visits for each customer depending on availability in the different ticketing systems. These features can be extracted on a global level.

3.2 Modeling

In order to meet the requirements described in Section 2.3, we selected a specific approach to model the customer sentiment. Major challenges are the large class imbalance (Fig. 3) and significant amount of label noise as a result of manual decision for an escalation. The output of a machine learning model can be helpful in several ways. First, we can identify which specific problems depicted in the designed features lead to escalation. Additionally, we can identify customers which have similar problems and might need special attention. We compare different machine learning methods: ensembles of decision trees and \glspldnn. For both methods, we calculate post-hoc explanations for each prediction using either a tree explainer Lundberg_Erion_Chen_DeGrave_Prutkin_Nair_Katz_Himmelfarb_Bansal_Lee_2020 or a modification of DeepLift Lundberg_Lee_2017. Both algorithms are implemented in the SHAP libary Lundberg_Lee_20173 and we refer to the explanatory outputs as SHAP values.

Ensemble of Decision Trees

Ensemble of decision tree methods have the following benefits:

  • The computed feature importance Genuer_Poggi_Tuleau-Malot_2010; Hastie_Tibshirani_Friedman helps end users understand which of the designed features are \saycorrelating with escalations/customer sentiment.

  • Ensemble methods provide a probability as a model output which can be interpreted as the customer sentiment (probability for escalation).

  • Since each combination of time point (week) in a window and designed feature is modeled as a single input variable, we can provide the relevance of each input variable for all predictions to the end user for better troubleshooting.

The decision tree ensemble methods we select are Random Forest (RF) breiman20014 and XGBoost (XGB) Chen_Guestrin_2016 5. Random Forest and XGBoost are ensemble learning techniques which can be used for both classification and regression. In our case, we are interested in classification. In general, a is a collection of weak classifiers where each classifier gets the same input and outputs the most probable class with being the set of all possible classes. The output of the Random Forest is then defined as

the class which is most probable for the majority of the . In the original paperbreiman2001 and also in our case, we used decision trees as weak classifiers. The decision trees for RF are generated independently and in parallel via a bagging (bootstrap aggregation) approach. This means that each decision tree is generated in two steps:

  1. Bootstrapping: Independently sampling the input data set with for each on data points and features.
    This means the data points from are sampled iid (independent and identically distributed) into a subset where .
    Also, the feature space is sampled iid, such that if contains the features , then contains the features from .

  2. Aggregating: Averaging or in our case deciding by a majority vote which class should be chosen.

Gradient BoostingFriedman2001 also combines many weak classifiers into a strong classifier, but the idea how to combine those weak learners differs. In contrast to bagging, the decision trees are not built in parallel but sequentially, while results are combined along the way. In our case, we chose XGBoost Chen_Guestrin_2016 which is an improved variant of the Gradient Boosting algorithm using a more regularized formalization of the model leading to a reduction of over-fitting. In both cases the output is the predicted class (here: or ) as well as the probability the model assigns to each prediction. We use the probability the model assigns to class as the predicted customer sentiment for each input . We address the imbalanced class problem by applying either random oversampling of the minority class, SMOTE Chawla_Bowyer_Hall_Kegelmeyer_2002 or random undersampling of the majority class He_Garcia_2009 using the imblearn libary imblearn20176. We treat the sampling strategy as a hyperparameter in our model selection approach, which we will describe later. We tested 8 different model configurations as summarized in Table 1. We applied two different data fusion approaches. For \sayearly fusion (M1, M2) we simply stack enterprise ( ) and log ( ) features to train a single classifier (RF or XGB). In \saylate fusion (M3, M4) we train one base classifier based on and one based on . The output of each base classifier is then fed into a subsequent logistic regression layer for the final prediction. Both base classifier are either RF or XGB. We additionally tried to train a classifier only based on (M5, M6) or only on (M7, M8).

Deep Neural Networks

We implemented a \glsdnn based on \glslstm \glsplnn hochreiter.1997 in order to model as a time series. Given a sequence of inputs , a LSTM computes sequences of outputs via the following recurrent equations:

are trainable parameters, is the sigmoid activation function, denotes the Hadamard product (element-wise product), and are the hidden state and cell memory of a LSTM cell. Additionally, a LSTM cell uses four gates to manage its states over time to avoid the problem of exploding/vanishing gradients in the case of longer sequences bengio.1994. (forget gate) determines how much of the previous memory is kept, (input gate) controls the amount new information () stored into memory, and (output gate) determines how much information is read out of the memory. The hidden state is commonly forwarded to a successive layer. In our experiments, we set the number of LSTM layers to as a hyperparameter. Additionally, we set as a hyperparameter if the LSTM is bidirectional Schuster_Paliwal_1997 or not. The final output vector from the last LSTM cell is then forwarded to a fully connected layer using dropout srivastava2014dropout for regularization and softmax activation for prediction. Fig. 4 depicts the implemented \glsdnn architecture. We address the imbalanced class problem by applying either random oversampling of the minority class or random undersampling of the majority class He_Garcia_2009. We used the PyTorch framework 7 to implement our \glsdnn architecture.

Figure 4: Implemented \glsdnn architecture.

3.3 Training and Validation

Our experiments are designed to simulate the real-world performance of our decision support system. Algorithm 2 line 1-9 and Fig. 5 depict the experimental setup.

Evaluation Metrics

For a given set of estimated customer sentiment values and ground truth escalation labels , we calculate the for a whole year as an evaluation metric. is the number of positive samples and the total number of samples at , respectively. We define

where denotes the largest elements and denotes the number of samples which have a positive ground truth value. Furthermore, we define

as the average , in order to compare different models for a relevant range of values .

Model Selection and Evaluation

We perform a weekly analysis for one year () to evaluate our approach. For each week, we use the past year as training data (). The gap of weeks are needed since in deployment we would have complete data until . Therefore, we only know the label for samples until , given our predictive interval is weeks. Fig. 6 shows the resulting distributions for and . We additionally split the training data for model selection into the first () and last () weeks. We apply the tree-structured Parzen estimator (TPE) Bergstra_2011 approach for hyperparameter tuning. TPE is a Bayesian optimization approach for hyperparameter tuning and can yield better results compared to grid and random search Bergstra_2011. We use the TPE implementation 8 in the Optuna Akiba_2019 library for our pipeline. We train models with different hyperparameter combinations suggested by TPE on and calculate the () for . The LSTM model (M9) is trained for epochs on and stops training if the validation loss did not decrease for epochs. A model checkpoint with the minimum validation loss is chosen to calculate the () on . Finally, the LSTM model with minimum () on over all sampled hyperparameter combinations is chosen for final evaluation. For RF and XGB we use the best set of hyperparameters to train a model on the complete training data . The resulting model is used to calculate predictions on the current test data () in order to obtain the estimated customer sentiment . The hyperparameters for the different classifiers are listed in Appendix A. For evaluation, we calculate based on and over the whole year (}). This measures the percentage of escalations we would have predicted in one year if we would look at the largest estimated customer sentiments at each week. In deployment (Algorithm 3), we provide information regarding the customer sentiment in the current week () based on since we only have the full data available up to and until . The source code for our experiments is available on GitHub9.

Figure 5: Illustration of the proposed training and evaluation setup (Algorithm 2). We evaluate the decision support system on a weekly basis for one year (). For each week we use the data from the previous year ( weeks) to train a model and to get a probability output for each customer (). Finally, we calculate to evaluate the performance over the whole year.
Figure 6: Resulting distribution of samples in and . The graphs depict all samples () and positive samples () for .
1: Dataset
2:for all customers  do
3:     for  do for every week for 2 years create samples starting from an arbitrary time point
4:         if customer exists since  then
5:              for  do for every week in look-behind window of size
6:                  extract log features
7:                  extract enterprise features               
8:               feature vector for customer and
9:              if  then escalation flag within the next weeks?
11:              else
13:               add sample to dataset               
14:     for all escalation flags  do
15:          discard infected intervals after escalation flags from dataset      
Algorithm 1 Build Dataset
1:for  do simulate one year of application
2:      52 weeks training data
3:      41 weeks for training for model selection
4:      10 weeks for validation data
5:      current test samples
6:     model selection using and based on model selection using TPE sampler
7:     train model with best hyperparameter on
8:     test model on estimated customer sentiments
9:Calculate based on
Algorithm 2 Training and Validation
1:current week is
2:output and corresponding SHAP values
Algorithm 3 In Deployment

4 Results

Figure 7 shows the values over N for the different model configurations (Table 1). Additionally, Table 2 depicts specific values for N and the overall .
Early and late fusion: Comparing M1 vs. M3 and M2 vs. M4 shows that there is a slight overall benefit of late fusion compared to early fusion in terms of ( vs. and vs. ).
Feature configuration: Using log features only (M7 and M8) yielded the worse results with and in terms of respectively. Using enterprise features only (M6 and M7) resulted in and in terms of respectively. Fusing both features yielded consistently better results for all configurations M1-M4 in terms of (). Fig. 8 shows the feature importance of all resulting models for the model configuration M2. One can see that enterprise features are generally more important than log features. Note the small scale and that some log features do have a relatively large feature importance for some weeks. Furthermore, the applied machine learning models can potentially exploit non-linear relationships between the different features.
RF and XGB: RF almost consistently outperformed XGB (M2 vs. M1, M4 vs. M3 and M8 vs. M7) in terms of ( vs. , vs. and vs. ). The only exception where XGB performs better than RF is the configuration with enterprise features only (M6 vs. M5) in terms of ( vs. ).
Deep Neural Networks Our LSTM model (M9) is consistently outperformed by the other models using both feature sets (M1-M4) in terms of ( vs. ).

Model features classifier fusion
M1 XGB early
M2 RF early
M3 XGB late
M4 RF late
M5 XGB n.a.
M6 RF n.a.
M7 XGB n.a.
M8 RF n.a.
M9 LSTM early
Table 1: Different model configurations used for the experiments.
Figure 7: Recall@N curves for N for all model configurations listed in Table 1.

max width= Model Recall@10 Recall@20 Recall@30 Recall@40 Recall@50 Recall@100 avg(Recall@N) M1 M2 M3 M4 M5 M6 M7 M8 M9

Table 2: Numerical results in terms of Recall@N and avg(Recall@N) for all model configurations listed in Table 1.
Figure 8: Feature importance for all trained models for configuration M2.

5 Discussion

Model configurations M2 and M4 yielded the best results (RF with early and late fusion). Thereby, late fusion (M4) performed slightly better in terms of ( vs. ). In practice, it is harder to compute meaningful SHAP values for the late fusion case since the base classifier are based on different feature sets. Therefore, our practical recommendation is to use M2 with early fusion. RF generally performed better than XGB. We assume that this might be because the gradient-based construction of decision tree ensembles might be more prone to overfitting on the heavily imbalanced dataset with noisy labels. We assume a similar problem with overfitting when using deep learning models for this kind of data. This might explain the inferior performance of M9 despite the potential to better model the temporal structure in the data. Furthermore, XGB and LSTM based models are more than 10 times slower to train compared to RF and are harder to tune. We also noticed that the computation of SHAP values Lundberg_Lee_2017 for LSTM based models is significantly slower compared to XGB and RF. As a conclusion, we recommend using model configuration M2. In practice, we estimate that one customer support employee can scan around customers in depth with our tool per hour. Hence, if for example a team of 5 customer support employees would invest one hour each week using the proposed decision support system with model M2, they could potentially prevent around of the escalation in a year. Furthermore, the decision support system can help to learn which specific problems, depicted in the designed features, lead to escalation, as well as to identify customers which have similar problems and might need special attention. In the following we explain in detail how the designed decision support system could be used in practice.

Figure 9: Schematic overview of the envisioned workflow.

5.1 Proposed Workflow

The following section briefly outlines how the data-driven decision support system is integrated into a productive environment and how it could be used by customer support employees.
The envisioned workflow, which is how this system is mainly used, can be grouped into five distinct steps, which are outlined in Fig. 9.

  • Step 1: Producing new predictions This step is fully automated and all relevant processes are triggered at the beginning of each week. The first process within this step is to load the most recent raw data for all monitored customers from the respective data sources and to conduct the necessary preprocessing steps. Afterwards, all available samples up to this point for which labels can be defined are used to train a new model. Once a new model has been trained, it is used to predict the customer sentiment of all monitored customers. Additionally, the SHAP values for each prediction and each feature are calculated. Predictions and SHAP values are then copied to a database and automatically loaded into an interactive dashboard which serves as a user interface.

  • Step 2: Identifying high risk customers One element of the user interface is an interactive table showing the most recent predictions for all monitored customers, along with some additional information about each customer, e.g. location and operated system type. Within this step, the user identifies a system within his or her area of responsibility with a particularly high probability for an escalation within the following two weeks. Once a customer has been identified, it can be selected, which reduces the information shown on the user interface to just the relevant information about the customer in question.

  • Step 3: Single out the most relevant features with SHAP values Knowing only which customers are at high risk of causing escalations without knowing why is only of limited use. In order to explain why a specific customer has a high probability for escalation (customer sentiment), SHAP values for each prediction and each feature are displayed in the user interface. With such a visualization, the user can easily single out one or a few features which, according to their respective SHAP values, have a large effect on the customers sentiment. By selecting these specific features, the information shown on the interface is further reduced, now only displaying information connected to the selected customer and the selected feature or features.

  • Step 4: Analyzing specific features Once a set of few features has been selected, the user is provided with the actual values of these features and how these values have been changed over the past weeks. With this information, the user can easily identify open, yet unresolved, tickets and see immediately for how long a specific ticket has been unresolved. Another example could be the accumulation of specific malfunctions reflected in log features.

  • Step 5: In depth analysis of certain problem At this point, the experienced user probably has a good idea of where a potential problem with the customer in question might be found (e.g. unresolved tickets, spare parts, software issues). For an in depth analysis of ticket data or consumed spare parts, other tools are already available which are tailored for such tasks. Therefore, once the user has identified the potential root cause for a bad customer sentiment, he or she is provided with a direct link to these external tools in order to continue the analysis as efficient as possible with the goal to act before an escalation occurs.

The main idea of a productive use of a data-driven decision support system is to help customer support employees decide which customers to focus on and where to look.

6 Conclusion and Future Work

In this paper, we propose a general framework and an interactive workflow with a decision support system. Additionally, we provide a publicly available industrial benchmark dataset, including all code necessary to reproduce or to improve our results. Our designed and implemented decision support system is currently deployed to monitor the customer sentiment of thousands of customers of high-end medical devices worldwide. The explainability of the system helps a variety of end users to identify problems in the field. We demonstrate that using both log and enterprise data-based features enables more effective troubleshooting compared to using either of these data sources alone. Furthermore, the gained insights can help to achieve better and more proactive customer relations, as well as improve product management by focusing on problems which affect the customer the most. There are some open challenges which could be addressed in future research using the provided benchmark dataset and evaluation framework. For example, more efficient methods for merging log and enterprise data information which preserve explainability can be investigated. Another challenge is to design models that, within the implemented framework, allow to increase its predictive power without trading interpretability. Finally, alternative learning problem formulations, like anomaly detection, could be explored for the task. This could help with the heavy class imbalance present in the benchmark dataset.



Bjoern Eskofier gratefully acknowledges the support of the German Research Foundation (DFG) within the framework of the Heisenberg professorship programme (grant number ES 434/8-1). We would like to thank Gilles Le Texier, Martin Rothgaengel, Birgi Tamersoy, Mirko Appel and Marie Mecking from Siemens Healthineers for their valuable inputs and discussions. We gratefully acknowledge the support of Siemens Healthineers for this study and for providing the data. We would also like to thank Erick Axxe from the Ohio State University for proofreading our manuscript as a native speaker.

Appendix A Hyper Parameter

Table 3 summarizes the hyperparamter search space for the classifiers used our experimental study. The sampling strategies are according to the Optuna Akiba_2019 10 libary. The parameter names correspond to the respective implementation of the classifiers.

classifier parameter range sampling strategy
(M2, M4, M6, M8)
max_samples {0.1, 1} uniform
max_depth int
n_estimator int
criterion {gini, entropy} categorical
ratio uniform
sampling_strategy {over, SMOTE, under} categorical
XGB (M1, M3, M5, M7) learning rate loguniform
max_depth int
n_estimator int
colsample_bytree uniform
subsample uniform
reg_lambda loguniform
ratio uniform
sampling strategy {over, SMOTE, under} categorical
LSTM (M9) batch_size categorical
num_epochs fixed
early_stopping fixed
hidden_dim int
learning_rate loguniform
weight_decay uniform
dropout_prob uniform
lstm_layer int
lstm_bidirectional {True, False} categorical
ratio uniform
sampling strategy {over, under} categorical
Table 3: Settings for hyperparamter tuning.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description