Exploiting Event Log Data-Attributes in RNN Based Prediction

Exploiting Event Log Data-Attributes in RNN Based Prediction

Markku Hinkka Aalto University, School of Science, Department of Computer Science, Finland QPR Software Plc, Finland
   Teemu Lehto Aalto University, School of Science, Department of Computer Science, Finland QPR Software Plc, Finland
   Keijo Heljanko University of Helsinki, Department of Computer Science, Finland, HIIT Helsinki Institute for Information Technology
markku.hinkka@aalto.fi, teemu.lehto@qpr.com, keijo.heljanko@aalto.fi

In predictive process analytics, current and historical process data in event logs are used to predict future. E.g., to predict the next activity or how long a process will still require to complete. Recurrent neural networks (RNN) and its subclasses have been demonstrated to be well suited for creating prediction models. Thus far, event attributes have not been fully utilized in these models. The biggest challenge in exploiting them in prediction models is the potentially large amount of event attributes and attribute values. We present a novel clustering technique which allows for trade-offs between prediction accuracy and the time needed for model training and prediction. As an additional finding, we also found that this clustering method combined with having raw event attribute values provides even better prediction accuracy at the cost of additional time required for training and prediction. We also built a highly configurable test framework that can be used to efficiently evaluate different prediction approaches and parameterizations.

process mining, predictive process analytics, prediction, recurrent neural networks, gated recurrent unit

Markku Hinkka, Teemu Lehto, Keijo Heljanko

1 Introduction

Event logs generated by systems in business processes are used in Process Mining to automatically build real-life process definitions and as-is models behind those event logs. There is a growing number of applications for predicting the properties of newly added event log cases, or process instances, based on case data imported earlier into the system[3][4][12][17]. The more the users start to understand their own processes, the more they want to optimize them. This optimization can be facilitated by performing predictions. In order to be able to predict properties of new and ongoing cases, as much information as possible should be collected that is related to the event log traces and relevant to the properties to be predicted. Based on this information, a model of the system creating the event logs can be created. In our approach, the model creation is performed using supervised machine learning techniques.

In our previous work [7] we have explored the possibility to use machine learning techniques for classification and root cause analysis for a process mining related classification task. In the paper, we experimented the efficiency of several feature selection techniques and sets of structural features (a.k.a. activity patterns) based on process paths in process mining models in the context of a classification task. One of the biggest problems with that approach is the finding the structural features having the most impact in the classification result. E.g., whether to use only activity occurrencies, transitions between two activities, activity orders, or other even more complicated types of structural features such as detecting subprocesses or repeats. For this purpose, we proposed another approach in [8], where we examined using recurrent neural network techniques for classification and prediction. These techniques are capable of automatically learning more complicated causal relationships between activity occurrences in activity sequences. We have evaluated several different approaches and parameters for the recurrent neural network techniques and have compared the results with the results we collected in our work. In both the previous publications[7][8], we focused on boolean -type classification tasks based on the activity sequences only.

The primary motivation for this paper is to further improve our techniques by exploring approaches on exploiting other properties commonly available in the event logs: case and event attribute values and other information derived from events such as time stamp related information. Our goal is to develop a mechanism that would allow the creation of a tool that is, based on a relatively simple set of parameters and training data, able to produce a prediction model for any case-level prediction task, such as predicting the next activity or the final duration of a running case.

The novelty in our method is based on the concatenation of multiple one-hot encoded feature vectors given as input to the RNN. This vector is constructed out of the same set of features for every event. Features of the vector can vary from one-hot encoded event- or case attribute cluster identifiers, or discrete attribute values, to features otherwise derived from the event data, such as temporal features based on event time stamps. The set of used features can be configured separately. We experimented several different types of features against each other using multiple different prediction scenarios and datasets.

Another contribution of our method, that we have not found in any earlier publications, is the way how we use clustering to make it easy to manage the input vector size no matter how many event- and case attributes there are in the data set. E.g., users can configure the absolute maximum length of the one-hot vector used for the event- or case attribute data which will not be exceeded, no matter how many actual attributes the dataset has.

Our prediction engine source code are available in GitHub 111https://github.com/mhinkka/articles.

The rest of this paper is structured as follows: Section 2 is a short summary of the latest developments around the subject. In Section 3, we present the problem statement and the related concepts. Section 4 presents our solution for the problem. In Section 5 we present our test framework used to test our solution. Section 6 describes the used datasets as well as performed prediction scenarios. Section 7 presents the experiments and their results validating our solution. Finally Section 8 draws the final conclusions.

2 Related Work

Lately there has been a lot of interest in the academic world on predictive process monitoring which can clearly be seen, e.g., in [5] where the authors have collected a survey of 55 accepted academic papers on the subject. In [16], the authors have compared several approaches spanning three different research fields: Machine learning, process mining and grammar inference. As result, they found that overall, the techniques from machine learning field generate more accurate predictions than grammar inference and process mining fields.

In [17] the authors used Long Short-Term Memory (LSTM) recurrent neural networks to predict the next activity and its timestamp. They used one-hot encoded activity labels and three numerical time-based features: duration between the current activity and the previous activity, time within the day and time within the week. Event attributes were not considered at all.

In [3] the authors trained LSTM networks to predict the next activity. In this case however, network inputs were created by concatenating categorical, character string valued event attributes and then encoding these attributes via an embedding space. They also note that this approach was feasible only because of the small number of unique values each attribute had in their test datasets. Similarly, in [15], the authors took a very similar approach based on LSTM network, but this time also incorporated both discrete and continuous event attribute values. Discrete values were one-hot encoded, whereas continuous values were normalized using min-max normalization and added to the input vectors as single values.

In [13] the authors used Gated Recurrent Unit (GRU) recurrent neural networks to detect anomalies in event logs. One one-hot encoded vector was created for activity labels and one for each of the included string valued event attributes. These vectors were then concatenated in similar fashion to our solution into one vector representing one event, that was then given as input to the network. We used this approach for benchmarking our own clustering based approach (labeled as Raw feature in the text below). The system proposed in their paper was able to predict both the next activity and the next values of event attributes. Specifically, it does not take case attributes and temporal attributes into account.

In [18] the authors trained a RNN to predict the most likely future activity sequence of a running process based only on the sequence of activity labels. Similarly our earlier publication [7] used sequences of activity labels to train a LSTM network to perform a boolean classification of cases.

None of the mentioned earlier works present a solution that is scalable for datasets having lots of event- or case attributes.

3 Problem

Using RNN to perform case-level predictions on event logs has lately been studied a lot. However, there has not been any scalable approach on handling event- and case attributes in RNN setting. Instead, e.g., in [13] authors used separate one-hot encoded vector for each attribute value. Having this kind of an approach when you have, e.g., 10 different attributes, each having 10 unique values would already require a vector of 100 elements to be added as input for every event. The longer the input vectors become, the more time and memory it gets for the model to create accurate models from them. This increases the time and memory required to use the model for predictions.

In some prediction scenarios, prediction accuracy can relatively easily be improved also by including features derived from the timestamps of the events. For example, when predicting the duration until a running case will move to its next activity, or when predicting the total duration of a case. One could also encode weekdays, days of month, hours of day and similar information in order to make the model to learn, e.g., the effect of holidays to the throughput of hospital visits.

One more issue to take into account when selecting input data to be fed to the RNN is that in order to improve the convergence rate of RNN, input elements should be normalized [10].

4 Solution

We decided to include several feature types into the input vectors of the RNN. Input vectors are formatted as shown in Table 1, where each column represents one feature vector element , where is the index of the feature and is the index of the element of that feature. In the table, represents the number of feature types used in the feature vector and represents the number of elements required in the input vector for feature type . Thus, each feature type produces one or more numeric elements into the input vector, which are then concatenated together into one actual input vector passed to RNN both in training and in prediction phases. Table 2 shows an example input vector having four different feature types: activity label, raw event attribute values (only single event attribute named having four unique values) and the event attribute cluster where clustering has been performed separately for each unique activity.

The following subsections describe in more detail the types of features used in our research. However, this mechanism can easily incorporate also other types of features not described here, as long as they can be converted into some kind of numeric vectors of equal length for every event of a case.

Table 1: Feature input vector structure
1 1 0 1 0 0 0 1 0
2 0 1 0 0 1 0 1 0
3 1 0 0 1 0 0 0 1
4 1 0 0 1 0 0 0 1
5 0 1 0 0 0 1 0 1
Table 2: Feature input vector example content

4.1 Event Attributes

Our primary solution for incorporating information in event attributes is to cluster all the event attribute values in the training set and then use one-hot encoded cluster identifier to represent all the attribute values of the element. The used clustering algorithm must be such that it tries to automatically find the optimal number of clusters for the given data set within the range of 0 to N clusters, where N can be configured by the user. By changing N, the user can easily configure the maximum length of the one-hot -vector as well as the precision of how detailed attribute information will be tracked. For this paper, we experimented with slightly modified version of Xmeans -algorithm [14].

It is very common that different activities get processed by different resources yielding a completely different set of possible attribute values. E.g., different departments in a hospital have different people, materials and processes. Also in the example feature vector shown in Table 2, -event attribute has completely different set of possible values depending on the since it is forbidden by, e.g., the external system to not allow activity of type eat to have food event attribute value of water. If we cluster all the event attributes using single clustering, we would easily lose this activity type specific information.

In order to retain this activity specific information, we used separate clustering for each unique activity type. All the event attribute clusters are encoded into one one-hot encoded vector representing only the resulting cluster label for that event, no matter what its activity is. This is shown in the example table as , which represents the row having as clustering label. E.g., in the example case, is 1 in both rows 1 and 2. However, row 1 is in that cluster because it is in the 0th cluster of the activity, whereas row 2 is in that cluster because it is in the 0th cluster of the activity. Thus, in order to identify the actual cluster, one would require both the activity label and the cluster label. For RNN to be able to properly learn about the actual event attribute values, it needs to be given both the activity label and the cluster label in input vector. Below, this approach is labeled as ClustN, where N is the maximum cluster count.

For benchmarking, we also experimented with an implementation where event attributes were used so that every event attribute is encoded into its own one-hot encoded vector and then concatenated into the actual input vectors. Below, this approach is referred to as Raw. Finally, we experimented also using both Raw and Clustered event attribute values. Below, this approach is referred to as BothN, where N is the maximum cluster count.

4.2 Case Attributes

For case attributes, a similar approach could be used as with event attributes. When training the model with event-based input vectors, every input vector given to the RNN could be appended with the vector built for the case of the event. We chose not to include this feature into this paper due to the limited space available. However, we have plans to experiment with this as a possible future work.

5 Test Framework

In order to test our solution, we built a test framework that roughly consists of two main components: A process mining tool capable of generating javascript object notation (JSON) formatted data structures based on the event data, and a prediction engine which takes these generated JSON data structures and performs either prediction or training.

The training work flow of this framework follows roughly the flowcharts shown in Figure 1.

Figure 1: Training work flow

5.1 Process mining tool

As process mining tool, we used a custom application built on top of APIs provided by QPR ProcessAnalyzer222https://www.qpr.com/products/qpr-processanalyzer and its expression language333https://devnet.onqpr.com/pawiki. For every used dataset and prediction task we created a separate expression language query that generated the desired structured information for both the training and testing purposes.

The generation consisted of first creating a query that generated a set of cases to export as well as selecting attribute values to export. For this paper, we included in our tests only attributes whose most used value was used in at least 5% of all the cases in the dataset. Thus, e.g., unique event identifiers were not considered at all in any of the performed tests. Finally, the resulting objects were serialized into a simple JSON format that contained all the required information about the activities, cases, events and the expected outcomes.

For the experiments, we generated two temporally isolated sets of cases where we first selected 10% of the newest cases. These cases were used for the test data. For training data, we selected all the cases whose last event had occurred before any of the events picked into the test dataset. Thus, the time just before the first test dataset event represents the time when the model was trained using all the complete cases gathered at that point. This test data was used only after the model has been fully trained using the training data.

5.2 Prediction engine

We extended the Python -based prediction engine that was used in our earlier work [7] by adding several new functionalities. The engine is still capable of supporting most of the hyperparameters that we experimented with in our earlier work, such as used RNN unit type, number of RNN layers and the used batch size.

The prediction engine we built for this work takes a single configuration file as input and outputs test result rows into a CSV file. A configuration file is a JSON file describing exactly the set of parameters, given as key-value pairs to the actual model trainer- and model tester functions used to perform a single test. At the top level of the JSON file, there is a single JSON object that describes the test runs in a kind of hierarchical fashion so that definitions defined in the upper level affect also all the definitions in the lower levels of the hierarchy. The hierarchy is built using a special ”runs” -property that has an array -type value that can have any number of JSON objects inside.There is also a special ”foreach” –key that can be used to iterate over sets of additional parameters on top of the parameters of the object in which the key is located. As result, this allows very flexible set of tests to be run sequentially without any user intervention. The configuration file supports about 60 configurable hyperparameters, most notable of which, from our experiment perspective, are listed in Table 4. For the purposes of the paper, we decided to separate the prediction engine completely out of the process mining tool. Thus, the JSON data generated by the process mining tool is written to a disk and read from disk files by the prediction engine. A simplified example test run configuration can be seen in Figure 2. This configuration runs six tests with next activity prediction tests using two different datasets and three different cluster sizes.

  "test_name": "example",
  "predict_next_activity": true,
  "for_each": [
      "input_filename": "BPIC17-ne-train",
      "test_filename": "BPIC17-ne-test",
      "dataset_name": "bpic17",
      "input_filename": "BPIC18-ne-train",
      "test_filename": "BPIC18-ne-test",
      "dataset_name": "bpic18",
  "runs": [
    { "max_num_event_clusters": 20 },
    { "max_num_event_clusters": 40 },
    { "max_num_event_clusters": 80 }
Figure 2: Example test run JSON configuration

5.2.1 Training

Every test run begins by loading the JSON event log data from the file generated by the process mining tool into memory. After this, the event log is pre-processed. The very first step of preprocessing optionally filters out excess cases by keeping only the selected number of randomly selected cases and discarding all the rest. This is used in some tests to speed up the training which would otherwise have lasted a lot longer. The next step is splitting the training data into actual training data and validation data used to find the best performing model out of all the model states during all the test iterations. For this, we picked 75% of the cases for the testing and the rest for the validation dataset.

After the event log is read and prepared, we initialize the actual prediction model and the data used to generate the actual input vectors. This data initialization involves, depending on the prediction scenario, splitting cases into prefixes and also taking a random sample of the actual available data if the amount of data exceeds the configured maximum amount. In this phase, if needed, we also cluster event attributes. Also the initialization of all the other derived information needed for the input vectors, such as the encoding of activity transition durations, is performed at this point.

Finally after the model is initialized, we start the actual training in which we concatenate all the requested feature vectors as well as the expected outcome into the RNN model repeatedly for the whole training set until 100 test iterations have passed. The number of actual epochs trained in each iteration is configurable. In our experiments, the total number of epochs was set to be 10. After every test iteration the model is validated against the validation set. In order to improve validation performance, if the size of the validation set is larger than separately specified limit (Max validation test size), a random sample of the whole validation set is used. These test results, including additional status and timing related information, is written into resulting test result CSV file. If the prediction accuracy of the model against the validation set is found to be better than the accuracy of any of the models found thus far, then the network state is stored for that model. Finally after all the training, the model having the best validation test accuracy is picked as the trained model and saved to a file. This model file contains, in addition to the details of the trained RNN model, all the meta-data required to perform the clustering and filtering operations for the test data.

5.2.2 Testing

In the testing phase, an independent set of cases read from a separate JSON file is tested against the model built in the previous step. After initializing the event log following similar steps as in the training phase, the trained model is loaded from the file. After the model has been loaded, the model is asked for a prediction for each input vector built from the test data. The prediction result accuracy, as well as some status and timing related information is written to the result CSV file.

6 Test Setup

We performed our tests using several different data sets. Some details of the used data sets can be found in the Table 3

Event Log # Event Attributes Case filter percentage
BPIC12444https://doi.org/10.4121/uuid:3926db30-f712-4394-aebc-75976070e91f 1 100%
BPIC13, incidents555https://doi.org/10.4121/uuid:500573e6-accc-4b0c-9576-aa5468b10cee 8 100%
BPIC14666https://doi.org/10.4121/uuid:c3e5d162-0cfd-4bb0-bd82-af5268819c35 1 100%
BPIC17777https://doi.org/10.4121/uuid:5f3067df-f10b-45da-b98b-86ae4c7a310b 4 50%
BPIC18888https://doi.org/10.4121/uuid:3301445f-95e8-4ff0-98a4-901f1f204972 5 20%
Table 3: Used Event logs and their relevant statistics

For each dataset, we performed next activity prediction where we wanted to predict the next activity of any ongoing case. In this case, we split every input case into possibly multiple virtual cases depending on the number of events the case had. If the length of the case was shorter than 4, the whole case was ignored. If the length was equal or higher, then a separate virtual case was created for all prefixes at least of length 4. Thus, for a case of length 6, 3 cases were created: One with length 4, one with 5 and one with 6. For all these prefixes, the next activity label was used as the expected outcome. For the full length case, the expected outcome was a special finished-token.

Finally the order of cases was randomized and in some test cases, to improve performance, a filtering was added to randomly filter out a percentage of resulting cases that was specified for each dataset separately. This percentage is shown in the ”Case filtering percentage” -column of Table 3.

7 Experiments

For experiments, we used the same system that we used already in our previous work [7]. The system had Windows 10 operating system and its hardware consisted of 3.5 GHz Intel Core i5-6600K CPU with 32 GB of main memory and NVIDIA GeForce GTX 960 GPU having 4 GB of memory. Out of those 4 GB, we reserved 3 GB for the tests. The testing framework was built on the test system using Python programming language. The actual recurrent neural networks were built using Lasagne 999https://lasagne.readthedocs.io/ library that works on top of Theano 101010http://deeplearning.net/software/theano/. Theano was configured to use GPU via CUDA for expression evaluation.

We started our experiments by running next activity predictions using all the tested datasets and maximum cluster counts. The most notable hyperparameters used for these tests are listed in Table 4.

Hyperparameter Value
# test iterations 100
# epochs per iteration 1.0 (for BPIC13), 0.1 (for others)
Clustering method XMeans [14]
Include activity occurrence false
Use activity -level event clustering true
# case clusters 40
# event clusters 40
Max # traces in training 75000
Max # traces in testing 25000
Max validation test size 10000
RNN type GRU [1]
# layers 1
Gradient descent optimizer Adam [9]
Learning rate 0.01
Hidden dimension size 256
Training data proportion 75%
Table 4: Initial hyperparameter values

The results of these runs are shown in Table 5. In the table, Features -column shows the used set of features. S.rate shows the achieved prediction success rate. In.v.s. shows the size of the input vector. This column can be used to give some kind of indication on the memory usage of using that configuration. Finally, Tra.t. and Pred.t. columns tell us the time required for performing the training and the prediction for all the cases in the test dataset. In both the cases, this time includes the time for setting up the neural network, clusterings and preparing the dataset from JSON format. Each row in the table represents one test run with unique combination of dataset and feature that was tested. None -feature represents the case in which there were no event attribute information at all in the input vector, ClustN represents a test with one-hot encoded cluster labels of event attributes clustered into N clusters, Raw represents having all one-hot encoded attribute values individually in the input vector, and finally BothN represents having both one-hot encoded attribute values and one-hot encoded cluster labels in the input vector. It should be noted also that, due to the way test runs were performed in the test framework, also None and Raw features were run using three different cluster sizes even if it didn’t in the end cause any differences in the input data. For these features, the test run achieving the best success rate was used. The varying success rates, if any, in these cases are caused mostly by random sampling being used in the original dataset.

Dataset Features S.rate In.v.s. Tra.t. Pred.t.
BPIC12 None 85.0% 26 690.5s 6.9s
Clust20 84.8% 30 701.1s 7.4s
Clust40 84.9% 30 686.1s 6.4s
Clust80 85.1% 30 701.4s 7.5s
Raw 84.7% 29 691.2s 7.5s
Both20 85.1% 33 704.3s 8.2s
Both40 84.7% 33 706.1s 8.4s
Both80 84.8% 33 713.1s 10.3s
BPIC13 None 62.9% 13 27.1s 0.4s
Clust20 66.2% 34 31.6s 1.5s
Clust40 65.7% 54 31.9s 0.9s
Clust80 64.7% 87 37.5s 1.0s
Raw 67.7% 1074 198.9s 2.6s
Both20 66.6% 1095 203.3s 3.4s
Both40 66.4% 1115 206.0s 3.2s
Both80 67.9% 1148 211.6s 3.2s
BPIC14 None 42.5% 41 1012.7s 6.1s
Clust20 44.5% 62 1056.6s 8.2s
Clust40 45.4% 82 1095.9s 8.8s
Clust80 45.3% 103 1134.6s 9.2s
Raw 41.1% 278 1502.5s 14.0s
Both20 45.8% 299 1586.6s 15.7s
Both40 45.6% 319 1636.2s 16.2s
Both80 45.6% 341 1720.1s 17.2s
BPIC17 None 86.2% 27 1532.7s 23.4s
Clust20 88.1% 48 1568.6s 27.2s
Clust40 88.2% 68 1619.2s 29.7s
Clust80 90.7% 108 1702.2s 32.8s
Raw 88.1% 253 1914.8s 43.2s
Both20 87.1% 200 2005.0s 54.9s
Both40 87.7% 220 2070.1s 49.7s
Both80 88.1% 260 2265.6s 60.2s
BPIC18 None 58.9% 41 966.4s 23.3s
Clust20 60.2% 62 1012.9s 27.7s
Clust40 57.9% 82 1055.9s 25.0s
Clust80 58.6% 117 1113.1s 31.3s
Raw 62.2% 253 1508.0s 53.2s
Both20 66.6% 274 1582.9s 55.7s
Both40 50.8% 294 1632.5s 55.4s
Both80 54.8% 334 1765.0s 60.3s
Table 5: Statistics of next activity prediction using different sets of input features

We also aggregated these results over all the datasets. Figure 3 shows average success rates of different event attribute encoding techniques over all the tested datasets using maximum of 40 clusters for event attributes. Figure 4 shows the average input vector lengths. Figure 5 and Figure 6 shows the averaged training and prediction times.

Figure 3: Average prediction success rate over all the datasets Figure 4: Average length of the input vector over all the datasets Figure 5: Average training time over all the datasets Figure 6: Average prediction time over all the datasets

Based on all of these results, we can see that having event attribute values included clearly improved the prediction accuracy over not having them included at all in all datasets. The effect was smallest (0.1%) in the BPIC12 model, where there was only 1 event attribute, whose value did not seem to correlate that much with the next activity prediction. The greatest effect was achieved in BPIC18 -model, where Both20 outperformed prediction without any event attributes by 7.7% and with raw attribute values by over 4.0% clearly indicating that the clustering can be really powerful technique in some datasets. Even when using the maximum cluster count of 20, prediction results will be either not affected or improved with relatively small impact to the training and prediction time. The only exception to this rule was again BPIC12, which produced 0.2% worse result with Clust20 than without any event attribute data. However, in this case, the result was 0.3% worse also when the raw event attribute data was included.

In all the datasets, the best prediction accuracy is always achieved either by using only clustering with cluster count of 80, or by using both clustering and raw attributes at the same time. Also, as indicated especially in BPIC13 tests, the more event attribute values there are, the longer the training and testing will take, thus making the clustering approach even more tempting.

7.1 Threats to validity

As threats to validity of the results in this paper, it is clear that there are a lot of variables involved. As initial set of parameter values, we used parameters that were found good enough in our earlier work and did some improvement attempts based on the results we got. It is most probable that the set of parameters we used were not optimal ones in each test run. We also did not test all the parameter combinations and the ones we did, we tested often only once, even though there was some randomness involved, e.g., selecting the initial cluster centers in the XMeans algorithm. However, we think that since we tested the results in several different datasets and the results were averaged over all the tested datasets, our results can be used at least as a baseline for further studies. All the results generated by the test runs, as well as all the source data and the test framework itself, are available in support materials 111111https://github.com/mhinkka/articles.

Also, we did not really test with datasets having lots of event attributes, the maximum amount tested being 8. However, it can be seen that since the size of the input vectors is completely user configurable when performing event attribute clustering, the user him/herself can easily set limits to the input vector length which should take the burden off from the RNN and move the burden to the clustering algorithms, which are usually more efficient in handling lots of features and feature values.

When evaluating the results of the performed tests and comparing them with other similar works, it should be taken into account that data sampling was used in several phases of the testing process. Thus, some of the data in the data sets were not used at all.

8 Conclusions

Clustering can be applied on attribute values to improve accuracy of predictions performed on running cases. In three of the five experimented data sets, having event attribute clusters encoded into the input vectors outperformed having the actual attribute values in the input vector. In addition, due to raw attribute values having direct effect to input vector lengths, the training and prediction time will also be directly affected by the number of unique event attribute values. Clustering does not have this problem: The number of elements reserved in the input vector for clustered event attribute values can be adjusted freely. The memory usage is directly affected by the length of the input vector. In the tested cases, the number of clusters to use to get the best prediction accuracy seemed to depend very much on the used datasets, when the tested cluster sizes were 20, 40 and 80. In some cases, having more clusters improved the performance, whereas in others, it did not have any significant impact, or even made the accuracy worse. We also found out that in some cases, having attribute cluster indicators in the input vectors improved the prediction even if the input vectors also included all the actual attribute values.

As future work, it could be interesting to first filter out some of the most rarely occurring attribute values before clustering the values. This could potentially reduce the amount of noise added to the clustered data and make it easier for the clustering algorithm to not be affected by noisy data. Another idea that we leave for future study is whether it would be a good idea to first perform some kind of a feature selection algorithm such as influence analysis [11], recursive feature elimination [6] or mRMR [2] to find the attribute values that correlate the most with the prediction results and have those attribute values added into the input vectors as raw one-hot encoded attribute values in addition to having the one-hot encoded cluster labels. More work is also required to understand exactly what properties of the event log affect the optimal number of clusters to use. Finally, more study is required to understand whether similar clustering approach performed for event attributes in this work could be applicable also for encoding case attributes.

9 Acknowledgements

We want to thank QPR Software Plc for funding our research. Financial support of Academy of Finland projects 277522 and 313469 is acknowledged.


  • [1] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoder-decoder approaches. In D. Wu, M. Carpuat, X. Carreras, and E. M. Vecchi, editors, Proceedings of SSST@EMNLP 2014, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, 25 October 2014, pages 103–111. Association for Computational Linguistics, 2014.
  • [2] C. H. Q. Ding and H. Peng. Minimum redundancy feature selection from microarray gene expression data. J. Bioinformatics and Computational Biology, 3(2):185–206, 2005.
  • [3] J. Evermann, J. Rehse, and P. Fettke. Predicting process behaviour using deep learning. Decision Support Systems, 100:129–140, 2017.
  • [4] C. D. Francescomarino, M. Dumas, F. M. Maggi, and I. Teinemaa. Clustering-based predictive process monitoring. CoRR, abs/1506.01428, 2015.
  • [5] C. D. Francescomarino, C. Ghidini, F. M. Maggi, and F. Milani. Predictive process monitoring methods: Which one suits me best? In Weske et al. [19], pages 462–479.
  • [6] P. M. Granitto, C. Furlanello, F. Biasioli, and F. Gasperi. Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products. Chemometrics and Intelligent Laboratory Systems, 83(2):83–90, 2006.
  • [7] M. Hinkka, T. Lehto, K. Heljanko, and A. Jung. Structural feature selection for event logs. In E. Teniente and M. Weidlich, editors, Business Process Management Workshops - BPM 2017 International Workshops, Barcelona, Spain, September 10-11, 2017, Revised Papers, volume 308 of Lecture Notes in Business Information Processing, pages 20–35. Springer, 2017.
  • [8] M. Hinkka, T. Lehto, K. Heljanko, and A. Jung. Classifying process instances using recurrent neural networks. In F. Daniel, Q. Z. Sheng, and H. Motahari, editors, Business Process Management Workshops - BPM 2018 International Workshops, Sydney, NSW, Australia, September 9-14, 2018, Revised Papers, volume 342 of Lecture Notes in Business Information Processing, pages 313–324. Springer, 2018.
  • [9] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
  • [10] C. Laurent, G. Pereyra, P. Brakel, Y. Zhang, and Y. Bengio. Batch normalized recurrent neural networks. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016, Shanghai, China, March 20-25, 2016, pages 2657–2661. IEEE, 2016.
  • [11] T. Lehto, M. Hinkka, and J. Hollmén. Focusing business improvements using process mining based influence analysis. In M. L. Rosa, P. Loos, and O. Pastor, editors, Business Process Management Forum - BPM Forum 2016, Rio de Janeiro, Brazil, September 18-22, 2016, Proceedings, volume 260 of Lecture Notes in Business Information Processing, pages 177–192. Springer, 2016.
  • [12] N. Navarin, B. Vincenzi, M. Polato, and A. Sperduti. LSTM networks for data-aware remaining time prediction of business process instances. In 2017 IEEE Symposium Series on Computational Intelligence, SSCI 2017, Honolulu, HI, USA, November 27 - Dec. 1, 2017, pages 1–7. IEEE, 2017.
  • [13] T. Nolle, A. Seeliger, and M. Mühlhäuser. Binet: Multivariate business process anomaly detection using deep learning. In Weske et al. [19], pages 271–287.
  • [14] D. Pelleg and A. W. Moore. X-means: Extending k-means with efficient estimation of the number of clusters. In P. Langley, editor, Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), Stanford University, Stanford, CA, USA, June 29 - July 2, 2000, pages 727–734. Morgan Kaufmann, 2000.
  • [15] S. Schönig, R. Jasinski, L. Ackermann, and S. Jablonski. Deep learning process prediction with discrete and continuous data features. In Proceedings of the 13th International Conference on Evaluation of Novel Approaches to Software Engineering, ENASE 2018, Funchal, Madeira, Portugal, March 23-24, 2018., pages 314–319, 2018.
  • [16] N. Tax, I. Teinemaa, and S. J. van Zelst. An interdisciplinary comparison of sequence modeling methods for next-element prediction. CoRR, abs/1811.00062, 2018.
  • [17] N. Tax, I. Verenich, M. L. Rosa, and M. Dumas. Predictive business process monitoring with LSTM neural networks. In E. Dubois and K. Pohl, editors, Advanced Information Systems Engineering - 29th International Conference, CAiSE 2017, Essen, Germany, June 12-16, 2017, Proceedings, volume 10253 of Lecture Notes in Computer Science, pages 477–492. Springer, 2017.
  • [18] I. Verenich, M. Dumas, M. L. Rosa, F. M. Maggi, D. Chasovskyi, and A. Rozumnyi. Tell me what’s ahead? predicting remaining activity sequences of business process instances. June 2016.
  • [19] M. Weske, M. Montali, I. Weber, and J. vom Brocke, editors. Business Process Management - 16th International Conference, BPM 2018, Sydney, NSW, Australia, September 9-14, 2018, Proceedings, volume 11080 of Lecture Notes in Computer Science. Springer, 2018.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description