Exploiting Event Log Data-Attributes in RNN Based Prediction
Abstract
In predictive process analytics, current and historical process data in event logs are used to predict future. E.g., to predict the next activity or how long a process will still require to complete. Recurrent neural networks (RNN) and its subclasses have been demonstrated to be well suited for creating prediction models. Thus far, event attributes have not been fully utilized in these models. The biggest challenge in exploiting them in prediction models is the potentially large amount of event attributes and attribute values. We present a novel clustering technique which allows for trade-offs between prediction accuracy and the time needed for model training and prediction. As an additional finding, we also found that this clustering method combined with having raw event attribute values provides even better prediction accuracy at the cost of additional time required for training and prediction. We also built a highly configurable test framework that can be used to efficiently evaluate different prediction approaches and parameterizations.
Keywords:
process mining, predictive process analytics, prediction, recurrent neural networks, gated recurrent unitMarkku Hinkka, Teemu Lehto, Keijo Heljanko
1 Introduction
Event logs generated by systems in business processes are used in Process Mining to automatically build real-life process definitions and as-is models behind those event logs. There is a growing number of applications for predicting the properties of newly added event log cases, or process instances, based on case data imported earlier into the system[3][4][12][17]. The more the users start to understand their own processes, the more they want to optimize them. This optimization can be facilitated by performing predictions. In order to be able to predict properties of new and ongoing cases, as much information as possible should be collected that is related to the event log traces and relevant to the properties to be predicted. Based on this information, a model of the system creating the event logs can be created. In our approach, the model creation is performed using supervised machine learning techniques.
In our previous work [7] we have explored the possibility to use machine learning techniques for classification and root cause analysis for a process mining related classification task. In the paper, we experimented the efficiency of several feature selection techniques and sets of structural features (a.k.a. activity patterns) based on process paths in process mining models in the context of a classification task. One of the biggest problems with that approach is the finding the structural features having the most impact in the classification result. E.g., whether to use only activity occurrencies, transitions between two activities, activity orders, or other even more complicated types of structural features such as detecting subprocesses or repeats. For this purpose, we proposed another approach in [8], where we examined using recurrent neural network techniques for classification and prediction. These techniques are capable of automatically learning more complicated causal relationships between activity occurrences in activity sequences. We have evaluated several different approaches and parameters for the recurrent neural network techniques and have compared the results with the results we collected in our work. In both the previous publications[7][8], we focused on boolean -type classification tasks based on the activity sequences only.
The primary motivation for this paper is to further improve our techniques by exploring approaches on exploiting other properties commonly available in the event logs: case and event attribute values and other information derived from events such as time stamp related information. Our goal is to develop a mechanism that would allow the creation of a tool that is, based on a relatively simple set of parameters and training data, able to produce a prediction model for any case-level prediction task, such as predicting the next activity or the final duration of a running case.
The novelty in our method is based on the concatenation of multiple one-hot encoded feature vectors given as input to the RNN. This vector is constructed out of the same set of features for every event. Features of the vector can vary from one-hot encoded event- or case attribute cluster identifiers, or discrete attribute values, to features otherwise derived from the event data, such as temporal features based on event time stamps. The set of used features can be configured separately. We experimented several different types of features against each other using multiple different prediction scenarios and datasets.
Another contribution of our method, that we have not found in any earlier publications, is the way how we use clustering to make it easy to manage the input vector size no matter how many event- and case attributes there are in the data set. E.g., users can configure the absolute maximum length of the one-hot vector used for the event- or case attribute data which will not be exceeded, no matter how many actual attributes the dataset has.
Our prediction engine source code are available in GitHub 111https://github.com/mhinkka/articles.
The rest of this paper is structured as follows: Section 2 is a short summary of the latest developments around the subject. In Section 3, we present the problem statement and the related concepts. Section 4 presents our solution for the problem. In Section 5 we present our test framework used to test our solution. Section 6 describes the used datasets as well as performed prediction scenarios. Section 7 presents the experiments and their results validating our solution. Finally Section 8 draws the final conclusions.
2 Related Work
Lately there has been a lot of interest in the academic world on predictive process monitoring which can clearly be seen, e.g., in [5] where the authors have collected a survey of 55 accepted academic papers on the subject. In [16], the authors have compared several approaches spanning three different research fields: Machine learning, process mining and grammar inference. As result, they found that overall, the techniques from machine learning field generate more accurate predictions than grammar inference and process mining fields.
In [17] the authors used Long Short-Term Memory (LSTM) recurrent neural networks to predict the next activity and its timestamp. They used one-hot encoded activity labels and three numerical time-based features: duration between the current activity and the previous activity, time within the day and time within the week. Event attributes were not considered at all.
In [3] the authors trained LSTM networks to predict the next activity. In this case however, network inputs were created by concatenating categorical, character string valued event attributes and then encoding these attributes via an embedding space. They also note that this approach was feasible only because of the small number of unique values each attribute had in their test datasets. Similarly, in [15], the authors took a very similar approach based on LSTM network, but this time also incorporated both discrete and continuous event attribute values. Discrete values were one-hot encoded, whereas continuous values were normalized using min-max normalization and added to the input vectors as single values.
In [13] the authors used Gated Recurrent Unit (GRU) recurrent neural networks to detect anomalies in event logs. One one-hot encoded vector was created for activity labels and one for each of the included string valued event attributes. These vectors were then concatenated in similar fashion to our solution into one vector representing one event, that was then given as input to the network. We used this approach for benchmarking our own clustering based approach (labeled as Raw feature in the text below). The system proposed in their paper was able to predict both the next activity and the next values of event attributes. Specifically, it does not take case attributes and temporal attributes into account.
In [18] the authors trained a RNN to predict the most likely future activity sequence of a running process based only on the sequence of activity labels. Similarly our earlier publication [7] used sequences of activity labels to train a LSTM network to perform a boolean classification of cases.
None of the mentioned earlier works present a solution that is scalable for datasets having lots of event- or case attributes.
3 Problem
Using RNN to perform case-level predictions on event logs has lately been studied a lot. However, there has not been any scalable approach on handling event- and case attributes in RNN setting. Instead, e.g., in [13] authors used separate one-hot encoded vector for each attribute value. Having this kind of an approach when you have, e.g., 10 different attributes, each having 10 unique values would already require a vector of 100 elements to be added as input for every event. The longer the input vectors become, the more time and memory it gets for the model to create accurate models from them. This increases the time and memory required to use the model for predictions.
In some prediction scenarios, prediction accuracy can relatively easily be improved also by including features derived from the timestamps of the events. For example, when predicting the duration until a running case will move to its next activity, or when predicting the total duration of a case. One could also encode weekdays, days of month, hours of day and similar information in order to make the model to learn, e.g., the effect of holidays to the throughput of hospital visits.
One more issue to take into account when selecting input data to be fed to the RNN is that in order to improve the convergence rate of RNN, input elements should be normalized [10].
4 Solution
We decided to include several feature types into the input vectors of the RNN. Input vectors are formatted as shown in Table 1, where each column represents one feature vector element , where is the index of the feature and is the index of the element of that feature. In the table, represents the number of feature types used in the feature vector and represents the number of elements required in the input vector for feature type . Thus, each feature type produces one or more numeric elements into the input vector, which are then concatenated together into one actual input vector passed to RNN both in training and in prediction phases. Table 2 shows an example input vector having four different feature types: activity label, raw event attribute values (only single event attribute named having four unique values) and the event attribute cluster where clustering has been performed separately for each unique activity.
The following subsections describe in more detail the types of features used in our research. However, this mechanism can easily incorporate also other types of features not described here, as long as they can be converted into some kind of numeric vectors of equal length for every event of a case.
… | … | … | … |
1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
2 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
3 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
4 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
5 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
4.1 Event Attributes
Our primary solution for incorporating information in event attributes is to cluster all the event attribute values in the training set and then use one-hot encoded cluster identifier to represent all the attribute values of the element. The used clustering algorithm must be such that it tries to automatically find the optimal number of clusters for the given data set within the range of 0 to N clusters, where N can be configured by the user. By changing N, the user can easily configure the maximum length of the one-hot -vector as well as the precision of how detailed attribute information will be tracked. For this paper, we experimented with slightly modified version of Xmeans -algorithm [14].
It is very common that different activities get processed by different resources yielding a completely different set of possible attribute values. E.g., different departments in a hospital have different people, materials and processes. Also in the example feature vector shown in Table 2, -event attribute has completely different set of possible values depending on the since it is forbidden by, e.g., the external system to not allow activity of type eat to have food event attribute value of water. If we cluster all the event attributes using single clustering, we would easily lose this activity type specific information.
In order to retain this activity specific information, we used separate clustering for each unique activity type. All the event attribute clusters are encoded into one one-hot encoded vector representing only the resulting cluster label for that event, no matter what its activity is. This is shown in the example table as , which represents the row having as clustering label. E.g., in the example case, is 1 in both rows 1 and 2. However, row 1 is in that cluster because it is in the 0th cluster of the activity, whereas row 2 is in that cluster because it is in the 0th cluster of the activity. Thus, in order to identify the actual cluster, one would require both the activity label and the cluster label. For RNN to be able to properly learn about the actual event attribute values, it needs to be given both the activity label and the cluster label in input vector. Below, this approach is labeled as ClustN, where N is the maximum cluster count.
For benchmarking, we also experimented with an implementation where event attributes were used so that every event attribute is encoded into its own one-hot encoded vector and then concatenated into the actual input vectors. Below, this approach is referred to as Raw. Finally, we experimented also using both Raw and Clustered event attribute values. Below, this approach is referred to as BothN, where N is the maximum cluster count.
4.2 Case Attributes
For case attributes, a similar approach could be used as with event attributes. When training the model with event-based input vectors, every input vector given to the RNN could be appended with the vector built for the case of the event. We chose not to include this feature into this paper due to the limited space available. However, we have plans to experiment with this as a possible future work.
5 Test Framework
In order to test our solution, we built a test framework that roughly consists of two main components: A process mining tool capable of generating javascript object notation (JSON) formatted data structures based on the event data, and a prediction engine which takes these generated JSON data structures and performs either prediction or training.
The training work flow of this framework follows roughly the flowcharts shown in Figure 1.

5.1 Process mining tool
As process mining tool, we used a custom application built on top of APIs provided by QPR ProcessAnalyzer222https://www.qpr.com/products/qpr-processanalyzer and its expression language333https://devnet.onqpr.com/pawiki. For every used dataset and prediction task we created a separate expression language query that generated the desired structured information for both the training and testing purposes.
The generation consisted of first creating a query that generated a set of cases to export as well as selecting attribute values to export. For this paper, we included in our tests only attributes whose most used value was used in at least 5% of all the cases in the dataset. Thus, e.g., unique event identifiers were not considered at all in any of the performed tests. Finally, the resulting objects were serialized into a simple JSON format that contained all the required information about the activities, cases, events and the expected outcomes.
For the experiments, we generated two temporally isolated sets of cases where we first selected 10% of the newest cases. These cases were used for the test data. For training data, we selected all the cases whose last event had occurred before any of the events picked into the test dataset. Thus, the time just before the first test dataset event represents the time when the model was trained using all the complete cases gathered at that point. This test data was used only after the model has been fully trained using the training data.
5.2 Prediction engine
We extended the Python -based prediction engine that was used in our earlier work [7] by adding several new functionalities. The engine is still capable of supporting most of the hyperparameters that we experimented with in our earlier work, such as used RNN unit type, number of RNN layers and the used batch size.
The prediction engine we built for this work takes a single configuration file as input and outputs test result rows into a CSV file. A configuration file is a JSON file describing exactly the set of parameters, given as key-value pairs to the actual model trainer- and model tester functions used to perform a single test. At the top level of the JSON file, there is a single JSON object that describes the test runs in a kind of hierarchical fashion so that definitions defined in the upper level affect also all the definitions in the lower levels of the hierarchy. The hierarchy is built using a special ”runs” -property that has an array -type value that can have any number of JSON objects inside.There is also a special âforeachâ âkey that can be used to iterate over sets of additional parameters on top of the parameters of the object in which the key is located. As result, this allows very flexible set of tests to be run sequentially without any user intervention. The configuration file supports about 60 configurable hyperparameters, most notable of which, from our experiment perspective, are listed in Table 4. For the purposes of the paper, we decided to separate the prediction engine completely out of the process mining tool. Thus, the JSON data generated by the process mining tool is written to a disk and read from disk files by the prediction engine. A simplified example test run configuration can be seen in Figure 2. This configuration runs six tests with next activity prediction tests using two different datasets and three different cluster sizes.
5.2.1 Training
Every test run begins by loading the JSON event log data from the file generated by the process mining tool into memory. After this, the event log is pre-processed. The very first step of preprocessing optionally filters out excess cases by keeping only the selected number of randomly selected cases and discarding all the rest. This is used in some tests to speed up the training which would otherwise have lasted a lot longer. The next step is splitting the training data into actual training data and validation data used to find the best performing model out of all the model states during all the test iterations. For this, we picked 75% of the cases for the testing and the rest for the validation dataset.
After the event log is read and prepared, we initialize the actual prediction model and the data used to generate the actual input vectors. This data initialization involves, depending on the prediction scenario, splitting cases into prefixes and also taking a random sample of the actual available data if the amount of data exceeds the configured maximum amount. In this phase, if needed, we also cluster event attributes. Also the initialization of all the other derived information needed for the input vectors, such as the encoding of activity transition durations, is performed at this point.
Finally after the model is initialized, we start the actual training in which we concatenate all the requested feature vectors as well as the expected outcome into the RNN model repeatedly for the whole training set until 100 test iterations have passed. The number of actual epochs trained in each iteration is configurable. In our experiments, the total number of epochs was set to be 10. After every test iteration the model is validated against the validation set. In order to improve validation performance, if the size of the validation set is larger than separately specified limit (Max validation test size), a random sample of the whole validation set is used. These test results, including additional status and timing related information, is written into resulting test result CSV file. If the prediction accuracy of the model against the validation set is found to be better than the accuracy of any of the models found thus far, then the network state is stored for that model. Finally after all the training, the model having the best validation test accuracy is picked as the trained model and saved to a file. This model file contains, in addition to the details of the trained RNN model, all the meta-data required to perform the clustering and filtering operations for the test data.
5.2.2 Testing
In the testing phase, an independent set of cases read from a separate JSON file is tested against the model built in the previous step. After initializing the event log following similar steps as in the training phase, the trained model is loaded from the file. After the model has been loaded, the model is asked for a prediction for each input vector built from the test data. The prediction result accuracy, as well as some status and timing related information is written to the result CSV file.
6 Test Setup
We performed our tests using several different data sets. Some details of the used data sets can be found in the Table 3
Event Log | # Event Attributes | Case filter percentage |
---|---|---|
BPIC12444https://doi.org/10.4121/uuid:3926db30-f712-4394-aebc-75976070e91f | 1 | 100% |
BPIC13, incidents555https://doi.org/10.4121/uuid:500573e6-accc-4b0c-9576-aa5468b10cee | 8 | 100% |
BPIC14666https://doi.org/10.4121/uuid:c3e5d162-0cfd-4bb0-bd82-af5268819c35 | 1 | 100% |
BPIC17777https://doi.org/10.4121/uuid:5f3067df-f10b-45da-b98b-86ae4c7a310b | 4 | 50% |
BPIC18888https://doi.org/10.4121/uuid:3301445f-95e8-4ff0-98a4-901f1f204972 | 5 | 20% |
For each dataset, we performed next activity prediction where we wanted to predict the next activity of any ongoing case. In this case, we split every input case into possibly multiple virtual cases depending on the number of events the case had. If the length of the case was shorter than 4, the whole case was ignored. If the length was equal or higher, then a separate virtual case was created for all prefixes at least of length 4. Thus, for a case of length 6, 3 cases were created: One with length 4, one with 5 and one with 6. For all these prefixes, the next activity label was used as the expected outcome. For the full length case, the expected outcome was a special finished-token.
Finally the order of cases was randomized and in some test cases, to improve performance, a filtering was added to randomly filter out a percentage of resulting cases that was specified for each dataset separately. This percentage is shown in the ”Case filtering percentage” -column of Table 3.
7 Experiments
For experiments, we used the same system that we used already in our previous work [7]. The system had Windows 10 operating system and its hardware consisted of 3.5 GHz Intel Core i5-6600K CPU with 32 GB of main memory and NVIDIA GeForce GTX 960 GPU having 4 GB of memory. Out of those 4 GB, we reserved 3 GB for the tests. The testing framework was built on the test system using Python programming language. The actual recurrent neural networks were built using Lasagne 999https://lasagne.readthedocs.io/ library that works on top of Theano 101010http://deeplearning.net/software/theano/. Theano was configured to use GPU via CUDA for expression evaluation.
We started our experiments by running next activity predictions using all the tested datasets and maximum cluster counts. The most notable hyperparameters used for these tests are listed in Table 4.
Hyperparameter | Value |
---|---|
# test iterations | 100 |
# epochs per iteration | 1.0 (for BPIC13), 0.1 (for others) |
Clustering method | XMeans [14] |
Include activity occurrence | false |
Use activity -level event clustering | true |
# case clusters | 40 |
# event clusters | 40 |
Max # traces in training | 75000 |
Max # traces in testing | 25000 |
Max validation test size | 10000 |
RNN type | GRU [1] |
# layers | 1 |
Gradient descent optimizer | Adam [9] |
Learning rate | 0.01 |
Hidden dimension size | 256 |
Training data proportion | 75% |
The results of these runs are shown in Table 5. In the table, Features -column shows the used set of features. S.rate shows the achieved prediction success rate. In.v.s. shows the size of the input vector. This column can be used to give some kind of indication on the memory usage of using that configuration. Finally, Tra.t. and Pred.t. columns tell us the time required for performing the training and the prediction for all the cases in the test dataset. In both the cases, this time includes the time for setting up the neural network, clusterings and preparing the dataset from JSON format. Each row in the table represents one test run with unique combination of dataset and feature that was tested. None -feature represents the case in which there were no event attribute information at all in the input vector, ClustN represents a test with one-hot encoded cluster labels of event attributes clustered into N clusters, Raw represents having all one-hot encoded attribute values individually in the input vector, and finally BothN represents having both one-hot encoded attribute values and one-hot encoded cluster labels in the input vector. It should be noted also that, due to the way test runs were performed in the test framework, also None and Raw features were run using three different cluster sizes even if it didn’t in the end cause any differences in the input data. For these features, the test run achieving the best success rate was used. The varying success rates, if any, in these cases are caused mostly by random sampling being used in the original dataset.
Dataset | Features | S.rate | In.v.s. | Tra.t. | Pred.t. |
---|---|---|---|---|---|
BPIC12 | None | 85.0% | 26 | 690.5s | 6.9s |
Clust20 | 84.8% | 30 | 701.1s | 7.4s | |
Clust40 | 84.9% | 30 | 686.1s | 6.4s | |
Clust80 | 85.1% | 30 | 701.4s | 7.5s | |
Raw | 84.7% | 29 | 691.2s | 7.5s | |
Both20 | 85.1% | 33 | 704.3s | 8.2s | |
Both40 | 84.7% | 33 | 706.1s | 8.4s | |
Both80 | 84.8% | 33 | 713.1s | 10.3s | |
BPIC13 | None | 62.9% | 13 | 27.1s | 0.4s |
Clust20 | 66.2% | 34 | 31.6s | 1.5s | |
Clust40 | 65.7% | 54 | 31.9s | 0.9s | |
Clust80 | 64.7% | 87 | 37.5s | 1.0s | |
Raw | 67.7% | 1074 | 198.9s | 2.6s | |
Both20 | 66.6% | 1095 | 203.3s | 3.4s | |
Both40 | 66.4% | 1115 | 206.0s | 3.2s | |
Both80 | 67.9% | 1148 | 211.6s | 3.2s | |
BPIC14 | None | 42.5% | 41 | 1012.7s | 6.1s |
Clust20 | 44.5% | 62 | 1056.6s | 8.2s | |
Clust40 | 45.4% | 82 | 1095.9s | 8.8s | |
Clust80 | 45.3% | 103 | 1134.6s | 9.2s | |
Raw | 41.1% | 278 | 1502.5s | 14.0s | |
Both20 | 45.8% | 299 | 1586.6s | 15.7s | |
Both40 | 45.6% | 319 | 1636.2s | 16.2s | |
Both80 | 45.6% | 341 | 1720.1s | 17.2s | |
BPIC17 | None | 86.2% | 27 | 1532.7s | 23.4s |
Clust20 | 88.1% | 48 | 1568.6s | 27.2s | |
Clust40 | 88.2% | 68 | 1619.2s | 29.7s | |
Clust80 | 90.7% | 108 | 1702.2s | 32.8s | |
Raw | 88.1% | 253 | 1914.8s | 43.2s | |
Both20 | 87.1% | 200 | 2005.0s | 54.9s | |
Both40 | 87.7% | 220 | 2070.1s | 49.7s | |
Both80 | 88.1% | 260 | 2265.6s | 60.2s | |
BPIC18 | None | 58.9% | 41 | 966.4s | 23.3s |
Clust20 | 60.2% | 62 | 1012.9s | 27.7s | |
Clust40 | 57.9% | 82 | 1055.9s | 25.0s | |
Clust80 | 58.6% | 117 | 1113.1s | 31.3s | |
Raw | 62.2% | 253 | 1508.0s | 53.2s | |
Both20 | 66.6% | 274 | 1582.9s | 55.7s | |
Both40 | 50.8% | 294 | 1632.5s | 55.4s | |
Both80 | 54.8% | 334 | 1765.0s | 60.3s |
We also aggregated these results over all the datasets. Figure 3 shows average success rates of different event attribute encoding techniques over all the tested datasets using maximum of 40 clusters for event attributes. Figure 4 shows the average input vector lengths. Figure 5 and Figure 6 shows the averaged training and prediction times.




Based on all of these results, we can see that having event attribute values included clearly improved the prediction accuracy over not having them included at all in all datasets. The effect was smallest (0.1%) in the BPIC12 model, where there was only 1 event attribute, whose value did not seem to correlate that much with the next activity prediction. The greatest effect was achieved in BPIC18 -model, where Both20 outperformed prediction without any event attributes by 7.7% and with raw attribute values by over 4.0% clearly indicating that the clustering can be really powerful technique in some datasets. Even when using the maximum cluster count of 20, prediction results will be either not affected or improved with relatively small impact to the training and prediction time. The only exception to this rule was again BPIC12, which produced 0.2% worse result with Clust20 than without any event attribute data. However, in this case, the result was 0.3% worse also when the raw event attribute data was included.
In all the datasets, the best prediction accuracy is always achieved either by using only clustering with cluster count of 80, or by using both clustering and raw attributes at the same time. Also, as indicated especially in BPIC13 tests, the more event attribute values there are, the longer the training and testing will take, thus making the clustering approach even more tempting.
7.1 Threats to validity
As threats to validity of the results in this paper, it is clear that there are a lot of variables involved. As initial set of parameter values, we used parameters that were found good enough in our earlier work and did some improvement attempts based on the results we got. It is most probable that the set of parameters we used were not optimal ones in each test run. We also did not test all the parameter combinations and the ones we did, we tested often only once, even though there was some randomness involved, e.g., selecting the initial cluster centers in the XMeans algorithm. However, we think that since we tested the results in several different datasets and the results were averaged over all the tested datasets, our results can be used at least as a baseline for further studies. All the results generated by the test runs, as well as all the source data and the test framework itself, are available in support materials 111111https://github.com/mhinkka/articles.
Also, we did not really test with datasets having lots of event attributes, the maximum amount tested being 8. However, it can be seen that since the size of the input vectors is completely user configurable when performing event attribute clustering, the user him/herself can easily set limits to the input vector length which should take the burden off from the RNN and move the burden to the clustering algorithms, which are usually more efficient in handling lots of features and feature values.
When evaluating the results of the performed tests and comparing them with other similar works, it should be taken into account that data sampling was used in several phases of the testing process. Thus, some of the data in the data sets were not used at all.
8 Conclusions
Clustering can be applied on attribute values to improve accuracy of predictions performed on running cases. In three of the five experimented data sets, having event attribute clusters encoded into the input vectors outperformed having the actual attribute values in the input vector. In addition, due to raw attribute values having direct effect to input vector lengths, the training and prediction time will also be directly affected by the number of unique event attribute values. Clustering does not have this problem: The number of elements reserved in the input vector for clustered event attribute values can be adjusted freely. The memory usage is directly affected by the length of the input vector. In the tested cases, the number of clusters to use to get the best prediction accuracy seemed to depend very much on the used datasets, when the tested cluster sizes were 20, 40 and 80. In some cases, having more clusters improved the performance, whereas in others, it did not have any significant impact, or even made the accuracy worse. We also found out that in some cases, having attribute cluster indicators in the input vectors improved the prediction even if the input vectors also included all the actual attribute values.
As future work, it could be interesting to first filter out some of the most rarely occurring attribute values before clustering the values. This could potentially reduce the amount of noise added to the clustered data and make it easier for the clustering algorithm to not be affected by noisy data. Another idea that we leave for future study is whether it would be a good idea to first perform some kind of a feature selection algorithm such as influence analysis [11], recursive feature elimination [6] or mRMR [2] to find the attribute values that correlate the most with the prediction results and have those attribute values added into the input vectors as raw one-hot encoded attribute values in addition to having the one-hot encoded cluster labels. More work is also required to understand exactly what properties of the event log affect the optimal number of clusters to use. Finally, more study is required to understand whether similar clustering approach performed for event attributes in this work could be applicable also for encoding case attributes.
9 Acknowledgements
We want to thank QPR Software Plc for funding our research. Financial support of Academy of Finland projects 277522 and 313469 is acknowledged.
References
- [1] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoder-decoder approaches. In D. Wu, M. Carpuat, X. Carreras, and E. M. Vecchi, editors, Proceedings of SSST@EMNLP 2014, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, 25 October 2014, pages 103–111. Association for Computational Linguistics, 2014.
- [2] C. H. Q. Ding and H. Peng. Minimum redundancy feature selection from microarray gene expression data. J. Bioinformatics and Computational Biology, 3(2):185–206, 2005.
- [3] J. Evermann, J. Rehse, and P. Fettke. Predicting process behaviour using deep learning. Decision Support Systems, 100:129–140, 2017.
- [4] C. D. Francescomarino, M. Dumas, F. M. Maggi, and I. Teinemaa. Clustering-based predictive process monitoring. CoRR, abs/1506.01428, 2015.
- [5] C. D. Francescomarino, C. Ghidini, F. M. Maggi, and F. Milani. Predictive process monitoring methods: Which one suits me best? In Weske et al. [19], pages 462–479.
- [6] P. M. Granitto, C. Furlanello, F. Biasioli, and F. Gasperi. Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products. Chemometrics and Intelligent Laboratory Systems, 83(2):83–90, 2006.
- [7] M. Hinkka, T. Lehto, K. Heljanko, and A. Jung. Structural feature selection for event logs. In E. Teniente and M. Weidlich, editors, Business Process Management Workshops - BPM 2017 International Workshops, Barcelona, Spain, September 10-11, 2017, Revised Papers, volume 308 of Lecture Notes in Business Information Processing, pages 20–35. Springer, 2017.
- [8] M. Hinkka, T. Lehto, K. Heljanko, and A. Jung. Classifying process instances using recurrent neural networks. In F. Daniel, Q. Z. Sheng, and H. Motahari, editors, Business Process Management Workshops - BPM 2018 International Workshops, Sydney, NSW, Australia, September 9-14, 2018, Revised Papers, volume 342 of Lecture Notes in Business Information Processing, pages 313–324. Springer, 2018.
- [9] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
- [10] C. Laurent, G. Pereyra, P. Brakel, Y. Zhang, and Y. Bengio. Batch normalized recurrent neural networks. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016, Shanghai, China, March 20-25, 2016, pages 2657–2661. IEEE, 2016.
- [11] T. Lehto, M. Hinkka, and J. Hollmén. Focusing business improvements using process mining based influence analysis. In M. L. Rosa, P. Loos, and O. Pastor, editors, Business Process Management Forum - BPM Forum 2016, Rio de Janeiro, Brazil, September 18-22, 2016, Proceedings, volume 260 of Lecture Notes in Business Information Processing, pages 177–192. Springer, 2016.
- [12] N. Navarin, B. Vincenzi, M. Polato, and A. Sperduti. LSTM networks for data-aware remaining time prediction of business process instances. In 2017 IEEE Symposium Series on Computational Intelligence, SSCI 2017, Honolulu, HI, USA, November 27 - Dec. 1, 2017, pages 1–7. IEEE, 2017.
- [13] T. Nolle, A. Seeliger, and M. Mühlhäuser. Binet: Multivariate business process anomaly detection using deep learning. In Weske et al. [19], pages 271–287.
- [14] D. Pelleg and A. W. Moore. X-means: Extending k-means with efficient estimation of the number of clusters. In P. Langley, editor, Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), Stanford University, Stanford, CA, USA, June 29 - July 2, 2000, pages 727–734. Morgan Kaufmann, 2000.
- [15] S. Schönig, R. Jasinski, L. Ackermann, and S. Jablonski. Deep learning process prediction with discrete and continuous data features. In Proceedings of the 13th International Conference on Evaluation of Novel Approaches to Software Engineering, ENASE 2018, Funchal, Madeira, Portugal, March 23-24, 2018., pages 314–319, 2018.
- [16] N. Tax, I. Teinemaa, and S. J. van Zelst. An interdisciplinary comparison of sequence modeling methods for next-element prediction. CoRR, abs/1811.00062, 2018.
- [17] N. Tax, I. Verenich, M. L. Rosa, and M. Dumas. Predictive business process monitoring with LSTM neural networks. In E. Dubois and K. Pohl, editors, Advanced Information Systems Engineering - 29th International Conference, CAiSE 2017, Essen, Germany, June 12-16, 2017, Proceedings, volume 10253 of Lecture Notes in Computer Science, pages 477–492. Springer, 2017.
- [18] I. Verenich, M. Dumas, M. L. Rosa, F. M. Maggi, D. Chasovskyi, and A. Rozumnyi. Tell me what’s ahead? predicting remaining activity sequences of business process instances. June 2016.
- [19] M. Weske, M. Montali, I. Weber, and J. vom Brocke, editors. Business Process Management - 16th International Conference, BPM 2018, Sydney, NSW, Australia, September 9-14, 2018, Proceedings, volume 11080 of Lecture Notes in Computer Science. Springer, 2018.