DeepAlign: Alignment-based Process Anomaly Correction using Recurrent Neural Networks
In this paper, we propose DeepAlign, a novel approach to multi-perspective process anomaly correction, based on recurrent neural networks and bidirectional beam search. At the core of the DeepAlign algorithm are two recurrent neural networks trained to predict the next event. One is reading sequences of process executions from left to right, while the other is reading the sequences from right to left. By combining the predictive capabilities of both neural networks, we show that it is possible to calculate sequence alignments, which are used to detect and correct anomalies. DeepAlign utilizes the case-level and event-level attributes to closely model the decisions within a process. We evaluate the performance of our approach on an elaborate data corpus of 30 realistic synthetic event logs and compare it to three state-of-the-art conformance checking methods. DeepAlign produces better corrections than the rest of the field reaching an overall accuracy of across all datasets, whereas the best comparable state-of-the-art method reaches .
Keywords: Business Process Management, Anomaly Detection, Deep Learning, Sequence Alignments
Process anomaly detection can be used to automatically detect deviations in process execution data. This technique infers the process solely based on distributions of the execution data, without relying on an abstract definition of the process itself. While these approaches can accurately pinpoint an anomaly in a process, they do not provide information about what should have been done instead. Although, the knowledge about the occurrence of an anomaly is valuable, much more value lies in the knowledge of what was supposed to happen and how to avoid this behavior in the future.
Process mining techniques are centered around the notion of a process model that describes the correct behavior of a process. Conformance checking techniques can be utilized to analyze process executions for their conformance with a process model. This method has the benefit of not only detecting deviations from the defined process but also of providing the closest conforming path through the process, thereby correcting it.
The correctness of the conformance checking result depends on the quality of the process model. Furthermore, a correct execution of a process is not necessarily defined by a correct order of process steps but can depend on a variety of other parameters. For example, it might not be allowed that the same person executes two consecutive process steps or a process might differ depending on the country it is being executed in. All these possibilities have to be accounted for both in the process model and the conformance checking algorithm to ensure a correct result. If no process model is available, conformance checking cannot be used and the creation of a good reference model is a time-consuming task.
An automatic process anomaly correction is therefore desirable, combining the autonomy of an anomaly detection algorithm with the descriptive results from conformance checking. Against this background, we propose the DeepAlign111Available on GitHub https://github.com/tnolle/deepalign algorithm, which combines these two benefits. It boroughs from the field of anomaly detection and employs two recurrent neural networks (RNN), trained on the task of next event prediction, as an approximate process model. Inspired by the alignment concept from conformance checking, we show that a bidirectional beam search  can be used to align a process execution with the process model as approximated by the two RNNs.
DeepAlign can not only detect that process steps have been skipped, but it can also predict which process steps should have been executed instead. Furthermore, it does not rely on a reference model of the process, nor any prior knowledge about it. It can be used to automatically detect anomalies and to automatically correct them.
Before we describe the DeepAlign algorithm, we must first introduce some concepts from the field of process mining and deep learning.
2.1 Process Mining
Process mining is centered around the idea of human-readable representations of processes called process models. Process models are widely used in business process management as a tool for defining, documenting, and controlling business processes inside companies.
During the execution of a digital business process, each process step is stored in a database. This includes information on when the process step was executed (timestamp), what process step was executed (activity), and to which business case it belongs (case identifier). These three fundamental bits of event information are the basis for every process mining algorithm and are usually combined into a single data structure called event log.
A log consists of cases, each of which consists of events executed within a process, and some attributes connected to the case (case attributes). Each event is defined by an activity name and its attributes (e.g., a user who executed the event).
Case, Event, and Log. Let be the set of all events. A case is a sequence of events , where is the set of all sequences over Let be the set of all cases. An event log is a set of cases .
Event logs can be used to automatically discover a process model. Discovery algorithms analyze the event logs for process patterns and aim to produce a human-readable process model that likely produced the event log. Multiple discovery algorithms exist, such as the Heuristics Miner  and the Inductive Visual Miner .
In process analytics, it is desirable to relate the behavior observed in an event log to the behavior defined in a process model. This discipline is called conformance checking. The goal of conformance checking is to find an alignment between an event log and a reference process model. The reference model can be manually designed or be discovered by a process discovery algorithm.
Alignment. An alignment  is a bidirectional mapping of an event sequence from the event log to a possible execution sequence of the process model. It is represented by a sequence of tuples , where is an empty move and . We say that a tuple represents a synchronous move if and , a model move if and , and a log move if and . An alignment is optimal if the number of empty moves is minimal.
For and , the two optimal alignments are
where the top row corresponds to and the bottom row corresponds to , mapping moves in the log to moves in the model and vice versa.
2.3 Recurrent Neural Network (RNN)
Recurrent neural networks (RNN) have been designed to handle sequential data such as sentences. An RNN is a special kind of neural network that makes use of an internal state (memory) to retain information about already seen words in a sentence. It is processing a sentence word for word, and with each new word, it will approximate the probability distribution over all possible next words. Neural networks can be efficiently trained using a gradient descent learning procedure, minimizing the error in the prediction by tuning its internal parameters (weights). The error can be computed as the difference between the output of the neural network and the desired output.
After the training procedure, the neural network can approximate the probability distribution over all possible next words, given an arbitrary length input sequence. With slight alterations, RNNs can be applied to event logs, which we will explain further in Sec. 3.
2.4 Beam Search
In natural language processing, it is common to search for the best continuation of a sentence under a given RNN model. To find the most probable continuation every possible combination of words has to be considered which, for a length of and a vocabulary size of , amounts to possible combinations. Due to the exponential search space, this problem is NP-hard.
Instead, a greedy approach can be taken, producing always the most likely next word given a start of a sentence, based on probability under the RNN. However, this approach does not yield good results because it approximates the total probability of the sentence continuation based only on the probability of the next word. A more probable sentence might be found, when expanding the search to the second most probable next word, or the third, and so on.
Beam search (BS) is a greedy algorithm that finds a trade-off between traversing all possible combinations and only the most probable next word. For every prediction, the BS algorithm expands only the most probable sentence continuations (beams). In the next step, the best probable continuations over all beams from the previous step are chosen, and so on. For , BS is equivalent to the greedy 1-best approach explained above. BS has the advantage of pruning the search space to feasible sizes, while still traversing a sufficient part of the search space to produce a good approximation of the most likely sentence continuation.
The BS algorithm is iteratively applied, inserting new words with each step, until convergence, i.e., the end of a sentence is reached, indicated by the end of sentence symbol.
2.5 Bidirectional Beam Search
The BS algorithm continues a sentence until a special end of sentence symbol is predicted. However, if the sentence has a defined beginning and end, this approach cannot be used because a unidirectional RNN only knows about the beginning of the sentence and not the end. This has been demonstrated and been addressed in  with a novel bidirectional beam search (BiBS) approach. Instead of using a single unidirectional RNN, the authors propose to use two separate unidirectional RNNs, one reading the input sentences forwards, and one reading them backwards.
The problem that arises with a gap in the middle of a sentence is that the probability of the resulting sentence, after the insertion of a new word, cannot be computed by a single RNN without re-computation of the remainder of the sentence. In BiBS, this probability is approximated by the product of the probability of the beginning of the sentence (by the forward RNN), the end of the sentence (by the backward RNN), and the joint probability of inserting the new word (according to both RNNs). The original BS algorithm is extended to expand the search space based on this joint probability, ensuring a proper fit both for the beginning and the end of the sentence.
The BiBS algorithm is iteratively applied to the original sentence, updating it with each step, until convergence, i.e., no insertions would yield a higher probability in any of the beams.
In this section we describe the DeepAlign algorithm and all its components. An overview of the algorithm is shown in Fig. 1. Two neural networks are trained to predict the next event, one reading cases from left to right (forwards), the other reading them from right to left (backwards). An extended BiBS is then used to transform the input case to the most probable case under the two RNN models. Lastly, an alignment is calculated based on the search history of the algorithm.
3.1 Next Event Prediction
Next event prediction aims to accurately model the decisions being made in a process. These decisions are based on multiple parameters, such as the history of a case, the attributes connected to past events, and the case level attributes. To succeed, a machine learning model must take into account all of these parameters.
In this paper, we propose a new neural architecture for next event prediction. It has been designed to model the sequence of activities (control-flow), the attributes connected to these activities (event attributes), and the global attributes connected to the case (case attributes). Figure 2 shows the architecture in detail.
At the heart of the network is a Gated Recurrent Unit (GRU) , a type of RNN. This GRU is iteratively fed an event, consisting of its activity and its event attributes, and must predict the corresponding next event. Each categorical attribute is fed through an embedding layer to map the values into a lower-dimensional embedding space. To include the case attributes, we make use of the internal state of the GRU. Instead of initializing the state with zeros (the default), we initialize it based on a representation of the case attributes. All case attributes are transformed by a case attribute network, consisting of two fully-connected layers (FC), to produce a real-valued representation of the case attributes. In other words, we initialize the next event prediction with a representation of the case attributes, thereby conditioning it to predict events according to these case attributes. Finally, the GRU output is fed into separate FC layers with Softmax activations to produce a probability distribution over all possible attributes of the next event (i.e., the prediction of the next event).
We train the networks with a GRU size equal to two times the maximum case length on mini-batches of size for epochs using the Adam optimizer with standard parameters . The first layer of the case attribute network has an output size of the GRU size divided by 8 and the second layer output is equal to the hidden state size of the GRU. These parameters were chosen following an exhaustive grid search, however, we found that any reasonable setting generally worked.
3.2 The DeepAlign Algorithm
In the context of processes, the sentences of words from above will become the cases of events from the event log. By replacing the next word prediction RNNs with next event prediction RNNs in the BiBS algorithm we can apply it to event logs. Instead of only predicting the next word, the RNNs will predict the next event, including the most likely event attributes.
Our goal is to utilize the two RNNs as the reference model for conformance checking and produce an alignment between log and the RNNs. Alignments can be interpreted as a sequence of skip (synchronous move), insertion (model move), or deletion (log move) operations. The BiBS algorithm already covers the first two operations, but not the last. To allow for deletions, we have to extend the BiBS algorithm.
Let be the forward event prediction RNN and be the backward RNN. Let further be the probability of case under RNN, initialized with the hidden state .
The probability of a case under the two RNNs can be computed by
where is the output of the case attribute network. If no case attributes are available, the initial state is set to zeros. An example is shown in Fig. 3.
For an insertion of an event at time in a case , the probability under the two RNNs can be approximated by
where is the total case length, and is the index notation to retrieve all events from until time t, and is the hidden state of after reading . Similarly, is the hidden state of after reading . An example is shown in Fig. 4.
The probability of deleting events at time in a case can be approximated by
An example is shown in Fig. 5.
Algorithm 1 shows the full DeepAlign process of aligning a case with the two RNNs. The algorithm is initialized with an initial set of beams , i.e., the original case. For each possible operation, the probabilities are computed using the aforementioned equations, and the top-K beams are returned. For simplicity, we assume that top-K always returns the updated cases according to the operations with the highest probability. The number of events that can be deleted in one step can be controlled with the parameter . This is necessary because successively deleting single events does not necessarily generate higher probabilities than removing multiple events at once.
Algorithm 1 does not yet return alignments, but the top-K updated cases. By keeping a history of the top-K operations (skip, deletion, and insertion) in every iteration, we can obtain the alignment directly from the history. A deletion corresponds to an empty move on the model, whereas an insertion corresponds to an empty move in the log.
The top-K selection in Alg.1 will select the top K beams based on the probability under the RNN models. In case of ties, we break the tie by choosing the beam with less empty moves (insertions and deletions).
We evaluate the DeepAlign algorithm for the task of anomaly correction. Given an event log consisting of normal and anomalous cases, an anomaly correction algorithm is expected to align each case in the event log with a correct activity sequence according to the process (without anomalies) that produced the event log.
We use a simple paper submission process as a running example throughout the remainder of this paper. The process model in Fig. 6 describes the creation of a scientific paper. It includes the peer review process, which is executed by a reviewer, whereas the paper is written by an author.
To evaluate the accuracy of the corrections, we generated six random process models using PLG2 . The models vary in complexity with respect to the number of activities, breadth, and width. Additionally, we use a handmade procurement process model called P2P as in .
To generate event attributes we create a likelihood graph  from the process models which includes probabilities for event attributes connected to each step in the process. This method has been proposed in . A likelihood graph for the paper process from Fig. 6 is shown in Fig. 7.
For each process step, the probability of the resource executing it is shown in yellow. Depending on the resource, the probabilities of the next process steps are shown in blue. Note that there is a long-term dependency between the steps Develop Hypothesis and Conduct Study, and, similarly, between Develop Method and Evaluate. That is, Conduct Study never eventually follows Develop Method, and, likewise, Evaluate never eventually follows Develop Hypothesis.
We can generate event logs by using a random-walk through the likelihood graph, complying with the transition probabilities, and generating activities and attributes along the way. In addition to the event attributes, we also generate case attributes, as well as, dependencies between the case attributes and the output probabilities in the likelihood graph. For the paper process, we generate two case attributes, Decision and Topic.
If the topic is Theory, this implies that Develop Hypothesis will occur in a case, whereas if the topic is Engineering, it implies Develop Method will occur. The decision can be Accept, Weak Accept, Borderline, Weak Reject, or Reject. For simplicity, we define that there will only be a Minor Revision if the Decision is either Accept or Weak Accept. There will be no Minor Revision otherwise. We have generated an event log that follows these rules that we use as an example throughout the remainder of the paper. The paper process was not used in the evaluation.
For each of the 7 process models, we generate 4 random event logs with varying numbers of event and case attributes. Additionally, we introduce noise to the event logs by randomly applying one of 7 anomalies to 30% of the cases in the event log. We gather a ground truth dataset for the anomaly correction problem by retaining the original cases before alteration. The 7 anomalies are defined as follows.
Skip: A sequence of up to 2 necessary events has been skipped
Insert: Up to 2 random activities have been inserted
Rework: A sequence of up to 3 events has been executed a second time
Early: A sequence of up to 2 events has been executed too early, and hence is skipped later in the case
Late: A sequence of up to 2 events has been executed too late, and hence is skipped earlier in the case
Attribute: An incorrect attribute value has been set in up to 3 events
To analyze the impact of the case and event attributes we evaluate four different implementations of DeepAlign: one that does not use any attributes (DeepAlign), one that only uses case attributes (DeepAlignC), one that only uses event attributes (DeepAlignE), and one that uses both (DeepAlignCE).
Additionally, we evaluate baseline approaches that first discover a process model using a discovery algorithm and then calculate the alignments . We chose the Heuristics Miner  and the Inductive Miner  using the implementations of the excellent PM4Py library . For completeness, we also evaluate the conformance checking algorithm using a perfect reference model, i.e., the one used to generate the event logs.
We run the DeepAlign algorithm for a maximum number of 10 iterations with a beam size of and a maximum deletion size of , and consider the top-1 beam for the evaluation. The Inductive Miner and the Heuristics Miner are used as implemented in PM4Py For the Heuristics Miner, we use a dependency threshold of .
The overall results are shown in Tab. 1. For each dataset we run the algorithms and evaluate the correction accuracy, that is, an alignment is regarded as correct if the model sequence is exactly the ground truth sequence. For correct alignments, we calculate the optimality of the alignment (i.e., if the number of empty moves is minimal). For incorrect alignments, we calculate the distance from the ground truth sequence by the Levenshtein distance. We average over all cases across all datasets.
|Control-flow||Case attr.||Event attr.||Accuracy||Optimal||Error|
Interestingly, DeepAlignC, DeepAlignE, and DeepAlignCE all outperform the perfect Reference Model approach. This is because the Reference Model does not contain any information about the case and event attributes. The Heuristics Miner yields much better results in the anomaly correction task than the Inductive Miner, however, DeepAlign outperforms both, without relying on case or event attributes.
Reference Model, Heuristics Miner, and Inductive Miner all produce optimal alignments because the alignment algorithm guarantees it. The DeepAlign algorithm shows a significant drop in alignment optimality when including the event attributes. The drop in optimality can be attributed to the fact that we always predict the top-1 attribute value for inserted events in the DeepAlign algorithm. Furthermore, it might be connected to the attribute level anomalies that we introduced as part of the generation. Including the case and event attributes in the reference model has a significant impact on the accuracy of the DeepAlign algorithm. The best results are achieved when including both the case and the event attributes. Table 2 shows the accuracy for each algorithm per dataset.
We want to finish the evaluation with examples from the paper dataset to illustrate the results of the DeepAlign algorithm. This is the resulting alignment for a case with a Skip anomaly,
|Identify Problem||Research Related Work||Develop Method||Experiment||Evaluate||Conclude||Submit||Review||…|
this is the result for a case with a Late anomaly,
|Identify Problem||Experiment||Research Related Work||Develop Method||Evaluate||Conclude||Submit||…|
|Identify Problem||Research Related Work||Develop Method||Experiment||Evaluate||Conclude||Submit||…|
and this is the result for a case with an Insert anomaly.
|Identify Problem||Research Related Work||Random activity 10||Develop Method||Experiment||Evaluate||Conclude||Random activity 12||Submit||…|
|Identify Problem||Research Related Work||Develop Method||Experiment||Evaluate||Conclude||Submit||…|
The DeepAlign method can also be utilized to generate sequences from nothing, that is, to align the empty case with the most likely case according to the model. Depending on the case attributes that are used to initialize the RNNs, the results will be different.
For and the resulting sequence is Identify Problem, Research Related Work, Develop Method, Experiment, Evaluate, Conclude, Submit, Review, Final Decision , whereas if we set the resulting sequence is Identify Problem, Research Related Work, Develop Hypothesis, Experiment, Conduct Study, Conclude, Submit, Review, Final Decision . The DeepAlign algorithm correctly generates a sequence including the Develop Method and Develop Hypothesis activities according to the setting of the Topic case attribute. It also does not generate the Minor Revision activity because the Decision is Reject. When setting , DeepAlign will generate the sequence including the Minor Revision branch. A similar effect can be observed when altering the event attributes.
This demonstrates that the RNNs are indeed capable of learning the rules behind the decisions in the paper process. Although the paper dataset contains unambiguous dependencies between the case attributes and the resulting correct sequences, the overall results on the randomly generated datasets indicate that case and event attributes ought not to be neglected.
6 Related Work
Anomaly detection in business processes is frequently researched. Many approaches exist that aim to detect anomalies in a noisy event log (i.e., an event log that contains anomalous cases).
Bezerra et al. have proposed multiple approaches utilizing discovery algorithms to mine a process model and then use conformance checking to infer the anomalies . Böhmer et al. proposed a technique based on an extended likelihood graph that is utilizing event-level attributes to further enhance the detection . The approach from  requires a clean event log (i.e., no anomalies in the log), but it has been shown that the same technique can be applied to noisy logs as well . Recently, Pauwels et al. presented an approach based on Bayesian Networks . Deep learning based approaches are presented in  and . However, none of these approaches can be utilized to correct an anomalous case or to produce an alignment.
Since Bezerra et al. presented their approach based on discovery algorithms in 2013, Mannhardt et al. have proposed both a data-aware discovery algorithm  and a data-aware conformance checking algorithm . The conformance checking algorithm relies on a configurable cost function for alignments that must be manually defined to include the case and event attributes. Our approach does not rely on a manual definition of the cost function, it traverses the search space based on learned probabilities instead.
Although alignments represent the current state-of-the-art in conformance checking , they often pose a significant challenge because they are computationally expensive. Van Dongen et al. address this issue in , compromising between computational complexity and quality of the alignments. Very recently, Leemans et al. have presented a stochastic approach to conformance checking , which can speed up the computation.
All of these approaches either rely on a non-data-aware discovery technique, require a manual effort to create a proper cost function, or they cannot generate alignments. To the best of our knowledge, DeepAlign is the first fully autonomous anomaly correction method.
We have demonstrated a novel approach to calculate alignments based on the DeepAlign algorithm. When no reference model is available, two recurrent neural networks can be used to approximate the underlying process based on execution data, including case and event attributes. The empirical results obtained in the experiments indicate that RNNs are indeed capable of modeling the behavior of a process solely based on an event log event if it contains anomalous behavior.
To the best of our knowledge, this is the first time that deep learning has been employed to calculate alignments in the field of process mining. Although we evaluate DeepAlign in the context of anomaly correction, many other applications are conceivable. For example, instead of training on a log that contains anomalies, a clean log could be used. Furthermore, a clean log can be obtained from an existing reference model, and DeepAlign could be used to find alignments. In other words, it might be possible to convert a manually created process model into a DeepAlign model. A discovery algorithm based on DeepAlign is also imaginable since DeepAlign can also be utilized to generate sequences from scratch. Depending on the case attributes the resulting predicted sequences will be different. We think that this idea lends itself to further research.
We further believe that the DeepAlign algorithm could be employed to reduce the memory consumption of an alignment algorithm since the search space is efficiently pruned during the bidirectional beam search. However, on the downside, DeepAlign does not guarantee optimal alignments. This weakness can be addressed by employing an optimal alignment algorithm between the input sequence and the corrected sequence, albeit at the expense of efficiency.
In summary, DeepAlign is a novel and flexible approach with great application potential in many research areas within the field of process mining.
This work is funded by the German Federal Ministry of Education and Research (BMBF) Software Campus project “R2PA” [01IS17050], Software Campus project “KADet” [01IS17050], and the research project “KI.RPA” [01IS18022D].
-  (2013) Memory-efficient alignment of observed and modeled behavior. BPM Center Report 3. Cited by: §4, §6.
-  (2019) Process Mining for Python (PM4Py): Bridging the Gap Between Process-and Data Science. pp. 13–16. Cited by: §4.
-  (2013) Algorithms for Anomaly Detection of Traces in Logs of Process Aware Information Systems. Information Systems 38 (1), pp. 33–44. Cited by: §6.
-  (2016) Multi-perspective Anomaly Detection in Business Process Execution Events. In Proceedings of On the Move to Meaningful Internet Systems, OTM’16, pp. 80–98. Cited by: §4, §6.
-  (2010) Trace alignment in process mining: opportunities for process diagnostics. In Proceedings of the 8th International Conference on Business Process Management – BPIC’10, pp. 227–242. Cited by: Definition 2.
-  (2016) PLG2: Multiperspective Process Randomization with Online and Offline Simulations. In BPM’16 (Demos), pp. 1–6. Cited by: §4.
-  (2014) Learning Phrase Representations Using RNN Encoder-decoder for Statistical Machine Translation. arXiv preprint arXiv:1406.1078. Cited by: §3.1.
-  (2014) Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.1.
-  (2014) Process and deviation exploration with inductive visual miner.. BPM’14 (Demos) 1295 (46), pp. 8. Cited by: §2.1, §4.
-  (2019) Earth movers’ stochastic conformance checking. In Proceedings of the 17th International Conference on Business Process Management – BPM’19, pp. 127–143. Cited by: §6.
-  (2016) Balanced multi-perspective checking of process conformance. Computing 98 (4), pp. 407–437. Cited by: §6.
-  (2015) The multi-perspective process explorer. BPM’15 (Demos) 1418, pp. 130–134. Cited by: §6.
-  (2018-11-01) Analyzing Business Process Anomalies Using Autoencoders. Machine Learning 107 (11), pp. 1875–1893. Cited by: §6.
-  (2019) BINet: Multi-perspective Business Process Anomaly Classification. Information Systems, pp. 101458. External Links: Cited by: §4, §6.
-  (2018) BINet: Multivariate Business Process Anomaly Detection Using Deep Learning. In Proceedings of the 16th International Conference on Business Process Management – BPM’18, pp. 271–287. Cited by: §4.
-  (2019) An anomaly detection technique for business processes based on extended dynamic bayesian networks. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, pp. 494–501. Cited by: §6.
-  (2017) Bidirectional beam search: Forward-backward inference in neural sequence models for fill-in-the-blank image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition – CVPR’17, pp. 6961–6969. Cited by: §1, §2.5.
-  (2017) Aligning modeled and observed behavior: a compromise between computation complexity and quality. In Proceedings of the 29th International Conference on Advanced Information Systems Engineering – CAiSE’17, pp. 94–109. Cited by: §6.
-  (2011) Flexible heuristics miner (fhm). In Proceedings of the 2011 IEEE Symposium on Computational Intelligence and Data Mining – CIDM’11, pp. 310–317. Cited by: §2.1, §4.