Conformance Checkingover Uncertain Event DataWe thank the Alexander von Humboldt (AvH) Stiftung for supporting our research interactions. We acknowledge Elisabetta Benevento for the valuable input. Please do not print this document unless strictly necessary.

Conformance Checking over Uncertain Event Data1

Abstract

Nowadays, more and more process data are automatically recorded by information systems, and made available in the form of event logs. Process mining techniques enable process-centric analysis of data, including automatically discovering process models and checking if event data conform to a certain model. In this paper, we analyze the previously unexplored setting of uncertain event logs: logs where quantified uncertainty is recorded together with the corresponding data. We define a taxonomy of uncertain event logs and models, and we examine the challenges that uncertainty poses on process discovery and conformance checking. Finally, we show how upper and lower bounds for conformance can be obtained aligning an uncertain trace onto a regular process model.

Keywords:
Process Mining Uncertain Data Partial Order.

1 Introduction

Over the last decades, the concept of process has become more and more central in formally describing the activities of businesses, companies and other similar entities, structured in specific steps and phases. A process is thus defined as a well-structured set of activities, potentially performed by multiple actors (resources), which contribute to the completion of a specific task or to the achievement of a specific goal. In this context, a very important notion is the concept of case, that is, a single instance of a process. For example, in a healthcare process a case may be a single hospitalization of a patient, or the patient themself; if the process belongs to a credit institution, a case may be a loan application from a customer, and so on. The case notion allows us to define a process as a procedure that precisely defines the steps needed to handle a case from inception to completion. This procedure is referred to as process model, and can be expressed in a number of different formalisms (transition systems, Petri nets, BPMN and UML diagrams, and many more). Consequently, the study and adoption of analysis techniques specifically customized to deal with process data and process models has enable the bridging of business administration and data science and the development of dedicated disciplines like business intelligence and business process management (BPM).

The processes that govern the innards of business companies are increasingly supported by software tools. Performing specific activities is both aided and recorded by process-aware information systems (PAISs), which support the definition and management of processes. The information regarding the execution of processes can then be extracted from PAISs in the form of an event log, a database or file containing the digital footprint of the operations carried out in the context of the execution of a process and recorded as events. Event logs can vary in form, and contain differently structured information depending on the information system that enacted data collection in the organization. There are however some basic information regarding events that are very often recorded: these are the time in which the event occurred, the activity that has been performed, and the case identifier to which the event belong. This last attribute allows to group events in clusters belonging to the same case, and these resulting clusters (usually organized in sequences sorted by timestamp) are called process traces. The discipline of process mining concerns the automatic analysis of event logs, with the goal of extracting knowledge regarding e.g. the structure of the process, the conformity of events to a specific normative process model, the performances in executing the process, the relationships between groups of actors in the process.

In this paper, we will consider the analysis of a specific class of event logs: the logs that contain uncertain event data. Uncertain events are recordings of executions of specific activities in a process which are enclosed with an indication of uncertainty in the event attributes. Specifically, we consider the case where the attributes of an event are not recorded as a precise value but as a range or a set of alternatives.

The recording of uncertain event data is a common occurrence in process management. The Process Mining Manifesto [28] describes a fundamental property of event data as trustworthiness, the assumption that the recorded data can be considered correct and accurate. In a general sense, uncertainty as defined here is an explicit absence of trustworthiness, with an indication of uncertainty recorded together with the event data. In the taxonomy of event data proposed in the Manifesto the logs at the two lower levels of quality frequently lack trustworthiness, and thus can be uncertain. This encompasses a wide range of processes, such as event logs of document and product management systems, error logs of embedded systems, worksheets of service engineers, and any process recorded totally or partially on paper. There are many possible causes behind the recording of uncertain event data, such as:

  • Incorrectness. In some instances, the uncertainty is simply given by errors occurred while recording the data itself. Faults of the information system, or human mistakes in a data entry phase can all lead to missing or altered event data that can be subsequently modeled as uncertain event data.

  • Coarseness. Some information systems have limitations in their way of recording data - often tied to factors like the precision of the data format - such that the event data can be considered uncertain. A typical example is an information system that only records the date, but not the time, of the occurrence of an event: if two events are recorded in the same day, the order of occurrence is lost. This is an especially common circumstance in the processes that are, partially or completely, recorded on paper and then digitalized. Another factor that can lead to uncertainty in the time of recording is the information system being overloaded and, thus, delaying memorization of data. This type of uncertainty can also be generated by the limited sensibility of a sensor.

  • Ambiguity. In some cases, the data recorded is not an identifier of a certain event attribute; in these instances, the data needs to be interpreted, either automatically or manually, in order to obtain a value for the event attribute. Uncertainty can arise if the meaning of the data is ambiguous and cannot be interpreted with precision. Example are data in the form of images, text, or video.

Aside from the causes, we can individuate other types of uncertain event logs based on the frequency of uncertain data. Uncertainty can be infrequent, when a specific attribute is only seldomly recorded together with explicit uncertainty; the uncertainty is rare enough that uncertain events can be considered outliers. Conversely, frequent uncertain behavior of the attribute is systematic, pervasive in a high number of traces, and thus not to be considered an outlier. The uncertainty can be considered part of the process itself. These concepts are not meant to be formal, and are laid out to distinguish between logs that are still processable regardless of the uncertainty, and logs where the uncertainty is too invasive to analyze them with existing process mining techniques.

In this paper, we propose a taxonomy of the different types of explicit uncertainty in process mining, together with a formal, mathematical formulation. As an example of practical application, we will consider the case of conformance checking [8], and we will apply it to uncertain data by assessing what are the upper and lower bounds on the conformance score for possible values of the attributes in an uncertain trace.

The main driving reasons behind this work is to provide the means to treat uncertainty as a relevant part of a process; thus, we aim not to filter it out but model it. In conclusion, there are two novel aspects regarding uncertain data that we intend to address in this work. The first is the explicitness of uncertainty: we work with the underlying assumption that the actual value of the uncertain attribute, while not directly provided, is described formally. This is the case when meta-information about the uncertainty in the attribute is available, either deduced from the features of the information system(s) that record the logs or included in the event log itself. Note that, as opposed to all previous work on the topic, the fact that uncertainty is explicit in the data means that the concept of uncertain behavior is completely separated from the concept of infrequent behavior. The second is the goal of modeling uncertainty: we consider uncertainty part of the process. Instead of filtering or cleaning the log we introduce the uncertainty perspective in process mining by extending the currently available techniques to incorporate it.

The rest of this paper is organized as follows. Section 2 proposes a taxonomy of the different possible types of uncertain process data. Section 3 contains the formal definitions needed to manage uncertainty. Section 5 describes a practical application of process mining over uncertain event data, the case of conformance checking through alignments. Section 6 shows experimental results on computing conformance checking scores for synthetic uncertain data, as well as a case of application on real-life data. Section 7 discusses previous and related work on the management of uncertain data and on the topic of conformance checking. Finally, Section 8 concludes the paper and discusses about future work.

2 A Taxonomy of Uncertain Event Data

The goal of this section of the paper is to propose a categorization of the different types of uncertainty that can appear in process mining. In process management, a central concept is the distinction between the data perspective (the event log) and the behavioral perspective (the process model). The first one is a static representation of process instances, the second summarizes the behavior of a process. Both can be extended with a concept of explicit uncertainty: this concept also implies an extension of the process mining techniques that have currently been implemented.

In this paper we will focus on uncertainty in event data, while the concept of uncertainty applied to models will be examined in a future work. Specifically, as an example application we will consider computing the conformance score of uncertain process data on classical models.

We can individuate two different notions of uncertainty:

  • Strong uncertainty: the possible values for the attributes are known, but the probability that the attribute will assume a certain instantiation is unknown or unobservable.

  • Weak uncertainty: both the possible values of an attribute and their respective probabilities are known.

In the case of a discrete attribute, the strong notion of uncertainty consists on a set of possible values assumed by the attribute. In this case, the probability for each possible value is unknown. Vice-versa, in the weak uncertainty scenario we also have a discrete probability distribution defined on that set of values. In the case of a continuous attribute, the strong notion of uncertainty can be represented with an interval for the variable. Notice that an interval do not indicate a uniform distribution; there is no information on the likelihood of values in it. Vice-versa, in the weak uncertainty scenario we also have a probability density function defined on a certain interval. Figure 1 summarizes this concepts. This leads to very simple representations of explicit uncertainty.

Figure 1: The four different types of uncertainty.

In this paper we consider only the control flow and time perspective of a process – namely, the attributes of the events that allow to discover a process model. These are the unique identifier of a process instance (case ID), the timestamp (often represented by the distance from a fixed origin point, e.g. the Unix Epoch), and the activity identifier of an event. Case IDs and activities are values chosen from a finite set of possible values; they are discrete variables. Timestamps, instead, are represented by numbers and thus are continuous variables.

We will also describe an additional type of uncertainty, which lays on the event level rather that the attribute level:

  • Indeterminate event: the event may have not taken place even though it was recorded in the event log. Indeterminate events are indicated with a ? symbol, while determinate (regular) events are marked with a ! symbol.

Case ID Timestamp Activity Indet. event
{ID327, ID412} 2011-12-05T00:00 A !
ID327 2011-12-07T00:00 {B, C, D} !
ID327
[2011-12-06T00:00,
2011-12-10T00:00]
D ?
ID327 2011-12-09T00:00 {A, C} !
{ID327, ID412, ID573} 2011-12-11T00:00 E ?
Table 1: An example of strongly uncertain trace.
Case ID Timestamp Activity Indet. event
{ID327:0.9, ID412:0.1} 2011-12-05T00:00 A !
ID327 2011-12-07T00:00 {B:0.7, C:0.3} !
ID327 (2011-12-08T00:00, 2) D ?:0.5
ID327 2011-12-09T00:00 {A:0.2, C:0.8} !
{ID327:0.4, ID412:0.6} 2011-12-11T00:00 E ?:0.7
Table 2: An example of weakly uncertain trace.

Examples of strongly and weakly uncertain traces are shown in Tables 1 and 2 respectively.

3 Preliminaries

Definition 1 (Power Set)

The power set of a set is the set of all possible subsets of , and is denoted with . denotes the set of all the non-empty subsets of : .

Definition 2 (Multiset)

A multiset is an extension of the concept of set that keeps track of the cardinality of each element. is the set of all multisets over some set . Multisets are denoted with square brackets, e.g. .

Definition 3 (Sequence, Subsequence and Permutation)

Given a set , a finite sequence over of length is a function , and it is written as . We denote with the empty sequence, the sequence with no elements and of length 0. Over the sequence we define , and . The concatenation between two sequences is denoted with . Given two sequences and , is a subsequence of if and only if there exists a sequence of strictly increasing natural numbers such that . We indicate this with . A permutation of the set is a sequence that contains all elements of without duplicates: , , and for all and for all , . We denote with all such permutations of set .

Definition 4 (Sequence Projection)

Let be a set and one of its subsets.  is the sequence projection function and is defined recursively: and for and :

For example, .

Definition 5 (Applying Functions to Sequences)

Let be a partial function. can be applied to sequences of using the following recursive definition: and for and :

Definition 6 (Transitive Relation and Correct Evaluation Order)

Let be a set of objects and be a binary relation . is transitive if and only if for all we have that . A correct evaluation order is a permutation of the elements of the set such that for all we have that .

Definition 7 (Strict Partial Order)

Let be a set of objects. Let . A strict partial order is a binary relation that have the following properties:

  • Irreflexivity: is false.

  • Transitivity: see Definition 6.

  • Antisymmetry: implies that is false. Implied by irreflexivity and transitivity [11].

Definition 8 (Directed Graph)

A directed graph is a tuple where is the set of vertices and is the set of directed edges. The set is the graph universe. A path in a directed graph is a sequence of vertices such that for all we have that . We denote with the set of all such possible paths over the graph G. Given two vertices , we denote with the set of all paths beginning in and ending in : . and are connected (and is reachable from ), denoted by , if and only if there exists a path between them in : . Conversely, . We opmit the superscript if it is clear from the context. A directed graph is acyclic if there exists no path satisfying .

Definition 9 (Topological Sorting)

Let be an acyclic directed graph. A topological sorting [14] is a permutation of the vertices of such that for all we have that . We denote with all such possible topological sortings over .

Definition 10 (Transitive Reduction)

A transitive reduction [3] of a graph is a graph with where every pair of vertices connected in is not connected by any other path: for all , . is the graph with the minimal number of edges that maintain the reachability between edges of . The transitive reduction of a directed acyclic graph always exists and is unique [3].

Definition 11 (Dependency Graph)

Let be a set of objects and be a transitive relation . A dependency graph [15] is the directed graph . Since is transitive, for all we have that , thus all the topological sortings are also all possible correct evaluation orders of the objects in for the relation .

In general, and on a more abstract level, a dependency graph is a structure that explicitly expresses the property of adjuction between directed graphs and transitive relations, meaning that directed graphs define transitive relations and vice versa [24].

Let us now define the basic artifacts needed to perform process mining.

Definition 12 (Universes)

Let be the set of all the event identifiers. Let be the set of all the case id identifiers. Let be the set of all the activity identifiers. Let be the totally ordered set of all the timestamp identifiers.

Definition 13 (Events and event logs)

Let us denote with the universe of certain events. A certain event log is a set of events such that every event identifier in is unique.

Definition 14 (Simple certain traces and logs)

Let
be a set of certain events and let and . A simple certain trace is the sequence of activities induced by such a set of events. denotes the universe of certain traces. is a simple certain log. We will drop the qualifier “simple” if it is clear from the context.

As a preliminary application of process mining over uncertain event data we will consider conformance checking. Starting from an event log and a process model, conformance checking verifies if the event data in the log conforms to the model, providing a diagnostic of the deviations. Conformance checking serves many purposes, such as checking if process instances follow a specific normative model, assessing if a certain execution log has been generated from a specific model, or verifying the quality of a process discovery technique.

The conformance checking algorithm that we are applying in this paper is based on alignments. Introduced by Adriansyah [2], conformance checking through alignments finds deviations between a trace and a Petri net model of a process by creating a correspondence between the sequence of activities executed in the trace and the firing of the transitions in the Petri net. The following definitions are partially from [29].

Definition 15 (Petri Net)

A Petri net is a tuple with the set of places, the set of transitions, , and the flow relation. A Petri net defines a directed graph with vertices and edges . A marking is a multiset of places.

A marking defines the state of a Petri net, and indicates how many tokens each place contains. For any , denotes the set of input nodes and denotes the set of output nodes. We opmit the superscript if it is clear from the context.

A transition is enabled in marking of net , denoted as , if each of its input places contains at least one token. An enabled transition may fire, i.e., one token is removed from each of the input places and one token is produced for each of the output places . Formally: is the marking resulting from firing enabled transition in marking of Petri net . denotes that is enabled in and firing results in marking .

Let be a sequence of transitions. denotes that there is a set of markings such that , , and for . A marking is reachable from if there exists a such that .

Definition 16 (Labeled Petri Net)

A labeled Petri net is a Petri net with labeling function where is some universe of activity labels. Let be a sequence of activities. if and only if there is a sequence such that and .

If , it is called invisible. To indicate invisible transitions we use the placeholder symbol ; by definition . An occurrence of visible transition corresponds to observable activity .

Definition 17 (System Net)

A system net is a triplet where is a labeled Petri net, is the initial marking, and is the final marking. is the universe of system nets. Over a system net we define the following:

  • is the set of visible transitions in ,

  • is the set of corresponding observable activities in ,

  • is the set of unique visible transitions in (i.e., there are no other transitions having the same visible label),

  • is the set of corresponding unique observable activities in ,

  • is the set of visible traces starting in and ending in , and

  • is the corresponding set of complete firing sequences.

Figure 2 shows a system net with initial and final markings and . Given a system net, is the set of all possible visible activity sequences, i.e. the labels of complete firing sequences starting in and ending in projected onto the set of observable activities. Given the set of activity sequences obtainable via complete firing sequences on a certain system net, we can define a perfectly fitting event log as a set of traces which activity projection is contained in .

Definition 18 (Perfectly Fitting Log)

Let be a certain event log and let be a system net. is perfectly fitting if and only if .

These definitions allow us to build alignments in order to compute the fitness of trace on a certain model. An alignment is a correspondence between a sequence of activities (extracted from the trace) and a sequence of transitions with the relative labels (fired in the model while replaying the trace). The first sequence indicates the “moves in the log” and the second indicates the “moves in the model”. If a move in the model cannot be mimicked by a move in the log, then a “” (“no move”) appears in the top row; conversely, if a move in the log cannot be mimicked by a move in the model, then a “” (“no move”) appears in the bottom row.“no moves” not corresponding to invisible transitions point to deviations between the model and the log. A move is a pair where the first element refers to the log and the second element to the model. A “” in the first element of the pair indicates a move on the model, in the second element it indicates a move on the log.

Definition 19 (Legal Moves)

Let be a certain event log, let be the set of activity labels appearing in the event log, and let be a system net with . is the set of legal moves.

An alignment is a sequence of legal moves such that after removing all “” symbols, the top row corresponds to a trace in the log and the bottom row corresponds to a firing sequence starting in and ending . Notice that if is an invisible transition, the activation of is indicated by a “” on the log in correspondence of and the placeholder label . Hence, the middle row corresponds to a visible path when ignoring the steps. Figure 2 shows a system net with two examples of alignments, of a fitting trace and of a non-fitting trace.

Definition 20 (Alignment)

Let be a certain trace and a complete firing sequence of system net . An alignment of and is a sequence such that the projection on the first element (ignoring “”) yields and the projection on the last element (ignoring “” and transition labels) yields .

A trace and a model can have several possible alignments. In order to select the most appropriate one, we introduce a function that associate a cost to undesired moves - the ones associated with deviations.

Definition 21 (Cost of Alignment)

Cost function assigns costs to legal moves. The cost of an alignment is the sum of all costs: .

Moves where log and model agree have no costs, i.e., for all . Moves on model only have no costs if the transition is invisible, i.e., if . is the cost when the model makes an “ move” without a corresponding move of the log (assuming ). is the cost for an “ move” only on the log. In this paper we often use a standard cost function that assigns unit costs: , , and for all .

Definition 22 (Optimal Alignment)

Let be a certain event log and let be a system net with .

  • For , we define: .

  • An alignment is optimal for trace and system net if for any : .

  • is a deterministic mapping that assigns any trace to an optimal alignment, i.e., and is optimal.

  • are the misalignment costs of the whole event log.

is a (perfectly) fitting trace for the system net if and only if . is a (perfectly) fitting event log for the system net if and only if .

Figure 2: Example of alignments on a system net. The alignment shows that the trace is perfectly fitting the net. The alignment shows that the trace is misaligned with the net in one point, indicated by “”.

The technique to compute the optimal alignment [2] is as follows. Firstly, it creates an event net, a sequence-structured system net able to replay only the trace to align. The transitions in the event net have labels corresponding to the activities in the trace. Then, a product net should be computed; it is the union of the event net and the model together with synchronous transitions added. These additional transitions are paired with transitions in the event net and in the process model that have the same label; they are then connected with arcs from the input places and to the output places of those transitions. The product net is able to represent moves on log, moves on model and synchronous moves by means of firing transitions: the transitions of the event net correspond to moves on log, the transitions of the process model correspond to moves on model, the added synchronous transitions correspond to synchronous moves. The union of the initial and final markings of the event net and the process model constitute respectively the initial and final marking of the product net: every complete firing sequence on the product net corresponds to a possible alignment. Lastly, the product net is translated to a state space, and a state space exploration via the algorithm is performed in order to find the complete firing sequence that yields the lowest cost.

Let us define formally the construction of the event net and the product net:

Definition 23 (Event Net)

Let be a certain trace. The event net of is a system net such that:

  • ,

  • ,

  • such that for all , ,

  • ,

  • .

Definition 24 (Product of two Petri Nets [33])

Let and be two system nets. The product net of and is the system net such that:

  • ,

  • such that ,

  • such that

  • such that for all , if , if , and otherwise,

  • ,

  • .

4 Uncertainty in Process Mining

Definition 25 (Determinate and indeterminate event qualifiers)

Let , where the “!” symbol denotes determinate events, and the “?” symbol denotes indeterminate events.

Definition 26 (Uncertain events)

Let us denote the universe of strongly uncertain events. is the universe of weakly uncertain events. Over a strongly uncertain event we define the following projection functions: , , and . We opmit the superscript from the projection functions if it is clear from the context.

Now that the definitions of strongly and weakly uncertain events are structured, let us aggregate them in uncertain event logs.

Definition 27 (Event logs)

A strongly uncertain event log is a set of events such that every event identifier in is unique. A weakly uncertain event log is a set of events such that every event identifier in is unique.

A weakly uncertain event log has a corresponding strongly uncertain event log such that .

Definition 28 (Realization of an event log)

is a realization of if and only if:

  • For all there is a distinct such that , , and ;

  • For all with there is a distinct such that , , and .

is the set of all such realizations of the log .

Note that these definitions allow us to transform a weakly uncertain log into a strongly uncertain one, and a strongly uncertain one in a set of certain logs.

The types of uncertainty in the specific scenario we consider in this paper includes:

  • Strong uncertainty on the activity;

  • Strong uncertainty on the timestamp;

  • Strong uncertainty on indeterminate events.

All three can happen concurrently. Table 3 shows such a trace, which we will use as running example. It is worth noticing that the specific case of uncertainty on the case ID causes a problem; since an event can have many possible case IDs, it can belong to different traces. In data format where the events are already aggregated into traces, such as the very common XES standard, this means that the information related to a trace can be non-local to the trace itself, but can be stored in some other points of the log. We will focus on the problem of uncertainty on the case ID attribute in a future work.

Firstly, we will lay down some simplified notation in order to model the problem at hand in a more compact way.

Definition 29 (Simple uncertain events, traces and logs)

is a simple uncertain event. Let us denote with the universe of all simple uncertain events. is a simple uncertain trace if for all , and all the event identifiers are unique. denotes the universe of simple uncertain traces. is a simple uncertain log if all the event identifiers in the log are unique. For we define the following projection functions: , , and . We opmit the superscript from the projection functions if it is clear from the context.

These simplified traces and logs can be related to the more general framework described in the previous section through the following transformation: let be a strongly uncertain log and let be a function mapping events onto cases such that and for all , . Thus, for , . The simple uncertain event log defined by on is .

In order to more easily work with timestamps in simple uncertain events, let us frame their time relationship as a strict partial order.

Definition 30 (Strict partial order over simple uncertain events)

Let be two simple uncertain events. is a strict partial order defined on the universe of strongly uncertain events as:

Theorem 4.1 ( is a strict partial order)
Proof

All properties characterizing strict partial orders are fulfilled by . For all we have:

  • Irreflexivity: this property is always verified, since is false (see Definition 26).

  • Transitivity: since and is totally ordered, we have that and this property is always verified.

Lemma 1 (Uncomparable events share possible timestamp values)

Let be two strongly uncertain events. and are uncomparable with respect to the strict partial order (i.e., neither nor are true) if and only if and share some possible values of their timestamp.

Proof

From Definition 30, it follows that two events are comparable if and only if either or . If both are false, then and . If we assume that then , while if then . In both cases, there are values common to both uncertain timestamps.


If the two events share timestamp values, it follows that at least one of the extremes of one event is encompassed by the extremes of the other. Assume that encompasses at least one of the extremes of (the other case is symmetric): then either or . In the first case, considering that is totally ordered and that , we have that both and are true, and and are uncomparable. The second case is proved analogously. ∎

Definition 31 (Realizations of simple uncertain traces)

Let be a simple uncertain trace. An order-realization is a permutation of the events in such that for all we have that , i.e. is a correct evaluation order for over , and the (total) order in which events are sorted in is a linear extension of the strict partial order . We denote with the set of all such order-realizations of the trace .

Given an order-realization , is an activity-realization of if and only if and for all we have that