Mining Closed Episodes with Simultaneous Events

Mining Closed Episodes with Simultaneous Events

Nikolaj Tatti, Boris Cule
ADReM, University of Antwerp, Antwerpen, Belgium
firstname.lastname@ua.ac.be
Abstract

Sequential pattern discovery is a well-studied field in data mining. Episodes are sequential patterns describing events that often occur in the vicinity of each other. Episodes can impose restrictions to the order of the events, which makes them a versatile technique for describing complex patterns in the sequence. Most of the research on episodes deals with special cases such as serial, parallel, and injective episodes, while discovering general episodes is understudied.

In this paper we extend the definition of an episode in order to be able to represent cases where events often occur simultaneously. We present an efficient and novel miner for discovering frequent and closed general episodes. Such a task presents unique challenges. Firstly, we cannot define closure based on frequency. We solve this by computing a more conservative closure that we use to reduce the search space and discover the closed episodes as a postprocessing step. Secondly, episodes are traditionally presented as directed acyclic graphs. We argue that this representation has drawbacks leading to redundancy in the output. We solve these drawbacks by defining a subset relationship in such a way that allows us to remove the redundant episodes. We demonstrate the efficiency of our algorithm and the need for using closed episodes empirically on synthetic and real-world datasets.

Mining Closed Episodes with Simultaneous Events


Nikolaj Tatti, Boris Cule
ADReM, University of Antwerp, Antwerpen, Belgium

firstname.lastname@ua.ac.be


\@float

copyrightbox[b]

\end@float
\@ssect

Categories and Subject Descriptors H.2.8 [Database management]: Database applications—Data mining; G.2.2 [Discrete mathematics]: Graph theory

  • Algorithms, Theory

    • Frequent episodes, Closed episodes, Depth-first search

      Discovering interesting patterns in data sequences is a popular aspect of data mining. An episode is a sequential pattern representing a set of events that reoccur in a sequence [?]. In its most general form, an episode also imposes a partial order on the events. This allows great flexibility in describing complex interactions between the events in the sequence.

      Existing research in episode mining is dominated by two special cases: parallel episodes, patterns where the order of the events does not matter, and serial episodes, requiring that the events must occur in one given order. Proposals have been made (see Section Mining Closed Episodes with Simultaneous Events) for discovering episodes with partial orders, but these approaches impose various limitations on the events. In fact, to our knowledge, there is no published work giving an explicit description of a miner for general episodes.

      We believe that there are two main reasons why general episodes have attracted less interest: Firstly, implementing a miner is surprisingly difficult: testing whether an episode occurs in the sequence is an NP-complete problem. Secondly, the fact that episodes are such a rich pattern type leads to a severe pattern explosion.

      Another limitation of episodes is that they do not properly address simultaneous events. However, sequences containing such events are frequently encountered, in cases such as, for example, sequential data generated by multiple sensors and then collected into one stream. In such a setting, if two events, say and , often occur simultaneously, existing approaches will depict this pattern as a parallel episode , which will only tell the user that these two events often occur near each other, in no particular order. This is a major limitation, since the actual pattern contains much more information.

      In this paper we propose a novel and practical algorithm for mining frequent closed episodes that properly handles simultaneous events. Such a task poses several challenges.

      Firstly, we can impose four different relationships between two events and : (1) the order of and does not matter, (2) events and should occur at the same time, (3) should occur after , and (4) should occur after or at the same time as . We extend the definition of an episode to handle all these cases. In further text, we consider events simultaneous only if they occur exactly at the same time. However, we can easily adjust our framework to consider events simultaneous if they occur within a chosen time interval.

      Secondly, a standard approach for representing a partial order of the events is by using a directed acyclic graph (DAG). The mining algorithm would then discover episodes by adding nodes and edges. However, we point out that such a representation has drawbacks. One episode may be represented by several graphs and the subset relationship based on the graphs is not optimal. This ultimately leads to outputting redundant patterns. We will address this problem.

      Thirdly, we attack the problem of pattern explosion by using closed patterns. There are two particular challenges with closed episodes. Firstly, we point out that we cannot define a unique closure for an episode, that is, an episode may have several maximal episodes with the same frequency. Secondly, the definition of a closure requires a subset relationship, and computing the subset relationship between episodes is NP-hard.

      We mine patterns using a depth-first search. An episode is represented by a DAG and we explore the patterns by adding nodes and edges. To reduce the search space we use the instance-closure of episodes. While it is not guaranteed that an instance-closed episode is actually closed, using such episodes will greatly trim the pattern space. Finally, the actual filtering for closed episodes is done in a post-processing step. We introduce techniques for computing the subset relationship, distinguishing the cases where we can do a simple test from the cases where we have to resort to recursive enumeration. This filtering will remove all redundancies resulting from using DAGs for representing episodes.

      The rest of the paper is organised as follows: In Section Mining Closed Episodes with Simultaneous Events, we discuss the most relevant related work. In Section Mining Closed Episodes with Simultaneous Events, we present the main notations and concepts. Section Mining Closed Episodes with Simultaneous Events introduces the notion of closure in the context of episodes. Our algorithm is presented in detail in Sections Mining Closed Episodes with Simultaneous EventsMining Closed Episodes with Simultaneous Events and Mining Closed Episodes with Simultaneous Events. In Section Mining Closed Episodes with Simultaneous Events we present the results of our experiments, before presenting our conclusions in Section Mining Closed Episodes with Simultaneous Events. The proofs of the theorems can be found in the Appendix and the code of the algorithm is available online111http://adrem.ua.ac.be/implementations.

      The first attempt at discovering frequent subsequences, or serial episodes, was made by Wang et al. [?]. The dataset consisted of a number of sequences, and a pattern was considered interesting if it was long enough and could be found in a sufficient number of sequences. A complete solution to a more general problem was later provided by Agrawal and Srikant [?] using an Apriori-style algorithm [?].

      Looking for frequent general episodes in a single event sequence was first proposed by Mannila et al. [?]. The Winepi algorithm finds all episodes that occur in a sufficient number of windows of fixed length. Specific algorithms were given for the case of parallel and serial episodes. However, no algorithm for detecting general episodes was provided.

      Some research has gone into outputting only closed subsequences, where a sequence is considered closed if it is not properly contained in any other sequence which has the same frequency. Yan et al. [?], Tzvetkov et al. [?], and Wang and Han [?] proposed methods for mining such closed patterns, while Garriga [?] further reduced the output by post-processing it and representing the patterns using partial orders.222Despite their name, the partial orders discovered by Garriga are different from general episodes. Harms et al. [?], meanwhile, experiment with closed serial episodes. In another attempt to trim the output, Garofalakis et al. [?] proposed a family of algorithms called Spirit which allow the user to define regular expressions that specify the language that the discovered patterns must belong to.

      Pei et al. [?], and Tatti and Cule [?] considered restricted versions of our problem setup. The former approach assumes a dataset of sequences where the same label can occur only once. Hence, an episode can contain only unique labels. The latter pointed out the problem of defining a proper subset relationship between general episodes and tackled it by considering only episodes where two nodes having the same label had to be connected. In our work, we impose no restrictions on the labels of events making up the episodes.

      In this paper we use frequency based on a sliding window as it is defined for Winepi. However, we can easily adopt our approach for other monotonically decreasing measures, as well as to a setup where the data consists of many (short) sequences instead of a single long one. Mannila et al. propose Minepi [?], an alternative interestingness measure for an episode, where the support is defined as the number of minimal windows. Unfortunately, this measure is not monotonically decreasing. However, the issue can be fixed by defining support as the maximal number of non-overlapping minimal windows [?, ?]. Zhou et al. [?] proposed mining closed serial episodes based on the Minepi method. However, the paper did not address the non-monotonicity issue of Minepi.

      Alternative interestingness measures, either statistically motivated or aimed to remove bias towards smaller episodes, were made by Garriga [?], Méger and Rigotti [?], Gwadera et al. [?, ?], Calders et al. [?], Cule et al. [?], and Tatti [?].

      Using episodes to discover simultaneous events has, to our knowledge, not been done yet. However, this work is somewhat related to efforts made in discovering sequential patterns in multiple streams [?, ?, ?]. Here, it is possible to discover a pattern wherein two events occur simultaneously, as long as they occur in separate streams.

      We begin this section by introducing the basic concepts that we will use throughout the paper. First we will describe our dataset.

      Definition

      A sequence event is a tuple consisting of three entries, a unique id number , a label , and a time stamp integer . We will assume that if , then . A sequence is a collection of sequence events ordered by their ids.

      Note that we are allowing multiple events to have the same time stamp even when their labels are equivalent. For the sake of simplicity, we will use the notation to mean a sequence . We will also write to mean the sequence

      This means that and have equal time stamps.

      Our next step is to define patterns we are interested in.

      Definition

      An episode event is a tuple consisting of two entries, a unique id number and a label . An episode graph is a directed acyclic graph (DAG). The graph may have two types of edges: weak edges and proper edges .

      An episode consists of a collection of episode events, an episode graph, and a surjective mapping from episode events to the nodes of the graph which we will denote by . A proper edge from node to node in the episode graph implies that the events of must occur before the events of , while a weak edge from to implies that the events of may occur either at the same time as those of or later.

      We will assume that the nodes of are indexed and we will use the notation to refer to the th node in .

      When there is no danger of confusion, we will use the same letter to denote an episode and its graph. Note that we are allowing multiple episode events to share the same node even if these events have the same labels.

      Definition

      Given an episode and a node , we define to be the multiset of labels associated with the node. Given two multisets of labels and we write if is lexicographically smaller than or equal to . We also define to be the multiset of all labels in .

      Definition

      A node in an episode graph is a descendant of a node if there is a path from to . If there is a path containing a proper edge we will call a proper descendant of . We similarly define a (proper) ancestor. A node is a source if it has no ancestors. A node is a proper source if it has no proper ancestors. We denote all sources of an episode by .

      We are now ready to give a precise definition of an occurrence of a pattern in a sequence.

      Definition

      Given a sequence and an episode , we say that covers if there exists an injective mapping from the episode events to the sequence events such that

      1. labels are respected, ,

      2. events sharing a same node map to events with the same time stamp, in other words, implies ,

      3. weak edges are respected, if is a descendant of , then ,

      4. proper edges are respected, if is a proper descendant of , then .

      Note that this definition allows us to abuse notation and map graph nodes directly to time stamps, that is, given a graph node we define , where .

      Consider the first three episodes in Figure Mining Closed Episodes with Simultaneous Events. A sequence covers and but not (proper edge is violated). A sequence covers and but not .

      Figure \thefigure: Toy episodes. Proper edges are drawn solid. Weak edges are drawn dashed.

      Finally, we are ready to define support of an episode based on fixed windows. This definition corresponds to the definition used in Winepi [?]. The support is monotonically decreasing which allows us to do effective pruning while discovering frequent episodes.

      Definition

      Given a sequence and two integers and we define a subsequence

      containing all events occurring between and .

      Definition

      Given a window size and an episode , we define the support of an episode in , denoted , to be the number of windows of size in covering the episode,

      We will use whenever is clear from the context. An episode is -frequent (or simply frequent) if its support is higher or equal than some given threshold .

      Consider a sequence and set the window size . There are windows covering episode (given in Figure Mining Closed Episodes with Simultaneous Events), namely and . Hence .

      Theorem

      Testing whether a sequence covers an episode is an NP-complete problem, even if does not contain simultaneous events.

      In practice, episodes are represented by DAGs and are mined by adding nodes and edges. However, such a representation has drawbacks [?]. To see this, consider episodes and in Figure Mining Closed Episodes with Simultaneous Events. Even though these episodes have different graphs, they are essentially the same — both episodes are covered by exactly the same sequences, namely all sequences containing , , , , , or . In other words, essentially the same episode may be represented by several graphs. Moreover, using the graph subset relationship to determine subset relationships between episodes will ultimately lead to less efficient algorithms and redundancy in the final output. To counter these problems we introduce a subset relationship based on coverage.

      Definition

      Given two episodes and , we say that is a subepisode of , denoted , if any sequence covering also covers . If and , we say that and are similar in which case we will write .

      This definition gives us the optimal definition for a subset relationship in the following sense: if , then there exists a sequence such that .

      Consider the episodes given in Figure Mining Closed Episodes with Simultaneous Events. It follows from the definition that , , and . Episodes and are not comparable.

      Theorem

      Testing is an NP-hard problem.

      Proof

      The hardness follows immediately from Theorem Theorem as we can represent sequence as a serial episode . Then covers if and only if .

      As mentioned in the introduction, pattern explosion is the problem with discovering general episodes. We tackle this by mining only closed episodes.

      Definition

      An episode is closed if there is no with .

      We should point out that, unlike with itemsets, an episode may have several maximal superepisodes having the same frequency. Consider , , and in Figure Mining Closed Episodes with Simultaneous Events, sequence and window size . The support of episodes , and is . Moreover, there is no superepisode of or that has the same support. Hence, and are both maximal superepisodes having the same support as . This implies that we cannot define a closure operator based on frequency. However, we will see in the next section that we can define a closure based on instances. This closure, while not removing all redundant episodes, will prune the search space dramatically. The final pruning will then be done in a post-processing step.

      Figure \thefigure: Toy episodes demonstrating closure.

      Our final step is to define transitively closed episodes that we will use along with the instance-closure (defined in the next section) in order to reduce the pattern space.

      Definition

      Let be an episode. A transitive closure, , is obtained by adding edges from a node to each of its descendants making the edge proper if the descendant is proper, and weak otherwise. If we say that is transitively closed.

      It is trivial to see that given an episode , we have . Thus we can safely ignore all episodes that are not transitively closed. From now on, unless we state otherwise, all episodes are assumed to be transitively closed.

      The reason why depth-first search is efficient for itemsets is that at each step we only need to handle the current projected dataset. In our setup we have only one sequence so we need to transport the sequence into a more efficient structure.

      Definition

      Given an input sequence and an episode , an instance is a valid mapping from to such that for each there is no such that , and . We define to be the smallest time stamp and to be the largest time stamp in . We require that , where is the size of the sliding window. An instance set of an episode , defined as is a set of all instances ordered by .

      The condition in the definition allows us to ignore some redundant mappings whenever we have two sequence events, say and , with and . If an instance uses only , then we can obtain from by replacing with . However, and are essentially the same for our purposes, so we can ignore either or . We require the instance set to be ordered so that we can compute the support efficiently. This order is not necessarily unique.

      Consider sequence and in Figure Mining Closed Episodes with Simultaneous Events. Then 333For simplicity, we write mappings as tuples.

      Using instances gives us several advantages. Adding new events and edges to episodes becomes easy. For example, adding a proper edge is equivalent to keeping instances with . We will also compute support and closure efficiently. We should point out that may contain an exponential number of instances, otherwise Theorem Theorem would imply that . However, this is not a problem in practice.

      The depth-first search described in Section Mining Closed Episodes with Simultaneous Events adds events to the episodes. Our next goal is to define algorithms for computing the resulting instance set whenever we add an episode event, say , to an episode . Given an instance of and a sequence event we will write to mean an expanded instance by setting .

      Let be an episode and let be the instance set. Assume a node and a label . Let be the episode obtained from by adding an episode event with label to node . We can compute from using the AugmentEqual algorithm given in Alg. 1.

      input : , a node , a label
      output : 
      1 return ;
      Algorithm 1 AugmentEqual, augments

      The second augmentation algorithm deals with the case where we are adding a new node with an single event labelled as to a parallel episode. Algorithm Augment, given in Alg. 2, computes the new instance set. The algorithm can be further optimised by doing augmentation with a type of merge sort so that post-sorting is not needed.

      input : , label of the new event
      output : 
      1 ;
      2 ;
      3 Sort by ;
      4 return ;
      Algorithm 2 Augment, augments instances

      Our next step is to compute the support of an episode from . We do this with the Support algorithm, given in Alg. 3. The algorithm is based on the observation that there are windows that contain the instance . However, some windows may contain more than one instance and we need to compensate for this.

      input : 
      output : 
      1 ; ;
      2 foreach  in reverse order do
      3       if  then
      4             Add to ;
      5             ;
      6            
      7      
      8; ;
      9 foreach  do
      10       ;
      11       ;
      12       ;
      13       ;
      14       ;
      15      
      16return ;
      Algorithm 3 Support, computes support
      Theorem

      computes .

      Proof

      Since is ordered by , the first for loop of the algorithm removes any instance for which there is an instance such that . In other words, any window that contains will also contain . We will show that the next for loop counts the number of windows containing at least one instance from , this will imply the theorem.

      To that end, let be the th instance in and define to be the set of windows of size containing . It follows that and that the first window of starts at . Let . Note that because of the pruning we have , this implies that . We know that . This implies that on Line 3 we have and since , this proves the theorem.

      Finally, we define a closure episode of an instance set.

      Definition

      Let be a set of instances. We define an instance-closure, to be the episode having the same nodes and events as . We define the edges

      If we say that is -closed.

      Consider sequence and episodes given in Figure Mining Closed Episodes with Simultaneous Events. Since events and always occur between and in , the instance closure is . Note that is not closed because and are both superepisodes of with the same support. However, instance-closure reduces the search space dramatically because we do not have to iterate the edges implied by the closure. Note that the closure may produce cycles with weak edges. However, we will later show that we can ignore such episodes.

      We are now ready to describe the mining algorithm. Our approach is a straightforward depth-first search. The algorithm works on three different levels.

      The first level, consisting of Mine (Alg. 4), and MineParallel (Alg. 5), adds episode events. Mine is only used for creating singleton episodes while MineParallel provides the actual search. The algorithm adds events so that the labels of the nodes are decreasing, . The search space is traversed by either adding an event to the last node or by creating a node with a single event. Episodes in Figure Mining Closed Episodes with Simultaneous Events are created by the first level. The edge in is augmented by the closure.

      Figure \thefigure: Search space for a sequence , window size , and support threshold . Each state shows the corresponding episode and its support.
      1for  do
      2       ;
      3       ;
      4       if  then  ; ;
      5      
      Algorithm 4 Mine, discovers frequent closed episodes
      input : episode ,
      1 ;
      2 ;
      3 ;
      4 for  do
      5       if  or  then
      6             ;
      7             ;
      8             if  then
      9                   ;
      10                  
      11            
      12      
      13for  do
      14       ;
      15       ;
      16       if  then
      17             ;
      18            
      19      
      Algorithm 5 , recursive routine adding episode events

      The second level, MineWeak, given in Alg. 6, adds weak edges to the episode, while the third level, MineProper, given in Alg. 7, turns weak edges into proper edges. Both algorithms add only those edges which keep the episode transitively closed. The algorithms keep a list of forbidden weak edges and forbidden proper edges . These lists guarantee that each episode is visited only once.

      and in Figure Mining Closed Episodes with Simultaneous Events are discovered by MineWeak, however, the weak edges are converted into proper edges by the instance-closure.

      input : ep. , , forbidden weak edges
      1 ;
      2 ;
      3 for  do
      4       if  is transitively closed  then
      5             ;
      6             ;
      7             if  then
      8                   ;
      9                  
      10            Add to ;
      11            
      12      
      Algorithm 6 , recursive routine adding weak edges
      input : episode , , forbidden weak edges , forbidden proper edges
      1 for  do
      2       if  proper edge is transitively closed  then
      3             ;
      4             ;
      5             if  then
      6                   ;
      7                  
      8            Add to ;
      9            
      10      
      Algorithm 7 , recursive routine adding proper edges

      At each step, we call TestEpisode. This routine, given a set of instances , will compute the instance-closure and test it. If the episode passes all the tests, the algorithm will return the episode and the search is continued, otherwise the branch is terminated. There are four different tests. The first test checks whether the episode is frequent. If we pass this test, we compute the instance-closure . The second test checks whether contains cycles. Let be an episode such that . In order for to have cycles we must have two nodes, say and , such that for all . Let be an episode obtained from by merging nodes in and together. We have and . This holds for any subsequent episode discovered in the branch allowing us to ignore the whole branch.

      If the closure introduces into any edge that has been explored in the previous branches, then that implies that has already been discovered. Hence, we can reject if any such edge is introduced. The final condition is that during MineProper no weak edges should be added into by the closure. If a weak edge is added, we can reject because it can be reached via an alternative route, by letting MineWeak add the additional edges and then calling MineProper.

      The algorithm keeps a list of all discovered episodes that are closed. If all four tests are passed, the algorithm tests whether there are subepisodes of in having the same frequency, and deletes them. On the other hand, if there is a superepisode of in , then is not added into .

      input : , forbidden weak edges , forbidden proper edges
      output : , if passes the tests, null otherwise
      1 ;
      2 if  then  return null; ;
      3 ;
      4 if there are cycles in  then
      5       return null ;
      6      
      7if  or  then
      8       return null ;
      9      
      10;
      11 foreach  do
      12       if  and  then
      13             return ;
      14            
      15      if  and  then
      16             Delete from ;
      17            
      18      
      19Add to ;
      20 return ;
      Algorithm 8 , tests the episode and updates , the list of discovered episodes

      In this section we will describe the technique for computing . In general, this problem is difficult to solve as pointed out by Theorem Theorem. Fortunately, in practice, a major part of the comparisons are easy to do. The following theorem says that if the labels of are unique in , then we can easily compare and .

      Theorem

      Assume two episodes and . Assume that and for each event in only one event occurs in with the same label. Let be the unique mapping from episode events in to episode events in honouring the labels. Then if and only if

      1. implies that ,

      2. is a proper child of implies that is a proper child of ,

      3. is a child of implies that is a child of or ,

      for any two events and in .

      If the condition in Theorem Theorem does not hold we will have to resort to enumerating the sequences covering . In order to do that, we need to extend the definition of coverage and subset relationship to the set of episodes.

      Definition

      A sequence covers an episode set if there is an episode such that covers . Given two episode sets and we define if every sequence that covers also covers .

      We also need a definition of a prefix subgraph.

      Definition

      Given a graph , a prefix subgraph is a non-empty induced subgraph of with no proper edges such that if a node is included then the parents of are also included. Given a multiset of labels and an episode we define to be the set of all maximal prefix subgraphs such that for each . We define to be the episodes with the remaining nodes. Finally, given an episode set we define .

      Example

      Consider episodes given in Figure Mining Closed Episodes with Simultaneous Events. We have and .

      Figure \thefigure: Toy episodes demonstrating .

      The main motivation for our recursion is given in the following theorem.

      Theorem

      Given an episode set and an episode , if and only if for each prefix subgraph of , we have .

      We focus for the rest of this section on implementing the recursion in Theorem Theorem. We begin by an algorithm, Generate, given in Alg. 9, that, given a graph without proper edges, discovers all prefix subgraphs.

      input : graph , nodes discovered so far
      output : list of nodes of all prefix subgraphs
      1 ;
      2 foreach  do
      3       ;
      4       Remove and its descendants from ;
      5      
      6return ;
      Algorithm 9 , recursive routine for iterating the nodes of all prefix subgraphs of

      Given a prefix subgraph of , our next step is to discover all maximal prefix subgraphs of whose label sets are subsets of . The algorithm, Consume, creating this list, is given in Alg. 10. Consume enumerates over all sources. For each source such that , the algorithm tests if there is another node sharing a label with . If so, the algorithm creates an episode without and its descendants and calls itself. This call produces a list of prefix graphs not containing node . The algorithm removes all graphs from that can be augmented with (since in that case they are not maximal). Finally, Consume adds to the current prefix subgraph, removes from , and removes from .

      input : graph , nodes discovered so far, label set
      output : list of nodes of all maximal prefix subgraphs
      1 ;
      2 while  do
      3       a node from ;
      4       if  then
      5             Remove and its descendants from ;
      6             continue ;
      7            
      8      if there is s.t.  then
      9             with and its descendants removed;
      10             ;
      11             ;
      12            
      13      ; ;
      14       Remove from ;
      15      
      16return ;
      Algorithm 10 , recursive routine for iterating the nodes of all prefix subgraphs of whose labels are contained in

      We are now ready to describe the recursion step of Theorem Theorem for testing the subset relationship. The algorithm, Step, is given in Alg. 11. Given an episode set and a graph , the algorithm computes . First, it tests whether we can apply Theorem Theorem. If this is not possible, then the algorithm first removes all nodes from not carrying a label from for . This is allowed because of the following lemma.

      Lemma

      Let and be two episodes. Let be a node such that . Let be the episode obtained from by removing . Then if and only if .

      Step continues by creating a subgraph of containing only proper sources. The algorithm generates all prefix subgraphs of and tests each one. For each subgraph , Step calls Consume and builds an episode set . The algorithm then calls itself recursively with . If at least one of these calls fails, then we know that , otherwise .

      input : episode set and an episode
      output : 
      1 foreach  do
      2       if Theorem Theorem guarantees that  then
      3             return true ;
      4            
      5      if Theorem Theorem states that and  then
      6             return false ;
      7            
      8      
      9;
      10 subgraph of with only proper sources;
      11 ;
      12 foreach  do
      13       ; ;
      14       foreach  do
      15             subgraph of with only proper sources;
      16             ;
      17             ;
      18            
      19      if  then
      20             return false ;
      21            
      22      
      23return true ;
      Algorithm 11 , recursion solving

      Finally, to compute , we simply call .

      We begin our experiments with a synthetic dataset. Our goal is to demonstrate the need for using the closure. In order to do that we created sequences with a planted pattern . We added this pattern times 50 time units apart from each other. We added noise events uniformly spreading over the whole sequence. We sampled the labels for the noise events uniformly from different labels. The labels for the noise and the labels of the pattern were mutually exclusive. We varied from to . We ran our miner using a window size of and varied the support threshold . The results are given in Table Mining Closed Episodes with Simultaneous Events.

      -closed frequent scans
      1 100 1 3 3 46
      2 100 3 15 27 200
      3 100 6 63 729 744
      4 100 10 255
      5 100 15
      6 100 21
      7 100 28
      7 50 29
      7 40 32
      7 30 39
      7 20 127
      7 10 684
      Table \thetable: Results from synthetic sequences with a planted episode . The columns show the number of nodes in , the support threshold, the size of the final output, the number of instance-closed episodes, the number of subepisodes of , and the number of sequence scans, respectively.

      When we are using a support threshold of , the only closed frequent patterns are the planted pattern and its subpatterns of form , since the frequency of these subpatterns is slightly higher than the frequency of the whole pattern. The number of instance-closed episodes (given in the column of Table Mining Closed Episodes with Simultaneous Events) grows more rapidly. The reason for this is that the instance-closure focuses on the edges and does not add any new nodes. However, this ratio becomes a bottleneck only when we are dealing with extremely large serial episodes, and for our real-world datasets this ratio is relatively small.

      The need for instance-closure becomes apparent when the number of instance-closed episodes is compared to the number of all possible general subepisodes (including those that are not tranistively closed) of a planted pattern, given in the column of Table Mining Closed Episodes with Simultaneous Events. We see that had we not used instance-closure, a single pattern having 6 nodes and 12 events renders pattern discovery infeasible.

      As we lower the threshold, the number of instance-closed episodes and closed episodes increases, however the ratio between instance-closed and closed episodes improves. The reason for this is that the output contains episodes other than our planted episode, and those mostly contain a small number of nodes.

      We also measured the number of sequence scans, namely the number of calls made to Support, to demonstrate how fast the negative border is growing. Since our miner is a depth-first search and the computation of frequency is based on instances, a single scan is fast, since we do not have to scan the whole sequence.

      Our second set of experiments was conducted on real-world data. The dataset consists of alarms generated in a factory, and contains events of different types, stretching over 18 months. An entry in the dataset consists of a time stamp and an event type. Once again, we tested our algorithm at various thresholds, varying the window size from 3 to 15 minutes. Here, too, as shown in Table Mining Closed Episodes with Simultaneous Events, we can see that the number of -closed episodes does not explode to the level of all frequent episodes, demonstrating the need for using the -closure as an intermediate step. Furthermore, we see that in a realistic setting, the number of -closed episodes stays closer to the number of closed episodes than in the above-mentioned synthetic dataset. This is no surprise, since real datasets tend to have a lot of patterns containing a small number of events.

      As the discovery of all frequent episodes is infeasible, we estimated their number as follows. An episode has subepisodes (including those that are not transitively closed) with the same events and nodes. From the discovered -closed episodes we selected a subset such that each has a unique set of events and a maximal . Then the lower bound for the total number of frequent episodes is . Using such a lower bound is more than enough to confirm that the number of frequent episodes explodes much faster than the number of closed and -closed episodes, as can be seen in Table Mining Closed Episodes with Simultaneous Events.

      Furthermore, our output contained a considerable number of episodes with simultaneous events — patterns that no existing method would have discovered. The runtimes ranged from just under 2 seconds to 90 seconds for the lowest reported thresholds.

      win (s) -closed freq.(est) scans
      180 6 6 6 194
      180 8 8 8 220
      180 12 12 12 282
      180 23 26 26 792
      600 4 4 4 128
      600 24 27 374
      600 90 137 493
      600 422 698
      900 24 26 350
      900 52 58 745
      900 280 426
      900
      Table \thetable: Results from the alarms dataset.

      Our third dataset consisted of trains delayed at a single railway station in Belgium. The dataset consists of actual departure times of delayed trains, coupled with train numbers, and contains events involving different train IDs, stretching over a period of one month. A window of 30 minutes was chosen by a domain expert. The time stamps were expressed in seconds, so a single train being delayed on a particular day would be found in 1800 windows. Therefore, the frequency threshold for interesting patterns had to be set relatively high. The results are shown in Table Mining Closed Episodes with Simultaneous Events, and were similar to those of the alarm dataset. The runtimes ranged from a few milliseconds to 55 minutes for the lowest reported threshold. The largest discovered pattern was of size 10, and the total number of frequent episodes at the lowest threshold was at least million, once again demonstrating the need for both outputting only closed episodes and using instance-closure.

      -closed freq.(est) scans
      141 141 141
      Table \thetable: Results from the trains dataset.

      In this paper we introduce a new type of sequential pattern, a general episode that takes into account simultaneous events, and provide an efficient depth-first search algorithm for mining such patterns.

      This problem setup has two major challenges. The first challenge is the pattern explosion which we tackle by discovering only closed episodes. Interestingly enough, we cannot define closure based on frequency, hence we define closure based on instances. While it holds that frequency-closed episodes are instance-closed, the opposite is not true. However, in practice, instance-closure reduces search space dramatically so we can mine all instance-closed episodes and discover frequency-closed episodes as a post-processing step.

      The second challenge is to correctly compute the subset relationship between two episodes. We argue that using a subset relationship based on graphs is not optimal and will lead to redundant output. We define a subset relationship based on coverage and argue that this is the correct definition. This definition turns out to be NP-hard. However, this is not a problem since in practice most of the comparisons can be done efficiently.

      Nikolaj Tatti is supported by a Post-doctoral Fellowship of the Research Foundation – Flanders (fwo).

      The authors wish to thank Toon Calders for providing the proof that checking whether a sequence covers an episode is NP-hard on a small piece of paper.

      • [1] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB 1994), pages 487–499, 1994.
      • [2] R. Agrawal and R. Srikant. Mining sequential patterns. 11th International Conference on Data Engineering (ICDE 1995), 0:3–14, 1995.
      • [3] T. Calders, N. Dexters, and B. Goethals. Mining frequent itemsets in a stream. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM 2007), pages 83–92, 2007.
      • [4] G. Casas-Garriga. Discovering unbounded episodes in sequential data. In Knowledge Discovery in Databases: PKDD 2003, 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 83–94, 2003.
      • [5] G. Casas-Garriga. Summarizing sequential data with closed partial orders. In Proceedings of the SIAM International Conference on Data Mining (SDM 2005), pages 380–391, 2005.
      • [6] G. Chen, X. Wu, and X. Zhu. Sequential pattern mining in multiple streams. In Proceedings of the 5th IEEE International Conference on Data Mining (ICDM 2005), pages 585–588, 2005.
      • [7] B. Cule, B. Goethals, and C. Robardet. A new constraint for mining sets in sequences. In Proceedings of the SIAM International Conference on Data Mining (SDM 2009), pages 317–328, 2009.
      • [8] M. Garofalakis, R. Rastogi, and K. Shim. Mining sequential patterns with regular expression constraints. IEEE Transactions on Knowledge and Data Engineering, 14(3):530–552, 2002.
      • [9] R. Gwadera, M. J. Atallah, and W. Szpankowski. Markov models for identification of significant episodes. In Proceedings of the SIAM International Conference on Data Mining (SDM 2005), pages 404–414, 2005.
      • [10] R. Gwadera, M. J. Atallah, and W. Szpankowski. Reliable detection of episodes in event sequences. Knowledge and Information Systems, 7(4):415–437, 2005.
      • [11] R. Gwadera and F. Crestani. Discovering significant patterns in multi-stream sequences. In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM 2008), pages 827–832, 2008.
      • [12] S. K. Harms, J. S. Deogun, J. Saquer, and T. Tadesse. Discovering representative episodal association rules from event sequences using frequent closed episode sets and event constraints. In Proceedings of the IEEE International Conference on Data Mining (ICDM 2001), pages 603–606, 2001.
      • [13] S. Laxman, P. S. Sastry, and K. P. Unnikrishnan. A fast algorithm for finding frequent episodes in event streams. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge discovery and data mining (KDD 2007), pages 410–419, 2007.
      • [14] H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1(3):259–289, 1997.
      • [15] N. Méger and C. Rigotti. Constraint-based mining of episode rules and optimal window sizes. In Knowledge Discovery in Databases: PKDD 2004, 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 313–324, 2004.
      • [16] T. Oates and P. R. Cohen. Searching for structure in multiple streams data. In Proceedings of the 13th International Conference on Machine Learning (ICML 1996), pages 346–354, 1996.
      • [17] J. Pei, H. Wang, J. Liu, K. Wang, J. Wang, and P. S. Yu. Discovering frequent closed partial orders from strings.