ONCE and ONCE+: Counting the Frequency of Time-constrained Serial Episodes in a Streaming Sequence
As a representative sequential pattern mining problem, counting the frequency of serial episodes from a streaming sequence has drawn continuous attention in academia due to its wide application in practice, e.g., telecommunication alarms, stock market, transaction logs, bioinformatics, etc. Although a number of serial episodes mining algorithms have been developed recently, most of them are neither stream-oriented, as they require multi-pass of dataset, nor time-aware, as they fail to take into account the time constraint of serial episodes. In this paper, we propose two novel one-pass algorithms, ONCE and ONCE+, each of which can respectively compute two popular frequencies of given episodes satisfying predefined time-constraint as signals in a stream arrives one-after-another. ONCE is only used for non-overlapped frequency where the occurrences of a serial episode in sequence are not intersected. ONCE+ is designed for the distinct frequency where the occurrences of a serial episode do not share any event. Theoretical study proves that our algorithm can correctly mine the frequency of target time constraint serial episodes in a given stream. Experimental study over both real-world and synthetic datasets demonstrates that the proposed algorithm can work, with little time and space, in signal-intensive streams where millions of signals arrive within a single second. Moreover, the algorithm has been applied in a real stream processing system, where the efficacy and efficiency of this work is tested in practical applications.
With the development of cloud computing, internet of things, biocomputing and so on, numerous ordered sequences are accessible from various daily applications. Among all these applications, mining serial episodes from long sequences has various potential applications and thereby drew much research attention, especially in the fields of telecommunication , finance , neuroscience  and information security. A serial episode is referred to as an ordered collection of specific signals, e.g., sequential alarm pattern in telecommunication alarm sequence, and the times it appears in the sequence is referred to as its frequency. Generally, studying the frequency of serial episode patterns can be used to analyze or summarize the whole sequence and can also be used to predict future signals in the sequence.
For instance, a long telecommunication alarm sequence can be summarized using a limited number of representative serial episodes; on the other hand, we may be interested in the frequency of some specific alarm episode patterns within the long sequence so that responses towards these alarm episodes can be optimized; besides, it can be directly used to mine frequent serial episodes; we may also be interested in predicting the future alarms within the sequence. All of these examples require counting the frequencies of a given set of serial episodes.
Counting the frequency for a finite set of given serial episodes can be easily found in many real applications in different fields. For instance, in securities market, the detection of securities fraud is a challenging task considering the massive amount of trading data produced everyday. Insider trading, one category of deceptive practices, can be generalized as a serial pattern using a group of actions including offers and sales of securities . With a set of patterns/trends that is known to be fraud, automatic detection of fraudulent activities can be achieved as long as we focus on the deceptive patterns in the streaming trading sequence. Besides, in the field of bioinformatics, in order to analyze a gene set of interest, analyzing its frequency and distribution among the whole genome datasets can help find out in which tissues or cells are they co-expressed .
In a message-intensive system, millions of data are generated within several minutes. Such data are referred to as streams . Formally, a streaming sequence is composed of several types of events and it will dynamically update its length as new events occur and often in a high rate. Conventional methods for counting the frequency of serial episodes are generally based on the idea of storing the entire dataset and then processing it through multiple passes. Hence, traditional algorithms are not applicable on streams as it is impossible to store the entire unlimited data before the processing. Any method for data streams must thus operate under the constraints of limited memory and time which means that data streams must be processed faster than they are generated. To this end, in this paper, we propose an efficient one-pass solution to count the frequencies of a given set of serial episodes in a data stream without the need to load the whole sequence beforehand.
In addition, although there exist some efforts that mine the frequency of serial episodes from a long sequence, among which [6, 7] even work in streams in a one-pass manner, they suffer from a key limitation that the serial episodes mined are not associated with any time constraint. That is, they do not care whether the serial episodes fall into a limited time span (e.g., an hour, a day, etc.). For instance, a common scenario in telecommunication alarm sequence study is to learn the typical serial alarm episodes in order to discover the sequential association rules between alarms so that we do not need to respectively respond to each of them, because responding to the earliest alarm can always automatically address the following ones incurred by it. Obviously, alarms that form a sequential association rule should not exhibit too large time span (e.g., an hour, etc.). As another example, we may be interested to know a particular person’s daily mobility pattern (e.g., OfficeGymBar) to help quantify his daily movement condition. In both of these scenarios, we have to limit the time span of the serial episodes.
As the state-of-the-art single-pass serial episodes mining algorithms, [6, 7, 8] employ automata to count the occurrences for each target episode. Unfortunately, the automata they employed cannot be easily incorporated with time constraint. This is thoughtfully discussed in Section 3. To address the problem, we propose a new model that successfully avoids the problem of [6, 7, 8] in counting serial episodes satisfying given time span within streaming sequence. In summary, our contributions in this work are as follows.
We formally define the non-overlapped frequency counting problem of time-constrained episodes. To address the problem, we present a carefully designed data structure, namely OccMap, as well as a group of operations over it. An OccMap corresponds to a particular serial episode and stores the timestamps of valid signals
Based on OccMap,we propose two efficient algorithm ONCE and ONCE+ (OccurreNce Count of serial Episode) to compute two popular frequencies of given time-constrained serial episodes in a dynamic event stream, over which only one-pass process is required. In particular, ONCE computes non-overlapped frequency while ONCE+ works on distinct frequency.
ONCE (ONCE+) does not require any other user-specified parameter except the time constraint . Our algorithm does not put any restriction over the streaming sequence, which can either arrive in batches (a group of sequentially ordered signals)  or single signals.
We theoretically prove that ONCE and ONCE+ algorithms can correctly count the target frequencies, respectively. Besides, processing an event in the stream only requires time, where is the length of the episode.
Empirical studies conducted over both real-world and synthetic datasets justify that ONCE and ONCE+ can efficiently and correctly find the frequencies of the serial episodes and outperforms baseline method in the aspects of both space and time cost.
The rest of this paper is organized as follows. In next section, we briefly discuss related work in serial episode mining. In Section 3 we introduce the preliminary definitions and problem statement. Afterwards, we present the details of our solution towards the problem in Section 4 and Section 5 with theoretical study of the complexity and correctness. In Section 6, we conduct empirical study over real-world and synthetic datasets. We show a practical application where the proposed algorithm is applied and discuss the corresponding observations in Section 7. Lastly, we conclude our work in the Section 8.
2 Related Work
Several types of sequential patterns have been extensively studied so far, including frequent (closed) sequential pattern mining [10, 11, 12, 13, 14], serial episodes discovery [15, 6], periodic (ordered) pattern mining [16, 17]. Within these works, various frequency definitions of episodes have been proposed, which have given rise to different types of frequent episodes. Recently, Achar et al.  reviewed 7 different frequency definitions in the literature. Three of them, window-based frequency , head frequency , and total frequency , consider the number of windows containing at least one occurrence of an episode, where each window has the same specified width. The remaining definitions, minimal occurrence-based frequency , non-overlapped frequency , non-interleaved frequency  and distinct frequency , directly take into account the different occurrences of an episode in the sequence.
However, these efforts cannot be deployed to some real-world applications such as fraud trading detection or telecommunication alarm responses as they ignored the practical significance of time constraint in episodes. Notably, the author in  suggested that, by attaching to each automaton a time constraint, their method can address time-constrained serial episode mining problem. However, they actually failed to empirically test this suggested method in time-constrained problem. Unfortunately, as we will illustrate in detail in Section 3.2, this suggested method is unable to generate correct answer.
In the field of serial episode mining over sequential streams, related algorithm studies have become increasingly prevalent over the recent years [9, 23, 24, 25]. Patnaik et al.  considered serial episode mining over dynamic data streams. The main contribution of their work is to define the batch of events and apply their algorithms over each batch. But the performance of their method highly depends on the size of batches where the frequency is computed. A large batch leads to high response time, while a small one fails to count the frequency of long episodes. Especially, once each batch of data contains only one event, i.e., events arrive one after another, their algorithm cannot work anymore. In addition, when a serial episode stretches over two consecutive batches, this occurrence of the episode will be missed. Xiang et al.  presented MESELO algorithm, which requires a complete view of the whole sequence. It strictly limits their application in streams where the number of events is potentially unlimited. For instance, if we want to learn the frequency of an episode in the past 48 hours, the window size in their method should be set as a large time span to store all records in the past hours, which takes enormous memory consumption. They also presented another work  that aims to mine serial episodes over precise-positioning sequences, where the elapsed time between any two consecutive events is a constant. SASE  has been proposed to record the appearance of target serial episode within a stream. The proposed structure has to spend time to process a single signal in the stream, while our algorithm takes only . Besides, none of these works takes into account the time span for the target serial episode. In contrast, we present in this paper a novel one-pass algorithm that works on stream sequence without any requirement to store the whole sequence beforehand or any limitation on the batch size, while taking into account the time span of the episodes.
3 Problem Formulation
In this section, we shall first present serials of preliminary definitions. Besides, for ease of understanding, in Table I we summarize the key notations that will be used in this paper.
|temporal event happens at|
|occurrence of in|
|minimal occurrence of in|
|time-constrained serial episode|
|time constraint of|
|length of serial episode|
|timestamp list of -th layer in|
We first define streaming sequences, serial episodes  and non-overlapped frequency.
Definition 1 (Streaming sequence)
Streaming sequence is a long (potentially infinite) sequence of event111To avoid duplicate word usage, we shall use the words event and signal interchangeably in the rest of this paper.. Let be finite alphabet set, be a sequential list of events, denoted by where and the pair () means event happens at timestamp . We denote by as the first event subsequence of . Let be the -th element of (i.e. ), and be the -th event and corresponding timestamp of , respectively.
For instance, a daily trajectory of a person can be denoted as where , , , , . In particular, if () is a set of events that happen simultaneously (i.e., ), the sequence is referred to as complex streaming sequence. Otherwise, if () is an individual event, it is a simple streaming sequence.
Definition 2 (Serial episode)
A serial episode is a set of totally ordered events, denoted by where appears before , if and only if . In particular, we denote by as the length of .
For instance, in the above sequence example where , and are both serial episodes; the length of (i.e., ) is 2 and that of (i.e., ) is 3, respectively.
Definition 3 (Occurrence)
Given a serial episode , the timestamp is defined as the occurrence of if happens at timestamp . We denote by as an occurrence of serial episode in .
Definition 4 (Minimal occurrence)
Given a serial episode , and its occurrence , namely . If there is no other occurrence of , say , such that and , then is called a minimal occurrence of in , denoted as .
Definition 5 (Time-constrained serial episode)
A serial episode with time constraint is denoted as where the occurrence of fall in a specified time period (e.g., daily/weekly/monthly), that is, (i.e., ). Usually, is used to represent a certain serial episode without time-constraint.
Given the following sequences,
serial episode and time-constrained serial episode , we illustrate the occurrences of both and in all and .
According to Definition 3, we can easily obtain , , where is a minimal occurrence.
Similarly, according to Definition 3, it is easy to find that , , . Moreover, the occurrences of in all the above sequences are as follows, , , .
Note that the events constituting an occurrence of a serial episode are not required to be contiguous in the stream.
As reviewed in Section 2, a number of different frequency definitions have been proposed to capture how often an episode occurs in an event sequence. We observe that existing frequency definitions can be grouped into two categories: definitions incurring dependent occurrences (e.g., two occurrences of an episode may share common events) and definitions incurring independent occurrences. Due to space constraints, we focus this paper only on the type of frequency definitions incurring independent occurrences, which contains two frequency definitions: the non-overlapped frequency [6, 20] and the distinct frequency . We review the definitions of the two frequency measures as follows.
Definition 6 (Non-overlapped frequency)
In an event stream , two occurrences of (resp., ), i.e., and , are non-overlapped if either or . The non-overlapped frequency of (resp., ) in is denoted as (resp., ).
Definition 7 (Distinct frequency)
In an event stream , two occurrences of (resp., ), i.e., and , are distinct if they do not share any event, that is if , . The distinct frequency of (resp., ) in is denoted as (resp., ).
Given the following sequences,
time-constrained serial episode , we can easily obtain , , are occurrences of in S, is a minimal occurrence of in , obviously, is another minimal occurrence. However, they overlap with each other. Thus, is . On the other hand, and are distinct occurrences because they don’t have the same timestamp and . Thus, is 2.
3.2 Problem definition
Given an event stream and the serial episode, whose frequency is to be extracted, we aim to identify the frequency of serial episodes with time constraint from the long stream.
Definition 8 (Time-constrained frequency counting problem)
Given event in stream222It can be a simple stream sequence or a complex one, our model can work on both of them. arrives one after another, a time-constrained serial episode , time-constrained frequency counting problem aims to evaluate whenever a new event arrives.
The most related work with the aforementioned problem is serial episodes frequency mining in long sequences, among which the most representative is , an effective solution towards mining frequent serial episodes from an arbitrary long event sequence. This approach utilizes a group of automaton, each of which corresponds to a particular candidate serial episode. The approach works by sequentially scanning every event within the target sequence . Each time an event is observed, the corresponding automaton who is waiting for this event (i.e., the next state matches the event) is updated. Whenever an automaton comes to end state, its corresponding count increases by 1 and the automaton is reset to the start state.
For instance, Figure 1 shows an example where the target sequence . Suppose we are interested in the frequency of the following serial episodes, . Given the four candidates, four finite state automata as are built (Figure 1(1)) where , the states of which sequentially correspond to the events in the episodes. Then it sequentially scans . Event is observed first, both and , whose next states are , change to next state (Figure 1(2)); the other two automata keep unchanged. Afterwards, is observed, this time , and all change to next state (Figure 1(3)); as reaches to the final state, hence its count increases and it is reset to the initial state again.
The aforementioned scheme  is justified effective and efficient in mining the non-overlapped frequency of given serial episodes from a long sequence. However, it does not take into account the time constraint for each episode, hence it cannot be applied in our scenario described in Section 3. Can we adjust it with limited variation to address our problem? The answer is no. Although the authors in [6, 20] suggested that simply attaching the time constraint to each automaton can solve the problem of mining episodes with given time constraint, they did not put it into practice, even when they mentioned the same method again several years later . In fact, the suggested method may lead to inaccurate results in time-constrained case. The following example shows that simply adding a time constraint towards each automaton, as suggested by [6, 20, 18], may inaccurately count the frequency of target serial episodes.
Suppose we are now adding a time constraint to each automaton in [6, 7] by introducing another column, namely start_time, to store the timestamp of first valid state change. The count of an automaton increases when not only the state changes to the final one but also the time span between final state and start_time is within .
In this way, only those instances satisfying the time constraint are counted. It seems to be a valid solution towards our problem. However, this solution may miss many valid occurrence counts, hence cannot satisfy Definition 8. For instance, if we follow the above adjusted solution, the count of will be (as will be activated by which finally fails to satisfy the time-constraint check). However, in fact there exists an instance of , namely . Similarly, the above solution can find minimum occurrence of (i.e., ). However, there exist two minimum occurrences of , namely and . Such problems will be much more complex and difficult to address by automata especially when contains many repeated events, e.g., .
Therefore, it is not a trivial task to design a model to count the non-overlapped frequency of serial episodes within a long streaming sequence that takes into account the time span of serial episode. To address this challenge, we develop a novel approach in next section.
4 ONCE Algorithm
In this section, we present in detail the algorithm for serial episodes counting in streaming sequence under an arbitrary time constraint. Given a streaming event sequence and a target serial episode with an arbitrary time constraint, ONCE algorithm generally works as follows. As each event in the stream passing by, we first need to find the latest minimum occurrences of the target serial episode, no matter it satisfies the given time constraint or not. To achieve that, we present a delicately constructed data structure, namely OccMap, which stores the timestamps of events that constitute the target serial episode. Whenever all events have been found in OccMap, we validate the candidate minimum occurrence by testing whether it satisfies the time constraint or not. If the test succeeds, we increase the count by 1. Afterwards, the tested occurrence and the unused timestamps of events are removed from OccMap. As a result, ONCE can output the frequency of target time-constrained serial episode whenever requested. Notably, in the following discussion, although we focus on counting the frequency of a given time-constrained serial episode, ONCE can in fact simultaneously count the frequencies for a group of target time-constrained serial episodes as the following proposed structure and corresponding operations are bind with each target time-constrained serial episode independently. Moreover, it is obvious that ONCE is a one-pass algorithm which is applicable to stream sequences.
4.1 OccMap structure
First of all, we present the data structure, namely OccMap, to store the timestamps of events in target serial episode. It is further used to extract candidate minimum occurrence of the serial episode.
An OccMap for time-constrained serial episode , is defined as a group of hierarchical lists. In particular, given , the OccMap for contains lists which are organized hierarchically into layers. The layers correspond to all the signals of . Each layer is an individual list, which is used to record the timestamps of the corresponding signals in the stream. For instance, in Figure 2(15+) there is an OccMap that corresponds to . It consists of three lists that are hierarchically organized. Each list is correlated with a single signal in the serial episode. Notably, if the same signal appears many times in a serial episode, i.e., in , we assign a list to each of the appearance independently.
4.2 Counting the frequency using OccMap
To facilitate the following discussion, we denote by as the OccMap for time-constrained serial episode .
In particular, we denote by , where and . is the timestamp of in stream ; refers to the signal corresponding to .
When processing the stream, OccMap has to perform the following operations, list update, occurrence validate and invalid entries elimination. In the following, we describe in detail how each of the operations are performed in OccMap. To illustrate each operation clearly, we shall use the following sequence as the running example.
Consider the following event stream :
List update. Given an OccMap that corresponds to , we perform list update by scanning every signal in the streaming sequence as it passes by.
At the very beginning, is initialized as layered empty lists. Besides the layered lists, we denote by as the most recent active layer number, which is initialized as . For each signal passing by, checks whether it matches any signals whose corresponding lists are active. Suppose and , we append the timestamp of to the end of list . During the update, only the layers (i.e., ) where can be updated. In another word, a new layer can be updated (i.e., the corresponding list can be appended with a timestamp) only when the layers before it are not empty. It guarantees OccMap the following property, which is straightforward.
[Minimum monotonicity] If an OccMap is updated according to list update strategy, it satisfies:
For example, given the target episode and sequence shown in Example 3, an OccMap is built, which contains three empty lists (), (), and (). Therefore, whenever new event of arrives, only two types of events: and are accepted and stored in the lists. Notably, as we have mentioned above, if a signal appears many times within a target serial episode, we assign an independent list for each appearance, i.e., and . Suppose we are starting from the very beginning of , each time a new event arrives and an old one leaves. At the very beginning, all the lists are initialized as empty. Besides, is set to . Now the first event of (i.e., ) comes, OccMap finds that is a valid event that should be taken into account. Then it tests whether it matches any signal in the activated layers, which now contains only . Obviously, it perfectly matches the first layer, i.e., , and is of cause activated, i.e., . As the test successes, we update by appending to the end of , which results in Figure 2(1). Moreover, as the first layer is not empty then, the second layer now becomes activated, i.e., .
When the second event in arrives, we perform the same test as above. As and , both and (they are both activated) should be updated. Therefore, should be appended to the end of both and , which results in the state shown in Figure 2(2). Similarly, as the second layer is not empty then, the third layer now becomes activated (i.e., ).
Afterward, the third event, arrives, we append it to and update to 4. Now the last layer in OccMap is not empty, we have to perform another action, occurrence validation, which is described in the following part.
Occurrence validation. Whenever the bottom layer (i.e., ) is updated (i.e., appended by an arbitrary timestamp), it indicates that there exist some groups of timestamps in each layered list that construct a candidate minimum occurrence for the target serial episode . Here, candidate occurrence means that it is the minimum occurrence of general serial episode without taken into account the time constraint. In fact, according to the list update strategy, there is at least one candidate occurrence for the target serial episode once the last layer is updated.
Given that an OccMap , where is updated according to list update strategy, once (i.e., the last layered list is not empty), there are entries in , where no two entries belong to the same layered lists, that constitute an occurrence of .
As the last list is not empty, . According to Property 1, . If we select the entries corresponding to , , …, from , respectively, the entries obviously constitute an occurrence of as they satisfy total order.
In fact, there may exist many groups of entries that can constitute an occurrence of . Recall that we are performing occurrence validation once the last layered list is not empty (i.e., appended by a timestamp). In other words, only contains one entry, namely , when occurrence validation is performed. That is, all the occurrences share the same end timestamp, , at most one can affect the frequency according to Definition 6. Therefore, we have to find the optimal occurrence, which is most probable to satisfy the time constraint, to validate. Intuitively, the optimal occurrence of should have the minimum time span (i.e., ), which is in fact the minimum occurrence shown in Definition 4. As all the occurrences of share the same , the minimum occurrence that is most probable to satisfy should have the latest . Therefore, to find the , we have to find the latest in .
In order to find the (i.e., ), we traverse the OccMap in a bottom-up way. In particular, given the end stamp , we greedily find from a latest entry that appears before , that is subject to . Let be the selected entry in , then we further greedily select from a latest entry that appears before , say . Afterwards, we iteratively perform the same selections in the upper layers, until . In the end, we can obtain , which are greedily selected from , respectively. constitute an occurrence of . Obviously, is the latest333Any other in cannot be , as it is definitely greater than according to our greedy selection strategy .
Till now, we have the minimum occurrence that is most probable to satisfy the time constraint . Hence, we check whether its time span satisfy the time constraint by testing the inequality . If the test successes, we increase the frequency of the target time-constrained serial episode by . Notably, if the test fails, any other occurrences will also fail, as they have smaller (i.e., larger ).
For instance, given and sequence in Example 3, we have updated as and passed by. When arrives, we have appended to the end of . Once the bottom layer is not empty, we perform occurrence validation as described before. In particular, we greedily find the as shown in Figure 2. Afterwards, we test whether its time span (i.e., ) satisfies (i.e., ). As the test successes (the corresponding entries in are circled and marked in green in Figure 2), we increase the frequency of by .
Invalid entries elimination. Once occurrence validation is performed, we need to immediately eliminate invalid entries from OccMap. Depending on whether time constraint test in occurrence validation successes or not, the elimination process varies.
If the minimum occurrence, say , found from occurrence validation is validated as satisfying the time constraint, i.e., , all the other entries left in are useless then. The reason is, any other occurrences that consist of any of these entries, which are smaller than , definitely overlap with , which deviates from Definition 6. Therefore, once the occurrence validation successes, all the other entries in are invalid anymore, and are immediately removed from . Besides, the active layer now are reset to .
As shown in Figure 2(3) and (3+), except for the occurrence of the episode tested (i.e., circled green entries in Figure 2(3)), all the other entries (i.e., underlined blue entry in Figure 2(3)) should be eliminated, which results in Figure 2(3+). The same operations can be found from Figure 2(8) and Figure 2(8+), as well as Figure 2(15) and Figure 2(15+). All the lists now become empty, thus .
Otherwise, when the minimum occurrence fails to satisfy time constraint , we also need to find those invalid entries to eliminate from . Differently, the invalid entries are no longer all the left ones in in this case. Instead, although the entries that constitute the minimum occurrence fails to pass time constraint test, the other entries may be further used to constitute another minimum occurrence. In order to show which entries left are useless to further constitute other minimum occurrences, we present the following theory.
Given that () is updated according to list update strategy and is found and validated by occurrence validation process, if which consists of () fails to satisfy the given time constraint (i.e., ), no other minimum occurrences with , where and such that , can satisfy the time constraint either.
Suppose that consists of , where and such that , satisfy the time constraint , then
As we are iteratively selecting the largest entry in subject to that according to the bottom-up minimum occurrence finding strategy presented above, we can find and . If , it is straightforward to know that , thus . Similarly, we can prove that , .
Therefore, , which means as . As , it is easy to know , which contradicts with Equation 1. Hence, cannot satisfy the time constraint .
Suppose , which consists of (), fails to satisfy time constraint , according to the above theory, it is easy to know that any entry in list cannot be used to generate an occurrence that passes time constraint test. Therefore, if the minimum occurrence fails to satisfy time constraint , we need to eliminate the entries in each layered list for all . Moreover, we need to eliminate some other entries in to guarantee that , as those entries not satisfying this property are also useless for minimum occurrence extraction. Besides, the active layer is updated accordingly after the elimination.
As shown in Figure 2(13) and Figure 2(13+), the occurrence of the episode (i.e., circled red entries in Figure 2(13)), namely , fails to pass the time constraint test as . Then, in each , we eliminate all the entries . In , we eliminate and any other entries before that. Similarly, in and , we eliminate and , respectively. The other entries, i.e., in , is left in , the rest lists are all empty again. Therefore, we reset .
With the help of all these operations above, an OccMap has the following interesting features.
An OccMap corresponds to exactly one time-constrained serial episode.
The last/bottom layered list contains no more than one entry; there is at most one occurrence for the target time-constrained serial episode.
The entries in all the layered lists satisfy: . On the other hand, as timestamps appended into the same list strictly follow time sequence, . Therefore, the above property can be rewritten as .
Notably, to save memory, after each list update operation, we additionally check the inserted timestamp, say , against the first entry in , if , cannot be used to generate any minimum occurrence that satisfies time constraint . Therefore, in this case is also be eliminated during list update. In this way, the size of each list is in fact upper bounded by if the event in the stream arrives with a constant speed .
4.3 The complete ONCE algorithm
Figure 2 shows the complete one-pass process of mining the non-overlapped frequency for time-constrained serial episode within shown in Example 3. Notably, all the other events that do not appear in (i.e., ) are ignored in the process. Each successful time constraint test over the minimum occurrences is marked in green, the unsuccessful ones are marked in red. It is easy to know from the figure that the non-overlapped frequency for in is , i.e., the number of successful time constraint tests. The detailed algorithms are shown in Algorithm 1, 2 and 3.
In Algorithm 1, given the input of streaming sequence and target time-constrained serial episode , ONCE algorithm first initializes an OccMap for (Line 1). As each event in the stream arrives, ONCE first performs list update (i.e., Algorithm 2) based on the event (Lines 2-3). When the last layer in OccMap is not empty, we perform occurrence validation to find the minimum occurrence and test whether it satisfies . Invalid entries elimination is performed immediately after that (Lines 4-5 and Algorithm 3). If the time constraint test successes, the frequency of is increased by (Lines 6-7). Finally, the frequency is returned (Line 11).
Algorithm 2 works as follows. Given the OccMap to be updated and event , we check every activated list which wait for update, if matches the event corresponding to the , we append to the end of (Lines 2-3). Afterwards, we perform a local check in in order to eliminate out-of-date entries (i.e., old entries that cannot constitute a minimum occurrence that satisfies ) from (Lines 4-6). Finally, we update the active layer to the next empty list (Line 9). Obviously, the time complexity for Algorithm 2 is where is the length of .
In Algorithm 3, we first extract the minimum occurrence from (Lines 1-5) and then eliminate invalid entries (Lines 6-19). To extract the minimum occurrence, we first set the right bound of the occurrence interval as , which is the only entry in (Line 1). Afterwards, we iteratively find from each upper layer as the latest timestamp that appears before . This process continues until (Lines 2-5). Therefore, constitute a minimum occurrence for . The time complexity of this process is 444In the implementation, as each always locates in the end of , we alternatively use linear search from the end of in Line 3, the average time of which is better than binary search in practice. Similar strategy also applies to Line 15.. As the length of each list in is in fact upper bounded by if the event in the stream arrives with a constant speed . According to Algorithm 2, in Line 4 and 5, we know that . Considering the fact that is a positive number, as a result, . Actually, considering the ListUpdate process, we have . With the two inequalities above, we can conclude that , which proves that the upper bound of in is . The time complexity is in fact if is fixed. Afterwards, we test whether the extracted minimum occurrence satisfies the time constraint. If the test successes, we eliminate all entries from the OccMap and return (Lines 6-9). Otherwise, we eliminate all entries in each list subject to (Lines 11-13). Besides, we also need to make sure by eliminating some other entries, as those entries are also useless in constituting further minimum occurrences (Lines 14-16). Finally, we update the active layer as the first empty layered list (Lines 17-18). It is easy to see the complexity of elimination is also as searching from a list an entry (Line 3 and 15) takes using binary search. Therefore, the complexity of Algorithm 3 is .
In all, the time complexity for ONCE algorithm in Algorithm 1 to process a single event is . Taking into account the number of events in the stream , the time complexity of processing events is all together .
4.4 Correctness of ONCE algorithm
In this part, we discuss the correctness of ONCE algorithm. In particular, we need to show that ONCE algorithm can correctly answer the problem in Definition 8, namely counting the non-overlapped frequency of given time-constrained serial episode as the event in streaming sequence passes by. To this end, we present a pair of lemmas below, based on which we can finally prove the correctness of ONCE.
Suppose the frequency of in returned by ONCE is denoted by and the ground truth is denoted by , then .
Given in Appendix A.
Suppose the frequency of in returned by ONCE is denoted by and the ground truth is denoted by , then .
Given in Appendix B.
5 Once+ Algorithm
In previous section we have presented ONCE algorithm, ONCE can compute the overlapped frequency of target time-constrained serial episode. In fact, it can also be adapted to compute distinct frequency with series of modifications. In this part, we propose the modified version, namely ONCE+, to compute the distinct frequency of time-constraint serial episodes.
ONCE+ also utilizes OccMap to store the timestamps of events in target serial episode, the only difference is in Validate&Eliminate algorithm. In order to distinguish ONCE and ONCE+, we name the modified algorithm as Validate&Eliminate+ algorithm. The detailed algorithm is shown in Algorithm 4.
In Algorithm 4, we need to find the occurrence which meets the time-constrained condition from . Firstly, we compare all the entries of with . If we failed find the minimum entry which meets , eliminate all entries from the OccMap and return (Lines 26-29). If we find it, set the first layer as . Afterwards, we iteratively find from each upper layer as the latest timestamp that appears before . This process continues until (Line 1-9). If we failed to find () in any layer, we also have to eliminate all entries from the OccMap and return (Line 21-25). Therefore, constitute a distinct occurrence for . Then, we eliminate all the entries from OccMap which is no later than (Line 10-20). Finally, we update the active layer as the first empty layered list.
Notably, in ONCE we find from bottom to top of the OccMap, then test whether the extracted minimal occurrence satisfies the time constraint condition. However, in ONCE+, we firstly need to find the occurrence from which meets the time-constrained. Then find from top to bottom of the OccMap. It is easy to see, the only difference between ONCE and ONCE+ is that the process of seeking from OccMap is opposite, so the complexity of Algorithm 4 is the same as Algorithm 3.
In order to illustrate the characteristics of ONCE+ algorithm. We use another example to show how ONCE+ works. In Figure 3, there is an OccMap that corresponds to .
However, if we use ONCE to count the non-overlapped frequency of or , the only non-overlapped minimal occurrence is , that is, and .
The correctness of ONCE+ is easy to justify following the same way as Section 4.4.
6 Experimental Study
In this part, we conduct experimental study over both synthetic and real world data. Through the experimental results, we justify that ONCE (resp., ONCE+) can answer the non-overlapped (resp., distinct) frequency counting problem for time-constrained serial episodes in a one-pass way efficiently. The real world data is the streaming sequence of telecommunication alarms within 4 cities in Guizhou Province of China during 2014. Besides, we also generate a synthetic dataset by randomly sample an event at each timestamp. The statistics of both datasets are shown in Table II. The synthetic data are generated by uniformly randomly sampling a particular event from one step after another. All the experiments are tested on a workstation with Xeon E5-2603v3 1.6GHz CPU, 16GB RAM running Ubuntu 12.04 LTS. We compare ONCE algorithm and ONCE+ with other baselines, namely SASE+ and SASE++ . All the parameters in SASE+ and SASE++ are optimized according to their suggested settings.
Notably, in the following experiments, we report the average throughput, which is defined as the number of signals processed by an algorithm per second. In line with , we report how the throughput can be affected by different factors. Through all these experiments, we are exciting to find that ONCE only takes less than for each , especially for the real world dataset. That is, our model can work in event-intensive stream even if millions of events arrive in single second555The code and dataset will be released once this work is published..
|Telecom. alarms||8,821,220||252||2014-05-01 0h to 2014-05-31 24h|
Selectivity . It is defined as, , which is controlled by changing the target episodes in stream 666Once a timestamp is inserted into OccMap, we identify it as a Match.. Similar to , it is varied from , up to 1.6, which is a very heavy workload to test our algorithms. We simulate the stream by sequentially input a new signal after some time interval. In particular, for the real-world data, each signal in the experiment arrives exactly the same with its original time interval; for synthetic dataset, we set each signal arrives with a constant speed every 1. Firstly we test the throughput processing all signals in , and report the average over all 10 episodes. Figure 5 show the throughput of the real-world data and synthetic dataset while varying . We see that the throughput of SASE+ drops very fast as increase, and that of SASE++ is worse than ONCE and ONCE+. The throughput of ONCE and ONCE+ is similar. SASE++, ONCE and ONCE+ are not sensitive to the selectivity. The throughput of ONCE and ONCE+ is nearly an order of magnitude better than SASE++.
Effect of . Secondly, we test the throughput of ONCE and ONCE+ by varying the time constraint for the given serial episodes. In particular, at each we randomly select 10 different episodes with length . We test the throughput of Algorithm 1 processing all signals in , and report the average over all 10 episodes. Notably, as in synthetic data each signal is associated with a discrete step, is defined as the maximal number of steps an episode should cover. In real-world dataset, we vary the time constraint from to hours. The results are shown in Figure 6. Notably, the response time for all the cases remains almost constant. The phenomenon seems different from our time complexity study. The reason is as follows. As is increased, the probability of performing Lines 7-9 in Algorithm 3, whose complexity is , will be mach larger than that of Lines 11-18, whose complexity is . Therefore, as increases, the curve will tend to be more constant (as Lines 7-9 contributes more to the response time) than sublinear.
Implicit factors. During the experimental study, we find that besides and , the response time also vary for different even if they share the same length and . The reason is that, according to Algorithm 1, each time a new event arrives, Lines 4-9 in Algorithm 1, which is the most time-consuming, may not always be performed. Intuitively, each time is updated, this part is performed. Therefore, the frequency of implicitly affects the eventual response time. Hence, we conduct another experiment to test the effect of frequency by fixing both and at particular levels. The results are shown in Figure 7. We randomly select 10 episodes at each frequency level (i.e., ) and report the average time for processing . We repeat the same setting for episodes with lengths , respectively. As the maximum frequency for episodes with length is less than 1500, thus it does not appear when frequency is 1500 and 2000. Notably, the frequencies of episodes selected at each level (e.g., ) may vary a bit (e.g., and etc.). Figure 7 only reports that of the synthetic data, as we cannot find enough episodes at each frequency level in real world one. Obviously, the response time increases along with the frequency level, which agrees with our analysis above.
Scalability and memory consumption. We now test the scalability of the model by varying the length of input sequence from to . The response time are shown in Figure 7(b). It increases almost linearly, which is consistent with our analysis in the end of Section 4.3. Notably, the average response time for processing a single signal is less than . That is, ONCE and ONCE+ can work on signal-intensive streams where millions of events happen in a second. We also demonstrate how the memory usage of the core structure in ONCE algorithm scales with the size of target episodes. As described in operation, each is initialized as layered empty lists. As massive data flow in, corresponding signal will be stored in this structure. To evaluate the memory consumption of the proposed algorithm during this dynamic process, the maximum cost (the structure reaches its largest condition at occurrence validation step) is measured under the condition that is steadily increased from 3 to 11. Notably, at each particular length level (3, 5, 7, 9, 11), we randomly select 10 target episodes whose occurrences are count and corresponding memory consumptions are evaluated. As is evident in Table III, the memory consumption grows with because the number of lists in