# Data Smashing 2.0:

Sequence Likelihood (SL) Divergence For Fast Time Series Comparison

###### Abstract

Recognizing subtle historical patterns is central to modeling and forecasting problems in time series analysis. Here we introduce and develop a new approach to quantify deviations in the underlying hidden generators of observed data streams, resulting in a new efficiently computable universal metric for time series. The proposed metric is universal in the sense that we can compare and contrast data streams regardless of where and how they are generated, and without any feature engineering step. The approach proposed in this paper is conceptually distinct from our previous work on data smashing [chattopadhyay2014data], and vastly improves discrimination performance and computing speed. The core idea here is the generalization of the notion of KL divergence often used to compare probability distributions to a notion of divergence in time series. We call this generalization the sequence likelihood (SL) divergence and show that it can be used to measure deviations within a well-defined class of discrete-valued stochastic processes. We devise efficient estimators of SL divergence from finite sample paths, and subsequently formulate a universal metric useful for computing distance between time series produced by hidden stochastic generators. We illustrate the superior performance of the new smash2.0 metric with synthetic data against the original data smashing algorithm and dynamic time warping (DTW) [petitjean2011global]. Pattern disambiguation in two distinct applications involving electroencephalogram data and gait recognition is also illustrated. We are hopeful that the smash2.0 metric introduced here will become an important component of the standard toolbox used in classification, clustering and inference problems in time series analysis.

## 1 Introduction

Efficiently learning stochastic processes is a key challenge in analyzing time-dependency in domains where randomness cannot be ignored. For such learning to occur, we need to define either a distance metric or, more generally, a measurement of similarity to compare and contrast time series. Examples of such similarity measurement from the literature include the classical distances and distances with dimensionality reduction [lin2003symbolic], the short time series distance (STS)[moller2003fuzzy], which takes into account of irregularity in sampling rates, the edit based distances[navarro2001guided] with generalizations to continuous sequences[chen2005robust], and the dynamic time warping (DTW)[petitjean2011global], which is used extensively in the speech recognition community. However these measurement of similarity all have either one or both of the following limitations. First, dimensionality reduction and feature selection heavily relies on domain knowledge and inevitably incurs trade-off between precision and computability. Most importantly, it necessitates the attention of human experts and data scientists. Secondly, when dealing with data from non-trivial stochastic process dynamics, state of the art techniques might fail to correctly estimate the similarity or lack thereof between exemplars. For example, suppose two sequences recording tosses of a fair coins, use to represent a head and , tail. The two sequences are extremely unlikely to share any similarity on the face value, they have a large pointwise distance, but they are generated by the same process. A good measurement of similarity should strive to disambiguate the underlying processes. The Smash2.0 metric introduced here addresses both these limitations.

When presented with finite sample paths, the Smash2.0 algorithm is specifically designed to estimate a distance between the generating models of the time series samples. The intuition for the Smash2.0 metric follows from a basic result in information theory: If we know the true distribution of the random variable, we could construct a code with average description length , where is the entropy of a distribution. If, instead, we used the code for a distribution , we would need bits on the average to describe the random variable. Thus, deviation in the distributions show up as KL divergence. If we can generalize the notion of KL divergence to processes, then it might be possible to quantify deviations in process dynamics via an increase in the entropy rate by the corresponding divergence.

Our ultimate goal is to design an algorithm that operates on a pair of data streams taking values in a finite alphabet. Nevertheless, to establish the correctness of our algorithm, we need to decide on a specific scheme for representing stochastic processes taking values in the alphabet. We further assume that our processes are ergodic and stationary. The specific modeling paradigm for modeling stochastic processes we use in this paper is called Probabilistic Finite-State Automaton, or PFSA for short, which has been studied in [crutchfield1994calculi, dupont2005links, chattopadhyay2014data, chattopadhyay2014causality]. PFSA can model discrete-valued stochastic processes that are not Markov of any finite order[ching2006markov]. It is also shown in [dupont2005links] to be able to approximate any hidden Markov model (HMM) with arbitrary accuracy. Moreover, PFSA has the property that many key statistical quantities of the processes they generate, such as entropy rate[cover2012elements] and KL-divergence[matthews2016sparse], have closed-form formulae. Here we want to point out the resemblance of the PFSA model to the variational autoencoder (VAE) [rezende2014stochastic, kingma2013auto] framework. The inference of PFSA from the input can be thought as the training of the encoder in a VAE, and the performance of both the VAE and the PFSA model are evaluated by the log-likelihood of input as being generated by the inferred models.

The work that has inspired the development of Smash2.0 is the data smashing algorithm (Smash) proposed in [chattopadhyay2014data]. Smash is also based on PFSA modeling and designed directly to represent the similarity between the generating models rather than sample paths. However, as while as both Smash and Smash2.0 have the advantage of not requiring dimensionality reduction or domain knowledge for feature extraction, Smash2.0 is much more computationally efficient than Smash.

The remaining of the paper is organized as follows. In Sec. 2, we introduce basic concepts of stochastic processes and establish the correspondence between processes and labeled directed graphs via the core concept of causal state. The definition and basic properties of PFSA are introduced by the end of Sec. 2.2. In Sec. 3, we answer the question of when a stochastic process has a PFSA generator. An inference algorithm, \algo, of PFSA is given in Sec. 4. In Sec. 5.2, we introduce the notion of irreducibility of PFSA and the closed-form formulae for entropy rate and KL divergence of the processes generated by irreducible PFSA. We conclude the section with log-likelihood convergence. In Sec. 6 we introduce the definition of Smash2.0 together with quantization of continuous sequences. The comparison of Smash2.0 to Smash and fastDTW is given in Sec. 6.3. In Sec. 7, we apply Smash2.0 to two real world problems.

## 2 Foundation

### 2.1 Stochastic Processes and Causal States

In this paper we study the generative model for stationary ergodic stochastic processes [peebles2001probability, crutchfield1994calculi] over a finite alphabet. Specifically, we consider a set of -valued random variables indexed by positive integers representing time steps. By stationary, we mean strictly stationary, \iethe finite-dimensional distributions [doob1990stochastic] are invariant of time. By ergodic, we mean that all finite-dimensional distributions can be approximated with arbitrary accuracy with long enough realization. We are especially interested in processes in which the s are *not* independent.

We denote the alphabet by and use lower case Greeks (e.g. or ) for symbols in . We use lower case Latins (e.g. or ) to denote sequences of symbols, for example, with the empty sequence denoted by . The length of a sequence is denoted by . The set of sequences of length is denoted by , and the collection of sequences of finite length is denoted by , \ie. We use to denote the set of infinitely long sequences, and to denote the collection of infinite sequences with as prefix. We note that, since all sequences can be viewed as prefixed by , we have .

We note that is a semiring over . Let denote the probability of the process producing a realization with for , it is straightforward to verify that defined by

(1) |

is a premeasure on . By Charathéodory extension theorem, the -finite premeasure can be extended uniquely to a measure over , where is the -field generated by . Denoting the measure also by , we see that every stochastic process induces a probability space over . In light of Eq. (1) and also for notational brevity, we denote by when no confusion arises. We note that . We refer to Chap. 1 of [klenke2013probability] as a more formal introduction to the measure-theory knowledge used here.

Taking one step further, and denoting the collection of all measures over by , we see that we can get a family of measures in from a process in addition to .

[Observation Induced Measures] For an observed sequence with , the measure is the extension to of the premeasure defined on the semiring given by

for all .

Now we introduce the concept of Probabilistic Nerode Equivalence, which was first introduced in [chattopadhyay2008structural].
{defn}[Probabilistic Nerode Equivalence]
For any pair of sequences , is equivalent to , written as , if and only if either , or .
One can verify that the relation defined above is indeed an equivalence relation which is also *right-invariant* in the sense that , for all . We denote the equivalence class of sequence by . We note that is well-defined because for . An equivalence class is also called a causal state [chattopadhyay2014data] since the distribution of future events preceded by possibly distinct are both determined by . We denote by when no confusion arises. We note that .

Since the equivalence class plays no role in our future discussion, we ignore it as a causal state from this point on.

[Derivatives] For any , the -th order derivative of an equivalence class , written as , is defined to be the marginal distribution of on , with the entry indexed by denoted by . The first-order derivative is also called the symbolic derivative in [chattopadhyay2014data] since , and is denoted by for short. The derivative of a sequence is that of its equivalence class, \ie. We note that is the marginal distribution of on , and is denoted by for short.

### 2.2 From Causal States to Probabilistic Automaton

From now on, we denote the set of causal states of a process by when no confusion arises. We start this section by showing that there is a labeled directed graph [bondy2008graph] associated with any stochastic process.

For any and such that , by right-invariance of probabilistic Nerode equivalence, there exists a , such that for all . Whenever the scenario described happens, we can put a directed edge from to and label it by and , and by doing this for all and , we get a (possibly infinite) labeled directed graph with vertex set .

[An Order-One Markov Process] We now carry out the construction described above on an order- Markov process [gagniuc2017markov] over alphabet , in which follows a Bernoulli distribution conditioned on the value of . Specifically we have

Together with the specification , we can check that the process is stationary and ergodic. The reason that we choose this process as our first example is because it has a small set of causal states of size . We list the causal states of sequences up to length in Tab. I.

causal state | |||
---|---|---|---|

Since is defined on an infinite dimensional space, we only show the symbolic derivative in Tab. I, but we can verify that if and only if for this process.

Now, we conceptualize the labeled directed graph obtained from analyzing the causal states by an automaton structure [yan1998introductionMachineComputation], which we call probabilistic finite-state automaton [chattopadhyay2014data], and show how we can get a stochastic process from it. {defn}[Probabilistic Finite-State Automaton (PFSA)] A probabilistic finite-state automaton is specified by a quadruple , where is a finite alphabet, is a finite set of states, is a partial map from to called transition map, and , called observation probability, is a map from to , where is the space of probability distributions over . The entry indexed by of is written as .

We call the directed graph (not necessarily simple with possible loops and multiedges) with vertices in and edges specified by the graph of the PFSA and, unless stated otherwise, we assume it to be strongly connected [bondy2008graph], which means for any pair , there is a sequence , such that for with and .

To generate a sequence of symbols, assuming ’s current state is , it then outputs symbol with probability , and moves to state . We see that is partial because is undefined when .

[Observation and Transition Matrices] Given a PFSA , the observation matrix is the matrix with the -entry given by , and the transition matrix is the matrix with the -entry, written as , given by

It is straightforward to verify that both and are stochastic, \ienonnegative with rows of sum .

We borrow the terms *observation matrix* and *transition matrix* from the study of HMM [stamp2004revealing]. However, we need to point out here that our model differs from the HMM in that, in HMM, the transition from the current state to the next one is independent of the symbol generated by the current state, while in PFSA, the current state and symbol generated together determine the next state the PFSA will be in.

Unless specified otherwise, we assume the initial distribution to be the stationary distribution [kai1967markov_StDis] of . We denote the stationary distribution of by , or by if is understood.

Stochastic process generated by a PFSA with distribution on states initialized with is stationary and ergodic. proof omitted.

Example 2.2 shows that we may derive a PFSA from a stationary ergodic process, and Thm. 2.2 shows that the process generated by the PFSA thus obtained is also stationary and ergodic. This motivates us to seek a characterization for stochastic processes that gives rise to a PFSA. Since the process in Example 2.2 is an order- Markov process, which is the simplest non-i.i.d. process, it is legitimate to ask whether a process has to be Markov to have a PFSA generator. This desired characterization is obtained from studying the properties of causal states, which we do in the next section.

Table II compares the three generative models of stochastic processes mentioned in this paper: Markov chain(MC), PFSA, and hidden Markov model(HMM). We note that a Markov chain produces a sequence of states, while sequences produced by PFSA and hidden Markov model take values in their respect output alphabets. We can also see that HMM can be considered as an extension to MC by adding an output alphabet and observation probabilities while PFSA are not directly comparable to either MC or HMM.

Model | Defining variables | Example |
---|---|---|

MC | Set of states; Transition probabilities. | |

PFSA | Set of states; Output alphabet; Transition function; Observation probabilities. | |

HMM | Set of states; Output alphabet; Transition probabilities; Observation probabilities. |

## 3 Stochastic Processes with PFSA Generator

### 3.1 Persistent Causal States

{defn}[Persistent and Transient Causal States] Let be the set of causal states of a stationary ergodic process. For every and , let , \iethe probability of length- sequences who are equivalent to . A causal state is persistent if , and transient if otherwise. We denote the set of persistent causal states by .

Here we borrow the term *transient state* from Markov chains literature, for example [gagniuc2017markov], but we should note that the two concepts are not identical. A Markov chain never revisits transient states as soon it hits a recurrent state. However, although a transient causal state could never be revisited, as the in Example 2.2, it could also be revisited for infinitely many times. The transient states in MC and PFSA are similar in that the probability of a Markov chain being in a transient state diminishes as time increases, and a transient causal states also has . Since transient states can recur, we name the counterpart to transient causal state in PFSA by *persistent causal state*, not by recurrent state as in MC.

For any pair , let , where the expression is a shorthand for for all . The following proposition shows that captures the flow of probability over causal states as sequence length increases. {prop} We have for each and . Furthermore, there is no flow from a persistent state to a transient one, \ie for and . proof omitted.

Let be the set of persistent causal states of a stationary ergodic process . Then, exists for every . Furthermore, if is finite and , the process generated by the PFSA with and is exactly . In fact, we have for . proof omitted.

[An Order-Two Markov Process] Now, let us consider an order- Markov process over alphabet , in which follows a Bernoulli distribution conditioned on the value of . More specifically, denoting by for , we have , , , . Together with the specification , , and , we can check that the process is stationary and ergodic. We list the causal states of sequences up to length in Tab. III.

causal state | |||
---|---|---|---|

Since is defined on an infinite dimensional space, we only show in Tab. III, but we can check that if and only if for this process. Since , , only show up once, while , , , appear repeatedly, we have . With more detailed calculation, we can show that , , , and , which sum up to . According to Thm. 3.1, we can construct a PFSA with state set that generates exactly the same process. We demonstrate the labeled directed graph constructed on in Fig. 2, and the PFSA is exactly the induced subgraph [ray2012graph] on , which is also the unique strongly connected component of the graph. We can show that the stationary distribution of the PFSA is exactly .

[A PFSA on Three States] In this example, we analyze the stochastic process generated by the PFSA on the right of Fig. 3. We nickname the PFSA by . We show that of this process is infinite, while is of size . We first notice that, no matter what state the PFSA resides, the sequence , and hence any sequence ending in , will take it to state , which generates symbol with probability , and with probability . We also note that, whenever there are two consecutive s in a given sequence in , we know for sure the state the PFSA resides. For example, sequence will take the PFSA to , and , to . On the left of Fig. 4, we show the probabilities of causal states , , , and the sum of probabilities of all other causal states for sequence length . We see from the bar plots that the sum of concentrations of , , and approaches as increases. We also point out that, with all numbers rounded up to three decimal places, , , and , while the stationary distributions of the states , , and are , respectively.

However, we also note that of the process is actually infinite by observing the fact that , where means repeated times, are all distinctive.

We note that the process generated by this PFSA is *not* Markov, as implied by the infinity of . However, the fact that there are only three persistent causal states whose sum of probabilities approaches allows it to have a PFSA generator.

[A Stochastic Process with Empty ] In this example, we analyze the stochastic process generated by the PFSA on the left of Fig. 3. We nickname the PFSA by . We show that of this process is infinite while is empty. Without run into details of the computation, we point out the fact that causal states of this process are also uniquely characterized by their symbolic derivatives, and the set is in one-to-one correspondence with . More specifically, we have

(2) |

where means being proportional to, and

with , for all . We demonstrate on the left of Fig. 4 the contour of against for sequence length . It takes some more work to show rigorously, but we can speculate that, for any fixed , approaches as approaches infinity, as the curves flatten out with increasing .

### 3.2 Accumulation Causal States

We see from Example 3 that we can have a PFSA that generates stochastic process with empty . In such a case, can we still get the PFSA structure back by studying the the set of causal states of the process? The answer is yes.

[Epsilon-Ball of Measure] Denote the collection all measures on by , and let , the -ball of order centered at is defined by

In another words, is the collection of all measures that is no more than away from with respect to total variation distance over .

[Accumulation Causal States] Let be the set of causal states of a stochastic process , a measure is an accumulation causal state of if

satisfies for all and . That is, a measure is an accumulation causal state if, no matter how large is and how small is, the sum of probabilities of length- sequences falling in does not vanish as approaches infinity.

The collection of accumulation causal states is denoted by . Since is monotonically decreasing as and , is well-defined. A measure with is called an atomic accumulation causal state, and the collection of all atomic accumulation causal states is denoted by .

[Translation Measure] Let , the translation of by for , denoted by , is the extension to of the premeasure on the semiring given by

is closed under translation. proof omitted.

Let be a stationary ergodic stochastic process with finite and . Then the process generated by the PFSA with and is exactly . In fact, we have . proof omitted.

[Example 3 Revisited] We demonstrate that of the process in Example 3 has two elements, again by observation. We plot the cumulative probability density functions of for each sequence length in Fig. 5. More specifically, for each fixed , the -coordinates of the dots are in , while the -coordinate of a dot with -coordinate equals . We can see clearly that, the cumulative function converges to a step function with steps at and as increases. The fact implies that satisfies that is either or . We see from (2) that the two measures in are exactly and . Fig. 5 also implies that that and , which is exactly the stationary distribution on the state set of .

## 4 Inference algorithm of PFSA

From the discussion in Sec. 3, we see that a stochastic process has a PFSA generator if either finitely many causal states get all the probability in the limit, as described in Sec. 3.1, or there exist finitely many measures in whose arbitrarily small neighborhoods are populated by almost all the causal states in the limit, as described in Sec. 3.2. The implication of these observations goes beyond the theory of PFSA, and guide us through the designing of inference algorithms of the model. In fact, a valid heuristic of the inference algorithm of PFSA would be to apply any clustering algorithm to the set of causal states corresponding to sequences up to a certain length, and use the center of the clusters to serve as estimates to the states. However, this primitive heuristic has a drawback since the cluster structure of may not be clear enough to facilitate a clustering algorithm. In order to get better estimates of the states, we need to fine tune our view into the set of causal states using the notion of -synchronizing sequence [chattopadhyay2014causality].

### 4.1 Epsilon-synchronizing Sequences

Before introducing -synchronizing sequence, we first introduce the concept of observation induced distributions over the state set. Let be a PFSA, we know that the initial distribution over states is exactly the stationary distribution . Let us assume that the first symbol generated by is , denote by the distribution over states after producing , we have

where

is the normalizer.

[Observation induced distributions] Let be a sequence observed, the distribution over states induced by is defined inductively by

where

for , with the base case .

[-synchronizing sequence] Let be a strongly connected PFSA on state set over alphabet . A sequence is called an -synchronizing sequence for some if there exists a such that , where is the base probability vector with the entry indexed by equalling .

The reason that the -synchronizing sequences are important to inference is that tends to have a much clearer cluster structure than for an -synchronizing sequence .

### 4.2 \algo Algorithm

We give a brief review to the algorithm called \algo proposed in [chattopadhyay2013abductive] in this section. By a sub-sequence, we mean a *consecutive* sub-sequence.
{defn}[Empirical Symbolic Derivatives]
Let , the empirical symbolic derivative of a sub-sequence of is given by

for all . Our inference algorithm is called \algo for Generator Extraction Using Self-Similar Semantics, With the input of a long enough observed sequence , \algo takes the following three steps to infer a PFSA:

Step one: Approximate -synchronizing sequence: Calculate

Then, select a sequence with being a vertex of the convex hull of .

Step Two: Identify transition structure: For each state , we associate a sequence identifier , and a probability distribution on . We extend the structure recursively: Initialize the state set as , find and set ; Calculate the empirical symbolic derivative of for each state and . If for some , then define . However, if no such exists in , add a new state to , and define , and . The process terminates when no more states can be added to . The inferred PFSA is the strongly connected component of the directed graph thus obtained.

Step Three: Identify observation probabilities: Initialize counter for each state and symbol ; choose an arbitrary initial state in the graph obtained in step two and run sequence through it, \ieif current state is , and the next symbol from is , then move to , and add to counter ; finally, calculate the observation probability map by .

## 5 Entropy Rate and KL Divergence

### 5.1 Irreducibility of PFSA

We first discuss the concept of irreducibility for PFSA. {defn} A PFSA is irreducible if there is no other PFSA with strictly fewer number of states that generates the same stochastic process as does. The definition of PFSA itself doesn’t ensure irreducibility, as shown by example 5.1.

[Reducible PFSA] In Fig. 6, we show two reducible PFSA. The PFSA on left generates the same process as the PFSA on the right Fig. 1 does, while the PFSA on right generates the same procces as the PFSA on the right of Fig. 3 does, but both with one more state than their respective irreducible versions.

[Measure of state and Equivalent States] Let a PFSA be specified by the quadruple and the measure be defined by . Two states are equivalent if and only .

We note that and in both PFSA in Fig. 6 are equivalent. We also see that we can get the corresponding irreducible PFSA back by collapsing equivalent states to a single state.

[Characterization of Irreducibility] A PFSA is irreducible if and only if it has no equivalent states. Furthermore, a irreducible PFSA is unique in the sense that, if two irreducible PFSA and generate the same stochastic process, there must be a one-to-one correspondence such that and .

The PFSA constructed on the set of persistent states and the set of atomic accumulation states are irreducible.

### 5.2 Entropy Rate and KL Divergence

{defn}[Entropy rate and KL divergence] The entropy rate of a PFSA is the entropy rate of the stochastic process generates [cover2012elements]. Similarly, the KL divergence of a PFSA from the PFSA is the KL divergence of the process generated by the from that of .More precisely, we have the

and the KL divergence

whenever the limits exist.

[Closed-form Formula for Entropy Rate] The entropy rate of a PFSA is given by

where is the entropy of a probability distribution. proof omitted.

[Closed-form Formula for KL Divergence]
Let and be two PFSA, and let be the joint -probability ofjoint state
^{1}^{1}1The formal definition of joint -probability needs long and technical derivation, which is outside the main focus of this paper. We can interpret as follows. Suppose we have a sample path generated by , and we run the sample path on both and (from arbitrary initial states) and calculate the frequency of the event “ is in state and is in state ” as a function of sequence length . The frequency can be shown to converges as approaches infinity and the limit is .,
then we have the KL divergence of from is given by

where is the KL divergence between two probability distributions. proof omitted.

### 5.3 Log-likelihood

{defn}[Log-likelihood] Let , the log-likelihood [cover2012elements] of a PFSA generating is given by

[Convergence of Log-likelihood] Let and be two irreducible PFSA, and let be a sequence generated by . Then we have

in probability as .

###### Proof:

We first notice that

By induction, we have , and hence by Cesàro summation theorem, whenever the limit exists.

Let be a sequence generated by . Let be the truncation of at the -th symbols, we have

Since the stochastic process generates is ergodic, we have

and . ∎

In this example we show the convergence of log-likelihood using the PFSA on the left of Fig. 1 and the PFSA that is the induced subgraph on , and in Fig. 2. We have

and

Let us use as the short hand for is generated by , we show in Fig. 7 the log-likelihood of producing a sequence generated by (top left), the log-likelihood of producing a sequence generated by (top right), the log-likelihood of producing a sequence generated by (bottom left), and the log-likelihood of producing a sequence generated by (bottom right). We can clear see that the convergence of log-likelihood from the plots.

## 6 Smash2.0

With the assumption of discrete-valued input, we first show in Sec. 6.1 how to use log-likelihood convergence to define a pairwise distance between sequences. Because PFSA is a model for sequences on finite alphabet, continuous-valued input should first be quantized to discrete ones before being modeled by PFSA. So we discuss in Sec. 6.2 ways of doing quantization and how their fitness can be evaluated. In Sec. 6.3, we compare Smash2.0 to Smash and fastDTW in both performance and efficiency.

### 6.1 Smash2.0: Distance between Time Series

The way we calculate distance between two sequences is as follows. We first choose a set of PFSA as base, and the coordinate for a sequence is defined to be

where is the log-likelihood of generating , as defined in Sec. 5.3. The distance between a pair of sequences can then be any valid distance between their coordinates.

For example, for the two numerical experiments in Sec.7, we use that contains four PFSA