On the Existence of a Sample Mean in Dynamic Time Warping Spaces

# On the Existence of a Sample Mean in Dynamic Time Warping Spaces

Brijnesh Jain and David Schultz
Technische Universität Berlin
Germany
e-mail: brijnesh.jain@gmail.com

#### Abstract.

The concept of sample mean in dynamic time warping (DTW) spaces has been successfully applied to improve pattern recognition systems and generalize centroid-based clustering algorithms. Its existence has neither been proved nor challenged. This article presents sufficient conditions for existence of a sample mean in DTW spaces. The proposed result justifies prior work on approximate mean algorithms, sets the stage for constructing exact mean algorithms, and is a first step towards a statistical theory of DTW spaces.

## 1 Introduction

Time series are time-dependent observations that vary in length and temporal dynamics (speed). Examples of time series data include acoustic signals, electroencephalogram recordings, electrocardiograms, financial indices, and internet traffic data.

Time series averaging aims at finding a typical time series that “best” represents a given sample of time series. First works on time series averaging started in the 1970ies with speech recognition as the prime application [19]. Since then, research predominantly focused on devising averaging algorithms for improving pattern recognition systems and generalizing centroid-based clustering algorithms [1, 8, 9, 13, 15, 16, 17, 18, 21, 23]. In contrast to averaging points in a Euclidean space, averaging time series is a non-trivial task, because the sample time series can vary in length and speed. To filter out these variations, the above cited averaging algorithms apply dynamic time warping (DTW).

The most promising direction poses time series averaging as an optimization problem [3, 9, 16, 22, 23]: Suppose that is the set of all time series of finite length and is a sample of time series . Then time series averaging amounts in minimizing the Fréchet function [6]

 F:Tm→R,x↦1NN∑k=1δ2(x,x(k)), (1)

where the solution space is the subset of all time series of length and denotes the DTW distance between time series and [20]. The global minimizers of the Fréchet function are the restricted sample means of sample . The restriction refers to confining the length of the candidate solutions.

Using Fréchet functions, the notion of a “typical time series that best represents a sample” has a precise meaning. A typical time series is any global minimizer of the Fréchet function. If a global minimum exists, it best represents a sample in the sense that it deviates least from all sample time series. The Fréchet function is motivated by the property that an arithmetic mean of real numbers minimizes the mean squared error from the sample numbers. Following Fréchet [6], we can use this property to generalize the concept of sample mean to arbitrary distance spaces for which a well-defined addition is unknown and therefore an arithmetic mean can not be computed in closed form by a weighted sum.

Research on the Fréchet function as defined in Eq. (1) has the following shortcomings:

1. Existence of a global minimum of the Fréchet function has neither been proved nor challenged. Existence of an optimal solution depends on the particular choice of DTW distance and loss function. A restricted sample mean trivially exists if the DTW distance between two time series is always zero. Conversely, there are DTW spaces for which a sample mean does not always exist (cf. Example 2.8). Thus it is unclear whether existing heuristics indeed approximate a typical time series or unknowingly search for a phantom.

2. In experiments, the solution space is heuristically chosen. For example, if all sample time series are of length then a common choice is . The intuition behind this choice is that the length of a restricted sample mean should “best” represent the lengths of the sample time series. This intuition may lead to solutions that fail to capture the characteristic properties of the sample time series as illustrated by Figure 1.

3. The Fréchet function only generalizes the sample mean. Neither weighted means nor other important measures of central location such as the sample median are captured by Eq. (1).

To address all three issues, we consider a more general formulation of the Fréchet function as given by

 F:U→R,x↦N∑k=1hk(δ(x,x(k))),

where is the solution space and the are loss functions. We recover the standard Fréchet function given in Eq. (1) by setting for all . To average the sum of squared DTW distances, we define , where is the uniform weight. To obtain a weighted version of Eq. (1), we demand that , where is a positive weight. The sample median is obtained by setting for all . Regardless of the choice of loss function, we refer to global minimizers of the general Fréchet function as sample means as umbrella term.

We focus on two forms of solution sets : (i) is the set of all time series of finite length and (ii) is the subset of time series of length . We call the unrestricted and the restricted form. Note that restrictions refer to the solution space only. Sample time series are always from and therefore may have arbitrary length. We assume no restrictions on the elements of the time series. The elements can be real values, feature vectors, symbols, trees, graphs, and mixtures thereof.

This contribution presents sufficient conditions for existence of a sample mean in restricted and unrestricted form. We show that common DTW distances mentioned in the literature satisfy the proposed sufficient conditions. A key result is the Reduction Theorem stating that there is a sample-dependent bound on the length beyond which the Fréchet function can not be further decreased. For the two sample time series in Figure 1 the bound is . Hence, the restricted sample mean is also an unrestricted sample mean.

This contribution has the following implications: Existence of a sample mean together with the necessary conditions of optimality proposed in [22] enable the formulation of exact mean algorithms [2]. Existence of restricted sample means theoretically justify prior work [3, 9, 16, 19, 22, 23] in the sense that the concept of a restricted sample mean is not a phantom but does in fact exist. Existence of the weighted mean justfies the soft-DTW approach proposed by [3]. Finally, this contribution is a first step towards a statistical theory of DTW spaces in the spirit of a statistical theory of shape, tree, and graph spaces [4, 5, 7, 10, 11, 12, 14].

The rest of this paper is structured as follows: Section 2 states the main results of this contribution and Section 3 concludes with a summary of the main findings and an outlook to further research. All proofs are delegated to the appendix.

## 2 Existence of a Sample Mean via the Reduction Theorem

This section first introduces the DTW-distance and Fréchet functions. Then the Reduction Theorem is stated and its implications are presented. Finally, sufficient conditions of existence of a sample mean are proposed.

#### Notations.

We write for the set of non-negative reals. By we denote the set of positive integers. We write to denote the set for a given . Finally, is the -fold Cartesian product of the set , where .

### 2.1 The Dynamic Time Warping Distance

Suppose that is an attribute set. A time series of length is a sequence consisting of elements for every time point . By we denote the set of all time series of length with elements from . Then

 T=⋃n∈NTn

is the set of all time series of finite length with elements from .

Without further mention, we assume that the attribute set is given. Since we do not impose restrictions on the attribute set , the above definition of time series covers a broad range of sequential data structures. For example, to represent real-valued univariate and multivariate time series, we use and , resp., as attribute set. For text strings and biological sequences, the set is an alphabet consisting of a finite set of symbols. Further examples are time series of satellite images and time series of graphs as studied in anomaly detection.

Time series vary in length and speed. To filter out these variations, we introduce the technique of dynamic time warping.

###### Definition 2.1.

Let . A warping path of order is a sequence of points such that

1. and (boundary conditions)

2. for all (step condition)

The set of all warping paths of order is denoted by . A warping path of order can be thought of as a path in a grid, where rows are ordered top-down and columns are ordered left-right. The boundary condition demands that the path starts at the upper left corner and ends in the lower right corner of the grid. The step condition demands that a transition from on point to the next point moves a unit in exactly one of the following directions: down, diagonal, and right.

A warping path defines an alignment (or warping) between time series and . Every point of warping path aligns element to element . The cost of aligning time series and along warping path is defined by

 cp(x,y)=L∑l=1d(xil,yjl),

where is a local distance function on . We demand that the local distance satisfies the following properties:

for all . As with the attribute set we tacitly assume that the local distance is given without further mention.

Now we are in the position to define the DTW-distance. We obtain the DTW-distance between two time series and by minimizing the cost over all possible warping paths.

###### Definition 2.2.

Let be a monotonous function. Let and be two time series of length and , respectively. The DTW-distance between and is defined by

 δ(x,y)=min{f(cp(x,y)):p∈Pm,n}.

An optimal warping path is any warping path satisfying .

The next example presents a common and widely applied DTW-distance in order to illustrates all components of Definition 2.2.

###### Example 2.3.

The Euclidean DTW-distance is specified by the attribute set , the squared Euclidean distance for all , and the square root function for all .

Even if the underlying local distance function is a metric, the induced DTW-distance is generally only a pseudo-semi-metric satisfying

for all . Computing the DTW-distance and deriving an optimal warping path is usually solved by applying techniques from dynamic programming [20].

A DTW-space is a pair consisting of a set of time series of finite length and a DTW-distance defined on . For the sake of convenience, we occasionally write to denote a DTW-space and tacitly assume that is the underlying DTW-distance.

### 2.2 Fréchet Functions

Let be a DTW-space. A loss function is a monotonously increasing function of the form . A typical example of a loss function is the squared loss for all .

###### Definition 2.4.

Let be a sample of time series with corresponding loss function for all . Then the function

 F:T→R,x↦1NN∑i=1hk(δ(x,x(k)))

is the Fréchet function of sample corresponding to the loss functions .

We omit explicitly mentioning the corresponding loss functions of a Fréchet function if no confusion can arise. Note that Definition 2.4 refers to the unrestricted form as described in Section 1. We present some examples and assume that is a sample of time series.

###### Example 2.5.

Let . The Fréchet function of corresponding to the loss functions takes the form

 Fp(x)=N∑k=1δp(x,x(k)).

For () the Fréchet function generalizes the concept of sample median (sample mean) in Euclidean spaces.

###### Example 2.6.

Let . The Fréchet function of corresponding to the loss functions is of the form

 Fp(x)=N∑k=1wkδp(x,x(k)).

The function is a weighted average of the sum of -distances . In the special case of , the function averages the sum of -distances.

Next, we consider the global minimizers – if exist – of a Fréchet function.

###### Definition 2.7.

The sample mean set of is the (possibly empty) set defined by

 F={z∈T:F(z)≤F(x) for all% x∈T}.

The elements of are the sample means of .

A sample mean is a time series that minimizes the Fréchet function . It can happen that the corresponding set is empty. In this case, a sample mean does not exist. Existence of a sample mean depends on the choice of DTW-distance and loss function. Moreover, if a sample mean exists, it may not be uniquely determined. In contrast, the sample variance111The sample variance corresponding to loss generalizes the sample variance in Euclidean spaces.

 F∗=infx∈TF(x)

exists and is uniquely determined, because the DTW-distance is bounded from below and the loss is monotonously increasing. Thus, existence of a sample mean means that the Fréchet function attains its infimum.

The next example presents a DTW-space for which a sample mean may not always exist. This example is inspired by the edit distance for sequences and drastically simplified to directly convey the main idea.

###### Example 2.8.

Let be the attribute set with local cost function of the form

for all . We assume that the function corresponding to the DTW distance is the identity such that

 δ(x,y)=min{cp(x,y):p∈Pm,n},

for all time series and of length and , respectively. Consider the sample consisting of two time series and . As indicated by Figure 2, the Fréchet function

 F(z)=12δ(x,z)+12δ(y,z)

never attains its infimum . Thus has no global minimum and therefore has no sample mean.

### 2.3 Restricted and Unrestricted Fréchet Functions

The Fréchet function of Definition 2.4 is in unrestricted form, because it is defined on the entire set and imposes no restrictions on the length of the sample mean. In restricted form, the function

 Fm:Tm→R,x↦F(x).

is the Fréchet function of restricted to the subset of all time series of length . It is important to note that the lengths of the sample time series in may vary, but the length of the independent variable of is fixed beforehand to value . In line with the unrestricted form, the set

 Fm={z∈Tm:Fm(z)≤Fm(x) for all x∈Tm}

is the restricted sample mean set of restricted to the subset . Occasionally, we call the elements of restricted sample means and the elements of the (unrestricted) sample means. As for the unrestricted form, existence of a restricted sample mean depends on the choice of DTW-distance and loss function.

### 2.4 The Reduction Theorem

This section presents sufficient conditions for existence of a sample mean in restricted and unrestricted form. The approach is as follows: First, we present sufficient conditions for existence of restricted sample means. Second, under these assumptions we infer that an unrestricted sample mean also exists. The main tool for the second step is the Reduction Theorem. Proofs are delegated to the appendix.

Suppose that is a sample of time series. The Reduction Theorem is based on the notion of reduction bound of sample . The exact definition of requires some technicalities and is fully spelled out in Section A.3. Here, we present a simpler definition that conveys the main idea and covers the relevant use cases in pattern recognition. For this, we assume that every sample time series of sample has length . Then the reduction bound of is defined by

 (2)

In contrast to Eq. (2), the exact definition of admits samples that contain trivial time series of length one. Equation (2) shows that the reduction bound of a sample increases linearly with the sum of the lengths of the sample time series. The following results hold for arbitrary samples and assume the exact definition of a reduction bound as provided in Section A.3.

###### Theorem 2.9 (Reduction Theorem).

Let be the Fréchet function of a sample . Then for every time series of length there is a time series of length such that .

The Reduction Theorem deserves some explanations. To illustrate the following comments we refer to Figures 35 with the following specifications: In these figures, we assume univariate time series with real values. The underlying distance is the Euclidean DTW-distance of Example 2.3. The Fréchet functions of the different samples are given by

 F(z)=12δ(x(1),z)+12δ(x(1),z).

The figures show warping paths by black lines connecting aligned elements of the time series to be compared. We make the following observations:

#### 1.

From the proof of the Reduction Theorem follows that every candidate solution whose length exceeds the reduction bound has an element that can be removed without increasing the value . Such elements are said to be redundant. Figure 3 schematically characterizes redundant elements of a time series.

#### 2.

In general, removing a redundant element does not increase the Fréchet function. Figure 3 shows that removing a redundant element can even decrease the value of the Fréchet function.

#### 3.

The reduction bound of the sample in Figure 3 is given by

 ρ(X)=ℓ(x(1))+ℓ(x(2))−2(N−1)=4+4−2=6.

The length of time series is only . This shows that short candidate solutions whose lengths are bounded by the reduction bound may also have redundant elements that can be removed without increasing the value of the Fréchet function. Existence of a redundant element depends on the choice of warping path between and the sample time series. For short time series , we can always find warping paths such that has no redundant elements. In contrast, long time series whose lengths exceed the reduction bound always have a redundant element, regardless which warping paths we consider.

#### 4.

Removing a non-redundant element of a candidate solution can increase the value of the Fréchet function. Figure 4 presents an example.

#### 5.

The Reduction Theorem does not exclude existence of sample means whose lengths exceed the reduction bound of a sample. Figure 5 presents an example for which a sample mean can have almost any length.

The Reduction Theorem and observations 1–4 form the basis for existence proofs in unrestricted form and point to a technique to improve algorithms for approximating a sample mean (if exists). Statements on the existence of a sample mean are presented in the next section. From observations 1–3 follows that a candidate solution of any length could be improved or at least shortened by detecting and removing redundant elements of . This observation is not further explored in this article and left for further research.

### 2.5 Sufficient Conditions of Existence

In this section, we derive sufficient conditions of existence of a sample mean in restricted and unrestricted form.

The Reduction Theorem guarantees existence of a sample mean in unrestricted form if sample means exist in restricted forms. Thus, existence proofs in the general unrestricted form reduce to existence proofs in the simpler restricted form. This statement is proved in Corollary 2.10.

###### Corollary 2.10.

Let be a sample and let be the reduction bound of . Suppose that for every . Then has a sample mean.

It is not self-evident that existence of a class of restricted sample mean implies existence of a sample mean. To see this, we define the restricted (sample) variance by

 F∗m=infx∈TmFm(x).

If attains its infimum, then has a restricted sample mean. Suppose that has a restricted sample mean for every . Then

 vm=minl≤mF∗m.

is the smallest restricted variance over all lengths . The sequence is bounded from below and monotonously decreasing. Therefore, the sequence converges to the unrestricted sample variance . Then has a sample mean only if the sequence attains its infimum . Corollary 2.10 guarantees that the sequence indeed attains its infimum latest at .

Next, we present sufficient conditions of existence. The first result proposes sufficient conditions of existence of a restricted and unrestricted sample mean for time series (sequences) with discrete attribute values.

###### Proposition 2.11.

Let be a sample. Suppose that is a finite attribute set. Then the following statements hold:

1. for every .

2. .

The second result proposes sufficient conditions of existence of a restricted and unrestricted sample mean of uni- and multivariate time series with elements from .

###### Proposition 2.12.

Let be a sample. Suppose that the following assumptions hold:

1. is a metric space of the form , where is a norm on .

2. The loss functions are continuous and strictly monotonously increasing.

Then the following statements hold:

1. for every .

2. .

The attribute set covers the case of univariate () and the case of multivariate () time series. The local cost function on is a metric induced by a norm on . Loss functions of the form are continuous and strictly monotonously increasing for and . Thus, the sufficient conditions of Prop. 2.12 cover customary DTW-spaces.

## 3 Conclusion

This article presents sufficient conditions for the existence of a sample mean in DTW spaces in restricted and unrestricted form. The sufficient conditions hold for common DTW distances reported in the literature. Key result is the Reduction Theorem stating that time series whose lengths exceed the reduction bound can be reduced to shorter time series without increasing the value of the Fréchet function. This result guarantees existence of a sample mean in unrestricted form if sample means exist in restricted form. The proof of the Reduction Theorem is framed into the theory of warping graphs. The existence proofs theoretically justify existing mean-algorithms and related pattern recognition applications in retrospect. The Reduction Theorem sets the stage for studying the unrestricted sample mean problem. Finally, existence of the sample mean sets the stage for constructing exact algorithms and a statistical theory of DTW spaces. The next step towards such a theory consists in studying under which conditions a sample mean is a consistent estimator of a population mean.

#### Acknowledgements.

B. Jain was funded by the DFG Sachbeihilfe JA 2109/4-1.

## References

• [1] W.H. Abdulla, D. Chow, and G. Sin. Cross-words reference template for DTW-based speech recognition systems. Conference on Convergent Technologies for Asia-Pacific Region, 2003.
• [2] M. Brill, T. Fluschnik, V. Froese, B. Jain, R. Niedermeier, D. Schultz. Exact Mean Computation in Dynamic Time Warping Spaces. arXiv, arXiv:1710.08937, 2017.
• [3] M. Cuturi and M. Blondel. Soft-DTW: a differentiable loss function for time-series. International Conference on Machine Learning, 2017.
• [4] I.L. Dryden and K.V. Mardia. Statistical shape analysis, Wiley, 1998.
• [5] A. Feragen, P. Lo, M. De Bruijne, M. Nielsen, and F. Lauze. Toward a theory of statistical tree-shape analysis. IEEE Transaction of Pattern Analysis and Machine Intelligence, 35:2008–2021, 2013.
• [6] M. Fréchet. Les éléments aléatoires de nature quelconque dans un espace distancié. Annales de l’institut Henri Poincaré, 215–310, 1948.
• [7] C.E. Ginestet. Strong Consistency of Fréchet Sample Mean Sets for Graph-Valued Random Variables. arXiv: 1204.3183, 2012.
• [8] L. Gupta, D. Molfese, R. Tammana, and P.G. Simos. Nonlinear alignment and averaging for estimating the evoked potential. IEEE Transactions on Biomedical Engineering, 43(4):348–356, 1996.
• [9] V. Hautamaki, P. Nykanen, P. Franti. Time-series clustering by approximate prototypes. International Conference on Pattern Recognition, 2008.
• [10] S. Huckemann, T. Hotz, and A. Munk. Intrinsic shape analysis: Geodesic PCA for Riemannian manifolds modulo isometric Lie group actions. Statistica Sinica, 20:1–100, 2010.
• [11] B.J. Jain. Statistical Analysis of Graphs. Pattern Recognition, 60:802–812, 2016.
• [12] D.G. Kendall. Shape manifolds, procrustean metrics, and complex projective spaces. Bulletin of the London Mathematical Society, 16:81–121, 1984.
• [13] J.B. Kruskal and M. Liberman. The symmetric time-warping problem: From continuous to discrete. Time warps, string edits and macromolecules: The theory and practice of sequence comparison, pp. 125–161, 1983.
• [14] J.S. Marron and A.M. Alonso. Overview of object oriented data analysis. Biometrical Journal, 56(5):732–753, 2014.
• [15] V. Niennattrakul and C.A. Ratanamahatana. Shape averaging under time warping. IEEE International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, 2009.
• [16] F. Petitjean, A. Ketterlin, and P. Gancarski. A global averaging method for dynamic time warping, with applications to clustering. Pattern Recognition 44(3):678–693, 2011.
• [17] F. Petitjean, G. Forestier, G.I. Webb, A.E. Nicholson, Y. Chen, and E. Keogh. Dynamic Time Warping Averaging of Time Series Allows Faster and More Accurate Classification. IEEE International Conference on Data Mining, 2014.
• [18] F. Petitjean, G. Forestier, G.I. Webb, A.E. Nicholson, Y. Chen, and E. Keogh. Faster and more accurate classification of time series by exploiting a novel dynamic time warping averaging algorithm. Knowledge and Information Systems, 47(1):1–26, 2016.
• [19] L.R. Rabiner and J.G. Wilpon. Considerations in applying clustering techniques to speaker-independent word recognition. The Journal of the Acoustical Society of America, 66(3): 663–673, 1979.
• [20] H. Sakoe and S. Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(1):43–49, 1978.
• [21] P. Sathianwiriyakhun, T. Janyalikit, and C.A. Ratanamahatana. Fast and accurate template averaging for time series classification. IEEE International Conference on Knowledge and Smart Technology, 2016.
• [22] D. Schultz and B. Jain. Nonsmooth analysis and subgradient methods for averaging in dynamic time warping spaces. Pattern Recognition, 74:340–358, 2017.
• [23] S. Soheily-Khah, A. Douzal-Chouakria, and E. Gaussier. Generalized k-means-based clustering for temporal data under weighted and kernel time warp. Pattern Recognition Letters, 75:63–69, 2016.

## Appendix A Theory of Warping Graphs

This appendix develops a theory of warping graphs to prove the Reduction Theorem and all other results in this article. The line of argument follows a bottom-up approach. First, we introduce warping chains to model abstract warping paths as given in Definition 2.1 and derive their relevant local properties. Then we proceed to warping graphs that model the alignment of two time series by a warping path and derive global properties from local properties. We enhance warping graphs with node labels and define the notion of weight of a warping graph to model the DTW-distance. Finally, we glue warping graphs to model Fréchet functions, derive the Reduction Theorem, and prove the statements presented in Section 2.

### a.1 Warping Chains

The basic constituents of a Fréchet function are time series and optimal warping paths. This section represents the linear order of time series by chains. Then we introduce the notion of warping chain to model abstract warping paths and study its local properties.

Let be a partially ordered set with partial order . Suppose that are two elements with . We write to mean and . A linear order on is a partial order such that any two elements in are comparable: For all , we have either or .

###### Definition A.1.

A chain is a linearly ordered set.

A chain models the order of a time series , where element refers to the positions of element in . For the sake of convenience, we assume that the explicit notation of a chain always lists its elements in linear order . We call the first and the last element in . The first and last element of a chain are the boundary elements. Any element of chain that is not a boundary element is called an inner element of .

A subset is a subchain of . Note that any subset of a chain is again a chain by transitivity of the linear order. Suppose that such that . Then the subchain is said to be contiguous.

Let be a chain and let , where is a distinguished symbol denoting the void element. The successor and predecessor of element are defined by

 i+l={il+1:1≤l

We assume that is another chain. The chains and induce a partial order on the product by

 (i,j)≤U(r,s)⇔i≤Vr % and j≤Ws

for all .

###### Definition A.2.

Let be the product of chains and . The successor map on is a point-to-set map

 SU:U→2U,(i,j)↦{(i+,j),(i,j+),(i+,j+)}∩(V×W),

where denotes the set of all subsets of .

The successor map models the set of feasible warping steps for a given element . Intersection of the successor map with ensures that elements with or are excluded. The successor map sends to the empty set if and are the last elements of the respective chains and . The next result shows that the successor map preserves the partial product order as well as the linear orders and .

###### Lemma A.3.

Let be the product of chains and . Suppose that is an element with . Then the following order preserving properties hold:

1. for all .

2. for all .

###### Proof.

Directly follows from the definitions of and . ∎

###### Lemma A.4.

Let be the product of chains and . Let be a subset consisting of elements such that for all . Then is a chain.

###### Proof.

The successor map is order preserving with respect to the product order according to Lemma A.3. The assertion follows, because any order is transitive. ∎

The chain in Lemma A.4 is compatible with the successor map . We call such a chain a warping chain.

###### Definition A.5.

Let be the product of chains and . A warping chain in is a chain such that for all .

The next result shows that warping chains preserve the order of the factor chains.

###### Proposition A.6.

Let and be chains. Let be a warping chain in . Then any pair of elements satisfies

 (i≤Vr∧j≤Ws)∨(r≤Vi∧s≤Wj). (3)
###### Proof.

Suppose that . Then there are indices such that and . Without loss of generality, we assume that . Let . Repeatedly applying Lemma A.3 yields

 ep≤U⋯≤Uep+u=eq,

where . Since any order is transitive, we have . Then the assertion directly follows from the definition of the product order . ∎

Equation (3) is the order-preserving property (or non-crossing property) of a warping chain. Note that the order preserving property does not hold for all subsets of .

### a.2 Warping Graphs

This section introduces the notion of warping graph that models the alignment of two time series by a warping path and studies its local and global structure.

A graph is a pair consisting of a finite set of nodes and a set of edges. A node is incident with an edge , if there is a node such that or . Similarly, an edge is said to be incident to node and to node . The neighborhood of node is the subset of nodes defined by . The elements of are the neighbors of . The degree of node in is the number of neighbors of .

A subgraph of graph is a graph such that and . We write to denote that is a subgraph of . A graph is connected, if for any two nodes there is a sequence of nodes in such that for all . A component of graph is a connected subgraph such that implies for every connected subgraph .

A graph is bipartite, if can be partitioned into two disjoint and non-empty subsets and such that . We write to denote a bipartite graph with node partitions and . Note that the order of the node partitions and in a bipartite graph matters. A bipartite chain graph is a bipartite graph whose node partitions are chains.

###### Definition A.7.

A bipartite chain graph with node partitions and is a warping graph of size if

1. (boundary condition)

2. is a warping chain in (step condition),

The set of all warping graphs of size is denoted by . If is a warping graph, we briefly write to denote the successor map and to denote the induced product order . The following result is a direct consequence of the boundary and step conditions:

###### Proposition A.8.

Every node in a warping graph has a neighbor.

We show that the neighborhood of a node of one partition of a warping graph is a contiguous chain of the other partition.

###### Proposition A.9.

Let be a warping graph with node partitions and . Suppose that is a node. Then the neighborhood of a node in is a contiguous subchain of .

###### Proof.

Suppose that the warping graph is of the form . Without loss of generality, we assume that and . Then we have . The assertion trivially holds for . Suppose that . We assume that is not contiguous. Then there are elements and such that . From Prop. A.8 follows that there is a node such that .

Two cases can occur: (1) and (2) . It is sufficient to consider the first case . The proof of the second case is analogue. By construction, there are edges and such that and . These relationships violate the order preserving property of a warping chain given in Eq. (3) of Prop. A.6. Hence, is contiguous. ∎

We introduce compact warping graphs that represent warping paths of minimal length.

###### Definition A.10.

A warping graph is compact if there is no warping graph such that is a proper subgraph of .

A warping graph is compact if no edge can be deleted without violating the boundary or step conditions. Figure 6 shows an example of a non-compact warping graph and its compactification.

###### Proposition A.11.

Let be a warping graph with edge set . Then the following statements are equivalent:

1. is compact.

2. Let . Then for all .

###### Proof.

We first prove the following Lemma:

###### Lemma.

Let and be two chains, let be a warping chain, and let . Then for every .

###### Proof.

Let be a chain and let be two elements of . Then we define the distance

 ΔC(ik,il)=|l−k|+1.

Suppose that for some . Let be an arbitrary successor of . Then by definition of the successor map, we have . Let . Suppose that and . Then by induction, we have

 ΔV(i,r)+ΔW(j,s)≥k.

From follows or . Hence, or . This shows the assertion . ∎

From the above Lemma follows that the case is impossible for a warping chain. Therefore it is sufficient to consider the case .

Let be compact. We assume that there is an such that . By construction, the edge is an inner element of the chain . From follows that removing neither violates the boundary conditions nor the step condition. This contradicts compactness of and shows that a compact warping graph implies the second statement.

Next, we show the opposite direction. Suppose that for all . We assume that is not compact. Then there is an edge that can be removed without violating the boundary and step conditions. Not violating the boundary condition implies that . Hence, and are edges in . We set . Then we obtain the contradiction that . Hence, is compact. ∎

Suppose that is a warping graph. By we denote the disjoint union of the node partitions. If is a node of one partition, then its neighborhood is a subset of the other node partition. Hence, is a chain and has boundary and eventually inner nodes. Let denote the possibly empty subset of inner nodes of chain . We show that inner nodes of always have degree one.

###### Lemma A.12.

Let be a warping graph. Suppose that is a node with neighborhood . Then for all .

###### Proof.

Without loss of generality, we assume that . Then is a chain. The assertion holds for , because in this case has no inner node. Suppose that . Let be an inner node. We assume that . Then there is a node such that . Since is a chain, we find that either or .

We only consider the first case . The proof of the second case is analogue. Observe that is an inner node of and is contiguous by Prop. A.9. Then contains the predecessor of node . By construction and are edges of such that and . Thus, the edges and violate the order preserving property of a warping chain given in Eq. (3) of Prop. A.6. This contradicts our assumption that is a warping chain. Hence, we have . Since the inner node was chosen arbitrarily, the assertion follows. ∎

###### Lemma A.13.

Let be a warping graph and let be a node with neighborhood . Suppose that and is a boundary node with . Then the following properties hold:

1. If is the first node in , then exists and .

2. If is the last node in , then exists and .

###### Proof.

We show the second assertion. The proof of the first assertion is analogue. Since and is the last node of , we find that and therefore exists. From follows that there is a node such that . We assume that satisfies . This implies that and are two edges of such that and . Thus, the edges and violate the order preserving property of a warping chain given in Eq. (3) of Prop. A.6. This contradicts our assumption that . Therefore, we have . This in turn shows that exists. From and follows , because is a contiguous subchain of by Prop. A.9. This shows and completes the proof. ∎

A bipartite graph is complete if . Let . A complete bipartite graph is a star graph of the form , if and . Similarly, is a star graph of the form , if and . By definition, a star graph has at least two nodes. A star forest is a graph whose components are star graphs.

###### Proposition A.14.

A compact warping graph is a star forest.

###### Proof.

Let be a compact warping graph. Suppose that is a component of . From Prop. A.8 follows that has at least two nodes connected by an edge.

We assume that is not a star. Then has two nodes with degree larger than one. Without loss of generality, we assume that . Then has at least two elements. Suppose that all nodes from have degree one. Since component is bipartite, we find that is isomorphic to the star , where . This contradicts our assumption that is not a star. Hence, there is a node with .

From Lemma A.12 follows that node is a boundary node of . We show the assertion for the case that is the last node in . The proof for the case that is the first node in is analogue. Since is the last node in and , we have and therefore . Applying Lemma A.13 yields that exists and .

By construction, we have . This shows that is not a boundary edge in . Since , we can remove without violating the step condition. Then the subgraph