Times series averaging and denoising from a probabilistic perspective on time-elastic kernels

Times series averaging and denoising from a probabilistic perspective on time-elastic kernels

Pierre-Francois Marteau, 
E-mail: see http://people.irisa.fr/Pierre-Francois.Marteau/
P.-F. Marteau is with UMR CNRS IRISA, Université Bretagne Sud, F-56000 Vannes, France.
Abstract

In the light of regularized dynamic time warping kernels, this paper re-considers the concept of time elastic centroid for a set of time series. We derive a new algorithm based on a probabilistic interpretation of kernel alignment matrices. This algorithm expresses the averaging process in terms of a stochastic alignment automata. It uses an iterative agglomerative heuristic method for averaging the aligned samples, while also averaging the times of occurrence of the aligned samples. By comparing classification accuracies for 45 heterogeneous time series datasets obtained by first nearest centroid/medoid classifiers we show that: i) centroid-based approaches significantly outperform medoid-based approaches, ii) for the considered datasets, our algorithm that combines averaging in the sample space and along the time axes, emerges as the most significantly robust model for time-elastic averaging with a promising noise reduction capability. We also demonstrate its benefit in an isolated gesture recognition experiment and its ability to significantly reduce the size of training instance sets. Finally we highlight its denoising capability using demonstrative synthetic data: we show that it is possible to retrieve, from few noisy instances, a signal whose components are scattered in a wide spectral band.

Time series averaging and Time elastic kernel and Dynamic Time Warping and Hidden Markov Model and Classification and Denoising.

1 Introduction

Since Maurice Fréchet’s pioneering work [frechet1906] in the early 1900s, time-elastic matching of time series or symbolic sequences has attracted much attention from the scientific community in numerous fields such as information indexing and retrieval, pattern analysis, extraction and recognition, data mining, etc. This approach has impacted a very wide spectrum of applications addressing socio-economic issues such as the environment, industry, health, energy, defense and so on.

Among other time elastic measures, Dynamic Time Warping (DTW) was widely popularized during the 1970s with the advent of speech recognition systems [VelichkoZagoruyko1970], [SakoeChiba1971], and numerous variants that have since been proposed to match time series with a certain degree of time distortion tolerance.

The main issue addressed in this paper is time series or shape averaging in the context of a time elastic distance. Time series averaging or signal averaging is a long-standing issue that is currently becoming increasingly prevalent in the big data context; it is relevant for de-noising [Kaiser1979], [Hassan2010], summarizing subsets of time series [Petitjean2011], defining significant prototypes, identifying outliers [Gupta2014], performing data mining tasks (mainly exploratory data analysis such as clustering) and speeding up classification [Petitjean2014], as well as regression or data analysis processes in a big data context.

In this paper, we specifically tackle the question of averaging subsets of time series, not from considering the DTW measure itself as has already been largely exploited, but from the perspective of the so-called regularized DTW kernel (KDTW). From this new perspective, the estimation of a time series average or centroid can be readily addressed with a probabilistic interpretation of kernel alignment matrices allowing a precise definition of the average of a pair of time series from the expected value of local alignments of samples. The tests carried out so far demonstrate the robustness and the efficiency of this approach compared to the state of the art approach.

The structure of this paper is as follows: the introductory section, the second section summarizes the most relevant related studies on time series averaging as well as DTW kernelization. In the third section, we derive a probabilistic interpretation of kernel alignment matrices evaluated on a pair of time series by establishing a parallel with a forward-backward procedure on a stochastic alignment automata. In the fourth section, we define the average of a pair of time series based on the alignment expectation of pairs of samples, and we propose an algorithm designed for the averaging of any subset of time series using a pairwise aggregating procedure. We present in the fifth section three complementary experiments to assess our approach against the state of the art, and conclude.

2 Related works

Time series averaging in the context of (multiple) time elastic distance alignments has been mainly addressed in the scope of the Dynamic Time Warping (DTW) measure [VelichkoZagoruyko1970], [SakoeChiba1971]. Although other time elastic distance measures such as the Edit Distance With Real Penalty (ERP) [Chen04ERP] or the Time Warp Edit Distance (TWED) [Marteau09TWED] could be considered instead, without loss of generality, we remain focused throughout this paper on DTW and its kernelization.

2.1 DTW and time elastic average of a pair of time series

A classical formulation of DTW can be given as follows. If is a fixed positive integer, we define a time series of length as a multidimensional sequence , such that, , .

Definition 2.1

If and are two time series with respective lengths and , an alignment path of length between and is represented by a sequence

such that , , and (using the notation , for all , .

We define and , as the index access functions at step of the mapped elements in the pair of aligned time series.

In other words, a warping path defines a way to travel along both time series simultaneously from beginning to end; it cannot skip a point, but it can advance one time step along one series without advancing along the other, thereby justifying the term time-warping.

If is a distance on , the global cost of a warping path is the sum of distances (or squared distances or local costs) between pairwise elements of the two time series along , i.e.:

A common choice of distance on is the one generated by the norm.

Definition 2.2

For a pair of finite time series and , any warping path has a finite length, and thus the number of existing warping paths is finite. Hence, there exists at least one path whose cost is minimal, so we can define as the minimal cost taken over all existing warping paths. Hence

(1)
Definition 2.3

From the DTW measure, [Gupta1996]  have defined the time elastic average of a pair of time series and as the time series whose elements are , , where mean corresponds to the definition of the mean in Euclidean space.


(a) Pairwise average (top) and Progressive agglomeration (bottom)

(b) Iterative agglomeration with refinement
Fig. 1: Pairwise averaging (top left), progressive hierarchical with similar first agglomeration (bottom left) v.s. iterative agglomeration (right) strategies. Final centroid approximations are presented in red bold color. Temporary estimations are presented using a bold dotted black line

2.2 Time elastic centroid of a set of time series

A single alignment path is required to calculate the time elastic centroid of a pair of time series (Def. 2.1). However, multiple path alignments need to be considered to evaluate the centroid of a larger set of time series. Multiple alignments have been widely studied in bioinformatics [Fasman1998], and it has been shown that determining the optimal alignment of a set of sequences under the sum of all pairs (SP) score scheme is a NP-complete problem [WangJ1994] [Just99]. The time and space complexity of this problem is , where is the number of sequences in the set and is the length of the sequences when using dynamic programming to search for an optimal solution [Carrillo1988]. This latter result applies to the estimation of the time elastic centroid of a set of time series with respect to the DTW measure. Since the search for an optimal solution becomes rapidly intractable with increasing , sub-optimal heuristic solutions have been subsequently proposed, most of them falling into one of the following three categories.

2.2.1 Progressive heuristics

Progressive heuristic methods estimate the time elastic centroid of a set of time series by combining pairwise centroids (Def. 2.3). This kind of approach constructs a binary tree whose leaves correspond to the time series of the data set, and whose nodes correspond to the calculation of a local pairwise centroid, such that, when the tree is complete, the root is associated with the estimated data set centroid. The proposed strategies differ in the way the tree is constructed. One popular approach consists of providing a random order for the leaves, and then constructing the binary tree up to the root using this ordering [Gupta1996]. Another approach involves constructing a dendrogram (a hierarchical ascendant clustering) from the data set and then using this dendrogram to calculate pairwise centroids starting with the closest pairs of time series and progressively aggregating series that are farther away[Niennattrakul2009] as illustrated on the left of Figure 1. Note that these heuristic methods are entirely based on the calculation of a pairwise centroid, so they do not explicitly require the evaluation of a DTW centroid for more than two time series. Their degree of complexity varies linearly with the number of time series in the data set.

2.2.2 Iterative heuristics

Iterative heuristics are based on an iterated three-step process. For a given temporary centroid candidate, the first step consists of calculating the inertia, i.e. the sum of the DTW distances between the temporary centroid and each time series in the data set. The second step (Figure (a)a top) evaluates the best pairwise alignment with the temporary centroid , of length , for each time series in the data set (), where is the timestamp. A new time series of length , is thus constructed that contains the contributions of all the samples of time series , but with time being possibly stretched (duplicate samples) or compressed (average of successive samples) according to the best alignment path as exemplified in Figure (a)a, top left side. The third step consists in producing a new temporary centroid candidate from the set by successively averaging (in the sense of the Euclidean centroid), the samples at every timestamp of the time series. Basically, we have .

Then, the new centroid candidate replaces the previous one and the process is iterated until the inertia is no longer reduced or the maximum number of iterations is reached. Generally, the first temporary centroid candidate is taken as the DTW medoid of the considered data set. This process is illustrated on Figure (b)b. The three steps of this heuristic method were first proposed in [Abdulla2003]. The iterative aspect of this heuristic approach was initially introduced by [Hautamaki2008] and refined by [Petitjean2011] who introduced the DTW Barycenter Averaging (DBA) algorithm. Note that, in contrast to the progressive method, this kind of approach needs to evaluate, at each iteration, all the alignments with the current centroid candidate. The complexity of the iterative approach is higher than the progressive approach, the extra computational cost being linear with the number of iterations. More sophisticated approaches have been proposed to escape some local minima. For instance [Petitjean2012] have evaluated a genetic algorithm for managing a population of centroid candidates, thus improving with some success the straightforward iterative heuristic methods.

2.2.3 Optimization approaches

Given the entire set of time series and a subset of time series , optimization approaches attempt to estimate the centroid of from the definition of an optimization problem, which is generally expressed by equation (2) given below:

(2)

Among other works, some attempt to use this kind of direct approach for the estimation of time elastic centroid was recently addressed in [ZhouDLTore2009], [ZhouDLTore2016] and [SoheilyKhah2016].

In [ZhouDLTore2009] the authors detail a Canonical Time Warp (CTW) and a Generalized version of it (GCTW) [ZhouDLTore2016] that combines DTW and CCA (Canonical Correlation Analysis) for temporally aligning multi-modal motion sequences. From a least square formulation for DTW, a non-convex optimization problem is handled by means of a coordinate-descent approach that alternates between multiple temporal alignments using DTW (or a variant exploiting a set of basis functions to parameterized the warping paths) and spatial projections using CCA (or a multi-set extension of CCA). Whilst these approaches have not been designed to explicitly propose a centroid estimation, they do provide multi-alignment paths that can straightforwardly be used to compute a centroid estimate. As an extension to CTW, GCTW requires the set-up of generally ”smooth” function basis that constrain the shape of the admissible alignment paths. This ensures the computational efficiency of GCTW, but in return it may induce some drawback, especially when considering the averaging of ”unsmoothed” time series that may involve very ”jerky” alignment paths. The choice of this function basis may require some expertise on the data.

In [SoheilyKhah2016], a non-convex constrained optimization problem is derived, by integrating a temporal weighting of local sample alignments to highlight the temporal region of interest in a time series data set, thus penalizing the other temporal regions. Although the number of parameters to optimize is linear with the size and the dimensionality of the time series, the two steps gradient-based optimization process they derived is very computationally efficient and shown to outperform the state of the art approaches on some challenging scalar and multivariate data sets. However, as numerous local optima exist in practice, the method is not guaranteed to converge towards the best possible centroid, which is anyway the case in all other approaches. Furthermore, their approach, due to combinatorial explosion, cannot be adapted for time elastic kernels like the one addressed in this paper and described in section 2.4.

2.3 Discussion and motivation

According to the state of the art in time elastic centroid estimation, an exact centroid, if it exists, can be calculated by solving a NP-complete problem whose complexity is exponential with the number of time series to be averaged. Heuristic methods with increasing time complexity have been proposed since the early 2000s. Simple pairwise progressive aggregation is a less complex approach, but which suffers from its dependence on initial conditions. Iterative aggregation is reputed to be more efficient, but entails a higher computational cost. It could be combined with ensemble methods or soft optimization such as genetic algorithms. The non-convex optimization approach has the merit of directly addressing the mathematical formulation of the centroid problem in a time elastic distance context. This approach nevertheless involves a higher complexity and must deal with a relatively large set of parameters to be optimized (the weights and the sample of the centroid). Its scalability could be questioned, specifically for high dimensional multivariate time series.

It should also be mentioned that some criticism of these heuristic methods has been made in [Niennattrakul2007]. Among other drawbacks, the fact that DTW is not a metric could explain the occurrence of unwanted behavior such as centroid drift outside the time series cluster to be averaged. We should also bear in mind that keeping a single best alignment can increase the dependence of the solution on the initial conditions. It may also increase the aggregating order of the time series proposed by the chosen method, or potentially enhance the convergence rate.

In this study, we do not directly address the issue of time elastic centroid estimation from the DTW perspective, but rather from the point of view of the regularized dynamic time warping kernel (KDTW) [MarteauGibet2014]. Although this perspective allows us to consider centroid estimation as a preimage problem, which is in itself another optimization perspective, we rather show that the KDTW alignment matrices computation can be described as the result of applying a forward-backward algorithm on a stochastic alignment automata. This probabilistic interpretation of the pairwise alignment of time series leads us to propose a robust averaging scheme for any set of time series that interpolate jointly along the time axis and in the sample space. Furthermore, this scheme significantly outperforms the current state of the art method, as shown by our experiments.

2.4 Time elastic kernels and their regularization

The Dynamic Time Warping (DTW) distance between two time series and of lengths and respectively, [VelichkoZagoruyko1970], [SakoeChiba1971] as defined in equation (2.2) can be recursively evaluated as

(3)

where is the Euclidean distance defined on between the two positions in sequences and taken at times and , respectively.

Apart from the fact that the triangular inequality does not hold for the DTW distance measure, it is not possible to define a positive definite kernel directly from this distance. Hence, the optimization problem, which is inherent to the learning of a kernel machine, is no longer convex and could be a source of limitation due to the emergence of local minima.

Regularized DTW: seminal work by [CuturiVert2007], prolonged recently by [MarteauGibet2014] leads us to propose new guidelines to ensure that kernels constructed from elastic measures such as DTW are positive definite. A simple instance of such a regularized kernel, derived from [MarteauGibet2014], can be expressed as a convolution kernel, which makes use of two recursive terms:

(4)

where is the Kronecker symbol, is a stiffness parameter which weights the local contributions, i.e. the distances between locally aligned positions, is a distance defined on , and is a symmetric binary non negative function, usually in , used to define a symmetric corridor around the main diagonal to limit the ”time elasticity” of the kernel. For the remaining of the paper we will not consider any corridor, hence everywhere.

The initialization is simply .

The main idea behind this regularization is to replace the operators and (which prevent symmetrization of the kernel) by a summation operator. This allows us to consider the best possible alignment, as well as all the best (or nearly the best) paths by summing their overall cost. The parameter is used to check what is termed as nearly-the-best alignment, thus penalizing alignments that are too far away from the optimal ones. This parameter can be easily optimized through a cross-validation.

For each alignment path, KDTW evaluates the product of local alignment costs occurring along the path. This product can be very small depending on the size of the time series and the selected value for . This is the source for a diagonal dominance problem in the Gram matrix. But, above all, this requires to balance the choice of the value according to the lengths of the matched time series. This is the main (and probably the only) limitation of the KDTW kernel: the selectivity or bandwidth of the local alignment kernels needs to be adjusted according to the lengths of the matched time series.

3 Stochastic alignment process

To introduce a probabilistic paradigm to the time elastic averaging of time series, we first consider the pairwise alignment process as the output of a stochastic automata. The stochastic alignment process that we propose finds its roots in the forward-backward algorithm defined for the learning of Hidden Markov Models (HMM) [Rabiner89] and in the parallel between HMM and DTW that is proposed in [Juang85], [NakagawaNakanishi1989] and in a more distant way in [Chudova2002]. However we differ from these founding works (and others) in the following

  1. we do not construct a parallel with DTW, but with its kernelized variant KDTW

  2. [NakagawaNakanishi1989] only consider an optimal alignment path (exploiting the Viterbi algorithm) while we consider the whole set of possible alignments (as in [Juang85])

  3. [Juang85] construct an asymmetric classical left-right HMM (one time series plays the role of the observation sequence, while the other plays the role of the state sequence). With a similar idea [Chudova2002] proposes a generative mixture model along a discrete time grid axis with local and global time warp capability. We construct instead an alignment process, that sticks on the DTW recursive definition without any other hypothesis on the structure of the automata, and for which the two aligned time series play the role of the observation sequence, and the set of states corresponds to the set of all possible sample pairs alignments.

3.1 pairwise alignment of time series as a Markov model

Let and be two discrete time series (observations) of length and respectively. To align this two time series, we define a stochastic alignment automata as follows. First we consider the set of state variables . Each characterizes the alignment between observed samples and . The posterior probability for all state variables, , given the sequences of observations and is .

The transitions probabilities between states are driven by a tensor , where , and . can be defined accordingly to the standard DTW definition, namely

(5)

The factor ensures that the transition matrix equivalent to is stochastic, basically

(6)

Notice that any tensor satisfying equation (6) could be considered at this level instead of the previous DTW surrogate tensor.

Furthermore, each state is observable through the so-called emission probabilities which are defined by a set of functions , where , and . The functions are normalized such that .

Here we differ from the classical HMM: the first difference lies in the nature of the observation sequence itself. Unlike HMM, our observation consists of a pair of subsequences that are not traveled necessarily synchronously, but according to the structure of the transition tensor . For instance, given the DTW tensor described by equation (5), from a current state associated to the alignment , three possible alignments can be reached at the next transition: , or .

The second difference with classical HMM is that the emission probabilities are independent from the state, such that . We use a local (density) kernel to estimate these probabilities as follows

(7)

where is the normalization coefficient.

Consequently, given the two observation sequences and , we define the emission probability matrix , for and

Finally let be the initial state probability vector defined by , if , otherwise.

Thereby, the stochastic alignment automata is fully specified by the triplet , where only depends on the lengths and of the observations, and depends on the complete pair of observations and .

3.2 Forward-backward alignment algorithm

We derive the forward-backward alignment algorithm for our stochastic alignment automata from its classical derivation that was defined for Hidden Markov Models [Rabiner89].

For all , the posterior probability is decomposed into forward/backward recursions as follows:

(8)

The last equality results from the application of the Bayes rule and the conditional independence of and given , .

Let be the probability of the alignment of the pair of partial observation sequences produced by all possible state sequences that end at state . can be recursively evaluated as the forward procedure

(9)

where is the subset of states allowing to reach the state in a single transition. For the DTW tensor (Eq. 5), we have .
Notice that in this case .

Similarly let be the probability of the alignment of the pair of partial sequences given starting state . can be recursively evaluated as the backward procedure

(10)

where is the subset of states that can be reached from the state in a single transition. For the DTW tensor (Eq. 5), we have .

Hence from Eq. 8, we get

(11)

Any tensor satisfying equation (6) is not eligible: for the and recursions to be calculable, one has to impose linearity. Basically cannot depend on any that is not previously evaluated. The constraint we need to impose is that the time stamps are locally increasing, i.e. if depends on any , then necessarily and or and . The same applies for the recursion.

Fig. 2: Forward Backward matrix (logarithmic values) for the alignment of a positive halfwave with a sinus wave. The dark red color represents high probability states, while dark blue color represents low probability states.

As an example, Figure 2 presents the Forward Backward () matrix () corresponding to the alignment of a positive half-wave with a sinus wave. The three areas of likely alignment paths are clearly identified in dark red colors.

3.3 Parallel with KDTW

A direct parallel exists between KDTW and the previous Markov process. It follows from the forward equation (Eq. 9) that

(12)

where is defined in equation (5), and , defined in equation (7), is such that . Hence, the recursion coincides exactly with the forward recursion (Eq. 9). Similarly, we can assimilate the backward recursion (eq. 10) to the evaluation of the pair of time series obtained by inverting and along the time axis. Hence, the forward-backward matrix elements (eq. 11) can be directly expressed in terms recursions.

Furthermore, the corridor function that occurs in the recursion (Eq. 4) modifies directly the structure of the transition tensor by setting whenever or . Neighbor states may be affected also by the normalization that is required to maintain stochastic.

3.4 Time elastic centroid estimate of a set of time series

Let us introduce the marginal probability of subset given the observations and , namely that sample is aligned with the samples of

(13)

and let us consider, for all and , the conditional probability of state given the two observation sequences, parameter and , namely the probability that and are aligned given the knowledge that is aligned with one of the samples of .

(14)

The previous equality is easily established because .

Note that for estimating we only need to evaluate the forward () and backward () recursions, since , the numerator term in Eq.11, is eliminated.

We can then define the expectation of the samples of that are aligned with sample (given that is aligned) as well as the expectation of time of occurrence of the samples of that are aligned with as follows:

(15)
Fig. 3: Centroids obtained for the CBF data set. For the three shapes, the expected start (24) and end (88) time stamps (hence the expected shape duration of 64 frames) are correctly extracted

The Expectation equations (Eq. 15) are at the basis of our procedure for averaging a set of time series.

Let be a set of time series and a reference time series ( can be initially setup as the medoid of set ). The centroid estimate of is defined as the pair where is a time series of length and is the sequence of time stamps associated to the samples of

(16)

Obviously, is a non uniformly sampled time series for which is the time stamp associated to observation . could be understood as the expected time of occurrence of the expected observation . A uniform re-sampling can straightforwardly be used to get back to a uniformly sampled time series.

The proposed iterative agglomerative algorithm (cf. Fig. 1-b), called TEKA (Time Elastic Kernel Averaging), that provides a refinement of the centroid estimation at each iteration until reaching a (local) optimum is presented in algorithm (1).

As an example, figure (3) presents the time elastic centroid estimates obtained, using algorithm (1) with , for the Cylinder c(t), Bell, b(t) Funnel, f(t), synthetic functions [Saito1994] defined as follows


    
    
where if , if , and are obtained from a standard normal distribution , is an integer obtained from a uniform distribution in and is another integer obtained from another uniform distribution in . Hence such shapes are characterized with start and end time stamps of and respectively, and a shape duration of samples. Figure (3) clearly shows that, from a subset of 300 time series (100 for each category), the algorithm has correctly recovered the start and end shape events (hence the expected shape duration) for all three shapes.

1:Let be a similarity time elastic kernel for time series satisfying eq. (3.3)
2:Let be a set of time series of dimensional samples
3:Let be an initial centroid estimate (e.g. the medoid of ) of length
4:Let and be two sequences of time stamps of length initialized with zero values
5:Let and be two double values;
6:repeat
7:     , , ;
8:     Evaluate and according to Eq. (16)
9:     //Average similarity between and elements
10:     =
11:until 
12:(, ) is the centroid estimation
13:Uniformly re-sample using the time stamps
Algorithm 1 Iterative Time Elastic Kernel Averaging (TEKA) of a set of time series


DBA
CTW
TEKA

TABLE I: Centroid estimation for the three categories of the CBF dataset and for the three tested algorithms: DBA (top), CTW (center) TEKA (bottom). The centroid estimations are indicated as a bold black line superimposed on top of the time series (in light red) that are averaged.

The figures presented in Table I compare the centroid estimates provided by the iterated DBA [Petitjean2012], CTW [ZhouDLTore2009] and TEKA algorithms. For the experiment, the DBA and TEKA algorithms were iterated at most 10 times. The centroid estimates provided by the TEKA algorithm are much smoother than the ones provided by DBA or CTW. This denoising property, expected from any averaging algorithm, will be addressed in a dedicated experiment (c.f. subsection 4.3).

3.5 Role of parameter

In practice, the selectivity or bandwidth of the local alignment kernels (that is controlled by parameter ) has to be adapted according to the the lengths of the time series. If the time series are long, then should be reduced to maintain the calculability of the forward-backward matrices, and the local selectivity decreases. Hence, more alignment paths are likely and more sample pairs participate to the calculation of the average such that local details are filtered out by the averaging. Conversely if the time series are short, can be increased, hence fewer sample pairs participate to the calculation of the average, and details can be preserved.

3.6 Computational complexity

TEKA has intrinsically the same algorithmic complexity than the DBA algorithm, basically for each pairwise averaging, where is the average length of the time series. Nevertheless, computationally speaking, TEKA algorithm is slightly more costly mainly because of two reasons:

  • the FB matrix induces a factor three in complexity because of the reverse alignment and the multiplication term by term of the forward and backward matrices.

  • the exponential terms that enter into the computation of KDTW (Eq. (4)) are costly, basically , where is the cost of the floating point multiplication, and is the number of digits. This induces another factor 2 or 3 depending on the chosen floating point precision.

The overall algorithmic cost for averaging a set of time series of average length with an average number of iterations is, for the two algorithms, .

Some optimization are indeed possible, in particular replacing the exponential function by another local kernel easier to compute is an important source of algorithmic simplification. We do not address further this issue in this paper and let it stand as a perspective.

4 Experiments

The two first proposed experiments aim at demonstrating the benefits of using time elastic centroids in a data reduction paradigm: 1-NC/NM (first near centroid or medoid) classification for the first one, and isolated gesture recognition for the second one using 1-NC/NM and SVM classifiers in conjunction with the KDTW kernel. The third experiment explores the noise reduction angle brought by time elastic centroids.

4.1 1-Nearest Centroid/Medoid classification

The purpose of this experiment is to evaluate the effectiveness of the proposed time elastic averaging method (TEKA) against a triple baseline. The first baseline allow us to compare centroid-based with medoid-based approaches. The second and third baselines are provided by the DBA [Petitjean2012] and CTW [ZhouDLTore2009] algorithms (thanks to the implementation proposed by the authors), currently considered as state of the art methods to average a set of sequences consistently with DTW. We have tested the CTW averaging with a 1-NC-DTW (CTW1) and a 1-NC-KDTW (CTW2) classifier to highlight the impact of the selected similarity measure.

For this purpose, we empirically evaluate the effectiveness of the methods using a first nearest centroid/medoid (1-NC/NM) classification task on a set of time series derived from widely diverse fields of application. The task consists of representing each category contained in a training data set by estimating its medoid or centroid and then evaluating the error rate of a 1-NC classifier on an independent testing data set. Hence, the classification rule consists of assigning to the tested time series the category which corresponds to the closest (or most similar) medoid or centroid according to the DTW measure for DTW medoid (DTW-M), DBA and CTW centroids (CTW1) or to KDTW measure for KDTW medoid (KDTW-M), CTW (CTW2) and TEKA centroids.

In [Petitjean2014] a generalized k-NC task is described. The authors demonstrate that by selecting the appropriate number of centroids (using DBA and k-means), they achieve, without loss, a 70% speed-up in average, compared to the original k-Nearest Neighbor task. Although, in general, the classification accuracy is improved when several centroids are used to represent the training datasets, our main purpose is to highlight and amplify the discrimination between time series averaging methods: this is why we stick here with the 1-NC task.

DATASET # Cat L DTW-M DBA CTW1 CTW2 KDTW-M TEKA
Synthetic_Control 660 3.00 2.00 19.00 3.33 3.33 2.33
Gun_Point 2150 44.00 32.00 54.67 25.33 52.00 27.33
CBF 3128 7.89 5.33 34.22 3.55 8.11 3.33
Face_(all) 14131 25.21 18.05 34.38 27.93 20.53 13.61
OSU_Leaf 6427 64.05 56.20 64.05 57.02 53.31 50.82
Swedish_Leaf 15128 38.56 30.08 32 25.76 31.36 22.08
50Words 50270 48.13 41.32 48.57 36.48 23.40 19.78
Trace 4275 5.00 7.00 6.00 18 23.00 16.00
Two_Patterns 4128 1.83 1.18 26.75 37.75 1.17 1.10
Wafer 2152 64.23 33.89 37.83 33.27 43.92 8.38
Face_(four) 4350 12.50 13.64 19.32 15.91 17.05 10.23
Lightning-2 2637 34.43 37.70 37.70 29.51 29.51 29.51
Lightning-7 7319 27.40 27.40 41.10 38.35 19.18 16.44
ECG200 296 32.00 28.00 27.00 25 29.00 26.00
Adiac 37176 57.54 52.69 54.73 34.78 40.67 32.22
Yoga 2426 47.67 47.87 53.56 48.97 47.53 44.90
Fish 7463 38.86 30.29 39.42 22.28 20.57 14.28
Beef 5470 60.00 53.33 53.33 50 53.33 50
Coffee 2286 57.14 32.14 32.14 28.57 32.14 32.14
OliveOil 4570 26.67 16.67 13.33 23.33 30 16.67
CinC_ECG_torso 41639 74.71 53.55 73.33 42.90 66.67 33.04
ChlorineConcentration 3166 65.96 68.15 67.40 67.97 65.65 64.97
DiatomSizeReduction 4345 22.88 5.88 5.23 2.61 11.11 2.94
ECGFiveDays 2136 47.50 30.20 34.49 13.47 11.38 16.37
FacesUCR 14131 27.95 18.44 32.20 21.66 20.73 12.19
Haptics 51092 68.18 64.61 58.77 57.47 63.64 53.57
InlineSkate 71882 78.55 76.55 81.64 82.18 78.36 75.09
ItalyPowerDemand 224 31.68 20.99 15.84 9.33 5.05 6.61
MALLAT 81024 6.95 6.10 5.24 3.33 6.87 3.66
MedicalImages 1099 67.76 58.42 58.29 59.34 57.24 59.60
MoteStrain 284 15.10 13.18 19.01 15.33 12.70 9.35
SonyAIBORobot_SurfaceII 265 26.34 21.09 20.57 17.52 26.230 19.30
SonyAIBORobot_Surface 270 38.10 19.47 14.48 9.31 39.77 17.95
Symbols 6398 7.64 4.42 22.31 20.70 3.92 4.02
TwoLeadECG 282 24.14 13.17 20.37 19.23 27.04 18.96
WordsSynonyms 25270 70.85 64.26 78.84 63.32 64.26 56.11
Cricket_X 12300 67.69 52.82 78.46 73.85 61.79 52.82
Cricket_Y 12300 68.97 52.82 69.74 65.64 46.92 50.25
Cricket_Z 12300 73.59 48.97 78.21 64.36 56.67 51.79
uWaveGestureLibrary_X 8315 38.97 33.08 37.33 34.61 34.34 32.18
uWaveGestureLibrary_Y 8315 49.30 44.44 45.42 41.99 42.18 39.64
uWaveGestureLibrary_Z 8315 47.40 39.25 47.65 39.36 41.96 39.97
PWM2 3128 43.00 35.00 63.66 6.33 21.00 4.33
uWaveGestureLibrary_3D 8315 10.11 5.61 9.35 7.68 13.74 7.73
CharTrajTT_3D 20178 11.026 9.58 13.45 15.05 6.93 4.99
# Best Scores - 1 7 0 9 6 27
# Uniquely Best Scores - 1 5 0 7 5 23
Average rank - 4.56 2.87 4.62 2.97 3.22 1.6
TABLE II: Comparative study using the UCR and UCI data sets: classification error rates evaluated on the TEST data set (in %) obtained using the first nearest neighbour classification rule for DTW-M, KDTW-M, (medoids), DBA, CTW1, CTW2 and TEKA (centroids). A single medoid/centroid extracted from the training data set represents each category.

A collection of 45 heterogeneous data sets is used to assess the proposed algorithms. The collection includes synthetic and real data sets, as well as univariate and multivariate time series. These data sets are distributed as follows:

  • 42 of these data sets are available at the UCR repository [KeoghUCRdataset]. Basically, we used all the data sets except for StarLightCurves, Non-Invasive Fetal ECG Thorax1 and Non-Invasive Fetal ECG Thorax2. Although these last three data sets are still tractable, their computational cost is high because of their size and the length of the time series they contain. All these data sets are composed of scalar time series.

  • One data set, uWaveGestureLibrary_3D was constructed from the uWaveGestureLibrary_X—Y—Z scalar data sets to compose a new set of multivariate (3D) time series.

  • One data set, CharTrajTT, is available at the UCI Repository [Lichman:2013] under the name Character Trajectories Data Set. This data set contains multivariate (3D) time series and is divided into two equal sized data sets (TRAIN and TEST) for the experiment.

  • The last data set, PWM2, which stands for Pulse Width Modulation [PWM], was specifically defined to demonstrate a weakness in dynamic time warping (DTW) pseudo distance. This data set is composed of synthetic scalar time series.

For each dataset, a training subset (TRAIN) is defined as well as an independent testing subset (TEST). We use the training sets to extract single medoids or centroid estimates for each of the categories defined in the data sets.

Furthermore, for KDTW-M, CTW2 and TEKA, the parameter is optimized using a leave-one-out (LOO) procedure carried out on the TRAIN data sets. The value is selected within the discrete set . The value that minimizes the LOO classification error rate on the TRAIN data is then used to provide the error rates that are estimated on the TEST data.

The classification results are given in Table II. It can be seen from this experiment, that

  1. Centroid-based methods outperform medoid-based methods: DBA and CTW (CTW2) yield lower error rates compared to DTW-M, as do TEKA compared to KDTW-M and DTW-M.

  2. CTW pairs much better with KDTW (CTW2 outperforms CTW1)

  3. TEKA outperforms DBA (under the same experimental conditions (maximum of 10 iterations)), and CTW.

The average ranking for all six tested methods, which supports our preliminary conclusion, is given at the bottom of Table II.

Method KDTW-M DBA CTW1 CTW2 TEKA
DTW-M p<.0001 p<.0001 0.638 0.0002 p<.0001
KDTW-M - 0.395 0.0004 0.5261 p<.0001
DBA - - p<.0001 0.8214 p<.0001
CTW1 - - - p<.0001 p<.0001
CTW2 - - - - p<.0001
TABLE III: Wilcoxon signed-rank test of pairwise accuracy differences for 1-NC/NM classifiers carried out on the 45 datasets.

In Table III we report the P-values for each pair of tested algorithms using a Wilcoxon signed-rank test. The null hypothesis is that for a tested pair of classifiers, the difference between classification error rates obtained on the 45 datasets follows a symmetric distribution around zero. With a significance level, the P-values that lead to reject the null hypothesis are shown in bolded fonts in the table. This analysis confirms our previous analysis of the classification results. We observe that centroid-based approaches perform significantly better than medoid-based approaches. Furthermore, KDTW-M appears to be significantly better than DTW-M.

Furthermore, TEKA is evaluated as significantly better than DBA and CTW2 in this experiment. Note also that DBA does not seem to perform significantly better than KDTW-M or CTW2, and that CTW1 performed similarly to DTW-M and poorly compared to the other centroid methods. Hence, it confirms out that CTW method seems to pair well with KDTW measure but poorly with the DTW measure.

4.2 Instance set reduction

In this second experiment, we address an application that consists in summarizing subsets of training time series to speed-up an isolated gesture recognition process.

The dataset that we consider enables to explore the hand-shape and the upper body movement using 3D positions of skeletal joints captured using a Microsoft Kinect 2 sensor. 20 subjects have been selected (15 males and 5 females) to perform in front of the sensor (at a three meters distance) the six selected NATOPS gestures. Each subject repeated each gesture three times. Hence the isolated gesture dataset is composed of 360 gesture utterances that have been manually segmented to a fixed length of 51 frames 111These datasets will be made available for the community at the earliest feasible opportunity.

To evaluate this task, we have performed a subject cross validation experiment consisting of 100 tests: for each test, 10 subjects have been randomly drawn among 20 for training and the remaining 10 subjects have been retained for testing. 1-NN/NC (our baselines) and SVM classifiers are evaluated, with or without summarizing the subsets composed with the three repetitions performed by each subjects using a single centroid (DBA, CTW, TEKA) or Medoid (KDTW-M). The parameter of the KDTW kernel as well as the SVM meta parameter (RBF bandwidth and ) are optimized using a leave one subject procedure on the training dataset. The kernels and are used respectively in the SVM DTW and SVM KDTW classifiers.

Method
ERR
mean std
PRE REC
1-NN DTW .134 .012 .869 .866 0.867 180
1-NN KDTW .128 .016 .876 .972 .874 180
1-NC DTW-DBA .136 .014 .868 .864 .866 60
1-NC KDTW-CTW .135 .016 .871 .865 .868 60
1-NC KDTW-TEKA .133 .014 .871 .867 .869 60
SVM DTW .146 .015 .871 .854 .862 164.97
SVM KDTW .051 .015 .952 .949 .951 103.10
SVM KDTW-M .087 .02 .92.9 .92.6 .92.7 47.62
SVM KDTW-DBA .080 .017 .935 .931 .931 46.74
SVM KDTW-CTW .085 .021 .933 .927 .930 50.12
SVM KDTW-TEKA .079 .019 .937 .933 .935 47.45
TABLE IV: Assessment measures (ERR:Error rate, PRE: Precision, REC:Recall and score) for the isolated gestures recognition. is the number of training gestures for the 1-NN/NC classifiers and the mean number of support vectors for the SVM classifiers.
Method 1-NN 1-NC 1-NC 1-NC
KDTW DBA CTW TEKA
1-NN DTW p<.0001 0.140 0.886 0.371
1-NN KDTW - p<.0001 0.026 0.087
1-NC DBA - - 0.281 0.006
1-NC CTW - - - 0.199
TABLE V: Wilcoxon signed-rank test of pairwise accuracy differences for 1-NN/NC classifiers. DTW and KDTW methods exploit the entire training sets while the other methods only use one centroid for each subject and each gesture label.
Method SVM SVM SVM SVM SVM
KDTW KDTW-M DBA CTW TEKA
SVM DTW p<.0001 p<.0001 p<.0001 p<.0001 p<.0001
SVM KDTW - p<.0001 p<.0001 p<.0001 p<.0001
SVM KDTW-M - - 0.002 0.57 0.0002
SVM DBA - - - 0.107 0.339
SVM CTW - - - - 0.013
TABLE VI: Wilcoxon signed-rank test oof pairwise accuracy differences for SVM classifiers. DTW and KDTW methods exploit the entire training sets while the other methods only use one centroid for each subject and each gesture label.

Table IV gives the assessment measures (ERR: average error rate, PRE: macro average precision, REC: macro average recall and ) for the isolated gestures classification task. In addition, the number of reference instances used by the 1-NN/NC classifiers or the number of support vectors exploited by the SVM ( column in the table) are reported to demonstrate the data reduction that is induced by the methods in the training sets.

The results show that the DTW measure does not fit well with SVM comparatively to KDTW: the error rate or the F1 score are about higher or lower for the isolated gesture task. Hence, to compare the DBA, CTW and TEKA centroids using a SVM classification, the KDTW kernel has been used. When using the centroids (SVM KDTW-DBA, SVM KDTW-CTW, SVM KDTW-TEKA), or Medoids (SVM KDTW-M) the error rate or score increases or decreases only by around and comparatively to the SVM-KDTW that achieves the best scores. Meanwhile the number of support vectors exploited by the SVM drops by a two factor, leading to an expected speed-up of . Compared to 1-NN classification without centroids, the SVM KDTW with centroids achieves a much better performance, with an expected speed-up of ( support vectors comparatively to gesture instances). This demonstrates the capacity of centroid methods to reduce significantly the size of the training sets while maintaining a very similar level of accuracy.

In more details, the TEKA is the centroid-based method that achieves the lowest error rates for the two classification tasks, while DBA is the centroid-based method that exploits the fewest support vectors (46.5).

Table V and VI give the P-values for the Wilcoxon signed-rank tests. With the same null hypothesis as above (difference between the error rates follows a symmetric distribution around zero), and with a significance level, the P-values that lead to reject the null hypothesis are presented in bolded fonts in the tables. From Table V we note that 1NN-KDTW (which exploits the full training set) performs significantly better than 1NN DTW, 1-NC DTW-DBA and 1-NC KDTW-CTW but not significantly than 1-NC KDTW-TEKA. Conversely, 1-NC KDTW-TEKA performs significantly better that 1-NC DTW-DBA but not significantly better that 1-NC KDTW-CTW. Similarly, from Table VI we observe that SVM KDTW, which exploits the full training set, performs significantly better than all centroid or medoid based methods. Also, SVM KDTW-TEKA performs significantly better than SVM KDTW-CTW but not significantly better than SVM KDTW-DBA. Finally SVM KDTW-TEKA and SVM KDTW-DBA outperform the medoid based method (SVM KDTW-M) but not SVM KDTW-CTW.

If the three centroid methods show rather close accuracies on this experiment, TEKA is significantly better than DBA on the 1NC task and significantly better than CTW on the SVM task.

4.3 Denoising experiment

To demonstrate the utility of centroid based methods for denoising data, we construct a demonstrative synthetic experiment that provides some insights. The test is based on the following 2D periodic signal:

(17)

where , and , and are constant and , , , are small perturbation in amplitude, frequency and phase respectively and randomly drawn from , , , .

Fig. 4: waveforms (top) and corresponding 2D shape (bottom, plain black curve) of the synthetic signal.
Fig. 5: Log power spectra of a component.
Fig. 6: Noisy waveforms (top) and corresponding 2D shape (bottom) of the synthetic signal.
Euclidean DBA CTW TEKA
Fig. 7: Centroids obtained from a set of height noisy instances for Euclidean, DBA, CTW and TEKA averaging methods. The log power spectra in dB (top), the 2D shape (center) and x,y waveforms (bottom) are shown.

In practice we have adopted the following setting: