The Wasserstein-Fourier Distance for Stationary Time Series1footnote 11footnote 1This work was supported by Fondecyt-Postdoctorado 3190926, Fondecyt-Iniciación 11171165, and Conicyt-PIA AFB 170001 CMM.

# The Wasserstein-Fourier Distance for Stationary Time Series1

## Abstract

We introduce a novel framework for analysing stationary time series based on optimal transport distances and spectral embeddings. First, we represent time series by their power spectral density (PSD), which summarises the signal energy spread across the Fourier spectrum. Second, we endow the space of PSDs with the Wasserstein distance, which capitalises its unique ability to preserve the geometric information of a set of distributions. These two steps enable us to define the Wasserstein-Fourier (WF) distance, which allows us to compare stationary time series even when they differ in sampling rate, length, magnitude and phase. We analyse the features of WF by blending the properties of the Wasserstein distance and those of the Fourier transform. The proposed WF distance is then used in three sets of key time series applications considering real-world datasets: (i) interpolation of time series leading to data augmentation, (ii) dimensionality reduction via non-linear PCA, and (iii) parametric and non-parametric classification tasks. Our conceptual and experimental findings validate the general concept of using divergences of distributions, especially the Wasserstein distance, to analyse time series through comparing their spectral representations.

## 1 Introduction

Time series (TS) analysis is ubiquitous in a number of applications: from seismic applications to audio enhancement, from astronomy to fault detection, and from underwater navigation to source separation. Time series analysis occurs naturally in the spectral domain, where signals are represented in terms of how their energy is spread across frequencies. The spectral content of a signal is given by its—e.g., Fourier—power spectral density (PSD), which can be computed via the Periodogram [1], parametric approaches such as the autoregressive and Yule-Walker methods [2, 3], and also methods handling noisy and missing data [4, 5, 6, 7, 8]. Comparing time series through their respective (appropriately computed) PSDs provides a promising alternative to the time domain, since the spectral representation is robust to differences on sampling rate, missing values, noise, delays and general acquisition artefacts. This concept has been exploited by, e.g., [9], who computed a distance between two times series using the Bregman divergence (associated to the logarithm function) between their PSDs; however, this divergence is of no use when the spectral supports2 differ. Therefore, developing sound metrics for comparing general PSDs remains essential.

Beyond the realm of spectral analysis, metrics for comparing general densities are governed by those based on the Euclidean distance and the Kullback-Leibler (KL) divergence, the latter requiring that one distribution dominates3 the other. We claim that these distances are not natural when comparing distributions of power, and might impose stringent requirements on the densities under study. In particular, both these distances are meaningful when the densities share a common support. In a nutshell, our point of view follows from the fact that Euclidean and KL are vertical divergences, meaning that they compare the (point-wise) values of two distributions for a given location in the -axis (frequency). In this paper, however, we focus on how spectral energy is distributed across different values of the -axis, thus requiring a horizontal distance. For this type of distance, optimal transport stands as a natural alternative which has already proven successful in a number of machine learning challenges such as clustering of distributions [10] and unsupervised learning [11].

We rely upon the celebrated Wasserstein distance [12], that is the cost of the optimal transportation (OT) between two probability distributions, to address the choice of a suitable metric for comparing PSDs. Related works include those of Flamary et al. in [13], which built a dictionary of fundamental and harmonic frequencies thus emphasising the importance of moving mass along the frequency dimension, and also [14], that followed the same rationale for supervised speech blind source separation. However, a study of the advantages of applying the Wasserstein on PSDs, as well as how to use such a distance on general purpose time-series analysis is still missing in the literature.

Section 2 presents the background related to PSDs and OT that supports our work, whereas Section 3 defines the proposed Wasserstein-Fourier distance and analyses its properties from OT and Signal Processing perspectives. Sections 4, 5 and 6 are devoted to the validation of the proposed distance through the three distance-based sets of experiments using synthetic and real-world datasets: interpolation paths between time series (and Gaussian processes), spectral dimensionality reduction and parametric/non-parametric regression. The paper then concludes with an assessment of our findings and a proposal for future research activities.

## 2 Background, assumptions and desiderata

Briefly put, recall that our aim is to compare time series by computing the distance between their respective PSDs using the Wasserstein distance; however, as this distance is formulated for probability distributions (rather than PSDs), we revise the conditions to appropriately equip PSDs with the Wasserstein metric.

### 2.1 An equivalence class for time series

We consider Fourier PSDs given by the square absolute value of the Fourier transform of a time series. A more precise definition depends on the properties of the time series at hand, for instance, for a continuous-time stationary time series , the PSD is given by (we denote by the imaginary unit)

 S(ξ)=limT→∞E[12T∣∣∣∫T−Tx(t)e−j2πtξdt∣∣∣2], (1)

where the normalising constant is required due to the infinite energy of a stationary time series. On the contrary, if the signal were a discrete-time one, the integral would be replaced by a summation, and if it were deterministic, the expectation operator would not be needed. For ease of presentation and notations, we will focus on continuous-time signals and the Fourier spectrum [15], however, we emphasise that the proposed concept can be readily extended to multivariate input, discrete time and abitrary spectral domains.

A unit-norm representation of the spectral content of time series is obtained by normalising the power spectral density (NPSD), thus ignoring the signal’s magnitude:

 s(ξ)=S(ξ)∫S(ξ′)dξ′. (2)

The justification for this choice is twofold. First, representing time series by a NPSD allows us to exploit the vast literature on divergences for probability distributions. Second, to compare the dynamic content of time series, i.e., how they evolve in time, the magnitude of the time series is irrelevant and so is the magnitude of its PSD (due to the linearity of the Fourier transform).

Choosing NPSDs and neglecting both the signal’s power and phase makes our representation blind to time-shifts (phase) and scaling. For stochastic processes, this representation has an interesting interpretation: as the PSD of a stationary stochastic process is uniquely linked to its covariance kernel (via the Bochner’s theorem [16]), all processes with proportional kernels will have the same proposed NPSD representation. This is particularly useful for Gaussian process (GPs, [17]), as it allows us to define a proper distance between GPs—see Sec. 4.4.

The NPSD representation thus induces an equivalence class for time series, where series having the same NPSDs belong to the same equivalence class and thus are understood to have the same dynamic content. We denote by the equivalence class of time series . Therefore, the NPSD representation can be understood as a non-injective embedding (or projection) from the space of time series onto the space of probability distributions due to the unit-norm choice. As a consequence, we can define a distance between two (equivalence) classes of time series as the distance between their corresponding NPSDs, where the latter can be chosen as a divergence for probability distributions such as the Euclidean distance, the Kullback-Leibler divergence or, as proposed in this work, the Wasserstein distance.

### 2.2 Optimal transport

Optimal transport (OT, [12]) compares two probability distributions paying special attention to the geometry of their domain. This comparison is achieved by finding the lowest cost to transfer the mass from one probability distribution onto the other. Specifically, the optimal transport between two measures and defined on is given by

 minπ∈Π(μ,ν)∫Rd×Rdc(x,y)dπ(x,y), (3)

where is a joint distribution for and , referred to as the transport plan, belonging to the space of the product measures on with marginals and ; is a cost function. In this manuscript, we focus on the Wasserstein distance, or Earth Mover’s distance [18], which is obtained by setting , , in eq. (3). This distance, denoted by

 Wp(μ,ν)=(minπ∈Π(μ,ν)∫Rd×Rd∥x−y∥pdπ(x,y))1/p, (4)

metrises the space of measures admitting moments of order . Remarkably, OT immediately addresses the key challenge highlighted in Sec. 1, as it allows us to compare meaningfully continuous and discrete densities of different supports unlike the Euclidean distance and KL divergence.

Additionally, notice that for one-dimensional probability measures—such as the NPSDs, the optimisation (4) is closed-form and boils down to

 Wpp(μ,ν)=∫10|F−μ(t)−F−ν(t)|pdt, (5)

where represents the inverse cumulative function of . We refer the reader to [12] for a theoretical presentation of OT and to [19] for recent computational aspects.

## 3 A Wasserstein-based distance between Fourier representations

Let us consider the stationary signals and belonging to two classes of time series denoted by and respectively. We define the proposed Wasserstein-Fourier distance (WF) as follows:

###### Definition 1.

The Wasserstein-Fourier distance between two equivalence classes of time series and is given by

 WF([x],[y])=W2(sx,sy), (6)

where and denote the NPSDs associated to and respectively, as defined in eq. (2).

Note that by a slight abuse of notation, denotes both the probability density (or mass) function and the measure itself. Since the function inherits the properties of the Wasserstein distance, we directly obtain the following result.

###### Theorem 1.

is a distance over the space of equivalence classes of time series.

###### Proof.

The proof relies on the properties of the distance. For three arbitrary time series , and , WF verifies:
(i) non-negativity: is direct,
(ii) identity of indiscernible: is equivalent to , and by definition of the equivalence class, to ,
(iii) symmetry: ,
(iv) triangle inequality: , which concludes the proof. ∎

### 3.1 Properties of the WF distance

We next analyse key features of the proposed WF distance, which follow naturally from the properties of the Wasserstein distance and the Fourier transform and, therefore, are of general interest for the interface between signal processing and OT. In what follows, we consider two arbitrary (possibly complex-valued) time series denoted and , and we denote the Fourier transform of as , its PSD as and its NPSD as .

Time shifting. If is a time-shifted version of , their Fourier transforms are related by and thus their PSDs are equal . Therefore,

 WF([x],[y])=0, (7)

which is in line with our aim (Sec.2.1) of constructing a distance that is invariant under time shifts.

Time scaling. If , we obtain , leading to a NPSD . In this case, the WF distance translates the scaling effect of magnitude , since the inverse cumulative functions obey , and thus

 WF([x],[y])=|a−1|(E|Y|2)1/2, (8)

where is a random variable with probability distribution .

Frequency shifting. If is a frequency-modulated version of the signal , their NPSDs then verify , which corresponds to a translation of the densities, leading to

 WF([x],[y])=∥ξ0∥. (9)

Non-convexity. Observe that the convexity (in each variable) of does not imply the convexity of the WF distance as shown in the following counter-example. Let us consider the signals:

 x(t) =cos(ωt) (10) y±(t) =±x(t)+1rcos(αt), (11)

with . Then for large enough we have

 WF([y++y−2],[x]) =|ω−α| (12) >|ω−α|/(π√r2+4) =WF([y+],[x])+WF([y−],[x])2.

In particular, the convexity of is not preserved by WF due to the normalisation of the PSDs.

### 3.2 Convergence properties in connection to the time domain

We next analyse the convergence of the empirical NPSD to the true NPSD in Wasserstein distance in terms of the pointwise convergence of the empirical autocovariance function of a stationary time series. This aims to validate the proposed Wasserstein-Fourier distance as a sound metric for statistical time-series by linking convergence between time and frequency representations.

Recall that the covariance function of a stationary time series describes how the process co-varies with itself at different time lags. For a zero-mean series, the covariance function is defined by

 c(h)=E[y(⋅)y∗(⋅+h)], (13)

and its normalised version referred to as autocorrelation function (ACF) is given by

 r(h)=c(h)/c(0). (14)

Furthermore, by Bochner theorem [16] we have that

 S(ξ)=∫Rc(h)e−j2πhξdh, (15)

where denotes the PSD of . Applying the inverse Fourier transform to the above equation leads to , meaning that the normalising constant of the ACF in eq. (14) is the one required to compute the NPSD proposed in eq. (2). As a consequence, we can interpret a normalised version of Bochner theorem that associates the NPSD with the ACF by

 s(ξ)=∫Rr(h)e−j2πhξdh, (16)

rather than the PSD with the covariance as the original theorem.

Leaning on this relationship, the following convergence result ties the sample ACF to the sample time series. For the sake of simplicity, we consider a zero-mean discrete-time series (i.e., a vector of infinitely-countable entries).

###### Proposition 1.

Let be a discrete-time zero-mean series and be a sample of , with and their true and empirical4 ACFs respectively. Then, if is band limited and stationary, we have

 limn→∞rn(h)=r(h)⇔limn→∞WF([yn],[y])=0. (17)
###### Proof.

By the normalised version of the Bochner theorem in eq. (16), and since the Fourier transform and its inverse are continuous, when tends to infinity one has that is equivalent to , the NPSDs associated to and . We next prove the right and left implications respectively:

[] Let us assume that . By Scheffé’s lemma, weakly converges to . Moreover, since is supported on a compact (as is band-limited), the dominated convergence theorem guarantees that . This implies that (see e.g. Theorem 7.12 in [21]) and thus .

[] Conversely, if when , then weakly converges to , and by the Lévy’s continuity theorem, we have that , which directly implies that for all . ∎

Remark that the band-limited assumption is actually not needed for the left implication .

After defining the proposed WF distance and studying its properties, we can exploit the geometric and statistical features inherited from the Wasserstein distance for time series analysis.

## 4 A geodesic path between two times series

Based on the geodesic structure of the Wasserstein space [22, Chap. 7], we next show that the introduced WF distance allows us to construct a geodesic trajectory between two time series. Recall that for the usual Euclidean () distance on the space of time series (functions of time with compact support), the trajectory between signals and consists in a path that starts at and ends at by point-wise interpolation, that is, a superposition of signals of the form

 xγ(t)=γx1(t)+(1−γ)x2(t),γ∈[0,1]. (18)

This interpolation corresponds to a spatial superposition of two signals from different sources (e.g., seismic waves), and is meaningful in settings such as source separation in Signal Processing, also referred to as the cocktail party [23]. More generally, in general physical models that consider white noise, there is an implicit assumption of additivity, where the spectral densities are linear combinations in the (vertical) sense.

In other applications, however, such as audio (classification), body motions sensors, and other intrinsically-oscillatory processes, the superposition of time series fails to provide insight and understanding of the data. For instance, the average of hand movements from different experiments would probably convey little information about the true gesture and it is likely to quickly vanish due to the random phases (or temporal offsets). These applications, where spectral content is key, would benefit from a spectral interpolation rather than a temporal superposition. The need for a spectral-based interpolation method can be illustrated considering two sinusoidal time series of frequencies , where the intuitive interpolation would be a sinusoidal series of a single frequency such that . We will see that, while the interpolation is unable to obtain such a results, the proposed WF-based interpolation successfully does so.

### 4.1 Definition of the interpolant

We then define a geodesic path from series to using the proposed WF distance and an extension of the McCann’s interpolant, namely a constant-speed Wasserstein geodesic given by [22, Theorem 7.2.2.], between their respective NPSDs. To construct this WF path between time series, we proceed as follows:

1. embed signals and into the frequency domain by computing their NPSDs and according to eq. (2),

2. compute a constant-speed geodesic between and as

 gγ=pγ#π∗, (19)

where for ; is an optimal transport plan in eq. (4) between and ; and is the pushforward operator. In other words, for any continuous function ,

 ∫Rh(u)dgγ(u)=∬R2h((1−γ)u+γv)dπ∗(u,v),

and for each is a valid NPSD,

3. map back into the time domain to obtain the time-series interpolants between and .

Recall that the WF distance is a non-invertible embedding of time-series due to both the normalisation and the neglected phase. However, as we are interested in the harmonic content of the interpolant, we can equip the above procedure with a unit magnitude and either a fixed, random or zero phase depending on the application. In the specific application of audio signals, [24] has proposed an interpolation path using the same rationale as described above and implemented it to create a portamento-like transition between the audio signals, where phase accumulation techniques are used to handle temporal incoherence.

In view of the proposed interpolation path, one can rightfully think of Dynamic Time Warping (DTW), a widespread distance for time series, which consists in warping the trajectories in a non-linear form to match a target time series. As for the relationship between WF and DTW, observe that (i) the WF distance is rooted in frequency, whereas DTW is rooted in time, and (ii) though DTW does not lean on the Wasserstein distance, the authors of [25] proposed a Soft-DTW, using OT as a cost function, but again, purely in the time domain.

### 4.2 Trajectories between parametric signals

For illustration, we consider parametric (and deterministic) time series for which the proposed geodesic WF interpolants can be calculated analytically; this is due to their NPSDs and optimal transport plans being either known or straightforward to compute. For completeness, we focus on two complex-valued examples, yet they can be turned into real-valued ones with the appropriate choice of parameters.

Sinusoids. Let us consider signals (of time) of the form

 xa,b(t)=ejat+ejbt, (20)

with NPSD . For two signals and such that , we have . Therefore, the interpolant between and for is given by

 xγ(t) =ej(γa1+(1−γ)a2)t+ej(γb1+(1−γ)b2)t =xγa1+(1−γ)a2,γb1+(1−γ)b2.

This example reveals a key feature of the WF distance: the WF interpolator moves the energy across frequencies, whereas vertical distances perform a weighted average of the energy distribution. As a consequence, the WF interpolator is sparse in frequency: the WF-interpolator between two line spectrum signals is also line spectrum.

Exponentials. Consider the complex exponential , with parameters and PSD

 S(ξ) =παexp(−2(πξ)2α)∗δ(ξ+μ) =παexp(−2π2(ξ−μ)2α). (21)

Normalising leads to an NPSD given by a Gaussian distribution . Furthermore, the Wasserstein distance and the geodesic between two Gaussians are known and, in fact, each interpolant is also a Gaussian [26]. Therefore, the WF distance and interpolant between two complex exponentials and with parameters and are respectively given by:

 WF([x1],[x2]) =√(μ1−μ2)2+(√α1−√α2)2/4π2, xγ(t) =e−[γ√α1+(1−γ)√α2 ]2t2ej2πt(γμ1+(1−γ)μ2).

Again, we see that the proposed interpolant preserves the nature of the signals: the WF-interpolation between complex exponentials is also a complex exponential.

Fig. 1 shows the interpolation between two complex exponentials via 6 evenly-spaced (according to WF) time series along the geodesic path; only the real part is shown for clarity.

### 4.3 Example: data augmentation for the C. elegans dataset

The proposed interpolation method can be used for data augmentation by computing synthetic time series along the WF-interpolation path between two (real-world) signals. We next illustrate this property with the worm dataset [27], also referred to as C. elegans behavioural database [28]. Since the movements of C. elegans can be represented by combinations of four base shapes (or eigenworms) [29], a worm movement of the present dataset can be decomposed into these bases. We will only consider the class of wildtype worms on the first base shape, for which the time series are 900-sample long. Our aim is to construct a WF-interpolation path between the movements of two worms belonging to the same class to synthetically generate more signals having the same spectral properties.

Recall that to compute the time series from the interpolated NPSDs we need the phase, for this experiment we will interpolate the phases too in an Euclidean way: for two phases and we choose the interpolator . Furthermore, we will consider an evenly-spaced frequency grid of points between and . Fig. 2 shows a 10-step interpolation path between two C. elegans series, using the WF (top) and the Euclidean (bottom) distances.

Observe that the WF interpolation is not constrained to remain in the element-wise convex hull spanned by the two series. On the contrary, under the Euclidean distance if the two series coincide in time and value, then all the interpolations are constrained to pass through that value, e.g., in Fig. 2 (bottom) for and many other time stamps.

Additionally, to assess how representative our interpolation is in the class of wildtype worms, let us refer to the distance between a NPSD and a class as

 d(s,C)=mins′∈Cd(s,s′), (22)

and verify that the Wasserstein interpolant is closer to the wildtype class NPSDs that its Euclidean counterpart. Indeed, Fig. 3 shows our () WF-interpolant and the signal of the minimiser of the above distance for comparison. Besides the visual similarity, notice from Table 1 that the WF interpolant is much closer to the wildtype class (using the WF distance) that the Euclidean interpolant. This notion of class proximity for time series opens new avenues for time series classification as we will see in Sec. 6.

### 4.4 Special case of Gaussian processes

The Gaussian process (GP) prior is a non-parametric generative model for continuous-time signals [17], which is defined in terms of its covariance or, via Bochner theorem (15), by its PSD. Following from the arguments in Sec. 2.1, we can identify an equivalence class of all the GPs that share the same covariance function (and hence PSD) up to a scaling constant. Identifying this equivalence class is key to extend the WF distance to GPs, which is in line with the state of the art on GPs regarding spectral covariances [30, 31, 32, 33, 34, 35].

The proposed interpolation is in fact a geodesic in the GP space (wrt to the WF distance), where every point in the trajectory is a valid GP. The interpolation can be implemented in the space of covariance kernels simply by applying the inverse Fourier transform to the proposed WF interpolator on NPSDs, where the phase is no longer needed since covariances are even functions. In this way, as the properties of a GP are determined by its kernel (e.g., periodicity, differentiability, long-term correlations, and band-limitedness), the WF interpolant is a transition in the space of GPs with these properties. Fig. 4 illustrates this interpolation for four different covariance kernels, notice how the interpolation results on smooth incorporation of frequency components.

## 5 Principal geodesic analysis in WF

In order to detect spectral patterns across multiple time series, we can apply dimensionality reduction techniques to their respective NPSDs. Though the de facto method for this is Functional Principal Component Analysis (FPCA) (see e.g. [37]), applying it in this case provides limited interpretation. This is because using the Euclidean distance directly on the NPSDs might result in principal components (PC) with negative values, which are therefore not distributions. To address this challenge, we can use the proposed WF distance to extend the classical FPCA, which takes place in the Hilbert space , to a Principal Geodesic Analysis (PGA) using WF (6), or, equivalently, to perform PGA on the NPSDs in the Wasserstein space.

### 5.1 Description of the general procedure

For this purpose, we rely on the method proposed by [38, 39], which is next described in a simplified manner, after defining the Wasserstein mean of distributions. Let be a set of distributions in . A Wasserstein barycenter —or mean—of this family, is defined as the Fréchet mean with respect to the Wasserstein distance5, as introduced by [40]. In other words, the Wasserstein barycenter is a solution of the following minimisation problem:

 ¯s∈argmins∈P2(Rd) 1nn∑i=1W22(si,s). (23)

The PGA procedure is then as follows:

1. For each distribution , compute the projection onto the tangent space at the barycenter using the logarithmic-map according to

 ln=log¯s(sn):=F−sn∘F¯s−id, (24)

where recall that denotes the cumulative function of and the inverse cumulative function of .

2. Perform FPCA on the log-map projections in the Hilbert space weighted by the barycenter (namely ), using components. Therefore, we can denote the projection of onto the component as

 pk,n:=^l+tk,nvk, (25)

where is the eigenvector of the log-map’s covariance matrix, is the location of the projection of onto , and is the Euclidean mean of .

3. Map each back to the original distribution space using the exponential map

 gk,n=exp¯s(pk,n):=(id+pk,n)#¯s. (26)

As a result, observe that for every pair , is a valid distribution that corresponds to the projection of the original distribution , onto the geodesic component defined by , which is a distribution for each . This is illustrated in Fig. 5.

Although computationally fast, the above procedure might not provide exact PGA, since a geodesic in the Wasserstein space is not exactly the image under the exponential map of straight lines in (see [41] for more details). Alternative methods are those proposed by [42] based on [41], and [43] which performs PCA on Gaussian processes by applying standard PCA on the space of covariance kernel matrices (linked to the PSDs through Bochner’s theorem).

### 5.2 Implementation on NPSDs: synthetic and real-world data

The above procedure can be relied upon to perform PGA in the space of NPSD, this allows us to yield a meaningful decomposition of the signals’ trends and to summarise the information of the series into geodesic components. We highlight this concept via a toy example using a dataset composed of:

• Class 1: cosine signals with random frequency and white Gaussian noise.

• Class 2: sinc signals with random parameter and white Gaussian noise.

Fig. 6 shows the NPSDs of the two classes, where the deviations from the theoretical PSDs arise from the use of a finite-size grid to compute the Fourier transform. Then, Fig. 7 shows the projection of all NPSDs along the first geodesic component , where the locations ; recall that deviation from the expected shapes are due to (i) approximate computation of NPSDs and (ii) inability to compute the exact PGA. Interestingly, the location of the projections for classes 1 and 2 are respectively in the ranges , and , meaning that the members from the two classes are well separated when projected on the first component. Moreover, Fig. 7 reveals that the red-coded (large ) curves encode the NPSD of a cosine given by a Dirac distribution, whereas the yellow-coded (small ) curves aim to recover a rectangle, which corresponds to the sinc’s PSD.

To validate the usefulness of the proposed procedure, we also applied FPCA (i) directly on the signals, and (ii) on the NPSDs. In neither of these cases it was possible to discriminate between the sinc or cosine classes.

Lastly, we applied this decomposition procedure on the real world fungi dataset composed of melt curves of the rDNA internal transcribed spacer (ITS) region of strains of fungal species ([44, 27]. Each fungi species contains between and signals, for a total of time series of length . The dataset is displayed in Fig. 8, where the species are colour coded.

Fig. 9 shows the locations of projections for the first and second components plotted against each other; the same per specie colour-code was used in this plots. Notice that the locations overlap between classes (at least for the first two components), meaning that the proposed PGA allows us to identify species that present similar dynamic content but also some that group in different clusters.

## 6 Classification of time series using divergences: WF and KL

This last section is dedicated to validating the general concept of using divergences between probability distributions on PSDs (such as Wasserstein and KL) in the classification of time series. We start with a toy example motivating the use of the WF distance in the classification task, to then propose classifiers using three spectral distances (WF, KL and Euclidean) combined with the logistic function (parametric) and the -nearest neighbours concept (non-parametric). Then, we implement the proposed time-series classifiers on binary and multiclass classification using four different real-world datasets where we see that the WF and KL distances outperform the Euclidean in every scenario.

### 6.1 Motivation

Let us consider two classes of synthetic NPSDs given by

• Left-Asymetric Gaussian Mixture (L-AGM): given by a sum of two Gaussians with random means and variances, where the left Gaussian has a variance four times that of the right one.

• Right-Asymmetric Gaussian Mixture (R-AGM): constructed in the same manner but the role of the variances is reversed.

In both classes the distance between the means remains constant. Fig. 10 shows samples of each class and their barycenters (Wasserstein and Euclidean); notice how was able to discriminate between two NPSDs of different classes, unlike the Euclidean counterpart. This promising feature is next exploited on real-world data using parametric and nonparametric classifiers.

### 6.2 Definition of classifiers

We can construct classifiers for time series using the proposed WF distance within distance-based classifiers, as initially addressed by [45]. In particular, we will consider a probabilistic classifier using the logistic function and a -nearest neighbours (KNN) classifier, both operating on the WF distance.

Logistic & WF. In binary classification (i.e., classes and ), we can parametrise the conditional probability that the sample is class using the logistic function:

 p(C0|s)=11+e−α+βd(s,¯s0)−γd(s,¯s1), (27)

where the function is a divergence and (resp. ) is known as the prototype of the class (resp. ). For a set of sample-label observation pairs , the set of parameters can then be obtained via, e.g., maximum likelihood. Notice that, in connection with the standard (linear) logistic regression, the argument in the exponential in eq. (27) is a linear combination of the distances to class prototypes; therefore, relative proximity (on distance) to the prototypes determines the class probabilities.

As the sample in our setting is an NPSD, we consider three alternatives for and in eq. (27), thus yielding three different classifiers:

1. Model : the Euclidean distance, and , for , is given as the Euclidean mean of class ,

2. Model : the Kullback-Leibler distance, and is also given as the Euclidean mean of class ,

3. Model : the Wasserstein distance, and is given as the Wasserstein mean of class .

Also, as the KL divergence is not symmetric, we clarify its use within model via the following remark.

###### Remark 1.

The support of the Euclidean barycenter of a family of distributions contains the support of each member . Therefore, the KL divergence computed in the following direction , where is any NPSD, is always finite and thus to be considered within model .

KNN & WF. We also consider a variant of the KNN classifier, using the aforementioned distances on NPSDs. Recall that the KNN algorithm is a non-parametric classifier [46, 47], where sample is assigned to the class most common across its nearest neighbours. This strategy heavily relies on the distance (or divergence) used, where the predicted class follows directly from the distances among samples. Therefore, we implement the KNN concept to NPSDs using , and and neighbours.

### 6.3 Binary classification of real-world dataset

We now perform binary classification using the logistic and KNN models with the three distances (6 models in total) applied to two real-world datasets: Urbansound8k and GunPoint.

Urban audio. Urbansound8k [48] comprises 4-second recordings of the classes: air conditioner (AC), car horn, children playing, dog bark, drilling, engine idling (EI), gunshot, jackhammer (J), siren, and street music. We considered foreground-target recordings at 44100 Hz, that is, a dataset of 3487 labelled NPSDs6. We focused on binary classification related to discriminating class gun shot from the remaining classes, thus having 9 different tasks.

We trained , and via maximum likelihood7 and cross-validated using the 10-fold recommended by the dataset. Table 2 shows the mean accuracy of all models and tasks with their 10-fold standard error, where we can notice the superiority of using a proper distance for distributions as opposed to simply de Euclidean distance. The proposed method always gave better results than , and depending on the classes, can provide competitive results wrt .

Table 3 shows the same tasks using the KNN models, where the performance of each distance (model) and task is perfectly aligned to the logistic classifiers.

Body-motion signals. The Gunpoint dataset [49] contains hand trajectories of length from young and old actors drawing a gun or just pointing at a target with his finger. Therefore, the dataset naturally comprises two pairs (young VS old, and drawing VS pointing) of non-overlapping classes.

For each pair of classes, there are labelled series for which we performed binary classification experiment using random splits of the dataset, and using as training data and as test data. In this case, the LR model endowed with the distance always gives better results than with or KL. However, the KNN method provides the overall best accuracy.

### 6.4 Extension to multi-class classification

To cater for multiple classes, the probabilistic classifier in eq. (27) can be extended using the softmax likelihood, where the conditional probability that a sample belongs to the class for is given by:

 p(Ci|s)=e−(αi+∑Kk=1βkid(s,¯sk))∑Kj=1e−(αj+∑Kk=1βkjd(s,¯sk)). (28)

We also consider the three models , and , where refers to the multiclass classification model (using softmax) using the divergence and barycenters denoted by its subscript. Recall that the KL model used Euclidean barycenters. The KNN method can inherently deal with multiple classes, therefore, no extension is needed.

Human activity signals. The Heterogeneity Human Activity Recognition (HAR) dataset [50] contains body-motion signals recorded from subjects performing one of the following activities: biking, standing, stairsup, stairsdown, sitting, walking; our aim is to implement a WF-based multiclass classifier on this dataset. Since these motion signals are 3-dimensional, we applied (standard) PCA and only retained the first (maximum-variance) component. Then, by only considering the recordings collected by users of Nexus phones, our dataset comprised 1242 labelled signals, which were then split into 80%-20% train-test splits. Table 5 shows the mean accuracy of the models trained, with their standard error over 10 splits.

Notice how, in this more challenging example, the proposed model , which uses the WF distance, performed marginally better than the competitors, and the -NN models performed better than its parametric counterpart.

Species of fungies. Lastly, we applied the proposed classifiers to the fungies dataset, where the aim was to classify different species or fungies, based on their rDNA ITS region, see Section 5. From Fig. 6, we can confirm that classification based on spectral densities (i.e., in the frequency domain) has been quite satisfactory. Finally, since the KNN performed better than the softmax, we can infer that the discrimination between the classes is nonlinear in all the metrics considered.

## 7 Discussion

We have proposed the Wasserstein-Fourier based distance (WF) to study stationary time series. To the best of our knowledge, the WF distance has only been used in time series analysis for dictionary learning of audio datasets and not as a general distance for time series challenges. The proposed WF has been presented in (i) theoretical terms, (ii) connection with consistency of the empirical autocorrelation function, (iii) data interpolation and augmentation, (iv) dimensionality reduction, and (v) time series classification. These results establish WF as a sound and intuitive metric for modern time series analysis, and thus working towards bridging the gap between the computational optimal transport and signal processing communities. Future research includes exploiting the property of geodesic path between GPs, briefly illustrated in this paper, to build new training procedures for Gaussian processes, as well as using optimal transport for unbalanced distributions [51], and therefore directly on the space of PSDs without the need for normalisation.

### Footnotes

1. This work was supported by Fondecyt-Postdoctorado , Fondecyt-Iniciación , and Conicyt-PIA AFB 170001 CMM.
2. The support of a function is the subset of the input space where the function is strictly greater than zero, therefore, the spectral support is the set of frequency components that have energy and thus are present in the time series.
3. Distribution dominates distribution if the support of is included in the support of .
4. We denote the empirical ACF and covariance by (see e.g. [20])
5. In one dimension, the barycenter is closed form:
6. PSDs were computed using the periodogram.
7. The negative-log-likelihood was minimised using the L-BFGS-B algorithm.

### References

1. A. Schuster, “On the investigation of hidden periodicities with application to a supposed 26 day period of meteorological phenomena,” Terrestrial Magnetism, vol. 3, no. 1, pp. 13–41, 1898.
2. G. Udny Yule, “On a method of investigating periodicities in disturbed series, with special reference to Wolfer’s sunspot numbers,” Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, vol. 226, no. 636-646, pp. 267–298, 1927.
3. G. Walker, “On periodicity in series of related terms,” Proc. of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, vol. 131, no. 818, pp. 518–532, 1931.
4. R. Turner and M. Sahani, “Time-frequency analysis as probabilistic inference,” IEEE Transactions on Signal Processing, vol. 62, no. 23, pp. 6171–6183, 2014.
5. N. Choudhuri, S. Ghosal, and A. Roy, “Bayesian estimation of the spectral density of a time series,” Journal of the American Statistical Association, vol. 99, no. 468, pp. 1050–1059, 2004.
6. Y. Wang, R. Khardon, and P. Protopapas, “Nonparametric Bayesian estimation of periodic light curves,” The Astrophysical Journal, vol. 756, no. 1, pp. 67, 2012.
7. P. Babu and P. Stoica, “Spectral analysis of nonuniformly sampled data – a review,” Digital Signal Processing, vol. 20, no. 2, pp. 359 – 378, 2010.
8. F. Tobar, “Bayesian nonparametric spectral estimation,” in Advances in Neural Information Processing Systems 31, 2018, pp. 10148–10158.
9. F. Itakura, “Analysis synthesis telephony based on the maximum likelihood method,” in The 6th International Congress on Acoustics, 1968, pp. 280–292.
10. J. Ye, P. Wu, J. Z. Wang, and J. Li, “Fast discrete distribution clustering using Wasserstein barycenter with sparse support,” IEEE Transactions on Signal Processing, vol. 65, no. 9, pp. 2317–2332, 2017.
11. M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in International Conference on Machine Learning, 2017, pp. 214–223.
12. C. Villani, Optimal Transport: Old and New, Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, 2008.
13. R. Flamary, C. Févotte, N. Courty, and V. Emiya, “Optimal spectral transportation with application to music transcription,” in Advances in Neural Information Processing Systems, 2016, pp. 703–711.
14. A. Rolet, V. Seguy, M. Blondel, and H. Sawada, “Blind source separation with optimal transport non-negative matrix factorization,” EURASIP Journal on Advances in Signal Processing, vol. 2018, no. 1, pp. 53, 2018.
15. S. Kay, Modern Spectral Estimation: Theory and Application, Prentice Hall, 1988.
16. S. Bochner, M. Tenenbaum, and H. Pollard, Lectures on Fourier Integrals, Princeton University Press, 1959.
17. C. Rasmussen and C. Williams, Gaussian Processes for Machine Learning, The MIT Press, 2006.
18. Y. Rubner, C. Tomasi, and L.J. Guibas, “The earth mover’s distance as a metric for image retrieval,” International Journal of Computer Vision, vol. 40, no. 2, pp. 99–121, 2000.
19. G. Peyré and M. Cuturi, Computational Optimal Transport, 2018.
20. C. Chatfield, The Analysis of Time Series: An Introduction, Chapman and Hall/CRC, 2016.
21. C. Villani, Topics in Optimal Transportation, Number 58. American Mathematical Soc., 2003.
22. L. Ambrosio, N. Gigli, and G. Savaré, Gradient flows: in metric spaces and in the space of probability measures, Springer Science & Business Media, 2008.
23. C. Cherry, “Some experiments on the recognition of speech, with one and with two ears,” The Journal of the Acoustical Society of America, vol. 25, no. 5, pp. 975–979, 1953.
24. T. Henderson and J. Solomon, “Audio transport: A generalized portamento via optimal transport,” International Conference on Digital Audio Effects 2019 (DAFx-19), 2019.
25. M. Cuturi and M. Blondel, “Soft-DTW: a differentiable loss function for time-series,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 894–903.
26. A. Takatsu, “Wasserstein geometry of Gaussian measures,” Osaka Journal of Mathematics, vol. 48, no. 4, pp. 1005–1026, 2011.
27. H.A. Dau et al., “The UCR time series classification archive,” .
28. Nikhil Bhatla et al., “Wormweb,” C. elegans, 2009.
29. A.E. Brown, E.I. Yemini, L.J. Grundy, T. Jucikas, and W.R. Schafer, “A dictionary of behavioral motifs reveals clusters of genes affecting caenorhabditis elegans locomotion,” Proceedings of the National Academy of Sciences, vol. 110, no. 2, pp. 791–796, 2013.
30. A. G. Wilson and R. P. Adams, “Gaussian process kernels for pattern discovery and extrapolation,” in Proc. of ICML, 2013, pp. 1067–1075.
31. G. Parra and F. Tobar, “Spectral mixture kernels for multi-output Gaussian processes,” in Advances in Neural Information Processing Systems 30, pp. 6681–6690. Curran Associates, Inc., 2017.
32. K. R. Ulrich, D. E. Carlson, K. Dzirasa, and L. Carin, “GP kernels for cross-spectrum analysis,” in Advances in Neural Information Processing Systems 28, pp. 1999–2007. Curran Associates, Inc., 2015.
33. J. Hensman, N. Durrande, and A. Solin, “Variational fourier features for Gaussian processes,” Journal of Machine Learning Research, vol. 18, no. 151, pp. 1–52, 2018.
34. M. Lázaro-Gredilla, J. Quiñonero Candela, C. E. Rasmussen, and A. R. Figueiras-Vidal, “Sparse spectrum Gaussian process regression,” Journal of Machine Learning Research, vol. 11, no. Jun, pp. 1865–1881, 2010.
35. F. Tobar, “Band-limited Gaussian processes: The sinc kernel,” in Advances in Neural Information Processing Systems 32, 2019.
36. R. Flamary and N. Courty, “POT Python Optimal Transport library,” 2017.
37. J. Dauxois, A. Pousse, and Y. Romain, “Asymptotic theory for the principal component analysis of a vector random function: some applications to statistical inference,” Journal of multivariate analysis, vol. 12, no. 1, pp. 136–154, 1982.
38. P.T. Fletcher, C. Lu, S.M Pizer, and S. Joshi, “Principal geodesic analysis for the study of nonlinear statistics of shape,” IEEE transactions on medical imaging, vol. 23, no. 8, pp. 995–1005, 2004.
39. A. Petersen, H.G. Müller, et al., “Functional data analysis for density functions by transformation to a Hilbert space,” The Annals of Statistics, vol. 44, no. 1, pp. 183–218, 2016.
40. M. Agueh and G. Carlier, “Barycenters in the Wasserstein space,” SIAM Journal on Mathematical Analysis, vol. 43, no. 2, pp. 904–924, 2011.
41. J. Bigot, R. Gouet, T. Klein, A. López, et al., “Geodesic PCA in the Wasserstein space by convex PCA,” in Annales de l’Institut Henri Poincaré, Probabilités et Statistiques. Institut Henri Poincaré, 2017, vol. 53, pp. 1–26.
42. E. Cazelles, V. Seguy, J. Bigot, M. Cuturi, and N. Papadakis, “Geodesic PCA versus log-PCA of histograms in the Wasserstein space,” SIAM Journal on Scientific Computing, vol. 40, no. 2, pp. B429–B456, 2018.
43. V. Masarotto, V.M. Panaretos, and Y. Zemel, “Procrustes metrics on covariance operators and optimal transportation of Gaussian processes,” Sankhya A, vol. 81, no. 1, pp. 172–213, 2019.
44. S. Lu, G. Mirchevska, S.S. Phatak, D. Li, J. Luka, R.A. Calderone, and W.A. Fonzi, “Dynamic time warping assessment of high-resolution melt curves provides a robust metric for fungal identification,” PloS one, vol. 12, no. 3, pp. e0173320, 2017.
45. A. Rakotomamonjy, A. Traore, M. Berar, R. Flamary, and N. Courty, “Distance measure machines,” arXiv preprint arXiv:1803.00250, 2018.
46. Kilian Q Weinberger and Lawrence K Saul, “Distance metric learning for large margin nearest neighbor classification,” Journal of Machine Learning Research, vol. 10, no. Feb, pp. 207–244, 2009.
47. A. Andoni, P. Indyk, and I. Razenshteyn, “Approximate nearest neighbor search in high dimensions,” arXiv preprint arXiv:1806.09823, 2018.
48. J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” in Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014, pp. 1041–1044.
49. C.A. Ratanamahatana and E. Keogh, “Everything you know about dynamic time warping is wrong,” in Third workshop on mining temporal and sequential data. Citeseer, 2004, vol. 32.
50. A. Stisen, H. Blunck, S. Bhattacharya, T. S. Prentow, M. B. Kjærgaard, A. Dey, T. Sonne, and M. M. Jensen, “Smart devices are different: Assessing and mitigatingmobile sensing heterogeneities for activity recognition,” in Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems. ACM, 2015, pp. 127–140.
51. L. Chizat, G. Peyré, B. Schmitzer, and F.X. Vialard, “An interpolating distance between optimal transport and Fisher–Rao metrics,” Foundations of Computational Mathematics, vol. 18, no. 1, pp. 1–44, 2018.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters