Abstract
We present a model that can automatically learn alignments between highdimensional data in an unsupervised manner. Learning alignments is an illconstrained problem as there are many different ways of defining a good alignment. Our proposed method casts alignment learning in a framework where both alignment and data are modelled simultaneously. We derive a probabilistic model built on nonparametric priors that allows for flexible warps while at the same time providing means to specify interpretable constraints. We show results on several datasets, including different motion capture sequences and show that the suggested model outperform the classical algorithmic approaches to the alignment task.
oddsidemargin has been altered.
marginparsep has been altered.
topmargin has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the ICML style.
Please do not change the page layout, or include packages like geometry,
savetrees, or fullpage, which change it for you.
We’re not able to reliably undo arbitrary changes to the style. Please remove
the offending package(s), or layoutchanging commands and try again.
Gaussian Process Latent Variable Alignment Learning
Ieva Kazlauskaite ^{0 } Carl Henrik Ek ^{0 } Neill D. F. Campbell ^{0 }
Machine learning is often tasked with learning models of data which comes from several different observations of the same underlying phenomenon. This is challenging as data might be sampled at different and uneven rates, sequences might be collected out of phase, etc. Consider the following scenarios: humans performing a task may take more or less time to complete parts of it, climate patterns are often cyclic though particular events take place at slightly different times in the year, the mental ability of children varies depending on their age, neuronal spike waveforms contain temporal jitter, replicated scientific experiments often vary in timing. However, most sample statistics, e.g. mean and variance, are designed to capture variation in amplitude rather than phase/timing. This leads to increased sample variance, blurred fundamental data structures and an inflated number of principal components needed to describe the data. Therefore, the data needs to be aligned in order for dependencies such as these to be recovered. Temporal or phase alignment is a necessary but nontrivial task that is often performed as a preprocessing stage before modelling. Traditionally, the notion of sequence similarity comes from a measure of pairwise similarity integrated across the sequences. This local measure of similarity often leads to highly nonconvex optimisations problems making alignments challenging to learn. In this paper we take a different approach where we encapsulate alignment and modelling within a single framework. By simultaneously modelling the sequences and the alignment we can capture global structure thereby circumventing the difficulties associated with an objective function based on pairwise similarity.
Methods for learning alignments can broadly be classified into two categories. The first learns a function to warp the input dimension while the second directly learns the transformed sequences. There are several benefits to learning a warping function as it allows us to resample the data and, by constraining the class of functions, we can also incorporate global constraints on the alignment. However, specifying a parametric function is challenging and often results in highly nonconvex optimisation tasks. Directly learning transformed sequences avoids having to specify a parametrisation at the cost of removing all but the most rudimentary global constraints on the warping function. This as the optimal alignment is completely specified by the pairwise similarity. We propose a novel approach that learns the warping function using a probabilistic model. Underpinning our methodology is the use of Gaussian process priors that allow us to approach this learning in a Bayesian framework achieving principled regularisation without reducing the solution space.
Our proposed model overcomes a number of problems with the existing literature and confers three main contributions:

We model the observed data directly with a generative process, rather than interpolating between observations, that allows us to reject noise in a principled manner.

The generative model of the aligned data allows a fully unsupervised approach that performs simultaneous clustering and alignment.

We use continuous, nonparametric processes to model explicitly the warping functions throughout; this allows the specification of sensible priors rather than unintuitive or heuristic choices of parametrisations.
We demonstrate the efficacy of our approach through quantitative comparisons to the current literature as well as variants of our proposed model.
There has been a significant amount of work in learning alignments from data. Most approaches are based on the assumption of the existence of a pairwise similarity measure between the instances of each sequence. The classical approach to minimise the distance between two sequences is called Dynamic Time Warping (DTW), and is based on a computing an affinity matrix of size where and are the lengths of the two sequences to be aligned (Berndt & Clifford, 1994). The solution corresponds to the path through this matrix that leads to the minimal combined pairwise cost. The optimal solution is found by backtracking through the affinity matrix and can be estimated using Dynamic Programming (Müller, 2007).
DTW will find the optimal alignment based on a pairwise distance between each element in two sequences. Such formulation imposes a number of limitations. DTW returns an alignment but not a parametrised warping. Furthermore, it is not trivial to encode a preference towards different warps as this would be a global characteristic while DTW is a local algorithm.
In its original form DTW only aligns two sequences but there have been several proposed extensions that allows it to process multiple sequences at once, most notably Procrustes dynamic time warping (PDTW), Procrustes derivative dynamic time warping (PDDTW), and Iterative Motion Warping (IMW) (Keogh & Pazzani, 2001), (Dryden & Mardia, 2016), (Hsu et al., 2005). All of these methods are applied directly in the observation space which is a limitation when the data contains a significant amount of noise.
The main algorithms that address this limitation are Canonical Time Warping (CTW) and Generalized Time Warping (GTW) (Zhou & de la Torre, 2009), (Zhou, 2012). Both of these approaches perform feature extraction and find a subspace that maximises the linear correlation of data samples. Similarly to our approach, GTW is parametrised using monotonic warping functions. However, in all these methods the spatial alignment and time warping are coupled. Another extension, called Generalized Canonical Time Warping (GCTW) combines CCA with DTW to simultaneously align multiple sequences of multimodal data (Zhou & de al Torre, 2016). GCTW relies on additional heuristic energy terms and on coarsetofine optimisation to get the energy method to converge to a good local minimum.
More recently, deep neural network architecture was employed to perform temporal alignments (Trigeorgis et al., 2016), (Trigeorgis et al., 2017). The proposed method, called Deep Canonical Time Warping (DCTW), performs nonlinear feature extraction and it performs competitively on larger audiovisual datasets. A different method proposed by Listgarten et al. uses continuous hidden Markov models, where the latent trace is an underlying representation of the set of observable sequences (Listgarten et al., 2005). Haxby et al. introduced hyperalignment that finds isometric transformations of trajectories in voxel space that result in an accurate match of the timeseries data (James et al., 2011). An extension to this model was proposed by Lorbert and Ramadge who address the issues of scalability and feature extension through the use of the kernel trick (Lorbert & Ramadge, 2012). The authors note that classification accuracy relies on intelligent feature selection.
Similar to our approach, Cui et al. propose an unsupervised manifold alignment method (Cui et al., 2014). It is based on finding alignment by enforcing several constraints such as geometry structure matching, feature matching, geometry preservation and integer constraints. The approach shows promising results but is very computationally expensive. Another nonlinear feature extraction method was proposed by Vu et al.; their method, named Manifold Time Warping, relies on constructing a knearest neighbour graph and then performing DTW to align a pair of sequences (Vu et al., 2012).
Another approach to sequence alignment is to use an implicit transformation of the sequences. In (Cuturi et al., 2007; Cuturi, 2011) the authors propose a kernel function that is capable of mapping sequences of different length to an implicit feature space. Another similar approach is (Baisero et al., 2015) which described a range of different kernels on sequences, this method is flexible and allows for learning implicit feature space mappings for sequences of not only different length but also different dimensionality. These methods have been shown experimentally to work very well, however, as the alignment now is implicit we cannot realign sequences neither are they generative so we cannot construct novel sequences.
A different line of work, often referred to as elastic registration or shape analysis is considered in the functional data analysis literature. In (Garreau et al., 2014) the authors propose an extension to DTW by replacing the Euclidean distance with a Mahalanobis distance. By having a parametrisable distance function the authors are able to learn mappings of the distance function from a set of paired observations. Kurtek et al. study the grouptheoretic approach to warps by using the group of warping functions to describe the equivalence relation between signals (Kurtek et al., 2011). In particular, the authors use the FisherRao Riemannian metric and the resulting geodesic distance to align signals with random warps, scalings and translations. Square root velocity function (SRVF) facilitates the use of FisherRao distance between functions by estimating the norm between their SRVFs (Srivastava et al., 2011), (Kurtek et al., 2012). Tucker et al. proposed a generative model that combines elastic shape analysis of curves and functional principal component analysis (fPCA) (Tucker et al., 2013). Another recent extension called Elastic functional coding relies on trajectory embeddings on Riemannian manifolds and results in manifold functional variant of PCA (Anirudh et al., 2015).
Our model makes use of Gaussian processes as priors over functions and this section provides a brief introduction. A Gaussian process (GP) (Rasmussen & Williams, 2005) is a random process that can be considered as the infinitedimensional generalisation to the Gaussian distribution. GPs have been used extensively in machine learning to specify nonparametric priors over the space of functions thereby allowing for principled Bayesian inference. It is fully specified by a mean function and a covariance function , i.e. Thus given a finite set of inputs , we may draw samples from the GP prior: where .
The Gaussian process latent variable model (Lawrence, 2005) (GPLVM) is a model that uses GP priors to learn latent variables. The model assumes that the observed data have been generated from a latent variable through some latent function . By placing a GP prior over and marginalising out this mapping, the latent representation of the data can be recovered. The model is very flexible and has been implemented across a wide range of different applications, for example (Grochow et al., 2004), (Urtasun et al., 2005), (Campbell & Kautz, 2014).
In (Snelson et al., 2004) and (LázaroGredilla, 2012) the authors construct a GP with a warped input space to account for differences in observations (e.g. inputs may vary over many orders of magnitude), and show that a warped GP finds the standard preprocessing transforms, such as the logarithm, automatically. In comparison, our approach leads to a warped output space of the GPLVM, and uses the additional knowledge of possible misalignments in the highdimensional space to regularise the problem of building a lowdimensional latent space. We will now proceed to show how GPs and the GPLVM naturally lend themselves to learning alignments.
Alignment learning is the task of recovering a set of monotonic warping functions that has been used to create samples of a latent unobserved sequence. Assuming that we have noisy observations we consider each sequence to be generated from a latent function and a latent input variate . We will further assume that the samples have been corrupted by additive Gaussian noise,
(1) 
where . We define a warping function that remaps a fixed uniform sampling rate , which is the same for all observations, such that,
(2) 
Additionally, without loss of generality, we define to be the sampling rate of the unobserved aligned sequence,
(3) 
the latent sequence from which the observations have been drawn.
We note that this implies different representations of the aligned signal. This is beneficial as we do not need to assume the noise to be equal across each sample. This means that we parametrise our warping function as the transformation from the known fixed sampling of the unknown aligned sequence to the unknown sampling of the known observations for each sequence, as shown in Fig. 1.
In addition, by using a latent variable model to represent the aligned data, we are able to perform clustering and alignment simultaneously. This means, unlike many previous methods that require the user to specify the target sequence to which all others are aligned, our model automatically detects the clusters in the dataset and aligns the sequences within each cluster.
From the observed data , we wish to recover both the warpings and the aligned signal . We will now proceed to derive a probabilistic formulation of the above model. To help with the derivation, we introduce two new random variables: as the realisation the function at location , and as the corresponding output of the warping function .
Now we wish to specify a prior over the generating and warping functions and . Specifying a parametric mapping is challenging and it severely limits the possible functions we can recover. In this paper, we make use of flexible nonparametric Gaussian process priors which allows us to provide significant structure to the learning problem without reducing the possible solution space.
Thus far, our model does not describe the aligned sequence in a shared frame and so it currently decomposes to independent models; there is nothing to encourage to match for . To bring these models together we introduce an additional latent variable that generates the set of estimated aligned sequences through a latent mapping as,
(4) 
where and . Taking the same approach as the treatment of and , we introduce the output of the function as a random variable . This leads to the graphical model shown in Fig. 2 that corresponds to the following joint distribution,
(5) 
where the blue terms denote Gaussian likelihoods and the red terms denote Gaussian process priors.
We encode our preference for smooth warping functions by making an autoregressive Gaussian process prior (Wang et al., 2008). We can ensure monotonicity by an appropriate parametrisation of estimated . Without loss of generality, these are constrained to be monotonic in the range using a set of auxiliary variables such that
(6) 
Thus, our unified approach simultaneously describes a model of the data through the generative mappings , a model of the warpings through , and a model of the underlying aligned sequences through . Importantly, all warping functions are continuous and generative which means we can easily resample the data. In Table 1 we contrast our proposed method with previous work.
We will now discuss how the proposed model can be fit to data. Due to the noise assumption and the priors all being Gaussian we can integrate out the latent functions , and in closed form.
(7)  
The properties of (2) and (3) are encoded as
(8)  
where is the covariance function, the precision of additive Gaussian noise and is the identity matrix of size . The unknown parameters, which will be inferred, are highlighted in blue.
Looking at the marginal distributions, we can see that the top row of (8) denotes a standard GPLVM where the input locations are inferred from the observed data . The bottom row is a regression model with unknown outputs from observed inputs and therefore illdefined when considered in isolation.
As discussed in § id1, these are brought together using the alignment GPLVM. The set of aligned estimates are generated from a latent variable via a Gaussian process,
(9) 
with covariance function , noise precision , and we place a standard prior of .
We take the MAP estimate over the unknown latent variables and hyperparameters in our overall objective function of
(10) 
where the priors over the hyperparameters are all uninformative logNormal distributions with zero mean and unit variance, e.g. . We also place an additional prior on the raw sample points to encourage smooth warps and improve training as,
(11) 
We implemented our model using the TensorFlow (Abadi et al., 2015) framework and minimized the negative MAP objective of (10) using the Adam optimizer (Kingma & Ba, 2014). We used standard squared exponential covariance functions for all the Gaussian process priors, for example,
(12) 
The complexity of our method is limited by the inversion of the covariance matrices and therefore scales with . However, there are standard sparse approaches available to scale to longer sequences. We also implemented the sparse variational method of Titsias (Titsias, 2009) which reduces the complexity to , where is a specified number of inducing points for the sparse approximation. This method performed well with very little drop in performance for an order of magnitude smaller than the full .
Our proposed model is fully nonparametric and models both the warping and the generating functions. A central argument in this paper is the benefit of modelling both the warping and the data at the same time. Methods that rely on the standard metric in the input space are illposed and thus require a regularisation term. It leads to an optimisation problem that suffers from poor local minima and relies on the use of a coarsetofine approach. Alternatively, the problem may be rewritten in some feature space using a kernel. Then the representer theorem implies that a solution of the resulting regularised problem in the reproducing kernel Hilbert spaces can be represented as a finite linear combination of kernel products evaluated at the input points. This solution corresponds to the predictive mean of the GP with the kernel as the covariance function of the GP prior (Rasmussen & Williams, 2005; Evgeniou et al., 2000).
In order to highlight the limitations of using the standard metric in the input space, we describe a model that performs a parametric resampling of the data which corresponds to removing our model of the warping but retaining a model of the data. In effect we take a traditional pairwise minimisation approach but include a probabilistic model of the data which will have the effect of regularising the optimisation problem.
We use a parametric resampling function similar to (Zhou & de al Torre, 2016) consisting of monotonically increasing basis functions. For each input sequence , we learn a set of weights . By enforcing that the weights lie on the surface of the order probability simplex the resulting function is guaranteed to be monotonic. The task is now to find the set of weights such that,
(13) 
As we do not have access to , we use the same latent variable model as previously. The model can be learned using gradient descent. We refer to this model as GPLVM+basis in § id1.
We demonstrate the efficacy of using the alignment GPLVM to perform simultaneous clustering and alignment by replacing it with an energy minimization objective that is similar to the previous literature, e.g. (Kurtek et al., 2011). The part of the objective (10) is replaced with
(14) 
where is the mean of and is a scaling constant. In § id1 we show the results of this method with the GP warping functions (energy+GPLVM). For completeness, we also consider the method that relies on energy minimisation and on basis function warpings; we refer to this method as energy+basis.
The parametric model described above, as well as some previous approaches, rely on handpicked basis functions to define the warps. This results in poor accuracy when the set of basis functions is small and in high computational complexity when the set is large. We also note that both our model and GPLVM+basis satisfy the same boundary conditions and monotonicity constraint as DTW (Berndt & Clifford, 1994). Due to the smoothness of the warps in both the parametrised and the nonparametric cases, we are able to avoid the degenerate cases where many consecutive elements of one sequence are aligned to a single element in the other sequences. Moreover, the choice of basis functions in the parametric models and the structure of the GPLVM in the nonparametric case correspond to imposing a global constraint region in DTW.
We will now discuss the experimental evaluation of our proposed model. We show comparisons to current stateoftheart approaches from data mining and functional data analysis communities using publicly available reference implementations^{1}^{1}1See (Zhou & de la Torre, 2018) for the implementation of DTW, DDTW, IMW, CTW, GTW, and (Srivastava et al., 2018) for the implementation of SRVF.. The accuracy is primarily measured in terms of the warping error, i.e. the MSE between the known true warps and the estimated warps, as the alignment accuracy is easily misinterpreted since it is a local measurement. In particular, it does not capture the degenerate cases where peaks and valleys (i.e. local maxima and minima) in the input sequences are shifted to noncorresponding extrema; this is particularly true in datasets with periodic components. Other examples of degenerate behaviour are multiple dimensions collapsing to a single point and warps that rely on translating and rescaling every input in each dimension that leads to overfitting (an example of this is IMW alignment (Zhou, 2012)). All of these result in high alignment accuracy but produce poor quality results.
For this experiment, we use the dataset proposed by Zhou and De la Torre (Zhou, 2012). It consists of sequences that are generated by temporally transforming latent 2D shapes under known warping transformations that allow quantitative evaluation of the estimated warps. To better assess the quality of the alignments, we run tests with randomly selected size of the dataset, dimensionality and temporal transformations. Our approach outperforms other methods on these datasets, see Table 2 and Fig. 4, and produces accurate alignments irrespective of the size of the dataset, dimensionality and structure of the sequences. SRVF is the best performing previous method. It constrains the set of acceptable warps by specifying a population mean and Karcher means (Kurtek et al., 2011). By contrast, our method places a prior on the warps favouring solutions with desirable properties (such as smoothness) over any other possible solutions.
The variant of our method that uses parametric warps (gplvm+basis) performs competitively on these datasets, motivating the use of a Gaussian process objective for alignment. Furthermore, we see that our nonparametric approach to modelling the time warps improves the flexibility of the model. In particular, out of the two models that rely on energy minimisation as the alignment objective, energy+basis and energy+gplvm, the latter one demonstrates lower warping error and significantly lower standard deviation on this dataset. This result supports the premise that even though the nonparametric representation allows for any smooth monotonic warp, the probabilistic framework places sufficient structure to make the problem well posed and avoid overfitting. An example of warps and alignments produced in this experiment are available in the Appendix.
Dataset no  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  Mean 

13  10  10  7  13  6  6  12  3  10  13  8  6  10  7  5  5  14  8  3  8  11  7  6  9  
258  157  107  246  169  131  92  144  138  298  146  240  204  213  157  230  247  196  248  277  141  153  83  285  178  
PDTW  9.32  14.38  5.88  12.42  10.57  10.03  2.30  6.60  2.41  16.99  9.84  13.47  4.95  16.72  4.00  6.97  14.03  6.57  15.01  1.98  4.44  5.16  5.05  13.19  10.85  8.93 
PDDTW  10.55  14.67  6.80  13.86  12.18  10.30  2.58  7.35  3.95  24.59  10.68  15.45  6.57  18.60  5.54  8.35  14.91  7.70  16.38  5.37  6.11  7.43  5.44  14.02  11.17  10.42 
PIMW  26.61  19.27  13.36  24.59  21.16  14.75  6.48  22.14  5.36  38.54  18.06  26.72  16.77  28.39  9.54  22.31  21.23  22.35  27.54  11.13  12.05  18.45  9.06  30.40  16.54  19.31 
PCTW  12.12  15.30  9.50  18.89  15.45  11.32  2.77  10.58  2.66  17.26  9.98  24.72  8.04  17.19  6.23  10.21  16.03  10.73  15.17  7.12  5.88  5.59  6.00  16.92  10.98  11.47 
GTW  6.52  9.54  6.54  6.96  7.50  3.23  4.27  10.63  0.86  33.31  2.55  21.28  5.35  4.72  3.38  23.28  8.50  10.70  2.83  3.93  2.74  2.56  5.45  15.70  2.63  8.20 
SRVF  6.74  3.41  4.73  8.06  6.94  4.14  2.14  3.34  3.10  8.65  5.13  5.33  3.57  7.37  5.78  7.02  4.87  4.71  5.54  5.12  3.82  4.55  2.22  4.53  7.04  5.11 
energy+basis  9.91  5.08  4.69  6.12  6.89  3.06  2.10  5.33  0.98  14.45  6.64  6.57  2.10  10.38  2.74  3.83  5.13  7.94  5.67  1.37  4.42  5.79  2.12  5.00  4.88  5.33 
gplvm+basis  5.85  3.09  2.98  5.29  3.65  2.24  1.31  3.53  0.87  1.85  3.67  3.68  1.49  5.58  1.58  3.55  3.69  2.13  3.12  1.22  3.20  3.59  1.35  1.72  1.90  2.88 
energy+gplvm  6.80  2.45  3.29  6.35  5.31  2.58  1.39  3.23  0.97  4.54  2.97  4.34  2.79  3.37  3.16  3.22  4.12  3.48  2.64  2.07  3.94  2.69  2.20  2.18  2.65  3.31 
ours  4.39  4.55  1.79  1.93  2.34  1.91  1.23  4.68  0.84  3.49  2.11  4.94  3.47  3.14  1.40  3.03  2.01  2.57  2.10  1.48  2.20  1.93  1.31  2.87  2.11  2.55 
In our second experiment, we consider a dataset that contains multiple clusters of sequences. This task requires the sequences to be aligned within each cluster. None of PDTW, PCTW, GTW nor the energy minimisation methods are able to perform this task as they have no knowledge of the underlying structure of the dataset. The SRVF algorithm performs clustering by first aligning the data in terms of amplitude and phase, then performing fPCA based on the estimated summary statistics, and finally modelling the original data using joint Gaussian or nonparametric models on the fPCA representations. We compare the performance of the SRVF algorithm with our approach as well as the variant of our approach with fixed basis functions.
We consider a dataset that contains three distinct groups of functions that were generated by temporally transforming three random 2D curves as described in § id1. All three approaches rely on the structure of the data alone to recognise the existence of the clusters and Fig. 3 shows that all three methods are able to align the data within clusters.
The performance of the methods is contrasted by calculating the mean squared error (MSE) among all pairs of sequences within each group (alignment error) and the MSE between the true warping functions and the warping functions calculated using each of the methods (warping error). For this comparison we repeat the test times with randomly selected initial curves, number of dimensions and number of sequences per group. The quantitative comparison in Table 3 shows that our method consistently achieves the lowest alignment errors (i.e. with lowest standard deviation (SD) on the set of datasets). Our method, as well as the parametric variant of it, also achieves low warping errors in comparison to SRVF which implies that they are able to reconstruct the original temporal transformations more accurately than SRVF. This behaviour is apparent in Fig. 3 where the warping functions produced by our method, and the parametric version of it, resemble the true warps while SRVF estimates noticeably different warping functions; this results in unpredictable distortions in the aligned dataset. These results reflects the differences between the SRVF method and our approach, while SRVF is cast as an optimisation problem over a constrained domain the domain of our probabilistic formulation is much larger but, importantly, structured from the assumptions encoded in the prior. This provides a better regularisation ultimately leading to the improvement in the recovered warpings.
MSE (SD)  srvf  gplvm+basis  Ours 

Alignment  6.4 (1.7)  8.4 (2.7)  5.9 (1.1) 
Warping  30.0 (10.4)  9.7 (4.9)  9.7 (5.7) 
We evaluate the performance of our model on a set of motion capture data from the CMU database (Lab, 2016), where each input sequence corresponds to a short clip of motion and the data is represented as quaternion locations of the joints of the subject performing the motion. In all of our experiments, we use the motion of subject no. from the CMU dataset that correspond to golf related motions such as a swing, a putt, and placing and picking up of a ball.
Our first experiment contains five instances of three different motions that need to be temporally aligned within the three groups. Fig. 5 illustrates how our model favours the simplified, i.e. aligned, inputs. The corresponding manifolds produced using a traditional GPLVM (i.e. without alignment) and a manifold produced using our approach are shown in Fig. 6. Our model produces a fine alignment of the input sequences within each of the groups, and consequently the resulting twodimensional manifold offers a good separation of the three groups. We note that the manifold produced using GPLVM without alignment contains more isolated areas, which means the model is less capable of generalising between the warps. Therefore, our implicitly aligned model is able to generate smoother transitions in the manifold, producing high quality predicted outputs of novel alignments.
In the second experiment we use the full set of joint motions to align a set of sports actions. In Fig. 7 we provide an illustration of the power of using a generative model for alignment. New locations in the manifold encode novel motion sequences that are supported by the data. By allowing the model to align the data, it greatly improves the generative power as the model is capable of producing a wider range of plausible motions.
We have presented an extension to the traditional GPLVM that is able to implicitly align the inputs that contain temporal variations. Our approach models the observed data directly producing a generative model of the functions rather than interpolating between observations. In addition, using a GPLVM for alignment builds an unsupervised generative model that has the benefit of simultaneous clustering and aligning the input sequences. Furthermore, we proposed a continuous, nonparametric explicit model of the time warping functions that removes issues such as quantisation artefacts and the need for ad hoc preprocessing. We demonstrated that the proposed approaches perform competitively on alignment tasks, and outperform the existing methods on the task of simultaneous alignment and clustering. In the future we will consider the use of Bayesian GPLVM for automatic model selection and will test the framework on additional datasets, including multimodal data.
References
 Abadi et al. (2015) Abadi, Martín, Agarwal, Ashish, Barham, Paul, Brevdo, Eugene, Chen, Zhifeng, Citro, Craig, Corrado, Greg S., Davis, Andy, Dean, Jeffrey, Devin, Matthieu, Ghemawat, Sanjay, Goodfellow, Ian, Harp, Andrew, Irving, Geoffrey, Isard, Michael, Jia, Yangqing, Jozefowicz, Rafal, Kaiser, Lukasz, Kudlur, Manjunath, Levenberg, Josh, Mané, Dan, Monga, Rajat, Moore, Sherry, Murray, Derek, Olah, Chris, Schuster, Mike, Shlens, Jonathon, Steiner, Benoit, Sutskever, Ilya, Talwar, Kunal, Tucker, Paul, Vanhoucke, Vincent, Vasudevan, Vijay, Viégas, Fernanda, Vinyals, Oriol, Warden, Pete, Wattenberg, Martin, Wicke, Martin, Yu, Yuan, and Zheng, Xiaoqiang. TensorFlow: LargeScale Machine Learning on Heterogeneous Systems, 2015. Software available from tensorflow.org.
 Anirudh et al. (2015) Anirudh, R., Turaga, P., Su, J., and Srivastava, A. Elastic Functional Coding of Human Actions: From VectorFields to Latent Variables. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
 Baisero et al. (2015) Baisero, A., Pokorny, F. T., and Ek, C. H. On a Family of Decomposable Kernels on Sequences. arXiv.org, abs/1501.06284, 2015.
 Berndt & Clifford (1994) Berndt, D. J. and Clifford, J. Using Dynamic Time Warping to Find Patterns in Time Series. In International Conference on Knowledge Discovery and Data Mining (KDD), 1994.
 Campbell & Kautz (2014) Campbell, N. D. F. and Kautz, J. Learning a Manifold of Fonts. ACM Transactions on Graphics, 33(4), 2014.
 Cui et al. (2014) Cui, Z., Chang, H., Shan, S., and Chen, X. Generalized Unsupervised Manifold Alignment. In Advances in Neural Information Processing Systems (NIPS), 2014.
 Cuturi (2011) Cuturi, M. Fast Global Alignment Kernels. In International Conference on Machine Learning (ICML), 2011.
 Cuturi et al. (2007) Cuturi, M., Vert, J. P., Birkenes, O., and Matsui, T. A Kernel for Time Series Based on Global Alignments. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2007.
 Dryden & Mardia (2016) Dryden, I. L. and Mardia, K. V. Statistical Shape Analysis, with Applications in R. Second Edition. 2016.
 Evgeniou et al. (2000) Evgeniou, Theodoros, Pontil, Massimiliano, and Poggio, Tomaso. Regularization networks and support vector machines. Advances in Computational Mathematics, 13, 2000.
 Garreau et al. (2014) Garreau, D., Lajugie, R., Arlot, S., and Bach, F. Metric Learning for Temporal Sequence Alignment. In Advances in Neural Information Processing Systems (NIPS), 2014.
 Grochow et al. (2004) Grochow, K., Martin, S. L., Hertzmann, A., and Popović, Z. Stylebased Inverse Kinematics. In ACM SIGGRAPH, 2004.
 Hsu et al. (2005) Hsu, Eugene, Pulli, Kari, and Popović, Jovan. Style translation for human motion. ACM Trans. Graph., 24(3):1082–1089, 2005.
 James et al. (2011) James, V. H., Guntupalli, J. S., Connolly, A. C., Halchenko, Y. O., Conroy, B. R., Gobbini, M. I., Hanke, M., and Ramadge, P. J. A Common, HighDimensional Model of the Representational Space in Human Ventral Temporal Cortex. Neuron, 72(2):404–416, 2011.
 Keogh & Pazzani (2001) Keogh, E. J. and Pazzani, M. J. Derivative Dynamic Time Warping. In SIAM International Conference on Data Mining, 2001.
 Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations (ICLR), 2014.
 Kurtek et al. (2011) Kurtek, S., Srivastava, A., and Wu, W. Signal Estimation Under Random Timewarpings and Nonlinear Signal Alignment. In Advances in Neural Information Processing Systems (NIPS), 2011.
 Kurtek et al. (2012) Kurtek, S., Srivastava, A., Klassen, E., and Ding, Z. Statistical Modeling of Curves using Shapes and Related Features. Journal of the American Statistical Association, 107(499):1152–1165, 2012.
 Lab (2016) Lab, Carnegie Mellon Graphics. Motion Capture Database . "http://mocap.cs.cmu.edu/info.php, 2016.
 Lawrence (2005) Lawrence, N. D. Probabilistic NonLinear Principal Component Analysis with Gaussian Process Latent Variable Models. Journal of Machine Learning Research (JMLR), 6:1783–1816, 2005.
 LázaroGredilla (2012) LázaroGredilla, M. Bayesian Warped Gaussian Processes. In Advances in Neural Information Processing Systems (NIPS). 2012.
 Listgarten et al. (2005) Listgarten, J., Neal, R. M., Roweis, S. T., and Emili, A. Multiple Alignment of Continuous Time Series. In Advances in Neural Information Processing Systems (NIPS). 2005.
 Lorbert & Ramadge (2012) Lorbert, A. and Ramadge, P. J. Kernel Hyperalignment. In Advances in Neural Information Processing Systems (NIPS). 2012.
 Müller (2007) Müller, M. Information Retrieval for Music and Motion. 2007.
 Rasmussen & Williams (2005) Rasmussen, C. E. and Williams, C. K. I. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). 2005.
 Roberts et al. (2012) Roberts, S., Osborne, M., Ebden, M., Reece, S., Gibson, N., and Aigrain, S. Gaussian processes for timeseries modelling. Philosophical Transactions of The Royal Society AMathematical Physical and Engineering Sciences, 371(1984), 12 2012.
 Snelson et al. (2004) Snelson, E., Ghahramani, Z., and Rasmussen, C. E. Warped Gaussian Processes. In Advances in Neural Information Processing Systems (NIPS), 2004.
 Srivastava et al. (2011) Srivastava, A., Wu, W., Kurtek, S., Klassen, E., and Marron, J. S. Registration of Functional Data Using FisherRao Metric. ArXiv, abs/1103.3817, 2011.
 Srivastava et al. (2018) Srivastava, A., Wu, W., Kurtek, S., Klassen, E., and Marron, J. S. Elastic Functional Data Analysis. http://ssamg.stat.fsu.edu/software, 2018.
 Titsias (2009) Titsias, Michalis. Variational Learning of Inducing Variables in Sparse Gaussian Processes. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2009.
 Trigeorgis et al. (2016) Trigeorgis, G., Nicolaou, M. A., Zafeiriou, S., and Schuller, B. W. Deep Canonical Time Warping. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
 Trigeorgis et al. (2017) Trigeorgis, G., Nicolaou, M. A., Zafeiriou, S., and Schuller, B. W. Deep Canonical Time Warping for Simultaneous Alignment and Representation Learning of Sequences. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2017.
 Tucker et al. (2013) Tucker, J. D., Wu, W., and Srivastava, A. Generative Models for Functional Data using Phase and Amplitude Separation. Computational Statistics and Data Analysis, 61(Supplement C):50–66, 2013.
 Urtasun et al. (2005) Urtasun, R., Fleet, D. J., Hertzmann, A., and Fua, P. Priors for People Tracking from Small Training Sets. In International Conference on Computer Vision (ICCV), 2005.
 Vu et al. (2012) Vu, H. T., Carey, C. J., and Mahadevan, S. Manifold Warping: Manifold Alignment over Time. 2012.
 Wang et al. (2008) Wang, J. M., Fleet, D. J., and Hertzmann, A. Gaussian Process Dynamical Models for Human Motion. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 30(2):283–298, 2008.
 Zhou (2012) Zhou, F. Generalized Time Warping for Multimodal Alignment of Human Motion. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
 Zhou & de al Torre (2016) Zhou, F. and de al Torre, F. Generalized Canonical Time Warping. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 38(2), 2016.
 Zhou & de la Torre (2009) Zhou, F. and de la Torre, F. Canonical Time Warping for Alignment of Human Behavior. In Advances in Neural Information Processing Systems (NIPS), 2009.
 Zhou & de la Torre (2018) Zhou, F. and de la Torre, F. Software for Canonical Time Warping. http://www.fzhou.com/ta_code.html, 2018.