A Multiple Continuous Signal Alignment Algorithm with Gaussian Process Profiles and an Application to Paleoceanography

A Multiple Continuous Signal Alignment Algorithm with Gaussian Process Profiles and an Application to Paleoceanography

Abstract

Aligning signals is essential for integrating fragmented knowledge in each signal or resolving signal classification problems. Motif finding, or profile analysis, is a preferred method for multiple signal alignments and can be classified into two categories, depending on whether the profile is constructive or latent. Existing methods in these categories have some limitations: constructive profiles are defined over finite sets and inferred latent profiles are often too abstract to represent the integrated information. Here we present a novel alignment method, the multiple continuous Signal Alignment algorithm with Gaussian Process Regression profiles (SA-GPR), which addresses the limitations of currently available methods. We present a novel stack construction algorithm as an example of our SA-GPR in the field of paleoceanography. Specifically, we create a dual-proxy stack of six high-resolution sediment cores from the Northeast Atlantic using alignments based on both radiocarbon age estimates and the oxygen isotope ratio of benthic foraminifera, which is a proxy for changes in global ice volume and deep-water temperature.

\csdef

input@path img/

\keywords

paleoceanography benthic stack radiocarbon alignment algorithm Gaussian process regression particle smoothing Markov-chain Monte Carlo expectation propagation variational bayesian method

1 Introduction

Problems in statistical learning theory often involve pairs of independent and dependent variables where independent variables are defined on an ordered field (i.e., they can be arranged). These problems occur in a diverse range of subjects: biological sequences (DNA, RNA, proteins), sequential economic events (annual price levels or asset prices), digitalized music (consecutive electrical signals), human languages (word sequences), and climate forecasting (a sequence of events leading to the future).

If the dependent variables can be “generated” or “observed” through time, regularly indexed in chronological order, then independent and dependent variables form a “time series”. If independent variables are linearly structured in discrete positions, we say that the consecutive pairs of those variables form a “sequence”. If both independent and dependent variables are defined across continuous spaces, the independent variable is irregularly spaced, and the dependent variable is expected to be noisy yet convey information about a phenomenon, the term “continuous signal” could be used. These terms are conceptually identical, thus in this paper we use the term “signal”. Furthermore, we primarily use the terms “inputs” and “outputs” to represent independent and dependent variables, respectively.

Figure 1: An example of multiple continuous signals alignments. The red signal is used for the reference to be aligned with the others.

Alignment problems involve signal similarity measurements and acquiring comprehensive information from fragmented knowledge stored in each signal. A set of noisy signals are aligned to neutralize their noises, which could be applied to image processing or speech recognition problems. Classification of signals is inseparable from the alignment problem for measuring their similarities. For example, to identify the single-nucleotide polymorphism (SNP) in a DNA sequence which could be the source of a genetic variant, the sequence is aligned to a reference genome [45, 3]. Similarly, in the field of paleoclimatology, isotopic signals measured in ocean sediment cores are often aligned to form a more complete record of past climate change [47].

Multiple signal alignment is not straightforward from the pairwise cases: it quickly becomes computationally expensive. Suppose that signals are given. Then, there are possible signal pairs, implying a quadratic relationship between the number of possible alignments and the number of signals. In addition, the direct pairwise alignment between two signals may be different from the indirect alignment of the same two signals via a third signal. There are various methods to deal with this multiple signal alignment problem. Dynamic programming [11, 44] attempts to find the best alignment among any number of signals. Though it is designed to return the optimal alignment regardless of the number of signals, the algorithm could be intractable in time and memory in practice. Progressive methods [24, 78, 60, 77] construct trees by first aligning pairs of similar signals and then successively connecting the groups of alignments until all signals are incorporated.

Existing methods for pairwise alignments are limited in some ways. Alignment algorithms often assume that the signals are observed regularly and are not robust if inputs are defined unevenly. Unlike biological sequences where the inputs are discrete positions, in general alignments between signals could be defined over a continuous space. If so, algorithms like the Needleman-Wunsch algorithm, which align by inserting or deleting bases are not appropriate because any tuple of outputs from each signal cannot be perfectly matched.

Motif finding [4], or profile analysis [20], circumvents these limitations with a global model, called a motif or profile, that is assumed to generate or cause each signal. Under this method, alignment software attempts to decipher how the signal was generated or caused by the profile. Motif finding algorithms can be classified into two categories depending on whether the motifs (profiles) are constructive or latent. On the one hand, studies that employ the constructive profiles [20, 49, 81, 2] try to construct a profile from the signals using the EM algorithm. Each element in the space is assumed to generate the aligned pairs of the signals. On the other hand, studies using the latent profiles [33, 19] assume a latent shared structure based on the Gaussian process prior which implicitly generates each signal. Feature extraction methods [88, 84, 82] extract features from the signals, which are used as latent shared structures, to maximize pairwise linear correlations with the linear or nonlinear transformation of data representations in the framework of dynamic time warping (DTW).

Both of these previous approaches in the motif finding have some limitations. Existing constructive algorithms assume that the profile is defined over a discrete space, which as noted above is problematic: if both inputs and outputs are defined on continuous spaces, the user of such discrete algorithms must choose the resolution. If one chooses a higher resolution while the data are of a lower resolution or are not evenly spaced, then the algorithm has to infer missing values in places where there is no available data.

Combining information from multiple records is important in many applications. In studies of past climate change, for example, global climate records from ocean sediment cores (specifically, the oxygen isotope ratio of calcite from benthic foraminifera) are aligned primarily for three reasons. First, the signal to noise ratio of multiple records is higher than that of an individual record and provide information about changes in global ice volume and deep-water temperature. Second, age vs. depth functions, or “age models,” are often constructed by aligning the signal in an input core without age constraints to a synchronous signal in a target core that has an age model. Third, signal differences in cores from different ocean regions can provide information about past changes in ocean circulation.

Extant synchronizing probabilistic algorithms iteratively construct a profile of global climate proxies by aligning sediment core depths in each of the emitted records to the profile until convergence is achieved [2]. These algorithms are discrete in time, which is problematic because some input records have high resolution data while others are of much lower resolution. Consequently, interpolation is required. In addition, the resultant profile does not capture the autoregressive characteristics of these events.

In this paper, we introduce a novel method that deals with all the above limitations. We describe a continuous time alignment algorithm and an algorithm for continuous time profile construction. This profile is represented by a Gaussian process regression model that captures the autoregressive nature of the underlying signals and requires only a modest number of input records. We also illustrate how this profile can be employed for alignment and age inference of additional signals. Moreover, dynamics of the inputs are reflected, without any loss of information in the learning procedure. In section 2, some existing methods of constructive motif finding are briefly discussed. Section 3 covers the statistical learning tools that we have adopted. Our method, which we call the multiple continuous Signal Alignment algorithm with Gaussian Process Regression profiles (SA-GPR) will be shown in section 4. Section 5 gives paleoclimate applications, including construction of a Deep Northeast Atlantic (DNEA) stack using six ocean sediment cores. Discussion and conclusion are in section 6 and 7, respectively.

2 Existing Methods

Below is an explanation of the notation we use here. Suppose there are signals to align and each signal has consecutive pairs (i.e., it consists of steps) of inputs and outputs.

  • : inputs defined on . For example, could be a set of positions for DNA sequences, depths for ocean sediment cores, or years for historical economic data.

  • : outputs defined on , each . For example, could be a set of bases/amino acids for DNA/protein sequences, tuples of some proxies for ocean sediment cores, or multiple (unitless) indices for historical economic data.

  • : latent series defined on for the synchronization.

  • : the emission model of the outputs given its latent series.

  • : the transition model of the latent series given the inputs.

Let us assume further that outputs are conditionally independent given latent series and is Markovian, as follows:

(1)
(2)

The goal is to infer so that all signals are arranged in a single ordered field . In addition, this algorithm allows outputs to be scaled and shifted, thus additional signal-specific parameters to regularize each must be considered.

2.1 Profile Hidden Markov Models

Profile hidden Markov models (HMMs) have been widely used in computational biology [20, 21, 23] and the acoustics of human speech recognition [5, 31, 25]. A profile HMM attempts to construct a profile, or motif in some literatures, from a set of signals and uses it as an alignment reference. It assumes that is defined on a finite set , so each is stored in the form of a -by- matrix given . The algorithm consists of the alignment and profile construction steps. Once it converges, the obtained profile can be used as a reference to be aligned or considered as a kind of summarization of the signals. If the profile has already been given, only the alignment step is required to infer .

Though this method is theoretically justifiable and reflects all the dynamics of given clearly in the form of transition models, in practice there are critical drawbacks. First, the performance is dependent on the density of compared to , unless is also defined on a finite set. Second, learning the emission model, , may require a large number of signals because it is nonstationary. Third, time complexity is proportional to , meaning that one cannot simply make denser to improve the performance without a substantial computational cost. Finally, learning f may not be robust if the underlying dynamics of is defined on a continuous space and must also be learned during the alignment step.

2.2 Dynamic Time Warping

Dynamic time warping (DTW) algorithms [55] measure similarities between two signals by finding their optimal match, or warping path, under the given cost function and some intuitive restrictions (e.g., prohibiting time-reversals). The core idea of DTW is conceptually identical to the Needleman-Wunsch algorithm [59] in bioinformatics. It searches the optimal warping path for the pair of query signals by solving the associated dynamic programming and returns the associated score.

Because DTW is originally designed for pairwise alignments, the constructed profile tends to be defined as a model signal on , for example, , where is the averaging outputs derived by the query signals. With some variations, DTW barycenter averaging (DBA) [64, 63] methods try to obtain the centroid signal used as a profile to be aligned.

However, DTW also has its limitations. First, DTW has a trouble in reflecting the dynamics of the latent series properly if the profile is not of high resolution. Second, results can be strongly dependent on the user-specified input signal outputs and are often forced to match outputs which are apparently asynchronous. Third, the performance is sensitive to the penalties and loss functions in the transition and emission models. Fourth, the method is deterministic and does not reflect alignment uncertainty (which is beneficial if there are sub-optimal paths). Finally, DTW has a difficulty in addressing signals with heteroscedastic noises by the limitations on the loss function which is defined on the outputs directly.

2.3 Stochastic Alignment Process

A hybrid approach of the HMM and DTW, which is called the stochastic alignment process (SAP), has been discussed in some studies [32, 57, 58, 15, 51]. Unlike DTWs, it is a probabilistic alignment algorithm since transition and emission models are probabilistic. SAP aims to either get the most likely warping path (Viterbi path) or sample warping paths given data using a forward-backward algorithm. SAP attempts to get the pairwise alignments of signals depending on the outputs chosen in each one to be aligned. These stochastic alignment process would be equivalent to the profile HMM discussed in section 2.1 if it searched the alignment between each signal and the reference as the profile. In any cases, SAP is designed to use discrete time and states as discussed above so not well suited for continuous alignments, as it shares the limitations of HMM and DTW.

3 Preliminaries

Before proceeding to the modeling, some tools of statistical learning theory must be covered briefly.

3.1 Particle Smoothing

The forward-backward algorithm in the HMM is a method to calculate exact posterior inferences given the data and is only applicable if latent series are defined on a finite set. For continuous states, Kalman smoothing [34] provides an exact method in discrete time only under the assumption that both process and observation noises are linear and homoscedastic Gaussian. However, except for some limited extensions [12, 86], Kalman smoothing cannot be used for general cases (e.g., when observation noise follows a more robust distribution, such as a Student’s t-distribution, or when the process dynamics are not Gaussian).

Particle smoothing [18, 36] is a variational algorithm based on sequential importance sampling. Roughly speaking, it is a generalization of the HMM to deal with continuous latent series in discrete time. The forward algorithm samples a set of candidates, or “particles”, from a proposal distribution at each step , and computes weights on those particles to approximate the forward posterior with an empirical distribution. In formulation, it is expressed as follows:

(3)

, where are the sampled particles at step and are the associated weights sum to 1.

Then, the forward posterior of the next step is updated iteratively as follows:

(4)

, where and:

(5)

The backward sampling algorithm samples each latent entry given the next one as well as all inputs and outputs iteratively until a complete latent series is sampled. In formulation, it is expressed as follows:

(6)

Note that the particle smoothing is reduced to an HMM if the proposal distribution is set to have the same finite support and all elements in the support are sampled once as particles.

Also, because particle smoothing does not compute the exact forward posteriors, this method has limitations that HMMs do not. First, performance is dependent on the user-specified proposal distributions. Second, a small number of output outliers might ruin the inference (especially if the transition model is too rigid). Lastly, the weights assigned to the particles are often too small to affect the backward sampling algorithm, which might cause a trouble in learning emission and transition models by the EM algorithm.

3.2 Metropolis-Hastings Algorithm

Although particle smoothing can be used to infer latent series in the form of samples on continuous space, its variational aspects do not guarantee that the sampled latent series are continuous. In practice, only a few particles could appear for certain ’s in the sampled latent series. To overcome this drawback, we need to consider algorithms which are designed to return continuous samples.

Metropolis-Hastings algorithm [54, 28, 52] is a Markov chain Monte Carlo (MCMC) [62] method for gathering random samples from a probability distribution (or a distribution which has not yet been normalized) where direct sampling is not easy. It generates a sequence of random samples for the previously sampled at time by first generating a candidate drawn from the proposal distribution and then calculating the acceptance ratio as follows:

(7)

If is bigger than or equal to a uniform random number in , define ; otherwise, . Under mild regularity conditions [68], it is guaranteed that eventually converges to , the exact posterior of the latent series given inputs and outputs. Because this convergence is often slow in practice, especially if the initial latent series is in a region of low density, good initialization is usually necessary. In addition, instead of picking given as a whole new latent series, performance can be enhanced replacing some ’s with .

To sum up, particle smoothing and Metropolis-Hastings algorithm are complementary. The latter needs good initialization which the former can generate, but the former is a variational method that does not guarantee the complete sampling so some “perturbations” are required to make the sampled latent series by it complete. Also, if we have multiple initializations from the particle smoothing, “independent” sampled latent series can be gathered by running the Metropolis-Hastings algorithm individually for each initialization.

3.3 Gaussian Process Regression

Figure 2: An example of the heteroscedastic GP regression. Green dots are randomly generated samples, black and red lines are the true and inferred means, respectively. The gray region represents the true 95% confidence band and red dashed lines stand for the inferred one. The heteroscedasticity is inferred by following the method in [41].

Now suppose that a pair is given, and the signal is defined over the real line, and is unidimensional . What can we say about the output at an arbitrary query latent entry ? In other words, how can we define ? This is a regression problem to establish the relationship between independent and dependent variables. Suppose that we have a regression vector and a scalar on and , respectively, with Gaussian noises, i.e.,

(8)

, where is a positive function returning diagonal matrices.

A Gaussian process (GP) [66] is a stochastic process where each finite collection of random variables has a multivariate normal distribution. Because it is a distribution over functions defined on the continuous domain, various models [83, 79, 29, 42, 22, 50, 41] adopt GPs. In our regression model, we give a GP prior on the regression vector, i.e.,

(9)

, where is a kernel covariance function and is a prior continuous function defined on . Note that (9) could be degenerate if either is not injective or .

With (8) and (9) as the likelihood and prior, respectively, one can derive the posterior predictive distribution of given by marginalizing out :

(10)

Note that (10) is always non-degenerate because . The regression model, , is derived from (10) as follows:

(11)

, where:

(12)

We call the resultant in (11) the GP regression. If is a constant function, then the GP regression is homoscedastic. Otherwise, it is heteroscedastic. Note that even if the model is homoscedastic, the inferred variances of GP regression vary over (a little) as defined in (12), which comes from the uncertainty of the regression vector. Even if is stationary (thus, ’s are identical), the next term varies over . In the heteroscedastic GP regression model, also contributes in varying variances over as a variance in the outputs given their regression.

Because this modeling directly depends on the data and does not require a parametric model for the regression line, it is considered a nonparametric regression. The performance is affected by the choice of and . It is relatively easy to tune them if the GP regression is homoscedastic (see [66]), but tuning heteroscedastic GP regression remains difficult and not exact unless is given a priori. For more details, see [40, 35, 38, 41]. To encompass all advantages of the profile HMM, heteroscedastic GP regression models must be considered because the heteroscedasticity from the uncertainty of the regression vector is often not enough to explain the heteroscedasticity of the data.

4 Method

Figure 3: A simple diagram of the SA-GPR. Each box is iterated until convergence.

Now we are prepared. Our multiple signal alignment method consists of two steps, the signal alignment and profile construction, which will be iterated until convergence. We call our method ‘SA-GPR’ which stands for “multiple continuous Signal Alignment algorithm with Gaussian Process Regression profiles”. Our signal alignment algorithm addresses the limitation of discrete profiles in early publications by using a continuously defined profile without loss of information. Our profile construction algorithm is a novel constructive method of continuous profile inferences based on the GP regression.

4.1 Assumptions

Before describing the method, we define frequently used terms rigorously. First, alignments apply to targets that are assumed to be generated from the global model. Here, the term “alignment” means adjusting one or both of two signals so that their inputs are synchronized with their points of origin. For example, the depths from two sediment core records are “aligned” to the common time at which both were simultaneously deposited.

In an ideal world, there is no outlier under the assumption that the model is correctly specified. In the real world, however, it is not possible to completely exclude observational errors. Moreover, “all models are wrong, but some are useful” (George E. P. Box) so it is quite natural to see some outliers which appear not to follow the assumed model. Although some apparent outliers can be discarded with the naked eye, this is not only subjective but also impossible if the inputs or outputs are of high-dimension. Therefore, an outlier classification procedure is included in the alignment algorithm, with the following notation:

  • : outlier indicators defined on . if the associated input-output pair is considered as an outlier, 0 otherwise.

Let us define outliers as data that do not follow the emission model, g, but instead follow an alternative model, . Also assume that outliers are independent from the inputs, i.e., we have the following prior (13) and likelihoods (14):

(13)
(14)

Then, one can easily compute the marginal distribution of given as follows:

(15)

Let in (15) be the new marginalized emission model. The rest of the notations in section 2 are kept.

4.2 Signal Alignment Algorithm

Suppose that signal-specific parameters are initialized and the profile to be aligned is given. This step is for learning the alignments to the given profile as a set of sampled latent series for each signal given data. The goal is to learn the transition model (unless it is assumed to be given a priori) and draw samples as follows: for each ,

(16)

The signal alignment algorithm can be considered as an EM algorithm [17], which consists of the following three steps. First, sampled latent series are initialized as for each by the particle smoothing described in section 3.1. Second, get the complete sampled latent series for each by the Metropolis-Hastings algorithm in section 3.2. Third, learn the transition model for each with posteriors approximated from the sampled latent series. These three steps are iterated until convergence. Note that this algorithm can be run in parallel and the emission model is specified by the profile.

What we did not mention in section 3.1 is how to choose the proposal distribution at each step . Suppose that we obtained samples of the latent series in the last round. Then, each can be designed as follows:

(17)

In other words, candidates of this round are assumed not to deviate much from the samples drawn in the last round.

The current step allows for the alignment of signals to a given probabilistic model in continuous outputs over continuous latent series, but they do not specify such a model. We next describe the profile construction algorithm.

4.3 Profile Construction Algorithm

Suppose that sampled latent series are given. This step is for learning the marginalized emission model from those samples and associated outputs. Suppose that is straightforwardly derived from , so what we need to do is to learn from inliers. For each step , one can get the posterior of from (13) and (14) as follows:

(18)

Thus, one can sample or infer for each . For this subsection we assume that each pair as those after discarding entries where for the notational simplicity. Also, we focus on the case so each is defined on . Cases where will be discussed at the end of this subsection.

From our assumption, must be defined on the given latent series continuously. Thus, for an arbitrary , we need to define where is the associated output. One might consider as the following posterior predictive:

(19)

Unfortunately, computing (19) could be intractable in practice. If all combinations of are considered, there are terms that must be summed even if only a few of such combinations are considered. Otherwise, the full regression model cannot be obtained because the model quickly becomes intractable as the size of inputs increase. As a variational approach, let us consider the following signal-specific models: it is justifiable only if each pair is expected to cover the domain of the profile in a rough way.

(20)

Suppose that we have drawn independent and identically distributed observations from . Our variational method assumes that the joint distribution of is approximated by the product of ’s, i.e., are also considered as a set of samples where each was drawn from . The objective function is then given as follows: is the Kullback-Leibler (KL) divergence.

(21)

Unfortunately, cannot be expressed in a closed form in general. However, if both and are Gaussian, then we can compute (21) explicitly. Let us assume further that each is Gaussian. Then, can be approximated as a Gaussian by the model reduction based on the moment-matching [56].

What remains is to define each as a Gaussian. This can be done by GP regression, described in section 3.3.

Let . Then, the model reduction returns as follows:

(22)

, where:

(23)

Finally, the Gaussian minimizer of is given as follows:

(24)

Let and update , and accordingly. Recall that the outlier distribution is assumed to be straightforwardly derived from . Here is a suggestion: for a positive hyperparameter , let:

(25)

I.e., we assume that the outliers follow the inlier model but shifted above or below by times the inferred standard devistions.

Note that the above GP-based regression is just a possible way of constructing . Any regression which guarantees (21) as a closed form can be an alternative.

For the case , if are conditionally independent given , the problem becomes straightforwardly resolved by considering multiple regression models on each coordinate . If not, there are two options: one is to consider multivariate regression models and the other is applying dimensionality reduction methods. Multivariate GP regression [13, 16] is one of the former options, but it is beyond the scope of this paper. For the latter, especially if can be embedded on a 1-dimensional submanifold (i.e., can be considered as a curve on the space of higher dimension), consider one of the following methods: principal component analysis (PCA) [61, 71], Laplacian eigenmaps [8], autoencoders [70, 6], Gaussian process latent variable models (GPLVM) [37, 42], maximum variance unfolding [87].

5 Applications: Dual Proxy Stack Construction for Paleoceanography

All methods which seem good in theory must be certified by examples. Some toy examples can be found in section 1 of the supplementary notes. In this section, we show the application of our SA-GPR to the alignment of ocean sediment records in the field of paleoceanography.

In the field of paleoceanography, a well-established method for indirect age inference of ocean sediment cores is based on the ratio of oxygen isotopes in foraminiferal shells, known as . Specifically, is the ratio of stable isotopes and relative to a laboratory standard. It is a common proxy for water temperature, salinity, and global ice volume. The of benthic foraminifera, which live on the seafloor, is often used as a global climate parameter because global ice volume accounts for approximately half the variance through time [75] while deep-water temperature and salinity have relatively little spatial variability.

An age model for an input core which lacks direct age constraints can be constructed by aligning the benthic record of the input core to a target record that has an age model. In this process, the input record indirectly adopts the age model of the target. The target record is often an average of multiple records, such as the LR04 stack [47]. The target can also be a probabilistic model developed from multiple records (Prob-stack [2]). Here, the term “stack” is used as “profile” in the field of paleoceanography.

However, local variability in benthic signals between core locations can cause significant uncertainty in aligned age models when studying millennial scale events. Benthic aligned age models can result in age model errors up to 4 kyr [74, 76] due to local effects corrupting the inference. For example, after the Last Glacial Maximum, 19-23 kiloyears ago (ka), melting ice sheets changed temperature and salinity gradients in the deep ocean and altered ocean circulation [1, 53]. Asynchronous temperature changes and ocean circulation rates caused by ice melt are thought to influence benthic records from 18-11 ka and could cause bias in aligned age models (e.g., [85, 26]).

Sediment samples from cores can be directly dated with radiocarbon () ages. However, this dating method is limited to 50 ka BP due to loss by radioactive decay, whereas benthic records from ocean sediment cores extend as far back as 65 Ma. Furthermore, disturbances to sediments deposited on the seafloor can sometimes result in age reversals within the sequence of sediment recovered by coring. Additionally, the resolution of radiocarbon measurements is frequently lower than that of . As a result, age inferences in records with low resolution data are strongly dependent on the assumptions regarding the rate of sediment accumulation in the intervals between data points.

In this subsection, as an application of our SA-GPR, we introduce an age inference method which integrates benthic and proxies.

5.1 Existing Approaches

Radiometric dating is commonly employed for direct paleo-age inferences. For example, determinations measured at specific depths within a core are assumed to follow models based on radiocarbon ages, and radiocarbon ages are translated into calendar ages by calibration curves [67, 30], so we can say that proxies allow us to access calendar ages associated to the sampled core depth. [14] developed an algorithm to construct sediment core age models from radiocarbon data based on a generalized Student’s t-distribution that is robust to outliers. This algorithm includes the uncertainties from measurements and tuning hyperparameters that reflect reservoir effects.

Multiple studies have constructed dynamic models that simulate changes in sediment accumulation rates (defined as the ratio of depth to age increments). [9] adapt a piecewise linear approach with automatic section selection and impose constraints on accumulation rates while [27] use a bivariate monotone Markov process with gamma increments. [10] construct an age-depth model, called Bacon (Baseyian Accumulation), by adapting an autoregressive gamma process for accumulation rates and a Student’s t-distribution for radiocarbon data. An adaptive Markov-chain Monte Carlo algorithm is implemented to sample ages.

Indirect age assignments depend on record alignments and allow a core to utilize ages from a different core or stack that has direct age proxies. A deterministic alignment algorithm, Match, was developed using dynamic programming [46]. [47] used Match to align benthic data from 57 globally distributed deep-sea sediment cores. These data were averaged to calculate a stack, called LR04, which is commonly used as a standard reference for benthic change over the past 5.3 Myr.

Ages can be inferred by alignment to the LR04 stack, similar to the way profile hidden Markov models can be employed in biological sequence alignments (details can be found in [20]). [43] tackled the age assignment problem using a probabilistic alignment model called HMM-Match. The HMM-Match emission model for data is based on Gaussian distributions with time varying mean values from LR04 and a constant core-dependent standard deviation learned by the Baum-Welch expectation maximization (EM) algorithm. The transition model accounts for the probability distribution of accumulation rates using a log-normal mixture based on radiocarbon observations from 37 cores. [2] constructed a stack (named Prob-stack) from 180 globally distributed benthic records with an algorithm based on HMM-Match, called HMM-Stack. Each point in Prob-stack is described by a Gaussian distribution of that varies along the record.

HMMs define inferred ages on discrete spaces. This is problematic when the input record has a higher resolution than the target stack or when a proportional accumulation model is employed like the one used in Match, in HMM-Match, and the algorithm we present here. Increasing the resolution of the target stack cannot be an ultimate solution because the time complexity of an HMM is quadratic to the size of its hidden space, so it soon becomes infeasible as the resolution of a record increases. In the next subsections, we explain how the SA-GPR method addresses these limitations.

5.2 Data

Our method of constructing dual proxy stacks can be used to integrate and summarize the information from a set of cores considered to be homogeneous, and to infer ages of records not used in the stack construction but believed to be homogeneous. First, we construct a Deep Northeast Atlantic (DNEA) stack from six cores (GeoB7920-2, GeoB9508-5, GeoB9526-5, MD95-2042, MD99-2334 and ODP658C) using the algorithm described above. We then use this stack to infer dual proxy ages of two records which are homogeneous but not used in constructing the local stack, GIK13289-2 and SU90-08. Details of the data are given in figure S10 in the supplementary notes.

5.3 Model Specifications

In this problem, inputs are record depths, three-dimensional outputs are data, data, and age guesses, and latent series are calendar ages. Transition models are defined reversely, from to 1, to describe the sedimentation order of layers. Core depths that lack data are assigned to the corresponding variable.

Transition Models on the Sediment Accumulations

For each depth in record , let indicate expansion, contraction or average change in the sedimentation rate from depth to in record . Then, our transition model is defined as follows:

(26)
(27)

, where the last term in equation (27) is , , .

The initial models are given by and . Specifically, confines the transition from to in one of three regions, called contraction (), average (), and expansion (), and . is a depth-scale parameter rescaling to adjust for differences in average accumulation rates.

The transition model can be considered an AR because the previous accumulation rate from to stored at each affects the choice of the current . This in turn influences the current accumulation rate from to . The parameters and are fixed in the procedure to avoid overfitting: their values are pre-learned from the same dataset for learning the log-normal mixture model in [43].

Emission Models for , and Age Guesses

We adopt a model of [14] on the radiocarbon proxy. For each depth , the emission model for is given as follows:

(28)

, where and are input variables (assumed to be fixed) given with determinations ’s a priori. Calendar ages, ’s, are translated by the mean and standard deviation functions and of the radiocarbon calibration curve. is employed to ignore depths where there is no data. In equation (28), is equivalent to the generalized robust Student’s t-distribution suggested in [14], which is adapted for controlling outliers. and are the fixed hyperparameters of the distribution.

The emission model for is, on the other hand, given by following:

(29)
(30)

Indicators ’s are independent Bernoulli variables, which indicate whether the corresponding ’s are drawn as inliers from the Gaussian distributions of the stack or as outliers from bimodal distributions for outliers. Outliers are removed because GP models are overly sensitive to them.

Priors for ages can be set if one has information on the ages before seeing any data, otherwise uniform priors are assumed, which makes . If data are available, it is modeled based on a Gaussian distribution; else assign 1 and thus ignore depths at which there are no such data:

(31)

We assume that all observations, , and , are conditionally independent given calendar ages, , i.e.,

(32)

This assumption could be inappropriate if proxies are correlated. Also notice that this formulation can be applied to any tuples of proxies as long as they are believed to be conditionally independent given ages.

Profile Construction

For profile (stack) construction, we apply the GP regression from section 4.3 with the Ornstein-Uhlenbeck (OU) kernel [66]. It is a special case of the Matérn class of covariance functions when the degree of differentiability is 0.5 and gives rise to a particular form of an AR model [69]. By following the Occam’s razor, we first apply the homoscedastic GP regression to the data. Note that the resulting models are heteroscedastic, whether or not the applied GP regression is homoscedastic or heteroscedastic: if the GP regression is homoscedastic, the heteroscedasticity comes from the uncertainty of the regression. If the GP regression is heteroscedastic, the noise model also contributes to the heteroscedasticity. Figure S15 in the supplementary notes strongly supports that this simple assumption is enough to explain the data.

5.4 Results

We construct a local stack for the Deep Northeast Atlantic (2273-3223 m) by assuming that two cores from the Iberian margin (MD95-2042 and MD99-2334) and four cores from the northwest African continental slope (GeoB7920-2, GeoB9508-5, GeoB9526-5 and ODP658C) are sufficiently synchronous to be considered homogeneous. The availability of allows direct access to the calendar ages to enhance the age inferences: this is especially useful in the Holocene (after 11.7 ka) where the signal-to-noise ratio is low. We used the regional Deep North Atlantic (DNA) stack from [48] to initialize the iterative algorithm. The open-source software for the stack construction and core alignment algorithms is available at https://github.com/eilion/DPGP-Stack, runs on MATLAB.

Figure 4: Core alignments for stacking. The upper panel (a) shows those to the DNA-stack, while the middle panel (b) is our dual proxy local DNEA stack. Stars indicate medians of data classified as inliers and dots represent outliers, after translations. The darker and brighter gray regions are the 1-sigma and 2-sigma of the stacks, respectively. The dot-dash line indicates their boundary. The lower panel (c) is a portion of the dual proxy stack from 10-50 kiloyears.
Figure 5: The comparisons between inferred dual proxy ages and ages of records. Shaded regions indicate 95% confidence regions and the black dashed line is just a diagonal.

Figure 4(a) shows the alignment of the six records to DNA-stack, and figure 4(b) shows the alignment of these data for the local, dual proxy stack we constructed for the Deep Northeast Atlantic (DNEA). Variability in benthic in the DNA stack is considerably larger than in the local DNEA stack (note that most of the DNEA medians are inside the 1-sigma of the DNA-stack, which is only supposed to contain 68% of them). Higher variances in the DNA stack compared to the DNEA stack may stem from benthic differences within the broader North Atlantic region, record-specific mean shifts applied to the DNEA stack but not the DNA stack, and/or the discrete nature of the algorithm used to construct the DNA stack. Another contributing factor to the tighter DNEA stack variance is the automatic detection and removal of outlying observations. The tighter variance will contribute to less uncertainty in ages inferred from this stack.

The smoothness of the local stack stems from the fact that the GP model captures correlations between all the data points with a heavier weight placed on near neighbors, thus limiting sudden large changes. Although our dual proxy stack is smoother than the DNA-stack, it still captures well-known millennial-scale climatic events. For example, Figure 4(c) shows four peaks at 24, 29, 38 and 46 kiloyears, which correspond to the Heinrich events H2 to H5 [73, 48]. The ability to resolve such short-lived features will improve the accuracy of age estimates for cores with high-resolution records. We also tried the squared-exponential (SE) kernel [66], which gives rise to a particular form of a continuous-time AR() model, but the results failed to reflect H2 and H3 at 24 and 29 kyr respectively, as figure S17 in the supplementary notes.

Figure S15 in the supplementary notes shows the histograms of normalized from the six records in the dual proxy stack with respect to the DNA stack and the new DNEA stack. The fact that there is only a small departure from the standard normal distribution to the DNEA stack supports the validity of a Gaussian model. Figure 5 compares ages inferred using both and (dual proxy) to those using only (analogous to Bacon). Overall agreement between cores (to within uncertainty) supports our assumption that benthic is synchronous and homogeneous among sites included in the local stack. Some departures from the diagonal are expected, considering influences from their data.

If the record for a particular core site is believed to be homogeneous to the in the local stack, its ages can be inferred indirectly by dual proxy alignment to the local stack or -only alignment (if data are unavailable). However, sometimes it may be difficult to ascertain whether the signal is homogenous across two or more sites through time. Here we present example alignments of two sediment cores from the same region (GIK13289-2 and SU90-08). We infer ages for these cores from their proxies and alignment to the DNEA stack constructed in Section 5.4 based on the assumption that the aligned cores share the same local signal as the stack. When assessing whether core sites share the same local signal as the stack, one should consider not only whether they are located within the same water mass today but also how water mass distributions have changed through time. We evaluate whether the cores used here are homogeneous with the DNEA stack in section 2 of the supplementary notes.

Figure S11 and S12 in the supplementary notes show the results of GIK13289-2 and SU90-08, respectively. The upper panels show that the translated and aligned data mainly fall within the stack’s confidence intervals. The lower left panels show that the inferred ages mainly pass through the confidence intervals from individual proxies (SU90-08 has only between 10-40 kiloyears and beyond that ages are inferred only based on alignment to the stack). Because stacks with dual proxy ages have more narrow confidence intervals than single proxy stacks, they are more informative for stack alignment and produce smaller age uncertainties. For example, compare figure S11 and S12 with S13 and S14 (in the supplementary notes) that show alignments to the DNA stack.

6 Discussion

SA-GPR shares the advantages of the other profile-dependent alignment algorithms. Aligning signals does not only focus on finding the best matches between signals but also integrates the fragmented information stored in each signal into the profile. Once the profile is constructed from a set of signals, it can be used as an alignment target for new signals.

Our method is constructive, so the obtained profiles are intuitive and easy to interpret. Once SA-GPR constructs a profile, it can be understood as the model signal with uncertainty induced from the query signals. This means that the profile can be utilized as a representation of the query signals for other meaningful purposes. For example, the mean values of the DNEA stack over ages in section 5 themselves form a regional parameter, just like benthic in the LR04 stack forms a global parameter directly used for inferring past climate changes.

The combination of the particle smoothing and Metropolis-Hastings algorithms is guaranteed to sample latent series given data continuously. Note that particle smoothing is applied only for the initialization. In SA-GPR, there are several reasons why HMM is not considered in the initialization. Firstly, if the dynamics of latent series do not allow “stays” (i.e., latent series must be strictly increasing), then the downgraded transition model in the HMM cannot perfectly reflect it whereas particle smoothing theoretically can. Secondly, HMM considers all candidates for assigning latent series while particle smoothing can focus only on latent series in possible ranges of the profile, which makes the algorithm more efficient. Thirdly, HMM initialization is not stable when varying the profile resolution.

Even if continuous sampling is available, it will soon become insignificant without a continuously defined profile. Our profile construction algorithm depending on the Gaussian process regressions is superior to algorithms based on simple interpolations because it is theoretically clearer to exploit data to fill in unobserved outputs. The closer the query input is to the data, the higher the associated output is correlated with them, based on the kernel covariance functions. Moreover, our method is nonparametric and more robust on the general data.

There are several limitations in our method. Firstly, the time complexity of SA-GPR might hinder accessing a large amount of data. If is the average length of the signals, is the number of signals, is the number of particles, and is the average number of sampled latent series for each signal, then the time complexity of signal alignment and profile construction algorithms are given by and , respectively. Though the former can be circumvented by considering only a small portion (of length ) of each signal in initialization to reduce it to , the latter cannot be reduced in principle because matrix inversion in GP regressions requires at least [39]. Alternatives include variational methods or approximations to make GP regressions tractable on big data (see [80] and [7] in details).

However, the time complexity of the profile construction and signal alignment algorithms is proportional to , which is a clear advantage over the other profile-free alignment methods that are based on pairwise alignments and whose time complexities are structurally proportional to . Also, both aligni