Recursive Multikernel Filters Exploiting Nonlinear Temporal Structure

# Recursive Multikernel Filters Exploiting Nonlinear Temporal Structure

Steven Van Vaerenbergh Dept. Communications Engineering
University of Cantabria, Spain
steven.vanvaerenbergh@unican.es
Simone Scardapane DIET Dept.
Sapienza University, Italy
simone.scardapane@uniroma1.it
Ignacio Santamaria Dept. Communications Engineering
University of Cantabria, Spain
i.santamaria@unican.es
###### Abstract

In kernel methods, temporal information on the data is commonly included by using time-delayed embeddings as inputs. Recently, an alternative formulation was proposed by defining a -filter explicitly in a reproducing kernel Hilbert space, giving rise to a complex model where multiple kernels operate on different temporal combinations of the input signal. In the original formulation, the kernels are then simply combined to obtain a single kernel matrix (for instance by averaging), which provides computational benefits but discards important information on the temporal structure of the signal. Inspired by works on multiple kernel learning, we overcome this drawback by considering the different kernels separately. We propose an efficient strategy to adaptively combine and select these kernels during the training phase. The resulting batch and online algorithms automatically learn to process highly nonlinear temporal information extracted from the input signal, which is implicitly encoded in the kernel values. We evaluate our proposal on several artificial and real tasks, showing that it can outperform classical approaches both in batch and online settings.

## I Introduction

In recent years, kernel adaptive filters (KAF) have become a popular approach for online machine learning and time-series prediction, thanks to numerous theoretical and practical advances [vanvaerenbergh2012kernel, zhao2013fixed, yukawa2015adaptive]. Differently from deep recurrent neural networks [goodfellow2016deep], whose training is generally formulated in batch fashion, KAFs can be trained efficiently with a single pass over the training data. Popular examples of KAFs include kernel least-mean-square [liu2008kernel, zhao2013fixed] and kernel recursive least-squares [engel2004kernel, vanvaerenbergh2012kernel]. When operating on temporal data, most KAFs apply common kernel functions, e.g. Gaussian or polynomial, to time-delay embeddings of the input data. Choosing a specific embedding is not trivial in general, and it might be suboptimal in the case of non-stationary signals. These are known problems in other kernel methods as well, such as support vector machines (SVMs) [mukherjee1997nonlinear] and kernel ridge regression (KRR).

In order to overcome these limitations, several authors have proposed kernelized extensions of classical recursive models, including the recurrent least-squares SVM [suykens2000recurrent], the autoregressive and moving average (ARMA) SVM [martinez2006support], the kernel machine and space projection (KMSP) method [li2011identification], the kernel -filter [camps2004robust], and, more recently, the kernel adaptive ARMA algorithm [li2016kernel]. All of these works share a common methodology, which is composed of three major steps: (i) define a proper state-space model (SSM) in the input space; (ii) map the input and/or the state of the model to a high-dimensional feature map corresponding to a properly defined reproducing kernel Hilbert space (RKHS); and (iii) solve the resulting model by substituting all dot products with evaluations of the associated kernel function according to the so-called “kernel trick”. Although this is a powerful methodology, it is hard to obtain insights from the model and, more importantly, selecting a proper embedding remains a crucial problem.

In this paper, we focus on the alternative methodology recently proposed in [tuia2014explicit]. The basic idea consists in defining the SSM explicitly in a proper RKHS, where samples might not correspond one-to-one with the original inputs. In particular, it is possible to define a proper reproducing theorem such that the model can be expressed as a summation of kernel values, which can be computed recursively at each time instant. For the specific case of the -filter, the resulting model is particularly appealing because each “tap” in the RKHS corresponds to a filtering operation on a different time-scale of the original input [tuia2014explicit]. Unlike previously proposed recursive kernels, e.g. [hermans2012recurrent], this class of kernels was found to work robustly in many problems, including time-series prediction and array processing. Some theoretical aspects of a related class of kernels were investigated independently in [mouattamid2009recursive].

Here, we are interested in extending this methodology by focussing on a shortcoming of the original model. In particular, standard kernel methods consider a single kernel value for each input datum, while in this case, we obtain several kernel values per input. In [tuia2014explicit], this problem was side-stepped by either averaging the values, or by computing inner products. This approach lacks in flexibility, however, because it implicitly assumes that all time-scales are equally important. In this paper, we put forth the idea of considering each kernel value as coming from a different kernel function, and to apply proper adaptive strategies to learn the dependency with respect to each of them. In the literature, the idea of adaptively combining kernel functions goes under the name of multiple kernel (MK) learning. MK algorithms originated in the SVM literature [gonen2011multiple], and their benefits have also been proven by a number of authors for KAFs, including the MK normalized LMS [yukawa2012multikernel], the mixture KMLS [pokharel2013mixture], the doubly regularized MKLMS [yukawa2013online], and Cartesian HYPASS [yukawa2015adaptive].

In order to test the feasibility of our idea, we focus on a simple strategy in which multiple kernel estimators are linearly combined following a stacking-like [breiman1996stacked] algorithm. This allows us to test the same formulation in both batch and online settings. In the experimental section, we show that by combining our procedure with the recursive kernel of [tuia2014explicit], we obtain a performance that is comparable to or better than competing approaches. To this end, we evaluate several benchmarks using both artificial and real-world datasets.

The rest of the paper is organized as follows. Section II describes the RKHS -filter from [tuia2014explicit], which is extended through the proposed MK strategy in Section III. We briefly consider computational considerations of the recursive kernel evaluation in Section IV. Experiments are presented in Section V, before giving some conclusive remarks in Section VI.

## Ii Recursive γ-filtering in RKHS

Let us denote by a generic input-output pair observed at time instant . We assume these data to be generated by the following nonlinear model

 yn =f(⟨wi,xin⟩)+ey, (1a) xin =g(xin−1,xin−2,…,yn,yn−1,…)+ex, (1b)

where is the input signal at the -th filter tap, denotes the inner product, are the filter weights, are smooth nonlinear functions, and represent state and output noise respectively. The main idea introduced in [tuia2014explicit] is to model a similar process, defined instead in a proper Hilbert space

 yn =^f(⟨wi,ϕin⟩H)+ey, (2a) ϕin =^g(ϕin−1,ϕin−2,…,yn,yn−1,…)+eϕ, (2b)

where are now samples in the (possibly infinite-dimensional) , and , represent the state and output noise. Differently from previous works, it is not required for to have as its preimage, in order to provide more flexibility to the model. Proving a representer theorem in general is not trivial, except for specific instantiations of Eq. (2). A particularly interesting case arises by assuming that the filtering operation in Eq. (2) is a -filter [principe1993gamma, camps2004robust] given by (omitting noise for simplicity)

 yn =P∑i=1⟨wi,ϕin⟩H, (3a) ϕin ={ψ(xn) if i=1,(1−μ)ϕin−1+μϕi−1n−1 if 2≤i≤P (3b)

where is some nonlinear transformation in , is the filter length controlling the memory depth, and is a free parameter controlling stability. It is interesting to observe that each “tap” in the Hilbert space is defined by a recursive equation, so as to work over different temporal combinations of the (nonlinearly transformed) original input sequence. In [tuia2014explicit], it is proved that, for a given sequence of length , the filter weights can be expressed as

 wi=N∑m=1βimϕim, (4)

for some coefficients . After substituting (4) in (3), and making use of the kernel definition , we obtain

 ^yn=P∑i=1N∑m=1βim⟨ϕim,ϕin⟩=P∑i=1N∑m=1βimκi(m,n). (5)

Again, it is worth underlining that, in general, . In particular, due to the recursive model definition, the kernel itself can be defined recursively, and the closed-form expression is given by111Note that the original formula in [tuia2014explicit, Eq. (13)] was missing a summation from to . The correct formula is given by Eq. (6).

 κi(m,n)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩κ(xm,xn),i=1¯μ2κi(m−1,n−1)  +μ2κi−1(m−1,n−1)  +μ2m−1∑j=2¯μj−1κi−1(m−j,n−1)  +μ2n−1∑j=2¯μj−1κi−1(m−1,n−j),i>1 (6)

where , and is the classical (scalar) kernel function corresponding to . The calculation of the associated kernel matrix is illustrated in Fig. 1. More generally, we can replace and with the corresponding time-delayed embeddings and to obtain a more expressive model with similar computational complexity.

The recursive kernel -filter generalizes several known models, such as the classical -filter, recursive auto-regressive filters, and several others (see [tuia2014explicit, Section III-C]). Differently from standard KRR and KAF algorithms, which require one kernel value for each time instant, the model structure in (5) requires kernel values, each of which can be seen as operating at a different time-scale. The filtering approach proposed in [tuia2014explicit] did not adapt all coefficients corresponding to all kernel values, but employed a simple combination of the kernels instead. One such combination is the composite kernel . However, this approach has a drawback, namely, it assumes that all kernels are equally significant, and it may lose important information when combining them. In the next section, we will describe a more principled approach to combine the different kernels.

## Iii Proposed adaptive multi-kernel approach

We now describe an adaptive formulation that automatically weights the different kernels. The proposed algorithm is relatively inexpensive, and it can be implemented easily in both batch and online settings. Note, however, that nothing prevents the use of more advanced multi-kernel or ensemble strategies to further exploit the multi-kernel structure.

Let us consider the batch case first. Given training samples , denote by the vector of all outputs, and by the kernel matrix corresponding to the -th tap in (6). A set of KRR models is trained as

 fi(x)=yT(Ki+cI)−1κi(x), (7)

where and is a regularization constant. We combine the basic models as , where the coefficients are found by minimizing

 minα1,…,αP⎧⎨⎩12N∑n=1(yn−P∑i=1αifi(xn))2⎫⎬⎭, (8)

which has an immediate closed form solution

 α=F−1y, (9)

where . This formulation is a basic form of what is known as “stacking” in the machine learning literature [breiman1996stacked, dvzeroski2004combining]. When , Eq. (8) does not require regularization. Otherwise, Eq. (8) can be replaced by a more general leave-one-out strategy as in the original stacking problem [breiman1996stacked]. More generally, we can include several constraints and/or regularization terms to Eq. (8) in order to force a specific structure on , such as the requirement to lie in a -simplex (similar to an adaptive combination of filters [arenas2016combinations]), or impose an -norm regularization to remove unnecessary lags.

A similar stacking strategy can be applied in the online case. In particular, given the new datum , we first update KAFs in parallel, for instance employing KLMS filters [liu2008kernel]

 fin=fin−1+η[yn−fin−1(xn)]κi(⋅,xn), (10)

where is the step-size. Then, we update the current estimate of the weighting coefficients following an instantaneous descent on (8)

 αn=αn−1+ν(yn−P∑i=1αifin(xn))fn(xn), (11)

where is a step-size parameter and .

Note that KLMS algorithms require calculating arrays of kernel evaluations in each iteration, which is feasible by applying the recursive formula discussed in the next section. KRLS algorithms, on the other hand, require calculating entire kernel matrices, which is less obvious in recursive settings. This, and other concepts, such as sparsity, require more investigation.

## Iv Efficient computation of the recursive kernel

We now briefly outline an implementation to compute the recursive kernel using fast operations on column vectors.

The slowest operations in Eq. (6) are the summations of the third and fourth term. In order to speed up the calculation of the last term, we define the column vector with elements

 rin(m)=μ2n−1∑j=2(1−μ)j−1κi(m−1,n−j). (12)

This vector can be obtained recursively by observing that

 rin(m)=(1−μ)(rin−1(m)+μ2κi(m−1,n−j−1)). (13)

A similar recursive calculation is not possible for the third term in Eq. (6), though this term can be obtained efficiently from the elements of the convolution , where and .

Fig. 2 compares the computation times of the recursive kernel matrix with , for an explicit implementation of Eq. (6) and the discussed efficient implementation, both in Matlab R2015a on an Intel Core i7 PC with GHz processor.

## V Experimental results

The proposed batch and online algorithms are evaluated on six benchmark problems, three of which are defined on artificially generated datasets and the rest on real-world data.

The first artificial dataset is the Mackey-Glass time-series (with delay ) taken from [tuia2014explicit], denoted as MG30, on which we perform one-step-ahead prediction. The second task is a noisy version of a nonlinear prediction problem introduced in [narendra1990identification], denoted as “Narendra”, which is defined by

 yn=0.3yn−1+0.6yn−2+f(en), (14)

where the unknown function has the form

 f(e)=0.6sin(πe)+0.3sin(3πe)+0.1sin(5πe) (15)

and , , is uniformly distributed in the interval , and we set . We additionally add Gaussian noise with variance to the desired output during training.

The third task on artificial data is the identification of the nonlinear Wiener model described in [scardapane2016diffusion], where the input is generated according to

 xn=bxn−1+√1−b2ex, (16)

with randomly generated according to a uniform distribution, is Gaussian noise with variance , and we set . The output is given by first applying a linear filter to an embedding of the last inputs, and then applying a soft nonlinearity on the resulting scalar value.

The first problem on real-world data considers the -step ahead prediction of the EEG dataset from [tuia2014explicit], extracted from the MIT-BIH Polysomnographic Database. We then consider the prediction of a respiratory motion trace recorded at the Georgetown University Hospital, originally described in [ernst2012compensating]. Finally, the EUR-USD dataset contains the EUR vs. USD exchange rates in minute intervals, taken on the days January 2nd and 5th of 2009. The task is -step ahead prediction.

### V-a Batch experiments

In all batch experiments, we use elements for training and separate elements for testing, except for EUR-USD, where we use samples for training and for testing, respectively. Parameters are fine-tuned following the same grid-search procedure described in [tuia2014explicit], optimizing over a third, independent validation set.

We evaluate several KRR models, trained using (i) a standard RBF kernel with time-delayed embeddings on input, (ii) the recursive kernel with the two composition strategies described in [tuia2014explicit] (averaging and symmetrization), and (iii) the stacking procedure described in Section III, both with and regularization. Additionally, we consider an extension of our idea by training a support vector regression model with the SimpleMKL algorithm [rakotomamonjy2008simplemkl], which finds a new kernel via an adaptive combination of the base kernel matrices. Denoting by the mean-squared error (MSE) over the test set, we evaluate the models using a normalized MSE on a logarithmic scale,

 nMSE=10log10(E2/^σ2y), (17)

where is the empirical variance of the output computed over the test set.

The results for the different benchmarks are given in Table I, where the best result for each dataset is highlighted with a bold font. We observe that the proposed stacking procedures are outperforming the standard KRR models in out of benchmarks, while SimpleMKL achieves slightly better performance in the Mackey-Glass task. To confirm the results, we employ the corrected Friedman test described in [demvsar2006statistical], according to which the performance of the algorithms are statistically different with a confidence value of . A set of Nemenyi post-hoc tests shows that the performance of the base stacking procedure is statistically better than the standard kernel and the two composite recursive kernels, while the sparse stacking procedure is statistically better than the standard kernel and the symmetric recursive kernel, only.

### V-B Online experiments

In a second set of experiments we evaluate the online performance of the described KLMS algorithm with recursive multikernel, and classical KLMS [liu2008kernel], on the six benchmark problems. Both algorithms use the same kernel parameter and learning rate. Table II lists the nMSE results obtained after convergence, indicating clear benefits of the RMK strategy.

Finally, in Fig. 3 we reproduce the online learning experiment from [liu2008kernel, Sec. 2.11.2]. In this experiment, a binary signal is fed into a nonlinear communications channel, and the goal is to estimate the correct input signal given its output, i.e. to construct a channel equalizer. KLMS with RMK employs lags and , and the same kernel parameter and learning rate as standard KLMS. While both nonlinear algorithms converge to a similar MSE, the RMK algorithm enjoys a much faster convergence rate.

## Vi Conclusions

In this paper, we proposed a novel approach for exploiting multiple time-scale information in kernel regression and filtering problems by combining a previously introduced -filter (defined in a proper Hilbert space), with an adaptive strategy for combining kernel functions. In this way, the algorithm automatically adapts to the most significant time-scales of the original input signal. Additionally, the kernel functions are particularly suitable for online processing because they can be recursively computed from previous values, and we briefly discussed on their efficient implementation.

Experimental simulations show that the algorithm has similar or better performance on a wide range of tasks, when compared to several alternative strategies such as time-delayed embeddings of the input. In future works, we plan to further extend our idea by leveraging upon the recent literature on MK filters, and by exploring nonlinear combinations of the different kernels.

## Acknowledgment

S. Van Vaerenbergh is supported by the Spanish Ministry of Economy and Competitiveness (under project TEC2014-57402-JIN). S. Scardapane is supported in part by Italian MIUR, “Progetti di Ricerca di Rilevante Interesse Nazionale”, GAUChO project, under Grant 2015YPXH4W_004.

## References

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters