# A Diffusion Kernel LMS algorithm for nonlinear adaptive networks

## Abstract

This work presents a distributed algorithm for nonlinear adaptive learning. In particular, a set of nodes obtain measurements, sequentially one per time step, which are related via a nonlinear function; their goal is to collectively minimize a cost function by employing a diffusion based Kernel Least Mean Squares (KLMS). The algorithm follows the Adapt Then Combine mode of cooperation. Moreover, the theoretical properties of the algorithm are studied and it is proved that under certain assumptions the algorithm suffers a no regret bound. Finally, comparative experiments verify that the proposed scheme outperforms other variants of the LMS.

## 1Introduction

In recent years, the interest in the topic of *distributed learning* and inference has grown rapidly. This is mainly due to the constantly increasing requirements for memory and computational resources, demanded by modern applications, so as to cope with the huge amount of available data. These data “spring" from several sources/applications, such as communication, imaging, medical platforms as well as social-networking sites, e.g., [1]. A natural way, to deal with the large number of data, which need to be processed, is to split the problem into subproblems and resort to distributed operations [2]. Thus, the development of algorithms dealing with such scenarios, where the data are not available in a single location but are instead spread out over multiple locations, becomes essential.

An important application within the distributed learning context is the one of *distributed adaptive learning*, [4]. In a nutshell, this problem considers a decentralized network consisting of nodes interested in performing a specific task, which can be, for instance, parameter estimation, classification, etc. The nodes constantly obtain new measurements and they continuously adapt and learn; this gives them the capability to track and adapt to changes in the environment. On top of that, it is assumed that there is no central node which could perform all the necessary operations and, so, the nodes act as independent learners and perform the computations by themselves. Finally, the task of interest is considered to be common or similar across the nodes and, to that direction, they cooperate with each other. Cooperation has been proved to be beneficial to the learning process since it improves the learning performance, [4].

This paper is concerned with the problem of distributed adaptive learning in Reproducing Kernel Hilbert spaces (RHKS). To be more specific, we consider an ad–hoc network the nodes of which obtain input/output measurements, sequentially, one per time step, related via a *nonlinear* unknown function. To cope with this nonlinearity we resort to the family of the kernel–based algorithms for nonlinear adaptive filtering. In particular, the proposed algorithm belongs to the Kernel LMS (KLMS) algorithmic family and follows the diffusion rationale for cooperation among the nodes.

Related Work:

Several studies for distributed adaptive estimation of linear systems have been proposed in the literature. These include diffusion based algorithms, e.g., [5], consensus ones, e.g., [8], as well as algorithms for multitask learning [10]. The problem of non–linear adaptive estimation in RKHS has been studied, e.g., [12]. A recent study, which considers the problem of nonlinear adaptive filtering in distributed networks, can be found in [16]. The major differences of this paper with our work are summarized in the sequel. First, the authors consider a predefined dictionary, which essentially makes the dimension of the problem finite and equal to the number of elements of the dictionary. On the contrary, here, we consider the general case, where the dictionary is allowed to grow as time increases, and we present a more general form of the algorithm. Furthermore, here, we study for the first time the theoretical properties of the Diffusion Kernel LMS (DKLMS) and we derive regret bounds for the proposed scheme.

Contributions:

In this paper, we propose a novel nonlinear distributed algorithm for adaptive learning. In particular, we propose a KLMS algorithm, which follows the diffusion rationale. The Adapt Then Combine mode of cooperation among the nodes is followed. To be more specific, we assume that the nodes obtain measurements, which arrive sequentially and are related via a nonlinear system. The goal is the minimization of the expected value of the networkwise discrepancy between the desired output and the estimated one. To that direction, at each step, the nodes: a) perform a local update step exploiting their most recent measurements, b) cooperate with each other, in order to enhance their estimates. Comparative experiments illustrate that the proposed scheme outperforms other LMS variants and the theoretical properties of the proposed scheme are discussed.

Notation:

Lowercase and uppercase boldfaced letters stand for vectors and matrices respectively. The symbol stands for the set of real numbers and for the set of nonnegative integers. denotes an infinite dimensional Hilbert space equipped with an inner product denoted by ; the induced norm is given by . Given a set , with the term we denote its cardinality.

## 2Problem Statement

We consider an ad–hoc network illustrated in Figure 1, consisting of nodes. Each node, , at each discrete time instance , has access to a scalar and a vector , which are related via

where is an unknown yet common to all the nodes function belonging to the Hilbert space and the term stands for the additive noise process. The overall goal is the estimation of a function, , which minimizes the cost:

in a *distributed* and *collaborative* fashion; that is the nodes want to minimize the cost by relying solely on local processing as well as interactions with their neighbors.

### 2.1Linear Diffusion LMS

In order to help the reader grasp the concept of the diffusion LMS, in this section we describe the linear scenario, i.e., the one where the function to be estimated is a vector, say , and essentially becomes:

The cost function to be minimized in that case can be written as follows:

The cost includes information coming from the whole network and in order to minimize it, global knowledge is required. Nevertheless, in distributed and decentralized learning each node can only interact and exchange information with its neighborhood which will be denoted by . A fully distributed algorithm, which can be employed for the estimation of is the diffusion LMS (see for example [4]). The starting point of this scheme is a modification of the steepest–descent method, which is properly reformulated so as to enable distributed operations and to avoid any global computation (the interested reader is referred to [4]). In addition, the instantaneous approximation is adopted, according to which the statistical values are substituted by their instantaneous ones, e.g., [17]. Each node updates the estimate at each time step according to the following iterative scheme:

where and is the step size. Furthermore, stand for combination coefficients, which have the following properties A common choice, among others, for choosing these coefficients is the Metropolis rule, in which the weights equal to:

The intuition behind the scheme presented in , , can be summarized as follows. In the first step, node updates its estimate using an LMS based update (adaptation step) exploiting local information. In the sequel, cooperates with its neighborhood by combining their intermediate estimates to obtain its updated estimate . The weights assign a non–negative weight to the estimates received by the neighborhood, whereas they are equal to zero for the rest of the nodes. Hence, each node *aggregates* the information received by the neighborhood. This scheme is also known as Adapt Then Combine (ATC) diffusion strategy.

### 2.2Centralized Kernel LMS

#### Preliminaries

Now, let us provide with a few elementary properties of the RKHS, which will be used in the sequel. Throughout this section the node subscript will be suppressed since we will describe properties of centralized learning. We consider a real Hilbert space comprising functions defined on ; that is . The function will be called a reproducing kernel of if the following properties hold:

the function belongs to .

, it holds that .

If these properties hold then is called a Reproducing Kernel Hilbert Space [18]. A typical example is the Gaussian kernel with definition: A very important property, which will be exploited in the sequel states that points in the RKHS can be written as follows:

where . Finally, the reproducing kernel is continuous, symmetric and positive-definite.

#### Kernel LMS

Kernel LMS, which was originally proposed in [20], is a generalization of the original LMS algorithm, which utilizes the transformed input, i.e., , at each iteration step. Put in mathematical terms, the recursion of the KLMS is given by:

Since the space may be infinite dimensional, it may be difficult to have direct access to the transformed input data and the function . However, if we go back to and forget the distributed aspect for now, we can see that the quantity of interest is , which can be computed exploiting . In particular, following similar steps as in [20] it can be shown that the KLMS recursion can be equivalently written:

Note that this reformulation is very convenient as it computes the response of the estimated function to the input, without any need to estimate the function itself.

## 3The Diffusion Kernel LMS

In this section we describe the proposed algorithm together with its theoretical properties. Recall the problem under consideration, discussed in Section 2. Our goal, here, is to bring together the tools described in Sections Section 2.1, Section 2.2 and derive a Kernel based LMS algorithm suitable for distributed operation. Our starting point will be the ATC–LMS described in - and we will employ the non–linear transformation on the input (similarly to , ). The resulting recursion is:

where is defined similarly to . Despite the fact that this seems a trivial generalization of , , as we have already discussed previously, one cannot resort directly to this form of iterations, since access to the transformed data may not be possible.

Exploiting the lemma, which will be presented shortly, we can bypass the aforementioned problem, by deriving the inner product between the obtained function and the transformed input vector in a closed form. Before we proceed, let us introduce some notation. The networkwise function at time is denoted by , where the Cartesian product . Similarly, we define the networkwise input: and . Finally, we gather the combination coefficients to the matrix , the –th entry of which contains . It can be readily shown that - can be written for the whole network in the following compact form:

The proof, which follows mathematical induction, is omitted due to lack of space and will be presented elsewhere.

### 3.1Theoretical Properties

In the sequel, we will present the regret bound of the proposed scheme and in particular we will show that this grows sublinearly with the time.

Proof:

The proof is omitted due to lack of space and will be presented elsewhere.

## 4Simulations

In this section, the performance of the proposed algorithm is validated within the distributed nonlinear adaptive filtering framework. We consider a network comprising nodes and a distributed version of the problem studied in [22], for which the input and the output are related via:

where is Gaussian with variance and the input is also Gaussian with variance , where with respect to the Uniform distribution. We compare the proposed algorithm with: a) the linear diffusion LMS, b) the non–cooperative KLMS , i.e., the KLMS where the nodes do not cooperate with each other. For the Kernel based algorithms we employ the Gaussian Kernel with and we choose a step–size equal to for all the algorithms. Furthermore, the combination weights are chosen with respect to the Metropolis rule, the buffer size at each node equals to and we only take into consideration information that is coming from the single hop neighbors. Finally, the adopted performance metric is the average MSE, with definition . As it can be seen from Figure 2 the KDLMS outperforms the other LMS variants, since it converges faster to a lower error floor compared to them. In the second experiment the setup is similar to the previous one albeit here we increase the variance of the noise, which now equals to . Figure 3 illustrates that the enhanced performance of the KDLMS, compared to the other algorithms, is retained in this scenario as well.

## 5Conclusions and Future Research

In this paper, a novel Kernel based Diffusion LMS, suitable for non–linear distributed adaptive filtering was proposed. The theoretical properties of the algorithm were discussed and the performance of the scheme was tested against other adaptive strategies. Future research focuses on accelerating the convergence speed by utilizing more data per iteration, as well as investigating sophisticated strategies, which reduce the number of coefficients by storing the most “informative” ones.

### References

**“Modeling and optimization for big data analytics:(statistical) learning tools for our era of data deluge,”**

Konstantinos Slavakis, Georgios Giannakis, and Gonzalo Mateos,*IEEE Signal Processing Magazine*, vol. 31, no. 5, pp. 18–31, 2014.**“Map-reduce for machine learning on multicore,”**

Cheng Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Andrew Y Ng, and Kunle Olukotun,*Advances in neural information processing systems*, vol. 19, pp. 281, 2007.*Understanding big data: Analytics for enterprise class hadoop and streaming data*,

Paul Zikopoulos, Chris Eaton, et al., McGraw-Hill Osborne Media, 2011.**“Diffusion adaptation over networks,”**

Ali H Sayed,*Academic Press Library in Signal Processing*, vol. 3, pp. 323–454, 2013.**“Adaptive robust distributed learning in diffusion sensor networks,”**

Symeon Chouvardas, Konstantinos Slavakis, and Sergios Theodoridis,*IEEE Transactions on Signal Processing*, vol. 59, no. 10, pp. 4692–4707, 2011.**“Diffusion least-mean squares over adaptive networks: Formulation and performance analysis,”**

Cassio G Lopes and Ali H Sayed,*IEEE Transactions on Signal Processing*, vol. 56, no. 7, pp. 3122–3136, 2008.**“Diffusion LMS strategies for distributed estimation,”**

Federico S Cattivelli and Ali H Sayed,*IEEE Transactions on Signal Processing*, vol. 58, no. 3, pp. 1035–1048, 2010.**“Distributed LMS for consensus-based in-network adaptive processing,”**

Ioannis D Schizas, Gonzalo Mateos, and Georgios B Giannakis,*IEEE Transactions on Signal Processing*, vol. 57, no. 6, pp. 2365–2382, 2009.**“Distributed recursive least-squares for consensus-based in-network adaptive estimation,”**

Gonzalo Mateos, Ioannis D Schizas, and Georgios B Giannakis,*Signal Processing, IEEE Transactions on*, vol. 57, no. 11, pp. 4583–4588, 2009.**“Multitask diffusion adaptation over networks,”**

Jie Chen, Cédric Richard, and Ali H Sayed,*IEEE Transactions on Signal Processing*, vol. 62, no. 16, pp. 4129–4144, 2014.**“Distributed diffusion-based LMS for node-specific adaptive parameter estimation,”**

Jorge Plata-Chaves, Nikola Bogdanovic, and Kostas Berberidis,*IEEE Transactions on Signal Processing*, vol. 63, no. 13, pp. 3448–3460, 2015.**“Extension of wirtinger’s calculus to reproducing kernel hilbert spaces and the complex kernel LMS,”**

Pantelis Bouboulis and Sergios Theodoridis,*IEEE Transactions on Signal Processing*, vol. 59, no. 3, pp. 964–978, 2011.**“The augmented complex kernel LMS,”**

Pantelis Bouboulis, Sergios Theodoridis, and Michael Mavroforakis,*IEEE Transactions on Signal Processing*, vol. 60, no. 9, pp. 4962–4967, 2012.**“Online kernel-based classification using adaptive projection algorithms,”**

Konstantinos Slavakis, Sergios Theodoridis, and Isao Yamada,*IEEE Transactions on Signal Processing*, vol. 56, no. 7, pp. 2781–2796, 2008.**“Adaptive constrained learning in reproducing kernel hilbert spaces: the robust beamforming case,”**

Konstantinos Slavakis, Sergios Theodoridis, and Isao Yamada,*IEEE Transactions on Signal Processing*, vol. 57, no. 12, pp. 4744–4764, 2009.**“Diffusion adaptation over networks with kernel least-mean-square,”**

Wei Gao, Jie Chen, Cédric Richard, and Jianguo Huang,*Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), 2015 IEEE International Workshop on, 2015 (submitted).**Adaptive filter theory*,

Simon S Haykin, Pearson Education India, 2008.*Learning with kernels*,

Alex J Smola and Bernhard Schölkopf, Citeseer, 1998.*Machine Learning: A Bayesian and Optimization Perspective*,

Sergios Theodoridis, Academic Press, 2015.**“The kernel least-mean-square algorithm,”**

Weifeng Liu, Puskal P Pokharel, and Jose C Principe,*IEEE Transactions on Signal Processing*, vol. 56, no. 2, pp. 543–554, 2008.**“Online dictionary learning for kernel LMS,”**

Wei Gao, Jie Chen, Cedric Richard, and Jianguo Huang,*IEEE Transactions on Signal Processing*, vol. 62, no. 11, pp. 2765–2777, 2014.**“Identification and control of dynamical systems using neural networks,”**

Kumpati S Narendra and Kannan Parthasarathy,*IEEE Transactions on Neural Networks*, vol. 1, no. 1, pp. 4–27, 1990.**“A generalized normalized gradient descent algorithm,”**

Danilo P Mandic,*IEEE Signal Processing Letters*, vol. 11, no. 2, pp. 115–118, 2004.