Decentralized Sparse Multitask RLS over Networks
Abstract
Distributed adaptive signal processing has attracted much attention in the recent decade owing to its effectiveness in many decentralized realtime applications in networked systems. Because many natural signals are highly sparse with most entries equal to zero, several decentralized sparse adaptive algorithms have been proposed recently. Most of them is focused on the single task estimation problems, in which all nodes receive data associated with the same unknown vector and collaborate to estimate it. However, many applications are inherently multitask oriented and each node has its own unknown vector different from others’. The related multitask estimation problem benefits from collaborations among the nodes as neighbor nodes usually share analogous properties and thus similar unknown vectors. In this work, we study the distributed sparse multitask recursive least squares (RLS) problem over networks. We first propose a decentralized online alternating direction method of multipliers (ADMM) algorithm for the formulated RLS problem. The algorithm is simplified for easy implementation with closedform computations in each iteration and low storage requirements. Moreover, to further reduce the complexity, we present a decentralized online subgradient method with low computational overhead. We theoretically analyze the convergence behavior of the proposed subgradient method and derive an error bound related to the network topology and algorithm parameters. The effectiveness of the proposed algorithms is corroborated by numerical simulations and an accuracycomplexity tradeoff between the proposed two algorithms is highlighted.
I Introduction
In the last decade, distributed adaptive signal processing has emerged as a vital topic because of the vast applications in need of decentralized realtime data processing over networked systems. For multiagent networks, distributed adaptive algorithms only rely on local information exchange, i.e., information exchange among neighbor nodes, to estimate the unknowns. This trait endows distributed adaptive algorithms with low communication overhead, robustness to node/link failures and scalability to large networks. In the literature, the centralized least mean squares (LMS) and recursive least squares (RLS) [1] have been extended to their decentralized counterparts [2, 3] to deal with estimation problems over networks. Furthermore, many natural signals are inherently sparse with most entries equal to zero such as the image signals and audio signals in [4, 5, 6, 7]. Sparsity of signals are particularly conspicuous in the era of big data: for many applications, redundant input features (e.g., a person’s salary, education, height, gender, etc.) are collected to be fed into a learning system to predict a desired output (e.g., whether a person will resign his/her job). Most input features are unrelated to the output so that the weight vector between the input vector and the output is highly sparse. As such, several sparse adaptive algorithms have been proposed such as the sparse LMS in [8, 9], the sparse RLS in [10] and the distributed sparse RLS in [11].
Most of the decentralized sparse adaptive algorithms are focused on the single task estimation problem, in which all nodes receive data associated with the same unknown vector and collaborate to estimate it. On the contrary, many applications are inherently multitaskoriented, i.e., each node has its own unknown vector different from others’. For instance, in a sensor network, each node may want to estimate an unknown vector related to its specific location and thus different nodes have different unknown vectors to be estimated. In fact, several decentralized multitask adaptive algorithms have been proposed in the literature including the multitask diffusion LMS in [12], its asynchronous version in [13] and its application in the study of tremor in Parkinson’s disease [14]. In particular, a sparse multitask LMS algorithm is proposed in [15] to promote sparsity of the estimated multitask vectors.
To the best of our knowledge, all the existing distributed adaptive algorithms for multitask estimation problems are based on various decentralized versions of LMS. The RLS based sparse multitask estimation problems have not aroused much attention. It is well known that the RLS possesses much faster convergence speed than the LMS. Hence, the RLS is more suitable for applications in need of fast and accurate tracking of the unknowns than the LMS, especially when the devices are capable of dealing with computations of moderately high complexity (which is the case as the computational capability of devices is increasing drastically). This motivates us to study the decentralized sparse multitask RLS problem over networks. The main contributions of this paper are summarized as follows.

A global networked RLS minimization problem is formulated. In accordance with the multitask nature of the estimation problem, each node has its own weight vector. Since neighbor nodes often share analogous properties and thus similar weight vectors, we add regularization term to penalize deviations of neighbors’ weight vectors. To enforce sparsity of the weight vectors, we further introduce regularization.

A decentralized online alternating direction method of multipliers (ADMM) algorithm is proposed for the formulated sparse multitask RLS problem. The proposed ADMM algorithm is simplified so that each iteration consists of simple closedform computations and each node only needs to store and update one matrix and six dimensional vectors, where is the dimension of the weight vectors. We show that the gaps between the outputs of the proposed ADMM algorithm and the optimal points of the formulated RLS problems converge to zero.

To overcome the relatively high computational cost of the proposed ADMM algorithm, we further present a decentralized online subgradient method, which enjoys lower computational complexity. We theoretically analyze its convergence behaviors and show that the tracking error of the weight vectors is upper bounded by some constant related to the network topology and algorithm parameters.

Numerical simulations are conducted to corroborate the effectiveness of the proposed algorithms. Their advantages over the single task sparse RLS algorithm in [11] are highlighted. We also observe an accuracycomplexity tradeoff between the proposed two algorithms.
The roadmap of the remaining part of this paper is as follows. In Section \@slowromancapii@, the sparse multitask RLS problem is formally formulated. In Section \@slowromancapiii@, we propose and simplify a decentralized online ADMM algorithm for the formulated RLS problem. In Section \@slowromancapiv@, we propose a decentralized online subgradient method for the formulated problem in order to reduce computational complexity. In Section \@slowromancapv@, numerical simulations are conducted. In Section \@slowromancapvi@, we conclude this work.
Ii The Statement of the Problem
We consider a network of nodes and some edges between these nodes. We assume that the network is a simple graph, i.e., the network is undirected with no selfloop and there is at most one edge between any pair of nodes. Denote the set of neighbors of node (those who are linked with node by an edge) as . The network can be either connected or disconnected (there does not necessarily exist a path connecting every pair of nodes). Time is divided into discrete slots denoted as . Each node has an unknown (slowly) timevariant dimensional weight vector to be estimated. The formulated network is therefore a multitask learning network since different nodes have different weight vectors, as opposed to the traditional single task learning network [2], which is usually transformed into a consensus optimization problem framework [16, 17, 18]. Each node has access to a sequence of private measurements , where is the input regressor at time and is the output observation at time . The measurement data are private in the sense that node has access only to its own measurement sequence but not others’. The data at node are assumed to conform to a linear regression model with (slowly) timevariant weight vector :
(1) 
where is the output measurement noise at time . In multitask learning networks, the benefit of cooperation between nodes comes from the fact that neighboring nodes have similar weight vectors [12], where similarity is embodied by some specific distance measures. By incorporating terms promoting similarity between neighbors and enforcing cooperation in the network, an estimator may achieve potentially higher performance than its noncooperative counterpart.
Moreover, many signals in practice are highly sparse, i.e., most entries in the signal are equal to zero, with examples encompassing image signals, audio signals, etc. [4, 5, 6, 7]. The sparsity of signals is especially conspicuous in today’s big data era because redundant data are collected as input features among which most are unrelated to the targeted output, leading to sparsity of the corresponding weight vectors. Furthermore, as per convention in adaptive algorithms [1], we assume that the weight vectors varies with time very slowly. This suggests that past data are of great merit to estimate the current weight vector, which justifies the advantage of the RLS (studied in this paper) over the LMS (studied in all existing works on multitask estimation [12, 13, 15, 14]) in terms of convergence speed.
In all, we propose an RLS based estimator to track the unknown weight vectors while enforcing similarity between neighbors’ weight vectors and sparsity of all weight vectors. The estimator at time is the optimal solution of the following optimization problem:
(2) 
where are the forgetting factor of the RLS algorithm, regularization coefficient for similarity between neighbors’ weight vectors and regularization coefficient for sparsity, respectively. If , then problem (2) enforces consensus of weight vectors across nodes, and thus degenerates to the sparse RLS problem in [11]. Note that the measurement data arrives in a sequential manner, which necessitates an online (real time) algorithm to solve (2) due to the prohibitive computation and storage cost of offline methods. Further note that the private measurement data are distributed among network nodes. Thus, a distributed algorithm for (2) is imperative as centralized algorithms are vulnerable to link failures and can incur large communication costs, not to mention the privacy concerns of the private data. Therefore, we are aimed at finding distributed online algorithm for solving (2). In the following two sections, we propose two different distributed online algorithms with complementary merits in accuracy and computational complexity.
Iii The Decentralized Online ADMM
In this section, we propose an alternating direction method of multipliers (ADMM) based decentralized online algorithm for solving (2). It is further simplified so that its iteration consists of simple closedform computations and each node only needs to store and update one matrix and six dimensional vectors. We show that the gaps between the outputs of the proposed ADMM algorithm and the optimal points of (2) converge to zero. Before the development of the algorithm, we first present some rudimentary knowledge of ADMM in the following subsection.
Iiia Preliminaries of ADMM
ADMM is an optimization framework widely applied to various signal processing applications, including wireless communications [19], power systems [20] and multiagent coordination [21]. It enjoys fast convergence speed under mild technical conditions [22] and is especially suitable for the development of distributed algorithms [23, 24]. ADMM solves problems of the following form:
(3) 
where are constants and are optimization variables. and are two convex functions. The augmented Lagrangian can be formed as:
(4) 
where is the Lagrange multiplier and is some constant. The ADMM then iterates over the following three steps for (the iteration index):
(5)  
(6)  
(7) 
The ADMM is guaranteed to converge to the optimal point of (3) as long as and are convex [23, 24]. It is recently shown that global linear convergence can be ensured provided additional assumptions on problem (3) holds [22].
IiiB Development of Decentralized Online ADMM for (2)
To apply ADMM to (2), we first transform it to the form of (3). We introduce auxiliary variables and , where denotes the cardinality of a set. Denote the index of th neighbor of node as . Thus, problem (2) can be equivalently transformed into the following problem:
(8) 
where the optimization variables are . Note that optimization problem (8) is in the form of (3) (regarding ’s and ’s as the variable in (3) and ’s as the variable in (3)). Thus, we can apply ADMM to problem (8). Introducing Lagrange multiplier , we can form the augmented Lagrangian of (8) as follows:
(9) 
In the following, for ease of notation, we use to represent all the and similarly for . We apply the ADMM updates (5), (6) and (7) to problem (8) as follows:
(10)  
(11)  
(12)  
(13) 
In the following, we detail how to implement the updates of the primal variables, i.e., (10) and (11), in a distributed and online fashion.
IiiB1 Updating and
The update of and in (10) can be decomposed across nodes. For each node , the subproblem is:
(14) 
Define the data dependent input correlation matrix and inputoutput cross correlation vector of node at time to be:
(15)  
(16) 
Note that the objective function in (14) is a convex quadratic function. Hence, the necessary and sufficient condition for optimality of problem (14) is that the gradient of the objective function vanishes. The gradient of the objective function, which is denoted as , with respect to and can be computed as follows:
(17)  
(18) 
Letting the gradients with respect to and be zero, we rewrite the update in (14) as:
(19) 
To inverse the matrix in (19), we need to use the following matrix inversion lemma.
Lemma 1.
For arbitrary matrices such that all the matrix inversions at the R.H.S. of (20) exist, we have:
(20) 
IiiB2 Updating
We note that the update for in (11) can be decomposed not only across nodes but also across each entry of the vector . For each node , the th entry of can be updated as follows:
(24)  
(25)  
(26) 
where the softthreshold function is defined for as follows:
(27) 
In (26), we have made use of the following fact.
Lemma 2.
For any , we have:
(28) 
Once we extend the definition of to vectors in a entrywise way, we can write the update for compactly as:
(29) 
IiiB3 Online Algorithm with Varying
So far, the derived ADMM algorithm is only suitable for one particular time shot . Since it takes iterations for ADMM to converge to the optimal point, for each time , we ought to run multiple rounds of ADMM iterations for some sufficiently large . After the ADMM has converged for this particular time , we update the data related quantities ( and ) and move to the next time slot. However, since the underlying weight vectors are varying across time (i.e., the underlying linear system is nonstationary), it is meaningless to estimate the weight vectors very accurately for every time slot. Thus, in the following, we choose , i.e., only one iteration of ADMM update is executed in each time slot. This is inspired by many existing adaptive algorithms such as the LMS algorithm, where only one step of gradient descent is performed at each time slot [1]. As such, we replace with in the previously derived updates (22), (23), (29) and get updates that are suitable for varying time :
(30)  
(31)  
(32) 
Moreover, the updates (12) and (13) for dual variables can be rewritten as:
(33)  
(34) 
The correlation matrices and crosscorrelation vectors can be updated as follows:
(35)  
(36) 
And is computed according to (21).
Remark 1.
The computation of in (21) necessitates inversion of an matrix, which incurs a computational complexity of unless special structure is present. For the special case of (which is suitable for timeinvariant weight vectors), this burden can be alleviated as follows. According to (21), (35) and the condition that , we have:
(37)  
(38)  
(39) 
However, in the general case where , the matrix inversion incurred by the computation of is inevitable, which is the most computationally intensive part of the proposed ADMM algorithm.
IiiB4 Simplification of the ADMM Updates
So far, the ADMM updates involve primal variables and dual variables . For each node , and include dimensional vectors, which is costly to sustain in terms of communication and storage overhead, especially when the numbers of neighbors (degrees) are large. This motivates us to simplify the ADMM updates (30)(36) so that the number of vectors at each node is independent of its degree. To this end, we first define the following auxiliary variables:
(40)  
(41)  
(42)  
(43)  
(44)  
(45)  
(46)  
(47)  
(48) 
Thus, the update for in (30) can be rewritten as:
(49) 
Using (31) yields the update for and :
(50) 
(51) 
The update for can be rewritten as:
(52) 
Similarly, from (34), we can spell out the updates for and :
(53)  
(54) 
Now, we are ready to formally present the proposed decentralized online ADMM algorithm for solving (2), which is summarized in Algorithm 1. Notice that the algorithm is completely distributed: each node only needs to communicate with its neighbors. It is also online (realtime): each node only needs to store and update one matrix and six dimensional vectors . All other involved quantities in Algorithm 1 are intermediate and can be derived from these stored matrices and vectors.
(55)  
(56) 
IiiC Convergence of the Algorithm 1
In this subsection, we briefly discuss about the convergence of Algorithm 1 and show that the gap between its output, , and the optimal point of problem (2), which we denote as , converges to zero. We make the following assumptions.
Assumption 1.
The true weight vector is timeinvariant, i.e., the linear regression data model is .
Assumption 2.
For each node , the input process is independent across time with timeinvariant correlation matrix .
Assumption 3.
For each node , the noise process has zero mean, i.e., and is independent across time and independent from the input process .
Note that all of these assumptions are standard when analyzing the performance of adaptive algorithms in the literature [1]. From the definition of and , we know that they are weighted sum of i.i.d. terms. According to the strong law of large numbers for weighted sums [25, 10], as , converges to . Similarly, converges to . When , the optimization problem at time , i.e., problem (2), is to minimize (w.r.t. ):
(57) 
Note that the R.H.S. of (57) does not depend on . Note that ADMM is guaranteed to converge to the optimal point for static convex optimization problem of the form (3) and the R.H.S. of (57) can be transformed into the form of (3) as we do in Subsection IIIB. So, the output of Algorithm 1, , converges to the minimum point of the R.H.S. of (57). Due to (57) and the definition of , we know that also converges to the minimum point of the R.H.S. of (57). Hence, the difference between the output of Algorithm 1, i.e., , and the optimal point of (2), i.e., , converges to zero.
Iv The Decentralized Online Subgradient Method
The implementation of the proposed Algorithm 1 necessitates an inversion of an matrix at each time and each node, which may not be suitable for nodes with low computational capability. In fact, a relatively high computational overhead is a general drawback of dual domain methods (e.g., ADMM) in optimization theory [17]. On the contrary, primal domain methods such as gradient descent method, though having relatively slow convergence speed, enjoys low computational complexity [26]. As such, in this section, we present a distributed online subgradient method for problem (2) to trade off convergence speed and accuracy for low computational complexity.
Iva Development of the Decentralized Online Subgradient Method
Recall the optimization problem at time , i.e., problem (2). Denote the objective function of (2) as . We derive the subdifferential (the set of subgradients [27]) of at to be:
(58) 
where the sign (set) function is defined as:
(59) 
The extension of the function to vectors is entrywise. The subgradient method is to simply use the iteration , where is any subgradient of at and is the step size [27]. This naturally leads to the following decentralized online update:
(60) 
where is any number within the interval ^{1}^{1}1There is a standard abuse of notation for the function: in (58) and (59), is defined to be the interval while in (IVA), is defined to be any arbitrary number within . In the following, the latter definition will be used.. By introducing an auxiliary variable