Decentralized RLS with Data-Adaptive Censoring for Regressions over Large-Scale Networks

Decentralized RLS with Data-Adaptive Censoring for Regressions over Large-Scale Networks

Abstract

The deluge of networked data motivates the development of algorithms for computation- and communication-efficient information processing. In this context, three data-adaptive censoring strategies are introduced to considerably reduce the computation and communication overhead of decentralized recursive least-squares (D-RLS) solvers. The first relies on alternating minimization and the stochastic Newton iteration to minimize a network-wide cost, which discards observations with small innovations. In the resultant algorithm, each node performs local data-adaptive censoring to reduce computations, while exchanging its local estimate with neighbors so as to consent on a network-wide solution. The communication cost is further reduced by the second strategy, which prevents a node from transmitting its local estimate to neighbors when the innovation it induces to incoming data is minimal. In the third strategy, not only transmitting, but also receiving estimates from neighbors is prohibited when data-adaptive censoring is in effect. For all strategies, a simple criterion is provided for selecting the threshold of innovation to reach a prescribed average data reduction. The novel censoring-based (C)D-RLS algorithms are proved convergent to the optimal argument in the mean-root deviation sense. Numerical experiments validate the effectiveness of the proposed algorithms in reducing computation and communication overhead.

{IEEEkeywords}

Decentralized estimation, networks, recursive least-squares (RLS), data-adaptive censoring

1 Introduction

In our big data era, various networks generate massive amounts of streaming data. Examples include wireless sensor networks, where a large number of inexpensive sensors cooperate to monitor, e.g. the environment [21, 22], or data centers, where a group of servers collaboratively handles dynamic user requests [24]. Since a single node has limited computational resources, decentralized information processing is preferable as the network size scales up [7, 9]. In this paper, we focus on a decentralized linear regression setup, and develop computation- and communication-efficient decentralized recursive least-squares (D-RLS) algorithms.

The main tool we adopt to reduce computation and communication costs is data-adaptive censoring, which leverages the redundancy present especially in big data. Upon receiving an observation, nodes determine whether it is informative or not. Less informative observations are discarded, while messages among neighboring nodes are exchanged only when necessary. We propose three censoring-based (C)D-RLS algorithms that can achieve estimation accuracy comparable to D-RLS without censoring, while significantly reducing the computation and communication overhead.

1.1 Related works

The merits of RLS algorithms in solving centralized linear regression problems are well recognized [12, 25]. When streaming observations that depend linearly on a set of unknown parameters become available, RLS yields the least-squares parameter estimates online. RLS reduces the computational burden of finding a batch estimate per iteration, and can even allow for tracking time-varying parameters. The computational cost can be further reduced by data-adaptive censoring [4], where less informative data are discarded. On the other hand, decentralized versions of RLS without censoring have been advocated to solve linear regression tasks over networks [16]. In D-RLS, a node updates its estimate that is common to the entire network by fusing its local observations with the local estimates of its neighbors. As time evolves, all local estimates consent on the centralized RLS solution. This paper builds on both [4] and [16] by developing censoring-based decentralized RLS algorithms, thus catering to efficient online linear regression over large-scale networks.

Different from our in-network setting where operation is fully decentralized and nodes are only able to communicate with their neighbors, most of the existing distributed censoring algorithms apply to star topology networks that rely on a fusion center [2, 10, 11, 19, 23]. Their basic idea is that each node transmits data to the fusion center for further processing only when its local likelihood ratio exceeds a threshold [23]; see also [10] where communication constraints are also taken into account. Information fusion over fading channels is considered in [11]. Practical issues such as joint dependence of sensor decision rules, randomization of decision strategies as well as partially known distributions are reported in [2], while [19] also explores quantization jointly with censoring.

Other than the star topology studied in the aforementioned works, [20] investigates censoring for a tree structure. If a node’s local likelihood ratio exceeds a threshold, its local data is sent to its parent node for fusion. A fully decentralized setting is considered in [3], where each node determines whether to transmit its local estimate to its neighbors by comparing the local estimate with the weighted average of its neighbors. Nevertheless, [3] aims at mitigating only the communication cost, while the present work also considers reduction of the computational cost across the network. Furthermore, the censoring-based decentralized linear regression algorithm in [14] deals with optimal full-complexity estimation when observations are partially known or corrupted. This is different from our context, where censoring is deliberately introduced to reduce computational and communication costs for decentralized linear regression.

1.2 Our contributions and organization

The present paper introduces three data-adaptive online censoring strategies for decentralized linear regression. The resultant CD-RLS algorithms incur low computational and communication costs, and are thus attractive for large-scale network applications requiring decentralized solvers of linear regressions. Unlike most related works that specifically target wireless sensor networks (WSNs), the proposed algorithms may be used in a broader context of decentralized linear regression using multiple computing platforms. Of particular interest are cases where a regression dataset is not available at a single machine, but it is distributed over a network of computing agents that are interested in accurately estimating the regression coefficients in an efficient manner.

In Section 2, we formulate the decentralized online linear regression problem (Section 2.1), and recast the D-RLS in [16] into a new form (Section 2.2) that prompts the development of three censoring strategies (Section 2.3). Section 3 develops the first censoring strategy (Section 3.1), analyzes all three censoring strategies (Section 3.2), and discusses how to set the censoring thresholds (Section 3.3). Numerical experiments in Section 4 demonstrate the effectiveness of the novel CD-RLS algorithms.

Notation. Lower (upper) case boldface letters denote column vectors (matrices). , , and stand for transpose, 2-norm, induced matrix 2-norm and expectation, respectively. Symbols , and are used for the trace, minimum eigenvalue and maximal eigenvalue of matrix , respectively. Kronecker product is denoted by and the uniform distribution over by , and the Gaussian probability distribution function (pdf) with mean and variance by . The standardized Gaussian pdf is , and its the associated complementary cumulative distribution function is represented by .

2 Context and Algorithms

This section outlines the online linear regression setup over networks, and takes a fresh look at the D-RLS algorithm. Three strategies are then developed using data-adaptive censoring to reduce the computational and communication costs of D-RLS.

2.1 Problem statement

Consider a bidirectionally connected network with nodes, described by a graph , where is the set of nodes with cardinality , and denotes the set of edges. Each node only communicates with its one-hop neighbors, collected in the set . The decentralized network is deployed to estimate a real vector . Per time slot , node receives a real scalar observation involving the wanted with a regression row , so that , with .

Our goal is to devise efficient decentralized online algorithms to solve the following exponentially-weighted least-squares (EWLS) problem

 ^sewls(t):=argmins12 t∑r=1J∑j=1λt−r[xj(r)−hTj(r)s]2 (1)

where is the EWLS estimate at slot , and is a forgetting factor that de-emphasizes the importance of past measurements, and thus enables tracking of a non-stationary process. When , (1) boils down to a standard decentralized online least-squares estimate.

2.2 D-RLS revisited

The D-RLS algorithm of [16] solves (1) as follows. Per time slot , node receives and and uses them to update the per-node inverse covariance matrix as

 Φ−1j(t) =λ−1Φ−1j(t−1) −λ−1Φ−1j(t−1)hj(t)hTj(t)Φ−1j(t−1)λ+hTj(t)Φ−1j(t−1)hj(t) (2)

along with the per-node cross-covariance vector as

 ψj(t) =λψj(t−1)+hj(t)xj(t). (3)

Using and , node then updates its local parameter estimate using

 sj(t)=Φ−1j(t)[ψj(t)−12∑j′∈Nj(vj′j(t−1)−vjj′(t−1))] (4)

where denotes the Lagrange multiplier of node corresponding to its neighbor at slot , that captures the accumulated differences of neighboring estimates, recursively obtained as ( is a step-size)

 vj′j(t−1)=vj′j(t−2)+ρ[sj(t−1)−sj′(t−1)]. (5)

Next, we develop an equivalent novel form of D-RLS recursions (2)–(5) that is convenient for our incorporation of data-adaptive censoring. Detailed derivation of the equivalence can be found in Appendix 6. The inverse covariance matrix is updated as in (2). However, the update of in (4) is replaced by

 sj(t) =sj(t−1)+Φ−1j(t)hj(t)[xj(t)−hTj(t)sj(t−1)] −ρΦ−1j(t)δj(t−1) (6)

where stands for a Lagrange multiplier conveying network-wide information that is updated as

 δj(t) =δj(t−1)+∑j′∈Nj[sj(t)−sj′(t)] −λ∑j′∈Nj[sj(t−1)−sj′(t−1)]. (7)

Observe that stores the weighted sum of differences between the local estimate of node , and all estimates of its neighbors. Interestingly, if the network is disconnected and the nodes are isolated, then so long as , and the update of in (6) basically boils down to the centralized RLS one [12, 25]. That is, the current estimate is modified from its previous value using the prediction error , which is known as the incoming data innovation. If on the other hand the network is connected, nodes can leverage estimates of their neighbors (captured by ), which provide new information from the network other than its own observations . The term can be viewed as a Laplacian smoothing regularizer, which encourages all nodes of the graph to reach consensus on their estimates.

Remark 1. In D-RLS, (2) incurs computational complexity , since calculating the products and requires multiplications. Similarly, (6) incurs computational complexity , that is dominated by the matrix-vector multiplications and . The cost of carrying out (7) is relatively minor. Regarding communication cost per slot , node needs to transmit its local estimate to its neighbors and receive estimates from all neighbors . The computational burden of D-RLS recursions (2)–(5) is comparable to that of (2), (6) and (7), with the cost of (4) being the same as what (6) requires. Meanwhile, the original form requires neighboring nodes and to exchange and in addition to and , which doubles the communication cost relative to (6) and (7).

2.3 Censoring-based D-RLS strategies

The D-RLS algorithm has well documented merits for decentralized online linear regression [16]. However, its computational and communication costs per iteration are fixed, regardless of whether observations and/or the estimates from neighboring nodes are informative or not. This fact motivates our idea of permeating benefits of data-adaptive censoring to decentralized RLS, through three novel censoring-based (C)D-RLS strategies. They are different from the RLS algorithms in [4], where the focus is on centralized online linear regression.

Our first censoring strategy (CD-RLS-1) can be intuitively motivated as follows. If a given datum is not informative enough, we do not have to use it since its contribution to the local estimate of node , as well as to those of all network nodes, is limited. With specifying proper thresholds to be discussed later, this intuition can be realized using a censoring indicator variable

 cj(t) :={0, if |xj(t)−hTj(t)sj(t−1)|≤τσj(t)1, if |xj(t)−hTj(t)sj(t−1)|>τσj(t). (8)

If the absolute value of the innovation is less than , then is censored; otherwise is used. Section 3.3 will provide rules for selecting the threshold along with the local noise variance , whose computations are lightweight. If data censoring is in effect, we simply throw away the current datum by letting in (2), to obtain

 Φ−1j(t) =λ−1Φ−1j(t−1). (9)

Likewise, letting and in (6), yields

 sj(t)=sj(t−1)−ρΦ−1j(t)δj(t−1). (10)

CD-RLS-1 is summarized in Algorithm 1. If censoring is in effect, computation cost per node and per slot is a fraction of the D-RLS in (4) and (7) without censoring. To recognize why, observe that the scalar-matrix multiplication in (9) is not necessary as the update of can be merged to wherever it is needed, e.g., in (10) and the next slot. In addition, carrying out the multiplications to obtain is no longer necessary, while the multiplications required to obtain remain the same.

The first censoring strategy still requires nodes to communicate with neighbors per time slot; hence, the communication cost remains the same. Reducing this communication cost, motivates our second censoring strategy (CD-RLS-2), where each node does not perform extra computations relative to CD-RLS-1, but only receives neighboring estimates if its current datum is censored. The intuition behind this strategy is that if a datum is censored, then very likely the current local estimate is sufficiently accurate, and the node does not need to account for estimates from its neighbors. Estimates from neighbors, are only stored for future usage. Likewise, neighbors in do not need node ’s current estimate either, because they have already received a very similar estimate. CD-RLS-2 is summarized in Algorithm 2.

The third censoring strategy (CD-RLS-3) given by Algorithm 3 is more aggressive than the second one. If a node has its datum censored at a certain slot, then it neither transmits to nor receives from its neighbors, and in that sense it remains “isolated” from the rest of the network in this slot. Apparently, we should not allow any node to be forever isolated. To this end, we can force each node to receive the local estimate from any of its neighbors at least once every slots, which upper bounds the delay of information exchange to . Interestingly, the ensuing section will prove convergence of all three strategies to the optimal argument in the mean-square deviation sense under mild conditions.

3 Development and performance analysis

This section starts with a criterion-based development of CD-RLS-1. Convergence analysis of all three censoring strategies will follow, before developing practical means of setting the censoring threshold .

3.1 Derivation of censoring-based D-RLS-1

Consider the following truncated quadratic cost that is similar to the one used in the censoring-based but centralized RLS [4]

 fj,t(s):= (11) {0,|xj(t)−hTj(t)s|≤τσj(t)12[xj(t)−hTj(t)s]2−12τ2σj(t)2,|xj(t)−hTj(t)s|>τσj(t)

which is convex, but non-differentiable on . Using (11) to replace the quadratic loss in (1), our CD-RLS-1 criterion is

 minst∑r=1J∑j=1λt−rfj,r(s). (12)

To solve (12) in a decentralized manner, we introduce a local estimate per node , along with auxiliary vectors and per edge . By constraining all local estimates of neighbors to consent, we arrive at the following equivalent separable convex program per slot

 min{sj}j∈V t∑r=1J∑j=1λt−rfj,r(sj) (13) s.t. sj=¯zj′j,sj′=~zj′j,¯zj′j=~zj′j,j∈V,j′∈Nj.

Next, we employ alternating minimization and the stochastic Newton iteration to derive our first censoring-based solver of (13). To this end, consider the Lagrangian of (13) that is given by

 L(s,z,v,u)=∑j∈Vt∑r=1λt−rfj,r(sj) +J∑j=1∑j′∈Nj[(vj′j)T(sj−¯zj′j)+(uj′j)T(sj′−~zj′j)] (14)

where and are primal variables, while and are dual variables. Consider also the augmented Lagrangian of (13), namely

 Lρ(s,z,v,u)=L(s,z,u,v) +ρ2J∑j=1∑j′∈Nj[||sj−¯zj′j||2+||sj′−~zj′j||2] (15)

where is a positive regularization scale. Note that the constraints on are not dualized, but they are collected in the set .

To minimize (13) per slot , we rely on alternating minimization [27] in an online manner, which entails an iterative procedure consisting of three steps.

 s(t)=argminsL(s,z(t−1),v(t−1),u(t−1))

 z(t)=argminz∈CzLρ(s(t),z,v(t−1),u(t−1))

 vj′j(t) =vj′j(t−1)+ρ[sj(t)−¯zj′j(t)] uj′j(t) =uj′j(t−1)+ρ[sj′(t)−~zj′j(t)].

Observe that [S2] is a linearly constrained quadratic program, for which if , we always have

 sj′(t)+sj(t)=~zj′j(t)+¯zj′j(t)and~zj′j(t)=¯zj′j(t).

Therefore, the initial values of and in [S3] are selected to satisfy (the simplest choice is ). It then holds for that

 vj′j(t)+uj′j(t)=0.

Using the latter to eliminate in [S3], we obtain

 vj′j(t) Extra open brace or missing close brace Extra open brace or missing close brace (16)

where the first equality comes from subtracting the two lines in [S3], and the second equality is due to . The auxiliary variables and can be also eliminated. When is initialized by , summing up both sides of (3.1) from to , we arrive, after telescopic cancellation, at

 Missing or unrecognized delimiter for \big (17)

Moving on to [S1], observe that it can be split into per-node subproblems

 sj(t) =argminsjt∑r=1λt−rfj,r(sj) +∑j′∈Nj[vj′j(t−1)−vjj′(t−1)]Tsj.

Before solving (11) with the stochastic Newton iteration [1], eliminate using (17) to obtain

 sj(t) =argminsjt∑r=1λt−rfj,r(sj) Extra open brace or missing close brace

which after manipulating the double sum yields

 sj(t)=argminsjt∑r=1λt−rfj,r(sj) +t∑r=1λt−rρ∑j′∈Nj[sj(r−1)−sj′(r−1) +(1−λ)r−1∑ξ=1(sj(ξ−1)−sj′(ξ−1))]Tsj.

If the update in (7) is initialized with , summing up both sides from to , we find after telescopic cancellation

 δj(r−1)=∑j′∈Nj[sj(r−1)−sj′(r−1) +(1−λ)r−1∑ξ=1(sj(ξ−1)−sj′(ξ−1))]. (18)

Thus, optimization of reduces to

 sj(t)=argminsjt∑r=1λt−rgj,r(sj) (19)

where the instantaneous cost per slot is

 gj,t(sj):=fj,t(sj)+ρδTj(t−1)sj. (20)

The stochastic gradient of the latter is given by

 ∇gj,t(sj(t−1)) = −cj(t)[(xj(t)−hj(t)sj(t−1))hj(t)]+ρδj(t−1).

In the stochastic Newton method, the Hessian matrix is given by

 Mj(t)=E[∇2gj,t(sj(t−1))]=E[cj(t)hj(t)hTj(t)]

where the second equality comes from (11) and (8). A reasonable approximation of the expectation is provided by sample averaging. However, presence of affects attenuation of regressors, which leads to

 Mj(t) =1tt∑r=1λt−rcj(r)hj(r)hTj(r) =λt−1tMj(t−1)+1tcj(t)hj(t)hTj(t).

Applying the matrix inversion lemma, we obtain

 M−1j(t) =tt−1[λ−1M−1j(t−1) (21) −cj(t)λ−1M−1j(t−1)hj(t)hTj(t)M−1j(t−1)(t−1)λ−hT(t)M−1j(t−1)hj(t)]

and after adopting a diminishing step size , the stochastic Newton update becomes

 sj(t) =sj(t−1)−1tM−1j(t)∇gj,t(sj(t−1)).

For rational convenience, let , and rewrite (21) as (cf. (2))

 Φ−1j(t) =λ−1Φ−1j(t−1) (22) −cj(t)λ−1Φ−1j(t−1)hj(t)hTj(t)Φ−1j(t−1)λ+hTj(t)Φ−1j(t−1)hj(t).

Substituting and into the stochastic Netwon iteration yields (cf. (6))

 sj(t) =sj(t−1)+cj(t)Φ−1j(t)hj(t)[xj(t)−hTj(t)sj(t−1)] −ρΦ−1j(t)δj(t−1)

which completes the development of CD-RLS-1.

3.2 Convergence analysis

Here we establish convergence of all three novel strategies for . With , the EWLS estimator can even adapt to time-varying parameter vectors, but analyzing its tracking performance goes beyond the scope of this paper. For the time-invariant case (), we will rely on the following assumption.

(as1) Observations obey the linear model , where is correlated across and . Rows are uniformly bounded and independent of . Covariance matrices are time-invariant and positive definite. Process is mean ergodic, while and are uncorrelated. Eigenvalues of , which approximate the true positive definite Hessian matrices , are bounded below by a positive constant when is large enough.

We will assess convergence of our iterative algorithms using the squared mean-root deviation (SMRD) metric, defined as

 SMRD(t) :={E[(J∑j=1||sj(t)−s0||2)12]}2. (23)

Letting denote the estimation error of node and the estimation error across all nodes, one can see that . Observe that is a lower-bound approximation of the mean-square deviation (MSD) metric [15, 26], since by Jensen’s inequality .

Under (as1), convergence of CD-RLS-1 and CD-RLS-2 is asserted as follows; see Appendix 7 for the proof.

Theorem 1.

For CD-RLS-1 and CD-RLS-2 Algorithms 1 and 2, set and per node . Let , and suppose for CD-RLS-1 and correspondingly for CD-RLS-2, while is the network Laplacian and the constant depends on , and the upper bound of . Under (as1), there exists for which it holds for that

 {E[(J∑j=1||sj(t)−s0||2)12]}2 ≤ J∑j=1γ−1||sj(0)−s0||2+γt0σ2jtr(Rhj)2Q(τ)μt + γσ2jλmax(R−1hj)tr(Rhj)ln(t)4Q2(τ)μt. (24)

Theorem 1 establishes that the SMRD in (23) converges to zero at a rate . The constant of the convergence rate is related to through , and ; the noise covariance , and the threshold through . Theorem 1 also indicates the impact of the initial states (determined by and ), which disappears at a faster rate of . To guarantee convergence, the step size must be small enough.

The proof for CD-RLS-3 is more challenging. Because a node does not receive any information from its neighbors when censoring is in effect, it has to rely on outdated neighboring estimates when the incoming datum is not censored. This delay in percolating information may cause computational instability. For this reason, we will impose an additional constraint to guarantee that all local estimates do not grow unbounded. In practice, this can be realized by truncating local estimates when they exceed a certain threshold.

(as2) Local estimates are uniformly bounded .

Convergence of CD-RLS-3 is then asserted as follows. Similar to CD-RLS-1 and CD-RLS-2, the SMRD of CD-RLS-3 converges to zero with rate , as stated in the following theorem.

Theorem 2.

For CD-RLS-3 given by Algorithms 3, set and per node . Under (as1) and (as2) with as in Theorem 1, there exists for which it holds , that

 {E[(J∑j=1||sj(t)−s0||2)12]}2≤a+bln(t)t (25)

where and are positive constants that depend on the upper bounds of and , parameters and , the covariance , the Laplacian matrix , and .

Although the bounds asserted by Theorems 1 and 2 could be loose, they demonstrate that , which establishes that the decentralized estimates converge to the ground truth asymptotically.

3.3 Threshold setting and variance estimation

The threshold influences considerably the performance of all CD-RLS algorithms. Its value trades off estimation accuracy for computation and communication overhead. We provide a simple criterion for setting using the average censoring ratio , which is defined as the number of censored data over the total number of data [19]. The goal is to choose so that the actual censoring ratio approaches as goes to infinity – since we are dealing with streaming big data, such an asymptotic property is certainly relevant. When is large enough, is very close to ; thus, the innovation . As a consequence, , where the last equality holds because