Decentralized Prediction-Correction Methods for Networked Time-Varying Convex Optimization

# Decentralized Prediction-Correction Methods for Networked Time-Varying Convex Optimization

Andrea Simonetto, Alec Koppel, Aryan Mokhtari, Geert Leus, and Alejandro Ribeiro The work in this paper is supported by NSF CAREER CCF-0952867, ONR N00014-12-1-0997, ASEE SMART, and ARL MAST CTA. This paper expands the results and presents proofs that are referenced in [1][2]. Andrea Simonetto is with the ICTEAM institute, Université catholique de Louvain, Belgium. Email: andrea.simonetto@uclouvain.be. Geert Leus is with the Department of EEMCS, Delft University of Technology, The Netherlands. Email: g.j.t.leus@tudelft.nl. Alec Koppel, Aryan Mokhtari, and Alejandro Ribeiro are with the Department of ESE, University of Pennsylvania, Philadelphia, PA, USA. Emails: {akoppel, aryanm, aribeiro}@seas.upenn.edu.
###### Abstract

We develop algorithms that find and track the optimal solution trajectory of time-varying convex optimization problems which consist of local and network-related objectives. The algorithms are derived from the prediction-correction methodology, which corresponds to a strategy where the time-varying problem is sampled at discrete time instances and then a sequence is generated via alternatively executing predictions on how the optimizers at the next time sample are changing and corrections on how they actually have changed. Prediction is based on how the optimality conditions evolve in time, while correction is based on a gradient or Newton method, leading to Decentralized Prediction-Correction Gradient (DPC-G) and Decentralized Prediction-Correction Newton (DPC-N). We extend these methods to cases where the knowledge on how the optimization programs are changing in time is only approximate and propose Decentralized Approximate Prediction-Correction Gradient (DAPC-G) and Decentralized Approximate Prediction-Correction Newton (DAPC-N). Convergence properties of all the proposed methods are studied and empirical performance is shown on an application of a resource allocation problem in a wireless network. We observe that the proposed methods outperform existing running algorithms by orders of magnitude. The numerical results showcase a trade-off between convergence accuracy, sampling period, and network communications.

## I Introduction

Decentralized tracking methods are used to solve problems in which distinct agents of a network aim at minimizing a global objective that varies continuously in time. We focus on a special case of this problem, where the objective may be decomposed into two parts: the first part is a sum of functions which are locally available at each node; the second is defined along the edges of the network, and is often defined by the cost of communication among the agents. Problems of this kind arise, e.g., in estimation, control, and robotics [3, 4, 5, 6, 7, 8, 9].

One approach to continuous-time optimization problems of this kind is to sample the objective function at discrete time instances , and then solve each time-invariant instance of the problem, via classical methods such as gradient or Newton descent. If the sampling period is chosen arbitrarily small, then doing so would yield the solution trajectory with arbitrary accuracy. However, solving such problems for each time sample is not a viable option in most application domains, since the computation time to obtain each optimizer exceeds the rate at which the solution trajectory changes, unless is approximately stationary.

Prediction-correction algorithms [10], by making use of tools of non-stationary optimization [11, 12, 13], have been developed to iteratively solve convex programs which continuously vary in time. These methods operate by predicting at time the optimal solution at the discrete time instance via an approximation of the variation of the objective function over this time slot. Then, this prediction is revised by executing gradient or Newton descent. However, these methods are designed only for centralized settings. We focus on time-varying convex programs in decentralized settings, where nodes can only communicate with their neighbors. As a consequence, the prediction-correction methods suggested in [10] are not directly applicable.

One approach to solving problems of this type are decentralized running algorithms, which run at the same time-scale as the optimization problem and dynamically react to changes in the objective function. Performance guarantees for such methods yield convergence to a neighborhood of the true optimizer on the order of the sampling period , despite the fact that only one round of communication is allowed per discrete time step [14, 4, 15, 16, 17, 18, 19, 20]. The aforementioned works mostly consider strongly convex objectives with no constraints. Notably, [18] and [19] describe a running dual decomposition and a running alternating direction method of multipliers (ADMM) algorithm. Notice that these methods implement only correction steps and thus cannot effectively mitigate the error from the non-stationarity of the optimizer.

In this paper, we generalize the prediction-correction methodology of [10] to decentralized settings such that each node of a network, after communicating with its neighbors, estimates its local component of the optimal trajectory at the discrete time instance from information regarding the objective at time , and then corrects this local prediction at time , via additional communications within the network. To develop this generalization, in the prediction step we truncate the Taylor series of the objective function’s Hessian inverse. This approximation is necessary since the computation of the objective function’s Hessian inverse, which is required for the prediction step, requires global communication. In the correction step, we use decentralized approximations of gradient descent and of Newton’s method to correct the predicted solution by descending towards the optimal solution of the observed objective function. In addition, we consider cases in which the prediction of how the cost function changes in time is unavailable, and must be estimated. This time-derivative approximation is particularly useful in target tracking [21] or designing learning-based control strategies [22, 23].

The main contributions of the paper are the following.

1. We develop prediction-correction algorithms for a class of time-varying networked optimization problems, which can be implemented in a distributed fashion over a network of computing and communicating nodes. The correction term is either derived from a gradient method or from a (damped) Newton step.

2. In order to compute the prediction (and correction for Newton) direction, we employ a novel matrix splitting technique, for which the one developed in [24, 25] is a special case (only valid for adjacency matrices). The novel methodology relies on the concept of block diagonal dominance.

3. We prove convergence of all the algorithms and characterize their convergence rate. For the case of the (damped) Newton correction step, we compute the (local) convergence region and argue global convergence in case of a damped step.

The paper is organized as follows. In Section II, we begin by introducing the optimization problem of interest and by providing some examples for the proposed formulation. We then derive a family of algorithms which contains four distinct methods (Section III). We analyze their convergence properties in Section IV, establishing that the sequence of iterates generated by all these algorithms converges linearly to a bounded tracking error. We observe a trade-off in the implementation between approximation accuracy and communication cost. In Section V, we numerically analyze the methods on a resource allocation problem in wireless sensor networks. Lastly, in Section VI we conclude111 Notation. Vectors are written as and matrices as . denotes the Euclidean norm, in the case of vectors, matrices, and tensors. The gradient of the function with respect to at the point is indicated as , while the partial derivative of the same function w.r.t. at is . Similarly, the notation denotes the Hessian of w.r.t. at , whereas denotes the partial derivative of the gradient of w.r.t. time at , i.e. the mixed first-order partial derivative vector of the objective. Consistent notation is used for higher-order derivatives. .

## Ii Problem Formulation

We consider a connected undirected graph , with vertex set containing nodes and edge set containing edges. Consider as the decision variable of node and as a non-negative scalar that represents time. Associated with each node are time-varying strongly convex functions and . The local functions may be interpreted as, e.g., the merit of a particular choice of control policy [5] or statistical model [3]. Moreover, associated with each edge is a continuously time-varying convex function . These edge-wise functions represent, e.g., the cost of communicating across the network [26].

We focus on problems where nodes aim at cooperatively minimizing the global smooth strongly convex cost function , which can be written as the sum of locally available functions , and a function induced by the network structure . In particular, the function is the sum of the locally available functions ,

 f(\mathboldy;t):=∑i∈Vfi(\mathboldyi;t). (1)

where we have defined in (1) as the stacking of the nodes’ decision variables , i.e., . The function induced by the structure of the network is the sum of locally available functions and the functions associated to the edges of the network,

 g(\mathboldy;t):=∑i∈Vgi,i(\mathboldyi;t)+∑(i,j)∈Egi,j(\mathboldyi,\mathboldyj;t). (2)

Our goal is to solve the time-varying convex program

 \mathboldy∗(t):=\operatornamewithlimitsargmin\mathboldy∈RnpF(\mathboldy;t):=f(\mathboldy;t)+g(\mathboldy;t),for t≥0, (3)

that is the foundation of many problems in cooperative control and network utility maximization. Our goal is to enable the nodes to determine their own component of the solution of (3) for each time in a decentralized fashion, i.e., a protocol such that each node only requires communication with neighboring nodes. Notice that nodes can minimize the objective function independently, while minimization of the function requires coordination and information exchange across the network. Before developing distributed protocols to solve (3), we present a couple of examples to clarify the problem setting.

###### Example 1 (Estimation of distributed processes)

We consider a network of interconnected sensors monitoring a time-varying distributed process. We represent this process by a vector-valued function , with being the spatial coordinate, and denoting time. We assume that the process is spatially smooth so that the value of at close-by spatial coordinates is also similar. We focus on the case that a network of sensors is deployed in a spatial region . The -th node acquires measurements which are noisy linear transformations of the true process , where is the location of the sensor , is its regressor, and the noise is Gaussian distributed independently across time with covariance . This problem setting comes up in earth sciences [27, 28] and acoustics [29], but it is also relevant in robotics [30, 31, 9]. By considering the task of learning a spatially regularized least-squares estimate of the process at different locations, we obtain the time-varying networked convex program

 min^\mathboldu1∈Rp,…,^\mathboldun∈Rn12n∑i=1∥\mathboldhiT^\mathboldui−zi(\mathboldxi,t)∥21σi+β2∑j∈Niwij∥^\mathboldui−^\mathbolduj∥22, (4)

where denotes the neighborhood of node , is the estimated value of the process at time and location , the constant is a regularizer that incentivizes closely located sensors to obtain similar estimates, and the nonnegative weights may be defined according to a function of the distance between sensors. The first term in (4) defines the estimation accuracy in terms of the squared error and is identified as a sum of functions which only depend on local information, which is a special case of (1). The second term in (4) couples the decisions of node with its neighbors , and it is of the form (2). Thus (4) is an instance of (3).

###### Example 2 (Resource allocation problems)

Consider a resource allocation problem in a wireless sensor network[32, 26, 33]. Associate with sensor a time-varying utility functions and decision variable representing the resources allocated to node in a network of sensors. To allocate resources in this network, one must respect channel capacity and interference constraints. These constraints may be formulated in aggregate as network-flow constraints, obtaining the time-varying resource allocation problem

 min\mathboldy1∈Rp,…,\mathboldyn∈Rp ∑i∈Vfi(\mathboldyi;t)  subject% to \mathboldA\mathboldy=\mathboldb(t). (5)

In (5), denotes the augmented graph edge incidence matrix. The matrix is formed by square blocks of dimension . If the edge with links node to node the block is and the block , where denotes the identity matrix of dimension . All other blocks are identically null. Moreover, the time-varying vectors are induced by channel capacity and rate transmission constraints.

In many situations, especially in commercial settings where the nodes are consumer devices, one seeks to solve decentralized approximations of (5). One way to do so is to consider the approximate augmented Lagrangian relaxation of (5), and solve instead

 min\mathboldy1∈Rp,…,\mathboldyn∈Rp ∑i∈Vfi(\mathboldyi;t)+1β2∥\mathboldA\mathboldy−\mathboldb(t)∥2, (6)

which is now unconstrained [34]. Notice that the parameter , which behaves similarly to a Lagrange multiplier, tunes the approximation level and penalizes the violation of the approximated constraint . Observe that the first term in (6) is precisely the same as (1). Moreover, block-wise decomposition of the second term yields edge-wise expressions of the form , which may be identified as the functions in (2).

## Iii Algorithm Development

To solve the time-varying optimization problem in (3), the first step is sampling the continuously time-varying objective function at time instants with , leading to a sequence of time-invariant convex problems

 \mathboldy∗(tk) := \operatornamewithlimitsargmin\mathboldy∈Rnp F(\mathboldy;tk)k≥0. (7)

The sequence of optimal decision variables defined in (7) are samples of the optimal trajectory defined in (3). Since solving (7) for each time instance is impractical even for moderately sized networks, we instead devise a method to generate a sequence of approximate optimizers for (7) which eventually remains close to the true optimizer in (7) up to a constant error. More formally, we seek to generate a sequence for which

 limsupk→∞∥\mathboldyk−\mathboldy∗(tk)∥=const% ., (8)

and whose rate, convergence, and asymptotical error constants depend on the sampling period and the number of exchanged messages per node per time instance .

To do so, we build upon prediction-correction methods, which at the current time sample predict the optimal decision variable at the next time sample , i.e., from an arbitrary initial variable , for each time , predict a new approximate optimizer as

 \mathboldyk+1|k=\mathboldyk+h\mathboldpk, (9)

where index is associated with time sample , and similarly for w.r.t. , is the prediction direction, is the predicted variable for step +, and is the sampling period. Then, after observing the sampled objective function at we correct the predicted vector by

 \mathboldyk+1=\mathboldyk+1|k+γ\mathboldck+1, (10)

for a certain correction direction which defines a descent direction, with nonnegative constant step-size .

### Iii-a Decentralized prediction step

Solving the strongly convex time-invariant problem (7) accounts in finding the unique decision variable for which

 ∇\mathboldyF(\mathboldy∗(tk);tk)=0. (11)

For any other variable , the gradient would not be null and we can use it to quantify the suboptimality of w.r.t. .

We design the prediction direction as the one that maintains the suboptimality level when determining (the rationale being that when arrived at optimality, we will keep it while predicting). Formally, we wish to determine as the vector for which

 ∇\mathboldyF(\mathboldyk+1|k;tk+1)=∇\mathboldyF(\mathboldyk;tk). (12)

Of course, implementing (12) requires information at future times at the present , an impossibility without clairvoyance. Instead, we approximate the left-hand side by adopting a Taylor expansion, obtaining,

 ∇\mathboldyF(\mathboldyk;tk)+∇\mathboldy\mathboldyF(\mathboldyk;tk)(\mathboldyk+1|k−\mathboldyk)+h∇t\mathboldyF(\mathboldyk;tk)=∇\mathboldyF(\mathboldyk;tk), (13)

which may be reordered so that is on the left-hand side, yielding

 \mathboldyk+1|k=\mathboldyk−h[∇\mathboldy\mathboldyF(\mathboldyk;tk)]−1∇t\mathboldyF(\mathboldyk;tk). (14)

The update (14) describes the discrete-time iso-suboptimality dynamics. This prediction step (14) in principle would allow us to maintain a consistent level of sub-optimality, but our focus on decentralized methods precludes its use. This is because execution of (14) requires computing the Hessian inverse which is not implementable by a network due to the fact that is a global computation. The Hessian consists of two terms: The first term is a block diagonal matrix and the second term is a block neighbor sparse matrix that inherits the structure of the graph. Therefore, the global objective function’s Hessian has the sparsity pattern of the graph and can be computed by exchanging information with neighboring nodes. Nonetheless, the Hessian inverse, required in (14), is not neighbor sparse and its computation requires global information.

To develop a decentralized protocol to approximately execute (14), we generalize a recently proposed technique to approximate the Hessian inverse which operates by truncating its Taylor expansion [24, 25]. To do so, define as the block diagonal matrix which contains the diagonal blocks of the matrix , and write the Hessian as

 ∇\mathboldy\mathboldyF(\mathboldyk;tk)=Dk−Bk, (15)

where the matrices and are defined as

 Dk :=∇\mathboldy\mathboldyf(\mathboldyk;tk)+diag[∇\mathboldy\mathboldyg(\mathboldyk;tk)], (16a) Bk :=diag[∇\mathboldy\mathboldyg(\mathboldyk;tk)]−∇\mathboldy\mathboldyg(\mathboldyk;tk). (16b)

Since is strongly convex, and by Assumption 2 [Cf. Section IV], the matrix is a positive definite block diagonal matrix and encodes second-order local objective information. The structure of the matrix is induced by that of the graph: the diagonal blocks of are null and the non-diagonal block is nonzero and given by iff and are neighbors.

Given that is positive definite, we can write

 ∇\mathboldy\mathboldyF(\mathboldyk;tk)=D1/2k(I−D−1/2kBkD−1/2k)D1/2k. (17)

Consider now the Taylor series for to write the inverse of (17) as

 [∇\mathboldy\mathboldyF(\mathboldyk;tk)]−1=D−1/2k∞∑τ=0(D−1/2kBkD−1/2k)τD−1/2k, (18)

whose convergence (as well as the fact that the eigenvalues of are strictly less then one so that the Taylor series holds) will be formally proved in Appendix A. We approximate the Hessian inverse in (18) by its -th order approximate , which is formed by truncating the series in (18) to its first terms as

 \vspace∗−1mmH−1k,(K)=D−1/2kK∑τ=0(D−1/2kBkD−1/2k)τD−1/2k. (19)

Since the matrix is block diagonal and is block neighbor sparse, it follows that the -th order approximate inverse is -hop block neighbor sparse, i.e. its -th block is nonzero if there is a path between nodes and with length or smaller. Substituting the approximation in (19) into (14), the prediction step may be written as

 \mathboldyk+1|k=\mathboldyk+h\mathboldpk,(K), (20)

where the approximate prediction direction is given by

 \mathboldpk,(K):=−H−1k,(K)∇t\mathboldyF(\mathboldyk;tk). (21)

Although the computation of the approximate prediction direction requires information of -hop neighbors, we establish that it can be computed in a decentralized manner via communication rounds among neighboring nodes.

###### Proposition 1

Consider the prediction step (20) and the approximate prediction direction in (21). Define and as the -th sub-vector of the vectors and associated with node . Consider and as the -th block of the matrices and in (16a) - (16b). If node computes for the recursion

 \mathboldpik,(τ+1)=−(Diik)−1(∑j∈NiBijk\mathboldpjk,(τ)+∇t\mathboldyFi(\mathboldyk;tk)), (22)

with initial condition , the result yields the approximate prediction direction .

Proof : By direct computation. See [24], Section III for a comparable derivation. \QED

The recursion in (22) allows for the computation the -th order approximate prediction direction by rounds of exchanging information with neighboring nodes. The -th sub-vector of the mixed partial gradient associated with node is given by

 ∇t\mathboldyFi(\mathboldyk;tk)=∇t\mathboldyifi(\mathboldyik;tk)+∇t\mathboldyigi,i(\mathboldyik;tk) +∑j∈Ni∇t\mathboldyigi,j(\mathboldyik,\mathboldyjk;tk). (23)

Node can compute by having access to the decision variables of its neighbors . In addition, according to the definition of the block diagonal matrix in (16), its -th block can be written as

 Diik:= ∇\mathboldyi\mathboldyifi(\mathboldyik;tk)+∇\mathboldyi\mathboldyigi,i(\mathboldyik;tk) +∑j∈Ni∇\mathboldyi\mathboldyigi,j(\mathboldyik,\mathboldyjk;tk), (24)

which is available at node , after receiving . These observations imply that the initial prediction direction can be computed locally at node . Further, the blocks of the neighbor sparse matrix are given by

 Bijk:=−∇\mathboldyi\mathboldyjgi,j(\mathboldyik,\mathboldyjk;tk)for j∈Ni, (25)

which are available at node . Therefore, node can compute the recursion in (22) by having access to the -th level approximate prediction direction of its neighbors . After the rounds of communication with neighboring nodes, to predict the local variable at step , node executes the local update

 \mathboldyik+1|k=\mathboldyik+h\mathboldpik,(K). (26)

Thus, the prediction step in (20) yields a decentralized protocol summarized in Algorithm 1.

### Iii-B Time derivative approximation

In practical settings, knowledge of how the function changes in time is unavailable. This issue may be mitigated by estimating the term via a first-order backward derivative: Let be an approximate version of computed as a first-order backward derivative,

 ~∇t\mathboldyFk=(∇\mathboldyF(\mathboldyk;tk)−∇\mathboldyF(\mathboldyk;tk−1))/h. (27)

The approximation requires only information of the previous discrete time slot. Using (27), we can approximate the prediction direction as

 ~\mathboldpk,(K):=−H−1k,(K)~∇t\mathboldyFk. (28)

This may be obtained in a decentralized way via rounds of communication among neighboring nodes, which may be established as a trivial extension of Proposition 1. Algorithm 1 may be modified to instead make use of the decentralized approximate prediction step in (28), as done in Algorithm 2. Once we obtain this local prediction of the optimizer at the next time , using information at the current time , the problem (3) is sampled at time . We make use of this new information in the correction step, as discussed next.

### Iii-C Decentralized correction step

The predicted variable [cf. (21)] is then corrected via (10) by making use of the objective at time . Different correction strategies give rise to different correction updates, whose relative merits depend on the application domain at hand. We present two distinct correction steps next.

Gradient correction step: After the objective at time is observed, we may execute the correction step (10) with , resulting in

 \mathboldyk+1=\mathboldyk+1|k−γ∇\mathboldyF(\mathboldyk+1|k;tk+1), (29)

which is a gradient correction step. This step is computable in a decentralized fashion since the local component of the gradient at node is given by

 ∇\mathboldyFi(\mathboldyk+1|k;tk+1)=∇\mathboldyifi(\mathboldyik+1|k;tk+1) (30) +∇\mathboldyigi,i(\mathboldyik+1|k;tk+1)+∑j∈Ni∇\mathboldyigi,j(\mathboldyik+1|k,\mathboldyjk+1|k;tk+1).

To implement the expression in (30), node only requires access to the decision variables of its neighbors . Thus, if nodes exchange their predicted variable with their neighbors they can compute the local correction direction as in (30) and update their predicted variable as

 \mathboldyik+1=\mathboldyik+1|k+γ\mathboldcik+1. (31)

We call DPC-G as the Decentralized Prediction-Correction method that uses gradient descent in the correction step (Algorithm 3) and the exact prediction step (Algorithm 1) in the prediction step. We call DAPC-G as the Decentralized Approximate Prediction-Correction method that uses gradient descent in the correction step (Algorithm 3) and the approximate prediction step (Algorithm 2) in the prediction step. Both DPC-G and DAPC-G require communication rounds among neighboring nodes per time step.

Newton correction step: The correction step in (10) could also be considered as a Newton step if we used . However, as in the discussion regarding the prediction step, computation of the Hessian inverse requires global communication. Consequently, we approximate the Hessian inverse by truncating its Taylor series as in (19). To be more precise, we define as the -th level approximation of the Hessian inverse as

 \vspace−4mmH−1k+1|k,(K′)=D−1/2k+1|kK′∑τ=0(D−1/2k+1|kBk+1|kD−1/2k+1|k)τD−1/2k+1|k, (32)

where the matrices and are defined as

 Dk+1|k :=∇\mathboldy\mathboldyf(\mathboldyk+1|k;tk+1)+diag[∇\mathboldy\mathboldyg(\mathboldyk+1|k;tk+1)], (33a) Bk+1|k :=diag[∇\mathboldy\mathboldyg(\mathboldyk+1|k;tk+1)]−∇\mathboldy\mathboldyg(\mathboldyk+1|k;tk+1). (33b)

Notice that the only difference between the decomposition matrices and for the correction step and the matrices and for the prediction step is the arguments for the inputs and . The prediction matrices and are evaluated for the function and the variable , while the correction matrices are evaluated for the function and the variable .

Thus, we can approximate the exact Hessian inverse with as in (32) and apply the correction step as

 \mathboldyk+1=\mathboldyk+1|k−γH−1k+1|k,(K′)∇\mathboldyF(\mathboldyk+1|k;tk+1), (34)

which requires exchanges of information among neighboring nodes. In practice, one can use the same algorithm for the prediction direction to compute the correction direction , where now the gradient takes the place of the time derivative.