Networked Parallel Algorithms for Robust Convex Optimization via the Scenario Approach

# Networked Parallel Algorithms for Robust Convex Optimization via the Scenario Approach

Keyou You and Roberto Tempo This work was supported by the National Natural Science Foundation of China (61304038), Tsinghua University Initiative Scientific Research Program, and CNR International Joint Lab COOPS.Keyou You is with the Department of Automation and TNList, Tsinghua University, 100084, China (email: youky@tsinghua.edu.cn).Roberto Tempo is with CNR-IEIIT, Politecnico di Torino, Torino, 10129, Italy (email: roberto.tempo@polito.it).
###### Abstract

This paper proposes a parallel computing framework to distributedly solve robust convex optimization (RCO) when the constraints are affected by nonlinear uncertainty. To this end, we adopt a scenario approach by randomly sampling the uncertainty set. To facilitate the computational task, instead of using a single centralized processor to obtain a “global solution” of the scenario problem (SP), we resort to multiple parallel processors that are distributed among different nodes of a network. Then, we propose a primal-dual sub-gradient algorithm and a random projection algorithm to distributedly solve the SP over undirected and directed graphs, respectively. Both algorithms are given in an explicit recursive form with simple iterations, which are especially suited for processors with limited computational capability. We show that, if the underlying graph is strongly connected, each node asymptotically computes a common optimal solution to the SP with a convergence rate where is a sequence of appropriately decreasing stepsizes. That is, the RCO is effectively solved in a distributed parallel way. The relations with the existing literature on robust convex programs are thoroughly discussed and an example of robust system identification is included to validate the effectiveness of our networked parallel algorithms.

Robust convex optimization, uncertainty, scenario approach, primal-dual algorithm, random projection algorithm.

## I Introduction

A robust convex optimization (RCO) is a convex optimization problem where an infinite number of constraints are parameterized by uncertainties. This problem has found wide applications in control analysis and synthesis of complex systems, as well as in other areas of engineering. As the dependence of the constraints on the uncertainties may be nonlinear, RCO is generally not easily solvable. In fact, the study of RCO bears a vast body of literature, see e.g. [1, 2, 3] and references therein.

In this paper, we adopt a scenario approach, which was first introduced in [4, 5] to solve RCO. In particular, we randomly sample the uncertainty set and obtain a standard convex optimization called the scenario problem (SP). The guarantees of optimality are then given in a probabilistic sense and an explicit bound on the probability that the original constraints are violated is provided. The striking feature of this approach is that the sample complexity, which guarantees that a solution to the SP is optimal with a given level of confidence, can be computed a priori. We also refer to [6, 7] for general properties and specific randomized algorithms to cope with uncertainty in systems and control.

To facilitate the computational task, instead of using a single processor to solve the SP, in this paper we propose a novel parallel computing framework with many interconnected processors. The challenging problem is to distribute the computational task among the nodes of a network, each representing a single processor. The idea is to break a (possibly) large number of constraints of the SP into many small sets of local constraints that can be easily handled in each node. That is, each node is able to compute some common optimal solution of the SP with a low computational cost. Under local interactions between nodes, the SP is then collaboratively solved in every node via the following three key steps.

First, every node randomly samples the uncertainty set of RCO, with the sample size inversely proportional to the total number of nodes or being a priori determined by its computational capability. Although this idea has been adopted in [8] to solve the SP, our approach is substantially different. In particular, after sampling, each node in [8] requires to completely solve a local SP at each iteration and exchange the set of active constraints with its neighbors. The process continues until a consensus on the set of active constraints is reached. Finally, every node solves its local SP under all active constraints of the SP. Clearly, the number of constraints in every local SP increases with the number of iterations. In some extreme cases, each constraint in the SP can be active, and every node eventually solves a local SP that has the same number of constraints as the SP. Thus, the computational cost in each node is not necessarily reduced. Moreover, each node cannot guarantee to obtain the same optimal solution to the SP. Since an active constraint may become inactive in any future iteration, identifying the active constraints cannot be recursively computed, and this computation is very sensitive to numerical errors. On the contrary, each node in this paper only needs to handle a fixed number of local constraints and recursively run an explicit algorithm with very simple structure.

Second, the SP is reformulated as a distributed optimization problem with many decoupled small sets of local constraints and a coupled constraint, which is specially designed based on the network structure. If the number of nodes is large, each node only needs to deal with a very small number of local constraints. The information is then distributed across the network via the coupled constraint in a suitably designed way, so that it can be locally handled. We recall that a similar technique has been already adopted to solve distributed optimization problems, see e.g. [9, 10]. It is interesting to note that these papers are only focused on convex optimization problems and no robustness issues are addressed. On the other hand, robust optimization has also attracted significant attention in many research areas [11, 12], but the proposed approaches are fully centralized. In this paper, we address both distributed and robust optimization problems simultaneously.

Third, each node of the network keeps updating a local copy of an optimal solution by individually handling its local constraints and interacting with its neighbors to address the coupled constraint. If the graph is strongly connected, every pair of nodes can indirectly access information from each other. To this purpose, we develop two recursive parallel algorithms for each node to interact with the neighbors to solve the SP by fully utilizing the constraint functions under undirected and directed graphs, respectively. For both algorithms, the computational cost per iteration only involves a few additions and multiplications of vectors, in addition to the computation of the sub-gradients of parameterized constraint functions. That is, the overall computational cost is very small in each node, and therefore the approach is particularly useful for solving a large-size optimization problem with many solvers of reduced power.

For undirected graphs, where the information flow between the nodes is bidirectional, we solve the distributed optimization problem by using an augmented Lagrangian function with a quadratic penalty [13]. Following this approach, a networked primal-dual sub-gradient algorithm is designed to find a saddle point. In this case, both the decoupled and coupled constraints are handled by introducing Lagrange multipliers, which provide a natural approach from the optimization viewpoint. For the coupled constraint, each node also needs to broadcast its estimate of an optimal solution to the SP, and the modified Lagrange multipliers to the neighbors, after which it recursively updates them by jointly using sub-gradients of local constraint functions. We show that each node finally converges to some common optimal solution to the SP. We remark that most of the existing work on distributed optimization [14, 15, 10] uses the Euclidean projection to handle local constraints. The projection is easy to obtain only if the projection set has a special structure, which is generally not the case of the SP. Therefore, our algorithm is more attractive to solve the SP problem in the context of networked parallel algorithms.

For directed graphs, the information flow between nodes is unidirectional and the primal-dual algorithm for undirected graphs cannot be used. To overcome this issue, we address the coupled constraint by adopting a consensus algorithm and design a novel two-stage recursive algorithm. At the first stage, we solve an unconstrained optimization problem which removes the decoupled local constraints in the reformulated distributed optimization and obtain an intermediate state vector in each node. We notice that, in the classical literature [16, 17, 18, 15], the assumption on balanced graphs is often made. In our paper, this restrictive assumption is removed and this step is non-trivial, see e.g. [19, 20]. At the second stage, each node individually addresses its decoupled local constraints by adopting a generalization of Polyak random algorithm [21], which moves its intermediate state vector toward a randomly selected local constraint set. Combining these two stages, and under some very mild conditions, both consensus and feasibility of the iteration in each node are achieved almost surely. Although this parallel algorithm is completely different from the primal-dual sub-gradient algorithm previously described, both algorithms essentially converge at a speed where is a sequence of appropriately decreasing stepsizes.

The rest of this paper is organized as follows. In Section II, we formulate RCO and include three motivating examples, after which the probabilistic approach to RCO is introduced. In Section III, we describe a parallel computing framework for distributedly solving the SP. In Section IV, a networked parallel algorithm is proposed via the primal-dual sub-gradient method for undirected graphs and show its convergence. In Section V, we design a distributed random projected algorithm over directed graphs to distributedly solve RCO. In Section VI, we show how to extend these results to stochastically time-varying graphs. An example focused on robust system identification is included in Section VII. Some brief concluding remarks are drawn in Section VIII.

A preliminary version of this work appeared in [22], which only addresses undirected graphs with a substantially different approach. In this paper, we provide significant extensions to directed graphs using randomized algorithms, establish their convergence properties, include the complete proofs and provide new simulation results for robust system identification.

Notation: The sub-gradient of a vector function whose components are convex functions with respect to an input vector is denoted by . For two non-negative sequences and , if there exists a positive constant such that , we write . For two vectors and , the notation means that is greater than for any . A similar notation is used for , and . The symbol denotes the vector with all entries equal to one. Given a pair of real matrices of suitable dimensions, indicates their Kronecker product. Finally, is the positive part of , is the trace of a matrix and denotes Euclidean norm.

## Ii Robust Convex Optimization and Scenario Approach

### Ii-a Robust Convex Optimization

Consider a robust convex optimization (RCO) of the form

 minθ∈Θ c′θ  subject to f(θ,q)≤0,∀q∈Q, (1)

where is a convex and closed set with non-empty interior, and the scalar-valued function is convex in the decision vector for any . The uncertainty enters into the constraint function without assuming any structure, except for the Borel measurability [23] of for any fixed . In particular, may be affected by parametric (possibly nonlinear) and nonparametric uncertainty.

For simplicity, we assume that the objective function is linear in . However, the linearity is not essential and the results of the paper still hold for any convex function by a simple relaxation. Specifically, consider a convex objective function and introduce an auxiliary variable . Then, the optimization in (1) is equivalent to

 minθ∈Θ,t∈Rt  subject to f0(θ)−t≤0 and f(θ,q)≤0,∀q∈Q.

Obviously, the above objective function becomes linear in the augmented decision variable and is of the same form as (1). That is, there is no loss of generality to focus on a linear objective function.

### Ii-B Motivating Examples

The robust convex optimization in (1) is crucial in many areas of research, see e.g. [11, 3] and references therein for more comprehensive examples. Here we present three applications for illustration.

###### Example 1 (Distributed robust optimization).

Consider the following distributed robust optimization problem

 minθ∈Θ m∑j=1fj(θ,qj), (2)

where is only known to node and represents the uncertainty in node and its bounding set. Moreover, is convex in for any and is Borel measurable in for any fixed .

From the worst-case point of view, we are interested in solving the following optimization problem

 minθ∈Θ m∑j=1(maxqj∈Qjfj(θ,qj)). (3)

However, the uncertainty generically enters the objective function in (2) without any specific structure, so that the objective function cannot be explicitly found. To solve (3), we note that it is equivalent to the following optimization problem

 minθ∈Θ m∑j=1tj subject to maxqj∈Qjfj(θ,qj)−tj≤0,∀j∈V. (4)

Let where and and . Then, the optimization in (4) is equivalent to

 minθ∈Θ m∑j=1tj subject to f(t,θ,q)⪯0,∀q∈Q. (5)

Clearly, (5) is RCO of the form in (1), except that is only known to node . However, this is not a critical issue as discussed in Example 4 in Section III-B.

###### Example 2 (Lasso).

Consider the least squares (LS) problem

 minv∥b−Xv∥,

where is the regression matrix and is the measurement vector. It is well-known that the LS solution has poor numerical properties when the regression matrix is ill-conditioned. A standard approach for addressing this issue is to introduce regularization technique, which results in the following LASSO problem

 minv{∥b−Xv∥+n∑i=1ci|vi|},

where quantifies the robustness of the solution with respect to the -th column of . By [24], the LASSO is in fact equivalent to a robust LS problem

 minvmaxq∈Q∥b−(X+q)v∥ (6)

with the following uncertainty set

 Q={[q1,…,qn]|∥qj∥≤cj,j=1,…,n}.

From (6), the LASSO is inherently robust to the uncertainty in the regression matrix , and the weight factor quantifies its robustness property. Note that the optimization in (6) can be reformulated as RCO in (1).

###### Example 3 (Distribution-free robust optimization).

Consider a distribution-free robust optimization under moment constraints

 minθ∈Θmaxq∈PE[f(θ,q)] (7)

where is a utility convex function in the decision variable for any given realization of the random vector , and the expectation is taken with respect to . Moreover, is a collection of random vectors with the same support, first- and second-moments

In light of [25] and the duality theory [26], the optimization problem (7) is equivalent to RCO

 minθ,α,β,Ω{α+μ′β+Tr(Ω′Σ)} subject to θ∈Θ,α+q′β+q′Ωq≥f(θ,q),∀q∈Q,.

Clearly, the optimization (7) is reformulated as RCO of the same form as (1).

Although the stochastic programming (7) is a convex optimization problem, one must often resort to Monte Carlo sampling to solve it, which is computationally challenging, as it may also need to find an appropriate sampling distribution. Unless has a special structure, it is very difficult to obtain such a distribution [27]. In the next section, we show how RCO can be effectively solved via a scenario approach.

### Ii-C Scenario Approach for RCO

The design constraint for all possible is crucial in the study of robustness of complex systems, e.g. performance of a system affected by the parametric uncertainty and the design of uncertain model predictive control [28]. However, obtaining worst-case solutions has been proved to be computationally difficult, even NP-hard as the uncertainty may enter into in a nonlinear manner. In fact, it is generally very difficult to explicitly characterize the constraint set with uncertainty, i.e.,

 {θ|f(θ,q)≤0,∀q∈Q}, (8)

which renders it impossible to directly solve RCO. There are only few cases when the uncertainty set is tractable [11]. Furthermore, this approach also introduces undesirable conservatism. For these reasons, we adopt the scenario approach to solve RCO.

Instead of satisfying the hard constraint in (8), the idea of this approach is to derive a probabilistic approximation by means of a finite number of random constraints, i.e,

 {θ|f(θ,q(i))≤0,i=1,…,Nbin} (9)

where is a positive integer representing the constraint size, and are independent identically distributed (i.i.d.) samples extracted according to an arbitrary absolutely continuous (with respect to the Lebesgue measure) distribution over .

Regarding the constraint in (9), we only guarantee that most, albeit not all, possible uncertainty constraints in RCO are not violated. Due to the randomness of , the set of constraint in (9) may be very close to its counterpart (8) in the sense of obtaining a small violation probability, which is now formally defined.

###### Definition 1 (Violation probability).

Given a decision vector , the violation probability is defined as

 V(θ):=Pq{q∈Q|f(θ,q)>0}.

The multi-sample is called a scenario and the resulting optimization problem under the constraint (9) is referred to as a scenario problem (SP)

 minθ∈Θc′θ  subject to f(θ,q(i))≤0,i=1,…,Nbin. (10)

In the sequel, let be the set of optimal solutions to the SP and be the set of feasible solutions, i.e.,

 Θ0={θ∈Θ|f(θ,q(i))≤0,i=1,…,Nbin}. (11)

For the SP, we need the following probabilistic assumption to study its relationship with RCO in (1).

###### Assumption 1 (Optimal solutions and interior point).

The SP in (10) is feasible for any multisample extraction and has a non-empty set of optimal solutions. In addition, there exists a vector such that

 f(θ0,q(i))<0,∀i=1,…,Nbin. (12)

The interiority condition (often called Slater’s constraint qualification) in (12) implies that there is no duality gap between the primal and dual problems of (10) and the dual problem contains at least an optimal solution [13]. We remark that in robust control it is common to study strict inequalities [28], e.g., when dealing with robust asymptotic stability of a system and therefore this is not a serious restriction. In fact, the set of feasible solutions to (1) is a subset of that of the SP in (10). The above feasibility assumption can also be relaxed in the analysis of the SP by using the approach introduced in [29]. The main result of the scenario approach for RCO is stated below.

###### Lemma 1 ([30]).

Assume that there exists a unique solution to (10). Let , , and satisfy the following inequality

 n−1∑i=0(Nbini)ϵi(1−ϵ)Nbin−i≤δ. (13)

Then, with probability at least , the solution of the scenario optimization problem (10) satisfies , i.e.,

 Pq{V(θsc)≤ϵ}≥1−δ.

Note that the uniqueness condition can be also relaxed in most cases by introducing a tie-breaking rule, see Section 4.1 of [4]. If the sample complexity satisfies (13), a solution to (10) approximately solves RCO in (1) with certain probabilistic guarantee. A subsequent problem is to compute the sample complexity, which dictates the smallest number of constraints required in the SP to solve (10). This problem has been addressed in [31] obtaining the following improved bound

 Nbin≥eϵ(e−1)(ln1δ+n−1) (14)

where is the Euler’s number. Thus, RCO in (1) can be approximately solved via the SP in (10) with a sufficiently large .

The remaining problem is to effectively solve the SP in (10), which is the main objective of this paper, in particular when is large.

## Iii Networked Parallel Scheme for Scenario Problems

In this section, we introduce a novel computational framework where many processors (nodes) with limited computational capability are interconnected via an algebraic graph. Then, we reformulate the SP in (10) as a distributed optimization problem, which assigns some local constraints to each node and adapts the coupled constraint to the graph structure.

### Iii-a Networked Computing Nodes

Although RCO in (1) can be attacked via the scenario approach, clearly may be large to achieve a high confidence level with small violation probability. For example, in a problem with variables, setting probability levels and , it follows from (14) that the number of constraints in the SP is . For such a large sample complexity , the computational cost for solving the SP (10) becomes very high, which may be far from the computational and memory capacity of a single processor.

In this section, we exploit the idea of solving large problems with many cheap solvers under an isotropic and universal design in the sense that the local computation of each solver is independent of its index and problem data. Specifically, we propose to use computing units (nodes) which cooperatively solve the SP in (10) in a parallel fashion. Then, the number of design constraints for node is reduced to . To maintain the desired confidence level and violation probability, it follows from (14) that

 m∑j=1nj≥Nbin.

A simple heuristic approach is to assign the number of constraints in (10) among nodes proportional to their computing and memory power. In practice, each node can declare the total number of constraints that can be handled. If the number of nodes is comparable to the scenario size , the number of constraints for every node is significantly reduced, e.g. , and can be even as small as one.

The problem is then how to parallelize the computational task across multiple nodes to cooperatively solve the SP. To this end, we introduce a directed graph to model interactions between the computing nodes where denotes the set of nodes, and the set of links between nodes is represented by . A directed edge exists in the graph if node directly receives information from node . Then, the in-neighbors and out-neighbors of node are respectively defined by and . Clearly, every node can directly receive information from its in-neighbors and broadcast information to its out-neighbors. A sequence of directed edges with for all is called a directed path from node to node . A graph is said to contain a spanning tree if it has a root node that is connected to any other node in the graph via a directed path, and is strongly connected if each node is connected to every other node in the graph via a directed path.

We say that is a row-stochastic weighting matrix adapted to the underlying graph , e.g., if and , otherwise, and for all . Moreover, we denote the associated Laplacian matrix of by . If is undirected, is a symmetric matrix and , which is simply denoted as .

Overall, the objective of this paper is to solve the following networked optimization problem.

###### Problem 1 (Networked parallel scheme).

Assume that is strongly connected. Then, each node computes a solution to the SP in (10) under the following setup:

1. Every node is able to independently generate i.i.d. samples with an absolutely continuous distribution , and is not allowed to share these samples with other nodes.

2. Every node is able to transmit finite dimensional data per packet via a directed/undirected edge.

3. The vector in the objective function, the constraint function and the set are accessible to every node.

In contrast with [8], in this networked parallel approach, we transmit a state vector with a fixed dimension among nodes. In addition, every node only needs to deal with a fixed number of constraints. In [8], each node requires to completely solve a local SP under an increasing number of constraints. We shall provide a more detailed comparison between our approach and [8] in Section IV-B.

### Iii-B Reformulation of the Scenario Problem

In this work, we propose recursive algorithms with small computation per iteration to distributedly solve the SP. This is particularly suited when several cheap processors cooperate. The main idea is to introduce “local copies” of in each node, and to optimize and update these variables by incrementally learning the constraints until a consensus is reached among all the neighboring nodes. The interactions between nodes are made to (indirectly) obtain the constraint set information from other nodes.

Let be the samples that are independently generated in node according to the distribution . For simplicity, the local constraint functions are collectively rewritten in a vector form

 fj(θ):=⎡⎢ ⎢ ⎢⎣f(θ,q(j1))⋮f(θ,q(jnj))⎤⎥ ⎥ ⎥⎦∈Rnj.

Then, we are interested in solving the constrained minimization problem

 minθ∈Θc′θ subject to fj(θ)⪯0,∀j∈V, (15)

where is only known to node .

###### Example 4 (Continuation of Example 1).

In (5), the -th component function of is only known to node . Then, node can independently extract random samples from and obtain the local inequality

 ~fj(θ,t):=⎡⎢ ⎢ ⎢ ⎢⎣fj(θ,q(1)j)−tj⋮fj(θ,q(nj)j)−tj⎤⎥ ⎥ ⎥ ⎥⎦⪯0, (16)

which is only known to node . Thus, the SP associated with the distributed robust optimization in (5) has the same form of (15) and can be solved as well.

Since each node may have very limited computational and memory capability, the algorithm for each node should be easy to implement with a low computational cost. To achieve this goal, we adopt two different approaches in the sequel for undirected and directed graphs, respectively. The first approach (for undirected graphs) exploits the simple structure of a primal-dual sub-gradient algorithm [13] which has an explicit recursive form. Moreover, the interpretation of this approach is natural from the viewpoint of optimization theory. It requires a bidirectional information flow between nodes and therefore it is not applicable to directed graphs. To overcome this limitation, the second approach (for directed graphs) revisits the idea of Polyak random algorithm for convex feasibility problem [32] in our networked computing framework. We remark that in [32] the algorithms are centralized and do not address distributed computation, which is resolved in this paper by exploiting the network structure.

Next, we show that the SP can be partially separated by adapting it to the computing network .

###### Lemma 2 (Optimization equivalence).

Assume that contains a spanning tree. Then, the optimal solution to the SP in (10) can be found via the following optimization problem

 minθ1,…,θm∈Θ m∑j=1c′θj subject to (18) m∑i=1aji(θj−θi)=0, fj(θj)⪯0,∀j∈V.
###### Proof.

By a slight abuse of notation, let be the augmented state of , i.e., , and , which is the associated Laplician matrix of the graph . Then, the constraint in (18) is compactly written as This is equivalent to as contains a spanning tree [33]. Thus, the above optimization problem is reduced to

 min{θ∈Θ|fj(θ)⪯0,∀j∈V}(N⋅c′θ)

whose set of optimal solutions is equivalent to that of (10).

A nice feature of Lemma 2 is that both the objective function and the constraint in (18) of each node are completely decoupled. The only coupled constraint lies in the consensus constraint in (18), which is required to align the state of each node, and can be handled by exploring the graph structure under local interactions. Since each node uses it to learn information from every other node, we need the following assumption.

###### Assumption 2 (Strong connectivity).

The graph is strongly connected.

As the constraint in (18) is only known to node , the strong connectivity assumption is clearly necessary. Otherwise, we may encouter a situation where a node can never be accessed by some other node . In this case, the information from node cannot be received by node . Then, it is impossible for node to find a solution to the SP (10) since the information on is missing to node .

## Iv Networked Primal-dual Sub-gradient Algorithms for Undirected Graphs

Recently, several papers concentrated on the distributed optimization problem of the form in Lemma 2, see e.g. [14, 15, 10, 34, 35, 36] and references therein. However, they mostly consider a generic local constraint set, i.e., the local constraint (18) is replaced by for some convex set , rather than having an explicit inequality form. Thus, the proposed algorithms require a projection onto the set at each iteration to ensure feasibility. This is easy to perform only if has a relatively simple structure, e.g., a half-space or a polyhedron. Unfortunately, the computation of the projection onto the set

 Θj={θ∈Rn|fj(θ)⪯0} (19)

is typically difficult and computational demanding. In this paper, we do not use the projection to handle the inequality constraints. Rather, we exploit the inequality functions by designing networked primal-dual algorithms for undirected graphs with the aid of an augmented Lagrangian function. The final result is to prove that the recursive algorithm in each node asymptotically converges to some common optimal solution of (10).

Since is closed and convex, the optimization problem in Lemma 2 is reformulated with equality constraints

 min m∑j=1c′θj subject to (Lj⊗In)θ=0, gj(θj)=0,∀j∈V

where is the -th row of the Laplacian matrix , and is a function only related to the local constraint of node , i.e.,

 gj(θj)=[d(θj,Θ)fj(θj)+].

The distance function measures the distance from the point to the set and is obviously convex in . Since is closed and convex, then if and only if .

Next, we design a primal-dual sub-gradient algorithm to find an optimal solution to (IV). In this section, for notational simplicity, with a slight abuse of notation, we use to denote the augmented state of .

### Iv-a Networked Primal-dual Sub-gradient Algorithm

To solve the optimization problem (IV), we add a quadratic penalty function

 hρ(θ)=ρ2m∑j=1∥(Lj⊗In)θ∥2+(gj(θj))2

to the objective without changing the optimal value or the set of optimal points, and is a given weighting parameter. Then, we focus on the so-called augmented Lagrangian

 L(θ,λ,γ)=m∑j=1Lj(θ,λj,γj) (21)

with the local augmented Lagrangian defined as

 Lj = c′θj+λ′j(Lj⊗In)θ+γ′jgj(θj) +ρ2(∥(Lj⊗In)θ∥2+(gj(θj))2)

where and are the Lagrange multipliers corresponding to (18) and (18), respectively. Then, our objective reduces to find a saddle point of the augmented Lagrangian in (21), i.e., for any , it holds that

 L(θ∗,λ,γ)≤L(θ∗,λ∗,γ∗)≤L(θ,λ∗,γ∗). (22)

The existence of a saddle point is ensured under Assumptions 1 and 2, as stated in the next result.

Under Assumptions 1 and 2, there exists a saddle point of the augmented Lagrangian in (21).

###### Proof.

Under Assumption 1, it follows from Propositions 5.1.6 and 5.3.1 in [13] that there exists a saddle point for the optimization (10). By the equivalence of the SP in (10) and the optimization problem in Lemma 2, the rest of proof follows immediately.

By the Saddle Point Theorem (see e.g. Proposition 5.1.6 in [13]), it is sufficient to find a saddle point of the form (22). In the section, we design a networked primal-dual sub-gradient method to achieve this goal.

If , then is convex in each argument, e.g. is convex for any fixed satisfying . Thus, the following set-valued mappings

 Tj(θ,λ,γ) = ∂θjL(θ,λ,γ), Pj(θ,λ,γ) = −∂(λj,γj)L(θ,λ,γ)

are well-defined where is the subdifferential of with respect to [13]. The optimality of a saddle point becomes

 0∈Tj(θ∗,λ∗,γ∗) and 0∈Pj(θ∗,λ∗,γ∗),

which can be solved via the following iteration

 θk+1j=θkj−αkj⋅Tkj and νk+1j=νkj−βkj⋅Pkj. (23)

Here it is sufficient to arbitrarily select and . The purpose of is to compute the Lagrange multipliers of . Moreover, the stepsizes are given by

 αkj=ζkmax{1,∥Tkj∥} and βkj=ζkmax{1,∥Pkj∥} (24)

where satisfies the following condition

 ζk>0,  ∞∑k=0ζk=∞, and ∞∑k=0(ζk)2<∞. (25)

Next, we show that the sub-gradient iteration in (23) can be distributedly computed via Algorithm 1. For notational simplicity, the dependence of the superscript , which denotes the number of iterations, is removed. In Algorithm 1, every node keeps updating a triple of state vector and Lagrange multipliers by receiving information only from its neighboring nodes .

###### Theorem 1 (Distributed implementation).

Suppose that is undirected. Then, the primal-dual sub-gradient algorithm in (23) can be distributedly implemented via Algorithm 1.

###### Proof.

We notice from (21) that is a pair of Lagrange multipliers that only appears in the local Lagrangian . This implies that

 Pkj = −∂(λj,γj)Lj(θk,γkj,λkj) = −⎡⎢ ⎢⎣m∑i=1aji(θkj−θki)gj(θkj)⎤⎥ ⎥⎦.

Clearly, in is computable in node by receiving information only from in-neighbors of node . As is a function of local variables, is accessible to node via only local interactions with its in-neighbors. By the additive property of the subdifferential [13], we further obtain from (21) that

 Tj(θk,λk,γk) = ∂θjL(θ,λ,γ) = c+m∑i=1lij(λki+ρ⋅(Li⊗In)θk) +(∂gj(θkj))′(γkj+ρ⋅gj(θkj)),

where is the -th element of the Laplacian matrix . As previously illustrated, the second term in the sum

 (Li⊗In)θk=m∑j=1aij(θki−θkj)

is locally computable in node . Together with the fact is undirected, both in-neighbors and out-neighbors of node are of the same. Thus, the second term in is obtained by aggregating the modified Lagrange multiplier from its out-neighbors. This further implies that node is able to compute via local interactions as well. We also note that

 θ−ΠΘ(θ)∥θ−ΠΘ(θ)∥

in Algorithm 1 is a sub-gradient of the distance function where is defined as the Euclidean projection onto .

### Iv-B Comparisons with the State-of-the-art

To solve the SP in (10), a distributed setup is proposed in [8] by exchanging the active constraints with neighbors. Specifically, each node solves a local SP of the form

 minθ∈Θc′θ  subject to f(θ,q(i))≤0,i∈Skj⊆{1,…,Nbin} (26)

at each iteration where is the set of indices associated with the random samples generated in node , and obtains local active constraints, indexed as . Here is an optimal solution to the local SP in (26), after which it broadcasts its active constraints indexed by to its out-neighbors. Subsequently, node updates its local constraint indices as

 Sk+1j=ActSkj∪(∪i∈NinjActSki)∪S0j (27)

and returns a local SP of the form (26) replacing by . In comparison, one can easily identify several key differences from Algorithm 1.

1. Using (26), we cannot guarantee to reduce the computation cost in each node. In particular, it follows from (27) that the number of constraints in each local SP in (26) increases with respect to the number of iterations, and eventually is greater than the total number of active constraints in the SP in (10). In an extreme case, the number of active constraints of (10) is equal to . From this point of view, the computation per iteration in each node is still very demanding. It should be noted that selecting the active constraints of an optimization problem is almost as difficult as solving the entire optimization problem.

In Algorithm 1, it is clear that the computation only requires a few additions and multiplications of vectors, in addition to finding a sub-gradient of a parameterized function with respect to . It should be noted that the computation of the sub-gradient of is unavoidable in almost any optimization algorithm. Clearly, the dimension of is equivalent to and . This implies that the computation cost in each node is greatly reduced as the number of nodes increases.

2. Deciding the active constraints in (26) is very sensitive to the optimal solution and numerically unstable. If is not an exact optimal solution or the evaluation of is not exact, we cannot correctly obtain the index set of active constraints. In Algorithm 1, there is no such a problem and the local update has certain robustness properties with respect to the round-off errors in computing e.g. and .

3. The size of data exchange between nodes in (26) may grow monotonically. Although a quantized index version of (26) is proposed for the channel with bounded communication bandwidth, it also needs to compute the vertices of a convex hull per iteration. More importantly, the dimension of the exchanged data per iteration is still larger than that in Algorithm 1.

4. Finally, the local SP of the form (26) in each node contains several overlapping constraints. Specifically, each constraint of the set could be handled more than once by every node. This certainly induces redundancy in computations. In Algorithm 1, each inequality is handled exclusively in only one node. From this perspective, Algorithm 1 saves a lot of computation resources, which is of great importance for a node with very limited computational and memory capability.

The primal-dual sub-gradient methods for distributed constrained optimization have been previously used, see e.g., [37]. However, the proposed algorithm originated from the normal Lagrangian (i.e., in (21)). As discussed in [37] after Theorem 1, this usually requires the strict convexity of the Lagrangian to ensure convergence of the primal-dual sequence, which clearly is not satisfied in our case. To remove this strong convexity condition, the authors propose a specially perturbed sub-gradient and assume boundedness on and . This increases the complexity of the distributed algorithm. In particular, it requires to run up to three consensus algorithms and projects the dual variable onto a bounded ball whose radius must be initially decided, and it is a global parameter. Obviously, Algorithm 1 has a much simpler structure by adopting an augmented Lagrangian in (21), which, to some extent, can be interpreted as the strict convexification of the Lagrangian function. Moreover, the convergence proof of Algorithm 1, which is given in the next subsection, is simpler and easier to understand.

Compared with the distributed alternating direction method of multipliers (ADMM) [38, 39, 40], the computation of Algorithm 1 is also simpler. For example, the ADMM essentially updates the primal sequence as follows

 θk+1∈argminθ∈ΘmLc(θ,λk,γk) (28)

where has a similar form to the augmented Lagrangian in (21). That is, it requires to solve an optimization (28) per iteration. In Algorithm 1, we only need to compute one inner iteration to update by moving along the sub-gradient direction.

### Iv-C Convergence of Algorithm 1

The update of Lagrange multipliers in Algorithm 1 has interesting interpretations. If does not satisfy the local constraint, i.e., , some element of the multiplier vector is strictly increased and a larger penalty is imposed on the augmented Lagrangian . This forces the update of to move toward the local feasible set , where is given in (19). If is bounded and the sequence is convergent, it follows that

 ∞∑k=1βkjgj(θkj)<∞

and . In light of (25), this implies that . Then, the sequence will eventually enter the local constraint set . Similarly, the multiplier will finally drive the state vector to reach a consensus in each node. Based on these two observations, it follows that finally becomes feasible.

The convergence of Algorithm 1 is now formally stated and then proved.

###### Theorem 2 (Convergence).

Suppose that Assumptions 1-2 hold. Then, the sequence of Algorithm 1 with stepsizes given in (24) converges to some common point in the set of the optimal solutions to (10).

###### Proof.

Let be an arbitrary saddle point in Lemma 3. Then, it follows from (23) that

 ∥θk+1j−θ∗j∥2=∥θkj−θ∗j∥2+(αkj)2∥Tkj∥2 −2αkj(θkj−θ∗j)′Tkj ≤∥θkj−θ∗j∥2+(ζk)2−2ζk(θkj−θ∗j)′Tkjmax{∥Tkj∥,1}.

Similarly, one can easily obtain

 ∥νk+1j−ν∗j∥2≤∥νkj−ν∗j∥2+(ζk)2−2ζk(νkj−ν∗j)′Pkjmax{1,∥Pkj∥}.

For notational simplicity, let

 zk=[θkνk],z∗=[θ∗ν∗], and wk=[TkPk].

Then, summing all leads to that

 ∥zk+1−z∗∥2 ≤ ∥zk−z∗∥2+O(ζk)2 (29) −2ζk(zk−z∗)′wk/∥wk∥.

The rest of the proof is completed by establishing the following two claims.

Claim 1: for all .

To show the non-negativeness, we write

 (zk−z∗)′wk=m∑j=1((c+m∑i=1lij~λki+s′j~γkj)′(θkj−θ∗j)) −(bkj)′(λkj−λ∗j)+(gj(θkj))′(γkj−γ∗j)), (30)

where is a modified Lagrange multiplier.

Noting that and , the sum in (30) is split into four sums. The first sum is the difference between two non-penalized Lagrangians, i.e.,

 m∑i=1(c′θkj+(λ∗j)′bkj+(γ∗j)′gj(θkj)−c′θ∗j).

The second sum involves the Lagrange multiplier , i.e.,

 m∑j=1(m∑i=1lij(λki)′(θkj−θ∗j)−(λkj)′bkj) =m∑i=1(λki)′m∑j=1lij(θkj−θ∗j)−m∑j=1(λkj)′bkj =m∑i=1(λki)′(bki−b∗i)−m∑j=1(λkj)′bkj =0

where we have used the fact that for all . The third sum involves the Lagrange multiplier , i.e.,

 m∑j=1(γkj)′(skj(θkj−θ∗j)−gj(θkj)) ≥m∑j=1(γkj