Distributed Pareto Optimization via Diffusion Strategies

Distributed Pareto Optimization via Diffusion Strategies

Jianshu Chen,  and Ali H. Sayed,  The authors are with Department of Electrical Engineering, University of California, Los Angeles, CA 90095. Email: {jshchen, sayed}@ee.ucla.edu. This work was supported in part by NSF grants CCF-1011918 and CCF-0942936. A preliminary short version of this work is reported in the conference publication[1].
Abstract

We consider solving multi-objective optimization problems in a distributed manner by a network of cooperating and learning agents. The problem is equivalent to optimizing a global cost that is the sum of individual components. The optimizers of the individual components do not necessarily coincide and the network therefore needs to seek Pareto optimal solutions. We develop a distributed solution that relies on a general class of adaptive diffusion strategies. We show how the diffusion process can be represented as the cascade composition of three operators: two combination operators and a gradient descent operator. Using the Banach fixed-point theorem, we establish the existence of a unique fixed point for the composite cascade. We then study how close each agent converges towards this fixed point, and also examine how close the Pareto solution is to the fixed point. We perform a detailed mean-square error analysis and establish that all agents are able to converge to the same Pareto optimal solution within a sufficiently small mean-square-error (MSE) bound even for constant step-sizes. We illustrate one application of the theory to collaborative decision making in finance by a network of agents.

{keywords}

Distributed optimization, network optimization, diffusion adaptation, Pareto optimality, mean-square performance, convergence, stability, fixed point, collaborative decision making.

I Introduction

We consider solving a multi-objective optimization problem in a distributed manner over a network of cooperative learners (see Fig. 1). Each agent is associated with an individual cost function ; and each of these costs may not be minimized at the same vector . As such, we need to seek a solution that is “optimal” in some sense for the entire network. In these cases, a general concept of optimality known as Pareto optimality is useful to characterize how good a solution is. A solution is said to be Pareto optimal if there does not exist another vector that is able to improve (i.e., reduce) any particular cost, say, , without degrading (increasing) some of the other costs . To illustrate the idea of Pareto optimality, let

(1)

denote the set of achievable cost values, where denotes the feasible set. Each point represents attained values for the cost functions at a certain . Let us consider the two-node case () shown in Fig. 2, where the shaded areas represent the set for two situations of interest.

Fig. 1: A network of cooperating agents; a cost function is associated with each node . The set of neighbors of node is denoted by (including node itself); this set consists of all nodes with which node can share information.

In Fig. 2, both and achieve their minima at the same point , where is the common minimizer. In comparison, in Fig. 2, attains its minimum at point , while attains its minimum at point , so that they do not have a common minimizer. Instead, all the points on the heavy red curve between points and are Pareto optimal solutions. For example, starting at point on the curve, if we want to reduce the value of without increasing the value of , then we will need to move out of the achievable set . The alternative choice that would keep us on the curve is to move to another Pareto optimal point , which would however increase the value of . In other words, we need to trade the value of for . For this reason, the curve from to is called the optimal tradeoff curve (or optimal tradeoff surface if ) [2, p.183].

Fig. 2: Optimal and Pareto optimal points for the case : (Left) denotes the optimal point where both cost functions are minimized simultaneously and (Right) Pareto optimal points lie on the red boundary curve.

To solve for Pareto optimal solutions, a useful scalarization technique is often used to form an aggregate cost function that is the weighted sum of the component costs as follows:

(2)

where is a positive weight attached with the th cost. It was shown in [2, pp.178–180] that the minimizer of (2) is Pareto optimal for the multi-objective optimization problem. Moreover, by varying the values of , we are able to get different Pareto optimal points on the tradeoff curve. Observing that we can always define a new cost by incorporating the weighting scalar ,

(3)

it is sufficient for our future discussions to focus on aggregate costs of the following form:

(4)

If desired, we can also add constraints to problem (4). For example, suppose there is additionally some constraint of the form at node , where is and is a scalar. Then, we can consider using barrier functions to convert the constrained optimization problem to an unconstrained problem [2, 3]. For example, we can redefine each cost to be , where is a barrier function that penalizes values of that violate the constraint. Therefore, without loss of generality, we shall assume and only consider unconstrained optimization problems. Moreover, we shall assume the are differentiable and, for each given set of positive weights , the cost in (2) or (4) is strongly convex so that the minimizer is unique [4]. Note that the new cost in (3) depends on so that the that minimizes in (4) also depends on .

One of the most studied approaches to the distributed solution of such optimization problems is the incremental approach — see, e.g., [5, 6, 7, 8, 9, 10, 11, 12]. In this approach, a cyclic path is defined over the nodes and data are processed in a cyclic manner through the network until optimization is achieved. However, determining a cyclic path that covers all nodes is generally an NP-hard problem[13] and, in addition, cyclic trajectories are vulnerable to link and node failures. Another useful distributed optimization approach relies on the use of consensus strategies[5, 14, 15, 16, 17, 18, 19, 20]. In this approach, vanishing step-size sequences are used to ensure that agents reach consensus and agree about the optimizer in steady-state. However, in time-varying environments, diminishing step-sizes prevent the network from continuous learning; when the step-sizes die out, the network stops learning.

In [21], we generalized our earlier work on adaptation and learning over networks [22, 23] and developed diffusion strategies that enable the decentralized optimization of global cost functions of the form (4). In the diffusion approach, information is processed locally at the nodes and then diffused through a real-time sharing mechanism. In this manner, the approach is scalable, robust to node and link failures, and avoids the need for cyclic trajectories. In addition, compared to the aforementioned consensus solutions (such as those in [16, 19, 24]), the diffusion strategies we consider here employ constant (rather than vanishing) step-sizes in order to endow the resulting networks with continuous learning and tracking abilities. By keeping the step-sizes constant, the agents are able to track drifts in the underlying costs and in the location of the Pareto optimal solutions. One of the main challenges in the ensuing analysis becomes that of showing that the agents are still able to approach the Pareto optimal solution even with constant step-sizes; in this way, the resulting diffusion strategies are able to combine the two useful properties of optimality and adaptation.

In [21], we focused on the important case where all costs share the same optimal solution (as was the case with Fig. 2); this situation arises when the agents in the network have a common objective and they cooperate to solve the problem of mutual interest in a distributed manner. Examples abound in biological networks where agents work together, for example, to locate food sources or evade predators[25], and in collaborative spectrum sensing[26], system identification[27], and learning applications[28]. In this paper, we develop the necessary theory to show that the same diffusion approach (described by (9)–(10) below) can be used to solve the more challenging multi-objective optimization problem, where the agents need to converge instead to a Pareto optimal solution. Such situations are common in the context of multi-agent decision making (see, e.g., [3] and also Sec. IV where we discuss one application in the context of collaborative decision in finance). To study this more demanding scenario, we first show that the proposed diffusion process can be represented as the cascade composition of three operators: two combination (aggregation) operators and one gradient-descent operator. Using the Banach fixed-point theorem[29, pp.299–303], we establish the existence of a unique fixed point for the composite cascade. We then study how close each agent in the network converges towards this fixed point, and also examine how close the Pareto solution is to the fixed point. We perform a detailed mean-square error analysis and establish that all agents are able to converge to the same Pareto optimal solution within a sufficiently small mean-square-error (MSE) bound. We illustrate the results by considering an example involving collaborative decision in financial applications.

Notation. Throughout the paper, all vectors are column vectors. We use boldface letters to denote random quantities (such as ) and regular font to denote their realizations or deterministic variables (such as ). We use to denote a (block) diagonal matrix consisting of diagonal entries (blocks) , and use to denote a column vector formed by stacking on top of each other. The notation means each entry of the vector is less than or equal to the corresponding entry of the vector .

Ii Diffusion Adaptation Strategies

In [21], we motivated and derived diffusion strategies for distributed optimization, which are captured by the following general description:

(5)
(6)
(7)

where is the local estimate for at node and time , is the step-size parameter used by node , and are intermediate estimates for . Moreover, is the (column) gradient vector of relative to . The non-negative coefficients , , and are the -th entries of matrices , , and , respectively, and they are required to satisfy:

(8)

where denotes a vector with all entries equal to one. Note from (8) that the combination coefficients are nonzero only for those . Therefore, the sums in (5)–(7) are confined within the neighborhood of node . Condition (8) requires the combination matrices to be left-stochastic, while is right-stochastic. We therefore note that each node first aggregates the existing estimates from its neighbors through (5) and generates the intermediate estimate . Then, node aggregates gradient information from its neighborhood and updates to through (6). All other nodes in the network are performing these same steps simultaneously. Finally, node aggregates the estimates through step (7) to update its weight estimate to .

Algorithm (5)–(7) can be simplified to several special cases for different choices of the matrices . For example, the choice , and reduces to the adapt-then-combine (ATC) strategy that has no exchange of gradient information [22, 21, 23, 30]:

(9)

while the choice , and reduces to the combine-then-adapt (CTA) strategy, where the order of the combination and adaptation steps are reversed relative to (9) [22, 23, 30]:

(10)

Furthermore, if in the CTA implementation (10) we enforce to be doubly stochastic, replace by a subgradient, and use a time-decaying step-size parameter (), then we obtain the unconstrained version used by [24]. In the sequel, we continue with the general recursions (5)–(7), which allow us to examine the convergence properties of several algorithms in a unified manner. The challenge we encounter now is to show that this same class of algorithms can still optimize the cost (4) in a distributed manner when the individual costs do not necessarily have the same minimizer. This is actually a demanding task, as the analysis in the coming sections reveals, and we need to introduce novel analysis techniques to be able to handle this general case.

Iii Performance Analysis

Iii-a Modeling Assumptions

In most situations in practice, the true gradient vectors needed in (6) are not available. Instead, perturbed versions are available, which we model as

(11)

where the random noise term, , may depend on and will be required to satisfy certain conditions given by (16)–(17). We refer to the perturbation in (11) as gradient noise. Using (11), the diffusion algorithm (5)–(7) becomes the following, where we are using boldface letters for various quantities to highlight the fact that they are now stochastic in nature due to the randomness in the noise component:

(12)
(13)
(14)

Using (12)–(14), we now proceed to examine the mean-square performance of the diffusion strategies. Specifically, in the sequel, we study: (i) how fast and (ii) how close the estimator at each node approaches the Pareto-optimal solution in the mean-square-error sense. We establish the convergence of all nodes towards the same Pareto-optimal solution within a small MSE bound. Since we are dealing with individual costs that may not have a common minimizer, the approach we employ to examine the convergence properties of the diffusion strategy is fundamentally different from [21]; we follow a system-theoretic approach and call upon the fixed-point theorem for contractive mappings[29, pp.299–303].

To proceed with the analysis, we introduce the following assumptions on the cost functions and gradient noise. As explained in [21], these conditions are weaker than similar conditions in the literature of distributed optimization; in this way, our convergence and performance results hold under more relaxed conditions than usually considered in the literature.

Assumption 1 (Bounded Hessian).

Each component cost function has a bounded Hessian matrix, i.e., there exist nonnegative real numbers and such that, for each :

(15)

with . ∎

Assumption 2 (Gradient noise).

There exist and such that, for all :

(16)
(17)

for all , where denotes the past history of estimators for and all . ∎

If we choose , then Assumption 1 implies that the cost functions are strongly convex111A differentiable function on is said to be strongly convex if there exists a such that for any . And if is twice-differentiable, this is also equivalent to [4, pp.9-10]. Strong convexity implies that the function can be lower bounded by some quadratic function. . This condition can be guaranteed by adding small regularization terms. For example, we can convert a non-strongly convex function to a strongly convex one by redefining as , where is a small regularization factor. We further note that, assumption (17) is a mix of the “relative random noise” and “absolute random noise” model usually assumed in stochastic approximation [4]. Condition (17) implies that the gradient noise grows when the estimate is away from the optimum (large gradient). Condition (17) also states that even when the gradient vector is zero, there is still some residual noise variance .

Iii-B Diffusion Adaptation Operators

To analyze the performance of the diffusion adaptation strategies, we first represent the mappings performed by (12)–(14) in terms of useful operators.

Definition 1 (Combination Operator).

Suppose is an arbitrary block column vector that is formed by stacking vectors on top of each other. The combination operator is defined as the linear mapping:

(18)

where is an left stochastic matrix, and denotes the Kronecker product operation. ∎

Definition 2 (Gradient-Descent Operator).

Consider the same block column vector . Then, the gradient-descent operator is the nonlinear mapping defined by:

(19)

Definition 3 (Power Operator).

Consider the same block vector . The power operator is defined as the mapping:

(20)

where denotes the Euclidean norm of a vector. ∎

We will use the power operator to study how error variances propagate after a specific operator or is applied to a random vector. We remark that we are using the notation “” rather than “” to highlight the fact that is a mapping from to a lower dimensional space . In addition to the above three operators, we define the following aggregate vector of gradient noise that depends on the state :

(21)

With these definitions, we can now represent the two combination steps (12) and (14) as two combination operators and . We can also represent the adaptation step (13) by a gradient-descent operator perturbed by the noise operator (21):

(22)

We can view as a random operator that maps each input into an random vector, and we use boldface letter to highlight this random nature. Let

(23)

denote the vector that collects the estimators across all nodes. Then, the overall diffusion adaptation steps (12)–(14) that update to can be represented as a cascade composition of three operators:

(24)

where we use to denote the composition of any two operators, i.e., . If there is no gradient noise, then the diffusion adaptation operator (24) reduces to

(25)

In other words, the diffusion adaptation over the entire network with and without gradient noise can be described in the following compact forms:

(26)
(27)

Fig. 3(a) illustrates the role of the combination operator (combination steps) and the gradient-descent operator (adaptation step). The combination operator aggregates the estimates from the neighborhood (social learning), while the gradient-descent operator incorporates information from the local gradient vector (self-learning). In Fig. 3(b), we show that each diffusion adaptation step can be represented as the cascade composition of three operators, with perturbation from the gradient noise operator.

(a) , and .
(b) Cascade representation of diffusion adaptation.
Fig. 3: Representation of the diffusion adaptation strategy (12)–(14) in terms of operators. Each diffusion adaptation step can be viewed as a cascade composition of three operators: , , and with gradient perturbation . If , then becomes .

Next, in Lemma 1, we examine some of the properties of the operators , which are proved in Appendix A.

Lemma 1 (Useful Properties).

Consider block vectors and with entries . Then, the operators , and satisfy the following properties:

  1. (Linearity): is a linear operator.

  2. (Nonnegativity): .

  3. (Scaling): For any scalar , we have

    (28)
  4. (Convexity): suppose are block vectors formed in the same manner as , and let be non-negative real scalars that add up to one. Then,

    (29)
  5. (Additivity): Suppose and are block random vectors that satisfy for . Then,

    (30)
  6. (Variance relations):

    (31)
    (32)

    where

    (33)
    (34)
    (35)
  7. (Block Maximum Norm): The norm of is the squared block maximum norm of :

    (36)
  8. (Preservation of Inequality): Suppose vectors , and matrix have nonnegative entries, then implies . ∎

Iii-C Transient Analysis

Using the operator representation developed above, we now analyze the transient behavior of the diffusion algorithm (12)–(14). From Fig. 3(b) and the previous discussion, we know that the stochastic recursion is a perturbed version of the noise-free recursion . Therefore, we first study the convergence of the noise free recursion, and then analyze the effect of gradient perturbation on the stochastic recursion.

Intuitively, if recursion converges, then it should converge to a vector that satisfies

(37)

In other words, the vector should be a fixed point of the operator [29, p.299]. We need to answer four questions pertaining to the fixed point. First, does the fixed point exist? Second, is it unique? Third, under which condition does the recursion converge to the fixed point? Fourth, how far is the fixed point away from the minimizer of (4)? We answer the first two questions using the Banach Fixed Point Theorem (Contraction Theorem) [29, pp.2–9, pp.299–300]. Afterwards, we study convergence under gradient perturbation. The last question will be considered in the next subsection.

Definition 4 (Metric Space).

A set , whose elements we shall call points, is said to be a metric space if we can associate a real number with any two points and of , such that

  1. if ; ;

  2. ;

  3. , for any .

Any function with these three properties is called a distance function, or a metric, and we denote a metric space with distance as . ∎

Definition 5 (Contraction).

Let be a metric space. A mapping is called a contraction on if there is a positive real number such that for all

Lemma 2 (Banach Fixed Point Theorem[29]).

Consider a metric space , where . Suppose that is complete222A metric space is complete if any of its Cauchy sequences converges to a point in the space; a sequence is Cauchy in if , there exists such that for all . and let be a contraction. Then, has precisely one fixed point. ∎

As long as we can prove that the diffusion operator is a contraction, i.e., for any two points , after we apply the operator , the distance between and scales down by a scalar that is uniformly bounded away from one, then the fixed point defined in (37) exists and is unique. We now proceed to show that is a contraction operator in when the step-size parameters satisfy certain conditions.

Theorem 1 (Fixed Point).

Suppose the step-size parameters satisfy the following conditions

(38)

Then, there exists a unique fixed point for the unperturbed diffusion operator in (25).

Proof.

Let be formed by stacking vectors on top of each other. Similarly, let . The distance function that we will use is induced from the block maximum norm (36): . From the definition of the diffusion operator in (25), we have

(39)

where steps (a) and (d) are because of the linearity of and , steps (b) and (e) are because of the variance relation property (31), and step (c) is due to the variance relation property (32). Taking the norm of both sides of (III-C), we have

(40)

where, in the second inequality, we used the fact that since and are right-stochastic matrices. Using property (36), we can conclude from (III-C) that: . Therefore, the operator is a contraction if , which, by substituting (33)–(34), becomes

and we arrive at the condition (38) on the step-sizes In other words, if condition (38) holds for each , then is a contraction operator. By Lemma 2, the operator will have a unique fixed point that satisfies equation (37). ∎

Given the existence and uniqueness of the fixed point, the third question to answer is if recursion converges to this fixed point. The answer is affimative under (38). However, we are not going to study this question separately. Instead, we will analyze the convergence of the more demanding noisy recursion (26). Therefore, we now study how fast and how close the successive estimators generated by recursion (26) approach . Once this issue is addressed, we will then examine how close is to the desired . Introduce the following mean-square-perturbation (MSP) vector at time :

(41)

The -th entry of characterizes how far away the estimate at node and time is from in the mean-square sense. To study the closeness of to , we shall study how the quantity evolves over time. By (26), (37) and the definitions of and in (24) and (25), we obtain

(42)

where step (a) is by the linearity of , steps (b) and (f) are by property (31), step (c) is by the substitution of (22), step (d) is by Property 5 in Lemma 1 and assumption (16), and step (e) is by (32). To proceed with the analysis, we establish the following lemma to bound the second term in (42).

Lemma 3 (Bound on Gradient Perturbation).

It holds that

(43)

where

(44)
(45)
(46)
Proof.

By the definition of in (21) with being a random vector, we get

(47)

For each block in (47), using Jensen’s inequality, we have

(48)

where denotes the maximum absolute column sum, and in the last step, we used (17). Using (123),

(49)

From (124) and the norm inequality , we obtain

(50)

Substituting (50) into (48), we obtain

(51)

where . Substituting (51) and into (47) leads to

(52)

where step (a) is due to the fact that is right-stochastic so that , step (b) is because of the linearity of , step (c) is due to property (31), step (d) is a consequence of Property 3 of Lemma 1, and step (e) is due to the convexity property (29). ∎

Substituting (43) into (42), we obtain