Solving Non-smooth Constrained Programs with Lower Complexity than \mathcal{O}(1/\varepsilon): A Primal-Dual Homotopy Smoothing Approach

Solving Non-smooth Constrained Programs with Lower Complexity than $\mathcal{O}(1/\varepsilon)$: A Primal-Dual Homotopy Smoothing Approach

Abstract

We propose a new primal-dual homotopy smoothing algorithm for a linearly constrained convex program, where neither the primal nor the dual function has to be smooth or strongly convex. The best known iteration complexity solving such a non-smooth problem is . In this paper, we show that by leveraging a local error bound condition on the dual function, the proposed algorithm can achieve a better primal convergence time of , where is a local error bound parameter. As an example application of the general algorithm, we show that the distributed geometric median problem, which can be formulated as a constrained convex program, has its dual function non-smooth but satisfying the aforementioned local error bound condition with , therefore enjoying a convergence time of . This result improves upon the convergence time bound achieved by existing distributed optimization algorithms. Simulation experiments also demonstrate the performance of our proposed algorithm.

1 Introduction

We consider the following linearly constrained convex optimization problem:

(1)
s.t. (2)

where is a compact convex set, is a convex function, . Such an optimization problem has been studied in numerous works under various application scenarios such as machine learning (Yurtsever et al. (2015)), signal processing (Ling and Tian (2010)) and communication networks (Yu and Neely (2017a)). The goal of this work is to design new algorithms for (1-2) achieving an approximation with better convergence time than .

1.1 Optimization algorithms related to constrained convex program

Since enforcing the constraint generally requires a significant amount of computation in large scale systems, the majority of the scalable algorithms solving problem (1-2) are of primal-dual type. Generally, the efficiency of the these algorithms depends on two key properties of the dual function of (1-2), namely, the Lipschitz gradient and strong convexity. When the dual function of (1-2) is smooth, primal-dual type algorithms with Nesterov’s acceleration on the dual of (1)-(2) can achieve a convergence time of (e.g. Yurtsever et al. (2015); Tran-Dinh et al. (2018)). When the dual function has both the Lipschitz continuous gradient and the strongly convex property, algorithms such as dual subgradient and ADMM enjoy a linear convergence (e.g. Yu and Neely (2018); Deng and Yin (2016)). However, when neither of the properties is assumed, the basic dual-subgradient type algorithm gives a relatively worse convergence time (e.g. Wei et al. (2015)), while its improved variants yield a convergence time of (e.g. Lan and Monteiro (2013); Deng et al. (2017); Yu and Neely (2017b); Yurtsever et al. (2018); Gidel et al. (2018)).

More recently, several works seek to achieve a better convergence time than under weaker assumptions than Lipschitz gradient and strong convexity of the dual function. Specifically, building upon the recent progress on the gradient type methods for optimization with Hlder continuous gradient (e.g. Nesterov (2015a, b)), the work Yurtsever et al. (2015) develops a primal-dual gradient method solving (1-2), which achieves a convergence time of , where is the modulus of Hlder continuity on the gradient of the dual function of the formulation (1-2).1 On the other hand, the work Yu and Neely (2018) shows that when the dual function has Lipschitz continuous gradient and satisfies a locally quadratic property (i.e. a local error bound with , see Definition 2.1 for details), which is weaker than strong convexity, one can still obtain a linear convergence with a dual subgradient algorithm. A similar result has also been proved for ADMM in Han et al. (2015).

In the current work, we aim to address the following question: Can one design a scalable algorithm with lower complexity than solving (1-2), when both the primal and the dual functions are possibly non-smooth? More specifically, we look at a class of problems with dual functions satisfying only a local error bound, and show that indeed one is able to obtain a faster primal convergence via a primal-dual homotopy smoothing method under a local error bound condition on the dual function.

Homotopy methods were first developed in the statistics literature in relation to the model selection problem for LASSO, where, instead of computing a single solution for LASSO, one computes a complete solution path by varying the regularization parameter from large to small (e.g. Osborne et al. (2000); Xiao and Zhang (2013)).2 On the other hand, the smoothing technique for minimizing a non-smooth convex function of the following form was first considered in Nesterov (2005):

(3)

where is a closed convex set, is a convex smooth function, and can be explicitly written as

(4)

where for any two vectors , , is a closed convex set, and is a convex function. By adding a strongly concave proximal function of with a smoothing parameter into the definition of , one can obtain a smoothed approximation of with smooth modulus . Then, Nesterov (2005) employs the accelerated gradient method on the smoothed approximation (which delivers a convergence time for the approximation), and sets the parameter to be , which gives an overall convergence time of . An important follow-up question is that whether or not such a smoothing technique can also be applied to solve (1-2) with the same primal convergence time. This question is answered in subsequent works Necoara and Suykens (2008); Li et al. (2016); Tran-Dinh et al. (2018), where they show that indeed one can also obtain an primal convergence time for the problem (1-2) via smoothing.

Combining the homotopy method with a smoothing technique to solve problems of the form (3) has been considered by a series of works including Yang and Lin (2015), Xu et al. (2016) and Xu et al. (2017). Specifically, the works Yang and Lin (2015) and Xu et al. (2016) consider a multi-stage algorithm which starts from a large smoothing parameter and then decreases this parameter over time. They show that when the function satisfies a local error bound with parameter , such a combination gives an improved convergence time of minimizing the unconstrained problem (3). The work Xu et al. (2017) shows that the homotopy method can also be combined with ADMM to achieve a faster convergence solving problems of the form

where is a closed convex set, are both convex functions with satisfying the local error bound, and the proximal operator of can be easily computed. However, due to the restrictions on the function in the paper, it cannot be extended to handle problems of the form (1-2).3

Contributions: In the current work, we show a multi-stage homotopy smoothing method enjoys a primal convergence time solving (1-2) when the dual function satisfies a local error bound condition with . Our convergence time to achieve within of optimality is in terms of number of (unconstrained) maximization steps , where constants are known, which is a standard measure of convergence time for Lagrangian-type algorithms that turn a constrained problem into a sequence of unconstrained problems. The algorithm essentially restarts a weighted primal averaging process at each stage using the last Lagrange multiplier computed. This result improves upon the earlier result by (Necoara and Suykens (2008); Li et al. (2016)) and at the the time extends the scope of homotopy smoothing method to solve a new class of problems involving constraints (1-2). It is worth mentioning that a similar restarted smoothing strategy is proposed in a recent work Tran-Dinh et al. (2018) to solve problems including (1-2), where they show that, empirically, restarting the algorithm from the Lagrange multiplier computed from the last stage improves the convergence time. Here, we give one theoretical justification of such an improvement.

1.2 The distributed geometric median problem

The geometric median problem, also known as the Fermat-Weber problem, has a long history (e.g. see Weiszfeld and Plastria (2009) for more details). Given a set of points , we aim to find one point so as to minimize the sum of the Euclidean distance, i.e.

(5)

which is a non-smooth convex optimization problem. It can be shown that the solution to this problem is unique as long as are not co-linear. Linear convergence time algorithms solving (5) have also been developed in several works (e.g. Xue and Ye (1997), Parrilo and Sturmfels (2003), Cohen et al. (2016)). Our motivation of studying this problem is driven by its recent application in distributed statistical estimation, in which data are assumed to be randomly spreaded to multiple connected computational agents that produce intermediate estimators, and then, these intermediate estimators are aggregated in order to compute some statistics of the whole data set. Arguably one of the most widely used aggregation procedures is computing the geometric median of the local estimators (see, for example, Duchi et al. (2014), Minsker et al. (2014), Minsker and Strawn (2017), Yin et al. (2018)). It can be shown that the geometric median is robust against arbitrary corruptions of local estimators in the sense that the final estimator is stable as long as at least half of the nodes in the system perform as expected.

Contributions: As an example application of our general algorithm, we look at the problem of computing the solution to (5) in a distributed scenario over a network of agents without any central controller, where each agent holds a local vector . Remarkably, we show theoretically that such a problem, when formulated as (1-2), has its dual function non-smooth but locally quadratic. Therefore, applying our proposed primal-dual homotopy smoothing method gives a convergence time of . This result improves upon the performance bounds of the previously known decentralized optimization algorithms (e.g. PG-EXTRA Shi et al. (2015) and decentralized ADMM Shi et al. (2014)), which do not take into account the special structure of the problem and only obtain a convergence time of . Simulation experiments also demonstrate the superior ergodic convergence time of our algorithm compared to other algorithms.

2 Primal-dual Homotopy Smoothing

2.1 Preliminaries

The Lagrange dual function of (1-2) is defined as follows:4

(6)

where is the dual variable, is a compact convex set and the minimum of the dual function is For any closed set and , define the distance function of to the set as

where . For a convex function , the -sublevel set is defined as

(7)

Furthermore, for any matrix , we use to denote the largest eigenvalue of . Let

(8)

be the set of optimal Lagrange multipliers. Note that if the constraint is feasible, then implies for any that satisfies . The following definition introduces the notion of local error bound.

Definition 2.1.

Let be a convex function over . Suppose is non-empty. The function is said to satisfy the local error bound with parameter if such that for any ,

(9)

where is a positive constant possibly depending on . In particular, when , is said to be locally quadratic and when , it is said to be locally linear.

Remark 2.1.

Indeed, a wide range of popular optimization problems satisfy the local error bound condition. The work Tseng (2010) shows that if is a polyhedron, has Lipschitz continuous gradient and is strongly convex, then the dual function of (1-2) is locally linear. The work Burke and Tseng (1996) shows that when the objective is linear and is a convex cone, the dual function is also locally linear. The values of have also been computed for several other problems (e.g. Pang (1997); Yang and Lin (2015)).

Definition 2.2.

Given an accuracy level , a vector is said to achieve an approximate solution regarding problem (1-2) if

where is the optimal primal objective of (1-2).

Throughout the paper, we adopt the following assumptions:

Assumption 2.1.

(a) The feasible set is nonempty and non-singleton.
(b) The set is bounded, i.e. for some positive constant . Furthermore, the function is also bounded, i.e. for some positive constant .
(c) The dual function defined in (6) satisfies the local error bound for some parameter and some level .
(d) Let be the projection operator onto the column space of . There exists a unique vector such that for any , , i.e. .

Note that assumption (a) and (b) are very mild and quite standard. For most applications, it is enough to check (c) and (d). We will show, for example, in Section 4 that the distributed geometric median problem satisfies all the assumptions. Finally, we say a function is smooth with modulus if

2.2 Primal-dual homotopy smoothing algorithm

This section introduces our proposed algorithm for optimization problem (1-2) satisfying Assumption 2.1. The idea of smoothing is to introduce a smoothed Lagrange dual function that approximates the original possibly non-smooth dual function defined in (6).

For any constant , define

(10)

where is an arbitrary fixed point in . For simplicity of notations, we drop the dependency on in the definition of . Then, by the boundedness assumption of , we have For any , define

(11)

as the smoothed dual function. The fact that is indeed smooth with modulus follows from Lemma A.1 in the Supplement. Thus, one is able to apply an accelerated gradient descent algorithm on this modified Lagrange dual function, which is detailed in Algorithm 1 below, starting from an initial primal-dual pair .

Let and .
For to do

  • Compute a tentative dual multiplier:

  • Compute the primal update:

  • Compute the dual update:

  • Update the stepsize: .

end for
Output:
and , where .

Algorithm 1 Primal-Dual Smoothing: PDS

Our proposed algorithm runs Algorithm 1 in multiple stages, which is detailed in Algorithm 2 below.

Let be a fixed constant and be the desired accuracy. Set , , , the number of stages , and the time horizon during each stage .
For to do

  • Let .

  • Run the primal-dual smoothing algorithm (, ) = PDS.

end for
Output:
.

Algorithm 2 Homotopy Method:

3 Convergence Time Results

We start by defining the the set of optimal Lagrange multipliers for the smoothed problem:5

(12)

Our convergence time analysis involves two steps. The first step is to derive a primal convergence time bound for Algorithm 1, which involves the location information of the initial Lagrange multiplier at the the beginning of this stage. The details are given in Supplement A.2.

Theorem 3.1.

Suppose Assumption 2.1(a)(b) holds. For any and any initial primal-dual pair , we have the following performance bound regarding Algorithm 1,

(13)
(14)

where , , and is any point in defined in (12).

An inductive argument shows that . Thus, Theorem 3.1 already gives an convergence time by setting and . Note that this is the best trade-off we can get from Theorem 3.1 when simply bounding the terms and by constants. To see how this bound leads to an improved convergence time when running in multiple rounds, suppose the computation from the last round gives a that is close enough to the optimal set , then, would be small. When the local error bound condition holds, one can show that . As a consequence, one is able to choose smaller than and get a better trade-off. Formally, we have the following overall performance bound. The proof is given in Supplement A.3.

Theorem 3.2.

Suppose Suppose Assumption 2.1 holds, , , . The proposed homotopy method achieves the following objective and constraint violation bound:

with running time , i.e. the algorithm achieves an approximation with convergence time .

4 Distributed Geometric Median

Consider the problem of computing the geometric median over a connected network , where is a set of nodes, is a collection of undirected edges, if there exists an undirected edge between node and node , and otherwise. Furthermore, .Furthermore, since the graph is undirected, we always have . Two nodes and are said to be neighbors of each other if . Each node holds a local vector , and the goal is to compute the solution to (5) without having a central controller, i.e. each node can only communicate with its neighbors.

Computing geometric median over a network has been considered in several works previously and various distributed algorithms have been developed such as decentralized subgradient methd (DSM, Nedic and Ozdaglar (2009); Yuan et al. (2016)), PG-EXTRA (Shi et al. (2015)) and ADMM (Shi et al. (2014); Deng et al. (2017)). The best known convergence time for this problem is . In this section, we will show that it can be written in the form of problem (1-2), has its Lagrange dual function locally quadratic and optimal Lagrange multiplier unique up to the null space of , thereby satisfying Assumption 2.1.

Throughout this section, we assume that , are not co-linear and they are distinct (i.e. if ). We start by defining a mixing matrix with respect to this network. The mixing matrix will have the following properties:

  1. Decentralization: The -th entry if .

  2. Symmetry: .

  3. The null space of satisfies , where is an all 1 vector in .

These conditions are rather mild and satisfied by most doubly stochastic mixing matrices used in practice. Some specific examples are Markov transition matrices of max-degree chain and Metropolis-Hastings chain (see Boyd et al. (2004) for detailed discussions). Let be the local variable on the node . Define

where

and is -th entry of the mixing matrix . By the aforementioned null space property of the mixing matrix , it is easy to see that the null space of the matrix is

(15)

Then, because of the null space property (15), one can equivalently write problem (5) in a “distributed fashion” as follows:

(16)
(17)

where we set the constant to be large enough so that the solution belongs to the set . This is in the same form as (1-2) with .

4.1 Distributed implementation

In this section, we show how to implement the proposed algorithm to solve (16-17) in a distributed way. Let , be the vectors of Lagrange multipliers defined in Algorithm 1, where each . Then, each agent in the network is responsible for updating the corresponding Lagrange multipliers and according to Algorithm 1, which has the initial values . Note that the first, third and fourth steps in Algorithm 1 are naturally separable regarding each agent. It remains to check if the second step can be implemented in a distributed way.

Note that in the second step, we obtain the primal update by solving the following problem:

where is a fixed point in the feasible set. We separate the maximization according to different agent :

Note that according to the definition of , it is equal to 0 if agent is not the neighbor of agent . More specifically, Let be the set of neighbors of agent (including the agent itself), then, the above maximization problem can be equivalently written as

where we used the fact that . Solving this problem only requires the local information from each agent. Completing the squares gives

(18)

The solution to such a subproblem has a closed form, as is shown in the following lemma (the proof is given in Supplement A.4):

Lemma 4.1.

Let , then, the solution to (18) has the following closed form:

4.2 Local error bound condition

The proof of the this theorem is given in Supplement A.5.

Theorem 4.1.

The Lagrange dual function of (16-17) is non-smooth and given by the following

where is the -th column block of the matrix , is the indicator function which takes 1 if and 0 otherwise. let be the set of optimal Lagrange multipliers defined according to (8). Suppose , then, for any , there exists a such that

Furthermore, there exists a unique vector s.t. , i.e. Assumption 2.1(d) holds. Thus, applying the proposed method gives the convergence time .

5 Simulation Experiments

In this section, we conduct simulation experiments on the distributed geometric median problem. Each vector is sampled from the uniform distribution in , i.e. each entry of is independently sampled from uniform distribution on . We compare our algorithm with DSM (Nedic and Ozdaglar (2009)), P-EXTRA (Shi et al. (2015)), Jacobian parallel ADMM (Deng et al. (2017)) and Smoothing (Necoara and Suykens (2008)) under different network sizes (). Each network is randomly generated with a particular connectivity ratio6, and the mixing matrix is chosen to be the Metropolis-Hastings Chain (Boyd et al. (2004)), which can be computed in a distributed manner. We use the relative error as the performance metric, which is defined as for each iteration . The vector is the initial primal variable. The vector is the optimal solution computed by CVX Grant et al. (2008). For our proposed algorithm, is the restarted primal average up to the current iteration. For all other algorithms, is the primal average up to the current iteration. The results are shown below. We see in all cases, our proposed algorithm is much better than, if not comparable to, other algorithms. For detailed simulation setups and additional simulation results, see Supplement A.6.

(a)
(b)
(c)
Figure 1: Comparison of different algorithms on networks of different sizes. (a) , connectivity ratio=0.15. (b) , connectivity ratio=0.13. (c) , connectivity ratio=0.1.

Acknowledgments

The authors thank Stanislav Minsker and Jason D. Lee for helpful discussions related to the geometric median problem. Qing Ling’s research is supported in part by the National Science Foundation China under Grant 61573331 and the National Science Foundation Anhui under Grant 1608085QF130. Michael J. Neely’s research is supported in part by the National Science Foundation under Grant CCF-1718477.

Appendix A Supplement

a.1 Smoothing lemma

In this section, we show that adding the strongly convex term on the primal indeed gives a smoothed dual.

Lemma A.1.

Let be defined as above and let be a sequence of -Lipschitz continuous convex functions, i.e. , where . Then, the Lagrange dual function

is smooth with modulus . In particular, if , then, the smooth modulus is equal to , where denotes the maximum eigenvalue of .

This proof of this lemma is rather standard (see also proof of Lemma 6 of Yu and Neely (2018)) and the special case of can also be derived from Fenchel duality (Beck et al. (2014)). The proof is included here just for completeness.

Proof of Lemma a.1.

First of all, note that the function is strongly concave, it follows that there exists a unique minimizer . By Danskin’s theorem (see Bertsekas (1999) for details), we have for any ,

Now, consider any , we have

(19)

where the equality follows from Danskin’s Theorem and the inequality follows from Lipschitz continuity of . Again, by the fact that is strongly concave with modulus ,

which implies

Adding the two inequalities gives

where the last inequality follows from Lipschitz continuity of again. This implies

Combining this inequality with (19) gives

finishing the first part of the proof. The second part of the claim follows easily from the fact that . ∎

a.2 Proof of Theorem 3.1

In this section, we give a convergence time proof of each stage. As a preliminary, We have the following basic lemma which bounds the perturbation of the Lagrange dual due to the primal smoothing.

Lemma A.2.

Let and be functions defined in (6) and (11), respectively. Then, we have for any ,

and

for any and .

Proof of Lemma a.2.

First of all, for any , let

then, we have for any ,

where the first inequality follows from the fact that maximizes . Similarly, we have

where the first inequality follows from the fact that maximizes . Furthermore, we have