Solving Non-smooth Constrained Programs with Lower Complexity than \mathcal{O}(1/\varepsilon): A Primal-Dual Homotopy Smoothing Approach

# Solving Non-smooth Constrained Programs with Lower Complexity than O(1/ε): A Primal-Dual Homotopy Smoothing Approach

Xiaohan Wei
Department of Electrical Engineering
University of Southern California
Los Angeles, CA, USA, 90089
xiaohanw@usc.edu
&Hao Yu
Alibaba Group (U.S.) Inc.
Bellevue, WA, USA, 98004
hao.yu@alibaba-inc.com
&Qing Ling
School of Data and Computer Science
Sun Yat-Sen University
Guangzhou, China, 510006
lingqing556@mail.sysu.edu.cn
&Michael J. Neely
Department of Electrical Engineering
University of Southern California
Los Angeles, CA, USA, 90089
mikejneely@gmail.com
###### Abstract

We propose a new primal-dual homotopy smoothing algorithm for a linearly constrained convex program, where neither the primal nor the dual function has to be smooth or strongly convex. The best known iteration complexity solving such a non-smooth problem is . In this paper, we show that by leveraging a local error bound condition on the dual function, the proposed algorithm can achieve a better primal convergence time of , where is a local error bound parameter. As an example application of the general algorithm, we show that the distributed geometric median problem, which can be formulated as a constrained convex program, has its dual function non-smooth but satisfying the aforementioned local error bound condition with , therefore enjoying a convergence time of . This result improves upon the convergence time bound achieved by existing distributed optimization algorithms. Simulation experiments also demonstrate the performance of our proposed algorithm.

Solving Non-smooth Constrained Programs with Lower Complexity than : A Primal-Dual Homotopy Smoothing Approach

Xiaohan Wei Department of Electrical Engineering University of Southern California Los Angeles, CA, USA, 90089 xiaohanw@usc.edu Hao Yu Alibaba Group (U.S.) Inc. Bellevue, WA, USA, 98004 hao.yu@alibaba-inc.com Qing Ling School of Data and Computer Science Sun Yat-Sen University Guangzhou, China, 510006 lingqing556@mail.sysu.edu.cn Michael J. Neely Department of Electrical Engineering University of Southern California Los Angeles, CA, USA, 90089 mikejneely@gmail.com

\@float

noticebox[b]Preprint. Work in progress.\end@float

## 1 Introduction

We consider the following linearly constrained convex optimization problem:

 min f(x) (1) s.t. Ax−b=0,  x∈X, (2)

where is a compact convex set, is a convex function, . Such an optimization problem has been studied in numerous works under various application scenarios such as machine learning (Yurtsever et al. (2015)), signal processing (Ling and Tian (2010)) and communication networks (Yu and Neely (2017a)). The goal of this work is to design new algorithms for (1-2) achieving an approximation with better convergence time than .

### 1.1 Optimization algorithms related to constrained convex program

Since enforcing the constraint generally requires a significant amount of computation in large scale systems, the majority of the scalable algorithms solving problem (1-2) are of primal-dual type. Generally, the efficiency of the these algorithms depends on two key properties of the dual function of (1-2), namely, the Lipschitz gradient and strong convexity. When the dual function of (1-2) is smooth, primal-dual type algorithms with Nesterov’s acceleration on the dual of (1)-(2) can achieve a convergence time of (e.g. Yurtsever et al. (2015); Tran-Dinh et al. (2018)). When the dual function has both the Lipschitz continuous gradient and the strongly convex property, algorithms such as dual subgradient and ADMM enjoy a linear convergence (e.g. Yu and Neely (2018); Deng and Yin (2016)). However, when neither of the properties is assumed, the basic dual-subgradient type algorithm gives a relatively worse convergence time (e.g. Wei et al. (2015)), while its improved variants yield a convergence time of (e.g. Lan and Monteiro (2013); Deng et al. (2017); Yu and Neely (2017b); Yurtsever et al. (2018); Gidel et al. (2018)).

More recently, several works seek to achieve a better convergence time than under weaker assumptions than Lipschitz gradient and strong convexity of the dual function. Specifically, building upon the recent progress on the gradient type methods for optimization with Hlder continuous gradient (e.g. Nesterov (2015a, b)), the work Yurtsever et al. (2015) develops a primal-dual gradient method solving (1-2), which achieves a convergence time of , where is the modulus of Hlder continuity on the gradient of the dual function of the formulation (1-2).111The gradient of function is Hlder continuous with modulus on a set if , where is the the vector 2-norm and is a constant depending on . On the other hand, the work Yu and Neely (2018) shows that when the dual function has Lipschitz continuous gradient and satisfies a locally quadratic property (i.e. a local error bound with , see Definition 2.1 for details), which is weaker than strong convexity, one can still obtain a linear convergence with a dual subgradient algorithm. A similar result has also been proved for ADMM in Han et al. (2015).

In the current work, we aim to address the following question: Can one design a scalable algorithm with lower complexity than solving (1-2), when both the primal and the dual functions are possibly non-smooth? More specifically, we look at a class of problems with dual functions satisfying only a local error bound, and show that indeed one is able to obtain a faster primal convergence via a primal-dual homotopy smoothing method under a local error bound condition on the dual function.

Homotopy methods were first developed in the statistics literature in relation to the model selection problem for LASSO, where, instead of computing a single solution for LASSO, one computes a complete solution path by varying the regularization parameter from large to small (e.g. Osborne et al. (2000); Xiao and Zhang (2013)).222 The word “homotopy”, which was adopted in Osborne et al. (2000), refers to the fact that the mapping from regularization parameters to the set of solutions of the LASSO problem is a continuous piece-wise linear function. On the other hand, the smoothing technique for minimizing a non-smooth convex function of the following form was first considered in Nesterov (2005):

 Ψ(x)=g(x)+h(x), x∈Ω1 (3)

where is a closed convex set, is a convex smooth function, and can be explicitly written as

 g(x)=maxu∈Ω2⟨Ax,u⟩−ϕ(u), (4)

where for any two vectors , , is a closed convex set, and is a convex function. By adding a strongly concave proximal function of with a smoothing parameter into the definition of , one can obtain a smoothed approximation of with smooth modulus . Then, Nesterov (2005) employs the accelerated gradient method on the smoothed approximation (which delivers a convergence time for the approximation), and sets the parameter to be , which gives an overall convergence time of . An important follow-up question is that whether or not such a smoothing technique can also be applied to solve (1-2) with the same primal convergence time. This question is answered in subsequent works Necoara and Suykens (2008); Li et al. (2016); Tran-Dinh et al. (2018), where they show that indeed one can also obtain an primal convergence time for the problem (1-2) via smoothing.

Combining the homotopy method with a smoothing technique to solve problems of the form (3) has been considered by a series of works including Yang and Lin (2015), Xu et al. (2016) and Xu et al. (2017). Specifically, the works Yang and Lin (2015) and Xu et al. (2016) consider a multi-stage algorithm which starts from a large smoothing parameter and then decreases this parameter over time. They show that when the function satisfies a local error bound with parameter , such a combination gives an improved convergence time of minimizing the unconstrained problem (3). The work Xu et al. (2017) shows that the homotopy method can also be combined with ADMM to achieve a faster convergence solving problems of the form

 minx∈Ω1f(x)+ψ(Ax−b),

where is a closed convex set, are both convex functions with satisfying the local error bound, and the proximal operator of can be easily computed. However, due to the restrictions on the function in the paper, it cannot be extended to handle problems of the form (1-2).333The result in Xu et al. (2017) heavily depends on the assumption that the subgradient of is defined everywhere over the set and uniformly bound by some constant , which excludes the choice of indicator functions necessary to deal with constraints in the ADMM framework.

Contributions: In the current work, we show a multi-stage homotopy smoothing method enjoys a primal convergence time solving (1-2) when the dual function satisfies a local error bound condition with . Our convergence time to achieve within of optimality is in terms of number of (unconstrained) maximization steps , where constants are known, which is a standard measure of convergence time for Lagrangian-type algorithms that turn a constrained problem into a sequence of unconstrained problems. The algorithm essentially restarts a weighted primal averaging process at each stage using the last Lagrange multiplier computed. This result improves upon the earlier result by (Necoara and Suykens (2008); Li et al. (2016)) and at the the time extends the scope of homotopy smoothing method to solve a new class of problems involving constraints (1-2). It is worth mentioning that a similar restarted smoothing strategy is proposed in a recent work Tran-Dinh et al. (2018) to solve problems including (1-2), where they show that, empirically, restarting the algorithm from the Lagrange multiplier computed from the last stage improves the convergence time. Here, we give one theoretical justification of such an improvement.

### 1.2 The distributed geometric median problem

The geometric median problem, also known as the Fermat-Weber problem, has a long history (e.g. see Weiszfeld and Plastria (2009) for more details). Given a set of points , we aim to find one point so as to minimize the sum of the Euclidean distance, i.e.

 x∗∈argminx∈Rdn∑i=1∥x−bi∥, (5)

which is a non-smooth convex optimization problem. It can be shown that the solution to this problem is unique as long as are not co-linear. Linear convergence time algorithms solving (5) have also been developed in several works (e.g. Xue and Ye (1997), Parrilo and Sturmfels (2003), Cohen et al. (2016)). Our motivation of studying this problem is driven by its recent application in distributed statistical estimation, in which data are assumed to be randomly spreaded to multiple connected computational agents that produce intermediate estimators, and then, these intermediate estimators are aggregated in order to compute some statistics of the whole data set. Arguably one of the most widely used aggregation procedures is computing the geometric median of the local estimators (see, for example, Duchi et al. (2014), Minsker et al. (2014), Minsker and Strawn (2017), Yin et al. (2018)). It can be shown that the geometric median is robust against arbitrary corruptions of local estimators in the sense that the final estimator is stable as long as at least half of the nodes in the system perform as expected.

Contributions: As an example application of our general algorithm, we look at the problem of computing the solution to (5) in a distributed scenario over a network of agents without any central controller, where each agent holds a local vector . Remarkably, we show theoretically that such a problem, when formulated as (1-2), has its dual function non-smooth but locally quadratic. Therefore, applying our proposed primal-dual homotopy smoothing method gives a convergence time of . This result improves upon the performance bounds of the previously known decentralized optimization algorithms (e.g. PG-EXTRA Shi et al. (2015) and decentralized ADMM Shi et al. (2014)), which do not take into account the special structure of the problem and only obtain a convergence time of . Simulation experiments also demonstrate the superior ergodic convergence time of our algorithm compared to other algorithms.

## 2 Primal-dual Homotopy Smoothing

### 2.1 Preliminaries

The Lagrange dual function of (1-2) is defined as follows:444Usually, the Lagrange dual is defined as . Here, we flip the sign and take the maximum for no reason other than being consistent with the form (4).

 F(λ):=maxx∈X−⟨λ,Ax−b⟩−f(x), (6)

where is the dual variable, is a compact convex set and the minimum of the dual function is For any closed set and , define the distance function of to the set as

 dist(x,K):=miny∈K∥x−y∥,

where . For a convex function , the -sublevel set is defined as

 Sδ:={λ∈RN: F(λ)−F∗≤δ}. (7)

Furthermore, for any matrix , we use to denote the largest eigenvalue of . Let

 Λ∗:={λ∗∈RN: F(λ∗)≤F(λ), ∀λ∈RN} (8)

be the set of optimal Lagrange multipliers. Note that if the constraint is feasible, then implies for any that satisfies . The following definition introduces the notion of local error bound.

###### Definition 2.1.

Let be a convex function over . Suppose is non-empty. The function is said to satisfy the local error bound with parameter if such that for any ,

 dist(λ,Λ∗)≤Cδ(F(λ)−F∗)β, (9)

where is a positive constant possibly depending on . In particular, when , is said to be locally quadratic and when , it is said to be locally linear.

###### Remark 2.1.

Indeed, a wide range of popular optimization problems satisfy the local error bound condition. The work Tseng (2010) shows that if is a polyhedron, has Lipschitz continuous gradient and is strongly convex, then the dual function of (1-2) is locally linear. The work Burke and Tseng (1996) shows that when the objective is linear and is a convex cone, the dual function is also locally linear. The values of have also been computed for several other problems (e.g. Pang (1997); Yang and Lin (2015)).

###### Definition 2.2.

Given an accuracy level , a vector is said to achieve an approximate solution regarding problem (1-2) if

 f(x0)−f∗≤O(ε),  ∥Ax0−b∥≤O(ε),

where is the optimal primal objective of (1-2).

Throughout the paper, we adopt the following assumptions:

###### Assumption 2.1.

(a) The feasible set is nonempty and non-singleton.
(b) The set is bounded, i.e. for some positive constant . Furthermore, the function is also bounded, i.e. for some positive constant .
(c) The dual function defined in (6) satisfies the local error bound for some parameter and some level .
(d) Let be the projection operator onto the column space of . There exists a unique vector such that for any , , i.e. .

Note that assumption (a) and (b) are very mild and quite standard. For most applications, it is enough to check (c) and (d). We will show, for example, in Section 4 that the distributed geometric median problem satisfies all the assumptions. Finally, we say a function is smooth with modulus if

 ∥∇g(x)−∇g(y)∥≤L∥x−y∥, ∀x,y∈X.

### 2.2 Primal-dual homotopy smoothing algorithm

This section introduces our proposed algorithm for optimization problem (1-2) satisfying Assumption 2.1. The idea of smoothing is to introduce a smoothed Lagrange dual function that approximates the original possibly non-smooth dual function defined in (6).

For any constant , define

 (10)

where is an arbitrary fixed point in . For simplicity of notations, we drop the dependency on in the definition of . Then, by the boundedness assumption of , we have For any , define

 Fμ(λ)=maxx∈X−⟨λ,Ax−b⟩−fμ(x) (11)

as the smoothed dual function. The fact that is indeed smooth with modulus follows from Lemma A.1 in the Supplement. Thus, one is able to apply an accelerated gradient descent algorithm on this modified Lagrange dual function, which is detailed in Algorithm 1 below, starting from an initial primal-dual pair .

Our proposed algorithm runs Algorithm 1 in multiple stages, which is detailed in Algorithm 2 below.

## 3 Convergence Time Results

We start by defining the the set of optimal Lagrange multipliers for the smoothed problem:555By Assumption 2.1(a) and Farkas’ Lemma, this is non-empty.

 Λ∗μ:={λ∗μ∈RN:Fμ(λ∗μ)≤Fμ(λ), ∀λ∈RN} (12)

Our convergence time analysis involves two steps. The first step is to derive a primal convergence time bound for Algorithm 1, which involves the location information of the initial Lagrange multiplier at the the beginning of this stage. The details are given in Supplement A.2.

###### Theorem 3.1.

Suppose Assumption 2.1(a)(b) holds. For any and any initial primal-dual pair , we have the following performance bound regarding Algorithm 1,

 (13) ∥A¯¯¯xT−b∥≤2σmax(ATA)μST(∥∥˜λ∗−˜λ∥∥+dist(λ∗μ,Λ∗)), (14)

where , , and is any point in defined in (12).

An inductive argument shows that . Thus, Theorem 3.1 already gives an convergence time by setting and . Note that this is the best trade-off we can get from Theorem 3.1 when simply bounding the terms and by constants. To see how this bound leads to an improved convergence time when running in multiple rounds, suppose the computation from the last round gives a that is close enough to the optimal set , then, would be small. When the local error bound condition holds, one can show that . As a consequence, one is able to choose smaller than and get a better trade-off. Formally, we have the following overall performance bound. The proof is given in Supplement A.3.

###### Theorem 3.2.

Suppose Suppose Assumption 2.1 holds, , , . The proposed homotopy method achieves the following objective and constraint violation bound:

 ∥A¯¯¯x(K)−b∥≤24(1+Cδ)C2δ(2M)βε,

with running time , i.e. the algorithm achieves an approximation with convergence time .

## 4 Distributed Geometric Median

Consider the problem of computing the geometric median over a connected network , where is a set of nodes, is a collection of undirected edges, if there exists an undirected edge between node and node , and otherwise. Furthermore, .Furthermore, since the graph is undirected, we always have . Two nodes and are said to be neighbors of each other if . Each node holds a local vector , and the goal is to compute the solution to (5) without having a central controller, i.e. each node can only communicate with its neighbors.

Computing geometric median over a network has been considered in several works previously and various distributed algorithms have been developed such as decentralized subgradient methd (DSM, Nedic and Ozdaglar (2009); Yuan et al. (2016)), PG-EXTRA (Shi et al. (2015)) and ADMM (Shi et al. (2014); Deng et al. (2017)). The best known convergence time for this problem is . In this section, we will show that it can be written in the form of problem (1-2), has its Lagrange dual function locally quadratic and optimal Lagrange multiplier unique up to the null space of , thereby satisfying Assumption 2.1.

Throughout this section, we assume that , are not co-linear and they are distinct (i.e. if ). We start by defining a mixing matrix with respect to this network. The mixing matrix will have the following properties:

1. Decentralization: The -th entry if .

2. Symmetry: .

3. The null space of satisfies , where is an all 1 vector in .

These conditions are rather mild and satisfied by most doubly stochastic mixing matrices used in practice. Some specific examples are Markov transition matrices of max-degree chain and Metropolis-Hastings chain (see Boyd et al. (2004) for detailed discussions). Let be the local variable on the node . Define

where

 Wij={(1−˜wij)Id×d,if i=j−˜wijId×d,if i≠j,

and is -th entry of the mixing matrix . By the aforementioned null space property of the mixing matrix , it is easy to see that the null space of the matrix is

 N(A)={u∈Rnd: u=[uT1,⋯,uTn]T, u1=u2=⋯=un}, (15)

Then, because of the null space property (15), one can equivalently write problem (5) in a “distributed fashion” as follows:

 min n∑i=1∥xi−bi∥ (16) s.t. Ax=0,∥xi−bi∥≤D, i=1,2,⋯,n, (17)

where we set the constant to be large enough so that the solution belongs to the set . This is in the same form as (1-2) with .

### 4.1 Distributed implementation

In this section, we show how to implement the proposed algorithm to solve (16-17) in a distributed way. Let , be the vectors of Lagrange multipliers defined in Algorithm 1, where each . Then, each agent in the network is responsible for updating the corresponding Lagrange multipliers and according to Algorithm 1, which has the initial values . Note that the first, third and fourth steps in Algorithm 1 are naturally separable regarding each agent. It remains to check if the second step can be implemented in a distributed way.

Note that in the second step, we obtain the primal update by solving the following problem:

 x(ˆλt)=argmaxx:∥xi−bi∥≤D, i=1,2,⋯,n −⟨ˆλt,Ax⟩−n∑i=1(∥xi−bi∥+μ2∥xi−˜xi∥2),

where is a fixed point in the feasible set. We separate the maximization according to different agent :

 xi(ˆλt)= argmaxxi:∥xi−bi∥≤D −n∑j=1⟨ˆλt,j,Wjixi⟩−∥xi−bi∥−μ2∥xi−˜xi∥2.

Note that according to the definition of , it is equal to 0 if agent is not the neighbor of agent . More specifically, Let be the set of neighbors of agent (including the agent itself), then, the above maximization problem can be equivalently written as

 argmaxxi:∥xi−bi∥≤D−∑j∈Ni⟨ˆλt,j,Wjixi⟩−∥xi−bi∥−μ2∥xi−˜xi∥2 = argmaxxi:∥xi−bi∥≤D−⟨∑j∈NiWjiˆλt,j,xi⟩−∥xi−bi∥−μ2∥xi−˜xi∥2  i∈{1,2,⋯,n},

where we used the fact that . Solving this problem only requires the local information from each agent. Completing the squares gives

 xi(ˆλt)=argmax∥xi−bi∥≤D−μ2∥∥ ∥∥xi−⎛⎝˜xi−1μ∑j∈NiWjiˆλt,j⎞⎠∥∥ ∥∥2−∥xi−bi∥. (18)

The solution to such a subproblem has a closed form, as is shown in the following lemma (the proof is given in Supplement A.4):

###### Lemma 4.1.

Let , then, the solution to (18) has the following closed form:

 xi(ˆλt)=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩bi,if  ∥bi−ai∥≤1/μ,bi−bi−ai∥bi−ai∥(∥bi−ai∥−1μ),if%   1μ<∥bi−ai∥≤1μ+D,bi−bi−ai∥bi−ai∥D,otherwise.

### 4.2 Local error bound condition

The proof of the this theorem is given in Supplement A.5.

###### Theorem 4.1.

The Lagrange dual function of (16-17) is non-smooth and given by the following

 F(λ)=−⟨ATλ,b⟩+Dn∑i=1(∥AT[i]λ∥−1)⋅I(∥AT[i]λ∥>1),

where is the -th column block of the matrix , is the indicator function which takes 1 if and 0 otherwise. let be the set of optimal Lagrange multipliers defined according to (8). Suppose , then, for any , there exists a such that

 dist(λ,Λ∗)≤Cδ(F(λ)−F∗)1/2, ∀λ∈Sδ.

Furthermore, there exists a unique vector s.t. , i.e. Assumption 2.1(d) holds. Thus, applying the proposed method gives the convergence time .

## 5 Simulation Experiments

In this section, we conduct simulation experiments on the distributed geometric median problem. Each vector is sampled from the uniform distribution in , i.e. each entry of is independently sampled from uniform distribution on . We compare our algorithm with DSM (Nedic and Ozdaglar (2009)), P-EXTRA (Shi et al. (2015)), Jacobian parallel ADMM (Deng et al. (2017)) and Smoothing (Necoara and Suykens (2008)) under different network sizes (). Each network is randomly generated with a particular connectivity ratio666The connectivity ratio is defined as the number of edges divided by the total number of possible edges ., and the mixing matrix is chosen to be the Metropolis-Hastings Chain (Boyd et al. (2004)), which can be computed in a distributed manner. We use the relative error as the performance metric, which is defined as for each iteration . The vector is the initial primal variable. The vector is the optimal solution computed by CVX Grant et al. (2008). For our proposed algorithm, is the restarted primal average up to the current iteration. For all other algorithms, is the primal average up to the current iteration. The results are shown below. We see in all cases, our proposed algorithm is much better than, if not comparable to, other algorithms. For detailed simulation setups and additional simulation results, see Supplement A.6.

#### Acknowledgments

The authors thank Stanislav Minsker and Jason D. Lee for helpful discussions related to the geometric median problem. Qing Ling’s research is supported in part by the National Science Foundation China under Grant 61573331 and the National Science Foundation Anhui under Grant 1608085QF130. Michael J. Neely’s research is supported in part by the National Science Foundation under Grant CCF-1718477.

## References

• Beck et al. (2014) Beck, A., A. Nedic, A. Ozdaglar, and M. Teboulle (2014). An gradient method for network resource allocation problems. IEEE Transactions on Control of Network Systems 1(1), 64–73.
• Bertsekas (1999) Bertsekas, D. P. (1999). Nonlinear programming. Athena scientific Belmont.
• Boyd et al. (2004) Boyd, S., P. Diaconis, and L. Xiao (2004). Fastest mixing markov chain on a graph. SIAM review 46(4), 667–689.
• Burke and Tseng (1996) Burke, J. V. and P. Tseng (1996). A unified analysis of hoffman?s bound via fenchel duality. SIAM Journal on Optimization 6(2), 265–282.
• Cohen et al. (2016) Cohen, M. B., Y. T. Lee, G. Miller, J. Pachocki, and A. Sidford (2016). Geometric median in nearly linear time. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pp. 9–21. ACM.
• Deng et al. (2017) Deng, W., M.-J. Lai, Z. Peng, and W. Yin (2017). Parallel multi-block admm with o (1/k) convergence. Journal of Scientific Computing 71(2), 712–736.
• Deng and Yin (2016) Deng, W. and W. Yin (2016). On the global and linear convergence of the generalized alternating direction method of multipliers. Journal of Scientific Computing 66(3), 889–916.
• Duchi et al. (2014) Duchi, J. C., M. I. Jordan, M. J. Wainwright, and Y. Zhang (2014). Optimality guarantees for distributed statistical estimation. arXiv preprint arXiv:1405.0782.
• Gidel et al. (2018) Gidel, G., F. Pedregosa, and S. Lacoste-Julien (2018). Frank-wolfe splitting via augmented lagrangian method. arXiv preprint arXiv:1804.03176.
• Grant et al. (2008) Grant, M., S. Boyd, and Y. Ye (2008). Cvx: Matlab software for disciplined convex programming.
• Han et al. (2015) Han, D., D. Sun, and L. Zhang (2015). Linear rate convergence of the alternating direction method of multipliers for convex composite quadratic and semi-definite programming. arXiv preprint arXiv:1508.02134.
• Lan and Monteiro (2013) Lan, G. and R. D. Monteiro (2013). Iteration-complexity of first-order penalty methods for convex programming. Mathematical Programming 138(1-2), 115–139.
• Li et al. (2016) Li, J., G. Chen, Z. Dong, and Z. Wu (2016). A fast dual proximal-gradient method for separable convex optimization with linear coupled constraints. Computational Optimization and Applications 64(3), 671–697.
• Ling and Tian (2010) Ling, Q. and Z. Tian (2010). Decentralized sparse signal recovery for compressive sleeping wireless sensor networks. IEEE Transactions on Signal Processing 58(7), 3816–3827.
• Luo and Luo (1994) Luo, X.-D. and Z.-Q. Luo (1994). Extension of hoffman’s error bound to polynomial systems. SIAM Journal on Optimization 4(2), 383–392.
• Minsker et al. (2014) Minsker, S., S. Srivastava, L. Lin, and D. B. Dunson (2014). Robust and scalable bayes via a median of subset posterior measures. arXiv preprint arXiv:1403.2660.
• Minsker and Strawn (2017) Minsker, S. and N. Strawn (2017). Distributed statistical estimation and rates of convergence in normal approximation. arXiv preprint arXiv:1704.02658.
• Motzkin (1952) Motzkin, T. (1952). Contributions to the theory of linear inequalities. D.R. Fulkerson (Transl.) (Santa Monica: RAND Corporation). RAND Corporation Translation 22.
• Necoara and Suykens (2008) Necoara, I. and J. A. Suykens (2008). Application of a smoothing technique to decomposition in convex optimization. IEEE Transactions on Automatic control 53(11), 2674–2679.
• Nedic and Ozdaglar (2009) Nedic, A. and A. Ozdaglar (2009). Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control 54(1), 48–61.
• Nesterov (2005) Nesterov, Y. (2005). Smooth minimization of non-smooth functions. Mathematical programming 103(1), 127–152.
• Nesterov (2015a) Nesterov, Y. (2015a). Complexity bounds for primal-dual methods minimizing the model of objective function. Mathematical Programming, 1–20.
• Nesterov (2015b) Nesterov, Y. (2015b). Universal gradient methods for convex optimization problems. Mathematical Programming 152(1-2), 381–404.
• Osborne et al. (2000) Osborne, M. R., B. Presnell, and B. A. Turlach (2000). A new approach to variable selection in least squares problems. IMA journal of numerical analysis 20(3), 389–403.
• Pang (1997) Pang, J.-S. (1997, Oct). Error bounds in mathematical programming. Mathematical Programming 79(1), 299–332.
• Parrilo and Sturmfels (2003) Parrilo, P. A. and B. Sturmfels (2003). Minimizing polynomial functions. Algorithmic and quantitative real algebraic geometry, DIMACS Series in Discrete Mathematics and Theoretical Computer Science 60, 83–99.
• Shi et al. (2015) Shi, W., Q. Ling, G. Wu, and W. Yin (2015). A proximal gradient algorithm for decentralized composite optimization. IEEE Transactions on Signal Processing 63(22), 6013–6023.
• Shi et al. (2014) Shi, W., Q. Ling, K. Yuan, G. Wu, and W. Yin (2014). On the linear convergence of the admm in decentralized consensus optimization. IEEE Trans. Signal Processing 62(7), 1750–1761.
• Tran-Dinh et al. (2018) Tran-Dinh, Q., O. Fercoq, and V. Cevher (2018). A smooth primal-dual optimization framework for nonsmooth composite convex minimization. SIAM Journal on Optimization 28(1), 96–134.
• Tseng (2010) Tseng, P. (2010). Approximation accuracy, gradient methods, and error bound for structured convex optimization. Mathematical Programming 125(2), 263–295.
• Wang and Pang (1994) Wang, T. and J.-S. Pang (1994). Global error bounds for convex quadratic inequality systems. Optimization 31(1), 1–12.
• Wei et al. (2015) Wei, X., H. Yu, and M. J. Neely (2015). A probabilistic sample path convergence time analysis of drift-plus-penalty algorithm for stochastic optimization. arXiv preprint arXiv:1510.02973.
• Weiszfeld and Plastria (2009) Weiszfeld, E. and F. Plastria (2009). On the point for which the sum of the distances to n given points is minimum. Annals of Operations Research 167(1), 7–41.
• Xiao and Zhang (2013) Xiao, L. and T. Zhang (2013). A proximal-gradient homotopy method for the sparse least-squares problem. SIAM Journal on Optimization 23(2), 1062–1091.
• Xu et al. (2017) Xu, Y., M. Liu, Q. Lin, and T. Yang (2017). Admm without a fixed penalty parameter: Faster convergence with new adaptive penalization. In Advances in Neural Information Processing Systems, pp. 1267–1277.
• Xu et al. (2016) Xu, Y., Y. Yan, Q. Lin, and T. Yang (2016). Homotopy smoothing for non-smooth problems with lower complexity than . In Advances In Neural Information Processing Systems, pp. 1208–1216.
• Xue and Ye (1997) Xue, G. and Y. Ye (1997). An efficient algorithm for minimizing a sum of euclidean norms with applications. SIAM Journal on Optimization 7(4), 1017–1036.
• Yang and Lin (2015) Yang, T. and Q. Lin (2015). Rsg: Beating subgradient method without smoothness and strong convexity. arXiv preprint arXiv:1512.03107.
• Yin et al. (2018) Yin, D., Y. Chen, K. Ramchandran, and P. Bartlett (2018). Byzantine-robust distributed learning: Towards optimal statistical rates. arXiv preprint arXiv:1803.01498.
• Yu and Neely (2017a) Yu, H. and M. J. Neely (2017a). A new backpressure algorithm for joint rate control and routing with vanishing utility optimality gaps and finite queue lengths. In INFOCOM 2017-IEEE Conference on Computer Communications, IEEE, pp. 1–9. IEEE.
• Yu and Neely (2017b) Yu, H. and M. J. Neely (2017b). A simple parallel algorithm with an o(1/t) convergence rate for general convex programs. SIAM Journal on Optimization 27(2), 759–783.
• Yu and Neely (2018) Yu, H. and M. J. Neely (2018). On the convergence time of dual subgradient methods for strongly convex programs. IEEE Transactions on Automatic Control.
• Yuan et al. (2016) Yuan, K., Q. Ling, and W. Yin (2016). On the convergence of decentralized gradient descent. SIAM Journal on Optimization 26(3), 1835–1854.
• Yurtsever et al. (2015) Yurtsever, A., Q. T. Dinh, and V. Cevher (2015). A universal primal-dual convex optimization framework. In Advances in Neural Information Processing Systems, pp. 3150–3158.
• Yurtsever et al. (2018) Yurtsever, A., O. Fercoq, F. Locatello, and V. Cevher (2018). A conditional gradient framework for composite convex minimization with applications to semidefinite programming. arXiv preprint arXiv:1804.08544.

## Appendix A Supplement

### a.1 Smoothing lemma

In this section, we show that adding the strongly convex term on the primal indeed gives a smoothed dual.

###### Lemma A.1.

Let be defined as above and let be a sequence of -Lipschitz continuous convex functions, i.e. , where . Then, the Lagrange dual function

 dμ(λ):=maxx∈X−⟨λ,g(x)⟩−fμ(x), λ∈RN

is smooth with modulus . In particular, if , then, the smooth modulus is equal to , where denotes the maximum eigenvalue of .

This proof of this lemma is rather standard (see also proof of Lemma 6 of Yu and Neely (2018)) and the special case of can also be derived from Fenchel duality (Beck et al. (2014)). The proof is included here just for completeness.

###### Proof of Lemma a.1.

First of all, note that the function is strongly concave, it follows that there exists a unique minimizer . By Danskin’s theorem (see Bertsekas (1999) for details), we have for any ,

 ∇dμ(λ)=g(x(λ)).

Now, consider any , we have

 ∥∇dμ(λ1)−∇dμ(λ2)∥=∥g(x(λ1))−g(x(λ2))∥≤G∥x(λ1)−x(λ2)∥. (19)

where the equality follows from Danskin’s Theorem and the inequality follows from Lipschitz continuity of . Again, by the fact that is strongly concave with modulus ,

 hλ1(x(λ2))≤hλ1(x(λ1))−μ2∥x(λ1))−x(λ2))∥2, hλ2(x(λ1))≤hλ2(x(λ2))−μ2∥x(λ1))−x(λ2))∥2,

which implies

 −⟨λ1,g(x(λ2))⟩−fμ(x(λ2))≤−⟨λ1,g(x(λ1))⟩−fμ(x(λ1))−μ2∥x(λ1))−x(λ2))∥2, −⟨λ2,g(x(λ1))⟩−fμ(x(λ1))≤−⟨λ2,g(x(λ2))⟩−fμ(x(λ2))−μ2∥x(λ1))−x(λ2))∥2.

Adding the two inequalities gives

 μ∥x(λ1))−x(λ2))∥2≤ ⟨λ1−λ2,g(x(λ1))−g(x(λ2))⟩ ≤ ∥λ1−λ2∥⋅∥g(x(λ1))−g(x(λ2))∥ ≤ G∥λ1−λ2∥⋅∥x(λ1))−x(λ2))∥,

where the last inequality follows from Lipschitz continuity of again. This implies

 ∥x(λ1))−x(λ2))∥≤Gμ∥λ1−λ2∥.

Combining this inequality with (19) gives

 ∥∇dμ(λ1)−∇dμ(λ2)∥≤G2μ∥λ1−λ2∥,

finishing the first part of the proof. The second part of the claim follows easily from the fact that . ∎

### a.2 Proof of Theorem 3.1

In this section, we give a convergence time proof of each stage. As a preliminary, We have the following basic lemma which bounds the perturbation of the Lagrange dual due to the primal smoothing.

###### Lemma A.2.

Let and be functions defined in (6) and (11), respectively. Then, we have for any ,

 0≤F(λ)−Fμ(λ)≤μD2/2

and

 0≤F(λ∗)−Fμ(λ∗μ)≤μD2/2,

for any and .

###### Proof of Lemma a.2.

First of all, for any , let

 x(λ) :=argmaxx∈X−⟨λ,Ax−b⟩−f(x)=:h(x), xμ(λ) :=argmaxx∈X−⟨λ,Ax−b⟩−fμ(x)=:hμ(x),

then, we have for any ,

 F(λ)−Fμ(λ)= h(x(λ))−hμ(xμ(λ)) = h(x(λ))−hμ(x(λ))+hμ(x(λ))−hμ(xμ(λ)) ≤ h(x(λ))−hμ(x(λ)) = fμ(x(λ))−f(x(λ))≤μD2/2,

where the first inequality follows from the fact that maximizes . Similarly, we have

 Fμ(λ)−F(λ)= hμ(xμ(λ))−h(x(λ))