Solving Nonsmooth Constrained Programs with Lower Complexity than $\mathcal{O}(1/\varepsilon)$: A PrimalDual Homotopy Smoothing Approach
Abstract
We propose a new primaldual homotopy smoothing algorithm for a linearly constrained convex program, where neither the primal nor the dual function has to be smooth or strongly convex. The best known iteration complexity solving such a nonsmooth problem is . In this paper, we show that by leveraging a local error bound condition on the dual function, the proposed algorithm can achieve a better primal convergence time of , where is a local error bound parameter. As an example application of the general algorithm, we show that the distributed geometric median problem, which can be formulated as a constrained convex program, has its dual function nonsmooth but satisfying the aforementioned local error bound condition with , therefore enjoying a convergence time of . This result improves upon the convergence time bound achieved by existing distributed optimization algorithms. Simulation experiments also demonstrate the performance of our proposed algorithm.
1 Introduction
We consider the following linearly constrained convex optimization problem:
(1)  
s.t.  (2) 
where is a compact convex set, is a convex function, . Such an optimization problem has been studied in numerous works under various application scenarios such as machine learning (Yurtsever et al. (2015)), signal processing (Ling and Tian (2010)) and communication networks (Yu and Neely (2017a)). The goal of this work is to design new algorithms for (12) achieving an approximation with better convergence time than .
1.1 Optimization algorithms related to constrained convex program
Since enforcing the constraint generally requires a significant amount of computation in large scale systems, the majority of the scalable algorithms solving problem (12) are of primaldual type. Generally, the efficiency of the these algorithms depends on two key properties of the dual function of (12), namely, the Lipschitz gradient and strong convexity. When the dual function of (12) is smooth, primaldual type algorithms with Nesterov’s acceleration on the dual of (1)(2) can achieve a convergence time of (e.g. Yurtsever et al. (2015); TranDinh et al. (2018)). When the dual function has both the Lipschitz continuous gradient and the strongly convex property, algorithms such as dual subgradient and ADMM enjoy a linear convergence (e.g. Yu and Neely (2018); Deng and Yin (2016)). However, when neither of the properties is assumed, the basic dualsubgradient type algorithm gives a relatively worse convergence time (e.g. Wei et al. (2015)), while its improved variants yield a convergence time of (e.g. Lan and Monteiro (2013); Deng et al. (2017); Yu and Neely (2017b); Yurtsever et al. (2018); Gidel et al. (2018)).
More recently, several works seek to achieve a better convergence time than under weaker assumptions than Lipschitz gradient and strong convexity of the dual function. Specifically,
building upon the recent progress on the gradient type methods for optimization with Hlder continuous gradient (e.g. Nesterov (2015a, b)), the work Yurtsever
et al. (2015) develops a primaldual gradient method solving (12), which achieves a convergence time of , where is the modulus of Hlder continuity on the gradient of the dual function of the formulation (12).
In the current work, we aim to address the following question: Can one design a scalable algorithm with lower complexity than solving (12), when both the primal and the dual functions are possibly nonsmooth? More specifically, we look at a class of problems with dual functions satisfying only a local error bound, and show that indeed one is able to obtain a faster primal convergence via a primaldual homotopy smoothing method under a local error bound condition on the dual function.
Homotopy methods were first developed in the statistics literature in relation to the model selection problem for LASSO, where, instead of computing a single solution for LASSO, one computes a complete solution path by varying the regularization parameter from large to small (e.g. Osborne
et al. (2000); Xiao and
Zhang (2013)).
(3) 
where is a closed convex set, is a convex smooth function, and can be explicitly written as
(4) 
where for any two vectors , , is a closed convex set, and is a convex function. By adding a strongly concave proximal function of with a smoothing parameter into the definition of , one can obtain a smoothed approximation of with smooth modulus . Then, Nesterov (2005) employs the accelerated gradient method on the smoothed approximation (which delivers a convergence time for the approximation), and sets the parameter to be , which gives an overall convergence time of . An important followup question is that whether or not such a smoothing technique can also be applied to solve (12) with the same primal convergence time. This question is answered in subsequent works Necoara and Suykens (2008); Li et al. (2016); TranDinh et al. (2018), where they show that indeed one can also obtain an primal convergence time for the problem (12) via smoothing.
Combining the homotopy method with a smoothing technique to solve problems of the form (3) has been considered by a series of works including Yang and Lin (2015), Xu et al. (2016) and Xu et al. (2017). Specifically, the works Yang and Lin (2015) and Xu et al. (2016) consider a multistage algorithm which starts from a large smoothing parameter and then decreases this parameter over time. They show that when the function satisfies a local error bound with parameter , such a combination gives an improved convergence time of minimizing the unconstrained problem (3). The work Xu et al. (2017) shows that the homotopy method can also be combined with ADMM to achieve a faster convergence solving problems of the form
where is a closed convex set,
are both convex functions with satisfying the local error bound, and the proximal operator of can be easily computed. However, due to the restrictions on the function in the paper, it cannot be extended to handle problems of the form (12).
Contributions: In the current work, we show a multistage homotopy smoothing method enjoys a primal convergence time solving (12) when the dual function satisfies a local error bound condition with . Our convergence time to achieve within of optimality is in terms of number of (unconstrained) maximization steps , where constants are known, which is a standard measure of convergence time for Lagrangiantype algorithms that turn a constrained problem into a sequence of unconstrained problems. The algorithm essentially restarts a weighted primal averaging process at each stage using the last Lagrange multiplier computed. This result improves upon the earlier result by (Necoara and Suykens (2008); Li et al. (2016)) and at the the time extends the scope of homotopy smoothing method to solve a new class of problems involving constraints (12). It is worth mentioning that a similar restarted smoothing strategy is proposed in a recent work TranDinh et al. (2018) to solve problems including (12), where they show that, empirically, restarting the algorithm from the Lagrange multiplier computed from the last stage improves the convergence time. Here, we give one theoretical justification of such an improvement.
1.2 The distributed geometric median problem
The geometric median problem, also known as the FermatWeber problem, has a long history (e.g. see Weiszfeld and Plastria (2009) for more details). Given a set of points , we aim to find one point so as to minimize the sum of the Euclidean distance, i.e.
(5) 
which is a nonsmooth convex optimization problem. It can be shown that the solution to this problem is unique as long as are not colinear. Linear convergence time algorithms solving (5) have also been developed in several works (e.g. Xue and Ye (1997), Parrilo and Sturmfels (2003), Cohen et al. (2016)). Our motivation of studying this problem is driven by its recent application in distributed statistical estimation, in which data are assumed to be randomly spreaded to multiple connected computational agents that produce intermediate estimators, and then, these intermediate estimators are aggregated in order to compute some statistics of the whole data set. Arguably one of the most widely used aggregation procedures is computing the geometric median of the local estimators (see, for example, Duchi et al. (2014), Minsker et al. (2014), Minsker and Strawn (2017), Yin et al. (2018)). It can be shown that the geometric median is robust against arbitrary corruptions of local estimators in the sense that the final estimator is stable as long as at least half of the nodes in the system perform as expected.
Contributions: As an example application of our general algorithm, we look at the problem of computing the solution to (5) in a distributed scenario over a network of agents without any central controller, where each agent holds a local vector . Remarkably, we show theoretically that such a problem, when formulated as (12), has its dual function nonsmooth but locally quadratic. Therefore, applying our proposed primaldual homotopy smoothing method gives a convergence time of . This result improves upon the performance bounds of the previously known decentralized optimization algorithms (e.g. PGEXTRA Shi et al. (2015) and decentralized ADMM Shi et al. (2014)), which do not take into account the special structure of the problem and only obtain a convergence time of . Simulation experiments also demonstrate the superior ergodic convergence time of our algorithm compared to other algorithms.
2 Primaldual Homotopy Smoothing
2.1 Preliminaries
The Lagrange dual function of (12) is defined as follows:
(6) 
where is the dual variable, is a compact convex set and the minimum of the dual function is For any closed set and , define the distance function of to the set as
where . For a convex function , the sublevel set is defined as
(7) 
Furthermore, for any matrix , we use to denote the largest eigenvalue of . Let
(8) 
be the set of optimal Lagrange multipliers. Note that if the constraint is feasible, then implies for any that satisfies . The following definition introduces the notion of local error bound.
Definition 2.1.
Let be a convex function over . Suppose is nonempty. The function is said to satisfy the local error bound with parameter if such that for any ,
(9) 
where is a positive constant possibly depending on . In particular, when , is said to be locally quadratic and when , it is said to be locally linear.
Remark 2.1.
Indeed, a wide range of popular optimization problems satisfy the local error bound condition. The work Tseng (2010) shows that if is a polyhedron, has Lipschitz continuous gradient and is strongly convex, then the dual function of (12) is locally linear. The work Burke and Tseng (1996) shows that when the objective is linear and is a convex cone, the dual function is also locally linear. The values of have also been computed for several other problems (e.g. Pang (1997); Yang and Lin (2015)).
Definition 2.2.
Throughout the paper, we adopt the following assumptions:
Assumption 2.1.
(a) The feasible set is nonempty and nonsingleton.
(b) The set is bounded, i.e.
for some positive constant . Furthermore, the function is also bounded, i.e.
for some positive constant .
(c) The dual function defined in (6) satisfies the local error bound for some parameter and some level .
(d) Let be the projection operator onto the column space of .
There exists a unique vector such that for any , , i.e.
.
Note that assumption (a) and (b) are very mild and quite standard. For most applications, it is enough to check (c) and (d). We will show, for example, in Section 4 that the distributed geometric median problem satisfies all the assumptions. Finally, we say a function is smooth with modulus if
2.2 Primaldual homotopy smoothing algorithm
This section introduces our proposed algorithm for optimization problem (12) satisfying Assumption 2.1. The idea of smoothing is to introduce a smoothed Lagrange dual function that approximates the original possibly nonsmooth dual function defined in (6).
For any constant , define
(10) 
where is an arbitrary fixed point in . For simplicity of notations, we drop the dependency on in the definition of . Then, by the boundedness assumption of , we have For any , define
(11) 
as the smoothed dual function. The fact that is indeed smooth with modulus follows from Lemma A.1 in the Supplement. Thus, one is able to apply an accelerated gradient descent algorithm on this modified Lagrange dual function, which is detailed in Algorithm 1 below, starting from an initial primaldual pair .
Our proposed algorithm runs Algorithm 1 in multiple stages, which is detailed in Algorithm 2 below.
3 Convergence Time Results
We start by defining the the set of optimal Lagrange multipliers for the smoothed problem:
(12) 
Our convergence time analysis involves two steps. The first step is to derive a primal convergence time bound for Algorithm 1, which involves the location information of the initial Lagrange multiplier at the the beginning of this stage. The details are given in Supplement A.2.
Theorem 3.1.
An inductive argument shows that . Thus, Theorem 3.1 already gives an convergence time by setting and . Note that this is the best tradeoff we can get from Theorem 3.1 when simply bounding the terms and by constants. To see how this bound leads to an improved convergence time when running in multiple rounds, suppose the computation from the last round gives a that is close enough to the optimal set , then, would be small. When the local error bound condition holds, one can show that . As a consequence, one is able to choose smaller than and get a better tradeoff. Formally, we have the following overall performance bound. The proof is given in Supplement A.3.
Theorem 3.2.
Suppose Suppose Assumption 2.1 holds, , , . The proposed homotopy method achieves the following objective and constraint violation bound:
with running time , i.e. the algorithm achieves an approximation with convergence time .
4 Distributed Geometric Median
Consider the problem of computing the geometric median over a connected network , where is a set of nodes, is a collection of undirected edges, if there exists an undirected edge between node and node , and otherwise. Furthermore, .Furthermore, since the graph is undirected, we always have . Two nodes and are said to be neighbors of each other if . Each node holds a local vector , and the goal is to compute the solution to (5) without having a central controller, i.e. each node can only communicate with its neighbors.
Computing geometric median over a network has been considered in several works previously and various distributed algorithms have been developed such as decentralized subgradient methd (DSM, Nedic and Ozdaglar (2009); Yuan et al. (2016)), PGEXTRA (Shi et al. (2015)) and ADMM (Shi et al. (2014); Deng et al. (2017)). The best known convergence time for this problem is . In this section, we will show that it can be written in the form of problem (12), has its Lagrange dual function locally quadratic and optimal Lagrange multiplier unique up to the null space of , thereby satisfying Assumption 2.1.
Throughout this section, we assume that , are not colinear and they are distinct (i.e. if ). We start by defining a mixing matrix with respect to this network. The mixing matrix will have the following properties:

Decentralization: The th entry if .

Symmetry: .

The null space of satisfies , where is an all 1 vector in .
These conditions are rather mild and satisfied by most doubly stochastic mixing matrices used in practice. Some specific examples are Markov transition matrices of maxdegree chain and MetropolisHastings chain (see Boyd et al. (2004) for detailed discussions). Let be the local variable on the node . Define
where
and is th entry of the mixing matrix . By the aforementioned null space property of the mixing matrix , it is easy to see that the null space of the matrix is
(15) 
Then, because of the null space property (15), one can equivalently write problem (5) in a “distributed fashion” as follows:
(16)  
(17) 
where we set the constant to be large enough so that the solution belongs to the set . This is in the same form as (12) with .
4.1 Distributed implementation
In this section, we show how to implement the proposed algorithm to solve (1617) in a distributed way. Let , be the vectors of Lagrange multipliers defined in Algorithm 1, where each . Then, each agent in the network is responsible for updating the corresponding Lagrange multipliers and according to Algorithm 1, which has the initial values . Note that the first, third and fourth steps in Algorithm 1 are naturally separable regarding each agent. It remains to check if the second step can be implemented in a distributed way.
Note that in the second step, we obtain the primal update by solving the following problem:
where is a fixed point in the feasible set. We separate the maximization according to different agent :
Note that according to the definition of , it is equal to 0 if agent is not the neighbor of agent . More specifically, Let be the set of neighbors of agent (including the agent itself), then, the above maximization problem can be equivalently written as
where we used the fact that . Solving this problem only requires the local information from each agent. Completing the squares gives
(18) 
The solution to such a subproblem has a closed form, as is shown in the following lemma (the proof is given in Supplement A.4):
Lemma 4.1.
Let , then, the solution to (18) has the following closed form:
4.2 Local error bound condition
The proof of the this theorem is given in Supplement A.5.
Theorem 4.1.
The Lagrange dual function of (1617) is nonsmooth and given by the following
where is the th column block of the matrix , is the indicator function which takes 1 if and 0 otherwise. let be the set of optimal Lagrange multipliers defined according to (8). Suppose , then, for any , there exists a such that
Furthermore, there exists a unique vector s.t. , i.e. Assumption 2.1(d) holds. Thus, applying the proposed method gives the convergence time .
5 Simulation Experiments
In this section, we conduct simulation experiments on the distributed geometric median problem. Each vector is sampled from the uniform distribution in , i.e. each entry of is independently sampled from uniform distribution on . We compare our algorithm with DSM (Nedic and
Ozdaglar (2009)),
PEXTRA (Shi
et al. (2015)), Jacobian parallel ADMM (Deng
et al. (2017)) and Smoothing (Necoara and
Suykens (2008)) under different network sizes (). Each network is randomly generated with a particular connectivity ratio
Acknowledgments
The authors thank Stanislav Minsker and Jason D. Lee for helpful discussions related to the geometric median problem. Qing Ling’s research is supported in part by the National Science Foundation China under Grant 61573331 and the National Science Foundation Anhui under Grant 1608085QF130. Michael J. Neely’s research is supported in part by the National Science Foundation under Grant CCF1718477.
Appendix A Supplement
a.1 Smoothing lemma
In this section, we show that adding the strongly convex term on the primal indeed gives a smoothed dual.
Lemma A.1.
Let be defined as above and let be a sequence of Lipschitz continuous convex functions, i.e. , where . Then, the Lagrange dual function
is smooth with modulus . In particular, if , then, the smooth modulus is equal to , where denotes the maximum eigenvalue of .
This proof of this lemma is rather standard (see also proof of Lemma 6 of Yu and Neely (2018)) and the special case of can also be derived from Fenchel duality (Beck et al. (2014)). The proof is included here just for completeness.
Proof of Lemma a.1.
First of all, note that the function is strongly concave, it follows that there exists a unique minimizer . By Danskin’s theorem (see Bertsekas (1999) for details), we have for any ,
Now, consider any , we have
(19) 
where the equality follows from Danskin’s Theorem and the inequality follows from Lipschitz continuity of . Again, by the fact that is strongly concave with modulus ,
which implies
Adding the two inequalities gives
where the last inequality follows from Lipschitz continuity of again. This implies
Combining this inequality with (19) gives
finishing the first part of the proof. The second part of the claim follows easily from the fact that . ∎
a.2 Proof of Theorem 3.1
In this section, we give a convergence time proof of each stage. As a preliminary, We have the following basic lemma which bounds the perturbation of the Lagrange dual due to the primal smoothing.
Lemma A.2.
Proof of Lemma a.2.
First of all, for any , let
then, we have for any ,
where the first inequality follows from the fact that maximizes . Similarly, we have
where the first inequality follows from the fact that maximizes . Furthermore, we have