Abstract
This paper considers a generic convex minimization template with affine constraints over a compact domain, which covers key semidefinite programming applications. The existing conditional gradient methods either do not apply to our template or are too slow in practice. To this end, we propose a new conditional gradient method, based on a unified treatment of smoothing and augmented Lagrangian frameworks. The proposed method maintains favorable properties of the classical conditional gradient method, such as cheap linear minimization oracle calls and sparse representation of the decision variable. We prove convergence rate of our method in the objective residual and the feasibility gap. This rate is essentially the same as the state of the art CGtype methods for our problem template, but the proposed method is significantly superior to existing methods in various semidefinite programming applications.
A conditional gradientbased augmented Lagrangian framework
Alp Yurtsever Olivier Fercoq Volkan Cevher
LIONS, Ecole Polytechnique Fédérale de Lausanne, Switzerland LTCI, Télécom ParisTech, Université ParisSaclay, France
1 Introduction
In this paper we focus on the following constrained convex minimization template with affine constraints:
()  
subject to 
where is the decision variable that live on the convex and compact optimization domain ; is a convex differentiable function with Lipschitz continuous gradient; and are known linear maps; is a convex function which can be nonsmooth but we assume that it is Lipchitz continuous; and is a convex set.
Conditional gradient method (CGM, a.k.a. FrankWolfe algorithm) has established itself as a scalable method for solving convex optimization problems over structured domains, thanks to its cheaper oracle compared to the projected and proximal gradient methods. This classical method is originated by Frank and Wolfe (1956), and its resurgence in machine learning follows Hazan and Kale (2012) and Jaggi (2013). Despite its favorable properties, classical CGM has restrictive assumptions on the problem template such as smoothness of the objective function, and extension of CGtype methods for broader templates is an active research area (cf. Section 4 for some recent advancements).
() is significantly broader in applications in comparison with the classical CGM template. Nonsmooth term not only lets us use regularization to promote additional structures, it can also be used as a nonsmooth loss function which generally provides more robustness than smooth functions. Moreover, () has an affine inclusion constraint and covers standard semidefinite programming (SDP) with trace constraint in particular. Hence, a large number of problems in machine learning, signal processing, and computer science can be cast within our template, from unsupervised clustering (Peng and Wei, 2007) to generalized eigenvector problems (Boumal et al., 2018), and from maximum cut (Goemans and Williamson, 1995) to phaseretrieval (Candès et al., 2013). We refer to Section 5 from (Yurtsever et al., 2018) for a detailed discussion on special instances and applications of ().
Affine constraints pose substantial difficulty for first order methods, hence primaldual methods are typically preferred for solving () in large scale. Among the primaldual approaches, augmented Lagrangian provides a powerful framework for deriving fast methods. However, these methods rely on proximaloracles constrained with . Unfortunately, this proximaloracle does not scale well with the problem dimensions and becomes a computational burden for an important part of the applications, in particular for SDP’s.
To this end, we develop conditional gradient augmented Lagrangian framework (CGAL) to exploit more scalable linear minimization oracles ():
For instance, possesses a rank solution that can be efficiently approximated via power method or Lanczos algorithm for matrix factorization problems and SDPs, while other firstorder oracles require full dimensional eigen or singular value decompositions.
CGAL can be viewed as a natural extension of the recent method in (Yurtsever et al., 2018) from quadratic penalty to an augmented Lagrangian formulation, and it especially focuses on improving the empirical performance. We prove that CGAL converges with rate both in the objective residual and the feasibility gap. The simplicity of our analysis also enables us to identify adaptive bounds and propose explicit and implementable the dual stepsize rules that retain the theoretical convergence rates, while significantly enhancing the practical performance. Our numerical evidence demonstrates superior performance.
The rest of this paper is organized as follows. We review the notions of smoothing, quadratic penalty and augmented Lagrangian methods in Section 2. Then, in Section 3 we introduce CGAL and the main convergence theorem. We provide detailed discussion and comparison against the existing related work in Section 4. Finally, Section 5 compiles the empirical evidence supporting the advantages of our framework, and Section 6 draws the conclusions. Technical details are deferred to the supplementary material.
Notation.
We use lowercase letters for vectors (or matrices when considering vector space of matrices), uppercase letters for linear maps, and calligraphic letters for sets. We denote the Euclidean inner product by , and the Euclidean norm by . We denote the adjoint of a linear map by . For a set , its indicator function is defined as
2 Prelimineries
Our algorithmic design is based on the unified treatment of smoothing, quadratic penalty and augmented Lagrangian frameworks. We review these notions and explain their similarities in this section.
2.1 Nesterov Smoothing
In the seminal work, Nesterov (2005a) introduces a technique for solving some structured nonsmooth optimization problems with efficiency estimates , which is much better than the theoretical lower bound . This technique is known as Nesterov smoothing, and it is widely used in efficient primaldual methods (e.g., (Nesterov, 2005b),(TranDinh et al., 2018)).
Nesterov exploits an important class of nonsmooth functions that can be written in the maxform:
for some convex and compact set , and some continuous convex functions and .
Let us consider a proxfunction of , i.e., a strongly convex continuous function on . Define the center point of this proxfunction as
Without loss of generality, we assume the strong convexity parameter of is 1, and . Smooth approximation with smoothness parameter is defined as
Then, is well defined, differentiable, convex and smooth. Moreover, it uniformly approximates , as it satisfies the following envelop property :
where we denote by . See Theorem 1 in (Nesterov, 2005a) for the proof and details.
For notational convenience, we restrict ourselves with , a Lipschitz continuous function coupled with a linear map, but our findings in this paper directly apply for the general max form. Note that we can write in the max form by choosing , , and , Fenchel conjugate of :
Since is convex and lower semicontinuous, Fenchel duality holds, and we have . Moreover, Lipschitz continuity assumption on imposes the boundedness of dual domain. We refer to Lemma 5 by Dünner et al. (2016) for wellknown technical details.
In this work, we specifically focus on the Euclidean proxfunctions, . Then, we define following the definition of as
The argument of this maximization subproblem can be written as , where denotes
Hence, following the wellknown Moreau decomposition, we can compute the gradient of by using
2.2 Quadratic Penalty
Penalty methods often work with unconstrained problems by augmenting the original objective with a penalty function parameterized by a penalty parameter, favoring the constraint. We update this parameter as we progress in the optimization procedure to converge to a solution of the original constrained problem.
A common and effective proxy is the quadratic penalty, which replaces the affine constraint by the squared Euclidean distance function, , where is called as the penalty parameter. Surprisingly, quadratic penalty approach is structurally equivalent to a de facto instance of Nesterov smoothing.
Let us start by writing the Fenchel conjugate of the affine constraint ,
Then, we can write the affine constraint in the max form by choosing and , and using the following relation:
Now, by choosing the standard Euclidean proxfunction with origin center point, our smooth approximation is
In summary, we can obtain the quadratic penalty with parameter , by applying Nesterov smoothing procedure to the indicator of an affine constraint.
Note that the quadratic penalty does not serve as a uniform approximation, because the dual domain is unbounded and the envelop property does not hold. Consequently, the common analysis techniques for smoothing does not apply for quadratic penalty methods. Nevertheless, one can exploit this structural similarity to design algorithms that universally work on both ends, for composite problems with smoothing friendly nonsmooth regularizers, and for problems with affine constraints. In fact, algorithmic design of Yurtsever et al. (2018) implicitly follows this idea.
Quadratic penalty provides simple and interpretable methods, but with limited practical applicability due to poor empirical performance. To this end, the next subsection reviews augmented Lagrangian methods as an alternative approach.
2.3 Augmented Lagrangian
Augmented Lagrangian (AL) methods replace the constraints with a continuous function that promotes feasibility. This function is parametrized by the penalty parameter (, i.e., augmented Lagrangian parameter), and a dual vector (i.e., Lagrange multiplier):
One can motivate augmented Lagrangian penalty from many different point of views. For instance, we can view it as the shifted quadratic penalty, since
Therefore, we can relate augmented Lagrangian function with Nesterov smoothing in a similar way. To draw this relation, we simply follow the same arguments as in the quadratic penalty case, but this time we use a shifted proxfunction :
To conclude, augmented Lagrangian formulation is structurally equivalent to an instance of Nesterov smoothing, applied to the indicator of the constraint, with a shifted Euclidean proxfunction. The center point of this proxfunction corresponds to the dual variable, and penalty parameter corresponds to the inverse smoothness parameter ().
Once again, this approach does not serve as a uniform approximation, and the common analysis for Nesterov smoothing does not apply for augmented Lagrangian.
In the next section, we use this basic understanding to design a novel conditional gradient method based on the augmented Lagrangian formulation.
3 Algorithm
In this section, we design CGAL for the special case of for the ease of presentation. One can extend CGAL in a straightforward way for the general case, based the discussion in Section 2, and the analysis techniques in this work and (Yurtsever et al., 2018).
3.1 Design of CGAL
Let us introduce the slack variable and define the augmented Lagrangian function as
where is the Lagrange multiplier and is the penalty parameter. Clearly is a convex smooth function with respect to .
One CGAL iteration is composed of three basic steps:

Primal step (conditional gradient step on ),

Penalty parameter update (increment ),

Dual step (proximal gradient step on ).
Primal step.
CGAL is characterized by the conditional gradient step with respect to on the primal variable. At iteration , denoting by
we can evaluate directional gradient using
Then, we compute linear minimization oracle
and we form next iterate () by combining the current iterate and with CG stepsize . We use the classical step size of CGtype methods, but the same guarantees hold for linesearch and fully corrective versions.
Penalty parameter update.
Penalty methods typically require the penalty parameter to be increased at a certain rate for provable convergence. In contrast, augmented Lagrangian methods can be designed with a fixed penalty parameter, because the saddle point formulation already favors the constraints. Unlike other augmented Lagrangian CGtype methods, we adopt an increasing penalty sequence in CGAL by choosing for some .
Dual step.
Once is formed, we update dual variable by a gradient ascent step with respect to . At iteration , we evaluate dual update by
To compute , we first define
Then, we can use the following formulation:
Choice of dual stepsize is crucial for convergence guarantees. We propose two alternative schemes, with a decreasing or constant bound on the stepsize.
Decreasing bound on stepsize. This variant cancels positive quadratic terms in the majorization bounds due to dual updates, with the negative quadratic terms that comes from the penalty parameter updates. Consequently, we choose the largest which satisfies
(decr.) 
is a sequence of positive numbers to be chosen, that acts like a dual domain diameter and appears in the final bounds. We will specify a reasonable positive constant in the sequel from the final converges bounds, by matching the factors of the dominating terms.
Constant bound on stepsize. We observed significant performance improvements by slightly relaxing the decreasing upper bound on the stepsize. To this end, we design this second variant. We do not cancel out additional quadratic terms, but restrict them to be smaller than other dominating terms in the majorization bound. To this end, we choose the largest which satisfies ( is similar as in (decr.) case)
(const.)  
We underline that computation of does not require an iterative linesearch procedure, instead it can be computed by simple vector operation both in (decr.) and (const.) variants. As a result, computational cost of finding is negligible.
3.2 Theoretical Guarantees of CGAL
We present convergence guarantees of CGAL in this section, but first we define some basic notions to be used in the sequel and state our main assumptions.
Solution set.
solution.
Strong duality.
We assume Slater’s condition, which is a sufficient condition for strong duality:
where means relative interior. Strong duality is a common assumption for primaldual methods.
Theorem 3.1.
Sequence generated by CGAL with dual stepsize conditions (const.) satisfies:
where . We can also bound using triangle inequality. Considering the bounds, it is reasonable to choose proportional to .
We omit design variants of CGAL with linesearch and fully corrective updates, covered by our theory.
3.3 Extension for Composite Problems
One can extend CGAL in a straightforward way for composite problems based on the discussions in Section 2. For this, we simply need to define the sum of two nonsmooth terms: Then, we consider smooth approximation of this term with smoothness parameter and proxfunction :
Gradient of can be written as the sum of individual gradient terms. Then, CGAL applies simply by adding one more dual variable, and changing as
One can keep fixed as in (Yurtsever et al., 2018), or update it similar to using . Both cases guarantees rates in feasibility gap and in objective residual .
4 Related Work
The majority of convex methods for solving () are based on computationally challenging oracles, which can be some second order oracle as in interior point methods, a projection step (onto ) as in operator splitting methods, or a constrained proximaloracle as in classical primaldual methods. We refer to (Wright, 1997),(Komodakis and Pesquet, 2015),(Ryu and Boyd, 2016) and the references therein for these classical approaches. In the rest of this section we focus on the optimization methods which applies () or some of its subclasses by leveraging the linear minimization oracle.
Lan (2014) introduces a conditional gradient method for nonsmooth minimization over a convex compact domain, based on Nesterov smoothing. To the best of our knowledge, this is the first attempt to combine Nesterov smoothing and conditional gradient approach. This method does not apply for our general problem template, in particular to problems with affine constraints, since it relies on the boundedness of the dual domain and the uniform approximation property.
Yurtsever et al. (2015) present the universal primaldual method (UPD), a primaldual subgradient approach for solving convex minimization problems under affine constraints. Main template of UPD is fairly different than (), it does not have the nonsmooth term instead it assumes Hölder smoothness in the dual space. The method does not directly work with ’s, but it leverages the socalled sharp operators with comparable computational complexity to ’s under some specific problem settings. In particular, for standard SDP’s with linear cost function, sharp operator becomes the same as
UPD adopts the inexact linesearch strategy introduced by Nesterov (2015). This strategy requires the input of target accuracy , and UPD is guaranteed to converge only up to accuracy, i.e., it guarantees . Practical performance of this method heavily depends on this parameter: Choosing small causes stepsizes to be too small. The best value of is typically around th and th of the optimal value , and this method is difficult to tune unless optimal value is roughly known.
Lan and Zhou (2016) propose the conditional gradient sliding method (CGS). This method is based on an inexact version of accelerated gradient method by Nesterov (1987), where the projection oracle is approximated by CGM. CGS is originally proposed for smooth minimization over a convex and compact domain, but the results are generalized for smoothing friendly nonsmooth functions in Section 4, following the same approach as Lan (2014). Note that this generalization directly follows the standard approach of Nesterov smoothing, and it does not apply for affine constraints.
Yen et al. (2016b) proposes the greedy direction method of multipliers (GDMM), a CGM variant for minimizing a linear objective function over an intersection of polytopes. GDMM relies on a consensus reformulation over cartesian product of these polytopes, and the consistency constraint is incorporated by the augmented Lagrangian. This method is further explored in structural support vector machine (Yen et al., 2016a) and maximumaposteriori inference (Huang et al., 2017) problems. Nevertheless, as raised later on by (Gidel et al., 2018), there are technical issues in the analysis which do not admit a trivial fix. We refer to Section B.1 in (Gidel et al., 2018) for more details.
Gidel et al. (2018) propose an augmented Lagrangian framework for convex splitting problem (FWAL). Similar to CGAL, this method is characterized by one CGM step on followed by one dual gradient ascent step on , but their penalty parameter is fixed. Their method is specific for (i.e., splitting), but it can be applied to case using a product space technique. The analysis of FWAL relies on the error bounds (see Theorem 1 in (Gidel et al., 2018) for the conditions, and (Bolte et al., 2017) for more details about error bounds). Their stepsize depends on the error bound constant as with . Hence, is a tuning parameter, and the method has guarantees only if it is chosen small enough. Note that is not only unknown, it can be also arbitrarily small.
Liu et al. (2018) introduce an inexact augmented Lagrangian method (IAL), where the Lagrangian subproblems are approximated by CGM up to a prescribed accuracy level, say for some to be tuned. This results in a doubleloop algorithm, where each iteration consists multiple CGM iterations until the following condition is satisfied:
Then, the algorithm takes a dual gradient ascent step.
IAL provably generates an solution after outer iterations, by choosing the penalty parameter appropriately (proportional to ). This method, however, requires multiple calls at each iteration. Since the number of calls is bounded by (see Theorem 2.2 in (Liu et al., 2018)), this results in calls of . Note that this is much worse than calls required by our method.
Yurtsever et al. (2018) present a conditional gradient type method (HCGM) for (). This method relies on the quadratic penalty approach to handle affine constraints. HCGM guarantees convergence rate both in the objective and the feasibility gap similar to CGAL. Note however, as explained in Section 2, penalty methods typically performs with the worst case rates. We can indeed observe this in numerical experiments of Yurtsever et al. (2018), that the empirical convergence rate is also . We demonstrated this also in our experiments in Section 5.
5 Numerical Experiments
This section presents the numerical evidence to demonstrate empirical superiority of CGAL, based on maxcut, clustering and generalized eigenvector problems.
We compared CGAL against UPD and HCGM from Section 4. This choice is based on the practicality of the algorithms: FWAL and IAL have tuning parameters each, and it is very difficult to use these methods in medium or large scale experiments. CGAL and HCGM has initial penalty parameter , and UPD has accuracy parameter to be tuned. We tuned all these parameters by bisection with factor , until the method with the chosen parameter outperforms itself both with th and th of the parameter. Although CGAL with (decr.) performed better than HCGM in all instances we tried, CGAL with (const.) uniformly outperformed (decr.) and HCGM. Hence in this section we focus on CGAL with (const.).
Note that the computational cost of all algorithms are dominated by , hence we provide plots with number of calls on the xaxis which is roughly proportional to computation time.
5.1 Maxcut
Maximum cut is an NPHard combinatorial problem from computer science. Denoting the symmetric graph Laplacian matrix of a graph by , this problem can be relaxed as (Goemans and Williamson, 1995):
subject to 
Tuning all methods from Section 4 require substantial computational effort, especially since some of these methods have multiple tuning parameters. To this end, we first consider a small scale maxcut instance where we compare against all of these methods. In this setup we use GD97_b dataset (Batagelj and Mrvar, ), which corresponds to a graph. In Figure 2, we present the performance of each method with the best parameter choice obtained after extensive search. We also provide the performance with all trials of each algorithm in the supplements, also with some other variants of the methods.
Next, we move from the toy case to medium scale examples, and we compare CGAL with UPD and HCGM for maxcut with G1 () and G40 () datasets (Ye, ). We compile the results of these tests in Figure 1, from which we observe that HCGM converges with (which is the worst case bound) while CGAL achieves faster than rate. Note that the sudden drop of UPD on the objective residual plots towards the end is not an increase of rate, it is simply the sign flip of which typically happens just before the saturation of UPD.
5.2 kmeans Clustering
We consider a test setup with SDP formulation of modelfree kmeans clustering by (Peng and Wei, 2007):
subject to 
where is the number of clusters, and is Euclidean distance matrix. We denote by the vector of ones, hence and together enforce each row to be on the unit simplex. Same applies for columns due to symmetry. We cast this problem into () by choosing , , maps , and finally .
We use the same setup as in (Yurtsever et al., 2018), which is designed and published online by Mixon et al. (2017). This setup contains a dimensional dataset generated by sampling and preprocessing the MNIST dataset (LeCun and Cortes, ) using a onelayer neural network. Further details on this setup and the dataset can be found in (Mixon et al., 2017).
In Figure 3, we observe once again that CGAL outperforms HCGM, achieving empirical convergence rate. In this problem instance, we failed to tune UPD, even with the knowledge of . After extensive analysis and tests, we concluded that UPD has an implicit tuning parameter. It is possible to choose different accuracy terms for objective and feasibility in UPD, as also noted by the authors, simply by scaling the objective function with a constant. Performance of UPD heavily depends on this scaling in addition to tuning accuracy parameter, hence we omit UPD.
5.3 Generalized Eigenvector Problem
We consider SDP relaxation of the generalized eigenvector problem from Boumal et al. (2018):
subject to 
where and are symmetric matrices of size , and is a model parameter. In this problem, we consider synthetic setups, where we generate iid Gaussian, and consider different cases for :

Gaussian  generated by taking symmetric part of iid Gaussian matrix

PolyDecay  generated by randomly rotating ()

ExpDecay  generated by randomly rotating ()

MaxCut SDP  is a solution of a maxcut SDP with G40 dataset ()
This problem highlights an important observation under various data models, which empirically explains why CGAL outperforms the base method HCGM. Note that this SDP problem provably has a rank solution, and if is tuned to its exact value, the solution is an extreme point of the domain. In this scenario, we might expect to pick the solution itself, or other close extreme points, if the problem formulation is wellconditioned. Recall that CGAL updates the dual variable, which corresponds to the center point of a quadratic penalty, with the expectation of better adaptation to the problem geometry. In Figure 4, we provide an empirical evidence of this adaptation, where dotted lines corresponds to extreme points chosen by . Unsurprisingly, converges quickly (with linear rates) under different scenarios for CGAL, while we do not observe the same behavior in HCGM or UPD (we omit lmo outputs of UPD in figure which do not converge).
6 Conclusions
CGAL can be viewed as a natural extension of HCGM of (Yurtsever et al., 2018) going from quadratic penalty to an augmented Lagrangian formulation. CGAL retains the strong theoretical guarantees of HCGM as well as (nearly) the same periteration complexity while exhibiting significantly superior empirical performance ( in practice vs in theory). In stark contrast to the existing methods that apply to the SDP templates, CGAL does not require strong assumptions on the problem geometry for the optimal convergence rates, and it comes from a simple analysis with interpretable bounds from which we can derive analytical dual stepsize rules. Considering recent developments on the storage optimal convex optimization methods based on conditional gradients (Yurtsever et al., 2017), CGAL might be the key step for designing fast convex optimization methods solving for huge scale SDP’s. Further exploration of CGAL for this specific problem setup is left for future research in addition to proving the faster convergence rate, which requires stronger analytical assumptions that should also hold for SDPs.
Acknowledgements
This work was supported by the Swiss National Science Foundation (SNSF) under grant number . This project has received funding from the European Research Council (ERC) under the European Union’s Horizon research and innovation programme (grant agreement no  timedata). This work was supported by a public grant as part of the Investissement d’avenir project, reference ANRLABXLMH, LabEx LMH, in a joint call with PGMO.
References
 (1) V. Batagelj and A. Mrvar. Pajek datasets, http://vlado.fmf.unilj.si/pub/networks/data/.
 Bolte et al. (2017) J. Bolte, T. P. Nguyen, J. Peypouquet, and B. W. Suter. From error bounds to the complexity of firstorder descent methods for convex functions. Math. Program., 165(2):471–507, 2017.
 Boumal et al. (2018) N. Boumal, V. Voroninski, and A. Bandeira. Deterministic guarantees for Burer–Monteiro factorizations of smooth semidefinite programs. arXiv:1804.02008v1, 2018.
 Candès et al. (2013) E. J. Candès, T. Strohmer, and V. Voroninski. PhaseLift: Exact and stable signal recovery from magnitude measurements via convex programming. Communications on Pure and Applied Math., 66(8):1241–1274, 2013.
 Dünner et al. (2016) C. Dünner, S. Forte, M. Takác, and M. Jaggi. Primal–dual rates and certificates. In Proc. rd Int. Conf. Machine Learning, 2016.
 Frank and Wolfe (1956) M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3:95–110, 1956.
 Gidel et al. (2018) G. Gidel, F. Pedregosa, and S. LacosteJulien. FrankWolfe splitting via augmented Lagrangian method. In Proc. st Int. Conf. Artificial Intelligence and Statistics (AISTATS), 2018.
 Goemans and Williamson (1995) M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of the ACM, 43(6):1115–1145, 1995.
 Hazan and Kale (2012) E. Hazan and S. Kale. Projection–free online learning. In Proceedings of the th International Conference on Machine Learning, 2012.
 Huang et al. (2017) X. Huang, I. E.H. Yen, R. Zhang, Q. Huang, P. Ravikumar, and I. S. Dhillon. Greedy direction method of multiplier for MAP inference of large output domain. In Proc. th Int. Conf. Artificial Intelligence and Statistics (AISTATS), 2017.
 Jaggi (2013) M. Jaggi. Revisiting Frank–Wolfe: Projection–free sparse convex optimization. In Proc. th Int. Conf. Machine Learning, 2013.
 Komodakis and Pesquet (2015) N. Komodakis and J.C. Pesquet. Playing with duality: An overview of recent primaldual approaches for solving largescale optimization problems. IEEE Signal Process. Mag., 32(6):31–54, 2015.
 Lan (2014) G. Lan. The complexity of large–scale convex programming under a linear optimization oracle. arXiv:1309.5550v2, 2014.
 Lan and Zhou (2016) G. Lan and Y. Zhou. Conditional gradient sliding for convex optimization. SIAM J. Optim., 26(2):1379–1409, 2016.
 (15) Y. LeCun and C. Cortes. MNIST handwritten digit database, Accessed: Jan. 2016 . URL http://yann.lecun.com/exdb/mnist/.
 Liu et al. (2018) Y.F. Liu, X. Liu, and S. Ma. On the nonergodic convergence rate of an inexact augmented Lagrangian framework for composite convex programming. to appear in Mathematics of Operations Research, 2018.
 Mixon et al. (2017) D. G. Mixon, S. Villar, and R. Ward. Clustering subgaussian mixtures by semidefinite programming. Information and Inference: A Journal of the IMA, 6(4):389–415, 2017.
 Nesterov (1987) Y. Nesterov. A method of solving a convex programming problem with convergence rate . Soviet Mathematics Doklady, 27(2):372–376, 1987.
 Nesterov (2005a) Y. Nesterov. Smooth minimization of nonsmooth functions. Math. Program., 103:127–152, 2005a.
 Nesterov (2005b) Y. Nesterov. Excessive gap technique in nonsmooth convex minimization. SIAM J. Optim., 16(1):235–249, 2005b.
 Nesterov (2015) Y. Nesterov. Universal gradient methods for convex optimization problems. Math. Program., 152(12):381–404, 2015.
 Peng and Wei (2007) J. Peng and Y. Wei. Approximating K–means–type clustering via semidefinite programming. SIAM J. Optim., 18(1):186–205, 2007.
 Ryu and Boyd (2016) E. K. Ryu and S. Boyd. Primer on monotone operator methods. Appl. Comput. Math., 15(1):3–43, 2016.
 TranDinh et al. (2018) Q. TranDinh, O. Fercoq, and V. Cevher. A smooth primaldual optimization framework for nonsmooth composite convex minimization. SIAM J. Optim., 28(1):96–134, 2018.
 Wright (1997) S. J. Wright. Primal–Dual Interior–Point Methods. SIAM, Philadelphia, USA, 1997.
 (26) Y. Ye. Gset random graphs, https://www.cise.ufl.edu/research/sparse/matrices/gset/.
 Yen et al. (2016a) I. E.H. Yen, K. Huang, R. Zhong, P. Ravikumar, and I. S. Dhillon. Dual decomposed learning with factorwise oracle for structural svm with large output domain. In Advances in Neural Information Processing Systems 29, 2016a.
 Yen et al. (2016b) I. E.H. Yen, X. Lin, J. Zhang, P. Ravikumar, and I. S. Dhillon. A convex atomic–norm approach to multiple sequence alignment and motif discovery. In Proceedings of the rd International Conference on Machine Learning, 2016b.
 Yurtsever et al. (2015) A. Yurtsever, Q. TranDinh, and V. Cevher. A universal primaldual convex optimization framework. In Advances in Neural Information Processing Systems 28, 2015.
 Yurtsever et al. (2017) A. Yurtsever, M. Udell, J. Tropp, and V. Cevher. Sketchy decisions: Convex lowrank matrix optimization with optimal storage. In Proc. 20th Int. Conf. Artificial Intelligence and Statistics (AISTATS), 2017.
 Yurtsever et al. (2018) A. Yurtsever, O. Fercoq, F. Locatello, and V. Cevher. A conditional gradient framework for composite convex minimization with applications to semidefinite programming. In Proc. th Int. Conf. Machine Learning, 2018.
Appendix A Proof of convergence
For notational simplicity in the proof, we redefine augmented Lagrangian function with three variables, including the slack variable as
where is the Lagrange multiplier and is the augmented Lagrangian parameter.
Directional derivatives of augmented Lagrangian function can be written as
Denote by . Then, using the Taylor expansion, we get the following estimate:
We can bound the inner product term on the right hand as follows:
where the first inequality holds since is the solution of , the second inequality simply follows the convexity of , and the last equality holds due to strong duality.
Also note by definition, , hence
Combining these bounds, we arrive at
where the last inequality follows from the optimality condition. By definition, , hence the following estimate holds
and in particular for .
In order to obtain a recurrence, we need to shift the dual variable on the left hand side of our bound. For this, we use the following relations:
Combining all these bounds and subtracting from both sides, we end up with
(1)  
From this point, we consider two cases: constant step size with growth condition, and decreasing step size.
a.1 Constant bound on stepsize
We choose by linesearch, to ensure the following conditions:
where is a sequence of positive and nondecreasing numbers, to input. Note that is well defined, in the sense there exists which satisfy both conditions, simply because trivially satisfies them.
In addition, since we choose and , we have
As a consequence, we can simplify (1) as
Applying this recursion we get
where the last equality follows since . By using the following inequality,
we get the following bound on the augmented Lagrangian:
In the next step, we translate the bound on augmented Lagrangian to convergence guarantees on objective residual and feasibility gap.
Convergence of objective.
We start by using the definition of augmented Lagrangian and the strong duality: