1 INTRODUCTION
Abstract

This paper considers a generic convex minimization template with affine constraints over a compact domain, which covers key semidefinite programming applications. The existing conditional gradient methods either do not apply to our template or are too slow in practice. To this end, we propose a new conditional gradient method, based on a unified treatment of smoothing and augmented Lagrangian frameworks. The proposed method maintains favorable properties of the classical conditional gradient method, such as cheap linear minimization oracle calls and sparse representation of the decision variable. We prove convergence rate of our method in the objective residual and the feasibility gap. This rate is essentially the same as the state of the art CG-type methods for our problem template, but the proposed method is significantly superior to existing methods in various semidefinite programming applications.

 

A conditional gradient-based augmented Lagrangian framework


 


Alp Yurtsever     Olivier Fercoq     Volkan Cevher

LIONS, Ecole Polytechnique Fédérale de Lausanne, Switzerland LTCI, Télécom ParisTech, Université Paris-Saclay, France

1 Introduction

In this paper we focus on the following constrained convex minimization template with affine constraints:

()
subject to

where is the decision variable that live on the convex and compact optimization domain ; is a convex differentiable function with -Lipschitz continuous gradient; and are known linear maps; is a convex function which can be non-smooth but we assume that it is -Lipchitz continuous; and is a convex set.

Conditional gradient method (CGM, a.k.a. Frank-Wolfe algorithm) has established itself as a scalable method for solving convex optimization problems over structured domains, thanks to its cheaper oracle compared to the projected and proximal gradient methods. This classical method is originated by Frank and Wolfe (1956), and its resurgence in machine learning follows Hazan and Kale (2012) and Jaggi (2013). Despite its favorable properties, classical CGM has restrictive assumptions on the problem template such as smoothness of the objective function, and extension of CG-type methods for broader templates is an active research area (cf. Section 4 for some recent advancements).

() is significantly broader in applications in comparison with the classical CGM template. Non-smooth term not only lets us use regularization to promote additional structures, it can also be used as a non-smooth loss function which generally provides more robustness than smooth functions. Moreover, () has an affine inclusion constraint and covers standard semidefinite programming (SDP) with trace constraint in particular. Hence, a large number of problems in machine learning, signal processing, and computer science can be cast within our template, from unsupervised clustering (Peng and Wei, 2007) to generalized eigenvector problems (Boumal et al., 2018), and from maximum cut (Goemans and Williamson, 1995) to phase-retrieval (Candès et al., 2013). We refer to Section 5 from (Yurtsever et al., 2018) for a detailed discussion on special instances and applications of ().

Affine constraints pose substantial difficulty for first order methods, hence primal-dual methods are typically preferred for solving () in large scale. Among the primal-dual approaches, augmented Lagrangian provides a powerful framework for deriving fast methods. However, these methods rely on proximal-oracles constrained with . Unfortunately, this proximal-oracle does not scale well with the problem dimensions and becomes a computational burden for an important part of the applications, in particular for SDP’s.

To this end, we develop conditional gradient augmented Lagrangian framework (CGAL) to exploit more scalable linear minimization oracles ():

For instance, possesses a rank- solution that can be efficiently approximated via power method or Lanczos algorithm for matrix factorization problems and SDPs, while other first-order oracles require full dimensional eigen or singular value decompositions.

CGAL can be viewed as a natural extension of the recent method in (Yurtsever et al., 2018) from quadratic penalty to an augmented Lagrangian formulation, and it especially focuses on improving the empirical performance. We prove that CGAL converges with rate both in the objective residual and the feasibility gap. The simplicity of our analysis also enables us to identify adaptive bounds and propose explicit and implementable the dual step-size rules that retain the theoretical convergence rates, while significantly enhancing the practical performance. Our numerical evidence demonstrates superior performance.

The rest of this paper is organized as follows. We review the notions of smoothing, quadratic penalty and augmented Lagrangian methods in Section 2. Then, in Section 3 we introduce CGAL and the main convergence theorem. We provide detailed discussion and comparison against the existing related work in Section 4. Finally, Section 5 compiles the empirical evidence supporting the advantages of our framework, and Section 6 draws the conclusions. Technical details are deferred to the supplementary material.

Notation.

We use lowercase letters for vectors (or matrices when considering vector space of matrices), uppercase letters for linear maps, and calligraphic letters for sets. We denote the Euclidean inner product by , and the Euclidean norm by . We denote the adjoint of a linear map by . For a set , its indicator function is defined as

2 Prelimineries

Our algorithmic design is based on the unified treatment of smoothing, quadratic penalty and augmented Lagrangian frameworks. We review these notions and explain their similarities in this section.

2.1 Nesterov Smoothing

In the seminal work, Nesterov (2005a) introduces a technique for solving some structured non-smooth optimization problems with efficiency estimates , which is much better than the theoretical lower bound . This technique is known as Nesterov smoothing, and it is widely used in efficient primal-dual methods (e.g., (Nesterov, 2005b),(Tran-Dinh et al., 2018)).

Nesterov exploits an important class of non-smooth functions that can be written in the max-form:

for some convex and compact set , and some continuous convex functions and .

Let us consider a prox-function of , i.e., a strongly convex continuous function on . Define the center point of this prox-function as

Without loss of generality, we assume the strong convexity parameter of is 1, and . Smooth approximation with smoothness parameter is defined as

Then, is well defined, differentiable, convex and smooth. Moreover, it uniformly approximates , as it satisfies the following envelop property :

where we denote by . See Theorem 1 in (Nesterov, 2005a) for the proof and details.

For notational convenience, we restrict ourselves with , a Lipschitz continuous function coupled with a linear map, but our findings in this paper directly apply for the general max form. Note that we can write in the max form by choosing , , and , Fenchel conjugate of :

Since is convex and lower semicontinuous, Fenchel duality holds, and we have . Moreover, Lipschitz continuity assumption on imposes the boundedness of dual domain. We refer to Lemma 5 by Dünner et al. (2016) for well-known technical details.

In this work, we specifically focus on the Euclidean prox-functions, . Then, we define following the definition of as

The argument of this maximization subproblem can be written as , where denotes

Hence, following the well-known Moreau decomposition, we can compute the gradient of by using

2.2 Quadratic Penalty

Penalty methods often work with unconstrained problems by augmenting the original objective with a penalty function parameterized by a penalty parameter, favoring the constraint. We update this parameter as we progress in the optimization procedure to converge to a solution of the original constrained problem.

A common and effective proxy is the quadratic penalty, which replaces the affine constraint by the squared Euclidean distance function, , where is called as the penalty parameter. Surprisingly, quadratic penalty approach is structurally equivalent to a de facto instance of Nesterov smoothing.

Let us start by writing the Fenchel conjugate of the affine constraint ,

Then, we can write the affine constraint in the max form by choosing and , and using the following relation:

Now, by choosing the standard Euclidean prox-function with origin center point, our smooth approximation is

In summary, we can obtain the quadratic penalty with parameter , by applying Nesterov smoothing procedure to the indicator of an affine constraint.

Note that the quadratic penalty does not serve as a uniform approximation, because the dual domain is unbounded and the envelop property does not hold. Consequently, the common analysis techniques for smoothing does not apply for quadratic penalty methods. Nevertheless, one can exploit this structural similarity to design algorithms that universally work on both ends, for composite problems with smoothing friendly non-smooth regularizers, and for problems with affine constraints. In fact, algorithmic design of Yurtsever et al. (2018) implicitly follows this idea.

Quadratic penalty provides simple and interpretable methods, but with limited practical applicability due to poor empirical performance. To this end, the next subsection reviews augmented Lagrangian methods as an alternative approach.

2.3 Augmented Lagrangian

Augmented Lagrangian (AL) methods replace the constraints with a continuous function that promotes feasibility. This function is parametrized by the penalty parameter (, i.e., augmented Lagrangian parameter), and a dual vector (i.e., Lagrange multiplier):

One can motivate augmented Lagrangian penalty from many different point of views. For instance, we can view it as the shifted quadratic penalty, since

Therefore, we can relate augmented Lagrangian function with Nesterov smoothing in a similar way. To draw this relation, we simply follow the same arguments as in the quadratic penalty case, but this time we use a shifted prox-function :

To conclude, augmented Lagrangian formulation is structurally equivalent to an instance of Nesterov smoothing, applied to the indicator of the constraint, with a shifted Euclidean prox-function. The center point of this prox-function corresponds to the dual variable, and penalty parameter corresponds to the inverse smoothness parameter ().

Once again, this approach does not serve as a uniform approximation, and the common analysis for Nesterov smoothing does not apply for augmented Lagrangian.

In the next section, we use this basic understanding to design a novel conditional gradient method based on the augmented Lagrangian formulation.

3 Algorithm

In this section, we design CGAL for the special case of for the ease of presentation. One can extend CGAL in a straightforward way for the general case, based the discussion in Section 2, and the analysis techniques in this work and (Yurtsever et al., 2018).

Input:
for  do
     .
     
     
     
     
     
      using (decr.) or (const.)
     
end for
Algorithm 1 CGAL

3.1 Design of CGAL

Let us introduce the slack variable and define the augmented Lagrangian function as

where is the Lagrange multiplier and is the penalty parameter. Clearly is a convex -smooth function with respect to .

One CGAL iteration is composed of three basic steps:

  • Primal step (conditional gradient step on ),

  • Penalty parameter update (increment ),

  • Dual step (proximal gradient step on ).

Primal step.

CGAL is characterized by the conditional gradient step with respect to on the primal variable. At iteration , denoting by

we can evaluate directional gradient using

Then, we compute linear minimization oracle

and we form next iterate () by combining the current iterate and with CG step-size . We use the classical step size of CG-type methods, but the same guarantees hold for line-search and fully corrective versions.

Penalty parameter update.

Penalty methods typically require the penalty parameter to be increased at a certain rate for provable convergence. In contrast, augmented Lagrangian methods can be designed with a fixed penalty parameter, because the saddle point formulation already favors the constraints. Unlike other augmented Lagrangian CG-type methods, we adopt an increasing penalty sequence in CGAL by choosing for some .

Dual step.

Once is formed, we update dual variable by a gradient ascent step with respect to . At iteration , we evaluate dual update by

To compute , we first define

Then, we can use the following formulation:

Choice of dual step-size is crucial for convergence guarantees. We propose two alternative schemes, with a decreasing or constant bound on the step-size.

Decreasing bound on step-size. This variant cancels positive quadratic terms in the majorization bounds due to dual updates, with the negative quadratic terms that comes from the penalty parameter updates. Consequently, we choose the largest which satisfies

(decr.)

is a sequence of positive numbers to be chosen, that acts like a dual domain diameter and appears in the final bounds. We will specify a reasonable positive constant in the sequel from the final converges bounds, by matching the factors of the dominating terms.

Constant bound on step-size. We observed significant performance improvements by slightly relaxing the decreasing upper bound on the step-size. To this end, we design this second variant. We do not cancel out additional quadratic terms, but restrict them to be smaller than other dominating terms in the majorization bound. To this end, we choose the largest which satisfies ( is similar as in (decr.) case)

(const.)

We underline that computation of does not require an iterative line-search procedure, instead it can be computed by simple vector operation both in (decr.) and (const.) variants. As a result, computational cost of finding is negligible.

3.2 Theoretical Guarantees of CGAL

We present convergence guarantees of CGAL in this section, but first we define some basic notions to be used in the sequel and state our main assumptions.

Solution set.

We denote a solution of () by , and the set of all solutions by . Similarly, we denote a solution of the dual problem by , and the set of all solutions by . Throughout, we assume that the solution set is nonempty, and that there exists a finite dual solution, i.e., .

-solution.

Given an accuracy level , we call a point as an -solution of () if

We call as the objective residual and as the feasibility gap. Note that the convergence of objective residual alone is not enough to approximate the solution, since the iterates are non-feasible and can take negative values.

Strong duality.

We assume Slater’s condition, which is a sufficient condition for strong duality:

where means relative interior. Strong duality is a common assumption for primal-dual methods.

Theorem 3.1.

Sequence generated by CGAL with dual step-size conditions (const.) satisfies:

where . We can also bound using triangle inequality. Considering the bounds, it is reasonable to choose proportional to .

Sequence generated by CGAL with dual step-size conditions (decr.) satisfies similar guarantees as (const.), with the factor of for all terms involving .

We omit design variants of CGAL with line-search and fully corrective updates, covered by our theory.

3.3 Extension for Composite Problems

One can extend CGAL in a straightforward way for composite problems based on the discussions in Section 2. For this, we simply need to define the sum of two non-smooth terms: Then, we consider smooth approximation of this term with smoothness parameter and prox-function :

Gradient of can be written as the sum of individual gradient terms. Then, CGAL applies simply by adding one more dual variable, and changing as

One can keep fixed as in (Yurtsever et al., 2018), or update it similar to using . Both cases guarantees rates in feasibility gap and in objective residual .

4 Related Work

The majority of convex methods for solving () are based on computationally challenging oracles, which can be some second order oracle as in interior point methods, a projection step (onto ) as in operator splitting methods, or a constrained proximal-oracle as in classical primal-dual methods. We refer to (Wright, 1997),(Komodakis and Pesquet, 2015),(Ryu and Boyd, 2016) and the references therein for these classical approaches. In the rest of this section we focus on the optimization methods which applies () or some of its subclasses by leveraging the linear minimization oracle.

Lan (2014) introduces a conditional gradient method for non-smooth minimization over a convex compact domain, based on Nesterov smoothing. To the best of our knowledge, this is the first attempt to combine Nesterov smoothing and conditional gradient approach. This method does not apply for our general problem template, in particular to problems with affine constraints, since it relies on the boundedness of the dual domain and the uniform approximation property.

Yurtsever et al. (2015) present the universal primal-dual method (UPD), a primal-dual subgradient approach for solving convex minimization problems under affine constraints. Main template of UPD is fairly different than (), it does not have the non-smooth term instead it assumes Hölder smoothness in the dual space. The method does not directly work with ’s, but it leverages the so-called sharp operators with comparable computational complexity to ’s under some specific problem settings. In particular, for standard SDP’s with linear cost function, sharp operator becomes the same as

UPD adopts the inexact line-search strategy introduced by Nesterov (2015). This strategy requires the input of target accuracy , and UPD is guaranteed to converge only up to accuracy, i.e., it guarantees . Practical performance of this method heavily depends on this parameter: Choosing small causes step-sizes to be too small. The best value of is typically around th and th of the optimal value , and this method is difficult to tune unless optimal value is roughly known.

Lan and Zhou (2016) propose the conditional gradient sliding method (CGS). This method is based on an inexact version of accelerated gradient method by Nesterov (1987), where the projection oracle is approximated by CGM. CGS is originally proposed for smooth minimization over a convex and compact domain, but the results are generalized for smoothing friendly non-smooth functions in Section 4, following the same approach as Lan (2014). Note that this generalization directly follows the standard approach of Nesterov smoothing, and it does not apply for affine constraints.

Yen et al. (2016b) proposes the greedy direction method of multipliers (GDMM), a CGM variant for minimizing a linear objective function over an intersection of polytopes. GDMM relies on a consensus reformulation over cartesian product of these polytopes, and the consistency constraint is incorporated by the augmented Lagrangian. This method is further explored in structural support vector machine (Yen et al., 2016a) and maximum-a-posteriori inference (Huang et al., 2017) problems. Nevertheless, as raised later on by (Gidel et al., 2018), there are technical issues in the analysis which do not admit a trivial fix. We refer to Section B.1 in (Gidel et al., 2018) for more details.

Gidel et al. (2018) propose an augmented Lagrangian framework for convex splitting problem (FW-AL). Similar to CGAL, this method is characterized by one CGM step on followed by one dual gradient ascent step on , but their penalty parameter is fixed. Their method is specific for (i.e., splitting), but it can be applied to case using a product space technique. The analysis of FW-AL relies on the error bounds (see Theorem 1 in (Gidel et al., 2018) for the conditions, and (Bolte et al., 2017) for more details about error bounds). Their step-size depends on the error bound constant as with . Hence, is a tuning parameter, and the method has guarantees only if it is chosen small enough. Note that is not only unknown, it can be also arbitrarily small.

Liu et al. (2018) introduce an inexact augmented Lagrangian method (IAL), where the Lagrangian subproblems are approximated by CGM up to a prescribed accuracy level, say for some to be tuned. This results in a double-loop algorithm, where each iteration consists multiple CGM iterations until the following condition is satisfied:

Then, the algorithm takes a dual gradient ascent step.

IAL provably generates an -solution after outer iterations, by choosing the penalty parameter appropriately (proportional to ). This method, however, requires multiple calls at each iteration. Since the number of calls is bounded by (see Theorem 2.2 in (Liu et al., 2018)), this results in calls of . Note that this is much worse than calls required by our method.

Yurtsever et al. (2018) present a conditional gradient type method (HCGM) for (). This method relies on the quadratic penalty approach to handle affine constraints. HCGM guarantees convergence rate both in the objective and the feasibility gap similar to CGAL. Note however, as explained in Section 2, penalty methods typically performs with the worst case rates. We can indeed observe this in numerical experiments of Yurtsever et al. (2018), that the empirical convergence rate is also . We demonstrated this also in our experiments in Section 5.

5 Numerical Experiments

Figure 1: Empirical comparison of CGAL, HCGM and UPD with max-cut problem setup.

This section presents the numerical evidence to demonstrate empirical superiority of CGAL, based on max-cut, clustering and generalized eigenvector problems.

We compared CGAL against UPD and HCGM from Section 4. This choice is based on the practicality of the algorithms: FW-AL and IAL have tuning parameters each, and it is very difficult to use these methods in medium or large scale experiments. CGAL and HCGM has initial penalty parameter , and UPD has accuracy parameter to be tuned. We tuned all these parameters by bisection with factor , until the method with the chosen parameter outperforms itself both with th and th of the parameter. Although CGAL with (decr.) performed better than HCGM in all instances we tried, CGAL with (const.) uniformly outperformed (decr.) and HCGM. Hence in this section we focus on CGAL with (const.).

Note that the computational cost of all algorithms are dominated by , hence we provide plots with number of calls on the x-axis which is roughly proportional to computation time.

5.1 Max-cut

Maximum cut is an NP-Hard combinatorial problem from computer science. Denoting the symmetric graph Laplacian matrix of a graph by , this problem can be relaxed as (Goemans and Williamson, 1995):

subject to

Tuning all methods from Section 4 require substantial computational effort, especially since some of these methods have multiple tuning parameters. To this end, we first consider a small scale max-cut instance where we compare against all of these methods. In this setup we use GD97_b dataset (Batagelj and Mrvar, ), which corresponds to a graph. In Figure 2, we present the performance of each method with the best parameter choice obtained after extensive search. We also provide the performance with all trials of each algorithm in the supplements, also with some other variants of the methods.


Figure 2: Overview of empirical performance of various methods in Section 4 for max-cut problem.

Next, we move from the toy case to medium scale examples, and we compare CGAL with UPD and HCGM for max-cut with G1 () and G40 () datasets (Ye, ). We compile the results of these tests in Figure 1, from which we observe that HCGM converges with (which is the worst case bound) while CGAL achieves faster than rate. Note that the sudden drop of UPD on the objective residual plots towards the end is not an increase of rate, it is simply the sign flip of which typically happens just before the saturation of UPD.

5.2 k-means Clustering

We consider a test setup with SDP formulation of model-free k-means clustering by (Peng and Wei, 2007):

subject to

where is the number of clusters, and is Euclidean distance matrix. We denote by the vector of ones, hence and together enforce each row to be on the unit simplex. Same applies for columns due to symmetry. We cast this problem into () by choosing , , maps , and finally .


Figure 3: Objective residual and feasibility gap for k-means clustering with preprocessed MNIST dataset.
Figure 4: Empirical comparison of CGAL & HCGM for solving generalized eigenvector problem with 4 different synthetic setups. Dotted lines present objective residual and feasibility gap of the atoms chosen by linear minimization oracle ().

We use the same setup as in (Yurtsever et al., 2018), which is designed and published online by Mixon et al. (2017). This setup contains a dimensional dataset generated by sampling and preprocessing the MNIST dataset (LeCun and Cortes, ) using a one-layer neural network. Further details on this setup and the dataset can be found in (Mixon et al., 2017).

In Figure 3, we observe once again that CGAL outperforms HCGM, achieving empirical convergence rate. In this problem instance, we failed to tune UPD, even with the knowledge of . After extensive analysis and tests, we concluded that UPD has an implicit tuning parameter. It is possible to choose different accuracy terms for objective and feasibility in UPD, as also noted by the authors, simply by scaling the objective function with a constant. Performance of UPD heavily depends on this scaling in addition to tuning accuracy parameter, hence we omit UPD.

5.3 Generalized Eigenvector Problem

We consider SDP relaxation of the generalized eigenvector problem from Boumal et al. (2018):

subject to

where and are symmetric matrices of size , and is a model parameter. In this problem, we consider synthetic setups, where we generate iid Gaussian, and consider different cases for :

  • Gaussian - generated by taking symmetric part of iid Gaussian matrix

  • PolyDecay - generated by randomly rotating ()

  • ExpDecay - generated by randomly rotating ()

  • MaxCut SDP - is a solution of a maxcut SDP with G40 dataset ()

This problem highlights an important observation under various data models, which empirically explains why CGAL outperforms the base method HCGM. Note that this SDP problem provably has a rank- solution, and if is tuned to its exact value, the solution is an extreme point of the domain. In this scenario, we might expect to pick the solution itself, or other close extreme points, if the problem formulation is well-conditioned. Recall that CGAL updates the dual variable, which corresponds to the center point of a quadratic penalty, with the expectation of better adaptation to the problem geometry. In Figure 4, we provide an empirical evidence of this adaptation, where dotted lines corresponds to extreme points chosen by . Unsurprisingly, converges quickly (with linear rates) under different scenarios for CGAL, while we do not observe the same behavior in HCGM or UPD (we omit lmo outputs of UPD in figure which do not converge).

6 Conclusions

CGAL can be viewed as a natural extension of HCGM of (Yurtsever et al., 2018) going from quadratic penalty to an augmented Lagrangian formulation. CGAL retains the strong theoretical guarantees of HCGM as well as (nearly) the same per-iteration complexity while exhibiting significantly superior empirical performance ( in practice vs in theory). In stark contrast to the existing methods that apply to the SDP templates, CGAL does not require strong assumptions on the problem geometry for the optimal convergence rates, and it comes from a simple analysis with interpretable bounds from which we can derive analytical dual step-size rules. Considering recent developments on the storage optimal convex optimization methods based on conditional gradients (Yurtsever et al., 2017), CGAL might be the key step for designing fast convex optimization methods solving for huge scale SDP’s. Further exploration of CGAL for this specific problem setup is left for future research in addition to proving the faster convergence rate, which requires stronger analytical assumptions that should also hold for SDPs.

Acknowledgements

This work was supported by the Swiss National Science Foundation (SNSF) under grant number . This project has received funding from the European Research Council (ERC) under the European Union’s Horizon research and innovation programme (grant agreement no - time-data). This work was supported by a public grant as part of the Investissement d’avenir project, reference ANR--LABX--LMH, LabEx LMH, in a joint call with PGMO.

References

  • (1) V. Batagelj and A. Mrvar. Pajek datasets, http://vlado.fmf.uni-lj.si/pub/networks/data/.
  • Bolte et al. (2017) J. Bolte, T. P. Nguyen, J. Peypouquet, and B. W. Suter. From error bounds to the complexity of first-order descent methods for convex functions. Math. Program., 165(2):471–507, 2017.
  • Boumal et al. (2018) N. Boumal, V. Voroninski, and A. Bandeira. Deterministic guarantees for Burer–Monteiro factorizations of smooth semidefinite programs. arXiv:1804.02008v1, 2018.
  • Candès et al. (2013) E. J. Candès, T. Strohmer, and V. Voroninski. PhaseLift: Exact and stable signal recovery from magnitude measurements via convex programming. Communications on Pure and Applied Math., 66(8):1241–1274, 2013.
  • Dünner et al. (2016) C. Dünner, S. Forte, M. Takác, and M. Jaggi. Primal–dual rates and certificates. In Proc. rd Int. Conf. Machine Learning, 2016.
  • Frank and Wolfe (1956) M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3:95–110, 1956.
  • Gidel et al. (2018) G. Gidel, F. Pedregosa, and S. Lacoste-Julien. Frank-Wolfe splitting via augmented Lagrangian method. In Proc. st Int. Conf. Artificial Intelligence and Statistics (AISTATS), 2018.
  • Goemans and Williamson (1995) M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of the ACM, 43(6):1115–1145, 1995.
  • Hazan and Kale (2012) E. Hazan and S. Kale. Projection–free online learning. In Proceedings of the th International Conference on Machine Learning, 2012.
  • Huang et al. (2017) X. Huang, I. E.-H. Yen, R. Zhang, Q. Huang, P. Ravikumar, and I. S. Dhillon. Greedy direction method of multiplier for MAP inference of large output domain. In Proc. th Int. Conf. Artificial Intelligence and Statistics (AISTATS), 2017.
  • Jaggi (2013) M. Jaggi. Revisiting Frank–Wolfe: Projection–free sparse convex optimization. In Proc. th Int. Conf. Machine Learning, 2013.
  • Komodakis and Pesquet (2015) N. Komodakis and J.-C. Pesquet. Playing with duality: An overview of recent primal-dual approaches for solving large-scale optimization problems. IEEE Signal Process. Mag., 32(6):31–54, 2015.
  • Lan (2014) G. Lan. The complexity of large–scale convex programming under a linear optimization oracle. arXiv:1309.5550v2, 2014.
  • Lan and Zhou (2016) G. Lan and Y. Zhou. Conditional gradient sliding for convex optimization. SIAM J. Optim., 26(2):1379–1409, 2016.
  • (15) Y. LeCun and C. Cortes. MNIST handwritten digit database, Accessed: Jan. 2016 . URL http://yann.lecun.com/exdb/mnist/.
  • Liu et al. (2018) Y.-F. Liu, X. Liu, and S. Ma. On the non-ergodic convergence rate of an inexact augmented Lagrangian framework for composite convex programming. to appear in Mathematics of Operations Research, 2018.
  • Mixon et al. (2017) D. G. Mixon, S. Villar, and R. Ward. Clustering subgaussian mixtures by semidefinite programming. Information and Inference: A Journal of the IMA, 6(4):389–415, 2017.
  • Nesterov (1987) Y. Nesterov. A method of solving a convex programming problem with convergence rate . Soviet Mathematics Doklady, 27(2):372–376, 1987.
  • Nesterov (2005a) Y. Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103:127–152, 2005a.
  • Nesterov (2005b) Y. Nesterov. Excessive gap technique in nonsmooth convex minimization. SIAM J. Optim., 16(1):235–249, 2005b.
  • Nesterov (2015) Y. Nesterov. Universal gradient methods for convex optimization problems. Math. Program., 152(1-2):381–404, 2015.
  • Peng and Wei (2007) J. Peng and Y. Wei. Approximating K–means–type clustering via semidefinite programming. SIAM J. Optim., 18(1):186–205, 2007.
  • Ryu and Boyd (2016) E. K. Ryu and S. Boyd. Primer on monotone operator methods. Appl. Comput. Math., 15(1):3–43, 2016.
  • Tran-Dinh et al. (2018) Q. Tran-Dinh, O. Fercoq, and V. Cevher. A smooth primal-dual optimization framework for nonsmooth composite convex minimization. SIAM J. Optim., 28(1):96–134, 2018.
  • Wright (1997) S. J. Wright. Primal–Dual Interior–Point Methods. SIAM, Philadelphia, USA, 1997.
  • (26) Y. Ye. Gset random graphs, https://www.cise.ufl.edu/research/sparse/matrices/gset/.
  • Yen et al. (2016a) I. E.-H. Yen, K. Huang, R. Zhong, P. Ravikumar, and I. S. Dhillon. Dual decomposed learning with factorwise oracle for structural svm with large output domain. In Advances in Neural Information Processing Systems 29, 2016a.
  • Yen et al. (2016b) I. E.-H. Yen, X. Lin, J. Zhang, P. Ravikumar, and I. S. Dhillon. A convex atomic–norm approach to multiple sequence alignment and motif discovery. In Proceedings of the rd International Conference on Machine Learning, 2016b.
  • Yurtsever et al. (2015) A. Yurtsever, Q. Tran-Dinh, and V. Cevher. A universal primal-dual convex optimization framework. In Advances in Neural Information Processing Systems 28, 2015.
  • Yurtsever et al. (2017) A. Yurtsever, M. Udell, J. Tropp, and V. Cevher. Sketchy decisions: Convex low-rank matrix optimization with optimal storage. In Proc. 20th Int. Conf. Artificial Intelligence and Statistics (AISTATS), 2017.
  • Yurtsever et al. (2018) A. Yurtsever, O. Fercoq, F. Locatello, and V. Cevher. A conditional gradient framework for composite convex minimization with applications to semidefinite programming. In Proc. th Int. Conf. Machine Learning, 2018.

Appendix A Proof of convergence

For notational simplicity in the proof, we redefine augmented Lagrangian function with three variables, including the slack variable as

where is the Lagrange multiplier and is the augmented Lagrangian parameter.

Directional derivatives of augmented Lagrangian function can be written as

Denote by . Then, using the Taylor expansion, we get the following estimate:

We can bound the inner product term on the right hand as follows:

where the first inequality holds since is the solution of , the second inequality simply follows the convexity of , and the last equality holds due to strong duality.

Also note by definition, , hence

Combining these bounds, we arrive at

where the last inequality follows from the optimality condition. By definition, , hence the following estimate holds

and in particular for .

In order to obtain a recurrence, we need to shift the dual variable on the left hand side of our bound. For this, we use the following relations:

Combining all these bounds and subtracting from both sides, we end up with

(1)

From this point, we consider two cases: constant step size with growth condition, and decreasing step size.

a.1 Constant bound on step-size

We choose by line-search, to ensure the following conditions:

where is a sequence of positive and non-decreasing numbers, to input. Note that is well defined, in the sense there exists which satisfy both conditions, simply because trivially satisfies them.

In addition, since we choose and , we have

As a consequence, we can simplify (1) as

Applying this recursion we get

where the last equality follows since . By using the following inequality,

we get the following bound on the augmented Lagrangian:

In the next step, we translate the bound on augmented Lagrangian to convergence guarantees on objective residual and feasibility gap.

Convergence of objective.

We start by using the definition of augmented Lagrangian and the strong duality: