Bundle-Level Type Methods Uniformly Optimal
for Smooth and Nonsmooth Convex Optimization
111The paper is a combined version of the two manuscripts
previously submitted to Mathematical Programming, namely: “Bundle-type methods uniformly optimal for smooth and nonsmooth convex optimization”
and “Level methods uniformly optimal for composite and structured nonsmooth convex optimization”.
††thanks: The author of this paper was partially supported by
NSF grant CMMI-1000347,
ONR grant N00014-13-1-0036 and
NSF CAREER Award CMMI-1254446.
The main goal of this paper is to develop uniformly optimal first-order methods for convex programming (CP). By uniform optimality we mean that the first-order methods themselves do not require the input of any problem parameters, but can still achieve the best possible iteration complexity bounds. By incorporating a multi-step acceleration scheme into the well-known bundle-level method, we develop an accelerated bundle-level (ABL) method, and show that it can achieve the optimal complexity for solving a general class of black-box CP problems without requiring the input of any smoothness information, such as, whether the problem is smooth, nonsmooth or weakly smooth, as well as the specific values of Lipschitz constant and smoothness level. We then develop a more practical, restricted memory version of this method, namely the accelerated prox-level (APL) method. We investigate the generalization of the APL method for solving certain composite CP problems and an important class of saddle-point problems recently studied by Nesterov [Mathematical Programming, 103 (2005), pp 127-152]. We present promising numerical results for these new bundle-level methods applied to solve certain classes of semidefinite programming (SDP) and stochastic programming (SP) problems.
Keywords: Convex Programming, Complexity, Bundle-level, Optimal methods
Consider the convex programming (CP)
where is a convex compact set and is a closed convex function. In the classic black-box setting, is represented by a first-order oracle which, given an input point , returns and , where denotes the subdifferential of at .
If is a general nonsmooth Lipschitz continuous convex function, then, by the classic complexity theory for CP nemyud:83, the number of calls to the first-order oracle for finding an -solution of (1.1) (i.e., a point s.t. ), cannot be smaller than when is sufficiently large. This lower complexity bound can be achieved, for example, by the simple subgradient descent or mirror descent method nemyud:83. If is a smooth function with Lipschitz continuous gradient, Nesterov in a seminal work Nest83-1 presented an algorithm with the iteration complexity bounded by , which, by nemyud:83, is also optimal for smooth convex optimization if is sufficiently large. Moreover, if is a weakly smooth function with Hölder continuous gradient, i.e., constants and such that then the optimal iteration complexity bound is given by (see NemNes85-1; Nest88-1; DeGlNe10-1).
To accelerate the solutions of large-scale CP problems, much effort has recently been directed to exploiting the problem’s structure, in order to identify possibly some new classes of CP problems with stronger convergence performance guarantee. One such example is given by the composite CP problems with the objective function given by . Here is a relatively simple nonsmooth convex function such as or (see Subsection 4.1 for more examples) and is a -dimensional vector function, see Nest89; Nest04; Nest07-1; Nem94; LewWri09-1; Lan10-3; GhaLan12-2a; GhaLan10-1b. In most of these studies, the components of are assumed to be smooth convex functions. In this case, the iteration complexity can be improved to by properly modifying Nesterov’s optimal smooth method, see for example, Nest04; Nest07-1; Nem94. It should be noted that these optimal first-order methods for general composite CP problems are in a sense “conceptual” since they require the minimization of the summation of a prox-function together with the composition of with an affine transformation Nest04. More recently, Nesterov Nest05-1 studied a class of nonsmooth convex-concave saddle point problems, where the objective function , in its basic form, is given by
Here is a convex compact set and denotes a linear operator from to . Nesterov shows that can be closely approximated by a certain smooth convex function and that the iteration complexity for solving this class of problems can be improved to . It is noted in jnt08 that this bound is unimprovable, for example, if is given by a Euclidean ball and the algorithm can only have access to and (the adjoint operator of ). These problems were later studied in Nem05-1; Nest05-2; AuTe06-1; Nest06-1; pena08-1; LaLuMo11-1 and found many interesting applications, for example, in dbg08-1; Lu09-1; BeBoCa09-1.
The advantages of the aforementioned optimal first-order methods (e.g., subgradient method or Nesterov’s method) mainly consist of their optimality, simplicity and cheap iteration cost. However, these methods might have some shortcomings in that each method is designed for solving a particular subclass of CP problems (e.g., smooth or nonsmooth). In particular, nonsmooth CP algorithms usually cannot make use of local smoothness properties that a nonsmooth instance might have, while it is well-known that Lipschitz continuous functions are differentiable almost everywhere within its domain. On the other hand, although it has been shown recently in Lan10-3 that Nesterov’s method, which was originally designed for solving smooth CP problems, is also optimal for nonsmooth optimization when employed with a properly specified stepsize policy (see also DeGlNe10-1 for a more recent generalization to weakly smooth CP problems), one still needs to determine some smoothness properties of (e.g., whether is smooth or not, i.e., or , and the specific value of ), as well as some other global information (e.g., and in some cases, the number of iterations ), before actually applying these generalized algorithms. Since these parameters describe the structure of CP problems over a global scope, these types of algorithms are still inherently worst-case oriented.
To address these issues, we propose to study the so-called uniformly optimal first-order methods. The key difference between uniformly optimal methods and existing ones is that they can achieve the best possible complexity for solving different subclasses of CP problems, but require little (preferably no) structural information for their implementation. To this end, we focus on a different type of first-order methods, namely: the bundle-level (BL) methods. Evolving from the well-known bundle methods Kiw83-1; Kiw90-1; Lem75, the BL method was first proposed by Lemaréchal et al. LNN in 1995. In contrast to subgradient or mirror descent methods for nonsmooth CP, the BL method can achieve the optimal iteration complexity for general nonsmooth CP without requiring the input of any problem parameters. Moreover, the BL method and their certain “restricted-memory” variants BenNem00; BenNem05-1; Rich07-1 often exhibit significantly superior practical performance to subgradient or mirror descent methods. However, to the best of our knowledge, the study on BL methods has so far been focused on general nonsmooth CP problems only.
Our contribution in this paper mainly consists of the following aspects. Firstly, we consider a general class of black-box CP problems in the form of (1.1), where satisfies
for some , and . Clearly, this class of problems cover nonsmooth (), smooth () and weakly smooth () CP problems (see for example, p.22 of Nest04 for the standard arguments used in smooth and weakly smooth case, and Lemma 2 of Lan10-3 for a related result in the nonsmooth case). By incorporating into the BL method a multi-step acceleration scheme that was first used by Nesterov Nest83-1 and later in AuTe06-1; Lan10-3; LaLuMo11-1; Nest04; Nest05-1 to accelerate gradient type methods for solving smooth CP problems, we present a new BL-type algorithm, namely: the accelerated bundle-level (ABL) method. We show that the iteration complexity of the ABL method can be bounded by
Hence, the ABL method is optimal for solving not only nonsmooth, but also smooth and weakly smooth CP problems. More importantly, this method does not require the input of any smoothness information, such as whether a problem is smooth, nonsmooth or weakly smooth, and the specific values of problem parameters , and . To the best of our knowledge, this is the first time that uniformly optimal algorithms of this type have been proposed in the literature.
Secondly, one problem for the ABL method is that, as the algorithm proceeds, its subproblems become more difficult to solve. As a result, each iteration of the ABL method becomes computationally more and more expensive. To remedy this issue, we present a restricted memory version of this method, namely: the accelerated prox-level (APL) method, and demonstrate that it can also uniformly achieve the optimal complexity for solving any black-box CP problems. In particular, each iteration of the APL method requires the projection onto the feasible set coupled with a few extra linear constraints, and the number of such linear constraints can be fully controlled (as small as or ). The basic idea of this improvement is to incorporate a novel rule due to Kiwiel Kiw95-1 (later studied by Ben-tal and Nemirovski BenNem00; BenNem05-1) for updating the lower bounds and prox-centers. In addition, non-Euclidean prox-functions can be employed to make use of the geometry of the feasible set in order to obtain (nearly) dimension-independent iteration complexity.
Thirdly, we investigate the generalization of the APL method for solving certain classes of composite and structured nonsmooth CP problems. In particular, we show that with little modification, the APL method is optimal for solving a class of generalized composite CP problems with the objective given by . Here , , can be a mixture of smooth, nonsmooth, weakly smooth or affine components. Such a formulation covers a wide range of CP problems, including the nonsmooth, weakly smooth, smooth, minimax, and regularized CP problems (see Subsection 4.1 for more discussions). The APL method can achieve the optimal iteration complexity for solving this class of composite problems without requiring any global information on the inner functions, such as the smoothness level and the size of Lipschitz constant. In addition, based on the APL method, we develop a completely problem-parameter free smoothing scheme, namely: the uniform smoothing level (USL) method, for solving the aforementioned class of structured CP problems with a bilinear saddle point structure Nest05-1. We show that this method can find an -solution of these CP problems in at most iterations.
Finally, we demonstrate through our preliminary numerical experiments that these new BL type methods can be competitive and even significantly outperform existing first-order methods for solving certain classes of CP problems. Observe that each iteration of BL type methods involves the projection onto coupled with a few linear constraints, while gradient type methods only require the projection onto . As a result, the iteration cost of BL type methods can be higher than that of gradient type methods, especially when the projection onto has explicit solutions. Here we would like to highlight a few interesting cases in which the application of BL type methods would be preferred: (i) the major iteration cost does not exist in the projection onto , but the computation of first-order information (e.g., involving eigenvalue decomposition or the solutions of another optimization problem); and (ii) the projection onto is as expensive as the projection onto coupled with a few linear constraints, e.g., is a general polyhedron. In particular, we show that the APL and USL methods, when applied to solving certain important classes of semidefine programming (SDP) and stochastic programming (SP) problems, can significantly outperform gradient type algorithms, as well as some existing BL type methods. The problems we tested consist of instances with up to decision variables.
The paper is organized as follows. In Section 2, we provide a brief review of the BL method and present the ABL method for black-box CP problems. We then study a restricted memory version of the ABL method, namely the APL method in Section 3. In Section 4, we investigate how to generalize the APL method for solving certain composite and structured nonsmooth CP problems. Section 5 is dedicated to the numerical experiments conducted on certain classes of SDP and SP problems. Finally, some concluding remarks are made in Section 6.
2 The accelerated bundle-level method
We present a new BL type method, namely: the accelerated bundle-level (ABL) method, which can uniformly achieve the optimal rate of convergence for smooth, weakly smooth and nonsmooth CP problems. More specifically, we provide a brief review of the BL method for nonsmooth minimization in Section 2.1, and then present the ABL method and discuss its main convergence properties in Section 2.2. Section 2.3 is devoted to the proof of a major convergence result used in Section 2.2. Throughout this section, we assume that the Euclidean space is equipped with the standard Euclidean norm associated with the inner product .
2.1 Review of the bundle-level method
Given a sequence of search points , an important construct, namely, the cutting plane model, of the objective function of problem (1.1) is given by
In the simplest cutting plane method CheGol59; Kelley60, we approximate by and update the search points according to
However, this scheme converges slowly, both theoretically and practically nemyud:83; Nest04. A significant progress Kiw83-1; Kiw90-1; Lem75 was made under the name of bundle methods (see, e.g., HeRe00; OliSagSch11-1 for some important applications of these methods). In these methods, a prox-term is introduced into the objective function of (2.3) and the search points are updated by
Here, the current prox-center is a certain point from and denotes the current penalty parameter. Moreover, the prox-center for the next iterate, i.e., , will be set to if is sufficiently smaller than . Otherwise, will be the same as . The penalty reduces the influence of the model ’s inaccuracy and hence the instability of the algorithm. Note, however, that the determination of usually requires certain on-line adjustments or line-search. In the closely related trust-region technique Rusz06; linWri03-1, the prox-term is put into the constraints of the subproblem instead of its objective function and the search points are then updated according to
This approach also encounters similar difficulties for determining the size of .
In an important work LNN, Lemaréchal et al. introduced the idea of incorporating level sets into the bundle method. The basic scheme of their bundle-level (BL) methods consists of:
Update to be the best objective value found so far and compute a lower bound on by
Set for some ;
Observe that step c) ensures that the new search point falls within the level set , while being as close as possible to . We refer to as the prox-center, since it controls the proximity between and the aforementioned level set. It is shown in LNN that, if is a general nonsmooth convex function (i.e., in (1.2)), then the above scheme can find an -solution of (1.1) in at most
iterations, where is a constant depending on and
In view of nemyud:83, the above complexity bound in (2.4) is unimprovable for nonsmooth convex optimization. Moreover, it turns out that the level sets give a stable description about the objective function and, as a consequence, very good practical performance has been observed for the BL methods, e.g., LNN; BenNem00; lns11.
2.2 The ABL algorithm and its main convergence properties
Based on the bundle-level method, our goal in this subsection is to present a new bundle type method, namely the ABL method, which can achieve the optimal complexity for solving any CP problems satisfying (1.2).
We introduce the following two key improvements into the classical BL methods. Firstly, rather than using a single sequence , we employ three related sequences, i.e., , and , to build the cutting-plane models (and hence the lower bound ), compute the upper bounds , and control the proximity, respectively. Moreover, the relations among these sequences are defined carefully. In particular, we define and for a certain . This type of multi-step scheme originated from the well-known Nesterov’s accelerated gradient method for solving smooth CP problems Nest83-1. Secondly, we group the iterations performed by the ABL method into different phases, and in each phase, the gap between the lower and upper bounds on will be reduced by a certain constant factor. It is worth noting that, although the convergence analysis of the BL method also relies on the concept of phases (see, e.g., BenNem00; BenNem05-1), the description of this method usually does not involve phases. However, we need to use phases explicitly in the ABL method in order to define in an optimal way to achieve the best possible complexity bounds for solving problem (1.1).
We start by describing the ABL gap reduction procedure, which, for a given search point and lower bound on , computes a new search point and updated lower bound satisfying for some .
The ABL gap reduction procedure:
Set , , and . Also let and the cutting plane be arbitrarily chosen, say and . Let .
Update lower bound: set , , ,
Update prox-center: set and
Update upper bound: set , and choose such that ;
If , terminate the procedure with and ;
Set and go to Step 1.
We now add a few remarks about the above gap reduction procedure . Firstly, we say that an iteration of procedure occurs whenever increases by . Observe that, if for all , then an iteration of procedure will be exactly the same as that of the BL method. In fact, in this case procedure will reduce to one phase of the BL method as described in BenNem05-1; BenNem00. Secondly, with more general selections of , the iteration cost of procedure is still about the same as that of the BL method. More specifically, each iteration of procedure involves the solution of two subproblems, i.e., (2.6) and (2.7), and the computation of and , while the BL method requires the solution of two similar subproblems and the computation of and . Thirdly, it can be easily seen that and , , respectively, computed by procedure are lower and upper bounds on . Indeed, by the definition of , (2.2) and the convexity of , we have
which, in view of (2.6), then implies that Moreover, it follows from the definition of that Hence, denoting
By showing how in (2.9) decreases with respect to , we establish in Theorem 2.1 some important convergence properties of procedure . The proof of this result is more involved and hence provided separately in Section 2.3.
Let and , , be given. Also let denote the optimality gap obtained at the -th iteration of procedure before it terminates. Then for any , we have
where is defined in (2.5), is the norm,
In particular, if and , , are chosen such that for some ,
then the number of iterations performed by procedure can be bounded by
Observe that, if for all , then as mentioned before, procedure reduces to a single phase (or segment) of the BL method and hence its termination follows by slightly modifying the standard analysis of the BL algorithm. However, such a selection of does not satisfy the conditions stated in (2.14) and thus cannot guarantee the termination of procedure in at most iterations. Below we discuss a few possible selections of that satisfy (2.14), in order to obtain the bound in (2.15). It should be pointed out that none of these selections rely on any problem parameters, such as , and .
Denoting and , we first show part a). Note that by (2.12), the selection of and the fact that , we have
Using these relations and the simple observation we conclude that
where the last inequality follows from the facts and for any .
We now show that part b) holds. Note that by (2.16), we have
which clearly implies that , . We now show that and by induction. Indeed, if , then, by (2.17), we have
which, in view of the fact that , then implies that Using the previous inequality and (2.16), we conclude that
According to the termination criterion in step 4 of procedure , each call to this procedure will reduce the gap between a given upper and lower bound on by a constant factor. In the ABL method described below, we will iteratively call procedure until a certain accurate solution of problem (1.1) is found.
The ABL method:
Input: initial point , tolerance and algorithmic parameter .
Set , and . Let .
If , terminate;
Set and ;
Set and go to step 1.
Whenever increments by , we say that a phase of the ABL method occurs. Unless explicitly mentioned otherwise, an iteration of procedure is also referred to as an iteration of the ABL method. The main convergence properties of the above ABL method are summarized as follows.
The number of phases performed by the ABL method does not exceed
The total number of iterations performed by the ABL method can be bounded by
Denote , . Without loss of generality, we assume that , since otherwise the statements are obviously true. Note that by the origin of and , we have
The previous two observations then clearly imply that the number of phases performed by the ABL method is bounded by (2.18). We now bound the total number of iterations performed by the ABL method. Suppose that procedure has been called times for some . It then follows from (2.20) that , , since due to the origin of . Using this observation, we obtain
Moreover, by Theorem 2.1, the total number of iterations performed by the ABL method is bounded by
Our result then immediately follows by combining the above two inequalities.
We now add a few remarks about Theorem 2.2. Firstly, by setting , and in (2.19), respectively, we obtain the optimal iteration complexity for nonsmooth, smooth and weakly smooth convex optimization (see Lan10-3; NemNes85-1; Nest88-1; DeGlNe10-1 for a discussion about the lower complexity bounds for solving these CP problems). Secondly, the ABL method achieves these aforementioned optimal complexity bounds without requiring the input of any smoothness information, such as whether the problem is smooth or not, and the specific values for and in (1.2). To the best of our knowledge, the ABL method seems to be the first uniformly optimal method for solving smooth, nonsmooth and weakly smooth CP problems in the literature. Thirdly, observe that one potential problem for the ABL method is that, as the algorithm proceeds, the model accumulates cutting planes, and the subproblems in procedure become more difficult to solve. We will address this issue in Section 3 by developing a variant of the ABL method.
2.3 Convergence analysis of the ABL gap reduction procedure
Our goal in this subsection is to prove Theorem 2.1, which describes some important convergence properties of procedure . We will first establish three technical results from which Theorem 2.1 immediately follows.
Lemma 1 below shows that the prox-centers for procedure are “close” to each other, in terms of . It follows the standard analysis of the BL method (see, e.g., LNN; BenNem00).
Let and , , respectively, be computed in step 1 and step 2 of procedure before it terminates. Then the level sets given by have a point in common. As a consequence, we have
where is defined in (2.5).
We have thus shown that for any . Now by (2.7), we have
Summing up the above inequalities and using (2.5), we obtain
which clearly implies (2.22).
The following two technical results will be used in the convergence analysis for a few accelerated bundle-level type methods, including ABL, APL and USL, developed in this paper.
Let be given at the -th iteration, , of an iterative scheme and denote . Also let be defined in (2.2) and suppose that the pair of new search points satisfy that, for some and ,