Lazifying Conditional Gradient Algorithms
Conditional gradient algorithms (also often called Frank-Wolfe algorithms) are popular due to their simplicity of only requiring a linear optimization oracle and more recently they also gained significant traction for online learning. While simple in principle, in many cases the actual implementation of the linear optimization oracle is costly. We show a general method to lazify various conditional gradient algorithms, which in actual computations leads to several orders of magnitude of speedup in wall-clock time. This is achieved by using a faster separation oracle instead of a linear optimization oracle, relying only on few linear optimization oracle calls.
Ligatures=TeX \setmainfontTeX Gyre Termes \setsansfont[Scale=MatchUppercase]TeX Gyre Heros \setmonofontTeX Gyre Cursor \setmathfontTeX Gyre Termes Math \extrafloats100 \setlist[enumerate]label=() \newlistenumerate*enumerate*1 \setlist[enumerate*]label=(0), after=., itemjoin=, , itemjoin*=, or
Convex optimization is an important technique both from a theoretical and an applications perspective. Gradient descent based methods are widely used due to their simplicity and easy applicability to many real-world problems. We are interested in solving constraint convex optimization problems of the form
where is a smooth convex function and is a polytope, with access to being limited to first-order information, i.e., we can obtain and for a given and access to via a linear minimization oracle, which returns for a given linear objective .
When solving Problem (1) using gradient descent approaches in order to maintain feasibility, typically a projection step is required. This projection back into the feasible region is potentially computationally expensive, especially for complex feasible regions in very large dimensions. As such, projection-free methods gained a lot of attention recently, in particular the Frank-Wolfe algorithm (frank1956algorithm) (also known as conditional gradient descent (levitin1966constrained); see also (jaggi2013revisiting) for an overview) and its online version (hazan2012projection) due to their simplicity. We recall the basic Frank-Wolfe algorithm in Algorithm 1. These methods eschew the projection step and rather use a linear optimization oracle to stay within the feasible region. While convergence rates and regret bounds are often suboptimal, in many cases the gain due to only having to solve a single linear optimization problem over the feasible region in every iteration still leads to significant computational advantages (see e.g., (hazan2012projection, Section 5)). This led to conditional gradient algorithms being used for e.g., online optimization and more generally machine learning. Also the property that these algorithms naturally generate sparse distributions over the extreme points of the feasible region is often helpful. Further increasing the relevance of these methods, it was shown recently that conditional gradient methods can also achieve linear convergence (see e.g., garber2013linearly; FW-converge2015; LDLCC2016) as well as that the number of total gradient evaluations can be reduced while maintaining the optimal number of oracle calls as shown in lan2014conditional.
Unfortunately, for complex feasible regions even solving the linear optimization problem might be time-consuming and as such the cost of solving the LP might be non-negligible. This could be the case, e.g., when linear optimization over the feasible region is a hard problem or when solving large-scale optimization or learning problems. As such it is natural to ask the following questions:
Does the linear optimization oracle have to be called in every iteration?
Does one need approximately optimal solutions for convergence?
Can one reuse information across iterations?
We will answer these questions in this work, showing that 1 the LP oracle is not required to be called in every iteration, 2 much weaker guarantees are sufficient, and 3 we can reuse information. To significantly reduce the cost of oracle calls while maintaining identical convergence rates up to small constant factors, we replace the linear optimization oracle by a (weak) separation oracle
(Oracle 1) which approximately solves a separation problem within a multiplicative factor and returns improving vertices. We stress that the weak separation oracle is significantly weaker than approximate minimization, which has been already considered in jaggi2013revisiting. In fact, there is no guarantee that the improving vertices returned by the oracle are near to the optimal solution to the linear minimization problem. It is this relaxation of dual bounds and approximate optimality that will provide a significant speedup as we will see later. However, if the oracle does not return an improving vertex (returns false), then this fact can be used to derive a reasonably small bound dual bound of the form: for some . While the accuracy is presented here as a formal argument of the oracle, an oracle implementation might restrict to a fixed value , which often makes implementation easier. We point out that the cases 1 and 1 potentially overlap if . This is intentional and in this case it is unspecified which of the cases the oracle should choose (and it does not matter for the algorithms).
This new oracle encapsulates the smart use of the original linear optimization oracle, even though for some problems it could potentially be implemented directly without relying on a linear programming oracle. Concretely, a weak separation oracle can be realized by a single call to a linear optimization oracle and as such is no more complex than the original oracle. However it has two important advantages: it allows for caching and early termination. Caching refers to storing previous solutions, and first searching among them to satisfy the oracle’s separation condition. The underlying linear optimization oracle is called only, when none of the cached solutions satisfy the condition. Algorithm 2 formalizes this process. Early termination is the technique to stop the linear optimization algorithm before it finishes at an appropriate stage, when from its internal data a suitable oracle answer can be easily recovered; this is clearly an implementation dependent technique. The two techniques can be combined, e.g., Algorithm 2 could use an early terminating linear oracle or other implementation of the weak separation oracle in line 4.
We call lazification the technique of replacing a linear programming oracle with a much weaker one, and we will demonstrate significant speedups in wall-clock performance (see e.g., Figure LABEL:fig:cacheEffect), while maintaining identical theoretical convergence rates.
To exemplify our approach we provide conditional gradient algorithms employing the weak separation oracle for the standard Frank-Wolfe algorithm as well as the variants in (hazan2012projection; LDLCC2016; garber2013linearly), which have been chosen due to requiring modified convergence arguments that go beyond those required for the vanilla Frank-Wolfe algorithm. Complementing the theoretical analysis we report computational results demonstrating effectiveness of our approach via a significant reduction in wall-clock time compared to their linear optimization counterparts.
There has been extensive work on Frank-Wolfe algorithms and conditional gradient algorithms, so we will restrict to review work most closely related to ours. The Frank-Wolfe algorithm was originally introduced in (frank1956algorithm) (also known as conditional gradient descent (levitin1966constrained) and has been intensely studied in particular in terms of achieving stronger convergence guarantees as well as affine-invariant versions. We demonstrate our approach for the vanilla Frank-Wolfe algorithm (frank1956algorithm) (see also (jaggi2013revisiting)) as an introductory example. We then consider more complicated variants that require non-trivial changes to the respective convergence proofs to demonstrate the versatility of our approach. This includes the linearly convergent variant via local linear optimization (garber2013linearly) as well as the pairwise conditional gradient variant of LDLCC2016, which is especially efficient in terms of implementation. However, our technique also applies to the Away-Step Frank-Wolfe algorithm, the Fully-Corrective Frank-Wolfe algorithm, the Pairwise Conditional Gradient algorithm, as well as the Block-Coordinate Frank-Wolfe algorithm. Recently, in Freund2016 guarantees for arbitrary step-size rules were provided and an analogous analysis can be also performed for our approach. On the other hand, the analysis of the inexact variants, e.g., with approximate linear minimization does not apply to our case as our oracle is significantly weaker than approximate minimization as pointed out earlier. For more information, we refer the interested reader to the excellent overview in (jaggi2013revisiting) for Frank-Wolfe methods in general as well as FW-converge2015 for an overview with respect to global linear convergence.
It was also recently shown in hazan2012projection that the Frank-Wolfe algorithm can be adjusted to the online learning setting and in this work we provide a lazy version of this algorithm. Combinatorial convex online optimization has been investigated in a long line of work (see e.g., (kalai2005efficient; audibert2013regret; neu2013efficient)). It is important to note that our regret bounds hold in the structured online learning setting, i.e., our bounds depend on the -diameter or sparsity of the polytope, rather than its ambient dimension for arbitrary convex functions (see e.g., (cohen2015following; gupta2016solving)). We refer the interested reader to (ocoBook) for an extensive overview.
A key component of the new oracle is the ability to cache and reuse old solutions, which accounts for the majority of the observed speed up. The idea of caching of oracle calls was already explored in various other contexts such as cutting plane methods (see e.g., joachims2009cutting) as well as the Block-Coordinate Frank-Wolfe algorithm in shah2015multi; osokin2016minding. Our lazification approach (which uses caching) is however much more lazy, requiring no multiplicative approximation guarantee; see (osokin2016minding, Proof of Theorem 3. Appendix F) and lacoste2013block for comparison to our setup.
The main technical contribution of this paper is a new approach, whereby instead of finding the optimal solution, the oracle is used only to find a good enough solution or a certificate that such a solution does not exist, both ensuring the desired convergence rate of the conditional gradient algorithms.
Our contribution can be summarized as follows:
Lazifying approach. We provide a general method to lazify conditional gradient algorithms. For this we replace the linear optimization oracle with a weak separation oracle, which allows us to reuse feasible solutions from previous oracle calls, so that in many cases the oracle call can be skipped. In fact, once a simple representation of the underlying feasible region is learned no further oracle calls are needed. We also demonstrate how parameter-free variants can be obtained.
Lazified conditional gradient algorithms. We exemplify our approach by providing lazy versions of the vanilla Frank-Wolfe algorithm as well as of the conditional gradient methods in (hazan2012projection; garber2013linearly; LDLCC2016).
Weak separation through augmentation. We show in the case of 0/1 polytopes how to implement a weak separation oracle with at most calls to an augmentation oracle that on input and provides either an improving solution with or ensures optimality, where denotes the -diameter of . This is useful when the solution space is sparse.
Computational experiments. We demonstrate computational superiority by extensive comparisons of the weak separation based versions with their original versions. In all cases we report significant speedups in wall-clock time often of several orders of magnitude.
It is important to note that in all cases, we inherit the same requirements, assumptions, and properties of the baseline algorithm that we lazify. This includes applicable function classes, norm requirements, as well as smoothness and (strong) convexity requirements. We also maintain identical convergence rates up to (small) constant factors.
We briefly recall notation and notions in Section 2 and consider conditional gradient algorithms in Section LABEL:sec:offline-frank-wolfe. In Section LABEL:sec:frank-wolfe-with we consider parameter-free variants of the proposed algorithms, and in Section LABEL:sec:online-fw-error we examine online versions. Finally, in Section LABEL:sec:weak-separate-augment we show a realization of a weak separation oracle with an even weaker oracle in the case of combinatorial problem and we provide extensive computational results in Section LABEL:sec:experiments.