Tight Bounds for Approximate Carathéodory and Beyond
We give a deterministic nearly-linear time algorithm for approximating any point inside a convex polytope with a sparse convex combination of the polytope’s vertices. Our result provides a constructive proof for the Approximate Carathéodory Problem [Bar15], which states that any point inside a polytope contained in the ball of radius can be approximated to within in norm by a convex combination of only vertices of the polytope for . We also show that this bound is tight, using an argument based on anti-concentration for the binomial distribution.
Along the way of establishing the upper bound, we develop a technique for minimizing norms over convex sets with complicated geometry; this is achieved by running Mirror Descent on a dual convex function obtained via Sion’s Theorem.
As simple extensions of our method, we then provide new algorithms for submodular function minimization and SVM training. For submodular function minimization we obtain a simplification and (provable) speed-up over Wolfe’s algorithm, the method commonly found to be the fastest in practice. For SVM training, we obtain convergence for arbitrary kernels; each iteration only requires matrix-vector operations involving the kernel matrix, so we overcome the obstacle of having to explicitly store the kernel or compute its Cholesky factorization.
The (exact) Carathéodory Theorem is a fundamental result in convex geometry which states that any point in a polytope can be expressed as a convex combination of vertices of . Recently, Barman [Bar15] proposed an approximate version and showed that it can be used to improve algorithms for computing Nash equilibria in game theory and algorithms for the -densest subgraph in combinatorial optimization. Versions of the Approximate Carathéodory Theorem have been proposed and applied in different settings. Perhaps its most famous incarnation is as Maurey’s Lemma [pisier1980remarques] in functional analysis. It states that if one is willing to tolerate an error of in norm, vertices suffice to approximate , where is the radius of the smallest ball enclosing . The key significance of the approximate Carathéodory Theorem is that the bound it provides is dimension-free, and consequently allows us to approximate any point inside the polytope with a sparse convex combination of vertices.
Both Barman’s proof and Maurey’s original proof start from a solution of the exact Carathéodory problem, interpret the coefficients of the convex combination as a probability distribution and generate a sparse solution by sampling from the distribution induced by . Concentration inequalites are then used to argue that the average sampled solution is close to in the -norm. The proof is clean and elegant, but it leaves two questions: Is randomization really necessary for this proof? And, can we bypass a solution to the exact approximate Carathéodory problem and directly compute a solution to the approximate version?
The second question is motivated by the fact that computing the solution to the exact Carathéodory problem can be costly. In fact, this takes time even if the points are known in advance. The situation becomes even worse for polytopes for which it is not desirable to maintain an explicit representation of all its vertices (e.g. the matching polytope or the matroid base polytope) since there may be exponentially many of them. In this case, even finding the vertices whose convex hull contains becomes significantly more difficult.
Our first contribution addresses those two questions by giving a constructive proof of the approximate Carathéodory Theorem. As a corollary, this gives the first nearly linear time deterministic algorithm for the approximate Carathéodory problem that does not require knowing in advance. Our algorithm runs in iterations, each of which takes linear time.
Our second contribution is to provide a lower bound showing that the factor is tight. This improves upon a lower bound of proved by Barman. Barman’s lower bound is tight up to constant factors for but leaves a significant gap for any . We prove our lower bound by exhibiting a polytope in the radius- ball and a point inside for which all convex combinations of vertices are -far from the in the -norm.
These are in principle the best results one can hope for. We also show that even though the dependence on can’t be improved in general, it can be greatly improved in a special case. If is far away from the boundary of , i.e., if the ball of radius around is contained in , then there exist a solution to the approximate Carathéodory problem with .
In order to achieve the positive results for approximate Carathéodory, we develop a technique for minimizing norms over convex sets with complicated geometry; this is achieved by running Mirror Descent on a dual convex function obtained via Sion’s Theorem. This technique may be of independent interest. To show its potential, we note that simple extensions of our method result in new algorithms for submodular function minimization and SVM training. For submodular function minimization, we obtain a simplification and (provable) speed-up over Wolfe’s algorithm, the method commonly found to be the fastest in practice. For SVM training, we obtain convergence for arbitrary kernels; each iteration only requires matrix-vector operations involving the kernel matrix, so we overcome the obstacle of having to explicitly store the kernel or compute its Cholesky factorization. Next, we elaborate on our technique and then discuss these applications in more details.
1.1 Our techniques: mirror descent and lower bounds
Our new constructive proof of the approximate Carathéodory Theorem employs a technique from Convex Optimization called Mirror Descent, which is a generalization of Subgradient Descent. Both subgradient and mirror descent are first order methods that minimize arbitrary convex functions to an additive precision of using only information about the subgradient of the function.
In particular, we formulate the approximate Carathéodory problem as:
where the columns of are the vertices of and is the unit simplex.
Our first instinct is to apply gradient or mirror descent to . This fails to achieve any sort of sparseness guarantee since ts gradient is generally not sparse and the new iterate would not be either.
Inspired by algorithms for solving positive linear programs such as [PST91, young01], we reformulate our problem as a saddle point problem , where is the -ball. This can be viewed as a zero-sum game. Applying a generalization of the minimax theorem we can obtain a dual convex function.
We apply the mirror descent framework to this dual function. Mirror descent is a framework that needs to be instantiated by the choice of a mirror map, which plays a similar role as linking functions in Online Learning. We will provide an overview of Mirror Descent in the next section so that the paper is self contained. But for the reader familiar with Mirror Descent terminology, our mirror map is a truncated version of the square -norm, where is chosen such that .
The analysis is enabled by choosing the right function to optimize and the appropriate mirror map. At the very high level, our algorithm incrementally improves our current choice of by expanding its support by in each iteration. The desired sparsity of then follows as we can show that the number of iterations is .
Our lower bound is inspired by a method proposed by Klein and Young [KleinY15] for proving conditional lower bounds on the running time for solving positive linear programs. It again follows from interpreting the Carathéodory problem as a zero-sum game between a maximization and a minimization player. We construct a random instance such that with high probability the minimization player has a dense strategy with value close to zero, but for every sparse support, the maximization player can force the strategy to be -far from zero with high probability. The lower bound follows from taking the union bound over the probabilities and applying the probabilistic method.
In the following, we discuss a number of applications of our results and techniques. While the first result is a straightforward use of our improved approximate Carathéodory theorem, the second result is a simple application of the mirror-descent technique, and the third one is a simple application of an extension of the technique to SVMs.
Warm-up: fast rounding in polytopes with linear optimization oracles.
The most direct application of our approach is to efficiently round a point in a polytope whenever it admits a good linear optimization oracle. An obvious such instance is given by the matroid polytope. Given an -element matroid by of rank and a fractional point inside its base polytope, our algorithm produces a sparse distribution over matroid bases such that marginals are approximately preserve in expectation. More specifically, for any , has a support of size , and ; furthermore, computing requires only calls to ’s independence oracle.
Submodular function minimization.
Fujishige’s minimum-norm point algorithm is the method typically employed by practitioners, to minimize submodular functions [fujishige2011submodular, bach2010convex]. Traditionally this has been implemented using variants of Wolfe’s algorithm [Wolfe76], which lacked a rigorous convergence analysis (it was only known to converge in exponential number of steps). Only recently Chakrabarty, Jain, and Kothari [ChakrabartyJK14] proved the first polynomial time bound for this method, obtaining an algorithm that runs in time , where is the time required to answer a single query to , and is the maximum marginal difference in absolute value.
As our second application, we show that our technique can replace Wolfe’s algorithm in the analysis of [ChakrabartyJK14] obtaining an an time algorithm for exact sumbodular function minimization, and a for a -additive approximation. We emphasize that those are not the best theoretical algorithm, but a simplification and a speed-up of the algorithm that is commonly found to be the fastest in practice.
Support vector machines.
Training support vector machines (SVMs) can also be formulated as minimizing a convex function. We show that our technique of converting a problem to a saddle point formulation and solving the dual via Mirror Descent can be applied to the problem of training -SVMs. This is based on a formulation introduced by Schölkopf, Smola, Williamson, and Bartlett [ScholkopfSWB00]. Kitamura, Takeda and Iwata [KitamuraTI14] show how SVMs can be trained using Wolfe’s algorithm. Replacing Wolfe’s algorithm by Mirror Descent we obtain an -approximate solution in time , where is the kernel matrix. Whenever the empirical data belongs to the unit ball, this yields a constant number of iterations for polynomial and RBF kernels. Our method does not need to explicitly store the kernel matrix, since every iteration only requires a matrix-vector multiplication, and the entries of the matrix can be computed on-the-fly as they are needed. In the special case of a linear kernel, each iteration can be implemented in time linear in input size, yielding a nearly-linear time algorithm for linear SVM training.
1.3 Related work
As previously mentioned, the Approximate Carathéodory Theorem was been independently discovered many times in the past. The earliest record is perhaps due to Novikoff [novikoff1962convergence] in 1962 who showed that the version of Approximate Carathéodory can be obtained as a byproduct of the analyis of the Perceptron Algorithm (as pointed out by [blum2015sparse]). Maurey [pisier1980remarques] proves it in the context of functional analysis. We refer to the appendix of [BourgainN13] for the precise statement of Maurey’s lemma as well as a self-contained proof. Farias et al [farias2012sparse] studies it for the special case of the the bipartite matching polytope. Barman [Bar15] study the case and provides several applications to game theory and combinatorial optimization.
Related to the Approximate Carathéodory problem is the question studied by Shalev-Shwartz, Srebro and Zhang of minimizing the loss of a linear predictor while bounding the number of features used by the predictor. Their main result implies a gradient-descent based algorithm for the -version of the Approximate Carathéodory Theorem but is only able to produce for . A different optimization approach to Approximate Carathéodory is done by Garber and Hazan [GarberH13] who solve the optimization problem using Frank-Wolfe methods, also obtaining the version of the result.
Finally, the literature on Mirror Descent is too large to survey, but we refer to the book by Ben-Tal and Nemirovski [Nemirovski] for a comprehensive overview, including a discussion of the square mirror map. In Online Learning a variant of this mirror map has been used in Gentile’s -norm algorithms [Gentile03a].
Given a point , we define its -norm as for and its norm by . Given a norm , we denote by . For and norms, we denote the balls simply by and .
Given a norm , we define its dual norm as in such a way that Hölder’s inequality holds with equality: . The dual norm of the norm is the norm for .
Given a vector , let its support represent the number of nonzero coordinates of .
2.2 Approximate Carathéodory problem
The (exact) Carathéodory Theorem is a fundamental result in linear algebra which bounds the maximum number of points necessary to describe a point in the convex hull of a set. More precisely, given a finite set of points and a point , there exist points in such that . On the plane, in particular, every point in the interior of a convex polygon can be written as a convex combination of three of its vertices.
The approximate version of the Carathéodory theorem bounds the number of points necessary to describe a point approximately. Formally, given a norm , an additive error parameter and a set of points , for every we want points such that there exists and .
A general result of this type is given by Maurey’s Lemma [pisier1980remarques]. For the case of norms, , Barman [Bar15] showed that points suffice. A notable aspect of this theorem is that the bound is independent of the dimension of the ambient space.
2.3 Convex functions
We give a brief overview on the theory of convex functions. For a detailed exposition we refer readers to [rockafellar].
A function defined on a convex domain is said to be convex if every point has a non-empty subgradient . Geometrically, this means that a function is convex iff it is the maximum of all its supporting hyperplanes, i.e. . When there is a unique element in we call it the gradient and denote it by . We will sometimes abuse notation and refer to as an arbitrary element of even when it is not unique.
Strong convexity and smoothness.
We say that a function is -strongly convex with respect to norm if for all and all subgradients :
A function is said to be -smooth with respect to the if for all and :
Bregman divergence and the Hessian.
Every continuously differentiable induces a concept of ‘distance’ known as the Bregman-divergence: given , we define as the second order error when computing using the linear approximation of around . The fact that is convex guarantees .
If the subgradient of is unique everywhere, we can define -strong convexity and -smoothness with respect to the Bregman divergence, as and . If is also twice-differentiable, a simple way to compute its strong convexity and smoothness parameters is by bounding the -eigenvalues of the Hessian. If for all and , then is -strongly convex and -smooth. This is because:
We say that a convex function is -Lipschitz with respect to norm if . Note that -Lipschitz continuity requires a bound on the dual norm, since
It is useful to write a convex function as the maximum of its supporting hyperplanes. One way to do that is using the Fenchel transform. When defining Fenchel transforms, it is convenient to identify a function to its extension such that for and otherwise. Given that identification, we can define the Fenchel transform of a function as the function given by . If is convex, the Fenchel transformation is self-invertible, i.e., or equivalently: . Notice that the previous expression is a way to write any convex function as a maximum over linear functions in parametrized by . The Fenchel inequality follows directly from the definition of the Fenchel transform.
When writing a convex function as a maximum of other convex function (typically linear functions), the Envelope Theorem gives a way to compute derivatives. Its statement is quite intuitive: since gradients are local objects, the gradient of at a certain point is the gradient of the function being maximized at that point. Formally, if where is convex in for every fixed , then if , then . A direct application of this theorem is in computing the gradients of the Fenchel dual: and .
Smoothness and strong convexity duality.
Finally, we will use the following duality theorem:
The function is a -strongly convex function with respect to if and only if its Fenchel dual is a -smooth with respect to .
Here we prove that -strong convexity of a function implies -smoothness of its dual, since this is the direction we will use. We refer to [Kakade12, ShalevThesis] for a proof of the converse.
Fix and let . Since is strongly convex, there in an unique maximum, so we can write . Also, . Since the Fenchel transform is self-dual, . In particular, this means that .
Using the strong-convexity of , we can write:
Summing the expressions above and applying Holder’s inequality, we get:
which implies the smoothness bound:
2.4 A primer on Mirror Descent
For the sake of completeness, we will present here an elementary exposition of the Mirror Descent Framework, which is used in our proof. For a complete exposition we refer to Nemirovskii [Nemirovski] or Bubeck [Bubeck14].
The goal of Mirror Descent is to minimize a convex function with Lipschitz constant with respect to norm . To motivate Mirror Descent, it is useful to think of dot products as a product of vectors in two different vector spaces, which can be thought as vectors vs linear forms or column vectors vs row vectors. In the spirit of Hölder’s inequality, we can think of as living in the space equipped with norm while lives in equipped with the dual norm . When we approximate , the second term is a dot-product of a vector in the domain , which we call the primal space and measure using norm and a gradient vector, which we call the dual space and measure with dual norm .
Keeping the discussion in the previous paragraph in mind, we can revisit the most intuitive method to minimize convex functions: gradient descent. The gradient descent method consists in following the directions of steepest descent, which is the direction opposite to the gradient. This leads to an iteration of the type: . In the view of primal space and dual space, this iteration suddenly looks strange, because one is summing a primal vector with a dual vector which live in different spaces. In some sense, the gradient descent for Lipschitz convex functions only makes sense in the norm, in which (see the subgradient descent method in [Nesterov2004]).
This motivated the idea of a map connecting the primal and the dual space. The idea in the mirror descent algorithm is to keep two vectors one in the primal space and one in the dual space. In each iteration we compute , obtaining a dual vector and update:
It is convenient in the analysis to think of this map as the gradient of a convex function . In the usual setup, we define the mirror map, which is a convex function , -strongly convex with respect to . Let be the Fenchel-dual which is a -smooth convex function with respect to by Theorem 2.1.
Notice that is defined as a maximum over linear functions of indexed by . The result known as the envelope theorem states that is the gradient of the linear function maximized at . Therefore: . This in particular implies that since for .
Using the definition of and we can define the Mirror Descent iteration as:
In the setup described above with , then in iterations, it holds that .
The idea of the proof is to bound the growth of using smoothness property of :
By the Fenchel inequality for all . Combining with the previous inequality and re-arranging the terms, we get:
The gradient of corresponds by the envelope theorem to maximizing . Therefore, since , . Substituting in the above expression and using the definition of Bregman divergence, we get:
Rearranging the terms and using that , we obtain:
So for , . ∎
In the conditions of the previous theorem, for , , where
Let . Applying the previous theorem with we get:
where both inequalities follow from convexity of . ∎
3 Nearly linear time deterministic algorithm
In this section, we present a nearly linear time deterministic algorithm for the approximate Carathéodory Problem. Barman’s original proof [Bar15] involves solving the exact Carathéodory problem, i.e. writing , interpreting as a probability distribution over , sampling points from according to and arguing using concentration bounds (Khintchine inequality to be precise) that the expectation . From an algorithmic point of view, this requires: (i) solving a linear program to compute ; (ii) using randomization to sample . Our main theorem shows that neither is necessary. There is a linear time deterministic algorithms that doesn’t require a solution to the exact Carathéodory problem.
Our algorithm is based on Mirror Descent. The idea is to formulate the Carathéodory problem as an optimization problem. Inspired by early positive Linear Programming solvers such as the one of Plotkin, Shmoys and Tardos [PST91], we convert this problem to a saddle point problem and then solve the dual using Mirror Descent. Using Mirror Descent to solve the dual guarantees a sparse primal certificate that would act as the desired convex combination.
Recall that we are given a finite set of points and . Our goal is to produce a sparse convex combination of the points in that is -close to in the -norm. Dropping the sparsity constraint for now, we can formulate this problem as:
where is a matrix where the columns are the vectors and is the unit simplex in -dimensions. We refer to P-Cara as the primal Carathéodory problem. This problem can be converted to a saddle point problem by noting that we can write the norm as for . So we can reformulate the problem as:
Sion’s Theorem [Sion58] is a generalization of Von Neumann’s minimax theorem that allows us to swap the order of minimization and maximization for any pair of compact convex sets. This leads to dual version of the Carathéodory problem:
The function is concave, since it is expressed as a minimum over linear functions in parametrized by . Maximizing a concave function is equivalent to minimizing a convex function. To keep the minimization terminology, which is more standard in optimization, we write:
Sparse solution by solving the dual.
Since , there is a vector such that . Hence, the optimal solution for P-Cara is zero and therefore are the solution of all equivalent formulations. Even though we know the optimal solution, it makes sense to optimize since in the process we can obtain an -approximation in a few number of iterations. If each iteration updates only one coordinate, then we are guaranteed to obtain an approximation with sparsity equal to the number of iterations. As it will become clear in a second, while the updates of variable are not sparse, the dual certificate produced by Mirror Descent will be sparse.
To make this statement precise, consider the gradient of , which can be obtained by an application of the envelope theorem: for . This problem corresponds to maximizing a linear function over the simplex, so the optimal solution is a corner of the simplex. In other words, where . Finally, we can use the Mirror Descent guarantee in Theorem 2.2 to bound the norm of the average gradient. We make this precise in the proof of the following theorem.
In fact does not even have to be explicitly given. All we need is to solve . When is explicitly given, this can be done in time by picking the best vertex. Sometimes, especially in combinatorial optimization, we have a polytope (whose vertices are ) represented by its constraints. Our result states that for these alternate formulations, we can still obtain a sparse representation efficiently if we can solve linear optimization problems over it fast. This observation will be important for our appication to submodular minimization.
We consider the space equipped with the norm. To apply the Mirror Descent framework, we need first to show that the dual norm (the -norm, in this case) of the gradient is bounded. This is easy, since in the approximate Carathéodory problem, , so . So we can take in Theorem 2.2.
Since and for , then . Also, since can be written as , clearly for all . Plugging those two facts in the guarantee of Theorem 2.2, we get:
Taking the maximum over all we get:
To complete the picture, we need to provide a function a -strongly convex function with a small value of .
For , the function , is -strongly convex with respect to the norm and .
We want to bound for all . For all in the interior of the ball there is a unique subgradient which we represent by . In the border of , however, there are multiple subgradients. First we claim that we need only to bound where denotes the gradient of the function . In order to see that, notice that if is a subgradient in a point and then:
by the definition of subgradient. Dividing the expression by and taking the limit when , we get: , so in particular: .
This observation allows us to bound the strong convexity parameter of by looking at the -eigenvalues of the Hessian of . In particular, we will show that for all , .
To make the notation simpler, we define as . This allows us to represent in a succinct form: since
so we can write . Therefore:
Now, to compute the Hessian, we have:
where is the diagonal matrix with in the diagonal. Using the fact that , we can write:
The last equality is a convoluted re-writing of the previous expression, but allows us to apply Hölder’s inequality. Recall that Hölder’s inequality states that whenever . Applying this inequality with and , we get:
Finally, we need to show how to compute the Fenchel dual and the mirror map efficiently:
The Fenchel dual of the function defined in Proposition 3.6 can be computed explicitly:
Also, where is a vector with -norm such that . This function can be explicitly computed as: .
By the definition of Fenchel duality: