Learning convex polytopes with margin

Learning convex polytopes with margin

Lee-Ad Gottlieb
Ariel University
leead@ariel.ac.il &Eran Kaufman
Ariel University
erankfmn@gmail.com &Aryeh Kontorovich
Ben-Gurion University
karyeh@bgu.sc.il &Gabriel Nivasch
Ariel University
gabrieln@ariel.ac.il
Abstract

We present a near-optimal algorithm for properly learning convex polytopes in the realizable PAC setting from data with a margin. Our first contribution is to identify distinct generalizations of the notion of margin from hyperplanes to polytopes and to understand how they relate geometrically; this result may be of interest beyond the learning setting. Our novel learning algorithm constructs a consistent polytope as an intersection of about halfspaces in time polynomial in (where is the number of halfspaces forming an optimal polytope). This is an exponential improvement over the state of the art (Arriaga and Vempala, 2006). We also improve over the super-polynomial-in- algorithm of Klivans and Servedio (2008), while achieving a better sample complexity. Finally, we provide the first nearly matching hardness-of-approximation lower bound, whence our claim of near optimality.

1 Introduction

In the theoretical PAC learning setting (Valiant, 1984), one considers an abstract instance space — which, most commonly, is either the Boolean cube or the Euclidean space . For the former setting, an extensive literature has explored the statistical and computational aspects of learning Boolean functions (Angluin, 1992; Hellerstein and Servedio, 2007). Yet for the Euclidean setting, a corresponding theory of learning geometric concepts is still being actively developed (Kwek and Pitt, 1998; Jain and Kinber, 2003; Anderson et al., 2013; Kane et al., 2013). The focus of this paper is the latter setting.

The simplest nontrivial geometric concept is perhaps the halfspace. These concepts are well-known to be hard to agnostically learn (Höffgen et al., 1995) or even approximate (Amaldi and Kann, 1995, 1998; Ben-David et al., 2003). Even the realizable case, while commonly described as “solved” via the Perceptron algorithm or linear programming (LP), is not straightforward: The Perceptron’s runtime is quadratic in the inverse-margin, while solving the consistent hyperplane problem in strongly polynomial time would provide a solution to the general LP problem ((Matoušek and Gärtner, 2006, p. 84) and personal communication from [Anonymous]), a question which has been open for decades (Bárász and Vempala, 2010). Thus, an unconditional (i.e., infinite-precision and independent of data configuration in space) polynomial-time solution for the consistent hyperplane problem hinges on the strongly polynomial LP conjecture.

If we consider not a single halfspace, but polytopes defined by the intersection of multiple halfspaces, the algorithmic and computational bounds rapidly become more pessimistic. Megiddo (1988) showed that the problem of deciding whether two sets of points in general space can be separated by the intersection of two hyperplanes is -complete, and Khot and Saket (2011) showed that “unless , it is hard to (even) weakly PAC-learn intersection of two halfspaces”, even when allowed the richer class of intersecting halfspaces. Under cryptographic assumptions, Klivans and Sherstov (2009) showed that learning an intersection of halfspaces is intractable regardless of hypothesis representation.

Our results.

Since the margin assumption is what allows one to find a consistent hyperplane in provably strongly polynomial time, it is natural to seek to generalize this scheme to intersections of halfspaces. To this end, we identify two distinct notions of polytope margin: These are the -envelope of a convex polytope, defined as all points within distance of the polytope boundary, and the -margin of the polytope, defined as the intersection of the -margins of the hyperplanes forming the polytope. (See Figure 2 for an illustration, and Section 2 for precise definitions.) Note that these two objects may exhibit vastly different behaviors, particularly at a sharp intersection of two or more hyperplanes.

It seems to us that the envelope of a polytope is a more natural structure than its margin, yet we find the margin more amenable to the derivation of both VC-bounds (Lemma 1) and algorithms (Theorem 8). Our first contribution is in demonstrating that results derived for margins can be adapted to apply to envelopes as well. This result may be of independent geometric interest. We prove that when confined to the unit ball, the -envelope fully contains within it the -margin (Theorem 5), and this implies that statistical and algorithmic results for the latter hold for the former as well.

We then present the central contribution of the paper, improving algorithmic runtimes for computing separating polytopes. The current state of the art is the algorithm of Arriaga and Vempala (2006), whose runtime is exponential in (where is the number of halfspaces forming the polytope, and is their margin). In contrast, we give an algorithm whose runtime has only polynomial dependence on — an exponential improvement in speed (Theorem 8). Complementing our algorithm, we provide the first nearly matching hardness-of-approximation bounds, which demonstrate that an exponential dependence on (but not !) is unavoidable under standard complexity-theoretic assumptions (Theorem 7).

Related work.

When general convex bodies are considered under the uniform distribution111 Since the concept class of convex sets has infinite VC-dimension, without distribution assumptions, an adversarial distribution can require an arbitrarily large sample size, even in dimensions (Kearns and Vazirani, 1997). , exponential (in dimension and accuracy) sample-complexity bounds were obtained by Rademacher and Goyal (2009). This may motivate the consideration of convex polytopes, and indeed a number of works have studied the problem of learning convex polytopes, including Hegedüs (1994); Kwek and Pitt (1998); Anderson et al. (2013); Kane et al. (2013); Kantchelian et al. (2014). Hegedüs (1994) examines query-based exact identification of convex polytopes with integer vertices, with runtime polynomial in the number of vertices (which could be exponential in the number of faces (Matoušek, 2002)). Kwek and Pitt (1998) also rely on membership queries (see also references therein regarding prior results, as well as strong positive results in dimension . Anderson et al. (2013) efficiently approximately recover an unknown simplex from uniform samples inside it. Kane et al. (2013) learn halfspaces under the log-concave distributional assumption.

The recent work of Kantchelian et al. (2014) bears a superficial resemblance to ours, but the two are actually not directly comparable. What they term worst case margin will indeed correspond to our margin. However, their optimization problem is non-convex, and the solution relies on heuristics without rigorous run-time guarantees. Their generalization bounds exhibit a better dependence on the number of halfspaces than our Lemma 3 ( vs. our ). However, the hinge loss appearing in their Rademacher-based bound could be significantly worse than the 0-1 error appearing in our VC-based bound. We stress, however, that the main contribution of our paper is algorithmic rather than statistical.

The works of Arriaga and Vempala (2006) and Klivans and Servedio (2008) are most comparable to ours. They also consider learning polytopes defined by hyperplanes with margins, and give learning algorithms for this problem. (The former paper actually constructs a candidate polytope [proper learning], while the latter constructs a function that approximates the polytope’s behavior, without constructing the polytope itself [improper learning].) Arriaga and Vempala (2006) present a runtime of

and sample complexity

Klivans and Servedio (2008) achieve runtime

(in dimension ) and sample complexity

these algorithms are respectively exponential and super-polynomial in , while our algorithm has runtime only polynomial in with near-optimal dependence on the margin.

2 Preliminaries

Notation.

For , we denote its Euclidean norm by and for , we write . Our instance space is the unit ball in : . We assume familiarity with the notion of VC-dimension as well as with basic PAC definitions such as generalization error (see, e.g., Kearns and Vazirani (1997)).

Polytopes.

A (convex) polytope is the convex hull of finitely many points: . Alternatively, it can be defined by hyperplanes where for each :

(1)

A hyperplane is said to classify a point as positive (resp., negative) with margin if (resp., ). Since , this means that is -far from the hyperplane , in distance.

Margins and envelopes.

We consider two natural ways of extending this notion to polytopes: the -margin and the -envelope. For a polytope defined by hyperplanes as in (1), we say that is in the inner -margin of if

and that is in the outer -margin of if

Similarly, we say that is in the outer -envelope of if and and that is in the inner -envelope of if and .

We call the union of the inner and the outer -margins the -margin, and we denote it by . Similarly, we call the union of the inner and the outer -envelopes the -envelope, and we denote it by .

The two notions are illustrated in Figure 2. As we show in Section 3 below, the inner envelope coincides with the inner margin, but this is not the case for the outer objects: The outer margin always contains the outer envelope, and could be of arbitrarily larger volume.

Fat hyperplanes and polytopes.

Binary classification requires a collection of concepts mapping the instance space (in our case, the unit ball in ) to . However, given a hyperplane and a margin , the function given by partitions into three regions: positive , negative , and ambiguous . We use a standard device (see, e.g., Hanneke and Kontorovich (2017, Section 4)) of defining an auxiliary instance space together with the concept class , where, for all ,

It is shown in (Hanneke and Kontorovich, 2017, Lemma 6) that 222 Such estimates may be found in the literature for homogeneous (i.e., ) hyperplanes (see, e.g., Bartlett and Shawe-Taylor (1999, Theorem 4.6)), but dealing with polytopes, it is important for us to allow offsets. As discussed in Hanneke and Kontorovich (2017), the standard non-homogeneous to homogeneous conversion can degrade the margin by an arbitrarily large amount, and hence the non-homogeneus case warrants an independent analysis.

Lemma 1.

The VC-dimension of is at most .

Analogously, we define the concept class of -fat -polytopes as follows. Each is induced by some -polytope intersection as in (1). The label of a pair is determined as follows: If is in the -margin of , then the pair is labeled irrespective of . Otherwise, if and , or else and , then the pair is labeled . Otherwise, the pair is labeled .

Lemma 2.

The VC-dimension of in dimensions is at most

where .

Proof.

The family of intersections of concept classes of VC-dimension at most is bounded by Blumer et al. (1989, Lemma 3.2.3). Since the class of -dimensional hyperplances has VC-dimension (Long and Warmuth, 1994), the family of polytopes has VC-dimension at most . The second part of the bound is obtained by applying Blumer et al. (1989, Lemma 3.2.3) to the VC bound in Lemma 1.

Generalization bounds.

The following VC-based generalization bounds are well-known; the first one may be found in, e.g., Cristianini and Shawe-Taylor (2000), while the second one in Anthony and Bartlett (1999).

Lemma 3.

Let be a class of learners with VC-dimension . If a learner is consistent on a random sample of size , then with probability at least its generalization error is

If the learner has empirical error on the sample, then with probability at least its generalization error is

for come universal constant .

Dimension reduction.

The Johnson-Lindenstrauss (JL) transform takes a set of vectors in and projects them into dimensions, while preserving all inter-point distances and vectors norms up to distortion. That is, if is a linear embedding realizing the guarantees of the JL transform on , then for every we have

and for every we have

The JL transform can be realized with probability for any constant by a randomized linear embedding, for example a projection matrix with entries drawn from a normal distribution (Achlioptas, 2003). This embedding is oblivious, in the sense that the matrix can be chosen without knowledge of the set .

It is an easy matter to show that the JL transform can also be used to approximately preserve distances to hyperplanes, as in the following lemma.

Lemma 4.

Let be set of -dimensional vectors in the unit ball, be a set of normalized vectors with , and a linear embedding realizing the guarantees of the JL transform. Then for any and and with probability (for any constant ), we have for all and that

Proof.

Let the constant in be chosen so that the JL transform preserves distances and norms within a factor for . By the guarantees of the JL transform for the chosen value of , we have that

A similar argument gives that . ∎

3 Polytope margin and envelope

In this section, we show that the notions of margin and envelope defined in Section 2 are, in general, quite distinct. Fortunately, when confined to the unit ball , one can be used to approximate the other.

Given two sets , their Minkowski sum is given by , and their Minkowski difference is given by . Let be a ball of radius centered at the origin.

Given a polytope an a real number , let

Hence, and are the results of expanding or contracting, in a certain sense, the polytope .

Figure 1: Expansion and contraction of a polytope by .

Also, let be the result of moving each halfspace defining a facet of outwards by distance , and similarly, let be the result of moving each such halfspace inwards by distance . Put differently, we can think of the halfspaces defining the facets of as moving outwards at unit speed, so expands with time. Then is at time . See Figure 1.

Observation 1.

We have .

Proof.

Each point in is at distance at least from each hyperplane containing a facet of , hence, it is at distance at least from the boundary of , so it is in . Now, suppose for a contradiction that there exists a point . Then is at distance less than from a point , where is some facet of and is the hyperplane containing . But then the segment must intersect another facet of . ∎

However, in the other direction we have . Furthermore, the Hausdorff distance between them could be arbitrarily large (see again Figure 1).

Figure 2: The -envelope (left) and -margin (right) of a polytope .

Then the -envelope of is given by , and the -margin of is given by . See Figure 2.

Since the -margin of is not contained in the -envelope of , we would like to find some sufficient condition under which, for some , the -margin of is contained in the -envelope of . Our solution to this problem is given in the following theorem. Recall that is the unit ball in .

Theorem 5.

Let be a polytope, and let . Suppose that . Then, within , the -margin of is contained in the -envelope of ; meaning, .

The proof uses the following general observation:

Observation 2.

Let be an expanding polytope whose defining halfspaces move outwards with time, each one at its own constant speed. Let be a point that moves in a straight line at constant speed. Suppose are such that and . Then as well.

Proof.

Otherwise, exits one of the halfspaces and enters it again, which is impossible. ∎

Proof of Theorem 5.

By Observation 1, it suffices to show that . Hence, let and . Let be the segment . Let be the point in that is at distance from . Suppose for a contradiction that . Then . Consider as a polytope that expands with time, as above. Let be a point that moves along at constant speed, such that and . Since , the speed of is at most . Hence, between and , moves distance at most , so is already between and . In other words, exits and reenters it, contradicting Observation 2. ∎

It follows immediately from Theorem 5 and Lemma 2 that the VC-dimension of the class of -polytopes with envelope is at most

where . Likewise, we can approximate the optimal -polytope with envelope by the algorithms of Theorem 8 (with parameter ).

4 Computing and learning separating polytopes

In this section, we present algorithms to compute and learn -fat -polytopes. We begin with hardness results for this problem, and show that these hardness results justify algorithms with run time exponential in the dimension or the square of the reciprocal of the margin, but not exponential in . We then present our algorithms.

4.1 Hardness

We show that computing separating polytopes is -hard, and even hard to approximate. We begin with the case of a single hyperplane. The following preliminary lemma builds upon (Amaldi and Kann, 1995, Theorem 10).

Lemma 6.

Given a labelled point set , let be a hyperplane that places all positive points of on its positive side, and maximizes the number of negative points on its negative size — call this quantity . Then it is -hard to find a hyperplane consistent with all positive points, and which places at least negative points on on the negative side of . This holds even when the optimal hyperplane has margin exactly .

Proof.

We reduce from maximum independent set, which is hard to approximate to within (Zuckerman, 2007). Given a graph , for each vetex place a negative point on the basis vector . Now place a positive point at the origin, and for each edge , place a positive point at .

Consider a hyperplane consistent with the positive points and placing negative points on the negative side: These negative points must represent an independent set in , for if , then by construction the midpoint of is positive, and so both cannot lie on the negative side of the plane.

Likewise, if contained an independent set of size , then we consider the hyperplane defined by the equation , where coordinate if and otherwise. It is easily verified that the distance from the hyperplane to a negative point is , to the origin is , and to other positive points are at least . ∎

Theorem 7.

Given a labelled point set , let be a collection of hyperplanes whose intersection partitions into positive and negative sets. Then it is -hard to find a collection of size less than whose intersection partitions into positive and negative sets. This holds even when all hyperplanes have margin or greater.

Proof.

The reduction is from minimum coloring, which is hard to approximate within a factor of (Zuckerman, 2007). The construction is identical to that of the proof of Lemma 6. The only difference is that if one color covers more than vertices, we break it up into a set of colors, each covering at most vertices. This increases the total number of colors to at most . The claim follows. ∎

4.2 Algorithms

Here we present algorithms for computing polytopes, and use them to give an efficient algorithm for learning polytopes. As a consequence of Lemma 6 and Theorem 7, we cannot hope to find in polynomial time even a single hyperplane consistent with the positive point which correctly classifies many negative points, let alone a -polytope consistent with all the data. In fact, by the Exponential Time Hypothesis, we cannot achieve a runtime better than exponential in either the dimension or — but not necessarily exponential in .

In what follows, we give two algorithms inspired by the polytope algorithm presented by Arriaga and Vempala (2006). Both have runtime faster than the algorithm of Arriaga and Vempala (2006), and the second is only polynomial in .

Theorem 8.

Given a labelled point set () for which some -fat -polytope () correctly separates the positive and negative points (i.e., the polytope is consistent), we can compute the following with high probability:

  1. A consistent -fat -polytope in time .

  2. A consistent -fat -polytope in time .

Before proving Theorem 8, we will need a preliminary lemma:

Lemma 9.

Given any , there exists a set of unit vectors of size with the following property: For any unit vector , there exists a that satisfies for all vectors with . The set can be constructed in time with high probability.

This implies that if a set admits a hyperplane with margin , the set contains a hyperplane with margin at least .

Proof.

We take to be a -net of the unit ball, a set satisfying that every point on the ball is within distance of some point in . Then (Vershynin, 2010, Lemma 5.2). For any unit vector we have for some that . Now let be vectors normal to their respective hyperplanes, and so for any vector satisfying we have that its distance from the respective hyperplanes is and . It follows that

The net can be constructed by a randomized greedy algorithm. By coupon-collector analysis, it suffices to construct random vectors. For example, each can be chosen by sampling each coordinate from a (0,1)-Normal distribution, and then normalizing the vector. Then the resulting set contains within in a -net, which can be extracted in the stated time. ∎

Proof of Theorem 8.

We first apply the Johnson-Lindenstrauss transform to reduce dimension of the points in to , achieving the guarantees of Lemma 4 with parameter . We then restrict our attention to the -dimensional hyperplanes implied by the -net of Lemma 9 with parameter . Then for each -dimensional hyperplane forming the original -fat -polytope, there is a -dimensional hyperplane of the net satisfying for every . Given this , we can recover an approximation to thus: Let include only those points for which , from which it follows that . We then run the Perceptron algorithm on in time , and find a hyperplane consistent with on all points at distance or more from . We will refer to as the -dimensional mirror of .

Having enumerated all vectors in and computed their -dimensional mirrors, we enumerate all possible -polytopes by taking all combinations of mirror hyperplanes, in total time

and choose the best one consistent with . The first part of the theorem follows.

Alternatively, we may give a greedy algorithm with better runtime: First note that as the intersection of hyperplanes correctly classifies all points, the best hyperplane among them correctly classifies at least a -fraction of the negative points with margin . Hence it suffices to compute the mirror hyperplane which is consistent with all positive points and maximizes the number of correct negative points, all with margin . We choose this hyperplane, remove from the correctly placed negative points, and iteratively search for the next best hyperplane. After iterations (for an appropriate constant ), the number of remaining points is

and the algorithm terminates. ∎

Having given an algorithm to compute -fat -polytopes, we can now give an efficient algorithm to learn -fat -polytopes. We sample points, and use the second item of Theorem 8 to find a -fat -polytope consistent with the sample. By Lemma 2, the class of polytopes has VC-dimension . The size of is chosen according to Lemma 3, and we conclude:

Theorem 10.

There exists an algorithm that learns -fat -polytopes with sample complexity

in time , where is the desired confidence level.

The stated runtime is only polynomial in , improving over that of Arriaga and Vempala (2006), whose algorithm was exponential in , and over that of Klivans and Servedio (2008), whose algorithm was super-polynomial in (and also had worse sample complexity).

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
199126
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description