Learning convex polytopes with margin
Abstract
We present a nearoptimal algorithm for properly learning convex polytopes in the realizable PAC setting from data with a margin. Our first contribution is to identify distinct generalizations of the notion of margin from hyperplanes to polytopes and to understand how they relate geometrically; this result may be of interest beyond the learning setting. Our novel learning algorithm constructs a consistent polytope as an intersection of about halfspaces in time polynomial in (where is the number of halfspaces forming an optimal polytope). This is an exponential improvement over the state of the art (Arriaga and Vempala, 2006). We also improve over the superpolynomialin algorithm of Klivans and Servedio (2008), while achieving a better sample complexity. Finally, we provide the first nearly matching hardnessofapproximation lower bound, whence our claim of near optimality.
1 Introduction
In the theoretical PAC learning setting (Valiant, 1984), one considers an abstract instance space — which, most commonly, is either the Boolean cube or the Euclidean space . For the former setting, an extensive literature has explored the statistical and computational aspects of learning Boolean functions (Angluin, 1992; Hellerstein and Servedio, 2007). Yet for the Euclidean setting, a corresponding theory of learning geometric concepts is still being actively developed (Kwek and Pitt, 1998; Jain and Kinber, 2003; Anderson et al., 2013; Kane et al., 2013). The focus of this paper is the latter setting.
The simplest nontrivial geometric concept is perhaps the halfspace. These concepts are wellknown to be hard to agnostically learn (Höffgen et al., 1995) or even approximate (Amaldi and Kann, 1995, 1998; BenDavid et al., 2003). Even the realizable case, while commonly described as “solved” via the Perceptron algorithm or linear programming (LP), is not straightforward: The Perceptron’s runtime is quadratic in the inversemargin, while solving the consistent hyperplane problem in strongly polynomial time would provide a solution to the general LP problem ((Matoušek and Gärtner, 2006, p. 84) and personal communication from [Anonymous]), a question which has been open for decades (Bárász and Vempala, 2010). Thus, an unconditional (i.e., infiniteprecision and independent of data configuration in space) polynomialtime solution for the consistent hyperplane problem hinges on the strongly polynomial LP conjecture.
If we consider not a single halfspace, but polytopes defined by the intersection of multiple halfspaces, the algorithmic and computational bounds rapidly become more pessimistic. Megiddo (1988) showed that the problem of deciding whether two sets of points in general space can be separated by the intersection of two hyperplanes is complete, and Khot and Saket (2011) showed that “unless , it is hard to (even) weakly PAClearn intersection of two halfspaces”, even when allowed the richer class of intersecting halfspaces. Under cryptographic assumptions, Klivans and Sherstov (2009) showed that learning an intersection of halfspaces is intractable regardless of hypothesis representation.
Our results.
Since the margin assumption is what allows one to find a consistent hyperplane in provably strongly polynomial time, it is natural to seek to generalize this scheme to intersections of halfspaces. To this end, we identify two distinct notions of polytope margin: These are the envelope of a convex polytope, defined as all points within distance of the polytope boundary, and the margin of the polytope, defined as the intersection of the margins of the hyperplanes forming the polytope. (See Figure 2 for an illustration, and Section 2 for precise definitions.) Note that these two objects may exhibit vastly different behaviors, particularly at a sharp intersection of two or more hyperplanes.
It seems to us that the envelope of a polytope is a more natural structure than its margin, yet we find the margin more amenable to the derivation of both VCbounds (Lemma 1) and algorithms (Theorem 8). Our first contribution is in demonstrating that results derived for margins can be adapted to apply to envelopes as well. This result may be of independent geometric interest. We prove that when confined to the unit ball, the envelope fully contains within it the margin (Theorem 5), and this implies that statistical and algorithmic results for the latter hold for the former as well.
We then present the central contribution of the paper, improving algorithmic runtimes for computing separating polytopes. The current state of the art is the algorithm of Arriaga and Vempala (2006), whose runtime is exponential in (where is the number of halfspaces forming the polytope, and is their margin). In contrast, we give an algorithm whose runtime has only polynomial dependence on — an exponential improvement in speed (Theorem 8). Complementing our algorithm, we provide the first nearly matching hardnessofapproximation bounds, which demonstrate that an exponential dependence on (but not !) is unavoidable under standard complexitytheoretic assumptions (Theorem 7).
Related work.
When general convex bodies are considered under the uniform distribution^{1}^{1}1 Since the concept class of convex sets has infinite VCdimension, without distribution assumptions, an adversarial distribution can require an arbitrarily large sample size, even in dimensions (Kearns and Vazirani, 1997). , exponential (in dimension and accuracy) samplecomplexity bounds were obtained by Rademacher and Goyal (2009). This may motivate the consideration of convex polytopes, and indeed a number of works have studied the problem of learning convex polytopes, including Hegedüs (1994); Kwek and Pitt (1998); Anderson et al. (2013); Kane et al. (2013); Kantchelian et al. (2014). Hegedüs (1994) examines querybased exact identification of convex polytopes with integer vertices, with runtime polynomial in the number of vertices (which could be exponential in the number of faces (Matoušek, 2002)). Kwek and Pitt (1998) also rely on membership queries (see also references therein regarding prior results, as well as strong positive results in dimension . Anderson et al. (2013) efficiently approximately recover an unknown simplex from uniform samples inside it. Kane et al. (2013) learn halfspaces under the logconcave distributional assumption.
The recent work of Kantchelian et al. (2014) bears a superficial resemblance to ours, but the two are actually not directly comparable. What they term worst case margin will indeed correspond to our margin. However, their optimization problem is nonconvex, and the solution relies on heuristics without rigorous runtime guarantees. Their generalization bounds exhibit a better dependence on the number of halfspaces than our Lemma 3 ( vs. our ). However, the hinge loss appearing in their Rademacherbased bound could be significantly worse than the 01 error appearing in our VCbased bound. We stress, however, that the main contribution of our paper is algorithmic rather than statistical.
The works of Arriaga and Vempala (2006) and Klivans and Servedio (2008) are most comparable to ours. They also consider learning polytopes defined by hyperplanes with margins, and give learning algorithms for this problem. (The former paper actually constructs a candidate polytope [proper learning], while the latter constructs a function that approximates the polytope’s behavior, without constructing the polytope itself [improper learning].) Arriaga and Vempala (2006) present a runtime of
and sample complexity
Klivans and Servedio (2008) achieve runtime
(in dimension ) and sample complexity
these algorithms are respectively exponential and superpolynomial in , while our algorithm has runtime only polynomial in with nearoptimal dependence on the margin.
2 Preliminaries
Notation.
For , we denote its Euclidean norm by and for , we write . Our instance space is the unit ball in : . We assume familiarity with the notion of VCdimension as well as with basic PAC definitions such as generalization error (see, e.g., Kearns and Vazirani (1997)).
Polytopes.
A (convex) polytope is the convex hull of finitely many points: . Alternatively, it can be defined by hyperplanes where for each :
(1) 
A hyperplane is said to classify a point as positive (resp., negative) with margin if (resp., ). Since , this means that is far from the hyperplane , in distance.
Margins and envelopes.
We consider two natural ways of extending this notion to polytopes: the margin and the envelope. For a polytope defined by hyperplanes as in (1), we say that is in the inner margin of if
and that is in the outer margin of if
Similarly, we say that is in the outer envelope of if and and that is in the inner envelope of if and .
We call the union of the inner and the outer margins the margin, and we denote it by . Similarly, we call the union of the inner and the outer envelopes the envelope, and we denote it by .
Fat hyperplanes and polytopes.
Binary classification requires a collection of concepts mapping the instance space (in our case, the unit ball in ) to . However, given a hyperplane and a margin , the function given by partitions into three regions: positive , negative , and ambiguous . We use a standard device (see, e.g., Hanneke and Kontorovich (2017, Section 4)) of defining an auxiliary instance space together with the concept class , where, for all ,
It is shown in (Hanneke and Kontorovich, 2017, Lemma 6) that ^{2}^{2}2 Such estimates may be found in the literature for homogeneous (i.e., ) hyperplanes (see, e.g., Bartlett and ShaweTaylor (1999, Theorem 4.6)), but dealing with polytopes, it is important for us to allow offsets. As discussed in Hanneke and Kontorovich (2017), the standard nonhomogeneous to homogeneous conversion can degrade the margin by an arbitrarily large amount, and hence the nonhomogeneus case warrants an independent analysis.
Lemma 1.
The VCdimension of is at most .
Analogously, we define the concept class of fat polytopes as follows. Each is induced by some polytope intersection as in (1). The label of a pair is determined as follows: If is in the margin of , then the pair is labeled irrespective of . Otherwise, if and , or else and , then the pair is labeled . Otherwise, the pair is labeled .
Lemma 2.
The VCdimension of in dimensions is at most
where .
Proof.
The family of intersections of concept classes of VCdimension at most is bounded by Blumer et al. (1989, Lemma 3.2.3). Since the class of dimensional hyperplances has VCdimension (Long and Warmuth, 1994), the family of polytopes has VCdimension at most . The second part of the bound is obtained by applying Blumer et al. (1989, Lemma 3.2.3) to the VC bound in Lemma 1.
∎
Generalization bounds.
The following VCbased generalization bounds are wellknown; the first one may be found in, e.g., Cristianini and ShaweTaylor (2000), while the second one in Anthony and Bartlett (1999).
Lemma 3.
Let be a class of learners with VCdimension . If a learner is consistent on a random sample of size , then with probability at least its generalization error is
If the learner has empirical error on the sample, then with probability at least its generalization error is
for come universal constant .
Dimension reduction.
The JohnsonLindenstrauss (JL) transform takes a set of vectors in and projects them into dimensions, while preserving all interpoint distances and vectors norms up to distortion. That is, if is a linear embedding realizing the guarantees of the JL transform on , then for every we have
and for every we have
The JL transform can be realized with probability for any constant by a randomized linear embedding, for example a projection matrix with entries drawn from a normal distribution (Achlioptas, 2003). This embedding is oblivious, in the sense that the matrix can be chosen without knowledge of the set .
It is an easy matter to show that the JL transform can also be used to approximately preserve distances to hyperplanes, as in the following lemma.
Lemma 4.
Let be set of dimensional vectors in the unit ball, be a set of normalized vectors with , and a linear embedding realizing the guarantees of the JL transform. Then for any and and with probability (for any constant ), we have for all and that
Proof.
Let the constant in be chosen so that the JL transform preserves distances and norms within a factor for . By the guarantees of the JL transform for the chosen value of , we have that
A similar argument gives that . ∎
3 Polytope margin and envelope
In this section, we show that the notions of margin and envelope defined in Section 2 are, in general, quite distinct. Fortunately, when confined to the unit ball , one can be used to approximate the other.
Given two sets , their Minkowski sum is given by , and their Minkowski difference is given by . Let be a ball of radius centered at the origin.
Given a polytope an a real number , let
Hence, and are the results of expanding or contracting, in a certain sense, the polytope .
Also, let be the result of moving each halfspace defining a facet of outwards by distance , and similarly, let be the result of moving each such halfspace inwards by distance . Put differently, we can think of the halfspaces defining the facets of as moving outwards at unit speed, so expands with time. Then is at time . See Figure 1.
Observation 1.
We have .
Proof.
Each point in is at distance at least from each hyperplane containing a facet of , hence, it is at distance at least from the boundary of , so it is in . Now, suppose for a contradiction that there exists a point . Then is at distance less than from a point , where is some facet of and is the hyperplane containing . But then the segment must intersect another facet of . ∎
However, in the other direction we have . Furthermore, the Hausdorff distance between them could be arbitrarily large (see again Figure 1).
Then the envelope of is given by , and the margin of is given by . See Figure 2.
Since the margin of is not contained in the envelope of , we would like to find some sufficient condition under which, for some , the margin of is contained in the envelope of . Our solution to this problem is given in the following theorem. Recall that is the unit ball in .
Theorem 5.
Let be a polytope, and let . Suppose that . Then, within , the margin of is contained in the envelope of ; meaning, .
The proof uses the following general observation:
Observation 2.
Let be an expanding polytope whose defining halfspaces move outwards with time, each one at its own constant speed. Let be a point that moves in a straight line at constant speed. Suppose are such that and . Then as well.
Proof.
Otherwise, exits one of the halfspaces and enters it again, which is impossible. ∎
Proof of Theorem 5.
By Observation 1, it suffices to show that . Hence, let and . Let be the segment . Let be the point in that is at distance from . Suppose for a contradiction that . Then . Consider as a polytope that expands with time, as above. Let be a point that moves along at constant speed, such that and . Since , the speed of is at most . Hence, between and , moves distance at most , so is already between and . In other words, exits and reenters it, contradicting Observation 2. ∎
4 Computing and learning separating polytopes
In this section, we present algorithms to compute and learn fat polytopes. We begin with hardness results for this problem, and show that these hardness results justify algorithms with run time exponential in the dimension or the square of the reciprocal of the margin, but not exponential in . We then present our algorithms.
4.1 Hardness
We show that computing separating polytopes is hard, and even hard to approximate. We begin with the case of a single hyperplane. The following preliminary lemma builds upon (Amaldi and Kann, 1995, Theorem 10).
Lemma 6.
Given a labelled point set , let be a hyperplane that places all positive points of on its positive side, and maximizes the number of negative points on its negative size — call this quantity . Then it is hard to find a hyperplane consistent with all positive points, and which places at least negative points on on the negative side of . This holds even when the optimal hyperplane has margin exactly .
Proof.
We reduce from maximum independent set, which is hard to approximate to within (Zuckerman, 2007). Given a graph , for each vetex place a negative point on the basis vector . Now place a positive point at the origin, and for each edge , place a positive point at .
Consider a hyperplane consistent with the positive points and placing negative points on the negative side: These negative points must represent an independent set in , for if , then by construction the midpoint of is positive, and so both cannot lie on the negative side of the plane.
Likewise, if contained an independent set of size , then we consider the hyperplane defined by the equation , where coordinate if and otherwise. It is easily verified that the distance from the hyperplane to a negative point is , to the origin is , and to other positive points are at least . ∎
Theorem 7.
Given a labelled point set , let be a collection of hyperplanes whose intersection partitions into positive and negative sets. Then it is hard to find a collection of size less than whose intersection partitions into positive and negative sets. This holds even when all hyperplanes have margin or greater.
Proof.
The reduction is from minimum coloring, which is hard to approximate within a factor of (Zuckerman, 2007). The construction is identical to that of the proof of Lemma 6. The only difference is that if one color covers more than vertices, we break it up into a set of colors, each covering at most vertices. This increases the total number of colors to at most . The claim follows. ∎
4.2 Algorithms
Here we present algorithms for computing polytopes, and use them to give an efficient algorithm for learning polytopes. As a consequence of Lemma 6 and Theorem 7, we cannot hope to find in polynomial time even a single hyperplane consistent with the positive point which correctly classifies many negative points, let alone a polytope consistent with all the data. In fact, by the Exponential Time Hypothesis, we cannot achieve a runtime better than exponential in either the dimension or — but not necessarily exponential in .
In what follows, we give two algorithms inspired by the polytope algorithm presented by Arriaga and Vempala (2006). Both have runtime faster than the algorithm of Arriaga and Vempala (2006), and the second is only polynomial in .
Theorem 8.
Given a labelled point set () for which some fat polytope () correctly separates the positive and negative points (i.e., the polytope is consistent), we can compute the following with high probability:

A consistent fat polytope in time .

A consistent fat polytope in time .
Before proving Theorem 8, we will need a preliminary lemma:
Lemma 9.
Given any , there exists a set of unit vectors of size with the following property: For any unit vector , there exists a that satisfies for all vectors with . The set can be constructed in time with high probability.
This implies that if a set admits a hyperplane with margin , the set contains a hyperplane with margin at least .
Proof.
We take to be a net of the unit ball, a set satisfying that every point on the ball is within distance of some point in . Then (Vershynin, 2010, Lemma 5.2). For any unit vector we have for some that . Now let be vectors normal to their respective hyperplanes, and so for any vector satisfying we have that its distance from the respective hyperplanes is and . It follows that
The net can be constructed by a randomized greedy algorithm. By couponcollector analysis, it suffices to construct random vectors. For example, each can be chosen by sampling each coordinate from a (0,1)Normal distribution, and then normalizing the vector. Then the resulting set contains within in a net, which can be extracted in the stated time. ∎
Proof of Theorem 8.
We first apply the JohnsonLindenstrauss transform to reduce dimension of the points in to , achieving the guarantees of Lemma 4 with parameter . We then restrict our attention to the dimensional hyperplanes implied by the net of Lemma 9 with parameter . Then for each dimensional hyperplane forming the original fat polytope, there is a dimensional hyperplane of the net satisfying for every . Given this , we can recover an approximation to thus: Let include only those points for which , from which it follows that . We then run the Perceptron algorithm on in time , and find a hyperplane consistent with on all points at distance or more from . We will refer to as the dimensional mirror of .
Having enumerated all vectors in and computed their dimensional mirrors, we enumerate all possible polytopes by taking all combinations of mirror hyperplanes, in total time
and choose the best one consistent with . The first part of the theorem follows.
Alternatively, we may give a greedy algorithm with better runtime: First note that as the intersection of hyperplanes correctly classifies all points, the best hyperplane among them correctly classifies at least a fraction of the negative points with margin . Hence it suffices to compute the mirror hyperplane which is consistent with all positive points and maximizes the number of correct negative points, all with margin . We choose this hyperplane, remove from the correctly placed negative points, and iteratively search for the next best hyperplane. After iterations (for an appropriate constant ), the number of remaining points is
and the algorithm terminates. ∎
Having given an algorithm to compute fat polytopes, we can now give an efficient algorithm to learn fat polytopes. We sample points, and use the second item of Theorem 8 to find a fat polytope consistent with the sample. By Lemma 2, the class of polytopes has VCdimension . The size of is chosen according to Lemma 3, and we conclude:
Theorem 10.
There exists an algorithm that learns fat polytopes with sample complexity
in time , where is the desired confidence level.
References
 Achlioptas [2003] Dimitris Achlioptas. Databasefriendly random projections: JohnsonLindenstrauss with binary coins. J. Comput. Syst. Sci., 66(4):671–687, 2003. doi: 10.1016/S00220000(03)000254. URL https://doi.org/10.1016/S00220000(03)000254.
 Amaldi and Kann [1995] Edoardo Amaldi and Viggo Kann. The complexity and approximability of finding maximum feasible subsystems of linear relations. Theoretical Computer Science, 147(1):181 – 210, 1995. ISSN 03043975. doi: https://doi.org/10.1016/03043975(94)00254G. URL http://www.sciencedirect.com/science/article/pii/030439759400254G.
 Amaldi and Kann [1998] Edoardo Amaldi and Viggo Kann. On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems. Theoretical Computer Science, 209(1):237 – 260, 1998. ISSN 03043975. doi: https://doi.org/10.1016/S03043975(97)001151. URL http://www.sciencedirect.com/science/article/pii/S0304397597001151.
 Anderson et al. [2013] Joseph Anderson, Navin Goyal, and Luis Rademacher. Efficient learning of simplices. In COLT 2013  The 26th Annual Conference on Learning Theory, June 1214, 2013, Princeton University, NJ, USA, pages 1020–1045, 2013. URL http://jmlr.org/proceedings/papers/v30/Anderson13.html.
 Angluin [1992] Dana Angluin. Computational learning theory: Survey and selected bibliography. In Proceedings of the 24th Annual ACM Symposium on Theory of Computing, May 46, 1992, Victoria, British Columbia, Canada, pages 351–369, 1992. doi: 10.1145/129712.129746. URL http://doi.acm.org/10.1145/129712.129746.
 Anthony and Bartlett [1999] Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge, 1999. ISBN 052157353X. doi: 10.1017/CBO9780511624216. URL http://dx.doi.org/10.1017/CBO9780511624216.
 Arriaga and Vempala [2006] Rosa I. Arriaga and Santosh Vempala. An algorithmic theory of learning: Robust concepts and random projection. Machine Learning, 63(2):161–182, 2006. doi: 10.1007/s1099400662657. URL https://doi.org/10.1007/s1099400662657.
 Bárász and Vempala [2010] Mihály Bárász and Santosh Vempala. A new approach to strongly polynomial linear programming. In Innovations in Computer Science  ICS 2010, Tsinghua University, Beijing, China, January 57, 2010. Proceedings, pages 42–48, 2010. URL http://conference.itcs.tsinghua.edu.cn/ICS2010/content/papers/4.html.
 Bartlett and ShaweTaylor [1999] Peter Bartlett and John ShaweTaylor. Generalization performance of support vector machines and other pattern classifiers, pages 43–54. MIT Press, Cambridge, MA, USA, 1999. ISBN 0262194163.
 BenDavid et al. [2003] Shai BenDavid, Nadav Eiron, and Philip M. Long. On the difficulty of approximately maximizing agreements. J. Comput. Syst. Sci., 66(3):496–514, 2003. doi: 10.1016/S00220000(03)000382. URL https://doi.org/10.1016/S00220000(03)000382.
 Blumer et al. [1989] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the VapnikChervonenkis dimension. J. Assoc. Comput. Mach., 36(4):929–965, 1989. ISSN 00045411.
 Cristianini and ShaweTaylor [2000] Nello Cristianini and John ShaweTaylor. An Introduction to Support Vector Machines and Other Kernelbased Learning Methods. Cambridge University Press, 2000. ISBN 0521780195. URL https://www.amazon.com/IntroductionSupportMachinesKernelbasedLearning/dp/0521780195?SubscriptionId=0JYN1NVW651KCA56C102&tag=techkie20&linkCode=xm2&camp=2025&creative=165953&creativeASIN=0521780195.
 Hanneke and Kontorovich [2017] Steve Hanneke and Aryeh Kontorovich. Optimality of SVM: Novel proofs and tighter bounds. 2017. URL https://www.cs.bgu.ac.il/~karyeh/optsvm.pdf.
 Hegedüs [1994] Tibor Hegedüs. Geometrical concept learning and convex polytopes. In Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory, COLT 1994, New Brunswick, NJ, USA, July 1215, 1994., pages 228–236, 1994. doi: 10.1145/180139.181124. URL http://doi.acm.org/10.1145/180139.181124.
 Hellerstein and Servedio [2007] Lisa Hellerstein and Rocco A. Servedio. On PAC learning algorithms for rich boolean function classes. Theor. Comput. Sci., 384(1):66–76, 2007. doi: 10.1016/j.tcs.2007.05.018. URL https://doi.org/10.1016/j.tcs.2007.05.018.
 Höffgen et al. [1995] KlausUwe Höffgen, Hans Ulrich Simon, and Kevin S. Van Horn. Robust trainability of single neurons. J. Comput. Syst. Sci., 50(1):114–125, 1995. doi: 10.1006/jcss.1995.1011. URL https://doi.org/10.1006/jcss.1995.1011.
 Jain and Kinber [2003] Sanjay Jain and Efim B. Kinber. Intrinsic complexity of learning geometrical concepts from positive data. J. Comput. Syst. Sci., 67(3):546–607, 2003. doi: 10.1016/S00220000(03)000679. URL https://doi.org/10.1016/S00220000(03)000679.
 Kane et al. [2013] Daniel M. Kane, Adam R. Klivans, and Raghu Meka. Learning halfspaces under logconcave densities: Polynomial approximations and moment matching. In COLT 2013  The 26th Annual Conference on Learning Theory, June 1214, 2013, Princeton University, NJ, USA, pages 522–545, 2013. URL http://jmlr.org/proceedings/papers/v30/Kane13.html.
 Kantchelian et al. [2014] Alex Kantchelian, Michael Carl Tschantz, Ling Huang, Peter L. Bartlett, Anthony D. Joseph, and J. Doug Tygar. Largemargin convex polytope machine. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 813 2014, Montreal, Quebec, Canada, pages 3248–3256, 2014. URL http://papers.nips.cc/paper/5511largemarginconvexpolytopemachine.
 Kearns and Vazirani [1997] Micheal Kearns and Umesh Vazirani. An Introduction to Computational Learning Theory. The MIT Press, 1997.
 Khot and Saket [2011] Subhash Khot and Rishi Saket. On the hardness of learning intersections of two halfspaces. J. Comput. Syst. Sci., 77(1):129–141, 2011. doi: 10.1016/j.jcss.2010.06.010. URL https://doi.org/10.1016/j.jcss.2010.06.010.
 Klivans and Servedio [2008] Adam R. Klivans and Rocco A. Servedio. Learning intersections of halfspaces with a margin. J. Comput. Syst. Sci., 74(1):35–48, 2008. doi: 10.1016/j.jcss.2007.04.012. URL https://doi.org/10.1016/j.jcss.2007.04.012.
 Klivans and Sherstov [2009] Adam R. Klivans and Alexander A. Sherstov. Cryptographic hardness for learning intersections of halfspaces. J. Comput. Syst. Sci., 75(1):2–12, 2009. doi: 10.1016/j.jcss.2008.07.008. URL https://doi.org/10.1016/j.jcss.2008.07.008.
 Kwek and Pitt [1998] Stephen Kwek and Leonard Pitt. PAC learning intersections of halfspaces with membership queries. Algorithmica, 22(1/2):53–75, 1998. doi: 10.1007/PL00013834. URL https://doi.org/10.1007/PL00013834.
 Long and Warmuth [1994] Philip M. Long and Manfred K. Warmuth. Composite geometric concepts and polynomial predictability. Inf. Comput., 113(2):230–252, 1994. doi: 10.1006/inco.1994.1071. URL https://doi.org/10.1006/inco.1994.1071.
 Matoušek [2002] Jiří Matoušek. Lectures on discrete geometry, volume 212 of Graduate Texts in Mathematics. SpringerVerlag, New York, 2002. ISBN 0387953736. doi: 10.1007/9781461300397. URL https://doi.org/10.1007/9781461300397.
 Matoušek and Gärtner [2006] Jiří Matoušek and Bernd Gärtner. Understanding and Using Linear Programming (Universitext). Springer, 2006. ISBN 3540306978. URL https://www.amazon.com/UnderstandingUsingLinearProgrammingUniversitext/dp/3540306978?SubscriptionId=0JYN1NVW651KCA56C102&tag=techkie20&linkCode=xm2&camp=2025&creative=165953&creativeASIN=3540306978.
 Megiddo [1988] Nimrod Megiddo. On the complexity of polyhedral separability. Discrete & Computational Geometry, 3(4):325–337, Dec 1988. ISSN 14320444. doi: 10.1007/BF02187916. URL https://doi.org/10.1007/BF02187916.
 Rademacher and Goyal [2009] Luis Rademacher and Navin Goyal. Learning convex bodies is hard. In COLT 2009  The 22nd Conference on Learning Theory, Montreal, Quebec, Canada, June 1821, 2009, 2009. URL http://www.cs.mcgill.ca/~colt2009/papers/030.pdf#page=1.
 Valiant [1984] Leslie G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134–1142, 1984.
 Vershynin [2010] Roman Vershynin. Introduction to the nonasymptotic analysis of random matrices. CoRR, abs/1011.3027, 2010. URL http://arxiv.org/abs/1011.3027.
 Zuckerman [2007] David Zuckerman. Linear degree extractors and the inapproximability of max clique and chromatic number. Theory of Computing, 3(6):103–128, 2007. doi: 10.4086/toc.2007.v003a006. URL http://www.theoryofcomputing.org/articles/v003a006.