[
Abstract
We examine the task of locating a target region among those induced by intersections of halfspaces in . This generic task connects to fundamental machine learning problems, such as training a perceptron and learning a separable dichotomy. We investigate the average teaching complexity of the task, i.e., the minimal number of samples (halfspace queries) required by a teacher to help a versionspace learner in locating a randomly selected target. As our main result, we show that the averagecase teaching complexity is , which is in sharp contrast to the worstcase teaching complexity of . If instead, we consider the averagecase learning complexity, the bounds have a dependency on as for i.i.d. queries and for actively chosen queries by the learner. Our proof techniques are based on novel insights from computational geometry, which allow us to count the number of convex polytopes and faces in a Euclidean space depending on the arrangement of halfspaces. Our insights allow us to establish a tight bound on the averagecase complexity for separable dichotomies, which generalizes the known bound on the average number of “extreme patterns” in the classical computational geometry literature (Cover, 1965).
Averagecase Complexity of Teaching Convex Polytopes via Halfspace Queries]Averagecase Complexity of Teaching Convex Polytopes
via Halfspace Queries
\SetCommentStymycommfont
\newtogglelongversion
\settogglelongversiontrue
\altauthor\NameAkash Kumar\Emailakumar@mpisws.org
\addrMPISWS
\AND\NameAdish Singla \Emailadishs@mpisws.org
\addrMPISWS \AND\NameYisong Yue \Emailyyue@caltech.edu
\addrCaltech
\AND\NameYuxin Chen \Emailchenyuxin@uchicago.edu
\addrUniversity of Chicago
eaching dimension, homogeneous halfspaces, averagecase complexity
1 Introduction
We consider the problem of locating a target region among those induced by intersections of halfspaces in dimension (Fig. 1). In the basic setting, the learner receives a sequence of instructions, which we refer to as halfspace queries \citep[same as membership queries in][]ANGLUIN198787,ANGLUINb, each specifying a halfspace the target region is in. Based on the evidence it receives, the learner then determines the location of the target region. This generic task connects to several fundamental problems in machine learning. Consider learning a linear prediction function in (aka perceptron, see Fig. 1) over linearly separable data points. Here, every data point specifies a halfspace, and the target hypothesis corresponds to a region in the hypothesis space. The learning task reduces to identifying the convex polytope induced by the halfspace constraints in the hypothesis spaces \citetbishop2006pattern. Similarly, when the set of data points are not linearly separable, but are separable by a surface (aka separable dichotomy, see Fig. 1), the problem of finding the separable dichotomy could be viewed as training a perceptron in the induced space \citepcover1965geometrical.
While these fundamental problems have been extensively studied in the passive learning setting \citepvapnik1971uniform,natarajan1987learning,blumer1989learnability,goldman1993learning, the underlying i.i.d. sampling strategy often requires more data than necessary to learn the target concept (when one is able to control the sampling strategy).
Moreover, the majority of existing work focuses on the worstcase complexity measures, which are often too pessimistic and do not reflect the learning complexity in the realworld scenarios \citephaussler1994bounds,wan2010learning,nachum2019average. As shown in Table 1, the label complexity of passive learning for the above generic task is .
Recently, there has been increasing interest in understanding the complexity of interactive learning, which aims to learn under more optimistic, realistic scenarios, in which “representative” examples are selected, and the number of examples needed for successful learning may shrink significantly. For example, under the active learning setting, the learner only query data points that are helpful for the learning task, which could lead to exponential savings in the sample complexity as compared with the passive learning setting \citepguillory2009average,jamieson2011active,hanneke2015minimax,kane2017active.
Type  Averagecase  Worstcase  Condition on hyperplane arrangement 

Passive learning    
Active learning  relaxed general position  
Teaching  relaxed general position 
An alternative interactive learning scenario is the setting where the learning happens in the presence of a helpful teacher, which identifies useful examples for the learning task. This setting is known as machine teaching \citepDBLP:journals/corr/ZhuSingla18. Importantly, the label complexity of teaching provides a lower bound on the number of samples needed by active learning \citepzilles2011models, and therefore can provide useful insights for designing interactive learning algorithms \citepbrown2019machine. Machine teaching has been extensively studied in terms of the worstcase label complexity \citepgoldman1995complexity,article:anthony95,zilles2008teaching,doliwa2014recursive,chen2018understanding,mansouri2019preference. However, to the best of our knowledge, the average complexity of machine teaching, even for the fundamental tasks described above, remains significantly underexplored.
In this paper, we investigate the average teaching complexity, i.e., the minimal number of examples required by a teacher to help a learner in locating a randomly selected target. We highlight our key results below.

We show that under the common assumption that the hyperplanes are in general position in , the averagecase complexity for teaching such a target is . This is in sharp contrast to the worstcase teaching complexity of (cf §4).

We provide a natural extension of the generalposition hyperplane arrangement condition, and show that if the hyperplanes in are in “relaxed general position arrangement” where , then one can further obtain improved complexity results of for averagecase teaching. Our proof techniques are based on novel insights from computational geometry, which allow us to count the number of convex polytopes and faces in a Euclidean space depending on the hyperplane arrangement. Our result improves upon the existing result for arbitrary hyperplane arrangement \citepFukuda1991BoundingTN (cf §4).
2 Related Work
Averagecase complexity of learning
While the majority of complexity measures for concept classes and data selection algorithms focus on the worstcase scenarios, there have been a few work concerning the averagecase complexity for various types of learning algorithms. Here we provide a survey on related work concerning averagecase complexity under the learning setting. [3] studied how the sample complexity depends on properties of a prior distribution on the concept class and over the sequence of examples the algorithm receives. Specifically, they studied the probability of an incorrect prediction for an optimal learning algorithm using the Shannon information gain. [9] considered the problem of learning DNFformulas sampled from the uniform distribution. [6] considered the average information complexity of learning (defined as the average mutual information between the input and the output of the learning algorithm). They show that for a concept class of VC dimension , there exists a proper learning algorithm that reveals bits of information for most concepts. Intuitively, this result aligns with our observation that average complexities of various data selection algorithms are significantly lower than that in the worstcase scenario. [7, 8] introduce the paradigm of smoothed analysis which differs from our averagecase analysis as we don’t allow perturbations to input spaces. Perhaps most similar to our approach, in terms of technical insights, is the work of [4], who studied the problem of active ranking via pairwise comparisons, and have used the geometrical properties of hyperplanes in to achieve an average complexity of for active ranking over points. In our work, we extend their results to the general problem of active learning of halfspaces, and also consider the teaching variant of the ranking via pairwise comparison problem.
Connection with the PAC learning framework
Intersection of halfspaces have been studied in PAC learning framework \citeppacpitt,BLUM1997371,KLIVANS2004808,klivanscrypto,vempala,KHOT2011129,learnconvexpolytope. Although we focus on exact teaching of intersections of halfspaces induced by hyperplanes, our results could be readily extended to analyze the average sample complexity for teaching a PAC learner under the realizable case. It is well known that a single halfspace can be PAClearnt efficiently by sampling a polynomial number of data points and finding a separating hyperplane via linear programming \citepblumer1989learnability. Relating this to the worstcase sample complexity results in \tablereftab:samplecomplexity, we know that the worstcase sample complexity for teaching a halfspace to a PAC learner is also polynomial in the VC dimension, i.e., for halfspaces. One can then extend the averagecase complexity results in \tablereftab:samplecomplexity, based on an argument similar with poolbased active learning \citepmccallumzy1998employing. The idea is for the teacher to draw unlabeled examples i.i.d. from the underlying data distribution in . Instead of providing all labels, the teacher provides labels to an optimal teaching set such that all unlabeled examples are implied by the given labels. Thus the learner has obtained labeled examples drawn i.i.d., and classical PAC bounds still apply.
Relevant work in algorithmic machine teaching
As discussed above, teaching problem of various concept classes has been explored before. The classic definition of average teaching dimension \citepgoldman1995complexity which is same as our definition in the uniform setting has been studied in various settings: \citetarticle:anthony95 showed the bound of for the class of linearly separable Boolean functions; \citetKushilevitz1996WitnessSF showed an improved upper bound of for any concept class ; \citetkuhlman proved that all classes of VC dimension 1 have an average teaching dimension of less than 2; \citetDNFteach have shown an bound on the class of DNFs with at most terms. In contrast, our work bypasses any dependence on the size of the concept class, and achieves an average teaching complexity of (where ). Some more powerful notions of teaching dimension in sequential setting: recursive and preferencebased, have been studied in \citetdoliwa2014recursive,recursiveteach, which differ from our batched setting. There is increasing interest in connecting the VC dimension to the teaching problem of concept classes \citep[stated in][]Simon2015OpenPR,Hu2017QuadraticUB, we notice the VC dimension of hyperplanes in general position is \citepedelsbrunner which is closely related to our averagecase result but away from the worstcase result.
3 Teaching Convex Polytopes via Halfspace Queries: A General Model
Convex polytopes induced by hyperplanes
Let be a hyperplane in , where and . We say a point satisfies or lies in if . We define a halfspace induced by a hyperplane to be one of the two connected components of i.e. sets corresponding to . We define as a set of hyperplanes in . The arrangement of the hyperplanes in , denoted as , induces intersections of halfspaces which create connected components. Any connected component of is defined as a region or convex polytope in . Equivalently, any region can be exactly specified by the intersections of halfspaces induced by hyperplanes in . We call the smallest subset that exactly specifies the bounding set of hyperplanes for . We define connected components induced on hyperplanes (e.g. for any ) by as faces. Thus, bounding set forms the faces to the polytope . {example}[Convex polytopes induced by hyperplanes] Fig. 1 provides an example of the arrangement of 5 hyperplanes in , where arrows on the hyperplanes specify halfspaces. The bounding set for the highlighted region , namely , forms 3 faces to . We use to denote the regions induced by the arrangement and the number of regions . We define a labeling function for an arbitrary region . Note that uniquely identifies its labeling function .
The teaching framework
We study the problem of teaching target regions (convex polytopes) induced by hyperplane arrangment in . Our teaching model is formally stated below. Consider the set of instances , with label set corresponding to two halfspaces induced by a hyperplane. Our hypothesis class, denoted as , is the set of regions induced by . Consider a target region . Let be the ground set of examples (i.e. labeled instances). We define a labeled subset as halfspace queries. We assume that for any halfspace queries , the labels are consistent, i.e., , . The version space induced by is the subset of regions that are consistent with the labels of all the halfspace queries i.e.,
or equivalently, set of convex polytopes which satisfy the halfspace queries . We define our version space learner as one which upon seeing a set of halfspace queries, maintains a version space containing all the regions that are consistent with all the observed queries. Corresponding to a version space learner and a target region , we define a teaching set as a minimal set of halfspace queries such that the resulting version space exactly contains . Formally,
Consequently, we want to teach a target hypothesis (regions), say via
specifying halfspace queries in the teaching set to a learner. Given a target region , the teaching complexity \citepgoldman1995complexity is defined as the sample size of the teaching set i.e. .
In section §4, we analyze the teaching complexity of convex polytopes both in the framework of averagecase and worstcase.
We define average teaching complexity of convex polytopes via halfspace queries as the expected size of the teaching set i.e. , when the target region is sampled uniformly at random. We define worstcase teaching complexity as the worstcase sample size of a teaching set corresponding to target regions from the set of hypotheses.
Hyperplanes in general position
We adopt a common assumption in computational geometry [2, 5] that the hyperplane arrangement is in general position, and further provide a relaxed notion of general position hyperplane arrangement, as defined below. {definition}[General position of hyperplanes [5]]For a set of hyperplanes in , the arrangement is in general position if any subset of hyperplanes where , intersects in a dimensional plane, otherwise has null intersection.
[Relaxed general position of hyperplanes] For a set of hyperplanes in and , the arrangement is in relaxed general position if any subset of hyperplanes where , intersects in a dimensional plane, otherwise has null intersection.
4 Averagecase Teaching Complexity
In this section, we study the generic problem of teaching convex polytopes via halfspace queries as illustrated in Fig. 1. Before establishing our main result, we first introduce two important results inherently connected to the average teaching complexity: the number of regions (which corresponds to the target hypotheses) induced by the intersections of halfspaces, and the number of faces (which corresponds to the teaching sets) induced by the hyperplane arrangement. Our proofs are inspired by ideas from combinatorial geometry and affine geometry, as detailed below.
4.1 Regions and Faces Induced by Intersections of Halfspaces
Consider a set of hyperplanes in . Generally, it is nontrivial to count the number of regions induced by an arbitrary hyperplane arrangement . When the hyperplane arrangement is in general position (Definition 3, Fig. 2), [5] established an exact result for counting the induced regions. However, it remains a challenging problem to identify the number of regions for more general hyperplane arrangements. However, we show that under the relaxed condition of Definition 3, which accounts for various nontrivial arrangements as shown in Fig. 22, one can exactly count the number of regions. {theorem}[Regions induced by relaxed general position arrangement] Consider a set of hyperplanes in . If the hyperplane arrangement is in relaxed general position for some , then the following holds: In the following we sketch the proof of Theorem 4.1. The key insight for the proof is in reducing it to the special case of general position in some subspace where . We show the reduction by constructing a subspace defined as:
As a key observation, note that is dimensional. Let be the induced set of hyperplanes in the subspace formed by the intersections of with . Therefore, the number of regions induced
where for some such that . The following proposition shows that is bijective, thereby providing an alternate way to count . {proposition} The map (as defined above) is a bijection. Thus, . Note that, if we can resolve induced by the hyperplane arrangement , then can be ascertained too. The following key lemma, proved in
5 Connections to Learning Complexity
In this section, we consider the problem of learning a convex polytope via halfspace queries, without the presence of a helpful teacher. We consider both the passive learning setting where learner makes i.i.d. queries and the active learning setting with actively chosen queries, and provide sample complexity results accordingly.
Learning convex polytopes via halfspace queries
Consider the hyperplane set in and a target region . For any hyperplane where , the labeling function , as defined in §3, specifies its label (halfspace) as . The problem of learning a region therefore reduces to identifying the corresponding labeling function . The objective here is to learn the region by querying the reference of the form , where and is the indicator function. Similar to the teaching setting, we assume that the target is sampled uniformly at random. In the following, we establish sample complexity results, i.e., on the minimal number of halfspace queries required to determine a target region, under the settings of active and passive learning.
5.1 Active Learning of Convex Polytopes
In §LABEL:subsection:_main_theorem, we showed that worstcase teaching complexity for convex polytopes is ,
this directly implies the lower bound of on the worstcase for active learning.
We now show that when the underlying hyperplane arrangement is in relaxed general position, the averagecase complexity of active learning has only a dependency on the number of hyperplanes.
We achieve this by actively selecting
informative queries—a similar characterization of the ambiguous queries as considered by \citetjamieson2011active for the pairwise ranking problem.
Concretely, we consider the following
querying strategy: For an (unknown) target region and a uniformly random ordering of hyperplanes , the learner checks in each iteration if a query is ambiguous for randomly selected (i.e. intersects the convex body defined by hyperplanes sampled previously); then asks or imputes the labels depending on their ambiguity.
In any iteration of the above query selection procedure
[Probability of ambiguity] Assume . Let denote the probability of the event that the query is ambiguous where is the sampled hyperplane. If is in relaxed general position, then there exists a positive, real number constant independent of such that for , . Lemma 5.1 allows us to bound the expected value of . As detailed in
6 Teaching separable Dichotomy as Teaching Convex Polytopes
In §4, we discussed the generic problem of teaching convex polytopes induced by intersections of halfspaces via halfspace queries.
We now consider the problem of separability of points (also see Fig. 11) which could be viewed as a variant of teaching convex polytopes. We achieve similar averagecase teaching complexity results for the problem. In the seminal work [1], Cover studied the problem of separability of points in which the task is to classify points using various types of classifiers (linear or nonlinear).
We first provide useful definitions for the domain of discussion. We define a set of points in as
(referred to as data space),
and use to represent the first coordinates of a point .
A map , is called map, and the subset is called induced space. A dichotomy (i.e., a disjoint partition of a set) of is separable if there exists a vector (aka separator of the dichotomy) such that: if then and if then .
{definition}[Relaxed general position of points]
For a set of data points in , say , is in general position
Naturally, we define teaching set for a separable dichotomy as the teaching set for the dual convex polytopes of the induced space. Following the standard practice, we call the hypothesis space (where each hypothesis/region corresponds to a )
as the dual space, and data space as the primal space.
We discuss the construction and relevant properties of duality below.
we assume that = (standard basis vector in
with coordinate being 1 and others being 0). Denote the set of all homogeneously linear separable dichotomies of by . We observe that if is a linear separator of , then forms a linear separator for .
Based on this observation, we define a relation on elements of as follows: . Notice that is reflexive, symmetric, and transitive. Thus, is an equivalence relation. Denote by the set of equivalence classes i.e. the quotient set \citep[see][]rossen2003discrete . It is easy to see that , where denotes an equivalence class for any . Before we construct the dual map, , we state a key assumption used in construction as follows:
Assumption 1
We represent each equivalence class by the dichotomy which labels as positive.
This implies that if is a homogeneous linear separator of the representative dichotomy of a class then as . Thus, dual map exploits this property of each equivalence class i.e.
(1) 
Hence, points maps to hyperplane , in in the dual space
and homogeneous linear hyperplane maps to point in . Notice that, maps to a hyperplane which exists in infinity i.e .
Denote the set of dual hyperplanes by ()
d’ϕn{x_1, …, x_n} ⊆R^dr ∈R^dx_irx_j  (2) 
Footnotes
 This idea is more formally studied in the hyperplane arrangement literature as essentialization \citep[see][chap: An introduction to hyperplane arrangement]miller2007geometric. See Appendix LABEL:appendixsub:_bijection for further discussion.
 Full algorithm is detailed in Appendix LABEL:appendixsub:_characterization.
 See [1] for the definition of general position of points.
 We use this notation to signify that exists in infinity.