[

# [

## Abstract

We examine the task of locating a target region among those induced by intersections of halfspaces in . This generic task connects to fundamental machine learning problems, such as training a perceptron and learning a -separable dichotomy. We investigate the average teaching complexity of the task, i.e., the minimal number of samples (halfspace queries) required by a teacher to help a version-space learner in locating a randomly selected target. As our main result, we show that the average-case teaching complexity is , which is in sharp contrast to the worst-case teaching complexity of . If instead, we consider the average-case learning complexity, the bounds have a dependency on as for i.i.d. queries and for actively chosen queries by the learner. Our proof techniques are based on novel insights from computational geometry, which allow us to count the number of convex polytopes and faces in a Euclidean space depending on the arrangement of halfspaces. Our insights allow us to establish a tight bound on the average-case complexity for -separable dichotomies, which generalizes the known bound on the average number of “extreme patterns” in the classical computational geometry literature (Cover, 1965).

T

Average-case Complexity of Teaching Convex Polytopes via Halfspace Queries]Average-case Complexity of Teaching Convex Polytopes
via Halfspace Queries \SetCommentStymycommfont \newtogglelongversion \settogglelongversiontrue \altauthor\NameAkash Kumar\Emailakumar@mpi-sws.org

eaching dimension, homogeneous halfspaces, average-case complexity

## 1 Introduction

We consider the problem of locating a target region among those induced by intersections of halfspaces in -dimension (Fig. 1). In the basic setting, the learner receives a sequence of instructions, which we refer to as halfspace queries \citep[same as membership queries in][]ANGLUIN198787,ANGLUINb, each specifying a halfspace the target region is in. Based on the evidence it receives, the learner then determines the location of the target region. This generic task connects to several fundamental problems in machine learning. Consider learning a linear prediction function in (aka perceptron, see Fig. 1) over linearly separable data points. Here, every data point specifies a halfspace, and the target hypothesis corresponds to a region in the hypothesis space. The learning task reduces to identifying the convex polytope induced by the halfspace constraints in the hypothesis spaces \citetbishop2006pattern. Similarly, when the set of data points are not linearly separable, but are separable by a -surface (aka -separable dichotomy, see Fig. 1), the problem of finding the -separable dichotomy could be viewed as training a perceptron in the -induced space \citepcover1965geometrical.

While these fundamental problems have been extensively studied in the passive learning setting \citepvapnik1971uniform,natarajan1987learning,blumer1989learnability,goldman1993learning, the underlying i.i.d. sampling strategy often requires more data than necessary to learn the target concept (when one is able to control the sampling strategy). Moreover, the majority of existing work focuses on the worst-case complexity measures, which are often too pessimistic and do not reflect the learning complexity in the real-world scenarios \citephaussler1994bounds,wan2010learning,nachum2019average. As shown in Table 1, the label complexity of passive learning for the above generic task is . Recently, there has been increasing interest in understanding the complexity of interactive learning, which aims to learn under more optimistic, realistic scenarios, in which “representative” examples are selected, and the number of examples needed for successful learning may shrink significantly. For example, under the active learning setting, the learner only query data points that are helpful for the learning task, which could lead to exponential savings in the sample complexity as compared with the passive learning setting \citepguillory2009average,jamieson2011active,hanneke2015minimax,kane2017active.

An alternative interactive learning scenario is the setting where the learning happens in the presence of a helpful teacher, which identifies useful examples for the learning task. This setting is known as machine teaching \citepDBLP:journals/corr/ZhuSingla18. Importantly, the label complexity of teaching provides a lower bound on the number of samples needed by active learning \citepzilles2011models, and therefore can provide useful insights for designing interactive learning algorithms \citepbrown2019machine. Machine teaching has been extensively studied in terms of the worst-case label complexity \citepgoldman1995complexity,article:anthony95,zilles2008teaching,doliwa2014recursive,chen2018understanding,mansouri2019preference. However, to the best of our knowledge, the average complexity of machine teaching, even for the fundamental tasks described above, remains significantly underexplored.

In this paper, we investigate the average teaching complexity, i.e., the minimal number of examples required by a teacher to help a learner in locating a randomly selected target. We highlight our key results below.

• We show that under the common assumption that the hyperplanes are in general position in , the average-case complexity for teaching such a target is . This is in sharp contrast to the worst-case teaching complexity of (cf §4).

• We provide a natural extension of the general-position hyperplane arrangement condition, and show that if the hyperplanes in are in “-relaxed general position arrangement” where , then one can further obtain improved complexity results of for average-case teaching. Our proof techniques are based on novel insights from computational geometry, which allow us to count the number of convex polytopes and faces in a Euclidean space depending on the hyperplane arrangement. Our result improves upon the existing result for arbitrary hyperplane arrangement \citepFukuda1991BoundingTN (cf §4).

• To draw a connection with the learning complexity, we show that without the presence of a teacher, a learning algorithm requires for i.i.d. queries and for actively chosen queries. Table 1 summarizes our main complexity results (cf §5).

• Based on our proof framework in §4, we provide complexity results for teaching -separable dichotomies, which recovers and extends the known bound on the average number of “extreme patterns” in the classical computational geometry literature \citepcover1965geometrical (cf §6).

## 2 Related Work

#### Average-case complexity of learning

While the majority of complexity measures for concept classes and data selection algorithms focus on the worst-case scenarios, there have been a few work concerning the average-case complexity for various types of learning algorithms. Here we provide a survey on related work concerning average-case complexity under the learning setting. [3] studied how the sample complexity depends on properties of a prior distribution on the concept class and over the sequence of examples the algorithm receives. Specifically, they studied the probability of an incorrect prediction for an optimal learning algorithm using the Shannon information gain. [9] considered the problem of learning DNF-formulas sampled from the uniform distribution. [6] considered the average information complexity of learning (defined as the average mutual information between the input and the output of the learning algorithm). They show that for a concept class of VC dimension , there exists a proper learning algorithm that reveals bits of information for most concepts. Intuitively, this result aligns with our observation that average complexities of various data selection algorithms are significantly lower than that in the worst-case scenario. [7, 8] introduce the paradigm of smoothed analysis which differs from our average-case analysis as we don’t allow perturbations to input spaces. Perhaps most similar to our approach, in terms of technical insights, is the work of [4], who studied the problem of active ranking via pairwise comparisons, and have used the geometrical properties of hyperplanes in to achieve an average complexity of for active ranking over points. In our work, we extend their results to the general problem of active learning of halfspaces, and also consider the teaching variant of the ranking via pairwise comparison problem.

#### Connection with the PAC learning framework

Intersection of halfspaces have been studied in PAC learning framework \citeppacpitt,BLUM1997371,KLIVANS2004808,klivanscrypto,vempala,KHOT2011129,learnconvexpolytope. Although we focus on exact teaching of intersections of halfspaces induced by hyperplanes, our results could be readily extended to analyze the average sample complexity for teaching a PAC learner under the realizable case. It is well known that a single halfspace can be PAC-learnt efficiently by sampling a polynomial number of data points and finding a separating hyperplane via linear programming \citepblumer1989learnability. Relating this to the worst-case sample complexity results in \tablereftab:sample-complexity, we know that the worst-case sample complexity for teaching a halfspace to a PAC learner is also polynomial in the VC dimension, i.e., for halfspaces. One can then extend the average-case complexity results in \tablereftab:sample-complexity, based on an argument similar with pool-based active learning \citepmccallumzy1998employing. The idea is for the teacher to draw unlabeled examples i.i.d. from the underlying data distribution in . Instead of providing all labels, the teacher provides labels to an optimal teaching set such that all unlabeled examples are implied by the given labels. Thus the learner has obtained labeled examples drawn i.i.d., and classical PAC bounds still apply.

#### Relevant work in algorithmic machine teaching

As discussed above, teaching problem of various concept classes has been explored before. The classic definition of average teaching dimension \citepgoldman1995complexity which is same as our definition in the uniform setting has been studied in various settings: \citetarticle:anthony95 showed the bound of for the class of linearly separable Boolean functions; \citetKushilevitz1996WitnessSF showed an improved upper bound of for any concept class ; \citetkuhlman proved that all classes of VC dimension 1 have an average teaching dimension of less than 2; \citetDNFteach have shown an bound on the class of DNFs with at most terms. In contrast, our work bypasses any dependence on the size of the concept class, and achieves an average teaching complexity of (where ). Some more powerful notions of teaching dimension in sequential setting: recursive and preference-based, have been studied in \citetdoliwa2014recursive,recursiveteach, which differ from our batched setting. There is increasing interest in connecting the VC dimension to the teaching problem of concept classes \citep[stated in][]Simon2015OpenPR,Hu2017QuadraticUB, we notice the VC dimension of hyperplanes in general position is  \citepedelsbrunner which is closely related to our average-case result but away from the worst-case result.

## 3 Teaching Convex Polytopes via Halfspace Queries: A General Model

#### Convex polytopes induced by hyperplanes

Let be a hyperplane in , where and . We say a point satisfies or lies in if . We define a halfspace induced by a hyperplane to be one of the two connected components of i.e. sets corresponding to . We define as a set of hyperplanes in . The arrangement of the hyperplanes in , denoted as , induces intersections of halfspaces which create connected components. Any connected component of is defined as a region or convex polytope in . Equivalently, any region can be exactly specified by the intersections of halfspaces induced by hyperplanes in . We call the smallest subset that exactly specifies the bounding set of hyperplanes for . We define connected components induced on hyperplanes (e.g. for any ) by as faces. Thus, bounding set forms the faces to the polytope . {example}[Convex polytopes induced by hyperplanes] Fig. 1 provides an example of the arrangement of 5 hyperplanes in , where arrows on the hyperplanes specify halfspaces. The bounding set for the highlighted region , namely , forms 3 faces to . We use to denote the regions induced by the arrangement and the number of regions . We define a labeling function for an arbitrary region . Note that uniquely identifies its labeling function .

#### The teaching framework

We study the problem of teaching target regions (convex polytopes) induced by hyperplane arrangment in . Our teaching model is formally stated below. Consider the set of instances , with label set corresponding to two halfspaces induced by a hyperplane. Our hypothesis class, denoted as , is the set of regions induced by . Consider a target region . Let be the ground set of examples (i.e. labeled instances). We define a labeled subset as halfspace queries. We assume that for any halfspace queries , the labels are consistent, i.e., , . The version space induced by is the subset of regions that are consistent with the labels of all the halfspace queries i.e.,

 VS(Q)={r∈R(A(H))|∀(h,l)∈Q,ℓr(h)=l},

or equivalently, set of convex polytopes which satisfy the halfspace queries . We define our version space learner as one which upon seeing a set of halfspace queries, maintains a version space containing all the regions that are consistent with all the observed queries. Corresponding to a version space learner and a target region , we define a teaching set as a minimal set of halfspace queries such that the resulting version space exactly contains . Formally,

 TS(H,r∗)∈argminQ⊆Q|Q|,~{}s.t.~{}VS(Q)={r∗}.

Consequently, we want to teach a target hypothesis (regions), say via specifying halfspace queries in the teaching set to a learner. Given a target region , the teaching complexity \citepgoldman1995complexity is defined as the sample size of the teaching set i.e. .
In section §4, we analyze the teaching complexity of convex polytopes both in the framework of average-case and worst-case. We define average teaching complexity of convex polytopes via halfspace queries as the expected size of the teaching set i.e. , when the target region is sampled uniformly at random. We define worst-case teaching complexity as the worst-case sample size of a teaching set corresponding to target regions from the set of hypotheses.

#### Hyperplanes in general position

We adopt a common assumption in computational geometry [2, 5] that the hyperplane arrangement is in general position, and further provide a relaxed notion of general position hyperplane arrangement, as defined below. {definition}[General position of hyperplanes [5]]For a set of hyperplanes in , the arrangement is in general position if any subset of hyperplanes where , intersects in a -dimensional plane, otherwise has null intersection.

{definition}

[Relaxed general position of hyperplanes] For a set of hyperplanes in and , the arrangement is in -relaxed general position if any subset of hyperplanes where , intersects in a -dimensional plane, otherwise has null intersection.

As illustrated in Fig. 2, Definition 3 accounts for arrangements beyond general position (Fig. 2) e.g parallel hyperplanes in Fig. 2. Definition 3 is a special case of Definition 3 which we discuss in details in

## 4 Average-case Teaching Complexity

In this section, we study the generic problem of teaching convex polytopes via halfspace queries as illustrated in Fig. 1. Before establishing our main result, we first introduce two important results inherently connected to the average teaching complexity: the number of regions (which corresponds to the target hypotheses) induced by the intersections of halfspaces, and the number of faces (which corresponds to the teaching sets) induced by the hyperplane arrangement. Our proofs are inspired by ideas from combinatorial geometry and affine geometry, as detailed below.

### 4.1 Regions and Faces Induced by Intersections of Halfspaces

Consider a set of hyperplanes in . Generally, it is non-trivial to count the number of regions induced by an arbitrary hyperplane arrangement . When the hyperplane arrangement is in general position (Definition 3, Fig. 2), [5] established an exact result for counting the induced regions. However, it remains a challenging problem to identify the number of regions for more general hyperplane arrangements. However, we show that under the relaxed condition of Definition 3, which accounts for various non-trivial arrangements as shown in Fig. 2-2, one can exactly count the number of regions. {theorem}[Regions induced by -relaxed general position arrangement] Consider a set of hyperplanes in . If the hyperplane arrangement is in -relaxed general position for some , then the following holds: In the following we sketch the proof of Theorem 4.1. The key insight for the proof is in reducing it to the special case of general position in some subspace where . We show the reduction by constructing a subspace defined as:

As a key observation, note that is -dimensional. Let be the induced set of hyperplanes in the subspace formed by the intersections of with . Therefore, the number of regions induced1 by the arrangement of , denoted as , is exactly . Thus, informatively, it is sufficient to rely on in to understand the intersection of halfspaces induced by in . We observe that every region is contained in exactly one region in . With this observation, we construct the following map from the regions induced by the hyperplane arrangement , to those induced by :

 B:R(A(^Hn,d′)) ⟶R(A(H)):^r⟼regionA(H)(^r),

where for some such that . The following proposition shows that is bijective, thereby providing an alternate way to count . {proposition} The map (as defined above) is a bijection. Thus, . Note that, if we can resolve induced by the hyperplane arrangement , then can be ascertained too. The following key lemma, proved in

## 5 Connections to Learning Complexity

In this section, we consider the problem of learning a convex polytope via halfspace queries, without the presence of a helpful teacher. We consider both the passive learning setting where learner makes i.i.d. queries and the active learning setting with actively chosen queries, and provide sample complexity results accordingly.

#### Learning convex polytopes via halfspace queries

Consider the hyperplane set in and a target region . For any hyperplane where , the labeling function , as defined in §3, specifies its label (halfspace) as . The problem of learning a region therefore reduces to identifying the corresponding labeling function . The objective here is to learn the region by querying the reference of the form , where and is the indicator function. Similar to the teaching setting, we assume that the target is sampled uniformly at random. In the following, we establish sample complexity results, i.e., on the minimal number of halfspace queries required to determine a target region, under the settings of active and passive learning.

### 5.1 Active Learning of Convex Polytopes

In §LABEL:subsection:_main_theorem, we showed that worst-case teaching complexity for convex polytopes is , this directly implies the lower bound of on the worst-case for active learning. We now show that when the underlying hyperplane arrangement is in -relaxed general position, the average-case complexity of active learning has only a dependency on the number of hyperplanes. We achieve this by actively selecting informative queries—a similar characterization of the ambiguous queries as considered by \citetjamieson2011active for the pairwise ranking problem. Concretely, we consider the following querying strategy: For an (unknown) target region and a uniformly random ordering of hyperplanes , the learner checks in each iteration if a query is ambiguous for randomly selected (i.e. intersects the convex body defined by hyperplanes sampled previously); then asks or imputes the labels depending on their ambiguity.
In any iteration of the above query selection procedure2, denote the event of requesting the query for a sampled hyperplane by . That is, . Note that each is a Bernoulli distribution with unknown parameter to be ascertained. If we can bound then we bound the expected number of queries as well. We define by of size as the set of hyperplanes sampled by the procedure. We notice that the sampled hyperplane is ambiguous if it intersects the convex body defined by hyperplanes in . Thus we want to bound the probability of the event that the query is ambiguous. Denote the probability of such an event as . Notice is our here. In Lemma 5.1, we show that is upper bounded by a factor of for relatively small sample size .

{lemma}

[Probability of ambiguity] Assume . Let denote the probability of the event that the query is ambiguous where is the sampled hyperplane. If is in -relaxed general position, then there exists a positive, real number constant independent of such that for , . Lemma 5.1 allows us to bound the expected value of . As detailed in

## 6 Teaching ϕ-separable Dichotomy as Teaching Convex Polytopes

In §4, we discussed the generic problem of teaching convex polytopes induced by intersections of halfspaces via halfspace queries. We now consider the problem of -separability of points (also see Fig. 1-1) which could be viewed as a variant of teaching convex polytopes. We achieve similar average-case teaching complexity results for the problem. In the seminal work [1], Cover studied the problem of -separability of points in which the task is to classify points using various types of classifiers (linear or non-linear).
We first provide useful definitions for the domain of discussion. We define a set of points in as (referred to as data space), and use to represent the first coordinates of a point . A map , is called -map, and the subset is called -induced space. A dichotomy (i.e., a disjoint partition of a set) of is -separable if there exists a vector (aka separator of the dichotomy) such that: if then and if then . {definition}[Relaxed general position of points] For a set of data points in , say , is in -general position3 for a fixed if every subset of is linearly independent. {definition}[Relaxed -general position] Consider a set of data points in . For a -map in , is said to be in -relaxed -general position for a fixed if every subset of -induced points is linearly independent. We consider the problem of teaching -separable dichotomy as providing labels to subset such that a separator can be taught which separates the entire dichotomy. In the remaining of this section, we show that the teaching problem of -separability of dichotomies (Fig. 1-1) can be studied as a special case of teaching convex polytopes. We connect the two problems via duality. Notice that showing the duality for homogeneous linear separability of dichotomies i.e (identity function) suffices for general -separability since it reduces to the homogeneous case.

Naturally, we define teaching set for a -separable dichotomy as the teaching set for the dual convex polytopes of the -induced space. Following the standard practice, we call the hypothesis space (where each hypothesis/region corresponds to a ) as the dual space, and data space as the primal space. We discuss the construction and relevant properties of duality below.
we assume that = (standard basis vector in with coordinate being 1 and others being 0). Denote the set of all homogeneously linear separable dichotomies of by . We observe that if is a linear separator of , then forms a linear separator for . Based on this observation, we define a relation on elements of as follows: . Notice that is reflexive, symmetric, and transitive. Thus, is an equivalence relation. Denote by the set of equivalence classes i.e. the quotient set \citep[see][]rossen2003discrete . It is easy to see that , where denotes an equivalence class for any . Before we construct the dual map, , we state a key assumption used in construction as follows:

###### Assumption 1

We represent each equivalence class by the dichotomy which labels as positive.

This implies that if is a homogeneous linear separator of the representative dichotomy of a class then as . Thus, dual map exploits this property of each equivalence class i.e.

 w⋅x=x⋅w =(x[d−1],xd)⋅(w[d−1]/wd,1) ≶0 ⇒x[d−1]⋅(w[d−1]/wd)+xd≜h[d−1]⋅zw+xd ≶0 (1)

Hence, points maps to hyperplane , in in the dual space and homogeneous linear hyperplane maps to point in . Notice that, maps to a hyperplane which exists in infinity i.e . Denote the set of dual hyperplanes by ()4. Formally, we define our dual map as follows:

 Υdual:X →¯H φdual:EX →R(A(¯H)) x ↦hx [\em v]:w[\em v] ↦rz[\em v]\@@numbered@sectionsectiontocDiscussionandConclusion\parWehavestudiedtheaverage−casecomplexityofteachingconvexpolytopeswithhalfspacequeries,andshowedthatifthehyperplanearrangementisind’−relaxedgeneralposition,thentheaverageteachingcomplexityisΘ(d′).Incontrast,theaverage−casesamplecomplexityisΘ(d′logn)foractivelearningandΘ(n)forpassivelearning.Weshowedthatourinsightscouldbeappliedtoteachingϕ−separabledichotomies.Moreover,asdiscussedindetailsintheAppendix???,wefurthershowthatourinsightsin\lx@sectionsign???couldbefurthergeneralizedtotheproblemofteachingrankingsovernpoints{x_1, …, x_n} ⊆R^d(encodedbytheirdistancestoan\emphunknownreferencepointr ∈R^d)viapairwisecomparisons(e.g.,‘‘isx_iclosertorthanx_j′′?).Oneinterestinglineoffutureworkistounderstandwhetherourresultcouldbeextendedtomoregeneralhyperplanearrangementsettings.Webelieveourresultsprovideusefulgeometricalinsightsforanalyzingtheaverage−casecomplexityformorecomplexhypothesisclasses.\par\par\par\@@unnumbered@sectionsectionAcknowledgementsWethankAliSayyadiforthehelpfuldiscussions.ThisworkwassupportedinpartbyfundingsfromPIMCOandBloomberg.\@bibliographyreference\par (2)

### Footnotes

1. This idea is more formally studied in the hyperplane arrangement literature as essentialization \citep[see][chap: An introduction to hyperplane arrangement]miller2007geometric. See Appendix LABEL:appendixsub:_bijection for further discussion.
2. Full algorithm is detailed in Appendix LABEL:appendixsub:_characterization.
3. See [1] for the definition of general position of points.
4. We use this notation to signify that exists in infinity.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters