A DivideandConquer Approach to Geometric Sampling for Active Learning
Abstract
Active learning (AL) repeatedly trains the classifier with the minimum labeling budget to improve the current classification model. The training process is usually supervised by an uncertainty evaluation strategy. However, the uncertainty evaluation always suffers from performance degeneration when the initial labeled set has insufficient labels. To completely eliminate the dependence on the uncertainty evaluation sampling in AL, this paper proposes a divideandconquer idea that directly transfers the AL sampling as the geometric sampling over the clusters. By dividing the points of the clusters into cluster boundary and core points, we theoretically discuss their margin distance and hypothesis relationship. With the advantages of cluster boundary points in the above two properties, we propose a Geometric Active Learning (GAL) algorithm by knight’s tour. Experimental studies of the two reported experimental tasks including cluster boundary detection and AL classification show that the proposed GAL method significantly outperforms the stateoftheart baselines.
keywords:
Active learning, uncertainty evaluation, geometric sampling, cluster boundary.This manuscript was finally accepted by Expert System with Applications Journal.
1 Introduction
Active learning (Activelearning) is developed to further improve the prediction accuracy in supervised learning problems without sufficient labels. This study has been widely applied in various of learning scenarios when the unannotated data are abundant but annotating them is expensive and timeconsuming, such as semisupervised text classification (Text), image annotation (Image), transfer learning (Transfer1), etc. Generally, the propose AL algorithms focus on the construction of an uncertainty evaluation function which guides the subsequent sampling such as (Uncertaintysampling), (ERR), etc. However, the label diversity and distribution features of the initial labeled set decide the performance of the uncertainty evaluation progress. When the initial labeled set only has a few data, performance degeneration of the subsequent sampling would be inevitable.
Geometric sampling shows its power in various of domains such as fast SVM training (CVM1), Bayesian adversarial spheres algorithm (bekasov2018bayesian), geometric deep learning (fey2018splinecnn), etc. Especially in large scale classification issue, Core Vector Machine (CVM) (CVM2) changed the SVM to a problem of minimum enclosing ball (MEB), which is popular in hardmargin support vector data description (SVDD) (SVDD), and then iteratively calculated the ball center and radius in a (1+) approximation. In this process, the cluster boundary points located on the surface of each MEB are added into a special data collection called core sets. Trained by the detected core sets, the proposed CVM performed faster than the SVM and needed less support vectors. Especially in the Gaussian kernel, a fixed radius was used to simplify the MEB problem to the EB (Enclosing Ball), and accelerated the calculation process of the Ball Vector Machine (BVM) (BVM). Without sophisticated heuristic searches in the kernel space, the training model, using points of high dimensional ball surface, can still be approximated to the optimal solution.
In this paper, we are motivated by the advantages of boundary points of CVM and propose a divideandconquer approach to geometric sampling for AL (see Figure 1). Underlying MEB model, we divide the data of each class into two types: cluster boundary and core points. In geometric description, cluster boundary points are located at the surface of one cluster and core points are distributed inside the cluster. To study the properties of the two types of points, we compare them from twofold: margin distance (w.r.t. Lemma 1) and hypothesis relationship (w.r.t. Lemma 2). The conclusion shows that cluster boundary points play more important role in the construction of the classification hyperplane compared to core points in a geometrical perspective.
Our conquer step is to obtain the cluster boundary points. By setting a knight in the geometric space, the path disagreement of the tour helps us to differ from cluster boundary and core points. We assume the tour path is decided by the update process of traversing 1 to nearest neighbors (NN) of the current tour position (data point). Their geometric disagreement in path length become the key of our detection method, i.e., the average tour path of boundary points are longer than that of the core points. With the above divideand conquer analysis, we finally propose a Geometric Active Learning (GAL) algorithm by training the geometric cluster boundary points. The contributions of this paper are described as follows.

We propose a divideandconquer idea to geometric AL sampling. It transfers the uncertain sampling space of AL into a set of the cluster boundary points.

We provide the geometric insights for cooperating cluster boundary points in AL under the assumption of geometric classification.

An AL algorithm termed GAL is developed in this paper. It samples independently without iteration and help from the labeled data.

We break the theoretical curse of uncertainty evaluation sampling by GAL algorithm since it is neither a modelbased nor labelbased strategy with the fixed time and space complexities of and respectively.

A lot of experiments are conducted to verify that GAL can be applied in multiclass settings to overcome the binary classification limitation of many existing AL approaches.
The remainder of this paper is structured as follows. The related work is reported in Section 2. The preliminaries are described in Section 3 and the geometric insights on cluster boundary points in AL are presented in Section 4. The divideandconquer approach of knight’s tour is presented in Section 5. The experiments and results are reported in Sections 6. The discussion is presented in Section 7. Finally, we conclude this paper in Section 8.
2 Related Work
In this section, we present the related work on active learning and cluster boundary research.
2.1 Active learning
The learning goal of AL is to obtain a descried error rate by annotating as fewer queries as possible. To improve the performance of the current classification model, the AL learner (human expert) is allowed to pick up a subset from an unlabeled data pool. Those data, which may largely affect the subsequent update of the learning model, are the primary goals of the learner. As a policy, accessing the unlabeled data pool to sample and querying their true labels with a given budge are approved. However, all the learners would face an awkward and difficult situation: how to fast select the descried data from the massive unlabeled data in the pool.
To resolve the above challenges, uncertainty evaluation (Uncertaintysampling) was proposed to guide AL by selecting the most informative or representative instances in a given sampling scheme or distribution assumption, such as margin (Margin), uncertainty probability (ERR), maximum entropy (Entropy), confused votes by committee (Vote), etc. For example, (Margin) proposes to select the data which is nearest to the current classification hyperplane, (ERR) selects the data which can maximize the error rate change, (Entropy) selects the data with the maximum entropy of prediction probability, etc. Basically, these uncertaintybased AL algorithms aim to reduce the number of queries or converge the classifier quickly. Accompanied by multiple iterations, querying stops when the defined sampling number is met or a satisfactory model is found. It is thus these algorithms still need to traverse the whole data set repeatedly in this framework, although this technique performs well. However, they always suffer from one main limitation, that is, heuristically searching the whole data space to obtain the optimal sampling subset is impossible because of the unpredictable scale of the candidate set.
In practice, incorporating the unsupervised learning in the sampling process shows powerful advantages such as (Preclustering) (Preclustering2) (Preclustering3). It makes the learner solve the previous limitation be possible. One classical method (Hiera) is performing the hierarchical clustering before sampling to improve th lower bound of the subsequent training performance. By setting up a probability condition, the learner is allowed to confidently annotate a number of subtrees with the label of the root note. When the clustering structure is perfect, it wold be positive for the sampling. However, an improper clustering results will mislead the annotation process. Then, performance degeneration of the subsequent sampling is inevitable.
2.2 Cluster boundary
Cluster boundary points are a set of special objects distributed in the margin regions of each cluster. Their labels are given by the cluster structure and guide the clustering partition. However, those label assignations are uncertain. Nowadays, the practical advantage of the cluster boundary has been widely used in the latent virus carrier detection (BERGE), abnormal gene segment diagnosis (Spinver), etc.
With the prior experience in clustering algorithms, researchers firstly study the cluster boundary detection issue in the low dimensional space and propose a series of approaches, such as (BORDER) (BRIM) (BERGE) etc. In those proposed algorithms,
BORDER firstly defines the cluster boundary points by measuring
the density of their nearest neighbors, and uses the reverse NN to obtain the complete boundary points, but with all the noises. To smooth the influence of noises, (BRIM) propose a detection algorithm termed BRIM via analyzing the balance property of the data distributed inside and outside the cluster.
Because the extracted features are in low dimensional space, this algorithm could only be applied in twodimension space. Moreover, the task of detecting the cluster boundary objects in high dimensional clusters is firstly studied in (Spinver) via utilizing the particle space inversion and Hopkins statistic. However, the devised Euclidean Gaussian filter function can not work well in very highdimensional space because of the uncertainty of noises in the sparse distribution.
Notation  Definition 

classifiers  
prediction error rate of when training  
data set  
data number of  
number of labeled, unlabeled, queried data  
label set  
a data point in  
labeled data points in  
queried data points in  
training set after querying  
distance function  
core points  
cluster boundary points  
noises  
training set of []  
core points located inside the positive class  
core points located inside the negative class  
cluster boundary points located near  
noises  
core points  
boundary points  
noises  
approximation statement  
assignment statement in algorithm  
3 Preliminary
In this section, we first define the AL sampling by a family of linear functions. Then, we define the cluster boundary and core points by a group of density functions. Related definitions, main notations and variables are briefly summarized in Table I.
Given represents data space , where and the label space , considering the classification hypothesis:
(1) 
where is the parameter vector and is the constant vector, here gives:
Definition 1.
Active learning. Optimizing to get the minimum RSS (residual sum of squares)(TED) (LLR):
(2) 
i.e.,
(3) 
where is the labeled data, is the queried data, and is the updated training set.
Definition 2.
Cluster boundary point (BORDER).
A boundary point is an object that satisfies the following conditions:
1. It is within a dense region .
2. region near , or .
Definition 3.
Core point.
A core point is an object that satisfies the following conditions:
1. It is within a dense region .
2. an expanded region based on , .
4 Geometric Insights
In clusteringbased AL work, core points provide a little help for the parameter training of classifiers. Considering that cluster boundary points may provide decisive factors for the support vectors, CVM and BVM iteratively use the points distributed on the hyperplane of an enclosing ball to train fast core support vectors in largescale data sets. Their significant success motivate the work of this paper.
To further show the importance of cluster boundary points, we (1) clarify the performance of training cluster boundary points in Section 4.1, (2) discuss the margin distance to the classification line or hyperplane of boundary and core points in Section 4.2, and (3) analyze the hypothesis relationship when training boundary and core points in Section 4.3, where the discussion cases of (2) and (3) are binary, and multiclass classifications of low and highdimensional space.
4.1 Performance of cluster boundary
In this section, we propose a geometrical perspective that the performance of the classification model is determined by the cluster boundary points. Our main theoretical result is summarized as follows.
Proposition 1.
Suppose that , respectively be a set of core points and cluster boundary points draw from a fixed geometrical cluster, be their union of set that satisfies =[]. Let be the classification hypothesis with respect to the training set , be another classification hypothesis with respect to the training set . The following hods for the generalized error disagreement :
(4) 
where denotes the approximation symbol.
Our main theoretical results in Proposition 1 claim that the core points, distributed inside the center regions of any cluster, present little influences on training a descried hypothesis h. To demonstrate our insights, Lemma 1 and Lemma 2 provide theoretical supports in different geometrical views, where Lemma 1 proves that cluster boundary points have shorter margin distance to the geometric classification line or hyperplane compared with core points, and Lemma 2 proves the trained models generated from core points are a subset of the models generated from the boundary points. In next subsection, we respectively present the detailed proofs of the two lemmas in settings of binary, multiclass settings of low and high dimension space.
4.2 Margin distance
Margin distance measures the distance to the classification line or hyperplane of one data point and we use to denote. The margin distance relations of boundary points and core points are described in the following lemma.
Lemma 1.
Suppose that , respectively be a set of core points and cluster boundary points draw from a fixed geometrical cluster. Let be the margin distance function. The margin distance of boundary points are shorter than the core points distributed in their local geometrical space, i.e.,
(5) 
Lemma 1 is supported by Corollary 1 to 3 from different cases:

Corollary 1: holds in binary classification of low dimensional space, where Corollaries 1.1 and 1.2 prove Proposition 1 in adjacent classes and wellseparated classes, respectively.

Corollary 2: holds in multiclass classification issue of low dimensional space.

Corollary 3: holds in highdimensional space.
We now present detailed proofs for the above corollaries.
Corollary 1.
holds in binary classification of low dimensional space.
Given two facts in the classification: (1) the data points far from h usually have clear assigned labels with a high prediction class probability; (2) h is always surrounded by noises and a part of the boundary points. Based on these facts, the proof is as follows.
Corollary 1.1: holds in adjacent classes of low dimensional space.
Proof.
Given any adjacent classes scenarios with binary labels ( {1,+1}) such as Figure 2(a). Let denote the core points located inside the positive class, denote the core points located inside the negative class, denotes the cluster boundary points near h, and denote the noises near h. The RSS analysis in such classification scenarios satisfy:
(6) 
where , and denote their numbers of the four types of points. In most of classification issues, noises always have wrong guidance on model training. We therefore only focus on the differences between the core and boundary points, that is to say,
(7) 
where denotes a constant. In space, the margin distance function between and could generalized as
(8) 
Considering that the classifier function is , we conclude . Then, Lemma 1 is as stated when (see Figure 2(b)). ∎
Corollary 1.2: holds in wellseparated classes of low dimensional space.
Proof.
In the wellseparated classes issue (see Figure 2(c)), the trained model based on any data points will lead to a strong classification result, that is to say, all AL approaches will perform well in this setting since:
(9) 
where denote a set of the cluster boundary points near h in the positive class, denote a set of the cluster boundary points near h in the negative class, , and . Let , , the results of Eq. (8) and (9) still hold. ∎
Corollary 2.
holds in multiclass classification in low dimensional space
Proof.
In this setting, , the classifier set , and cluster boundary points are segmented into parts , where denotes the data points close to , (see Figure 2(d))). Based on the result of Case 1, dividing the multiclass classification problem into binary classification problems, we can obtain:
(10) 
and
(11) 
where represents the core points near . Then, the following holds:
(12) 
∎
Corollary 3.
holds in highdimensional space.
Proof.
In a highdimensional space, the distance function between and hyperplane could be extended as
(13) 
where , and is a dimension vector. Because the above equation is the mdimension extension of Eq. (9), the proof relating to low dimensional space is still valid in highdimensional space. ∎
4.3 Hypotheses relationship
Lemma 2 describes this relationship of the hypotheses generated from the boundary and core points.
Lemma 2.
Suppose that , respectively be a set of core points and cluster boundary points draw from a fixed geometrical cluster. Let be the hypothesis with respect to the training set , be another hypothesis with respect to the training set . The following holds for
(14) 
It shows training models based on can predict well, but the model based on may sometimes not predict well. To prove this relation, we discuss it in three different cases:

Corollary 4: holds in binary classification of low dimensional space, where Corollary 4.1 and Corollary 4.2 prove Lemma 2 in onedimension space and twodimension space, respectively.

Corollary 5: holds in binary classification in highdimensional space.

Corollary 6: holds in multiclass classification.
Corollary 4.
holds in binary classification of low dimensional space.
This corollary is supported by two different views in Corollary 4.1 and Corollary 4.2.
Corollary 4.1: holds in linear onedimension space.
Proof.
Given point classifier in the linear onedimension space as described in Figure 3(a),
(15) 
where are core points. In comparison, the boundary points of have smaller distances to the optimal classification model , i.e., . Therefore, it is easy to conclude: Then, classifying and by is successful, but we cannot classify and by , or , respectively. ∎
Corollary 4.2: holds in twodimensional space.
Proof.
Given two core points in the twodimensional space, the line segment between them is described as follows:
(16) 
Training and obtain the following classification hypotheses:
(17) 
where is the angle between (see Figure 3(b)).
Similarly, the classifier trained by is subject to:
(18) 
where is the line segment between and . Intuitively, the difference of and is their constraint equation. Because , we can conclude:
(19) 
It aims to show cannot classify and when or in the constraint equation. But for any , it can classify correctly. ∎
Corollary 5.
holds in in highdimensional space.
Proof.
Given two core points , a bounded Hyperplane between them is:
(20) 
Training the two data points can get the following classifier:
(21) 
where is the angle between and , is the normal vector of . Given point , which is located on , if , in the positive class or in the negative class, cannot predict and correctly. It can also be described as follows: if segments the bounded hyperplane between and , or and , the trained can not classify and . Then Lemma 2 is as stated ∎
Corollary 6.
holds in multiclass classification issue.
Proof.
Follows the multiclass classification proof in Lemma 1, the multiclass problem could be segmented into parts of binary classification problems. ∎
5 Geometric Active Learning by Knight’s Tour
In our geometrical analysis, we divide the AL into a geometrical sampling process over a fixed cluster. The cluster boundary points, distributed in the margin regions of any class, have been demonstrated to provide more powerful support than core points, in terms of margin distance and hypothesis relationship. With this novel insight, in this section, we develop a conquer method to find this special set of points. However, the cluster boundary points always have multiple potential positions because of the uncertain locations of the classification hypotheses. As the diversity of the candidate positions of the cluster boundary points, recognizing all the potential positions can capture all the possible cluster boundary points against any multiclass scenarios.
Knight’s tour is a classical path planning problem that requires the knight returns to the original starting point after traveling 64 chess lattices. Nowadays, this problem has became the path optimization in graph theory, and also been developed to a Markov chain problem in discrete state space. Setting the knight in data space with samples, and its step transfer matrix is:
(22) 
where denotes that moves to in steps with a speed of steps once. When , is the onestep transfer matrix of the knight’s tour. Suppose that the knight begins the tour with a speed of and a step length of , where denotes the path length between and . If the policy of the tour is to save the path cost, the knight needs to estimate each potential paths and takes a given probabilistic to select the subsequent position. Therefore, we propose the transfer probabilistic matrix :
(23) 
where denotes the probability of moving into from . We here define it by the ratio of the path length between from and all other possible paths, i.e., , where . Let be the probabilistic transfer matrix produced by:
(24) 
where denotes the Hadamard product of two matrices, and . Withe this operation, denotes the length of the probabilistic transfer path when the current position of the tour is set from ,… , to . Meanwhile, for any , we have
(25) 
is a matrix with the size of and is the probabilistic transfer path length of the tour when the knight is located at the position of . This matrix characterizes the distribution features of the current location of the knight’s tour. When the initial position of the tour is set in the central regions of the cluster, the knight would spend expensively to leave the cluster because the knight has multiple directions where can move into. However, if the knight is set in the boundary region of the cluster, the cost would decrease dramatically. Therefore, the tour path within a limited steps could intuitively reflect where the tour is, i.e., the boundary or the central regions of the cluster. With this policy, we further characterize the steps transfer path of each position by probability evaluation:
(26) 
where is the nd neighbor of and we call as the probabilistic tour matrix. The different between Eq. (25) and Eq. (26) is the tour space of the knight. In Eq. (25), calculates the tour cost of leaving the cluster and the knight needs to visit positions. However, the tour cost in the local space characterizes the distribution features of cluster boundary and core points. Therefore, we limit the position numbers of the tour by a local variable in Eq. (26), which updates into .
Based on the above definitions and analysis, we propose a Geometric Active Learning (GAL) algorithm. Its pseudocode has been summarized in Algorithm 1. In its steps, Step 4 to Step 8 use the Rtree to calculate the matrix that denotes the NN of each data in . The time complexity of this searching process approximates . Then, we calculate the probabilistic tour path of each data point using Eq. (26) and store these values in matrix . Step 9 sorts the values of matrix by ascending. From a geometrical perspective, we divide the cluster into two regions: outer cluster collection and inner cluster collection , where the outer cluster collection removes all noises from , the inner cluster collection covers all feasible core points from . Therefore, the cluster boundary collection of includes the data belongs to but are not in . To implement this process, we set two parameters named inner cluster ratio and outer cluster ratio to split . Let be a colon matrix via sorting matrix by ascending, Step 8 to Step 14 describe this splitting process with the following policies: 1) for any data , if its probabilistic transfer path length is shorter than , it is a data within the inner cluster, and 2) for any data , if its probabilistic transfer path length is shorter than , it is within the outer cluster. Finally, Step 15 returns the complement set of with respect to .
6 Experiments
To demonstrate the effectiveness of our proposed GAL algorithm, we evaluate and compare the performance of the cluster boundary detection and AL classification with the existing algorithms in this section. The structure of this section is: Section 6.1 and 6.2 respectively describe the related baselines and tested data sets, Section 6.3 describes the preprocessing and evaluation, Section 6.4 describes the experimental settings, and Section 6.5 analyzes the results.
6.1 Baselines
For the cluster boundary detection task, some baselines have been collected:

BORDER (BORDER) uses the reversal kNN approach to detect the cluster boundary based on a assumption of the reverse kNN number of cluster boundary points are less than that of core points. But its detection results always include all feasible noises because noises always have smaller number of reverse kNN , compared to other data.

BERGE(BERGE) is the a iterative cluster boundary detection algorithm which uses evidence accumulation to start the detection, but the error rate always increases rapidly when labeling noises as cluster boundary points by mistake.

Spinver(Spinver) algorithm, whose inspiration comes from spatial inversion of particle physics, is a high dimensional cluster boundary algorithm. It uses the Hopkins statistics to capture the neighborhood characteristics after smoothing noises by an Euclidean distancebased on Gaussian filtering function. But the Hopkins statistics prefers a balance class scenario.
For the classification task, several baselines also have been researched and will compare from GAL:

Random, which uses a random sampling strategy to query unlabeled data, and can be applied to any AL task but with an uncertainty result.

Margin (Margin), which selects the unlabeled data point with the shortest distance to the classification model, only can be supported by the SVM classification model.

Hierarchical (Hiera) sampling is a very different idea, compared to many existing AL approaches. It labels the subtree with the root node’s label when the subtree meets the objective probability function. But incorrect labeling leads to a very bad classification result.

TED (TED) favors data points that are on the one side hard topredict and on the other side representative for the rest of the experiments.

Reactive(Reactive) learning finds the data point which has the maximum influences on the future prediction result after annotating the selected data. This novel idea does not need to query the Oracle when relabeling, but needs a welltrained classification model at the beginning. Furthermore, its reported approach can’t be applied in multiclass classification problems.
6.2 Data sets
We synthesized and collected some emulated, benchmark data sets, respectively for the experiments described in this section, which are detailed as follows.
For the cluster boundary detection task, two clustering data sets named Aggregation and Flame, are used to show the concept of cluster boundary points. The other four classical clustering datasets Syn1 Syn4 are tested in the boundary detection experiment, where denote the data set has samples with dimensions.

Syn1:54002. The clusters are surrounded by a lot of noises.

Syn2:48002.The circle cluster is embedded in the annulus cluster and a lot of noises connect them.

Syn3:78322. There are two connected diamond clusters with multidensity.

Syn4:50342. A lot of noises connect the different clusters.
The following datasets are realworld medical data sets.

Biomed ^{1}^{1}1http://lib.stat.cmu.edu/datasets/:2094. Medical data set. It has 134 normal objects and 75 virus infected objects. 30 virus carriers in the normal objects are defined as the cluster boundary of normal people.

Cancer Spinver:2402. Medical data set. It has 241 malignant tumor objects and 75 benign tumor objects. 37 benign tumor objects which may become malignant tumor patients are cluster boundary objects of normal people.

Colon ^{2}^{2}2http://genomicspubs.princeton.edu/oncology/affydata/:2402. Gene data set. 7 cluster boundary points.

Prostate:2402 Spinver. Gene data set. 18 cluster boundary objects.
There are two image data sets in the target tracking field ^{3}^{3}3http://research.microsoft.com/enus/um/people/jckrumm/wallflower/testimages.htm and we sill use our GAL algorithm to capture the moving targets.

Waving Trees:287160. This comes from the data on the continuous monitoring of one building, including 7 captured images when a volunteer passes by the monitored area.

Moved Object:1745160. This comes from the data on the continuous monitoring of one office, including 363 captured images when a volunteer enters the office and leaves after staying some time.
There is also one subset of the Basel Face Model ^{4}^{4}4https://faces.dmi.unibas.ch/bfm/ in relation to the light test.

Basel Face Model:This is a popular 3D face model data set about multigestures and color change. The light subset has 4488 images, and all are stored with 500500 pixels. We use the GAL algorithm to detect images with strong or dark light since only normal light images are useful in most realworld cases.
For the classification task of AL, we compare the bets classification results of different algorithms on some classical clustering data sets ^{5}^{5}5http://cs.joensuu.fi/sipu/datasets/ and the letter recognition data set letter.

g2230:20482. There are 2 adjacent classes in the data set.

Flame:2402. It has 2 adjacent classes with similar densities.

Jain:3732. It has two adjacent classes with different densities.

Pathbased:3002. Two clusters are close and surround by a arc cluster.

Spiral:3122. There are three spiral curve cluster which are linear inseparable.

Aggregation:7882. There are 7 adjacent classes in the data set.

R15:6002. There are 7 separate clusters and 8 adjacent classes.

D31:31002. It has 31 adjacent classes.

letter:2000016. It is a classical letter recognition data set with 26 English letters. We select 5 pairs letters which are difficult to distinguish from each other to test the above AL algorithms in a twoclass setting. They are DvsP, EvsF, IvsJ, MvsN, UvsV, respectively. For multiclass test, we select AD, AH, AL, AP, AT, AX, AZ, respectively. Of these, AD is the letter set A to D, and AH is the letter set A to H, … , AZ is the letter set A to Z. The seven multiclass sets have 4, 8, 12, 16, 20, 26 classes respectively.
In addition to the introduction for the tested data sets, all twodimensional data sets are shown in Figure. 4.
6.3 Preprocessing and Evaluation
The methods of preprocessing used in this paper are reported in this section. Application cases are: preprocessing methods (a) and (b) are used for the Colon and Prostate data sets, respectively since the compressed large domain will accelerate the calculation speed and reduce the memory consumption; pretreatment (c) is used to change the image type to number type and is used for Waving Trees, Moved Object and Basel Face Model. Here we detail the specific methods:
(a) , the value of each dimension of each data point is divided by ;
(b) , the value of each dimension of each data point is divided by ;
(c) , for each image, read the grayscale matrix and compress it into a singlecolumn matrix (i.e., with a size of ) with the average grayscale values.
For the cluster boundary detection problem, we use the score to evaluate the detection result. This is a popular evaluation function in information retrieval which considers both precision and the recall . Because the cluster boundary detection task is also a retrieval problem, we use it to evaluate our results. For the classification problem, we use accuracy to evaluate it.
6.4 Experimental setting
We discuss the experimental setting of the compared algorithms over the synthetic and real data sets in this section.

Figure 5 marks the cluster boundary points on Syn1 and Syn2. It is used to show the definition of cluster boundary points.

Table 2 reports the best cluster boundary detection result on different synthetic and real data sets. We have marked the highest scores of each group of experiment.

Figure 6(a) shows the cluster boundary detection result on the light subdata set of the Basel face. To compare the detected cluster boundary images, we also show the detection results for the core points using GAL in Figure 6(b).

Table 3 shows the classification results on some synthetic data sets. The specific experiment settings are as follows: (1) we use the MATLAB random function to implement the Random algorithm and calculate the mean and STD values after running it 100 times; (2) as the Margin, Hierarchical and Reactive algorithms all need the labeled data points to guide the training process, we select one data point from each class and query the Oracle, respectively. Similar, we test the algorithms 100 times and then calculate the mean and STD values in order to guarantee that the labeled set includes all the different label kinds of Oracle, or the algorithms will show poorer performance if we use random selection; (3) there are two important parameters for the TED algorithm: the kernel function parameter and the regularization parameter for the kernel ridge regression . We use a super parameter =1.8 to generate the kernel matrix and train from 0.01:0.01:1. The reason for this is that this parameter will provide important guidance for the sampling selection. After we test it many times, we limit its correct and stable range; (4) for our GAL algorithm, we train the parameters form 2:1: and boundary upper =:1:N to record the classification result. Because segments the core points and boundary points, we use a super parameter = to begin the training. The conclusion that there are at least 70% N data points as core points in the data set comes from our published papers (Spinver) and experience summary. The classifier trained in the classification experiment is LIBSVM (chang2011libsvm).
Datasets  Dimension  Algorithm  Real.boun  Num.det  Num.C  Precision  Recall  

Syn1  2  BORDER  1077  1252  831  0.6637  0.7716  0.7136 
BERGE  1250  940  0.7520  0.8728  0.8079  
Spinver  1049  993  0.9466  0.9220  0.9341  
GAL  1043  996  0.9549  0.9248  0.9396  
Syn2  2  BORDER  1204  1802  1089  0.6043  0.9045  0.7246 
BERGE  1456  1098  0.7541  0.9120  0.8256  
Spinver  1264  1111  0.8790  0.9228  0.9003  
GAL  1163  1040  0.8942  0.9302  0.9118  
Syn3  2  BORDER  640  723  540  0.7469  0.8438  0.7924 
BERGE  662  532  0.8036  0.8313  0.8172  
Spinver  611  542  0.8871  0.8469  0.8665  
GAL  632  580  0.9177  0.9063  0.9120  
Syn4  2  BORDER  538  669  445  0.6366  0.8271  0.7195 
BERGE  553  472  0.8535  0.8773  0.8652  
Spinver  540  482  0.8926  0.8959  0.8942  
GAL  540  496  0.9185  0.9219  0.9202  
Biomed  4  BORDER  30  26  23  0.8846  0.7667  0.8214 
BERGE  27  24  0.8889  0.8000  0.8421  
Spinver  29  27  0.9310  0.9000  0.9153  
GAL  29  28  0.9655  0.9333  0.9491  
Cancer  10  BORDER  37  37  28  0.7568  0.7568  0.7568 
BERGE  37  30  0.8108  0.8108  0.8108  
Spinver  35  34  0.9714  0.9789  0.9444  
GAL  36  35  0.9722  0.9459  0.9589  
Colon  2000  BORDER  7  7  1.0000  1.0000  1.0000  
BERGE  6  5  0.8333  0.7143  0.7692  
Spinver  7  7  1.0000  1.0000  1.0000  
GAL  7  7  1.0000  1.0000  1.0000  
Prostate  10,509  BORDE  19  18  0.9474  1.0000  0.9730  
BERGE  17  16  0.9412  0.8889  0.9143  
Spinver  18  18  1.0000  1.0000  1.0000  
GAL  18  18  1.0000  1.0000  1.0000  
Waving Trees  160  BORDE  17  17  1.0000  1.0000  1.0000  
BERGE  17  15  0.8824  0.8824  0.8824  
Spinver  17  17  1.0000  1.0000  1.0000  
GAL  17  17  1.0000  1.0000  1.0000  
Moved Object  160  BORDE  363  222  0.6116  0.6116  0.6116  
BERGE  363  250  0.6887  0.6887  0.6887  
Spinver  363  222  0.6116  0.6116  0.6116  
GAL  363  352  0.9697  0.9697  0.9697 
Data sets  Num_C  Algorithms  Number of queries (percentage of the data set)  
1%  5%  10%  15%  20%  30%  40%  50%  60%  
Biomed  2  Random  .516.026  .546.012  .603.028  .652.029  .693.031  .767.026  .815.026  .849.021  .881.022 
Margin  .500.000  .509.015  .551.047  .590.076  .644.103  .709.153  .822.139  .882.161  .927.188  
Hierarchical  .504.000  .550.000  .585.000  .615.000  .668.000  .774.014  .847.000  .920.011  .974.000  
TED  .610.000  .619.009  .651.003  .759.006  .848.007  .875.005  .901.005  .964.005  .972.000  
Reactive                    
GAL  .724.163  .725.022  .790.021  .825.018  .886.012  .909.013  .927.011  .994.008  1.00.000  
Cancer  2  Random  .516.026  .546.012  .603.028  .652.029  .693.031  .767.026  .815.026  .849.021  .881.022 
Margin  .500.000  .509.015  .551.047  .590.076  .644.103  .709.153  .822.139  .882.161  .927.188  
Hierarchical  .504.000  .550.000  .585.000  .615.000  .668.000  .774.014  .847.000  .920.011  .974.000  
TED  .610.000  .619.009  .651.003  .759.006  .848.007  .875.005  .901.005  .964.005  .972.000  
Reactive                    
GAL  .724.163  .725.022  .790.021  .825.018  .886.012  .909.013  .927.011  .994.008  1.00.000  
g2230  2  Random  .516.026  .546.012  .603.028  .652.029  .693.031  .767.026  .815.026  .849.021  .881.022 
Margin  .500.000  .509.015  .551.047  .590.076  .644.103  .709.153  .822.139  .882.161  .927.188  
Hierarchical  .504.000  .550.000  .585.000  .615.000  .668.000  .774.014  .847.000  .920.011  .974.000  
TED  .610.000  .619.009  .651.003  .759.006  .848.007  .875.005  .901.005  .964.005  .972.000  
Reactive  .506.008  .531.029  .554.052  .593.065  .634.058  .744.060  .715.047  .811.000  .816.000  
GAL  .724.163  .725.022  .790.021  .825.018  .886.012  .909.013  .927.011  .994.008  1.00.000  
Flame  2  Random  .670.142  .794.106  .904.059  .944.036  .958.025  .976.014  .984.008  .987.005  .990.006 
Margin  .499.137  .596.102  .740.162  .872.158  .930.159  .935.145  .961.120  .963.109  .944.165  
Hierarchical  .720.041  .607.042  .855.062  .972.010  .999.000  1.00.000  1.00.000  1.00.000  1.00.000  
TED  .829.000  .950.006  .974.006  .988.006  .991.000  .995.001  .996.002  .996.002  .998.000  
Reactive  .553.154  .804.120  .917.090  .966.045  .974.045  .993.006  .993.027  .996.004  .997.004  
GAL  .887.004  .976.008  .983.005  .988.004  .991.002  .995.002  1.00.000  1.00.000  1.00.000  
Jain  2  Random  .659.180  .773.042  .816.041  .848.041  .881.040  .928.028  .958.024  .974.015  .981.015 
Margin  .258.003  .270.074  .382.211  .545.306  .572.310  .627.347  .623.340  .721.347  .736.352  
Hierarchical  .325.013  .295.008  .297.010  .636.022  .873.024  1.00.000  1.00.000  1.00.000  1.00.000  
TED  .739.000  .764.006  .837.018  .932.019  .978.018  .998.002  1.00.000  1.00.000  1.00.000  
Reactive  .666.163  .748.036  .791.027  .836.041  .899.045  .994.022  .998.008  1.00.000  1.00.000  
GAL  .768.007  .915.026  .963.018  .977.013  .989.009  1.00.000  1.00.000  1.00.000  1.00.000  
Pathbased  3  Random  .447.157  .533.089  .719.096  .833.063  .891.046  .940.046  .958.016  .969.014  .976.010 
Margin  .366.000  .368.016  .407.087  .481.151  .686.230  .875.209  .960.151  .962.148  .988.081  
Hierarchical  .488.027  .500.017  .547.024  .717.028  .749.023  .861.022  .949.015  .970.013  1.00.000  
TED  .356.000  .582.023  .875.032  .933.008  .941.005  .987.009  .997.002  1.00.000  1.00.000  
Reactive                    
GAL  .748.004  .811.048  .920.038  .950.019  .959.012  1.00.000  1.00.000  1.00.000  1.00.000  
Spiral  3  Random  .352.023  .493.049  .634.061  .757.059  .830.051  .918.034  .955.024  .977.017  .988.011 
Margin  .337.005  .344.015  .408.062  .513.101  .630.144  .893.180  .964.119  .965.126  .990.034  
Hierarchical  .380.024  .486.044  .498.046  .525.062  .627.044  .653.048  .770.055  .774.062  .865.039  
TED  .355.000  .678.011  .751.039  .828.039  .896.003  .920.002  .960.000  .990.003  .998.000  
Reactive                    
GAL  .427.017  .685.090  .830.097  .872.082  .919.063  .963.038  .990.021  .998.006  1.00.000  
Aggregation  7  Random  .339.101  .583.062  .775.047  .868.031  .923.023  .972.013  .987.006  .993.003  .996.000 
Margin  .215.000  .355.092  .707.153  .964.098  .995.044  1.00.000  1.00.000  1.00.000  1.00.000  
Hierarchical  .471.038  .578.016  .651.009  .695.010  .961.009  .987.005  .990.005  .992.003  .997.000  
TED  .379.002  .646.019  .948.009  .968.001  .999.001  1.00.000  1.00.000  1.00.000  1.00.000  
Reactive               