A Divide-and-Conquer Approach to Geometric Sampling for Active Learning

A Divide-and-Conquer Approach to Geometric Sampling for Active Learning

Xiaofeng Cao* Advanced Analytics Institute, University of Technology Sydney
Email: xiaofeng.cao@student.uts.edu.au.
Address: 2 Blackfriars St, Chippendale NSW 2008
Phone:+61 0481126436.
Abstract

Active learning (AL) repeatedly trains the classifier with the minimum labeling budget to improve the current classification model. The training process is usually supervised by an uncertainty evaluation strategy. However, the uncertainty evaluation always suffers from performance degeneration when the initial labeled set has insufficient labels. To completely eliminate the dependence on the uncertainty evaluation sampling in AL, this paper proposes a divide-and-conquer idea that directly transfers the AL sampling as the geometric sampling over the clusters. By dividing the points of the clusters into cluster boundary and core points, we theoretically discuss their margin distance and hypothesis relationship. With the advantages of cluster boundary points in the above two properties, we propose a Geometric Active Learning (GAL) algorithm by knight’s tour. Experimental studies of the two reported experimental tasks including cluster boundary detection and AL classification show that the proposed GAL method significantly outperforms the state-of-the-art baselines.

keywords:
Active learning, uncertainty evaluation, geometric sampling, cluster boundary.
journal: Journal of LaTeX Templatesmytitlenotemytitlenotefootnotetext: This manuscript was independently finished by Xiaofeng Cao when he was a PhD candidate with Advanced Analytics Institute, University of Technology Sydney. He now is pursuing the Ph.D. degree with the Centre for Artificial Intelligence, University of Technology Sydney. He changed his research interests from data mining into learning theory including PAC learning, agnostic active learning, and generalization theory.
This manuscript was finally accepted by Expert System with Applications Journal.

1 Introduction

Active learning (Activelearning) is developed to further improve the prediction accuracy in supervised learning problems without sufficient labels. This study has been widely applied in various of learning scenarios when the unannotated data are abundant but annotating them is expensive and time-consuming, such as semi-supervised text classification (Text), image annotation (Image), transfer learning (Transfer1), etc. Generally, the propose AL algorithms focus on the construction of an uncertainty evaluation function which guides the subsequent sampling such as (Uncertaintysampling), (ERR), etc. However, the label diversity and distribution features of the initial labeled set decide the performance of the uncertainty evaluation progress. When the initial labeled set only has a few data, performance degeneration of the subsequent sampling would be inevitable.

(a) Original data space
(b) Cluster core points
(c) Cluster boundary points
Figure 4: Motivation of our active learning work. In each sub-figure, the black line denotes the generated SVM classification model based on the data points in the figure. (a) Training the original data space. (b) Training the cluster core points. (c) Training the cluster boundary points. We observe that the generated classification lines of (c) are similar to the models of (a) and (b).

Geometric sampling shows its power in various of domains such as fast SVM training (CVM1), Bayesian adversarial spheres algorithm (bekasov2018bayesian), geometric deep learning (fey2018splinecnn), etc. Especially in large scale classification issue, Core Vector Machine (CVM) (CVM2) changed the SVM to a problem of minimum enclosing ball (MEB), which is popular in hard-margin support vector data description (SVDD) (SVDD), and then iteratively calculated the ball center and radius in a (1+) approximation. In this process, the cluster boundary points located on the surface of each MEB are added into a special data collection called core sets. Trained by the detected core sets, the proposed CVM performed faster than the SVM and needed less support vectors. Especially in the Gaussian kernel, a fixed radius was used to simplify the MEB problem to the EB (Enclosing Ball), and accelerated the calculation process of the Ball Vector Machine (BVM) (BVM). Without sophisticated heuristic searches in the kernel space, the training model, using points of high dimensional ball surface, can still be approximated to the optimal solution.

In this paper, we are motivated by the advantages of boundary points of CVM and propose a divide-and-conquer approach to geometric sampling for AL (see Figure 1). Underlying MEB model, we divide the data of each class into two types: cluster boundary and core points. In geometric description, cluster boundary points are located at the surface of one cluster and core points are distributed inside the cluster. To study the properties of the two types of points, we compare them from two-fold: margin distance (w.r.t. Lemma 1) and hypothesis relationship (w.r.t. Lemma 2). The conclusion shows that cluster boundary points play more important role in the construction of the classification hyperplane compared to core points in a geometrical perspective.

Our conquer step is to obtain the cluster boundary points. By setting a knight in the geometric space, the path disagreement of the tour helps us to differ from cluster boundary and core points. We assume the tour path is decided by the update process of traversing 1 to nearest neighbors (NN) of the current tour position (data point). Their geometric disagreement in path length become the key of our detection method, i.e., the average tour path of boundary points are longer than that of the core points. With the above divide-and conquer analysis, we finally propose a Geometric Active Learning (GAL) algorithm by training the geometric cluster boundary points. The contributions of this paper are described as follows.

  • We propose a divide-and-conquer idea to geometric AL sampling. It transfers the uncertain sampling space of AL into a set of the cluster boundary points.

  • We provide the geometric insights for cooperating cluster boundary points in AL under the assumption of geometric classification.

  • An AL algorithm termed GAL is developed in this paper. It samples independently without iteration and help from the labeled data.

  • We break the theoretical curse of uncertainty evaluation sampling by GAL algorithm since it is neither a model-based nor label-based strategy with the fixed time and space complexities of and respectively.

  • A lot of experiments are conducted to verify that GAL can be applied in multi-class settings to overcome the binary classification limitation of many existing AL approaches.

The remainder of this paper is structured as follows. The related work is reported in Section 2. The preliminaries are described in Section 3 and the geometric insights on cluster boundary points in AL are presented in Section 4. The divide-and-conquer approach of knight’s tour is presented in Section 5. The experiments and results are reported in Sections 6. The discussion is presented in Section 7. Finally, we conclude this paper in Section 8.

2 Related Work

In this section, we present the related work on active learning and cluster boundary research.

2.1 Active learning

The learning goal of AL is to obtain a descried error rate by annotating as fewer queries as possible. To improve the performance of the current classification model, the AL learner (human expert) is allowed to pick up a subset from an unlabeled data pool. Those data, which may largely affect the subsequent update of the learning model, are the primary goals of the learner. As a policy, accessing the unlabeled data pool to sample and querying their true labels with a given budge are approved. However, all the learners would face an awkward and difficult situation: how to fast select the descried data from the massive unlabeled data in the pool.

To resolve the above challenges, uncertainty evaluation (Uncertaintysampling) was proposed to guide AL by selecting the most informative or representative instances in a given sampling scheme or distribution assumption, such as margin (Margin), uncertainty probability (ERR), maximum entropy (Entropy), confused votes by committee (Vote), etc. For example, (Margin) proposes to select the data which is nearest to the current classification hyperplane, (ERR) selects the data which can maximize the error rate change, (Entropy) selects the data with the maximum entropy of prediction probability, etc. Basically, these uncertainty-based AL algorithms aim to reduce the number of queries or converge the classifier quickly. Accompanied by multiple iterations, querying stops when the defined sampling number is met or a satisfactory model is found. It is thus these algorithms still need to traverse the whole data set repeatedly in this framework, although this technique performs well. However, they always suffer from one main limitation, that is, heuristically searching the whole data space to obtain the optimal sampling subset is impossible because of the unpredictable scale of the candidate set.

In practice, incorporating the unsupervised learning in the sampling process shows powerful advantages such as (Pre-clustering) (Pre-clustering2) (Pre-clustering3). It makes the learner solve the previous limitation be possible. One classical method (Hiera) is performing the hierarchical clustering before sampling to improve th lower bound of the subsequent training performance. By setting up a probability condition, the learner is allowed to confidently annotate a number of subtrees with the label of the root note. When the clustering structure is perfect, it wold be positive for the sampling. However, an improper clustering results will mislead the annotation process. Then, performance degeneration of the subsequent sampling is inevitable.

2.2 Cluster boundary

Cluster boundary points are a set of special objects distributed in the margin regions of each cluster. Their labels are given by the cluster structure and guide the clustering partition. However, those label assignations are uncertain. Nowadays, the practical advantage of the cluster boundary has been widely used in the latent virus carrier detection (BERGE), abnormal gene segment diagnosis (Spinver), etc.

With the prior experience in clustering algorithms, researchers firstly study the cluster boundary detection issue in the low dimensional space and propose a series of approaches, such as (BORDER) (BRIM) (BERGE) etc. In those proposed algorithms, BORDER firstly defines the cluster boundary points by measuring the density of their nearest neighbors, and uses the reverse NN to obtain the complete boundary points, but with all the noises. To smooth the influence of noises, (BRIM) propose a detection algorithm termed BRIM via analyzing the balance property of the data distributed inside and outside the cluster. Because the extracted features are in low dimensional space, this algorithm could only be applied in two-dimension space. Moreover, the task of detecting the cluster boundary objects in high dimensional clusters is firstly studied in (Spinver) via utilizing the particle space inversion and Hopkins statistic. However, the devised Euclidean Gaussian filter function can not work well in very high-dimensional space because of the uncertainty of noises in the sparse distribution.

Notation Definition
classifiers
prediction error rate of when training
data set
data number of
number of labeled, unlabeled, queried data
label set
a data point in
labeled data points in
queried data points in
training set after querying
distance function
core points
cluster boundary points
noises
training set of []
core points located inside the positive class
core points located inside the negative class
cluster boundary points located near
noises
core points
boundary points
noises
approximation statement
assignment statement in algorithm
Table 1: A summary of notations

3 Preliminary

In this section, we first define the AL sampling by a family of linear functions. Then, we define the cluster boundary and core points by a group of density functions. Related definitions, main notations and variables are briefly summarized in Table I.

Given represents data space , where and the label space , considering the classification hypothesis:

(1)

where is the parameter vector and is the constant vector, here gives:

Definition 1.

Active learning. Optimizing to get the minimum RSS (residual sum of squares)(TED) (LLR):

(2)

i.e.,

(3)

where is the labeled data, is the queried data, and is the updated training set.

Definition 2.

Cluster boundary point (BORDER).
A boundary point is an object that satisfies the following conditions:
1. It is within a dense region .
2. region near , or .

Definition 3.

Core point. A core point is an object that satisfies the following conditions:
1. It is within a dense region .
2. an expanded region based on , .

4 Geometric Insights

In clustering-based AL work, core points provide a little help for the parameter training of classifiers. Considering that cluster boundary points may provide decisive factors for the support vectors, CVM and BVM iteratively use the points distributed on the hyperplane of an enclosing ball to train fast core support vectors in large-scale data sets. Their significant success motivate the work of this paper.

To further show the importance of cluster boundary points, we (1) clarify the performance of training cluster boundary points in Section 4.1, (2) discuss the margin distance to the classification line or hyperplane of boundary and core points in Section 4.2, and (3) analyze the hypothesis relationship when training boundary and core points in Section 4.3, where the discussion cases of (2) and (3) are binary, and multi-class classifications of low and high-dimensional space.

4.1 Performance of cluster boundary

In this section, we propose a geometrical perspective that the performance of the classification model is determined by the cluster boundary points. Our main theoretical result is summarized as follows.

Proposition 1.

Suppose that , respectively be a set of core points and cluster boundary points draw from a fixed geometrical cluster, be their union of set that satisfies =[]. Let be the classification hypothesis with respect to the training set , be another classification hypothesis with respect to the training set . The following hods for the generalized error disagreement :

(4)

where denotes the approximation symbol.

Our main theoretical results in Proposition 1 claim that the core points, distributed inside the center regions of any cluster, present little influences on training a descried hypothesis h. To demonstrate our insights, Lemma 1 and Lemma 2 provide theoretical supports in different geometrical views, where Lemma 1 proves that cluster boundary points have shorter margin distance to the geometric classification line or hyperplane compared with core points, and Lemma 2 proves the trained models generated from core points are a subset of the models generated from the boundary points. In next subsection, we respectively present the detailed proofs of the two lemmas in settings of binary, multi-class settings of low and high dimension space.

4.2 Margin distance

Margin distance measures the distance to the classification line or hyperplane of one data point and we use to denote. The margin distance relations of boundary points and core points are described in the following lemma.

Lemma 1.

Suppose that , respectively be a set of core points and cluster boundary points draw from a fixed geometrical cluster. Let be the margin distance function. The margin distance of boundary points are shorter than the core points distributed in their local geometrical space, i.e.,

(5)

Lemma 1 is supported by Corollary 1 to 3 from different cases:

  • Corollary 1: holds in binary classification of low dimensional space, where Corollaries 1.1 and 1.2 prove Proposition 1 in adjacent classes and well-separated classes, respectively.

  • Corollary 2: holds in multi-class classification issue of low dimensional space.

  • Corollary 3: holds in high-dimensional space.

We now present detailed proofs for the above corollaries.

Corollary 1.

holds in binary classification of low dimensional space.

Given two facts in the classification: (1) the data points far from h usually have clear assigned labels with a high prediction class probability; (2) h is always surrounded by noises and a part of the boundary points. Based on these facts, the proof is as follows.

(a)
(b)
(c)
(d)
Figure 9: (a) An example of adjacent classes in two-dimensional space. denotes a linear classification hypothesis. The red diamonds denote samples of Class 1, the blue squares denote samples Class 2. are two core points and are two cluster boundary points. This figure illustrates Eq. (7) and the conclusion of it are and . (b) An example of in the binary classification problem. This figure illustrates Eq. (11). (c)An example of well-separated classesin two-dimensional space. This figure illustrates Eq. (10). (d) An example of segmenting in the multi-class classification problem with = 6.

Corollary 1.1: holds in adjacent classes of low dimensional space.

Proof.

Given any adjacent classes scenarios with binary labels ( {-1,+1}) such as Figure 2(a). Let denote the core points located inside the positive class, denote the core points located inside the negative class, denotes the cluster boundary points near h, and denote the noises near h. The RSS analysis in such classification scenarios satisfy:

(6)

where , and denote their numbers of the four types of points. In most of classification issues, noises always have wrong guidance on model training. We therefore only focus on the differences between the core and boundary points, that is to say,

(7)

where denotes a constant. In space, the margin distance function between and could generalized as

(8)

Considering that the classifier function is , we conclude . Then, Lemma 1 is as stated when (see Figure 2(b)). ∎

Corollary 1.2: holds in well-separated classes of low dimensional space.

Proof.

In the well-separated classes issue (see Figure 2(c)), the trained model based on any data points will lead to a strong classification result, that is to say, all AL approaches will perform well in this setting since:

(9)

where denote a set of the cluster boundary points near h in the positive class, denote a set of the cluster boundary points near h in the negative class, , and . Let , , the results of Eq. (8) and (9) still hold. ∎

Corollary 2.

holds in multi-class classification in low dimensional space

Proof.

In this setting, , the classifier set , and cluster boundary points are segmented into parts , where denotes the data points close to , (see Figure 2(d))). Based on the result of Case 1, dividing the multi-class classification problem into binary classification problems, we can obtain:

(10)

and

(11)

where represents the core points near . Then, the following holds:

(12)

Corollary 3.

holds in high-dimensional space.

Proof.

In a high-dimensional space, the distance function between and hyperplane could be extended as

(13)

where , and is a -dimension vector. Because the above equation is the m-dimension extension of Eq. (9), the proof relating to low dimensional space is still valid in high-dimensional space. ∎

(a)
(b)
Figure 12: (a) An example of in one-dimensional space. are two point classifiers. (b) An example of in two-dimensional space.

4.3 Hypotheses relationship

Lemma 2 describes this relationship of the hypotheses generated from the boundary and core points.

Lemma 2.

Suppose that , respectively be a set of core points and cluster boundary points draw from a fixed geometrical cluster. Let be the hypothesis with respect to the training set , be another hypothesis with respect to the training set . The following holds for

(14)

It shows training models based on can predict well, but the model based on may sometimes not predict well. To prove this relation, we discuss it in three different cases:

  • Corollary 4: holds in binary classification of low dimensional space, where Corollary 4.1 and Corollary 4.2 prove Lemma 2 in one-dimension space and two-dimension space, respectively.

  • Corollary 5: holds in binary classification in high-dimensional space.

  • Corollary 6: holds in multi-class classification.

Corollary 4.

holds in binary classification of low dimensional space.

This corollary is supported by two different views in Corollary 4.1 and Corollary 4.2.

Corollary 4.1: holds in linear one-dimension space.

Proof.

Given point classifier in the linear one-dimension space as described in Figure 3(a),

(15)

where are core points. In comparison, the boundary points of have smaller distances to the optimal classification model , i.e., . Therefore, it is easy to conclude: Then, classifying and by is successful, but we cannot classify and by , or , respectively. ∎

Corollary 4.2: holds in two-dimensional space.

Proof.

Given two core points in the two-dimensional space, the line segment between them is described as follows:

(16)

Training and obtain the following classification hypotheses:

(17)

where is the angle between (see Figure 3(b)).

Similarly, the classifier trained by is subject to:

(18)

where is the line segment between and . Intuitively, the difference of and is their constraint equation. Because , we can conclude:

(19)

It aims to show cannot classify and when or in the constraint equation. But for any , it can classify correctly. ∎

Corollary 5.

holds in in high-dimensional space.

Proof.

Given two core points , a bounded Hyperplane between them is:

(20)

Training the two data points can get the following classifier:

(21)

where is the angle between and , is the normal vector of . Given point , which is located on , if , in the positive class or in the negative class, cannot predict and correctly. It can also be described as follows: if segments the bounded hyperplane between and , or and , the trained can not classify and . Then Lemma 2 is as stated ∎

Corollary 6.

holds in multi-class classification issue.

Proof.

Follows the multi-class classification proof in Lemma 1, the multi-class problem could be segmented into parts of binary classification problems. ∎

5 Geometric Active Learning by Knight’s Tour

In our geometrical analysis, we divide the AL into a geometrical sampling process over a fixed cluster. The cluster boundary points, distributed in the margin regions of any class, have been demonstrated to provide more powerful support than core points, in terms of margin distance and hypothesis relationship. With this novel insight, in this section, we develop a conquer method to find this special set of points. However, the cluster boundary points always have multiple potential positions because of the uncertain locations of the classification hypotheses. As the diversity of the candidate positions of the cluster boundary points, recognizing all the potential positions can capture all the possible cluster boundary points against any multi-class scenarios.

Knight’s tour is a classical path planning problem that requires the knight returns to the original starting point after traveling 64 chess lattices. Nowadays, this problem has became the path optimization in graph theory, and also been developed to a Markov chain problem in discrete state space. Setting the knight in data space with samples, and its -step transfer matrix is:

(22)

where denotes that moves to in steps with a speed of steps once. When , is the one-step transfer matrix of the knight’s tour. Suppose that the knight begins the tour with a speed of and a step length of , where denotes the path length between and . If the policy of the tour is to save the path cost, the knight needs to estimate each potential paths and takes a given probabilistic to select the subsequent position. Therefore, we propose the transfer probabilistic matrix :

(23)

where denotes the probability of moving into from . We here define it by the ratio of the path length between from and all other possible paths, i.e., , where . Let be the probabilistic transfer matrix produced by:

(24)

where denotes the Hadamard product of two matrices, and . Withe this operation, denotes the length of the probabilistic transfer path when the current position of the tour is set from ,… , to . Meanwhile, for any , we have

(25)

is a matrix with the size of and is the probabilistic transfer path length of the tour when the knight is located at the position of . This matrix characterizes the distribution features of the current location of the knight’s tour. When the initial position of the tour is set in the central regions of the cluster, the knight would spend expensively to leave the cluster because the knight has multiple directions where can move into. However, if the knight is set in the boundary region of the cluster, the cost would decrease dramatically. Therefore, the tour path within a limited steps could intuitively reflect where the tour is, i.e., the boundary or the central regions of the cluster. With this policy, we further characterize the steps transfer path of each position by probability evaluation:

(26)

where is the nd neighbor of and we call as the probabilistic tour matrix. The different between Eq. (25) and Eq. (26) is the tour space of the knight. In Eq. (25), calculates the tour cost of leaving the cluster and the knight needs to visit positions. However, the tour cost in the local space characterizes the distribution features of cluster boundary and core points. Therefore, we limit the position numbers of the tour by a local variable in Eq. (26), which updates into .

Based on the above definitions and analysis, we propose a Geometric Active Learning (GAL) algorithm. Its pseudo-code has been summarized in Algorithm 1. In its steps, Step 4 to Step 8 use the R-tree to calculate the matrix that denotes the NN of each data in . The time complexity of this searching process approximates . Then, we calculate the probabilistic tour path of each data point using Eq. (26) and store these values in matrix . Step 9 sorts the values of matrix by ascending. From a geometrical perspective, we divide the cluster into two regions: outer cluster collection and inner cluster collection , where the outer cluster collection removes all noises from , the inner cluster collection covers all feasible core points from . Therefore, the cluster boundary collection of includes the data belongs to but are not in . To implement this process, we set two parameters named inner cluster ratio and outer cluster ratio to split . Let be a colon matrix via sorting matrix by ascending, Step 8 to Step 14 describe this splitting process with the following policies: 1) for any data , if its probabilistic transfer path length is shorter than , it is a data within the inner cluster, and 2) for any data , if its probabilistic transfer path length is shorter than , it is within the outer cluster. Finally, Step 15 returns the complement set of with respect to .

1 Input: data set , number of queries , nearest neighbor number k, inner cluster ratio , outer cluster ratio , and .
2 Initialize: , , .
3 Calculate the kNN matrix of using R-tree search.
4 for each data point  do
5       Calculate using Eq. (26).
6      
7 end for
8Update via sorting by ascending.
9 while  do
10       if   then
11             Add into inner cluster collection .
12       end if
13      if   then
14             Add into outer cluster collection .
15       end if
16      Return the collection of the boundary data by .
17 end while
Algorithm 1 Geometric Active Learning

6 Experiments

To demonstrate the effectiveness of our proposed GAL algorithm, we evaluate and compare the performance of the cluster boundary detection and AL classification with the existing algorithms in this section. The structure of this section is: Section 6.1 and 6.2 respectively describe the related baselines and tested data sets, Section 6.3 describes the preprocessing and evaluation, Section 6.4 describes the experimental settings, and Section 6.5 analyzes the results.

6.1 Baselines

For the cluster boundary detection task, some baselines have been collected:

  • BORDER (BORDER) uses the reversal kNN approach to detect the cluster boundary based on a assumption of the reverse kNN number of cluster boundary points are less than that of core points. But its detection results always include all feasible noises because noises always have smaller number of reverse kNN , compared to other data.

  • BERGE(BERGE) is the a iterative cluster boundary detection algorithm which uses evidence accumulation to start the detection, but the error rate always increases rapidly when labeling noises as cluster boundary points by mistake.

  • Spinver(Spinver) algorithm, whose inspiration comes from spatial inversion of particle physics, is a high dimensional cluster boundary algorithm. It uses the Hopkins statistics to capture the neighborhood characteristics after smoothing noises by an Euclidean distance-based on Gaussian filtering function. But the Hopkins statistics prefers a balance class scenario.

For the classification task, several baselines also have been researched and will compare from GAL:

  • Random, which uses a random sampling strategy to query unlabeled data, and can be applied to any AL task but with an uncertainty result.

  • Margin (Margin), which selects the unlabeled data point with the shortest distance to the classification model, only can be supported by the SVM classification model.

  • Hierarchical (Hiera) sampling is a very different idea, compared to many existing AL approaches. It labels the subtree with the root node’s label when the subtree meets the objective probability function. But incorrect labeling leads to a very bad classification result.

  • TED (TED) favors data points that are on the one side hard to-predict and on the other side representative for the rest of the experiments.

  • Re-active(Re-active) learning finds the data point which has the maximum influences on the future prediction result after annotating the selected data. This novel idea does not need to query the Oracle when relabeling, but needs a well-trained classification model at the beginning. Furthermore, its reported approach can’t be applied in multi-class classification problems.

6.2 Data sets

We synthesized and collected some emulated, benchmark data sets, respectively for the experiments described in this section, which are detailed as follows.

For the cluster boundary detection task, two clustering data sets named Aggregation and Flame, are used to show the concept of cluster boundary points. The other four classical clustering datasets Syn1- Syn4 are tested in the boundary detection experiment, where denote the data set has samples with dimensions.

  • Syn1:54002. The clusters are surrounded by a lot of noises.

  • Syn2:48002.The circle cluster is embedded in the annulus cluster and a lot of noises connect them.

  • Syn3:78322. There are two connected diamond clusters with multi-density.

  • Syn4:50342. A lot of noises connect the different clusters.

The following datasets are real-world medical data sets.

  • Biomed 111http://lib.stat.cmu.edu/datasets/:2094. Medical data set. It has 134 normal objects and 75 virus infected objects. 30 virus carriers in the normal objects are defined as the cluster boundary of normal people.

  • Cancer Spinver:2402. Medical data set. It has 241 malignant tumor objects and 75 benign tumor objects. 37 benign tumor objects which may become malignant tumor patients are cluster boundary objects of normal people.

  • Colon 222http://genomics-pubs.princeton.edu/oncology/affydata/:2402. Gene data set. 7 cluster boundary points.

  • Prostate:2402 Spinver. Gene data set. 18 cluster boundary objects.

There are two image data sets in the target tracking field 333http://research.microsoft.com/en-us/um/people/jckrumm/wallflower/testimages.htm and we sill use our GAL algorithm to capture the moving targets.

  • Waving Trees:287160. This comes from the data on the continuous monitoring of one building, including 7 captured images when a volunteer passes by the monitored area.

  • Moved Object:1745160. This comes from the data on the continuous monitoring of one office, including 363 captured images when a volunteer enters the office and leaves after staying some time.

There is also one sub-set of the Basel Face Model 444https://faces.dmi.unibas.ch/bfm/ in relation to the light test.

  • Basel Face Model:This is a popular 3D face model data set about multi-gestures and color change. The light sub-set has 4488 images, and all are stored with 500500 pixels. We use the GAL algorithm to detect images with strong or dark light since only normal light images are useful in most real-world cases.

For the classification task of AL, we compare the bets classification results of different algorithms on some classical clustering data sets 555http://cs.joensuu.fi/sipu/datasets/ and the letter recognition data set letter.

  • g2-2-30:20482. There are 2 adjacent classes in the data set.

  • Flame:2402. It has 2 adjacent classes with similar densities.

  • Jain:3732. It has two adjacent classes with different densities.

  • Pathbased:3002. Two clusters are close and surround by a arc cluster.

  • Spiral:3122. There are three spiral curve cluster which are linear inseparable.

  • Aggregation:7882. There are 7 adjacent classes in the data set.

  • R15:6002. There are 7 separate clusters and 8 adjacent classes.

  • D31:31002. It has 31 adjacent classes.

  • letter:2000016. It is a classical letter recognition data set with 26 English letters. We select 5 pairs letters which are difficult to distinguish from each other to test the above AL algorithms in a two-class setting. They are DvsP, EvsF, IvsJ, MvsN, UvsV, respectively. For multi-class test, we select A-D, A-H, A-L, A-P, A-T, A-X, A-Z, respectively. Of these, A-D is the letter set A to D, and A-H is the letter set A to H, … , A-Z is the letter set A to Z. The seven multi-class sets have 4, 8, 12, 16, 20, 26 classes respectively.

In addition to the introduction for the tested data sets, all two-dimensional data sets are shown in Figure. 4.

6.3 Preprocessing and Evaluation

The methods of preprocessing used in this paper are reported in this section. Application cases are: preprocessing methods (a) and (b) are used for the Colon and Prostate data sets, respectively since the compressed large domain will accelerate the calculation speed and reduce the memory consumption; pretreatment (c) is used to change the image type to number type and is used for Waving Trees, Moved Object and Basel Face Model. Here we detail the specific methods:

Figure 13: Classical clustering data sets. (a) Syn1 (b) Syn2 (c) Syn3 (d) Syn4 (e) g2-2-30 (f) Flame (g) Jain (h) Pathbased (i) Spiral (j)Aggregation (k) R15 (l) D31. The first four data sets are tested in the cluster boundary detection task and the others are tested in the classification task of AL.
Figure 14: The marked cluster boundary point of Aggregation and Flame.

(a) , the value of each dimension of each data point is divided by ;
(b) , the value of each dimension of each data point is divided by ;
(c) , for each image, read the grayscale matrix and compress it into a single-column matrix (i.e., with a size of ) with the average grayscale values.

For the cluster boundary detection problem, we use the score to evaluate the detection result. This is a popular evaluation function in information retrieval which considers both precision and the recall . Because the cluster boundary detection task is also a retrieval problem, we use it to evaluate our results. For the classification problem, we use accuracy to evaluate it.

6.4 Experimental setting

We discuss the experimental setting of the compared algorithms over the synthetic and real data sets in this section.

  • Figure 5 marks the cluster boundary points on Syn1 and Syn2. It is used to show the definition of cluster boundary points.

  • Table 2 reports the best cluster boundary detection result on different synthetic and real data sets. We have marked the highest scores of each group of experiment.

  • Figure 6(a) shows the cluster boundary detection result on the light sub-data set of the Basel face. To compare the detected cluster boundary images, we also show the detection results for the core points using GAL in Figure 6(b).

  • Table 3 shows the classification results on some synthetic data sets. The specific experiment settings are as follows: (1) we use the MATLAB random function to implement the Random algorithm and calculate the mean and STD values after running it 100 times; (2) as the Margin, Hierarchical and Re-active algorithms all need the labeled data points to guide the training process, we select one data point from each class and query the Oracle, respectively. Similar, we test the algorithms 100 times and then calculate the mean and STD values in order to guarantee that the labeled set includes all the different label kinds of Oracle, or the algorithms will show poorer performance if we use random selection; (3) there are two important parameters for the TED algorithm: the kernel function parameter and the regularization parameter for the kernel ridge regression . We use a super parameter =1.8 to generate the kernel matrix and train from 0.01:0.01:1. The reason for this is that this parameter will provide important guidance for the sampling selection. After we test it many times, we limit its correct and stable range; (4) for our GAL algorithm, we train the parameters form 2:1: and boundary upper =:1:N to record the classification result. Because segments the core points and boundary points, we use a super parameter = to begin the training. The conclusion that there are at least 70% N data points as core points in the data set comes from our published papers (Spinver) and experience summary. The classifier trained in the classification experiment is LIBSVM (chang2011libsvm).

(a) Cluster boundary images
(b) Cluster core images
Figure 17: The cluster boundary and core images detection results using GAL algorithm on the light subset of Basel face model.
Datasets Dimension Algorithm Real.boun Num.det Num.C Precision Recall
Syn1 2 BORDER 1077 1252 831 0.6637 0.7716 0.7136
BERGE 1250 940 0.7520 0.8728 0.8079
Spinver 1049 993 0.9466 0.9220 0.9341
GAL 1043 996 0.9549 0.9248 0.9396
Syn2 2 BORDER 1204 1802 1089 0.6043 0.9045 0.7246
BERGE 1456 1098 0.7541 0.9120 0.8256
Spinver 1264 1111 0.8790 0.9228 0.9003
GAL 1163 1040 0.8942 0.9302 0.9118
Syn3 2 BORDER 640 723 540 0.7469 0.8438 0.7924
BERGE 662 532 0.8036 0.8313 0.8172
Spinver 611 542 0.8871 0.8469 0.8665
GAL 632 580 0.9177 0.9063 0.9120
Syn4 2 BORDER 538 669 445 0.6366 0.8271 0.7195
BERGE 553 472 0.8535 0.8773 0.8652
Spinver 540 482 0.8926 0.8959 0.8942
GAL 540 496 0.9185 0.9219 0.9202
Biomed 4 BORDER 30 26 23 0.8846 0.7667 0.8214
BERGE 27 24 0.8889 0.8000 0.8421
Spinver 29 27 0.9310 0.9000 0.9153
GAL 29 28 0.9655 0.9333 0.9491
Cancer 10 BORDER 37 37 28 0.7568 0.7568 0.7568
BERGE 37 30 0.8108 0.8108 0.8108
Spinver 35 34 0.9714 0.9789 0.9444
GAL 36 35 0.9722 0.9459 0.9589
Colon 2000 BORDER 7 7 1.0000 1.0000 1.0000
BERGE 6 5 0.8333 0.7143 0.7692
Spinver 7 7 1.0000 1.0000 1.0000
GAL 7 7 1.0000 1.0000 1.0000
Prostate 10,509 BORDE 19 18 0.9474 1.0000 0.9730
BERGE 17 16 0.9412 0.8889 0.9143
Spinver 18 18 1.0000 1.0000 1.0000
GAL 18 18 1.0000 1.0000 1.0000
Waving Trees 160 BORDE 17 17 1.0000 1.0000 1.0000
BERGE 17 15 0.8824 0.8824 0.8824
Spinver 17 17 1.0000 1.0000 1.0000
GAL 17 17 1.0000 1.0000 1.0000
Moved Object 160 BORDE 363 222 0.6116 0.6116 0.6116
BERGE 363 250 0.6887 0.6887 0.6887
Spinver 363 222 0.6116 0.6116 0.6116
GAL 363 352 0.9697 0.9697 0.9697
Table 2: The best cluster boundary detection results of the four algorithms on the synthetic and real data sets.
Data sets Num_C Algorithms Number of queries (percentage of the data set)
1% 5% 10% 15% 20% 30% 40% 50% 60%
Biomed 2 Random .516.026 .546.012 .603.028 .652.029 .693.031 .767.026 .815.026 .849.021 .881.022
Margin .500.000 .509.015 .551.047 .590.076 .644.103 .709.153 .822.139 .882.161 .927.188
Hierarchical .504.000 .550.000 .585.000 .615.000 .668.000 .774.014 .847.000 .920.011 .974.000
TED .610.000 .619.009 .651.003 .759.006 .848.007 .875.005 .901.005 .964.005 .972.000
Re-active - - - - - - - - -
GAL .724.163 .725.022 .790.021 .825.018 .886.012 .909.013 .927.011 .994.008 1.00.000
Cancer 2 Random .516.026 .546.012 .603.028 .652.029 .693.031 .767.026 .815.026 .849.021 .881.022
Margin .500.000 .509.015 .551.047 .590.076 .644.103 .709.153 .822.139 .882.161 .927.188
Hierarchical .504.000 .550.000 .585.000 .615.000 .668.000 .774.014 .847.000 .920.011 .974.000
TED .610.000 .619.009 .651.003 .759.006 .848.007 .875.005 .901.005 .964.005 .972.000
Re-active - - - - - - - - -
GAL .724.163 .725.022 .790.021 .825.018 .886.012 .909.013 .927.011 .994.008 1.00.000
g2-2-30 2 Random .516.026 .546.012 .603.028 .652.029 .693.031 .767.026 .815.026 .849.021 .881.022
Margin .500.000 .509.015 .551.047 .590.076 .644.103 .709.153 .822.139 .882.161 .927.188
Hierarchical .504.000 .550.000 .585.000 .615.000 .668.000 .774.014 .847.000 .920.011 .974.000
TED .610.000 .619.009 .651.003 .759.006 .848.007 .875.005 .901.005 .964.005 .972.000
Re-active .506.008 .531.029 .554.052 .593.065 .634.058 .744.060 .715.047 .811.000 .816.000
GAL .724.163 .725.022 .790.021 .825.018 .886.012 .909.013 .927.011 .994.008 1.00.000
Flame 2 Random .670.142 .794.106 .904.059 .944.036 .958.025 .976.014 .984.008 .987.005 .990.006
Margin .499.137 .596.102 .740.162 .872.158 .930.159 .935.145 .961.120 .963.109 .944.165
Hierarchical .720.041 .607.042 .855.062 .972.010 .999.000 1.00.000 1.00.000 1.00.000 1.00.000
TED .829.000 .950.006 .974.006 .988.006 .991.000 .995.001 .996.002 .996.002 .998.000
Re-active .553.154 .804.120 .917.090 .966.045 .974.045 .993.006 .993.027 .996.004 .997.004
GAL .887.004 .976.008 .983.005 .988.004 .991.002 .995.002 1.00.000 1.00.000 1.00.000
Jain 2 Random .659.180 .773.042 .816.041 .848.041 .881.040 .928.028 .958.024 .974.015 .981.015
Margin .258.003 .270.074 .382.211 .545.306 .572.310 .627.347 .623.340 .721.347 .736.352
Hierarchical .325.013 .295.008 .297.010 .636.022 .873.024 1.00.000 1.00.000 1.00.000 1.00.000
TED .739.000 .764.006 .837.018 .932.019 .978.018 .998.002 1.00.000 1.00.000 1.00.000
Re-active .666.163 .748.036 .791.027 .836.041 .899.045 .994.022 .998.008 1.00.000 1.00.000
GAL .768.007 .915.026 .963.018 .977.013 .989.009 1.00.000 1.00.000 1.00.000 1.00.000
Pathbased 3 Random .447.157 .533.089 .719.096 .833.063 .891.046 .940.046 .958.016 .969.014 .976.010
Margin .366.000 .368.016 .407.087 .481.151 .686.230 .875.209 .960.151 .962.148 .988.081
Hierarchical .488.027 .500.017 .547.024 .717.028 .749.023 .861.022 .949.015 .970.013 1.00.000
TED .356.000 .582.023 .875.032 .933.008 .941.005 .987.009 .997.002 1.00.000 1.00.000
Re-active - - - - - - - - -
GAL .748.004 .811.048 .920.038 .950.019 .959.012 1.00.000 1.00.000 1.00.000 1.00.000
Spiral 3 Random .352.023 .493.049 .634.061 .757.059 .830.051 .918.034 .955.024 .977.017 .988.011
Margin .337.005 .344.015 .408.062 .513.101 .630.144 .893.180 .964.119 .965.126 .990.034
Hierarchical .380.024 .486.044 .498.046 .525.062 .627.044 .653.048 .770.055 .774.062 .865.039
TED .355.000 .678.011 .751.039 .828.039 .896.003 .920.002 .960.000 .990.003 .998.000
Re-active - - - - - - - - -
GAL .427.017 .685.090 .830.097 .872.082 .919.063 .963.038 .990.021 .998.006 1.00.000
Aggregation 7 Random .339.101 .583.062 .775.047 .868.031 .923.023 .972.013 .987.006 .993.003 .996.000
Margin .215.000 .355.092 .707.153 .964.098 .995.044 1.00.000 1.00.000 1.00.000 1.00.000
Hierarchical .471.038 .578.016 .651.009 .695.010 .961.009 .987.005 .990.005 .992.003 .997.000
TED .379.002 .646.019 .948.009 .968.001 .999.001 1.00.000 1.00.000 1.00.000 1.00.000
Re-active - - - - - - -