Improving Semi-Supervised Support Vector MachinesThrough Unlabeled Instances Selection

# Improving Semi-Supervised Support Vector Machines Through Unlabeled Instances Selection

Yu-Feng Li    Zhi-Hua Zhou
###### Abstract

Semi-supervised support vector machines (S3VMs) are a kind of popular approaches which try to improve learning performance by exploiting unlabeled data. Though S3VMs have been found helpful in many situations, they may degenerate performance and the resultant generalization ability may be even worse than using the labeled data only. In this paper, we try to reduce the chance of performance degeneration of S3VMs. Our basic idea is that, rather than exploiting all unlabeled data, the unlabeled instances should be selected such that only the ones which are very likely to be helpful are exploited, while some highly risky unlabeled instances are avoided. We propose the S3VM-us method by using hierarchical clustering to select the unlabeled instances. Experiments on a broad range of data sets over eighty-eight different settings show that the chance of performance degeneration of S3VM-us is much smaller than that of existing S3VMs.

$\ast$$\ast$footnotetext: Corresponding author. Email: zhouzh@nju.edu.cn

Improving Semi-Supervised Support Vector Machines

Through Unlabeled Instances Selection

National Key Laboratory for Novel Software Technology

Nanjing University, Nanjing 210093, China

Key words: unlabeled data, performance degeneration, semi-supervised support vector machine

## 1 Introduction

In many real situations there are plentiful unlabeled training data while the acquisition of class labels is costly and difficult. Semi-supervised learning tries to exploit unlabeled data to help improve learning performance, particularly when there are limited labeled training examples. During the past decade, semi-supervised learning has received significant attention and many approaches have been developed chapelle2006ssl (); zhu2007semi (); Zhou:Li2010 ().

Among the many semi-supervised learning approaches, S3VMs (semi-supervised support vector machines) bennett1999sss (); Joachims1999 () are popular and have solid theoretical foundation. However, though the performances of S3VMs are promising in many tasks, it has been found that there are cases where, by using unlabeled data, the performances of S3VMs are even worse than SVMs simply using the labeled data zhang2000 (); chapelle2006ssl (); Chapelle2008 (). To enable S3VMs to be accepted by more users in more application areas, it is desirable to reduce the chances of performance degeneration by using unlabeled data.

In this paper, we focus on transductive learning and present the S3VM-us (S3VM with Unlabeled instances Selection) method. Our basic idea is that, given a set of unlabeled data, it may be not adequate to use all of them without any sanity check; instead, it may be better to use only the unlabeled instances which are very likely to be helpful while avoiding unlabeled instances which are with high risk. To exclude highly risky unlabeled instances, we first introduce two baselines, where the first baseline uses standard clustering technique motivated by the discernibility of density set SinghNIPS2008 () while the other one uses label propagation technique motivated by confidence estimation. Then, based on the analysis of the deficiencies of the two baseline approaches, we propose the S3VM-us method, which employs hierarchical clustering to help select unlabeled instances. Comprehensive experiments on a broad range of data sets over eighty-eight different settings show that, the chance of performance degeneration of S3VM-us is much smaller than that of TSVM Joachims1999 (), while the overall performance of S3VM-us is competitive with TSVM.

The rest of this paper is organized as follows. Section 2 briefly reviews some related work. Section 3 introduces two baseline approaches. Section 4 presents our S3VM-us method. Experimental results are reported in Section 5. The last section concludes this paper.

## 2 Related Work

Roughly speaking, existing semi-supervised learning approaches mainly fall into four categories. The first category is generative methods, e.g., Miller:Uyar1997 (); nigam2000text (), which extend supervised generative models by exploiting unlabeled data in parameter estimation and label estimation using techniques such as the EM method. The second category is graph-based methods, e.g., blum2001 (); zhu2003 (); zhou2004 (), which encode both the labeled and unlabeled instances in a graph and then perform label propagation on the graph. The third category is disagreement-based methods, e.g., blum1998 (); zhou2005tri (), which employ multiple learners and improve the learners through labeling the unlabeled data based on the exploitation of disagreement among the learners. The fourth category is S3VMs, e.g., bennett1999sss (); Joachims1999 (), which use unlabeled data to regularize the decision boundary to go through low density regions Chapelle2005 ().

Though semi-supervised learning approaches have shown promising performances in many situations, it has been indicated by many authors that using unlabeled data may hurt the performance nigam2000text (); zhang2000 (); Cozman2003 (); zhou2005tri (); chawla2005learning (); LaffertyNIPS2007 (); ben2008does (); SinghNIPS2008 (). In some application areas, especially the ones which require high reliability, users might be reluctant to use semi-supervised learning approaches due to the worry of obtaining a performance worse than simply neglecting unlabeled data. As typical semi-supervised learning approaches, S3VMs also suffer from this deficiency.

The usefulness of unlabeled data has been discussed theoretically LaffertyNIPS2007 (); ben2008does (); SinghNIPS2008 () and validated empirically chawla2005learning (). Many literatures indicated that unlabeled data should be used carefully. For generative methods, Cozman et al. Cozman2003 () showed that unlabeled data can increase error even in situations where additional labeled data would decrease error. One main conjecture on the performance degeneration is attributed to the difficulties of making a right model assumption which prevents the performance from degenerated by fitting with unlabeled data. For graph-based methods, more and more researchers recognize that graph construction is more crucial than how the labels are propagated, and some attempts have been devoted to using domain knowledge or constructing robust graphs balcan2005person (); Jebara2009 (). As for disagreement-based method, the generalization ability has been studied with plentiful theoretical results based on different assumptions blum1998 (); dasgupta2002pac (); wang2007 (); Wang:Zhou2010 (). As for S3VMs, the correctness of the S3VM objective has been studied on small data sets Chapelle2008 ().

It is noteworthy that though there are many work devoted to cope with the high complexity of S3VMs Joachims1999 (); collobert2006lst (); Chapelle2008 (); li2009means3vm (), there was no proposal on how to reduce the chance of performance degeneration by using unlabeled data. There was a relevant work which uses data editing techniques in semi-supervised learning li2005setred (); however, it tries to remove or fix suspicious unlabeled data during training process, while our proposal tries to select unlabeled instances for S3VM and SVM predictions after the S3VM and SVM have already been trained.

## 3 Two Baseline Approaches

As mentioned, our intuition is to use only the unlabeled data which are very likely to help improve the performance and keep the unlabeled data which are with high risk to be unexploited. In this way, the chance of performance degeneration may be significantly reduced. Current S3VMs can be regarded as an extreme case which believes that all unlabeled data are with low risk and therefore all of them should be used; while inductive SVMs which use labeled data only can be regarded as another extreme case which believes that all the unlabeled data are high risky and therefore only labeled data are used.

Specifically, we consider the following problem: Once we have obtained the predictions of inductive SVM and S3VM, how to remove risky predictions of S3VM such that the resultant performance could be often better and rarely worse than that of inductive SVM?

There are two simple ideas that are easy to be worked out to address the above problem, leading to two baseline approaches, namely S3VM-c and S3VM-p.

In the sequel, suppose we are given a training data set where denotes the set of labeled data and denotes the set of unlabeled data. Here is an instance and is the label. We further let and denote the predicted labels on by inductive SVM and S3VM, respectively.

### 3.1 S3vm-c

The first baseline approach is motivated by the analysis in SinghNIPS2008 () which suggests that unlabeled data help when the component density sets are discernable. Here, one can simulate the component density sets by clusters and discernibility by a condition of disagreements between S3VM and inductive SVM. We consider the disagreement using two factors, i.e., bias and confidence. When S3VM obtains the same bias as inductive SVM and enhances the confidence of inductive SVM, one should use the results of S3VM; otherwise it may be risky if we totally trust the prediction of S3VM.

Algorithm 1 gives the S3VM-c method and Figure 1(d) illustrates the intuition of S3VM-c. As can be seen, S3VM-c inherits the correct predictions of S3VM on groups while avoids the wrong predictions of S3VM on groups .

### 3.2 S3vm-p

The second baseline approach is motivated by confidence estimation in graph-based methods, e.g., zhu2003 (), where the confidence can be naturally regarded as a risk measurement of unlabeled data.

Formally, to estimate the confidence of unlabeled data, let be the label matrix for labeled data where is the label vector. Let be the weight matrix of training data and is the laplacian of , i.e., where is a diagonal matrix with entries . Then, the predictions of unlabeled data can be obtained by zhu2003 ()

 Fu={\boldmathΛ}−1u,uWu,lFl, (1)

where is the sub-matrix of with respect to the block of unlabeled data, while is the sub-matrix of with respect to the block between labeled and unlabeled data. Then, assign each point with the label and the confidence . After confidence estimation, similar to S3VM-c, we consider the risk of unlabeled data by two factors, i.e., bias and confidence. If S3VM obtains the same bias of label propagation and the confidence is high enough, we use the S3VM prediction, and otherwise we take SVM prediction.

Algorithm 2 gives the S3VM-p method and Figure 1(e) illustrates the intuition of S3VM-p. As can be seen, the correct predictions of S3VM on groups are inherited by S3VM-p, while the wrong predictions of S3VM on groups are avoided.

## 4 Our Proposed Method

### 4.1 Deficiencies of S3VM-c and S3VM-p

S3VM-c and S3VM-p are capable of reducing the chances of performance degeneration by using unlabeled data, however, they both suffer from some deficiencies. For S3VM-c, it works in a local manner and the relation between clusters are never considered, leading to the unexploitation of some helpful unlabeled instances, e.g., unlabeled instances in groups in Figure 2(d). For S3VM-p, as stated in wang2008graph (), the confidence estimated by label propagation approach might be incorrect if the label initialization is highly imbalanced, leading to the unexploitation of some useful unlabeled instances, e.g., groups in Figure 2(e).

Moreover, both S3VM-c and S3VM-p heavily rely on the predictions of S3VM, which might become a serious issue especially when S3VM obtains degenerated performance. Figures 2(b) and 2(c) illustrate the behaviors of S3VM-c and S3VM-p when S3VM degenerates performance. Both S3VM-c and S3VM-p erroneously inherit the wrong predictions of S3VM of group 1.

### 4.2 S3vm-us

The deficiencies of S3VM-c and S3VM-p suggest to take into account of cluster relation and make the method insensitive to label initialization. This motivates us to use hierarchical clustering jain1988algorithms (), leading to our proposed method S3VM-us.

Hierarchical clustering works in a greedy and iterative manner. It first initials each singe instance as a cluster and then at each step, it merges two clusters with the shortest distance among all pairs of clusters. In this step, the cluster relation is considered and moreover, since hierarchical clustering works in an unsupervised setting, it does not suffer from the label initialization problem.

Suppose and are the lengths of paths from the instance to its nearest positive and negative labeled instances, respectively, in hierarchical clustering. We simply take the difference between and as an estimation of the confidence on the unlabeled instance . Intuitively, the larger the difference between and , the higher the confidence on labeling .

Algorithm 3 gives the S3VM-us method and Figures 1(f) and 2 illustrate the intuition of S3VM-us. As can be seen, the wrong predictions of S3VM on groups are avoided by S3VM-us, the correct predictions of S3VM on groups are inherited, and S3VM-us does not erroneously inherit the wrong predictions of S3VM on group 1 in Figure 2.

## 5 Experiments

### 5.1 Settings

We evaluate S3VM-us on a broad range of data sets including the semi-supervised learning benchmark data sets in chapelle2006ssl () and sixteen UCI data sets. The benchmark data sets are g241c, g241d, Digit1, USPS, TEXT and BCI. For each data, the archive provides two data sets with one using 10 labeled examples and the other using 100 labeled examples. As for UCI data sets, we randomly select 10 and 100 examples to be used as labeled examples, respectively, and use the remaining data as unlabeled data. The experiments are repeated for 30 times and the average accuracies and standard deviations are recorded. It is worth noting that in semi-supervised learning, labeled examples are often too few to afford a valid cross validation, and therefore hold-out tests are usually used for the evaluation.

In addition to S3VM-c and S3VM-p, we compare with inductive SVM and TSVM333http://svmlight. joachims.org/ Joachims1999 (). Both linear and Gaussian kernels are used. For the benchmark data sets, we follow the setup in chapelle2006ssl (). Specifically, for the case of 10 labeled examples, the parameter for SVM is fixed to where is the size of data set and the Gaussian kernel width is set to , i.e., the average distance between instances. For the case of labeled examples, is fixed to 100 and the Gaussian kernel width is selected from by cross validation. On UCI data sets, the parameter is fixed to 1 and the Gaussian kernel width is set to for 10 labeled examples. For 100 label examples, the parameter is selected from and the Gaussian kernel width is selected from by cross validation. For S3VM-c, the cluster number is fixed to 50; for S3VM-p, the weighted matrix is constructed via Gaussian distance and the parameter is fixed to 0.1; for S3VM-us, the parameter is fixed to 0.1.

### 5.2 Results

The results are shown in Tables 1 and 2. As can be seen, the performance of S3VM-us is competitive with TSVM. In terms of average accuracy, TSVM performs slightly better (worse) than S3VM-us on the case of 10 (100) labeled examples. In terms of pairwise comparison, S3VM-us performs better than TSVM on 13/12 and 14/16 cases with linear/Gaussian kernel for 10 and 100 labeled examples, respectively. Note that in a number of cases, TSVM has large performance improvement against inductive SVM, while the improvement of S3VM-us is smaller. This is not a surprise since S3VM-us tries to improve performance with the caution of avoiding performance degeneration.

Though TSVM has large improvement in a number of cases, it also has large performance degeneration in cases. Indeed, as can be seen from Tables 1 and 2, TSVM is significantly inferior to inductive SVM on 8/44, 19/44 cases for 10 and 100 labeled examples, respectively. Both S3VM-c and S3VM-p are capable to reduce the times of significant performance degeneration, while S3VM-us does not significantly degenerate performance in the experiments.

### 5.3 Parameter Influence

S3VM-us has a parameter . To study the influence of , we run experiments by setting to different values (0.1, 0.2 and 0.3) with 10 labeled examples. The results are plotted in Figure 3. It can be seen that the setting of has influence on the improvement of S3VM-us against inductive SVM. Whatever linear kernel or gaussian kernel is used, the larger the value of , the closer the performance of S3VM-us to SVM. It may be possible to increase the performance improvement by setting a smaller , however, this may increase the risk of performance degeneration.

## 6 Conclusion

In this paper we propose the S3VM-us method. Rather than simply predicting all unlabeled instances by semi-supervised learner, S3VM-us uses hierarchical clustering to help select unlabeled instances to be predicted by semi-supervised learner and predict the remaining unlabeled instances by inductive learner. In this way, the risk of performance degeneration by using unlabeled data is reduced. The effectiveness of S3VM-us is validated by empirical study.

The proposal in this paper is based on heuristics and theoretical analysis is future work. It is worth noting that, along with reducing the chance of performance degeneration, S3VM-us also reduces the possible performance gains from unlabeled data. In the future it is desirable to develop really safe semi-supervised learning approaches which are able to improve performance significantly but never degenerate performance by using unlabeled data.

## References

• [1] M. F. Balcan, A. Blum, P. P. Choi, J. Lafferty, B. Pantano, M. R. Rwebangira, and X. Zhu. Person identification in webcam images: An application of semi-supervised learning. In ICML Workshop on Learning with Partially Classified Training Data, 2005.
• [2] S. Ben-David, T. Lu, and D. Pál. Does unlabeled data provably help? Worst-case analysis of the sample complexity of semi-supervised learning. In COLT, pages 33–44, 2008.
• [3] K. Bennett and A. Demiriz. Semi-supervised support vector machines. In NIPS 11, pages 368–374. 1999.
• [4] A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In ICML, pages 19–26, 2001.
• [5] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT, pages 92–100, 1998.
• [6] O. Chapelle, B. Schölkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press, Cambridge, MA, 2006.
• [7] O. Chapelle, V. Sindhwani, and S. S. Keerthi. Optimization techniques for semi-supervised support vector machines. J. Mach. Learn. Res., 9:203–233, 2008.
• [8] O. Chapelle and A. Zien. Semi-supervised learning by low density separation. In AISTATS, pages 57–64, 2005.
• [9] N. V. Chawla and G. Karakoulas. Learning from labeled and unlabeled data: An empirical study across techniques and domains. J. Artif. Intell. Res., 23:331–366, 2005.
• [10] R. Collobert, F. Sinz, J. Weston, and L. Bottou. Large scale transductive SVMs. J. Mach. Learn. Res., 7:1687–1712, 2006.
• [11] F. G. Cozman, I. Cohen, and M. C. Cirelo. Semi-supervised learning of mixture models. In ICML, pages 99–106, 2003.
• [12] S. Dasgupta, M. L. Littman, and D. McAllester. PAC generalization bounds for co-training. In NIPS 14, pages 375–382. 2002.
• [13] A.K. Jain and R.C. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, NJ., 1988.
• [14] T. Jebara, J. Wang, and S. F. Chang. Graph construction and b-matching for semi-supervised learning. In ICML, pages 441–448, 2009.
• [15] T. Joachims. Transductive inference for text classification using support vector machines. In ICML, pages 200–209, 1999.
• [16] J. Lafferty and L. Wasserman. Statistical analysis of semi-supervised regression. In NIPS 20, pages 801–808. 2008.
• [17] M. Li and Z. H. Zhou. SETRED: Self-training with editing. In PAKDD, pages 611–621, 2005.
• [18] Y.-F. Li, J. T. Kwok, and Z.-H. Zhou. Semi-supervised learning using label mean. In ICML, pages 633–640, 2009.
• [19] D. J. Miller and H. S. Uyar. A mixture of experts classifier with learning based on both labelled and unlabelled data. In NIPS 9, pages 571–577. 1997.
• [20] K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Mach. Learn., 39(2):103–134, 2000.
• [21] A. Singh, R. Nowak, and X. Zhu. Unlabeled data: Now it helps, now it doesn’t. In NIPS 21, pages 1513–1520. 2009.
• [22] J. Wang, T. Jebara, and S. F. Chang. Graph transduction via alternating minimization. In ICML, pages 1144–1151, 2008.
• [23] W. Wang and Z.-H. Zhou. Analyzing co-training style algorithms. In ECML, pages 454–465, 2007.
• [24] W. Wang and Z.-H. Zhou. A new analysis of co-training. In ICML, pages 1135–1142, 2010.
• [25] T. Zhang and F. Oles. The value of unlabeled data for classification problems. In ICML, pages 1191–1198, 2000.
• [26] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf. Learning with local and global consistency. In NIPS 16, pages 595–602. 2004.
• [27] Z.-H. Zhou and M. Li. Tri-training: Exploiting unlabeled data using three classifiers. IEEE Trans. Knowl. Data Eng., 17(11):1529–1541, 2005.
• [28] Z.-H. Zhou and M. Li. Semi-supervised learning by disagreement. Knowl. Inf. Syst., 24(3):415–439, 2010.
• [29] X. Zhu. Semi-supervised learning literature survey. Technical Report 1530, Dept. Comp. Sci., Univ. Wisconsin-Madison, 2006.
• [30] X. Zhu, Z. Ghahramani, and J. D. Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In ICML, pages 912–919, 2003.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters