Training Support Vector Machines using Coresets
Note: This work was done as a course project as part of an ongoing research effort that was recently submitted . The submission, done in collaboration with Murad Tukan, Dan Feldman, and Daniela Rus , supersedes the work in this manuscript.
We present a novel coreset construction algorithm for solving classification tasks using Support Vector Machines (SVMs) in a computationally efficient manner. A coreset is a weighted subset of the original data points that provably approximates the original set. We show that coresets of size , i.e., polylogarithmic in and polynomial in , exist for a set of input points with features and present an -FPRAS for constructing coresets for scalable SVM training. Our method leverages the insight that data points are often redundant and uses an importance sampling scheme based on the sensitivity of each data point to construct coresets efficiently. We evaluate the performance of our algorithm in accelerating SVM training against real-world data sets and compare our algorithm to state-of-the-art coreset approaches. Our empirical results show that our approach outperforms a state-of-the-art coreset approach and uniform sampling in enabling computational speedups while achieving low approximation error.
Popular machine learning algorithms are computationally expensive, or worse yet, intractable to train on Big Data. Recently, the notion of using coresets [1, 8, 4], small weighted subsets of the input points that approximately represent the original data set, has shown promise in accelerating machine learning algorithms, such as -means clustering , training mixture models , and logistic regression . Support Vector Machines (SVMs) are one of the most popular algorithms for classification and regression analysis. However, with the rising availability of Big Data, training SVMs on massive data sets has shown to be computationally expensive. In this paper, we present a coreset construction algorithm for efficiently training Support Vector Machines.
Our approach entails a randomized coreset construction that is based on the insight that data is often redundant and that some input points are more important than others for large-margin classification. Using importance sampling, our algorithm can be considered an -FPRAS which generates a coreset that could be used for training instead of the original (massive) set of input points, but yet still provide an -approximation to the ground-truth classifier if all the points were used instead, with probability at least . In this paper, we prove that such coresets of size can be efficiently constructed for a set of points with features and present an intuitive, importance sampling-based approach for constructing them.
2 Related Work
Training a canonical Support Vector Machine (SVM) requires time and space  where is the number of training points, which may be impractical for certain applications. Work by Tsang et al.  investigated computationally-efficient approximations in terms of coresets to the SVM problem, termed Core Vector Machines (CVMs), and leveraged existing coreset methods for the Minimum Enclosing Ball (MEB) [1, 3]. The authors propose a method that reduces the training time required for the two-class L2-SVM to and the space to an expression that is (surprisingly) independent of . Similar geometric approaches based on convex hulls and extreme points were investigated by .
Since the SVM problem is inherently a quadratic optimization problem, prior work has investigated approximations to the quadratic programming problem using the Frank-Wolfe algorithm or Gilbert’s algorithm . Another line of research has been in reducing the problem of polytope distance to solve the SVM problem . The authors establish lower and upper bounds for the polytope distance problem and use Gilbert’s algorithm to train an SVM in linear time.
Har-Peled et al. constructed coresets to approximate the maximum margin separation, i.e., a hyperplane that separates all of the input data with margin larger than , where is the best achievable margin . They study the running time of a simple coreset algorithm for binary and “structured-output” classification and the use of coresets for active learning and noise-tolerant learning in the agnostic setting.
There have been probabilistic approaches to the SVM problem. Most notable are the works of Clarkson et al.  and Hazan et al . Hazan et al. used a primal-dual approach combined with the Stochastic Gradient Descent (SGD) approach in order to learn linear SVMs in sublinear time. They propose the SVM-SIMBA approach which returns an -approximate solution with probability at least to the SVM problem that uses Hinge loss as the objective function. The key idea in their method is to access single features of the training vectors rather than the entire vectors themselves. However, their method is nondeterministic and returns the correct -approximation only with some probability greater than a constant (). Clarkson et al.  present sublinear-time (in the size of the input) approximation algorithms for some optimization problems such as training linear classifiers (e.g., perceptron) and finding MEB. They introduce a technique that is originally applied to the perceptron algorithm, but extend it to the related problems of MEB and SVM in the hard margin or -SVM formulations. The drawback of their method is that the approximation can be successfully computed only with high probability. Pegasos  employs primal estimated Subgradient Descent independent of the input size.
Joachims presents an alternative approach to training SVMs in linear time based on the cutting plane method that hinges on an alternative formulation of the SVM optimization problem . He shows that the Cutting-Plane algorithm can be leveraged to train SVMs in time for classification and time for ordinal regression where is the average number of non-zero features. However, this approach does not trivially extend to SVMs with kernels and is not sublinear with respect to the number of points .
3 Problem Definition
Given a set of training points where and for all , the soft-margin two-class L2-SVM problem is the following quadratic program:
for some regularization parameter and query space , which is defined to be the set of all candidate margins. We note that we ignore the bias term in the problem formulation for simplicity, since this term can always be embedded into the feature space by expanding the dimension to .
Instead of establishing new algorithms for the SVM problem, we focus on reducing the size of the training data by sampling informative points, while providing each point a proper weight.
3.1 Soft-margin SVM
If the training data are linearly separable, we can select two parallel hyperplanes that separate the two classes of data, so that the distance between them is as large as possible. The region bounded by these two hyperplanes is called the margin. To extend to cases in which the data are not linearly separable a hinge loss function is introduced that penalizes the violation of the margin constraints. This approach is known as soft-margin SVM. Each given data point falls in one of three categories. It either lies beyond the margin , in which case it does not contribute to the SVM loss (2). The point could lie directly on the margin , where the point is a support vector and directly affects the cost function but does not directly add to it. Note that the distance between the separating hyperplane and the points on the margin is exactly . This distance is commonly referred to as margin as well. If the point lies within the margin it adds a cost to (2) proportional to the amount of constraint violation.
The regularization parameter weights the relative importance of maximizing the margin and margin constraint satisfaction for each data point and is used to increase robustness against outliers. Accordingly, if C is very small, margin constraint violation is only penalized weakly, is small and the safety margin around the decision boundary will be large. Contrary, if C is very large, violation of the margin constraint is penalized heavily and the formulation approaches the hard-margin SVM case which is sensitive to outliers in the training set.
Instead of introducing an entirely new algorithm for solving the SVM problem itself, we use (smally) subsets of the input data instead for training, i.e., coresets, which Using coresets, the main benefit is that we can reduce the runtime through reduction of the number of training data points, while maintaining a close approximation to the optimal solution of the problem. The following definitions for coresets are used throughout the paper:
Definition 1 (Query Space).
Let be the set of input points, let denote the set of possible margins over which the SVM optimization is performed over, and let be the SVM objective function given by (2). Then, the tuple is called the query space.
Definition 2 (-coreset).
Let be a weighted subset of the input points such that the function maps each point to its corresponding weight. The pair is called a coreset with respect to the input points . Let be the query set, i.e., the set of candidate margins, and let be the SVM objective function given by (2). Then is an -coreset if for every margin we have
In other words, is a -coreset if is an approximation to the objective function value with all of the training points used, . Our overarching goal is to construct an -coreset such that the cardinality of is sublinear in , the number of data points. In our analysis, we will also rely on the concept of sensitivity of a point , see definition below, which has been previously introduced in :
Definition 3 (Sensitivity).
The sensitivity of an arbitrary point is defined as
Note that represents the contribution of point to the objective function of the SVM and that
where is the objective function of the two class L2-norm SVM as in (2).
In this section, we prove under mild assumptions that an -coreset of size can be constructed with probability at least , in time. At a high level, our algorithm first efficiently approximates the importance of each point , which we refer to as the point’s sensitivity . The number of sample points required for an -approximation is then computed as a function of the points’ sensitivities using an analogue of the Estimator Theorem covered in class (and by Motwani et al. ), i.e., Theorem 7 by Braverman et al. .
The outline of our proof is as follows. We begin by enumerating the preliminary material as well as the assumptions we impose on the problem. We then bound the sensitivity of each point by computing a tight upper bound that can be efficiently computed for all points. We then sum over all the upper bounds for the sensitivities of the points and show this sum is logarithmic in the number of points . We then invoke Theorem 7 with the computed sum of sensitivities and a straightforward application of the theorem’s expression for the number of points required yields the existence of an -coreset polylogarithmic in the number of points and polynomial in the number of features . Combining these procedures, we finally arrive at the -FPRAS, shown as Alg. 1.
In the following, we state some assumptions and results upon which, we base our analysis.
Assumption 4 (Normalized Input).
The training data is normalized such that for any , we have .
Assumption 5 (Scaled Input).
The training data is centered around its mean , such that .
Assumptions 4 and 5 are very commonly fulfilled in practical settings since both normalization and mean-centering of the input points are desirable before the training procedure for more robust results.
Assumption 6 (Bounded Query Space).
Let . The query set is then defined to be the set
In other words, we consider the set of candidate margins that do not entirely separate the labeled data, as is usually the case in target coresets applications that involve an extremely large number of data points111In future work, we intend to relax this assumption in our sensitivity analysis by leveraging a probabilistic argument in conjunction with the fact that data points are centered..
Assumption 6 of having a bounded query space is justified by the fact that the points lie within a unit ball and therefore margins in accordant scale are reasonable. Moreover, we note that in many coreset applications a bounded space is a necessary condition for having coresets of sublinear size as shown for the case of Logistic Regression .
Our analysis further relies on Theorem 7 given by Braverman et al. , which is stated below. This result states that for any given overapproximation of the sensitivity , a coreset of sufficiently large size, where the size depends on the tightness of the overapproximation, gives an -coreset with probability at least . This theorem, together with our subsequent analysis, will allow us to establish the aforementioned -FPRAS for computing the margin of a SVM classifier.
Theorem 7 (Braverman et al. ).
Let be a function such that
and let . Further, let denote the query space and let be the corresponding VC dimension . Then, for all , there exists some sufficiently large constant such that for a random sample of size
we have that with probability for every and . Let
be the weight for every , where is the number of times point is sampled. Then, with probability at least , is an -coreset for .
4.2 Sensitivity Upper Bound
To be able to derive an -FPRAS according to Theorem 7, we need to be able to efficiently and tightly upperbound the sensitivity of each point. In particular, we start out from the sensitivity description in its most basic form and use multiple insights from SVM.
Lemma 8 (Sensitivity Bounds).
The sensitivity of any arbitrary point is bounded above by
Consider the sensitivity of a particular point :
|(By Assumption 6)||(11)|
From Lemma 8, we can now derive an upper bound on the total sensitivity :
The sum of sensitivities over all points, , is bounded above by
Leveraging the inequality established by Lemma 8, we have the following upper bound for the sum of sensitivities, i.e., for :
Finally, under the combination of the results above, we present the derivation of the -FPRAS.
Theorem 10 (-Coreset FPRAS).
Given any and a data set , Algorithm 1 generates an -coreset, , of size
for the L2-norm SVM problem with probability at least 1 - . Moreover, our algorithm runs in time.
First, we leverage the seminal result by Vapnik et al.  stating that the VC dimension of a separating hyperplane with a margin , i.e., , is bounded above by
Now, by Theorem 7, we have that the coreset constructed by our algorithm is an -coreset for the SVM problem with probability at least , and the size of the coreset is established by plugging in the bound for the sum of sensitivities from Corollary 9 to Equation (5):
Moreover, note that the computation time of our algorithm is dominated by computing the upper bounds on the sensitivity, i.e., Line 1 of our Algorithm, which is in turn a time operation per point, yielding a total running time of . ∎
Thus, our coresets are of size polylogarithmic in the number of points and polynomial in the dimension of the points . Note that since in the applications we are considering, the theorem above proves that the coreset generated by our approximation scheme is capable of generating an approximation to the SVM problem with probability , even when the coreset size is significantly smaller than the size of the original input points .
|Input:||A set of training points containing points,|
|an error parameter , and failure probability .|
|Output:||An -coreset for the query space with probability at least .|
In this section, we will give an overview of the algorithm to compute the coreset , which can then be used to train the SVM classifier. We will highlight the important insights of the method, before explaining each step closely.
Our algorithm is based off the idea that for any given dataset , we assign an importance to each data point and then sample from the dataset according to the multinomial distribution emerging from this procedure. The crucial insight to this method is how we assign the importances . In particular, we use an overapproximation of the sensitivity of each point, i.e., , to assign importances, which are obtained from the analysis from the previous section. Following the sampling of points, we further assign weights to each data points, which are proportional to the number of times the point has been sampled. We then train a SVM classifier on the weighted coreset using any standard SVM library. The resulting algorithm is an -FPRAS for approximating the trained classifier.
The overall method to compute the desired coreset is outlined in Algorithm 1. Given a set of input data , an error parameter , and the desired failure probability , the algorithm returns an -coreset from the query space with probability at least . In Line 1 we compute the importance of a point, i.e., the upper bound on the sensitivity of a point , see Lemma 8 for more details. In Line 1, we compute the necessary number of samples to include in , according to Theorem 7, and we then sample from the resulting multinomial distribution, see Line 1. Note that samples in Line 1 are weighted according to Theorem 7.
We evaluate the performance of our coreset construction algorithm against two real-world, publicly available data sets  and compare its effectiveness to Core Vector Machines (CVMs), another approximation algorithm for SVM , and categorical uniform sampling, i.e., we sample uniformly a number of data points from each label . In particular, we selected a set of subsample sizes for each data set of size and ran all of the three aforementioned coreset construction algorithms to construct and evaluate the accuracy of subsamples sizes . The results were averaged across trials for each subsample size. Our experiments were implemented in Python and performed on a 3.2GHz i7-6900K (8 cores total) machine.
6.1 Credit Card Dataset
The Credit Card dataset222https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients contains entries each consisting of features of a client, such as education, and age. The goal of the classification task is to predict whether a client will default on a payment.
Figure 2 depicts the accuracy of the subsamples generated by each algorithm and the computation time required to construct the subsamples. Our results show that for a given subsample size, our coreset construction algorithm runs more efficiently and achieves significantly higher approximation accuracy with respect to the SVM objective function in comparison to uniform sampling and CVM. Note that the relative error is still relatively high since the dataset is not very large and benefits of coresets become even more visible for significantly larger datasets.
As can be seen in Fig. 3 our coreset sampling process noticeably differs from uniform sampling and some data points are sampled with much higher probability than others. This is in line with the idea of sampling more important points with higher probability.
6.2 Skin Dataset
The Skin dataset333https://archive.ics.uci.edu/ml/datasets/Skin+Segmentation/ consists of data points with attributes per point. The attributes include random samples of B,G,R values from face images and the goal of the classification task is to determine whether these samples are skin or non-skin samples.
Our coreset outperforms uniform sampling for all coreset sizes (cf. Fig. 4), while computation time of coreset generation and SVM training remains a fraction of the original dataset. Due to poor performance CVM is omitted from the showed results. Note that this dataset is significantly larger than the Credit Card dataset and thus the advantages in error and most significantly runtime are more prominent.
We presented an efficient coreset algorithm for obtaining substantial speedups in SVM training at the cost of small, provably-bounded approximation error. Our approach relies on the intuitive fact that data is often redundant and that some data points are more important than others. We showed that by obtaining tight bounds on the importance, i.e. sensitivity, of each point, coresets of size polylogarithmic in the number of points and polynomial in the dimension of the points can be efficiently constructed. To the best of our knowledge, this paper presents the first method for constructing coresets of this size that is also applicable to streaming settings, using the merge-and-reduce approach familiar to coresets . Our favorable empirical results demonstrate the effectiveness of our algorithm in accelerating the training time of SVMs in real-world data sets. We conjecture that our coreset construction method can be extended to significantly speed up SVM training for nonlinear kernels as well as other popular machine learning algorithms, such as deep learning.
-  Pankaj K Agarwal, Sariel Har-Peled, and Kasturi R Varadarajan. Geometric approximation via coresets. Combinatorial and computational geometry, 52:1–30, 2005.
-  Anonymous. Small coresets to represent large training data for support vector machines. International Conference on Learning Representations, 2018.
-  Mihai Badoiu and Kenneth L Clarkson. Smaller core-sets for balls. In Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, pages 801–802. Society for Industrial and Applied Mathematics, 2003.
-  Vladimir Braverman, Dan Feldman, and Harry Lang. New frameworks for offline and streaming coreset constructions. arXiv preprint arXiv:1612.00889, 2016.
-  Kenneth L Clarkson. Coresets, sparse greedy approximation, and the frank-wolfe algorithm. ACM Transactions on Algorithms (TALG), 6(4):63, 2010.
-  Kenneth L Clarkson, Elad Hazan, and David P Woodruff. Sublinear optimization for machine learning. Journal of the ACM (JACM), 59(5):23, 2012.
-  Dan Feldman, Matthew Faulkner, and Andreas Krause. Scalable training of mixture models via coresets. In Advances in neural information processing systems, pages 2142–2150, 2011.
-  Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 569–578. ACM, 2011.
-  Bernd Gärtner and Martin Jaggi. Coresets for polytope distance. In Proceedings of the twenty-fifth annual symposium on Computational geometry, pages 33–42. ACM, 2009.
-  Sariel Har-Peled, Dan Roth, and Dav Zimak. Maximum margin coresets for active and noise tolerant learning. In IJCAI, pages 836–841, 2007.
-  Elad Hazan, Tomer Koren, and Nati Srebro. Beating sgd: Learning svms in sublinear time. In Advances in Neural Information Processing Systems, pages 1233–1241, 2011.
-  Jonathan H Huggins, Trevor Campbell, and Tamara Broderick. Coresets for scalable bayesian logistic regression. arXiv preprint arXiv:1605.06423, 2016.
-  Thorsten Joachims. Training linear svms in linear time. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 217–226. ACM, 2006.
-  M. Lichman. UCI machine learning repository, 2013.
-  Rajeev Motwani and Prabhakar Raghavan. Randomized algorithms. Chapman & Hall/CRC, 2010.
-  Manu Nandan, Pramod P Khargonekar, and Sachin S Talathi. Fast svm training using approximate extreme points. Journal of Machine Learning Research, 15(1):59–98, 2014.
-  Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient solver for svm. In Proceedings of the 24th international conference on Machine learning, pages 807–814. ACM, 2007.
-  Ivor W Tsang, James T Kwok, and Pak-Ming Cheung. Core vector machines: Fast svm training on very large data sets. Journal of Machine Learning Research, 6(Apr):363–392, 2005.
-  Vladimir Naumovich Vapnik and Vlamimir Vapnik. Statistical learning theory, volume 1. Wiley New York, 1998.