Distributed Inference for Linear Support Vector Machine
The growing size of modern data brings many new challenges to existing statistical inference methodologies and theories, and calls for the development of distributed inferential approaches. This paper studies distributed inference for linear support vector machine (SVM) for the binary classification task. Despite a vast literature on SVM, much less is known about the inferential properties of SVM, especially in a distributed setting. In this paper, we propose a multi-round distributed linear-type (MDL) estimator for conducting inference for linear SVM. The proposed estimator is computationally efficient. In particular, it only requires an initial SVM estimator and then successively refines the estimator by solving simple weighted least squares problem. Theoretically, we establish the Bahadur representation of the estimator. Based on the representation, the asymptotic normality is further derived, which shows that the MDL estimator achieves the optimal statistical efficiency, i.e., the same efficiency as the classical linear SVM applying to the entire dataset in a single machine setup. Moreover, our asymptotic result avoids the condition on the number of machines or data batches, which is commonly assumed in distributed estimation literature, and allows the case of diverging dimension. We provide simulation studies to demonstrate the performance of the proposed MDL estimator.
Keywords: Linear support vector machine, divide-and-conquer, Bahadur representation, asymptotic theory
The development of modern technology has enabled data collection of unprecedented size. Very large-scale datasets, such as collections of images, text, transactional data, sensor network data, are becoming prevailing, with examples ranging from digitalized books and newspapers, to collections of images on Instagram, to data generated by large-scale networks of sensing devices or mobile robots. The scale of these data brings new challenges to traditional statistical estimation and inference methods, particularly in terms of memory restriction and computation time. For example, a large text corpus easily exceeds the memory limitation and thus cannot be loaded into memory all at once. In a sensor network, the data are collected by each sensor in a distributed manner. It will incur an excessively high communication cost if we transfer all the data into a center for processing, and moreover, the center might not have enough memory to store all the data collected from different sensors. In addition to memory constraints, these large-scale datasets also pose challenges in computation. It will be computationally very expensive to directly apply an off-the-shelf optimization solver for computing the maximum likelihood estimator (or empirical risk minimizer) on the entire dataset. These challenges call for new statistical inference approaches that are able to not only handle large-scale datasets efficiently but also achieve the same statistical efficiency as classical approaches.
In this paper, we study the problem of distributed inference for linear support vector machine (SVM). SVM, introduced by Cortes and Vapnik (1995), has been one of the most popular classifiers in statistical machine learning, which finds a wide range of applications in image analysis, medicine, finance, and other domains. Due to the importance of SVM, various parallel SVM algorithms have been proposed in machine learning literature; see, e.g., Graf et al. (2005); Forero et al. (2010); Zhu et al. (2008); Hsieh et al. (2014) and an overview in Wang and Zhou (2012). However, these algorithms mainly focus on addressing the computational issue for SVM, i.e., developing a parallel optimization procedure to minimize the objective function of SVM that is defined on given finite samples. In contrast, our paper aims to address the statistical inference problem, which is fundamentally different. More precisely, the task of distributed inference is to construct an estimator for the population risk minimizer in a distributed setting and to characterize its asymptotic behavior (e.g., establishing its limiting distribution).
As the size of data becomes increasingly large, distributed inference has received a lot of attentions and algorithms have been proposed for various problems (please see the related work Section 2 and references therein for more details). However, the problem of SVM possesses its own unique challenges in distributed inference. First, SVM is a classification problem that involves binary outputs . Thus, as compared to regression problems, the noise structure in SVM is different and more complicated, which brings new technical challenges. We will elaborate this point with more details in Remark 3.1. Second, the hinge loss in SVM is non-smooth. Third, instead of considering the fixed dimension as in many existing theories on asymptotic properties of SVM parameters (see, e.g., Lin (1999); Zhang (2004); Blanchard et al. (2008); Koo et al. (2008)), we aim to study the diverging case, i.e., as the sample size .
To address aforementioned challenges, we focus ourselves on the distributed inference for linear SVM, as the first step to the study of distributed inference for more general SVM.555Our result relies on the Bahadur representation of the linear SVM estimator (see, e.g., Koo et al. (2008)). For general SVM, to the best of our knowledge, the Bahadur representation in a single machine setting is still open, which has to be developed before investigating distributed inference for general SVM. Thus, we leave this for future investigation. Our goal is three-fold:
The obtained estimator should achieve the same statistical efficiency as merging all the data together. That is, the distributed inference should not lose any statistical efficiency as compared to the “oracle” single machine setting.
We aim to avoid any condition on the number of machines (or the number of data batches). Although this condition is widely assumed in distributed inference literature (see Lian and Fan (2017) and Section 2 for more details), removing such a condition will make the results more useful in cases when the size of the entire dataset is much larger than the memory size or in applications of sensor networks with a large number of sensors.
The proposed algorithm should be computationally efficient.
To simultaneously achieve these three goals, we develop a multi-round distributed linear-type (MDL) estimator for linear SVM. In particular, by smoothing the hinge loss using a special kernel smoothing technique adopted from the quantile regression literature (Horowitz, 1998; Pang et al., 2012; Chen et al., 2018), we first introduce a linear-type estimator in a single machine setup. Our linear-type estimator requires a consistent initial SVM estimator that can be easily obtained by solving SVM on one local machine. Given the initial estimator , the linear-type estimator has a simple and explicit formula that greatly facilitates the distributed computing. Roughly speaking, given samples for , our linear-type estimator takes the form of “weighted least squares”:
where the term is a weighted gram matrix and is the weight that only depends on the -th data and . In the vector , is a fixed vector that only depends on and is the weight that only depends on . The formula in (1) has a similar structure as weighted least squares, and thus can be easily computed in a distributed environment (noting that each term in and only involves the -th data point and there is no interaction term in (1)). In addition, the linear-type estimator in (1) can be efficiently computed by solving a linear equation system (instead of computing matrix inversion explicitly), which is computationally more attractive than solving the non-smooth optimization in the original linear SVM formulation.
The linear-type estimator can easily refine itself by using the on the left hand side of (1) as the initial estimator. In other words, we can obtain a new linear-type estimator by recomputing the right hand side of (1) using as the initial estimator. By successively refining the initial estimator for rounds/iterations, we could obtain the final multi-round distributed linear-type (MDL) estimator . The estimator not only has its advantage in terms of computation in a distributed environment, but also has describable statistical properties. In particular, with a small number , the estimator is able to achieve the optimal statistical efficiency, that is, the same efficiency as the classical linear SVM estimator computed on the entire dataset. To establish the limiting distribution and statistical efficiency results, we first develop the Bahadur representation of our MDL estimator of SVM (see Theorem 4.3). Then the asymptotic normality follows immediately from the Bahadur representation. It is worthwhile noting that the Bahadur representation (see, e.g., Bahadur (1966); Koenker and Bassett Jr (1978); Chaudhuri (1991)) provides an important characterization of the asymptotic behavior of an estimator. For the original linear SVM formulation, Koo et al. (2008) first established the Bahadur representation. In this paper, we establish the Bahadur representation of our multi-round linear-type estimator.
Finally, it is worthwhile noting that our algorithm is similar to a recently developed algorithm for distributed quantile regression (Chen et al., 2018), where both algorithms rely on a kernel smoothing technique and linear-type estimators. However, the technique for establishing the theoretical property for linear SVM is quite different from that for quantile regression. The difference and new technical challenges in linear SVM will be illustrated in Remark 3.1 (see Section 3).
The rest of the paper is organized as follows. In Section 2, we provide a brief overview of related works. Section 3 first introduces the problem setup and then describes the proposed linear-type estimator and MDL estimator for linear SVM. In Section 4, the main theoretical results are given. Section 5 provides the simulation studies to illustrate the performance of MDL estimator of SVM. Conclusions and future works are given in Section 6. We provide the proofs of our main results in Appendix A.
2 Related Works
In distributed inference literature, the divide-and-conquer (DC) approach is one of the most popular approaches and has been applied to a wide range of statistical problems. In the standard DC framework, the entire dataset of i.i.d. samples are evenly split into batches or distributed on local machines. Each machine computes a local estimator using the local samples. Then, the final estimator is obtained by averaging local estimators. The performance of the DC approach (or its variants) has been investigated on many statistical problems, such as density parameter estimation (Li et al., 2013), kernel ridge regression (Zhang et al., 2015), high-dimensional linear regression (Lee et al., 2017) and generalized linear models (Chen and Xie, 2014; Battey et al., 2018), semi-parametric partial linear models (Zhao et al., 2016), quantile regression (Volgushev et al., 2017; Chen et al., 2018), principal component analysis (Fan et al., 2017), one-step estimator (Huang and Huo, 2015), high-dimensional SVM (Lian and Fan, 2017), -estimators with cubic rate (Shi et al., 2017), and some non-standard problems where rates of convergence are slower than and limit distributions are non-Gaussian (Banerjee et al., 2018). On one hand, the DC approach enjoys low communication cost since it only requires one-shot communication (i.e., taking the average of local estimators). On the other hand, almost all the existing work on DC approaches requires a constraint on the number of machines. The main reason is that the averaging only reduces the variance but not the bias of each local estimator. To make the variance the dominating term in final estimator via averaging, the constraint on the number of machines is unavoidable. In particular, in the DC approach for linear SVM in Lian and Fan (2017), the number of machines has to satisfy the condition (see Remark 1 in Lian and Fan (2017)). As a comparison, our MDL estimator that involves multi-round aggregations successfully eliminates this condition on the number of machines.
In fact, to relax this constraint, several multi-round distributed methods have been recently developed (see Wang et al. (2017) and Jordan et al. (2018)). In particular, the key idea behind these methods is to approximate the Newton step by using the local Hessian matrix computed on a local machine. However, to compute the local Hessian matrix, their methods require the second-order differentiability on the loss function and thus are not applicable to problems involving non-smooth loss such as SVM.
The second line of the related research is the support vector machine (SVM). Since it was proposed by Cortes and Vapnik (1995), there is a large body of literature on SVM from both machine learning and statistics community. The readers might refer to the books (Cristianini and Shawe-Taylor, 2000; Schölkopf and Smola, 2002; Steinwart and Christmann, 2008) for a comprehensive review of SVM. In this section, we briefly mention a few relevant works on the statistical properties of linear SVM. In particular, the Bayes risk consistency and the rate of convergence of SVM has been extensively investigated (see, e.g., Lin (1999); Zhang (2004); Blanchard et al. (2008); Bartlett et al. (2006)). These works mainly concern the asymptotic risk. For the asymptotic properties of underlying coefficients, Koo et al. (2008) first established the Bahadur representation of linear SVM under the fixed setting. Jiang et al. (2008) proposed interval estimators for the prediction error for general SVM. For the large case, there are two common settings. One assumes that grows to infinity at a slower rate than (or linear in) the sample size but without any sparsity assumption. Our paper also belongs to this setup. Under this setup, Huang (2017) investigated the angle between the normal direction vectors of SVM separating hyperplane and corresponding Bayes optimal separating hyperplane under spiked population models. Another line of research considers high-dimensional SVM under a certain sparsity assumption on underlying coefficients. Under this setup, Peng et al. (2016) established the error bound in -norm. Zhang et al. (2016a) and Zhang et al. (2016b) investigated the variable selection problem in linear SVM.
In a standard binary classification problem setting, we consider a pair of random variables with and . The marginal distribution of is given by and where and . We assume that the random vector has a continuous distribution on given . Let be i.i.d. samples drawn from the joint distribution of random variables . In the linear classification problem, a hyperplane is defined by with . Define and the coefficient vector . For convenience purpose we also define . In this paper we consider the standard non-separable SVM formulation, which takes the following form
Here is the hinge loss and is the regularization parameter. The corresponding population loss function is defined as
We denote the minimizer for the population loss by
Koo et al. (2008) proved that under some mild conditions (see Koo et al. (2008) Theorem 1,2), there exists a unique minimizer for (4) and it is nonzero (i.e., ). We assume that these conditions hold throughout the paper. The minimizer of the population loss function will serve as the “true parameter” in our estimation problem and the goal is to construct an estimator and make inference of . We further define some useful quantities as follows
The reason why we use the notation is because it plays a similar role in the theoretical analysis as the noise term in a standard regression problem. However as we will show in Section 3 and 4, the behavior of is quite different from the noise in a classical regression setting since it does not have a continuous density function (see Remark 3.1). Next, denote by the Dirac delta function, we define
where is the indicator function.
The quantities and can be viewed as the gradient and Hessian matrix of and we assume that the smallest eigenvalue of is bounded away from 0. In fact these assumptions can be verified under some regular conditions (see Koo et al. (2008) Lemma 2, Lemma 3 and Lemma 5 for details) and are common in SVM literature (e.g., Zhang et al. (2016b) Condition 2 and 6).
3.2 A Linear-type Estimator for SVM
In this section, we first propose a linear-type estimator for SVM on a single machine which can be later extended to a distributed algorithm. The main challenge in solving the optimization problem in (2) is that the objective function is non-differentiable due to the appearance of hinge loss. Motivated by a smoothing technique from quantile regression literature (see e.g. Chen et al. (2018); Horowitz (1998); Pang et al. (2012)), we consider a smooth function satisfying if and if . We replace the hinge loss with its smooth approximation , where is the bandwidth. As the bandwidth , and approaches the indicator function and Dirac delta function respectively, and approximates the hinge loss (see Figure 1 for an example of with different bandwidths). To motivate our linear-type estimator, we first consider the following estimator with the non-smooth hinge loss in linear SVM replaced by its smooth approximation:
Since the objective function is differentiable and , by the first order condition (i.e., setting the derivative of the objective function (6) to zero), satisfies
We first rearrange the equation and express by
This fixed-point form formula for cannot be solved explicitly since appears on both sides of (7). Nevertheless, is not our final estimator and is mainly introduced to motivate our estimator. The key idea is to replace on the right hand side of (7) by a consistent initial estimator (e.g. can be constructed by solving a linear SVM on a small batch of samples). Then, we obtain the following linear-type estimator for :
Notice that (8) has a similar structure as weighted least squares (see the explanations in the paragraph below (1) in the introduction). As shown in the following section, this weighted least squares formulation can be computed efficiently in a distributed setting.
3.3 Multi-Round Distributed Linear-type (MDL) Estimator
It is important to notice that given the initial estimator , the linear-type estimator in (8) only involves summation of matrices and vectors computed for each individual data point. Therefore based on (8), we will construct a multi-round linear-type estimator (MDL estimator) that can be efficiently implemented in a distributed setting.
First let us assume that the total data indices are divided into subsets with equal size . Denote by the data in the -th local machine. In order to compute , for each batch of data for , we define the following quantities
Given , the quantities can be computed independently in each machine and only has to be stored and transfered to the central machine. Then after receiving from all the machines, the central machine can aggregate the data and compute the estimator by
Then can be sent to all the machines to repeat the whole process to construct using as the new initial estimator. The algorithm is repeated times for a pre-specified (see (18) for details), and is taken to be the final estimator (see Algorithm 1 for details). We name this estimator as the multi-round distributed linear-type (MDL) estimator.
We notice that instead of computing matrix inversion in every iteration which has a computation cost , one only needs to solve a linear system in (10). Linear system has been studied in numeric optimization for several decades and many efficient algorithms have been developed, such as conjugate gradient method (Hestenes and Stiefel (1952)). We also notice that we only have to solve a single optimization problem on one local machine to compute the initial estimator. Then at each iteration, only matrix multiplication and summation needs to be computed locally which makes the algorithm computationally efficient. It is worthwhile noticing that according to Theorem 4.4 in Section 4, under some mild conditions, if we choose for , the MDL estimator achieves optimal statistical efficiency as long as satisfies (18), which is usually a small number. Therefore a few rounds of iterations would guarantee a good performance for the MDL estimator.
For the choice of the initial estimator in the first iteration, we propose to construct it by solving the original SVM optimization (2) only on a small batch of samples (e.g. the samples on the first machine ). The estimator is only a crude estimator for , but we will prove later that it is enough for the algorithm to produce an estimator with optimal statistical efficiency under some regularity conditions. In particular, if we compute the initial estimator in the first round on the first batch of data, we will solve the following optimization problem
Then we have the following proposition from Zhang et al. (2016b).
According to our Theorem 4.3, the initial estimator needs to satisfy , and therefore the estimator computed on the first machine is a valid initial estimator. On the other hand, one can always use different approaches to construct the initial estimator as long as it has a convergence rate .
We note that although Algorithm 1 has a similar form as the DC-LEQR estimator for quantile regression (QR) in Chen et al. (2018). The structures of the SVM and QR problems are fundamentally different and thus the theoretical development for establishing the Bahadur representations for SVM is more challenging. To see that, let us recall the quantile regression model:
where is the unobserved random noise satisfying and is known as the quantile level. The asymptotic results of QR estimators heavily rely on the Lipschitz continuity assumption on the conditional density of given , which has been assumed in almost all existing literature. In the SVM problem, the quantity plays a similar role as the noise in a regression problem. However, since is binary, the conditional distribution becomes a two-point distribution, which no longer has a density function. To address this challenge and derive the asymptotic behavior of SVM, we directly work on the joint distribution of and . As the dimension of (i.e., ) can go to infinity, we use a slicing technique by considering the one-dimensional marginal distribution of (see Condition (C2) and proof of Theorem 4.3 for more details).
4 Theoretical Results
In this section, we give a Bahadur representation of the MDL estimator and establish its asymptotic normality result. From (8), the difference between the MDL estimator and the true coefficient can be written as
where , and for any , and are defined as follows
For a good initial estimator which is close to , the quantities and are close to and . Recall that and we have
When is close to zero, the term in parenthesis of approximates . Therefore, will be close to . Moreover, since approximates Dirac delta function as , approaches
When is large, will be close to its corresponding population quantity defined in (5).
According to the above argument, when is close to , and approximate and , respectively. Therefore, by (12), we would expect to be close to the following quantity,
We will see later that (13) is exactly the main term of the Bahadur representation of the estimator. Next we formalize these statements and present the asymptotic properties of and in Proposition 4.1 and 4.2. The asymptotic properties of the MDL estimator will be provided in Theorem 4.3. To this end, we first introduce some notations and assumptions for the theoretical result.
Recall that and for let be a ()-dimensional vector with removed from . Similar notations are used for . Since we assumed that , without loss of generality, we assume and its absolute value is lower bounded by some constant (i.e., ). Let and be the density functions of when and respectively. Let be the conditional density function of given and be the joint density of . Similar notations are used for .
We state some regularity conditions to facilitate theoretical development of asymptotic properties of and .
There exists a unique nonzero minimizer for (4) with , and , for some constant .
and for some constants .
Assume that , , , , and for some constant . Also assume . Similar assumptions are made for .
Assume that and for some and .
The smoothing function satisfies if and if , and also assume that is twice differentiable and is bounded. Moreover, assume that .
As we discussed in Section 3.1, condition (C0) is a standard assumption which can be implied by some mild conditions (see Koo et al. (2008) (C1)-(C4)). Conditions (C1) is a mild condition on the boundness of . Condition (C2) is a regularity condition on the conditional density of and , and it is satisfied by commonly used density functions, e.g. Gaussian distribution and uniform distribution. Condition (C3) is a sub-Gaussian condition on . Condition (C4) is a smoothness condition on the smooth function and can be easily satisfied by a properly chosen (e.g. see an example in Section 5).
Under the above conditions, we give Proposition 4.1 and Proposition 4.2 for the asymptotic behavior of and , respectively. Recall that and we have the following propositions. The proofs are relegated to Appendix A.
Under conditions (C0)-(C4), assume that we have an initial estimator with where is the convergence rate of the initial estimator. We choose the bandwidth such that , then we have
Suppose the same conditions in Propositions 4.1 hold, we have
According to the above propositions, with some algebraic manipulations and condition (C0), we have
By appropriately choosing the bandwidth such that it shrinks with at the same rate (see Theorem 4.3), becomes the dominating term on the right hand side of (15). This implies that by taking one round of refinement, the norm of improves from to (note that , see Proposition 4.1). Therefore by recursively applying the argument in (14) and setting the obtained estimator as the new initial estimator , the algorithm iteratively refines the estimator . This gives the Bahadur representation of our MDL estimator for rounds of refinements (see Algorithm 1).
Under conditions (C0)-(C4), assume that the initial estimator satisfies . Also, assume and for some . For a given integer , let the bandwidth in the -th iteration be for . Then we have
Since the initial bandwidth is and by Proposition 3.1, the convergence rate of the initial estimator computed on a single machine is , the initial condition in Theorem 4.3 is satisfied by . The condition ensures that , which implies the consistency of . It is worthwhile noting that the choice of bandwidth in Theorem 4.3 is up to a constant. One can choose for a constant in practice and Theorem 4.3 still holds. We omit the constant for simplicity of the statement (i.e., setting ). We notice that the algorithm in not sensitive to the choice of . Even with a suboptimal constant , the algorithm still shows good performance with a few more rounds of iterations (i.e., using a larger ). Please see Section 5 for a simulation study that shows the insensitivity to the scaling constant.
According to our choice of , we can see that as long as the number of iterations satisfies
the bandwidth is . Then by (17), the Bahadur remainder term becomes
Note that as long as , the remainder term becomes and it is independent of . This implies that as long as the regularization term satisfies , the choice of will no longer affect the convergence rate of the estimator. Define . By applying the central limit theorem to (16), we have the following result on the asymptotic distribution of .
Please see Appendix A for the proofs of Theorem 4.3 and Theorem 4.4. We impose the conditions and for some constants and in order to ensure the right hand side of (18) is a constant, which implies that we only need to perform a constant number of iterations even when .
We introduce the vector since we consider the diverging regime and thus the dimension of the “sandwich matrix” is growing in . Therefore, it is notationally convenient to introduce an arbitrary vector to make the limiting variance a positive real number. Also note that the conditions and guarantee that the remainder term (19) satisfies , which enables the application of the central limit theorem.
It is also important to note that the asymptotic variance in Theorem 4.4 matches the optimal asymptotic variance of in (3), which is directly computed on all samples (see Theorem 2 Koo et al. (2008)). This result shows that the MDL estimator does not lose any statistical efficiency as compared to the linear SVM in a single machine setup.
Remark 4.1 (Kernel SVM).
It is worthwhile to note that the proposed distributed algorithm can also be utilized in solving nonlinear SVM by using feature mapping approximation techniques. In the general SVM formulation, the objective function is defined as follows
where the function is the feature mapping function which maps to a high or even infinite dimensional space. The function defined by is called the kernel function associated with the feature mapping . With kernel mapping approximation, we construct a low dimensional feature mapping approximation such that . Then the original nonlinear SVM problem (20) can be approximated by
Several feature mapping approximation methods have been developed for kernels with some nice properties (see, e.g., Rahimi and Recht (2008); Lee and Wright (2011); Vedaldi and Zisserman (2012)) and it is also shown that the approximation error is small under some regularity conditions. We note that we should use a data-independent feature mapping approximation where only depends on the kernel function . This ensures that can be directly computed without loading data, which enables efficient algorithm in a distributed setting. For instance, for the RBF kernel, which is defined as , Rahimi and Recht (2008) proposed a data-independent approximation as