Optimal Convergence for Distributed Learning with Stochastic Gradient Methods and Spectral Algorithms
We study generalization properties of distributed algorithms in the setting of nonparametric regression over a reproducing kernel Hilbert space (RKHS). We first investigate distributed stochastic gradient methods (SGM), with mini-batches and multi-passes over the data. We show that optimal generalization error bounds (up to logarithmic factor) can be retained for distributed SGM provided that the partition level is not too large. We then extend our results to spectral algorithms (SA), including kernel ridge regression (KRR), kernel principal component analysis, and gradient methods. Our results are superior to the state-of-the-art theory. Particularly, our results show that distributed SGM has a smaller theoretical computational complexity, compared with distributed KRR and classic SGM. Moreover, even for non-distributed SA, they provide the first optimal, capacity-dependent convergence rates, for the case that the regression function may not be in the RKHS.
Keywords: Kernel Methods, Stochastic Gradient Methods, Regularization, Distributed Learning
In statistical learning theory, a set of input-output pairs from an unknown distribution is observed. The aim is to learn a function which can predict future outputs given the corresponding inputs. The quality of a predictor is often measured in terms of the mean-squared error. In this case, the conditional mean, which is called as the regression function, is optimal among all the measurable functions (Cucker and Zhou, 2007; Steinwart and Christmann, 2008).
In nonparametric regression problems, the properties of the regression function are not known a priori. Nonparametric approaches, which can adapt their complexity to the problem, are key to good results. Kernel methods is one of the most common nonparametric approaches to learning (Schölkopf and Smola, 2002; Shawe-Taylor and Cristianini, 2004). It is based on choosing a RKHS as the hypothesis space in the design of learning algorithms. With an appropriate reproducing kernel, RKHS can be used to approximate any smooth function.
The classical algorithms to perform learning task are regularized algorithms, such as KRR (also called as Tikhonov regularization in inverse problems), kernel principal component regression (KPCR, also known as spectral cut-off regularization in inverse problems), and more generally, SA. From the point of view of inverse problems, such approaches amount to solving an empirical, linear operator equation with the empirical covariance operator replaced by a regularized one (Engl et al., 1996; Bauer et al., 2007; Gerfo et al., 2008). Here, the regularization term controls the complexity of the solution to against over-fitting and to ensure best generalization ability. Statistical results on generalization error had been developed in (Smale and Zhou, 2007; Caponnetto and De Vito, 2007) for KRR and in (Caponnetto, 2006; Bauer et al., 2007) for SA.
Another type of algorithms to perform learning tasks is based on iterative procedure (Engl et al., 1996). In this kind of algorithms, an empirical objective function is optimized in an iterative way with no explicit constraint or penalization, and the regularization against overfitting is realized by early-stopping the empirical procedure. Statistical results on generalization error and the regularization roles of the number of iterations/passes have been investigated in (Zhang and Yu, 2005; Yao et al., 2007) for gradient methods (GM, also known as Landweber algorithm in inverse problems), in (Caponnetto, 2006; Bauer et al., 2007) for accelerated gradient methods (AGM, known as -methods in inverse problems) in (Blanchard and Krämer, 2010) for conjugate gradient methods (CGM), and in (Lin and Rosasco, 2017b) for (multi-pass) SGM. Interestingly, GM and AGM can be viewed as special instances of SA (Bauer et al., 2007), but CGM and SGM can not (Blanchard and Krämer, 2010; Lin and Rosasco, 2017b).
The above mentioned algorithms suffer from computational burdens at least of order due to the nonlinearity of kernel methods. Indeed, a standard execution of KRR requires in space and in time, while SGM after -iterations requires in space and (or ) in space. Such approaches would be prohibitive when dealing with large-scale learning problems. These thus motivate one to study distributed learning algorithms (Mcdonald et al., 2009; Zhang et al., 2012). The basic idea of distributed learning is very simple: randomly divide a dataset of size into subsets of equal size, compute an independent estimator using a fixed algorithm on each subset, and then average the local solutions into a global predictor. Interestingly, distributed learning technique has been successfully combined with KRR (Zhang et al., 2015; Lin et al., 2017) and more generally, SA (Guo et al., 2017; Blanchard and Mucke, 2016b), and it has been shown that statistical results on generalization error can be retained provided that the number of partitioned subsets is not too large. Moreover, it was highlighted (Zhang et al., 2015) that distributed KRR not only allows one to handle large datasets that restored on multiple machines, but also leads to a substantial reduction in computational complexity versus the standard approach of performing KRR on all samples.
In this paper, we study distributed SGM, with multi-passes over the data and mini-batches. The algorithm is a combination of distributed learning technique and (multi-pass) SGM (Lin and Rosasco, 2017b): it randomly partitions a dataset of size into subsets of equal size, computes an independent estimator by SGM for each subset, and then averages the local solutions into a global predictor. We show that with appropriate choices of algorithmic parameters, optimal generalization error bounds up to a logarithmic factor can be achieved for distributed SGM provided that the partition level is not too large.
The proposed configuration has certain advantages on computational complexity. For example, without considering any benign properties of the problem such as the regularity of the regression function (Smale and Zhou, 2007; Caponnetto and De Vito, 2007) and a capacity assumption on the RKHS (Zhang, 2005; Caponnetto and De Vito, 2007), even implementing on a single machine, distributed SGM has a convergence rate of order , with a computational complexity in space and in time, compared with in space and in time of classic SGM performing on all samples, or in space and in time of distributed KRR. Moreover, the approach dovetails naturally with parallel and distributed computation: we are guaranteed a superlinear speedup with parallel processors (though we must still communicate the function estimates from each processor).
The proof of the main results is based on a similar (but a bit different) error decomposition from (Lin and Rosasco, 2017b), which decomposes the excess risk into three terms: bias, sample and computational variances. The error decomposition allows one to study distributed GM and distributed SGM simultaneously. Different to those in (Lin and Rosasco, 2017b) which rely heavily on the intrinsic relationship of GM with the square loss, in this paper, an integral operator approach (Smale and Zhou, 2007; Caponnetto and De Vito, 2007) is used, combining with some novel and refined analysis, see Subsection 6.2 for further details.
We then extend our analysis to distributed SA and derive similar optimal results on generalization error for distributed SA, based on the fact that GM is a special instance of SA.
This paper is an extended version of the conference version (Lin and Cevher, 2018) where results for distributed SGM are given only. In this version, we additionally provide statistical results for distributed SA, including their proofs, as well as some further discussions.
We highlight that our contributions are as follows.
We provide the first results with optimal convergence rates (up to a logarithmic factor) for distributed SGM, showing that distributed SGM has a smaller theoretical computational complexity, compared with distributed KRR and non-distributed SGM. As a byproduct, we derive optimal convergence rates (up to a logarithmic factor) for non-distributed SGM, which improve the results in (Lin and Rosasco, 2017b).
Our results for distributed SA improves previous results from (Zhang et al., 2015) for distributed KRR, and from (Guo et al., 2017) for distributed SA, with a less strict condition on the partition number . Moreover, they provide the first optimal rates for distributed SA in the non-attainable cases (i.e., the regression function may not be in the RKHS).
The remainder of the paper is organized as follows. Section 2 introduces the supervised learning setting. Section 3 describes distributed SGM, and then presents theoretical results on generalization error for distributed SGM, following with simple comments. Section 4 introduces distributed SA, and then gives statistical results on generalization error. Section 5 discusses and compares our results with related work. Section 6 provides the proofs for distributed SGM. Finally, proofs for auxiliary lemmas and results for distributed SA are provided in the appendix.
2 Supervised Learning Problems
We consider a supervised learning problem. Let be a probability measure on a measurable space where is a compact-metric input space and is the output space. is fixed but unknown. Its information can be only known through a set of samples of points, which we assume to be i.i.d.. We denote the induced marginal measure on of and the conditional probability measure on with respect to and . We assume that has full support in throughout.
The quality of a predictor can be measured in terms of the expected risk with a square loss defined as
In this case, the function minimizing the expected risk over all measurable functions is the regression function given by
The performance of an estimator can be measured in terms of generalization error (excess risk), i.e., It is easy to prove that
Here, is the Hilbert space of square integral functions with respect to , with its induced norm given by .
Kernel methods are based on choosing the hypothesis space as a RKHS. Recall that a reproducing kernel is a symmetric function such that is positive semidefinite for any finite set of points in . The reproducing kernel defines a RKHS as the completion of the linear span of the set with respect to the inner product
Given only the samples , the goal is to learn the regression function through efficient algorithms.
3 Distributed Learning with Stochastic Gradient Methods
In this section, we first state the distributed SGM. We then present theoretical results for distributed SGM and non-distributed SGM, following with simple discussions.
3.1 Distributed SGM
Throughout this paper, as that in (Zhang et al., 2015), we assume that111For the general case, one can consider the weighted averaging scheme, as that in (Lin et al., 2017), and our analysis still applies with a simple modification. the sample size for some positive integers , and we randomly decompose as with . For any we write We study distributed SGM, with mini-batches and multi-pass over the data, as detailed in Algorithm 1. For any the set of the first positive integers is denoted by .
In the algorithm, at each iteration , for each the local estimator updates its current solution by subtracting a scaled gradient estimate. It is easy to see that the gradient estimate at each iteration for the -th local estimator is an unbiased estimate of the full gradient of the empirical risk over The global predictor is the average over these local solutions. In the special case , the algorithm reduces to the classic multi-pass SGM.
There are several free parameters, the step-size , the mini-batch size , the total number of iterations/passes, and the number of partition/subsets . All these parameters will affect the algorithm’s generalization properties and computational complexity. In the coming subsection, we will show how these parameters can be chosen so that the algorithm can generalize optimally, as long as the number of subsets is not too large. Different choices on , , and correspond to different regularization strategies. In this paper, we are particularly interested in the cases that both and are fixed as some universal constants that may depend on the local sample size , while is tuned.
The total number of iterations can be bigger than the local sample size , which means that the algorithm can use the data more than once, or in another words, we can run the algorithm with multiple passes over the data. Here and in what follows, the number of (effective) ‘passes’ over the data is referred to after iterations of the algorithm.
The numerical realization of the algorithm and its performance on a synthesis data can be found in (Lin and Cevher, 2018). The space and time complexities for each local estimator are
respectively. The total space and time complexities of the algorithm are
3.2 Generalization Properties for Distributed Stochastic Gradient Methods
In this section, we state our results for distributed SGM, following with simple discussions. Throughout this paper, we make the following assumptions.
is separable and is continuous. Furthermore, for some ,
and for some ,
The above assumptions are quite common in statistical learning theory, see e.g., (Steinwart and Christmann, 2008; Cucker and Zhou, 2007). The constant from Equation (8) measures the noise level of the studied problem. The condition implies that the regression function is bounded almost surely,
It is trivially satisfied when is bounded, for example, in the classification problem. To state our first result, we define an inclusion operator , which is continuous under Assumption (7).
Assume that and
Consider Algorithm 1 with any of the following choices on , and .
1) for all , and
2) for all , and
Here and throughout this section, we use the notations to mean for some positive constant depending only on , and to mean .
The above result provides generalization error bounds for distributed SGM with two different choices on step-size , mini-batch size and total number of iterations/passes. The convergence rate is optimal up to a logarithmic factor, in the sense that it nearly matches the minimax rate in (Caponnetto and De Vito, 2007) and the convergence rate for KRR (Smale and Zhou, 2007; Caponnetto and De Vito, 2007). The number of passes to achieve optimal error bounds in both cases is roughly one. The above result asserts that distributed SGM generalizes optimally after one pass over the data for two different choices on step-size and mini-batch size, provided that the partition level is not too large. In the case that according to (6), the computational complexities are in space and in time, comparing with in space and in time of classic SGM.
Corollary 1 provides statistical results for distributed SGM without considering any further benign assumptions about the learning problem, such as the regularity of the regression function and the capacity of the RKHS. In what follows, we will show how the results can be further improved, if we make these two benign assumptions.
The first benign assumption relates to the regularity of the regression function. We introduce the integer operator , defined by . Under Condition (7), is positive trace class operators (Cucker and Zhou, 2007), and hence is well defined using the spectral theory.
There exist and , such that
This assumption characterizes how large the subspace that the regression function lies in. The bigger the is, the smaller the subspace is, the stronger the assumption is, and the easier the learning problem is, as if Moreover, if we are making no assumption, and if , we are requiring that there exists some such that almost surely (Steinwart and Christmann, 2008, Section 4.5).
The next assumption relates to the capacity of the hypothesis space.
For some and , satisfies
The left hand-side of (10) is called effective dimension (Zhang, 2005) or degrees of freedom (Caponnetto and De Vito, 2007). It is related to covering/entropy number conditions, see (Steinwart and Christmann, 2008). The condition (10) is naturally satisfied with , since is a trace class operator which implies that its eigenvalues satisfy Moreover, if the eigenvalues of satisfy a polynomial decaying condition for some , or if is of finite rank, then the condition (10) holds with , or with . The case is refereed as the capacity independent case. A smaller allows deriving faster convergence rates for the studied algorithms, as will be shown in the following results.
Making these two assumptions, we have the following general results for distributed SGM.
In the above result, we only consider the setting of a fixed step-size. Results with a decaying step-size can be directly derived following our proofs in the coming sections, combining with some basic estimates from (Lin and Rosasco, 2017b). The error bound from (12) depends on the number of iteration , the step-size , the mini-batch size, the number of sample points and the partition level . It holds for any pseudo regularization parameter where . When , for we can choose , and ignoring the logarithmic factor and constants, (12) reads as
The right-hand side of the above inequality is composed of three terms. The first term is related to the regularity parameter of the regression function , and it results from estimating bias. The second term depends on the sample size and it results from estimating sample variance. The last term results from estimating computational variance due to random choices of the sample points. In comparing with the error bounds derived for classic SGM performed on a local machine, one can see that averaging over the local solutions can reduce sample and computational variances, but keeps bias unchanged. As the number of iteration increases, the bias term decreases, and the sample variance term increases. This is a so-called trade-off problem in statistical learning theory. Solving this trade-off problem leads to the best choice on number of iterations. Notice that the computational variance term is independent of the number of iterations and it depends on the step-size, the mini-batch size, and the partition level. To derive optimal rates, it is necessary to choose a small step-size, and/or a large mini-batch size, and a suitable partition level. In what follows, we provide different choices of these algorithmic parameters, corresponding to different regularization strategies, while leading to the same optimal convergence rates up to a logarithmic factor.
We add some comments on the above theorem. First, the convergence rate is optimal up to a logarithmic factor, as it is almost the same as that for KRR from (Caponnetto and De Vito, 2007; Smale and Zhou, 2007) and also it nearly matches the minimax lower rate in (Caponnetto and De Vito, 2007). In fact, let ( and ) be the set of probability measure on such that Assumptions 1-3 are satisfied. Then the following minimax lower rate is a direct consequence of (Caponnetto and De Vito, 2007, Theorem 2):
for some constant independent on , where the infimum in the middle is taken over all algorithms as a map . Alternative minimax lower rates (perhaps considering other quantities, and ) could be found in (Caponnetto and De Vito, 2007, Theorem 3) and (Blanchard and Mucke, 2016a, Theorem 3.5). Second, distributed SGM saturates when The reason for this is that averaging over local solutions can only reduce sample and computational variances, not bias. Similar saturation phenomenon is also observed when analyzing distributed KRR in (Zhang et al., 2015; Lin et al., 2017). Third, the condition is equivalent to assuming that the learning problem can not be too difficult. We believe that such a condition is necessary for applying distributed learning technique to reduce computational costs, as there are no means to reduce computational costs if the learning problem itself is not easy. Fourth, as the learning problem becomes easier (corresponds to a bigger ), the faster the convergence rate is, and moreover the larger the number of partition can be. Finally, different parameter choices leads to different regularization strategies. In the first two regimes, the step-size and the mini-batch size are fixed as some prior constants (which only depends on ), while the number of iterations depends on some unknown distribution parameters. In this case, the regularization parameter is the number of iterations, which in practice can be tuned by using cross-validation methods. Besides, the step-size and the number of iterations in the third regime, or the mini-batch size and the number of iterations in the last regime, depend on the unknown distribution parameters, and they have some regularization effects. The above theorem asserts that distributed SGM with differently suitable choices of parameters can generalize optimally, provided the partition level is not too large.
3.3 Optimal Rate for Multi-pass SGM on a Single Dataset
As a direct corollary of Theorem 1, we derive the following results for classic multi-pass SGM.
The above results provide generalization error bounds for multi-pass SGM trained on a single dataset. The derived convergence rate is optimal in the minimax sense (Caponnetto and De Vito, 2007; Blanchard and Mucke, 2016a) up to a logarithmic factor. Note that SGM does not have a saturation effect, and optimal convergence rates can be derived for any Corollary 3 improves the result in (Lin and Rosasco, 2017b) in two aspects. First, the convergence rates are better than those (i.e., if or otherwise) from (Lin and Rosasco, 2017b). Second, the above theorem does not require the extra condition made in (Lin and Rosasco, 2017b).
4 Distributed Learning with Spectral Algorithms
In this section, we first state distributed SA. We then present theoretical results for distributed SA, following with simple discussions. Finally, we give convergence results for classic SA.
4.1 Distributed Spectral Algorithms
In this subsection, we present distributed SA. We first recall that a filter function is defined as follows.
Definition 4 (Filter functions)
Let be a subset of A class of functions is said to be filter functions with qualification () if there exist some positive constants such that
In the algorithm, is a regularization parameter which should be appropriately chosen in order to achieve best performance. In practice, it can be tuned by using the cross-validation methods. SA is associated with some given filter functions. Different filter functions correspond to different regularization algorithms. The following examples provide several common filter functions, which leads to different types of regularization methods, see e.g. (Gerfo et al., 2008; Bauer et al., 2007).
Example 1 (Krr)
The choice corresponds to Tikhonov regularization or the regularized least squares algorithm. It is easy to see that is a class of filter functions with qualification , and .
Example 2 (Gm)
Let be such that for all Then as will be shown in Section 6,
where we identify corresponds to gradient methods or Landweber iteration algorithm. The qualification could be any positive number, and .
Example 3 (Spectral cut-off)
Consider the spectral cut-off or truncated singular value decomposition (TSVD) defined by
Then the qualification could be any positive number and .
Example 4 (KRR with bias correction)
The function corresponds to KRR with bias correction. It is easy to show that the qualification , and
The implementation of the algorithms is very standard using the representation theorem, for which we thus skip the details.
4.2 Optimal Convergence for Distributed Spectral Algorithms
We have the following general results for distributed SA.
The above results provide generalization error bounds for distributed SA. The upper bound depends on the number of partition , the regularization parameter and total sample size . When the regularization parameter by setting , the derived error bounds for can be simplified as
There are two terms in the upper bound. They are raised from estimating bias and sample variance. Note that there is a trade-off between the bias term and the sample variance term. Solving this trade-off leads to the best choice on regularization parameter. Note also that similar to that for distributed SGM, distributed SA also saturates when
The convergence rate from the above corollary is optimal as it matches exactly the minimax rate in (Caponnetto and De Vito, 2007), and it is better than the rate for distributed SGM from Theorem 1, where the latter has an extra logarithmic factor. According to Corollary 5, distributed SA with an appropriate choice of regularization parameter can generalize optimally, if the number of partitions is not too large. To the best of our knowledge, the above corollary is the first optimal statistical result for distributed SA considering the non-attainable case (i.e. can be less than ). Moreover, the requirement on the number of partitions to achieve optimal generalization error bounds is much weaker than that () in (Guo et al., 2017; Blanchard and Mucke, 2016b).
4.3 Optimal Rates for Spectral Algorithms on a Single Dataset
The following results provide generalization error bounds for classic SA.
The above results assert that SA generalizes optimally if the regularization parameter is well chosen. To the best of our knowledge, the derived result is the first one with optimally capacity-dependent rates in the non-attainable case for a general SA. Note that unlike distributed SA, classic SA does not have a saturation effect.
In this section, we briefly review some of the related results in order to facilitate comparisons. For ease of comparisons, we summarize some of the results and their computational costs in Table 1.
We first briefly review convergence results on generalization error for KRR, and more generally, SA. Statistical results for KRR with different convergence rates have been shown in, e.g., (Smale and Zhou, 2007; Caponnetto and De Vito, 2007; Wu et al., 2006; Steinwart and Christmann, 2008; Steinwart et al., 2009). Particularly, Smale and Zhou (2007) proved convergence rates of order with , without considering the capacity assumption. Caponnetto and De Vito (2007) gave optimally capacity-dependent convergence rate of order but only for the case that . The above two are based on integral operator approaches. Using an alternative argument related to covering-number or entropy-numbers, Wu et al. (2006) provided convergence rate , and (Steinwart and Christmann, 2008, Theorem 7.23) providesls convergence rate , assuming that and almost surely. For GM, Yao et al. (2007) derived convergence rate of order (for ), without considering the capacity assumption. Involving the capacity assumption, Lin and Rosasco (2017b) derived convergence rate of order if , or if . Note that both proofs from (Yao et al., 2007; Lin and Rosasco, 2017b) rely on the special separable properties of GM with the square loss. For SA, statistical results on generalization error with different convergence rates have been shown in, e.g., (Caponnetto, 2006; Bauer et al., 2007; Blanchard and Mucke, 2016a; Dicker et al., 2017; Lin et al., 2017). The best convergence rate shown so far (without making any extra unlabeled data as that in (Caponnetto, 2006)) is (Blanchard and Mucke, 2016a; Dicker et al., 2017; Lin et al., 2017) but only for the attainable case, i.e., . These results also apply to GM, as GM can be viewed as a special instance of SA. Note that some of these results also require the extra assumption that the sample size is large enough. In comparisons, Corollary 6 provides the best convergence rates for SA, considering both the non-attainable and attainable cases and without making any extra assumption. Note that our derived error bounds are in expectation, but it is not difficult to derive error bounds in high probability using our approach, and we will report this result in a future work.
|KRR (Smale and Zhou, 2007)||,||1||&|
|KRR (Caponnetto and De Vito, 2007)||, ,||1||-|
|KRR (Steinwart and Christmann, 2008)444The results from (Steinwart and Christmann, 2008) are based on entropy-numbers arguments while the other results summarized for KRR in the table are based on integral-operator arguments.||, ,||1||-|
|GM (Yao et al., 2007)||1||&|
|GM (Dicker et al., 2017)||, ,||1||&|
|GM (Lin and Rosasco, 2017b)||,||1||&|
|GM (Lin and Rosasco, 2017b)||,||1||&|
|OL (Ying and Pontil, 2008)||1||&|
|AveOL (Dieuleveut and Bach, 2016)||,||1||&|
|AveOL (Dieuleveut and Bach, 2016)||1||&|
|SGM (Lin and Rosasco, 2017b)||1||&|
|SGM (Lin and Rosasco, 2017b)||1||&|
|SGM [Corollary 3]||1||&|
|SGM [Corollary 3]||1||&|
|NyKRR (Rudi et al., 2015)||, ,||1||&|
|NySGM (Lin and Rosasco, 2017a)||, ,||1||&|
|DKRR & DSA (Guo et al., 2017)||,||&||&|