Does generalization performance of regularization learning depend on ? A negative example The research was supported by the National 973 Programming (2013CB329404), the Key Program of National Natural Science Foundation of China (Grant No. 11131006), and the National Natural Science Foundations of China (Grants No. 61075054).
-regularization has been demonstrated to be an attractive technique in machine learning and statistical modeling. It attempts to improve the generalization (prediction) capability of a machine (model) through appropriately shrinking its coefficients. The shape of a estimator differs in varying choices of the regularization order . In particular, leads to the LASSO estimate, while corresponds to the smooth ridge regression. This makes the order a potential tuning parameter in applications. To facilitate the use of -regularization, we intend to seek for a modeling strategy where an elaborative selection on is avoidable. In this spirit, we place our investigation within a general framework of -regularized kernel learning under a sample dependent hypothesis space (SDHS). For a designated class of kernel functions, we show that all estimators for attain similar generalization error bounds. These estimated bounds are almost optimal in the sense that up to a logarithmic factor, the upper and lower bounds are asymptotically identical. This finding tentatively reveals that, in some modeling contexts, the choice of might not have a strong impact in terms of the generalization capability. From this perspective, can be arbitrarily specified, or specified merely by other no generalization criteria like smoothness, computational complexity, sparsity, etc..
Keywords: Learning theory, regularization learning, sample dependent hypothesis space, learning rate
MSC 2000: 68T05, 62G07.
1. Institute for Information and System Sciences, School of Mathematics and Statistics, Xi’an Jiaotong University Xi’an 710049, P R China
2. The Methodology Center, The Pennsylvania State University, Department of Statistics, 204 E. Calder Way, Suite 400, State College, PA 16801, USA
Contemporary scientific investigations frequently encounter a common issue of exploring the relationship between a response and a number of covariates. In machine learning research, the subject is typically addressed through learning a underling rule from the data that accurately predicates future values of the response. For instance, in banking industry, financial analysts are interested in building a system that helps to judge the risk of a loan request. Such a system is often trained based on the risk assessments from previous loan applications together with the empirical experiences. An incoming loan request is then viewed as a new input, upon which the corresponding potential risk (response) is to be predicted. In such applications, the predictive accuracy of a trained rule is of the key importance.
In the past decade, various strategies have been developed to improve the prediction (generalization) capability of a learning process, which include regularization as an well-known example . The regularization learning prevents over-fitting by shrinking the model coefficients and thereby attains a higher predictive value. To be specific, suppose that the data for are collected independently and identically according to an unknown but definite distribution, where is a response of th unit and is the corresponding -dimensional covariates. Let
be a sample dependent space (SDHS) with and being a positive definite kernel function. The coefficient-based regularization strategy ( regularizer) takes the form of
where is a regularization parameter and is defined by
With different choices of order , (Equation 1) leads to various specific forms of the regularizer. In particular, when , corresponds to the ridge regressor , which smoothly shrinks the coefficients toward zero. When , leads to the LASSO , which set small coefficients exactly at zero and thereby also serves as a variable selection operator. When , coincides with the bridge estimator , which tends to produce highly sparse estimates through a non-continuous shrinkage.
The varying forms and properties of make the choice of order crucial in applications. Apparently, an optimal may depend on many factors such as the learning algorithms, the purposes of studies and so forth. These factors make a simple answer to this question infeasible in general. To facilitate the use of -regularization, alteratively, we intend to seek for a modeling strategy where an elaborative selection on is avoidable. Specifically, we attempt to reveal some insights for the role of in -learning via answering the following question:
Problem 1. Are there any kernels such that the generalization capability of (Equation 1) is independent of ?
In this paper, we provides a positive answer to Problem 1 under the framework of statistical learning theory. Specifically, we provide a featured class of positive definite kernels, under which the estimators for attain similar generalization error bounds. We then show that these estimated bounds are almost essential in the sense that up to a logarithmic factor the upper and lower bounds are asymptotically identical. In the proposed modeling context, the choice of does not have a strong impact in terms of the generalization capability. From this perspective, can be arbitrarily specified, or specified merely by other no generalization criteria like smoothness, computational complexity, sparsity, etc..
The reminder of the paper is organized as follows. In Section 2, we provide a literature review and explain our motivation of the research. In Section 3, we present some preliminaries including spherical harmonics, Gegenbauer polynomials and so on. In Section 4, we introduce a class of well-localized needlet type kernels of Petrushev and Xu  and show some crucial properties of them which will play important roles in our analysis. In Section 5, we then study the generalization capabilities of -regularizer associated with the constructed kernels for different . In Section 6, we provide the proof of the main results. We conclude the paper with some useful remarks in the last section.
2Motivation and related work
In practice, the choice of in (Equation 1) is critical, since it embodies certain potential attributions of the anticipated solutions such as sparsity, smoothness, computational complexity, memory requirement and generalization capability of course. The following simple simulation illustrates that different choice of can lead to different sparsity of the solutions.
The samples are identically and independently drawn according to the uniform distribution from the two dimensional Sinc function pulsing a Gaussian noise with . There are totally 256 training samples and 256 test samples. In Fig. 1, we show that different choice of may deduce different sparsity of the estimator for the kernel . It can be found that regularizers can deduce sparse estimator, while it impossible for regularizer.
Therefore, for a given learning task, how to choose is an important and crucial problem for regularization learning. In other words, which standards should be adopted to measure the quality of regularizers deserves study. As the most important standard of statistical learning theory, the generalization capability of regularization scheme (Equation 1) may depend on the choice of kernel, the size of samples , the regularization parameter , the behavior of priors, and, of course, the choice of . If we take the generalization capability of regularization learning as a function of , we then automatically wonder how this function behaves when changes for a fixed kernel. If the generalization capabilities depends heavily on , then it is natural to choose the such that the generalization capability of the corresponding regularizer is the smallest. If the generalization capabilities is independent of , then can be arbitrarily specified, or specified merely by other no generalization criteria like smoothness, computational complexity, sparsity.
However, the relation between the generalization capability and depends heavily on the kernel selection. To show this, we compare the generalization capabilities of , , and regularization schemes for two kernels: and in the simulation. The one case shows that the generalization capabilities of regularization schemes may be independent of and the other case shows that the generalization capability of (Equation 1) depends heavily on . In the left of Fig. 2, we report the relation between the test error and regularization parameter for the kernel . It is shown that when the regularization parameters are appropriately tuned, all of the aforementioned regularization schemes may possess the similar generalization capabilities. In the right of Fig. 2, for the kernel , we see that the generalization capability of regularization depends heavily on the choice of .
From these simulations, we see that finding kernels such that the generalization capability of (Equation 1) is independent of is of special importance in theoretical and practical applications. In particular, if such kernels exist, with such kernels, can be solely chosen on the basis of algorithmic and practical considerations for regularization. Here we emphasize that all these conclusions can, of course only be made in the premise that the obtained generalization capabilities of all regularizers are (almost) optimal.
There have been several papers that focus on the generalization capability analysis of the regularization scheme (Equation 1). Wu and Zhou  were the first, to the best of our knowledge, to show a mathematical foundation of learning algorithms in SDHS. They claimed that the data dependent nature of the algorithm leads to an extra error term called hypothesis error, which is essentially different form regularization schemes with sample independent hypothesis spaces (SIHSs). Based on this, the authors proposed a coefficient-based regularization strategy and conducted a theoretical analysis of the strategy by dividing the generalization error into approximation error, sample error and hypothesis error. Following their work, Xiao and Zhou  derived a learning rate of regularizer via bounding the regularization error, sample error and hypothesis error, respectively. Their result was improved in  by adopting a concentration inequality technique with empirical covering numbers to tackle the sample error. On the other hand, for regularizers, Tong et al.  deduced an upper bound for generalization error by using a different method to cope with the hypothesis error. Later, the learning rate of  was improved further in  by giving a sharper estimation of the sample error.
In all those researches, some sharp restrictions on the probability distributions (priors) have been imposed, say, both spectrum assumption of the regression function and concentration property of the marginal distribution should be satisfied. Noting this, for regularizer, Sun and Wu  conducted a generalization capability analysis for regularizer by using the spectrum assumption to the regression function only. For regularizer, by using a sophisticated functional analysis method, Zhang et al.  and Song et al.  built the regularized least square algorithm on the reproducing kernel Banach space (RKBS), and they proved that the regularized least square algorithm in RKBS is equivalent to regularizer if the kernel satisfies some restricted conditions. Following this method, Song and Zhang  deduced a similar learning rate for the regularizer and eliminated the concentration property assumption on the marginal distribution .
Limiting within is certainly incomplete to judge whether the generalization capability of regularization depends on the choice of . Moreover, in the context of learning theory, to intrinsically characterize the generalization capability of a learning strategy, the essential generalization bound  rather than the upper bound is required, that is, we must deduce a lower and an upper bound simultaneously for the learning strategy and prove that the upper and lower bounds can be asymptotically identical. We notice, however, that most of the previously known estimations on generalization capability of learning schemes (Equation 1) are only concerned with the upper bound estimation. Thus, their results can not serve the answer to Problem 1. Different from the pervious work, the essential bound estimation of generalization error for regularization schemes (Equation 1) with will be presented in the present paper. As a consequence, we provide an affirmative answer to Problem 1.
In this section, we introduce some preliminaries on spherical harmonics, Gegenbauer polynomial and orthonormal basis construction., which will be used in the construction of the positive definite needlet kernel.
The Gegenbauer polynomials are defined by the generating function 
where and . The coefficients are algebraic polynomials of degree which are called the Gegenbauer polynomials associated with . It is known that the family of polynomials is a complete orthogonal system in the weighted space , , and there holds
Then it is easy to see that is a complete orthonormal system for the weighted space , where . Let be the unit ball in , be the unit sphere in and be the set of algebraic polynomials of degree not larger than defined on . Denote by the aero element of . Then The following important properties of are established in .
For any integer , the restriction to of a homogeneous harmonic polynomial with degree is called a spherical harmonic of degree . The class of all spherical harmonics with degree is denoted by , and the class of all spherical polynomials with total degrees is denoted by . It is obvious that . The dimension of is given by
and that of is where denotes that there exist absolute constants and such that .
where is arbitrary orthonormal basis of .
For and , we say that a finite subset is an -covering of if
where denotes the cardinality of the set and denotes the spherical cap with the center and the angle . The following positive cubature formula can be found in .
3.3Basis and reproducing kernel for
consists an orthonormal basis for , where
is an orthonormal basis for . The following Lemma ? defines a reproducing kernel of , whose proof will be presented in Appendix A.
4The needlet kernel: Construction and Properties
Such a function can be easily constructed out of an orthogonal wavelet mask . We define a kernel as the following
As is admissible, the constructed kernel called the needlet kernel (or localized polynomial kernel)  henceforth, is positive definite. We will show that so defined kernel function , deduces the regularization learning whose learning rate is independent of the choice of . To this end, we first show several useful properties of the needlet kernel.
The following Proposition ? which can be deduced directly from Lemma ? and the definition of reveals that possesses reproducing property for .
Since is an admissible function by definition, it follows that is an algebraic polynomial of degree not larger than for any fixed . At the first glance, as a polynomial kernel, it may have good frequency localization property while have bad space localization property. The following Proposition ?, which can be found in , however, advocates that is actually a polynomial kernel possessing very good spacial localized properties. This makes it widely applicable in approximation theory and signal processing .
be the best approximation error of . Define
It has been shown in  that the integral operator possesses the following compressive property:
By Propositions ?, ? and ?, a standard method in approximation theory  yields the following best approximation property of .
5 Almost essential learning rate
In this section, we conduct a detailed generalization capability analysis of the regularization scheme (Equation 1) when the kernel function is specified as . Our aim is to derive an almost essential learning rate of regularization strategy (Equation 1). We first present a quick review of learning theory. Then, we given the main result of this paper, where a -independent learning rate of regularization schemes (Equation 1) is deduced. At last, we present some remarks on the main result.
5.1Statistical learning theory
Let be an input space and an output space. Assume that there exists a unknown but definite relationship between and , which is modeled by a probability distribution on . It is assumed that admits the decomposition
Let be a set of finite random samples of size , , drawn identically, independently according to from . The set of examples is called a training set. Without loss of generality, we assume that almost everywhere.
The aim of learning is to learn from a training set a function such that is an effective estimate of when is given. One natural measurement of the error incurred by using of this purpose is the generalization error,
which is minimized by the regression function  defined by
We do not know this ideal minimizer , since is unknown, but we have access to random examples from sampled according to .
Let be the Hilbert space of square integrable functions on , with norm In the setting of , it is well known that, for every , there holds
The goal of learning is then to construct a function that approximates , in the norm , using the finite sample .
One of the main points of this paper is to formulate the learning problem in terms of probability estimates rather than expectation estimates. To this end, we present a formal way to measure the performance of learning schemes in probability. Let and be the class of all Borel measures on such that . For each , we enter into a competition over all estimators established in the hypothesis space , and we define the accuracy confidence function by 
Furthermore, we define the accuracy confidence function for all possible estimators based on samples by
From these definitions, it is obvious that
for all .
5.2-independent learning rate
The sample dependent hypothesis space (SDHS) associated with is then defined by
and the corresponding regularization scheme is defined by
The projection operator from the space of measurable functions to is defined by
As by assumption, it is easy to check  that
Also, for arbitrary , we denote .
We also need to introduce the class of priors. For any , denote by or the Fourier transformation of ,
where . The inverse Fourier transformation will be denoted by . In the space , the derivative of with order is defined as
where . Here, Fourier transformation and derivatives are all taken sense in distribution. Let be any positive number. We consider the Sobolev class of functions
It follows from the well known Sobolev embedding theorem that provided
Now, we state the main result of this paper, whose proof will be given in the next section.
We explain Theorem ? below in more detail. At first, we explain why the accuracy function is used to characterize the generalization capability of the regularization schemes (Equation 8). In applications, we are often faced with the following problem: There are data available, and we are asked to product an estimator with tolerance at most by using these data only. In such circumstance, we have to know the probability of success. It is obvious that such probability depends on and . For example, if is too small, we can not construct an estimator within small tolerance. This fact is quantitatively verified by Theorem ?. More specifically, ( ?) shows that if there are data available and with , then regularization scheme (Equation 8) is impossible to yield an estimator with tolerance error smaller than . This is not a negative result, since we can see in ( ?) also that the main reason of impossibility is the lack of data rather than inappropriateness of the learning scheme (Equation 8). More importantly, Theorem ? reveals a quantitive relation between the probability of success and the tolerance error based on samples. It says in ( ?) that if the tolerance error is relaxed to or larger, then the probability of success of regularization is at least . The first inequality (lower bound) of ( ?) implies that such confidence can not be improved further. That is, we have presented an optimal confidence estimation for regularization scheme (Equation 8) with . Thus, Theorem ? basically concludes the following thing: If , then every estimator deduced from samples by regularization can not approximate the regression function with tolerance smaller than , while if , then the regularization schemes with any can definitely yield the estimators that approximate the regression function with tolerance .
The values and thus are critical for indicating the generalization error of a learning scheme. Indeed, the upper bound of generalization error of a learning scheme depends heavily on , while the lower bound of generalization error is relative to . Thus, in order to have a tight generalization error estimate of a learning scheme, we naturally wish to make the interval as short as possible. Theorem ? shows that, for regularization scheme (Equation 8), , and , which shows that the interval is almost the shortest one in the sense that up to a logarithmic factor, the upper bound and lower bound are asymptotical identical. Noting that the learning rate established in Theorem ? is independent of , we thus can conclude that the generalization capability of regularization does not depend on the choice of . This gives an affirmative answer to Problem 1.
The other advantage of using the accuracy confidence function to measure the generalization capability is that it allows to expose some phenomenon that can not be founded if the classical expectation standard is utilized. For example, Theorem ? shows a sharp phase transition phenomenon of regularization learning, that is, the behavior of the accuracy confidence function changes dramatically within the critical interval . It drops from a constant to an exponentially small quantity. We might call the interval of phase transition for a corresponding learning scheme. To make this more intuitive, let us conduct a simulation on the phase transition of the confidence function below. Without loss of generality, we implemented the regularization strategy (Equation 8) associated with the kernel (Equation 5) for and to yield the estimator. The regularization parameter was chosen as . The training samples were drawn independently and identically according to the uniform distribution from the well known function, that is . The number of the training samples was chosen from to and the tolerance was chosen from to with step-length . Then, there were totally 1000 test data drawn i. i. d according to the uniform distribution from . The test error was defined as We repeated 100 times simulations at each point, and labeled its value as if is smaller than the tolerance error and otherwise. Simulation result is shown in Fig.3. We can see from Fig.3 that in the upper right part, the colors of all points are red, which means that in those setting, the probability that is smaller than the tolerance is approximately . Thus, if the number of samples is small, then regularization schemes can not provide an estimation with very small tolerance. In the lower left area, the colors of all points are blue, which means that the probability of smaller than the tolerance is approximately . Between these two areas, there exists a band, that could be called the phase transition area, in which the colors of points vary from red to blue dramatically. It is seen that the length of phase transition interval monotonously decreases with . All these coincide with the theoretical assertions of Theorem ?.
For comparison, we also present a generalization error bound result in terms of expectation error. Corollary ? below can be directly deduced from Theorem ? and , if we notice the identity:
It is noted that the representation theorem in learning theory  implies that the generalization capability of an optimal learning algorithm in SDHS is not worse than that of learning in RKHS with convex loss function. Corollary ? then shows that if , then the generalization capability of an optimal learning scheme in SDHS associated with is not worse than that of any optimal learning algorithms in the corresponding RKHS. More specifically, ( ?) shows that as far as the learning rate is concerned, all regularization schemes (Equation 8) for can realize the same almost optimal theoretical rate. That is to say, the choice of has no influence on the generalization capability of the learning schemes (Equation 8). This also gives an affirmative answer to Problem 1 in the sense of expectation. Here, we emphasize that the independence of generalization of regularization on is based on the understanding of attaining the same almost optimal generalization error. Thus, in application, can be arbitrarily specified, or specified merely by other no generalization criteria (like complexity, sparsity, etc.).
6Proof of Theorem
The methodology we adopted in the proof of Theorem ? seems of novelty. Traditionally, the generalization error of learning schemes in SDHS is divided into the approximation, hypothesis and sample errors (three terms) . All of the aforementioned results about coefficient regularization in SDHS falled into this style. According to , the hypothesis error has been regarded as the reflection of nature of data dependence of SDHS (sample dependent hypothesis space), and an indispensable part attributed to an essential characteristic of learning algorithms in SDHS, compared with the learning in SIHS (sample independent hypothesis space). With the specific kernel function , we will divide the generalization error of regularization in this paper into the approximation and sample errors (two terms) only. Both of these two terms are dependent of the samples. The success in this paper then reveals that for at least some kernels, the hypothesis error is negligible, or can be avoided in estimation when regularization learning are analyzed in SDHS. We show that such new methodology can bring an important benefit of yielding an almost optimal generalization error bound for a large types of priors. Such benefit may reasonably be expected to beyond the regularization.
We sketch the methodology to be used as follows. Due to the sample dependent property, any estimators constructed in SDHS may be a random approximant. To bound the approximation error, we first deduce a probabilistic cubature formula for algebraic polynomial. Then we can discretize the near-best approximation operator based on the probabilistic cubature formula. Thus, the well known Jackson-type error estimate  can be applied to derive the approximation error. To bound the sample error, we will use a different method from the tranditional approaches . Since the constructed approximant in SDHS is a random approximant, the concentration inequality such as Bernstein inequality  can not be available. In our approach, based on the prominent property of the constructed approximant, we will bound the sample error by using the concentration inequality established in  twice. Then the relation between the so-called Pseudo-dimension and covering number  yields the sample error estimate for regularization schemes (Equation 8) with arbitrary . Hence, we divide the proof into four subsections. The first subsection is devoted to establish the probabilistic cubature formula. The second subsection is to construct the random approximant and study the approximation error. The third subsection is to deduce the sample error and the last subsectionis to derive the final learning rate. We present the details one by one below.
6.2A probabilistic cubature formula
In this subsection, we establish a probabilistic cubature formula. At first, we need several lemmas. The weighted norm on the -dimensional unit sphere is defined as follows. Let and . Define
The following  gives a weighted Nikolskii inequality for spherical polynomial.
where is a positive constant depending only on and .
Lemma ? establishes a relation between cubature formula on the unit sphere and cubature formula on the unit ball, which can be found in .
The following Lemma ? is known as the Bernstein inequality for random variables, which can be found in .
We also need a lemma showing that if is a set of independent random variables drawn identically according to a distribution , then with high confidence the cubature formula holds.
Proof. For the sake of brevity, we write in the following. Since the sampling set consists of a sequence of i.i.d. random variables on , the sampling points are a sequence of functions on some probability space . Without loss of generality, we assume for arbitrary fixed . If we set , then we have
where we have used the equality
It follows from Lemma ? that
On the other hand, we have
Then using Lemma ? again, there holds
Thus it follows from Lemma ? that with confidence at least
This means that if is a sequence of i.i.d. random variables, then the Marcinkiewicz-Zygmund inequality
holds with probability at least
By virtue of the above lemmas, we can prove the following Proposition ?.
6.3Error decomposition and an approximation error estimate
To estimate the upper bound of
we first introduce an error decomposition strategy. It follows from the definition of that, for arbitrary ,
Since with , it follows from the Sobolev embedding theorem that . Thus, it can be deduced from Proposition ? and Proposition ? that there exists a such that
where denotes the largest integer not larger than and denotes the uniform norm on . The above inequalities together with the well known Jackson inequality  imply that there exists a such that for all with , there holds
Let , where is defined as in (Equation 10). Define
Then we have
where and is called the approximation error and sample error, respectively.
Proof. From Proposition ?, it is easy to deduce that
Thus, Lemma ? with yields that with confidence at least there exists a set of real numbers satisfying for such that
The above observation together with (Equation 11) implies that with confidence at least there exists a
such that for arbitrary , there holds
where is a constant depending only on and . Indeed, if , we have . Without loss of generality, we assume . Then there holds
If , it follows from the Hölder inequality that
Thus, for all , there holds
It thus follows from the definition of that the inequalities
holds with confidence at least
6.4A sample error estimate
For further use, we also need introducing some quantities to measure the complexity of a space . Let be a Banach space and a compact set in . The quantity , where is the number of elements in least -net of , is called -entropy of in . The quantity is called the -covering number of . For any , define
If a vector belongs to , then we denote by