Ivanov-Regularised Least-Squares Estimators over Large RKHSs and Their Interpolation Spaces

Ivanov-Regularised Least-Squares Estimators over Large RKHSs and Their Interpolation Spaces

\nameStephen Page \emails.page@lancaster.ac.uk
\addrSTOR-i
Lancaster University
Lancaster, LA1 4YF, United Kingdom \AND\nameSteffen Grünewälder \emails.grunewalder@lancaster.ac.uk
\addrDepartment of Mathematics and Statistics
Lancaster University
Lancaster, LA1 4YF, United Kingdom
Abstract

We study kernel least-squares estimation under a norm constraint. This form of regularisation is known as Ivanov regularisation and it provides better control of the norm of the estimator than the well-established Tikhonov regularisation. This choice of regularisation allows us to dispose of the standard assumption that the reproducing kernel Hilbert space (RKHS) has a Mercer kernel, which is restrictive as it usually requires compactness of the covariate set. Instead, we assume only that the RKHS is separable with a bounded and measurable kernel. We provide rates of convergence for the expected squared error of our estimator under the weak assumption that the variance of the response variables is bounded and the unknown regression function lies in an interpolation space between and the RKHS. We then obtain faster rates of convergence when the regression function is bounded by clipping the estimator. In fact, we attain the optimal rate of convergence. Furthermore, we provide a high-probability bound under the stronger assumption that the response variables have subgaussian errors and that the regression function lies in an interpolation space between and the RKHS. Finally, we derive adaptive results for the settings in which the regression function is bounded.

Ivanov-Regularised Least-Squares Estimators over Large RKHSs and Their Interpolation Spaces Stephen Page s.page@lancaster.ac.uk
STOR-i
Lancaster University
Lancaster, LA1 4YF, United Kingdom
Steffen Grünewälder s.grunewalder@lancaster.ac.uk
Department of Mathematics and Statistics
Lancaster University
Lancaster, LA1 4YF, United Kingdom

Editor:

Keywords: Ivanov Regularisation, RKHSs, Mercer Kernels, Interpolation Spaces, Training and Validation

1 Introduction

One of the key problems to overcome in nonparametric regression is overfitting, due to estimators coming from large hypothesis classes. To avoid this phenomenon, it is common to ensure that both the empirical risk and some regularisation function are small when defining an estimator. There are three natural ways to achieve this goal. We can minimise the empirical risk subject to a constraint on the regularisation function, minimise the regularisation function subject to a constraint on the empirical risk or minimise a linear combination of the two. These techniques are known as Ivanov regularisation, Morozov regularisation and Tikhonov regularisation respectively (Oneto et al., 2016). Ivanov and Morozov regularisation can be viewed as dual problems, while Tikhonov regularisation can be viewed as the Lagrangian relaxation of either.

Tikhonov regularisation has gained popularity as it provides a closed-form estimator in many situations. In particular, Tikhonov regularisation in which the estimator is selected from a reproducing kernel Hilbert space (RKHS) has been extensively studied (Steinwart and Christmann, 2008; Caponnetto and de Vito, 2007; Mendelson and Neeman, 2010; Steinwart et al., 2009). Although Tikhonov regularisation produces an estimator in closed form, it is Ivanov regularisation which provides the greatest control over the hypothesis class, and hence over the estimator it produces. For example, if the regularisation function is the norm of the RKHS, then the bound forces the estimator to lie in a ball of predefined radius inside the function space. An RKHS norm measures the smoothness of a function, so the norm constraint bounds the smoothness of the estimator. By contrast, Tikhonov regularisation provides no direct bound on the smoothness of the estimator.

The control we have over the Ivanov-regularised estimator is useful in many settings. The most obvious use of Ivanov regularisation is when the regression function lies in a ball of known radius in the RKHS. In this case, Ivanov regularisation can be used to constrain the estimator to lie in the same ball. Ivanov regularisation can also be used within larger inference methods. It is compatible with validation, allowing us to control an estimator selected from an uncountable collection. This is because the Ivanov-regularised estimator is continuous in the size of the ball containing it, so the estimators parametrised by an interval of ball sizes can be controlled simultaneously using chaining.

The norm bound provided by Ivanov regularisation can also be useful in addressing other problems. These include the use of optimal transport for the covariate shift problem, in which we are interested in bounding the squared error of our estimator where is a probability measure other than the distribution of the covariates . Having control over the norm of the estimator means that we can use the reproducing kernel property and the Cauchy–Schwarz inequality to bound the difference in the value of the estimator at two different points. When the kernel of the RKHS is Lipschitz continuous, this bound can be expressed in terms of the distance between the two points. Integrating an expression of this form with respect to a joint distribution over the two points which has different fixed marginal distributions gives the transport cost for the induced optimal transport problem. The squared error can be bounded by the sum of the squared error and the transport cost for the corresponding optimal transport problem.

In addition to the useful properties of the Ivanov-regularised estimator, Ivanov regularisation can be performed almost as quickly as Tikhonov regularisation. The Ivanov-regularised estimator is a support vector machine (SVM) with regularisation parameter selected to match the norm constraint, as discussed in Appendix B. This parameter can be selected to within a tolerance using interval bisection with order iterations. In general, Ivanov regularisation requires the calculation of SVMs.

In this paper, we study the behaviour of the Ivanov-regularised least-squares estimator with regularisation function equal to the norm of the RKHS. We derive a number of novel results concerning the rate of convergence of the estimator in various settings and under various assumptions. Ivanov regularisation allows us to significantly weaken some of the fundamental assumptions made when studying Tikhonov-regularised estimators. For example, Ivanov regularisation allows us to dispose of the restrictive Mercer kernel assumption.

The current theory of RKHS regression for Tikhonov-regularised estimators only applies when the RKHS has a Mercer kernel. If the RKHS has a Mercer kernel with respect to the covariate distribution , then there is a simple decomposition of and succinct representation of as a subspace of . These descriptions are in terms of the eigenfunctions and eigenvalues of the kernel operator on . Many results have assumed a fixed rate of decay of these eigenvalues in order to produce estimators whose squared error decreases quickly with the number of data points (Mendelson and Neeman, 2010; Steinwart et al., 2009). However, the assumptions necessary for to have a Mercer kernel are in general quite restrictive. The usual set of assumptions is that the covariate set is compact, the kernel of is continuous on and the covariate distribution satisfies (see Section 4.5 of Steinwart and Christmann, 2008). In particular, the assumption that the covariate set is compact is inconvenient and there has been some research into how to relax this condition by Steinwart and Scovel (2012).

By contrast, we provide results that hold under the significantly weaker assumption that the RKHS is separable with a bounded and measurable kernel . We can remove the Mercer kernel assumption because we control empirical processes over balls in the RKHS instead of relying on the representation of the RKHS given by Mercer’s theorem. We first prove an expectation bound on the squared error of our estimator of order under the weak assumption that the response variables have bounded variance. Here, is the number of data points and parametrises the interpolation space between and containing the regression function. The definition of an interpolation space is given in Subsection 1.1. The expected squared error can be viewed as the expected squared error of our estimator at a new independent covariate with the same distribution . If we also assume that the regression function is bounded, then it makes sense to clip our estimator so that it takes values in the same interval as the regression function. This further assumption allows us to achieve an expectation bound on the squared error of the clipped estimator of order .

We then move away from the average behaviour of the error towards its behaviour in the worst case. We obtain high-probability bounds of the same order, under the stronger assumption that the response variables have subgaussian errors and the interpolation space is between and . The second assumption is quite natural as we already assume that the regression function is bounded, and can be continuously embedded in since it has a bounded kernel . This assumption also means that the set of possible regression functions is independent of the covariate distribution, which may be advantageous in some scenarios. For example, it makes sense when considering the covariate shift problem, as discussed above.

When the regression function is bounded, we also analyse an adaptive version of our estimator, which does not require us to know which interpolation space contains the regression function. This adaptive estimator obtains bounds of the same order as the non-adaptive one. Furthermore, our results match the order of Steinwart et al. (2009) when . In particular, this shows that our expectation results are of optimal order for our setting.

1.1 RKHSs and Their Interpolation Spaces

A Hilbert space of real-valued functions on is an RKHS if the evaluation functional by is bounded for all . In this case, the dual of and the Riesz representation theorem tells us that there is some such that for all . The kernel is then given by for , and is symmetric and positive-definite.

Now suppose that is a measurable space on which is a probability measure. We can define a range of spaces between and . Let be a Banach space and be a subspace of . Then the -functional of is

for and . For and , we define

and

for . The interpolation space is defined to be the set of such that for and . Smaller values of give larger spaces. The space is not much larger than when is close to 1, but we obtain spaces which get closer to as decreases. Hence, we can define the interpolation spaces , where is the space of -almost-sure equivalence classes of measurable functions on such that is integrable with respect to . We will work with , which gives the largest space of functions for a fixed . Note that although is not a subspace of , the above definitions are still valid as there is a natural embedding of into as long as the functions in are measurable on . We will also require , where is the space of bounded measurable functions on .

1.2 Literature Review

The current literature assumes that the RKHS has a Mercer kernel , as discussed above. Earlier research in this area, such as that of Caponnetto and de Vito (2007), assumes that the regression function is at least as smooth as an element of . However, their paper still allows for regression functions of varying smoothness by letting the regression function be of the form for and . Here, is the kernel operator and is the covariate distribution. By assuming that the th eigenvalue of is of order for , the authors achieve a squared error of order with high probability by using SMVs. This squared error is shown to be of optimal order for .

Later research focuses on the case in which the regression function is at most as smooth as an element of . Often, this research demands that the response variables are bounded. For example, Mendelson and Neeman (2010) assume that for to obtain a squared error of order with high probability by using Tikhonov-regularised least-squares estimators. The authors also show that if the eigenfunctions of the kernel operator are uniformly bounded in , then the order can be improved to . Steinwart et al. (2009) relax the condition on the eigenfunctions to the condition

for all and some constant . The same rate is attained by using clipped Tikhonov-regularised least-squares estimators, including clipped SMVs, and is shown to be optimal. The authors assume that is in an interpolation space between and , which is slightly more general than the assumption of Mendelson and Neeman (2010). A detailed discussion about the image of under powers of and interpolation spaces between and is given by Steinwart and Scovel (2012).

Lately, the assumption that the response variables must be bounded has been relaxed to allow for subexponential errors. However, the assumption that the regression function is bounded has been maintained. For example, Fischer and Steinwart (2017) assume that for and that is bounded. The authors also assume that is continuously embedded in with respect to an appropriate norm on for some . This gives the same squared error of order with high probability by using SVMs.

1.3 Contribution

In this paper, we provide bounds on the squared error of our Ivanov-regularised least-squares estimator when the regression function comes from an interpolation space between and an RKHS , which is separable with a bounded and measurable kernel . We use the norm of the RKHS as our regularisation function. Under the weak assumption that the response variables have bounded variance, we prove a bound on the expected squared error of order (Theorem 5 on page 5). If we assume that the regression function is bounded then we can clip the estimator and achieve an expected squared error of order (Theorem 8 on page 8).

Under the stronger assumption that the response variables have subgaussian errors and the regression function comes from an interpolation space between and , we show that the squared error is of order with high probability (Theorem 16 on page 16). For the settings in which the regression function is bounded, we use training and validation on the data in order to select the size of the constraint on the norm of our estimator. This gives us an adaptive estimation result which does not require us to know which interpolation space contains the regression function. We obtain a squared error of order in expectation and with high probability, depending on the setting (Theorems 13 and 19 on pages 13 and 19). In order to perform training and validation, the response variables in the validation set must have subgaussian errors. The expectation results are of optimal order.

The results not involving validation are summarised in Table 1. The columns for which there is an bound on the regression function also make the interpolation assumption. Orders of bounds marked with are known to be optimal.

Regression Function Interpolation Bound Interpolation
Response Variables Bounded Variance Bounded Variance Subgaussian Errors
Bound Type Expectation Expectation High Probability
Bound Order
Table 1: Orders of bounds on squared error

The validation results are summarised in Table 2. Again, the columns for which there is an bound on the regression function also make the interpolation assumption. The assumptions on the response variables relate to those in the validation set, which has data points. We assume that is at least some multiple of . Again, orders of bounds marked with are known to be optimal.

Regression Function Bound Interpolation
Response Variables Subgaussian Errors Subgaussian Errors
Bound Type Expectation High Probability
Bound Order
Table 2: Orders of validation bounds on squared error

2 Problem Definition and Assumptions

We now formally define our regression problem and the assumptions that we make in this paper. For a topological space , let be the Borel -algebra of . Let be a measurable space. Assume that are -valued random variables on the probability space for , which are i.i.d. with and integrable. We denote integration with respect to by . Since is -measurable, where is the -algebra generated by , we have that for some function which is measurable on (Section A3.2 of Williams, 1991). From the definition of conditional expectation and the identical distribution of the , it is clear that we can choose to be the same for all . The conditional expectation used is that of Kolmogorov, defined using the Radon–Nikodym derivative. Its definition is unique almost surely. Since is integrable, by Jensen’s inequality. In addition to

(1)

we assume

(2)

We also assume that with norm at most , where is a separable RKHS of measurable functions on and . Finally, we assume that has kernel which is bounded and measurable on , with

We can guarantee that is separable by, for example, assuming that is continuous and is a separable topological space (Lemma 4.33 of Steinwart and Christmann, 2008). The fact that has a kernel which is measurable on guarantees that all functions in are measurable on (Lemma 4.24 of Steinwart and Christmann, 2008).

Let be the unit ball of and . Our assumption that with norm at most for provides us with the following approximation result. Theorem 3.1 of Smale and Zhou (2003) shows that

when is dense in . This additional condition is present because these are the only interpolation spaces considered by the authors, and in fact the result holds by the same proof even when is not dense in . Hence, for all there is some such that

(3)

We will estimate for small by

We also define . The estimator is calculated in Appendix B and shown to be a -valued measurable function on , where varies in . Lemma 21 summarises these results in this appendix.

2.1 Clipping

In the majority of this paper, we will assume that . Since is bounded in , we can make and closer to by constraining them to lie in the same interval. Similarly to Chapter 7 of Steinwart and Christmann (2008) and Steinwart et al. (2009), we define the projection by

for .

2.2 Validation

Let us assume that we have an independent second data set which are -valued random variables on the probability space for . Let the be i.i.d. with and

Furthermore, we assume that the moment generating function of conditional on satisfies

for all and . The random variable is said to be -subgaussian conditional on . This assumption is made so that we can use chaining in the proofs of Lemmas 10 and 17. Let and be compact. Furthermore, let

and

The minimum is attained as it is the minimum of a continuous function over a compact set. In the event of ties, we may take to be the infimum of all points attaining the minimum. We will estimate by

We can uniquely define in the same way as . Lemma 23 in Appendix B shows that the estimator is a random variable on . Hence, is a -valued random variable on .

2.3 Suprema of Stochastic Processes

We require the measurability of certain suprema over subsets of . Since is a separable metric space, its subsets are also separable. For example, has a countable dense subset . Furthermore, is continuously embedded in and because its kernel is bounded. Here, is the empirical distribution of the . Hence, for example,

This is a random variable on since the right-hand side is a supremum of countably many random variables on . Therefore, this quantity has a well-defined expectation and we can also apply Talagrand’s inequality to it.

3 Expectation Bound for Unbounded Regression Function

First, we will bound the difference between and in the norm.

Lemma 1

The definition of shows

(4)

Proof  Since , the definition of gives

Expanding

substituting into the above and rearranging gives

Substituting

into the above and applying the Cauchy–Schwarz inequality to the second term gives

For constants we have

for by completing the square and rearranging. Applying this result to the above inequality proves the lemma.  
We now take the expectation of (4).

Lemma 2

By bounding the expectation of the right-hand side of (4), we have

Proof  We have

by (3) and

The remainder of this proof method is due to Remark 6.1 of Sriperumbudur (2016). Since , we have

by the Cauchy–Schwarz inequality. By Jensen’s inequality and the independence of the , we have

and again, by Jensen’s inequality, we have

The result follows from Lemma 1.  
The next step is to move our bound on the expectation of the squared norm of to the expectation of the squared norm of .

Lemma 3

By using Rademacher processes, we can show

Proof  Let the be i.i.d. Rademacher random variables on for , independent of the . Lemma 2.3.1 of van der Vaart and Wellner (2013) shows

by symmetrisation. Since

for all , we find

is a contraction vanishing at 0 as a function of for all . By Theorem 3.2.1 of Giné and Nickl (2016), we have

We now follow a similar argument to the proof of Lemma 2. We have

By Jensen’s inequality, we have

and again, by Jensen’s inequality, we have

The result follows.  
The next result follows as a simple consequence of Lemmas 2 and 3.

Corollary 4

Since , we have

All that remains to is to combine Corollary 7 with (3) by using

and to let .

Theorem 5

We have

Based on this bound, the asymptotically optimal choice of is

which gives a bound of

4 Expectation Bound and Validation for Bounded Regression Function

We will now also assume that and clip our estimator. It follows from (3) that

and from Lemma 2 that

4.1 Expectation Bound

As in Section 3, the next step is to move our bound on the expectation of the squared norm of to the expectation of the squared norm of . Working with clipped functions allows us to achieve a difference between the two squared norms of order instead of order . This is substantially smaller.

Lemma 6

By using Rademacher processes, we can show

Proof  Let the be i.i.d. Rademacher random variables on for , independent of the . Lemma 2.3.1 of van der Vaart and Wellner (2013) shows

is at most

by symmetrisation. Since

for all , we find

is a contraction vanishing at 0 as a function of for all . By Theorem 3.2.1 of Giné and Nickl (2016), we have

is at most

Therefore,

is at most

by the triangle inequality, since . Again, by Theorem 3.2.1 of Giné and Nickl (2016), we have

is at most

since is a contraction vanishing at 0. We now follow a similar argument to the proof of Lemma 2. We have

By Jensen’s inequality, we have