Multi-task Regression using Minimal Penalties
In this paper we study the kernel multiple ridge regression framework, which we refer to as multi-task regression, using penalization techniques. The theoretical analysis of this problem shows that the key element appearing for an optimal calibration is the covariance matrix of the noise between the different tasks. We present a new algorithm to estimate this covariance matrix, based on the concept of minimal penalty, which was previously used in the single-task regression framework to estimate the variance of the noise. We show, in a non-asymptotic setting and under mild assumptions on the target function, that this estimator converges towards the covariance matrix. Then plugging this estimator into the corresponding ideal penalty leads to an oracle inequality. We illustrate the behavior of our algorithm on synthetic examples.
Editor: Tong Zhang
Keywords: multi-task, oracle inequality, learning theory
A classical paradigm in statistics is that increasing the sample size (that is, the number of observations) improves the performance of the estimators. However, in some cases it may be impossible to increase the sample size, for instance because of experimental limitations. Hopefully, in many situations practicioners can find many related and similar problems, and might use these problems as if more observations were available for the initial problem. The techniques using this heuristic are called “multi-task” techniques. In this paper we study the kernel ridge regression procedure in a multi-task framework.
One-dimensional kernel ridge regression, which we refer to as “single-task” regression, has been widely studied. As we briefly review in Section 3 one has, given data points , to estimate a function , often the conditional expectation , by minimizing the quadratic risk of the estimator regularized by a certain norm. A practically important task is to calibrate a regularization parameter, that is, to estimate the regularization parameter directly from data. For kernel ridge regression (a.k.a. smoothing splines), many methods have been proposed based on different principles, for example, Bayesian criteria through a Gaussian process interpretation (see, e.g., Rasmussen and Williams, 2006) or generalized cross-validation (see, e.g., Wahba, 1990). In this paper, we focus on the concept of minimal penalty, which was first introduced by Birgé and Massart (2007) and Arlot and Massart (2009) for model selection, then extended to linear estimators such as kernel ridge regression by Arlot and Bach (2011).
In this article we consider different (but related) regression tasks, a framework we refer to as “multi-task” regression. This setting has already been studied in different papers. Some empirically show that it can lead to performance improvement (Thrun and O’Sullivan, 1996; Caruana, 1997; Bakker and Heskes, 2003). Liang et al. (2010) also obtained a theoretical criterion (unfortunately non observable) which tells when this phenomenon asymptotically occurs. Several different paths have been followed to deal with this setting. Some consider a setting where , and formulate a sparsity assumption which enables to use the group Lasso, assuming all the different functions have a small set of common active covariates (see for instance Obozinski et al., 2011; Lounici et al., 2010). We exclude this setting from our analysis, because of the Hilbertian nature of our problem, and thus will not consider the similarity between the tasks in terms of sparsity, but rather in terms of an Euclidean similarity. Another theoretical approach has also been taken (see for example, Brown and Zidek (1980), Evgeniou et al. (2005) or Ando and Zhang (2005) on semi-supervised learning), the authors often defining a theoretical framework where the multi-task problem can easily be expressed, and where sometimes solutions can be computed. The main remaining theoretical problem is the calibration of a matricial parameter (typically of size ), which characterizes the relationship between the tasks and extends the regularization parameter from single-task regression. Because of the high dimensional nature of the problem (i.e., the small number of training observations) usual techniques, like cross-validation, are not likely to succeed. Argyriou et al. (2008) have a similar approach to ours, but solve this problem by adding a convex constraint to the matrix, which will be discussed at the end of Section 5.
Through a penalization technique we show in Section 2 that the only element we have to estimate is the correlation matrix of the noise between the tasks. We give here a new algorithm to estimate , and show that the estimation is sharp enough to derive an oracle inequality for the estimation of the task similarity matrix , both with high probability and in expectation. Finally we give some simulation experiment results and show that our technique correctly deals with the multi-task settings with a low sample-size.
We now introduce some notations, which will be used throughout the article.
The integer is the sample size, the integer is the number of tasks.
For any matrix , we define
that is, the vector in which the columns are stacked.
is the set of all matrices of size .
is the set of symmetric matrices of size .
is the set of symmetric positive-semidefinite matrices of size .
is the set of symmetric positive-definite matrices of size .
denotes the partial ordering on defined by: if and only if .
is the vector of size whose components are all equal to .
is the usual Euclidean norm on for any : , .
2 Multi-task Regression: Problem Set-up
We consider kernel ridge regression tasks. Treating them simultaneously and sharing their common structure (e.g., being close in some metric space) will help in reducing the overall prediction error.
2.1 Multi-task with a Fixed Kernel
Let be some set and a set of real-valued functions over . We suppose has a reproducing kernel Hilbert space (RKHS) structure (Aronszajn, 1950), with kernel and feature map . We observe , which gives us the positive semidefinite kernel matrix . For each task , is a sample with distribution , for which a simple regression problem has to be solved. In this paper we consider for simplicity that the different tasks have the same design . When the designs of the different tasks are different the analysis is carried out similarly by defining , but the notations would be more complicated.
We now define the model. We assume , is a symmetric positive-definite matrix of size such that the vectors are i.i.d. with normal distribution , with mean zero and covariance matrix , and
This means that, while the observations are independent, the outputs of the different tasks can be correlated, with correlation matrix between the tasks. We now place ourselves in the fixed-design setting, that is, is deterministic and the goal is to estimate . Let us introduce some notation:
(resp. ) denotes the smallest (resp. largest) eigenvalue of .
is the condition number of .
To obtain compact equations, we will use the following definition:
We denote by the matrix and introduce the vector , obtained by stacking the columns of . Similarly we define , , and .
In order to estimate , we use a regularization procedure, which extends the classical ridge regression of the single-task setting. Let be a matrix, symmetric and positive-definite. Generalizing the work of Evgeniou et al. (2005), we estimate by
Although could have a general unconstrained form we may restrict to certain forms, for either computational or statistical reasons.
Requiring that implies that Equation (2) is a convex optimization problem, which can be solved through the resolution of a linear system, as explained later. Moreover it allows an RKHS interpretation, which will also be explained later.
As done by Evgeniou et al. (2005), for every , define
Taking in Equation (2) leads to the criterion
Minimizing Equation (4) enforces a regularization on both the norms of the functions and the norms of the differences . Thus, matrices of the form are useful when the functions are assumed to be similar in . One of the main contributions of the paper is to go beyond this case and learn from data a more general similarity matrix between tasks.
We extend Example 4 to the case where the tasks consist of two groups of close tasks. Let be a subset of , of cardinality . Let us denote by the complementary of in , the vector with components , and the diagonal matrix with components . We then define
This matrix leads to the following criterion, which enforces a regularization on both the norms of the functions and the norms of the differences inside the groups and :
As stated below (Proposition 8), acts as a scalar product between the tasks. Selecting a general matrix is thus a way to express a similarity between tasks.
Following Evgeniou et al. (2005), we define the vector-space of real-valued functions over by
We now define a bilinear symmetric form over ,
With the preceding notations is a scalar product on .
is a RKHS.
In order to write down the kernel matrix in compact form, we introduce the following notations.
Definition 10 (Kronecker Product)
Let , . We define the Kronecker product as being the matrix built with blocks, the block of index being :
The kernel matrix associated with the design and the RKHS is .
2.2 Optimal Choice of the Kernel
Now when working in multi-task regression, a set of matrices is given, and the goal is to select the “best” one, that is, minimizing over the quadratic risk . For instance, the single-task framework corresponds to and . The multi-task case is far richer. The oracle risk is defined as
The ideal choice, called the oracle, is any matrix
Nothing here ensures the oracle exists. However in some special cases (see for instance Example 12) the infimum of over the set may be attained by a function —which we will call “oracle” by a slight abuse of notation—while the former problem does not have a solution.
From now on we always suppose that the infimum of over is attained by some function . However the oracle is not an estimator, since it depends on .
Example 12 (Partial computation of the oracle in a simple setting)
It is possible in certain simple settings to exactly compute the oracle (or, at least, some part of it). Consider for instance the set-up where the functions are taken to be equal (that is, ). In this setting it is natural to use the set
Using the estimator we can then compute the quadratic risk using the bias-variance decomposition given in Equation (36):
Computations (reported in Appendix D) show that, with the change of variables , the bias does not depend on and the variance is a decreasing function of . Thus the oracle is obtained when , leading to a situation where the oracle functions verify . It is also noticeable that, if one assumes the maximal eigenvalue of stays bounded with respect to , the variance is of order while the bias is bounded with respect to .
As explained by Arlot and Bach (2011), we choose
where the penalty term has to be chosen appropriately.
Our model (1) does not constrain the functions . Our way to express the similarities between the tasks (that is, between the ) is via the set , which represents the a priori knowledge the statistician has about the problem. Our goal is to build an estimator whose risk is the closest possible to the oracle risk. Of course using an inappropriate set (with respect to the target functions ) may lead to bad overall performances. Explicit multi-task settings are given in Examples 3, 4 and 5 and through simulations in Section 6.
The unbiased risk estimation principle (introduced by Akaike, 1970) requires
which leads to the (deterministic) ideal penalty
Since and , we can write
Since is centered and is deterministic, we get, up to an additive factor independent of ,
that is, as the covariance matrix of is ,
In order to approach this penalty as precisely as possible, we have to sharply estimate . In the single-task case, such a problem reduces to estimating the variance of the noise and was tackled by Arlot and Bach (2011). Since our approach for estimating heavily relies on these results, they are summarized in the next section.
Note that estimating is a mean towards estimating . The technique we develop later for this purpose is not purely a multi-task technique, and may also be used in a different context.
3 Single Task Framework: Estimating a Single Variance
and ; the ideal penalty becomes
By analogy with the case where is an orthogonal projection matrix, is called the effective degree of freedom, first introduced by Mallows (1973); see also the work by Zhang (2005). The ideal penalty however depends on ; in order to have a fully data-driven penalty we have to replace by an estimator inside . For every , define
We shall see now that it is a minimal penalty in the following sense. If for every
then—up to concentration inequalities— acts as a mimimizer of
The former theoretical arguments show that
if , decreases with so that is huge: the procedure overfits;
if , increases with when is large enough so that is much smaller than when .
The following algorithm was introduced by Arlot and Bach (2011) and uses this fact to estimate .
For every , compute
Output: such that .
An efficient algorithm for the first step of Algorithm 14 is detailed by Arlot and Massart (2009), and we discuss the way we implemented Algorithm 14 in Section 6. The output of Algorithm 14 is a provably consistent estimator of , as stated in the following theorem.
Theorem 15 (Corollary of Theorem 1 of Arlot and Bach, 2011)
Let . Suppose with , and that and exist such that
Then for every , some constant and an event exist such that and if , on ,
The values and in Algorithm 14 have no particular meaning and can be replaced by , , with . Only depends on and . Also the bounds required in Assumption (8) only impact the right hand side of Equation (9) and are chosen to match the left hand side. See Proposition 10 of Arlot and Bach (2011) for more details.
4 Estimation of the Noise Covariance Matrix
Thanks to the results developped by Arlot and Bach (2011) (recapitulated in Section 3), we know how to estimate a variance for any one-dimensional problem. In order to estimate , which has parameters, we can use several one-dimensional problems. Projecting onto some direction yields
with and . Therefore, we will estimate for a well chosen set, and use these estimators to build back an estimation of .
We now explain how to estimate using those one-dimensional projections.
The idea is to apply Algorithm 14 to the elements of a carefully chosen set . Noting the -th vector of the canonical basis of , we introduce . We can see that estimates , while estimates . Henceforth, can be estimated by . This leads to the definition of the following map , which builds a symmetric matrix using the latter construction.
Let be defined by
This map is bijective, and for all
This leads us to defining the following estimator of :
Let us recall that , . Following Arlot and Bach (2011) we make the following assumption from now on:
We can now state the first main result of the paper.
Theorem 20 is proved in Section E. It shows estimates with a “multiplicative” error controlled with large probability, in a non-asymptotic setting. The multiplicative nature of the error is crucial for deriving the oracle inequality stated in Section 5, since it allows to show the ideal penalty defined in Equation (7) is precisely estimated when is replaced by .
An important feature of Theorem 20 is that it holds under very mild assumptions on the mean of the data (see Remark 22). Therefore, it shows is able to estimate a covariance matrix without prior knowledge on the regression function, which, to the best of our knowledge, has never been obtained in multi-task regression.
Remark 21 (Scaling of for consistency)
A sufficient condition for ensuring is a consistent estimator of is
which enforces a scaling between , and . Nevertheless, this condition is probably not necessary since the simulation experiments of Section 6 show that can be well estimated (at least for estimator selection purposes) in a setting where .
Remark 22 (On assumption (13))
Assumption (13) is a single-task assumption (made independently for each task). The upper bound can be multiplied by any factor (as in Theorem 15), at the price of multiplying by in the upper bound of Equation (14). More generally the bounds on the degree of freedom and the bias in (13) only influence the upper bound of Equation (14). The rates are chosen here to match the lower bound, see Proposition 10 of Arlot and Bach (2011) for more details.
Remark 23 (Choice of the set )
Other choices could have been made for , however ours seems easier in terms of computation, since . Choosing a larger set leads to theoretical difficulties in the reconstruction of , while taking other basis vectors leads to more complex computations. We can also note that increasing decreases the probability in Theorem 20, since it comes from an union bound over the one-dimensional estimations.
5 Oracle Inequality
This section aims at proving “oracle inequalities”, as usually done in a model selection setting: given a set of models or of estimators, the goal is to upper bound the risk of the selected estimator by the oracle risk (defined by Equation (6)), up to an additive term and a multiplicative factor. We show two oracle inequalities (Theorems 26 and 29) that correspond to two possible definitions of .
Note that “oracle inequality” sometimes has a different meaning in the literature (see for instance Lounici et al., 2011) when the risk of the proposed estimator is controlled by the risk of an estimator using information coming from the true parameter (that is, available only if provided by an oracle).
5.1 A General Result for Discrete Matrix Sets
Let be the estimator of defined by Equation (11). We define
We assume now the following holds true:
5.2 A Result for a Continuous Set of Jointly Diagonalizable Matrices
We now show a similar result when matrices in can be jointly diagonalized. It turns out a faster algorithm can be used instead of Equation (11) with a reduced error and a larger probability event in the oracle inequality. Note that we no longer assume is finite, so it can be parametrized by continuous parameters.
Suppose now the following holds, which means the matrices of are jointly diagonalizable:
Equation (19) shows that under Assumption (18), we do not need to estimate the entire matrix in order to have a good penalization procedure, but only to estimate the variance of the noise in directions.
Let be the canonical basis of , be the orthogonal basis defined by . We then define
where for every , denotes the output of Algorithm 14 applied to Problem (), and
Remark 31 (Scaling of )
It is to be noted that still influences the left hand side via .
Theorems 26 and 29 are non asymptotic oracle inequalities, with a multiplicative term of the form . This allows us to claim that our selection procedure is nearly optimal, since our estimator is close (with regard to the empirical quadratic norm) to the oracle one. Furthermore the term in front of the infima in Equations (16), (21), (17) and (22) can be further diminished, but this yields a greater remainder term as a consequence.
Remark 33 (On assumption (18))
In that case the optimization problem is quite easy to solve, as detailed in Remark 36. If not, solving (20) may turn out to be a hard problem, and our theoretical results do not cover this setting. However, it is always possible to discretize the set or, in practice, to use gradient descent.
The main theoretical limitation comes from the fact that the probabilistic concentration tools used apply to discrete sets (through union bounds). The structure of kernel ridge regression allows us to have a uniform control over a continuous set for the single-task estimators at the “cost” of pointwise controls, which can then be extended to the multi-task setting via (18). We conjecture Theorem 29 still holds without (18) as long as is not “too large”, which could be proved similarly up to some uniform concentration inequalities.
Note also that if