Multitask Regression using Minimal Penalties
Abstract
In this paper we study the kernel multiple ridge regression framework, which we refer to as multitask regression, using penalization techniques. The theoretical analysis of this problem shows that the key element appearing for an optimal calibration is the covariance matrix of the noise between the different tasks. We present a new algorithm to estimate this covariance matrix, based on the concept of minimal penalty, which was previously used in the singletask regression framework to estimate the variance of the noise. We show, in a nonasymptotic setting and under mild assumptions on the target function, that this estimator converges towards the covariance matrix. Then plugging this estimator into the corresponding ideal penalty leads to an oracle inequality. We illustrate the behavior of our algorithm on synthetic examples.
ENS; Sierra Projectteam
Département d’Informatique de l’École Normale Supérieure
(CNRS/ENS/INRIA UMR 8548)
23, avenue d’Italie, CS 81321
75214 Paris Cedex 13, France Sylvain Arlot sylvain.arlot@ens.fr
CNRS; Sierra Projectteam
Département d’Informatique de l’École Normale Supérieure
(CNRS/ENS/INRIA UMR 8548)
23, avenue d’Italie, CS 81321
75214 Paris Cedex 13, France Francis Bach francis.bach@ens.fr
INRIA; Sierra Projectteam
Département d’Informatique de l’École Normale Supérieure
(CNRS/ENS/INRIA UMR 8548)
23, avenue d’Italie, CS 81321
75214 Paris Cedex 13, France
Editor: Tong Zhang
Keywords: multitask, oracle inequality, learning theory
1 Introduction
A classical paradigm in statistics is that increasing the sample size (that is, the number of observations) improves the performance of the estimators. However, in some cases it may be impossible to increase the sample size, for instance because of experimental limitations. Hopefully, in many situations practicioners can find many related and similar problems, and might use these problems as if more observations were available for the initial problem. The techniques using this heuristic are called “multitask” techniques. In this paper we study the kernel ridge regression procedure in a multitask framework.
Onedimensional kernel ridge regression, which we refer to as “singletask” regression, has been widely studied. As we briefly review in Section 3 one has, given data points , to estimate a function , often the conditional expectation , by minimizing the quadratic risk of the estimator regularized by a certain norm. A practically important task is to calibrate a regularization parameter, that is, to estimate the regularization parameter directly from data. For kernel ridge regression (a.k.a. smoothing splines), many methods have been proposed based on different principles, for example, Bayesian criteria through a Gaussian process interpretation (see, e.g., Rasmussen and Williams, 2006) or generalized crossvalidation (see, e.g., Wahba, 1990). In this paper, we focus on the concept of minimal penalty, which was first introduced by Birgé and Massart (2007) and Arlot and Massart (2009) for model selection, then extended to linear estimators such as kernel ridge regression by Arlot and Bach (2011).
In this article we consider different (but related) regression tasks, a framework we refer to as “multitask” regression. This setting has already been studied in different papers. Some empirically show that it can lead to performance improvement (Thrun and O’Sullivan, 1996; Caruana, 1997; Bakker and Heskes, 2003). Liang et al. (2010) also obtained a theoretical criterion (unfortunately non observable) which tells when this phenomenon asymptotically occurs. Several different paths have been followed to deal with this setting. Some consider a setting where , and formulate a sparsity assumption which enables to use the group Lasso, assuming all the different functions have a small set of common active covariates (see for instance Obozinski et al., 2011; Lounici et al., 2010). We exclude this setting from our analysis, because of the Hilbertian nature of our problem, and thus will not consider the similarity between the tasks in terms of sparsity, but rather in terms of an Euclidean similarity. Another theoretical approach has also been taken (see for example, Brown and Zidek (1980), Evgeniou et al. (2005) or Ando and Zhang (2005) on semisupervised learning), the authors often defining a theoretical framework where the multitask problem can easily be expressed, and where sometimes solutions can be computed. The main remaining theoretical problem is the calibration of a matricial parameter (typically of size ), which characterizes the relationship between the tasks and extends the regularization parameter from singletask regression. Because of the high dimensional nature of the problem (i.e., the small number of training observations) usual techniques, like crossvalidation, are not likely to succeed. Argyriou et al. (2008) have a similar approach to ours, but solve this problem by adding a convex constraint to the matrix, which will be discussed at the end of Section 5.
Through a penalization technique we show in Section 2 that the only element we have to estimate is the correlation matrix of the noise between the tasks. We give here a new algorithm to estimate , and show that the estimation is sharp enough to derive an oracle inequality for the estimation of the task similarity matrix , both with high probability and in expectation. Finally we give some simulation experiment results and show that our technique correctly deals with the multitask settings with a low samplesize.
1.1 Notations
We now introduce some notations, which will be used throughout the article.

The integer is the sample size, the integer is the number of tasks.

For any matrix , we define
that is, the vector in which the columns are stacked.

is the set of all matrices of size .

is the set of symmetric matrices of size .

is the set of symmetric positivesemidefinite matrices of size .

is the set of symmetric positivedefinite matrices of size .

denotes the partial ordering on defined by: if and only if .

is the vector of size whose components are all equal to .

is the usual Euclidean norm on for any : , .
2 Multitask Regression: Problem Setup
We consider kernel ridge regression tasks. Treating them simultaneously and sharing their common structure (e.g., being close in some metric space) will help in reducing the overall prediction error.
2.1 Multitask with a Fixed Kernel
Let be some set and a set of realvalued functions over . We suppose has a reproducing kernel Hilbert space (RKHS) structure (Aronszajn, 1950), with kernel and feature map . We observe , which gives us the positive semidefinite kernel matrix . For each task , is a sample with distribution , for which a simple regression problem has to be solved. In this paper we consider for simplicity that the different tasks have the same design . When the designs of the different tasks are different the analysis is carried out similarly by defining , but the notations would be more complicated.
We now define the model. We assume , is a symmetric positivedefinite matrix of size such that the vectors are i.i.d. with normal distribution , with mean zero and covariance matrix , and
(1) 
This means that, while the observations are independent, the outputs of the different tasks can be correlated, with correlation matrix between the tasks. We now place ourselves in the fixeddesign setting, that is, is deterministic and the goal is to estimate . Let us introduce some notation:

(resp. ) denotes the smallest (resp. largest) eigenvalue of .

is the condition number of .
To obtain compact equations, we will use the following definition:
Definition 1
We denote by the matrix and introduce the vector , obtained by stacking the columns of . Similarly we define , , and .
In order to estimate , we use a regularization procedure, which extends the classical ridge regression of the singletask setting. Let be a matrix, symmetric and positivedefinite. Generalizing the work of Evgeniou et al. (2005), we estimate by
(2) 
Although could have a general unconstrained form we may restrict to certain forms, for either computational or statistical reasons.
Remark 2
Requiring that implies that Equation (2) is a convex optimization problem, which can be solved through the resolution of a linear system, as explained later. Moreover it allows an RKHS interpretation, which will also be explained later.
Example 3
Example 4
As done by Evgeniou et al. (2005), for every , define
Taking in Equation (2) leads to the criterion
(4) 
Minimizing Equation (4) enforces a regularization on both the norms of the functions and the norms of the differences . Thus, matrices of the form are useful when the functions are assumed to be similar in . One of the main contributions of the paper is to go beyond this case and learn from data a more general similarity matrix between tasks.
Example 5
We extend Example 4 to the case where the tasks consist of two groups of close tasks. Let be a subset of , of cardinality . Let us denote by the complementary of in , the vector with components , and the diagonal matrix with components . We then define
This matrix leads to the following criterion, which enforces a regularization on both the norms of the functions and the norms of the differences inside the groups and :
(5) 
As shown in Section 6, we can estimate the set from data (see Jacob et al., 2008 for a more general formulation).
Remark 6
Remark 7
As stated below (Proposition 8), acts as a scalar product between the tasks. Selecting a general matrix is thus a way to express a similarity between tasks.
Following Evgeniou et al. (2005), we define the vectorspace of realvalued functions over by
We now define a bilinear symmetric form over ,
which is a scalar product as soon as is positive semidefinite (see proof in Appendix A) and leads to a RKHS (see proof in Appendix B):
Proposition 8
With the preceding notations is a scalar product on .
Corollary 9
is a RKHS.
In order to write down the kernel matrix in compact form, we introduce the following notations.
Definition 10 (Kronecker Product)
Let , . We define the Kronecker product as being the matrix built with blocks, the block of index being :
The Kronecker product is a widely used tool to deal with matrices and tensor products. Some of its classical properties are given in Section E; see also Horn and Johnson (1991).
Proposition 11
The kernel matrix associated with the design and the RKHS is .
2.2 Optimal Choice of the Kernel
Now when working in multitask regression, a set of matrices is given, and the goal is to select the “best” one, that is, minimizing over the quadratic risk . For instance, the singletask framework corresponds to and . The multitask case is far richer. The oracle risk is defined as
(6) 
The ideal choice, called the oracle, is any matrix
Nothing here ensures the oracle exists. However in some special cases (see for instance Example 12) the infimum of over the set may be attained by a function —which we will call “oracle” by a slight abuse of notation—while the former problem does not have a solution.
From now on we always suppose that the infimum of over is attained by some function . However the oracle is not an estimator, since it depends on .
Example 12 (Partial computation of the oracle in a simple setting)
It is possible in certain simple settings to exactly compute the oracle (or, at least, some part of it). Consider for instance the setup where the functions are taken to be equal (that is, ). In this setting it is natural to use the set
Using the estimator we can then compute the quadratic risk using the biasvariance decomposition given in Equation (36):
Computations (reported in Appendix D) show that, with the change of variables , the bias does not depend on and the variance is a decreasing function of . Thus the oracle is obtained when , leading to a situation where the oracle functions verify . It is also noticeable that, if one assumes the maximal eigenvalue of stays bounded with respect to , the variance is of order while the bias is bounded with respect to .
As explained by Arlot and Bach (2011), we choose
where the penalty term has to be chosen appropriately.
Remark 13
Our model (1) does not constrain the functions . Our way to express the similarities between the tasks (that is, between the ) is via the set , which represents the a priori knowledge the statistician has about the problem. Our goal is to build an estimator whose risk is the closest possible to the oracle risk. Of course using an inappropriate set (with respect to the target functions ) may lead to bad overall performances. Explicit multitask settings are given in Examples 3, 4 and 5 and through simulations in Section 6.
The unbiased risk estimation principle (introduced by Akaike, 1970) requires
which leads to the (deterministic) ideal penalty
Since and , we can write
Since is centered and is deterministic, we get, up to an additive factor independent of ,
that is, as the covariance matrix of is ,
(7) 
In order to approach this penalty as precisely as possible, we have to sharply estimate . In the singletask case, such a problem reduces to estimating the variance of the noise and was tackled by Arlot and Bach (2011). Since our approach for estimating heavily relies on these results, they are summarized in the next section.
Note that estimating is a mean towards estimating . The technique we develop later for this purpose is not purely a multitask technique, and may also be used in a different context.
3 Single Task Framework: Estimating a Single Variance
This section recalls some of the main results from Arlot and Bach (2011) which can be considered as solving a special case of Section 2, with , and . Writing with , the regularization matrix is
and ; the ideal penalty becomes
By analogy with the case where is an orthogonal projection matrix, is called the effective degree of freedom, first introduced by Mallows (1973); see also the work by Zhang (2005). The ideal penalty however depends on ; in order to have a fully datadriven penalty we have to replace by an estimator inside . For every , define
We shall see now that it is a minimal penalty in the following sense. If for every
then—up to concentration inequalities— acts as a mimimizer of
The former theoretical arguments show that

if , decreases with so that is huge: the procedure overfits;

if , increases with when is large enough so that is much smaller than when .
The following algorithm was introduced by Arlot and Bach (2011) and uses this fact to estimate .
Algorithm 14

Input: ,

For every , compute

Output: such that .
An efficient algorithm for the first step of Algorithm 14 is detailed by Arlot and Massart (2009), and we discuss the way we implemented Algorithm 14 in Section 6. The output of Algorithm 14 is a provably consistent estimator of , as stated in the following theorem.
Theorem 15 (Corollary of Theorem 1 of Arlot and Bach, 2011)
Let . Suppose with , and that and exist such that
(8) 
Then for every , some constant and an event exist such that and if , on ,
(9) 
Remark 16
The values and in Algorithm 14 have no particular meaning and can be replaced by , , with . Only depends on and . Also the bounds required in Assumption (8) only impact the right hand side of Equation (9) and are chosen to match the left hand side. See Proposition 10 of Arlot and Bach (2011) for more details.
4 Estimation of the Noise Covariance Matrix
Thanks to the results developped by Arlot and Bach (2011) (recapitulated in Section 3), we know how to estimate a variance for any onedimensional problem. In order to estimate , which has parameters, we can use several onedimensional problems. Projecting onto some direction yields
(10) 
with and . Therefore, we will estimate for a well chosen set, and use these estimators to build back an estimation of .
We now explain how to estimate using those onedimensional projections.
The idea is to apply Algorithm 14 to the elements of a carefully chosen set . Noting the th vector of the canonical basis of , we introduce . We can see that estimates , while estimates . Henceforth, can be estimated by . This leads to the definition of the following map , which builds a symmetric matrix using the latter construction.
Definition 18
Let be defined by
This map is bijective, and for all
This leads us to defining the following estimator of :
(11) 
Remark 19
Let us recall that , . Following Arlot and Bach (2011) we make the following assumption from now on:
(13) 
We can now state the first main result of the paper.
Theorem 20
Theorem 20 is proved in Section E. It shows estimates with a “multiplicative” error controlled with large probability, in a nonasymptotic setting. The multiplicative nature of the error is crucial for deriving the oracle inequality stated in Section 5, since it allows to show the ideal penalty defined in Equation (7) is precisely estimated when is replaced by .
An important feature of Theorem 20 is that it holds under very mild assumptions on the mean of the data (see Remark 22). Therefore, it shows is able to estimate a covariance matrix without prior knowledge on the regression function, which, to the best of our knowledge, has never been obtained in multitask regression.
Remark 21 (Scaling of for consistency)
A sufficient condition for ensuring is a consistent estimator of is
which enforces a scaling between , and . Nevertheless, this condition is probably not necessary since the simulation experiments of Section 6 show that can be well estimated (at least for estimator selection purposes) in a setting where .
Remark 22 (On assumption (13))
Assumption (13) is a singletask assumption (made independently for each task). The upper bound can be multiplied by any factor (as in Theorem 15), at the price of multiplying by in the upper bound of Equation (14). More generally the bounds on the degree of freedom and the bias in (13) only influence the upper bound of Equation (14). The rates are chosen here to match the lower bound, see Proposition 10 of Arlot and Bach (2011) for more details.
Remark 23 (Choice of the set )
Other choices could have been made for , however ours seems easier in terms of computation, since . Choosing a larger set leads to theoretical difficulties in the reconstruction of , while taking other basis vectors leads to more complex computations. We can also note that increasing decreases the probability in Theorem 20, since it comes from an union bound over the onedimensional estimations.
5 Oracle Inequality
This section aims at proving “oracle inequalities”, as usually done in a model selection setting: given a set of models or of estimators, the goal is to upper bound the risk of the selected estimator by the oracle risk (defined by Equation (6)), up to an additive term and a multiplicative factor. We show two oracle inequalities (Theorems 26 and 29) that correspond to two possible definitions of .
Note that “oracle inequality” sometimes has a different meaning in the literature (see for instance Lounici et al., 2011) when the risk of the proposed estimator is controlled by the risk of an estimator using information coming from the true parameter (that is, available only if provided by an oracle).
5.1 A General Result for Discrete Matrix Sets
We first show that the estimator introduced in Equation (11) is precise enough to derive an oracle inequality when plugged in the penalty defined in Equation (7) in the case where is finite.
Definition 25
Let be the estimator of defined by Equation (11). We define
We assume now the following holds true:
(15) 
Theorem 26
5.2 A Result for a Continuous Set of Jointly Diagonalizable Matrices
We now show a similar result when matrices in can be jointly diagonalized. It turns out a faster algorithm can be used instead of Equation (11) with a reduced error and a larger probability event in the oracle inequality. Note that we no longer assume is finite, so it can be parametrized by continuous parameters.
Suppose now the following holds, which means the matrices of are jointly diagonalizable:
(18) 
Let be the matrix defined in Assumption (18), and recall that . Computations detailed in Appendix D show that the ideal penalty introduced in Equation (7) can be written as
(19) 
Equation (19) shows that under Assumption (18), we do not need to estimate the entire matrix in order to have a good penalization procedure, but only to estimate the variance of the noise in directions.
Definition 28
Let be the canonical basis of , be the orthogonal basis defined by . We then define
where for every , denotes the output of Algorithm 14 applied to Problem (), and
(20) 
Theorem 29
5.3 Comments on Theorems 26 and 29
Remark 30
Remark 31 (Scaling of )
When assumption (15) holds, Equation (16) implies the asymptotic optimality of the estimator when
In particular, only such that are admissible. When assumption (18) holds, the scalings required to ensure optimality in Equation (21) are more favorable:
It is to be noted that still influences the left hand side via .
Remark 32
Theorems 26 and 29 are non asymptotic oracle inequalities, with a multiplicative term of the form . This allows us to claim that our selection procedure is nearly optimal, since our estimator is close (with regard to the empirical quadratic norm) to the oracle one. Furthermore the term in front of the infima in Equations (16), (21), (17) and (22) can be further diminished, but this yields a greater remainder term as a consequence.
Remark 33 (On assumption (18))
Assumption (18) actually means all matrices in can be diagonalized in a unique orthogonal basis, and thus can be parametrized by their eigenvalues as in Examples 3, 4 and 5.
In that case the optimization problem is quite easy to solve, as detailed in Remark 36. If not, solving (20) may turn out to be a hard problem, and our theoretical results do not cover this setting. However, it is always possible to discretize the set or, in practice, to use gradient descent.
Compared to the setting of Theorem 26, assumption (18) allows a simpler estimator for the penalty (19), with an increased probability and a reduced error in the oracle inequality.
The main theoretical limitation comes from the fact that the probabilistic concentration tools used apply to discrete sets (through union bounds). The structure of kernel ridge regression allows us to have a uniform control over a continuous set for the singletask estimators at the “cost” of pointwise controls, which can then be extended to the multitask setting via (18). We conjecture Theorem 29 still holds without (18) as long as is not “too large”, which could be proved similarly up to some uniform concentration inequalities.
Note also that if