Multi-task Regression using Minimal Penalties

Multi-task Regression using Minimal Penalties

\nameMatthieu Solnon \emailmatthieu.solnon@ens.fr
\addrENS; Sierra Project-team
Département d’Informatique de l’École Normale Supérieure
(CNRS/ENS/INRIA UMR 8548)
23, avenue d’Italie, CS 81321
75214 Paris Cedex 13, France \AND\nameSylvain Arlot \emailsylvain.arlot@ens.fr
\addrCNRS; Sierra Project-team
Département d’Informatique de l’École Normale Supérieure
(CNRS/ENS/INRIA UMR 8548)
23, avenue d’Italie, CS 81321
75214 Paris Cedex 13, France \AND\nameFrancis Bach \emailfrancis.bach@ens.fr
\addrINRIA; Sierra Project-team
Département d’Informatique de l’École Normale Supérieure
(CNRS/ENS/INRIA UMR 8548)
23, avenue d’Italie, CS 81321
75214 Paris Cedex 13, France
Abstract

In this paper we study the kernel multiple ridge regression framework, which we refer to as multi-task regression, using penalization techniques. The theoretical analysis of this problem shows that the key element appearing for an optimal calibration is the covariance matrix of the noise between the different tasks. We present a new algorithm to estimate this covariance matrix, based on the concept of minimal penalty, which was previously used in the single-task regression framework to estimate the variance of the noise. We show, in a non-asymptotic setting and under mild assumptions on the target function, that this estimator converges towards the covariance matrix. Then plugging this estimator into the corresponding ideal penalty leads to an oracle inequality. We illustrate the behavior of our algorithm on synthetic examples.

Multi-task Regression using Minimal Penalties Matthieu Solnon matthieu.solnon@ens.fr
ENS; Sierra Project-team
Département d’Informatique de l’École Normale Supérieure
(CNRS/ENS/INRIA UMR 8548)
23, avenue d’Italie, CS 81321
75214 Paris Cedex 13, France
Sylvain Arlot sylvain.arlot@ens.fr
CNRS; Sierra Project-team
Département d’Informatique de l’École Normale Supérieure
(CNRS/ENS/INRIA UMR 8548)
23, avenue d’Italie, CS 81321
75214 Paris Cedex 13, France
Francis Bach francis.bach@ens.fr
INRIA; Sierra Project-team
Département d’Informatique de l’École Normale Supérieure
(CNRS/ENS/INRIA UMR 8548)
23, avenue d’Italie, CS 81321
75214 Paris Cedex 13, France

Editor: Tong Zhang

Keywords: multi-task, oracle inequality, learning theory

1 Introduction

A classical paradigm in statistics is that increasing the sample size (that is, the number of observations) improves the performance of the estimators. However, in some cases it may be impossible to increase the sample size, for instance because of experimental limitations. Hopefully, in many situations practicioners can find many related and similar problems, and might use these problems as if more observations were available for the initial problem. The techniques using this heuristic are called “multi-task” techniques. In this paper we study the kernel ridge regression procedure in a multi-task framework.

One-dimensional kernel ridge regression, which we refer to as “single-task” regression, has been widely studied. As we briefly review in Section 3 one has, given data points , to estimate a function , often the conditional expectation , by minimizing the quadratic risk of the estimator regularized by a certain norm. A practically important task is to calibrate a regularization parameter, that is, to estimate the regularization parameter directly from data. For kernel ridge regression (a.k.a. smoothing splines), many methods have been proposed based on different principles, for example, Bayesian criteria through a Gaussian process interpretation (see, e.g., Rasmussen and Williams, 2006) or generalized cross-validation (see, e.g., Wahba, 1990). In this paper, we focus on the concept of minimal penalty, which was first introduced by Birgé and Massart (2007) and Arlot and Massart (2009) for model selection, then extended to linear estimators such as kernel ridge regression by Arlot and Bach (2011).

In this article we consider different (but related) regression tasks, a framework we refer to as “multi-task” regression. This setting has already been studied in different papers. Some empirically show that it can lead to performance improvement (Thrun and O’Sullivan, 1996; Caruana, 1997; Bakker and Heskes, 2003). Liang et al. (2010) also obtained a theoretical criterion (unfortunately non observable) which tells when this phenomenon asymptotically occurs. Several different paths have been followed to deal with this setting. Some consider a setting where , and formulate a sparsity assumption which enables to use the group Lasso, assuming all the different functions have a small set of common active covariates (see for instance Obozinski et al., 2011; Lounici et al., 2010). We exclude this setting from our analysis, because of the Hilbertian nature of our problem, and thus will not consider the similarity between the tasks in terms of sparsity, but rather in terms of an Euclidean similarity. Another theoretical approach has also been taken (see for example, Brown and Zidek (1980), Evgeniou et al. (2005) or Ando and Zhang (2005) on semi-supervised learning), the authors often defining a theoretical framework where the multi-task problem can easily be expressed, and where sometimes solutions can be computed. The main remaining theoretical problem is the calibration of a matricial parameter (typically of size ), which characterizes the relationship between the tasks and extends the regularization parameter from single-task regression. Because of the high dimensional nature of the problem (i.e., the small number of training observations) usual techniques, like cross-validation, are not likely to succeed. Argyriou et al. (2008) have a similar approach to ours, but solve this problem by adding a convex constraint to the matrix, which will be discussed at the end of Section 5.

Through a penalization technique we show in Section 2 that the only element we have to estimate is the correlation matrix of the noise between the tasks. We give here a new algorithm to estimate , and show that the estimation is sharp enough to derive an oracle inequality for the estimation of the task similarity matrix , both with high probability and in expectation. Finally we give some simulation experiment results and show that our technique correctly deals with the multi-task settings with a low sample-size.

1.1 Notations

We now introduce some notations, which will be used throughout the article.

  • The integer is the sample size, the integer is the number of tasks.

  • For any matrix , we define

    that is, the vector in which the columns are stacked.

  • is the set of all matrices of size .

  • is the set of symmetric matrices of size .

  • is the set of symmetric positive-semidefinite matrices of size .

  • is the set of symmetric positive-definite matrices of size .

  • denotes the partial ordering on defined by: if and only if .

  • is the vector of size whose components are all equal to .

  • is the usual Euclidean norm on for any : , .

2 Multi-task Regression: Problem Set-up

We consider kernel ridge regression tasks. Treating them simultaneously and sharing their common structure (e.g., being close in some metric space) will help in reducing the overall prediction error.

2.1 Multi-task with a Fixed Kernel

Let be some set and a set of real-valued functions over . We suppose has a reproducing kernel Hilbert space (RKHS) structure (Aronszajn, 1950), with kernel and feature map . We observe , which gives us the positive semidefinite kernel matrix . For each task , is a sample with distribution , for which a simple regression problem has to be solved. In this paper we consider for simplicity that the different tasks have the same design . When the designs of the different tasks are different the analysis is carried out similarly by defining , but the notations would be more complicated.

We now define the model. We assume , is a symmetric positive-definite matrix of size such that the vectors are i.i.d. with normal distribution , with mean zero and covariance matrix , and

(1)

This means that, while the observations are independent, the outputs of the different tasks can be correlated, with correlation matrix between the tasks. We now place ourselves in the fixed-design setting, that is, is deterministic and the goal is to estimate . Let us introduce some notation:

  • (resp. ) denotes the smallest (resp. largest) eigenvalue of .

  • is the condition number of .

To obtain compact equations, we will use the following definition:

Definition 1

We denote by the matrix and introduce the vector , obtained by stacking the columns of . Similarly we define , , and .

In order to estimate , we use a regularization procedure, which extends the classical ridge regression of the single-task setting. Let be a matrix, symmetric and positive-definite. Generalizing the work of Evgeniou et al. (2005), we estimate by

(2)

Although could have a general unconstrained form we may restrict to certain forms, for either computational or statistical reasons.

Remark 2

Requiring that implies that Equation (2) is a convex optimization problem, which can be solved through the resolution of a linear system, as explained later. Moreover it allows an RKHS interpretation, which will also be explained later.

Example 3

The case where the tasks are treated independently can be considered in this setting: taking for any leads to the criterion

(3)

that is, the sum of the single-task criteria described in Section 3. Hence, minimizing Equation (3) over amounts to solve independently single task problems.

Example 4

As done by Evgeniou et al. (2005), for every , define

Taking in Equation (2) leads to the criterion

(4)

Minimizing Equation (4) enforces a regularization on both the norms of the functions and the norms of the differences . Thus, matrices of the form are useful when the functions are assumed to be similar in . One of the main contributions of the paper is to go beyond this case and learn from data a more general similarity matrix between tasks.

Example 5

We extend Example 4 to the case where the tasks consist of two groups of close tasks. Let be a subset of , of cardinality . Let us denote by the complementary of in , the vector with components , and the diagonal matrix with components . We then define

This matrix leads to the following criterion, which enforces a regularization on both the norms of the functions and the norms of the differences inside the groups and :

(5)

As shown in Section 6, we can estimate the set from data (see Jacob et al., 2008 for a more general formulation).

Remark 6

Since and can be diagonalized simultaneously, minimizing Equation (4) and Equation (5) is quite easy: it only demands optimization over two independent parameters, which can be done with the procedure of Arlot and Bach (2011).

Remark 7

As stated below (Proposition 8), acts as a scalar product between the tasks. Selecting a general matrix is thus a way to express a similarity between tasks.

Following Evgeniou et al. (2005), we define the vector-space of real-valued functions over by

We now define a bilinear symmetric form over ,

which is a scalar product as soon as is positive semi-definite (see proof in Appendix A) and leads to a RKHS (see proof in Appendix B):

Proposition 8

With the preceding notations is a scalar product on .

Corollary 9

is a RKHS.

In order to write down the kernel matrix in compact form, we introduce the following notations.

Definition 10 (Kronecker Product)

Let , . We define the Kronecker product as being the matrix built with blocks, the block of index being :

The Kronecker product is a widely used tool to deal with matrices and tensor products. Some of its classical properties are given in Section E; see also Horn and Johnson (1991).

Proposition 11

The kernel matrix associated with the design and the RKHS is .

Proposition 11 is proved in Appendix C. We can then apply the representer’s theorem (Schölkopf and Smola, 2002) to the minimization problem (2) and deduce that with

2.2 Optimal Choice of the Kernel

Now when working in multi-task regression, a set of matrices is given, and the goal is to select the “best” one, that is, minimizing over the quadratic risk . For instance, the single-task framework corresponds to and . The multi-task case is far richer. The oracle risk is defined as

(6)

The ideal choice, called the oracle, is any matrix

Nothing here ensures the oracle exists. However in some special cases (see for instance Example 12) the infimum of over the set may be attained by a function —which we will call “oracle” by a slight abuse of notation—while the former problem does not have a solution.

From now on we always suppose that the infimum of over is attained by some function . However the oracle is not an estimator, since it depends on .

Example 12 (Partial computation of the oracle in a simple setting)

It is possible in certain simple settings to exactly compute the oracle (or, at least, some part of it). Consider for instance the set-up where the functions are taken to be equal (that is, ). In this setting it is natural to use the set

Using the estimator we can then compute the quadratic risk using the bias-variance decomposition given in Equation (36):

Computations (reported in Appendix D) show that, with the change of variables , the bias does not depend on and the variance is a decreasing function of . Thus the oracle is obtained when , leading to a situation where the oracle functions verify . It is also noticeable that, if one assumes the maximal eigenvalue of stays bounded with respect to , the variance is of order while the bias is bounded with respect to .

As explained by Arlot and Bach (2011), we choose

where the penalty term has to be chosen appropriately.

Remark 13

Our model (1) does not constrain the functions . Our way to express the similarities between the tasks (that is, between the ) is via the set , which represents the a priori knowledge the statistician has about the problem. Our goal is to build an estimator whose risk is the closest possible to the oracle risk. Of course using an inappropriate set (with respect to the target functions ) may lead to bad overall performances. Explicit multi-task settings are given in Examples 3, 4 and 5 and through simulations in Section 6.

The unbiased risk estimation principle (introduced by Akaike, 1970) requires

which leads to the (deterministic) ideal penalty

Since and , we can write

Since is centered and is deterministic, we get, up to an additive factor independent of ,

that is, as the covariance matrix of is ,

(7)

In order to approach this penalty as precisely as possible, we have to sharply estimate . In the single-task case, such a problem reduces to estimating the variance of the noise and was tackled by Arlot and Bach (2011). Since our approach for estimating heavily relies on these results, they are summarized in the next section.

Note that estimating is a mean towards estimating . The technique we develop later for this purpose is not purely a multi-task technique, and may also be used in a different context.

3 Single Task Framework: Estimating a Single Variance

This section recalls some of the main results from Arlot and Bach (2011) which can be considered as solving a special case of Section 2, with , and . Writing with , the regularization matrix is

and ; the ideal penalty becomes

By analogy with the case where is an orthogonal projection matrix, is called the effective degree of freedom, first introduced by Mallows (1973); see also the work by Zhang (2005). The ideal penalty however depends on ; in order to have a fully data-driven penalty we have to replace by an estimator inside . For every , define

We shall see now that it is a minimal penalty in the following sense. If for every

then—up to concentration inequalities— acts as a mimimizer of

The former theoretical arguments show that

  • if , decreases with so that is huge: the procedure overfits;

  • if , increases with when is large enough so that is much smaller than when .

The following algorithm was introduced by Arlot and Bach (2011) and uses this fact to estimate .

Algorithm 14
  1. Input: ,

  2. For every , compute

  3. Output: such that .

An efficient algorithm for the first step of Algorithm 14 is detailed by Arlot and Massart (2009), and we discuss the way we implemented Algorithm 14 in Section 6. The output of Algorithm 14 is a provably consistent estimator of , as stated in the following theorem.

Theorem 15 (Corollary of Theorem 1 of Arlot and Bach, 2011)

Let . Suppose with , and that and exist such that

(8)

Then for every , some constant and an event exist such that and if , on ,

(9)
Remark 16

The values and in Algorithm 14 have no particular meaning and can be replaced by , , with . Only depends on and . Also the bounds required in Assumption (8) only impact the right hand side of Equation (9) and are chosen to match the left hand side. See Proposition 10 of Arlot and Bach (2011) for more details.

4 Estimation of the Noise Covariance Matrix

Thanks to the results developped by Arlot and Bach (2011) (recapitulated in Section 3), we know how to estimate a variance for any one-dimensional problem. In order to estimate , which has parameters, we can use several one-dimensional problems. Projecting onto some direction yields

(10)

with and . Therefore, we will estimate for a well chosen set, and use these estimators to build back an estimation of .

We now explain how to estimate using those one-dimensional projections.

Definition 17

Let be the output of Algorithm 14 applied to problem (10), that is, with inputs and .

The idea is to apply Algorithm 14 to the elements of a carefully chosen set . Noting the -th vector of the canonical basis of , we introduce . We can see that estimates , while estimates . Henceforth, can be estimated by . This leads to the definition of the following map , which builds a symmetric matrix using the latter construction.

Definition 18

Let be defined by

This map is bijective, and for all

This leads us to defining the following estimator of :

(11)
Remark 19

If a diagonalization basis (whose basis matrix is ) of is known, or if is diagonal, then a simplified version of the algorithm defined by Equation (11) is

(12)

This algorithm has a smaller computational cost and leads to better theoretical bounds (see Remark 24 and Section 5.2).

Let us recall that , . Following Arlot and Bach (2011) we make the following assumption from now on:

(13)

We can now state the first main result of the paper.

Theorem 20

Let be defined by Equation (11), and assume (13) holds. For every , a constant , an absolute constant and an event exist such that and if , on ,

(14)

Theorem 20 is proved in Section E. It shows estimates with a “multiplicative” error controlled with large probability, in a non-asymptotic setting. The multiplicative nature of the error is crucial for deriving the oracle inequality stated in Section 5, since it allows to show the ideal penalty defined in Equation (7) is precisely estimated when is replaced by .

An important feature of Theorem 20 is that it holds under very mild assumptions on the mean of the data (see Remark 22). Therefore, it shows is able to estimate a covariance matrix without prior knowledge on the regression function, which, to the best of our knowledge, has never been obtained in multi-task regression.

Remark 21 (Scaling of for consistency)

A sufficient condition for ensuring is a consistent estimator of is

which enforces a scaling between , and . Nevertheless, this condition is probably not necessary since the simulation experiments of Section 6 show that can be well estimated (at least for estimator selection purposes) in a setting where .

Remark 22 (On assumption (13))

Assumption (13) is a single-task assumption (made independently for each task). The upper bound can be multiplied by any factor (as in Theorem 15), at the price of multiplying by in the upper bound of Equation (14). More generally the bounds on the degree of freedom and the bias in (13) only influence the upper bound of Equation (14). The rates are chosen here to match the lower bound, see Proposition 10 of Arlot and Bach (2011) for more details.

Assumption (13) is rather classical in model selection, see Arlot and Bach (2011) for instance. In particular, (a weakened version of) (13) holds if the bias is bounded by , for some .

Remark 23 (Choice of the set )

Other choices could have been made for , however ours seems easier in terms of computation, since . Choosing a larger set leads to theoretical difficulties in the reconstruction of , while taking other basis vectors leads to more complex computations. We can also note that increasing decreases the probability in Theorem 20, since it comes from an union bound over the one-dimensional estimations.

Remark 24

When as defined by Equation (12), that is, when a diagonalization basis of is known, Theorem 20 still holds on a set of larger probability with a reduced error . Then, a consistent estimation of is possible whenever for some .

5 Oracle Inequality

This section aims at proving “oracle inequalities”, as usually done in a model selection setting: given a set of models or of estimators, the goal is to upper bound the risk of the selected estimator by the oracle risk (defined by Equation (6)), up to an additive term and a multiplicative factor. We show two oracle inequalities (Theorems 26 and 29) that correspond to two possible definitions of .

Note that “oracle inequality” sometimes has a different meaning in the literature (see for instance Lounici et al., 2011) when the risk of the proposed estimator is controlled by the risk of an estimator using information coming from the true parameter (that is, available only if provided by an oracle).

5.1 A General Result for Discrete Matrix Sets

We first show that the estimator introduced in Equation (11) is precise enough to derive an oracle inequality when plugged in the penalty defined in Equation (7) in the case where is finite.

Definition 25

Let be the estimator of defined by Equation (11). We define

We assume now the following holds true:

(15)
Theorem 26

Let , and assume (13) and (15) hold true. Absolute constants , a constant and an event exist such that and the following holds as soon as . First, on ,

(16)

Second, an absolute constant exists such that

(17)

Theorem 26 is proved in Section F.

Remark 27

If is defined by Equation (12) the result still holds on a set of larger probability with a reduced error, similar to the one in Theorem 29.

5.2 A Result for a Continuous Set of Jointly Diagonalizable Matrices

We now show a similar result when matrices in can be jointly diagonalized. It turns out a faster algorithm can be used instead of Equation (11) with a reduced error and a larger probability event in the oracle inequality. Note that we no longer assume is finite, so it can be parametrized by continuous parameters.

Suppose now the following holds, which means the matrices of are jointly diagonalizable:

(18)

Let be the matrix defined in Assumption (18), and recall that . Computations detailed in Appendix D show that the ideal penalty introduced in Equation (7) can be written as

(19)

Equation (19) shows that under Assumption (18), we do not need to estimate the entire matrix in order to have a good penalization procedure, but only to estimate the variance of the noise in directions.

Definition 28

Let be the canonical basis of , be the orthogonal basis defined by . We then define

where for every , denotes the output of Algorithm 14 applied to Problem (), and

(20)
Theorem 29

Let , and assume (13) and (18) hold true. Absolute constants , and , a constant and an event exist such that and the following holds as soon as . First, on ,

(21)

Second, an absolute constant exists such that

(22)

Theorem 29 is proved in Section F.

5.3 Comments on Theorems 26 and 29

Remark 30

Taking (hence and ), we recover Theorem 3 of Arlot and Bach (2011) as a corollary of Theorem 26.

Remark 31 (Scaling of )

When assumption (15) holds, Equation (16) implies the asymptotic optimality of the estimator when

In particular, only such that are admissible. When assumption (18) holds, the scalings required to ensure optimality in Equation (21) are more favorable:

It is to be noted that still influences the left hand side via .

Remark 32

Theorems 26 and 29 are non asymptotic oracle inequalities, with a multiplicative term of the form . This allows us to claim that our selection procedure is nearly optimal, since our estimator is close (with regard to the empirical quadratic norm) to the oracle one. Furthermore the term in front of the infima in Equations (16), (21), (17) and (22) can be further diminished, but this yields a greater remainder term as a consequence.

Remark 33 (On assumption (18))

Assumption (18) actually means all matrices in can be diagonalized in a unique orthogonal basis, and thus can be parametrized by their eigenvalues as in Examples 3, 4 and 5.

In that case the optimization problem is quite easy to solve, as detailed in Remark 36. If not, solving (20) may turn out to be a hard problem, and our theoretical results do not cover this setting. However, it is always possible to discretize the set or, in practice, to use gradient descent.

Compared to the setting of Theorem 26, assumption (18) allows a simpler estimator for the penalty (19), with an increased probability and a reduced error in the oracle inequality.

The main theoretical limitation comes from the fact that the probabilistic concentration tools used apply to discrete sets (through union bounds). The structure of kernel ridge regression allows us to have a uniform control over a continuous set for the single-task estimators at the “cost” of pointwise controls, which can then be extended to the multi-task setting via (18). We conjecture Theorem 29 still holds without (18) as long as is not “too large”, which could be proved similarly up to some uniform concentration inequalities.

Note also that if