Kernels for Vector-Valued Functions: a Review
Kernel methods are among the most popular techniques in machine learning. From a regularization perspective they play a central role in regularization theory as they provide a natural choice for the hypotheses space and the regularization functional through the notion of reproducing kernel Hilbert spaces. From a probabilistic perspective they are the key in the context of Gaussian processes, where the kernel function is known as the covariance function. Traditionally, kernel methods have been used in supervised learning problem with scalar outputs and indeed there has been a considerable amount of work devoted to designing and learning kernels. More recently there has been an increasing interest in methods that deal with multiple outputs, motivated partly by frameworks like multitask learning. In this paper, we review different methods to design or learn valid kernel functions for multiple outputs, paying particular attention to the connection between probabilistic and functional methods.
- 1 Introduction
- 2 Learning Scalar Outputs with Kernel Methods
- 3 Learning Multiple Outputs with Kernels Methods
4 Separable Kernels and Sum of Separable Kernels
- 4.1 Kernels and Regularizers
- 4.2 Coregionalization Models
- 4.3 Extensions
- 5 Beyond Separable Kernels
- 6 Inference and Computational Considerations
- 7 Applications of Multivariate Kernels
- 8 Discussion
Many modern applications of machine learning require solving several decision making or prediction problems and exploiting dependencies between the problems is often the key to obtain better results and coping with a lack of data (to solve a problem we can borrow strength from a distinct but related problem).
In sensor networks, for example, missing signals from certain sensors may be predicted by exploiting their correlation with observed signals acquired from other sensors . In geostatistics, predicting the concentration of heavy pollutant metals, which are expensive to measure, can be done using inexpensive and oversampled variables as a proxy . In computer graphics, a common theme is the animation and simulation of physically plausible humanoid motion. Given a set of poses that delineate a particular movement (for example, walking), we are faced with the task of completing a sequence by filling in the missing frames with natural-looking poses. Human movement exhibits a high-degree of correlation. Consider, for example, the way we walk. When moving the right leg forward, we unconsciously prepare the left leg, which is currently touching the ground, to start moving as soon as the right leg reaches the floor. At the same time, our hands move synchronously with our legs. We can exploit these implicit correlations for predicting new poses and for generating new natural-looking walking sequences . In text categorization, one document can be assigned to multiple topics or have multiple labels . In all the examples above, the simplest approach ignores the potential correlation among the different output components of the problem and employ models that make predictions individually for each output. However, these examples suggest a different approach through a joint prediction exploiting the interaction between the different components to improve on individual predictions. Within the machine learning community this type of modeling is often broadly referred to to as multitask learning. Again the key idea is that information shared between different tasks can lead to improved performance in comparison to learning the same tasks individually. These ideas are related to transfer learning [97, 20, 12, 74], a term which refers to systems that learn by transferring knowledge between different domains, for example: “what can we learn about running through seeing walking?”
More formally, the classical supervised learning problem requires estimating the output for any given input ; an estimator is built on the basis of a training set consisting of input-output pairs . The input space is usually a space of vectors, while the output space is a space of scalars. In multiple output learning (MOL) the output space is a space of vectors; the estimator is now a vector valued function . Indeed, this situation can also be described as the problem of solving distinct classical supervised problems, where each problem is described by one of the components of . As mentioned before, the key idea is to work under the assumption that the problems are in some way related. The idea is then to exploit the relation among the problems to improve upon solving each problem separately.
The goal of this survey is twofold. First, we aim at discussing recent results in multi-output/multi-task learning based on kernel methods and Gaussian processes providing an account of the state of the art in the field. Second, we analyze systematically the connections between Bayesian and regularization (frequentist) approaches. Indeed, related techniques have been proposed from different perspectives and drawing clearer connections can boost advances in the field, while fostering collaborations between different communities.
The plan of the paper follows. In chapter 2 we give a brief review of the main ideas underlying kernel methods for scalar learning, introducing the concepts of regularization in reproducing kernel Hilbert spaces and Gaussian processes. In chapter 3 we describe how similar concepts extend to the context of vector valued functions and discuss different settings that can be considered. In chapters 4 and 5 we discuss approaches to constructing multiple output kernels, drawing connections between the Bayesian and regularization frameworks. The parameter estimation problem and the computational complexity problem are both described in chapter 6. In chapter 7 we discuss some potential applications that can be seen as multi-output learning. Finally we conclude in chapter 8 with some remarks and discussion.
2 Learning Scalar Outputs with Kernel Methods
To make the paper self contained, we will start our study reviewing the classical problem of learning a scalar valued function, see for example [100, 40, 10, 82]. This will also serve as an opportunity to review connections between Bayesian and regularization methods.
As we mentioned above, in the classical setting of supervised learning, we have to build an estimator (e.g. a classification rule or a regression function) on the basis of a training set . Given a symmetric and positive bivariate function , namely a kernel, one of the most popular estimators in machine learning is defined as
where has entries , and , where is a new input point. Interestingly, such an estimator can be derived from two different, though, related perspectives.
2.1 A Regularization Perspective
We will first describe a regularization (frequentist) perspective (see [35, 105, 100, 86]). The key point in this setting is that the function of interest is assumed to belong to a reproducing kernel Hilbert space (RKHS),
Then the estimator is derived as the minimizer of a regularized functional
The first term in the functional is the so called empirical risk and
it is the sum of the squared errors.
It is a measure of the price we pay when predicting in
place of . The second term in the functional is the (squared) norm
in a RKHS. This latter concept plays a key role, so we review a few
essential concepts (see [87, 6, 105, 25]). A RKHS is a Hilbert
space of functions and can be defined by a reproducing
where is the inner product in .
The latter property, known as the reproducing property, gives the name to the space. Two further properties make RKHS appealing:
functions in a RKHS are in the closure of the linear combinations of the kernel at given points, . This allows us to describe, in a unified framework, linear models as well as a variety of generalized linear models;
the norm in a RKHS can be written as and is a natural measure of how complex is a function. Specific examples are given by the shrinkage point of view taken in ridge regression with linear models  or the regularity expressed in terms of magnitude of derivatives, as is done in spline models .
In this setting the functional (2) can be derived either
from a regularization point of view
[35, 105] or from the theory of empirical
risk minimization (ERM) . In the former, one observes
that, if the space is large enough, the minimization of the
empirical error is ill-posed, and in particular it responds in an
unstable manner to noise, or when the number of samples is low
Adding the squared norm stabilizes the problem. The latter point of view, starts from the
analysis of ERM showing that generalization to new samples can be
achieved if there is a tradeoff between fitting and
The explicit form of the estimator is derived in two steps. First, one can show that the minimizer of (2) can always be written as a linear combination of the kernels centered at the training set points,
see for example [65, 19]. The above result is the well known representer theorem originally proved in  (see also  and  for recent results and further references). The explicit form of the coefficients can be then derived by substituting for in (2).
2.2 A Bayesian Perspective
A Gaussian process (GP) is a stochastic process with the important characteristic that any finite number of random variables, taken from a realization of the GP, follows a joint Gaussian distribution. A GP is usually used as a prior distribution for functions . If the function follows a Gaussian process we write
where is the mean function and the covariance or kernel function. The mean function and the covariance function completely specify the Gaussian process. In other words the above assumption means that for any finite set if we let then
where and is the kernel matrix. In the following, unless otherwise stated, we assume that the mean vector is zero.
From a Bayesian point of view, the Gaussian process specifies our prior beliefs about the properties of the function we are modeling. Our beliefs are updated in the presence of data by means of a likelihood function, that relates our prior assumptions to the actual observations. This leads to an updated distribution, the posterior distribution, that can be used, for example, for predicting test cases.
In a regression context, the likelihood function is usually Gaussian and expresses a linear relation between the observations and a given model for the data that is corrupted with a zero mean Gaussian noise,
where corresponds to the variance of the noise. Noise is assumed to be independent and identically distributed. In this way, the likelihood function factorizes over data points, given the set of inputs and . The posterior distribution can be computed analytically. For a test input vector , given the training data , this posterior distribution is given by,
where denotes the set of parameters which include the variance of the noise, , and any parameters from the covariance function . Here we have
and finally we note that if we are interested into the distribution of the noisy predictions, , it is easy to see that we simply have to add to the expression for the predictive variance (see ).
Figure 1 represents a posterior predictive distribution for a data vector with . Data points are represented as dots in the figure. The solid line represents the mean function predicted, , while the shaded region corresponds to two standard deviations away from the mean. This shaded region is specified using the predicted covariance function, . Notice how the uncertainty in the prediction increases as we move away from the data points.
Equations for and are obtained under the assumption of a Gaussian likelihood, common in regression setups. For non-Gaussian likelihoods, for example in classification problems, closed form solutions are not longer possible. In this case, one can resort to different approximations, including the Laplace approximation and variational methods .
2.3 A Connection Between Bayesian and Regularization Point of Views
Connections between regularization theory and Gaussian process prediction or Bayesian models for prediction have been pointed out elsewhere [78, 105, 82]. Here we just give a very brief sketch of the argument. We restrict ourselves to finite dimensional RKHS. Under this assumption one can show that every RKHS can be described in terms of a feature map , that is a map , such that
In fact in this case one can show that functions in the RKHS with kernel can be written as
Then we can build a Gaussian process by assuming the coefficient to be distributed according to a multivariate Gaussian distribution. Roughly speaking, in this case the assumption becomes
As we noted before if we assume a Gaussian likelihood we have
where and . Then the posterior distribution is proportional to
and we see that a maximum a posteriori estimate will in turn give the minimization problem defining Tikhonov regularization , where the regularization parameter is now related to the noise variance.
We note that in regularization the squared error is often replaced by a more general error term . In a regularization perspective, the loss function measure the error we incur when predicting in place of . The choice of the loss function is problem dependent. Often used examples are the square loss, the logistic loss or the hinge loss used in support vector machines (see ).
The choice of a loss function in a regularization setting can be contrasted to the choice of the likelihood in a Bayesian setting. In this context, the likelihood function models how the observations deviate from the assumed true model in the generative process. The notion of a loss function is philosophically different. It represents the cost we pay for making errors. In Bayesian modeling decision making is separated from inference. In the inference stage the posterior distributions are computed evaluating the uncertainty in the model. The loss function appears only at the second stage of the analysis, known as the decision stage, and weighs how incorrect decisions are penalized given the current uncertainty. However, whilst the two notions are philosophically very different, we can see that, due to the formulation of the frameworks, the loss function and the log likelihood provide the same role mathematically.
The discussion in the previous sections shows that the notion of a kernel plays a crucial role in statistical modeling both in the Bayesian perspective (as the covariance function of a GP) and the regularization perspective (as a reproducing kernel). Indeed, for scalar valued problems there is a rich literature on the design of kernels (see for example [86, 90, 82] and references therein). In the next sections we show how the concept of a kernel can be used in multi-output learning problems. Before doing that, we describe how the concepts of RKHSs and GPs translate to the setting of vector valued learning.
3 Learning Multiple Outputs with Kernels Methods
In this chapter we discuss the basic setting for learning vector valued functions and related problems (multiclass, multilabel) and then describe how the concept of kernels (reproducing kernels and covariance function for GP) translate to this setting.
3.1 Multi-output Learning
The problem we are interested in is that of learning an unknown functional relationship between an input space , for example , and an output space . In the following we will see that the problem can be tackled either assuming that belongs to reproducing kernel Hilbert space of vector valued functions or assuming that is drawn from a vector valued Gaussian process. Before doing this we describe several related settings all falling under the framework of multi-output learning.
The natural extension of the traditional (scalar) supervised learning problem is the one we discussed in the introduction, when the data are pairs . For example this is the typical setting for problems such as motion/velocity fields estimation. A special case is that of multi-category classification problem or multi-label problems, where if we have classes each input point can be associated to a (binary) coding vector where, for example stands for presence ( for absence) of a class instance.The simplest example is the so called one vs all approach to multiclass classification which, if we have classes, amounts to the coding , where is the canonical basis of .
A more general situation is that where different outputs might have different training set cardinalities, different input points or in the extreme case even different input spaces. More formally, in this case we have a training set for each component , with , where the number of data associated with each output, might be different and the input for a component might belong to different input space .
The terminology used in machine learning often does not distinguish the different settings above and the term multitask learning is often used. In this paper we use the term multi-output learning or vector valued learning to define the general class of problems and use the term multi-task for the case where each component has different inputs. Indeed in this very general situation each component can be thought of as a distinct task possibly related to other tasks (components). In the geostatistics literature, if each output has the same set of inputs the model is called isotopic and heterotopic if each output to be associated with a different set of inputs . Heterotopic data is further classified into entirely heterotopic data, where the variables have no sample locations in common, and partially heterotopic data, where the variables share some sample locations. In machine learning, the partially heterotopic case is sometimes referred to as asymmetric multitask learning [112, 21].
The notation in the multitask learning scenario (heterotopic case) is a bit more involved. To simplify the notation we assume that the number of data for each output is the same. Moreover, for the sake of simplicity sometimes we restrict the presentation to the isotopic setting, though the models can usually readily be extended to the more general setting. We will use the notation to indicate the collection of all the training input points, , and to denote the collection of all the training data. Also we will use the notation to indicate a vector valued function evaluated at different training points. This notation has slightly different meaning depending on the way the input points are sampled. If the input to all the components are the same then and . If the input for the different components are different then , where and .
3.2 Reproducing Kernel for Vector Valued Function
The definition of RKHS for vector valued functions parallels the one in the scalar, with the main difference that the reproducing kernel is now matrix valued, see for example [65, 19] . A reproducing kernel is a symmetric function , such that for any is a positive semi-definite matrix. A vector valued RKHS is a Hilbert space of functions , such that for very , and , , as a function of belongs to and moreover has the reproducing property
where is the inner product in .
Again, the choice of the kernel corresponds to the choice of the representation (parameterization) for the function of interest. In fact any function in the RKHS is in the closure of the set of linear combinations
where we note that in the above equation each term is a matrix acting on a vector . The norm in the RKHS typically provides a measure of the complexity of a function and this will be the subject of the next sections.
Note that the definition of vector valued RKHS can be described in a component-wise fashion in the following sense. The kernel can be described by a scalar kernel acting jointly on input examples and task indices, that is
where is a scalar reproducing kernel on the space . This latter point of view is useful while dealing with multitask learning, see  for a discussion.
Provided with the above concepts we can follow a regularization approach to define an estimator by minimizing the regularized empirical error (2), which in this case can be written as
where . Once again the solution is given by the representer theorem 
and the coefficient satisfies the linear system
where are vectors obtained concatenating the coefficients and the output vectors, and is an with entries , for and (see for example ). More explicitly
where each block is an by matrix (here we make the simplifying assumption that each output has same number of training data). Note that given a new point the corresponding prediction is given by
where has entries for and .
3.3 Gaussian Processes for Vector Valued Functions
Gaussian process methods for modeling vector-valued functions follow the same approach as in the single output case. Recall that a Gaussian process is defined as a collection of random variables, such that any finite number of them follows a joint Gaussian distribution. In the single output case, the random variables are associated to a single process evaluated at different values of while in the multiple output case, the random variables are associated to different processes , evaluated at different values of [24, 37, 102].
The vector-valued function is assumed to follow a Gaussian process
where is a vector which components are the mean functions of each output and is a positive matrix valued function as in section 3.2. The entries in the matrix correspond to the covariances between the outputs and and express the degree of correlation or similarity between them.
For a set of inputs , the prior distribution over the vector is given by
where is a vector that concatenates the mean vectors associated to the outputs and the covariance matrix is the block partitioned matrix in (6). Without loss of generality, we assume the mean vector to be zero.
In a regression context, the likelihood function for the outputs is often taken to be Gaussian distribution, so that
where is a diagonal matrix with elements
For a Gaussian likelihood, the predictive distribution and the marginal likelihood can be derived analytically. The predictive distribution for a new vector is 
where , has entries for and , and denotes a possible set of hyperparameters of the covariance function used to compute and the variances of the noise for each output . Again we note that if we are interested into the distribution of the noisy predictions it is easy to see that we simply have to add to the expression of the prediction variance. The above expression for the mean prediction coincides again with the prediction of the estimator derived in the regularization framework.
In the following chapters we describe several possible choices of kernels (covariance function) for multi-output problems. We start in the next chapter with kernel functions that clearly separate the contributions of input and output. We will see later alternative ways to construct kernel functions that interleave both contributions in a non trivial way.
4 Separable Kernels and Sum of Separable Kernels
In this chapter we review a special class of multi-output kernel functions that can be formulated as a sum of products between a kernel function for the input space alone, and a kernel function that encodes the interactions among the outputs. We refer to this type of multi-output kernel functions as separable kernels and sum of separable kernels (SoS kernels).
We consider a class of kernels of the form
where are scalar kernels on and .
Equivalently one can consider the matrix expression
where is a symmetric and positive semi-definite matrix. We call this class of kernels separable since, comparing to (3), we see that the contribution of input and output is decoupled.
In the same spirit a more general class of kernels is given by
For this class of kernels, the kernel matrix associated to a data set has a simpler form and can be written as
where represents the Kronecker product between matrices. We call this class of kernels sum of separable kernels (SoS kernels).
The simplest example of separable kernel is given by setting , where is the Kronecker delta. In this case , that is all the outputs are treated as being unrelated. In this case the kernel matrix , associated to some set of data , becomes block diagonal. Since the off diagonal terms encode output relatedness. We can see that the matrix encodes dependencies among the outputs.
The key question is how to choose the scalar kernels and especially how to design, or learn, the matrices . This is the subject we discuss in the next few sections. We will see that one can approach the problem from a regularization point of view, where kernels will be defined by the choice of suitable regularizers, or, from a Bayesian point of view, constructing covariance functions from explicit generative models for the different output components. As it turns out these two points of view are equivalent and allow for two different interpretations of the same class of models.
4.1 Kernels and Regularizers
In this section we largely follow the results in [64, 65, 27] and . A possible way to design multi-output kernels of the form (9) is given by the following result. If is given by (9) then is possible to prove that the norm of a function in the corresponding RKHS can be written as
where is the pseudoinverse of and . The above expression gives another way to see why the matrix encodes the relation among the components. In fact, we can interpret the right hand side in the above expression as a regularizer inducing specific coupling among different tasks with different weights given by . This result says that any such regularizer induces a kernel of the form (9). We illustrate the above idea with a few examples.
Mixed Effect Regularizer Consider the regularizer given by
where and The above regularizer is composed of two terms: the first is a standard regularization term on the norm of each component of the estimator; the second forces each to be close to the mean estimator across the components, . The corresponding kernel imposes a common similarity structure between all the output components and the strength of the similarity is controlled by a parameter ,
where is the matrix whose entries are all equal to , and is a scalar kernel on the input space . Setting corresponds to treating all components independently and the possible similarity among them is not exploited. Conversely, is equivalent to assuming that all components are identical and are explained by the same function. By tuning the parameter the above kernel interpolates between this two opposites cases. We note that from a Bayesian perspective is a correlation matrix with all the off-diagonals equal to , which means that the output of the Gaussian process are exchangeable.
Cluster Based Regularizer. Another example of regularizer, proposed in , is based on the idea of grouping the components into clusters and enforcing the components in each cluster to be similar. Following , let us define the matrix as the matrix, where is the number of clusters, such that if the component belongs to cluster and otherwise. Then we can compute the matrix such that if components and belong to the same cluster , and is its cardinality, otherwise. Furthermore let be the index set of the components that belong to cluster . Then we can consider the following regularizer that forces components belonging to the same cluster to be close to each other:
where is the mean of the components in cluster and are parameters balancing the two terms. Straightforward calculations show that the previous regularizer can be rewritten as , where
Therefore the corresponding matrix valued kernel is .
Graph Regularizer. Following [64, 91], we can define a regularizer that, in addition to a standard regularization on the single components, forces stronger or weaker similarity between them through a given positive weight matrix ,
The regularizer can be rewritten as:
where , with . Therefore the resulting kernel will be , with a scalar kernel to be chosen according to the problem at hand.
In the next section we will see how models related to those described above can be derived from suitable generative models.
4.2 Coregionalization Models
The use of probabilistic models and Gaussian processes for multi-output learning was pioneered and largely developed in the context of geostatistics, where prediction over vector-valued output data is known as cokriging. Geostatistical approaches to multivariate modelling are mostly formulated around the “linear model of coregionalization” (LMC) [49, 37], that can be considered as a generative approach for developing valid covariance functions. Covariance functions obtained under the LMC assumption follow the form of a sum of separable kernels. We will start considering this model and then discuss how several models recently proposed in the machine learning literature are special cases of the LMC.
The Linear Model of Coregionalization
In the linear model of coregionalization, the outputs are expressed as linear combinations of independent random functions. This is done in a way that ensures that the resulting covariance function (expressed jointly over all the outputs and the inputs) is a valid positive semidefinite function. Consider a set of outputs with . In the LMC, each component is expressed as 
where the latent functions , have mean zero and covariance if , and are scalar coefficients. The processes are independent for . The independence assumption can be relaxed and such relaxation is presented as an extension in section 4.3. Some of the basic processes and can have the same covariance , while remaining independent.
where the functions , with and , have mean equal to zero and covariance if and . Expression (18) means that there are groups of functions and that the functions within each group share the same covariance, but are independent. The cross covariance between any two functions and is given in terms of the covariance functions for
The covariance is given by . Due to the independence of the functions , the above expression reduces to
with . The kernel can now be expressed as
where each is known as a coregionalization matrix. The elements of each are the coefficients appearing in equation (19). The rank for each matrix is determined by the number of latent functions that share the same covariance function , that is, by the coefficient .
Equation (18) can be interpreted as a nested structure  in which the outputs are first expressed as a linear combination of spatially uncorrelated processes with and if , otherwise it is equal to zero. At the same time, each process can be represented as a set of uncorrelated functions weighted by the coefficients , where again, the covariance function for is .
Therefore, starting from a generative model for the outputs, the linear model of coregionalization leads to a sum of separable kernels that represents the covariance function as the sum of the products of two covariance functions, one that models the dependence between the outputs, independently of the input vector (the coregionalization matrix ), and one that models the input dependence, independently of the particular set of functions (the covariance function ). The covariance matrix for is given by (10).
Intrinsic Coregionalization Model
A simplified version of the LMC, known as the intrinsic coregionalization model (ICM) (see ), assumes that the elements of the coregionalization matrix can be written as , for some suitable coefficients . With this form for , we have
where . The above expression can be seen as a particular case of the kernel function obtained from the linear model of coregionalization, with . In this case, the coefficients , and the kernel matrix for multiple outputs becomes as in (9).
The kernel matrix corresponding to a dataset takes the form
One can see that the intrinsic coregionalization model corresponds to the special separable kernel often used in the context of regularization. Notice that the value of for the coefficients , determines the rank of the matrix .
As pointed out by , the ICM is much more restrictive than the LMC since it assumes that each basic covariance contributes equally to the construction of the autocovariances and cross covariances for the outputs. However, the computations required for the corresponding inference are greatly simplified, essentially because of the properties of the Kronecker product. This latter point is discussed in detail in Section 6.
Comparison Between ICM and LMC
We have seen before that the intrinsic coregionalization model is a particular case of the linear model of coregionalization for (with ) in equation 19. Here we contrast these two models. Note that a different particular case of the linear model of coregionalization is assuming (with ). This model, known in the machine learning literature as the semiparametric latent factor model (SLFM) , will be introduced in the next subsection.
To compare the two models we have sampled from a multi-output Gaussian process with two outputs (), a one-dimensional input space () and a LMC with different values for and . As basic kernels we have used the exponentiated quadratic (EQ) kernel given as ,
where represents the Euclidian norm and is known as the characteristic length-scale. The exponentiated quadratic is variously referred to as the Gaussian, the radial basis function or the squared exponential kernel.
Figure 2 shows samples from the intrinsic coregionalization model for , meaning a coregionalization matrix of rank one. Samples share the same length-scale and have similar form. They have different variances, though. Each sample may be considered as a scaled version of the latent function, as it can be seen from equation 18 with and ,
where we have used instead of for the one-dimensional input space.
where and are sampled from the same Gaussian process. Outputs are weighted sums of two different latent functions that share the same covariance. In contrast to the ICM of rank one, we see from figure 3 that both outputs have different forms, although they share the same length-scale.