Scalable Matrix-valued Kernel Learning for High-dimensional Nonlinear Multivariate Regression and Granger Causality

Scalable Matrix-valued Kernel Learning for High-dimensional Nonlinear Multivariate Regression and Granger Causality

   Vikas Sindhwani
IBM Research
New York 10598, USA &Minh Há Quang
Italian Institute of Technology
Genova, 16163, Italy &Aurélie C. Lozano
IBM Research
New York 10598, USA

We propose a general matrix-valued multiple kernel learning framework for high-dimensional nonlinear multivariate regression problems. This framework allows a broad class of mixed norm regularizers, including those that induce sparsity, to be imposed on a dictionary of vector-valued Reproducing Kernel Hilbert Spaces. We develop a highly scalable and eigendecomposition-free algorithm that orchestrates two inexact solvers for simultaneously learning both the input and output components of separable matrix-valued kernels. As a key application enabled by our framework, we show how high-dimensional causal inference tasks can be naturally cast as sparse function estimation problems, leading to novel nonlinear extensions of a class of Graphical Granger Causality techniques. Our algorithmic developments and extensive empirical studies are complemented by theoretical analyses in terms of Rademacher generalization bounds.

1 Introduction

Consider the problem of estimating an unknown non-linear function, , from labeled examples, where is a “structured” output space [6]. In principle, may be endowed with a general Hilbert space structure, though we focus on the multivariate regression setting, where . Such problems can be naturally formulated as Tikhonov Regularization [36] in a suitable vector-valued Reproducing Kernel Hilbert Space (RKHS) [27, 1]. The theory and formalism of vector-valued RKHS can be traced as far back as the work of Laurent Schwarz in 1964 [34]. Yet, vector-valued extensions of kernel methods have not found widespread application, in stark contrast to the versatile popularity of their scalar cousins. We believe that two key factors are responsible:

  • The kernel function is much more complicated - it is matrix-valued in this setting. Its choice turns into a daunting model selection problem. Contrast this with the scalar case where Gaussian or Polynomial kernels are a default choice requiring only a few hyperparameters to be tuned.

  • The associated optimization problems, in the most general case, have much greater computational complexity than in the scalar case. For example, a vector-valued Regularized Least Squares (RLS) solver would require cubic time in the number of samples multiplied by the number of output coordinates.

Scalable kernel learning therefore becomes a basic necessity – an unavoidable pre-requisite – for even considering vector-valued RKHS methods for an application at hand. Our contributions in this paper are as follows:

  • We propose a general framework for function estimation over a dictionary of vector-valued RKHSs where a broad family of variationally defined regularizers, including sparsity inducing norms, serve to optimally combine a collection of matrix-valued kernels. As such our framework may be viewed as providing generalizations of scalar multiple kernel learning [22, 28, 23, 31] and associated structured sparsity algorithms [9].

  • We specialize our framework to the class of separable kernels [1] which are of interest due to their universality [10], conceptual simplicity and potential for scalability. Separable matrix-valued kernels are composed of a scalar input kernel component and a positive semi-definite output matrix component (to be formally defined later). We provide a full resolution of the kernel learning problem in this setting by jointly estimating both components together. This is in contrast to recent efforts [13, 20] where only one of the two components is optimized, and the full complexity of the joint problem is not addressed. Our algorithms achieve scalability by orchestrating carefully designed inexact solvers for inner subproblems, for which we also provide convergence rates.

  • We provide bounds on Rademacher Complexity for the vector-valued hypothesis sets considered by our algorithm. This complements and extends generalization results in the multiple kernel learning literature for the scalar case [12, 39, 21, 26, 22].

  • We demonstrate that the generality of our framework enables novel non-standard applications. In particular, when applied to multivariate time series problems, sparsity in kernel combinations lends itself to a natural causality interpretation. We believe that this nonlinear generalization of graphical Granger Causality techniques (see  [3, 24, 35] and references therein) may be of independent interest.

In section 6 we provide Supplementary Material containing detailed proofs of the results stated in the main body of this paper.

2 Vector-valued RLS & Separable Matrix-valued Kernels

Given labeled examples , , the vector-valued Regularized Least Squares (RLS) solves the following problem,


where is a vector-valued RKHS generated by the kernel function , and is the regularization parameter. For readers unfamiliar with vector-valued RKHS theory that is the basis of such algorithms, we provide a first-principles overview in our Supplementary Material (see section 6.4).

In the vector-valued setting, is a matrix-valued function, i.e., when evaluated for any pair of inputs , the value of is an -by- matrix. More generally speaking, the kernel function is an input-dependent linear operator on the output space. The kernel function is positive in the sense that for any finite set of input-output pairs , the following holds: . A generalization [27] of the standard Representer Theorem says that the optimal solution has the form,


where the coefficients are -dimensional vectors. For RLS, these coefficient vectors can be obtained solving a dense linear system, of the familiar form,

where assembles the coefficient vectors into a matrix; the operator stacks columns of its argument matrix into a long column vector; is a large -sized Gram matrix comprising of the blocks , for , and denotes the identity matrix of compatible size. It is easy to see that for , the above developments exactly collapse to familiar concepts for scalar RLS (also known as Kernel Ridge Regression). In general though, the above linear system requires time to be solved using standard dense numerical linear algebra, which is clearly prohibitive. However, for a family of separable matrix-valued kernels [1, 10] defined below, the computational cost can be improved to ; though still costly, this is atleast comparable to scalar RLS when is comparable to .

Separable Matrix Valued Kernel and its Gram matrix: Let be a scalar kernel function on the input space and represent its Gram matrix on a finite sample. Let be an positive semi-definite output kernel matrix. Then, the function is positive and hence defines a matrix valued kernel. The Gram matrix of this kernel is where denotes Kronecker product.

For separable kernels, the corresponding RLS dense linear system (Eqn 3 below) can be reorganized into a Sylvester equation (Eqn 4 below):


Sylvester solvers are more efficient than applying a direct dense linear solver for Eqn 3. The classical Bartel-Stewart and Hessenberg-Schur methods (e.g., see MATLAB’s dlyap function) are usually used for solving Sylvester equations. They are similar in flavor to an eigendecomposition approach [1] we describe next for completeness, though they take fewer floating point operations at the same cubic order of complexity.

Eigen-decomposition based Sylvester Solver: Let and denote the eigen-decompositions of and respectively, where . Then the solution to the matrix equation always exists when and is given by .

Output Kernel Learning: In recent work  [13] develop an elegant extension of the vector-valued RLS problem (Eqn. 1), which we will briefly describe here. We will use the shorthand to represent the implied separable kernel and correspondingly denote its RKHS by [13] attempt to jointly learn both and , for a fixed pre-defined choice of . In finite dimensional language, and are estimated by solving the following problem [13],

where denotes trace, denotes Frobenius norm, and denotes the cone of positive semi-definite matrices. It is shown that the objective function is invex, i.e., its stationary points are globally optimal. [13] proposed a block coordinate descent where for fixed , is obtained by solving Eqn. 4 using an Eigendecomposition-based solver. Under the assumption that exactly satisfies Eqn. 4, the resulting update for is then shown to automatically satisfy the constraint that . However, [13] remark that experiments on their largest dataset took roughly a day to complete on a standard desktop and that the “limiting factor was the solution of the Sylvester equation”.

3 Learning over a Vector-valued RKHS Dictionary

Our goals are two fold: one, we seek a fuller resolution of the separable kernel learning problem for vector-valued RLS problems; and two, we wish to derive extensible algorithms that are eigendecomposition-free and much more scalable. In this section, we expand Eqn. 1 to simultaneously learn both input and output kernels over a predefined dictionary, and develop optimization algorithms based on approximate inexact solvers that execute cheap iterations.

Consider a dictionary of separable matrix valued kernels, of size , sharing the same output kernel matrix : . Let denote the sum space of functions:


and equip this space with the following norms:

The infimum in the above definition is in fact attained as a minimum (see Proposition 2 in our Supplementary Material, section 6.1), so that one can write

For notational simplicity, we will denote these norms as though it should be kept in mind that their definition is with respect to a given dictionary of vector-valued RKHSs.

Note that , being the norm of the vector of norms in individual RKHSs, imposes a functional notion of sparsity on the vector-valued function . We now consider objective functions of the form,


where is constrained to belong to the spectahedron with bounded trace:

where denotes the cone of symmetric positive semi-definite matrices, and is a regularizer whose canonical choice will be the squared norm (with ), i.e.,

When , induces sparsity, while for , non-sparse uniform combinations approaching a simple sum of kernels is induced. Our algorithms work for a broad choice of regularizers that admit a quadratic variational representation of the form:


for an appropriate auxillary function . For squared norms, this auxillary function is the indicator function of a convex set [5, 37, 28]:


where for .

We rationalize this framework as follows:

  • Penalty functions of the form above define a broad family of structured sparsity-inducing norms that have extensively been used in the multiple kernel learning and sparse modeling literature [5, 37, 29]. They allow complex non-differentiable norms to be related back to weighted or RKHS norms, and optimizing the weights in many cases infact admits closed form expressions. Infact, all norms admit quadratic variational representations of related forms [5].

  • Optimizing over the Spectahedron allows us to develop a specialized version of the approximate Sparse SDP solver [17] whose iterations involve the computation of only a single extremal eigenvector of the (partial) gradient at the current iterate – this involves relatively cheap operations followed by quick rank-one updates.

  • By bounding the trace of , we show below that a Conjugate Gradient (CG) based iterative Sylvester solver for Eqn. 3 would always be invoked on well-conditioned instances and hence show rapid numerically convergence (particularly also with warm starts).

  • The trace constraint parameter , together with the regularization parameter , also naturally appears in our Rademacher complexity bounds.

3.1 Algorithms

First we give a basic result concerning sums of vector-valued RKHSs. The proof, given in our Supplementary Material (see Section 6.1), follows Section 6 of  [4] replacing scalar concepts with corresponding notions from the theory of vector-valued RKHSs [27].

Proposition 1.

Given a collection of matrix-valued reproducing kernels and positive scalars , the function:

is the reproducing kernel of the sum space with the norm given by:

This result combined with the variational representation of the penalty function in Eqn. 7 allows us to reformulate Eqn. 6 in terms of a joint optimization problem over and , where we define the weighted scalar kernel . This formulation allows us to scale gracefully with respect to , the number of kernels. Denote the Gram matrix of on the labeled data as , i.e., , where denotes the Gram matrices of the individual scalar kernel . The finite dimensional version of the reformulated problem becomes,


A natural strategy for such a non-convex problem is Block Coordinate Descent. The minimization of or keeping the other variables fixed, is a convex optimization problem. The minimization of admits closed form solution. At termination, the vector-valued function returned is

which is a matrix version of the functional form for the optimal solution as specified by the Representer theorem (Eqn. 2) for separable kernels. We next describe each of the three block minimization subproblems.

A. Conjugate Gradient Sylvester Solver: For fixed , the optimal is given by the solution of the dense linear system of Eqn 3 or the Sylvester equation 4, with . General dense linear solvers have prohibitive cost when invoked on Eqn. 3. The eigendecomposition-based Sylvester solver performs much better, but needs to be invoked repeatedly since as well as are changing across (outer) iterations. Instead, we apply a CG-based iterative solver for Eqn 3. Despite the massive size of the linear system, using CG infact has several unobvious quantifiable advantages due to the special Kronecker structure of Eqn. 3:

  • A CG solver can exploit warm starts by initializing from previous , and allow early termination at cheaper computational cost.

  • The large coefficient matrix in Eqn.3 never needs to be explicitly materialized. For any CG iterate , matrix-vector products can be efficiently computed since,

    CG can exploit additional low-rank or sparsity structure in and for fast matrix multiplication. When the base kernels are either (a) linear kernels derived from a small group of features, or (b) arise from randomized approximations, such as the random Fourier features for Gaussian Kernel [30], then where has columns. In this case, need never be explicitly materialized and the cost of matrix multiplication can be further reduced.

CG is expected to make rapid progress in a few iterations in the presence of strong regularization as enforced jointly by and the trace constraint parameter on . This is because the coefficient matrix of the linear system in Eqn. 3 is then expected to be well conditioned for all possible that the algorithm may encounter, as we formalize in the following proposition. Below, let denote the largest eigenvalue of the Gram matrix .

Proposition 2 (Convergence Rate for CG-solver for Eqn. 3 with ).

Assume norm for in Eqn. 6. Let be the CG iterate at step , be the optimal solution (at current fixed and ) and be the initial iterate (warm-started from previous value). Then,

where with . For dictionaries involving only Gaussian scalar kernels, , i.e., the convergence rate depends only on the relative strengths of regularization parameters .

The proof is given in our Supplementary Material (section 6.3).

B. Updates for : Note from Eqn. 7 that the optimal weight vector only depends on the RKHS norms of component functions, and is oblivious to the vector-valued, as opposed to scalar-valued, nature of the functions themselves. This is essentially the reason why existing results [5, 37, 28] routinely used in the (scalar) MKL literature can be immediately applied to our setting to get closed form update rules. Define where refers to previous value of . The components of the optimal weight vector are given below for two choices of .

  • For , the optimal is given by:

  • For an elastic net type penalty, , we have .

Several other choices are also infact possible, e.g., see Table 1 in [37], discussion around subquadratic norms in [5] and regularizers for structured sparsity introduced in [29].

C. Spectahedron Solver: Here, we consider the optimization subproblem, which is:


where and . Hazan’s Sparse SDP solver [17, 14] based on Frank-Wolfe algorithm [11], can be used for problems of the general form,

where is a convex, symmetric and differentiable function. It has been successfully applied in matrix completion and collaborative filtering settings [19].

In each iteration, Hazan’s algorithm optimizes a linearization of the objective function around the current iterate , resulting in updates of the form,


where , and is a constant which measures the curvature of the graph of over the Spectahedron. Here, is an approximate eigensolver which when invoked on the gradient of at the current iterate (a positive semi-definite matrix) computes the single eigenvector corresponding to the smallest eigenvalue, only to a prespecified precision. Hazan’s algorithm is appealing for us since each iteration itself tolerates approximations and the updates pump in rank-one terms. We specialize Hazan’s algorithm to our framework as follows (below, note that and ):

  • Using bounded trace constraints, , instead of unit trace is more meaningful for our setting. The following modified updates optimize over : , where is reset to the zero vector if the smallest eigenvalue is positive.

  • The gradient for our objective is: where and assembles the diagnal entries of its argument into a diagonal matrix.

  • Instead of using Hazan’s line search parameter , we do exact line search along the direction which leads to a closed form expression:

Adapting the analysis of Hazan’s algorithm in [14] to our setting, we get the following convergence rate (proof given in our Supplementary Material, section 6.3):

Proposition 3 (Convergence Rate for optimizing ).

Assume norm for in Eqn. 6. For , the iterate in Eqn. 12 satisfies where .

Remarks: An additional small smoothing term on is needed to make block coordinate descent provably convergent for our problem, as discussed in [5] (chapter 5) in the general context of re-weighted algorithms. Note that Propositions 2 and 5 offer convergence rates for the convex optimization sub-problems associated with optimizing and respectively keeping other variables fixed. These results strongly suggest that inexact solutions to subproblems may be quickly obtained in a few iterations.

3.2 Rademacher Complexity Results

Here, we complement our algorithms with statistical generalization bounds. The notion of Rademacher complexity is readily generalizable to vector-valued hypothesis spaces  [25]. Let be a class of functions where . Let be a vector of independent Rademacher variables, and similarly define the matrix The empirical Rademacher complexity of the vector-valued class is the function defined as


We now state bounds on the Rademacher complexity of hypothesis spaces considered by our algorithms, both for general matrix-valued kernel dictionaries as well as the special case of separable matrix-valued kernel dictionaries. When the output dimensionality is set to 1, our results essentially recover existing results in the scalar multiple kernel learning literature given in [12, 39, 21, 26]. Our bounds in part (B) and (C) in the Theorem below involve the same dependence on the number of kernels , and on (for norms) as given in [12], though there are slight differences in stated bounds since our hypothesis class is not exactly the same as that in [12]. In particular, for the case of (part C below), we obtain a dependence on the number of kernels which is known to be tight for the scalar case [12, 26]. Since this logarithmic dependence is rather mild, we can expect to learn effectively over a large dictionary even in the vector-valued setting.

Theorem 3.1.

Let . For , consider the hypothesis class

(A) For any , , the empirical Rademacher complexity of can be upper bounded as follows:

where . For the case of separable kernels, where such that and , we have

(B) If is such that , where , then

where . For separable kernels,

(C) If , so that , then

For separable kernels, we have

Due to space limitations, the full proof of this Theorem is provided our Supplementary Material (see Section 6.2, Theorem 2).

Using well-known results [7], these bounds on Rademacher complexity can be immediately turned into generalization bounds for our algorithms.

4 Empirical Studies

Statistical Benefits of Joint Input/Output Kernel Learning: We start with a small dataset of weekly log returns of 9 stocks from 2004, studied in [40, 33] in the context of linear multivariate regression with output covariance estimation techniques. We consider first-order vector autoregressive (VAR) models of the form where corresponds to the 9-dimensional vector of log-returns for the 9 companies at week and the function is estimated by solving Eqn. 6. Our experimental prototcol is exactly the same as [40, 33]: data is split evenly into a training and a test set and the regularizaton parameter is chosen by 10-fold cross-validation. All other parameters are left at their default values (i.e., ). We generated a dictionary of 117 Gaussian kernels defined by univariate Gaussian kernels on each of the 9 dimensions with 13 varying bandwidths. Results are shown in Table 1 where we compare our methods in terms of mean test RMSE against standard linear regression (OLS) and linear Lasso independently applied to each output coordinate, and the sparse multivariate regression with covariance estimation approaches of [33, 40], labeled MRCE and FES respectively. We see that joint input and output kernel learning (labeled IOKL) yields the best return prediction model reported to date on this dataset. As expected, it outperforms models obtained by leaving output kernel matrix fixed as the identity and only optimizing scalar kernels (IKL), or only optimizing the output kernel for fixed choices of scalar kernel (OKL). Of the 117 kernels, 13 have of the mass in the learnt scalar kernel combination.

WMT 0.98 0.42 0.41 0.40 0.43 0.43 0.44
XOM 0.39 0.31 0.31 0.29 0.32 0.31 0.29
GM 1.68 0.71 0.71 0.62 0.62 0.59 0.47
Ford 2.15 0.77 0.77 0.69 0.56 0.48 0.36
GE 0.58 0.45 0.45 0.41 0.41 0.40 0.37
COP 0.98 0.79 0.79 0.79 0.81 0.80 0.76
Ctgrp 0.65 0.66 0.62 0.59 0.66 0.62 0.58
IBM 0.62 0.49 0.49 0.51 0.47 0.50 0.42
AIG 1.93 1.88 1.88 1.74 1.94 1.87 1.79
Average 1.11 0.72 0.71 0.67 0.69 0.67 0.61
Table 1: VAR modeling on financial datasets.

Scalability and Numerical Behaviour: Our main interest here is to observe the classic tradeoff in numerical optimization between running few, but very expensive steps versus executing several cheap iterations. We use a 102-class image categorization dataset – Caltech-101 – which has been very well studied in the multiple kernel learning literature [13, 38, 15]. There are 30 training images per category for a total of 3060 training images, and 1355 test images. Targets are 102-dimensional class indicator vectors. We define a dictionary of kernels using scalar-valued kernels precomputed from visual features and made publically available by the authors of [38], for 3 training/test splits. From previous studies, it is well known that all underlying visual features contribute to object discrimination on this dataset and hence non-sparse multiple kernel learning with norms are more effective. We therefore set and without any further tuning, since their choice is not central to our main goals in this experiment. We vary the stopping criteria for our CG-based Sylvester solver () and the number of iterations () allowed in the Sparse SDP solver, for the and subproblems respectively. Note that the closed form updates (Eqn. 10) for norms take negligible time.

We compare our algorithms with an implementation in which each subproblem is solved exactly using an eigendecomposition based Sylvester solver for , and unconstrained updates for developed in [13], respectively. To make comparisons meaningful, we set to a large value so that the optimization over effectively corresponds to unconstrained minimization over the entire psd cone  . In Figure 12, we report the improvement in objective function and classification accuracy as a function of time (upto 1 hour). We see that insufficient progress is made in both extremes: when either the degree of inexactness is intolerable () or when subproblems are solved to very high precision (). Our solvers are far more efficient than eigendecomposition based implementation that takes an exorbitant amount of time per iteration for exact solutions. Approximate solvers at appropriate precision (e.g., ) make very rapid progress and return high accuracy models in just a few minutes. In fact, averaged over the three training/test splits, the classification accuracy obtained is which is highly competitive with state of the art results reported on this dataset, with the kernels used above. For example, [38] report [15] report and [13] report .

Figure 1: Objective function vs time
Figure 2: Accuracy vs time

4.1 Application: Non-linear Causal Inference

Here, our goal is to show how high-dimensional causal inference tasks can be naturally cast as sparse function estimation problems within our framework, leading to novel nonlinear extensions of Grouped Graphical Granger Causality techniques (see [35, 24] and references therein). In this setting, there is an interconnected system of distinct sources of high dimensional time series data which we denote as . We refer to these sources as “nodes”. The system is observed from time to , and the goal is to infer the causal relationships between the nodes. Let denote the adjacency matrix of the unknown causal interaction graph where implies that node causally influences node . In 1980, Clive Granger gave an operational definition for Causality:

Granger Causality [16]: A subset of nodes is said to causally influence node , if the past values of the time series collectively associated with the node subset is predictive of the future evolution of the time series associated with node , with statistical significance, and more so than the past values of alone.

A practical appeal of this definition is that it links causal inference to prediction, with the caveat that causality insights are bounded by the quality of the underlying predictive model. Furthermore, the prior knowledge that the underlying causal interactions are highly selective makes sparsity a meaningful prior to use. Prior work on using sparse modeling techniques to uncover causal graphs has focused on linear models [35, 24] while many, if not most, natural systems involve nonlinear interactions for which a functional notion of sparsity is more appropriate.

To apply our framework to such problems, we model the system as the problem of estimating nonlinear functions: , for , and where is a lag parameter. The dynamics of each node, , can be expressed as the sum of a set of vector-valued functions,


where the component , for all values of the index , only depends on the history of node , i.e., the observations . Each belongs to vector-valued RKHS whose kernel is . In other words, we set up a dictionary of separable matrix-valued kernels , where scalar kernels depend only on individual nodes alone; and the output matrix is associated with node currently being modeled. By imposing (functional) sparsity in the sum in Eqn. 14 using our framework, i.e. estimating by solving Eqn. 6, we can identify which subset of nodes are causal drivers (in the Granger sense) of the dynamics observed at node . The sparsity structure of then naturally induces a weighted causal graph :

where are the kernel weights estimated by our algorithm. Note that only if a component function associated with the history of node (for some ) is non-zero in the sum Eqn. 14). In addition to recovering the temporal causal interactions in this way, the estimated output kernel matrix associated with each captures within-source temporal dependencies. We now apply these ideas to a problem in computational biology.

Causal Inference of Gene Networks: We use time-course gene expression microarray data measured during the full life cycle of Drosophila melanogaster [2]. The expression levels of 4028 genes are simultaneously measured at 66 time points corresponding to various developmental stages. We extracted time series data for 2397 unique genes, and grouped them into 35 functional groups based on their gene ontologies. The goal is to infer causal interactions between functional groups (represented by multiple time series associated with genes in that group), as well obtain insight on within-group relationships between genes. We conducted four sets of experiments: with linear and nonlinear dictionaries (Gaussian kernels with 13 choices of bandwidths per group), and with or without output kernel learning. We use the parameters and time lag of without tuning. Figure 3 shows hold-out RMSE from the four experiments, for each of the 35 functional groups. Clearly, nonlinear models with both input and output kernel learning (labeled “nonlinear L” in Figure 3) give the best predictive performance implying greater relability in the implied causal graphs.

Figure 3: RMSE in predicing multiple time series

In consutation with a professional biologist, we analyzed the causal graphs uncovered by our approach (Figure 4). In particular the nonlinear causal model uncovered the centrality of a key cellular enzymatic activity, that of helicase, which was not recognized by the linear model. In contrast, the central nodes in the linear model are related to membranes (lipid binding and gtpase activity). Nucleic acid binding transcription factor activity and transcription factor binding are both related to the helicase activity, which is consistent with biological knowledge of them being tightly coupled. This was not captured in the linear model. Molecular chaperone functions, which connect ATPase activity and unfolded protein binding, was successfully identified by our model, while the linear model failed to recognize its relevance. It is less likely that unfolded protein and lipid activity should be linked as suggested by the linear model.

Figure 4: Causal Graphs: Linear (left) and Non-linear (right)

In addition, via output kernel matrix estimation (i.e., ), our model also provides insight on the conditional dependencies within genes, shown in Figure 5, for the unfolded protein binding group.

Figure 5: Interactions in unfolded protein binding group

5 Related work and Conclusion

Our work is the first to address efficient simultaneous estimation of both the input and output components of separable matrix-valued kernels. Two recent papers are closely related. In [13], the input scalar kernel is predefined and held fixed, while the output matrix is optimized in a block coordinate descent procedure. As discussed in Section 2, this approach involves solving Sylvester equations using eigendecomposition methods which is computationally very costly. In very recent work, concurrent with our work, [20] independently propose a multiple kernel learning framework for operator-valued kernels. Some elements of their work are similar to ours. However, they only optimize the scalar input kernel keeping the output matrix fixed. Their optimization strategy also includes eigen-decomposition, and Gauss-Siedel iterations for solving linear systems, while we exploit the quadratic nature of the objective function using CG and a fast sparse SDP solver to demonstrate the scalability benefits of inexact optimization. In addition, we provide generalization analysis in terms of bounds on the Rademacher complexity of our vector-valued hypothesis spaces, complementing analogous results in the scalar multiple kernel learning literature [12, 22]. We also outlined how our framework operationalizes nonlinear Granger Causality in high-dimensional time series modeling problems, which may be of independent interest. Future work includes extending our framework to other classes of vector-valued kernels [1, 10] and to functional data analysis problems [32].


  • [1] M. A. Alvarez, L. Rosasco, and N. Lawrence. Kernels for vector-valued functions: A review. Foundations and Trends in Machine Learning, 4(3):195–266, 2012.
  • [2] M. Arbeitman, E. Furlong, F. Imam, E. Johnson, B. Null, B. Baker, M. Krasnow, M. Scott, R. Davis, and K. White. Gene expression during the life cycle of drosophila melanogaster., 2002.
  • [3] A. Arnold, Y. Liu, and N. Abe. Temporal causal modeling with graphical granger methods. In KDD, 2007.
  • [4] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337–404, 1950.
  • [5] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity inducing penalties. Foundations and Trends in Machine Learning, 2011.
  • [6] G. Bakir, T. Hofmann, B. Sch lkopf, A. Smola, B. Taskar, and S. E. Vishwanathan. Predicting Structured Data. MIT Press, 2007.
  • [7] P. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. JMLR, 3:463–482, 2002.
  • [8] D. Bertsekas. Nonlinear Programming. Athena Scientific, Providence, RI, 2003.
  • [9] P. Buhlmann and S. V. D. Geer. Statistics for High Dimensional Data. Springer, 2010.
  • [10] A. Caponnetto, M. Pontil, C.Micchelli, and Y. Ying. Universal multi-task kernels. Journal of Machine Learning Research, 9:1615–1646, 2008.
  • [11] K. Clarkson. Coresets, sparse greedy approximation, and the frank-wolfe algorithm. ACM Transactions on Algorithms, 2010.
  • [12] C. Cortes, M. Mohri, and A. Rostamizadeh. Generalization bounds for learning kernels. In ICML, 2010.
  • [13] P. G. G. P. Francesco Dinuzzo, Cheng Soon Ong. Learning output kernels with block coordinate descent. In ICML, 2011.
  • [14] B. Gartner and J. Matousek. Approximation Algorithms and Semi-definite Programming. Springer-Verlag, Berlin Heidelberg, 2012.
  • [15] P. Gehler and S. Nowozin. On feature combination for multiclass object classification. In ICCV, 2009.
  • [16] C. Granger. Testing for causality: A personal viewpoint. Journal of Economic Dynamics and Control, 2:329–352, 1980.
  • [17] E. Hazan. Sparse approximate solutions to semi-definite programs. In LATIN, 2008.
  • [18] M. Jaggi. Sparse convex optimization methods for machine learning. PhD Thesis DISS ETH No. 20013.
  • [19] M. Jaggi and M. Sulovsky. A simple algorithm for nuclear norm regularized problems. In ICML, 2010.
  • [20] H. Kadri, A. Rakotomamonjy, F. Bach, and P. Preux. Multiple operator-valued kernel learning. In NIPS, 2012.
  • [21] S. Kakade, S. Shalev-Schwartz, and A. Tewari. Regularization techniques for learning with matrices. In Journal of Machine Learning Research, 2012.
  • [22] M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. -norm multiple kernel learning. JMLR, 12:953–997, 2011.
  • [23] V. Koltchinskii and M. Yuan. Sparsity in multiple kernel learning. The Annals of Statistics, 38(6):3660–3695, 2010.
  • [24] A. C. Lozano, N. Abe, Y. Liu, and S. Rosset. Grouped graphical granger modeling methods for temporal causal modeling. In KDD, pages 577–586, 2009.
  • [25] A. Maurer. The rademacher complexity of linear transformation classes. In Proceedings of the Conference on Learning Theory (COLT), 2006.
  • [26] A. Maurer and M. Pontil. Structured sparsity and generalization. Journal of Machine Learning Research, 13:671–690, 2012.
  • [27] C. A. Micchelli and M. Pontil. On learning vector-valued functions. Neural Computation, 17:177–204, 2005.
  • [28] C. Michelli and M. Pontil. Learning the kernel function via regularization. JMLR, 6:1099–1125, 2005.
  • [29] C. Michelli and M. Pontil. Regularizers for structured sparsity. Advances in Computational Mathematics, 2011.
  • [30] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, 2007.
  • [31] A. Rakotomamonjy, F.Bach, S. Cano, and Y. Grandvalet. Simplemkl. Journal of Machine Learning Research, 9:2491–2521, 2008.
  • [32] J. Ramsay and B. W. Silverman. Functional Data Analysis. Springer, Providence, RI, 2005.
  • [33] A. J. Rothman, E. Levina, and J. Zhu. Sparse multivariate regression with covariance estimation. Journal of Computational and Graphical Statistics, 1:947–962, 2010.
  • [34] L. Schwartz. Sous-espaces hilbertiens d’espaces vectoriels topologiques et noyaux associés (noyaux reproduisants). J. Analyse Math., 13:115–256, 1964.
  • [35] A. Shojaie and G. Michailidis. Discovering graphical granger causality using the truncating lasso penalty. Bioinformatics, 26(18):i517–i523, Sept. 2010.
  • [36] A. Tikhonov. Regularization of incorrectly posed problems. Sov. Math. Dokl., 17:4:1035–1038, 1963.
  • [37] R. Tomioka and T. Suzuki. Regularization strategies and empirical bayesian learning for mkl. In NIPS Workshops 2010, 2010.
  • [38] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman. Multiple kernels for object detection. In International Conference on Computer Vision, 2009.
  • [39] Y. Ying and C. Cambell. Generalization bounds for learning the kernel problem. In COLT, 2009.
  • [40] M. Yuan, A. Ekici, Z. Lu, and R. Monteiro. Dimension reduction and coefficient estimation in multivariate linear regression. Journal of the Royal Statistical Society Series B, 69:329–46, 2007.

6 Supplementary Material

6.1 Sums of operator-valued reproducing kernels

Proposition 1.

Let be operator-valued reproducing kernels of a dictionary of RKHS mapping , with respective norms .

(a) , with , , is the reproducing kernel of the Hilbert space

with norm given by


(b) If, furthermore, , , then

that is each admits a unique orthogonal decomposition

with norm


It suffices for us to consider .

(a) Consider the product space

with inner product

This inner product is well-defined by the assumption , , making a Hilbert space. Consider the map defined by

It is straightforward to show that its kernel is a closed subspace of , so that admits an orthogonal decomposition

Let be the restriction of onto , then is a bijection. Define the inner product on to be


This inner product is clearly well-defined, making a Hilbert space isomorphic to . Let be fixed, then its preimage in is

Since , is obviously the minimum norm element of . Thus


To show that is the reproducing kernel for , we need to show two properties:

(i) for all : this is clear.

(ii) The reproducing property:

To this end, let and . Then


We have by assumption:

so that . Since , this means that

Using the reproducing properties of and and rearranging, we get


Combining (18) and (19), we obtain the reproducing property

for all and as required.

(b) For the second part, let be fixed. Suppose that admits two decompositions


This means that . Thus ’s decomposition in and is unique, that is and , if and only if . In this case , , , and is isomorphic as a Hilbert space to , with

as we claimed. ∎

Proposition 2.

Let be operator-valued reproducing kernels of a dictionary of RKHS mapping , with respective norms . Let

Then for , the norm given by