###### Abstract

We extend kernelized matrix factorization with a fully Bayesian treatment and with an ability to work with multiple side information sources expressed as different kernels. Kernel functions have been introduced to matrix factorization to integrate side information about the rows and columns (e.g., objects and users in recommender systems), which is necessary for making out-of-matrix (i.e., cold start) predictions. We discuss specifically bipartite graph inference, where the output matrix is binary, but extensions to more general matrices are straightforward. We extend the state of the art in two key aspects: (i) A fully conjugate probabilistic formulation of the kernelized matrix factorization problem enables an efficient variational approximation, whereas fully Bayesian treatments are not computationally feasible in the earlier approaches. (ii) Multiple side information sources are included, treated as different kernels in multiple kernel learning that additionally reveals which side information sources are informative. Our method outperforms alternatives in predicting drug–protein interactions on two data sets. We then show that our framework can also be used for solving multilabel learning problems by considering samples and labels as the two domains where matrix factorization operates on. Our algorithm obtains the lowest Hamming loss values on 10 out of 14 multilabel classification data sets compared to five state-of-the-art multilabel learning algorithms.

Kernelized Bayesian Matrix Factorization

Mehmet Gönen {}^{1,2} mehmet.gonen@aalto.fi

Suleiman A. Khan {}^{1,2} suleiman.khan@aalto.fi

Samuel Kaski {}^{1,2,3} samuel.kaski@aalto.fi

{}^{1} Helsinki Institute for Information Technology HIIT

{}^{2} Department of Information and Computer Science, Aalto University

{}^{3} Department of Computer Science, University of Helsinki

Matrix factorization algorithms are very popular matrix completion methods (Srebro, 2004), successfully used in many applications including recommender systems and image inpainting. The main idea behind these methods is to factorize a partially observed data matrix by finding a low-dimensional latent representation for both its rows and columns. The prediction for a missing entry can be calculated as the inner product between the latent representations of the corresponding row and column. Salakhutdinov & Mnih (2008a; b) give a probabilistic formulation for matrix factorization and its fully Bayesian extension. However, these approaches are still incomplete in two major aspects: (i) It is not possible to integrate side information (e.g., social network or user profiles for recommender systems) into the model. (ii) It is not possible to make predictions for completely empty columns or rows (i.e., out-of-matrix prediction).

Algorithms for integrating side information into matrix factorization have been proposed earlier in the recommender systems literature. Ma et al. (2008) propose a probabilistic matrix factorization method that uses a social network and the rating matrix together to find better latent components. Shan & Banerjee (2010) integrate side information into a probabilistic matrix factorization model using topic models to generate latent components of the rated items (e.g., movies). Agarwal & Chen (2010) use a similar strategy to generate latent components of both users and items using topic models. Wang & Blei (2011) also combine matrix factorization and topic models for scientific article recommendation using textual content of articles as side information. All these algorithms are based on explicit feature representations; some are specific to count (e.g., text) data, and all are able to model linear dependencies. We use kernels to include nonlinear dependencies.

Lawrence & Urtasun (2009) formulate a nonlinear matrix factorization method by generating latent components via Gaussian processes without integrating any side information. Recently, Zhou et al. (2012) propose a kernelized probabilistic matrix factorization method using Gaussian process priors with covariance matrices on side information. However, with the modeling assumptions, only a maximum a posteriori (MAP) estimate for the latent components is computationally feasible, and even then the used gradient descent approach can be very slow. Furthermore, the method uses only a single kernel for each domain and needs test instances while training to be able to calculate their latent components (i.e., transductive learning).

In this paper, we focus on modeling interaction networks between two domains (e.g., biological networks between drugs and proteins), and estimating unknown interactions between objects from these two domains, which is also known as bipartite graph inference (Yamanishi, 2009). The standard pairwise kernel approaches are based on a kernel matrix over object pairs in the training set and are computationally expensive (Ben-Hur & Noble, 2005). There are also kernel-based (non-Bayesian) dimensionality reduction algorithms that map objects from both domains into the same subspace and perform prediction there (Yamanishi, 2009; Yamanishi et al., 2008; 2010).

In biological interaction networks, being composed of structured objects such as drugs and proteins, there exist several feature representations or similarity measures for the objects (Schölkopf et al., 2004). Instead of using a single specific kernel, we can combine multiple kernel functions to obtain a better similarity measure, which is known as multiple kernel learning (Gönen & Alpaydin, 2011).

We introduce a kernelized Bayesian matrix factorization method and give its details for the bipartite graph inference scenario; it can also be applied to other types of matrices with slight modifications. Our two main contributions are: (i) We formulate a novel fully conjugate probabilistic model that allows us to develop an efficient variational approximation scheme, the first fully Bayesian treatment which is still significantly faster than the earlier method for computing MAP point estimates (Zhou et al., 2012). (ii) The proposed method is able to integrate multiple side information sources by coupling matrix factorization with multiple kernel learning. We show the effectiveness of our approach on one toy data set and two drug–protein interaction data sets. We then show how our method can be used to solve multilabel learning problems and report classification results on 14 benchmark data sets.

We assume that the objects come from two domains \mathcal{X} and \mathcal{Z}. We are given two samples of independent and identically distributed training instances from each, denoted by \mathbf{X}=\{\boldsymbol{x}_{i}\in\mathcal{X}\}_{i=1}^{N_{\texttt{x}}} and \mathbf{Z}=\{\boldsymbol{z}_{j}\in\mathcal{Z}\}_{j=1}^{N_{\texttt{z}}}. In order to calculate similarities between the objects of the same domain, we have multiple kernel functions for each domain, namely, \{k_{\texttt{x},m}\colon\mathcal{X}\times\mathcal{X}\to\mathbb{R}\}_{m=1}^{P_{% \texttt{x}}} and \{k_{\texttt{z},n}\colon\mathcal{Z}\times\mathcal{Z}\to\mathbb{R}\}_{n=1}^{P_{% \texttt{z}}}. If the side information comes in the form of features instead of similarities, the set of kernels defined for a specific domain correspond to different notions of similarity on the same feature representation or may be using information coming from multiple feature representations (i.e., views).

The (i,j)th entry of the target label matrix \mathbf{Y}\in\{-1,+1\}^{N_{\texttt{x}}\times N_{\texttt{z}}} is

\displaystyle y_{j}^{i}=\begin{cases}+1& \textnormal{if $\boldsymbol{x}_{i}$ % and $\boldsymbol{z}_{j}$ are interacting,}\\ -1&\textnormal{otherwise.}\end{cases} |

The superscript indexes the rows and the subscript indexes the columns. The prediction task is to estimate unknown interactions for out-of-matrix objects, which is also known as cold start prediction in recommender systems.

Figure 1 illustrates the method we propose; it is composed of four main parts: (a) kernel-based nonlinear dimensionality reduction, (b) multiple kernel learning, (c) matrix factorization, and (d) binary classification. The first two kernel-based parts are applied to each domain separately and they are completely symmetric, hence we call them twins. One of the twins (i.e., the one that operates on domain \mathcal{Z}) is omitted for clarity. In this section, we briefly explain each part and introduce the notation used. In the following sections, we formulate a fully conjugate probabilistic model and derive a variational approximation.

Kernel-Based Nonlinear Dimensionality Reduction. In this part, we perform feature extraction using the input kernel matrices \{\mathbf{K}_{\texttt{x},m}\in\mathbb{R}^{N_{\texttt{x}}\times N_{\texttt{x}}}% \}_{m=1}^{P_{\texttt{x}}} and the common projection matrix \mathbf{A}_{\texttt{x}}\in\mathbb{R}^{N_{\texttt{x}}\times R} where R is the corresponding subspace dimensionality. We obtain the kernel-specific components \{\mathbf{G}_{\texttt{x},m}=\mathbf{A}_{\texttt{x}}^{\top}\mathbf{K}_{\texttt{% x},m}\}_{m=1}^{P_{\texttt{x}}} after the projection. The main idea is very similar to kernel principal component analysis or kernel Fisher discriminant analysis, where the columns of the projection matrix can be solved with eigendecompositions (Schölkopf & Smola, 2002). However, this solution strategy is not possible for the more complex model formulated here.

Multiple Kernel Learning. This part is responsible for combining the kernel-specific components linearly to obtain the composite components \mathbf{H}_{\texttt{x}}=\sum\nolimits_{m=1}^{P_{\texttt{x}}}e_{\texttt{x},m}% \mathbf{G}_{\texttt{x},m} where the kernel weights can take arbitrary values \boldsymbol{e}_{\texttt{x}}\in\mathbb{R}^{P_{\texttt{x}}}. If we have a single kernel function for a specific domain, we can safely ignore the composite components and the kernel weights, and use the single available kernel-specific components to represent the objects in that domain (Gönen, 2012a). The details of our method with a single kernel function for each domain are explained in the supplementary material.

Matrix Factorization. In this part, we propose to use the low-dimensional representations of objects in the unified subspace, namely, \mathbf{H}_{\texttt{x}} and \mathbf{H}_{\texttt{z}}, to calculate the predicted output matrix \mathbf{F}=\mathbf{H}_{\texttt{x}}^{\top}\mathbf{H}_{\texttt{z}}. This corresponds to factorizing the predicted outputs into two low-rank matrices.

Binary Classification. This part just assigns a class label to each object pair (\boldsymbol{x}_{i},\boldsymbol{z}_{j}) by looking at the sign of the predicted output f_{j}^{i} in the matrix factorization part. The proposed method can also be extended to handle other types of outputs (e.g., real-valued outputs used in recommender systems) by removing the binary classification part and directly generating the target outputs in the matrix factorization part. This corresponds to removing the predicted output matrix \mathbf{F} and generating target label matrix \mathbf{Y} directly from the composite components \mathbf{H}_{\texttt{x}} and \mathbf{H}_{\texttt{z}}. The details of our method for real-valued outputs are also given in the supplementary material.

For the method described in the previous section, we formulate a probabilistic model, called kernelized Bayesian matrix factorization with twin multiple kernel learning (KBMF2MKL), which has two key properties that enable us to perform efficient inference: (i) The kernel-specific and composite components are modeled explicitly by introducing them as latent variables. (ii) Kernel weights are assumed to be normally distributed without enforcing any constraints (e.g., non-negativity) on them. The reasons for introducing these two properties to our probabilistic model becomes clear when we explain our inference method.

Figure 2 gives the graphical model of KBMF2MKL with latent variables and their corresponding priors. There are some additions to the notation described earlier: The N_{\texttt{x}}\times R matrix of priors for the entries of the projection matrix \mathbf{A}_{\texttt{x}} is denoted by \mathbf{\Lambda}_{\texttt{x}}. The P_{\texttt{x}}\times 1 vector of priors for the kernel weights \boldsymbol{e}_{\texttt{x}} is denoted by \boldsymbol{\eta}_{\texttt{x}}. The standard deviations for the kernel-specific and composite components are represented as \sigma_{g} and \sigma_{h}, respectively; these hyper-parameters are not shown for clarity.

The distributional assumptions of the dimensionality reduction part are

\displaystyle\lambda_{\texttt{x},s}^{i} | \displaystyle\sim\mathcal{G}(\lambda_{\texttt{x},s}^{i};\alpha_{\lambda},\beta% _{\lambda}) | \displaystyle\;\;\;\;\;\;\forall(i,s) | ||

\displaystyle a_{\texttt{x},s}^{i}|\lambda_{\texttt{x},s}^{i} | \displaystyle\sim\mathcal{N}(a_{\texttt{x},s}^{i};0,(\lambda_{\texttt{x},s}^{i% })^{-1}) | \displaystyle\;\;\;\;\;\;\forall(i,s) | ||

\displaystyle g_{\texttt{x},m,i}^{s}|\boldsymbol{a}_{\texttt{x},s},\boldsymbol% {k}_{\texttt{x},m,i} | \displaystyle\sim\mathcal{N}(g_{\texttt{x},m,i}^{s};\boldsymbol{a}_{\texttt{x}% ,s}^{\top}\boldsymbol{k}_{\texttt{x},m,i},\sigma_{g}^{2}) | \displaystyle\;\;\;\;\;\;\forall(m,s,i) |

where \mathcal{N}(\cdot;\boldsymbol{\mu},\mathbf{\Sigma}) is the normal distribution with mean vector \boldsymbol{\mu} and covariance matrix \mathbf{\Sigma}, and \mathcal{G}(\cdot;\alpha,\beta) denotes the gamma distribution with shape parameter \alpha and scale parameter \beta. The multiple kernel learning part has the following distributional assumptions:

\displaystyle\eta_{\texttt{x},m} | \displaystyle\sim\mathcal{G}(\eta_{\texttt{x},m};\alpha_{\eta},\beta_{\eta}) | \displaystyle\;\;\forall m | ||

\displaystyle\ e_{\texttt{x},m}|\eta_{\texttt{x},m} | \displaystyle\sim\mathcal{N}(e_{\texttt{x},m};0,\eta_{\texttt{x},m}^{-1}) | \displaystyle\;\;\forall m | ||

\displaystyle h_{\texttt{x},i}^{s}|\{e_{\texttt{x},m},g_{\texttt{x},m,i}^{s}\}% _{m=1}^{P_{\texttt{x}}} | \displaystyle\sim\mathcal{N}\mathopen{}\mathclose{{}\left(h_{\texttt{x},i}^{s}% ;\sum\limits_{m=1}^{P_{\texttt{x}}}e_{\texttt{x},m}g_{\texttt{x},m,i}^{s},% \sigma_{h}^{2}}\right) | \displaystyle\;\;\forall(s,i) |

where kernel-level sparsity can be tuned by changing the hyper-parameters (\alpha_{\eta},\beta_{\eta}). Setting the gamma priors to induce sparsity, e.g., (\alpha_{\eta},\beta_{\eta})=(0.001,1000), produces results analogous to using the \ell_{1}-norm on the kernel weights, whereas using uninformative priors, e.g., (\alpha_{\eta},\beta_{\eta})=(1,1), resembles using the \ell_{2}-norm. The matrix factorization part calculates the predicted outputs using the inner products between the low-dimensional representations of the object pairs:

\displaystyle f_{j}^{i}|\boldsymbol{h}_{\texttt{x},i},\boldsymbol{h}_{\texttt{% z},j}\sim\mathcal{N}(f_{j}^{i};\boldsymbol{h}_{\texttt{x},i}^{\top}\boldsymbol% {h}_{\texttt{z},j},1)\;\;\;\;\;\;\forall(i,j) |

where the predicted outputs are introduced to speed up the inference procedures (Albert & Chib, 1993). The binary classification part forces the predicted outputs to have the same sign with their target labels:

\displaystyle y_{j}^{i}|f_{j}^{i}\sim\delta(f_{j}^{i}y_{j}^{i}>\nu)\;\;\;\;\;% \;\forall(i,j) |

where the margin parameter \nu is introduced to remove ambiguity in the scaling and to place a low-density region between two classes, similar to the margin idea in SVMs, which is generally used for semi-supervised learning (Lawrence & Jordan, 2005). Here, \delta(\cdot) is the Kronecker delta function that returns 1 if its argument is true and 0 otherwise.

Exact inference for the model is intractable and of the two readily available alternatives, Gibbs sampling and variational approximation, we choose the latter for computational efficiency. Variational methods optimize a lower bound on the marginal likelihood, which involves a factorized approximation of the posterior, to find the joint parameter distribution (Beal, 2003).

As short-hand notations, all hyper-parameters in the model are denoted by \boldsymbol{\zeta}=\{\alpha_{\eta},\beta_{\eta},\alpha_{\lambda},\beta_{% \lambda},\sigma_{g},\sigma_{h},\nu\}, all prior variables by \mathbf{\Xi}=\{\boldsymbol{\eta}_{\texttt{x}},\boldsymbol{\eta}_{\texttt{z}},% \mathbf{\Lambda}_{\texttt{x}},\mathbf{\Lambda}_{\texttt{z}}\}, and the remaining random variables by \mathbf{\Theta}=\{\mathbf{A}_{\texttt{x}},\mathbf{A}_{\texttt{z}},\boldsymbol{% e}_{\texttt{x}},\boldsymbol{e}_{\texttt{z}},\mathbf{F},\{\mathbf{G}_{\texttt{x% },m}\}_{m=1}^{P_{\texttt{x}}},\{\mathbf{G}_{\texttt{z},n}\}_{n=1}^{P_{\texttt{% z}}},\mathbf{H}_{\texttt{x}},\mathbf{H}_{\texttt{z}}\}. We omit the dependence on \boldsymbol{\zeta} for clarity. We factorize the variational approximation as

\displaystyle p(\mathbf{\Theta},\mathbf{\Xi}|\{\mathbf{K}_{\texttt{x},m}\}_{m=% 1}^{P_{\texttt{x}}},\{\mathbf{K}_{\texttt{z},n}\}_{n=1}^{P_{\texttt{z}}},% \mathbf{Y})\approx q(\mathbf{\Theta},\mathbf{\Xi})=\\ \displaystyle q(\mathbf{\Lambda}_{\texttt{x}})q(\mathbf{A}_{\texttt{x}})q(\{% \mathbf{G}_{\texttt{x},m}\}_{m=1}^{P_{\texttt{x}}})q(\boldsymbol{\eta}_{% \texttt{x}})q(\boldsymbol{e}_{\texttt{x}})q(\mathbf{H}_{\texttt{x}})\\ \displaystyle q(\mathbf{\Lambda}_{\texttt{z}})q(\mathbf{A}_{\texttt{z}})q(\{% \mathbf{G}_{\texttt{z},n}\}_{n=1}^{P_{\texttt{z}}})q(\boldsymbol{\eta}_{% \texttt{z}})q(\boldsymbol{e}_{\texttt{z}})q(\mathbf{H}_{\texttt{z}})q(\mathbf{% F}) |

and define each factor according to its full conditional:

\displaystyle q(\mathbf{\Lambda}_{\texttt{x}}) | \displaystyle=\prod\limits_{i=1}^{N_{\texttt{x}}}\prod\limits_{s=1}^{R}% \mathcal{G}(\lambda_{\texttt{x},s}^{i};\alpha(\lambda_{\texttt{x},s}^{i}),% \beta(\lambda_{\texttt{x},s}^{i})) | ||

\displaystyle q(\mathbf{A}_{\texttt{x}}) | \displaystyle=\prod\limits_{s=1}^{R}\mathcal{N}(\boldsymbol{a}_{\texttt{x},s};% \mu(\boldsymbol{a}_{\texttt{x},s}),\Sigma(\boldsymbol{a}_{\texttt{x},s})) | ||

\displaystyle q(\{\mathbf{G}_{\texttt{x},m}\}_{m=1}^{P_{\texttt{x}}}) | \displaystyle=\prod\limits_{m=1}^{P_{\texttt{x}}}\prod\limits_{i=1}^{N_{% \texttt{x}}}\mathcal{N}(\boldsymbol{g}_{\texttt{x},m,i};\mu(\boldsymbol{g}_{% \texttt{x},m,i}),\Sigma(\boldsymbol{g}_{\texttt{x},m,i})) | ||

\displaystyle q(\boldsymbol{\eta}_{\texttt{x}}) | \displaystyle=\prod\limits_{m=1}^{P_{\texttt{x}}}\mathcal{G}(\eta_{\texttt{x},% m};\alpha(\eta_{\texttt{x},m}),\beta(\eta_{\texttt{x},m})) | ||

\displaystyle q(\boldsymbol{e}_{\texttt{x}}) | \displaystyle=\mathcal{N}(\boldsymbol{e}_{\texttt{x}};\mu(\boldsymbol{e}_{% \texttt{x}}),\Sigma(\boldsymbol{e}_{\texttt{x}})) | ||

\displaystyle q(\mathbf{H}_{\texttt{x}}) | \displaystyle=\prod\limits_{i=1}^{N_{\texttt{x}}}\mathcal{N}(\boldsymbol{h}_{% \texttt{x},i};\mu(\boldsymbol{h}_{\texttt{x},i}),\Sigma(\boldsymbol{h}_{% \texttt{x},i})) | ||

\displaystyle q(\mathbf{F}) | \displaystyle=\prod\limits_{i=1}^{N_{\texttt{x}}}\prod\limits_{j=1}^{N_{% \texttt{z}}}\mathcal{TN}(f_{j}^{i};\mu(f_{j}^{i}),\Sigma(f_{j}^{i}),\rho(f_{j}% ^{i})) |

where \alpha(\cdot), \beta(\cdot), \mu(\cdot), and \Sigma(\cdot) denote the shape parameter, scale parameter, mean vector, and covariance matrix, respectively. Here, \mathcal{TN}(\cdot;\boldsymbol{\mu},\mathbf{\Sigma},\rho(\cdot)) denotes the truncated normal distribution with mean vector \boldsymbol{\mu}, covariance matrix \mathbf{\Sigma}, and truncation rule \rho(\cdot) such that \mathcal{TN}(\cdot;\boldsymbol{\mu},\mathbf{\Sigma},\rho(\cdot))\propto% \mathcal{N}(\cdot;\boldsymbol{\mu},\mathbf{\Sigma}) if \rho(\cdot) is true and \mathcal{TN}(\cdot;\boldsymbol{\mu},\mathbf{\Sigma},\rho(\cdot))=0 otherwise.

We can bound the marginal likelihood using Jensen’s inequality:

\displaystyle\log p(\mathbf{Y}|\{\mathbf{K}_{\texttt{x},m}\}_{m=1}^{P_{\texttt% {x}}},\{\mathbf{K}_{\texttt{z},n}\}_{n=1}^{P_{\texttt{z}}})\geq\\ \displaystyle\operatorname{E}_{q(\mathbf{\Theta},\mathbf{\Xi})}[\log p(\mathbf% {Y},\mathbf{\Theta},\mathbf{\Xi}|\{\mathbf{K}_{\texttt{x},m}\}_{m=1}^{P_{% \texttt{x}}},\{\mathbf{K}_{\texttt{z},n}\}_{n=1}^{P_{\texttt{z}}})]\\ \displaystyle-\operatorname{E}_{q(\mathbf{\Theta},\mathbf{\Xi})}[\log q(% \mathbf{\Theta},\mathbf{\Xi})] |

and optimize this bound by maximizing with respect to each factor separately until convergence. The approximate posterior distribution of a specific factor \boldsymbol{\tau} can be found as

\displaystyle q(\boldsymbol{\tau})\propto\\ \displaystyle\exp(\operatorname{E}_{q(\{\mathbf{\Theta},\mathbf{\Xi}\}% \backslash\boldsymbol{\tau})}[\log p(\mathbf{Y},\mathbf{\Theta},\mathbf{\Xi}|% \{\mathbf{K}_{\texttt{x},m}\}_{m=1}^{P_{\texttt{x}}},\{\mathbf{K}_{\texttt{z},% n}\}_{n=1}^{P_{\texttt{z}}})]). |

For our model, thanks to the conjugacy, the resulting approximate posterior distribution of each factor follows the same distribution as the corresponding factor. The variational updates for the approximate posterior distributions are given in the supplementary material.

Modeling Choices. Note that using the kernel-specific and composite components as latent variables in our probabilistic model introduces extra conditional independencies between the random variables and enables us to derive efficient update rules for the approximate posterior distributions. The other key property of our model is the assumption of normality of the kernel weights, which allows us to obtain a fully conjugate probabilistic model (Gönen, 2012b). In earlier Bayesian multiple kernel learning algorithms, the combined kernel has usually been defined as a convex sum of the input kernels, by assuming a Dirichlet distribution on the kernel weights (Girolami & Rogers, 2005; Damoulas & Girolami, 2008). As a consequence of the nonconjugacy between Dirichlet and normal distributions, they need to use a sampling strategy (e.g., importance sampling) to update the kernel weights when deriving variational approximations.

Convergence. The inference mechanism sequentially updates the approximate posterior distributions of the latent variables and the corresponding priors until convergence, which can be checked by monitoring the lower bound. The first term of the lower bound corresponds to the sum of exponential-form expectations of the distributions in the joint likelihood. The second term is the sum of negative entropies of the approximate posteriors in the ensemble. The only nonstandard distribution in these terms is the truncated normal distribution used for the predicted outputs, and the truncated normal distribution has a closed-form formula also for its entropy.

Computational Complexity. The most time-consuming operations of the update equations are covariance calculations because they need matrix inversions. The time complexity of the covariance updates for the projection matrices is \mathcal{O}(R\max(N_{\texttt{x}}^{3},N_{\texttt{z}}^{3})) and we can cache \sum\nolimits_{m=1}^{P_{\texttt{x}}}\mathbf{K}_{\texttt{x},m}\mathbf{K}_{% \texttt{x},m}^{\top} and \sum\nolimits_{n=1}^{P_{\texttt{z}}}\mathbf{K}_{\texttt{z},n}\mathbf{K}_{% \texttt{z},n}^{\top} before starting inference procedure to reduce the computational cost of these updates. The covariance updates for the kernel-specific components require inverting diagonal matrices. The time complexities of the covariance updates for the kernel weights and the composite components are \mathcal{O}(\max(P_{\texttt{x}}^{3},P_{\texttt{z}}^{3})). The other calculations in these updates can be done efficiently using matrix-matrix or matrix-vector multiplications. Finding the posterior expectations of the predicted outputs only requires evaluating the standardized normal cumulative distribution function and the standardized normal probability density. In summary, the total time complexity of each iteration in our variational approximation scheme is \mathcal{O}(R\max(N_{\texttt{x}}^{3},N_{\texttt{z}}^{3})+\max(P_{\texttt{x}}^{% 3},P_{\texttt{z}}^{3})), which makes the algorithm more efficient than standard pairwise kernel approaches (Ben-Hur & Noble, 2005) that require calculating a kernel matrix over pairs and training a kernel-based classifier using this kernel, resulting in \mathcal{O}(N_{\texttt{x}}^{3}N_{\texttt{z}}^{3}) complexity.

Prediction. Given a test pair (\boldsymbol{x}_{\star},\boldsymbol{z}_{*}), we want to predict the corresponding score f_{*}^{\star} or target label y_{*}^{\star}. We first replace posterior distributions of \mathbf{A}_{\texttt{x}}, \mathbf{A}_{\texttt{z}}, \boldsymbol{e}_{\texttt{x}}, and \boldsymbol{e}_{\texttt{z}} with their approximate posterior distributions q(\mathbf{A}_{\texttt{x}}), q(\mathbf{A}_{\texttt{z}}), q(\boldsymbol{e}_{\texttt{x}}), and q(\boldsymbol{e}_{\texttt{z}}). Using the approximate distributions, we obtain the predictive distributions of the kernel-specific and composite components. The predictive distribution of the target label can finally be formulated as

\displaystyle p(y_{*}^{\star}=+1|\{\boldsymbol{k}_{\texttt{x},m,\star},\mathbf% {K}_{\texttt{x},m}\}_{m=1}^{P_{\texttt{x}}},\{\boldsymbol{k}_{\texttt{z},n,*},% \mathbf{K}_{\texttt{z},n}\}_{n=1}^{P_{\texttt{z}}},\mathbf{Y})=\\ \displaystyle(\mathcal{Z}_{*}^{\star})^{-1}\Phi\mathopen{}\mathclose{{}\left(% \dfrac{\mu(f_{*}^{\star})-\nu}{\Sigma(f_{*}^{\star})}}\right) |

where \mathcal{Z}_{*}^{\star} is the normalization coefficient calculated for the test pair and \Phi(\cdot) is the standardized normal cumulative distribution function.

We first run our method on a toy data set to illustrate its kernel learning capability. We then test its performance in a real-life application with experiments on two drug–protein interaction data sets. One of them is a standard data set with a single view for each domain and the other one is a larger multiview data set we have collected. We also perform experiments on 14 benchmark multilabel classification data sets in order to show the suitability of our matrix factorization framework with side information in a nonstandard application scenario. Our Matlab implementations are available at http://research.ics.aalto.fi/mi/software/kbmf/.

We create a toy data set consisting of samples from two domains and real-valued outputs for object pairs. The data generation process is:

\displaystyle x_{i}^{m} | \displaystyle\sim\mathcal{N}(x_{i}^{m};0,1) | \displaystyle\;\;\;\;\;\;\forall(m,i) | ||

\displaystyle z_{j}^{n} | \displaystyle\sim\mathcal{N}(z_{j}^{n};0,1) | \displaystyle\;\;\;\;\;\;\forall(n,j) | ||

\displaystyle y_{j}^{i}|\boldsymbol{x}_{i},\boldsymbol{z}_{j} | \displaystyle\sim\mathcal{N}(y_{j}^{i};x_{i}^{1}z_{j}^{3}+x_{i}^{4}z_{j}^{8}+x% _{i}^{7}z_{j}^{10},1) | \displaystyle\;\;\;\;\;\;\forall(i,j) |

where (N_{\texttt{x}},N_{\texttt{z}})=(40,60), the samples from \mathcal{X} and \mathcal{Z} are generated from 15- and 10-dimensional isotropic normal distributions with unit variance (i.e., m\in\{1,\dots,15\} and n\in\{1,\dots,10\}), respectively, and the target outputs are generated using only three features from each domain. Note that this data set has not been generated from our probabilistic model.

In order to have multiple kernel functions for each domain, we calculate a separate linear kernel for each feature of the data points, i.e., (P_{\texttt{x}},P_{\texttt{z}})=(15,10). We then learn our model, intended to work as a predictive model that identifies the relevant features for the prediction task and has a good generalization performance. We use uninformative priors for the projection matrices and the kernel weights by setting (\alpha_{\eta},\beta_{\eta},\alpha_{\lambda},\beta_{\lambda})=(1,1,1,1). The standard deviations are set to (\sigma_{g},\sigma_{h},\sigma_{y})=(0.1,0.1,1), where \sigma_{y} denotes the noise level used for the target outputs. The subspace dimensionality is arbitrarily set to five (i.e., R=5).

Figure 3 shows the found posterior means of the kernel weights. We see that our method correctly identifies the relevant features for each domain (i.e., the first, fourth, and seventh features for \mathcal{X} and the third, eighth, and tenth features for \mathcal{Z}). The root mean square error between the target and predicted outputs is 0.9865 in accordance with the level of noise added.

As the first case study, we analyze a drug–protein interaction network of humans, involving enzymes in particular. This drug–protein interaction network contains 445 drugs, 664 proteins, and 2926 experimentally validated interactions between them. The data set consists of the chemical similarity matrix between drugs, the genomic similarity matrix between proteins, and the target matrix of known interactions provided by Yamanishi et al. (2008).

We compare one baseline and three matrix factorization methods: (i) Baseline simply calculates the target output averages over each column or row as the predictions, (ii) kernelized probabilistic matrix factorization (KPMF) method of Zhou et al. (2012) with real-valued outputs, (iii) our kernelized Bayesian matrix factorization (KBMF) method with real-valued outputs, and (iv) KBMF method with binary outputs.

Our experimental methodology is as follows: For KPMF, the standard deviation \sigma_{y} is set to one. For both KBMF variants, we use uninformative priors for the projection matrices and the kernel weights, i.e., (\alpha_{\eta},\beta_{\eta},\alpha_{\lambda},\beta_{\lambda})=(1,1,1,1), and the standard deviations (\sigma_{g},\sigma_{h}) are set to (0.1,0.1). For KBMF with real-valued outputs, the standard deviation \sigma_{y} is set to one. For KBMF with binary outputs, the margin parameter \nu is arbitrarily set to one. We perform simulations with eight different numbers of components, i.e., R\in\{5,10,\dots,40\}. We run five replications of five-fold cross validation over drugs and report the average area under ROC curve (AUC) over the 25 results as the performance measure.

In the results, C and G mark the chemical similarity between drugs and the genomic similarity between proteins, respectively, whereas N marks the similarity between proteins calculated from the interaction network and it is defined as the ratio between (i) the number of drugs that are interacting with both proteins and (ii) the number of drugs that are interacting with at least one of the proteins, (i.e., Jaccard index).

The results in Figure 4 reveal that KPMF is above the baseline for more than 5 components, and both variants of KBMF for all component numbers. Both variants of our new KBMF outperform the earlier KPMF for all types of inputs, where the differences between KPMF and KBMF are statistically significant (paired t-test, p<0.01). The difference is not due to KPMF having been introduced only for real-valued outputs, as even the real-output variant of KBMF is better. The difference is not due to the inability of the current version of KPMF to handle multiple data views either, as the single-kernel KBMF outperforms it. Hence the differences in the performance are due to the differences in the inference: MAP point estimates versus fully Bayesian inference. The best results are obtained with the binary-output KBMF when using all data sources.

Note that when we combine the genomic and network similarities between proteins using our method, the resulting similarity measure for proteins is better than those of single-kernel scenarios, leading to better prediction performance. This shows that when we have multiple side information sources about the objects, integrating them into the matrix factorization model in a principled way improves the results.

We study an additional drug–protein interaction network of humans, containing 855 drugs, 800 proteins, and 4659 experimentally validated interactions between them, extracted from the drugs and proteins of the data set provided by Khan et al. (2012). We have two views consisting of two standard 3D chemical structure descriptors for drugs, namely, 1120-dimensional Amanda (Duran et al., 2008) and 76-dimensional VolSurf (Cruciani et al., 2000). In order to calculate the similarity between drugs, we use a Gaussian kernel whose width parameter is selected as the square root of the dimensionality of the data points.

We repeat the same experimental procedure as in the previous experiment with one minor change only. We perform simulations with 16 different numbers of components, i.e., R\in\{5,10,\dots,80\}, due to the larger size of the interaction network.

We compare four different ways of including the drug property data. Amanda and VolSurf correspond to using a single view when calculating the kernel between drugs. Early corresponds to concatenating the two views, which is known as early combination (Schölkopf et al., 2004), before calculating the kernel between drugs. MKL corresponds to calculating two different kernels between drugs and combining them with our kernel combination approach.

The overall ordering of the results of the different matrix factorization methods is the same as in the previous case study (Figure 5). The results of KBMF with real-valued outputs, which are omitted not to clutter the figure, are in between KPMF and KBMF with binary outputs. The KPMF outperforms Baseline after 20 components, whereas KBMF is consistently better (by at least four percentage units) than KPMF for all single-kernel scenarios and the differences are statistically significant (paired t-test, p<0.01). KBMF with five components is already better than Baseline for all scenarios.

For KBMF with binary outputs, we see that Amanda and VolSurf are significantly better than Baseline and obtain similar prediction performances. Early outperforms Amanda and VolSurf with a slight margin, whereas MKL obtains consistently better results than all the other scenarios after five components.

Our method can also be interpreted as a metric learning algorithm since each kernel function can be converted into a distance metric. We test this property on the task of finding or retrieving drugs with similar functions. The idea is that since the targets are centrally important for the action mechanisms of drugs, a metric useful for predicting targets could be useful for retrieval of drugs as well. As the ground truth for relevance we use a standard therapeutic classification of the drugs according to the organ or system on which they act and/or their chemical characteristics (not used during learning); drugs having the same class are considered relevant. Figure 6 gives the precision at top-k retrieved drugs, when each drug in turn is used as the query and the rest of the 855 drugs are retrieved in the order of similarity according to the learned metric. Early is better than Amanda and VolSurf, and MKL is the best. This shows that our method is able to learn a kernel function between drugs that is better for retrieval than the kernels either on single or concatenated views.

Data Set | N_{\texttt{train}} | N_{\texttt{test}} | D | L | KBMF | Zhang’s | ML-KNN | RML | Tang’s | RankSVM |

Emotions | 391 | 202 | 72 | 6 | 0.176 | 0.195 | 0.202 | 0.241 | 0.240 | 0.234 |

Scene | 1211 | 1196 | 294 | 6 | 0.086 | 0.089 | 0.099 | 0.109 | 0.130 | 0.127 |

Yeast | 1500 | 917 | 103 | 14 | 0.189 | 0.196 | 0.195 | 0.204 | 0.190 | 0.201 |

Arts | 2000 | 3000 | 462 | 26 | 0.057 | 0.057 | 0.061 | 0.058 | 0.094 | 0.063 |

Business | 2000 | 3000 | 681 | 33 | 0.025 | 0.026 | 0.027 | 0.032 | 0.092 | 0.027 |

Computers | 2000 | 3000 | 640 | 21 | 0.036 | 0.036 | 0.041 | 0.037 | 0.097 | 0.042 |

Education | 2000 | 3000 | 606 | 22 | 0.039 | 0.038 | 0.039 | 0.050 | 0.038 | 0.048 |

Entertainment | 2000 | 3000 | 743 | 40 | 0.046 | 0.055 | 0.063 | 0.059 | 0.053 | 0.062 |

Health | 2000 | 3000 | 636 | 27 | 0.036 | 0.037 | 0.047 | 0.041 | 0.222 | 0.042 |

Recreation | 2000 | 3000 | 438 | 30 | 0.044 | 0.057 | 0.062 | 0.057 | 0.057 | 0.064 |

Reference | 2000 | 3000 | 550 | 33 | 0.027 | 0.025 | 0.032 | 0.027 | 0.087 | 0.034 |

Science | 2000 | 3000 | 612 | 32 | 0.032 | 0.031 | 0.033 | 0.051 | 0.057 | 0.038 |

Social | 2000 | 3000 | 793 | 33 | 0.022 | 0.021 | 0.022 | 0.101 | 0.072 | 0.027 |

Society | 2000 | 3000 | 1047 | 39 | 0.038 | 0.052 | 0.054 | 0.096 | 0.056 | 0.060 |

Average Rank | 1.536 | 1.964 | 3.750 | 4.464 | 4.607 | 4.679 |

In multilabel learning, each sample is associated with a set of labels instead of just a single label. Multilabel classification can be cast into our formulation as follows: Samples and labels are assumed to be from domains \mathcal{X} and \mathcal{Z}, respectively. Class membership matrix corresponds to target label matrix \mathbf{Y} in our model. Our method allows us to integrate side information about samples and labels in the form of kernel matrices. For example, we can exploit the correlation between labels by integrating a kernel calculated over them into the model.

We compare our algorithm KBMF with five state-of-the-art multilabel learning algorithms, namely, (i) RankSVM (Elisseeff & Weston, 2002), (ii) ML-KNN (Zhang & Zhou, 2007), (iii) Tang’s (Tang et al., 2009), (iv) RML (Petterson & Caetano, 2010), and (v) Zhang’s (Zhang et al., 2012). We perform experiments on 14 benchmark multilabel classification data sets whose characteristics are given in Table 1.

For KBMF, the similarities between samples are measured with five different Gaussian kernels whose widths are selected as \sqrt{D/4}, \sqrt{D/2}, \sqrt{D}, \sqrt{2D}, and \sqrt{4D}, whereas the similarity between labels is measured with the Jaccard index over the labels of training samples. The number of components R is selected from \{1,\dots,\min(L,15)\} according to training performance.

Table 1 reports the classification results on multilabel data sets. KBMF obtains the best results on 10 out of 14 data sets, whereas it obtains the second best results on the remaining four data sets. These results validate the suitability of our framework to multilabel learning.

We introduce a kernelized Bayesian matrix factorization method that can make use of multiple side information sources about the objects (both rows and columns) and be applied in various scenarios including recommender systems, interaction network modeling, and multilabel learning. Our two main contributions are: (i) formulating an efficient variational approximation scheme for inference with the help of a novel fully conjugate probabilistic model and (ii) coupling matrix factorization with multiple kernel learning to integrate multiple side information sources into the model. In contrast to the earlier kernelized probabilistic matrix factorization method of Zhou et al. (2012), for our probabilistic model, it is possible to derive a computationally feasible fully Bayesian treatment. We illustrate the usefulness of the method on one toy data set, two molecular biological data sets, and 14 multilabel classification data sets.

An interesting topic for future research is to optimize the dimensionality of the latent components using a Bayesian model selection procedure. For example, we can share the same set of precision priors for the projection matrices and determine the dimensionality using automatic relevance determination (Neal, 1996).

Acknowledgments. This work was financially supported by the Academy of Finland (Finnish Centre of Excellence in Computational Inference Research COIN, grant no 251170; Computational Modeling of the Biological Effects of Chemicals, grant no 140057) and the Finnish Graduate School in Computational Sciences (FICS).

## References

- Agarwal & Chen (2010) Agarwal, D. and Chen, B.-C. fLDA: Matrix factorization through latent Dirichlet allocation. In Proceedings of WSDM, pp. 91–100, 2010.
- Albert & Chib (1993) Albert, J. H. and Chib, S. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422):669–679, 1993.
- Beal (2003) Beal, M. J. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, The Gatsby Computational Neuroscience Unit, University College London, 2003.
- Ben-Hur & Noble (2005) Ben-Hur, A. and Noble, W. S. Kernel methods for predicting protein–protein interactions. Bioinformatics, 21(Suppl. 1):i38–i46, 2005.
- Cruciani et al. (2000) Cruciani, G., Pastor, M., and Guba, W. VolSurf: A new tool for the pharmacokinetic optimization of lead compounds. European Journal of Pharmaceutical Sciences, 11(Suppl. 2):S29–S39, 2000.
- Damoulas & Girolami (2008) Damoulas, T. and Girolami, M. A. Probabilistic multi-class multi-kernel learning: On protein fold recognition and remote homology detection. Bioinformatics, 24(10):1264–1270, 2008.
- Duran et al. (2008) Duran, A., Martinez, G. C., and Pastor, M. Development and validation of AMANDA, a new algorithm for selecting highly relevant regions in Molecular Interaction Fields. Journal of Chemical Information and Modeling, 48(9):1813–1823, 2008.
- Elisseeff & Weston (2002) Elisseeff, A. and Weston, J. A kernel method for multi-labelled classification. In Proceedings of NIPS, pp. 681–687, 2002.
- Girolami & Rogers (2005) Girolami, M. and Rogers, S. Hierarchic Bayesian models for kernel learning. In Proceedings of ICML, pp. 241–248, 2005.
- Gönen (2012a) Gönen, M. Predicting drug–target interactions from chemical and genomic kernels using Bayesian matrix factorization. Bioinformatics, 28(18):2304–2310, 2012a.
- Gönen (2012b) Gönen, M. Bayesian efficient multiple kernel learning. In Proceedings of ICML, pp. 1–8, 2012b.
- Gönen & Alpaydin (2011) Gönen, M. and Alpaydin, E. Multiple kernel learning algorithms. Journal of Machine Learning Research, 12(Jul):2211–2268, 2011.
- Khan et al. (2012) Khan, S. A., Faisal, A., Mpindi, J. P., Parkkinen, J. A., Kalliokoski, T., Poso, A., Kallioniemi, O. P., Wennerberg, K., and Kaski, S. Comprehensive data-driven analysis of the impact of chemoinformatic structure on the genome-wide biological response profiles of cancer cells to 1159 drugs. BMC Bioinformatics, 13(112), 2012.
- Lawrence & Jordan (2005) Lawrence, N. D. and Jordan, M. I. Semi-supervised learning via Gaussian processes. In Proceedings of NIPS, pp. 753–760, 2005.
- Lawrence & Urtasun (2009) Lawrence, N. D. and Urtasun, R. Non-linear matrix factorization with Gaussian processes. In Proceedings of ICML, pp. 601–608, 2009.
- Ma et al. (2008) Ma, H., Yang, H., Lyu, M. R., and King, I. SoRec: Social recommendation using probabilistic matrix factorization. In Proceedings of CIKM, pp. 931–940, 2008.
- Neal (1996) Neal, R. M. Bayesian Learning for Neural Networks. Springer, New York, NY, 1996.
- Petterson & Caetano (2010) Petterson, J. and Caetano, T. Reverse multi-label learning. In Proceedings of NIPS, pp. 1912–1920, 2010.
- Salakhutdinov & Mnih (2008a) Salakhutdinov, R. and Mnih, A. Probabilistic matrix factorization. In Proceedings of NIPS, pp. 1257–1264, 2008a.
- Salakhutdinov & Mnih (2008b) Salakhutdinov, R. and Mnih, A. Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In Proceedings of ICML, pp. 880–887, 2008b.
- Schölkopf & Smola (2002) Schölkopf, B. and Smola, A. J. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, 2002.
- Schölkopf et al. (2004) Schölkopf, B., Tsuda, K., and Vert, J.-P. (eds.). Kernel Methods in Computational Biology. MIT Press, Cambridge, MA, 2004.
- Shan & Banerjee (2010) Shan, H. and Banerjee, A. Generalized probabilistic matrix factorizations for collaborative filtering. In Proceedings of ICDM, pp. 1025–1030, 2010.
- Srebro (2004) Srebro, N. Learning with Matrix Factorizations. PhD thesis, Massachusetts Institute of Technology, 2004.
- Tang et al. (2009) Tang, L., Chen, J., and Ye, J. On multiple kernel learning with multiple labels. In Proceedings of IJCAI, pp. 1255–1260, 2009.
- Wang & Blei (2011) Wang, C. and Blei, D. M. Collaborative topic modeling for recommending scientific articles. In Proceedings of KDD, pp. 448–456, 2011.
- Yamanishi (2009) Yamanishi, Y. Supervised bipartite graph inference. In Proceedings of NIPS, pp. 1841–1848, 2009.
- Yamanishi et al. (2008) Yamanishi, Y., Araki, M., Gutteridge, A., Honda, W., and Kaneisha, M. Prediction of drug–target interaction networks from the integration of chemical and genomic spaces. Bioinformatics, 24:i232–i240, 2008.
- Yamanishi et al. (2010) Yamanishi, Y., Kotera, M., Kanesiha, M., and Goto, S. Drug–target interaction prediction from chemical, genomic and pharmacological data in an integrated framework. Bioinformatics, 26:i246–i254, 2010.
- Zhang & Zhou (2007) Zhang, M.-L. and Zhou, Z.-H. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition, 40(7):2038–2048, 2007.
- Zhang et al. (2012) Zhang, W., Xue, X., Fan, J., Huang, X., Wu, B., and Liu, M. Multi-kernel multi-label learning with max-margin concept network. In Proceedings of IJCAI, pp. 1615–1620, 2012.
- Zhou et al. (2012) Zhou, T., Shan, H., Banerjee, A., and Sapiro, G. Kernelized probabilistic matrix factorization: Exploiting graphs and side information. In Proceedings of SDM, pp. 403–414, 2012.

Supplementary Material In this supplementary material, we provide details about our method and its two variants: (id1) variational updates for the model presented in the main paper, (id1) variant for using a single kernel function on each domain instead of the multiple kernels, and (id1) variant for predicting real-valued outputs instead of the binary outputs.

The approximate posterior distributions of the dimensionality reduction part can be found as

\displaystyle q(\mathbf{\Lambda}_{\texttt{x}}) | \displaystyle=\prod\limits_{i=1}^{N_{\texttt{x}}}\prod\limits_{s=1}^{R}% \mathcal{G}\mathopen{}\mathclose{{}\left(\lambda_{\texttt{x},s}^{i};\alpha_{% \lambda}+\dfrac{1}{2},\mathopen{}\mathclose{{}\left(\dfrac{1}{\beta_{\lambda}}% +\dfrac{\widetilde{(a_{\texttt{x},s}^{i})^{2}}}{2}}\right)^{-1}}\right) | ||

\displaystyle q(\mathbf{A}_{\texttt{x}}) | \displaystyle=\prod\limits_{s=1}^{R}\mathcal{N}\mathopen{}\mathclose{{}\left(% \boldsymbol{a}_{\texttt{x},s};\Sigma(\boldsymbol{a}_{\texttt{x},s})\sum\limits% _{m=1}^{P_{\texttt{x}}}\dfrac{\mathbf{K}_{\texttt{x},m}\widetilde{(\boldsymbol% {g}_{\texttt{x},m}^{s})^{\top}}}{\sigma_{g}^{2}},\mathopen{}\mathclose{{}\left% (\operatorname{diag}(\widetilde{\boldsymbol{\lambda}_{\texttt{x}}^{s}})+\sum% \limits_{m=1}^{P_{\texttt{x}}}\dfrac{\mathbf{K}_{\texttt{x},m}\mathbf{K}_{% \texttt{x},m}^{\top}}{\sigma_{g}^{2}}}\right)^{-1}}\right) |

where the tilde notation denotes the posterior expectations as usual, i.e., \widetilde{f(\boldsymbol{\tau})}=\operatorname{E}_{q(\boldsymbol{\tau})}[f(% \boldsymbol{\tau})].

The kernel-specific components have the following approximate posterior distribution:

\displaystyle q(\{\mathbf{G}_{\texttt{x},m}\}_{m=1}^{P_{\texttt{x}}})=\prod% \limits_{m=1}^{P_{\texttt{x}}}\prod\limits_{i=1}^{N_{\texttt{x}}}\mathcal{N}% \mathopen{}\mathclose{{}\left(\boldsymbol{g}_{\texttt{x},m,i};\Sigma(% \boldsymbol{g}_{\texttt{x},m,i})\mathopen{}\mathclose{{}\left(\dfrac{% \widetilde{\mathbf{A}_{\texttt{x}}^{\top}}\boldsymbol{k}_{\texttt{x},m,i}}{% \sigma_{g}^{2}}+\dfrac{\widetilde{e_{\texttt{x},m}}\widetilde{\boldsymbol{h}_{% \texttt{x},i}}}{\sigma_{h}^{2}}-\sum\limits_{o\neq m}\dfrac{\widetilde{e_{% \texttt{x},m}e_{\texttt{x},o}}\widetilde{\boldsymbol{g}_{\texttt{x},o,i}}}{% \sigma_{h}^{2}}}\right),\mathopen{}\mathclose{{}\left(\dfrac{\mathbf{I}}{% \sigma_{g}^{2}}+\dfrac{\widetilde{e_{\texttt{x},m}^{2}}\mathbf{I}}{\sigma_{h}^% {2}}}\right)^{-1}}\right) |

where the mean and covariance parameters are affected by the kernel weights, the composite components, and other kernel-specific components in addition to the projection matrix and the corresponding kernel matrix.

The approximate posterior distributions of the multiple kernel learning part can be found as

\displaystyle q(\boldsymbol{\eta}_{\texttt{x}}) | \displaystyle=\prod\limits_{m=1}^{P_{\texttt{x}}}\mathcal{G}\mathopen{}% \mathclose{{}\left(\eta_{\texttt{x},m};\alpha_{\eta}+\dfrac{1}{2},\mathopen{}% \mathclose{{}\left(\dfrac{1}{\beta_{\eta}}+\dfrac{\widetilde{e_{\texttt{x},m}^% {2}}}{2}}\right)^{-1}}\right) | ||

\displaystyle q(\boldsymbol{e}_{\texttt{x}}) | \displaystyle=\mathcal{N}\mathopen{}\mathclose{{}\left(\boldsymbol{e}_{\texttt% {x}};\Sigma(\boldsymbol{e}_{\texttt{x}})\begin{bmatrix}\dfrac{\widetilde{% \mathbf{G}_{\texttt{x},m}^{\top}}\widetilde{\mathbf{H}_{\texttt{x}}}}{\sigma_{% h}^{2}}\end{bmatrix}_{m=1}^{P_{\texttt{x}}},\mathopen{}\mathclose{{}\left(% \operatorname{diag}(\widetilde{\boldsymbol{\eta}_{\texttt{x}}})+\begin{bmatrix% }\dfrac{\widetilde{\mathbf{G}_{\texttt{x},m}^{\top}\mathbf{G}_{\texttt{x},o}}}% {\sigma_{h}^{2}}\end{bmatrix}_{m=1,o=1}^{P_{\texttt{x}},P_{\texttt{x}}}}\right% )^{-1}}\right) |

where the mean and covariance parameters of the kernel weights are calculated using the kernel-specific and composite components.

The composite components have the following approximate posterior distribution:

\displaystyle q(\mathbf{H}_{\texttt{x}})=\prod\limits_{i=1}^{N_{\texttt{x}}}% \mathcal{N}\mathopen{}\mathclose{{}\left(\boldsymbol{h}_{\texttt{x},i};\Sigma(% \boldsymbol{h}_{\texttt{x},i})\mathopen{}\mathclose{{}\left(\sum\limits_{m=1}^% {P_{\texttt{x}}}\dfrac{\widetilde{e_{\texttt{x},m}}\widetilde{\boldsymbol{g}_{% \texttt{x},m,i}}}{\sigma_{h}^{2}}+\widetilde{\mathbf{H}_{\texttt{z}}}% \widetilde{(\boldsymbol{f}^{i})^{\top}}}\right),\mathopen{}\mathclose{{}\left(% \dfrac{\mathbf{I}}{\sigma_{h}^{2}}+\widetilde{\mathbf{H}_{\texttt{z}}\mathbf{H% }_{\texttt{z}}^{\top}}}\right)^{-1}}\right) |

where it can be seen that the inference transfers information between the two domains. Note that the composite components of each domain are the only random variables that have an effect on the other domain, i.e., only the \mathbf{H}_{\texttt{z}} variables of domain \mathcal{Z} are used when updating the random variables of the domain \mathcal{X}.

The approximate posterior distribution of the predicted outputs is a product of truncated normals:

\displaystyle q(\mathbf{F}) | \displaystyle=\prod\limits_{i=1}^{N_{\texttt{x}}}\prod\limits_{j=1}^{N_{% \texttt{z}}}\mathcal{TN}(f_{j}^{i};\widetilde{\boldsymbol{h}_{\texttt{x},i}^{% \top}}\widetilde{\boldsymbol{h}_{\texttt{z},j}},1,f_{j}^{i}y_{j}^{i}>\nu). |

We need to find the posterior expectation of \mathbf{F} to update the approximate posterior distributions of the composite components. Fortunately, the truncated normal distribution has a closed-form formula for its expectation.

We formulate a simplified probabilistic model, called kernelized Bayesian matrix factorization with twin kernels (KBMF2K), for the case with a single kernel function for each domain. Figure 7 shows the graphical model of KBMF2K with latent variables and their corresponding priors.

The distributional assumptions of the simplified model are

\displaystyle\lambda_{\texttt{x},s}^{i} | \displaystyle\sim\mathcal{G}(\lambda_{\texttt{x},s}^{i};\alpha_{\lambda},\beta% _{\lambda}) | \displaystyle\;\;\;\;\;\;\forall(i,s) | ||

\displaystyle a_{\texttt{x},s}^{i}|\lambda_{\texttt{x},s}^{i} | \displaystyle\sim\mathcal{N}(a_{\texttt{x},s}^{i};0,(\lambda_{\texttt{x},s}^{i% })^{-1}) | \displaystyle\;\;\;\;\;\;\forall(i,s) | ||

\displaystyle g_{\texttt{x},i}^{s}|\boldsymbol{a}_{\texttt{x},s},\boldsymbol{k% }_{\texttt{x},i} | \displaystyle\sim\mathcal{N}(g_{\texttt{x},i}^{s};\boldsymbol{a}_{\texttt{x},s% }^{\top}\boldsymbol{k}_{\texttt{x},i},\sigma_{g}^{2}) | \displaystyle\;\;\;\;\;\;\forall(s,i) | ||

\displaystyle f_{j}^{i}|\boldsymbol{g}_{\texttt{x},i},\boldsymbol{g}_{\texttt{% z},j} | \displaystyle\sim\mathcal{N}(f_{j}^{i};\boldsymbol{g}_{\texttt{x},i}^{\top}% \boldsymbol{g}_{\texttt{z},j},1) | \displaystyle\;\;\;\;\;\;\forall(i,j) | ||

\displaystyle y_{j}^{i}|f_{j}^{i} | \displaystyle\sim\delta(y_{j}^{i};f_{j}^{i}y_{j}^{i}>\nu) | \displaystyle\;\;\;\;\;\;\forall(i,j). |

As short-hand notations, all hyper-parameters in the model are denoted by \boldsymbol{\zeta}=\{\alpha_{\lambda},\beta_{\lambda},\sigma_{g},\nu\}, all prior variables by \mathbf{\Xi}=\{\mathbf{\Lambda}_{\texttt{x}},\mathbf{\Lambda}_{\texttt{z}}\}, and the remaining random variables by \mathbf{\Theta}=\{\mathbf{A}_{\texttt{x}},\mathbf{A}_{\texttt{z}},\mathbf{F},% \mathbf{G}_{\texttt{x}},\mathbf{G}_{\texttt{z}}\}. We again omit the dependence on \boldsymbol{\zeta} for clarity. We can write the factorized variational approximation as

\displaystyle p(\mathbf{\Theta},\mathbf{\Xi}|\mathbf{K}_{\texttt{x}},\mathbf{K% }_{\texttt{z}},\mathbf{Y})\approx q(\mathbf{\Theta},\mathbf{\Xi})=q(\mathbf{% \Lambda}_{\texttt{x}})q(\mathbf{A}_{\texttt{x}})q(\mathbf{G}_{\texttt{x}})q(% \mathbf{\Lambda}_{\texttt{z}})q(\mathbf{A}_{\texttt{z}})q(\mathbf{G}_{\texttt{% z}})q(\mathbf{F}) |

and define each factor in the ensemble just like its full conditional:

\displaystyle q(\mathbf{\Lambda}_{\texttt{x}}) | \displaystyle=\prod\limits_{i=1}^{N_{\texttt{x}}}\prod\limits_{s=1}^{R}% \mathcal{G}(\lambda_{\texttt{x},s}^{i};\alpha(\lambda_{\texttt{x},s}^{i}),% \beta(\lambda_{\texttt{x},s}^{i})) | ||

\displaystyle q(\mathbf{A}_{\texttt{x}}) | \displaystyle=\prod\limits_{s=1}^{R}\mathcal{N}(\boldsymbol{a}_{\texttt{x},s};% \mu(\boldsymbol{a}_{\texttt{x},s}),\Sigma(\boldsymbol{a}_{\texttt{x},s})) | ||

\displaystyle q(\mathbf{G}_{\texttt{x}}) | \displaystyle=\prod\limits_{i=1}^{N_{\texttt{x}}}\mathcal{N}(\boldsymbol{g}_{% \texttt{x},i};\mu(\boldsymbol{g}_{\texttt{x},i}),\Sigma(\boldsymbol{g}_{% \texttt{x},i})) | ||

\displaystyle q(\mathbf{F}) | \displaystyle=\prod\limits_{i=1}^{N_{\texttt{x}}}\prod\limits_{j=1}^{N_{% \texttt{z}}}\mathcal{TN}(f_{j}^{i};\mu(f_{j}^{i}),\Sigma(f_{j}^{i}),\rho(f_{j}% ^{i})). |

We can bound the marginal likelihood using Jensen’s inequality:

\displaystyle\log p(\mathbf{Y}|\mathbf{K}_{\texttt{x}},\mathbf{K}_{\texttt{z}}% )\geq\operatorname{E}_{q(\mathbf{\Theta},\mathbf{\Xi})}[\log p(\mathbf{Y},% \mathbf{\Theta},\mathbf{\Xi}|\mathbf{K}_{\texttt{x}},\mathbf{K}_{\texttt{z}})]% -\operatorname{E}_{q(\mathbf{\Theta},\mathbf{\Xi})}[\log q(\mathbf{\Theta},% \mathbf{\Xi})] |

and optimize this bound by maximizing with respect to each factor separately until convergence. The approximate posterior distribution of a specific factor \boldsymbol{\tau} can be found as

\displaystyle q(\boldsymbol{\tau}) | \displaystyle\propto\exp(\operatorname{E}_{q(\{\mathbf{\Theta},\mathbf{\Xi}\}% \backslash\boldsymbol{\tau})}[\log p(\mathbf{Y},\mathbf{\Theta},\mathbf{\Xi}|% \mathbf{K}_{\texttt{x}},\mathbf{K}_{\texttt{z}})]). |

The approximate posterior distributions of the ensemble can be found as

\displaystyle q(\mathbf{\Lambda}_{\texttt{x}}) | \displaystyle=\prod\limits_{i=1}^{N_{\texttt{x}}}\prod\limits_{s=1}^{R}% \mathcal{G}\mathopen{}\mathclose{{}\left(\lambda_{\texttt{x},s}^{i};\alpha_{% \lambda}+\dfrac{1}{2},\mathopen{}\mathclose{{}\left(\dfrac{1}{\beta_{\lambda}}% +\dfrac{\widetilde{(a_{\texttt{x},s}^{i})^{2}}}{2}}\right)^{-1}}\right) | ||

\displaystyle q(\mathbf{A}_{\texttt{x}}) | \displaystyle=\prod\limits_{s=1}^{R}\mathcal{N}\mathopen{}\mathclose{{}\left(% \boldsymbol{a}_{\texttt{x},s};\Sigma(\boldsymbol{a}_{\texttt{x},s})\dfrac{% \mathbf{K}_{\texttt{x}}\widetilde{(\boldsymbol{g}_{\texttt{x}}^{s})^{\top}}}{% \sigma_{g}^{2}},\mathopen{}\mathclose{{}\left(\operatorname{diag}(\widetilde{% \boldsymbol{\lambda}_{\texttt{x}}^{s}})+\dfrac{\mathbf{K}_{\texttt{x},m}% \mathbf{K}_{\texttt{x},m}^{\top}}{\sigma_{g}^{2}}}\right)^{-1}}\right) | ||

\displaystyle q(\mathbf{G}_{\texttt{x}}) | \displaystyle=\prod\limits_{i=1}^{N_{\texttt{x}}}\mathcal{N}\mathopen{}% \mathclose{{}\left(\boldsymbol{g}_{\texttt{x},i};\Sigma(\boldsymbol{g}_{% \texttt{x},i})\mathopen{}\mathclose{{}\left(\dfrac{\widetilde{\mathbf{A}_{% \texttt{x}}^{\top}}\boldsymbol{k}_{\texttt{x},i}}{\sigma_{g}^{2}}+\widetilde{% \mathbf{G}_{\texttt{z}}}\widetilde{(\boldsymbol{f}^{i})^{\top}}}\right),% \mathopen{}\mathclose{{}\left(\dfrac{\mathbf{I}}{\sigma_{g}^{2}}+\widetilde{% \mathbf{G}_{\texttt{z}}\mathbf{G}_{\texttt{z}}^{\top}}}\right)^{-1}}\right) | ||

\displaystyle q(\mathbf{F}) | \displaystyle=\prod\limits_{i=1}^{N_{\texttt{x}}}\prod\limits_{j=1}^{N_{% \texttt{z}}}\mathcal{TN}(f_{j}^{i};\widetilde{\boldsymbol{g}_{\texttt{x},i}^{% \top}}\widetilde{\boldsymbol{g}_{\texttt{z},j}},1,f_{j}^{i}y_{j}^{i}>\nu). |

We modify our proposed model for binary-valued outputs to also handle real-valued outputs. Figure 8 illustrates the graphical model of the modified kernelized Bayesian matrix factorization with twin multiple kernel learning (KBMF2MKL) with latent variables and their corresponding priors.

The distributional assumptions of the modified KBMF2MKL model are

\displaystyle\lambda_{\texttt{x},s}^{i} | \displaystyle\sim\mathcal{G}(\lambda_{\texttt{x},s}^{i};\alpha_{\lambda},\beta% _{\lambda}) | \displaystyle\;\;\;\;\;\;\forall(i,s) | ||

\displaystyle a_{\texttt{x},s}^{i}|\lambda_{\texttt{x},s}^{i} | \displaystyle\sim\mathcal{N}(a_{\texttt{x},s}^{i};0,(\lambda_{\texttt{x},s}^{i% })^{-1}) | \displaystyle\;\;\;\;\;\;\forall(i,s) | ||

\displaystyle g_{\texttt{x},m,i}^{s}|\boldsymbol{a}_{\texttt{x},s},\boldsymbol% {k}_{\texttt{x},m,i} | \displaystyle\sim\mathcal{N}(g_{\texttt{x},m,i}^{s};\boldsymbol{a}_{\texttt{x}% ,s}^{\top}\boldsymbol{k}_{\texttt{x},m,i},\sigma_{g}^{2}) | \displaystyle\;\;\;\;\;\;\forall(m,s,i) | ||

\displaystyle\eta_{\texttt{x},m} | \displaystyle\sim\mathcal{G}(\eta_{\texttt{x},m};\alpha_{\eta},\beta_{\eta}) | \displaystyle\;\;\;\;\;\;\forall m | ||

\displaystyle\ e_{\texttt{x},m}|\eta_{\texttt{x},m} | \displaystyle\sim\mathcal{N}(e_{\texttt{x},m};0,\eta_{\texttt{x},m}^{-1}) | \displaystyle\;\;\;\;\;\;\forall m | ||

\displaystyle h_{\texttt{x},i}^{s}|\{e_{\texttt{x},m},g_{\texttt{x},m,i}^{s}\}% _{m=1}^{P_{\texttt{x}}} | \displaystyle\sim\mathcal{N}\mathopen{}\mathclose{{}\left(h_{\texttt{x},i}^{s}% ;\sum\limits_{m=1}^{P_{\texttt{x}}}e_{\texttt{x},m}g_{\texttt{x},m,i}^{s},% \sigma_{h}^{2}}\right) | \displaystyle\;\;\;\;\;\;\forall(s,i) | ||

\displaystyle y_{j}^{i}|\boldsymbol{h}_{\texttt{x},i},\boldsymbol{h}_{\texttt{% z},j} | \displaystyle\sim\mathcal{N}(y_{j}^{i};\boldsymbol{h}_{\texttt{x},i}^{\top}% \boldsymbol{h}_{\texttt{z},j},\sigma_{y}^{2}) | \displaystyle\;\;\;\;\;\;\forall(i,j). |

As short-hand notations, all hyper-parameters in the model are denoted by \boldsymbol{\zeta}=\{\alpha_{\eta},\beta_{\eta},\alpha_{\lambda},\beta_{% \lambda},\sigma_{g},\sigma_{h},\sigma_{y}\}, all prior variables by \mathbf{\Xi}=\{\boldsymbol{\eta}_{\texttt{x}},\boldsymbol{\eta}_{\texttt{z}},% \mathbf{\Lambda}_{\texttt{x}},\mathbf{\Lambda}_{\texttt{z}}\}, and the remaining random variables by \mathbf{\Theta}=\{\mathbf{A}_{\texttt{x}},\mathbf{A}_{\texttt{z}},\boldsymbol{% e}_{\texttt{x}},\boldsymbol{e}_{\texttt{z}},\{\mathbf{G}_{\texttt{x},m}\}_{m=1% }^{P_{\texttt{x}}},\{\mathbf{G}_{\texttt{z},n}\}_{n=1}^{P_{\texttt{z}}},% \mathbf{H}_{\texttt{x}},\mathbf{H}_{\texttt{z}}\}. We again omit the dependence on \boldsymbol{\zeta} for clarity. We can write the factorized variational approximation as

\displaystyle p(\mathbf{\Theta},\mathbf{\Xi}|\{\mathbf{K}_{\texttt{x},m}\}_{m=% 1}^{P_{\texttt{x}}},\{\mathbf{K}_{\texttt{z},n}\}_{n=1}^{P_{\texttt{z}}},% \mathbf{Y})\approx q(\mathbf{\Theta},\mathbf{\Xi})=\\ \displaystyle q(\mathbf{\Lambda}_{\texttt{x}})q(\mathbf{A}_{\texttt{x}})q(\{% \mathbf{G}_{\texttt{x},m}\}_{m=1}^{P_{\texttt{x}}})q(\boldsymbol{\eta}_{% \texttt{x}})q(\boldsymbol{e}_{\texttt{x}})q(\mathbf{H}_{\texttt{x}})q(\mathbf{% \Lambda}_{\texttt{z}})q(\mathbf{A}_{\texttt{z}})q(\{\mathbf{G}_{\texttt{z},n}% \}_{n=1}^{P_{\texttt{z}}})q(\boldsymbol{\eta}_{\texttt{z}})q(\boldsymbol{e}_{% \texttt{z}})q(\mathbf{H}_{\texttt{z}}) |

and define each factor in the ensemble just like its full conditional:

\displaystyle q(\mathbf{\Lambda}_{\texttt{x}}) | \displaystyle=\prod\limits_{i=1}^{N_{\texttt{x}}}\prod\limits_{s=1}^{R}% \mathcal{G}(\lambda_{\texttt{x},s}^{i};\alpha(\lambda_{\texttt{x},s}^{i}),% \beta(\lambda_{\texttt{x},s}^{i})) | ||

\displaystyle q(\mathbf{A}_{\texttt{x}}) | \displaystyle=\prod\limits_{s=1}^{R}\mathcal{N}(\boldsymbol{a}_{\texttt{x},s};% \mu(\boldsymbol{a}_{\texttt{x},s}),\Sigma(\boldsymbol{a}_{\texttt{x},s})) | ||

\displaystyle q(\{\mathbf{G}_{\texttt{x},m}\}_{m=1}^{P_{\texttt{x}}}) | \displaystyle=\prod\limits_{m=1}^{P_{\texttt{x}}}\prod\limits_{i=1}^{N_{% \texttt{x}}}\mathcal{N}(\boldsymbol{g}_{\texttt{x},m,i};\mu(\boldsymbol{g}_{% \texttt{x},m,i}),\Sigma(\boldsymbol{g}_{\texttt{x},m,i})) | ||

\displaystyle q(\boldsymbol{e}_{\texttt{x}}) | \displaystyle=\mathcal{N}(\boldsymbol{e}_{\texttt{x}};\mu(\boldsymbol{e}_{% \texttt{x}}),\Sigma(\boldsymbol{e}_{\texttt{x}})) | ||

\displaystyle q(\boldsymbol{\eta}_{\texttt{x}}) | \displaystyle=\prod\limits_{m=1}^{P_{\texttt{x}}}\mathcal{G}(\eta_{\texttt{x},% m};\alpha(\eta_{\texttt{x},m}),\beta(\eta_{\texttt{x},m})) | ||

\displaystyle q(\mathbf{H}_{\texttt{x}}) | \displaystyle=\prod\limits_{i=1}^{N_{\texttt{x}}}\mathcal{N}(\boldsymbol{h}_{% \texttt{x},i};\mu(\boldsymbol{h}_{\texttt{x},i}),\Sigma(\boldsymbol{h}_{% \texttt{x},i})). |

We can bound the marginal likelihood using Jensen’s inequality:

\displaystyle\log p(\mathbf{Y}|\{\mathbf{K}_{\texttt{x},m}\}_{m=1}^{P_{\texttt% {x}}},\{\mathbf{K}_{\texttt{z},n}\}_{n=1}^{P_{\texttt{z}}})\geq\operatorname{E% }_{q(\mathbf{\Theta},\mathbf{\Xi})}[\log p(\mathbf{Y},\mathbf{\Theta},\mathbf{% \Xi}|\{\mathbf{K}_{\texttt{x},m}\}_{m=1}^{P_{\texttt{x}}},\{\mathbf{K}_{% \texttt{z},n}\}_{n=1}^{P_{\texttt{z}}})]-\operatorname{E}_{q(\mathbf{\Theta},% \mathbf{\Xi})}[\log q(\mathbf{\Theta},\mathbf{\Xi})] |

and optimize this bound by maximizing with respect to each factor separately until convergence. The approximate posterior distribution of a specific factor \boldsymbol{\tau} can be found as

\displaystyle q(\boldsymbol{\tau}) | \displaystyle\propto\exp(\operatorname{E}_{q(\{\mathbf{\Theta},\mathbf{\Xi}\}% \backslash\boldsymbol{\tau})}[\log p(\mathbf{Y},\mathbf{\Theta},\mathbf{\Xi}|% \{\mathbf{K}_{\texttt{x},m}\}_{m=1}^{P_{\texttt{x}}},\{\mathbf{K}_{\texttt{z},% n}\}_{n=1}^{P_{\texttt{z}}})]). |

The approximate posterior distributions of the ensemble can be found as

\displaystyle q(\mathbf{\Lambda}_{\texttt{x}}) | \displaystyle=\prod\limits_{i=1}^{N_{\texttt{x}}}\prod\limits_{s=1}^{R}% \mathcal{G}\mathopen{}\mathclose{{}\left(\lambda_{\texttt{x},s}^{i};\alpha_{% \lambda}+\dfrac{1}{2},\mathopen{}\mathclose{{}\left(\dfrac{1}{\beta_{\lambda}}% +\dfrac{\widetilde{(a_{\texttt{x},s}^{i})^{2}}}{2}}\right)^{-1}}\right) | ||

\displaystyle q(\mathbf{A}_{\texttt{x}}) | \displaystyle=\prod\limits_{s=1}^{R}\mathcal{N}\mathopen{}\mathclose{{}\left(% \boldsymbol{a}_{\texttt{x},s};\Sigma(\boldsymbol{a}_{\texttt{x},s})\sum\limits% _{m=1}^{P_{\texttt{x}}}\dfrac{\mathbf{K}_{\texttt{x},m}\widetilde{(\boldsymbol% {g}_{\texttt{x},m}^{s})^{\top}}}{\sigma_{g}^{2}},\mathopen{}\mathclose{{}\left% (\operatorname{diag}(\widetilde{\boldsymbol{\lambda}_{\texttt{x}}^{s}})+\sum% \limits_{m=1}^{P_{\texttt{x}}}\dfrac{\mathbf{K}_{\texttt{x},m}\mathbf{K}_{% \texttt{x},m}^{\top}}{\sigma_{g}^{2}}}\right)^{-1}}\right) | ||

\displaystyle q(\{\mathbf{G}_{\texttt{x},m}\}_{m=1}^{P_{\texttt{x}}}) | \displaystyle=\prod\limits_{m=1}^{P_{\texttt{x}}}\prod\limits_{i=1}^{N_{% \texttt{x}}}\mathcal{N}\mathopen{}\mathclose{{}\left(\boldsymbol{g}_{\texttt{x% },m,i};\Sigma(\boldsymbol{g}_{\texttt{x},m,i})\mathopen{}\mathclose{{}\left(% \dfrac{\widetilde{\mathbf{A}_{\texttt{x}}^{\top}}\boldsymbol{k}_{\texttt{x},m,% i}}{\sigma_{g}^{2}}+\dfrac{\widetilde{e_{\texttt{x},m}}\widetilde{\boldsymbol{% h}_{\texttt{x},i}}}{\sigma_{h}^{2}}-\sum\limits_{o\neq m}\dfrac{\widetilde{e_{% \texttt{x},m}e_{\texttt{x},o}}\widetilde{\boldsymbol{g}_{\texttt{x},o,i}}}{% \sigma_{h}^{2}}}\right),\mathopen{}\mathclose{{}\left(\dfrac{\mathbf{I}}{% \sigma_{g}^{2}}+\dfrac{\widetilde{e_{\texttt{x},m}^{2}}\mathbf{I}}{\sigma_{h}^% {2}}}\right)^{-1}}\right) | ||

\displaystyle q(\boldsymbol{\eta}_{\texttt{x}}) | \displaystyle=\prod\limits_{m=1}^{P_{\texttt{x}}}\mathcal{G}\mathopen{}% \mathclose{{}\left(\eta_{\texttt{x},m};\alpha_{\eta}+\dfrac{1}{2},\mathopen{}% \mathclose{{}\left(\dfrac{1}{\beta_{\eta}}+\dfrac{\widetilde{e_{\texttt{x},m}^% {2}}}{2}}\right)^{-1}}\right) | ||

\displaystyle q(\boldsymbol{e}_{\texttt{x}}) | \displaystyle=\mathcal{N}\mathopen{}\mathclose{{}\left(\boldsymbol{e}_{\texttt% {x}};\Sigma(\boldsymbol{e}_{\texttt{x}})\begin{bmatrix}\dfrac{\widetilde{% \mathbf{G}_{\texttt{x},m}^{\top}}\widetilde{\mathbf{H}_{\texttt{x}}}}{\sigma_{% h}^{2}}\end{bmatrix}_{m=1}^{P_{\texttt{x}}},\mathopen{}\mathclose{{}\left(% \operatorname{diag}(\widetilde{\boldsymbol{\eta}_{\texttt{x}}})+\begin{bmatrix% }\dfrac{\widetilde{\mathbf{G}_{\texttt{x},m}^{\top}\mathbf{G}_{\texttt{x},o}}}% {\sigma_{h}^{2}}\end{bmatrix}_{m=1,o=1}^{P_{\texttt{x}},P_{\texttt{x}}}}\right% )^{-1}}\right) | ||

\displaystyle q(\mathbf{H}_{\texttt{x}}) | \displaystyle=\prod\limits_{i=1}^{N_{\texttt{x}}}\mathcal{N}\mathopen{}% \mathclose{{}\left(\boldsymbol{h}_{\texttt{x},i};\Sigma(\boldsymbol{h}_{% \texttt{x},i})\mathopen{}\mathclose{{}\left(\sum\limits_{m=1}^{P_{\texttt{x}}}% \dfrac{\widetilde{e_{\texttt{x},m}}\widetilde{\boldsymbol{g}_{\texttt{x},m,i}}% }{\sigma_{h}^{2}}+\dfrac{\widetilde{\mathbf{H}_{\texttt{z}}}(\boldsymbol{y}^{i% })^{\top}}{\sigma_{y}^{2}}}\right),\mathopen{}\mathclose{{}\left(\dfrac{% \mathbf{I}}{\sigma_{h}^{2}}+\dfrac{\widetilde{\mathbf{H}_{\texttt{z}}\mathbf{H% }_{\texttt{z}}^{\top}}}{\sigma_{y}^{2}}}\right)^{-1}}\right). |