# Prior and Likelihood Choices for Bayesian Matrix Factorisation on Small Datasets

###### Abstract

In this paper, we study the effects of different prior and likelihood choices for Bayesian matrix factorisation, focusing on small datasets. These choices can greatly influence the predictive performance of the methods. We identify four groups of approaches: Gaussian-likelihood with real-valued priors, nonnegative priors, semi-nonnegative models, and finally Poisson-likelihood approaches. For each group we review several models from the literature, considering sixteen in total, and discuss the relations between different priors and matrix norms. We extensively compare these methods on eight real-world datasets across three application areas, giving both inter- and intra-group comparisons. We measure convergence runtime speed, cross-validation performance, sparse and noisy prediction performance, and model selection robustness. We offer several insights into the trade-offs between prior and likelihood choices for Bayesian matrix factorisation on small datasets—such as that Poisson models give poor predictions, and that nonnegative models are more constrained than real-valued ones.

Prior and Likelihood Choices for Bayesian Matrix Factorisation on Small Datasets

Thomas Brouwer and Pietro Lió Computer Laboratory, University of Cambridge, United Kingdom

## 1 Introduction

Matrix factorisation methods have become very popular in recent years, and used for many applications such as collaborative filtering (?; ?) and bioinformatics (?; ?). Given a matrix relating two entity types, such as movies and users, matrix factorisation decomposes that matrix into two smaller so-called factor matrices, such that their product approximates the original one. Matrix factorisation is often used for predicting missing values in the datasets, and analysing the resulting factor values to identify biclusters or features.

Most models can be categorised as being either non-probabilistic, such as the popular models by (?), or Bayesian. The former seek to minimise an error function (such as the squared error) between the original matrix and the approximation. In contrast, Bayesian variants treat the two smaller matrices as random variables, place prior distributions over them, and find the posterior distribution over their values after observing the data. A likelihood function, usually Gaussian, is used to capture noise in the dataset. Previous work (?) has demonstrated that Bayesian variants are much better for predictive tasks than non-probabilistic versions, which tend to overfit to noise and sparsity.

Matrix factorisation techniques can also be grouped by their constraints on the values in the factor matrices. Firstly, many approaches place no constraints, using real-valued factor matrices (commonly done in the Bayesian literature (?; ?)). Instead, we could constrain them to be nonnegative (as is popular in the non-probabilistic literature (?; ?)), limiting its applicability to nonnegative datasets, but making it easier to interpret the factors and potentially also making the method more robust to overfitting. Thirdly, semi-nonnegative variants constrain one factor matrix to be nonnegative, leaving the other real-valued (?; ?). Finally, some versions work only on count data.

In the Bayesian setting, the first three groups of methods all generally use a Gaussian likelihood for noise, and place either real-valued or nonnegative priors over the matrices. For the former, Gaussian is a common choice (?; ?; ?; ?), and for the latter options include the exponential distributions (?). The fourth group uses a Poisson likelihood to capture count data (?; ?; ?). These models are often extended by using complicated hierarchical prior structures over the factor matrices, giving additional behaviour (such as automatic model selection).

This paper offers the first systematic comparison between different Bayesian variants of matrix factorisation. Similar comparisons have been provided in other fields, such as for the regression parameter in Bayesian model averaging (?; ?), which demonstrated that the choice of prior can greatly influence the predictive performance of these models. However, a similar study for Bayesian matrix factorisation is still missing. More strikingly, many papers that introduce new matrix factorisation models do not provide a thorough comparison with competing approaches, or popular non-probabilistic ones such as (?)—for example, the seminal paper by (?) compares their approach with only one other matrix factorisation method; although (?) compares with three others.

We give an overview of the different approaches that can be found in the literature, including hierarchical priors, and then study the effects of these different Bayesian prior and likelihood choices. We aim to make general statements about the behaviour of the four different groups of methods on small real-world datasets (up to a million observed entries), by considering eight datasets across three different applications—four drug effectiveness datasets, two collaborative filtering datasets, and two methylation expression datasets. Our experiments consider convergence speed, cross-validation performance, sparse and noisy prediction performance, and model selection effectiveness. This study offers novel insights into the differences between the four approaches, and the effects of popular hierarchical priors.

We note that there is a rich literature of Bayesian nonparametric matrix factorisation models, which learn the size of the factor matrices automatically. However, these models often require complex inference approaches to find good solutions, and hence their predictive performance is more determined by the inference method than the precise model choices (such as likelihood and prior). In this paper we therefore focus on parametric matrix factorisation models, to isolate the effects of likelihood and prior choices.

Finally, we acknowledge that the models we study were generally introduced for a specific application domain, and that this makes it hard to make general statements about the behaviour of these methods on different datasets. However, we believe that it is essential to provide a cross-application comparison of the different approaches, as this teaches us valuable lessons for the applications studied, and they are likely to apply to different areas as well. The lack of other studies exploring the trade-offs between likelihood and prior choices for Bayesian matrix factorisation make this a novel and essential study.

## 2 Bayesian Matrix Factorisation

In this section we introduce the different matrix factorisation models that we study. Formally, the problem of matrix factorisation can be defined as follows. Given an observed matrix , we want to find two smaller matrices and , each with so-called factors (columns), to solve , where noise is captured by matrix . Some entries in may be unobserved, as given by the set . These entries can then be predicted by .

In the Bayesian treatment of matrix factorisation, we express a likelihood function for the observed data that captures noise (such as Gaussian or Poisson). We treat the latent matrices as random variables, placing prior distributions over them. A Bayesian solution for matrix factorisation can then be found by inferring the posterior distribution over the latent variables (, , and any additional random variables in our model), given the observed data . This posterior distribution is often intractable to compute exactly, but several methods exist to approximate it (see Section 4.1).

In Section 2.2 we introduce a wide range of models from the literature, and categorise them into four groups. The model names are highlighted in bold in the text.

### 2.1 Probability Distributions

We introduce all notation and probability distributions in the paper below.

is a diagonal matrix with entries on the diagonal.

is a Gaussian distribution with precision .

is a -dimensional multivariate Gaussian distribution.

is a Gamma distribution, where is the gamma function.

is a normal-inverse Wishart distribution, where is an inverse Wishart distribution, and I the identity matrix.

is a Laplace distribution.

is an inverse Gaussian.

is an exponential distribution, where is the unit step function.

is a truncated normal: a normal distribution with zero density below and renormalised to integrate to one. is the cumulative distribution function of .

Category | Name | Likelihood | Prior | Prior | Hierarchical prior |

Real-valued | GGG | - | |||

GGGU | - | ||||

GGGA | |||||

GGGW | |||||

GLL | - | ||||

GLLI | |||||

GVG | - | ||||

Nonnegative | GEE | - | |||

GEEA | |||||

GTT | - | ||||

GTTN | |||||

G | - | ||||

with | with | ||||

Semi- | GEG | - | |||

nonnegative | GVnG | GVG with | - | ||

Poisson | PGG | - | |||

PGGG |

### 2.2 Models

There are three types of choices we make that determine the type of matrix factorisation model we use: the likelihood function, the priors we place over the factor matrices and , and whether we use any further hierarchical priors. We have identified four different groups of Bayesian matrix factorisation approaches based on these choices: Gaussian-likelihood with real-valued priors, nonnegative priors (constraining the matrices to be nonnegative), semi-nonnegative models (constraining one of the two factor matrices to be nonnegative), and finally Poisson-likelihood approaches. Models within each group use different priors and hierarchical priors, and many choices can be found in the literature. In this paper we consider a total of sixteen models, as summarised in Table 1. We have focused on fully conjugate models (meaning the prior and likelihood are in the same family of distributions) to ensure inference for each model is guaranteed to work well, so that all performance differences in Section 6 come entirely from the choice of likelihood and priors.

The first three groups all use a Gaussian likelihood for noise, by assuming each value in comes from the product of and , , with Gaussian noise added of precision , for which we use a Gamma prior . The last group instead opt for a Poisson likelihood, . This only works for nonnegative count data, with , but has been studied extensively in the literature due to the popularity and prevalence of datasets like the Netflix Challenge.

#### Real-valued matrix factorisation

The most common approach is to use independent zero-mean Gaussian priors for (?; ?; ?; ?), which gives rise to the GGG model. The GGGU model is identical but uses a univariate posterior for inference (see supplementary materials).

The first hierarchical model (GGGA) uses the Bayesian automatic relevance determination (ARD) prior, which helps with model selection. The main idea is to replace the hyperparameter by a factor-specific variable , which has a further Gamma prior. This causes all entries in columns of and to go further to zero if only a few values in that column are high, effectively making the factor inactive. This prior has been used for real-valued (?; ?) and nonnegative matrix factorisation (?).

Another hierarchical model (GGGW) was introduced in the seminal paper of (?). Instead of assuming independence of each entry in , we assume each row of comes from a multivariate Gaussian with row mean and covariance , and similarly for . We then place a further Normal-Inverse Wishart prior over these parameters.

An alternative to the Gaussian prior is to use the Laplace distribution (?), which has a much more pointy distribution than Gaussian around . This leads to more sparse solutions, as more factors are set to low values. The basic model (GLL) can be extended with a hierarchical Inverse Gaussian prior over the parameter (GLLI), which they claim helps with variable selection.

The final model (GVG) was introduced by (?). They used a volume prior for the matrix, with density . The hyperparameter determines the strength of the volume penalty (higher means stronger prior).

#### Nonnegative matrix factorisation

These models all place nonnegative prior distributions over entries in and , and as a result can only deal with nonnegative datasets.

(?) introduced a model using exponential priors over the factor matrices (GEE). This model can also be extended with ARD (?) (GEEA). Another option is to use the truncated normal distribution (GTT), which can also be extended by placing a hierarchical prior over the mean and precision (GTTN), as done by (?). This nontrivial prior cannot be sampled from directly, but will be useful for inference.

Finally, we can use a prior inspired by the norm for both and (G), as we will discuss in Section 3.

#### Semi-nonnegative matrix factorisation

Instead of forcing nonnegativity on both factor matrices, we could place this constraint on only one, as was done in (?; ?). In the Bayesian setting we place a real-valued prior over one matrix, and a nonnegative prior over the other. The major advantage is that we can handle real-valued datasets, while still enforcing some nonnegativity. However, we will see in Section 6 that its performance is identical to the real-valued approaches.

Firstly we can use an exponential prior for entries in , and a Gaussian for , effectively combining the GGG and GEE models into one (GEG). Another semi-nonnegative model (GVnG) comes from constraining the volume prior in the GVG model to also be nonnegative: if any .

#### Poisson likelihood

The standard Poisson matrix factorisation model (PGG) uses independent Gamma priors over the entries in and , with hyperparameters (?; ?; ?). This model can also be extended with a hierarchical prior (PGGG), by replacing with and placing a further Gamma prior over these parameters (?).

## 3 Priors and Norms

The prior distributions in Bayesian models act as a regulariser that prevents us from overfitting to the data, preventing poor predictive performance. We can write out the expression of the log posterior of the parameters, which for a Gaussian likelihood and no hierarchical priors becomes

for some constants . Note that this last expression is simply the negative Frobenius norm (squared error) of the training fit, plus a regularisation term over the matrices . This training error is frequently used in the nonprobabilistic matrix factorisation literature (?; ?; ?), where different regularisation terms are used. These are often based on row-wise matrix norms, such as

This offers some interesting insights: the norm is equivalent to an independent Gaussian prior (GGG), due to the square in the exponential of the Gaussian prior; the norm is equivalent to a Laplace prior distribution (GLL); if we constrain to be nonnegative then the norm is equivalent to an exponential prior distribution (GEE); and finally, the norm can be formulated as a nonnegative prior distribution, which we use for the model (see Table 1).

In other words, the type of priors chosen for Bayesian matrix factorisation determine the type of regularisation that we add to the model. Additionally, we can use hierarchical priors to model further desired behaviour (such as ARD).

## 4 Model Discussion

### 4.1 Inference

In this paper we use Gibbs sampling (see Section 4.1), because it tends to be very accurate at finding the true posterior (?), but other methods like variational Bayesian inference are also possible. The Gibbs sampling algorithms, together with their time complexities, are given in the supplementary materials.

### 4.2 Hyperparameters

The hyperparameter values we choose for each model can influence their performance, especially when the data is sparse. The hierarchical models try to automatically choose the correct values, by placing a prior over the original hyperparameters. This introduces new hyperparameters, but the models are generally less sensitive to these.

However, in our experience even the models without hierarchical priors are not very sensitive to this choice, as long as we use fairly weak priors. In particular, we used (GGG, GGGU, GEE, GTT, , GEG), (GLL), and (PGG). The distributions with these hyperparameter values are plotted in Figure 1.

For the other models we used: (Gaussian likelihood); (GGGA, GEEA); (GGGW); (GLLI), (GTTN), (PGGG).

We did find that the volume prior models (GVG, GVnG) were very sensitive to the hyperparameter choice . The following values were chosen by trying a range on each dataset and choosing the best one: for {GDSC,CTRP,CCLE ,,MovieLens 100K,1M,GM,PM}.

### 4.3 Software

Implementations of all models, datasets, and experiments, are available at https://github.com/Anonymous/.

## 5 Datasets

We conduct our experiments on a total of eight real-world datasets across three different applications, allowing us to see whether our observations on one dataset or application also hold more generally. We will focus on one or two datasets at a time for more specific experiments. Also note that we make sure all datasets contain only positive integers, so that we can compare all four groups of Bayesian matrix factorisation approaches.

Dataset | Rows | Columns | Fraction obs. |
---|---|---|---|

GDSC | 707 | 139 | 0.806 |

CTRP | 887 | 545 | 0.801 |

CCLE | 504 | 24 | 0.965 |

CCLE | 502 | 24 | 0.632 |

MovieLens 100K | 943 | 1473 | 0.072 |

MovieLens 1M | 6040 | 3503 | 0.047 |

Gene body meth. | 160 | 254 | 1.000 |

Promoter meth. | 160 | 254 | 1.000 |

The first comes from bioinformatics, in particular predicting missing values in drug sensitivity datasets, each detailing the effectiveness ( or values) of a range of drugs on different cancer and tissue types (cell lines). We consider the Genomics of Drug Sensitivity in Cancer (GDSC v5.0 (?), ), Cancer Therapeutics Response Portal (CTRP v2 (?), ), and Cancer Cell Line Encyclopedia (CCLE (?), both and ) datasets. We preprocessed these datasets by: undoing the natural log transform of the GDSC dataset; capping high values to 100 for GDSC and CTRP; and then casting them as integers. We also filtered out rows and columns with only one or two observed datapoints.

The second application is collaborative filtering, where we are given movie ratings for different users (one to five stars) and we wish to predict the number of stars a user will give to an unseen movie. We use the MovieLens 100K and 1M datasets (?), with 100,000 and 1,000,000 ratings respectively.

Finally, another bioinformatics application, this time looking at methylation expression profiles (?). These datasets give the amount of methylation measured in either the body region of 160 breast cancer driver genes (gene body methylation) or the promoter region (promoter methylation) for 254 different patients. We multiplied all values by twenty and cast them as integers.

The datasets are summarised in Table 2, and the distribution of values for each dataset is visualised in Figure 2. This shows us that the drug sensitivity datasets tend to be bimodal, whereas the MovieLens and methylation datasets are more normally distributed. We can also see that the MovieLens datasets tend to be large and sparse, whereas the others are well-observed and relatively small.

## 6 Experiments

We conducted experiments to compare the four different groups of approaches. In particular, we measured their convergence speed, cross-validation performance, sparse prediction performance, and model selection effectiveness. We sometimes focus on a selection of the methods for clarity. To make the comparison complete, we also added a popular non-probabilistic nonnegative matrix factorisation model (NMF) (?) as a baseline. The results are discussed in Section 7.

### 6.1 Convergence

Firstly we compared the convergence speed of the models on the GDSC and MovieLens 100K datasets. We ran each model with , and measured the average mean squared error on the training data across ten runs. We plotted the results in Figure 5, where each group is plotted as the same colour: red for real-valued, blue for nonnegative, green for semi-nonnegative, yellow for Poisson, and grey for the non-probabilistic baseline. Runtime speeds are given in the supplementary materials.

### 6.2 Cross-validation

Our first predictive experiment was to measure the 5-fold cross-validation performance on each of the eight datasets. We used the hyperparameter values from Section 4.2, and used 5-fold nested cross-validation to choose the dimensionality . The average mean squared error of predictions are given in Figure 5 for all eight datasets. The average dimensionality found in nested cross-validation can be found in the supplementary materials.

### 6.3 Noise test

We then measured the predictive performance when the datasets are very noisy. We added different levels of Gaussian noise to the data, with the noise-to-signal ratio being given by the ratio of the variance of the Gaussian noise we add, to the standard deviation of the generated data. For each noise level we split the datapoints randomly into ten folds, and measured the predictive performance of the models on one held-out set at a time. We used for all methods. The results for the GDSC drug sensitivity dataset are given in Figure 5, where we plot the ratio of the variance of the data to the mean squared error of the predictions—higher values are better, and using the row average gives a performance of one.

### 6.4 Sparse predictions

Next we measured the predictive performances when the sparsity of the data increases. For different fractions of unobserved data, we randomly split the data based on that fraction, trained the model on the observed data, and measured the performance on the held-out test data. We used for all models. The average mean squared error of ten repeats is given in Figure 7, showing the performances on both the methylation GM and GDSC drug sensitivity datasets.

### 6.5 Model selection

We also measured the robustness of the models to overfitting if the dimensionality is high. As a result, most models will fit very well to the training data, but give poor predictions on the test data. Here, we vary the dimensionality for each of the models on the GDSC drug sensitivity dataset, randomly taking out as test data, and repeating ten times. The results are given in Figure 7—in the supplementary materials we look at two more datasets.

## 7 Discussion

From the results shown in the previous section, we were able to draw the following conclusions.

Observation 1: Poisson likelihood methods perform poorly compared to the Gaussian likelihood—they overfit quickly (Figures 7a), give worse predictive performances in cross-validation (Figure 5) and under noisy conditions (Figure 5), presumably because they cannot converge as deep as the other methods (Figure 5). At high sparsity levels they can start to perform better (Figure 7d). Some papers (?) claim that Poisson models offer better predictions, but for small and well-observed datasets we found the opposite to be true.

Observation 2: Nonnegative models are more constrained than the real-valued ones, causing them to converge less deep (Figure 5), and to be less likely to overfit to high sparsity levels (7c, 7h) than the standard GGG model. However, the right hierarchical prior for a real-valued model (such as Wishart) can bridge this gap.

Observation 3: There is no difference in performance between real-valued and semi-nonnegative matrix factorisation, as shown in the model selection and sparsity experiments (Figures 7e, 7j, and 7e): the performance for GGG and GEG, as well as GVG and GVnG, are nearly identical.

Observation 4: There is no difference in predictive performance between univariate and multivariate posteriors (GGG, GGGU), as shown in Figures 7b and 7g.

Observation 5: The automatic relevance determination and Wishart hierarchical priors are effective ways of preventing overfitting, as shown in Figures 7b and 7c: the GGGA, GGGW, and GEEA models keep the line down as increase, whereas the GGG and GEE models start overfitting more. This has been shown before for nonnegative models (?) but the effect is even stronger for the real-valued ones.

Overvation 6: Similarly, the Laplace priors are good at reducing overfitting as the dimensionality grows (Figure 7b), without requiring additional hierarchical priors.

Observation 7: Some other hierarchical priors do not make a difference, such as with GLLI, GTTN, PGGG—Figures 7d, 7i, and 7d show little difference in performance. They can help us automatically choose the hyperparameters, but in our experience the models are not very sensitive to this choice anyways.

Although these observations are specific to the applications and dataset sizes studied, we believe that general insights can be drawn from them about the behaviour of the four different groups of Bayesian matrix factorisation models. The behaviour of Poisson models is especially interesting, because they are often claimed to be better than Gaussian models for large datasets, but for smaller ones this does not hold. We hope that these insights will assist future researchers in their model design.

## References

- [Arngren, Schmidt, and Larsen 2011] Arngren, M.; Schmidt, M. N.; and Larsen, J. 2011. Unmixing of Hyperspectral images using bayesian non-negative matrix factorization with volume prior. In Journal of Signal Processing Systems, volume 65, 479–496. IEEE.
- [Barretina et al. 2012] Barretina, J.; Caponigro, G.; Stransky, N.; Venkatesan, K.; Margolin, A. A.; Kim, S.; Wilson, C. J.; et al. 2012. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483(7391):603–7.
- [Brouwer and Lió 2017] Brouwer, T., and Lió, P. 2017. Bayesian Hybrid Matrix Factorisation for Data Integration. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS).
- [Brouwer, Frellsen, and Lió 2017] Brouwer, T.; Frellsen, J.; and Lió, P. 2017. Comparative Study of Inference Methods for Bayesian Nonnegative Matrix Factorisation. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD)ââ.
- [Chen, Wang, and Zhang 2009] Chen, G.; Wang, F.; and Zhang, C. 2009. Collaborative filtering using orthogonal nonnegative matrix tri-factorization. Information Processing and Management 45(3).
- [Ding, Li, and Jordan 2010] Ding, C.; Li, T.; and Jordan, M. I. 2010. Convex and semi-nonnegative matrix factorizations. IEEE Transactions on Pattern Analysis and Machine Intelligence 32:45–55.
- [Eicher, Papageorgiou, and Raftery 2011] Eicher, T. S.; Papageorgiou, C.; and Raftery, A. E. 2011. Default priors and predictive performance in Bayesian model averaging, with application to growth determinants. Journal of Applied Econometrics 26(1):30–55.
- [Gönen 2012] Gönen, M. 2012. Predicting drug-target interactions from chemical and genomic kernels using Bayesian matrix factorization. Bioinformatics 28(18).
- [Gopalan and Blei 2014] Gopalan, P., and Blei, D. M. 2014. Bayesian Nonparametric Poisson Factorization for Recommendation Systems. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics (AISTATS), volume 33, 2–4.
- [Gopalan, Hofman, and Blei 2015] Gopalan, P.; Hofman, J. M.; and Blei, D. M. 2015. Scalable recommendation with hierarchical Poisson factorization. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, 326–335. AUAI Press.
- [Harper and Konstan 2015] Harper, F. M., and Konstan, J. A. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems 5(4):1–19.
- [Hu, Rai, and Carin 2015] Hu, C.; Rai, P.; and Carin, L. 2015. Zero-Truncated Poisson Tensor Factorization for Massive Binary Tensors. In Uncertainty in Artificial Intelligence (UAI).
- [Jing, Wang, and Yang 2015] Jing, L.; Wang, P.; and Yang, L. 2015. Sparse Probabilistic Matrix Factorization by Laplace Distribution for Collaborative Filtering. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI).
- [Koboldt et al. 2012] Koboldt, D. C.; Fulton, R. S.; McLellan, M. D.; Schmidt, H.; Kalicki-Veizer, J.; McMichael, J. F.; Fulton, L. L.; et al. 2012. Comprehensive molecular portraits of human breast tumours. Nature 490(7418):61–70.
- [Lee and Seung 2000] Lee, D. D., and Seung, H. S. 2000. Algorithms for Non-negative Matrix Factorization. NIPS, MIT Press 556–562.
- [Ley and Steel 2009] Ley, E., and Steel, M. F. 2009. On the effect of prior assumptions in Bayesian model averaging with applications to growth regression. Journal of Applied Econometrics 24(4):651–674.
- [Lippert, Weber, and Huang 2008] Lippert, C.; Weber, S.; and Huang, Y. 2008. Relation prediction in multi-relational domains using matrix factorization. In NIPS workshop on structured input, structured output.
- [Pauca et al. 2004] Pauca, V.; Shahnaz, F.; Berry, M.; and Plemmons, R. 2004. Text mining using non-negative matrix factorizations. In Proceedings SIAM International Conference on Data Mining (SDM), 452–456.
- [Pauca, Piper, and Plemmons 2006] Pauca, V. P.; Piper, J.; and Plemmons, R. J. 2006. Nonnegative matrix factorization for spectral data analysis. Linear Algebra and Its Applications 416(1):29–47.
- [Salakhutdinov and Mnih 2008a] Salakhutdinov, R., and Mnih, A. 2008a. Probabilistic Matrix Factorization. In Advances in Neural Information Processing Systems (NIPS), 1257–1264.
- [Salakhutdinov and Mnih 2008b] Salakhutdinov, R., and Mnih, A. 2008b. Bayesian Probabilistic Matrix Factorization using Markov Chain Monte Carlo. In International Conference on Machine Learning (ICML), 880–887. New York, New York, USA: ACM Press.
- [Schmidt and Mohamed 2009] Schmidt, M. N., and Mohamed, S. 2009. Probabilistic non-negative tensor factorization using Markov chain Monte Carlo. In 17th European Signal Processing Conference.
- [Schmidt, Winther, and Hansen 2009] Schmidt, M. N.; Winther, O.; and Hansen, L. K. 2009. Bayesian non-negative matrix factorization. In International Conference on Independent Component Analysis and Signal Separation, Springer Lecture Notes in Computer Science, Vol. 5441.
- [Seashore-Ludlow et al. 2015] Seashore-Ludlow, B.; Rees, M. G.; Cheah, J. H.; Cokol, M.; Price, E. V.; Coletti, M. E.; Jones, V.; et al. 2015. Harnessing Connectivity in a Large-Scale Small-Molecule Sensitivity Dataset. Cancer discovery 5(11):1210–23.
- [Tan and Févotte 2013] Tan, V. Y. F., and Févotte, C. 2013. Automatic relevance determination in nonnegative matrix factorization with the ()-divergence. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(7):1592–1605.
- [Virtanen et al. 2012] Virtanen, S.; Klami, A.; Khan, S.; and Kaski, S. 2012. Bayesian group factor analysis. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS).
- [Virtanen, Klami, and Kaski 2011] Virtanen, S.; Klami, A.; and Kaski, S. 2011. Bayesian CCA via Group Sparsity. In Proceedings of the 28th International Conference on Machine Learning.
- [Wang, Li, and Zhang 2008] Wang, F.; Li, T.; and Zhang, C. 2008. Semi-supervised clustering via matrix factorization. In Proceedings of the 2008 SIAM International Conference on Data Mining.
- [Yang et al. 2013] Yang, W.; Soares, J.; Greninger, P.; Edelman, E. J.; Lightfoot, H.; Forbes, S.; Bindal, N.; et al. 2013. Genomics of Drug Sensitivity in Cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells. Nucleic acids research 41(Database issue):D955–61.