Parallel Stochastic Gradient MCMC for Matrix Factorisation Models
Abstract
For large matrix factorisation problems, we develop a distributed Markov Chain Monte Carlo (MCMC) method based on stochastic gradient Langevin dynamics (SGLD) that we call Parallel SGLD (PSGLD). PSGLD has very favourable scaling properties with increasing data size and is comparable in terms of computational requirements to optimisation methods based on stochastic gradient descent. PSGLD achieves high performance by exploiting the conditional independence structure of the MF models to subsample data in a systematic manner as to allow parallelisation and distributed computation. We provide a convergence proof of the algorithm and verify its superior performance on various architectures such as Graphics Processing Units, shared memory multicore systems and multicomputer clusters.
1 Introduction
Matrix factorisation (MF) models have been widely used in data analysis and have been shown to be useful in various domains, such as recommender systems, audio processing, finance, computer vision, and bioinformatics Smaragdis & Brown (2003); Devarajan (2008); Cichoki et al. (2009). The aim of a MF model is to decompose an observed data matrix in the form: , where and are the factor matrices, known typically as the dictionary and the weight matrix respectively, to be estimated by minimising some error measure such as the Frobenious norm .
More general noise models and regularisation methods can be developed. One popular approach is using a probabilistic MF model having the following hierarchical generative model:
(1) 
where, denotes the row of and denotes the column of

Point estimates such as the maximum likelihood (ML) or maximum aposteriori (MAP):
(2) 
The full posterior distribution:
(3)
The majority of the current literature on MF focuses on obtaining point estimates via optimisation of the objective given in Equation 2. Point estimates can be useful in practical applications and there is a broad literature for solving this optimisation problem for a variety of choices of prior and likelihood functions, with various theoretical guarantees Lee & Seung (1999); Liu et al. (2010); Févotte & Idier (2011); Gemulla et al. (2011); Recht & Ré (2013). In contrast, Monte Carlo methods that sample from the often intractable full posterior distribution (in the sense of computing moments or the normalizing constant) received less attention, mainly due to the perceived computational obstacles and rather slow convergence of standard methods, such as the Gibbs sampler, for the target density in 3.
Having an efficient sampler that can generate from the full posterior in contrast to a point estimate would be useful in various applications such as model selection (i.e., estimating the ‘rank’ of the model) or estimating the Bayesian predictive densities useful for active learning. Yet, despite the well known advantages, Monte Carlo methods are typically not the method choice in large scale MF problems as they are perceived to be computationally very demanding. Indeed, classical approaches based on batch MetropolisHastings would require passing over the whole data set at each iteration and the acceptance step makes the methods even more impractical for large data sets. Recently, alternative approaches have been proposed to scaleup MCMC inference to largescale regime. An important attempt was made by Welling and Teh (2011), where the authors combined the ideas from a gradientbased MCMC method, so called the Langevin dynamics (LD) Neal (2010) and the popular optimisation method, stochastic gradient descent (SGD) Kushner & Yin (2003), and developed a scalable MCMC framework called as the stochastic gradient Langevin dynamics (SGLD). Unlike conventional batch MCMC methods, SGLD requires to ‘see’ only a small subset of the data per iteration similar to SGD. With this manner, SGLD can handle large datasets while at the same time being a valid MCMC method that forms a Markov Chain asymptotically sampling from the target density. Approximation analysis of SGLD has been studied in Sato & Nakagawa (2014) and Teh et al. (2014). Several extensions of SGLD have been proposed. Ahn et al. (2012) made use of the Fisher information besides the noisy gradients, Patterson and Teh (2013) applied SGLD on the probability simplex. Chen et al. (2014) and Ding et al. (2014) considered second order Langevin dynamics and made use of the momentum terms, extending the vanilla SGLD.
In this study, we develop a parallel and distributed MCMC method for sampling from the full posterior of a broad range of MF models, including models not easily tackled using standard methods such as the Gibbs sampler. Our approach is carefully designed for MF models and builds upon the generic distributed SGLD (DSGLD) framework that was proposed in S. Ahn & Welling (2014) where separate Markov chains are run in parallel on different subsets of the data that are distributed among worker nodes. When applied to MF models, DSGLD results in computational inefficiencies since it cannot exploit the conditional independence structure of the MF models. Besides, DSGLD requires all the latent variables (i.e., and ) to be synchronised once in a couple of iterations which introduces a significant amount of communication cost. On the other hand, for large problems it may not even be possible to store the latent variables in a single machine; one needs to distribute the latent variables among the nodes as well.
We propose a novel parallel and distributed variant of SGLD for MF models, that we call Parallel SGLD (PSGLD). PSGLD has very favourable scaling properties with increasing data size, remarkably upto the point that the resulting sampler is computationally not much more demanding than an optimisation method such as the distributed stochastic gradient descent (DSGD) Gemulla et al. (2011). Reminisicent to DSGD, PSGLD achieves high performance by exploiting the conditional independence structure of the MF models for subsampling the data in a systematic manner as to allow parallelisation. The main advantages of PSGLD can be summarised as follows:

Due to its inherently parallel structure, PSGLD is faster than SGLD by several orders of magnitude while being as accurate.

As we will illustrate in our experiments, PSGLD can easily be implemented in both sharedmemory and distributed architectures. This makes the method suitable for very large data sets that might be distributed among many nodes.

Unlike DSGLD, which requires to communicate all the parameters and among the worker nodes, PSGLD communicates only small parts of . This drastically reduces the communication cost for large and .

The probability distribution of the samples generated by PSGLD converges to the Bayesian posterior.
We evaluate PSGLD on both synthetic and real datasets. Our experiments show that, PSGLD can be beneficial in two different settings: 1) a sharedmemory setting, where we implement PSGLD on a graphics processing unit (GPU) 2) a distributed setting, where we implement PSGLD on a cluster of computers by using a message passing protocol. Our results show that, in the sharedmemory setting, while achieving the same quality, PSGLD is times faster than a Gibbs sampler on a nonnegative matrix factorisation problem; and in the distributed setting, PSGLD easily scalesup to matrices with hundreds of millions of entries.
We would like to note that, a DSGLDbased, distributed MF framework has been independently proposed by Ahn et al. (2015), where the authors focus on a particular MF model, called as the Bayesian probabilistic matrix factorisation (BPMF) Salakhutdinov & Mnih (2008). In this study, we focus on a generalised observation model family (Tweedie models), in which we can obtain several observation models that have been used in important MF models (such as BPMF, Poisson nonnegative matrix factorisation (NMF) Lee & Seung (1999), ItakuraSaito NMF Févotte et al. (2009)) as special cases.
2 Stochastic Gradient Langevin Dynamics (SGLD) for Matrix Factorisation
In the last decade, SGD has become very popular due to its low computational requirements and convergence guarantee. SGLD brings the ideas of SGD and LD together in order to generate samples from the posterior distribution in a computationally efficient way. In algorithmic sense, SGLD is identical to SGD except that it injects a Gaussian noise at each iteration. For MF models, SGLD iteratively applies the following update rules in order to obtain the samples and : and , where
Here, is the number of elements in , denotes the iteration number, is the subsample that is drawn at iteration , the set is defined as , denotes the gradients, and denotes the number of elements in . The elements of the noise matrices and are independently Gaussian distributed:
For convergence, the step size must satisfy the following conditions:
(4) 
A typical choice for the step size is .
In SGLD, the subsample can be drawn with or without replacement. When dealing with MF models, instead of subsampling the data arbitrarily, one might come up with more clever subsampling schemas that could reduce the computational burden drastically by enabling parallelism. In the next section, we will describe our novel method, PSGLD, where we utilise a systematic subsampling schema by exploiting the conditional independence structure of MF models.
3 Parallel SGLD for Matrix Factorisation
In this section, we describe the details of PSGLD. Inspired by Liu et al. (2010); Gemulla et al. (2011); Recht & Ré (2013), PSGLD utilises a biased subsampling schema where the observed data is carefully partitioned into mutually disjoint blocks and the latent factors are also partitioned accordingly. An illustration of this approach is depicted in Figure 1. In this particular example, the observed matrix is partitioned into disjoint blocks and the latent factors and are partitioned accordingly into and blocks. At each iteration, PSGLD subsamples blocks from , called as the parts, in such a way that the blocks in a part will not ‘touch’ each other in any dimension of , as illustrated in Figure 1. This biased subsampling schema enables parallelism, since given a part, the SGLD updates can be applied to different blocks of the latent factors in parallel.
In the example given in Figure 1, we arbitrarily partition the data into equalsized blocks where these blocks are obtained in a straightforward manner by partitioning using a grid. In the general case, the observed matrix will be partitioned into blocks and these blocks can be formed in a datadependent manner, instead of using simple grids.
Let us formally define a block and a part. First, we need to define a partition of a set as where contains nonempty disjoint subsets of , whose union is equal to . Here, denotes the number of subsets that the partition contains. We will define the blocks and the parts by using partitions of the sets and .
Definition 1.
A block, is the Cartesian product of two sets, one of them being in and the other one being in . Formally, it is defined as follows:
(5) 
where and .
Definition 2.
A part, at iteration , is a collection of mutually disjoint blocks and is defined as follows:
(6) 
where all the blocks are mutually disjoint, formally,
Suppose we read a part at iteration . Then the SGLD updates for can be written as follows:
(7) 
Since all are mutually disjoint, we can decompose Equation 7 into interchangeable updates (i.e., they can be applied in any order), that are given as follows: , where
(8) 
for all . Here, and are the latent factor blocks at iteration , that are determined by the current data block and are formally defined as follows:
The noise matrix is of the same size as and its entries are independently Gaussian distributed with mean and variance .
Similarly, we obtain interchangeable update rules for that are given as follows: , where
(9) 
for all . Similarly, is of the same size as and its entries are independently Gaussian distributed with mean and variance . The parallelism of PSGLD comes from the fact that all these update rules are interchangeable, so that we can apply them in parallel. The pseudocode of PSGLD is given in Algorithm LABEL:algo:psgld.
3.1 Convergence Analysis
Since we are making use of a biased subsampling schema, it is not clear that the samples generated by PSGLD will converge to the Bayesian posterior. In this section, we will define certain conditions on the selection of the parts and provided these conditions hold, we will show that the probability distribution of the samples and converges to the Bayesian posterior .
For theoretical use, we define as the parameter vector, that contains both and :
(10) 
where denotes the vectorisation operator. We also define
Then, the stochastic noise is given by
(11) 
Under the following conditions Theorem 1 holds.
Condition 1.
The step size satisfies Equation 4.
Condition 2.
The part is chosen from nonoverlapping parts whose union covers the whole observed matrix (e.g., the parts given in Figure 1). The probability of choosing a part at iteration is proportional to its size:
Condition 3.
, for integer .
Theorem 1.
Let be the probability density function of the samples that are generated by PSGLD. Then, the probability distribution of converges to the Bayesian posterior :
(12) 
Proof sketch.
Under Condition 2, we can show that is an unbiased estimator of ; therefore the stochastic noise is zeromean:
The rest of the proof is similar to Sato & Nakagawa (2014). Under conditions 1 and 3, we can show that follows the (multidimensional) FokkerPlank equation and therefore the stationary distribution of is . ∎
algocf[t] \end@float
3.2 Nonnegativity Constraints
In certain applications, all the elements of , , and are required to be nonnegative, that is known as the nonnegative matrix factorisation (NMF) Lee & Seung (1999). As we will illustrate in Section 4, the nonnegativity constraint is often a necessity in certain probabilistic models, where we essentially decompose the parameters of the probabilistic model that are nonnegative by definition (e.g., the intensity of a Poisson distribution or the mean of a gamma distribution).
In an SGD framework, the latent factors can be kept in a constraint set by using projections that apply the minimum force to keep the variables in the constraint set. However, since we are in an MCMC framework, it is not clear that appending a projection step to the PSGLD updates would still result in a proper MCMC method. Instead, similar to Patterson & Teh (2013), we make use of a simple mirroring trick, where we replace the negative entries of and with their absolute values. Formally, we let and take values in the whole , however we parametrise the prior and the observation models with the absolute values, and . Since and (similarly, and ) will be equiprobable in this setting, we can replace the negative elements of and with their absolute values without violating the convergence guarantee.
4 Experiments
In this section we will present our experiments where we evaluate PSGLD on both synthetic and real datasets using the nonnegative matrix factorisation (NMF) model. In order to be able to cover a wide range of likelihood functions, we consider the following probabilistic model:
(13) 
where , , and . Here, and denote the exponential and Tweedie distributions, respectively. The Tweedie distribution is an important special case of the exponential dispersion models Jørgensen (1997) and has shown to be useful for factorisation models Yilmaz et al. (2011). The Tweedie density can be written in the following form:
where is the mean, is the dispersion (related to the variance), is the power parameter, is the normalizing constant, and denotes the divergence that is defined as follows:
The divergence generalises many divergence functions that are commonly used in practice. As special cases, we obtain the ItakuraSaito divergence, KullbackLeibler divergence, and the Euclidean distance square, for , respectively. From the probabilistic perspective, different choices of yield important distributions such as gamma (), Poisson (), Gaussian (), compound Poisson (), and inverse Gaussian () distributions. Due to a technical condition, no Tweedie model exists for the interval , but for all other values of , one obtains the very rich family of Tweedie stable distributions Jørgensen (1997). Thanks to the flexibility of the Tweedie distribution, we are able to choose an observation model by changing a single parameter , without modifying the inference algorithm.
In most of the special cases of the Tweedie distribution, the normalizing constant is an infinite sum and cannot be written in a simple analytical form. Fortunately, provided that and are given, the normalizing constant becomes irrelevant since it does not depend on the mean parameter and therefore and . Consequently, the PSGLD updates only involve the partial derivatives of the divergence with respect to and , which is tractable.

4.1 Experimental Setup
We will compare PSGLD with different MCMC methods, namely the Gibbs sampler, LD, and SGLD. We will conduct our experiments in two different settings: 1) a sharedmemory setting where the computation is done on a single multicore computer 2) a distributed setting where we make use of a cluster of computing nodes
It is easy to derive the update equations required by the gradientbased methods for the TweedieNMF model. However, developing a Gibbs sampler for this general model is unfortunately not obvious. We could derive Gibbs samplers for certain special cases of the Tweedie model, such as the PoissonNMF Cemgil (2009) where and . Moreover, in order the full conditional distributions that are required by the Gibbs sampler, we need to introduce an auxiliary tensor and augment the probabilistic model in Equation 13 as follows:
where denotes the Poisson distribution.
The LD and Gibbs samplers require to pass on the whole observed matrix at every iteration. The Gibbs sampler further requires the whole auxiliary tensor to be sampled at each iteration.
4.2 SharedMemory Setting
In this section, we will compare the mixing rates and the computation times of all the aforementioned methods in a sharedmemory setting. We will first compare the methods on synthetic data, then on musical audio data.
We conduct all the sharedmemory experiments on a MacBook Pro with GHz Quadcore Intel Core i7 CPU, GB of memory, and NVIDIA GeForce GT 750M graphics card. We have implemented PSLGD on the GPU in CUDA C. We have implemented the other methods on the CPU in C, where we have used the GNU Scientific Library and BLAS for the matrix operations.
Experiments on Synthetic Data
In order to be able to compare all the methods, in our first experiment we use the PoissonNMF model. We first generate , , and by using the generative model. Then, we run all the methods in order to obtain the samples . For simplicity, we choose and we set . In order to obtain the blocks, we partition the sets and into equal pieces, where we simply partition by using a grid, similar to the example given in Figure 1. Initially, we choose different parts whose union cover the whole observed matrix , similar to the ones in Figure 1. At each iteration, we choose one of these parts in cyclic order, i.e. we proceed to the next part at each iteration and return the first part after iteration with integer . Since the sizes of all the parts are the same, Condition 2 is satisfied.
In LD, we use a constant step size , whereas in SGLD and PSGLD, we set the step sizes as , where . For each method, we tried several values for the parameters and report the results for the best performing ones. In LD we set , in SGLD we set , , and in PSGLD we set and . The results are not very sensitive to the actual value of and , provided these are set in a reasonable range. Furthermore, in SGLD, we draw the subsamples with a withreplacement manner, where we set .
Figure 2 shows the mixing rates and the running times of the methods under the Poisson model for different data sizes. While plotting the loglikelihood of the state of the Markov chain is not necessarily an indication of convergence to the stationary distribution, nevertheless provides a simple indicator if the sampler is stuck around a low probability mode. We set the number of rows , , and we generate samples from the Markov chain with each method. We can observe that, in all cases, SGLD achieves poor mixing rates due to the withreplacement subsampling schema while LD achieves better mixing rates than SGLD. Moreover, while the LD updates can be implemented highly efficiently using BLAS, the reduced data access of SGLD does not reflect in reduced computation time due to the random data access pattern when selecting subsamples from .
The results show that PSGLD and the Gibbs sampler seem to achieve much better mixing rates. However, we observe an enormous difference in the running times of these methods – PSGLD is times faster than the Gibbs sampler on a GPU, while achieving virtually the same quality. For example, in a model with rows, the Gibbs sampler runs for more than hours while PSGLD completes the burnin phase in nearly second and generates K samples from the Markov chain in less than seconds, even when there are more than million entries in . Naturally, this gap between PSGLD and the Gibbs sampler becomes more pronounced with increasing problem size. We also observe that PSGLD is faster than LD and SGLD by folds while achieving a much better mixing rate.

We also evaluate PSGLD with observation model, which corresponds to a compound Poisson distribution. This distribution is particularly suited for sparse data as it has a nonzero probability mass on and a continuous density on Jørgensen (1997). Even though the probability density function of this distribution cannot be written in closedform analytical expression, fortunately we can still generate random samples from the distribution in order to obtain synthetic .
Since deriving a Gibbs sampler for the compound Poisson model is not obvious, we will compare only LD, SGLD, and PSGLD on this model. Figure 2 shows the performance of these methods for . We obtain qualitatively similar results; PSGLD achieves a much better mixing rate and is much faster than the other methods.
Experiments on Audio
The TweedieNMF model has been widely used for audio and music modelling Févotte & Idier (2011). In musical audio context, the observed matrix is taken as a timefrequency representation of the audio signal, such as the power or magnitude spectra that are computed via shorttime Fourier transform. Here, the index denotes the frequency bins, whereas the index denotes the timeframes. An example audio spectrum belonging to a short piano excerpt ( seconds) is given in Figure 3.
When the audio spectrum is decomposed by using an NMF model, each column of will contain a different spectral template and each row of will contain the activations through time for a particular spectral template. In music processing applications, each spectral template is expected to capture the spectral shape of a certain musical note and the activations are expected to capture the loudness of the notes.
We decompose the audio spectrum given in Figure 3 and visually compare the dictionary matrices that are learned by LD and PSGLD. The size of is and we set . For PSGLD, we partition the sets and into equal pieces and we choose the parts in cyclic order at each iteration. With each method, we generate samples but discard the samples in the burnin phase ( samples). Figure 3 shows the Monte Carlo averages that are obtained by different methods. We observe that PSGLD successfully captures the spectral shapes of the different notes and the chords that occur in the piece, even though the method is completely unsupervised. We also observe that LD is able to capture the spectral shapes of most of the notes as well, and estimates a less sparse dictionary. Furthermore, PSGLD runs in a much smaller amount of time; the running times of the methods are and seconds respectively for PSGLD and LD – as a reference the Gibbs sampler needs to run for seconds on the same problem.
4.3 DistributedHybrid Setting
In this section, we will focus on the implementation of PSGLD in a distributed setting, where each block of might reside at a different node. We will consider a distributed architecture that contains three main components: 1) the data nodes that store the blocks of 2) the computational nodes that execute the PSGLD updates 3) the main node that is only responsible for submitting the jobs to the computational nodes only at the beginning of the sampling process.
In the distributed setting, we implement PSGLD by a message passing protocol in C using the OpenMPI library. PSGLD is naturally suited for message passing environments, and the low level control on the distributed computations provide more insight than other platforms such as Hadoop MapReduce. On the other hand, it is straightforward to implement PSGLD in a MapReduce environment for commercial and faulttolerant applications.
In our implementation, we make use of an efficient communication mechanism, where we set the number of blocks to the number of available nodes. As illustrated in Figure 4, throughout the sampling process, each node is responsible only for a certain block; however, at the end of each iteration it transfers the corresponding block to its adjacent node in a cyclic fashion. With this mechanism, the part is determined implicitly at each iteration depending on the current locations of the factor blocks and . Besides, as opposed to many distributed MCMC methods such as DSGLD, this mechanism enables PSGLD to have a much lower communication cost, especially for large , , and values.
We conduct our distributedsetting experiments on a cluster with computational nodes where each computational node has Intel Xeon GHz CPUs and GB of memory. Therefore, provided that the memory is sufficient, we are able to run concurrent processes on our cluster. In our experiments, by assuming that the network connection between the computational nodes is sufficiently fast, we will assume that we have at most computational nodes.
We evaluate PSGLD on a large movie ratings dataset, MovieLens M (grouplens.org). This dataset contains million ratings applied to movies by users, resulting in a sparse where of is nonzero. In all our experiments, we set , , and we set to the number of available nodes where we partition the sets and into equal pieces similar to the sharedmemory experiments. In these experiments, the sizes of the parts are close to each other, therefore our communication mechanism satisfies Condition .
In our first experiment, our goal is to contrast the speed of our sampling algorithm to a distributed optimisation algorithm. Clearly, the goals of both computations are different (a sampler does not solve an optimisation problem unless techniques such as simulated annealing is being used), yet monitoring the root mean squared error (RMSE) between and throughout the iterations provides a qualitative picture about the convergence behaviour of the algorithms. Figure 5 shows the RMSE values of PSGLD and the distributed stochastic gradient descent (DSGD) algorithm Gemulla et al. (2011) for iterations with . We observe that a very similar convergence behaviour and the running times for both methods. The results indicate that, PSGLD makes Bayesian inference possible for MF models even for large datasets by generating samples from the Bayesian posterior, while at the same time being as fast as the stateoftheart distributed optimisation algorithms.
In our last set of experiments, we demonstrate the scalability of PSGLD. Firstly, we differ the number of nodes from to and generate samples in each setting. Figure 6 shows the running times of PSGLD for different number of nodes. The results show that, the running time reduces almost quadratically as we increase the number of nodes until . For , the communication cost dominates and the running time increases.
Finally, in order to illustrate how PSGLD scales with the size of the data, we increase the size of while increasing the number of nodes accordingly. We start with the original dataset and nodes, then we duplicate in both dimensions (the number of elements quadruples) and set the number of nodes to . We repeat this procedure two more times, where the ultimate dataset becomes of size with million nonzero entries and the number of nodes becomes . Figure 6 shows the running times of PSGLD with for increasing data sizes and number of nodes. The results show that, even though we increase the size of the data folds, the running time of PSGLD remains nearly constant provided we can increase the number of nodes proportionally.
5 Conclusion
We have described a scalable MCMC method for sampling from the posterior distribution of a MF model and tested the performance of our approach in terms of accuracy, speed and scalability on various modern architectures. Our results suggest that, contrary to the established folklore in ML, inference methods for ‘big data’ are not limited to optimisation, and Monte Carlo methods are as competitive in this regime as well. The existence of efficient samplers paves the way to full Bayesian inference; due to lack of space we have not presented natural applications such as model selection.
We conclude with the remark that it is rather straightforward to extend PSGLD to more structured models such as coupled matrix and tensor factorisation models. Here, several datasets are decomposed simultaneously and the distributed nature of PSGLD is arguably even more attractive when data are naturally distributed to different physical locations.
Acknowledgments
This work is supported by the Scientific and Technological Research Council of Turkey (TÜBİTAK) Grant no. 113M492. Umut Şimşekli is also funded by a PhD Scholarship from TÜBİTAK.
Footnotes
 In the rest of the paper, we will use bold capital letters to denote matrices, e.g., , bold small letters to denote vectors, e.g., , and small regular letters to denote scalars, e.g., .
 For the source code for both settings (CUDA and OpenMPI), please contact the authors.
References
 Ahn, S., Korattikara, A., Liu, N., Rajan, S., and Welling, M. Largescale distributed bayesian matrix factorization using stochastic gradient mcmc. In KDD, 2015.
 Cemgil, A. T. Bayesian inference in nonnegative matrix factorisation models. Computational Intelligence and Neuroscience, 2009.
 Chen, T., Fox, E.B., and Guestrin, C. Stochastic gradient Hamiltonian Monte Carlo. In Proc. International Conference on Machine Learning, June 2014.
 Cichoki, A., Zdunek, R., Phan, A.H., and Amari, S. Nonnegative Matrix and Tensor Factorization. Wiley, 2009.
 Devarajan, Karthik. Nonnegative matrix factorization: An analytical and interpretive tool in computational biology. PLoS Computational Biology, 4, 2008.
 Ding, Nan, Fang, Youhan, Babbush, Ryan, Chen, Changyou, Skeel, Robert D., and Neven, Hartmut. Bayesian sampling using stochastic gradient thermostats. In Advances in Neural Information Processing Systems, pp. 3203–3211, 2014.
 Févotte, C., Bertin, N., and Durrieu, J. L. Nonnegative matrix factorization with the ItakuraSaito divergence. with application to music analysis. Neural Computation, 21:793–830, 2009.
 Févotte, Cédric and Idier, Jérôme. Algorithms for nonnegative matrix factorization with the divergence. Neural Computation, 23(9):2421–2456, 2011.
 Gemulla, Rainer, Nijkamp, Erik, Haas, Peter J., and Sismanis, Yannis. Largescale matrix factorization with distributed stochastic gradient descent. In ACM SIGKDD, 2011.
 Jørgensen, B. The Theory of Dispersion Models. Chapman & Hall/CRC Monographs on Statistics & Applied Probability, 1997.
 Kushner, H. and Yin, G. Stochastic Approximation and Recursive Algorithms and Applications. Springer, 2003.
 Lee, D. D. and Seung, H. S. Learning the parts of objects by nonnegative matrix factorization. Nature, 401, 1999.
 Liu, Chao, chih Yang, Hung, Fan, Jinliang, He, LiWei, and Wang, YiMin. Distributed nonnegative matrix factorization for webscale dyadic data analysis on mapreduce. In Proceedings of the 19th International World Wide Web Conference, April 2010.
 Neal, Radford M. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 54, 2010.
 Patterson, S. and Teh, Y. W. Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In Advances in Neural Information Processing Systems, 2013.
 Recht, Benjamin and Ré, Christopher. Parallel stochastic gradient algorithms for largescale matrix completion. Mathematical Programming Computation, 2013.
 S. Ahn, A. Korattikara and Welling, M. Bayesian posterior sampling via stochastic gradient fisher scoring. In ICML, 2012.
 S. Ahn, B. Shahbaba and Welling, M. Distributed stochastic gradient mcmc. In ICML, 2014.
 Salakhutdinov, Ruslan and Mnih, Andriy. Bayesian probabilistic matrix factorization using markov chain monte carlo. In Proceedings of the 25th international conference on Machine learning, pp. 880–887, 2008.
 Sato, Issei and Nakagawa, Hiroshi. Approximation analysis of stochastic gradient langevin dynamics by using fokkerplanck equation and ito process. In Proceedings of the 31st International Conference on Machine Learning (ICML14), pp. 982–990. JMLR Workshop and Conference Proceedings, 2014.
 Smaragdis, Paris and Brown, Judith C. Nonnegative matrix factorization for polyphonic music transcription. In In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 177–180, 2003.
 Teh, Y. W., Thiéry, A. H., and Vollmer, S. J. Consistency and fluctuations for stochastic gradient Langevin dynamics. submitted, 2014.
 Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the International Conference on Machine Learning, 2011.
 Yilmaz, Y. K., Cemgil, A. T., and Simsekli, U. Generalised coupled tensor factorisation. In NIPS, 2011.