Parallel Stochastic Gradient MCMC for Matrix Factorisation Models

# Parallel Stochastic Gradient MCMC for Matrix Factorisation Models

## Abstract

For large matrix factorisation problems, we develop a distributed Markov Chain Monte Carlo (MCMC) method based on stochastic gradient Langevin dynamics (SGLD) that we call Parallel SGLD (PSGLD). PSGLD has very favourable scaling properties with increasing data size and is comparable in terms of computational requirements to optimisation methods based on stochastic gradient descent. PSGLD achieves high performance by exploiting the conditional independence structure of the MF models to sub-sample data in a systematic manner as to allow parallelisation and distributed computation. We provide a convergence proof of the algorithm and verify its superior performance on various architectures such as Graphics Processing Units, shared memory multi-core systems and multi-computer clusters.

## 1 Introduction

Matrix factorisation (MF) models have been widely used in data analysis and have been shown to be useful in various domains, such as recommender systems, audio processing, finance, computer vision, and bioinformatics Smaragdis & Brown (2003); Devarajan (2008); Cichoki et al. (2009). The aim of a MF model is to decompose an observed data matrix in the form: , where and are the factor matrices, known typically as the dictionary and the weight matrix respectively, to be estimated by minimising some error measure such as the Frobenious norm .

More general noise models and regularisation methods can be developed. One popular approach is using a probabilistic MF model having the following hierarchical generative model:

 p(W)=∏ikp(wik),p(H)=∏kjp(hkj)
 p(V|WH)=∏ijp(vij|wi,hj) (1)

where, denotes the row of and denotes the column of 1. In MF problems we might be interested in two different quantities:

1. Point estimates such as the maximum likelihood (ML) or maximum a-posteriori (MAP):

 (W,H)⋆ =argmaxW,Hlogp(W,H|V) (2)
2. The full posterior distribution:

 p(W,H|V)∝p(V|W,H)p(W)p(H) (3)

The majority of the current literature on MF focuses on obtaining point estimates via optimisation of the objective given in Equation 2. Point estimates can be useful in practical applications and there is a broad literature for solving this optimisation problem for a variety of choices of prior and likelihood functions, with various theoretical guarantees Lee & Seung (1999); Liu et al. (2010); Févotte & Idier (2011); Gemulla et al. (2011); Recht & Ré (2013). In contrast, Monte Carlo methods that sample from the often intractable full posterior distribution (in the sense of computing moments or the normalizing constant) received less attention, mainly due to the perceived computational obstacles and rather slow convergence of standard methods, such as the Gibbs sampler, for the target density in 3.

Having an efficient sampler that can generate from the full posterior in contrast to a point estimate would be useful in various applications such as model selection (i.e., estimating the ‘rank’ of the model) or estimating the Bayesian predictive densities useful for active learning. Yet, despite the well known advantages, Monte Carlo methods are typically not the method choice in large scale MF problems as they are perceived to be computationally very demanding. Indeed, classical approaches based on batch Metropolis-Hastings would require passing over the whole data set at each iteration and the acceptance step makes the methods even more impractical for large data sets. Recently, alternative approaches have been proposed to scale-up MCMC inference to large-scale regime. An important attempt was made by Welling and Teh (2011), where the authors combined the ideas from a gradient-based MCMC method, so called the Langevin dynamics (LD) Neal (2010) and the popular optimisation method, stochastic gradient descent (SGD) Kushner & Yin (2003), and developed a scalable MCMC framework called as the stochastic gradient Langevin dynamics (SGLD). Unlike conventional batch MCMC methods, SGLD requires to ‘see’ only a small subset of the data per iteration similar to SGD. With this manner, SGLD can handle large datasets while at the same time being a valid MCMC method that forms a Markov Chain asymptotically sampling from the target density. Approximation analysis of SGLD has been studied in Sato & Nakagawa (2014) and Teh et al. (2014). Several extensions of SGLD have been proposed. Ahn et al. (2012) made use of the Fisher information besides the noisy gradients, Patterson and Teh (2013) applied SGLD on the probability simplex. Chen et al. (2014) and Ding et al. (2014) considered second order Langevin dynamics and made use of the momentum terms, extending the vanilla SGLD.

In this study, we develop a parallel and distributed MCMC method for sampling from the full posterior of a broad range of MF models, including models not easily tackled using standard methods such as the Gibbs sampler. Our approach is carefully designed for MF models and builds upon the generic distributed SGLD (DSGLD) framework that was proposed in S. Ahn & Welling (2014) where separate Markov chains are run in parallel on different subsets of the data that are distributed among worker nodes. When applied to MF models, DSGLD results in computational inefficiencies since it cannot exploit the conditional independence structure of the MF models. Besides, DSGLD requires all the latent variables (i.e., and ) to be synchronised once in a couple of iterations which introduces a significant amount of communication cost. On the other hand, for large problems it may not even be possible to store the latent variables in a single machine; one needs to distribute the latent variables among the nodes as well.

We propose a novel parallel and distributed variant of SGLD for MF models, that we call Parallel SGLD (PSGLD). PSGLD has very favourable scaling properties with increasing data size, remarkably upto the point that the resulting sampler is computationally not much more demanding than an optimisation method such as the distributed stochastic gradient descent (DSGD) Gemulla et al. (2011). Reminisicent to DSGD, PSGLD achieves high performance by exploiting the conditional independence structure of the MF models for sub-sampling the data in a systematic manner as to allow parallelisation. The main advantages of PSGLD can be summarised as follows:

• Due to its inherently parallel structure, PSGLD is faster than SGLD by several orders of magnitude while being as accurate.

• As we will illustrate in our experiments, PSGLD can easily be implemented in both shared-memory and distributed architectures. This makes the method suitable for very large data sets that might be distributed among many nodes.

• Unlike DSGLD, which requires to communicate all the parameters and among the worker nodes, PSGLD communicates only small parts of . This drastically reduces the communication cost for large and .

• The probability distribution of the samples generated by PSGLD converges to the Bayesian posterior.

We evaluate PSGLD on both synthetic and real datasets. Our experiments show that, PSGLD can be beneficial in two different settings: 1) a shared-memory setting, where we implement PSGLD on a graphics processing unit (GPU) 2) a distributed setting, where we implement PSGLD on a cluster of computers by using a message passing protocol. Our results show that, in the shared-memory setting, while achieving the same quality, PSGLD is times faster than a Gibbs sampler on a non-negative matrix factorisation problem; and in the distributed setting, PSGLD easily scales-up to matrices with hundreds of millions of entries.

We would like to note that, a DSGLD-based, distributed MF framework has been independently proposed by Ahn et al. (2015), where the authors focus on a particular MF model, called as the Bayesian probabilistic matrix factorisation (BPMF) Salakhutdinov & Mnih (2008). In this study, we focus on a generalised observation model family (Tweedie models), in which we can obtain several observation models that have been used in important MF models (such as BPMF, Poisson non-negative matrix factorisation (NMF) Lee & Seung (1999), Itakura-Saito NMF Févotte et al. (2009)) as special cases. Figure 1: Illustration of the parts and the blocks. Here, we partition the sets [I] and [J] into B=3 pieces. The partitions for this example are chosen as PB([I])={{1,…,I3},{I3+1,…,2I3},{2I3+1,…,I}} and PB([J])={{1,…,J3},{J3+1,…,2J3},{2J3+1,…,J}}. Part 1 consists of three non-overlapping blocks, say Π=Λ1∪Λ2∪Λ3, where Λ1={1,…,I3}×{1,…,J3}, Λ2={I3+1,…,2I3}×{J3+1,…,2J3}, and Λ3={2I3+1,…,I}×{2J3+1,…,J}. Given the blocks in a part, the corresponding blocks in W and H become conditionally independent, as illustrated in different colours and textures. Therefore, for different blocks, the PSGLD updates can be applied in parallel.

## 2 Stochastic Gradient Langevin Dynamics (SGLD) for Matrix Factorisation

In the last decade, SGD has become very popular due to its low computational requirements and convergence guarantee. SGLD brings the ideas of SGD and LD together in order to generate samples from the posterior distribution in a computationally efficient way. In algorithmic sense, SGLD is identical to SGD except that it injects a Gaussian noise at each iteration. For MF models, SGLD iteratively applies the following update rules in order to obtain the samples and : and , where

 ΔW(t)= ϵ(t)(N|Ω(t)|∑(i,j)∈Ω(t)∇Wlogp(vij|W(t−1),H(t−1)) +∇Wlogp(W(t−1)))+Ψ(t) ΔH(t)= ϵ(t)(N|Ω(t)|∑(i,j)∈Ω(t)∇Hlogp(vij|W(t−1),H(t−1)) +∇Hlogp(H(t−1)))+Ξ(t).

Here, is the number of elements in , denotes the iteration number, is the sub-sample that is drawn at iteration , the set is defined as , denotes the gradients, and denotes the number of elements in . The elements of the noise matrices and are independently Gaussian distributed:

 ψ(t)ik∼N(ψ(t)ik;0,2ϵ(t)),ξ(t)kj∼N(ξ(t)kj;0,2ϵ(t)).

For convergence, the step size must satisfy the following conditions:

 ∞∑t=0ϵ(t)=∞,∞∑t=0(ϵ(t))2<∞ (4)

A typical choice for the step size is .

In SGLD, the sub-sample can be drawn with or without replacement. When dealing with MF models, instead of sub-sampling the data arbitrarily, one might come up with more clever sub-sampling schemas that could reduce the computational burden drastically by enabling parallelism. In the next section, we will describe our novel method, PSGLD, where we utilise a systematic sub-sampling schema by exploiting the conditional independence structure of MF models.

## 3 Parallel SGLD for Matrix Factorisation

In this section, we describe the details of PSGLD. Inspired by Liu et al. (2010); Gemulla et al. (2011); Recht & Ré (2013), PSGLD utilises a biased sub-sampling schema where the observed data is carefully partitioned into mutually disjoint blocks and the latent factors are also partitioned accordingly. An illustration of this approach is depicted in Figure 1. In this particular example, the observed matrix is partitioned into disjoint blocks and the latent factors and are partitioned accordingly into and blocks. At each iteration, PSGLD sub-samples blocks from , called as the parts, in such a way that the blocks in a part will not ‘touch’ each other in any dimension of , as illustrated in Figure 1. This biased sub-sampling schema enables parallelism, since given a part, the SGLD updates can be applied to different blocks of the latent factors in parallel.

In the example given in Figure 1, we arbitrarily partition the data into equal-sized blocks where these blocks are obtained in a straightforward manner by partitioning using a grid. In the general case, the observed matrix will be partitioned into blocks and these blocks can be formed in a data-dependent manner, instead of using simple grids.

Let us formally define a block and a part. First, we need to define a partition of a set as where contains non-empty disjoint subsets of , whose union is equal to . Here, denotes the number of subsets that the partition contains. We will define the blocks and the parts by using partitions of the sets and .

###### Definition 1.

A block, is the Cartesian product of two sets, one of them being in and the other one being in . Formally, it is defined as follows:

 Λ=I×J (5)

where and .

###### Definition 2.

A part, at iteration , is a collection of mutually disjoint blocks and is defined as follows:

 Π(t)=∪Bb=1Λ(t)b=∪Bb=1I(t)b×J(t)b (6)

where all the blocks are mutually disjoint, formally,

 I(t)b∈PB([I]),J(t)b∈PB([J]) I(t)b∩I(t)b′=∅,J(t)b∩J(t)b′=∅,∀b≠b′.

Suppose we read a part at iteration . Then the SGLD updates for can be written as follows:

 ΔW(t)= ϵ(t)(N|Π(t)|∑(i,j)∈Π(t)∇Wlogp(vij|⋅) +∇Wlogp(W(t−1)))+Ψ(t) = ϵ(t)(N|Π(t)|B∑b=1∑(i,j)∈Λ(t)b∇Wlogp(vij|⋅) +∇Wlogp(W(t−1)))+Ψ(t) (7)

Since all are mutually disjoint, we can decompose Equation 7 into interchangeable updates (i.e., they can be applied in any order), that are given as follows: , where

 ΔW(t)b= ϵ(t)(N|Π(t)|∑(i,j)∈Λ(t)b∇Wblogp(vij|W(t−1)b,H(t−1)b) +∇Wblogp(W(t−1)b))+Ψ(t)b (8)

for all . Here, and are the latent factor blocks at iteration , that are determined by the current data block and are formally defined as follows:

 W(t)b ≡{w(t)ik|i∈I(t)b,k∈[K]} H(t)b ≡{h(t)kj|j∈J(t)b,k∈[K]}

The noise matrix is of the same size as and its entries are independently Gaussian distributed with mean and variance .

Similarly, we obtain interchangeable update rules for that are given as follows: , where

 ΔH(t)b= ϵ(t)(N|Π(t)|∑(i,j)∈Λ(t)b∇Hblogp(vij|W(t−1)b,H(t−1)b) +∇Hblogp(H(t−1)b))+Ξ(t)b (9)

for all . Similarly, is of the same size as and its entries are independently Gaussian distributed with mean and variance . The parallelism of PSGLD comes from the fact that all these update rules are interchangeable, so that we can apply them in parallel. The pseudo-code of PSGLD is given in Algorithm LABEL:algo:psgld.

### 3.1 Convergence Analysis

Since we are making use of a biased sub-sampling schema, it is not clear that the samples generated by PSGLD will converge to the Bayesian posterior. In this section, we will define certain conditions on the selection of the parts and provided these conditions hold, we will show that the probability distribution of the samples and converges to the Bayesian posterior .

For theoretical use, we define as the parameter vector, that contains both and :

 θ≜[vec(W)⊤,vec(H)⊤]⊤ (10)

where denotes the vectorisation operator. We also define

 L(θ(t)) ≜logp(θ(t))+∑i,j∈[I]×[J]logp(vij|θ(t)) ^L(θ(t)) ≜logp(θ(t))+N|Π(t)|∑i,j∈Π(t)logp(vij|θ(t))

Then, the stochastic noise is given by

 ζ(t)=∇θ^L(θ(t))−∇θL(θ(t)). (11)

Under the following conditions Theorem 1 holds.

###### Condition 1.

The step size satisfies Equation 4.

###### Condition 2.

The part is chosen from nonoverlapping parts whose union covers the whole observed matrix (e.g., the parts given in Figure 1). The probability of choosing a part at iteration is proportional to its size:

 p(Π(t)=Π)=|Π|N.

, for integer .

###### Theorem 1.

Let be the probability density function of the samples that are generated by PSGLD. Then, the probability distribution of converges to the Bayesian posterior :

 limt→∞qt(θ)=p(θ|V). (12)
###### Proof sketch.

Under Condition 2, we can show that is an unbiased estimator of ; therefore the stochastic noise is zero-mean:

 \mathdsE[ζ(t)]=0.

The rest of the proof is similar to Sato & Nakagawa (2014). Under conditions 1 and 3, we can show that follows the (multi-dimensional) Fokker-Plank equation and therefore the stationary distribution of is . ∎

\@float

algocf[t]     \end@float

### 3.2 Non-negativity Constraints

In certain applications, all the elements of , , and are required to be non-negative, that is known as the non-negative matrix factorisation (NMF) Lee & Seung (1999). As we will illustrate in Section 4, the non-negativity constraint is often a necessity in certain probabilistic models, where we essentially decompose the parameters of the probabilistic model that are non-negative by definition (e.g., the intensity of a Poisson distribution or the mean of a gamma distribution).

In an SGD framework, the latent factors can be kept in a constraint set by using projections that apply the minimum force to keep the variables in the constraint set. However, since we are in an MCMC framework, it is not clear that appending a projection step to the PSGLD updates would still result in a proper MCMC method. Instead, similar to Patterson & Teh (2013), we make use of a simple mirroring trick, where we replace the negative entries of and with their absolute values. Formally, we let and take values in the whole , however we parametrise the prior and the observation models with the absolute values, and . Since and (similarly, and ) will be equiprobable in this setting, we can replace the negative elements of and with their absolute values without violating the convergence guarantee.

## 4 Experiments

In this section we will present our experiments where we evaluate PSGLD on both synthetic and real datasets using the non-negative matrix factorisation (NMF) model. In order to be able to cover a wide range of likelihood functions, we consider the following probabilistic model:

 p(W)=∏ikE(wik;λw),p(H)=∏kjE(hkj;λh) p(V|WH)=∏ijTW(vij;∑kwikhkj,ϕ,β) (13)

where , , and . Here, and denote the exponential and Tweedie distributions, respectively. The Tweedie distribution is an important special case of the exponential dispersion models Jørgensen (1997) and has shown to be useful for factorisation models Yilmaz et al. (2011). The Tweedie density can be written in the following form:

 TW(v;μ,ϕ,β)=1K(x,ϕ,β)exp(−1ϕdβ(v||μ))

where is the mean, is the dispersion (related to the variance), is the power parameter, is the normalizing constant, and denotes the -divergence that is defined as follows:

 dβ(v||μ)=vββ(β−1)−vμβ−1β−1+μββ.

The -divergence generalises many divergence functions that are commonly used in practice. As special cases, we obtain the Itakura-Saito divergence, Kullback-Leibler divergence, and the Euclidean distance square, for , respectively. From the probabilistic perspective, different choices of yield important distributions such as gamma (), Poisson (), Gaussian (), compound Poisson (), and inverse Gaussian () distributions. Due to a technical condition, no Tweedie model exists for the interval , but for all other values of , one obtains the very rich family of Tweedie stable distributions Jørgensen (1997). Thanks to the flexibility of the Tweedie distribution, we are able to choose an observation model by changing a single parameter , without modifying the inference algorithm.

In most of the special cases of the Tweedie distribution, the normalizing constant is an infinite sum and cannot be written in a simple analytical form. Fortunately, provided that and are given, the normalizing constant becomes irrelevant since it does not depend on the mean parameter and therefore and . Consequently, the PSGLD updates only involve the partial derivatives of the -divergence with respect to and , which is tractable. Figure 2: Shared-memory experiments with a) the Poisson observation model b) the compound Poisson observation model.

### 4.1 Experimental Setup

We will compare PSGLD with different MCMC methods, namely the Gibbs sampler, LD, and SGLD. We will conduct our experiments in two different settings: 1) a shared-memory setting where the computation is done on a single multicore computer 2) a distributed setting where we make use of a cluster of computing nodes2.

It is easy to derive the update equations required by the gradient-based methods for the Tweedie-NMF model. However, developing a Gibbs sampler for this general model is unfortunately not obvious. We could derive Gibbs samplers for certain special cases of the Tweedie model, such as the Poisson-NMF Cemgil (2009) where and . Moreover, in order the full conditional distributions that are required by the Gibbs sampler, we need to introduce an auxiliary tensor and augment the probabilistic model in Equation 13 as follows:

 p(wik)=E(wik;λw),p(hkj)=E(hkj;λh) p(sijk)=PO(sijk;wikhkj),vij=∑ksijk

where denotes the Poisson distribution.

The LD and Gibbs samplers require to pass on the whole observed matrix at every iteration. The Gibbs sampler further requires the whole auxiliary tensor to be sampled at each iteration.

### 4.2 Shared-Memory Setting

In this section, we will compare the mixing rates and the computation times of all the aforementioned methods in a shared-memory setting. We will first compare the methods on synthetic data, then on musical audio data.

We conduct all the shared-memory experiments on a MacBook Pro with GHz Quad-core Intel Core i7 CPU, GB of memory, and NVIDIA GeForce GT 750M graphics card. We have implemented PSLGD on the GPU in CUDA C. We have implemented the other methods on the CPU in C, where we have used the GNU Scientific Library and BLAS for the matrix operations.

#### Experiments on Synthetic Data

In order to be able to compare all the methods, in our first experiment we use the Poisson-NMF model. We first generate , , and by using the generative model. Then, we run all the methods in order to obtain the samples . For simplicity, we choose and we set . In order to obtain the blocks, we partition the sets and into equal pieces, where we simply partition by using a grid, similar to the example given in Figure 1. Initially, we choose different parts whose union cover the whole observed matrix , similar to the ones in Figure 1. At each iteration, we choose one of these parts in cyclic order, i.e. we proceed to the next part at each iteration and return the first part after iteration with integer . Since the sizes of all the parts are the same, Condition 2 is satisfied.

In LD, we use a constant step size , whereas in SGLD and PSGLD, we set the step sizes as , where . For each method, we tried several values for the parameters and report the results for the best performing ones. In LD we set , in SGLD we set , , and in PSGLD we set and . The results are not very sensitive to the actual value of and , provided these are set in a reasonable range. Furthermore, in SGLD, we draw the sub-samples with a with-replacement manner, where we set .

Figure 2 shows the mixing rates and the running times of the methods under the Poisson model for different data sizes. While plotting the log-likelihood of the state of the Markov chain is not necessarily an indication of convergence to the stationary distribution, nevertheless provides a simple indicator if the sampler is stuck around a low probability mode. We set the number of rows , , and we generate samples from the Markov chain with each method. We can observe that, in all cases, SGLD achieves poor mixing rates due to the with-replacement sub-sampling schema while LD achieves better mixing rates than SGLD. Moreover, while the LD updates can be implemented highly efficiently using BLAS, the reduced data access of SGLD does not reflect in reduced computation time due to the random data access pattern when selecting sub-samples from .

The results show that PSGLD and the Gibbs sampler seem to achieve much better mixing rates. However, we observe an enormous difference in the running times of these methods – PSGLD is times faster than the Gibbs sampler on a GPU, while achieving virtually the same quality. For example, in a model with rows, the Gibbs sampler runs for more than hours while PSGLD completes the burn-in phase in nearly second and generates K samples from the Markov chain in less than seconds, even when there are more than million entries in . Naturally, this gap between PSGLD and the Gibbs sampler becomes more pronounced with increasing problem size. We also observe that PSGLD is faster than LD and SGLD by folds while achieving a much better mixing rate. Figure 3: a) The audio spectrum of a short piano piece b) The spectral dictionaries learned by PSGLD and LD.

We also evaluate PSGLD with observation model, which corresponds to a compound Poisson distribution. This distribution is particularly suited for sparse data as it has a non-zero probability mass on and a continuous density on Jørgensen (1997). Even though the probability density function of this distribution cannot be written in closed-form analytical expression, fortunately we can still generate random samples from the distribution in order to obtain synthetic .

Since deriving a Gibbs sampler for the compound Poisson model is not obvious, we will compare only LD, SGLD, and PSGLD on this model. Figure 2 shows the performance of these methods for . We obtain qualitatively similar results; PSGLD achieves a much better mixing rate and is much faster than the other methods.

#### Experiments on Audio

The Tweedie-NMF model has been widely used for audio and music modelling Févotte & Idier (2011). In musical audio context, the observed matrix is taken as a time-frequency representation of the audio signal, such as the power or magnitude spectra that are computed via short-time Fourier transform. Here, the index denotes the frequency bins, whereas the index denotes the time-frames. An example audio spectrum belonging to a short piano excerpt ( seconds) is given in Figure 3.

When the audio spectrum is decomposed by using an NMF model, each column of will contain a different spectral template and each row of will contain the activations through time for a particular spectral template. In music processing applications, each spectral template is expected to capture the spectral shape of a certain musical note and the activations are expected to capture the loudness of the notes.

We decompose the audio spectrum given in Figure 3 and visually compare the dictionary matrices that are learned by LD and PSGLD. The size of is and we set . For PSGLD, we partition the sets and into equal pieces and we choose the parts in cyclic order at each iteration. With each method, we generate samples but discard the samples in the burn-in phase ( samples). Figure 3 shows the Monte Carlo averages that are obtained by different methods. We observe that PSGLD successfully captures the spectral shapes of the different notes and the chords that occur in the piece, even though the method is completely unsupervised. We also observe that LD is able to capture the spectral shapes of most of the notes as well, and estimates a less sparse dictionary. Furthermore, PSGLD runs in a much smaller amount of time; the running times of the methods are and seconds respectively for PSGLD and LD – as a reference the Gibbs sampler needs to run for seconds on the same problem. Figure 4: Illustration of the communication mechanism. There are 3 nodes and B=3 is selected as the number of nodes. The numbers inside the blocks denote the nodes in which the corresponding blocks are located. At each iteration, node n transfers its Hb block to node (nmodB)+1. The blocks Wb are kept in the same node throughout the process. This strategy implicitly determines the part to be used at the next iteration.

### 4.3 Distributed-Hybrid Setting

In this section, we will focus on the implementation of PSGLD in a distributed setting, where each block of might reside at a different node. We will consider a distributed architecture that contains three main components: 1) the data nodes that store the blocks of 2) the computational nodes that execute the PSGLD updates 3) the main node that is only responsible for submitting the jobs to the computational nodes only at the beginning of the sampling process.

In the distributed setting, we implement PSGLD by a message passing protocol in C using the OpenMPI library. PSGLD is naturally suited for message passing environments, and the low level control on the distributed computations provide more insight than other platforms such as Hadoop MapReduce. On the other hand, it is straightforward to implement PSGLD in a MapReduce environment for commercial and fault-tolerant applications.

In our implementation, we make use of an efficient communication mechanism, where we set the number of blocks to the number of available nodes. As illustrated in Figure 4, throughout the sampling process, each node is responsible only for a certain block; however, at the end of each iteration it transfers the corresponding block to its adjacent node in a cyclic fashion. With this mechanism, the part is determined implicitly at each iteration depending on the current locations of the factor blocks and . Besides, as opposed to many distributed MCMC methods such as DSGLD, this mechanism enables PSGLD to have a much lower communication cost, especially for large , , and values.

We conduct our distributed-setting experiments on a cluster with computational nodes where each computational node has Intel Xeon GHz CPUs and GB of memory. Therefore, provided that the memory is sufficient, we are able to run concurrent processes on our cluster. In our experiments, by assuming that the network connection between the computational nodes is sufficiently fast, we will assume that we have at most computational nodes.

We evaluate PSGLD on a large movie ratings dataset, MovieLens M (grouplens.org). This dataset contains million ratings applied to movies by users, resulting in a sparse where of is non-zero. In all our experiments, we set , , and we set to the number of available nodes where we partition the sets and into equal pieces similar to the shared-memory experiments. In these experiments, the sizes of the parts are close to each other, therefore our communication mechanism satisfies Condition .

In our first experiment, our goal is to contrast the speed of our sampling algorithm to a distributed optimisation algorithm. Clearly, the goals of both computations are different (a sampler does not solve an optimisation problem unless techniques such as simulated annealing is being used), yet monitoring the root mean squared error (RMSE) between and throughout the iterations provides a qualitative picture about the convergence behaviour of the algorithms. Figure 5 shows the RMSE values of PSGLD and the distributed stochastic gradient descent (DSGD) algorithm Gemulla et al. (2011) for iterations with . We observe that a very similar convergence behaviour and the running times for both methods. The results indicate that, PSGLD makes Bayesian inference possible for MF models even for large datasets by generating samples from the Bayesian posterior, while at the same time being as fast as the state-of-the-art distributed optimisation algorithms.

In our last set of experiments, we demonstrate the scalability of PSGLD. Firstly, we differ the number of nodes from to and generate samples in each setting. Figure 6 shows the running times of PSGLD for different number of nodes. The results show that, the running time reduces almost quadratically as we increase the number of nodes until . For , the communication cost dominates and the running time increases.

Finally, in order to illustrate how PSGLD scales with the size of the data, we increase the size of while increasing the number of nodes accordingly. We start with the original dataset and nodes, then we duplicate in both dimensions (the number of elements quadruples) and set the number of nodes to . We repeat this procedure two more times, where the ultimate dataset becomes of size with million non-zero entries and the number of nodes becomes . Figure 6 shows the running times of PSGLD with for increasing data sizes and number of nodes. The results show that, even though we increase the size of the data folds, the running time of PSGLD remains nearly constant provided we can increase the number of nodes proportionally. Figure 6: Scalability of PSGLD. a) The size of the data is kept fixed, the number of nodes is increased b) The size of the data and the number of nodes are increased proportionally.

## 5 Conclusion

We have described a scalable MCMC method for sampling from the posterior distribution of a MF model and tested the performance of our approach in terms of accuracy, speed and scalability on various modern architectures. Our results suggest that, contrary to the established folklore in ML, inference methods for ‘big data’ are not limited to optimisation, and Monte Carlo methods are as competitive in this regime as well. The existence of efficient samplers paves the way to full Bayesian inference; due to lack of space we have not presented natural applications such as model selection.

We conclude with the remark that it is rather straightforward to extend PSGLD to more structured models such as coupled matrix and tensor factorisation models. Here, several datasets are decomposed simultaneously and the distributed nature of PSGLD is arguably even more attractive when data are naturally distributed to different physical locations.

## Acknowledgments

This work is supported by the Scientific and Technological Research Council of Turkey (TÜBİTAK) Grant no. 113M492. Umut Şimşekli is also funded by a PhD Scholarship from TÜBİTAK.

### Footnotes

1. In the rest of the paper, we will use bold capital letters to denote matrices, e.g., , bold small letters to denote vectors, e.g., , and small regular letters to denote scalars, e.g., .

### References

1. Ahn, S., Korattikara, A., Liu, N., Rajan, S., and Welling, M. Large-scale distributed bayesian matrix factorization using stochastic gradient mcmc. In KDD, 2015.
2. Cemgil, A. T. Bayesian inference in non-negative matrix factorisation models. Computational Intelligence and Neuroscience, 2009.
3. Chen, T., Fox, E.B., and Guestrin, C. Stochastic gradient Hamiltonian Monte Carlo. In Proc. International Conference on Machine Learning, June 2014.
4. Cichoki, A., Zdunek, R., Phan, A.H., and Amari, S. Nonnegative Matrix and Tensor Factorization. Wiley, 2009.
5. Devarajan, Karthik. Nonnegative matrix factorization: An analytical and interpretive tool in computational biology. PLoS Computational Biology, 4, 2008.
6. Ding, Nan, Fang, Youhan, Babbush, Ryan, Chen, Changyou, Skeel, Robert D., and Neven, Hartmut. Bayesian sampling using stochastic gradient thermostats. In Advances in Neural Information Processing Systems, pp. 3203–3211, 2014.
7. Févotte, C., Bertin, N., and Durrieu, J. L. Nonnegative matrix factorization with the Itakura-Saito divergence. with application to music analysis. Neural Computation, 21:793–830, 2009.
8. Févotte, Cédric and Idier, Jérôme. Algorithms for nonnegative matrix factorization with the -divergence. Neural Computation, 23(9):2421–2456, 2011.
9. Gemulla, Rainer, Nijkamp, Erik, Haas, Peter J., and Sismanis, Yannis. Large-scale matrix factorization with distributed stochastic gradient descent. In ACM SIGKDD, 2011.
10. Jørgensen, B. The Theory of Dispersion Models. Chapman & Hall/CRC Monographs on Statistics & Applied Probability, 1997.
11. Kushner, H. and Yin, G. Stochastic Approximation and Recursive Algorithms and Applications. Springer, 2003.
12. Lee, D. D. and Seung, H. S. Learning the parts of objects by non-negative matrix factorization. Nature, 401, 1999.
13. Liu, Chao, chih Yang, Hung, Fan, Jinliang, He, Li-Wei, and Wang, Yi-Min. Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce. In Proceedings of the 19th International World Wide Web Conference, April 2010.
14. Neal, Radford M. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 54, 2010.
15. Patterson, S. and Teh, Y. W. Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In Advances in Neural Information Processing Systems, 2013.
16. Recht, Benjamin and Ré, Christopher. Parallel stochastic gradient algorithms for large-scale matrix completion. Mathematical Programming Computation, 2013.
17. S. Ahn, A. Korattikara and Welling, M. Bayesian posterior sampling via stochastic gradient fisher scoring. In ICML, 2012.
18. S. Ahn, B. Shahbaba and Welling, M. Distributed stochastic gradient mcmc. In ICML, 2014.
19. Salakhutdinov, Ruslan and Mnih, Andriy. Bayesian probabilistic matrix factorization using markov chain monte carlo. In Proceedings of the 25th international conference on Machine learning, pp. 880–887, 2008.
20. Sato, Issei and Nakagawa, Hiroshi. Approximation analysis of stochastic gradient langevin dynamics by using fokker-planck equation and ito process. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 982–990. JMLR Workshop and Conference Proceedings, 2014.
21. Smaragdis, Paris and Brown, Judith C. Non-negative matrix factorization for polyphonic music transcription. In In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 177–180, 2003.
22. Teh, Y. W., Thiéry, A. H., and Vollmer, S. J. Consistency and fluctuations for stochastic gradient Langevin dynamics. submitted, 2014.
23. Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the International Conference on Machine Learning, 2011.
24. Yilmaz, Y. K., Cemgil, A. T., and Simsekli, U. Generalised coupled tensor factorisation. In NIPS, 2011.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters   