HOGWILD!-Gibbs Can Be PanAccurate

HOGWILD!-Gibbs Can Be PanAccurate

Constantinos Daskalakis
EECS & CSAIL, MIT
costis@csail.mit.edu
Supported by NSF CCF-1617730, CCF-1650733, and ONR N00014-12-1-0999.
   Nishanth Dikkala
EECS & CSAIL, MIT
nishanthd@csail.mit.edu
Supported by NSF CCF-1617730, CCF-1650733, and ONR N00014-12-1-0999.
   Siddhartha Jayanti
EECS & CSAIL, MIT
jayanti@mit.edu
Supported by the Department of Defense (DoD) through the National Defense Science & Engineering Graduate Fellowship (NDSEG) Program.
Abstract

Asynchronous Gibbs sampling has been recently shown to be fast-mixing and an accurate method for estimating probabilities of events on a small number of variables of a graphical model satisfying Dobrushin’s condition [DeSaOR16]. We investigate whether it can be used to accurately estimate expectations of functions of all the variables of the model. Under the same condition, we show that the synchronous (sequential) and asynchronous Gibbs samplers can be coupled so that the expected Hamming distance between their (multivariate) samples remains bounded by where is the number of variables in the graphical model, and is a measure of the asynchronicity. A similar bound holds for any constant power of the Hamming distance. Hence, the expectation of any function that is Lipschitz with respect to a power of the Hamming distance, can be estimated with a bias that grows logarithmically in . Going beyond Lipschitz functions, we consider the bias arising from asynchronicity in estimating the expectation of polynomial functions of all variables in the model. Using recent concentration of measure results [DaskalakisDK17, GheissariLP17, GotzeSS18], we show that the bias introduced by the asynchronicity is of smaller order than the standard deviation of the function value already present in the true model. We perform experiments on a multi-processor machine to empirically illustrate our theoretical findings.

1 Introduction

The increasingly ambitious applications of data analysis, and the corresponding growth in the size of the data that needs to processed has brought important scalability challenges to machine learning algorithms. Fundamental methods such as Gradient Descent and Gibbs sampling, which were designed with a sequential computational model in mind, are to be applied on datasets of increasingly larger size. As such, there has recently been increased interest towards developing techniques for parallelizing these methods. However, these algorithms are inherently sequential and are difficult to parallelize.

HOGWILD!-SGD, proposed by Niu et al. [recht2011hogwild], is a lock-free asynchronous execution of stochastic gradient descent that has been shown to converge under the right sparsity conditions. Several variants of this method, and extensions of the asynchronous execution approach have been recently proposed, and have found successful applications in a broad range of applications ranging from PageRank approximation, to deep learning and recommender systems [yu2012scalable, noel2014dogwild, mitliagkas2015frogwild, mania2015perturbed, liu2015asynchronous, liu2015asynchronous, de2015taming].

Similar to HOGWILD!-SGD, lock-free asynchronous execution of Gibbs sampling, called HOGWILD!-Gibbs, was proposed by Smola and Narayanamurthy [smola2010architecture], and empirically shown to work well on several models [zhang2014dimmwitted]. Johnson et al. [johnson2013analyzing] provide sufficient conditions under which they show theoretically that HOGWILD!-Gibbs produces samples with the correct mean in Gaussian models, while Terenin et al. [Terenin15] propose a modification to the algorithm that is shown to converge under some strong assumptions on asynchronous computation.

Input: Set of variables , Configuration , Distribution
initialization;
for  to  do
       Sample uniformly from ;
       Sample and set ;
       For all , set ;
end for
Algorithm 1 Gibbs Sampling

In a more recent paper, De Sa et al. [DeSaOR16] propose the study of HOGWILD!-Gibbs under a stochastic model of asynchronicity in graphical models with discrete variables. Whenever the graphical model satisfies Dobrushin’s condition, they show that the mixing time of the asynchronous Gibbs sampler is similar to that of the sequential (synchronous) one. Moreover, they establish that the asynchronous Gibbs sampler accurately estimates probabilities of events on a sublinear number of variables, in particular events on up to variables can be estimated within variational distance , where is the total number of variables in the graphical model (Lemma 2, [DeSaOR16]).

Our Results.

Our goal in this paper is to push the theoretical understanding of HOGWILD!-Gibbs to estimate functions of all the variables in a graphical model. In particular, we are interested in whether HOGWILD!-Gibbs can be used to accurately estimate the expectations of such functions. Results from [DeSaOR16] imply that an accurate estimation is possible whenever the function under consideration is Lipschitz with a good Lipschitz constant with respect to the Hamming metric. Under the same Dobrushin condition used in [DeSaOR16] (see Definition 3), and under a stochastic model of asynchronicity with weaker assumptions (see Section 2.1), we show that you can do better than the bounds implied by [DeSaOR16] even for functions with bad Lipschitz constants. For instance, consider quadratic functions on an Ising model, which is a binary graphical model, and serves as a canonical example of Markov random fields [LevinPW09, MontanariS10, Felsenstein04, DaskalakisMR11, GemanG86, Ellison93]. Under appropriate normalization, these functions take values in the range and have a Lipschitz constant of . Given this, the results of [DeSaOR16] would imply we can estimate quadratic functions on the Ising model within an error of . We improve this error to be of . In particular, we show the following in our paper:

  • Starting at the same initial configuration, the executions of the sequential and the asynchronous Gibbs samplers can be coupled so that the expected Hamming distance between the multivariate samples that the two samplers maintain is bounded by , where is the number of variables in the graphical model, and is a measure of the average contention in the asynchronicity model of Section 2.1. See Lemma 2. More generally, the expectation of the -th power of the Hamming distance is bounded by , for some function . See Lemma 3.

  • It follows from Lemmas 2 and 3 that, if a function of the variables of a graphical model is -Lipschitz with respect to the -th power of the Hamming distance, then the bias in the expectation of introduced by HOGWILD!-Gibbs under the asynchronicity model of Section 2.1 is bounded by . See Corollary 1.

  • Next, we improve the bounds of Corollary 1 for functions that are degree- polynomials of the variables of the graphical model. Low degree polynomials on graphical models are a natural class of functions which are of interest in many statistical tasks performed on graphical models (see, for instance, [DaskalakisDK18] ). For simplicity we show these improvements for the Ising model, but our results are extendible to general graphical models. We show, in Theorem 6, that the bias introduced by HOGWILD!-Gibbs in the expectation of a degree- polynomial of the Ising model is bounded by . This bound improves upon the bound computed by Corollary 1 by a factor of about , as the Lipschitz constant with respect to the Hamming distance of a degree- polynomial of the Ising model can be up to . Importantly, the bias of that we show is introduced by the asynchronicity is of a lower order of magnitude than the standard deviation of degree- polynomials of the Ising model, which is —see Theorem 3, and which is already experienced by the sequential sampler. Moreover, in Theorem 7, we also show that the asynchronous Gibbs sampler is not adding a higher order variance to its sample. Thus, our results suggest that running Gibbs sampling asynchronously leads to a valid bias-variance tradeoff.

    Our bounds for the expected Hamming distance between the sequential and the asynchronous Gibbs samplers follow from coupling arguments, while our improvements for polynomial functions of Ising models follow from a combination of our Hamming bounds and recent concentration of measure results for polynomial functions of the Ising model [DaskalakisDK17, GheissariLP17, GotzeSS18].

  • In Section 5, we illustrate our theoretical findings by performing experiments on a multi-core machine. We experiment with graphical models over two kinds of graphs. The first is the grid graph (which we represent as a torus for degree regularity) where each node has 4 neighbors, and the second is the clique over nodes.
    We first study how valid the assumptions of the asynchronicity model are. The main assumption in the model was that the average contention parameter doesn’t grow as the number of nodes in the graph grows. It is a constant which depends on the hardware being used and we observe that this is indeed the case in practice. The expected contention grows linearly with the number of processors on the machine but remains constant with respect to (see Figures 1 and 2).
    Next, we look at quadratic polynomials over graphical models associated with both the grid and clique graphs. We estimate their expected values under the sequential Gibbs sampler and HOGWILD!-Gibbs and measure the bias (absolute difference) between the two. Our theory predicts that this should scale at and we observe that this is indeed the case (Figure 3). Our experiments are described in greater detail in Section 5.

2 The Model and Preliminaries

In this paper, we consider the Gibbs sampling algorithm as applied to discrete graphical models. The models will be defined on a graph with nodes and will represent a probability distribution . We use to denote the range of values each node in can take. For any configuration , will denote the conditional distribution of variable given all other variables of state .

In Section 4, we will look at Ising models, a particular class of discrete binary graphical models with pairwise local correlations. We consider the Ising model on a graph with nodes. This is a distribution over , with a parameter vector . has a parameter corresponding to each edge and each node . The probability mass function assigned to a string is

where is the log-partition function for the distribution. We say an Ising model has no external field if for all . For ease of exposition we will focus on the case with no external field in this paper. However, the results extend to Ising models with external fields when the functions under consideration (in Section 4) are appropriately chosen to be centered. See [DaskalakisDK17].

Throughout the paper we will focus on bounded functions defined on the discrete space . For a function , we use to denote the maximum absolute value of the function over its domain. We will use to denote the set . In Section 4, we will study polynomial functions over the Ising model. Since always in an Ising model, any polynomial function of degree can be represented as a multilinear function of degree and we will refer to them interchangeably in the context of Ising models.

Definition 1 (Polynomial/Multilinear Functions of the Ising Model).

A degree- polynomial defined on variables is a function of the following form

where is a coefficient vector.

When the degree , we will refer to the function as a linear function, and when the degree we will call it a bilinear function. Note that since , any polynomial function of an Ising model is a multilinear function. We will use to denote the coefficient vector of such a multilinear function and to denote the maximum element of in absolute value. Note that we will use permutations of the subscripts to refer to the same coefficient, i.e., is the same as . Also we will use the term -linear function to refer to a multilinear function of degree .
At times, for simplicity we will instead consider degree polynomials of the form . This form involves multiplicity in the terms, i.e. each monomial might appear multiple times. We can map from degree polynomials without multiplicity of terms to functions of the above form in a straightforward manner by dividing each coefficient by an appropriate constant which captures how many times the term appears in the above notation. This constant lies between and .

We now give a formal definition of Dobrushin’s uniqueness condition, also known as the high-temperature regime. First we define the influence of a node on a node .

Definition 2 (Influence in Graphical Models).

Let be a probability distribution over some set of variables . Let denote the set of state pairs which differ only in their value at variable . Then the influence of node on node is defined as

Now, we are ready to state Dobrushin’s condition.

Definition 3 (Dobrushin’s Uniqueness Condition).

Consider a distribution defined on a set of variables . Let

is said to satisfy Dobrushin’s uniqueness condition is .

We have the following result from [DeSaOR16] about mixing time of Gibbs sampler for a model satisfying Dobrushin’s condition.

Theorem 1 (Mixing Time of Sequential Gibbs Sampling).

Assume that we run Gibbs sampling on a distribution that satisfies Dobrushin’s condition, . Then the mixing time of sequential-Gibbs is bounded by

Definition 4.

For any discrete state space over the set of variables , The Hamming distance between is defined as .

Definition 5 (The greedy coupling between two Gibbs Sampling chains).

Consider two instances of Gibbs sampling associated with the same discrete graphical model over the state space : and . The following coupling procedure is known as the greedy coupling. Start chain 1 at and chain 2 at and in each time step , choose a node uniformly at random to update in both the chains. Without loss of generality assume that . Let denote the probability that the first chain sets and let be the probability that the second chain sets . Plot the points , and for all on the interval from . Also pick and . Couple the updates according to the following rule:

Draw a number uniformly at random from . Suppose and . Choose and .

We state an important property of this coupling which holds under Dobrushin’s condition, in the following Lemma.

Lemma 1.

The greedy coupling (Definition 5) satisfies the following property. Let and consider two executions of Gibbs sampling associated with distribution and starting at and respectively. Suppose the executions were coupled using the greedy coupling. Suppose in the step , node is chosen to be updated in both the models. Then,

(1)

2.1 Modeling Asynchronicity

We use the asynchronicity model from [Recht11] and [DeSaOR16]. Hogwild!-Gibbs is a multi-threaded algorithm where each thread performs a Gibbs update on the state of a graph which is stored in shared memory (typically in RAM). We view each processor’s write as occuring at a distinct time instant. And each write starts the next time step for the process. Assuming that the writes are all serialized, one can now talk about the state of the system after writes. This will be denoted as time . HOGWILD! is modeled as a stochastic system adapted to a natural filtration . contains all events thast have occured until time . Some of these writes happen based on a read done a few steps ago and hence correspond to updates based on stale values in the local cache of the processor. The staleness is modeled in a stochastic manner using the random variable to denote the delay associated with the read performed on node at time step . The value of node used in the update at time is going to be . Delays across different node reads can be correlated. However delay distribution is independent of the configuration of the model at time . The model imposes two restrictions on the delay distributions. First, the expected value of each delay distribution is bounded by . We will think of as a constant compared to in this paper. We call the average contention parameter associated with a HOGWILD!-Gibbs execution. [DeSaOR16] impose a second restriction which bounds the tails of the distribution of . We do not need to make this assumption in this paper for our results. [DeSaOR16] need the assumption to show that the HOGWILD! chain mixes fast. However, by using coupling arguments we can avoid the need to have the HOGWILD! chain mix and will just use the mixing time bounds for the sequential Gibbs sampling chain instead. Let denote the set of all delay distributions. We refer to the sequential Gibbs sampler associated with a distribution as and the HOGWILD! Gibbs sampler together with associated with a distribution by . Note that is a time-inhomogenuous Markov chain and might not converge to a stationary distribution.

2.2 Properties of Polynomials on Ising Models satisfying Dobrushin’s condition

Here we state some known results about polynomial functions on Ising models satisfying Dobrushin’s condition.

Theorem 2 (Concentration of Measure for Polynomial Functions of the Ising model, [DaskalakisDK17, GheissariLP17, GotzeSS18]).

Consider an Ising model without external field on a graph satisfying Dobrushin’s condition with Dobrushin parameter . Let be a degree -polynomial over the Ising model. Let . Then, there is a constant , such that,

As a corollary this also implies,

Theorem 3 (Concentration of Measure for Polynomial Functions of the Ising model, [DaskalakisDK17, GheissariLP17, GotzeSS18]).

Consider an Ising model without external field on a graph satisfying Dobrushin’s condition with Dobrushin parameter . Let be a degree -polynomial over the Ising model. Let . Then, there is a constant , such that,

As a corollary this also implies,

Theorem 4 (Marginals Bound under Dobrushin’s condition, [DaskalakisDK17]).

Consider an Ising model satisfying Dobrushin’s condition with Dobrushin parameter . For some positive integer , let be a degree polynomial function. Then we have that, if ,

3 Bounding The Expected Hamming Distance Between Coupled Executions of HOGWILD! and Sequential Gibbs Samplers

In this Section, we show that under the greedy coupling of the sequential and asynchronous chains, the expected Hamming distance between the two chains at any time is small. This will form the basis for our accurate estimation results of Section 4. We begin by showing that the expected Hamming distance between the states and of a coupled run of the sequential and asynchronous executions respectively, is bounded by a . At a high level, the proof of Lemma 2 proceeds by studying the expected change in the Hamming distance under one step of the coupled execution of the chains. We can bound the expected change using the Dobrushin parameter and the property of the greedy coupling (Lemma 1). We then show that the expected change is negative whenever the Hamming distance between the two chains was above to begin with. This allows us to argue that when the two chains start at the same configuration, then the expected Hamming distance remains bounded by .

Lemma 2.

Let denote a discrete probability distribution on variables (nodes) with Dobrushin parameter . Let denote the execution of the sequential Gibbs sampler on and denote the HOGWILD! Gibbs sampler associated with such that . Suppose the two chains are running coupled in a greedy manner. Let denote all events that have occured until time in this coupled execution. Then we have, for all , under the greedy coupling of the two chains,

Proof.

We will show the statement of the Lemma is true by induction over . The statement is clearly true at time since . Suppose it holds for some time . The state of the system at time , includes the choice of nodes the two chains chose to update at each time step, the delays for each node and time step and the states and under the two chains. We let denote the node that was chosen to be updated in the two chains at time step . As a shorthand, we use to denote . We will first compute a bound on

(2)

The induction hypothesis implies that . Given the delays for time , denote by the following state

We partition the set of nodes into the set where and have the same value and the set where they don’t. Define and as follows.

When the two samplers proceed to perform their update for time step , in addition to the nodes in the set some additional nodes might appear to have different values under the asynchronous chain. This is because of the delays . We use to denote the set of nodes in whose values read by the asynchronous chain are different from those under the sequential chain.

(3)

Next, we proceed to obtain a bound on the expected size of which we will use later. Let denote the delays at time . Let denote the last index before when the node ’s value was updated. Observe that the s have to all be distinct. Then

(4)

Suppose a node from the set was chosen at step to be updated in both the chains. Now, is either or . The probability that it is is bounded above by the total variation between the corresponding conditional distributions.

(5)
(6)

where (5) follows from the property of the greedy coupling (Lemma 1) and (6) follows from the triangle inequality because total variation distance is a metric.

Now suppose that was chosen instead from the set . Now is either or . We lower bound the probability that it is using arguments similar to the above calculation.

(7)
(8)

where (7) follows from the property of the greedy coupling (Lemma 1) and (8) follows from the triangle inequality because total variation distance is a metric.

Now the expected change in the Hamming distance

where we used the fact that for any .

Now,

(9)
(10)
(11)

Next, we generalize the above Lemma to bound also the moment of the Hamming distance between and obtained from the coupled executions. The proof of Lemma 3 follows a similar flavor as that of Lemma 2. It is however more involved to bound the expected increase in the power of the Hamming distance and it requires some careful analysis to see that the bound doesn’t scale polynomially in .

Lemma 3 ( moment bound on Hamming).

Consider the same setting as that of Lemma 2. We have, for all , under the greedy coupling of the two chains,

where is some function of the parameters and .

Proof.

Again we will employ the shorthand . The proof is structured in the following way. We will show the statement of the Lemma is true by induction over and . To show the statement for a certain values of we will use the inductive hypotheses of the statement for all where and for all where . Lemma 2 shows the statement for all values of for . We will assume the statement holds for all for .

Now, we proceed to show it is true for . Clearly for it holds because . Suppose it holds for some time . The state of the system at time , includes the choices of nodes the two chains chose to update at each time step, the delays for each node and time step and the states and under the two chains. We let denote the node that was chosen to be updated in the two chains at time step . We will proceed in a similar way as we did in Lemma 2. We first compute a bound on

(12)

as a function of . The induction hypothesis implies that . We partition the set of nodes into the set where and have the same value and the set where they don’t. Define and as follows.

When the two samplers proceed to perform their update for time step , in addition to the nodes in the set some additional nodes might appear to have different values under the asynchronous chain. This is because of the delays . We use to denote the set of nodes in whose values read by the asynchronous chain are different from those under the sequential chain.

(13)

Next, we use the bound on the expected size of which was derived in Lemma 2.

(14)

Suppose a node from the set was chosen at step to be updated in both the chains. Now, is either or . The probability that it is is bounded above by the total variation between the corresponding conditional distributions.

(15)
(16)

(16) follows again due to the property of greedy coupling (Lemma 1) and the metric property of the total variation distance.

Now suppose that was chosen instead from the set . Now is either or . The probability that it is is bounded below by the following.

(17)
(18)

(16) follows again due to the property of greedy coupling (Lemma 1) and the metric property of the total variation distance.

Now the expected change in the value of the power of the Hamming distance is

(19)
(20)
(21)
(22)
(23)

Now, suppose for an appropriate constant .Then,

(24)
(25)
(26)

where (25) follows because

(27)

Suppose instead that was larger than . We want to show that for appropriately large in , doesn’t exceed .

(28)
(29)
(30)
(31)
(32)

where (32) holds because for sufficiently large compared to the values of , where , we can have dominating the all the remaining terms in the two summations. Hence, by induction we have the desired Lemma statement. ∎

4 Estimating Global Functions Using HOGWILD! Gibbs Sampling

To begin with, we observe that our Hamming moment bounds from Section 3 imply that we can accurately estimate functions or events of the graphical model if they are Lipschitz. We show this below as a Corollary of Lemma 3. Before we state the Corollary, we will first state the following simple Lemma which quantifies how large a is required to have an accurate estimate from the Gibbs sampler.

Lemma 4.

Let be an graphical model on nodes satisfying Dorbushin’s condition with Dobrushin parameter . Let denote the steps of a Gibbs sampler running on . Let . Also let be a bounded function on the graphical model. Then for ,

Proof.

We have from Theorem 1 that . This implies

(33)

Now, we state Corollary 1 which quantifies the error we can attain when trying to estimate expectations of Lipschitz functions using HOGWILD!-Gibbs.

Corollary 1.

Let denote the distribution associated with a graphical model over the set of variables () taking values in a discrete space . Assume that the model satisfies Dobrushin’s condition with Dobrushin parameter . Let be a function such that, for all ,

Let and let denote an execution of HOGWILD!-Gibbs sampling on with average contention parameter . For ,

where is the function from Lemma 3.

Proof.

Consider an execution , where , of the synchronous Gibbs sampling associated with coupled greedily with the HOGWILD! chain. We have from 1 and 4 that, for ,

(34)

Next, we have,

(35)

Putting (34) and (35) together we get the statement of the Corollary. ∎

We note that the results of [DeSaOR16] can be used to obtain Corollary 1 when the function is Lipschitz with respect to the Hamming distance. The above corollary provides a simple way to bound the bias introduced by HOGWILD! in estimation of Lipschitz functions. However, many functions of interest over graphical models are not Lipschitz with good Lipschitz constants. In many cases, even when the Lipschitz constants are bad, there is still hope for more accurate estimation. As it turns out Dobrushin’s condition provides such cases. We will focus on one such case which is polynomial functions of the Ising model. Our goal will be to accurately estimate the expected values of constant degree polynomials over the Ising model. Using the bounds from Lemmas 2 and 3, we now proceed to bound the bias in computing polynomial functions of the Ising model using HOGWILD! Gibbs sampling.

We first remark that linear functions (degree 1 polynomials) suffer 0 bias in their expected values due to HOGWILD!-Gibbs. This is because under zero external field Ising models since each node individually has equal probability of being . This symmetry is maintained by HOGWILD!-Gibbs since the delays are configuration-agnostic. Hence the delays when a node is and when it is can be coupled perfectly leaving the symmetry intact. More interesting cases start happening when we go to degree 2 polynomials. Therefore, we start our investigation at quadratic polynomials. Theorem 5 states the bound we show for the bias in computation of degree 2 polynomials of the Ising model.

Theorem 5 (Bias in Quadratic functions of Ising Model computed using HOGWILD!-Gibbs).

Consider the quadratic function . Let denote an Ising model on nodes with Dobrushin parameter . Let denote a run of sequential Gibbs sampler and denote a run of HOGWILD!- Gibbs on , such that . Then we have, for , under the greedy coupling of the two chains,

Proof.

Under the greedy coupling, we have that,

(36)
(37)
(38)