Gamma Processes, Stick-Breaking, and Variational Inference

# Gamma Processes, Stick-Breaking, and Variational Inference

Anirban Roychowdhury, Brian Kulis
Department of Computer Science and Engineering
The Ohio State University
roychowdhury.7@osu.edu,kulis@cse.ohio-state.edu
###### Abstract

While most Bayesian nonparametric models in machine learning have focused on the Dirichlet process, the beta process, or their variants, the gamma process has recently emerged as a useful nonparametric prior in its own right. Current inference schemes for models involving the gamma process are restricted to MCMC-based methods, which limits their scalability. In this paper, we present a variational inference framework for models involving gamma process priors. Our approach is based on a novel stick-breaking constructive definition of the gamma process. We prove correctness of this stick-breaking process by using the characterization of the gamma process as a completely random measure (CRM), and we explicitly derive the rate measure of our construction using Poisson process machinery. We also derive error bounds on the truncation of the infinite process required for variational inference, similar to the truncation analyses for other nonparametric models based on the Dirichlet and beta processes. Our representation is then used to derive a variational inference algorithm for a particular Bayesian nonparametric latent structure formulation known as the infinite Gamma-Poisson model, where the latent variables are drawn from a gamma process prior with Poisson likelihoods. Finally, we present results for our algorithms on nonnegative matrix factorization tasks on document corpora, and show that we compare favorably to both sampling-based techniques and variational approaches based on beta-Bernoulli priors.

## 1 Introduction

The gamma process is a versatile pure-jump Lévy process with widespread applications in various fields of science. Of late it is emerging as an increasingly popular prior in the Bayesian nonparametric literature within the machine learning community; it has recently been applied to exchangeable models of sparse graphs [1] as well as for nonparametric ranking models [2]. It also has been used as a prior for infinite-dimensional latent indicator matrices [3]. This latter application is one of the earliest Bayesian nonparametric approaches to count modeling, and as such can be thought of as an extension of the venerable Indian Buffet Process to modeling latent structures where each feature can occur multiple times for a datapoint, instead of being simply binary.

The flexibility of gamma process models allows them to be applied in a wide variety of Bayesian nonparametric settings, but their relative complexity makes principled inference nontrivial. In particular, all direct applications of the gamma process in the Bayesian nonparametric literature use Markov chain Monte Carlo samplers (typically Gibbs sampling) for posterior inference, which often suffers from poor scalability. For other Bayesian nonparametric models—in particular those involving the Dirichlet process or beta process—a successful thread of research has considered variational alternatives to standard sampling methods [4, 5, 6]. One first derives an explicit construction of the underlying “weights” of the atomic measure component of the random measures underlying the infinite priors; so-called “stick-breaking” processes for the Dirichlet and beta processes yield such a construction. Then these weights are truncated and integrated into a mean-field variational inference algorithm. For instance, stick-breaking was derived for the Dirichlet process in the seminal paper by Sethuraman [7], which was in turn used for variational inference in Dirichlet process models [4]. Similar stick-breaking representations for a special case of the Indian Buffet Process [8] and the beta process [9] have been constructed, and have naturally led to mean-field variational inference algorithms for nonparametric models involving these priors [10, 11]. Such variational inference algorithms have been shown to be more scalable than the sampling-based inference techniques normally used; moreover they work with the full model posterior without marginalizing out any variables.

In this paper we propose a variational inference framework for gamma process priors using a novel stick-breaking construction of the process. We use the characterization of the gamma process as a completely random measure (CRM), which allows us to leverage Poisson process properties to arrive at a simple derivation of the rate measure of our stick-breaking construction, and show that it is indeed equal to the Lévy measure of the gamma process. We also use the Poisson process formulation to derive a bound on the error of the truncated version compared to the full process, analogous to the bounds derived for the Dirichlet process [12], the Indian Buffet Process [10] and the beta process [11]. We then, as a particular example, focus on the infinite Gamma-Poisson model of [3] (note that variational inference need not be limited to this model). This model is a prior on infinitely wide latent indicator matrices with non-negative integer-valued entries; each column has an associated parameter independently drawn from a gamma distribution, and the matrix values are independently drawn from Poisson distributions with these parameters as means. We develop a mean-field variational technique using a truncated version of our stick-breaking construction, and a sampling algorithm that uses Monte Carlo integration for parameter marginalization, similar to [9], as a baseline inference algorithm for comparison. Finally we compare the two algorithms on a non-negative matrix factorization task involving the Psychological Review, NIPS, KOS and New York Times document corpora.

Related Work. To our knowledge there has been no previous exposition of an explicit recursive “stick-breaking”-like construction of the gamma CRM, and by extension no instance of variational algorithms for such priors. The very general inverse Lévy measure algorithm of [13] requires inversion of the exponential integral, as does the generalized CRM construction technique of [14] when applied to the gamma process; since the closed form solution of the inverse of an exponential integral is not known, these techniques do not give us an analytic construction of the weights, and hence cannot be adapted to variational techniques in a straightforward manner. Other constructive definitions of the gamma process include [15], who discusses a sampling-based scheme for the weights of a gamma process by sampling from a Poisson process. Further, the characterization of the Dirichlet process as a normalized gamma process may possibly be utilized for sampling gamma process weights, but to our knowledge no existing methods for variational inference employ these approaches. As an alternative to gamma process-based models for count modeling, recent research has examined the negative binomial-beta process and its variants [16, 17, 18]; the stick-breaking construction of [9] readily extends to such models since they have beta process priors. The beta stick-breaking construction has also been used for variational inference in beta-Bernoulli process priors [11], though they have scalability issues when applied to the count modeling problems addressed in this work, as we show in the experimental section.

## 2 Background

### 2.1 Completely random measures

A completely random measure [19, 20] on a space is defined as a stochastic process on such that for any two disjoint Borel subsets in , the random variables are independent. The canonical way of constructing a completely random measure is to first take a -finite product measure , then draw a countable set of points from a Poisson process on a Borel -algebra on with as the rate measure. Then the CRM is constructed as , where the measure given to a measurable Borel set . In this notation are referred to as weights and the as atoms.

If the rate measure is defined on as , where is an arbitrary finite continuous measure on and is some constant (or function of ), then the corresponding CRM constructed as above is known as a beta process. If the rate measure is defined as , with the same restrictions on and , then the corresponding CRM constructed as above is known as the gamma process. The total mass of the gamma process , is distributed as . The improper distributions in these rate measures integrate to infinity over their respective domains, ensuring a countably infinite set of points in a draw from the Poisson process. For the beta process, the weights are in [0,1], whereas for the gamma process they are in . In both cases however the sum of the weights is finite, as can be seen from Campbell’s theorem [19], and is governed by and the total mass of the base measure on . For completeness we note that completely random measures as defined in [19] have three components: a set of fixed atoms, a deterministic measure (usually assumed absent), and a random discrete measure. It is this third component that is explicitly generated using a Poisson process, though the fixed component can be readily incorporated into this construction [21].

If we create an atomic measure by normalizing the weights from the gamma process, i.e. where , then is known as a Dirichlet process [22], denoted as where . It is not a CRM as the random variables induced on disjoint sets lack independence because of the normalization; it belongs to the class of normalized random measures with independent increments (NRMIs).

### 2.2 Stick-breaking for the Dirichlet and Beta Processes

A recursive way to generate the weights of random measures is given by stick-breaking, where a unit interval is subdivided into fragments based on draws from suitably chosen distributions. For example, the sick-breaking construction of the Dirichlet process [7] is given by

 D=∞∑i=1Vii−1∏j=1(1−Vj)δωi,

where . Here the length of the first break from a unit-length stick is given by . In the next round, a fraction of the remaining stick of length is broken off, and we are left with a piece of length . The length of the piece in the next round is therefore given by , and so on. Note that the weights belong to (0,1), and since this is a normalized measure, the weights sum to 1 almost surely. This is consistent with the use of the Dirichlet process as a prior on probability distributions.

This construction was generalized in [9] to yield stick-breaking for the beta process:

 B=∞∑i=1Ci∑j=1V(i)iji−1∏l=1(1−V(l)ij)δωij, (1)

where . We use this representation as the basis for our stick breaking-like construction of the Gamma CRM, and use Poisson process-based proof techniques similar to [23] to derive the rate measure.

## 3 The Stick-breaking Construction of the Gamma Process

### 3.1 Constructions and proof of correctness

We propose a simple recursive construction of the gamma process CRM, based on the stick-breaking construction for the beta process proposed in [9, 23]. In particular, we augment (or ‘mark’) a slightly modified stick-breaking beta process with an independent gamma-distributed random measure and show that the resultant Poisson process has the rate measure as defined above. We show this by directly deriving the rate measure of the marked Poisson process using product distribution formulae. Our proposed stick-breaking construction is as follows:

 G=∞∑i=1Ci∑j=1G(i)ijV(i)iji∏l=1(1−V(l)ij)δωij, (2)

where . As with the beta process stick-breaking construction, the product of beta random variables allows us to interpret each as corresponding to a stick that is being broken into an infinite number of pieces. Note that the expected weight on an atom in round is . The parameter can therefore be used to control the weight decay cadence along with .

The above representation provides the clearest view of the construction, but is somewhat cumbersome to deal with in practice, mostly due to the introduction of the additional gamma random variable. We reduce the number of random variables by noting that the product of a random variable has an distribution; we also perform a change of variables on the product of the s to arrive at the following equivalent construction, for which we now prove its correctness:

###### Theorem 1.

A gamma CRM with positive concentration parameters and finite base measure may be constructed as

 G=∞∑i=1Ci∑j=1Eije−Tijδωij (3)

where .

###### Proof.

Note that, by construction, in each round in (3), each set of weighted atoms forms a Poisson point process since the are drawn from a Poisson() distribution. In particular, each of these sets is a marked Poisson process [21], where the atoms of the Poisson process on are marked with the random variables that have a probability measure on . The superposition theorem of [21] tells us that the countable union of Poisson process is itself a Poisson process on the same measure space; therefore denoting , we can say is a Poisson process on . We show below that the rate measure of this process equals that of the Gamma CRM.

Now, we note that the random variable has a probability measure on ; denote this by . We are going to mark the underlying Poisson process with this measure. The density corresponding to this measure can be readily derived using product distribution formulae. To that end, ignoring indices, if we denote , then we can derive its distribution by a change of variable. Then, denoting we can use the product distribution formula to write the density of as

 fQ(q)=1∫0αiΓ(i)(−logw)i−1wα−2ce−cqwdw,

where . Formally speaking, this is the Radon-Nikodym density corresponding to the measure , since it is absolutely continuous with respect to the Lebesgue measure on and -finite by virtue of being a probability measure. Furthermore, these conditions hold for all the measures that we have in our union of marked Poisson processes; this allows us to write the density of the combined measure as

 f(p) =∞∑i=11∫0αiΓ(i)(−logw)i−1wα−2ce−cpwdw =1∫0∞∑i=1αiΓ(i)(−logw)i−1wα−2ce−cpwdw by monotone convergence =1∫0αw−2ce−cpwdw =αp−1e−cp =cp−1e−cpαc

. Note that the measure defined on by the “improper” gamma distribution is -finite, in the sense that we can decompose into the countable union of disjoint intervals , each of which has finite measure. In particular, the measure of the interval is given by the exponential integral.

Therefore the rate measure of the process G as constructed here is where is the same as up to the multiplicative constant , and therefore satisfies the finiteness assumption imposed on . ∎

We use the form specified in the theorem above in our variational inference algorithm since the variational distributions on almost all the parameters and variables in this construction lend themselves to simple closed-form exponential family updates. As an aside, we note that the random variables have a distribution; therefore if we denote then the construction in (2) is equivalent to

 G=∞∑i=1Ci∑j=1E(i)iji∏l=1U(l)ijδωij,

where . This notation therefore relates our construction to the stick-breaking construction of the Indian Buffet Process [8], where the Bernoulli probabilities are generated as products of iid random variables : . In particular, we can view our construction as a generalization of the IBP stick-breaking, where the stick-breaking weights are multiplied with independent Exp() random variables, with the summation over providing an explicit Poissonization.

### 3.2 Truncation analysis

The variational algorithm requires a truncation level for the number of atoms for tractability. Therefore we need to analyze the closeness between the marginal distributions of the data drawn from the full prior and the truncated prior, with the stick-breaking prior weights integrated out. Our construction leads to a simpler truncation analysis if we truncate the number of rounds (indexed by in the outer sum), which automatically truncates the atoms to a finite number. For this analysis, we will use the stick-breaking gamma process as the base measure of a Poisson likelihood process, which we denote by ; this is precisely the model for which we develop variational inference in the next section. If we denote the gamma process as , with as the recursively constructed weights, then can be written as . Under this model, we can obtain the following result, which is analogous to error bounds derived for other nonparametric models [12, 10, 11] in the literature.

###### Theorem 2.

Let N samples be drawn from . If , the full gamma process, then denote the marginal density of . If is a gamma process truncated after rounds, denote the marginal density of . Then

 14∫|m∞(X)−mR(X)|dX≤1−exp{−Nγαc(α1+α)R}.
###### Proof.

The starting intuition is that if we truncate the process after R rounds, then the error in the marginal distribution of the data will depend on the probability of positive indicator values appearing for atoms after the round in the infinite version. Combining this with ideas analogous to those in [ish_james_2000] and [12], we get the following bound for the difference between the marginal distributions:

 14∫|m∞(X)−mR(X)|dX≤P{∃(k,j),k>R∑r=1Cr,1≤n≤N s.t. Xn(ωkj)>0}.

Since we have a Poisson likelihood on the underlying gamma process, this probability can be written as

 P(⋅)=1−E⎡⎢⎣E⎧⎨⎩(∞∏r=R+1Cr∏j=1e−πrj)N∣∣∣Cr⎫⎬⎭⎤⎥⎦,

where . We may then use Jensen’s inequality to bound it as follows:

 P(⋅) ≤1−exp[N∞∑r=R+1E{Cr∑j=1log(e−πrj)}] =1−exp[Nγ1c∞∑r=R+1(α1+α)r] =1−exp{−Nγαc(α1+α)R}.

## 4 Variational Inference

As discussed in Section 3.2, we will focus on the infinite Gamma-Poisson model, where a gamma process prior is used in conjunction with a Poisson likelihood function. When integrating out the weights of the gamma process, this process is known to yield a nonparametric prior for sparse, infinite count matrices [3]. We note that our approach should easily be applicable to other models involving gamma process priors.

### 4.1 The Model

To effectively perform variational inference, we re-write as a single sum of weighted atoms, using indicator variables for the rounds in which the atoms occur, similar to [9]:

 G=∞∑k=1Eke−Tkδωk, (4)

where . We also place gamma priors on . Denoting the data, the latent prior variables and the model hyperparameters by respectively, the full likelihood may be written as . We truncate the infinite gamma process to atoms, and take to be the total number of datapoints. denotes the set of the latent variables excluding those from the Poisson-Gamma prior; for instance, in factor analysis for topic models, this contains the Dirichlet-distributed factor variables (or topics).

From the Poisson likelihood, we have , independently for each . The distributions of involve the indicator functions on the round indicator variables :

 P(Tk|dk,α)=ανk(0)∏r≥1Γ(r)\mathbbm1(dk=r)Tνk(1)ke−αTk,

where , and

See [11] for a discussions on how to approximate these factors in the variational algorithm.

### 4.2 The Variational Prior Distribution

Mean-field variational inference involves minimizing the KL divergence between the model posterior, and a suitably constructed variational distribution which is used as a more tractable alternative to the actual posterior distribution. To that end, we propose a fully-factorized variational distribution on the Poisson-Gamma prior as follows:

 Q=q(α)⋅q(γ)⋅q(c)⋅K∏k=1q(Ek)⋅q(Tk)⋅q(dk)⋅N∏n=1q(znk),

where .

Instead of working with the actual KL divergence between the full posterior and the factorized proxy distribution, variational inference maximizes what is canonically known as the evidence lower bound (ELBO), a function that is the same as the KL divergence up to a constant. In our case it may be written as . We omit the full representation here for brevity.

### 4.3 The Variational Parameter Updates

Since we are using exponential family variational distributions, we leverage the closed form variational updates for exponential families wherever we can, and perform gradient ascent on the ELBO for the parameters of those distributions which do not have closed form updates. We list the updates on the distributions of the prior below. The closed-form updates for the hyperparameters in are as follows:

 κ2=K∑k=1EQ(Tk)+a2,ρ1=c1+K,ρ2=K∑k=1EQ(Ek)+c2, τ1=b1+K,τ2=∑r≥1{1−K∏k=1r−1∑´r=1φk(´r)}+b2.

The updates for the multinomial probabilities in are given by:

 φk(r)∝exp{rEQ(logα)−logΓ(r)+(r−1)EQ(logTk)−ζ⋅∑i≠kφi(r)−EQ(γ)r∑j=2∏k′≠kj−1∑r′=1φk′(r′)}.

In addition to these updates, our variational algorithm requires gradient ascent updates on and updates on as follows:

The gradients for the two variational parameters in are:

 ∂L∂´uk=∑r≥1(r−1)φk(r)ψ′(´uk)−EQ(α)´υk−N∑n=1EQ(Ek)(´υk´υk+1)´uk⋅log(´υk´υk+1)−N∑n=1EQ(znk)1´υk−(´uk−1)ψ′(´uk)−1
 ∂L∂´υk=−∑r≥1(r−1)φk(r)1´υk+EQ(α)´uk(´υk)2−N∑n=1EQ(Ek)´uk´υk´uk−1(´υk+1)´uk+1+N∑n=1EQ(znk)´uk(´υk)2−1´υk.

For the topic modeling problems, we model the observed vocabulary-vs-document corpus count matrix as , where the matrix models the factor loadings, and the models the actual factor counts in the documents. We put the truncated Poisson-Gamma prior on , and put a Dirichlet prior on the columns of .

The variational distribution consequently gets a Dirichlet distribution multiplied to it, where are the variational Dirichlet hyperparameters. This setup does not immediately lend itself to closed form updates for the -s, so we resort to gradient ascent. The gradient of the ELBO with respect to each variational hyperparameter is

In practice however we found a closed-form update facilitated by a simple lower bound on the ELBO to converge faster. We describe the update here. First note that the part of the ELBO relevant to a potential closed form variational update of can be written as

 L=−ϕvk⋅∑nEQ(znk)+∑ndvn⋅logϕvk+⋯,

which can then be lower bounded as

 L≥logϕvk⋅(−∑nEQ(znk)+∑ndvn)+⋯.

This allows us to analytically update as . This frees us from having to choose appropriate corpus-specific initializations and learning rates for the s.

A similar lower bound on the ELBO allows us to update the variational parameters of as .

## 5 The MCMC Sampler

As a baseline, we also derive and compare with a standard MCMC sampler for this model. We use the construction in (4) for sampling from the model. To avoid inferring the latent variables in all the atom weights of the Poisson-Gamma prior, we use Monte Carlo techniques to integrate them out, as in [9]. This affects posterior inference for the indicators , the round indicators and the hyperparameters . The posterior distribution for is closed form, as are those for the likelihood latent variables in . We re-write the construction of the Poisson-Gamma prior:

 G=∞∑k=1Eke−Tkδωk,

. We put improper priors on and , and a noninformative Gamma prior on . The indicator counts are given by . To avoid sampling the atom weights , we integrate them out using Monte Carlo techniques in the sampling steps for the prior.

### 5.1 Sampling the round indicators

The conditional posterior for the round indicators can be written as

 p(dk=i|{dl}k−1l=1,{Znk}Nn=1,α,c,γ)∝p({Znk}Nn=1|dk=i,α,c)p(dk=i|{dl}k−1l=1).

For the first factor, we collapse out the stick-breaking weights and approximate the resulting integral using Monte-Carlo techniques as follows:

 p({Znk}Nn=1|dk=i,α,c) =∫[0,∞]iN∏n=1Pois(Znk|gk)dG ≈1SS∑s=1N∏n=1Pois(Znk|g(s)k),

where . Here is the number of simulated samples from the integral over the stick-breaking weights. We take in our experiments.

The second factor is the same as [9]:

 p(dk=d|γ,{dl}k−1l=1)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩0if % d

Here . Normalizing the product of these two factors over all is infeasible, so we evaluate this product for increasing till it drops below , and normalize over the gathered values.

### 5.2 Sampling the factor variables

Here we consider the Poisson factor modeling scenario that we use to model vocabulary-document count matrices. Recall that a count matrix is modeled as , where the matrix models the factor loadings, and the models the actual factor counts in the documents.. We put the Poisson-Gamma prior on and symmetric Dirichlet priors on the columns of . The sampling steps for and are described next.

#### 5.2.1 Sampling Φ

First note that the elements of the count matrix are modeled as , which can be equivalently written as . Standard manipulations then allow us to sample the ’s from where .

Now we have . Using standard relationships between Poisson and multinomial distributions, we can derive the posterior distribution of the ’s as .

#### 5.2.2 Sampling Z

In our algorithm we sample each conditioned on all the other variables in the model; therefore the conditional posterior distribution can be written as

 p(znk|D,Φ,Zn,−k,d,α,c,γ) =p(D|Zn,Φ)p(znk|d,α,c,Zn,−k)

The distributions in both the numerator and denominator of the second factor can be sampled from using the Monte Carlo techniques described above, by integrating out the stick-breaking weights.

### 5.3 Sampling hyperparameters

As mentioned above, we put a noninformative Gamma prior on and improper (1) priors on and . The posterior sampling steps are described below:

#### 5.3.1 Sampling γ

Given the round indicators , we can recover the round-specific atom counts as described above. Then the conjugacy between the Gamma prior on and the Poisson distribution of gives us a closed form posterior distribution for : .

#### 5.3.2 Sampling α

The conditional posterior distribution of may be written as:

 p(α|Z,d,c)∝p(α)N∏n=1K∏k=1p(Z|d,α,c).

We calculate the posterior distribution of using Monte Carlo techniques as described above. Then we discretize the search space for around its current values as , where the lower and upper bounds and are chosen so that the unnormalized posterior falls below . The search space is also clipped below at 0. is then drawn from a multinomial distribution on the search values after normalization.

#### 5.3.3 Sampling c

We sample in exactly the same way as . We first write the conditional posterior as

 p(c|Z,d,α)∝p(c)N∏n=1K∏k=1p(Z|d,α,c).

The search space is then discretized using appropriate upper and lower bounds as above, and is sampled using Monte Carlo techniques. is then drawn from a multinomial distribution on the search values after normalization.

## 6 Experiments

We consider the problem of learning latent topics in document corpora. Given an observed set of counts of vocabulary words in a set of documents, represented by say a count matrix, where is the vocabulary size and the number of documents, we aim to learn latent factors and their vocabulary realizations using Poisson factor analysis. In particular, we model the observed corpus count matrix as , where the matrix models the factor loadings, and the models the actual factor counts in the documents. As a baseline, we also derive and compare with a standard MCMC sampler for this model. We use the construction in (4) for sampling from the model. To avoid inferring the latent variables in all the atom weights of the Poisson-Gamma prior, we use Monte Carlo techniques to integrate them out, as in [9]. This affects posterior inference for the indicators , the round indicators and the hyperparameters . The posterior distribution for is closed form, as are those for the likelihood latent variables in . The complete updates are described in the supplementary.

We implemented and analyzed the performance of three variational algorithms corresponding to three different priors on : the Poisson-gamma process prior from this paper (abbreviated hereafter as VGP), the Bernoulli-beta prior from [11] (VBP) and the IBP prior from [10] (VIBP), along with the MCMC sampler mentioned above (SGP). For the Bernoulli-beta priors we modeled as as in [11], where the nonparametric priors are put on and a vague Gamma prior is put on . For the VGP and SGP models we set . In addition, for all four algorithms, we put a symmetric Dirichlet prior on the columns of . We added corresponding variational distributions for the variables in the collection denoted as above. We use held-out per-word test log-likelihoods and times required to update all variables in in each iteration as our comparison metrics, with 80% of the data used for training. We used the same likelihood metric as [16], with the samples replaced by the expectations of the variational distributions.

Synthetic Data. As a warm-up, we consider the performances of VGP and SGP on some synthetic data generated from this model. We generate 200 weighted atoms from the gamma prior using the stick-breaking construction, and use the Poisson likelihood to generate 3000 values for each atom to yield the indicator matrix . We simulated a vocabulary of 200 terms, generated a 200200 factor-loading matrix using symmetric Dirichlet priors, and then generated . For the VGP, we measure the test likelihood after every iteration and average the results across 10 random restarts. These measurements are plotted in fig.0(a). As shown, VGP’s measured heldout likelihood converges within 10 iterations. The SGP traceplot shows the first thirty heldout likelihoods measured after burn-in. Per-iteration times were 15 seconds and 2.36 minutes for VGP (with =125) and SGP respectively. The SGP learned online, with values oscillating around 50. SNBP refers to the Poisson-Gamma mixture (“NB process”) sampler from [16]. Its traceplot shows the first 30 likelihoods measured after 1000 burn-in iterations. We see that it performed similarly to our algorithms, though slightly worse.

Real data. We used a similar framework to model the count data from the Psychological Review (PsyRev), NIPS222http://www.stats.ox.ac.uk/ teh/data.html, KOS and New York Timesfootnotemark: corpora. The vocabulary sizes are 2566, 13649, 6906 and 100872 respectively, while the document counts are 1281, 1740, 3430 and 300000 respectively. For each dataset, we ran all three variational algorithms with 10 random restarts each, measuring the held-out log-likelihoods and per-iteration runtimes for different values of the truncation factor . The learning rates for gradient ascent updates were kept on the order of for both VGP and VBP, with 5 gradient steps per iteration. A representative subset of results is shown in figs.0(b) through 0(f).

We used vague gamma priors on the hyperparameters in the variational algorithms, and improper (1) priors for the sampler. We found the test likelihoods to be independent of these initializations. The results for the variational algorithms were dependent on the Dirichlet prior on , as noted in fig.0(b). We therefore used the learned test likelihood after 100 iterations as a heuristic to select . We found the three variational algorithms to attain very similar test likelihoods across all four datasets after a few hours of CPU time, with the VGP and VBP having a slight edge over the VIBP. The sampler somewhat unexpectedly did not attain a competitive score for any dataset, unlike the synthetic case. For instance, as shown in fig.0(c), it oscillated around -7.45 for the PsyRev dataset, whereas the variational algorithms attained -7.23. For comparison, the NB process sampler from [16] attains -7.25 each iteration after 1000 iterations of burn-in. Also as seen in fig.0(c), VGP was faster to convergence (in less than 10 iterations in 5 seconds) than VIBP and VBP (50 iterations each). The test log-likelihoods after a few hours of runtime were largely independent of the truncation for the three variational algorithms. Behavior for the other datasets was similar.

Among the three variational algorithms, the VIBP scaled best for small to medium datasets as a function of the truncation factor due to all updates being closed-form, in spite of having to learn the additional weight matrix . The VGP running times were competitive for small values of for these datasets. However, in the large NYT dataset, VGP was orders of magnitude faster than the Bernoulli-beta algorithms (note the log-scale in fig.0(f)). For example, with a truncation of 100 atoms, VGP took around 45 seconds per iteration, whereas both VIBP and VBP took more than 3 minutes. The VBP scaled poorly for all datasets, as seen in figs.0(d) through 0(f). The reason for this is three-fold: learning the parameters for the additional matrix which is directly affected by dimensionality (also the reason for VIBP being slow for NYT dataset), gradient updates for two variables (as opposed to one for VGP) and a Taylor approximation required for these gradient updates (see [11]). The sampler SGP required around 7 minutes per iteration for the small datasets and an hour and 40 minutes on average for NYT.

To summarize, we found the VGP to post running times that are competitive with the fastest algorithm (VIBP) in small to medium datasets, and outperform the other methods completely in the large NYT dataset, all the while providing similar accuracy compared to variational algorithms for similar models, as measured by held-out likelihood. It was also the fastest to converge, typically taking less than 15 iterations. Compared with SGP, our variational method is substantially faster (particularly on large-scale data) and produces higher likelihood scores on real data.

## 7 Conclusion

We have described a novel stick-breaking representation for gamma processes and used it to derive a variational inference algorithm. This algorithm has been shown to be far more scalable for large datasets than variational algorithms for related models while attaining similar accuracy, and also outperforms sampling-based methods. We expect that recent improvements to variational techniques can also be applied to our algorithm, potentially yielding even further scalability.

## References

• [1] F. Caron and E. B. Fox. Bayesian Nonparametric Models of Sparse and Exchangeable Random Graphs, 2013. arXiv:1401.1137.
• [2] F. Caron, Y. W. Teh, and B. T. Murphy. Bayesian Nonparametric Plackett-Luce models for the Analysis of Clustered Ranked Data, 2013. arXiv:1211.5037.
• [3] M. Titsias. The Infinite Gamma-Poisson Model. In Advances in Neural Information Processing Systems, 2008.
• [4] D. Blei and M. Jordan. Variational methods for Dirichlet process mixtures. Bayesian Analysis, 1:121–144.
• [5] Y.W. Teh, K. Kurihara, and M. Welling. Collapsed variational inference for HDP. In NIPS, 2007.
• [6] C. Wang, J. Paisley, and D. Blei. Online variational inference for the hierarchical Dirichlet process. In AISTATS, 2011.
• [7] J. Sethuraman. A constructive definition of dirichlet priors. Statistica Sinica, 4:639–650, 1994.
• [8] Y. W. Teh, D. Görür, and Z. Ghahramani. Stick-breaking construction for the Indian buffet process. In Proceedings of the International Conference on Artificial Intelligence and Statistics, volume 11, 2007.
• [9] J. Paisley, A. Zaas, C. W. Woods, G. S. Ginsburg, and L. Carin. A Stick-Breaking Construction of the Beta Process. In International Conference on Machine Learning, 2010.
• [10] F. Doshi-Velez, K. Miller, J. Van Gael, and Y. W. Teh. Variational Inference for the Indian Buffet Process. In AISTATS, 2009.
• [11] J. Paisley, L. Carin, and D. M. Blei. Variational Inference for Stick-Breaking Beta Process Priors. In International Conference on Machine Learning, 2011.
• [12] H. Ishwaran and L. F. James. Gibbs Sampling Methods for Stick-Breaking Priors. Journal of the American Statistical Association, 96:161–173, 2001.
• [13] R. Wolpert and K. Ickstadt. Simulation of Lévy Random Fields. In Practical Nonparametric and Semiparametric Bayesian Statistics. Springer-Verlag, 1998.
• [14] P. Orbanz and S.Williamson. Unit-rate Poisson representations of completely random measures.
• [15] R.J. Thibaux. Nonparametric Bayesian Models for Machine Learning. PhD thesis, University of California at Berkeley, 2008.
• [16] M. Zhou and L. Carin. Augment-and-conquer negative binomial processes. In NIPS, 2012.
• [17] M. Zhou, L. Hannah, D. Dunson, and L. Carin. Beta-negative binomial process and Poisson factor analysis. In AISTATS, 2012.
• [18] T. Broderick, L. Mackey, J. Paisley, and M. I. Jordan. Combinatorial clustering and the beta negative binomial process. arXiv, 2011.
• [19] J.F.C. Kingman. Completely Random Measures. Pacific Journal of Mathematics, 21(1):59–78, 1967.
• [20] M. I. Jordan. Hierarchical Models, Nested Models and Completely Random Measures. In M.-H. Chen, D. Dey, P. Mueller, D. Sun, and K. Ye, editors, Frontiers of Statistical Decision Making and Bayesian Analysis: In Honor of James O. Berger. New York: Springer, 2010.
• [21] J. F. C. Kingman. Poisson Processes, volume 3 of Oxford Studies in Probability. Oxford University Press, New York, 1993.
• [22] T.S. Ferguson. A Bayesian Analysis of Some Nonparametric Problems. The Annals of Statistics, 1(2):209–230, 1973.
• [23] J. Paisley, D. M. Blei, and M. I. Jordan. Stick-Breaking Beta Processes and the Poisson Process. In Artificial Intelligence and Statistics, 2012.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters