Poisson Latent Feature Calculus for Generalized Indian Buffet Processes

Poisson Latent Feature Calculus for Generalized Indian Buffet Processes

Abstract

The purpose of this work is to describe a unified, and indeed simple, mechanism for non-parametric Bayesian analysis, construction and generative sampling of a large class of latent feature models which one can describe as generalized notions of Indian Buffet Processes (IBP). This is done via the Poisson Process Calculus as it now relates to latent feature models. The IBP, first arising in a Bayesian Machine Learning context, was ingeniously devised by Griffiths and Ghahramani in (2005) and its generative scheme is cast in terms of customers entering sequentially an Indian Buffet restaurant and selecting previously sampled dishes as well as new dishes. In this metaphor dishes corresponds to latent features, attributes, preferences shared by individuals. The IBP, and its generalizations, represent an exciting class of models well suited to handle high dimensional statistical problems now common in this information age. In a survey article Griffiths and Ghahramani note applications for Choice models, modeling protein interactions, independent components analysis and sparse factor analysis, among others. The IBP is based on the usage of conditionally independent Bernoulli random variables, coupled with completely random measures acting as Bayesian priors, that are used to create sparse binary matrices. This Bayesian non-parametric view was a key insight due to Thibaux and Jordan (2007). One way to think of generalizations is to to use more general random variables. Of note in the current literature are models employing Poisson and Negative-Binomial random variables. However, unlike their closely related counterparts, generalized Chinese restaurant processes, the ability to analyze IBP models in a systematic and general manner is not yet available. The limitations are both in terms of knowledge about the effects of different priors and in terms of models based on a wider choice of random variables. This work will not only provide a thorough description of the properties of existing models but also provide a simple template to devise and analyze new models. We close by proposing and analyzing general classes of multivariate processes.

[
\kwd
\startlocaldefs\endlocaldefs\runtitle

IBP

{aug}\thankstext

t1Supported in part by the grant RGC-HKUST 601712 of the HKSAR.

\contributor

James, Lancelot F.Hong Kong University of Science and Technology

class=AMS] \kwd[Primary ]60C05, 60G09 \kwd[; secondary ]60G57,60E99

Bayesian Statistics, Indian Buffet Process, Latent Feature models, Poisson Process, Statistical Machine Learning

1 Introduction

The purpose of this work is to describe a unified, and indeed simple, mechanism for non-parametric Bayesian analysis of a large class of generative latent feature models which one can describe as generalized notions of Indian Buffet Processes(IBP). This work will not only provide a thorough description of the properties of existing models but also provide a simple template to devise and analyze new models. That is to say, although one of the goals in this article is to promote a particular calculus, the results are presented in a fashion where it is not necessary for the reader interested primarily in implementation and new constructions to have a command of that calculus.

The IBP, and its generalizations, represent an exciting class of models well suited to handle high dimensional problems now common in this information age. These points are articulated in for instance [[3], [5], [10], [13], [14], [15], [25], [28], [33], [37], [42], [43], [47], [48], [49]], and for placement in a wider context see [[36]]. The IBP is based on the usage of conditionally independent Bernoulli random variables, one way to think of generalizations is to to use more general random variables. Of note in the current literature are models employing Poisson and Negative-Binomial random variables as described in for instance [[4], [16], [44], [50], [51]]. These are generally coupled with gamma and beta process priors. Even in those specialized cases the properties of these models are not well understood, and the analysis conducted so far requires great care. Additionally, for instance, it is unclear how replacing a gamma process with a generalized gamma process works in the model of [[44]]. Indeed a generalized gamma process is not conjugate in the Poisson model but it does have extra flexibility over its gamma process counterpart. This article describes general results for IBP type processes based on variables having any distribution that has mass at zero and otherwise can be discrete or continuous. Examples of models employing continuous are given in [[28], [29]]. Furthermore, the results are presented for general prior models based on completely random measures. That is to say the analysis does not require conjugacy.

The method proposed is via the Poisson Partition Calculus (PPC) devised by the author in [[20], [22]], and applied further for instance in [[23], [24]]. The partition calculus refers to a non-combinatorial approach to systematically derive expressions for combinatorial mechanisms derived from Bayesian exchangeable processes whereby are conditionally iid and where is a random probability measure or more generally a random measure, that can be expressed as a functional of a Poisson random measure One is interested in the posterior distribution of and the marginal distribution of Due to discreteness, the marginal distribution of can be viewed as an analogue of a Blackwell-MacQueen Polya urn scheme, which is the generative sampling scheme corresponding to the marginal joint probability measure induced by setting to be a Dirichlet process random probability measure, and modeling to be iid Furthermore, these structures describe random partitions of by the number of unique values among , which can be viewed as clusters, and the size of clusters based on the tied values. Loosely speaking these can be described as analogues of Chinese restaurant processes(CRP), most closely associated with the Dirichlet and Pitman-Yor processes which are used as priors/models in Bayesian non-parametric statistics and employed extensively in Statistical Machine Learning. See for instance [[18], [39], [40]]. We shall provide a schematic for this method as it relates to latent feature models, but first describe more details for the IBP and its generalizations.

1.1 The Indian Buffet Process(IBP)

The basic process, for was ingeniously formulated by Griffiths and Ghahramani [[15], [14]] whereby a random binary matrix is formed with distinct features or attributes labeling an unbounded number of columns, and M rows, where each row represents the attributes/features/preferences possessed by a single individual by entering in the entry if the feature is possessed and otherwise. The matrix naturally indicates what features are shared among individuals. The generative process to describe this is cast in terms of individual customers sequentially entering an Indian Buffet restaurant whereby the -th customer chooses one of the already sampled dishes(indicates that customer shares features already exhibited by the previous customers) according to the most popular dishes already sampled, specifically with probability where denotes the number of customers who have already sampled dish . Otherwise the customer chooses new dishes according to a distribution. A key insight was made by Thibaux and Jordan [[43]], which connected this generative process with a formal Bayesian non-parametric framework where are modelled as iid Bernoulli processes with base measure that is selected to have a Beta process prior distribution whose iid atoms are obtained by specifying a proper probability as its base measure. The generative process is given by the distribution of In other words the basic IBP is generated from a random matrix with entries where conditioned on the the points of Poisson random measure with mean intensity are independent Bernoulli variables. It follows that one can represent each and the Beta process

TheoremRemark 1.1 .

The concept of Beta processes was developed in the fundamental paper of Hjort [[17]] as it relates to Bayesian NP problems arising in survival analysis, Kim [[26]] is also an important reference in this regard. Naturally within this context is the related work of Doksum [[8]]. However it is used in a rather different context in the IBP setting. Other uses of a beta process within a Lévy moving process context can be found in James [[22]] which apparently is a precursor to the kernel smoothed Beta process in [[41]]. Spatial notions of Beta and other processes appear in [[22], [23]].

1.2 Generalizations

Using the theory of marked Poisson point processes we can easily describe the generalization of the IBP that includes the models already mentioned above. Simply replace the with more general variables where conditional on are independent random variables(possibly vector valued) with distribution denoted as where are the points of a Poisson random measure with more general Lévy density not restricted to and possibly depending on which reflects dependence on the distribution of the on a space Importantly, for each are the points of a Poisson random measure with mean intensity

(1.1)

Furthermore we shall assume that can take on the value zero with positive probability. So relative to write and hence

The general construction is where now

(1.2)
TheoremRemark 1.2 .

Notice that one can always represent where are Bernoulli variables and is a general random variable that is not necessarily independent of So that in (1.2) can be viewed as a weighted Bernoulli process. The dependent representation makes these more general than models of the form considered in [[28], [29]].

Here key to our exposition, it is important to note that can always be represented as

(1.3)

where is a Poisson random measure with mean intensity

As in James[[22]], takes its values in the space of boundedly finite measures,

TheoremRemark 1.3 .

It is further noted that can be defined over any Polish space. This may be relevant in applications where is specified in terms of more general parameters, such as [[29]], say For instance could correspond to a Normal distribution with mean and variance parameters Such extensions from a technical point of view are easily handled and will not be discussed explicitly here in the univariate case. However the multivariate extension in section 5 covers this case.

TheoremDefinition 1.1 .

We say that is or more generally and also write when discussing calculations. is by construction a completely random measure and we shall specify its law by saying is or Using this we shall say that are iid if they have the specifications in (1.2) and call the marginal distribution of or equivalently

TheoremRemark 1.4 .

Formally we choose to satisfy However we allow both infinite activity models corresponding to and finite activity/compound models where In the latter case is a proper density.

1.3 Outline

Notice that in (1.2) the only difference in the setup is the choice of the distribution on Obviously different choices of lead to different modeling capabilities and interpretations. We note further that the PPC is a method that does not rely on conjugacy. That is to say if it can be applied for one choice of it can be applied for all choices of In the forthcoming section we shall provide a schematic to derive the posterior distribution and related quantities for any and any We will then state formally the posterior distribution for and , marginal distribution of and the distributions of We shall also introduce classes of iid variables which can provide a representation for the marginal distribution of hence all that can be implemented for most choices of This is one of the key elements to describe the appropriate analogues of the Indian Buffet generative process that can be implemented broadly. We note that in order for this to apply, must at minimum admit a positive mass at or must be taken to be a finite measure. Overall we shall show how to do the following for any without taking limits, without guesswork and without combinatorial arguments, and lastly without long proofs. We shall then show details for all choices of Bernoulli, Poisson, Negative-Binomial, that have been mentioned above.

  • Generally apply the Poisson Calculus.

  • Describe

  • Hence

  • Describe the marginal structure via integration.

  • Provide descriptions for the marginal process as

    that via the introduction of variables leads to tractable sampling schemes for many

  • Describe

  • These last two points lead to rather explicit descriptions of the marginal process that apply for many and hence lead to generative schemes that generalize the IBP feature selection scheme.

In section 5 we will describe and provide parallel details for a large class of multivariate models. Much of what has been described in the univariate case carries over quite clearly to the multivariate case. We shall also highlight the use of a beta-Dirichlet process described in [[27]] as a natural extension of a Beta process in this setting.

2 Poisson Latent Feature Calculus Schematic

One should first note that Poisson random measures are the building blocks for most discrete random processes used in the Bayesian Nonparametrics literature. Certainly this is true for any case where one employs completely random measures.

2.1 Basic updating operations using the PPC

For our purposes here we will need two ingredients of the PPC that can be found as Propositions 2.1 and 2.2 in James [[22], p. 6-8][see also [[20], p. 7-8]]. These represent direct extensions of updating mechanisms for the Dirichlet and gamma processes employed in [[30], [31], [32]] and also [[19], [21]]. Note again that

(2.1)

The Laplace functional exponential change of measure

First [[22], Proposition 2.1] describes a Laplace functional exponential change of measure which describes updating through changing the mean measure to

(2.2)

where in our setting and where

This action changes from to

Disintegration involving the moment measure: decomposing

Next using apply the moment measure calculation appearing in [[22], Proposition 2.2.]

(2.3)

expressed over uniquely picked points from In other words conditioned on the distribution of these points is determined by the joint measure

There are two important things to notice about (2.3). First is the meaning of which corresponds to the law of the random measure

(2.4)

where is also The quantity above is merely a decomposition of in terms of the unique points picked and those remaining . It says remarkably, and in fact is well known, that maintains the same law as This translates into the following result, for any measurable function

Furthermore the marginal measure

only depends on the unique values and not the multiplicities that might be associated with them. For a proof of this see James [[22], Proposition 5.2].

2.2 The joint distribution

The form of the relevant likelihood can be deduced from the discussion above. However, it is perhaps instructive to first write it similar to the form used in [[51], Appendix A.1]

which, setting can be written as

It then follows by, arguments similar to [[51], Appendix A.1], and augmenting that one can write the joint distribution of where are the unique jumps, as,

(2.5)

where

Now apply the exponential change of measure in (2.2) to obtain

Noticing that,

this operation results in the law of changing from to that is is changed to

Now apply the disintegration in (2.3) to obtain the desired final form of the joint distribution

(2.6)

or

(2.7)

where

(2.8)

and

(2.9)
TheoremRemark 2.1 .

Note in order to emphasize a notion of partial conjugacy, for any Lévy density and we could without loss of generality define a Levy density

which we would assign to the prior measure Note of course that Hence it follows that for and that

We will use some variant of this for illustration in the special cases of Poisson, Binomial and Negative-Binomial models.

TheoremRemark 2.2 .

It is to be emphasized that once a quantity such as (2.6) has been calculated this contains all the ingredients for a formal posterior analysis. What remains are arguments via Bayes rule and finite-dimensional refinements involving particular choices of etc.

2.3 Describing posterior and marginal quantities

Now from (2.4) and (2.6) one can immediately deduce that for fixed that corresponds to the case where is hence can be expressed as where is a What remains to complete the posterior distribution of is to find the distribution of given Integrating out in (2.6) leads to the joint distribution of

(2.10)

Note this is now in the usual finite dimensional Bayesian setup. Hence are conditionally independent with density,

(2.11)

and the marginal of is

(2.12)

2.4 The general posterior distribution

We now summarize our results. Throughout we use,

where Note that depends on and so will differ over various models. Since it should be clear from context, we shall suppress this dependence within the notation.

TheoremTheorem 2.1 .

Suppose that are iid is Then

  1. The posterior distribution of is equivalent to the distribution of

    where is and the distribution of the conditionally independent is given by (2.11).

  2. The posterior distribution of is equivalent to the distribution of

    (2.13)

    where is

  3. The marginal distribution of is given by (2.12).

This immediately leads to the next result

TheoremProposition 2.1 .

Suppose that are iid , where is Then from results in Theorem 2.1, equation (2.13) shows that the distribution of is equivalent to the distribution of

where is and each has distribution and the marginal distribution of is specified by (2.11). That is to say has the distribution are the already chosen features.

TheoremRemark 2.3 .

It should be evident that it easy to make adjustments for the case where the are independent rather than iid. Furthermore semi-parametric models may be treated as in [[21], [19]], see also [[22]].

2.5 Details for the Bernoulli, Poisson, and Negative-Binomial IBP type processes

Before we continue with a general discussion, we take a look at specifics in the Poisson, Bernoulli, and Negative-Binomial cases. Note while we can just employ Theorem 2.1 and Proposition 2.1 directly and list out results, we provide some specific developments of those general arguments used in these special cases for illustrative purposes. Note if corresponds to Poisson Bernoulli, or Negative-Binomial, denoted as , random variable then respectively the probability mass function with arguments is in the Poisson case

In the Bernoulli case,

and in the Negative-Binomial case

Hence in the respective cases takes the values

where in the latter two cases Furthermore, given in (2.8) takes the respective forms

Now setting it follows that takes the form

Note to simplify notation we shall do the remaining calculations for homogeneous Readers can easily make the adjustments to the case of First some notation and known facts are introduced.

TheoremRemark 2.4 .

It is important to note that the points and notation described next can be found in Gnedin and Pitman [[12]] and James [[23]]. Which alludes to connection between the Poisson, Bernoulli, and Negative-Binomial IBP models and Regenerative Compositions/ Neutral to the Right processes. Many explicit calculations and examples for these IBP models can be deduced from those works. We encourage the reader to take a look at those works for such details.

Let be an infinitely divisible random variable where are the ranked jumps of a subordinator determined by the Levy density then it follows that where

From one can construct an infinitely divisible random variable whose points are determined by a Levy density

and there is also the converse relation

Furthermore

which is equivalent to

TheoremRemark 2.5 .

Note taking gives the homogeneous beta process, and it can be read from Gnedin[[11]] that for an integer

which is the term appearing in in the Bernoulli IBP case under a beta process CRM.

Poisson IBP Case

Call in this model, iid processes. It follows that if is determined by the Levy density then it follows for each M that,

for any Lévy density Hence denote the marginal distribution of as Furthermore,

where . Hence setting the joint distribution in (2.7) is

(2.14)

It should be evident from (2.14) that the marginal distribution in the Poisson case is given as

(2.15)

where

In addition, the posterior distribution of corresponds to that of the random measures

(2.16)

where is and independent of this, the distribution of are conditionally independent with density

(2.17)

This implies that is equivalent in distribution to where is Note further if we refer to the marginal distribution of as our analysis shows that can be written as

where is and are conditionally independent Poisson variables.

TheoremRemark 2.6 .

It is easy to see that although we cannot expect and hence to be fully conjugate models, there is a sense of partial conjugacy in the following manner. Suppose that the mean intensity of is then within the posterior decomposition (2.16) the PRM has mean intensity

Gamma, Stable and generalized gamma process priors.

We now give details for the Gamma-Poisson model that was investigated in Titsias [[44]] and then move beyond that to the case where the gamma process is replaced by a generalized gamma process. There are no known results for this flexible and natural extension, We should note here that the gamma process model of Titsias bears some similarities to Lo (1982) [[30]]. The latter is a reference which is apparently unknown within the general IBP literature. Lo [[30]] shows that if a weighted gamma process is used as a prior for the mean intensity of a Poisson process then its posterior distribution is also a weighted gamma process given the observations from Poisson Processes. In this sense it can be seen as a precursor to the class of IBP models.

We describe a scalar version of a weighted gamma process, which is just based on a gamma random variable with shape parameter and scale with law denoted as Gamma That is to say if is such a variable then is Gamma This corresponds to a PRM with mean intensity

and hence is a weighted gamma process with parameters Note in Lo’s case one takes to be a function which again presents no extra difficulties here. From the above discussion it follows that and hence is a weighted gamma process with parameters Furthermore the distribution of is such that are conditionally independent Gamma random variables. Hence the are negative binomial variables as described in Titsias. Furthermore the Lévy exponent is and hence the joint distribution is given by

where Note under

and the are Poisson Hence the marginals of are Negative-Binomial.

Now suppose that for is PRM with mean intensity

Then is a generalized gamma process with parameters corresponding to a random variable with density where is the density of a positive Stable random variable with index Furthermore

It follows that is a generalized gamma process with parameters and the jumps satisfy are independent Gamma random variables. Hence can be expressed via