Priors for Random Count Matrices Derived from a Family of Negative Binomial Processes

Priors for Random Count Matrices Derived from a Family of Negative Binomial Processes

Mingyuan Zhou, Oscar Hernan Madrid Padilla, and James G. Scott
The University of Texas at Austin
The authors are with the Department of Information, Risk, and Operations Management and Department of Statistics and Data Sciences, the University of Texas at Austin, Austin, TX 78712, USA. Address for correspondence: 2110 Speedway Stop B6500, Austin, TX 78712, USA. Email:mingyuan.zhou@mccombs.utexas.edu.
Abstract

We define a family of probability distributions for random count matrices with a potentially unbounded number of rows and columns. The three distributions we consider are derived from the gamma-Poisson, gamma-negative binomial, and beta-negative binomial processes, which we refer to generically as a family of negative-binomial processes. Because the models lead to closed-form update equations within the context of a Gibbs sampler, they are natural candidates for nonparametric Bayesian priors over count matrices. A key aspect of our analysis is the recognition that, although the random count matrices within the family are defined by a row-wise construction, their columns can be shown to be independent and identically distributed. This fact is used to derive explicit formulas for drawing all the columns at once. Moreover, by analyzing these matrices’ combinatorial structure, we describe how to sequentially construct a column-i.i.d. random count matrix one row at a time, and derive the predictive distribution of a new row count vector with previously unseen features. We describe the similarities and differences between the three priors, and argue that the greater flexibility of the gamma- and beta- negative binomial processes—especially their ability to model over-dispersed, heavy-tailed count data—makes these well suited to a wide variety of real-world applications. As an example of our framework, we construct a naive-Bayes text classifier to categorize a count vector to one of several existing random count matrices of different categories. The classifier supports an unbounded number of features, and unlike most existing methods, it does not require a predefined finite vocabulary to be shared by all the categories, and needs neither feature selection nor parameter tuning. Both the gamma- and beta- negative binomial processes are shown to significantly outperform the gamma-Poisson process when applied to document categorization, with comparable performance to other state-of-the-art supervised text classification algorithms.

1 Introduction

1.1 Models for count matrices

The need to model a random count matrix arises in many settings, from linguistics to marketing to ecology. For example, in text analysis, we often observe a document-term matrix, whose rows record how many times word appeared in a given document. In a biodiversity study, we may observe a site-species matrix, where each row records the number of times species was observed at a given site. Similar applications arise in a wide variety of fields; for examples, see Cameron and Trivedi (1998), Chib et al. (1998), Canny (2004), Buntine and Jakulin (2006), Winkelmann (2008), Titsias (2008), and Zhou et al. (2012).

Nonparametric Bayesian analysis provides a natural setting in which to study random matrices, especially those with no natural upper bound on the number of rows or columns. Yet while there is a wide selection of nonparametric Bayesian models for random count vectors and random binary matrices, prior distributions over random count matrices are relatively underdeveloped. Moreover, a major conceptual problem in modeling a random count matrix arises when new rows are added sequentially. For example, as new documents are collected and processed in text analysis, each new document (represented by a new row of the matrix) may contain previously unseen words (features). This requires that new columns be added to the existing count matrix. But it is not obvious how to define the predictive distribution of this new row of a random count matrix, if the row contains previously unseen features. This is especially important in natural language processing, where a common application is to build a naive Bayes model for classifying new documents. Without having a predictive distribution that accounts for new features, one must often use a predetermined vocabulary and simply ignore the previously unseen terms appearing in a new document.

We directly address these issues by investigating a family of nonparametric Bayesian priors for random count matrices constructed from stochastic processes: the gamma-Poisson process, the gamma-negative binomial process (GNBP), and the beta-negative binomial process (BNBP). We show that all these processes lead to random count matrices with independent and identically distributed (i.i.d.) columns, which can be constructed by drawing all the columns at once, or by adding one row at a time. In addition, we show the gamma-Poisson process, and for special cases of the GNBP and BNBP with common row-wise parameters, the generated random count matrices are exchangeable in both rows and columns.

Our derivation exactly marginalizes out the underlying stochastic processes to arrive at a probability mass function (PMF) for a column-i.i.d. random count matrix. In contrast to existing techniques that take the infinite limit of a finite-dimensional model, this novel procedure allows for the construction and analysis of much more flexible nonparametric priors for random matrices, and highlights certain model properties that are not evident from the finite-model limit. The argument relies upon a novel combinatorial analysis for calculating the number of ways to map a column-i.i.d. random count matrix to a structured random count matrix whose columns are ordered in a certain manner. This is a key step in deriving the predictive distribution of a new random count vector under a random count matrix.

As an application of our proposed framework, we construct a naive-Bayes text classification model. The approach does not require a predefined list of terms (features), and naturally accounts for documents with previously unseen terms. This also implies that random count matrices of different categories can be updated, analyzed, and tested completely in parallel. Moreover, the algorithm requires neither feature selection nor parameter tuning. Following Crammer et al. (2012), the algorithm may also be conveniently extended to an online learning setting. Empirical results suggest that both the proposed GNBP and BNBP models lead to substantially better out-of-sample classification performance, versus both the gamma-Poisson model and the multinomial model with Laplace smoothing. They also clearly outperform the text classification algorithms that first learn lower-dimensional feature vectors for documents and then train a multi-class classifier, and have comparable performance to the state-of-the-art discriminatively trained text classification algorithms, whose features need to be carefully constructed and parameters carefully selected.

1.2 Connections with existing work

Our paper is in the spirit of existing work on nonparametric Bayesian priors for random count vectors and random binary matrices. To model a random count vector, one may use the Chinese restaurant process, or any one of many other stochastic processes characterized by exchangeable partition probability functions (EPPFs) or sample-size dependent EPPFs; see, for example, Blackwell and MacQueen (1973), Pitman (2006), Lijoi and Prünster (2010), and Zhou and Walker (2014). Likewise, to model a random binary matrix, one may use the Indian buffet process (Griffiths and Ghahramani, 2005, Teh and Gorur, 2009). These well-studied nonparametric Bayesian priors, however, are not directly useful for describing random count matrices. To address this gap, we investigate a family of nonparametric Bayesian priors for random count matrices, each based on a previously proposed stochastic process that has not been thoroughly studied: the gamma-Poisson process (Lo, 1982, Titsias, 2008), the gamma-negative binomial process, or GNBP (Zhou and Carin, 2015); and the beta-negative binomial process, or BNBP (Zhou et al., 2012, Broderick et al., 2015).

All three models can be derived as the marginal distribution of a suitably defined stochastic process with respect to a traditional sampling model for integer-valued counts. This parallels the construction of the models for count vectors or binary matrices mentioned previously. For example, the Chinese restaurant process describes a random count vector as the marginal of the Dirichlet process (Ferguson, 1973) under multinomial sampling. Likewise, the Indian buffet process describes a random binary matrix as the marginal of the beta process (Hjort, 1990) under Bernoulli sampling (Thibaux and Jordan, 2007). Similarly, we present the negative binomial process as the marginal of the gamma process under Poisson sampling, the GNBP as the marginal of the gamma process under negative binomial sampling, and the BNBP as the marginal of the beta process under negative binomial sampling.

The remainder of the paper is organized as follows. After some preliminary definitions and notation, we introduce in Section 2 three distinct nonparametric Bayesian priors for random count matrices. In Section 3, we construct nonparametric Bayesian naive Bayes classifiers to classifier a count vector to one of several existing count matrices and demonstrate their use in document categorization. The details for deriving the random count matrix priors from their underlying hierarchical stochastic processes are provided in the Supplementary Material.

1.3 Notation and preliminaries

Stochastic processes.

A gamma process (Ferguson, 1973) on the product space , where , is defined by two parameters: a finite and continuous base measure over a complete separable metric space , and a scale , such that for each . The Lévy measure of the gamma process is . Although the Lévy measure integrates to infinity, is finite, and therefore a draw from the gamma process can be represented as the countably infinite sum where is the mass parameter and is the base distribution.

A beta process (Hjort, 1990) on the product space , is also defined by two parameters: a finite and continuous base measure over a complete separable metric space , and a concentration parameter . The Lévy measure of the beta process in this paper is defined as

(1)

As and , a draw from can be represented as where is the mass parameter and is the base distribution.

Random count matrices.

A random count matrix is denoted generically by , , where the rows of correspond to the samples or cases, and the columns to features that have been observed at least once across all rows. Throughout the paper, we will refer to count matrices constructed sequentially by row, for which we require a consistent notation. Suppose that a new case is observed; we use to refer to the new part introduced to the matrix by adding row . Similarly, we use to denote the number of new columns introduced by adding row , meaning that ; to indicate the count vector corresponding to column of the matrix; and to denote the total number of counts of feature across all rows. One may think of as the combination of two submatrices: a row of counts appended below , and then a submatrix, whose first rows are entirely zero, and whose columns are inserted into random locations among original columns with their relative orders preserved.

Our convention is that a prior for a random count matrix is named by the stochastic process used to generate each of its rows. In this paper, we study three hierarchical stochastic processes, all in the family of negative binomial processes. Each such stochastic process is defined by the prior for an almost-surely discrete random measure, together with a sampling model for generating counts. We denote the distribution of such a matrix as , where “Process” is the name of the underlying hierarchical stochastic process, “M” stands for matrix, and encodes the parameters of the process.

For example, to construct a gamma-Poisson or negative binomial process random count matrix, , we draw a random measure from a gamma process. Then for each row of the matrix, we independently draw : a Poisson process such that for all . As is atomic, we have . Although contains countably many atoms, we will show in later sections that only a finite number of them have nonzero counts. The count matrix is constructed by organizing all the nonzero column count vectors, , in an arbitrary order into a random count matrix. Thus the statistical features we care about, such as words or species, are identified with the atoms of the underlying random measure.

Some important distributions.

The notation denotes a random variable having a logarithmic distribution (Quenouille, 1949) with PMF

A related distribution, called the sum-logarithmic, is defined as follows. Let , and let . The marginal distribution of is a sum-logarithmic distribution (Zhou and Carin, 2015), expressed as , with PMF

where are unsigned Stirling numbers of the first kind. These are related to gamma functions by

(2)

The joint distribution of and is described as the Poisson-logarithmic bivariate distribution in Zhou and Carin (2015), with PMF

(3)

The marginalization of from this compound Poisson representation leads to the negative binomial distribution , with PMF

We describe in the Supplementary Material several other useful distributions, including the logarithmic mixed sum-logarithmic (LogLog), the negative binomial mixed sum-logarithmic, the gamma-negative binomial (GNB), the beta-negative binomial (BNB), the digamma distribution, and the logbeta distributions.

2 Nonparametric Priors for Random Count Matrices

In this section, we introduce three nonparametric Bayesian priors for random count matrices; for the gamma-Poisson process, we describe in detail its PMF, row- and column-wise construction, and some other basic properties; and for the GNBP and BNBP, we present their PMFs and defer other details to the Supplementary Material. We then describe the predictive distribution of a new row count vector under a random count matrix, and highlight some important differences among the three priors. Although results here are quoted without proof, and the detailed construction is deferred to the Supplementary Material, the basic manner of argument in each case is similar. Our goal is to marginalize out the infinite-dimensional random measure to obtain the unconditional PMF of the random count matrix , where . We are able to do so by separating the absolutely continuous and discrete components of the underlying random measure, and applying a result for Poisson processes known as the Palm formula (e.g. Daley and Vere-Jones, 1988, James, 2002, Caron et al., 2014), together with combinatorics. This is a very general approach, which can also be employed to derive the PMF of the Indian buffet process random binary matrix using the beta-Bernoulli process.

2.1 The gamma-Poisson or negative binomial process

Let denote a gamma-Poisson or negative binomial process (NBP) random count matrix, parameterized by a mass parameter and a concentration parameter . This prior arises from marginalizing out the gamma process from conditionally independent Poisson process draws , with the rows of corresponding to the ’s and the columns of corresponding to the atoms with at least one nonzero count.

2.1.1 Conditional likelihood

As are i.i.d. given , they are exchangeable according to de Fennetti’s theorem. With a draw from the gamma-Poisson process expressed as , where is the weight of the atom of the gamma process , we may write the likelihood of , given , as

where . Let denote the set of all observed atoms with nonzero counts, and let . Our goal is to marginalize out the random measure to obtain the unconditional PMF of the random count matrix , where , and to show that this “feature count” matrix is row-column exchangeable. The rows correspond to the ’s, and the columns represent those atoms in with at least one nonzero count across the ’s. Representing the infinite dimensional ’s as a finite random matrix brings interesting combinatorial questions that need to be carefully addressed.

Fix an arbitrary labeling of the indices of the atoms in from to . We now appeal to the definition of a gamma process and rewrite the conditional likelihood of  as

(4)

where is the total mass of the rest of the (absolutely continuous) space. The idea is to first marginalize out from (4) to obtain the marginal distribution , whose derivation using the Palm formula is provided in the Supplementary Material, and then use combinatorial argument to find the marginal distribution of the random count matrix organized from .

2.1.2 Marginal distribution and combinatorial analysis

One of our main results is that the PMF of , with rows and a random number of columns, is

(5)

where the unordered column vectors of the count matrix represent a draw from the underlying stochastic process, and the normalization constant of arises from the fact that the mapping from a realization of to is one-to-many, with distinct column orderings.

By construction, the rows of a NBP random count matrix are exchangeable. Moreover, one may verify by direct calculation that a NBP random count matrix with PMF (5) can be generated column by column as i.i.d. count vectors:

(6)

It is clear from (2.1.2) that the columns of are independent multivariate count vectors, which all follow the same logarithmic-multinomial (mixture) distribution. Thus the NBP random count matrix is row-column exchangeable (see, e.g. Hoover, 1982, Aldous, 1985, Orbanz and Roy, 2014, for a general treatment of row-column exchangeable matrices).

Now consider the row-wise sequential construction of the NBP random matrix, recalling that represents the “new” part of the matrix added by the new row. With the prior on well defined, one may construct in a sequential manner as

where and is the prediction rule to add the new part brought by row into the matrix . Direct calculations using (2.1.2) yield the following form for this prediction rule, expressed in terms of familiar PMFs:

(7)

This formula says that to add a new row to , we first draw count at each existing column. We then draw new columns as . Finally, each entry in the new columns has a distributed random count; crucially, new columns brought by the new row must have positive counts.

The normalizing constant in (2.1.2) plays a key role in our combinatorial analysis, and will appear again in both the gamma- and beta- negative binomial processes. It emerges directly from the calculations, and can also be interpreted in the following way. After drawing new columns, we must insert them into the original columns while keeping the relative orders of both the original and new columns unchanged. This is a one-to-many mapping, with the number of such order-preserving insertions given by the binomial coefficient. For example, if the original has two columns and the new row introduces two more columns, then we construct by rearranging the two old columns 1 and 2 and the two new columns iii and iv in one of possible ways: (1 2 iii iv), (1 iii 2 iv), (iii 1 2 iv), (1 iii iv 2), (iii 1 iv 2), and (iii iv 1 2), where (1 2 iii iv) represents the construction appending the new columns to the right of the original matrix.

It is instructive to compare (2.1.2), which generates a NBP random matrix by drawing all its columns at once, with (2.1.2), which generates an identically distributed random matrix one row at a time. The matrix generated with (2.1.2) has i.i.d. columns. The matrix generated with (2.1.2) adds new columns when it adds the th row, and if the newly added columns are inserted into random locations among original columns with their relative orders preserved, then we arrive at an identically distributed column-i.i.d. random count matrix. If the newly added columns are inserted in a particular way, then the distribution of the generated random matrix would be different up to a multinomial coefficient. For example, if we generate row vectors from to and each time we append the new columns to the right of the original matrix, then this ordered matrix will appear with probability

(8)

Shown in the first row of Figure 1 are three NBP random count matrices simulated in this manner. We note that the gamma-Poisson process is related to the model of Lo (1982), as well as the model of Titsias (2008), which can be considered as a special case of the NBP with the concentration parameter fixed at one.

2.1.3 Inference for parameters

Although the marginal likelihood alone is not amenable to posterior analysis, the NBP parameters can be conveniently inferred using both the conditional and marginal likelihoods. To complete the model, we let and . With (4), (5) and , we sample the parameters in closed form as

(9)

Similar strategies will be used to infer the parameters of the other two stochastic processes. Having closed-form update equations for parameter inference via Gibbs sampling is a unique feature shared by all the nonparametric Bayesian priors proposed in this paper.

2.2 The gamma-negative binomial process

Let denote a gamma-negative binomial process (GNBP) random count matrix, parameterized by a mass parameter , a concentration parameter , and row-specific probability parameters . This random count matrix is the direct outcome of marginalizing out the gamma process , with data augmentation, from conditionally independent negative binomial process draws , which are defined such that for each .

As directly marginalizing out the gamma process under negative binomial sampling is difficult, our construction is based on the compound-Poisson representation of the negative binomial, described in Section 1.3. Specifically, consider the joint distribution of and a latent count matrix , whose dimension and locations of nonzero counts are the same as those of . These two matrices parallel the scalar and given in the joint PMF of the Poisson-logarithmic distribution (3). This joint distribution is defined as

(10)

where , and . The detailed derivation is in the Supplementary Material.

Similar to the analysis in Section 2.1 for the NBP, we show in the Supplementary Material that the GNBP random count matrix can be constructed by either drawing its i.i.d. columns at once or adding one row at a time, and it has closed-form Gibbs sampling update equations for model parameters. Different from the NBP random count matrix that is row-column exchangeable, the GNBP random count matrix no longer maintains row exchangeability if its row-wise probability parameters are set differently for different rows.

Shown in the second row of Figure 1 are three sequentially constructed GNBP random count matrices, with the new columns introduced by each row appended to the right of the matrix. Similar to the combinatorial arguments that lead to (8), this particularly structured matrix and its auxiliary matrix appear with probability .

2.3 The beta-negative binomial process

Let denote a beta-negative binomial process (BNBP) random count matrix, parameterized by a mass parameter , a concentration parameter , and row-specific dispersion parameters , whose PMF is defined as

(11)

where . The PMF is the direct outcome of marginalizing out the beta process from conditionally independent negative binomial process draws , which are defined such that for each , where is the weight of atom . The detailed derivation is provided in the Supplementary Material.

Similar to the analysis in Section 2.1 for the NBP, we show in the Supplementary Material that the BNBP random count matrix can be constructed by either drawing its i.i.d. columns at once or adding one row at a time using an “ice cream” buffet process, and it has closed-form Gibbs sampling update equations for all model parameters except for the concentration parameter . The BNBP random count matrix no longer maintains row exchangeability if its row-wise dispersion parameters are set differently for different rows.

Shown in the last row of Figure 1 are three sequentially constructed BNBP random count matrices, with the new columns introduced by each row appended to the right of the matrix. Similar to the combinatorial arguments that lead to (8), this particularly structured matrix appears with probability .

2.4 The predictive distribution of a new row count vector

It is critical to note that the prediction rule of the NBP shown in (2.1.2) is for sequentially constructing a column-i.i.d. random count matrix, but it is not the predictive distribution for a new row count vector. The submatrix of orders its column in the same way as does, and the submatrix of also maintains a certain order of its columns; however, the indexing of these columns are in fact arbitrarily chosen from possible permutations. Therefore, the predictive distribution of a row vector that brings new columns shall be

(12)
(13)

The normalizing constant in (12) arises because a realization of to is one-to-many, with distinct orderings of these new columns brought by the th row. Our experimental results show that omitting this normalizing term may significantly deteriorate the out-of-sample prediction performance.

An equivalent representation in (13) shows that one may first consider the distribution of a matrix constructed by appending the new columns brought by to the right of , which is , and then apply the Bayes’ rule to derive the conditional distribution of this particularly ordered given . The normalizing constant in (13) can be interpreted in the following way. We need to insert the new columns one by one into the original matrix. The first, second, , and last new columns can choose from , , , and possible locations, respectively, thus there are ways to insert the new columns into the original ordered columns, which is again a one-to-many mapping. The same combinatorial analysis applies to both the GNBP and BNBP. For the GNBP, to compute the predictive likelihood of , one will need to take extra care as the computation involves , an auxiliary random count matrix that is not directly observable. In Section 3, we will discuss in detail how to compute the predictive likelihood via Monte Carlo integration.

2.5 Comparison

In the Supplementary Material, we provide further details on the construction of random count matrices from the negative binomial process, as well as those derived from both the gamma-negative binomial process (GNBP) and beta-negative binomial process (BNBP). While the PMFs for all three proposed nonparametric priors are complicated, their relationship and differences become evident once we show that they all govern random count matrices with a Poisson-distributed number of i.i.d. columns. Table 1 shows the differences among the three priors’ row-wise sequential construction, and the following list shows the variance-mean relationship for each prior for the counts at existing columns. Together, these provide additional insights on how the priors differ from each other.

(16)
Model Number of new columns Counts in existing columns Counts in new columns
NBP
GNBP
BNBP
Table 1: Comparison of the prediction rules of the NBP, GNBP, and BNBP random count matrices.
Figure 1: Sequentially constructed negative binomial process (NBP), gamma-negative binomial process (GNBP), and beta-negative binomial process (BNBP) random count matrices (the blank cells indicate zero counts). The ten rows of each matrix are added one by one, with the new columns introduced by each row appended to the right of the matrix. To make the expected total count of a random matrix as and the expected number of columns approximately as , the parameters are set as and for the NBP, set as , , and for the GNBP, and set as , , and for the BNBP. The randomized row wise parameters and are generated via and , respectively.

The NBP can be used to generate a row-column exchangeable random count matrix with a potentially unbounded number of columns. However, as shown in (2.1.2), to model the total count of a column , the NBP uses the logarithmic distribution, which has only one free parameter, always has the mode at one, and monotonically decreases. In addition, each column sum is assigned to the rows with a multinomial distribution that has a uniform probability vector . Furthermore, as shown in Table 1, for out-of-sample prediction, it models counts at existing columns using , whose variance-mean relationship (2.5) may be restrictive in modeling highly overdispersed counts. Finally, the expected number of new columns brought by a row, equal to , monotonically decreases. These constraints limit the potential use of the NBP model.

Both the GNBP and BNBP relax these constraints in their own unique ways. Examining the sequential construction of the GNBP helps us understand the advantages of the GNBP over the NBP. As shown in Table 1, to model the likelihood of a new row count vector, one may find that the GNBP employs the three-parameter GNB instead of the two-parameter negative binomial distribution to model the count at an existing column, and employs the two-parameter LogLog instead of the logarithmic distribution to model the count at a new column. As the GNB random variable can be generated as , using the laws of total expectation and total variance, we express in terms of in (2.5). Since and , the GNBP can model much more overdispersed counts than the NBP. Moreover, the GNBP allows each row count vector to have its own probability parameter, allowing finer control on the expected number of new columns brought by a new row, which is . The NBP random count matrix is row-column exchangeable, whereas the GNBP random count matrix is column exchangeable, but not row exchangeable if the row-wise probability parameters are fixed at different values.

As shown in Table 1, to model the likelihood of a new row count vector, one may find that the BNBP employs the three-parameter BNB instead of the two-parameter negative binomial distribution to model the count at an existing column, and employs the two-parameter digamma instead of the logarithmic distribution to model the count at a new column. Note that the BNB random variable can be generated as , using the laws of total expectation and total variance, for , we express in terms of in (16). As and for , the BNBP can also model much more overdispersed counts than the NBP. Moreover, the BNBP allows each row count vector to have its own dispersion parameter, allowing finer control on the expected number of new columns brought by a row, which is ; the NBP random count matrix is row-column exchangeable, whereas the BNBP random count matrix is column exchangeable, but not row exchangeable if the row-wise dispersion parameters are different.

The variance-mean relationships expressed by (2.5)-(16) show that the GNBP and BNBP can model much more overdispersed counts than the NBP. This fact is borne out by the simulated random count matrices in Figure 1, which provide some intuition for the practical differences among the models. The parameters for the three priors have been chosen so that each random matrix has the same expected total count. Yet the counts in the NBP random count matrices have small dynamic ranges, whereas the counts in both the GNBP and BNBP matrices can contain values that are significantly above the average.

2.6 Parameter inference

An appealing feature of all three negative binomial process random count matrix priors is that their parameters can be inferred with closed-form Gibbs sampling update equations, by exploiting both the conditional and marginal distributions, together with the data augmentation and marginalization techniques unique to the negative binomial distribution. Parameter inference for the NBP is provided in Section 2.1.3. The details of parameter inference for both the GNBP and BNBP are provided in the Supplementary Material.

3 Negative Binomial Process Naive Bayes Classifiers

3.1 Background

Given a random count matrix, finding the predictive distribution of a row count vector, which may bring additional columns, involves interesting and challenging combinatory arguments that have been throughly addressed in this paper. With these combinatorial structures carefully analyzed, we are ready to construct a NBP, a GNBP, and a BNBP naive Bayes classifiers. We do so as follows. First, for each category, the training row count vectors are summarized as a random count matrix , each column of which must contain at least one nonzero count (i.e. columns with all zeros are excluded). Second, Gibbs sampling is used to infer the parameters that generate . To represent the posterior of , MCMC samples are collected. For the GNBP, a posterior MCMC sample for the auxiliary random matrix is also collected when is collected. Finally, to test a row count vector , its predictive likelihood given is calculated via Monte Carlo integration using

(17)

for both the NBP and BNBP, and using

(18)

for the GNBP. Although a larger shall lead to a more accurate calculation of the predictive likelihood, the computational complexity for testing is a linear function of . It is therefore of practical importance to find out how the value of impacts the performance of the proposed nonparametric Bayesian naive classifiers. Below we consider experiments on document categorization, for which we will show that performs essentially just as well as selecting a much larger in terms of the categorization accuracy.

3.2 Experiment settings

We consider the example of categorizing the 18,774 documents of the 20 newsgroups dataset111http://qwone.com/jason/20Newsgroups/, where each bag-of-words document is represented as a word count vector under a vocabulary of size 61,188. We also consider the TDT2 corpus222http://www.cad.zju.edu.cn/home/dengcai/Data/TextData.html ( NIST Topic Detection and Tracking corpus): with the documents appearing in two or more categories removed, this subset of TDT2 consists of 9,394 documents from the largest 30 categories, with a vocabulary of size 36,771; this dataset was used to compare document clustering algorithms in Cai et al. (2005). We train all three negative binomial processes using 10%, 20%, , or 80% of the documents in each newsgroup of the 20 newsgroups dataset, and in each category of the TDT2 corpus. We then test on the remaining documents. We report our results based on five random training/testing partitions.

To make comparison to other commonly used text categorization algorithms, we also consider a default setting for the 20 newsgroups dataset: using the first 11,269 documents for training and the other 7,505 documents collected at later times for testing. For this setting, we reports our results based on five independent runs with random initializations. This allows us to compare our performance to many other papers that have proposed text classification algorithms and benchmarked their methods using this same split of the 20 newsgroups dataset.

For the th newsgroup/category with training documents, we construct a document-term count matrix , whose element represents the number of times term appearing in document . Since only the terms present in the training documents of the th category are considered, the column indices of correspond to the terms that appear at least once in training. We use to denote that is a parameter inferred from . Note that the column indices of can be arbitrarily ordered, which affects neither training nor out-of-sample prediction as long as their corresponding features are recorded.

We collect MCMC samples of model parameters and auxiliary variables to compute the predictive likelihood for a new row count vector. In this paper, we run independent Markov chains and collect the 2500th sample of each chain. Note that one may also consider collecting samples at a certain interval from a single Markov chain after the burn-in period. We consider non-informative hyper-parameters as . For the BNBP, we set . The document-term training count matrix of the th newsgroup is modeled as , , and under the three priors respectively.

Note that we are facing typical “small and large ” problems as the number of rows of a document-term count matrix is typically much smaller than the number of columns. For example, the first newsgroup of the 20 newsgroups dataset contains 798 documents with 12,665 unique words, which is summarized as a count matrix; and the 30th category of the TDT2 subset contains 52 documents with 2904 unique words, which is summarized as a count matrix. As the number of unique terms in a category might be significantly smaller than the vocabulary size of the whole corpus, our approach for both training and testing could be much faster than the approach that considers all the terms in the vocabulary of the corpus. In addition, our approach provides a principled, model-based way to handle terms that appear in a testing document but not in the training documents. By contrast, many traditional approaches have to discard these terms not present in training.

3.3 Training and posterior predictive checking

Figure 2: The parameters of the negative binomial processes are inferred using (a) the observed document-term count matrix. These parameters are used to simulate (b) a NBP random count matrix, (c) a GNBP random count matrix, and (d) a BNBP random count matrix. These matrices are visualized by arranging the new columns brought by each new row to the right of the original matrix. The counts larger than 3 are displayed as 3.

We train the NBP, GNBP, and BNBP with the document-term word count matrix that summarizes all the 52 documents in the 30th category of the TDT2 subset. We then run 2500 MCMC iterations and collect the last 1500 samples to infer the posterior means of the parameters in , , and . Using the corresponding parameters learned from the training count matrix, we regenerate a NBP, a GBNP, and a BNBP random count matrix as an informal posterior predictive check on the model. The observed count matrix is shown in Figure 2 (a), and the three simulated random count matrices are shown in Figure 2 (b)-(d). These matrices are displayed by arranging the new columns brought by a new row to the right of the original matrix.

It is clear that the NBP is restrictive, in that the generated random matrix looks the least similar to the observed count matrix. This is unsurprising, as the NBP has a limited ability to model highly overdispersed counts, does not model row-heterogeneity, and can barely adjust the number of new columns brought by a row. On the other hand, both the generated GNBP and BNBP random count matrices resemble the original count matrix much more closely. This is expected, since both priors use heavy-tailed count distributions to model highly overdispersed counts, and have row-wise probability or dispersion parameters to model row-heterogeneity and to control the number of new columns brought by each row. Note that the observed matrix has 2904 columns, but each of the generated random count matrices has a different (random) number of columns. This is because there are one-to-one correspondences between their row indices, but not their column indices.

3.4 Out-of-sample prediction and categorization for count vectors

For out-of-sample prediction on a new row vector, we first compute that vector’s likelihood under different categories’ training count matrices. We then use these likelihoods in a naive-Bayes classifier to categorize the new vector. For example, for testing row count vector under category , we will first match the column indices (features) of this row count vector to those of the training count matrix ; each feature that belongs to one of the features of but not present in will be assigned a zero count; and the features that are present in vector but not in will be treated as new features brought by vector to to . For the the GNBP, we first find an estimate of as . For the BNBP, we first find an expectation-maximization estimate of by running the updates

iteratively for 20 iterations, where for a testing row vector with all zeros, we let . Given the column sums of and the inferred model parameters (together with auxiliary variables for the GBNB), the predictive likelihoods of a new row count vector are calculated using (17) for both the NBP and BNBP and with (18) for the GNBP.

Note that when the predictive distributions are used to calculate the likelihoods, the models are not constrained under a predetermined vocabulary. But if we are given a vocabulary of size that includes all the important terms, exploiting that information might further improve the performance. Thus to test document , we also consider using

(19)

as the likelihood for the NBP, using

(20)

as the likelihood for the GNBP, and using