Beta-Product Poisson-Dirichlet Processes

Beta-Product Poisson-Dirichlet Processes

Federico Bassetti Università degli Studi di Pavia federico.bassetti@unipv.it Roberto Casarin University Ca’ Foscari, Venice r.casarin@unive.it  and  Fabrizio Leisen University Carlos III de Madrid leisen@gmail.com
July 5, 2019
Abstract.

Time series data may exhibit clustering over time and, in a multiple time series context, the clustering behavior may differ across the series. This paper is motivated by the Bayesian non–parametric modeling of the dependence between the clustering structures and the distributions of different time series. We follow a Dirichlet process mixture approach and introduce a new class of multivariate dependent Dirichlet processes (DDP). The proposed DDP are represented in terms of vector of stick-breaking processes with dependent weights. The weights are beta random vectors that determine different and dependent clustering effects along the dimension of the DDP vector. We discuss some theoretical properties and provide an efficient Monte Carlo Markov Chain algorithm for posterior computation. The effectiveness of the method is illustrated with a simulation study and an application to the United States and the European Union industrial production indexes.

JEL: C11,C14,C32

Keywords: Bayesian non–parametrics, Dirichlet process, Poisson–Dirichlet process, Multiple Time-series non–parametrics

1. Introduction

This paper is concerned with some multivariate extensions of the Poisson-Dirichlet process. In this paper the class of models considered originates from the Ferguson Dirichlet process (DP) (see Ferguson (1973, 1974)) that is now widely used in non-parametric Bayesian statistics. Our extensions rely upon the so-called Sethuraman’s representation of the DP. Sethuraman (1994) shows that, given a Polish space and a probability measure on , a random probability measure on is a Dirichlet Process of precision parameter and base measure , in short , if and only if it has the stick-breaking representation:

(1)

where and are stochastically independent, is a sequence of independent random variables (atoms) with common distribution (base measure), and the weights s are defined by the stick-breaking construction:

(2)

being independent random variables with beta distribution .

The DP has been extended in many directions. In this paper, we will build on a generalization of the DP that is called the Poisson–Dirichlet process. A Poisson–DP, , with parameters and and base measure , is a random probability measure that can be represented with a Sethuraman-like construction by taking in (1)-(2) a sequence of independent random variables with , see Pitman (2006) and Pitman and Yor (1997). Further generalizations based on the stick-breaking construction of the DP can be found in Ishwaran and James (2001).

The DP process and its univariate extensions are now widely used in Bayesian non-parametric statistics. A recent account of Bayesian non-parametric inference can be found in Hjort et al. (2010). The univariate DP is usually employed as a prior for a mixing distribution, resulting in a DP mixture (DPM) model (see for example Lo (1984)). The DPM models incorporate DP priors for parameters in Bayesian hierarchical models, providing an extremely flexible class of models. More specifically, the DPM models define a random density by setting:

(3)

where is a suitable density kernel. Due to the availability of simple and efficient methods for posterior computation, starting from Escobar (1994) and Escobar and West (1995), DPM models are now routinely implemented and used in many fields.

The first aim of this paper is to introduce new class of vectors of Poisson-Dirichlet processes and of DPM models. Vectors of random probability measures arise naturally in generalizations of the DP and DPM models that accommodate dependence of observations on covariates or time. Using covariates, data may be divided into different groups and this leads to consider group-specific random probability measures and, as we shall see, to assume that the observations are partially exchangeable.

Probably the first paper in this direction, that introduced vectors of priors for partially exchangeable data, is Cifarelli and Regazzini (1978). More recently, MacEachern (1999, 2001) introduced the so-called dependent Dirichlet process (DDP) and the DDP mixture models. His specification of the DDP incorporates dependence on covariates through the atoms while assuming fixed weights. More specifically, the atoms in the Sethuraman’s representation (1)-(3) are replaced with stochastic processes , being a set of covariates. There exist many applications of this specification of the DDP. For instance, De Iorio et al. (2004) proposed an ANOVA-type dependence structure for the atoms, while Gelfand et al. (2004) considered a spatial dependence structure for the atoms. Later, DDP with both dependent atoms and weights was introduced in Griffin and Steel (2006). Other constructions that incorporate a dependence structure in the weights have been proposed, for instance, in Duan et al. (2007); Chung and Dunson (2011); Dunson and Peddada (2008); Dunson et al. (2008) and Rodriguez et al. (2010).

Other approaches to the definition of dependent vectors of random measures rely upon either suitable convex combinations of independent DPs (e.g., Müller et al. (2004); Pennell and Dunson (2006); Hatjispyrosa et al. (2011); Kolossiatis et al. (2011)) or hierarchical structures of stick-breakings (e.g., Teh et al. (2006); Sudderth and Jordan (2009)).

Finally, we should note that it is possible to follow alternative routes other than the Sethuraman’s representation to the definition of vectors of dependent random probabilities. For example, Leisen and Lijoi (2011) used normalized vectors of completely random measures, while Ishwaran and Zarepour (2009) employed bivariate gamma processes.

In this paper, we introduce a new class of multivariate Poisson-DP and DP by using a vector of stick-breaking processes with multivariate dependent weights. In the construction of the dependent weights, we consider the class of multivariate beta distributions introduced by Nadarajah and Kotz (2005) that have a tractable stochastic representation and makes the Bayesian inference procedures easier. We discuss some properties of the resulting multivariate DP and Poisson-DP and show that our process has the appealing property that the marginal distributions are DP or Poisson-DP.

The second aim of the paper is to apply the new DP to Bayesian non–parametric inference and to provide a simple and efficient method for posterior computation of DDP mixture models. We follow a data-augmentation framework and extend to the multivariate context the slice sampling algorithm described in Walker (2007) and Kalli et al. (2011). The sampling methods for the full conditional distributions of the resulting Gibbs sampling procedure are detailed and the effectiveness of the proposed algorithm is studied with a set of simulation experiments.

Another contribution of this paper is to present an application of the proposed multivariate DDP mixture models to multivariate time series modeling. In the recent years, the interest in Bayesian non-parametric models for time series has increased. In this context, DP have been recently employed in different ways. Rodriguez and ter Horst (2008) used a Dirichlet process to define an infinite mixture of time series models. Taddy and Kottas (2009) proposed a Markov-switching finite mixture of independent Dirichlet process mixtures. Jensen and Maheu (2010) considered Dirichlet process mixture of stochastic volatility models and Griffin (2011) proposed a continuous-time non–parametric model for volatility. A flexible non–parametric model with a time-varying stick-breaking process has been recently proposed by Griffin and Steel (2011). In their model, a sequence of dependent Dirichlet processes is used for capturing time-variations in the clustering structure of a set of time series. In our paper, we extend the existing Bayesian non–parametric models for multiple time series by allowing each series to have a possible different clustering structure and by accounting for dependence between the series-specific clusterings. Since we obtain a dynamic infinite-mixture model and since the number of components with negligible weights can be different in each series, our model represents a non–parametric alternative to multivariate dynamic finite-mixture models (e.g., Markov-switching models) that are usually employed in time series analysis.

The structure of the paper is as follows. Section 2 introduces vectors of dependent stick-breaking processes and defines their properties. Section 3 introduces vectors of Poisson-Dirichlet processes for prior modelling. Section 4 proposes a Monte Carlo Markov Chain (MCMC) algorithm for approximated inference for vector of DP mixtures. Section 5 provides some applications to both simulated data and to the time series of the industrial production index for the United States and the European Union. Section 6 concludes the paper.

2. Dependent stick-breaking processes

Consider a set of observations, taking values in a space , say a subset of , divided in sub–samples (or group of observations), that is:

Above is the -th observation within sub–sample . For instance, may correspond to a space label or predictors. Typically, one assumes that the observations of the block have the same (conditional) density and that the observations are (conditionally) independent. Hence, to perform a non–parametric Bayesian analysis of the data, one needs to specify a prior distribution for the vector of densities . Moreover, in assessing a prior for , a possible interest is on borrowing information across blocks. To do this, first we introduce a sequence of density kernels () (where is jointly measurable and defines a probability measure on for any in , being a dominating measure on ). Secondly we define:

(4)

where is a vector of dependent stick breaking processes that will be defined in the next section.

2.1. Vectors of stick-breaking processes

Following a general definition of dependent stick-breaking processes, essentially proposed in MacEachern (1999, 2001), we let

(5)

where the vectors of weights and the atoms satisfy the following hypotheses:

  • and are stochastically independent;

  • is an i.i.d. sequence taking values in with common probability distribution ;

  • are determined via the stick breaking construction

    where for and are stochastically independent random vectors taking values in such that a.s. for every .

Note that is a vector of dependent random measures whenever or are vectors of dependent random variables. The dependence between the measures affects the dependence structure underlying the densities , which can be represented as infinite mixtures

(6)

functions of the atoms and the weights .

The above definition of dependent random measures is quite general. For the sake of completeness, we shall notice that our specification of vectors of stick-breaking processes can be extended, even if not straightforwardly, up to include more rich structure such as the matrix of stick-breaking processes proposed in Dunson et al. (2008). In the rest of this section we briefly discuss the choice of atoms and analyze some general features of vectors of stick breaking processes. While in the next section we focus on the main contribution of this works which is a new specification of the stick-vectors based on multivariate beta distribution.

2.2. Atoms

The simplest assumption for the atoms is that they are common to all the measures . Otherwise stated this means that the base measure of the atoms is

(7)

for every measurable subsets of , being a probability measure on , which corresponds to the case

(8)

with distributed according to .

Eventually one can choose a more complex structure for the law of the atoms, including covariates (or exogenous effects) related to the specific block . For instance one can assume an ANOVA-like scheme of the form

(9)

where represents the overall ”mean” (of the -th mixture component) and the specific ”mean” for factor (of the -th mixture component). A similar choice has been used in De Iorio et al. (2004).

In many situations it is reasonable to assume that the components of the mixture are essentially the same for all the blocks but that they have different weights. In addition, the choice (7)-(8) yields a simple form of the correlation between the related random measure. This feature, which may be useful in the parameter elicitation, is discussed in the next subsection.

2.3. Correlation and Dependence Structure

In the general definition given in Subsection 2.1 the vectors of stick variables are assumed to be independent. If in addition one assumes that they have the same distribution, i.e. that

(10)

then it is easy to compute the correlation of two elements of the vector as well as the correlation between and .

For the shake of simplicity we shall consider only the case and set

(11)

and, for every ,

where denotes the -th marginal of .

Proposition 1.

Assume that (10) holds true, then for all measurable set and

(12)

and for every in

(13)

The proof of Proposition 1 is in Appendix.

Corollary 2.

Assume that (8) and (10) hold true, then for every measurable set

(14)

where is defined in (11).

2.4. Partial exchangeability

We conclude this section by observing that is a partially exchangeable random array, indeed the joint law of the infinite process of observation is characterized by:

(15)

where the expectation is respect to the joint law of . Recall that an array is said to be row-wise partially exchangeable if, for every , every measurable sets and any permutations of ,

In other words, the joint law is not necessarily invariant to permutations of observations from different groups. From a practical point of view, the partial exchangeability represents a suitable model for sets of data that exhibit sub-samples with some possibly different features.

3. Beta-Product Poisson-Dirichlet Processes

We propose a new class of vector of dependent probability measures in such a way that (marginally) is a Dirichlet process for every . This result follows from the Sethurman’s representation (1) if one considers a multivariate distribution for such that .

It is worth noticing that there are many possible definitions of multivariate beta distribution (see for example Olkin and Liu (2003) and Nadarajah and Kotz (2005)), but not all of them has a tractable stochastic representation and leads to simple Bayesian inference procedures. For this reason we follow Nadarajah and Kotz (2005) and consider a suitable product of independent beta random variables. More specifically we apply the following result.

Proposition 3 (Radhakrishna Rao (1949)).

If are independent beta random variables with shape parameters , and if , , then the product is also a beta random variable with parameters .

Proof.

It is easy to check that the Mellin transform of a beta random variable of parameters is given by

Hence, since and using the independence assumption

Which gives the result for . The general case follows. ∎

We obtain two alternative specifications of the multidimensional beta variables. Specifically, if we set

(16)

with independent, () and , then .

As an alternative we consider

(17)

with independent and , , that gives .

It should be noted that (16) resembles the specification of the matrix stick-breaking process in Dunson et al. (2008). In that paper all the components of the stick-breaking process are products of two independent beta variables with fixed parameters and , that precludes obtaining Poisson-Dirichlet marginals. For this reason we propose the specifications in (16) and (17) that, for a special choice of the parameters , allow for Poisson-Dirichlet marginals. In the first construction we obtain a random vector with identical marginal distributions, while in the second construction the vector has different marginals. Moreover, in the second case , which induces an ordering on the concentration parameters of the Dirichlet marginals. These aspects will be further discussed in the following sections.

3.1. Dirichlet Process marginal.

For the sake of clarity, we start with . We assume (10) and we discuss how to choose the parameters in (16)-(17) in order to get DP marginals. Note that since (10) holds true, then . According to the construction schemes given in (16) and (17), with , , , two possible and alternative specifications of are:

  • , with independent, and , , where and ;

  • , with independent, and with and .

Thanks to Proposition 3, if (H1) holds, then and , while if (H2) holds, then and . Hence, we have the following result.

Proposition 4.

Under (10), if (H1) holds true, and are (marginally) Dirichlet processes with the same precision parameter and base measures , respectively. If (H2) holds true, then the first component of is a Dirichlet process with precision parameter and base measure and the second component is a Dirichlet process with precision parameter and base measure .

Since with this construction the ’s are Dirichlet processes, we call Beta-Product Dependent Dirichlet Process of parameters and base measure , in short , where for (H1) and for (H2). In addition, when we assume (7), we denote the resulting process with .

It should be noted that the two processes have different marginal behaviors. The process has marginals with the same precision parameter and should be used as a prior when the clustering along the different vector dimension is expected to be similar. In the process, the precision parameter decreases along the vector dimension. This process should be used as a prior when a priori one suspects that the clustering features are different in the subgroups of observations.

Correlation under H1
Correlation under H2
Figure 1. Left column: correlation between and under (H1) (first row) and (H2) (second row). Right column: correlation between and , assuming (7), under (H1) (first row) and (H2) (second row)

For parameter elicitation purposes, it is useful to analyze how the choice of affects the correlation between and . Let us start by considering the correlation between the stick variables. From Theorem 4 and 6 in Nadarajah and Kotz (2005), one obtains the following correlation between the components and in the cases of (H1) and (H2)

Fig. 1 shows the correlation level between the stick-breaking components (left column) and the random measures (right column) for different values of and . In these graphs, the white color is used for correlation values equal to one and the black is used for a correlation value equal to zero. The gray areas represent correlation values in the interval under both (H1) and (H2) beta models. According to the graph at the top-left of Fig. 1, one can conclude that the parametrization used in this paper allows for covering the whole range of possible correlation values in the (0,1) interval. For instance, under (H1), a low correlation between the components of the stick-breaking corresponds to low values of the parameter , say between 0 and 0.1, for any choice of the parameter .

Corollary 5.

Under the same assumptions of Proposition 4,

Recall that if (7) holds true, then for any measurable set :

The graphs at the bottom of Fig. 1 show how the parameters and affect the correlation between the components and of the bivariate beta used in the stick breaking process and the correlation between the random measures and – when assuming (7).

It is worth noticing that, under (H1), converges (in distribution) to as , where and independent random variables with distribution . While, under (H2), converges to as , where is a random variable. In particular, if one assumes (7) and (H2), when , one gets the limit situation in which all the observations are sampled from a common mixture of Dirichlet processes. In other words, in this limit case, as can be seen by (15) for , one considers the observations (globally) exchangeable, so no distinction between the two blocks are allowed. The other limiting case is when one assumes (H1) and takes to be independent random elements with probability distribution and . In the limit for , one obtains two independent Dirichlet processes and with base measures and . In other words, with this choice, one considers the blocks of observations as two independent blocks of exchangeable variables and no sharing of information is allowed.

The (H1) construction for the case follows immediately from (16) with , , , that is by assuming , , to be independent with and , , where and . In this case, has distribution for and hence is marginally . Also the formula for the correlation between two measures easily extends to the case under (7):

The (H2) construction extends to by setting in (17) and (), that is by taking to be independent random variables with , , , , where for all . In this last case for every . Hence is marginally . Under (7) the correlation between and with is given by

(18)

The proof of this last result is given in the Appendix.

3.2. Poisson-Dirichlet process marginal

Recall that a Poisson–Dirichlet process, , with parameters and , and base measure , is obtained by taking in (1)-(2) a sequence of independent random variables with . In this section we show that by a suitable choice of the parameters in (16)-(17) we obtain a vectors of dependent random measures with Poisson-Dirichlet marginals.

In the first case use (16) with , and , that is take to be independent random variables such that

(19)

where , , and . Proposition 3 yields that . In the second case use (17) with , , and for , that is take to be independent random variables such that

(20)

with and . In this last case has for every .

Summarizing we have proved the following

Proposition 6.

If (16) and (19) hold true is a for every , while if (17) and (20) hold true is a for every .

4. Slice Sampling Algorithm for Posterior Simulation

For posterior computation, we propose an extension of the slice sampling algorithm introduced in Walker (2007); Kalli et al. (2011). For the sake of simplicity we shall describe the sampling strategy for a vector of Beta-Product DDP with (), see Subsection 3.1. The proposed algorithm can be easily extend to the case and to the Beta-Product dependent Poisson-DP.

Recall that in the stick variables are defined by

for a sequence of independent vectors with the same distribution of and the convention and under (H2).

Starting from (6), the key idea of the slice sampling is to find a finite number of variables to be sampled. First we introduce a latent variable in such a way that is the marginal density of

It is clear that given , the number of components is finite. In addition we introduce a further latent variable which indicates which of these finite number of components provides the observation, that is

(21)

Hence, the likelihood function for the augmented variables is available as a simple product of terms and crucially is finite.

To be more precise we introduce the allocation variables () taking values in and the slice variables () taking values on . We shall use the notation

and we write: for , for , for , for and for .

The random elements have the law already described. While conditionally on , the random vectors , , are stochastically independent with the joint density (21).

We conclude by observing that it can be useful to put a prior distribution even on the hyperparameters . The law of is assumed independent of the random vector of hyperparameters , while the distribution of depends on through (H1) or (H2), so we shall write .

Our target is the exploration of the posterior distribution (given ) of by Markov Chain Monte Carlo sampling.

Essentially we shall use a block Gibbs sampler which iteratively simulates given , given and given .

For the one dimensional, this blocking structure case has been introduced in Papaspiliopoulos (2008) and Kalli et al. (2011) as an alternative and more efficient version of the original algorithm of Walker (2007). Our algorithm extends the one dimensional slice sampling of Kalli et al. (2011) to the multidimensional case. This extension is not trivial as it involves generation of random samples from vectors of allocation and slice variables of a multivariate stick-breaking process. We present an efficient Gibbs sampling algorithm by elaborating further on the blocking strategy of Kalli et al. (2011).

In order to describe in details the full-conditionals of the above sketched block Gibbs sampler, we need some more notation. Define for and ,

Moreover, let

(22)

In our MCMC algorithm we shall treat as three blocks of random length: , where

and . Note that and almost surely. In the following subsections we give the details of the full conditionals of the blocking Gibbs sampler, further details on the algorithm are given in Appendix.

4.1. The full conditional of

The atoms given are conditionally independent and the full conditionals are:

(23)

where . The strategy for sampling from this full conditional depends on the specific form of and . In the next section we will discuss a possible strategy for Gaussian kernels.

4.2. The full conditional of

In order to sample from the conditional distribution of given a further blocking is used:

  • given . The joint conditional distribution of given is

    (24)

    where is the prior on the concentration parameters and

    (25)