Concentration in the Generalized Chinese Restaurant Process

# Concentration in the Generalized Chinese Restaurant Process

A. Pereira, R. I. Oliveira, R. Ribeiro
###### Abstract.

The Generalized Chinese Restaurant Process (GCRP) describes a sequence of exchangeable random partitions of the numbers . This process is related to the Ewens sampling model in Genetics and to Bayesian nonparametric methods such as topic models. In this paper, we study the GCRP in a regime where the number of parts grows like with . We prove a non-asymptotic concentration result for the number of parts of size . In particular, we show that these random variables concentrate around where is the asymptotic number of parts and is a positive value depending on . We also obtain finite- bounds for the total number of parts. Our theorems complement asymptotic statements by Pitman and more recent results on large and moderate deviations by Favaro, Feng and Gao.

## 1. Introduction

Models of random partitions have attracted much attention in Probability and Statistics. In this paper we study a specific family of models of random partitions called generalized Chinese Restaurant processes (GCRP). These models were introduced by Pitman [13], [14] as two-parameter generaliation of Ewens’ sampling formula [7]. They are also important building blocks in topic models [11] and other Bayesian nonparametric methods [5].

The GRCP generates a sequence of random partitions of for . We focus on a specific setting for the model where the number of parts in grows like for a parameter . Our main goal is to prove concentration for the total number of the number of parts with size in each , that is:

 Nn(k):=|{A∈Pn:|A|=k}|.

As we explain below, the are mixtures of i.i.d. models, and the above random variables do not concentrate around any fixed value. Nevertheless, we show that they do concentrate around random values. Our main result – Theorem 3.2 below – shows that, for large , with high probability,

 Nn(k)=cV∗Γ(k−α)Γ(k+1)nα+o(Γ(k−α)Γ(k+1)nα)

where is a random variable with a.s. and and is a constant depending on model parameters. This result holds simultaneously for all in a range that grows polynomially in . Since

 Γ(k−α)/Γ(k+1)=Θ(k−(1+α)) for large k,

we verify that the power-law-type behavior in that is known to hold asymptotically for the is already visible for finite . Moreover, in our proof we also obtain finite- bounds on the number of parts in (cf. Theorem 3.1 below).

Our proof method is based on martingale inequalities and is inspired by the analysis of preferential-attachment-type models [4]. However, there are some important technical differences, which we discuss in subsection 3.2. A salient feature of our approach is that the concentration-of-measure arguments we employ are somewhat delicate, and rely on Freedman’s concentration inequality [10].

The remainder of the paper is organized as follows. We fix some notation in the next paragraph. In section 2, we introduce the model, discuss its regimes, and give some background on its theory and applications. Section 3 states our main theorems. We will also outline their proofs and compare them with previous results. Section 4 contains the main concentration-of-meausre results we will need, including Freedman’s inequality. Actual proofs start in Section 5 with the analysis of the number of parts in . The arguments for is more convoluted and takes four sections. Section 6 gives some preliminary results, including a recursive formula. Section 7 obtains high-probability upper bounds and lower bounds for . The proof of our main Theorem is wrapped up in Section 8. The final section contains some concluding remarks. The appendix collects several technical estimates

Notation: In this paper is the set of positive integers. Given , we let denote the set of all numbers from to . Given a nonempty set , a partition of is a collection of pairwise disjoint and nonempty subsets of whose union is all of . The elements of are called the parts. We denote the cardinality of a finite set by . In particular, for a finite partition , denotes the number of parts in . Finally, when we talk about sequences of random or deterministic values, we will write .

## 2. The model

### 2.1. Definitions

Fix two parameters ; extra conditions will be imposed later. GCRP() – shorthand for the Generalized Chinese Restaurant Process with parameters () – is a Markov chain

 P1,P2,P3,P4,…

where, for each , is a partition of . We let

 (1) Vn:=|Pn|

denote the number of parts in and write

 (2) Pn={Ai,n:i=1,…,Vn},

where the are the parts of . In the colorful metaphor of the “Chinese restaurant", the are the tables occupied by customers , who arrive sequentially, with being the number of occupied tables. So describes the table arrangements of the first customers.

The evolution of the process is as follows.

• Initial state: customer sits by herself i.e. .

• Evolution: Given , with as in (2), we define via a random choice:

• For each , with probability

 |Ai,n|−αn+θ,

customer sits at the th table. That is,

 Pn+1={Aj,n:j∈[Vn]∖{i}}∪{Ai,n∪{n+1}}.

Notice that in this case.

• With probability

 αVn+θn+θ,

customer sits by herself at a new table. That is, we set

 Pn={Ai,n:i=1,…,Vn}∪{{n+1}}.

In this case .

Our focus in this paper is on and the random variables

 (3) Nn(k):=|{A∈Pn:|A|=k}|=|{i∈[Vn]:|Ai,n|=k}|(k∈[n])

that count how many of the parts in have size .

### 2.2. Choices of parameters and different regimes

The attentive reader will have noticed that the above process only makes sense for certain values of and . Specifically, there are different assumptions one can make, which lead to different behavior [13, 14].

• Bounded number of parts: if and for some , then almost surely. After reaches value , the process behaves like an urn model with urns.

• Logarithmically growing number of parts: if , , then

 Vnlogn→θ almost surely

and has Gaussian fluctuations at the scale of .

• Polynomially growing number of parts: if and ,

 (4) Vnnα→Vo almost surely

where is a nondegenerate random variable with a density over . In particular, almost surely.

This last regime is the focus of the present paper.

### 2.3. Some background

We discuss here a bit of the history and applications of the GCRP. Those interested only in results may skip to the next section.

The GCRP is an exchangeable model in the sense that the law of is invariant under permutations of . One consequence of this is that the natural infinite limit of is an exchangeable random partition of the natural numbers . That is, the law of is invariant under any finite permutation of .

A well-known result of Kingman [12] says that exchangeable random partitions of can always be built from mixtures of paintbox partitions. Suppose is a random probability distribution over where . Conditionally on , let be an i.i.d.- sequence. Form a partition of by placing each with in a singleton, and (for each ) putting all with in the same part. Clearly, such a construction always leads to an exchangeable random partition, and Kingman’s theorem says that this is the only way to build such partitions. In the specific case of the infinite GCRP(), the law of is the two-parameter Poisson-Dirichlet distribution PD(). This can be used to derive explicit formulae for the distribution of for each .

The GCRP was first mentioned in print by Aldous [2]. It was studied by Pitman [13], [14] as an example of a partially exchangeable model where many explicit calculations are possible. In particular, the exact distribution of the random variables we consider can be computed explicitly. Based on these formulae, [8], [9] obtained large and moderate deviation results for these variables. These results are briefly described in subsection 3.1 below.

The class of models we consider is also important in many applications. On the one hand, it is a generalization of Ewens’ neutral allele sampling model in population Genetics [7]. On the other hand, the GCRP and its variants are important building blocks for topic models [11] and many other Bayesian nonparametric methods. We refer to Crane’s recent survey [5] for much more information on our model, its extensions and the many contexts where it has appeared.

## 3. Results

Let and recall the definitions of and in (1) and (3), respectively. Our theorem describes these random variables in the setting where and . Recall from Section 2.2 that in this setting the random variables have a nontrivial limit (cf. (4)) . For our purposes, it is more convenient to work with the random variables , where

 ϕn:=Γ(1+θ)Γ(1+θ+α)Γ(n+α+θ)Γ(n+θ).

Note that converges to a constant when . In particular, the limit

 (5) V∗:=limn→+∞Vnϕn almost surely

exists and is a.s. positive (it is a rescaling of ). Our first result quantifies the convergence in this statement.

###### Theorem 3.1 (Proven in subsection 5.3).

Consider a realization of the Generalized Chinese Restaurant Process GCRP() with parameters and . Then there exist constants and such that for the following holds with probability :

 ∀m∈N:∣∣∣Vmϕm−V∗∣∣∣≤c∗[loglog(m+2)+log(1δ)](m+θ)α/2.

Our second and main result gives concentration of the random variables simultaneously for all .

###### Theorem 3.2 (Main; proven in section 8).

Consider a realization of the Generalized Chinese Restaurant Process GCRP() with parameters and . Then there exist constants , such that the following holds. Assume with . Take , and define . Then the following holds with probability :

 ∀k∈[kϵ,n]:∣∣∣Nn(k)−c(α,θ)Γ(k−α)Γ(k+1)V∗nα∣∣∣≤CΓ(k−α)Γ(k+1)nαεα+2(1+Alogn)

where

 c(α,θ):=αΓ(1+θ)Γ(1−α)Γ(1+α+θ)>0.

The following immediate corollary is perhaps somewhat easier to parse.

###### Corollary 3.1.

In the setting of Theorem 3.2, let with . Then there exist sequences , such that the the probability that for some we have

 P⎛⎜ ⎜⎝∀k∈[kε,n]:V∗−ξn

for large enough .

###### Proof sketch.

Apply Theorem 3.2 with , where but . Then take:

 ξn=Cc(α,θ)εα+2n(1+Cn).

### 3.1. Related work

One consequence of our results is the a.s. asymptotics for :

 Nn(k)Vn→c(α,θ)Γ(k−α)Γ(k+1).

This kind of Law of Large Numbers was first obtained by Pitman [14, Chapter 3] with no explicit convergence rates.

Much more recently, Favaro, Feng and Gao [8, 9] have used Pitman’s explicit formulae to obtain large and moderate deviation results for the . Reference [9], which is the closest to our work, focuses on precise estimates for probabilities like

 (6) P(Nn(k)nαβn>c) when βn≫(logn)1−α.

The paper [8] considers even larger sequences . By contrast, we obtain finite- estimates for deviations at smaller scales, which (as expected) are not as precise. There is also a difference in proof methods: whereas they rely on explicit formulae, our argument is based on recursions and martingales.

Another important conceptual difference between our work and that of Favaro et al. is that, for their purposes, the lack of concentration in is not an issue. Indeed, if one goes “deep enough" into the tail of the , as in (6), the nontrivial distribution of becomes irrelevant. Our theorems operate at a finer scale and complement these previous papers by giving tail bounds for and matter (cf. Theorem 5.1). As a result, we find in Theorem 3.2 that the sequence is essentially a deterministic function of .

### 3.2. Proof outline

The general methodology in our proof is based in the study of degree distributions in preferential attachment random graphs, as in the book by Chung and Lu [4, Chapter 3]. However, a new phenomenon arises. In the graph setting, the total number of vertices at time is usually linear (at least with high probability). By contrast, the analogue of the total number of vertices is – the number of parts –, which is sublinear and not concentrated.

One consequence of this point in our analysis is that the martingale arguments are much more delicate, and rely on Friedman’s martingale inequality (cf. section 4), instead of the more usual (and less precise) Azuma-Höffding bound. Another point is that we must first obtain results on the number of parts , which we do in section 5.

We then consider the random variables . The general strategy is to write these variables in terms of “recursions + martingales" depending on for , and then observe how the “martingale" part concentrates. These first steps, which are taken in section 6, are similar to the analysis in [4, Chapter 3]. However, the results obtained are not directly employable to prove the main theorem. Section 7 then turns these arguments into actionable bounds. This leads to the proof of the main result in section 8.

## 4. Concentration inequalities

We recall here Freedman’s inequality and a particular corollary that will be important to our proofs.

###### Theorem 4.1 (Freedman’s Inequality [10]).

Let be a martingale with and a constant. Write

 Wn:=n∑k=2E[(ΔMj)2|Fj].

Suppose

 |ΔMj|≤R,  for all j.

Then, for all we have

 P(Mn≥λ,Wn≤σ2)≤exp(−λ22σ+2Rλ/3).

The lemma below is a straightforward consequence of Freedman’s inequality. Since we will deal with the problem of bounding martingales under some constraints frequently, it will be convenient to have this precise statement.

###### Lemma 4.1.

Let is a martingale and a constant such that, , and is its quadratic variation, then for any constant we have

 P(|Mn|≥Rλ)≤2exp⎛⎝−λ2c1+23⎞⎠+P(Wn≥c1R2λ).
###### Proof.

It follows of the union bound and Freedman’s inequality to the martingales and :

 P(|Mn|≥Rλ) ≤P(|Mn|≥Rλ,Wn≤c1R2λ)+P(Wn≥R2λ) ≤2exp⎛⎝−(Rλ)22c1R2λ+2R3(Rλ)⎞⎠+P(Wn≥c1R2λ) ≤2exp⎛⎝−λ2c1+23⎞⎠+P(Wn≥c1R2λ).

## 5. Estimates on the number of parts

In this section we obtain results on the number of parts of . In particular, we prove Theorem 3.1 above.

In subsection 5.1 we prove a recurrence relation for . We use this in subsection 5.2 to derive concentration for the whole sequence. Finally subsection 5.3 proves Theorem 3.2.

The following normalizing factor will appear in our proofs:

 (7) ϕn:=n−1∏j=1(1+αj+θ)=Γ(1+θ)Γ(1+θ+α)Γ(n+α+θ)Γ(n+θ).

Note that by Lemma A.6 we have .

### 5.1. A recurrence relation

The first result in this section is the following Lemma.

###### Lemma 5.1 (Recurrence relation for Vn).

For all the recurrence relation holds

 (8) Vnϕn=Vmϕm+(Mn−Mm)+O(1)(m+θ)α,

where is a martingale satisfying ,

1. ;

2. ,

for all .

###### Proof.

Recall . On the other hand, we also know that

 (9)

In other words, conditioned on , the random variable is distributed as . In order to obtain mean zero martingale, it will be useful to centralize the random variable . Thus we may write as

 (10) Vn=Vn−1+ΔVnVn=(1+αn−1+θ)Vn−1+(ΔVn−αVn−1+θn−1+θ)+θn−1+θ.

Thus, dividing the above identity by , we obtain

 (11) Vnϕn=Vn−1ϕn−1+ζn+θ(n−1+θ)ϕn,

where

 (12) ζn:=ΔVn−αVn−1+θn−1+θϕn.

Observe that

 (13) E[ζn|Fn]=0.

Iterating this argument steps leads to

 (14) Vnϕn=Vmϕm+(Mn−Mm)+(θn−θm),

where

 (15) Mn:=n∑j=2ζj and θn=1+n−1∑j=1θ(j+θ)ϕj+1.

Notice that identity (13) implies that is a zero mean martingale.

Now we estimate the order of the deterministic contribution of on identity (14). By Lemma A.6, the following upper bound holds

 (16) 1(j+θ)ϕj+1<2Γ(1+θ+α)Γ(1+θ)⋅(j+θ)1+α.

Thus, bounding the sum by the integral, we obtain

 (17) θn−θm=n−1∑j=mθϕj+1(j+θ)≤4Γ(1+θ+α)θαΓ(1+θ)1(m+θ)α.

which proves the first statement of the lemma.

In the remainder of the proof we estimate the increments of the martingale as well as its conditioned quadratic variation. By the definition of and recalling that is at most one and the bound on (17) we obtain that

 (18) |ΔMj|≤1ϕj≤2Γ(1+θ+α)Γ(1+θ)⋅(j+θ)α

and also

 (19) E[(ΔMj)2|Fj−1]≤α(j−1+θ)ϕjϕj−1ϕjVj−1+θαϕj−1≤2Γ(1+θ+α)αΓ(1+θ)⋅(j+θ)−α−1Vj−1+θαϕj−1,

which proves the lemma. ∎

### 5.2. Concentration and tail bounds

We combine the recurrence relation we have proven with Freedman’s inequality to obtain the following theorem

###### Theorem 5.1.

In the -GCRP there are constants and such that for all integer and we have

 P(supj≥m(Vjϕj−Vmϕm)≥A(m+θ)α/2)≤exp(−cVA).

In particular, for , considering and we have

 P[supj∈N(Vjϕj)≥A]≤exp(−cVA).
###### Proof.

We start with the particular case and then use it to prove the general result.

Case . From Lemma 5.1 we know that the may be written as a mean zero martingale plus a deterministic factor , where is a increasing positive and bounded sequence of real numbers. Thus, converges to some positive number . For a positive real number , consider the following stopping time

 (20) TA: =inf{i∈N:Viϕi≥A+θi}.

Observe that

 (21) P(supj∈N(Vjϕj)≥A+θ∞)≤P(∃j∈N:Vjϕj≥A+θj)=limnP(VTA∧nϕTA∧n≥A+θTA∧n)=limnP(MTA∧n≥A).

By the above inequality, the first case is proven if we obtain a proper upper bound for the tail of the stopped martingale . We will do this via Lemma 4.1, which requires bounds on the increment and quadratic variation of . We obtain these bounds on the next lines. For the increment a direct application of Lemma 5.1 gives us

 |MTA∧(j+1)−MTA∧j|≤R,

where . For the quadratic variation we have that, also by Lemma 5.1,

 Wn∧TA ≤n∧TA∑j=22Γ(1+θ+α)αΓ(1+θ)⋅(j+θ)−α−1Vj−1+θαϕj−1 (22) ≤n∧TA∑j=22Γ(1+θ+α)αΓ(1+θ)⋅(j+θ)−α−1(A+θj+θαϕj−1).

Choosing , which is defined below:

 (23) K(α,θ):=θα+supj∈N{θj};

on (5.2), we obtain

 Wn∧TA ≤4Γ(1+θ+α)Γ(1+θ)(1+θ)−α⋅A=2R⋅(R2A).

Finally, applying Lemma 4.1, with

 c1:=2R=Γ(1+θ)Γ(1+θ+α)(1+θ)α

we obtain

 (24) P(MTA∧n≥A)≤exp⎛⎜ ⎜ ⎜ ⎜ ⎜⎝−A2Γ(1+θ)Γ(1+θ+α)(1+θ)α+23⎞⎟ ⎟ ⎟ ⎟ ⎟⎠,

and

 (25) P(MTA∧n≥A)≤exp(−c2A),

for

 (26) c2=(2Γ(1+θ)Γ(1+θ+α)(1+θ)α+23)−1.

The above inequality combined with (21) gives us

 P(supm∈N(Vmϕm)≥A) ≤exp(−c2A),

proving the result for .

Case . The proof of the case is similar to the first case, but it requires another stopping time and the case itself. So, consider the following stopping time:

 ^TB: =inf{j≥m:Vjϕj−Vmϕm≥B} =inf{j∈N:(Mj−Mm)+(θj−θm)≥B}.

Observe that, as showed in the proof of Lemma 5.1,

 (27) θn−θm≤4Γ(1+θ+α)θαΓ(1+θ)1(m+θ)α.

Now, let and suppose . Thus,

 P(supj≥m(Vjϕj−Vmϕm)≥A(m+θ)α/2) ≤P(∃j≤n:(Mj−Mm)+(θj−θm)≥A(m+θ)α/2) =P((M^TB∧n−Mm)+(θ^TB∧n−θm)≥B) (use that θ^TB∧n≥θm) ≤limnP(M^TB∧n−Mm≥A2(m+θ)α/2).

Let be the same as defined in (20). Then:

 P(M^TB∧n−Mm≥A2(m+θ)α/2) ≤P(M^TB∧n−Mm≥A2(m+θ)α/2,TA≥n) +P(TA

As in the case , by Lemma 5.1, the increment of satisfies the following upper bound

 |(M(j+1)∧^TB∧TA−Mm)−(Mj∧^TB∧TA−Mm)| ≤2Γ(1+θ+α)Γ(1+θ)(m+θ)−α2,

 Wn∧TA≤4Γ(1+θ+α)Γ(1+θ)(m+θ)−α⋅A.

Thus, again by Lemma 4.1 it follows that

 P(M^TB∧TA∧n−Mm≥A2(m+θ)α/2)≤exp(−c3A),

for some constant , which implies

 P(supj≥m(Vjϕj−Vmϕm)≥A(m+θ)α/2) ≤exp(−c2A)+exp(−c3A)≤exp(−cVA),

for . ∎

### 5.3. Proof of Theorem 3.1

A consequence of Theorem 5.1 is to give estimates of how large the deviation of from its limit can be uniformly in time.

###### Proof of theorem 3.1.

Given define

 (28) δj=δ(j+1)(j+2).

Let denote the following event

 (29) Ej:=⎧⎪⎨⎪⎩∀m≥2j:∣∣∣Vmϕm−V2jϕ2j∣∣∣≤log1δjcV(2j+θ)α2⎫⎪⎬⎪⎭.

Assuming we have by Theorem 5.1

 P(Ecj) ≤exp(−log1δj)≤δ(j+1)(j+2),

which implies, by union bound,

 P(⋂j≥0Ej) ≥1−∑j≥0P(Ecj) ≥1−∑j≥0δj ≥1−δ.

Now, observe that, when occurs, we have for all

 ∣∣∣Vmϕm−V∗∣∣∣ ≤∣∣∣Vmϕm−V2jϕ2j∣∣∣+∣∣∣V2jϕ2j−V∗∣∣∣ ≤2supm≥2j∣∣∣Vmϕm−V2jϕ2j∣∣∣ ≤2log1δj(2j+θ)α2 ≤1cV(2j+θ)α2[4log(j+2)+2log(1δ)],

and once it follows that

 ∣∣∣Vmϕm−V∗∣∣∣ ≤32cV(m+θ)α2[loglog(m+2)+log(1δ)],

for any . To finish take . ∎

## 6. Preliminary estimates for the number of parts of size k

This section is devoted to give estimates for the number of classes with fixed number of elements at time , . As in the case for , we investigate the behaviour of properly normalized. In this sense, we let be the normalization factor for given by the expression below

 (30) ψn(k):=n−1∏j=1(1−k−αj+θ)=Γ(k+θ)Γ(n−k+α+θ)Γ(α+θ)Γ(n+θ).

We note that, for each fixed, . The proof of this result may be done similarly to that one given to . We also let be

 (31) Xn(k):=Nn(k)ψn(k).

The first step in the analysis of the non-asymptotic behavior of is to prove that also satisfies a recurrence relation (Subsection 6.1). We then present a martingale concentration argument that will be useful in analyzing the recurrence (Subsection 6.2). Subsequent sections will use these results to give upper and lower bounds on .

### 6.1. Recurrence relation for Xn(k)

The goal of this part is to derive a recurrence relation for . The proof is essentially the same we have given for .

###### Lemma 6.1.

For all the sequence satisfies

 (32) Xn(1) =Mn(1)+n−1∑j=1αVj(j+θ)ψj+1(1)+θn; (33) Xn(k) =Mn(k)+Xk(k)+k−1−αk−1+θn−1∑j=kXj(k−1),∀k>1,

where are zero mean martingales defined in (37) and (40) for all and

 (34) θn(1):=N1(1)+n−1∑j=1θ(j+θ)ψj+1(1).
###### Proof.

We treat the case separately since satisfies a recurrence relation slight different from the other cases. However, the proof for both cases follow the recipe given by the proof of Lemma 5.1, so we do not fill all the details here.

Case . Note th