Exchangeable Trait Allocations

# Exchangeable Trait Allocations

Trevor Campbell Computer Science and Artificial Intelligence Laboratory (CSAIL)
Massachusetts Institute of Technology
Diana Cai Department of Computer Science
Princeton University
and  Tamara Broderick Computer Science and Artificial Intelligence Laboratory (CSAIL)
Massachusetts Institute of Technology
###### Abstract.

Trait allocations are a class of combinatorial structures in which data may belong to multiple groups and may have different levels of belonging in each group. Often the data are also exchangeable, i.e., their joint distribution is invariant to reordering. In clustering—a special case of trait allocation—exchangeability implies the existence of both a de Finetti representation and an exchangeable partition probability function (EPPF), distributional representations useful for computational and theoretical purposes. In this work, we develop the analogous de Finetti representation and exchangeable trait probability function (ETPF) for trait allocations, along with a characterization of all trait allocations with an ETPF. Unlike previous feature allocation characterizations, our proofs fully capture single-occurrence “dust” groups. We further introduce a novel constrained version of the ETPF that we use to establish an intuitive connection between the probability functions for clustering, feature allocations, and trait allocations. As an application of our general theory, we characterize the distribution of all edge-exchangeable graphs, a class of recently-developed models that captures realistic sparse graph sequences.

## 1. Introduction

Representation theorems for exchangeable random variables are a ubiquitous and powerful tool in Bayesian modeling and inference. In many data analysis problems, we impose an order, or indexing, on our data points. This indexing can arise naturally—if we are truly observing data in a sequence—or can be artificially created to allow their storage in a database. In this context, exchangeability expresses the assumption that this order is arbitrary and should not affect our analysis. For instance, we often assume a sequence of data points is an infinite exchangeable sequence, i.e., that the distribution of any finite subsequence is invariant to reordering. Though this assumption may seem weak, de Finetti’s theorem (de Finetti, 1931; Hewitt and Savage, 1955) tells us that in this case, we can assume that a latent parameter exists, that our data are independent and identically distributed (i.i.d.) conditional on this parameter, and that the parameter itself has a distribution. Thus, de Finetti’s theorem may be seen as a justification for a Bayesian model and prior—and, in fact, for the infinite-dimensional priors provided by Bayesian nonparametrics (Jordan, 2010).

De Finetti-style representation theorems have provided many other useful insights for modeling and inference within Bayesian analysis. For example, consider clustering problems, where the inferential goal is to assign data points to mutually exclusive and exhaustive groups. It is typical to assume that the distribution of the clustering—i.e., the assignment of data points to clusters—is invariant to the ordering of the data points. In this case, two different representation theorems have proved particularly useful in practice. First, Kingman (1978) showed that exchangeability in clustering implies the existence of a latent set of probabilities (known as the “Kingman paintbox”) from which cluster assignments are chosen i.i.d.. It is straightforward to show from the Kingman paintbox representation that exchangeable clustering models enforce linear growth in cluster size as a function of the size of the total data. By contrast, many real-world clustering problems, such as disambiguating census data or clustering academic papers by originating lab, exhibit sublinear growth in cluster size (e.g., Wallach et al., 2010; Broderick and Steorts, 2014; Miller et al., 2016). Thus, the Kingman paintbox representation allows us to see that exchangeable clustering models are misspecified for these examples. Similarly, Pitman (1995) showed that clustering exchangeability is equivalent to the existence of an exchangeable partition probability function (EPPF). The EPPF and similar developments have led to algorithms that allow practical inference specifically with the Dirichlet process mixture (Escobar, 1994; Escobar and West, 1995) and more generally in other clustering models (Pitman and Yor, 1997; Ishwaran and James, 2001, 2003; Lee et al., 2013).

In this work, we develop and characterize a generalization of clustering models that we call trait allocation models. Trait allocations apply when data may belong to more than one group (a trait), and may exhibit nonnegative integer levels of belonging in each group. For example, a document might exhibit multiple words in a number of topics, a participant in a social network might send multiple messages to each of her friend groups, or a DNA sequence might exhibit different numbers of genes from different ancestral populations. Trait allocations generalize both clustering, where data must belong to exactly one group, and feature allocations (Griffiths and Ghahramani, 2005; Broderick et al., 2013), where data exhibit binary membership in multiple groups. Authors have recently proposed a number of models for trait allocations (e.g., Titsias, 2008; Zhou et al., 2012; Zhou, 2014; James, 2017; Broderick et al., 2015; Roychowdhury and Kulis, 2015). But as of yet, there is no characterization either of the class of exchangeable trait allocation models or of classes of exchangeable trait allocation models that are particularly amenable to inference. The consequences of the exchangeability assumption in this setting have not been explored. In this work, we provide characterizations of both the full class of exchangeable trait allocations and those with EPPF-like probability distributions. This work not only unifies and generalizes past research on partitions and feature allocations, but provides a natural avenue for the study of other practical exchangeable combinatorial structures.

We begin by formally defining trait allocations, random sequences thereof, and exchangeability in Section 2. In Section 3, we introduce ordered trait allocations via the lexicographic ordering. We use these constructions to establish a de Finetti representation for exchangeable trait allocations in Section 4 that is analogous to the Kingman paintbox representation for clustering. Our new representation handles dust, the case where some traits may appear for just a single data point. This work therefore also extends previous work on the special case of exchangeable feature allocations to the fully general case, whereas previously it was restricted to the dustless case (Broderick et al., 2013). In Section 5, we develop an EPPF-like function to describe distributions over exchangeable trait allocations and characterize the class of trait allocations to which it applies. We call these exchangeable trait probability functions (ETPFs). Just as in the partition and feature allocation cases, the class of random trait allocations with probability functions represents a class of trait allocations that are particularly amenable to approximate posterior inference in practice—and therefore of particularly pressing interest to characterize. In Section 5, we introduce new concepts we call constrained ETPFs, which are the combinatorial analogue of earlier work on restricted nonparametric processes (Williamson et al., 2013; Doshi-Velez and Williamson, 2017). In Sections 6 and 5, we show how constrained ETPFs capture earlier probability functions for numerous exchangeable models within a single framework. In Section 6, we apply both our de Finetti representation and constrained ETPF to characterize edge-exchangeable graphs, a recently developed form of exchangeability for graph models that allows sparse projective sequences of graphs (Broderick and Cai, 2015; Crane and Dempsey, 2015; Cai et al., 2016; Crane and Dempsey, 2016a; Williamson, 2016). A similar representation generalizing partitions and edge-exchangeable (hyper)graphs has been studied in concurrent work (Crane and Dempsey, 2016b) on relational exchangeability, first introduced by Ackerman (2015); Crane and Towsner (2015)—but here we additionally explore the existence of a trait frequency model, the existence of a constrained trait frequency model and its connection to clustering and feature allocations, and the various connections between frequency models and probability functions.

### 1.1. Notation and conventions

Definitions are denoted by the symbol . The natural numbers are denoted and the nonnegative reals . We let for any . Sequences are denoted with parentheses, with indices suppressed only if they are clear from context. For example, is the sequence and is the sequence , while is the sequence with fixed. The notation means is a (not necessarily proper) subset of . The indicator function is denoted ; for example, is 1 if , and 0 otherwise. For any multiset of elements in a set , we denote to be the multiplicity of in for each . Two multisets of are said to be equal, denoted , if the multiplicity of all elements are equal in both and , i.e. . For any finite or infinite sequence, we use subscript to denote the element in the sequence. For sequences of (multi)sets, if is beyond the end of the sequence, the subscript operation returns the empty set. Equality in distribution and almost surely are denoted , and convergence almost surely/in probability/in distribution is denoted . We often use cycle notation for permutations (see Dummit and Foote (2004, p. 29)): for example, is the permutation with , , , , and for . We use the notation to denote sampling from the categorical distribution on with probabilities for . The symbol for a sequence of sets denotes , their infinite product space.

## 2. Trait allocations

We begin by formalizing the concepts of a trait and trait allocation. We assume that our sequence of data points is indexed by . As a running example for intuition, consider the case where each data point is a document, and each trait is a topic. Each document may have multiple words that belong to each topic. The degree of membership of the document in the topic is the number of words in that topic. We wish to capture the assignment of data points to the traits they express but in a way that does not depend on the type of data at hand. Therefore, we focus on the indices to the data points. This leads to the definition of traits as multisets of the data indices, i.e., the natural numbers. E.g., is a trait in which the datum at index 1 has multiplicity 2, and the datum at index 3 has unit multiplicity. In our running example, this trait might represent the topic about sports; the first document has two sports words, and the third document has one sports word.

###### Definition 2.1.

A trait is a finite, nonempty multiset of .

Let the set of all traits be denoted . A single trait is not sufficient to capture the combinatorial structure underlying the first data in the sequence: each datum may be a member of multiple traits (with varying degrees of membership). The traits have no inherent order just as the topics “sports”, “arts”, and “science” have no inherent order. And each document may contain words from multiple topics. Building from Definition 2.1 and motivated by these desiderata, we define a finite trait allocation as a finite multiset of traits. For example, represents a collection of traits expressed by the first data points in a sequence. In this case, index 1 is a member of two traits, index 2 is a member of none, and so on. Throughout, we assume that each datum at index , belongs to only finitely many latent traits. Further, for a data set of size , any index should not belong to any trait; the allocation represents traits expressed by only the first data. These statements are formalized in Definition 2.2.

###### Definition 2.2.

A trait allocation of is a multiset of traits, where

 ∀n∈N:n≤N, ∑ω∈TtN(ω)⋅ω(n)<∞ (2.1) ∀n∈N:n>N, ∑ω∈TtN(ω)⋅ω(n)=0. (2.2)

Let be the set of trait allocations of , and define to be the set of all finite trait allocations, . Two notable special cases of finite trait allocations that have appeared in past work are feature allocations (Griffiths and Ghahramani, 2005; Broderick et al., 2013) and partitions (Kingman, 1978; Pitman, 1995). Feature allocations are the natural combinatorial structure underlying feature learning, where each datum expresses each trait with multiplicity at most 1. For example, is a feature allocation of . Note that each index may be a member of multiple traits. Partitions are the natural combinatorial structure underlying clustering, where the traits form a partition of the indices. For example, is a partition of , since its traits are disjoint and their union is . The theory in the remainder of the paper will be applied to recover past results for these structures as corollaries.

Up until this point, we have dealt solely with finite sequences of data. However, in many data analysis problems, it is more natural (or at least an acceptable simplifying approximation) to treat the observed sequence of data as the beginning of an infinite sequence. As each datum arrives, it adds its own index to the traits it expresses, and in the process introduces any previously uninstantiated traits. For example, if after 3 observations we have , then observing the next might yield . Note that when an index is introduced, none of the earlier indices’ memberships to traits are modified; the sequence of finite trait allocations is consistent. To make this rigorous, we define the restriction of a trait (allocation), which allows us to relate two trait allocations of differing and . The restriction operator —provided by Definition 2.3 and acting on either traits or finite trait allocations—removes all indices greater than from all traits, and does not modify the multiplicity of indices less than or equal to . If any trait becomes empty in this process, it is removed from the allocation. For example, . Two trait allocations are said to be consistent, per Definition 2.4, if one can be restricted to recover the other. Thus, and are consistent finite trait allocations.

###### Definition 2.3.

The restriction of a trait to is defined as

 τ|M(m):={τ(m)m≤M0m>M, (2.3)

and is overloaded for finite trait allocations as

 tN|M(τ):={∑ω∈T\mathds1(ω|M=τ)⋅tN(ω)τ≠∅0τ=∅. (2.4)
###### Definition 2.4.

A pair of trait allocations of and of with is said to be consistent if .

The consistency of two finite trait allocations allows us to define the notion of a consistent sequence of trait allocations. Such a sequence can be thought of as generated by the sequential process of data arriving; each data point adds its index to its assigned traits without modifying any previous index. For example, is a valid beginning to an infinite sequence of trait allocations. The first datum expresses two traits with multiplicity 1, and the second and third each express a single one of those traits with multiplicity 1. As a counterexample, is not a valid trait allocation sequence, as the third trait allocation is not consistent with either the first or second. This sequence does not correspond to building up the traits expressed by data in a sequence; when the third datum is observed, the traits expressed by the first are modified.

###### Definition 2.5.

An infinite trait allocation is a sequence of trait allocations of , for which

 ∀N∈N,tN+1|N=tN. (2.5)

Note that since restriction is commutative ( for ), Definition 2.5 implies that all pairs of elements of the sequence are consistent. Restriction acts on infinite trait allocations in a straightforward way: given , restriction to is equivalent to the corresponding projection, .

Denote the set of all infinite trait allocations . Recall that the motivation for developing infinite trait allocations is to capture the latent combinatorial structure underlying a sequence of observed data. Since this sequence is random, its underlying structure may also be, and thus the next task is to develop a corresponding notion of a random infinite trait allocation. Given a sequence of probability spaces for with consistent measures , i.e.

 ∀N∈N,νN(tN)=∑tN+1∈TN+1\mathds1(tN+1|N=tN)⋅νN+1(tN+1), (2.6)

the Kolmogorov extension theorem (Kallenberg, 1997, Theorem 5.16) guarantees the existence of a unique random infinite trait allocation that satisfies a.s. and has finite marginal distributions equal to the induced by restriction, i.e.

 ∀N∈N,T∞|N∼νN. (2.7)

The properties of the random infinite trait allocation are intimately related to those of the observed sequence of data it represents. In many applications, the data sequence has the property that its distribution is invariant to finite permutation of its elements; in some sense, the order in which the data sequence is observed is immaterial. We expect the random infinite trait allocation associated with such an infinite exchangeable sequence111For an introduction to exchangeability and related theory, see Aldous (1985). to inherit a similar property. As a simple illustration of the extension of permutation to infinite trait allocations, suppose we observe the sequence of data exhibiting trait allocation sequence , , , and so on. If we swap and in the data sequence—resulting in the new sequence —the traits expressed by become those containing index , the traits for become those containing index , and the rest are unchanged. Therefore, the permuted infinite trait allocation is , , , and so on. Note that (resp. ) is equal to the restriction to (resp. ) of with permuted indices, while for is with its indices permuted. This demonstrates a crucial point—if the permutation affects only indices up to (there is always such an for finite permutations), we can arrive at the sequence of trait allocations for the permuted data sequence in two steps. First, we permute the indices in and then restrict to to get the first permuted finite trait allocations. Then we permute the indices in for each .

To make this observation precise, we let be a finite permutation of the natural numbers, i.e.,

 π:N→N,π is a bijection,∃M∈N:∀m>M,π(m)=m, (2.8)

and overload its notation to operate on traits and (in)finite trait allocations in Definition 2.6. Note that if is a finite permutation, its inverse is also a finite permutation with the same value of for which implies . Intuitively, operates on traits and finite trait allocations by permuting their indices. For example, if has the cycle and fixes all indices greater than 3, then .

###### Definition 2.6.

Given a finite permutation of the natural numbers that fixes all indices , the permutation of a trait under is defined as

 πτ(m) :=τ(π−1(m)), (2.9)

the permutation of a trait allocation of under is defined as

 πtN(τ) :=tN(π−1τ), (2.10)

and the permutation of an infinite trait allocation under is defined as

 πt∞ (2.11)

As discussed above, the definition for infinite trait allocations ensures that the permuted infinite trait allocation is a consistent sequence that corresponds to rearranging the observed data sequence with the same permutation. Definition 2.6 provides the necessary framework for studying infinite exchangeable trait allocations, defined as random infinite trait allocations whose distributions are invariant to finite permutation.

###### Definition 2.7.

An infinite exchangeable trait allocation, , is a random infinite trait allocation such that for any finite permutation ,

 πT∞d=T∞. (2.12)

Note that if the random infinite trait allocation is a random infinite partition/feature allocation almost surely, the notion of exchangeability in Definition 2.7 reduces to earlier notions of exchangeability for random infinite partition/feature allocations (Kingman, 1978; Aldous, 1985; Broderick et al., 2013). Exchangeability also has an analogous definition for random finite trait allocations, though this is of less interest in the present work.

As a concrete example, consider the countable set of sequences of nonnegative integers such that . For each data index, we will generate an element of and use it to represent a sequence of multiplicities in an ordered sequence of traits. In particular, we endow with probabilities for each . We start from an empty ordered trait allocation. Then for each data index , we sample a sequence ; and for each , we add index to trait with multiplicity . The final trait allocation is the unordered collection of nonempty traits. Since each data index generates its membership in the traits i.i.d. conditioned on , the sequence of trait allocations is exchangeable. This process is depicted in Fig. 1. As we will show in Section 4, all infinite exchangeable trait allocations have a similar construction.

## 3. Ordered trait allocations and lexicographic ordering

We impose no inherent ordering on the traits in a finite trait allocation via the use of (multi)sets; the allocations and are identical. This correctly captures our lack of a preferred trait order in many data analysis problems. However, ordered trait allocations are nonetheless often useful from standpoints both practical—such as when we need to store a finite trait allocation in an array in physical memory—and theoretical—such as in developing the characterization of all infinite exchangeable trait allocations in Section 4.

A primary concern in the development of an ordering scheme is consistency. Intuitively, as we observe more data in the sequence, we want the sequence of finite ordered trait allocations to “grow” but not be “shuffled”; in other words, if two finite trait allocations are consistent, the traits in their ordered counterparts at the same index should each be consistent. For partitions, this task is straightforward: each trait receives as a label its lowest index (Aldous, 1985), and the labels are used to order the traits. This is known as the order-of-appearance labeling, as traits are labeled in the order in which they are instantiated by data in the sequence. For example, in the partition of , would receive label 1 and would receive label 2, so would be before in the order. Restricting these traits will never change their order—for instance, and , which still each receive label 1 and 2, respectively. If a restriction leaves a trait empty, it is removed and does not interfere with any traits of a lower label. For finite feature allocations, this ordering is inapplicable, since multiple features may have a common lowest index. Instead, Griffiths and Ghahramani (2005) introduce a left-ordered form in which one feature precedes another if it contains an index that the other does not, and all indices have the same membership in both features. For example, precedes in this ordering, since the traits both have index , but only the first has index .222Other past work (Broderick et al., 2013) uses auxiliary randomness to order features, but this technique does not guarantee that orderings of two consistent finite trait allocations are themselves consistent. In this section, we show that the well-known lexicographic ordering—which generalizes these previous orderings for partitions and feature allocations—satisfies our desiderata for an ordering on traits. We begin by defining ordered trait allocations.

###### Definition 3.1.

An ordered trait allocation of is a sequence , , of traits such that no trait contains an index .

Let be the set of ordered trait allocations of , and let be the set of all ordered finite trait allocations. As in the case of unordered trait allocations, the notion of consistency is intimately tied to that of restriction. We again require that restriction to removes all indices , and removes all traits rendered empty by that process. However, we also require that the order of the remaining traits is preserved: for example, if , the restriction of to should yield , not . Definition 3.2 satisfies these desiderata, overloading the function again for notational brevity.

###### Definition 3.2.

The restriction of an ordered finite trait allocation to is defined as

 ℓN|M (3.1)

where the function removes any empty sets from a sequence while preserving the order of the nonempty sets.

In the example above, the basic restriction of to would yield , which the filter function then processes to form , as desired. Analogously to the unordered case, we say two ordered trait allocations , , of , , with , are consistent if , and define the set of infinite ordered trait allocations as the set of infinite sequences of ordered finite trait allocations with .

Given these definitions, we are now ready to make the earlier intuitive notion of a consistent trait ordering scheme precise. Definition 3.3 states that a function must satisfy two conditions to be a valid trait ordering. The first condition enforces that a trait ordering does not add, remove, or modify the traits in the finite trait allocation ; this implies that trait orderings are injective. The second condition enforces that trait orderings commute with restriction; in other words, applying a trait ordering to a consistent sequence of finite trait allocations yields a consistent sequence of ordered finite trait allocations. For example, suppose , , and we are given a proposed trait ordering where and . This would not violate either of the conditions and may be a valid trait ordering. If instead the ordering was , the proposal would not be a valid trait ordering—the traits and get “shuffled”, i.e., .

###### Definition 3.3.

A trait ordering is a function such that:

1. The ordering is exhaustive: If , then .

2. The ordering is consistent: .

The trait ordering we use throughout is the lexicographic ordering: for two traits, we pick the lowest index with differing multiplicity, and order the one with higher multiplicity first. For example, since is the lowest index with differing multiplicity, and the multiplicity of is greater in the first trait than in the second. Similarly, since has greater multiplicity in the first trait than the second, and both 1 and 2 have the same multiplicity in both traits. Definition 3.4 makes this precise.

###### Definition 3.4.

For two traits , we say that if there exists such that and all satisfy .

We define as the mapping from to the ordered trait allocation induced by the lexicographic ordering. The mapping is a trait ordering, as shown by Theorem 3.6. The proof of Lemma 3.5 is provided in Appendix A.

###### Lemma 3.5.

For any pair , if then for all .

###### Theorem 3.6.

The mapping is a trait ordering.

###### Proof.

is trivially exhaustive: since the restriction operation acts identically to individual traits in both ordered and unordered finite trait allocations, and empty traits are removed, both and have the same multiset of traits (albeit in a potentially different order). The first trait of satisfies for any such that , by definition of . By Lemma 3.5, this implies that for all . Therefore, the first trait in is the same as the first trait in . Applying this logic recursively to with removed, the result follows. ∎

## 4. De Finetti representation of exchangeable trait allocations

We now derive a de Finetti-style representation theorem for infinite exchangeable trait allocations (Definition 2.7) that extends previous results for partitions and feature allocations (Kingman, 1978; Broderick et al., 2013). It turns out that all infinite exchangeable trait allocations have essentially the same form as in the example construction at the end of Section 2, with some additional nuance.

The high-level proof sketch is as follows. We first use the lexicographic ordering from Section 3 to associate an i.i.d. sequence of uniform random labels to the traits in the sequence, in the style of Aldous (1985). We collect the multiset of labels for each index into a sequence, called the label multiset sequence; the consistency of the ordering from Theorem 3.6 implies that this construction is well-defined. We show that the label multiset sequence itself is exchangeable in the traditional sequential sense in Lemma 4.3. And we use de Finetti’s theorem (Kallenberg, 1997, Theorem 9.16) to uncover its construction from conditionally i.i.d. random quantities. Finally, we relate this construction back to the original set of infinite exchangeable trait allocations to arrive at its representation in Theorem 4.5. Throughout the remainder of the paper, is a random infinite trait allocation and .

As an example construction of the label multiset sequence, suppose we have , and . The lexicographic ordering of is . The first trait in the ordering receives the first label in the sequence, , and the second trait receives the second label, . For each index , we now collect the multiset of labels to its assigned traits with the same multiplicity. Index 1 is a member of only the first trait with multiplicity 1, so its label multiset is . Index 2 is a member of the first trait with multiplicity 2 and the second with multiplicity 1, so its label multiset is . Similarly, for index 3 it is , and for index 4 it is . Putting the multisets in order (for index 1, then 2, 3, etc.), the label multiset sequence is therefore , where the ellipsis represents the continuation beyond to , , and so on. While the may be seen as a mathematical convenience for the proof, an alternative interpretation is that they correspond to trait-specific parameters in a broader Bayesian hierarchical model. Indeed, our proof would hold for from any nonatomic distribution, not just the uniform. In the document modeling example, each could correspond to a distribution over English words; with high mass on “basketball”, “luge”, and “curling” could represent a “sports” topic. For this reason, we call the labels. Let the set of (possibly empty) finite multisets of be denoted .

###### Definition 4.1.

The label multiset sequence of elements corresponding to and is defined by

 YN(ϕ):=∑k\mathds1(ϕ=ϕk)⋅[TN]k(N). (4.1)

In other words, is constructed by selecting the component of , ordering its traits , and then adding copies of to for each . Again, the can thus be thought of as labels for the traits, and is the multiset of labels representing the assignment of the datum to its traits (hence the name label multiset sequence). This construction of ensures that the “same label applies to the same trait” as increases: the a.s. consistency of the ordering introduced in Section 3 immediately implies that

 ∀N≤M,YN(ϕ)a.s.=∑k\mathds1(ϕ=ϕk)⋅[TM]k(N). (4.2)

Definition 4.1 implicitly creates a mapping, which we denote . Since the are distinct a.s., we can partially invert to recover the infinite trait allocation corresponding to a.s. via

 TN(τ)a.s.=\mathds1(∀n>N,τ(n)=0)⋅|{ϕ∈(0,1):∀n≤N,τ(n)=Yn(ϕ)}|. (4.3)

The first term in the product—the indicator function—ensures that is nonzero only for traits that do not contain any index . The second term counts the number of points for which the multiplicities in match those expressed by the label multiset sequence for . Thus, there exists another mapping such that

 ~φ(φ(T∞,ϕ∞))a.s.=T∞. (4.4)

The existence of the partial inverse is a crucial element in the characterization of all distributions on infinite exchangeable trait allocations in Theorem 4.5. In particular, it guarantees that the distributions over random infinite trait allocations are in bijection with the distributions on label multiset sequences , allowing the characterization of those on (a much simpler space) instead. As the primary focus of this work is infinite exchangeable trait allocations, we therefore must deduce the particular family of distributions on that are in bijection with the infinite exchangeable trait allocations on .

Lemma 4.3 shows that this family is, as one might suspect, the exchangeable (in the classical, sequential sense) label multiset sequences. The main result required for its proof is Lemma 4.2, which states that permutation of essentially results in the same permutation of the components of , modulo reordering the labels in . In other words, permuting the data sequence represented by leads to the same permutation of . As an example, consider a setting in which , , and thus . For a finite permutation , we define and , i.e., permutations act on sequences by reordering elements. If we permute the observed data sequence that represents by , this leads to the permutation of the indices in also by , resulting in . If we then reorder with a different permutation , so , then the corresponding label multiset sequence is . This is the reordering of by , the same permutation that was used to reorder the observed data; the main result of Lemma 4.2 is that a always exists to reorder such that this is the case. The proof of Lemma 4.2 may be found in Appendix A.

###### Lemma 4.2.

For each finite permutation and infinite trait allocation , there exists a finite permutation such that

 πφ(t∞,ϕ∞)a.s.=φ(πt∞,π′ϕ∞). (4.5)
###### Lemma 4.3.

is exchangeable iff is exchangeable.

###### Proof.

Fix a finite permutation . Then by Lemma 4.2 there exists a collection of finite permutations that depend on such that

 πY∞ a.s.=φ(πT∞,πT∞ϕ∞). (4.6)

If is exchangeable, then using Eq. 4.6 and the definition of in Eq. 4.4,

 T∞a.s.=~φ(Y∞)d=~φ(πY∞)a.s.=πT∞. (4.7)

If is exchangeable, then again using Eq. 4.6 and noting that is a sequence of i.i.d. random variables and hence also exchangeable,

 πY∞a.s.=φ(πT∞,πT∞ϕ∞)d=φ(T∞,ϕ∞)=Y∞. (4.8)

We are now ready to characterize all distributions on infinite exchangeable trait allocations in Theorem 4.5 using the de Finetti representation provided by Definition 4.4. At a high level, this is a constructive representation involving three steps. Recall that is the countable set of sequences of nonnegative integers such that . First, we generate a (possibly random) distribution over , i.e., a sequence of nonnegative reals such that

 ∑ξ,ξ′∈Kμξ,ξ′=1and∀ξ,ξ′∈K,μξ,ξ′≥0. (4.9)

Next, for each , we sample i.i.d. from this distribution, resulting in two sequences . The sequence determines the membership of index in regular traits—which may be joined by other indices—and determines its membership in dust traits—which are unique to index and will never be joined by any other index. In particular, for each , index joins trait with multiplicity ; and for each , index has additional unique traits of multiplicity . For example, in a sequence of documents generated by latent topics, one author may write a single document with a number of words that are never again used by other authors (e.g. Jabberwocky, by Lewis Carroll); in the present context, these words would be said to arise from a dust topic. Meanwhile, common collections of words expressed by many documents will group together to form regular topics. Finally, we associate each trait with an i.i.d. label, construct the label multiset sequence , and use our mapping to collect these results together to form an infinite trait allocation . We say a random infinite trait allocation is regular if it has no dust traits with probability 1, and irregular otherwise.

###### Definition 4.4.

A random infinite trait allocation has a de Finetti representation if there exists a random distribution on such that has distribution induced by the following construction:

1. generate and ,

2. for all , define the multisets of via

 RN(ϕ) =∑k,j\mathds1(ϕ=ϕk,ξNk=j)⋅j% (regular traits) (4.10) DN(ϕ) =∑j,ℓ\mathds1(ϕ=ϕNjℓ,ℓ≤ξ′Nj)⋅j(dust traits) (4.11) YN(ϕ) =RN(ϕ)+DN(ϕ), (4.12)
3. assemble the label multiset sequence and set .

Theorem 4.5 is the main result of this section, which shows that infinite exchangeable trait allocations—both regular and irregular—are precisely those which have a de Finetti representation per Definition 4.4. The proof of Theorem 4.5 approaches the problem by characterizing the distribution of the exchangeable label multiset sequence .

###### Theorem 4.5.

is exchangeable iff it has a de Finetti representation.

###### Proof.

If has a de Finetti representation, then it is exchangeable by the fact that the are i.i.d. random variables. In the other direction, if is exchangeable, then there is a random label multiset sequence which is exchangeable by Lemma 4.3. Since we can recover from via , it suffices to characterize and then reconstruct .

We split into its regular and dust components—that represent, respectively, traits that are expressed by multiple data points and those that are expressed only by data point —defined for by

 DN(ϕ) ={0∃M≠N:YM(ϕ)>0YN(ϕ)otherwise (4.13) RN(ϕ) =YN(ϕ)−DN(ϕ). (4.14)

Choose any ordering on the countable set . Next, we extract the multiplicities in and via the sequences ,

 ξNk:=RN(ϕk). (4.15)

Note that we can recover the distribution of from that of by generating sequences and using steps 2 and 3 of Definition 4.4. Therefore it suffices to characterize the distribution of . Note that is a function of such that permuting the elements of corresponds to permuting those of in the same way. Thus since is exchangeable, so is . And since is a sequence in a Borel space, de Finetti’s theorem (Kallenberg, 1997, Theorem 9.16) states that there exists a directing random measure such that . Since the set is countable, we can represent with a probability for each tuple . ∎

The representation in Theorem 4.5 generalizes de Finetti representations for both clustering (the Kingman paintbox) and feature allocation (the feature paintbox) (Kingman, 1978; Broderick et al., 2013), as shown by Corollaries 4.8 and 4.7. Further, Corollary 4.8 is the first de Finetti representation for feature allocations that accounts for the possibility of dust features; previous results were limited to regular feature allocations (Broderick et al., 2013). Theorem 4.5 also makes the distinction between regular and irregular trait allocations straightforward, as shown by Corollary 4.6.

###### Corollary 4.6.

An exchangeable trait allocation is regular iff it has a de Finetti representation where implies .

###### Corollary 4.7.

A partition is exchangeable iff it has a de Finetti representation where implies either

• and , or

• , , and .

###### Corollary 4.8.

A feature allocation is exchangeable iff it has a de Finetti representation where implies that

• , , and , .

## 5. Frequency models and probability functions

The set of infinite exchangeable trait allocations encompasses a very expressive class of random infinite trait allocations: membership in different regular traits at varying multiplicities can be correlated, membership in dust traits can depend on membership in regular traits, etc. While interesting, this generality makes constructing models with efficient posterior inference procedures difficult. A simplifying assumption one can make is that given the directing measure , the membership of an index in a particular trait is independent of its membership in other traits. This assumption is often acceptable in practice, and limits the infinite exchangeable trait allocations to a subset—which we refer to as frequency models—for which efficient inference is often possible. Frequency models, as used in the present context, generalize the notion of a feature frequency model (Broderick et al., 2013) for feature allocations.

At a high level, this constructive representation consists of three steps. First, we generate random sequences of nonnegative reals and such that , , and , . The quantity is the probability that an index joins regular trait with multiplicity , while is the average number of dust traits of multiplicity for each index. Next, each index independently samples its multiplicity in regular trait from the discrete distribution , where is the probability that the index is not a member of trait . For each , each index is a member of an additional dust traits of multiplicity . Finally, we collect these results together to form an infinite trait allocation . Note that the above essentially imposes a particular form for , as given by Definition 5.1.

###### Definition 5.1.

A random infinite trait allocation has a frequency model if there exist two random sequences , of nonnegative real numbers such that has a de Finetti representation with

 μξ,ξ′=(∞∏k=1θkξk)⋅⎛⎜⎝∞∏j=1(θ′j)ξ′je−θ′jξ′j!⎞⎟⎠. (5.1)

Although considerably simpler than general infinite exchangeable trait allocations, this representation still involves a potentially infinite sequence of parameters; a finitary representation would be more useful for computational purposes. In practice, the marginal distribution of provides such a representation (Griffiths and Ghahramani, 2005; Thibaux and Jordan, 2007; James, 2017; Broderick et al., 2018). So rather than considering a simplified class of de Finetti representations, we can alternatively consider a simplified class of marginal distributions for . In previous work on feature allocations (Broderick et al., 2013), the analog of frequency models was shown to correspond to those marginal distributions that depend only on the unordered feature sizes (the so-called exchangeable feature probability functions (EFPFs)). In the following, we develop the generalization of EFPFs for trait allocations and show that the same correspondence result holds in this generalized framework.

We let be the number of unique orderings of a trait allocation ,

 κ(tN):=(∑τ∈TtN(τ))!∏τ∈TtN(τ)!, (5.2)

and use the multiplicity profile333A very similar quantity is known in the population genetics literature as the site (or allele) frequency spectrum (Bustamante et al., 2001), though it is typically defined there as an ordered sequence or vector rather than as a multiset. of , given by Definition 5.2, to capture the multiplicities of indices in its traits. The multiplicity profile of a trait is defined to be the multiset of multiplicities of its elements, while the multiplicity profile of a finite trait allocation is the multiset of multiplicity profiles of its traits. As an example, the multiplicity profile of a trait is , since there are two elements of multiplicity 1, one element of multiplicity 2, and one of multiplicity 4 in the trait. If we are given the finite trait allocation , then its multiplicity profile is . Note that a multiplicity profile is itself a trait allocation, though not always of the same indices. Here, the trait allocation is of , and its multiplicity profile is a trait allocation of .

###### Definition 5.2.

The multiplicity profile of a trait is defined as

 ¯¯¯τ(n):=|{m∈N:τ(m)=n}|, (5.3)

and is overloaded for finite trait allocations as

 ¯¯¯¯¯¯tN(ξ):=∑τ∈T\mathds1(¯¯¯τ=ξ)⋅tN(τ). (5.4)

We also extend Definition 5.2 to ordered trait allocations , where the multiplicity profile is the ordered multiplicity profiles of its traits, i.e.  is defined such that .

The precise simplifying assumption on the marginal distribution of that we employ in this work is provided in Definition 5.3, which generalizes past work on exchangeable probability functions (Pitman, 1995; Broderick et al., 2013).

###### Definition 5.3.

A random infinite trait allocation has an exchangeable trait probability function (ETPF) if there exists a function such that for all ,

 P(TN=tN)=κ(tN)⋅p(N,¯¯¯¯¯¯tN). (5.5)

One of the primary goals of this section is to relate infinite exchangeable trait allocations with frequency models to those with ETPFs. The main result of this section, Theorem 5.4, shows that these two assumptions are actually equivalent: any random infinite trait allocation that has a frequency model (including those with random , of arbitrary distribution) has an ETPF, and any random infinite trait allocation with an ETPF has a frequency model. Therefore, we are able to use the simple construction of frequency models in practice via their associated ETPFs.

###### Theorem 5.4.

has a frequency model iff it has an ETPF.

The key to the proof of Theorem 5.4 is the uniformly ordered infinite trait allocation, defined below in Definition 5.6. Recall that is the space of consistent, ordered infinite trait allocations and that denotes an ordering of . Here, we develop the uniform ordering : intuitively, for each , is constructed by inserting the new traits in relative to into uniformly random positions among the elements of . This guarantees that is marginally a uniform random permutation of for each , and that is a consistent sequence, i.e. . There are two advantages to analyzing rather than itself. First, the ordering removes the combinatorial difficulties associated with analyzing . Second, the traits are independent of their ordering, thereby avoiding the statistical coupling of the ordering based solely on .

The definition of the uniform ordering in Definition 5.6 is based on associating traits with the uniformly distributed i.i.d. sequence , and ordering the traits based on the order of those values. To do so, we require a definition of the finite permutation that rearranges the first elements of to be in order and leaves the rest unchanged, known as the order mapping of . For example, if , then is represented in cycle notation as , and . The precise formulation of this notion is provided by Definition 5.5.

###### Definition 5.5.

The order mapping of the sequence is the finite permutation defined by

 πn(k):={∣∣{j∈N:j≤n,ϕj≤ϕk}∣∣k≤nkk>n. (5.6)

Definition 5.6 shows how to use the order mapping to uniformly order an infinite trait allocation: we rearrange the lexicographic ordering of using the order mapping where is the number of traits in .

###### Definition 5.6.

The uniform ordering of is