On synthetic data with predetermined subject partitioning and cluster profiling, and prespecified categorical variable marginal dependence structure
Michail Papathomas
School of Mathematics and Statistics, University of St Andrews, United Kingdom
ABSTRACT.
A standard approach for assessing the performance of partition or mixture models is to create synthetic data sets with a prespecified clustering structure, and assess how well the model reveals this structure. A common format is that subjects are assigned to different clusters, with variable observations simulated so that subjects within the same cluster have similar profiles, allowing for some variability. In this manuscript, we consider observations from nominal, ordinal and interval categorical variables. Theoretical and empirical results are utilized to explore the dependence structure between the variables, in relation to the clustering structure for the subjects. A novel approach is proposed that allows to control the marginal association or correlation structure of the variables, and to specify exact correlation values. Practical examples are shown and additional theoretical results are derived for interval data, commonly observed in cohort studies, including observations that emulate Single Nucleotide Polymorphisms. We compare a synthetic dataset to a real one, to demonstrate similarities and differences.
Key words: Nominal variables; Ordinal variables; Interval variables; Cohort studies; Dirichlet process; Bayesian clustering; Simulated data; Single Nucleotide Polymorphisms
1 Introduction
Partitioning and mixture models are often used to reveal the clustering structure within a sample. For example, to discover if combinations of risk factors are associated with the risk of disease (Müller et al. 2011), or to reveal dependencies in a population, whilst reducing the dimensionality of the problem; Yau and Holmes (2011). The combination of clustering with other methods can potentially lead to significant improvements in the exploration of the joint effect of covariates. See Bhattacharya and Dunson (2012) and Yang and Dunson (2013), where tensor factorizations are employed to characterize the joint density of variables that form highdimensional data. In Zhou et al. (2015) and Papathomas and Richardson (2016), marginally independent variables are detected with the use of modelling that is directly related to Bayesian partitioning algorithms. An overview of clustering approaches is given in Hennig et al. (2016).
Assessing the performance of partitioning models involves the creation of synthetic data with a prespecified clustering structure. The model is then fitted to the simulated data to assess its performance in terms of revealing this structure. Usually, profiles are created for a number of subjects, by simulating observations from a set of variables. The subjects are assigned to different clusters, and variable observations are simulated so that subjects within the same cluster have similar profiles, allowing for some variability. The investigator controls the strength of the signal in the clustering structure (i.e. how distinct the different clusters are), and the variability of the observations within each cluster. Sometimes partitioning the variables is of interest, rather than the subjects (Marbac et al. 2014). In this manuscript we focus on the former setup, as the two frameworks are interchangeable for simulated observations.
Partitioning models for continuous observations (Jasra et al. 2005) often assume a specific correlation structure for the variables, given the cluster allocation. This typically involves a mutivariate normal distribution. Such modelling does not have any adverse effect on the predetermined clustering structure. See, for example, Melnykov et al. (2012) or Waller and Jones (2016). In contrast to continuous observations, clustering approaches for categorical observations typically prescribe that variables are independent given the clustering of the subjects (Dunson and Xing, 2009; Liverani et al. 2015). The resulting dimensionality reduction is the main advantage of this local independence modelling, as determining a fully specified joint distribution between categorical variables with levels requires the specification of probabilities, a task that quickly becomes cumbersome and unwieldy. Celeux and Govaert (2016) comment on this notable difference in the modelling of continuous and categorical observations, mentioning that, in many applications, the assumption of independence given the clustering has proven successful in achieving the main objective of a clustering algorithm, which is to bring together similar observations. In Oberski (2016), local dependence is discussed, given well defined substantive interest. In this manuscript we concentrate on categorical variables, and thus adopt the widely espoused local independence assumption.
Importantly, when creating a synthetic dataset of categorical observations, departing from the withincluster independence assumption can interfere with the predetermined clustering structure, generating additional clusters. For example, assume the aim is to create 2 clusters of subjects, using 5 variables with three levels each . For the first cluster, define probabilities for observing respectively. For the second cluster, define a high probability for observing 0, through probabilities . Assume now a high positive correlation imposed on the variables, given the cluster. (See Supplemental material, Section S1, for generating correlated categorical variable observations.) This specification will generate one prominent group of subjects with observations (2,2,2,2,2), and another with observations (0,0,0,0,0). Crucially, a smaller but notable group of subjects will also be created, with profile , and this will form a third cluster, despite the simulation specifications. This was demonstrated in Tackney (2014, Summer student project). This is an artifact of the strong withincluster correlation, and not of nonidentifiability. In subsequent analyses, we ensure that all generated clustering structures are identifiable by following the guidelines of Allman et al. (2009) in terms of the required number of variables, so that the model parameters are identifiable up to label swapping. Denoting by the number of clusters, all synthetic datasets satisfy the identifiability condition, .
Our focus is on Bayesian clustering with the Dirichlet process (Bigelow and Dunson, 2009; Molitor et al. 2010). Our work concerns nominal, ordinal, and interval variables, where the numerical distance between categories is meaningful and known. Interval variables are of particular interest to us, as data from epidemiological and association cohort studies, such as number of children, are often in the form of interval observations. Furthermore, continuous observations are often categorized when data from cohort studies are analyzed. This is done to alleviate the adverse effect of outlier observations (for example in dietary observations; see Bingham and Riboli, 2004), or to allow for the flexible modelling of interactions (for example in air pollution variables; see Papathomas et al. 2011). Importantly, interval categorical variables allow for the use of covariances and correlations through expectations. This enables the derivation of the mathematical results in Sections 2.4 and 4.2, further increasing the investigator’s control over the marginal dependence structure of the variables.
The variables are independent given the clustering of the subjects, but marginally dependent. In synthetic data sets, this dependence can be at odds with the dependence structure observed in real data sets. The creation of synthetic data with predetermined clustering structure is straightforward, as long as the marginal dependence structure between the variables, generated as a byproduct of the clustering structure, is ignored. To the best of our knowledge, no algorithm has yet been proposed where the clustering structure is predetermined, whilst there is also control over the dependence structure between the variables. In this manuscript, an algorithm is proposed that allows to control the categorical variables’ marginal correlation structure. In turn, this enables the creation of simulated data sets that share more characteristics with real ones, compared to synthetic data created with standard methods. Wang and Sabo (2015) discuss the simulation of correlated binary observations, incorporating cluster specific random effects, but the aim of the proposed algorithm is not to generate clusters with distinct variable profiles. The proposed methods are generally applicable, as shown in the first six examples in this manuscript. One of the application of our methodology is to the generation of partitioned data that emulate Single Nucleotide Polymorphism (SNP) observations. We do not touch on issues relevant to recombination and imputation (Ioannidis et al. 2009), as this is beyond the scope of this manuscript.
In Section 2, we describe the generic approach for generating data with a predetermined clustering structure, introduce association measures for nominal and ordinal variables, and explore the marginal dependence structure between interval variables, deriving theoretical results. In Section 3, we introduce a specific algorithm for constructing clusters with distinct variable profiles, and examine its properties. In Section 4, this algorithm is modified to allow for a more flexible dependence structure. The effected control on the marginal dependence of the variables is extended, and exact correlation values can be specified. Practical examples are shown and additional theoretical results are derived. In Section 5, a real data set containing SNP observations is compared to a synthetic one, demonstrating similarities and differences. We conclude with a discussion in Section 6.
2 Simulating a predetermined clustering structure and the implied correlation matrix
2.1 The clustering model
Assume categorical variables , . Without any loss of generality, assume that each variable takes values . Let . Following Papathomas and Richardson (2016), for subject , , a variable profile is a vector of categorical values . Let , where is an allocation variable, so that denotes that subject, , belongs to cluster . Denote by the probability that , when the individual belongs to cluster . Given the clustering allocation, the variables are assumed independent, following a multinomial distribution with cluster specific parameters . Denote by the probabilities that a subject belongs to cluster , .
2.2 A generic algorithm for a predetermined clustering structure for the subjects
A generic algorithm for creating observations from variables, for subjects that are partitioned in clusters, is given as:

Specify the number of clusters .

Specify the number of subjects , , allocated to cluster , in accordance with cluster allocation probabilities . Allocate each subject to one of the clusters in accordance with .

Specify the variable profile of the subjects within each cluster, i.e. probabilities , for all , , and , to generate a distinct variable profile for the subjects in each cluster.

To generate , sample from the standard normal distribution. For, , if , then . Here, by definition, .
2.3 Association measures for nominal and ordinal variables
Pairwise association measures for nominal variables are typically based on the chisquare statistic. A standard choice is Cramer’s , defined as,
Here,
where, denotes the number of subjects classified in the crosstabulation cell , , , and, . Other measures describe the proportional reduction in variance from the marginal of to the conditional distribution of given . Agresti (2002, p.69) suggests the concentration coefficient, , as a possible choice.
where, denotes the probability a subject is classified in the crosstabulation cell , , and, . Those probabilities are estimated by . Measures and vary within . One difficulty associated with these measures is developing intuition on how large a value constitutes a strong association (Agresti, 2002, p.57).
The association between two ordinal variables is commonly measured by Spearman’s coefficient or Kendall’s tau. The two measures typically produce similar results (Collwell and Gillett, 1982). We opt to evaluate Kendall’s tau, as it has a more intuitive interpretation in terms of concordant and discordant observations. Kendall’s tau is defined as,
where is the number of concordant pairs and the number of discordant pairs. Two observed pairs and are concordant either if and , or if and . When , or , the two pairs are neither concordant nor discordant. When ties occur, the modified StuartKendall is often adopted (Berry et al. 2009), where,
2.4 The marginal correlation structure of interval variables
Assume that is a vector of interval categorical variables. The marginal variancecovariance matrix is,
Element , , in the diagonal of is,
(1) 
Element , , , in the offdiagonal of is,
As and are independent given ,
Denote by the expected value for in cluster , i.e. . Then, for ,
(2) 
Example 1: Consider . Then, and implies positive correlations, as,
Example 2: Consider , and assume that and . Then, for , it follows from (2) that,
In the Supplemental material, Section S2, we present extended versions of the two examples above, including additional inferences after utilizing equation (2). However, the larger the number of clusters, the less helpful (2) becomes for understanding the effect of the clustering on the marginal covariance structure of the variables. More helpful is the following Theorem.
Theorem 1: Assume that and are interval categorical variables. Under the condition that , for , ,
(3) 
Proof: The proof is given in the Appendix.
Equation (3), although restricted to , is more helpful for examining the effect of the clustering on the covariance structure of the variables. For any number of clusters, if, for all , the sign of is the same as the sign of , the correlation between and is positive. If, for all , the sign of is different to the sign of , the correlation between and is negative. The correlation is zero if, for every term in , as given by (3), either is zero or is zero.
3 A specific algorithm for a predetermined clustering
The algorithm in this Section allows for the creation of synthetic data with a partitioning signal of adjustable strength. The signal’s strength depends on how distinct probability vectors and , , are. Markedly different vectors and create distinct variable profiles for the different clusters. The algorithm is given as:

Define the number of clusters .

For even, define the number of variables so that, , where is an integer, . For odd, define so that, .

Specify the number of subjects allocated to cluster , in accordance with cluster allocation probabilities, , . Allocate each subject to one of the clusters in accordance with .

For each variable , consider two sets of probabilities, and , so that, , and, . The two sets could be distinct in the sense that the first elements of are considerably higher than subsequent elements, whilst the first elements of are considerably lower that subsequent elements. When the same probabilities apply to all variables, subscript is omitted, so that, , and, .

For a subject in cluster , odd, define its profile so that the first variables are simulated in accordance with , the next variables in accordance with , and so on and so forth. This is done in accordance with the fourth step of the algorithm in Section 2.2.

For a subject in cluster , even, define its profile so that the first variables are simulated in accordance with , the next variables in accordance with , and so on and so forth.

If required, to generate observations from variables , , that do not contribute to the clustering, consider , distinct from and . For all subjects, generate observations from irrespectively of cluster allocation.
For even , this specification generates groups of associated variables, where the dependence between variables within a group is stronger compared to the dependence between variables in different groups. Henceforth, we refer to those groups as homogenous. is the number of variables within each homogenous group. This dependence structure is shown empirically for nominal and ordinal variables in Examples 3 and 4, where we also present derived correlations assuming interval variables. Proposition 1, determines theoretically the dependence structure described above for interval variables.
Proposition 1: For the algorithm proposed in Section 3, for even , and for , and, , the covariance within each homogenous group is the same for all groups, and the highest observed in the interval variables’ covariance matrix.
Proof: Without any loss of generality, assume that all variables contribute to the clustering. Each of the homogenous groups contains adjoined variables with the same cluster profile characterized by H or L. For the variables within a homogenous group, the differences and always carry the same sign, for any and . This is not true for variables in different groups. This translates to withingroup covariances that are always positive and larger than betweengroup covariances, as the algorithm determines balanced sized clusters and Theorem 1 holds.
Example 3: For 8 clusters () and 16 variables (), the variable profile for each cluster is shown in Table 1. In the Supplemental material, Section S3, Figure S1, we show the clustering of simulated data for subjects, using and . The clustering of the simulated data is in accordance with the predetermined clustering. This is observed in subsequent Examples too. (Throughout the manuscript, simulated subject profiles are clustered using the R package PReMiuM (Liverani et al. 2015), which implements Bayesian clustering with the Dirichlet process.) The number of variables in each homogenous group is , and the number of homogenous groups is . In Figure 1(a) we show a heatmap for the derived pairwise concentration coefficient , which corresponds to simulated observations from nominal variables. Cramer’s gave the same dependence structure, with slightly smaller correlations. In Figure 1(b) we present the derived StuartKendall measure, assuming ordinal observations. Results using Spearman’s coefficient were very similar. In Figure 1(c), we present a heatmap of the theoretical correlations between the variables, and in Figure 1(d) the sample correlations assuming interval observations. Note that blocks of negative and zero correlations are also observed in the correlation matrix, due to the symmetry in the clustering structure. We observe that the dependence structure measured by in Figure 1(b) is very similar to the sample correlations in Figure 1(d). In Figures S2 and S3 in the Supplemental material, we present the derived correlation matrices for and respectively.
Table 1: Cluster profiles for 16 variables () and 8 clusters () for Example 3. Observations are simulated using probability vectors and .
Cluster 1  L  L  L  L  L  L  L  L  L  L  L  L  L  L  L  L 
Cluster 2  H  H  H  H  H  H  H  H  H  H  H  H  H  H  H  H 
Cluster 3  L  L  L  L  L  L  L  L  H  H  H  H  H  H  H  H 
Cluster 4  H  H  H  H  H  H  H  H  L  L  L  L  L  L  L  L 
Cluster 5  L  L  L  L  H  H  H  H  L  L  L  L  H  H  H  H 
Cluster 6  H  H  H  H  L  L  L  L  H  H  H  H  L  L  L  L 
Cluster 7  L  L  H  H  L  L  H  H  L  L  H  H  L  L  H  H 
Cluster 8  H  H  L  L  H  H  L  L  H  H  L  L  H  H  L  L 
Example 4: For odd , the number of homogenous groups of variables, is . The homogenous groups are not defined as clearly as for even (see Example 3). For 5 clusters () and 14 variables (, as two additional variables do not contribute to the clustering), the variable profile for each cluster is shown in Table 2. In the Supplemental material, Section S4, Figure S4, we show the clustering structure for simulated data for subjects, using , , and . The number of variables in each homogenous group is , and the number of homogenous groups is . In figure 2(a) we show the derived pairwise association measure . Again, Cramer’s gave a very similar structure, with slightly smaller correlations. In Figure 1(b) we present the derived StuartKendall measure. Results using Spearman’s coefficient were very similar. In Figure 2(c), we present a heatmap of the theoretical correlation matrix for the 12 variables that contribute to the clustering. Figure 2(d), shows the sample correlation matrix for all 14 variables, given the simulated observations. When we assume that the simulated observations are ordinal [Figure 2(b)], the derived association measure is very similar to the correlations in Figure 2(d).
Table 2: Cluster profiles for 14 variables () and 5 clusters () for Example 4. Observations are simulated using probability vectors , and .
Cluster 1  L  L  L  L  L  L  L  L  L  L  L  L  A  A 
Cluster 2  H  H  H  H  H  H  H  H  H  H  H  H  A  A 
Cluster 3  L  L  L  L  L  L  H  H  H  H  H  H  A  A 
Cluster 4  H  H  H  H  H  H  L  L  L  L  L  L  A  A 
Cluster 5  L  L  L  H  H  H  L  L  L  H  H  H  A  A 
4 A modified algorithm
The algorithm introduced in Section 3 is modified in Section 4.1, to allow for homogenous groups of different size. Observations are generated from variables split into homogenous groups, not necessarily in a balanced manner. In Section 4.2, we derive a theoretical result for interval variables that allows to first specify the within homogenous group covariances or correlations. In turn, this specification determines and . The number of homogenous groups is a power of . We assume the number of clusters is even, as this generates a clearly defined dependence structure. The categorical variables are positively associated (nominal/ordinal data) or correlated (interval data) within each homogenous group.
4.1 The proposed algorithm
The proposed algorithm is shown below. Explanatory comments are added in brackets.

Define the number of homogenous groups , where is a power of 2.

Define the number of variables in each homogenous group , .

{Solving gives the even number of clusters, . }

Define the number of subjects, , within each cluster.

For each variable , consider two sets of probabilities, , and, , so that, , and, . The two sets could be distinct so that the first elements of are considerably higher than subsequent elements, whilst the first elements of are considerably lower.

For odd , define the profile of cluster so that:

the first variables are simulated in accordance with

the next variables in accordance with

the next variables in accordance with

and so on and so forth.


For even , define the profile of cluster so that:

the first variables are simulated in accordance with

the next variables in accordance with

the next variables in accordance with

and so on and so forth.


{ When , the two steps above simplify as follows: for odd , define the profile of cluster so that the first variables are simulated in accordance with , the next variables considering , and so on and so forth. For even , the first variables are simulated considering , the next variables in accordance with , and so on and so forth. }

If required, to generate observations from variables , , that do not contribute to the clustering, consider , distinct from and . For all subjects, generate observations from irrespectively of cluster allocation.
Example 5: Assume 6 clusters (), 12 variables (), and, , , , and . Consider subjects. Observations were simulated using and . Figure 3(a) shows the derived pairwise association measure . In Figure 3(b) we present the derived StuartKendall measure. In Figure 3(c), we present a heatmap of the theoretical correlation matrix for this specification. In Figure 3(d), we present the sample correlation matrix for the simulated observations. Inferences are very similar to those in Examples 3 and 4.
4.2 Allowing for a predetermined covariance or correlation within each homogenous group for interval variables
Theorem 2: Assume that interval variables and belong to the same homogenous group. For the algorithm in Section 4.1, and for , so that Theorem 1 holds,
where, , and, .
Proof: See Appendix.
For and in the same homogenous group, given , cluster specific probabilities for and should be set so that,
(4) 
In practice, one may consider the simplified scenario where variables in the same homogenous group share the same set of possible values, and . Then, given , set cluster specific probabilities so that, for all in the same homogenous group,
(5) 
where denotes absolute value. Example illustrations are given below, starting with a rudimentary example involving binary variables.
Example 6 (Application to binary variables):
Given predetermined covariances: For a binary variable , with levels 0 and 1,
, and, .
For and in the same homogenous group, given , set cluster specific probabilities for and so that,
.
In practice, set constant for all variables, and suitably high, say, close to 1. Then, allow to vary, in accordance with,
.
Given predetermined correlations: From equation (1), and for ,
For even , for half of the clusters, . For the remaining clusters, . Thus, as ,
Therefore, denoting with the correlation between and ,
To allow for different predetermined correlations within each homogenous group of variables, set constant for all , and let vary so that,
Example 7 (Application to SNP variables, given predetermined covariances): Single Nucleotide Polymorphisms (SNP) are categorical observations with 3 levels, usually denoted by and for ‘Wild type’, ‘Heterozygous variant’ and ‘Homozygous variant’ respectively. For a SNP , due to the HardyWeinberg principle, (Ziegler and König, 2010), , and , where . Thus, , , , and,