Compact Representation of Uncertainty in Hierarchical Clustering

Compact Representation of Uncertainty in Hierarchical Clustering


Hierarchical clustering is a fundamental task often used to discover meaningful structures in data, such as phylogenetic trees, taxonomies of concepts, subtypes of cancer, and cascades of particle decays in particle physics. When multiple hierarchical clusterings of the data are possible, it is useful to represent uncertainty in the clustering through various probabilistic quantities. Existing approaches represent uncertainty for a range of models; however, they only provide approximate inference. This paper presents dynamic-programming algorithms and proofs for exact inference in hierarchical clustering. We are able to compute the partition function, MAP hierarchical clustering, and marginal probabilities of sub-hierarchies and clusters. Our method supports a wide range of hierarchical models and only requires a cluster compatibility function. Rather than scaling with the number of hierarchical clusterings of elements (), our approach runs in time and space proportional to the significantly smaller powerset of . Despite still being large, these algorithms enable exact inference in small-data applications and are also interesting from a theoretical perspective. We demonstrate the utility of our method and compare its performance with respect to existing approximate methods.

Keywords: Hierarchical Clustering – Jet Physics – Genomics


1 Introduction

The traditional (flat) clustering task is to output a partition of a given dataset. In contrast, the output required in the hierarchical clustering task is a recursive partitioning of the data, i.e., a rooted tree structure where each node in the tree corresponds to a subset of the data, with the data points at the leaves and the non-leaf nodes corresponding to the data at their descendent leaves. The two primary advantages of a recursive, tree-shaped partition are that trees, unlike flat partitions, 1) encode structure between the data partitions at multiple levels of granularity, and 2) represent multiple partitions of the dataset at once Dasgupta (2016).

Hierarchical clustering is often used to discover meaningful structures, such as phylogenetic trees of organisms Kraskov et al. (2005), taxonomies of concepts Cimiano and Staab (2005), subtypes of cancer Sørlie et al. (2001), and jets in particle physics Cacciari et al. (2008). Among the reasons that hierarchical clustering has been found to be broadly useful is that it forms a natural data representation of data generated by a Markov tree, i.e., a tree-shaped model where the state variables are dependent only on their parent or children.

Often, probabilistic approaches, such as diffusion trees Neal (2003); Knowles and Ghahramani (2011) and coalescent models Teh et al. (2008); Boyles and Welling (2012); Hu et al. (2013), model which tree structures are likely for a given dataset. Inference in these methods is approximate, typically performed with methods including greedy best-first, beam-search, sequential Monte Carlo, and MCMC. While these approaches often discover tree structures using probabilistic inference, the corresponding inference methods do not have efficient ways to compute an exact normalized distribution over all tree structures.

Recent work Kappes et al. (2015); Kohonen and Corander (2016) has studied distributions over flat clusterings and showed that dynamic programming can be used to both efficiently compute the partition function as well as find the MAP clustering Van Os and Meulman (2004); Greenberg et al. (2018). However, analogous results for hierarchical clustering are not known.

In this paper, we provide data structures and algorithms for efficiently computing exact distributions over hierarchical clusterings. We show how these methods can be used to produce not only a normalizing constant for the distribution but also find the MAP clustering, as well as compute marginal distributions, including the probability of any sub-hierarchy or cluster. We further show how to sample exactly from the posterior distribution over hierarchical clusterings (i.e., the probability of sampling a given hierarchy is equal to the probability of that hierarchy). Our methods require time and space subquadratic in the powerset of , a substantial reduction over brute force methods, which require , making practical exact inference for datasets on the order of 20 points or fewer, thereby benefiting real-world use cases where exact inference on datasets this size is of high interest and practical importance. For example, in jet physics there are practical applications for datasets of this size when implementing simplified models for the cascades of particle decays Soper and Spannowsky (2011, 2013) or on experimental data (after preprocessing methods) from proton-proton collisions at the Large Hadron Collider at CERN. We demonstrate our methods’ capabilities for exact inference in discovering cascades of particle decays in jet physics and subtype hierarchies in cancer genomics, two applications where there is a need for exact inference on datasets made feasible by our methods. We find that greedy and beam search methods frequently return estimates with energy well below the MAP clustering, and we quantify exactly various probabilistic quantities of interest in these domains, only made possible by our methods.

2 Background: Uncertainty in Clustering

Hierarchical Clustering is the task of recursively splitting a dataset into subsets until reaching singletons (this can equivalently be viewed as starting with the set of singletons and repeatedly taking the union of sets until reaching the entire dataset). This is in contrast to flat clustering, where the task is to partition the dataset into disjoint subsets. Formally, {definition}(Hierarchical Clustering3) Given a dataset of elements, , a hierarchical clustering, , is a set of nested subsets of , s.t. , , and , either , , or . Further, , if s.t. , then s.t. . When , is sometimes referred to as a cluster. Borrowing terminology from the literature on tree data structures, when and , we refer to and as children of , and the parent of and ; if we refer to as an ancestor of and a descendent of .

We wish to be able to efficiently compute probability distributions over hierarchical clusterings. Toward that end, we adopt an energy-based model of hierarchical clustering, where each hierarchy is assigned a real-valued energy. (Hierarchies are often represented by tree structures; we use the terms hierarchy and tree interchangeably throughout this paper.) The specifics of the energy function are model dependent, and we choose a convention where a higher energy implies a more compatible hierarchy, e.g. the energy could be identified with the likelihood or with a correlation function for the pairing of vertices. However, we want to emphasize that the algorithms introduced in this paper (see section 3 for more details) are independent of the particular form of the energy. Specifically, these algorithms can be straightforwardly implemented on models where the energy of each hierarchy is factorized as a product of positive definite single vertex splittings .

In order to assign the energy of a hierarchy, a linkage function mapping pairs of clusters onto the positive reals4 is required. The linkage value for each cluster in the hierarchy is computed, and the energy assigned to a hierarchy is the product of the cluster linkages for the set of pairs of children in the hierarchy. It is important to note that a requirement for our algorithms to compute exact values for the probabilistic quantities defined in section 3, is that the domain of the linkage function be the set of pairs of clusters, not pairs of trees. While this is always the case when performing hierarchical clustering with a linkage function, there are energy models where the energy of joining a given pair of clusters depends on the sub-tree structures rooted at those clusters (i.e., the linkage over cluster pairs is one-to-many).

In order to obtain a distribution over the set of hierarchical clusterings, we must compute a normalizer, often called the partition function5, which we define as the sum of the energies of each possible hierarchical clustering (equivalent to the energy marginalized over all the hierarchies). Thus, {definition}(Energy-Based Hierarchical Clustering) Let be a dataset, be a hierarchical clustering of , and be the energy of . Then, the probability of with respect to , , is equal to the energy of normalized by the partition function, . This gives us and . where denotes the set of all possible hierarchical clusterings of dataset . We drop the subscript when the dataset is clear from context.

To compute , we require a positive definite linkage function6, , where is the power set of .

The energy of tree , , is then given by , where . Since , the energy of a hierarchy can be computed by traversing bottom up or top down the hierarchy, without having to consider all possible pairs of clusters in .

Unfortunately, the number of trees over leaves grows very rapidly, namely (see Callan (2009); Dale and Moon (1993) for more details and proof). In fact, even if we consider a lower bound given by 7, , we are not able to simply iterate over the set of hierarchical clusterings to obtain the partition function. Instead, in this paper we introduce an extension to the trellis data structure proposed in Greenberg et al. (2018), enabling its use for hierarchical clustering. Algorithms we develop using this structure allow us to efficiently compute probabilistic quantities, such as the partition function or MAP hierarchical clustering, without having to iterate over each possible hierarchy. Also, our implementation of the hierarchical trellis can systematically count the total number of hierarchies, giving a result that matches exactly the formula . The trellis data structure is a directed acyclic graph, where there is a bijection between the vertices of the graph and , and there is an edge from vertex to vertex when the set of elements associated with is a maximal superset of the set of elements associated with . The dataset associated with a trellis vertex is denoted and the trellis vertex associated with a dataset is denoted .

In Greenberg et al. (2018), the authors use the cluster trellis data structure to compute distributions over flat clusterings. In the following section, we describe how to extend this data structure and use it to compute distributions over hierarchical clusterings.

3 Cluster Trellis for Hierarchical Clustering

In order to express a distribution over hierarchical clusterings, we wish to compute the partition function, i.e., the sum of the likelihoods of all possible hierarchical clusterings. We also wish to find the MAP hierarchical clustering, as well as the marginal probabilities of given sub-hierarchies and clusters.

In this section we describe algorithms for computing the partition function, maximal energy hierarchical clustering, and marginal probabilities, all exactly, in time. The algorithms are based on a cluster data structure, described in Greenberg et al. (2018), that we extend to hierarchical clusterings. We give correctness proofs and analyze the complexity of these algorithms below as well. In Figure 1 we show a visualization for an example implementing the hierarchical trellis to compute these probabilistic quantities over a dataset of four elements. As comparison, brute force methods have to iterate through all clusterings.

Computing the Partition Function

Given a dataset of elements, , the partition function, , for the set of hierarchical clusterings over , , is given by .

This value can be expressed in terms of the partition function for the set of hierarchical clusterings over proper subsets of . In particular, {fact} For any , the hierarchical partition function can be written recursively, as


where is the set of all clusters containing the element (omitting X), i.e,. .

Figure 1: An example of using the trellis to compute the distribution over hierarchical clusterings of the dataset . The left subfigure shows all the possible trees over subsets of 3, 2, and 1 elements, with the sums of their energy values stored in corresponding trellis nodes. The right subfigure shows the computation of the partition function for the dataset, given the partition functions for the sub-sets. The colors indicate corresponding computations that are computed with and stored in the trellis.

See Appendix for Proof. To compute the partition function efficiently, we make use of Fact 3 and our extension of the cluster trellis introduced in Greenberg et al. (2018). Our algorithm works by first setting the partition function of the leaf nodes in the trellis to 1. Then, the partition function for every node in the trellis is computed in order (in a bottom-up approach), from the nodes with the smallest number of elements to the nodes with the largest number of elements, memoizing the partition function value at each node. By computing the partition functions in this order, whenever computing the partition function of a given node in the trellis, the partition functions of all of the descendent nodes will have already been computed and memoized. Note that this algorithm can also be expressed recursively. See Algorithm 1 for pseudo-code of the recursive version of this algorithm and Figure 1 for a visualization of computing distributions over hierarchical clustering using the trellis.

  for  in  do
     if  has not been assigned then
     if  has not been assigned then
Algorithm 1 Partition Function

Since the partition function for each node in the trellis is computed and the descendent nodes’ partition functions are always pre-computed due to order of computation, the time-complexity of the algorithm is , which is significantly smaller than . See Appendix for Proof.

Computing the MAP Hierarchical Clustering

It is possible to compute the MAP clustering for dataset , , defined as , with an approach analogous to the one introduced for the partition function, using the cluster trellis as well as Fact 3 for this purpose.


For any , the MAP clustering can be written recursively,

See Appendix for Proof. As in the partition function algorithm described in Section 3, the time complexity for finding the MAP clustering is . Unlike the partition function algorithm, to compute the maximal likelihood hierarchical clustering, the maximal energy of the sub-hierarchy rooted at each node is computed, rather than the partition function. Further, pointers to the children of the maximal sub-hierarchy rooted at each node are stored at that node.


It often useful to compute the marginal probabilities of any given sub-hierarchy, , defined as , as well as the marginal probabilities of any given cluster, , defined as .

Sampling from the Distribution over Trees

In this section we introduce a sampling procedure for hierarchical clusterings implemented over the trellis that samples according to the probability distribution over all hierarchical clusterings . Thus, the probability to sample a hierarchical clustering is given by


Given a dataset , after building the trellis and computing the partition function for each sub-hierarchy rooted at , with , we sample starting from the root . In particular,

Proposition: A sample can be found recursively starting from the root vertex of the trellis until reaching all singletons . At each level (and branch 8) we recursively sample a set of children nodes with probability


with . Then, probability of the sample is obtained by multiplying the values obtained at each level. See Appendix for Proof and Algorithm 2 for pseudo-code.

  Input: Dataset
  if   then
     Sample a set of vertices in , where , according to the probability in Eq. 3.
     Sampling Function
     Sampling Function
Algorithm 2 Sampling Function

4 Experiments

In this section, we conduct experiments demonstrating our approach to compute exactly various probabilistic quantities of importance for real world applications. In the case of the MAP hierarchical clustering, we compare the performance of our trellis algorithm with respect to approximate methods, e.g. greedy and beam search algorithms, and analyze the improvements. In particular, we study these probabilistic quantities for two types of problems, i.e. hierarchical clusterings where the ground truth is provided by a toy model to simulate cascades of particle physics decays in jet physics and modeling subtypes of cancer, which determine prognosis and treatment plans. Also, for the jet physics case, we provide an implementation of the sampling procedure to obtain the posterior distribution, detailed in section 3, and compare the sampled distribution with the expected one calculated using the partition function.

4.1 Jet Physics and Hierarchical Clusterings

The Large Hadron Collider (LHC) at CERN collides two beams of high-energy protons and produces many new (unstable) particles. Some of these new particles (quarks and gluons) will undergo a showering process, where they radiate many other quarks and gluons in successive binary splittings. These splittings can be represented with a binary tree, where the energy of the particles decreases after each step. When the energy is below a given threshold, the showering terminates, resulting in a spray of particles that is called a jet. The particle detectors only observe the leaves of this binary tree (the jet constituents), and the unstable particles in the showering process are unobserved. While the latent showering process is unobserved, it is described by quantum chromodynamics (QCD).

A specific jet could result from several latent trees generated by this showering process. It is natural to represent a jet and the particular clustering history that gave rise to it as a binary tree, where the inner nodes represent each of the unstable particles and the leaves represent the jet constituents. This representation connects jets physics with natural language processing (NLP) and biology, which is exciting and was first suggested in Louppe et al. (2019).


Jets are among the most common objects produced at the LHC, and a great amount of work has been done to develop techniques for a better treatment and understanding of them, from both an experimental and theoretical point of view. In particular, determining the nature (type) of the initial unstable particle (the root of the binary tree), and its children and grandchildren that gave rise to a specific jet is essential in searches for new physics, as well as precision measurements of our current model of nature, i.e., the Standard Model of particle physics. In this context, it becomes relevant and interesting to study algorithms to cluster the jet constituents (leaves) into a binary tree and metrics to compare them. Being able to improve over the current techniques that attempt to invert the showering process to reconstruct the ground truth-level tree would assist in physics searches at the Large Hadron Collider.

There are software tools called parton showers, e.g., PYTHIA, Herwig, Sherpa, that encode a physics model for the simulation of jets that are produced at the LHC. Current algorithms used by the physics community to estimate the clustering history of a jet are domain-specific sequential recombination jet algorithms, called generalized clustering algorithms Cacciari et al. (2008), and they do not use these generative models. These algorithms sequentially cluster the jet constituents by locally choosing the pairing of nodes that minimizes a distance measure. Given a pair of nodes, this measure depends on the angular distance between their momentum vector and the value of this vector in the transverse direction with respect to the collision axis between the incoming beams of protons.

Currently, generative models that implement the parton shower in full physics simulations are implicit models, i.e., they do not admit a tractable density. Extracting additional information that describes the features of the latent process is relevant to study problems where we aim to unify generation and inference, e.g inverting the generative model to estimate the clustering history of a jet. A schematic representation of this approach is shown in Figure 2.

Figure 2: Schematic representation of the tree structure of a sample jet generated with Ginkgo and the clustered tree for some clustering algorithm. For a given algorithm, labels the different variables that determine the latent structure of the tree. The tree leaves are labeled in red and the inner nodes in green.

At present, it is very hard to access the joint likelihood in state-of-the-art parton shower generators in full physics simulations. Also, typical implementations of parton showers involve sampling procedures that destroy the analytic control of the joint likelihood. Thus, to aid in machine learning (ML) research for jet physics, a python package for a toy generative model of a parton shower, called Ginkgo, was introduced in Cranmer et al. (2019a). Ginkgo has a tractable joint likelihood, and is as simple and easy to describe as possible but at the same time captures the essential ingredients of parton shower generators in full physics simulations. Within the analogy between jets and NLP, Ginkgo can be thought of as ground-truth parse trees with a known language model. A python package with a pyro implementation of the model with few software dependencies is publicly available in Cranmer et al. (2019a).

Data Methods

In this paper, we proposed a new method to efficiently consider every possible latent structure, find the MAP hierarchical clustering, partition function Z, and compute an estimate for the posterior distribution over all possible hierarchical clusterings from sampling. The generative model described by Ginkgo deviates in a subtle way from the assumptions needed for the trellis procedure to find the exact maximum likelihood tree and partition function over trees. This relates to the requirements that the linkage function only depends on pairs of clusters. In particular, the likelihood function for Gingko can assign different likelihoods to different subtrees over the same cluster. Nevertheless, we find that the trellis procedure is very effective when we identify the energy function for a pair of clusters to be the likelihood associated with the tree resulting from merging the two maximum likelihood trees corresponding to and . The trellis procedure will be exact with respect to hierarchical clusterings from this linkage function, but not exact for a more general type of linkage function that would include pairs of trees. We will compare the trellis results for the MAP hierarchical clustering with approximate methods, as described below.

The ground truth hierarchical clusterings of our dataset are generated with the toy generative model for jets, Ginkgo. This model implements a recursive algorithm to generate a binary tree, whose leaves are the jet constituents, see Cranmer et al. (2019b) for more details. Jet constituents (leaves) and intermediate state particles (inner nodes) in Ginkgo are represented by a 2 dimensional spatial momentum vector , which determines the direction of motion. It is important to mention that in this model, the linkage function depends on the pairs of trees, i.e. the energy for a pairing of clusters depends on the sub-tree structures rooted at each of those clusters. As a result, we are not guaranteed that the MAP computed with the trellis is the exact one in this particular model. Because of the same reason, the partition function we obtain is not identical to the ground truth distribution, but we can realize a partition function in this model. Thus, we will obtain a partition function and sample from the posterior distribution to show an implementation of the trellis algorithm.

Next, we review new algorithms to cluster jets, based on the joint likelihood of the jet binary tree in Ginkgo, introduced in Cranmer et al. (2020). In this work the authors explore algorithms that aim to obtain the maximum likelihood estimate (MLE) or MAP for the latent structure of a jet, given a set of jet constituents (leaves). In this approach, the tree latent structure is fixed by the algorithm. In particular, greedy and beam search algorithms are studied. The greedy one simply chooses the pairing of nodes that locally maximizes the likelihood at each step. The beam search one maximizes the likelihood of multiple steps before choosing the latent path. The current implementation only takes into account one more step ahead, with a beam size given by , with the number of jet constituents to cluster. Also, when two or more clusterings had an identical likelihood value, only one of them was kept in the beam, to avoid counting multiple times the different orderings of the same clustering (see Boyles and Welling (2012) for details about the different orderings of the internal nodes of the tree). This approach significantly improved the performance of the beam search algorithm in terms of finding the MLE.


In this section we show results for the implementation of the trellis algorithm on a jet physics dataset of 1965 Ginkgo Cranmer et al. (2019a) jets with a number of leaves between 5 and 10, and we refer to it as Ginkgo510 9. We start by comparing in Table 1 the mean difference among the MAP values for the hierarchies likelihood obtained with the trellis, beam search and greedy algorithms. We see that the energy values of the trees found using the trellis are greater than beam search, while beam search tree energies are greater than those found using the greedy approach, as expected.

Beam Search Greedy
Trellis 2.7 1.9 5.9 4.3
Beam Search 3.2 3.4

Table 1: Mean and standard deviation for the difference in log likelihood for the MAP tree found by algorithms indicated by the row and column heading on the Ginkgo510 dataset.

Even though we use beam search as a baseline to estimate the MAP value, we want to emphasize that it typically has a good performance for trees with up to about 10 leaves, but as we see in Table 1, the trellis MAP value is significantly greater. As already mentioned, the MAP computed with the trellis is not exact in this particular model due to the linkage function depending on the sub-tree structures rooted at each of the clusters. However, the Trellis allows us to obtain a MAP value computed by recursively considering each sub-tree MAP value, which covers a much larger set of hierarchies than beam search (the baseline algorithm used here only considers one step ahead).

Next, in Figure 3 we show a plot of the partition function versus the MAP for each set of leaves in Ginkgo510 dataset. It is interesting to note that there seems to be a correlation between and the Trellis MAP, dependent on the number of leaves. We want to emphasize that the implementation of the trellis algorithm allows us to access the partition function.

Figure 3: Scatter plot of the partition function vs. the trellis MAP for the Ginkgo510 dataset, with up to 10 leaves (jet constituents). The color indicates the number of leaves of each hierarchical clustering. There appears to be a correlation between and the MAP that is dependent on the number of leaves. Also, the greater the number of leaves, the faster that grows which is consistent with the rapid increase in the number of possible hierarchical clusterings.

Finally, we show the implementation of the sampling procedure introduced in section 3. We compare the sampled posterior distribution with respect to the expected one (as explained below), conditioned on a set of five leaves.

Figure 4: Posterior distribution of for a specific jet with five leaves. We compare the distribution from sampling 3 hierarchies from the posterior using the procedure described in Sec. 3 (shown as black dots with tiny statistical error bars) and the expected posterior distribution (in green). The plots show the discrete nature of the distribution. The log likelihood for the ground truth hierarchical clustering is shown as a vertical dashed red line, which happens to be near the MAP value in this case.

We show in Figure 4 the results from sampling 3 hierarchies (black dots) and the expected distribution (green) for the likelihood of each hierarchy. The expected posterior is defined as the probability density function of each possible hierarchy. In principle, this could be obtained by taking the ratio of the likelihood of each hierarchy with respect to the partition function . However, an algorithm that accesses the likelihood of each possible hierarchy would be inefficient and not feasible in practice (the trellis algorithm provides an efficient way to obtain without having to visit each hierarchy). Thus, we opt to take an approximate approach, as follows. If we sample enough number of times, we would expect each possible hierarchy to appear at least once. Thus, as a proof of concept, we sample 3 hierarchies for a set of five leaves (88 different hierarchies), keep only one of them for each unique likelihood value and normalize by and bin size. We show this result in the histogram labeled as Expected (green) in Figure 4. There is an excellent agreement between the sampled and the expected distributions.

4.2 Cancer Genomics and Hierarchical Clusterings


Hierarchical clustering is a common clustering approach for gene expression data (Sørlie et al., 2001). However, standard hierarchical clustering uses a greedy agglomerative or divisive heuristic to build a tree. It is not uncommon to have a need for clustering a small number of samples in cancer genomics studies. An analysis of data available from shows that the median sample size for 7,412 completed phase I clinical trials involving cancer is only 30.

Data and Methods

Here, we compare a greedy agglomerative clustering to our exact MAP clustering tree using the Prediction Analysis of Microarray 50 (pam50) gene expression data set. The pam50 data set (, ) is available from the UNC MicroArray Database (University of North Carolina, 2020). It has intrinsic subtype annotations for 139 of the 232 samples. Missing data values (2.65%) were filled in with zeros. We drew a stratified sample of the total data set with two samples from each known intrinsic subtype and two samples from the unknown group. See Appendix for more details.

Figure 5: Comparison of greedy hierarchical tree (left) and MAP tree (right, our method) on the subsampled pam50 () data set.


Figure 5 displays the greedy hierarchical clustering tree and the MAP tree with transformed weights (see Appendix A.6 for more details) for the twelve samples selected from the pam50 dataset. The main difference between these trees is in the split of the subtree including LumB, HER2, and unknown samples. The greedy method splits HER2 from LumB and unknown, while the MAP tree shows a different topology for this subtree. For the MAP solution, we note that the subtree rooted at is consistent. All of the correlation coefficients among this cluster are positive, so the optimal action is to split off the item with the smallest (positive) correlation coefficient.

5 Related Work

Modeling distributions over tree structures has been the subject of a large body of work. Bayesian non-parametric models typically define a posterior distribution over tree structures given data such as Dirichlet and Pitman-Yor diffusion trees Neal (2003); Knowles and Ghahramani (2011), coalescent models Teh et al. (2008); Boyles and Welling (2012); Hu et al. (2013), and in the case of grouped data, the nested Chinese restaurant processes Blei et al. (2010) and nested hierarchical Dirichlet processes Paisley et al. (2014). Other models, such as tree structured nested sticking breaking, provide a distribution over a different class of tree structures, one for which data can sit at internal nodes Ghahramani et al. (2010). These methods, while providing a distribution over trees, only support using parametric distributions to define emission probabilities and do not support the general family of linkage functions used in our approach, which can use any scoring function to define the distribution. Factor graph-based distributions over tree structures such as Wick et al. (2012) on the other hand support a flexible class of distributions over tree structures as in our approach. However inference in factor graph models as well as many of the Bayesian non-parameteric models is typically approximate or performed by sampling methods. This lends in practice to approximate MAP solutions and distributions over tree structures. Exact methods like the one proposed in this paper have not, to our knowledge, been proposed.

Dasgupta (2016) defines a cost function for hierarchical clustering. Much work has been done to develop approximate solution methods and related objectives Roy and Pokutta (2017); Cohen-Addad et al. (2019); Charikar and Chatziafratis (2017); Moseley and Wang (2017); Cohen-Addad et al. (2017); Charikar et al. (2019).

Bootstrapping methods, such as Suzuki and Shimodaira (2006), represent uncertainty in hierarchical clustering. Unlike our approach, bootstrapping methods approximate statistics of interest through repeatedly (re-)sampling from the empirical distribution.

Exact inference and exact distributions over flat clusterings has been considered by Greenberg et al. (2018), which provides the foundation of the dynamic programming approach in this work. Other work on exact inference in flat clustering uses fast convolutions via the Mobius transform and Mobius inversion Kohonen and Corander (2016). (Kappes et al., 2015) produce approximate distributions over flat clusterings using Perturb and MAP Papandreou and Yuille (2011).

6 Conclusion

This paper describes a data structure and dynamic-programming algorithm to efficiently compute and sample from probability distributions over hierarchical clusterings. Our method improves upon the computation cost of brute-force methods from to sub-quadratic in the substantially smaller powerset of . We demonstrate our methods’ utility on jet physics and cancer genomics datasets and show its improvement over approximate methods. Finally, though there are multiple approaches to sample from the posterior distribution, such as MCMC techniques, they could be very expensive while our sampling method has a bounded computation time.


KC and SM are supported by the National Science Foundation under the awards ACI-1450310 and OAC-1836650 and by the Moore-Sloan data science environment at NYU. PF is supported in part by NSF HDR TRIPODS award 1934846.

Appendix A Appendix

a.1 Proof of Fact 3


Given a dataset , pick an element . We consider all possible clusters that contain . Given , then is fixed so as to satisfy and . We want to show that the partition function can be written recursively in terms of and .

The partition function is defined as the sum of the energies of all possible hierarchical clusterings ,


where , . Also, and are the sub-hierarchies in that are rooted at and , respectively. Next, we rewrite Eq. A.1 grouping together all the hierarchies that have the same clusters 10,


with . Thus, of a cluster can be written recursively in terms of the partition function of the sub-clusters of X 11.

a.2 Proof of Sampling Proposition


At level (and branch ) we sample a set in , where and , with probability


The probability of the sampled tree is obtained by multiplying the values obtained at each level. Let us consider the contribution from two consecutive levels and , for a sub-hierarchy rooted at ,


We start from the root vertex of the trellis with partition function . We sample level by level (and branch by branch) until we reach the leaves of the tree (singletons). At each new step, the normalizing partition function cancels out with the corresponding one at the previous step (see Eq. A.2). Thus, the probability of the sampled tree can be written as follows:


where labels each level, the branches at each level and the partition function for each singleton (). Thus, this sampling procedure follows the probability distribution over the set of all hierarchical clusterings .

a.3 Proof of Lower Bound on Number of Trees

The number of trees on leaves is given exactly by , where is the number of internal nodes in the subtree rooted at node Boyles and Welling (2012). Since , this makes the number of trees on leaves . The smallest conceivable value for , which gives us the bound on the number of trees to be , as desired.

Note that this is a loose lower-bound, and that it could be improved upon as follows: say a hierarchical clustering is a caterpillar clustering is every internal node in the underlying tree has two children and the set associated with one of those children as size one. There are caterpillar clustering. To see this, note that the th level (where the root is level 1) of a caterpillar clustering has exactly one leaf for . There are choices for the corresponding singleton sets.

Note, however, that there is a closed form expression for the exact number of unordered hierarchies given by , with n the number of singletons (see Callan (2009); Dale and Moon (1993) for more details and proof).

a.4 Proof of Fact 3


We proceed in a similar way as detailed in Appendix 3, as follows. Given a dataset , pick an element . We consider all possible clusters that contain . Given , then is fixed so as to satisfy and . We want to show that the MAP clustering can be computed recursively in terms of and .

The MAP value is defined as the energy of the clustering with maximal energy among all possible hierarchical clusterings ,


where , . Also, and are the sub-hierarchies in that are rooted at and , respectively. As mentioned earlier, the cluster trellis provides an exact MAP solution conditioned on the fact that the domain of the linkage function is the set of pairs of clusters, and not pairs of trees. Thus, we can rewrite Eq. A.4 grouping together all the hierarchies that have the same clusters , as follows


with . Thus, of a cluster can be written recursively in terms of the MAP values of the sub-clusters of X 12.

a.5 Proof of Partition Function Time Complexity

The partition function is computed for each node in the trellis, and due to the order of computation, at the time of computation for node , the partition functions for all nodes in the subtrellis rooted at node i have already been computed. Therefore, the partition function for a node with elements can be computed in steps (given the pre-computed partition functions for each of the node’s descendants), since the number of nodes for the trellis rooted at node i (with i elements) corresponds to the powerset of i. There are nodes of size , making the total computation .

a.6 Additional Details on PAM50 Experiments

The Pearson correlation coefficient was used for the clustering metric for the PAM50 data set experiments. The correlation clustering input can be represented as a complete weighted graph, , where each edge has weight . The goal is to construct a clustering of the nodes that maximizes the sum of positive within-cluster edge weights minus the sum of all negative across-cluster edge weights. However, the correlations among subsampled pam50 () data set are all positive. To allow more balanced weights among vertices, we transform the weights using and then subtracted the average of all .


  1. Corresponding authors: Craig S. Greenberg , Sebastian Macaluso
  3. Technically, this is the definition of binary hierarchical clustering, which we limit our exposition to in this paper. We note that for k-ary hierarchical clustering, k=2 (i.e., binary clustering) is the most expressive, in that it encodes the greatest number of tree-consistent clusterings. While our algorithms assume a binary tree, they can naturally be extended to k-ary trees, which is left as future work.
  4. In cases where the natural linkage function maps to , the linkage value can be exponentiated, as in the Gibbs distribution.
  5. Our use of an energy is different than the use of the term energy in statistical mechanics. In the Gibbs distribution, the energy for a state is proportional to the log-probability for the system to be in that state.
  6. We abuse notation by overloading to represent both the energy of a hierarchy, , as well as the linkage function for a pair of clusters, .
  7. See Appendix for proof of bound on the number of hierarchical clustering.
  8. To simplify notation we do not explicitly include the branch label in Eq. 3.
  9. The parameters for this dataset are , and (see Cranmer et al. (2019a) for specific details about the model)
  10. The cluster trellis provides an exact solution conditioned on the fact that the domain of the linkage function is the set of pairs of clusters, and not pairs of trees.
  11. Note that for each singleton , we have .
  12. Note that for each singleton , we have .


  1. The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. Journal of the ACM (JACM). Cited by: §5.
  2. The time-marginalized coalescent prior for hierarchical clustering. In Advances in Neural Information Processing Systems, pp. 2969–2977. Cited by: §A.3, §1, §4.1, §5.
  3. The anti- jet clustering algorithm. JHEP 04, pp. 063. External Links: Document, 0802.1189 Cited by: §1, §4.1.
  4. A combinatorial survey of identities for the double factorial. External Links: 0906.1317 Cited by: §A.3, §2.
  5. Hierarchical clustering better than average-linkage. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 2291–2304. Cited by: §5.
  6. Approximate hierarchical clustering via sparsest cut and spreading metrics. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 841–854. Cited by: §5.
  7. Learning concept hierarchies from text with a guided agglomerative clustering algorithm. In Proceedings of the ICML 2005 Workshop on Learning and Extending Lexical Ontologies with Machine Learning Methods, Cited by: §1.
  8. Hierarchical clustering: objective functions and algorithms. Journal of the ACM (JACM) 66 (4), pp. 1–42. Cited by: §5.
  9. Hierarchical clustering beyond the worst-case. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §5.
  10. Cited by: §4.1, §4.1, footnote 7.
  11. Cited by: §4.1.
  12. Cited by: §4.1.
  13. The permuted analogues of three Catalan sets. Vol. 34. External Links: Document, 0378-3758 Cited by: §A.3, §2.
  14. A cost function for similarity-based hierarchical clustering. In Symposium on Theory of Computing (STOC), Cited by: §1, §5.
  15. Tree-structured stick breaking for hierarchical data. In Advances in neural information processing systems, pp. 19–27. Cited by: §5.
  16. Compact representation of uncertainty in clustering. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi and R. Garnett (Eds.), pp. 8630–8640. Note: \href Cited by: §1, §2, §2, §3, §3, §5.
  17. Binary to bushy: bayesian hierarchical clustering with the beta coalescent. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §5.
  18. Probabilistic correlation clustering and image partitioning using perturbed multicuts. In International Conference on Scale Space and Variational Methods in Computer Vision, Cited by: §1, §5.
  19. Pitman-yor diffusion trees. In Conference on Uncertainty in Artificial Intelligence (UAI), Cited by: §1, §5.
  20. Computing exact clustering posteriors with subset convolution. Communications in Statistics-Theory and Methods. Cited by: §1, §5.
  21. Hierarchical clustering using mutual information. EPL (Europhysics Letters) 70 (2), pp. 278. Cited by: §1.
  22. QCD-Aware Recursive Neural Networks for Jet Physics. JHEP 01, pp. 057. External Links: Document, 1702.00748 Cited by: §4.1.
  23. Approximation bounds for hierarchical clustering: average linkage, bisecting k-means, and local search. In Advances in Neural Information Processing Systems, Cited by: §5.
  24. Density modeling and clustering using dirichlet diffusion trees. Bayesian statistics. Cited by: §1, §5.
  25. Nested hierarchical dirichlet processes. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §5.
  26. Perturb-and-map random fields: using discrete optimization to learn and sample from energy models. In 2011 International Conference on Computer Vision, pp. 193–200. Cited by: §5.
  27. Hierarchical clustering via spreading metrics. The Journal of Machine Learning Research 18 (1), pp. 3077–3111. Cited by: §5.
  28. Finding physics signals with shower deconstruction. Phys. Rev. D84, pp. 074002. External Links: Document, 1102.3480 Cited by: §1.
  29. Finding top quarks with shower deconstruction. Phys. Rev. D87, pp. 054012. External Links: Document, 1211.3140 Cited by: §1.
  30. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proceedings of the National Academy of Sciences 98 (19), pp. 10869–10874. Cited by: §1, §4.2.
  31. Pvclust: an r package for assessing the uncertainty in hierarchical clustering. Bioinformatics 22 (12), pp. 1540–1542. Cited by: §5.
  32. Bayesian agglomerative clustering with coalescents. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §5.
  33. UNC microarray database. Note: \href Cited by: §4.2.
  34. Improving dynamic programming strategies for partitioning. Journal of classification 21 (2), pp. 207–230. Cited by: §1.
  35. A discriminative hierarchical model for fast coreference at large scale. In Association for Computational Linguistics (ACL), Cited by: §5.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description