1 INTRODUCTION
###### Abstract

Bayesian hierarchical clustering (BHC) is an agglomerative clustering method, where a probabilistic model is defined and its marginal likelihoods are evaluated to decide which clusters to merge. While BHC provides a few advantages over traditional distance-based agglomerative clustering algorithms, successive evaluation of marginal likelihoods and careful hyperparameter tuning are cumbersome and limit the scalability. In this paper we relax BHC into a non-probabilistic formulation, exploring small-variance asymptotics in conjugate-exponential models. We develop a novel clustering algorithm, referred to as relaxed BHC (RBHC), from the asymptotic limit of the BHC model that exhibits the scalability of distance-based agglomerative clustering algorithms as well as the flexibility of Bayesian nonparametric models. We also investigate the reducibility of the dissimilarity measure emerged from the asymptotic limit of the BHC model, allowing us to use scalable algorithms such as the nearest neighbor chain algorithm. Numerical experiments on both synthetic and real-world datasets demonstrate the validity and high performance of our method.

Bayesian Hierarchical Clustering with Exponential Family:
Small-Variance Asymptotics and Reducibility

Juho Lee and Seungjin Choi

Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 790-784, Korea {stonecold,seungjin}@postech.ac.kr

## 1 Introduction

Agglomerative hierarchical clustering, which is one of the most popular algorithms in cluster analysis, builds a binary tree representing the cluster structure of a dataset (Duda et al., 2001). Given a dataset and a dissimilarity measure between clusters, agglomerative hierarchical clustering starts from leaf nodes corresponding to individual data points and successively merges pairs of nodes with smallest dissimilarities to complete a binary tree.

Bayesian hierarchical clustering (BHC) (Heller and Ghahrahmani, 2005) is a probabilistic alternative of agglomerative hierarchical clustering. BHC defines a generative model on binary trees and compute the probability that nodes are merged under that generative model to evaluate the (dis)similarity between the nodes. Since this (dis)similarity is written as a probability, one can naturally decide a level, where to stop merging according to this probability. Hence, unlike traditional agglomerative clustering algorithms, BHC has a flexibility to infer a proper number of clusters for given data. The source of this flexibility is Dirichlet process mixtures (DPM) (Ferguson, 1973; Antoniak, 1974) used to define the generative model of binary trees. BHC was shown to provide a tight lower bound on the marginal likelihood of DPM (Heller and Ghahrahmani, 2005; Wallach et al., 2010) and to be an alternative posterior inference algorithm for DPM. However, when evaluating the dissimilarity between nodes, one has to repeatedly compute the marginal likelihood of clusters and careful tuning of hyperparameters are required.

In this paper, we study BHC when the underlying distributions are conjugate exponential families. Our contributions is twofold. First, we derive a non-probabilistic relaxation of BHC, referred to as RBHC, by performing small variance asymptotics, i.e., letting the variance of the underlying distribution in the model go to zero. To this end, we use the technique inspired by the recent work (Kulis and Jordan, 2012; Jiang et al., 2012), where the Gibbs sampling algorithm for DPM with conjugate exponential family was shown to approach a -means-like hard clustering algorithm in the small variance limit. The dissimilarity measure in RBHC is of a simpler form, compared to the one in the original BHC. It does not require careful tuning of hyperparameters, and yet has the flexibility of the original BHC to infer a proper number of the clusters in data. It turns out to be equivalent to the dissimilarity proposed in (Telgarsky and Dasgupta, 2012), which was derived in different perspective, minimizing a cost function involving Bregman information (Banerjee et al., 2005). Second, we study the reducibility (Bruynooghe, 1978) of the dissimilarity measure in RBHC. If the dissimilarity is reducible, one can use nearest-neighbor chain algorithm (Bruynooghe, 1978) to build a binary tree with much smaller complexities, compared to the greedy algorithm. The nearest neighbor chain algorithm builds a binary tree of data points in time and space, while the greedy algorithm does in time and space. We argue is that even though we cannot guarantee that the dissimilarity in RBHC is always reducible, it satisfies the reducibility in many cases, so it is fine to use the nearest-neighbor chain algorithm in practice to speed up building a binary tree using the RBHC. We also present the conditions where the dissimilarity in RBHC is more likely to be reducible.

## 2 Background

We briefly review agglomerative hierarchical clustering, Bayesian hierarchical clustering, and Bregman clustering, on which we base the development of our clustering algorithm RBHC. Let be a set of data points. Denote by a set of indices. A partition of is a set of disjoint nonempty subsets of whose union is . The set of all possible partitions of is denoted by For instance, in the case of , an exemplary random partition that consists of three clusters is ; its members are indexed by . Data points in cluster is denoted by for . Dissimilarity between and is given by .

### 2.1 Agglomerative Hierarchical Clustering

Given and , a common approach to building a binary tree for agglomerative hierarchical clustering is the greedy algorithm, where pairs of nodes are merged as one moves up the hierarchy, starting in its leaf nodes (Algorithm 1). A naive implementation of the greedy algorithm requires in time since each iteration needs to find the pair of closest nodes and the algorithm runs over iterations. It requires in space to store pairwise dissimilarities. The time complexity can be reduced to with priority queue, and can be reduced further for some special cases; for example, in single linkage clustering where

 d(c0,c1)=mini∈c0,j∈c1d({i},{j}),

the time complexity is since building a binary tree is equivalent to finding the minimum spanning tree in the dissimilarity graph. Also, in case of the centroid linkage clustering where the distance between clusters are defined as the Euclidean distance between the centers of the clusters, one can reduce the time complexity to where  (Day and Edelsbrunner, 1984).

### 2.2 Reducibility

Two nodes and are reciprocal nearest neighbors (RNNs) if the dissimilarity is minimal among all dissimilarities from to elements in and also minimal among all dissimilarities from . Dissimilarity is reducible (Bruynooghe, 1978), if for any ,

 d(c0,c1)≤min{d(c0,c2),d(c1,c2)} ⇒min{d(c0,c2),d(c1,c2)}≤d(c0∪c1,c2). (1)

The reducibility ensures that if and are reciprocal nearest neighbors (RNNs), then this pair of nodes are the closest pair that the greedy algorithm will eventually find by searching on an entire space. Thus, the reducibility saves the effort of finding a pair of nodes with minimal distance. Assume that are RNNs. Merging become problematic only if, for other RNNs , merging changes the nearest neighbor of (or ) to . However this does not happen since

 d(c2,c3)≤min{d(c2,c0),d(c3,c0)} ⇒min{d(c2,c0),d(c3,c0)}≤d(c2∪c3,c0) ⇒d(c0,c1)≤d(c2∪c3,c0).

The nearest neighbor chain algorithm (Bruynooghe, 1978) enjoys this property and find pairs of nodes to merge by following paths in the nearest neighbor graph of the nodes until the paths terminate in pairs of mutual nearest neighbors (Algorithm 2). The time and space complexity of the nearest neighbor chain algorithm are and , respectively. The reducible dissimilarity includes those of single linkage and Ward’s method (Ward, 1963).

### 2.3 Agglomerative Bregman Clustering

Agglomerative clustering with Bregman divergence (Bregman, 1967) as a dissimilarity measure was recently developed in (Telgarsky and Dasgupta, 2012), where the clustering was formulated as the minimization of the sum of cost based on the Bregman divergence between the elements in a cluster and center of the cluster. This cost is closely related with the Bregman Information used for Bregman hard clustering (Banerjee et al., 2005)). In (Telgarsky and Dasgupta, 2012), the dissimilarity between two clusters is defined as the change of cost function when they are merged. As will be shown in this paper, this dissimilarity turns out to be identical to the one we derive from the asymptotic limit of BHC. Agglomerative clustering with Bregman divergence showed better accuracies than traditional agglomerative hierarchical clustering algorithms on various real datasets (Telgarsky and Dasgupta, 2012).

### 2.4 Bayesian Hierarchical Clustering

Denote by a tree whose leaves are for . A binary tree constructed by BHC (Heller and Ghahrahmani, 2005) explains the generation of with two hypotheses compared in considering each merge: (1) the first hypothesis where all elements in were generated from a single cluster ; (2) the alternative hypothesis where has two sub-clusters and , each of which is associated with subtrees and , respectively. Thus, the probability of in tree is written as:

 p(Xc|Tc)=p(Hc)p(Xc|Hc) (2) + {1−p(Hc)}p(Xc0|Tc0)p(Xc1|Tc1),

where the prior is recursively defined as

 γ{i}\lx@stackreldef=α,γc\lx@stackreldef=αΓ(|c|)+γc0γc1, (3) p(Hc)\lx@stackreldef=αΓ(|c|)/γc, (4)

and the likelihood of under is given by

 p(Xc|Hc)\lx@stackreldef=∫{∏i∈cp(xi|θ)}p(dθ). (5)

Now, the posterior probability of , which is the probability of merging , is computed by Bayes rule:

 p(Hc|Xc,Tc)=p(Hc)p(Xc|Hc)p(Xc|Tc). (6)

In (Lee and Choi, 2014), an alternative formulation for the generative probability was proposed, which writes the generative process via the unnormalized probabilities (potential functions):

 (7) ϕ(Xc|Tc)\lx@stackreldef=γcp(Xc|Tc). (8)

With these definitions, (2) is written as

 ϕ(Xc|Tc)=ϕ(Xc|Hc)+ϕ(Xc0|Tc0)ϕ(Xc1|Tc1), (9)

and the posterior probability of is written as

 (10)

One can see that the ratio inside behaves as the dissimilarity between and :

 d(c0,c1)\lx@stackreldef=ϕ(Xc0|Tc0)ϕ(Xc1|Tc1)ϕ(Xc|Hc). (11)

Now, building a binary tree follows Algorithm 1 with the distance in (11). Beside this, BHC has a scheme to determine the number of clusters. It was suggested in (Heller and Ghahrahmani, 2005) that the tree can be cut at points . It is equivalent to say that the we stop the algorithm if the minimum over is greater than 1. Note that once the tree is cut, the result contains forests, each of which involves a cluster.

BHC is closely related to the marginal likelihood of DPM; actually, the prior comes from the predictive distribution of DP prior. Moreover, it was shown that computing to build a tree naturally induces a lower bound on the marginal likelihood of DPM,  (Heller and Ghahrahmani, 2005):

 Γ(α)Γ(α+n)ϕ(Xc|Tc)≤p\@setsize7pt\vipt\@viptDPM(Xc). (12)

Hence, in the perspective of the posterior inference algorithm for DPM, building tree in BHC is equivalent to computing the approximate marginal likelihood. Also, cutting the tree at the level where corresponds finding the MAP clustering of .

In (Heller and Ghahrahmani, 2005), the time complexity was claimed to be . However, this does not count the complexity required to find a pair with the smallest dissimilarity via sorting. For instance, with a sorting algorithm using priority queues, BHC requires in time.

The dissimilarity is very sensitive to tuning the hyperparameters involving the distribution over parametes required to compute . An EM-like iterative algorithm was proposed in (Heller and Ghahrahmani, 2005) to tune the hyperparameters, but the repeated execution of the algorithm is infeasible for large-scale data.

### 2.5 Bregman Diverences and Exponential Families

###### Definition 1.

((Bregman, 1967)) Let be a strictly convex differentiable function defined on a convex set. Then, the Bregman divergence, , is defined as

 Bφ(x,y)=φ(x)−φ(y)−⟨x−y,∇φ(y)⟩. (13)

Various divergences belong to the Bregman divergence. For instance, Euclidean distance or KL divergence is Bregman divergence, when or , respectively.

The exponential family distribution over with natural parameter is of the form:

 p(x|θ)=exp{⟨t(x),θ⟩−ψ(θ)−h(x)}, (14)

where is sufficient statistics, is a log-partition function, and is a base distribution. We assume that is regular ( is open) and is minimal ( s.t. ). Let be the convex conjugate of :

 φ(μ)=supθ∈Θ{⟨μ,θ⟩−ψ(θ)}. (15)

Then, the Bregman divergence and the exponential family has the following relationship:

###### Theorem 1.

(Banerjee et al., 2005) Let be the conjugate function of . Let be the natural parameter and be the corresponding expectation parameter, i.e., . Then is uniquely expressed as

 p(x|θ) = exp{−Bφ(t(x),μ)} (16) exp{φ(t(x))−h(x)}.

The conjugate prior for has the form:

 p(θ|ν,τ)=exp{⟨τ,θ⟩−νψ(θ)−ξ(ν,τ)}. (17)

can also be expressed with the Bregman divergence:

 p(θ|ν,τ)=exp{−νBφ(τ/ν,μ)} (18) ×exp{νφ(τ/ν)−ξ(ν,τ)}.

### 2.6 Scaled Exponential Families

Let , and . The scaled exponential family with scale is defined as follows (Jiang et al., 2012):

 p(x|˜θ) = exp{⟨t(x),˜θ⟩−˜ψ(˜θ)−hβ(x)} (19) = exp{β⟨t(x),θ⟩−βψ(θ)−hβ(x)}.

For this scaled distribution, the mean remains the same, and the covariance becomes  (Jiang et al., 2012). Hence, the distribution is more concentrated around its mean. The scaled distribution in the Bregman divergence form is

 p(x|˜θ) = exp{−βBφ(t(x),μ)} (20) exp{βφ(t(x))−hβ(x)}.

According to , is defined with , . Actually, this yields the same prior as non-scaled distribution.

 p(˜θ|˜ν,˜τ) = exp{⟨˜θ,˜τ⟩−˜ν˜ψ(˜θ)−ξβ(˜ν,˜τ)} (21) = exp{⟨θ,τ⟩−νψ(θ)−ξ(ν,τ)}.

## 3 Main Results

We present the main contribution of this paper. From now on, we assume that the likelihood and prior in Eq. (5) are scaled exponential families defined in Section 2.6.

### 3.1 Small-Variance Asymptotics for BHC

The dissimilarity in BHC can be rewritten as follows:

 d(c0,c1) = ϕ(Xc0|Tc0)ϕ(Xc1|Tc1)ϕ(Xc|Hc) (22) = αΓ(|c0|)Γ(|c1|)p(Xc0|Hc0)p(Xc1|Hc1)Γ(|c|)p(Xc|Hc) × {1+d(c00,c01)}{1+d(c10,c11)},

where and . We first analyze the term , as in (Jiang et al., 2012).

 (23) = βd∫exp{⟨θ,τ+β∑i∈ct(xi)⟩−(ν+β|c|)ψ(θ) −∑i∈chβ(xi)−ξ(ν,τ)}dθ = βdexp{(ν+β|c|)φ(μc)−∑i∈chβ(xi)−ξ(ν,τ)} ×∫exp{−(ν+β|c|)Bφ(μc,μ)}dθ,

where

 μc\lx@stackreldef=τ+β∑i∈ct(xi)ν+β|c|. (24)

Note that is a function of . The term inside the integral of Eq. (23) has a local minimum at , and thus can be approximated by Laplace’s method:

 =βdexp{(ν+β|c|)φ(μc)−∑i∈chβ(xi)−ξ(ν,τ)} (2πν+β|c|)d2∣∣∣∂2Bφ(μc,μ)∂θ∂θ⊤∣∣∣−12μ=μc{1+O(β−1)}. (25)

It follows from this result that, as , the asymptotic limit of dissimilarity in (22) is given by

 limβ→∞αΓ(|c0|)Γ(|c1|)p(Xc0|Hc0)p(Xc1|Hc1)Γ(|c|)p(Xc|Hc) ∝ limβ→∞αβd2exp{β(|c0|φ(μc0)+|c1|φ(μc1)−|c|φ(μc))},

Let , then we have

 =limβ→∞exp{β(|c0|φ(μc0)+|c1|φ(μc1)−|c|φ(μc)−λ)}.

As , the term inside the exponent converges to

 |c0|φ(¯tc0)+|c1|φ(¯tc1)−|c|φ(¯tc)−λ, (26)

where

 ¯tc\lx@stackreldef=1|c|∑i∈ct(xi), (27)

and this is the average of sufficient statistics for cluster . With this result, we define a new dissimilarity as

 d⋆(c0,c1) \lx@stackreldef= |c0|φ(¯tc0)+|c1|φ(¯tc1)−|c|φ(¯tc) (28) = |c0|φ(¯tc0)+|c1|φ(¯tc1) − (|c0|+|c1|)φ(|c0|¯tc0+|c1|¯tc1|c0|+|c1|).

Note that is always positive since is convex. If , the limit Eq. (3.1) diverges to , and converges to zero otherwise. When , Eq. (3.1) is the same as the limit of the dissimilarity , and thus the dissimilarity diverges when and converges otherwise. When , assume that the dissimilarities of children and converges to zero. From Eq. (22), we can easily see that converges only if . In summary,

 (29)

In similar way, we can also prove the following:

 limβ→∞d(c0,c1)d(c2,c3)={0 if d⋆(c0,c1)

which means that comparing two dissimilarities in original BHC is equivalent to comparing the new dissimilarities , and we can choose the next pair to merge by comparing instead of .

With Eqs. (29) and (30), we conclude that when , BHC reduces to Algorithm 1 with dissimilarity measure and threshold , where the algorithm terminates when the minimum exceeds .

On the other hand, a simple calculation yields

 d⋆(c0,c1)=|c0|Bφ(¯tc0,¯tc)+|c1|Bφ(¯tc1,¯tc),

which is exactly same as the dissimilarity proposed in (Telgarsky and Dasgupta, 2012). Due to the close relationship between exponential family and the Bregman divergence, the dissimilarities derived from two different perspective has the same form.

As an example, assume that and . We have and

 d⋆(c0,c1)=|c0||c1|∥∥xc0−xc1∥∥22σ2(|c0|+|c1|), (31)

which is same as the Ward’s merge cost (Ward, 1963), except for the constant . Other examples can be found in (Banerjee et al., 2005).

Note that does not need hyperparameter tunings, since the effect of prior is ignored as . This provides a great advantage over BHC which is sensitive to the hyperparameter settings.

Smoothing: In some particular choice of , the singleton clusters may have degenerate values (Telgarsky and Dasgupta, 2012). For example, when , the function has degenerate values when . To handle this, we use the smoothing strategy proposed in (Telgarsky and Dasgupta, 2012); instead of the original function , we use the smoothed functions and defined as follows:

 φ0(x)\lx@stackreldef=φ((1−α)x+αγ), (32) φ1(x)\lx@stackreldef=φ(x+αγ), (33)

where be arbitrary constant and must in the relative interior of the domain of . In general, we use as a smoothed function, but we can also use when the domain of is a convex cone.

Heuristics for choosing : As in (Kulis and Jordan, 2012), we choose the threshold value . Fortunately, we found that the clustering accuracy was not extremely sensitive to the choice of ; merely selecting the scale of could result in reasonable accuracy. There can be many simple heuristics, and the one we found effective is to use the -means clustering. With the very rough guess on the desired number of clusters , we first run the -means clustering (with Euclidean distance) with (we fixed for all experiments). Then, was set to the average value of dissimilarities between the all pair of centers.

### 3.2 Reducibility of d⋆(⋅,⋅)

The relaxed BHC with small-variance asymptotics still has the same complexities to BHC. If we can show that is reducible, we can reduce the complexities by adapting the nearest neighbor chain algorithm. Unfortunately, is not reducible in general (one can easily find counter-examples for some distributions). However, we argue that is reducible in many cases, and thus applying the nearest neighbor chain algorithm as if is reducible does not degrades the clustering accuracy. In this section, we show the reason by analyzing .

At first, we investigate a term inside the dissimilarity:

 f(¯tc)\lx@stackreldef=|c|φ(¯tc). (34)

The second-order Taylor expansion of this function around the mean yields:

 f(¯tc)=|c|φ(μ)+|c|φ(1)(μ)⊤(¯tc−μ) +Δφ(¯tc,μ)+ϵφ(¯tc,μ), (35)

where is the th order derivative of , and

 Δφ(¯tc)\lx@stackreldef=|c|2(¯tc−μ)⊤φ(2)(μ)(¯tc−μ), (36) ϵφ(¯tc)\lx@stackreldef=|c|∑|α|=3∂αφ(ν)α!(¯tc−μ)α. (37)

Here, is the multi-index notation, and for some . The term plays an important role in analyzing the reducibility of . To bound the error term , we assume that 111This assumption holds for the most of distributions we will discuss (if properly smoothed), but not holds in general.. As earlier, assume that is a average of observations generated from the same scaled-exponential family distribution:

 x1,…,xn\lx@stackreliid∼p(⋅|β,θ),¯tc=1|c|∑i∈ct(xi). (38)

By the property of the log-partition function of the exponential family distribution, we get the following results:

 E[ϵφ(¯tc)]=1β2|c|∑|α|=3∂αφ(ν)∂αψ(θ)α!. (39)

One can see that the expected error converges to zero as . Also, it can be shown that the expectation of the ratio of two terms converges to zero as :

 limβ→∞E[ϵφ(¯tc)Δφ(¯tc)]=0, (40)

which means that asymptotically dominates (detailed derivations are given in the supplementary material). Hence, we can safely approximate up to second order term.

Now, let and be clusters belong to the same super-cluster (i.e. and were generated from the same mean vector ). We don’t need to investigate the case where the pair belong to a different cluster, since then they will not be merged anyway in our algorithm. By the independence, . Applying the approximation (3.2), we have

 d⋆(c0,c1)≈Δφ(¯tc0)+Δφ(¯tc1)−Δφ(¯tc0∪c1) (41) =|c0||c1|2(|c0|+|c1|)(¯tc0−¯tc1)⊤φ(2)(μ)(¯tc0−¯tc1).

This approximation, which we will denote as , is a generalization of the Ward’s cost (31) from Euclidean distance to Mahalanobis distance with matrix (note that this approximation is exact for the spherical Gaussian case). More importantly, is reducible.

###### Theorem 2.
 ˜d⋆(c0,c1)≤min{˜d⋆(c0,c2),˜d⋆(c1,c2)} ⇒min{˜d⋆(c0,c2),˜d⋆(c1,c2)}≤˜d⋆(c0∪c1,c2).
###### Proof.

For the Ward’s cost, the following Lance-Williams update formula (Lance and Williams, 1967) holds for :

 ˜d⋆(c0∪c1,c2)=(|c0|+|c2|)˜d⋆(c0,c2)|c0|+|c1|+|c2| (42) +(|c1|+|c2|)˜d⋆(c1,c2)−|c2|˜d⋆(c0,c1)|c0|+|c1|+|c2|.

Hence, by the assumption, we get

As a result, the dissimilarity is reducible provided that the Taylor’s approximation (3.2) is accurate. In such a case, one can apply the nearest-neighbor chain algorithm with , treating as if it is reducible to build a binary tree in time and space. Unlike Algorithm 2, we have a threshold to determine the number of clusters, and we present a slightly revised algorithm (Algorithm 3). Note again that the revised algorithm generates forests instead of trees.

## 4 Experiments

### 4.1 Experiments on Synthetic Data

Testing the reducibility of : We tested the reducibility of empirically. We repeatedly generated the three clusters and from the exponential family distributions, and counted the number of cases where the dissimilarities between those clusters are not reducible. We also measured the average value of the relative error to support our arguments in Section 3.2. We tested three distributions; Poisson, multinomial and Gaussian. At each iteration, we first sampled the size of the clusters , and from . Then we sampled the three clusters from one of the three distributions, and computed and . We then first checked whether these three values satisfy the reducibility condition (2.2) (for example, if is the smallest, we checked if ). Then, for , we computed the approximate value and measured the relative error

 2×∣∣ ∣∣d⋆(c0,c1)−˜d⋆(c0,c1)d⋆(c0,c1)+˜d⋆(c0,c1)∣∣ ∣∣. (43)

We repeated this process for times for the three distributions and measured the average values. For Poisson distribution, we sampled the mean and sampled the data from . We smoothed the function as to prevent degenerate function values. For multinomial distribution, we tested the case where the dimension is and the number of trials is 5. We sampled the parameter where is -dimensional one vector, and sampled the data from . We smoothed the function