Diversity-Promoting Bayesian Learning of Latent Variable Models

# Diversity-Promoting Bayesian Learning of Latent Variable Models

## Abstract

To address three important issues involved in latent variable models (LVMs), including capturing infrequent patterns, achieving small-sized but expressive models and alleviating overfitting, several studies have been devoted to “diversifying” LVMs, which aim at encouraging the components in LVMs to be diverse. Most existing studies fall into a frequentist-style regularization framework, where the components are learned via point estimation. In this paper, we investigate how to “diversify” LVMs in the paradigm of Bayesian learning. We propose two approaches that have complementary advantages. One is to define a diversity-promoting mutual angular prior which assigns larger density to components with larger mutual angles and use this prior to affect the posterior via Bayes’ rule. We develop two efficient approximate posterior inference algorithms based on variational inference and MCMC sampling. The other approach is to impose diversity-promoting regularization directly over the post-data distribution of components. We also extend our approach to “diversify” Bayesian nonparametric models where the number of components is infinite. A sampling algorithm based on slice sampling and Hamiltonian Monte Carlo is developed. We apply these methods to “diversify” Bayesian mixture of experts model and infinite latent feature model. Experiments on various datasets demonstrate the effectiveness and efficiency of our methods.

Pengtao Xie, Jun Zhu and Eric P. Xing \ShortHeadingsDiversity-Promoting Bayesian Learning of Latent Variable ModelsXie, Zhu and Xing \firstpageno1

\editor{keywords}

Diversity-Promoting, Latent Variable Models, Mutual Angular Prior, Bayesian Learning, Variational Inference

## 1 Introduction

Latent variable models (LVMs) (Bishop, 1998; Knott and Bartholomew, 1999; Blei, 2014) are a major workhorse in machine learning (ML) to extract latent patterns underlying data, such as themes behind documents and motifs hiding in genome sequences. To properly capture these patterns, LVMs are equipped with a set of components, each of which is aimed to capture one pattern and is usually parametrized by a vector. For instance, in topic models (Blei et al., 2003), each component (referred to as topic) is in charge of capturing one theme underlying documents and is represented by a multinomial vector.

While existing LVMs have demonstrated great success, they are less capable in addressing two new problems emerged due to the growing volume and complexity of data. First, it is often the case that the frequency of patterns is distributed in a power-law fashion (Wang et al., 2014; Xie et al., 2015) where a handful of patterns occur very frequently whereas most patterns are of low frequency. Existing LVMs lack capability to capture infrequent patterns, which is possibly due to the design of LVMs’ objective function used for training. For example, a maximum likelihood estimator would reward itself by modeling the frequent patterns well as they are the major contributors of the likelihood function. On the other hand, infrequent patterns contribute much less to the likelihood, thereby it is not very rewarding to model them well and LVMs tend to ignore them. Infrequent patterns often carry valuable information, thus should not be ignored. For instance, in a topic modeling based recommendation system, an infrequent topic (pattern) like losing weight is more likely to improve the click-through rate than a frequent topic like politics. Second, the number of components strikes a tradeoff between model size (complexity) and modeling power. For a small , the model is not expressive enough to sufficiently capture the complex patterns behind data; for a large , the model would be of large size and complexity, incurring high computational overhead. How to reduce model size while preserving modeling power is a challenging issue.

To cope with the two problems, several studies (Zou and Adams, 2012; Xie et al., 2015; Xie, 2015) propose a “diversification” approach, which encourages the components of a LVM to be mutually “dissimilar”. First, regarding capturing infrequent patterns, as posited in (Xie et al., 2015) “diversified” components are expected to be less aggregated over frequent patterns and part of them would be spared to cover the infrequent patterns. Second, concerning shrinking model size without compromising modeling power, Xie (2015) argued that “diversified” components bear less redundancy and are mutually complementary, making it possible to capture information sufficiently well with a small set of components, i.e., obtaining LVMs possessing high representational power and low computational complexity.

The existing studies (Zou and Adams, 2012; Xie et al., 2015; Xie, 2015) of “diversifying” LVMs mostly focus on point estimation (Wasserman, 2013) of the model components, under a frequentist-style regularized optimization framework. In this paper, we study how to promote diversity under an alternative learning paradigm: Bayesian inference (Jaakkola and Jordan, 1997; Bishop and Tipping, 2003; Neal, 2012), where the components are considered as random variables of which a posterior distribution shall be computed from data under certain priors. Compared with point estimation, Bayesian learning offers complementary benefits. First, it offers a “model-averaging” (Jaakkola and Jordan, 1997; Bishop and Tipping, 2003) effect for LVMs when they are used for decision-making and prediction because the parameters shall be integrated under a posterior distribution, thus potentially alleviate overfitting on training data. Second, it provides a natural way to quantify uncertainties of model parameters, and downstream decisions and predictions made thereupon (Jaakkola and Jordan, 1997; Bishop and Tipping, 2003; Neal, 2012). Affandi et al. (2013) investigated the “diversification” of Bayesian LVMs using the determinantal point process (DPP) (Kulesza et al., 2012) prior. While Markov chain Monte Carlo (MCMC) (Affandi et al., 2013) methods have been developed for approximate posterior inference under the DPP prior, DPP is not amenable for another mainstream paradigm of approximate inference techniques – variational inference (Wainwright et al., 2008) – which is usually more efficient (Hoffman et al., 2013) than MCMC. In this paper, we propose alternative diversity-promoting priors that overcome this limitation.

We propose two approaches that have complementary advantages to perform diversity-promoting Bayesian learning of LVMs. Following (Xie et al., 2015), we adopt a notion of diversity that component vectors are more diverse provided the pairwise angles between them are larger. First, we define mutual angular Bayesian network (MABN) priors over the components, which assign higher probability density to components that have larger mutual angles and use these priors to affect the posterior via Bayes’ rule. Specifically, we build a Bayesian network (Koller and Friedman, 2009) whose nodes represent the directional vectors of the components and local probabilities are parameterized by von Mises-Fisher (Mardia and Jupp, 2009) distributions that entail an inductive bias towards vectors with larger mutual angles. The MABN priors are amenable for approximate posterior inference of model components. In particular, they facilitate variational inference, which is usually more efficient than MCMC sampling. Second, in light of that it is not flexible (or even possible) to define priors to capture certain diversity-promoting effects such as small variance of mutual angles, we adopt a posterior regularization approach (Zhu et al., 2014b), in which a diversity-promoting regularizer is directly imposed over the post-data distributions to encourage diversity and the regularizer can be flexibly defined to accommodate various desired diversity-promoting goals. We instantiate the two approaches to the Bayesian mixture of experts model (BMEM) (Waterhouse et al., 1996) and experiments demonstrate the effectiveness and efficiency of our approaches.

We also study how to “diversify” Bayesian nonparametric LVMs (BN-LVMs) (Ferguson, 1973; Ghahramani and Griffiths, 2005; Hjort et al., 2010). Different from parametric LVMs where the component number is set to an finite value and does not change throughout the entire execution of algorithm, in BN-LVMs the number of components is unlimited and can reach infinite in principle. As more data accumulates, new components are dynamically added. Compared with parametric LVMs, BN-LVMs possess the following advantages: (1) they are highly flexible and adaptive: if new data cannot be well modeled by existing components, new components are automatically invoked; (2) in BN-LVMs, the “best” number of components is determined according to the fitness to data, rather than being manually set which is a challenging task even for domain experts. To “Diversify” BN-LVMs, we extend the MABN prior to an Infinite Mutual Angular (IMA) prior that encourages infinitely many components to have large angles. In this prior, the components are mutually dependent, which incurs great challenges for posterior inference. We develop an efficient sampling algorithm based on slice sampling (Teh and Ghahramani, 2007) and Riemann manifold Hamiltonian Monte Carlo (Girolami and Calderhead, 2011). We apply the IMA prior to induce diversity in the infinite latent feature model (ILFM) (Ghahramani and Griffiths, 2005) and experiments on various datasets demonstrate that the IMA is able to (1) achieve better performance with fewer components; (2) better capture infrequent patterns; and (3) reduce overfitting.

The major contributions of this work are:

• We propose a mutual angular Bayesian network (MABN) prior which is biased towards components having large mutual angles, to promote diversity in Bayesian LVMs.

• We develop an efficient variational inference method for posterior inference of model components under the MABN priors.

• To flexibly accommodate various diversity-promoting effects, we study a posterior regularization approach which directly imposes diversity-promoting regularization over the post-data distributions.

• We extend the MABN prior from the finite case to the infinite case and apply it to “diversify” Bayesian nonparametric models.

• We develop an efficient sampling algorithm based on slice sampling and Riemann manifold Hamiltonian Monte Carlo for “diversified” BN-LVMs.

• Using Bayesian mixture of experts model and infinite latent feature model as study cases, we empirically demonstrate the effectiveness and efficiency of our methods.

The rest of the paper is organized as follows. Section 2 reviews related works. In Section 3 and 4, we introduce how to promote diversity in Bayesian parametric and nonparametric LVMs respectively. Section 5 gives experimental results and Section 6 concludes the paper.

## 2 Related Work

Recent works (Zou and Adams, 2012; Xie et al., 2015; Xie, 2015) have studied the diversification of components in LVMs under a point estimation framework. Zou and Adams (2012) leverage the determinantal point process (DPP) (Kulesza et al., 2012) to promote diversity in latent variable models. Xie et al. (2015) propose a mutual angular regularizer that encourages model components to be mutually different where the dissimilarity is measured by angles. Cogswell et al. (2015) define a covariance-based regularizer to reduce the correlation among hidden units in neural networks, for the sake of alleviating overfitting.

Diversity-promoting Bayesian learning of LVMs has been investigated in (Affandi et al., 2013), which utilizes the DPP prior to induce bias towards diverse components. They develop a Gibbs sampling (Gilks, 2005) algorithm. But the determinant in DPP makes variational inference based algorithms very difficult to derive. Our conference version of the paper (Xie et al., 2016) has introduced a mutual angular prior to “diversify” Bayesian parametric LVMs. This work extends the study of “diversification” to nonparametric models where the number of components is infinite.

Diversity-promoting regularization is investigated in other problems as well, such as ensemble learning and classification. In ensemble learning, many studies (Kuncheva and Whitaker, 2003; Banfield et al., 2005; Partalas et al., 2008; Yu et al., 2011) explore how to select a diverse subset of base classifiers or regressors, with the aim to improve generalization performance and reduce computational complexity. In multi-way classification, Malkin and Bilmes (2008) propose to use the determinant of a covariance matrix to encourage “diversity” among classifiers. Jalali et al. (2015) propose a class of variational Gram functions (VGFs) to promote pairwise dissimilarity among classifiers.

## 3 Diversity-Promoting Bayesian Learning of Parametric Latent Variable Models

In this section, we study how to “diversify” parametric Bayesian LVMs where the number of components is finite. We investigate two approaches: prior control and posterior regularization, which have complementary advantages.

### 3.1 Diversity-Promoting Mutual Angular Prior

The first approach we take is to define a prior which has an inductive bias towards components that are more “diverse” and use it to affect the posterior via Bayes’ rule. We refer to this approach as prior control. While diversity can be defined in various ways, following (Xie et al., 2015) we adopt the notion that a set of component vectors are considered to be more diverse if the pairwise angles between them are larger. We desire the prior to have two traits. First, to favor diversity, they assign a higher density to components having larger mutual angles. Second, it should facilitate posterior inference. In Bayesian learning, the easiness of posterior inference relies heavily on the prior (Blei and Lafferty, 2006; Wang and Blei, 2013).

One possible solution is to turn the mutual angular regularizer (Xie et al., 2015) that encourages a set of component vectors to have large mutual angles into a distribution based on Gibbs measure (Kindermann et al., 1980), where is the partition function guaranteeing that integrates to one. The concern is that it is not sure whether is finite, i.e., whether is proper. When an improper prior is utilized in Bayesian learning, the posterior is also highly likely to be improper, except in a few special cases (Wasserman, 2013). Performing inference on improper posteriors is problematic.

Here we define mutual angular Bayesian network (MABN) priors possessing the aforementioned two traits, based on Bayesian network (Koller and Friedman, 2009) and von Mises-Fisher (Mardia and Jupp, 2009) distribution. For technical convenience, we decompose each real-valued component vector into , where is the magnitude and is the direction (). Let denote the directional vectors. Note that the angle between two vectors is invariant to their magnitudes, thereby, the mutual angles of component vectors in are the same as angles of directional vectors in . We first construct a prior which prefers vectors in to possess large angles. The basic idea is to use a Bayesian network (BN) to characterize the dependency among directional vectors and design local probabilities to entail an inductive bias towards large mutual angles. In the Bayesian network (BN) shown in Figure 1, each node represents a directional vector and its parents are nodes . We define a local probability at node to encourage to have large mutual angles with . Since these directional vectors lie on a sphere, we use the von Mises-Fisher (vMF) distribution to model them. The probability density function of the vMF distribution is , where the random variable lies on a dimensional sphere (), is the mean direction with , is the concentration parameter and is the normalization constant. The local probability at node is defined as a von Mises-Fisher (vMF) distribution whose density is

 p(~ai|pa(~ai))=Cp(κ)exp⎛⎜⎝κ⎛⎝−∑i−1j=1~aj∥∑i−1j=1~aj∥2⎞⎠⊤~ai⎞⎟⎠ (1)

with mean direction .

Now we explain why this local probability favors large mutual angles. Since and are unit-length vectors, is the cosine of the angle between and . If has larger angles with , then the average negative cosine similarity would be larger, accordingly would be larger. This statement is true for all . As a result, would be larger if the directional vectors have larger mutual angles. For the magnitudes of the components, which have nothing to do with the mutual angles, we sample for each component independently from a gamma distribution with shape parameter and rate parameter .

The generative process of is summarized as follows:

• Draw

• For , draw

• For , draw

• For , let

The probability distribution over can be written as

 p(A)=Cp(κ)exp(κμ⊤0~a1)∏Ki=2Cp(κ)exp⎛⎝κ(−∑i−1j=1~aj∥∑i−1j=1~aj∥2)⊤~ai⎞⎠∏Ki=1αα12gα1−1ie−giα2Γ(α1). (2)

According to the factorization theorem (Koller and Friedman, 2009) of Bayesian network, it is easy to verify , thus is a proper prior.

When inferring the posterior of model components using a variational inference method, we need to compute the expectation of appearing in the local probability , which is extremely difficult. To address this issue, we define an alternative local probability that achieves similar modeling effect as , but greatly facilitates variational inference. We re-parametrize the local probability defined in Eq.(1) using Gibbs measure:

 ^p(~ai|pa(~ai)) ∝ exp(κ(−i−1∑j=1~aj)⊤~ai) (3) ∝ = Cp(κ∥i−1∑j=1~aj∥2)exp⎛⎝κ∥i−1∑j=1~aj∥2(−∑i−1j=1~aj∥∑i−1j=1~aj∥2)⊤~ai⎞⎠ = Cp(κ∥i−1∑j=1~aj∥2)exp(κ(−i−1∑j=1~aj)⊤~ai),

which is another vMF distribution with mean direction and concentration parameter . This re-parameterized local probability is proportional to , which measures the negative cosine similarity between and its parent vectors. Thereby, still encourages large mutual angles between vectors as does. The difference between and is that in the term is moved from the denominator to the normalizer, thus we can avoid computing the expectation of . Though it incurs a new problem that we need to compute the expectation of , which is also hard due to the complex form of the function, we managed to resolve this problem as detailed in Section 3.1.1. We refer to the MABN prior defined in Eq.(2) as type I MABN and that with local probability defined in Eq.(3) as type II MABN.

#### Approximate Inference Algorithms

We develop algorithms to infer the posteriors of components under the MABN prior. Since exact posteriors are intractable, we resort to approximate inference techniques. Two main paradigms of approximate inference methods are: (1) variational inference (VI) (Wainwright et al., 2008); (2) Markov chain Monte Carlo (MCMC) sampling (Gilks, 2005). These two approaches possess benefits that are mutually complementary. MCMC can achieve a better approximation of the posterior than VI since it generates samples from the exact posterior while VI seeks an approximation. However, VI can be computationally more efficient (Hoffman et al., 2013).

Variational Inference The basic idea of VI (Wainwright et al., 2008) is to use a “simpler” variational distribution to approximate the true posterior by minimizing the Kullback-Leibler divergence between these two distributions, which is equivalent to maximizing the following variational lower bound w.r.t :

 Eq(A)[logp(D|A)]+Eq(A)[logp(A)]−Eq(A)[logq(A)] (4)

where is the MABN prior and is data likelihood. Here we choose to be a mean field variational distribution , where and . Given the variational distribution, we first compute the analytical expression of the variational lower bound, in which we particularly discuss how to compute . If choosing to be the type-I MABN prior (Eq.(2)), we need to compute which is very difficult to deal with due to the presence of . Instead we choose the type-II MABN prior for the convenience of deriving the variational lower bound. Under the type-II MABN, we need to compute for all , where is the partition function of . The analytical form of this expectation is difficult to derive as well due to the complexity of the function: where is the modified Bessel function of the first kind at order . To address this issue, we derive an upper bound of and compute the expectation of the upper bound, which is relatively easy to do. Consequently, we obtain a further lower bound of the variational lower bound and learn the variational and model parameters w.r.t the new lower bound.

Now we proceed to derive the upper bound of , which equals to . Applying the inequality (Bouchard, 2007), where is a variational parameter, we have

 logZi≤γ+∫log(1+exp(κ(−i−1∑j=1~aj)⋅~ai−γ)d~ai. (5)

Then applying the inequality (Bouchard, 2007), where is another variational parameter and , we have

 logZi≤γ+∫[log(1+e−ξ)−κ(i−1∑j=1~aj)⋅~ai+γ−ξ2−1/2−g(ξ)2ξ((κ(i−1∑j=1~aj)⋅~ai+γ)2−ξ2)]d~ai. (6)

Finally, applying the following integrals on a high-dimensional sphere: (1) , (2) , (3) , we get

 logZi≤−1/2−g(ξ)2ξκ2∥i−1∑j=1~aj∥222π(p+1)/2Γ(p+12)+γ+[log(1+e−ξ)+ξ−γ2+1/2−g(ξ)2ξ(ξ2−γ2)]2π(p+1)/2Γ(p+12) (7)

The expectation of this upper bound is much easier to compute. Specifically, we need to tackle , which can be computed as

 Eq(A)[∥i−1∑j=1~aj∥22] = Eq(A)[i−1∑j=1~a⊤j~aj+i−1∑j=1i−1∑k≠j~a⊤j~ak] (8) = i−1∑j=1tr(Eq(~aj)[~aj~a⊤j])+i−1∑j=1i−1∑k≠jEq(~aj)[~aj]⊤Eq(~ak)[~ak] = i−1∑j=1tr(cov(~aj))+i−1∑j=1i−1∑k=1Eq(~aj)[~aj]⊤Eq(~ak)[~ak],

where , , , and .

MCMC Sampling One potential drawback of the variational inference approach is that a large approximation error can be incurred if the variational distribution is far from the true posterior. We further present an alternative approximation inference method — Markov chain Monte Carlo (MCMC) (Gilks, 2005), which draws samples from the exact posterior distribution and uses the samples to represent the posterior. Specifically we choose the Metropolis-Hastings (MH) algorithm (Gilks, 2005) which generates samples from an adaptive proposal distribution, computes acceptance probabilities based on the unnormalized true posterior and uses the acceptance probabilities to decide whether a sample should be accepted or rejected. The most commonly used proposal distribution is based on random walk: the newly proposed sample comes from a random perturbation around the previous sample . For the directional variables and magnitude variables , we define the proposal distributions to be a von Mises-Fisher distribution and a normal distribution respectively:

 q(~a(t+1)i|~a(t)i)=Cp(^κ)exp(^κ~a(t+1)i⋅~a(t)i)q(g(t+1)i|g(t)i)=1σ√2πexp{−(g(t+1)i−g(t)i)22σ2}. (9)

is required to be positive, but the Gaussian distribution may generate non-positive samples. To address this problem, we adopt a truncated sampler (Wilkinson, 2015) which repeatedly draws samples until a positive value is obtained. Under such a truncated sampling scheme, the MH acceptance ratio needs to be modified accordingly. Please refer to (Wilkinson, 2015) for details.

MH eventually converges to a stationary distribution where the generated samples represent the true posterior. The downside of MCMC is that it could take a long time to converge, which is usually computationally less efficient than variational inference (Hoffman et al., 2013). Under the MH algorithm, the MABN prior facilitates better efficiency compared with the DPP prior. In each iteration, the MABN prior needs to be evaluated, whose complexity is quadratic in the component number whereas evaluating the DPP has a cubic complexity in .

### 3.2 Diversity-Promoting Posterior Regularization

In practice, one may desire to achieve more than one diversity-promoting effects in LVMs. For example, the mutual angular regularizer (Xie et al., 2015) aims to encourage the pairwise angles between components to have not only large mean, but also small variance such that the components are uniformly “different” from each other and evenly spread out to different directions in the space. It would be extremely difficult, if ever possible, to define a proper prior that can accommodate all desired effects. For instance, the MABN priors defined above can encourage the mutual angles to have large mean, but are unable to promote small variance. To overcome such inflexibility of the prior control method, we resort to a posterior regularization approach (Zhu et al., 2014b). Instead of designing a Bayesian prior to encode the diversification desideratum and indirectly influencing the posterior, posterior regularization directly imposes a control over the post-data distributions to achieve certain goals. Giving prior and data likelihood , computing the posterior is equivalent to solving the following optimization problem (Zhu et al., 2014b)

 supq(A)Eq(A)[logp(D|A)π(A)]−Eq(A)[logq(A)], (10)

where is any valid probability distribution. The basic idea of posterior regularization is to impose a certain regularizer over to incorporate prior knowledge and structural bias (Zhu et al., 2014b) and solve the following regularized problem

 supq(A)Eq(A)[logp(D|A)π(A)]−Eq(A)[logq(A)]+λR(q(A)), (11)

where is a tradeoff parameter. Through properly designing , many diversity-promoting effects can be flexibly incorporated. Here we present a specific example while noting that many other choices are applicable. Gaining insight from (Xie et al., 2015), we define as

 Ω({Eq(ai)[ai]}Ki=1)=1K(K−1)K∑i=1K∑j≠iθij−γ1K(K−1)K∑i=1K∑j≠i(θij−1K(K−1)K∑p=1K∑q≠pθpq)2, (12)

where is the non-obtuse angle measuring the dissimilarity between and , and the regularizer is defined as the mean of pairwise angles minus their variance. The intuition behind this regularizer is: if the mean of angles is larger (indicating these vectors are more different from each other on the whole) and the variance of the angles is smaller (indicating these vectors evenly spread out to different directions), then these vectors are more diverse. Note that it is very difficult to design priors to simultaneously achieve these two effects.

While posterior regularization is more flexible, it lacks some strengths possessed by the prior control method for our consideration of diversifying latent variable models. First, prior control is a more natural way of incorporating prior knowledge, with solid theoretical foundation. Second, prior control can facilitate sampling based algorithms that are not applicable for the above posterior regularization.1 In sum, the two approaches have complementary advantages and should be chosen according to specific problem context.

### 3.3 “Diversifying” Bayesian Mixture of Experts Model

In this section, we apply the two approaches developed above to “diversify” the Bayesian mixture of experts model (BMEM) (Waterhouse et al., 1996).

#### BMEM with Mutual Angular Prior

The mixture of experts model (MEM) (Jordan and Jacobs, 1994) has been widely used for machine learning tasks where the distribution of input data is so complicated that a single model (“expert”) cannot be effective for all the data. MEM assumes that the input data is inherently belonging to multiple latent groups and one single “expert” is allocated to each group to handle the data therein. Here we consider a classification task whose goal is to learn binary linear classifiers given the training data , where is the input feature vector and is the class label. We assume there are latent experts where each expert is a classifier with coefficient vector . Given a test example , it first goes through a gate function that decides which expert is best suitable to classify this example and the decision is made in a probabilistic way. A discrete variable is utilized to indicate the selected expert and the probability that (assigning example to expert ) is , where is a coefficient vector characterizing the selection of expert . Given the selected expert, the example is classified using the coefficient vector corresponding to that expert. As described in Figure 2, the generative process of is as follows

• For

• Draw , where

• Draw .

As of now, the model parameters and are deterministic variables. Next we place a prior over them to enable Bayesian learning (Waterhouse et al., 1996) and desire this prior to be able to promote diversity among the experts to retain the advantages of “diversifying” LVMs as stated before. The mutual angular Bayesian network prior can be applied to achieve this goal

 p(B)=Cp(κ)exp(κμ⊤0~β1)K∏i=2Cp(κ)exp⎛⎝κ(−∑i−1j=1~βj||∑i−1j=1~βj||2)⊤~βi⎞⎠K∏i=1αα12gα1−1ie−giα2Γ(α1),

where and .

#### BMEM with Mutual Angular Posterior Regularization

As an alternative approach, the diversity in BMEM can be imposed by placing the mutual angular regularizer (Eq.(12)) over the post-data posteriors (Zhu et al., 2014b). Here we instantiate the general diversity-promoting posterior regularization defined in Eq.(11) to BMEM, by specifying the following parametrization. The latent variables in BMEM include , and and the post-data distribution over them is defined as . For computational tractability, we define and to be: and where , are von Mises-Fisher distributions and , are gamma distributions, and define to be multinomial distributions: where is a multinomial vector. The priors over and are specified to be: and where , are vMF distributions and , are gamma distributions. Under such parametrization, we solve the following diversity-promoting posterior regularization problem

 supq(B,H,z)Eq(B,H,z)[logp({yi}Ni=1,z|B,H)π(B,H)]−Eq(B,H,z)[logq(B,H,z)]+λ1Ω({Eq(~βk)[~βk]}Kk=1)+λ2Ω({Eq(~ηk)[~ηk]}Kk=1). (13)

Note that other parametrizations are also valid, such as placing Gaussian priors over and and setting , to be Gaussian.

## 4 Diversity-Promoting Bayesian Nonparametric Modeling

In the last section, we study how to promote diversity among a finite number of components in parametric LVMs. In this section, we investigate how to achieve this goal in nonparametric LVMs where the component number is infinite in principle. We extend the mutual angular Bayesian network (MABN) prior defined in last section to an Infinite Mutual Angular (IMA) prior that encourages infinitely many components to have large angles. In this prior, the components are mutually dependent, which incurs great challenges for posterior inference. We develop an efficient sampling algorithm based on slice sampling (Teh and Ghahramani, 2007) and Riemann manifold Hamiltonian Monte Carlo (Girolami and Calderhead, 2011). We apply the IMA prior to induce diversity in the infinite latent feature model (ILFM) (Ghahramani and Griffiths, 2005).

### 4.1 Bayesian Nonparametric Latent Variable Models

A BN-LVM consists of an infinite number of components, each parameterized by a vector. For example, in Dirichlet process Gaussian mixture model (DP-GMM) (Rasmussen, 1999; Blei et al., 2006), the components are clusters, each parameterized with a Gaussian mean vector. In Indian buffet process latent feature model (IBP-LFM) (Ghahramani and Griffiths, 2005), the components are features, each parameterized by a weight vector. Given these infinitely many components, BN-LVMs design some proper mechanism to select one or a finite subset of them to model each observed data example. For example, in DP-GMM, a Chinese restaurant process (CRP) (Aldous, 1985) is designed to assign each data example to one of the infinite number of clusters. In IBP-LFM, an Indian buffet process (IBP) (Ghahramani and Griffiths, 2005) is utilized to select a finite set of features from the infinite feature pool to reconstruct each data example. A BN-LVM typically consists of two priors. One is a base distribution from which the parameter vectors of components are drawn. The other is a stochastic process – such as CRP and IBP – which designates how to select components to model data. The prior studied in this paper belongs to the first regime. It is commonly assumed that parameter vectors of the components are independently drawn from the same base distribution. For example, in both DP-GMM and IBP-LFM, the mean vectors and weight vectors are independently drawn from a Gaussian distribution. In this paper, we aim to design a prior that encourages the component vectors to be mutually different and “diverse”, under which the component vectors are not independent any more, which presents great challenges for posterior inference.

### 4.2 Infinite Mutual Angular Prior

In the MABN prior, the components are added one by one. Each new component is encouraged to have large angles with previous ones. This adding process can be repeated infinitely many times, resulting in a prior that encourages an infinite number of components to have large mutual angles

 p({ˆwi}∞i=1)=p(ˆw1)∞∏i=2p(ˆwi|pa(ˆwi)) (14)

The factorization theorem (Koller and Friedman, 2009) of Bayesian network ensures that integrates to one. The magnitudes do not affect angles (hence diversity), which can be generated independently from a gamma distribution.

To this end, the generative process of can be summarized as follows:

• Sample

• For , sample

• For , sample

• For ,

The probability distribution over can be written as

 p({wi}∞i=1)=Cp(κ)exp(κμ⊤0ˆw1)∏∞i=2Cp(κ)exp(κ(−∑i−1j=1ˆwj∥∑i−1j=1ˆwj∥2)⊤ˆwi)∏∞i=1αα12rα1−1ie−riα2Γ(α1) (15)

### 4.3 Diversity-Promoting Infinite Latent Feature Model

In this section, using infinite latent feature model (ILFM) (Griffiths and Ghahramani, ) as an instance of BN-LVM, we showcase how to promote diversity among the components therein with the IMA prior. Given a set of data examples where , ILFM aims to invoke a finite subset of features from an infinite feature collection to construct these data examples. Each feature (which is a component in this LVM) is parameterized by a vector . For each data example , a subset of features are selected to construct it. The selection is denoted by a binary vector where denotes the -th feature is invoked to construct the -th example and otherwise. Given the parameter vectors of features and the selection vector , the example can be represented as: . The binary selection vectors can be either drawn from an Indian buffet process (IBP) (Ghahramani and Griffiths, 2005) or a stick-breaking construction (Teh and Ghahramani, 2007). Let be the prior probability that feature is present in a data example and the features are permuted such that their prior probabilities are in a decreasing ordering: . According to the stick-breaking construction, these prior probabilities can be generated in the following way: , . Given , the binary indicator is generated as . To reduce the redundancy among the features, we impose the IMA prior over their parameter vectors to encourage them to be mutually different, which results in an IMA-LFM model.

### 4.4 Algorithm

In this section, we develop a sampling algorithm to infer the posteriors of and in the IMA-LFM model. Two major challenges need to be addressed. First, the prior over is not conjugate to the likelihood function . Second, the parameter vectors are usually of high-dimensional, rendering slow mixing. To address the first challenge, we adopt the slicing sampling algorithm (Teh and Ghahramani, 2007). This algorithm introduces an auxiliary slice variable , where is the prior probability of the last active feature. A feature is active if there exists an example such that and is inactive otherwise. In the sequel, we discuss the sampling of other variables.

Sample New Features Let be the maximal feature index with and be the index such that all active features have index ( itself would be inactive feature). If the new value of makes , then we draw new (inactive) features, including the parameter vectors and prior probabilities. The prior probabilities are drawn sequentially from using adaptive rejection sampling (ARS) (Gilks and Wild, 1992). The parameter vectors are drawn sequentially from

 p(wk|{wj}k−1j=1)=p(ˆwk|{ˆwj}k−1j=1)p(rk)=Cp(κ)exp(κ(−∑k−1j=1ˆwj∥∑k−1j=1ˆwj∥2)⊤ˆwk)αα12rα1−1ie−riα2Γ(α1) (16)

where we draw from which is a von Mises-Fisher distribution and draw from a Gamma distribution, then multiply and together since they are independent. For each new feature , the corresponding binary selection variables are initialized to zero.

Sample Existing We sample from , where .

Sample Given , we only need to sample for from , where and denotes all other elements in <