Effects of Additional Data on Bayesian Clustering

# Effects of Additional Data on Bayesian Clustering

Keisuke Yamazaki
k.yamazaki@aist.go.jp
Artificial Intelligence Research Center,
National Institute of Advanced Industrial Science and Technology
2-3-26 Aomi Koto-ku, Tokyo, Japan
###### Abstract

Hierarchical probabilistic models, such as mixture models, are used for cluster analysis. These models have two types of variables: observable and latent. In cluster analysis, the latent variable is estimated, and it is expected that additional information will improve the accuracy of the estimation of the latent variable. Many proposed learning methods are able to use additional data; these include semi-supervised learning and transfer learning. However, from a statistical point of view, a complex probabilistic model that encompasses both the initial and additional data might be less accurate due to having a higher-dimensional parameter. The present paper presents a theoretical analysis of the accuracy of such a model and clarifies which factor has the greatest effect on its accuracy, the advantages of obtaining additional data, and the disadvantages of increasing the complexity.
Keywords: unsupervised learning, semi-supervised learning, hierarchical parametric models, latent variable estimation

## 1 Introduction

Hierarchical probabilistic models, such as mixture models, are often used for data analysis. These models have two types of variables: observable and latent. Observable variables represent the data that can be observed, while latent variables represent the hidden processes that generate the data. In cluster analysis, for example, the observable variable represents for the position of the observed data, and the latent variable provides a label that indicates from which cluster a given data point was generated.

Because there are two variables, there are also two estimation tasks. Prediction of the unseen data corresponds to estimating the observable variable. Studies have performed theoretical analyses of its accuracy, and the results have been used to determine the optimal model structure, such as by using an information criterion (Akaike, 1974a; Watanabe, 2010). On the other hand, there has not been sufficient analysis of the theoretical accuracy of estimating a latent variable. Recently, an error function that measures the accuracy has been defined, based on the Kullback-Leibler divergence, and an asymptotic analysis has shown that the two estimation tasks have different properties (Yamazaki, 2014a); although when estimating the latent variable, a Bayesian clustering method is more accurate than the maximum-likelihood clustering method, these two methods have the same asymptotic error in a prediction task.

From a statistical point of view, the degree to which the additional data improves the estimation is not trivial. The structure of the model will be more complex and thus able to accept additional data. There is a trade-off between the complexity of a model and the amount of data that is used; a more complex model will be less accurate due to the increased dimensionality of the parameter, but more data will improve the accuracy. In a prediction task, where the estimation target is the observable variable, it has been proven mathematically that the advantages of increasing the amount of data outweighs the disadvantages of increasing the complexity of the model (Yamazaki and Kaski, 2008). However, it is a still open question whether the use of additional data increases clustering accuracy in methods other than semi-supervised learning, where this has already been proven (Yamazaki, 2015a, b).

In the present paper, we extend the results of Yamazaki (2015a, b), and investigate the effect of additional data on the accuracy of Bayesian clustering. We consider a mixture model that uses both initial and additional data. When the additional data are ignored, only the initial data are used to determine the structure of the model, such as the dimensionality of the data and the number of clusters, and thus the dimensionality of the parameter decreases; that is, the model becomes less complex. By comparing the accuracy with and without the use of additional data, we clarify the effect of the additional data in the asymptotic case, that is, when the total amount initial and additional data is sufficiently large. Moreover, the extension of the present paper allows us to elucidate the effect of more complicated overlap between the initial and the additional data sets. For example, we will deal with the unlabeled additional data while the former study (Yamazaki, 2015a) restricts the analysis to the labeled ones.

The remainder of this paper is organized as follows. Section 2 summarizes Bayesian clustering and considers its asymptotic accuracy when there are no additional data. Section 3 presents a formal definition of a mixture model that incorporates additional data and derives the asymptotic accuracy of the model. Section 4 determines under what conditions the use of additional data will improve the accuracy. Finally, we present a discussion and our conclusions in Sections 5 and 6, respectively.

## 2 Bayesian Clustering

In this section, we present a definition of Bayesian clustering and present an evaluation function for the clustering results.

We consider a mixture model defined by

 p(x|w)= K∑k=1akf(x|bk),

where expresses the position of a data point, is the parameter, and is the density function associated with a mixture component. The mixing ratio has constraints given by for all , and . Let the dimension of be , that is, . Then, the parameter can be expressed as

 w= (a1,…,aK−1,b11,…,b1dc,…,bKdc)⊤.

We define the data source, which generates i.i.d. data, as follows:

 q(x,y)= q(y)q(x|y)=a∗yf(x|b∗y),

where indicates a cluster label, and and are constants. So that the clusters can be identified, we require for . The data source is , where and

 w∗=(a∗1,…,a∗K−1,b∗11,…,b∗Kdc)⊤.

Note that identification of the labels and are impossible in the unsupervised learning when the components have the same parameter . We refer to as the true parameter. Let be generated by the data source. We use the notation and for the sets of data positions and labels, respectively.

Cluster analysis is formulated as estimating when is given. When considered as a density estimation, the task is to estimate . If is observable, the task must be to estimate unseen . This is the prediction of the supervised learning and its theoretical analysis has been thoroughly studied (Akaike, 1974a; Watanabe, 2001). Since the label is not given, the latent variable explicitly appears in the clustering algorithms such as the expectation-maximization algorithm (Dempster et al., 1977) and the variational Bayes algorithm (Attias, 1999).

Bayesian clustering is then defined as

 p(Yn|Xn)= ∫n∏i=1p(yi|xi,w)p(w|Xn)dw,

where the conditional probability is defined as

 p(y|x)= p(x,y|w)p(x|w)=ayf(x|by)p(x|w),

and is the posterior distribution. When a prior distribution is given by , the posterior distribution is defined as

 p(w|Xn)= 1Z(Xn)n∏i=1p(xi|w)φ(w),

where is the normalizing factor

 Z(Xn)= ∫n∏i=1p(xi|w)φ(w)dw.

If we replace in with this definition, we find an equivalent expression for the estimated density:

 p(Yn|Xn)= ∫∏ni=1p(xi,yi|w)φ(w)dw∫∏ni=1p(xi|w)φ(w)dw.

Since the clustering task is formulated as a density estimation, the difference between the true density of and the estimated density, , can be used to evaluate the accuracy of the clustering. The true density is defined as

 q(Yn|Xn)= q(Xn,Yn)q(Xn)=n∏i=1q(xi,yi)∑Kyi=1q(xi,yi).

In the present paper, we will use the Kullback-Leibler divergence to measure the difference between the densities:

 D(n)= 1nEXn[∑Ynq(Yn|Xn)lnq(Yn|Xn)p(Yn|Xn)],

where is the expectation over all . We evaluate the density estimation of since is the probabilistic variable due to the generating process of the data source. This is the reason why we consider the divergence instead of the deterministic loss function such as the 0-1 loss.

We wish to find the asymptotic form of the error function , that is, the case in which the number of data points is sufficiently large. Assume that the Fisher information matrices

 {IXY(w∗)}ij= E[∂lnp(x,y|w∗)∂wi∂lnp(x,y|w∗)∂wj], {IX(w∗)}ij= E[∂lnp(x|w∗)∂wi∂lnp(x|w∗)∂wj]

exist and are positive definite, where the expectation is

 E[f(x,y)]=∫K∑y=1f(x,y)p(x,y|w∗)dx.

This assumption corresponds to the statistical regularity, which requires that there is no redundant component of the model compared with the data source. In the case, where the regularity is not satisfied, the algebraic geometrical analysis is available (Watanabe, 2001; Yamazaki, 2016). The present paper focuses on the regular case. We then have the following theorem (Yamazaki, 2014a).

###### Theorem 1

The error function has the asymptotic form

 D(n)= 12lndet[IXY(w∗)IX(w∗)−1]1n+o(1n).

Since the data source is described by the model, the posterior distribution converges to the true parameter. Then, the error goes to zero in the asymptotic limit. This theorem shows the convergence speed; the leading term has the order , and its coefficient is determined by the Fisher information matrices.

## 3 Formal Definition of an Additional Data Set

In this section, we formally define an additional data set and perform a clustering task for a given data set. We then derive the asymptotic form of the error function.

### 3.1 Formulation of Data and Clustering

Let the initial data set be denoted , that is, . Let an additional data set be denoted . We assume that the additional data comprise ordered elements:

 Da={zn+1,…,zn+αn},

where is positive and real, and is an integer. The element is for an unlabeled case, and it is for a labeled case. Let be the density function of , where is the parameter. Assume that the data source of the additional data is defined by

 qa(z)=pa(z|v∗),

where a constant is the true parameter. Also, assume that the following Fisher information matrix exists and is positive definite:

 {IZ(v∗)}ij= Ez[∂lnpa(z|v∗)∂vi∂lnpa(z|v∗)∂vj],

where the expectation is based on ; for the unlabeled case

 Ez[f(z)]= ∫f(x)pa(x|v∗)dx,

and for the labeled case,

 Ez[f(z)]= ∫∑yf(x,y)pa(x,y|v∗)dx.

Let us consider a parameter vector defined by

 u= (u1,…,ud1,ud1+1,…,ud1+d2,ud1+d2+1,…,ud1+d2+d3)⊤.

This parameter is divided into three parts: contains the elements included in but not in , contains the elements included in both and , and contains the elements included in but not in . This means that there are permutations and defined by

 (u1,…,ud1,ud1+1,…,ud1+d2)⊤= ψi(w), (ud1+1,…,ud1+d2,ud1+d2+1,…,ud1+d2+d3)⊤= ψa(v),

which are sorting functions for and , respectively. Considering these permutations, we use the notation and for the initial data set, and and for the additional data set.

Bayesian clustering with an additional data set is defined as

 p(Yn|Xn,Da)= ∫∏nj=1pi(xj,yj|u)∏n+αni=n+1pa(zi|u)φ(u)du∫∏ni=1pj(xj|u)∏n+αni=n+1pa(zi|u)φ(u)du,

where the prior distribution is . Note that the estimation target is , and labels are not estimated for the additional data, even if they are unlabeled.

### 3.2 Four Cases of Additional Data Sets

According to the division of the parameter dimension , and , we will consider the following four cases:

1. ,

2. ,

3. ,

4. ,

where is assumed in the all cases. Note that the additional data do not affect the clustering result when since there is no overlap between the models for the initial and the additional data.

The following two examples are the first case, where ;

###### Example 2 (Semi-supervised learning)

Semi-supervised classification (Type II Yamazaki, 2012), (Zhu, 2007) is described by

 pi(x|u)= p(x|w)=K∑k=1akf(x|bk), pa(x,y|u)= p(x,y|w)=ayf(x|by),

where the parameter is given by

 u= (a1,…,aK−1,b11,…,bKdc)⊤.

In this case, and . The unlabeled data and the labeled data are generated by and , respectively. The clustering task is to estimate the density of :

 p(Yn|Xn,Da)= ∫∏ni=1ayif(xi|byi)∏αni=n+1ayif(xi|byi)φ(u)du∫∏ni=1∑Ky=1ayf(xi|by)∏n+αni=n+1ayif(xi|byi)φ(u)du.

The schematic relation between the initial and the additional data sets is shown in the top-left panel of Figure 1.

###### Example 3

Clustering of a partial data set (Yamazaki, 2014a) is described by

 pi(x|u)= pa(x|u)=K∑k=1akf(x|bk),

where and . Both the initial data and the additional data are unlabeled, which corresponds to the estimation of labels based on () data points:

 p(Yn|Xn,Da)= ∫∏ni=1ayif(xi|byi)∏n+αni=n+1∑Ky=1ayf(xi|by)φ(u)du∫∏ni=1∑Ky=1ayf(xi|by)∏n+αni=n+1∑Ky=1ayf(xi|by)φ(u)du.

The relation between the initial and the additional data sets is shown in the top-right panel of Figure 1.

The next case is an example of the second case, where and ;

###### Example 4

Suppose some cluster provides labeled data in the additional data set. For example, suppose the labeled data of the first cluster, which is the target, are given in . In other words, the positive labeled data are additionally given (du Plessis et al., 2015). Then, the density functions are defined as

 pi(x|u)= K∑k=1akf(x|bk), pa(x|u)= f(x|b1),

where . The parameter vector is expressed by

 u= (a1,…,aK−1,b21,…,bKdc,b11,…,b1dc)⊤,

where the common part is , and . The estimated density is given by

 p(Yn|Xn,Da)= ∫∏ni=1ayif(xi|byi)∏αni=n+1f(xi|b1)φ(u)du∫∏ni=1∑Ky=1ayf(xi|by)∏n+αni=n+1f(xi|b1)φ(u)du.

The relation between the initial and the additional data sets is shown in the middle-left panel of Figure 1.

The case, where and , has the following example;

###### Example 5

When a new feature is added to , the density functions are defined as

 pi(x|u)= K∑k=1akf(x|bk), pa(x,x′|u)= K∑k=1akf(x|bk)g(x′|ck),

where , and is generated by . For simplicity, let and be scalar, and let be conditionally independent of . Note that the asymptotic results of the present paper hold when this assumption is not satisfied. The parameter vector is expressed as

 u= (a1,…,aK−1,b11,…,bKdc,c1,…,cK)⊤,

where is the common part, , and . The estimated density is given by

 ∫∏ni=1ayif(xi|byi)∏αni=n+1∑Ky=1ayf(xi|by)g(x′i|cy)φ(u)du∫∏ni=1∑Ky=1ayf(xi|by)∏n+αni=n+1∑Ky=1ayf(xi|by)g(x′i|cy)φ(u)du.

The relation between the initial and the additional data sets is shown in the middle-right panel of Figure 1.

Lastly, the case, where , has the following example;

###### Example 6

When the class-prior changes (du Plessis and Sugiyama, 2014; Yamazaki, 2015a), the density functions are described by

 pi(x|u)= K∑k=1akf(x|bk), pa(x|u)= K∑K=1ckf(x|bk),

where the mixing ratio of the additional data for is different from that of the initial data . In the analysis of the previous study, the additional data are restricted to the labeled ones (Yamazaki, 2015a), which is expressed as

 pa(x,y|u)= ckf(x|bk).

In the present paper, we extend the situation to the unlabeled case. The parameter vector is given by

 u= (a1,…,aK−1,b11,…,bKdc,c1,…,cK−1)⊤,

where is the common part, and . The estimated density is given by

 p(Yn|Xn,Da)= ∫∏ni=1ayif(xi|byi)∏αni=n+1∑Ky=1cyf(xi|by)φ(u)du∫∏ni=1∑Ky=1ayf(xi|by)∏n+αni=n+1∑Ky=1cyf(xi|by)φ(u)du.

The relation between the initial and the additional data sets is shown in the bottom panel of Figure 1. The notation and show that the latent variables are based on a model with a mixing ratio of to .

Example 4 is a special case of Examples 2 and 6.

### 3.3 Error Function with Additional Data

In the previous studies, we formulated the error function and derived its asymptotic form in each case Yamazaki (2014a, 2015b, 2015a). Here, we show the unified formulation and derivation of the error function. The error function is given by

 Da(n)=EXD[∑Ynq(Yn|Xn)lnq(Yn|Xn)p(Yn|Xn,Da)], (1)

where is the expectation over all and . Define three Fisher information matrices as

 {IXY(u)}jk= E[∂lnpi(x,y|u)∂uj∂lnpi(x,y|u)∂uk], {IX(u)}jk= E[∂lnpi(x|u)∂uj∂lnpi(x|u)∂uk], {IZ(u)}jk= Ez[∂lnpa(z|u)∂uj∂lnpa(z|u)∂uk].

An asymptotic property of the error is determined by these matrices.

###### Theorem 7

The error function has the asymptotic form

 Da(n)= 12lndetJXY(u∗)JX(u∗)−11n+o(1n),

where

 JXY(u)= IXY(u)+αIZ(u), JX(u)= IX(u)+αIZ(u).

By generalizing the derivation of Theorem 2 in Yamazaki (2015b), the proof is shown as follows;
Proof of Theorem 7:
Based on the definition, the error function can be divided into two parts:

 nDa(n)= FXY(n)−FX(n),

where the free energy functions are given by

 FXY(n)= −nSXY−EXD[ln∫n∏j=1pi(xj,yj|u)n+αn∏i=n+1pa(zi|u)φ(u)du], FX(n)= −nSX−EXD[ln∫n∏j=1pi(xj|u)n+αn∏i=n+1pa(zi|u)φ(u)du].

The entropy functions are defined as

 SXY= E[−lnp(x,y|u∗)], SX= E[−lnp(x|u∗)].

Based on the saddle point approximation and the assumptions on the Fisher information matrices , , and , it has been shown that the free energy functions have the following asymptotic forms (Clarke and Barron, 1990; Yamazaki, 2014a, 2015a):

 FXY(n)= dimu2lnn2πe+ln√detJXY(u∗)φ(u∗)+o(1), FX(n)= dimu2lnn2πe+ln√detJX(u∗)φ(u∗)+o(1).

Rewriting the energy functions in their asymptotic form, we obtain

 nDa(n)= 12lndetJXY(u∗)−12lndetJX(u∗)+o(1),

which completes the proof. (End of Proof)

## 4 Effective Additional Data Sets

In this section, we determine when the use of an additional data set makes the estimation more accurate.

### 4.1 Formal Definition of Effective Data Set and Sufficient Condition for Effectiveness

Using the asymptotic form of the error functions, we formulate as follows an additional data set that improves the accuracy.

###### Definition 8 (Effective data set)

If the difference between the error with and without a particular additional data set satisfies the following condition, then the data set is effective: there exists a positive constant such that

 D(n)−Da(n)= Cn+o(1n).

According to this definition, is effective if the leading term of the asymptotic form of is smaller than that of .

Let us rewrite and as block matrices:

 IXY(u∗)= (K11K12K21K22), IX(u∗)= (L11L12L21L22),

where and are matrices, and and are matrices. We define the block elements of the inverse matrices as

 IXY(u∗)−1= (~K11~K12~K21~K22), IX(u∗)−1= (~L11~L12~L21~L22).

If , we define the block matrix as

 IZ(u∗)= (A22A23A32A33),

where is a matrix and is a matrix. Otherwise, we define it as

 IZ(u∗)= ⎛⎜⎝0000A22A230A32A33⎞⎟⎠.

We also define the following inverse block matrix;

 (A22A23A32A33)−1= (~A22~A23~A32~A33).

Theorem 9 provides the unified asymptotic expression of the difference of the error functions and for all cases that gives a sufficient condition for additional data to be effective.

###### Theorem 9

Let the block matrices of the Fisher information matrices and be defined as above. Let the eigenvalues of be and those of be . The asymptotic difference of the errors and is expressed as follows;

 D(n)−Da(n)= = 12ln(d2∏i=11+αμi1+αλi)1n+o(1n).

The following condition is necessary and sufficient to ensure that the additional data set is effective:

 d2∏i=1(1+αμi)> d2∏i=1(1+αλi). (2)

The following condition is sufficient; for all ,

 μi>λi.

On the other hand, if the coefficient is negative, the additional data degrade the accuracy.

It is obvious that the sufficient condition shows

 1+αμi>1+αλi>0,

which satisfies Eq. (2).

The proof of this theorem is presented in the next subsection, and we will provide an interpretation of the theorem in Section 5.

### 4.2 Proof of Theorem 9

We will first show the proof of the most general case, where . Since and , , , and can be rewritten as block matrices:

 IXY(u∗)= ⎛⎜⎝K11K120K21K220000⎞⎟⎠, IX(u∗)= ⎛⎜⎝L11L120L21L220000⎞⎟⎠, IZ(u∗)= ⎛⎜⎝0000A22A230A32A33⎞⎟⎠.

According to Theorems 1 and 7,

 D(n)−Da(n)= Cp2n+o(1n) Cp= lndet(K11K12K21K22)−lndet(L11L12L21L22) −lndet{⎛⎜⎝K11K120K21K220000⎞⎟⎠+α⎛⎜⎝0000A22A230A32A33⎞⎟⎠} +lndet{⎛⎜⎝L11L120L21L220000⎞⎟⎠+α⎛⎜⎝0000A22A230A32A33⎞⎟⎠}.

The coefficient can be rewritten as

 Cp= lndet(K11K12K21K22)−lndet(L11L12L21L22) −lndet{⎛⎜⎝K11K120K21K220000⎞⎟⎠⎛⎜ ⎜⎝Ed1000~A22~A230~A32~A33⎞⎟ ⎟⎠ +α⎛⎜⎝0000A22A230A32A33⎞⎟⎠⎛⎜ ⎜⎝Ed1000~A22~A230~A32~A33⎞⎟ ⎟⎠} +lndet{⎛⎜⎝L11L120L21L220000⎞⎟⎠⎛⎜ ⎜⎝Ed1000~A22~A230~A32