Histogram Meets Topic Model: Density Estimation by Mixture of Histograms

Histogram Meets Topic Model: Density Estimation by Mixture of Histograms

Abstract

The histogram method is a powerful non-parametric approach for estimating the probability density function of a continuous variable. But the construction of a histogram, compared to the parametric approaches, demands a large number of observations to capture the underlying density function. Thus it is not suitable for analyzing a sparse data set, a collection of units with a small size of data. In this paper, by employing the probabilistic topic model, we develop a novel Bayesian approach to alleviating the sparsity problem in the conventional histogram estimation. Our method estimates a unit’s density function as a mixture of basis histograms, in which the number of bins for each basis, as well as their heights, is determined automatically. The estimation procedure is performed by using the fast and easy-to-implement collapsed Gibbs sampling. We apply the proposed method to synthetic data, showing that it performs well.

1Introduction

Histogram, a non-parametric density estimator, has been used extensively for analyzing the probability density function of continuous variables [1]. Histograms are so flexible that they can model various properties of the underlying density like multi-modality, although they usually demand a large number of samples to obtain a good estimate. Thus the histogram method cannot be applied directly to a sparse data set, or a collection of units with a small data set. Due to the improvement of technology, however, it has recently become more important to analyze such diverse data of continuous variables as purchase timing [5], period of word appearance [7], check-in location [8] and neuronal spike time [9], which could be sparse in many cases.

In this paper, by employing the probabilistic topic model called latent Dirichlet allocation [10], we propose a novel Bayesian approach to estimating probability density functions. The proposed method estimates a unit’s density function as a mixture of basis histograms, in which the unit’s density function is characterized by a small number of mixture weights, alleviating the sparsity problem in the conventional histogram method. Furthermore, the model optimizes the bin width, as well as the heights, at the level of individual bases. Thus the model can implement a variable-width bin histogram as a mixture of regularly-binned histograms of different bin widths. We show that the estimation procedure in the proposed model can be performed by using the fast and easy-to-implement collapsed Gibbs sampling [11]. We apply the proposed method to synthetic data, clarifying that it performs well.

2Model

= 2.20mm

Notation
Symbol Definition
number of units
number of variables in unit
total number of variables
number of basis histograms
th unit,
th variable in collection
unit which generated
latent variable of ,

= 1.0mm

Notation

half-open range of variable
lower boundary of th bin
number of bins in th histogram
weight of th histogram on unit
probability mass of th bin
in th histogram,
index of bin within which falls
under th bin number,

Suppose that we have a collection of units, each of which consists of continuous variables generated by each unit (). For convenience, we number all the variables in the collection from to (in an arbitrary order), and define a set of collections, and , where and are the th variable and the unit which generated it, respectively. The notation is summarized in Table ?.

In our proposed model, it is assumed that each of the continuous variable, , be generated from a mixture of histograms. In fact, as a generative process, a histogram can be described as a piecewise-constant probability density function [2], which is a key point of our model construction. We first provide a description of the piecewise-constant distribution.

2.1Histogram: piecewise-constant distribution

Figure 1: (A) Histogram is described by a piecewise-constant probability density function. A sampling of t_j from histogram can be implemented by the two steps: draw a bin index w_j (1\leq w_j \leq W) from a multinomial distribution; and draw t_j from the uniform distribution defined over the corresponding bin range [x_{w_j}, x_{w_j+1}). (B) Graphical model representation of HistLDA, a model which includes a histogram.
Figure 1: (A) Histogram is described by a piecewise-constant probability density function. A sampling of from histogram can be implemented by the two steps: draw a bin index () from a multinomial distribution; and draw from the uniform distribution defined over the corresponding bin range . (B) Graphical model representation of HistLDA, a model which includes a histogram.

Histogram method describes an underlying density function by; (i) discretizing a half-open range of variable, , into contiguous intervals (bins) of equal width, (); and (ii) assigning a constant probability density, , to each bin region, , for . Here the lower boundary of bin, , is given by, . In it, a continuous variable follows a piecewise-constant distribution,

or equivalently,

where is the probability mass of the th bin region , and represents a discretization operator that transforms a continuous variable into the corresponding bin index, defined by

It should be emphasized here that Eqs. (Equation 1-Equation 2) suggest that the observation process can be decomposed into the following two processes (see Figure 1A): (i) Draw a bin index from a Multinomial distribution with parameter ; (ii) Draw a continuous variable from an uniform distribution defined over the corresponding bin region, . Without the second process, a mixture model of reduces to the original latent Dirichlet allocation (LDA) [10], in which observed variables are discrete. In the next section, we construct a mixture of histograms by incorporating a uniform process into the original LDA. We call the proposed model the Histogram Latent Dirichlet Allocation (HistLDA).

2.2Histogram latent Dirichlet allocation

HistLDA estimates unit ’s probability density function as a mixture of basis histograms as follows:

where is the number of mixture components, is a latent variable indicating the component from which the variable is drawn, W is the set of bin numbers, () is the probability masses of the th histogram, is its set, and () is the weights of the components on unit . Each of the basis histograms, , is described by Eq. (Equation 1). Note that the set of basis histograms is shared by all the units, and heterogeneity across units is represented only through the weight .

In accordance with LDA [10], our HistLDA assumes the following generative process for a set of collections, and :

  1. For each basis histogram :

    • Draw number of bins Uniform ()

    • Draw probability mass Dirichlet (, )

  2. For each unit :

    • Draw basis weight Dirichlet (, )

  3. For each observation in collection :

    • Draw basis Multinomial (, )

    • Draw bin index Multinomial ()

    • Draw variable Uniform (),

where is the uniform distribution defined over an interval , is the symmetric Dirichlet distribution of random variables with parameter , is the multinomial distribution of categories with equal choice probability , and are Dirichlet parameters, and is the maximum number of bins to be considered.

The HistLDA is an extension of the original LDA in that (i) the number of bins (vocabulary), as a random variable, is not necessarily the same among the bases (topics), and (ii) observation is not a discrete bin index (word) drawn from a multinomial distribution, but a continuous variable further drawn from a uniform distribution defined over the corresponding bin’s interval (see Figure 1). Also, it is worth noting that the HistLDA is a novel extension of Knuth’s Bayesian binning model [2] into Hierarchical structure.

Because being conjugate to Dirichlet priors, the multinomial parameters, and , can be marginalized out from the generative model, leading to the joint distribution of data t, its latent basis z, and the set of bin numbers W, as follows:

where is the gamma function, is the number of times a variable of unit has been assigned to basis , is the number of times that a variable assigned to the th basis histogram is addressed to the th bin of the histogram, and .

2.3Estimation by collapsed Gibbs sampling

Figure 2: A schema for the estimation procedure. (A) Each unit has a small size of continuous variables, which is indicated by each symbol (e.g. triangle). Here each of the data set is not enough to obtain a good estimate of each unit’s density function. (B) Given the number of basis histograms (three in the figure) and their bin numbers \mbox{\boldmathW}, according to the posterior, each of all the variables is allocated to one of the bases, colored in red, yellow and blue, respectively. Given the data set allocated to each basis, the bin number is optimized at the level of individual bases based on the posterior (Binning), and again the variables are re-allocated under the new bin numbers. The optimization is performed accurately due to the enough size of the data. The procedure is repeated before the joint posterior (\mbox{\boldmathz},\mbox{\boldmathW}|\mbox{\boldmatht}) is converged. In collapsed Gibbs sampling, the heights of each histogram are not estimated explicitly, making the estimation easier to implement. (C) For a unit, each of the weights of the bases, \theta^{(u)}, are estimated by counting the proportion of the unit’s data allocated to each basis, represented by a pie chart. Finally, the density function of each unit is estimated as a mixture of the basis histograms.
Figure 2: A schema for the estimation procedure. (A) Each unit has a small size of continuous variables, which is indicated by each symbol (e.g. triangle). Here each of the data set is not enough to obtain a good estimate of each unit’s density function. (B) Given the number of basis histograms (three in the figure) and their bin numbers W, according to the posterior, each of all the variables is allocated to one of the bases, colored in red, yellow and blue, respectively. Given the data set allocated to each basis, the bin number is optimized at the level of individual bases based on the posterior (Binning), and again the variables are re-allocated under the new bin numbers. The optimization is performed accurately due to the enough size of the data. The procedure is repeated before the joint posterior zWt is converged. In collapsed Gibbs sampling, the heights of each histogram are not estimated explicitly, making the estimation easier to implement. (C) For a unit, each of the weights of the bases, , are estimated by counting the proportion of the unit’s data allocated to each basis, represented by a pie chart. Finally, the density function of each unit is estimated as a mixture of the basis histograms.

Given a collection of observed variables, t, we can estimate latent basis and number of bins, z and W, based on the posterior distribution, zWttzW. In practice, by using the simple and easy-to-implement collapsed Gibbs sampling [11], we obtain samples of z and W following t, from which the weight and the probability mass , as well as the hyperparameters and , are estimated efficiently.

Collapsed Gibbs sampling

Given a set of bin numbers W and the current state of all but one latent variable , denoted by z, the basis assignment to th variable is sampled from the following multinomial distribution:

and given z and all but one bin number , denoted by W, the th bin number is sampled from the following multinomial distribution:

where represents the count that does not include the current assignment of . Here represents the th histogram’s bin index within which falls, defined as

Eqs. (Equation 5) and (Equation 6) are easily derived from Eq. (Equation 4) (not shown here).

In each sampling of z and W, the hyperparameters of the Dirichlet priors, and , can be updated by using the fixed-point iteration method described in [13] as follows:

where is the digamma function defined by the derivative of . For details, see Appendix A.

In practice, we set the initial hyperparameters as = = 0.5 (Jeffreys non-informative prior [14]), and draw the initial z from . The algorithm of the collapsed Gibbs sampling in HistLDA is summarized in Algorithm 1 (see also Figure 2B).

Estimation of basis histograms and their weights on units

By repeating the collapsed Gibbs sampling (Equation 5-Equation 7) before the convergence is achieved, we first estimate the hyperparameters, and , as the last updated values. At the same time, we also estimate the set of bin numbers, W, as the last updated values. Next, given , and W, we further draw samples of basis assignment, [zzz], according to Eq. (Equation 5), and obtain the posterior mean estimate of the weight and probability mass as follows:

where , and represent the sufficient statistics in the th sample, z. In the following experiment, was set at .

In Figure 2, we give an intuitive explanation of the estimation procedure in HistLDA.

Algorithm 1 Collapsed Gibbs sampling in HistLDA
Set and , and initialize , and for .
Draw z from Dirichlet (, ).
Count sufficient statistics , and for , .
repeat
to do
Draw from Eq. (Equation 6).
Update for , under a new .
to do
Set , , .
Draw from Eq. (Equation 5).
Set , , .
Update and based on Eq. (Equation 7).
until posterior zWt is converged.

3Result

Figure 3: Density estimation with synthetic data. (A) The ISE between the intended and the estimated density functions against the data size per unit m. The error bars represent standard deviations of ISE when the density estimation was performed three times using the three sets of data generated. (B) Three units’ examples of estimated probability density functions for m=100. The solid line represents the density function estimated by each method, and the dashed line represents the true density function.
Figure 3: Density estimation with synthetic data. (A) The ISE between the intended and the estimated density functions against the data size per unit . The error bars represent standard deviations of ISE when the density estimation was performed three times using the three sets of data generated. (B) Three units’ examples of estimated probability density functions for . The solid line represents the density function estimated by each method, and the dashed line represents the true density function.

To confirm that the HistLDA works well on a sparse data set, that is, a collection of units consisting of a small size of variables, we evaluated the performance of HistLDA together with the reference methods on synthetic data.

As the reference methods, we adopted two histogram methods, the Bayesian binning method proposed by Knuth [2] (Knuth) and penalized-maximum likelihood method by Birgé and Rozenholc [3] (BR), and a parametric method, Gaussian mixture model (GMM). The three methods estimate a unit’s probability density function based on its own observed variables.

We made synthetic data in the scenario that a unit generated continuous variables from a complex probability density function comprising the following three different types of distributions: the normal distribution with mean and variance , the exponential distribution with rate parameter , and the uniform distribution defined over the interval . Here each unit was characterized by the mixing proportions, which were sampled from a uniform or flat Dirichlet distribution with respect to each unit. Generating a collection comprising units, each of which had variables, we estimated the units’ underlying probability density functions from the data, and evaluated the goodness of the estimation in terms of the average integrated squared error (ISE) of probability density function:

where the range of variable, , was set at , and and represent the intended and the estimated probability density functions, respectively. In the experiment, the number of mixture components was set at three for both HistLDA and GMM. The data size per unit, , was specified as , , , , , .

Figure 3A compares HistLDA’s ISE against the results achieved by the reference methods, demonstrating that HistLDA performed better than the other methods for all the data size per unit, . The comparison between HistLDA and the other histogram methods (Knuth and BR) found that HistLDA performed relatively much better even in the small . This suggests that our HistLDA copes with the sparsity problem in the usual histogram methods. Also, Figure 3A shows that the histogram methods (HistLDA, Knuth and BR) performed better when the data size was larger, while the performance of GMM was not improved significantly. Parametric approaches like GMM, which work robustly in sparse data, usually perform poorly under the wrong assumption of underlying distribution. In the experiment, the underlying density function of each unit was far from Gaussian (see Figure 3B).

Figure 3B shows three units’ examples of estimated probability density functions for = . Each unit had a complex and unit-specific distribution, but HistLDA obtained a good estimate of each distribution by adopting small bin widths (large number of bins). Generally, the bin width of histogram is estimated to be smaller in a larger data set [15], of which situation is realized in HistLDA by allocating the whole data of all the units into a small number of basis histograms (see Figure 2). Furthermore in Figure 3B, HistLDA seems to have adjusted bin widths depending on the location: large bin width was used around 0, and small one being used around 1. HistLDA implements a variable-width bin histogram by way of a mixture of regular histograms with different bin widths. Figure 4 displays how the bin number (bin width) for each basis histogram was optimized in the collapsed Gibbs sampling.

Figure 4: Collapsed Gibbs sampling with synthetic data. (A) Plots of bin number \mbox{\boldmathW} = ( W_1, W_2, W_3 ) and hyperparameter (\alpha, \beta) versus iteration number. (B) The sampled bin number and the estimated basis histograms at the 4th, 10th, and 500th iteration. In each figure, the three distributions depicted by the dashed lines represent the underlying normal, exponential, and uniform distributions.
Figure 4: Collapsed Gibbs sampling with synthetic data. (A) Plots of bin number W and hyperparameter (, ) versus iteration number. (B) The sampled bin number and the estimated basis histograms at the 4th, 10th, and 500th iteration. In each figure, the three distributions depicted by the dashed lines represent the underlying normal, exponential, and uniform distributions.

4Summary

We have proposed a novel histogram density estimation that overcomes the sparsity of data, by incorporating the latent Dirichlet allocation (LDA) into the histogram method. The proposed method estimates a density function as a mixture of histograms, of which bin numbers or bin widths are optimized together with their heights, at the level of individual histograms. By way of a mixture of regularly-binned histograms of different bin widths, it can implement a variable-width bin histogram. As with the LDA, all the estimation procedure is performed by using the fast and easy-to-implement collapsed Gibbs sampling. We assessed the goodness of the proposed method by examining synthetic data, and demonstrated that it performed well when data sets were sparse.


AEstimation of hyperparameters

Based on the empirical Bayes method, the hyperparameters in the Dirichlet distributions, and , can be estimated by maximizing the marginal likelihood,

The maximization is performed by using the Monte Carlo EM algorithm, leading to the following update rule,

where and are the updated values of hyperparameters, and zWt represents the expectation with respect to the posterior distribution of z and W under the previous estimate of the hyperparameters, (). This time, we employ the stochastic EM, a variant of Monte Carlo EM algorithm, in which the expectation is replaced by a sample taken from the posterior distribution (Equation 5-Equation 6), as

where z and W are the values of z and W sampled by the collapsed Gibbs sampling with the previous hyperparameters, (). Although having no analytical forms, Eq. (Equation 9) can be computed by iterating the following fixed-point equation [13]:

where is the digamma function defined by the derivative of , and , and represent the realization of , and in the collapsed Gibbs sampling with the previous hyperparameters, (), respectively.

References

  1. Rissanen, J., Speed, T.P. & Yu, B. Density estimation by stochastic complexity. IEEE Transactions on Information Theory, 38(2): 315-323, 1992.
  2. Knuth, K.H. Optimal data-based binning for histograms. arXiv preprint physics/0605197, 2006.
  3. Birgé, L. & Rozenholc, Y. How many bins should be put in a regular histogram. ESAIM: Probability and Statistics, 10: 24-25, 2006.
  4. Chan, S., Diakonikolas, I., Sevedio, R.A., & Sun, X. Near-optimal density estimation in near-linear time using variable-width histograms. Advances in Neural Information Processing Systems 27, pages 1844-1852, 2014.
  5. Kim, H., Takaya, N., & Sawada, H. Tracking temporal dynamics of purchase decisions via hierarchical time-rescaling model. In Proc. 23rd ACM International Conference on Information and Knowledge Management (CIKM), pages 1389-1398. ACM Press, 2014.
  6. Trinh, G., Rungie, C., Wright, M., Driesener, C., & Dawes, J. Predicting future purchases with the Poisson log-normal model. Marketing Letters, 25(2): 219-234, 2014.
  7. Wang, X. & McCallum, A. Topics over time: a non-Markov continuous-time model of topical trends. In Proc.12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 424-433. ACM Press, 2006.
  8. Cho, E., Myers, S.A., & Leskovec, J. Friendship and mobility: user movement in location-based social networks. In Proc.17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 1082-1090. ACM Press, 2011.
  9. Iyengar, S. & Liao, Q. Modeling neural activity using the generalized inverse Gaussian distribution. Biological Cybernetics, 77: 289-295, 1997.
  10. Blei, D.M., Ng, A.Y., & Jordan, M.I. Latent dirichlet allocation. Journal of Machine Learning Research, 3: 993-1022, 2003.
  11. Griffiths, T.L. & Steyvers, M. Finding scientific topics. PNAS, 101: 5228-5235, 2004.
  12. Endres, D., Oram, M., Schindelin, J., & Fóldiák, P. Bayesian binning beats approximate alternatives: estimating peri-stimulus time histograms. Advances in Neural Information Processing Systems 20, pages 393-400, 2008.
  13. Minka, T. Estimating a Dirichlet distribution. Technical report. MIT, 2000.
  14. Jeffreys, H. Theory of probability, 3rd ed. Oxford, Oxford University Press, 1961.
  15. Shimazaki, H. & Shinomoto, S. A method for selecting the bin size of a time histogram. Neural Computation, 19: 1503-1527, 2007.
19078
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
Edit
-  
Unpublish
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel
Comments 0
Request comment
""
The feedback must be of minumum 40 characters
Add comment
Cancel
Loading ...

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description