Histogram Meets Topic Model: Density Estimation by Mixture of Histograms
Abstract
The histogram method is a powerful nonparametric approach for estimating the probability density function of a continuous variable. But the construction of a histogram, compared to the parametric approaches, demands a large number of observations to capture the underlying density function. Thus it is not suitable for analyzing a sparse data set, a collection of units with a small size of data. In this paper, by employing the probabilistic topic model, we develop a novel Bayesian approach to alleviating the sparsity problem in the conventional histogram estimation. Our method estimates a unit’s density function as a mixture of basis histograms, in which the number of bins for each basis, as well as their heights, is determined automatically. The estimation procedure is performed by using the fast and easytoimplement collapsed Gibbs sampling. We apply the proposed method to synthetic data, showing that it performs well.
1Introduction
Histogram, a nonparametric density estimator, has been used extensively for analyzing the probability density function of continuous variables [1]. Histograms are so flexible that they can model various properties of the underlying density like multimodality, although they usually demand a large number of samples to obtain a good estimate. Thus the histogram method cannot be applied directly to a sparse data set, or a collection of units with a small data set. Due to the improvement of technology, however, it has recently become more important to analyze such diverse data of continuous variables as purchase timing [5], period of word appearance [7], checkin location [8] and neuronal spike time [9], which could be sparse in many cases.
In this paper, by employing the probabilistic topic model called latent Dirichlet allocation [10], we propose a novel Bayesian approach to estimating probability density functions. The proposed method estimates a unit’s density function as a mixture of basis histograms, in which the unit’s density function is characterized by a small number of mixture weights, alleviating the sparsity problem in the conventional histogram method. Furthermore, the model optimizes the bin width, as well as the heights, at the level of individual bases. Thus the model can implement a variablewidth bin histogram as a mixture of regularlybinned histograms of different bin widths. We show that the estimation procedure in the proposed model can be performed by using the fast and easytoimplement collapsed Gibbs sampling [11]. We apply the proposed method to synthetic data, clarifying that it performs well.
2Model
= 2.20mm
Symbol  Definition


number of units


number of variables in unit


total number of variables


number of basis histograms


th unit,


th variable in collection


unit which generated


latent variable of ,


= 1.0mm



halfopen range of variable


lower boundary of th bin


number of bins in th histogram


weight of th histogram on unit


probability mass of th bin


in th histogram,


index of bin within which falls


under th bin number,


Suppose that we have a collection of units, each of which consists of continuous variables generated by each unit (). For convenience, we number all the variables in the collection from to (in an arbitrary order), and define a set of collections, and , where and are the th variable and the unit which generated it, respectively. The notation is summarized in Table ?.
In our proposed model, it is assumed that each of the continuous variable, , be generated from a mixture of histograms. In fact, as a generative process, a histogram can be described as a piecewiseconstant probability density function [2], which is a key point of our model construction. We first provide a description of the piecewiseconstant distribution.
2.1Histogram: piecewiseconstant distribution
Histogram method describes an underlying density function by; (i) discretizing a halfopen range of variable, , into contiguous intervals (bins) of equal width, (); and (ii) assigning a constant probability density, , to each bin region, , for . Here the lower boundary of bin, , is given by, . In it, a continuous variable follows a piecewiseconstant distribution,
or equivalently,
where is the probability mass of the th bin region , and represents a discretization operator that transforms a continuous variable into the corresponding bin index, defined by
It should be emphasized here that Eqs. (Equation 1Equation 2) suggest that the observation process can be decomposed into the following two processes (see Figure 1A): (i) Draw a bin index from a Multinomial distribution with parameter ; (ii) Draw a continuous variable from an uniform distribution defined over the corresponding bin region, . Without the second process, a mixture model of reduces to the original latent Dirichlet allocation (LDA) [10], in which observed variables are discrete. In the next section, we construct a mixture of histograms by incorporating a uniform process into the original LDA. We call the proposed model the Histogram Latent Dirichlet Allocation (HistLDA).
2.2Histogram latent Dirichlet allocation
HistLDA estimates unit ’s probability density function as a mixture of basis histograms as follows:
where is the number of mixture components, is a latent variable indicating the component from which the variable is drawn, W is the set of bin numbers, () is the probability masses of the th histogram, is its set, and () is the weights of the components on unit . Each of the basis histograms, , is described by Eq. (Equation 1). Note that the set of basis histograms is shared by all the units, and heterogeneity across units is represented only through the weight .
In accordance with LDA [10], our HistLDA assumes the following generative process for a set of collections, and :
For each basis histogram :
Draw number of bins Uniform ()
Draw probability mass Dirichlet (, )
For each unit :
Draw basis weight Dirichlet (, )
For each observation in collection :
Draw basis Multinomial (, )
Draw bin index Multinomial ()
Draw variable Uniform (),
where is the uniform distribution defined over an interval , is the symmetric Dirichlet distribution of random variables with parameter , is the multinomial distribution of categories with equal choice probability , and are Dirichlet parameters, and is the maximum number of bins to be considered.
The HistLDA is an extension of the original LDA in that (i) the number of bins (vocabulary), as a random variable, is not necessarily the same among the bases (topics), and (ii) observation is not a discrete bin index (word) drawn from a multinomial distribution, but a continuous variable further drawn from a uniform distribution defined over the corresponding bin’s interval (see Figure 1). Also, it is worth noting that the HistLDA is a novel extension of Knuth’s Bayesian binning model [2] into Hierarchical structure.
Because being conjugate to Dirichlet priors, the multinomial parameters, and , can be marginalized out from the generative model, leading to the joint distribution of data t, its latent basis z, and the set of bin numbers W, as follows:
where is the gamma function, is the number of times a variable of unit has been assigned to basis , is the number of times that a variable assigned to the th basis histogram is addressed to the th bin of the histogram, and .
2.3Estimation by collapsed Gibbs sampling
Given a collection of observed variables, t, we can estimate latent basis and number of bins, z and W, based on the posterior distribution, zWttzW. In practice, by using the simple and easytoimplement collapsed Gibbs sampling [11], we obtain samples of z and W following t, from which the weight and the probability mass , as well as the hyperparameters and , are estimated efficiently.
Collapsed Gibbs sampling
Given a set of bin numbers W and the current state of all but one latent variable , denoted by z, the basis assignment to th variable is sampled from the following multinomial distribution:
and given z and all but one bin number , denoted by W, the th bin number is sampled from the following multinomial distribution:
where represents the count that does not include the current assignment of . Here represents the th histogram’s bin index within which falls, defined as
Eqs. (Equation 5) and (Equation 6) are easily derived from Eq. (Equation 4) (not shown here).
In each sampling of z and W, the hyperparameters of the Dirichlet priors, and , can be updated by using the fixedpoint iteration method described in [13] as follows:
where is the digamma function defined by the derivative of . For details, see Appendix A.
In practice, we set the initial hyperparameters as = = 0.5 (Jeffreys noninformative prior [14]), and draw the initial z from . The algorithm of the collapsed Gibbs sampling in HistLDA is summarized in Algorithm 1 (see also Figure 2B).
Estimation of basis histograms and their weights on units
By repeating the collapsed Gibbs sampling (Equation 5Equation 7) before the convergence is achieved, we first estimate the hyperparameters, and , as the last updated values. At the same time, we also estimate the set of bin numbers, W, as the last updated values. Next, given , and W, we further draw samples of basis assignment, [zzz], according to Eq. (Equation 5), and obtain the posterior mean estimate of the weight and probability mass as follows:
where , and represent the sufficient statistics in the th sample, z. In the following experiment, was set at .
In Figure 2, we give an intuitive explanation of the estimation procedure in HistLDA.
Algorithm 1 Collapsed Gibbs sampling in HistLDA 

Set and , and initialize , and for . 
Draw z from Dirichlet (, ). 
Count sufficient statistics , and for , . 
repeat 
to do 
Draw from Eq. (Equation 6). 
Update for , under a new . 
to do 
Set , , . 
Draw from Eq. (Equation 5). 
Set , , . 
Update and based on Eq. (Equation 7). 
until posterior zWt is converged. 
3Result
To confirm that the HistLDA works well on a sparse data set, that is, a collection of units consisting of a small size of variables, we evaluated the performance of HistLDA together with the reference methods on synthetic data.
As the reference methods, we adopted two histogram methods, the Bayesian binning method proposed by Knuth [2] (Knuth) and penalizedmaximum likelihood method by Birgé and Rozenholc [3] (BR), and a parametric method, Gaussian mixture model (GMM). The three methods estimate a unit’s probability density function based on its own observed variables.
We made synthetic data in the scenario that a unit generated continuous variables from a complex probability density function comprising the following three different types of distributions: the normal distribution with mean and variance , the exponential distribution with rate parameter , and the uniform distribution defined over the interval . Here each unit was characterized by the mixing proportions, which were sampled from a uniform or flat Dirichlet distribution with respect to each unit. Generating a collection comprising units, each of which had variables, we estimated the units’ underlying probability density functions from the data, and evaluated the goodness of the estimation in terms of the average integrated squared error (ISE) of probability density function:
where the range of variable, , was set at , and and represent the intended and the estimated probability density functions, respectively. In the experiment, the number of mixture components was set at three for both HistLDA and GMM. The data size per unit, , was specified as , , , , , .
Figure 3A compares HistLDA’s ISE against the results achieved by the reference methods, demonstrating that HistLDA performed better than the other methods for all the data size per unit, . The comparison between HistLDA and the other histogram methods (Knuth and BR) found that HistLDA performed relatively much better even in the small . This suggests that our HistLDA copes with the sparsity problem in the usual histogram methods. Also, Figure 3A shows that the histogram methods (HistLDA, Knuth and BR) performed better when the data size was larger, while the performance of GMM was not improved significantly. Parametric approaches like GMM, which work robustly in sparse data, usually perform poorly under the wrong assumption of underlying distribution. In the experiment, the underlying density function of each unit was far from Gaussian (see Figure 3B).
Figure 3B shows three units’ examples of estimated probability density functions for = . Each unit had a complex and unitspecific distribution, but HistLDA obtained a good estimate of each distribution by adopting small bin widths (large number of bins). Generally, the bin width of histogram is estimated to be smaller in a larger data set [15], of which situation is realized in HistLDA by allocating the whole data of all the units into a small number of basis histograms (see Figure 2). Furthermore in Figure 3B, HistLDA seems to have adjusted bin widths depending on the location: large bin width was used around 0, and small one being used around 1. HistLDA implements a variablewidth bin histogram by way of a mixture of regular histograms with different bin widths. Figure 4 displays how the bin number (bin width) for each basis histogram was optimized in the collapsed Gibbs sampling.
4Summary
We have proposed a novel histogram density estimation that overcomes the sparsity of data, by incorporating the latent Dirichlet allocation (LDA) into the histogram method. The proposed method estimates a density function as a mixture of histograms, of which bin numbers or bin widths are optimized together with their heights, at the level of individual histograms. By way of a mixture of regularlybinned histograms of different bin widths, it can implement a variablewidth bin histogram. As with the LDA, all the estimation procedure is performed by using the fast and easytoimplement collapsed Gibbs sampling. We assessed the goodness of the proposed method by examining synthetic data, and demonstrated that it performed well when data sets were sparse.
AEstimation of hyperparameters
Based on the empirical Bayes method, the hyperparameters in the Dirichlet distributions, and , can be estimated by maximizing the marginal likelihood,
The maximization is performed by using the Monte Carlo EM algorithm, leading to the following update rule,
where and are the updated values of hyperparameters, and zWt represents the expectation with respect to the posterior distribution of z and W under the previous estimate of the hyperparameters, (). This time, we employ the stochastic EM, a variant of Monte Carlo EM algorithm, in which the expectation is replaced by a sample taken from the posterior distribution (Equation 5Equation 6), as
where z and W are the values of z and W sampled by the collapsed Gibbs sampling with the previous hyperparameters, (). Although having no analytical forms, Eq. (Equation 9) can be computed by iterating the following fixedpoint equation [13]:
where is the digamma function defined by the derivative of , and , and represent the realization of , and in the collapsed Gibbs sampling with the previous hyperparameters, (), respectively.
References
 Rissanen, J., Speed, T.P. & Yu, B. Density estimation by stochastic complexity. IEEE Transactions on Information Theory, 38(2): 315323, 1992.
 Knuth, K.H. Optimal databased binning for histograms. arXiv preprint physics/0605197, 2006.
 Birgé, L. & Rozenholc, Y. How many bins should be put in a regular histogram. ESAIM: Probability and Statistics, 10: 2425, 2006.
 Chan, S., Diakonikolas, I., Sevedio, R.A., & Sun, X. Nearoptimal density estimation in nearlinear time using variablewidth histograms. Advances in Neural Information Processing Systems 27, pages 18441852, 2014.
 Kim, H., Takaya, N., & Sawada, H. Tracking temporal dynamics of purchase decisions via hierarchical timerescaling model. In Proc. 23rd ACM International Conference on Information and Knowledge Management (CIKM), pages 13891398. ACM Press, 2014.
 Trinh, G., Rungie, C., Wright, M., Driesener, C., & Dawes, J. Predicting future purchases with the Poisson lognormal model. Marketing Letters, 25(2): 219234, 2014.
 Wang, X. & McCallum, A. Topics over time: a nonMarkov continuoustime model of topical trends. In Proc.12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 424433. ACM Press, 2006.
 Cho, E., Myers, S.A., & Leskovec, J. Friendship and mobility: user movement in locationbased social networks. In Proc.17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 10821090. ACM Press, 2011.
 Iyengar, S. & Liao, Q. Modeling neural activity using the generalized inverse Gaussian distribution. Biological Cybernetics, 77: 289295, 1997.
 Blei, D.M., Ng, A.Y., & Jordan, M.I. Latent dirichlet allocation. Journal of Machine Learning Research, 3: 9931022, 2003.
 Griffiths, T.L. & Steyvers, M. Finding scientific topics. PNAS, 101: 52285235, 2004.
 Endres, D., Oram, M., Schindelin, J., & Fóldiák, P. Bayesian binning beats approximate alternatives: estimating peristimulus time histograms. Advances in Neural Information Processing Systems 20, pages 393400, 2008.
 Minka, T. Estimating a Dirichlet distribution. Technical report. MIT, 2000.
 Jeffreys, H. Theory of probability, 3rd ed. Oxford, Oxford University Press, 1961.
 Shimazaki, H. & Shinomoto, S. A method for selecting the bin size of a time histogram. Neural Computation, 19: 15031527, 2007.