Complexity of Possiblygapped Histogram and
Analysis of Histogram (ANOHT)
Abstract
Through real examples, we demonstrate that gaps and distributional patterns embedded within realvalued measurements are inseparable biological and mechanistic information contents of the system. To discover such patterns, we develop datadriven algorithms to construct possiblygapped histograms. Then, based on these computed patterns, we propose a geometrybased Analysis of Histogram (ANOHT) to provide multiscale comparisons among multiple treatments. Our developments begin by showing that constructing a possiblygapped histogram is a complex problem of statistical mechanics. Since its ensemble of candidate histograms is described via a twolayer Ising model in order to accommodate all local and scalesensitive boundary parameters. This histogram construction is also a distinctive problem of Information Theory. Since, from the perspective of data compression via Uniformity, the decoding error criterion is shown as nearly independent of sample size. So if a Hamiltonian is defined as the sum of total coding lengths of boundaries and total decoding errors within bins, then the direct search for the macroscopic state or optimal solution is computationally impossible. We resolve this computing issue by applying the Hierarchical Clustering algorithm to narrow the search within a small set of nearly optimal solutions. An algorithm is then developed to select the best one. Further, based on a computed possiblygapped histogram, the first phase of ANOHT is developed for simultaneous comparison of multiple treatments locally and globally, while the second phase of ANOHT is to develop a treegeometry and to check the authenticity of branchinduced treatmentgrouping by applying classic empirical process theory. The wellknown Iris data is used to illustrate our technical developments, and a large baseball pitching data set and a heavily rightcensored divorce data are analyzed to showcase the existential gaps and utilities of ANOHT.
Keywords: Data Mechanics; Hierarchical Clustering Algorithm; Macrostate; Statistical Mechanics; Unsupervised Machine Learning.
1 Introduction
Without spatial and temporal coordinates, a sample of onedimensional (1D) realvalued measurements is generally taken as one basic simple data type and receives very limited research attention. Its simplicity is seemingly implied by the illusion that its information contents are evident and transparent, and all its characteristic patterns should have immediately popped up right in front of our eyes. As a matter of fact, its pictorial representation, usually called an empirical distribution, indeed embraces hidden and implicit patterns waiting to be extracted.
There are many possible patterns that can be exhibited through the piecewise stepfunction structure of an empirical distribution. Among all possible patterns, two of them take the most basic forms: one is “linear segment” and the other is “gap”. A linear segment indicates a potential Uniform distribution being embedded within the empirical distribution, and a gap strictly indicates an interval zone, in which definitely allows no observations. As for the rest of the potential patterns, they can be very well approximated by properly combining these two basic patterns. Therefore, an empirical distribution ideally can be well approximated by various serial compositions of basic patterns. Each composition is a possiblygapped piecewise linear approximation, which is correspondingly equivalent to a possiblygapped histogram.
From such an approximation perspective, the larger number of patterns involving in such a series of basic patterns means a higher cost in terms of data compression [3]. Nevertheless, for the sake of true and intrinsic data patterns, it seems very natural to think that the hypothetical population distribution underlying the observed empirical distribution should embrace discontinuity.
But this is not the case in the statistical literature. Even though a histogram is discrete in nature, most existing versions of its constructions were developed through approximating a density function by imposing continuity and smoothness assumptions [6], [18], [21]. In sharp contrast, in computer science literature, a histogram is always discrete because it is primarily used for visualizing data or queries from a database [14]. So it does not involve with approximations via continuous components of Uniform or other smooth kernel distributions.
The computing costs for a histogram via aforementioned Statistics and Computer science viewpoints are not severely high. But the computing load for a possiblygapped histogram can be theoretically very heavy because a constructive approach ideally needs to allow an unknown number of bins with heterogeneous binwidth and to accommodate an unknown number of gaps of different sizes. Further, the computing cost will grow with the sample sizes because of the multiscale nature of these two basic patterns. That is why the construction of a possiblygapped histogram is computationally complex.
In this paper, we resolve this computational issue algorithmically by treating it as if it is derived from a physical system. From this perspective, a possiblygapped histogram should embrace deterministic structures, which are all boundaries of the bins, and the stochastic structures that are Uniform within each bin. That is, the deterministic and stochastic structures together constitutes the system information contents of a onedimensional data set.
Such physical information contents render another interpretation from the perspective of algorithmic complexity [15]. A relatively simple way of seeing this complexity is the fact that the candidate ensemble is constructed via a Twolayer Ising model [11]. This ensemble grows exponentially in size with respect to the number of data points, say . Based on this ensemble description, we clearly see that all boundary parameters are characteristically local because of multiple relevant scales also depending on in a heterogeneous fashion across all potential bins.
Further, it is interesting to note that the decoding error under Uniformity is nearly independent of sample size within each bin. This fact becomes an effective criterion for confirming Uniformity, on one hand. On the other hand, any further division of a Uniform distribution would give rise to several Uniform distributions with lower total decoding errors. Therefore we need to balance between the total decoding errors and coding lengths of boundaries of the bins (with respect to a converting index). So a Hamiltonian is defined.
Therefore constructing a possiblygapped histogram is indeed a complex physical problem of statistical mechanics aiming for extracting the lowest Hamiltonian macroscopic state (or macrostate). It is also a problem of information theory aiming for the best balance between the costs. But it is clearly not a problem of statistics because of its multiscale nature. Thus we need a brandnew computing protocol to seek for the optimal solution and the macrostate.
It is surprising that this seemingly complex macrostate can be approximated by a relatively simple computational algorithm. By applying the popular hierarchical clustering (HC) algorithm with complete or other modules, excluding the singlelinkage one, on a onedimensional dataset, the resultant hierarchical tree indeed gives rise to just a few feasible candidates. We develop an algorithm to select one from this small set of candidates. Nearly optimal solutions can be derived. Also, the criterion of decoding error of Uniformity turns out to be a practical way of checking whether a space between two consecutive bins is indeed a gap.
The merits of a possiblygapped histogram are intriguingly profound and farreaching. We demonstrate its potential merits through our developments of new data analysis paradigm, called Analysis of Histogram (ANOHT). The first phase of ANOHT is designed to address the issues: where and how multiple treatments or distributions are locally and globally different? The second phase of ANOHT is designed to answer issues: which treatments are closer to which, but far away from others? And why? Here ANOHT is simple and equipped with excellent visualization capability. ANOHT also gives the biological and mechanistic meanings to identify gaps found within a histogram.
Our technical developments are illustrated throughout by employing the wellknown Iris data. In the Result sections, ANOHT is first applied onto a large pitching data set of three Major League Baseball (MLB) pitchers, and then onto a heavily censored divorce data. The implications of our developments from building a possiblygapped histogram to ANOHT are discussed in the Conclusion section.
2 Methods
In this section, our technical developments are divided into four subsections: 1) building the ensemble of candidate histograms; 2) deriving the decoding errors; 3) constructing the algorithm for possiblygapped histogram as the first phase of ANOHT; 4) developing the tree geometry as the second phase of ANOHT. The Iris data is illustrated throughout our developments.
2.1 Candidate ensemble.
Let be the observed data points, and their ranked values as in increasing order. The construction of a possiblygapped histogram can be precisely stated in the following Twolayer onedimensional Ising models depicted in Fig. 1.
 1

Layer1: a “spin” is placed on each spacing between a pair of consecutive observed values . First an Upspin for sharing the same Uniformpart, and then a Downspin for indicating that , belongs to two distinct bins with or without a gap;
 2

Layer2: a “hiddenspin” is placed between two consecutive Uniformparts, on the first layer: 1) “+”spin for having a gap, and 2) “”spin for without a gap.
In total there are Up or Downspins in the Layer1. Then the number of spins in the Layer2 is exactly the number Downspins, say , in the Layer1. Since is the number of distinct sets of nodes chosen from the pool of nodes, this class of “Twolayer one dimensional Ising model” has its cardinality growing exponentially with the sample size as:
This exponential growth rate reflects the essence of the fact that all boundaries are local once we revoke continuity and smoothness assumptions. That is, the complexity of computing within this ensemble comes from dealing with multiscales of parameters, which is one significant datadriven nature of possiblygapped histogram.
2.2 Decoding errors
Next we consider the decoding error coming from Uniform datacompression. First let the identically independently distributed (i.i.d) random variables be denoted as , that is, is distributed with respect to (w.r.t) Standard Uniform density function . Denote the order statistics as , and the density of th order statistic , say , is evaluated as below.
Here becomes the exponential density for the extreme ordered statistics () and ().
Then we have,
Also, we have,
Given the observed data as , further let also be another set of observations, distributed with standard Uniform distribution. Then Decoding Error Sum of Squares(DESS) is evaluated as:
since we expect that the second term in the last equation is about the same size of if are realized from i.i.d .
In general, by a Uniformpart , we denote a Uniform distribution on in a possiblygapped histogram. Thus a set of observations in a bin in a histogram will be from , if its decodingerror is about . This is a definite property requirement, which we call the ‘DESS criterion’. Since this requirement is independent of sample size for large , it must be satisfied by all parts of the possiblygapped histogram. It is clear that the majority of candidates in the ensemble of Twolayer one dimensional Ising model is not feasible.
The next issue is the fact that an Uniformpart can be divided into several Uniformparts with for all . Then the total DESS of the collection is about
which can be much smaller than the DESS of the original .
On the other hand, this collection of incurs increasing cost of the coding length for the boundaries . So it is necessary to balance between DESS due to stochastic randomness, and complexity of deterministic structure. Let be the relative index between the costs: decoding error and coding length. Based on the criterion of model simplicity over complexity, we need to make sure that local boundary points s or s are chosen, if the following inequality holds:
or,
where is an specific index measuring how many units of decoding error is “equal to” the coding length cost of one boundary point. That is, is determined within the domain system and is independent of sample size . Thus this balancing criterion being independent of is realistic and practical, but is at odds with statistical modeling and its model selection in the Statistics literature. For instance, the principle of Minimum Description Length(MDL) is to minimize with respect to :
where both terms on the righthand side grow with sample size . The first term accounts for the predictive errors, while the second term accounts for the precision of all global parameter estimators. Here we would like to point out a fundamental assumption: the ensemble of candidate models is independent of sample size . This assumption is always imposed implicitly, but hardly spelled out explicitly in the statistical literature. One serious implication of this assumption is the homogeneity of data structure that does not change as sample size increases. This homogeneity is apparently and practically not possible to hold in histogram construction.
2.3 Possiblygapped histogram and 1st phase of ANOHT
Given a value of index , an observed data , and the fact that the exhaustive search for an optimal possiblygapped histogram within the ensemble prescribed by the TwoLayer one dimensional Ising model is overwhelmingly impossible, one practically feasible computational solution is to apply the Hierarchical Clustering (HC) algorithm with the recommended Complete module. This choice of the module provides the needed subdividing tendency. This tendency fits the Uniform distribution well in the sense that subdividing Uniform distribution reduces the total decoding errors.
The computational algorithm for a small set of potentially possiblygapped histograms is proposed based on the bifurcating feature of an HCtree as follows. Let’s define a bifurcating internode in an HCtree to be active, if its parent internode has not been marked with a STOP sign. For practical purpose, the index is used as a threshold value.
There are several ways to check whether a gap exists between two consecutive Uniform parts of a histogram. For example, theoretically extended boundaries can be estimated via Exponential distribution of the extreme random variable. If the two estimated boundaries don’t cross each other, then the two Uniform parts are separated. So there exists a gap between them. Another practical way of checking is to recalculate the DESS of these two Uniform parts after modifying both of them so that they share a common extended boundary, i,e. the midpoint of their extreme values. Then if both of them satisfy the DESS criterion, then there doesn’t exist a gap. Otherwise, there exists one.
The histogram resulted from the above algorithm is the coarsest version of possiblygapped histogram. If an optimal histogram is needed, then all Uniform parts with large DESS can be bifurcated according to HCtree branches to further improve the total decoding errors at a cost of coding one more boundary.
Further, the higher an HCtree internode is, the bigger is the gap potential. In fact, a gap is often visible in the coarsest version of a possiblygapped histogram, since it gives rise to one possiblygapped piecewise linear approximation onto the empirical distribution function, see illustrations of four features of Iris data in Fig. 2. Particularly in the third and fourth panels of Fig. 2 for Petal Length and Petal width, we see evident gaps. The biological significance of the possiblygapped histograms of Iris’ Petal length and width is clearly revealed through the separation of colorcoded species.
Here we explicitly illustrate such computations on Iris data set, after standardizing each feature to zero mean and unit standard deviation. When a histogram is indeed “gapped”, this gap should be identified as a corresponding break in the empirical CDF also. We can verify this through the distributions of the extreme random variables of the distribution, as the following.
being the binfrequency.
As in Fig. 2(c), the right boundary of the first bin of petal length can be estimated as , while the left boundary of the second bin is estimated as . Also, in Fig. 2(d), the right boundary of the second bin and the left boundary of the third bin of petal width are estimated as and . The computed left and right boundary estimates do not cross each other in both cases. Hence this implies that the gap between Setosa and the other two species is significant. The DESS values of the final gapped bins are pretty low and fall around the uniform DESS criterion. In contrast, the Iris’ Sepal length and width do not bear such significance.
As a histogram of a feature is usually taken as a simple datavisualization step within a long and complicated process of data analysis, it hardly constitutes any standalone goal of data analysis. Here we would like to point out that this impression is completely incorrect. In fact, a possiblygapped histogram indeed is an important and useful tool for data analysis.
Consider a possiblygapped histogram constructed via Algorithm 1 applied to a data set by pooling measurements from treatments. We are ready to simultaneously compare these treatmentspecific distributions by encoding treatmentspecific colors onto each bin with respect to its membermeasurements’ treatment identifications. Each bin has a composition of colored proportions. So its entropy relative to the entropy of treatment sample sizes, or their ratio is an index for binspecific local comparison among the treatments. A pvalue can be also derived from the simulation study via simple random sampling without replacement. Hence we can afford to test these treatments by pointing out where and how they are different. If an overall index is needed, then the weighted entropy across all bins is calculated, so is its pvalue. This is the first phase of ANOHT.
In summary, a colorcoded possiblygapped histogram can simultaneously offer more informative comparisons among treatments than KolmogorovSmirnov test, which is limited for pairwise distributional comparisons, and OneWay Analysis of Variance (ANOVA), which focus on comparing the meanvalues.
This first phase of ANOHT on Iris data is shown through the middle column panels of Fig. 2. All four colorcoded possiblygapped histograms reveal very informative comparisons among the three Iris species. The compositions of binspecific colorcodes very importantly reveal how and where these species are different. Such locality information is essential and invaluable. The biological meaning of gaps in histograms pertaining to Petal length and width become evident and crucial. Here, for local testing purpose with respect to all four features, we see that majorities of pvalues are either zeros or extremely small due to single color dominance or one color being completely missing. Likewise, the overall pvalues are extremely small. Therefore, biologically speaking, these four features all provide informative pieces of information for differentiating these three species of Iris. This is a conclusion that has not been seen in literature, particularly in machine learning.
2.4 Tree geometry and 2nd phase of ANOHT
As the first phase of ANOHT provides a way of simultaneous comparison of treatments locally and globally, the second phase of ANOHT intends to precisely address issues regarding: which treatments are closer to which and farther away from which. Our algorithm will construct a treegeometry upon these treatmentnodes to provide comprehensive information regarding these issues. Upon this tree geometry, this algorithm further evaluates an “authenticity” index at every internode of treebranch. Here an authentic tree branch means that its memberships are strongly supported by the data, not due to chance.
Heuristic ideas underlying this algorithm are as follows. A colorcoded possiblygapped histogram (of counts) is transformed into a matrix: one colored bin is turned into one column according to colorspecific rows, that is, one color for one row. Then the matrix of counts is normalized rowbyrow into a matrix having columns of frequencies. Then, for simplicity, Euclidean distance is calculated between any pair of rows and the Hierarchical clustering algorithm is applied to build an HCtree upon the row axis. So a heatmap superimposed with such an HCtree has resulted with internodes, so branches.
Along the construction process of this HCtree, each of internode is associated with a tree height. We sort and digitally rank these tree heights from smallest to largest as a way of renormalization. Hence each internode specifying a branch will be assigned with a rankdigit. This internodespecific rankdigit indicates that the tree height for any node within this branch to meet with any other node outside of this branch must be ranked higher.
Using the Empirical Process Theory for complete data ([7], [19]), mimicking is then applied on this heatmap, by simulating each row vector. The specific multivariate Normality structure used here is given as follows . Let be a generic empirical distribution estimating a true distribution pertaining to one of the treatments. The classical empirical process theory implies that
where is a Brownian Bridge, which is a Gaussian process with covariance function when . Therefore, with being the chosen ordered boundaries of bins, the vector comes from a family of Multivariate Normal distribution as below.
where,
(1) 
with , and the matrix and its inverse taking the following forms,
Thus we have .
Further, we have the following asymptotic multivariate Normality result based on equation (2.4)
with
This covariance matrix with small and negative offdiagonal entries: , is the key component for mimicking each row vector of the aforementioned matrix (or heatmap). An HCtree is also derived for each mimicked heatmap. By performing such mimicking procedure many times, an ensemble of HCtrees is generated. With this ensemble, we are able to compute an authenticity index at each internode on the original HCtree by counting the proportion of mimicked HCtree having the following property: the rankdigit pertaining to the smallest branch containing all nodes belonging to the original branch being smaller or equal to the original rankdigit defining the original branch. See also the description of Algorithm 2 below.
It is emphasized once more that the algorithmically computed tree geometry would allow us to see which branch of treatmentnodes is authentic in the sense that withinbranch distances are smaller than betweenbranch distances with significantly high probabilities. In other words, the formation of branchspecific memberships is not likely due to noises. This authenticity index evaluates and confirms potential biological or mechanistic basis for this branch. This serves as one part of the knowledge discovery.
By individually applying the Algorithm 2 based on all four histograms in Fig. 2, we obtain four tree geometries on three Iris speciesnodes with respect to their four features, as shown in Fig.3. In each of the four features of Iris, about to of the 10,000 generations of the species tree confirm that the branch of Virginica and Versicolor stayed together, and the Setosa stayed separated from them. Further, the heatmap clearly pinpoints where are the significant differences, so that the species Setosa is rather distinct from the two other species. It is not unreasonable to think that each of these four HCtrees with authenticity indexes will shed light on the phylogenetics of these three species. This is one of the most significant merits of the 2nd phase of ANOHT.
3 ANOHT on right censored data.
When data is compromised, such as being rightcensored in survival analysis, the construction of a histogram is rarely seen in practice ([13]) as well in literature ([4], [1], [12]). We propose to construct a possiblygapped histogram and perform ANOHT based on KaplanMeier estimate of a survival function. Then the ANOHT is also performed based on the NelsonAllen estimate of the cumulative hazard function. The later analysis is much simpler and easier to apply than the former analysis. We apply both analyses on a heavily censored divorce data.
The foundation for the 1st phase of ANOHT is the fact that the KaplanMeier estimate has weights only on uncensored time points, see [16]. And the weighting scheme is termed redistributiontotheright, see [5]. Therefore the first phase of ANOHT can be carried out by building a possiblygapped histogram on uncensored time points, and then applying the redistributiontotheright weighting scheme to adjust the weights on all bins. Nonetheless, the 2nd phase of ANOHT would need a bit more complicated empirical process theoretical results, which are given below.
3.1 Empirical process theory on KaplanMeier estimate of the survival function.
Given a treatment of interest, let the rightcensored data set be generically denoted as with and (independent) for all The KaplanMeier estimate of the survival function is constructed as
where is the orderstatistics ranking from the smallest to the largest, and is the corresponding censoring status of .
The right censored version of empirical process theory ([16], [1],[12]) states that
where is a Gaussian process with characteristic covariance function
with and . It is noted that when there is no censoring.
Likewise the dimensional vector is asymptotically normally distributed, that is:
where
Hence we have the asymptotic normality of the kdim vector . Throughout this section we employee the following approximation:
It is clear that the KaplanMeier estimate has jumps only on uncensored time points. Hence a construction of a possiblygapped histogram needs only to take one more step of readjusting the weighting, that is, a histogram for a rightcensored data set can be constructed via the following two steps

Build a histogram upon uncensored(or complete) data points;

Readjust the weights by applying the redistributiontotheright scheme for constructing the KaplanMeier estimate.
3.2 Empirical process theory on NelsonAalen estimate of cumulative hazard function.
Due to the Martingale central limit theory, see [1] and [12], the NelsonAalen estimate of the cumulative hazard function is popularly used in survival analysis as an alternative to KaplanMeier estimate in statistical inferences, see [8], [9] and [10].
The NelsonAalen estimate of is denoted as:
and the Martingale Central Limit Theorem assures that
where is a Gaussian process with independent increment property. That is, its covariance function is
.
Hence acts like a Brownian motion, so the components of the kdimensional vector are indeed independently distributed, that is,
with
Here is the foundation for the 2nd phase of ANOHT on complete data, while and are the foundations for the 2nd phase of ANOHT on right censored data.
4 Results on baseball data
In this section, we apply our algorithms to construct possiblygapped histograms and perform Analysis of Histogram (ANOHT) on pitching data of three wellknown pitchers in Major League Baseball (MLB). The three pitchers are Jake J. Arrieta (Chicago Cubs), Kyle C. Hendricks (Chicago Cubs) and Robert A. Dickey (Atlanta Braves). The first two pitchers with rather distinct pitching styles were keys to the 2016 World Series Champion won by Chicago Cubs, whose previous titles were in 1908 and 1907. The first pitcher is the 2015 Cy Young Award winner, the second one, who graduated from Dartmouth College, has a nickname “The Professor”, while the third one is the first knuckleball pitcher to win the Cy Young Award. This pitching data set contains 18732 pitches in the 20152016 season, including Arrieta’s 6848 pitches.
The data set was downloaded from in the MLB official website provided by PITCHf/x system. This system is installed in all 30 MLB stadiums to track and digitally record the full trajectory of live baseball pitch since 2006. The data contains 21 features of each and every single pitch and batting results throughout every single game in the regular season. Here we only look at two important pitching characteristic features: startspeed and breaklength. The startspeed is the detected speed of a baseball at the release point of a pitch, while breaklength is the measured largest distance from the baseball’s curved trajectory to the straightline linking its release point to the front of home plate. These two features are main characteristics of a pitch. They are related to each other in rather distinct ways among 8 computer classified pitchtypes: 1) Fourseam Fastball (FF); 2) Fastball cutter (FC); 3) Fastball sinker(SI); 4) Slider (SL); 5) Changeup (CH); 6) Curveball (CU); 7) Knuckleball (KN); 8) Eephus (EP). The first three pitching types are in the category of fastball, while the last five are in the category of offspeed pitches. It needs to be kept in mind that a pitcher’s repertoire of pitchtypes is only a strict subset of these eight pitchtypes.
The startspeed of Arrieta’s fastball can go up to nearly 100 mph (miles per hour), while Dickey’s knuckleball can be as slow as 65mph. It is known in general that a pitch with higher startspeed tends to have smaller breaklength. In fact, their relationship is relatively nonlinear. Hence, combinations of these two features are usually cleverly crafted by every professional baseball pitcher in order to effectively face batters. So it is of great interest to compare one single pitcher’s as well as multiple pitchers’ distinct pitchtypes from these two aspects.
4.1 Data analysis on one single pitcher’s pitchtypes.
We begin with our first phase of ANOHT on comparing pitchtypes of a single pitcher from the startspeed and breaklength aspects. For instance, to compare Arrieta’s 5 pitchtypes: FF, SI, SL, CH, CU, we apply the Algorithm 1 to construct two possiblygapped histograms of startspeed and breaklength, as shown in the panels (A) and (B) of Fig.4, respectively. In panel (A), we see a clear gap around 84mph. This gap bears a mechanistic difference between his curveball (CU) and the rest of 4 pitchtypes, which have significant higher speed. That is, this pitchtype is purposely and distinctively carried out by this pitcher. Hence this gap indicates a clearcut implementation due to the pitcher’s control capability.
This capability is even more strikingly demonstrated via the gap shown in the histogram of breaklength, as shown in panel (B). Again the pitch type CU is completely separated from the rest of 4 types. It is also evident that pitchtypes FF and SI are dominant in bins on right extreme of startspeed histogram and correspondingly dominant in bins on the leftextreme of breaklength histogram.
In summary these two color coded possiblygapped histograms clearly demonstrate the distinctive differences among Arrieta’s five pitchtypes from the two aspects. In other words, all binspecific entropies are to be zeros or significantly small in comparison with the entropy of the distribution pertaining to the five categories of pitchtypes. That is, all pvalues computed via the scheme of simplerandomsampling without replacement are zeros or extremely small. Similar data analysis on the other two pitcher can be likewise carried out.
4.2 Data analysis on three pitchers’ pitchtypes.
We begin with the first phase of ANOHT to understand how similar or distinct are pitcherspecific pitchtypes from the aspects of startspeed and breaklength. On top of Arrieta’s 5 pitch types, Hendricks has 5 pitch types: FF, FC, SI, CH, CU, and Dickey has 3 types: FF, KN, EP. So there are 13 pitcherspecific pitchtypes (or treatments) in total for comparison. One coarse and one fine resolution of possiblygapped histograms of startspeed are constructed and reported in two rows of triplet panels of Fig.5, respectively. After applying Algorithm 1 with a choice of tree height, there are 20 bins selected in this coarse version, as seen in panels (AC). The piecewise linear approximation on the empirical distribution seems reasonable, as seen in the panel (A), and histogram reveals evident two, or potentially three modes in panel (B). But the all DESS values are relatively high in panel (C).
Therefore, in order to drive DESS values lower, we select a lower and get a histogram with a fine resolution, as shown in panels (DF). Though the piecewise linear approximation becomes somehow overwhelmingly detailed with 80 bins in panel (D), the histogram is seen with evident 3 modes in panel (E) and corresponding DESS values become significantly smaller than that in panel (C). That is, the fine resolution histogram with very smaller bins on its right seems to give rise to more detailed distributional structures than the coarse one. This coarsevsfine resolution of histograms illustrate why we need to have datadriven bins with datadriven sizes.
Then we color code the coarse version of histogram, instead of the fine resolution version, for better visualization when we set to compare these 13 pitcherspecific pitch types, as shown in the panel (A) of Fig.6. Upon this histogram of startspeed, we see that bins on its left hand side are dominated by Dickey’s KN, and bins on its right hand side are primarily dominated by Arrieta’s FF and SI. While Hendrick’s pitch types are prevalent on bins situated at both sides of major valley of this histogram. It is clear that, by focusing on where differences actually occur, this color coded histogram is much more informative than traditional boxplot, as shown in panel (C) of Fig.6, or ANOVA, which focuses merely on comparisons of mean values.
Then we turn to the second phase of ANOHT to see what the tree geometries of the 13 pitcherspecific pitchtypes look like from the aspects of startspeed and breaklength. Here it is worth reemphasizing the significance of tree geometry. Tree geometry is a datadriven structure that makes performing branchbasedgroup comparison possible. A branchbasedgroup specified by an internode is confirmed as being authentic, if its branch formation is significantly supported by the data, and not by chance. This confirmation equivalently implies that, with significantly high probability, no tree nodes outside of such a group have smaller distances (or higher similarity) than distances (or similarity) among its groupmembernodes. From mechanistic perspective, such a concept of authenticity of tree branch (basedgroup) has a direct implication on “phylogenetic” information contents regarding the 13 pitcherspecific pitchtypes, as would be seen below. From inference perspective, its functions have gone beyond multiple comparisons, such as Tukey’s pairwise comparison and others, typically performed in Analysis of Variance (ANOVA) in classical Statistics.
We apply the Algorithm 2 based on the coarse resolution version of histogram with 20 bins. The results based on fine resolution version of histogram with 80 bins are rather similar, but a bit harder to visualize. The resultant heatmap is framed by a hierarchical clustering (HC) tree on its row axis, as reported on the panel (B) of Fig.6. Percentages attached on each internode of the HC tree are calculated from 10,000 mimicked HC trees. We see via the HC tree branches in the order going from top to bottom that the branchbasedgroup of Hendrick’s CH and Arrieta’s CU, the group of Hendrick’s CU and Dickey’s KN, the group of Arrieta’s FF and SI, and the group of Hendrick’s SI and FC, are all confirmed as being authentic with significantly high probabilities. These memberships of the four pairs of pitcherspecific pitchtypes are not mixing with any pitcher’s pitch type outside of their groups. Also the triplet branchbasedgroups: Hendrick’s CH, Arrieta’s CU and Dickey’s FF and the branchbasedgroup of five: Hendrick’s FF, SI and FC and Arrieta’s SL and CH, are also confirmed as being authentic as well. Therefore we may conclude that these two authentic branches: one on the slow end and one on the fast end of startspeed, are two large datadriven anchors of the tree geometry.
Results from ANOHT on breaklength are reported in two panels of Fig. 7. The histogram in panel (A) reveals that bins on the lower end are dominated by Arrieta and Hendrick’s fastball, while bins on the larger end are dominated by Arrieta’s CU and Dickey’s KN. Also it is odd, but interesting to see that the bin located at 10 is nearly exclusively occupied by Dickey’s KN. This exclusiveness implies that the Knuckleballspecific range of breaklength is likely jointly achieved by the pitcher’s unusual pitching mechanics and the unusual low startspeed as seen in bins on the lower end of histogram of startspeed in panel (A) of Fig. 5. This is another mechanistic pattern that can be possibly derived from ANOHT, but hardly could be easily derived from other methodologies.
For tree geometry among the 13 pitcherspecific pitch types, the HC tree superimposed on the row axis of heatmap reveals three authentic triplet branchbasedgroups, as shown in the panel (B) of Fig. 7. These three triplet groups are: 1)Hendrick’s FF and FC and Arrieta’s FF; 2)Hendrick’s SI and Arrieta’s SL and SI; 3) Hendrick’s CH and Dickey’s FF and Arrieta’s CH. All these three authentic branches are all on the lower end of breaklength. In contrast, indexes pertaining to the branchbasedgroup memberships on the higher end are high, but not high enough to be comfortably confirmed as being authentic.
5 Results on divorce data
We then apply our ANOHT on a heavily censored divorce dataset. One significant feature of such a dataset is that the heavily rightcensoring typically causes many issues in parametric and semiparametric inferences in survival analysis literature, see [17] and [2]. However it seems to have relatively little effects on ANOHT. The key reason is that ANOHT can be based only on the reliable portion of histograms.
The dissolution of marriage dataset is based on a survey conducted in the U.S. studying 3,371 couples. The unit of observation is a couple and the event of interest is divorce. Couples lost to followup or widowed are treated as censored observations. We have three fixed covariates: education of the husband and two indicators of the couple ethnicity regarding whether the husband is black and whether the couple is mixed.
This dataset consists of only 1033 couples with completely observed eventtime of dissolution of marriage in years, and 2338 couples with right censored eventtimes. The censoring rate is about . The first factor is the years of husband’s education with three categories: 1)coded as 0 for being less than 12 years; 2) coded as 1 for being between 12 to 15 years; 3) coded as 2 for being 16 or more years. The second factor is husband’s ethnicity with two categories: 1) coded as 1 for the husband being black; 2) codes as 0 otherwise. The third factor is couple’s ethnicity with two categories: 1) coded as 1 if the husband and wife have different ethnicity; 2)coded 0 otherwise. Censoring status as usual is coded as 1 for divorce and 0 for censoring (due to widowhood or lost to follow up).
Thus this dataset can be partitioned with respect to one single factor, or a pair of factors, or the three factors together. That is, the finest partition has 12 samples or treatments, in which each treatment is triplecoded. For instance the treatment ‘201’ stands for the sample of couples with husband’s education being more than 15 years (coded 2 in the first factor), nonblack husband (coded 0 in the second factor) and having mixed ethnicity (coded 1 in the third factor).
5.1 ANOHT on a divorce data
The first phase of ANOHT on this divorce dataset on these four aspects are reported in the Fig. 8.
Overall the possiblygapped histogram shows three evident gaps along its right tail. These gaps are not likely due to old age. Since they are followed by bins with significant heights, their locations seem interesting, but need more research attentions for pertinent interpretations. Specifically the panel (A) of Fig 8 shows three KaplanMeier estimates of three survival functions with respect to three categories of education levels. It is rather interesting to note that the marriage with husbands having middle education level (1**) is likely to fail much earlier than marriages with husbands in either the lowest (0**) or highest (2**) education levels. The panels (B) and (C) of Fig. 8 clearly reveal that the two categories: Blackhusband (*1*) and mixedethnicity of couple (**1), respectively associate with lower survival rates. Further the category: Blackhusband and mixedethnicity of couple (*11), has the lowest survival rate against the other three categories in panel (D) of Fig. 8. This survival rate is especially lower than the one belonging to the category: nonblack husband and wife (*00).
The second phase of ANOHT on divorce dataset is reported in the figure below:
The group of 6 treatments (categories): (000, 001, 010, 100, 011, 200), are seen in panels (A) and (B) of Fig. 9. They have similar KaplanMeier survival function and NelsonAalen cumulative hazard functions. Particularly its subgroup of 3 treatments:(000, 100, 200) shows very high similarity on both aspects among them. Such evidence is interpreted as that the factor of education has no effect on occurrence of event of marriage failure in nonblack couple. In contrast, the presence of the subgroup of 3 treatments: (010, 001, 011), seems to imply that the occurrence of the event of marriage failure for a couple with husband having lowest education level is neither affected by the husband’s ethnicity, nor the couple’s mixture of ethnicity.
From KaplanMeier survival function perspective, the tree on the heatmap of panel (A) of Fig. 9 evidently shows one authenticity of the large branch of 9 treatments against the other branch of 3 treatments: (201, 211, 210). Another authenticity is found on the branch of two treatments: (100, 000). In contrast, from NelsonAalen cumulative hazard function perspective, there is only authenticity found on the outlier treatment: (211) against the rest of treatments.
6 Conclusion
In this note we demonstrate algorithmic datadriven computing for constructing a possiblygapped histogram, and then develop two phases of Analysis of Histogram (ANOHT). Their practical values are clearly illustrated and manifested through Iris data, baseball pitching data and divorce data. Our datadriven computing simply and critically demonstrates interfaces of machine learning, information theory, and statistical physics. We believe such data analytic techniques would have very high potentials for very wide spectra of problems in sciences.
It is surprising that a wellknown Hierarchical clustering algorithm, which was first developed in 1950’s for taxonomy ([20]), can help resolve a complex computational physical problem with exponential growth of complexity. Even more striking is that such a multiscale physical problem explicitly contrasts and brings out the unspoken assumption of homogeneity implicitly assumed in all known statistical model selection techniques.
We believe that the existential issue of gap in a distribution will become more critical in this Big Data era ever than before. Its biological and mechanistic meanings should be a part of critical knowledge discovery in datadriven analysis. Particularly a possiblygapped histogram can provide a fundamental new method of renormalization on data via digital coding schemes. Such a data renormalizing method in fact plays a critical role in unsupervised machine learning for knowledge discovery.
The merits of two phases of ANOHT are apparently seen in the three real examples, so are likely to be seen in many scientific researches. We have demonstrated that they have the capabilities to resolve issues never being discussed before, and at the same time to provide knowledge discoveries from wider perspectives. Further, since they are free from common, but unrealistic constraints and assumptions imposed by statistical modeling and analysis, their results should be more authentic and closer to reality.
Here we devote the rest of this section for implications of our developments on statistics. It is because of lacking continuity or smoothness assumptions in our developments, many types of potential complexity can be realistically embraced within our data analysis on 1D dataset. For instance, the multiscale issue is clearly seen in the histograms of startspeed, as shown in Fig. 4. This issue has to be resolved in a datadriven fashion because of the lack of prior knowledge. In sharp contrast, a histogram with prefixed number of equalsize bins is just not realistic. Here we emphasize that ignoring realistic information contents contained in data is hardly a right way of analyzing real data.
This multiscale feature also spells out the fact that all bins’ boundaries are not global parameters. Thus the ensemble of candidate histograms is far from being fixed, but grows exponentially like the ensemble of all possible spinconfigurations of Ising model in statistical physics. They render no uniform or rates of convergence. Thus all statistical model selection techniques, such as AIC, BIC and minimum description length (MDL), in statistics are not applicable. The simple underlying technical reason is that all these model selection techniques require their ensemble of candidate models to be fixed and independent of , so that the corresponding mathematical optimization problem can be well defined.
The implications of ANOHT on statistics are clearly demonstrated and contrasted through its applications on the Iris, MLB pitching and divorce data sets. The computing simplicity and informative results pertaining to the two phases of analyses via ANOHT come from the recognized interface of a physical problem and an unsupervised machine learning algorithm.
Ethics:
No ethical approvals are needed in this study. All measurements pertaining to study subjects are available from two public websites.
Data Accessibility:
The Iris data is available at UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets.html). The pitching data is available in PITCHf/x database belonging to Major League Baseball via http://gd2.mlb.com/components/game/mlb/. The divorce data is available at: http://data.princeton.edu/wws509/datasets/#divorce
Competing Interests:
We have no competing interests.
Author’s contributions:
H.F. designed the study. T.R. collected all data for analysis. H.F. and T.R analyzed the data. H.F and T.R. interpreted the results and wrote the manuscript. All authors gave final approval for publication.
Funding:
No financial funding received for this study.
Acknowledgement:
We thank Kevin Fujii for helps in pitching data retrieval.
References
 [1] Andersen, P. K., Borgan, O., Gill, R. D. and Keiding, N. Statistical Models Based on Counting Process. SpringerVerlag. New York, 1992.
 [2] Bassiakos, Y. C., Meng, X.L. and Lo, S.H. A general estimator of the treatment effect when the data are heavily censored. Biometrika, 78, 4, (1991), 7418.
 [3] Cover, T. M., and Thomas, J. A. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). WileyInterscience, 2006.
 [4] Cox, D. R. Analysis of Survival data. Chapman and Hall. London, 1984.
 [5] Efron, B. The two sample problem with censored data. Proceeding of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Vol. IV, (1967), 831853.
 [6] HALL, P., and HANNAN, E. J. On stochastic complexity and nonparametric density estimation. Biometrika 75, 4 (1988), 705.
 [7] Hsieh Fushing The empirical process approach for semiparametric twosample models with heterogeneous treatment effect. J. Roy. Statist. Soc. Ser. B. 57,(1995), 735748.
 [8] Hsieh Fushing The empirical process approach for a generalized hazards model. Biometrika 83, (1996a), 519528.
 [9] Hsieh Fushing Empirical process approach in a twosample locationscale model with censored data. Ann. Statist. 24, (1996), 27052719.
 [10] Hsieh Fushing On heteroscedastic hazards regression model: theory and applications. J. Royal Stastist. Soc. Ser. B. 63, (2001), 6379.
 [11] Ising, E. Beitrag zur theorie des ferromagnetismus. Zeitschrift für Physik 31, 1 (1925), 253–258.
 [12] Kalbfleisch, J. D. and Prentice, R. L. The Statistical Analysis of Failure Time Data. John Wiley, New York, 2002.
 [13] Kleinbaum D. G. Survival Analysis: A SelfLearning text. SpringerVerlag, New York, 1996.
 [14] Kontkanen, P., and Myllymäki, P. Mdl histogram density estimation. In Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics (AISTATS07) (2007), 219–226.
 [15] Li, M., and Vitnyi, P. M. An Introduction to Kolmogorov Complexity and Its Applications. Springer Publishing Company, Incorporated, 2008.
 [16] Miller, R. G. Survival Analysis. John Wiley, New York, 1981.
 [17] Padgett, W. and Wei, L. J. Estimation of the ratio of scale parameters in the two sample problem with arbitrary right censorship. Biometrika 69, 1982, 2526.
 [18] Rissanen, J. Stochastic Complexity in Statistical Inquiryl. World Scientific Publishing Co., Inc, River Edge, NJ, USA, 1989.
 [19] Shorack, G. R. and Wellner, J. A. Empirical Processes with Applications in Statistics. Wiley, New York, 1986.
 [20] Sneath, P. H. A. and Sokal, R. R. Numerical Taxonomy: The Principles and Practices of Numerical Classification Freeman, San Francisco, 1973.
 [21] Yu, B., and Speed, T. P. Data compression and histograms. Probability Theory and Related Fields 92, 2 (1992), 195–229.