Complexity of Possibly-gapped Histogram and Analysis of Histogram (ANOHT)

Complexity of Possibly-gapped Histogram and
Analysis of Histogram (ANOHT)

Hsieh Fushing111Correspondence: Hsieh Fushing, University of California at Davis, CA 95616. E-mail: fhsieh@ucdavis.edu   and Tania Roy
Department of Statistics, University of California, Davis
July 5, 2019

Abstract

Through real examples, we demonstrate that gaps and distributional patterns embedded within real-valued measurements are inseparable biological and mechanistic information contents of the system. To discover such patterns, we develop data-driven algorithms to construct possibly-gapped histograms. Then, based on these computed patterns, we propose a geometry-based Analysis of Histogram (ANOHT) to provide multiscale comparisons among multiple treatments. Our developments begin by showing that constructing a possibly-gapped histogram is a complex problem of statistical mechanics. Since its ensemble of candidate histograms is described via a two-layer Ising model in order to accommodate all local and scale-sensitive boundary parameters. This histogram construction is also a distinctive problem of Information Theory. Since, from the perspective of data compression via Uniformity, the decoding error criterion is shown as nearly independent of sample size. So if a Hamiltonian is defined as the sum of total coding lengths of boundaries and total decoding errors within bins, then the direct search for the macroscopic state or optimal solution is computationally impossible. We resolve this computing issue by applying the Hierarchical Clustering algorithm to narrow the search within a small set of nearly optimal solutions. An algorithm is then developed to select the best one. Further, based on a computed possibly-gapped histogram, the first phase of ANOHT is developed for simultaneous comparison of multiple treatments locally and globally, while the second phase of ANOHT is to develop a tree-geometry and to check the authenticity of branch-induced treatment-grouping by applying classic empirical process theory. The well-known Iris data is used to illustrate our technical developments, and a large baseball pitching data set and a heavily right-censored divorce data are analyzed to showcase the existential gaps and utilities of ANOHT.

Keywords: Data Mechanics; Hierarchical Clustering Algorithm; Macrostate; Statistical Mechanics; Unsupervised Machine Learning.

1 Introduction

Without spatial and temporal coordinates, a sample of one-dimensional (1D) real-valued measurements is generally taken as one basic simple data type and receives very limited research attention. Its simplicity is seemingly implied by the illusion that its information contents are evident and transparent, and all its characteristic patterns should have immediately popped up right in front of our eyes. As a matter of fact, its pictorial representation, usually called an empirical distribution, indeed embraces hidden and implicit patterns waiting to be extracted.

There are many possible patterns that can be exhibited through the piecewise step-function structure of an empirical distribution. Among all possible patterns, two of them take the most basic forms: one is “linear segment” and the other is “gap”. A linear segment indicates a potential Uniform distribution being embedded within the empirical distribution, and a gap strictly indicates an interval zone, in which definitely allows no observations. As for the rest of the potential patterns, they can be very well approximated by properly combining these two basic patterns. Therefore, an empirical distribution ideally can be well approximated by various serial compositions of basic patterns. Each composition is a possibly-gapped piecewise linear approximation, which is correspondingly equivalent to a possibly-gapped histogram.

From such an approximation perspective, the larger number of patterns involving in such a series of basic patterns means a higher cost in terms of data compression [3]. Nevertheless, for the sake of true and intrinsic data patterns, it seems very natural to think that the hypothetical population distribution underlying the observed empirical distribution should embrace discontinuity.

But this is not the case in the statistical literature. Even though a histogram is discrete in nature, most existing versions of its constructions were developed through approximating a density function by imposing continuity and smoothness assumptions [6], [18], [21]. In sharp contrast, in computer science literature, a histogram is always discrete because it is primarily used for visualizing data or queries from a database [14]. So it does not involve with approximations via continuous components of Uniform or other smooth kernel distributions.

The computing costs for a histogram via aforementioned Statistics and Computer science viewpoints are not severely high. But the computing load for a possibly-gapped histogram can be theoretically very heavy because a constructive approach ideally needs to allow an unknown number of bins with heterogeneous bin-width and to accommodate an unknown number of gaps of different sizes. Further, the computing cost will grow with the sample sizes because of the multi-scale nature of these two basic patterns. That is why the construction of a possibly-gapped histogram is computationally complex.

In this paper, we resolve this computational issue algorithmically by treating it as if it is derived from a physical system. From this perspective, a possibly-gapped histogram should embrace deterministic structures, which are all boundaries of the bins, and the stochastic structures that are Uniform within each bin. That is, the deterministic and stochastic structures together constitutes the system information contents of a one-dimensional data set.

Such physical information contents render another interpretation from the perspective of algorithmic complexity [15]. A relatively simple way of seeing this complexity is the fact that the candidate ensemble is constructed via a Two-layer Ising model [11]. This ensemble grows exponentially in size with respect to the number of data points, say . Based on this ensemble description, we clearly see that all boundary parameters are characteristically local because of multiple relevant scales also depending on in a heterogeneous fashion across all potential bins.

Further, it is interesting to note that the decoding error under Uniformity is nearly independent of sample size within each bin. This fact becomes an effective criterion for confirming Uniformity, on one hand. On the other hand, any further division of a Uniform distribution would give rise to several Uniform distributions with lower total decoding errors. Therefore we need to balance between the total decoding errors and coding lengths of boundaries of the bins (with respect to a converting index). So a Hamiltonian is defined.

Therefore constructing a possibly-gapped histogram is indeed a complex physical problem of statistical mechanics aiming for extracting the lowest Hamiltonian macroscopic state (or macro-state). It is also a problem of information theory aiming for the best balance between the costs. But it is clearly not a problem of statistics because of its multi-scale nature. Thus we need a brand-new computing protocol to seek for the optimal solution and the macro-state.

It is surprising that this seemingly complex macro-state can be approximated by a relatively simple computational algorithm. By applying the popular hierarchical clustering (HC) algorithm with complete or other modules, excluding the single-linkage one, on a one-dimensional dataset, the resultant hierarchical tree indeed gives rise to just a few feasible candidates. We develop an algorithm to select one from this small set of candidates. Nearly optimal solutions can be derived. Also, the criterion of decoding error of Uniformity turns out to be a practical way of checking whether a space between two consecutive bins is indeed a gap.

The merits of a possibly-gapped histogram are intriguingly profound and far-reaching. We demonstrate its potential merits through our developments of new data analysis paradigm, called Analysis of Histogram (ANOHT). The first phase of ANOHT is designed to address the issues: where and how multiple treatments or distributions are locally and globally different? The second phase of ANOHT is designed to answer issues: which treatments are closer to which, but far away from others? And why? Here ANOHT is simple and equipped with excellent visualization capability. ANOHT also gives the biological and mechanistic meanings to identify gaps found within a histogram.

Our technical developments are illustrated throughout by employing the well-known Iris data. In the Result sections, ANOHT is first applied onto a large pitching data set of three Major League Baseball (MLB) pitchers, and then onto a heavily censored divorce data. The implications of our developments from building a possibly-gapped histogram to ANOHT are discussed in the Conclusion section.

2 Methods

In this section, our technical developments are divided into four subsections: 1) building the ensemble of candidate histograms; 2) deriving the decoding errors; 3) constructing the algorithm for possibly-gapped histogram as the first phase of ANOHT; 4) developing the tree geometry as the second phase of ANOHT. The Iris data is illustrated throughout our developments.

2.1 Candidate ensemble.

Let be the observed data points, and their ranked values as in increasing order. The construction of a possibly-gapped histogram can be precisely stated in the following Two-layer one-dimensional Ising models depicted in Fig. 1.

1

Layer-1: a “spin” is placed on each spacing between a pair of consecutive observed values . First an Up-spin for sharing the same Uniform-part, and then a Down-spin for indicating that , belongs to two distinct bins with or without a gap;

2

Layer-2: a “hidden-spin” is placed between two consecutive Uniform-parts, on the first layer: 1) “+”-spin for having a gap, and 2) “-”-spin for without a gap.

Figure 1: Schematic illustration of two-layer Ising model. The black dots on the blue line represent observations ordered and placed on the Real line. The two layers of spins are given below in the next two rows, where up and down arrows describes“+” and “-” spins respectively. A green arrow means that the two consecutive pairs of observations will be in the same bin and a red arrow means they will be in different bins. Hence, in the second layer, the down red arrow indicates two separate bins with an existential gap, and an upward red arrow indicates two consecutive bins, but with no gap. Next, the histograms are constructed accordingly.
Figure 2: The gapped histograms of the four columns of Iris dataset (namely, petal length, petal width, sepal length and sepal width respectively) are displayed in four rows above. The proportion of three species Setosa, Virginica, and Versicolor in each histogram is respectively represented by red, green and blue. The left column of panels shows the corresponding empirical cumulative distribution function, separated by red lines indicating each different bin from the histogram. The middle column of panels show the four possibly-gapped histograms via the hierarchical clustering algorithm with “complete” module and index (tree height). The right column of panels describes the DESS for each bin of the histograms, along with the threshold plotted in red color.

In total there are Up or Down-spins in the Layer-1. Then the number of spins in the Layer-2 is exactly the number Down-spins, say , in the Layer-1. Since is the number of distinct sets of nodes chosen from the pool of nodes, this class of “Two-layer one dimensional Ising model” has its cardinality growing exponentially with the sample size as:

This exponential growth rate reflects the essence of the fact that all boundaries are local once we revoke continuity and smoothness assumptions. That is, the complexity of computing within this ensemble comes from dealing with multi-scales of parameters, which is one significant data-driven nature of possibly-gapped histogram.

2.2 Decoding errors

Next we consider the decoding error coming from Uniform data-compression. First let the identically independently distributed (i.i.d) random variables be denoted as , that is, is distributed with respect to (w.r.t) Standard Uniform density function . Denote the order statistics as , and the density of th order statistic , say , is evaluated as below.

Here becomes the exponential density for the extreme ordered statistics () and ().

Then we have,

Also, we have,

Given the observed data as , further let also be another set of observations, distributed with standard Uniform distribution. Then Decoding Error Sum of Squares(DESS) is evaluated as:

since we expect that the second term in the last equation is about the same size of if are realized from i.i.d .

In general, by a Uniform-part , we denote a Uniform distribution on in a possibly-gapped histogram. Thus a set of observations in a bin in a histogram will be from , if its decoding-error is about . This is a definite property requirement, which we call the ‘DESS criterion’. Since this requirement is independent of sample size for large , it must be satisfied by all parts of the possibly-gapped histogram. It is clear that the majority of candidates in the ensemble of Two-layer one dimensional Ising model is not feasible.

The next issue is the fact that an Uniform-part can be divided into several Uniform-parts with for all . Then the total DESS of the collection is about

which can be much smaller than the DESS of the original .

On the other hand, this collection of incurs increasing cost of the coding length for the boundaries . So it is necessary to balance between DESS due to stochastic randomness, and complexity of deterministic structure. Let be the relative index between the costs: decoding error and coding length. Based on the criterion of model simplicity over complexity, we need to make sure that local boundary points s or s are chosen, if the following inequality holds:

or,

where is an specific index measuring how many units of decoding error is “equal to” the coding length cost of one boundary point. That is, is determined within the domain system and is independent of sample size . Thus this balancing criterion being independent of is realistic and practical, but is at odds with statistical modeling and its model selection in the Statistics literature. For instance, the principle of Minimum Description Length(MDL) is to minimize with respect to :

where both terms on the right-hand side grow with sample size . The first term accounts for the predictive errors, while the second term accounts for the precision of all global parameter estimators. Here we would like to point out a fundamental assumption: the ensemble of candidate models is independent of sample size . This assumption is always imposed implicitly, but hardly spelled out explicitly in the statistical literature. One serious implication of this assumption is the homogeneity of data structure that does not change as sample size increases. This homogeneity is apparently and practically not possible to hold in histogram construction.

2.3 Possibly-gapped histogram and 1st phase of ANOHT

Given a value of index , an observed data , and the fact that the exhaustive search for an optimal possibly-gapped histogram within the ensemble prescribed by the Two-Layer one dimensional Ising model is overwhelmingly impossible, one practically feasible computational solution is to apply the Hierarchical Clustering (HC) algorithm with the recommended Complete module. This choice of the module provides the needed subdividing tendency. This tendency fits the Uniform distribution well in the sense that subdividing Uniform distribution reduces the total decoding errors.

The computational algorithm for a small set of potentially possibly-gapped histograms is proposed based on the bifurcating feature of an HC-tree as follows. Let’s define a bifurcating inter-node in an HC-tree to be active, if its parent inter-node has not been marked with a STOP sign. For practical purpose, the index is used as a threshold value.

Step-1: Compute the DESS on the single data-range, , as having only one bin. If DESS-criterion is satisfied, then stop this algorithm. If DESS criterion is violated, go to the Step-2.
Step-2: Find the next highest active bifurcating inter-node and compute its DESS, say . If , or DESS criterion is satisfied, then mark this inter-node with a STOP sign. Then check if there are active inter-nodes remaining in the HC-tree. If yes, then repeat to Step-2, otherwise go the Step-3. If , and if DESS criterion is violated, then repeat Step-2.
Step-3: Stop when no more active inter-nodes are left. Check the existential gap between all consecutive Uniform-parts.
Algorithm 1 Algorithm for possibly-gapped histograms

There are several ways to check whether a gap exists between two consecutive Uniform parts of a histogram. For example, theoretically extended boundaries can be estimated via Exponential distribution of the extreme random variable. If the two estimated boundaries don’t cross each other, then the two Uniform parts are separated. So there exists a gap between them. Another practical way of checking is to recalculate the DESS of these two Uniform parts after modifying both of them so that they share a common extended boundary, i,e. the mid-point of their extreme values. Then if both of them satisfy the DESS criterion, then there doesn’t exist a gap. Otherwise, there exists one.

The histogram resulted from the above algorithm is the coarsest version of possibly-gapped histogram. If an optimal histogram is needed, then all Uniform parts with large DESS can be bifurcated according to HC-tree branches to further improve the total decoding errors at a cost of coding one more boundary.

Further, the higher an HC-tree inter-node is, the bigger is the gap potential. In fact, a gap is often visible in the coarsest version of a possibly-gapped histogram, since it gives rise to one possibly-gapped piece-wise linear approximation onto the empirical distribution function, see illustrations of four features of Iris data in Fig. 2. Particularly in the third and fourth panels of Fig. 2 for Petal Length and Petal width, we see evident gaps. The biological significance of the possibly-gapped histograms of Iris’ Petal length and width is clearly revealed through the separation of color-coded species.

Here we explicitly illustrate such computations on Iris data set, after standardizing each feature to zero mean and unit standard deviation. When a histogram is indeed “gapped”, this gap should be identified as a corresponding break in the empirical CDF also. We can verify this through the distributions of the extreme random variables of the distribution, as the following.

being the bin-frequency.

As in Fig. 2(c), the right boundary of the first bin of petal length can be estimated as , while the left boundary of the second bin is estimated as . Also, in Fig. 2(d), the right boundary of the second bin and the left boundary of the third bin of petal width are estimated as and . The computed left and right boundary estimates do not cross each other in both cases. Hence this implies that the gap between Setosa and the other two species is significant. The DESS values of the final gapped bins are pretty low and fall around the uniform DESS criterion. In contrast, the Iris’ Sepal length and width do not bear such significance.

As a histogram of a feature is usually taken as a simple data-visualization step within a long and complicated process of data analysis, it hardly constitutes any stand-alone goal of data analysis. Here we would like to point out that this impression is completely incorrect. In fact, a possibly-gapped histogram indeed is an important and useful tool for data analysis.

Consider a possibly-gapped histogram constructed via Algorithm 1 applied to a data set by pooling measurements from treatments. We are ready to simultaneously compare these treatment-specific distributions by encoding treatment-specific colors onto each bin with respect to its member-measurements’ treatment identifications. Each bin has a composition of colored proportions. So its entropy relative to the entropy of treatment sample sizes, or their ratio is an index for bin-specific local comparison among the treatments. A p-value can be also derived from the simulation study via simple random sampling without replacement. Hence we can afford to test these treatments by pointing out where and how they are different. If an overall index is needed, then the weighted entropy across all bins is calculated, so is its p-value. This is the first phase of ANOHT.

In summary, a color-coded possibly-gapped histogram can simultaneously offer more informative comparisons among treatments than Kolmogorov-Smirnov test, which is limited for pairwise distributional comparisons, and One-Way Analysis of Variance (ANOVA), which focus on comparing the mean-values.

This first phase of ANOHT on Iris data is shown through the middle column panels of Fig. 2. All four color-coded possibly-gapped histograms reveal very informative comparisons among the three Iris species. The compositions of bin-specific color-codes very importantly reveal how and where these species are different. Such locality information is essential and invaluable. The biological meaning of gaps in histograms pertaining to Petal length and width become evident and crucial. Here, for local testing purpose with respect to all four features, we see that majorities of p-values are either zeros or extremely small due to single color dominance or one color being completely missing. Likewise, the overall p-values are extremely small. Therefore, biologically speaking, these four features all provide informative pieces of information for differentiating these three species of Iris. This is a conclusion that has not been seen in literature, particularly in machine learning.

2.4 Tree geometry and 2nd phase of ANOHT

As the first phase of ANOHT provides a way of simultaneous comparison of treatments locally and globally, the second phase of ANOHT intends to precisely address issues regarding: which treatments are closer to which and farther away from which. Our algorithm will construct a tree-geometry upon these treatment-nodes to provide comprehensive information regarding these issues. Upon this tree geometry, this algorithm further evaluates an “authenticity” index at every inter-node of tree-branch. Here an authentic tree branch means that its memberships are strongly supported by the data, not due to chance.

Heuristic ideas underlying this algorithm are as follows. A color-coded possibly-gapped histogram (of counts) is transformed into a matrix: one colored bin is turned into one column according to color-specific rows, that is, one color for one row. Then the matrix of counts is normalized row-by-row into a matrix having columns of frequencies. Then, for simplicity, Euclidean distance is calculated between any pair of rows and the Hierarchical clustering algorithm is applied to build an HC-tree upon the row axis. So a heatmap superimposed with such an HC-tree has resulted with inter-nodes, so branches.

Along the construction process of this HC-tree, each of inter-node is associated with a tree height. We sort and digitally rank these tree heights from smallest to largest as a way of re-normalization. Hence each inter-node specifying a branch will be assigned with a rank-digit. This inter-node-specific rank-digit indicates that the tree height for any node within this branch to meet with any other node outside of this branch must be ranked higher.

Using the Empirical Process Theory for complete data ([7], [19]), mimicking is then applied on this heatmap, by simulating each row vector. The specific multivariate Normality structure used here is given as follows . Let be a generic empirical distribution estimating a true distribution pertaining to one of the treatments. The classical empirical process theory implies that

where is a Brownian Bridge, which is a Gaussian process with covariance function when . Therefore, with being the chosen ordered boundaries of bins, the vector comes from a family of Multivariate Normal distribution as below.

where,

(1)

with , and the matrix and its inverse taking the following forms,

Thus we have .

Further, we have the following asymptotic multivariate Normality result based on equation (2.4)

with

This covariance matrix with small and negative off-diagonal entries: , is the key component for mimicking each row vector of the aforementioned matrix (or heatmap). An HC-tree is also derived for each mimicked heatmap. By performing such mimicking procedure many times, an ensemble of HC-trees is generated. With this ensemble, we are able to compute an authenticity index at each inter-node on the original HC-tree by counting the proportion of mimicked HC-tree having the following property: the rank-digit pertaining to the smallest branch containing all nodes belonging to the original branch being smaller or equal to the original rank-digit defining the original branch. See also the description of Algorithm 2 below.

Step-1: Build a possibly gapped histogram on the pooled data, using Algorithm 1.
Step-2: Compute a matrix with the frequency of each species in the bins constructed in the gapped histogram of pooled data. For treatments and bins denotes the number of subjects of -th treatment falling in th bin. = total frequency of -th treatment. Define , . Build a tree on the row-axis of the matrix . This tree will be the point of reference.
Step-3: Mimicking by randomly simulating a matrix with its -th row being generated with respect to
Step-4: Generate the matrix 10,000 times and for each generated matrix build a tree on the rows and check if the branches of these new trees are similar to the branching of the reference tree. For each bifurcating inter-node of the reference tree, calculate an authenticity index as the percentage of times the nodes re-appeared together in the mimicked trees.
Algorithm 2 ANOHT

It is emphasized once more that the algorithmically computed tree geometry would allow us to see which branch of treatment-nodes is authentic in the sense that within-branch distances are smaller than between-branch distances with significantly high probabilities. In other words, the formation of branch-specific memberships is not likely due to noises. This authenticity index evaluates and confirms potential biological or mechanistic basis for this branch. This serves as one part of the knowledge discovery.

By individually applying the Algorithm 2 based on all four histograms in Fig. 2, we obtain four tree geometries on three Iris species-nodes with respect to their four features, as shown in Fig.3. In each of the four features of Iris, about to of the 10,000 generations of the species tree confirm that the branch of Virginica and Versicolor stayed together, and the Setosa stayed separated from them. Further, the heatmap clearly pinpoints where are the significant differences, so that the species Setosa is rather distinct from the two other species. It is not unreasonable to think that each of these four HC-trees with authenticity indexes will shed light on the phylogenetics of these three species. This is one of the most significant merits of the 2nd phase of ANOHT.

Figure 3: ANOHT for all columns of iris data set. Out of re-generations, the nodes Versicolor and Virginica stay together of the times in the Sepal length and Petal length trees( for Sepal and Petal width trees), and in those cases Setosa stays in a different branch. A color closer to red represents higher value and dark blue corresponds to the lowest frequency zero.

3 ANOHT on right censored data.

When data is compromised, such as being right-censored in survival analysis, the construction of a histogram is rarely seen in practice ([13]) as well in literature ([4], [1], [12]). We propose to construct a possibly-gapped histogram and perform ANOHT based on Kaplan-Meier estimate of a survival function. Then the ANOHT is also performed based on the Nelson-Allen estimate of the cumulative hazard function. The later analysis is much simpler and easier to apply than the former analysis. We apply both analyses on a heavily censored divorce data.

The foundation for the 1st phase of ANOHT is the fact that the Kaplan-Meier estimate has weights only on uncensored time points, see [16]. And the weighting scheme is termed redistribution-to-the-right, see [5]. Therefore the first phase of ANOHT can be carried out by building a possibly-gapped histogram on uncensored time points, and then applying the redistribution-to-the-right weighting scheme to adjust the weights on all bins. Nonetheless, the 2nd phase of ANOHT would need a bit more complicated empirical process theoretical results, which are given below.

3.1 Empirical process theory on Kaplan-Meier estimate of the survival function.

Given a treatment of interest, let the right-censored data set be generically denoted as with and (independent) for all The Kaplan-Meier estimate of the survival function is constructed as

where is the order-statistics ranking from the smallest to the largest, and is the corresponding censoring status of .

The right censored version of empirical process theory ([16], [1],[12]) states that

where is a Gaussian process with characteristic covariance function

with and . It is noted that when there is no censoring.

Likewise the -dimensional vector is asymptotically normally distributed, that is:

where

Hence we have the asymptotic normality of the k-dim vector . Throughout this section we employee the following approximation:

It is clear that the Kaplan-Meier estimate has jumps only on uncensored time points. Hence a construction of a possibly-gapped histogram needs only to take one more step of readjusting the weighting, that is, a histogram for a right-censored data set can be constructed via the following two steps

  1. Build a histogram upon uncensored(or complete) data points;

  2. Re-adjust the weights by applying the redistribution-to-the-right scheme for constructing the Kaplan-Meier estimate.

3.2 Empirical process theory on Nelson-Aalen estimate of cumulative hazard function.

Due to the Martingale central limit theory, see [1] and [12], the Nelson-Aalen estimate of the cumulative hazard function is popularly used in survival analysis as an alternative to Kaplan-Meier estimate in statistical inferences, see [8], [9] and [10].

The Nelson-Aalen estimate of is denoted as:

and the Martingale Central Limit Theorem assures that

where is a Gaussian process with independent increment property. That is, its covariance function is

.

Hence acts like a Brownian motion, so the components of the k-dimensional vector are indeed independently distributed, that is,

with

Here is the foundation for the 2nd phase of ANOHT on complete data, while and are the foundations for the 2nd phase of ANOHT on right censored data.

4 Results on baseball data

In this section, we apply our algorithms to construct possibly-gapped histograms and perform Analysis of Histogram (ANOHT) on pitching data of three well-known pitchers in Major League Baseball (MLB). The three pitchers are Jake J. Arrieta (Chicago Cubs), Kyle C. Hendricks (Chicago Cubs) and Robert A. Dickey (Atlanta Braves). The first two pitchers with rather distinct pitching styles were keys to the 2016 World Series Champion won by Chicago Cubs, whose previous titles were in 1908 and 1907. The first pitcher is the 2015 Cy Young Award winner, the second one, who graduated from Dartmouth College, has a nickname “The Professor”, while the third one is the first knuckleball pitcher to win the Cy Young Award. This pitching data set contains 18732 pitches in the 2015-2016 season, including Arrieta’s 6848 pitches.

The data set was downloaded from in the MLB official website provided by PITCHf/x system. This system is installed in all 30 MLB stadiums to track and digitally record the full trajectory of live baseball pitch since 2006. The data contains 21 features of each and every single pitch and batting results throughout every single game in the regular season. Here we only look at two important pitching characteristic features: start-speed and break-length. The start-speed is the detected speed of a baseball at the release point of a pitch, while break-length is the measured largest distance from the baseball’s curved trajectory to the straight-line linking its release point to the front of home plate. These two features are main characteristics of a pitch. They are related to each other in rather distinct ways among 8 computer classified pitch-types: 1) Four-seam Fastball (FF); 2) Fastball cutter (FC); 3) Fastball sinker(SI); 4) Slider (SL); 5) Changeup (CH); 6) Curveball (CU); 7) Knuckleball (KN); 8) Eephus (EP). The first three pitching types are in the category of fastball, while the last five are in the category of off-speed pitches. It needs to be kept in mind that a pitcher’s repertoire of pitch-types is only a strict sub-set of these eight pitch-types.

The start-speed of Arrieta’s fastball can go up to nearly 100 mph (miles per hour), while Dickey’s knuckleball can be as slow as 65mph. It is known in general that a pitch with higher start-speed tends to have smaller break-length. In fact, their relationship is relatively nonlinear. Hence, combinations of these two features are usually cleverly crafted by every professional baseball pitcher in order to effectively face batters. So it is of great interest to compare one single pitcher’s as well as multiple pitchers’ distinct pitch-types from these two aspects.

Figure 4: Possibly-gapped histograms for the start-speed(A) and break-length(B) measurements of pitches thrown by Jake Arrieta in 2015-2016. The total number of pitches from 3 pitchers is 18732 and out of them 6848 are from Arrieta. The five different colors in the histograms show the frequency of five different pitching styles of Arrieta in each bin.

4.1 Data analysis on one single pitcher’s pitch-types.

We begin with our first phase of ANOHT on comparing pitch-types of a single pitcher from the start-speed and break-length aspects. For instance, to compare Arrieta’s 5 pitch-types: FF, SI, SL, CH, CU, we apply the Algorithm 1 to construct two possibly-gapped histograms of start-speed and break-length, as shown in the panels (A) and (B) of Fig.4, respectively. In panel (A), we see a clear gap around 84mph. This gap bears a mechanistic difference between his curveball (CU) and the rest of 4 pitch-types, which have significant higher speed. That is, this pitch-type is purposely and distinctively carried out by this pitcher. Hence this gap indicates a clear-cut implementation due to the pitcher’s control capability.

This capability is even more strikingly demonstrated via the gap shown in the histogram of break-length, as shown in panel (B). Again the pitch type CU is completely separated from the rest of 4 types. It is also evident that pitch-types FF and SI are dominant in bins on right extreme of start-speed histogram and correspondingly dominant in bins on the left-extreme of break-length histogram.

In summary these two color coded possibly-gapped histograms clearly demonstrate the distinctive differences among Arrieta’s five pitch-types from the two aspects. In other words, all bin-specific entropies are to be zeros or significantly small in comparison with the entropy of the distribution pertaining to the five categories of pitch-types. That is, all p-values computed via the scheme of simple-random-sampling without replacement are zeros or extremely small. Similar data analysis on the other two pitcher can be likewise carried out.

4.2 Data analysis on three pitchers’ pitch-types.

We begin with the first phase of ANOHT to understand how similar or distinct are pitcher-specific pitch-types from the aspects of start-speed and break-length. On top of Arrieta’s 5 pitch types, Hendricks has 5 pitch types: FF, FC, SI, CH, CU, and Dickey has 3 types: FF, KN, EP. So there are 13 pitcher-specific pitch-types (or treatments) in total for comparison. One coarse and one fine resolution of possibly-gapped histograms of start-speed are constructed and reported in two rows of triplet panels of Fig.5, respectively. After applying Algorithm 1 with a choice of tree height, there are 20 bins selected in this coarse version, as seen in panels (A-C). The piecewise linear approximation on the empirical distribution seems reasonable, as seen in the panel (A), and histogram reveals evident two, or potentially three modes in panel (B). But the all DESS values are relatively high in panel (C).

Figure 5: Possibly gapped histograms on start-speed measurements of all pitches thrown by Kyle Hendricks, Jake Arrieta and R.A. Dickey in 2015-2016. Sub-figure (A-C) shows a coarse version with high decoding error allowed, and sub-figure (D-F) shows a finer version where decoding error is constrained to be less than 2.

Therefore, in order to drive DESS values lower, we select a lower and get a histogram with a fine resolution, as shown in panels (D-F). Though the piecewise linear approximation becomes somehow overwhelmingly detailed with 80 bins in panel (D), the histogram is seen with evident 3 modes in panel (E) and corresponding DESS values become significantly smaller than that in panel (C). That is, the fine resolution histogram with very smaller bins on its right seems to give rise to more detailed distributional structures than the coarse one. This coarse-vs-fine resolution of histograms illustrate why we need to have data-driven bins with data-driven sizes.

Then we color code the coarse version of histogram, instead of the fine resolution version, for better visualization when we set to compare these 13 pitcher-specific pitch types, as shown in the panel (A) of Fig.6. Upon this histogram of start-speed, we see that bins on its left hand side are dominated by Dickey’s KN, and bins on its right hand side are primarily dominated by Arrieta’s FF and SI. While Hendrick’s pitch types are prevalent on bins situated at both sides of major valley of this histogram. It is clear that, by focusing on where differences actually occur, this color coded histogram is much more informative than traditional boxplot, as shown in panel (C) of Fig.6, or ANOVA, which focuses merely on comparisons of mean values.

Figure 6: Coarse version of the possibly gapped histogram of start-speed from figure 5, color coded according to the 13 different pitches thrown by the three pitchers in sub-figure (A). For differentiating between the 3 pitchers, we used shades of red, blue and green for Arrieta, Dickey and Hendricks respectively . Sub-figure (B) shows the tree on ANOHT matrix with the branch strength percentages, and sub-figure (C) shows the traditional boxplot of all start-speeds of 13 pitching styles.

Then we turn to the second phase of ANOHT to see what the tree geometries of the 13 pitcher-specific pitch-types look like from the aspects of start-speed and break-length. Here it is worth re-emphasizing the significance of tree geometry. Tree geometry is a data-driven structure that makes performing branch-based-group comparison possible. A branch-based-group specified by an inter-node is confirmed as being authentic, if its branch formation is significantly supported by the data, and not by chance. This confirmation equivalently implies that, with significantly high probability, no tree nodes outside of such a group have smaller distances (or higher similarity) than distances (or similarity) among its group-member-nodes. From mechanistic perspective, such a concept of authenticity of tree branch (-based-group) has a direct implication on “phylogenetic” information contents regarding the 13 pitcher-specific pitch-types, as would be seen below. From inference perspective, its functions have gone beyond multiple comparisons, such as Tukey’s pairwise comparison and others, typically performed in Analysis of Variance (ANOVA) in classical Statistics.

We apply the Algorithm 2 based on the coarse resolution version of histogram with 20 bins. The results based on fine resolution version of histogram with 80 bins are rather similar, but a bit harder to visualize. The resultant heatmap is framed by a hierarchical clustering (HC) tree on its row axis, as reported on the panel (B) of Fig.6. Percentages attached on each inter-node of the HC tree are calculated from 10,000 mimicked HC trees. We see via the HC tree branches in the order going from top to bottom that the branch-based-group of Hendrick’s CH and Arrieta’s CU, the group of Hendrick’s CU and Dickey’s KN, the group of Arrieta’s FF and SI, and the group of Hendrick’s SI and FC, are all confirmed as being authentic with significantly high probabilities. These memberships of the four pairs of pitcher-specific pitch-types are not mixing with any pitcher’s pitch type outside of their groups. Also the triplet branch-based-groups: Hendrick’s CH, Arrieta’s CU and Dickey’s FF and the branch-based-group of five: Hendrick’s FF, SI and FC and Arrieta’s SL and CH, are also confirmed as being authentic as well. Therefore we may conclude that these two authentic branches: one on the slow end and one on the fast end of start-speed, are two large data-driven anchors of the tree geometry.

Results from ANOHT on break-length are reported in two panels of Fig. 7. The histogram in panel (A) reveals that bins on the lower end are dominated by Arrieta and Hendrick’s fastball, while bins on the larger end are dominated by Arrieta’s CU and Dickey’s KN. Also it is odd, but interesting to see that the bin located at 10 is nearly exclusively occupied by Dickey’s KN. This exclusiveness implies that the Knuckleball-specific range of break-length is likely jointly achieved by the pitcher’s unusual pitching mechanics and the unusual low start-speed as seen in bins on the lower end of histogram of start-speed in panel (A) of Fig. 5. This is another mechanistic pattern that can be possibly derived from ANOHT, but hardly could be easily derived from other methodologies.

For tree geometry among the 13 pitcher-specific pitch types, the HC tree superimposed on the row axis of heatmap reveals three authentic triplet branch-based-groups, as shown in the panel (B) of Fig. 7. These three triplet groups are: 1)Hendrick’s FF and FC and Arrieta’s FF; 2)Hendrick’s SI and Arrieta’s SL and SI; 3) Hendrick’s CH and Dickey’s FF and Arrieta’s CH. All these three authentic branches are all on the lower end of break-length. In contrast, indexes pertaining to the branch-based-group memberships on the higher end are high, but not high enough to be comfortably confirmed as being authentic.

Figure 7: The possibly gapped histogram in panel(A) for breaklength of pitches thrown by for Arrieta, Dickey and Hendricks in 2015-2016, color coded according to the 13 different pitching styles. Panel (B) shows the ANOHT tree percentages on the proportion matrix of 13 pitching styles.

5 Results on divorce data

We then apply our ANOHT on a heavily censored divorce dataset. One significant feature of such a dataset is that the heavily right-censoring typically causes many issues in parametric and semi-parametric inferences in survival analysis literature, see [17] and [2]. However it seems to have relatively little effects on ANOHT. The key reason is that ANOHT can be based only on the reliable portion of histograms.

The dissolution of marriage dataset is based on a survey conducted in the U.S. studying 3,371 couples. The unit of observation is a couple and the event of interest is divorce. Couples lost to follow-up or widowed are treated as censored observations. We have three fixed covariates: education of the husband and two indicators of the couple ethnicity regarding whether the husband is black and whether the couple is mixed.

This dataset consists of only 1033 couples with completely observed event-time of dissolution of marriage in years, and 2338 couples with right censored event-times. The censoring rate is about . The first factor is the years of husband’s education with three categories: 1)coded as 0 for being less than 12 years; 2) coded as 1 for being between 12 to 15 years; 3) coded as 2 for being 16 or more years. The second factor is husband’s ethnicity with two categories: 1) coded as 1 for the husband being black; 2) codes as 0 otherwise. The third factor is couple’s ethnicity with two categories: 1) coded as 1 if the husband and wife have different ethnicity; 2)coded 0 otherwise. Censoring status as usual is coded as 1 for divorce and 0 for censoring (due to widowhood or lost to follow up).

Thus this dataset can be partitioned with respect to one single factor, or a pair of factors, or the three factors together. That is, the finest partition has 12 samples or treatments, in which each treatment is triple-coded. For instance the treatment ‘201’ stands for the sample of couples with husband’s education being more than 15 years (coded 2 in the first factor), non-black husband (coded 0 in the second factor) and having mixed ethnicity (coded 1 in the third factor).

5.1 ANOHT on a divorce data

The first phase of ANOHT on this divorce dataset on these four aspects are reported in the Fig. 8.

Figure 8: The first phase of ANOHT results on divorce dataset: (A) three treatments via 1st factor; (B) 2nd factor; (C) 3rd factor; (D) four treatments with respect to the 2nd and 3rd factors.

Overall the possibly-gapped histogram shows three evident gaps along its right tail. These gaps are not likely due to old age. Since they are followed by bins with significant heights, their locations seem interesting, but need more research attentions for pertinent interpretations. Specifically the panel (A) of Fig 8 shows three Kaplan-Meier estimates of three survival functions with respect to three categories of education levels. It is rather interesting to note that the marriage with husbands having middle education level (1**) is likely to fail much earlier than marriages with husbands in either the lowest (0**) or highest (2**) education levels. The panels (B) and (C) of Fig. 8 clearly reveal that the two categories: Black-husband (*1*) and mixed-ethnicity of couple (**1), respectively associate with lower survival rates. Further the category: Black-husband and mixed-ethnicity of couple (*11), has the lowest survival rate against the other three categories in panel (D) of Fig. 8. This survival rate is especially lower than the one belonging to the category: non-black husband and wife (*00).

The second phase of ANOHT on divorce dataset is reported in the figure below:

Figure 9: The second phase of ANOHT results on divorce dataset: (A) based on Kaplan-Meier survival function estimates on 12 categories; (B) based on the Nelson-Aalen cumulative hazard function estimates.

The group of 6 treatments (categories): (000, 001, 010, 100, 011, 200), are seen in panels (A) and (B) of Fig. 9. They have similar Kaplan-Meier survival function and Nelson-Aalen cumulative hazard functions. Particularly its subgroup of 3 treatments:(000, 100, 200) shows very high similarity on both aspects among them. Such evidence is interpreted as that the factor of education has no effect on occurrence of event of marriage failure in non-black couple. In contrast, the presence of the subgroup of 3 treatments: (010, 001, 011), seems to imply that the occurrence of the event of marriage failure for a couple with husband having lowest education level is neither affected by the husband’s ethnicity, nor the couple’s mixture of ethnicity.

From Kaplan-Meier survival function perspective, the tree on the heatmap of panel (A) of Fig. 9 evidently shows one authenticity of the large branch of 9 treatments against the other branch of 3 treatments: (201, 211, 210). Another authenticity is found on the branch of two treatments: (100, 000). In contrast, from Nelson-Aalen cumulative hazard function perspective, there is only authenticity found on the outlier treatment: (211) against the rest of treatments.

6 Conclusion

In this note we demonstrate algorithmic data-driven computing for constructing a possibly-gapped histogram, and then develop two phases of Analysis of Histogram (ANOHT). Their practical values are clearly illustrated and manifested through Iris data, baseball pitching data and divorce data. Our data-driven computing simply and critically demonstrates interfaces of machine learning, information theory, and statistical physics. We believe such data analytic techniques would have very high potentials for very wide spectra of problems in sciences.

It is surprising that a well-known Hierarchical clustering algorithm, which was first developed in 1950’s for taxonomy ([20]), can help resolve a complex computational physical problem with exponential growth of complexity. Even more striking is that such a multi-scale physical problem explicitly contrasts and brings out the unspoken assumption of homogeneity implicitly assumed in all known statistical model selection techniques.

We believe that the existential issue of gap in a distribution will become more critical in this Big Data era ever than before. Its biological and mechanistic meanings should be a part of critical knowledge discovery in data-driven analysis. Particularly a possibly-gapped histogram can provide a fundamental new method of re-normalization on data via digital coding schemes. Such a data re-normalizing method in fact plays a critical role in unsupervised machine learning for knowledge discovery.

The merits of two phases of ANOHT are apparently seen in the three real examples, so are likely to be seen in many scientific researches. We have demonstrated that they have the capabilities to resolve issues never being discussed before, and at the same time to provide knowledge discoveries from wider perspectives. Further, since they are free from common, but unrealistic constraints and assumptions imposed by statistical modeling and analysis, their results should be more authentic and closer to reality.

Here we devote the rest of this section for implications of our developments on statistics. It is because of lacking continuity or smoothness assumptions in our developments, many types of potential complexity can be realistically embraced within our data analysis on 1D dataset. For instance, the multi-scale issue is clearly seen in the histograms of start-speed, as shown in Fig. 4. This issue has to be resolved in a data-driven fashion because of the lack of prior knowledge. In sharp contrast, a histogram with pre-fixed number of equal-size bins is just not realistic. Here we emphasize that ignoring realistic information contents contained in data is hardly a right way of analyzing real data.

This multiscale feature also spells out the fact that all bins’ boundaries are not global parameters. Thus the ensemble of candidate histograms is far from being fixed, but grows exponentially like the ensemble of all possible spin-configurations of Ising model in statistical physics. They render no uniform or rates of convergence. Thus all statistical model selection techniques, such as AIC, BIC and minimum description length (MDL), in statistics are not applicable. The simple underlying technical reason is that all these model selection techniques require their ensemble of candidate models to be fixed and independent of , so that the corresponding mathematical optimization problem can be well defined.

The implications of ANOHT on statistics are clearly demonstrated and contrasted through its applications on the Iris, MLB pitching and divorce data sets. The computing simplicity and informative results pertaining to the two phases of analyses via ANOHT come from the recognized interface of a physical problem and an unsupervised machine learning algorithm.

Ethics:

No ethical approvals are needed in this study. All measurements pertaining to study subjects are available from two public websites.

Data Accessibility:

The Iris data is available at UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets.html). The pitching data is available in PITCHf/x database belonging to Major League Baseball via http://gd2.mlb.com/components/game/mlb/. The divorce data is available at: http://data.princeton.edu/wws509/datasets/#divorce

Competing Interests:

We have no competing interests.

Author’s contributions:

H.F. designed the study. T.R. collected all data for analysis. H.F. and T.R analyzed the data. H.F and T.R. interpreted the results and wrote the manuscript. All authors gave final approval for publication.

Funding:

No financial funding received for this study.

Acknowledgement:

We thank Kevin Fujii for helps in pitching data retrieval.

References

  • [1] Andersen, P. K., Borgan, O., Gill, R. D. and Keiding, N. Statistical Models Based on Counting Process. Springer-Verlag. New York, 1992.
  • [2] Bassiakos, Y. C., Meng, X.-L. and Lo, S.-H. A general estimator of the treatment effect when the data are heavily censored. Biometrika, 78, 4, (1991), 741-8.
  • [3] Cover, T. M., and Thomas, J. A. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, 2006.
  • [4] Cox, D. R. Analysis of Survival data. Chapman and Hall. London, 1984.
  • [5] Efron, B. The two sample problem with censored data. Proceeding of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Vol. IV, (1967), 831-853.
  • [6] HALL, P., and HANNAN, E. J. On stochastic complexity and nonparametric density estimation. Biometrika 75, 4 (1988), 705.
  • [7] Hsieh Fushing The empirical process approach for semiparametric two-sample models with heterogeneous treatment effect. J. Roy. Statist. Soc. Ser. B. 57,(1995), 735-748.
  • [8] Hsieh Fushing The empirical process approach for a generalized hazards model. Biometrika 83, (1996a), 519-528.
  • [9] Hsieh Fushing Empirical process approach in a two-sample location-scale model with censored data. Ann. Statist. 24, (1996), 2705-2719.
  • [10] Hsieh Fushing On heteroscedastic hazards regression model: theory and applications. J. Royal Stastist. Soc. Ser. B. 63, (2001), 63-79.
  • [11] Ising, E. Beitrag zur theorie des ferromagnetismus. Zeitschrift für Physik 31, 1 (1925), 253–258.
  • [12] Kalbfleisch, J. D. and Prentice, R. L. The Statistical Analysis of Failure Time Data. John Wiley, New York, 2002.
  • [13] Kleinbaum D. G. Survival Analysis: A Self-Learning text. Springer-Verlag, New York, 1996.
  • [14] Kontkanen, P., and Myllymäki, P. Mdl histogram density estimation. In Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics (AISTATS-07) (2007), 219–226.
  • [15] Li, M., and Vitnyi, P. M. An Introduction to Kolmogorov Complexity and Its Applications. Springer Publishing Company, Incorporated, 2008.
  • [16] Miller, R. G. Survival Analysis. John Wiley, New York, 1981.
  • [17] Padgett, W. and Wei, L. J. Estimation of the ratio of scale parameters in the two sample problem with arbitrary right censorship. Biometrika 69, 1982, 252-6.
  • [18] Rissanen, J. Stochastic Complexity in Statistical Inquiryl. World Scientific Publishing Co., Inc, River Edge, NJ, USA, 1989.
  • [19] Shorack, G. R. and Wellner, J. A. Empirical Processes with Applications in Statistics. Wiley, New York, 1986.
  • [20] Sneath, P. H. A. and Sokal, R. R. Numerical Taxonomy: The Principles and Practices of Numerical Classification Freeman, San Francisco, 1973.
  • [21] Yu, B., and Speed, T. P. Data compression and histograms. Probability Theory and Related Fields 92, 2 (1992), 195–229.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
49202
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description