A nonparametric framework for inferring orders of categorical data from category-real ordered pairs

A nonparametric framework for inferring orders of categorical data from category-real ordered pairs

Chainarong Amornbunchornvej1,  Navaporn Surasvadi, Anon Plangprasopchok, and Suttipong ThajchayapongAll authors are at National Electronics and Computer Technology Center (NECTEC), Pathum Thani, 12120, Thailand.1Corresponding author (email: chainarong.amo@nectec.or.th)Manuscript received November 15, 2019; revised November 15, 2019.
Abstract

Given a dataset of careers and incomes, how large a difference of income between any pair of careers would be? Given a dataset of travel time records, how long do we need to spend more when choosing a public transportation mode instead of to travel? In this paper, we propose a framework that is able to infer orders of categories as well as magnitudes of difference of real numbers between each pair of categories using Estimation statistics framework. Not only reporting whether an order of categories exists, but our framework also reports the magnitude of difference of each consecutive pairs of categories in the order. In large dataset, our framework is scalable well compared with the existing framework. The proposed framework has been applied to two real-world case studies: 1) ordering careers by incomes based on information of 350,000 households living in Khon Kaen province, Thailand, and 2) ordering sectors by closing prices based on 1060 companies’ closing prices of NASDAQ stock markets between years 2000 and 2016. The results of careers ordering show income inequality among different careers. The stock market results illustrate dynamics of sector domination that can change over time. Our approach is able to be applied in any research area that has category-real ordered pairs. Our proposed Dominant-Distribution Network provides a novel approach to gain new insight of analyzing category orders. The software of this framework is available for researchers or practitioners within R package: EDOIF.

Bootstrapping, Nonparametric statistics, Estimation statistics, Ordering Inference

I Introduction

We use an order of items with respect to their specific properties all the time to make our decision. For instance, when we plan to buy a new house, we might use an ordered list of houses based on their price or distance from a downtown. We might use travel times to order the list of transportation mode to decide which option is the best to travel from A to B, etc.

Ordering is related to the concept of Partial order or poset [1] in Order theory. The well-known form of poset is a Directed acyclic graph (DAG) that is widely used in studying of causality [2, 3], animal behavior [4], social networks [5, 6], etc. Additionally, in social science, ordering of careers based on incomes can be applied to the study of inequality in society (see Section V-B).

Hence, ordering is an important concept that is used daily and can impact society decision and scientific research. However, in the Era of Big data, inferring orders of categories items based on their real-value properties from large datasets is far from trivial.

In this paper, we investigate the problem of inferring an order of categories based on their real-value properties, Dominant-distribution ordering inference problem, using poset [1] concept as well as estimating a magnitude of difference between any pair of categories. We also propose a Dominant-Distribution Network as a representation of dominant category orders. We develop our framework based on a new concept of statistics named Estimation Statistics principle. The aim of estimation statistics is to resolve issues of the traditional methodology, null hypothesis significance testing (NHST), that focuses on using p-value to make a dichotomous yes-no question (see Section II).

Dominant-distribution ordering inference problem: In order to say that one category dominates another, real values from one category must have higher values than other values from another category, with high probability on average (see Figure 1). Given a set of order pairs of category-real values, the goal is to find an order list of categories with respect to their real-value distributions. If category dominates category in the list, then a probability that real-number values from is greater than an expectation of ’s distribution is high and not vice versa.

In the aspect of scalability, our framework can finish analysing a dataset of 10,000 data points using 11 seconds while a candidate approach needs 300 seconds for the same dataset. The software of our proposed framework is available for researchers and practitioners with a user-friendly R package: EDOIF at [7].

This paper is organized as follows. Section II reviews related work, analyzing the existing gaps and how our contributions address them. Then, Section III describes our proposed framework. Experimental setup is shown in Section IV where corresponding results are discussed in Section V. Finally, Section VI concludes this paper.

Ii Related works

There are several NHST frameworks in both parametric (e.g. Student’s t-test [8]) and nonparametric (Mann-Whitney test [9]) types that are able to compare two distributions and report whether one has a greater sample mean or median than another using p-value. Nevertheless, these approaches are not capable of providing a magnitude of mean difference between two distributions. Moreover, there are several issues of using only p-values to compare distributions. For instance, a null hypothesis might always get rejection since, in some system, there is always some effect in a system but an effect might be too small [10]. The NHST also treats distribution comparison as a dichotomous yes-no question and ignores a magnitude of difference, which might be an important information [11] for a research question. Besides, using only p-value information is a major issue of lack of repeatability in many research publications [12].

Hence, Estimation Statistics has been developed as an alternative methodology to NHST. The estimation statistics is considered to be more informative than NHST [13, 14, 15]. The primary purpose of Estimation method is to determine magnitudes of difference among distributions in terms of point estimates and confidence intervals rather than reporting only p-value in NHST.

Fig. 1: An example of distribution of category dominates distribution of category . A probability of finding a data point in that greater than is greater than a probability of finding a data point in that greater than .

Recently, the Data Analysis using Bootstrap-Coupled ESTimation in R (DABESTR) framework [15], which is Estimation Statistics, has been developed. It mainly uses Bias-corrected and accelerated (BCa) bootstrap [16] as a main approach to estimate a confidence interval of mean difference between distributions. BCa bootstrap is robust to a skew issue in the distribution [16] than a percentile confidence interval and other approaches. However, it is not obvious whether BCa bootstrap is better than other approaches in the task of inferring a confidence interval of mean difference when two distributions have a high level of uniform noise (see Figure 2). Moreover, DABESTR is not scalable well when there are many pairs of distributions to compare; it cannot display all confidence intervals of mean difference of all pairs in a single plot. Another issue of using BCa bootstrap is that it is too slow (see Section IV-E) in practice compared to other approaches. There is also no problem formalization of Dominant-distribution ordering inference problem, which should be considered as a problem that can be formalized by Order theory, using of partial order concept [1].

Fig. 2: An example of distribution of category dominates distribution of category with different degrees of uniform noise w.r.t. total data density: (left) 1%, (middle) 20%, and (right) 40% of noise. The higher degree of uniform noise, the harder it is to distinguish whether dominates .

Ii-a Our Contributions

To fill theses gaps in the field, in this paper, we formalize Dominant-distribution ordering inference problem using partial order concept [1] in order theory (see Appendix A). We provide a framework as a solution of Dominant-distribution ordering inference problem. Our framework is a non-parametric framework based on the bootstrap principle that has no assumption regarding models of data (see Appendix B). We also propose to represent a dominant order with Dominant-Distribution Network (Definition 4). Our proposed framework is capable of:

  • Inferring an order of multiple categories: inferring orders of domination of categories and representing orders in the form of a graph;

  • Estimating a magnitude of difference between a pair of categories: estimating confidence intervals of mean difference for all pairs of categories; and

  • Visualizing a network of dominant orders and magnitudes of difference among categories: visualizing dominant orders in one graph entitled, Dominant-Distribution Network, as well as illustrating all magnitudes of difference of all categories pairs within a single plot that no other framework is capable of.

We evaluate our framework in the aspect of sensitivity analysis of uniform noise using simulation data that we posses the ground truth and compare it against several methods. To demonstrate real-world applications of our framework, we also provide two case studies. The first is the story of inferring income orders of household careers in order to measure income inequality in Khon Kaen province, Thailand based on surveys of 350,000 households. Another case study is to use our framework to study dynamics of sector domination in NASDAQ stock market using the 1060 companies stock-closing prices between 2000 and 2016. The assessment on these two independent/irrelevant domains indicates the potential that our framework is applicable to any field of study that requires ordering of categories based on real-value data. Our Dominant-Distribution Network (Definition 4) provides a novel approach to gain insight of analyzing category orders.

Ii-B Why confidence intervals?

We can simply just order categories by their means or medians. However, comparing only means cannot tell us how much overlapping areas two categories have. Hence, we need mean confidence intervals to approximate the overlapping areas as well as using mean-difference confidence intervals to tell magnitude of difference between two categories. Additionally, if there are many categories and we want to infer how much pairs of categories always dominate others, then we can use a network to represent these dominant relationships. In this paper, we propose a network called a Dominant-distribution network to represent dominant relationships among categories.

Iii Methods

Fig. 3: A high-level overview of the proposed framework.

For any given pair of categories , we define an order that category dominates category using their real random variables as follows.

Definition 1 (Dominant-distribution relation)

Given two continuous random variables and where are distributions. Assuming that and have the following property: . We say that dominates if ; denoting . We denote if .

Since a dominant-distribution relation is a partial order relation (Theorem A.4), an order always exists in any given set of ordered pairs of category and real number. For each pair of category and , we can use a bootstrap approach to infer whether as well as using an inferred confidence interval from bootstrapping to represent a magnitude of difference between and (see Appendix B).

We propose the Empirical Distribution Ordering Inference Framework (EDOIF), as a solution of Dominant-distribution ordering inference problem using bootstrap and additional non-parametric method. Fig .3 illustrates an overview of our framework. Given a set of order pairs of category-real values as inputs of our framework where s.t. is a set of category classes, and , in this paper, we assume that for any pair if , then both and are realizations of random variables from the distribution .

In the first step, we infer a sample-mean confidence interval of each and a mean-difference confidence interval between each pair of and (Section III-A). Then, in Section III-B, we provide details regarding the way to infer the Dominant-distribution network.

Iii-a Confidence interval inference

\@float

algocf[htbp]     \end@float

We separate a set into where is a set of data point that has a category in . We sort based on their sample means s.t. where are sample means of respectively.

For each , we perform the bootstrap approach (Appendix B-A) to infer the sample mean distribution and its -confidence interval. Given and , the framework infers the confidence interval of w.r.t. denoted . Algorithm LABEL:algo:MeanBootstrapFunc illustrates the details of how to infer using the bootstrap approach.

\@float

algocf[htbp]     \end@float

In the next step, we infer an -mean-difference confidence interval of each pair .

Given are sample-mean distributions that are obtained by bootstrapping respectively, , , and .

The framework uses the bootstrap approach to infer sample-mean-difference distribution of and the -confidence interval of . Algorithm LABEL:algo:MeanDiffBootstrapFunc illustrates the details of how to infer using the bootstrap approach in general.

Even though we can use a normal confidence interval as a confidence interval in line 6 of Algorithm LABEL:algo:MeanBootstrapFunc and line 7 of Algorithm LABEL:algo:MeanDiffBootstrapFunc (see Lemma B.2), the normal bound has an issue when a distribution is skew [15, 16]. Hence, we deploy both percentile confidence intervals and Bias-corrected and accelerated (BCa) bootstrap [16] to infer both confidence intervals: and .

For a percentile confidence interval inference (our default option) and BCa bootstrap, we deploy a standard library of bootstrap approaches in R “boot” package [17, 18, 19].

Iii-B Dominant-distribution network inference

The first step of inferring Dominant-distribution network in Definition 4 is to infer whether .

In the network , a node represents and if .

Given , , we can check the normal lower bound of in Lemma B.2 that we mentioned in Section B-B. If the lower bound is greater than zero, then . However, we deploy Mann-Whitney test [9] to infer whether due to its robustness (see the Result Section). Along with Mann-Whitney test [9], we also deploy p-value adjustment method by Benjamini and Yekutieli (2001) [20] to reduce the false positive issue.

In the next step, for each , we add node to . For any pair , if , then . One of the properties we have for is that the set of nodes that are reachable by the path from is a set of distributions of which dominates them.

Iii-C Visualization

We use ggplots package [21] to create mean confidence intervals (e.g. Figure 7) and mean-difference confidence intervals (e.g. Figure 9) plots. For a dominant-distribution network, we visualize it using iGraph package [22] (e.g. Figure 8).

Iv Experimental setup

We use both simulation and real-world datasets to evaluate our method performance.

Iv-a Simulation data for sensitivity analysis

We simulated datasets from mixture distributions, which consists of a normal distribution, Cauchy distribution, and uniform distribution. The random variable is defined as follows.

(1)

Where is a normal distribution with mean and variance , is a Cauchy distribution with location and scale , is a uniform distribution with the minimum number and maximum number , and is a value that represents a level of uniform noise. When the increases, the ratio of uniform distribution in the mixture distribution increases. We set to generate simulation datasets in order to perform the sensitivity analysis.

Fig. 4: A dominant-distribution network of simulation datasets

In all simulation datasets, there are five categories: . The dominant-distribution relations of these categories are represented as a dominant-distribution network . The network is shown in Figure 4. Only dominates others. In this paper, for , we set to generate realizations of . For , we set .

Because uniform distribution in the mixture distribution has the range between -400 and 400, but all areas of distributions of are within , a method has more issue to distinguish whether for any when we increase (see Fig 2).

The main task of inference here is to measure whether a given method can infer that w.r.t. a network in Figure 4 from these simulation datasets. We generate 100 datasets for each different value of . In total, there are 900 datasets.

To measure the performance of ordering inference, we define true positive (TP), false positive (FP), and false negative (FN) in order to calculate precision, recall, and F1 score as follows. Given any pair of categories , TP is when both ground truth (Figure 4) and inferred result agree that is true. FP is when a method infers that but the ground truth disagrees. FN is when the ground truth has but an inferred result from the method disagrees.

In the task of inferring whether , we compared our approach (Mann-Whitney test [9] with p-value adjustment method [20]) against 1) t-test with Pooled Standard Deviation [23], 2) t-test with p-value adjustment [20], 3) BCa bootstrap, and 4) percentile bootstrap. For both BCa bootstrap, and percentile bootstrap, we decide whether based on the lower bound of confidence intervals of mean difference between and . If the lower bound is positive, then , otherwise, .

Iv-B Real-world data: Thailand’s population household information

This dataset was obtained from Thailand household-population surveys from Thai government in 2018 [24]. The purpose of this survey was to analyze the Multidimensional Poverty Index (MPI) [25, 26], which is considered as a current main poverty index that the United Nations (UN) uses. We deployed the data of household incomes and careers information from 353,910 households of Khon Kaen province, Thailand to perform our analysis. We categorized careers of heads of household into 14 types: student (student), freelance (Freelance), plant farmer (AG-Farmer), peasant (AG-Peasant), orchardist (AG-Orchardist), fishery (AG-Fishery), animal farmer (AG-AnimalFarmer), unemployment (Unemployment), merchant (Merchant), company employee (EM-ComEmployee), business owner (Business-Owner), government’s company employee (EM-ComOfficer), government officer (EM-Officer), and others (Others). The incomes in this dataset are annual incomes of households and the unit of incomes is in Thai Baht (THB).

Given a set of ordered pairs of career and household income, we analyzed the income gaps of different types of careers in order to study the inequality of population w.r.t. people careers.

Iv-C Real-world data: NASDAQ Stock closing prices

This NASDAQ stock-market dataset has been obtained by the work in [4] from Yahoo! Finance.111http://finance.yahoo.com/ The dataset was collected from January 2000 to January 2016. It consist of a set of time series of stock closing prices of 1060 companies. Each company time series has a total length as 4169 time-steps. Due to the high variety of company sectors, in this study, we separated these time series into five sectors: ‘Service & Life Style’, ‘Materials’, ‘Computer’, ‘Finance’, and ‘Industry & Technology’.

In order to observe the dynamics of domination, we separated time series into two intervals: 2000-2014, and 2015-2016. For each intervals, we aggregated the entire time series using median.

Given a set of ordered pairs of closing-price median and sector, the purpose of this study is to find which sectors dominated others in each interval.

Iv-D Parameter settings

We set the significant level and the number of times of sampling with replacement for a bootstrap approach is for all experiments unless stated otherwise.

Iv-E Running time

Fig. 5: A comparison of running time between two methods of Bootstrap confidence intervals.

In this experiment, we compared the running time of two methods of bootstrapping to infer confidence intervals: BCa bootstrap (BCa) and percentile (perc) approaches using simulation datasets from the previous section.222The computer specification that we used in this experiment is Dell 730, with CPU Intel Xeon E5-2630 2.4GHz, and Ram 128 GB. We set the number of times of bootstrapping as 4000 rounds. In Figure 5, the result is shown that BCa method was a lot slower than the percentile approach. In the dataset of 10,000 data points, the BCa bootstrap required the running time around 300 seconds while the percentile approach required only 11 seconds. Besides, for a dataset that has 500,000 data points, percentile approach was able to finish running around 11 minutes. This indicates that the percentile approach is scalable better than BCa bootstrap. Hence, for a large dataset, we recommend users to use the percentile approach since it is fast and the performance is comparable or even better than BCa method that we will show in the next section.

V Results

V-a Simulation results

In this section, we report the results of our analysis from simulation datasets (Section IV-A). The main task is the ordering inference; determining whether for all pairs of categories.

Precision Recall F1 scores
ttest (pool.sd) 0.61 0.52 0.55
ttest 0.72 0.72 0.72
Bootstrap: BCa 0.70 0.67 0.68
Bootstrap: Perc 0.73 0.68 0.70
EDOIF (Mann-Whitney) 0.77 0.85 0.81
TABLE I: The categories ordering inference result; each approach is used to infer orders of any pair of two categories w.r.t. the real-values within each category.

Table I illustrates the categories ordering inference result. Each value in the table is the aggregate results of datasets from different values of : . The table shows that our approach (using Mann-Whitney) performance is above all approaches. While ttest (pool.sd) performed the worst, the traditional t-test performed slightly better than both bootstrap approaches. Comparing between BCa and percentile bootstraps, the performance of percentile bootstrap is slightly better than BCa bootstrap. Even though BCa bootstrap covers the skew issue better than percentile bootstrap [15, 16], our result indicates that percentile bootstrap is more accurate than BCa when the noise presents in the task of ordering inference.

Fig. 6: The sensitivity analysis of categories ordering inference. The simulation datasets containing different levels of noise were deployed for the experiment (best viewed in colour codes).

Figure 6 shows the result of sensitivity analysis of all approaches when the uniform noise presents in different degrees. The horizontal axis represents noise ratios and the vertical axis represents F1 score in the task of ordering inference. According to Figure 6, our approach (using Mann-Whitney) performed better than all methods in all levels of noise. t-test preformed slightly better than both bootstraps approaches. Both bootstrap methods performance are quite similar. The t-test with (pool.sd) performed the worst. Both Table I and Figure 6 illustrate the robustness of our approach.

V-B Case study: Ordering career categories based on Thailand’s household incomes in Khon Kaen province

Fig. 7: Confidence intervals of household incomes of the population from Khon Kaen province categorized by careers.

In this section, we report the orders of careers based on incomes of population in Khon Kaen province, Thailand. Due to the expensive cost of computation of BCa bootstrap, in this dataset, since there are 353,910 data points, we used percentile bootstrap as a main method. Figure 7 illustrates the bootstrap-percentile confidence intervals of mean incomes of all careers with an order.

A government officer (EM-Officer) class is ranked as the 1st place of career that has the highest mean income, while a student class has the lowest mean income.

Fig. 8: A dominant-distribution network of household incomes of the population from Khon Kaen province categorized by careers. A node size represents a magnitude of sample mean of incomes in a career node.

Figure 8 shows orders of dominant-distribution relations of career classes in a form of a dominant-distribution network. It shows that a government officer (EM-Officer) class dominates all career classes. In a dominant-distribution network, its network density represents a level of domination; higher network density implies there are many categories that are dominated by others. The network density of the network is 0.79. Since the network density is high, a higher-rank career class seems to dominate a lower-rank career class with high probability. This implies that different careers provide different incomes. In other words, gaps between careers are high. Figure 9 provides the magnitudes of income-mean difference between pairs of careers in the form of confidence intervals. It shows us that the majority of pairs of different careers have gaps of annual incomes at least 25,000 THB (around $800 USD)!

Fig. 9: Mean-difference confidence intervals of different careers based on household incomes of the population from Khon Kaen province categorized by careers.

Since one of definitions of economic inequality is income inequality [27, 28, 29], there is a high degree of career-income inequality in this area. In societies with a more equal distribution of incomes, people are healthier  [28]. This inequality might lead to other issues such as health issue. Moreover, the income inequality is associate with happiness of people [29]. This case study shows that using our dominant-distribution network and mean-difference confidence intervals is a novel way of studying career-income inequality.

V-C Case study: Ordering aggregate-closing prices of NASDAQ stock market based on sectors

This case study reveals the dynamics of sector domination in NASDAQ stock market. We report the patterns of dominate sectors that change over time in the market.

Fig. 10: The sectors ordering result of NASDAQ stock closing prices from 1060 companies between 2000 and 2014. a) Confidence intervals of closing prices of sectors. b) Confidence intervals of difference means of closing prices among sectors. c) A dominant-distribution network of sectors.

Figure 10 shows the sectors ordering result of NASDAQ stock closing prices from 1060 companies between 2000 and 2014. The dominated sector is ‘Finance’ sector that dominates all other sectors. Due to the high network density of the dominant-distribution network at 0.8, there are large gaps between sectors in this time interval.

Fig. 11: The sectors ordering result of NASDAQ stock closing prices from 1060 companies between 2015 and 2016. We separated companies into five main sectors: ‘Service & Life Style’, ‘Materials’, ‘Computer’, ‘Finance’, and ‘Industry & Technology’. a) Confidence intervals of closing prices of sectors. b) Confidence intervals of difference means of closing prices among sectors. c) A dominant-distribution network of sectors.

On the other hand, in Figure 11, the sectors result ordering of NASDAQ stock between 2015 and 2016 demonstrates that there is no sector that dominate all other sectors. The network density is 0.4, which implies that the level of domination is less than the previous interval. The Finance sector is ranked as 4th position in the order. It is not because the Finance sector has a lower closing price in recent years, but all other sectors have higher closing prices lately. The computer sector has a higher closing price lately compared to the previous time interval, which is consistent with the current situation that the IT development (e.g. big data analytics, AI, block chain) impacts many business scopes significantly [30].

Vi Conclusion

In this paper, we proposed a framework that is able to infer orders of categories based on their expectation of real-number values using Estimation statistics framework. Not only reporting whether an order of categories exists, but our framework also reports the magnitude of difference of each consecutive pairs of categories in the order using confidence intervals and a dominant-distribution network. In large dataset, our framework is scalable well using percentile bootstrap approach compared with the existing framework: DABESTR that uses BCa bootstrap. The proposed framework was applied to two real-world case studies: 1) ordering careers by 350,000 household incomes from the population of Khon Kaen province in Thailand, and 2) ordering sectors based on 1060 companies’ closing prices of NASDAQ stock markets between years 2000 and 2016. The results of careers ordering showed income-inequality among different careers in a dominant-distribution network. The stock market results illustrated dynamics of sectors that dominate the market can be changed over time. The encouraging results show that our approach is able to be applied to any other research area that has category-real ordered pairs. Our proposed Dominant-Distribution Network provides a novel approach to gain new insight of analyzing category orders. The software of this framework is available for researchers or practitioners with a user-friendly R package at [7].

Appendix A Problem formalization

In this section, we provide the details regarding that a dominant-distribution relation is a partial order as well as providing the problem formalization of Dominant-distribution ordering inference problem. In the first step, we provide the concept of equivalent distributions.

Proposition A.1

Let be distributions such that and , then are equivalent distributions denoted .

  • When and , the first obvious case is . For the case that and , this cannot happen because of contradiction. Hence, and implies only .

We provide a relationship between expectations of distribution and a dominant-distribution relation below.

Proposition A.2

Let be distributions, and s.t. . if and only if .

  • In the forward direction, suppose . Because the center of is on the right of in the real-number axis, hence, covers almost areas of distribution except the area of . In contrast, covers only a tiny area in the far right of . This implies that or .

    In the backward direction, we use the proof by contradiction. Suppose . Because implies and , then we have the following implications.

    Let assume that . This implies that . Since , we have

    (2)

    Assuming , we also have

    (3)

    By combining inequation 2 and inequation 3, we have

    (4)

    The inequation 4 contradicts with the requirement of , which is ! Therefore, .

In the next step, we show that a dominant-distribution relation has a transitivity property.

Proposition A.3

Let be distributions such that , , then .

  • According to Proposition A.2, implies .

    Now, we have . The distribution must be on the right hand side of . Hence, , which implies .

Now, we are ready to conclude that a dominant-distribution relation is a partial order.

Theorem A.4

Given a set of continuous distributions s.t. for any pair , given , . The dominant-distribution relation is a partial order over a set  [1].

  • A relation is a partial order over a set if it has the following properties: Antisymmetry, Transitivity, and Reflexivity.

    • Antisymmetry: if and , then by Proposition A.1.

    • Transitivity: if , , then by Proposition A.3.

    • Reflexivity: .

    Therefore, by definition, the dominant-distribution relation is a partial order over a set of continuous distributions.

Suppose we have and . We can have as a random variable that represents the magnitude of difference between two distributions. Suppose is the true mean of ’s distribution, our next goal is to find the confidence interval of .

Definition 2 (-mean-difference confidence interval)

Given two continuous random variables and where are distributions, , and . An interval is -mean-difference confidence interval if .

Now, we are ready to formalize Dominant-distribution ordering inference problem.

\@float

algocf[h!]     \end@float

Appendix B Statistical inference

B-a Bootstrap approach

Suppose we have and with the unknown , we can use the mean as the point estimate of since it is the unbiased estimator. We deploy the estimation statistics [13, 14, 15] , which is a framework that focuses on estimating an effect sizes, , of two distributions. Compared to null hypothesis significance testing approach (NHST), estimation statistics framework reports not only whether two distribution are significantly different, but it also reports magnitudes of difference in the form of confidence interval.

The estimation statistics framework uses Bootstrap technique [31] to approximately infer the bootstrap confidence interval of . Assuming that the number of times of bootstrapping is large, according to Central Limit Theorem (CLT), even though the underlying distribution is not normal distributed, summary statistics (e.g. means) of random sampling approaches a normal distribution. Hence, we can use the normal confidence interval to approximate the confidence interval of .

Theorem B.1 (Central Limit Theorem (CLT) [32])

Given be i.i.d. random variables with and , and . Then, the random variable

converges in distribution to a standard normal random variable as goes to infinity, that is

where is the standard normal CDF.

Lemma B.2

Given are random variables i.i.d. from , are random variables i.i.d. from , and are random variables where .

Assuming that the number is large, the distribution of is unknown with an unknown variance . Suppose is the sample mean of , , and is their standard deviation. Given that is the standard normal CDF and , then the interval

(5)

is approximately confidence interval for .

  • Since is large, the distribution of sample mean of follows the Central Limit Theorem. This implies that the random variable

    has approximately distribution. Hence, is approximately normal distributed from . The confidence interval for is .

    Since is the unbiased estimator of and is the unbiased estimator of , we can have the approximation of confidence interval of as follows.

According to Lemma B.2, we need to access to a large number of to infer the confidence interval. We can generate s.t. is large using the bootstrap technique. The following theorem allows us to approximate the mean of in the bootstrap approach.

Theorem B.3 (Bootstrap convergence [33, 34])

Given are random variables i.i.d. from an unknown distribution with . We choose from the set by resampling with replacement. As approach :

  • Asymptotic mean: the conditional distribution of given converges weakly to .

  • Asymptotic standard deviation: in conditional probability: that is for any positive ,

    where , , and .

From Theorem B.3, when we increase the number of times we perform the resampling with replacement on to be large, we can approximate the using the bootstrap sample mean . The same applies for the standard deviation that we can use its bootstrap version to approximate it. By using , we can approximate the confidence interval in Lemma B.2.

B-B Dominant-distribution relation inference

According to Proposition A.2, implies . Suppose that and are also random variables. If or , then . However, in reality, might not equal to one due to noise. Hence, we define the following notion of Dominant-distribution relation.

Definition 3 (-Dominant-distribution relation)

Given two continuous random variables and where are distributions, and . Suppose , we say that is dominant to if ; denoting .

Suppose we have two empirical distribution and . From Theorem B.3 and Lemma B.2, we can define and as random variables from sample-mean distributions of empirical distributions and . We can get and by bootstrapping data from and . Suppose , then, we can approximate the confidence interval of with using the interval in Lemma B.2.

Next, we use confidence interval of to infer whether . Given , according to the Definition 3, if , then . We can approximate whether with the probability by the approximate confidence interval of : . If the lower bound is greater than zero, then is approximately .

In the aspect of hypothesis test, determining whether is the same as testing whether the expectation of is less than the expectation of where the null hypothesis is and the alternative hypothesis is . We can verify these two hypothesis by inferring the confidence interval of . If the lower bound of is greater than zero with the probability , then we can reject the null hypothesis. Moreover, not only the confidence interval can test the null hypothesis, but it is also be able to tell us the magnitude of mean difference between and . Hence, the confidence interval is more informative than the NHST approach.

Given a set of distributions , in this paper, we choose to represent -Dominant-distribution relations using a network as follows.

Definition 4 (Dominant-distribution network)

Given a set of continuous distributions and . Let be a directed acyclic graph. The graph is a Dominant-distribution network s.t. a node represents and if .

In the Section III, we discuss about the proposed framework that can infer a Dominant-distribution network from a set of order-pairs of real value and category.

Acknowledgment

The authors would like to thank the National Electronics and Computer Technology Center (NECTEC), Thailand, to provide our resources in order to successfully finish this work.

References

  • [1] E. W. M. Ben Dushnik, “Partially ordered sets,” American Journal of Mathematics, vol. 63, no. 3, pp. 600–610, 1941. [Online]. Available: http://www.jstor.org/stable/2371374
  • [2] J. Pearl, Causality.   Cambridge university press, 2009.
  • [3] J. Peters, D. Janzing, and B. Schölkopf, Elements of causal inference: foundations and learning algorithms.   MIT press, 2017.
  • [4] C. Amornbunchornvej, I. Brugere, A. Strandburg-Peshkin, D. R. Farine, M. C. Crofoot, and T. Y. Berger-Wolf, “Coordination event detection and initiator identification in time series data,” ACM Trans. Knowl. Discov. Data, vol. 12, no. 5, pp. 53:1–53:33, Jun. 2018. [Online]. Available: http://doi.acm.org/10.1145/3201406
  • [5] D. Kempe, J. Kleinberg, and É. Tardos, “Maximizing the spread of influence through a social network,” in Proceedings of the ninth ACM SIGKDD.   ACM, 2003, pp. 137–146.
  • [6] T. Y. Berger-Wolf and J. Saia, “A framework for analysis of dynamic social networks,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining.   ACM, 2006, pp. 523–528.
  • [7] C. Amornbunchornvej, “Empirical distribution ordering inference framework (edoif) in r,” https://github.com/DarkEyes/EDOIF, 2019, accessed: 2019-10-24.
  • [8] Student, “The probable error of a mean,” Biometrika, vol. 6, no. 1, pp. 1–25, 1908. [Online]. Available: http://www.jstor.org/stable/2331554
  • [9] H. B. Mann and D. R. Whitney, “On a test of whether one of two random variables is stochastically larger than the other,” Ann. Math. Statist., vol. 18, no. 1, pp. 50–60, 03 1947. [Online]. Available: https://doi.org/10.1214/aoms/1177730491
  • [10] J. Cohen, “The earth is round (p¡. 05): Rejoinder.” American Psychologist, vol. 50, no. 12, p. 1103, 1995.
  • [11] P. D. Ellis, The essential guide to effect sizes: Statistical power, meta-analysis, and the interpretation of research results.   Cambridge University Press, 2010.
  • [12] L. G. Halsey, D. Curran-Everett, S. L. Vowler, and G. B. Drummond, “The fickle p value generates irreproducible results,” Nature methods, vol. 12, no. 3, p. 179, 2015.
  • [13] G. Cumming, Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis.   Routledge, 2013.
  • [14] A. Claridge-Chang and P. N. Assam, “Estimation statistics should replace significance testing,” Nature methods, vol. 13, no. 2, p. 108, 2016.
  • [15] J. Ho, T. Tumkaya, S. Aryal, H. Choi, and A. Claridge-Chang, “Moving beyond p values: data analysis with estimation graphics,” Nature Methods, vol. 16, no. 7, pp. 565–566, 7 2019. [Online]. Available: https://doi.org/10.1038/s41592-019-0470-3
  • [16] B. Efron, “Better bootstrap confidence intervals,” Journal of the American Statistical Association, vol. 82, no. 397, pp. 171–185, 1987. [Online]. Available: https://www.tandfonline.com/doi/abs/10.1080/01621459.1987.10478410
  • [17] R. R Development Core Team et al., “R: A language and environment for statistical computing,” 2011.
  • [18] A. C. Davison and D. V. Hinkley, Bootstrap methods and their application.   Cambridge university press, 1997, vol. 1.
  • [19] A. Canty and B. D. Ripley, boot: Bootstrap R (S-Plus) Functions, 2019, r package version 1.3-23.
  • [20] Y. Benjamini, D. Yekutieli et al., “The control of the false discovery rate in multiple testing under dependency,” The annals of statistics, vol. 29, no. 4, pp. 1165–1188, 2001.
  • [21] H. Wickham, ggplot2: elegant graphics for data analysis.   Springer, 2016.
  • [22] G. Csardi, T. Nepusz et al., “The igraph software package for complex network research,” InterJournal, Complex Systems, vol. 1695, no. 5, pp. 1–9, 2006.
  • [23] J. Cohen, “Statistical power analysis for the behavorial sciences. 2nd ed,” 1998.
  • [24] C. Amornbunchornvej, N. Surasvadi, A. Plangprasopchok, and S. Thajchayapong, “Identifying linear models in multi-resolution population data using minimum description length principle to predict household income,” arXiv preprint arXiv:1907.05234, 2019.
  • [25] S. Alkire and M. E. Santos, “Multidimensional poverty index 2010: research briefing,” Oxford Poverty & Human Development Initiative (OPHI), 2010.
  • [26] S. Alkire, U. Kanagaratnam, and N. Suppa, “The global multidimensional poverty index (mpi): 2018 revision,” OPHI MPI Methodological Notes, vol. 46, 2018.
  • [27] S. Kuznets, “Economic growth and income inequality,” The American economic review, vol. 45, no. 1, pp. 1–28, 1955.
  • [28] I. Kawachi and B. P. Kennedy, “Income inequality and health: pathways and mechanisms.” Health services research, vol. 34, no. 1 Pt 2, p. 215, 1999.
  • [29] S. Oishi, S. Kesebir, and E. Diener, “Income inequality and happiness,” Psychological science, vol. 22, no. 9, pp. 1095–1100, 2011.
  • [30] X. Du, L. Deng, and K. Qian, “Current market top business scopes trend—a concurrent text and time series active learning study of nasdaq and nyse stocks from 2012 to 2017,” Applied Sciences, vol. 8, no. 5, p. 751, 2018.
  • [31] B. Efron, Bootstrap Methods: Another Look at the Jackknife.   New York, NY: Springer New York, 1992, pp. 569–593. [Online]. Available: https://doi.org/10.1007/978-1-4612-4380-9_41
  • [32] H. Pishro-Nik, Introduction to probability, statistics, and random processes.   Kappa Research, 2014.
  • [33] K. Athreya et al., “Bootstrap of the mean in the infinite variance case,” The Annals of Statistics, vol. 15, no. 2, pp. 724–731, 1987.
  • [34] P. J. Bickel, D. A. Freedman et al., “Some asymptotic theory for the bootstrap,” The annals of statistics, vol. 9, no. 6, pp. 1196–1217, 1981.

Chainarong Amornbunchornvej received the bachelor of engineering degree (with honor) in computer engineering in 2011 and the master’s degree in telecommunications engineering in 2013, both from King Mongkut’s Institute of Technology Ladkrabang, Bangkok, Thailand. He received the Ph.D. degree in computer science from the University of Illinois at Chicago, IL, USA, in 2018. He is currently a researcher at National Electronics and Computer Technology Center, Thailand. He focuses on data science and statistical inference in general especially in time series analysis, causal inference, social network analysis, theoretical computer science, as well as bioinformatics.

Navaporn Surasvadi is a researcher at the National Electronics and Computer Technology Center (NECTEC), Thailand. She received the BE in computer engineering (with first class honor) from Chulalongkorn University, Bangkok, Thailand and the MSc in Management Science and Engineering from Stanford University, CA, USA. She received her PhD in Operations Management from Leonard N. Stern School of Business, New York University, NY, USA in 2014. Her current research interests include data analytics and data visualization especially in strategic data for government policy planning, as well as operations management.

Anon Plangprasopchok, Ph.D. is a research scientist at National Electronics and Computer Technology Center (Thailand). He obtained a PhD in the Computer Science Department at the University of Southern California in 2010. His research interests lie in the area of applied data mining and machine learning techniques. He has been involved in several key government projects including revenue forecasting models (as a principle investigator) and data platform for poverty allievation (as a data scientist) for example.

Suttipong Thajchayapong received the M.S. and B.S. degrees in electrical and computer engineering from Carnegie Mellon University, Pittsburgh, PA, USA, and the Ph.D. degree in electrical and electronic engineering from Imperial College London, London, U.K.,He is a Researcher with National Electronics and Computer Technology Centre, National Science and Technology Development Agency, Pathumthani, Thailand. His research interests include intelligent transportation systems with emphasis on vehicular traffic monitoring and simulation, anomaly detection, and mobility and quality of service in wireless networks. Dr. Thajchayapong is a Member of ITS Thailand.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
398451
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description