Testing for Normality with Neural Networks

# Testing for Normality with Neural Networks

## Abstract

In this paper, we treat the problem of testing for normality as a binary classification problem and construct a feedforward neural network that can successfully detect normal distributions by inspecting small samples from them. The numerical experiments conducted on small samples with no more than elements indicated that the neural network which we trained was more accurate and far more powerful than the most frequently used and most powerful standard tests of normality: Shapiro-Wilk, Anderson-Darling, Lilliefors and Jarque-Berra, as well as the kernel tests of goodness-of-fit. The neural network had the AUROC score of almost , which corresponds to the perfect binary classifier. Additionally, the network’s accuracy was higher than on a set of larger samples with elements. Since the normality of data is an assumption of numerous techniques for analysis and inference, the neural network constructed in this study has a very high potential for use in everyday practice of statistics, data analysis and machine learning in both science and industry.

## 1 Introduction

An underlying assumption of many statistical procedures is that the data at hand come from a normal distribution (Razali & Wah 2011, Thode 2002). Hence, normality tests are among the most frequently used statistical tools.

The standard statistical tests of normality require two hypotheses to be formulated: the null () that the data at hand come from a normal distribution, and the alternative hypothesis () that the distribution from which the data come is not normal. The tests are usually conducted and interpreted as follows. First, researchers choose the value of , the probability to incorrectly reject a true . The most common values for are and . Then, the test statistic is calculated and its -value is determined. If the -value is lower than , the null hypothesis is rejected, and the data are not considered normal. Otherwise, is not rejected. The choice of limits the Type I error rate, i.e. the frequency at which the test rejects a true .

Each statistical test for normality is based on a statistic that summarizes the sample being analyzed and considers only a specific property of normal distributions (Razali & Wah 2011), which is why tests sometimes give conflicting results (Sigut et al. 2006). Moreover, numerous simulation studies have shown that the tests exhibit low power (probability to identify a non-normal distribution) when applied to samples with fewer than elements (Esteban et al. 2001, Noughabi & Arghami 2011, Hain 2010, Razali & Wah 2011, Yap & Sim 2011, Ahmad & Khan 2015, Marmolejo-Ramos & González-Burgos 2013, Mbah & Paothong 2015, binti Yusoff & Wah 2012, Patrício et al. 2017, Wijekularathna et al. 2019). In this study, we propose a new test of normality, based on artificial neural networks. Our research hypothesis is that neural networks can learn efficient classification rules from the datasets of normal and non-normal samples and that they can be used to classify a distribution in question as normal or non-normal based on a small-size sample drawn from it. It turned out that the neural networks outperformed the standard tests of normality. To our best knowledge, this is the second attempt at using neural networks to test for normality, the first one being the study of Sigut et al. (2006), whose work we build on. Another two contributions are: (1) introduction of descriptors, the data structures which allow machine-learning algorithms to be applied to variable-size and set data, and (2) creation of large and diverse datasets of samples drawn from various distributions.

The rest of the paper is organized as follows. We review the literature in Section 2. In Section 3, we formally define the problem of normality testing and briefly sketch a number of selected tests of normality. In Section 4, we describe how we constructed neural networks that can classify samples as coming from normal or non-normal distributions. The datasets used are presented in Section 5. The results of evaluation are given in Section 6 and discussed in Section 7. The conclusions are drawn in Section 8.

We use the common mathematical notation throughout the paper: denotes an unnamed probability measure, the expectation of a random variable, the uppercase Latin letters represent random variables, is the indicator function equal to when the condition is true, and otherwise, and so on. The cumulative distribution function (CDF) of the standard normal distribution will be denoted as . The abbreviation for the empirical distribution function will be EDF. For a sample , will represent the sample’s mean and the usual estimate of its standard deviation.

## 2 Literature Review

Machine learning algorithms have been applied to several problems which were traditionally the domain of statistics. In this Section, we review the approaches taken by machine-learning researchers in solving those problems.

In the the two-sample problem, the goal is to decide if two samples of data come from the same distribution. If they are treated as the objects of two classes, then a binary classifier can be trained on their union. Its accuracy can then be regarded as a statistic upon which a new statistical test can be based. In the case that the samples were drawn from the same distribution, we expect low accuracy, and vice versa. This approach was followed in several studies, e.g.: Lopez-Paz & Oquaba (2017), Ojala & Garriga (2010), Al-Rawi & Cunha (2012), Kim et al. (2016), Blanchard et al. (2010), Rosenblatt et al. (2019). Gretton et al. (2012) continue the research of Borgwardt et al. (2006), Gretton, Borgwardt, Rasch, Schölkopf & Smola (2007), Gretton, Borgwardt, Rasch, SchÃ¶lkopf & Smola (2007), Smola et al. (2007), Gretton et al. (2009) and define non-parametric tests using kernels and reproducing kernel Hilbert spaces (RKHS) (Scholköpf & Smola 2001, Steinwart & Crhistmann 2008). The kernel tests are based on the fact that for a convenient choice of the kernel, each distribution can be mapped into a unique point in the kernel’s RKHS. The expected distance of those points is zero if and only if those two distributions are the same. The kernels can estimate the distance using the samples from those distributions.

In the statistical independence problem, one is interested in finding out if two or more random variables are statistically independent. Gretton et al. (2008) formulate a statistical test of independence of two variables based on the Hilbert-Schmidt Independence Criterion (HSIC) proposed by Gretton et al. (2005), which is a measure of statistical independence. HSIC is also used by Chwialkowski & Gretton (2014) to check if two time series are independent. Similarly, Pfister et al. (2017) propose HSIC-based tests of independence of an arbitrary number of random variables.

Finally, the goodness-of-fit problem is the one in which we check if a sample of data has been drawn from one particular distribution of interest, which we call a model. Chwialkowski et al. (2016) define the model’s Stein operator (Stein 1972) over an RKHS. For any function in the RKHS, the expected value of is zero if and only if follows the model distribution. Chwialkowski et al. (2016) show that is a measure of discrepancy between and the model, call it the Stein discrepancy and calculate it using kernels. However, the asymptotic distribution of its normalized estimator in the case of a fit has no computable closed form (Chwialkowski et al. 2016), so Chwialkowski et al. (2016) use wild bootstrap (Shao 2010, Leucht & Neumann 2013) to address that issue and account for possible correlations in the data. The same idea is independently studied by Liu et al. (2016). Liu et al. (2016) define a different estimator of the Stein discrepancy. The estimator’s distribution in the case that the data are not generated by the model is asymptotically normal, but in the case that they indeed were drawn from the model, the limiting distribution of the estimator has no analytical form. Liu et al. (2016) estimate it by the bootstrap method proposed by Arcones & Gine (1992), Huskova & Janssen (1993). Jitkrittum et al. (2017) follow the approach of Chwialkowski et al. (2015) for kernelized two-sample tests and present a variant of the Stein discrepancy statistic that is both computable in linear time and comparable to the original statistic in term of power. We describe them in more detail in Section 3 alongside other techniques to test for normality with which we compared our network.

Lloyd & Ghahramani (2015), Kellner & Celisse (2019) show how the kernel-based tests can be used to check if a sample comes from a normal distribution. Lloyd & Ghahramani (2015) fit a Gaussian model to the histogram of Simon Newcombâs 66 measurements taken to determine the speed of light. Then, they simulate a thousand points from the fit model and an carry out a kernel two-sample test on the original and simulated data. Kellner & Celisse (2019) estimate the normal model’s mean and variance from the sample, which gives a specific normal distribution and its corresponding embedding in the chosen kernel’s RKHS. The squared distance between that point and the estimator of the sample’s mean embedding is easily computable via the kernel trick. Kellner & Celisse (2019) find the -quantile of the estimator’s distribution via the fast bootstrap method of Kojadinovic & Yan (2012) and thus determine the test’s rejection region. That is equivalent to rejecting the hypothesis that the data come from a normal distribution if the -value of the test’s statistic is lower than ().

Sigut et al. (2006) also deal with testing for normality, but adopt a completely different approach. They train a neural network on a dataset consisting of many normal and non-normal samples drawn from a large number of different distributions using the Johnson system of distributions (Johnson 1949a, b). Since samples can contain any number of elements, a preprocessing step to transform them to the vectors of the same space is needed. Sigut et al. (2006) map each sample to a vector of several statistics: skewness, kurtosis, the -statistic of the Shapiro-Wilk test of normality (Shapiro & Wilk 1965), the statistic of the Lin-Mudholkar test (Lin & Mudholkar 1980), and the statistic of the Vasicek entropy test of normality (Vasicek 1976). Their idea was that the neural networks would combine the insights all those different statistics give into a sample and outperform the tests based on individual statistics.

The study of Sigut et al. (2006) is the most relevant to our study. Their network is presented in more detail in Section 3.

## 3 Tests of Normality

The problem of testing for normality can be formulated as follows.

###### Definition 1.

Given a sample drawn from a distribution modeled by a random variable , choose between the following two hypotheses:

• the null () that follows a normal distribution ();

• the alternative hypothesis () that does not follow a normal distribution ().

The problem can also be formulated for the case where and are known and we want to test if follows a specific normal distribution . However, we will focus on the formulation given above because it represents a more realistic scenario. In all that follows, we assume that is one-dimensional.

There are two main types of techniques for testing for normality:

• graphical methods, which provide qualitative assessment of normality of the sample in question (Henderson 2006): histograms, box–and–whisker plots, quantile–quantile (Q–Q) plots, probability–probability (P–P) plots;

• numerical methods, which provide quantitative assessment of normality.

The methods in the latter group can be divided into four different categories based on the characteristic of the normal distribution that they focus on (Seier 2011):

1. tests based on skewness and kurtosis;

2. tests based on empirical distribution function of the sample at hand;

3. regression and correlation tests;

4. other tests for normality.

There are forty classical statistical tests for normality (Dufour et al. 1998). We compare our network with four of them: the Shapiro–Wilk (SW) (Shapiro & Wilk 1965), Liliefors (LF) (Lilliefors 1967), Anderson-Darling (AD) (Anderson & Darling 1952, 1954), and Jarque–Bera (JB) tests (Jarque & Bera 1980, 1987). We chose those tests because they are the most frequently examined in simulation studies and are the most powerful tests of normality. Of the new tests, we included the kernel tests of normality (Jitkrittum et al. 2017), since they also bridge machine learning and statistics as our approach, but in another way. Finally, we included the network of Sigut et al. (2006) in the study as it is comparable and complementary to ours.

Throughout this Section, will denote a sample drawn from the distribution whose normality we want to test.

### 3.1 The Shapiro-Wilk Test

Let denote the vector of the expected values of order statistics of independent and identically distributed random variables sampled from the standard normal distribution. Let denote the corresponding covariance matrix.

The intuition behind the SW test is as follows. If a random variable follows normal distribution , and , then (Shapiro & Wilk 1965). For the ordered random samples and , the best linear unbiased estimate of is (Shapiro & Wilk 1965):

 ^σ=bTV−1zbTV−1b (1)

In that case, should be equal to the usual estimate of variance :

 sd(x)2=1n−1n∑i=1(xi−¯¯¯x) (2)

The value of the test statistic, , is a scaled ratio of those two estimates:

 W=(∑ni=1aixi)2∑ni=1(xi−¯¯¯x)2 (3)

where:

 a=[a1,a2,…,an]=bTV−1(bTV−1V−1b)1/2 (4)

The range of is , with higher values indicating stronger evidence in support of normality. However, the original formulation required use of the tables of the critical values of (Henderson 2006) and was limited to smaller samples with elements because the values of and were known only for small samples at the time (Hain 2010). Royston (1982) extended the upper limit of to and presented a normalizing transformation algorithm suitable for computer implementation. The upper limit was further improved by Royston (1995) who gave an algorithm AS 194 which allowed the test to be used for the samples with .

### 3.2 The Lilliefors Test

This test was introduced independently by Lilliefors (1967) and van Soest (1967). The LF test checks if the given sample comes from a normal distribution whose parameters and are taken to be the sample mean () and standard deviation (). It is equivalent to the Kolmogorov-Smirnov test of goodness-of-fit (Kolmogorov 1933) for that particular choice of and (Wijekularathna et al. 2019). The LF test is conducted as follows. If the sample at hand comes from , then its transformation , where:

 zi=xi−¯¯¯xsd(x) (5)

should follow . The difference between the EDF of , , and the CDF of , , quantifies how well the sample fits the normal distribution. In the LF test, that difference is calculated as follows

 D=nmaxi=1|edfz(zi)−Φ(zi)| (6)

Higher values of indicate greater deviation from normality.

### 3.3 The Anderson-Darling Test

Whereas the Lilliefors test focuses on the largest difference between the sample’s EDF and the CDF of the hypothesized model distribution, the AD test calculates the expected weighted difference between those two (Anderson & Darling 1954), with the weighting function designed to make use of the specific properties of the model. It is a general test designed to check if a sample comes from the hypothesized distribution. The AD statistic is:

 A=n∫+∞−∞[F∗(x)−edf(x)]2ψ(F∗(x))dF∗(x) (7)

where is the weighting function, is the sample’s EDF, and is the CDF of the distribution we want to test. When testing for normality, the weighting function is chosen to be sensitive to the tails (Anderson & Darling 1954, Wijekularathna et al. 2019):

 ψ(t)=[t(1−t)]−1 (8)

Then, the statistic (7) can be calculated in the following way (Gross & Ligges 2015, Thode 2002):

 (9)

where () is the ordered permutation of that is obtained from as in the LF test. The -values are computed from the modified statistic (Gross & Ligges 2015, Stephens 1986):

 A(1+0.75/n+2.25/n2) (10)

As noted by Wijekularathna et al. (2019), the AD test is a generalization of the Cramer - von Mises (CVM) test (Cramér 1928, von Mises 1928). When , the AD test’s statistic reduces to that of the CVM test:

 C=n∫+∞−∞[F∗(x)−edf(x)]2dF∗(x) (11)

Because takes into account the specific properties of the model distribution, the AD test may be more sensitive than the CVM test (Wijekularathna et al. 2019). In both AD and CVM tests, larger values of the statistics indicate stronger arguments in favor of non-normality.

### 3.4 The Jarque-Bera Test

The Jarque–Bera test (Jarque & Bera 1980, 1987), checks how much the sample’s skewness () and kurtosis () match those of normal distributions. Namely, for each normal distribution it holds that and . The statistic of the test, as originally defined by Jarque & Bera (1980), is computed as follows:

 J=n⎛⎜ ⎜⎝(√β1)26+(β2−3)224⎞⎟ ⎟⎠=(√β1)26/n+(β2−3)224/n (12)

We see that higher values of indicate greater deviation from the skewness and kurtosis of normal distributions. The same idea was examined by Bowman & Shenton (1975). The asymptotic expected values of the estimators of skewness and kurtosis are and , while the asymptotic variances are and for the sample of size (Bowman & Shenton 1975). The statistic is then a sum of two asymptotically independent standardized normals. However, the estimator of kurtosis slowly converges to normality, which is why the original statistic was not useful for small and medium-sized samples (Urzua 1996). Urzua (1996) adjusted the statistic by using the exact expressions for the means and variances of the estimators of skewness and kurtosis:

 J=n⎛⎜ ⎜⎝(√β1)2c1+(β2−c2)2c3⎞⎟ ⎟⎠ (13)

where:

 c1 = 6(n−2)(n+1)(n+3) c2 = 3(n−1)n+1 c3 = 24n(n−2)(n−3)(n+1)2(n+3)(n+5)

which allowed the test to be applied to smaller samples.

### 3.5 The Kernel Test of Normality

As mentioned in Section 2, there are several kernel tests of goodness-of-fit. Just as our approach, they also represent a blend of machine learning and statistics. We evaluate the test of Jitkrittum et al. (2017) against our network in this study because that test is computationally less complex than the original kernel tests of goodness-of-fit proposed by Chwialkowski et al. (2016), but comparable with them in terms of statistical power. Of the other candidates, the approach of Kellner & Celisse (2019) is of quadratic complexity and requires bootstrap, and a drawback of the approach of Lloyd & Ghahramani (2015), as Jitkrittum et al. (2017) point out, is that it requires a model to be fit and new data to be simulated from it. Also, this approach fails to exploit our prior knowledge on the characteristics of the distribution for which goodness-of-fit is being determined. So, the test formulated by Jitkrittum et al. (2017) was our choice as the representative of the kernel tests of goodness-of-fit. It is a distribution-free test, so we first describe its general version before we show how we used it to test for normality. The test is defined for multidimensional distributions, but we present it for the case of one-dimensional distributions for simplicity and because we are interested in one-dimensional Gaussians.

Let be the RKHS of real-valued functions over with the reproducing kernel . Let be the density of model . As in Chwialkowski et al. (2016), a Stein operator (Stein 1972) can be defined over :

 (Tqf)(x)=∂logq(x)∂xf(x)+∂f(x)∂x (14)

Let us note that for

 ξq(x,⋅)=∂logq(x)∂xk(x,⋅)+∂k(x,⋅)∂x (15)

it holds that:

 (Tqf)(x)=⟨f,ξq(x,⋅)⟩F (16)

If , then (Chwialkowski et al. 2016). Let be the random variable which follows the distribution from which the sample was drawn. The Stein discrepancy between and is defined as follows (Chwialkowski et al. 2016):

 (17)

where is called the Stein witness function and belongs to . Chwialkowski et al. (2016) show that if is a cc-universal kernel (Carmeli et al. 2010), then if and only if , provided a couple of mathematical conditions are satisfied.

Jitkrittum et al. (2017) follow the approach of Chwialkowski et al. (2015) for kernel two-sample tests and present the statistic that is comparable to the original one of Chwialkowski et al. (2016) in terms of power, but faster to compute. The idea is to use a real analytic kernel that makes the witness function real analytic. In that case, the values of for a sample of points , drawn from , are almost surely zero w.r.t. the density of if . Jitkrittum et al. (2017) define the following statistic which they call the finite set Stein discrepancy:

 FSSD2=1mm∑j=1g(vi)2 (18)

If , almost surely. Jitkrittum et al. (2017) use the following estimate for :

 FSSD2=2n(n−1)n∑i=2∑j

where

 Δ(xi,xj)=1mm∑ℓ=1ξq(xi,vℓ)ξq(xj,vℓ) (20)

In our case, we want to test if is equal to any normal distribution. Similarly to the LF test, we can use the sample estimates of the mean and variance as the parameters of the normal model. Then, we can randomly draw numbers from and use them as points , calculating the estimate (19) for and . As far as the kernel and its parameters are concerned, we chose the Gaussian kernel as it fulfills the conditions laid out by Jitkrittum et al. (2017). To set its bandwidth, we relied on the median heuristic (Garreau et al. 2017), which sets it to the median of the absolute differences (). The exact number of locations, , was set to .

Since is always zero if the sample comes from the normal distribution, the larger the value of , the more likely it is that the sample came from a non-normal distribution. We will refer to this test as the FSSD test from now on.

### 3.6 The Statistic-Based Neural Network

Since the neural networks are designed with a fixed-size input in mind, and samples can have any number of elements, Sigut et al. (2006) represented he samples with the statistics of several normality tests. The rationale behind this method is that even though each statistical test focuses on a selected property of normal distributions, taken together they examine samples from different and complementary perspectives, so a neural network that combines their statistics could be more accurate. Sigut et al. (2006) used the estimates of the following statistics:

1. skewness:

 √β1=m3m3/22 (21)

where ,

2. kurtosis:

 √β2=m4m22 (22)
3. the statistic of the Shapiro-Wilk test (see Equation (3)),

4. the statistic of the test proposed by Lin & Mudholkar (1980):

 Zp=12ln1+r1−rr=∑ni=1(xi−¯¯¯x)(hi−¯¯¯h)√(∑ni=1(xi−¯¯¯x)2)(∑ni=1(hi−¯¯¯h)2)hi=⎛⎜ ⎜⎝∑nj≠ix2j−1n−1(∑nj≠ixj)2n⎞⎟ ⎟⎠13¯¯¯h=1nhi (23)
5. and the statistic of the Vasicek test (Vasicek 1976):

 Km,n=n2m×sd(x)(n∏i=1(x(i+m)−x(i−m)))1n (24)

where is a positive integer smaller than , is the non-decreasingly sorted sample , for , and for .

It is not clear which activation function Sigut et al. (2006) used. They trained three networks with a single hidden layer containing , and neurons, respectively. One of the networks was designed to take the sample size into account as well so that it can be flexible and applicable to samples with any number of elements. Just as the other two, the network showed that it was capable of modeling posterior Bayesian probabilities of the input samples being normal. Sigut et al. (2006) focused on the samples with no more than elements.

In addition to our network, presented in Section 4, we trained one that follows the philosophy of Sigut et al. (2006). We refer to that network as Statistic-Based Neural Network (SBNN) because it expects an array of statistics as its input. More precisely, prior to being fed to the network, each sample is transformed to the following array:

 [√β1,β2,W,Zp,K3,n,K5,n,n] (25)

just as in Sigut et al. (2006) ( is the sample size). ReLU was used as the activation function. We trained SBNN in the same way as the network we design in Section 4 to make their comparison fair.

## 4 Neural Networks for Identifying Normal Distributions

The problem of testing for normality, designed as a statistical hypothesis testing problem in Definition 1, can be straightforwardly cast as a binary classification problem.

###### Definition 2.

Given a sample drawn from a distribution modeled by a random variable , classify

• as following a normal distribution, by assigning it label ; or

• as following a non-normal distribution, by assigning it label .

In the classification problem, the set of all normal distributions is treated as class , and the set of all non-normal distributions as class . By abusing terminology, we can say that the samples drawn from class are normal and that those coming from a distribution from class are non-normal.

The core idea of our approach is to create a dataset consisting of the normal and non-normal samples of different sizes, drawn from various normal and non-normal distributions, and train a neural network on that dataset as if this was any other binary classification problem. Moreover, since we can construct as large a dataset as we want, we are not limited by the data availability as in other learning problems. We can use as much data as our computational capacities allow.

However, since samples can contain any number of elements, and training a network for each imaginable sample size is impossible, a preprocessing step is required to transform the given sample, no matter how large, to a vector with fixed number of dimensions. We call those vectors descriptors and describe them in detail in Section 4.1.

The networks we considered in this study are simple feedforward networks with two hidden layers at most. The activation function we used is ReLU. The final layer was designed to be the softmax layer. That means that for each sample , the output is a single value which may be interpreted as the probability that the sample belongs to class , i.e. that it comes from a normal distribution. In fact, minimizing the log-loss (as we did) ensures that the network’s output is an estimate of the Bayesian posterior probability that comes from a normal distribution (Hampshire & Pearlmutter 1991, Richard & Lippmann 1991). The label assigned to , , is determined using the following decision rule:

 ^y(x)={1,p1(x)≥0.50,p1(x)<0.5 (26)

Even though the design described above is not complex, it turned out that it was able to successfully learn efficient classification rules from the data.

### 4.1 Descriptors

Neural networks require that all the objects they classify have the same dimensions. On the other hand, samples of data, analyzed in everyday practice of statistics, can have any number of elements. Even though it is the most natural way to represent samples by their actual content, i.e. by arrays containing their elements, such an approach is not convenient for machine-learning algorithms as it requires training a separate classifier for each possible value of the sample size, , which is not a feasible strategy.

Sigut et al. (2006) approached the problem by representing samples with the statistics of several normality tests. We propose to extend their approach in the following way. First of all, we note that samples can be represented not just by the statistics of normality tests, but also by a number of descriptive statistics such as the mean, median, standard deviation, etc. Then, we follow the same idea that as the LF and kernel tests. If a random variable follows any normal distribution , then it can be transformed to follow the standard normal , as formally stated by the following theorem.

###### Theorem 1 (Grinstead & Snell (2003)).

If , then:

 Z=X−μσ∼N(0,1) (27)

If the sample under consideration comes from a normal distribution, then its standardization should produce a sample that more or less looks like as if it was drawn from . A structure with which we choose to represent samples should make use of Theorem 1. Also, it should contain not only a number of summarizing statistics, but some original elements from raw samples as well. Therefore, we propose to consider evenly spaced quantiles of standardized samples, as they should convey the information about a sample’s likeness to as well as expose the network to the selected elements from the raw sample. Quantiles combined with the mean, standard deviation, minimum, maximum, and possibly other statistics, should be able to yield good classification results and enable us to train a network capable of handling samples of different sizes. In addition to statistics, there is one more piece of information that we believe is important and should be contained in sample representations, and that is the sample size. We hypothesize that if network takes the sample size into consideration, then it will be sensitive to variation in differences between normal and non-normal samples for different sample sizes. We call such structures descriptors and formally introduce them in the following definition.

###### Definition 3.

Let be a sample with elements. The descriptor of is the array

 [hq(z),h2q(z),…,h1(z),n,¯¯¯x,\emph{sd}(x),min(x),max(x),\emph{med% }(x)] (28)

where for , for , and is the EDF of . The mean and standard deviation of are and , its minimal and maximal values are and , and represents its median.

Transforming the sample into descriptors allows us to train a single classifier once and then use it to test normality of a sample with any number of elements.

The value of , that determines which quantiles from a sample get included in its descriptor, should be set to the number that will not make the descriptors too large, so that training is not slow, but will also make them representative of the samples as well. We determined the value of via cross-validation.

We name the networks that use descriptors the descriptor-based neural networks (DBNN), which is how we will refer to them in the remainder of the paper.

## 5 Datasets

Since no public dataset of samples drawn from various distributions and suitable for training was available at the time of our research, we have created different datasets to construct and evaluate our network. Four of them are artificial, and two are based on the real-world data.

We name the artificial datasets - and describe them as follows.

Set consists of normal and as much non-normal samples. The samples are of the sizes . The set is balanced by the sample classes and sizes. The normal samples were drawn from randomly selected normal distributions. Each distribution was specified by setting its mean to a random number from and deviation to a random number from . The non-normal samples were simulated from the Pearson family of distributions. Pearson (1895, 1901, 1916) identified twelve different types of distributions. The density of each distribution in this family is obtained by solving the following differential equation (Karvanen et al. 2000):

 f′(x)f(x)=x−ab0+b1x+b2x2 (29)

where and are the distributions’s parameters. They depend on the central moments as follows

 b1 =a=−m3(m4+3m22)C (30) b0 =−m2(4m2m4−3m23)C (31) b2 =−2m2m4−3m23−6m32C (32)

where and is the -th central moment of the distribution. Given , , Ð¸ ( is skewness, is kurtosis), it is possible to get the parameters , , , and , and simulate the samples from the distribution they specify.

When generating non-normal samples, for each feasible combination of and we randomly chose and . We avoided the combination of and that identifies the normal distributions. More precisely, we did not consider non-normal distributions for which the following held: and . A lot of non-normal distributions was covered in this way.

• We used of for cross-validation. We will denote that subset as . Just as , is also balanced.

• The rest of was reserved for initial testing. We will refer to that subset as . It is balanced too.

Set was generated in the same way as . The only difference is that contains the samples of the sizes . The samples of those sizes were not used to train the network. The purpose of was to check the net’s robustness with respect to the sample size.

Set contains the samples from the four groups of non-normal distributions that were used by Esteban et al. (2001) and Noughabi & Arghami (2011) to empirically evaluate several standard tests of normality. Those groups are presented in Table 1. For each sample size and , we generated samples with elements from the distributions in group , iterating over them cyclicly and drawing a sample of elements until we drew such samples. We used this set to estimate the abilities of our network to detect non-normal distributions.

Set was created in the same was as and used to additionally evaluate and analyze the DBNN.

As sets - are artificially created, we evaluated our network on two real-world datasets to gain a better insight into its practical usefulness.

The first real-life set, , is based on a part1 of data on the heights of the Dobe !Kung people collected by Howell (2010, 2017). The part contains the heights (in centimeters) of people, including the information on sex and age. For each age span , we sampled the heights of men and women separately, thus creating samples representing different groups of people. To confirm that the height in each group is normally distributed, for each sample we drew random same-sized samples from and plotted their empirical CDFs. An example is presented in Figure 1 for women in the age group . We see that the empirical CDF of the real sample appears to be same as those of simulated normal samples, so we can safely confirm that the distribution of the height in the group is normal. The plots are similar for all the other groups, so we concluded the same about their distributions. The sample sizes vary from to . Figure 1: The empirical CDF of the heights of women in the age group [24,33) and simulated CDFs of simulated normal samples

The second real-world set, , was based on the Northern California Earthquake Catalog and Phase Data (Northern California Earthquake Data Center 2014a, b). The catalog contains the records on earthquakes that took place in California between July 1966 and December 2019 and had a magnitude of at least degrees. The distribution of the magnitudes in the catalog is presented in Figure 2. As we can see, it is not normal. For each , we created samples by randomly drawing magnitudes from the catalog. Figure 2: The empirical density of the earthquake magnitudes in the Northern California Earthquake Catalog and Phase Data

## 6 Evaluation

The cross-validation results are presented in Section 6.1. Using the Boruta algorithm (Kursa & Rudnicki 2010), we inspected the importance of the features included in the descriptors to make sure that they are well-defined. The results can be found in Section 6.2. We also checked if the constructed network is well-calibrated in the sense of Guo et al. (2017). The details are in Section 6.3. The results of evaluation on sets and are presented in Sections 6.4 and 6.5. The results of comparison with other methods for normality testing on sets and can be found in Section 6.6. Evaluation on real-world data is described in Section 6.7. Finally, the results of DBNN’s runtime analysis are reported in Section 6.8.

### 6.1 Cross-validation

For training, we used ADAM (Kingma & Ba 2014) with early stopping, a positive regularization coefficient (), and the maximal number of iterations set to . The objective function to minimize was the log-loss. We conducted five-fold cross-validation with of left out for validation after training. The hyperparameters considered were , that determines the structure of the descriptors, the network’s architecture (the number and sizes of the hidden layers), and , the regularization coefficient. We conducted grid-search to set the hyperparameters.

The results of cross-validation are presented in Table 2. The best accuracy, , was achieved by the network with two inner layers, the first with and the second with neurons, the parameter set to , and trained with the regularization coefficient set to . All the combinations of hyperparameters gave good results and were trained very quickly, with the whole cross-validation done in a couple of minutes.

After cross-validation, we trained the network with the best combination of hyperparameters using the whole set .

### 6.2 Feature Importance Inspection

We used the Boruta algorithm for feature selection, formulated by Kursa & Rudnicki (2010), to check if the dimensions of the descriptors are statistically correlated with membership in classes or .

With iterations, maximal depth of trees set to levels, the quantile parameter , significance threshold , and automatically constructed trees, Boruta found, using set , that all the features we included in the definition of the descriptors were statistically correlated with the labels.

### 6.3 Checking for Calibration

The binary classifiers that output probabilities that the classified object belongs to the classes under consideration are called probabilistic. DBNN is such a classifier. Its for input can be interpreted as the posterior probability that the true class of , denoted as , is , so we can obtain the probability for class by simple subtraction: . We are interested in how reliable those probabilities are. For example, the probability is reliable if of samples for which DBNN outputs do come from a normal distribution. Classifiers that output only reliable probabilities are called perfectly calibrated, and we formally introduce them in Definition 4.

###### Definition 4 (Perfect calibration (Guo et al. 2017)).

A binary classifier is called perfectly calibrated if its posterior probability satisfies the following condition:

 IP(Y=1∣p1(X)=t)=t,∀t∈[0,1] (33)

where is a random variable that models objects to be classified, and is a random variable that models the true class of .

The reliability diagrams are often used to check if a classifier is calibrated (Murphy & Winkler 1977, 1987, BrÃ¶cker 2008). The diagrams visualize the relationship between expected and empirical probabilities and are created as follows. First, the objects (here, samples) are sorted by the output probability of belonging to class . Then, they are discretized in a number of same-sized bins. For each bin, the following values are calculated:

• the empirical proportion of the objects that belong to class ; and

• the expected proportion of the objects that belong to class , computed with respect to the posterior probabilities. It is calculated as the mean posterior probability of the objects in the bin.

The reliability diagrams plot the empirical against expected proportions. If the classifier is perfectly calibrated, one can expect a line. (Fenlon et al. 2018).

We constructed the reliability diagram of DBNN using set . It is shown in Figure LABEL:sub@subfig:reliability_diagram. We see that it resembles the diagram of a perfectly calibrated classifier, but is not quite the same. In order to account for variability, we randomly selected subsets of , each containing samples, and plotted their diagrams and the mean reliability graph in Figure LABEL:sub@subfig:reliability_diagrams_and_variability. We see that the reliability graphs constructed using the subsets capture the graph of the perfectly calibrated classifier. Therefore, we conclude that DBNN is a reliable classifier, but note that it is not perfectly calibrated.

### 6.4 Results on Set Atest

We calculated the following performance metrics:

1. Accuracy ().

2. True positive rate (), also known as recall. Its complement to is known as the Type I error rate in statistics.

3. Positive predictive value (), also known as precision: the proportion of normal samples among those classified as normal.

4. True negative rate (). Known as power in statistical literature. Its complement to is called Type II error rate in statistics.

5. Negative predictive value (). Same as , but for class .

6. The measure. The harmonic mean of and .

7. The area under the ROC curve ().

The values of those metrics on are presented in Table 3 and depicted in Figure 4. We see that DBNN achieved great accuracy and has the exceptional AUROC of almost . All the scores consistently rise as the sample size increases. Figure 4: Performance of DBNN on set Atest: TPR, PPV, TNR and NPV, per sample size n.

### 6.5 Results on Set B

We calculated the same metrics on set as for set in order to see how DBNN handles the samples of sizes not seen during training. The results are presented in Table 4 and Figure 5. The metrics on both sets are visualized in Figure 6.

The performance of DBNN on set is comparable to its performance on set .