On the Shattering Coefficient of Supervised Learning Algorithms

On the Shattering Coefficient of Supervised Learning Algorithms

Rodrigo Fernandes de Mello University of São Paulo, Institute of Mathematics and Computer Sciences, Department of Computer Science, Av. Trabalhador Saocarlense, 400, São Carlos, Brazil. e-mail: mello@icmc.usp.br
Abstract

The Statistical Learning Theory (SLT) provides the theoretical background to ensure that a supervised algorithm generalizes the mapping given is selected from its search space bias . This formal result depends on the Shattering coefficient function to upper bound the empirical risk minimization principle, from which one can estimate the necessary training sample size to ensure the probabilistic learning convergence and, most importantly, the characterization of the capacity of , including its under and overfitting abilities while addressing specific target problems. In this context, we propose a new approach to estimate the maximal number of hyperplanes required to shatter a given sample, i.e., to separate every pair of points from one another, based on the recent contributions by Har-Peled and Jones in the dataset partitioning scenario, and use such foundation to analytically compute the Shattering coefficient function for both binary and multi-class problems. As main contributions, one can use our approach to study the complexity of the search space bias , estimate training sample sizes, and parametrize the number of hyperplanes a learning algorithm needs to address some supervised task, what is specially appealing to deep neural networks. Experiments were performed to illustrate the advantages of our approach while studying the search space on synthetic and one toy datasets and on two widely-used deep learning benchmarks (MNIST and CIFAR-10). In order to permit reproducibility and the use of our approach, our source code is made available at https://bitbucket.org/rodrigo_mello/shattering-rcode.

1 Introduction

The Statistical Learning Theory (SLT) is amongst the most important results for the area of supervised machine learning [18]. SLT formalizes the Empirical Risk Minimization Principle (ERMP) which ensures the probabilistic convergence of the empirical risk to its expected value, simply referred to as risk , given a classification function . By having access to the risk of functions , one can select the best mapping from , for , by assuming that is computed on the Joint Probability Distribution (JPD) , in which and are the input and output spaces, respectively.

Unfortunately, there is no full access to the JPD of real-world problems [19] given the inherent continuous and unbounded nature of input attributes, making impossible to compute the risk for any . Thus, by assuming the i.i.d. (independently ad identically distributed) sampling from such a JPD, Vapnik [18] took advantage of the Law of Large Numbers (LLN) and the Chernoff’s bound [4] to obtain the following result:

(1)

in which defines the subspace of admissible functions of a given supervised learning algorithm, a.k.a. the search space bias from which such an algorithm selects classification functions from; refers to an acceptable divergence between both risks; is the sample size; and corresponds to the Shattering coefficient or the growth function mapping the number of distinct classifications an algorithm provides as the sample size increases.

As later proved by Sauer [14] and Shelah [16], the Shattering coefficient function is upper bounded as follows:

(2)

in which VC is the Vapnik-Chervonenkis dimension for a target scenario, corresponding to the greatest sample that can be divided/separated/classified in all possibilities provided a binary space segmentation. For instance, consider one point in and a single hyperplane, we can classify it as either positive or negative by simply placing the same hyperplane on both sides of the point. By extending this simple scenario, the greatest sample that can be classified in all possible ways in some space is , using a -dimensional hyperplane, so that  [19, 18].

In spite of such theoretical contribution, there is no practical approach to compute tighter bounds for the Shattering coefficient function when addressing real-world supervised tasks, therefore it becomes very complex to measure the space of admissible functions that an algorithm assumes while converging to its best as possible classification function . After studying a recent theoretical result by Har-Peled and Jones [7] in the context of dataset partitioning, we concluded that it could be useful to support the tighter estimations of Shattering coefficient functions for both binary and multi-class problems, as proposed in this paper. Furthermore, based on the same result we can estimate the maximal number of hyperplanes required to shatter a given sample, i.e., to separate every pair of points from one another, thus leading to another relevant bound that helps parametrizing learning algorithms, what is specially motivating in the context of deep neural networks [5]. As part of our contributions, our source codes are made available at https://bitbucket.org/rodrigo_mello/shattering-rcode, including all experimental scenarios discussed throughout this paper.

This paper is organized as follows: Section 2 briefly introduces the background information and some related work; Our approach to compute the Shattering coefficient function is detailed in Section 3; Section 4 discusses our implementation to compute the Shattering coefficient function; Section 5 presents and discusses some practical results on datasets; Concluding remarks and future directions are drawn in Section 6.

2 Background and Related Work

Vapnik [18] has formulated the Statistical Learning Theory (SLT) to provide upper bounds to the probabilistic convergence of the empirical risk to its expected value, given a set of classification functions in form , thus mapping input instances from to output labels in . Inequation 1 was deduced in such a context, which was later used to formalize the Generalization Bound:

(3)

after assuming:

(4)

in which refers to the probability bound for Inequality 1. Therefore, by having the Shattering coefficient function , one can formulate how the risk is bounded by the empirical risk plus some variance. As a clear consequence, by computing this function for real-world tasks, one can understand this estimation process as well as define the minimal sample size to guarantee learning convergence as discussed in [3].

For the sake of illustration, consider the estimation of the necessary training sample size by assuming . From Inequations 1 and 4, we have:

(5)
(6)
(7)
(8)

given an acceptable value for to ensure the probabilistic convergence. In order to do so, suppose and the maximum acceptable divergence between risks be , thus:

(9)

so that solving for , we obtain as the necessary training sample size to guarantee such a probabilistic convergence, as follows:

(10)

so that the empirical risk will diverge more than units from the risk with a probability less than or equals to , therefore in of the scenarios will be a good enough estimator for its expected value according to the divergence factor . As main consequence, there is a significant guarantee of selecting an adequate learning model to address this task, once one knows how it will operate on unseen data examples [3].

Still considering Inequation 3 with :

(11)

one can obtain the divergence between and as the sample size increases. We also invite the reader to observe that as the rightmost term approaches zero, such that the empirical risk becomes an ideal estimator of .

From a different perspective, Sauer [14] proved that if the density of a family of subsets of a set with is less than , then:

(12)

there exists a family of subsets of with , such that the density of is (), allowing to find a theoretical upper bound for the Shattering coefficient function as defined in Inequation 2. This density was then proved to be represented by the Vapnik-Chervonenkis (VC) dimension.

This formulation of the Shattering coefficient provides an important upper bound, however: (i) it may be a too far from the growth function itself; (ii) it depends on the VC dimension which may be difficult to estimate [19]; (iii) it does not take into account the number of hyperplanes adopted in practical application scenarios; and (iii) it does not consider the number of class labels of the target supervised task.

From a different perspective, Mello et al. [2] employed supervised datasets to estimate the Shattering coefficient function for Decision Trees (DT), a specific class of algorithms. They derived recurrences from a practical perspective and solved them out to find closed forms that allow to compare the complexity of DT models in attempt to select the most generalized option. However, that study does not generalize the Shattering coefficient function as proposed in this paper.

3 Computing the Shattering Coefficient

Har-Peled and Jones [7] have recently published a fundamental theoretical result to support our approach to compute the Shattering coefficient function. Given a set of points in general position (no three of them lie on a common line) lying inside a -dimensional unit cube for , they proved a theorem to compute the minimum number of hyperplanes necessary to separate (a property referred to as separability, denoted as ) all pairs of points from each other. They performed a demonstration and concluded that the minimum number of hyperplanes separating is and, in expectation, one can separate using hyperplanes.

Let some supervised dataset be composed of pairs and corresponding to an input example and its class label, respectively. Now suppose the input space presents different degrees of class overlapping among examples in which obviously jeopardize the classification process. Using the theorem by Har-Peled and Jones [7], we firstly define some cube in to enclose all examples and, then, we rescale their relative distances in order to ensure every space axis is in range so that a cube is obtained for .

As next step, we employ the expectation by Har-Peled and Jones [7] to estimate the number of hyperplanes to separate , i.e., , such that we have the minimum number of linear functions providing the pairwise separability of points. According to the SLT, this corresponds to the overfitting scenario in which every example in is separated from every other, thus tending to the memory-based classifier, as discussed in [18, 19, 3]. Although this represents the overfitting scenario, it also works as an upper bound for the required number of hyperplanes to shatter such an input space.

Based on the result by Har-Peled and Jones [7], we observed that the number of hyperplanes found with did not correspond to the expect number when addressing practical problems. For instance, consider Figure 1(a) that illustrates a two-class problem following two 2D Gaussian distributions parametrized with the averages and , respectively, whose variance is equal to along both dimensions. By circumscribing such input space, we obtained the constant number of hyperplanes , given (the input space is bidimensional) and the reduced space will contain instances independently of any increase in the number of examples belonging to the original sample. This is only possible due to such input space presents a clear linear separation of instances under different class labels.

Figure 1: Analyzing input instances generated using two 2D Gaussian distributions parametrized with the averages and , respectively. Their variance is equal to along both dimensions. From left to right: (i) the input space is illustrated using two different symbols (one per class); and (ii) we show the reduced space after exploring homogeneous class neighborhoods.

From this simple conclusion, we notice the need of reducing the input space to the most relevant examples, such as the support vectors in case of SVM [1], in attempt to obtain some space such as illustrated in Figure 1(b), which can indeed be separated by a single hyperplane, even when sample size . From this, we proposed the algorithm discussed in the next section to estimate the number of homogeneous-class examples at space neighborhoods before computing the minimum number of hyperplanes to separate all pairs of points from one another.

Having such an input space, we employed the theoretical result by Har-Peled and Jones [7] to find as the sample size increases. Such partial result was then used to compute the Shattering coefficient given the minimal number of hyperplanes to separate disconnected points under the same class label. Figure 2(a) illustrates a more complex scenario in which two Gaussians present some class overlapping, while Figure 2(b) shows its reduction based on our algorithm. Figure 2(c) finally illustrates the number of hyperplanes as the sample size increases, which is basically affected by the amount of point overlapping.

Figure 2: Analyzing a second scenario in which input instances present some class overlapping while being generated by using two 2D Gaussian distributions parametrized with the averages and , respectively. Their variance is equal to along both dimensions. From left to right: (i) the input space is illustrated using two different symbols (one per class); (ii) the original input was reduced to non-homogeneous class instances around space neighborhoods; and (iii) the expected number of hyperplanes for the reduced space, according to Har-Peled and Jones [7].
Lemma 1.

Let the number of half-spaces be defined by given necessary hyperplanes to separate some data sample in form . The Shattering coefficient function for is defined as follows:

(13)

thus corresponding to the combinatorics to count all possibilities for classes.

Proof..

Let some input space contain points in general position be classified using two hyperplanes, given the number of classes is . The worst-case scenario is given by all points organized in a circular shape, such that every single point is separable from every other, so that:

(14)

which is the result of applying the first hyperplane. Then, when we apply the second hyperplane without repeating the previous classification, we will have points to be somehow divided, producing:

(15)

in the worst-case scenario, thus producing:

(16)

however the first hyperplane can still separate more than a single point from every other point, resulting in:

(17)

thus we conclude that by having a general input space , for , and any number of points and classes, we will have binomial terms and their corresponding sums. ∎

4 Implementation

We implemented our approach in the R Statistical Software [13] by starting with the reduction of data points under the same class label around neighborhoods of the input space , given , based on hierarchical clustering algorithms [8]. Our approach assumes points in general position that belong to different classes, then it runs the selected hierarchical approach (either the single, the complete or the average linkage) to produce a dendrogram. Such a structure is next analyzed from leaves to its root and every set of connected points is assessed to confirm if they belong to the same class label. When they do, we reduce them to a single point in space. For instance, in case of Figure 3, all points under the same class label will be reduced to a single element, thus we end up with only two points in space which are then studied using the results by Har-Peled and Jones [7] to compute the necessary number of hyperplanes to separate them out.

Our code is based on the main function detailed in Algorithm 1, which is responsible for the point cloud reduction, maintaining all points that cannot be connected to their euclidean neighbors given they belong to different classes, in such a way some class overlapping results in more elements. This is the situation illustrated in Figure 2, in which the mixing region will contain more elements even after such a processing stage.

1:procedure ReduceSample() : input instances; : class labels; : hierarchical clustering method (average, complete or single-linkage). Output : number of instances after reducing them out along space regions under homogeneous class labels
2:     euclidean = dist()  Computes the euclidean distance among all pairs in . hmodel = hclust(euclidean, )  Producing the hierarchical clustering model.
3:      = is set with the same size as the cardinality of the input space. For each row along the order each pair was connected by the hierarchical clustering algorithm.
4:     for row in hmodel.merge do Obtain the class labels of the two elements to be connected. If row[1] or row[2] are negative, they correspond to the index of an instance in , otherwise they are associated to some dendrogram level to be connected.
5:         c1 = find.class(row[1]) 
6:         c2 = find.class(row[2]) 
7:         if  then If they are under the same class, so there is homogeneity to be explored.
8:              operate()  Reduce the number of instances according to such homogeneity.               
Algorithm 1 This pseudocode illustrates how the homogeneity of class labels among nearby instances is explored in attempt to reduce the sample size before computing the number of hyperplanes capable of separating every pair of points in sample .
Figure 3: All points under the same class label are reduced to a single instance up to the dendrogram level that ensures no class mix. In this illustrative situation, instances are obtained to proceed with the assessment of hyperplanes.

The remaining functions available with our R code are not detailed in this section, because they simply compute the number of hyperplanes in form , as proved by Har-Peled and Jones [7], on top of the reduced point cloud. Then, it employs our formulation of Theorem 1 to compute the Shattering coefficient function. Our source code is fully available at https://bitbucket.org/rodrigo_mello/shattering-rcode.

5 Experimental Results

In order to assess our approach and illustrate its usefulness, we performed the following experiments also included in our shared code 111Source code available in https://bitbucket.org/rodrigo_mello/shattering-rcode.: (i) a simple scenario with two 2D Gaussians (Figure 4); (ii) a simple class overlapping of two 2D Gaussians (Figure 5); (iii) two concentric classes (Figure 8); (iv) the Iris dataset in order to illustrate a classical toy example; (v) the MNIST dataset for several different space reconstructions according to the convolutional masks used along with the Convolutional Neural Network (CNN) [5]; and (vi) the CIFAR-10 dataset also under a list of space reconstructions defined by convolutional masks.

The two first experiments consider synthetic datasets to illustrate our approach. While the first considers no class overlapping (Figure 4), the second explores a space region with a significant class heterogeneity (Figure 5). After reducing the dataset of the first scenario, we obtained only two instances independently of the sample size, each one under a different class label. This basically means there is no class overlapping so that there are two obvious space regions with complete class homogeneity. From that, we applied the formal results by Har-Peled and Jones [7] to compute the expected number of hyperplanes required to separate this dataset in form:

in which is assumed to be an enough constant to make the function an upper bound for the Big notation [9]. Setting and , once the input space dimensionality and the sample size to be classified in all possible ways are both equal to , thus:

which corresponds to the number of hyperplanes necessary to separate all pairs of points from set , in form . From Theorem 1, we thus have so that the number of half-spaces is then:

given the presence of two classes (). As main conclusion, we have a constant Shattering coefficient as the sample size tends to infinity, thus, from the Generalization Bound (Inequation 3), we have:

considering as the constant obtained after computing:

consequently as and the empirical risk estimates the expected risk, ensuring supervised learning. Now, let us suppose and (given a confidence level of for the probabilistic convergence of the empirical to the expected risk) and study the necessary sample size to ensure such learning bound, so that:

then:

and finally:

so supposing we accept any to influence in our estimator, thus any training sample size with more than instances will be enough to provide all such guarantees.

For the sake of comparison, if we admit the upper bound by Sauer [14] and Shelah [16] (Inequation 2), the Shattering coefficient function for this first experimental scenario would be bounded by:

(18)

knowing the VC dimension is for  [19]. Observe this upper bound would be very far from our constant estimation, once new data points would be considered as affecting the shattering, even if they lie around the class averages. This is a practical situation that illustrates the usefulness of our proposed approach.

Figure 4: First experimental scenario: Two 2D Gaussians following the Normal distributions and , respectively. Observe there is no class overlapping.

The second experimental scenario considers the dataset illustrated in Figure 5 (dataset ), which clearly contains a region with some relevant class overlapping. This label-mixing situation does not permit further reductions in the sample size, as shown in Figure 6. In this situation, the sample reduction follows the function , so that there is a clear trend to maintain around of the instances contained in the original sample. If one wishes to separate such a set, then the number of hyperplanes will increase along the sample size as illustrated in Figure 7.

Figure 5: Second experimental scenario: Two 2D Gaussians following the Normal distributions and , respectively. Observe the presence of class overlapping.
Figure 6: Second experimental scenario: Original sample size versus its reduced size after exploring class homogeneity along the input space. Triangles correspond to the experimental sample size after reduction, while the linear function is a regression on those points.
Figure 7: Second experimental scenario: Number of hyperplanes required to separate along the reduced sample size. Points correspond to the experimental results, while the curve is an approximation using the Big notation by Har-Peled and Jones [7].

As next step, we take the reduction function to compute the expected number of hyperplanes required to separate :

knowing and  [9]. From Theorem 1, we have so that the number of half-spaces is:

and, thus, we have:

as the Shattering coefficient for this experimental dataset with two classes (), assuming the number of hyperplanes increases along the sample size.

Besides considering the sample size reduction, this scenario tends to overfit data instances once the number of hyperplanes is directly influenced by the separability of . This is an undesirable setting given the Shattering coefficient would be impacted by the growth in the number of hyperplanes defining the decision boundary, thus implicating successive relaxations in the algorithm search bias, finally leading to greater probabilities of representing the memory-based classifier [18]. If one decides to take this formulation to adapt the number of hyperplanes, the classification task will be prone to overfitting [3].

As matter of fact, this analysis on the number of hyperplanes is useful to understand the linear separability of instances, according to the class homogeneity. If the number of hyperplanes follows a constant, such as for dataset , we are sure there is no overlapping region in such input space. Observe this could help researchers to analyze spaces before and after applying kernel transformations on the original data space [15]. In addition, one may also use our approach to define a stopping criterion for the number of hyperplanes and, thus, define a given supervised learning setting. For instance, consider this dataset and its reduction function , one could conclude that at least of instances would mix and set the number of hyperplanes only assuming the homogeneous-class regions. This would only take the two homogeneous regions into account from which we could have the same number of hyperplanes (and shattering) as for . Of course, after taking such a decision, a maximal accuracy of could be asymptotically obtained, as derived from the overlapping regions.

The third experimental scenario considers two concentric circles organized in dataset (Figure 8), which is composed of instances with synthetically-generated attributes. This is a classical scenario that takes advantage of a second-order polynomial kernel [3] to obtain a linear separable space as illustrated in Figure 9. Our intention with this experiment is to analyze differences of our approach in light of space kernelization.

Figure 8: Third experimental scenario: One central Gaussian function following the Normal distribution and a sinusoidal function summed to another Normal distribution reconstructed in a bi-dimensional manner around the first object.
Figure 9: Third experimental scenario after applying a second-order polynomial kernel on the original data space. Red dots correspond to triangles while black dots to circles of Figure 8.

Given our approach already reduces class-homogeneous space regions to single data points, there is no significant effect on the sample reduction even after applying a kernel that provides linear separability. This third experiment is used to confirm this claim, which performed runs for each scenario with and without the kernel transformation on the original data space in attempt to explore the random characteristics of the data generation. We then collected the final sample size after processing the whole dataset, obtaining the following averages and standard deviations for both scenarios: , ; and , . From which, we built up the following Hypothesis test:

Thus rejecting and allowing us to conclude that holds with . However both averages are too close in terms of the final sample size after being processed by our approach, leading to the same number of homogeneous space regions when applying the ceiling operator , what is necessary given the sample size is a discrete number. If we suppose that even after receiving an infinite-sized sample the same number of data points are enough to characterize this problem behavior, then, according to Lemma 1, the number of necessary hyperplanes to classify such sample with instances is:

for both situations, having however for the original space and after the kernel transformation, thus leading respectively to and . Using as multiplicative constants to represent the closed form of the Big format [9], we find and .

Finally the number of half spaces is approximately the same from which we compute the Shattering coefficient according to Inequation 13 (for two classes ):

thus finding:

allowing to conclude that the Shattering coefficient is asymptotically constant, leading to tight learning guarantees according to the SLT [18].

From this particular analysis, we verify that our approach only works when a given transformation kernel helps to reduce the class overlapping and not necessarily the linear separability of some data space. Besides this limitation, the number of hyperplanes provides a big picture about the complexity of the search space bias of the algorithm and that could be somehow used to infer a proper kernel. In addition, we must not forget about the importance in estimating a tight Shattering coefficient for this problem, thus allowing to study, for instance, its necessary sample size to ensure learning bounds according to the SLT, as performed in [3].

As the fourth experimental scenario, we decided to take into consideration the traditional Iris dataset [6], due to its vast use as a toy problem in the literature. In that case, the original dataset with instances and attributes was reduced to examples, so that if we suppose that even after receiving an infinite-sized sample this same number of data points is enough to characterize the problem behavior, then the number of estimated hyperplanes would be given by . From this Big notation, we would obtain some constant Shattering coefficient , for , thus confirming the learning convergence for this particular classification task according to the SLT [3]. Of course, this could be only devised if, as the original sample size grows, instances are still enough to represent class-homogeneous regions along the input space. Consider this as a motivating example, but the reader should have a greater sample size in order to study whether there is some stability around eleven homogeneous space regions.

The fifth and sixth experimental scenarios are based on two Deep Learning benchmarks: MNIST [12] and CIFAR-10 [10]. The MNIST handwritten digit database contains instances, being for training and the remaining for testing purposes. This dataset was then used to study the influences of a single convolutional layer whose mask sizes were empirically set as , , or , from which we obtained the sample size reductions illustrated in Figure 10.

Figure 10: Fifth experimental scenario: Sample reductions for the MNIST handwritten digit database using the following convolutional masks: (triangle), (plus), (cross) and (diamond).
Figure 11: Fifth experimental scenario: Estimating the number of hyperplanes for the MNIST handwritten digit database using the following convolutional masks: (triangle), (plus), (cross) and (diamond).

We observe that the greater is the convolutional mask, the less steep is the slope devised from data points in Figure 10. For instance, masks cause a greater reduction in the sample size than any other. In order to bring a different perspective, we modeled those curves using linear regressions, obtaining the functions and model indices listed in Table 1, which allowed us to understand the percentage of samples that are reduced as the sample size increases. First of all, we observe all linear models were fair enough to represent the data points, according to their F-statistics and -values. Besides noticing a greater sample reduction whenever the convolutional mask is increased, another question stays: should we keep increasing the sizes of convolutional masks? In fact no, because the dimensionality resultant of the space embedding may impact in the complexity of the Shattering function as seen next.

mask linear regression model indices
F-statistic p-value
Table 1: Linear regression models: The MNIST sample size reduction.

The mask size may impact the space dimensionality on which hyperplanes are built up, leading, for example, the instance being addressed using masks to a space , what tends to increase the number of hyperplanes, as seen in Figure 11. Therefore, the best way of balancing the complexity of the Shattering coefficient is by analyzing how the number of hyperplanes grows as the original sample size is embedded by some convolutional mask.

From Figure 11, we conclude that masks produce the best scenario among the analyzed sizes, so that it reduces the number of hyperplanes and regions to be considered in the calculation of the Shattering coefficient (Inequation 13, setting given this benchmark considers ten classes), thus ensuring tighter learning bounds to this classification task. In addition, this study on the number of hyperplanes allows us to set up the number of convolutional units at a first layer of a Convolutional Neural Network (CNN). Observe next consecutive layers could be devised using the same criterion.

As the last experimental scenario, we consider the CIFAR-10 image dataset [10, 11], which contains training and test examples formed by RGB images along classes. In this situation, we performed the same assessment conducted with MNIST. Figure 12 illustrates the curves according to the convolutional masks, from which we observe the same trend, i.e., the greater the mask size is, the less steep is its slope. Table 2 lists the linear regressions and model indices estimated on top of those curves, allowing us to understand the percentage of samples that are reduced as the sample increases. Observe the reduction is much stronger than for the MNIST dataset (see Table 1), what is always good due to its direct effect on decreasing the number of hyperplanes.

Figure 12: Sixth experimental scenario: Sample reductions for the CIFAR-10 image database using the following convolutional masks: (triangle), (plus), (cross) and (diamond).
mask linear regression model indices
F-statistic p-value
Table 2: Linear regression models: The CIFAR-10 sample size reduction.

By analyzing Figure 13, we notice the estimated number of hyperplanes is smaller for masks, what directly decreases its Shattering coefficient, being the best to be adopted at a first convolutional layer while solving the CIFAR-10 classification task. Of course, the same procedure could be extended to consecutive layers in order to compute the number of hyperplanes and their inherent complexities.

Then using the equation representing the sample reduction for masks, we have the number of hyperplanes given by , so that we obtain for . From this, we plug the function representing the reduction in the original sample size to characterize the number of hyperplanes:

in attempt to illustrate the effect in the number of hyperplanes , we decided to consider the whole training set size with instances which would obtain input examples for our problem (in which is the image size and refers to the number of channels), so that the maximal number of hyperplanes for a first convolutional layer would be given by:

so that we obtain an approximate number of convolutional units to adopt at a first convolutional layer. Of course this number considers the pairwise separation and it tends to overfit input data examples, so that we could have it as an upper bound to be experimentally analyzed. There is another point, we did not define the constant , which must be tackled as future work (we intend to devise it from a combination of theoretical and experimental assessments).

Figure 13: Sixth experimental scenario: Estimating the number of hyperplanes for the CIFAR-10 image database using the following convolutional masks: (triangle), (plus), (cross) and (diamond).

We hope this section provided a wide enough view on how to employ our theoretical results in practical classification tasks, in attempt to: i) compute the Shattering coefficient function for binary and multi-class classification problems; and ii) define the upper bound for the number of hyperplanes necessary to classify some input space. Observe that by having less hyperplanes, the less complex is the input space we are dealing with and, inherently, the stronger is its learning bound according to the Statistical Learning Theory [18].

6 Concluding remarks

In this paper, we proposed a new approach to estimate the maximal number of hyperplanes required to separate specific data samples based on the recent theoretical contributions by Har-Peled and Jones in the context of dataset partitioning [7], and the use of this approach to analytically compute the Shattering coefficient function for both binary and multi-class classification problems.

As main contribution, our approach allows to study the complexity of the search space bias of supervised learning algorithms; to estimate the necessary training sample sizes while tackling specific learning tasks; and to parametrize the number of hyperplanes an algorithm needs while solving supervised tasks. All those aspects are even more relevant in the current state of the Machine Learning area due to the need of ensuring convergence guarantees to deep learning algorithms [5, 17].

All experiments performed intended to illustrate our contributions and how they can be employed in practical circumstances. The synthetic and the toy datasets permitted us to devise our first conclusions which were then extended when dealing with two deep learning benchmarks (MNIST and CIFAR-10). In attempt to support other researchers, we made available our source code at https://bitbucket.org/rodrigo_mello/shattering-rcode.

References

  • [1] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, Sep 1995.
  • [2] Rodrigo F. de Mello, Chaitanya Manapragada, and Albert Bifet. Measuring the shattering coefficient of decision tree models. Expert Systems with Applications, 137:443 – 452, 2019.
  • [3] Rodrigo Fernandes de Mello and Moacir Antonelli Ponti. Machine Learning - A Practical Approach on the Statistical Learning Theory. Springer, 2018.
  • [4] L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer, 1996.
  • [5] Martha Dais Ferreira, Débora Cristina Corrêa, Luis Gustavo Nonato, and Rodrigo Fernandes de Mello. Designing architectures of convolutional neural networks to solve practical problems. Expert Syst. Appl., 94:205–217, 2018.
  • [6] R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(7):179–188, 1936.
  • [7] Sariel Har-Peled and Mitchell Jones. On separating points by lines. Discrete & Computational Geometry, May 2019.
  • [8] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM Comput. Surv., 31(3):264–323, September 1999.
  • [9] Donald E. Knuth. The Art of Computer Programming, Volume 3: (2Nd Ed.) Sorting and Searching. Addison Wesley Longman Publishing Co., Inc., Redwood City, CA, USA, 1998.
  • [10] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto, 2009.
  • [11] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced research). Web Site, Accessed in November 2019.
  • [12] Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.
  • [13] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2013.
  • [14] N. Sauer. On the density of families of sets. Journal of Combinatorial Theory, 13:145–147, 1972.
  • [15] Bernhard Scholkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001.
  • [16] Saharon Shelah. A combinatorial problem; stability and order for models and theories in infinitary languages. Pacific Journal of Mathematics, 41(1):247–261, 1972.
  • [17] Christopher Dane Shulby, Martha Dais Ferreira, Rodrigo Fernandes de Mello, and Sandra M. Aluísio. Theoretical learning guarantees applied to acoustic modeling. J. Braz. Comp. Soc., 25(1):1:1–1:12, 2019.
  • [18] Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 2013.
  • [19] U. von Luxburg and B. Schölkopf. Statistical Learning Theory: Models, Concepts, and Results, volume 10, pages 651–706. Elsevier North Holland, Amsterdam, Netherlands, 05 2011.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
398155
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description