A New Fuzzy Stacked Generalization Technique and Analysis of its Performance

A New Fuzzy Stacked Generalization Technique and Analysis of its Performance

Mete Ozay,  Fatos T. Yarman Vural,  M. Ozay and F. T. Yarman Vural are with the Department of Computer Engineering, Middle East Technical University, Ankara, Turkey.
E-mail: mozay@metu.edu.tr, vural@ceng.metu.edu.tr
Abstract

In this study, a new Stacked Generalization technique called Fuzzy Stacked Generalization (FSG) is proposed to minimize the difference between -sample and large-sample classification error of the Nearest Neighbor classifier. The proposed FSG employs a new hierarchical distance learning strategy to minimize the error difference. For this purpose, we first construct an ensemble of base-layer fuzzy - Nearest Neighbor (-NN) classifiers, each of which receives a different feature set extracted from the same sample set. The fuzzy membership values computed at the decision space of each fuzzy -NN classifier are concatenated to form the feature vectors of a fusion space. Finally, the feature vectors are fed to a meta-layer classifier to learn the degree of accuracy of the decisions of the base-layer classifiers for meta-layer classification.

The proposed method is examined on both artificial and real-world benchmark datasets. Experimental results obtained using artificial datasets show that the classification performance of the FSG depends on how the individual classifiers share feature vectors of samples. Rather than the power of the individual base layer-classifiers, diversity and cooperation of the classifiers become an important issue to improve the overall performance of the proposed FSG. A weak base-layer classifier may boost the overall performance more than a strong classifier, if it is capable of recognizing the samples, which are not recognized by the rest of the classifiers, in its own feature space. The experiments explore the type of the collaboration among the individual classifiers required for an improved performance of the suggested architecture. Experiments on multiple feature real-world datasets show that the proposed FSG performs better than the state of the art ensemble learning algorithms such as Adaboost, Random Subspace and Rotation Forest. On the other hand, compatible performances are observed in the experiments on single feature multi-attribute datasets.

Error minimization, ensemble learning, decision fusion, nearest neighbor rule, classification.

1 Introduction

Stacked Generalization algorithm, proposed by Wolpert [1] and used by many others [2, 3, 4, 5, 6, 7], is a widely used ensemble learning technique. The basic idea is to ensemble several classifiers in a variety of ways so that the performance of the Stacked Generalization (SG) is higher than that of the individual classifiers which take place under the ensemble. Although gathering the classifiers under the Stacked Generalization algorithm significantly boosts the performance in some application domains, it is observed that the performance of the overall system may get worse than that of the individual classifiers in some other cases. Wolpert defines the problem of describing the relation between the performance and various parameters of the algorithm as a black art problem [1, 7].

In this study, we suggest a Fuzzy Stacked Generalization (FSG) technique and resolve the black art problem [1] for the minimization of classification error of the nearest neighbor rule. The proposed technique aggregates the independent decisions of the fuzzy base-layer nearest neighbor classifiers by concatenating the membership values of each sample for each class under the same vector space, called the decision space. A meta-layer fuzzy classifier is, then, trained to learn the degree of the correctness of the base-layer classifiers.

There are three major contributions of this study:

  1. We propose a novel hierarchical distance learning approach to minimize the difference between -sample and large-sample classification error of the nearest neighbor rule using FSG. The proposed approach enables us to define a “distance learning in feature space problem” as a “decision space design problem”, which is resolved using an ensemble of nearest neighbor classifiers.

  2. The proposed FSG algorithm enables us to extract information from different feature spaces using expert base-layer classifiers. Expertise of a base-layer classifier on a feature space is analyzed using class membership vectors that reside in a decision space of the classifier. Therefore, expertise of base-layer classifiers is used for designing distance functions of the nearest neighbor classifiers. In addition, a fusion space of a meta-layer classifier is constructed by aggregating decision spaces of base-layer classifiers. Therefore, the dimension of the feature vectors in the fusion space is fixed to where and is the number of classifiers and classes, respectively. Then, computational complexity of the meta-layer classifier is , where is the number of samples in the training dataset.

  3. We make a thorough empirical analysis of the black art problem of the suggested FSG. The empirical results show the effect of the samples which cannot be correctly classified by any of the base-layer classifiers on the classification performance of the FSG. It is observed that if the base-layer classifiers share all the samples in the training set to correctly classify them, then the performance of the overall FSG becomes higher than that of the individual base-layer classifiers. On the other hand, if a sample is misclassified by all of the base-layer classifiers, then this sample causes the performance decrease of the overall FSG.

The suggested Fuzzy Stacked Generalization algorithm is tested on artificial and real datasets by the comparisons with the state of the art ensemble learning algorithms such as Adaboost [8], Random Subspace [9] and Rotation Forest [10].

In the next section, a literature review of SG architectures and distance learning methods which minimize the generalization error of the nearest neighbor rule is given. The difference between -sample and large-sample errors of the nearest neighbor rule is defined in Section 3. A distance learning approach which minimizes the error difference is given in Section 4. Employment of distance learning approach in hierarchical FSG technique and its algorithmic description is given in Section 5. Section 6 addresses the conceptual and algorithmic properties of the FSG. Experimental analyses are given in Section 7. Section 8 summarizes and concludes the paper.

2 Related Works and Motivation

Various Stacked Generalization architectures are proposed in the literature [1, 2, 3, 4, 5, 6, 11, 12, 13, 14, 15, 16, 17]. Most of them aggregate the decisions of the base-layer classifiers by using vector concatenation operation [1, 2, 3, 4, 5, 6, 11, 12, 13, 14, 16, 17, 18, 19, 20], or majority voting [15] techniques at the meta-layer.

Ueda [2] employs vector concatenation operation to the feature vectors at the output feature spaces of the base-layer classifiers (which are called decision spaces in our work) and considers these operations as the linear decision combination methods. Then, he compares linear combination and voting methods in an SG, experimentally, where Neural Networks are implemented as the base-layer classifiers. Following the same formulation, Sen and Erdogan [3] analyze various weighted and sparse linear combination methods by combining the decisions of heterogeneous base-layer classifiers such as decision trees and -NN. Rooney et al. [4] employ homogeneous and heterogeneous classifier ensembles for stacked regression using linear combination rules. Zenko et al. [5] compare the classification performances of SG algorithms, which employ linear combination rules with the other combination methods (e.g. voting) and ensemble learning algorithms (e.g. Bagging and Adaboost). Akbas and Yarman Vural [14] employed an SG algorithm using fuzzy -NN classifiers for image annotation. Sigletos et al. [15] compare the classification performances of several SG algorithms which combine nominal (i.e., crisp decision values such as class labels) or probabilistic decisions (i.e., estimations of probability distributions). Ozay and Yarman Vural [13, 16] compared the classification performances of the homogeneous base-layer fuzzy -NN classifiers and a linear meta-layer classifier using heterogeneous SG architectures.

In most of the experimental results given in the aforementioned studies, linear decision combination or aggregation method provides comparable or better performances than the other combination methods. However, performance evaluations of the stacked generalization methods reported in the literature are not consistent with each other. This fact is demonstrated by Dzeroski and Zenko in [17] where they employ heterogeneous base-layer classifiers in their stacked generalization architecture. They report that their results contradict with the observations of the studies in the literature on SG. The contradictory results can be attributed to many non-linear relations among the parameters of the SG, such as the number and the structure of base-layer and meta-layer classifiers, and their feature, decision and fusion spaces.

Selection of the parameters of the SG, and designing classifiers and feature spaces have been considered as a black art by Wolpert [1] and Ting and Witten [7]. For instance, popular classifiers, such as, -NN, Neural Networks and Naïve Bayes, can be used as the base-layer classifiers in SG to obtain nominal decisions. However, there are crucial differences among these classifiers in terms of processing the feature vectors. Firstly, -NN and Neural Networks are non-parametric classifiers, whereas the Naïve Bayes is a parametric one. Secondly, -NN is a local classifier which employs the neighborhood information of the features, whereas Neural Networks compute a global linear decision function and Naïve Bayes computes the overall statistical properties of the datasets. Therefore, tracing the feature mappings from base-layer input feature spaces to meta-layer input feature spaces (i.e. fusion spaces) in SG becomes an intractable and uncontrollable problem. Additionally, the outputs of the heterogeneous classifiers give different type of information about the decisions of the classifiers, such as crisp, fuzzy or probabilistic class labeling.

The employment of fuzzy decisions in the ensemble learning algorithms is analyzed in [6, 21, 22]. Tan et al. [6] use fuzzy -NN algorithms as base-layer classifiers, and employ a linearly weighted voting method to combine the fuzzy decisions for Face Recognition. Cho and Kim [21] combine the decisions of Neural Networks which are implemented in the base-layer classifiers using a fuzzy combination rule called fuzzy integral. Kuncheva [22] experimentally compares various fuzzy and crisp combination methods, including fuzzy integral and voting, to boost the classifier performances in Adaboost. In their experimental results, the classification algorithms that implement fuzzy rules outperform the algorithms that implement crisp rules. However, the effect of the employment of fuzzy rules to the classification performance of SG is given as an open problem.

In this study, most of the above mentioned intractable problems are avoided by employing a homogeneous architecture which consists of the same type of base-layer and meta-layer classifiers in a new stacked generalization architecture called Fuzzy Stacked Generalization (FSG). This architecture allows us to concatenate the output decision spaces of the base-layer classifiers, which represent consistent information about the samples. Furthermore, we model linear combination or feature space aggregation method as a feature space mapping from the base-layer output feature space (i.e. decision space) to the meta-layer input feature space (i.e. fusion space). In our proposed FSG, classification rules of base-layer classifiers are considered as the feature mappings from classifier input feature spaces to output decision spaces. In order to control these mappings for tracing the transformations of the feature vectors of samples through the layers of the architecture, homogeneous fuzzy -NN classifiers are used and the behavior of fuzzy decision rules is investigated in both the base-layer and the meta-layer. Moreover, employment of fuzzy -NN classifiers enables us to obtain information about the uncertainty of the classifier decisions and the belongingness of the samples to classes [23, 24].

We analyze the classification error of a nearest neighbor classifier in two parts, namely ) -sample error which is the error of a classifier employed on a training dataset of samples and ) large-sample error which is the error of a classifier employed on a training dataset of large number of samples such that . A distance learning approach proposed by Short and Fukunaga [25] is used in a hierarchical FSG architecture from Decision Fusion perspective for the minimization of the error difference between -sample and large-sample error. In the literature, distance learning methods have been employed using prototype [26, 27, 28, 29] and feature selection [30] or weighting [31] methods by computing the weights associated to samples and feature vectors, respectively. The computed weights are used to linearly transform feature spaces of classifiers to more discriminative feature spaces [32, 33, 34] in order to decrease large-sample classification error of the classifiers [35]. A detailed literature review of prototype selection and distance learning methods for nearest neighbor classification is given in [29].

There are three main differences between our proposed hierarchical distance learning method and the methods introduced in the literature [26, 27, 28, 29, 30, 31, 32, 33, 34, 35]:

  1. The proposed method is used for the minimization of the error difference between -sample and large-sample error, while the aforementioned methods [26, 27, 28, 29, 30, 31, 32, 33, 34, 35] consider the minimization of large-sample error.

  2. We employ a generative feature space mapping by computing the class posterior probabilities of the samples in the decision spaces and use posterior probability vectors as feature vectors in fusion spaces. On the other, the methods given in the literature [26, 27, 28, 29, 30, 31, 32, 33, 34, 35] use discriminative approaches by just transforming the input feature spaces to more discriminative input feature spaces.

  3. The aforementioned methods, including the method of Short and Fukunaga [25], employ distance learning methods in a single classifier. On the other hand, we employ a hierarchical ensemble learning approach for distance learning. Therefore, different feature space mappings can be employed in different classifiers in the ensemble, which enables us more control on the feature space transformations than a single feature transformation in a single classifier.

In Section 3, we define the problem of minimizing the error difference between -sample and large-sample error in a single classifier. Then, we introduce the distance learning approach for an ensemble of classifiers considering the distance learning problem as a decision space design problem in Section 4. Employment of the proposed hierarchical distance learning approach in the FSG and its algorithmic description is given in Section 5. We discuss expertise of base-layer classifiers and the dimensionality problems of the feature spaces in FSG, and its computational complexity in Section 6. In order to compare the proposed FSG with the state of the art ensemble learning algorithms, we have implemented Adaboost, Random Subspace and Rotation Forest in the experimental analysis in Section 7. Moreover, we have used the same multi-attribute benchmark datasets with the same data splitting given in [26, 27] to compare the performance of the proposed hierarchical distance learning approach with that of the aforementioned distance learning methods. Since the classification performances of these distance learning methods are analyzed in [26, 27] in detail, we do not reproduce these results in Section 7 and refer the reader to [26, 27].

3 -sample and Large-sample Classification Errors of -Nn

Suppose that a training dataset of samples, where is the label of a sample , is given. A sample is represented in a feature space by a feature vector .

Let be a set of probability densities at a feature vector of a sample , such that is observed by a given class label according to density . Therefore, is called the likelihood of observing for a given . A set of functions is called the set of prior probabilities of class labels such that and , . Then, the posterior probability of assigning the sample to a class in is computed using the Bayes Theorem [36] as

Bayes classification rule estimates the class label of as [36]

If a loss occurs when a sample is assigned to , then the classification error of the Bayes classifier employed on is defined as [37]

and the expected error is defined as [37]

where the expectation is taken over the density of the feature vectors in .

In this work, we focus on the minimization of the classification error of a well-known classifier which is Nearest Neighbors (-NN) [36]. Given a new test sample with , let be a set of nearest neighbors of such that

The nearest neighbor rule (e.g. ) simply estimates , which is the label of , as the label of the nearest neighbor of . In the nearest neighbor rule (e.g. -NN), is estimated as

where is the number of samples which belong to in .

Then, the probability of error of the nearest neighbor rule is computed using number of samples as

(1)

where and represent posterior probabilities [36].

In the asymptotic of large number of training samples, if is not singular, i.e. continuous at , then large-sample error is computed as

(2)

It is well known that there is an elegant relationship between the classification errors of Bayes classifier and -NN as follows [37]:

Then, the difference between the -sample error (1) and the large-sample error (2) is computed as

(3)

The main goal of this paper is to minimize the difference between and (3) by employing a distance learning approach suggested by Short and Fukunaga [25] using Fuzzy Stacked Generalization. The distance learning approach of Short and Fukunaga and its employment using a hierarchical distance learning strategy is given in Section 4. This strategy has been used for modeling the Fuzzy Stacked Generalization and its algorithmic definition is given in Section 5.

4 Minimization of -sample and Large-sample Classification Error Difference using Hierarchical Distance Learning in the FSG

Let us start by defining

and an error function

for a fixed test sample . Then, the minimization of the expected value of the error difference in (3), , is equivalent to the minimization of the expected value of the error function [25]

(4)

where the expectation is computed over the number of training samples .

Short and Fukunaga [25] notice that (4) can be minimized by either increasing or designing a distance function which minimizes (4) in the classifiers. In a classification problem, a proper distance function is computed as [25]

(5)

where

and is the squared norm, or Euclidean distance.

In a single classifier, (5) is computed in , using local approximations to posterior probabilities using training and test datasets [25]. Moreover, if the -sample error is minimized on each different feature space , then an average error over an ensemble of classifiers which is defined as

(6)

is minimized by using

(7)

In this study, an approach to minimize (6) using (7) is employed in a hierarchical decision fusion algorithm. For this purpose, first posterior probabilities are estimated using individual -NN classifiers, which are called base-layer classifiers. Then the vectors of probability estimates, and , are concatenated to construct

and

for all training and test samples. Finally, classification is performed using and , by a -NN classifier, called meta-layer classifier, with

(8)

Note that (8) can be used for the minimization of the error difference in a feature space . If for , then (8) is equal to (5). Therefore, distance learning problem proposed by Short and Fukunaga [25] is reformulated as a decision fusion problem. Then, the distance learning approach is employed using a hierarchical decision fusion algorithm called Fuzzy Stacked Generalization (FSG) as described in the next section.

5 Fuzzy Stacked Generalization

Given a training dataset , each sample is represented in different feature spaces , by a feature vector which is extracted by using the feature extractor . Therefore, training datasets of base-layer classifiers employed on feature spaces can be represented by different feature sets, .

At the base-layer, each feature vector extracted from the same sample is fed into an individual fuzzy -NN classifier in order to estimate posterior probabilities using the class memberships as

(9)

where is the label of the -nearest neighbor of which is , and is the fuzzification parameter [38]. Each base-layer fuzzy -NN classifier is trained and the membership vectors of each sample is computed using leave-one-out cross validation. For this purpose, (9) is employed for each using a validation set , where .

The class label of an unknown sample is estimated by a base-layer classifier employed on as

The training performance of the base-layer classifier is computed as,

(10)

where

(11)

is the Kronecker delta which takes the value when the base-layer classifier correctly classifies a sample such that .

When a set of test samples is received, the feature vectors of the samples are extracted by each . Then, posterior probability of each test sample , is estimated using the training datasets by each base-layer -NN classifier at each , .

If a set of labels of test samples, , is available, then the test performance is computed as

The output space of each base-layer classifier is spanned by the class membership vectors of each sample . It should be noted that the class membership vectors satisfy

This equation aligns each sample on the surface of a simplex at the output space of a base-layer classifier, which is called a Decision Space of that classifier. Therefore, base-layer classifiers can be considered as transformations which map the input feature space of any dimension into a point on a simplex in a (number of classes) dimensional decision space (for , the simplex is reduced to a line).

Class-membership vectors obtained at the output of each classifier are concatenated to construct a feature space called Fusion Space for a meta-layer classifier. The fusion space consists of dimensional feature vectors and which form the training dataset

and the test dataset

for the meta-layer classifier. Note that

Finally, a meta-layer fuzzy -NN classifier is employed to classify the test samples using their feature vectors in with the feature vectors of training samples in . Meta-layer training and test performance is computed as

and

respectively. An algorithmic description of the FSG is given in Algorithm 1.

input : Training set , test set and feature extractors , .
output : Predicted class labels of the test samples .
foreach  do
      1 Extract features and using ;
      2 Compute and using (9);
      
end foreach
3Construct and ;
4 Employ meta-layer classification using and to predict ;
Algorithm 1 Fuzzy Stacked Generalization.

The proposed algorithm has been analyzed on artificial and benchmark datasets in Section 7. A treatment of the FSG is given in the next section.

6 Remarks On The Performance Of Fuzzy Stacked Generalization

In this section, we discuss the error minimization properties of the FSG, and the relationships between the performance of the FSG and various learning parameters.

6.1 Expertise of the Base-layer Classifiers, Feature Space Dimensionality Problem and Performance of the FSG

Employing distinct feature extractors for each classifier enables us to split various attributes of the feature spaces, coherently. Therefore, each base-layer classifier gains an expertise to learn a specific property of a sample, and correctly classifies a group of samples belonging to a certain class in the training data. This approach assures the diversity of the classifiers as suggested by Kuncheva [39] and enables the classifiers to collaborate for learning the classes or groups of samples. It also allows us to optimize the parameters of each individual base-layer classifier independent of the other.

Formation of the fusion space by concatenating the decision vectors at the output of base-layer classifiers helps us to learn the behavior of each individual classifier to recognize a certain feature of the sample, which may result in substantial improvement in the performance at the meta-layer. However, this postponed concatenation technique increases the dimension of the feature vector to . If one deals with a classification problem of high number of classes, which may also require high number of base-layer classifiers with large number of samples for high performance, the dimension of the feature space at the meta-layer becomes large causing again curse of dimensionality. An analysis to show the decrease in performance as the number of classes and the classifiers increase is provided in [13]. More detailed experimental results on the change of the classification performances as the number of feature spaces increases are given by comparing FSG on benchmark datasets with state of the art ensemble learning algorithms in Section 7.

Since there are several parameters such as the number of classes, the number of feature extractors, and the mean and variances of distributions of the feature vectors, which affect the performance of classifier ensembles, there is no generalized model that defines the behavior of the performance with respect to these parameters. However, it is desirable to define a framework which ensures an increase in the performance of the FSG compared to the performance of the individual classifiers.

In addition, the design of the feature spaces of individual base-layer classifiers, size of the training set, number of classes and the relationship between all of these parameters affect the performance. A popular approach to design the feature space of a single classifier is to extract all of the relevant features from each sample, and aggregate them under the same vector. Unfortunately, this approach creates the well-known dimensionality curse problem. On the other hand, reducing the dimension by the methods such as principal component analysis, normalization, and feature selection algorithms may cause the loss of information. Therefore, one needs to find a balance between the dimensionality curse and the information deficiency in designing the feature space.

The suggested FSG architecture establishes this balance by designing independent base-layer fuzzy -NN classifiers each of which receives relatively low dimensional feature vectors compared to the concatenated feature vectors of the single classifier approach. This approach avoids the problem of normalization required after the concatenation operation. Note that the dimension of the decision space is independent of the dimensions of the feature spaces of the base-layer classifiers. Therefore, no matter how high is the dimension of the individual feature vectors at the base-layer, this architecture fixes the dimensions at the meta-layer to (number of classes number of feature extractors). This may be considered as a partial solution to dimensionality curse problem provided that is bounded to a value to assure statistical stability to avoid curse of dimensionality.

6.2 Computational Complexity of the FSG

In the analysis of the computational complexities of the proposed FSG algorithm, computational complexities of feature extraction algorithms are ignored assuming that the feature sets are already computed and given.

The computational complexity of the Fuzzy Stacked Generalization algorithm is dominated by the number of samples. The computational complexity of a base-layer -NN classifier is , . If each base-layer classifier is implemented by an individual processor in parallel, then the computational complexity of base-layer classification process is , where . In addition, the computational complexity of a meta-layer classifier which employs a fuzzy -nn is . Therefore, the computational complexity of the FSG is .

In the following section, we provide an empirical study to analyze the remarks given in this section.

7 Experimental Analysis

In this section, three sets of experiments are performed to analyze the behavior of the suggested FSG and compare its performance with the state of the art ensemble learning algorithms.

  1. In order to examine the proposed algorithm for the minimization of the difference between the -sample and the large-sample classification error, we propose an artificial dataset generation algorithm following the comments of

    • Cover and Hart [37] on the analysis of the relationship between the class conditional densities of the datasets and the performance of the nearest neighbor classification algorithm, and

    • Hastie and Tibshirani [40] on the development of metric learning methods for -NN.

    In addition, we analyze the relationship between performances of base-layer and meta-layer classifiers considering sample and feature shareability among base-layer classifiers and feature spaces. Then, we examine geometric properties of transformations between feature spaces by visualizing the feature vectors in the spaces and tracing the samples in each feature space, i.e. base-layer input feature space, base-layer output decision space and meta-layer input fusion space.

  2. Next, benchmark pattern classification datasets such as Breast Cancer, Diabetis, Flare Solar, Thyroid, German, Titanic [26, 27, 28, 29, 41, 42], Caltech 101 Image Dataset [43] and Corel Dataset [13] are used to compare the classification performances of the proposed approach and state of the art supervised ensemble learning algorithms. We have used the same data splitting of the benchmark Breast Cancer, Diabetis, Flare Solar, Thyroid, German and Titanic datasets suggested in [26, 27] to enable the reader to compare our results with the aforementioned distance learning methods referring to [26, 27].

  3. Finally, we examine FSG in a real-world target detection and recognition problem using a multi-modal dataset. The dataset is collected using a video camera and microphone in an indoor environment to detect and recognize two moving targets. The problem is defined as a four-class classification problem, where each class represents absence or presence of the targets in the environment. In addition, we analyze the statistical properties of the feature spaces by computing entropy values of the distributions of the feature vectors in each feature space, and comparing the entropy values of each feature space of each classifier computed at each layer.

In the FSG, values of the fuzzy -NN classifiers are optimized by searching using cross validation, where is the number of samples in training datasets. In the experiments, fuzzy -NN is implemented both in Matlab111An Matlab implementation is available on https://github.com/meteozay/fsg.git and C++. For C++ implementations, a fuzzified modification of a GPU-based parallel -NN is used [44]. Classification performances of the proposed algorithms are compared with the state of the art ensemble learning algorithms, such as Adaboost [8], Random Subspace [9] and Rotation Forest [10]. Weighted majority voting is used as the combination rule in Adaboost. Decision trees are implemented as the weak classifiers in both Adaboost and Rotation Forest, and -NN classifier is implemented as the weak classifier in Random Subspace. The number of weak classifiers is selected using cross-validation in the training set, where is the dimension of the feature space of the samples in the datasets. Adaboost and Random Subspace algorithms are implemented using Statistics Toolbox of Matlab.

Experimental analyses of the proposed FSG algorithm on artificial datasets are given in Section 7.1. In Section 7.2, classification performances of the proposed algorithms and the state-of-the art classification algorithms are compared using benchmark datasets.

7.1 Experiments on Artificial Datasets

The relationship between the performance of the and nearest neighbor algorithms and the statistical properties of the datasets has been studied in the last decade by many researchers. Cover and Hart [37] analyzed this relationship with an elegant example, which is revised later by Devroye, Gyorfi and Lugosi [45].

In the example, suppose that the feature vectors of the samples of a training dataset are grouped in two disks with centers and , which represent the class groups and such that in a two dimensional feature space, where is the between-class variance. In addition, assume that the class conditional densities are uniform and

Note that the probability that samples belong to the first class , i.e. that the feature vectors reside in the first disk, is

Now, assume that the feature vector of a training sample belonging to is classified by nearest neighbor rule. Then, will be misclassified if its nearest neighbor resides in the second disk. However, if the nearest neighbor of resides in the second disk, then each of the feature vectors must reside in the second disk. Therefore, the classification error is the probability that all of the samples reside in the second disk such that

If -NN rule is used for classification with , where , then an error occurs if or less number of features reside in the first disk with probability

which is a Binomial distribution

Then the following inequality holds

Therefore, the classification or generalization error of the -NN depends on the class conditional densities [37] such that rule performs better than rule when the between class variance of the data distributions is smaller than the within class variances , , , .

Although Cover and Hart [37] introduced this example to analyze the classification performances of the nearest neighbor rules, Hastie and Tibshirani [40] used the results of the example in order to define a metric, which is a function of and , to minimizes the difference between the -sample and large-sample errors. Since the minimization of error difference is one of the motivations of FSG, a similar experimental setup is designed in order to analyze the performance of FSG in this section.

In the experiments, feature vectors of the samples in the datasets are generated using a Gaussian distribution in each dimensional feature space , . While constructing the datasets, the mean vector and the covariance matrix of the class-conditional density of a class

(12)

are systematically varied in order to observe the effect of the class overlaps to the classification performance. One can easily realize that there are explosive alternatives for changing the parameters of the class-conditional densities in a -dimensional vector space. However, it is quite intuitive that the amount of overlaps among the classes affects the performance of the individual classifiers rather than the changes in the class scatter matrix. Therefore, we suffice to control only the amount of overlaps during the experiments. This task is achieved by fixing the covariance matrix , in other words within-class-variance, and changing the mean values of the individual classes, which varies the between-class variances, , , , .

Denoting as the eigenvector and as the eigenvalue of a covariance matrix , we have . Therefore, the central position of the sample distribution constructed by datasets in a -dimensional space is defined by and and the propagation is defined by and . In the datasets, covariance matrices are held fix and equal. Therefore, the eigenvalues represented on both axes are the same. As a result, datasets are generated by the circular Gaussian function with fixed radius.

In this set of experiments, a variety of artificial datasets is generated in such a way that most of the samples are correctly labeled by at least one base-layer classifier. In other words, feature spaces are generated to construct classifiers which are expert on specific classes. The performances of the base-layer classifiers are controlled by fixing the covariance matrices, and changing the mean values of Gaussian distributions which are used to generate the feature vectors. Thereby, we can analyze the relationship between classification performance, the number of samples correctly labeled by at least one base-layer classifier and expertise of the base-layer classifiers.

In order to avoid the misleading information in this gradual overlapping process, the feature vectors of the samples belonging to different classes are first generated apart from each other to assure the linear separability in the initialization step. Then, the distances between the mean values of the distributions are gradually decreased. The ratio of decrease is selected as one tenth of between-class variance of distributions for each class pair and , , , which is , where . The termination condition for the algorithms is

At each epoch, only the mean value of the distribution of one of the classes approaches to the mean value of that of another class, while keeping the rest of the mean values fixed. Defining as the number of classifiers fed by different feature extractors and as the number of classes, the data generation method is given in Algorithm 2.

input : The number of feature spaces , the number of classes , the mean value vectors and the within class variances of the class conditional densities, .
output : Training and test datasets.
foreach  do
       foreach  do
            1 Initialize ;
             foreach  do
                   repeat
                        2 Generate feature vectors using (12);
                        3 ;
                        4 ;
                        
                  until ;
             end foreach
            
       end foreach
      
end foreach
5Randomly split the feature vectors into two datasets, namely test and training datasets.
Algorithm 2 Artificial data generation algorithm.

7.1.1 Performance Analysis on Artificial Datasets

In the first set of the experiments, base-layer classifiers are used. The number of samples belonging to each class is taken as , and -dimensional feature spaces are fed to each base-layer classifier as input for classes with samples. The feature sets are prepared with fixed and equal

which is the covariance matrix of the class conditional distributions in , . In other words, and .

The features are distributed with different and converged towards each other using Algorithm 2. The matrix , with the row vectors that contain the mean values of the distribution of each class at each space is defined as

In order to analyze the relationship between the number of samples that are correctly classified by at least one of the base-layer classifiers and classification performance of the FSG, the average number of samples that are correctly classified by at least one base-layer classifier, which is denoted as , is also given in the experimental results.

In each epoch, features belonging to different classes are distributed with different topologies in each classifier by different overlapping ratios. For example, feature vectors of the samples belonging to the ninth class is located apart from that of the rest of the classes in , while they are overlapped in other feature spaces. In this way, the classification behaviors of the base-layer classifiers are controlled through the topological distributions of the features, and classification performances are measured by the metrics given in Section 4.

In Table I, performances of individual classifiers and the proposed algorithms are given for an instance of the dataset generated by Algorithm 2, where the datasets are constructed in such a way that each sample is correctly recognized by at least one of the base-layer classifiers, i.e. . Although the performances of individual classifiers are in between , the classification performance of FSG is . In that case, different classes are distributed at higher relative distances and with different overlapping ratios. The matrix used in the first experiment is

FSG
Class-1 66.0 63.6 67.6 62.8 61.6 85.6 50.0 100.0
Class-2 67.2 60.8 49.6 50.8 98.4 38.4 36.8 100.0
Class-3 54.4 58.8 50.8 85.2 72.4 53.6 47.6 99.2
Class-4 66.8 64.0 96.8 66.4 61.6 22.8 37.6 100.0
Class-5 60.8 90.0 56.0 63.6 75.2 38.8 48.4 100.0
Class-6 91.6 57.2 69.6 54.0 66.0 43.6 73.6 100.0
Class-7 57.2 55.2 65.2 57.6 60.8 37.2 94.4 100.0
Class-8 78.4 75.6 86.0 69.2 54.4 61.6 97.6 100.0
Class-9 40.8 41.2 36.0 36.0 32.8 26.0 99.6 100.0
Class-10 44.0 32.4 32.0 38.0 37.6 43.2 95.6 100.0
Class-11 32.0 35.2 33.6 40.0 39.6 92.8 38.8 99.6
Class-12 37.6 39.6 34.4 52.0 44.4 97.2 63.6 99.6
Average Performance () 58.0 56.1 56.5 56.3 58.7 53.4 65.3 99.9
TABLE I: Comparison of the classification performances () of the base-layer classifiers with respect to the classes (Class-ClassID) and the performances of the FSG, when .

In Table II, the performance results of algorithms at another epoch of the experiments are given. In this experiment, of the samples are correctly classified by at least one of the base-layer classifiers, i.e. . The corresponding mean value matrix of each class at each feature space is