A New Fuzzy Stacked Generalization Technique and Analysis of its Performance
Abstract
In this study, a new Stacked Generalization technique called Fuzzy Stacked Generalization (FSG) is proposed to minimize the difference between sample and largesample classification error of the Nearest Neighbor classifier. The proposed FSG employs a new hierarchical distance learning strategy to minimize the error difference. For this purpose, we first construct an ensemble of baselayer fuzzy  Nearest Neighbor (NN) classifiers, each of which receives a different feature set extracted from the same sample set. The fuzzy membership values computed at the decision space of each fuzzy NN classifier are concatenated to form the feature vectors of a fusion space. Finally, the feature vectors are fed to a metalayer classifier to learn the degree of accuracy of the decisions of the baselayer classifiers for metalayer classification.
The proposed method is examined on both artificial and realworld benchmark datasets. Experimental results obtained using artificial datasets show that the classification performance of the FSG depends on how the individual classifiers share feature vectors of samples. Rather than the power of the individual base layerclassifiers, diversity and cooperation of the classifiers become an important issue to improve the overall performance of the proposed FSG. A weak baselayer classifier may boost the overall performance more than a strong classifier, if it is capable of recognizing the samples, which are not recognized by the rest of the classifiers, in its own feature space. The experiments explore the type of the collaboration among the individual classifiers required for an improved performance of the suggested architecture. Experiments on multiple feature realworld datasets show that the proposed FSG performs better than the state of the art ensemble learning algorithms such as Adaboost, Random Subspace and Rotation Forest. On the other hand, compatible performances are observed in the experiments on single feature multiattribute datasets.
1 Introduction
Stacked Generalization algorithm, proposed by Wolpert [1] and used by many others [2, 3, 4, 5, 6, 7], is a widely used ensemble learning technique. The basic idea is to ensemble several classifiers in a variety of ways so that the performance of the Stacked Generalization (SG) is higher than that of the individual classifiers which take place under the ensemble. Although gathering the classifiers under the Stacked Generalization algorithm significantly boosts the performance in some application domains, it is observed that the performance of the overall system may get worse than that of the individual classifiers in some other cases. Wolpert defines the problem of describing the relation between the performance and various parameters of the algorithm as a black art problem [1, 7].
In this study, we suggest a Fuzzy Stacked Generalization (FSG) technique and resolve the black art problem [1] for the minimization of classification error of the nearest neighbor rule. The proposed technique aggregates the independent decisions of the fuzzy baselayer nearest neighbor classifiers by concatenating the membership values of each sample for each class under the same vector space, called the decision space. A metalayer fuzzy classifier is, then, trained to learn the degree of the correctness of the baselayer classifiers.
There are three major contributions of this study:

We propose a novel hierarchical distance learning approach to minimize the difference between sample and largesample classification error of the nearest neighbor rule using FSG. The proposed approach enables us to define a “distance learning in feature space problem” as a “decision space design problem”, which is resolved using an ensemble of nearest neighbor classifiers.

The proposed FSG algorithm enables us to extract information from different feature spaces using expert baselayer classifiers. Expertise of a baselayer classifier on a feature space is analyzed using class membership vectors that reside in a decision space of the classifier. Therefore, expertise of baselayer classifiers is used for designing distance functions of the nearest neighbor classifiers. In addition, a fusion space of a metalayer classifier is constructed by aggregating decision spaces of baselayer classifiers. Therefore, the dimension of the feature vectors in the fusion space is fixed to where and is the number of classifiers and classes, respectively. Then, computational complexity of the metalayer classifier is , where is the number of samples in the training dataset.

We make a thorough empirical analysis of the black art problem of the suggested FSG. The empirical results show the effect of the samples which cannot be correctly classified by any of the baselayer classifiers on the classification performance of the FSG. It is observed that if the baselayer classifiers share all the samples in the training set to correctly classify them, then the performance of the overall FSG becomes higher than that of the individual baselayer classifiers. On the other hand, if a sample is misclassified by all of the baselayer classifiers, then this sample causes the performance decrease of the overall FSG.
The suggested Fuzzy Stacked Generalization algorithm is tested on artificial and real datasets by the comparisons with the state of the art ensemble learning algorithms such as Adaboost [8], Random Subspace [9] and Rotation Forest [10].
In the next section, a literature review of SG architectures and distance learning methods which minimize the generalization error of the nearest neighbor rule is given. The difference between sample and largesample errors of the nearest neighbor rule is defined in Section 3. A distance learning approach which minimizes the error difference is given in Section 4. Employment of distance learning approach in hierarchical FSG technique and its algorithmic description is given in Section 5. Section 6 addresses the conceptual and algorithmic properties of the FSG. Experimental analyses are given in Section 7. Section 8 summarizes and concludes the paper.
2 Related Works and Motivation
Various Stacked Generalization architectures are proposed in the literature [1, 2, 3, 4, 5, 6, 11, 12, 13, 14, 15, 16, 17]. Most of them aggregate the decisions of the baselayer classifiers by using vector concatenation operation [1, 2, 3, 4, 5, 6, 11, 12, 13, 14, 16, 17, 18, 19, 20], or majority voting [15] techniques at the metalayer.
Ueda [2] employs vector concatenation operation to the feature vectors at the output feature spaces of the baselayer classifiers (which are called decision spaces in our work) and considers these operations as the linear decision combination methods. Then, he compares linear combination and voting methods in an SG, experimentally, where Neural Networks are implemented as the baselayer classifiers. Following the same formulation, Sen and Erdogan [3] analyze various weighted and sparse linear combination methods by combining the decisions of heterogeneous baselayer classifiers such as decision trees and NN. Rooney et al. [4] employ homogeneous and heterogeneous classifier ensembles for stacked regression using linear combination rules. Zenko et al. [5] compare the classification performances of SG algorithms, which employ linear combination rules with the other combination methods (e.g. voting) and ensemble learning algorithms (e.g. Bagging and Adaboost). Akbas and Yarman Vural [14] employed an SG algorithm using fuzzy NN classifiers for image annotation. Sigletos et al. [15] compare the classification performances of several SG algorithms which combine nominal (i.e., crisp decision values such as class labels) or probabilistic decisions (i.e., estimations of probability distributions). Ozay and Yarman Vural [13, 16] compared the classification performances of the homogeneous baselayer fuzzy NN classifiers and a linear metalayer classifier using heterogeneous SG architectures.
In most of the experimental results given in the aforementioned studies, linear decision combination or aggregation method provides comparable or better performances than the other combination methods. However, performance evaluations of the stacked generalization methods reported in the literature are not consistent with each other. This fact is demonstrated by Dzeroski and Zenko in [17] where they employ heterogeneous baselayer classifiers in their stacked generalization architecture. They report that their results contradict with the observations of the studies in the literature on SG. The contradictory results can be attributed to many nonlinear relations among the parameters of the SG, such as the number and the structure of baselayer and metalayer classifiers, and their feature, decision and fusion spaces.
Selection of the parameters of the SG, and designing classifiers and feature spaces have been considered as a black art by Wolpert [1] and Ting and Witten [7]. For instance, popular classifiers, such as, NN, Neural Networks and NaÃ¯ve Bayes, can be used as the baselayer classifiers in SG to obtain nominal decisions. However, there are crucial differences among these classifiers in terms of processing the feature vectors. Firstly, NN and Neural Networks are nonparametric classifiers, whereas the NaÃ¯ve Bayes is a parametric one. Secondly, NN is a local classifier which employs the neighborhood information of the features, whereas Neural Networks compute a global linear decision function and NaÃ¯ve Bayes computes the overall statistical properties of the datasets. Therefore, tracing the feature mappings from baselayer input feature spaces to metalayer input feature spaces (i.e. fusion spaces) in SG becomes an intractable and uncontrollable problem. Additionally, the outputs of the heterogeneous classifiers give different type of information about the decisions of the classifiers, such as crisp, fuzzy or probabilistic class labeling.
The employment of fuzzy decisions in the ensemble learning algorithms is analyzed in [6, 21, 22]. Tan et al. [6] use fuzzy NN algorithms as baselayer classifiers, and employ a linearly weighted voting method to combine the fuzzy decisions for Face Recognition. Cho and Kim [21] combine the decisions of Neural Networks which are implemented in the baselayer classifiers using a fuzzy combination rule called fuzzy integral. Kuncheva [22] experimentally compares various fuzzy and crisp combination methods, including fuzzy integral and voting, to boost the classifier performances in Adaboost. In their experimental results, the classification algorithms that implement fuzzy rules outperform the algorithms that implement crisp rules. However, the effect of the employment of fuzzy rules to the classification performance of SG is given as an open problem.
In this study, most of the above mentioned intractable problems are avoided by employing a homogeneous architecture which consists of the same type of baselayer and metalayer classifiers in a new stacked generalization architecture called Fuzzy Stacked Generalization (FSG). This architecture allows us to concatenate the output decision spaces of the baselayer classifiers, which represent consistent information about the samples. Furthermore, we model linear combination or feature space aggregation method as a feature space mapping from the baselayer output feature space (i.e. decision space) to the metalayer input feature space (i.e. fusion space). In our proposed FSG, classification rules of baselayer classifiers are considered as the feature mappings from classifier input feature spaces to output decision spaces. In order to control these mappings for tracing the transformations of the feature vectors of samples through the layers of the architecture, homogeneous fuzzy NN classifiers are used and the behavior of fuzzy decision rules is investigated in both the baselayer and the metalayer. Moreover, employment of fuzzy NN classifiers enables us to obtain information about the uncertainty of the classifier decisions and the belongingness of the samples to classes [23, 24].
We analyze the classification error of a nearest neighbor classifier in two parts, namely ) sample error which is the error of a classifier employed on a training dataset of samples and ) largesample error which is the error of a classifier employed on a training dataset of large number of samples such that . A distance learning approach proposed by Short and Fukunaga [25] is used in a hierarchical FSG architecture from Decision Fusion perspective for the minimization of the error difference between sample and largesample error. In the literature, distance learning methods have been employed using prototype [26, 27, 28, 29] and feature selection [30] or weighting [31] methods by computing the weights associated to samples and feature vectors, respectively. The computed weights are used to linearly transform feature spaces of classifiers to more discriminative feature spaces [32, 33, 34] in order to decrease largesample classification error of the classifiers [35]. A detailed literature review of prototype selection and distance learning methods for nearest neighbor classification is given in [29].
There are three main differences between our proposed hierarchical distance learning method and the methods introduced in the literature [26, 27, 28, 29, 30, 31, 32, 33, 34, 35]:

We employ a generative feature space mapping by computing the class posterior probabilities of the samples in the decision spaces and use posterior probability vectors as feature vectors in fusion spaces. On the other, the methods given in the literature [26, 27, 28, 29, 30, 31, 32, 33, 34, 35] use discriminative approaches by just transforming the input feature spaces to more discriminative input feature spaces.

The aforementioned methods, including the method of Short and Fukunaga [25], employ distance learning methods in a single classifier. On the other hand, we employ a hierarchical ensemble learning approach for distance learning. Therefore, different feature space mappings can be employed in different classifiers in the ensemble, which enables us more control on the feature space transformations than a single feature transformation in a single classifier.
In Section 3, we define the problem of minimizing the error difference between sample and largesample error in a single classifier. Then, we introduce the distance learning approach for an ensemble of classifiers considering the distance learning problem as a decision space design problem in Section 4. Employment of the proposed hierarchical distance learning approach in the FSG and its algorithmic description is given in Section 5. We discuss expertise of baselayer classifiers and the dimensionality problems of the feature spaces in FSG, and its computational complexity in Section 6. In order to compare the proposed FSG with the state of the art ensemble learning algorithms, we have implemented Adaboost, Random Subspace and Rotation Forest in the experimental analysis in Section 7. Moreover, we have used the same multiattribute benchmark datasets with the same data splitting given in [26, 27] to compare the performance of the proposed hierarchical distance learning approach with that of the aforementioned distance learning methods. Since the classification performances of these distance learning methods are analyzed in [26, 27] in detail, we do not reproduce these results in Section 7 and refer the reader to [26, 27].
3 sample and Largesample Classification Errors of Nn
Suppose that a training dataset of samples, where is the label of a sample , is given. A sample is represented in a feature space by a feature vector .
Let be a set of probability densities at a feature vector of a sample , such that is observed by a given class label according to density . Therefore, is called the likelihood of observing for a given . A set of functions is called the set of prior probabilities of class labels such that and , . Then, the posterior probability of assigning the sample to a class in is computed using the Bayes Theorem [36] as
Bayes classification rule estimates the class label of as [36]
If a loss occurs when a sample is assigned to , then the classification error of the Bayes classifier employed on is defined as [37]
and the expected error is defined as [37]
where the expectation is taken over the density of the feature vectors in .
In this work, we focus on the minimization of the classification error of a wellknown classifier which is Nearest Neighbors (NN) [36]. Given a new test sample with , let be a set of nearest neighbors of such that
The nearest neighbor rule (e.g. ) simply estimates , which is the label of , as the label of the nearest neighbor of . In the nearest neighbor rule (e.g. NN), is estimated as
where is the number of samples which belong to in .
Then, the probability of error of the nearest neighbor rule is computed using number of samples as
(1) 
where and represent posterior probabilities [36].
In the asymptotic of large number of training samples, if is not singular, i.e. continuous at , then largesample error is computed as
(2) 
It is well known that there is an elegant relationship between the classification errors of Bayes classifier and NN as follows [37]:
Then, the difference between the sample error (1) and the largesample error (2) is computed as
(3) 
The main goal of this paper is to minimize the difference between and (3) by employing a distance learning approach suggested by Short and Fukunaga [25] using Fuzzy Stacked Generalization. The distance learning approach of Short and Fukunaga and its employment using a hierarchical distance learning strategy is given in Section 4. This strategy has been used for modeling the Fuzzy Stacked Generalization and its algorithmic definition is given in Section 5.
4 Minimization of sample and Largesample Classification Error Difference using Hierarchical Distance Learning in the FSG
Let us start by defining
and an error function
for a fixed test sample . Then, the minimization of the expected value of the error difference in (3), , is equivalent to the minimization of the expected value of the error function [25]
(4) 
where the expectation is computed over the number of training samples .
Short and Fukunaga [25] notice that (4) can be minimized by either increasing or designing a distance function which minimizes (4) in the classifiers. In a classification problem, a proper distance function is computed as [25]
(5) 
where
and is the squared norm, or Euclidean distance.
In a single classifier, (5) is computed in , using local approximations to posterior probabilities using training and test datasets [25]. Moreover, if the sample error is minimized on each different feature space , then an average error over an ensemble of classifiers which is defined as
(6) 
is minimized by using
(7) 
In this study, an approach to minimize (6) using (7) is employed in a hierarchical decision fusion algorithm. For this purpose, first posterior probabilities are estimated using individual NN classifiers, which are called baselayer classifiers. Then the vectors of probability estimates, and , are concatenated to construct
and
for all training and test samples. Finally, classification is performed using and , by a NN classifier, called metalayer classifier, with
(8) 
Note that (8) can be used for the minimization of the error difference in a feature space . If for , then (8) is equal to (5). Therefore, distance learning problem proposed by Short and Fukunaga [25] is reformulated as a decision fusion problem. Then, the distance learning approach is employed using a hierarchical decision fusion algorithm called Fuzzy Stacked Generalization (FSG) as described in the next section.
5 Fuzzy Stacked Generalization
Given a training dataset , each sample is represented in different feature spaces , by a feature vector which is extracted by using the feature extractor . Therefore, training datasets of baselayer classifiers employed on feature spaces can be represented by different feature sets, .
At the baselayer, each feature vector extracted from the same sample is fed into an individual fuzzy NN classifier in order to estimate posterior probabilities using the class memberships as
(9) 
where is the label of the nearest neighbor of which is , and is the fuzzification parameter [38]. Each baselayer fuzzy NN classifier is trained and the membership vectors of each sample is computed using leaveoneout cross validation. For this purpose, (9) is employed for each using a validation set , where .
The class label of an unknown sample is estimated by a baselayer classifier employed on as
The training performance of the baselayer classifier is computed as,
(10) 
where
(11) 
is the Kronecker delta which takes the value when the baselayer classifier correctly classifies a sample such that .
When a set of test samples is received, the feature vectors of the samples are extracted by each . Then, posterior probability of each test sample , is estimated using the training datasets by each baselayer NN classifier at each , .
If a set of labels of test samples, , is available, then the test performance is computed as
The output space of each baselayer classifier is spanned by the class membership vectors of each sample . It should be noted that the class membership vectors satisfy
This equation aligns each sample on the surface of a simplex at the output space of a baselayer classifier, which is called a Decision Space of that classifier. Therefore, baselayer classifiers can be considered as transformations which map the input feature space of any dimension into a point on a simplex in a (number of classes) dimensional decision space (for , the simplex is reduced to a line).
Classmembership vectors obtained at the output of each classifier are concatenated to construct a feature space called Fusion Space for a metalayer classifier. The fusion space consists of dimensional feature vectors and which form the training dataset
and the test dataset
for the metalayer classifier. Note that
Finally, a metalayer fuzzy NN classifier is employed to classify the test samples using their feature vectors in with the feature vectors of training samples in . Metalayer training and test performance is computed as
and
respectively. An algorithmic description of the FSG is given in Algorithm 1.
The proposed algorithm has been analyzed on artificial and benchmark datasets in Section 7. A treatment of the FSG is given in the next section.
6 Remarks On The Performance Of Fuzzy Stacked Generalization
In this section, we discuss the error minimization properties of the FSG, and the relationships between the performance of the FSG and various learning parameters.
6.1 Expertise of the Baselayer Classifiers, Feature Space Dimensionality Problem and Performance of the FSG
Employing distinct feature extractors for each classifier enables us to split various attributes of the feature spaces, coherently. Therefore, each baselayer classifier gains an expertise to learn a specific property of a sample, and correctly classifies a group of samples belonging to a certain class in the training data. This approach assures the diversity of the classifiers as suggested by Kuncheva [39] and enables the classifiers to collaborate for learning the classes or groups of samples. It also allows us to optimize the parameters of each individual baselayer classifier independent of the other.
Formation of the fusion space by concatenating the decision vectors at the output of baselayer classifiers helps us to learn the behavior of each individual classifier to recognize a certain feature of the sample, which may result in substantial improvement in the performance at the metalayer. However, this postponed concatenation technique increases the dimension of the feature vector to . If one deals with a classification problem of high number of classes, which may also require high number of baselayer classifiers with large number of samples for high performance, the dimension of the feature space at the metalayer becomes large causing again curse of dimensionality. An analysis to show the decrease in performance as the number of classes and the classifiers increase is provided in [13]. More detailed experimental results on the change of the classification performances as the number of feature spaces increases are given by comparing FSG on benchmark datasets with state of the art ensemble learning algorithms in Section 7.
Since there are several parameters such as the number of classes, the number of feature extractors, and the mean and variances of distributions of the feature vectors, which affect the performance of classifier ensembles, there is no generalized model that defines the behavior of the performance with respect to these parameters. However, it is desirable to define a framework which ensures an increase in the performance of the FSG compared to the performance of the individual classifiers.
In addition, the design of the feature spaces of individual baselayer classifiers, size of the training set, number of classes and the relationship between all of these parameters affect the performance. A popular approach to design the feature space of a single classifier is to extract all of the relevant features from each sample, and aggregate them under the same vector. Unfortunately, this approach creates the wellknown dimensionality curse problem. On the other hand, reducing the dimension by the methods such as principal component analysis, normalization, and feature selection algorithms may cause the loss of information. Therefore, one needs to find a balance between the dimensionality curse and the information deficiency in designing the feature space.
The suggested FSG architecture establishes this balance by designing independent baselayer fuzzy NN classifiers each of which receives relatively low dimensional feature vectors compared to the concatenated feature vectors of the single classifier approach. This approach avoids the problem of normalization required after the concatenation operation. Note that the dimension of the decision space is independent of the dimensions of the feature spaces of the baselayer classifiers. Therefore, no matter how high is the dimension of the individual feature vectors at the baselayer, this architecture fixes the dimensions at the metalayer to (number of classes number of feature extractors). This may be considered as a partial solution to dimensionality curse problem provided that is bounded to a value to assure statistical stability to avoid curse of dimensionality.
6.2 Computational Complexity of the FSG
In the analysis of the computational complexities of the proposed FSG algorithm, computational complexities of feature extraction algorithms are ignored assuming that the feature sets are already computed and given.
The computational complexity of the Fuzzy Stacked Generalization algorithm is dominated by the number of samples. The computational complexity of a baselayer NN classifier is , . If each baselayer classifier is implemented by an individual processor in parallel, then the computational complexity of baselayer classification process is , where . In addition, the computational complexity of a metalayer classifier which employs a fuzzy nn is . Therefore, the computational complexity of the FSG is .
In the following section, we provide an empirical study to analyze the remarks given in this section.
7 Experimental Analysis
In this section, three sets of experiments are performed to analyze the behavior of the suggested FSG and compare its performance with the state of the art ensemble learning algorithms.

In order to examine the proposed algorithm for the minimization of the difference between the sample and the largesample classification error, we propose an artificial dataset generation algorithm following the comments of
In addition, we analyze the relationship between performances of baselayer and metalayer classifiers considering sample and feature shareability among baselayer classifiers and feature spaces. Then, we examine geometric properties of transformations between feature spaces by visualizing the feature vectors in the spaces and tracing the samples in each feature space, i.e. baselayer input feature space, baselayer output decision space and metalayer input fusion space.

Next, benchmark pattern classification datasets such as Breast Cancer, Diabetis, Flare Solar, Thyroid, German, Titanic [26, 27, 28, 29, 41, 42], Caltech 101 Image Dataset [43] and Corel Dataset [13] are used to compare the classification performances of the proposed approach and state of the art supervised ensemble learning algorithms. We have used the same data splitting of the benchmark Breast Cancer, Diabetis, Flare Solar, Thyroid, German and Titanic datasets suggested in [26, 27] to enable the reader to compare our results with the aforementioned distance learning methods referring to [26, 27].

Finally, we examine FSG in a realworld target detection and recognition problem using a multimodal dataset. The dataset is collected using a video camera and microphone in an indoor environment to detect and recognize two moving targets. The problem is defined as a fourclass classification problem, where each class represents absence or presence of the targets in the environment. In addition, we analyze the statistical properties of the feature spaces by computing entropy values of the distributions of the feature vectors in each feature space, and comparing the entropy values of each feature space of each classifier computed at each layer.
In the FSG, values of the fuzzy NN classifiers are optimized by searching using cross validation, where is the number of samples in training datasets. In the experiments, fuzzy NN is implemented both in Matlab^{1}^{1}1An Matlab implementation is available on https://github.com/meteozay/fsg.git and C++. For C++ implementations, a fuzzified modification of a GPUbased parallel NN is used [44]. Classification performances of the proposed algorithms are compared with the state of the art ensemble learning algorithms, such as Adaboost [8], Random Subspace [9] and Rotation Forest [10]. Weighted majority voting is used as the combination rule in Adaboost. Decision trees are implemented as the weak classifiers in both Adaboost and Rotation Forest, and NN classifier is implemented as the weak classifier in Random Subspace. The number of weak classifiers is selected using crossvalidation in the training set, where is the dimension of the feature space of the samples in the datasets. Adaboost and Random Subspace algorithms are implemented using Statistics Toolbox of Matlab.
Experimental analyses of the proposed FSG algorithm on artificial datasets are given in Section 7.1. In Section 7.2, classification performances of the proposed algorithms and the stateofthe art classification algorithms are compared using benchmark datasets.
7.1 Experiments on Artificial Datasets
The relationship between the performance of the and nearest neighbor algorithms and the statistical properties of the datasets has been studied in the last decade by many researchers. Cover and Hart [37] analyzed this relationship with an elegant example, which is revised later by Devroye, Gyorfi and Lugosi [45].
In the example, suppose that the feature vectors of the samples of a training dataset are grouped in two disks with centers and , which represent the class groups and such that in a two dimensional feature space, where is the betweenclass variance. In addition, assume that the class conditional densities are uniform and
Note that the probability that samples belong to the first class , i.e. that the feature vectors reside in the first disk, is
Now, assume that the feature vector of a training sample belonging to is classified by nearest neighbor rule. Then, will be misclassified if its nearest neighbor resides in the second disk. However, if the nearest neighbor of resides in the second disk, then each of the feature vectors must reside in the second disk. Therefore, the classification error is the probability that all of the samples reside in the second disk such that
If NN rule is used for classification with , where , then an error occurs if or less number of features reside in the first disk with probability
which is a Binomial distribution
Then the following inequality holds
Therefore, the classification or generalization error of the NN depends on the class conditional densities [37] such that rule performs better than rule when the between class variance of the data distributions is smaller than the within class variances , , , .
Although Cover and Hart [37] introduced this example to analyze the classification performances of the nearest neighbor rules, Hastie and Tibshirani [40] used the results of the example in order to define a metric, which is a function of and , to minimizes the difference between the sample and largesample errors. Since the minimization of error difference is one of the motivations of FSG, a similar experimental setup is designed in order to analyze the performance of FSG in this section.
In the experiments, feature vectors of the samples in the datasets are generated using a Gaussian distribution in each dimensional feature space , . While constructing the datasets, the mean vector and the covariance matrix of the classconditional density of a class
(12) 
are systematically varied in order to observe the effect of the class overlaps to the classification performance. One can easily realize that there are explosive alternatives for changing the parameters of the classconditional densities in a dimensional vector space. However, it is quite intuitive that the amount of overlaps among the classes affects the performance of the individual classifiers rather than the changes in the class scatter matrix. Therefore, we suffice to control only the amount of overlaps during the experiments. This task is achieved by fixing the covariance matrix , in other words withinclassvariance, and changing the mean values of the individual classes, which varies the betweenclass variances, , , , .
Denoting as the eigenvector and as the eigenvalue of a covariance matrix , we have . Therefore, the central position of the sample distribution constructed by datasets in a dimensional space is defined by and and the propagation is defined by and . In the datasets, covariance matrices are held fix and equal. Therefore, the eigenvalues represented on both axes are the same. As a result, datasets are generated by the circular Gaussian function with fixed radius.
In this set of experiments, a variety of artificial datasets is generated in such a way that most of the samples are correctly labeled by at least one baselayer classifier. In other words, feature spaces are generated to construct classifiers which are expert on specific classes. The performances of the baselayer classifiers are controlled by fixing the covariance matrices, and changing the mean values of Gaussian distributions which are used to generate the feature vectors. Thereby, we can analyze the relationship between classification performance, the number of samples correctly labeled by at least one baselayer classifier and expertise of the baselayer classifiers.
In order to avoid the misleading information in this gradual overlapping process, the feature vectors of the samples belonging to different classes are first generated apart from each other to assure the linear separability in the initialization step. Then, the distances between the mean values of the distributions are gradually decreased. The ratio of decrease is selected as one tenth of betweenclass variance of distributions for each class pair and , , , which is , where . The termination condition for the algorithms is
At each epoch, only the mean value of the distribution of one of the classes approaches to the mean value of that of another class, while keeping the rest of the mean values fixed. Defining as the number of classifiers fed by different feature extractors and as the number of classes, the data generation method is given in Algorithm 2.
7.1.1 Performance Analysis on Artificial Datasets
In the first set of the experiments, baselayer classifiers are used. The number of samples belonging to each class is taken as , and dimensional feature spaces are fed to each baselayer classifier as input for classes with samples. The feature sets are prepared with fixed and equal
which is the covariance matrix of the class conditional distributions in , . In other words, and .
The features are distributed with different and converged towards each other using Algorithm 2. The matrix , with the row vectors that contain the mean values of the distribution of each class at each space is defined as
In order to analyze the relationship between the number of samples that are correctly classified by at least one of the baselayer classifiers and classification performance of the FSG, the average number of samples that are correctly classified by at least one baselayer classifier, which is denoted as , is also given in the experimental results.
In each epoch, features belonging to different classes are distributed with different topologies in each classifier by different overlapping ratios. For example, feature vectors of the samples belonging to the ninth class is located apart from that of the rest of the classes in , while they are overlapped in other feature spaces. In this way, the classification behaviors of the baselayer classifiers are controlled through the topological distributions of the features, and classification performances are measured by the metrics given in Section 4.
In Table I, performances of individual classifiers and the proposed algorithms are given for an instance of the dataset generated by Algorithm 2, where the datasets are constructed in such a way that each sample is correctly recognized by at least one of the baselayer classifiers, i.e. . Although the performances of individual classifiers are in between , the classification performance of FSG is . In that case, different classes are distributed at higher relative distances and with different overlapping ratios. The matrix used in the first experiment is
FSG  

Class1  66.0  63.6  67.6  62.8  61.6  85.6  50.0  100.0 
Class2  67.2  60.8  49.6  50.8  98.4  38.4  36.8  100.0 
Class3  54.4  58.8  50.8  85.2  72.4  53.6  47.6  99.2 
Class4  66.8  64.0  96.8  66.4  61.6  22.8  37.6  100.0 
Class5  60.8  90.0  56.0  63.6  75.2  38.8  48.4  100.0 
Class6  91.6  57.2  69.6  54.0  66.0  43.6  73.6  100.0 
Class7  57.2  55.2  65.2  57.6  60.8  37.2  94.4  100.0 
Class8  78.4  75.6  86.0  69.2  54.4  61.6  97.6  100.0 
Class9  40.8  41.2  36.0  36.0  32.8  26.0  99.6  100.0 
Class10  44.0  32.4  32.0  38.0  37.6  43.2  95.6  100.0 
Class11  32.0  35.2  33.6  40.0  39.6  92.8  38.8  99.6 
Class12  37.6  39.6  34.4  52.0  44.4  97.2  63.6  99.6 
Average Performance ()  58.0  56.1  56.5  56.3  58.7  53.4  65.3  99.9 
In Table II, the performance results of algorithms at another epoch of the experiments are given. In this experiment, of the samples are correctly classified by at least one of the baselayer classifiers, i.e. . The corresponding mean value matrix of each class at each feature space is