Definition 1
Abstract

We present a novel adaptive random subspace learning algorithm (RSSL) for prediction purpose. This new framework is flexible where it can be adapted with any learning technique. In this paper, we tested the algorithm for regression and classification problems. In addition, we provide a variety of weighting schemes to increase the robustness of the developed algorithm. These different wighting flavors were evaluated on simulated as well as on real-world data sets considering the cases where the ratio between features (attributes) and instances (samples) is large and vice versa. The framework of the new algorithm consists of many stages: first, calculate the weights of all features on the data set using the correlation coefficient and F-statistic statistical measurements. Second, randomly draw samples with replacement from the data set. Third, perform regular bootstrap sampling (bagging). Fourth, draw without replacement the indices of the chosen variables. The decision was taken based on the heuristic subspacing scheme. Fifth, call base learners and build the model. Sixth, use the model for prediction purpose on test set of the data. The results show the advancement of the adaptive RSSL algorithm in most of the cases compared with the synonym (conventional) machine learning algorithms.

 

Adaptive Random SubSpace Learning (RSSL) Algorithm for Prediction

 

Mohamed Elshrif mme4362@rit.edu

Rochester Institute of Technology (RIT), 102 Lomb Memorial Dr., Rochester, NY 14623 USA

Ernest Fokoué epfeqa@rit.edu

Rochester Institute of Technology (RIT), 102 Lomb Memorial Dr., Rochester, NY 14623 USA


\@xsect

Given a dataset , where and are realizations of two random variables and respectively, we seek to use the data to build estimators of the underlying function for predicting the response given the vector of explanatory variables. In keeping with the standard in statistical learning theory, we will measure the predictive performance of any given function using the theoretical risk functional given by

(1)

with the ideal scenario corresponding to the universally best function defined by

(2)

For classification tasks, the most commonly used loss function is the zero-one loss , for which the theoretical universal best defined in (2) is the Bayes classifier given by . For regression tasks, the squared loss is by far the most commonly used, mainly because of the wide variety of statistical, mathematical and computational benefits it offers. For regression under the squared loss, the universal best defined in (2) is also known theoretically known to be the conditional expectation of given , specifically given by . Unfortunately, these theoretically expressions of the best estimators cannot be realized in practice because the distribution function of defined on is unknown. To circumvent this learning challenge, one has to do essentially two foundational thing, namely: (a) choose a certain function class (approximation) from which to search for the estimator of the true but unknown underlying , (b) specify the empirical version of (1) based on the given sample , an use that empirical risk as the practical objective function. However, in this paper, we do not directly construct our estimating classification functions from the empirical risk. Instead, we build the estimators using other optimality criteria, and then compare their predictive performances using the average test error , namely

(3)

where is the -th realization of the estimator built using the training portion of the split of into training set and test set, and is the -th observation from the test set at the -th random replication of the split of . In this paper, we consider both multiclass classification tasks with response space and regression tasks with , and we focus on learning machines from a function class whose members are ensemble learners in the sense of Definition (1). In machine learning, in order to improve the accuracy of a regression function, or a classification function, scholars tend to combine multiple estimators because it has been proven both theoretically and empirically (Tumer & Ghosh, 1995; Tumer & Oza, 1999) that an appropriate combination of good base learners leads to a reduction in prediction error. This technique is known as ensemble learning (aggregation). In spite of the underlying algorithm used, the ensemble learning technique most of the time (on average) outperforms the single learning technique, especially for prediction purposes (van Wezel & Potharst, 2007). There are many approaches of performing ensemble learning. Among these, there are two popular ensemble learning techniques, bagging (Breiman, 1996) and boosting (Freund, 1995). Many variants of these two techniques have been studied previously such as random forest (Breiman, 2001) and AdaBoost (Freund & Schapire, 1997) and applied in a prediction problem. Our proposed method belongs to the subclass of ensemble learning methods known as random subspace learning.

Definition 1

Given an ensemble of base learners , with relative weight (usually for convex aggregation), the ensemble representation of the underlying function is given by the aggregation (weighted sum)

(4)

A question naturally arises as to how the ensemble is chosen, and how the weights are determined. Bootstrap Aggregating also known as bagging (Breiman, 1996), boosting (Freund & Schapire, 1996), random forests (Breiman, 2001), and bagging with subspaces (Panov & Dzeroski, 2007) are all predictive learning methods based on the ensemble learning principle for which the ensemble is built from the provided data set and the weights are typically taken to be equal. In this paper, we focus on learning tasks involving high dimension low sample size (HDLSS) data, and we further zero-in on those data sets for which the number of explanatory variables is substantially larger than the sample size . As our main contribution in this paper, we introduce, develop and apply a new adaptation of the theme of random subspace learning (Ho, 1998) using the traditional multiple linear regression (MLR) model as our base learner in regression and the generalized linear model (GLM) as a base learner in classification. Some applications by nature posses few instances (small ) with large number of features () such as fMRI (Kuncheva et al., 2010) and DNA microarrays (Bertoni et al., 2005) data sets. It is hard for a traditional (conventional) algorithm to build a regression model, or to classify the data set when it possesses a very small instances to features ratio. The prediction problem becomes even more difficult when this huge number of features correlated are highly correlated, or irrelevant for the task of building such a model, as we will show later in this paper. Therefore, we harness the power of our proposed adaptive subspace learning technique to guide the choice/selection of good candidate features from the data set, and therefore select the best base learners, and ultimately the ensemble yielding the lowest possible prediction error. In most typical random subspace learning algorithms, the features are selected according to an equally likely scheme. The question then arises as to whether one can devise a better scheme to choose the candidate features for efficiently with some predictive benefits. On the other hand, it is interesting to assess the accuracy of our proposed algorithm under different levels of the correlation of the features. The answer to this question constitutes one of the central aspect of our proposed method, in the sense we explore a variety of weighting scheme for choosing the features, most of them (the schemes) based on statistical measures of relationship between the response variable and each explanatory variable. As the computational section will reveal, the weighting schemes proposed here lead to a substantially improvement in predictive performance of our method over random forest on all but one data set, arguably due to the fact that our method because it leverages the accuracy of the learning algorithm through selecting many good models (since the weighting scheme allows good variables to be selected more often and therefore leads to near optimal base learners).

\@xsect

Traditionally, in a prediction problem, a single model is built based on the training set and the prediction is decided based solely on this single fitted model. However, in bagging, bootstrap samples are taken from the data set, then, for each instance, the model is fitted. Finally, the prediction is made based on the average of all bagged models. Mathematically, the prediction accuracy for the constructed model using bagging outperforms the traditional model and in the worst case it has the same performance. However, it must be said that it depends on the stability of the modeling procedure. It turns out that bagging reduces the variance without affecting the bias, thereby leading to an overall reduction in prediction error, and hence its great appeal. Any set of predictive models can be used as an ensemble in the sense defined earlier. There are many ensemble learning approaches. These approaches could be categorized into four classes: (1) algorithms that use heterogeneous predictive models such as stacking (Wolpert, 1992). (2) algorithms that manipulate the instances of the data sets such as bagging (Breiman, 1996), boosting (Freund & Schapire, 1996), random forests (Breiman, 2001), and bagging with subspaces (Panov & Dzeroski, 2007). (3) algorithms that maniplulate the features of the data sets such as random forests (Breiman, 2001), random subspaces (Ho, 1998), and bagging with subspaces (Panov & Dzeroski, 2007). (4) algorithms that manipulate the learning algorithm such as random forests (Breiman, 2001), neural networks ensemble (Hansen & Salamon, 1990), and extra-trees ensemble (Geurts et al., 2006). Since our proposed algorithm manipulates both the instances and features of the data sets, we will focus on the algorithms in the second and third categories (Breiman, 1996; 2001; Panov & Dzeroski, 2007; Ho, 1998).

Bagging (Breiman, 1996), or bootstrap aggregating is an ensemble learning method that generates multiple predictive models. These models are based on performing bootstrap replicates of the learning (training) data set and utilizing from each replicate to build a separate predictive model. The bootstrap sample is attained through randomly (uniformly) sampling with replacement from instances of the training data set. The decision is made based on averaging the predictor classifiers in regression task and taking the majority vote in classification task. Bagging tend to decrease the variance and keeps the bias as in the case of a single classifier. The bagging accuracy increases when the applied learner is unstable, which means that for any small fluctuation on the training data set causes large impact on the test data set such as trees (Breiman, 1996). Random forests (Breiman, 2001), is an ensemble learning method that averages the prediction results from multiple independent predictor (tree) models. It also performs bootstrap replicates, like bagging (Breiman, 1996), to construct different predictors. For each node of the tree, randomly selecting subset of the attributes. It is considered to improve over bagging through de-correlating the trees. Choose the best attribute from the selected subset. As (Denil et al., 2014) mentions that when building a random tree, there are three issues that should be decided in advance; (1) the leafs splitting method, (2) the type of predictor, and 3- the randomness method. Random subspace learning (Ho, 1998), is an ensemble learning method that constructs base models based on different features. It chooses a subset of features and then learns the base model depending only on these features. The random subspaces reaches the highest accuracy when the number of features is large as well as the number of instances. In addition, it performs good when there are redundant features on the data set. Bagging subspaces (Panov & Dzeroski, 2007), is an ensemble learning method that combines both the bagging (Breiman, 1996) and random subspaces (Ho, 1998) learning methods. It generates a bootstrap replicates of the training data set, in the same way as bagging. Then, it randomly chooses a subset from the features, in the same manner as random subspaces. It outperforms the bagging and random subspaces. Also, it is found to yield the same performance as random forests in case of using decision tree as a base learner. In the simulation part of this paper, we aim to answer the following research questions: (1) Is the performance of the adaptive random subspace learning (RSSL) better than the performance of single classifiers? (2) What is the performance of the adaptive RSSL compared to the most widely used classifier ensembles? (3) Is there a theoretical explanation as to why adaptive RSSL works well for most of the simulated and real-life data sets? (4) How does adaptive RSSL perform on different parameter settings and with various percentages of the instance-to-feature ratio (IFR)? (5) How does the correlation between features affect the predictive performance of adaptive RSSL?

\@xsect

In this section, we present an adaptive random subspace learning algorithm for the prediction problem. We start with the formulation of the problem, followed by our suggested solution (proposed algorithm) to tackle (handle) it. A crucial step of assessing the candidate features for building the models is explained in detail. Finally, we elucidate the strength of the new algorithm, from a theoretical perspective.

\@xsect

As we said earlier our proposed method belongs to the category of random subspace learning where each base learner is constructed using a bootstrap sample and a subset of the original features. The main difference here is that we use base learners that are typically considered not to lead to any improvement when aggregated, and we also select features using weighting schemes that inspired for the strength of the relationship between each feature and the response (target). Each base learner is driven by the subset of variables of predictors that are randomly select to build it, and the subsample drawn with replacement from . For notational convenience, we use vectors of indicator variables to denote these two important quantities. The sample indicator , where

The variable indicator , where

The estimator of the th base learner can therefore fully and unambiguously denoted by which we refer to as for notational simplicity. Each is chosen according to one of the weighting schemes. To increase the strength of the developed algorithm, we introduce a weighting scheme procedure to select the important features, which facilitates building a proper model and leverage the prediction accuracy. Our weighting schemes are

  • Correlation coefficient: We measure the strength of the association between each feature vector and the response (target), and take the square of the resulting sample correlation

  • F-statistic: For classification tasks especially, we use the observed F-statistic resulting from the analysis of variance with as the response and the class label the treatment group.

Using the ensemble we form the ensemble estimator of class Membership as

and the ensemble estimator of regression response as

\@xsect

We used a collection of simulated and real-world data sets for our experiments. In addition, we used real-world data sets from previous papers, which aim to solve the same problem, for comparison purpose. We report the mean square error (MSE) for each individual algorithm and task purposes, i.e., regression, or classification.

\@xsect

We designed our artificial data sets to fit six scenarios based on the factors, which are the dimensionality of the data (number of features), the number of sample size ((number of instances), and the correlation of the data.

\@xsect

We benefit from the public repository of the UCI University real-life data sets in our paper. For the purposes of consistency and completeness, we choose the real data sets that carries different characteristics in terms of the number of instances and the number of features along with variety of applications. The real data sets can be represented based on the task as follows:

Figure 1: Prior Feature Importance: A representative simulation results for regression analysis on synthetic dataset of scenario with number of instances n=25, number of features p=500, correlation coefficient =0.5, number of learners=450, and number of replications=100.

Figure 2: Prior Feature Importance: A representative simulation results for classification analysis on real dataset of Lymphoma disease.

Figure 3: A representative results of synthetic dataset of scenario with number of instances n=50, number of features p=1000, correlation coefficient =0.05, number of learners=450, and number of replications=100. We used the correlation weighting scheme for regression analysis on logarithmic scale.

Figure 4: A representative results of Diabetes interaction real dataset with correlation weighting scheme for regression analysis on original scale.

Figure 5: A representative results on synthetic dataset of scenario with number of instances n=200, number of features p=25, correlation coefficient =0.05, number of learners=450, and number of replications=100. We used F-statistics weighting scheme for classification analysis.

Figure 6: A representative results of the Diabetes in Pima Indian Women real dataset with F-statistics weighting scheme for classification analysis.

Weighting n p MLR Uniform MLR Adaptive MLR RF Better?
Correlation 200 25 0.05 5.690.89 14.502.63 4.600.706 9.811.86
200 25 0.5 4.780.81 11.672.55 4.770.94 8.461.97
25 200 0.05 974.375.e3 18.356.92 8.103.86 18.567.24
25 200 0.5 5.e35.e4 18.838.72 8.275.24 18.188.65
50 1000 0.05 2.e41.e5 28.3611.51 12.385.91 27.9211.78
1000 50 0.05 4.660.34 16.621.37 4.330.33 6.730.62
F-statistics 200 25 0.05 5.040.79 14.422.67 4.480.74 8.751.76
200 25 0.5 4.490.76 12.062.04 5.511.09 8.331.59
25 200 0.05 3.e42.e5 17.779.15 5.814.10 15.818.55
25 200 0.5 1.e41.e5 23.0916.06 12.5310.27 24.1116.31
50 1000 0.05 4.e53.e6 16.655.38 7.652.83 15.545.31
1000 50 0.05 4.190.33 15.971.15 3.900.30 6.240.55
Table 1: Regression Analysis: Mean Square Error (MSE) for different machine learning algorithms on various scenarios of synthetic data sets.

Data Set Weighting MLR Uni. MLR Adap. MLR RF Better?
BodyFat correlation 17.412.69 23.593.71 19.253.06 19.723.18
F-statistics 17.062.50 23.073.46 17.462.65 19.512.99
Attitude correlation 74.1232.06 80.3534.40 58.4920.21 88.7235.97
F-statistics 75.1936.63 74.7133.17 51.8415.19 82.2135.58
Cement correlation 10.767.25 NA 19.9215.98 75.9156.05
F-statistics 11.078.55 NA 24.2718.27 62.2046.53
Diabetes 1 correlation 2998.13322.37 3522.30311.81 3165.74300.86 3203.94311.94
F-statistics 2988.32341.20 3533.45375.38 3133.60324.75 3214.11318.6931
Diabetes 2 correlation 3916.98782.35 4244.00390.29 3016.54285.89 3266.50324.82
F-statistics 3889.00679.55 4306.76419.66 3076.77338.08 3326.28382.37
Longley correlation 0.210.13 0.620.36 0.490.29 1.540.92
F-statistics 0.220.13 0.660.42 0.490.29 1.631.04
Prestige correlation 66.6815.31 73.3214.77 64.8713.93 55.9611.83
F-statistics 65.7715.96 72.3316.64 63.2714.71 56.0212.66

Table 2: Regression Analysis: Mean Square Error (MSE) for different machine learning algorithms on real data sets.

Weighting n p GLM Uni. GLM Adap. GLM RF Better?
F-statistics 200 25 0.05 0.0700.033 0.4860.172 0.0710.032 0.1010.053
200 25 0.5 0.1400.045 0.4980.221 0.1380.043 0.1360.058
50 200 0.05 0.1020.093 0.6730.123 0.1000.092 0.3200.103
50 200 0.5 0.0580.141 0.3460.346 0.0490.121 0.1780.188
50 1000 0.05 0.0330.064 0.5220.158 0.0340.062 0.4090.114
1000 50 0.05 0.1300.019 0.6430.028 0.1300.019 0.1670.024
Table 3: Classification Analysis: MisClassification Rate (MCR) for different machine learning algorithms on various scenarios of simulated data sets.

Data Set W. S. GLM Uni. GLM Adap. GLM RF Better?
Diabetes in Pima F-stat 0.2740.071 0.2490.051 0.2550.051 0.2690.050
Prostate Cancer F-stat 0.4250.113 0.3550.093 0.3320.094 0.3430.098
Golub Leukemia F-stat 0.427 0.023 0.021 0.023
Diabetes F-stat 0.0340.031 0.0680.039 0.0380.034 0.0310.029
Lymphoma F-stat 0.2480.065 0.0570.034 0.0460.029 0.0820.046
Lung Cancer F-stat 0.1130.051 0.0380.023 0.0370.024 0.0510.030
Colon Cancer F-stat 0.2960.124 0.1680.095 0.1240.074 0.1990.106
Table 4: Classification Analysis: MisClassification Rate (MCR) for different machine learning algorithms on real data sets.

Data set inst. feat. IFR ratio
regression
Bodyfat 252 14 1,800%
attitude 30 7 428.5%
Cement 13 5 260%
Diabetes 1 442 11 4,018%
Diabetes 2 442 65 680%
Longley 16 7 228.5%
Prestige 102 5 2,100%
classification
Diabetes in Pima 200 8 2,500%
Prostate cancer 79 501 15.8%
Leukemia 72 3572 2.0%
Diabetes 145 4 3,625%
Lymphoma 180 662 27.2%
Table 5: Summary of the regression and classification real data sets.

Figure 7: A representative results that exhibits the relationship between mean square error (MSE) and correlation coefficient () for different algorithms on synthetic dataset with correlation weighting scheme for regression analysis when pn.

Figure 8: A representative results that exhibits the relationship between mean square error (MSE) and correlation coefficient () for different algorithms on synthetic dataset with F-statistics weighting scheme for classification analysis when np.

To elucidate the performance of our developed model, we compare the accuracy of the RSSL with random forest and … on the same real data sets they used before.

\@xsect

As revealed (experienced) from our experiments on synthetic data sets that when the number of selected features is higher than 15-20 (for our particular dataset) yields ensemble classifiers that are highly accurate and stable. The reason for this is that only if the number of voters is Òlarge enoughÓ does the random process of attribute selection yield suKcient number of qualitatively different classifiers that ensure high accuracy and stability of the ensemble.

how many bootstrap replications are useful? The evidence both experimental and theoretical is that bagging can push a good but unstable procedure a significant step towards optimality. why the training set in real dataset was chosen to be large and in simulated dataset the test set used to be large? The bootstrap sample was repeated 50 times. The random division of the data is repeated 100 times. Choosing between these two strategies is not an easy task since it involves a trade-off between bias and estimation variance over the forecast horizon.

Even though that our developed adaptive RSSL algorithm outperforms many classifier ensembles. It has limitations where this new algorithm can not deal with data set that has categorical features. Instead it necessities to encode these features numerically. Also, the algorithm is not designed to classify data sets with multiple classes. Moreover, the adaptive RSSL algorithms sometimes fails to select the optimal feature subsets?

\@xsect

We presented a detailed quantitative analysis of the adaptive RSSL algorithm for an ensemble prediction problem. We support this analysis with deep theoretical (mathematical) explanation (formulation). The key important issues for the developed algorithm resides on four fundamental factors: generalization, flexibility, speed, and accuracy. We will explain each of these four factors. We present a rigorous theoretical justification of our propose algorithm. For now, we choose fixed number of attribute subset. However, the algorithm should evaluated based on the performance (accuracy) to determine the appropriate number (dimension) for single classifiers used in the ensemble learning. In addition, the adaptive RSSL algorithm is tested on a relatively small data sets. Our next step will be applying the developed algorithm on a big data sets.

Also, we show that the adaptive RSSL performs better than widely used ensemble algorithms even with the dependence of feature subsets.

Computational issues.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
49878
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description