Coopetitive Soft Gating Ensemble

Coopetitive Soft Gating Ensemble

Stephan Deist, Maarten Bieshaar, Jens Schreiber, André Gensler, and Bernhard Sick S. Deist, M. Bieshaar, J. Schreiber, André Gensler, and B. Sick are with the Intelligent Embedded Systems Lab, University of Kassel, Kassel, Germany stephan.deist@student.uni-kassel.de, mbieshaar@uni-kassel.de, jens.schreiber@uni-kassel.de, gensler@uni-kassel.de, bsick@uni-kassel.de
Abstract

In this article, we proposed the Coopetititve Soft Gating Ensemble or CSGE for general machine learning tasks. The goal of machine learning is to create models which poses a high generalisation capability. But often problems are too complex to be solved by a single model. Therefore, ensemble methods combine predictions of multiple models. The CSGE comprises a comprehensible combination based on three different aspects relating to the overall global historical performance, the local-/situation-dependent and time-dependent performance of its ensemble members. The CSGE can be optimised according to arbitrary loss functions making it accessible for a wider range of problems. We introduce a novel training procedure including a hyper-parameter initialisation at its heart. We show that the CSGE approach reaches state-of-the-art performance for both classification and regression tasks. Still, the CSGE allows to quantify the influence of all base estimators by means of the three weighting aspects in a comprehensive way. In terms of Organic computing (OC), our CSGE approach combines multiple base models towards a self-organising complex system. Moreover, we provide a scikit-learn compatible implementation.

I Introduction

The main goal of machine learning is to create models from a set of training data, that have a high capability of generalisation. Often the problems are so complex that one single estimator can not handle the whole scope. One possible approach is to not have a single estimator but have multiple estimators instead. This attempt of combining multiple estimators is called ensemble. In many fields, ensembles can achieve state-of-the-art performance. Popular ensemble methods are Boosting [1], Bagging [2] or Stacking [3]. In [4] it is shown that ensembles often lead to better results than using one single estimator. When considering the Bias–variance tradeoff [5], ensembles can reduce both variance and bias and therefore result in strong models.

In many use cases it is desired to have models that have a high comprehensibility between data and operation. Since Boosting, Bagging and Stacking more or less work like a black box a fairly new ensemble method called Coopetitive Soft Gating Ensemble or CSGE was proposed. In [6][7] and [8] it is statistically shown that the CSGE can achieve state-of-the-art performance in the area of power forecasting. In this article, we aim to extend the approach to general machine learning problems.

Each estimator of an ensemble is called ensemble member. The idea of the CSGE is to gradually weight a number of ensemble members according to their historically observed performance of various aspects. There are three aspects which take influence on the weight: First, the overall performance of the estimator, Second, the local performance of the estimator in similar historical situations, Third, time-dependant effects modelling the autocorrelation in the estimators outcome.

The ability to assess the individual performance of each base estimator for those three aspects can, in terms of Organic computing (OC) [9] and autonomic computing (AC), be interpreted as self-awareness capability. Our CSGE approach combines multiple base estimators towards a self-organising complex system with only few parameters to be determined.

Ii Main Contribution

The main contribution of this article is an extended coopetitive soft gating ensemble approach. It generalises the method proposed in [6, 7] for wind power forecasting to other machine learning tasks including regression, classification, and time series forecasting. The main contributions of this article can be summarised as follows:

  • The loss function to be optimised by the CSGE is arbitrary and can be chosen by the user with only minimal conditions.

  • A novel heuristic to choose the hyper-parameters of the CSGE training algorithm reducing the required number of adjustable parameters.

  • In an extensive evaluation of our approach on common real-world reference datasets, we show that our CSGE approach reaches state-of-the-art performance compared to other ensembles methods. But, the CSGE allows to quantify the influence of all base estimators by means of the three weighting aspects in a comprehensive way.

  • A scikit-learn compatible implementation of the CSGE111https://git.ies.uni-kassel.de/csge/csge.

The remainder of this article is structured as follows. In Section III, we review the related work in the field of ensemble method for machine learning. Then, in Section IV, we introduce the CSGE approach outlining the main principles, the three weighting aspects, and the algorithm for ensemble training. In Section V, we present the evaluation of the CSGE approach on three synthetic datasets, and four reference classification and regression real-world datasets. Finally, in Section VI, the main conclusion and open issues for future work are discussed.

Iii Related Work

In machine learning the term ensemble describes the combination of the multiple models. The ensemble comprises a finite set of estimators, whose predictions are aggregated forming the ensemble prediction. The theoretical justification of why ensembles can increase the overall predictive performance is given by the bias-variance decomposition [5]. The key to ensemble methods is the model diversification, i.e., how to create sufficiently different models from sample data. A comprehensive review of ensembles is given in [10]. Besides other, the most important design principles for ensembles are: Data, parameter, and structural diversity. Data diversity comprises ensembles trained on different subsets of the data. Well known representatives of this type are bagging [2], boosting [1], and random forest [11]. The idea of parameter diversity is to induce diversity into the ensemble by varying the parameters of the ensemble members. A representative of this type is the multiple kernel learning algorithm [12] in which multiple kernels are combined. Lastly, structural diversity comprises the combination of different models, e.g., obtained by applying different learning algorithms or variable model types. These ensembles are also referred as heterogeneous ensembles [13]. A well known representative of this type is the stacking algorithm [3]. Another ensemble technique is Bayesian model averaging (BMA) [14] accounts for this model uncertainty when deriving parameter estimates. Hence, the ensemble estimate comprises the weighted estimated of various model hypothesis. Another method not to be confused with BMA is Bayesian model combination [15]. It overcomes the short coming of BMA to converge to a single model. Recently, mixture of experts models, which comprises a gating model weighting the outputs of different submodels, gained a lot of attention, as they determine state-of-the-art performance in language modelling [16] and multi-source machine translation [17]. These approaches are based on deep neural networks, hence, they require many training samples and their weightings are barely comprehensible.

In [6, 7], the CSGE was presented in the context of renewable energy power forecasting. It comprises a hierarchical two-stage ensemble prediction system and weights the ensemble members predictions based on three aspects, namely overall, local, and time-dependent performance. In [8], the system was extended to handle probabilistic forecasts. The approach presented in this article, is a generalisation of the approach to other machine learning tasks.

Iv Method

In this section the novel Coopetitive Soft Gating Ensemble method or short CSGE, as proposed in [8], is introduced. After a brief general overview, we detail different characteristics of the ensemble method namely soft gating, global-, local- and time-dependent-weighting. In the final sections, we give details and recommendations for training.

Fig. 1: The architecture of the CSGE. The predictions of the input are passed to the CSGE module. Here the weights are calculated regarding global-, local- and time-dependant weighting. After that the predictions are weighted and aggregated.

Iv-a Coopetitive Soft Gating Ensemble

The archictectur of the CSGE, as depicted Figure 1, highlights the three weighting aspects: global-, local- and time-dependent-weighting. For each of the weighting methods the novel coopetetive soft gating principle is applied. Coopetetive soft gating is a conglomerate of cooperation and competetion. The techniques combines two well known principles in ensemble methods, weighting and gating. Weighting combines all ensemble member in a linear combination, while gating selects only one of all ensemble members. The idea of the CSGE is to have the possibility to have a mixture of both weighting and gating and let the ensemble itself choose which concept to use for the combination of different predictions.

Each of the three, coopetetive soft gating, weight aspects has -ensemble members as input. Each ensemble member provides estimations for the input . denotes the timestamp , also called leadtime, when operating on timeseries and the index of the -th ensemble member. For each prediction and estimator the CSGE calculates the local, global and time-dependant weighting and aggregates the results. All weights get normalised and each prediction gets weighted using the corresponding normalised weight . In the final step, the ensemble’s prediction is obtained by aggregation of the weighted predictions, as follows:

(1)

To ensure that the prediction is not distorted weights have the following constraint:

(2)

The optimal weights with are obtained by the CSGE w.r.t. an loss function, e.g. mean squared error, cross-entropy etc. Each weighting aspect has different characteristic related to the loss function summarised as follows

  • The global weights are determined for the respective model regarding the overall observed performance of a model during ensemble training. This is a fixed weighting term. Thereby, overall strong models have more influence than weaker models.

  • Local weighting considers the fact that different ensemble members have various prediction quality over the whole feature space. As an example, when considering the problem of renewable energy prediction, an ensemble member could perform well on rainy weather inputs but has worse quality when using sunny weather inputs. Therefore, the local weighting rewards ensemble members with a higher weighting, which performed well on similar input data. These weights are adjusted for each prediction during runtime.

  • The time-dependant weight aspect is used when performing predictions on time series. Sometimes ensemble members perform differently for different lead times. When considering the problem of renewable energy prediction again, we can see that one method might achieve superior results on short time horizons, while losing quality for long time predictions quickly. Other methods may perform worse on short time predictions, but have greater stability over time. These weights are calculated for each prediction during runtime.

Involving the three aspects, the overall weighting for a specific ensemble member is calculated by the following equation:

(3)

where is the global weighting, is the local weighting and is the time-dependant weighting. To calculate the final weighting the values are normalised for the -th ensemble member with the following equation:

(4)

This equation also ensures that the constraint of Equation 2 is fulfilled.

Iv-B Soft Gating Principle

The primary goal of the CSGE is to increase the quality of the prediction by weighting robust predictors greater than predictors with worse quality results. Traditionally in ensemble methods, one of the two paradigms weighting or gating are used to combine individual ensemble members. The soft gating approach of the CSGE introduces a novel method, which allows a mixture of both weighting and gating and let the ensemble itself choose which concept to use for the combination of different predictions. Moreover, the soft gating approach applies to all three weighting aspects.

To evaluate the quality of an individual ensemble member, a function is needed which maps the correlation between the error of the prediction and its weighting. To do so the function is used to determine the weights of the estimator as follows:

(5)

is a set which contains the reference errors of all estimators, while is the error of the estimator . The parameter is chosen by the user. It controls the linearity of the weighting. For greater the CSGE tends to work as gating, while smaller results in a weighting approach.

Fig. 2: The error (RMSE) of a predictor is drawn on the x-axis, while the y-axis contains the corresponding weights computed by . For greater a higher error gets more regulated with less weighting, than for smaller

By taking a closer look on in Figure 2 we can discover the following characteristics:

  • The function is falling monotonously.

  • For larger or the error of an ensemble member, returns smaller weightings.

  • For every ensemble members are weighted with , disrespecting the error of its prediction.

  • is a small constant to prevent a division by zero

To ensure that , is adjusted in the following way

(6)

Beside the fact of having only one parameter (), the soft gating offers the advantage of operating on direct errors of the ensemble members and therefore having a strong correlation to the actual data.

Iv-C Global Weighting

The global weighting is calculated during ensemble training and then remains constant. Ensemble members that perform well on the training data get greater weights compared to those who showed a worse performance. Therefore, the difference between estimation and ground truth is calculated with

(7)

is the prediction of the -th ensemble member, while is the corresponding ground-truth.
is an a arbitrary scoring function, which could for example be the root mean-squared error (RMSE) for regression or the accuracy score (ACC) for classification. The only condition is that the loss needs to be falling monotonously with increasing errors to work correctly with the soft gating principle, see equation 6.The error score of the -th ensemble member is calculated by

(8)
(9)

By applying the soft gating principle to the set of all error scores of the ensemble members we obtain the final global weighting with

(10)

Iv-D Local Weighting

The local weighting considers the quality difference between the predictors for distinct situations over the whole feature space. Therefore, the local weighting rewards ensemble members with a higher weighting, which performed well on similar input data. In contrast to the global weighting the local weighting is calculated for each estimation during runtime.

For similarity situations, we consider the distances in the input feature space. Therefore we assume, that situations with a low distance have more in common compared to situations with a larger distance. The set contains all data that was used during ensemble training. Often the features of vary in their ranges and information value. Since we use the distances of features to determine situations which are similar, it can be useful to apply a principal component analysis (PCA) on the history data .

(11)

is the transformed history set, which has a dimension of . The parameter is chosen by the user and can be set in the range of , where is the number of features of . To calculate the local weight of a new prediction, we have to transform the input data into the transformed feature space by applying the PCA:

(12)

By using a k-nearest neighbor we can determine similar situations in the input data.

(13)

The set of similar situation in the input data is used to derive the errors for each situation with to obtain the average local error with:

(14)

This equation is applied for each ensemble member to obtain , the set of all local error scores for all ensemble members.

(15)

Finnaly, the local weight is calculated by using the soft gating principle:

(16)

Iv-E Time-Dependant Weighting

The time-dependant weighting considers the fact that the quality of an ensemble member varies over leadtime. As with the local weighting, the time-dependant weighting is calculated for each estimation. contains all predictions of estimator starting at time to time .

(17)

The error for a specific time is calculated by averaging the output error over all training samples with:

(18)
(19)

With and as the corresponding ground truth for time . To calculate the error score for time of estimator , we use the following equation:

(20)

is a measure that compares the error of the prediction with to the average error in the time intervall . The weight is calculated analogous to global- and local weighting using the soft gating principle with

(21)
(22)

Iv-F Model Fusion and Ensemble Training

To find the optimal set of parameters for the predictions, including all three weighting aspects for all ensemble members, we aim to optimise the prediction of Equation 1. Since there are three aspects, global-, local- and time-dependant weighting, it follows that there are also three to be chosen. As mentioned previously the parameter is chosen by the user and controls the non-linearity of the system. Therefore, the following minimisation problem solves the task to adjust the with

(23)

where is the prediction from Equation 1 given the current weights. are the summed errors over the training data, while is a regularisation term to control overfitting.

However, to optimise Equation 23, adjust and calculate the global weighting, we need a set of training data . In general the ensemble members are trained on a training set and validated on a validation set. By using the same training set to train the CSGE it will often become overfitted and will not generalise well. Therefore, we need training data for the CSGE that is not used to train the ensemble members. A simple Method is shown in Figure 3. The training data gets split into two sets. One to train the ensemble members and one to train the CSGE itself. Beside the fact that the setup is very simple it has a disadvantage. Because of the fact that the training set for the ensemble members and the training set for the CSGE need to be distinct, training data is wasted in this simple approach.

Fig. 3: The two training sets must be distinct in order to get information about the quality of each ensemble member

A more advanced approach shown in Figure 4, allows using the training data more efficiently. Since the CSGE uses the output data of the predictors we need those data for training. Therefore a cross validation with K-folds is used to generate those data. The training set in the -th step of this k-fold is split in a set for training and a set for prediction. Then a copy of the -th ensemble member is trained by using the set . This temporary predictor is denoted with , where is the -th step and the indices for the -th ensemble member. The temporary predictor is used to predict .

(24)

After the -th iteration we can union all created predictions with:

(25)

After that, all ensemble members are trained on the whole training set. Because the training data now consists out of the output data of the estimators we have to adjust the calculation of the CSGE. Therefore we have to store the predictions in a dimensional matrix, where is the number of samples and the number of estimators. is the timestamp when operating on time series.

(26)

Now, we have to adjust the Equation 7, in which the difference between prediction and ground truth is calculated. We can use since global- and local weighting do not consider the time aspect.

(27)

Equation 17, where the set is defined, which contains all predictions of the training point of the ensemble member over the timerange to .

(28)
Fig. 4: In the -th step we divide the the training data in a distinct set and . is used to train a copy of the ensemble members, while is used to make predictions. After the -th iteration every element of our training data is predicted and we can use these predictions for ensemble training.

Iv-G Regularisation Heuristic

The ensemble learning tends to choose high for one single aspect and therefore for other aspects. As an example, the for the local weighting often got chosen very high. This means that the local aspect of the CSGE works as a selecting ensemble, that chooses one of the ensemble members.
In order to minimise the regularisation term , the of global- and time-dependant weighting got chosen very low. This leads to an averaging ensemble for the global and time-dependent aspect, that weights all ensemble member equally with , which disables these two aspects.
Even though it can be necessary to disable some aspects, it is often better to distribute the values for the ’s more evenly, to get a more generalised ensemble model.

We propose the function to prevent this problem. weights the and penalises when choosing or too high. Typically the parameter lies in the range of [6].

(29)

The minimisation Equation 23 must be adjusted in the following way

(30)
Fig. 5: weights the chosen ’s to avoid choosing too high or too low values for

V Experimental Evaluation

In this section, we present the evaluation of the CSGE. We split the evaluation into two steps. First, we show the proper functionality of each of the three different weighting aspects with a distinct synthetic dataset. Second, we evaluate the CSGE on four reference real-world datasets for both regression and classification, including a comparison with other state-of-the-art ensemble methods.

V-a Synthetic Datasets

We created synthetic datasets in order to evaluate each aspect of the CSGE, i.e., global-, local- and time dependant weighting, separately. For each synthetic dataset we have created a data generating function . Since, we are interested in the general functionality we do not consider any additional noise. Furthermore, we defined some mathematical estimators which have to be combined by the CSGE properly in order to match the function .

V-A1 Global Weighting

For evaluation of the global Weighting, we created in the following way:

(31)

We use two estimator as ensemble members which are defined as follows:

(32)
(33)

The result after training the CSGE is depicted in Fig. 6. We observe, that the Since there are neither local nor time-dependent aspects, the learning procedure choose and . When interpreting the chosen and from a mathematical point of view we observe that the chosen weights are correct, i.e., . The CSGE perfectly matches .

Fig. 6: is the yellow function, while is the green function. is drawn in red. The predicted values by the CSGE are drawn blue dotted, which exactly match the yellow

V-A2 Local Weighting

For evaluation of the Local Weighting we created in the following way:

(34)

We use two estimator as ensemble members which are defined as follows:

(35)
(36)

This experiment has no global and time dependent aspect, therefore the learning algorithm chooses and . Since can only be approximated by picking either or depending on the feature space, the chosen should be . In Figure 7, we can see the results of the experiment. We observe, that the CSGE is able to perfectly reconstruct the reference model .

Fig. 7: is the red function, while is the orange function. is drawn in green. The predicted values by the CSGE are drawn blue dotted, which exactly match the red

V-A3 Time-dependant Weighting

For evaluation of the Time-dependant Weighting we created in the following way:

(37)

We use two estimator as ensemble members which are defined as follows:

(38)
(39)

Figure 8 shows the results of the experiment. Since there is no global and local aspect, the learning algorithm picks and , . We observe that after training, CSGE perfectly matches the reference function.

Fig. 8: is drawn with blue points, while the predictions using the CSGE are drawn with red crosses

These evaluations on synthetic data show that the CSGE works properly.

V-B Real-world Regression Datasets

In order to evaluate the CSGE on regression problems, we chose Bosting Housing and Diabetes datasets222 http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_[boston|diabetes].html (last accessed: 25.06.2018). As ensemble members we used a Support Vector Regression (SVR) with radial basis function (RBF) kernel, a Neuronal Network Regressor and a Decision Tree Regressor. As composition proceeding to the CSGE we chose Stacking and Voting (i.e., Averaging). For Stacking we used a Neuronal Network (i.e., referred as ANN Stacking) and a Linear Regression (i.e., referred as Linear Stacking) as meta learner. We chose the RMSE loss to optimise the CSGE. For each dataset we performed a ten-fold cross-validation with ten different random seeds.

We have chosen default model parameters for the ensemble members as supplied by the scikit-learn library, the parameters for the ensemble methods are optimised for each experiment. To adjust the regularisation parameter and the number of neighbours of the CSGE, we used a grid search. Since the layer size of the ANN Stacking also need to be optimise, we applied a grid search, too. The Linear Regression Model has no hyper-parameters to be optimised. As a reference to CSGE and Stacking, we used a simple Averaging.

V-B1 Boston Housing

The overall result, i.e. RMSE, can be seen in tabular I. We observe, that both CSGE and Stacking achieve better results than each ensemble member. The Stacking approach with a Linear Regression meta learner achieves best results. Even though the CSGE has sligthly poorer results compared to Linear Stacking, it has an equal performance like ANN Stacking.

Ensemble Members
Linear Regression SVM Decision Tree
Mean 24.6673 82.3656 21.7313
Standard Deviation 5.8063 14.7024 9.9349
Minimum 17.2139 66.0964 13.6710
Maximum 33.9569 107.5636 46.7434
Ensemble Methods
CSGE Linear Stacking ANN Stacking Averaging
Mean 18.9079 15.9885 18.1849 23.2753
Standard Deviation 8.7271 5.5606 5.6026 8.3834
Minimum 9.8018 10.9614 13.7156 15.7049
Maximum 34.9028 27.6670 32.0342 39.4708
TABLE I: RMSE on Boston Housing Dataset

V-B2 Diabetes

The overall result can be seen in tabular II. We can see, that every ensemble method achieved worse results compared to the best ensemble member (Linear Regression). The Linear Stacking accomplished the best results of all ensemble methods. Even though, the CSGE performed better than Averaging and ANN Stacking.

Ensemble Members
Linear Regression SVM Decision Tree
Mean 3083.1198 6356.0135 6518.2421
Standard Deviation 322.2800 405.2114 728.0636
Minimum 2641.9339 5620.8063 5487.6391
Maximum 3419.9466 6837.5063 7819.4436
Ensemble Methods
CSGE Linear Stacking ANN Stacking Averaging
Mean 3333.9250 3099.9531 3465.7875 3916.1540
Standard Deviation 453.4425 323.7526 553.6934 364.5056
Minimum 2738.1390 2664.3987 2795.0731 3380.1302
Maximum 4273.5943 3459.9381 4878.9214 4392.5727
TABLE II: RMSE on Diabetes Dataset

V-C Real-world Classification Datasets

For each dataset we performed a ten-fold cross-validation with ten different random seeds.

In order to evaluate the CSGE on classification tasks, we chose Iris and Wine333 http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_[iris|wine].html (last accessed: 25.06.2018) datasets. as ensemble members we used a Support Vector Classification (SVC) with linear and RBF kernel and a Decision Tree Classifier. The SVC with linear kernel is referred as Linear Classifier, while the SVC with RBF is referred as SVC. As composition proceeding to the CSGE we chose Stacking and majority Voting. For Stacking, we used a Neuronal Network (i.e., referred as ANN Stacking) and a SVC with linear kernel (i.e., referred as Linear Stacking) as meta learner. We chose the accuracy loss to optimise the CSGE.

As before with the regression, we have chosen default model parameters for the ensemble members and only optimised the ensemble’s parameters using a grid search. As a reference to CSGE and Stacking, we used the majority Voting ensemble.

V-D Iris

The overall results, i.e., classification accuracies, are depicted in tabular III. We can see that the CSGE surpassed both Stacking ensembles, including the reference ensemble method of Voting. All ensemble methods results are worse than the single ensemble member, i.e., SVM.

Ensemble Members
Linear Classifier SVM Decision Tree
Mean 0.6000 0.9711 0.9333
Standard Deviation 0.2071 0.0183 0.0181
Minimum 0.2889 0.9333 0.9111
Maximum 0.9778 1.0000 0.9556
Ensemble Methods
CSGE Linear Stacking ANN Stacking Voting
Mean 0.9578 0.6756 0.9378 0.9511
Standard Deviation 0.0221 0.1582 0.0888 0.0204
Minimum 0.9333 0.4000 0.6889 0.9333
Maximum 0.9778 0.9333 0.9778 0.9778
TABLE III: Accuracy on Iris Dataset

Figure 9 shows the ROC curve of the iris dataset, we can see that the CSGE achieves the best results compared to Stacking and Voting.

Fig. 9: ROC Curve of the different ensemble approaches and their ensemble members on the iris dataset. The Linear classifier not visible since it is below the clipping.

V-E Wine

The resulting accuracies are depicted in tabular IV. We can see, that both CSGE achieved the best results compared to Stacking and Voting. Since the Decision Tree is the by far best ensemble member, the CSGE worked as an gating ensemble by selecting the predictions of the Decision Tree, only.

Ensemble Members
Linear Classifier SVM Decision Tree
Mean 0.4944 0.4204 0.9148
Standard Deviation 0.1197 0.0800 0.0265
Minimum 0.3704 0.3148 0.8704
Maximum 0.6852 0.5185 0.9444
Ensemble Methods
CSGE Linear Stacking ANN Stacking Voting
Mean 0.9148 0.6907 0.8759 0.6704
Standard Deviation 0.0265 0.1615 0.0509 0.1776
Minimum 0.8704 0.4444 0.7593 0.3704
Maximum 0.9444 0.9444 0.9444 0.9259
TABLE IV: Accuracy on Wine Dataset

Figure 10 shows the ROC curve of the classifiers on the wine dataset. We observe that the CSGE achieves the best results compared to Stacking and Voting.

Fig. 10: ROC Curve of the different ensemble approaches and their ensemble members on the wine dataset. The Linear classifier not visible since it is below the clipping.

Vi Conclusion and Future Work

In this article, we proposed the CSGE for general machine learning tasks. The CSGE is an ensemble which comprises comprehensible weightings based on the three basic aspects as there are global-, local- and time-dependent weights.

The CSGE can be optimised according to arbitrary loss functions making it accessible for a wider range of problems . Moreover, we introduced a novel hyper-parameter initialisation heuristics, enhancing the training process. We showed the applicability and comprehensibility of the approach for synthetic datasets as well as real-world datasets. For the real-world data sets, we showed that our CSGE approach reaches state-of-the-art performance compared to other ensembles methods for both classification and regression tasks.

For future work, we intent to apply the approach to real-world problems in various domains, such as trajectory forecasting of vulnerable road users, and investigate its applicability in other domains.

Vii Acknowledgment

This work results from the project DeCoInt, supported by the German Research Foundation (DFG) within the priority program SPP 1835: “Kooperativ interagierende Automobile”, grant numbers SI 674/11-1. This work was supported within the project Prophesy (0324104A) funded by BMWi (Deutsches Bundesministerium für Wirtschaft und Energie / German Federal Ministry for Economic Affairs and Energy).

References

  • [1] R. E. Schapire, “The strength of weak learnability,” Machine Learning, vol. 5, no. 2, pp. 197–227, 1990.
  • [2] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123–140, 1996.
  • [3] P. Smyth and D. Wolpert, “Linearly combining density estimators via stacking,” Machine Learning, vol. 36, no. 1, pp. 59–83, Jul 1999.
  • [4] L. K. Hansen and P. Salamon, “Neural network ensembles,” TPAMI, vol. 12, no. 10, pp. 993–1001, 1990.
  • [5] R. Kohavi and D. Wolpert, “Bias plus variance decomposition for zero-one loss functions.” in Proc. of the 13th International Conference on Machine Learning, Bari, Italy, 1996, pp. 275–283.
  • [6] A. Gensler and B. Sick, “Forecasting wind power - an ensemble technique with gradual coopetitive weighting based on weather situation,” in IJCNN, Vancouver, BC, 2016, pp. 4976–4984.
  • [7] ——, “A multi-scheme ensemble using coopetitive soft gating with application to power forecasting for renewable energy generation,” CoRR, vol. arXiv:1803.06344, 2018.
  • [8] ——, “Probabilistic wind power forecasting: A multi-scheme ensemble technique with gradual coopetitive soft gating,” in SSCI, Honolulu, HI, 2017, pp. 1–10.
  • [9] C. Müller-Schloer, H. Schmeck, and T. Ungerer, Eds., Organic Computing – A Paradigm Shift for Complex Systems, ser. Autonomic Systems.    Basel, Switzerland: Birkhäuser Verlag, 2011.
  • [10] Y. Ren, L. Zhang, and P. N. Suganthan, “Ensemble classification and regression-recent developments, applications and future directions,” CIM, vol. 11, no. 1, pp. 41–53, 2016.
  • [11] L. Breimann, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
  • [12] M. Gönen and E. Alpaydn, “Multiple kernel learning algorithms,” J. Mach. Learn. Res., vol. 12, pp. 2211–2268, 2011.
  • [13] J. a. Mendes-Moreira, C. Soares, A. M. Jorge, and J. F. D. Sousa, “Ensemble approaches for regression: A survey,” ACM Comput. Surv., vol. 45, no. 1, pp. 10:1–10:40, 2012.
  • [14] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics), ser. Information Science and Statistics, M. Jordan, J. Kleinberg, and B. Schökopf, Eds.    Secaucus, NJ: Springer-Verlag New York, 2006, vol. 1.
  • [15] K. Monteith, J. L. Carroll, K. Seppi, and T. Martinez, “Turning bayesian model averaging into bayesian model combination,” in IJCNN, San Jose, CA, 2011, pp. 2657–2663.
  • [16] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” in ICLR, Toulon, France, 2017.
  • [17] E. Garmash and C. Monz, “Ensemble learning for multi-source neural machine translation,” in COLING: Technical Pap, Osaka, Japan, 2016, pp. 1409–1418.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
211856
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description