Coopetitive Soft Gating Ensemble
Abstract
In this article, we proposed the Coopetititve Soft Gating Ensemble or CSGE for general machine learning tasks. The goal of machine learning is to create models which poses a high generalisation capability. But often problems are too complex to be solved by a single model. Therefore, ensemble methods combine predictions of multiple models. The CSGE comprises a comprehensible combination based on three different aspects relating to the overall global historical performance, the local/situationdependent and timedependent performance of its ensemble members. The CSGE can be optimised according to arbitrary loss functions making it accessible for a wider range of problems. We introduce a novel training procedure including a hyperparameter initialisation at its heart. We show that the CSGE approach reaches stateoftheart performance for both classification and regression tasks. Still, the CSGE allows to quantify the influence of all base estimators by means of the three weighting aspects in a comprehensive way. In terms of Organic computing (OC), our CSGE approach combines multiple base models towards a selforganising complex system. Moreover, we provide a scikitlearn compatible implementation.
I Introduction
The main goal of machine learning is to create models from a set of training data, that have a high capability of generalisation. Often the problems are so complex that one single estimator can not handle the whole scope. One possible approach is to not have a single estimator but have multiple estimators instead. This attempt of combining multiple estimators is called ensemble. In many fields, ensembles can achieve stateoftheart performance. Popular ensemble methods are Boosting [1], Bagging [2] or Stacking [3]. In [4] it is shown that ensembles often lead to better results than using one single estimator. When considering the Biasâvariance tradeoff [5], ensembles can reduce both variance and bias and therefore result in strong models.
In many use cases it is desired to have models that have a high comprehensibility between data and operation. Since Boosting, Bagging and Stacking more or less work like a black box a fairly new ensemble method called Coopetitive Soft Gating Ensemble or CSGE was proposed. In [6], [7] and [8] it is statistically shown that the CSGE can achieve stateoftheart performance in the area of power forecasting. In this article, we aim to extend the approach to general machine learning problems.
Each estimator of an ensemble is called ensemble member. The idea of the CSGE is to gradually weight a number of ensemble members according to their historically observed performance of various aspects. There are three aspects which take influence on the weight: First, the overall performance of the estimator, Second, the local performance of the estimator in similar historical situations, Third, timedependant effects modelling the autocorrelation in the estimators outcome.
The ability to assess the individual performance of each base estimator for those three aspects can, in terms of Organic computing (OC) [9] and autonomic computing (AC), be interpreted as selfawareness capability. Our CSGE approach combines multiple base estimators towards a selforganising complex system with only few parameters to be determined.
Ii Main Contribution
The main contribution of this article is an extended coopetitive soft gating ensemble approach. It generalises the method proposed in [6, 7] for wind power forecasting to other machine learning tasks including regression, classification, and time series forecasting. The main contributions of this article can be summarised as follows:

The loss function to be optimised by the CSGE is arbitrary and can be chosen by the user with only minimal conditions.

A novel heuristic to choose the hyperparameters of the CSGE training algorithm reducing the required number of adjustable parameters.

In an extensive evaluation of our approach on common realworld reference datasets, we show that our CSGE approach reaches stateoftheart performance compared to other ensembles methods. But, the CSGE allows to quantify the influence of all base estimators by means of the three weighting aspects in a comprehensive way.

A scikitlearn compatible implementation of the CSGE^{1}^{1}1https://git.ies.unikassel.de/csge/csge.
The remainder of this article is structured as follows. In Section III, we review the related work in the field of ensemble method for machine learning. Then, in Section IV, we introduce the CSGE approach outlining the main principles, the three weighting aspects, and the algorithm for ensemble training. In Section V, we present the evaluation of the CSGE approach on three synthetic datasets, and four reference classification and regression realworld datasets. Finally, in Section VI, the main conclusion and open issues for future work are discussed.
Iii Related Work
In machine learning the term ensemble describes the combination of the multiple models. The ensemble comprises a finite set of estimators, whose predictions are aggregated forming the ensemble prediction. The theoretical justification of why ensembles can increase the overall predictive performance is given by the biasvariance decomposition [5]. The key to ensemble methods is the model diversification, i.e., how to create sufficiently different models from sample data. A comprehensive review of ensembles is given in [10]. Besides other, the most important design principles for ensembles are: Data, parameter, and structural diversity. Data diversity comprises ensembles trained on different subsets of the data. Well known representatives of this type are bagging [2], boosting [1], and random forest [11]. The idea of parameter diversity is to induce diversity into the ensemble by varying the parameters of the ensemble members. A representative of this type is the multiple kernel learning algorithm [12] in which multiple kernels are combined. Lastly, structural diversity comprises the combination of different models, e.g., obtained by applying different learning algorithms or variable model types. These ensembles are also referred as heterogeneous ensembles [13]. A well known representative of this type is the stacking algorithm [3]. Another ensemble technique is Bayesian model averaging (BMA) [14] accounts for this model uncertainty when deriving parameter estimates. Hence, the ensemble estimate comprises the weighted estimated of various model hypothesis. Another method not to be confused with BMA is Bayesian model combination [15]. It overcomes the short coming of BMA to converge to a single model. Recently, mixture of experts models, which comprises a gating model weighting the outputs of different submodels, gained a lot of attention, as they determine stateoftheart performance in language modelling [16] and multisource machine translation [17]. These approaches are based on deep neural networks, hence, they require many training samples and their weightings are barely comprehensible.
In [6, 7], the CSGE was presented in the context of renewable energy power forecasting. It comprises a hierarchical twostage ensemble prediction system and weights the ensemble members predictions based on three aspects, namely overall, local, and timedependent performance. In [8], the system was extended to handle probabilistic forecasts. The approach presented in this article, is a generalisation of the approach to other machine learning tasks.
Iv Method
In this section the novel Coopetitive Soft Gating Ensemble method or short CSGE, as proposed in [8], is introduced. After a brief general overview, we detail different characteristics of the ensemble method namely soft gating, global, local and timedependentweighting. In the final sections, we give details and recommendations for training.
Iva Coopetitive Soft Gating Ensemble
The archictectur of the CSGE, as depicted Figure 1, highlights the three weighting aspects: global, local and timedependentweighting. For each of the weighting methods the novel coopetetive soft gating principle is applied. Coopetetive soft gating is a conglomerate of cooperation and competetion. The techniques combines two well known principles in ensemble methods, weighting and gating. Weighting combines all ensemble member in a linear combination, while gating selects only one of all ensemble members. The idea of the CSGE is to have the possibility to have a mixture of both weighting and gating and let the ensemble itself choose which concept to use for the combination of different predictions.
Each of the three, coopetetive soft gating, weight aspects has ensemble members as input. Each ensemble member provides estimations for the input . denotes the timestamp , also called leadtime, when operating on timeseries and the index of the th ensemble member. For each prediction and estimator the CSGE calculates the local, global and timedependant weighting and aggregates the results. All weights get normalised and each prediction gets weighted using the corresponding normalised weight . In the final step, the ensemble’s prediction is obtained by aggregation of the weighted predictions, as follows:
(1) 
To ensure that the prediction is not distorted weights have the following constraint:
(2) 
The optimal weights with are obtained by the CSGE w.r.t. an loss function, e.g. mean squared error, crossentropy etc. Each weighting aspect has different characteristic related to the loss function summarised as follows

The global weights are determined for the respective model regarding the overall observed performance of a model during ensemble training. This is a fixed weighting term. Thereby, overall strong models have more influence than weaker models.

Local weighting considers the fact that different ensemble members have various prediction quality over the whole feature space. As an example, when considering the problem of renewable energy prediction, an ensemble member could perform well on rainy weather inputs but has worse quality when using sunny weather inputs. Therefore, the local weighting rewards ensemble members with a higher weighting, which performed well on similar input data. These weights are adjusted for each prediction during runtime.

The timedependant weight aspect is used when performing predictions on time series. Sometimes ensemble members perform differently for different lead times. When considering the problem of renewable energy prediction again, we can see that one method might achieve superior results on short time horizons, while losing quality for long time predictions quickly. Other methods may perform worse on short time predictions, but have greater stability over time. These weights are calculated for each prediction during runtime.
Involving the three aspects, the overall weighting for a specific ensemble member is calculated by the following equation:
(3) 
where is the global weighting, is the local weighting and is the timedependant weighting. To calculate the final weighting the values are normalised for the th ensemble member with the following equation:
(4) 
This equation also ensures that the constraint of Equation 2 is fulfilled.
IvB Soft Gating Principle
The primary goal of the CSGE is to increase the quality of the prediction by weighting robust predictors greater than predictors with worse quality results. Traditionally in ensemble methods, one of the two paradigms weighting or gating are used to combine individual ensemble members. The soft gating approach of the CSGE introduces a novel method, which allows a mixture of both weighting and gating and let the ensemble itself choose which concept to use for the combination of different predictions. Moreover, the soft gating approach applies to all three weighting aspects.
To evaluate the quality of an individual ensemble member, a function is needed which maps the correlation between the error of the prediction and its weighting. To do so the function is used to determine the weights of the estimator as follows:
(5) 
is a set which contains the reference errors of all estimators, while is the error of the estimator . The parameter is chosen by the user. It controls the linearity of the weighting. For greater the CSGE tends to work as gating, while smaller results in a weighting approach.
By taking a closer look on in Figure 2 we can discover the following characteristics:

The function is falling monotonously.

For larger or the error of an ensemble member, returns smaller weightings.

For every ensemble members are weighted with , disrespecting the error of its prediction.

is a small constant to prevent a division by zero
To ensure that , is adjusted in the following way
(6) 
Beside the fact of having only one parameter (), the soft gating offers the advantage of operating on direct errors of the ensemble members and therefore having a strong correlation to the actual data.
IvC Global Weighting
The global weighting is calculated during ensemble training and then remains constant. Ensemble members that perform well on the training data get greater weights compared to those who showed a worse performance. Therefore, the difference between estimation and ground truth is calculated with
(7) 
is the prediction of the th ensemble member, while is the corresponding groundtruth.
is an a arbitrary scoring function, which could for example be the root meansquared error (RMSE) for regression or the accuracy score (ACC) for classification.
The only condition is that the loss needs to be falling monotonously with increasing errors to work correctly with the soft gating principle, see equation 6.The error score of the th ensemble member is calculated by
(8) 
(9) 
By applying the soft gating principle to the set of all error scores of the ensemble members we obtain the final global weighting with
(10) 
IvD Local Weighting
The local weighting considers the quality difference between the predictors for distinct situations over the whole feature space. Therefore, the local weighting rewards ensemble members with a higher weighting, which performed well on similar input data. In contrast to the global weighting the local weighting is calculated for each estimation during runtime.
For similarity situations, we consider the distances in the input feature space. Therefore we assume, that situations with a low distance have more in common compared to situations with a larger distance. The set contains all data that was used during ensemble training. Often the features of vary in their ranges and information value. Since we use the distances of features to determine situations which are similar, it can be useful to apply a principal component analysis (PCA) on the history data .
(11) 
is the transformed history set, which has a dimension of . The parameter is chosen by the user and can be set in the range of , where is the number of features of . To calculate the local weight of a new prediction, we have to transform the input data into the transformed feature space by applying the PCA:
(12) 
By using a knearest neighbor we can determine similar situations in the input data.
(13) 
The set of similar situation in the input data is used to derive the errors for each situation with to obtain the average local error with:
(14) 
This equation is applied for each ensemble member to obtain , the set of all local error scores for all ensemble members.
(15) 
Finnaly, the local weight is calculated by using the soft gating principle:
(16) 
IvE TimeDependant Weighting
The timedependant weighting considers the fact that the quality of an ensemble member varies over leadtime. As with the local weighting, the timedependant weighting is calculated for each estimation. contains all predictions of estimator starting at time to time .
(17) 
The error for a specific time is calculated by averaging the output error over all training samples with:
(18) 
(19) 
With and as the corresponding ground truth for time . To calculate the error score for time of estimator , we use the following equation:
(20) 
is a measure that compares the error of the prediction with to the average error in the time intervall . The weight is calculated analogous to global and local weighting using the soft gating principle with
(21) 
(22) 
IvF Model Fusion and Ensemble Training
To find the optimal set of parameters for the predictions, including all three weighting aspects for all ensemble members, we aim to optimise the prediction of Equation 1. Since there are three aspects, global, local and timedependant weighting, it follows that there are also three to be chosen. As mentioned previously the parameter is chosen by the user and controls the nonlinearity of the system. Therefore, the following minimisation problem solves the task to adjust the with
(23) 
where is the prediction from Equation 1 given the current weights. are the summed errors over the training data, while is a regularisation term to control overfitting.
However, to optimise Equation 23, adjust and calculate the global weighting, we need a set of training data . In general the ensemble members are trained on a training set and validated on a validation set. By using the same training set to train the CSGE it will often become overfitted and will not generalise well. Therefore, we need training data for the CSGE that is not used to train the ensemble members. A simple Method is shown in Figure 3. The training data gets split into two sets. One to train the ensemble members and one to train the CSGE itself. Beside the fact that the setup is very simple it has a disadvantage. Because of the fact that the training set for the ensemble members and the training set for the CSGE need to be distinct, training data is wasted in this simple approach.
A more advanced approach shown in Figure 4, allows using the training data more efficiently. Since the CSGE uses the output data of the predictors we need those data for training. Therefore a cross validation with Kfolds is used to generate those data. The training set in the th step of this kfold is split in a set for training and a set for prediction. Then a copy of the th ensemble member is trained by using the set . This temporary predictor is denoted with , where is the th step and the indices for the th ensemble member. The temporary predictor is used to predict .
(24) 
After the th iteration we can union all created predictions with:
(25) 
After that, all ensemble members are trained on the whole training set. Because the training data now consists out of the output data of the estimators we have to adjust the calculation of the CSGE. Therefore we have to store the predictions in a dimensional matrix, where is the number of samples and the number of estimators. is the timestamp when operating on time series.
(26) 
Now, we have to adjust the Equation 7, in which the difference between prediction and ground truth is calculated. We can use since global and local weighting do not consider the time aspect.
(27) 
Equation 17, where the set is defined, which contains all predictions of the training point of the ensemble member over the timerange to .
(28) 
IvG Regularisation Heuristic
The ensemble learning tends to choose high for one single aspect and therefore for other aspects. As an example, the for the local weighting often got chosen very high. This means that the local aspect of the CSGE works as a selecting ensemble, that chooses one of the ensemble members.
In order to minimise the regularisation term , the of global and timedependant weighting got chosen very low. This leads to an averaging ensemble for the global and timedependent aspect, that weights all ensemble member equally with , which disables these two aspects.
Even though it can be necessary to disable some aspects, it is often better to distribute the values for the ’s more evenly, to get a more generalised ensemble model.
We propose the function to prevent this problem. weights the and penalises when choosing or too high. Typically the parameter lies in the range of [6].
(29) 
The minimisation Equation 23 must be adjusted in the following way
(30) 
V Experimental Evaluation
In this section, we present the evaluation of the CSGE. We split the evaluation into two steps. First, we show the proper functionality of each of the three different weighting aspects with a distinct synthetic dataset. Second, we evaluate the CSGE on four reference realworld datasets for both regression and classification, including a comparison with other stateoftheart ensemble methods.
Va Synthetic Datasets
We created synthetic datasets in order to evaluate each aspect of the CSGE, i.e., global, local and time dependant weighting, separately. For each synthetic dataset we have created a data generating function . Since, we are interested in the general functionality we do not consider any additional noise. Furthermore, we defined some mathematical estimators which have to be combined by the CSGE properly in order to match the function .
VA1 Global Weighting
For evaluation of the global Weighting, we created in the following way:
(31) 
We use two estimator as ensemble members which are defined as follows:
(32) 
(33) 
The result after training the CSGE is depicted in Fig. 6. We observe, that the Since there are neither local nor timedependent aspects, the learning procedure choose and . When interpreting the chosen and from a mathematical point of view we observe that the chosen weights are correct, i.e., . The CSGE perfectly matches .
VA2 Local Weighting
For evaluation of the Local Weighting we created in the following way:
(34) 
We use two estimator as ensemble members which are defined as follows:
(35) 
(36) 
This experiment has no global and time dependent aspect, therefore the learning algorithm chooses and . Since can only be approximated by picking either or depending on the feature space, the chosen should be . In Figure 7, we can see the results of the experiment. We observe, that the CSGE is able to perfectly reconstruct the reference model .
VA3 Timedependant Weighting
For evaluation of the Timedependant Weighting we created in the following way:
(37) 
We use two estimator as ensemble members which are defined as follows:
(38) 
(39) 
Figure 8 shows the results of the experiment. Since there is no global and local aspect, the learning algorithm picks and , . We observe that after training, CSGE perfectly matches the reference function.
These evaluations on synthetic data show that the CSGE works properly.
VB Realworld Regression Datasets
In order to evaluate the CSGE on regression problems, we chose Bosting Housing and Diabetes datasets^{2}^{2}2 http://scikitlearn.org/stable/modules/generated/sklearn.datasets.load_[bostondiabetes].html (last accessed: 25.06.2018). As ensemble members we used a Support Vector Regression (SVR) with radial basis function (RBF) kernel, a Neuronal Network Regressor and a Decision Tree Regressor. As composition proceeding to the CSGE we chose Stacking and Voting (i.e., Averaging). For Stacking we used a Neuronal Network (i.e., referred as ANN Stacking) and a Linear Regression (i.e., referred as Linear Stacking) as meta learner. We chose the RMSE loss to optimise the CSGE. For each dataset we performed a tenfold crossvalidation with ten different random seeds.
We have chosen default model parameters for the ensemble members as supplied by the scikitlearn library, the parameters for the ensemble methods are optimised for each experiment. To adjust the regularisation parameter and the number of neighbours of the CSGE, we used a grid search. Since the layer size of the ANN Stacking also need to be optimise, we applied a grid search, too. The Linear Regression Model has no hyperparameters to be optimised. As a reference to CSGE and Stacking, we used a simple Averaging.
VB1 Boston Housing
The overall result, i.e. RMSE, can be seen in tabular I. We observe, that both CSGE and Stacking achieve better results than each ensemble member. The Stacking approach with a Linear Regression meta learner achieves best results. Even though the CSGE has sligthly poorer results compared to Linear Stacking, it has an equal performance like ANN Stacking.
Ensemble Members  

Linear Regression  SVM  Decision Tree  
Mean  24.6673  82.3656  21.7313  
Standard Deviation  5.8063  14.7024  9.9349  
Minimum  17.2139  66.0964  13.6710  
Maximum  33.9569  107.5636  46.7434  
Ensemble Methods  
CSGE  Linear Stacking  ANN Stacking  Averaging  
Mean  18.9079  15.9885  18.1849  23.2753 
Standard Deviation  8.7271  5.5606  5.6026  8.3834 
Minimum  9.8018  10.9614  13.7156  15.7049 
Maximum  34.9028  27.6670  32.0342  39.4708 
VB2 Diabetes
The overall result can be seen in tabular II. We can see, that every ensemble method achieved worse results compared to the best ensemble member (Linear Regression). The Linear Stacking accomplished the best results of all ensemble methods. Even though, the CSGE performed better than Averaging and ANN Stacking.
Ensemble Members  

Linear Regression  SVM  Decision Tree  
Mean  3083.1198  6356.0135  6518.2421  
Standard Deviation  322.2800  405.2114  728.0636  
Minimum  2641.9339  5620.8063  5487.6391  
Maximum  3419.9466  6837.5063  7819.4436  
Ensemble Methods  
CSGE  Linear Stacking  ANN Stacking  Averaging  
Mean  3333.9250  3099.9531  3465.7875  3916.1540 
Standard Deviation  453.4425  323.7526  553.6934  364.5056 
Minimum  2738.1390  2664.3987  2795.0731  3380.1302 
Maximum  4273.5943  3459.9381  4878.9214  4392.5727 
VC Realworld Classification Datasets
For each dataset we performed a tenfold crossvalidation with ten different random seeds.
In order to evaluate the CSGE on classification tasks, we chose Iris and Wine^{3}^{3}3 http://scikitlearn.org/stable/modules/generated/sklearn.datasets.load_[iriswine].html (last accessed: 25.06.2018) datasets. as ensemble members we used a Support Vector Classification (SVC) with linear and RBF kernel and a Decision Tree Classifier. The SVC with linear kernel is referred as Linear Classifier, while the SVC with RBF is referred as SVC. As composition proceeding to the CSGE we chose Stacking and majority Voting. For Stacking, we used a Neuronal Network (i.e., referred as ANN Stacking) and a SVC with linear kernel (i.e., referred as Linear Stacking) as meta learner. We chose the accuracy loss to optimise the CSGE.
As before with the regression, we have chosen default model parameters for the ensemble members and only optimised the ensemble’s parameters using a grid search. As a reference to CSGE and Stacking, we used the majority Voting ensemble.
VD Iris
The overall results, i.e., classification accuracies, are depicted in tabular III. We can see that the CSGE surpassed both Stacking ensembles, including the reference ensemble method of Voting. All ensemble methods results are worse than the single ensemble member, i.e., SVM.
Ensemble Members  

Linear Classifier  SVM  Decision Tree  
Mean  0.6000  0.9711  0.9333  
Standard Deviation  0.2071  0.0183  0.0181  
Minimum  0.2889  0.9333  0.9111  
Maximum  0.9778  1.0000  0.9556  
Ensemble Methods  
CSGE  Linear Stacking  ANN Stacking  Voting  
Mean  0.9578  0.6756  0.9378  0.9511 
Standard Deviation  0.0221  0.1582  0.0888  0.0204 
Minimum  0.9333  0.4000  0.6889  0.9333 
Maximum  0.9778  0.9333  0.9778  0.9778 
Figure 9 shows the ROC curve of the iris dataset, we can see that the CSGE achieves the best results compared to Stacking and Voting.
VE Wine
The resulting accuracies are depicted in tabular IV. We can see, that both CSGE achieved the best results compared to Stacking and Voting. Since the Decision Tree is the by far best ensemble member, the CSGE worked as an gating ensemble by selecting the predictions of the Decision Tree, only.
Ensemble Members  

Linear Classifier  SVM  Decision Tree  
Mean  0.4944  0.4204  0.9148  
Standard Deviation  0.1197  0.0800  0.0265  
Minimum  0.3704  0.3148  0.8704  
Maximum  0.6852  0.5185  0.9444  
Ensemble Methods  
CSGE  Linear Stacking  ANN Stacking  Voting  
Mean  0.9148  0.6907  0.8759  0.6704 
Standard Deviation  0.0265  0.1615  0.0509  0.1776 
Minimum  0.8704  0.4444  0.7593  0.3704 
Maximum  0.9444  0.9444  0.9444  0.9259 
Figure 10 shows the ROC curve of the classifiers on the wine dataset. We observe that the CSGE achieves the best results compared to Stacking and Voting.
Vi Conclusion and Future Work
In this article, we proposed the CSGE for general machine learning tasks. The CSGE is an ensemble which comprises comprehensible weightings based on the three basic aspects as there are global, local and timedependent weights.
The CSGE can be optimised according to arbitrary loss functions making it accessible for a wider range of problems . Moreover, we introduced a novel hyperparameter initialisation heuristics, enhancing the training process. We showed the applicability and comprehensibility of the approach for synthetic datasets as well as realworld datasets. For the realworld data sets, we showed that our CSGE approach reaches stateoftheart performance compared to other ensembles methods for both classification and regression tasks.
For future work, we intent to apply the approach to realworld problems in various domains, such as trajectory forecasting of vulnerable road users, and investigate its applicability in other domains.
Vii Acknowledgment
This work results from the project DeCoInt, supported by the German Research Foundation (DFG) within the priority program SPP 1835: “Kooperativ interagierende Automobile”, grant numbers SI 674/111. This work was supported within the project Prophesy (0324104A) funded by BMWi (Deutsches Bundesministerium fÃ¼r Wirtschaft und Energie / German Federal Ministry for Economic Affairs and Energy).
References
 [1] R. E. Schapire, “The strength of weak learnability,” Machine Learning, vol. 5, no. 2, pp. 197–227, 1990.
 [2] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123–140, 1996.
 [3] P. Smyth and D. Wolpert, “Linearly combining density estimators via stacking,” Machine Learning, vol. 36, no. 1, pp. 59–83, Jul 1999.
 [4] L. K. Hansen and P. Salamon, “Neural network ensembles,” TPAMI, vol. 12, no. 10, pp. 993–1001, 1990.
 [5] R. Kohavi and D. Wolpert, “Bias plus variance decomposition for zeroone loss functions.” in Proc. of the 13th International Conference on Machine Learning, Bari, Italy, 1996, pp. 275–283.
 [6] A. Gensler and B. Sick, “Forecasting wind power  an ensemble technique with gradual coopetitive weighting based on weather situation,” in IJCNN, Vancouver, BC, 2016, pp. 4976–4984.
 [7] ——, “A multischeme ensemble using coopetitive soft gating with application to power forecasting for renewable energy generation,” CoRR, vol. arXiv:1803.06344, 2018.
 [8] ——, “Probabilistic wind power forecasting: A multischeme ensemble technique with gradual coopetitive soft gating,” in SSCI, Honolulu, HI, 2017, pp. 1–10.
 [9] C. MüllerSchloer, H. Schmeck, and T. Ungerer, Eds., Organic Computing – A Paradigm Shift for Complex Systems, ser. Autonomic Systems. Basel, Switzerland: Birkhäuser Verlag, 2011.
 [10] Y. Ren, L. Zhang, and P. N. Suganthan, “Ensemble classification and regressionrecent developments, applications and future directions,” CIM, vol. 11, no. 1, pp. 41–53, 2016.
 [11] L. Breimann, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
 [12] M. Gönen and E. Alpaydn, “Multiple kernel learning algorithms,” J. Mach. Learn. Res., vol. 12, pp. 2211–2268, 2011.
 [13] J. a. MendesMoreira, C. Soares, A. M. Jorge, and J. F. D. Sousa, “Ensemble approaches for regression: A survey,” ACM Comput. Surv., vol. 45, no. 1, pp. 10:1–10:40, 2012.
 [14] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics), ser. Information Science and Statistics, M. Jordan, J. Kleinberg, and B. Schökopf, Eds. Secaucus, NJ: SpringerVerlag New York, 2006, vol. 1.
 [15] K. Monteith, J. L. Carroll, K. Seppi, and T. Martinez, “Turning bayesian model averaging into bayesian model combination,” in IJCNN, San Jose, CA, 2011, pp. 2657–2663.
 [16] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparselygated mixtureofexperts layer,” in ICLR, Toulon, France, 2017.
 [17] E. Garmash and C. Monz, “Ensemble learning for multisource neural machine translation,” in COLING: Technical Pap, Osaka, Japan, 2016, pp. 1409–1418.