Global convergence of Negative Correlation Extreme Learning Machine
Abstract
Ensemble approaches introduced in the Extreme Learning Machine (ELM) literature mainly come from methods that relies on data sampling procedures, under the assumption that the training data are heterogeneously enough to set up diverse base learners. To overcome this assumption, it was proposed an ELM ensemble method based on the Negative Correlation Learning (NCL) framework, called Negative Correlation Extreme Learning Machine (NCELM). This model works in two stages: i) different ELMs are generated as base learners with random weights in the hidden layer, and ii) a NCL penalty term with the information of the ensemble prediction is introduced in each ELM minimization problem, updating the base learners, iii) second step is iterated until the ensemble converges.
Although this NCL ensemble method was validated by an experimental study with multiple benchmark datasets, no information was given on the conditions about this convergence. This paper mathematically presents the sufficient conditions to guarantee the global convergence of NCELM. The update of the ensemble in each iteration is defined as a contraction mapping function, and through Banach theorem, global convergence of the ensemble is proved.
Keywords:
Ensemble Negative Correlation Learning Extreme Learning Machine FixedPoint Banach Contraction mapping.∎
1 Introduction
Over the years, Extreme Learning Machine (ELM) Huang et al. (2012) has become a competitive algorithm for diverse machine learning tasks: time series prediction Ren and Han (2019), speech recognition Xu et al. (2019), deep learning architectures Chang et al. (2018); Chaturvedi et al. (2018), …. Both the SingleHiddenLayer Feedforward Network (SLFN) and the kernel trick versions Huang et al. (2012) are widely used in supervised machine learning problems, due mainly to its low computational burden and its powerful nonlinear mapping capability. The neural network version of the ELM framework relies on the randomness of the weights between the input and the hidden layer, to speed the training stage while keeping competitive performance results Li et al. (2020).
Ensemble learning, also known as committeebased learning Zhou (2012); Kuncheva and Whitaker (2003), has attracted much interest in the machine learning community Zhou (2012) and has been applied widely in many realworld tasks such as object detection, object recognition, and object tracking Girshick et al. (2014); Wang et al. (2012); Zhou et al. (2014); Ykhlef and Bouchaffra (2017). The main characteristic of these methodologies lies in the training data to generate diversity among the base learners. The ensemble methods can be separated whether they promote the diversity implicitly (for example, using data sampling methods, such as Bagging Breiman (1996) and Boosting Freund (1995)) or explicitly (introducing parameter diversity terms, such as Negative Correlation Learning Framework Masoudnia et al. (2012); Huanhuan Chen and Xin Yao (2009)). In this context, Bagging and Boosting are the most common approaches Domingos (1997); Wyner et al. (2017), although the convergence of these ensemble methods is not always assured Rudin et al. (2004); Mukherjee et al. (2013).
Negative Correlation Learning is a framework, originally designed for neural network ensemble, that introduces the promotion of the diversity among the base learners as another term to optimize in the training stage of the model Huanhuan Chen and Xin Yao (2009). This ensemble learning method has been applied to multiclass problems Wang et al. (2010), deep learning tasks Shi et al. (2018) and semisupervised machine learning problems Chen et al. (2018). In the Extreme Learning Machine community, Negative Correlation Extreme Learning Machine was introduced by adding to the regularized ELM Huanhuan Chen and Xin Yao (2009) the diversity term directly in the loss function PeralesGonzález et al. (2020). This allows managing de diversity along with the regularization and the error. However, this method relies on the convergence of the ensemble, and it was not clarified in the original paper.
In this paper, training conditions for convergence are presented and discussed. The training stage of Negative Correlation Extreme Learning Machine (NCELM) is reformulated as a fixedpoint iteration, and the solution of each step can be represented as a contraction mapping. Using Banach Theorem, this contraction mapping implies there is a convergence, and the ensemble method is stable.
The manuscript is organized as follows: Extreme Learning Machine for classification problems and the ensemble method Negative Correlation Extreme Learning machine are explained in Section 2. Conditions about convergence are studied in Section 3, and discussion about hyperparameter boundaries and graphic examples are in Section 4. Conclusions are on the final segment of the article, Section 5.
2 Negative Correlation Extreme Learning Machine and its formulation
2.1 Extreme Learning Machine as base learner
For a classification problem, training data could be represented as , where

is the vector of features of the th training pattern,

is the dimension of the input features,

is the target of the th training pattern, 1ofJ encoded (all elements of the vector are 0 except the corresponding to the label of the pattern, which is 1),

is the number of classes.
Following this notation, the output function of the Extreme Learning Machine classifier Huang et al. (2012) is , where each is
(1) 
where is the hidden layer output. The predicted class corresponds to the vector component with highest value,
(2) 
The ELM model estimates the coefficient vectors , where is the number of nodes in the hidden layer, that minimizes the following equation:
(3) 
where

is the output of the hidden layer for the training patterns,

is the matrix with the desired targets

is the th column of the matrix.
Because Eq. (3) is a convex minimization problem, the minimum of Eq. (3) can be found by deriving respect to and equaling to 0,
(4) 
2.2 Negative Correlation Extreme Learning Machine
Negative Correlation Extreme Learning Machine model PeralesGonzález et al. (2020) is an ensemble of base learners, and each th base learner is an ELM, , where is the number of base classifiers. The result output of a testing instance is defined as the average of their outputs,
(5) 
In the Negative Correlation Learning proposal for ELM framework PeralesGonzález et al. (2020), minimization problem for each th base learner is similar to Eq. (3), but the diversity among the outputs of the individual , and the final ensemble is introduced as a penalization, with as a problemdependent parameter that controls the diversity. The minimization problem for each th base learner is
(6) 
where is the output of the ensemble,
(7) 
Because appears in , the proposed solution for Eq. (6) is to transform the problem in an iterated sequence, with solution of Eq. (3) as the first iteration , for . The output weight matrices in the th iteration , , for each individual are obtained from the following optimization problem
(8) 
where is updated as
(9) 
(10) 
The result is introduced in in order to obtain iteratively. However, the convergence of this iteration was not assured in the original paper PeralesGonzález et al. (2020), but it can be proved with Banach fixedpoint theorem.
3 Conditions for the convergence of NCELM
3.1 Banach fixedpoint theorem
As Stephen Banach defined Banach (1922),
Theorem 3.1
Let be a nonempty complete metric space with a contraction mapping . Then admits a unique fixedpoint in (). Furthermore, can be found as follows: start with an arbitrary element and define a sequence , then .
Let a complete metric space, then a map is called a contraction mapping on if there exists such that
(11) 
3.2 Reformulation of NCELM model as a contraction mapping
In order to prove that the iteration of Eq. (8) over , is a fixedpoint iteration, the elements of the NCELM model are going to be defined into a metric space with a map . Later, it is proved that is a contraction mapping. An element is defined as
(12) 
thus is the subspace that contains the posible solutions of Eq. (10), and it is included in the space . The output of the ensemble, , is then a function of , since it is composed by all the by definition in Eq. (7). Noting this as , the map
(13) 
is the applied Equation (10) to this point . The map depends of each classification problem, because of , and are problemdependent. Individuals can be considered,
(14) 
Following this formulation, the NCELM model always starts from the initial point
(15) 
that leads to , thus the first element in the sequence is
(16) 
3.3 Definition of distance
For two points from the space , it is defined the distance metric as
(17) 
where is the norm power to 2,
(18) 
The distance after the map is
(19) 
so distance is just a sum of . It is trivial that if
,
then
,
so it is only needed to prove that
(20) 
3.4 Proof that T is a contraction mapping
After computing the training data, the coefficient matrix are fixed. If both points are obtained by Eq. (4), , and by Eq. (10), , because both equations give unique solution, and in this case the inequality from Eq. (20) is assured.
Let assume arbitrary , initial points from Eq. (10) are,
(21)  
(22) 
From these new predictions can be obtained,
(23)  
(24) 
Note that an example of could be , .
The application of would result in
(25)  
(26) 
In order to apply Woodbury matrix identity Woodbury (1950) in Eq. (25), the following matrix are renamed:

,



,
so the inverse of matrix in Eq. (25) can be rewritten as
(27) 
where
(28) 
Similar result is obtained for Eq. (26). Because , using Eq. (27) into Eq. (25), (26) led us to achieve and
(29)  
(30) 
The distance can be expressed as
(31) 
Since the solution is discarded, Eq. (31) can be divided by the distance ,
(32) 
and applying Eq. (20),
(33) 
Because real terms powers to 2 are greater than 0, it is only needed to prove that
(34) 
Left inequality is assured, due to . Powering the fraction to 2 and applying norm properties,
(35)  
(36) 
we have
(38) 
and the problem of the maximum value of Eq. (38) is the generalized Rayleigh quotient Parlett (1998),
(39) 
This problem is equivalent to
(40) 
where in this problem is the distance between , which is nonzero because that problem was discarded. This can be solved used Lagrange multipliers,
(41) 
Maximizing respect to , a Generalized Eigenvalue Problem (GEP) is obtained,
(42) 
In the GEP, the eigenvalues can be calculated as
(43) 
is the maximum eigenvalue of the quotient in Eq. (39). From Eq. (32) and (34), if then condition from Equation (20) is assured.
Using norm property in Eq. (35), and adding previous knowledge , a bottom bound can be set. Taking inverse,
(44) 
If , the same reasoning could be followed. From norm property in Eq. (36), an upper bound for norm inverse matrix can be set,
(45) 
(46) 
Where . Replacing in Eq. (43), it is reached to the inequality
(47) 
so and can be imposed, the maximum eigenvalue to be under condition in equation (47),
(48) 
After consider as
(49) 
and replacing into Equation (48),
(50) 
values can be obtained numerically, by finding the zero in the following equation
(51) 
because it is an implicit equation, where and depends on . However, can be relaxed using norm property in Equation (45),
thus, a more restrictive bound can be set,
(52) 
It is trivial to see that, if , then . Although is still implicit in Equation (52) through , in the same Section this problem could be avoided.
4 Discussion
4.1 condition
For values that assures the condition from Eq. (52), then the inequality in Eq. (20) is also assured. Eq. (20) is much restrictive than condition from Banach fixedpoint theorem,
which means that, under certain condition of , there is an upper bound that allows to formulate that NCELM as a fixedpoint iteration. Moreover, because sequence Eq. (10) is a fixedpoint iteration, , with
(53) 
the solution of the system, , thus by definition of ,
(54) 
and condition for in Eq. (48) is relaxed over the iterations, since the upper bound for increases,
(55) 
as long as . And this is also assured, since matrix and exist and are non singular, because of Equations (21) and (22), so from Eq. (49),
Eq. (55) implies that any value can be chosen, whether or not, because the condition is relaxed over iterations and becomes more and more large. If Eq. (11) would be not fulfilled in the first iteration for , then could be chosen, and the boundary would be relaxed during the training stage, until . Using the base learners obtained at this point of the training stage, the fixedpoint iteration could continue with .
4.2 Experimental results
Because the base learners converge to an ensemble optimum, the difference between the coefficient vectors in iteration , and the values in the next iteration , always decreases. For the explanatory purpose, this Section shows graphically an example of this convergence
Dataset  Size  #Attr.  #Classes  Class distribution 

qsarbiodegradation  1055  41  2  (699, 356) 
Hyperparameters are , and . To reduce the computational burden, vector norm chosen for plotting where not norm but ,