Global convergence of Negative Correlation Extreme Learning Machine

# Global convergence of Negative Correlation Extreme Learning Machine

## Abstract

Ensemble approaches introduced in the Extreme Learning Machine (ELM) literature mainly come from methods that relies on data sampling procedures, under the assumption that the training data are heterogeneously enough to set up diverse base learners. To overcome this assumption, it was proposed an ELM ensemble method based on the Negative Correlation Learning (NCL) framework, called Negative Correlation Extreme Learning Machine (NCELM). This model works in two stages: i) different ELMs are generated as base learners with random weights in the hidden layer, and ii) a NCL penalty term with the information of the ensemble prediction is introduced in each ELM minimization problem, updating the base learners, iii) second step is iterated until the ensemble converges.

Although this NCL ensemble method was validated by an experimental study with multiple benchmark datasets, no information was given on the conditions about this convergence. This paper mathematically presents the sufficient conditions to guarantee the global convergence of NCELM. The update of the ensemble in each iteration is defined as a contraction mapping function, and through Banach theorem, global convergence of the ensemble is proved.

###### Keywords:
Ensemble Negative Correlation Learning Extreme Learning Machine Fixed-Point Banach Contraction mapping.

## 1 Introduction

Over the years, Extreme Learning Machine (ELM) Huang et al. (2012) has become a competitive algorithm for diverse machine learning tasks: time series prediction Ren and Han (2019), speech recognition Xu et al. (2019), deep learning architectures Chang et al. (2018); Chaturvedi et al. (2018), …. Both the Single-Hidden-Layer Feedforward Network (SLFN) and the kernel trick versions Huang et al. (2012) are widely used in supervised machine learning problems, due mainly to its low computational burden and its powerful nonlinear mapping capability. The neural network version of the ELM framework relies on the randomness of the weights between the input and the hidden layer, to speed the training stage while keeping competitive performance results Li et al. (2020).

Ensemble learning, also known as committee-based learning Zhou (2012); Kuncheva and Whitaker (2003), has attracted much interest in the machine learning community Zhou (2012) and has been applied widely in many real-world tasks such as object detection, object recognition, and object tracking Girshick et al. (2014); Wang et al. (2012); Zhou et al. (2014); Ykhlef and Bouchaffra (2017). The main characteristic of these methodologies lies in the training data to generate diversity among the base learners. The ensemble methods can be separated whether they promote the diversity implicitly (for example, using data sampling methods, such as Bagging Breiman (1996) and Boosting Freund (1995)) or explicitly (introducing parameter diversity terms, such as Negative Correlation Learning Framework Masoudnia et al. (2012); Huanhuan Chen and Xin Yao (2009)). In this context, Bagging and Boosting are the most common approaches Domingos (1997); Wyner et al. (2017), although the convergence of these ensemble methods is not always assured Rudin et al. (2004); Mukherjee et al. (2013).

Negative Correlation Learning is a framework, originally designed for neural network ensemble, that introduces the promotion of the diversity among the base learners as another term to optimize in the training stage of the model Huanhuan Chen and Xin Yao (2009). This ensemble learning method has been applied to multi-class problems Wang et al. (2010), deep learning tasks Shi et al. (2018) and semi-supervised machine learning problems Chen et al. (2018). In the Extreme Learning Machine community, Negative Correlation Extreme Learning Machine was introduced by adding to the regularized ELM Huanhuan Chen and Xin Yao (2009) the diversity term directly in the loss function Perales-González et al. (2020). This allows managing de diversity along with the regularization and the error. However, this method relies on the convergence of the ensemble, and it was not clarified in the original paper.

In this paper, training conditions for convergence are presented and discussed. The training stage of Negative Correlation Extreme Learning Machine (NCELM) is reformulated as a fixed-point iteration, and the solution of each step can be represented as a contraction mapping. Using Banach Theorem, this contraction mapping implies there is a convergence, and the ensemble method is stable.

The manuscript is organized as follows: Extreme Learning Machine for classification problems and the ensemble method Negative Correlation Extreme Learning machine are explained in Section 2. Conditions about convergence are studied in Section 3, and discussion about hyper-parameter boundaries and graphic examples are in Section 4. Conclusions are on the final segment of the article, Section 5.

## 2 Negative Correlation Extreme Learning Machine and its formulation

### 2.1 Extreme Learning Machine as base learner

For a classification problem, training data could be represented as , where

• is the vector of features of the -th training pattern,

• is the dimension of the input features,

• is the target of the -th training pattern, 1-of-J encoded (all elements of the vector are 0 except the corresponding to the label of the pattern, which is 1),

• is the number of classes.

Following this notation, the output function of the Extreme Learning Machine classifier Huang et al. (2012) is , where each is

 fj(x)=h′(x)βj, (1)

where is the hidden layer output. The predicted class corresponds to the vector component with highest value,

 argmaxj=1,…,Jfj(x). (2)

The ELM model estimates the coefficient vectors , where is the number of nodes in the hidden layer, that minimizes the following equation:

 minβj∈RD (∥βj∥2+C∥Hβj−Yj∥2),j=1,…,J, (3)

where

• is the output of the hidden layer for the training patterns,

• is the matrix with the desired targets

• is the -th column of the matrix.

Because Eq. (3) is a convex minimization problem, the minimum of Eq. (3) can be found by deriving respect to and equaling to 0,

 (4)

### 2.2 Negative Correlation Extreme Learning Machine

Negative Correlation Extreme Learning Machine model Perales-González et al. (2020) is an ensemble of base learners, and each -th base learner is an ELM, , where is the number of base classifiers. The result output of a testing instance is defined as the average of their outputs,

 fj(x)=1SS∑s=1f(s)j(x)=1SS∑s=1h(s)′(x)β(s)j. (5)

In the Negative Correlation Learning proposal for ELM framework Perales-González et al. (2020), minimization problem for each -th base learner is similar to Eq. (3), but the diversity among the outputs of the individual , and the final ensemble is introduced as a penalization, with as a problem-dependent parameter that controls the diversity. The minimization problem for each -th base learner is

 minβ(s)j∈RD×J (∥β(s)j∥2+C∥H(s)β(s)j−Yj∥2+λ⟨H(s)β(s)j,Fj⟩2), (6)

where is the output of the ensemble,

 Fj=S∑s′=1H(s′)β(s′)j. (7)

Because appears in , the proposed solution for Eq. (6) is to transform the problem in an iterated sequence, with solution of Eq. (3) as the first iteration , for . The output weight matrices in the -th iteration , , for each individual are obtained from the following optimization problem

 minβ(s)j,(r)∈RD×J (∥β(s)j,(r)∥2+C∥H(s)β(s)j,(r)−Yj∥2+λ⟨H(s)β(s)j,(r),Fj,(r−1)⟩2), (8)

where is updated as

 Fj,(r−1)=1SS∑s=1H(s)β(s)j,(r−1). (9)

As Eq. (3), solution can be obtained for (8) by deriving it and equaling to ,

 β(s)j,(r)=(IC+H(s)′H(s)+λCH(s)′Fj,(r−1)F′j,(r−1)H(s))−1H(s)′Yj. (10)

The result is introduced in in order to obtain iteratively. However, the convergence of this iteration was not assured in the original paper Perales-González et al. (2020), but it can be proved with Banach fixed-point theorem.

## 3 Conditions for the convergence of NCELM

### 3.1 Banach fixed-point theorem

As Stephen Banach defined Banach (1922),

###### Theorem 3.1

Let be a non-empty complete metric space with a contraction mapping . Then admits a unique fixed-point in (). Furthermore, can be found as follows: start with an arbitrary element and define a sequence , then .

Let a complete metric space, then a map is called a contraction mapping on if there exists such that

 d(T(x),T(y))≤qd(x,y),∀x,y∈X. (11)

This means that the points, after applying the mapping, are closer than in their original position Ciesielski (2007). Thus, if NCELM solution in Eq. (10) is a contraction mapping on the solutions of , it can be assured that a fixed point exist for NCELM model.

### 3.2 Reformulation of NCELM model as a contraction mapping

In order to prove that the iteration of Eq. (8) over , is a fixed-point iteration, the elements of the NCELM model are going to be defined into a metric space with a map . Later, it is proved that is a contraction mapping. An element is defined as

 B=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝β(1)j⋮β(s)j⋮β(S)j⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠, (12)

thus is the subspace that contains the posible solutions of Eq. (10), and it is included in the space . The output of the ensemble, , is then a function of , since it is composed by all the by definition in Eq. (7). Noting this as , the map

 T(B)=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝(IC+H(1)′H(1)+λCH(1)′FBF′BH(1))−1H(1)′Yj⋮(IC+H(s)′H(s)+λCH(s)′FBF′BH(s))−1H(s)′Yj⋮(IC+H(S)′H(S)+λCH(S)′FBF′BH(S))−1H(S)′Yj⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠, (13)

is the applied Equation (10) to this point . The map depends of each classification problem, because of , and are problem-dependent. Individuals can be considered,

 T(B)=⎛⎜ ⎜⎝T(1)(B)⋮T(S)(B)⎞⎟ ⎟⎠. (14)

Following this formulation, the NCELM model always starts from the initial point

 B(0)=⎛⎜ ⎜⎝→0⋮→0⎞⎟ ⎟⎠, (15)

that leads to , thus the first element in the sequence is

 (16)

and the problem from Eq.(8) is a sequence , for . In the following Section, it is shown that each is a contraction map, so it is because Eq. (14).

### 3.3 Definition of distance

For two points from the space , it is defined the distance metric as

 d(U,V)=S∑s=1∥U(s)−V(s)∥2=S∑s=1d(s)(U(s),V(s)), (17)

where is the norm power to 2,

 d(s)(U(s),V(s))=∥U(s)−V(s)∥2. (18)

The distance after the map is

 d(T(U),T(V))=S∑s=1∥T(s)(U)−T(s)(V)∥2=S∑s=1d(s)(T(s)(U),T(s)(V)). (19)

so distance is just a sum of . It is trivial that if

 d(s)(T(s)(U),T(s)(V))≤d(s)(U(s),V(s))

,

then

 d(T(U),T(V))

,

so it is only needed to prove that

 d(s)(T(s)(U),T(s)(V(s)))≤qd(s)(U(s),V(s)),∀s,q∈[0,1). (20)

### 3.4 Proof that T is a contraction mapping

After computing the training data, the coefficient matrix are fixed. If both points are obtained by Eq. (4), , and by Eq. (10), , because both equations give unique solution, and in this case the inequality from Eq. (20) is assured.

Let assume arbitrary , initial points from Eq. (10) are,

 U(s)≡(IC+H(s)′H(s)+λCH(s)′^FU^F′UH(s))−1H(s)Yj, (21) V(s)≡(IC+H(s)′H(s)+λCH(s)′^FV^F′VH(s))−1H(s)Yj. (22)

From these new predictions can be obtained,

 FU=1SS∑s=1H(s)U(s), (23) FV=1SS∑s=1H(s)V(s). (24)

Note that an example of could be , .

The application of would result in

 T(s)(U)=(IC+H(s)′H(s)+λCH(s)′FUF′UH(s))−1H(s)Yj, (25) T(s)(V)=(IC+H(s)′H(s)+λCH(s)′FVF′VH(s))−1H(s)Yj. (26)

In order to apply Woodbury matrix identity Woodbury (1950) in Eq. (25), the following matrix are renamed:

• ,

• ,

so the inverse of matrix in Eq. (25) can be rewritten as

 (A(s)U+DCE)−1=A(s),−1U−A(s),−1UH(s)′(CλI+δUH(s)A(s),−1UH(s)′)−1δUH(s)A(s),−1U=A(s),−1U−Δ(s)UA(s),−1U, (27)

where

 Δ(s)U≡A(s),−1UH(s)′(CλI+δUH(s)A(s),−1UH(s)′)−1δUH(s). (28)

Similar result is obtained for Eq. (26). Because , using Eq. (27) into Eq. (25), (26) led us to achieve and

 T(s)(U) = U(s)−Δ(s)UU(s), (29) T(s)(V) = V(s)−Δ(s)VV(s). (30)

The distance can be expressed as

 d(s)(T(s)(U),T(s)(V(s)))=∥U(s)−V(s)−(Δ(s)UU(s)−Δ(s)VV(s))∥2=∥U(s)−V(s)∥2+∥Δ(s)UU(s)−Δ(s)VV(s)∥2−2∥U(s)−V(s)∥∥Δ(s)UU(s)−Δ(s)VV(s)∥. (31)

Since the solution is discarded, Eq. (31) can be divided by the distance ,

 d(s)(T(s)(U(s)),T(s)(V(s)))d(s)(U(s),V(s))=1+∥Δ(s)UU(s)−Δ(s)VV(s)∥∥U(s)−V(s)∥2−2∥Δ(s)UU(s)−Δ(s)VV(s)∥∥U(s)−V(s)∥=⎛⎝∥Δ(s)UU(s)−Δ(s)VV(s)∥∥U(s)−V(s)∥−1⎞⎠2, (32)

and applying Eq. (20),

 ⎛⎝∥Δ(s)UU(s)−Δ(s)VV(s)∥∥U(s)−V(s)∥−1⎞⎠2

Because real terms powers to 2 are greater than 0, it is only needed to prove that

 ⎛⎝∥Δ(s)UU(s)−Δ(s)VV(s)∥∥U(s)−V(s)∥−1⎞⎠2<1,−1<∥Δ(s)UU(s)−Δ(s)VV(s)∥∥U(s)−V(s)∥−1<1,0<∥Δ(s)UU(s)−Δ(s)VV(s)∥∥U(s)−V(s)∥<2. (34)

Left inequality is assured, due to . Powering the fraction to 2 and applying norm properties,

 ∥x+y∥≤∥x∥+∥y∥, (35) ∥BC∥≤∥B∥∥C∥, (36)

we have

 ∥Δ(s)UU(s)−Δ(s)VV(s)∥2∥U(s)−V(s)∥2≤∥Δ(s)UU(s)∥2+∥Δ(s)VV(s)∥2∥U(s)−V(s)∥2≤∥Δ(s)U∥2∥U(s)∥2+∥Δ(s)V∥2∥V(s)∥2∥U(s)−V(s)∥2 (38)

and the problem of the maximum value of Eq. (38) is the generalized Rayleigh quotient Parlett (1998),

 maxU(s),V(s)(U(s)V(s))⎛⎜⎝∥Δ(s)U∥200∥Δ(s)V∥2⎞⎟⎠(U(s)V(s))(U(s)V(s))(1−1−11)(U(s)V(s))=maxWW′XWW′YW. (39)

This problem is equivalent to

 maxWW′XWs.t.W′YW=K (40)

where in this problem is the distance between , which is nonzero because that problem was discarded. This can be solved used Lagrange multipliers,

 LW=W′XW−γ(W′YW−K). (41)

Maximizing respect to , a Generalized Eigenvalue Problem (GEP) is obtained,

 ∇LW=2(X−γY)W=0XW=γYW. (42)

In the GEP, the eigenvalues can be calculated as

 (43)

is the maximum eigenvalue of the quotient in Eq. (39). From Eq. (32) and (34), if then condition from Equation (20) is assured.

Using norm property in Eq. (35), and adding previous knowledge , a bottom bound can be set. Taking inverse,

 ∥x∥−∥y∥≤∥x+y∥≤∥x∥+∥y∥,1∥x∥+∥y∥≤1∥x+y∥≤1∥x∥−∥y∥. (44)

If , the same reasoning could be followed. From norm property in Eq. (36), an upper bound for norm inverse matrix can be set,

 ∥I∥=∥BB−1∥≤∥B∥∥B−1∥≤1∥B−1∥≤1∥B∥ (45)

Applying Eq. (44) and (45) in the definition of in (28), an upper bound can be found,

 ∥Δ(s)U∥2≤α(s)U∥δU∥2C2λ2−α(s)U∥δU∥2 (46)

Where . Replacing in Eq. (43), it is reached to the inequality

 γ≤α(s)Uα(s)V∥δU∥2∥δV∥2C2λ2(α(s)U∥δU∥2+α(s)V∥δV∥2)−2α(s)Uα(s)V∥δU∥2∥δV∥2 (47)

so and can be imposed, the maximum eigenvalue to be under condition in equation (47),

 λmax<2C3  ⎷α(s)U∥δU∥2+α(s)V∥δV∥2α(s)Uα(s)V∥δU∥2∥δV∥2. (48)

After consider as

 η(s)≡∥A(s),−1U∥∥A(s),−1V∥=∥(IC+H(s)′H(s)+λCH(s)′^FU^F′UH(s))−1∥∥(IC+H(s)′H(s)+λCH(s)′^FV^F′VH(s))−1∥ (49)

and replacing into Equation (48),

 λmax<2C3∥A(s),−1U∥ ⎷η(s)∥δU∥2+∥δV∥2∥δU∥2∥δV∥2≡λbound. (50)

values can be obtained numerically, by finding the zero in the following equation

 H(λ)=λ−2C3∥A(s),−1U∥ ⎷η(s)∥δU∥2+∥δV∥2∥δU∥2∥δV∥2, (51)

because it is an implicit equation, where and depends on . However, can be relaxed using norm property in Equation (45),

 ∥A(s),−1U∥≤1∥IC+H(s)′H(s)∥−λC∥H(s)′^FU^F′UH(s)∥,1∥A(s),−1U∥≥∥IC+H(s)′H(s)∥−λC∥H(s)′^FU^F′UH(s)∥,

thus, a more restrictive bound can be set,

 λ<2C3(∥IC+H(s)′H(s)∥−λC∥H(s)′^FU^F′UH(s)∥) ⎷η(s)∥δU∥2+∥δV∥2∥δU∥2∥δV∥2,λ<2∥I+CH(s)′H(s)∥3(1+∥H(s)′^FU^F′UH(s)∥) ⎷η(s)∥δU∥2+∥δV∥2∥δU∥2∥δV∥2≡λ′bound. (52)

It is trivial to see that, if , then . Although is still implicit in Equation (52) through , in the same Section this problem could be avoided.

## 4 Discussion

### 4.1 λ condition

For values that assures the condition from Eq. (52), then the inequality in Eq. (20) is also assured. Eq. (20) is much restrictive than condition from Banach fixed-point theorem,

 d(T(U),T(V))≤qd(U,V),q∈[0,1),

which means that, under certain condition of , there is an upper bound that allows to formulate that NCELM as a fixed-point iteration. Moreover, because sequence Eq. (10) is a fixed-point iteration, , with

 B∗j=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝β(1)∗j⋮β(s)∗j⋮β(S)∗j⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠, (53)

the solution of the system, , thus by definition of ,

 δ(r)=Fj,(r)Fj,(r)′−Fj,(r−1)Fj,(r−1)′→0, (54)

and condition for in Eq. (48) is relaxed over the iterations, since the upper bound for increases,

 limr→∞  ⎷η(s)(r)∥δ(r)∥2+∥δ(r−1)∥2∥δ(r)∥2∥δ(r−1)∥2=+∞ (55)

as long as . And this is also assured, since matrix and exist and are non singular, because of Equations (21) and (22), so from Eq. (49),

 0<∥A(s),−1(r−1)∥<+∞,0<∥A(s),−1(r)∥<+∞.

Eq. (55) implies that any value can be chosen, whether or not, because the condition is relaxed over iterations and becomes more and more large. If Eq. (11) would be not fulfilled in the first iteration for , then could be chosen, and the boundary would be relaxed during the training stage, until . Using the base learners obtained at this point of the training stage, the fixed-point iteration could continue with .

### 4.2 Experimental results

Because the base learners converge to an ensemble optimum, the difference between the coefficient vectors in iteration , and the values in the next iteration , always decreases. For the explanatory purpose, this Section shows graphically an example of this convergence2. The dataset qsar-biodegradation from original paper experimental framework Perales-González et al. (2020) is chosen,

Hyper-parameters are , and . To reduce the computational burden, vector norm chosen for plotting where not norm but ,

 d(β