l_{0}-norm Based Centers Selection for Failure Tolerant RBF Networks

# l0-norm Based Centers Selection for Failure Tolerant RBF Networks

Hao Wang, Chi-Sing Leung,  Hing Cheung So,  Ruibin Feng, and Zifa Han Hao Wang, Chi-Sing Leung, Hing Cheung So, Ruibin Feng, and Zifa Han are with the Department of Electronic Engineering, City University of Hong Kong Kowloon Tong, Kowloon, Hong Kong.
###### Abstract

The aim of this paper is to select the RBF neural network centers under concurrent faults. It is well known that fault tolerance is a very attractive property for neural network algorithms. And center selection is an important procedure during the training process of RBF neural network. In this paper, we will address these two issues simultaneously and devise two novel algorithms. Both of them are based on the framework of ADMM and utilize the technique of sparse approximation. For both two methods, we first define a fault tolerant objective function. After that, the first method introduces the MCP function (an approximate -norm function) and combine it with ADMM framework to select the RBF centers. While the second method utilize ADMM and IHT to solve the problem. The convergence of both two methods is proved. Simulation results show that the proposed algorithms are superior to many existing center selection algorithms under concurrent fault.

failure tolerant, RBF, center selection, ADMM, -norm, global convergence, MCP, IHT.

## I Introduction

Radial basis function (RBF) neural network [1, 2, 3] is a common algorithm and widely used in many applications. Its training process basically includes two stages. In the first one, the RBF centers are determined. Then, the corresponding weights of these RBF centers are estimated. There are many ways to do RBF center selection. For instance, we can select all input vector from the training samples [4], or randomly select a subset of the training set [5]. However, the first method may result in a complex network structures and overfitting. The second method cannot ensure the constructed RBF network covers the input space well.

To overcome the disadvantages of above two methods, researchers proposed many other RBF center selection approaches. Among them, clustering algorithm [6], orthogonal least squares (OLS) approach [7, 8], and support vector regression [9, 10] are the most representative methods. Most algorithms in this area did not consider the influence of network faults. However, during the training process of neural networks, the network faults are almost inevitable. For example, when we calculate the centers’ weights, the round-off errors will be introduced which can be seen as a kind of multiplicative weight fault [11, 12, 13]. When the connection between two neurons is damaged, signal cannot transform between them which may result in open weight fault [14, 15].

Over the past two decades, several fault tolerant neural networks have been proposed [16, 17, 18, 19]. Most of them only consider one kind of network fault. While, [20] first describes a situation when the multiplicative weight fault and open weight fault occur in a neural network concurrently. But due to the modification of its objective function, the solution of the proposed method in [20] is bias. Then, to handle this issue, a new approach based on OLS and regularization term is proposed by [21]. The performance of this algorithm is better than most existing methods. While, due to using the OLS approach, the computational cost of this method is very high. And it can only carry out the center selection steps before the training process. To further improve the performance of the network and do center selection and training simultaneously, we proposed a -norm based fault tolerant RBF center selection method in our previous work [22].

In this paper, we further develop our previous work by replacing the -norm regularization with the -norm term. And then we propose two methods to solve the problem. In the first one, we further modify the objective by introducing the minimax concave penalty (MCP) function to substitute the -norm term. After that, the problem is solved by alternating direction method of multipliers (ADMM) framework. In the second method, the ADMM framework and iterative hard threshold (IHT) are utilized to solve the problem. The main contribution of this paper is: (i). Two novel fault tolerant RBF center selection algorithms are developed. (ii). The global convergence of the proposed methods is proved. (iii). The performance improvement of the proposed methods is very significant.

The rest of paper is organized as follows. The background of RBF neural network under concurrent fault and ADMM are described in Section II. In Section III, the proposed two approaches are developed. The global convergence of the two methods is proved in Section IV. Numerical results for algorithms evaluation and comparison are provided in the Section V. Finally, the conclusions are drawn in Section VI.

## Ii Background

### Ii-a Notation

We use a lower-case or upper-case letter to represent a scalar while vectors and matrices are denoted by bold lower-case and upper-case letters, respectively. The transpose operator is denoted as , and represents the identity matrix with appropriate dimensions. Other mathematical symbols are defined in their first appearance.

### Ii-B RBF networks under concurrent fault situation

In this paper, the training set is expressed as

 D={(xi,yi):xi∈RK1,yi∈R,i=1,2,...,N}, (1)

where is the input of the -th sample with dimension , and is the corresponding output. Similarly, the test set can be denoted as

 D′={(x′i′,y′i′):x′i′∈RK1,y′i′∈R,i′=1,2,...,N′}. (2)

Generally speaking, a RBF approach is used to handle a regression problem. The input-output relationship of data in is approximated by the sums of radial basis functions, i.e.,

 f(x)=M∑j=1wjexp⎛⎝−∥∥x−cj∥∥22s⎞⎠, (3)

where is the -th radial basis function, denotes its weight, the vectors ’s are the RBF centers, is a parameter which can be used to control the radial basis function width, and denotes the number of RBF centers. Normally, the centers are selected from the input data . If we directly use all training inputs as centers, it may result in some ill-conditioned solutions. Therefore, center selection is a key step in RBF network.

Let , for a faulty-free network, the training set error can be expressed as

 Etrain = 1NN∑i=1(yi−f(xi))2 (4) = 1NN∑i=1⎛⎝yi−M∑j=1wjexp⎛⎝−∥∥xi−cj∥∥22s⎞⎠⎞⎠2 = 1N∥y−Aw∥22,

where , is a matrix. Let denotes the entry of ,

 aj(xi)=[A]i,j=exp⎛⎝−∥∥xi−cj∥∥22s⎞⎠. (5)

However, in the implementation of a RBF network, weight failure may happen. Multiplicative weight noise and open weight fault are two common fault in the RBF network [11, 12, 13, 16, 18, 19, 23, 24]. When they are concurrent [20, 21], the weights can be modeled as

 ~wj=(wj+bjwj)βj, (6)

where , denotes the open fault of the th weight. When the connection is opened, , otherwise, . The term in (6) is the multiplicative noise of the th weight. We can see the magnitude of the noise is proportional to that of the weight. Normally, we assume that the ’s are independent and identically distributed (i.i.d.) zero-mean random variables with variance . With this assumption, the statistics of ’s are summarized as

 ⟨bj⟩=0, ⟨b2j⟩=σ2b, (7a) ⟨bjbj′⟩=0,∀j≠j′, (7b)

where is the expectation operator. Furthermore, we assume that ’s are i.i.d. binary random variables. The probability mass function of is given by

 Prob(βj)= Pβ, for βj=0, (8) Prob(βj)= 1−Pβ, for βj=1. (9)

Thus, the statistics of ’s are

 ⟨βj⟩=⟨β2j⟩=1−Pβ, (10a) ⟨βjβj′⟩=(1−Pβ)2,∀j≠j′. (10b)

Given a particular fault pattern of and , the training set error can be expressed as

 ~Etrain = 1N∥y−A~w∥22 (11) = 1NN∑i=1[y2i−2yiM∑j=1βjwjaj(xi) +M∑j=1M∑j′=jβjβj′wjwj′(1+bjbj′)aj(xi)aj′(xi) +M∑j=1M∑j′=1(bj+bj′)βjβj′wjwj′aj(xi)aj′(xi) −2yiM∑j=1bjβjwjaj(xi)].

From (7) and (10), the average training set error [21] over all possible failure is given by

 ¯¯¯Etrain=PβNN∑i=1y2i+1−PβN∥y−Aw∥22 +1−PβNwT[(Pβ+σ2b)diag(ATA)−PβATA]w. (12)

In (II-B), the term can be seen as a constant with respect to the weight vector . Hence the training objective function can be defined as

 ψ(w)=1N∥y−Aw∥22+wTRw, (13)

where .

The ADMM framework is an iterative approach for solving optimization problems by breaking them into smaller pieces [25]. The algorithm can solve problems in the form

 minz,y: ψ(z)+g(y) (14a) s.t.    Cz+Dy=b. (14b)

with variables and , where , and . In the ADMM framework, first we need to construct an augmented Lagrangian function

 L(z,y,α)= ψ(z)+g(y)+αT(Cz+Dy−b) (15) +ρ2∥Cz+Dy−b∥22,

where is the Lagrange multiplier vector, and is a trade-off parameter. The algorithm consists of the iterations as:

 zk+1 = argminyL(zk,y,αk), (16a) yk+1 = argminzL(z,yk+1,αk), (16b) αk+1 = αk+ρ(Czk+1+Dyk+1−b). (16c)

## Iii Development of Proposed Algorithm

In (13), we use RBF centers. To limit the size of the RBF network and automatically select appropriate centers during training process, we further modify the objective function and propose two novel algorithms based on the property of -norm and ADMM framework.

### Iii-a RBF center selection based on ADMM MCP

In the first method, we consider introducing an additional penalty term (-norm) into (13), then we have

 ^Q(w,λ)=1N∥y−Aw∥22+wTRw+λ∥w∥0, (17)

where is not a proper norm, it represents the number of non-zero entries in the vector . Due to the -norm term, the problem in (17) is NP hard [26]. Inspired by [27, 28], the minimax concave penalty (MCP) function is a very attractive approximation function of -norm. Hence, we further modify the function in (17), then

 Q(w,λ)=1N∥y−Aw∥22+wTRw+Pλ,γ(w), (18)

where denotes the MCP function,

 Pλ,γ(wi)=⎧⎪⎨⎪⎩λ|wi|−w2i2γ,if|wi|≤γλ,12γλ2,if|wi|>γλ. (19)

and

 ∂Pλ,γ(w)∂wi = λsign(wi)(1−|wi|λγ)+ (20) = {λsign(wi)−wiγ,if|wi|≤γλ,0,if|wi|>γλ.

The shape of MCP penalty function with different parameter settings is given by Fig. 1. From the figure, we can see, with appropriate parameter setting, MCP penalty function can be used to replace .

Then we use the ADMM framework to solve the problem shown in (18). Following the steps of ADMM, first we introduce a dummy variable and transform the unconstrained problem into the standard constrained form

 minw,u ψ(w)+Pλ,γ(u), (21a) s.t. u=w, (21b)

where is given by (13). Then we construct the augmented Lagrangian as

 L(w,u,υ) = ψ(w)+Pλ,γ(u)+υT(u−w) (22) +ρ2∥w−u∥22,

According to (16), the ADMM iteration of is

 uk+1 = argminuL(wk,u,υk), (23) = argminuPλ,γ(u)+υkT(u−wk)+ρ2∥∥wk−u∥∥22 = argminuPλ,γ(u)+ρ2∥∥∥wk−u−1ρυk∥∥∥22

where . For any ,

 (24)

where denotes the soft-threshold operator [29],

 S(z,λ)=sign(z)max{|z|−λ,0}.

(24) is an approximate optimal solution of the optimization problem in (23). Besides, it is worth noting that when the function in (24) is similar with the soft-threshold. On the other hand, when the function is close to the hard-threshold.

is directly calculated by first-order optimization condition, the solution is given by:

 wk+1 = argminwL(w,uk+1,υk) (25) = argminwψ(w)+υkT(uk+1−w)+ρ2∥∥w−uk+1∥∥22 = argminwψ(w)+ρ2∥∥∥w−uk+1−1ρυk∥∥∥22 =

is updated as

 υk+1 = υk+ρ(uk+1−wk+1). (26)

In the following part, we illustrate the convexity of the first method with Theorem 1.

###### Theorem 1

Let denotes the minimum eigenvalue of the matrix . If , then the objective function is strongly convex with respect to .

Proof:

 ∂2ψ(w)∂w2 = 2NATA+2R = 2N[(1−Pβ)ATA+(Pβ+σ2b)diag(ATA)].

Obviously, (III-A) is positive definite, hence is strongly convex. Then, we calculate the first gradient of with respect to ,

 ∂Q(w,λ)∂w=2NAT(Aw−y)+2Rw +λsign(w)(1−|w|λγ)+. (28)

And its second-order derivative with respect to is given by

 ∂2Q(w,λ)w2=∂2ψ(w)∂w2+Θ, (29)

where

 Θ=⎡⎢ ⎢⎣θ1⋯0⋮⋱⋮0⋯θM⎤⎥ ⎥⎦,

for ,

 θi={−1/r,ifwi≤γλ0,ifwi>γλ. (30)

Thus it is obvious that if , i.e. , then the objective function is also strongly convex with respect to .

### Iii-B RBF center selection based on ADMM IHT

In the first method, two trade-off parameters and are used to regularize the magnitude ratio between the training set error and the penalty term. The settings of and are different when the magnitude of training set error is inconstant. Besides, sometimes we may need to make the number of RBF centers exactly equal to a certain value. For the first method, we have to repeatedly try different value of to meet this requirement. It is very inconvenient. Hence we propose the second method which modifies the problem in (13) to a constrained form given by

 argminw 1N∥y−Aw∥22+wTRw, (31a) s.t. ∥w∥0≤K. (31b)

With the constraint in (31b), we can ensure the number of RBF centers is equal to or smaller than . But this problem cannot be solved by ADMM framework directly. Because the constraint is undecomposable. Hence we introduce an indicator function

 ic(K)(w)={ll0,ifw∈c(K),+∞,otherwise, (32)

where the set (). After that, the problem in (31) can be rewritten as

 argminw1N∥y−Aw∥22+wTRw+ic(K)(w). (33)

Then we follow the standard step of ADMM, introduce a dummy variable and rewrite problem in (33) as

 argminw ψ(w)+g(u), (34a) s.t. w=u, (34b)

where denotes the indicator function . Its augmented Lagrangian is shown as

 L(w,u,υ) = ψ(w)+g(u)+υT(u−w) (35) +ρ2∥w−u∥22,

According to (35) and (16), we have:

 uk+1 = argminuL(wk,u,υk), (36) = argminug(u)+ρ2∥∥∥wk−u−1ρυk∥∥∥22, ≈ IHT(wk−υk/ρ),

where IHT is short for iterative hard threshold [30], lets all other entries of equal to 0 except the largest entries. Obviously, the iterative hard threshold can restrict into the feasible region . It is worth noting that the second method just replaces the MCP function in the Lagrangian (22) by the indicator function . It does not influence the update of and . Hence in the second method we still have:

 wk+1=[2NATA+2R+ρI]−1[2NATy+ρuk+1+υk]. (37)

and

 υk+1 = υk+ρ(uk+1−wk+1). (38)

## Iv Analysis of Global Convergence

In this section, we mainly discuss the convergence of the two proposed methods. Here we cannot directly follow the general convergence proof of nonconvex ADMM given by [31]. Because some of the assumptions given in [31] are not satisfied, for instance, the funtion is not a restricted prox-regular function. But the global convergence of our proposed two methods still can be proved. We first give a sketch of our proof in Theorem 2, and the details will be discussed latter.

###### Theorem 2

If the proposed methods satisfy the following three conditions:

C1 (Sufficient decrease condition) For each , let

 L(wk+1,uk+1,υk+1)−L(wk,uk,υk)≤−τ1∥wk+1−wk∥22. (39)

C2 (Boundness condition) The sequences generated by the proposed two methods are bounded and their Lagrangian functions are lower bounded.

C3 (Subgradient bound condition) For each , there exists , and such that

 ∥dk+1∥22≤τ2∥wk+1−wk∥22. (40)

Then, based on the above three conditions, we can further deduce that the sequences has at least a limit point and any limit point is a stationary point. Moreover, if their Lagrangian functions are Kurdyka-Łojasiewicz (KŁ) function, then the sequences will globally converge to the unique point .

Proof: The theorem is similar as the Proposition 2 in [31] and the Theorem 2.9 in [32]. The proof of it is also standard. So we omit it here. The details can be found in the proof of Proposition 2 in [31].

For the proof of convergence, the key step here is to prove that the above mentioned three conditions are satisfied. Hence, we have the following three Proposition.

###### Proposition 1

If is greater than a certain value, the proposed two methods satisfy the sufficient decrease condition in C1.

Proof: In the following proof, we use the second method as an example. For the first method, the proof is same except replacing the function by .

For the second method, the Lagrangian function in (35) can rewrite as

 L(w,u,υ) = ψ(w)+ρ2∥∥∥w−u−1ρυ∥∥∥22 (41) +g(u)−12ρ∥υ∥22.

Since is strongly convex, we can deduce that (41) is also strongly convex with respect to . Hence, based on the definition of strongly convex function, we have

 L(wk+1,uk+1,υk)−L(wk,uk+1,υk) ≤−a2∥wk+1−wk∥22, (42)

where .

Then, from (37), we have

 ∇ψ(wk+1)−υk+ρ(wk+1−uk+1)=0, (43)

and combine it with (38) we can deduce that and . Thus

 L(wk+1,uk+1,υk+1)−L(wk+1,uk+1,υk) (44) = = 1ρ∥υk+1−υk∥22=1ρ∥∇ψ(wk+1)−∇ψ(wk)∥22 ≤ l2ψρ∥wk+1−wk∥22.

Where is a Lipschitz constant of function , and the last inequality is because that has Lipschitz continue gradient.

Finally, because is an approximation optimal solution of (36), it is reasonable to assume that

 L(wk,uk+1,υk)−L(wk,uk,υk)≤0 (45)

Combine (IV), (44) and (45), we have

 L(wk+1,uk+1,υk+1)−L(wk,uk,υk) (46) = L(wk+1,uk+1,υk+1)−L(wk+1,uk+1,υk) +L(wk+1,uk+1,υk)−L(wk,uk+1,υk) +L(wk,uk+1,υk)−L(wk,uk,υk) ≤ (l2ψρ−a2)∥wk+1−wk∥22.

To ensure , we need . Hence the in C1.

###### Proposition 2

If , is bounded for all and will converge when . And the sequence generated by the proposed two methods are bounded.

Proof: We still use the second method as an example. The proof of the first method is also similar. Firstly, we prove the is lower bounded for all .

 L(wk,uk,υk)=ψ(wk)+g(uk)+υkT(uk−wk) +ρ2∥∥wk−uk∥∥22, =ψ(wk)+g(uk)+∇ψ(wk)T(uk−wk) +ρ2∥∥wk−uk∥∥22, ≥ψ(uk)+(ρ2−lψ2)∥uk−wk∥22+g(uk), (47)

where the inequality in (IV) is due to the Lemma 3.1 in [32] and the Lipschitz continue gradient of . Because, according to the Lemma 3.1 in [32], we can deduce that

 ψ(wk)+∇ψ(wk)T(uk−wk)≥ψ(uk)−lψ2∥uk−wk∥22,

then we have the inequality in (IV).

Obviously, if , then . Hence is lower bounded. According to the proof of Property 1, we know is sufficient descent. Hence is upper bounded by . Thus, we can say that is bounded. Due to the sufficient descent property, must converge when and as long as , for , we have .

Next, we prove the sequence is bounded. From inequation (46), we have

 ∥wk+1−wk∥22 ≤1τ1(L(wk,uk,υk)−L(wk+1,uk+1,υk+1)).

Then we can deduce

 l∑k=1∥wk+1−wk∥22 ≤1τ1(L(w0,u0,υ0)−L(wl+1,ul+1,υl+1)) <∞. (48)

Even if , we still have . Thus is bounded.

From (44), we know

 ∥υk+1−υk∥22≤l2ψ∥wk+1−wk∥22.

Therefore we can also deduce that

 ∞∑i=1∥υk+1−υk∥22<∞. (49)

In addation, according to (38), we have

 ∥uk+1−uk∥22 (50) = ≤ 2∥wk+1−wk∥22+2ρ2∥υk+1−υk∥22 +2ρ2∥υk−1−υk∥22.

Thus

 ∞∑i=1∥uk+1−uk∥22<∞. (51)

So the sequence is bounded.

###### Proposition 3

The proposed two methods satisfy the subgradient bound condition given by C3.

Proof: For the second method,

 ∂L∂w∣∣∣(wk+1,uk+1,υk+1) = ∇ψ(wk+1)+ρ(wk+1−uk+1)−υk+1 = υk−υk+1, ∂L∂u∣∣∣(wk+1,uk+1,υk+1) = ∂g(uk+1)−ρ(wk+1−uk+1)+υk+1 ∋ ρ(wk−wk+1)+υk+1−υk, ∂L∂υ∣∣∣(wk+1,uk+1,υk+1) = (54)

where the second equality in (IV) is according to (36),

 0∈∂g(uk+1)+υk−ρ(wk−uk+1). (55)

Thus

 dk+1 := ⎡⎢ ⎢⎣υk−υk+1ρ(wk−wk+1)+υk+1−υk1ρ(υk+1−υk)⎤⎥ ⎥⎦ (56) ∈ ∂L(wk+1,uk+1,υk+1)

Combining with the inequality in (44), we can deduce that

 ∥dk+1∥22≤τ2∥wk+1−wk∥22. (57)

Apparently, for the first method, we just need to replace the function by , then all other proof is same with the above mentioned process.

Thus, based on C1-C3 in Theorem 2, we know that the sequences generated by the first method and second method both have at least one limit point and any limit point is a stationary point. In other words, at least, the proposed two methods have local convergence. Finally, to prove both of them have global convergence, we need to prove their Lagrangian functions are KŁ function.

Before that, in order to facilitate the following explanation, we introduce several fundamental definitions.

For a function , denotes the domain of . A function is proper means that and it can never attain . It is lower semi-continuous at a point if

 liminfx→x0f(x)≥f(x0).

And if the function is lower semi-continuous at every point in , then it is a lower semi-continuous function.

A subset of is a real semi-algebraic set if there exists a finite number of real polynomial functions such that

 S=q1⋃j=1q2⋂i=1{z∈Rd:lij(z