Are Direct Links Necessary in Random Vector Functional Link Networks for Regression?
A random vector functional link network (RVFL) is widely used as a universal approximator for classification and regression problems. The big advantage of RVFL is fast training without backpropagation. This is because the weights and biases of hidden nodes are selected randomly and stay untrained. Recently, alternative architectures with randomized learning are developed which differ from RVFL in that they have no direct links and a bias term in the output layer. In this study, we investigate the effect of direct links and output node bias on the regression performance of RVFL. For generating random parameters of hidden nodes we use the classical method and two new methods recently proposed in the literature. We test the RVFL performance on several function approximation problems with target functions of different nature: nonlinear, nonlinear with strong fluctuations, nonlinear with linear component and linear. Surprisingly, we found that the direct links and output node bias do not play an important role in improving RVFL accuracy for typical nonlinear regression problems.
Keywords:Random vector functional link network Neural networks with random hidden nodes Randomized learning algorithms.
A random vector functional link network (RVFL) is a type of feedforward neural network (FNN) with a single hidden layer and direct links between input and output layers. Unlike typical FNN, in RVFL the weights and biases of the hidden nodes are selected randomly and stay fixed. The only parameters which are learned are weights and biases of the output layer. Due to randomization in RVFL we can avoid complicated and time-consuming gradient descent methods for solving the optimization problem which is non-convex in typical FNNs. It is commonly known that the gradient learning methods have many drawbacks such as sensitivity to initial values of parameters, convergence to local minima, vanishing/exploding gradients in deep neural structures, and usually additional hyperparameters to tune. In RVFL the resulting optimization problem becomes convex and the output weights can be determined analytically by using a simple standard linear least-squares method .
RVFL is extensively used for classification and regression problems due to its adaptive nature and universal approximation property. Many simulation studies reported in the literature show the high performance of the randomized models which is compared to fully adaptable ones. Randomization which is cheaper than optimization ensures faster training and simpler implementation.
RVFL is not the only FNN solution with randomization. Alternative approaches such as  and many other new solutions do not have direct links between the input and output layers . The effect of direct links as well as a bias in the output layer on the RVFL performance in classification tasks was investigated in . The basic conclusion of that work was that the direct link plays an important performance enhancing role in RVFL, while the bias term in the output neuron had no significant effect. In this work, we investigate the effect of direct links and output node bias on the regression performance of RVFL. For generating random parameters of hidden nodes we use the classical method and two new methods recently proposed in the literature. We test the RVFL performance on several function approximation problems with target functions of different nature: nonlinear, nonlinear with strong fluctuations, nonlinear with linear component and linear.
The remainder of this paper is structured as follows. In Section 2, we briefly present RVFL learning algorithm and the decomposition of the function built by RVFL. In Section 3, we describe three methods of generating weights and biases of hidden nodes. Section 4 reports the simulation study and compares results for different RVFL configurations, different methods of random parameters generation, and different regression problems. Finally, Section 5 concludes the work.
2 Random Vector Functional Link Network
RVFL was proposed by Pao and Takefuji . It was proven in  that RVFL is a universal approximator for a continuous function on a bounded finite dimensional set with a closed-form solution. RVFL can be regarded as a single hidden layer FNN built with a specific randomized algorithm. The RVFL architecture is shown in Fig. 1. Note that in addition to a hidden layer transforming inputs nonlinearly, RVFL also has direct links connecting an input layer with output nodes. The weights and biases of hidden nodes, , respectively, are randomly assigned and fixed during the training phase. The output weights, , are analytically evaluated using a linear least-square method. This results in a flat-net architecture for which only weights must be learned. The learning problem, which is non-convex for the full learning of all parameters, becomes convex in RVFL. So, the time-consuming gradient-based learning algorithms are not needed, which makes the learning process much easier to implement and extremely rapid.
The learning algorithm of RVFL ia as follows. One output is considered, hidden nodes and inputs. The training set is and the activation function of hidden nodes is , which is nonlinear piecewise continuous function, e.g. a sigmoid:
Randomly generate hidden node parameters: weights and biases for all nodes, , according to any continuous sampling distribution.
Calculate the hidden layer output matrix :
where is an activation function of the -th node.
The -th column of is the -th hidden node output vector with respect to inputs . Hidden nodes map nonlinearly inputs from -dimensional input space to -dimensional space. The output matrix remains unchanged because parameters of hidden nodes, and , are fixed.
Calculate the output weights:
where is a vector of output weights, is an one vector corresponding to an output node bias, is a input matrix, is a vector of target outputs, and is the Moore-Penrose generalized inverse of matrix .
The above equation for results from the following criterion for minimizing the approximation error:
A function expressed by RVFL is a linear combination of inputs and activation functions :
Note that the first component in (5) is linear and represents a hyperplane, the second component expresses a nonlinear function and the last component is a bias. These three components of function are depicted in Fig. 2. The nonlinear component is a linear combination of hidden node activation functions (sigmoids in our case) which are also shown in this figure.
A natural question that arises is: are all these three components of function necessary for an approximation of the target function? Is only a nonlinear component not enough? In the experimental part of this work, we try to answer these questions.
3 Generating Weights and Biases of Hidden Nodes
The key issue in FNN randomized learning is finding a way of generating the random hidden node parameters to obtain a good projection space . The standard approach is to generate both weights and biases randomly with a fixed interval from any continuous sampling distribution. The symmetric interval ensures a universal approximation property for the functions which meet Lipschitz condition . The appropriate selection of this interval is a problem that has not been solved as yet and is considered to be one of the major research challenges in the area of FNN randomized learning , . In many cases the interval is selected as without any justification, regardless of the problem solved, data distribution, and activation functions type. In practical applications, the optimization of this interval is recommended for better model performance , , .
In the experimental part of this work, we use three methods of generating random parameters. One of them is a standard approach, where both weights and biases of hidden nodes are generated uniformly from interval . A bound of this symmetrical interval, , is adjusted to the target function (TF). This method of generating random parameters is denoted as Gs. Note that in the right panel of Fig. 2 the sigmoids are randomly evenly distributed in the input interval which is a correct solution. Unfortunately, the Gs method does not ensure such even distribution (see ).
Another method (denoted as Gu in this work) was proposed in . Due to different functions of the hidden node parameters, i.e. weights express slopes of the sigmoids and biases express their shifts, they should be generated separately, not both from the same interval. According to Gu method, first, the weights are selected randomly from and then biases are determined as follows:
where is one of the training point selected randomly (see  for other variants).
Determining biases from (6) ensures that the hidden nodes will be placed in accordance with the input data density . The Gu method ensures that all sigmoids have their steepest fragments, which are most useful for modeling TF fluctuations, inside the input hypercube as shown in the right panel of Fig. 2. In this way, Gu improves a drawback of Gs which can generate sigmoids having their saturated fragments inside the input hypercube. These fragments are useless for building a nonlinear fitted function. Moreover, in Gs, it is difficult to adjust both parameters, weights and biases, when they are selected from the same interval. Gu selects weights first and then calculates biases depending on the weights and data distribution.
The third method of generating random parameters of hidden nodes ensures sigmoids with uniformly distributed slope angles , . This method is denoted as G in this work. In many cases G gives better performance of the model than Gu, especially for highly nonlinear TFs (see  for comparison of Gs, Gu and G). In the first step, G generates slope angles of sigmoids , where and . The bound angles, and , are tuned to the TF. For highly nonlinear TFs, with strong fluctuations, only can be adjusted, keeping . The weights are calculated on the basis of the angles from:
G ensures random slopes between and for the multidimensional sigmoids in each of directions. The biases of the hidden nodes are calculated from (6) to set the sigmoids inside the input hypercube depending on data density.
4 Experiments and Results
In this section, to asses the impact of the direct links and bias in the output node on RVFL performance we consider the following RVFL configurations:
– RVFL with direct links and output node bias,
– RVFL with direct links and without output node bias,
– RVFL without direct links and with output node bias,
– RVFL without direct links and output node bias.
We use sigmoids as activation functions. The hidden node weights and biases are generated using three methods described in Section 3:
– the standard approach of generating both weights and biases from ,
– generating weights from and biases according to (6),
The parameters of these methods as well as the number of hidden nodes were selected in grid search for each RVFL variant and TF from the sets: , for 2-dimensional data or for 5 and 10-dimensional data, , and .
We test RVFL performance over several regression problems using TFs defined as:
where and are variables.
Function (8) is a simple nonlinear function shown in the left panel of Fig. 3. The first component of function (9) is a highly nonlinear function shown in the middle panel of Fig. 3. The second component is a hyperplane. The TF can be composed of these both components if and or of only one component if or . The TF with both components is shown in the right panel of Fig. 3. To asses the RVFL regression performance on TFs of different character four types of TFs were used:
The experiments were carried out for and , and , and and . As an accuracy measure, we used root mean squares error (RMSE). In each case, RFVL networks were trained 100 times and the final errors were calculated as the averages over 100 trials.
Tables 1–4 show RMSE for different TFs and RVFL variants. To confirm the significance of error differences between RVFL without direct links and output node bias (configuration –dl–b) and other RVFL configurations we used a two-sided Wilcoxon signed-rank test. We performed the tests separately for Gs, Gu and G. The null hypothesis was as follows: , where is or , respectively, comes from a distribution with zero median. It was assumed that -value below 5% indicates a rejection of the null hypothesis. The cases of the null hypothesis rejection are underlined in the tables (i.e. the cases +dl+b, +dl–b or –dl+b for which the error was significantly lower than for –dl–b).
|+dl+b||7.50E-03 9.39E-04||0.0121 5.29E-04||0.0236 4.72E-04|
|+dl–b||7.41E-03 9.21E-04||0.0121 5.33E-04||0.0236 4.70E-04|
|–dl+b||7.30E-03 9.42E-04||0.0121 5.28E-04||0.0237 4.46E-04|
|–dl–b||7.20E-03 9.11E-04||0.0121 5.22E-04||0.0237 4.44E-04|
|Gu||+dl+b||7.45E-03 9.07E-04||0.0128 5.05E-04||0.0227 4.10E-04|
|+dl–b||7.38E-03 9.14E-04||0.0128 5.13E-04||0.0229 4.21E-04|
|–dl+b||7.28E-03 9.75E-04||0.0128 5.16E-04||0.0227 4.07E-04|
|–dl–b||7.20E-03 9.95E-04||0.0128 5.15E-04||0.0230 4.23E-04|
|G||+dl+b||7.47E-03 9.39E-04||0.0130 5.05E-04||0.0217 4.11E-04|
|+dl–b||7.39E-03 9.31E-04||0.0130 5.10E-04||0.0219 4.20E-04|
|–dl+b||7.27E-03 9.53E-04||0.0129 4.92E-04||0.0217 4.06E-04|
|–dl–b||7.16E-03 9.87E-04||0.0130 4.98E-04||0.0219 4.20E-04|
|Gs||+dl+b||0.0414 0.0055||0.2268 0.0122||0.2203 0.0098|
|+dl–b||0.0414 0.0055||0.2268 0.0122||0.2203 0.0098|
|–dl+b||0.0415 0.0056||0.2268 0.0122||0.2203 0.0098|
|–dl–b||0.0415 0.0056||0.2268 0.0122||0.2203 0.0098|
|Gu||+dl+b||0.0378 0.0028||0.2268 0.0121||0.2203 0.0098|
|+dl–b||0.0378 0.0028||0.2268 0.0121||0.2203 0.0098|
|–dl+b||0.0379 0.0027||0.2268 0.0121||0.2203 0.0098|
|–dl–b||0.0379 0.0027||0.2268 0.0121||0.2203 0.0098|
|G||+dl+b||0.0335 0.0021||0.1702 0.0111||0.2026 0.0099|
|+dl–b||0.0335 0.0021||0.1702 0.0111||0.2026 0.0099|
|–dl+b||0.0336 0.0022||0.1704 0.0111||0.2030 0.0099|
|–dl–b||0.0336 0.0022||0.1704 0.0111||0.2030 0.0099|
|Gs||+dl+b||0.0375 0.0044||0.0887 0.0030||0.0802 0.0030|
|+dl–b||0.0375 0.0044||0.0887 0.0030||0.0802 0.0030|
|–dl+b||0.0374 0.0043||0.0888 0.0030||0.0803 0.0030|
|–dl–b||0.0374 0.0043||0.0888 0.0030||0.0803 0.0030|
|Gu||+dl+b||0.0351 0.0019||0.0887 0.0030||0.0802 0.0030|
|+dl–b||0.0351 0.0019||0.0887 0.0030||0.0802 0.0030|
|–dl+b||0.0351 0.0019||0.0888 0.0030||0.0802 0.0030|
|–dl–b||0.0351 0.0019||0.0888 0.0030||0.0802 0.0030|
|G||+dl+b||0.0307 0.0018||0.0706 0.0033||0.0750 0.0028|
|+dl–b||0.0307 0.0018||0.0706 0.0033||0.0754 0.0028|
|–dl+b||0.0308 0.0018||0.0707 0.0033||0.0762 0.0028|
|–dl–b||0.0308 0.0018||0.0707 0.0033||0.0763 0.0028|
|Gs||+dl+b||2.61E-03 1.10E-03||1.89E-03 5.40E-04||1.52E-03 3.80E-04|
|+dl–b||2.62E-03 1.09E-03||1.89E-03 5.40E-04||1.70E-03 4.10E-04|
|–dl+b||3.97E-03 9.87E-04||1.98E-03 5.20E-04||2.20E-03 4.30E-04|
|–dl–b||4.08E-03 9.63E-04||1.99E-03 5.20E-04||2.33E-03 3.46E-04|
|Gu||+dl+b||2.61E-03 1.10E-03||1.89E-03 5.40E-04||1.52E-03 3.80E-04|
|+dl–b||2.72E-03 1.05E-03||1.89E-03 5.40E-04||1.70E-03 4.10E-04|
|–dl+b||3.38E-03 9.94E-04||1.93E-03 5.26E-04||2.02E-03 4.45E-04|
|–dl–b||3.78E-03 1.04E-03||1.93E-03 5.32E-04||2.26E-03 3.77E-04|
|G||+dl+b||2.61E-03 1.10E-03||1.89E-03 5.39E-04||1.72E-03 4.14E-04|
|+dl–b||2.73E-03 1.07E-03||2.53E-03 5.27E-04||3.94E-03 4.25E-04|
|–dl+b||3.51E-03 1.05E-03||4.87E-03 6.33E-04||6.69E-03 3.76E-04|
|–dl–b||3.78E-03 9.96E-04||4.96E-03 6.20E-04||6.73E-03 3.94E-04|
From Tables 1–3 can be seen that for nonlinear functions all RVFL configurations (+dl+b, +dl–b, –dl+b and –dl–b) produce very similar results. Even in the case of NLF+L where TF contains a significant linear component. Only in four cases out of 81, the errors were slightly lower than for corresponding –dl–b configurations. These cases are: Gu +dl+b for NL, Gu –dl+b for NL, G +dl+b for NLF+L, and G +dl–b for NLF+L. Note that for 2-dimensional NL, –dl–b configurations gave lower errors than other configurations for each method of generating random parameters.
The optimal numbers of hidden nodes (averaged over 100 trials in each case) are shown in Table 5. Note that for NL there is no difference in the optimal number of nodes between RVFL configurations. Differences appear for multidimensional TFs with fluctuations, NLF and NLF+L, when random parameters are generated using Gs or Gu. In these cases, the configurations with direct links (+dl) need less hidden nodes than those without direct links (–dl). This is maybe because the hyperplane introduced by the direct links is useful for modeling the linear parts of the TFs (see the linear TF regions near the corner in the middle and right panels of Fig. 3). We can see from Table 5 that for multidimensional TFs with fluctuations G needs more nodes than Gs and Gu. But it was observed that also with a small number of nodes, G still outperformed Gs and Gu in accuracy. Adding nodes led to decreasing in error for G, while for Gs and Gu increasing in error was observed at the same time . This can be related to overfitting caused by the steeper nodes generated by Gs and Gu then by G, where the node slope angles are distributed uniformly. This phenomenon needs to be explored in detail on other TFs.
Table 4 shows the results for linear TF. This TF can be modeled with only direct links and bias. So, the hidden layer is unnecessary. Note that the optimal number of hidden nodes for the +dl+b configurations is around one (see Table 5) which is the minimum value of in our tests. The results for configurations without direct links (–dl) for L are usually much worse than those with direct links. Only for variants Gs and Gu at the errors were at a similar level for all network configurations. In the –dl configurations the linear TF is modeled with sigmoids and overfitting is a real threat when training data is noisy as in our case. Using only direct links and bias prevents overfitting for linear TFs. But it should be noted that for linear TFs we do not need to use NNs. Simple linear regression is a better choice. Moreover, linear TFs are rare in practice.
Note that for highly nonlinear TFs such as NLF and NLF+L, G ensures much more accurate fitting than other methods of generating random parameters (see Tables 2 and 3). For low dimensional TFs with fluctuations, Gu was more accurate than Gs. This is because, for low , Gs generates many sigmoids that are saturated in the input hypercube and thus they are useless for modeling fluctuations. This phenomenon decreases with (see ).
In this work, we investigate whether direct links and an output node bias are necessary in RVFL for regression problems. RVFL can be decomposed into a linear component represented by the direct links, a nonlinear component represented by the hidden nodes and a bias term. The experimental study showed that nonlinear target functions can be modeled with only nonlinear component. The fitting errors with and without direct links and bias in these cases were at a similar level. The linear component and bias term, if needed, can be replaced by hidden nodes. The direct links seem to be useful for modeling the target functions with linear regions. In our simulations modeling of such functions, NLF and NLF+L, required less hidden nodes when direct links were also used. This issue requires further research with target functions of different nature.
In our study, we used three methods of generating random parameters of hidden nodes. The most sophisticated method proposed recently in the literature, G, was the most accurate especially for highly nonlinear target functions.
- thanks: Supported by Grant 2017/27/B/ST6/01804 from the National Science Centre, Poland.
- email: firstname.lastname@example.org
- Principe, J.,Chen, B.: Universal approximation with convex optimization: Gimmick or reality? IEEE Comput Intell Mag 10, 68–77 (2015)
- Schmidt, W.F., Kraaijveld, M.A., Duin, R.P.W.: Feedforward neural networks with random weights. In: Proc. 11th IAPR International Conference Pattern Recognition Methodology and Systems, vol. II, pp. 1–4 (1992)
- Wang, D., Li, M.: Stochastic configuration networks: Fundamentals and algorithms. IEEE Trans on Cybernetics 47(10), 3466–3479 (2017).
- Le Zhang, Suganthan, P.N.: A comprehensive evaluation of random vector functional link networks. Information Sciences 367â368, 1094–1105 (2016).
- Pao, Y.H., Takefuji, Y.: Functional-link net computing: theory, system architecture, and functionalities. IEEE Comput 25(5), 76–79 (1992)
- Igelnik, B., Pao, Y.H.: Stochastic choice of basis functions in adaptive function approximation and the functional-link net. IEEE Trans. Neural Netw. 6(6), 1320–1329 (1995).
- Dudek, G.: Generating random weights and biases in feedforward neural networks with random hidden nodes. Information Sciences 481, 33–56 (2019)
- Husmeier, D.: Random vector functional link (RVFL) networks. In: Neural Networks for Conditional Probability Estimation: Forecasting Beyond Point Predictions, chapter 6. Springer (1999).
- Le Zhang, Suganthan, P.N.: A survey of randomized algorithms for training neural networks. Information Sciences 364–365, 146–155 (2016).
- Cao, W., Wang, X., Ming, Z., Gao, J.: A review on neural networks with random weights. Neurocomputing 275, 278–287 (2018).
- Pao, Y.H., Park, G., Sobajic, D.: Learning and generalization characteristics of the random vector functional-link net. Neurocomputing 6(2), 163–180 (1994).
- Dudek, G.: Generating random parameters in feedforward neural networks with random hidden nodes: Drawbacks of the standard method and how to improve it. ArXiv:1908.05864 (2019).
- Tyukin, I., Prokhorov, D.: Feasibility of random basis function approximators for modeling and control. In: Proc. IEEE International Symposium on Intelligent Control, pp. 1391–1396 (2009).
- Dudek, G.: Improving randomized learning of feedforward neural networks by appropriate generation of random parameters. In: Advances in Computational Intelligence. 15th International Work-Conference on Artificial Neural Networks IWANN 2019. LNCS 11506, Springer, pp. 517–530 (2019).