Per-sample Prediction Intervals for Extreme Learning Machines

Per-sample Prediction Intervals for Extreme Learning Machines

Abstract

Prediction intervals in supervised Machine Learning bound the region where the true outputs of new samples may fall. They are necessary in the task of separating reliable predictions of a trained model from near random guesses, minimizing the rate of False Positives, and other problem-specific tasks in applied Machine Learning. Many real problems have heteroscedastic stochastic outputs, which explains the need of input-dependent prediction intervals.

This paper proposes to estimate the input-dependent prediction intervals by a separate Extreme Learning Machine model, using variance of its predictions as a correction term accounting for the model uncertainty. The variance is estimated from the model’s linear output layer with a weighted Jackknife method. The methodology is very fast, robust to heteroscedastic outputs, and handles both extremely large datasets and insufficient amount of training data.

1 Introduction

Practical applications of machine learning can be problematic in the sense that developers and practitioneers often do not fully trust in their own predictions. A fundamental reason for this mistrust can be found in the fact that Mean Squared Error (MSE) and other error measures averaged over a dataset are commonly used to evaluate performance of a method or compare different methods. Averaged error measures are unfit for business processes where each particular sample is important, as it represents a customer or other existing entity [2]. On the other hand, applied Machine Learning models might skip some data samples, because they are only a part of a bigger process structure, and uncertain data might be given to human experts to be handled [22].

The trust problem can be solved by computing a sample-specific confidence value [32]. Then predictions with high confidence (and enough trust in them) are used, while data samples with uncertain predictions are passed to the next analytical stage. The Machine Learning model works as a filter, solving “easy cases” automatically with confident predictions, and reducing the amount of data remaining to be analyzed [3].

Let be a dataset where outputs are independently drawn from a normal distribution conditioned on inputs :

(1)

This dataset has heteroscedastic noise because the variance is not constant. A common homoscedasticity assumption simplifies formula (1) to but removes the ability to separate confident predictions from uncertain ones.

The heteroscedastisity of outputs is a reasonable assumption because applied Machine Learning problems often have stochastic outputs. Such outputs do not have a single correct value for the given input. The variance of random noise in outputs may be assumed equal because the noise is independent of the inputs, but the same assumption cannot be made about the variance of the stochastic outputs because they certainly depend on the inputs.

This work focuses on prediction intervals specifically for Extreme Learning Machines (ELM) [21, 25]. ELM is a fast non-linear model with universal approximation ability [18]. It has a feed-forward neural network structure but with randomly fixed hidden layer weights, so only the linear output layer needs to be trained. With a large hidden layer and L2-regularization [41] ELM exhibit stable predictions [29], that are not affected by a particular initialization of the random hidden layer weights. It is an excelled Machine Learning tool to solve applied problems [4, 40] with simple formulation, little to no hyper-parameters, performance at the state-of-the-art level [17, 38, 47] and scalable to Big Data [1, 39].

The idea of the method is to use an ELM to predict an output , and a second ELM to estimate its conditional variance . Furthermore, a variance analysis is done on the predictions of the second ELM. It provides upper and lower boundaries for the predicted variance. These boundaries describe the model uncertainty for samples with little similar training data available, and make the methodology uniformly applicable to different problems.

The rest of the paper is organized as following. The following section describes state-of-the-art in prediction intervals estimation, and how the proposed solution differs from the rest. Section 2 describes Extreme Learning Machines and the proposed methodology. Section 3 analyses the method performance on small artificial and real world datasets. Section 4 presents the results on huge real world dataset, and describes the runtime requirements compared to the original ELM. Section 5 summarizes the findings.

1.1 State-of-the-Art

Prediction with uncertainty in a well-known task. Probabilistic methods can obviously formulate a solution. Prediction intervals are available in Bayesian formulation of ELM [12, 8], including per-sample PI [36] though the applicability is limited due to the quadratic computational cost in the number of data samples.

Fuzzy nonlinear regression [15] approach exists for problems having fuzzy inputs or outputs. It applies random weights neural networks with non-iterative training similar to ELM, but formulates the solution in terms of fuzzy sets theory [5]. Such a native fuzzy approach allows for a detailed investigation of the effects of uncertainty on learning of a method [43, 44], and has important practical applications [6] for fuzzy data problems.

Without runtime limitation, good results are achieved with model independent methods [33] based on clustering of input data and re-sampling. Clustering of inputs and repetitive model re-training during the re-sampling both scale poorly with data size, and would limit the performance of ELM otherwise capable of processing billions of data samples [1].

A specific case [42] of model-independent approach limited to linear models (with arbitrary solution algorithm and hyper-parameters) provides good results for heteroscedastic datasets ([42], supplementary materials), and suits for ELM output layer solution as well. The method applies to any amount of training data, and will benefit from huge datasets by producing more independent models in its ensemble part. Unfortunately, it does not output prediction intervals directly.

The scope of this paper is constrained to fast ways of computing prediction intervals of outputs, tailored specifically for Extreme Learning Machine. The proposed solution works especially well in conjunction with ELM, re-using some heavy computational parts as shown in the next section. A fast runtime is one of the the key features of ELM, making it valuable for practical applications and Big Data processing. Another key feature of ELM is approximation of complex unknown functions, and the proposed method approximates prediction intervals of model outputs in similar fashion without probabilistic or fuzzy set notations.

2 Methodology

This section starts by introducing the Extreme Learning Machine. It continues with the prediction intervals idea, and its implementation suitable for ELM. The section concludes with a formal description of an algorithm.

2.1 Extreme Learning Machine

The Extreme Learning Machine [20] model is formulated as a feed-forward neural network with a single hidden layer. It has input and hidden neurons. Solution is given for one output neuron; in case of many output neurons each one has an independent solution. The hidden layer weights are initialized with random noise and fixed. Often an extra input neuron with the constant value is added to function as bias.

Hidden layer neurons apply a non-linear transformation function to their output. Typical functions are sigmoid or hyperbolic tangent, but this function may be omitted to add linear neurons. For input data samples gathered in a matrix , the hidden layer output matrix is:

(2)

where the function is applied element-wise. In matrix notation, the formula simplifies to .

The output layer of ELM is a linear regression problem , that is over-determined in real cases with more data samples than hidden neurons (). The output weights are given by an ordinary least squares solution computed with the Moore-Penrose pseudoinverse [35] of matrix .

Random initialization may decrease the performance of a naive ELM. This problem is completely solved by including L2 regularization in the output layer solution. The linear regression problem becomes:

(3)

where is L2-regularization parameter optimized by validation. With L2 regularization and a large number of hidden neurons, ELM performance becomes stable and unaffected by a particular random initialization of  [19].

2.2 Prediction Intervals

Assume a stochastic output with i.i.d. distribution conditioned on the inputs as in equation (1). Model prediction estimates only the mean value of an output, and ignores its stochastic nature.

Prediction intervals (PI) offer a simple way of describing the uncertainty of the output by estimating the boundaries on its value, such that the true output lies between those boundaries with the given probability . For normally distributed outputs (1) the prediction intervals at the confidence level can be modelled by

(4)

where is an inverse cumulative distribution function, i.e. .

The maximum likelihood estimator for the variance of a homoscedastic output is given by Mean Squared Error [7]. However, it provides uniform prediction intervals that fit poorly to practical applications of Machine Learning.

An estimation of variance in linear regression is a well-researched topic, with plethora of theoretical [37] and experimental [33] results available. Variance of heteroscedastic model predictions can be computed with the Bienaymé formula [26, 23] from the variance of model weights . However, variance of the predicted outputs corresponds to confidence intervals and does not describe the range of possible true outputs .

The relation between the heteroscedastic prediction intervals and other methods is illustrated on Figure 1.

(a) Training data points, true function and boundaries.
(b) Prediction intervals with MSE that estimate uniform boundaries.
(c) Confidence intervals of model predictions .
(d) Heteroscedastic prediction intervals that estimate boundaries, obtained with the proposed method.
Figure 1: Different types of confidence analysis on a toy heteroscedastic dataset (a). Uniform PI (b) estimate per-sample variance of outputs incorrectly, while confidence intervals (c) estimate variance of model predictions that is different from the variance of outputs. Only the heteroscedastic prediction intervals (d) provide a precise description of the dataset outputs distribution. ELM model predictions are used in (b-d).

2.3 Prediction Intervals for Extreme Learning Machines

The idea of this paper is to estimate the variance of heteroscedastic outputs using a second ELM model. The model predictions are computed by the first ELM, then the squared residuals are used as training outputs for the second ELM that learns to predict the conditional variance of outputs.

However, ELM predictions can be inaccurate, and their quality must be taken into account. For that reason, variances of the predictions for the first ELM and the second ELM are added to the predicted squared residuals to bound the true variance of the outputs :

(5)

In addition to directly estimating the input-dependent variance , this expression has the desired properties of giving larger variance for models with insufficient amount of training data. With an excessive amount of training data , variances of the predicted residuals and the predicted outputs decrease to zero and the variance of true outputs is given by its ELM estimation: . A similar approach to the prediction intervals exist in feed-forward neural networks [31], however it is valid only for the case .

The output layer of ELM is a linear regression. Bienaymé formula [26, 23] provides the variance of outputs in linear regression, and in ELM:

(6)

where is the hidden layer output of an ELM for an input sample .

There is plethora of methods for estimating covariance of normally distributed linear system weights . The method of choice is weighted Jackknife estimator [45]. It is unbiased, robust against heteroscedastic noise [37, 16, 13, 10], as fast as an ELM, and scales well with the data size. Another good method for variance estimation is Wild Bootstrap [10] with nice theoretical properties, but it is slower as the bootstrap part requires several repetitions to converge.

2.4 Weighted Jackknife for Big Data

A summary of the Weighted Jackknife methods is presented below. Its inputs are an ELM hidden layer outputs and residuals .

(7)
(8)
(9)
(10)
(11)

The method uses three auxiliary matrices: , and . Equation (9) creates a weighted data matrix by scaling every row of the original data , its denominator includes a dot product between two vectors .

Weighted Jackknife works well together with ELM and Big Data. First, an auxiliary matrix in (7) is an inverse of the already computed matrix in an ELM solution (3).

Second, Big Data applications with huge number of samples are often limited by memory size, especially if the matrix computations are run on GPUs with very limited memory pool. Weighted Jackknife avoids such limitation by batch computations. Let the data matrix split in equal parts with samples each:

Then auxiliary matrix can be computed in the corresponding parts , and an auxiliary matrix becomes a summation over all the parts . Size of matrices and does not depend on the number of samples , and the weighting (9) may be done in-place without consuming additional memory.

Having only one data part in memory at a time reduces the total memory requirements by a factor of . Large enough allows a single workstation to process billions of samples with Weighted Jackknife, the same way as presented for ELM in [1]. The practical value of is limited by the minimum size of a single batch, that cannot fully utilize CPU/GPU computational potential for small data batches of  [1].

2.5 ELM Prediction Intervals Algorithm

Prediction intervals are computed in two stages. The first stage uses training data to learn the two necessary ELM models , and estimate the covariances of output weights in these models:

  1. Train an ELM model on the training data

  2. Predict outputs for the training data

  3. Use weighed Jackknife to estimate covariance of the output weights

  4. Compute residuals for the training data

  5. Train another ELM model to predict the residuals

  6. Use weighed Jackknife to estimate covariance of the output weights

The training data and auxiliary vectors can be discarded at this point.

The second stage uses the previously trained models to predicts test outputs, their squared residuals and all variances. Then the prediction intervals are estimated with an equation (4).

  1. Compute the hidden layer outputs for test inputs
    using the two ELM models

  2. Predict test outputs

  3. Compute variance of the predicted outputs

  4. Predict squared residuals

  5. Compute variance of the predicted square residuals

  6. Compute prediction intervals for a desired confidence level :

    (12)

Models can have different optimal number of neurons, that should be validated. Using L2 regularization prevents numerical instabilities. Note that the predicted squared residuals might have negative values, that are replaced by zero.

3 Experimental Results

3.1 Artificial Dataset

An artificial dataset with heteroscedastic noise is shown on Figure 2. Additional tests are done on homoscedastic versions of the same dataset with the same projection function with an input-independent normally distributed noise. All experiments used ELM with one linear and 10 hyperbolic tangent hidden neurons, in both and .

Figure 2: Artificial dataset with true 95% intervals for noise.

Figure 3 shows the computed PI on the heteroscedastic artificial dataset at 95% confidence level. The figure also presents the standard deviation of the predicted residuals at 95% confidence, to show how it is affected by the amount of training data. As the amount of training data increases, PI are given more precisely by and depend less on (Figure 3, right).

Figure 3: Estimated PI for heteroscedastic stochastic outputs. Variance of the predicted residuals (shaded area) captures model uncertainty with less training data. Thin dash lines are actual PI, solid line is the projection function, thick dash line is an estimated output, and black dots are training data samples.

Similar results obtained for the datasets with homoscedastic noise, presented on Figure 4. Larger variance of outputs makes the prediction task harder, leading to larger errors in (Figure 4, upper left). At the same time the variance of increases (Figure 4, shaded area), and the true PI rarely go beyond their estimated boundaries. Smaller variance of noise leads to more more precise PI, that still cover the true PI most of the time.

Figure 4: Estimated PI and their variance (shaded area) for homoscedastic stochastic outputs with difference variance; more data leads to more precise PI. Thin dash lines are actual PI, solid line is the projection function, thick dash line is estimated projection function, and black dots are training data samples.

In the extreme case of a training set with only 30 samples (which is not enough for learning the correct shape of the true projection function), the predicted squared residuals become unreliable. However, including their variance in the predictions compensates for the model uncertainty (see Figure 5). It sometimes leads to over-estimation of the true PI, but this is a desired property that prevents an uncertain model from predicting false highly confident outputs .

Figure 5: Estimated PI and their variance (shaded area) with an insufficient amount of training data; PI are over-estimated in poorly predicted areas. Thin dash lines are actual PI, solid line is the projection function, thick dash line is estimated projection function, and black dots are training data samples.

3.2 Comparison on Real World Datasets

ELM Prediction Intervals are compared on four real datasets with four other methods presented in [24]. Details of the datasets are given in Table 1. The paper uses two common metrics: Prediction Intervals Coverage Probability (PICP) that is a percentage of test samples whose outputs lie between the PI, and the Normalized Mean Predicted Interval Width that is an average width of PI on a test set divided by the range of the test targets. PICP shows what percentage of targets actually lie within PI, and it should correspond to the target coverage. NMPIW presents how optimal are the PI for the given task, compared to a naive approach of simply taking the full range of targets as an interval. Ideal PI have a small NMPIW with PICP equals to target coverage.

Dataset Samples Features Reference
Concrete compressive strength 1030 8 [46]
Plasma beta-carotene 315 12 [30]
Powerplant - Steam pressure 200 5 [14]
Powerplant - Main steam temperature 200 5 [14]
Table 1: Real-world datasets used for comparison.

The two measures PICP and NMPIW are inter-dependent as increasing PI width also increases the coverage. The comparison work [24] proposed a combined measure to replace PICP and NMPIW, but it is subjective due to two arbitrary hyper-parameters. This paper rather presents PICP and NMPIW on the same plot.

ELM PI method proposed in the paper is compared to four other methods of computing PI for neural networks. The Delta method [9] linearizes a neural networks model around a set of parameters, then applies an asymptotic theory to construct the PI. An extension of the Delta method to heteroscedastic noise is available [11], although still limited due to linearization. Bayesian learning of neural network weights allows for direct derivation of variance for particular predicted values [27], but at a very high computational cost. Bootstrap method is directly applicable to any machine learning method including neural networks, although caution should be taken in selecting bootstrap parameters to make the method resilient to heteroscedastic noise [10]. Finally, the Lower Upper Bound Estimation (LUBE) method proposed by [24] uses two additional outputs in a neural network to predict lower and upper PI, training the network with a custom cost function that includes both PICP and NMPIW.

Experimental setup uses L1 regularized ELM model [28] for automatic model structure selection on relatively small datasets, implemented in HP-ELM toolbox [1]. The datasets are randomly split in 70% training and 30% test samples, median results over 30 initializations are reported. Numerical experimental results are given in Table 2; comparison numbers for other methods are available in the corresponding paper [24]. Runtime is reported for a 1.4GHz dual-core laptop.

Dataset PICP(%) NMPIW(%) Runtime(ms)
Concrete compressive strength 91.59 34.01 92
Plasma beta-carotene 92.63 40.66 36
Powerplant - Steam pressure 93.33 39.29 27
Powerplant - Main steam temperature 88.33 18.38 35
Table 2: Experimental results of ELM Prediction Intervals.

Performance of the methods is shown as points in NMPIW/PICP coordinates, presented on Figure 6. An ideal method would be at the left edge of the dashed line (low NMPIW with precise PICP). As shown on the figure, ELM PI method performs better on Steam pressure dataset, a little worse on Plasma beta-carotene datasets, and about average on the other two.

Figure 6: Comparison of the ELM PI method (filled star) with four other methods from [24]. Best performing methods have low NMPIW and the target coverage (points close to the upper left corner).

A further analysis shows possible reasons for good performance on Steam pressure, and bad one on Plasma beta-carotene. The analysis compares against uniform PI using the same ELM predictions for a dataset. Such PI estimate homoscedastic noise correctly, but cannot learn heteroscedastic noise. Let a uniform PI grow starting from zero, then as they grow both coverage and the interval width will increase, generating many pairs of {NMPIW, PICP} points. These points are then connected by a line that represents homoscedastic PI performance boundary. Homoscedastic PI performance boundary and ELM PI for the two datasets in question are shown on Figure 7.

Figure 7: Comparison of ELM PI (black marker) with uniform PI of varying width (solid line). Heteroscedastic ELM PI perform better on the Steam pressure dataset, while uniform PI are enough for the Plasma beta-carotene dataset.

Obviously, useful heteroscedastic PI must be above this boundary – but in practice they may end up below due to poorer parameter estimation. Indeed, heteroscedastic PI need interval width per sample while homoscedastic PI only have interval width per dataset, that is easier to estimate precisely. As seen from Figure 7, this is the situation for ELM PI on the Plasma beta-carotene dataset where uniform PI perform better. On Steam pressure however, heteroscedastic PI perform better than uniform ones as they have higher coverage with the same average width. Another possible reason for the difference in performance is that Plasma beta-carotene dataset has homoscedastic noise, while Steam pressure dataset has actually a heteroscedastic noise (or heteroscedastic stochastic outputs), so heteroscedastic PI provide the most benefit when computed on the latter dataset.

4 Minimizing False Positives on a Large Real Dataset

This experiment uses PI to minimize the amount of false positive predictions on a large classification task. Note that the proposed PI methodology applies equally well to regression, and monotonic classification tasks are handled even better using purposely developed [48] implementations of ELM as .

A 4,000,000-sample dataset of pixel colors for skin/non-skin classification is created from the FaceSkin Images dataset [34]. The inputs are colors of the target pixel and its neighbors with input features total, and the outputs are +1 for skin pixels and -1 for non-skin ones. The dataset uses photos of various people under different lighting conditions, without any pre-processing. True skin masks are created manually and are highly accurate. Half of the dataset is used for training, and the other half for test.

The applied ELM model uses 147 linear + 200 sigmoid neurons. Predictions of ELM are real values, that are turned into classes by taking their sign. Due to a simple model and input features (that are not tailored for image processing) the performance is average at about 87% accuracy. The goal of the experiment is to check whether the per-sample PI can be used to significantly improve the accuracy at a cost of coverage, compared to per-datasets PI computed by MSE.

To trade coverage for precision, a threshold is introduced. ELM predictions with an absolute value less than are ignored. A value of corresponding to the desired coverage percentage is found by scalar optimization methods. For per-sample PI, threshold is multiplied by the value of the corresponding for a prediction .

The results are shown on Figure 8. Here, an ELM models with a total of 347 hidden neurons is trained on a dataset with two million samples. The per-sample PI improves the true positive rate slightly. However, they reach almost zero false positives with 3% coverage, and exactly zero at 1%. Contrary to the proposed method, uniform PI computed with MSE cannot achieve zero false positives. Although one percent of coverage seems very little, it represents 20,000 test samples for that dataset, and it is a surprising achievement for a simple ELM model that is not optimized for False Positives reduction like in custom applications [2]. A specifically designed model, or an ensemble of multiple models could achieve zero False Positives with a larger coverage – a significant result for practical use of ELM, and Machine Learning algorithms in general.

Figure 8: True Positive versus False Positive rate for the most confident part of the predictions (depicted by percentage) for a MSE-based threshold (dash line), and sample-specific threshold based on PI (solid line). Per-sample PI give almost zero False Positives for 3% best predictions, and exactly zero for 1% best. True Positives rate is overall higher than for an MSE-based threshold.

4.1 Runtime Analysis

The runtime of per-sample PI is examined on the pixel classification dataset explained above. The experiments are run on a desktop machine with 4-core Intel Skylake CPU, using an efficient ELM toolbox from [1]. With 2,000,000 training samples and 347 hidden neurons, training an ELM takes 12 seconds (for both or ). Computing covariance matrices and with weighted Jackknife method takes 25 seconds each, or only twice longer that training an ELM itself. Test predictions take 8 seconds to compute, and test per-sample PI take 32 seconds. In total, prediction intervals increase the ELM runtime by a constant factor of about 5.

Runtime on the real-world datasets is not directly comparable with the other methods as they are run on different machines, but it is the same order of magnitude as Bootstrap, an order of magnitude faster than Delta or Bayesian methods, but also an order of magnitude slower than the LUBE method. Replacing L1 regularized ELM with standard ELM reduces the runtime to the level of LUBE method, however it degrades the results on small datasets with a few hundreds samples. Extremely large datasets that do not need regularization benefit from the faster run speed.

5 Conclusion

The paper proposed a method of computing per-sample prediction intervals for Extreme Learning Machines. It successfully evaluates variance of heteroscedastic stochastic outputs, using only ELM models and the weighted Jackknife method. The proposed framework works well for homoscedastic outputs, making the proposed method applicable on a general level. ELM PI is comparable to other methods of computing PI in neural networks on small datasets, while keeping it possible to have very fast runtimes and scalability for Big Data.

On a real dataset, the method has shown to allow for a better precision and lower False Positives rate. Heteroscedastic PI performs in a similar way as uniform PI from Mean Squared Error on 50%-70% of dataset samples, but they make a huge difference on the most confidently predicted 1%-10% of samples. For these samples, the proposed PI allowed to achieve zero False Positives rate even with a basic ELM model, which is an extremely useful feature in many practical applications. The runtime is comparable to the runtime of an ELM itself that makes it feasible for large datasets of Big Data problems.

ELM PI can be easily extended to non-symmetric PI by using two ELM models in the second stage for predicting upper and lower boundaries separately. An ensemble of ELMs may increase the coverage for zero False Positives data predictions. These extensions will be examined and evaluated in future works on this topic.

References

  1. A. Akusok, K. Björk, Y. Miche and A. Lendasse (2015-07) High-Performance Extreme Learning Machines: A Complete Toolbox for Big Data Applications. IEEE Access 3, pp. 1011–1025. External Links: ISSN 2169-3536, Document Cited by: §1.1, §1, §2.4, §3.2, §4.1.
  2. A. Akusok, Y. Miche, J. Hegedus, R. Nian and A. Lendasse (2014-03) A Two-Stage Methodology Using K-NN and False-Positive Minimizing ELM for Nominal Data Classification. Cognitive Computation 6 (3), pp. 432–445. External Links: ISSN 1866-9956, Document Cited by: §1, §4.
  3. A. Akusok, Y. Miche, J. Karhunen, K. Björk, R. Nian and A. Lendasse (2015-05) Arbitrary Category Classification of Websites Based on Image Content. IEEE Computational Intelligence Magazine 10 (2), pp. 30–41. External Links: ISSN 1556-603X, Document Cited by: §1.
  4. A. Akusok, D. Veganzones, Y. Miche, K. Björk, P. du Jardin, E. Séverin and A. Lendasse (2015-07) MD-ELM: Originally Mislabeled Samples Detection using OP-ELM Model. Neurocomputing 159, pp. 242–250. External Links: ISSN 09252312, Document Cited by: §1.
  5. H. Asai, S. Tanaka and K. Uegima (1982) Linear regression analysis with fuzzy model. IEEE Transaction Systems Man and Cybermatics 12 (6), pp. 903–07. Cited by: §1.1.
  6. R. A. R. Ashfaq, X. Wang, J. Z. Huang, H. Abbas and Y. He (2017-02) Fuzziness based semi-supervised learning approach for intrusion detection system. Information Sciences 378, pp. 484–497. External Links: ISSN 0020-0255, Document Cited by: §1.1.
  7. C. M. Bishop (2006) Pattern Recognition and Machine Learning. Information science and statistics, Vol. 4, Springer Science+Business Media, Singapore. External Links: ISBN 978-0-387-31073-2 Cited by: §2.2.
  8. Y. Chen, J. Yang, C. Wang and D. Park (2016) Variational Bayesian extreme learning machine. Neural Computing and Applications 27 (1), pp. 185–196. External Links: ISSN 1433-3058, Document Cited by: §1.1.
  9. G. Chryssolouris, M. Lee and A. Ramsey (1996-01) Confidence interval prediction for neural network models. IEEE Transactions on Neural Networks 7 (1), pp. 229–232. External Links: ISSN 1045-9227, Document Cited by: §3.2.
  10. R. Davidson and E. Flachaire (2008-09) The wild bootstrap, tamed at last. Journal of Econometrics 146 (1), pp. 162–169. External Links: ISSN 0304-4076, Document Cited by: §2.3, §3.2.
  11. A. A. Ding and X. He (2003-03) Backpropagation of pseudo-errors: neural networks that are adaptive to heterogeneous noise. IEEE Transactions on Neural Networks 14 (2), pp. 253–262. External Links: ISSN 1045-9227, Document Cited by: §3.2.
  12. E. Soria-Olivas, J. Gomez-Sanchis, J. D. Martin, J. Vila-Frances, M. Martinez, J. R. Magdalena and A. J. Serrano (2011-03) BELM: Bayesian Extreme Learning Machine. IEEE Transactions on Neural Networks 22 (3), pp. 505–509. External Links: ISSN 1045-9227, Document Cited by: §1.1.
  13. E. Flachaire (2005-04) Bootstrapping heteroskedastic regression models: wild bootstrap vs. pairs bootstrap. 2nd CSDA Special Issue on Computational Econometrics 49 (2), pp. 361–376. External Links: ISSN 0167-9473, Document Cited by: §2.3.
  14. R. Guidorzi and R. Rossi (1974) Identification of a power plant from normal operating records. Automatic Control Theory and Applications 2 (3), pp. 63–67. Cited by: Table 1.
  15. Y. He, X. Wang and J. Z. Huang (2016-10) Fuzzy Nonlinear Regression Analysis Using a Random Weight Network. Inf. Sci. 364 (C), pp. 222–240. External Links: ISSN 0020-0255, Document Cited by: §1.1.
  16. P. S. Horn, A. J. Pesce and B. E. Copeland (1998-03) A robust approach to reference interval estimation and evaluation. Clinical Chemistry 44 (3), pp. 622–631. Cited by: §2.3.
  17. G. Huang, Z. Bai, L.L.C. Kasun and C. M. Vong (2015-05) Local Receptive Fields Based Extreme Learning Machine. IEEE Computational Intelligence Magazine 10 (2), pp. 18–29. External Links: ISSN 1556-603X, Document Cited by: §1.
  18. G. Huang, L. Chen and C. Siew (2006-07) Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Transactions on Neural Networks 17 (4), pp. 879–892. External Links: ISSN 1045-9227, Document Cited by: §1.
  19. G. Huang, H. Zhou, X. Ding and R. Zhang (2012-04) Extreme learning machine for regression and multiclass classification.. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 42 (2), pp. 513–529. External Links: ISSN 1941-0492, Document Cited by: §2.1.
  20. G. Huang, Q. Zhu and C. Siew (2006-12) Extreme learning machine: Theory and applications. Neural Networks Selected Papers from the 7th Brazilian Symposium on Neural Networks (SBRN ’04)7th Brazilian Symposium on Neural Networks 70 (1–3), pp. 489–501. External Links: ISSN 0925-2312, Document Cited by: §2.1.
  21. G. Huang, Q. Zhu and C. Siew (25-29 July 2004) Extreme learning machine: a new learning scheme of feedforward neural networks. In Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Conference On, Vol. 2, pp. 985–990. External Links: ISBN 1098-7576, Document Cited by: §1.
  22. J. Hegedus, Y. Miche, A. Ilin and A. Lendasse (3-4 Dec. 2011) Methodology for Behavioral-based Malware Analysis and Detection Using Random Projections and K-Nearest Neighbors Classifiers. In 2011 Seventh International Conference on Computational Intelligence and Security, pp. 1016–1023. External Links: Document Cited by: §1.
  23. R. A. Johnson and D. W. Wichern (2002) Applied multivariate statistical analysis. Vol. 5, Prentice hall Upper Saddle River, NJ. Cited by: §2.2, §2.3.
  24. A. Khosravi, S. Nahavandi, D. Creighton and A. F. Atiya (2011-03) Lower Upper Bound Estimation Method for Construction of Neural Network-Based Prediction Intervals. IEEE Transactions on Neural Networks 22 (3), pp. 337–346. External Links: ISSN 1045-9227, Document Cited by: Figure 6, §3.2, §3.2, §3.2, §3.2.
  25. A. Lendasse, V. C. Man, Y. Miche and G. Huang (2016) Advances in extreme learning machines (ELM2014). Neurocomputing 174, Part A, pp. 1 – 3. External Links: ISSN 0925-2312, Document Cited by: §1.
  26. M. Loève (1955) Probability Theory; Foundations, Random Sequences. D. Van Nostrand Company, New York. Cited by: §2.2, §2.3.
  27. D. J. C. MacKay (1992-09) The Evidence Framework Applied to Classification Networks. Neural Computation 4 (5), pp. 720–736. External Links: ISSN 0899-7667, Document Cited by: §3.2.
  28. Y. Miche, A. Sorjamaa, P. Bas, O. Simula, C. Jutten and A. Lendasse (2010-01) OP-ELM: Optimally-Pruned Extreme Learning Machine. IEEE Transactions on Neural Networks 21 (1), pp. 158–162. External Links: Document Cited by: §3.2.
  29. Y. Miche, M. van Heeswijk, P. Bas, O. Simula and A. Lendasse (2011-09) TROP-ELM: A double-regularized ELM using LARS and Tikhonov regularization. Advances in Extreme Learning Machine: Theory and Applications Biological Inspired Systems. Computational and Ambient Intelligence Selected papers of the 10th International Work-Conference on Artificial Neural Networks (IWANN2009) 74 (16), pp. 2413–2421. External Links: ISSN 0925-2312, Document Cited by: §1.
  30. D. W. Nierenberg, T. A. Stukel, J. A. Baron, B. J. Dain and E. R. Greenberg (1989-09) Determinants of Plasma Levels of Beta-Carotene and Retinol. American Journal of Epidemiology 130 (3), pp. 511–521. Cited by: Table 1.
  31. D. A. Nix and A. S. Weigend (1995) Learning Local Error Bars for Nonlinear Regression. In Advances in Neural Information Processing Systems 7, G. Tesauro, D. S. Touretzky and T. K. Leen (Eds.), pp. 489–496. Cited by: §2.3.
  32. D. Pevec and I. Kononenko (2014-10) Input dependent prediction intervals for supervised regression.. Intelligent Data Analysis 18 (5), pp. 873–887. External Links: ISSN 1088467X Cited by: §1.
  33. D. Pevec and I. Kononenko (2015) Prediction intervals in supervised learning for model evaluation and discrimination. Applied Intelligence 42 (4), pp. 790–804. External Links: ISSN 1573-7497, Document Cited by: §1.1, §2.2.
  34. S. L. Phung, A. Bouzerdoum and Sr. Chai D. (2005-01) Skin segmentation using color pixel classification: analysis and comparison. Pattern Analysis and Machine Intelligence, IEEE Transactions on 27 (1), pp. 148–154. External Links: ISSN 0162-8828, Document Cited by: §4.
  35. C. R. Rao and S. K. Mitra (1972) Generalized inverse of a matrix and its applications. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Theory of Statistics, Berkeley, CA, pp. 601–620. Cited by: §2.1.
  36. Z. Shang and J. He (2015-01) Confidence-weighted extreme learning machine for regression problems. Neurocomputing 148, pp. 544–550. External Links: ISSN 0925-2312, Document Cited by: §1.1.
  37. J. Shao and C. F. J. Wu (1987) Heteroscedasticity-Robustness of Jackknife Variance Estimators in Linear Models. The Annals of Statistics 15 (4), pp. 1563–1579. External Links: ISSN 00905364 Cited by: §2.2, §2.3.
  38. D. Sovilj, E. Eirola, Y. Miche, K. Björk, R. Nian, A. Akusok and A. Lendasse (2016-01) Extreme learning machine for missing data using multiple imputations. Neurocomputing 174, Part A, pp. 220 – 231. External Links: ISSN 0925-2312, Document Cited by: §1.
  39. C. Swaney, A. Akusok, K.-M. Björk, Y. Miche and A. Lendasse (2015-01) Efficient Skin Segmentation via Neural Networks: HP-ELM and BD-SOM. INNS Conference on Big Data 2015 Program San Francisco, CA, USA 8-10 August 2015 53, pp. 400–409. External Links: ISSN 1877-0509, Document Cited by: §1.
  40. M. Termenon, M. Graña, A. Savio, A. Akusok, Y. Miche, K. Björk and A. Lendasse (2016) Brain MRI morphological patterns extraction tool based on Extreme Learning Machine and majority vote classification. Neurocomputing 174, Part A, pp. 344 – 351. External Links: ISSN 0925-2312, Document Cited by: §1.
  41. A. N. Tikhonov (1963) Solution of incorrectly formulated problems and the regularization method. Soviet Math. Dokl. 5, pp. 1035–1038. Cited by: §1.
  42. Q. Wang, J. Zhang and Z. Pang (2017-09-01) Stable prediction in high-dimensional linear models. Statistics and Computing 27 (5), pp. 1401–1412. External Links: ISSN 1573-1375 Cited by: §1.1.
  43. X. Z. Wang, H. J. Xing, Y. Li, Q. Hua, C. R. Dong and W. Pedrycz (2015-10) A Study on Relationship Between Generalization Abilities and Fuzziness of Base Classifiers in Ensemble Learning. IEEE Transactions on Fuzzy Systems 23 (5), pp. 1638–1654. External Links: ISSN 1063-6706, Document Cited by: §1.1.
  44. X. Z. Wang, T. Zhang and R. Wang (2017) Noniterative Deep Learning: Incorporating Restricted Boltzmann Machine Into Multilayer Random Weight Neural Networks. IEEE Transactions on Systems, Man, and Cybernetics: Systems PP (99), pp. 1–10. External Links: ISSN 2168-2216, Document Cited by: §1.1.
  45. C. F. J. Wu (1986-12) Jackknife, Bootstrap and Other Resampling Methods in Regression Analysis. Ann. Statist. (4), pp. 1261–1295 (en). External Links: ISSN 0090-5364, Document Cited by: §2.3.
  46. I.-C. Yeh (1998) Modeling of strength of high-performance concrete using artificial neural networks. Cement and Concrete Research 28 (12), pp. 1797–1808. Cited by: Table 1.
  47. Z. Huang, Y. Yu, J. Gu and H. Liu (2017-04) An Efficient Method for Traffic Sign Recognition Based on Extreme Learning Machine. IEEE Transactions on Cybernetics 47 (4), pp. 920–933. External Links: ISSN 2168-2267, Document Cited by: §1.
  48. H. Zhu, E. C.C. Tsang, X. Wang and R. A. R. Ashfaq (2017) Monotonic classification extreme learning machine. Neurocomputing 225 (Supplement C), pp. 205 – 213. External Links: ISSN 0925-2312 Cited by: §4.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
402629
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description