# Autonomous Deep Learning: Incremental Learning of Denoising Autoencoder for Evolving Data Streams

###### Abstract

The generative learning phase of Autoencoder (AE) and its successor Denosing Autoencoder (DAE) enhances flexibility of data stream method in exploiting unlabelled samples. Nonetheless, the feasibility of DAE for data stream analytic deserves in-depth study because it characterizes a fixed network capacity which cannot adapt to rapidly changing environments. An automated construction of a denoising autoeconder, namely deep evolving denoising autoencoder (DEVDAN), is proposed in this paper. DEVDAN features an open structure both in the generative phase and in the discriminative phase where input features can be automatically added and discarded on the fly. A network significance (NS) method is formulated in this paper and is derived from the bias-variance concept. This method is capable of estimating the statistical contribution of the network structure and its hidden units which precursors an ideal state to add or prune input features. Furthermore, DEVDAN is free of the problem- specific threshold and works fully in the single-pass learning fashion. The efficacy of DEVDAN is numerically validated using nine non-stationary data stream problems simulated under the prequential test-then-train protocol where DEVDAN is capable of delivering improvement of classification accuracy to recently published online learning works while having flexibility in the automatic extraction of robust input features and in adapting to rapidly changing environments.

Autonomous Deep Learning: Incremental Learning of Denoising Autoencoder for Evolving Data Streams

Mahardhika Pratama^{*,1}, Andri Ashfahani^{*,2}, Yew Soon Ong^{*,3}, Savitha Ramasamy^{+,4} and Edwin Lughofer^{#,5}
^{*}School of Computer Science and Engineering, NTU, Singapore
^{+}Institute of Infocomm Research, A*Star, Singapore
^{#}Johannes Kepler University Linz, Austria
{^{1}mpratama@, ^{2}andriash001@e., ^{3}asysong@}ntu.edu.sg, ^{4}ramasamysa@i2r.a-star.edu.sg,
^{5}edwin.lughofer@jku.at

## Introduction

The underlying challenge in the design of DNNs is seen in the model selection phase where no commonly accepted methodology exists to configure the structure of DNNs (?). This issue often forces one to blindly choose the structure of DNNs. DNN model selection has recently attracted intensive research where the goal is to determine an appropriate structure for DNNs with the right complexity for given problems. It is evident that a shallow NN tends to converge much faster than a DNN and handles the small sample size problem better than DNNs. In other words, the size of DNNs strongly depends on the availability of samples. This encompasses the development of pruning (?), regularization (?), parameter prediction (?), etc. Most of which start with an over-complex network followed by a complexity reduction scenario to drop the inactive components of DNNs (?). These approaches, however, do not fully fit to handle streaming data problems because they rely on an iterative parameter learning scenario where the tuning phase is iterated across a number of epochs (?). Moreover, a fixed structure is considered to be the underlying bottleneck of this model because it does not embrace or is too slow to respond to new training patterns as a result of concept change especially if network parameters have converged to particular points (?).

The ideas of online DNNs have started to attract research attention (?). In (?), online incremental feature learning is proposed using a denoising autoencoder (DAE) (?). The incremental learning aspect is depicted by its aptitude to handle the addition of new features and the merging of similar features. The structural learning scenario is mainly driven by feature similarity and does not fully operate in the one-pass learning mode. (?) puts forward the hedge backpropagation method to answer the research question as to how and when a DNN structure should be adapted. This work, however, assumes that an initial structure of DNN exists and is built upon a fixed-capacity network. To the best of our knowledge, the two approaches are not examined with the prequential test-then-train procedure considering the practical scenario where data streams arrive without labels, thus being impossible to first undertake the training process (?).

In the realm of DNNs, the pre-training phase plays a vital role because it addresses the random initialization problem leading to slow convergence (?). From Hintonâs variational bound theory (?), the power of depth can be achieved provided the hidden layer has sufficient complexity and appropriate initial parameters. An unsupervised learning step is carried out in the pre-training phase, also known as the generative phase (?). The generative phase implements the feature learning approach which produces a higher-level representation of the input features and induces appropriate intermediate representation (?). From the viewpoint of data stream, the generative phase offers refinement of predictive model with the absence of true class label. This case is evident due to the fact that data stream often arrives without labels. Of the several approaches for the generative phase, the autoencoder (AE) is considered the most prominent method (?). DAE is a variant of AE which adopts the partial destruction of the original input features (?). This approach prevents the learning identity function problem and opens the manifold of the original input dimension because the destroyed input variables are likely to sit further than the clean input manifold. Nevertheless, the structure of DAE is user-defined and not well-suited for data stream applications due to their iterative nature.

A deep evolving denoising autoencoder (DEVDAN) for evolving data streams is proposed in this paper. DEVDAN presents an incremental learning approach for DAE which features a fully open and single-pass working principle in both generative and discriminative phase. It is capable of starting its generative learning process from scratch without an initial structure. Its hidden nodes can be automatically generated, pruned and learned on demand and on the fly. Note that this paper considers the most challenging case where one has to grow the network from scratch but the concept is directly applicable in the presence of initial structure. The discriminative model relies on a soft-max layer which produces the end-output of DNN and shares the same trait of the generative phase: online and evolving. DEVDAN distinguishes itself from (?) because it works by means of estimation of network significance leading to approximation of bias and variance and is free of user-defined thresholds. A new hidden unit is introduced if the current structure is no longer expressive enough to represent the current data distribution - underfitting whereas an inconsequential unit is pruned in the case of high variance - overfitting. In addition, the evolving trait of DEVDAN is not only limited to the generative phase but also the discriminative phase.

The unique feature of the NS measure is its aptitude to estimate the statistical contribution of a neural network and a hidden node during their lifespan in an online fashion. This approach is defined as a limit integral representation of a generalization error which approximates both the historical and future significance of the overall network and its hidden unit. It is worth mentioning that a different approach from conventional self-organizing radial basis function networks (?; ?) has to be developed because DAE cannot be approached by an input space clustering method. The NS method offers a general framework of a statistical contribution measure and is extendable for different DNNs. Moreover, the NS method is also free of user-defined parameters which are often problem-dependent and hard to assign. It is supported by an adaptive conflict threshold dynamically adjusted with respect to the true performance of DEVDAN and current data distribution.

The performance of DEVDAN has been numerically investigated using nine prominent data stream problems: SEA (?), Hyperplane (?), HEPMASS, SUSY (?), KDDCup (?), Weather, electricity pricing (?), RLCPS (?), RFID localization problem. DEVDAN is capable of improving accuracy of conventional DAE and outperforming proposed data stream methods (?; ?). It offers a flexible approach to the automatic construction of robust features from data streams and operates in the one-pass learning fashion. Our numerical results are produced under the prequential test-then-train protocol - standard evaluation procedure of data stream method (?). The remainder of this paper is structured as follows: this paper starts with the problem formulation followed by the automatic construction of network structure and the discriminative training phase. The proof of concepts discusses numerical study in nine data stream problems and comparison of DEVDAN against state-of-the art algorithms. Some concluding remarks are drawn in the last section of this paper.

## Problem Formulation

Evolving data streams refer to continuous arrival of data points in a number of time stamps where may consist of a single data point or be formed as a data batch of a particular size . here denotes the input space dimension and stands for the size of data chunk. The size of data batch often varies and the number of time stamps is in practise unknown. In realm of real data stream environments, data points come into picture with the absence of true class labels . Labelling process is carried out and is subject to the access of ground truth or expert knowledge (?). In other words, a delay is expected in consolidating the true class labels. This issue warrants a generative learning step which can be applied to refine a predictive model in a unsupervised fashion while pending for operator to annotate the true class label of data samples - the underlying motivation of DEVDAN’s algorithmic development. This problem also hampers the suitability of the conventional cross validation method or the direct train-test partition method as an evaluation protocol of data stream learner. Hence, the so-called prequential test-then-train procedure is carried out here. That is, data streams are first used to test the generalization power of a learner before being exploited to perform model’s update. The performance of a data stream method is evaluated by aggregation of its performance across all time stamps.

DEVDAN is constructed under the denoising autoencoder (DAE) (?) - a variant of autoencoder (AE) (?) which aims to retrieve the original input information from the noise perturbation. The masking noise scenario is chosen here to induce partially destroyed input feature vector by forcing its elements to zeros. In other words, only a subset of original input features goes through DAE. corrupted input variables are randomly destructed in every training observation satisfying the joint distribution (?). This mechanism brings DAE a step forward of classical AE since it never functions as an identity function rather extracts key features of predictive problem. The reconstruction process is carried out via encoding-decoding scheme formed with the sigmoid activation function as follows:

(1) |

(2) |

where is a weight matrix, are respectively the bias of hidden units and the decoding function. is the number of hidden units. The weight matrix of the decoder is constrained such that is a reverse mapping . That is, DAE has a tied weight (?).

The typical characteristic of data stream is the presence of concept drift formulated as a change of the joint-class posterior probability (?). This situation leads to a current model created by previously induced concept being obsolete. DEVDAN features an open structure where it is capable of initiating its structure from scratch without the presence of a pre-configured structure. Its structure automatically evolves in respect of the network significance approach forming an approximation of the network bias and variance. In other words, DEVDAN initially extracts a single input feature where the number of extracted input features incrementally augments if it signifies a underfitting situation, high bias, or decreases if it suffers from an overfitting situation, high variance. In realm of concept drift, this is supposed to handle the so-called virtual drift - distributional change of the input space. The virtual drift is interpreted by the change of prior probability or the class conditional probability (?). The parameter tuning scenario is driven by the stochastic gradient descent (SGD) method in a single pass mode with the cross-entropy cost function (?).

Once the true class labels of a data batch has been observed , the 0-1 encoding scheme is undertaken to construct a labelled data batch where stands for the number of target classes. The discriminative phase of DEVDAN is carried out once completing the generative phase of DEVDAN using a softmax layer trained with the SGD method with only a single epoch. Furthermore, the discriminative training process is also equipped by the hidden unit growing and pruning strategies derived in a similar manner as that of the generative training process. An overview of DEVDAN’s learning mechanism is depicted in Fig. 1. One must bear in mind that DEVDAN’s learning scheme can be also applied with an initial model.

## Automatic Construction of Network Structure

This section formalizes the network significance (NS) method applied to grow and to prune hidden units of DAE.

### Growing Hidden Units of DAE

The power of DAE can be examined from its reconstruction error which can be formed in terms of mean square error (MSE) as follows:

(3) |

where respectively stand for clean input variables and reconstructed input features of DAE. This formula suffers from two bottlenecks for the single-pass learning scenario: 1) it calls for memory of all data points to understand a complete picture of DAE’s reconstruction capability; 2) Notwithstanding that the MSE can be calculated recursively without revisiting preceding samples, this procedure does not examine the reconstruction power of DAE for unseen data samples. In other words, it does not take into account the generalization power of DAE.

To correct this drawback, let denotes the estimation of clean input variables and stands for the expectation of DAE’s output, the NS method is defined as follows:

(4) |

Note that where is the probability density estimation. The NS method can be defined in terms of the expectation of the squared reconstruction error:

(5) |

Several mathematical derivation steps lead to the bias and variance formula as follows:

(6) |

where the variance of a random variable can be expressed as . The key for solving (6) is to find the expectation of the recovered input attributes delineating the statistical contribution of DAE. It is worth mentioning that the statistical contribution captures both the network contribution in respect to past training samples and unseen samples. It is thus written as follows:

(7) |

It is evident that is induced by the feature extractor and is influenced by partially destroyed input features due to the masking noise. Hence, (7) is modified as follows:

(8) |

(9) |

Suppose that the normal distribution holds, the probability density function (PDF) is expressed as . It is also known that the sigmoid function can be approached by the probit function (?) where and . Following the result of (?), (9) is derived:

(10) |

where are respectively the mean and standard deviation of the Gaussian function which can be calculated recursively from streaming data. The final expression of is formulated as follows:

(11) |

where (11) is a function of two sigmoid functions. This result enables us to establish the in (6). Let’s recall . The second term is derived from (11) while the first term is written:

(12) |

Due to the fact that , it is obvious that is IID variable which allows us to go further as follows:

(13) |

(14) |

Consolidating all the results of (11) and (14), the final expression of the NS method is established. The NS method is derived from the expectation of MSE leading to the popular bias and variance formula. This method allows one to examine the quality of the predictive model by directly inspecting the possible underfitting or overfitting situation of a predictive model and capturing the reliability of a predictive model across the overall data space given a particular data distribution. A high NS value indicates either a high variance problem (overfitting) or a high bias problem (underfitting) which cannot be simply portrayed by a system error index. The addition of a new hidden node is supposed to reduce the high bias problem. It is, however, not to be done in the case of overfitting because it exacerbates the overfitting situation.

The hidden unit growing condition is derived from a similar idea to statistical process control which applies the statistical method to monitor the predictive quality of DEVDAN and does not rely on the user-defined parameter (?; ?). Nevertheless, the hidden node growing condition is not modelled as the binomial distribution here because DEVDAN is more concerned about how to reconstruct corrupted input variables rather than performing binary classification. Because the underlying goal of the hidden node growing process is to relieve the high bias problem, a new hidden node is added if the following condition is satisfied:

(15) |

where are respectively the mean and standard deviation of Bias at the time instant while are the minimum Bias up to the observation. These variables are computed with the absence of previous data samples by simply updating their values whenever a new sample becomes available. Moreover, have to be reset once (15) is satisfied. Note that the bias can be calculated by decomposing the NS formula in (6). This setting is also formalized from the fact that the Bias values should decrease while the number of training observations increases as long as there is no change in the data distribution. On the other hand, a rise in the Bias values signals the presence of concept drift which cannot be addressed by simply learning the DAE’s parameters. A similar approach is adopted in the drift detection method (DDM) (?) but no warning phase is arranged in the NS method to avoid the use of windowing approaches. (15) is derived from the so-called sigma rule where governs the confidence degree of sigma rule. is selected as which leads to revolve around meaning that it attains the confidence level of 68.2% to 95.2%. This strategy aims to improve flexibility of hidden unit growing process which adapts to the learning context and addresses the problem-specific nature of the constant . A high bias signifies a underfitting situation which can be resolved by adding complexity of network structure while addition of hidden unit should be avoided in the case of low bias to prevent the variance increase.

Once a new hidden node is appended, its parameters, is randomly sampled from the scope of for simplicity while is allocated as . This formulation comes from the fact that a new hidden unit should drive the error toward zero. In other words, where is the number of hidden units or extracted features. New hidden node parameters play crucial role to assure improvement of reconstruction capability and to drive to a zero reconstruction error. It is accepted that the scope does not always ensure model’s convergence. This issue can be tackled with adaptive scope selection of random parameters (?).

### Hidden Unit Pruning Strategy

The overfitting problem occurs mainly due to a high network variance resulting from an over-complex network structure. The hidden unit pruning strategy helps to find a lower dimensional representation of feature space by discarding its superfluous components. Because a high variance designates the overfitting condition, the hidden unit pruning strategy starts from the evaluation of model’s variance. The same principle as the growing scenario is implemented where the statistical process control method is adopted to detect the high variance problem as follows:

(16) |

where respectively stand for the mean and standard deviation of at the time instant while denote the minimum Bias up to the observation. , selected as , is a dynamic constant controlling the confidence level of the sigma rule. The term 2 is arranged in (16) to overcome a direct-pruning-after-adding problem which may take place right after the feature growing process due to the temporary increase of network variance. The network variance naturally alleviates as more observations are encountered. Note that can be calculated with ease by following the mathematical derivation of the NS method. Moreover, are reset when (16) is satisfied.

After (16) is identified, the contribution of each hidden unit is examined. Inconsequential hidden unit is discarded to reduce the overfitting situation. The significance of hidden unit is tested via the concept of network significance, adapted to evaluate the hidden unit statistical contribution. This method can be derived by checking the hidden node activity in the whole corrupted feature space . The significance of the hidden node is defined as its average activation degree for all possible data samples as follows:

(17) |

where stand for the connective weight and bias of the encoding function. Suppose that data samples are sampled from a certain PDF, (17) can be derived as follows:

(18) |

Because the decoder is no longer used and is only used to complete a feature learning scenario, the importance of the hidden units is examined from the encoding function only. As with the growing strategy, (18) can be solved from the fact that the sigmoid function can be approached by the Probit function. The importance of the hidden unit is formalized as follows:

(19) |

where respectively denote the mean and standard deviation of the partially destroyed input features . Because the significance of the hidden node is obtained from the limit integral of the sigmoid function given the normal distribution, (19) can be also interpreted as the expectation of sigmoid encoding function. It is also seen that (19) delineates the statistical contribution of the hidden unit in respect to the recovered input attribute. A small HS value implies that hidden unit plays a small role in recovering the clean input attributes and thus can be ruled out without significant loss of accuracy.

Since the contribution of hidden unit is formed in terms of the expectation of an activation function, the least contributing hidden unit having the minimum is deemed inactive. If the overfitting situation occurs or (16) is satisfied, the pruning process encompasses the hidden unit with the lowest as follows:

(20) |

The condition (20) aims to mitigate the overfitting situation by getting rid of the least contributing hidden unit. This condition also signals that the original feature representation can be still reconstructed with the rest of hidden units. Moreover, this strategy is supposed to enhance the generalization power of DEVDAN by reducing its variance.

### Generative Training Phase

The parameter optimization phase is carried out using the stochastic gradient descent (SGD) approach with only single epoch. Since data points are normalized into the range of (?), the SGD procedure is derived using the cross-entropy loss function as follows:

(21) |

(22) |

where is the noise-free input vector and is the reconstructed input vector. is the number of samples observed thus far. Note that the cross-entropy function can be seen as the negative log-likelihood function. The minimization of the negative log-likelihood function is equivalent to the maximization of the likelihood function from the maximum likelihood optimization principle. Because the SGD method is utilized in the parameter learning scenario to update , the tuning phase is carried out on a per-sample basis or a single-pass scenario. is thus set as 1. The first order derivative in the SGD method is calculated with respect to the tied weight constraint . Note that the parameter adjustment step is carried out under a dynamic network which commences with only a single input feature and grows its network structure on demand.

The notion of DEVDAN allows the model’s structure to be self-organized in the generative phase while pending for operator to feed the true class labels . Furthermore, the concept of DAE discovers salient structure of input space by opening manifold of learning problem and expedites parameter’s convergence in the discriminative training phase. all of which can be committed while pending for operator to feed the true class labels. Although DEVDAN is realized in the single hidden layer architecture, it is modifiable to the deep structure with ease by applying the greedy layer-wise learning process (?).

Data sets | Performance | DEVDAN | pEnsemble | pEnsemble+ | AE | DAE |
---|---|---|---|---|---|---|

SUSY | CR | |||||

ET | K | K | K | K | ||

HN | ||||||

NoP | ||||||

HEPMASS | CR | |||||

ET | K | K | K | |||

HN | ||||||

NoP | ||||||

RLCPS | CR | |||||

ET | K | K | K | K | K | |

HN | ||||||

NoP | ||||||

RFID | CR | |||||

localization | ET | |||||

HN | ||||||

NoP | ||||||

Electricity | CR | |||||

pricing | ET | |||||

HN | ||||||

NoP | ||||||

Weather | CR | |||||

ET | ||||||

HN | ||||||

NoP | ||||||

KDDCup | CR | |||||

10% | ET | |||||

HN | ||||||

NoP | ||||||

SEA | CR | |||||

ET | ||||||

HN | ||||||

NoP | ||||||

Hyperplane | CR | |||||

ET | ||||||

HN | ||||||

NoP |

CR: classification rate, ET: execution time, HN: hidden nodes, NoP: number of parameters

## Discriminative Training Phase

Once the true class labels are obtained, the 0-1 encoding scheme is applied to craft the target vector where is the number of target class. That is, if only if a data sample falls into -th class. A generative model is passed to the discriminative training phase added with a softmax layer to infer the final classification decision as follows:

(23) |

where and denote the output weight vector and bias of discriminative network respectively while the softmax layer outputs probability distribution across target classes .

The parameters, are further adjusted using the labelled data chunk via the SGD method with only a single epoch. The optimization problem is formulated as follows:

(24) |

where the loss function is akin to the generative training phase, the cross-entropy loss function. The adjustment process is executed in the one-pass learning fashion leading to per-sample adaptation process .

The structural learning scenario also occurs in the discriminative learning phase where the NS approach can be formulated in respect to the squared predictive error rather than reconstruction error . Similar derivation can be applied here but the difference only exists in the output expression of the discrimininative model as instead of the encoding and decoding scheme as shown in (1),(2). Moreover, the hidden node growing and pruning conditions still refer to the same criteria (15),(16). The pseudocode of DEVDAN’s generative and discriminative phases are placed in the supplemental document.

## Proof of Concepts

The learning performance of DEVDAN is numerically validated using nine real-world and synthetic data stream problems: SEA, Hyperplane, Susy, KDDCup, RLCPS, RFID localization, Hepmass, Electricity Pricing and Weather. At least five of nine problems characterize non-stationary properties, while the remainder four problems feature salient characteristics in examining the performance of data stream algorithms: big size, high input dimension, etc. We refer readers to supplemental document for detailed characteristics of the nine datasets including the number of time stamps applied in the prequential test-then-train procedure. The numerical results of DEVDAN is compared against conventional AE and DAE where the discriminative phase is adjusted using only a single training epoch to assure fair comparison. AE and DAE structures are initialized before process runs. Comparison against classic AE and DAE is shown to highlight to what extent DEVDAN outperforms its root while DEVDAN is also compared against pENsemble (?) and (?)- a prominent data stream algorithm built upon an evolving ensemble classifier concept.

The learning performance of the consolidated algorithms is evaluated according to four criteria: classification rate, number of parameters, execution time and hidden units while the prequential test-then train procedure is followed as our evaluation protocol to simulate real data stream environments. The numerical results refer to the average numerical results across all time stamps. Numerical results are reported in Table 1. All consolidated algorithms are executed in the same computational platform under MATLAB environments with the Intel(R) Xeon(R) CPU E5-1650 @3.20 GHz processor and 16 GB RAM. Because of the page limit, all figures pertaining to DEVDAN learning performance and the source code of DEVDAN are placed as supplemental documents. The source code of DEVDAN will be made publicly available once our paper is accepted.

### Numerical Results

It is reported in Table 1 that DEVDAN produces more accurate prediction than its counterparts in six problems: KDD Cup, SEA, Hyperplane, SUSY, RLCPS and HEPMASS. This fact confirms the efficacy of DEVDAN in coping with non-stationary learning environments because Hyperplane, SEA and KDD Cup problems are well-known in the literature for their non-stationary properties. DEVDAN consistently outperforms both AE and DAE having a fixed structure except only slightly inferior to DAE in the RFID localization problem. DEVDAN also exhibits very competitive performance against ensemble classifiers, pENsemble and pENsemble. It is worth noting that pENsemble and pENsemble incurs much higher computational complexity than DEVDAN because it is crafted under the concept of multi-model structure. This fact is substantiated by the execution time of pENsemble and pENsemble consistently slower than DEVDAN in almost all problems. The learning performance of DEVDAN is visualized in the supplemental document.

## Conclusion

This paper presents a novel denoising autoencoder (DAE), namely the deep evolving denoising autoencoder (DEVDAN). DEVDAN features a self-organizing property in both generative and discriminative phases where input features can be incrementally constructed and discarded in a fully automated manner with the absence of a user-defined threshold. Our numerical study in nine popular data stream problems shows that DEVDAN delivers the most encouraging numerical result from other four benchmarked algorithms. Our numerical results demonstrate the advantage of DEVDAN’s evolving structure which adapts to dynamic components of data streams. This fact also supports the relevance of generative phase for online data stream which contributes toward refinement of network structure in unsupervised fashion. Nevertheless, it is admitted that DEVDAN is still crafted under a single hidden layer feedforward network. A deep version of DEVDAN will be subject to our future investigation.

## References

- [Alvares and Salzmann 2016] Alvares, J. M., and Salzmann, M. 2016. Learning the number of neurons in deep networks. In Lee, D. D.; Sugiyama, M.; Luxburg, U. V.; Guyon, I.; and Garnett, R., eds., Advances in Neural Information Processing Systems 29. Curran Associates, Inc. 2270–2278.
- [Baldi, Sadowski, and Whiteson 2014] Baldi, P.; Sadowski, P. D.; and Whiteson, D. 2014. Searching for exotic particles in high-energy physics with deep learning. Nature communications 5:4308.
- [Bengio et al. 2006] Bengio, Y.; Lamblin, P.; Popovici, D.; and Larochelle, H. 2006. Greedy layer-wise training of deep networks. In Proceedings of the 19th International Conference on Neural Information Processing Systems, NIPS’06, 153–160. Cambridge, MA, USA: MIT Press.
- [Bengio, Courville, and Vincent 2013] Bengio, Y.; Courville, A.; and Vincent, P. 2013. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8):1798–1828.
- [Bifet et al. 2010] Bifet, A.; Holmes, G.; Kirkby, R.; and Pfahringer, B. 2010. Moa: Massive online analysis. J. Mach. Learn. Res. 11:1601–1604.
- [Denil et al. 2013] Denil, M.; Shakibi, B.; Dinh, L.; Ranzato, M.; and de Freitas, N. 2013. Predicting parameters in deep learning. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, 2148–2156. USA: Curran Associates Inc.
- [Ditzler and Polikar 2013] Ditzler, G., and Polikar, R. 2013. Incremental learning of concept drift from streaming imbalanced data. IEEE Trans. on Knowl. and Data Eng. 25(10):2283–2301.
- [Gama et al. 2014] Gama, J. a.; Žliobaitė, I.; Bifet, A.; Pechenizkiy, M.; and Bouchachia, A. 2014. A survey on concept drift adaptation. ACM Comput. Surv. 46(4):44:1–44:37.
- [Gama, Fernandes, and Rocha 2006] Gama, J. a.; Fernandes, R.; and Rocha, R. 2006. Decision trees for mining data streams. Intell. Data Anal. 10(1):23–45.
- [Gama 2010] Gama, J. 2010. Knowledge Discovery from Data Streams. Chapman & Hall/CRC, 1st edition.
- [Hinton and Salakhutdinov 2006] Hinton, G., and Salakhutdinov, R. 2006. Reducing the dimensionality of data with neural networks. Science 313(5786):504 – 507.
- [Hinton and Zemel 1993] Hinton, G. E., and Zemel, R. S. 1993. Autoencoders, minimum description length and helmholtz free energy. In Proceedings of the 6th International Conference on Neural Information Processing Systems, NIPS’93, 3–10. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
- [Hinton, Vinyals, and Dean ] Hinton, G.; Vinyals, O.; and Dean, J. Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531.
- [Mohammadi et al. 2017] Mohammadi, M.; Al-Fuqaha, A. I.; Sorour, S.; and Guizani, M. 2017. Deep learning for iot big data and streaming analytics: A survey. CoRR abs/1712.04301.
- [Murphy 2012] Murphy, K. P. 2012. Machine Learning: A Probabilistic Perspective. The MIT Press.
- [Platt 1991] Platt, J. 1991. A resource-allocating network for function interpolation. Neural Comput. 3(2):213–225.
- [Pratama et al. 2017] Pratama, M.; Dimla, E.; Lughofer, E.; Pedrycz, W.; and Tjahjowidodo, T. 2017. Online tool condition monitoring based on parsimonious ensemble+. CoRR abs/1711.01843.
- [Pratama, Pedrycz, and Lughofer 2018] Pratama, M.; Pedrycz, W.; and Lughofer, E. 2018. Evolving ensemble fuzzy classifier. IEEE Transactions on Fuzzy Systems 1–1.
- [Sahoo et al. 2017] Sahoo, D.; Pham, Q. D.; Lu, J.; and Hoi, S. C. 2017. Online deep learning: Learning deep neural networks on the fly. arXiv preprint arXiv:1711.03705 abs/1711.03705.
- [Sariyar, Borg, and Pommerening 2011] Sariyar, M.; Borg, A.; and Pommerening, K. 2011. Controlling false match rates in record linkage using extreme value theory. Journal of Biomedical Informatics 44(4):648–654.
- [Stolfo et al. 2000] Stolfo, S. J.; Fan, W.; Lee, W.; Prodromidis, A.; and Chan, P. K. 2000. Cost-based modeling for fraud and intrusion detection: Results from the jam project. In In Proceedings of the 2000 DARPA Information Survivability Conference and Exposition, 130–144. IEEE Computer Press.
- [Street and Kim 2001] Street, W. N., and Kim, Y.-S. 2001. A streaming ensemble algorithm (sea) for large-scale classification. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’01, 377–382. New York, NY, USA: ACM.
- [Vincent et al. 2008] Vincent, P.; Larochelle, H.; Bengio, Y.; and Manzagol, P.-A. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, 1096–1103. New York, NY, USA: ACM.
- [Wang and Li 2017] Wang, D., and Li, M. 2017. Stochastic configuration networks: Fundamentals and algorithms. IEEE transactions on cybernetics 47(10):3466–3479.
- [Yingwei, Sundararajan, and Saratchandran 1997] Yingwei, L.; Sundararajan, N.; and Saratchandran, P. 1997. A sequential learning scheme for function approximation using minimal radial basis function neural networks. Neural Comput. 9(2):461–478.
- [Yoon et al. 2018] Yoon, J.; Yang, E.; Lee, J.; and Hwang, S. J. 2018. Lifelong learning with dynamically expandable networks. ICLR.
- [Zhou, Sohn, and Lee 2012] Zhou, G.; Sohn, K.; and Lee, H. 2012. Online incremental feature learning with denoising autoencoders. Journal of Machine Learning Research 22:1453–1461.