Weakly Supervised Deep Learning Approach in Streaming Environments††thanks: This paper has been accepted for publication in The 2019 IEEE International Conference on Big Data (IEEE BigData 2019), Los Angeles, CA, USA.
The feasibility of existing data stream algorithms is often hindered by the weakly supervised condition of data streams. A self-evolving deep neural network, namely Parsimonious Network (ParsNet), is proposed as a solution to various weakly-supervised data stream problems. A self-labelling strategy with hedge (SLASH) is proposed in which its auto-correction mechanism copes with the accumulation of mistakes significantly affecting the model’s generalization. ParsNet is developed from a closed-loop configuration of the self-evolving generative and discriminative training processes exploiting shared parameters in which its structure flexibly grows and shrinks to overcome the issue of concept drift with/without labels. The numerical evaluation has been performed under two challenging problems, namely sporadic access to ground truth and infinitely delayed access to the ground truth. Our numerical study shows the advantage of ParsNet with a substantial margin from its counterparts in the high-dimensional data streams and infinite delay simulation protocol. To support the reproducible research initiative, the source code of ParsNet along with supplementary materials are made available at https://bit.ly/2qNW7p4.
semi-supervised learning deep learning incremental learning data streams concept drifts
Although incremental learning (IL) of data streams is well-established in the literature , one of the major limitations in the existing approaches lies in their fully supervised nature in which the true class label has to become available immediately after receiving a data point. Some delay is expected in associating the target class to the incoming sample . In the realm of quality classification, the only way to understand the product quality of the manufacturing process is through visual inspection calling for the frequent stoppage of the machining process. The rate of delay is not constant, varying and often infinite. It is exemplified by the fact where the worn cases in the tool condition monitoring problem is often more difficult to be obtained than that of partially worn case. Damage of particular components of the aircraft engine often goes undetected because of the cost of fault analysis. These rationales outline urgent need of weakly-supervised data stream algorithms handling a variety of weakly-supervised learning situations regardless of the rate of delay and class distribution. The major challenge of weakly-supervised data streams is obvious in the fact that the concept drift might occur anytime with/without being accompanied by labels.
One approach to reducing labelling cost of data streams makes use of the online active learning strategy which actively queries target labels of uncertain samples for model updates [3, 4]. In , the confidence score is developed for sample selection. Although the active learning concept has been shown to reduce the labelling cost significantly , these approaches work with the same assumption where the true class label can be immediately obtained regardless of the labelling cost. Another approach using the popular semi-supervised hashing algorithm is proposed and combined with the dynamic feature learning approach of auto-encoder . In , the closed-loop configuration of the generative and discriminative processes are proposed to deal with partially labelled data streams. These algorithms are not compatible in the infinite delay scenario only having the ground truth access during the warm-up phase.
Another possible scenario for weakly-supervised data streams exists in the extreme latency problem. That is, labelled samples are provided only in the initial phase and true class labels never arrive afterward . This condition is more challenging to be tackled than the sporadic access to ground truth since it only involves a very small amount of labelled samples. COMPOSE is proposed as a solution to the infinite delay problem using the computational geometry approach called the shape. As pointed out in , COMPOSE suffers from high computational complexity. SCARGC is put forward in  and exploits the pool-based approach.  adopts the self-evolving micro-clustering approach where unlabelled samples are associated with the most similar existing clusters. These approaches are derived from a non-deep learning strategy having the limited advantage in dealing with high input dimensions.
A weakly-supervised deep learning approach, namely Parsimonious Network (ParsNet), is proposed and overcomes both sporadic access and infinitely delayed access to ground truth. The unique facet of ParsNet is seen in the combination of a self-evolving structure to handle the concept drift with/without label and self-labelling with hedge (SLASH) method addressing the accumulation of mistakes in generating pseudo labelled samples. The accumulation of mistakes is a major issue in the weakly-supervised data stream problem because noisy pseudo labels feed misleading information of the ideal decision boundary thus undermining the model’s generalization. Note that ParsNet works in the one-scan fashion making the accumulation of mistakes difficult to handle.
The SLASH functions as the weakly-supervised learning policy in which enrichment of class labels is performed with the auto-correction mechanism (hedge). It automatically associates unlabelled samples to the most confident class measured from the class posterior probability of ParsNet and Autonomous Gaussian Mixture Model (AGMM). Nonetheless, the self-labelling process risks on the accumulation of mistakes as a result of noisy pseudo labels. Our numerical study exhibits that a small quantity of noisy pseudo labels is sufficient to downgrade the model’s predictive performance. The hedge implements the auto-correction mechanism automatically identifying noisy pseudo labels and preventing from forgetting important parameters. This strategy pushes the network parameters to its near optimal points. In a nutshell, the hedge protects ParsNet from performance deterioration due to wrong pseudo labels while accepting clean pseudo labels. Augmentation of labelled samples is performed by injecting small noise to originally labelled samples thereby performing consistency regularization .
Adjustment of network width takes place in both generative and discriminative phases referring to reconstruction error and predictive error making possible for identification of concept drifts with/without labels. The hidden node evolution is governed by the advancement of network significance (NS) method  relaxing the normal distribution assumption, adapting to the concept drifts and facilitating direct addition of hidden nodes to arrive at desirable network capacity quickly. This mechanism is underpinned by an autonomous Gaussian mixture model (AGMM) with a growing/pruning aptitude. That is, the approximation of probability density function is obtained from dynamically evolved Gaussian components where a new Gaussian component is added if the concept drifts leading to ensues.
The major contribution of this paper is summed up in four facets: 1) online weakly-supervised deep neural network, ParsNet, is proposed to handle the problem of weakly-supervised data streams. ParsNet is applicable for both label’s scarcity and extreme latency problems; 2) The SLASH is offered to deal with the scarcity of labelled samples via generation of pseudo label with protection against noisy pseudo labels; 3) Rapidly changing distributions of partially labelled data streams are dealt with a closed-loop configuration of flexible generative and discriminative phases having shared network parameters. Hidden units are automatically added/pruned in both phases thereby addressing the concept drift problem with/without target representation; 4) the advantage of ParsNet has been numerically validated in two challenging problems, sporadic access to ground truth and infinitely delayed access to ground truth. ParsNet demonstrates significant performance improvement over popular data stream algorithms in handling high-dimensional data streams with over 10% performance gap while showing a similar pattern in the infinite delay case.
2 Problem Formulation
Weakly-supervised data stream problem is defined as a learning problem of sequentially arriving data batches where denotes the number of data batches that might not be bounded in practice. ParsNet adopts a single-pass learning mode without any epoch where data samples of -th batch are directly discarded once learned. This scenario exemplifies ParsNet’s feasibility in a strictly online learning procedure. Streaming data are captured without labels where denote the task size and the input dimension. The true class label is solicited from particular labelling strategies. A weakly-supervised data stream learner is motivated by the prohibitive labelling cost or at least there exists some delay in obtaining the true class label thus leading to . Note that the labelling delay is varying in nature and each target class incurs different labelling costs thereby causing sporadic access of true class labels. Another typical characteristic is observed in the presence of concept drifts  which might occur anytime with/without target labels.
ParsNet is numerically validated under two simulation scenarios: sporadic access to ground truth and infinitely delayed access to ground truth. The first case outlines labelled samples arrive sporadically without assuring the balanced class proportion. The target vector only contains a subset of the original data batches having a much smaller size than the data batch where . It hinders the application of active learning approach since the access of labelled samples other than those of is not provided. This problem reflects different labelling costs of target classes and human factor such as boredom, fatigue, etc. The second case considers more complex case where a small amount of labelled samples is only made available in the warm-up period while the remainder of data batches suffers from the absence of target classes. Moreover, the selection of initially labelled samples is taken without the assurance of the class distribution. That is, the first batch of each problem is chosen as the initially labelled samples without changing the original order. Initially labelled samples are made available once without being carried over to the next batches . ParsNet only relies on augmented samples for model updates. The two procedures are simulated in the prequential test-then-train protocol.
3 ParsNet: Parsimonious Networks
A complete picture of ParsNet learning policy is visualized in Fig. 1. The learning process is initiated with the creation of AGMM which navigates the structural evolution: estimation of complex probability density function and addition factor of hidden nodes . AGMM features an open structure principle addressing the concept drift issue making the current estimation of probability density function to be obsolete . Both ParsNet and AGMM start their learning process from scratch without a predefined structure.
The generative phase takes place afterward and sets the initial condition of the discriminative phase. The generative phase also functions as an early alarm of changing learning conditions since adjustment of network width takes place in respect to the reconstruction error with the absence of ground truth. The discriminative step further evolves the generative network and runs the same structural learning phase with access to the ground truth. In other words, both phases utilize shared parameters.
The weakly-supervised learning mechanism of ParsNet is governed by the idea of SLASH performing the self-labelling strategy with auto-correction. The salient feature of the SLASH lies in the solution of noisy pseudo labels uncharted in the existing literature. Note that the wrong pseudo label is a major problem due to what we call the accumulation of mistakes. Because of the absence of ground truth, mislabelled samples perturb the contour of decision boundary affecting the classification decision. This problem is complicated by the streaming environments because the access to old samples is prohibited. The auto-correction mechanism of the SLASH prevents ParsNet from forgetting its important parameters and automatically forces its parameters to its nearest optimal in the case of noisy pseudo labels. In addition, augmentation of originally labelled samples is performed to increase target representation.
The generative network is framed under the denoising auto-encoder (DAE) making use of partially corrupted input features via the masking noise. This step aims to extract robust features and to perform implicit regularization. As pointed out in , the noise injecting mechanism in DAE supports faster convergence than the use of explicit regularization if a suitable noise level is selected. The generative network is expressed mathematically as follows:
where denotes partially destroyed input features by setting a subset of input features to be blank. are the connective weight and bias of encoder while are the connective weight and bias of decoder. The tied-weight constraint is applied here and leads to . Here, are shared in the discriminative network as and the softmax layer is inserted outputting the output posterior probability:
where are respectively the connective weight and bias of the soft-max layer. is the target classes.
3.1 Network Evolution of ParsNet
Growing and pruning processes occur in the generative and discriminative phases using an extension of the NS method.
Adjustment of network width.
Note that the generative and discriminative learning phases are different only on the target variable inducing . The reconstruction error, , is utilized in the generative phase, whereas the empirical error, , is utilized in the discriminative phase. (4) can be expanded as follows:
where and denote the network significance of discriminative and generative phase, respectively. The NS method provides a statistical approximation of network’s generalization power via the bias and variance decomposition to signify high bias and variance situations .
AGMM is proposed here to approximate the complex probability density function and distinguishes itself from the original NS method  derived from the unrealistic assumption of static Gaussian distribution. AGMM features a fully single-pass characteristic and an open structure with the growing and pruning processes of Gaussian components. A new component can be added if the concept drift is present. The use of AGMM leads to the estimation of the statistical contribution of hidden nodes as follows:
where stands for the -th GMM with the mixing coefficient , center , width . The integral can be solved independently for each GMM by following the finding of  where the sigmoid function can be approximated by the probit function with and the integral of probit function is another probit function. The final expression of is derived as follows:
where meets the partition of unity property. This approach not only summarizes distributional characteristic of already seen samples but also portrays network contribution navigated by the AGMM.
where , are respectively the empirical mean and standard deviation of while are the minimum bias up to the observation. Here, , are reset once (9) is satisfied, whereas , are not reset because our numerical study exhibits performance degradation. That is, the estimation of the network’s bias is reliable when considering all observations. (9) is motivated by the sigma rule  where the confidence level depends on the choice of . To correct this shortcoming, is set to be dynamic as . It enables to achieve leading to the confidence degree of to . Note that (9) has inherent drift detection capability because it is formulated from the statistical process control method .
If the condition (9) is satisfied, number of hidden nodes are added simultaneously and are initialized using Xavier initialization. refers to the number of Gaussian components. This approach is one step ahead of existing approaches adopting one-by-one addition of hidden nodes considered too slow in adapting to concept drifts notably if a network structure has to be constructed from scratch . This strategy is supported by the nature of AGMM exploring the true complexity of data distributions in an online manner.
ParsNet monitors the network’s variance where complexity reduction procedure is carried out to cope with the high variance situation implying an over-complex network structure - prone to the over-fitting case. can be obtained from (5) and (6) by first calculating and . Note that the first term is assumed to hold the i.i.d condition meaning that . The high variance condition is derived as follows:
where a reset scenario is applied to , if (10) is come across. It is observed that (10) is similar to (9) except with the presence of meant to avoid the direct-pruning-after-adding situation and the use of variance factor . Once the high variance condition is met, the statistical contribution of hidden node is examined as per (8). Hidden nodes are deemed inactive and thus pruned without loss of generalization if its statistical contribution falls into the following condition:
where , stand for the mean and standard deviation of hidden node statistical contributions. Note that this condition opens the likelihood for more than one hidden node to be pruned at once. This trait is important to induce sufficient reduction of network variance.
Autonomous Gaussian Mixture Model.
AGMM is utilized here to determine the probability density function thereby relaxing from the normal distribution assumption. Unlike conventional GMM, AGMM features an open structure and operates in the one-pass mode. That is, AGMM parameters are not only tuned in the online fashion but also the growing and pruning scenarios are demonstrated to track any variation of data streams notably if .
The growing process is governed by the Gaussian clustering method where a new component is incorporated to handle uncovered input space region as follows:
This growing condition enumerates the spatial proximity of the most adjacent component to the incoming sample where the operator is utilized here to deal with unstructured problems: images, texts, etc. possessing a high input dimension. Furthermore, no gradient needs to be calculated making the use of operator feasible. The threshold on the right side is designed under the assumption that the majority of data points lies in and beyond that it is considered abnormal samples. is inserted here to take into account the effect of input dimension. Moreover, here enables the confidence level of sigma rule to be adjusted as the level of network bias, thereby relaxing too strict normal distribution case.
The use of (12) in the growing process often leads to an explosion of Gaussian components due to low variance direction of data distributions. A vigilance test is incorporated to control the size of AGMM components using the concept of available space. A new Gaussian component is added only if the most adjacent component does not have any available space to expand its size. That is, the tuning process of winning component incurs significant overlapping to other components. The concept of available space is realized by comparing the coverage span of winning component to all components. This strategy is also meant to prevent too general Gaussian components which lose its relevance to portray particular concepts. The vigilance test is formalized as follows:
where denotes the vigilance test parameter. In this work, we adopt an adaptive vigilance test parameter adapting to the overlapping degree of winning Gaussian component to other components. It is designed as the portion of all components located outside per input dimension as (14). supports the addition of new component whereas allows the winning Gaussian to expand its size. Finally, the growing mechanism in (12) and (13) are combined with AND operator to determine the introduction of a new Gaussian component. The new component is crafted by simply setting the sample of interest as the centre of interest while the spread is assigned with a small constant value .
The tuning scenario is carried out if the growing condition, (12) AND (13), are violated. This step is meant to attain a fine-grained coverage of GMM in the input space where the winning GMM having the closest proximity to an incoming sample  is adjusted as follows:
where denotes the support or population of GMM. An incoming sample is associated to the winning GMM while increasing the support of GMM. This step assures the convergence of the tuning scenario as the increase of GMM’s population. The mixing coefficient, , is obtained through the posterior probability of the GMM satisfying the partition of unity as follows :
where is the likelihood function arranged as while the prior probability, , is set as or the relative support of component .
The pruning procedure is implemented in the AGMM to get rid of inconsequential hidden units which play little during their lifespans or lose their relevance due to rapidly changing environments. Suppose that defines the activation degree of GMM, the relevance of GMM is introduced as the average of its activity, , over its lifespan [12, 13]:
where denotes the time period of a Gaussian component since it is inserted. The hidden unit pruning procedure is controlled by:
(18) follows the half sigma rule. Furthermore, no component is pruned during a certain evaluation period making sure a new cluster is given the opportunity to develop its shape. The self-evolving property of AGMM enables the use of in the growing process of ParsNet since it explores the complexity of data distributions.
The Self-Labelling Strategy with Hedge (SLASH) governs the parameter learning strategy formally expressed as a joint optimization problem as follows:
where is the loss function of generative phase, is the loss function of the true class label, is the loss function of augmented class labels, is the loss function of pseudo-class labels. The augmented label is a variation of originally labelled samples via the introduction of external perturbation while the pseudo-labelled data points are generated from the self-labelling process. Furthermore, the self-labelling process is carried out solely to unlabelled samples.
The last term functions as the hedge performing auto-correction against the noisy pseudo label and adopts the Synaptic Intelligence (SI) concept  originally proposed to overcome the catastrophic forgetting problem in the continual learning domain. We extend this concept as the hedge mechanism correcting noisy pseudo label. is a regularization variable controlling how much information to be accepted from pseudo labels while stands for the parameter importance indicator memorizing influential network parameters before receiving the pseudo-label-induced model update. consists of all network parameters meaning that the hedge affects all network parameters. is the optimal network parameters induced by the last true class label. is a regularization term reflecting the parameter’s importance. Note that the originally labelled samples , the augmented samples and the pseudo-labelled samples are mixed thereby making possible for the hedging strategy to be seamlessly executed. Note that structural evolution takes place only in the case of and because augmented samples are simply variations of labelled samples while pseudo-labelled samples risks on noisy information leading to incorrect estimation of network bias and variance. Since (19) is formulated as an unconstrained optimization problem, the SGD with no epoch is carried out alternately. This setting is meant to demonstrate the feasibility of ParsNet under the most challenging condition of data stream processing.
The self-labelling process is carried out by examining the confidence degree of ParsNet and AGMM in which a label is propagated if both of them show confident and consistent prediction:
where the output posterior probability of AGMM, , is determined as  . stands for the class posterior probability and is the frequency count . denotes the normalized output posterior probability of ParsNet simply obtained from the two most dominant output classes  as . The normalization renders the class-invariant output of softmax layer behaving similarly to the binary classification case where uncertain prediction is indicated by the output posterior probability being close to 0.5 (, ). This case is often associated with data samples falling nearby the decision boundary. are two confidence thresholds. Note that a pseudo label is assigned only if both AGMM and ParsNet output the same class label.
Learning with Hedge.
The notion of a hedge, , is designed to prevent the performance drop due to the noisy pseudo label produced by the self-annotation process. It brings deviated parameters due to the wrong pseudo label back to their closest optimum value guided by the parameter importance indicator . This mechanism is triggered before learning pseudo-labelled samples and controls the influence of pseudo-labelled samples to model updates. In other words, the regularization approach protects ParsNet against noisy pseudo labels by allowing to move from its optimal values only if a reliable pseudo label is fed by the self-labelling process. We utilize reconstruction error to determine the regularization factor where the score is used to scale the reconstruction error to the range of as per . This strategy implies the proportional reduction of the model update as the increase of reconstruction error. That is, noisy pseudo label distracts direction of gradients directly influencing the network’s reconstruction power. As with the SI method , plays an important role to steer ParsNet’s parameters to its closest optimum parameters since it memorizes important network parameters. It is defined:
Since the hedge functions as the compensation of self-labelling mechanism, is updated only when observing the originally labelled and augmented samples to adjust the optimal parameters while being frozen when observing the pseudo label. Updating in the case of aims to further reinforce the memorization process since in this case, ParsNet has access to the ground truth. is the total parameter movement during the training process. denotes the total originally labelled and augmented samples received thus far while refers to the parameter’s movement between two samples and is the gradient. is a small constant to avoid division with zero. The regularization via Hedge adjusts network parameters proportionally as the change of network parameters and the network gradient. In practice, the normalization is applied to , , as an effort to avoid exploding gradient. The underlying difference from the original version  lies in the use of regularization as a watchdog of the pseudo labelled samples where the transition between old and new parameters are unclear.
Generation of Augmented Samples.
Label enrichment scenario is also performed from augmentation of originally labelled samples or known as the consistency regularization approach. Augmented samples are formed as variations of labelled samples produced by the noise injecting mechanism where random Gaussian noise with zero mean is used to produce a corrupted version of labelled samples. is specifically used for non-image problems while the image problem applies being equivalent to corruption of an image. Since augmented points are sampled from originally labelled samples, it is not subject to the SLASH mechanism rather the SLASH parameters are adjusted here. This mechanism is crucial in the infinite delay case because labeled samples can be only accessed during the warm-up period. ParsNet only relies on augmented samples and pseudo-labelled samples for the infinite delay problem.
In this section, the effectiveness of ParsNet is examined in two challenging cases: sporadic access to ground truth and infinitely delayed access to ground truth. An ablation study is conducted to analyze each learning component.
4.1 Sporadic Access to Ground Truth
Sporadic access to ground truth arranges a simulation environment where continuously arriving data batches comprises partially labelled samples. Our numerical study considers two proportions of labelled samples, 25% and 50%, in each data batch where the target classes are randomly distributed without any treatment of class distributions. Six prominent data stream problems are put forward: Rotated MNIST, Permuted MNIST , weather , SEA , Hyperplane  and RFID localization (Courtesy of Dr. Huang Sheng, Singapore). All of them except the RFID problem characterize the non-stationary properties. The RFID problem, however, presents the multi-class problem. The properties of the six problems are outlined in Table 1.
|Dataset||IA||C||DP||Tasks||Class Proportion (%)|
IA: input attributes, C: classes, DP: data points.
Baseline and parameter Setting.
Five recently published algorithms, namely ADL , DEVDAN , OMB , Learn++ (L++)  and Learn++NSE (LNSE)  are considered as baselines. We execute each algorithm in the same simulation protocol and computational resources using their published codes to ensure a fair comparison. Their original hyper-parameter setting is retained. If their performance is surprisingly poor, they are tuned and their best results are reported here. On the other hand, the predefined parameters of ParsNet are selected as , and while learning rates of generative and discriminative phases are chosen respectively as and . This setting is fixed for all numerical study in this paper and is obtained simply from minor hand-tuning mechanism to show the non-ad-hoc nature of ParsNet. The source code of ParsNet along with all datasets and supplemental document is offered to support the reproducible research initiative. The link is given in the abstract of this paper. Numerical results of other algorithms are produced using 50% labelled samples.
The prequential test-then-train protocol is followed here as per the guideline of . A model is forced to predict the entire samples of each incoming task before utilizing them for model updates. Numerical results are obtained from the average of numerical results per data batch. The numerical results of ParsNet are taken from the average of five independent runs to arrive at conclusive learning performance. Numerical results are statistically validated using the Wilcoxon signed-rank test where () indicates statistically significant differences.
The advantage of ParsNet is shown in Table 2 where ParsNet outperforms other algorithms in four of six datasets. Substantial performance improvements are attained by ParsNet in the rotated MNIST and permuted MNIST problems with over 10% margin. Note that the numerical results of Lean++ and Learn++NSE cannot be produced in the two problems because they do not scale well with high-dimensional data streams. This result signifies the efficacy of the SLASH helping to achieve major performance improvement in the complex problem having a high input dimension. The self-labelling mechanism of SLASH enriches labelled samples while the hedging mechanism prevents the accumulation of mistakes. The hedge rejects noisy pseudo labels by preventing the network parameters to be distracted too far from their previous optimal values.
CR: classification rate, TrT: training time, HN: hidden nodes, HL: hidden layers, PS: Number of pseudo labels. : indicates that the numerical results of ParsNet and other methods are significantly different.
This observation also has a strong correlation to the nature of the image classification problem where the noise injecting mechanism in the data augmentation process is capable of increasing the number of labelled samples without suffering from the consistency issue. In short, the noise injecting mechanism does not affect the class densities. ParsNet also produces better accuracy in the hyperplane and SEA problems with statistically significant differences. Although ParsNet is behind Learn++ in the weather problem, it is mainly attributed to the NN’s characteristic which does not cope well with the high uncertainties. A similar situation is also observed in both DEVDAN and ADL. ParsNet is also less accurate than Learn++NSE in the RFID problems but still better than other algorithms. In the realm of execution time, ParsNet incurs a moderate increase of computational overhead from DEVDAN and ADL but remains much faster than Learn++NSE and Learn++. This observation is not surprising due to the additional training steps of ParsNet compared to ADL and DEVDAN. Another important finding is shown in Table 3 reporting the precision and recall of ParsNet produced using 50% lebelled samples. It is depicted that precision and recall exhibit a relatively small gap meaning that ParsNet’s classification decision is not biased to one of the classes.
P: Precision. R: Recall.
4.2 Infinitely Delayed Access to Ground Truth
The second numerical study is carried out by following the infinite delay protocol where the true class label is only provided during the warm-up phase while the remainder of entire streams arrive without labels. Our numerical study makes use of the same datasets and procedure as the first numerical study except only for that the first data batch is used in the warm-up phase without changing the original data order thus leading to the very small amount of labeled samples. The ratio of labelled samples and all samples, as well as the class proportion, are provided in Table 4.
|R. MNIST||P. MNIST||Weather||SEA||Hyperplane||RFID|
|Class Proportion (%)|
NS: Number of all samples. NLS: Number of labelled samples.
SCARGC  is put forward as the baseline because it is specifically designed for the infinite delay problem and considered as a state-of-the-art algorithm. It has been reported to be better than other algorithms such as COMPOSE . Two versions of SCARGC built upon SVM (S-SVM) and 1NN (S-1NN) are used here where their results are obtained using their published codes under the same computational environment as ParsNet. Besides, DEVDAN is added as our baseline because it has the generative phase to handle the infinite delay situation. Table 4 reports numerical results.
Our numerical results in Table 4 confirm the advantage of ParsNet in the infinite delay scenario. ParsNet beats the three baselines with conclusive performance differences - statistically significant in all cases. It is worth mentioning that the infinite delay problem is more challenging than the sporadic access to ground truth due to a very low proportion of labelled samples. Both ParsNet and DEVDAN suffers from performance drop compared to the first case.
4.3 Ablation Study
The ablation study covers three configurations: (A) without AGMM, (B) without network evolution, (C) without SLASH. Our ablation study is carried out under the sporadic access to ground truth. Furthermore, the learning performance is studied under labelled samples using the rotated MNIST problem. Numerical results are reported in Table 5. It is observed from configuration (A) that the deactivation of AGMM significantly compromises the training performance because it becomes too slow to achieve desirable network capacity and to adapt to the concept drift. Poor performance is perceived in the configuration (B) if the structural learning mechanism is switched off. The absence of SLASH in configuration C contributes to around 3% degradation on accuracy. The concept of self-evolving structure has been studied intensively in the context of fuzzy and RBF networks .
A weakly-supervised deep learning algorithm, ParsNet, is proposed for handling various weakly supervised data stream problems. Our numerical study under a scarcity of labelled samples as well as extreme label latency demonstrates the efficacy of ParsNet. The major performance improvement by over 10% is attained in handling high-dimensional data streams under the minor supervisory mechanisms and in dealing with the infinite delay scenario.
-  Joao Gama. Knowledge Discovery from Data Streams. Chapman & Hall/CRC, 1st edition, 2010.
-  Vinícius M. A. de Souza, Diego Furtado Silva, João Gama, and Gustavo E. A. P. A. Batista. Data stream classification guided by clustering on nonstationary environments and extreme verification latency. In SIAM SDM, 2015.
-  Ahsanul Haque, Latifur Khan, and Michael Baron. SAND: semi-supervised adaptive novel class detection and classification over data stream. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2016.
-  B. Krawczyk, B. Pfahringer, and M. Wozniak. Combining active learning with concept drift detection for data stream mining. In IEEE International Conference on Big Data, Big Data 2018, Seattle, WA, USA, December 10-13, 2018, pages 2239–2244, 2018.
-  Yanchao Li, Yongli Wang, Qi Liu, Cheng Bi, Xiaohui Jiang, and Shurong Sun. Incremental semi-supervised learning on streaming data. Pattern Recognition, 2019.
-  Mahardhika Pratama, Andri Ashfahani, Edwin Lughofer, and Yew-Soon Ong. Devdan: Deep evolving denoising autoencoder. Neurocomputing, 2019.
-  Vinícius M. A. de Souza, Diego Furtado Silva, Gustavo E. A. P. A. Batista, and João Gama. Classification of evolving data streams with infinitely delayed labels. In 14th IEEE ICMLA, 2015.
-  David Berthelot, Nicholas Carlini, Ian J. Goodfellow, Nicolas Papernot, Avital Oliver, and Colin Raffel. Mixmatch: A holistic approach to semi-supervised learning. CoRR, abs/1905.02249, 2019.
-  Andri Ashfahani and Mahardhika Pratama. Autonomous deep learning: Continual learning approach for dynamic environments. In In SIAM SDM, 2019.
-  Arnu Pretorius, Steve Kroon, and Herman Kamper. Learning dynamics of linear denoising autoencoders. In Proceedings of the 35th ICML, 2018.
-  Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.
-  Richard Jayadi Oentaryo, Meng Joo Er, Linn San, and Xiang Li. Online probabilistic learning for fuzzy inference system. Expert Syst. Appl., 2014.
-  P. Angelov and X. Zhou. Evolving fuzzy systems from data streams in real-time. In 2006 International Symposium on Evolving Fuzzy Systems, pages 29–35, Sep. 2006.
-  Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In Proceedings of the 34th ICML, 2017.
-  B. Vigdor and B. Lerner. The bayesian artmap. IEEE Transactions on Neural Networks, 18(6):1628–1644, November 2007.
-  Edwin Lughofer. Single-pass active learning with conflict and ignorance. Evolving Systems, 3(4):251–271, 2012.
-  Rupesh K Srivastava, Jonathan Masci, Sohrob Kazerounian, Faustino Gomez, and Jürgen Schmidhuber. Compete to compute. In NIPS 26. 2013.
-  R. Elwell and R. Polikar. Incremental learning of concept drift in nonstationary environments. Trans. Neur. Netw., October 2011.
-  W. N. Street and Y-S Kim. A streaming ensemble algorithm (sea) for large-scale classification. In Proceedings of the Seventh ACM SIGKDD, KDD ’01. ACM, 2001.
-  A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer. Moa: Massive online analysis. J. Mach. Learn. Res., 2010.
-  Young Hun Jung, Jack Goetz, and Ambuj Tewari. Online multiclass boosting. In Advances in Neural Information Processing Systems 30. 2017.
-  Robi Polikar, Lalita Upda, Satish S Upda, and Vasant Honavar. Learn++: An incremental learning algorithm for supervised neural networks. IEEE transactions on systems, man, and cybernetics, 2001.
-  João Gama, Raquel Sebastião, and Pedro Pereira Rodrigues. On evaluating stream learning algorithms. Machine Learning, 90(3), 2013.
-  Karl B. Dyer, Robert Capo, and Robi Polikar. COMPOSE: A semisupervised learning framework for initially labeled nonstationary streaming data. IEEE Trans. Neural Netw. Learning Syst., 2014.
-  I Skrjanc, J Iglesias, A Sanchis, D Leite, E. D Lughofer, and F Gomide. Evolving fuzzy and neuro-fuzzy approaches in clustering, regression, identification, and classification: A survey. Information Sciences, 490:344–368, 2019.