Implicit Context-aware Learning and Discovery for Streaming Data Analytics

Implicit Context-aware Learning and Discovery for Streaming Data Analytics

Kin Gwn Lore, Kishore K. Reddy United Technologies Research Center, East Hartford, Connecticut 06118
Email:{lorek,reddykk}@utrc.utc.com
Abstract

The performance of machine learning model can be further improved if contextual cues are provided as input along with base features that are directly related to an inference task. In offline learning, one can inspect historical training data to identify contextual clusters either through feature clustering, or hand-crafting additional features to describe a context. While offline training enjoys the privilege of learning reliable models based on already-defined contextual features, online training for streaming data may be more challenging— the data is streamed through time, and the underlying context during a data generation process may change. Furthermore, the problem is exacerbated when the number of possible context is not known. In this study, we propose an online-learning algorithm involving the use of a neural network-based autoencoder to identify contextual changes during training, then compares the currently-inferred context to a knowledge base of learned contexts as training advances. Results show that classifier-training benefits from the automatically discovered contexts which demonstrates quicker learning convergence during contextual changes compared to current methods.

Streaming data, context-aware framework, autoencoders, incremental learning

I Introduction

Contextual cues can greatly benefit learning of predictive tasks in a machine learning model. A single datapoint may be meaningless. However, if the scope is expanded to include the context from which the data is obtained, one can obtain greater insights and perhaps substantially improve decision making. Context can exist in the form of semantic context, spatial context, and scale context in the domain of computer vision. These contexts, when leveraged, has shown great success in various works from object detection to image segmentation [wang2017attribute, japuria2017casnsc, li2016object, liu2017face]. Another application area concerns time-series data obtained from sensors or transactional in nature, where contextual features could be derived from time, weather, or event logs [matei2017context, gao2017detecting, liu2016multi]. Furthermore, mobile devices can use contextual features such as location, infer user activities based on on-board accelerometers, and user events to provide tailored mobile services or improving user experience [mezhoudi2013user, kabir2015machine, wang2010camf, zhou2012context, kwapisz2011activity]. However, such contextual features are often explicit in nature as they are specially selected and incorporated to fit the purpose of the machine learning task. However, we are interested in automatically identifying context changes without knowing what are the contexts.

Consider the illustration shown in Fig 1. Suppose there is an engine data with various sensors attached from which we can derive the engine load. Suppose there are two classes present in the data, one being the engine under excessive load (positive event) whereas the other engine is under nominal load (negative event). If context is not known, the data distribution of the two classes can be represented in Fig 1(a). If the model is given a task to discriminate between nominal or excessive loads, it is inevitable that an under-performing model will be learned due to contradictory evidences observed in data within high-confusion regimes. On the other hand, consider contextual drift that happens along a third dimension—in this case, it is time. As time progresses, an aircraft might change its mode of operation from taking-off to cruising. In this case, we do not explicitly identify time as a contextual feature as it has no predictive power. We only know that the data distribution has changed in the passage of time—caused by some changes context—which is implicit in nature. Given these contextual clues (Fig. 1(b), there is a clear decision boundary that separates between nominal and excessive engine loads, which in turn provides more useful information for a successful classification model.

Fig. 1: Illustration on the importance of contextual features in a classification task.

Usually, in offline learning, one can inspect historical training data to identify contextual clusters either through feature clustering, or hand-crafting additional features to describe a context. While offline training enjoys the privilege of learning reliable models based on already-defined contextual features, online training for streaming data may be more challenging—the data is streamed through time, and the underlying context during a data generation process may change. Furthermore, the problem is exarcebated when the number of possible context is not known. In this paper, our contributions are threefold:

  1. A novel online-learning algorithm is proposed;

  2. Contrary to handcrafting contextual features to be used as part of the input, the algorithm discovers new contexts as new samples arrive;

  3. The algorithm is compared with multiple baseline approaches over four different applications and datasets;

This paper is organized as follows. In section II, we outline related works and highlight the contributions of the present work. In section III, we setup the problem and describe how one might choose to approach the described problem. Then, we present the proposed method and outline the online learning algorithm in section IV. Multiple experiments are performed to benchmark the proposed method against the non-context aware counterparts, where the results are presented and discussed in section V. Finally, we summarize the paper and end with conclusions in section VI.

Ii Related works

In the context of concept and context drift, authors of [widmer1996recognition] proposed the MetaL(B) and MetaL(BI) as meta-learning frameworks to identify potential contextual clues during online training. However, the method is limited to a problem set where external contextual features are already known and prespecified; contextual changes are detected by changes in contextual features. In [widmer1996learning], the authors presented an approach to learn in the presence of concept drift and hidden contexts where object descriptions from the past are stored in addition to detecting the changes to the current hypothesis. Descriptions are specified for each positive and negative groups, and the running count of the descriptions are tracked.

On the other hand, there are applications that uses neural network-based approaches for streaming data. However, most applications are directed to anomaly detection where autoencoders are used to detect anomalies in streaming data. Authors of [dong2018threaded] observed improved anomaly detection by implementing threaded ensembles of autoencoders, while authors of[zhou2012online] features using denoising autoencoders to effectively learn features during online learning.

Further, there are existing works which combines both worlds on learning under context/concept drifts with neural network-based approaches. [yan2016correcting] highlights the use of autoencoders to learn a drift-corrected representation of the dataset for the purpose of transfer learning. Additionally, some authors decided to handle concept drift in online streaming data by constructing a new classifier for every new encountered context and contexts [budiman2016adaptive].

Recently, neural networks has gained immense traction in the advent of deep learning with the rise of computing power. Autoencoders, being a class of neural networks, has been known to perform particular well in identifying hidden representations in data via an unsupervised learning method. This provided a huge advantage where clusters of feature representations can be learned without the need of explicit labels by experts. The algorithm proposed in the present work leverages the power of autoencoders to identify underlying contexts, which then relate the current learning task with the inferred context. More details follow in the next section. To the best of our knowledge, at this point there are no existing work that identifies new context on-the-fly and dynamically appends the contextual feature to the data matrix without handcrafting contextual features.

Iii Problem setup

Consider an online learning task centered around classification. Data, or the feature matrix, arrives as a -dimensional vector associated with the class labels . At every time step , the model receives a new data sample to train the classification model . At each newly-arriving samples, the model will perform class predictions and compare the predicted output with the supplied ground-truth label . Any wrongly-classified model will provide feedback to the model to update its learned parameters.

A concept shift happens whenever there is a change in the relationship between supplied features and the corresponding target values . Typically, this happens whenever there is a change in the process which results from a change in the mode of operation. Consider the operation of a jet engine; suppose in one mode of operation, monitoring specific feature values observed in attached sensors can provide insights to the health of the engine. However, when a change occurs such that the engine is operating under a different mode, what used to be healthy could now be considered as unhealthy (i.e., where the decision threshold has changed which affects the labeling strategy). In a such a scenario, typical online-learning algorithms which do not incorporate context awareness will fail to adjust to this change in context.

To evaluate the proposed algorithm, we present three datasets that can simulate the aforementioned problem setup. We first have a look at the stagger dataset [widmer1993effective], which comprised of basic attributes such as shapes, sizes, and color, and the model shall predict a binary value where the labeling logic is hidden behind the training data that switches at fixed intervals. Next, we demonstrate on the MNIST-digits dataset [lecun1998mnist], where the task is to perform digit classification while the two datasets switch back-and-forth during training at fixed intervals. Finally, we evaluate the algorithm on the Naval Propulsion dataset from the UCI Data Repository (propulsion[coraddu2016machine] in the context of condition-based maintenance. Each of these datasets are briefly described below.

Iii-a The stagger dataset

The stagger dataset [widmer1993effective] is defined in a simple block-world defined by three nominal attributes:

  • size {small, medium, large}

  • color {red, green, blue}

  • shape {square, circular, triangular}

Additionally, three target concepts are defined and enumerated below:

  1. size = small color = red

  2. color = green shape = circular

  3. size = (medium large)

Fig. 2: The stagger dataset is defined by objects with three nominal attributes: size, color, and shape, each with three values per attribute.

The hidden target concept will switch from one of these definitions to the next, following the sequence 1-2-3-1-2-3, which results in extreme concept drift. The key metric to evaluate the performance of the model then hinges on the ability to converge to high accuracy at a rate quicker than the baseline approaches. In the experiments, each time step consists of a sample, and the concept switches every 200 samples. Since there are a total of 6 concepts (3 concepts, each repeated twice), there are a total of 1,200 samples available for online training and evaluation.

Iii-B The MNIST-digits dataset

The MNIST dataset [lecun1998mnist] is a well-known benchmarking dataset for digits recognition (digits of 0-9) with pixel dimensions of -px and grayscale in nature. Like MNIST, the digits dataset is also a database of handwritten digits obtained from scikit-learn, originally based on the USPS dataset [seewald2005digits] with image dimensions of -px in a vectorized format (i.e., 64-dimensional). To bring both datasets onto a common space and keep computational complexity low, we resized the MNIST digits into the common -px dimension as the digits dataset.

Fig. 3: The MNIST dataset is a collection of digits ranging from 0-9.

Each sample will come from a fixed dataset (either MNIST or digits) which lasts for 1,000 samples. We simulate training data arrival by randomly sampling from the superset which contains 50,000 samples from MNIST and 1,797 samples from digits. Denoting MNIST as and digits as , the context-switching sequence is M-D-M-D-M-D which totals 6,000 samples.

Iii-C The propulsion dataset

The UCI Repository Naval Propulsion Plants dataset [coraddu2016machine] is a real-world dataset collected for the development of machine-learning driven condition-based maintenance models. The dataset contains sensor readings and parameters such as lever position, ship speed, gas turbine shaft torque, propeller torque, etc. along with real-valued information such as compressor decay and turbine decay. To frame it as a condition-based maintenance model, we are interested in determining whether the compressor state is healthy or unhealthy (i.e., a binary classification problem). The dataset has 16 features (thus 16-dimensional) so the interactions between features is moderately high-dimensional. Based on the statistics of the dataset, we determine that the compressor is in a healthy state if the responding real-valued variable compressor decay is higher than percentile derived from the data. With this, the switching contexts are:

  1. Label = Unhealthy (1) if (Compressor decay)

  2. Label = Unhealthy (1) if (Compressor decay)

and Label is Healthy (0) if the conditions are not met. Each operating modes last for 300 samples and the sequence are defined as C1-C2-C1-C2 (totalling 1200 samples).

Iv Proposed method

Prerequisites: autoencoders. The autoencoder  [vincent2010stacked, hinton2006reducing] (fig. 4) is able to automatically learn useful features from the input data and reconstruct the input data based on the learned features. It showed promising results across a wide variety of cross-domain anomaly detection problems [sakurada2014anomaly, martinelli2004electric, nolle2016unsupervised]. The autoencoder is trained to learn a low-dimensional representation of the normal input data by attempting to reconstruct its inputs to get with the following objective function:

where is the parameters of the autoencoder (i.e., weights of the neural network), is the loss function (typically loss), is the input, and is the reconstruction of the autoencoder. When presented with data that comes from a different data-generating distribution, it is expected that a high reconstruction error to be observed. The errors can be modeled as a normal distribution—an anomaly (in this case, out-of-context data) is detected when the probability density of the average reconstruction error is below a certain threshold.

Fig. 4: A neural network-based autoencoders learn latent representation in the underlying normal data, where the code vector is encoded in the bottleneck layer. The DAE attempts to reconstruct the input based on this compressed representation of the input. The reconstruction error between the input and the output is used directly for anomaly detection.

Proposed method. In this work, we propose a novel method for context-aware incremental learning by jointly training a neural network-based autoencoder for contextual inference and a classifier for the designated task. In general, there is only a single classifier driven by the classification task. However, during the learning phase, the autoencoder receives features and the data labels to learn the underlying representation of the features and the labels . Through this approach, a nominal concept representation for a particular operating mode is derived. For this approach to function, we place the assumption where classification accuracy will experience a sharp decrease in the event of concept drift. Hence, a sharp decrease of classification accuracy indicates a potential change in concept, which triggers the autoencoder to compare the representation of the new data with the average representation of the learned context computed via . If the representation is substantially different (i.e., high reconstruction error), a new concept is hypothesized, where then a contextual variable is introduced as a flag for the new concept. The subsequent training data matrix will be denoted as which indicates that the newly added contextual variable is now part of the feature matrix as inputs to the classifier.

One autoencoder is learned for each concept as a descriptor for the training instance. During every encounter of potential concept switch, the new sample is evaluated against a knowledge base of autoencoders to derive the reconstruction errors where is the number of seen (and hypothesized) contexts. If for all autoencoders the reconstruction error is above a certain threshold, then a new context is hypothesized. On the other hand, if one of the autoencoder returns a reconstruction error falling within a predefined threshold (based on statistical significance), then the new sample is thought to come from a context that has been encountered previously. Using this framework, the online learning algorithm is able to adapt to changing concepts and contexts without requiring prior complete knowledge on the number of possible concepts present in the data. The framework is summarized in Fig. 5.

In the framework, we modeled reconstruction errors as a normal distribution. Under a fixed context or concept, the mean and standard errors of the reconstruction error specific to a context is monitored. As previously mentioned, whenever the online training experiences a sharp drop in training accuracy, we pass the current training sample through the autoencoder related to the current context to compute its reconstruction error . Given a past history of verified samples coming from the evaluated context, we can derive and . For a new sample under decreased accuracy, we consider a sample as out-of-context if:

where is a defined threshold. The online training algorithm is outlined in algorithm 1.

Fig. 5: Flowchart of the proposed algorithm.
Initialization:
- Initialize current context ID,
- Initialize historical accuracies,
- Initialize classifier,
- Initialize context autoencoder,
- Initialize autoencoder list,
- Define accuracy threshold,
- Define autoencoder window,
- Train classifier
- Train autoencoder
while Data is streaming do
       Make prediction,
       Evaluate accuracy,
       Append accuracy to list,
       if mean()  then
             Select
             Update autoencoder
      else
             Pass through stored list of autoencoders
             Compute for each for
             if  then
                   Update autoencoder list
                   Increment context ID
                   Initialize new autoencoder
            else
                  Assign context ID
             end if
            
       end if
      
end while
Algorithm 1 Implicit Context-aware Learning on Streaming Data

One apparent advantage of this framework is that for each new concepts encountered, the current concept is stored as a contextual variable to aid decision support. While not directly linked to the class labels, contextual variables are important where the decisions can be conditioned upon the contextual variables. One problem arises when during online predictions (without training data), there is no way to know the true labels. This prevents the autoencoder as it requires both the data and label to reconstruct from learned representation. From here, there are two possible ways to go forward: (1) One can define a placeholder default value in place of the label, and (2) an expert can inspect the data and provide the contextual flag explicitly by examining past contexts. In (1), using default values in place of the expected labels will not provide a reliable estimate of reconstruction error, but by waiting a few more time steps to gather a batch of data to evaluate against the autoencoders to obtain an average representation, we will be able to compare the representations via the autoencoder with higher certainty. On the other hand, method (2) can first trigger an alarm to grab the attention of the expert, where the expert can use human reasoning to assign a context, after which the context is defined for all subsequent data.

V Results and discussions

V-a Compared approaches

In our method, we emphasize on context awareness, where for every new concepts encountered, a new contextual variable is defined as the input feature. We compared the online training framework against baseline approaches without context awareness, that is, the context is assumed to be the same without constructing new contextual features. For clarity, we designate the following names and description for the algorithms:

  1. *Implicit Context-aware Learning with Memory component (ICAL-Mem): The proposed approach with context awareness achieved through autoencoders. The classifier is a tree-based model due to its ease of training.

  2. *Implicit Context-aware Learning (ICAL): A simplified version of the proposed approach, but without context awareness. This approach does not store autoencoders as concept descriptors; rather, a drop in online classification accuracy directly signifies a change in context and all information related to previously learned concepts are forgotten.

  3. Non-context-aware Learning (Non-CAL): A classifier which does not respond to contextual changes and learns from a long history of available data, at a time scale larger than the time span of a particular context. A CART algorithm is selected for this approach.

  4. Myopic Non-context-aware Learning (Myopic Non-CAL): A classifier that learns from data coming from a short considered sliding window. Such approach is expected to respond to misclassifications as old knowledge is purged and forgotten from training. A CART algorithm is also selected for this approach.

Methods with an asterisk (*) are methods proposed in this paper which is centered upon context-aware learning.

V-B Evaluation metric

During online training, the predicted label and the true label of a new sample is stored. A running exponentially-weighted moving average (EWMA) of the classification accuracy is presented, and then the average accuracy for the whole experiment is computed. If an algorithm takes a slow recovery of classification accuracy whenever a new context is encountered, the resultant post-experiment average accuracy will be lower than the algorithm that undergoes fast recovery. The metric is intuitive, and is the usual arithmetic mean of accuracy defined as:

where is the number of evaluation time steps, is the evaluation function (e.g. rolling mean of classification accuracy) operating on the predicted label and the true label at step .

V-C Results on stagger dataset

The top plot in Fig. 6 shows the EWMA accuracies of the considered algorithms evaluated on the stagger dataset. During the first context, all algorithms performed equally well and converges to good accuracies at the same rate. As the data advances into a new context, all algorithms suffered a significant drop in accuracy, but the awareness to changing contexts enabled ICAL-Mem and ICAL to recover quickly. Non-CAL and Myopic Non-CAL on the other hand, without contextual variables, struggles to identify underlying patterns due to conflicting evidence during previous learning. However, myopic Non-CAL was able to recover quicker than Non-CAL by forgetting older contradictory evidences from the previous contexts.

Fig. 6: (Top) EWMA accuracies of all considered algorithms evaluated on the stagger dataset. Concept switch is indicated by the grey vertical lines; the 6 partitions corresponds to concepts 1-2-3-1-2-3. (Bottom) Inferred contexts during online training.

The bottom plot in Fig. 6 shows the identified context using the autoencoder approach. Being aware to contextual changes, ICAL constantly increments the number of contexts by 1 and starts relearning without considering old memories. ICAL-Mem is able to do better by remapping the new incoming sample to a previously-learned context, and subsequently switches training mode to condition upon a previously-learned context. Hence, it was able to identify the same contexts during every second encounter. Note that during the second encounters (i.e., partitions 4, 5, 6) the EWMA accuracies is much higher than the ICAL without memory, as well as the non-context aware counterparts (i.e., Non-CAL and Myopic Non-CAL).

V-D Results on MNIST-digits dataset

Results for this dataset is presented in the same format as before. For this dataset, the contexts are actually fairly similar which prevents the autoencoder from inferring contexts effectively. Due to the similarity of the MNIST and digits datasets in the feature space, the classification model was still able to perform well despite the context change. One striking difference between this dataset and the stagger dataset is that the feature undergoes some change, not the labeling strategy. Coupled with similar features (i.e., grayscale handwritten digits), there is only a small benefit in using context-aware approaches. This is shown by the Non-CAL method tracing closely with the context-aware counterparts (ICAL-Mem and ICAL). In terms of the inferred context, the first sharp contextual change is idenfified after the first partition. Interestingly, the model has seen enough training data and perform equally well during the second context change. Hence, the classification accuracy did not suffer and consequently not trigger context reidentification.

Fig. 7: (Top) EWMA accuracies of all considered algorithms evaluated on the MNIST-digits dataset. Context switch is indicated by the grey vertical lines; the 6 partitions corresponds to concepts 1-2-1-2-1-2. (Bottom) Inferred contexts during online training.

V-E Results on Propulsion dataset

Results for this dataset is striking—there is a huge gain of implicit context-aware learning with the autoencoder-based memory component. The ICAL-Mem approach clearly recovers faster during every conceptual drift, and correctly maps the inferred context to one that has been learned. As a result of the reassignment of previously-known context, the online classification accuracy quickly converges at a much higher rate compared to other baseline approaches. Myopic Non-CAL requires a long time to recover, while Non-CAL approach was not able to recover quickly due to contrasting evidence in previously learned data. The online accuracy plots are presented in Fig. 8.

Fig. 8: (Top) EWMA accuracies of all considered algorithms evaluated on the Propulsion dataset. Context switch is indicated by the grey vertical lines; the 4 partitions corresponds to concepts 1-2-1-2. (Bottom) Inferred contexts during online training.

V-F Discussions

Table I shows a summary on the average accuracy throughout the entirety of the online training process. As a reminder, higher average accuracy is an indicator of quick convergence/recovery during a context/concept switch. In general, there is a much greater benefit in incorporating context-awareness during training especially a new context/concept is discovered. The benefit is more apparent whenever there is change in the ground truth compared to changes in features. This is because given enough training data, the model might learn to generalize better to accommodate shifts within the feature space, while it is not possible for the model to generalize to a different label with the similar set of features.

Method Stagger MNIST-Digits Propulsion
*ICAL-Mem 93.94% 93.17% 93.77%
*ICAL 92.61% 93.52% 88.11%
Non-CAL 68.23% 92.29% 64.20%
Myopic Non-CAL 82.61% 72.33% 66.00%
TABLE I: Averaged accuracies through the entirety of the training process. Methods marked with asterisks (*) are methods proposed in this paper.

Vi Conclusions

In this paper, we proposed a novel online-learning algorithm that generates new contextual identifier to aid learning of classification tasks under context or concept drifts. The building blocks of the algorithm consist of a task-oriented classifier, in additional to autoencoders that stores low-dimensional representation of the data which can increase in numbers as new contexts or concepts are discovered. The algorithm is evaluated against several baselines without context-awareness, which is a fundamental component to the success of online learning under drift. Results has shown that up to 30 pts. improvement (percent) can be achieved by incorporating context-awareness during online training, as opposed to the non-context aware counterparts.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
394788
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description