A Machine-Learning Phase Classification Scheme for Anomaly Detection in Signals with Periodic Characteristics

A Machine-Learning Phase Classification Scheme for Anomaly Detection in Signals with Periodic Characteristics

Lia Ahrens Deutsches Forschungszentrum für Künstliche Intelligenz, Lia.Ahrens@dfki.de    Julian Ahrens Deutsches Forschungszentrum für Künstliche Intelligenz, Julian.Ahrens@dfki.de    Hans D. Schotten Deutsches Forschungszentrum für Künstliche Intelligenz, Technische Universität Kaiserslautern, Hans_Dieter.Schotten@dfki.de
Abstract

In this paper we propose a novel machine-learning method for anomaly detection. Focusing on data with periodic characteristics where randomly varying period lengths are explicitly allowed, a multi-dimensional time series analysis is conducted by training a data-adapted classifier consisting of deep convolutional neural networks performing phase classification. The entire algorithm including data pre-processing, period detection, segmentation, and even dynamic adjustment of the neural nets is implemented for a fully automatic execution. The proposed method is evaluated on three example datasets from the areas of cardiology, intrusion detection, and signal processing, presenting reasonable performance.

Keywords: anomaly detection, time series analysis, phase classification, machine learning, convolutional neural networks

1 Introduction

Many real-world systems, both natural and anthropogenic, exhibit periodic behaviour. Monitoring such systems necessarily produces periodic time series. In one particular instance of such a monitoring application, one is interested in automatically detecting changes in the periodically repeating pattern and thus anomalies in the systems operation. This type of anomaly detection occurs in a wide range of different fields and applications, be they medical, e.g. diagnosing diseases of the cardiovascular and respiratory systems, in industrial contexts, e.g. monitoring the operation of a transformer or rotating machinery, and in signal processing and communications. The pursued aims range from simple monitoring to intrusion detection and prevention.

Traditionally, anomaly detection is performed in the form of outlier detection in mathematical statistics. Numerous methods have been proposed, including but not limited to distance- and density-based techniques [7, 5] and subspace- or submanifold-based techniques [13, 6, 25]. Most of these techniques make no explicit use of the concept of time and are therefore usually less suited for the analysis of time series. Methods making explicit use of the temporal structure include classical models from statistical time series analysis such as autoregressive–moving-average (ARMA) models [17], hidden Markov models [28] such as Kalman filters [24], and rolling-window distance-based methods such as matrix profiles [27]. Distance analysing methods are effective for clean data but not robust against noise, whereas distribution-based methods from mathematical statistics are still powerful in the presence of noise, requiring data-specific parametrisation. In the past few years, non-linear methods, such as different types of recurrent neural networks (RNNs) and long short-term memory (LSTM) networks, have also come into use [10, 26]. Many of these methods are difficult to train [1, 12, 18] or need large amounts of data in order to achieve reasonable performance while avoiding overfitting. On the other hand, in recent years, convolutional neural networks (CNNs) have gained popularity in image processing [15, 23] where they are used mainly for classification tasks. The same principles that ensure the success of CNNs in image processing carry over to other types of signal processing when the number of dimensions of the convolutional kernels is changed accordingly. Most of the work using recurrent or convolutional networks for time series analysis focusses on forecasting or detecting certain patterns explicitly known at training time. On these tasks, convolutional networks have recently been shown to outperform the previously state of the art LSTMs [2].

In this paper we consider data with periodic characteristics and design a machine-learning algorithm for time series analysis, in particular anomaly detection, applying convolutional neural nets in a manner which, to the best of the authors knowledge, has not been proposed previously. In contrast to existing methods and inspired by machine-learning methods for image processing, we employ a convolutional net acting not as a predictor or estimator but as a classifier whose classes depend on phase i.e. the relative location in time. We also integrate general procedures for data pre-processing and automated phase reclustering so that no manual action is required in between.

Our algorithm is tested on three datasets: a cardiology dataset (ECG database) [3], an industrial network dataset for cyber attack research (SCADA dataset) [16], and a synthetic waveform dataset described in detail in Section 3.3.1. It turns out that, to a certain extent, our method is robust against unclean data and the related neural nets do not show high sensitivity to the hyperparameters and are therefore relatively easy to train.

The remainder of the paper is organised as follows. In Section 2 we introduce our general concept for anomaly detection including data pre-processing, mathematical basis of convolutional neural networks, and training algorithm. In Section 3, the three aforementioned example datasets are evaluated. For the ECG database 3.1, a part of the data from healthy patients is used as normal data for training and validation and the classifier is tested on the remaining data including healthy and ill patients. For the evaluation of the SCADA dataset 3.2, we choose one subdataset with normal network traffic including a certain amount of interference for training and three correctly labeled subdatasets with malicious activities for testing the intrusion detection algorithm. For the synthetic dataset 3.3, a wave generator based on fundamentals from calculus and the theory of stochastic processes is built in a first step, along with injected anomalies generated basically by varying different parameters of the underlying waves. Each generated wave without anomaly is devided into training and validation data and the resulting classifier is tested on the remaining parts of the wave with injected anomalies; the average detection rate over separately generated waves is employed for evaluation. In Appendix A, dealing with the issue of randomly varying period length which shows up in the ECG data 3.1 and the synthetic waves 3.3, an auxiliary period detection scheme is designed based on classical principles of signal processing.

2 Method

2.1 Concept

Pioneering research reveals that training convolutional networks for image classification yields convincing results, taking into account the spatial structure of considered images. Motivated by this, the maching-learning approach proposed in this paper is based on the following key ideas:

  1. Conducting multi-dimensional time series analysis by means of multi-channel deep convolutional neural nets, where each channel in the input layer corresponds to a single feature (dimension) of the considered time series

  2. Identifying relative locations (order of occurrence) of subpatterns from time series with periodic characteristics by means of training data-adapted classifiers so that subpatterns over different periods of the underlying time series are properly divided into a certain amount of classes

To be more specific, for a time series presenting periodic characteristics and beginning at a local extremum, , with fixed period length and a pre-determined initial number of classes , sampling with a sliding window of length and stride length , each subpattern of the form , , is assigned to the class labeled .

The time series analysis by means of training classifiers as above can be employed for anomaly detection. Indeed, for seasonal data, subpatterns sampled from the time series occur repeatedly and in fixed order within each single period. A successfully trained classifier outputs the correct label indicating the relative location (i.e. distance between subpattern and period begin) of the input subpattern. Abnormal datapoints in an input pattern are expected to cause a false classification result and therefore to be identified as anomalies.

2.2 Data pre-processing

2.2.1 Period detection

In general, the periodic characteristics of a time series can be recognised by examining the autocorrelogram (cf. [4, 2.1.4]) or periodogram (cf. [4, 2.2.1]). In many cases the period length is fixed and known. In case of a fluctuating (cf. e.g. data from cardiology), an auxiliary period detector is designed in Appendix A, capturing the local extremum (considered as period begin in our setting) of each period and using cross-correlations in order to achieve robust period detection. Note that for randomly varying period length , the stride length in our setting varies proportionally to so that the number of sliding windows considered within each period remains constant.

2.2.2 Sliding window

The classification accuracy of our approach turns out not to be highly sensitive to the length of the sliding window . In the context of anomaly detection, the value of should be kept relatively small (e.g. less than or equal to three times the average duration of a single abnormal data sequence) in order to highlight the local effect of the abnormal data points on the time series. We use a window size of (approximately three times the stride length), where refers to the average value of (recall that in general may vary over time). Empirically, this has proven to be adequate for our purpose. Note that the length of the sliding window remains constant even in the case of randomly varying period length , the varying stride length merely affects the amount of overlap between adjacent sliding windows.

2.2.3 Normalisation

In order to remove trend components and avoid skewed results due to dominating extreme values, the samples within the sliding window are normalised by the local mean and variance, that is, considering a -dimensional time series as input data, for and the vector is fed into channel of the convolutional neural net, where

with

2.3 Convolutional neural networks

The core of our phase classifier is a convolutional neural network (CNN). CNNs are a special type of feedforward neural network, which exploit structures of space or time by sharing many of the weights among different neurons. We provide a short description of the mathematical basis of a convolutional neural network.

Basically, a feedforward neural network is a function , mapping an input vector to an output vector , using a vector of parameters to adapt the mapping. When acting as a classifier, is the number of classes and the predicted class of a given input is taken to be . The network can be decomposed into layers, each of which represents a different function mapping vectors to vectors, i.e.,

where is the number of layers and for , are the transformations performed by each of the single layers and the vectors are again parameter vectors used to adapt the mapping and given as subvectors of . For ease of notation, let us denote the input to the function by , starting with and the output of the function by , ending with .

In the most simple feedforward neural networks, each of the transformations is given by a multiplication with a matrix called the weight matrix followed by an addition of a vector called the bias vector followed by the application of some non-linear function called the activation function to each of the components of the resulting vector, i.e.,

The entries of the matrix and the vector are exactly the components of the parameter vector . This is called a fully connected layer.

In the case of a one-dimensional convolutional layer, the affine transformation is replaced with a more restrictive kind of affine transformation, the so-called batched convolution. For this, the vector is reindexed to form a two-dimensional array with . Similarly, the parameter vector is distributed not into a matrix and a vector , but into a matrix of vectors called the convolution kernels and a vector . The operation performed by the function is now given by

and we have the constraint that . In many convolutional networks, the input vectors are extended (padded) by additional zero entries prior to being convolved. When padding with exactly zeros, the output vectors are of the same size as the input vectors. This is referred to as ‘SAME’-padding.

In our networks, we also use a type of layer called a max pooling layer between two convolutional layers. The transfer function of this layer is given by

where is a positive integer called the pool size and we have the constraints and . Note that max pooling layers have no adjustable parameters .

In our network, we employ both convolutional and regular fully connected layers. We apply ‘SAME’-padding in all convolutional layers and use the hyperbolic tangent () as activation function, which is a common choice in feedforward neural networks.

The exact layout of the convolutional network used for our task is displayed in Table 1. Here , , and denote the dimension of the input time series, the sliding window size and the current number of classes, respectively. The layer and kernel sizes were chosen to best adapt to varying input time series dimensions, sliding window sizes, and numbers of classes.

Layer Type Sizes
0 convolutional , ,
1 max pooling , ,
2 convolutional , ,
,
3 fully connected
4 fully connected
5 output
Table 1: Layers of classifier neural network

The method, by which the parameter vector is adjusted and the network adapts, is the minimization of a function applied to the output of the neural network, called the loss function. Since we are classifying phases, to each training input (and hence to each output ), there corresponds a label . In our case, we use the cross entropy loss function, which is given by

where denotes a weight by which the losses of each class are scaled. The weights are statically determined and are in our case chosen to be proportional to the inverse of the the number of training examples for each class in order to counteract bias caused by unbalanced classes.

2.4 Training algorithm

The neural networks in our algorithm are trained by the ADAM training algorithm which is a refined version of stochastic gradient descent (SGD). In SGD, the average loss for a set containing pairs of training inputs and corresponding labels is minimised by changing the randomly initialised parameters of the neural network according to the update rule

where is a tuning parameter called the learning rate. This minimises the average of the loss values . The set is called a mini-batch and is taken to be a subset of the set of all available training inputs. The update steps are performed with changing disjoint mini-batches until the entire training dataset is exhausted. Each pass through the entire set of available training data is referred to as an epoch. To enhance the training process (cf. [21]), we change the size of the mini-batches during the training, later epochs use larger mini-batch sizes. The adaptive adjustments performed by the ADAM algorithm detailed in [14] provide further enhancements to this process.

In contrast to usual classifiers, our algorithm encapsulates the gradient descent algorithm in a decision process monitoring the necessity of dynamic reclustering which aims to optimise the classification accuracy. The complete algorithm is given in Algorithm 1, the single steps are described in more detail in the remainder of this section.

for  do
     if  then
          return stored net
     end if
     
     while true do
          initialise net and labels
          repeat
               perform training iteration
          until no improvement in validation loss
within consecutive epochs
           training stop crit.
          if minimum class accuracy  then
                reclustering stop crit.
               store net
               
               break
          end if
          if  or  then
               break
          end if
          
          recluster according to overall confusion matrix
          update weights of loss function
     end while
end for
if  then
     return stored net
else
     change and rerun
end if
Algorithm 1 Training algorithm

Each time having initialised the neural network for separating the currently considered classes, the gradient descent optimiser is run until a training-progress-monitoring stop criterion is fulfilled (cf. training stop criterion in Section 2.4.2 for more details). The classification ability of the underlying neural net is evaluated by means of the so-called confusion matrices (cf. Section 2.4.1) throughout the entire training. If at the end of training all classes are evaluated with sufficient accuracy (cf. reclustering stop criterion in Section 2.4.2), the trained neural net is stored; otherwise a relabeling procedure according to the overall confusion matrix is conducted where the class with least average evaluation accuracy is merged into the class to which the corresponding inputs are most commonly misclassified during the training and the neural net is re-initialised with respect to the updated classes (cf. Section 2.4.3). Among all stored neural nets, the ultimate classifier is chosen as the one having the maximum number of output classes (cf. Section 2.4.4).

In the subsequent subsections, the aforementioned reclustering process and stop criteria are described in detail.

2.4.1 Confusion matrix

In order to track the progression of classification accuracy during the training, we record the confusion matrix evaluated on the training data after each epoch. For a current number of classes and existing classes labeled as , the confusion matrix evaluated after the -th epoch is an -dimensional matrix denoted by , where the entry refers to the number of training inputs labeled as and predicted by the neural net during the -th training epoch as class , .

During the experimentation, we observe that classes which are easily distinguishable can already be separated after very few training iterations, whereas classes sharing more similarity perform significantly worse in the beginning and also show a slower increase of evaluation accuracy during the training. Taking into account that the evaluated value of the loss function commonly follows a convex decreasing trend throughout the entire training, the above observation motivates us to assess the separation ability of the underlying neural net during training by weighting the confusion matrix with the respective contribution to the training progress and to introduce the overall confusion matrix denoted by and defined as

(1)

where , , and refer to the current number of classes, the number of training epochs that are performed until the training stop criterion (cf. Section 2.4.2) is satisfied, and the average training loss during the -th epoch, respectively.

In our setting, the overall confusion matrix serves as the key object of the decision criterion for our dynamic reclustering (cf. Sections 2.4.2 and 2.4.1). Its definition in terms of (1) by taking the weighted average throughout the entire training and dropping the values from the initial epoch () aims to mitigate the random effect of the initialisation of the neural network. Empirically, this yields robust reclustering results during different test runs for fixed .

2.4.2 Stop criteria

The criteria for stopping the loops are related to parametrised effectiveness and accuracy reqirements in the following manner:

Training stop criterion

We monitor the training progress by evaluating the average loss of validation data over each training epoch. For each (re-)initialised neural network, training is stopped if no improvement in the average validation loss during the latest epochs can be observed.

Reclustering stop criterion

Allowing a maximum per-class margin of error , the reclustering procedure is stopped if

where , and refer to the current number of classes, number of epochs for training the related network (i.e. until the training stop criterion is fulfilled), and the respective confusion matrix evaluated at the end of training (recall the definition in Section 2.4.1), respectively.

2.4.3 Reclustering

As long as the recustering stop criterion is not fulfilled, the subsequent reclustering procedure is considered necessary.

For a current number of classes and existing classes labeled as , let and denote the worst evaluated class and the correponding most misassigned class during the entire training of the respective neural net (i.e. until the training stop criterion is fulfilled) which are defined as

and

respectively (recall the definition of in (1)). The class labeled as is merged into class . Furthermore, since we always assume the labels to be consecutive, the inputs with the largest label are assigned the label of the dropped class .

Each time after relabeling, the weights corresponding to the remaining classes in the cost function are adjusted to be again inversely proportional to the current shares of the classes in order to warrant a well-balanced training of the updated classifier and the neural net is re-initialised.

2.4.4 Final number of classes

In the context of anomaly detection, we are dealing with the trade-off between optimising the classification accuracy of normal data preventing false positives (i.e. to cancel confusing classes) and maintaining the ability of misclassifying abnormal data for the sake of anomaly detection (i.e. to still retain sufficiently many classes characterising different phases within a period). Keeping this in mind, the final number of classes determining the ultimate classifier neural network is selected in the following manner:

Given a maximum allowed number of classes with an even number , the starting initial number of classes is set to . Each time for an updated initial number of classes , the relabeling procedure described in Section 2.4.3 is run at most -times (i.e. with at least remaining classes). If the reclustering stop criterion is fulfilled after relabeling -times, the candidate final number of classes related to is set to and the corresponding neural net is stored. If , the updating processes of is finished; otherwise is reduced by . The overall final number of classes refers to the maximum of taken along the entire path of , i.e.  and the final classifier neural network is the one stored when this overall maximum was achieved. If this maximum was achieved more than once, we choose the neural network corresponding to the highest such that achieved this maximum. This is because a high value of corresponds to a narrow sliding window (cf. Section 2.2.2) and hence maximises the sensitivity of the anomaly detector.

If in the end no suitable network has been stored, we increase and rerun the algorithm.
Finally, it is worth mentioning that once all the hyperparameters are determined, the whole training algorithm introduced above, including data pre-processing and reclustering, is implemented in a machine-learning manner so that the classification and anomaly detection process can be accomplished fully automatically.

3 Applications and results on example datasets

In the subsequent sections, we present the results of our machine-learning algorithm for anomaly detection applied to three example datasets corresponding to the domains of cardiology, industry, and signal processing, confirming the feasibility of the method in a range of applications. The cardiology dataset is the most complex and challenging dataset, as the recordings taken from healthy control patients exhibit a high level of diversity which needs to be captured by the classifier. This diversity mandates the use of a more complex representation which is one of the strengths of deep neural networks over other parametric models. The other two datasets demonstrate the applicability of the method in different contexts, including the detection of anomalies occurring only at certain instances in time.

3.1 Cardiology dataset

The PTB Diagnostic ECG Database is a database created by the Physikalisch-Technische Bundesanstalt (PTB) consisting of 549 electrocardiogram (ECG) records gathered from 290 subjects aged 18 to 87. The ECGs were recorded using a non-commercial PTB prototype recorder, the specifications of which can be found on the database website111https://physionet.org/physiobank/database/ptbdb/. The dataset is part of PhysioNet [11] and is further described in [3].

3.1.1 Input data

We use and of the measurements from healthy patients for training and validation, respectively. The trained classifier is tested on the remainder of the data from healthy patients and data from all ill patients.

Due to the large data volume, we manually resample the input data to a sample rate of samples per second instead of the original before feeding it into the neural network. This operation is not strictly necessary, but it speeds up the training process. Also, we only use the first 60 periods of each recording during training and for testing. We train our classifier with resampled time series from healthy patients and use the data coming from all conventional leads and Frank leads (cf. [9]) for the ECG diagnostic, resulting in a convolutional neural net with channels on the input layer.

3.1.2 Period detection

The first challenge when analysing ECG data consists in detecting the randomly varying periods of individual patients, for which we design a period detector. This detector is described in greater detail in Appendix A. The detector has a number of parameters which need to be adjusted to the dataset, the actual values used here are given in Table 2.

Parameter Value
prefilter window half-length
minimum base period length
maximum base period length
maximum period length deviation factor
reference window half-length factor
Table 2: Parameters for period detector on ECG database

For this dataset, the entire time series for feature ‘i’ is used as both the reference and input time series to the period detector. However, in order to ensure the requirement that no trend component exists in the signal, the first difference of the signal is used instead of the raw signal. In order to adjust for the offsets thus introduced at peak detection, between Steps 4 and 5 the reference window is adjusted to be precisely centred on the corresponding peak in the original (smoothed but not differentiated) signal, i.e., its midpoint is changed to

The maximum allowed adjustment of has empirically been found to yield satisfactory results.

The median of all observed period lengths approximately amounts to (ms).

3.1.3 Results and discussion

Figure 1: Validation loss over epochs for training the final classifier neural nets with ,
Epochs Merge New Labels
N/A
to
to
N/A
Table 3: Label History

During the training, the maximum allowed number of classes and per-class margin of error are set to and , respectively, which results in an ultimate classifier with an initial number of classes and final number of classes . The related label history is shown in Table 3. The average validation loss during the training of the respective neural nets is presented in Figure 1. A training accuracy of and a validation accuracy of are achieved.

Figures 2 and 3 illustrate the result of testing the trained classifier on three patients from the category ‘healthy control’ and three ill patients: the measurements on feature ‘i’ from the test patients are presented in a temporal resolution of ms and the bars in the upper and lower halves of the figures refer to the predicted classes and the true labels of the input data from the corresponding sliding windows, respectively.222Note that here and in the sequel the coloured bars in these diagrams are always plotted between the beginnings of adjacent windows, thus only covering approximately the first third of each sliding window.

Figure 4 presents a statistical evaluation of the per-patient test results on patients from the most recorded categories in the considered data base: ‘dysrhythma’, ‘valvular heart disease’, ‘cardiomyopathy/heart failure’, ‘bundle branch block’, ‘hypertrophy’, ‘myocardial infarction’, and ‘healthy control’. The lines in different colors represent the empirical distribution functions of the per-patient classification accuracy from the aforementioned categories. Observe that the blue line related to healthy patients is located in the bottom right corner of the diagram, to the left of which all other lines corresponding to ill patients are centred (cf. the median for each category), which enables us to distinguish ill patients from healthy patients in some cases. For instance, if we take per-patient classification accuracy as threshold value, all test patients from the categories ‘dysrhythma’ and ‘valvular heart disease’, and nearly of the patients from the categories ‘cardiomyopathy/heart failure’ and ‘myocardial infarction’, respectively, and over of the patients from the categotries ‘bundle branch block’ and ‘hypertrophy’ will be considered as anomalies, whereas up to three false positive results ( of) all tested patients from the category ‘healthy control’ will be assessed as normal. Since the sample sizes provided for the individual categories vary a lot (e.g. there are subjects for ‘myocardial infarction’ whereas the entire category ‘healthy control’ consists of only subjects including training, validation and test data applied in our context), we are not in the position to make a general statement on the choice of an ideal threshold value. Table 4 provides a statistical evaluation of the per-disease average classification accuracy. It turns out that the category ‘healthy control’ presents by far the best test result compared to all other categories related to heart disease (anomaly).

Note that our anomaly detection scheme does not incorporate any specific cardiological knowledge. It gives an indication whether a patient may be ill or not, it detects deviations from the known healthy data and does not classify the diseases separately. It also only gives a statistical indication, which is a result somewhat similar to the one reported in [8] where it was observed that the ECGs of ill patients showed deviations in certain affine dependencies usually present between the 12-lead and 3-lead ECGs of healthy patients.

Figure 2: Classifier applied to patients from category ‘healthy control’
(a) Classifier applied to a dysrthythmia patient
(b) Classifier applied to a valvular-heart-disease patient
(c) Classifier applied to a myocardial-infarction patient
Figure 3: Classifier applied to ill patients
Figure 4: Distribution of per-patient classification accuracy evaluated on test patients from different categories
Disease Classification
Accurracy
Valvular heart disease
Dysrhythmia
Cardiomyopathy/Heart failure
Myocardial infarction
Bundle branch block
Hypertrophy
Healthy control
Table 4: Results of per-disease classification accuracy

3.2 SCADA dataset

In [16], Antoine Lemay and José M. Fernandez describe a simulation of an industrial control system, specifically designed for providing supervisory control and data acquisition (SCADA) network datasets for intrusion detection research. The generated datasets are openly available on GitHub333https://github.com/antoine-lemay/Modbus_dataset and contain periods of regular operation, manual interactions with the system, and anomalies caused by network intrusions. Since the operation of the simulated system is cyclic, the resulting data is mostly periodic.

3.2.1 Input data

Among the available datasets with common characteristics, we choose the first and the last of the dataset named ‘characterization_modbus_6RTU_with_operate’ with a duration of 5.5 minutes in total for training and validation, respectively, where neither the injected malicious activities nor the manual operations included are labeled, both resulting in a certain proportion of abnormal data points (noise) in the corresponding time series. The trained classifier is tested on the only three correctly labeled datasets ‘moving_two_files_modbus_6RTU’ (‘Test Data 1’), ‘CnC_uploading_exe_modbus_6RTU_with_operate’ (‘Test Data 2’), and ‘send_a_fake_command_modbus_6RTU_with_operate’ (‘Test Data 3’), including no manual operations, a small portion of manual operations, and a large amount of noise e.g. manual operations (causing non-intrusion anomalies), respectively. In each dataset, four features are considered: number and total size of sent packets, and number of active IP and port pairs. At one-second intervals, we record the increase in each feature and consider the corresponding -dimensional time series.

The given 10-seconds polling interval yields periodic characteristics of the considered time series with a fixed period length of .

3.2.2 Results and discussion

Setting and for training, the final classifier uses and . The respective label history and the evolution of the average validation loss are presented in Table 5 and Figure 5, respectively.

In Figure 6, the number of active port pairs extracted from ‘Test Data 1’ is plotted against time (in seconds) and the bars in the upper and lower halves represent the classes predicted by our trained neural net and the true labels of the input data from the corresponding sliding windows, respectively; data points which result in prediction errors are considered anomalies.

The final results of our anomaly detection algorithm on the entire test data is summerised in Table 6. In the first two (cleaner) test datasets with no or only a small amount of manual operations, all cyber attacks in the test data are detected along with a single false positive detection (corresponding to false detection rate in ‘Test Data 1’), whereas the classifier tested on the last test dataset including a large amount of noise performs not as good, taking into account that only malicious activities but no manual operations or any other types of interference are labeled as anomalies and our time series analysis does not include the respective context consideration.

Indeed, the SCADA datasets which are applicable in our setting are quite small. Due to the non-compatability between datasets with small and large amounts of noise (i.e. non-intrusion anomalies appearing in the form of pulses), it is difficult to choose one suitable dataset for training and to test the detector on datasets with incompatible characteristics, e.g., it would be unfeasible to train a detector on one of the cleaner datasets and then test it against a noisy dataset, or vice versa. For a more extensive treatment of anomaly detection on a richer dataset, cf. Section 3.3.

Figure 5: Validation loss over epochs for training the final classifier neural nets with ,
Epochs Merge New Labels
N/A
to
to
to
to
to
to
N/A
Table 5: Label History
Figure 6: Classifier applied to test data
Detection False
Dataset Rate Positives
Test Data 1
Test Data 2
Test Data 3
Table 6: Results of Intrusion Detection

3.3 Wave dataset

The waves dataset is a synthetic dataset loosely modelled on a system transmitting a periodic signal. From the theory of Fourier analysis, every differentiable periodic signal with frequency can be decomposed into its frequency components

cf. [22, Theorem 2.1], which motivates the principal rule of our wave generator. In our consideration, the generated waves have no DC offset, i.e. , and components only up to frequency , i.e.  for all . The signals are supposed to be transmitted over a noisy channel which we assume to add filtered Brownian and white noise. The wave generator also has some inherent randomness in the form of clock jitter, amplitude noise, and phase noise. There are also a number of fault conditions which form the basis of the anomalies to be detected.

3.3.1 Wave generator

The waves in this dataset are of the form

, with given by

and . Here, is a Gaussian white noise process, i.e., are independent and identically distributed (i.i.d.) random variables with for all , and , , , and are independent (discrete) Ornstein-Uhlenbeck processes with individual sets of parameters. In general, the Ornstein-Uhlenbeck process obeys the stochastic differential equation

(2)

where , , , and is a standard Brownian motion, cf. e.g. [20, Ex. 6.6]. In discrete time, a process following (2) can be approximated by generating i.i.d. random variables with for all and exponentially smoothing them:

(3)

Indeed, letting

the process with

is a random walk with Gaussian increments and thus corresponds to a discretely sampled standard Brownian motion [19, (1.9)]. Therefore, (3) can be written as

which yields a discrete counterpart of (2). The Ornstein-Uhlenbeck process can be thought of as a process performing a random walk where the increments are biased towards the mean . As such, it behaves locally like a Brownian motion, causing the power of the higher frequency parts of its spectrum to average (brownian noise). The process can be used to model parameters of systems that tend to shift over time, while generally remaining close to a certain average value.

For each single wave, a set of parameters controlling the governing processes is randomly generated using the parameters in Table 7.

Process
N/A N/A
Table 7: Parameters for processes governing generated waves ()

The means of the processes for amplitude and phase variation are sampled according to the following law:

for , where denotes the uniform distribution on the interval . They remain constant throughout the wave and determine the overall shape of the wave.

3.3.2 Generated anomalies

Based on the parameters and processes employed by the wave generator, we inject the following four types of anomalies or noise:

Amplitude anomalies

The amplitude process of one of the frequency components (i.e., for a single ) is increased by , where is randomly sampled for each anomaly according to the law .

Phase anomalies

The phase process of one of the frequency components is changed. The amount of change is randomly sampled for each anomaly from the distribution resulting in a random phase change of at least and at most .

Pulse anomalies

A pulse of random amplitude is added onto the wave. For each anomaly, the amplitude of the pulse is randomly sampled according to the law and the pulse width is a random integer drawn from the interval .

White noise

The white noise process is amplified by a factor which is randomly sampled for each anomaly according to the law .

For each wave, a segment of samples is generated. Then segments, each consisting of samples are generated, the last samples of which the anomaly or noise is injected into. For the evaluation, we use generated waves, resulting in a number of anomalies and waves with increased white noise in the test dataset.

3.3.3 Input data and period detection

The generated waves are considered in groups, where each group consists of a normal wave recorded over periods and related abnormal waves each corresponding to a single type of anomaly with a normal start-up time of at least time units (i.e. the first entrance time of anomalies following the respective normal wave is to the right of the time stamp ). In each group, we take the first and the remainder of the normal wave for training and validation, respectively, and subsequently test the trained classifier on the respective abnormal waves.

Since the simulated waves contain interference in the time component which results in random period lengths , we again make use of the periodic detector decribed in Appendix A using the parameters specified in Table 8.

Parameter Value
prefilter window half-length
minimum base period length
maximum base period length
maximum period length deviation factor
reference window half-length factor
Table 8: Parameters for period detector on wave dataset

Note that in contrast to the treatment of ECG data, in each data group the reference window is selected among the subpatterns extracted from the training data.

By construction, the average period length equals time untis.

3.3.4 Results

Throughout the entire training, we set the maximum number of classes and allowed per-class margin of error to and , respectively. Overall, an average classification accuracy of is achieved on both training and validation data.

Figures 7, 8, 9 and Figure 10 present the detection results of our classifiers trained by individual example waves tested on different types of anomalies and white noise, respectively. Again, in each diagram the bars in the upper and lower halves refer to the predicted classes and true labels of the data from the corresponding sliding windows fed into the trained classifier, respectively. Notice that in Figure 10, slightly increased white noise does not lead to any classification errors, which suggests some robustness property of our classifier against noise.

The final results of our anomaly detection algorithm tested on the groups of synthetic waves are shown in Table 9. The amount of anomalies and white noise are obtained by counting the number of waves injected with the respective type of interference, whereas the denominator for evaluating false positives equals the number of available prediction windows in the clean test data. Overall, our algorithm yields high detection rates of all types of injected anomalies ( on average); the small rate of false positives () confirms the high prediction accuracy of our classification algorithm; the low error rate in the presence of increased white noise shows the robustness of our classifier against noise to a certain extent.

Figure 7: Classifier with , applied to wave with shock injected at time stamp 3072
Figure 8: Classifier with , applied to wave with abnormal phases starting at time stamp 2048
Figure 9: Classifier with , applied to wave with abnormal amplitudes starting at time stamp 2048
Figure 10: Classifier with , applied to wave with slightly increased white noise () starting at time stamp 2048
Type Detection Rate %
Phases
Amplitudes
Pulse
Total Anomalies
False Positives
White noise ()
White noise ()
Table 9: Results of Anomaly Detection

4 Conclusion

In this paper, we proposed a novel approach to detecting anomalies in time series exhibiting periodic characteristics, where we applied deep convolutional neural networks for phase classification and automated phase similarity tagging. We evaluated our approach on three datasets corresponding to the domains of cardiology, industry, and signal processing, confirming that our method is feasible in a number of contexts.

Appendix A Period detection scheme

In this section, we provide the details for the period detection scheme used for the ECG and synthetic datasets. This period detection scheme is primed using a reference signal and then applied to the actual input signal . It is assumed that the input signals do not have a trend component, which can be achieved by a suitable transformation of the input signals, such as taking the first difference as in the ECG data case, cf. 3.1. The detection is now performed in the following steps:

  1. Smooth the signals by applying a rolling mean

  2. Infer approximate base period using the autocorrelation of the reference signal

  3. Detect peaks in the reference signal spaced approximately one base period apart using a simple peak detection logic

  4. Take the average of segments around the detected peaks and find one reference segment which most closely matches this average

  5. Cross-correlate the input signal with the reference segment

  6. Detect peaks in the cross-correlation spaced approximately one base period apart using again the simple peak detection logic

The steps are described in more detail in the following paragraphs:

Step 1: The raw signals and are subjected to a rolling mean filter, resulting in smoothed signals and , respectively, i.e.,

The window length of this filter is chosen to provide just enough filtering to dampen some of the noise contained within the input signal.

Step 2: The sample autocorrelation of the (smoothed) reference signal at lag is computed via

with

(cf. [4, 2.1.5]), where and denote the sample size and sample mean of the reference signal , respectively. Now the mean period length is inferred by taking the of restricted to some interval , i.e.,

A plot of an example autocorellation function is shown in Figure 11, the inferred mean period length is displayed by the vertical line.

Figure 11: Autocorrelation function of one of the ECG database records

Step 3: The reference signal is now fed into a simple peak detector which proceeds to inductively find peaks spaced approximately one base period apart via

where is a tolerance value to account for the variability of period lengths in the signals.

Step 4: The detector now extracts subpatterns from the reference signal centred at the peaks , i.e., . Here, is another tolerance parameter to mitigate the effects of period length variability. Then the seasonal means

are computed. Here denotes the total number of subpatterns. Let now

The choice of ensures that is the subpattern with maximum similarity to the mean and is thus suited as a reference pattern.

Step 5: The reference pattern is now used for detecting the periods in the input signal by computing the cross-correlation function:

Step 6: Finally the simple peak detector from Step 3 is applied to the cross-correlation to obtain the final segment beginnings.

A comparison of the periods detected by the simple peak detector from Step 3 and the cross-correlating period detector from Step 6 can be seen in Figure 12. The top graph shows the input to the simple peak detector, the bottom graph shows the cross-correlation; the gray boxes in the top half of the backgrounds represent the segments inferred by the simple peak detector, those in the bottom half represent those found by the cross-correlating period detector. Notice how glitches in the input signal easily manage to confuse the simple peak detector while the cross-correlating period detector is robust to such perturbations.

Figure 12: Comparison of periods detected in the Steps 3 and 6

References

  • [1] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning Long-Term Dependencies with Gradient Descent is Difficult. IEEE Transactions on Neural Networks, 5(2):157–166, March 1994.
  • [2] Anastasia Borovykh, Sander Bohte, and Cornelis W. Oosterlee. Dilated Convolutional Neural Networks for Time Series Forecasting. Journal of Computational Finance (online early), October 2018.
  • [3] R. Bousseljot, D. Kreiseler, and A. Schnabel. Nutzung der EKG-Signaldatenbank CARDIODAT der PTB über das Internet. Biomedizinische Technik/Biomedical Engineering, 40(s1):317–318, 1995.
  • [4] George E. P. Box, Gwilym M. Jenkins, and Gregory C. Reinsel. Time Series Analysis: Forecasting and Control. Wiley Series in Probability and Statistics. John Wiley & Sons, Inc., fourth edition, July 2008.
  • [5] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. LOF: Identifying Density-Based Local Outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pages 93–104. SIGMOD, 2000.
  • [6] Jinghui Chen, Saket Sathe, Charu Aggarwal, and Deepak Turaga. Outlier Detection with Autoencoder Ensembles. In Proceedings of the 2017 SIAM International Conference on Data Mining, pages 90–98, 2017.
  • [7] Taurus T. Dang, Henry Y. T. Ngan, and Wei Liu. Distance-Based k-Nearest Neighbors Outlier Detection Method in Large-Scale Traffic Data. In Proceedings of the 2015 IEEE International Conference on Digital Signal Processing (DSP 2015), pages 507–510, 2015.
  • [8] Drew Dawson, Hui Yang, Milind Malshe, Satish T. S. Bukkapatnam, Bruce Benjamin, and Ranga Komanduri. Linear affine transformations between 3-lead (Frank XYZ leads) vectorcardiogram and 12-lead electrocardiogram signals. Journal of Electrocardiology, 42(6):622–630, 2009.
  • [9] Ernest Frank. An Accurate, Clinically Practical System For Spatial Vectorcardiography. Circulation, 13(5):737–749, May 1956.
  • [10] Felix A. Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10):2451–2471, October 2000.
  • [11] A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M. Hausdorff, P. Ch. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals. Circulation, 101(23), June 2000.
  • [12] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies. In John F. Kolen and Stefan C. Kremer, editors, A Field Guide to Dynamical Recurrent Networks, pages 237–244. John Wiley & Sons, January 2001.
  • [13] Karin Kailing, Hans-Peter Kriegel, and Peer Kröger. Density-Connected Subspace Clustering for High-Dimensional Data. In Proceedings of the 2004 SIAM International Conference on Data Mining, pages 246–256, 2004.
  • [14] Diederik P. Kingma and Jimmy L. Ba. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations, 2015.
  • [15] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
  • [16] Antoine Lemay and José M. Fernandez. Providing SCADA network data sets for intrusion detection research. In 9th USENIX Workshop on Cyber Security Experimentation and Test (CSET ’16), Austin, TX. USENIX Association, August 2016.
  • [17] Hamid Louni. Outlier detection in ARMA models. Journal of Time Series Analysis, 29(6), November 2008.
  • [18] Razvan Pascanu, Tomas Mikolov, and Y Bengio. On the difficulty of training Recurrent Neural Networks. In 30th International Conference on Machine Learning, ICML, pages 2347–2355, November 2013.
  • [19] Daniel Revuz and Marc Yor. Continuous Martingales and Brownian Motion, volume 293 of Grundlehren der mathematischen Wissenschaften. Springer-Verlag Berlin Heidelberg, third edition, 1999.
  • [20] Steven Shreve. Stochastic Calculus for Finance II: Continuous-Time Models. Springer Finance Textbooks. Springer-Verlag New York, first edition, 2004.
  • [21] Samuel L. Smith, Pieter-Jan Kindermans, and Quoc V. Le. Don’t Decay the Learning Rate, Increase the Batch Size. In International Conference on Learning Representations, 2018.
  • [22] Elias M. Stein and Rami Shakarchi. Fourier Analysis: An Introduction, volume 1 of Princeton Lectures in Analysis. Princeton University Press, 2003.
  • [23] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex A. Alemi. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In ICLR 2016 Workshop, 2016.
  • [24] Jo-Anne Ting, Evangelos Theodorou, and Stefan Schaal. A Kalman filter for robust outlier detection. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1514–1519, 2007.
  • [25] Haowen Xu, Wenxiao Chen, Nengwen Zhao, Zeyan Li, Jiahao Bu, Zhihan Li, Ying Liu, Youjian Zhao, Dan Pei, Yang Feng, Jie Chen, Zhaogang Wang, and Honglin Qiao. Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications. In Proceedings of the 2018 World Wide Web Conference, WWW ’18, pages 187–196, Republic and Canton of Geneva, Switzerland, 2018. International World Wide Web Conferences Steering Committee.
  • [26] Hongju Yan and Hongbing Ouyang. Financial Time Series Prediction Based on Deep Learning. Wireless Personal Communications, 102(2):683–700, September 2018.
  • [27] Chin-Chia Michael Yeh, Yan Zhu, Liudmila Ulanova, Nurjahan Begum, Yifei Ding, Hoang Anh Dau, Diego Furtado Silva, Abdullah Mueen, and Eamonn Keogh. Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View That Includes Motifs, Discords and Shapelets. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pages 1317–1322, December 2016.
  • [28] Qingbo Yin, Li-Ran Shen, Ru-Bo Zhang, Xue-Yao Li, and Hui-Qiang Wang. Intrusion detection based on hidden Markov model. In Proceedings of the Second International Conference on Machine Learning and Cybernetics, pages 3115–3118, November 2003.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
321508
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description