BGU SISE Thesis - Tomer Meirman

\AtBeginEnvironment

algorithm\setstretch1.1

Ben-Gurion University of the Negev

Faculty of Engineering Sciences

The Department of Software and Information Systems Engineering

Anomaly Detection for Aggregated Data Using Multi-Graph Autoencoder

THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE M.Sc DEGREE

By: Tomer Meirman

September 2020

Ben-Gurion University of the Negev

Faculty of Engineering Sciences

The Department of Software and Information Systems Engineering

Anomaly Detection for Aggregated Data Using Multi-Graphs Autoencoder

THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE M.Sc DEGREE

By: Tomer Meirman

Supervised By: Dr. Gilad Katz and Dr. Roni Stern

September 2020

Abstract

In data systems, activities or events are continuously collected in the field to trace their proper executions. Logging, which means recording sequences of events, can be used for analyzing system failures and malfunctions, and identifying the causes and locations of such issues. In our research we focus on creating an Anomaly detection models for system logs. The task of anomaly detection is identifying unexpected events in dataset, which differ from the normal behavior. Anomaly detection models also assist in data systems analysis tasks.

Modern systems may produce such a large amount of events monitoring every individual event is not feasible. In such cases, the events are often aggregated over a fixed period of time, reporting the number of times every event has occurred in that time period. This aggregation facilitates scaling, but requires a different approach for anomaly detection. In this research, we present a thorough analysis of the aggregated data and the relationships between aggregated events. Based on the initial phase of our research we present graphs representations of our aggregated dataset, which represent the different relationships between aggregated instances in the same context.

Using the graph representation, we propose Multiple-graphs autoencoder (MGAE), a novel convolutional graphs-autoencoder model which exploits the relationships of the aggregated instances in our unique dataset. MGAE outperforms standard graph-autoencoder models and the different experiments. With our novel MGAE we present 60% decrease in reconstruction error in comparison to standard graph autoencoder, which is expressed in reconstructing high-degree relationships.

Keywords: Machine learning; Graph autoencoder; Graph convolutional networks; anomaly detection; unsupervised learning; Convolutional autoencoders

Acknowledgements

I would like to acknowledge both of the advisors in this research Dr. Gilad Katz and Dr. Roni Stern. I first came to Dr. Stern with minimum knowledge of the field of research, and through his guidance, assertiveness and patience I took my first steps in analyzing, investigating and truly researching a new subject. I met Dr. Katz when he taught the ”Deep Learning” course, and when he joined as an advisor the research team he had an on my creativity, auto-dedication and research skills. Both of the advisors in this team made me love being a researcher and helped my decision in continuing for PhD studies.

Finally, I would like to thank my mother Vered and my life partner Dana, who both supported me through the late nights and weekends of studying, with great understanding, respect and support.

Contents

Chapter 1 Introduction

In many software systems, runtime data, such as SQL transactions or activities, are continuously collected in the field to trace executions. These data are typically stored in log files. Logging sequences of events (or activities), can be used for analyzing system failures and malfunctions, eases the process of identifying causes and locations of problems, user system or application, and allows increase efficiency in the process of finding a solution for such behavior. [32]

Generally, system administrators, analysts and other IT teams are required to monitor and respond to anomalous activities in the log files. To analyze and detect anomalies in massive amounts of data, Machine learning algorithms, and specifically anomaly detection may be used as a helpful tool [3].

When monitoring an entire company or network activities, hundreds of thousands of records can be recorded every hour. Due to this, there are different ways of aggregating (summarizing) the records to create reports which will be more storage efficient.

In our research we considered the case where our data is aggregated based on freqeucny, i.e aggregation based on how many times (count) we recorded the same type of logs or event in the same context.

The dataset made available by IBM to this research includes log entries of all SQL commands issued by users to a database. The main goal which lead us in this research was creating an anomaly detection model for the unique dataset. A major challenge in this research was that the available dataset did not contain information about specific invocations of SQL commands. Rather, the available dataset contains information about how many times (Frequency aggregation) every SQL commands have been issued in a specific time range, by a specific database user, in a specific session. While academic literature provides anomaly detection model for system logs based on sequential (i.e a temporal relation in which an event is communicated to have occurred after the event) or semantic relationships (i.e associations that there exist between the meanings of words)  [37, 18, 23, 13], they require a sequence of events, and therefore do not consider cases where the data is aggregated (Count) and how to deal with such cases. On the other hand, the literature that handles aggregation, such as  [21, 2, 23], uses methods which consider instances as individuals and might miss correlations between instances that can be used as the frequency.

During our research, we discovered and analyzed correlations between the different aggregated types of events. The intuition behind this analysis was that in many cases where activity includes repetitive patterns due to both repeating business routines and user high-level activities that are composed of multiple events. For example, performing a login activity by the user may always include two SQL commands: a SELECT command to fetch the stored user and password, and an INSERT command to record the login attempt. Thus, we expect that the counts of the aggregated logs that correspond to these two SQL commands will be similar.

While our raw data is not a relational data, which fits a graph representation naturally, the initial phase of this research research has shown great promise in representing each session in our data as a graph. This is due to the correlation between each couple of events in the same context, and also due to the analysis of higher degree of relationships in our datasets.

Based on the research and analysis which were conducted in this research, we proposed Multiple-graphs autoencoder(MGAE), a novel autoencoder model, based on the multi-graph data representation:

MGAE is a type of an autoencoder, which is an unsupervised learning technique where a neural network is trained in attempt to copy its input to its output [19]. The model uses graph convolutional network (GCN, based on Kipf [25]) for both the encoder and the decoder. MGAE’s task is to reconstruct multiple-graphs representation of the aggregated data to learn the patterns of different dynamic sessions. During the research we have analysed the learning abilities of our model comparing it to a single-graph autoencoder, which is a state-of-the-art technique with many different usages, as we present in the background section. Figure 1.1 illustrates the MGAE learning process. The process consists dataset transformation into graphs representations. Then, the adjacency matrix is extracted from graphs for the training the multiple-graphs autoencoder, where we reconstruct the adjacency matrices.

Our research contributions are as follows:

  • The first to propose a graph-based representation of user activity (in the research it will be presented as session or context) log representation.

  • The first to apply GCN for multiple graphs representation and to anomaly detection.

  • Evaluation and comparison of our multiple-graph autoencoder approach to standard graph autoencoder. The evaluation shows the decrease in reconstruction error and benefits of using MGAE over standard graph-autoencoder.

Figure 1.1: Illustration - single-graph autoencoder architecture, from data creation to training

Chapter 2 Background and Related Work

In this chapter we review the relevant literature in anomaly detection methods, focusing on methods for logs and time series analysis. In addition, we will review autoencoders and graph convolutional networks, which is the main focus of this thesis.

2.1 Linear Regression

Linear regression is considered one of the most widely used techniques for analyzing multi-factor data [35]. The techniques are being used in various fields such as economics [11], finance [7], accounting [8], marketing [42], politics[26], agriculture [30] and more

Linear regression method predicts a numeric target variable, by fitting the best matching linear relationship between the target and independent variables.The best matching depends on the number of independent variables. In our research we will use the linear regression algorithm to match a relationship between a target and one independent variable. The outcome of a simple linear regression model is a relationship that may be described by the following formula as presented in [46]:

Where is the estimated value for the target based on the value of independent parameter in sample or observation i. The value is the intercept, and the value of is the slope or gradient.

The simple linear regression model is performed using the ”Least Squares” method, which is making sure the sum of all distances between the line and the actual observations at each point is as small as possible. The least square methods goes back as far as the 1800’s when it was first performed by Legendre (1805) and Gauss (1809) for the prediction of planetary movement [40].

To understand the strength of the relationship, or in other word, the correlation between the target variable to the independent variable, we can produce the coefficient of determination or in other name [12] based on the predicted values using the formula above. The coefficient is defined as , where u is the residual sum of squares and v is the total sum of squares where n is the number of samples. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). In our research we will produce the coefficient to deduct whether two events are correlated to each other.

2.2 Anomaly Detection

Anomaly detection refers to the problem of finding patterns in data that do not conform to expected behavior [5], those patterns can be also referred to as outliers or anomalies. Over time, a variety of anomaly detection techniques have been developed in several research communities. Many of these techniques have been specifically developed for certain application domains, while others are more generic. There has been an abundance of research on anomaly detection algorithms  [5]. Some anomaly detection algorithms are based on supervised learning, and relay on having access to both normal and abnormal examples. Other algorithms are based on unsupervised learning, where both abnormal and normal data are not labeled and other methods needs to be used to detect outliers. In our research the data is not labeled and therefore the methods we used relate to unsupervised learning or an adaptation of supervised learning such as linear regression for the same task.

2.2.1 Unsupervised Anomaly Detection

Some anomaly detection algorithms are unsupervised, and work by observing and characterizing the normal behavior and identify strong deviations from this normal behavior as abnormal. Different anomaly detection for unsupervised data has been proposed such as  [1, 28, 16, 22, 14] and more. Our approach fits to this unsupervised type of anomaly detection algorithms. However, we do not characterize a single SQL command to be normal or abnormal, as we do not have access to individual commands but rather to their count, i.e., the number of times each command was issued in each session. Note that one can still apply unsupervised anomaly detection on the count of each command, e.g., identifying when a single event ID has been issued too many times. In this research we go beyond this, and explore whether anomalies can be identified by considering the relation between the counts of different events.

2.2.2 Anomaly Detection for System Logs

Our data set consist of SQL logs of every SQL query made by user in different applications. While system logs may differ, the type of logs can be parsed similarly and literature regarding anomaly detection or pattern recognition in system logs may give us a direction on how to proceed and what methods have been proposed. Log-based anomaly detection has become a research topic of practical importance both in academia and industry [22]. There has been different methods to detect system logs, such as [22, 21, 14, 27, 3].

The authors of LogMine [21] propose a method of map-reducing instances in linear time, and then using those instances to hierarchically create clusters for similar type of logs. LogMine on finding patterns in logs. While LogMine focus on first map-reducing instances of strings and later using those map-reduced patterns for an individual instance, we use the frequency feature to find correlations between instances. Landauer [27] offered a similar clustering approach to create an anomaly detection model for system logs. The procedure they propose, similarly to the model Brown [3] proposed, relies on the timestamps of each instance to detect patterns and correlations between log lines. The authors of DeepLog  [14], proposed a deep neural network model utilizing Long Short-Term Memory (LSTM), to model a system log as a natural language sequence.

While there are different methods and anomaly detection models, the challenge in our research is different due to two things:

  1. The different proposed methods treat each log individually, or consider the timestamps of each log as a part of the relationship. In our research there are no timestamps for the data, and each log event is aggregated for each session, therefore it is cannot be treated individually.

  2. We do not have the full query, since the data is compressed, which can be treated similarly as the log’s structure, and therefore we cannot use NLP methods to model type of instances.

We can see that while there are many state of the art solutions for anomaly detection for system logs, they do not provide us with a solution to our research questions.

2.2.3 Related Anomaly Detection Models

Sequential anomaly detection refers to detecting anomalies by considering the relationship between observed objects has been studied in the context of anomaly detection of time series or sequences of events. [10, 14, 29] In all cases, the available data has a clear notion of sequence. Research on anomaly detection for such cases often focuses on developing intelligent aggregating functions that allow effective anomaly detection.

our data set we use a feature name session which is a continuous connection between user and an application (or server), where different type of instances (will be referred as events) are aggregated by their count.

Literature also present semantic anomaly detection  [37, 17] (i.e associations that there exist between the meanings of words) that use correlation between instances which have the same attribute origin (similar to sessions in our research). While the data may be processed and aggregated in similar manner to our data set, the authors uses semantic evaluation during the training.

Generally, sessions may gives us an idea what type of instances (events) may be related to each other, either by sequence or semantics. However, in our dataset we lose the both the semantic and sequential evaluation options between the type of instances. While the methods above may assist us with having an idea on how to correlate between instances, we still not to find a way to overcome the aggregation.

Zhang (2013) [51] developed an anomaly detection method for software defined networks (SDN) that considered aggregated data about network flows in their anomaly detection model. Similar to our research, they used linear prediction method to find correlations between instances. While both their method and ours proposed having aggregation based on count (frequency) of instances, in their setting, they were given access to the entire flow of events, and by that, they could modify the granularity of the aggregated data dynamically.

The different literature reviewed in this section shows that the literature in the anomaly detection field in general, and more specifically detecting anomalies in logs, applies different methods to deal with the main goal. On the other hand, exploiting relationships in aggregated data, and representing the data as graphs for graph autoencoders, which we will present in the next sections are a novel approach to this field.

2.3 Graph Learning

This section in the literature review will describe the algorithm known as Graph Convolutional Networks. First we will describe the algorithm and it’s properties, later we will explain the idea of modeling temporal data by using Graph Convolutional Networks and lastly, we will explain on the method that we intend to take in order to analyze faults in our dataset. This section will present related works that are relevant to the research topic and further explain on them.

2.3.1 Graph Convolutional Networks

Graph Convolutional Networks or GCNs are a neural network architecture for machine learning on graphs. GCN can be used for classification [25, 48], link prediction, pattern recognition [24, 47], etc. Formally, given a graph GCN will take as an input the following:

  1. Feature Matrix X - (), where is the number of features for each node (in our case a node will be a construct id / event type).

  2. Adjacency Matrix A - ( that represents directed edges and their values (in our case, an edge value ill be defined as the division between the constructs’ counts)

A Hidden Layer in GCN can be written as where f will be the propagation rule i.e. the activation after multiplying the inputs with the weights, will be the feature matrix X.

2.3.2 Modeling Rational Data with GCN

In the article by Schlichtkrull  [39], the concept of Relational Graph Convolutional Networks (R-GCNs) is being introduces and used for link prediction and entity classification. The link prediction model can be regarded as an autoencoder consisting of:

  1. Encoder: an R-GCN producing latent feature representations of entities

  2. Decoder: a tensor factorization model exploiting these representations to predict labeled edges.

The researchers represent a relation type (or an edge with relation value) between two entities as: where and are nodes and is the relation. For entity recognition the model uses an encoder with only the existing nodes (while ignoring unlabeled nodes), each node separately. For the task of link prediction the writers used an encoder-decoder architecture.

We can create a similar representation by changing relations to a numeric value that fits our dataset’s representation. However, their model’s ability is used with a non temporal or dynamic graph, where each session represented by a graph in our dataset is dynamic and requires a different set of edges, nodes and relations for every graph representation.

2.3.3 Classification with GCNs

Generally, GCNs shows great promise when it comes to scalability and efficiency. In recent year, it has been widely used for classification purposes [48, 54]. When reviewing the work published by Kipf et.al. [25], we noticed two relevant contributions:

  1. A simple and well-behaved layer-wise propagation rule for neural network models which operate directly on graphs and show how it can be motivated from a first-order approximation of spectral graph convolutions.

  2. The results of their work shows high accuracy and efficiency in semi-supervised learning.

As a part of our research, we attempted to use the presented propagation rule, as both of the graph representations of the data set is sparse and large, and for that reason efficient learning is needed. However, the semi-supervised learning approach was irrelevant for our research, since the learning phase in our model is conducted with an unsupervised dataset. In addition, their method is used to learn and model static graph representation while our dataset dynamic, in that case our output layer had to be different and so did the loss function.

We have also reviewed [54], which propose a dual-graph convolutional networks for semi-supervised classification tasks. The author proposed dual graph representations which are a transformation of the same diffusion matrix (similar to the adjacency). While the approach propose the usage of two graphs in their convolutional networks for learning, our approach is different since we have presented two graphs that represent different relationships between our instances.

2.3.4 Graph Convolutional Network link prediction

Another use of GCNs is link prediction. Link prediction has various of usages, such as traffic prediction as presented in [52] [20] [49] [9] . The problem presented is mapping a road network G in unweighted graph , where V is a set of road nodes and E represents connections (whether they exist or not) between roads. Each node also has attribute features such as traffic speed, traffic flow and traffic density. The goal of traffic forecasting is to predict information on each road based on the historical information about them.

We have reviewed which use a temporal-dynamic GCN approach presented. in this research, we tried reviewing similar model to their methods, with slight changes to the features. However, the difference was that we needed to predict the relations between nodes in the current timestamp i.e the adjacency matrix, while in both articles the authors proposed methods to predict nodes’ features in the next timestamp, i.e the feature matrix. Therefore, we have adjusted the model so the output will include the adjacency matrix to find outliers in it.

2.3.5 Unsupervised Learning with Graph Autoencoders

GNNs are typically used for supervised or semi-supervised learning problems [53]. Recently, there has been an attempt to create auto-encoders (AE) to graph domains. Graph auto-encoders aim at representing nodes into low-dimensional vectors by an unsupervised training manner. Graph Auto-Encoder (GAE) [24] first uses GCNs to encode nodes in the graph. Then it uses a simple decoder to reconstruct the adjacency matrix and computes the loss from the similarity between the original adjacency matrix and the reconstructed matrix. In addition, it also trains the GAE model in a variational manner and the model is named as the variational graph autoencoder (VGAE).

In addition, Berg 2017 [44] use graph autoencoders in recommender systems and have proposed the graph convolutional matrix completion model (GC-MC). Based on Berg’s research the GC-MC model outperforms other baseline models on the MovieLens dataset. Adversarially Regularized Graph Auto-encoder (ARGA) [36] employs generative adversarial networks (GANs) to regularize a graph convolutional-based graph auto-encoder in attempt to follow a prior distribution. There are other graph auto-encoders such as NetRA [50], DNGR [4], SDNE [45] and DRNE [43]. The last autoencoders do not use GCN in their architecture, instead they use different graph embedding techniques.

Our MGAE(Multi-graphs autoencoder) model similar method to Kipf’s GAE [24], however, the graphs we reconstruct are dynamic graphs. In addition, our method uses multiple graph to represent the data while the other graph autoencoders methods we have presented all uses a single graph representation.

Chapter 3 Research Goal

The primary purpose of a system log is to record system states and by that we can detect significant events at various critical points to help debug system failures and perform root cause analysis [14]. Due to the large amount of operations (in our case, SQL transaction), logging requires a tremendous amount of storage space. Therefore, data may be aggregated for efficient storing. In such cases, debugging and analysing aggregated data is non-trivial task.

In this thesis, we have two main goals:

  1. Exploration of graph based representation of the aggregated data set.

  2. Exploit the graph based representation to create a GCN-based approach for anomaly detection.

Based on those goals, we will define the following objectives:

  • Analyzing correlations between events to test whether correlations exists in the events aggregated dataset. For this task we present an analysis of relationships between couple of events in different contexts.

  • Create a graph representation which was based on the relationships which were discovered. The representations are made for each session separately.

  • Create a multiple-graph representation which was based on more advanced relationships and further knowledge of the dataset.

  • Designing an architecture graph and multiple-graphs autoencoders using the graph representations which were extracted from the raw dataset.

  • Analysing the learning abilities for the graph autoencoders to detect whether the graph representation can be reconstructed and used for learning tasks.

  • Compare the learning abilities and benefits of using MGAE over a single-graph autoencoder model, to understand if and what advantages multiple-graphs autoencoder may provide.

We limited our research to the challenging dataset we have been presented with. The results we present may be beneficial for other datasets and for larger graphs. In addition, the latest model we have presented was not tested as an anomaly detection model, which we believe may prove better than other anomaly detection models for aggregated data.

Chapter 4 Research Contributions

Based on the research and analysis which were conducted in this research, we proposed MGAE model (Multiple-graphs autoencoder): MGAE model is a type of an autoencoder, which is an unsupervised learning technique where a neural network is trained in attempt to copy its input to its output [19]. MGAE’s task is to reconstruct multiple-graphs representation of the aggregated data to learn the patterns of different dynamic sessions. During the research we have analysed the learning abilities of our model comparing it to a single-graph autoencoder, which is a state-of-the-art technique with many different usages, as we present in the background section.

Our research contributions are as follows:

  • We present a thorough research analysis of the unique dataset. The analysis framework may be used on other aggregated datasets where different type of relationships can be used for learning tasks. Through the analysis we have reached the graphs form of the data set which later was used for the develop of MGAE model.

  • We present the multiple-graphs autoencoder (MGAE) models, based on three different graphs representations of our dataset, in addition to a relevant traditional graph autoencoder. The different models exploit the relationships between entities on a session level.

  • We present the benefits of using this model rather than a single graph autoencoder for more complex and high degree relationships between events. We used different techniques to represent graphs and adjust the adjacency matrix data to show the model’s superiority over single-graph autoencoder models. However, our analysis us currently an indicator of the learning ability of such model for the challenging dataset we represent in this research.

Chapter 5 Research Problem Formulation

Our proposed methods are based on relationships between events. First we present the formulation for our preliminary research on discovering whether we can detect patterns between aggregated events. Then we will present the autoencoders methods formulations.

5.1 Aggregated Dataset Formulation

The dataset made available to this research includes information about SQL commands issued by users to a database. A major challenge in this research is that the available dataset do not contain information about specific invocations of SQL commands. Rather, the available dataset contains information about how many times every SQL commands have been issued in a specific time range, by a specific database user, in a specific session

During the thesis paper, to describe entities in our dataset, we will use the following definitions:

  1. Session (context) - A continuous connection between user and an application (or server). A specific session will be referred to with session id.

  2. Event (may be reffered as type of event or aggregated event) - We will define event as a SQL query (or transaction) made to a specific database in a session (context). The event will be referred to with event id. In each session (context) there will be only one type of event due to the aggregation in our dataset.

  3. Count (of event) - The number of times a unique event was made in the session.

  4. Report (file) - Hourly log file containing all of the aggregated events.

Based on the definitions we have presented, to describe relationships between events we will define the following terminology:

  1. Couple - two events that were recorded in the same session.

  2. Strong correlation - Based on linear regression score ( Score close to 1)

  3. Stable couple - Couple that were found in the same session (context) during the training set and testing set.

  4. Strongly correlated couple - Stable couple with strong correlation both in the training set and testing set.

5.1.1 Dependencies Between Entities

Figure 5.1: Illustration - Multiple instances of one event in a single session aggregated to a single event.

After discussions with a domain expert, we established the following functional dependencies between some of the fields in our dataset:

  1. event ID → verb object

  2. Session ID → instance ID, made by the same user using the same connection, IP address, database and application.

  3. Session ID, event ID → count

The aggregation of a single event is illustrated in Figure 5.1

5.1.2 Detecting Correlations Between Events

Generally, to find couples (of events), either strongly correlated or just stable, we used Linear Regression as the model to find the correlation. To try and find simple patterns based on our hypothesis, we created a simple modeling mechanism as presented in Algorithm 1.

11:For each event id - , find all events that appears in the same session with it.
  • 21.

    For event id - that was found with , create couple points based on the count of the events on the same session. The points are (x,y) while x,y are the counts of respectively.

  • 32.

    For each session where appears and does not, add a point based on the count of in the session and 0 for the count of , and vice versa.

  • 2:For each couple, create a linear regression model.
    3:Try to predict y’s counts based on x’s and check the score. If the score is higher than 0.8 (both in training and testing), consider the couple as strongly correlated.
    Algorithm 1 Simple Correlations Model

    Using this algorithm we can find simple patterns that may point on existing correlations between two events both in training set and testing set.

    5.2 Graph Autoencoder

    In this section we will present the definitions related the the graph (convolutional) autoencoder (GAE). We will use convolutional layers, because filter parameters are typically shared over all locations in the graph (Kipf [25]). For the autoencoder models, the goal is to reconstruct graph , or multiple graphs for MGAE, where are the nodes (or vertices) and are the edges of the graphs. The input for the model is:

    • Matrix which represents the graph structure in a two-dimensional form. may also be referred to as the adjacency matrix. The dimensions of will be

    The output for the model will be matrix , with the same dimensions as . Matrix will represent the structure of the reconstructed graph. Graph-level outputs can be modeled by introducing some form of pooling operation (Duvenaud [15]).

    Every neural network layer can then be written as a non-linear function ,with and (or z for graph-level outputs), being the number of layers. The specific models then differ only in how is chosen and parameterized.

    Chapter 6 Methods

    Our overarching goal is creating a graph based method which may be used for anomaly detection for our aggregated data. The graph representation of our data should leverage the relationships between aggregated events in different sessions. To this end, we will present the methods used during the following research parts:

    • Aggregated dataset analysis

    • Dataset graph representations

    • Graph and multiple graphs autoencoder

    In Section 6.1 we will present different analysis and methods to detect whether correlations between aggregated events in our data set exist and how we may exploit them. Later, in Section 6.2 we will present the different graph representations of our dataset which represents different relationships between the events. In Section 6.3 we will present the multiple-graph autoencoder which was our main contribution in this research.

    6.1 Aggregated Dataset Analysis

    In our research we created different models to find correlations between two or more instances, based on analysis and understanding of the data. In the following sections we describe the method of analyzing the different events and their behavior, and first results based on the couples we found in this section (both strongly correlated and stable events).

    6.1.1 Analyzing Events

    First, we count the cumulative number of unique events as a function of the time range where the events were collected. This is done in order to understand better the behavior of existing (training set) and new (testing set) events in our system.

    From the number of events made each day, we filter out the ones we found to be stable (in stable couple), to understand the stability. That way we are able to understand if there is an evidence of repetitive patterns which we can find based on algorithm 1.

    To understand the scale of the found stable couples(or the events of those couples), we review the percent of those stable events from the whole testing set. In other words, we try to understand how much of the whole events instances in the report, can be found as a part of couples.

    In our research, we intend further analyze events and strongly correlated events, to understand how can we find the ones that exists both in training and in future testing sets, and use them as the base of our anomaly detection model.

    6.1.2 Analyzing Correlations Between Events

    Next, we searched for pairs of stable events IDs in which there is a strong correlation between their counts (Strongly correlated). To this end, we collected all events IDs that were recorded together in the same session. Then, we fitted a linear regression model on their counts. An output of such a fitting is the score, which is 1.0 for a perfect correlation and 0 or less for no correlation. We considered every pair of event IDs for which the is at least 0.8. Then, we checked whether these correlative pairs still maintain their correlation in other testing data. (Explanation of algorithm1) That is, we used the learned linear regression model for every pair to predict the counts in the week after the week used for training.

    Later, we checked how many correlation identified during training remain stable, i.e., they are still good predictors of the counts in the testing set. That is, we counted the number of all strong correlations we found for 3, 4 and 5 days of training, and have tested how many of those we could predict correctly.

    Finally, we checked how common it is to observe a stable event with a correlation of 0.8 or higher with another event.

    In our research, we intend to further analyze what is the time decay of correlations we can find in our training set. That way, we can review the behavior of such correlations and create a more sophisticated and efficient algorithm to detect them in early training stage, while losing the correlations that do not hold over time.

    6.1.3 Multiple-Linear Regression Anomaly Detection Model

    To test the couples correlations found by Algorithm 1, and whether we can use them for anomaly detection, we created an Anomaly Detection model based on that algorithm. To test a singular point (counts of events), we created the following prediction error check:

    .

    Where perfect prediction has the result 0 and worst possible prediction is 1. the error is the maximum allowed error between the prediction and the true count value.

    Find strong correlations couples for events from the training set using the Simple Correlation Model 1.
    1For given (Session, event1, event2, count1, count2) tuple check:
  • 21.

    if event1,event2 appears in trained model, try to predict count2 using the linear regression model with count1 as the input.

  • 32.

    if prediction prediction error check is higher than maximum allowed error - output anomaly

  • Algorithm 2 Couples correlations-based anomaly detection algorithm

    In addition, we added ”Exoneration” technique, which exonerate a which was classified as ”anomaly” when tested with , but later found that it was strongly correlated with a different event. We have also defined a threshold parameter which was defined as the number of broken correlations of specific instance. When the threshold is larger than 1, it means that in order to output a event as an anomaly, it should have been tested with at least two other events and predict incorrect counts with both. While this anomaly detection is fairly simple, it was the first step in testing our hypothesis.

    6.1.4 Advanced Relationships

    The discuss so far has been limited to one type of relationship between construct counts: a linear, pair-wise correlation. To assess the potential for more complex relationship, we constructed a graph whose vertices are the stable constructs and where there is an edge between constructs if there is a strong (larger than 0.8) correlation between their counts. That way, if Event A and Event B were found to be strongly correlated, Event A and Event B will be represented in the graph as node A and B (respectively). In addition, we will add an edge between node A and B due to the strong correlation between them.

    While we did not create a graph model that represented the method in this section (i.e only strong correlation would include an edge), it later led us to construct the graph representation that represents the count (ratio) graphs, which was presented in Chapter 5. In the next sections we will present the graph representations of our data and later the different autoencoders models which use those representations.

    6.2 Dataset Graph Representations

    A significant part of our contribution relies on the graph representations of our aggregated dataset. The first graph representation derived from correlation based anomaly detection we have presented in the previous section. For our graph autoencoder models, it was required to transform the raw data to graphs that represent each entity. The entity we chose to represent is the sessions, this is since events may appear only once per session. Based on the results of the events analysis, which we have presented in the previous section, we have deducted that thousands of new unique events are created daily. However, due to memory and storage limit, we wanted to create a graph which represent the most common events. Therefore, we filtered out the most common ones by a parameter of minimum appearances. The number of type of events we had left were still the majority part of our data set (over 50% of the total set).

    Next, we create a representation for each session as directed graph which we referred to as Relations graph. Each graph is represented by its adjacency matrix, i.e. the matrix that represents the value of each edge (the value of edge from construct i to construct j will appear in cell (i,j) in the adjacency matrix).

    This will result in a graph that represents each session in our dataset. The relations graph formulation is presented in Algorithm 3 and will be denoted as Relations graph.

    11:We generate a directed graph , which will represent session Si as follows:
  • 21.

    The graph’s nodes will be created by modeling all the events in a specific session (including events whose counts are zeros) , .

  • 32.

    Graphs edges will created as follows: for all events. Edge will contain the value . Edge with value 0 (or no edge) represents a connection that does not exist in the specific timestamp (session).

  • Algorithm 3 Relations graph creation

    The second is denoted as Appearances graph and presented in Algorithm 4. In the appearances graph the edges represents the percentage of times the nodes (events) appeared together throughout the training set report. In other words, the appearances graph representation may detect how likely it is for an edge, which represents that both events appear together and how often (in percentage), to exist in a specific session. The intuition behind creating this graph representation is the task of link prediction, which may be used for anomaly detection as well.

    11:We generate a directed graph , which will represent session Si as follows:
  • 21.

    The graph’s nodes will be created by modeling all the events in a specific session (including events whose counts are zeros) , .

  • 32.

    Graphs edges will created as follows: Edge (i,j) will be denoted as the number that both entity (event) i and entity (event) j have appeared together, from the total number of times that entity i appeared in through the report(s), or: . For example: if event x have appeared in 50 sessions, event y have appeared in 30 sessions, and they have appeared together in 25 sessions, the value in for edge (x,y) will be 25/50 and for edge (y,x) the value will be 25/30.

  • 43.

    The value of edge (i,j) is 0 (or no edge) if either of the events do not appear in the session.

  • Algorithm 4 Appearances graph creation

    The last graph we will present is denoted as Count graph, represents the count property of each event in the session. We have created this graphs in two methods. The first, results in an adjacency matrix with the counts value of aggregated events in the diagonal of the matrix. The second, results in an adjacency matrix where the all the cells in consists the count value of . The count value in both methods was normalized by the diagonal sum. The creation of the count graph is presented in Algorithm 5

    11:We generate a directed graph , which will represent session Si as follows:
  • 21.

    The graph’s nodes will be created by modeling all the events in a specific session (including events whose counts are zeros) , .

  • 32.

    Graphs edges will created as according to one of two options:

  • 4(a)

    Count diagonal: Edge (i,i) will exist for all events that exist in the session. The value for the edge (i,i) will be the normalized count value of the event.

  • 5(b)

    Full count row: In addition to the count diagonal, the edges (i,j) will all contain the same value as edge (i,i) for each in the graph.

  • 6
    Algorithm 5 Count graph creation

    6.3 Multiple-Graphs Autoencoder

    In the following section we will present the different graph autoencoders models which are based on the graph representations we have presented in previous section. For consistent results, architecture for the models with all the different data representations is the same, it consists 4 layers as follows:

    1. Encoder:

      1. Convolutional layer 2D – with 64 filters and 3x3 kernel size

      2. Convolutional layer 2D – with 32 filters and 3x3 kernel size

    2. Decoder:

      1. Convolutional transpose layer 2D – with 64 filters and 3x3 kernel size

      2. Output – convolutional transpose layer (filters 1/3 depends on the model) and 3x3 kernel size

    In Algorithm 6 we present the creation of both the training and testing set for the single-graph autoencoder.

    1:Splitting the dataset (hourly reports of aggregated events) to session entities.
    2:Filtering the events that occurred less than the minimum appearances threshold.
    3:For each session, create relation graph as described in Algorithm 3
    4:Extract the adjacency matrix for the relations graph and save it as both and
    Algorithm 6 Dataset creation for single-graph autoencoder

    The training process for the single-graph autoencoder is reconstructing the adjacency matrix that was extracted from each session, using the architecture, hyper-parameters and settings we have mentioned above. The full process from the original session to the newly predicted session can be illustrated as appears on Figure 6.2.

    Figure 6.1: Illustration - single-graph autoencoder architecture, from data creation to training

    The dataset creation for the multi-graph autoencoder model is similar to the single-graph autoencoder, with the change of creating three-layers matrix that represent all the three graphs representations of our data set (relations graph, appearances graph and count graph).

    1:Splitting the dataset (hourly reports of aggregated events) to session entities.
    2:Filtering the events that occurred less than the minimum appearances threshold.
    3:For each session, create relation graph as described in Algorithm 3, appearances graph as desribed in Algorithm 4 and count graph as described in Algorithm 5.
    4:Extract the adjacency matrix for all three graph, resulting in three layers matrix, and save it as both and .
    Algorithm 7 Dataset creation for multi-graph autoencoder

    The training process for the multiple-graph autoencoder is reconstructing the adjacency matrices of all three graphs that we have presented: Relations graph, appearances graph and count graph. The three adjacency matrices will then be saved as one three layered matrix which represents one session. We should mention here that we created two types of multi-graph representations, one with the count graph with count diagonal edges, and the other with Full count row.

    To separate the two representations, we will define the following:

    1. The representation that consists Count diagonal-count graph will be denoted as ”Three layers - Count diagonal”

    2. The representation that consists Full count row-count graph will be denoted as ”Three layers - Full count row”

    Using the same architecture, hyper-parameters and settings we have described for the single-graph autoencoder, we will train the model for both representations. The full process from the original session to the newly predicted session as appears on for the multi-graph auteoncoders is illustrated in 6.2.

    Figure 6.2: Illustration - single-graph autoencoder architecture, from data creation to training

    Chapter 7 Evaluation

    Our evaluation consists two main sections. The first section shows the analysis of the events (or type of events), analyzing correlations between results and the results of the first anomaly detection model (Multiple-linear regressions model). Then we will analyze the different graph autoencoder models, with the different settings and present the benefit of our multiple-graphs autoencoder over the single-graph autoencoder.

    7.1 Events Analysis

    The main part of this section was to analyze events, or as we may refer to them in this section: Constructs or Construct_ids. The term construct is used in our data set and describes a type of SQL query made by a specific user to a specific database. In Section 5 we also defined: Session, Stable and strongly correlated constructs (events), Strong Correlations and stable correlated construct (event).

    We also presented the functional dependencies between the fields in our dataset:

    1. Construct ID (event) → verb object

    2. Session ID → instance ID, made by the same user using the same connection, IP address, database and application.

    3. Session ID, Construct ID (event) → count

    7.1.1 Results of Constructs (events) Analysis

    We started by reviewing the number of unique constructs (events) that we can observe through different hourly reports. In figure 7.1 we can see that the number of unique construct IDs grows linearly with time. At a first glance, this means new SQL commands emerge all the time, leaving scarce hope for finding patterns. However, our subsequent analysis described below that the number of stable construct IDs, however, remains relatively stable.

    Figure 7.1: Cumulative unique construct ids per hour

    In the next figure 7.2, we plot the number unique stable construct IDs (events) were recorded on each day of the week (X-axis)

    Figure 7.2: Stable constructs daily

    The upper lines on the following graphs are for constructs recorded through the whole day, and the lower lines are for one hour (14:00) every day. The Y-Axis shows how many unique constructs are there (Stable/Total during the day). This shows that the number of stable construct IDs indeed remains more consistent. An exception to this are Saturday and Sunday, which show a decrease in number of stable construct IDs. We conjecture that this is because these are vacation days, and thus exhibit fewer events.

    Figure 7.3: Percent of stable constructs daily

    Figure 7.3 shows the percentage of the stable constructs from all the unique constructs on the tested day. The x-axis shows which day we which day we tested (Tuesday Sunday). The y-axis shows the percent of stable constructs of the total construct each day Therefore, we see that approximately 50% to 60% of the unique constructs everyday are stable. Thus, most construct IDs are actually stable construct IDs.

    7.1.2 Results of Correlations Analysis

    Figure 7.4: Number Of strong Correlations for different training times

    The X-Axis shows the number of testing days on the following week. The Y-Axis shows how many strong correlations we found in total. The results clearly show that many pairs of construct IDs remain correlative in their counts during testing. This suggest that these correlations are stable patterns which may later be used for anomaly detection (Research question 1).

    Figure 7.5: Percent of correct prediction during tests for different training times

    The X-axis shows the number of testing days on the following week. The Y-axis shows the percent of the correct correlations after testing.

    The results indeed shows that over 70% of the correlations identified during training were stable. This is a very encouraging result, as it shows that learning correlations from past data can be useful for identifying normal or abnormal behavior in the future (Research question 1 and 2).

    Frequency of Observing a Stable Correlation To this end, we first report that in the dataset used for the figures so far for 5 days training and 5 days testing, there were a total of of 1275 stable constructs, and from these constructs 964 were involved in at least one correlation with R2 equal to or larger than 0.8.

    Figure 7.6: Percent of Stable and Stable correlated constructs during tests

    Figure 7.6 shows the percentage of sessions that include at least one stable construct with a strong correlation. The X-axis shows the number of testing days on the following week, and the y-Axis shows the percent of sessions with stable constructs (the upper gray line) and the percent of sessions that include stable constructs that have at least one strong correlation. As can be seen, this is indeed the majority of the sessions.

    We can thus conclude, that even though the number of unique constructs is relatively low (every day we record 6,000-10,000 unique constructs), correlative constructs are very common. This suggests that an anomaly detection methods based on such correlations may be very effective (Research questions 1 and 2).

    7.1.3 Results of Advanced Relationships Analysis

    The figure 7.7 below show the structure of the graph mentioned in in the method description for advanced relationship analysis. The graph is created by defining nodes as strongly correlated constructs (events) and edges between each constructs that have a strong correlation.

    Figure 7.7: Overview of potential multiple constructs correlations

    As can be seen, there are multiple, non-trivial connected components in the graph. The figure below show the connected components (Y-axis) for every size of connected component. Note that the majority of the constructs are, in fact, part of at least one connected component. Moreover, some construct IDs are quite large, suggesting more sophisticated relationships exist beyond pairwise correlations.

    Figure 7.8: Connected components sizes

    For example, one can consider checking correlation between triplets of construct counts, or quadruplets or more. Observe that for a correlative triplet to exist, it must be a 3-clique in the correlation graph. Below, we plotted the number of maximal cliques in this graph as a function of their size (X-axis).

    Figure 7.9: Cliques in the graph

    Clearly, these results show that complex patterns exist, well beyond simple correlative pairs. For example, this data shows that there is even a 23-clique in the correlation graph. This means that there are 23 constructs that are strongly correlated with each other. Future research will investigate such higher-degree relationships between constructs’ counts.

    Based on the last section and evaluation we deducted that higher degree of relationships is required. Based on previous sections we also deducted that the stable constructs (events), i.e the most common constructs, make up for the majority part of the data set. Therefore, we came up for a number of ways to represent our dataset in a graph form. The next section will present our evaluation of the different graph-autoencoders.

    7.2 Experimental Setting for Graph-Autoencoders

    In this section we will present the different methods we have evaluated and compared between the different models we have presented. First, we will present the models configuration we have used across the different autoencoders. Then, we will present how we compared three different models which reconstructed three different graphs representations. At last, we will discuss the different evaluations we performed to test which model outperform the others.

    7.2.1 Models Configuration

    For all the different autoencoders we used Adam optimizer (standard optimizer) with learning rate of 0.001 (showed better results than larger learning rates in previous research). The activation function for each layer is ”ReLU”. The reconstruction error for our models is mean squared error (MSE). The mean square error was chosen over other errors because this was the default error for both vector and image reconstruction in simple autoencoders which are not binary. MSE measures the distance between the predicted matrix and the actual matrix we intend to reconstruct. The training stage for all models consists 100 epochs, with validation from data set which consisted reports of the next day/first day of next week. The policy for model saving was the best training MSE. The training phase was similar for all type of models.

    7.2.2 Graph Autoencoders Reconstruction Comparison

    In section 6.2 we have presented three different type of graphs (multiple graphs) representations of our data, and while each representation is different, there was one graph which was created for each type, which was the relations graph. The relation graph would be the best representative to understand whether we have an anomaly or to predict links, since it considers both the count and the relationship between constructs in each session, and through the different session its dynamic unlike the appearances matrix. It also have relational values and not distinct values, which allows us to exploit the correlations between events which we presented in the previous section.

    7.2.3 Evaluation and Loss Functions for Graph Autoencoders

    The mean square error (MSE) was chosen as the main loss function for reconstruction of the single and multiple graph representations. The reconstruction error was chosen due to multiple literature that presented convolutional autoencoders with the same error, such as [33] [34] [31] [41] [6] and more. Mean square error is defined as follows:

    Where n is the number of samples n the set, is reconstructed sample i and is the true sample i.

    In addition to the mean square error, we will present some of the results with root mean square error (RMSE). We did not use it for training or testing, however, some results shows large deviation and the analysis was more clear with the different scales presented with RMSE rather than MSE. RMSE is defined as follows:

    Where n is the number of samples n the set, is reconstructed sample i and is the true sample i.

    The last evaluation method we used was measuring precision and recall of reconstructing the relations graph. The evaluation method was used to understand which model is closer to predict the actual values for the edges in our graphs.

    Precision and recall are defined as follows:

    Where: TP = true positives, FP = false positives, and FN = false negatives.

    The results of the evaluation methods are relevant for binary classification. Therefore, We did not based the conclusions of this research based on the results of this methods, but the results allowed us to compare the models to each other. To be able the precision and recall, we have defined the following thresholds:

    1. Zero threshold – If the value of cell (i,j) of the predicted matrix is under the zero threshold, we will consider the result of this cell as 0 (or negative).

    2. Error threshold – If the value of cell (i,j) of the predicted matrix minus the true value of cell (i,j) in the true matrix A, is under the error threshold, we will consider the result of the cell as 1 (or positive).

    Using this thresholds we were able to give each predicted matrix precision and recall score. Then, we calculate the average precision and recall score based on all reconstructions for all the models.

    7.3 Graph Autoencoders Analysis and Comparison

    In this section we will present the results and analysis of single and multi-graphs autoencoders. The first part will present a summary of precision and recall scores for the different set of models in different settings and thresholds. Then we will present the effect of training seasons with larger amount of events vs the training all unfiltered seasons, along with comparison between the three type of models using this method. The next section will show the analysis of seasons we predicted and challenges in using the model. In the last section we will present a comparison between normalized values of the relations graph to non-normalized value and an analysis of the results.

    The results we will present all used the same test set, which was a report of the aggregated analysis which was recorded in the next week after the training. The test set consists 10,000 normal samples of which 5,000 samples consisted sessions with 10 or more events, and were considered more complex due to the higher number of values different than zeros in the adjacency matrix. The test set consisted sessions with no known anomalies, in order to analyse how well the different models perform in reconstructing normal sessions.

    7.3.1 Models Comparison

    In this section we will present comparison between the performance of the different models we have presented.

    We will begin with presenting the precision and recall score for the different models. We have used different settings of the error threshold to try and understand which model provides the best results. The results were tested by comparing the results on the reconstruction of the relations graph. In Figure 7.10 we can see the precision and recall scores. While it seems that the method with single graph (single layer) representation outperform both graphs, in the next analysis we will present why it is problematic and we did not continue with it in further experiments.

    Figure 7.10: Precision and recall comparison with different error thresholds

    Later, we wanted to test the capabilities of fully reconstructing the relations graph for the different models. We have split the problem based on the number of events (constructs) in each session, to get a better understanding of how well the models are doing when the graphs are more complex (i.e the adjacency matrix contains more values different than 0). For this part of the analysis we have defined the following terms:

    • Good FP - a graph that was reconstructed with no false-positive cells, i.e successfully did not reconstruct edges that did not exist in the true relations graph.

    • Good FN - a graph that was reconstructed with no false-negatives cells, i.e successfully reconstructed all the edges that existed in the true relations graph.

    Model Good FP Good FN
    Single-graph 80% 52%
    Multi-graphs - diagonal count 57% 42%
    Multi-graphs - full count 63% 16%
    Table 7.1: Percentage of good FPs and good FNs from reconstructed samples

    We have checked the percentage of good FPs and FNs from the reconstructed samples and the results are as presented in Table 7.1. The settings for reviewing the results of the reconstruction all had the same setting, of both error and zero threshold 0.1.When reviewing the results we can see that even though the recall score is relatively high (over 0.9) for the single-graph model, we can only fully construct 52% of the samples in a matter of no missing edges. We changed the thresholds settings and had similar or worse results.

    While continuing the analyze the good FNs and FPs reconstructions we discovered, as can be seen in Figure 7.11, that the number of good FNs sessions drops when the number of events in sessions increase above 8. In addition, there are no good FNs sessions that contains above with above 13 events in session, which is over 40% of the samples in the test set.

    Figure 7.11: Frequency of samples with good FN vs not good FN. Y-axis represents the frequency of samples and X-axis represents the number of events in session

    Based on the analysis and results presented in this section, we have reached the following conclusions:

    • The scores are very easily biased by the different thresholds settings.

    • It is nearly impossible to fully reconstruct the relations graph using this methods when the number of constructs (events) in session is large, even when the MSE error is relatively low.

    • It would be very difficult to create an anomaly detection algorithm that allows that much false-positives or negatives.

    Based on those conclusions, we have decided to not continue to compare and analyze the different methods with the precision and recall scores.

    7.3.2 Analysis of Training Samples with Larger Number of Events

    For further analysis and comparison between our models, we chose to evaluate the MSE loss in reconstructing the relations graph. In addition we wanted to analyze the standard deviation due to the low amount of graphs we were able to reconstruct in previous analysis. In Table 7.2 we can see the results of simple comparison between the three different models while trying to reconstruct the same graph. In contrast to the previous results, we can observe in the overall results, the novel approach of Multiple-graph autoencoder (MGAE), outperforms single-graph autoencoder.

    Model Type average MSE standard deviation
    Single-graph 1.341 12.81
    Multi-graphs - Count diagonal 0.479 7.339
    Multi-graphs - Full count 3.118 61.872
    Table 7.2: Average MSE and standard deviation comparison

    Since the results were different than the one we presented in Table 7.1 and Figure 7.10, we wanted to further review the effect of number of events per session on the performance of each models. To do so, we removed all duplicated sessions from our test set, and reviewed average MSE which corresponds to each session size (session size refers to the number of events in session).

    Figure 7.12: : Average RMSE for different sessions size (number of constructs). ”Not full count” refers to the three-graphs - count diagonal model

    We plot the results in Figure 7.12 with the average RMSE loss to make the analysis more visual and the differences easier to distinguish. We are able to notice that the no layers (single graph) model’s performance decrease when the session size increase, while the other models’ performance remain more stable. On the other hand, it is less sensitive in some of the more problematic sessions ,i.e session that are harder to reconstruct which produce relatively high MSE loss.

    In attempt to create a model which performs better on higher degrees of relationships, we have trained additional models which had similar data representations, but with minimum session size of 10. In addition, we have added more samples to the training set for those models so they will have a similar amount of samples to the models without minimum session size. Thus resulting in 6 models we compared in this phase:

    1. Two single-graph (or layer) models - with and without minimum session size

    2. Two multi-graphs - Count diagonal models - without and without minimum session size

    3. Two multi-graphs - Full count models - without and without minimum session size

    Since the three of our models were trained only on sessions with 10 events or more, we have also filtered the test set to consist only sessions with 10 events or more for to evaluate all models. In addition, we have filtered out duplicate sessions (about 70% of the sessions had at two or more similar sessions) allowing each unique sample to have a similar weight of the results.

    Model Type MSE std min10 MSE min10 std
    Single graph 3.043 15.879 20.568 245.857
    M-graphs - Count diagonal 0.228 5.058 0.917 32.133
    M-graphs - Full count 6.812 37.928 14.812 169.834
    Table 7.3: Average MSE and standard deviation comparison, with and without minimum session size

    In Table 7.3 we can see the average MSE and standard deviation for the 6 different models. On the left two columns we can see the MSE and std for the models without minimum session size and on the right two columns we see the results for the models trained with minimum session size per session. We can see that even though the models were trained with more samples (sessions) that included more events, i.e. more dense matrices, the models that were trained with no minimum session size performed better.

    Figure 7.13: : Average RMSE for different sessions size for the multi-graph - count diagonal model, trained with minimum compared to without minimum.

    In Figure 7.13 we can see that in problematic samples (i.e sessions with relatively high reconstruction error) the model that was trained without minimum session size performs better, however, in non-problematic samples the model that was trained with minimum 10 session size performs better. From this section on we will refer to problematic samples as sessions that were reconstructed by multi-graph - count diagonal model with reconstruction error (MSE) above 1. Specifically, there were 98 of those problematic samples for the model train with 10 minimum session size, and 38 for the model with no minimum session size, another way to determine that the model trained with no minimum session size performs better.

    7.3.3 Analyzing Problematic Sessions

    To further understand the difficulties in reconstructing the problematic sessions, we have filtered out the relevant problematic sessions from our data set. We have reviewed the sessions with emphasis the following details:

    • The number of events in the created relations graph

    • The number of events in the original session

    • The count values for the different events in the session (with emphasis on the largest and smallest values)

    From our review, we could notice that throughout all sessions the same two type of events appear, each time with very different ratios (from 1 to 0.016), and 24 of 25 times the maximum count was of the same type of event. In addition, in all problematic sessions the minimum count was 1 while the highest between 390 to 9984. Furthermore, 3) In all those sessions there were between 68 to over 500 different type of events which were not part of the relations graph. We reviewed those events and they did not appear in the training set, i.e they were unique or uncommon in comparison to the events which appeared.

    In Figure 7.8 we can see that throughout the preliminary analysis we found connected components which are larger than 300, which included strong correlations between the events, while in our graph representation we filtered only 228 type of events.

    7.3.4 Data Normalization Comparison

    In the previous section we have presented problematic sessions. Throughout the analysis of those sessions we could conclude that we had two main issues:

    1. Less common or new events that were not represented in the graphs.

    2. Large difference between the minimum and maximum count value.

    At this stage of the research, we were not able to handle new events. In addition, the architecture of our model did not allow us to add less common events due to the storage and memory limit. However, we wanted to deal with the large difference. The value edge (i,j) in the relations graph is the inverse value of the edge (j,i), that may cause some edges to have very small values and on the other hand very large numbers for the inverse value. Therefore, we suggested two ways of normalization techniques:

    1. Log normalization

    2. Limit normalization

    The logarithmic normalization (or transformation) is formulated as follows:

    Where is the output image, is the output image, and is the scaling constant which is provided by:

    Thus results in a scaled matrix, with all cells with values between 0 to 1. When logarithmic transformation is applied onto a digital image, the darker intensity values are given brighter values thus making the details present in darker or gray areas of the image more visible to human eyes. Our intuition in using this transformation is allowing the smallest values to be more visible and easier for learning.

    The limit normalization includes limiting the maximum count value for each session to 100, and the using matrix normalization with Euclidean norm. It can be formulated as follows:

    Where A is the adjacency matrix, is the matrix’s dimension and is cell (i,j) in the adjacency matrix which is limited to value 100. The intuition behind using this normalization technique is to limit the extreme count values (over 90% of the count values are 100 or less), allowing the differences between non-zero values of the adjacency matrix to be lower than if we would simply use matrix normalization.

    The data set for and testing phase for this two normalization methods is similar to the one in previous sections, where we review the loss of the reconstructed relations graph.

    model_type Normalization norm - MSE norm - std
    Single graph Limit 0.0000084 0.0000136
    M-Graph - diagonal count Limit 0.0000014 0.0000017
    M-Graph - full count Limit 0.0000067 0.0000037
    Single graph Log 0.00102 0.00168
    M-Graph - diagonal count Log 0.00036 0.00047
    M-Graph - full count Log 0.00038 0.00053
    Table 7.4: Average MSE and standard deviation comparison, different normalization methods

    First, we can review the comparison for each type of normalization method in Table 7.4. From this graph we can see that our results from the previous sections still stands and the multi-graph - count diagonal model still provides the best results for each technique. In addition, we wanted to see which of the methods handles high degree of relationships better, in addition to lower relative standard deviation, which would relate to better construction of the problematic sessions we have presented in previous section.

    Figure 7.14: : Average RMSE for different sessions’ size for different models with limit normalization method.
    Figure 7.15: : Average RMSE for different sessions’ size for different models with log normalization method.

    In Figure 7.14 and Figure 7.15, we can see that for lower degree of relationships the single-graph model may provide better results, however, with very small difference. When it comes to standard deviation and higher degree of relationships, the model that performs the best in both methods is the multi-graph - count diagonal model.

    While we have observed the benefits of using the MGAE, we wanted to deduct which method of normalizing the data would benefit the us the most. In order to do so, we have scaled the reconstruction error by the MGAE model of all test samples to have = 1, and calculated the standard deviation.

    Model_type limit norm std log norm std no norm - std
    Single graph 1.62 1.64 2.33
    M-Graph - diag count 1.16 1.31 22.11
    M-Graph - full count 0.55 1.39 12.16
    Table 7.5: Relative standard deviation comparison, different normalization methods

    The results for the scaled standard deviation are presented in Table 7.5. In addition, we have plotted the comparison for the for the same MGAE method (count diagonal) in Figure 7.16, where we can see the benefit of using both normalization methods rather than keeping the data not normalized.

    Figure 7.16: : Average RMSE for different sessions’ size for MGAE - Count diagonal with different normalization methods.

    Chapter 8 Summary and Conclusions

    In this thesis we researched how relationships in aggregated data can be exploited for learning tasks. The main task that drove us towards this research was creating an anomaly detection mechanism for an aggregated logs data sets, which provided us a unique challenge that did not exist in previous literature, yet was a real-world problem.

    Our main contribution in this thesis is firstly an analysis of relationships between aggregated entities. The analysis allowed us to create a completely different dataset representation which exploits the different relationships between the entities in different ways.

    We defined the different entities which we have referred to during the research, the event (construct) and session, and have presented how we could simply find a correlation that we could exploit further into the research. After we have defined and analyzed the effect of the correlations we have presented the a graph representations. The graph representations allowed to exploit different properties of the relationships we researched and therefore we could successfully build different type of autoencoder models.

    Finally, we propose the multi-graph autoencoder (MGAE) model. This model leverages the relationships between instances (events) in the same context (session) to reconstruct the session and representation. This reconstruction may be used for link prediction and anomaly detection tasks in future works. Additionally, we have presented an analysis of the advantage of using this model over a single-graph autoencoder.

    An article about MGAE model is in process and will be submitted in the foreseeable future.

    8.1 Conclusions

    In recent years, there has been an attempt to create auto-encoders (AE) to graph domains. Graph auto-encoders aim at representing nodes into low-dimensional vectors by an unsupervised training manner. Graph autoencoders can be used for node clustering and link prediction and has showed promising performances on such tasks [38].

    Our main objective in the first part of the research was to investigate the existence of correlations in our dataset, and to assess their potential to learn these correlations and use them to detect anomalies. In addition, we have presented a simple anomaly detection models that exploit such correlations.

    Later, We demonstrated how we can create different graphs representation based on the simple patterns we have detected in this part. We have presented three different graphs that represents our data in graph manner, where each graph present a different aspect of the relationships between events.

    In the next part of our research, we have presented different ways to create represent the data to benefit most from the graphs and increase the ability to reconstruct the adjacency matrix. Based on the different tests and analysis we have performed during this research we may conclude that our MGAE model can reconstruct graphs better than a single-graph auto encoder models for our data set.

    For future work, we suggest the following improvements and extensions:

    1. Create an anomaly detection model based on the graph autoencoder - In order to create such a model, we create different experiments framework that uses the reconstruction error as measurement. We need to add baseline anomaly detection algorithms in addition to the one we have previously created based on simple correlations.

    2. Add a feature matrix to the graph representation - In some GCN methods a feature matrix was added as a part of the network’s architecture, this would allow to add other features which we have not added for the different events or sessions.

    3. Use existing graph embedding models for graph embedding for the different representations - By using existing embedding models we will allow our graphs to contain more nodes and edges, and by that we could increase the scope of our model.

    4. Extend the model for other datasets - we researched an aggregated dataset which was a unique and challenging on it’s own. However, we may find a method to create multiple-graphs for different (and preferably public) datasets, and by prodive further evidence of the benefits of our model.

    5. Creating more complex autoencoder architecture - The autoencoder architecture we have used was to demonstrate the abilities and benefits of our model. A better architecture that may fit the link prediction or anomaly detection task may allow the model to learn higher degree of relationships better and decrease the reconstruction error.

    Appendix A Dataset Information

    The dataset made available to this research includes information about SQL commands issued by users to a database protected by a Guardium instance. A major challenge in this research is that the available dataset do not contain information about specific invocations of SQL commands. Rather, the available dataset contains information about how many times every SQL commands have been issued in a specific time range, by a specific database user, in a specific session. In details, each record has the following fields:

    • Instance_id. The ID of the Guardium instance.

    • Period_start. The start time of the specified time range.

    • Session_id. A unique identifier created for a specific user and a specific connection.

    • Construct_id. A unique identifier that describes the issued SQL command in a session.

    • Verb_object. The verb object provides additional information about the issued SQL command, in the form of a list of pairs (object, action), where the object represents a database entity such as a table, and an action represents the action done to the object, e.g., select or insert. The verb object is a list of (object,action) pairs since an SQL command may consist of performing multiple actions, e.g., if the SQL command calls a stored procedure.

    • Count. The number of times an event with this verb object has been executed in the same session in the specified time range.

    • Failed. The number of times the query has failed,

    • Db_user. The database user used to issue the SQL command.

    • Os_user. The relevant OS user.

    • Source_program. The application the user used to access the DB

    • Server_IP. The server IP.

    • Client_IP. The client IP.

    • Service_name. The name of DB

    • Host_name. Client’s computer name

    After discussions with a domain expert, we established the following functional dependencies between some of the fields in our dataset:

    1. Construct_ID → verb_object

    2. Session_ID → instance_ID, period_start, db_user, os_user, source_program, server_IP, client_ID, service_name, host_name

    3. Session_ID, Construct_ID → count, fail

    The main fields we considered the thesis are session_ID, construct_ID, period_start, and count.

    References

    1. S. Ahmad, A. Lavin, S. Purdy and Z. Agha (2017-06) Unsupervised real-time anomaly detection for streaming data. Neurocomputing, pp. . External Links: Document Cited by: §2.2.1.
    2. H. Barringer, A. Groce, K. Havelund and S. Margaret (2010-11) Formal analysis of log files. Journal of Aerospace Computing, Information, and Communication 7, pp. . External Links: Document Cited by: Chapter 1.
    3. A. Brown, A. Tuor, B. Hutchinson and N. Nichols (2018) Recurrent neural network attention mechanisms for interpretable system log anomaly detection. MLCS’18, New York, NY, USA. External Links: ISBN 9781450358651, Link, Document Cited by: Chapter 1, §2.2.2, §2.2.2.
    4. S. Cao, W. Lu and Q. Xu (2016) Deep neural networks for learning graph representations. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pp. 1145–1152. Cited by: §2.3.5.
    5. V. Chandola, A. Banerjee and V. Kumar (2009-07) Anomaly detection: a survey. ACM Comput. Surv. 41, pp. . External Links: Document Cited by: §2.2.
    6. M. Chen, X. Shi, Y. Zhang, D. Wu and M. Guizani (2017) Deep features learning for medical image analysis with convolutional autoencoder neural network. IEEE Transactions on Big Data. Cited by: §7.2.3.
    7. D. O. Cook, R. Kieschnick and B. D. McCullough (2008) Regression analysis of proportions in finance with self selection. Journal of empirical finance 15 (5), pp. 860–867. Cited by: §2.1.
    8. T. E. Cooke (1998) Regression analysis in accounting disclosure studies. Accounting and business research 28 (3), pp. 209–224. Cited by: §2.1.
    9. Z. Cui, K. Henrickson, R. Ke and Y. Wang (2019) Traffic graph convolutional recurrent neural network: a deep learning framework for network-scale traffic learning and forecasting. IEEE Transactions on Intelligent Transportation Systems. Cited by: §2.3.4.
    10. Y. Dang, B. Wang, R. Brant, Z. Zhang, M. Alqallaf and Z. Wu (2017) Anomaly detection for data streams in large-scale distributed heterogeneous computing environments. In International Conference on Management Leadership and Governance, Cited by: §2.2.3.
    11. T. E. Dielman (2001) Applied regression analysis for business and economics. Duxbury/Thomson Learning Pacific Grove, CA. Cited by: §2.1.
    12. N. R. Draper and H. Smith (1998) Applied regression analysis. Vol. 326, John Wiley & Sons. Cited by: §2.1.
    13. M. M. Drugan (2016) A bayesian model for anomaly detection in sql databases for security systems. In 2016 IEEE Symposium Series on Computational Intelligence (SSCI), Vol. , pp. 1–8. Cited by: Chapter 1.
    14. M. Du, F. Li, G. Zheng and V. Srikumar (2017-10) DeepLog: anomaly detection and diagnosis from system logs through deep learning. pp. 1285–1298. External Links: Document Cited by: §2.2.1, §2.2.2, §2.2.2, §2.2.3, Chapter 3.
    15. D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik and R. P. Adams (2015) Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232. Cited by: §5.2.
    16. E. Eskin, A. Arnold, M. Prerau, L. Portnoy and S. Stolfo (2002) A geometric framework for unsupervised anomaly detection. In Applications of data mining in computer security, pp. 77–101. Cited by: §2.2.1.
    17. Q. Fu, J. Lou, Y. Wang and J. Li (2009) Execution anomaly detection in distributed systems through unstructured log analysis. In 2009 ninth IEEE international conference on data mining, pp. 149–158. Cited by: §2.2.3.
    18. S. Ghanbari and C. Amza (2008-07) Semantic-driven model composition for accurate anomaly diagnosis. pp. 35–44. External Links: ISBN 978-0-7695-3175-5, Document Cited by: Chapter 1.
    19. I. Goodfellow, Y. Bengio and A. Courville (2016) Deep learning. The MIT Press. External Links: ISBN 0262035618 Cited by: Chapter 1, Chapter 4.
    20. S. Guo, Y. Lin, N. Feng, C. Song and H. Wan (2019) Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 922–929. Cited by: §2.3.4.
    21. H. Hamooni, B. Debnath, J. Xu, H. Zhang, G. Jiang and A. Mueen (2016) LogMine: fast pattern recognition for log analytics. In CIKM, Cited by: Chapter 1, §2.2.2, §2.2.2.
    22. S. He, J. Zhu, P. He and M. R. Lyu (2016) Experience report: system log analysis for anomaly detection. In 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE), Vol. , pp. 207–218. Cited by: §2.2.1, §2.2.2.
    23. M. Imran Khan, B. O’Sullivan and S. N. Foley (2018-01) A semantic approach to frequency based anomaly detection of insider access in database management systems. pp. 18–28. External Links: ISBN 978-3-319-76686-7, Document Cited by: Chapter 1.
    24. T. N. Kipf and M. Welling (2016) Variational graph auto-encoders. arXiv preprint arXiv:1611.07308. Cited by: §2.3.1, §2.3.5, §2.3.5.
    25. T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. CoRR abs/1609.02907. Cited by: Chapter 1, §2.3.1, §2.3.3, §5.2.
    26. J. M. Kousser (1973) Ecological regression and the analysis of past politics. The Journal of Interdisciplinary History 4 (2), pp. 237–262. Cited by: §2.1.
    27. M. Landauer, M. Wurzenberger, F. Skopik, G. Settanni and P. Filzmoser (2018) Dynamic log file analysis: an unsupervised cluster evolution approach for anomaly detection. computers & security 79, pp. 94–116. Cited by: §2.2.2, §2.2.2.
    28. K. Leung and C. Leckie (2005) Unsupervised anomaly detection in network intrusion detection using clusters. In Proceedings of the Twenty-eighth Australasian conference on Computer Science-Volume 38, pp. 333–342. Cited by: §2.2.1.
    29. K. Lu, Z. Chen, Z. Jin and J. Guo (2003) An adaptive real-time intrusion detection system using sequences of system call. In CCECE 2003-Canadian Conference on Electrical and Computer Engineering. Toward a Caring and Humane Technology (Cat. No. 03CH37436), Vol. 2, pp. 789–792. Cited by: §2.2.3.
    30. J. Majumdar, S. Naraseeyappa and S. Ankalaki (2017) Analysis of agriculture data using data mining techniques: application of big data. Journal of Big data 4 (1), pp. 20. Cited by: §2.1.
    31. A. Makhzani and B. Frey (2014) A winner-take-all method for training sparse convolutional autoencoders. In NIPS Deep Learning Workshop, Cited by: §7.2.3.
    32. L. Mariani and F. Pastore (2008-11) Automated identification of failure causes in system logs. In 2008 19th International Symposium on Software Reliability Engineering (ISSRE), Vol. , pp. 117–126. External Links: Document, ISSN 1071-9458 Cited by: Chapter 1.
    33. J. Masci, U. Meier, D. Cireşan and J. Schmidhuber (2011) Stacked convolutional auto-encoders for hierarchical feature extraction. In Artificial Neural Networks and Machine Learning – ICANN 2011, T. Honkela, W. Duch, M. Girolami and S. Kaski (Eds.), Berlin, Heidelberg, pp. 52–59. External Links: ISBN 978-3-642-21735-7 Cited by: §7.2.3.
    34. J. Masci, U. Meier, D. Cireşan and J. Schmidhuber (2011) Stacked convolutional auto-encoders for hierarchical feature extraction. In International conference on artificial neural networks, pp. 52–59. Cited by: §7.2.3.
    35. D. C. Montgomery, E. A. Peck and G. G. Vining (2012) Introduction to linear regression analysis. Vol. 821, John Wiley & Sons. Cited by: §2.1.
    36. S. Pan, R. Hu, G. Long, J. Jiang, L. Yao and C. Zhang (2018) Adversarially regularized graph autoencoder for graph embedding. External Links: 1802.04407 Cited by: §2.3.5.
    37. O. Raz, P. Koopman and M. Shaw (2002-02) Semantic anomaly detection in online data sources. pp. 302– 312. External Links: ISBN 1-58113-472-X, Document Cited by: Chapter 1, §2.2.3.
    38. G. Salha, R. Hennequin and M. Vazirgiannis (2019) Keep it simple: graph autoencoders without graph convolutional networks. External Links: 1910.00942 Cited by: §8.1.
    39. M. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov and M. Welling (2017) Modeling relational data with graph convolutional networks. External Links: 1703.06103 Cited by: §2.3.2.
    40. S. M. Stigler (1986) The history of statistics: the measurement of uncertainty before 1900. Harvard University Press. Cited by: §2.1.
    41. S. Tan and B. Li (2014) Stacked convolutional auto-encoders for steganalysis of digital images. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific, pp. 1–4. Cited by: §7.2.3.
    42. N. Todua, P. Babilua and T. Dochviri (2013) On the multiple linear regression in marketing research. Bull. Georg. Natl. Acad. Sci 7 (3), pp. 135–139. Cited by: §2.1.
    43. K. Tu, P. Cui, X. Wang, P. S. Yu and W. Zhu (2018) Deep recursive network embedding with regular equivalence. KDD ’18, New York, NY, USA, pp. 2357–2366. External Links: ISBN 9781450355520, Link, Document Cited by: §2.3.5.
    44. R. van den Berg, T. N. Kipf and M. Welling (2017) Graph convolutional matrix completion. External Links: 1706.02263 Cited by: §2.3.5.
    45. D. Wang, P. Cui and W. Zhu (2016) Structural deep network embedding. KDD ’16, New York, NY, USA, pp. 1225–1234. External Links: ISBN 9781450342322, Link, Document Cited by: §2.3.5.
    46. S. Weisberg (2005) Applied linear regression. Vol. 528, John Wiley & Sons. Cited by: §2.1.
    47. S. Yan, Y. Xiong and D. Lin (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. arXiv preprint arXiv:1801.07455. Cited by: §2.3.1.
    48. L. Yao, C. Mao and Y. Luo (2019) Graph convolutional networks for text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 7370–7377. Cited by: §2.3.1, §2.3.3.
    49. B. Yu, H. Yin and Z. Zhu (2018-07) Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pp. 3634–3640. External Links: Document, Link Cited by: §2.3.4.
    50. W. Yu, C. Zheng, W. Cheng, C. Aggarwal, D. Song, B. Zong, H. Chen and W. Wang (2018-07) Learning deep network representations with adversarially regularized autoencoders. pp. 2663–2671. External Links: Document Cited by: §2.3.5.
    51. Y. Zhang (2013-12) An adaptive flow counting method for anomaly detection in sdn. pp. 25–30. External Links: Document Cited by: §2.2.3.
    52. L. Zhao, Y. Song, C. Zhang, Y. Liu, P. Wang, T. Lin, M. Deng and H. Li (2019) T-gcn: a temporal graph convolutional network for traffic prediction. IEEE Transactions on Intelligent Transportation Systems, pp. 1–11. External Links: ISSN 1558-0016, Link, Document Cited by: §2.3.4.
    53. J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li and M. Sun (2018) Graph neural networks: a review of methods and applications. arXiv preprint arXiv:1812.08434. Cited by: §2.3.5.
    54. C. Zhuang and Q. Ma (2018) Dual graph convolutional networks for graph-based semi-supervised classification. In Proceedings of the 2018 World Wide Web Conference, pp. 499–508. Cited by: §2.3.3, §2.3.3.
    Comments 0
    Request Comment
    You are adding the first comment!
    How to quickly get a good reply:
    • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
    • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
    • Your comment should inspire ideas to flow and help the author improves the paper.

    The better we are at sharing our knowledge with each other, the faster we move forward.
    ""
    The feedback must be of minimum 40 characters and the title a minimum of 5 characters
       
    Add comment
    Cancel
    Loading ...
    425279
    This is a comment super asjknd jkasnjk adsnkj
    Upvote
    Downvote
    ""
    The feedback must be of minumum 40 characters
    The feedback must be of minumum 40 characters
    Submit
    Cancel

    You are asking your first question!
    How to quickly get a good answer:
    • Keep your question short and to the point
    • Check for grammar or spelling errors.
    • Phrase it like a question
    Test
    Test description