Accelerating Dependency Graph Learning from Heterogeneous Categorical Event Streams via Knowledge Transfer

Accelerating Dependency Graph Learning from Heterogeneous Categorical Event Streams via Knowledge Transfer

Chen Luo Rice UniversityHoustonTexas cl67@rice.edu Zhengzhang Chen NEC Labs AmericaPrincetonNew Jersey zchen@nec-labs.com Lu-An Tang NEC Labs AmericaPrincetonNew Jersey ltang@nec-labs.com Anshumali Shrivastava Rice UniversityHoustonTexas anshumali@rice.edu  and  Zhichun Li NEC Labs AmericaPrincetonNew Jersey zhichun@nec-labs.com
Abstract.

Dependency graph, as a heterogeneous graph representing the intrinsic relationships between different pairs of system entities, is essential to many data analysis applications, such as root cause diagnosis, intrusion detection, etc. Given a well-trained dependency graph from a source domain and an immature dependency graph from a target domain, how can we extract the entity and dependency knowledge from the source to enhance the target? One way is to directly apply a mature dependency graph learned from a source domain to the target domain. But due to the domain variety problem, directly using the source dependency graph often can not achieve good performance. Traditional transfer learning methods mainly focus on numerical data and are not applicable.

In this paper, we propose ACRET, a knowledge transfer based model for accelerating dependency graph learning from heterogeneous categorical event streams. In particular, we first propose an entity estimation model to filter out irrelevant entities from the source domain based on entity embedding and manifold learning. Only the entities with statistically high correlations are transferred to the target domain. On the surviving entities, we propose a dependency construction model for constructing the unbiased dependency relationships by solving a two-constraint optimization problem. The experimental results on synthetic and real-world datasets demonstrate the effectiveness and efficiency of ACRET. We also apply ACRET to a real enterprise security system for intrusion detection. Our method is able to achieve superior detection performance at least 20 days lead lag time in advance with more than 70% accuracy.

copyright: none

1. Introduction

The heterogeneous categorical event data are ubiquitous. Consider system surveillance data in enterprise networks, where each data point is a system event that involves heterogeneous types of entities: time, user, source process, destination process, and so on. Mining such event data is a challenging task due to the unique characteristics of the data: (1) the exponentially large event space. For example, in a typical enterprise network, hundreds (or thousands) of hosts incessantly generate operational data. A single host normally generates more than events per second; And (2) the data varieties and dynamics. The variety of system entity types may necessitate high-dimensional features in subsequent processing, and the event data may changing dramatically over time, especially considering the heterogeneous categorical event streams (Aggarwal et al., 2003; Aggarwal, 2007; Manzoor et al., 2016).

Figure 1. Comparison between a traditional work flow and ACRET work flow of learning the dependency graph. ACRET extracts the knowledge from a well-trained dependency graph to speed up the training process of a new dependency graph.

To address the above challenges, the recent studies of dependency graphs (He and Zhang, 2011; King et al., 2005; Xu et al., 2016) have witnessed a growing interest. Such dependency graphs can be applied to model a variety of systems including enterprise networks (Xu et al., 2016), societies (Myers et al., 2014), ecosystems (Kawale et al., 2013), etc. For instance, we can present an enterprise network as a dependency graph, with nodes representing system entities of processes, files, or network sockets, and edges representing the system events between entities (e.g., a process reads a file). This enterprise system dependency graph can be applied to many forensic analysis tasks such as intrusion detection, risk analysis, and root cause diagnosis (Xu et al., 2016). A social network can also be modeled as a dependency graph representing the social interactions between different users. Then, this social dependency graph can be used for user behavior analysis or abnormal user detection (Kim and Leskovec, 2011).

However, due to the aforementioned data characteristics, learning a mature dependency graph from heterogeneous categorical event streams often requires a long period of time. For instance, the dependency graph of an enterprise network needs to be trained for several weeks before it can be applied for intrusion detection or risk analysis as illustrated in Fig. 1. Furthermore, every time, when the system is deployed in a new environment, we need to rebuild the entire dependency graph. This process is both time and resource consuming.

Enlightened by the cloud services (Krutz and Vines, 2010), one way to avoid the time-consuming rebuilding process is by reusing a unified dependency graph model in different domains/environments. However, due to the domain variety, directly apply the dependency graph learned from an old domain to a new domain often can not achieve good performance. For example, the enterprise network from an IT company (active environment) is very different from the enterprise network from an electric company (stable environment). Thus, the enterprise dependency graph of the IT company contains many unique system entities that can not be found in the dependency graph of the electric company. Nevertheless, there are still a lot of room for transfer learning.

Transfer learning has shed light on how to tackle the domain differences (Pan and Yang, 2010). It has been successfully applied in various data mining and machine learning tasks, such as clustering and classification (Chen et al., 2016). However, most of the transfer learning algorithms focus on numerical data (Dai et al., 2007; Sun et al., 2015; Chattopadhyay et al., 2013). When it comes to graph structure data, there is less existing work (Fang et al., 2015; He et al., 2009), not to mention the dependency graph. This motivates us to propose a novel knowledge transfer-based method for dependency graph learning.

In this paper, we propose ACRET, a knowledge transfer based method for accelerating dependency graph learning from heterogeneous categorical event streams. ACRET consists of two sub-models: EEM (Entity Estimation Model) and DCM (Dependency Construction Model). Specifically, first, EEM filters out irrelevant entities from source domain based on entity embedding and manifold learning. Only the entities with statistically high correlations can be transferred to the target domain. Then, based on the reduced entities, DCM model effectively constructs unbiased dependency relationships between different entities for the target dependency graph by solving a two-constraint optimization problem. We launch an extensive set of experiments on both synthetic and real-world data to evaluate the performance of ACRET. The results demonstrate the effectiveness and efficiency of our proposed algorithm. We also apply ACRET to a real enterprise security system for intrusion detection. Our method is able to achieve superior detection performance at least 20 days lead lag time in advance with more than 70% accuracy.

2. Preliminaries and Problem Statement

Figure 2. An overview of an enterprise security system.

In this section, we introduce some notations and define the problem.

Heterogeneous Categorical Event. A heterogeneous categorical event is a record contains different categorical attributes, and the -th attribute value denotes an entity from the type .

For example, in the enterprise system (as illustrated in Fig. 2), a process event (e.g., a program opens a file or connects to a server) can be regarded as a heterogeneous categorical event. It contains information, such as timing, type of operation, information flow directions, user, and source/destination process, etc.

By continuous monitoring/auditing the heterogeneous categorical event data (streams) generated by the physical system, one can generate the corresponding dependency graph of the system, as in (He and Zhang, 2011; King et al., 2005; Xu et al., 2016). This dependency graph is a heterogeneous graph representing the dependencies/interactions between different pairs of entities. Formally, we define the dependency graph as follows:

Dependency Graph. A dependency graph is a heterogeneous undirected weighted graph , where is the set of heterogeneous system entities, and is the total number of entities in the dependency graph; is the set of dependency relationships/edges between different entities. For ease of discussion, we use the terms edge and dependency interchangeably in this paper. A undirected edge between a pair of entities and exists depending on whether they have a dependency relation or not. The weight of the edge denotes the intensity of the dependency relation.

In an enterprise system, a dependency graph can be a weighted graph between different system entities, such as processes, files, users, Internet sockets. The edges in the dependency graph are the causality relations between different entities.

As shown in Fig. 2, the enterprise security system utilities the accumulated historical heterogeneous system data from event streams to construct the system dependency graph and update the graph periodically. The learned dependency graph is applied to forensic analysis applications such as intrusion detection, risk analysis, and incident backtrack etc.

The problems of cold-start and time-consuming training reflect a great demand for an automated tool for effectively transferring dependency graphs between different domains. Motivated by this, this paper focuses on accelerating the dependency graph learning via knowledge transfer. Based on the definitions described above, we formally define our problem as follows:

Knowledge Transfer for Dependency Graph Learning. Given two domains: a source domain and a target domain . In the source domain , we have a well-trained dependency graph generated from the heterogeneous categorical event streams. In target domain , we have a small incomplete dependency graph trained by a short period of time. The task of knowledge transfer for dependency graph learning is to use to help construct a mature dependency graph in the domain .

There are two major assumptions for this problem: (1) The event streams in the source domain and target domain are generated by the same physical system; (2) The entity size of source dependency graph should be larger than the size of the intersection graph . Because transferring knowledge from a less informative dependency graph to an informative graph is unreasonable.

Figure 3. An overview of the ACRET model.

3. The Acret Model

To learn a mature dependency graph , intuitively, we would like to leverage the entity and dependency information from the well-trained source dependency graph to help complete the original small dependency graph . One naive way is to directly transfer all the entities and dependencies from the source domain to the target domain. However, due to the domain difference, it is likely that there are many entities and their corresponding dependencies that appear in source domain but not in the target domain. Thus, one key challenge in our problem is how to identify the domain-specific/irrelevant entities from the source dependency graph. After removing the irrelevant entities, another challenge is how to construct the dependencies between the transferred entities by adapting the domain difference and following the same dependency structure as in . To address these two key challenges in dependency graph learning, we propose a knowledge transfer algorithm with two sub-models: EEM (Entity Estimation Model) and DCM (Dependency Construction Model) as illustrated in Fig. 3. We first introduce these two sub-models separately in details, and then combine them into a uniform algorithm.

3.1. Eem: Entity Estimation Model

For the first sub-model, Entity Estimation Model, our goal is to filter out the entities in the source dependency graph that are irrelevant to the target domain. To achieve this, we need to deal with two main challenges: (1) the lack of intrinsic correlation measures among categorical entities, and (2) heterogeneous relations among different entities in the dependency graph.

To overcome the lack of intrinsic correlation measures among categorical entities, we embed entities into a common latent space where their semantics can be preserved. More specifically, each entity, such as a user, or a process in computer systems, is represented as a -dimensional vector and will be automatically learned from the data. In the embedding space, the correlation of entities can be naturally computed by distance/similarity measures in the space, such as Euclidean distances, vector dot product, and so on. Compared with other distance/similarity metrics defined on sets, such as Jaccard similarity, the embedding method is more flexible and it has nice properties such as transitivity (Zhang et al., 2015).

To address the challenge of heterogeneous relations among different entities, we use the meta-path proposed in (Sun and Han, 2012) to model the heterogeneous relations. For example, in a computer system, a meta-path can be a “Process-File-Process”, or a ”File-Process-Internet Socket”. “Process-File-Process” denotes the relationship of two processes load the same file, and ”File-Process-Internet Socket” denotes the relationship of a file loaded by a process who opened an Internet Socket.

The potential meta-paths induced from the heterogeneous network can be infinite, but not every one is relevant and useful for the specific task of interest. There are some works (Chen and Sun, 2017) for automatically selecting the meta-paths for specific tasks.

Given a set of meta-paths , where denotes the -th meta-path and let denotes the number of meta-paths. We can construct graphs by each time only extracting the corresponding meta-path from the dependency graph (Sun and Han, 2012). Let denotes the vector representation of the entities in . Then, we model the relationship between two entities and as:

(1)

In the above, is a weighted average of all the similarity matrices :

(2)

where ’s are non-negative coefficients, and is the similarity matrix constructed by calculating the pairwise shortest path between each entities in . Here, is the adjacent matrix of the dependency graph . By using the shortest path in the graph, one can capture the long term relationship between different entities (Bondy and Murty, 1976). Putting Eq. 2 into Eq. 1, we have:

(3)

where is is the Frobenius norm (Han et al., 2011).

Then, the objective function of EEM model is:

(4)

where , and is the generalization term (Han et al., 2011), which prevents the model from over-fitting. is the trade-off factor of the generalization term. In practice, we can choose as 1 or 2, which bears the resemblance to Hamming distance and Euclidean distance, respectively.

Putting everything together, we get:

(5)

Then, the optimized value can be obtained by:

3.1.1. Inference Method

The objective function in Eq. 5 contains two sets of parameters: (1) , and (2) . Then, we propose a two-step iterative method for optimizing , where the entity vector matrices and the weight for each meta-path mutually enhance each other. In the first step, we fix the weight vectors and learn the best entity vector matrix . In the second step, we fix the entity vector matrix and learn the best weight vectors .

Fix and learn : when we fix , then the problem is reduced to , where is a constant similarity matrix. Then, the optimization process becomes a traditional manifold learning problem. Fortunately, we can have a closed form to solve this problem, via so called multi-dimensional scaling (Han et al., 2011). To obtain such an embedding, we compute the eigenvalue decomposition of the following matrix:

where is the double centering matrix, has columns as the eigenvectors and is a diagonal matrix with eigenvalues. Then, the embedding can be chosen as:

(6)

Fix and learn : When fixing , the problem is reduced to:

(7)

where is a constant matrix, and is also a constant. Then, this function becomes a linear regression. So, we also have the close form solution for :

After we get the embedding vectors , then the relevance matrix between different entities can be obtained as:

(8)

One can use a user defined threshold to select the entities with high correlation with target domain for transferring. But user defined threshold is often suffered by the lack of domain knowledge. So here, we introduce a hypothesis test based method for automatically thresholding the selection of the entities.

For each entity in , we first normalize all the scores by: , where is the average value of , and is the standard deviation of . This standardized scores can be approximated with a gaussian distribution. Then, the threshold will be with . (or for ) (Han et al., 2011). By using this threshold, one can filter out all the statistically irrelevant entities from the source domain, and transfer highly correlated entities to the target domain.

By combining the transferred entities and the original target domain dependency graph , we get , as shown in Fig. 3. Then, the next step is to construct the missing dependencies in .

3.2. DCM: Dependency Construction Model

To construct the missing dependencies/edges in , there are two constraints need to be considered:

  • Smoothness Constraint: The predicted dependency structure in needs to be close to the dependency structure of the original graph . The intuition behind this constraint is that the learned dependencies should more or less intact in as much as possible.

  • Consistency Constraint: Inconsistency between and should be similar to the inconsistency between and . Here, and are the sub-graphs of which have the same entity set with and , respectively. This constraint guarantees that the target graph learned by our model can keep the original domain difference with the source graph.

Before we model the above two constraints, we first need a measure to evaluate the inconsistence between different domains. In this work, we propose a novel metric named dynamic factor between two dependency graphs and from two different domains as:

(9)

where is the number of entities in , and denote the adjacent matrix of and , respectively, and denotes the number of edges of a fully connected graph with entities (Bondy and Murty, 1976).

Next, we introduce the Dependency Construction Model in details.

3.2.1. Modeling Smoothness Constraint

We first model the smoothness constraint as follows:

(10)

where is the vector representation of the entities in , and is the regularization term.

3.2.2. Modeling Consistency Constraint

We then model the consistency constraint as follows:

(11)

where is the dynamic factor as we defined before. Then, putting Eq. 9 and into Eq. 11, we get:

(12)

where .

3.2.3. Unified Model

Having proposed the modeling approaches in Section 3.2.1 and 3.2.2, we intend to put all the two constraints together. The unified model for dependency construction is proposed as follows:

(13)

The first term of the model incorporates the Smoothness Constraint component, which keeps the closer to target domain knowledge existed in the . The second term considers the Consistency Constraint, that is the inconsistency between and should be similar to the inconsistency between and .

and are important parameters which capture the importance of each term, and we will discuss these parameters in Section 3.3. To optimize the model as in Eq. 13, we use stochastic gradient descent (Han et al., 2011) method. The derivative on is given as:

(14)

3.3. Overall Algorithm

Input :  ,
Output : 
1 Select a set of meta-paths from .;
2 Extract networks from ;
3 Calculate all the similarity matrix ;
4 The Entity Estimation Process ;
5 while Convergence do
6       Calculate and ;
7       .;
8       Calculate , and ;
9      .;
10      
11 end while
12Construct based on the method introduced in Section 3.2.1;
13 The Dependency Construction Process ;
14 while Convergence do
15       Update using the gradient of function 14
16 end while
Construct based on the method introduced in Section 3.2.2
Algorithm 1 ACRET: Knowledge Transfer based Algorithm for Dependency Graph Learning

The overall algorithm is then summarized as Algorithm 1. In the algorithm, line 5 to line 11 implements the Entity Estimation Model, and line 13 to 16 implements the Dependency Construction Model.

3.3.1. Setting Parameters

There are two parameters, and , in our model. For , as in (Sun and Han, 2012; Han et al., 2011), it is always assigned manually based on the experiments and experience. For , when a large number of entities are transferred to the target domain, a large can improve the transferring result, because we need more information to be added from the source domain. On the other hand, when only a small number of entities are transferred to target domain, then a larger will bias the result. Therefore, the value of depends on how many entities are transferred from the source domain to the target domain. In this sense, we can use the proportion of the transferred entities in to calculate . Given the entity size of as , the entity size of as , then can be calculated as:

(15)

The experimental results in Section 4.6 demonstrate the effectiveness of the proposed parameter selection method.

(a) Varying graph size
(b) Varying dynamic factor
(c) Vary graph maturity
Figure 4. Performance on synthetic data.

3.3.2. Complexity Analysis

As shown in Algorithm 1, the time for learning our model is dominated by computing the objective functions and their corresponding gradients against feature vectors.

For the Entity Estimation Model, the time complexity of computing the in Eq. 6 is bounded by , where is the number of entities in , and is the dimension of the vector space of . The time complexity for computing is also bounded by . So, suppose the number of training iterations for EEM is , then the overall complexity of EEM model is . For the Dependency Construction Model, the time complexity of computing the gradients of against is , where is the number of iterations, is the dimensionality of feature vector. As shown in our experiment (see Section 4.5), , , , and are all small numbers. So that we can regard them as a constant, say , so the overall complexity of our method is , which is linear with the size of the entity set. This makes our algorithm practicable for large scale datasets.

4. Experiments

In this section, we evaluate ACRET using synthetic data and real system surveillance data collected in enterprise networks.

4.1. Comparing Methods

We compare ACRET with the following methods:

NT: This method directly uses the original small target dependency graph without knowledge transfer. In other words, the estimated target dependency graph .

DT: This method directly combines the source dependency graph and the original target dependency graph. In other words, the estimated target dependency graph .

RW-DCM: This is a modified version of the ACRET method. Instead of using the proposed EEM model to perform entity estimation, this method uses the random walk to evaluate the correlations between entities and perform entity estimation. Random walk is a widely-used method for relevance search in a graph (Kang et al., 2012).

EEM-CMF: This is another modified version of the ACRET method. In this method, we replace DCM model with collective matrix factorization (Singh and Gordon, 2008) method. Collective matrix factorization has been applied for link prediction in multiple domains (Singh and Gordon, 2008).

4.2. Evaluation Metrics

Since in ACRET algorithm, we use hypothesis-test for thresholding the selection of entities and dependencies, similar to (Luo et al., 2014b; Han et al., 2011), we use the F1-score to evaluate the hypothesis-test accuracy of all the methods.

F1-score is the harmonic mean of precision and recall. In our experiment, the final F1-score is calculated by averaging the entity F1-score and dependency/edge F1-score.

To calculate the precision (recall) of both entity and link, we compare the estimated entity (edge) set with the ground-truth entity/link set. Then, precision and recall can be calculated as follows:

where is the number of correctly estimated entities (edges), is the number of total estimated entities (edges), and is the number of ground truth entities (edges).

4.3. Synthetic Experiments

We first evaluate the ACRET on synthetic graph data-sets to have a more controlled setting for assessing algorithmic performance. We control three aspects of the synthetic data to stress test the performance of our ACRET method:

  • Graph size is defined as the number of entities for a dependency graph. Here, we use to denote the source domain graph size and to denote the target one.

  • Dynamic factor, denoted as , has the same definition as in Section 3.2.

  • Graph maturity score, denoted as M, is defined as the percentage of entities/edges of the ground-truth graph , that are used for constructing the original small graph . Here, graph maturity score is used for simulating the period of learning time of to reach the maturity in the real system.

Then, given , , , and M, we generate the synthetic data as follows: We first randomly generate an undirected graph as the source dependency graph based on the value of (West, 2001); Then, we randomly assign three different labels to each entity. Due to space limitations, we will only show the results with three labels, but similar results have been achieved in graphs with more than three labels; We further construct the target dependency graph by randomly adding/deleting of the edges and deleting entities from . Finally, we randomly select M = of entities/edges from to form .

4.3.1. How Does Acret’s Performance Scale with Graph Size?

We first explore how the ACRET’s performance changes with graph size and . Here, we fix the maturity score to M, the dynamic factor to , and target domain dependency graph size to . Then, we increase the source graph size from to . From Fig. 3(a), we observe that with the increase of the size difference , the performances of DT and RW-DCM are getting worse. This is due to the poor ability of DT and RW-DCM for extracting useful knowledge from the source domain. In contrast, the performance of ACRET and EEM-CMF increases with the size differences. This demonstrates the great capability of EEM model for entity knowledge extraction.

4.3.2. How Does Acret’s Performance Scale with Domain Dynamic Factor?

We now vary the dynamic factor to understand its impact on the ACRET’s performance. Here, the graph maturity score is set to M=, and two domain sizes are set to and , respectively. Fig. 3(b) shows that the performances of all the methods go down with the increase of the dynamic factor. This is expected, because transferring the dependency graph from a very different domain will not work well. On the other hand, the performances of ACRET, RW-DCM, and EEM-CMF only decrease slightly with the increase of the dynamic factor. Since RW-DCM and EEM-CMF are variants of the ACRET method, this demonstrates that the two sub-models of the ACRET method are both robust to large dynamic factors.

4.3.3. How Does Acret’s Performance Scale with Graph Maturity?

Third, we explore how the graph maturity score M impacts the performance of ACRET. Here, the dynamic factor is fixed to . The graph sizes are set to and . Fig. 3(c) shows that with the increase of the M, the performances of all the methods are getting better. The reason is straightforward: with the maturity score increases, the challenge of domain difference for all the methods is becoming smaller. In addition, our ACRET and its variants RW-DCM, and EEM-CMF perform much better than DT and NT. This demonstrates the great ability of the sub-models of ACRET for knowledge transfer. Furthermore, ACRET still achieves the best performance.

4.4. Real-World Experiments

(a) Windows dataset
(b) Linux dataset
Figure 5. Performance on real-world data.

Two real-world system monitoring datasets are used in this experiment. The data is collected from an enterprise network system composed of Linux machines and Windows machines from two departments, in a time span of consecutive days. In both datasets, we collect two types of system events: (1) communications between processes, and (2) system activity of processes sending or receiving Internet connections to/from other machines at destination ports. Three different types of system entities are considered: (1) processes, (2) Unix domain sockets, and (3) Internet sockets. The sheer size of the Windows dataset is around Gigabytes, and the Linux dataset is around Gigabytes. Both Windows and Linux datasets are split into a source domain and a target domain according to the department name. The detailed statistics of the two datasets are shown in Table 1.

Data Win Linux
# System Events 120 Million 10 Million
# Source Domain Machines 62 24
# Target Domain Machines 61 23
# Time Span 14 days 14 days
Table 1. The statistics of real datasets.

In this experiment, we construct one target domain dependency graph per day by increasing the learning time daily. The final graph is the one learned for days. From Fig. 5, we observe that for both Windows and Linux datasets, with the increase of the training time, the performances of all the algorithms are getting better. On the other hand, compared with all the other methods, ACRET achieves the best performance on both Windows and Linux datasets. In addition, our proposed ACRET algorithm can make the dependency graph deplorable in less than four days, instead of two weeks or longer by directly learning on the target domain.

(a) EEM on synthetic data
(b) DCM on synthetic data
(c) EEM on Windows data
(d) DCM on Windows data
(e) EEM on Linux data
(f) DCM on Linux data
Figure 6. Performance on convergence.

4.5. Convergence Analysis

As described in Section 3.3.2, the performance bottleneck of ACRET model is the learning process of the two sub-models: EEM (Entity Estimation Model) and DCM (Dependency Construction Model). In this section, we report the convergence speed of our approach.

We use both synthetic and real-world data to validate the model convergence speed. For the synthetic data, we choose the one with dynamic factor to be , the dependency graph size to be and , and the graph maturity to be . For the two real-world datasets, we fix the target dependency graph learning time as days.

From Fig. 6, we can see that in all three datasets, ACRET converges very fast (i.e., with less than 10 iterations). This makes our model applicable for the real-world large-scale systems.

4.6. Parameter Study

(a) Synthetic dataset
(b) Windows dataset
(c) Linux dataset
Figure 7. Sensitivity analysis on parameter .

In this section, we study the impact of parameter in Eq. 13. We use the same datasets as in Section 4.5. As shown in Fig. 7, when the value of is too small or too large, the results are not good, because controls the leverage between the source domain information and target domain information. The extreme value of (too large or too small) will bias the result. On the other hand, the value calculated by Eq. 15 is for the synthetic dataset, for the Windows dataset, and for the Linux dataset. And Fig. 7 shows the best results just appear around these three values. This demonstrates that our proposed method for setting the value is very effective, which successfully addresses the parameter pre-assignment issue.

4.7. Case Study on Intrusion Detection

As aforementioned, dependency graph is essential to many forensic analysis applications like root cause diagnosis and risk analysis. In this section, we evaluate the ACRET’s performance in a real commercial enterprise security system (see Fig. 2) for intrusion detection.

In this case, the dependency graph, which represents the normal profile of the enterprise system, is the core analysis model for the offshore intrusion detection engine. It is built from the normal system process event streams (see Section 4.4 for data description) during the training period. The same security system has been deployed in two companies: one Japanese electric company and one US IT company. We obtain one dependency graph from the IT company after 30 days’ training, and two dependency graphs from the electric company after 3 and 30 days’ training, respectively. ACRET is applied for leveraging the well-trained dependency graph from the IT company to complete the 3 days’ immature graph from the electric company.

In the one-day testing period, we try different types of attacks (Jones and Sielken, 2000), including Snowden attack, ATP attack, botnet attack, Sniffer Attack and etc., which resulted in ground-truth alerts. All other alerts reported during the testing period are considered as false positives.

Table 2 shows the intrusion detection results in the electric company using the dependency graphs generated by different transfer learning methods and the 30 days’ training from the electric company. From the results, we can clearly see that ACRET outperforms all the other transfer learning methods by at least in precision and in recall. On the other hand, the performance of the dependency graph ( days’ model) accelerated by ACRET is very close to the ground truth model ( days’ model). This means, by using ACRET, we can achieve similar performance in one-tenth training time, which is of great significant to some mission critical environments.

Method Precision Recall
NT 0.01 0.10
DT 0.15 0.30
RW-DCM 0.38 0.57
EEM-CMF 0.42 0.60
ACRET 0.60 0.73
Real 30 days’ model 0.58 0.76
Table 2. Intrusion detection performance.

5. Related Work

5.1. Transfer Learning

Transfer learning has been widely studied in recent years (Cao et al., 2010; Pan and Yang, 2010). Most of the traditional transfer learning methods focus on numerical data (Dai et al., 2007; Sun et al., 2015; Chattopadhyay et al., 2013). When it comes to graph (network) structured data, there is less existing work. In (Fang et al., 2015), the authors presented TrGraph, a novel transfer learning framework for network node classification. TrGraph leverages information from the auxiliary source domain to help the classification process of the target domain. In one of their earlier work, a similar approach was proposed (Fang et al., 2013) to discover common latent structure features as useful knowledge to facilitate collective classification in the target network. In (He et al., 2009), the authors proposed a framework to propagates the label information from the source domain to the target domain via the example-feature-example tripartite graph. Transfer learning has also been applied to the deep neural network structure. In (Chen et al., 2016), the authors introduced Net2Net, a technique for rapidly transferring the information stored in one neural net into another. Net2Net utilizes function preserving transformations to transfer knowledge from neural networks. Different from existing methods, we aim to expedite the dependency graph learning process through knowledge transfer.

5.2. Link Prediction and Relevance Search

Graph link prediction is a well-studied research topic (Liben-Nowell and Kleinberg, 2007; Hofman et al., 2017). In (Ye et al., 2013), Ye et al. presented a transfer learning algorithm to address the edge sign prediction problem in signed social networks. Because edge instances are not associated with a pre-defined feature vector, this work was proposed to learn the common latent topological features shared by the target and source networks, and then adopt an AdaBoost-like transfer learning algorithm with instance weighting to train a classifier. Collective matrix factorization (Singh and Gordon, 2008) is another popular technique that can be applied to detect mission links by combining the source domain and target domain graphs. However, all the existing link prediction methods can not deal with dynamics between the source domain and target domain as introduced in our problem.

Finding relevant nodes or similarity search in graphs is also related to our work. Many different similarity metrics have been proposed such as Jaccard coefficient, cosine similarity, and Pearson correlation coefficient (Bondy and Murty, 1976), and Random Walks (Kang et al., 2012; Sun et al., 2005). However, none of these similarity measures consider the multiple relations exist in the data. Recent advances in heterogeneous information networks (Sun and Han, 2012) have offered several similarity measures for heterogeneous relations, such as meta-path and relation path (Luo et al., 2014c, a). However, these methods can not deal with the multiple domain knowledge.

6. Conclusion

In this paper, we investigate the problem of transfer learning on dependency graph. Different from traditional methods that mainly focus on numerical data, we propose ACRET, a two-step approach for accelerating dependency graph learning from heterogeneous categorical event streams. By leveraging entity embedding and constrained optimization techniques, ACRET can effectively extract useful knowledge (e.g., entity and dependency relations) from the source domain, and transfer it to the target dependency graph. ACRET can also adaptively learn the differences between two domains, and construct the target dependency graph accordingly. We evaluate the proposed algorithm using extensive experiments. The experiment results convince us of the effectiveness and efficiency of our approach. We also apply ACRET to a real enterprise security system for intrusion detection. Our method is able to achieve superior detection performance at least 20 days lead lag time in advance with more than 70% accuracy.

References

  • (1)
  • Aggarwal (2007) Charu C Aggarwal. 2007. Data streams: models and algorithms. Vol. 31. Springer Science & Business Media.
  • Aggarwal et al. (2003) Charu C Aggarwal, Jiawei Han, Jianyong Wang, and Philip S Yu. 2003. A framework for clustering evolving data streams. In Proceedings of the 29th International Conference on Very Large Data Bases-Volume 29. VLDB Endowment, 81–92.
  • Bondy and Murty (1976) John Adrian Bondy and Uppaluri Siva Ramachandra Murty. 1976. Graph theory with applications. Vol. 290. New York: Elsevier.
  • Cao et al. (2010) Bin Cao, Nathan N Liu, and Qiang Yang. 2010. Transfer learning for collective link prediction in multiple heterogenous domains. In Proceedings of the 27th International Conference on Machine Learning. 159–166.
  • Chattopadhyay et al. (2013) Rita Chattopadhyay, Wei Fan, Ian Davidson, Sethuraman Panchanathan, and Jieping Ye. 2013. Joint transfer and batch-mode active learning. In International Conference on Machine Learning. 253–261.
  • Chen et al. (2016) Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. 2016. Net2net: Accelerating learning via knowledge transfer. In Proceedings of the 4th International Conference on Learning Representations.
  • Chen and Sun (2017) Ting Chen and Yizhou Sun. 2017. Task-Guided and Path-Augmented Heterogeneous Network Embedding for Author Identification. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 295–304.
  • Dai et al. (2007) Wenyuan Dai, Qiang Yang, Gui-Rong Xue, and Yong Yu. 2007. Boosting for transfer learning. In Proceedings of the 24th International Conference on Machine Learning. ACM, 193–200.
  • Fang et al. (2013) Meng Fang, Jie Yin, and Xingquan Zhu. 2013. Transfer learning across networks for collective classification. In 2013 IEEE 13th International Conference on Data Mining. IEEE, 161–170.
  • Fang et al. (2015) Meng Fang, Jie Yin, Xingquan Zhu, and Chengqi Zhang. 2015. TrGraph: Cross-network transfer learning via common signature subgraphs. IEEE Transactions on Knowledge and Data Engineering 27, 9 (2015), 2536–2549.
  • Han et al. (2011) Jiawei Han, Jian Pei, and Micheline Kamber. 2011. Data mining: concepts and techniques. Elsevier.
  • He et al. (2009) Jingrui He, Yan Liu, and Richard Lawrence. 2009. Graph-based transfer learning. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM, 937–946.
  • He and Zhang (2011) Miao He and Junshan Zhang. 2011. A dependency graph approach for fault detection and localization towards secure smart grid. IEEE Transactions on Smart Grid 2, 2 (2011), 342–351.
  • Hofman et al. (2017) Jake M Hofman, Amit Sharma, and Duncan J Watts. 2017. Prediction and explanation in social systems. Science 355, 6324 (2017), 486–488.
  • Jones and Sielken (2000) Anita K Jones and Robert S Sielken. 2000. Computer system intrusion detection: A survey. Computer Science Technical Report (2000), 1–25.
  • Kang et al. (2012) U Kang, Hanghang Tong, and Jimeng Sun. 2012. Fast random walk graph kernel. In Proceedings of the 2012 SIAM International Conference on Data Mining. SIAM, 828–838.
  • Kawale et al. (2013) Jaya Kawale, Stefan Liess, Arjun Kumar, Michael Steinbach, Peter Snyder, Vipin Kumar, Auroop R Ganguly, Nagiza F Samatova, and Fredrick Semazzi. 2013. A graph-based approach to find teleconnections in climate data. Statistical Analysis and Data Mining: The ASA Data Science Journal 6, 3 (2013), 158–179.
  • Kim and Leskovec (2011) Myunghwan Kim and Jure Leskovec. 2011. Modeling social networks with node attributes using the multiplicative attribute graph model. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence. AUAI Press, 400–409.
  • King et al. (2005) Samuel T King, Z Morley Mao, Dominic G Lucchetti, and Peter M Chen. 2005. Enriching intrusion alerts through multi-host causality. In Proceedings of the 2005 Network and Distributed System Security Symposium (NDSS).
  • Krutz and Vines (2010) Ronald L Krutz and Russell Dean Vines. 2010. Cloud security: A comprehensive guide to secure cloud computing. Wiley Publishing.
  • Liben-Nowell and Kleinberg (2007) David Liben-Nowell and Jon Kleinberg. 2007. The link-prediction problem for social networks. Journal of the Association for Information Science and Technology 58, 7 (2007), 1019–1031.
  • Luo et al. (2014a) Chen Luo, Renchu Guan, Zhe Wang, and Chenghua Lin. 2014a. HetPathMine: A Novel Transductive Classification Algorithm on Heterogeneous Information Networks. In Proceedings of the 36th European Conference on Information Retrieval. Springer, 210–221.
  • Luo et al. (2014b) Chen Luo, Jian-Guang Lou, Qingwei Lin, Qiang Fu, Rui Ding, Dongmei Zhang, and Zhe Wang. 2014b. Correlating events with time series for incident diagnosis. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1583–1592.
  • Luo et al. (2014c) Chen Luo, Wei Pang, Zhe Wang, and Chenghua Lin. 2014c. Hete-cf: Social-based collaborative filtering recommendation using heterogeneous relations. In 2014 IEEE International Conference on Data Mining. IEEE, 917–922.
  • Manzoor et al. (2016) Emaad Manzoor, Sadegh M Milajerdi, and Leman Akoglu. 2016. Fast Memory-efficient Anomaly Detection in Streaming Heterogeneous Graphs. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1035–1044.
  • Myers et al. (2014) Seth A Myers, Aneesh Sharma, Pankaj Gupta, and Jimmy Lin. 2014. Information network or social network? the structure of the twitter follow graph. In Proceedings of the 23rd International Conference on World Wide Web. ACM, 493–498.
  • Pan and Yang (2010) Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22, 10 (2010), 1345–1359.
  • Singh and Gordon (2008) Ajit P Singh and Geoffrey J Gordon. 2008. Relational learning via collective matrix factorization. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 650–658.
  • Sun et al. (2005) Jimeng Sun, Huiming Qu, Deepayan Chakrabarti, and Christos Faloutsos. 2005. Relevance search and anomaly detection in bipartite graphs. ACM SIGKDD Explorations Newsletter 7, 2 (2005), 48–55.
  • Sun et al. (2015) Qian Sun, Mohammad Amin, Baoshi Yan, Craig Martell, Vita Markman, Anmol Bhasin, and Jieping Ye. 2015. Transfer learning for bilingual content classification. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2147–2156.
  • Sun and Han (2012) Yizhou Sun and Jiawei Han. 2012. Mining heterogeneous information networks: principles and methodologies. Synthesis Lectures on Data Mining and Knowledge Discovery 3, 2 (2012), 1–159.
  • West (2001) Douglas Brent West. 2001. Introduction to graph theory. Vol. 2. Prentice hall Upper Saddle River.
  • Xu et al. (2016) Zhang Xu, Zhenyu Wu, Zhichun Li, Kangkook Jee, Junghwan Rhee, Xusheng Xiao, Fengyuan Xu, Haining Wang, and Guofei Jiang. 2016. High fidelity data reduction for big data security dependency analyses. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 504–516.
  • Ye et al. (2013) Jihang Ye, Hong Cheng, Zhe Zhu, and Minghua Chen. 2013. Predicting positive and negative links in signed social networks by transfer learning. In Proceedings of the 22nd International Conference on World Wide Web. ACM, 1477–1488.
  • Zhang et al. (2015) Kai Zhang, Qiaojun Wang, Zhengzhang Chen, Ivan Marsic, Vipin Kumar, Guofei Jiang, and Jie Zhang. 2015. From categorical to numerical: Multiple transitive distance learning and embedding. In Proceedings of the 2015 SIAM International Conference on Data Mining. SIAM, 46–54.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
10237
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description