Applications of Graph Integration to Function Comparison and Malware Classification

Applications of Graph Integration to Function Comparison and Malware Classification

Michael Slawinski Cylance Inc.
Irvine, CA
mslawinski@cylance.com
   Andy Wortman Cylance Inc.
Irvine, CA
awortman@cylance.com
Abstract

We classify .NET files as either benign or malicious by examining directed graphs derived from the set of functions comprising the given file. Each graph is viewed probabilistically as a Markov chain where each node represents a code block of the corresponding function, and by computing the PageRank vector (Perron vector with transport), a probability measure can be defined over the nodes of the given graph. Each graph is vectorized by computing Lebesgue antiderivatives of hand-engineered functions defined on the vertex set of the given graph against the PageRank measure. Files are subsequently vectorized by aggregating the set of vectors corresponding to the set of graphs resulting from decompiling the given file. The result is a fast, intuitive, and easy-to-compute glass-box vectorization scheme, which can be leveraged for training a standalone classifier or to augment an existing feature space. We refer to this vectorization technique as PageRank Measure Integration Vectorization (PMIV). We demonstrate the efficacy of PMIV by training a vanilla random forest on 2.5 million samples of decompiled .NET, evenly split between benign and malicious, from our in-house corpus and compare this model to a baseline model which leverages a text-only feature space. The median time needed for decompilation and scoring was 24ms. 111Code available at https://github.com/gtownrocks/grafuple

\@definecounter

compactenumi

I Introduction

We classify .NET files as either malicious or benign by understanding the structural and textual differences between various types of labeled directed graphs resulting from decompilation. The graphs under consideration are the function call graph and the set of shortsighted data flow graphs (SDFG) derived from traversing the abstract syntax trees, one for each function in the given file.

Each SDFG is viewed as a Markov chain and is vectorized by considering both topological features of the unlabeled graphs and the textual features of the nodes. Under this paradigm, a heuristic notion of average file behavior can be defined by computing expected values of specially-chosen functions defined on the vertex sets of the given graphs against the PageRank measure.

For each graph , we construct a filtration of subsets of defined by specifying a sequence of upper bounds on the set of PageRank values. The resulting sequence of expected values corresponds to a Lebesgue antiderivative of the function . As there are typically many SDFG graphs per file, we vectorize by computing, for each pair , percentiles of , where indexes the set of SDFG’s present in the file.

Model interpretability is a consequence of our approach by construction, because each hand-designed function , and therefore its antiderivative , is interpretable.

This vectorization technique and its application to malware classification are the main contributions of this paper.

I-a Motivation

Static analysis classifiers trained on high dimensional data can suffer from susceptibility to adversarial examples (See [1] or [2]) due to a large proportion of the feature space consisting of execution and semantics agnostic file features. These include embedded unreferenced strings, certain header information, file size, etc. See [3], [4], and [5] for in-depth discussions.

Gross, K., et al. [6] show that even Deep Neural Networks trained to distinguish malicious files from benign files are vulnerable to adversarial attacks. See [7] for a more recent example.

Ironically, many of these features are high area under the curve features due to the copy pasta nature of most malware, but a model trained on such features can easily be tricked by perturbing these features. This is made possible by the fact that altering these features has no effect on the runtime behavior of the file.

Graph-based feature engineering approaches address this shortcoming by considering features extracted from the semantic structure of the file.

I-B Related Work

The signature-based approach to malware detection historically has been characterized by hand picking features for the sake of either a rule-based approach or a regression approach as in [8]. Both signatures (hand-written static rules) and regression models fit into this category. This approach is effective on known samples, but is prone to overfitting. This issue was the main motivator for moving towards modeling approaches which leverage semantic structure.

The leveraging of control-flow-based vectorization of executable files for the sake of both supervised and unsupervised learning is well established in the literature, and has proven to be a technique robust to overfitting and robust to adversarial examples. See [9] for one of the first such contributions. Other early approaches involved differentiating files based on sequences of api calls. In [10] the author builds a model based on ngrams of api calls. See [11] or [12] for similar approaches.

In addition to the sequential structure of function calls, one can also take into account the combinatorial graph structure of the calling relationships. Anderson, B., et al. [13] construct graph similarity kernels by viewing control flow graphs as Markov chains. They construct a malicious/benign classifier with these kernels, which showed significant improvement over a model built only on function call ngrams.

Chae et al. [14] successfully leveraged the information present in the combinatorial structure of the control flow graph to compute the sequences and frequency of API’s by considering a random walk kernels similar to those constructed above. See [15] for a similar approach.

We restrict our attention in this work to decompiled .NET, but the graph-based approach has been leveraged successfully in the similar realm of disassembly. Indeed, [16] discusses the use of graph similarity to compare disassembled files, which results in a kind of file-level isomorphism useful for finding trojans. Similar kernel methods applied to graphs arising via disassembly have been shown to be effective at detecting self-mutating malware by measuring the similarity between observed control flow graphs and known control flow graphs associated with malware. See [17] for details.

Deep learning has also been used to extend similarity detection by constructing neural networks built on top of features derived from graph embeddings in order to measure cross-platform binary code similarity. In [18] the neural network learns a graph embedding for the sake of measuring control flow graph similarity. See [19] for a similar approach using graph convolutional networks. Graph embedding for the sake of measuring control flow similarity has also been applied to bug search and plagiarism detection. See [20], [21], or [22] for further details and [23] for a mathematical exposition of graph embedding.

Reinforcement learning has also been used in the security space to train models robust to adversarial examples created via gradient-based attacks on differentiable models, or genetic algorithm-based attacks on non-differentiable models. See [24] for further discussion and the authors’ game-theoretic reinforcement approach to adversarial training.

Pure character-level sequence approaches (LSTM/GRU), which do not necessarily leverage the combinatorial structure of function call or control flow graphs have also been explored, as in [25]. The authors first train a language model in order to learn a feature representation of the file and then train a classifier on this latent representation. See [26] for a more basic RNN approach.

Our approach combines a graph-based feature representation with the interpretability of a logistic regression, while avoiding the training and architectural complexity common to state-of-the-art graph convolutional neural networks.

Ii Data

The dataset used in this work was curated from our internal corpus and consisted of 25 million samples of .NET with 2.5 million remaining post deduplication, evenly split between benign and malicious.

The deduplication process involved decompiling, hashing each resulting function, sorting and concatenating these hashes, and then hashing the result.

Labels were assigned via the rule: label(file) == malicious iff any{labelfile malicious}, where indexes the set of vendors participating on virustotal [27] at the time of labeling.

Iii Decompilation of .NET

Decompilation is a program transformation by which compiled code is transformed into a high-level human-readable form, and is used in this work to study the control flow of the files in our .NET corpus. Program control flow is understood by studying the structure of two types of control flow graphs resulting from decompilation. The function call graph describes the calling structure of the functions (subroutines) constituting the overall program. The control flow of each constituent function is understood by constructing a graph from the set of possible traversals of the associated abstract syntax tree.

Iii-a Abstract Syntax Trees

An abstract syntax tree is a binary tree representation of the syntactic structure of the given routine in terms of operators and operands.

For example, consider the expression

consisting of mathematical operators and numeric operands. We may express the syntactic structure of this expression with the binary tree:

\bracketset

action character=@

{forest}

delay=content=#1, for tree=edge path= ->,\forestoption{edge}](!u.parentanchor)--(.childanchor)\forestoption{edgelabel};},}@++[*[5][3]][+[4][*[%[2][2]][8]]]] The root and the subsequent internal nodes represent operators and the leaves represent operands. The distilled semantic structure in this case is the familiar order of operations for arithmetic expressions.

More generally, each node of an AST represents some construct occurring in the source code, and a directed edge connects two nodes if the code representing the target node conditionally executes immediately after the code represented by the source node. These trees facilitate the distillation of the semantics of the program.



Iii-B Abstract Syntax Trees for the CLR

Each node of a given AST is labeled by an operation performed on the Common Language Runtime (CLR) virtual machine. A subset of these operations is listed as follows (see Appendix -B for the complete list and the details thereof):

  • AddressOf

  • Assignment

  • BinaryOp

  • break

  • Call

  • ClassRef

  • CLRArray

  • continue

  • CtorCall

  • Dereference

  • Entrypoint

  • FieldReference

  • FnPtrObj

  • LocalVar


Example III.1.

An AST snippet from a benign .NET sample.


{…

”30”:{

”type”:”LocalVar”,

”name”:”variable7”

},

”28”:{

”type”:”LocalVar”,

”name”:”locals[0]”

},

”29”:{

”type”:”CLRVariableWithInitializer”,

”varType”:”System.Web.UI”,

”name”:”variable8”,

”value”:”28”

},

”64”:{

”fnName”:”AddParsedSubObject”,

”type”:”Call”,

”target”:”62”,

”arguments”:[

”63”

]

…}

As shown in the example, the metadata available at each node is a function of the CLR operation being performed at that node.



Iii-C Traversals of Abstract Syntax Trees

We consider all possible execution paths through a given abstract syntax tree and merge these paths together to form a shortsighted data flow graph (SDFG). Consider the following code snippet:

Example III.2.

Small code block resulting in a nonlinear SDFG.


if foo() {

bar();

}

else {

baz();

}

bla();

The two possible execution paths through this code snippet are given by and . See Figure 1 for the resulting SDFG.



foo

bar

baz

bla

ev to true

ev to false

Fig. 1: SDFG: Merging the two possible evaluation paths through this block yields the SDFG.

Iii-D Function Call Graphs

The function call graph represents the calling relationships between the subroutines of the file. The function call graphs in our corpus tended to be less linear than the SDFGs, and contained features which improved accuracy. Notably these features were purely graph-based and were not derived via the imposition of a Markov structure, PageRank computation, or subsequent Lebesgue integration.



Iv The PageRank Vector

The PageRank [28] vector describes the long-run diffusion of random walks through a strongly connected directed graph. Indeed, the probability measure over the nodes obtained via repeated multiplication of an initial distribution vector over the nodes by the associated probability transition matrix converges to the PageRank vector, and is in practice a very efficient method for computing it to a close approximation.

Intuitively, the PageRank vector is obtained by considering many random walks through the given graph and for each node computing the number of times we observed the walker at the given node as a proportion of all observations. See [29] for more details.

Viewing the graph in question as a Markov chain we order the vertices of the graph and define the probability transition matrix by

(1)

where is the set of edges emanating from vertex and .

In order to apply the Perron-Frobenius theorem, the probability transition matrix constructed via row-normalizing the adjacency matrix , where if there is an edge from node to node and 0 otherwise, must be irreducible. To this end, we add a smoothing term to obtain the matrix

(2)

where

The addition of the term ensures the irreducibility of as required by the Perron-Frobenius theorem, where is the probability of the Markov chain moving between any two vertices without traversing an edge and governs the extent to which the topology of the original graph is ignored. See Figure 2.

The resulting Markov chain is defined by

where in this work we heuristically set . The sensitivity of the results to is left to a future paper.

Note IV.1.

One can view this concept in the context of a running program as the repeated calling of a particular function as represented by the SDFG, where a particular execution path is viewed as a random walk through the graph. See [29] for more details.

Theorem IV.2.

Perron-Frobenius. If is an irreducible matrix then has a unique eigenvector with eigenvalue 1.

The eigenvector is such that , so defines a probability measure over the vertices of , which we will write as , or just if the reference graph is either clear from the context or irrelevant.



V Integration of Functions on Graphs

Given two labeled graphs , and a mapping , where assigns a real number to each element of the disjoint union based on the label of , the connectivity at , or some other scheme, a pointwise comparison of and may not be possible. Consider for example the simple case of .

We address this difficulty by defining a probability measure for each , where is a set of labeled directed graph. Then for any subset , we can directly compare the Lebesgue integrals and .


Let be the PageRank vector given by the unique left eigenvector with eigenvalue 1 of the probability transition matrix of the directed graph , viewed as a Markov chain. Each file under consideration contains multiple graphs, and we wish to find a way to not only compare these graphs, but understand the ensemble of graphs in the given file.

Let be a partition of and let be a directed graph. Let be the probability measure on given by the PageRank vector . Consider a function . The function

(3)

where , and . Mathematically is the Lebesgue antiderivative of over with measure given by .

The above process of building a function on from a graph and a rule which can be applied consistently to any element of can be formulated as a mapping

(4)

where is the set of functions from to .




Vi Similarity Measure on Graph Space

Let be the set of directed graphs with vertices labeled from the alphabet and let be a set of functions defined on . Define the vectorization map

(5)

where the expected value is taken with respect to the PageRank measure as defined in the previous section.

We construct a similarity function via (5)

for , , and .


Definition VI.1.

A metric on a set is a function

satisfying \enit@toodeep

\enit@after

Condition (ii) is satisfied since for all and (iii) is satisfied since for all by construction.

However, it is possible that for , meaning that while is effective as a measure of similarity of labeled directed graphs, it is not a metric on .

Indeed, let , , and let where is defined by . Then , which implies . The graphs and have the same topology and the same combinatorial structure, but the set of functions is insufficient to distinguish from .



Additional conditions must be imposed on and the functions defined thereon in order to guarantee the injectivity of , a necessary condition for to define a metric. We leave this analysis to a future paper.



Vii Application of Lebesgue Integration on Graphs to SDFGs

The machinery developed in the previous sections lends itself to two immediate applications.

The first is the use of the vectorization map

(6)
(7)

applied to .NET files, constructed via decompilation followed by integration of selected functions on SDFGs as described in Equation (4), to i) construct an -class classifier on a given corpus of labeled .NET files, and ii) cluster these files in using any of the classic metrics defined on Euclidean space.

The second application is classification and clustering of .NET files within the metric space , described in Section VI. The remainder of this paper concerns the applications of the vectorization map .



Vii-a Feature Hashing

Feature hashing allows for the vectorization of data which is both categorical in nature and is such that the full set of categories is unknown at the time of vectorization. We construct a hash map on strings by wrapping the hash function from the Python standard library as follows:

We take a log in order to bring the integer resulting from the hash function down to a more aesthetic size. This has no effect on the model as random forests are agnostic to the magnitudes of feature values.



Vii-B Functions on SDFGs

For the sake of clarity, we illustrate the typical form such a function takes with an example. Consider the SDFG snippet given in Example III.1. We define the function

by


The complete set of functions

leveraged in this work is listed in Appendix -A.



Vii-C Lebesgue Integration of Functions on SDFGs

Because the nodeset of any SDFG is finite and the PageRank measure defined thereon is discrete, the Lebesgue antiderivatives of the functions defined in the previous section take the form of sequences of dot products.

We illustrate the nature of the map via an example.

Example VII.1.

Consider a SDFG G representing the traversals of some function’s abstract syntax tree. Assume and that PageRank. Assume the nodes both correspond to function calls , where represent the set of arguments passed to .

(8)
(9)

Take the partition of defined by

The Lebesgue antiderivative of NumPass2Call on

takes the form

In general, each entry of the vector is a linear combination of the form , where is the element of the PageRank vector assigned to node and is a real number resulting from applying to node .

Fig. 2: PageRank Distributions The PageRank measure defined on the nodes of a given graph depends on the topology of the graph, and thus the expected values of functions also depend on the topology of the graph, where for the PageRank of node and .
Fig. 3: ClassRef_ExpectedType_60: We compute the standard deviation of the set of expected values indexed by the set of SDFGs resulting from decompilation, where we take to be one of true and 0 if false. There is one such expected value per SDFG graph and this feature is obtained by computing the mean of the set of these expected values across all SDFGs in the file.

Viii Vectorization of Function Call Graphs

The features of the form extracted from the function call graphs are limited to:

This function, unlike those applied to the SDFGs, is not integrated. We simply include the cryto flag as a feature directly.

The remaining features extracted from the function call graphs are combinatorial and topological in nature.

Let be the function call graph for a single .NET file and let be the connected components thereof. Let represent the number of edges connected to the vertex . Let where is the number of nodes of component . We extract the following features:

  • max()/min()

  • mean

  • std








Ix Experiments

We compare PMIV to a baseline method we call Uniform Measure Integration Vectorization (UMIV).

Uniform Measure Integration Vectorization is similar to PMIV in that the method is defined by computing a graph-based integral of functions defined over the node sets, where these functions are exactly those used for PMIV. The critical difference is that UMIV is defined via integration against the uniform measure.

This means that instead of computing

as in defined via Equation (3), we compute

(10)

i.e., a simple average of the given function over the node set of the given graph. PMIV and UMIV similarly leverage the textual information embedded in SDFGs, but UMIV ignores the combinatorial structure of the SDFGs.




Ix-a Parsing (same for PMIV and UMIV)

Each .NET file is decompiled resulting in i) an abstract syntax tree for each function within the file and ii) the function call graph. The abstract syntax trees are traversed individually resulting in a single SDFG for each function within the file. The function call graph is a directed graph indicating which functions call which other functions.

Example IX.1.

Consider the C# program


using System;

class Hello

{

static void Main()

{

Console.WriteLine(”Hello, World!”);

}

}

Three graphs result from decompilation - an empty function call graph and two linear SDFG graphs. See Appendix LABEL:DECOMPILATION for the decompiler output.


Ix-B Vectorization (PMIV)

Each file is vectorized by applying both the vectorization map (6) to the set of shortsighted data flow graphs (many per file) and the vectorization of the function call graph (one per file) as described in Section VIII.


Ix-B1 Sdfg

Given a file marked by its hash , we consider a set of SDFG graphs obtained by decompiling .

For each function , hash , and partition , we can compute the values

We can then compute both the mean and standard deviation of the set for each . As the number of SDFGs varies by file, this is necessary to guarantee that every file in the corpus can be mapped to for some fixed .


The file is mapped, via integrating over , to the feature space given by coordinates .

The file is then described by the feature vector given by

where is an index running over the set of functions . See Figure 3 for a resulting feature histogram for a particular .




Ix-B2 Function Call Graphs

The function call graph features are included as components of the final file-level vector directly without computing means and standard deviations, as there is a single such graph per file.



Ix-C Vectorization (UMIV)

Files are vectorized in essentially the same way as in PMIV, except that we assign the probability to each vertex for a SDFG.


The vectorization scheme is defined in Equation (10), and the final reduction of these values across SDFGs into a single vector corresponding to a single file is identical to that of PMIV.


Ix-D Algorithm

We train a separate random forest ([30][31]) for each vectorization method, each with identical hyperparameters.

A random forest is an ensemble learning method for classification, regression, and other tasks that operates by constructing a multitude of decision trees at training time and scoring via a polling (classification) or averaging (regression) procedure over its constituent trees.

This algorithm is especially valuable in malware classification as scoring inaccuracy caused by unavoidable label noise is somewhat mitigated by the ensemble.


Ix-E Training and Validation (same for PMIV and UMIV)

The .NET corpus was first deduplicated via decompilation by first decompiling each file, hashing each resulting graph, lexicographically sorting and concatenating these hashes, and then hashing the result.

The deduplicated corpus was split into training (70%), validation (10%), and test (20%) sets. We used the grid search functionality of scikit-learn with cross-validation for hyperparameter tuning of the random forest. The optimal model is described in Table I.


max leaf nodes None
min samples leaf 1
warm start False
min weight fraction leaf 0
oob score False
min samples split 2
criterion gini
class weight None
min impurity split 2.09876756095e-05
n estimators 480
max depth None
bootstrap True
max features sqrt
TABLE I: Random Forest Hyperparameters

X Experimental Results

X-a Accuracy, Precision, Recall (PMIV and UMIV)

The model is 98.3% accurate on the test set using only 400 features, which is tiny for a static classifier.

The precision on malicious files was 98.94%, meaning that of the files classified as malicious by the model, 98.94% of them were actually malicious. Precision on benign files was 97.88% and recall on benign files was 99.37%.

The recall on malicious files was 96.47%, meaning that of the malicious files, 96.47% of them were correctly scored as malicious. Of the four precision/recall values, malicious recall was the weakest. There are very likely features of malicious .NET files that are not captured by the set of functions we currently leverage to construct our feature space.

As shown in tables II and III, our graph-structure-based vectorization method PMIV outperforms our baseline UMIV method by wide margins, demonstrating the efficacy of our graph integration construction.





Class Precision Recall F1-score Support
Benign 97.88% 99.37% 98.62% 696827
Malware 98.94% 96.47% 97.69% 424420
avg/total 98.28% 98.27% 98.27% 1121247
False Positive Rate 1.10%
False Negative Rate 1.72%
TABLE II: PMIV Performance
Class Precision Recall F1-score Support
Benign 90.61% 87.04% 88.79% 696827
Malware 87.80% 91.18% 89.46% 424420
avg/total 89.19% 89.13% 89.13% 1121247
False Positive Rate 8.79%
False Negative Rate 12.96%
TABLE III: UMIV Performance

Xi Conclusion

We have engineered a robust control flow graph-based vectorization scheme for exposing features which reveal semantically interesting constructs of .NET files. The vectorization scheme is interpretable and glass-box by construction, which will facilitate scalable taxonomy operations in addition to high-accuracy classification as benign or malicious.

The control flow-type graphs include both function call graphs, one for each file, and SDFG graphs, one for each function defined within the file. Leveraging the combinatorial structure of these graphs results in a rich feature space, within which even a simple classifier can effectively distinguish between benign and malicious files.

The vectorization scheme introduced here may be leveraged to train a standalone model or to augment the feature space of an existing model. Although we limited our experiments to decompiled .NET, we see no obstruction to applying the PMIV concept to a wider class of graph-based file data, such as disassembly.

Future work will involve the addition of new functions for control flow-type graphs to the vectorization scheme, as well as the clustering of files and the functions of which they consist within both the codomain of the vectorization map and within the graph space . We will also explore the extent to which these functions and files can be parameterized through manifold learning in Euclidean as well as graph space.






























Acknowledgment

The authors would like to thank former colleague Brian Wallace for both deduplicating our .NET corpus and applying the decompiler at scale. Without his efforts, this project would not have been possible.






References

-a Complete List of Functions on SDFGs

All functions are assumed to be zero on nodes for which the associated AST member is inconsistent with the function definition. For example, NumPass2Call is trivial on all non-Call nodes.

Note that value, expr are floats and are ints.



-B CLR AST Dictionary

The dictionary of terms relating to the CLR is as follows:




















Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
398406
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description