SAFE: Self-Attentive Function Embeddings for Binary Similarity

SAFE: Self-Attentive Function Embeddings for Binary Similarity

Abstract

The binary similarity problem consists in determining if two functions are similar by only considering their compiled form. Advanced techniques for binary similarity recently gained momentum as they can be applied in several fields, such as copyright disputes, malware analysis, vulnerability detection, etc., and thus have an immediate practical impact. Current solutions compare functions by first transforming their binary code in multi-dimensional vector representations (embeddings), and then comparing vectors through simple and efficient geometric operations. However, embeddings are usually derived from binary code using manual feature extraction, that may fail in considering important function characteristics, or may consider features that are not important for the binary similarity problem. In this paper we propose SAFE, a novel architecture for the embedding of functions based on a self-attentive neural network. SAFE works directly on disassembled binary functions, does not require manual feature extraction, is computationally more efficient than existing solutions (i.e., it does not incur in the computational overhead of building or manipulating control flow graphs), and is more general as it works on stripped binaries and on multiple architectures. We report the results from a quantitative and qualitative analysis that show how SAFE provides a noticeable performance improvement with respect to previous solutions. Furthermore, we show how clusters of our embedding vectors are closely related to the semantic of the implemented algorithms, paving the way for further interesting applications (e.g. semantic-based binary function search)

1 Introduction

In the last years there has been an exponential increase in the creation of new contents. As all products, also software is subject to this trend. As an example, the number of apps available on the Google Play Store increased from 30K in 2010 to 3 millions in 20181.This increase directly leads to more vulnerabilities as reported by CVE2 that witnessed a 120% growth in the number of discovered vulnerabilities from 2016 to 2017. At the same time complex software spreads in several new devices: the Internet of Things has multiplied the number of architectures on which the same program has to run and COTS software components are increasingly integrated in closed-source products.

This multidimensional increase in quantity, complexity and diffusion of software makes the resulting infrastructures difficult to manage and control, as part of their internals are often inaccessible for inspection to their own administrators. As a consequence, system integrators are looking forward to novel solutions that take into account such issues and provide functionalities to automatically analyze software artifacts in their compiled form (binary code). One prototypical problem in this regard, is the one of binary similarity [13, 18, 3], where the goal is to find similar functions in compiled code fragments.

Binary similarity has been recently subject to a lot of attention [9, 10, 14]. This is due to its centrality in several tasks, such as discovery of known vulnerabilities in large collection of software, dispute on copyright matters, analysis and detection of malicious software, etc.

In this paper, in accordance with [16] and [29], we focus on a specific version of the binary similarity problem in which we define two binary functions to be similar if they are compiled from the same source code. As already pointed out in [29], this assumption does not make the problem trivial.

Inspired by [16] we look for solutions that solve the binary similarity problem using embeddings. Loosely speaking, each binary function is first transformed into a vector of numbers (an embedding), in such a way that code compiled from a same source results in vectors that are similar.

This idea has several advantages. First of all, once the embeddings are computed, checking the similarity is relatively cheap and fast (we consider the scalar product of two constants size vectors as a constant time operation). Thus we can pre-compute a large collection of embeddings for interesting functions and check against such collection in linear time. In the light of the many use cases, this characteristic is extremely useful. Another advantage comes from the fact that such embeddings can be used as input to other machine learning algorithms, that can in turn cluster functions, classify them, etc.

Current solutions that adopt this approach, still come with several shortcomings:

  • they [29] use manually selected features to calculate the embeddings, introducing potential bias in the resulting vectors. Such bias stems form the possibility of overlooking important features (that don’t get selected), or including features that are expensive to process while not providing noticeable performance improvements; for example, including features extracted from the function’s control flow graph (CFG) imposes a speed penalty with respect to features extracted from the disassembled code3;

  • they [12] assume that call symbols to dynamically linking libraries are available in binary functions (such as libc,msvc,ecc..), while this is not true for binaries that are stripped and statically linked4 or in partial fragments of binaries (e.g. extracted during volatile memory forensic analyses);

  • they work only on specific CPU architectures [12].

Considering these shortcomings, in this paper we introduce SAFE: Self-Attentive Function Embeddings, a solution we designed to overcome all of them. In particular we considered two specific goals:

  • design a solution to quickly generate embeddings for several hundreds of binaries;

  • design a solution that could be applicable the vast majority of cases, i.e. able to work with stripped binaries with statically linked libraries, and on multiple architectures (in particular we consider AMD64 and ARM as target platforms for our study).

The core of SAFE is based on recent advancements in the area of natural language processing. Specifically, we designed SAFE on a Self-Attentive Neural Network recently proposed in [20]. The idea is to directly consider the sequence of instructions in the binary function and to model them as natural language. An GRU Recurrent Neural Network (GRU RNN) is then used to capture the sequential interaction of the instructions. In addition, an attention mechanism allows the final function embedding to consider all the GRU hidden states, and to automatically focus (i.e., give more weight) to the portion of the binary code that helps the most in accomplishing the training objective - i.e., recognise if two binary functions are similar.

We also investigate the possibility of semantically classifying (i.e. identifying the general sematic behavior of) binary functions by clustering similar embeddings. At the best of our knowledge we are the first to investigate the feasibility of this task through machine learning tools, and to perform a quantitative analysis on this subject. The results are encouraging showing a 95% of classification accuracy for 4 different broad classes of algorithms (namely Encryption, Sorting, Mathematical and String Manipulation functions). Finally, we also applied our semantic classifier to known malwares, and we were able to accurately recognize with it functions implementing encryption algorithms.

1.1 Contribution

The main contributions of our work are:

  • we describe SAFE, a general architecture for calculating binary function embeddings starting from disassembled binaries;

  • we extensively evaluate SAFE showing that it provides better performance than previous state-of-the art systems with similar requirements. Specifically, we compare it with the recent Gemini [29], showing a performance improvement on several metrics that ranges from to depending on the task at hand;

  • we apply SAFE to the problem of identifying vulnerable functions in binary code, a common application task for binary similarity solutions; also in this task SAFE provides better performance than state-of-the-art solutions.

  • we show that embeddings produced by SAFE can be used to automatically classify binary functions in semantic classes. On a dataset of K functions, we can recognize whether a function implements an encryption algorithm, a sorting algorithm, generic math operations, or a string manipulation, with an accuracy of .

  • we apply SAFE to the analysis of known malwares, to identify encryption functions. Interestingly, we achieve good performances: among 10 functions flagged by SAFE as Encryption, only one was a false positive.

The remainder of this paper is organized as follows. Section 2 discusses related work, followed by Section 3 where we define the problem and report an overview of the solution we tested. In Section 4 we describe in details SAFE, and in Section 5 we provide implementation details and information on the training. In Section 6 we describe the experiments we performed and report their results. Finally, in Section 7 we discuss the speed of SAFE.

2 Related Work

We can broadly divide the binary similarity literature in works that propose embedding-based solutions, and works that do not.

2.1 Works not based on embeddings

Single Platform solutions — Regarding the literature of binary-similarity for a single platform, a family of works is based on matching algorithms for function CFGs. In Bindiff [13] matching among vertices is based on the syntax of code, and it is known to perform poorly across different compiler (see [9]). Pewny et al. [24] proposed a solution where each vertex of a CFG is represented with an expression tree; similarity among vertices is computed by using the edit distance between the corresponding expression trees.

Other works use different solutions that do not rely on on graph matching. David and Yahav [11] proposed to represent a function as several independent execution traces, called tracelets; similar tracelets are then matched by using a custom edit-distance. A related concept is used by David et al. in [9] where functions are divided in pieces of independent code, called strands. The matching between functions is based on how many statistically significant strands are similar. Intuitively, a strand is significant if it is not statistically common. Strand similarity is computed using an SMT-solver to assess semantic similarity. Note that all previous solutions are designed around matching procedures that work pair-to-pair, and they cannot be adapted to pre-compute a constant size signature of a binary function on which similarity can be assessed.

Egele et al. in [14] proposed a solution where each function is executed multiple times in a random environment. During the executions some features are collected and then used to match similar functions. This solution can be used to compute a signature for each function. However, it needs to execute a function multiple times, that is both time consuming and difficult to perform in the cross-platform scenario. Furthermore, it is not clear if the features identified in [14] are useful for cross-platform comparison. Finally, Khoo et al. [18] proposed a matching approach based on n-grams computed on instruction mnemonics and graphlets. Even if this strategy does produce a signature, it cannot be immediately extended to cross-platform similarity.

source code
compiler
binary function compiled from source code .
embedding vector of function
list of instructions in function
number of instructions in a function
instruction in
embedding vector of
list of instruction embeddings in function
i2v instruction embedding model (instruction2vector)
-th component of vector
-th hidden state of the Bi-directional Recurrent Neural Network (RNN)
dimension of vector
dimension of the state
hyperbolic tangent function
softmax function
ReLU rectified linear unit function
weights matrix of the attention mechanism
weights matrix of the attention mechanism
weights matrix of the output layer of the Self-Attentive net.
weights matrix of the output layer of the Self-Attentive network
embedding matrix
attention depth - Parameters of the function embedding network
number of attention hops - Parameters of the function embedding network
ground truth label associated with the -th input pair of functions
network hyper parameters
network objective function
number of results of a function search query
Table 1: Notation.

Cross-Platform solutions — Pewny et al. [23] proposed a graph-based methodology, i.e. a matching algorithm on the CFGs of functions. The idea is to transform the binary code in an intermediate representation; on such representation the semantic of each CFG vertex is computed by using a sampling of the code executions using random inputs. Feng et al. [15] proposed a solution where each function is expressed as a set of conditional formulas; then it uses integer programming to compute the maximum matching between formulas. Note that both, [23] and [15] allow pair-to-pair check only.

David et al. [10] propose to transform binary code to an intermediate representation. Then, functions were partitioned in slices of independent code, called strands. An involved process guarantees that strands with the same semantics will have similar representations. Functions are deemed to be similar if they have matching of significant strands. Note that this solution does generate a signature as a collection of hashed strands. However, it has two drawbacks: the first is that the signature is not constant-size but it depends on the number of strands contained in the function. The second drawback is that is not immediate to transform such signatures into embeddings that can be directly fed to other machine learning algorithms.

2.2 Based on embeddings

The most related to our works are the ones that propose embeddings for binary similarity. Specifically, the works that target cross-platform scenarios.

Single-Platform solutions — Recently, [12] proposed a function embedding solution called Asm2Vec. This solution is based on the PV-DM model [19] for natural language processing. Operatively, Asm2Vec computes the CFG of a function, and then it performs a series of random walks on top of it. Asm2Vec outperforms several state-of-the-art solutions in the field of binary similarity. Despite being a really promising solution, Asm2vec does not fullfill all the design goals of our system: firstly it requires libc call symbols to be present in the binary code as tokens to produce the embedding of a function; secondly it is only suitable for single-platform embeddings.

Cross-Platform solutions — Feng et al. [16] introduced a solution that uses a clustering algorithm over a set of functions to obtain centroids for each cluster. Then, they used these centroids and a configurable feature encoding mechanism to associate a numerical vector representation with each function. Xu et al. [29] proposed an architecture called Gemini, where function embeddings are computed using a deep neural network. Interestingly, [29] shows that Gemini outperforms [16] both in terms of accuracy and performance (measured as time required to train the model). In Gemini the CFG of a function is first transformed into an annotated CFG, a graph containing manually selected features, and then embedded into a vector using the graph embedding model of [8]. The manual features used by Gemini do not need call symbols. At the best of our knowledge Gemini is the state-of-the-art solution for cross-platform embeddings based on deep neural networks that works on cross-platform code without call symbols. In an unpublished technical report [4] we proposed a variation of Gemini where manual features are replaced with an unsupervised feature learning mechanism. This single change led to a performance improvement over the baseline represented by Gemini.

Finally, in [30] the author propose the use of a recurrent neural network based on LSTM (Long short-term memory) to solve a subtask of binary similarity that is the one of finding similar CFG blocks.

3 Problem Definition and Solution Overview

For clarity of exposition we summarize all the notation used in this paper in Table 1. Let us first define the similarity problem.

We say that two binary functions are similar, , if they are the result of compiling the same original source code with different compilers. Essentially, a compiler is a deterministic transformation that maps a source code to a corresponding binary function . In this paper we consider as a compiler the specific software, e.g. gcc-5.4.0, together with the parameters that influence the compiling process, e.g. the optimization flags -O.

We indicate with , the list of assembly instructions composing function . Our aim is to represent as a vector in . This is achieved with an embedding model that maps to an embedding vector , preserving structural similarity relations between binary functions.

Function Semantic. Loosely speaking, a function can be seen as an implementation of an algorithm. We can partitions algorithms in classes, where each class is a group of algorithms solving related problems. In this paper we focus on four classes E (Encryption), S (Sorting), SM (String Manipulation), M (Mathematical). A function belongs to class E if it is the implementation of an encryption algorithm (e.g., AES, DES); it belongs to S class if it implements a sorting algorithm (e.g., bubblesort, mergesort); it belongs to SM class if it implements an algorithm to manipulate a string (e.g., string reverse, string copy); it belongs to M class if it implements math operations (e.g., computing a bessel function); We say that a classifier, recognizes the semantic of a function , with taken from one of the aforementioned classes, if it is able to guess the class to which belongs.

3.1 SAFE Overview.

We use an embedding model structured in two phases; in the first phase the Assembly Instructions Embedding component, transforms a sequence of assembly instructions in a sequence of vectors, in the second phase a Self-Attentive Neural Network, transforms a sequence of vectors in a single embedding vector. See Figure 1 for a schematic representation of the overall architecture of our embedding network.

Assembly Instructions Embedding (i2v)

In the first phase of our strategy we map each instruction to a vector of real numbers , using the word2vec model [22]. Word2vec is an extremely popular feature learning technique in natural language processing. We use a large corpus of instructions to train our instruction embedding model (see Section 5.1), we call our mapping instruction2vec (i2v). The final outcome of this step is a sequence of vectors .

Self-Attentive Network

For our Self-Attentive Network we use the network recently proposed in [20]. In this Self-Attentive Network, a bi-directional recurrent neural network is fed with the sequence of assembly vectors. Intuitively, for each instruction vector the RNN computes a summary vector taking into account the instruction itself and its context in . The final embedding of is a weighted sum of all summary vectors. The weights of such summation are computed by a two layers fully-connected neural network.

We selected the Self-Attentive Network for two reasons. First, it shows state-of-the art performance on natural language processing tasks [20]. Secondly, it suffers less of the long-memory problem5 of classic RNNs: in the Self-Attentive case the RNN computes only a local summary of each instruction. Our research hypothesis is that it would behave well over the long sequences of instructions composing binary functions; and this hypothesis is indeed confirmed in our experiments (see Section 6).

4 Details of the SAFE, Function Embedding Network

Figure 1: Architecture of SAFE. The vertex feature extractor component refers to the Unsupervised Feature Learning case.

We denote the entire embedding network as SAFE: Self-Attentive Function Embeddings.

4.1 Assembly Instructions Embedding (i2v)

The first step of our solution consists in associating an embedding vector to each instruction contained in . We achieve it by training the embedding model i2v using the skip-gram method [22]. The idea of skip-gram is to use the current instruction to predict the instructions around it. A similar approach has been used also in [7].

We train the i2v model using assembly instructions as tokens (i.e., a single token includes both the instruction mnemonic and the operands). We do not use the raw instruction but we filter it as follows. We examine the operands and replace all base memory addresses with the special symbol MEM and all immediates whose absolute value is above some threshold (we use in our experiments, see Section 5.1) with the special symbol IMM. We do this filtering because we believe that using raw operands is of small benefit; for instance, the displacement given by a jump is useless (e.g., instructions do not carry with them their memory address), and, on the contrary, it may decrease the quality of the embedding by artificially inflating the number of different instructions. As example the instruction mov EAX, becomes mov EAX,IMM, mov EAX,x becomes mov EAX,MEM, while the instruction mov EAX,EBP is not modified. Intuitively, the last instruction is accessing a stack variable different from mov EAX,[EBP], and this information remains intact with our filtering.

4.2 Self-Attentive Network

We based our Self-Attentive Network on the one proposed by [20]. The overall structure is detailed in Figure 2. We compute embedding of a function by using the sequence of instruction vectors . These vectors are fed into a bi-directional neural network, obtaining for each vector a summary vector of size :

where is the concatenation operand, (resp., ) is the forward (resp., backward) RNN cell, and are the forward and backward states of the RNN (we set ). The state of each RNN cell has size .

From these summary vectors we obtain a matrix . Matrix has as rows the summary vectors. An attention matrix of size is computed using a two layers neural network:

where is a weight matrix of size and the parameter is the attention depth of our model. The matrix is a weight matrix of size and the parameter is the number of attention hops of our model.

Intuitively, when , collapses into a single attention vector, where each each value is the weight a specific summary vector. When , becomes a matrix and each row is an independent attention hop. Loosely speaking, each hops weights the attention of a different aspect of the binary function.

The embedding matrix of our sequence is:

and it has fixed size . In order to transform the embedding matrix into a vector of size , we flatten the matrix and we feed the flattening into a two-layers fully connected neural network with ReLU activation function:

where is a weight matrix of size , and a weight matrix of size .

Figure 2: Self-Attentive Network: detailed architecture.

Learning Parameters Using Siamese Architecture: we learn the network parameters
using a pairwise approach, a technique also called siamese network in the literature [5]. The main idea is to join two identical function embedding networks with a similarity score (with identical we mean that the networks share the same parameters). The final output of the siamese architecture is the similarity score between the two input graphs (see Figure 3).

In more details, from a pair of input functions two vectors are obtained by using the same function embedding network. These vectors are compared using cosine similarity as distance metric, with the following formula:

(1)

where indicates the -th component of the vector .

Figure 3: Siamese network.

To train the network we require in input a set of functions pairs, , with ground truth labels , where indicates that the two input functions are similar and otherwise. Then using the siamese network output, we define the following objective function:

The objective function is minimized by using, for instance, stochastic gradient descent. The term is introduced to penalize the choice of the same weights for each attention hops in matrix (see [20]).

5 Implementation Details and Training

In this section we first discuss the implementation details of our system, and then we explain how we trained our models.

5.1 Implementation Details and i2v setup

We developed a prototype implementation of SAFE using Python and the Tensorflow [1] framework. For static analysis of binaries we used the ANGR framework [27], radare2 [25]6 and IDA Pro. To train the network we used a batch size of 250, learning rate , Adam optimizer.

In our SAFE prototype we used the following parameters: the RNN cell is the GRU cell [6]; the value is , , , , .

We decided to truncate the number of instructions inside each function to the maximum value of , this represents a good trade-off between training time and accuracy, the great majority of functions in our datasets is below this threshold (more than 90% of the functions).

i2v model

We trained two i2v models using the two training corpora described below. One model is for the instruction set of ARM and one for AMD64. With this choice we tried to capture the different sintaxes and semantics of these two assembly languages. The model that we use for i2v (for both versions AMD64 and ARM) is the skip-gram implementation of word2vec provided in [28]. We used as parameters: embedding size 100, window size and word frequency .

Datasets for training the i2v models We collected the assembly code of a large number of functions, and we used it to build two training corpora, one for the i2v AMD64 model and one for the i2v ARM model. We built both corpora by dissasembling several UNIX executables and libraries using IDA PRO. The libraries and the executables have been randomly sampled from repositories of Debian packages.

We avoided multiple inclusion of common functions and libraries by using a duplicate detection mechanism; we tested the uniqueness of a function computing an hash of all function instructions, where instructions are filtered by replacing the operands containing immediate and memory locations with a special symbol.

From 2.52 GBs of AMD64 binaries we obtained the assembly code of 547K unique functions. From 3.05 GBs of ARM binaries we obtained the assembly code of 752K unique functions. Overall the AMD64 corpus contains millions assembly code lines while the ARM corpus contains millions assembly code lines.

5.2 Training Single and Cross Platform Models

We trained SAFE models using the same methodology of Gemini, see [29]. We trained both a single and a cross platform models that were then evaluated in several tasks (see Section 6 for the results).

Datasets

  • AMD64multipleCompilers Dataset. This is dataset has been obtained by compiling the following libraries for AMD64: binutils-2.30, ccv0.7, coreutils-8.29, curl-7.61.0, gsl-2.5, libhttpd-2.0, openmpi-3.1.1, openssl-1.1.1-pre8, valgrind-3.13.0. The compilation has been done using 3 different compilers, clang-3.9, gcc-5.4, gcc-3.47 and 4 optimization levels (i.e., -O-). The compiled object files have been disassembled with ANGR, obtaining a total of 452598 functions.

  • AMD64ARMOpenSSL Dataset. To align our experimental evaluation with state-of-the-art studies we built the AMD64ARMOpenSSL Dataset in the same way as the one used in [29]. In particular, the AMD64ARMOpenSSL Dataset consists of a set of 95535 functions generated from all the binaries included in two versions of Openssl (v1_0_1f - v1_0_1u) that have been compiled for AMD64 and ARM using gcc-5.4 with 4 optimizations levels (i.e., -O-). The resulting object files have been disassembled using ANGR; we discarded all the functions that ANGR was not able to disassemble.

Training

We generate our training and test pairs as reported in [29]. The pairs can be of two kinds: similar pairs, obtained pairing together two binary functions originated by the same source code, and dissimilar pairs, obtained pairing randomly functions that do not derive from the same source code.

Specifically, for each function in our datasets we create two pairs, a similar pair, associated with training label and a dissimilar pair, training label ; obtaining a total number of pairs that is twice the total number of functions.

The functions in AMD64multipleCompilers Dataset are partitioned in three sets: train, validation, and test (75%-15%-15%).

The functions in AMD64ARMOpenSSL Dataset are partitioned in two sets: train and test (80%-20%), in this case we do not need the validation set because in Task 1 Section 6.1 we will perform a cross-validation.

The test and validation pairs will be used to assess performances in Task 1, see Section 6.1.

As in [29], pairs are partitioned preventing that two similar functions are in different partitions (this is done to avoid that the network sees during training functions similar to the ones on which it will be validated or tested).

We train our models for epochs (an epoch represents a complete pass over the whole training set). In each epoch we regenerate the training pairs, that is we create new similar and dissimilar pairs using the functions contained in the training split. We pre-compute the pairs used in each epoch, in such a way that each method is tested on the same data. Note that, we do not regenerate the validation and test pairs.

6 Evaluation

We perform an extensive evaluation of SAFE investigating its performances on several tasks:

  • Task 1 - Single Platform and Cross Platform Models Tests: we test our single platform and cross platform models following the same methodology of [29]. We achieve a performance improvement of in the single platform case and of in the cross platform case. We remark that in these tests our models behave almost perfectly (within from what a perfect model may achieve). This task is described in Section 6.1.

  • Task 2 - Function Search: in this task we are given a certain binary function and we have to search for similes on a large dataset created using several compilers (including compilers that were not used in the training phase). We achieve a precision above for the first 15 results, and a recall of in the first 50 results. Section 6.2 is devoted to Task 2.

  • Task 3 - Vulnerability Search: in this task we evaluate our system on a use-case scenario in which we search for vulnerable functions. Our tests on several vulnerabilities show a recall of in the first 10 results. Task 4 is the focus of Section 6.3.

  • Task 4 - Semantic Classification: in this task we classify the semantic of binary functions using the embeddings built with SAFE. We reach an accuracy of on our test dataset. Moreover, we test our classifier on real world malwares, showing that we can identify encryption functions. Task 4 is explained in Section 6.4.

During our evaluation we compare safe with Gemini 8

6.1 Task 1 - Single and Cross Platform tests

In this task we evaluate the performance of SAFE using the same testing methodology of Gemini. We use the test split and the validation split computed as discussed in Section 5.2.1.

Test methodology

We perform two disjoint tests.

  • On AMD64multipleCompilers Dataset, we compute performance metrics on the validation set for all the epochs. Then, we use the model hyper parameters that led to the best performance on the validation set to compute a final performance score on the test set.

  • On AMD64ARMOpenSSL Dataset, we perform a 5-fold cross validation: we partition the dataset in sets; for all possible set union of 4 partitions we train the classifiers on such union and then we test it on the remaining partition. The reported results are the average of 5 independent runs, one for each possible fold chosen as test set. This approach is more robust than a fixed train/validation/test split since it reduces the variability of the results.

Measures of Performances As in [29], we measure the performance using the Receiver Operating Characteristic (ROC) curve [17]. Following the best practices of the field we measure the area under the ROC curve, or AUC (Area Under Curve). Loosely speaking, higher the AUC value, better the predictive performance of the algorithm.

Experimental Results

AMD64multipleCompilers Dataset The results for the single platform case are in Figure (a)a. Our AUC is , the AUC of Gemini is . Even if the improvement is 6.8, it is worth to notice that SAFE provides performance that are close to the perfect case (0.99 AUC).

AMD64ARMOpenSSL Dataset
(a) Test on AMD64multipleCompilers Dataset. ROC curves for the comparison between SAFE and Gemini on the test set. The dashed line is the ROC for Gemini, the continuous line the ROC for SAFE. The AUC of our solution is the AUC of Gemini is
(b) Test on AMD64ARMOpenSSL Dataset. ROC curves for the comparison between SAFE and Gemini using 5-fold cross validation. The lines represent the ROC curves obtained by averaging the results of the five runs; the dashed line is the average for Gemini, the continuous line the average for our solutions. For both we color the area between the ROC curves with minimum AUC and the maximum AUC. The average AUC of our solution is the average AUC of Gemini is
Figure 4: ROC curves on AMD64multipleCompilers Dataset and AMD64ARMOpenSSL Dataset. Task 1- Validation and Test of Single Platform and Cross Platform Models

We compare ourselves with Gemini in the cross-platform case. The results are in Figure (b)b and they shows the average ROC curves on the five runs of the -fold cross validation. The Gemini results are reported with an orange dashed line while we use a continuous blue line for our results. For both solutions we additional highlighted the area between the ROC curves with minimum AUC maximum AUC in the five runs. The better prediction performance of SAFE is clearly visible; the average AUC obtained by Gemini is with a standard deviation of over the five runs, while the average AUC of SAFE is with a standard deviation of . The average improvement with respect to Gemini is of .

6.2 Task 2 - Function Search

In this task we evaluate the function search capability of the model trained on AMD64multipleCompilers Dataset. We take a target function , we compute its embedding and we search for similar functions in the AMD64PostgreSQL Dataset (details of this dataset are given below). Given the target , a search query returns , that is the ordered list of the nearest embeddings in AMD64PostgreSQL Dataset.

Dataset

We built AMD64PostgreSQL Dataset by compiling postgreSQL 9.6.0 for AMD64 using 12 compilers: gcc-3.4, gcc-4.7, gcc-4.8, gcc-4.9, gcc-5.4, gcc-6, gcc-7, clang-3.8, clang-3.9, clang-4.0, clang-5.0, clang-6.0. For each compiler we used all 4 optimization levels. We took the object files, i.e. we did not create the executable by linking objects file together, and we disassembled them with radare2, obtaining a total of 581640 functions. For each function the AMD64PostgreSQL Dataset contains an average number of similars. We do not reach an average of 9 similars because some functions are lost due to disassembler errors.

Measures of Performances

We compute the usual measures of precision, fraction of similar functions in over all functions in , and recall, fraction of similar functions in over all similar functions in the dataset. Moreover, we also compute the normalised Discounted Cumulative Gain (nDCG) [2]:

Where is if is a function similar to or otherwise, and, is the Discounted Cumulative Gain of the optimal query answering. This measure is between and , and it takes into account the ordering of the similar functions in , giving better results to responses that put similar functions first.

As an example let us suppose we have two results for the same query: and (where means that the corresponding index in the result list is occupied by a similar function and otherwise). These results have the same precision (i.e., ), but nDCG scores the first better.

Experimental Results

(a) Precision for the top- answers with .
(b) nDCG for the top- answers with .
(c) Recall for the top- answers with .
Figure 5: Results for Task 3 -Function Search, on AMD64PostgreSQL Dataset (581K functions) average on 160K queries.

Our results on precision, nDCG and recall are reported in Figure 5.

The performances were calculated by averaging the results of 160K queries. The queries are obtained by sampling, in AMD64PostgreSQL Dataset, 10K functions for each compiler and optimization level in the set clang-4.0,clang-6.0,gcc-4.8,gcc-7O0,O1,O2,O3.

Let us recall that, on average, for each query we have 33 similar functions (e.g., functions compiled from the same source code) in the dataset.

  • Precision: The results are reported in Figure (a)a. The precision is above for , and it is above for . The increase of performance on Gemini is around 10% on the entire range considered. Specifically at we have values for SAFE and for Gemini.

  • nDCG: The tests are reported in Figure (b)b. Our solution has a performance above for . This implies that we have a good order of the results and the similar functions are among the first results returned. The value is always above . There is a clear improvement with respects to Gemini, the increase is around on the entire range considered. Specifically at we have values for SAFE and for Gemini.

  • Recall: The tests are reported in Figure (c)c. We have a recall at of 47% (vs. 39% Gemini), the recall at is (vs. 45% Gemini). Specifically at we have values for SAFE and for Gemini.

6.3 Task 3 - Vulnerability Search

In this task we evaluate our ability to look up for vulnerable functions on a dataset specifically designed for this purpose. The methodology and the performance measures of this test are the same of Task 2.

Dataset and methodology

The dataset used is the vulnerability dataset of [9]. The dataset contains several vulnerable binaries compiled with 11 compilers in the families of clang, gcc and icc. The total number of different vulnerabilities is 810. We disassembled the dataset with ANGR, obtaining 3160 binary functions. The average number of vulnerable functions for each of the 8 vulnerabilities is ; with a minimum of vulnerable functions and a maximum of 11. We performed a lookup for each of the 8 vulnerabilities, computing the precision, nDCG, and recall on each result. Finally, we averaged these performances over the 8 queries.

Experimental Results

The results of our experiments are reported in Figure 6. We can see that SAFE outperforms Gemini for all values of in all tests. Our nDCG is very large, showing that SAFE effectively finds most of the vulnerable functions in the nearest results. For we reach a recall of , while Gemini reaches a recall of . For our recall is (vs. recall of Gemini, with an increment of performance of 29%), and we reach a maximum of (vs. of Gemini). One of the reason why the accuracy quickly decreases is that, on average, we have similar functions; this means that even a perfect system at will have an accuracy that is less than . This metric problem is not shared by the nDCG reported in Figure (b)b, recall that the nDCG is normalized on the behaviour of the perfect query answering system. During our tests we have seen that on the infamous hearthbleed vulnerability we have an ideal behaviour, SAFE found all the vulnerable functions in the first results, while Gemini had a recall at 13 around .

(a) Precision for the top- answers with .
(b) nDCG for the top- answers with .
(c) Recall for the top- answers with .
Figure 6: Results for Task 3 - Vulnerability Search.

6.4 Task 4 - Semantic Classification

In Task 4 we evaluate the semantic classification using the embeddings computed with the model trained on AMD64multipleCompilers Dataset. We calculate the embeddings for all functions in Semantic Dataset (details on the dataset below). We split our embeddings in train set and test set and we train and test an SVM classifier using a 10-fold cross validation. We use an SVM classifier with kernel rbf, and parameters and . We compare our embeddings with the ones computed with Gemini.

Class Number of functions
S (Sorting) 4280
E (Encryption) 2868
SM (String Manipulation) 3268
M (Math) 4742
Total 15158
Table 2: Number of function for each class in the Semantic Dataset

Dataset

The Semantic Dataset has been generated from a source code collection containing 443 functions that have been manually annotated as implementing algorithms in one of the 4 classes: E (Encryption), S (Sorting), SM (String Manipulation), M (Mathematical). Semantic Dataset contains multiple functions that refer to different implementations of the same algorithm. We compiled the sources for AMD64 using the 12 compilers and 4 optimizations used for AMD64PostgreSQL Dataset, we took the object files and after disassembling them with ANGR we obtained a total of 15158 binary functions, see details in Table 2. It is customary to use auxiliary functions when implementing complex algorithms (e.g. a swap function used by a quicksort algorithm). When we disassemble the Semantic Dataset we take special care to include the auxiliary functions in the assembly code of the caller. This step is done to be sure that the semantic of the function is not lost due to the scattering of the algorithm semantic among helper functions. Operatively, we include in the caller all the callees up to depth 2.

Measures of Performances

As performance measures we use precision, recall and F-1 score.

Experimental Results

Class Embedding Model Precision Recall F1-Score
E (Encryption) SAFE 0.92 0.94 0.93
Gemini 0.82 0.85 0.83
M (Math.) SAFE 0.98 0.95 0.96
Gemini 0.96 0.90 0.93
S (Sorting) SAFE 0.91 0.93 0.92
Gemini 0.87 0.92 0.89
SM (String Manipulation) SAFE 0.98 0.97 0.97
Gemini 0.90 0.89 0.89
Weighted Average SAFE 0.95 0.95 0.95
Gemini 0.89 0.89 0.89
Table 3: Results of semantic classification using embeddings computed with SAFE model and Gemini. The classifier is an SVM with kernel rbf, and .

The results of our semantic classification tests are reported in Table 3. First and foremost, we have a strong confirmation that is indeed possible to classify the semantic of the algorithms using function embeddings. The use of an SVM classifier on the embedding vector space leads to good performance. There is a limited variability of performances between different classes. The classes on which SAFE performs better are SM and M. We speculate that the moderate simplicity of the algorithms belonging to these classes creates a limited variability among the binaries. The M class is also one of the classes where the Gemini embeddings are performing better, this is probably due to the fact that one of the manual features used by Gemini is the number of arithmetic assembly instructions inside a code block of the CFG. By analyzing the output of the classifier we find out that the most common error, a mistake common to both Gemini case and SAFE, is the confusion between encryption and sorting algorithms. A possible explanation for this behaviour is that simple encryption algorithms, such as RC5, share many similarities with sorting algorithms (e.g., nested loops on an array).

Finally, we can see that, in all cases, the embeddings computed with our architecture outperform the ones computed with Gemini; the improvement range is between and . The average improvement, weighted on the cardinality of each class, is around .

Qualitative Analysis of the Embeddings We performed a qualitative analysis of the embeddings produced with SAFE. Our aim is to understand how the network captures the information on the inner semantics of the binary functions, and how it represent such information in the vector space.

Figure 7: 2-dimensional visualization of the embedding vectors for all binary functions in Semantic Dataset. The four different categories of algorithms (Encryption, Sorting, Math and String Manipulation) are represented with different symbols and colors.

To this end we computed the embeddings for all functions in Semantic Dataset. In Figure 7 we report the two-dimensional projection of the -dimensional vector space where binary functions embeddings lie, obtained using the t-SNE 12 visualisation technique [21]. From Figure 7 is possible to observe a quite clear separation between the different classes of algorithms considered. We believe this behaviour is really interesting and it further confirms our quantitative experiments on semantic classification.

Real use case of Task 4 - Detecting encryption functions in Windows Malwares We decided to test the semantic classifcation on a real use case scenario. We trained a new SVM classifier using the semantic dataset with only two classes, encryption and non-encryption. We then used this classifier on malwares. We analyzed two samples of window malwares found in famous malware repositories: the TeslaCrypt and Vipasana ransomwares. We disassembled the malwares with radare2, we included in the caller the code of the callee functions up to depth 2. We processed the disassembled functions with our classifier, and we selected only the functions that are flagged as encryption with a probability score greater than . Finally, we manually analyzed the malware samples to assess the quality of the selected functions.

  • TeslaCrypt 13: on a total of 658 functions, the classifier flags the ones at addresses 0x41e900, 0x420ec0, 0x4210a0, 0x4212c0, 0x421665, 0x421900, 0x4219c0. We confirmed that these are either encryption (or decryption) functions or helper functions directly called by the main encryption procedures.

  • Vipasana 14: on a total of 1254 functions, the classifier flags the ones at addresses 0x406da0, 0x414a58, 0x415240. We confirmed that two of these are either encryption (or decryption) functions or helper functions directly called by the main encryption procedures. The false positive is 0x406da0.

As final remark, we want to stress that these malwares are for windows and they are 32-bit binaries, while we trained our entire system on ELF executables for AMD64. This shows that our model is able to generate good embeddings also for cases that are largely different from the ones seen during training.

7 Speed considerations.

As reported in the introduction, one of the advantages of SAFE that it ditches the use of CFGs. From our tests on radare2 disassembling a function is times faster than computing its CFG. Once functions are disassembled an Nvidia K80 running our model computes the embeddings of functions in around second.

More precisely, we run our tests on a virtual machine hosted on Google cloud platform. The machine has 8 core Intel Sandy Bridge, 30gb of ram, an Nvidia K80 and SSD hard-drive. We disassembled all object files in postegres 9.6 compiled with gcc-6 for all optimizations. During the disassembling we assume to know the starting address of a function, see [26] for a paper using neural networks to find functions in a binary.

The time needed to disassemble and pre-process 3432 binaries is 235 seconds, the time needed to compute the embeddings of the resulting 32592 functions is seconds. The end-to-end time to compute embeddings for all functions in postgres starting from binary files is less than 5 minutes. We repeated the same test with openssl 1.1.1 compiled with gcc-5 for all optimizations. The end-to-end time to compute the embeddings for all functions in openssl is less than 4 minutes.

Gemini is up to 10 times slower, it needs 43 minutes for postgres and 26 minutes for openssl.

8 Conclusions and future works

In this paper we introduced SAFE an architecture for computing embeddings of functions in the cross-platform case that does not use debug symbols. Our architecture does not need the CFG, and this leads to a considerable speed advantage. SAFE creates thousand embeddings per second on a mid-CTOS GPU. Even when we factor the disassembling time our end-to-end system (from binary file to function embedding), processes more than 100 functions per second. This considerable speed comes with a significant increase of predictive performances with respect to the state of the art. Summing up, SAFE is both faster and more precise than previous solutions.

Finally, we think that our experiments on semantic detection are really interesting, and they pave the way to more complex and refined analysis, with the final purpose of building binary classifiers that rival with the classifiers today available for image recognition.

Future Works There are several immediate lines of improvement that we plan to investigate in the immediate future. The first one is to retrain our i2v model to make use of libc call symbols. This will allow us to quantify the impact of such information on embedding quality. We believe that symbols could lead to a further increase of performance, at the cost of assuming more information and the integrity of the binary that we are analyzing.

The use of libc symbols would enable a more fine grained semantic classification: as example we could be able to distinguish a function that is sending encrypted content on a socket from a function that is writing encrypted content on a file. Summarizing, the field of applied machine learning for binary analysis is still in its infancy and there are several opportunities for future works.

Acknowledgments. The authors would like to thank Google for providing free access to its cloud computing platform through the Education Program. Moreover, the authors would like to thank NVIDIA for partially supporting this work through the donation of a GPGPU card used during prototype development. Finally, the authors would like to thank Davide Italiano for the insightful discussions. This work is supported by a grant of the Italian Presidency of the Council of Ministers and by the CINI (Consorzio Interuniversitario Nazionale Informatica) National Laboratory of Cyber Security.

Footnotes

  1. https://www.statista.com/statistics/266210/number-of-available-applications-in-the-google-play-store/
  2. https://www.cvedetails.com/browse-by-date.php
  3. Tests conducted using the Radare2 disassembler[25].
  4. Conversely, recognizing library functions in stripped statically linked binaries is an application of the binary similarity problem without symbolic calls.
  5. Classic RNNs do not cope well with really long sequences.
  6. We build our system to be compatible with two opensource leader disassemblers.
  7. Note that gcc-3.4 has been released more than 10 years before gcc-5.4.
  8. Gemini has not been distributed publicly. We implemented it using the information contained in [29]. For Gemini the parameters are: function embeddings of dimension , number of rounds , and a number of layers . These parameters are the ones that give the better performance for Gemini, according to our experiments and the one in the original Gemini paper.
  9. 48= 12 compilers 4 optimizations level
  10. cve-2014-0160, cve-2014-6271, cve-2015-3456, cve-2014-9295, cve-2014-7169, cve-2011-0444, cve-2014-4877, cve-2015-6862.
  11. Some vulnerable functions are lost during the disassembling process
  12. We used the TensorBoard implementation of t-SNE
  13. The sample is available at url https://github.com/ytisf/theZoo/tree/master/malwares/Binaries/Ransomware.TeslaCrypt. The variant analyzed is the one with hash 3372c1eda…
  14. The sample is available at url https://github.com/ytisf/theZoo/tree/master/malwares/Binaries/Ransomware.Vipasana. The variant analyzed is the one with hash 0442cfabb…4b6ab

References

  1. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving and M. Isard (2016) Tensorflow: a system for large-scale machine learning.. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, (OSDI), pp. 265–283. Cited by: §5.1.
  2. A. Al-Maskari, M. Sanderson and P. Clough (2007) The relationship between ir effectiveness measures and user satisfaction. In Proceedings of the 30th International ACM Conference on Research and Development in Information Retrieval, (SIGIR), pp. 773–774. Cited by: §6.2.2.
  3. S. Alrabaee, P. Shirani, L. Wang and M. Debbabi (2015) SIGMA: a semantic integrated graph matching approach for identifying reused functions in binary code. Digital Investigation 12, pp. S61 – S71. Cited by: §1.
  4. R. Baldoni, G. A. Di Luna, L. Massarelli, F. Petroni and L. Querzoni (2018) Unsupervised features extraction for binary similarity using graph embedding neural networks. In arXiv preprint arXiv:1810.09683, Cited by: §2.2.
  5. J. Bromley, I. Guyon, Y. LeCun, E. Säckinger and R. Shah (1994) Signature verification using a “siamese” time delay neural network. In Proceedings of the 6th International Conference on Neural Information Processing Systems, (NIPS), pp. 737–744. Cited by: §4.2.
  6. K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk and Y. Bengio (2014) Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, (EMNLP), Cited by: §5.1.
  7. Z. L. Chua, S. Shen, P. Saxena and Z. Liang (2017) Neural nets can learn function type signatures from binaries. In Proceedings of 26th USENIX Security Symposium, (USENIX Security), pp. 99–116. Cited by: §4.1.
  8. H. Dai, B. Dai and L. Song (2016) Discriminative embeddings of latent variable models for structured data. In Proceedings of the 33rd International Conference on Machine Learning, (ICML), pp. 2702–2711. Cited by: §2.2.
  9. Y. David, N. Partush and E. Yahav (2016) Statistical similarity of binaries. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, (PLDI), pp. 266–280. Cited by: §1, §2.1, §2.1, §6.3.1.
  10. Y. David, N. Partush and E. Yahav (2017) Similarity of binaries through re-optimization. In ACM SIGPLAN Notices, Vol. 52, pp. 79–94. Cited by: §1, §2.1.
  11. Y. David and E. Yahav (2014) Tracelet-based code search in executables. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, (PLDI), pp. 349–360. Cited by: §2.1.
  12. S. H. Ding, B. C. Fung and P. Charland (2019) Asm2Vec: boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In (to Appear) Proceedings of 40th Symposium on Security and Privacy, (SP), Cited by: 2nd item, 3rd item, §2.2.
  13. T. Dullien, R. Rolles and R. Bochum (2005) Graph-based comparison of executable objects. In Proceedings of Symposium sur la sécurité des technologies de l’information et des communications, (STICC), Cited by: §1, §2.1.
  14. M. Egele, M. Woo, P. Chapman and D. Brumley (2014) Blanket execution: dynamic similarity testing for program binaries and components. In Proceedings of 23rd USENIX Security Symposium, (USENIX Security), pp. 303–317. Cited by: §1, §2.1.
  15. Q. Feng, M. Wang, M. Zhang, R. Zhou, A. Henderson and H. Yin (2017) Extracting conditional formulas for cross-platform bug search. In Proceedings of the 12th ACM on Asia Conference on Computer and Communications Security, (ASIA CCS), pp. 346–359. Cited by: §2.1.
  16. Q. Feng, R. Zhou, C. Xu, Y. Cheng, B. Testa and H. Yin (2016) Scalable graph-based bug search for firmware images. In Proceedings of the 23rd ACM SIGSAC Conference on Computer and Communications Security, (CCS), pp. 480–491. Cited by: §1, §1, §2.2.
  17. J. L. Herlocker, J. A. Konstan, L. G. Terveen and J. T. Riedl (2004) Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems 22 (1), pp. 5–53. Cited by: §6.1.1.
  18. W. M. Khoo, A. Mycroft and R. Anderson (2013) Rendezvous: a search engine for binary code. In Proceedings of the 10th Working Conference on Mining Software Repositories, (MSR), pp. 329–338. Cited by: §1, §2.1.
  19. Q. V. Le and T. Mikolov (2014) Distributed representations of sentences and documents. In Proceedings of the 31th International Conference on Machine Learning, (ICML), pp. 1188–1196. Cited by: §2.2.
  20. Z. Lin, M. Feng, C. Nogueira dos Santos, M. Yu, B. Xiang, B. Zhou and Y. Bengio (2017) A structured self-attentive sentence embedding. Arxiv: arXiv:1703.03130. Cited by: §1, §3.1, §3.1, §4.2, §4.2.
  21. L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §6.4.3.
  22. T. Mikolov, I. Sutskever, K. Chen, G. Corrado and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems, (NIPS), pp. 3111–3119. Cited by: §3.1, §4.1.
  23. J. Pewny, B. Garmany, R. Gawlik, C. Rossow and T. Holz (2015) Cross-architecture bug search in binary executables. In Proceedings of the 34th IEEE Symposium on Security and Privacy, (SP), pp. 709–724. Cited by: §2.1.
  24. J. Pewny, F. Schuster, L. Bernhard, T. Holz and C. Rossow (2014) Leveraging semantic signatures for bug search in binary programs. In Proceedings of the 30th Annual Computer Security Applications Conference, (ACSAC), pp. 406–415. Cited by: §2.1.
  25. Radare2ATeam (2017) Radare2 disassembler repository. Note: \urlhttps://github.com/radare/radare2 Cited by: §5.1, footnote 3.
  26. E. C. R. Shin, D. Song and R. Moazzezi (2015) Recognizing functions in binaries with neural networks. In Proceedings of the 24th USENIX Conference on Security Symposium, (USENIX Security), pp. 611–626. Cited by: §7.
  27. Y. Shoshitaishvili, C. Kruegel, G. Vigna, R. Wang, C. Salls, N. Stephens, M. Polino, A. Dutcher, J. Grosen and S. Feng (2016) Sok:(state of) the art of war: offensive techniques in binary analysis. In Proceedings of the 37th IEEE Symposium on Security and Privacy, (SP), pp. 138–157. Cited by: §5.1.
  28. TensorFlowAAuthors (last accessed 11/2018) Word2vec skip-gram implementation in tensorflow. In \urlhttps://www.tensorflow.org/tutorials/representation/word2vec, Cited by: §5.1.1.
  29. X. Xu, C. Liu, Q. Feng, H. Yin, L. Song and D. Song (2017) Neural network-based graph embedding for cross-platform binary code similarity detection. In Proceedings of the 24th ACM SIGSAC Conference on Computer and Communications Security, (CCS), pp. 363–376. Cited by: 1st item, 2nd item, §1, §2.2, 2nd item, §5.2.2, §5.2.2, §5.2, 1st item, §6.1.1, footnote 8.
  30. F. Zuo, X. Li, Z. Zhang, P. Young, L. Luo and Q. Zeng (2018) Neural machine translation inspired binary code similarity comparison beyond function pairs. arXiv preprint arXiv:1808.04706. Cited by: §2.2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
402623
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description