SAFE: SelfAttentive Function Embeddings for Binary Similarity
Abstract
The binary similarity problem consists in determining if two functions are similar by only considering their compiled form. Advanced techniques for binary similarity recently gained momentum as they can be applied in several fields, such as copyright disputes, malware analysis, vulnerability detection, etc., and thus have an immediate practical impact. Current solutions compare functions by first transforming their binary code in multidimensional vector representations (embeddings), and then comparing vectors through simple and efficient geometric operations. However, embeddings are usually derived from binary code using manual feature extraction, that may fail in considering important function characteristics, or may consider features that are not important for the binary similarity problem. In this paper we propose SAFE, a novel architecture for the embedding of functions based on a selfattentive neural network. SAFE works directly on disassembled binary functions, does not require manual feature extraction, is computationally more efficient than existing solutions (i.e., it does not incur in the computational overhead of building or manipulating control flow graphs), and is more general as it works on stripped binaries and on multiple architectures. We report the results from a quantitative and qualitative analysis that show how SAFE provides a noticeable performance improvement with respect to previous solutions. Furthermore, we show how clusters of our embedding vectors are closely related to the semantic of the implemented algorithms, paving the way for further interesting applications (e.g. semanticbased binary function search)
1 Introduction
In the last years there has been an exponential increase in the creation of new contents. As all products, also software is subject to this trend. As an example, the number of apps available on the Google Play Store increased from 30K in 2010 to 3 millions in 2018
This multidimensional increase in quantity, complexity and diffusion of software makes the resulting infrastructures difficult to manage and control, as part of their internals are often inaccessible for inspection to their own administrators. As a consequence, system integrators are looking forward to novel solutions that take into account such issues and provide functionalities to automatically analyze software artifacts in their compiled form (binary code). One prototypical problem in this regard, is the one of binary similarity [13, 18, 3], where the goal is to find similar functions in compiled code fragments.
Binary similarity has been recently subject to a lot of attention [9, 10, 14]. This is due to its centrality in several tasks, such as discovery of known vulnerabilities in large collection of software, dispute on copyright matters, analysis and detection of malicious software, etc.
In this paper, in accordance with [16] and [29], we focus on a specific version of the binary similarity problem in which we define two binary functions to be similar if they are compiled from the same source code. As already pointed out in [29], this assumption does not make the problem trivial.
Inspired by [16] we look for solutions that solve the binary similarity problem using embeddings. Loosely speaking, each binary function is first transformed into a vector of numbers (an embedding), in such a way that code compiled from a same source results in vectors that are similar.
This idea has several advantages. First of all, once the embeddings are computed, checking the similarity is relatively cheap and fast (we consider the scalar product of two constants size vectors as a constant time operation). Thus we can precompute a large collection of embeddings for interesting functions and check against such collection in linear time. In the light of the many use cases, this characteristic is extremely useful. Another advantage comes from the fact that such embeddings can be used as input to other machine learning algorithms, that can in turn cluster functions, classify them, etc.
Current solutions that adopt this approach, still come with several shortcomings:

they [29] use manually selected features to calculate the embeddings, introducing potential bias in the resulting vectors. Such bias stems form the possibility of overlooking important features (that don’t get selected), or including features that are expensive to process while not providing noticeable performance improvements; for example, including features extracted from the function’s control flow graph (CFG) imposes a speed penalty with respect to features extracted from the disassembled code
^{3} ; 
they [12] assume that call symbols to dynamically linking libraries are available in binary functions (such as libc,msvc,ecc..), while this is not true for binaries that are stripped and statically linked
^{4} or in partial fragments of binaries (e.g. extracted during volatile memory forensic analyses); 
they work only on specific CPU architectures [12].
Considering these shortcomings, in this paper we introduce SAFE: SelfAttentive Function Embeddings, a solution we designed to overcome all of them. In particular we considered two specific goals:

design a solution to quickly generate embeddings for several hundreds of binaries;

design a solution that could be applicable the vast majority of cases, i.e. able to work with stripped binaries with statically linked libraries, and on multiple architectures (in particular we consider AMD64 and ARM as target platforms for our study).
The core of SAFE is based on recent advancements in the area of natural language processing. Specifically, we designed SAFE on a SelfAttentive Neural Network recently proposed in [20]. The idea is to directly consider the sequence of instructions in the binary function and to model them as natural language. An GRU Recurrent Neural Network (GRU RNN) is then used to capture the sequential interaction of the instructions. In addition, an attention mechanism allows the final function embedding to consider all the GRU hidden states, and to automatically focus (i.e., give more weight) to the portion of the binary code that helps the most in accomplishing the training objective  i.e., recognise if two binary functions are similar.
We also investigate the possibility of semantically classifying (i.e. identifying the general sematic behavior of) binary functions by clustering similar embeddings. At the best of our knowledge we are the first to investigate the feasibility of this task through machine learning tools, and to perform a quantitative analysis on this subject. The results are encouraging showing a 95% of classification accuracy for 4 different broad classes of algorithms (namely Encryption, Sorting, Mathematical and String Manipulation functions). Finally, we also applied our semantic classifier to known malwares, and we were able to accurately recognize with it functions implementing encryption algorithms.
1.1 Contribution
The main contributions of our work are:

we describe SAFE, a general architecture for calculating binary function embeddings starting from disassembled binaries;

we extensively evaluate SAFE showing that it provides better performance than previous stateofthe art systems with similar requirements. Specifically, we compare it with the recent Gemini [29], showing a performance improvement on several metrics that ranges from to depending on the task at hand;

we apply SAFE to the problem of identifying vulnerable functions in binary code, a common application task for binary similarity solutions; also in this task SAFE provides better performance than stateoftheart solutions.

we show that embeddings produced by SAFE can be used to automatically classify binary functions in semantic classes. On a dataset of K functions, we can recognize whether a function implements an encryption algorithm, a sorting algorithm, generic math operations, or a string manipulation, with an accuracy of .

we apply SAFE to the analysis of known malwares, to identify encryption functions. Interestingly, we achieve good performances: among 10 functions flagged by SAFE as Encryption, only one was a false positive.
The remainder of this paper is organized as follows. Section 2 discusses related work, followed by Section 3 where we define the problem and report an overview of the solution we tested. In Section 4 we describe in details SAFE, and in Section 5 we provide implementation details and information on the training. In Section 6 we describe the experiments we performed and report their results. Finally, in Section 7 we discuss the speed of SAFE.
2 Related Work
We can broadly divide the binary similarity literature in works that propose embeddingbased solutions, and works that do not.
2.1 Works not based on embeddings
Single Platform solutions — Regarding the literature of binarysimilarity for a single platform, a family of works is based on matching algorithms for function CFGs. In Bindiff [13] matching among vertices is based on the syntax of code, and it is known to perform poorly across different compiler (see [9]). Pewny et al. [24] proposed a solution where each vertex of a CFG is represented with an expression tree; similarity among vertices is computed by using the edit distance between the corresponding expression trees.
Other works use different solutions that do not rely on on graph matching. David and Yahav [11] proposed to represent a function as several independent execution traces, called tracelets; similar tracelets are then matched by using a custom editdistance. A related concept is used by David et al. in [9] where functions are divided in pieces of independent code, called strands. The matching between functions is based on how many statistically significant strands are similar. Intuitively, a strand is significant if it is not statistically common. Strand similarity is computed using an SMTsolver to assess semantic similarity. Note that all previous solutions are designed around matching procedures that work pairtopair, and they cannot be adapted to precompute a constant size signature of a binary function on which similarity can be assessed.
Egele et al. in [14] proposed a solution where each function is executed multiple times in a random environment. During the executions some features are collected and then used to match similar functions. This solution can be used to compute a signature for each function. However, it needs to execute a function multiple times, that is both time consuming and difficult to perform in the crossplatform scenario. Furthermore, it is not clear if the features identified in [14] are useful for crossplatform comparison. Finally, Khoo et al. [18] proposed a matching approach based on ngrams computed on instruction mnemonics and graphlets. Even if this strategy does produce a signature, it cannot be immediately extended to crossplatform similarity.
source code  
compiler  
binary function compiled from source code .  
embedding vector of function  
list of instructions in function  
number of instructions in a function  
instruction in  
embedding vector of  
list of instruction embeddings in function  
i2v  instruction embedding model (instruction2vector) 
th component of vector  
th hidden state of the Bidirectional Recurrent Neural Network (RNN)  
dimension of vector  
dimension of the state  
hyperbolic tangent function  
softmax function  
ReLU  rectified linear unit function 
weights matrix of the attention mechanism  
weights matrix of the attention mechanism  
weights matrix of the output layer of the SelfAttentive net.  
weights matrix of the output layer of the SelfAttentive network  
embedding matrix  
attention depth  Parameters of the function embedding network  
number of attention hops  Parameters of the function embedding network  
ground truth label associated with the th input pair of functions  
network hyper parameters  
network objective function  
number of results of a function search query 
CrossPlatform solutions — Pewny et al. [23] proposed a graphbased methodology, i.e. a matching algorithm on the CFGs of functions. The idea is to transform the binary code in an intermediate representation; on such representation the semantic of each CFG vertex is computed by using a sampling of the code executions using random inputs. Feng et al. [15] proposed a solution where each function is expressed as a set of conditional formulas; then it uses integer programming to compute the maximum matching between formulas. Note that both, [23] and [15] allow pairtopair check only.
David et al. [10] propose to transform binary code to an intermediate representation. Then, functions were partitioned in slices of independent code, called strands. An involved process guarantees that strands with the same semantics will have similar representations. Functions are deemed to be similar if they have matching of significant strands. Note that this solution does generate a signature as a collection of hashed strands. However, it has two drawbacks: the first is that the signature is not constantsize but it depends on the number of strands contained in the function. The second drawback is that is not immediate to transform such signatures into embeddings that can be directly fed to other machine learning algorithms.
2.2 Based on embeddings
The most related to our works are the ones that propose embeddings for binary similarity. Specifically, the works that target crossplatform scenarios.
SinglePlatform solutions — Recently, [12] proposed a function embedding solution called Asm2Vec. This solution is based on the PVDM model [19] for natural language processing. Operatively, Asm2Vec computes the CFG of a function, and then it performs a series of random walks on top of it. Asm2Vec outperforms several stateoftheart solutions in the field of binary similarity. Despite being a really promising solution, Asm2vec does not fullfill all the design goals of our system: firstly it requires libc call symbols to be present in the binary code as tokens to produce the embedding of a function; secondly it is only suitable for singleplatform embeddings.
CrossPlatform solutions — Feng et al. [16] introduced a solution that uses a clustering algorithm over a set of functions to obtain centroids for each cluster. Then, they used these centroids and a configurable feature encoding mechanism to associate a numerical vector representation with each function. Xu et al. [29] proposed an architecture called Gemini, where function embeddings are computed using a deep neural network. Interestingly, [29] shows that Gemini outperforms [16] both in terms of accuracy and performance (measured as time required to train the model). In Gemini the CFG of a function is first transformed into an annotated CFG, a graph containing manually selected features, and then embedded into a vector using the graph embedding model of [8]. The manual features used by Gemini do not need call symbols. At the best of our knowledge Gemini is the stateoftheart solution for crossplatform embeddings based on deep neural networks that works on crossplatform code without call symbols. In an unpublished technical report [4] we proposed a variation of Gemini where manual features are replaced with an unsupervised feature learning mechanism. This single change led to a performance improvement over the baseline represented by Gemini.
Finally, in [30] the author propose the use of a recurrent neural network based on LSTM (Long shortterm memory) to solve a subtask of binary similarity that is the one of finding similar CFG blocks.
3 Problem Definition and Solution Overview
For clarity of exposition we summarize all the notation used in this paper in Table 1. Let us first define the similarity problem.
We say that two binary functions are similar, , if they are the result of compiling the same original source code with different compilers.
Essentially, a compiler is a deterministic transformation that maps a source code to a corresponding binary function . In this paper we consider as a compiler the specific
software, e.g. gcc5.4.0, together with the parameters that influence the compiling process, e.g.
the optimization flags O.
We indicate with , the list of assembly instructions composing function . Our aim is to represent as a vector in . This is achieved with an embedding model that maps to an embedding vector , preserving structural similarity relations between binary functions.
Function Semantic. Loosely speaking, a function can be seen as an implementation of an algorithm. We can partitions algorithms in classes, where each class is a group of algorithms solving related problems. In this paper we focus on four classes E (Encryption), S (Sorting), SM (String Manipulation), M (Mathematical). A function belongs to class E if it is the implementation of an encryption algorithm (e.g., AES, DES); it belongs to S class if it implements a sorting algorithm (e.g., bubblesort, mergesort); it belongs to SM class if it implements an algorithm to manipulate a string (e.g., string reverse, string copy); it belongs to M class if it implements math operations (e.g., computing a bessel function); We say that a classifier, recognizes the semantic of a function , with taken from one of the aforementioned classes, if it is able to guess the class to which belongs.
3.1 SAFE Overview.
We use an embedding model structured in two phases; in the first phase the Assembly Instructions Embedding component, transforms a sequence of assembly instructions in a sequence of vectors, in the second phase a SelfAttentive Neural Network, transforms a sequence of vectors in a single embedding vector. See Figure 1 for a schematic representation of the overall architecture of our embedding network.
Assembly Instructions Embedding (i2v)
In the first phase of our strategy we map each instruction to a vector of real numbers , using the word2vec model [22]. Word2vec is an extremely popular feature learning technique in natural language processing. We use a large corpus of instructions to train our instruction embedding model (see Section 5.1), we call our mapping instruction2vec (i2v). The final outcome of this step is a sequence of vectors .
SelfAttentive Network
For our SelfAttentive Network we use the network recently proposed in [20]. In this SelfAttentive Network, a bidirectional recurrent neural network is fed with the sequence of assembly vectors. Intuitively, for each instruction vector the RNN computes a summary vector taking into account the instruction itself and its context in . The final embedding of is a weighted sum of all summary vectors. The weights of such summation are computed by a two layers fullyconnected neural network.
We selected the SelfAttentive Network for two reasons. First, it shows stateofthe art performance on natural language processing tasks [20]. Secondly, it suffers less of the longmemory problem
4 Details of the SAFE, Function Embedding Network
We denote the entire embedding network as SAFE: SelfAttentive Function Embeddings.
4.1 Assembly Instructions Embedding (i2v)
The first step of our solution consists in associating an embedding vector to each instruction contained in . We achieve it by training the embedding model i2v using the skipgram method [22]. The idea of skipgram is to use the current instruction to predict the instructions around it. A similar approach has been used also in [7].
We train the i2v model using assembly instructions as tokens (i.e., a single token includes both the instruction mnemonic and the operands). We do not use the raw instruction but we filter it as follows. We examine the operands and replace all base memory addresses with the special symbol MEM and all immediates whose absolute value is above some threshold (we use in our experiments, see Section 5.1) with the special symbol IMM. We do this filtering because we believe that using raw operands is of small benefit; for instance, the displacement given by a jump is useless (e.g., instructions do not carry with them their memory address), and, on the contrary, it may decrease the quality of the embedding by artificially inflating the number of different instructions. As example the instruction mov EAX, becomes mov EAX,IMM, mov EAX,x becomes mov EAX,MEM, while the instruction mov EAX,EBP is not modified. Intuitively, the last instruction is accessing a stack variable different from mov EAX,[EBP], and this information remains intact with our filtering.
4.2 SelfAttentive Network
We based our SelfAttentive Network on the one proposed by [20]. The overall structure is detailed in Figure 2. We compute embedding of a function by using the sequence of instruction vectors . These vectors are fed into a bidirectional neural network, obtaining for each vector a summary vector of size :
where is the concatenation operand, (resp., ) is the forward (resp., backward) RNN cell, and are the forward and backward states of the RNN (we set ). The state of each RNN cell has size .
From these summary vectors we obtain a matrix . Matrix has as rows the summary vectors. An attention matrix of size is computed using a two layers neural network:
where is a weight matrix of size and the parameter is the attention depth of our model. The matrix is a weight matrix of size and the parameter is the number of attention hops of our model.
Intuitively, when , collapses into a single attention vector, where each each value is the weight a specific summary vector. When , becomes a matrix and each row is an independent attention hop. Loosely speaking, each hops weights the attention of a different aspect of the binary function.
The embedding matrix of our sequence is:
and it has fixed size . In order to transform the embedding matrix into a vector of size , we flatten the matrix and we feed the flattening into a twolayers fully connected neural network with ReLU activation function:
where is a weight matrix of size , and a weight matrix of size .
Learning Parameters Using Siamese Architecture: we learn the network parameters
using a pairwise approach, a technique also called siamese network in the literature [5].
The main idea is to join two identical function embedding networks with a similarity score (with identical we mean that the networks share the same parameters).
The final output of the siamese architecture is the similarity score between the two input graphs (see Figure 3).
In more details, from a pair of input functions two vectors are obtained by using the same function embedding network. These vectors are compared using cosine similarity as distance metric, with the following formula:
(1) 
where indicates the th component of the vector .
To train the network we require in input a set of functions pairs, , with ground truth labels , where indicates that the two input functions are similar and otherwise. Then using the siamese network output, we define the following objective function:
The objective function is minimized by using, for instance, stochastic gradient descent. The term is introduced to penalize the choice of the same weights for each attention hops in matrix (see [20]).
5 Implementation Details and Training
In this section we first discuss the implementation details of our system, and then we explain how we trained our models.
5.1 Implementation Details and i2v setup
We developed a prototype implementation of SAFE using Python and the Tensorflow [1] framework. For static analysis of binaries we used the ANGR framework [27], radare2 [25]
In our SAFE prototype we used the following parameters: the RNN cell is the GRU cell [6]; the value is , , , , .
We decided to truncate the number of instructions inside each function to the maximum value of , this represents a good tradeoff between training time and accuracy, the great majority of functions in our datasets is below this threshold (more than 90% of the functions).
i2v model
We trained two i2v models using the two training corpora described below. One model is for the instruction set of ARM and one for AMD64. With this choice we tried to capture the different sintaxes and semantics of these two assembly languages. The model that we use for i2v (for both versions AMD64 and ARM) is the skipgram implementation of word2vec provided in [28]. We used as parameters: embedding size 100, window size and word frequency .
Datasets for training the i2v models We collected the assembly code of a large number of functions, and we used it to build two training corpora, one for the i2v AMD64 model and one for the i2v ARM model. We built both corpora by dissasembling several UNIX executables and libraries using IDA PRO. The libraries and the executables have been randomly sampled from repositories of Debian packages.
We avoided multiple inclusion of common functions and libraries by using a duplicate detection mechanism; we tested the uniqueness of a function computing an hash of all function instructions, where instructions are filtered by replacing the operands containing immediate and memory locations with a special symbol.
From 2.52 GBs of AMD64 binaries we obtained the assembly code of 547K unique functions. From 3.05 GBs of ARM binaries we obtained the assembly code of 752K unique functions. Overall the AMD64 corpus contains millions assembly code lines while the ARM corpus contains millions assembly code lines.
5.2 Training Single and Cross Platform Models
We trained SAFE models using the same methodology of Gemini, see [29]. We trained both a single and a cross platform models that were then evaluated in several tasks (see Section 6 for the results).
Datasets

AMD64multipleCompilers Dataset. This is dataset has been obtained by compiling the following libraries for AMD64: binutils2.30, ccv0.7, coreutils8.29, curl7.61.0, gsl2.5, libhttpd2.0, openmpi3.1.1, openssl1.1.1pre8, valgrind3.13.0. The compilation has been done using 3 different compilers, clang3.9, gcc5.4, gcc3.4
^{7} and 4 optimization levels (i.e., O). The compiled object files have been disassembled with ANGR, obtaining a total of 452598 functions. 
AMD64ARMOpenSSL Dataset. To align our experimental evaluation with stateoftheart studies we built the AMD64ARMOpenSSL Dataset in the same way as the one used in [29]. In particular, the AMD64ARMOpenSSL Dataset consists of a set of 95535 functions generated from all the binaries included in two versions of Openssl (v1_0_1f  v1_0_1u) that have been compiled for AMD64 and ARM using gcc5.4 with 4 optimizations levels (i.e., O). The resulting object files have been disassembled using ANGR; we discarded all the functions that ANGR was not able to disassemble.
Training
We generate our training and test pairs as reported in [29]. The pairs can be of two kinds: similar pairs, obtained pairing together two binary functions originated by the same source code, and dissimilar pairs, obtained pairing randomly functions that do not derive from the same source code.
Specifically, for each function in our datasets we create two pairs, a similar pair, associated with training label and a dissimilar pair, training label ; obtaining a total number of pairs that is twice the total number of functions.
The functions in AMD64multipleCompilers Dataset are partitioned in three sets: train, validation, and test (75%15%15%).
The functions in AMD64ARMOpenSSL Dataset are partitioned in two sets: train and test (80%20%), in this case we do not need the validation set because in Task 1 Section 6.1 we will perform a crossvalidation.
The test and validation pairs will be used to assess performances in Task 1, see Section 6.1.
As in [29], pairs are partitioned preventing that two similar functions are in different partitions (this is done to avoid that the network sees during training functions similar to the ones on which it will be validated or tested).
We train our models for epochs (an epoch represents a complete pass over the whole training set). In each epoch we regenerate the training pairs, that is we create new similar and dissimilar pairs using the functions contained in the training split. We precompute the pairs used in each epoch, in such a way that each method is tested on the same data. Note that, we do not regenerate the validation and test pairs.
6 Evaluation
We perform an extensive evaluation of SAFE investigating its performances on several tasks:

Task 1  Single Platform and Cross Platform Models Tests: we test our single platform and cross platform models following the same methodology of [29]. We achieve a performance improvement of in the single platform case and of in the cross platform case. We remark that in these tests our models behave almost perfectly (within from what a perfect model may achieve). This task is described in Section 6.1.

Task 2  Function Search: in this task we are given a certain binary function and we have to search for similes on a large dataset created using several compilers (including compilers that were not used in the training phase). We achieve a precision above for the first 15 results, and a recall of in the first 50 results. Section 6.2 is devoted to Task 2.

Task 3  Vulnerability Search: in this task we evaluate our system on a usecase scenario in which we search for vulnerable functions. Our tests on several vulnerabilities show a recall of in the first 10 results. Task 4 is the focus of Section 6.3.

Task 4  Semantic Classification: in this task we classify the semantic of binary functions using the embeddings built with SAFE. We reach an accuracy of on our test dataset. Moreover, we test our classifier on real world malwares, showing that we can identify encryption functions. Task 4 is explained in Section 6.4.
During our evaluation we compare safe with Gemini
6.1 Task 1  Single and Cross Platform tests
In this task we evaluate the performance of SAFE using the same testing methodology of Gemini. We use the test split and the validation split computed as discussed in Section 5.2.1.
Test methodology
We perform two disjoint tests.

On AMD64multipleCompilers Dataset, we compute performance metrics on the validation set for all the epochs. Then, we use the model hyper parameters that led to the best performance on the validation set to compute a final performance score on the test set.

On AMD64ARMOpenSSL Dataset, we perform a 5fold cross validation: we partition the dataset in sets; for all possible set union of 4 partitions we train the classifiers on such union and then we test it on the remaining partition. The reported results are the average of 5 independent runs, one for each possible fold chosen as test set. This approach is more robust than a fixed train/validation/test split since it reduces the variability of the results.
Measures of Performances As in [29], we measure the performance using the Receiver Operating Characteristic (ROC) curve [17]. Following the best practices of the field we measure the area under the ROC curve, or AUC (Area Under Curve). Loosely speaking, higher the AUC value, better the predictive performance of the algorithm.
Experimental Results
AMD64multipleCompilers Dataset The results for the single platform case are in Figure (a)a. Our AUC is , the AUC of Gemini is . Even if the improvement is 6.8, it is worth to notice that SAFE provides performance that are close to the perfect case (0.99 AUC).
AMD64ARMOpenSSL Dataset
We compare ourselves with Gemini in the crossplatform case. The results are in Figure (b)b and they shows the average ROC curves on the five runs of the fold cross validation. The Gemini results are reported with an orange dashed line while we use a continuous blue line for our results. For both solutions we additional highlighted the area between the ROC curves with minimum AUC maximum AUC in the five runs. The better prediction performance of SAFE is clearly visible; the average AUC obtained by Gemini is with a standard deviation of over the five runs, while the average AUC of SAFE is with a standard deviation of . The average improvement with respect to Gemini is of .
6.2 Task 2  Function Search
In this task we evaluate the function search capability of the model trained on AMD64multipleCompilers Dataset. We take a target function , we compute its embedding and we search for similar functions in the AMD64PostgreSQL Dataset (details of this dataset are given below). Given the target , a search query returns , that is the ordered list of the nearest embeddings in AMD64PostgreSQL Dataset.
Dataset
We built AMD64PostgreSQL Dataset by compiling postgreSQL 9.6.0 for AMD64 using 12 compilers: gcc3.4, gcc4.7, gcc4.8, gcc4.9, gcc5.4, gcc6, gcc7, clang3.8, clang3.9, clang4.0, clang5.0, clang6.0. For each compiler we used all 4 optimization levels. We took the object files, i.e. we did not create the executable by linking objects file together, and we disassembled them with radare2, obtaining a total of 581640 functions. For each function the AMD64PostgreSQL Dataset contains an average number of similars. We do not reach an average of
Measures of Performances
We compute the usual measures of precision, fraction of similar functions in over all functions in , and recall, fraction of similar functions in over all similar functions in the dataset. Moreover, we also compute the normalised Discounted Cumulative Gain (nDCG) [2]:
Where is if is a function similar to or otherwise, and, is the Discounted Cumulative Gain of the optimal query answering. This measure is between and , and it takes into account the ordering of the similar functions in , giving better results to responses that put similar functions first.
As an example let us suppose we have two results for the same query: and (where means that the corresponding index in the result list is occupied by a similar function and otherwise). These results have the same precision (i.e., ), but nDCG scores the first better.
Experimental Results
Our results on precision, nDCG and recall are reported in Figure 5.
The performances were calculated by averaging the results of 160K queries. The queries are obtained by sampling, in AMD64PostgreSQL Dataset, 10K functions for each compiler and optimization level in the set clang4.0,clang6.0,gcc4.8,gcc7O0,O1,O2,O3.
Let us recall that, on average, for each query we have 33 similar functions (e.g., functions compiled from the same source code) in the dataset.

Precision: The results are reported in Figure (a)a. The precision is above for , and it is above for . The increase of performance on Gemini is around 10% on the entire range considered. Specifically at we have values for SAFE and for Gemini.

nDCG: The tests are reported in Figure (b)b. Our solution has a performance above for . This implies that we have a good order of the results and the similar functions are among the first results returned. The value is always above . There is a clear improvement with respects to Gemini, the increase is around on the entire range considered. Specifically at we have values for SAFE and for Gemini.

Recall: The tests are reported in Figure (c)c. We have a recall at of 47% (vs. 39% Gemini), the recall at is (vs. 45% Gemini). Specifically at we have values for SAFE and for Gemini.
6.3 Task 3  Vulnerability Search
In this task we evaluate our ability to look up for vulnerable functions on a dataset specifically designed for this purpose. The methodology and the performance measures of this test are the same of Task 2.
Dataset and methodology
The dataset used is the vulnerability dataset of [9]. The dataset contains several vulnerable binaries compiled with 11 compilers in the families of clang, gcc and icc.
The total number of different vulnerabilities is 8
Experimental Results
The results of our experiments are reported in Figure 6. We can see that SAFE outperforms Gemini for all values of in all tests. Our nDCG is very large, showing that SAFE effectively finds most of the vulnerable functions in the nearest results. For we reach a recall of , while Gemini reaches a recall of . For our recall is (vs. recall of Gemini, with an increment of performance of 29%), and we reach a maximum of (vs. of Gemini). One of the reason why the accuracy quickly decreases is that, on average, we have similar functions; this means that even a perfect system at will have an accuracy that is less than . This metric problem is not shared by the nDCG reported in Figure (b)b, recall that the nDCG is normalized on the behaviour of the perfect query answering system. During our tests we have seen that on the infamous hearthbleed vulnerability we have an ideal behaviour, SAFE found all the vulnerable functions in the first results, while Gemini had a recall at 13 around .
6.4 Task 4  Semantic Classification
In Task 4 we evaluate the semantic classification using the embeddings computed with the model trained on AMD64multipleCompilers Dataset. We calculate the embeddings for all functions in Semantic Dataset (details on the dataset below). We split our embeddings in train set and test set and we train and test an SVM classifier using a 10fold cross validation. We use an SVM classifier with kernel rbf, and parameters and . We compare our embeddings with the ones computed with Gemini.
Class  Number of functions 
S (Sorting)  4280 
E (Encryption)  2868 
SM (String Manipulation)  3268 
M (Math)  4742 
Total  15158 
Dataset
The Semantic Dataset has been generated from a source code collection containing 443 functions that have been manually annotated as implementing algorithms in one of the 4 classes: E (Encryption), S (Sorting), SM (String Manipulation), M (Mathematical). Semantic Dataset contains multiple functions that refer to different implementations of the same algorithm. We compiled the sources for AMD64 using the 12 compilers and 4 optimizations used for AMD64PostgreSQL Dataset, we took the object files and after disassembling them with ANGR we obtained a total of 15158 binary functions, see details in Table 2. It is customary to use auxiliary functions when implementing complex algorithms (e.g. a swap function used by a quicksort algorithm). When we disassemble the Semantic Dataset we take special care to include the auxiliary functions in the assembly code of the caller. This step is done to be sure that the semantic of the function is not lost due to the scattering of the algorithm semantic among helper functions. Operatively, we include in the caller all the callees up to depth 2.
Measures of Performances
As performance measures we use precision, recall and F1 score.
Experimental Results
Class  Embedding Model  Precision  Recall  F1Score 
E (Encryption)  SAFE  0.92  0.94  0.93 
Gemini  0.82  0.85  0.83  
M (Math.)  SAFE  0.98  0.95  0.96 
Gemini  0.96  0.90  0.93  
S (Sorting)  SAFE  0.91  0.93  0.92 
Gemini  0.87  0.92  0.89  
SM (String Manipulation)  SAFE  0.98  0.97  0.97 
Gemini  0.90  0.89  0.89  
Weighted Average  SAFE  0.95  0.95  0.95 
Gemini  0.89  0.89  0.89  
The results of our semantic classification tests are reported in Table 3. First and foremost, we have a strong confirmation that is indeed possible to classify the semantic of the algorithms using function embeddings. The use of an SVM classifier on the embedding vector space leads to good performance. There is a limited variability of performances between different classes. The classes on which SAFE performs better are SM and M. We speculate that the moderate simplicity of the algorithms belonging to these classes creates a limited variability among the binaries. The M class is also one of the classes where the Gemini embeddings are performing better, this is probably due to the fact that one of the manual features used by Gemini is the number of arithmetic assembly instructions inside a code block of the CFG. By analyzing the output of the classifier we find out that the most common error, a mistake common to both Gemini case and SAFE, is the confusion between encryption and sorting algorithms. A possible explanation for this behaviour is that simple encryption algorithms, such as RC5, share many similarities with sorting algorithms (e.g., nested loops on an array).
Finally, we can see that, in all cases, the embeddings computed with our architecture outperform the ones computed with Gemini; the improvement range is between and . The average improvement, weighted on the cardinality of each class, is around .
Qualitative Analysis of the Embeddings We performed a qualitative analysis of the embeddings produced with SAFE. Our aim is to understand how the network captures the information on the inner semantics of the binary functions, and how it represent such information in the vector space.
To this end we computed the embeddings for all functions in Semantic Dataset. In
Figure 7 we report the twodimensional projection of the dimensional vector space where binary functions embeddings lie, obtained using the tSNE
Real use case of Task 4  Detecting encryption functions in Windows Malwares We decided to test the semantic classifcation on a real use case scenario. We trained a new SVM classifier using the semantic dataset with only two classes, encryption and nonencryption. We then used this classifier on malwares. We analyzed two samples of window malwares found in famous malware repositories: the TeslaCrypt and Vipasana ransomwares. We disassembled the malwares with radare2, we included in the caller the code of the callee functions up to depth 2. We processed the disassembled functions with our classifier, and we selected only the functions that are flagged as encryption with a probability score greater than . Finally, we manually analyzed the malware samples to assess the quality of the selected functions.

TeslaCrypt
^{13} : on a total of 658 functions, the classifier flags the ones at addresses 0x41e900, 0x420ec0, 0x4210a0, 0x4212c0, 0x421665, 0x421900, 0x4219c0. We confirmed that these are either encryption (or decryption) functions or helper functions directly called by the main encryption procedures. 
Vipasana
^{14} : on a total of 1254 functions, the classifier flags the ones at addresses 0x406da0, 0x414a58, 0x415240. We confirmed that two of these are either encryption (or decryption) functions or helper functions directly called by the main encryption procedures. The false positive is 0x406da0.
As final remark, we want to stress that these malwares are for windows and they are 32bit binaries, while we trained our entire system on ELF executables for AMD64. This shows that our model is able to generate good embeddings also for cases that are largely different from the ones seen during training.
7 Speed considerations.
As reported in the introduction, one of the advantages of SAFE that it ditches the use of CFGs. From our tests on radare2 disassembling a function is times faster than computing its CFG. Once functions are disassembled an Nvidia K80 running our model computes the embeddings of functions in around second.
More precisely, we run our tests on a virtual machine hosted on Google cloud platform. The machine has 8 core Intel Sandy Bridge, 30gb of ram, an Nvidia K80 and SSD harddrive. We disassembled all object files in postegres 9.6 compiled with gcc6 for all optimizations. During the disassembling we assume to know the starting address of a function, see [26] for a paper using neural networks to find functions in a binary.
The time needed to disassemble and preprocess 3432 binaries is 235 seconds, the time needed to compute the embeddings of the resulting 32592 functions is seconds. The endtoend time to compute embeddings for all functions in postgres starting from binary files is less than 5 minutes. We repeated the same test with openssl 1.1.1 compiled with gcc5 for all optimizations. The endtoend time to compute the embeddings for all functions in openssl is less than 4 minutes.
Gemini is up to 10 times slower, it needs 43 minutes for postgres and 26 minutes for openssl.
8 Conclusions and future works
In this paper we introduced SAFE an architecture for computing embeddings of functions in the crossplatform case that does not use debug symbols. Our architecture does not need the CFG, and this leads to a considerable speed advantage. SAFE creates thousand embeddings per second on a midCTOS GPU. Even when we factor the disassembling time our endtoend system (from binary file to function embedding), processes more than 100 functions per second. This considerable speed comes with a significant increase of predictive performances with respect to the state of the art. Summing up, SAFE is both faster and more precise than previous solutions.
Finally, we think that our experiments on semantic detection are really interesting, and they pave the way to more complex and refined analysis, with the final purpose of building binary classifiers that rival with the classifiers today available for image recognition.
Future Works There are several immediate lines of improvement that we plan to investigate in the immediate future. The first one is to retrain our i2v model to make use of libc call symbols. This will allow us to quantify the impact of such information on embedding quality. We believe that symbols could lead to a further increase of performance, at the cost of assuming more information and the integrity of the binary that we are analyzing.
The use of libc symbols would enable a more fine grained semantic classification: as example we could be able to distinguish a function that is sending encrypted content on a socket from a function that is writing encrypted content on a file.
Summarizing, the field of applied machine learning for binary analysis is still in its infancy and there are several opportunities for future works.
Acknowledgments. The authors would like to thank Google for providing free access to its cloud computing platform through the Education Program. Moreover, the authors would like to thank NVIDIA for partially supporting this work through the donation of a GPGPU card used during prototype development. Finally, the authors would like to thank Davide Italiano for the insightful discussions. This work is supported by a grant of the Italian Presidency of the Council of Ministers and by the CINI (Consorzio Interuniversitario Nazionale Informatica) National Laboratory of Cyber Security.
Footnotes
 https://www.statista.com/statistics/266210/numberofavailableapplicationsinthegoogleplaystore/
 https://www.cvedetails.com/browsebydate.php
 Tests conducted using the Radare2 disassembler[25].
 Conversely, recognizing library functions in stripped statically linked binaries is an application of the binary similarity problem without symbolic calls.
 Classic RNNs do not cope well with really long sequences.
 We build our system to be compatible with two opensource leader disassemblers.
 Note that gcc3.4 has been released more than 10 years before gcc5.4.
 Gemini has not been distributed publicly. We implemented it using the information contained in [29]. For Gemini the parameters are: function embeddings of dimension , number of rounds , and a number of layers . These parameters are the ones that give the better performance for Gemini, according to our experiments and the one in the original Gemini paper.
 48= 12 compilers 4 optimizations level
 cve20140160, cve20146271, cve20153456, cve20149295, cve20147169, cve20110444, cve20144877, cve20156862.
 Some vulnerable functions are lost during the disassembling process
 We used the TensorBoard implementation of tSNE
 The sample is available at url https://github.com/ytisf/theZoo/tree/master/malwares/Binaries/Ransomware.TeslaCrypt. The variant analyzed is the one with hash 3372c1eda…
 The sample is available at url https://github.com/ytisf/theZoo/tree/master/malwares/Binaries/Ransomware.Vipasana. The variant analyzed is the one with hash 0442cfabb…4b6ab
References
 (2016) Tensorflow: a system for largescale machine learning.. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, (OSDI), pp. 265–283. Cited by: §5.1.
 (2007) The relationship between ir effectiveness measures and user satisfaction. In Proceedings of the 30th International ACM Conference on Research and Development in Information Retrieval, (SIGIR), pp. 773–774. Cited by: §6.2.2.
 (2015) SIGMA: a semantic integrated graph matching approach for identifying reused functions in binary code. Digital Investigation 12, pp. S61 – S71. Cited by: §1.
 (2018) Unsupervised features extraction for binary similarity using graph embedding neural networks. In arXiv preprint arXiv:1810.09683, Cited by: §2.2.
 (1994) Signature verification using a “siamese” time delay neural network. In Proceedings of the 6th International Conference on Neural Information Processing Systems, (NIPS), pp. 737–744. Cited by: §4.2.
 (2014) Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, (EMNLP), Cited by: §5.1.
 (2017) Neural nets can learn function type signatures from binaries. In Proceedings of 26th USENIX Security Symposium, (USENIX Security), pp. 99–116. Cited by: §4.1.
 (2016) Discriminative embeddings of latent variable models for structured data. In Proceedings of the 33rd International Conference on Machine Learning, (ICML), pp. 2702–2711. Cited by: §2.2.
 (2016) Statistical similarity of binaries. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, (PLDI), pp. 266–280. Cited by: §1, §2.1, §2.1, §6.3.1.
 (2017) Similarity of binaries through reoptimization. In ACM SIGPLAN Notices, Vol. 52, pp. 79–94. Cited by: §1, §2.1.
 (2014) Traceletbased code search in executables. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, (PLDI), pp. 349–360. Cited by: §2.1.
 (2019) Asm2Vec: boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In (to Appear) Proceedings of 40th Symposium on Security and Privacy, (SP), Cited by: 2nd item, 3rd item, §2.2.
 (2005) Graphbased comparison of executable objects. In Proceedings of Symposium sur la sécurité des technologies de l’information et des communications, (STICC), Cited by: §1, §2.1.
 (2014) Blanket execution: dynamic similarity testing for program binaries and components. In Proceedings of 23rd USENIX Security Symposium, (USENIX Security), pp. 303–317. Cited by: §1, §2.1.
 (2017) Extracting conditional formulas for crossplatform bug search. In Proceedings of the 12th ACM on Asia Conference on Computer and Communications Security, (ASIA CCS), pp. 346–359. Cited by: §2.1.
 (2016) Scalable graphbased bug search for firmware images. In Proceedings of the 23rd ACM SIGSAC Conference on Computer and Communications Security, (CCS), pp. 480–491. Cited by: §1, §1, §2.2.
 (2004) Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems 22 (1), pp. 5–53. Cited by: §6.1.1.
 (2013) Rendezvous: a search engine for binary code. In Proceedings of the 10th Working Conference on Mining Software Repositories, (MSR), pp. 329–338. Cited by: §1, §2.1.
 (2014) Distributed representations of sentences and documents. In Proceedings of the 31th International Conference on Machine Learning, (ICML), pp. 1188–1196. Cited by: §2.2.
 (2017) A structured selfattentive sentence embedding. Arxiv: arXiv:1703.03130. Cited by: §1, §3.1, §3.1, §4.2, §4.2.
 (2008) Visualizing data using tsne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §6.4.3.
 (2013) Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems, (NIPS), pp. 3111–3119. Cited by: §3.1, §4.1.
 (2015) Crossarchitecture bug search in binary executables. In Proceedings of the 34th IEEE Symposium on Security and Privacy, (SP), pp. 709–724. Cited by: §2.1.
 (2014) Leveraging semantic signatures for bug search in binary programs. In Proceedings of the 30th Annual Computer Security Applications Conference, (ACSAC), pp. 406–415. Cited by: §2.1.
 (2017) Radare2 disassembler repository. Note: \urlhttps://github.com/radare/radare2 Cited by: §5.1, footnote 3.
 (2015) Recognizing functions in binaries with neural networks. In Proceedings of the 24th USENIX Conference on Security Symposium, (USENIX Security), pp. 611–626. Cited by: §7.
 (2016) Sok:(state of) the art of war: offensive techniques in binary analysis. In Proceedings of the 37th IEEE Symposium on Security and Privacy, (SP), pp. 138–157. Cited by: §5.1.
 (last accessed 11/2018) Word2vec skipgram implementation in tensorflow. In \urlhttps://www.tensorflow.org/tutorials/representation/word2vec, Cited by: §5.1.1.
 (2017) Neural networkbased graph embedding for crossplatform binary code similarity detection. In Proceedings of the 24th ACM SIGSAC Conference on Computer and Communications Security, (CCS), pp. 363–376. Cited by: 1st item, 2nd item, §1, §2.2, 2nd item, §5.2.2, §5.2.2, §5.2, 1st item, §6.1.1, footnote 8.
 (2018) Neural machine translation inspired binary code similarity comparison beyond function pairs. arXiv preprint arXiv:1808.04706. Cited by: §2.2.