Function Naming in Stripped Binaries Using Neural Networks
In this paper we investigate the problem of automatically naming pieces of assembly code. Where by naming we mean assigning to portion of code the string of words that would be likely assigned by an human reverse engineer. We formally and precisely define the framework in which our investigation takes place. That is we define problem, we provide reasonable justifications for the choice that we made during our designing of the training and test steps and we performed a statistical analysis of function names in a large real-world corpora of over 4 millions of functions. In such framework we test several baselines coming from the field of NLP (e.g., Seq2Seq networks and transformers). Moreover, we provide a set of tailored solutions that beat the aforementioned baselines.
Last few years have witnessed the growth of a trend consisting in the application of machine learning (ML) and natural language processing (NLP) techniques to the code, as illustrated in [sourcecode]. In fact, the vast and increasing amount of high quality software available through open source repositories such as GitHub, has given the chance to leverage such large amount of source code as a ground truth for building statistical models of code. The design choice of using ML to build code models is motivated by the naturalness hypothesis which underlines the similarities between programming languages and human languages. According to this hypothesis, software is a form of human communication with similar statistical properties to natural language and these properties can be exploited to build better software engineering tools [sourcecode]. The practice of applying ML and NLP techniques to the code turned out to be very helpful and effective in many tasks such as predicting program bugs, predicting program behavior, automatically creating new code, predicting identifier names, translating code between programming languages, etc. Thus, the success given by the application of NLP techniques to source code has led to investigate the possible use of such techniques also in the context of binary code.
Following this research line, in this paper we investigate the feasibility of using the same techniques to predict the name of functions in stripped binary programs. The latter are binary executable files that only contain low-level information such as instructions, registers and memory addresses but no debug symbols since they are not directly necessary for program execution. Debug symbols are generated by compiler programs on the basis of the source code and typically include information about functions and variables, such as name, location, type and size which is helpful for debugging and security analysis of a binary. Being non essential for the software execution, symbols are often removed from a program after compilation, increasing the complexity of reverse engineer the software.
Reconstructing symbols in a binary program can be a very useful feature for all those field where reverse engineering code plays a crucial role, e.g. malware analysis. Usually, after having disassembled the malware a reverse engineer starts analyzing the set of assembly instructions of the program looking for specific functions (e.g. encryption or network) that might reveal the malicious nature of the software. This task could be daunting, especially when the binary code is stripped, since function names are simply represented by debug symbols with no semantic information. In this case, it could be very helpful having explanatory names for such functions, as they would save a lot of effort to the reverser.
Recently, a few works proposed solutions to this problem. For example, [DEBIN] and [NERO] show that this problem can be solved predicting directly the full name of a function or generating its name looking only at the calls made by a function. However, the problem at hand is far from being solved, for several reasons:
existing solutions work only under strong assumption, like the presence of calls to dynamically linked libraries in each functions, or a closed set of possible assignable names to predict from;
the evaluation of these solutions has been performed on small datasets that contain binary programs with little variance;
there is no public and common dataset to compare existing solutions.
Starting from these issues in this paper we propose the following contributions:
we provide a general definition of the problem;
we describe a new dataset with a large variance of real world binaries;
We test different Deep Neural Network architectures that works in the general setting of the problem and do not require any underlying assumption.
After this introduction, Section II discussed the state of the art, Section III introduces the function naming problem, Section IV describes the dataset used for the evaluation, Section V details the proposed solution and Section VI reports the experimental results. Finally, Section VII concludes the paper.
Ii Related Work
Ii-a Debug symbols prediction
Predicting debugging symbols, including function names, through ML is a rather new field of research with few contributions. The most notable work in this area is DEBIN, proposed by He et al. [DEBIN]. DEBIN use conditional random fields to predict function names, variable names and types. Differently from out work, DEBIN assigns to function only names that are already known to be function names, i.e. it cannot generalize and generate new names.
This limitation is surpassed by NERO [NERO], which models the problem of predicting function names as a neural machine translation (NMT) task where each function is represented by a set of call sites sequences, that retain information about the names of dynamically linked functions.
Finally, DIRE [lacomis2019dire] proposes a probabilistic technique for variable name recovery that uses both lexical and structural information recovered by the decompiler. DIRE uses the extended context provided by the decompilerâs internal abstract syntax tree (AST) representation of the decompiled binary, which encodes additional structural information, to train it neural network.
Ii-B Deep Learning for Binary Analysis
Recently a lot of works focussed on the usage of deep learning techniques applied to the binary similarity problem. Given two binary functions, the binary similarity problem consists in determining whether such functions are the result of differently compiling the same source code. This problem is related to the one addressed in this paper in the sense that, the problem of predicting function names from stripped binaries, could also be seen as, first cluster similar functions, and then assign similar names to such similar functions. A few works propose to use embeddings to represent binary code in a n-dimensional numeric space while translating code similarity to euclidean vicinity.
GEMINI [GEMINI] and SAFE [SAFE] both make use of graph embedding networks to compute functions embeddings. In particular, they build the control flow graph (CFG) of each function and then use the structure2vec graph embedding technique to translate the CFG to a vector of real numbers. The difference between these two works resides in the way features are extracted from the CFG: GEMINI manually-selects such features, whereas SAFE uses an unsupervised feature learning mechanism.
It is worth pointing out the strong similarity between the SAFE architecture, used to calculate functions embeddings, and the encoder of the architecture which is used in this paper to accomplish the task of predicting function names. Moreover, SAFE employs an instructions embedding technique to map assembly instructions to vectors; the same embeddings are used to feed the networks proposed in this work as well.
Recently, Ding et al. [ASM2VEC] proposed a function embedding solution called Asm2Vec. This solution is based on the PV-DM model [Le:2014aa] for NLP. Operatively, Asm2Vec computes the CFG of a function, and then it performs a series of random walks on top of it. Asm2Vec outperforms several state-of-the-art solutions in the field of binary similarity; however, its applicability is limited by the fact that it requires libc call symbols to be present in the binary code as tokens to produce the embedding of a function and, furthermore, it cannot generate cross-platform embeddings.
The binary similarity problem has also been applied to solve different problems. Katz et al. [DBLP:journals/corr/abs-1905-08325] proposed an approach to binary software decompilation based on neural machine translation [kalchbrenner2013recurrent]. Their solution applies an NMT model to small snippets of canonical binary code to translate it to an intermediate code representation that are then transformed in C code using templates. More recently Fu et al. [NIPS2019_8628] proposed a new framework for solving this problem called Coda. While not directly linked to the function naming problem, binary code decompilation still represent a fundamental and orthogonal approach to reverse engineer software.
Iii Problem definition
We are interested in exploring solutions for the function naming problem. That is, given a binary code , representing a functional unit of code
The above problem is really challenging, and by its nature it cannot be defined more precisely without incurring in complex reasoning about what the “semantic” of code is.
Statistical learning methods are especially suitable for problems with fuzzy definitions. However, in order for these methods to be effective, their models need to be trained and tested on a suitable dataset. Such a dataset must have a reasonably large size (millions of functions), contain the functions binary code and their names.
Function names will be used to train statistical models, thus their content is crucial for the trained models to correctly embed the function semantics. In particular we assume that programmers, when writing code in an high-level language, assigns names to functions that include information about the function semantics and their context of execution. The above assumption does not always hold true. However, we find reasonable to assume that it holds most of the time, especially in large projects developed by professional or skilled programmers that follow common function naming conventions
There is an unavoidable ambiguity in the output of any method that names something. This is due to the fact that several different meaningful descriptive names can be associated to a certain function of code. As an example, a function implementing quick sort on an array can be named “Quick-sort”, “quick-array-sort”, “sort”, etc. This creates a problem in the way any solution is tested: in order to evaluate its accuracy we should consider all possible meaningful names. This is again unfeasible, and we will test our method on a partition of the dataset above mentioned. We will say that a prediction is correct if is the name present in the dataset, other predictions will be deemed as wrong even if they could be meaningful.
Vocabulary and restricted names
Unfortunately, we found that functions names in our dataset are noisy, e.g. many functions of the OpenGL library contain the bigram “gl”. Such pattern is recurrent in many libraries and software, as developers use words and acronyms that have a meaning in the software itself, but that do not add much to the semantic of the function. In order to clean our dataset we designed a filtering process (described in a detailed way in Section V). This filtering process has the purpose of associating each original name to a reduced name over a restricted vocabulary of words. We found that such restriction preserves part of the semantic of the majority of names in our dataset while it solves the problem described above.
Apart from restricting the vocabulary (and thus the set of possible names) we perform an additional assumption. We transform our problem from the one of predicting a string, and thus a sequence of words, to the one of predicting a set of words. More clearly, if the actual name of the function is “open file” our goal is to predict the set open, file. This simplifying assumption allows us to threat the problem with several deep learning techniques. Not only the ones born for predicting sequences (such as the notorious Seq2Seq), but also those developed for multi-labels classification. The tokens in the vocabulary are the labels, and predicting a set of tokens for a function is equivalent to classifying it with multiple possible labels.
In this section we describe the dataset used in the experiments.
Iv-a Executable Sources
The dataset has been obtained crawling all available packages from Ubuntu apt repository , collecting 22040 packages. Executables inside packages have been disassembled using IDA Pro, we successfully disassembled 119504 distinct ELF files. Finally, we discarded all executables without debug symbols and we filtered duplicate functions. At the end of this process we obtained 5470954 named functions.
Iv-B Functions processing and Filtering
One straightforward representation for a binary function is the control flow graph (CFG). However, this type of function representation to use as input for deep learning network has been shown to produce less accurate results [BAR19] based on this observation we decided to follow the success of several recent works like [SAFE] and [ASM2VEC] that represent functions as an ordered list of instructions. The order is given by the address of each instruction inside the program. The average number of instructions per function is , we set a maximum length of instructions per function and we truncate functions longer than this threshold.
Iv-C Name Preprocessing and Filtering
After collecting the dataset we apply a normalization process to function names. Often names are modified by the compiler that encodes several information to enforce their uniqueness; this operation is usually called name mangling. Furthermore, differently from natural language, function names do not contain blank spaces, this makes much more difficult splitting them in tokens. Starting from this observation we preprocess and filter function names in five steps: demangling, function name splitting, stemming, tokens selection and out-of-vocabulary tokens removal.
: The aim of this step is to first remove the encoding added by the compiler. Since compilers perform mangling in a standard way, it is possible to perform name demangling using standard libraries. During this step we also filtered all the functions obtained by binaries not written in C/C++ (e.g. code written in GO, Haskell, etc.).
Function name splitting
: This step consists in splitting function names into tokens. This is achieved by using the natural partition provided by camel and snake notations, which are generally adopted for function names.
: this is a technique used in information retrieval to reduce inflected (or sometimes derived) words to their base form. Nevertheless, such technique turns out to be very useful in this context, since it has the effect of reducing the vocabulary of function name tokens by mapping different forms of the same token into a unique one (for example the tokens “shared” and “sharing” are both mapped to their base form “share”). Stemming is important since it is not necessary for the network to learn the syntactically correct token to associate to each function, but rather the semantically correct one.
: the last step consists in selecting only a given amount of tokens as vocabulary. This operation has the goal of removing useless and meaningless tokens from function names that could insert noise during the learning process. The token selection process consists first in assigning a score to each token and then extracting the top n tokens with the highest score (the chosen value for n is 1000). These top n tokens are used as the vocabulary of function name tokens. The score for each token was calculated with:
Where is an index over all the packages in the dataset and is equal to if the token in the package, otherwise . This choice permit us to exclude tokens that appear only in few packages independently from the frequency of the token inside the package. In this way we avoid to assign an high score to tokens that are not semantically relevant.
OOV Removal and final filtering
The fifth and final step consists in removing from function names all those tokens not contained in the vocabulary retrieved in the previous step. There are few cases in which function names only consists of out-of-vocabulary tokens resulting in an empty name; in such cases functions are removed from the dataset. After applying this last step to our dataset we remove 13% of functions.
Iv-D Data Splitting
In order to train and evaluate the naming models we need to split the dataset in the canonical three sets: Train, Validation and Test. In order to avoid information leakage from the training set to the validation or test we split the functions in our dataset according the packages: all functions from a given package belong to one of the three sets. We put of functions in the training set, in the validation set and the in the set test.
V Solution Overview
Sequential based solutions are usually used for NLP problems like Machine-Translation or Question Answering. This models are able to takes a sequence as input and to output another sequence. We used this architecture for our experiments since translation problems look very similar to function naming except for the fact that output sequences can be very shorter with respect to the input.
The sequence-to-sequence network (Seq2Seq) is also called encoder-decoder network, since it consists of two Recurrent Neural Networks (RNNs) called encoder and decoder. The encoder takes as input the list of assembly instructions and encodes its information into a set of vectors, whereas the decoder predicts the function name by decoding this information.
The encoder consists of a bidirectional RNN with Long-short term memory (LSTM) cells. The use of a bidirectional encoder is important since it allows to compute for each instruction its hidden states vector that takes into account the instruction itself and its previous and following context.
The decoder is a forward RNN that starts taking as input the last hidden state of the encoder and the attention matrix. At each time step the hidden state of the decoder is passed through a softmax layer that outputs the probability distribution of all tokens over the target vocabulary. From this probability distribution it is possible to choose the token of the function name. The attention mechanism allows to better model long distance dependencies which are a critical aspect in assembly code. In fact, the attention decoder, while generating function name tokens, is able to understand which part of the context actually matters. Moreover, the attention decoder does not work with a unique vector of fixed size that “squashes” all the information regarding the input assembly instruction sequence as traditional encoder-decoder network does. Indeed, it keeps around vectors for each assembly instruction, and then references these vectors at each decoding step. One common strategy to choose the next token from the probabilistic output of the decoder is beam-search. This method permits to predict sequences: at each time it explores the output distribution to find an approximation of the sequence that maximizes the full sequence probability.
The implementation of the Seq2Seq network used for the experiments, is the one provided in [Seq2Seqimpl], whereas the used implementation of the transformer network is [tensor2tensor]. The tool used to disassemble binaries is IDA Pro. Token stemming was implemented using the TreeTagger Python library [treetagger], whereas function name demangling was implemented with cxxfilt [demangling].
In order to speed up the evaluation phases performed during the training process, the validation set has been reduced by randomly selecting 50000 functions. Parameters have been tuned on the validation set, whereas the final model has been evaluated on the test set. Early stopping has been used to stop training at the point when performance on the validation set starts degrading or stops improving. The metric used for stopping the training was ROUGE-1 [lin2004rouge] instead of the loss function of the validation set, since the former does not take into account the order of tokens, whereas the latter does. Decoding of test set has always been carried out using beam-width 5.
The limit on 500 instructions as the maximum function length ensures that only ~6% of instructions sequences in the multiple compilers in the Ubuntu dataset were sliced. All the other instruction sequences were considered by the network in their entirety.
Regarding the Seq2Seq training process, experiments used the following set of parameters:
Bidirectional RNNs with LSTM cells for both the encoder and decoder
Dropout to the outputs of each RNN layer using 1.0 keep probability for both encoder and decoder
Dropout to the inputs of each RNN layer using 0.8 keep probability
Adam optimization algorithm
Learning rate 0.0001
Batch size 32
Training steps 1000000
Maximum length of instructions sequences 500
Maximum length of function name tokens sequences 50
Instruction embedding size 100
Target embedding size 512
Number of LSTM cells units 512 for both the encoder and decoder
We trained the Seq2Seq network both with pre-trained instruction embeddings that were computed using the i2v model used in SAFE. All the details about the dataset and parameters used to train this model are described in [SAFE]. The network was also tested with no pre-trained embeddings.
In order to train the Seq2Seq network, pre-trained instruction embeddings have been used. Such embeddings are computed using the i2v model also employed in SAFE. All the details about the dataset and parameters used to train this model are provided in [SAFE]. The network was also tested with no pre-trained instruction embeddings. In this case embeddings for single instructions were randomly initialized and trained together with the network itself.
Performance of the two Seq2Seq networks, with and without pre-trained instruction embeddings, were evaluated calculating precision, recall and F1-score of the predicted tokens. Table I reports the results.
|Seq2Seq with pre-trained embeddings||0.25||0.17||0.20|
The performance obtained wioth the Seq2Seq network with pre-trained embeddings are close to the ones achieved with the LSTM-text configuration in [NERO]. The main difference between the configuration used in this experiment and LSTM-text lies in the input given to the Seq2Seq network: LSTM-text feeds the network with assembly code tokens, whereas in this experiment the network receives entire assembly instructions as input. The problem with LSTM-text is that, considering separate assembly code tokens, inevitably leads to extremely long input sequences (nearly three times the length of sequences with whole assembly instructions). Therefore, in order to prevent the network from slicing the assembly code sequences of too many functions, it is necessary to increase the maximum source sequence length parameter, leading to a very slow training.
To better understand the quality of the obtained results, performance have been compared to those achievable by a random predictor. The random prediction has been created by using the same probability distribution of function name lengths as the one of the original function names in the test set and by randomly assigning tokens to each function. As can be noticed from the table, results achieved by the model are clearly better than those achieved by a random prediction. This result suggests that even if this model does not provide strong performance overall, there are functions for which it is able to understand behaviors and assign correct tokens.
Regarding the performance difference between the two Seq2Seq networks, the 1%-2% advantage obtainable by using pre-trained instruction embeddings is coherent with the results reported in [SAFE].
This paper introduces a solution for the problem of assigning names to functions in stripped binary code that is based on the usage of a Seq2Seq network architecture borrowed from the Machine-Translation research field.
The solution was tested on a new dataset consisting of 5470954 functions extracted from 22040 packages in the Ubuntu apt repository. Experimental results show some initial positive findings: the network is able to predict correct tokens for some functions, providing an interesting context help for reverse engineers. However, good predictions are restricted to a limited number of cases, while overall performance are still limited.
- for simplicity we assume a functional unit of code to always coincide with a binary function.
- e.g. https://swift.org/documentation/api-design-guidelines/ or https://www.oracle.com/technetwork/java/codeconventions-135099.html