Survey on the Usage of Machine Learning Techniques for Malware Analysis

Survey on the Usage of Machine Learning Techniques for Malware Analysis

Daniele Ucci ucci@diag.uniroma1.it Leonardo Aniello l.aniello@soton.ac.uk Roberto Baldoni baldoni@diag.uniroma1.it Research Center of Cyber Intelligence and Information Security, “La Sapienza” University of Rome Cyber Security Research Group, University of Southampton
Abstract

Coping with malware is getting more and more challenging, given their relentless growth in complexity and volume. One of the most common approaches in literature is using machine learning techniques, to automatically learn models and patterns behind such complexity, and to develop technologies for keeping pace with the speed of development of novel malware. This survey aims at providing an overview on the way machine learning has been used so far in the context of malware analysis in Windows environments, i.e. for the analysis of Portable Executables. We systematize surveyed papers according to their objectives (i.e., the expected output, what the analysis aims to), what information about malware they specifically use (i.e., the features), and what machine learning techniques they employ (i.e., what algorithm is used to process the input and produce the output). We also outline a number of problems concerning the datasets used in considered works, and finally introduce the novel concept of malware analysis economics, regarding the study of existing tradeoffs among key metrics, such as analysis accuracy and economical costs.

keywords:
portable executable, malware analysis, machine learning, benchmark, malware analysis economics
journal: Computers and Security

1 Introduction

Despite the significant improvement of security defence mechanisms and their continuous evolution, malware are still spreading and keeping to succeed in pursuing their malicious goals. Malware analysis concerns the study of malicious samples with the aim of developing a deeper understanding about several aspects of malware, including their behaviour, how they evolve over time, and how they intrude specific targets. The outcomes of malware analysis should allow security firms to update their defence solutions, in order to keep pace with malware evolution and consequently prevent new security incidents.

Within the unceasing arms race between malware developers and analysts, each progress of security mechanisms is likely to be promptly followed by the realization of some evasion trick. The easiness of overcoming novel defensive measures also depends on how well they capture malicious traits of samples. For example, a detection rule based on the MD5 hash of a known malware can be easily eluded by applying standard obfuscation techniques, indeed they change the binary of the malware, and thus its hash, but leave its behaviour unmodified. On the other side, developing detection rules that capture the semantics of a malicious sample is much more difficult to circumvent, as malware developers should apply more complex modifications.

Given the importance of producing defensive technologies as challenging as possible to overcome for malware producers, a major goal for malware analysis should be to capture aspects and traits having the broadest scope. In this way, resulting security measures would become harder to circumvent, and consequently the effort for attackers to adapt existing malware would result infeasible. Machine learning is a natural choice to support such a process of knowledge extraction. The plentiful availability of samples to analyse, and thus of really large training sets, has fostered the adoption of machine learning for malware analysis. Indeed, many works in literature have taken this direction, with a variety of approaches, objectives and obtained results.

1.1 Surveying Approach

This survey aims at reviewing and systematising existing academic works where machine learning is used to support malware analysis of Windows executables, i.e. Portable Executables (PEs). Indeed, although mobile malware represents an ever growing threat, especially for Android platform, Windows largely remains the preferred target operating system for malware developer 111AV-TEST, Security Report 2016/17, 2017: https://www.av-test.org/fileadmin/pdf/security_report/AV-TEST_Security_Report_2016-2017.pdf. 65 recent papers have been selected on the basis of their bibliographic significance, reviewed and systematised according to three fundamental dimensions

  • the specific objective of the presented malware analysis (i.e., the output),

  • what types of features they consider (i.e., the input),

  • what machine learning algorithms they consider.

Such a simple characterisation of academic works helps in understanding how machine learning can be used within the context of malware analysis, so as to also identify possible novel relevant objectives that have not been investigated yet. We distinguished 7 different objectives: malware detection, malware variants detection (variants selection and families selection), malware category detection, malware novelty and similarity detection, malware development detection, malware attribution, and malware triage.

The review of the features that can be gathered from a sample provides a comprehensive view of available information, and how they can be used with reference to identified malware analysis objectives. Smart combinations of these information can lead to extracting additional knowledge to be used to achieve further objectives or refine existing ones. We grouped the features used by surveyed papers in 15 types: strings, byte sequences, opcodes, APIs/System calls, memory accesses, file system, Windows registry, CPU registers, function length, PE file characteristics, raised exceptions, network, AV/Sandbox submissions, and code stylometry.

Examining used algorithms provides an effective overview about how selected inputs can be processed to achieve a specific malware analysis objective. The frequent employment of a particular algorithm to achieve a given objective means that such algorithm is likely to be really effective for that objective. On the other hand, observing that some class of algorithms has never been used for a certain objective may suggest novel directions to investigate further. We arranged algorithms in 4 classes: signature-based (malicious signature matching, malicious graph matching), classification (rule-based classifier, Bayes classifier, support vector machine, prototype-based classification, decision tree, k-Nearest neighbors, artificial neural network), clustering (clustering with locality sensitive hashing, clustering with distance and similarity metrics, k-Means clustering, density-based spatial clustering of applications with noise, hierarchical clustering, prototype-based clustering, self-organizing maps), and others (expectation maximization, learning with local and global consistency, belief propagation).

1.2 Main Findings

The thorough study we have carried out has highlighted some interesting points that would deserve to be dealt with in more detail, indeed we claim they can be developed to extend and improve current academic research on the usage of machine learning for malware analysis.

1.2.1 Feature Choice

A first point concerns a general lack of proper explanation about the reasons why a specific set of features enables to properly characterise the malicious traits of samples. The common approach is to take all the available features, feed them to the chosen machine learning algorithms, and compute accuracy metrics on obtained (usually good) results. Some works include a feature selection phase where the subset of most determining features is extracted. Except for a few papers, the vast majority does not delve into explaining the connection between considered features and achieved results, which seems to leave the whole analysis rather incomplete. We advocate the need to properly address this aspect whenever machine learning algorithms are used for malware analysis.

1.2.2 Need for a Benchmark

Another point regards the set of samples used for training and testing the chosen model. Most of reviewed papers do not describe in detail the employed dataset, nor they share it publicly, which prevents others from reproducing published results, and thus from properly comparing newly proposed solutions. This is obviously a significant obstacle to streamlining advancements in the field of malware analysis through machine learning. As a matter of fact, in other research areas where reference benchmarks are available, it is easy to prove (and also to disprove) that a novel technique is better than the state of the art, and thus to assert a progress. On the other hand, establishing a benchmark of samples acknowledged by the academic malware analysis community is extremely challenging. Indeed, benchmarks should be as stable as possible over time to be used as reference points for measurements, but malware are characterized by a strong intrinsic evolutionary nature. Novel and more advanced malicious samples are developed daily, hence each malware becomes less interesting from a research perspective as time goes by. Despite this, we believe that more effort should be spent along the direction of enabling researchers to reproduce published results, and thus to correctly compare different solutions. At this regard, we outline some desired properties that a dataset of samples should have to become a valid reference for research purposes.

1.2.3 Malware Analysis Economics

A final point is about the novel concept of malware analysis economics. The final purpose of malware analysis is expanding the knowledge on malware, by the means of a learning process that is continuous over time, whose effectiveness can be measured along two dimensions. The first is the pace of knowledge growth, which relates to how fast this knowledge develops with respect to the evolution of malware over time. The second is the accuracy of the knowledge, which refers to the extent such knowledge matches the real characteristics of malware. Both pace and accuracy depend on several factors, some being hard to assess, others being easily measurable. When machine learning comes into play, these quantifiable factors include how many novel samples are considered, how many features are extracted from each sample, and what -nnkinds of algorithms are used. Having bigger datasets at disposal (i.e., large number of samples) generally leads to learn more accurate malware knowledge, at the cost of greater effort for the collection, feature extraction, and elaboration of a larger number of samples. Required time is likely to increase too, which impacts negatively on the pace of malware knowledge growth. To keep this time as constant as possible, more physical resources should be employed to parallelise to some extent the whole malware analysis process, which in turn entails additional costs because of the higher provisioning requested. What emerges is the existence of a trade-off between the cost of the analysis from one hand, and the growth pace and accuracy of acquired knowledge from the other. Analogous trade-offs also apply for what features are extracted from samples, and for what algorithms are used. Since the costs of malware analysis could easily rise to unaffordable levels in an attempt to achieve the highest accuracy and the fastest growth of malware knowledge, a careful understanding of the dynamics of such trade-offs turns out to be highly strategic. Thus, we claim the importance of investigating thoroughly the relationships between what is required to improve the effectiveness of malware analysis (i.e., growth pace and accuracy of obtained malware knowledge) and how much it costs (i.e., in terms of time, realization complexity, and needed resources). This would make it possible to define clear guidelines on setting up a malware analysis process able to meet specific requirements on effectiveness at the minimum cost.

1.3 Related Work

In literature, some works have already addressed the problem of surveying contributions dealing with the usage of machine learning techniques for malware analysis. Bazrafshan et al. Bazrafshan2013 () and Ye et al. Ye2017 () analyse scientific papers on malware detection only. The former article identifies three main methods for detecting malicious software, i.e. based on signatures, behaviours and machine learning techniques, while the latter examines different aspects of malware detection processes, focusing on feature extraction/selection and classification/clustering algorithms. Gandotra et al. Gandotra:2014 () survey papers that use machine learning techniques for malware analysis and only distinguish between malware detection and family classification. In SahuAhirwarHemlata2014 (), the authors focus on papers proposing techniques based on pattern matching to recognize malware. Basu et al. Basu2016 () examine different works relying on data mining and machine learning techniques, whose primary objective is the detection of possibly malicious software. They outline the types of analysis that a malware analyst can carry out and discuss different types of inputs that can be potentially used (e.g. byte sequences, opcodes, PE file characteristics). Compared to our work, the above mentioned surveys focus just on very few malware analysis objectives, and their studies are limited to the descriptions of proposed approaches without any attempt of structuring surveyed contributions.

At the time of writing, the most similar work to ours is the one published by LeDoux and Lakhotia LeDoux2015 (). Their article points out the problems related to malware analysis and how machine learning can help in solving them. Similarly to our work, they provide a wider overview on machine learning concepts, list a set of features useful for analysing malware, and state the complexity of gathering a ground truth to evaluate analysis results. However, as final objective of malware analysis, they only consider the timely detection, removal, and recovery from the infection, while in this survey we identify 7 different objectives.

1.4 Paper Structure

The rest of the paper is structured as follows. Section 2 introduces some basic notions on malware analysis. Section 3 outlines the possible objectives of malware analysis, Section 4 delves with what types of input data is used for the analysis, and Section 5 reports what machine learning methods are employed. The characterization of surveyed papers according to the inputs, outputs and algorithms described in previous sections is reported in Section 6. Section 7 describes the datasets used in each paper: it discusses sample collections and the issues related to experimental evaluation reproducibility. Malware analysis economics are investigated in Section 8. Finally, conclusions and possible future works are presented in Section 9.

2 Background on Malware Analysis

With malware analysis, we refer to the process of studying a generic sample (i.e., a file), with the aim of acquiring knowledge about its potentially malicious nature. The analysis of a sample includes an initial phase where required data are extracted from the file, and an elaboration phase where these data are examined, and possibly correlated to some available knowledge base, to gain further added-value information. What information is mined depend on the specific objective to achieve. In the works considered in this survey, the information extraction process is performed through either static or dynamic analysis, or a combination of both, while examination and correlation are carried out by using machine learning techniques. Approaches based on static analysis look at the content of samples without requiring their execution, while dynamic analysis works by running samples to examine their behaviour. Execution traces are indeed among the inputs used in examination and correlation phases when dynamic analysis is employed. For an extensive dissertation on dynamic analyses, refer to EST12 ().

Malware development strategies are in line with software engineering recommendations for what concerns code reuse, in the sense that any specific malware is usually updated to the minimal extent required to evade latest detection techniques. Indeed, as a new malware is discovered by security firms and then neutralised by releasing the correspondent antivirus detection rule (e.g., its signature) or software patch, malware developers are likely to produce a variant of that malware, which keeps most of the code and characteristics of the original version but differs for a few aspects to guarantee its effectiveness in evading current recognition mechanisms. These mechanisms are commonly evaded by employing obfuscation and encryption techniques to automatically generate variants. These variants are referred to as polymorphic and metamorphic malware. Polymorphism changes the appearance of the original malicious code by means of encryption and data appending/prepending. These modifications are performed by mutation engines, usually bundled within the malware itself. The limitation of this variant generation approach is that malware code remains the same once decrypted by a mutation engine, which makes in-memory signature-based detection methods effective. On the other hand, metamorphic malware can still evade these recognition mechanisms thanks to more advanced morphing techniques. These include insertion of a number of No Operation (NOP) and garbage instructions, function reordering, control flow modification, and variation in data structure usage. Malicious software exploiting metamorphism automatically recodes itself before propagating or being redistributed by the attacker. This kind of variants can be detected by focussing on the semantics of an executable.

Variants originating from a same “root” malware are usually grouped in a malware family, which by consequence includes a set of samples sharing many similarities, yet being different enough among each other from the point of view of anti-malware tools.

3 Malware Analysis Objectives

This section details the analysis goals of the surveyed papers, organized in 7 distinct objectives.

3.1 Malware Detection

The most common objective in the context of malware analysis is detecting whether a given sample is malicious. From a practical point of view, this objective is also the most important because knowing in advance that a sample is dangerous allows preventing it from being harmful for a system. Indeed, the majority of reviewed works has this as main goal Shultz:2001 (); Kolter:2006 (); Ahmed2009 (); Chau2010 (); FirdausiLimErwinEtAl2010 (); AndersonQuistNeilEtAl2011 (); Santos:2011 (); AndersonStorlieLane2012 (); Yonts2012 (); SantosDevesaBrezoEtAl2013 (); EskandariKhorshidpourHashemi2013 (); Vadrevu2013 (); BaiWangZou2014 (); KruczkowskiSzynkiewicz2014 (); TamersoyRoundyChau2014 (); UppalSinhaMehraEtAl2014 (); Chen:2015 (); ElhadiMaarofBarry2015 (); FengXiongCaoEtAl2015 (); Ghiasi:2015 (); Ahmadi:2015 (); Kwon2015 (); Mao2015 (); Saxe2015 (); WuechnerOchoaPretschner2015 (); Raff2017 (). According to the specific machine learning technique employed into the detection process, the output generated by the analysis can be provided with a confidence value. The higher this value, the more the output of the analysis is likely to be correct. Hence, the confidence value can be used by malware analysts to understand if a sample under analysis needs further inspection.

3.2 Malware Variants Detection

Developing variants is one of the most effective and cheapest strategies for an attacker to evade detection mechanisms, while reusing as much as possible already available codes and resources. Recognizing that a sample is actually a variant of a known malware prevents such strategy to succeed, and paves the way to understand how malware evolve over time through the continuous development of new variants. Also this objective has been deeply studied in literature, and several papers included in this survey target the detection of variants. More specifically, we identify two slightly different variations of this objective

3.3 Malware Category Detection

Malware can be categorized according to their prominent behaviours and objectives. As an example, malicious software can be interested in spying on users’ activities and stealing their sensitive information (i.e., spyware), encrypting documents and asking for a ransom in some cryptocurrency (i.e., ransomware), or gaining remote control of an infected machine (i.e., remote access toolkits). Even if more sophisticated malware fit more behaviours and objectives, using these categories is a coarse-grained yet significant way of describing malicious samples Tian:2008 (); Siddiqui:2009 (); ChenRoussopoulosLiangEtAl2012 (); Comar:2013 (); Kwon2015 (); Sexton2015 (). Although cyber security firms have not still agreed upon a standardized taxonomy of malware categories, effectively recognizing the categories of a sample can add valuable information for the analysis.

3.4 Malware Novelty and Similarity Detection

Along the line of acquiring knowledge about unknown samples by comparing them against known malware, it is really interesting to identify what are the specific similarities and differences of the binaries to analyse with respect to those already analysed and stored in the knowledge base. We can distinguish between two distinct types of novelty and similarity detection.

3.5 Malware Development Detection

An undeniable advantage for malware developers is the wide availability of the most used defence mechanisms, such as antiviruses, sandboxes, and online scanning services. Indeed, these tools can be used to test the evasion capabilities of the samples being developed. The latter can be consequently refined to avoid being detected by specific technologies, which can also depend on the actual targets of the attackers. Malware analysts can leverage this practice by analysing the submissions of samples to online services, like VirusTotal and public sandboxes, in order to identify those submissions that seem related to the test of novel malware ChenRoussopoulosLiangEtAl2012 (); Graziano:2015 (). In particular, by analysing submissions and their metadata, researchers found out that malicious samples involved in famous targeted attacks, have been previously submitted to Anubis Sandbox Graziano:2015 ().

3.6 Malware Attribution

Another aspect malware analysts are interested in regards the identification of who developed a given malware Caliskan-Islam:2015 (). Anytime a cyber threat is detected, a three-level analysis can be carried out: technical, operational, and strategic. From a technical perspective, a malware analyst looks at specific indicators of the executable: what programming language has been used, if it contains any IP address or URL, and the language of comments and resources. Another distinctive trait, in case the malware exchanges messages with a command and control center, is the time slot where the attacker triggers the malware. The operational analysis consists in correlating technical information related to other cyber threats that share similarities with the malicious sample under analysis. During the strategic analysis, extracted technical and operational knowledge can be merged with intelligence information and political evaluations in the attempt of attributing a (set of) malware sample(s) to a cyber threat actor or group.

3.7 Malware Triage

A last objective addresses the need to provide a fast and accurate prioritization for new samples when they come at a fast rate and have to be analysed. This process is referred to as triage, and is becoming relevant because of the growing volume of new samples developed daily. Malware triage shares some aspects with the detection of variants, novelties and similarities, since they give key information to support the prioritization process. Nevertheless, triage should be considered as a different objective because it requires faster execution at the cost of possible worse accuracy, hence different techniques are usually employed JangBrumleyVenkataraman2011 (); Kirat2013 ().

4 Malware Analysis Features

This section provides an overview on the data used by reviewed papers to achieve the objectives outlined in Section 3. The features given as input to machine learning algorithms derive from these data. Since almost all the works we examined considered Windows executables, the inputs taken into account are extracted from the content of the PEs themselves or from traces and logs related to their execution.

In many cases, surveyed works only refer to macro-classes without mentioning the specific features they employed. As an example, when n-grams are used, only a minority of works mention the size of . Whenever possible, for each feature type we provide a table reporting what specific features are used, with proper references.

4.1 Strings

A PE can be inspected to explicitly look for the strings it contains, such as code fragments, author signatures, file names, system resource information. These types of strings have been shown Shultz:2001 () to provide valuable information for the malware analysis process (see Table 1). Once strings in clear are extracted, it is possible to gather information like number and presence of specific strings, which can unveil key cues to gain additional knowledge on a PE Shultz:2001 (); IslamTianBattenEtAl2013 (); Saxe2015 (). In Ahmadi:2015 (), the authors use histograms representing how string lengths are distributed in the sample.

String extraction tools. Strings222Strings: https://technet.microsoft.com/en-us/sysinternals/strings.aspx and pedump333pedump: https://github.com/zed-0xff/pedump are two well-known tools for extracting strings from a PE. While pedump outputs the list of the strings found in a Windows executable, Strings allows to use wild-card expressions and tune search parameters. Conversely to Strings, pedump is able to detect most common packers, hence it can be also used when the PE is packed. Both tools fail if the strings contained in the executable are obfuscated. Another remarkable tool is FLOSS444FireEye Labs Obfuscated Strings Solver (FLOSS): https://github.com/fireeye/flare-floss, which combines different static analysis techniques to deal with obfuscated string found in analysed samples.

Strings
Number of strings; presence of “GetProcAddress”,“CopyMemory”, “CreateFileW”, “OpenFile”, “FindFirstFileA”, “FindNextFileA”, “RegQueryValueExW” IslamTianBattenEtAl2013 ()
Distribution of string lengths Ahmadi:2015 ()
Table 1: List of features employed in the surveyed papers for the input type Strings

4.2 Byte sequences

A binary can be characterised by computing features on its byte-level content. Analysing the specific sequences of bytes in a PE is a widely employed technique (see Table 2). A few works use chunks of bytes of specific sizes Shultz:2001 (); SrakaewPiyanuntcharatsr:2015 (); Raff2017 (), while many others rely on n-grams  Kolter:2006 (); AndersonQuistNeilEtAl2011 (); JangBrumleyVenkataraman2011 (); RieckTriniusWillemsEtAl2011 (); AndersonStorlieLane2012 (); DahlStokesDengEtAl2013 (); UppalSinhaMehraEtAl2014 (); Ahmadi:2015 (); Chen:2015 (); FengXiongCaoEtAl2015 (); Lin:2015 (); Sexton2015 (); SrakaewPiyanuntcharatsr:2015 (); UpchurchZhou2015 (); WuechnerOchoaPretschner2015 ().

An n-gram is a sequence of bytes, and features correspond to the different combination of these bytes, namely each feature represents how many times a specific combination of bytes occurs in the binary. Different works use n-grams of diverse sizes. Among those that have specified the size, the majority relies on sequences no longer than 3 (i.e., trigrams) SrakaewPiyanuntcharatsr:2015 (); Lin:2015 (); Ahmadi:2015 (); AndersonQuistNeilEtAl2011 (); AndersonStorlieLane2012 (); Sexton2015 (); DahlStokesDengEtAl2013 (); IslamTianBattenEtAl2013 (). Indeed, the number of features to consider grows exponentially with .

Byte sequences
Chunks either of , KB, or equal to file size SrakaewPiyanuntcharatsr:2015 ()
1-grams SrakaewPiyanuntcharatsr:2015 (); Lin:2015 (); Ahmadi:2015 ()
2-grams AndersonQuistNeilEtAl2011 (); AndersonStorlieLane2012 (); SrakaewPiyanuntcharatsr:2015 (); Lin:2015 (); Sexton2015 ()
3-grams DahlStokesDengEtAl2013 (); SrakaewPiyanuntcharatsr:2015 (); Lin:2015 (); IslamTianBattenEtAl2013 ()
4-grams UppalSinhaMehraEtAl2014 (); Lin:2015 ()
5-grams, 6-grams Lin:2015 ()
Table 2: List of features employed in the surveyed papers for the input type Byte sequences

4.3 Opcodes

Opcodes identify the machine-level operations executed by a PE, and can be extracted by examining the assembly code Siddiqui:2009 (); Ye2010 (); AndersonQuistNeilEtAl2011 (); AndersonStorlieLane2012 (); ShoshaLiuGladyshevEtAl2012 (); HuShinBhatkarEtAl2013 (); Kong:2013 (); SantosBrezoUgarte-PedreroEtAl2013 (); SantosDevesaBrezoEtAl2013 (); Ahmadi:2015 (); Gharacheh:2015 (); KhodamoradiFazlaliMardukhiEtAl2015 (); Pai:2015 (); Sexton2015 (); SrakaewPiyanuntcharatsr:2015 (). As shown in Table 3, opcode frequency is a type of feature employed in some surveyed papers. It measures the number of times each specific opcode appears within the assembly of, or it is executed by, a PE  Ye2010 (); KhodamoradiFazlaliMardukhiEtAl2015 (). Others AndersonStorlieLane2012 (); KhodamoradiFazlaliMardukhiEtAl2015 () count opcode occurrences by aggregating them by operation scope, e.g., math instructions, memory access instructions. Similarly to n-grams, also sequences of opcodes are used as features Ye2010 (); Gharacheh:2015 (); KhodamoradiFazlaliMardukhiEtAl2015 (); SrakaewPiyanuntcharatsr:2015 (). Given the executable, the Interactive DisAssembler555Interactive DisAssembler: https://www.hex-rays.com/products/ida/index.shtml (IDA) is the most popular commercial solution that allows the extraction of the assembly code.

Opcodes
Branch instruction, count, privileged instruction count, and memory instruction count AndersonStorlieLane2012 ()
Math instruction count, logic instruction count, stack instruction count, NOP instruction count, and other instruction count AndersonStorlieLane2012 (); KhodamoradiFazlaliMardukhiEtAl2015 ()
Sequences of length 1 and 2 SantosBrezoUgarte-PedreroEtAl2013 ()
Instruction frequencies Ahmadi:2015 (); KhodamoradiFazlaliMardukhiEtAl2015 ()
Data define instruction proportions Ahmadi:2015 ()
Count of instructions manipulating a single-bit, data transfer instruction count, data conversion instruction count, pointer-related instruction count, compare instruction count, jump instruction count, I/O instruction count, and set byte on conditional instruction count KhodamoradiFazlaliMardukhiEtAl2015 ()
Table 3: List of features employed in the surveyed papers for the input type Opcodes

4.4 APIs/System calls

Similarly to opcodes, APIs and system calls enable the analysis of samples’ behaviour, but at a higher level (see Table 4). They can be extracted by analysing either the disassembly code (to get the list of all calls that can be potentially executed) or the execution traces (for the list of calls actually invoked). While APIs allow to characterise in general what actions are executed by a sample Ahmed2009 (); ShoshaLiuGladyshevEtAl2012 (); IslamTianBattenEtAl2013 (); Kong:2013 (); BaiWangZou2014 (); Egele:2014 (); Ahmadi:2015 (); KawaguchiOmote2015 (); LiangPangDai2016 (), looking at system call invocations provides a specific view on the interaction of the PE with the operating system LeeMody2006 (); BayerComparettiHlauschekEtAl2009 (); ParkReevesMulukutlaEtAl2010 (); RieckTriniusWillemsEtAl2011 (); AndersonStorlieLane2012 (); KwonLee2012 (); DahlStokesDengEtAl2013 (); PalahanBabicChaudhuriEtAl2013 (); SantosDevesaBrezoEtAl2013 (); Egele:2014 (); UppalSinhaMehraEtAl2014 (); Asquith2015 (); ElhadiMaarofBarry2015 (); Mao2015 (). Data extracted by observing APIs and system calls can be really large, and many works carry out additional processing to reduce feature space by using convenient data structures. Next subsections give an overview on the data structures used in surveyed papers.

APIs/System calls
Process spawning BaileyOberheideAndersenEtAl2007 (); LindorferKolbitschComparetti2011 (); ChionisNikolopoulosPolenakis2013 ()
Syscall dependencies BayerComparettiHlauschekEtAl2009 (); PalahanBabicChaudhuriEtAl2013 (); ElhadiMaarofBarry2015 ()
Syscall sequences ParkReevesMulukutlaEtAl2010 (); KwonLee2012 (); UppalSinhaMehraEtAl2014 (); Asquith2015 ()
MIST representation RieckTriniusWillemsEtAl2011 ()
“RegOpenKeyEx” count, “RegOpenKeyExW” count, “RegQueryValueExW” count, “Compositing” count, “MessageBoxW” count, “LoadLibraryW” count, “LoadLibrary- ExW” count, “0x54” count IslamTianBattenEtAl2013 ()
Referred APIs count, Referred DDLs count BaiWangZou2014 ()
Process activity Lin:2015 ()
Is API ‘x’ present in the analysed sample? KawaguchiOmote2015 ()
Table 4: List of features employed in the surveyed papers for the input type APIs/System calls

4.4.1 Call graphs and data dependent call graphs

Call graphs allow to analyse how data is exchanged among procedure calls Ryder:1979 () by storing relationships among these procedures, and possibly including additional information on variables and parameters. Call graphs have been employed in KwonLee2012 (); Kong:2013 (); ElhadiMaarofBarry2015 (); Graziano:2015 () to extract relationships among invoked functions. API call graphs have been subsequently extended in ParkReevesMulukutlaEtAl2010 (); ChionisNikolopoulosPolenakis2013 (); ElhadiMaarofBarry2015 () by integrating data dependencies among APIs. Formally, two API calls are data dependent if the one invoked afterwards uses a parameter given in input to the other.

4.4.2 Control flow graphs, enriched control flow graphs, and quantitative data flow graphs

Control flow graphs allow compilers to produce an optimized version of the program itself Allen:1970 (). Each graph models control flow relationships which can be used to represent the behaviour of a PE and extract the program structure. Works employing control flow graphs for sample analysis are AndersonStorlieLane2012 (); ChionisNikolopoulosPolenakis2013 (); Graziano:2015 (). In  EskandariKhorshidpourHashemi2013 (), graph nodes are enriched with dynamic analysis information for deciding if conditional jumps are actually executed.Wüchner et al. WuechnerOchoaPretschner2015 () use quantitative data flow graphs to model the amount of data flowing among system resources. Analogously, Polino et al. leverage data flow analysis in order to track data dependencies and control operations in main memory Polino:2015 ().

4.4.3 Multilayer dependency chains

These chains represent function templates organized according to sample behaviors LiangPangDai2016 (). Each behavior is stored in a different chain and is characterized by six sub-behaviors capturing the interactions between samples and the system where they run (i.e., interactions with files, Windows Registry, services, processes, network, and the operating system itself). Thus, each chain contains a complete sequence of API calls invoked by a sample, along with call metadata. All these information provide analysts a more detailed characterization of malware behaviors.

4.4.4 Causal dependency graphs

They have been initially proposed by BaileyOberheideAndersenEtAl2007 () for tracking the activities of a malware by monitoring persistent state changes in the target system. These persistent state changes enable to define malware behaviour profiles and recognize classes of malware exhibiting similar behaviours. In Kong:2013 (), causal dependency graphs are used to discover the entry point exploited by an attacker to gain access to a system.

4.4.5 Influence graphs

Influence graphs are directed graphs that encode information about the downloads performed by samples. They result particularly effective in detecting droppers Kwon2015 ().

4.4.6 Markov chains and Hidden Markov Models

Markov chains are memoryless random processes evolving with time. In Ahmed2009 (); AndersonQuistNeilEtAl2011 (); AndersonStorlieLane2012 (); Sexton2015 (), Markov chains are used to model binary content and execution traces of a sample to perform malware classification. Similar to the approach designed in EskandariKhorshidpourHashemi2013 () for building enriched data flow graphs, instructions are categorized in 8 coarse-grained classes (e.g., math and logic instructions).

Hidden Markov models are probabilistic functions of Markov chains. States of hidden Markov models are unobservable, while the output of a state can be observed and it obeys to a probabilistic distribution (continuous or discrete). Pai et al. train various hidden Markov models having 2 hidden states, to recognize malware belonging to specific families Pai:2015 ().

4.5 Memory accesses

Any data of interest such as user generated content, Windows Registry key, configuration and network activity passes through main memory, hence analysing how memory is accessed can reveal important information about the behaviour of an executable Hal:2012 () (see Table 5). Some works ShoshaLiuGladyshevEtAl2012 (); Kong:2013 () either rely on Virtual Machine Monitoring and Introspection techniques, or statically trace reads and writes in main memory. Egele et al. dynamically trace values read from and written to stack and heap Egele:2014 ().

Memory analysis tools. Different open-source tools are available to analyse memory during sample executions, such as Volatility666Volatility: http://www.volatilityfoundation.org/25 and Rekall777Rekall: http://www.rekall-forensic.com/pages/at_a_glance.html, both included in the SANS Investigative Forensic Toolkit888SIFT: https://digital-forensics.sans.org/community/downloads.

Memory accesses
Performed read and write operations in main memory Kong:2013 ()
Values read/written from/to stack and heap Egele:2014 ()
Table 5: List of features employed in the surveyed papers for the input type Memory accesses

4.6 File system

What samples do with files is fundamental to grasp evidence about the interaction with the environment and possible attempts to gain persistence. Features of interest mainly concern how many files are read or modified, what type of files and in what directories, and which files appear in not-infected/infected machines  LeeMody2006 (); BaileyOberheideAndersenEtAl2007 (); Chau2010 (); ChionisNikolopoulosPolenakis2013 (); Kong:2013 (); Graziano:2015 (); Lin:2015 (); Mao2015 (); MohaisenAlrawiMohaisen2015 () (see Table 6). Sandboxes and memory analysis toolkits include modules for monitoring interactions with the file system, usually modelled by counting the number of files created/deleted/modified by the PE under analysis. In MohaisenAlrawiMohaisen2015 (), the size of these files is considered as well, while Lin et al. leverage the number of created hidden files Lin:2015 ().

File System accesses analysis tools. Activities performed on file system can be monitored using programs like MS Windows Process Monitor999Process Monitor: https://docs.microsoft.com/en-us/sysinternals/downloads/procmon and SysAnalyzer101010SysAnalyzer: http://sandsprite.com/iDef/SysAnalyzer/. While SysAnalyzer implements by default snapshots over a user-defined time interval to reduce the amount of data presented to malware analysts, Process Monitor has been designed for real-time monitoring. Nevertheless, SysAnalyzer can be also used in live-logging mode. ProcDOT111111ProcDOT: http://www.procdot.com allows the integration of Process Monitor with network traces, produced by standard network sniffers (e.g. Windump), and provides an interactive visual analytics tool to analyse the binary activity.

File system
Number of created/deleted/modified files, size of created files MohaisenAlrawiMohaisen2015 ()
Number of created hidden files Lin:2015 ()
Table 6: List of features employed in the surveyed papers for the input type File system

4.7 Windows Registry

The registry is one of the main sources of information for a PE about the environment, and also represents a fundamental tool to hook into the operating system, for example to gain persistence. Discovering what keys are queried, created, deleted and modified can shed light on many significant characteristics of a sample LeeMody2006 (); ShoshaLiuGladyshevEtAl2012 (); Lin:2015 (); Mao2015 (); MohaisenAlrawiMohaisen2015 () (see Table 7). Usually, works relying on file system inputs (see Section 4.6) monitor also the Windows Registry. In ShoshaLiuGladyshevEtAl2012 (), changes to file system are used in conjunction with file system accesses.

Windows Registry analysis tools. Process Monitor, introduced in Section 4.6, takes also care of detecting changes to the Windows Registry. Similarly, RegShot121212RegShot: https://sourceforge.net/projects/regshot/ is an open-source lightweight software for examining variations in the Windows Registry by taking a snapshot before and after the sample is executed. Since it is based on snapshots, malware analysts are not overwhelmed with data belonging to the entire execution of the samples. As mentioned for memory and file system accesses, usually, sandboxes monitor Windows Registry key creations/deletions/modifications, reporting occurred changes.

Windows Registry
Number of created/deleted/modified Registry keys LeeMody2006 (); ShoshaLiuGladyshevEtAl2012 (); Lin:2015 (); MohaisenAlrawiMohaisen2015 ()
Table 7: List of features employed in the surveyed papers for the input type Windows Registry

4.8 CPU registers

The way CPU registers are used can also be a valuable indication, including whether any hidden register is used, and what values are stored in the registers, especially in the FLAGS register (see Table 8). Both Kong:2013 () and Ahmadi:2015 () rely on static analysis of registers, whereas Egele:2014 () and Ghiasi:2015 () employ a dynamic approach. Some works examine register use to detect malware variants Kong:2013 (); Ahmadi:2015 (); Ghiasi:2015 (). While in Kong:2013 () and Ahmadi:2015 () the objective is to identify samples belonging to one or more specific families, Ghiasi:2015 () aims to select the variants of the malware under analysis. Static analyses of CPU registers involve counting reads and writes performed on each register Kong:2013 (), tracking the number of changes to FLAGS Kong:2013 (), and measuring the frequency of register usage Ahmadi:2015 (). Conversely, Ghiasi:2015 () applies dynamic analysis to get features associated to the values contained in CPU registers. Instead, Egele:2014 () monitors only returned values with the objective of detecting similarities among executables.

Registers
No. of changes to FLAGS register, register read/write count Kong:2013 ()
Returned values in the eax register upon function completion Egele:2014 ()
Registers usage frequency Ahmadi:2015 ()
Registers values Ghiasi:2015 ()
Table 8: List of features employed in the surveyed papers for the input type CPU registers

4.9 Function length

Another characterizing feature is the function length, measured as the number of bytes contained in a function. In particular, the frequency of function lengths is used ChionisNikolopoulosPolenakis2013 () (see Table 9). This input alone is not sufficient to discriminate malicious executables from benign software, indeed it is usually employed in conjunction with other features. This idea, formulated in Tian:2008 (), is adopted in IslamTianBattenEtAl2013 (), where function length frequencies are combined with other static and dynamic features.

Function length
Function length frequencies IslamTianBattenEtAl2013 ()
Linearly/polynomially/exponentially spaced bins of length ranges ChionisNikolopoulosPolenakis2013 ()
Table 9: List of features employed in the surveyed papers for the input type Function length

4.10 PE file characteristics

A static analysis of a PE can provide a large set of valuable information such as sections, imports, symbols, used compilers (see Table 10). As malware generally present slight differences with respect to benign PEs, these information can be used to understand if a file is malicious or not LeeMody2006 (); Yonts2012 (); Kirat2013 (); BaiWangZou2014 (); Asquith2015 (); Saxe2015 ().

PE file characteristics
Resource icon’s checksum LeeMody2006 (); Asquith2015 ()
Number of symbols, pointer to symbol table, PE timestamp, and PE characteristics flags Yonts2012 ()
Section count Yonts2012 (); BaiWangZou2014 ()
Resource’s directory table, items in .reloc section count and symbols in export table count BaiWangZou2014 ()
Disassembled file size, sample size, number of lines in the disassembled file, first bytes sequence address, entropy, and symbol frequencies Ahmadi:2015 ()
PE header checksum, resource strings’ checksum, resource metadata checksum, section names, section sizes, import table location, import table size, and entry point offset Asquith2015 ()
Section attributes Ahmadi:2015 (); Asquith2015 ()
Table 10: List of features employed in the surveyed papers for the input type PE file characteristics

4.11 Raised exceptions

The analysis of the exceptions raised during the execution can help understanding what strategies a malware adopts to evade analysis systems SantosDevesaBrezoEtAl2013 (); Asquith2015 (). A common trick to deceive analysts is throwing an exception to run a malicious handler, registered at the beginning of malware execution. In this way, examining the control flow becomes much more complex. This is the case of the Andromeda botnet, version 131313New Anti-Analysis Tricks In Andromeda 2.08: https://blog.fortinet.com/2014/05/19/new-anti-analysis-tricks-in-andromeda-2-08. Even if such version is outdated, Andromeda is still active and targets victims with spam campaigns141414Andromeda Botnet Targets Italy in Recent Spam Campaigns: http://researchcenter.paloaltonetworks.com/2016/07/unit42-andromeda-botnet-targets-italy-in-recent-spam-campaigns/.

4.12 Network

A huge number of key information can be obtained by observing how the PE interacts with the network. Contacted addresses, generated traffic, and received packets can unveil valuable aspects, e.g., regarding communications with a command and control center. Statistics on used protocols, TCP/UDP ports, HTTP requests, DNS-level interactions are other features of this type (see Table 11). Many surveyed works require dynamic analysis to extract this kind of information LeeMody2006 (); BaileyOberheideAndersenEtAl2007 (); BayerComparettiHlauschekEtAl2009 (); LindorferKolbitschComparetti2011 (); ChionisNikolopoulosPolenakis2013 (); NariGhorbani2013 (); Graziano:2015 (); Kwon2015 (); MohaisenAlrawiMohaisen2015 (); LiangPangDai2016 (). Other papers extract network-related inputs by monitoring the network and analysing incoming and outgoing traffic Comar:2013 (); KruczkowskiSzynkiewicz2014 (); Vadrevu2016 (). A complementary approach consists in analysing download patterns of network users in a monitored network Vadrevu2013 (). It does not require sample execution and focuses on network features related to the download of a sample, such as the website from which the file has been downloaded.

Network
Connection count BaileyOberheideAndersenEtAl2007 (); Graziano:2015 (); MohaisenAlrawiMohaisen2015 ()
TCP flags, packet direction, total packets count, total packets with no-payload count, total transferred byte count, total transferred payload byte count, flow duration, average inter arrival time, size of -th packet, inter arrival time of -th packet, payload size of -th packet, maximum payload size, minimum payload size, average payload size, payload size standard deviation, maximum packet size, minimum packet size, average packet size, and packet size standard deviation NariGhorbani2013 ()
Download domain, download history, download request, and queried URLs Vadrevu2013 ()
Destination IP Vadrevu2013 (); KruczkowskiSzynkiewicz2014 ()
Timestamp KruczkowskiSzynkiewicz2014 ()
Unique IP count, protocol type, HTTP request/response type count, DNS A/PTR/CNAME/MX/NS/SOA record lookup count, request/response packet size MohaisenAlrawiMohaisen2015 ()
Table 11: List of features employed in the surveyed papers for the input type Network

4.13 AV/Sandbox submissions

Malware developers may use online public services like VirusTotal151515VirusTotal: https://www.virustotal.com and Malwr161616Malwr: https://malwr.com to test the effectiveness of their samples in evading most common antiviruses. Querying these online services can provide additional information useful for the analysis: submission time, how many online antiviruses classify the same as malicious, and other data on the submission (see Table 12). Graziano et al. Graziano:2015 () leverage submissions to a sandbox for identifying cases of malware development. Surprisingly, samples used in infamous targeted campaigns have been found to be submitted to public sandboxes months or years before.

AV/Sandbox submissions
Occurred errors  SantosDevesaBrezoEtAl2013 (); Asquith2015 (); Graziano:2015 ()
Created hidden files/registries, hidden connections, process/file activity, frequencies of specific words in the AV/Sandbox report Graziano:2015 (); Lin:2015 ()
Count of registry types and registries modifications Graziano:2015 (); MohaisenAlrawiMohaisen2015 ()
PE file characteristics, timestamps, AV labels, submitting user information, and sandbox analysis results Graziano:2015 ()
Table 12: List of features employed in the surveyed papers for the input type AV/Sandbox submissions

4.14 Code stylometry

Features related to the coding style used by an anonymous malware author can reveal important details about her identity. Unfortunately, this kind of features requires the availability of source code, which is very rare in malware analysis. However, in case of leaks and/or public disclosures, source codes can become available. In Caliskan-Islam:2015 (), the author’s coding style of a generic software (i.e. not necessarily malicious) is captured through syntactic, lexical, and layout features (see Table 13). These are extracted both from the source code and from the abstract syntax tree, representing the executable. Syntactic features model keywords and the properties of the abstract syntax tree, while lexical and layout features allow to gather information about author’s code writing preferences.

Code stylometry
Source code’s lexical, layout, and syntactic properties Caliskan-Islam:2015 ()
Table 13: List of features employed in the surveyed papers for the input type Code stylometry

5 Malware Analysis Algorithms

In this section we briefly describe the machine learning algorithms used by reviewed papers. Different algorithms aim at a different goals, e.g., finding a match with respect to some available knowledge base, or classifying samples by assigning them labels, or clustering PEs on the basis of some metrics. We accordingly organize malware analysis algorithms in four categories: signature-based (Section 5.1), classification (Section 5.2), clustering (Section 5.3), and others (Section 5.4).

5.1 Signature-based

Signature-based approaches are widely employed by commercial antiviruses to detect malicious samples. Signatures are computed by software analysts to find patterns in potentially malicious samples under analysis. Many surveyed works propose to extract signatures from call graphs, control flow graphs, and behavior profiles. When performed by humans, this task is error-prone and time-consuming EST12 (). Indeed these signatures should be general enough to detect variants of the same malware but, if too general, legitimate samples could be mistakenly classified as malicious, with a consequent non-negligible number of false positives.

5.1.1 Malicious signature matching

Malicious signature matching approaches can be divided into two phases. In a preliminary phase, malware signatures are collected inside a knowledge base (e.g., a storage unit). In the second phase, signatures are extracted from samples under analysis and then compared with those stored in the knowledge base. If one or more matches are found, next steps depend on the objective of the analysis, e.g., samples are marked as malicious, or are assigned a specific label. Malicious signature matching has been used both for malware detection in FengXiongCaoEtAl2015 () and malware variants selection in KwonLee2012 () and ShoshaLiuGladyshevEtAl2012 ().

5.1.2 Malicious graph matching

Signatures can be also computed from the graphs representing specific features or behaviours of the sample under analysis. Similarly to malicious signature matching, these approaches need an initial phase where graph representations are extracted from a dataset of samples and stored in the KB. The matching procedure, instead, varies from work to work. As an example, while in ElhadiMaarofBarry2015 () signatures are extracted from data dependent call graphs (see Section 4.4.1) transformed into strings, Park et al. measure the similarity among data dependent graphs by calculating the maximal common subgraph distance ParkReevesMulukutlaEtAl2010 (). Similarly to Park et al., KwonLee2012 () represent samples behaviour with graphs as well and matching is performed by XORing matrix representations of behavioural graphs. Malicious graph matching is applied also in ShoshaLiuGladyshevEtAl2012 () to generate evasion-resistant signatures.

5.2 Classification

Classification is the task of assigning an observation to a specific category and it is performed through a statistical model called classifier. A classifier takes as input a vector of features representing measurable properties of an observation. In the following, several classifier implementations are discussed.

5.2.1 Rule-based classifier

Rule-based classification relies on a set of conditions consisting in a series of tests that, if successfully passed, allow the classifier to label the instances accordingly. These tests can be connected by logical conjunctions or more general logical expressions Witten:2005 (), as in Shultz:2001 () and FengXiongCaoEtAl2015 (). Conditions can be also applied for modelling similarity Ghiasi:2015 (); Sexton2015 (); LiangPangDai2016 () and distance thresholds exceeding Tian:2008 (), as well as scores. To this end, Lindorfer et al. use a rule-based classifier to compute the probability that a sample implements evasion techniques LindorferKolbitschComparetti2011 (). In Ahmed2009 (), a rule learner is employed to detect malicious software.

5.2.2 Bayes Classifier

Bayesian models are usually employed for document classification. Given a document and a fixed set of classes, Bayesian models outputs the predicted class of the document in input. Bayesian models perform best when features are completely independent, boolean, and not redundant. The more the redundant features, the more the classification is biased towards such features. Many surveyed works apply Bayesian models to malware analysis  SantosBrezoUgarte-PedreroEtAl2013 (); SantosDevesaBrezoEtAl2013 (); UppalSinhaMehraEtAl2014 (); KawaguchiOmote2015 (); WuechnerOchoaPretschner2015 ().

Naïve Bayes

The naïve Bayes classifier is the simplest among the probabilistic Bayes models. It assumes strong independence among features and normal distribution on feature values. While the first assumption can be modified by using other probability distributions (e.g. Bernoulli), the latter needs to hold to avoid compromising classification results. Shultz:2001 (); Kolter:2006 (); FirdausiLimErwinEtAl2010 (); UppalSinhaMehraEtAl2014 (); KawaguchiOmote2015 (); Sexton2015 (); WuechnerOchoaPretschner2015 () employ Naïve Bayes for analysing malware.

Bayesian Network

Conversely to naïve Bayes classifiers, Bayesian networks provide a tool for graphically representing probability distributions in a concise way Witten:2005 (). Bayesian networks can be drawn as directed acyclic graphs, where nodes represent features and categories, while edges describe conditional dependence between nodes. Nodes are data structures storing a probability distribution over represented feature values. These probabilistic graphical models have been used in EskandariKhorshidpourHashemi2013 (); SantosBrezoUgarte-PedreroEtAl2013 (); SantosDevesaBrezoEtAl2013 ().

5.2.3 Support Vector Machine (SVM)

Support vector machines are binary classifiers employed in a wide range of application fields ranging from control theory, medicine, biology, pattern recognition, and information security. In malware analysis, support vector machines have been applied in a large number of surveyed papers Kolter:2006 (); Ahmed2009 (); FirdausiLimErwinEtAl2010 (); AndersonQuistNeilEtAl2011 (); ChenRoussopoulosLiangEtAl2012 (); Comar:2013 (); IslamTianBattenEtAl2013 (); Kong:2013 (); SantosBrezoUgarte-PedreroEtAl2013 (); SantosDevesaBrezoEtAl2013 (); KruczkowskiSzynkiewicz2014 (); UppalSinhaMehraEtAl2014 (); Ahmadi:2015 (); FengXiongCaoEtAl2015 (); KawaguchiOmote2015 (); Lin:2015 (); MohaisenAlrawiMohaisen2015 (); Sexton2015 (); WuechnerOchoaPretschner2015 (). In Egele:2014 (), SVM is employed to compute the optimal weights to associate to the used features. These classifiers are able to deal with high-dimensional and sparse data KruczkowskiSzynkiewicz2014 (). In order to work with non-linearly separable data, support vector machines rely on kernel methods. Support vector machines are usually bundled with three default kernel functions: polynomial, sigmoid, and radial basis function.

5.2.4 Multiple Kernel Learning

Instead of using a single kernel, multiple kernel learning combines different kernel functions to classify observations. Chosen kernels may either capture complementary aspects of observations under analysis or aggregate features coming from different sources Gonen:2011 (). In AndersonStorlieLane2012 (), Anderson et al. combine six kernels, each one corresponding to a different data source (e.g. PE file information, system calls), and leverage multiple kernel learning for detecting malicious samples.

5.2.5 Prototype-based Classification

This approach relies on the concept of prototypes. They correspond to malware activity reports output by sandboxes or obtained by emulators, virtual or physical machines provided with monitoring tools. In RieckTriniusWillemsEtAl2011 (), malware activity is extracted by means of system calls and by converting them in feature vectors. As discussed in Section 4.4, system calls are representative of samples behaviour. A prototype combines all the feature vectors in groups of homogeneous behaviours, according to the available reports. In the same work, Rieck et al. propose an approximation algorithm for extracting prototypes from a dataset of malware activity reports. Classification is performed by extracting the prototype from the sample under analysis and computing its nearest prototype in the dataset. The distance between prototypes is measured by using the Euclidean distance.

5.2.6 Decision Tree

Decision tree classifiers are widely employed in many fields. In general, nodes are meant for testing input features against some learned threshold Witten:2005 (). One of the main strength of decision trees is that they can be trained using boolean, numeric, or nominal features. During the test phase, feature values guide the path to follow along the tree until a leaf node is reached. The specific instance is classified according to the category assigned to such leaf. In malware analysis, observations are typically related to samples. Works using decision trees are Kolter:2006 (); Ahmed2009 (); FirdausiLimErwinEtAl2010 (); IslamTianBattenEtAl2013 (); NariGhorbani2013 (); SantosBrezoUgarte-PedreroEtAl2013 (); SantosDevesaBrezoEtAl2013 (); BaiWangZou2014 (); UppalSinhaMehraEtAl2014 (); KawaguchiOmote2015 (); KhodamoradiFazlaliMardukhiEtAl2015 (); MohaisenAlrawiMohaisen2015 (); SrakaewPiyanuntcharatsr:2015 (). Interestingly, decision trees can be reduced to rule-based classifiers (see 5.2.1). Indeed, every path in the tree leading to a leaf node can be represented as a set of rules logically in “AND”.

Random Forest

These classifiers are ensembles of decision trees, each fed with feature values independently sampled using the same distribution for all trees Breiman:2001 (). Usually classifier ensembles are obtained by means of bagging, boosting, and randomization. Random forest uses bagging since it introduces randomness in the choice of the features to take into account. Random forest classifiers have been applied in Siddiqui:2009 (); Comar:2013 (); IslamTianBattenEtAl2013 (); UppalSinhaMehraEtAl2014 (); Ahmadi:2015 (); KawaguchiOmote2015 (); KhodamoradiFazlaliMardukhiEtAl2015 (); Kwon2015 (); Mao2015 (); WuechnerOchoaPretschner2015 ().

Gradient Boosting Decision Tree

Conversely to random forest classifiers, gradient boosting decision trees are tree ensembles built by using the boosting technique. They aim to minimize the expected value of a specified loss function on a training set. In ChenRoussopoulosLiangEtAl2012 () and Sexton2015 (), gradient boosting decision trees are used to detect the category of malicious samples.

Logistic Model Tree

Logistic model tree classifiers are decision trees having logistic regression models at their leaves. These models are linear and built on independent variables, representing the considered classes, weighted on the basis of the samples in the training set. Weights are computed by maximizing the log-likelihood. Graziano et al. employ logistic model trees for analysing malware submitted to a public sandbox Graziano:2015 (), whereas DahlStokesDengEtAl2013 (); Sexton2015 (); PalahanBabicChaudhuriEtAl2013 () leverage logistic regression classifiers to detect, respectively, families of malware, their categories, and novelties or similarities with respect to other samples.

5.2.7 k-Nearest Neighbors (k-NN)

For each labeled -length feature vector contained in a training set of size , a k-NN algorithm transforms them in -dimensional points. Labels can, for example, report if a sample is malicious or benign. In the test phase, given a -size dataset of samples under analysis, these are transformed into -dimensional points to find what are their nearest neighbours by means of a distance or similarity measure (e.g., Euclidean distance). Categories of unknown instances are chosen by picking the most popular class among them. The main advantage of these classifiers is that they require short training times. Using a worst-case analysis model, the time required to train a -Nearest Neighbor classifier is . The test phase has time complexity Manning:2008 (). Classic implementations of -Nearest Neighbor can be further refined to improve their running time to logarithmic by employing KD-tree data structures. -NN classifiers have been successfully employed to assign malware variants to their families LeeMody2006 (); Kong:2013 (); MohaisenAlrawiMohaisen2015 (); Raff2017 (), detect malicious software Ahmed2009 (); FirdausiLimErwinEtAl2010 (); IslamTianBattenEtAl2013 (); KawaguchiOmote2015 (), and spot potentially interesting samples SantosBrezoUgarte-PedreroEtAl2013 ().

5.2.8 Artificial Neural Network

Neural networks are computing systems formed by a set of highly interconnected processing units organized in layers. Each processing unit has an activation function and is linked to other units by means of weighted connections that are modified according to a specified learning rule. Artificial neural networks are widely employed in pattern recognition and novelty detection, time series prediction, and in medical diagnoses. They perform best if the system they model is error-tolerant and can be helpful when the relationships among inputs are not clear or difficult to describe with other models. Limitations strongly depend on the used activation function and applied learning rule. Dahl et al. take advantage of neural networks for detecting malware families, with unsatisfactory results: indeed they obtain a two-class error rate of , even with an ensemble of neural networks DahlStokesDengEtAl2013 (). In Saxe2015 (), artificial neural networks are employed with the objective of detecting malware.

Multilayer Perceptron Neural Network

Multilayer Perceptrons are neural networks whose connections are acyclic and each layer is fully connected with the next one. For this reason, they can be represented through directed graphs. These classifiers employ non-linear activation functions and, in the training phase, use backpropagation. In FirdausiLimErwinEtAl2010 (), Multilayer Perceptron Neural Networks are used for detecting malware.

5.3 Clustering

Clustering is the process of grouping similar elements. The ideal clustering should arrange similar elements together and they should be distant from other groups, also called clusters. The notion of distance is defined according specific distance or similarity metrics.

5.3.1 Clustering with locality sensitive hashing

Local sensitive hashing is a dimensionality reduction technique for approximating clusters and neighbor search. It relies on locality-sensitive hash families, which map elements into bins. Similar elements are more likely to be mapped to same bucket. Locality-sensitive hash families and, hence local sensitive hashing, are defined only for specific similarity and distance measures such as cosine or Jaccard similarity and Hamming or Euclidean distance. Local sensitive hashing is the building block for grouping similar malicious sample and it has been applied in some works BayerComparettiHlauschekEtAl2009 (); TamersoyRoundyChau2014 (); UpchurchZhou2015 ().

5.3.2 Clustering with Distance and Similarity Metrics

As already discussed, clustering can be performed by taking into account either distance or similarity among different samples. Several metrics have been used in malware analysis: Euclidean RieckTriniusWillemsEtAl2011 (); MohaisenAlrawiMohaisen2015 () and Hamming distances MohaisenAlrawiMohaisen2015 (), cosine MohaisenAlrawiMohaisen2015 () and Jaccard similarities MohaisenAlrawiMohaisen2015 (); Polino:2015 (). Distances can be also computed on signatures extracted by samples using tools such as ssdeep171717ssdeep: http://ssdeep.sourceforge.net and sdhash181818sdhash: http://roussev.net/sdhash/sdhash.html. Both are fuzzy hashing algorithms based on blocks: anytime a sufficient amount of input is processed, a small block is generated. Each of the generated blocks is a portion of the final signature. Samples can be grouped together, in conjunction with other features, on the basis of their signatures obtained with block-based fuzzy hashing, as in Graziano:2015 () and UpchurchZhou2015 ().

Time complexity of algorithms based on distances and similarity metrics depends both on the used measures and their implementations. For commonly applied metrics, such as cosine similarity, Euclidean and Hamming distances, the required time to compute them between two -dimensional points is . Thus, given a dataset of samples, the time complexity to cluster them on these metrics is . Depending on data, speedups based on triangle inequality for faster clustering are known to exist in literature.

5.3.3 k-Means Clustering

-means is one of the simplest and most used clustering algorithm Rokach:2005 (). It is a variant of the Expectation Maximization algorithm, belongs to the class of partition algorithms and separates data into clusters. The number of clusters is chosen a priori and initial cluster centers, called centroids, are picked randomly. Iteratively, each instance of the dataset is assigned to its nearest centroid to minimize the least within-cluster sum of squares, that is the squared Euclidean distance. In non-pathological scenarios, -means can halt in two cases: the sum of squared error is under a threshold or no change occurs for the clusters. In these scenarios, these halting criteria guarantee to reach a local optimum and convergence in a finite number of iterations. Even if theoretically has been proven that in the worst case -means has an exponential running time, a relatively recent smoothed analysis has shown that the number of iterations are bounded by a polynomial in the number of data points Arthur:2009 (). In practice, -means running time is often linear in . However, there exist corner cases in which instances are assigned from one cluster to another without either minimizing the sum of squares or converging in a finite number of iterations. In these cases, the running time of -means is unbounded unless the number of iterations is bounded. Bounding the iterations allows -Means to run in linear time, even in pathological scenarios. Both Huang et al. and Pai et al. use -means clustering for performing family selection as objective of their analyses HuangYeJiang2009 (); Pai:2015 ().

5.3.4 -Medoids

-medoids is another popular partitional clustering algorithm initially proposed by Vinod Vinod1969 () to group similar elements. Differently from -means (Section 5.3.3), it relies on medoids (i.e., the most centrally located elements in the corresponding clusters) for assigning instances to clusters. Similar considerations about pathological scenarios, discussed in Section 5.3.3, apply to -medoids as well, but it is less sensitive to outliers Park2009 (). Ye et al. combine -medoids Ye2010 () with hierarchical clustering (see Section 5.3.6) and propose a -medoids variation for identifying variants of malicious samples.

5.3.5 Density-based Spatial Clustering of Applications with Noise

DBSCAN is a widely known density-based clustering algorithm, initially proposed by Ester et al. for grouping objects in large databases Ester:1996 (). It is able to efficiently compute clusters of arbitrary shape through a two step process. The first step involves the selection of an entry having in its neighbourhood at least a certain number of other entries (i.e., the core point). Its neighbours can be obtained by transforming the database into a collection of points and by then measuring the distance among them. If the distance is lower than a chosen threshold, then the two points are considered neighbours. In the second step, the cluster is built by grouping all the points that are density-reachable from the core point. The notion of density-reachability has been defined in Ester:1996 (), and regards the constraints that allow a point to be directly reachable from another. The conditions impose that the former is a core point and the latter is in its neighbourhood. Rather than referring to two single points, density-reachability applies to a path in which points are directly reachable from each other. The algorithm runs on a database storing entries that can be transformed in points, and mainly serves neighbourhood queries. These queries can be answered efficiently in (e.g. by using -trees), thus the expected running time of DBSCAN is . Vadrevu et al. use DBSCAN to detect variants of malware Vadrevu2016 ().

5.3.6 Hierarchical Clustering

A hierarchical clustering schema recursively partitions instances and constructs a tree of clusters called dendrogram. The tree structure allows the cluster exploration at different levels of granularity. Hierarchical clustering can be performed in two ways: agglomerative and divisive. Agglomerative approaches are bottom-up: they start with clusters each having a single element, then closer cluster pairs are iteratively merged until a unique cluster contains all the elements. Divisive approaches are top-down: all the elements are initially included in a unique cluster, then they are divided in smaller sub-clusters until clusters with only one element are obtained.

Closeness can be modelled using different criteria: complete-link, single-link, average-link, centroid-link, and Ward’s-link Johnson:1967 (); MohaisenAlrawiMohaisen2015 (). Agglomerative clustering is more used than divisive, mainly because in the worst case it has a running time , while divisive approach is exponential. In malware analysis, hierarchical clustering has been applied in JangBrumleyVenkataraman2011 (); MohaisenAlrawiMohaisen2015 ().

5.3.7 Prototype-based Clustering

Analogously to what described in Section 5.2.5, prototypes can be also used to cluster malware that are similar among each other RieckTriniusWillemsEtAl2011 (). In RieckTriniusWillemsEtAl2011 (), Rieck et al. leverage hierarchical complete linkage clustering technique to group reports (see Section 5.2.5 for prototype/report mapping). The algorithm running time is , where and are the number of prototypes and reports, respectively. Thus, prototype-based clustering provides a time complexity with respect to exact hierarchical clustering running time.

5.3.8 Self-Organizing Maps

Self-organizing maps are useful data explorations tools that can be also used to cluster data. They are applied to a vast range of application fields going from industry, finance, natural sciences, to linguistics Kohonen:2013 (). Self-organizing maps can be represented as an ordered regular grid in which more similar models are automatically arranged together in adjacent positions on the grid, far away from less similar models. Model disposition allows to get additional information from the data topographic relationships. This capability is fundamental when dealing with high-dimensional data. In a first phase, self-organizing maps are calibrated using an input dataset. Then, each time a new input instance feeds the map, it is elaborated by the best-matching model. Analogously to what described in Section 5.3.2, a model best-matches an instance on the basis of a specific similarity or distance measure. Many proposed self-organizing maps rely on different similarity or distance measures (e.g., dot product, Euclidean and Minkowski distances). Self-organizing maps have been used in ChenRoussopoulosLiangEtAl2012 () for analysing malware.

5.4 Others

This subsection presents Machine Learning algorithms that are neither signature-based, nor classification, nor clustering.

5.4.1 Expectation Maximization

Expectation-maximization is a general-purpose statistical iterative method of maximum likelihood estimation used in cases where observations are incomplete. It is often applied also for clustering. Given a set of observations characterizing each element of the dataset, an expectation step assigns the element to the most likely cluster, whereas a maximization step recomputes centroids. The main strengths of expectation-maximization are stability, robustness to noise, and ability to deal with missing data. Finally, the algorithm has been proven to converge to a local maximum. Expectation maximization has been employed by Pai et al. for detecting malware variants belonging to the same families Pai:2015 ().

5.4.2 Learning with Local and Global Consistency

Learning with local and global consistency is a semi-supervised approach. Information on known labels are spread to neighbours until a global stable state is reached Zhou:04 (). Learning with local and global consistency has been proved effective on synthetic data, in digit recognition and text categorization Zhou:04 (). In malware analysis, this learning approach has been applied in Santos:2011 (), achieving over accuracy.

5.4.3 Belief Propagation

Belief propagation is an approach for performing inference over graphical models (e.g. Bayesian networks) and, in general, graphs. It works iteratively. At each iteration, each pair of inter-connected nodes exchanges messages reporting nodes opinions about their probabilities of being in a certain state. Belief propagation converges when probabilities do not keep changing significantly or after a fixed number of iterations. Chau et al. Chau2010 () apply it for detecting malware on terabytes of data, while both TamersoyRoundyChau2014 () and Chen:2015 () adapt belief propagation to malware analysis by proposing new mathematical representations. In Chen:2015 (), Chen et al. properly tune belief propagation algorithm and improve it with respect to the approach used in TamersoyRoundyChau2014 () and other classification algorithms (e.g., support vector machines and decision trees).

6 Characterization of Surveyed Papers

In this section we leverage on the discussed objectives (Section 3), feature classes (Section 4) and algorithm types (Section 5) to precisely characterize each reviewed paper. Such characterization is organized by objective: for each objective, we provide an overall view about which works aim at that objective, what machine learning algorithm they use and what feature classes they rely on. For space constraints, in the following tables we use the following abbreviations

  • Str. for strings

  • Byte seq. for byte sequences

  • Ops for opcodes

  • APIs Sys. calls for APIs/System calls

  • Memory or Mem. for memory accesses

  • File sys. for file system

  • Win. Reg. for Windows registry

  • CPU reg. for CPU registers

  • Func. length for function length

  • PE file char. for PE file characteristics

  • Raised excep. for raised exceptions

  • Net for network

  • Submissions for AV/Sandbox submissions

6.1 Malware detection

Table 14 lists all the reviewed works having malware detection as objective. It can be noted that the most used inputs regard

  • byte sequences, extracted from either the PE or its disassembled code, and organized in n-grams

  • API/system call invocations, derived by executing the samples

Most of the works use more algorithms to find out the one guaranteeing more accurate results.

6.2 Malware variants detection

As explained in Section 3.2, variants detection includes two slightly diverse objectives: given a malware, identifying either its variants or its families. Tables 15 and Table 16 detail surveyed works aiming to identify variants and families, respectively. In both characterizations, APIs and system calls are largely employed as well as their interactions with the environment, modeled by memory, file system, Windows registries, and CPU registers. Some of the surveyed papers, instead, use byte sequences and opcodes, while others add to the analysis features related to sample network activity.

Paper Algorithms Strings Byte
seq. Ops APIs
Sys. calls File
system Win. Reg. CPU
reg. PE file char. Raised
excep. Network
Schultz et al Shultz:2001 () Rule-based classifier, Naïve Bayes
Kolter and Maloof Kolter:2006 () Decision Tree, Naïve Bayes, SVM
Ahmed et al. Ahmed2009 () Decision Tree, Naïve Bayes, SVM
Chau et al. Chau2010 () Belief propagation
Firdausi et al. FirdausiLimErwinEtAl2010 () Decision Tree, Naïve Bayes, SVM, k-NN, Multilayer Perceptron Neural Network
Anderson et al. AndersonQuistNeilEtAl2011 () SVM
Santos et al. Santos:2011 () Learning with Local and Global Consistency
Anderson et al. AndersonStorlieLane2012 () Multiple Kernel Learning
Yonts Yonts2012 () Rule-based classifier
Eskandari et al. EskandariKhorshidpourHashemi2013 () Bayesian Network
Santos et al. SantosDevesaBrezoEtAl2013 () Bayesian Network, Decision Tree, k-NN, SVM
Vadrevu et al. Vadrevu2013 () Random Forest
Bai et al. BaiWangZou2014 () Decision Tree, Random Forest
Kruczkowski and Szynkiewicz KruczkowskiSzynkiewicz2014 () SVM
Tamersoy et al. TamersoyRoundyChau2014 () Clustering with locality sensitive hashing
Uppal et al. UppalSinhaMehraEtAl2014 () Decision Tree, Random Forest, Naïve Bayes, SVM
Chen et al. Chen:2015 () Belief propagation
Elhadi et al. ElhadiMaarofBarry2015 () Malicious graph matching
Feng et al. FengXiongCaoEtAl2015 () Rule-based classifier, Malicious signature matching, SVM
Ghiasi et al. Ghiasi:2015 () Rule-based classifier
Kwon et al. Kwon2015 () Random Forest
Mao et al. Mao2015 () Random Forest
Saxe and Berlin Saxe2015 () Neural Networks
Srakaew et al. SrakaewPiyanuntcharatsr:2015 () Decision Tree
Wüchner et al. WuechnerOchoaPretschner2015 () Naïve Bayes, Random Forest, SVM
Raff and Nicholas Raff2017 () -NN with Lempel-Ziv Jaccard distance
Table 14: Characterization of surveyed papers having malware detection as objective.
Paper Algorithms Byte
seq. Ops APIs
Sys. calls Memory File
system Win. Reg. CPU
reg. PE file
char. Network
Kwon and Lee KwonLee2012 () Malicious signature matching
Shosha et al. ShoshaLiuGladyshevEtAl2012 () Malicious signature matching
Chionis et al. ChionisNikolopoulosPolenakis2013 ()
Malicious signature matching
Gharacheh et al. Gharacheh:2015 () -1919footnotemark: 19
Ghiasi et al. Ghiasi:2015 () Rule-based classifier
Khodamoradi et al. KhodamoradiFazlaliMardukhiEtAl2015 () Decision Tree, Random Forest
Upchurch and Zhou UpchurchZhou2015 () Clustering with locality sensitive hashing
Liang et al. LiangPangDai2016 () Rule-based classifier
Vadrevu and Perdisci Vadrevu2016 () DBSCAN clustering
Table 15: Characterization of surveyed papers having malware variants selection as objective. Instead of using machine learning techniques, Gharacheh et al. rely on Hidden Markov Models to detect variants of the same malicious sample Gharacheh:2015 ().
Paper Algorithms Str. Byte
seq. Ops APIs
Sys. calls Mem. File
sys. Win. Reg. CPU
reg. Func.
length PE file
char. Raised
excep. Net.
Huang et al. HuangYeJiang2009 () k-Means-like algorithm
Park et al. ParkReevesMulukutlaEtAl2010 () Malicious graph matching
Ye et al. Ye2010 () -Medoids variants
Dahl et al. DahlStokesDengEtAl2013 () Logistic Regression, Neural Networks
Hu et al. HuShinBhatkarEtAl2013 () Prototype-based clustering
Islam et al. IslamTianBattenEtAl2013 () Decision Tree, k-NN, Random Forest, SVM
Kong and Yan Kong:2013 () SVM, k-NN
Nari and Ghorbani NariGhorbani2013 () Decision Tree
Ahmadi et al. Ahmadi:2015 () SVM, Random Forest, Gradient Boosting Decision Tree
Asquith Asquith2015 () -2020footnotemark: 20
Lin et al. Lin:2015 () SVM
Kawaguchi and Omote KawaguchiOmote2015 () Decision Tree, Random Forest, k-NN, Naïve Bayes
Mohaisen et al. MohaisenAlrawiMohaisen2015 () Decision Tree, k-NN, SVM, Clustering with with different similarity measures, Hierarchical clustering
Pai et al. Pai:2015 () k-Means, Expectation Maximization
Raff and Nicholas Raff2017 () -NN with Lempel-Ziv Jaccard distance
Table 16: Characterization of surveyed papers having malware families selection as objective. Asquith describes aggregation overlay graphs for storing PE metadata, without further discussing any machine learning technique that could be applied on top of these new data structures.

6.3 Malware category detection

These articles focus on the identification of specific threats and, thus, on particular inputs such as byte sequences, opcodes, executable binaries’ function lengths, and network activity regarding samples. Table 17 reports the works whose objectives deal with the detection of the specific category of a malware sample.

Paper Algorithms Byte
seq. Ops Func.
length Net.
Tian et al. Tian:2008 () Rule-based classifier
Siddiqui et al. Siddiqui:2009 () Decision Tree, Random Forest
Chen et al. ChenRoussopoulosLiangEtAl2012 () Random Forest, SVM
Comar et al. Comar:2013 () Random Forest, SVM
Kwon et al. Kwon2015 () Random Forest
Sexton et al. Sexton2015 () Rule-based classifier, Logistic Regression,
Naïve Bayes, SVM
Table 17: Characterization of surveyed papers having malware category detection as objective.
Paper Algorithms Byte
seq. APIs
Sys. calls Mem. File sys. Win. Reg. CPU reg. Network
Bailey et al. BaileyOberheideAndersenEtAl2007 () Hierarchical clustering with normalized compression distance
Bayer et al. BayerComparettiHlauschekEtAl2009 () Clustering with locality sensitive hashing
Rieck et al. RieckTriniusWillemsEtAl2011 () Prototype-based classification and clustering with Euclidean distance
Palahan et al. PalahanBabicChaudhuriEtAl2013 () Logistic Regression
Egele et al. Egele:2014 () SVM2121footnotemark: 21
Table 18: Characterization of surveyed papers having malware similarities detection as objective. SVM is used only for computing the optimal values of weight factors associated to each feature chosen to detect similarities among malicious samples.
Paper Algorithms Byte
seq. Ops APIs
Sys. calls Net.
Bayer et al. BayerComparettiHlauschekEtAl2009 () Clustering with locality sensitive hashing
Lindorfer et al. LindorferKolbitschComparetti2011 () Rule-based classifier
Rieck et al. RieckTriniusWillemsEtAl2011 () Prototype-based classification and clustering, clustering with Euclidean distance
Palahan et al. PalahanBabicChaudhuriEtAl2013 () Logistic Regression
Santos et al. SantosBrezoUgarte-PedreroEtAl2013 () Decision Tree, k-NN, Bayesian Network, Random Forest
Polino et al. Polino:2015 () Clustering with Jaccard similarity
Table 19: Characterization of surveyed papers having malware differences detection as objective.

6.4 Malware novelty and similarity detection

Similar to Section 6.2, this characterization groups works whose objective is to spot (dis)similarities among samples. According to the final objective of the analysis, one can be more interested in finding out either similarities or differences characterizing a group of samples. All the analysed papers but SantosBrezoUgarte-PedreroEtAl2013 () rely on APIs and system call collection (see Table 18 and Table 19).

While works aiming to spotting differences among PEs, in general, do not take into account interactions with the system in which they are executed, the ones that identify similarities leverage such information.

6.5 Malware development detection

Table 20 outlines the very few works that propose approaches to deal with malware development cases. While Chen et al. use just byte sequences for their analysis ChenRoussopoulosLiangEtAl2012 (), in Graziano:2015 (), both information related to malicious sample execution into a sandbox and their submission-related metadata are used.

Paper Algorithms Byte
seq. APIs
Sys. calls File
sys. Win. Reg. Net. Submis-sions
Chen et al. ChenRoussopoulosLiangEtAl2012 () Gradient Boosting Decision Tree, Self-Organizing Maps, SVM
Graziano et al. Graziano:2015 () Logistic Model Tree, Clustering using ssdeep tool
Table 20: Characterization of surveyed papers having malware development detection as objective.

6.6 Malware attribution

Malware attribution is one of the most important tasks at the strategic level (see Section 3.6). Indeed, both U.S. and U.K. governments already identified cyber threat attribution as a strategic goal within their national cyber security strategies dhs2009 (); co2011 (). In addition to the typical inputs used in malware analysis, Caliskan-Islam et al. Caliskan-Islam:2015 () also rely on code stylometry to capture application source code’s coding style which, in turn, can be represented through lexical, layout and syntactic features (refer to Table 21).

Paper Algorithms Code stylometry
Caliskan-Islam et al. Caliskan-Islam:2015 () Random Forest
Table 21: Characterization of surveyed papers having malware attribution as objective.

6.7 Malware triage

Even if serious attention has been paid on malware detection in general, much less consideration is given to malware triage, as shown in Table 22. Jang et al. are the only ones, among the surveyed works, that perform triage by using PE’s byte sequences and API/system call invocations JangBrumleyVenkataraman2011 ().

Paper Algorithms Byte
seq. APIs
Sys. calls PE file
char.
Jang et al. JangBrumleyVenkataraman2011 () Clustering with locality sensitive hashing, full hierarchical clustering
Kirat et al. Kirat2013 () k-NN
Table 22: Characterization of surveyed papers having malware triage as objective.

7 Datasets

More than % of surveyed works, datasets contain both malicious and benign samples. Nevertheless, about % of the papers perform their experimental evaluations using datasets having solely malicious executables. The objectives of these works are variants and families selection LeeMody2006 (); HuangYeJiang2009 (); ShoshaLiuGladyshevEtAl2012 (); HuShinBhatkarEtAl2013 (); Kong:2013 (); NariGhorbani2013 (); Ahmadi:2015 (); MohaisenAlrawiMohaisen2015 (); UpchurchZhou2015 (); LiangPangDai2016 (); Vadrevu2016 (), category detection Tian:2008 (), malware novelty and similarity detection BaileyOberheideAndersenEtAl2007 (); BayerComparettiHlauschekEtAl2009 (); LindorferKolbitschComparetti2011 (); RieckTriniusWillemsEtAl2011 (). Just two works rely on benign datasets only Egele:2014 (); Caliskan-Islam:2015 (). Since their objectives are identifying sample similarities and attributing the ownership of some source codes under analysis, respectively, then they do not necessarily need malware.

Figure 1: Dataset sources for malicious and benign samples

Figure 1 reports a summary of the employed sources for malicious and benign samples, respectively. Used datasets come from legitimate applications, public repositories, security vendors, sandboxes, AV companies and research centers, Internet Service Providers, honeypots, CERT and, in some cases, datasets are built by the researchers themselves.

It is worth noting that most of the benign datasets consist of legitimate applications, while most of malware have been obtained from public repositories, security vendors and popular sandboxed analysis services. Legitimate applications include PEs contained in the “Program Files” or “system” folders, and downloads from “trusted” (i.e. signed) companies. Examples of these benign programs are Cygwin, Putty, the Microsoft Office Suite, and Adobe Acrobat. The most popular public repository in the examined works is VX Heavens222222VX Heavens: http://vxheaven.org, followed by Offensive Computing232323Offensive Computing: http://www.offensivecomputing.net and Malicia Project242424Malicia Project: http://malicia-project.com. The first two repositories are still actively maintained at the time of writing, while Malicia Project has been permanently shut down due to dataset aging and lack of maintainers.

Security vendors, popular sandboxed analysis services, and AV companies have access to a huge number of samples. Surveyed works rely on CWSandbox and Anubis sandboxed analysis services. CWSandbox is a commercial sandbox, now named Threat Analyzer. It is actively developed by ThreatTrack Security252525ThreatTrack: https://www.threattrack.com/malware-analysis.aspx. Anubis is no more maintained262626Anubis: https://anubis.iseclab.org by their creators and by iSecLab272727iSecLab: https://iseclab.org, which was the international laboratory hosting the sandbox. As can be observed from Figure 1, these sandboxes are mainly used for obtaining malicious samples. The same trend holds for security vendors, AV companies and research centers. Internet Service Providers (ISPs), honeypots, and Computer Emergency Response Teams (CERTs) share with researchers both benign and malicious datasets.

An interesting case is represented by samples developed by the researchers themselves. A few works use in their evaluations malicious samples developed by the authors ShoshaLiuGladyshevEtAl2012 (); Gharacheh:2015 (); KhodamoradiFazlaliMardukhiEtAl2015 (). These samples are created by also using malware toolkits such as Next Generation Virus Constrution Kit282828Next Generation Virus Construktion Kit: http://vxheaven.org/vx.php?id=tn02, Virus Creation Lab292929Virus Creation Lab: http://vxheaven.org/vx.php?id=tv03, Mass Code Generator303030Mass Code Generator: http://vxheaven.org/vx.php?id=tm02, and Second Generation Virus Generator313131Second Generation Virus Generator: http://vxheaven.org/vx.php?id=tg00. Finally, a minority of analysed papers do not mention the source of their datasets.

One of the most common problems of these datasets is that, very often, they are not balanced. A proper training of machine learning models require that each class contains an almost equal amount of samples, otherwise inaccurate models could be obtained. The same problem holds when also benign datasets are used, indeed malicious samples should be roughly as many as benign samples. In Yonts2012 (), Yonts supports his choice of using a smaller benign dataset by pointing out that changes in standard system files and legitimate applications are little. However, there are surveyed works that rely on benign datasets whose size is bigger than the size of malicious ones Kolter:2006 (); Siddiqui:2009 (); BilgeBalzarottiRobertsonEtAl2012 (); TamersoyRoundyChau2014 (); UppalSinhaMehraEtAl2014 (); Chen:2015 (); KhodamoradiFazlaliMardukhiEtAl2015 (); Sexton2015 ().

Figure 2: Dataset labeling methods

Samples contained in the datasets used in considered papers are already labeled in general. Figure 2 reports statistics on whether considered works perform further labeling on these samples. The majority of reviewed papers does not perform any additional labeling operation to their already-labeled datasets. However, some works analyse samples again and recompute labels to check if they match with the ones provided together with the datasets. Label computation can be manual, automated, or both. Manual labeling is a highly time-consuming task, but provides more accurate results since this activity is usually performed by security experts. Consequently, the number of samples that can be labeled using this method is quite small compared to automated labeling techniques. Few works use manual labeling DahlStokesDengEtAl2013 (); Graziano:2015 (); Polino:2015 (); UpchurchZhou2015 (), while others combine manual and automated methods BayerComparettiHlauschekEtAl2009 (); LindorferKolbitschComparetti2011 (); HuShinBhatkarEtAl2013 (); MohaisenAlrawiMohaisen2015 ().

Different from other research fields, in malware analysis there are no available reference benchmarks to compare accuracy and performance with respect to previous works. In addition, since the datasets used for experimental evaluations are rarely shared, it is difficult, if not impossible, to compare works. In the papers we have surveyed, only two works have shared the dataset used in their evaluations Shultz:2001 (); UpchurchZhou2015 (), while a third one plans to provide a reference to the analysed malware samples in the future MohaisenAlrawiMohaisen2015 (). To this end, Shultz et al. made available the dataset used in their experimental evaluations for detecting malware. At the time of writing, it is no more accessible. It contained 1,001 legitimate applications and 3,265 malware consisting of trojans and viruses. The applied labeling methodology is automated and performed using a commercial antivirus. On the other hand, Upchurch et al. UpchurchZhou2015 () share a golden standard test dataset for future works that aim to perform malware variants selection analyses. The dataset is imbalanced and only includes 85 samples, organized in 8 families containing trojans, information stealers, bots, keyloggers, backdoors, and potentially unwanted programs. All the above samples have been analysed by professional malware analysts and tested against 5 different malware variant detection approaches. Experimental evaluations report performance and accuracy of tested methodologies and compare obtained results with the ones published in the original papers. Sample metadata include MD5 hashes, but no temporal information, i.e., when a sample appeared first. Miller et al. have extensively proved that the lack of this critical information considerably affects the accuracy of experimental results Miller2015 ().

Given such lack of reference datasets, we propose three desiderata for malware analysis benchmarks.

  1. Benchmarks should be labeled accordingly to the specific objectives to achieve. As an example, benchmarks for families selection should be labeled with samples’ families, while benchmarks for category detection should label malware with their categories.

  2. Benchmarks should be balanced: samples of different classes should be in equal proportion to avoid issues on classification accuracy.

  3. Benchmarks should be actively maintained and updated over time with new samples, trying to keep pace with the malware industry. Samples should also be provided with temporal information, e.g., when they have been spotted first. In this way analysts would have at disposal a sort of timeline of malware evolution and they could also discard obsolete samples.

From the above dataset descriptions, it is clear that shared samples in Shultz:2001 () and UpchurchZhou2015 () are correctly labeled according to malware detection and malware variants selection objectives, respectively. Both datasets are not balanced. In Shultz et al. Shultz:2001 (), the described dataset is biased towards malicious programs, while in UpchurchZhou2015 () diverse groups of variants contain a different number of samples, ranging from to . Finally, analyzed datasets are not actively maintained and do not contain any temporal information (in Shultz:2001 (), the authors do not mention if such information has been included into the dataset).

8 Malware Analysis Economics

Analysing samples through machine learning techniques requires complex computations, both for extracting desired features and for running chosen algorithms. The time complexity of these computations has to be carefully taken into account because of the need to complete them fast enough to keep pace with the speed new malware are developed. Space complexity has to be considered as well, indeed feature space can easily become excessively large (e.g., using n-grams), and also the memory required by machine learning algorithms can grow to the point of saturating available resources.

Time and space complexities can be either reduced to adapt to processing and storage capacity at disposal, or they can be accommodated by supplying more resources. In the former case, the accuracy of the analysis is likely to worsen, while in the latter accuracy levels can be kept at the cost of providing additional means, e.g., in terms of computing machines, storage, network. There exist tradeoffs between maintaining high accuracy and performance of malware analysis on one hand, and supplying the required equipment on the other.

Another potential cost of malware analysis through machine learning techniques is given by the effort for obtaining sample labels. As outlined in Section 7, samples can be labeled either manually or in an automated fashion. Manual labeling definitely requires more time than automated techniques. However, the former produces more accurate results since this task is usually carried out by security experts.

We refer to the study of these tradeoffs as malware analysis economics, and in this section we provide some initial qualitative discussions on this novel topic.

The time needed to analyse a sample through machine learning is mainly spent in feature extraction and algorithm execution. There exist in literature plenty of works discussing time complexity of machine learning algorithms. The same does not apply for the study of the execution time of the feature extraction process. The main aspect to take into account in such study is whether desired features come from static or dynamic analysis, which considerably affects execution time because the former does not require to run the samples, while the latter does. Table 23 categorizes surveyed works on the basis of the type of analysis they carry out, i.e., static, dynamic or hybrid. The majority of works relies on dynamic analyses, while the others use, in equal proportions, either static analyses alone or a combination of static and dynamic techniques.

Malware analysis Papers
Static Shultz:2001 (); Kolter:2006 (); Tian:2008 (); HuangYeJiang2009 (); Siddiqui:2009 (); Santos:2011 (); ChenRoussopoulosLiangEtAl2012 (); KwonLee2012 (); Yonts2012 (); HuShinBhatkarEtAl2013 (); Kong:2013 (); SantosBrezoUgarte-PedreroEtAl2013 (); Vadrevu2013 (); BaiWangZou2014 (); TamersoyRoundyChau2014 (); Ahmadi:2015 (); Caliskan-Islam:2015 (); Chen:2015 (); FengXiongCaoEtAl2015 (); Gharacheh:2015 (); KhodamoradiFazlaliMardukhiEtAl2015 (); Pai:2015 (); Sexton2015 (); SrakaewPiyanuntcharatsr:2015 (); UpchurchZhou2015 ()
Dynamic LeeMody2006 (); BaileyOberheideAndersenEtAl2007 (); BayerComparettiHlauschekEtAl2009 (); FirdausiLimErwinEtAl2010 (); ParkReevesMulukutlaEtAl2010 (); AndersonQuistNeilEtAl2011 (); LindorferKolbitschComparetti2011 (); RieckTriniusWillemsEtAl2011 (); ShoshaLiuGladyshevEtAl2012 (); Comar:2013 (); DahlStokesDengEtAl2013 (); NariGhorbani2013 (); PalahanBabicChaudhuriEtAl2013 (); KruczkowskiSzynkiewicz2014 (); UppalSinhaMehraEtAl2014 (); ElhadiMaarofBarry2015 (); Ghiasi:2015 (); KawaguchiOmote2015 (); Lin:2015 (); MohaisenAlrawiMohaisen2015 (); WuechnerOchoaPretschner2015 (); LiangPangDai2016 ()
Hybrid JangBrumleyVenkataraman2011 (); AndersonStorlieLane2012 (); ChionisNikolopoulosPolenakis2013 (); EskandariKhorshidpourHashemi2013 (); IslamTianBattenEtAl2013 (); SantosDevesaBrezoEtAl2013 (); Egele:2014 (); Graziano:2015 (); Polino:2015 (); Vadrevu2016 ()
Table 23: Surveyed papers arranged according the type of analysis performed on input samples.
Analysis Str. Byte
seq. Ops APIs
Sys. calls Mem. File
sys. Win. Reg. CPU
reg. Func.
len. PE file
char. Exc. Net. Sub-mis.
Static
Dynamic
Table 24: Type of analysis required for extracting the inputs presented in Section 4: strings, byte sequences, opcodes, APIs/system calls, memory, file system, Windows Registry, CPU registers, function length, PE file characteristics, raised exceptions, network, and AV/Sandbox submissions.

To deepen even further this point, Table 24 reports for each feature type whether it can be extracted through static or dynamic analysis. It is interesting to note that certain types of features can be extracted both statically and dynamically, with significant differences on execution time as well as on malware analysis accuracy. Indeed, while certainly more time-consuming, dynamic analysis enables to gather features that make the overall analysis improve its effectiveness damodaran2015comparison (). As an example, we can consider the features derived from API calls (see Table 24), which can be obtained both statically and dynamically. Tools like IDA provide the list of imports used by a sample and can statically trace what API calls are present in the sample code. Malware authors usually hide their suspicious API calls by inserting in the source code a huge number of legitimate APIs. By means of dynamic analysis, it is possible to obtain the list of the APIs that the sample has actually invoked, thus simplifying the identification of those suspicious APIs. By consequences, in this case dynamic analysis is likely to generate more valuable features compared to static analysis.

Although choosing dynamic analysis over, or in addition to, static seems obvious, its inherently higher time complexity constitutes a potential performance bottleneck for the whole malware analysis process, which can undermine the possibility to keep pace with malware evolution speed. The natural solution is to provision more computational resources to parallelise analysis tasks and thus remove bottlenecks. In turn, such solution has a cost to be taken into account when designing a malware analysis environment, such as the one presented by Laurenza et al. Laurenza2016 ().

The qualitative tradeoffs we have identified are between accuracy and time complexity (i.e., higher accuracy requires larger times), between time complexity and analysis pace (i.e., larger times implies slower pace), between analysis pace and computational resources (faster analysis demands using more resources), and between computational resources and economic cost (obviously, additional equipment has a cost). Similar tradeoffs also hold for space complexity and sample labeling. As an example, when using n-grams as features, it has been shown that larger values of lead to more accurate analysis, at cost of having the feature space grow exponentially with  UppalSinhaMehraEtAl2014 (); Lin:2015 (). As another example, using larger datasets in general enables more accurate machine learning models and thus better accuracy, provided the availability of enough space to store all the samples of the dataset and the related analysis reports. The analysis accuracy is heavily impacted by the employed labeling methodology as well. Manual labeling ensures higher accuracy, but requires security experts to spend a lot of time analyzing samples. This largely slows down the analysis pace and necessitates to hire many malware analysts to complete sample analyses at a reasonable rate. On the contrary, the usage of automated labeling techniques allows to speed-up the labeling process at the price of potentially reducing analyses accuracy (i.e., if samples are incorrectly labeled, machine learning models could be inaccurate). The only costs associated to this solution regard licenses acquisition of potentially commercial solutions for automated labeling (e.g., antivirus products).

We claim the significance of investigating these tradeoffs more in details, with the aim of outlining proper guidelines and strategies to design a malware analysis environment in compliance with requirements on analysis accuracy and pace, and also by respecting budget constraints.

9 Conclusion

We presented a survey on existing literature on malware analysis through machine learning techniques. There are three main contributions of our work. First, we proposed an organization of reviewed works according to three orthogonal dimensions: the objective of the analysis, the type of features extracted from samples, the machine learning algorithms used to process these features. We identified different malware analysis objectives (ranging from malware detection to malware triage), grouped features according to their specific type (e.g., strings and byte sequences), and organized machine learning algorithms for malware analysis in distinct classes. Such characterization provides an overview on how machine learning algorithms can be employed in malware analysis, emphasising which specific feature classes allow to achieve the objective(s) of interest. In this first contribution, we discussed the general lack of justifications for using a specific set of features to properly describe the malicious traits of samples: the majority of reviewed papers did not explain the correlation between considered features and obtained results.

Second, we highlighted some issues regarding the datasets used in literature and outlined three desiderata for building enhanced datasets. Currently, there is a shortage of publicly available datasets suitable for specific objectives. For example, datasets where samples are properly labelled by family are a rarity. Furthermore, usually, datasets employed in reviewed experimental evaluations are rarely shared. In the majority of examined papers, used datasets are not balanced, hence preventing the construction of really accurate models. When malware samples are to be used for evaluating novel analysis techniques, their fast obsolescence becomes an additional and relevant issue. Indeed, the effectiveness of new approaches should be tested on samples as much recent as possible, otherwise there would be the risk that such approaches turn out to be poorly accurate when applied in the real world. At today’s malware evolution pace, samples are likely to become outdated in a few months, but reference datasets commonly include malware of a few years ago. Thus, we proposed three desired characteristics for malware analysis benchmarks: they should be (i) labeled accordingly to the specific objectives to achieve, (ii) balanced, (iii) actively maintained and updated over time.

Third, we introduced the novel concept of malware analysis economics, concerning the investigation and exploitation of existing tradeoffs between performance metrics of malware analysis (e.g., analysis accuracy and execution time) and economical costs. We have identified tradeoffs between accuracy, time complexity, analysis pace with respect to malware evolution, required computational resources, and economic cost. Similar tradeoffs also hold for space complexity.

Noteworthy research directions to investigate can be linked to each of the three contributions. The organization of malware analysis works along three dimensions can be further refined by better identifying and characterizing analysis objectives, extracted features, and used machine learning algorithms. Novel combinations of objectives, features and algorithms can be investigated to achieve better performance compared to the state of the art. Moreover, observing that some classes of algorithms have never been used for a certain objective may suggest novel directions to examine further. The discussion on malware analysis datasets can drive academic works aimed at building new datasets in accordance with the three identified desiderata. Providing better datasets would enable better and fairer comparisons among results claimed by diverse works, hence would ease effective progresses in the malware analysis field. Finally, the initial set of general tradeoffs described in the context of malware analysis economics can be deepened to derive quantitative relationships among the key metrics of interest, which would allow defining effective approaches to design and setup analysis environments.

References

  • (1) Z. Bazrafshan, H. Hashemi, S. M. H. Fard, A. Hamzeh, A survey on heuristic malware detection techniques, in: Information and Knowledge Technology (IKT), 2013 5th Conference on, IEEE, 2013, pp. 113–120.
  • (2) Y. Ye, T. Li, D. Adjeroh, S. S. Iyengar, A survey on malware detection using data mining techniques, ACM Computing Surveys (CSUR) 50 (3) (2017) 41.
  • (3) E. Gandotra, D. Bansal, S. Sofat, Malware analysis and classification: A survey, Journal of Information Security 2014.
  • (4) M. K. Sahu, M. Ahirwar, A. Hemlata, A review of malware detection based on pattern matching technique, Int. J. of Computer Science and Information Technologies (IJCSIT) 5 (1) (2014) 944–947.
  • (5) I. Basu, Malware detection based on source data using data mining: A survey, American Journal Of Advanced Computing 3 (1).
  • (6) C. LeDoux, A. Lakhotia, Malware and machine learning, in: Intelligent Methods for Cyber Warfare, Springer, 2015, pp. 1–42.
  • (7) M. Egele, T. Scholte, E. Kirda, C. Kruegel, A survey on automated dynamic malware-analysis techniques and tools, ACM Comput. Surv. 44 (2) (2008) 6:1–6:42. doi:10.1145/2089125.2089126.
    URL http://doi.acm.org/10.1145/2089125.2089126
  • (8) M. G. Schultz, E. Eskin, F. Zadok, S. J. Stolfo, Data mining methods for detection of new malicious executables, in: Security and Privacy, 2001. S P 2001. Proceedings. 2001 IEEE Symposium on, 2001, pp. 38–49. doi:10.1109/SECPRI.2001.924286.
  • (9) J. Z. Kolter, M. A. Maloof, Learning to detect and classify malicious executables in the wild, J. Mach. Learn. Res. 7 (2006) 2721–2744.
    URL http://dl.acm.org/citation.cfm?id=1248547.1248646
  • (10) F. Ahmed, H. Hameed, M. Z. Shafiq, M. Farooq, Using spatio-temporal information in api calls with machine learning algorithms for malware detection, in: Proceedings of the 2nd ACM workshop on Security and artificial intelligence, ACM, 2009, pp. 55–62.
  • (11) D. H. Chau, C. Nachenberg, J. Wilhelm, A. Wright, C. Faloutsos, Polonium: Tera-scale graph mining for malware detection, in: Acm sigkdd conference on knowledge discovery and data mining, 2010.
  • (12) I. Firdausi, C. Lim, A. Erwin, A. S. Nugroho, Analysis of machine learning techniques used in behavior-based malware detection, in: ACT ’10, IEEE, 2010, pp. 201–203.
  • (13) B. Anderson, D. Quist, J. Neil, C. Storlie, T. Lane, Graph-based malware detection using dynamic analysis, Journal in Computer Virology 7 (4) (2011) 247–258.
  • (14) I. Santos, J. Nieves, P. G. Bringas, International Symposium on Distributed Computing and Artificial Intelligence, Springer Berlin Heidelberg, Berlin, Heidelberg, 2011, Ch. Semi-supervised Learning for Unknown Malware Detection, pp. 415–422. doi:10.1007/978-3-642-19934-9_53.
    URL http://dx.doi.org/10.1007/978-3-642-19934-9_53
  • (15) B. Anderson, C. Storlie, T. Lane, Improving malware classification: bridging the static/dynamic gap, in: Proceedings of the 5th ACM workshop on Security and artificial intelligence, ACM, 2012, pp. 3–14.
  • (16) J. Yonts, Attributes of malicious files, Tech. rep., The SANS Institute (2012).
  • (17) I. Santos, J. Devesa, F. Brezo, J. Nieves, P. G. Bringas, Opem: A static-dynamic approach for machine-learning-based malware detection, in: CISIS ’12-ICEUTE´ 12-SOCO´, Springer, 2013, pp. 271–280.
  • (18) M. Eskandari, Z. Khorshidpour, S. Hashemi, Hdm-analyser: a hybrid analysis approach based on data mining techniques for malware detection, Journal of Computer Virology and Hacking Techniques 9 (2) (2013) 77–93.
  • (19) P. Vadrevu, B. Rahbarinia, R. Perdisci, K. Li, M. Antonakakis, Measuring and Detecting Malware Downloads in Live Network Traffic, Springer Berlin Heidelberg, Berlin, Heidelberg, 2013, pp. 556–573. doi:10.1007/978-3-642-40203-6_31.
    URL http://dx.doi.org/10.1007/978-3-642-40203-6_31
  • (20) J. Bai, J. Wang, G. Zou, A malware detection scheme based on mining format information, The Scientific World Journal 2014.
  • (21) M. Kruczkowski, E. N. Szynkiewicz, Support vector machine for malware analysis and classification, in: Web Intelligence (WI) and Intelligent Agent Technologies (IAT), IEEE Computer Society, 2014, pp. 415–420.
  • (22) A. Tamersoy, K. Roundy, D. H. Chau, Guilt by association: large scale malware detection by mining file-relation graphs, in: Proceedings of the 20th ACM SIGKDD international conference on Knowledge Discovery and Data Mining, ACM, 2014, pp. 1524–1533.
  • (23) D. Uppal, R. Sinha, V. Mehra, V. Jain, Malware detection and classification based on extraction of api sequences, in: Advances in Computing, Communications and Informatics (ICACCI, 2014 International Conference on, IEEE, 2014, pp. 2337–2342.
  • (24) L. Chen, T. Li, M. Abdulhayoglu, Y. Ye, Intelligent malware detection based on file relation graphs, in: Semantic Computing (ICSC), 2015 IEEE International Conference on, 2015, pp. 85–92. doi:10.1109/ICOSC.2015.7050784.
  • (25) E. Elhadi, M. A. Maarof, B. Barry, Improving the detection of malware behaviour using simplified data dependent api call graph, Journal of Security and Its Applications.
  • (26) Z. Feng, S. Xiong, D. Cao, X. Deng, X. Wang, Y. Yang, X. Zhou, Y. Huang, G. Wu, Hrs: A hybrid framework for malware detection, in: Proceedings of the 2015 ACM International Workshop on Security and Privacy Analytics, ACM, 2015, pp. 19–26.
  • (27) M. Ghiasi, A. Sami, Z. Salehi, Dynamic vsa: a framework for malware detection based on register contents, Engineering Applications of Artificial Intelligence 44 (2015) 111 – 122. doi:http://dx.doi.org/10.1016/j.engappai.2015.05.008.
    URL http://www.sciencedirect.com/science/article/pii/S0952197615001190
  • (28) M. Ahmadi, G. Giacinto, D. Ulyanov, S. Semenov, M. Trofimov, Novel feature extraction, selection and fusion for effective malware family classification, CoRR abs/1511.04317.
    URL http://arxiv.org/abs/1511.04317
  • (29) B. J. Kwon, J. Mondal, J. Jang, L. Bilge, T. Dumitras, The dropper effect: Insights into malware distribution with downloader graph analytics, in: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, ACM, 2015, pp. 1118–1129.
  • (30) W. Mao, Z. Cai, D. Towsley, X. Guan, Probabilistic inference on integrity for access behavior based malware detection, in: International Workshop on Recent Advances in Intrusion Detection, Springer, 2015, pp. 155–176.
  • (31) J. Saxe, K. Berlin, Deep neural network based malware detection using two dimensional binary program features, in: Malicious and Unwanted Software (MALWARE), 2015 10th International Conference on, IEEE, 2015, pp. 11–20.
  • (32) T. Wüchner, M. Ochoa, A. Pretschner, Robust and effective malware detection through quantitative data flow graph metrics, in: Detection of Intrusions and Malware, and Vulnerability Assessment, Springer, 2015, pp. 98–118.
  • (33) E. Raff, C. Nicholas, An alternative to ncd for large sequences, lempel-ziv jaccard distance, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2017, pp. 1007–1015.
  • (34) J. Kwon, H. Lee, Bingraph: Discovering mutant malware using hierarchical semantic signatures, in: Malicious and Unwanted Software (MALWARE), 2012 7th International Conference on, IEEE, 2012, pp. 104–111.
  • (35) A. F. Shosha, C. Liu, P. Gladyshev, M. Matten, Evasion-resistant malware signature based on profiling kernel data structure objects, in: CRiSIS, 2012, IEEE, 2012, pp. 1–8.
  • (36) I. Chionis, S. Nikolopoulos, I. Polenakis, A survey on algorithmic techniques for malware detection.
  • (37) M. Gharacheh, V. Derhami, S. Hashemi, S. M. H. Fard, Proposing an hmm-based approach to detect metamorphic malware, in: Fuzzy and Intelligent Systems (CFIS), 2015, pp. 1–5. doi:10.1109/CFIS.2015.7391648.
  • (38) P. Khodamoradi, M. Fazlali, F. Mardukhi, M. Nosrati, Heuristic metamorphic malware detection based on statistics of assembly instructions using classification algorithms, in: Computer Architecture and Digital Systems (CADS), 2015 18th CSI International Symposium on, IEEE, 2015, pp. 1–6.
  • (39) J. Upchurch, X. Zhou, Variant: a malware similarity testing framework, in: 2015 10th International Conference on Malicious and Unwanted Software (MALWARE), IEEE, 2015, pp. 31–39.
  • (40) G. Liang, J. Pang, C. Dai, A behavior-based malware variant classification technique, International Journal of Information and Education Technology 6 (4) (2016) 291.
  • (41) P. Vadrevu, R. Perdisci, Maxs: Scaling malware execution with sequential multi-hypothesis testing, in: ASIA CCS ’16, ASIA CCS ’16, ACM, New York, NY, USA, 2016, pp. 771–782. doi:10.1145/2897845.2897873.
    URL http://doi.acm.org/10.1145/2897845.2897873
  • (42) T. Lee, J. J. Mody, Behavioral classification, in: EICAR Conference, 2006, pp. 1–17.
  • (43) K. Huang, Y. Ye, Q. Jiang, Ismcs: an intelligent instruction sequence based malware categorization system, in: Anti-counterfeiting, Security, and Identification in Communication, 2009, IEEE, 2009, pp. 509–512.
  • (44) Y. Park, D. Reeves, V. Mulukutla, B. Sundaravel, Fast malware classification by automated behavioral graph matching, in: Workshop on Cyber Security and Information Intelligence Research, ACM, 2010, p. 45.
  • (45) Y. Ye, T. Li, Y. Chen, Q. Jiang, Automatic malware categorization using cluster ensemble, in: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2010, pp. 95–104.
  • (46) G. E. Dahl, J. W. Stokes, L. Deng, D. Yu, Large-scale malware classification using random projections and neural networks, in: Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2013, pp. 3422–3426.
  • (47) X. Hu, K. G. Shin, S. Bhatkar, K. Griffin, Mutantx-s: Scalable malware clustering based on static features, in: USENIX Annual Technical Conference, 2013, pp. 187–198.
  • (48) R. Islam, R. Tian, L. M. Batten, S. Versteeg, Classification of malware based on integrated static and dynamic features, Journal of Network and Computer Applications 36 (2) (2013) 646–656.
  • (49) D. Kong, G. Yan, Discriminant malware distance learning on structural information for automated malware classification, in: ACM SIGKDD ’13, KDD ’13, ACM, New York, NY, USA, 2013, pp. 1357–1365. doi:10.1145/2487575.2488219.
    URL http://doi.acm.org/10.1145/2487575.2488219
  • (50) S. Nari, A. A. Ghorbani, Automated malware classification based on network behavior, in: Computing, Networking and Communications (ICNC), 2013 International Conference on, IEEE, 2013, pp. 642–647.
  • (51) N. Kawaguchi, K. Omote, Malware function classification using apis in initial behavior, in: Information Security (AsiaJCIS), 2015 10th Asia Joint Conference on, IEEE, 2015, pp. 138–144.
  • (52) C.-T. Lin, N.-J. Wang, H. Xiao, C. Eckert, Feature selection and extraction for malware classification, Journal of Information Science and Engineering 31 (3) (2015) 965–992.
  • (53) A. Mohaisen, O. Alrawi, M. Mohaisen, Amal: High-fidelity, behavior-based automated malware analysis and classification, Computers & Security.
  • (54) S. Pai, F. Di Troia, C. A. Visaggio, T. H. Austin, M. Stamp, Clustering for malware classification.
  • (55) R. Tian, L. M. Batten, S. C. Versteeg, Function length as a tool for malware classification, in: Malicious and Unwanted Software, 2008. MALWARE 2008. 3rd International Conference on, 2008, pp. 69–76. doi:10.1109/MALWARE.2008.4690860.
  • (56) M. Siddiqui, M. C. Wang, J. Lee, Detecting internet worms using data mining techniques, Journal of Systemics, Cybernetics and Informatics (2009) 48–53.
  • (57) Z. Chen, M. Roussopoulos, Z. Liang, Y. Zhang, Z. Chen, A. Delis, Malware characteristics and threats on the internet ecosystem, Journal of Systems and Software 85 (7) (2012) 1650–1672.
  • (58) P. M. Comar, L. Liu, S. Saha, P. N. Tan, A. Nucci, Combining supervised and unsupervised learning for zero-day malware detection, in: INFOCOM, 2013 Proceedings IEEE, 2013, pp. 2022–2030. doi:10.1109/INFCOM.2013.6567003.
  • (59) J. Sexton, C. Storlie, B. Anderson, Subroutine based detection of apt malware, Journal of Computer Virology and Hacking Techniques (2015) 1–9doi:10.1007/s11416-015-0258-7.
    URL http://dx.doi.org/10.1007/s11416-015-0258-7
  • (60) M. Bailey, J. Oberheide, J. Andersen, Z. M. Mao, F. Jahanian, J. Nazario, Automated classification and analysis of internet malware, in: Recent advances in intrusion detection, Springer, 2007, pp. 178–197.
  • (61) U. Bayer, P. M. Comparetti, C. Hlauschek, C. Kruegel, E. Kirda, Scalable, behavior-based malware clustering, in: NDSS, Vol. 9, Citeseer, 2009, pp. 8–11.
  • (62) K. Rieck, P. Trinius, C. Willems, T. Holz, Automatic analysis of malware behavior using machine learning, Journal of Computer Security 19 (4) (2011) 639–668.
  • (63) S. Palahan, D. Babić, S. Chaudhuri, D. Kifer, Extraction of statistically significant malware behaviors, in: Computer Security Applications Conference, ACM, 2013, pp. 69–78.
  • (64) M. Egele, M. Woo, P. Chapman, D. Brumley, Blanket execution: Dynamic similarity testing for program binaries and components, in: USENIX Security ’14, USENIX Association, San Diego, CA, 2014, pp. 303–317.
    URL https://www.usenix.org/conference/usenixsecurity14/technical-sessions/presentation/egele
  • (65) M. Lindorfer, C. Kolbitsch, P. M. Comparetti, Detecting environment-sensitive malware, in: Recent Advances in Intrusion Detection, Springer, 2011, pp. 338–357.
  • (66) I. Santos, F. Brezo, X. Ugarte-Pedrero, P. G. Bringas, Opcode sequences as representation of executables for data-mining-based unknown malware detection, Information Sciences 231 (2013) 64–82.
  • (67) M. Polino, A. Scorti, F. Maggi, S. Zanero, Jackdaw: Towards Automatic Reverse Engineering of Large Datasets of Binaries, in: M. Almgren, V. Gulisano, F. Maggi (Eds.), Detection of Intrusions and Malware, and Vulnerability Assessment, Lecture Notes in Computer Science, Springer International Publishing, 2015, pp. 121–143, dOI: 10.1007/978-3-319-20550-2_7.
    URL http://link.springer.com/chapter/10.1007/978-3-319-20550-2_7
  • (68) M. Graziano, D. Canali, L. Bilge, A. Lanzi, D. Balzarotti, Needles in a haystack: Mining information from public dynamic analysis sandboxes for malware intelligence, in: USENIX Security ’15, 2015, pp. 1057–1072.
    URL https://www.usenix.org/conference/usenixsecurity15/technical-sessions/presentation/graziano
  • (69) A. Caliskan-Islam, R. Harang, A. Liu, A. Narayanan, C. Voss, F. Yamaguchi, R. Greenstadt, De-anonymizing programmers via code stylometry, in: USENIX Security ’15, USENIX Association, Washington, D.C., 2015, pp. 255–270.
    URL https://www.usenix.org/conference/usenixsecurity15/technical-sessions/presentation/caliskan-islam
  • (70) J. Jang, D. Brumley, S. Venkataraman, Bitshred: feature hashing malware for scalable triage and semantic analysis, in: Computer and communications security, ACM, 2011, pp. 309–320.
  • (71) D. Kirat, L. Nataraj, G. Vigna, B. Manjunath, Sigmal: A static signal processing based malware triage, in: Proceedings of the 29th Annual Computer Security Applications Conference, ACM, 2013, pp. 89–98.
  • (72) S. Srakaew, W. Piyanuntcharatsr, S. Adulkasem, On the comparison of malware detection methods using data mining with two feature sets, Journal of Security and Its Applications 9 (2015) 293–318.
  • (73) M. Asquith, Extremely scalable storage and clustering of malware metadata, Journal of Computer Virology and Hacking Techniques (2015) 1–10.
  • (74) B. G. Ryder, Constructing the call graph of a program, Transactions on Software Engineering SE-5 (3) (1979) 216–226. doi:10.1109/TSE.1979.234183.
  • (75) F. E. Allen, Control flow analysis, in: Proceedings of a Symposium on Compiler Optimization, ACM, New York, NY, USA, 1970, pp. 1–19. doi:10.1145/800028.808479.
    URL http://doi.acm.org/10.1145/800028.808479
  • (76) H. Pomeranz, Detecting malware with memory forensics, http://www.deer-run.com/~hal/Detect_Malware_w_Memory_Forensics.pdf, accessed: 2016-11-28 (2012).
  • (77) I. H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems), Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2005.
  • (78) M. Gnen, E. Alpaydin, Multiple kernel learning algorithms, J. Mach. Learn. Res. 12 (2011) 2211–2268.
    URL http://dl.acm.org/citation.cfm?id=1953048.2021071
  • (79) L. Breiman, Random forests, Mach. Learn. 45 (1) (2001) 5–32. doi:10.1023/A:1010933404324.
    URL http://dx.doi.org/10.1023/A:1010933404324
  • (80) C. D. Manning, P. Raghavan, H. Schütze, Introduction to Information Retrieval, Cambridge University Press, New York, NY, USA, 2008.
  • (81) L. Rokach, O. Maimon, Clustering Methods, Springer US, Boston, MA, 2005, pp. 321–352. doi:10.1007/0-387-25465-X_15.
    URL http://dx.doi.org/10.1007/0-387-25465-X_15
  • (82) D. Arthur, B. Manthey, H. Röglin, k-means has polynomial smoothed complexity, in: FOCS ’09., 2009, pp. 405–414. doi:10.1109/FOCS.2009.14.
  • (83) H. D. Vinod, Integer programming and the theory of grouping, Journal of the American Statistical Association 64 (326) (1969) 506–519.
    URL http://www.jstor.org/stable/2283635
  • (84) H.-S. Park, C.-H. Jun, A simple and fast algorithm for k-medoids clustering, Expert Systems with Applications 36 (2, Part 2) (2009) 3336 – 3341. doi:https://doi.org/10.1016/j.eswa.2008.01.039.
    URL http://www.sciencedirect.com/science/article/pii/S095741740800081X
  • (85) M. Ester, H.-P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, AAAI Press, 1996, pp. 226–231.
  • (86) S. C. Johnson, Hierarchical clustering schemes, Psychometrika 32 (3) (1967) 241–254. doi:10.1007/BF02289588.
    URL http://dx.doi.org/10.1007/BF02289588
  • (87) T. Kohonen, Essentials of the self-organizing map, Neural Netw. 37 (2013) 52–65. doi:10.1016/j.neunet.2012.09.018.
    URL http://dx.doi.org/10.1016/j.neunet.2012.09.018
  • (88) D. Zhou, O. Bousquet, T. N. Lal, J. Weston, B. Schölkopf, Learning with local and global consistency, in: Advances in Neural Information Processing Systems 16, MIT Press, 2004, pp. 321–328.
  • (89) Homeland Security, a roadmap for cybersecurity research.
  • (90) Cabinet Office, The UK Cyber Security Strategy, https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/60961/uk-cyber-security-strategy-final.pdf (2011).
  • (91) L. Bilge, D. Balzarotti, W. Robertson, E. Kirda, C. Kruegel, Disclosure: detecting botnet command and control servers through large-scale netflow analysis, in: ACSAC ’12, ACM, 2012, pp. 129–138.
  • (92) B. Miller, A. Kantchelian, S. Afroz, R. Bachwani, R. Faizullabhoy, L. Huang, V. Shankar, M. Tschantz, T. Wu, G. Yiu, et al., Back to the future: Malware detection with temporally consistent labels, CORR.
  • (93) A. Damodaran, F. Di Troia, C. A. Visaggio, T. H. Austin, M. Stamp, A comparison of static, dynamic, and hybrid analysis for malware detection, Journal of Computer Virology and Hacking Techniques (2015) 1–12.
  • (94) G. Laurenza, D. Ucci, L. Aniello, R. Baldoni, An architecture for semi-automatic collaborative malware analysis for cis, in: Dependable Systems and Networks Workshop, 2016 46th Annual IEEE/IFIP International Conference on, IEEE, 2016, pp. 137–142.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
200015
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description