FeatureAnalytics: An approach to derive relevant attributes for analyzing Android Malware

# FeatureAnalytics: An approach to derive relevant attributes for analyzing Android Malware

[    [    [    [    [    [ \orgdivResearch and Development Centre, \orgnameBharathiar University, \orgaddress\stateCoimbatore, \countryIndia
\orgdivDepartment of IT and Science, \orgnameDr.GRD College of Science, \orgaddress\stateCoimbatore, \countryIndia
\orgnameThapar Institute of Engineering and Technology, \orgaddress\statePunjab, \countryIndia
Deepa K.    Radhamani G.    Vinod P.    M. Shojafar    N. Kumar    M. Conti
17 September 2018XX XX 2018XX XX 2018
17 September 2018XX XX 2018XX XX 2018
17 September 2018XX XX 2018XX XX 2018
###### Abstract

[Summary]Ever increasing number of Android malware, has always been a concern for cybersecurity professionals. Even though plenty of anti-malware solutions exist, a rational and pragmatic approach for the same is rare and has to be inspected further. In this paper, we propose a novel two-set feature selection approach based on Rough Set and Statistical Test named as RSST to extract relevant system calls. To address the problem of higher dimensional attribute set, we derived suboptimal system call space by applying the proposed feature selection method to maximize the separability between malware and benign samples. Comprehensive experiments conducted on a dataset consisting of 3500 samples with 30 RSST derived essential system calls resulted in an accuracy of 99.9%, Area Under Curve (AUC) of 1.0, with 1% False Positive Rate (FPR). However, other feature selectors (Information Gain, CFsSubsetEval, ChiSquare, FreqSel and Symmetric Uncertainty) used in the domain of malware analysis resulted in the accuracy of 95.5% with 8.5% FPR. Besides, empirical analysis of RSST derived system calls outperform other attributes such as permissions, opcodes, API, methods, call graphs, Droidbox attributes and network traces.

\jnlcitation\cname

, , , , , and (\cyear2018), \ctitleFeatureAnalytics: An approach to derive relevant attributes for analyzing Android Malware, \cjournalConcurrency and Computation: Practice and Experience, \cvol2018;00:1–26.

Android malware detection, machine learning, system calls, feature selection, rough set, statistical test.
\articletype

Regular Issue Type

1]Deepa K. 2]Radhamani G. 1]Vinod P. 3]Mohammad Shojafar 4]Neeraj Kumar 2]Mauro Conti

\authormark

Deepa K.et al

## 1 Introduction

In the past few years, Android has been widely adopted as the preferred operating system of smartphones, tablets, and even Internet of Things (IoT) devices. In particular, smartphones being portable with its extensive computing capabilities have gained widespread attention than personal computers. Reports from 1 state that in 2017 the sale of Android-based smartphones have surpassed 1.32 billion. The smartphone industry is steadily increasing and estimated to touch 1.71 billion in 2020 1. Since December 2017 1, over 3.5 million new apps are being uploaded to Google Play Store. Unfortunately, security issues with Android system evolve due to the tremendous growth of third-party app stores, which hosts numerous malware applications. Recently, Sophos lab 2, reports submission of 10 million Android samples by the end of December 2017, of which 77% of applications are identified as malware. Notably, the popularity of Android phones and its ubiquitous nature also attracted the adversary to exfiltrate critical information from compromised devices. Moreover, these malicious apps are used for phone-tapping, steal sensitive information, geographic locations, and send premium rate messages. Considering the above-mentioned circumstances, immediate attention is required for enforcing security of smart devices from malignant applications.

There are broadly two approaches for malware detection: (a) static analysis and (b) dynamic analysis methods. Traditionally, static analysis is used to create malware signatures. The sequence of instructions, strings, bits or hashes may be used to express signature. Even though signature-based techniques can rapidly identify malicious applications, they can be easily evaded by code polymorphism or techniques involving source code transformation. Additionally, signature-based detector proves ineffective for detecting zero-day malware. Dynamic analysis is also known as a behavior-based approach. Here, the antimalware engine evaluates actions of an app to determine if the application demands unauthorized access to the sensitive resources.

Moreover, several reinforcement solutions are proposed to study the malware codes and its behaviors to realize the threats 3. To be precise, machine learning-based approaches (MLA) in malware detection have gained increased acceptance due its increase in detection accuracy 45.

In this context, several questions arise: Is it possible to present an Android malware detection framework that categorizes various applications and distinguishes between the malicious and benign ones? How can we select significant features, either using static or dynamic analysis, to identify the malware apps? How can we construct an optimal feature vector to improve classification? The goal of this paper is to shed light on these issues.

In this paper, we present and investigate malware detection by developing the feature set comprising of system calls. The characteristic of the feature space is its representative power of exhibiting the behavior of an application. To showcase the indented actions of a monitoring application, Android Monkey 6 is employed to supply random inputs (in the form of swipes and clicks, etc.) to the sample. Each event triggers the invocation of a set of the characteristic system calls. The extracted set of attributes might have irrelevant calls that do not help in the process of identifying malicious applications. Hence, a two-step feature selection method is proposed. Initially, an optimal feature vector is derived by applying rough set feature selection approach 7, 8. Further, to boost the performance of classifiers, we further synthesize the previous attributes using statistical test, precisely the Large Population Test 9, to generate the list of prominent malware and benign attributes. These features are subsequently utilized to develop classification model, using algorithms such as AdaBoostM1-J48 10, Random forest 11, and Rotation forest 12. The main contributions of this study are summarized as follows:

1. We propose a two-step feature selection approach inspired by the rough set and statistical test (named as RSST), capable of eliminating irrelevant attributes (i.e., system calls) for improving classifier detection performance.

2. We perform an extensive analysis to investigate the optimal feature vector that depicts enhanced results. This is ascertained by varying the number of system calls in the feature space. The results of LPT-based feature set exhibited an accuracy of 99.8% with 0.001 False Positive Rate (FPR).

3. We perform extensive analysis using different categories of features extracted with static and dynamic analysis. In particular, static features considered are permissions, opcodes, APIs, and, dynamic features include system calls, network trace, system call graphs and attributes extracted from Droidbox 13. Moreover, we demonstrate that the performance of classification model created with system calls are better compared to others.

4. We thoroughly evaluated the performance of our proposed feature selection approach with other attribute selectors traditionally utilized in the domain of malware detection. The experiments show that set of system calls derived by our method can separate malware and legitimate instances with 99.9% accuracy compared to alternate techniques.

The rest of this paper is structured as follows. Section 2 discusses the related work in Android malware detection. Section 3 presents our methodology. In Section 4, we discuss the experiments, while we also present different setting of experiments to obtain relevant calls. Section 5 presents the static and dynamic analysis to validate the efficiency of our proposed system call feature set. In Section 6 detection performance of our proposed two-step feature selection model is compared with other attribute selectors. After that, Section 7 describes some essential understandings and achievements of our solution and related experiments. We holistically compared our method against some state-of-the-art ones in Section 8. We list the limitations of our method in Section 9. Finally, the conclusion is given in Section 10.

## 2 Related work

In the following, we will discuss the solutions adopted for Android malware analysis. Particularly, we will present static (see Section 2.1), dynamic (see Section 2.2) and hybrid solutions (see Section 2.3) categories. In the following, we will briefly discuss each category.

### 2.1 Static Analysis

The paper 14 proposed DREBIN developed with the features extracted from both manifest files and bytecode and embedded all of the attributes into a joint vector space to detect the malicious apps using support vector machines (SVM). DREBIN detects 94% detection rate of the malware samples with an FPR of 1%. However, it is quite ineffective at detecting new types of malware because of the enormous size of the feature set.

Authors in  15 present a systematic characterization of Android malware based on method of installation, activation mechanism and malicious payload.

Another solution 16 combines application permissions, broadcast receivers, the presence of embedded Android applications and native code. They adopt random decision trees that built rules comprising of two or three features respectively.

A probabilistic discriminative model based on regularized logistic regression for detecting malicious apps from decompiled source code was proposed by authors in 17. API calls and permissions were used as features. Further, features selection methods i.e., Information gain and Chi-squared were utilized for determining significant attributes. A comprehensive analysis on datasets collected from different sources was performed. Finally, classifier performance was evaluated using the metric like precision, recall, and Area Under Curve (AUC).

Droid Detective 18 uses a rule-based approach to classify apps as benign and malware. Combination of permissions together with their frequencies is utilized to create a set of rules. Subsequently, relevant rules essentially discriminating malware and benign samples were extensively investigated.

Wei et al. 19 proposed a framework for identifying malapps and trusted applications along with the categorization of benign samples into diverse groups. Feature such as requested/used permissions, filtered intents, code-related information, restricted/suspicious API calls and hardware related attributes was extracted from a large collection of applications. Support Vector Machine (SVM) was used to rank attributes. Ensembles of SVM, CRT, Naive Bayes, Random forest and K-NN was considered to label an app into respective classes. The decision of unseen samples was arrived using majority voting.

### 2.2 Dynamic Analysis

In 20, authors considered vectorized representation of CPU utilization, network traffic, power and memory consumption by apps as features. Information gain feature selection filtered 10 prominent attributes of total 32 features. Malware detection models employing Random forest, Naive Bayes, Logistic Regression and Support Vector Machine (SVM) is created. An F-Measure of 0.993 with 0.998 AUC obtained with Random forest demonstrates the suitability of ensemble classifier for developing malware classification model.

Authors in 21 presented a detection framework using system calls having the possibility to be implemented in the resource-constrained environment. To address this issue they proposed filtering and abstraction process on 200 popular applications. During the filtering phase, irrelevant system calls are eliminated to describe the behaviour of the applications. Later, system calls with identical functions are consolidated. However, we argue that abstraction phase might lose some important information affecting the identification of malicious samples from legitimate instances. Further, the return types and parameters passed as an argument to the calls are distinct, thus mapping of multiple calls with few representative ones is not always feasible.

Authors in 22 present an ML classification system employing 59 Linux based features characterizing memory, CPU, and Network from the Android OS to detect malicious applications. The analysis was carried out by eliminating a set of attributes to estimate the performance of classification model. Finally, 36 out of 59 features learned with SVM resulted in 98.85% accuracy with 0.67% False Positive Rate.

In 23, TCP packets during active communication between the infected system and attacker server were used to build feature set. ClassifierSubsetEval feature selection method implemented in WEKA filtered six out of 11 attributes. The algorithms such as Bayes network, multi-layer perceptron, decision tree(J48), -nearest neighbor and Random forest were considered for developing learning models. The experimental results indicated 99.99% accuracy.

A client-side application capable of executing on the device for detecting deviations of legitimate apps from their malicious counterpart was proposed in 24. The detection system consisted of a machine learning model trained with network trace. The study demonstrated that applications were easily distinguishable by analyzing the traffic patterns. Thus, the behavior of applications could be modelled by analyzing network behavior . Verification of whether the application behavior is what it claimed to be could also be performed.

Andromaly a light-weight host-based framework for anomaly detection on Android smartphones was discussed in 25. Andromaly monitors various system metrics, such as CPU usage, volume of data transferred through network, number of active processes and battery usage. Then, Andromaly receives the feature vectors from main service, analyze them (i.e., using rule-based, knowledge-based classifiers or anomaly detectors) to perform threat assessment with Threat Weighting Unit (TWU), which is eventually used in the detection process.

In 26, sequence of system calls with different depth was used as features. Initially, set of malicious and benign applications were executed in a smartphone, later call logs were processed out side the device. System call name was extracted and used to create malicious and legitimate patterns with respect to experimentally determined threshold value. Experiments conducted on 2000 malicious and bengin applications resulted in an accuracy exceeding 90%.

### 2.3 Hybrid Analysis

In 27, authors present two-side malicious apps defense scanner using ML technique modelled on Random forest classifier. Firstly, the scanner executes the samples in a sandbox environment, and system calls are collected. Secondly, the categorized applications which are labelled as malware or benign are rechecked by monitoring the network activity of each app. Finally, Wrongly labelled files were corrected if any application depicted suspicious network activity.

The detection of smartphone malware using the subset of system calls, the weighted sum of permissions and combination of permissions was addressed in 28.Experimental study reported statistical difference in open, read, recv and write system calls. They also claim that these system calls can be used for appropriately classifying malware and trusted applications. The overall precision of approximately 85% was obtained by estimating values for different evaluation parameters. Besides, in  29, authors provide information flow control along with declassification policies on unannotated programs with support to runtime security labels. Such solution presents hybrid approaches which cover dynamic labels and execution constraints to handle legacy and untrusted and mobile codes.

On the other hand, authors in 30 present Manilyzer exploiting information system that adopts , SVM, and C.45 classification algorithms. Their results confirm that they have 90% accuracy in the classification of an app corpus over 617 obtained applications. However, as static analysis, they fail to generalize the patterns of new malware specimens, where they discuss detection by capturing attributes from network packets.

Authors in 31 present a three-phase detection and classification framework: Permission-Based Detection (PBD), System–Call Based Detector (SBD), and classification of malwares into their respective types. Experiments on the dataset comprising of 933 benign and 265 malware resulted in 97% True Positive Rate (TPR) with 3%v False Positive Rate (FPR)and 98% accuracy.

After reviewing the previously published papers, we conclude that the researchers concentrated on improving the outcome of classification by deriving attributes or using feature selection methods commonly used in machine learning domain. Different from prior work, we focused on developing novel feature selection method for improving results along with investigating robust attributes which can be used for developing effective malware identification models.

## 3 Proposed Methodology

In this section, we present our approach for identifying malicious samples. The framework is shown in Fig. 1. It is designed to contain multiple phases. In the initial stage, we collect malware and benign samples from multiple data stores. Each sample is installed in an emulator, which is subsequently interacted with Android Monkey. We use strace command to capture system call on the Android mobile application. System call logs are processed to filter call names, which are further used as features. In the subsequent phase, the feature set is refined using two-step feature selection approach. Initially, irrelevant calls are eliminated using rough set approach, later large population test is applied to determine discriminant system calls. Classification models are developed with the extracted attribute set, and finally, samples are separated into one of the two classes, i.e., either malware or benign. In the subsequent sections, we detail each phase involved in our proposed approach.

### 3.1 Dataset Description

The dataset consists of 2000 benign applications which are downloaded from Google Play Store 32, Chinese market 33, Koodous 34, and third-party Android markets 3536. Each sample is submitted to VirusTotal 37, an online antimalware service, to confirm applications are indeed legitimate. The malware dataset constitutes 1500 samples. A total of 554 malicious apps are randomly collected from Drebin project, 450 taken from Koodous-a collaborative platform for Android Malware analysis, consisting of analysis tools and combine social interactions. Also, a set of 496 ransomware apps 38 is also used as a part of malware dataset.

### 3.2 Extraction of System calls

System call logs, during the execution of each sample, are considered as the feature for classifying a file as malware or benign. Linux system calls can be categorized based on operating system functionality such as process management, file management, memory management, device management, information management and communication models. System calls act as an interface between user and kernel. All requests performed at the user mode are forwarded through a system call interface before its execution through the hardware. At any occasion, if a user intends to make a phone call using a call application, then the user request is transferred to Telephony Manager Service, to a set of library calls, which in turn results in multiple invocations of a system call. During the execution of a system call, control is transferred from user mode to kernel mode. When the system call is completed, then the control is returned back to the user mode. Thus, the interaction of a program with OS can be precisely exhibited by representing feature set with system calls or sequence of system calls. Moreover, static features (e.g., permissions, metadata, opcodes, API’s, and intents) are susceptible to change due to obfuscation, however, system calls are relatively resilient to obfuscation comparing attributes extracted during static analysis.

Each application is installed in Android emulator using adb install command. Initially, we keep track of the zygote process which starts at init. Whenever a new application is launched, the zygote is forked. Using strace utility, we monitor the zygote and later filter out the process id, i.e., pid of the required application. Using Android Monkey, random events consisting of touches, clicks, and gestures, etc. are supplied to the application. In particular, we subject the application to SMS, phone and direction events one after another, to gather its actual behavior. During each event, system calls are recorded. Employing Android Monkey, 2500 events are subjected to an application, and the call trace is collected. The execution trace consists of system call name, parameters and return values. Using a customized parser, we extract call names which are utilized as features. Following steps are employed to extract system calls.

• The .apk files are installed in emulator using adb shell command:

• Afterwards, the call trace are recorded with strace utility 39. The input to the strace is pid and the logfilename.
strace -p <pid> -o sdcard/logfilename

• aapt command is run to obtain package name for an apk. The following command is used to interact with an application.
adb shell monkey -p <package-name> -v <# events>

• Other fake actions (such as sending SMS, making/receiving calls or setting locations) are performed using some commands shown below:

• Connect to the emulator, using telnet:
telnet localhost 5554

• To make a phone call
gsm call<callerPhoneNumber>

• Send an SMS
sms send <senderePhoneNumber> <textMessage>

• To change geo-locations
geo fix <longtitude value> <latitude value>

• The log file stored in emulator is copied to the device using adb pull.
adb -s emulator-<id> pull sdcard/logfilename destination-path

• The emulator instance is killed and the android device is restored to the previous clean state.

### 3.3 Representation of Feature Vector

The goal of malware classification system is to map a collection of applications (or apks) into a fixed number of predefined categories i.e., malware (M) or benign (B). Hence, this is a supervised learning problem. To this end, the preliminary task is to transform each apk typically into a group of features. Formally speaking, each system call in this case corresponds to a feature. To adapt attributes into a feature vector, representative calls from samples are converted to a specific value. In conventional approach, feature vectors are represented as boolean value (presence/absence of an attribute is expressed as 1/0) or number of times occurs in the samples. Since, the classification is based on the contents (i.e., system calls), an attribute weighting scheme known as Term Frequency-Inverse Document Frequency (TF-IDF) 40 is utilized. Specifically, elements of each vector are the TF-IDF weight of a system call. This representation of feature vector assigns a higher weight to system calls that are typical of a sample, compared to calls that are relatively rare in the whole collection of instances. Thus, a collection of feature vectors are referred to us as Feature Vector Table (FVT), which is a data structure consisting of rows of instances (vectors) and columns of system call (see Fig. 1). As supervised learning is used, each vector is labeled as malware () or benign ().

### 3.4 Feature Selection Approach

The property of an application that is being measured and characterizes it is known as feature (or attribute). One of the dominant problems in machine learning over the past is identifying sub-optimal feature vector, having a strength equivalent to full feature space. Feature selection is a combinatorial optimization problem which aims to minimize redundant or irrelevant attributes. A feature is characterized as redundant if the information conveyed by the feature is more or less identical to the one or more features. On the contrary, a feature is considered as irrelevant, if it does not carry essential information for identifying target classes.

Generally speaking, the objective of feature selection approach is to determine an optimal set from a finite set of a large number of attributes or reduce the number of possible solutions. Thus, in such problems, an exhaustive search is not feasible. Explicitly, in the context of malware detection, set of features (or attributes) are considered relevant if they can potentially identify target classes. Techniques using feature selection attempt to identify a small subset of attributes based on a fixed criterion. Relevance can be computed using specific statistical or information-theoretic approaches. Also, feature selection approach derives the useful attributes without changing its physical meaning.

In this work, we applied forward feature selection method. Feature selection was performed using search approach by applying SFE method. In this regard, the concepts of rough set theory 41 is used. Here, a table is represented as a tuple , where represents the universe of instances set (i.e., apk files) and denote the set of attributes or objects, and is an attribute instance. Let be the subset of features obtained by eliminating sparse features. Attribute set is partitioned into a set of conditional attributes (i.e., or ), and a set of decision attributes (i.e., system calls), respectively.

Indiscernibility relation, , is an equivalence relation defined as in equation (1).

 IND(P)={(x,y)∈U×U,∀a∈P,a(x)=a(y)}, (1)

where is the feature value of object ; Here denotes that and are indiscernible with respect to ; and, represents all equivalence classes in .

is the lower approximation of which represents elements of that are surely in . It denotes by

 L⋆=U{E∈U|IND(L):E⊂X}, (2)

is the upper approximation of which represents the set that are possibly classified as elements in . Equation (3) contains the definition of upper approximation.

 L⋆=U{E∈U|IND(L):E∩X≠ϕ}, (3)

The positive region is denoted as a set of applications of that can be classified with certainty to belong to classes using attribute . In this paper, the significance of feature, i.e., system call, calculated from positive region is used as the criteria for feature selection. In order to construct a feature set, we estimated the reducts of conditional attributes (i.e., set of system calls) with respect to the decision attributes (i.e., the target classes). Johnson’s greedy algorithm 42 can be used to determine reducts. Thus reducts eliminate all superfluous attributes from the feature set. Formally, the reduct of an conditional attribute , w.r.t., decision features is the set of system calls that must be following properties (1) the classification metrics obtained with is similar to , specifically the positive regions for and are identical, therefore, and (2) feature set of is minimal, thus . It is represented using Equation (4).

 POSC(D)=⋃X∈U/IND(D)C––X, (4)

Finally, the significance of decision attributes () on is defined as:

 ΨC(D)=∣POSC(D)∣∣U∣. (5)

where is the cardinality of a set . A system call is irrelevant in system call set , if , otherwise is regarded as relevant feature in with respect to target classes . Therefore, set of attributes in reducts preserves the separation of classes. Subsequently, FVT records the TF-IDF score of system calls. These scores are mapped into four bins, where bin is defined to contain TF-IDF values between 0.0-0.25, contains values between 0.26-0.5, contains values between 0.51-0.75, and contains values between 0.76-1.0. Finally this representation of feature vector table is used to determine relevant attributes in the feature space.

The selection of a suitable/relevant subset of system calls are explained with steps listed in Algorithm 1 and. Specifically, in Algorithm 1, the line numbers 8 to 13 generate a set consisting of apks with identical values of conditional and decision attributes. Similarly, steps 14 to 18 is used to create a set with apks having similar values of the conditional attributes but with dissimilar decision attributes. Steps 19-22 append the elements of sets and to the dictionaries and , respectively. The cardinality of set is subsequently used to determine the significance/dependency value of each system call , illustrated in line number 29. The system call with highest significance/dependency value is determined (steps 30 to 33) and returned to the procedure for generating reducts (i.e., Algorithm 2). In Algorithm 2, the procedure for computing reducts takes three parameters as input: a significant system call , set of conditional attributes , and decision attributes . Algorithm 2 starts with the most significant system call. Subsequently, applies a forward approach to incrementally add attributes to reduct that has the highest significance value, refer line numbers 8-10.

Complexity Analysis: The time spent for computing reducts are related to the amount of comparison of feature vector, and all possible values of the vectors. In our case, the continuous values of elements are mapped to one of the possible bins (). In our case, we have two classes and four bins, hence, we have 8 possible cases. Thus, maximum values of feature vector cannot be more than , which itself is a huge number. Moreover, an in-accurate design of algorithm would require as the worst case time complexity. However, in our implementation, feature vectors are represented as a binary tree. In a nutshell, in our solution, the overall worst case time complexity for comparing feature vectors is estimated as .

### 3.5 An Illustrative Example

Let us consider an example to illustrate the selection of relevant system calls using the proposed feature selection approach. To demonstrate the procedure, we make use of an example feature vector table as shown in Table 2. There are three system calls (i.e., , , ), seven applications (i.e., -) and a decision attribute . The decision attribute consists of two values or . In this example, each application is represented as a vector, the elements of a vector is mapped to bins, i.e., for a system call  (see Table 2).

The procedure begins with an empty set of reduct . Each system call is selected and its significance is computed, using Equation 5. The call with highest significance value is selected and assigned to the reduct set  (refer Algorithms 1 and 2). In this example, system call is added to .

 U/As1={{x1,x6B1},{x3,x4}B2,{x5,x7}B3},
 Ψs1=∣{x5,x7}∣∣{x1,x2,x3,x4,x5,x6,x7}∣=27=0.28,
 U/As2={{x2,x3B1},{x4,x5,x6,x7}B2,{x1}B4},
 Ψs2=∣{x1,x2,x3}∣∣{x1,x2,x3,x4,x5,x6,x7}∣=37=0.42,
 U/As3={{x1,x4B1},{x2,x3}B2,{x6,x7}B3,{x5}B4},
 Ψs3=∣{x1,x2,x3,x4,x5}∣∣{x1,x2,x3,x4,x5,x6,x7}∣=57=0.71,
 ∴R←{s3},

Subsequently other attributes ( or ) are added to by evaluating the significance of a system call with each attributes in  (i.e, ). Hence, a forward feature selection strategy is employed.

 U/A{s1,s3}={{x1}{x2,x3},{x5},{x6},{x7}},
 Ψ{s1,s3}=∣{x1,x2,x3,x4,x5,x6,x7}∣∣{x1,x2,x3,x4,x5,x6,x7}∣=77=1.0,
 U/A{s2,s3}={{x2,x3}{x1},{x4},{x5},{x6,x7}},
 Ψ{s2,s3}=∣{x1,x2,x3,x4,x5}∣∣{x1,x2,x3,x4,x5,x6,x7}∣=57=0.714,
 ∴R←{s1,s3}.

Finally, the feature set is considered as feature set and eventually utilized to construct classification model.

### 3.6 Classification Phase

After the construction of feature set, our system creates classification models using three algorithms. In contrast to conventional signature-based scanners, the machine learning-based models require fewer updates, due to the fact that less number of malware is reported to form new families 43. We employ commonly used classification algorithms reported in malware detection process. Classification algorithms such as Random forest 11, Rotation forest 12, AdaBoost (with J48 as base classifier) implemented in WEKA 44 are considered.

### 3.7 Evaluation Parameters

To evaluate the effectiveness of our proposed method, we used classical evaluation metrics applied in machine learning;

• True Positive (TP): it indicates number of malicious applications that are appropriately identified.

• True Negative (TN): it denotes the number of accurately classified benign instances.

• False Positive (FP): it signifies the number of wrongly classified benign instances as malware applications.

• False Negative (FN): it indicates malware instances wrongly classified as legitimate application.

Using above mentioned criteria, following metrics are used to measure the effectiveness of our proposed system:

• Accuracy (Acc): Acc is the number of applications that the classifier correctly detects, divided by the number of malicious and legitimate applications.

 Acc=TP+TNTP+FN+TN+FP. (6)
• False Positive Ratio (FPR): FPR is the number of misclassified legitimate applications, divided by the number of benign applications.

 FPR=FPFP+TN. (7)
• Area Under Curve (AUC): AUC is used to combine FPR and TPR together45. In particular, AUC measures the tradeoff between TPR and FPR. Intrinsic goal of AUC is to solve situation where data set consists of imbalanced samples (or skewed sample distribution), and it is required that the model is not over-fitted to class consisting of higher number of instances. The value of AUC is between 0 and 1, AUC value 1 means the prediction is appropriate, it is reasonable if the value is greater than 0.5, however, if the value is less than 0.5, then we must reverse the decision of classification model.

 AUC=12(TPTP+FP+TNTN+FP). (8)

## 4 Evaluation of Results

The experiments are conducted on system with Intel core i7, 2 GHz quad-core processor and 8GB internal memory. We evaluate each classification model by a 10-fold cross-validation [40],[20] procedure to develop optimum model having improved generalization capability. Dataset is divided into ten equal subsets with 90% of the set used for developing training model and remaining 10% of instances used as test set. Extensive analysis are performed on extracted features using static and dynamic mechanisms. Following sections present the experiments and analysis of the work.

### 4.1 Performance obtained with System call attributes

The effectiveness of classification system is evaluated under following settings:

1. Outcome of classification obtained with prominent benign system calls

2. Performance of model developed using set of malicious system calls

3. Classifier results ascertained with subset of feature space derived using statistical test. Furthermore, the analysis is conducted using significant benign and malicious system call set.

#### 4.1.1 Evaluation on significant benign attributes

The trusted application is executed in the emulator and applying the procedure discussed in Section 3; we extract system call names. The collection of call names are considered as attributes. Later, irrelevant attributes are removed using feature selection approach discussed in Section 3.4. Using filtered system calls, classification models are generated, and the performance is evaluated with 10-fold cross-validation. We observe that Rotation forest and Random forest relatively yield similar performance. Table 3 shows the weighted average of different metrics (refer to the last row), both Rotation forest and Random forest resulted in AUC value 1.0, with an accuracy in the range of 99.42-99.54%, and FPR of 0.005 and 0.01, respectively.

In Table 3, we see that Random forest 11 with 80 system calls result in an AUC value of 1.0 with an FPR of 0.003. Model created with Rotation forest 12 provides an AUC value of 1.0 using 50 system calls. However, the best outcome (i.e., FPR of 0.009 and AUC of 0.972) with AdaBoost is obtained with 80 attributes.

#### 4.1.2 Performance on malware attributes

We observe from Table 4 that Random forest provided an AUC in range of 0.99-1.0 with FPR between 0.005-0.013 and accuracy in range of 99.344-99.787%. The weighted average of evaluation metrics obtained for Random forest is better compared to Rotation forest and AdaBoostM1. Also, highest accuracy of 99.782% with 10 system calls are obtained with Random forest, proving its efficacy for constructing malware detection model.

### 4.2 Evaluation of a set system calls employing large population test

As smartphones have limited computing resources, hence lightweight machine learning model is required to be installed on such devices. Keeping this in mind, we resort to applying two-step feature selection approach. Initially, system call set is synthesized by implementing Rough set-based feature selection. Subsequently, the attributes as mentioned earlier are further pruned by using statistical test, in particular, large population test, hence the method named as RSST. Two sample large population test are used to estimate if the population means differs 9. Specifically, we apply a statistical test to determine set of system calls having increased divergence across the target classes. To carry this task, we consider same attribute set supplied to feature selection approach discussed in Section 3.4. Specifically, the system call set used to build previous malicious model is further filtered using large population test. A similar approach is again repeated for feature set extracted from benign applications. The result of statistical method prunes around 50% features. The significance is determined using a two-tailed test. Thus, the null hypothesis (H0) and the alternative hypothesis (H1) are defined as below:

• Null Hypothesis (H0): The mean of system calls in malware and benign applications are the same.

• Alternate Hypothesis (H1): The mean of system calls in malware and benign set has significant difference.

The difference in mean of system call is computed for both malware and benign set. The evidence of test is computed at the significance level using equation (9).

 z=¯¯¯¯¯XiM−¯¯¯¯¯XiB√σiM|M|+σiB|B|, (9)

where and , denote means of system call in the classes (malware/benign). Likewise, and are the standard derivations of system call . The null hypothesis for a two-tailed test is rejected, if and only if, and , indicating a significant difference in the mean of the system call in target classes. On examining the outcome of the result, 50% of system calls having a small difference in means are excluded. In other words, the attribute space is constructed with the system calls that qualify the statistical test. Therefore, two feature list, one consisting of calls predominantly found in malware samples, and another set of dominant calls in the legitimate instances. Identical to previous experiments, evaluation metrics are measured with the variable amount of features, considered in increments of 10 system calls at a time. Overall 92 attributes were observed to satisfy .

#### 4.2.1 RSST-based Benign System calls

We observe from Table 5 that Random forest provided an AUC value of 1.0 with a false positive rate of 0.001 at a feature length of 30. However, Rotation forest results in AUC value 1.0 with 30 system calls. AdaBoost again illustrates an FPR of 0.009 and AUC of 0.969 with 38 features (found to comply statistical test). Average evaluation metrics show similar trends in the results for Random forest and Rotation forest. An important aspect to be noticed is the improvement in the performance of AdaBoost compared to previous experiments, as discussed in sections 4.1.1 and 4.1.2, respectively.

#### 4.2.2 RSST-based Malware System calls

In Table 6, we observe that average evaluation metrics obtained with Random Forest and Rotation forest are approximately similar to the previous experiments. Random forest results in an AUC value of 1.0 with the false positive rate of 0.001 at a feature length of 30. While, Rotation forest gives an AUC value 1.0 with 20 system calls. AdaBoost demonstrates an average evaluation metrics compared with the results produced in sections 4.1.1 and 4.1.2, interestingly with better a FPR.

Figure 2 shows the z-score value of prominent system calls participating in the system call space. These calls satisfy alternative hypothesis depicting substantial variance amongst feature vectors in target classes.

## 5 Comparative Analysis with Static and Dynamic Features

In order to validate the efficiency of system call feature set for identifying malicious instances, we further conducted series of experiments using diverse attributes/variables extracted using static and dynamic analysis. The framework of our experiment is shown in Fig. 3. Here, we also estimate the result using F-measure along with the metrics used in all previous experiments. F-measure or F-score can be interpreted as weighted average of precision and recall, where low false positive rate indicates precision and low false negative rate relates to recall. F-measure reaches its best value at 1 and worst score at 0. The F-measure is harmonic mean of precision and recall. In most of the classification problem we have trade-off between precision and recall. If one of the parameters amongst precision and recall is favoured, the harmonic mean quickly decreases. However, F-measure is greatest when both precision and recall are equal.

Feature set using static analysis are formed by reversing AndroidManifest.xml and collection of smali files. In particular, attributes such as permissions, hardware components, app components, opcodes, and methods are considered. Additionally, features are collected by executing apks. Specifically, we derived attributes from .pcap files (i.e., network-based features), system call graphs and information filtered using Droidbox. The following section introduces aforementioned variables and the performance achieved by developing classification models incorporating them.

### 5.1 Evaluation on Static Features

Static features are extracted from Android Manifest.xml and smali code of each applications. As discussed in previous experiments, each apk is transformed to a vector representation, which are used to create feature occurrence matrix. Later, prominent attributes are derived by applying feature selection approach as discussed earlier. The classification models are evaluated and the obtained results are shown in Table 7.

Tabulated results demonstrate better performance for permissions comparing to the other statistical attributes. An application is not installed until a user accept all requested permissions. Developers may sometime declare permissions, which are not originally needed by an apk. Specifically, such applications are over-privileged and expose devices to threat. The top 5 permissions demanded by malicious applications are INTERNET, READ_PHONE_STATE, WRITE_EXTERNAL_STORAGE, READ_SMS and WRITE_SMS.

Machine learning system based on permissions can be defeated by having applications initially request fewer permissions during installation time. In particular, an adversary may create malicious applications to have an uniform statistical distribution of permissions as in benign dataset. Later, application(s) during execution may demand additional permissions. Under this scenario, the developed models will yield higher misclassification rate. Studies in 46 report ex-filtration of sensitive data from the devices with the apps demanding zero permissions during installation. While authors in 47 illustrate zero permission app could be used to infer user’s location, traveled routes using accelerometer, magnetometer and gyroscope.

External storage like SD Card contains sensitive data such as pictures, videos, configuration files, and backup documents, etc. Generally, applications have read-only access to the SD Card, allowing the attacker to fetch list of installed files. Alternatively, an adversary can query /data/system/packages.list to find list of installed applications, and subsequently determines exploits to compromise smartphones. Additionally, basic device information such as kernel version, device ID, and custom ROM can be distilled having access to /proc/version file.

Table 7 shows the performance obtained by considering permissions. Extracted permissions are represented as binary vectors. The presence of a permission is denoted by 1 and absence by 0. With 100 significant permissions, an F1-measure of 0.952 with FPR of 0.074 is obtained. Permissions are ineffective for identifying malicious samples. In the outlook of end-user, list of permissions are generally viewed as a license agreement. It does not relate to the context of risk and neither indicates how much hazardous is the installed application. Besides, some permissions are frequently used by many applications thus the users do not care about them.

Furthermore, analysis is performed by extracting instructions from each smali code. Generally, an instruction is composed of mnemonics and list of operands. In order to create feature set we considered mnemonics neglecting the operands. Prominent opcodes are selected with feature selection, two models are constructed: (a) one with relevant benign opcodes, and (b) another using malware opcodes. F1-measure of 0.98 was obtained with 400 malware opcodes (see Table 7). We debate that opcodes/ngrams of opcodes cannot be effective in detecting unknown malware samples, as they can be easily obfuscated. Specifically, trivial obfuscation methods such as renaming of class/method/identifier can thwart detection. Consequently, scanners based on statistical signatures will imprecisely identify new samples.

As an extension to the experiments, we extract API from malware and trusted applications. An F1-measure of 0.908 is obtained considering API’s (see Table 7). We thus argue that API’s are weak attributes, as performance is inferior compared to permissions and opcodes. Present day malware employs reflection, so the applications refer malicious codes/libraries during execution. Malicious intentions are invisible during static analysis, as the set of API’s in both malware and benign set appear identical. Since the distribution of attributes across feature vectors are likely to be uniform in the target classes, the classifier assigns incorrect labels to the apks.

### 5.2 Evaluation on Dynamic Features

To evaluate our scheme, we conducted experiments by executing applications. Further feature set are created using network trace, DroidBox information and system call graphs. The performance obtained with these features are compared with the set of system calls derived on applying the two-step feature selection approach (i.e., feature set created by applying rough set followed by large population test.). To practically show the efficiency of our scheme, the attributes and classification results are discussed in the following subsections.

#### 5.2.1 DroidBox attributes

DroidBox consists of two modules, one is the Host, and another is the Target. The Target inherits functionalities of TaintDroid, a dynamic taint analysis tool. It is launched from an emulator which monitors data at a low level. The Host part is a collection of Python scripts. The Host links emulator and receives information from the Target about the application being monitored. Finally, the outcome of the analysis is displayed in graphical or textual format. Few important information/sections retrieved from DroidBox are listed below:

1. accessedFiles-a list of files accessed by the application.

2. cryptousage-operations associated with cryptAPIAndroid.

3. dataleaks-gives user’s data leak information.

4. fdaccess-application performs read/write operations on files.

5. opennet/closenet-open or close a socket.

6. recvnet/sendnet-receive or transmit via network.

7. sendsms/phonecall-send sms or call specific number.

The DroidBox tracing file is a record of actions in JSON format. Largely all sections in the JSON file have the following format.

ΨΨ"Section name":{
ΨΨ"Time of operation"{
ΨΨ"Parameter (e.g., for accessedFiles,
ΨΨpath of the accessed file)
ΨΨ}
ΨΨ}
ΨΨ


Consider an example shown below, the application being monitored repeatedly accessed same files, i.e., abc.png and imagesÃ±12345Ã±54321-example.jpg, may indicate suspicious activity.

[formatcom=\sffamily]
ΨΨaccessedFiles: {
ΨΨimages1234554321-example.jpg,
ΨΨimages1234554321-example.jpg,
ΨΨ&lt;&gt;
ΨΨ}
ΨΨ


Using the section and its associated fields/parameters, we represent each app vector in the form of integers (i.e., occurrence of operation along with its parameters). Table 8 depicts the performance obtained with DroidBox.

#### 5.2.2 Network trace

Network traffic are extracted using tcpdump 48 after installing application in emulator. Like the earlier experimental setting used for system call analysis, Android Monkey is used to interact with the application. The network traffic is recorded until fixed random event existed. Finally, the output of network trace is recorded in a .pcap file. The basic structure of tcpdump output is shown below:
{timestamp} {network protocol} {source ip}.{source port} {dest ip}. {dest port}

Subsequently, we extract six features and classification model is built. An AUC of 0.994 with an FPR value of 0.033 (refer Table 8) is obtained using six features listed below:

1. Raw traffic size (RS): is the total packet size estimated at the time frame of analysis.

2. Number of packets(PN): count of packets in a pcap file.

3. Average length of packet(AL): average length of packets for each pcap file.

4. Outbound packets in a file(ON): is number of packets transmitted from a source IP to the others.

5. Inbound packets in a file(IN): number of packets received by a source IP.

6. Total outbound and inbound packets(OIN): sum total of inbound and out-band packets.

#### 5.2.3 System call graph

In order to ascertain relationships among the logged system calls, a directed graph is constructed, where is the set of vertices and is the set of edges. In particular, the graph is represented in the form of the adjacency matrix. For each pair of extracted system calls, edge between the vertices are created. The weight of edge is incremented for each appearance of system call pairs. Subsequently, we compute in-degree, out-degree, standard deviations of in-degree/out-degree, which is used as features for building the classification model. Table 8 exhibits the classifier outcome with graph based features. We noticed that detection performance of graph-based features are better than attributes obtained from network trace and DroidBox. This indicates that the interactions of an application with the operating system (using system calls) definitely appears to be strong candidate for developing malware detection system.

## 6 Performance with Conventional Feature Selection Approach

In this section, we briefly introduce well-known feature selection algorithms used in the domain of malware analysis. These algorithms select a subset of system calls which typically output enhanced classification rate. In particular, feature selection algorithms: Information Gain(IG), Chi-Square (CHI), Correlation-based Filter (CFS) and Wrapper Subset Evaluator (WSE) are utilized to select relevant attributes before developing classification model. We have selected open source implementation of presented algorithms included in WEKA. Eventually, the detection performance of our proposed two-step feature selection model is compared with the models developed from aforementioned feature selection algorithms.

Information Gain (IG) is based on the concept of information theory. In this method, the algorithm calculates the amount of information carried by as system call . involves computing entropy of a class and subtracting the conditional entropy of after observing the class, i.e., . Hence, for a classification system, is expressed using equations (10)-(12).

 IG(s)=H(C)−H(s|C), (10)
 H(s)=−∑s∈Ap(s)log2(p(s)), (11)
 H(s|C)=−∑{M,B}∈Cp(C)∑s∈Ap(C|s)log2(p(C|s)), (12)

Finally, of all system calls are arranged in descending order, and the top system calls are used for modeling. Chi-Square(CHI) feature selection is used to test the independence of two events. In our case, the two events are occurrence of a system call and presence of the class. Precisely, we want to evaluate whether the occurrence of a system call and class are independent. Our aim is to determine set of calls such that its presence and class are highly dependent. Importance of a system call is calculated using equation (13).

where is the total number of apks (), and are the number of malware and benign applications containing the system call , and are number of malware and benign instances without . Like , system calls are sorted based on value, finally we select the top ranked system call for developing classification model. Symmetric Uncertainty (SU) neutralizes the bias induced by IG towards higher values and normalizes it in range of [0,1]. Symmetric Uncertainty is the measure of information contained in variables and put together over the information independently contained in and . The value 1 indicates that knowledge of one variable can determine another attribute. Additionally, it denotes two variables are highly correlated. On the other hand value 0 signifies independence of variables. Symmetric Uncertainty is defined as:

 SU(A,C)=2.IG(A|C)H(A)+H(C). (14)

System calls having higher correspondence with the class are used for developing classification models. CfsSubsetEval is a correlation-based feature selection approach. The algorithm selects predominant attributes/system calls based on two aspects (i) correlation of an attribute and class must be high. It assures the relevance of system call and a class(), and (ii) the set of system calls obtained from the previous step must not have high correlation amongst each other (higher correlation means larger redundancy). In other words, features/calls are effective if its correlation with class is large, and all its redundant groups are discarded. Wrapper Subset Evaluator (WSE) looks for attributes along-with the given classifier. Hence, in the process of finding subset of calls, certain search mechanisms are used. Hence, we employed two well known search approach i.e., Genetic search (GS) and Breadth First Search (BFS) respectively.

After applying the aforementioned feature selection methods, list of relevant system calls are obtained. These set of system calls constitute our feature set. The performance of classification is determined by varying the length of features. We can see clearly that high accuracy and AUC is obtained with our proposed feature selector on comparing IG, CHI, SU, CFS, and WSE (both BFS and GS search techniques). Figures 3(a)-3(c) show the achieved outcomes. In particular, accuracies obtained with conventional approaches are between 89-92%, AUC is in range of 0.96-0.97 and FPR lies between 0.08-0.11 respectively. Since the implementations of algorithms: CFS and WSE in WEKA returns a single subset of system calls, we estimated the performance of the models on these subsets. CFS reported an accuracy of 90.28%, with FPR of 0.0971 and AUC of 0.966 at 14 attributes. WSE (BFS search) resulted in 79.16% accuracy with 0.241 FPR and 0.832 AUC at with two significant calls. Additionally, using WSE (GS search) we obtained 84.4% accuracy with an FPR of 0.157 and AUC value of 0.918.

We note that, the performance of classification models created from conventional feature selection approaches are not better. Hence, performance assessment of combined system call set is undertaken. The combined feature set include collection of attributes from individual classification models which is supposed to produce improved results. In particular set of 16, 17, 30, 50 and 50 significant calls filtered with WFS (GS), CFS, IG, CHI, and SU are grouped together. Finally, we obtain 54 unique calls by combining the previous outputs. Specifically, the aforementioned attribute set is created to retain system calls with reasonable predictive capabilities from several models in order to achieve increased detection. Eventually, the best outcomes of this experiment are 91.5% accuracy, with 0.085 FRP and 0.972 as AUC. We thus conclude that combined feature space does not improve detection as opposed to selectors independently considered. To validate this, the feature set is inspected and we noticed it to be augmented with irrelevant calls. Hence a comprehensive approach for feature fusion 49 is required to be investigated, which is not within the scope of the present study, and will be considered in the future experiments.

## 7 Discussion

In this section, we discuss important conclusions drawn based on the investigations conducted through multiple experiments. Particularly, the inferences are summarized on the following interpretations:

1. The performance of detection reported with the feature set comprising of system calls, call graph attributes and network traces are competitively the best compared to the static features. An app may utilize reflection and native code 50, 51 to make its real program logic undetectable by static analysis. One of the important conclusions drawn from extensive experiments is that behaviour of malicious and benign samples are appropriately represented with attributes derived from the dynamic analysis. Static analysis is not resilient to typical obfuscation techniques 52. Statistics from 53 report interesting facts like 43% of Google Play Store apps are obfuscated, 73% of third-party markets and 63.5% of malicious apps use identifier renaming to obfuscate applications. Moreover, the same study demonstrated that malware authors employ string encryption to hide true intentions of the malicious code, which is rarely observed in legitimate applications. Besides, source code can be conveniently altered using ProGaurd 54, an obfuscation tool. Additionally, solution based on static control flow analysis can be defeated by adopting DashO 55 a Java and Android obfuscator. Thus, the machine learning based solution depending on static features is supposed to give higher misclassification rate. Studies in 53 state that trusted applications are equally obfuscated as with malicious counterparts. This is performed in order to optimize the bytecode and protect benign apks against code reversing attacks. The aforesaid, obfuscation techniques do not affect the performance of machine learning approaches utilizing system calls and features derived from system call graphs.

2. It is evident from Fig. 5 that the average values of evaluation metrics obtained with proposed feature selection method using statistical test following rough set are better compared with commonly employed approaches. The proposed attribute selector initially determines a subset of system calls with high-class dependency/significance. Subsequently, the large population test filters irrelevant system calls to obtain a subset of system calls having a significant mean difference across the classes to generate a reduced system call set.

3. Exhaustive experiments performed by us demonstrate that the feature set comprising of few system calls lack its representativeness about the target classes. In other words, fewer attributes are incapable of exhibiting separation (or numerical variance) between feature vectors. Thus, the detection rate is also less. This conclusion can be clearly drawn from Fig. 4 especially until 30 system calls. Consequently, the accuracy and AUC are small with large FPR. Furthermore, the addition of calls in the attribute set increases the difference in feature vectors. Hence, the prediction capability of classifier also improves, this is visualized for the feature set comprising of 30-50 system calls. However, augmenting the feature set with additional calls result in the classifier to learn attributes that add noise to the feature set. Hence, the performance remains steady and gradually tend to decrease. This trend can be perceived for feature-length beyond 50 attributes. The decrease in the performance is primarily due to the increased variability across the samples resulting in the wrong prediction.

4. It has been observed that like any security systems/solutions, machine learning based malware detectors can be bypassed by adversarial examples. In general, the adversarial samples are crafted by adding small perturbation so that these examples are misclassified by learned models. In particular, the adversarial samples are crafted to match the distribution of attributes in the malicious and legitimate set. Normally the classification models developed with permissions and APIs are learned with applications represented in binary values, with the presence of feature indicated by one and absence is shown with zeros. The perturbations are induced by changing values of some permissions/APIs with zero elements to ones. These extraneous attributes are included to evade anti-malware systems. Specially, the permission-based detectors can be easily bypassed by augmenting AndroidManifest.XML files with extra permissions.

Machine learning models created with system calls are difficult to be bypassed with the adversarial examples. To do this, the malware developers must learn the statistical difference of each system call from a large collection of surrogate data set. Later, the source code of each app must be modified to include certain functionality resulting in the invocation of calls that match the distribution of calls in the benign set. However, it is difficult to be achieved, as it would require malicious apps consume more execution time. Besides, such samples can be identified without substantial effort employing trivial heuristics such as (a) battery used; (b) percentage of CPU utilized; (c) amount of free memory available; and, (d) calls to extraneous system calls (e.g., get current date/time, list of files/threads/processes, etc.).

5. We estimated the time required in extracting static features from AndroidManifest.xml and smali code. The total time spent to extract manifest features (permissions, services, activities, hardware etc.) is observed as 1073 seconds. Hence, per sample time on an average is estimated to be 0.3256 seconds. Also, time invested to gather API and opcodes is determined as 3229 seconds. Thus on an average, time spent for a single instance is 0.922 seconds. Since dynamic analysis involves execution of an application in an emulated environment, the average time required to extract system calls is estimated as approximately 3 minutes. Time invested to construct classification model and later generating results using cross-validation is between 0.425-0.577 seconds for all feature selectors considered in this paper. The minimum time is obtained for the model developed with CFsSubsetEval, and maximum time of 0.577 seconds is invested for models created using RSST, learned with malware system calls.

6. Looking at the performance of classification algorithms, in context of identifying samples and running time, we conclude, the effectiveness of both Random forest and Rotation in detecting malware samples with the high classification accuracy, i.e., 99.9%. However, on average, experiments with AdaBoost resulted in the best accuracy of 90.19% which is far less than other two algorithms. Essentially Random forest and Rotation forest assigns the class label of a sample using majority voting. On examining the evaluation metric, we conclude that a fair selection amongst Random and Rotation forest cannot be made. However, we preferred Random forest over Rotation forest, as the time required for building classification model with Random forest was observed to be less compared to Rotation forest, refer to Figure 6. Furthermore, the running times for Rotation forest increases with the size of attributes set, similar trends were also observed by author in 56.

## 8 Comparison with Prior Works

In this section, we compare the results of our proposed work with earlier studies. Authors in 57 proposed an Android malware detector known as M0Droid consisting of two modules client agent and server analyzer. The agent executes in the background and submits the application to the server, which executes an application in the emulator. During execution, each apk is subjected to random events generated by Android Monkey, consequently, system calls are recorded. Later, a signature for each application is created, which is represented as a sequence consisting of a pair of system call identifier and occurrence. Detection of unknown instances is undertaken by comparing the Spearman correlation coefficient with known signatures in the repository. The experimental study demonstrated 60.16% detection rate with 39.43% FPR on 200 applications.

Xi Xiao 58 et al. considered 196 system calls. To achieve improved classification accuracy, back propagation neural network on Markov chains from system call sequence were considered. Each system call was treated as a state thus, corresponding to 196 states. Android Monkey was used to generate 1000 pseudo-random events and later, strace was used to record system call events. The analysis was performed to determine optimal network structure (i.e., depth, hidden–layer and learning parameter), deployed for identifying unseen malware instances. Thus, assessment of three and four layers neural network was carried away. Three hidden layer with 37 nodes resulted in highest of 0.980 at a of 0.974 and of 0.0134. However, with four-layer network better performance was achieved when the first hidden layer was between 450–650. The highest of 0.982 at of 0.977 and a value of 0.013 was reported. Moreover, better classification outcome was ascertained with 0.7 as learning rate. The methodology suffers from certain limitations (a) on varying kernel version, the number of system call entries increases. Thus, classification model with fewer discriminative calls must be determined, otherwise, noisy attributes might result in over parameterization leading to poor generalization and (b) more the number of system calls, training will become excessively time consuming.

In 59 system call sequences were considered, and insignificant calls were eliminated by determining relative class difference for a system call. Execution trace of an application was represented using boolean values. The study also considered estimation of mutual information of a system calls with respect to class labels to ascertain calls representative of a target class. Subsequently, a feature to feature correlation was also estimated which was reported to deliver poor results. A malware detection accuracy of 97% was obtained with this approach. The principal limitation we observed with feature selection approach implemented in paper is the absence of association of class weight with the occurrence of system call.

In 60, authors proposed Feature Extraction and Selection Tool (FEST) for malware detection. The feature extraction module was designed to extract permissions and APIs. Subsequently, prominent attributes were picked using the proposed feature selection algorithm, FrequenSel. The study reported an accuracy of 98%, with 2% false positive rate. We implemented a similar version of feature selection algorithm (i.e., FrequenSel) on the set of system calls extracted from our dataset, and obtained an accuracy of 90.43%, with 8.9% FPR. Also, with 25 and 6 significant system calls independently extracted from benign and malware dataset, 89.916%, and 89.635% accuracy is obtained, with 9% and 8.9% FPR.

## 9 Limitations and Future Direction

Our approach carries the general weaknesses of supervised learning model. The proposed method utilizes training data to build the model, under the assumption that the testing data, which the model will be applied to, is drawn from the same population as the training data. This assumption is not true in reality, since the malicious application may evolve. Hence, the model needs to be updated with new training data including new benign and and malicious applications. The extension of our work will consider the system call sequences and call graph to construct semantic features. These attributes will be used to classify malware instances to different families. To this end, we may consider creating an iterative classification system. In this scheme, we plan to develop multi-tier classification approach. Instances will be initially classified using a set of classifiers and feature selection methods. Misclassified samples from previous layers will be further supplied in the subsequent layers employing diverse classifier and feature selectors. The features from each layer will be subsequently combined to generate final attribute set which will be used for modeling and prediction. We would like to extend the aforementioned approach employing deep learning methods.

Finally, we intend to extend our study by evaluating the approach on multiple dataset. Besides, we plan to investigate the robustness of classification system by incorporating attribute fusion approach by coupling call sequences with other features derived from source code of applications.

## 10 Conclusions

In this paper, we propose a two-step feature selection approach utilizing predictive capabilities of rough set and a statistical test for determining relevant system calls. We observed that with 30 significant system calls we could separate malware and goodware (i.e., non-malicious) with 99.9% accuracy, AUC of 1.0, and 1% FPR. Comprehensive analysis of the proposed feature selector with conventional feature selection methods such as Information Gain, Symmetric Uncertainty, ChiSquare, and CFsSubsetEval has been performed to test the performance. The results demonstrate that the proposed feature selection algorithm outperformed the traditional techniques. Exhaustive experiments with static attributes derived from manifest files, smali code, and features obtained using dynamic analysis including network traces, call graph attributes and droidbox information, suggest that feature set comprising of system calls exhibited better performance.

## Acknowledgments

This work is also partially supported by the grant n. 2017-166478 (3696) from Cisco University Research Program Fund and Silicon Valley Community Foundation, and by the grant "Scalable IoT Management and Key security aspects in 5G systems" from Intel. Moreover, the work is supported by the project “Adaptive Failure and QoS-aware Controller over Cloud Data Center to Preserve Robustness and Integrity of the Incoming Traffic” funded by the University of Padua, Italy.

## References

• 1 Smart phone sale http://www.statista.com/topics/840/smartphones/[Date last accessed May 2018]; .
• 2 sophos Website http://www.sophos.com/en-us/security-news-trends/whitepapers.aspx[Date last accessed May 2018]; .
• 3 Faruki Parvez, Bharmal Ammar, Laxmi Vijay, et al. Android security: a survey of issues, malware penetration, and defenses. IEEE communications surveys & tutorials. 2015;17(2):998–1022.
• 4 Amos Brandon, Turner Hamilton, White Jules. Applying machine learning classifiers to dynamic android malware detection at scale. In: :1666–1671IEEE; 2013.
• 5 Gardiner Joseph, Nagaraja Shishir. On the Security of Machine Learning in Malware C&C Detection: A Survey. ACM Comput. Surv.. 2016;49(3):59:1–59:39.
• 6 Android Monkey http://developer.android.com/studio/test/monkey.html[Date last accessed May 2018]; .
• 7 Zhang Mi, Yao JT. A rough sets based approach to feature selection. In: :434–439IEEE; 2004.
• 8 Świniarski Roman W. Rough sets methods in feature reduction and classification. International Journal of Applied Mathematics and Computer Science. 2001;11:565–582.
• 9 Massey Adam, Miller Steven J. Tests of hypotheses using statistics. Mathematics Department, Brown University, Providence, RI. 2006;2912.
• 10 Freund Yoav, Schapire Robert, Abe Naoki. A short introduction to boosting. Journal-Japanese Society For Artificial Intelligence. 1999;14(771-780):1612.
• 11 Breiman Leo. Random forests. Machine learning. 2001;45(1):5–32.
• 12 Kuncheva Ludmila I, Rodríguez Juan J. An experimental study on rotation forest ensembles. In: :459–468Springer; 2007.
• 13 Droidbox http://github.com/pjlantz/droidbox[Date last accessed May 2018]; .
• 14 Arp Daniel, Spreitzenbarth Michael, Hubner Malte, Gascon Hugo, Rieck Konrad, Siemens CERT. DREBIN: Effective and Explainable Detection of Android Malware in Your Pocket.. In: :23–26; 2014.
• 15 Zhou Yajin, Jiang Xuxian. Dissecting android malware: Characterization and evolution. In: :95–109IEEE; 2012.
• 16 Glodek William, Harang Richard. Rapid permissions-based detection and analysis of mobile malware using random decision forests. In: :980–985IEEE; 2013.
• 17 Cen Lei, Gates Christoher S, Si Luo, Li Ninghui. A probabilistic discriminative model for android malware detection with decompiled source code. IEEE Transactions on Dependable and Secure Computing. 2015;12(4):400–412.
• 18 Talha Kabakus Abdullah, Alper Dogru Ibrahim, Aydin Cetin. APK Auditor: Permission-based Android malware detection system. Digital Investigation. 2015;13:1–14.
• 19 Wang Wei, Li Yuanyuan, Wang Xing, Liu Jiqiang, Zhang Xiangliang. Detecting Android malicious apps and categorizing benign apps with ensemble of classifiers. Future Generation Computer Systems. 2018;78:987–994.
• 20 Ham Hyo-Sik, Choi Mi-Jung. Analysis of android malware detection performance using machine learning classifiers. In: :490–495IEEE; 2013.
• 21 Amamra Abdelfattah, Robert Jean-Marc, Abraham Andrien, Talhi Chamseddine. Generative versus discriminative classifiers for android anomaly-based detection system using system calls filtering and abstraction process. Security and Communication Networks. 2016;9(16):3483–3495.
• 22 Kim Hwan-Hee, Choi Mi-Jung. Linux kernel-based feature selection for Android malware detection. In: :1–4IEEE; 2014.
• 23 Narudin Fairuz Amalina, Feizollah Ali, Anuar Nor Badrul, Gani Abdullah. Evaluation of machine learning classifiers for mobile malware detection. Soft Computing. 2016;20(1):343–357.
• 24 Chekina Lena, Mimran Dudu, Rokach Lior, Elovici Yuval, Shapira Bracha. Detection of Deviations in Mobile Applications Network Behavior. CoRR. 2012;abs/1208.0564.
• 25 Shabtai Asaf, Kanonov Uri, Elovici Yuval, Glezer Chanan, Weiss Yael. Andromaly: a behavioral malware detection framework for android devices. Journal of Intelligent Information Systems. 2012;38(1):161–190.
• 26 Tong Fei, Yan Zheng. A hybrid approach of mobile malware detection in Android. Journal of Parallel and Distributed Computing. 2017;103:22 - 31. Special Issue on Scalable Cyber-Physical Systems.
• 27 Su Xin, Chuah M, Tan Gang. Smartphone dual defense protection framework: Detecting malicious applications in android markets. In: :153–160IEEE; 2012.
• 28 Canfora Gerardo, Mercaldo Francesco, Visaggio Corrado Aaron. A classifier of malicious android applications. In: :607–614IEEE; 2013.
• 29 Rocha Bruno PS, Conti Mauro, Etalle Sandro, Crispo Bruno. Hybrid static-runtime information flow and declassification enforcement. IEEE transactions on information forensics and security. 2013;8(8):1294–1305.
• 30 Feldman Stephen, Stadther Dillon, Wang Bing. Manilyzer: automated android malware detection through manifest analysis. In: :767–772IEEE; 2014.
• 31 Lin Ying-Dar, Lai Yuan-Cheng, Lu Chun-Nan, Hsu Peng-Kai, Lee Chia-Yin. Three-phase behavior-based detection and classification of known and unknown malware. Security and Communication Networks. 2015;8(11):2004–2015.
• 33 Chinese http://www.appinchina.co/market/[Date last accessed May 2018]; .
• 34 Koodous http://koodous.com/[Date last accessed May 2018]; .
• 35 1Mobile Market http://m.1mobile.com/me.onemobile.android.html[Date last accessed May 2018]; .
• 36 9apps http://www.9apps.com/[Date last accessed May 2018]; .
• 37 VirusTotal http://www.virustotal.com/[Date last accessed May 2018]; .
• 38 Ransomware http://ransom.mobi/[Dateset last accessed May 2018]; .
• 39 strace http://linux.die.net/man/1/strace[Date last accessed May 2018]; .
• 40 Ramos Juan, others . Using tf-idf to determine word relevance in document queries. In: :133–142; 2003.
• 41 Hu Xiaohua. Knowledge discovery in databases: an attribute-oriented rough set approach. PhD thesisUniversity of Regina1995.
• 42 Jensen Richard, Shen Qiang. Rough set based feature selection: A review. Rough computing: theories, technologies and applications. 2007;:70–107.
• 43 AV-Test http://goo.gl/Rg6NDN[Date last accessed May 2018]; .
• 44 WEKA http://www.cs.waikato.ac.nz/ml/weka/(Data last accessed May 2018); .
• 45 Idrees Fauzia, Rajarajan Muttukrishnan, Conti Mauro, Chen Thomas M, Rahulamathavan Yogachandran. PIndroid: A novel Android malware detection system using ensemble learning methods. Computers & Security. 2017;68:36–46.
• 46 Moonsamy Veelasha, Batten Lynn. Zero permission android applications-attacks and defenses. In: :5–9School of Information Systems, Deakin University; 2012.
• 47 Narain Sashank, Vo-Huu Triet D, Block Kenneth, Noubir Guevara. Inferring user routes and locations using zero-permission mobile sensors. In: :397–413IEEE; 2016.
• 49 Ruta Dymitr, Gabrys Bogdan. An overview of classifier fusion methods. Computing and Information systems. 2000;7(1):1–10.
• 50 Felt Adrienne Porter, Chin Erika, Hanna Steve, Song Dawn, Wagner David. Android permissions demystified. In: :627–638ACM; 2011.
• 51 Rastogi Vaibhav, Chen Yan, Jiang Xuxian. Droidchameleon: evaluating android anti-malware against transformation attacks. In: :329–334ACM; 2013.
• 52 Rastogi Vaibhav, Chen Yan, Jiang Xuxian. Catch me if you can: Evaluating android anti-malware against transformation attacks. IEEE Transactions on Information Forensics and Security. 2014;9(1):99–108.
• 53 Dong Shuaike, Li Menghao, Diao Wenrui, et al. Understanding Android Obfuscation Techniques: A Large-Scale Investigation in the Wild. arXiv preprint arXiv:1801.01633. 2018;.
• 54 Proguard http://developer.android.com/tools/help/proguard.html(Date last accessed May 2018); .
• 55 Dasho http://www.preemptive.com/solutions/android-obfuscation/(Date last accessed May 2018); .
• 56 Du Peijun, Samat Alim, Waske Björn, Liu Sicong, Li Zhenhong. Random Forest and Rotation Forest for Fully Polarized SAR Image Classification using Polarimetric and Spatial Features. ISPRS Journal of Photogrammetry and Remote Sensing. 2015;105:38–53.
• 57 Damshenas Mohsen, Dehghantanha Ali, Choo Kim-Kwang Raymond, Mahmud Ramlan. M0droid: An android behavioral-based malware detection model. Journal of Information Privacy and Security. 2015;11(3):141–157.
• 58 Xiao Xi, Wang Zhenlong, Li Qing, Xia Shutao, Jiang Yong. Back-propagation neural network on Markov chains from system call sequences: a new approach for detecting Android malware with system call sequences. IET Information Security. 2016;11(1):8–15.
• 59 Canfora Gerardo, Medvet Eric, Mercaldo Francesco, Visaggio Corrado Aaron. Detecting android malware using sequences of system calls. In: :13–20ACM; 2015.
• 60 Zhao Kai, Zhang Dafang, Su Xin, Li Wenjia. Fest: A feature extraction and selection tool for Android malware detection. In: :714–720IEEE; 2015.

## Author Biography

{biography}

Deepa K. is currently persuing her Ph.D from Bharathiar University, Coimbatore. She is MCA & MPhil in Computer Science from Bharathiar University. Her main research interests Cyber Securities and Machine Learning. Deepa K. is currently persuing her Ph.D from Bharathiar University, Coimbatore. She is MCA & MPhil in Computer Science from Bharathiar University. Her main research interests Cyber Securities and Machine Learning.

{biography}

Radhamani G. is presently working as Professor and Director, School of Information Technology and Science, Dr. G R Damodaran College of Science, affiliated to Bharathiar University, India. Formerly, she worked as Head, Department of IT, Ministry of Manpower, Sultanate of Oman. Prior to that she served as a Research Associate in IIT (India) and as a faculty in Department of Information Technology, Multimedia University, Malaysia. She received her M.Sc and M.Phil degrees from the P.S.G College of Technology, India, Ph.D degree from the Multimedia University, Malaysia. She had been invited to be Keynote Speaker and Chair for International conferences in India and abroad. She has published papers in International Journals and Conferences.

{biography}

Vinod P. is Post Doc at Department of Mathematics, University of Padua, Italy. He holds his Ph.D in Computer Engineering from Malaviya National Institute of Technology, Jaipur, India. He has more than 70 research articles published in peer reviewed Journals and International Conferences. He is reviewer of number of security journals, and has also served as programme committee member in the International Conferences related to Computer and Information Security. His current research is involved in the development of malware scanner for mobile application using machine learning techniques. Vinod’s area of interest is Adversarial Machine Learning, Malware Analysis, Context aware privacy persevering Data Mining, Ethical Hacking and Natural Language Processing.

{biography}

Mohammad Shojafar is an Intel Innovator and senior researcher in SPRITZ Security and Privacy Research Group at the University of Padua, Italy. He was CNIT Senior Researcher at the University of Rome Tor Vergata contributed on European H2020 “SUPERFLUIDITY” project. Also, he completed some Italian projects named “SAMMClouds”, “V-FoG”, “PRIN15” projects aim to address some of the open issues related to the Software as a Service (SaaS) and Infrastructure as a Service (IaaS) systems In Cloud and Fog computing which are supported by the University of Sapienza Rome and University of Modena and Reggio Emilia, Italy, respectively. He received the Ph.D. degree from Sapienza University of Rome, Rome, Italy, in 2016 with an “Excellent” degree. He received the MSc and BSc in QIAU and Iran University Science and Technology, Tehran, Iran in 2010 and 2006, respectively. He published over 90 refereed articles is prestigious venues such as IEEE TCC, IEEE TSC and IEEE TGCN. He was a programmer/analyzer at National Iranian Oil Company (NIOC) and Tidewater ltd in Iran from 2008-2013, respectively.

{biography}

Neeraj Kumar is currently an Associate Professor in the Department of Computer Science and Engineering, Thapar University, Patiala (Pb.), India. He has published more than 200 technical research papers in leading journals and conferences from IEEE, Elsevier, Springer, John Wiley etc. Some of his research findings are published in top cited journals such as IEEE TIE, IEEE TDSC, IEEE TITS, IEEE TCE, IEEE Netw., IEEE Comm., IEEE WC, IEEE IoTJ, IEEE SJ, FGCS, JNCA, and ComCom. He has guided many research scholars leading to Ph.D. and M.E./M.Tech. His research is supported by fundings from Tata Consultancy Service, council of scientific and industrial research, and Department of Science & Technology. He is a senior member of IEEE and committee member of different societies of ComSoc. He is in the editorial board member of IEEE Communication Magazine, Journal of Networks and Computer Applications, International Journal of Communication Systems, and Security and Privacy.

{biography}

Mauro Conti is Full Professor at the University of Padua, Italy. He obtained his Ph.D. from Sapienza University of Rome, Italy, in 2009. After his Ph.D., he was a Post-Doc Researcher at Vrije Universiteit Amsterdam, The Netherlands. In 2011 he joined as Assistant Professor the University of Padua, where he became Associate Professor in 2015. In 2017, he obtained the national habilitation as Full Professor for Computer Science and Computer Engineering. He has been Visiting Researcher at GMU (2008, 2016), UCLA (2010), UCI (2012, 2013, 2014), TU Darmstadt (2013), UF (2015), and FIU (2015, 2016). He has been awarded with a Marie Curie Fellowship (2012) by the European Commission, and with a Fellowship by the German DAAD (2013). His main research interest is in the area of security and privacy. In this area, he published more than 200 papers in topmost international peer-reviewed journals and conference. He is Associate Editor for several journals, including IEEE Communications Surveys & Tutorials and IEEE Transactions on Information Forensics and Security. He was Program Chair for TRUST 2015, ICISS 2016, WiSec 2017, and General Chair for SecureComm 2012 and ACM SACMAT 2013. He is Senior Member of the IEEE.

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters