Neural Classification of Malicious Scripts: A study with JavaScript and VBScript

Neural Classification of Malicious Scripts: A study with JavaScript and VBScript

\nameJack W. Stokes \emailjstokes@microsoft.com
\addrMicrosoft Research
Redmond, WA 98052, USA \AND\nameRakshit Agrawal \emailragrawa1@ucsc.edu
\addrDepartment of Computer Science
University of California, Santa Cruz
Santa Cruz, CA 95064, USA \AND\nameGeoff McDonald \emailgeofm@microsoft.com
\addrMicrosoft Corp.
Vancouver, BC, V6E 4M3, CA
Abstract

Malicious scripts are an important computer infection threat vector. Our analysis reveals that the two most prevalent types of malicious scripts include JavaScript and VBScript. The percentage of detected JavaScript attacks are on the rise. To address these threats, we investigate two deep recurrent models, LaMP (LSTM and Max Pooling) and CPoLS (Convoluted Partitioning of Long Sequences), which process JavaScript and VBScript as byte sequences. Lower layers capture the sequential nature of these byte sequences while higher layers classify the resulting embedding as malicious or benign. Unlike previously proposed solutions, our models are trained in an end-to-end fashion allowing discriminative training even for the sequential processing layers. Evaluating these models on a large corpus of 296,274 JavaScript files indicates that the best performing LaMP model has a 65.9% true positive rate (TPR) at a false positive rate (FPR) of 1.0%. Similarly, the best CPoLS model has a TPR of 45.3% at an FPR of 1.0%. LaMP and CPoLS yield a TPR of 69.3% and 67.9%, respectively, at an FPR of 1.0% on a collection of 240,504 VBScript files.

Neural Classification of Malicious Scripts: A study with JavaScript and VBScript Jack W. Stokes jstokes@microsoft.com
Microsoft Research
Redmond, WA 98052, USA
Rakshit Agrawal ragrawa1@ucsc.edu
Department of Computer Science
University of California, Santa Cruz
Santa Cruz, CA 95064, USA
Geoff McDonald geofm@microsoft.com
Microsoft Corp.
Vancouver, BC, V6E 4M3, CA

1 Introduction

Malicious scripts are widely abused by malware authors to infect users’ computers. In this paper, we show that in the current threat landscape, the two most prevalent types of script malware that Windows users encounter are JavaScript (JS) and VBScript (VBS). JavaScript is an interpreted scripting language developed by Netscape that is often included in webpages to provide additional dynamic functionality Mozilla (). VBScript, or Microsoft Visual Basic Scripting Edition, is an active scripting language originally designed for Internet Explorer and the Microsoft Internet Information Service web server Microsoft ().

Spearphishing attacks have been a key component of several recent large-scale data breaches (CRN (); Snell ()). For example in Figure 1, a typical spearphishing attack involves a user being sent an email stating that they have an outstanding invoice. An archive is attached to the email, and inside the archive is a VBScript file called ”invoice.vbs”. If the user opens the VBScript file, it will be executed through the default file association using a native script execution host on Windows (in this example “wscript.exe”). Now that the malicious script is running on the computer, these attacks commonly download and execute further malware such as ransomware (Corporation (2016)). Figure 2 presents examples of malicious JavaScript and VBScript content.

Figure 1: Example of an email-based social engineering attack using an attached VBScript file.
(a) Malicious JavaScript
(b) Malicious VBScript
Figure 2: Example a) JavaScript file from the TrojanDownloader:JS/Swabfex malware family, and b) from a malicious VBScript file from the Worm:VBS/Jenxcus malware family.

While a wide range of different machine learning models have been proposed for detecting malicious executable files (Gandotra et al. (2014)), there has been little work in investigating malicious JavaScript, and even less research has been devoted to trying to detect malicious VBScript. Previous JavaScript solutions include those based on static analysis (Likarish et al. (2009); Maiorca et al. (2015); Shah (2016)), and both static and dynamic analysis (Corona et al. (2014)). Two previous solutions for VBScript are based on static analysis (Kim et al. (2006); Wael et al. (2017)). In addition, deep recurrent models have recently been proposed detecting system API calls in PE files (Athiwaratkun and Stokes (2017); Kolosnjaji et al. (2016); Pascanu et al. (2015)), JavaScript (Wang et al. (2016)), and Powershell (Hendler et al. (2018)).

There are several challenges posed by trying to detect malicious JavaScript and VBScript. One main challenge is the lack of labeled data. While obtaining malicious samples is challenging enough, creating a large benign set of script files is extremely difficult given strict privacy email policies which prevent manual inspection of undetected email. Furthermore, malicious scripts include obfuscation to hide the malicious content, and often unpack or decrypt the underlying malicious script only upon execution. Complicating this is the fact that the obfuscators, in some cases, are used by both benign and malware files. Thus pure static analysis of the primary script often fails to detect some malicious activity. Another problem is that anti-virus (AV) automation systems such as sandboxing environments are designed primarily to handle Windows Portable Executable (PE) files (e.g., .exe and .dll). Accordingly, the number of labeled script files is typically much lower than for executable files.

In this paper, we propose ScriptNet, a deep recurrent neural classification system which can be trained to detect either malicious JavaScript or VBScript using a combination of both static and dynamic analysis. We first use a production anti-virus engine to dynamically execute a script in a sandboxed environment inside of the engine. This allows the AV engine to safely analyze any child scripts which are dropped during script execution without infecting the computer.

We investigate two different models for the task of detecting malicious JavaScript and VBScript. Both models encode sequential information using one or more long, short-term memory (LSTM) layers. The LSTM and Max Pooling (LaMP) model follows a two-stage approach where the first stage learns a language model for the individual characters in the script content. Next, the second stage includes a, potentially deep, neural network for the final classification of the script as malicious or benign. To allow the processing of longer script files, we next investigate the Convoluted Partitioning of Long Sequences (CPoLS) model which adds an additional layer consisting of a one-dimensional convolutional neural network. LaMP is similar to the model proposed by Athiwaratkun and Stokes (2017) for PE files, but differs in two respects. While Athiwaratkun’s model also has an LSTM-based language model followed by a neural network classification stage, each component is trained in isolation. The language model is first trained in an unsupervised fashion, and this trained language model is then frozen and used to generate the embeddings for the classification stage. Instead, LaMP is trained with end-to-end learning where all the model parameters, including those in the language model and the classifier, are learned simultaneously directly from the characters in the script content. Similarly, CPoLS is also trained in an end-to-end manner. Second, LaMP extends the model in Athiwaratkun and Stokes (2017) to allow for stacked (i.e., multiple) LSTM layers. Since our models operate directly on the script content encoded as bytes, they do not require careful and potentially computationally expensive feature engineering proposed by other solutions. The main contributions of this paper include:

  • We study the detection percentage and threat vectors of malicious JavaScript and VBScript from telemetry generated by a production anti-virus product.

  • We investigate two deep recurrent neural network models for the detection of malicious JavaScript and VBScript.

  • We evaluate these models on two large corpora of JavaScript and VBScript files.

   

Figure 4: Percentage of malicious files detected by the Windows Defender anti-malware engine over time in the categories of JavaScript and VBScript attacks.
Figure 3: Percentage of malicious files detected by the Windows Defender anti-malware engine over time in the categories of JavaScript and VBScript attacks.
Figure 6: Percentage of total detections by file type in 2017. The remaining 92.5% of detections were for PE files.
Figure 3: Percentage of malicious files detected by the Windows Defender anti-malware engine over time in the categories of JavaScript and VBScript attacks.
Figure 5: Percentage of total detections by file type in 2017. The remaining 92.5% of detections were for PE files.

2 Motivation

The detection of malicious JavaScript and VBScript is important for protecting users against modern malware attacks. With advances in browser and operating system security making browser exploit attacks more difficult, miscreants are instead relying on social engineering attacks. Figure 6 illustrates the percentage of malicious detected files by the Windows Defender anti-malware engine in the categories of JavaScript and VBScript attacks. The percentage of malicious JavaScript-based attacks has been rising recently, while the percentage of detected attacks involving VBScript have remained relatively constant since 2014. Figure 6 indicates the percentage of all, non-PE files detected in the Windows Defender telemetry. This figure indicates that JavaScript and VBScript are the two most prevalent types of detected scripts found in the telemetry data. Since the remaining 92.5% of the detections are for PE files, malicious scripts are still a small minority of the detected files in the wild.

Based on the identified arrival methods of malicious JavaScript and VBScript, Figure 7 illustrates the identified attack methods based on the telemetry data from 2017. Archive file detections, the most prevalent threat vector for JavaScript, are generated when the user extracts the script from within an archive and are often used in social-engineering attacks. Interestingly, removable drives (e.g., thumbdrives, external USB harddrives) were responsible for the second most JavaScript attacks. Only 11.1% of detected malicious JavaScript files were encountered from malicious email, and 3.8% of the files were directly downloaded from the internet.

The distribution of the attack sources for malicious VBScript tells a different story. The main threat vector of malicious VBScript is emails followed closely again by downloads. Archives and removable drives play a smaller role in VBScript attacks, but they are still important threat vectors.

Figure 7: Arrival methods for malicious JavaScript and VBScript files detected by the Windows Defender anti-malware engine in 2017.

3 Threat Model

It is necessary to specify the assumptions that we make about the attacker. The most important assumption is that the model is able to learn some deep embedding which is able to identify activity related to malware from the first bytes (e.g., 200, 1000) of the script. If the first bytes are randomly initialized, the models will fail to detect the activity that somehow captures malicious intent.

Another assumption is that the behavior which identifies an unknown malicious script is also found in labeled scripts in the training set. If the training set does not contain scripts which are somehow related to the unknown script being evaluated, the classifier may again fail to accurately predict the script type.

As part of the scanning process, the anti-malware engine emulates an unknown file and attempts to extract any child scripts. It may be possible that the anti-malware engine fails to successfully extract all the child scripts. In this case, the model may also fail to detect the malicious script if the parent script is predicted to be benign, and the child script which executes the malicious activity is not successfully extracted.

4 Data

Scripts: Building a dataset of malicious and benign scripts for training is a challenge. A sizable percentage of malicious scripts are delivered in email and for privacy reasons cannot be collected. For this research, samples were selected randomly from the files observed on users’ computers during June 2017 that had been successfully collected, with permission, by the Windows Defender backend. These samples are collected by many sources including users directly submitting suspicious files for analysis, files shared through sample exchanges such as VirusTotal, and scripts that were extracted from installer packages or archives.

Labels: Another challenge in training a classifier for detecting malicious scripts is obtaining enough labeled data. Since we are trying to predict if a script is malware or benign, we must obtain both types of labels.

A script is labeled as malware if it has been inspected by our AV partner’s analysts and determined to be malicious. In addition, the script is labeled as malicious if it has been detected by the company’s detection signatures. Finally, scripts are labeled as malware if eight or more other anti-virus vendors detect the script as malware.

Obtaining enough benign scripts is a challenge because labeling a script as benign often requires manual inspection. Thus, a script is labeled as benign by a number of methods. First, the script is considered benign if it has been labeled as benign by an analyst or has been collected by a trusted source such as being downloaded from a legitimate webpage. However, this does not provide enough labeled benign scripts so we augment this benign dataset with scripts which are not detected by any trusted scanner at least 15 days after our AV partner has first encountered it in the wild.

Datasets: Our anti-virus partners provided the first 1000 bytes of 296,274 JavaScript files which contained 166,179 malicious and 130,095 benign scripts. We randomly assigned these scripts into training, validation, and test sets containing 207,392, 29,627, and 59,255 samples, respectively. The validation set is a small dataset which is used for hyperparameter tuning during the training phase. By doing so, we are later able to make a fair assessment of the final model’s performance on the held-out test set. Similarly, our partners provided a VBScript dataset with 240,504 examples including 66,028 malicious scripts and 174,476 benign scripts. This dataset was then randomly split into 168,353 training scripts, 24,050 validation scripts, and 48,101 test scripts.

5 System

Figure 8 presents an overview of the proposed neural script classification system. The labeled collection of malicious and benign scripts (e.g., JavaScript or VBScript files), described in the previous section, are first scanned with the Windows Defender anti-malware engine. During this scanning operation, the script is emulated and unpacked and may drop one or more additional scripts. Each child script is also emulated and unpacked which may generate even more scripts. This process continues until all scripts have been extracted and scanned.

Figure 8: Overview of the neural script classification system.

These scripts are next normalized. All whitespace characters, except line breaks, are first removed. Next the text is standardized to lowercase and converted to the US-ASCII character set. Any characters which are not included in the US-ASCII character set, such as non-English language characters, are replaced by the constant character ‘?’. Figure 9 illustrates an example script before and after normalization.

Figure 9: Example malicious packed JavaScript file from the TrojanDownloader:JS/Crimace.A malware family before (left), and after normalization (right).

Before training the model, each normalized script is written to the file system. To avoid storing malicious content on the hard drive, the characters are next encoded by their numeric ASCII encoding (e.g., ’97’ for the character ’a’) delimited by commas. This delimited, encoded sequence data is then used to train the neural script malware model.

To evaluate an unknown file, the system uses the trained model to produce a prediction which indicates the probability that the unknown script is malicious.

6 Models

Static and dynamic analysis of script files, like VBScript and JavaScript, allows our system to use information hidden in the script’s unpacked content to learn its malicious nature. In this section, we discuss our models which can capture the script files and learn malicious intent using neural classifier models and sequential learning.

Translation to Sequences: The raw scripts can be considered to be documents containing a limited vocabulary set. As such, the scripts are long ordered sequences of encoded characters. For normalized script files, we define our vocabulary as the set of all possible bytes (8-bits). This leads to a vocabulary of size 256. Each normalized script, therefore, is a sequence of these bytes.

Sequential Learning: In language models over document-like datasets, sequential learning is a commonly used learning methodology (Józefowicz et al. (2016); Sutskever et al. (2014)). Neural network-based models for sequential learning use Recurrent Neural Networks (RNNs), and their variants, to capture the ordered nature of elements, while learning generally over each individual item. In our models, we use a specific memory-based gated variant of RNNs, known as the Long Short-Term Memory (LSTM) model (Gers et al. (2000); Hochreiter and Schmidhuber (1997)). LSTMs are used extensively for processing long sequences of data. In speech and language models in particular, enhanced LSTMs define the state-of-the-art (Cho et al. (2014); Graves et al. (2013a, b); Sutskever et al. (2014)). However, their general neural nature, along with the ability to learn using backpropagation through time (Werbos (1990)), makes them useful in many domains. For our byte sequences, we therefore use LSTMs as the primary element for the capturing sequential attributes of the data. LSTMs can often be implemented with minor variations in their structure. The implementation used in our models, at each timestep , is described by the following equations:

(1)

where the nonlinearity defined by corresponds to the logistic sigmoid function. The variables are the input gate, forget gate, output gate and cell activation, respectively. are the weight matrices for each gate corresponding to the recurrent input from the previous timestep, are the input weight matrices per gate, and are the biases for each gate. The function represents the pairwise product between two vectors.

The network takes input vector at each timestep , and updates two properties of the LSTM. It updates the cell memory using the gates as well as the cell memory from the previous timestep. It then updates the hidden activation for timestep by using the gates and cell memory. The input vector provided to the LSTM cell can be of any structure depending on the data. In a categorical representation, it can be a one-hot encoded vector, while in the case of embeddings, it can be in the form of a dense vector. For sparse featured data, the input can simply be a sparse vector.

Model Architectures: In our experiments for sequential learning, we designed two neural model architectures. The primary difference in these two architectures is their resilience against very long length sequences. We will discuss these properties in detail below.

LSTM and Max Pooling: In the LSTM and Max Pooling (LaMP) architecture, illustrated in Figure 10, we first use an embedding layer, Embedding, to process the input byte sequence . Since each element in corresponds to a byte from the vocabulary, it is symbolic in nature. We use the embedding layer to transform each byte into a dense vector (i.e., an embedding) which captures relatedness among different bytes, thereby assisting the overall model in learning. The sequence of embeddings is then passed through multiple LSTM layers stacked on top of each other. The LSTM generates representations for each element in the input sequence as . In order for us to perform classification on the sequence and identify its hidden malicious content, we transform the sequence into a vector highlighting significant information, while reducing its dimensionality. For this purpose, we use a temporal, max pooling layer, MaxPool1d, as proposed by Pascanu et al. (2015). Given an input vector sequence of length , where each vector is a -dimensional vector, MaxPool1d computes an output vector as .

We pass the sequence through MaxPool1d to obtain vector . Next, is passed through one or more dense neural layers employing a rectified linear (Relu) nonlinear activation function. This helps learn an additional layer of weights before performing the final prediction. The Relu activated vector is finally used by a sigmoid layer to generate final probability indicating if the script is malicious or benign. We can formally define LaMP on an input byte sequence as:

(2)

where is the weight matrix for the dense Relu hidden layer, and is the weight matrix for the final sigmoid classification layer.

While LaMP provides a simple model to capture sequences directly, it is limited by the length of the input sequences. As the length of input sequence increases, the model becomes both difficult to train and more memory-intensive. In the case of detecting malicious content, long sequences can often separate two or more bytes far from each other even when their combined presence is a cause of the malicious intent. When learning directly on a sequence, it is possible for the model to lose the context of an identified byte earlier in the sequence when processing a new byte at a larger distance. To cope with such problems in detection, we therefore, propose another architecture called Convoluted Partitioning of Long Sequences (CPoLS).

Figure 10: LaMP model for detecting malicious JavaScript and VBScript files.

Convoluted Partitioning of Long Sequences: Convoluted Partitioning of Long Sequences (CPoLS) is a neural model architecture designed specifically to extract classification information hidden deep within long sequences. In this model illustrated in Figure 11, we process the input sequence in parts by splitting it first into smaller pieces of fixed length. By performing this step, we generate a sequence of multiple partitions, each of which is a sequence in itself of a smaller length.

We use Convolutional Neural Networks (CNNs) LeCun and Bengio (1995) in this model, along with the other LaMP modules. CNNs are widely used in computer vision (Krizhevsky et al. (2012); Russakovsky et al. (2015)), and they have also recently shown success in sequential learning domains as well (Gehring et al. (2016, 2017)).

Given an input byte sequence , the model first splits it into a partitioned list containing several small subsequences where is the index of each partition in . To translate the bytes in these sequences from symbols to dense vectors, we pass them through an embedding layer, Embedding, and obtain sequence , where each element corresponds to the sequence of embeddings for partition in . Each of these partitions , are now separately processed, while still maintaining their overall sequential nature. We call this method RecurrentConvolutions. In this method, we pass each partition through the one-dimensional CNN, Conv1D, which applies multiple filters on the input sequence and generates tensor representing the convoluted output of vector sequence . refers to the sequence with Conv1D performed on it. The combined list of these convolved partitions is referred to as . In RecurrentConvolutions, we then reduce the dimensionality of by performing a temporal max pooling MaxPool1d. MaxPool1d takes a tensor input and extracts a vector from it. Similarly, we apply RecurrentConvolutions on each partition to obtain the updated vectors . These vectors are finally combined in the same order to create an updated sequence of learned partition representations. With the help of partitioning, the length of is also limited to a trainable length.

At this stage, the model uses sequence as an input to the LaMP model and learns the probability that the script is malicious. Therefore, we use a combination of an LSTM, a second MaxPool1d layer, dense Relu activations, and a final sigmoid layer for generating the prediction on the new input sequence . Formally, we define the CPoLS model as:

(3)

Such a model is resilient to extremely long sequence lengths and can also find malicious objects hidden very late in the sequence.

Figure 11: Convoluted Partitioning of Long Sequences (CPoLS) model for detecting malicious JavaScript and VBScript files.
Script Type Model Parameter Description Value
JavaScript LaMP Minibatch Size 200
JavaScript LaMP LSTM Hidden Layer Size 1500
JavaScript LaMP Embedding Layer Size 64
JavaScript CPoLS Minibatch Size 50
JavaScript CPoLS LSTM Hidden Layer Size 1500
JavaScript CPoLS Embedding Layer Size 64
JavaScript CPoLS CNN Window Size 10
JavaScript CPoLS CNN Window Stride 5
JavaScript CPoLS Number of CNN Filters 128
VBScript LaMP Minibatch Size 100
VBScript LaMP LSTM Hidden Layer Size 1500
VBScript LaMP Embedding Layer Size 128
VBScript CPoLS Minibatch Size 100
VBScript CPoLS LSTM Hidden Layer Size 1500
VBScript CPoLS Embedding Layer Size 128
VBScript CPoLS CNN Window Size 10
VBScript CPoLS CNN Window Stride 5
VBScript CPoLS Number of CNN Filters 128
Table 1: Settings for the various model parameters.

End-to-End Learning: To train the models described above, we perform an end-to-end learning process. Since the data available to us is in the form of a sequence and an associated binary label, we need to train the entire model, solely from this label. In end-to-end learning, we pass each sequence through all layers of our model to derive the probability . Using this probability, with the true label , we measure the cross-entropy loss . This loss is used to compute the gradients required for updating the weights in each layer of the model. Therefore, we simultaneously learn all the parameters for the primary classification objective.

7 Experimental Results

We next evaluate the performance of the proposed neural malware script classifier models on JavaScript and VBScript files using the data described in Section 4. We first start by describing the experimental setup used to generate the results. Instead of training a single model to detect both JavaScript and VBScript, we train individual models for each script type since a specific model can better learn to identify the nuances of each particular scripting language. Accordingly, we first evaluate the LaMP and CPoLS models trained on JavaScript files and then repeat the evaluation for models trained on VBScript files.

Experimental Setup: All the experiments were performed using Keras (Chollet et al. (2015)) with the TensorFlow (Abadi et al. (2015)) backend. The models were trained and evaluated on a cluster of NVIDIA K40 graphical processing unit (GPU) cards. All models were trained with a maximum of 15 epochs, but early stopping was employed if the model fully converged before reaching the maximum number of epochs.

We did hyperparameter tuning of the various input parameters for both types of script models, and the results are summarized in Table 1. To do so, we first set the other hyperparameters to fixed values and then vary the hyperparameter under consideration. For example, to evaluate different minibatch sizes for the JavaScript LaMP classifier, we first set the LSTM’s hidden layer size , the embedding dimension to , the number of LSTM layers and the number of hidden layers in the classifier . With these settings, we evaluate the classification error rate on the validation set for the JavaScript dataset. Table 1 indicates the final hyperparameter settings used for the remainder of the experiments.

JavaScript: We evaluate the performance of the LaMP model on the JavaScript dataset in Figure 11(a) for several different combinations of LSTM stacked layers, , and classifier hidden layers, . Similarly, the CPoLS model is evaluated with the JavaScript files in Figure 11(b). For LaMP, adding either another stacked LSTM layer or classifier hidden layer improves the detection results. On the other hand, the simplest CPoLS model with one LSTM layer and one neural network hidden layer performs best. For lower FPRs, LaMP offers significant performance advantages over CPoLS. This result indicates that sequential modeling of the individual characters in the JavaScript content captures the underlying behavior compared to a sequential model on the output of the convolutional processing of the subsequences in CPoLS.

At a false positive rate (FPR) of 1%, the best performing JavaScript LaMP model has a true positive rate of 67.2% with . Similarly for CPoLS with , the best performing model yields a TPR of 45.3% at an FPR of 1.0%

(a) LaMP
(b) CPoLS
Figure 12: ROC curves for different JavaScript models.

VBScript: Next we evaluate the LaMP and CPoLS models for VBScript in Figures 12(a) and Figure 12(b), respectively. Similar to the JavaScript CPoLS model results, the simplest LaMP and CPoLS VBScript models with a single LSTM layer and classifier hidden layer offer the best, or nearly the best, performance compared to the more complex models. At an FPR of 1.0%, the TPR for the LaMP model is 69.3% with . Similarly, CPoLS yields a TPR of 67.1% with at this FPR = 1.0%.

(a) LaMP
(b) CPoLS
Figure 13: ROC curves for different VBScript models.

8 Discussion

In this section, we consider several limitations of the proposed ScriptNet neural malware script classification system. These include limitations due to the size of the GPU memory and adversarial learning-based attacks.

One limitation is the maximum sequence length, , employed by the LaMP models. This parameter value was primarily chosen because it allows the LaMP models to be trained in the 12 GB of SDRAM on the NVIDIA K40. If the length was increased much beyond this value, we could not train all the models investigated in this study. It may be possible that more advanced GPUs that are released in the future, and contain more GPU memory, might allow better performance if the maximum sequence length can be extended.

Attacks based on adversarial learning are another important concern. Both architectures used in this study include recurrent LSTM and possibly deep neural network (DNN) components. While researcher have not directly attacked LSTM structures using adversarial learning-based attacks,  Papernot et al. (2016) have shown that standard RNN cells (i.e., SimpleRNN) are vulnerable by unrolling the recurrent loop. Like DNNs, this unrolled structure can then be attacked using a number of methods for crafting adversarial samples Hu and Tan (2017); Papernot et al. (2015). One possible defense is to run the classifier in a secure enclave such as Intel’s SGX (Ohrimenko et al. (2016)). Other defenses including distillation and ensembles have been explored for PE files (Grosse et al. (2017); Stokes et al. (2017)).

9 Related Work

JavaScript:  Maiorca et al. (2015) propose a static analysis-based system to detect malicious PDF files which use features constructed from both the content of the PDF, including JavaScript, as well as its structure. Once these features are extracted, the authors use a boosted decision tree trained with the AdaBoost algorithm to detect malicious PDFs.  Cova et al. (2010) use the approach of anomaly detection for detecting malicious JavaScript code. They learn a model for representing normal (benign) JavaScript code, and then use it during the detection of anomalous code. They also present the learning of specific features that helps characterize intrinsic events of a drive-by download.  Hallaraker and Vigna (2005) present an auditing system in Mozilla for JavaScript interpreters. They provide logging and monitoring on downloaded JavaScript, which can be integrated with intrusion detection systems for malicious behavior detection. In Likarish et al. (2009), they classify obfuscated malicious JavaScript using several different types of classifiers including Naive Bayes, an Alternating Decision Tree (ADTree), a Support Vector Machine (SVM) with using the Radial Basis Function (RBF) kernel, and the rule-based Ripper algorithm. In their static analysis-based study, the SVM performed best based on tokenized unigrams and bigrams chosen by feature selection. A PDF classifier proposed by Laskov and Šrndić (2011) uses a one-class SVM to detect malicious PDFs which contain JavaScript code. Laskov’s system is based solely on static analysis. The features are derived from lexical analysis of JavaScript code extracted from the PDF files in their dataset.  Corona et al. (2014), propose Lux0R, a system to select API references for the detection of malicious JavaScript in PDF documents. These references include JavaScript APIs as well as functions, methods, keywords, and constants. The authors propose a discriminant analysis feature selection method. The features are then classified with an SVM, a Decision Tree and a Random Forest model. Like ScriptNet, Lux0R performs both static and dynamic analysis. However, they do not use deep learning and require the extraction of the JavaScript API references.  Wang et al. (2016) use deep learning models in combination with sparse random projections, and logistic regression. They also present feature extraction from JavaScript code using auto-encoders. While they use deep learning models, the feature extraction and model architectures limit the information extractability from JavaScript code.  Shah (2016) propose using a statistical n-gram language model to detect malicious JavaScript. Our proposed system uses an LSTM neural model for the language model instead of the n-gram model proposed by Shah (2016). Other papers which investigate the detection of malicious JavaScript include Liu et al. (2014); Schütt et al. (2012); Wang et al. (2013); Xu et al. (2012, 2013).

VBScript: While more research has been devoted to detecting malicious JavaScript, partly because of its inclusion in malicious PDFs, only a few previous studies have considered malicious VBScript. In Kim et al. (2006), a conceptual graph is first computed for VBScript files, and new malware is detected by identifying graphs which are similar to those of known malicious VBScript files. The method is based on static analysis of the VBScripts.  Wael et al. (2017) propose a number of different classifiers to detect malicious VBScript including Logistic Regression, a Support Vector Machine with an RBF kernel, a Random Forest, a Multilayer Perceptron, and a Decision Table. The features are created based on static analysis. The best performing classifier in their study is the SVM. In Zhao and Chen (2010), they detect malicious applets, JavaScript and VBScript based on a method which models immunoglobulin secretion.

Other File Types: A number of deep learning models have been proposed for detecting malicious PE files including Athiwaratkun and Stokes (2017); Dahl et al. (2013); Huang and Stokes (2016); Kolosnjaji et al. (2016); Pascanu et al. (2015). In particular, a character-level CNN has been proposed for detecting malicious PE files (Athiwaratkun and Stokes (2017)) and Powershell script files (Hendler et al. (2018)).  Raff et al. (2017) discuss a model which is similar to CPoLS but noted it did not work for PE files. They did not provide any results for their model.

10 Conclusions

Malicious script classification is an important problem facing anti-virus companies. Failure to detect a malicious script may result in a successful spearphishing, ransomware, or drive-by download attack. Neural language models have shown promising results in the detection of malicious executable files. Similarly, we show that these types of models can also detect malicious JavaScript and VBScript files with relatively high true positive rates at low false positive rates. These results are even more remarkable because the best performing models only utilize the first 200 characters in the script, making them fast for large-scale production.

The performance results confirm that the LaMP and CPoLS architectures using LSTM and CNN neural models are able to learn and generate representations of byte sequences in the scripts. In particular, the LaMP JavaScript malware script classification model using two LSTM layers and one dense neural network layer offers the best results, while for VBScript malware, the LaMP model with one LSTM and one hidden layer is significantly better than the competing models. The embeddings generated in these models, therefore, capture important sequential information from within the script file and help to predict their malicious nature through neural training over these embeddings.


References

  • Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org.
  • Athiwaratkun and Stokes (2017) B. Athiwaratkun and J. W. Stokes. Malware classification with lstm and gru language models and a character-level cnn. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2482–2486, March 2017. doi: 10.1109/ICASSP.2017.7952603.
  • Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014. URL http://arxiv.org/abs/1406.1078.
  • Chollet et al. (2015) François Chollet et al. Keras. https://github.com/fchollet/keras, 2015.
  • Corona et al. (2014) Igino Corona, Davide Maiorca, Davide Ariu, and Giorgio Giacinto. Lux0r: Detection of malicious pdf-embedded javascript code through discriminant analysis of api references. In Proceedings of the 2014 Workshop on Artificial Intelligent and Security Workshop, AISec ’14, pages 47–57, New York, NY, USA, 2014. ACM.
  • Corporation (2016) Microsoft Corporation. Don’t let this Black Friday/Cyber Monday spam deliver Locky ransomware to you, 2016. URL https://cloudblogs.microsoft.com/microsoftsecure/2016/11/23/dont-let-this-black-friday-cyber-monday-spam-deliver-locky-ransomware-to-you/.
  • Cova et al. (2010) Marco Cova, Christopher Kruegel, and Giovanni Vigna. Detection and analysis of drive-by-download attacks and malicious javascript code. In Proceedings of the 19th International Conference on World Wide Web, WWW ’10, pages 281–290, New York, NY, USA, 2010. ACM.
  • (8) CRN. Pentagon data breach shows growing sophistication of phishing attacks. URL https://www.crn.com/news/security/300077701/pentagon-data-breach-shows-growing-sophistication-of-phishing-attacks.htm.
  • Dahl et al. (2013) George E. Dahl, Jack W. Stokes, Li Deng, and Dong Yu. Large-scale malware classification using random projections and neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013.
  • Gandotra et al. (2014) E. Gandotra, D. Bansal, and S. Sofat. Malware analysis and classification: A survey. pages 55–64, 2014.
  • Gehring et al. (2016) Jonas Gehring, Michael Auli, David Grangier, and Yann N. Dauphin. A convolutional encoder model for neural machine translation. CoRR, abs/1611.02344, 2016. URL http://arxiv.org/abs/1611.02344.
  • Gehring et al. (2017) Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional sequence to sequence learning. CoRR, abs/1705.03122, 2017. URL http://arxiv.org/abs/1705.03122.
  • Gers et al. (2000) Felix A. Gers, Jürgen Schmidhuber, and Fred A. Cummins. Learning to forget: Continual prediction with LSTM. Neural Computation, 12(10):2451–2471, 2000.
  • Graves et al. (2013a) A. Graves, A. r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6645–6649, May 2013a. doi: 10.1109/ICASSP.2013.6638947.
  • Graves et al. (2013b) Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. Hybrid speech recognition with Deep Bidirectional LSTM. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pages 273–278. IEEE, dec 2013b. ISBN 978-1-4799-2756-2. doi: 10.1109/ASRU.2013.6707742. URL http://ieeexplore.ieee.org/document/6707742/.
  • Grosse et al. (2017) Kathrin Grosse, Nicolas Papernot, Praveen Manoharan, Michael Backes, and Patrick McDaniel. Adversarial perturbations against deep neural networks for malware classification. In Proceedings of the European Symposium on Research in Computer Security (ESORICS), 2017.
  • Hallaraker and Vigna (2005) O. Hallaraker and G. Vigna. Detecting malicious javascript code in mozilla. In 10th IEEE International Conference on Engineering of Complex Computer Systems (ICECCS’05), pages 85–94, June 2005. doi: 10.1109/ICECCS.2005.35.
  • Hendler et al. (2018) D. Hendler, S. Kels, and A. Rubin. Detecting Malicious PowerShell Commands using Deep Neural Networks. ArXiv e-prints, April 2018.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1–32, 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735.
  • Hu and Tan (2017) Weiwei Hu and Ying Tan. Generating adversarial malware examples for black-box attacks based on gan. arXiv preprint 1702.05983, 2017.
  • Huang and Stokes (2016) Wenyi Huang and Jack W. Stokes. Mtnet: A multi-task neural network for dynamic malware classfication. In Proceedings of Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA), pages 399–418, 2016.
  • Józefowicz et al. (2016) Rafal Józefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. CoRR, abs/1602.02410, 2016. URL http://arxiv.org/abs/1602.02410.
  • Kim et al. (2006) Sungsuk Kim, Chang Choi, Junho Choi, Pankoo Kim, and Hanil Kim. A method for efficient malicious code detection based on conceptual similarity. In International Conference on Computational Science and Its Applications (ICCSA), volume 3983, pages 567–576, 2006.
  • Kolosnjaji et al. (2016) Bojan Kolosnjaji, Apostolis Zarras, George Webster, and Claudia Eckert. Deep learning for classification of malware system call sequences. In Australasian Joint Conference on Artificial Intelligence, pages 137–149. Springer International Publishing, 2016.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • Laskov and Šrndić (2011) Pavel Laskov and Nedim Šrndić. Static detection of malicious javascript-bearing pdf documents. In Proceedings of the 27th Annual Computer Security Applications Conference, ACSAC ’11, pages 373–382, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0672-0.
  • LeCun and Bengio (1995) Yann LeCun and Yoshua Bengio. Convolutional networks for images speech and time series. 1995.
  • Likarish et al. (2009) Peter Likarish, Eunjin Jung, and Insoon Jo. Obfuscated malicious javascript detection using classification techniques. In 2009 4th International Conference on Malicious and Unwanted Software (MALWARE), pages 47–54. IEEE, oct 2009. ISBN 978-1-4244-5786-1. doi: 10.1109/MALWARE.2009.5403020. URL http://ieeexplore.ieee.org/document/5403020/.
  • Liu et al. (2014) D. Liu, H. Wang, and A. Stavrou. Detecting malicious javascript in pdf through document instrumentation. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pages 100–111, June 2014. doi: 10.1109/DSN.2014.92.
  • Maiorca et al. (2015) Davide Maiorca, Davide Ariu, Igino Corona, and Giorgio Giacinto. A structural and content-based approach for a precise and robust detection of malicious pdf files. In Proceedings of the International Conference on Information Systems Security and Privacy (ICISSP), 2015.
  • (31) Microsoft. VBScript. URL https://msdn.microsoft.com/en-us/library/t0aew7h6.aspx.
  • (32) Mozilla. JavaScript. URL https://developer.mozilla.org/en-US/docs/Web/JavaScript.
  • Ohrimenko et al. (2016) Olga Ohrimenko, Felix Schuster, Cédric Fournet, Aastha Mehta, Sebastian Nowozin, Kapil Vaswani, and Manuel Costa. Oblivious multi-party machine learning on trusted processors. In USENIX Security Symposium, pages 619–636, 2016.
  • Papernot et al. (2015) Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. Proceedings of the 1st IEEE European Symposium on Security and Privacy, 2015.
  • Papernot et al. (2016) Nicolas Papernot, Patrick McDaniel, Ananthram Swami, and Richard Harang. Crafting adversarial input sequences for recurrent neural networks. In Proceedings of the Military Communications Conference (MILCOM), 2016.
  • Pascanu et al. (2015) R. Pascanu, J. W. Stokes, H. Sanossian, M. Marinescu, and A. Thomas. Malware classification with recurrent networks. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1916–1920, April 2015. doi: 10.1109/ICASSP.2015.7178304.
  • Raff et al. (2017) Edward Raff, Jon Barker, Jared Sylvester, Robert Brandon, Bryan Catanzaro, and Charles K. Nicholas. Malware detection by eating a whole exe. CoRR, abs/1710.09435, 2017.
  • Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • Schütt et al. (2012) Kristof Schütt, Marius Kloft, Alexander Bikadorov, and Konrad Rieck. Early detection of malicious behavior in javascript code. In Proceedings of the 5th ACM Workshop on Security and Artificial Intelligence, AISec ’12, pages 15–24, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1664-4.
  • Shah (2016) Anumeha Shah. Malicious JavaScript Detection using Statistical Language Model. Master’s Projects, page 70, 2016. URL http://scholarworks.sjsu.edu/etd{_}projects/476.
  • (41) Elizabeth Snell. Verizon finds phishing attacks, malware top data breach causes. URL https://healthitsecurity.com/news/verizon-finds-phishing-attacks-malware-top-data-breach-causes.
  • Stokes et al. (2017) Jack W. Stokes, De Wang, Mady Marinescu, Marc Marino, and Brian Bussone. Attack and defense of dynamic analysis-based, adversarial neural malware classification models. CoRR, abs/1712.05919, 2017.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3104–3112. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf.
  • Wael et al. (2017) D. Wael, A. Shosha, and S. G. Sayed. Malicious vbscript detection algorithm based on data-mining techniques. In 2017 Intl Conf on Advanced Control Circuits Systems (ACCS) Systems 2017 Intl Conf on New Paradigms in Electronics Information Technology (PEIT), pages 112–116, Nov 2017. doi: 10.1109/ACCS-PEIT.2017.8303028.
  • Wang et al. (2013) Wei-Hong Wang, Yin-Jun Lv, Hui-Bing Chen, and Zhao-Lin Fang. A static malicious javascript detection using svm. In Proceedings of the 2nd International Conference on Computer Science and Electronics Engineering, 2013.
  • Wang et al. (2016) Yao Wang, Wan dong Cai, and Peng cheng Wei. A deep learning approach for detecting malicious javascript code. Proceedings of Security and Communication Networks, 11(9):1520–1534, 2016.
  • Werbos (1990) Paul J. Werbos. Backpropagation Through Time: What It Does and How to Do It. Proceedings of the IEEE, 78(10):1550–1560, 1990. ISSN 15582256. doi: 10.1109/5.58337.
  • Xu et al. (2012) W. Xu, F. Zhang, and S. Zhu. The power of obfuscation techniques in malicious javascript code: A measurement study. In 2012 7th International Conference on Malicious and Unwanted Software, pages 9–16, Oct 2012. doi: 10.1109/MALWARE.2012.6461002.
  • Xu et al. (2013) Wei Xu, Fangfang Zhang, and Sencun Zhu. Jstill: Mostly static detection of obfuscated malicious javascript code. In Proceedings of the Third ACM Conference on Data and Application Security and Privacy, CODASPY ’13, pages 117–128, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-1890-7.
  • Zhao and Chen (2010) H. Zhao and W. Chen. A web page malicious script detection method inspired by the process of immunoglobulin secretion. In 2010 International Symposium on Intelligence Information Processing and Trusted Computing, pages 241–245, Oct 2010. doi: 10.1109/IPTC.2010.100.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
192943
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description