The distribution of the attack sources for malicious VBScript tells a different story. The main threat vector of malicious VBScript is emails followed closely again by downloads. Archives and removable drives play a smaller role in VBScript attacks, but they are still important threat vectors.
3 Threat Model
It is necessary to specify the assumptions that we make about the attacker. The most important assumption is that the model is able to learn some deep embedding which is able to identify activity related to malware from the first bytes (e.g., 200, 1000) of the script. If the first bytes are randomly initialized, the models will fail to detect the activity that somehow captures malicious intent.
Another assumption is that the behavior which identifies an unknown malicious script is also found in labeled scripts in the training set. If the training set does not contain scripts which are somehow related to the unknown script being evaluated, the classifier may again fail to accurately predict the script type.
As part of the scanning process, the anti-malware engine emulates an unknown file and attempts to extract any child scripts. It may be possible that the anti-malware engine fails to successfully extract all the child scripts. In this case, the model may also fail to detect the malicious script if the parent script is predicted to be benign, and the child script which executes the malicious activity is not successfully extracted.
Scripts: Building a dataset of malicious and benign scripts for training is a challenge. A sizable percentage of malicious scripts are delivered in email and for privacy reasons cannot be collected. For this research, samples were selected randomly from the files observed on users’ computers during June 2017 that had been successfully collected, with permission, by the Windows Defender backend. These samples are collected by many sources including users directly submitting suspicious files for analysis, files shared through sample exchanges such as VirusTotal, and scripts that were extracted from installer packages or archives.
Labels: Another challenge in training a classifier for detecting malicious scripts is obtaining enough labeled data. Since we are trying to predict if a script is malware or benign, we must obtain both types of labels.
A script is labeled as malware if it has been inspected by our AV partner’s analysts and determined to be malicious. In addition, the script is labeled as malicious if it has been detected by the company’s detection signatures. Finally, scripts are labeled as malware if eight or more other anti-virus vendors detect the script as malware.
Obtaining enough benign scripts is a challenge because labeling a script as benign often requires manual inspection. Thus, a script is labeled as benign by a number of methods. First, the script is considered benign if it has been labeled as benign by an analyst or has been collected by a trusted source such as being downloaded from a legitimate webpage. However, this does not provide enough labeled benign scripts so we augment this benign dataset with scripts which are not detected by any trusted scanner at least 15 days after our AV partner has first encountered it in the wild.
These scripts are next normalized. All whitespace characters, except line breaks, are first removed. Next the text is standardized to lowercase and converted to the US-ASCII character set. Any characters which are not included in the US-ASCII character set, such as non-English language characters, are replaced by the constant character ‘?’. Figure 9 illustrates an example script before and after normalization.
Before training the model, each normalized script is written to the file system. To avoid storing malicious content on the hard drive, the characters are next encoded by their numeric ASCII encoding (e.g., ’97’ for the character ’a’) delimited by commas. This delimited, encoded sequence data is then used to train the neural script malware model.
To evaluate an unknown file, the system uses the trained model to produce a prediction which indicates the probability that the unknown script is malicious.
Translation to Sequences: The raw scripts can be considered to be documents containing a limited vocabulary set. As such, the scripts are long ordered sequences of encoded characters. For normalized script files, we define our vocabulary as the set of all possible bytes (8-bits). This leads to a vocabulary of size 256. Each normalized script, therefore, is a sequence of these bytes.
Sequential Learning: In language models over document-like datasets, sequential learning is a commonly used learning methodology (Józefowicz et al. (2016); Sutskever et al. (2014)). Neural network-based models for sequential learning use Recurrent Neural Networks (RNNs), and their variants, to capture the ordered nature of elements, while learning generally over each individual item. In our models, we use a specific memory-based gated variant of RNNs, known as the Long Short-Term Memory (LSTM) model (Gers et al. (2000); Hochreiter and Schmidhuber (1997)). LSTMs are used extensively for processing long sequences of data. In speech and language models in particular, enhanced LSTMs define the state-of-the-art (Cho et al. (2014); Graves et al. (2013a, b); Sutskever et al. (2014)). However, their general neural nature, along with the ability to learn using backpropagation through time (Werbos (1990)), makes them useful in many domains. For our byte sequences, we therefore use LSTMs as the primary element for the capturing sequential attributes of the data. LSTMs can often be implemented with minor variations in their structure. The implementation used in our models, at each timestep , is described by the following equations:
where the nonlinearity defined by corresponds to the logistic sigmoid function. The variables are the input gate, forget gate, output gate and cell activation, respectively. are the weight matrices for each gate corresponding to the recurrent input from the previous timestep, are the input weight matrices per gate, and are the biases for each gate. The function represents the pairwise product between two vectors.
The network takes input vector at each timestep , and updates two properties of the LSTM. It updates the cell memory using the gates as well as the cell memory from the previous timestep. It then updates the hidden activation for timestep by using the gates and cell memory. The input vector provided to the LSTM cell can be of any structure depending on the data. In a categorical representation, it can be a one-hot encoded vector, while in the case of embeddings, it can be in the form of a dense vector. For sparse featured data, the input can simply be a sparse vector.
Model Architectures: In our experiments for sequential learning, we designed two neural model architectures. The primary difference in these two architectures is their resilience against very long length sequences. We will discuss these properties in detail below.
LSTM and Max Pooling: In the LSTM and Max Pooling (LaMP) architecture, illustrated in Figure 10, we first use an embedding layer, Embedding, to process the input byte sequence . Since each element in corresponds to a byte from the vocabulary, it is symbolic in nature. We use the embedding layer to transform each byte into a dense vector (i.e., an embedding) which captures relatedness among different bytes, thereby assisting the overall model in learning. The sequence of embeddings is then passed through multiple LSTM layers stacked on top of each other. The LSTM generates representations for each element in the input sequence as . In order for us to perform classification on the sequence and identify its hidden malicious content, we transform the sequence into a vector highlighting significant information, while reducing its dimensionality. For this purpose, we use a temporal, max pooling layer, MaxPool1d, as proposed by Pascanu et al. (2015). Given an input vector sequence of length , where each vector is a -dimensional vector, MaxPool1d computes an output vector as .
We pass the sequence through MaxPool1d to obtain vector . Next, is passed through one or more dense neural layers employing a rectified linear (Relu) nonlinear activation function. This helps learn an additional layer of weights before performing the final prediction. The Relu activated vector is finally used by a sigmoid layer to generate final probability indicating if the script is malicious or benign. We can formally define LaMP on an input byte sequence as:
where is the weight matrix for the dense Relu hidden layer, and is the weight matrix for the final sigmoid classification layer.
While LaMP provides a simple model to capture sequences directly, it is limited by the length of the input sequences. As the length of input sequence increases, the model becomes both difficult to train and more memory-intensive. In the case of detecting malicious content, long sequences can often separate two or more bytes far from each other even when their combined presence is a cause of the malicious intent. When learning directly on a sequence, it is possible for the model to lose the context of an identified byte earlier in the sequence when processing a new byte at a larger distance. To cope with such problems in detection, we therefore, propose another architecture called Convoluted Partitioning of Long Sequences (CPoLS).
Convoluted Partitioning of Long Sequences: Convoluted Partitioning of Long Sequences (CPoLS) is a neural model architecture designed specifically to extract classification information hidden deep within long sequences. In this model illustrated in Figure 11, we process the input sequence in parts by splitting it first into smaller pieces of fixed length. By performing this step, we generate a sequence of multiple partitions, each of which is a sequence in itself of a smaller length.
We use Convolutional Neural Networks (CNNs) LeCun and Bengio (1995) in this model, along with the other LaMP modules. CNNs are widely used in computer vision (Krizhevsky et al. (2012); Russakovsky et al. (2015)), and they have also recently shown success in sequential learning domains as well (Gehring et al. (2016, 2017)).
Given an input byte sequence , the model first splits it into a partitioned list containing several small subsequences where is the index of each partition in . To translate the bytes in these sequences from symbols to dense vectors, we pass them through an embedding layer, Embedding, and obtain sequence , where each element corresponds to the sequence of embeddings for partition in . Each of these partitions , are now separately processed, while still maintaining their overall sequential nature. We call this method RecurrentConvolutions. In this method, we pass each partition through the one-dimensional CNN, Conv1D, which applies multiple filters on the input sequence and generates tensor representing the convoluted output of vector sequence . refers to the sequence with Conv1D performed on it. The combined list of these convolved partitions is referred to as . In RecurrentConvolutions, we then reduce the dimensionality of by performing a temporal max pooling MaxPool1d. MaxPool1d takes a tensor input and extracts a vector from it. Similarly, we apply RecurrentConvolutions on each partition to obtain the updated vectors . These vectors are finally combined in the same order to create an updated sequence of learned partition representations. With the help of partitioning, the length of is also limited to a trainable length.
At this stage, the model uses sequence as an input to the LaMP model and learns the probability that the script is malicious. Therefore, we use a combination of an LSTM, a second MaxPool1d layer, dense Relu activations, and a final sigmoid layer for generating the prediction on the new input sequence . Formally, we define the CPoLS model as:
Such a model is resilient to extremely long sequence lengths and can also find malicious objects hidden very late in the sequence.
|VBScript||LaMP||LSTM Hidden Layer Size||1500|
|VBScript||LaMP||Embedding Layer Size||128|
|VBScript||CPoLS||LSTM Hidden Layer Size||1500|
|VBScript||CPoLS||Embedding Layer Size||128|
|VBScript||CPoLS||CNN Window Size||10|
|VBScript||CPoLS||CNN Window Stride||5|
|VBScript||CPoLS||Number of CNN Filters||128|
End-to-End Learning: To train the models described above, we perform an end-to-end learning process. Since the data available to us is in the form of a sequence and an associated binary label, we need to train the entire model, solely from this label. In end-to-end learning, we pass each sequence through all layers of our model to derive the probability . Using this probability, with the true label , we measure the cross-entropy loss . This loss is used to compute the gradients required for updating the weights in each layer of the model. Therefore, we simultaneously learn all the parameters for the primary classification objective.
7 Experimental Results
Experimental Setup: All the experiments were performed using Keras (Chollet et al. (2015)) with the TensorFlow (Abadi et al. (2015)) backend. The models were trained and evaluated on a cluster of NVIDIA K40 graphical processing unit (GPU) cards. All models were trained with a maximum of 15 epochs, but early stopping was employed if the model fully converged before reaching the maximum number of epochs.
In this section, we consider several limitations of the proposed ScriptNet neural malware script classification system. These include limitations due to the size of the GPU memory and adversarial learning-based attacks.
One limitation is the maximum sequence length, , employed by the LaMP models. This parameter value was primarily chosen because it allows the LaMP models to be trained in the 12 GB of SDRAM on the NVIDIA K40. If the length was increased much beyond this value, we could not train all the models investigated in this study. It may be possible that more advanced GPUs that are released in the future, and contain more GPU memory, might allow better performance if the maximum sequence length can be extended.
Attacks based on adversarial learning are another important concern. Both architectures used in this study include recurrent LSTM and possibly deep neural network (DNN) components. While researcher have not directly attacked LSTM structures using adversarial learning-based attacks, Papernot et al. (2016) have shown that standard RNN cells (i.e., SimpleRNN) are vulnerable by unrolling the recurrent loop. Like DNNs, this unrolled structure can then be attacked using a number of methods for crafting adversarial samples Hu and Tan (2017); Papernot et al. (2015). One possible defense is to run the classifier in a secure enclave such as Intel’s SGX (Ohrimenko et al. (2016)). Other defenses including distillation and ensembles have been explored for PE files (Grosse et al. (2017); Stokes et al. (2017)).
9 Related Work
Other File Types: A number of deep learning models have been proposed for detecting malicious PE files including Athiwaratkun and Stokes (2017); Dahl et al. (2013); Huang and Stokes (2016); Kolosnjaji et al. (2016); Pascanu et al. (2015). In particular, a character-level CNN has been proposed for detecting malicious PE files (Athiwaratkun and Stokes (2017)) and Powershell script files (Hendler et al. (2018)). Raff et al. (2017) discuss a model which is similar to CPoLS but noted it did not work for PE files. They did not provide any results for their model.
- Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org.
- Athiwaratkun and Stokes (2017) B. Athiwaratkun and J. W. Stokes. Malware classification with lstm and gru language models and a character-level cnn. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2482–2486, March 2017. doi: 10.1109/ICASSP.2017.7952603.
- Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014. URL http://arxiv.org/abs/1406.1078.
- Chollet et al. (2015) François Chollet et al. Keras. https://github.com/fchollet/keras, 2015.
- Corporation (2016) Microsoft Corporation. Donât let this Black Friday/Cyber Monday spam deliver Locky ransomware to you, 2016. URL https://cloudblogs.microsoft.com/microsoftsecure/2016/11/23/dont-let-this-black-friday-cyber-monday-spam-deliver-locky-ransomware-to-you/.
- (8) CRN. Pentagon data breach shows growing sophistication of phishing attacks. URL https://www.crn.com/news/security/300077701/pentagon-data-breach-shows-growing-sophistication-of-phishing-attacks.htm.
- Dahl et al. (2013) George E. Dahl, Jack W. Stokes, Li Deng, and Dong Yu. Large-scale malware classification using random projections and neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013.
- Gandotra et al. (2014) E. Gandotra, D. Bansal, and S. Sofat. Malware analysis and classification: A survey. pages 55–64, 2014.
- Gehring et al. (2016) Jonas Gehring, Michael Auli, David Grangier, and Yann N. Dauphin. A convolutional encoder model for neural machine translation. CoRR, abs/1611.02344, 2016. URL http://arxiv.org/abs/1611.02344.
- Gehring et al. (2017) Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional sequence to sequence learning. CoRR, abs/1705.03122, 2017. URL http://arxiv.org/abs/1705.03122.
- Gers et al. (2000) Felix A. Gers, Jürgen Schmidhuber, and Fred A. Cummins. Learning to forget: Continual prediction with LSTM. Neural Computation, 12(10):2451–2471, 2000.
- Graves et al. (2013a) A. Graves, A. r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6645–6649, May 2013a. doi: 10.1109/ICASSP.2013.6638947.
- Graves et al. (2013b) Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. Hybrid speech recognition with Deep Bidirectional LSTM. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pages 273–278. IEEE, dec 2013b. ISBN 978-1-4799-2756-2. doi: 10.1109/ASRU.2013.6707742. URL http://ieeexplore.ieee.org/document/6707742/.
- Grosse et al. (2017) Kathrin Grosse, Nicolas Papernot, Praveen Manoharan, Michael Backes, and Patrick McDaniel. Adversarial perturbations against deep neural networks for malware classification. In Proceedings of the European Symposium on Research in Computer Security (ESORICS), 2017.
- Hendler et al. (2018) D. Hendler, S. Kels, and A. Rubin. Detecting Malicious PowerShell Commands using Deep Neural Networks. ArXiv e-prints, April 2018.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1–32, 1997. ISSN 0899-7667. doi: 10.1162/neco.19126.96.36.1995.
- Hu and Tan (2017) Weiwei Hu and Ying Tan. Generating adversarial malware examples for black-box attacks based on gan. arXiv preprint 1702.05983, 2017.
- Huang and Stokes (2016) Wenyi Huang and Jack W. Stokes. Mtnet: A multi-task neural network for dynamic malware classfication. In Proceedings of Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA), pages 399–418, 2016.
- Józefowicz et al. (2016) Rafal Józefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. CoRR, abs/1602.02410, 2016. URL http://arxiv.org/abs/1602.02410.
- Kim et al. (2006) Sungsuk Kim, Chang Choi, Junho Choi, Pankoo Kim, and Hanil Kim. A method for efficient malicious code detection based on conceptual similarity. In International Conference on Computational Science and Its Applications (ICCSA), volume 3983, pages 567–576, 2006.
- Kolosnjaji et al. (2016) Bojan Kolosnjaji, Apostolis Zarras, George Webster, and Claudia Eckert. Deep learning for classification of malware system call sequences. In Australasian Joint Conference on Artificial Intelligence, pages 137–149. Springer International Publishing, 2016.
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- LeCun and Bengio (1995) Yann LeCun and Yoshua Bengio. Convolutional networks for images speech and time series. 1995.
- Maiorca et al. (2015) Davide Maiorca, Davide Ariu, Igino Corona, and Giorgio Giacinto. A structural and content-based approach for a precise and robust detection of malicious pdf files. In Proceedings of the International Conference on Information Systems Security and Privacy (ICISSP), 2015.
- (31) Microsoft. VBScript. URL https://msdn.microsoft.com/en-us/library/t0aew7h6.aspx.
- Ohrimenko et al. (2016) Olga Ohrimenko, Felix Schuster, Cédric Fournet, Aastha Mehta, Sebastian Nowozin, Kapil Vaswani, and Manuel Costa. Oblivious multi-party machine learning on trusted processors. In USENIX Security Symposium, pages 619–636, 2016.
- Papernot et al. (2015) Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. Proceedings of the 1st IEEE European Symposium on Security and Privacy, 2015.
- Papernot et al. (2016) Nicolas Papernot, Patrick McDaniel, Ananthram Swami, and Richard Harang. Crafting adversarial input sequences for recurrent neural networks. In Proceedings of the Military Communications Conference (MILCOM), 2016.
- Pascanu et al. (2015) R. Pascanu, J. W. Stokes, H. Sanossian, M. Marinescu, and A. Thomas. Malware classification with recurrent networks. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1916–1920, April 2015. doi: 10.1109/ICASSP.2015.7178304.
- Raff et al. (2017) Edward Raff, Jon Barker, Jared Sylvester, Robert Brandon, Bryan Catanzaro, and Charles K. Nicholas. Malware detection by eating a whole exe. CoRR, abs/1710.09435, 2017.
- Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
- (41) Elizabeth Snell. Verizon finds phishing attacks, malware top data breach causes. URL https://healthitsecurity.com/news/verizon-finds-phishing-attacks-malware-top-data-breach-causes.
- Stokes et al. (2017) Jack W. Stokes, De Wang, Mady Marinescu, Marc Marino, and Brian Bussone. Attack and defense of dynamic analysis-based, adversarial neural malware classification models. CoRR, abs/1712.05919, 2017.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3104–3112. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf.
- Wael et al. (2017) D. Wael, A. Shosha, and S. G. Sayed. Malicious vbscript detection algorithm based on data-mining techniques. In 2017 Intl Conf on Advanced Control Circuits Systems (ACCS) Systems 2017 Intl Conf on New Paradigms in Electronics Information Technology (PEIT), pages 112–116, Nov 2017. doi: 10.1109/ACCS-PEIT.2017.8303028.
- Werbos (1990) Paul J. Werbos. Backpropagation Through Time: What It Does and How to Do It. Proceedings of the IEEE, 78(10):1550–1560, 1990. ISSN 15582256. doi: 10.1109/5.58337.
- Zhao and Chen (2010) H. Zhao and W. Chen. A web page malicious script detection method inspired by the process of immunoglobulin secretion. In 2010 International Symposium on Intelligence Information Processing and Trusted Computing, pages 241–245, Oct 2010. doi: 10.1109/IPTC.2010.100.