A Convolutional Transformation Network for Malware Classification
Modern malware evolves various detection avoidance techniques to bypass the state-of-the-art detection methods. An emerging trend to deal with this issue is the combination of image transformation and machine learning techniques to classify and detect malware. However, existing works in this field only perform simple image transformation methods that limit the accuracy of the detection. In this paper, we introduce a novel approach to classify malware by using a deep network on images transformed from binary samples. In particular, we first develop a novel hybrid image transformation method to convert binaries into color images that convey the binary semantics. The images are trained by a deep convolutional neural network that later classifies the test inputs into benign or malicious categories. Through the extensive experiments, our proposed method surpasses all baselines and achieves 99.14% in terms of accuracy on the testing set.
The number of new malware and variants on the Internet has been continuously increasing. According to a technical report from Kaspersky Lab , 323,000 new malware files are detected daily, and there are one billion unique malware files in their cloud database. This significant growth is due to the existing of many automatic malware creation toolkits such as Zeus and SpyEye . These kits commonly employ different evading techniques such as encryption, obfuscation, packing to create new malware from existing malware samples to bypass the detection of anti-malware scanners .
In recent years, machine learning and deep learning techniques have been adopted to the malware classification domain by researchers and anti-malware vendors to capture the malware evading techniques [41, 1] (the next section discusses the detailed related work). Machine learning algorithms in malware classification are based on a set of features extracted from file samples by static and/or dynamic analysis techniques. The analyses require either the disassembly code or code execution, and the accuracy of these models is, therefore, dependent on the analysis tools and selected features from the analyses. Also, the analysis and feature selecting process sometimes need security experts to revise and disambiguate intermediate results, thus cannot be fully automated [13, 38].
An alternative approach to overcoming the limitations mentioned above was the adoption of image processing and classification techniques to the domain of malware. The principal tenet of this approach is that it does not require disassembly code or code execution since binary files are converted and mapped to images so that it can be resilient to the known anti-analysis techniques [38, 23].
Although image-based is a new approach for malware classification that can avoid anti-analysis techniques, the existing works in this domain e.g., [23, 24, 21, 18, 5, 30] use simple mapping algorithms to transform malware binaries to images. Thus the semantics of the malware may be disregarded. Our observation is that the more information given to classifiers, the more accuracy rate can be archived. Motivated by this, our work proposes a novel hybrid image transformation method for binaries for malware classification. Our new technique tackles the semantic issue by adding and highlighting essential sequences visually using the entropy technique. A binary file is transformed into a color image where its channels encode semantic information. By visualizing continuous sequences of same semantic entropy values, the image can highlight suspicious sections in a binary for further analysis such as packed or encrypted sections or small cryptographic artifacts like decryption keys or passwords which meant to be hidden from human view. Our approach starts with a simple color scheme where bytes are classified into a small number of categories to get an overview of the structure of a file. This approach allows us to select the best color scheme to represent the semantic meaning of the binary data, Then, we take the byte stream and split them into 32-bit blocks and calculate the Shannon entropy on them. Using the second scheme, we can locate encrypted or packed sections. Taken from 256 different byte values, we compress them down into a few common character classes and calculate byte entropy over a sliding window of these selected bytes. Each of these color schemes has its own advantages in representing malware behavior. The character class scheme covers the most common padding bytes, nicely highlights strings in malware while the entropy scheme will locate encrypted and compressed sections.
To automatically recognize malware variants, their shared patterns should be identified and learned. If malware authors make a small change in the original binary, its image retains the global structure . Thus, image representations of different binaries from the same malware family appear to be similar. Based on these observations, we have developed a deep neural network to learn the global patterns shared among malware.
Our proposed method consists of the following steps in a supervised training phase. First, we divide a given binary executable (benign or malicious) into blocks of sequence and calculate the Shanon entropy value for each block. Next, these blocks are analyzed conditionally and assigned to the corresponding color. The transformed images are fed into convolutional/pooling/fully connected layers. Finally, the network outputs the predicted label. In summary, the contributions of our work are as follows.
We propose a novel image transformation method to convert binary executables into color images that convey the semantics of the binary data.
We develop a Convolutional Transformation Network, or CTN in short, for classifying malware based on transformed images.
We report the experimental results of our proposed network on a large dataset of malicious and benign binary samples. Through extensive experiments, our proposed method achieves 99.14% of accuracy, a much greater rate compared with similar work on the same test set.
The composition of this paper is as follows: Section II describes related work on malware analysis, detection, and classification methods. In Section III, a malware analysis using color images is proposed, and Section IV illustrates the experimental results. Finally, the conclusions of this paper are presented, and future work is discussed in Section V.
Ii Related Work
In this section, we first review the progress of image classification. Then, we summarize the malware classification and the integration of the image transformation into the malware classification.
Ii-a Image classification
Image classification is a fundamental problem in computer vision. In the early stage, Haralick et al.  describes some easily computable textural features based on gray-tone spatial dependencies and illustrates their application in category-identification tasks. Later, LeCun et al.  proposed Convolutional neural networks (ConvNets or CNNs) by stacking several convolutional operators into a network. CNNs can create a hierarchy of progressively more abstract features and show a good performance on hand-writing digit classification. However, the limitation of hardware resource has restricted CNNs from further investigation. Therefore, there exist many works using hand-crafted features for image classification. In the global scale, GIST  is computed over the entire image as a global image descriptor for scene classification. In the local scale, Lowe  introduced SIFT feature extracted from interest points. Similarly, Dalal et al.  proposed histogram of gradients (HOG) which can be used for both image classification and object detection in a sliding window manner.
Recently, along with the development of GPUs, CNNs were resurrected in deep learning for image classification. Krizhevsky et al.  trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. The network has many layers, namely, convolution layers and max-pooling layers, and fully-connected layers with a final 1000-class softmax. Furthermore, there are many extensions to deeper networks for higher performance in terms of classification rate [32, 33, 9].
Ii-B Malware Classification
Malware classification is a method to detect whether a given software program is malicious. Conventional techniques in the anti-malware industry used signature-based algorithms, however, these techniques cannot detect new malware or modified malware using evading techniques such as encryption, packing, polymorphism, obfuscation, and meta-morphism [41, 1]. To detect these new type of malware, in recent years, there have been various efforts to adopt advanced machine learning techniques in the domain of malware classification. In this subsection, we summary latest efforts in this area. We also highlight several works closely related our work that adopted image classification techniques to detect malware.
Ii-B1 Feature-based Approach
These classification techniques first extract various features from file samples and use these features to train the classifier using machine learning methods. Feature extraction can be performed by static analysis, dynamic analysis, or a hybrid combination of the two. Static analysis techniques performed a string search on the program to collect some features. Many efforts used static analysis to construct a feature vector for classification such as [15, 29, 34, 11, 2, 28]. The limitations of static analysis are that static-based features suffer from binary obfuscation and are limited in representing true behavior of malware. Moreover, these techniques are platform-specific and only applicable to Windows PE. Various other works used dynamic techniques to extract features from operations on system resources [27, 10], call sequences , control flow or function call graph . According to recent findings, e.g., [1, 40, 39] features-based malware classification methods are still facing evading techniques implemented in modern malware.
Ii-B2 Image-based Approach
There have been several efforts that adapt the techniques in image processing to malware classification. In [41, 38], various visualization techniques, including image processing for malware analysis, have been surveyed. Lately, more efforts are adopting the image-based approach for malware classification, e.g., [21, 18, 5, 30]. A common feature of these efforts is that they transform binary malware samples into different image forms, then image classification techniques are used to classify the malware based on the image representation of the malware. We highlight a few works closely related to ours and categorize them based on the image transformation methods and the representation of the malware in images. In [24, 25], the binary samples are mapped to a vector of 8-bit unsigned integers of byte values, which is reshaped and converted into a gray-scale image in the range [0,255]. The malware classification is calculated using computed texture features from the images using K-nearest neighbors. Han et al. [7, 6] proposed several transformation methods to convert the opcode sequences extracted from malware samples into image matrices represented in RGB-colored pixels. In a later work of the same authors , binaries are converted into bitmap images, which are then converted to entropy graphs to calculate the similarities. This approach however, only works for Windows PE file because it needs PE header information to decide sections to be converted. Besides, this method cannot deal with packed samples. Liu et al.  proposed a method to transform disassembly files to gray-scale images, which are later compressed and mapped into feature vectors for classification using K-means and diversity selection. In , malware is executed on a virtual machine to capture user-mode API calls, which are then sorted assigned a color based on their maliciousness. The most malicious APIs such as DeleteFileA are assigned hot colors (stars from red (1,0,0) in RGB) while the most benign APIs are mapped to cold colors (the coldest color is blue (0,0,1) in RGB). The API color points at the time they appear in log trace are rendered into an image that is used to classify the maliciousness. Kancherla et al.  plots raw bytes values into 2-D dimensional images and extract intensity, Wavelet, and Gabor-based features. These features are then fed into the SVM classifier to do malware classification. Our work stands apart from the literature by introducing a novel transformation method to convert binaries to images with multiple layers. Thanks to this approach, more features of binaries are carried in the images, resulting in a higher accuracy rate of classification compared with existing works.
Iii System design
In this section, we introduce our proposed Convolutional Transformation Network (CTN) for malware classification. Our network consists of two major components, namely, input transformation and convolutional network-based classification.
Malware uses obfuscation technique to bypass Antivirus and hide its malicious activities. To better capture their behavior, we not only use static features like strings, imports but also need to understand the encoding techniques. The output of the malware detection system can give a better intuition of malware to users rather than giving a single decision. In other words, it is beneficial to point out suspicious section in a program, and security analysts can perform further analysis on them. To achieve these goals, we first analyze the raw byte contents of programs and split it into blocks of byte sequences. We then calculate entropy for each block and transform it into color images. The color images have been proven to be more effective than the gray scale counterpart by the work .
Iii-B Input Transformation
In this section, we explore multiple ways to encode and transform the binary input into images. First, we start with a simple technique called the Byte class
Iii-B1 Byte class
This scheme only includes information about strings. Specifically, a character belongs to one of the four categories: the lowest byte value (0), the highest, lowest byte value, the printable strings, and non-printable strings. For example Tab (09), newline (0a) and carriage return (0d) are considered to be texts. This method can be used to get an overview of the file structure.
Iii-B2 Gradient based
Gradient color scheme is similar to the byteclass method except that they vary colors with byte ordinal values, which is from 0 to 255.
This scheme can reveal structural details that do not appear in the byteclass scheme.
The detail colors scheme assigns a color to each different byte value. It tries to maximize the difference between colors, while at the same time keeping colors for bytes that are close in value as similar as possible. To balance these two conflicting constraints, we again resort to the Hilbert curve. In this paper, we project the 1-dimensional sequence of byte values into a 3-dimensional Hilbert curve traversal of RGB color cube.
where is a random variable of 256 symbol map of ; ; block size is the block size. The probability mass function, is the probability of byte value i within a given byte block. Entropy values are then rescaled to between 0 and 1. We assign entropy values to printable and encoded strings, and transform them into color RGB images.
|Scheme||Training (GIST)||Validation (GIST)||Training (CNN)||Validation (CNN)|
Iii-C Hybrid Image Transformation (HIT)
We extend the entropy color scheme in the previous section to capture more semantic information about PE files. By encoding byte values in a wide range, HIT can not only capture obfuscation information but also be able to represent semantic information in the file headers like imported functions or libraries. We encode the semantic information into the green channel of RGB image. To select the best number of partitions in the 256 symbols, we base on the observation that information in the PE headers exists in a visible form of strings and numbers. We start by splitting the symbol range into 4 smaller ranges of lowest, highest, printable and other bytes, and keep splitting the range by a binary value where is the index of partition until getting the best performance. We reserve the red and blue as same as entropy method and put more light to green channel on standard characters. Table II illustrates a typical example of byte splitting and encoding. The motivation behind our color scheme is that a detailed image can improve classification performance  and by using more ranges of byte values metadata information such as PE headers, in an executable can be visually identified. As almost malware samples are packed to hinder the analysis, HIT can be applied to detect not only obfuscation patterns and malicious indicators in executables. A regular binary has lower entropy than a packed binary since its follow a software coding standard and contains printable characters. Packed malware has higher entropy than benign since obfuscated or packed makes bytes randomly.
We store entropy information into the red and blue channels while put string information into the green channel as it is most sensitive to human vision  and take highest coefficient value in image grayscale-color conversion. In particular, our HIT method defines entropy value for red and blue and fixed green value by a binary value , where is the index of partition. Thus, HIT can output a lower entropy value that makes red and blue lower, and a pixel tends to green. By this way, regular files have more green pixel than malicious files, which contain higher entropy due to the higher red/blue values.
Designing HIT, however, requires selecting the number of partitions. If we divide the color range into many partitions, the output pixels in the image are going to be random since the patterns were removed. Furthermore, when the number of partitions increases, the classifiers learn patterns from an image because it contains random pixels or the training process are easier to be overfitted. We do a heuristic search for the best cut.
|0||RGB(r, 0, b)|
|255||RGB(r, 255, b)|
|[a-w]||RGB(r, 126, b)|
|[A-W]||RGB(r, 64, b)|
|[0-9]||RGB(r, 32, b)|
|special character||RGB(r, 16, b)|
Iii-D Network Layers
There exist standard convolutional/pooling layers that widely used in image classification such as AlexNet . In our work, we design a simple neural network which can be used to extract features from images. In particular, our CTN model contains three convolutional layers, two pooling layers, and one fully connected layer, respectively, to learn to classify transformed images automatically. Note that the input of our network is the transformed images, as discussed earlier. Fig. 2 represents network operators used in our model.
For the evaluation, we collected malware samples from Virusshare  and Windows executable software as benign files. We utilize the Microsoft Software Removal Tool  to label the samples. Our dataset contains 525 malicious and 525 benign selected samples. We further partition the dataset into two parts: the training set (80%), the validation set (20%).
Iv-B Experimental Results
We first conduct the parameter selection. As shown in Fig. 3, our proposed model achieves the best performance at the cut point 8. In case we increase the cut point to 16, the performance of the model is decreasing as more random pixels cannot be learned well. Also, it slows down and overfits classifiers. Therefore, we adopt the cut point 8 for the rest of our experiments.
Table I demonstrates the performance of different image transformation methods and different image features. CNN generally performs better than GIST. Our HIT performs the best on both training set and test set. That means the HIT can be able to generalize well on unseen samples. The performance on the training set is generally better than the one on the validation. There is one exception that the entropy transformation performs better with new unseen data on CNN. Regarding the GIST feature, the gray image transformation performs the best with 94.27%. While using CNN features, HIT reaches the top performance with 99.14% accuracy. The results indicates that using color images with CNN architecture is better than gray scale images.
Fig. 4 visualizes the output average taken from the last network in our model. We can observe that the mean outputs of benign and malicious samples are different. Fig. 4(a) shows that benign samples in average have almost locations in the green regions while malicious samples have more sections in the dark blue regions, as shown in Fig. 4(b). It obviously indicates that malware is evolved with obfuscation techniques to hide visible indicators.
In this paper, we present a Convolutional Transformation Network for malware classification based on the combination of deep learning and the conversion of binary files into color images. In other words, we cast the malware classification problem into the image classification task. We improve the accuracy rate by enhancing the image color coding. The results of malware classification show that our method achieves over 99.14% regarding accuracy surpassing all the baselines.
In the future, we would like to extend our work to the malware segmentation problem to detect the specific malicious segments inside malware programs. Also, we also aim to investigate our work to polymorphism malware classification.
The project leading to this paper has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 675320 (NeCS: European Network for Cyber Security).
-  (2017) Evading Machine Learning Malware Detection. In Black Hat USA 2017, July 22-27, 2017, Las Vegas, NV, USA, Cited by: §I, §II-B1, §II-B.
-  (2013) Large-scale malware classification using random projections and neural networks. pp. 3422–3426. Cited by: §II-B1.
-  (2005) Histograms of oriented gradients for human detection. pp. 886–893. Cited by: §II-A.
-  (2012) Number by colors: a guide to using color to understand technical data. Springer Science & Business Media. Cited by: §III-C.
-  (2015) Malware analysis using visualized images and entropy graphs. International Journal of Information Security 14 (1), pp. 1–14. Cited by: §I, §II-B2.
-  (2014) Malware analysis using visualized image matrices. The Scientific World Journal 2014. Cited by: §II-B2.
-  (2013) Malware analysis method using visualization of binary files. pp. 317–321. Cited by: §II-B2.
-  (1973) Textural features for image classification. IEEE Trans. Systems, Man, and Cybernetics 3 (6), pp. 610–621. Cited by: §II-A.
-  (2016-06) Deep residual learning for image recognition. pp. 770–778. External Links: Cited by: §II-A.
-  (2016) MtNet: a multi-task neural network for dynamic malware classification. In Detection of Intrusions and Malware, and Vulnerability Assessment, pp. 399–418. Cited by: §II-B1.
-  (2013) Classification of malware based on integrated static and dynamic features. Journal of Network and Computer Applications 36 (2), pp. 646–656. Cited by: §II-B1.
-  (2013) Image visualization based malware detection. pp. 40–44. Cited by: §II-B2.
-  (2016-12) Kaspersky Lab Number of the Year 2016: 323,000 Pieces of Malware Detected Daily. Note: https://goo.gl/ELzMyu Cited by: §I, §I.
-  (2016) Deep learning for classification of malware system call sequences. In Proceedings of 29th Australasian Joint Conference on Artificial Intelligence, pp. 137–149. External Links: Cited by: §II-B1.
-  (2006) Learning to detect and classify malicious executables in the wild. Journal of Machine Learning Research 7 (Dec), pp. 2721–2744. Cited by: §II-B1.
-  (2012) Imagenet classification with deep convolutional neural networks. pp. 1097–1105. Cited by: §II-A, §III-D.
-  (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §II-A.
-  (2016) Malware classification using gray-scale images and ensemble learning. pp. 1018–1022. Cited by: §I, §II-B2.
-  (2004) Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60 (2), pp. 91–110. Cited by: §II-A.
-  (2007) Using entropy analysis to find encrypted and packed malware. IEEE Security & Privacy 5 (2). Cited by: §III-B4.
-  (2017-02) Malware class recognition using image processing techniques. In 2017 International Conference on Data Management, Analytics and Innovation (ICDMAI), Vol. , pp. 76–80. External Links: Cited by: §I, §II-B2.
-  (2019)(Website) External Links: Cited by: §IV-A.
-  (2011) Malware images: visualization and automatic classification. pp. 4. Cited by: §I, §I.
-  (2016-03) SPAM: Signal Processing to Analyze Malware. IEEE Signal Processing Magazine 33 (2), pp. 105–117. Cited by: §I, §II-B2.
-  (2011) A comparative assessment of malware classification using binary texture analysis and dynamic analysis. pp. 21–30. Cited by: §II-B2.
-  (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. International journal of computer vision 42 (3), pp. 145–175. Cited by: §II-A.
-  (2008) Learning and classification of malware behavior. pp. 108–125. Cited by: §II-B1.
-  (2015) Deep neural network based malware detection using two dimensional binary program features. pp. 11–20. Cited by: §II-B1.
-  (2001) Data mining methods for detection of new malicious executables. pp. 38–49. Cited by: §II-B1.
-  (2014) Malware behavior image for malware variant identification. pp. 238–243. Cited by: §I, §II-B2.
-  (1951) Prediction and entropy of printed english. Bell Labs Technical Journal 30 (1), pp. 50–64. Cited by: §III-B4.
-  (2014) Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. Cited by: §II-A.
-  (2015) Going deeper with convolutions. Cited by: §II-A.
-  (2009) An automated classification system based on the strings of trojan and virus families. pp. 23–30. Cited by: §II-B1.
-  (2010) Evaluating color descriptors for object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 32 (9), pp. 1582–1596. Cited by: §III-A.
-  (2010) Evaluating color descriptors for object and scene recognition. IEEE transactions on pattern analysis and machine intelligence 32 (9), pp. 1582–1596. Cited by: §III-C.
-  (2019)(Website) External Links: Cited by: §IV-A.
-  (2015) A Survey of Visualization Systems for Malware Analysis. In Eurographics Conference on Visualization (EuroVis) – STARs, R. Borgo, F. Ganovelli, and I. Viola (Eds.), External Links: Cited by: §I, §I, §II-B2.
-  (2016-02) Automatically Evading Classifiers: A Case Study on PDF Malware Classifiers. In 2016 Network and Distributed System Security Symposium (NDSS), Cited by: §II-B1.
-  (2018-02) Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks. In 2018 Network and Distributed System Security Symposium (NDSS), Cited by: §II-B1.
-  (2017-06) A survey on malware detection using data mining techniques. ACM Comput. Surv. 50 (3), pp. 41:1–41:40. External Links: Cited by: §I, §I, §I, §II-B2, §II-B.
-  (2017-05) Autoencoder-based feature learning for cyber security applications. In 2017 International Joint Conference on Neural Networks (IJCNN), Vol. , pp. 3854–3861. External Links: Cited by: §II-B1.