Random CapsNet Forest Model for Imbalanced Malware Type Classification Task
Abstract
Behavior of a malware varies with respect to malware types. Therefore, knowing type of a malware affects strategies of system protection softwares. Many malware type classification models empowered by machine and deep learning achieve superior accuracies to predict malware types. Machine learning based models need to do heavy feature engineering and feature engineering is dominantly effecting performance of models. On the other hand, deep learning based models require less feature engineering than machine learning based models. However, traditional deep learning architectures and components cause very complex and data sensitive models. Capsule network architecture minimizes this complexity and data sensitivity unlike classical convolutional neural network architectures. This paper proposes an ensemble capsule network model based on bootstrap aggregating technique. The proposed method are tested on two malware datasets, whose thestateoftheart results are wellknown.
keywords:
Capsule Networks, Malware, Ensemble, Deep Learning1 Introduction
Malware type classification is important as much as malware detection problem because system protection softwares are able to decide their strategies with respect to malware family types. Malware families have different behavior and affects on a computer system. For instance, each malware family uses different resources, files, ports and other components of operating systems. In addition to this, because of trends in technology, new malware types have occurred.
Malware type classification is the most common problem in cybersecurity domain, because strategies of protection systems vary with respect to malware family type. Malware type classification can be done via three different ways such as static, dynamic and image based nataraj2011comparative; abijah2019intelligent; ni2018malware. This paper focuses on image based malware family classification problem. In order to do that, two important datasets have been used and obtained results have been compared other models in the literature.
This paper proposes a new model named random capsule network forest (RCNF) based on boostrap aggregation (bagging) ensemble and capsule network (CapsNet) breiman1996bagging; sabour2017dynamic. The main assumption behind the proposed method is to reduce variance of different CapsNet models (as weak learners) using bagging. In this perspective, the main contributions of this paper can be listed as bellow:

This paper introduces the first application of CapsNet on malware type classification.

The first ensemble model of CapsNet is presented in the paper.

The proposed model uses simple architecture engineering instead of complex convolutional neural network architectures and domain specific feature engineering techniques.
This paper also presents a detailed comparison of proposed model with other models in the literature.
The paper is organized as follows, section 2 presents a literature survey for CapsNet applications and previous malware analysis studies. Datasets used in the paper are described in section 3. Section 4 gives details of inspiring model and the proposed model. Results are investigated in section 5. The paper is concluded by section 6.
2 Related Work
Convolutional neural networks (CNNs) are very popular and successful models for image based datasets. As expected in visual static malware analysis, there are many CNN architectures. CapsNet, as a new CNN structure, has been implemented in 2017 sabour2017dynamic. Especially in health domain jimenez2018capsule, CapsNet have many appliations. For instance Afshar et al. afshar2018brain uses CapsNet for brain tumor classification problem like classification of breast cancer histology images iesmantas2018convolutional. Mobiny et al mobiny2018fast creates a fast CapsNet architecture for lung cancer diagnosing. Another important application area of CapsNet is object segmentation. LaLonde et al. lalonde2018capsules uses CapsNet for object segmentation. The most brilliant idea in machine learning is generative adverserial networks (as known as GAN). GANs are based on CNN structures. CapsNet is very useful to make GANs better removing weakest point of CNNs jaiswal2018capsulegan. In other words, these studies show that CapsNet is a promising architecture against to standard CNN.
In the literature, although there are many applications of CapsNets, there is a missing and important area. This area is computer and information security. This gap can be seen easily in a preprint version of a survey about CapsNets patrick2019capsule. In that manner, this paper aims to a robust malware prediction model based on CapsNet architecture for imbalanced malware dataset. This model is the first application of CapsNets in malware prediction domain.
3 Datasets
The base CapsNet and the proposed RCNF models have been tested on two the most known malware datasets called Malimg and Microsoft Malware 2015 (BIG2015) respectively. This section describes these two datasets.
3.1 Malimg
Nataraj et al. introduced a new malware family type classification approach based on visual analysis and they had published a new malware dataset nataraj2011malware. This dataset has 9339 samples and 25 different classes. Table 1 presents the number of samples for each malware family. This distribution shows that the dataset is highly imbalanced.
No.  Family Name  Number of Samples 

1  Allaple.L  1591 
2  Allaple.A  2949 
3  Yuner.A  800 
4  Lolyda.AA 1  213 
5  Lolyda.AA 2  184 
6  Lolyda.AA 3  123 
7  C2Lop.P  146 
8  C2Lop.gen!g  200 
9  Instantaccess  431 
10  Swizzot.gen!I  132 
11  Swizzor.gen!E  128 
12  VB.AT  408 
13  Fakerean  381 
14  Alueron.gen!J  198 
15  Malex.gen!J  136 
16  Lolyda.AT  159 
17  Adialer.C  125 
18  Wintrim.BX  97 
19  Dialplatform.B  177 
20  Dontovo.A  162 
21  Obfuscator.AD  142 
22  Agent.FYI  116 
23  Autorun.K  106 
24  Rbot!gen  158 
25  Skintrim.N  80 
Figure 1 shows that malware images created from byte files. All images are single channel and have been resized to 224 x 224 for CapsNet architecture. This size is the largest value which can be processed in our computer system.
3.2 Microsoft Malware 2015 (BIG2015)
BIG2015 dataset has been released as a Kaggle competition ronen2018microsoft; Microsof31:online. Table 2 presents the sample distribution for each malware family in BIG2015 dataset. The distribution shows that the dataset is highly imbalanced and Simda is the toughest malware family to be predicted for the dataset. The dataset contains 10868 BYTE (bytes) files and 10868 ASM (assembly code) files and 9 different malware family types.
No.  Family Name  Number of Image 

1  Ramnit  1541 
2  Lollipop  2478 
3  Kelihos_ver3  2942 
4  Vundo  475 
5  Simda  42 
6  Tracur  751 
7  Kelihos_ver1  398 
8  Obfuscator.ACY  1228 
9  Gatak  1013 
Figure 2 depicts image representations creating from BYTE and ASM files of the same malware sample in Ramnit malware family. All images are single channel. All images have been resized 112 x 112 for our CapsNet architecture, because the architecture uses both BYTE and ASM image representations at the same time.
4 Model
In this section, general capsule networks, base CapsNet architecture for Malimg and base CapsNet architecture for Microsoft Malware 2015 Dataset are described. CapsNet architectures are different for both Malimg and BIG2015 Dataset.
4.1 Capsule Networks
Capsule networks are special architectures of convolutional neural networks aiming to minimize information loss because of max pooling sabour2017dynamic. These method is the weakest point for preserving spatial information iesmantas2018convolutional. A CapsNet contains capsules similar to autoencoders krizhevsky2011using; sabour2017dynamic. Each capsule learns how to represent an instance for a given class. Therefore, each capsule creates fixed length feature vector to be input for a classifier layer without using max pooling layers in its internal structure. In this way, this capsule structure aims to preserve texture and spatial information with minimum loss.
Sabor et al. proposes an efficient method to train CapsNet architectures sabour2017dynamic. This method is called dynamic routing algorithm. This algorithm uses a new nonlinear activation function called squashing shown in equation (1). This equation emphasize that short vectors is shrunk to almost zero and long vectors is shrunk to 1 sabour2017dynamic. In this equation, is the output of ith capsule and shows the total input of this capsule.
(1) 
Visualizing of the squash activation function described in equation (1) is very hard because its input is a high dimensional vector. If the activation function can be thought as a single variable function as described in marchisio2019capsacc then the behavior of function and its derivative can be visualized in figure 3.
A basic CapsNet architecture contains two parts such as standard convolution blocks and capsule layer as shown in figure 4. A convolution block is made from combination of convolution filters and ReLU activation function. At the end of the convolution block, obtained feature maps are reshaped and projected to vector representation. This representation feeds each capsule in the capsule layer. Each capsule learns how to represent and reconstruct a given sample like an autoencoder krizhevsky2011using architecture. This representations are used to calculate the class probabilities for classification task.
Margin loss function is used for CapsNet. This function is similar to hinge loss rosasco2004loss. Equation (2) defines the margin loss for caspule ,
(2) 
where , , denotes actual class and represents the current prediction. The summation of for each capsule gives the total loss. In order to minimize margin loss, the most applicable optimizer algorithm for CapsNet is Adam kingma2014adam; sabour2017dynamic. We have observed that CapsNet cannot converge to minimum loss value with other optimizer than Adam. This is obviously an open issue for CapsNet studies in the future.
Our main assumption is that CapsNet architecture will able to successfully classify malware family types using raw pixel values obtained from malware binary and assembly files. In addition to the main assumption, this paper aims to increase CapsNet malware type classification architecture accuracy with bagging ensembling method.
4.2 Base Capsule Network Model for Malimg Dataset
Before creating an ensemble CapsNet model, base CapsNet estimator must be built. This architecture depends on the dataset. Thus, base CapsNet estimator architecture has a single convolution line as shown in figure 5. The convolutional line contains two sequential block and each block contains two sequential convolution and ReLU layers. First two convolutional layers have 3x3 kernels and 32 filters. Second two convolutional layers have 3x3 kernels and 64 filters. Feature maps are reshaped to 128 feature vector. At the end of the reshape step, there is a capsule layer containing 25 capsules, dimension of each capsule is 8 and the routing iteration is 3 of the capsule layer. This is the optimal CapsNet architecture for Malimg dataset depending on our experiments.
4.3 Base Capsule Network Model for BIG2015 Dataset
The BIG2015 dataset has two different files for each sample. One of them is a binary file and the other is assembly file. Thus, it is possible to design a CapsNet, which can be fed by two different image inputs at the same time. Figure 6 shows a CapsNet architecture, which has two exactly identical convolution lines. In this architecture, the first two sequential layers contain 3x3 kernels and 64 filters. The second two sequential layers contain 3x3 kernels and 128 filters.
Features extracted from ASM image and BYTE image are concatenated and the final feature vector is reshaped to a vector which length is 128. For the next level, as an input, this feature vector feeds to a capsule layer containing 9 capsules. In this layer, dimension of each capsule is 8 and routing iteration is 3. This hyperparameter set is optimal for the base CapsNet estimator for BIG2015 dataset.
4.4 Random CapsNet Forest Model for Both Two Datasets
Random CapsNet Forest (RCNF) is an ensemble model which is inspired from Random Forest algorithm breiman2001random. The basic idea behind RCNF is to assume identical CapsNet models as weak learners and create different training sets for each model from the original training set using bootstrapping resampling technique, as shown in algorithm (1). The training algorithm (1) is a variant of bootstrap aggregating (as known as bagging) breiman1996bagging for CapsNet model and bagging reduces variance of the model and increases robustness of the model breiman1996bias. In this paper, bagging is preferred to create ensemble of CapsNet instead of boosting freund1996experiments, because it is shown that boosting tends to be overfit quinlan1996bagging. During the training phase, each epoch updates weights of the CapsNet. Therefore, the weight of the best model at the end of each epoch is saved according to validation score in order to increase model performance and consistency against to random weight initialization of the CapsNet.
The prediction method is described in algorithm (2). Each weight of CapsNet model is loaded and test samples are predicted by the model. Cumulative predicted probabilities are added into variable and this step is known as average ensembling step. At the end of the estimation loop, the index of the highest probabilities is assigned as predicted class.
5 Experiment and Results
CapsNet and RCNF ensemble model are tested on two different datasets called Malimg and BIG2015. Malimg dataset has been divided into three parts, training, validation and test sets. Training set has 7004 samples, validation set has 1167 samples and test set has 1167 samples. BIG2015 has also been divided into three parts like Malimg dataset. In the experiments of CapsNet and RCNF ensemble model for BIG2015 dataset, training set has 8151 samples, validation set has 1359 samples and test set has 1358 samples.
The First experiment is to obtain performance results of single base CapsNet estimators for each dataset. The second experiment is about performance of RCNF model. Model evaluation has been done in terms of accuracy, F1 measure and number of parameters of deep neural nets. These performance metrics are defined as follows:
(3) 
(4) 
where true positive (TP) and false positive (FP) are the number of instances correct and wrong classified as positive, respectively. True negative (TN) and false negative (FN) are the number of instances correct and wrong classified as negative, respectively. Accuracy is the ratio of number of true predictions to all instances in the set as shown in equation (3). F1 is shown as the equation (4) in terms of true positives, false negatives and false positives.
Figure 7 shows confusion matrices for each test part of both datasets. Each confusion matrix (figure 6(a) and 6(b)) implies that a model containing single CapsNet incorrectly predicts rare malware families in both datasets.
Figure 8 is the confusion matrix of RCNF containing 5 base CapsNet models. This confusion matrix shows prediction accuracy of the model for each malware family type in Malimg test set. Class 8, 10, 20 and 21 have been predicted wrongly by the RCNF model. On the other hand, the model has been very successful at correctly predicting other malware types in the test set. This confusion matrix also shows that RCNF is successful at correctly predicting rare malware types in Malimg test set.
In the second experiment, RCNF is tested on BIG2015 dataset. Figure 8 shows the prediction results of RCNF containing 10 base CapsNet for BIG2015 dataset. Class 4 is the rarest malware type in the whole dataset. Training, validation and test sets are stratified, so the class distribution is preserved for each partition. This result shows that RCNF is able to predict the rarest malware type pretty well. Class 0, 1, 2 and 6 are predicted perfectly by RCNF. If the performance of RCNF is compared with the performance of a single CapsNet model, it is easily seen that RCNF is better than a single CapsNet at predicting rare malware families for imbalanced datasets.
Model  Number of Parameters  F1 Score  Accuracy  

Yue yue2017imbalanced  20M    0.9863  
Cui et al. cui2018detection    0.9455  0.9450  

90592  0.9658  0.9863  

5 x 90592  0.9661  0.9872 
Model  Number of Parameters  F1 Score  Accuracy  
Cao et al. cao2018efficient      0.95  
Gibert et al. gibert2018end    0.9813  0.9894  
Kreuk et al. kreuk2018deceiving      0.9921  
Le et al. le2018deep  268949  0.9605  0.9820  
Chen chen2018deep      0.9925  
Jung et al. jung2018malware  148489    0.99  
Abijah et al. abijah2019intelligent      0.9914  
Zhao et al. zhao2019maldeep      0.929  
Khan et al. khan2018analysis      0.8836  
Safa et al. safabenchmarking      0.9931  
Kebede et al. kebede2017classification      0.9915  
Kim et al. kim2018classifying      0.9266  
Kim et al. kim2018detecting    0.8936  0.9697  
Yan et la. yan2018detecting      0.9936  
Naeem et la. naeem2019identification    0.971  0.9840  

527232  0.9779  0.9926  

10 x 527232  0.9820  0.9956 
Table 3 shows the test performance of the proposed models and others for Malimg dataset. Yue yue2017imbalanced uses a weighted loss function to handle imbalance class distribution problem in Malimg dataset and also uses transfer learning yosinski2014transferable method to classify malware family types. Due to using transfer learning, the architecture has 20M parameters and the model is so huge. Cui et al. cui2018detection use classical machine learning methods such as KNearest Neighbor and support vector machines. They have trained these algorithms using GIST and GLCM features which are feature engineering methods for images and have applied resampling to the dataset in order to solve imbalance dataset problem. RCNF does not use weighted loss function or any sampling method to overcome imbalanced dataset problem. Our results are higher than these two methods and results also show that CapsNet and RCNF do not require any method for extra feature engineering in Malimg dataset. A single CapsNet architecture for Malimg dataset has 90592 trainable parameters and RCNF has 452960 trainable paramters, so our proposed methods are reasonably smaller than Yue’s method.
Table 4 compares the test performance of the proposed models and others for BIG2015 dataset. Chen chen2018deep and Khan et al. khan2018analysis use transfer learning architectures for the dataset. Test results of our proposed methods are better than those two models, but our results are very close to Chen chen2018deep in terms of accuracy. Our proposed models are better than Gibert et al. gibert2018end in terms of accuracy, but F1 score of RCNF is very close to this model. Our proposed models are obviously better than models of Cao et al. cao2018efficient, Zhao et al. zhao2019maldeep, Kim et al. kim2018classifying and Kim et al. kim2018detecting. Jung et al. jung2018malware propose a reasonably smaller model than our models in terms of number of parameters, but our model has higher accuracy score than this model. Abijah et al. abijah2019intelligent, Safa et al. safa2019benchmarking, Kebede et al. kebede2017classification and Yan et al. yan2018detecting propose deep learning models whose accuracy scores are close to our proposed models. Data resampling and weighted loss function for imbalanced dataset, transfer learning and data augmentation have never been used to train CapsNet and RCNF models for both BIG2015 and Malimg datasets. In other words, these results are basic performances of these models.
For those tables, last studies using Mailmg and BIG2015 datasets are chosen. In order to compare in a fair way, these models are drawn from visual static malware analysis studies.
6 Conclusion
This paper introduces a first application of CapsNet on imbalanced malware family type classification task. Moreover, the first ensemble model of CapsNet called RCNF is introduced in this paper. Proposed models do not require any complex feature engineering methods or architecture for deep networks. In order to show that, we used two different malware family type datasets such as Malimg and BIG2015. These datasets are used image based visual static malware classification. Our proposed models can use these datasets directly using raw pixel values.
Datasets in the paper are highly imbalanced in terms of class distribution. CapsNet and RCNF do not use oversampling, undersampling and weighted loss function during training phase. Results show that CapsNet and RCNF are the best models least suffering from imbalanced class distribution among others.
Experiment results show that a single CapsNet model has good performance for both BIG2015 and Malimg datasets. However, we have assumed an ensemble model of CapsNet can help to increase generalization performance and RCNF has better generalization perofrmance results than a single CapsNet model as expected.
As a future work, we are planning to develop a hybrid architecture for malware classification. This hybrid method will be based on CapsNet architeture. This paper shows that CapsNet are useful for imbalanced malware dataset in visual static malware analysis problem domain.
References
Footnotes
 journal: Computers & Security