DomainGAN: Generating Adversarial Examples to Attack Domain Generation Algorithm Classifiers

DomainGAN: Generating Adversarial Examples to Attack Domain Generation Algorithm Classifiers

Isaac Corley1, Jonathan Lwowski2, Justin Hoffman3 Booz Allen Hamilton
San Antonio, Texas
Email: 1corley_isaac@bah.com, 2lwowski_jonathan@bah.com, 3hoffman_justin@bah.com
Abstract

Domain Generation Algorithms (DGAs) are frequently used to generate large numbers of domains for use by botnets. These domains are often used as rendezvous points for the servers that malware has command and control over. There are many algorithms that are used to generate domains, but many of these algorithms are simplistic and are very easy to detect using classical machine learning techniques. In this paper, three different variants of generative adversarial networks (GANs) are used to improve domain generation by making the domains more difficult for machine learning algorithms to detect. The domains generated by traditional DGAs and the GAN based DGA are then compared by using state of the art machine learning based DGA classifiers. The results show that the GAN based DGAs gets detected by the DGA classifiers significantly less than the traditional DGAs. An analysis of the GAN variants is also performed to show which GAN variant produces the most usable domains. As verified by testing results and analysis, the Wasserstein GAN with Gradient Penalty (WGANGP), is the best GAN variant to use as a DGA.

Domain Generation Algorithms, Generative Adversarial Networks, Machine Learning

I Introduction

Numerous types of malware utilize Domain Generation Algorithms (DGA) to produce a large amount of pseudo-domains. The malware will try to connect to many or all of these domains attempting to find a Command and Control (C2) server. These C2 servers will provide the malware with further updates such as gathered intelligence [Woodbridge2016PredictingDG] or exfiltration of sensitive information collected from compromised machines. For the malware to be successful, it only requires that a few domains be registered. Additionally, to cause the malware to completely fail, all domains generated and used by the malware must be blacklisted. This makes the task of detecting DGAs very difficult, because the DGA detector must maintain a high detection accuracy.

Fig. 1: Autoencoder and GAN Architectures

The task of detecting DGA is very difficult and has become an important research topic. There are many DGA detectors [Woodbridge2016PredictingDG, yu2018character, kim2014convolutional], however many of these detection algorithms have only been tested using traditional DGAs. For example, Woodbridge et. al. developed a DGA classifier using Long Short Term Memory (LSTM) networks. Their model achieved over 90% accuracy with a very low false positive rate, however their model was only trained and tested on the Alexa Top 1 Million dataset [alexa], and the Bambenek feeds [Bambenek]. The Bambenek feeds mostly contain DGAs that were produced using traditional and easy to detect DGA algorithms. More importantly, the Bambenek feeds do not contain adversarial DGAs that were designed to break DGA classifiers. Along with using LSTMs as a DGA classifier, Yu et. al. did a comparison of the state of the art machine learning DGA classifiers [yu2018character]. The classifiers that were compared include several different Convolutional Neural Network (CNN), and LSTM based models. These models were also trained on the benign Alexa Top 1 Million dataset domains as well as the Bambenek DGA domain lists. Their models had testing accuracies varying from 78% to 98%. However, since these models were only trained on the Bambenek feeds, they suffer from the same issues as Woodbridge et. al, such as being vulnerable to adversarial examples.

With the improvement of DGA classifiers, adversarial DGAs have become prevalent [sidi2019maskdga, Peck2019CharBotAS]. These adversarial DGAs can be difficult to detect using traditional DGA detection algorithms. For example, Sidi et. al. uses a substitute model to algorithmically perturbate generated domains to make them more likely to evade DGA classifiers. They show their adversarial DGA degrades the accuracy of various DGA classifier from 97% to 49%. Another adversarial DGA developed by Peck et. al. uses an algorithmic method that introduces a small number of typographical errors in benign domains [Peck2019CharBotAS].

With the emergence of neural networks, machine learning based DGAs have been developed [Spooren, anderson2016deepdga] to specifically evade DGA classifier detection. The DGA developed by Spooren et. al [Spooren], uses feature engineering along with an iterative DGA development process to produce DGAs that can fool DGA classifiers. Anderson et. al. [anderson2016deepdga] developed a generative DGA, DeepDGA, which trains a GAN on the Alexa Top 1 Million dataset to generate samples which are benign-like to evade DGA classifiers. They tested their DGA samples against a random forest classifier, and showed that their model had a 48% detection rate versus the 96% detection rate on samples generated by traditional algorithmic DGAs. However, one notable drawback to their model is that it tends to produce very short domains [sidi2019maskdga]. Short domains can be very expensive, have a higher chance of already being an existing domain, and have a higher chance of being previously generated by the DGA.

In this paper, three different variants of generative adversarial networks (GANs) are used to improve domain generation by making the domains more difficult for machine learning algorithms to detect. The domains generated by traditional DGAs and the GAN based DGA are then compared by using state-of-the-art neural network DGA classifiers. Our results show that the GAN based DGAs are detected by the DGA classifiers significantly less than the traditional DGAs. Additionally, further analysis of the samples generated by each GAN variant is performed to show which GAN variant produces the most usable domains for a botnet. As verified by our results and analysis, the Wasserstein GAN with Gradient Penalty (WGANGP), is the best GAN variant to use as a DGA to evade DGA classifier detection.

The rest of the paper is organized as follows. The dataset to train the DomainGAN is analyzed in Section II. Our proposed GAN based DGA will be discussed in Section III, followed by an analysis of the results in Section IV. Finally, the conclusions and future works will be discussed in Section V.

Ii Dataset

Fig. 2: Encoder Architecture for the Autoencoder
Fig. 3: Decoder Architecture for the Autoencoder

The Alexa Top 1 Million dataset [alexa] was used throughout our experiments for generating realistic domain samples. This dataset is composed of the URLs of the top 1 million web sites. The domains are ranked using the Alexa traffic ranking which is determined using a combination of the browsing behavior of users on the website, the number of unique visitors, and the number of pageviews. In more detail, unique visitors are the number of unique users who visit a website on a given day, and pageviews are the total number of user URL requests for the website. However, multiple requests for the same website on the same day are counted as a single pageview. The website with the highest combination of unique visitors and pageviews is ranked the highest [alexa_support]. This ranking provides support to the hypothesis that the Alexa domains are not generated by DGAs and are benign domains. Prior to any experiments, top level domains, e.g. .com, .net, .org, are removed from all domains. To further understand the dataset, a few examples of domains can be viewed in Table I.

Ranking Domain
1 google.com
2 youtube.com
3 baidu.com
900,000 aileencooks.com
900,001 alrei.org
900,002 amco.co.in
TABLE I: Alexa Top 1 Million Dataset Examples

Iii Domain Generation Model

Our proposed GAN model consists of four main components; an encoder, decoder, generator, and discriminator. As seen in Figure 1, the autoencoder is initially trained to take an input domain from the Alexa Top 1 Million dataset, encode that domain into a small finite embedded set of neurons using the encoder network and then decode the compressed representation back into the original domain using the decoder network. After this training process, the autoencoder networks are then rearranged into the GAN framework where the decoder network is repurposed as the generator network and the encoder network is utilized as the discriminator network. The generator is then trained to produce domains which are as similar as possible to the Alexa Top 1 Million domains. The discriminator model then detects if a given domain is produced by either the generator network or sampled from the Alexa Top 1 Million dataset. The generator and discriminator networks will then iteratively learn how to fool and detect the other, respectively. This process is repeated until the generator is able to produce realistic benign-like domains.

Iii-a Autoencoder Model

Similarly to the experiments of [anderson2016deepdga], we initialize the generator network’s weights by pretraining an autoencoder to learn a compressed representation of important domain specific features in the embedded space. To do this, the autoencoder consists of an encoder, seen in Figure 2, and a decoder, seen in Figure 3 both of which are individually inspired by the sentence classification network from [kim2014convolutional]. We note that when not utilizing pretraining, GAN training becomes highly unstable and consistently diverges to unusable samples.

The encoder begins by taking a domain from the Alexa Top 1 Million dataset as input. This domain is then tokenized and fed into an embedding layer with 39 input dimensions representing the set of possible tokens, embedding dimension of 39, and an input sequence length of 60 maximum tokens. The output of the embedding layer is then fed into three parallel 1-dimensional convolutional layers. All three layers have 256 filters and Rectified Linear Unit (ReLU) activations [nair2010rectified]. The three layers have a kernel size of 2, 3, and 4, respectively, which theoretically extracts various n-gram features of the domain names. The 3 parallel convolution layer outputs are then concatenated together and fed into another convolution layer with 8 filters, a kernel size of 2, and a ReLU activation. Finally, the output of the last convolution layer is flattened into a single vector to form the compressed encoder output. This architecture is visualized in Figure 2.

The decoder begins by taking the output of the encoder as its input. The input is then reshaped into a 2-dimensional matrix and fed into 3 parallel convolution layers, similarly to the encoder architecture. The layers’ outputs are concatenated together and are fed into another convolution layer. This convolution layer has 32 filters and a kernel size of 3, and a ReLU activation. The decoder’s final convolution layer is then trained to reproduce the original domain which was fed into the encoder. This layer has 39 filters, a kernel size of 3, and softmax activation. The softmax activation output represents the probability distribution across tokens. This architecture is visualized in Figure 3.

Iii-B Generator Model

Once the decoder has been trained to learn to decode the low-dimensional representation of benign domains, it is repurposed for use as the generator in the GAN framework. The generator, seen in Figure 4, takes a latent vector , sampled from a random uniform distribution on the interval [-1, 1] as its input, or more formally where and . This vector is fed into a fully-connected layer with 480 neurons and a ReLU activation. The output of this layer is then fed into the pretrained decoder. The pretrained decoder’s weights are frozen, and the output of the decoder is the generated domain. Intuitively, the fully-connected layer learns a mapping from a uniform distribution to the low-dimensional distribution of the embedded space learned by the encoder to produce realistic benign domains. The generator architecture is displayed in Figure 4.

Fig. 4: Generator Architecture

Iii-C Discriminator Model

Similar to the generator, the discriminator is developed using the pretrained decoder weights as its initialization. The discriminator, seen in Figure 5, takes a domain that is real or generated as the input. The domain is then fed into the pretrained encoder from the autoencoder. The encoder’s weights are frozen as well. The output of the encoder is then fed into a single neuron output layer with linear activation. The output of this layer is the probability that the input domain was sampled from the Alexa Top 1 Million or generated. The discriminator architecture is displayed in Figure 5.

Fig. 5: Discriminator Architecture

Iv Results

Fig. 6: Generated Domain Lengths Distributions

Iv-a Autoencoder Training Results

The autoencoder was trained on the Alexa Top 1 Million dataset discussed in Section II. The dataset is randomly shuffled and split into train and test sets with a percentage split criterion of 75%/25%. The autoencoder is trained for 400 epochs with a batch size of 64. We then calculate the mean squared error (MSE) on the test set which resulted in a MSE of . By sampling the maximum token probability from the softmax output distributions we note that the autoencoder is able to perfectly recreate the test set domains. Examples of input and output domains from the trained autoencoder can be seen in Table II.

Input Domain Output Domain
google google
yahoo yahoo
netflix netflix
TABLE II: Example Input Domains and Corresponding Autoencoder Decoded Output Domains

Iv-B GAN Variants

After the autoencoder has been trained, the model is split into the encoder and decoder networks which are then used as components of the the discriminator and generator networks, respectively. To train the GAN we have the generator network produce batches of “fake” domains with an equivalent number of real domains sampled from the Alexa Top 1 Million dataset. The discriminator then attempts to determine if the domains are fake or real. Based on how well the discriminator is able to classify the domains, the weights of the generator and the discriminator are both updated using a loss function and back propagation. It is known that GANs suffer greatly from instability during training. As a result, convergence during optimization is generally difficult to achieve [mescheder2018training]. To combat this issue, multiple variants of GANs have been developed to improve upon the originally proposed framework. These variants commonly propose new loss functions which are theoretically able to provide a more meaningful metric which can measure the amount the discriminator determines a given sample is real or generated. Our experiments provide an analysis on the task of generating realistic domains by comparing three GAN variants, Least Squares GAN (LSGAN), Wasserstein GAN with Gradient Penalty (WGANGP), and the original GAN, utilized by DeepDGA [anderson2016deepdga].

The original GAN loss function solves the binary classification problem of determining of whether an input to the discriminator network is either sampled from the real data or generated by the generator network. The output of the discriminator is composed of a sigmoid activation which the output can be derived either 1 (real) or 0 (generated/fake). The objective function is realized in Equation 1.

(1)

The LSGAN framework [lsgan] was proposed to solve the vanishing gradient problem inherent in neural network classifiers with sigmoid outputs. The modified discriminator output is meant to provide an unbounded measurement of correctness to more effectively penalize the discriminator’s classifications. This change effectively makes the discriminator network a critic instead of a classifier as it’s able to provide a value which is more similar to a continuous score than a classification. The notable changes within the GAN framework are the replacement of the discriminator sigmoid output activation with a linear activation and optimizing the discriminator with a MSE loss function. The objective functions for the LSGAN framework are provided in Equations 2 and 3.

(2)
(3)

The final GAN variant we utilize throughout our experiments is the WGANGP framework. The WGANGP framework, seen in Equation 4, utilizes the Earth Mover’s distance, or Wasserstein-1, provided in Equation 5. Due to discriminator network’s output metric being representing a continuous value, it is commonly referred to as a critic. The critic provides a continuous metric for comparing real and generated samples which is shown to be a more meaningful representation of comparing the data distributions. In addition to the change in loss function, the WGANGP framework uses a Gradient Penalty which constrains the norm of the gradients of the networks to a maximum of 1, provided in Equation 6.

(4)
(5)
(6)

To determine which framework provides the most appealing samples, we generate 1 million domains using each trained GAN variant and analyze them using several methods such as domain length, and n-gram distribution analysis.

Iv-C Domain Length Analysis

An analysis was performed to compare the domain lengths of the generated domains to the Alexa Top 1 Million domains. Generated samples with lengths similar to benign domains are important for evasion because DGA classifiers typically learn features such as length of domains to differentiate benign from DGA domains. Additionally, shorter domains increase the likelihood of a domain collision resulting in a more expensive cost to register the domain. A domain collision is considered the case when a DGA algorithm generates a domain which already exists or is owned by another entity. This results in an objective where DGAs should attempt to match the domain length distribution of benign domains. As seen in Figure 6, the original GAN learns to generate notably small domains, even smaller than the Alexa domain length distribution. However, the WGANGP model is visually more similar than the other GAN variants to the domain length distribution of the Alexa Top 1 Million dataset.

Iv-D Existing Domain Collision Analysis

To provide further analysis on the effects of domain length on a GAN variants ability to produce a usable domain, an analysis was performed to calculate which percentage of domains produced by each GAN variant are already owned. This is important because if a given domain already exists then the generated domain is unusable by a botnet unless it is purchased from the existing owner. To check the performance of each GAN variant with respect to generating unusable existing domains, 1000 domains where generated by each GAN variant and then each of the generated domains were checked to see if they already exist online. Each generated second level domain was concatenated with 3 top level domains, “.com”, “.org”, and “.net”. As seen in Table III, the WGANGP produces significantly less existing domain collisions. The WGANGP produces 12.3% existing domain collisions, the LSGAN 19.6%, and the GAN, 29.6%.

GAN Variant Existing Domain Collision %
GAN 29.6%
LSGAN 19.6%
WGANGP 12.3%
TABLE III: Likelihood of Existing Domain Collision of Generated Domains by GAN Variants

Iv-E Repeated Domain Collision

Another important aspect to consider when comparing the GAN variants is repeated domain collision. A repeated domain collision is the likelihood of the DGA to produce domains in a batch of generated samples which are the same. To analyze repeated domain collisions, all duplicates were removed from the 1 million generated domains. As seen in Table IV, the original GAN had the highest amount of repeated domain collisions at 53.2%, while the WGANGP had the lowest amount at 7.4%. This is likely due to the original GAN producing shorter domain lengths than WGANGP, since shorter domains are more likely to have a higher chance of repetition.

GAN Variant Domain Collision %
GAN 53.2%
LSGAN 16.1%
WGANGP 7.4%
TABLE IV: Likelihood of Repeated Domain Collision of Generated Domains by GAN Variants

Iv-F Unigram and Bigram Distribution Analysis

To further compare generated and benign samples, the unigram and bigram distributions of the three GAN are calculated and analyzed. Like domains lengths, DGA classifiers will typically learn n-gram statistics of domains to differentiate between DGA and benign domains. Therefore if the GAN can mimic the unigram and bigram distribution of the Alexa Top 1 Million dataset, it is more likely to evade detection by DGA classifiers. As seen in Figure 7 and Figure 8, we plot the unigram and bigram distributions of the Alexa and generated domains ranked by the Alexa Top 1 Million n-gram distribution in decreasing order. For both n-gram distributions, the WGANGP framework is more notably able to model the Alexa Top 1 Million n-gram distributions better than the LSGAN and GAN variants.

Fig. 7: Unigram Character Distributions of Alexa Top 1M and Generated Domains
Fig. 8: Bigram Character Distributions of Alexa Top 1M and Generated Domains

Iv-G DGA Classifier Results

Furthermore, the GAN variants generated domains were tested against various DGA classifiers, which demonstrate poor performance of the models. These classifiers are Endgame, Invincea, CMU, MIT, NYU, and Baseline [yu2018character]. After testing the original classifiers, the classifiers were fine-tuned using domains generated from the GAN variants. After fine-tuning these models, the GAN generated domains were then tested again showing significant improvement.

Iv-H Spoofing the Original Classifiers

To train the original DGA classifiers, the Alexa top 1 million domains, and 1 million DGA domains from the Bambenek feeds [Bambenek] were used as the input dataset. Seventy percent of the input dataset was used as the training data, and the rest was used as the testing dataset. Each of the DGA classification models were trained for 50 epochs, but only the best model was saved. The best model was determined by using the validation loss at the end of every epoch. After training the DGA classification models, the training and testing results were similar to the results in the paper by Yu et. al. [yu2018character]. The training and testing results of each model can be seen in Table V.

Classifier
Train Accuracy
Test Accuracy
Endgame 95.86% 96.02%
Invincea 98.44% 98.55%
CMU 95.51% 95.47%
MIT 98.21% 98.08%
NYU 98.45% 98.36%
Baseline 95.49% 95.58%
TABLE V: Training and Testing Accuracy of the Original DGA Classifiers

After training the original DGA classifiers, the 1 million generated domains from each of the GAN variants were classified using each of the classifiers. As seen in Table VI, all of the models fail to classify a majority of the GAN generated domains as DGA. This means domains generated using the GAN variants would evade the DGA classifiers with a high percentage.

Classifier
GAN
Evasion %
LSGAN
Evasion %
WGANGP
Evasion %
Endgame 98.93% 95.58% 96.14%
Invincea 97.43% 94.94% 94.93%
CMU 99.23% 98.84% 97.63%
MIT 98.90% 97.65% 97.78%
NYU 97.74% 95.58% 96.14%
Baseline 99.63% 98.89% 97.22%
TABLE VI: Probability that GAN Variants Generated Domains Evade Detection By DGA Classifiers

Iv-I Fine-Tuned Classifiers

Due to the original DGA classifiers having low accuracy at detecting DGA domains from the GAN variants, the models were fine-tuned on the GAN generated domain samples for each of the variants. The datasets for fine-tuning includedg 500,000 domains from the Bambenek feeds, 500,000 domains generated from each of the GAN variants, and 1 million domains from the Alexa Top 1 million dataset. Each of the classifiers were then fine-tuned by retraining each of the models with the weights being initialized with the weights from the original training. Each of the models were fine-tuned for 50 epochs with only the best model being saved based on the validation accuracy at the end of each epoch. As seen in Table VII, the classification models have lower accuracy than the original models, however this is expected because the dataset includes GAN generated domains which are harder to classify due to their similarity to benign domains. However, the accuracies are still relatively high making these models more usable than the original classifiers.

GAN Variant
for Fine-Tuning
Classifier Train Accuracy Test Accuracy
GAN Endgame 89.77% 89.64%
Invincea 94.79% 95.22%
CMU 90.59% 90.50%
MIT 93.67% 93.59%
NYU 94.55% 94.39%
Baseline 81.94% 81.89%
LSGAN Endgame 87.16% 87.34%
Invincea 92.20% 92.84%
CMU 88.03% 87.99%
MIT 90.95% 90.86%
NYU 91.93% 91.69%
Baseline 79.40% 79.44%
WGANGP Endgame 83.81% 83.93%
Invincea 91.72% 92.58%
CMU 84.60% 84.47%
MIT 88.73% 88.71%
NYU 90.83% 90.63%
Baseline 78.14% 78.30%
TABLE VII: Training and Testing Accuracy DGA Classifiers After Fine-tuning on GAN Generated Samples

To test the ability of the fine-tuned classification models to correctly identify GAN generated domains, 500,000 new GAN generate domains were fed into each of the fine-tuned models. As seen in Table VIII, the fine-tuned models have greater performance at detecting GAN generated domains making it more difficult for the GAN-based DGAs to evade detection.

Fine-Tuned
Classifier
GAN
Variant
GAN
Evasion %
LSGAN
Evasion %
WGANGP
Evasion %
Endgame GAN 11.01% 63.92% 74.83%
Invincea GAN 3.24% 63.69% 75.60%
CMU GAN 10.84% 65.40% 76.89%
MIT GAN 9.16% 65.39% 79.37%
NYU GAN 6.77% 65.61% 77.93%
Baseline GAN 30.69% 64.13% 76.36%
Endgame LSGAN 45.18% 25.42% 72.65%
Invincea LSGAN 36.01% 7.82% 56.78%
CMU LSGAN 49.44% 25.26% 74.31%
MIT LSGAN 41.08% 16.03% 68.97%
NYU LSGAN 43.44% 16.30% 66.61%
Baseline LSGAN 64.07% 43.00% 77.63%
Endgame WGANGP 56.43% 65.76% 34.45%
Invincea WGANGP 50.95% 52.54% 11.39%
CMU WGANGP 66.77% 73.24% 37.90%
MIT WGANGP 52.36% 59.07% 25.71%
NYU WGANGP 55.55% 63.09% 23.19%
Baseline WGANGP 89.50% 88.44% 61.77%
TABLE VIII: Probability that the GAN Variants Can Produce Domains That Will Evade Detection By The Fine-Tuned DGA Classifiers

Iv-J Summary of Results

To summarize the results in the previous sections, it is necessary to compare the percent of domains generated from each GAN variant that will actually be usable. The main factors that affect if a generated domain is usable, is “Repeated Domain Collisions”, “Existing Domain Collisions”, and “DGA Classifier Detections”. If a domain encounters any of those issues, it cannot be considered usable. Using the 1 million generated domains, the probability of a domain being usable was calculated. Although the WGANGP generated domain has a higher chance of being detected by a DGA classifier, the WGANGP has the highest probability of generating a usable domain. As seen in Figure 9,the WGANGP produces more usable domains because it has significantly less repeated domains along with less existing domain collisions. This makes the WGANGP generator the best GAN variant to use as a DGA.

Fig. 9: Summary of analysis for domains generated from GAN Variants

V Conclusion

In this paper, three different variants of generative adversarial networks (GANs) are used to improve domain generation by making the domains more difficult for machine learning algorithms to detect. The domains generated by traditional DGAs and the GAN based DGA are then compared by using state of the art machine learning based DGA classifiers. The results show that the GAN based DGAs gets detected by the DGA classifiers significantly less than the traditional DGAs. An analysis of the GAN variants was also performed to show which GAN variant produces the most usable domains. As verified by testing results and analysis, the Wasserstein GAN with Gradient Penalty (WGANGP), is the best GAN variant to use as a DGA.

In the future, we plan to use Reinforcement Learning (RL) to create another DGA model. We believe the use of RL will not only improve the DGA’s ability to evade detection by DGA classifiers, but will also allow the model to continuously learn and improved based on if the DGA model is successful or not, thus making the RL DGA model very difficult to detect.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
398290
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description