WhiteNet: Phishing Website Detection by Visual Whitelists

WhiteNet: Phishing Website Detection by Visual Whitelists

Sahar Abdelnabi, Katharina Krombholz and Mario Fritz CISPA Helmholtz Center for Information Security

Phishing websites are still a major threat in today’s Internet ecosystem. Despite numerous previous efforts, black and white listing methods do not offer sufficient protection – in particular against zero-day phishing attacks. This paper contributes WhiteNet, a new similarity-based phishing detection framework, based on a triplet network with three shared Convolutional Neural Networks (CNNs). WhiteNet learns profiles for websites in order to detect zero-day phishing websites by a “visual whitelist”. We furthermore present WhitePhish, the largest dataset to date that facilitates visual phishing detection in an ecologically valid manner. We show that our method outperforms the state-of-the-art by a large margin while being robust against a range of evasion attacks.

I Introduction

Phishing pages impersonate legitimate websites without permission [whittaker2010large] to steal sensitive data from users causing major financial losses and privacy violations [corona2017deltaphish, jain2017phishing, thomas2017data, khonji2013phishing]. Phishing attacks have increased due to the advances in creating phishing kits that enabled the deployment of phishing pages on larger scales [corona2017deltaphish, oest2018inside]. According to the Anti-Phishing Working Group (APWG) [APWG], an international association aiming at fighting phishing attacks, 180,768 attempts have been reported in the first quarter of 2019 which is higher than the total number of phishing attempts in the third and fourth quarters of 2018, indicating that phishing attacks are continuously increasing.

There have been numerous attempts to combat the threats imposed by phishing attacks by automatically detecting phishing pages. Modern browsers mostly rely on blacklisting [sheng2009empirical] as a fundamentally reactive mechanism, however, in a recent empirical study [oest2019phishfarm], the new phishing pages that used cloaking techniques were found to be both harder and slower to get detected by blacklists which motivates the development of proactive solutions. An example of the latter is using heuristics that are based on monitored phishing pages [khonji2013phishing]. These heuristics can be extracted from URL strings [blum2010lexical, nguyen2014novel, zouina2017novel] or HTML [chou2004client, li2019stacking] to detect anomalies between the claimed identity of a webpage and its features [pan2006anomaly]. However, since phishing attacks are continuously evolving, these heuristics are subject to continuous change and might not be effective in detecting future attacks (e.g. the use of HTTPS is now more common in phishing webpages [APWG], its absence formerly was used as a feature to detect phishing pages [pan2006anomaly]).

Fig. 1: Phishing detection using WhiteNet. Whitelisted pages are granted based on their URLs. The embeddings of other pages are compared to the whitelist pages’ ones and the decision is based on the visual similarity. Phishing webpages are visually similar to the whitelist with closer features, unlike other legitimate websites outside the whitelist.

Since the key factor in deceiving users is the high visual similarity between phishing webpages and their corresponding legitimate ones, detecting such similarity was used in many previous detection studies [jain2017phishing]. In these methods, a whitelist of websites is maintained (domain names and screenshots), and whenever a user visits a page that is not in the whitelist, its content is compared against the whitelist’s ones. If a high visual similarity is detected, then this page is classified as a phishing page as it impersonates one of the whitelist’s websites. Similarity-based methods have the advantage of not relying on heuristics that are likely to fail and instead they rely on the strong incentive of the adversary to design pages that are similar to trustworthy websites. This makes them less prone to an arms race between defenders and attackers. Similarity can be detected from rendered screenshots of webpages which allows the detection of webpages composed entirely of images or embedded objects that attackers might use to hide textual information and avoid detection by HTML methods [jain2017phishing].

These efforts still have limitations. First, their whitelists are too small (e.g. 4-14 websites in [mao2013baitalarm, mao2017phishing, dalgic2018phish, dunlop2010goldphish]) which makes them able to detect attacks against these few websites only. Second, existing approaches fall short in detecting zero-day phishing webpages as they only protect certain webpages of the legitimate websites such as login forms where phishing pages are assumed to have a close copy of them [jain2017phishing, fu2006detecting, lam2009counteracting, rao2015computer, bozkir2016use]. Consequently, attackers can bypass detection by crafting phishing pages that show differences from the corresponding legitimate webpages (e.g. by obfuscation using advertisement banners and changed layout [chen2010detecting]), in addition to using other webpages from the targeted websites that were not contained in the whitelist.

Our work targets the limitations mentioned above. First, we present a new dataset (WhitePhish) that enlarges the covered whitelist to 155 websites. For these websites, we included unique screenshots of phishing pages as well as legitimate pages with different designs and views for each website as a training set. Also, we collected a legitimate test set of websites that are not included in the whitelist.

Second, we propose WhiteNet, a similarity-based detection model that is the first to utilize triplet convolutional neural networks to learn a more robust visual similarity metric between different designs and webpages of the same website. A conceptual overview of our method is depicted in Figure 1 in which we show a potential whitelist of websites. The figure shows a learnt feature space in which whitelist’s webpages belonging to the same website have high proximity. Additionally, phishing webpages have high visual similarity and closer embeddings to the whitelist, thus, they would be classified as phishing. Finally, websites that are outside the whitelist have genuine identities and relatively different features.


Key Contributions:

  • WhitePhish: largest dataset to date which we constructed to mitigate the limitations of previous datasets, facilitate visual phishing detection and improve the ecological validity when evaluating phishing detection frameworks.

  • WhiteNet: similarity-based detection model utilizing triplet convolutional neural networks with a learnt visual similarity metric between different designs and webpages of the same website that is more robust to evasion attacks. The concept is shown in Figure 1.

Ii Related Work

The similarity between phishing and whitelisted websites can be inferred by extracting features that represent text content (e.g. most frequent words) and style information (e.g. font name and color, etc.), which then can be compared against whitelisted identities [huang2010mitigate, liu2006antiphishing]. Also, Document Object Model (DOM) comparison between two webpages can be used to detect similarity as DOM represents the logical structure of HTML or XML files [rosiello2007layout]. However, these methods fail if attackers used images to represent the webpage instead of HTML text [fu2006detecting]. Additionally, they are vulnerable to code obfuscation techniques where different code produces similar rendered images [fu2006detecting].

Consequently, another line of work infers similarity directly from rendered screenshots. As an example, layout similarity that is deduced from the matching of screenshots’ segmentation blocks was used in [lam2009counteracting]. Also, Earth Mover’s Distance (EMD) was used to compute the similarity between low-resolution screenshots in [fu2006detecting]. Besides, discriminative keypoint features were often used to match screenshots, such as the use of Scale-Invariant Feature Transform (SIFT) in [afroz2011phishzoo], Speeded-Up Robust Features (SURF) in [rao2015computer], Histogram of Oriented Gradients (HOG) in [bozkir2016use], and Oriented FAST and rotated BRIEF (ORB) in [malisa2017detecting] to detect mobile applications spoofing.

An approach similar in spirit was recently proposed in [woodbridge2018detecting], but only to detect the visual similarity between URL pairs using Siamese CNNs. In contrast, we propose a visual similarity metric based on screenshots as a general approach, with further optimizations adapting to the harder problem, to potentially detect more phishing pages which goes beyond homoglyph attacks.

Iii Objective and Threat Model

Despite previous efforts, we believe that our work explores new territory in phishing detection research with no similar precedence; most of these previous methods assume a close similarity in layout and content of the phishing and the legitimate images pair, while we aim instead to capture the look and feel of each website by learning a visual profile for each one that can generalize to partially copied and changed pages or zero-day pages that were not seen in the whitelist.

Threat model: We consider phishing pages targeting the collected large whitelist. We assume that the attacker would be motivated to target websites that are widely known and trusted, therefore, we assume a high coverage of phishing pages could be achieved by the collected whitelist. For these websites, we assume that the attacker could craft the phishing page to be entirely or partially similar to any page from the targeted websites, or to have a new design with a combination of these pages as an evasion technique to avoid detection by exact matching. We assume other evasion techniques that introduce small imperceptible changes to the phishing page to reduce the similarity to the targeted website. We consider conceivable adversarial evasion techniques in addition to introducing hand-crafted perturbations. For all these attempts, we assume that the adversary has an incentive to create seemingly trusted pages by not introducing very perceptible noise on the page that might decrease the perceived design quality.

Iv Analyses and Limitations of Published Datasets

In this section, we discuss previously published datasets and their limitations in addition to the contributions of the WhitePhish dataset.

Unfortunately, only a small number of datasets for the phishing detection task using screenshots are publicly available. One of these is DeltaPhish [corona2017deltaphish] for detecting phishing pages hosted within compromised legitimate websites. The dataset consists of groups having the same domain, where each group contains one phishing page and a few other benign pages from the compromised hosting website. Thus, the legitimate examples only cover the hosting websites, not the websites spoofed by the collected phishing pages. Consequently, this dataset is not suitable for similarity-based detection. Moreover, we observed that a large percentage of phishing pages’ screenshots in this dataset are duplicates since PhishTank111https://www.phishtank.com/ reports unique URLs which do not necessarily contain unique screenshots. We also found that the legitimate and phishing examples had different designs as phishing examples generally consisted of login forms with few page elements, while legitimate examples contained more details, which might work as a confounding factor. This could cause the trained model to be biased to these design changes and, thus, could fail when tested with legitimate pages with login forms.

The Phish-IRIS dataset [dalgic2018phish] for similarity-based detection consists of phishing pages collected from PhishTank targeting 14 websites and an “other” class collected from the Alexa top 300 websites222https://www.alexa.com representing legitimate examples outside the whitelist. However, this dataset has a limited number of whitelisted websites, and the screenshots of the whitelisted websites were taken only from phishing reports which skews the dataset towards poorly designed phishing pages.

WhitePhish contributions: Based on the previously mentioned limitations, we collected the WhitePhish dataset that facilitates similarity-based detection approaches and closes the following gaps: 1) we increased the size of the whitelist to detect more phishing attacks. 2) we collected a phishing webpage corpus with removing duplicity in screenshots. 3) instead of only training on phishing pages, we also collected legitimate pages of the targeted websites with different page designs and views (i.e. training whitelist). 4) we collected a legitimate test set of websites outside the whitelist that limits bias as far as possible (e.g. login forms should also be well represented in this test set).

Fig. 2: A histogram of the minimum distances between the pre-trained VGG16 visual representation of the phishing test set and the corresponding targeted website in the training whitelist (red), in comparison with the ones between the legitimate test set and the training whitelist (blue).

We define zero-day phishing pages as the ones which were not included in the training whitelist of legitimate websites. For that to be satisfied in the collected phishing set, the pixel-wise similarity between the phishing pages and the whitelist’s pages should be small and comparable to the one between other websites’ pages and the whitelist’s ones. To evaluate this, we computed the similarity (defined by the minimum distances between the pre-trained deep VGG16 visual representation) between the phishing test pages and the corresponding spoofed website in the training whitelist. We then compared this to the similarity between the legitimate test pages (to represent other identities) and the whitelist’s ones. If the phishing pages had exact similar corresponding pages in the whitelist, they would have considerably smaller distances to the whitelist compared to other pages. However, as can be seen from the two histograms in Figure 2, the distance ranges in both sets are comparable with high overlap. We also show that the percentage of phishing pages with very small distances to the training whitelist is small. As a conclusion, the phishing pages are generally different from the training whitelist’s ones, therefore, the collected phishing test set can be used as a proxy to evaluate the performance on future zero-day phishing pages.

V Constructing the WhitePhish Dataset

In this section, we show how we constructed and analysed WhitePhish.

Fig. 3: A histogram of the 23 most frequent websites that were found in the unique phishing set.

Phishing pages

To collect the phishing examples, we crawled and saved the screenshots of the active verified phishing pages from PhishTank which yielded 10250 pages. We observed that the same phishing screenshot design could be found with multiple URLs, so we manually inspected the saved screenshots to remove duplicates in addition to removing not found and broken URLs. Having an uncorrelated phishing set is important to have an accurate error estimate and to avoid having duplicates in training and test splits. After filtering, the phishing set contained 1195 phishing pages targeting 155 websites. We observed that phishing pages targeting one website have differences in elements’ locations, colors, scales, text languages and designs, therefore, the phishing set can be used to test the model’s robustness to these variations. Additionally, the majority of these phishing pages belonged to a small subset of these 155 websites, as we show in Figure 3, therefore, even though whitelisting methods cannot detect attacks against non-whitelisted websites, high coverage of phishing pages could be achieved by including a few websites in the whitelist. Also, the most frequent websites belonged to categories such as social media platforms, Software as a Service (SaaS) websites, and banking websites, which is consistent with the APWG reported statistics [APWG] and previous studies [dalgic2018phish].

Targeted legitimate websites’ pages

Besides collecting phishing webpages, we collected legitimate pages from those 155 targeted websites to work as a visual whitelist. Instead of gathering the legitimate pages that correspond to the found designs of the phishing set, we crawled all internal links that were parsed from the HTML file of the homepage and saved the corresponding screenshots. As a result, not all phishing pages have corresponding similar legitimate pages in this whitelist. We saved all webpages from the website to get different page designs, possible login forms, and different languages to make the similarity model trained with this dataset robust against these differences. For these 155 websites, we collected 9363 screenshots, where the number of collected screenshots for each website depends on the number of hyperlinks found in the homepage.

Top-ranked legitimate websites’ pages

Furthermore, we also queried the top 500 ranked websites from Alexa, the top 100 websites from SimilarWeb333https://www.similarweb.com/, in addition to the top 100 websites in categories most prone to phishing such as banking, finance, and governmental services. In total, we collected a list of 400 websites from SimilarWeb. From these lists, we excluded the 155 websites we collected from the phishing pages’ targets, and then we downloaded the screenshots of a set of 57 websites from SimilarWeb (1612 screenshots) and 59 different websites from Alexa (844 screenshots).

Training and test pages split

We mainly have three data components: a training whitelist of legitimate pages, phishing pages targeting the websites in the whitelist, and benign examples of websites outside the whitelist, where the latter two sets are used to test the model. In similarity detection approaches, the objective is to differentiate the phishing pages from other benign examples based on their similarity to the whitelist.

Fig. 4: A histogram of the categories in the legitimate test set.

We used the first legitimate set that we built from the phishing pages’ targets (155 websites) as our main training whitelist since these websites have corresponding phishing pages that could be used to test the model. We also included a subset from the other top-ranked legitimate websites in training to test the robustness of the model when adding new websites as we explain later in our evaluation section.

To test the model, we used the phishing set mentioned earlier. In addition, we constructed a legitimate test set of 683 benign examples from the top-ranked websites’ pages that we crawled. Unlike the legitimate whitelist training where we train on all variations of a website to have robustness against different potential phishing designs, we here collected an uncorrelated set to have an accurate error estimate. Also, the benign examples should simulate a general user’s browsing behaviour spanning many websites with different categories not only multiple webpages from the same website, therefore, we selected 3-7 non-redundant screenshots from each website to form the legitimate test set.

In order not to have a biased dataset that might give optimistic or spurious results only because the legitimate and phishing test sets have different designs, the legitimate test set should contain an adequate number of forms and have a similar distribution of categories as phishing pages’ ones (e.g. banks or payment). With a well-balanced test set, we can accurately evaluate the similarity model performance and whether it can find the website identity instead of relying on other unrelated features such as the page layout (e.g. having forms). Accordingly, we inspected the categories in the legitimate test set in a qualitative analysis which we show in Figure 4. As can be observed, we found that nearly 41% of the screenshots contain forms; we believe that these are the most challenging pages to be classified as different from the phishing pages since the latter usually contain forms. We also found that categories most prone to phishing are well represented in the legitimate set which makes our test set unbiased. Finally, the test set has high coverage of possible categories a user might face.

Whitelist analysis

In addition to the whitelist we built from PhishTank, we also examined alternative sources for building whitelists without needing to crawl phishing data. This could help in taking proactive steps to protect websites that might be attacked in the future if the adversary decided to avoid detection by targeting other websites than the ones which have been already known to be vulnerable. In order for the attacks to succeed, attackers have an incentive to target websites that are trusted and known for a large percentage of users, therefore, top-ranked websites have a high potential to be useful in building alternative whitelists. Based on that, we built our analysis on the top 500 websites from Alexa, and the top 400 websites from SimilarWeb in categories most prone to phishing. To evaluate whether or not these lists can represent the targets that might be susceptible to attacks, we found the intersection between those lists and the PhishTank whitelist. To visualize our analysis, Figure 5 shows cumulative percentages of phishing instances whose targets are included in ascending percentiles of the Alexa, SimilarWeb, and the concatenation of both lists. We found that including both lists covered around 88% of the phishing instances we collected from PhishTank, which indicates that the top-ranked websites are relevant for constructing whitelists. We also observed that the SimilarWeb list covered more instances than the Alexa list, we accounted that for the fact that the former was built from categories such as banks, SaaS and payment, in addition to the general top websites. We, therefore, conclude that this categorization approach is more effective in forming potential whitelists since important categories are less likely to change in future phishing attacks.

Fig. 5: Percentage of phishing instances whose targets are covered by ascending percentiles of Alexa, SimilarWeb and by the concatenation of both lists.
Fig. 6: An overview of WhiteNet. Our model utilizes triplet networks with convolutional sub-networks to simultaneously learn visual similarity between screenshots from the same website (denoted by same shaped symbols), and dissimilarity between screenshots belonging to different websites. Our network has two training stages; first, training is performed with uniform random sampling from all screenshots. Second, training is performed by iteratively finding hard examples according to the model’s latest checkpoint.

Vi WhiteNet

As we presented in Figure 1, similarity-based phishing detection is based on whether there is a high visual similarity between a visited webpage to any of the whitelisted websites, while having a different domain. If the visited page was found to be not similar enough to the whitelist, it would be classified as a legitimate page with a genuine identity. Therefore, our objective can be considered as a similarity learning problem rather than a multi-class classification between whitelist’s websites and an “other” class. Including a subset of “other” websites in training with a multi-class classification method could cause the model to fail at test time when testing with new websites. Motivated by these reasons and adapting to the harder problem of the whitelist size in the dataset, we treated the problem as a similarity learning problem with deep learning using Siamese or triplet networks which have been successfully used in applications such as face verification [taigman2014deepface], signature verification [dey2017signet], and character recognition [koch2015siamese]. In each of these applications, the identity of an image is compared against a database and the model verifies if this identity is matched with any of those in the database. They have been also used in the tasks of few-shots learning or one-shot learning [koch2015siamese] by learning a good representation that encapsulates the identity with few learning examples. These reasons make this deep learning paradigm suitable for similarity-based phishing detection.

Our network, WhiteNet, adopts the triplet network paradigm with three shared convolutional networks. We show an overview of the training of WhiteNet in Figure 6 which consists of two stages: in the first stage, training is performed on all database screenshots with a random sampling of examples. The second training stage fine-tunes the model weights by iteratively training on hard examples that were wrongly classified by the model’s last checkpoint according to the distance between the learned embeddings. By learning these deep embeddings, we build a profile for each website that encapsulates its identity, which would enable us to detect zero-day webpages that are not necessarily contained in the whitelist database. The rest of this section illustrates in more detail each aspect of the WhiteNet model.

Vi-a Triplet Networks

The Siamese networks are two networks with shared weights trained with the goal of learning a feature representation of the input such that similar images have higher proximity in the new feature space than different images. The sub-networks shares weights and parameters and the weight updates are mirrored for each of them, the sub-networks are then joined with a loss function that minimizes the distance of similar objects’ embeddings while maximizing the distance of dissimilar objects’ ones [dey2017signet].

The triplet network, which we used in WhiteNet, extends this approach; it was initially used in the FaceNet system [schroff2015facenet] to learn an embedding for the face verification task. This type of architectures performs the training on three images, an anchor image, a positive image whose identity is the same as the anchor, and a negative image with a different identity than the anchor. The overall objective of the network is to learn a feature space in which the distance between the positive and anchor images’ embeddings is smaller than the distance between the anchor and negative images’ ones. This is achieved by minimizing the loss function that is

where: represents the embedding space, is a set of possible triplets (anchor, positive, and negative), and is a margin that is enforced between positive and negative pairs which achieves a relative distance constraint. The loss penalizes the triplet examples in which the distance between the anchor and positive images is not smaller by at least the margin than the distance between the anchor and negative images. In our problem, the positive image is a screenshot of the same website as the sampled anchor, and similarly, the negative image is a screenshot of a website that is different from the anchor.

To produce a representation for screenshots that will be used in triplet loss, we used a pre-trained VGG16 trained on ImageNet dataset [simonyan2014very]. We used all convolution layer without including the top fully connected layer, we then added a new convolution layer of size 5x5 with 512 filters, with ReLU activations, and initialized randomly with HE initialization [he2015delving]. Instead of using a fully connected layer after the convolution layers, we used a Global Max Pooling (GMP) layer that better fits the task of detecting possible local discriminating patterns in patches such as logos. The output of the GMP layer is used as the final embedding vector with 512 dimensions. To match the VGG image size, all screenshots were resized to 224x224 with the three RGB channels.

Vi-B Triplet Sampling

Since there are a large number of possible combinations of triplets, the training is usually done based on sampling or mining of triplets instead of forming all combinations. However, random sampling could produce a large number of triplets that easily satisfy the condition due to having zero or small loss which would not contribute to training. Therefore, mining of hard examples was previously used in FaceNet to speed-up convergence [schroff2015facenet].

Therefore, as we show in Figure 6, our training process has two training stages. In the first stage, we used a random sampling of triplets to cover all combinations. In this stage, each website has the same probability of being selected in either the anchor image or the negative image to uniformly cover all websites. Also, all screenshots belonging to one website have the same probability of selection.

After training the network with random sampling, we then fine-tuned the model by iteratively finding the hard examples to form a new training subset. We first randomly sample a query set representing one screenshot from each website, then with the latest model checkpoint, we compute the L2 distance between embeddings of the query set and all the rest of training screenshots. In this feature space, the distance between a query image and any screenshot from the same website should ideally be closer than the distance from the same query image to any image from different websites. Based on this, we can find the examples that have the largest error in distance. Hence, we retrieve the one example from the same website that had the largest distance to the query (hard positive example), and the one example from a different website that had the smallest distance to the query (hard negative example). We then form a new training subset by taking the hard examples along with the sampled query set altogether, and we continue the training process with triplet sampling on this new subset.

For the same query set, we repeat the process of finding a new subset of hard examples for a defined number of iterations for further fine-tuning. Finally, we repeat the overall process by sampling a new query set and selecting the training subsets for this new query set accordingly. Sampling different query sets is motivated by avoiding overtraining to a fixed query set which might have outliers or might not be the strongest representation of each website.

This hard example mining framework can be considered as an approximation to a training scheme where a query image is paired with screenshots from all websites and a Softmin function is then applied on top of the pairwise distances with a supervised label indicating that the distance between the query and the same website’s screenshot has a label (denoting minimum distance). However, this paradigm would not scale well with the number of websites in the whitelist, and therefore it is not tractable in our case as a single training example would have 155 pairs (whitelist websites). The used paradigm, on the other hand, finds the most violating examples across all training data each defined number of iterations and then continues the regular triplet training on them.

Vi-C Prediction

At test time, the closest screenshot in distance to a phishing test page targeting a website should ideally be a screenshot of the same website. Therefore, the decision is not done based on all triplets comparison but it can be done by finding the screenshot with the minimum distance to the query image. To this end, we use the shared network to compute the embeddings then we compute the L2 distance between the embeddings of the test screenshot and all training screenshots. After computing the pairwise distances, the test screenshot is assigned to the website of the screenshot that has the minimum distance. This step could identify the website targeted if the test page is a phishing page.

As depicted in Figure 1, if the minimum distance between a page and the whitelist is smaller than a defined threshold, the page would be classified as a phishing page that tries to impersonate one of the whitelisted websites by having a high visual similarity. On the other hand, if the distance is not small enough, the page would be classified as a legitimate page with a genuine identity. Therefore, we apply a threshold on the minimum distance for classification.

Vii Evaluation

In this section, we present our main experiments along with their results. First, we show the implementation details of WhiteNet and its performance as our finally used model, then we present the results of an ablation study and further experiments to evaluate the robustness of WhiteNet.

Vii-a WhiteNet: Final Model

Evaluation metrics

Since our method is based on the visual similarity of a phishing page to websites in the whitelist, we computed the percentage of correct matches between a phishing page and its targeted website. We also calculated the overall accuracy of the binary classification between legitimate test pages and phishing pages at different distance thresholds to calculate the Receiver Operating Characteristic (ROC) curve area.

Implementation details

To train the network, we used Adam optimizer [kingma2014adam] with momentum values of , and a learning rate of 0.00002 with a decay of 1% every 300 mini-batches where we used a batch size of 32 triplets. We set the margin () in the triplet loss to 2.2. The first stage of triplet sampling had 21,000 mini-batches, followed by hard examples fine-tuning, which had 18,000 mini-batches divided as follows: we sampled 75 random query sets, for each, we find a training subset which will be used for 30 minibatches, then we repeat this step 8 times. We used 40% of the phishing examples in training and used the rest of the 60% for the test set. We used the same training/test split in the two phases of training. We tested the model with the legitimate set consisting of 683 screenshots that we collected; this set was only used in testing and was not included in training the model. We used Keras with TensorFlow backend for our implementation and all the following experiments.


Using WhiteNet, 81% of the phishing test pages were matched to their correct website using the top-1 closest screenshot. After computing the correct matches, we computed the false positives and true positives at different thresholds (where the positive class is phishing) which yielded a ROC curve area of 0.9879 outperforming all the other examined models and re-implemented state-of-the-art approaches which we show in the following sections.

Vii-B Ablation Study

Given the results of WhiteNet, this sub-section investigates the effects of different parameters in the model, we summarize our experiments in Table I which shows the top-1 match and the ROC area for each model in comparison with WhiteNet. We also show the corresponding ROC curve for each model in Figure 7.

We first evaluated the triplet network by experimenting with Siamese network as an alternative. We used a similar architecture to the one used in [koch2015siamese] with two convolutional networks and a supervised label of 1 if the two sampled screenshots are from the same website, and 0 otherwise. The network was then trained with binary cross-entropy loss. We also examined both L1 and L2 as the distance function used in the triplet loss. Besides, we inspected different architecture’s parameters regarding the shared sub-network including the added convolution layer, and the final layer that is used as the embedding vector where we experimented with Global Average Pooling (GAP) [lin2013network], fully connected layer, and taking all spatial locations by flattening the final feature map. In addition to VGG16, we evaluated ResNet50 as well. We also studied the effect of the second training phase of hard examples training by comparing it with a model that was only trained by random sampling.


Added Layer

Last Layer

Network type




Top-1 Match

ROC Area

VGG16 Conv 5x5(512) GMP Triplet L2 2 stages 40% 81.03% 0.9879
Siamese 75.31% 0.8871
FC (1024) Siamese L1 64.8% 0.655
L1 73.91% 0.9739
GAP 68.61% 0.6449
FC (1024) 78.94% 0.8517
Flattening 80.05% 0.8721
Conv 3x3(512) 80.19% 0.9174
No new layer 79.91% 0.8703
ResNet No new layer 78.52% 0.8526
Random 75.3% 0.9477
20% 74.37% 0.9899
TABLE I: A summary of the ablation study. Row 1 is the finally used model, cells indicated by ”” denotes the same cell value of row 1 (WhiteNet).
Fig. 7: ROC curves for the experiments in Table I. The legend follows the same order of rows in Table I.

As can be seen from Table I and Figure 7, the triplet network outperformed the Siamese network. Also, the second training phase of hard examples improved the performance, which indicates the importance of this step to reach convergence as previously reported in [schroff2015facenet]. We also show that the used parameters in WhiteNet outperform the other studied parameters.

Motivated by the observation that some phishing pages had bad quality designs and were different from their corresponding legitimate websites, we studied the robustness of WhiteNet to the ratio of phishing examples seen in training. We, thus, reduced the training phishing set to only 20% and tested with the other 80%. Although the top-1 match decreased, the ROC area of the binary classification was similar to the model trained with 40%, which suggests the model ability to generalize to potential future phishing pages without over-fitting to specific designs.

Vii-C Robustness with Whitelist Expansion

In addition to the PhishTank whitelist gathered from phishing reports, we studied other sources of whitelists as per the analysis presented earlier in our dataset collection procedure. We then studied the robustness of WhiteNet’s performance when adding new websites to the training whitelist. To that end, we categorized the training websites to three lists (as shown in Figure 8), the PhishTank whitelist, a subset containing 32 websites from SimilarWeb top 400 list (418 screenshots), a subset containing 38 websites (576 screenshots) from Alexa top 500 list. Since we have phishing pages for the websites in the PhishTank whitelist only, the other two lists can be used in training as distractors to the performance on the phishing examples. When training on one of these additional lists, its websites will not be used in the legitimate test set which will be formed from the rest of the websites yielding test sets of 562 and 573 screenshots in the case of adding SimilarWeb and Alexa lists respectively.

Fig. 8: The three main lists used in training, the whitelist collected from PhishTank that contains 155 websites, a list of 38 websites from Alexa, and a list of 32 websites from SimilarWeb. The smaller lists are added to training in addition to the list derived from PhishTank.
Experiment Top-1 Match ROC Area
PhishTank whitelist (155 websites) 81.03% 0.9879
Add SimilarWeb list (32+155 websites) 78.3% 0.9764
Add Alexa list (38+155 websites) 78.1% 0.9681
TABLE II: A summary of our experiments when adding more websites from Alexa and SimilarWeb to the training whitelist.

As shown in Table II, when adding new websites to the training whitelist, the performance of the classification (indicated by the ROC area and the top-1 match) decreased. However, this decrease in performance was relatively slight, which indicates the robustness of WhiteNet to adding a few more websites to training.

Method Top-1 Match ROC Area
SIFT 6.55% 0.488
HOG 27.61% 0.58
ORB 24.9% 0.6922
VGG16 51.32% 0.8134
WhiteNet 81.03% 0.9879
TABLE III: A summary of our experiments comparing WhiteNet with alternative approaches.
(a) WhiteNet
(b) WhiteNet
(c) VGG16
(d) VGG16
Fig. 9: t-SNE visualizations of WhiteNet’s embeddings (first row) compared with the pre-trained VGG16 ones as a baseline (second row). Figures (a) and (c) show whitelist’s webpages color-coded by websites. Figures (b) and (d) show whitelist’s webpages (blue) and their phishing pages (red and orange) in comparison with legitimate test pages outside the whitelist (green).

Vii-D Comparison with Prior Work

Furthermore, we compared WhiteNet with alternative approaches that we re-implemented on the WhitePhish dataset. As we discussed in the WhitePhish collection procedure, the collected whitelist training websites pages do not necessarily contain the same designs and layout of the phishing pages targeting the same websites. This makes methods based on layout similarity and segmentation not suitable for our problem. Therefore, we compared WhiteNet with image matching by feature descriptors (SIFT, HOG, and ORB) that have been previously used in phishing detection literature. Since deep learning methods have not been used in previous visual similarity detection studies, we compared WhiteNet with a baseline of using the features of VGG16 network pre-trained on ImageNet. A summary of our experiments with these alternative approaches is demonstrated in Table III. In all of these experiments, similar to WhiteNet training, 40% of the phishing set was included in the training, and the prediction was performed based on the minimum distance to the training set. As shown in Table III, the use of VGG16 outperformed the other features, however, WhiteNet achieves the higher ROC curve area and top-1 correct match with a significant performance gain.

Vii-E Embeddings Visualization

WhiteNet produces a feature vector (dimensions: 512) for each screenshot that represents an encoding that resulted from minimizing the triplet loss. In this learned feature space, screenshots belonging to the same website should be in closer proximity compared with screenshots belonging to different websites. To verify this, we used t-Distributed Stochastic Neighbor Embedding (t-SNE) [maaten2008visualizing] to reduce the dimensions of the embeddings vectors to a two-dimensional set. We show the visualization’s results in Figure 9 in which we compare the embeddings of WhiteNet with a baseline of pre-trained VGG16 ones. We first visualized the embeddings of the training whitelist’s webpages categorized by websites as demonstrated in Figure 9 and Figure 9 for WhiteNet and VGG16 respectively. As can be observed from the figure, the learned embeddings show higher inter-class separation between websites in the case of WhiteNet when compared with VGG16.

Fig. 10: Distance threshold selection. (a) shows a density histogram of the minimum distances between the phishing (red) and legitimate (blue) validation sets to the training whitelist. The vertical green line shows the threshold that achieves an equal error rate between the two fitted Gaussian Probability Density Functions (PDF). (b) shows the true and false positive rates of the test sets over thresholds, the vertical green line marks the threshold from (a).

Additionally, Figure 9 and Figure 9 show the training whitelist’s pages in comparison with phishing and legitimate test ones for WhiteNet and VGG16 respectively. For successful phishing detection, phishing pages should have smaller distances to whitelist’s pages than legitimate test pages, which is more satisfied in the case of WhiteNet than VGG16. Not only does this experiment demonstrate the efficacy of WhiteNet, but it shows that using a pre-trained baseline is not adequate for the problem and further optimization, as done in WhiteNet, is needed.

Vii-F Distance Threshold Selection

To determine a suitable distance/similarity threshold for the binary classification between phishing and legitimate test sets, we split the phishing and legitimate hold-out sets to validation and test sets. We computed the minimum distances of both of them to the training whitelist. Figure 10 shows the two density histograms and the fitted Gaussian Probability Density Functions (PDF) of the minimum distance for the validation sets of both classes. The vertical line (at 8.1) represents a threshold value with an equal error rate for both classes. Additionally, Figure 10 shows the true and false positive rates of the test sets over different thresholds where the indicated threshold is the same one deduced from Figure 10. As can be observed, the threshold deduced from the validation set is predictive on the test set and the false positive rate keeps a consistent low value for small thresholds and increases gradually after the true positive rate has begun to saturate.

Original Image Blurring Brightening Darkening Gaussian noise Salt and Pepper Occlusion Shift
Sigma=1.5 Gamma=1.3 Gamma=0.8 Var=0.01 Noise=5% Last quarter (-30,-30) pixels
Matching(%) 81.03% 79.91% 77.54% 79.63% 79.49% 79.35% 80.05% 78.52%
Sigma=3.5 Gamma=1.5 Gamma=0.5 Var=0.1 Noise=15% Second quarter (-50,-50) pixels
Matching(%) 77.68% 76.42% 75.87% 75.59% 75.73% 76.7% 75.73%
TABLE IV: Top-1 correct match of WhiteNet with respect to possible attacks applying different perturbations (with different parameters in the two rows) to the phishing test examples.

Vii-G Security Evaluation

To test the robustness of WhiteNet, we define two models for evasion techniques. In the first one, we study how susceptible WhiteNet is to small changes in the input (e.g. change of color, noise, and position). In the second one, we assume a white-box attack where the adversary has full access to the target model and the dataset used in training (including the closest point to the phishing page). In both models, we assume that the attacker’s goal is to violate the target model’s integrity (in our case: similarity detection of pages belonging to the same website) by crafting phishing pages that show differences from their corresponding original pages that might be included in the whitelist. However, we assume that the adversary is motivated not to introduce very obvious changes or noise in order for his phishing page to seem trusted and succeed in luring users.

Performance against hand-crafted perturbations

We studied 7 types of perturbations [yu19iccv] that we applied to the phishing test set (without augmentation in training): blurring, brightening, darkening, Gaussian noise, salt and pepper noise, occlusion by insertion of boxes, and shifting. Our goal is to study the effect of these changes on the detection of similarity defined by the matching accuracy. Table IV demonstrates an example of each of these changes along with the corresponding top-1 match. Our findings revealed that the matching accuracy dropped slightly (up to at worst) for the imperceptible noise in each perturbation case, while it dropped (up to at worst) for the stronger noise that we assume that it is less likely to be used.

Adversarial perturbations

Another direction for evasion attacks is crafting adversarial perturbations with imperceptible noise that would change the model decision when added to the input test points [kurakin2016adversarial]. There is a lot of work towards fixing the evasion problem [biggio2018wild], however, adversarial perturbations are well-known for classification models. In contrast, WhiteNet is based on a metric learning approach that, at test time, is used to compute distances to the training points. We are not aware of any prior adversarial perturbation methods on similarity-based networks and therefore we propose and investigate an adaptation of the adversarial example generation methods to our problem by using the Fast Gradient Sign Method (FGSM) [goodfellow2014explaining] defined as:

where is the adversarial example, is the original example, is the original example’s target (0 in the triplet loss), denotes the model’s parameters and is the cost function used in training (triplet loss in WhiteNet).

Adapting this to our system, we used the phishing test example as the anchor image, sampled an image from the same website as the positive image (from the training whitelist), and an image from a different website as the negative image. We then computed the gradient with respect to the anchor image (the phishing test image) to produce the adversarial example. We experimented with two values for the noise magnitude (): 0.005 and 0.01, however, the 0.01 noise value is no longer imperceptible and causes noticeable noise in the input (as shown in Table V). We also examined different triplet sampling approaches when generating the adversarial examples, in the first one, we select the positive image randomly from the website’s images. However, since the matching decision is based on the closest distance, in the second approach, we select the closest point as the positive image since it is more critical. We demonstrate our results in Table VI where we show the matching accuracy for each case averaged over 5 trials as we randomly sample triplets for each example. Our results showed that the matching accuracy dropped to for the 0.005 noise and to for the higher 0.01 noise. We also found that targeting the closest example did not differ from sampling a random positive image. In addition, we tested an iterative approach of adding noise to the closest point at each step which was comparable to adding noise with a larger magnitude (0.01) at only one step.

TABLE V: Adversarial examples generated with FGSM on the triplet loss with (first row) and (second row).
Epsilon () Sampling Top-1 Match
0.005 random 72.52%missing0.44%
0.005 closest point 72.83%missing0.59%
0.01 random 62.54%missing0.91%
0.002 closest point (5 iterations) 64.15%missing0.51%
0 (original) - 81.03%
TABLE VI: A summary of our experiments to evaluate the performance of adversarial examples generated by FGSM.
Epsilon () Sampling Top-1 Match
0.005 random 78.97%missing0.37%
0.01 random 73.10%missing1.1%
0 (original) - 81.03%
TABLE VII: Matching accuracy of adversarial examples after adversarial training.

To test the improvement in the performance of adversarial examples on a model with adversarial training, we fine-tuned the trained WhiteNet for 3000 mini-batches on the same training data. In each mini-batch, half of the triplets were adversarial examples generated with FGSM with an epsilon value that is randomly generated from a range of 0.003 and 0.01. After training, we again applied FGSM on the phishing test set using the tuned model. As shown in Table VII, the matching accuracy increased for both the 0.005 (to reach comparable performance to the original phishing set) and 0.01 epsilon values cases. These results demonstrate that WhiteNet, when trained with perturbed examples, is robust against possible adversarial attacks with slightly added noise.

Viii Discussion

Phishing test

Closest match

TABLE VIII: Examples of test phishing webpages (first row) that were correctly matched to the targeted websites (closest match from the training set in the second row) with a closest page that has a similar layout but different colors and content.

Phishing test

Closest match

TABLE IX: Examples of test phishing webpages (first row) that were correctly matched to the targeted websites (closest match from the training set in the second row) despite having large differences in layout and content.

We discuss the implications of the efficacy of WhiteNet by showcasing examples of phishing pages that were correctly detected, and failure modes with both false positive and false negative examples.

Viii-a Evaluating Successful Cases

We categorize the successfully classified phishing pages into three main categories. The first one is the easily classified ones consisting of exact or very close copying of a corresponding legitimate webpage that is contained in the training whitelist. However, our model still showed robustness to small variations such as the text language of login forms (which shows an advantage over text-similarity methods), small advertisements’ images changes, the addition or removal of elements in the page, and changes in their locations. We observed that these pages have approximately a minimum distance in the range of 0-2 to the training set (as shown in the distances’ histogram in Figure 10) and constitute around 25% of the correct matches.

The second category, which is relatively harder than the first one, is the phishing webpages that look similar in style (e.g. location of elements and layout of the page) to training pages, however, they are highly different in content (e.g. images, colors, and text). We show examples of this second category in Table VIII. Similarly, these pages correspond approximately to the distance range of 2-4 in Figure 10 and constitute around 35% of the correct matches.

Finally, the hardest category is the phishing pages showing disparities in design when compared to the training examples as shown in Table IX. These pages had distances to the training set which were higher than 4 and increased according to their differences and they constitute around 40% of the correct matches. For example, the first three columns show a match between pages with different designs and elements’ locations. Also, the fourth phishing page has a pop-up window that partially occludes information and changes the page’s colors. The fifth phishing page is challenging as it does not show the company logo, yet it was correctly matched to the targeted website due to having other similar features. This suggests that WhiteNet captures the look and feel of websites, which makes it have an advantage over previous methods that relied only on logo matching such as [afroz2011phishzoo, dunlop2010goldphish]. The last two pages are highly dissimilar to the matched page except for having the same logo. Even though these examples could arguably be easily recognized as phishing pages by users, they are more challenging to be detected based on similarity and therefore they were excluded in previous studies such as [mao2017phishing]. This analysis shows the ability of WhiteNet to detect the similarity of phishing pages that are partially copied or created with poor quality in addition to phishing pages with no corresponding similar pages in the training whitelist (i.e. zero-day pages) which all are possible attempts to evade detection in addition to the ones we previously discussed. Additionally, our work motivates a follow-up study where the perceptual aspect could be studied to evaluate how likely poorly designed, obfuscated, or perturbed images (as an evasion mechanism) are to succeed in deceiving users [malisa2017detecting].

Fig. 11: Histogram of wrong matches per websites. The most frequent 19 websites are shown.

Phishing test

Closest match

TABLE X: Examples of test phishing webpages (first row) that were matched to the wrong website (closest match from the training set in the second row).

Legitimate test

Closest match

TABLE XI: Examples of test legitimate pages (first row) that are highly similar to pages from the training set (second row).

Viii-B Evaluating Failure Modes

We also analysed the failure modes of the model by analysing the wrong website matches. To this end, we computed a histogram of the wrong matches per website for the top 19 websites with the largest numbers of phishing pages as we show in Figure 11, where the highest mismatches are for phishing examples belonging to Facebook, Dropbox, Microsoft one drive, Microsoft Office, and Adobe. We found that these websites have many phishing pages that have little similarity to the targeted legitimate websites, such as the first three phishing pages targeting Facebook and Microsoft Excel in Table X. We also found some phishing pages that used outdated designs or earlier versions of certain login forms such as the fourth example in Table X and were, therefore, matched to a wrong website. This could be improved by including earlier versions of websites in the training data.

Moreover, the last three examples in Table X show some of the main limitations. Since our whitelist contains a large number of screenshots per website, we have many distractors of potentially similar pages to the query screenshot, such as the fifth and sixth examples in Table X that were matched to similar screenshots from different websites. We also found that some phishing pages have pop-up windows that completely covered the logo and the page’s colors and structure, and were then matched to pages with darker colors such as the last example in Table X. The wrong matches had generally higher distances than the correct matches which could make them falsely classified as legitimate examples.

We also show some examples of legitimate test pages that had high similarity to pages from the training set in Table XI and would be falsely classified as phishing pages based on the threshold in Figure 10. We observed that pages with forms were harder to identify as dissimilar to other pages with forms in the whitelist especially when having similar colors and layout, since they contain few distinguishable and salient elements and they are otherwise similar. We believe that using the screenshot’s text (possibly extracted by OCR), or logo detection by region-based convolution [ren2015faster] could be future possible model optimization directions to help reduce the false positives and also improve the matching of hard examples.

Viii-C Practical Aspects

We here discuss practical considerations for the deployment of our system, such as the required storage space and computation time. First, our system does not require storing all screenshots of the whitelist, as it suffices to store the embedding vectors of screenshots (512-dimensional vectors). Also, the system is computationally feasible since the training whitelist embeddings can be pre-computed, which only leaves at test time the relatively smaller computations of the query image embedding and the L2 distances. On a typical computer with 8GByte RAM and Intel Core i7-8565U 1.80GHz processor, the average time for prediction (computing the embeddings and comparing to the whitelist) was 1.1missing0.7 seconds which decreased to 0.46missing0.25 seconds on a NVIDIA Tesla K80 GPU. If further speeding up is needed, the search for the closest point could be optimized. Besides, the decision could only be computed when the user attempts to submit information. We also show in our analysis of possible perturbations that the learned similarity is robust against partial removal of parts of the page, which suggests that a page could be detected even if it was partially loaded. Regarding the WhitePhish dataset, we point out that the manual work in curating the dataset was mainly for constructing unbiased and uncorrelated test sets, however, it is less needed in collecting the training whitelist of legitimate websites. This enables the automatic update of the whitelist to add new websites when needed. Nevertheless, detecting duplicity can be automated by finding the closest pages to the newly added one based on pixel-wise features (such as VGG features).

Ix Conclusion

In this work, we presented a new framework for phishing detection using visual similarity. We presented a new dataset (WhitePhish) that covers the largest visual whitelist so far (155 websites). To overcome the observed previous limitation, we improved the validity of the dataset by providing a non-duplicated phishing set, a training whitelist having different variations of each website, and a legitimate test set of websites outside the whitelist that reduces bias as far as possible by not restricting on specific page designs. Besides, we performed an analysis of different whitelist sources that provides valuable insights for constructing potential whitelists instead of only inferring them from previous phishing reports.

To detect zero-day phishing pages, the developed model should be able to identify the visual similarity of phishing pages that were not seen in the whitelist. To that end, we proposed WhiteNet that learns a visual profile of websites by learning the similarity between any two webpages belonging to the same website despite having different designs or layouts. We performed a qualitative analysis of the successful cases of WhiteNet and we found that our network identified easy phishing pages (highly similar to pages in the whitelist), and more importantly, phishing pages that were partially copied, obfuscated, or different from the training whitelist’s ones. WhiteNet was found to be robust against a range of the possible evasion attacks that we studied, which makes our model less prone to the fierce arms race between attackers and defenders.

In conclusion, our work introduces important contributions to phishing detection research to learn a robust and proactive visual similarity metric that demonstrates a leap in performance over the state-of-the-art and outperforms prior work with an increase of 56 percent points in matching accuracy and 30 in the ROC area under the curve in phishing website detection.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description