A First Look at Ad-block Detection –
A New Arms Race on the Web
The rise of ad-blockers is viewed as an economic threat by online publishers, especially those who primarily rely on advertising to support their services. To address this threat, publishers have started retaliating by employing ad-block detectors, which scout for ad-blocker users and react to them by restricting their content access and pushing them to whitelist the website or disabling ad-blockers altogether. The clash between ad-blockers and ad-block detectors has resulted in a new arms race on the web.
In this paper, we present the first systematic measurement and analysis of ad-block detection on the web. We have designed and implemented a machine learning based technique to automatically detect ad-block detection, and use it to study the deployment of ad-block detectors on Alexa top-100K websites. The approach is promising with precision of 94.8% and recall of 93.1%. We characterize the spectrum of different strategies used by websites for ad-block detection. We find that most of publishers use fairly simple passive approaches for ad-block detection. However, we also note that a few websites use third-party services, e.g. PageFair, for ad-block detection and response. The third-party services use active deception and other sophisticated tactics to detect ad-blockers. We also find that the third-party services can successfully circumvent ad-blockers and display ads on publisher websites.
A First Look at Ad-block Detection –
A New Arms Race on the Web
|Muhammad Haris Mughees, Zhiyun Qian, Zubair Shafiq, Karishma Dash, Pan Hui|
|Hong Kong University of Science and Technology, University of California-Riverside, The University of Iowa|
The online advertising industry has been largely fueling the World Wide Web for the past many years. According to the Interactive Advertising Bureau (IAB), the annual online ad revenues for 2014 totaled $49.5 billion in 2014, which is 15.6% higher than in 2013 [?]. Online advertising plays a critical role in allowing web content to be offered free of charge to end-users, with the implicit assumption that end-users agree to watch ads to support these “free” services. However, online advertising is not without its problems. The economic magnetism of online advertising industry has made ads an attractive target for various types of abuses, which are driven by incentives for higher monetary benefits. Since publishers are paid on a per-impression or per-click basis, many publishers choose to place ads such that they interfere with the organic content and cause annoyance to end-users [?]. They include anything from autoplay video ads, rollovers, pop-ups, and flash animation ads to the ever-popular homepage takeover with sidebars that follow user scrolling. Another major issue with online advertising is the widespread tracking of users across websites raising privacy and corporate surveillance concerns. Several recent studies have shown that ad exchanges aggressively profile users and invade user privacy [?]. Malvertising (using ads to spread malware) is also on the rise [?, ?].
In addition to the above problems, many users simply desire an ad-free web experience which is much cleaner and smoother. Therefore, ad-blockers have become popular in recent years and they can block ads seamlessly without requiring any user input. A wide range of ad-blocking extensions are available for popular web browsers such as Chrome and Firefox [?]. Adblock Plus is most prominent among all these extensions [?]. According to a recent academic study, 22% of the most active residential broadband users of a major European ISP use Adblock Plus [?]. In addition, it is estimated in a recent report [?] that $22 billion will be lost due to ad-blocking in 2015, almost twice the amount estimated in 2014. To the advertisement industry and content publishers, ad-blockers are becoming a growing threat to their business model. To combat this, two strategies have emerged: (1) companies such as Google and Microsoft have begun to pay ad-blockers to have their ads whitelisted; and (2) websites have begun to detect the presence of ad-blockers and may refuse to serve any user with ad-blocker turned on, e.g. Yahoo mail reportedly did so recently [?].
As not every website is willing or capable of paying ad-blockers, the 2nd strategy becomes a low-cost solution that can be easily deployed. Even though anecdotes exist about websites starting to detect ad-blockers, the scale at which this occurs remains largely unknown. To fill this gap, in this paper we perform the first systematic characterization of the ad-block detection phenomenon. Specifically, we are interested in understanding: (1) how many websites are performing ad-block detection; (2) what type of technical approaches are used; and (3) how can ad-blockers counter or circumvent such detection.
Key Contributions. The key contributions of the paper are the following:
We conduct a measurement study of Alexa top-100K websites using a machine learning based approach to identify the websites that use ad-block detection. The approach is promising with precision of 94.8% and recall of 93.1%. The results show that around 300–1100 websites are currently performing ad-block detection (details in §A First Look at Ad-block Detection – A New Arms Race on the Web).
In this section, we provide an overview of ad-blockers and ad-block detectors.
The rise of ad-blockers. The issues with online ads has resulted in a proliferation of ad-blocking software. Ad-blocking software (or ad-blocker) is an effective tool that blocks ads seamlessly, primarily published as extensions in web browsers such as Chrome and Firefox [?]. More recently, Apple has also allowed content blocking plugins for Safari on iOS devices [?]. Other popular relevant tools include Ghostery [?] and DisconnectMe [?]; however, they are primarily focused on protecting user privacy. With respect to functionality, these ad-blockers (1) block ads on websites and (2) protect user privacy by filtering network requests that profile browsing behaviors. Recent reports have shown that the number of users using ad-blocking software has rapidly increased worldwide. According to PageFair, up to 198 million users around the world now use ad-blocking software [?]. According a recent academic study, 22% of the most active residential broadband users of a major European ISP use Adblock Plus [?]. These ad-blocking users have been estimated to cost publishers more than $22 billion in lost revenue in 2015 [?].
How do ad-blockers work? Ad-blockers eliminate ads by either page element removal or web request blocking. For page element removal, ad-blockers use various CSS selectors to access the elements and remove them. Similarly, for web requests, ad-blocker looks for particular URLs and remove the ones which belong to advertisers. For both of these actions, ad-blockers are dependent on filter lists that contain the set of rules (as regular expressions) specifying the domains and element selectors to remove. There are various kinds of filter lists available which can be included in ad-blockers. Each of these lists serves a different purpose. For example, Adblock Plus by default includes EasyList [?], which provides rules for removing ads from English websites. Similarly, Fanboy [?] is another popular list that removes only annoying ads from websites. Additionally, EasyPrivacy [?] helps ad-blockers to remove spy-wares.
The rise of ad-block detection. The widespread use of ad-blockers has prompted a cat-and-mouse game between publishers and ad-blocking software. More specifically, publishers have started to detect whether users are visiting their websites while using ad-blocking software. Once detected, publishers notify users to turn off their ad-blocking software. These notifications can range from a mild non-intrusive message which is integrated inside website content to more aggressive blocking of website content and/or functionality. Figure A First Look at Ad-block Detection – A New Arms Race on the Web shows examples of both cases. We note that the aggressive approach refrains users from accessing any website content. To detect the use of ad-blocking software, publishers include scripts in the code of their web pages. When a user with the ad-blocking software opens such a website, these scripts typically monitor the visibility of ads on the page to identify the use of ad-blockers. If ads are found hidden or removed by the scripts, publishers take countermeasures according to their policies. It is noteworthy that the strategies used by publishers to detect ad-blockers is evolving.
In this section, we design and implement our approach for automatically identifying websites that employ ad-block detection. The main premise of our approach is that websites conducting ad-block detection make distinct changes to their web page content for ad-block users as compared to users without ad-block. Our goal is to identify, quantify, and extract such distinct features that can be leveraged for training machine learning models to automatically detect websites that employ ad-block detection.
We want to identify distinct features that capture the changes made by ad-block detectors to the HTML structure of web pages. To this end, we first conducted some pilot studies to test the behavior of websites that employ ad-block detection. Based on our pilot studies, we found that the changes made by ad-block detectors can be categorized into: (1) addition of extra DOM nodes, (2) change in the style of existing DOM nodes, and (3) changes in the textual content. We also found a few cases when the websites completely changed the web page content. In addition, a few websites with ad-block detectors reacted by redirecting users to warning pages. Note that the Adblock Plus is installed with the default configuration which allows acceptable ads [?]. This will likely suppress many ad-block detections and result in underestimating their prevalence. However, since most regular users would choose the default configuration, we believe our study represents what most users would observe regarding to ad-block detection. Below, we provide an overview of our proposed features and also discuss how they capture the changes by ad-block detectors.
Node additions. We found that in order to show notification to users with ad-blockers, websites dynamically create and add new DOM nodes. Thus, node additions in the DOM can potentially indicate ad-block detection. We can log the total number of DOM elements inserted in a web page.
Style changes. We found that a few websites include ad-block detection notifications which are in their page content but hidden. If these websites detect the use of ad-blockers, they change the visibility of their notification. To cover such cases, we can log attribute changes to DOM elements of a web page.
Text changes. Other then structural changes, we found that some websites change the textual content (i.e., text-related nodes) in response to ad-blockers. Therefore, we can log changes in the textual content of a web page and addition of text-related nodes in a web page.
Miscellaneous features. In addition to the above-mentioned features, we also consider other features like innerHTML to detect whether the structure is completely changed and URL to detect redirection.
Figure A First Look at Ad-block Detection – A New Arms Race on the Web provides an overview of our methodology to automatically measure ad-block detection on the web. We conduct A/B testing to compare the contents of a web page with and without ad-blocking software. To automate this process, we use the Selenium Web Driver [?] to open two separate instances of the Chrome web browser, with and without Adblock Plus (➊). We implemented a custom Chrome browser extension to record changes in the content of web pages during the page load process. Our extension records the structure of the DOM tree, all textual content, and HTML code of the web page (➋). We implemented a feature extraction script to process the collected data and generate a feature vector for each website (➌). We feed the extracted features to a supervised classification algorithm for training and testing (➍). We train the machine learning model using a labeled set of websites with and without ad-block detectors. Below we describe these steps in detail.
Web automation for A/B testing. Using the Selenium Web Driver [?], we implemented a web automation tool to conduct automated measurements. For A/B testing, our tool first loads a website without Adblock Plus, and then opens it with Adblock Plus in a separate browser instance. However, we found that many websites host dynamic content that changes at a very small timescales. For example, some websites include dynamic images (e.g., logos), which can introduce noise in our A/B testing. Similarly, most news websites update their content frequently which can also add noise. Thus, we may incorrectly attribute these changes to the ad-blocker or ad-block detector used by the publisher. To mitigate the impact of such noise, our tool opens multiple instances of each website in parallel and excludes content that changes across multiple instances.
Data collection using a custom Chrome extension. To collect data while a web page is loading, we use DOM Mutation Observers [?] to track changes in a DOM (e.g. DOMNodeAdded, DOMAttrModified, etc.). The changes we track include addition of new DOM nodes or scripts, node attribute changes like class change or style change, removal of nodes, changes in text etc. We implemented the data collection module as a Chrome extension. The extension is preloaded in the browser instances that are launched by our web automation tool. As soon as a web page starts loading, the extension attaches an observer listener with it. Whenever an event occurs, the listener fires and we record the information. For example, we record the identifier, type, value, name, parent nodes, and attributes of the corresponding node. For each attribute change, in addition to above-mentioned information, we record the name of attribute which changes like style or class and its old and new value. We also log page level data such as the complete DOM tree, innerText, and innerHTML as well.
|Node features||# nodes|
|Attribute features||total changes in style|
|Text features||bag of words|
Feature extraction. We then process the output of data collector to extract a set of informative features which can distinguish between changes due to ad-block detection. Recall that we load each page multiple times to mitigate noise. Let A denote the data collected with ad-blocker, and let B & B’ denote the data collected by loading a web page twice without an ad-blocker. We provide details of the feature extraction process below. Table A First Look at Ad-block Detection – A New Arms Race on the Web includes the list of all features used in our study.
Node features. For each instance, we extract DOM related nodes because our pilot experiments revealed that websites using ad-block detection add only DOM related nodes. More specifically, we extract the list of anchor, div, h1, h2, h3, img, table, p, and iframe nodes for each instance. Once we have a list of DOM nodes for each instance, we compare A vs. B’ and B vs. B’ to obtain the list of differences between these nodes. We denote these lists as AB’ and BB’ lists. As explained earlier, to remove number of node differences due to dynamic content of websites, we cross-validate nodes in AB’ with BB’ using their properties. Our key idea is that if a publisher ads random nodes to a web page, they may have different identifiers but most the other properties will be almost similar. Thus, we remove the nodes from AB’ that also appear in BB’.
Attribute features. For each instance, we extract changes in the style of DOM related nodes. More specifically, we focus on changes to the display-related property of nodes. For instance, we log whether the visibility property of a node changes from hidden to non-hidden. We also log changes to the display property of a node, e.g., the number of changes in height, width, and opacity of nodes. Similar to node features, we compare A, B, and B’ to eliminate attribute changes from AB’ that also appear in BB’.
Text features. We get the list of all text nodes in A, B and B’. Using the lists, we identify pairs of nodes with differences texts. We particularly focus on line differences rather than character-level differences to mitigate noise (e.g., difference in clock time). We again compare A, B, and B’ to eliminate changes in textual features from AB’ that also appear in BB’.
Structural features. We compare differences in the overall page HTML using the cosine similarity metric. If the cosine similarity between A and B/B’ is very low, it indicates significant content change. To check for potential URL redirections, we also track changes in URL.
Classification model training and testing. We feed the extracted features to a machine learning classifier to automatically detect websites that employ ad-block detection. However, in order to train the classification algorithm, we need a sufficient number of labeled examples of websites that detect ad-blockers (i.e., positive samples) and websites that do not detect ad-blockers (i.e., negative samples). To get positive samples, we first use a crowd-sourced list of such websites [?]. We manually validated the websites in this list, and excluded websites that did not detect and respond to ad-blockers. We also manually opened Alexa top 1000 websites and identified four websites that use ad-block detection.222 During the manual verification, we found that the response of websites after ad-block detection varies. Most websites detect and respond to ad-blockers on the homepage without waiting for any input from users. In contrast, some websites respond to ad-blockers only when a particular content type is requested (e.g., video is played) or when the user navigates to other pages. Since it is not practical to automatically identify such requirements, we restrict ourselves to the former category of websites. Also note that some websites include ad-block detection logic but they do not respond to ad-blockers. We excluded these websites from the list as well. Overall, we identified a total of 200 positive training samples. Since a vast majority of Alexa top 1000 websites do not deploy ad-block detection, we use them as negative training samples.
|# text nodes added||27.89%|
|# lines added||18.13%|
|# nodes added||17.37%|
|# characters added||17.19%|
|# div nodes added||13.01%|
|# height property changed||10.67%|
|# display property changed||8.67%|
|# styles attribute changed||7.20%|
|# images added||5.82%|
In this section, we analyze the extracted features to quantitatively understand their usefulness in identifying ad-block detection. We first visualize the distributions of a few features. Figure A First Look at Ad-block Detection – A New Arms Race on the Web plots the cumulative distribution functions (CDF) of two features. We observe that websites which employ ad-block detection tend to changes more lines and add div elements than other websites. These distributions confirm our intuition that ad-block detectors make changes in the web content that are distinguishable.
To systematically study the usefulness of different features, we employ the concept of information gain [?], which uses entropy to quantify how our knowledge of a feature reduces the uncertainty in the class variable. The key benefit of information gain over other correlation-based analysis methods is that it can capture non-monotone dependencies. Let denote the entropy (i.e., uncertainty) of feature . H is defined as:
Let denote the entropy (i.e., uncertainty) of the binary class variable . Information gain is computed as:
We can normalize information gain, also called relative information gain, as:
Using this, we can quantify what an input feature informs us about the use of ad-block detection. Table A First Look at Ad-block Detection – A New Arms Race on the Web ranks the top 10 features based on their information gain. We note that text-based features (number of words changed and number of text nodes added) have the highest information gain, both exceeding 25%. They are followed by node and style based features (e.g., number of div elements added, number of nodes for which height property is changed, etc.).
|C4.5 Decision Tree||87.0%||89.0%||91.3%|
We train machine learning classification models using the labeled set of 1000 negative samples and 200 positives samples. We use the standard -fold cross validation methodology to verify the accuracy of the trained models. For this purpose we select , divide the data into 5 folds where one fold is used as training set while rest of folds are used for verification. To quantify the classification accuracy of the trained models, we use the standard ROC metrics such as precision, recall, and area under ROC curve (AUC).
We test multiple machine learning models on our data set. We tuned various parameters of each of these models to optimize their classification performance. Table A First Look at Ad-block Detection – A New Arms Race on the Web summarizes the classification accuracy of these classifiers. We note that the random forest classifier, which is a combination of tree classifiers, clearly outperforms the C4.5 decision tree and the naive Bayes classifiers. The random forest classifier achieves 93.1% recall, 94.8% precision, and 96.0% AUC.
To further evaluate the effectiveness of different feature sets in identifying ad-block detection, we conduct experiments using stand alone feature sets and then evaluate their all possible combinations. We divide the features into node features, attribute features, and text features. Among stand alone feature sets, text-based features provide the best classification accuracy. We also observe that using combinations of feature sets does improve the classification accuracy. The best classification performance is achieved when all feature sets are combined.
To further gain some intuition from the trained machine learning models, we visualize a pruned version of the decision tree model trained on labeled data in Figure A First Look at Ad-block Detection – A New Arms Race on the Web. As expected from the information gain analysis, we note that a text feature (words difference) is the root node of the decision tree. If there is a positive word difference, the model detects ad-block detection. Similarly, if node visibility is changed, the model detects ad-block detection. It is interesting to note that the top three features in the decision tree belong to different feature categories. This indicates that different feature sets complement each other, rather than capturing similar information, which we also observed earlier when evaluating different combinations of features.
We want to analyze the strategies and methods used by publishers for ad-block detection. To this end, we first use the random forest model on Alexa top 100K websites to identify ad-block detectors. Our machine learning model found a total of 292 websites that detect and respond to ad-blockers. Table LABEL:my-label (in Appendix) lists these 292 ad-block detecting websites along with their Alexa rank. We note that a vast majority of the websites in Table LABEL:my-label have low Alexa ranks, likely due to 1) the top web websites have paid ad-blockers to be whitelisted or 2) the top websites are worried about losing users if they take an aggressive stance against ad-blocker users. Using additional string based features (e.g., “Adblock”, “Adblock Plus”), we also found a total of 797 websites that have ad-block detection scripts but do not exhibit visible behaviors, likely due to default-on acceptable ads in our Adblock Plus extension. It is also possible that such websites are currently tracking the usage of ad-blockers but not necessarily ready to go aggressively against users. Overall, we found 1,089 ad-block detecting websites in the Alexa top-100K list. In this section, we focus our attention on the ad-block detecting websites that not only detect ad-blockers but also respond to them.