A First Look at Ad-block Detection – A New Arms Race on the Web

# A First Look at Ad-block Detection – A New Arms Race on the Web

Muhammad Haris Mughees, Zhiyun Qian, Zubair Shafiq, Karishma Dash, Pan Hui
Hong Kong University of Science and Technology, University of California-Riverside, The University of Iowa
###### Abstract

The rise of ad-blockers is viewed as an economic threat by online publishers, especially those who primarily rely on advertising to support their services. To address this threat, publishers have started retaliating by employing ad-block detectors, which scout for ad-blocker users and react to them by restricting their content access and pushing them to whitelist the website or disabling ad-blockers altogether. The clash between ad-blockers and ad-block detectors has resulted in a new arms race on the web.

In this paper, we present the first systematic measurement and analysis of ad-block detection on the web. We have designed and implemented a machine learning based technique to automatically detect ad-block detection, and use it to study the deployment of ad-block detectors on Alexa top-100K websites. The approach is promising with precision of 94.8% and recall of 93.1%. We characterize the spectrum of different strategies used by websites for ad-block detection. We find that most of publishers use fairly simple passive approaches for ad-block detection. However, we also note that a few websites use third-party services, e.g. PageFair, for ad-block detection and response. The third-party services use active deception and other sophisticated tactics to detect ad-blockers. We also find that the third-party services can successfully circumvent ad-blockers and display ads on publisher websites.

A First Look at Ad-block Detection –

A New Arms Race on the Web

 Muhammad Haris Mughees1, Zhiyun Qian2, Zubair Shafiq3, Karishma Dash2, Pan Hui1 Hong Kong University of Science and Technology, University of California-Riverside, The University of Iowa

\@float

\end@float

The online advertising industry has been largely fueling the World Wide Web for the past many years. According to the Interactive Advertising Bureau (IAB), the annual online ad revenues for 2014 totaled $49.5 billion in 2014, which is 15.6% higher than in 2013 [?]. Online advertising plays a critical role in allowing web content to be offered free of charge to end-users, with the implicit assumption that end-users agree to watch ads to support these “free” services. However, online advertising is not without its problems. The economic magnetism of online advertising industry has made ads an attractive target for various types of abuses, which are driven by incentives for higher monetary benefits. Since publishers are paid on a per-impression or per-click basis, many publishers choose to place ads such that they interfere with the organic content and cause annoyance to end-users [?]. They include anything from autoplay video ads, rollovers, pop-ups, and flash animation ads to the ever-popular homepage takeover with sidebars that follow user scrolling. Another major issue with online advertising is the widespread tracking of users across websites raising privacy and corporate surveillance concerns. Several recent studies have shown that ad exchanges aggressively profile users and invade user privacy [?]. Malvertising (using ads to spread malware) is also on the rise [?, ?]. In addition to the above problems, many users simply desire an ad-free web experience which is much cleaner and smoother. Therefore, ad-blockers have become popular in recent years and they can block ads seamlessly without requiring any user input. A wide range of ad-blocking extensions are available for popular web browsers such as Chrome and Firefox [?]. Adblock Plus is most prominent among all these extensions [?]. According to a recent academic study, 22% of the most active residential broadband users of a major European ISP use Adblock Plus [?]. In addition, it is estimated in a recent report [?] that$22 billion will be lost due to ad-blocking in 2015, almost twice the amount estimated in 2014. To the advertisement industry and content publishers, ad-blockers are becoming a growing threat to their business model. To combat this, two strategies have emerged: (1) companies such as Google and Microsoft have begun to pay ad-blockers to have their ads whitelisted; and (2) websites have begun to detect the presence of ad-blockers and may refuse to serve any user with ad-blocker turned on, e.g. Yahoo mail reportedly did so recently [?].

As not every website is willing or capable of paying ad-blockers, the 2nd strategy becomes a low-cost solution that can be easily deployed. Even though anecdotes exist about websites starting to detect ad-blockers, the scale at which this occurs remains largely unknown. To fill this gap, in this paper we perform the first systematic characterization of the ad-block detection phenomenon. Specifically, we are interested in understanding: (1) how many websites are performing ad-block detection; (2) what type of technical approaches are used; and (3) how can ad-blockers counter or circumvent such detection.

Key Contributions. The key contributions of the paper are the following:

We conduct a measurement study of Alexa top-100K websites using a machine learning based approach to identify the websites that use ad-block detection. The approach is promising with precision of 94.8% and recall of 93.1%. The results show that around 300–1100 websites are currently performing ad-block detection (details in §A First Look at Ad-block Detection – A New Arms Race on the Web).

We cluster different ad-block detection approaches based on the JavaScripts that are inserted in the websites. The results indicate that there is a spectrum of detection solutions ranging from fairly simple (passive detection) to complex (active deception). We conduct several case studies to illustrate the strengths and limitations of different approaches.

In this section, we provide an overview of ad-blockers and ad-block detectors.

In this section, we design and implement our approach for automatically identifying websites that employ ad-block detection. The main premise of our approach is that websites conducting ad-block detection make distinct changes to their web page content for ad-block users as compared to users without ad-block. Our goal is to identify, quantify, and extract such distinct features that can be leveraged for training machine learning models to automatically detect websites that employ ad-block detection.

Node additions. We found that in order to show notification to users with ad-blockers, websites dynamically create and add new DOM nodes. Thus, node additions in the DOM can potentially indicate ad-block detection. We can log the total number of DOM elements inserted in a web page.

Style changes. We found that a few websites include ad-block detection notifications which are in their page content but hidden. If these websites detect the use of ad-blockers, they change the visibility of their notification. To cover such cases, we can log attribute changes to DOM elements of a web page.

Text changes. Other then structural changes, we found that some websites change the textual content (i.e., text-related nodes) in response to ad-blockers. Therefore, we can log changes in the textual content of a web page and addition of text-related nodes in a web page.

Miscellaneous features. In addition to the above-mentioned features, we also consider other features like innerHTML to detect whether the structure is completely changed and URL to detect redirection.

Figure A First Look at Ad-block Detection – A New Arms Race on the Web provides an overview of our methodology to automatically measure ad-block detection on the web. We conduct A/B testing to compare the contents of a web page with and without ad-blocking software. To automate this process, we use the Selenium Web Driver [?] to open two separate instances of the Chrome web browser, with and without Adblock Plus (➊). We implemented a custom Chrome browser extension to record changes in the content of web pages during the page load process. Our extension records the structure of the DOM tree, all textual content, and HTML code of the web page (➋). We implemented a feature extraction script to process the collected data and generate a feature vector for each website (➌). We feed the extracted features to a supervised classification algorithm for training and testing (➍). We train the machine learning model using a labeled set of websites with and without ad-block detectors. Below we describe these steps in detail.

Web automation for A/B testing. Using the Selenium Web Driver [?], we implemented a web automation tool to conduct automated measurements. For A/B testing, our tool first loads a website without Adblock Plus, and then opens it with Adblock Plus in a separate browser instance. However, we found that many websites host dynamic content that changes at a very small timescales. For example, some websites include dynamic images (e.g., logos), which can introduce noise in our A/B testing. Similarly, most news websites update their content frequently which can also add noise. Thus, we may incorrectly attribute these changes to the ad-blocker or ad-block detector used by the publisher. To mitigate the impact of such noise, our tool opens multiple instances of each website in parallel and excludes content that changes across multiple instances.

Data collection using a custom Chrome extension. To collect data while a web page is loading, we use DOM Mutation Observers [?] to track changes in a DOM (e.g. DOMNodeAdded, DOMAttrModified, etc.). The changes we track include addition of new DOM nodes or scripts, node attribute changes like class change or style change, removal of nodes, changes in text etc. We implemented the data collection module as a Chrome extension. The extension is preloaded in the browser instances that are launched by our web automation tool. As soon as a web page starts loading, the extension attaches an observer listener with it. Whenever an event occurs, the listener fires and we record the information. For example, we record the identifier, type, value, name, parent nodes, and attributes of the corresponding node. For each attribute change, in addition to above-mentioned information, we record the name of attribute which changes like style or class and its old and new value. We also log page level data such as the complete DOM tree, innerText, and innerHTML as well.

Feature extraction. We then process the output of data collector to extract a set of informative features which can distinguish between changes due to ad-block detection. Recall that we load each page multiple times to mitigate noise. Let A denote the data collected with ad-blocker, and let B & B’ denote the data collected by loading a web page twice without an ad-blocker. We provide details of the feature extraction process below. Table A First Look at Ad-block Detection – A New Arms Race on the Web includes the list of all features used in our study.

Node features. For each instance, we extract DOM related nodes because our pilot experiments revealed that websites using ad-block detection add only DOM related nodes. More specifically, we extract the list of anchor, div, h1, h2, h3, img, table, p, and iframe nodes for each instance. Once we have a list of DOM nodes for each instance, we compare A vs. B’ and B vs. B’ to obtain the list of differences between these nodes. We denote these lists as AB’ and BB’ lists. As explained earlier, to remove number of node differences due to dynamic content of websites, we cross-validate nodes in AB’ with BB’ using their properties. Our key idea is that if a publisher ads random nodes to a web page, they may have different identifiers but most the other properties will be almost similar. Thus, we remove the nodes from AB’ that also appear in BB’.

Attribute features. For each instance, we extract changes in the style of DOM related nodes. More specifically, we focus on changes to the display-related property of nodes. For instance, we log whether the visibility property of a node changes from hidden to non-hidden. We also log changes to the display property of a node, e.g., the number of changes in height, width, and opacity of nodes. Similar to node features, we compare A, B, and B’ to eliminate attribute changes from AB’ that also appear in BB’.

Text features. We get the list of all text nodes in A, B and B’. Using the lists, we identify pairs of nodes with differences texts. We particularly focus on line differences rather than character-level differences to mitigate noise (e.g., difference in clock time). We again compare A, B, and B’ to eliminate changes in textual features from AB’ that also appear in BB’.

Structural features. We compare differences in the overall page HTML using the cosine similarity metric. If the cosine similarity between A and B/B’ is very low, it indicates significant content change. To check for potential URL redirections, we also track changes in URL.

In this section, we analyze the extracted features to quantitatively understand their usefulness in identifying ad-block detection. We first visualize the distributions of a few features. Figure A First Look at Ad-block Detection – A New Arms Race on the Web plots the cumulative distribution functions (CDF) of two features. We observe that websites which employ ad-block detection tend to changes more lines and add div elements than other websites. These distributions confirm our intuition that ad-block detectors make changes in the web content that are distinguishable.

To systematically study the usefulness of different features, we employ the concept of information gain [?], which uses entropy to quantify how our knowledge of a feature reduces the uncertainty in the class variable. The key benefit of information gain over other correlation-based analysis methods is that it can capture non-monotone dependencies. Let denote the entropy (i.e., uncertainty) of feature . H is defined as:

 H=−∑ipilogpi

Let denote the entropy (i.e., uncertainty) of the binary class variable . Information gain is computed as:

 IG(Y|X)=H(Y)−H(Y|X).

We can normalize information gain, also called relative information gain, as:

 H(Y)−H(Y|X)H(Y).

Using this, we can quantify what an input feature informs us about the use of ad-block detection. Table A First Look at Ad-block Detection – A New Arms Race on the Web ranks the top 10 features based on their information gain. We note that text-based features (number of words changed and number of text nodes added) have the highest information gain, both exceeding 25%. They are followed by node and style based features (e.g., number of div elements added, number of nodes for which height property is changed, etc.).

We train machine learning classification models using the labeled set of 1000 negative samples and 200 positives samples. We use the standard -fold cross validation methodology to verify the accuracy of the trained models. For this purpose we select , divide the data into 5 folds where one fold is used as training set while rest of folds are used for verification. To quantify the classification accuracy of the trained models, we use the standard ROC metrics such as precision, recall, and area under ROC curve (AUC).

 Precision=True PositivesTrue Positives+% False Positives
 Recall=True PositivesTrue Positives+False % Negatives

We test multiple machine learning models on our data set. We tuned various parameters of each of these models to optimize their classification performance. Table A First Look at Ad-block Detection – A New Arms Race on the Web summarizes the classification accuracy of these classifiers. We note that the random forest classifier, which is a combination of tree classifiers, clearly outperforms the C4.5 decision tree and the naive Bayes classifiers. The random forest classifier achieves 93.1% recall, 94.8% precision, and 96.0% AUC.

To further evaluate the effectiveness of different feature sets in identifying ad-block detection, we conduct experiments using stand alone feature sets and then evaluate their all possible combinations. We divide the features into node features, attribute features, and text features. Among stand alone feature sets, text-based features provide the best classification accuracy. We also observe that using combinations of feature sets does improve the classification accuracy. The best classification performance is achieved when all feature sets are combined.

To further gain some intuition from the trained machine learning models, we visualize a pruned version of the decision tree model trained on labeled data in Figure A First Look at Ad-block Detection – A New Arms Race on the Web. As expected from the information gain analysis, we note that a text feature (words difference) is the root node of the decision tree. If there is a positive word difference, the model detects ad-block detection. Similarly, if node visibility is changed, the model detects ad-block detection. It is interesting to note that the top three features in the decision tree belong to different feature categories. This indicates that different feature sets complement each other, rather than capturing similar information, which we also observed earlier when evaluating different combinations of features.

Our goal here is to characterize how different ad-block detection strategies operate under the hood. We cluster ad-block detection strategies based on their JavaScript code similarity. Our analysis allows us to measure the popularity of specific strategies and third-party ad-block detection services, e.g. PageFair. The result of the analysis will also help us design countermeasures against the state-of-the-art ad-block detectors.

As a first step, we collect the JavaScript code of all websites that employ ad-block detection. Analyzing the functionality of JavaScript code is non-trivial because the code can be packed inside functions such as eval. To overcome these issues, we leverage the fact that the code needs to unpack itself before execution. We attach a debugger between the Chrome V8 JavaScript engine [?] and the web pages. Specifically, we observe script.parsed function, which is invoked when eval is called or new code is added with <iframe> or <script> tags. We implement the debugger as a Chrome extension and collect all JavaScript snippets parsed on a web page and identify the snippet responsible for ad-block detection.

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters