The Privacy Policy Landscape After the GDPR

The Privacy Policy Landscape After the GDPR

Thomas Linden University of Wisconsin, E-mail: tlinden2@wisc.edu Hamza Harkous École Polytechnique Fédérale de Lausanne, E-mail: hamza.harkous@gmail.com *
Abstract

Every new privacy regulation brings along the question of whether it results in improving the privacy for the users or whether it creates more barriers to understanding and exercising their rights. The EU General Data Protection Regulation (GDPR) is one of the most demanding and comprehensive privacy regulations of all time. Hence, a few months after it went into effect, it is natural to study its impact over the landscape of privacy policies online. In this work, we conduct the first longitudinal, in-depth, and at-scale assessment of privacy policies before and after the GDPR. We gauge the complete consumption cycle of these policies, from the first user impressions until the compliance assessment. We create a diverse corpus of 3,086 English-language privacy policies for which we fetch the pre-GPDR and the post-GDPR versions.
Via a user study with 530 participants on Amazon Mechanical Turk, we discover that the visual presentation of privacy policies has slightly improved in limited data-sensitive categories in addition to the top European websites. We also find that the readability of privacy policies suffers under the GDPR, due to almost a 30% more sentences and words, despite the efforts to reduce the reliance on passive sentences.
We further develop a new workflow for the automated assessment of requirements in privacy policies, building on automated natural language processing techniques. We first use that workflow to perform ambiguity assessment of the policies’ content. We find evidence for positive changes triggered by the GDPR, with the ambiguity level, averaged over 8 metrics, improving in over 20.5% of the policies. Finally, we show that privacy policies cover more data practices, particularly around data retention, user access rights, and specific audiences, and that an average of 15.2% of the policies improved across 8 compliance metrics. Our analysis, however, reveals a large gap that exists between the current status quo and the ultimate goals of the GDPR.

\cclogo

1.2pc plus 5pt minus 2pt

[3]Kassem Fawaz

\runningtitle

The Privacy Policy Landscape After the GDPR

1 Introduction

For more than two decades since the emergence of the World Wide Web, the “Notice and Choice” framework has been the governing practice for the disclosure of online privacy practices. This framework follows a market-based approach of voluntarily disclosing the privacy practices and meeting the fair information practices [21]. The EU’s recent General Data Protection Regulation (GDPR) promises to drastically change this privacy landscape. As the most sweeping privacy regulation so far, the GDPR requires information processors, across all industries, to be transparent and informative about their privacy practices.

The GDPR is not the first regulatory attempt at improving the transparency privacy practices of companies. In the past, regulatory bodies have targeted specific industries, such as the healthcare industry (e.g., with the Health Information and Portability Accountability Act (HIPAA) of 1996 [28]) and the financial industry (e.g., with Gramm-Leach-Bliley Act (GLBA) of 1999 [8]).

Researchers have conducted comparative studies around the changes of privacy policies through time, particularly in the light of such regulations [4, 2, 1, 22]. Interestingly, the main outcomes of these studies have been consistent: (1) the percentage of websites with privacy policies has been growing, (2) the detail-level and descriptiveness of these policies have increased, and (3) the readability and clarity of the policies have suffered.

The GDPR aims to address shortcomings of previous regulations by going further than any of them did. One of its distinguishing features is that non-complying entities can face heavy fines, which can reach 20 million Euros or 4 percent of annual global turnover, whichever is highest. Companies and service providers raced to change their privacy notices by May to comply with the new regulations [7]. With the avalanche of updated privacy notices that users had to accommodate, a natural question follows:

What is the impact of the GDPR on the landscape of online privacy policies?

Researchers have recently started looking into this question through evaluating companies’ behavior in the light of the GDPR. Their approaches, however, are limited to a very small number of websites (at most 14) [5, 26]. Concurrent to our work, Degeling [7] et al., performed the first large-scale study focused on the evolution of the cookie consent notices, which have been hugely reshaped by the GDPR (with 6,579 EU websites). They also touched upon the growth of privacy policies, finding that the percentage of sites with privacy policies has grown by 4.9%.

Still, these studies have not provided a comprehensive answer to the above question when it comes to privacy policies. We answer this question by presenting the first on-scale, longitudinal study of privacy policies’ content in the context of the GDPR. We develop an automated pipeline for the collection and analysis of a set of 3,086 English-language privacy policies by comparing their Pre-GDPR and Post-GDPR versions. These policies cover the privacy practices of websites from different topics, regions, and popularity. We approach the problem by instrumenting the change effected on the entire experience of users interacting with privacy policies.

We break down this experience into five stages:

  1. Presentation: Are users encouraged to read privacy policies in the first place?

  2. Readability: If they read, how easy is the policy to understand?

  3. Coverage: Are the policies covering the set of topics they are supposed to address?

  4. Ambiguity: If users understood the text, how ambiguous is the policy about the data practices?

  5. Compliance: Finally, how much are privacy policies covering the compliance expectations set by regulators?

To quantify the progress in privacy policies presentation, we gauge the change in the user perception of their interfaces, via a user study involving 530 participants on Amazon Mechanical Turk. We also quantify the structural and visual changes in the policies by analyzing their HTML content. Next, we study the change in the policies’ readability using standard metrics, and we quantify the evolution in their length.

Unlike previous longitudinal studies that either relied on manual investigation or on heuristics-based search queries [1, 2, 7], we build on the recent trend of automated semantic analysis of privacy policies, and we develop a total of 25 advanced, in-depth queries that allow us to assess the evolution of content among the set of studied policies. We use this approach, inspired by goal-driven requirements engineering [30], to evaluate the policies’ topic coverage and to assess the different aspects of ambiguity present in the policies. We also use it to codify several GDPR compliance metrics, namely those provided by the UK Information Commissioner (ICO). We conduct this analysis by building on top of the Polisis 111https://pribot.org framework, a recent system developed for automated analysis of privacy policies [9]. Following this approach, we cover a deeper level of semantic understanding that keyword-based approaches fall short at.

Our contributions in this work can be summarized as follows:

  1. We provide evidence, via our user study, for an improvement in the visual clarity and simplicity of the policies’ presentation in websites within data-sensitive categories (4.1% in health, 8.5% in adult categories) and in top-websites in the European region (3.6%). Further, we demonstrate the policies include using more structured content than before.

  2. Our readability assessment shows that reading privacy policies has become a more demanding task with the text featuring 28% more words and 33% more sentences on average, resulting in lower readability scores (e.g., Flesch-Kincaid indicating an additional level of education required). However, service providers are noticeably exerting an effort on formulating their policies in more active sentences with clearer subjects.

  3. We find a noticeable improvement in the coverage of topics like data retention (6.3%), handling special audiences (6.5%), and user access rights (7.8%) in the policies.

  4. Our ambiguity analysis across eight metrics covering data collection, sharing, and retention shows 20.5% of the policies had an improved (i.e., lower) ambiguity while 14.3% had worse values. This shows a slight trend of providers paying special attention to using less vague language under the regulatory scrutiny. Other providers are attempting to cover more practices in their policies at the expense of specificity.

  5. Finally, we demonstrate that there is still a considerable effort to be done around satisfying compliance recommendations. Studying 8 aspects in that context, we find that although more high level topics are covered, the detailed analysis shows a gap in fulfilling the requirements. Still, the GDPR’s effect is evident: we find that significantly more companies improved (15.2% on average) on these metrics compared to those that worsened (7.0%).

2 GDPR Background

As the most comprehensive privacy regulation to date, the General Data Protection Regulation (GDPR), passed on April 14, 2016, and enforced on May 25, 2018, is the European Union’s approach to online privacy. The GDPR specifies a set of organizational requirements to protect the privacy of the users’ data. In this context, the GDPR defines four entities: data subjects, data controllers, data processors, and third parties. The data subjects are the users of the information systems from which data is collected. The data controller is typically the service provider (e.g., website or mobile app) with a vested interest in receiving and processing the user data. A data controller might employ a processor to process the data on its behalf. Finally, the data controller might authorize the third party (e.g., analytics agency) to process some of the user’s data.

Chapter III of the GDPR describes the rights of the data subjects; the first (article 12) is the right to be informed about the service provider’s privacy practices “in a concise, transparent, intelligible and easily accessible form, using clear and plain language.” The service provider has to communicate to the user their practices regarding data collection and sharing (articles 13 and 14) as well as the rights of users associated with data collection and processing (articles 15-22).

Under the GDPR, the service provider has to inform the user about the contact information of the controller, the purposes for data collection, the recipients of shared data, retention period and the types of data collected. Furthermore, the service provider has to promptly notify the users about updates in their privacy practices. Articles 13 and 14 make a distinction between data collected directly from the user or obtained indirectly. The service providers have to inform the users about the source and type of information when obtained indirectly. Articles 15-22 list the rights of users regarding data collection and processing. These rights include the right of access, rights of rectification and erasure, right to the restriction of processing, right to data portability, and right to object.

Companies have reacted to the enforcement of the GDPR by modifying their privacy policies to meet the requirements above. In this paper, we focus on the requirements set in chapter III of the GDPR and study whether and how privacy policies meet those requirements.

3 Creation of Policies Dataset

In the following, we describe our methodology to create the corpus of privacy policies from English-language websites before and after the gdpr enforcement.

Website Selection

Our methodology aims at selecting websites that exhibit a topical and geographical mix. Hence, we used the Alexa service222http://alexa.com/ to obtain the top links in each of 16 Alexa categories, spanning different topics (e.g., adults, arts, business, regional, etc.). We amended these categories by considering the subcategories of the regional category (e.g., North America, Middle East, Europe, etc.). For each of the resulting 25 categories, we took the 500 top visited websites. We did not limit our selection to the top websites in Europe as we wanted to gauge the impact of the GDPR beyond the European population. This step resulted in a set of 11,440 urls, of which we counted 8,877 unique URLs.

Retrieving Latest Policies

Next, we obtained the most recent privacy policy link for each URL in our starting set. We conducted this step over the period of July and August 2018 to ensure a minimal grace period after the GDPR enforcement date. As shown in Fig. 1, we engaged in a multi-stage methodology to get the privacy policy link. First, we automatically crawl the home page of the website. We identify a set of candidate privacy policy links on the home page based on regular expressions (e.g., the presence of words like privacy, statement, notice, or policy in the URL or the title). Then we crawl their HTML using the Selenium framework333https://www.seleniumhq.org/ and a headless Chrome Browser. We use Boilerpipe [14] to get the cleaned HTML of the webpage, without the unnecessary components (e.g., headers and footers). We extract the body text using the library Beautiful Soup444https://www.crummy.com/software/BeautifulSoup/. We developed a custom “Is Policy?” classifier, described below, which we use to decide whether the text belongs to a valid English-language policy or not. If we still do not find a valid policy URL, we query Google’s Custom Search API555https://developers.google.com/custom-search/json-api/v1/overview with a query of the form “site:example.com privacy policy” to find candidate links on the same domain. We then pass a maximum of 3 such links (if any) through the “Is Policy?” classifier to find the valid ones. If no link is valid, we label that policy URL as not found. Hence, we end up with a maximum of one privacy policy URL per website.

Figure 1: Our methodology of retrieving the policy URLs.

The “Is Policy” classifier consists of two modules. The first is a language detection module, using langid.py [20] that labels non-English websites as invalid. The second module is a one-layer Convolutional Neural Network (CNN), whose input is a vector of tokenized words and whose output is the probability that the input text belong to a privacy policy. The data used to train the classifier is composed of (1) a set of 1000 privacy policies labeled as valid from the ACL/COLING 2014 privacy policies’ dataset released by Ramanath et al. [23] and (2) an invalid set consisting of the text from 1000 web pages, fetched from random links within the homepages of the top 500 Alexa websites. We ensured that the latter pages did not have any of the keywords associated with privacy policies in their URL or title. The data has been split into an 80% training set and a 20% testing set, and the classifier yielded a 99.09% accuracy on the testing set. The details of the architecture can be found in Appendix A.

Out of the initial 11,440 URLs, our pipeline assigned a valid label (i.e., an English privacy policy) with a probability higher than 0.5 to a set of 7,787 English privacy policy links (68.8%). Of these, 4,800 are unique privacy policy links. It is worth noting that, in a lot of cases, distinct initial URLs refer to the same privacy policy link, due to the same company owning multiple websites (e.g., YouTube owned by Google or Xbox owned by Microsoft). On the other hand, for 977 URLs (8.5% of the sites), our pipeline found no candidate URLs on the site or via Google’s search. The remaining set of 2,676 URLs (23.4% of the sites) had candidate links for privacy policies, but they were considered invalid by our classifier (i.e., either classified as non-English or as non-privacy-policy). We manually analyzed a random subset from each of the two sets that failed to return a privacy policy link. The rejected links either (1) did not contain an English policy, (2) were unreachable all together, (3) had a very short policy, or (4) had the privacy policy embedded within the longer terms of service. We consider all these rejected cases to be either non-suitable or too noisy for our further automated analysis. Hence, we focus on the set of the valid 4,800 unique policy links that we successfully obtained.

Website Category Websites Pre-GDPR
Policies Post-GDPR
Policies
Adult 500 224 146
Arts 500 414 297
Business 500 385 257
Computers 475 393 318
Games 400 303 203
Health 450 355 274
Home 500 395 252
Kids And Teens 450 323 243
News 500 416 294
Recreation 500 382 253
Reference 500 365 265
Regional 500 427 293
Reg/Africa 420 167 108
Reg/Asia 500 254 187
Reg/Caribbean 319 141 101
Reg/Central America 200 46 29
Reg/Europe 500 383 237
Reg/Middle East 500 155 80
Reg/North America 500 418 304
Reg/Oceania 350 233 187
Reg/South America 400 103 81
Science 475 325 217
Shopping 500 399 207
Society 500 365 179
Sports 500 415 161

Table 1: The distribution of websites, Pre-GDPR and Post-GDPR policies over the Alexa website categories.

Retrieving Archived Policies

To obtain the content of Pre-GDPR privacy policies, we resorted to the Wayback Machine666https://archive.org/help/wayback_api.php, which holds 336 billion web pages, with thousands of snapshots of urls over the timeline of 1996 to the present. We used the Wayback Machine API and the Memento Project API777https://timetravel.mementoweb.org/guide/api/ to identify the archived links of the privacy policies. We set the time filter to extract the archived version of the privacy policies closest to July 1, 2017. We note that, due to the differences in archival frequency of these sites, 3% of the retrieved versions dated back to 2016 and less than 1.5% dated back to earlier than 2016. We targeted that period as a middle ground between being too close to the GDPR enforcement date (May 25, 2018), thus risking to miss the pre-GDPR version, and picking a policy too far in the past, thus making it difficult to attribute the changes to the regulation itself. After retrieving the archived policy links, we crawled these links and only kept the ones, which were assigned a valid label with a probability higher than 0.5 by the “Is Policy?” classifier.

Finally, we ended up with a corpus of 3,086 policies, which have both the Pre-GDPR and the Post-GDPR versions labeled as valid. We used this corpus in our subsequent analysis. We manually examined a sample of the archived versions that failed to give a valid policy. We observed that either: (1) the privacy policy link has changed, (2) the website had “robots.txt” configured to block indexing, or (3) the website did not have a privacy policy before GDPR.

Analyzed Dataset

Table 1 shows the distribution of the valid, English, pre-GDPR and post-GDPR privacy policies in our corpus over the considered 25 categories (double-counting policies that apply to multiple sites). It is apparent that the regional categories (Central America, South America, Middle East, Africa, and Caribbean) have the lowest ratios of valid policies to websites (less than 45%); many of these of websites are not in English. On the other end of the spectrum, websites from News, Sports, and Regional/North America, Regional (containing many of the popular websites) have the highest ratio of valid privacy policies. The table also shows that the distribution of the number of archived policies (fourth column) among the categories roughly follows that of the valid current policies.

Figure 2: The number of valid policies grouped by rank on the Alexa Top-1M list.

Fig. 2 shows the distribution of the websites according to their ranking in the top-1M list of Alexa, regardless of their category ranking. It also indicates those websites with valid and missing privacy policies. It is evident that the distribution of the websites over rank exhibits a long-tail distribution. Most of the analyzed websites are clustered in the top 100K region. Of those websites, less than a third do not have a valid English privacy policy. This ratio increases as the rank of the website is lower. The same trend holds in the top 100K websites (the zoomed region in the figure).

4 Presentation Analysis

Our first step to understand the evolution of the privacy policies is to assess whether these policies became more presentable to the users, delivering a better first impression that encourages them to read. Towards that end, we conducted a user study and then analyzed the structural composition of these policies.

4.1 User Study

We followed a within-subjects study design, with two conditions (Pre-GDPR and Post-GDPR). Our goal was to have each participant evaluate how presentable a screenshot of a privacy policy is.

4.1.1 Study Setup

Recruitment: We recruited 530 participants from Amazon Mechanical Turk (www.mturk.com) who have HIT approval rate and have achieved masters status. We paid each respondent $1.75 to fill the survey that lasted 9 minutes on average. Out of the respondents, 35% are female, 51% hold a Bachelors degree, 16% do not hold a degree, and 79% are from North America. The average age of the respondents is 35 years. We limited the survey to assessing the privacy policy screenshots and did not ask for any personally identifiable information.

Study Material: We chose a representative set of 400 unique privacy policies by sampling 16 policies from each of the 25 categories under study. In each category, we sorted the policies by their rank in the Alexa 1M list. Then, we sampled 4 policies from each of the respective percentiles. As is customary in studies around websites aesthetics [24, 16], we decided to assess the presentability by using screenshots of the webpages instead of live versions to avoid any bias from webpage loading times, internet speed, or localized versions. We used the “webkit2png” tool888http://www.paulhammond.org/webkit2png/ to capture a full screenshot of each privacy policy using the live (Post-GDPR) link as well as the web archive (Pre-GDPR) link. Hence, we ended up with a total of 800 screenshots. As these screenshots included the full policy scrolled to the bottom, we cropped pixels from the top of each screenshot to display for the respondents. Two of the authors manually inspected each of the screenshots and corrected any screenshot that was incorrectly captured. They also insured that the screenshots had no issues in rendering (especially those coming from the Wayback Machine) and that they did not carry any indication of their sources.

Survey Design: We presented each respondent with a random set of 20 screenshots from both conditions. The screenshots’ order was randomized per participant to compensate for the effects of learning and fatigue. The respondents were also not primed about the comparative nature of the study. As we were interested in only assessing the users’ first impressions of the policies, we partially followed the approach of Reinecke et al., [24] and Lindgaard et al., [16], with the difference of not setting a maximal limit on the duration users spend on each screenshot. On average, each screenshot received 13 evaluations. We explicitly required the respondents not to read the content of each screenshot but to rather give their assessment over its look and feel. For each screenshot, the respondent indicated how much do they agree/disagree with a set of three statements over a 5-point Likert scale (Strongly Disagree(SD), Disagree (D), Neither (N), Agree (A), Strongly Agree (SA). A snapshot of the survey is available in Fig. 10 of Appendix B. These statements, which closely follow the usability measurement questions in [19], are as follows:

  1. s1. This policy has an attractive appearance.

  2. s2. This policy has a clean and simple presentation.

  3. s3. This policy creates a positive experience for me.

Additionally, we placed two anchor questions that contain badly formatted “lorem epsum” text. We used these questions to weed out respondents with low quality answers. At the end of the survey, the respondents filled an optional demographics survey.

4.1.2 Results

Figure 3: Comparison of the agreement percentage across categories between the Pre-GDPR and the Post-GDPR policies for statement s3.
Overall Results

We compare the aggregated user responses between the two conditions. Our first finding was that, for all three statements, there is a significant difference between the overall responses for the Pre-GDPR and the Post-GDPR screenshots. The null hypothesis of responses being independent of the source of the screenshot was rejected for s1 (), s2 () and s3 (). The -values (for this and subsequent tests of this section) are computed using the Chi-square test over the distribution of responses for the Pre-GDPR and Post-GDPR conditions.

Stmt Condition SD(%) D(%) N(%) A(%) SA(%)
s1 Pre-GDPR Post-GDPR 6.0 6.0 19.0 17.0 19.0 19.0 39.0 40.0 14.0 16.0
s2 Pre-GDPR Post-GDPR 5.0 4.0 14.0 13.0 16.0 16.0 41.0 42.0 21.0 23.0
s3 Pre-GDPR Post-GDPR 6.0 6.0 17.0 14.0 25.0 24.0 34.0 36.0 16.0 16.0
Table 2: The resulting scores for the user-study grouped by question and study condition.

As evident from Table 2, the user responses have shifted towards the “Agree” and “Strongly Agree” categories for the Post-GDPR policies compared to their Post-GDPR counterparts for the three statements. Hence, considering the whole sample of policies, the Post-GDPR policies were significantly more attractive and presentable and created a better first impression.

Category-based Analysis

In order to delve further into our results, we investigated the websites grouped by category. This breakdown enabled us to understand the website categories where the change in presentation and appeal is more noticeable. Interestingly, for the majority of the categories, we did not find evidence for a significant difference between the Pre-GDPR and the Post-GDPR groups (). The only exceptions, triggering the overall statistical significance, were adult, health, home and Regional/Europe (for s2 only) (). Still, the trend in most of the categories is a general shift in the user responses towards the “Agree”/“Strongly Agree” for the three statements. Fig. 3 shows the percentage of respondents who ranked the policies of each category as “Agree” and “Strongly Agree” for the statement s3 (The results for the other two statements are very similar, but omitted due to space limits). It is clear that in most categories, more respondents ranked the Post-GDPR policies as more appealing than their Pre-GDPR counterparts.

Our main conclusion is that service providers in particular categories have collectively made more efforts to make their privacy policies more appealing. The GDPR places special emphasis on sensitive personal information, which would explain why health and adult websites have improved the appeal of their privacy policies. This aligns with previous research which showed that adult sites are more likely to give clear notice of privacy practices [21]. Similarly, websites in the European category have opted to simpler and cleaner presentation according to our user survey as a large percentage of their audience is in the EU. Finally, it is evident that the improvement in the Post-GDPR presentation compared to the Pre-GDPR version cannot be attributed to the fact that the source of the latter policies is the Wayback Machine. The changes were not significant in most categories and, for a few categories, the older versions had an even better appeal. In our manual observations of the archived websites, the wayback machine preserves the original visual layout, with a toolbar added on top (which we removed in our study).

4.2 HTML Layout Analysis

We study the HTML composition of privacy policies as another indicator of their presentation. We target two major types of HTML tags: structural and style tags. Structural tags, such as those corresponding to tables (<table>) and lists (<ul>,<ol>) allow the text on a web-page to be formatted in an itemized (and potentially more reader-friendly) manner. On the other hand, style tags, such as bold (<b>, <strong>) and italic (<i>, <em>) can potentially improve the overall experience as readers can easily distinguish key terms from the supporting text. By highlighting the important expressions, the reader can rapidly scan the page for the desired information.

4.2.1 Evolution Analysis

Figure 4: The average number of HTML tags for Pre-GDPR and Post-GDPR website policies, grouped by the purpose of the HTML tag.

Fig. 4 plots the average number of the structural and style tags for the Pre-GDPR and Post-GDPR policies. For both types of HTML tags, there has been an increase in the average number of HTML tags per policy between Pre-GDPR and Post-GDPR policies. This increase is statistically significant with -values equal to 0 according to Welch’s unequal variances t-test for both HTML types. The more drastic increase post GDPR is the number of tags corresponding to the web-page’s structure. For example, we observe that, on average, 60 of policies have increased their HTML tag count corresponding to lists (“li”, “dl”, “dd”, “ul”, or “ol”). At face value, these results indicate the service providers have been improving the layout of their policies to be more readable.

4.2.2 Association with User Study

We investigate whether the evolution of specific HTML tags correlates with the change in the policies’ presentation observed in the user study. Towards that end, we computed Pearson’s correlation between (1) the average policy score (mapped consecutively from “Strongly Disagree”: 1 to “Strongly Agree”: 5) and (2) the ratio of each HTML tag to the whole set of tags in the policy. Across all policies, we did not find any meaningful correlation () for all the HTML tags for all three statements). Further, we study whether the change of number of HTML tags from Pre-GDPR policy to Post-GDPR policy correlates with the change of average score between the two policies. Similarly, we observe that the change of each HTML tags exhibits no correlation with the change of average response per policy.

In conclusion, HTML tags are a poor indicator of the appeal of a privacy policy to the users. This observation is consistent with our choice to run a user study to analyze the appeal and presentation of the privacy policy. There are other factors that contribute to the look and feel of a privacy policy such as CSS tags, which we do not investigate.

5 Readability Analysis

The next angle we consider is the readability one. We used the text extracted from the two groups of Pre-GDPR and Post-GDPR policies in Sec. 3. Then, we evaluated a set of readability metrics on that text.

Readability Metrics

We utilized the following readability metrics to measure how easy or difficult the text is to read:

  1. Lexicon Count: gives the number of words available in the text:

  2. Syllables Count: gives the total number of syllables available in the text:

  3. Sentence Count: gives the number of sentences present in a text:

  4. Passive Voice Index: gives the percentage of sentences that contain passive verb forms. To compute this score, we tokenized the text into sentences. Then we performed dependency parsing on each sentence using the Spacy library999https://spacy.io/. We considered a sentence to contain a passive voice if it followed the pattern: nsubjpass (that is Nominal subject (passive)), followed by aux (Auxiliary), and then followed by auxpass (Auxiliary (passive)). This would match sentences similar to “Data is collected”.

  5. Flesch Kincaid Grade [13]: measures the readability by accounting for the number of words per sentence and the number of syllables per word. It is presented as a U.S. grade level:

  6. Dale-Chall Readability Score [6]: uses a white-list of 3000 words that groups of fourth-grade American students could reliably understand, declaring all the other words not on that list to be difficult. It is computed as:

    If the percentage of difficult words is greater than 5%, the score adds 3.6325. The score is mapped to the U.S. grade level as follows: 4.9 or lower: 4th-grade or lower; 5.0–5.9: 5th-6th-grade; 6.0–6.9: 7th-8th-grade; 7.0–7.9: 9th-10th-grade; 8.0–8.9: 11th-12th-grade; 9.0–9.9: 13th-15th-grade (college) student.

Readability Results

Metric Pre-GDPR () Post-GDPR ()
Words 2073.02 (1621.03) 2655.17 (7500.74) 0.000
Syllables 3649.24 (3125.78) 4965.70 (18116.86) 0.000
Sentences 88.62 (68.15) 118.28 (598.75) 0.006
Passive 13.02 (8.69) 12.40 (6.80) 0.001
Flesch-Kinc. 14.16 (4.91) 14.81 (7.30) 0.000
Dale-Chall 7.42 (0.70) 7.40 (0.70) 0.260
Syl/W 1.75 (0.41) 1.78 (0.60) 0.015
W/Sent 23.27 (4.74) 23.98 (5.74) 0.000
Table 3: Readability Metrics results for Pre-GDPR and Post-GDPR policies.

We extract the six metrics from the text of our policies and present the results in Table 3. We also show the number of syllables per word and the number of words per sentence. Using the Unpaired -test, we compared the mean of each Pre-GDPR metric to the corresponding Post-GDPR mean. All metrics had , with the exception of the Dale-Chall metric. First, we observe that the Post-GDPR policies are noticeably longer, with 28% more words and 33% more sentences on average. Sentences and words are also slightly longer in these policies as evident from the last two rows of the table, leading to an increase of the Flesch-Kincaid score as well. Overall, this results indicate the users need to spend more effort and time to fully comprehend the data practices of service providers. On the other hand, we notice that the Passive Voice Index has declined by around 5% (still statistically significant), meaning that there is a slight direction towards introducing more clarity in the text. Finally, the Dale-Chall metric almost stayed the same (less than 0.3% change). This is understandable given that this metric relies on a white-list approach and that the vocabulary used in privacy policies is pretty distinct. Notice that, in our case, this metric gives a different grade compared to Flesch-Kincaid, but this discrepancy has been observed in previous studies [11].

Our findings suggest that service providers, in the aim for higher transparency (and compliance), are moving towards longer and more elaborate policies, thereby negatively impacting the readability. This is despite a slight effort on reducing complexity in the form of using more active-voice sentences with clearer subjects.

Association with User Study

Finally, we studied whether the presentation metrics of Sec. 4 are correlated with the readability metrics of the privacy policies. We computed the correlation between the average response to each statement of each policy and the readability metrics of Table 3. Regardless of the policy category, readability metric, presentation question, or policy period, there is always a weak correlation between the readability and the policy appeal to the respondents (Pearson’s ). This observation validates our methodology in Sec. 4, where we wanted to gauge the policy’s appearance, without focusing on the policy’s readability. The respondents appear focused on the appearance instead of reading the text in the policy screenshot.

6 Automated Requirement Analysis

While the readability analysis captures whether the privacy policies are reader-friendly, the metrics are domain-agnostic. These metrics cannot quantify the complexity of the privacy policies. Hence, we follow a different methodology to delve deeper and assess both the ambiguity and the compliance angles in the following sections.

Methodology Overview

Our methodology starts by defining a set of goals or requirements. These goals are high-level and are independent of the implementation methodology. Then, we code these goals by extending a technique called structured querying over privacy policies, introduced by Harkous et al., [9]. This technique allows crafting first-order logic queries over low-level labels; the labels are predicted using a set of deep learning classifiers.

This technique offers an advantage over approaches based on heuristics and keyword analysis (e.g., [3, 7]); it allows us to better cover text with varying wordings but similar semantics. Further, this technique avoids the shortcomings of the other approaches that directly use machine learning to quantify the goals (e.g., [5, 17, 15]); it is more flexible for adapting the goals as needed, without having to create the new labeling data for each new goal.

In this work, we are the first to conduct a comprehensive analysis of privacy-related goals using structured querying. Compared to the work by Harkous et al. [9], our main contribution is in the goals’ definition, the translation of these goals into queries, the volume of goals we measure, and the comparative nature of this study.

Polisis Overview

To introduce our methodology, we first give a brief overview of Polisis [9] that automatically labels privacy policies. Polisis pre-processes a privacy policy and breaks it into a set of smaller segments. A segment is a set of consecutive sentences of a privacy policy that are semantically coherent. Polisis passes each segment through a set of classifiers to extract the embedded privacy practices. It uses the privacy taxonomy of the OPP-115 dataset by Wilson et al. [31] (Fig. 5) to label each segment with high-level privacy categories as well as values for lower-level privacy attributes.

In particular, Polisis assigns a segment , of a privacy policy, a set: containing one or more categories from nine high-level privacy categories (dark shaded in Fig. 5). Also, Polisis labels each segment with a set of values, corresponding to 20 lower-level privacy attributes (some of which are in light shade in Fig. 5). The values corresponding to each attribute are shown as tables in Fig. 5. For example, the attribute “purpose” indicates the purposes of data processing and represented by the set .

If and , we conclude that the segment describes multiple purposes for first party data collection, which are to provide basic features, personalize the service, and use data for marketing. In addition to the labels, Polisis returns a probability measure associated with each label. Elements of the sets mentioned above are the ones classified with a probability larger than 0.5. We interact with Polisis via an API provided to us by its developers. The API receives the privacy policy link and returns the policy analysis. We refer the reader to the work of Harkous et al., [9] for a breakdown of the accuracy of the high-level and each of the low-level classifiers. Also, a detailed description of all the taxonomy attributes and their values is present within the OPP-115 dataset (https://usableprivacy.org/data).

Figure 5: The privacy taxonomy of Wilson et al. [31]. The top level of the hierarchy (darkly shaded blocks) defines high-level privacy categories. The lower level defines a set of privacy attributes (light shaded blocks), each assuming a set of values. We show examples of values for some of the attributes. The taxonomy has more attributes that we do not show for space considerations.

Automated Goal Analysis

We build on the automated analysis provided by Polisis to deliver custom defined goals. Let us consider, for instance, the following goal: “Quantify the ambiguity by measuring the ratio of cases where the policy does not specify the third party entity receiving the user information”. We code this goal in two steps. The first step is filtering: among the segments returned from Polisis, consider the set such that category ()= {third-party-sharing-collection } and third-party-entity() . The second step is scoring: given the set , retrieve the subset of with third-party-entity() = unspecified; then compute the ambiguity score as . We follow the spirit of this filtering-scoring approach for our in-depth analysis in the following sections.

Privacy Category Description
First Party Collection/Use Service provider’s collection and use of user data.
Third Party Sharing/Collection Sharing and collection of user data with third parties (e.g., advertisers)
User Choice/Control Options for choices and control users have for their collected data
User Access, Edit, & Deletion Options for users to access, edit or delete their stored data.
Data Retention The period and purpose for storing user’s data.
Data Security The protection mechanisms for user’s data.
Policy Change Communicating changes to the privacy policy to the users.
International & Specific Audiences Practices related to a specific group of users (e.g., children, Europeans).
Table 4: Description of the relevant high-level privacy categories from Wilson et al. [31].

7 Coverage Analysis

Figure 6: Category Coverage for policies before and after the GDPR; values are derived from applying the Chi-Square test.

As the first step of content-based analysis of privacy policies, we study the policies’ coverage of the high-level privacy categories of Fig. 5. For each policy and each category , the filtering step consists of selecting the set of segments, such that . The score for the policy is .

Fig. 6 displays the portion of policies with a coverage score of 1 for each of the high-level privacy categories. We consider the two groups: Pre-GDPR and Post-GDPR. The Pre-GDPR policies already have near perfect coverage of the three categories: First Party Data Collection, Third Party Data Sharing, and User Choice. Coverage of the Third Party Data Sharing, and User Choice categories has increased slightly among the Post-GDPR policies, with the source of the policy (Pre-GDPR vs. Post-GDPR) being statistically significant ( after applying the Chi-Square test on the number of policies covering/missing each category between Pre-GDPR and Post-GDPR).

The highest increase in coverage appears in the Data Retention (6.3%), Special Audiences (6.5%), and User Access (7.8%) categories, with the increase being statistically significant. This positive trend of the policies covering more privacy practices is also consistent with the GDPR requirements. The GDPR emphasizes the data retention periods, safeguarding the user data, and providing the users with the options to access and rectify their information (cf., Sec. 2) .

8 Ambiguity Analysis

Description Filtering Logic Scoring Function
A-Q1: Quantify the ambiguity resulting from the policy not specifically indicating how the first party is obtaining user data. Consider the set such that category ()= {first-party } purpose () action-first-party() Take such that action-first-party()= unspecified
The ambiguity score is: .
A-Q2: Quantify the ambiguity resulting from the policy not specifically indicating how the third party is collecting user data.
Consider the set such that category ()= {third-party } action-third-party()
Take such that action-third-party()= unspecified The ambiguity score is: .
A-Q3: Quantify the ambiguity resulting from the policy not specifically indicating the type of information accessed by the first party.
Consider the set such that category ()= {first-party } purpose ()
info-type()
Take such that info-type()= unspecified The ambiguity score is: .
A-Q4: Quantify the ambiguity resulting from the policy not specifically indicating the type of information shared with the third party.
Consider the set such that category ()= {third-party } info-type()
Take such that info-type()= unspecified The ambiguity score is: .
A-Q5: Quantify the ambiguity resulting from the policy not specifically indicating the third party receiving user information.
Consider the set such that category ()= {third-party }
Take such that third-party-entity()= unspecified The ambiguity score is: .
A-Q6: Quantify the ambiguity resulting from the policy’s coverage of first party collection purposes relative to all possible purposes in our taxonomy. Let be the set of all purposes. Let be the set of all purposes such that a segment with: category () = {first-party } purpose() The ambiguity score is .
A-Q7: Quantify the ambiguity resulting from the policy’s coverage of third party sharing purposes relative to all possible purposes in our taxonomy. Let be the set of all purposes.Let be the set of all purposes such that a segment with: category ()= {third-party } purpose() The ambiguity score is .
A-Q8: Quantify the ambiguity resulting from the policy not specifically indicating a purpose for the data retention. Consider the set such that category ()= {data-retention } purpose() Take such that purpose()= unspecified The ambiguity score is: .
Table 5: Table of the ambiguity queries applied to each pre-GDPR and post-GDPR. Note the separate scoring Function for AQ6 and AQ7.

In this section, we aim to assess the change in the level of ambiguity present in the privacy policies.

8.1 Queries

We consider the ambiguitiy of the service provider in the context of eight privacy practices, which we extract from articles 15-22 of the GDPR. We use the filtering-scoring approach of Sec. 6 to quantify this ambiguity. Table 5 describes the eight ambiguity queries (A-Q1 A-Q8) that quantify how explicit is the privacy policy in describing: how first party is collecting data, how third party is obtaining data, the information type collected, information type shared, purposes for data collection, purposes for data sharing, and purposes for data retention. For all the queries, a higher score indicates higher ambiguity.

The reader might notice a discrepancy in Table 5 between the scoring step for the ambiguity queries focusing on the purpose attribute (first party(A-Q6) and third party (A-Q7) purposes vs. data retention purposes (A-Q8)). We treated these cases differently due to the nature of privacy policies as well as Polisis’ output. Within the Polisis system, does not always imply vagueness in describing the purpose of data collection/sharing. Rather, it might indicate that the purpose is not the subject of that segment. Hence, most of the segments that focus on the data types being collected or shared will carry an unspecified label. Accordingly, we quantify purpose ambiguity, in the first party and third party contexts, as the ratio of the number of stated purposes in the policy () to the total number of possible purposes (). On the other hand, data retention is typically addressed by one or two segments in the policy; the segments almost always describe the purpose. If those segments have , then we expect that to signal an ambiguity in describing the purpose for data retention.

8.2 Results

Figure 7: The comparison ambiguity scores of policies before and after the GDPR.

We analyze the evolution of the eight ambiguity scores between the Pre-GDPR and Post-GDPR policies. We consider four cases in Fig. 7: (1) privacy practice is not covered in either Pre-GDPR or the Post-GDPR policies, (2) no ambiguity in both Pre-GDPR and Post-GDPR policies, (3) ambiguity score is the same between the two versions of the policy, (4) worse ambiguity with the score of the Post-GDPR policy higher than Pre-GDPR, and (5) improved ambiguity with the score of the Pre-GDPR policy higher than Post-GDPR.

Privacy Not Covered

Consistent with the results of Section 7, we find that the purpose of data retention practice (A-Q8) is not frequently covered among the studied policies. The other practices (A-Q1 to A-Q5) are not fully covered among the studied policies (missing in less than 30% of the policies). For these queries, the reason for the missing coverage is that we filter out the segments that mention particular purposes. To understand our motivation, notice that privacy policies frequently describe data collection/sharing independently of the purpose (i.e., in different segments). Hence, when the purposes are stated, the information type being collected is not repeated again; the segment references previous sections. To avoid such purpose segments being counted as ambiguous with respect to information types (or similar attributes), we decided to exclude them in the filtering stage. By doing this, we are potentially losing data from less than 30% of the policies. The resulting ambiguity scores, however, have much less noise in them. Last but not least, all policies in both the Pre-GDPR and Post-GDPR groups covered at least one purpose related to first party data collection (A-Q6) and third party sharing (A-Q7).

No Ambiguity

For the privacy practices covered in A-Q1 and A-Q2 (and to a lower degree the rest of the practices) the ambiguity values stay at zero. These subsets of policies mention the specific methods of collecting and sharing user data. Nevertheless, this result does not indicate full compliance with the GDPR requirements about informing the users about how data is collected. We show later in Sec. 9 that while policies mention specific data collection process, they do not cover all the GDPR requirements about collecting user data.

Same Ambiguity

A large portion of the policies exhibited the same ambiguity scores for the analyzed privacy practices. This result is not surprising given that 30% of the policies did not change between Pre-GDPR and Post-GDPR (policy text is exactly the same). For the other policies, they maintain the same ambiguity levels even when their content changes.

Worse Ambiguity

Interestingly, we observe a considerable portion of policies exhibiting higher ambiguity in describing their privacy practices. We attribute this reason to the policies trying to be more comprehensive and general in describing the data practices at the expense of the specificity of the data practices clauses. This is a representative example from the Post-GDPR privacy policy of hrc.org101010https://www.hrc.org/hrc-story/privacy-policy:
“We also may be required to release information if required to do so by law or if, in our sole discretion, such disclosure is reasonably necessary to comply with legal process or to protect the rights, property, or personal safety of our web site, HRC, our officers, directors, employees, members, other users, and/or the public. We may also share personal information in the event of an organizational restructuring”.

While the Pre-GDPR contained segments related to sharing data with third party entities, this newly added segment does not specify the type of personal information released.

Improved Ambiguity

Finally, we observe that a large portion of the privacy policies have improved their ambiguity by using more precise phrases to describe the data collected and shared along with the purposes. Except for A-Q5, we notice that there are more policies with improved ambiguity than those with worse ambiguity (20.5% vs 14.3% on average and three times more in A-Q1 and A-Q7). Over all the practices, we find that the improvement of ambiguity of Post-GDPR policies is significant ( after applying Chi-Squared test with Yates’s correction).

Conclusion

In conclusion, privacy policies appear to be moving in the right direction. While many of them have maintained the same ambiguity levels (due to unchanged policies or low coverage of data practices), a considerable number of policies have been changed. Of those policies, a minority have changed negatively; in an effort to comply with the GDPR, they have tried to be more comprehensive in describing their preactices at the expense of being less specific. The majority of the policies that changed, however, have been more specific in informing the users about the privacy practices. Our analysis finds that the introduction of the GDPR, with its specific requirements about informing users, have contributed significantly to this positive trend.

9 Compliance Analysis

ICO Checklist Item Filtering Logic Scoring Func.
ICO-Q1: “The purposes of processing user data.” Consider the set such that category ()= {first-party }
purpose() and unspecified purpose()
Score=
ICO-Q2: “The categories of obtained personal data (if personal data is not obtained from the individual it relates to).” Consider the set such that category ()= {first-party } action-first-party() unspecified info-type() Score=
ICO-Q3: “The recipients of the user’s personal data.” Consider the set such that category ()= {third-party } unspecified third-party-entity() Score=
ICO-Q4: “The retention periods of the user’s personal data.” Consider the set such that category () = {data-retention } retention-period() = {stated} Score= if else
ICO-Q5: “The right for the user to withdraw consent from data processing.” Consider the set such that category () {first-party, user-choice-control} choice-type() = {op-out-link, op-out-via-contacting-company} choice-scope() = {first-party-use} Score= if else
ICO-Q6: “The source of the personal data (if the personal data is not obtained from the individual it relates to).” Consider the set such that category ()= {first-party } action-first-party() Score=
ICO-Q7: “If we plan to use personal data for a new purpose, we update our privacy information and communicate the changes to individuals before starting any new processing.” Consider the set such that category () = {policy-change} type-of-policy-change() = {privacy-relevant-change } unspecified how-notified() Score= if else
ICO-Q8: “Individuals have the right to access their personal data.” Consider the set such that category () = {user-access-edit-deletion} access-type() {view, export, edit-information} Score= if else

Table 6: The list of the queries derived from ICO’s GDPR checklists. ICO-Q1 – ICO-Q7 are from the “Right to be Informed” checklist. ICO-Q8 is from the “Right of Access” checklist. = {collect-from-user-on-other-websites, receive-from-other-parts-of-company-affiliates, receive-from-other-service-third-party-named, receive-from-other-service-third-party-unnamed, track-user-on-other-websites }

Finally, we study the content of the policies in the light of the compliance requirements introduced by the GDPR. We rely on the UK’s Information Commissioner’s officer’s (ICO) guide to GDPR111111https://ico.org.uk/for-organisations/guide-to-the-general-data-protection-regulation-gdpr/. This guide contains a set of guidelines for organizations to meet the provisions set in the GDPR. In particular, the guide includes a checklist for the organizations to inform users of their rights under the GDPR. ICO’s guide provides an official and structured interpretation of the GDPR, which obviates the need for our customized interpretation of the law. We translate these requirements via the filtering-scoring approach of Sec. 6 in a way that allows us to compare the privacy practices of the service providers before and after the introduction of the GDPR. Since the taxonomy employed in Polisis precedes the GDPR, some of the items in the ICO’s checklists are inadmissible (they cover newer concepts). Table 6 shows the admissible ICO checklist items, their descriptions, and their corresponding filtering and scoring logic.

Figure 8: The comparison of ICO scores of policies before and after the GDPR. The queries for the ICO checklist can be found in Table 6.

For the requirements ICO-Q1, ICO-Q2, ICO-Q3 and ICO-Q6, we consider the number of segments satisfying the associated queries as the score for compliance. For the rest of the requirements, we consider the compliance evidence as a binary metric denoted by the existence of a segment satisfying the associated clause. The choice of using different methods of the evidence stems from the nature of the requirements. The items ICO-Q1, ICO-Q2, ICO-Q3 and ICO-Q6 require the privacy policy to list purposes, categories, recipients and sources of data. The other items require the policy to mention a user right, such as withdrawing consent or receiving privacy updates.

We assess the change of compliance for each policy, according to the eight requirements, by comparing their Pre-GDPR and Post-GDPR versions. We break down the change for each requirement into four cases: (1) requirement still missing, (2) requirement still covered, (3) requirement worsened and (4) requirement improved. Fig. 8 shows, for each ICO requirement, the percentage of policies falling into each of the four cases.

Case 1: Requirements Still Missing

Despite the GDPR’s emphasis on the right of the users to be informed, we find that many of its requirements are still missing from privacy policies. Most notably, a vast majority of the privacy policies still do not cover specific retention periods, do not provide explicit opt-out options, and do not establish a clear notification channel for privacy policy updates. ICO-Q4 is the highest missed requirement of all the requirements. Most privacy policy do mention data retention, but are vague about the periods during which user data is retained. This observation is consistent with our findings in Sec. 8; most policies do not mention any purpose related to data retention. The following is a representative example of the Data Retention clause of the privacy policy of trulia.com:

  • Pre-GDPR: We will retain your Personal Information for as long as you have an active account, as needed to provide you with the Services, to comply with our legal, financial reporting, or compliance obligations, and to enforce this privacy policy.

  • Post-GDPR: We will retain your information for as long as necessary to fulfill the purposes outlined in this Privacy Policy unless a longer retention period is required or permitted by law.

The GDPR also makes a distinction for the source of the personal data. The service provider has to specify the sources and categories of personal data that are not obtained directly from the users. While privacy policies, generally, articulate the information they directly collect (e.g., through mobile app or website) about the users, they do not address the data they obtain about their users from other sources (e.g., other companies). It is evident from Fig. 8 that the requirements ICO-Q2 and ICO-Q6 are still missing from more than 80% of the policies in our dataset. Although not all policies are expected to get such data indirectly, these numbers are large enough to monitor in future benchmarks.

Case 2: Requirements Still Covered

A smaller portion of the policies maintain coverage of the eight ICO requirements. The requirements ICO-Q1 and ICO-Q3 are still maintained to high degree, indicating that there are many policies still providing the purposes for their data collection and the recipients of collected data. Moreover, we found no example of a policy that maintained coverage of all the eight ICO requirements. Half of the policies maintain one or two ICO requirements at the most between their Pre-GDPR and Post-GDPR versions.

Case 3: Requirements Worsened

For the vast majority of the policies, there has not been a noticeable decline in the covered ICO requirements. Except for ICO-Q1, this decline has been limited to less than 10% of the policies. We observe ICO-Q1 and ICO-Q3, referring to purposes and recipients of collected data, to exhibit the highest decline among the eight requirements. Recall that for both requirements, we calculate the score based on the number of segments satisfying the associated queries. While the number of such segments have declined, they are still larger than zero – indicating that requirements are covered with fewer segments.

Interestingly, we found, in several cases, the policies to have lower coverage of the binary requirements, indicating that the policy does not cover the practice anymore. We manually analyzed several of those cases to find that some policies have moved the descriptions to other webpages (e.g., oath.com with data retention). In other cases, the increased ambiguity of describing the practices resulted in less coverage of the requirements. This is an example from the privacy policy from www.mckinsey.com showing a decline in the coverage of ICO-Q4 (retention period):

  • Pre-GDPR: McKinsey would like to know whether a job candidate has previously applied. A minimal amount of personally-identifiable information will be retained for this purpose, and it will be deleted after 5 years.

  • Post-GDPR: McKinsey retains personal data, as necessary, for the duration of the relevant business relationship. We may also retain personal data for longer than the duration of the business relationship should we need to retain it to protect ourselves against legal claims, use it for analysis or historical record-keeping, or comply with our information management policies and schedules.

Case 4: Requirements Improved

Finally, a considerable portion of policies have improved their coverage of the ICO requirements. ICO-Q1, ICO-Q3 and ICO-Q8 have seen the highest improvement among the policies in our dataset. We observe that only a minority of the policies (mostly from the more popular websites) have started addressing the previously missing requirements (such as ICO-Q2, ICO-Q4, ICO-Q7). For example, NYTimes has added a new clause to address the requirement about notifying users about changes to the privacy policies:
We evaluate this privacy policy periodically in light of changing business practices, technology and legal requirements. As a result, it is updated from time to time. Any such changes will be posted on this page. If we make a significant or material change in the way we use or share your personal information, you will be notified via email and/or prominent notice within the NYT Services at least 30 days prior to the changes taking effect.

Conclusion

Similar to the ambiguity analysis, we find privacy policies to be increasingly inclusive of the GDPR’s privacy requirements. More than 90% of the policies list the purposes for processing as well as the recipients of the user’s data. While many of the privacy policies are still missing several of the GDPR’s new requirements, several others have been addressing those requirements. For example, nearly half of the policies provide users with options to access their data.

10 Limitations

Despite our efforts to cover a diverse set of websites and to understand privacy policies’ evolution from multiple angles, we acknowledge that this work has several limitations.

First, our approach in assessing the content of privacy policies does not fully capture all the efforts introduced due to the GDPR. For instance, the concept of layered privacy notices has been adopted by several companies to give users two levels of understanding: an overview of high-level practices and an in-depth description of these practices. Unfortunately, this is difficult to automatically analyze as it can come in a variety of formats, such as expandable hidden text and multi-page policies. This could have affected our readability and ambiguity metrics in particular. Still, our user study partially captures such visual changes, and our compliance metrics are largely unaffected. Moreover, in our presentation analysis, we study HTML tags only while CSS tags might have had an impact on the look and feel of a policy.

Second, our study is limited to the English language policies as we wanted to leverage the existing techniques of advanced semantic understanding of natural language. We made this trade-off for depth vs. language-coverage and decided to not use a keyword-based analysis, which is easier to expand across languages with a dictionary-based approach.

Third, the usage of automated approaches for selecting the privacy policies and analyzing them is inherently error-prone. Hence, our ambiguity and compliance analysis might have been affected by the errors made with the machine learning models. Nevertheless, with the recent success and increasing accuracy levels across different tasks [31, 9], we believe that such techniques, coupled with manual post-analysis, are a highly efficient venue for in-depth analysis at scale.

11 Related Work

Over the previous couple of decades, researchers have investigated the effectiveness of privacy policies. We survey the evolution of the privacy policies’ landscape, particularly in relation to regulatory intervention. We also describe the recent research that studies the GDPR’s impact on privacy policies and the general trend of automated analysis of privacy policies.

Evolution of the Privacy Policies Landscape

In 2002, the Progress and Freedom Foundation (PFF) studied a random sample of the most visited websites and found that, compared to two years earlier, websites were collecting less personally identifiable information and offering more opt-in choices and less opt-out choices [1]. Another longitudinal analysis has been performed by Milne and Culnan in the 1998-2001 period, confirming the positive change in the number of websites including notices about information collection, third-party disclosures, and user choices [22]. In the same period, Liu and Arnett found that slightly more than 30% of Global 500 Web sites provide privacy policies on their home page [18].

Despite the increased proliferation of privacy policies, their lack of clarity was one of the major motivations of the regulatory measures before the GDPR. In 2004, Antón et al. showed that 40 online privacy statements from 9 financial institutions have questionable compliance with the requirements of the Gramm-Leach-Bliley Act (GLBA). They assessed the requirements of having “clear and conspicuous” policies via keyword-based investigation and readability metrics [2]. In 2007, Antón et al. studied the effect of the Health Information and Portability Accountability Act (HIPAA) via a longitudinal study of 24 healthcare privacy policy documents from 9 healthcare Web sites. A similar conclusion held: although HIPAA has resulted in more descriptive policies, the overall trend was reduced readability and less clarity. Resorting to a user study, in 2008, Vail et al. showed that users perceive traditional privacy policies (in paragraph-form) to be more secure than shorter, simplified alternatives. However, they also demonstrated that these policies are significantly more difficult to comprehend than other formats [29].

In a recent study, Turow et al. studied the surveys around the “privacy policy” as a label in the period of 2003 to 2015 in the US. They found that the users’ misplaced confidence in this label, not only carries implication on their personal lives, but also may affect their actions as citizens in response to government regulations or corporate activities [27].

Privacy Policies After the GDPR

Since the GDPR has been enforced on May 25, 2018, a few studies have investigated its impact on privacy practices of companies. Despite the initial trials with automated approaches in these studies, they have been limited in terms of scale, which is the main goal behind automation. Contissa et al., conducted a preliminary study on 14 privacy policies of the top companies as an attempt to automatically measure the compliance of these companies with the GDPR. They found frequent presence of unclear language, problematic processing, and insufficient information. They used Claudette, a recent system designed for the detection of such types of issues in privacy policies [17]. Tesfay et al. introduced a tool, inspired by the GDPR to classify privacy policy content into eleven privacy aspects [26]. They validated their approach with 10 privacy policies.

The first large-scale study concurrent to ours is that by Degeling et al. who performed a longitudinal analysis of the privacy policies and cookie consent notices of 6,759 websites representing the 500 most popular websites in each of the 28 member states of the EU, [7]. They found that the number of websites with privacy policies has increased by 4.9% and that 50% of websites updated their privacy policies just before the GDPR came into action in May 2018. Unlike our work, however, their work has been focused on cookie consent notices and on terminology-based analysis of privacy policies, without an in-depth tackling of the semantic change of privacy policies.

Automated Analysis of Privacy Policies

The attempts at automating the analysis of privacy policies have been accelerating in the previous two years [10, 9, 32, 31, 25, 15]. Since the introduction of the Online Privacy Policies dataset (OPP-115) by Wilson et al [31], several works have employed that dataset as a basis for automating the analysis of privacy policies, with various degrees of granularity. Zimmeck et al. used it for checking the discrepancy between permissions and privacy policies in mobile apps [32]. Sathyendra et al. used it to automatically classify the opt-out choices in privacy policies [25]. Recently, Harkous et al. developed Polisis, the first system to have a unified architecture for in-depth modeling of privacy policies [9]. Our work is the first to employ the automatic analysis of privacy policies to perform a longitudinal study.

12 Conclusion

In this paper, we seek to answer a question about the impact of the recently introduced GDPR on the online privacy policies. To answer this question, we analyze a sample of more than 3000 Pre-GDPR and Post-GDPR policies along five dimensions: presentation, readability, coverage, ambiguity and compliance. We study the presentation of the privacy policy through a user study of 530 participants. We find that the user-perceived appeal of privacy policies has improved with the introduction of the GDPR. We compare the readability of Pre-GDPR and Post-GDPR policies using several widely-used metrics. We conclude that policies have become considerably longer and accordingly demand more user time and effort. For the last three dimensions, we employ a novel goal-based methodology to automatically extract the privacy categories, ambiguity and compliance from policies. We observe that policies have improved their coverage of some privacy practices. Despite the fact that ambiguity levels have declined in a significant number of policies, we noticed a considerable gap that has to be filled on different levels from a compliance perspective.

References

  • [1] W. F. Adkinson, J. A. Eisenach, and T. M. Lenard, “Privacy online: A report on the information practices and policies of commercial web sites,” Progress and Freedom Foundation, 2002.
  • [2] A. I. Anton, J. B. Earp, Q. He, W. Stufflebeam, D. Bolchini, and C. Jensen, “Financial privacy policies and the need for standardization,” IEEE Security & privacy, vol. 2, no. 2, pp. 36–45, 2004.
  • [3] A. I. Antón, J. B. Earp, and A. Reese, “Analyzing website privacy requirements using a privacy goal taxonomy,” in Requirements Engineering, 2002. Proceedings. IEEE Joint International Conference on.   IEEE, 2002, pp. 23–31.
  • [4] A. I. Anton, J. B. Earp, M. W. Vail, N. Jain, C. M. Gheen, and J. M. Frink, “Hipaa’s effect on web site privacy policies,” IEEE Security & Privacy, vol. 5, no. 1, pp. 45–52, 2007.
  • [5] G. Contissa, K. Docter, F. Lagioia, M. Lippi, H.-W. Micklitz, P. Pałka, G. Sartor, and P. Torroni, “Claudette meets gdpr: Automating the evaluation of privacy policies using artificial intelligence,” 2018.
  • [6] E. Dale and J. S. Chall, “A formula for predicting readability: Instructions,” Educational Research Bulletin, vol. 27, no. 2, pp. 37–54, 1948. [Online]. Available: http://www.jstor.org/stable/1473669
  • [7] M. Degeling, C. Utz, C. Lentzsch, H. Hosseini, F. Schaub, and T. Holz, “We value your privacy… now take some cookies: Measuring the gdpr’s impact on web privacy,” arXiv preprint arXiv:1808.05096, 2018.
  • [8] Federal Trade Commission, “How to comply with the privacy of consumer financial information rule of the gramm-leach-bliley act,” https://www.ftc.gov/tips-advice/business-center/guidance/how-comply-privacy-consumer-financial-information-rule-gramm, 2002, accessed: 2018-07-01.
  • [9] H. Harkous, K. Fawaz, R. Lebret, F. Schaub, K. Shin, and K. Aberer, “Polisis: Automated analysis and presentation of privacy policies using deep learning,” in 27th USENIX Security Symposium (USENIX Security 18).   USENIX Association, 2018.
  • [10] H. Harkous, K. Fawaz, K. G. Shin, and K. Aberer, “Pribots: Conversational privacy with chatbots,” in Workshop on the Future of Privacy Notices and Indicators, SOUPS 2016, Denver, CO, USA, June 22, 2016.   USENIX Association, 2016.
  • [11] D. Janan and D. Wray, “Readability: the limitations of an approach through formulae,” in British Educational Research Association Annual Conference, University of Manchester, Manchester, England, 2012.
  • [12] Y. Kim, “Convolutional neural networks for sentence classification,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, 2014, pp. 1746–1751. [Online]. Available: http://aclweb.org/anthology/D/D14/D14-1181.pdf
  • [13] J. P. Kincaid, J. A. Aagard, J. W. O’Hara, and L. K. Cottrell, “Computer readability editing system,” IEEE Transactions on Professional Communication, vol. PC-24, no. 1, pp. 38–42, March 1981.
  • [14] C. Kohlschütter, P. Fankhauser, and W. Nejdl, “Boilerplate detection using shallow text features,” in Proceedings of the third ACM international conference on Web search and data mining.   ACM, 2010, pp. 441–450.
  • [15] L. Lebanoff and F. Liu, “Automatic detection of vague words and sentences in privacy policies,” arXiv preprint arXiv:1808.06219, 2018.
  • [16] G. Lindgaard, G. Fernandes, C. Dudek, and J. Brown, “Attention web designers: You have 50 milliseconds to make a good first impression!” Behaviour & information technology, vol. 25, no. 2, pp. 115–126, 2006.
  • [17] M. Lippi, P. Palka, G. Contissa, F. Lagioia, H.-W. Micklitz, G. Sartor, and P. Torroni, “Claudette: an automated detector of potentially unfair clauses in online terms of service,” arXiv preprint arXiv:1805.01217, 2018.
  • [18] C. Liu and K. P. Arnett, “Raising a red flag on global www privacy policies,” Journal of Computer Information Systems, vol. 43, no. 1, pp. 117–127, 2002.
  • [19] E. T. Loiacono, R. T. Watson, D. L. Goodhue et al., “Webqual: A measure of website quality,” Marketing theory and applications, vol. 13, no. 3, pp. 432–438, 2002.
  • [20] M. Lui and T. Baldwin, “langid. py: An off-the-shelf language identification tool,” in Proceedings of the ACL 2012 system demonstrations.   Association for Computational Linguistics, 2012, pp. 25–30.
  • [21] F. Marotta-Wurgler, “Self-regulation and competition in privacy policies,” The Journal of Legal Studies, vol. 45, no. S2, pp. S13–S39, 2016.
  • [22] G. R. Milne and M. J. Culnan, “Using the content of online privacy notices to inform public policy: A longitudinal analysis of the 1998-2001 us web surveys,” The Information Society, vol. 18, no. 5, pp. 345–359, 2002.
  • [23] R. Ramanath, F. Liu, N. M. Sadeh, and N. A. Smith, “Unsupervised alignment of privacy policies using hidden markov models,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume 2: Short Papers, 2014, pp. 605–610. [Online]. Available: http://aclweb.org/anthology/P/P14/P14-2099.pdf
  • [24] K. Reinecke, T. Yeh, L. Miratrix, R. Mardiko, Y. Zhao, J. Liu, and K. Z. Gajos, “Predicting users’ first impressions of website aesthetics with a quantification of perceived visual complexity and colorfulness,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems.   ACM, 2013, pp. 2049–2058.
  • [25] K. M. Sathyendra, S. Wilson, F. Schaub, S. Zimmeck, and N. Sadeh, “Identifying the provision of choices in privacy policy text,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 2774–2779.
  • [26] W. B. Tesfay, P. Hofmann, T. Nakamura, S. Kiyomoto, and J. Serna, “Privacyguide: Towards an implementation of the eu gdpr on internet privacy policy evaluation,” in Proceedings of the Fourth ACM International Workshop on Security and Privacy Analytics.   ACM, 2018, pp. 15–21.
  • [27] J. Turow, M. Hennessy, and N. Draper, “Persistent misperceptions: Americans’ misplaced confidence in privacy policies, 2003–2015,” Journal of Broadcasting & Electronic Media, vol. 62, no. 3, pp. 461–478, 2018.
  • [28] US Department of Health and Human Services, “Summary of the HIPAA privacy rule,” Washington, 2003.
  • [29] M. W. Vail, J. B. Earp, and A. I. Antón, “An empirical study of consumer perceptions and comprehension of web site privacy policies,” IEEE Transactions on Engineering Management, vol. 55, no. 3, pp. 442–454, 2008.
  • [30] A. Van Lamsweerde, “Goal-oriented requirements engineering: A guided tour,” in Requirements Engineering, 2001. Proceedings. Fifth IEEE International Symposium on.   IEEE, 2001, pp. 249–262.
  • [31] S. Wilson, F. Schaub, A. A. Dara, F. Liu, S. Cherivirala, P. G. Leon, M. S. Andersen, S. Zimmeck, K. M. Sathyendra, N. C. Russell, T. B. Norton, E. H. Hovy, J. R. Reidenberg, and N. M. Sadeh, “The creation and analysis of a website privacy policy corpus,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, 2016. [Online]. Available: http://aclweb.org/anthology/P/P16/P16-1126.pdf
  • [32] S. Zimmeck, Z. Wang, L. Zou, R. Iyengar, B. Liu, F. Schaub, S. Wilson, N. Sadeh, S. M. Bellovin, and J. Reidenberg, “Automated analysis of privacy requirements for mobile apps,” in 24th Annual Network and Distributed System Security Symposium, NDSS 2017, 2017.

Appendix A Policy Classifier Architecture

Fig. 9 shows the detailed architecture of the single-label classifier used in Sec. 3 for checking whether the crawled pages are valid privacy policies. The input text, obtained from the web page, is tokenized into words, using the Penn Treebank Tokenizer in nltk121212http://www.nltk.org/. Then the words are mapped into vectors at the word embeddings layer. The word vectors are input to a convolutional layer, followed by a Rectified Linear Unit (ReLU) and a Max-pooling layer. The next layer is a fully connected (dense) layer followed by another ReLU. Finally, we apply a softmax on the output dense layer to obtain a probability distribution over the two labels “valid” and “invalid”. For more details about this kind of classifiers we refer the reader to the work of Kim [12] on sentence classification.

Figure 9: Architecture of the “Is Policy?” classifier used to determine whether a privacy policy is valid. Hyperparameters used: Embeddings size: 100, Number of filters: 250, Filter Size: 3, Dense Layer Size: 250, Batch Size: 32

Appendix B User Survey

In Fig. 10, we show a screenshot of an example step from the user survey that we presented to the users in Sec. 4.

Figure 10: Example step from our user survey where the users had to respond to the three questions
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
283463
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description