Temporal Limits of Privacy in Human Behavior
Large-scale collection of human behavioral data by companies raises serious privacy concerns. We show that behavior captured in the form of application usage data collected from smartphones is highly unique even in very large datasets encompassing millions of individuals. This makes behavior-based re-identification of users across datasets possible. We study 12 months of data from 3.5 million users and show that four apps are enough to uniquely re-identify 91.2% of users using a simple strategy based on public information. Furthermore, we show that there is seasonal variability in uniqueness and that application usage fingerprints drift over time at an average constant rate.
pnasresearcharticle \leadauthorSekara \authorcontributionsV.S. and H.J. conceived the study. V.S. and H.J designed measures and analyses. V.S. collected and curated the data. V.S. conducted the analysis. V.S., E.M., and H.J. wrote the manuscript. All authors interpreted and discussed the findings. \authordeclarationThe authors have no conflict of interest. \equalauthors1V.S. and H.J. lead the research and contributed equally to this work. \correspondingauthor*To whom correspondence should be addressed: email@example.com, firstname.lastname@example.org.
Tracking behavior is a fundamental part of the emerging big-data economy, allowing companies and organizations to segment, profile and understand their users in increasingly greater detail. Modeling context and interests of users has proven to have various advantages: products can be designed to better fit customers’ needs; content can be adapted; and advertising can be made more relevant (agrawal1993mining, bell2007lessons, chen2009large, mislove2010you, dodds2010measuring, mislove2011understanding). Efficient user modeling requires the collection of large-scale datasets of human behavior, which has led to a growing proportion of human activities to be recorded and stored (conte2012manifesto). Today, most of our interactions with computers are stored in a database, whether it is an e-mail, phone call, credit-card transaction, Facebook like, or online search, and the rate of information growth is expected to accelerate even further in the future (lazer2009computational). These rich digital traces can be compiled into detailed representations of human behavior and can revolutionize how we organize our societies, fight diseases, and perform research; however, they also raise serious privacy concerns (blumberg2009locational, eckersley2010unique, de2013unique, hannak2013measuring, greenwood2014new, de2015unique, sapiezynski2015tracking, mayer2016evaluating). For example, Narayanan et al. demonstrated the feasibility of inferring political views of IMDb users through re-identification of movie ratings (narayanan2008robust). Another infamous case is the hacking (and eventual erasure of personal data) of multiple accounts of a journalist, which was carried out by the attacker being able to connect two different databases (wiredhack).
The ubiquity and sensing capabilities of mobile phones together with our seemingly symbiotic relationship to them, renders these devices good tools for tracking and studying human behavior (eagle2006reality, stopczynski2014measuring). Mobile phones are ubiquitous and have permeated nearly every human society: in the year 2015 98.3% of the world’s population had a mobile subscription (itu2016). Mobile phones have transformed the way people access the internet as well: today the majority of traffic to web pages stems from mobile devices rather than from desktop computers (stonetemple), making advertisers target mobile phones to a higher degree. With the standard methods based on cookies for identifying customers not being used in smartphone apps, along with the rising usage of ad-blockers among users (pagefair2017), advertisers and so-called data brokers are now targeting smartphone applications to replace the rich data cookies provided in the past. Advertisement identifiers are one such ID embedded in applications, but they do not allow data brokers to track users across multiple applications or devices, and they can even be reset by the user. Application usage behavior, however, cannot be cleared, and it is hard (and in many cases not feasible) to be changed or manipulated by users. This creates an economic incentive for global population tracking of application usage. This tracking is in conflict of users’ perception of permissible usage of data (martin2018penalty). Also, in general, users are not knowledgeable enough about what data is collected about them to make an informed decision (posner1981economics).
A majority of the online services people interact with on a daily basis collect personal information and sell the data to data brokers (third parties) (anthes2015data). In a recent report released by the U.S. Federal Trade Commission, it was shown that data broker companies obtain vast amounts of personal data, which they further enrich with additional online and offline sources, and re-sell these improved datasets to the highest bidder, typically without the explicit consent or knowledge of the users (ramirez2014data). According to U.S. privacy laws, data is considered anonymous if it does not contain personally identifiable information (PII) such as name, home address, email address, phone number, social security number, or any other obvious identifier. As a result, it is legal for companies to share and sell anonymized versions of a dataset. However, as studies have shown, the mere absence of PII in a dataset does not necessarily guarantee anonymity due to the fact that it is relatively easy to compromise the privacy of individuals (narayanan2008robust, sweeney2002k, de2013unique).
Human behavior, although imbued with routines, is inherently diverse. Previous work has shown that 99.4% of smartphone users have unique app usage patterns and established the viability of using apps as markers of human identity, similar in application to fingerprints in forensic science (falaki2010diversity, welke2016differentiating, achara2015unicity). It has further been demonstrated that the software infrastructure we use to access the Internet can be used to identify users (eckersley2010unique). The digital breadcrumbs we leave online can be used to infer many aspects of our lives. It has been shown for example that age, gender, relationship status, education level, political beliefs, sexual orientation, religion, and even personality can be predicted from Facebook likes (kosinski2013private, youyou2015computer), or based on the apps people use on their smartphones (chittaranjan2013mining, seneviratne2014predicting, malmi2016you). Human mobility traces has been shown to be highly unique and research has further shown that 4 spatio-temporal points are sufficient to re-identify a majority of individuals (de2013unique).
This study demonstrates how easy it is to uniquely identify individuals from their smartphone usage patterns given only a handful of data points, and investigates the temporal patterns of uniqueness, revealing that humans are easier to identify during certain periods of the year. We define identification as matching a behavior pattern against an (anonymous) quasi-identifier consisting of a similar pattern. In the dataset we use, no further information can be gained about the user beyond matching two patterns. However, in a real world scenario, an attacker could use this method for connecting two datasets to learn new information about the re-identified user, e.g. email address, age, or gender, depending on the data available to the attacker. Our study focuses on applications (apps) — small software programs which users can download to their smartphones, and which provide a near unlimited range of functions, from simple functions such as flashlights or calculators to more advanced—artificial intelligence like—functions. Each new phone comes with a set of apps pre-loaded by the manufacturer, but a user is free to customize their device to suit their specific needs, as such users have access to millions of apps on app stores such as Google Play (approx. 2.8 million apps) (appbrain).
Uniqueness of human behavior
To evaluate the likelihood of identifying individuals within smartphone usage data we use a dataset that spans 12 months (Feb. 1st 2016 to Jan. 31st 2017) and encompasses 3.5 million people using in total 1.1 million unique apps. We have chosen to disregard phone vendor specific apps, such as alarm clock apps, built-in phone dialer apps, etc. and only focus on apps that are downloadable from Google Play. From this we form app fingerprints for each user, i.e. a binary vector containing information about which apps the user has used for every month. We only consider apps actually used by a user in a month, not apps that were installed but never used. Figure 1 illustrates the typical patterns of app usage, with individuals continuously changing their app-fingerprint over the course of a year by trying out new apps and ceasing to use others. As such, app-fingerprints slowly drift over time, with the average rate of change being roughly constant between consecutive months (Figure S1). In combination with fingerprints drifting, the number of apps people use on their smartphones is constant over time as well, suggesting that humans have a limited capacity for interacting, navigating, and managing the plethora of services and social networks offered by smartphones (Figure S2). This limiting effect has been observed in other aspects of life such as interactions among people (dunbar1992neocortex) or geo-spatial exploration (alessandretti2016evidence).
The risk of re-identifying individuals is estimated by means of unicity (de2013unique, de2015unique). Here, re-identification corresponds to successful assignment of an app-fingerprint to a single unique user in our dataset. This does not entail that we can directly get the real identity of a person, such as name, address, e-mail, social security number, etc. This, however, would become possible if this knowledge is cross-referenced with other data sources, which there unfortunately has been countless examples of (narayanan2008robust, barbaro2006face, barth2012re, sweeney2013identifying, tockar2014riding). Given an individual’s app-fingerprint, unicity quantifies the number of apps needed to uniquely re-identify that person; the fewer apps we need the more unique a person is and vice versa. Given a dataset of app-fingerprints and set of apps , and , a user is uniquely identifiable if that user, and only that user, in the dataset has used apps , and , i.e. matching the fingerprint of user . In our dataset we evaluate uniqueness as the percentage of users we can re-identify using number of apps.
To attack the dataset without any prior knowledge of the system itself, the most realistic strategy is to pick apps at random. Figure 2A shows the efficiency of this type of random sampling of apps, with of users being re-identified from using 4 apps. Although this value means only 1 of every 5 individual can be re-identified, it is surprisingly high given that we only use binary features (that is, has the user used the app or not) and have no information regarding when an app was used or for how long—features which would only make fingerprints more unique. In case of a real attack, however, the above results might give the general public a false sense of security as it is possible to use free, publicly available information to formulate an attack strategy that greatly outperforms the random strategy.
The popularity of apps follows a heavy-tailed distribution (olmstead2016apps) (and see Figure S3); a few apps are used by millions or even billions of individuals, while an overwhelming majority of apps only have a couple of users. All this information is available on Google Play from where it is possible to retrieve by automatic means, or it can be purchased from vendors such as AppMonsta. Because this information is so easily attainable, we formulate a strategy that takes the user base of apps (popularity of apps) into account, starting with the least used apps: the popularity strategy. Rather than using the popularity in terms of downloads on Google Play, we use the popularity counted as the number of users that use an app in our dataset (see Methods for details). A real-world re-identification attack strategy could use the Google Play download numbers for each app to reduce the amount of computation required. Figure 2B shows that just using 2 apps with the popularity strategy greatly outperforms the random strategy, and using 4 apps, we are able to re-identify of users.
Seasonal variability of anonymity
Human lives, routines and behaviors evolve over time (kossinets2006empirical, sekara2016fundamental, alessandretti2016evidence), and therefore individual app-fingerprints might become harder (or easier) to identify. To quantify the seasonal variability of uniqueness, we construct monthly fingerprints for all individuals and evaluate anonymity using the unicity framework. Figure 3 shows the fraction of individuals that are re-identifiable per month, and reveals an increased fraction of identifications for June, July, and August—months which are typically considered vacation months. The increase in uniqueness is independent of how we select apps (random, or by popularity). In fact, during these three months the process of identifying individuals from randomly selected apps is respectively and more effective when using and apps. For the popularity scheme, we note and higher rates of identifications when using and apps. The increase in identifiability stems from a combination of related behavioral changes (Figure S4). Apps related to categories such as travel, weather, sports, and health & fitness gain popularity during the summer months (June, July, August), related to people traveling and downloading apps that help them navigate new cities, using fitness apps to motivate them to exercise more, and using apps that enable them to follow global sports events such as the 2016 UEFA European Championship in football (soccer). Simultaneously, apps related to categories such as education and business become less popular. This suggests an interplay between our physical behavior and our app-fingerprint, indicating that when we change our geo-spatial routines by traveling and exploring new places, we also change our app usage. This change in phone behavior makes our app-fingerprints more unique and easier to identify.
Hiding in the crowd
Our dataset is limited to 3.5 million users, similar in size to a small country, but how will uniqueness change as more users are added (increased sample-size)? Will it become possible to hide in the crowd? More precisely, how does the population size affect the extent to which a specific app-fingerprint remains unique. That is, as more and more users are added to our sample, does the likelihood to observe multiple individuals with identical fingerprints also increase? This corresponds to an inverse k-anonymity problem (sweeney2002k), where one needs to estimate the number of users that should be added in order to increase the overall anonymity of the dataset. (Bearing in mind that overall anonymity is not a good measure for the sensitivity of individual traces.) To understand the effect of sample-size on unicity, we first slice our dataset into smaller subsamples and use it to estimate the uniqueness for sample sizes ranging from 100,000 to 3.5 million individuals. Figure 4A reveals that sample size has a large effect on the re-identification rate when selecting apps using a random heuristic. Considering , the average re-identification rate decreases from for a sample size of 1 million individuals to for 2 million individuals and for the full sample of 3.5 million people. The attack scheme is considerably less affected (Figure 4B). For we find that the re-identification rates are respectively , and for sample sizes of 1, 2 and 3.5 million individuals. As such, increasing the sample size by (from 1 to 3.5 million individuals) only reduces uniqueness by approximately 4 percent-points.
In order to estimate uniqueness for sample sizes larger than the study population we extrapolate results from Figure 4B for . We express uniqueness of fingerprints using multiple functional forms including: power-laws (), exponentials (), stretched exponentials (), and linear functions (), where denotes the sample size and is a scaling factor. The stretched exponential and power-law show the highest agreement with the data (Figure S6), and roughly suggest that 5 apps are enough to re-identify 75%–80% of individuals for 10 times larger samples (35 million individuals). Although the applied analysis displays high uncertainty with regards to extrapolations, it illustrates the observation that increasing the population size does not help us in hiding in the crowd (that is, uniqueness is not a characteristic of small sample sizes).
The economic incentives, the easy and global scale of collecting and trading this data without users’ knowledge creates some serious concerns, especially since this practice is in violation of users’ expectations or knowledge (martin2018penalty, posner1981economics). The EU General Data Protection Regulation (GDPR) may be a first step towards addressing these concerns through regulation, since it does mention unicity (gdpr) and applies globally to data about any EU citizen. Our conclusion from this study is that application usage data should be considered personal information, since it is a unique fingerprint.
This study was performed using app usage data collected from Android phones from a single vendor only. As phone vendor specific apps were disregarded in the analysis, we expect the results to generalize across all Android devices. Further, we have no reason to believe that app usage behaviour and uniqueness is fundamentally different for individuals using iOS devices compared to Android users.
To estimate the uniqueness of app-fingerprints, we apply the unicity framework (de2013unique) on samples of 10,000 randomly selected individuals. For each individual we select apps (without replacement) from the person’s app-fingerprint. With the popularity based attack, apps with low user base are selected to increase the uniqueness of the app usage pattern. The person is then said to be unique if they are the only individual in the dataset whose app-fingerprint contains those apps. In cases where is larger the the total length of a person’s app-fingerprint we instead select number of apps. Uniqueness for a sample is then estimated as the fraction of the users that have unique traces. Overall uniqueness is the average of the samples, and error-bars are given by the standard deviation. We use .
Subsampling the dataset
To quantify the relation between sample size and uniqueness, we subsample the dataset by selecting a fraction of the original dataset. For each sample we estimate uniqueness using the above methodology. To account for selection bias we estimate uniqueness as the average of multiple realizations of a sample size. We use 20 realizations for sample sizes between 100,000 - 500,000, 10 realizations for samples between 600,000 - 900,000, and 5 realizations for sample sizes above 1,000,000 individuals.
V.S. and H.J would like to thank Sune Lehmann for useful discussions and feedback.
1 Supplementary Information
We use a dataset that spans 12 months, from Feb. 1st 2016 to Feb. 1st 2017, and contains monthly app-fingerprints for 3,556,083 individuals. Each fingerprint is a binary vector composed of the apps a person has used during a month. We do not consider apps that are installed but unused.
We further disregard phone vendor specific apps such as: alarm clock, phone dialer, settings etc. and only focus on apps that are downloadable from Google Play. This removes vendor bias, and makes re-identification harder. The users are selected from major markets in the Americas, Europe and Asia. Thus, the impact of regional variations on uniqueness due to local applications is smaller than if we had sampled users from anywhere in the world.
In total, the number of unique apps in the dataset is 1,129,110, and each individual in the dataset uses at least 3 apps per month.