Privacy Leakage through Innocent Content Sharing in Online Social Networks
The increased popularity and ubiquitous availability of online social networks and globalised Internet access have affected the way in which people share content. The information that users willingly disclose on these platforms can be used for various purposes, from building consumer models for advertising, to inferring personal, potentially invasive, information.
In this work, we use Twitter, Instagram and Foursquare data to convey the idea that the content shared by users, especially when aggregated across platforms, can potentially disclose more information than was originally intended.
We perform two case studies: First, we perform user de-anonymization by mimicking the scenario of finding the identity of a user making anonymous posts within a group of users. Empirical evaluation on a sample of real-world social network profiles suggests that cross-platform aggregation introduces significant performance gains in user identification.
In the second task, we show that it is possible to infer physical location visits of a user on the basis of shared Twitter and Instagram content. We present an informativeness scoring function which estimates the relevance and novelty of a shared piece of information with respect to an inference task. This measure is validated using an active learning framework which chooses the most informative content at each given point in time. Based on a large-scale data sample, we show that by doing this, we can attain an improved inference performance. In some cases this performance exceeds even the use of the user’s full timeline.
SIGIR PIR ’16July 21, 2016, Pisa, Italy
SIGIR PIR’16 Pisa, Italy
User privacy is a topic that has increasingly gained traction with the rise of online social networks (OSN). These platforms allow users to communicate, connect with peers and share content. Originally, OSNs mainly focused on these core aspects, but nowadays the term also includes platforms which are primarily user-centric, allowing members to broadcast personal thoughts and content. In 2010,  find OSNs among the most frequently visited Web sites for a large population of users. Due to their prevalence and abundance in personal content, OSNs lend themselves to the study of human behavior at scale .
Recent successful initial public offerings (IPO) and high market valuations underline the monetary value of OSNs. However, the relation between the number of registered users, their online activity, and these valuations is not entirely clear. It has been shown in several studies that user characteristics, such as personality traits  or future route intentions , can be reliably inferred from corresponding OSN profiles. Although the general value of personal data is widely accepted, there have not been many studies which assign a tangible value to OSN profiles. As a consequence, both for users as well as platform providers, the value of information remains a vague notion, at best. This situation is detrimental both to users who cannot be expected to make informed decisions about privacy controls, as long as they do not know the value and potential risk of disclosing a given information item, as well as platform and service providers who blindly buy and sell user data in bulk instead of saving resources by concentrating on select relevant portions of information.
In this paper, we aim to draw attention to accidental privacy leakage through content sharing in online social networks and make a first step toward describing a formal metric of task-specific informativeness of pieces of shared content.
Our empirical study relies on three popular OSN platforms: (1) Twitter, a microblogging platform whose main content comes in tweets, posts limited to 140 characters which can contain text, media (video or images), links to external Web sites, references to other users and hashtags (terms starting with the # symbol, which are used to mark keywords or topics in a tweet). (2) Instagram, a photo sharing platform. Its main content are visual in nature along with optional textual descriptors. (3) Foursquare, a location service platform concentrating on the notion of check-ins. Check-ins correspond to real-world venues that the user has visited. In addition to the venue name, more information such as location and venue categories are available.
Our investigation is driven by the following research questions:
How well can we uniquely identify a user based on matching a set of unseen posts to a user’s online footprint and is there a benefit in modelling user identities across more than one OSN?
For the same user, is the information posted in one OSN indicative of the information contained in another OSN?
Can we quantify the amount of new information that a piece of shared content carries, with respect to a concrete inference task?
In particular, to address RQ 1, we mimic the following scenario: let us have a collection of users’ online footprints and a set of anonymous posts. We test whether we are able to correctly find the author of the anonymous posts based on seeing part of their online footprint.
As for RQ 2 and 3, we consider that a user may unintentionally expose personal information through seemingly innocuous shared content. For example, when a user shares a venue check-in, it is easy to infer which venue category was visited. However, a post which does not mention a place explicitly might still contain information about a potential behaviour or visit intention. Consider two tweets from the same user: “Lol should start heading to the gym #fitness” and “What a great sunny day!”. It is clear that the first tweet contains more information about the user’s intention to visit a venue type than the second. To this end, we devise an informativeness metric for shared content. The metric explicitly models the item’s relevancy towards a given inference task as well as its novelty in comparison to the previously seen timeline. Such a score can serve as an indication of the amount of novel information disclosure associated with an information item and can, in the future help both service providers as well as privacy advocates in making informed decisions.
This paper makes three novel contributions beyond the state of the art:
We formulate a scoring function to quantify the information value of shared content from the perspective of improving the performance of a concrete inference task.
We present a practical way of exploiting the theoretical model by integration into an active learning scheme that results in enhanced user model learning rates.
In both practical settings, we put particular emphasis on cross-platform models of user identity
The remainder of this paper is structured as follows: Section 2 gives an overview of related work dedicated to user modelling as well as privacy protection in Web scenarios. On the basis of a parallel corpus of OSN profiles belonging to the same natural person, Section 4 discusses unique user identification methods employing intra-platform as well as cross-platform information. Section 5 formally describes an informativeness score for shared OSN content and applies it to the task of predicting user traits manifested on one platform based on the user’s activity on other OSNs. Finally, Section 6 concludes with a brief discussion of our main findings as well as an outlook on ongoing and future extensions to this work.
2 Related Work
There is a wealth of work dedicated to privacy protection in Web information systems such as search engines or social networks. A number of early studies investigate the common privacy concerns of information system users [17, 18, 16, 9], finding that general concerns are abundant among Internet users but remain vague and imprecise. Many users are aware of the information collection and behavioral profiling activities undertaken by service providers as well as the wide range of data-driven inference efforts that have been presented by the academic and industrial communities [15, 12]. In spite of this knowledge, however, even technology-affine users cannot reliably quantify the exact risks entailed by careless information disclosure.
Privacy concerns become especially prevalent in mobile computing environments . De Montjoye et al.  show that as few as four hourly GPS samples are enough to uniquely identify 95% of all individuals in a 500k-user phone log. We encounter an even greater potential for privacy hazards in settings that go beyond raw positional traces, joining them with topical information, e.g., in Web search queries  or contextual advertising .
To counter such de-anonymization and tracking efforts, various strategies have been proposed. Dwork’s concept of differential privacy  considers adding -noise to aggregate queries that prevents singling out individual contributions to the overall aggregate. Similarly, Carpineto and Romano  rely on the notion of -anonymity, ensuring that no query should return less than individual records. In the domain of personal information, these approaches may not go far enough since certain, frequent, characteristics that would neither be detected under -anonymity nor differential privacy could cause severe privacy hazards.
This paper, in spirit, follows the reasoning of Howard et al.  by measuring not just the amount of information contained in a given message, but also with respect to an inference task which can be economically relevant. In this way it attributes an economic dimension to messages, which can be an interesting measure both for industry players as well as for the message’s original author. On the basis of a number of concrete classification tasks, this paper aims to close the gap between the rich body of work on empirical analysis of privacy hazards on the one hand and the large range of available privacy protection measures on the other. We argue that only by understanding the concrete implications of information disclosure (e.g., in the form of the value of a piece of information) can users be expected to make educated decisions about the appropriate protection measures they are willing to take.
As our research questions are concerned with the relationship between parallel user profiles of the same natural person across different OSNs, we rely on the methodology described in  to assemble our dataset. We obtain a collection of 618 distinct users who cross-post content from corresponding profiles in multiple social networks, totalling 1.1 million tweets, 18000 Instagram posts and 99000 Foursquare check-ins.
4 User De-anonymization
Our first use case is concerned with, given a number of anonymous social media posts and a collection of users , finding the particular user that authored the posts. Inspired by general text matching strategies , each user’s known previous posts are described in the form of a unigram language model and the likelihood of said user having authored the anonymous text corresponds to . Using Bayes’ law, one can write:
And to select the most likely user:
To simplify the expression further, we assume that is constant for all users and treat as uniform across all . Thus, we find the most likely user by estimating , the probability of posts being generated by the language model derived from ’s available timeline.
These timelines are projected into a n-dimentional TF-IDF weighted vector space. To preserve the natural way in which users write, no further vocabulary pre-processing (such as lemmatization or exclusion of less common words) was applied. Based on this representation, we estimate as the product across all terms in the vocabulary:
As described in Section 3, the dataset contains the online footprint of the same user on Twitter, Instagram and Foursquare. To mimic the described task, textual data from one OSN is used as the source of anonymous posts and the textual data from the remaining two OSNs is used to generate the user language model .
To generate the training data, randomly sampled sections of varying length from the training source are used to generate pairs of the form . For the test set, we remove any form of user mentions to mimic an anonymous post.
Our experiments investigate a number of combinations of (training, test) data sources:
(Twitter + Foursquare + Instagram, Twitter)
(Twitter + Instagram, Foursquare)
(Twitter + Foursquare, Instagram)
For the first two cases, we split the Twitter timeline into separate training and test portions. However, the more interesting conditions are 3-6, as the source of anonymous posts does not come from the data source used to generate the language model.
Additionally, we vary the amount of available profiling information by successively revealing larger parts of the training data. Furthermore, we also study the impact of changing the size of the set of anonymous posts .
Multiple training sources are combined (Conditions 2, 4, and 6) in the following way: for a fixed amount of available profiling information, 20% of it is made up from Instagram or Foursquare data (or 40% if these are put in together) and the rest from Twitter, due to the relative abundance of Twitter data.
The performance of our classifier is given by accuracy of predicting the correct user who generated the anonymous posts. The results are averaged across 10 randomization runs.
In Table 1, the results for Conditions 1-2 can be found. We note that the usage of additional OSNs does not improve the de-anonymization performance. This is not too surprising as the source of the anonymous posts come from Twitter. The results for Conditions 3-4 can be found in Table 2. We remark that the classifier’s performance does not improve with the addition of Instagram data as users can cross-post check-ins on Twitter and users can check-in into a venue multiple times.
The more interesting results can be found in Table 3, which presents the results for Conditions 5-6. We note that in this case, as training and test data come from different sources, there is some improvement in the de-anonymization performance when we include Instagram data as an extra training source.
|# anonymous posts|
|Train Source||Posts Seen||1||5||15||20|
|Twitter Instagram Foursquare||50||17.56||46.17||69.82||73.24|
|# anonymous posts|
|Train Source||Posts Seen||1||5||15||20|
|# anonymous posts|
|Train Source||Posts Seen||1||5||15||20|
With respect to RQ 1, we note that it is possible to match user profiles across OSNs based only on their textual data, achieving a maximum accuracy of 77.96% when using only 500 posts (the equivalent of 20% of the average length of the Twitter timelines at our disposition). Furthermore, the usage of multiple OSNs as training data source seems to improve the classifiers’ performance when the source of anonymous posts and training data are distinct, suggesting there is a consistency in user language and vocabulary across the chosen OSNs. We also observe that, in general, the more anonymous posts are available, the better the performance of the designed classifier becomes.
5 Information Valuation
Let us again start from an OSN user base , in which each user is defined by the set of his associated timelines . Further, let be an OSN such that we can define the set of all posts made by in as his timeline, . We treat the timeline as a long consecutive piece of text in which each post constitutes a sentence. We use information from the timeline to estimate the probability of a user manifesting a certain property . This probability is denoted by , where denotes “ shows Property ” and is the user’s timeline. Due to our definition of timelines, the same method can be used for full timelines or subsets of posts. Regardless of the chosen scope, we now project the timeline into an -dimensional TF-IDF weighted vector space that allows us to train a classifier , estimating the final .
5.1 Measuring informativeness
Our objective is to find a function which quantifies the information carried in a post. On the one hand, we are interested in capturing the relevance of a post with respect to a certain inference task, on the other hand, in order to avoid redundancy or attributing a high score to already seen information, we are interested in capturing the novelty of some content with respect to what is already known. In a spirit similar to , we model the information content in two ways:
Relevance of the post with respect to an inference task or a set of tasks;
Novelty of the post with respect to the user’s previously posted content.
We form our informativeness score as a convex combination between these two quantities, thus introducing a mixture parameter . Now, for each newly authored post , we can define an informativeness function as follows:
Measuring the relevance of shared content can be intuitively thought of as determining which piece of shared content contains features that are important for the classifier’s decision. A popular choice of such a function describing feature importance is the Gini Importance (). For a feature , the Gini Importance for a classifier is defined as:
Where is a node, T a decision tree and the decrease in Gini Impurity. The Gini Importance indicates how often a particular feature was selected for a split, and how large its overall discriminative value was for a particular classification problem. We estimate the overall relevance of a post by summing up the importance scores across features contained in the post :
For a fixed and OSN , let , be the vector representation of shared contents , , the user’s timeline.
Informally, the function should have the following properties:
should have low novelty if it is contained in .
should have low novelty if it is similar to .
should have high novelty if it is distinct from .
Let be denoted as . The proposed function to measure novelty is the following: Let : , be a non-symmetric function defined by:
For each word, the novelty function decays with the number of times that the word appears. By regulating we can control how many times we have to observe a word to not consider it novel anymore.
5.2 Experimental Setup
As a concrete example of the general data-driven label prediction problem introduced previously, we turn towards the task of predicting whether a person will visit a particular type of location (e.g., an Italian restaurant or a golf course) based on their social network timeline(s). These timelines are projected into a TF-IDF-weighted vector space. The vocabulary is curated by: removing all links and user mentions, stop words, words which occur less than 5 times and, when possible, word lemmatisation using WordNet. For the prediction task, we use the AdaBoost algorithm  with decision trees as weak learners as our classifier since they generally work well without refined parameter tuning. The classifier’s performance is evaluated under 10-fold cross-validation.
We begin by training one binary classifier per venue type that decides whether or not a user’s timeline suggests they are likely to visit that type of location. For every test user , we initiate the procedure by randomly sampling a single post from their timeline and create a truncated timeline. Then, at each iteration, we sample a constant number of posts from the timeline, add them to the truncated timeline and make a prediction using this iteratively updated input vector. The procedure is iterated until user has no more posts left (or until the truncated timeline reaches a fixed amount of posts). We obtain an ordered sequence of predictions: , where represents the number of iterations. When , i.e., we add one post at a time, we simulate the situation in which an existing user profile is updated over time as new content is being shared.
We aggregate results taking into account the varying timeline length across users in the following way: the maximum timeline length is calculated. Then, for each user whose timeline is shorter than , the last prediction is repeated to generate a sequence of predictions of length . As a performance baseline, we randomly sample posts to be added. In the active method, instead, we select the posts which are most informative according to our metric presented in the previous section.
Foursquare offers a hierarchical taxonomy of places that display very different relative popularity. Figure 4 plots the percentage of our user population that has visited each of the more than 500 categories. We can note that the distribution is heavily skewed. While virtually all users have, at some point, visited places that fall into broad categories such as [Arts & Entertainment] or [Food], others are so specific that they remain almost empty (e.g., [South Tyrolean Restaurants] or [Hunting Supplies]). For the purpose of venue prediction, we are forced to make a subselection of categories that are neither so broad that the prediction task would become trivial nor so specific that the classifier would not find sufficiently many training examples. For this reason, we focus on those categories that were visited by 25% to 35% of the population, giving us a set of 37 venues (highlighted in yellow in the figure).
Each user’s timeline is initialized with a single post to which we iteratively add additional, randomly sampled, posts to form an updated user model. Figure 5, shows the classifier performance as a function of the number of posts available to the user model. From this overview, we note three recurring slope patterns. Some venue-specific classifiers quickly reach their optimal performance after as few as 750 posts have been observed (top figure), for others, significantly more iterations are required (center figure), and lastly, for some particular venues, the classification accuracy hardly benefits at all from using more posts (bottom figure). We refer to these three situations as quick-to-learn, slow-to-learn and hard-to-learn venues, respectively. Figure 6 gives a complete overview of the relative frequency of mentions of the chosen venue categories and their affiliation to the three slope types. The general tendency seems to be that frequently mentioned venues tend to be quicker to learn than rare ones, while hard-to-learn venues appear to be randomly spread across the observation frequency range.
Active resource selection
After having confirmed the intuitive assumption that (within the limits of our three slope types) more data results in more accurate predictions, we now proceed to describing an active selection scenario in which we expand the user model by the most informative posts according to our metric rather than random ones. To this end, we fix the novelty parameter at , meaning that after a word appears 5 times, its novelty becomes negligible. Table 4 highlights this method’s performance at different settings of . 50 posts are actively selected for this experiment and we note that our selection scheme biased towards informativeness delivers significantly better performance
Let us return to our previously introduced categorization of learning curve slope types. Table 5 shows the influence of on the performance of the three slope categories. We observe clearly diverging tendencies between quick and slow-to-learn venues. While quick-to-learn venues benefit from low novelty contributions, their more slowly evolving counterparts benefit from novelty-biased informativeness scores. Examples can be seen in Figure 7. Again, hard-to-learn venue types do not show any noticeable response to different choices of , as long as the relevance component is not fully turned off.
Furthermore, for some particular venues, the classifier attained a better performance when using only 50 actively selected posts than when using the full timeline of the users. Some examples can be found in Table 6.
Regarding RQ 2, we show again that there is consistency in terms of content shared across OSNs, in particular, we show that is is possible to predict venue type visits based on what is shared on Twitter and Instagram. Furthermore, with respect to RQ 3, we show that using our designed metric, we can find the posts which are most relevant to predict venue type visits. In particular, using the active learning framework with our information measure of content in posts as selection criteria, we observe overall quicker learning rates and in some cases, we can use a significantly reduced the number of posts to attain a classifier performance which is comparable to using the full timeline of the user.
|Sushi Rest.||Cocktail Bar||Gastropub||Brewery||Nightclub|
|Truncated + Random Selection||11.49||4.97||5.36||21.51||20.11|
|Truncated + Active Selection||61.72||55.84||54.79||66.15||54.23|
In this paper, we studied privacy hazards pertaining to cross-platform social network usage. Individually innocuous posts can lead to leakage of critical information when aggregated along or across a user’s OSN profiles. We quantify this effect in two experiments: (1) uniquely identifying users in an anonymous pool and (2) predicting user properties manifested on one OSN platform based on content from other parallel profiles.
In the user de-anonymization task, we note that it is possible to match user profiles across OSNs based only on their textual data, with as little as 10% to 20% of the user’s full timeline. Furthermore, the inclusion of multiple OSNs as training data sources has been shown to improve the classifiers’ performance when the source of anonymous posts and training data are distinct. This suggests that there is a consistency in user language and vocabulary across OSNs.
In the information valuation task, we propose a general-purpose metric of textual informativeness in order to model the value of shared information items both for service providers (predictive power) as well as the user (potential privacy hazards). We show experimentally that the metric reflects the relative importance of posts with respect to the inference task being performed. When actively selecting a subset of posts per user, this method was always able to beat a random selection baseline. While choosing posts according to their relevance seems to lead to better performance in general, we noted that only for some venues there was a noticeable benefit in including a strong novelty component in the information scoring function.
This work focused on showing the privacy hazards that arise from sharing content which seems uninformative or harmless. In the future, we are excited to extend this line of work by a dedicated investigation of information valuation scores on the user side (e.g., of an OSN) as it would greatly help people understand their own digital footprint and enable them to recognize moments of critical information disclosure. Furthermore, part of this work focused on proposing a metric for information valuation with respect to an inference task. We are interested to extend this line of work by an in-vivo study of monetary efficiency of advertisers as a consequence of introducing an informativeness-aware resource selection scheme in their real-time bidding (RTB) pipelines.
- Average score for all classifiers is computed using the average of precision and recall across all classifiers.
- Ashish Agarwal, Kartik Hosanagar, and Michael D Smith. Location, location, location: An analysis of profitability of position in online advertising markets. Journal of marketing research, 48(6):1057–1073, 2011.
- Adam Berger. Statistical Machine Learning for Information Retrieval. PhD thesis, Pittsburgh, PA, USA, 2001. AAI3168516.
- Leo Breiman. Arcing classifier (with discussion and a rejoinder by the author). Ann. Statist., 26:801–849, 1998.
- Claudio Carpineto and Giovanni Romano. Semantic search log k-anonymization with generalized k-cores of query concept graph. In Advances in Information Retrieval, pages 110–121. Springer, 2013.
- Charles L.A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova, Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. Novelty and diversity in information retrieval evaluation. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’08, pages 659–666, New York, NY, USA, 2008. ACM.
- Yves-Alexandre de Montjoye, César A Hidalgo, Michel Verleysen, and Vincent D Blondel. Unique in the crowd: The privacy bounds of human mobility. Scientific reports, 3, 2013.
- Cynthia Dwork. Differential privacy. In Encyclopedia of Cryptography and Security, pages 338–340. Springer, 2011.
- Maria Han Veiga and Carsten Eickhoff. A cross-platform collection of social network profiles. In Proceedings of the 39th SIGIR Conference on Research and Development in Information Retrieval. ACM, 2016.
- Weiyin Hong and James YL Thong. Internet privacy concerns: An integrated conceptualization and four empirical studies. MIS Quarterly, 37(1):275–298, 2013.
- Ronald Howard et al. Information value theory. Systems Science and Cybernetics, IEEE Transactions on, 2(1):22–26, 1966.
- Mark J Keith, Samuel C Thompson, Joanne Hale, Paul Benjamin Lowry, and Chapman Greer. Information disclosure on mobile devices: Re-examining privacy calculus with actual user behavior. International journal of human-computer studies, 71(12):1163–1173, 2013.
- Michal Kosinski, David Stillwell, and Thore Graepel. Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences of the United States of America (PNAS), 110(15):5802–5805, 2013.
- Ravi Kumar and Andrew Tomkins. A characterization of online browsing behavior. In Proceedings of the 19th International Conference on World Wide Web, WWW ’10, pages 561–570. ACM, 2010.
- David Lazer, Alex Pentland, and et al. Computational social science. Science, 323(5915):721–723, 2009.
- Wen Li, Carsten Eickhoff, and Arjen P. de Vries. Want a coffee?: predicting users’ trails. In William R. Hersh, Jamie Callan, Yoelle Maarek, and Mark Sanderson, editors, SIGIR, pages 1171–1172. ACM, 2012.
- Naresh K Malhotra, Sung S Kim, and James Agarwal. Internet users’ information privacy concerns (iuipc): The construct, the scale, and a causal model. Information Systems Research, 15(4):336–355, 2004.
- H Jeff Smith, Sandra J Milberg, and Sandra J Burke. Information privacy: measuring individuals’ concerns about organizational practices. MIS quarterly, pages 167–196, 1996.
- Kathy A Stewart and Albert H Segars. An empirical examination of the concern for information privacy instrument. Information Systems Research, 13(1):36–49, 2002.
- Robert West, Ryen W White, and Eric Horvitz. Here and there: Goals, activities, and predictions about location from geotagged queries. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pages 817–820. ACM, 2013.