You Tweet What You Eat: Studying Food Consumption Through Twitter
Food is an integral part of our lives, cultures, and well-being, and is of major interest to public health. The collection of daily nutritional data involves keeping detailed diaries or periodic surveys and is limited in scope and reach. Alternatively, social media is infamous for allowing its users to update the world on the minutiae of their daily lives, including their eating habits. In this work we examine the potential of Twitter to provide insight into US-wide dietary choices by linking the tweeted dining experiences of K users to their interests, demographics, and social networks. We validate our approach by relating the caloric values of the foods mentioned in the tweets to the state-wide obesity rates, achieving a Pearson correlation of across the 50 US states and the District of Columbia. We then build a model to predict county-wide obesity and diabetes statistics based on a combination of demographic variables and food names mentioned on Twitter. Our results show significant improvement over previous CHI research . We further link this data to societal and economic factors, such as education and income, illustrating that areas with higher education levels tweet about food that is significantly less caloric. Finally, we address the somewhat controversial issue of the social nature of obesity (Christakis & Fowler ) by inducing two social networks using mentions and reciprocal following relationships.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from firstname.lastname@example.org.
CHI 2015, April 18–23, 2015, Seoul, Republic of Korea.
Copyright © 2015 ACM 978-1-4503-3145-6/15/04 …$15.00.
H.3.1Content Analysis and Indexing \categoryH.3.5Online Information Services \categoryJ.3Life and Medical SciencesHealth
Food is a part of our daily lives that determines our well-being, health, and longevity. It is an important social activity and an expression of our culture and beliefs. The study of dietary habits is important for both cultural understanding and for monitoring public health. Heart disease, diabetes, osteoarthritis, and even cancer have all been linked to weight gain
Until now, large-scale dietary studies of food consumption used questionnaires and food diaries to keep track of the daily activities of their participants, which can be intrusive and expensive to conduct . Alternatively, social media is notorious for providing its users with a means of documenting the minutiae of their daily lives, including their dietary choices.
Can we use social media – and Twitter in particular – to get insights into dietary habits of an entire country? After all, tweeting “I’m having a sandwich” has become a classic example of the “pointless babble” commonly found on Twitter.
But is the data sensitive enough to pick out more personal dietary variations? To answer this question, we use demographic information from the US Census, extrapolate users’ gender using their screen names, and characterize users’ zip codes as being either urban or rural. We find gender differences among the Twitter users, with women generally tweeting about less caloric foods than men. Estimated education level also proves to be a significant factor, with fewer calories being mentioned in areas with higher education levels. Not only do we see a difference in the density of the foods mentioned by users, but we also find qualitative differences in their dietary selections. Alcoholic beverages tend to be mentioned in urban environments, whereas pizza and chocolate are popular in the rural ones. Users’ self-disclosed interests are also related to their diets, with those who mention an interest in cooking decreasing the chance of being obese by %, which is in line with research showing that lack of cooking at home gives rise to obesity .
Does social network influence personal dietary habits of users? Unlike traditional surveys that focus on isolated, randomly chosen individuals, social networks come with a network structure. The links of this network allow to analyze individuals not just in isolation, but in the context of their social circle. It has been suggested in  that social trends can spread in society much like diseases, resulting in a social infection. By inducing two kinds of social networks – one using reciprocal communication and another using following relationships – we show the assortativity in the dietary habits of users far beyond that which would be expected by chance.
Barring several limitations, which we outline in the Discussion section, we hope this study provides a case for the use of social media in public health monitoring in the dietary domain. We show that it is possible not only to detect indications of country-wide health trends, but zoom in on demographic and interest groups, potentially informing public health awareness campaigns.
2 Related Work
Unlike in the animal kingdom, as omnivores, humans make nutritional selection not only based on their physiological needs, but also based on their culture and identity. As Fischler  puts it, by selecting and cooking food, one “transfers nutritional raw materials from the state of Nature to the state of Culture”. The rules which are applied to food differ according to one’s nationality, gender, and age, and social circumstances of the meal dictate its content, timing, and atmosphere. Below we outline the latest attempts to track public health using social media, and most notably nutritional research, which has been mostly thus far focused on recipe websites.
Recently, Twitter has been used as a source of data for public health monitoring, such as for tracking flu-like symptoms [3, 9, 21], adverse side-effects of drugs , tobacco use , and county-level health statistics  . Using a text classifier, Sadilek & Kautz  detect tweets which mention the user being sick. They find that the higher social status of the users, the better their health, with poverty, education, and race (originating from the census data) explaining 8.7% of the variation in observed health. The most predictive variables were proximity to polluted sites and encounters with sick individuals. Culotta  uses well-defined lexicons such as LIWC and some demographic variables about users to predict county-wide health statistics (e.g., Obesity and Diabetes) of the top 100 most populous counties in the US. Prier et al.  use LDA to find topics related to tobacco, such as addiction recovery, other drug use, and anti-smoking campaigns. Paul & Dredze  apply an Ailment Topic Aspect Model to tweets to discover mentions of various ailments, including allergies, obesity, and insomnia. More generally, life satisfaction has been mined from Twitter by Schwartz & Eichstaedt et al. , who evaluated their approach using phone survey data. Using LDA, they find word topics which correlate with demographics and socio-economic status, and provide insights into the sources of well-being, such as donating money and having rewarding jobs. These efforts, including those utilizing other social media websites like Craigslist , have aimed to augment the current data collection practices, making them faster, cheaper, and potentially more accurate.
In this paper we focus on the dietary choices of a large population of social media users. Culture-specific ingredient connections have been discovered by Ahn et al.  who mine recipes to create a “flavor network”. Temporal nature of food consumption has been explored by West et al. , who mine logs of recipe-related queries. Using Fourier transforms, they illustrate the yearly and weekly periodicity in food density of the accessed recipes, with different trends in Southern and Northern hemispheres, suggesting a link between food selection and climate. Focusing on users who decided to go on a diet (as signified by them adding a book on dieting to their shopping cart), authors show the dip in caloric value per serving of the recipes users search for, and a gradual return to the pre-diet levels. Geographical distribution of food has been explored by Zhu et al. , who, unlike , find climate (operationalized using temperature) to have little correlation with ingredient use, while finding geographical proximity to be a key factor in shaping regional cuisines. A recent work by Wagner et al.  on german-language recipe site shows similar negative relation between recipes and geographic distance of their seekers. We also find caloric content of foods mentioned in text a useful quantification of dietary selections, but unlike these studies, we relate it to public health statistics in order to validate its use. Furthermore, the nature of social media, unlike the recipe search logs, allows us to enrich our data with information on user demographics, interests and social network. In particular, our work on user interests echoes that in , where interests of Facebook users were found to be related to their BMI. However, we take a more general approach to detecting user interests, using the network of the Twitter users.
Social nature of obesity has been hypothesized by Christakis & Fowler , who tracked a densely interconnected social network of people across years. They found a person’s chances of becoming obese increased by % if he or she had a friend who became obese in a given interval. These effects were not seen among neighbors in the immediate geographic location, emphasizing the importance of social ties. They provide three explanations for the collective dynamics of obesity: homophily which is a tendency of people to associate with people who are similar to them, confounding which occurs when people share attributes or jointly experience events, and induction which refers to a person-to-person spread of behaviors and traits. These findings have been contested, however, by Cohen-Cole & Fletcher , who claim “social network effect” becomes negligible once “standard econometric techniques are implemented”. Recently, Silva et al. , who use Foursquare checkins to gauge food culture similarity between geographical locales, also found that countries closer in geographic proximity are not necessarily similar in their check-in behavior. In this work we examine both demographic and social aspects of food tweeting behavior, and provide some support to the social affinity that is not local in geographic sense.
We begin by collecting 50M tweets through the Twitter Streaming
We take several bootstrapping steps in order to improve our detection of
food-related tweets and to extend our list of keywords. First, we examine the
most frequent terms used in the tweets detected by the initial list,
hand-select other terms unambiguously related to food. This new “food”
filter was now applied to the users’ historical tweets. We then label a subset
of tweets detected using these two filters, as well as a sample of those which
were not thus far identified as food-related, in order to train a classifier.
Finally, we select
most popular terms in the tweets the classifier deems to be on food-related
topic and further enrich them with nutritional information – mainly
calories per serving. To estimate this value, we search a nutritional
In terms of food twitting activity, we detect weekly periodicity, as well as some major holidays: Thanksgiving, Fourth of July, and Valentine’s Day. Weekly periods spike on Saturday and are at lowest on Mondays.
Now, we can use this list to perform longest n-gram matching to detect the foods in the tweet text, and aggregate their caloric content. Upon manual examination of 800 tweets containing these foods, we found 70% of tweets to mention some food, and out of these 63% either explicitly or implicitly being about food consumption and another 12.5% wishing for a food. Most false matches happened when the foods were used in figurative sense (“they scattered like fish”) or referring to others (“chocolate man”), though many of these references may still reveal cultural relationships with food, such as reference to Thanksgiving as “turkey day”. The most frequently mentioned foods in our dataset are pizza, chocolate, chicken, ice cream, and apple. Among the top drinks are coffee, beer, wine, and tea. We then get the food that most distinguishes a given state, by first considering the top most popular foods in each state, and computing the difference in probability of each word from that of appearing in the overall food-related corpus – a technique similar to feature selection for a binary classification – to find the term most likely to be found in tweets of one state, and not the others. We find local peculiarities like California wine, Florida orange, Maryland crab, and Alaskan salmon. We also find possible erroneous matchings, as in the case of New York apple (from “Big Apple”) and Missouri arrowhead. These special cases, having particular prevalence in specific localities, attest the difficulty of identifying the proper context in a limited-length text of tweets. We leave further tuning of our food lexicon for future work.
4 State-level correlations
Although the cyclical nature of food consumption and major holidays can be
detected in this data, there is still a concern whether the data sampling is
representative of US population or, at least, useful to detect
differences in food consumption. Thus, we correlate the caloric values
foods mentioned in tweets per each state to the obesity
|Significance: ***, **, *|
Table 1 shows Pearson product-moment correlation and Spearman rank correlation coefficient between the average caloric density of tweets and these health statistics across the US states (including Washington DC). For each tweet, we use exact string matching to identify the foods (many of which would result in erroneous matches otherwise) and, if more than one is found, average their caloric value. In all we consider all entries, we also differentiate between solid foods and non-alcoholic and alcoholic beverages. The correlation is the highest when we consider all foods, with the Pearson correlation of with obesity and with diabetes. For both ailments, beverage caloric value alone has higher correlations than solid food alone. However, the importance of alcoholic beverages differs drastically, with being somewhat correlated with obesity at and having no statistically significant relationship with diabetes. The reasons for this differentiation may be physiological, but also cultural. In the next section we also illustrate the association of alcohol with urban locales.
We further explore the relationship between obesity and caloric value of the
mentioned foods in Figure 1, where we color the
according to their geographic region. The grey dashed line shows the linear
regression line. We find the Southern states to be in the upper right corner,
with Louisiana (LA) and Arkansas (AR) in the extreme right.
5 County-wide model fitting
Despite the inherent sparsity problem observed at county-level as compared to the state level, we found a Pearson correlation of the caloric value of all foods with obesity at and diabetes for counties with at least users. For counties with at least users (), the correlations were even better with for obesity and for diabetes. Encouraged by this results, we wanted to investigate further the extent to which food mentions could capture county-wide health signals such as obesity and diabetes. To this end, we designed an experiment that compares our “food mention” model to the one presented in . Similarly to the paper we train a regression model using different kinds of variables to predict obesity and diabetes rates at county level.
Culotta’s model  uses lexicon categories along with a selection of demographic variables
to predict county-level health statistics of the top 100 most populous counties in the US.
Their experimental study revealed a strong predictive accuracy of demographic variables which can
be improved if combined with linguistic variables derived from LIWC
To enable the comparison with Culotta’s model, we first build a user model with the following variables:
LIWC categories: For each user, we compute a binary vector of LIWC categories extracted from her Twitter profile. A LIWC category (e.g., Social, Family) takes the value of 1 if the user mentions at least one word in her profile that belongs to the category, 0 otherwise.
Food names: For each user, we compute a binary vector of Food names mentioned in her tweets. We use our handcrafted dictionary of food names and apply an exact matching to the tweets.
avgCal: this variable reflects the average caloric value computed across all food names mentioned by the user in her tweets.
Demographic variables: Each user is assigned a list of five demographic variables derived from census data related to the county to which she belongs. These variables are: Under_18 (proportion of people under the age of 18), Over_65 (proportion of people above the age of 65) , Female (proportion of females), Afro-Hispanic (proportion of Afro-American and Hispani), and Income (log of median household annual income).
Next, we aggregate user models at the level of counties. We generate for each county, a vector of LIWC categories (64), food names (461), and demographic variables (5). The weight of a given food name (resp. LIWC category) reflects the proportion of users in that county who mentioned that food name (resp. LIWC category). Note that unlike Culotta’s work where only the top 100 most populous counties (based on census data) are considered, we have retained all counties with at least 100 users (346 counties).
We run a series of regressions using various models (features) to predict obesity and diabetes scores of the 346 counties. We consider six models: Demog (demographic variables), Liwc (LIWC categories), Calories ( variable), Food (food name variables), Liwc-Demog (LIWC categories and demographic variables), and Food-Demog (food names and demographic variables). For each model, we use a five fold cross-validation to assess the generalization of its accuracy. Folds are selected in a way that prevents counties from the same state to appead simultaneousely in both training and test sets. Finally, we use Ridge regression in order to reduce overfitting, espacially in models with large number of variables (e.g., Food (461)).
Figure 2 shows the Pearson’s correlations (which are all statistically significant) achieved by different models along with their corresponding SEM (standard error of the mean) scores. Recall that the SEM score is equal to the corrected standard deviation of the sample divided by the root square of the size of that sample. Food-Demog model, which combines food names and demographic variables, is found to outperform Culotta’s model Liwc-Demog which combines LIWC linguistic variables with Census demographics. If fact, Food-Demog achieves held-out correlations of 0.775 for obesity and 0.804 for diabetes while Liwc-Demog achieves respectively 0.679 and 0.708 for the two health statistics. Surprisingly, the simple Food model based on our handcrafted dictionary of food names significantly outperforms the Liwc model. Also, we notice that Calories model which has only one variable achieves reasonably good correlation scores (0.450 for obesity and 0.398 for diabetes) compared to Liwc (0.470 for obesity, 0.380 for diabetes) which has 64 variables. Note that we also tried a baseline using top 1,000 most frequent hashtags in our dataset and find very low correlations with the county-wide obesity and diabetes rates of around 0.06. Finally, we built another baseline that uses generic tweet statistics, including number of tweets, retweets, number of replies, and number of hashtags. This baseline achieved comparable correlations (0.38 for obesity and 0.44 for diabetes) to those of the “Calories” model. \note[Yelena]Discuss why this generic model may be performing so well.
Finally, we observe that Food model significantly outperforms – the aggregated – Calories model for the prediction of both obesity and diabetes. Yet, Calories model has several advantages. First, Calories model is much simpler than all the other models as it relies on a single variable () and does not require any model fitting. Second, the model is easily interpretable and arguably closer to the “root cause” of obesity and diabetes. Third, the caloric value of food has shown very strong correlations at the state level. Next, we use Calories model and the best-performing model Food-Demog to zoom in to the personal, user level of analysis.
6 Characterizing Users
6.1 Income, Education and Gender
The effect of obesity in US has been shown to vary according to income, education, and gender. According to CDC, whereas among men, obesity prevalence is generally similar at all income levels, higher income women are less likely to be obese than their low income counterparts
We supplement this with a per-user gender classification using Genderize API
Figures 3 shows an exploration of the dimensions of gender, educational attainment (likelihood of a Bachelor’s degree) and median household income. Two variables are plotted: the likelihood of obesity (estimated using the model built in the previous section) the average caloric values of the tweets produced by each user. We find that although the trends differ in for the genders in the kind of heaviness of food they tweet, the distinction is not present in the obesity rates. It is understandable, given gender-mixed populations on which the model was trained. We witness a stronger negative correlation for educational level with both the caloric values and the estimated obesity rates. These trends echo those found by USDA, which finds obesity prevalence increase as education decreases (though more so for women).
Next, we mapped the zip codes to major metropolitan areas (top US cities by population) and labeled all of these as “urban”. We find ( of all) users to be in these urban areas. Although one may suppose eating at restaurants (in urban areas) may result in tweets mentioning more caloric food, we find quite the opposite. There is a significant (at ) distinction between the two populations, with users in rural areas having an average of calories in their tweets, and urban .
The above distinctions are identifiable not only in the caloric value of the foods mentioned in the tweets, but in the foods themselves. Figure 4 shows the distinguishing foods between users in rural and urban areas, computed using probability differences such that the foods which are popular in one area but not in another get a higher score. Urban food vocabulary distinguishes itself with alcoholic drinks (wine and beer), and more uncommon foods like avocado and crab, whereas the rural cuisine emphasizes pizza and common deserts like chocolate and ice cream.
There are, of course, other factors which contribute to one’s diet. For instance, people who are interested in healthy eating and organic food, or those with families would consciously alter their diets. To detect individuals with such interests, we compiled a high-precision keyword lists to detect users interested in, or at least mentioning cooking, dieting, organic food, and health. We also detected users who mentioned being a family member (being a father, mother, having kids, etc.). These filters were then applied to user profiles.
The above method can be used to detect interests users feel comfortable declaring in their profile, however not all user characteristics can be extracted. We take a different approach to detect users who declare themselves to be overweight by the use of hash tags such as #fatgirlproblems or #fatguyproblems. These are typically used by users in self-reference, for example “I have more pictures of food than I do selfies .. #fatgirlproblems” Doing this, we detected users using at least one such tag.
Table 3 shows the difference in estimated obesity of population for which a variable has been detected (say, mentioning sports) from that where the variable was not detected, as well as the number of instances found for the class. Thus, if the difference is positive, the class has a higher chance of obesity by that many percentage points. Note that for this experiment only users whose names resulted in an identified gender were used (totaling in data points). We find that people who are interested in cooking and organic food decrease their likelihood of obesity by and percentage points, respectively. Although our on diet keyword filter produced a low recall – of only users – the chance of decrease is near %. The detection of #fatproblems hashtags resulted in matches, but we see only a moderate (although statistically significant) positive change in the obesity rate.
|***, **, *|
We go further and identify fine-grained interests as in . Using WeFollow
Our findings partially confirm a previous study on Facebook interests of users in US and New York metropolitan area  in which it was found that a greater proportion of the population with interest in television was associated with higher prevalence of obesity. However, our observations of interests in sports is less straightforward, with sports and NFL categories positively relating to obesity, whereas the Facebook study finds activity-related interests to be associated associated with a lower predicted prevalence of obese and/or overweight people. The separation between watching and participating in sports would shed more light on this discrepancy.
7 Social Nature of Food
Social circumstances play an important role in how we consume our food, and which food we consume. We attempt to discern the relationship between social interactions, expressed in the text of the tweet and in the follower network of the user, and users’ eating behavior (operationalized by predicted probabilities of obesity and diabetes, frequency of food mentioning, and caloric value of tweets). We first study the impact of two types of relationship networks on the obesity and diabetes scores of users. Then, we use a threshold model to quantify the circumstances under which a user would get exposed to higher obesity and diabetes risks. Finally, we look at the influence of friends at varying degrees of closeness.
7.1 User-level obesity, diabetes and food frequency
In this section, we focus on the impact of social relationships on obesity, diabetes and the amount of food related tweets. For each user, we compute the fraction of tweets citing at least one food name. To estimate the obesity and diabetes scores at individual levels, we use the Food model that relies on food names only. For training a model, each user is assigned the obesity and diabetes rates of their county. A Ridge regression is used to learn the models and predict individual user scores. Predicted scores could be seen as a user’s risk level of obesity and diabetes, based on the food mentioned in their tweets.
Note that for the following analysis we deliberately discarded demographic variables. Friends on Twitter are also likely to live in close proximity  and hence are likely to share similar estimates of demographic variables, based on geographic census data. So friends are likely to be similar along this dimension. However, we are more interested in observing the connection between social network closeness and propensity to share “unhealthy tweeting behavior”, such as the use of food names associated with higher obesity levels.
7.2 User Networks
We explore the social nature of eating habits by constructing two social networks, namely Friendship network (FN) and Mention network (MN). The Friendship network relates to the structural aspect of Twitter. In FN, two users are considered to be “friends” only if they follow each other. This definition follows the principal of mutual reachability introduced by Xie et al.  to identify real-life friends from Twitter. Alternatively, Mention network relates to the behavioral aspect of users. Here, a user (mentioner) has a link to user (mentionee) only if has mentioned in at least one tweet.
Note that users without gender have been removed from both MN and FN, and only links between users with a known gender (i.e. male or female) in the initial set of US users are considered. Table 4 provides some statistics on the two networks.
|Network||# Links||# Users|
7.3 Obesity and Diabetes Activation (Spread)
Focusing on obesity, we ask to which extend does being connected to users with a high likelihood of obesity and diabetes (based on the food names they tweet) increases the likelihood of a user for these two health issues? A threshold model postulates that a success or failure of a social diffusion process depends on the reaching of a certain critical number of adopters . This model has been used, for example, to model the adoption of a particular notation for source attribution on Twitter . Following the idea of “social activation”, we label users beyond the percentile (i.e. the top 10%) in terms of obesity and/or diabetes scores as “active” users. Then, for every user we calculate the number of active users to whom they are connected in both friendship and mention networks. Finally, we compute the activation probability (i.e., the probability of being an active user) given that a user is connected to active users.
Figure 5 shows the activation probability scores of users as a function of the number of their active friends. As expected, the probability of showing a strongly increased probability of obesity, estimated by the food names mentioned, increases as the number of friends with high obesity scores increases. This increasing trend is particularly pronounced for up to four active friends. Though the standard errors of the computed probabilities increase dramatically beyond this point, with only few users having more than five active friends, there seems to be a plateau effect. The same behaviour is observed in the activation graph of diabetes.
To alleviate the influence of content spread phenomena in which people may tweet about a food name just because their friends tweeted about it, we removed all tweets that are ”replies to” or ”retweets of” other tweets. We trained a new Food model and run our activation algorithm which resulted in the same activation curve as Figure 5. Thus, although we cannot exclude offline motivational effects, excluding them from data does not affect our results. We also investigate the effect of geography to make sure that the activation process is not the result of people living in the same region as their friends. We removed from the social graph all friendship links between users living in the same US state. We observed a small drop in activation probability scores, yet the increasing trend function of the number of active friends was still there.
7.4 Clique-ness Analysis
Another way to assess the influence of social connections on users’ dietary habits is to take into consideration the strength of user connections. Here, we define the link strength between any pair of users having an edge in the mention network or the friendship network as the fraction of their common friends in the same network. As mentioned in the dataset description (section Data), we have collected up to friends for each of the K US users, leading to a user-friend bipartite graph with 180M links. Then, we use Jaccard index to compute the similarity of the friend sets between all pairs of users in both networks. Based on the computed scores, users and their links are assigned to different bins corresponding to different Jaccard index intervals. As expected, the distribution of Jaccard scores is heavily skewed. For instance, more than 71.6% of the total number of links have a strength score lower that in the mention network.
In order to check whether tightly connected friends share a higher degree of content similarity, we correlate the fraction of food tweets of each user to that of their friends within each bin. Figure 6 shows the bins from most dissimilar users () to those most similar in terms of ego-network overlap. Note that the intervals are not normalized due to the highly skewed distribution of the data, with the last bin of having only users in Mentions and in Friendship networks. We witness an increase in the correlation of the food-related tweet fraction of users and their friends as network overlap increases until the last two buckets – those with users sharing the most friends.
Upon manual examination, we find highly related users to be in the same social locales, such as students in the same university or school. The phenomena may also be related to the shape of the exposure curve, as proposed by Romero et al. , which models the rise of initial interest as the number of exposures increases, and the decrease after over-exposure to a phenomenon.
8 Discussion & Future Work
Using data from social media to study sociological phenomena and to “nowcast” variables such as obesity rates has both advantages and disadvantages compared to traditional survey-based methods. Two of the most commonly given advantages include cost and timeliness. Data for hundreds of thousands of users can be obtained at relatively low cost, and this data can be collected, aggregated and analyzed in a matter of days or even hours, rather than weeks or months for comparable surveys. However, in addition to these two quantitative differences, there is also an important qualititative advantage: the availability of network information. Having access to even a subset of people that a given person interacts with makes it possible to look for evidence of social mechanisms such as homophily. In the health domain, having data on the strength of such effects makes it possible to consider social interventions. Rather than trying to change unhealthy behavior of individuals by providing information and incentives derived solely from their own behavior, information about their friends’ physical activity or healthy dietary choices could be provided. Similarly, altruistic incentives of the type “Run a total of 30km this week and your friend @JohnDoe has a chance to win an iPhone” could be based on automatically derived, intricate knowledge of a user’s social circle. Or if a set of friends are planning a physical activity, they may be encouraged to invite others from their social circle.
The availability of rich data related to hobbies and interests is also a potential advantage for studies using social media. Public health initiatives targeting various segments of population already exist, such as UK’s Change 4 Life (http://www.nhs.uk/change4life) targeting parents and USDA’s MyPlate on Campus for university students (http://www.choosemyplate.gov/MyPlateOnCampus). In our study, we link users’ obesity likelihood, inferred by a model combining demographic estimates with food names mentioned on Twitter, to their interests, such as TV shows, sports, and movies. The relationship between each of these activities and food would help target population segments which are especially at risk, and maximize the return on media advertising expenditures. But it is our vision that automated tools will help provide personalized messages to social media users, providing context-aware, real-time information, suggestions, and motivation. For instance, if a user tweets about an intention to exercise, links to tutorials and videos can be suggested. The development of these approaches first requires automated assessment of a user’s dietary behavior, and this paper takes first steps in that direction.
Of course, sociological studies which use social media also have significant drawbacks. Typically, such studies suffer from a user sampling bias with an over-representation of affluent and tech-savvy demographic groups. This bias can be more pronounced when only users with GPS-enabled devices are considered.
Indeed, the users in our subset come from neighborhoods with average household income at , well above the US average of in
It is an advantage of big data analysis that high-precision methods, such as hand-crafted keyword filters, can be effectively applied to glimpse a phenomena of interest. Yet, such approaches are not robust under temporal changes in the vocabulary, and may suffer from low recall. For example, the #fatproblems hash tags are a convention for people to admit (often humorously) to feeling overweight. However, there may be many more subtle (and more serious) ways one can detect self-image expressions. Being able to identify with high accuracy users that are overweight would allow a more fine-grained validation of our techniques, in addition to the state-level validation that we currently focus on.
Despite the significant relationship between food mentions and their caloric value in the tweets and health problems, one needs to be careful not to assume that the user consumed every food about which they tweeted. In fact, it is difficult to extrapolate whether the tweet is about an actual dining experience, even if we detect a mention of food. A crowd-sourced effort to label a training set of tweets for detecting dining experiences has shown the task to be difficult, with user agreement at 78%, and the resulting trained classifier producing noisy output. We leave determining the exact nature of the food mention in a tweet to future work.
Another concern is whether the tweets are about foods not normally eaten by their writers, that is, written on special occasions or on special topics. Thus, we checked the intervals at which the users in our dataset tweet, finding the top 1% tweeting at a median of 18.8 food tweets per week, and the median overall to be at 1.2 food tweets per week. This suggests that our users at least do not tweet just about their birthday cake, but are actively engaged on a weekly basis. Although in our analysis we normalize our data per user, we still may have unusual very-active users who tweet differently than the rest. Thus, we check the overlap between the top mentioned food between the top 1% and the bottom 50%, and find 43 out of 50 foods overlap. Finally, even the top foods we find are every-day ones, like pizza, coffee, and chicken. These findings encourage us that we capture at least in part the day-to-day dietary habits of Twitter users.
Our analysis describes correlations, not causations. However, we believe that insights gained from this type of analysis are required before deciding where to drill deeper through, ideally, controlled experiments. Targeting Twitter users with a particular behavior could also be a promising step for interventions such as public health awareness campaigns.
Finally, active promotion of dietary habits were studied by Yom-Tov et al. , who track the dissemination of pro- and anti-anorexia photos on Flickr and the emerging social networks. They find the two groups to interact mostly within each respective community, but for pro-recovery group to tag their content with terms which would ensure their content is visible to pro-anorexia users. Such fine-grained analysis is beyond the scope of our current study, but the direct social influence in terms of verbal and non-verbal interactions is an enticing future direction of this research.
In this paper we describe a large-scale study of the Twittersphere, as it allows us to monitor US-wide nutritional behavior. We show that the foods mentioned in the daily tweets of users are predictive of the national obesity and diabetes statistics, with values of and across the 50 US states and the District of Columbia. We show how the calories tweeted are linked to user interest and demographic indicators, and that users sharing more friends are more likely to display a similar interest toward food. More needs to be done to develop sensitive and accurate tools for user characterization, with both textual and social network information available. As a documentation of users’ interests, opinions, and behaviors, this study is another example of the potential Twitter has for public health research.
- A 2009 study found that 40% of all tweets fell under the label of “pointless babble”, see http://www.pearanalytics.com/blog/2009/twitter-study-reveals-interesting-results-40- percent-pointless-babble/. This study is feeding, pun intended, on a lot of such tweets.
- See http://content.time.com/time/health/article/0,8599,1909406,00.html on why that may be the case.
- Abisheva, A., Garimella, V. R. K., Garcia, D., and Weber, I. Who watches (and shares) what on youtube? and when?: using twitter to understand youtube viewership. In WSDM (2014).
- Ahn, Y.-Y., Ahnert, S. E., Bagrow, J. P., and Barabási, A.-L. Flavor network and the principles of food pairing. Nature Scientific reports (2011).
- Aramaki, E., Maskawa, S., and Morita, M. Twitter catches the flu: Detecting influenza epidemics using twitter. In EMNLP (2011).
- Bian, J., Topaloglu, U., and Yu, F. Towards large-scale twitter mining for drug-related adverse events. In SHB (2012).
- Cawley, J., and Meyerhoefer, C. The medical care costs of obesity: an instrumental variables approach. Journal of health economics 31, 1 (2012), 219–230.
- Christakis, N. A., and Fowler, J. H. The spread of obesity in a large social network over 32 years. New England journal of medicine 357, 4 (2007), 370–379.
- Chunara, R., Bouton, L., Ayers, J. W., and Brownstein, J. S. Assessing the online social environment for surveillance of obesity prevalence. PloS one 8, 4 (2013), e61373.
- Cohen-Cole, E., and Fletcher, J. M. Is obesity contagious? social networks vs. environmental factors in the obesity epidemic. Journal of Health Economics 27, 5 (2008).
- Culotta, A. Lightweight methods to estimate influenza rates and alcohol sales volume from twitter messages. Language Resources and Evaluation (2013), 1–22.
- Culotta, A. Estimating county health statistics with twitter. In CHI (2014).
- Cutler, D. M., Glaeser, E. L., and Shapiro, J. M. Why have americans become more obese? Journal of Economic Perspectives 17 (2003), 93–118.
- de Castro, J. M. Seasonal rhythms of human nutrient intake and meal pattern. Physiology & behavior 50, 1 (1991), 243–248.
- Donohew, L., Ray, E., and Donohew, L. Public health campaigns: Individual message strategies and a model. Communication and health: Systems and applications (1990).
- Fischler, C. Food, self and identity. Social science information 27, 2 (1988), 275–292.
- Gershenson, C., Gasca-González, O. O., Delgado-Reyes, L., Pérez-Cruz, J. C., Orozco-Gómez, L. P., Hernández-Salazar, L., Moguel-Ancheita, S., Ramírez-Moreno, M. A., Morales-Cruz, M. V., Cortés-Gastélum, M. A., et al. Epidemiology and social networks. Cir Cir 79 (2011), 199–200.
- Haimson, O. L., Brubaker, J. R., and Hayes, G. R. Ddfseeks same: Sexual health-related language in online personal ads for men who have sex with men. In CHI (2014).
- Kooti, F., Mason, W. A., Gummadi, K. P., and Cha, M. Predicting emerging social conventions in online social networks. In CIKM (2012).
- Paul, M. J., and Dredze, M. You are what you tweet: Analyzing twitter for public health. In ICWSM (2011).
- Prier, K. W., Smith, M. S., Giraud-Carrier, C., and Hanson, C. L. Identifying health-related topics on twitter. In Social computing, behavioral-cultural modeling and prediction. Springer, 2011, 18–25.
- Romero, D. M., Meeder, B., and Kleinberg, J. Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on twitter. In WWW (2011).
- Sadilek, A., and Kautz, H. Modeling the impact of lifestyle on health at scale. In WSDM (2013).
- Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., Dziurzynski, L., Agrawal, M., Park, G. J., Lakshmikanth, S. K., Jha, S., Seligman, M. E., Ungar, L., et al. Characterizing geographic variation in well-being using tweets. In ICWSM (2013).
- Silva, T. H., Vaz de Melo, P. O. S., Almeida, J., Musolesi, M., and Loureiro, A. You Are What You Eat (and Drink): Identifying Cultural Boundaries by Analyzing Food and Drink Habits in Foursquare. In ICWSM (2014).
- Takhteyev, Y., Gruzd, A., and Wellman, B. Geography of twitter networks. Social networks 34, 1 (2012), 73–81.
- Valente, T. W. Social network thresholds in the diffusion of innovations. Social networks 18, 1 (1996), 69–89.
- Wagner, C., Singer, P., and Strohmaier, M. Spatial and Temporal Patterns of Online Food Preferences. In WWW (Republic and Canton of Geneva, Switzerland, 2014).
- West, R., White, R. W., and Horvitz, E. From cookies to cooks: insights on dietary patterns via analysis of web usage logs. In WWW (2013).
- Xie, W., Li, C., Zhu, F., Lim, E.-P., and Gong, X. When a friend in twitter is a friend in life. In WebSci (2012).
- Yom-Tov, E., Fernandez-Luque, L., Weber, I., and Crain, S. P. Pro-anorexia and pro-recovery photo sharing: a tale of two warring tribes. Journal of medical Internet research (2012).
- Zhu, Y.-X., Huang, J., Zhang, Z.-K., Zhang, Q.-M., Zhou, T., and Ahn, Y.-Y. Geography and similarity of regional cuisines in china. PloS one 8, 11 (2013), e79161.