Role of Temporal Diversity in Inferring Social Ties Based on Spatio-Temporal Data
The last two decades have seen a tremendous surge in research on social networks and their implications. The studies includes inferring social relationships, which in turn have been used for target advertising, recommendations, search customization etc. However, the offline experiences of human, the conversations with people and face-to-face interactions that govern our lives interactions have received lesser attention. We introduce DAIICT Spatio-Temporal Network (DSSN), a spatiotemporal dataset of 0.7 million data points of continuous location data logged at an interval of every 2 minutes by mobile phones of 46 subjects.
Our research is focused at inferring relationship strength between students based on the spatiotemporal data and comparing the results with the self-reported data. In that pursuit we introduce Temporal Diversity, which we show to be superior in its contribution to predicting relationship strength than its counterparts. We also explore the evolving nature of Temporal Diversity with time.
Our rich dataset opens various other avenues of research that require fine-grained location data with bounded movement of participants within a limited geographical area. The advantage of having a bounded geographical area such as a university campus is that it provides us with a microcosm of the real world, where each such geographic zone has an internal context and function and a high percentage of mobility is governed by schedules and time-tables. The bounded geographical region in addition to the age homogeneous population gives us a minute look into the active internal socialization of students in a university.
The emergence of new technologies that enable collection of geospatial and temporal data along with the explosive growth of research in the fields of online social networks and their implications have led the path to conduct research on the offline experiences of humans, the social behaviour and movement of people in the real world. Online social networking sites facilitated easy collection of data and social graphs which motivated a large number of studies: from analyzing the structure of the networks, identifying the most influential people in a network, predictive models for inferring social connections to evolution studies for communities in social graphs.
However, the properties of the online social networks may not necessarily apply to the real world. Even the social ties projected in the online world may not necessarily capture the social ties that exist in reality. Online Social Networks are curated and rather embellished versions of real-life mobility and social interactions. It is therefore imperative to bridge the gap between the amounts of research on online social networks and the real lives. This brings forth the need to examine the real world social networks and individual mobility patterns. Such real world mobility data can be collected through various online services such as geo-tagged tweets, check-ins from Foursquare and Facebook or from mobile apps data such as whatsapp. Since this data is collected from the people who visit certain places at a certain time, the properties inferred from such data are applicable to the real world as opposed to the online social networks.
There are certain disadvantages of collecting data from various online websites. First, the data can be highly irregular depending on whether the user checked-in at particular places, whether the application was running at that time, whether the GPS was switched on by the user etc. Second, it does not track the fine-grained movements of the user at regular intervals. Third, often the check-in location is known however the duration spent at the location may remain unknown. Finally, the number of check-in locations visited by the user are unbounded and therefore it is relatively hard to attach a context to every location.
We attempt to demonstrate the power of collecting fine-grained behavioral social network data from mobile phones of users. We introduce DAIICT Spatio-Temporal Network (DSSN) - a spatiotemporal dataset which addresses the above challenges and produces fine-grained data with each location visited by the users recorded at minute and regular intervals. The dataset is complimented by an extensive survey of the participants in which demographic, sociability and most importantly ground truth data about social ties with other participants is collected. We then turn to various experiments conducted on one of the possible lines of research with this dataset- finding the inter-relationship between reported friendship and the time spent together, introducing Temporal Diversity as a feature to infer the strength of friendship between two individuals and comparing its performance to measures of Diversity in previous works. We believe Temporal Diversity can be beneficial in environments where mobility is governed by schedules and time-tables much like in real life. We also explore the concept of Time Diversity and how it evolves with time. Finally, we discuss the numerous different research avenues that can be pursued using such a fine-grained spatiotemporal dataset.
2 DAIICT Spatio-Temporal Network (DSSN) Dataset
2.1 Subject pool
The subjects from this study consisted of students pursuing B-Tech ICT at DAIICT (Dhirubhai Ambani Institute of Information and Communication Technology), a university located in Gujarat, India. It has a residential campus which spans 60 acres and houses approximately 1,500 students. This choice of location provided several advantages.
The movements of the subjects were largely encapsulated within the geographical boundary of the campus. This provided us with a finite set of locations that are visited by the subjects at different times.
Each location inside the campus has an inherent function attached to it. For example, the library is used for reading purposes while the sports center is used for recreational purposes.
The subjects of the study pursuing the B-Tech ICT program are compulsorily residing within the campus. The in-campus residence along with the distinct designated buildings for recreation, academics etc inside the campus make it an interesting and reliable microcosm of real world activities, movements and social behaviour.
The data was collected from 46 subjects between the months of March and May, 2016. For this paper’s analyses, we used a subset of the data collected during the month of April, 2016. Out of the 46 subjects that participated in the study, 36 of them completed the survey conducted in July, 2016. The subjects volunteered to become part of the experiment.
2.2 Important Statistics
For each user, the timestamp, latitude and longitude, elevation, accuracy, satellites, network provider is recorded. The total number of data points collected are 7,33,403. The total number of data points within the month of April are: 6,59,268 (this subset of data is used for the following analysis). The total number of subjects using the application to record data are 46. The data recorded varies in accuracy with an average accuracy of 36.0 meters.
2.3 GPSLogger Software for Data Collection
The data was collected from the Android-based mobile phones of the subjects. The subjects installed the GPS-Logger app which is available on playstore (https://play.google.com/store/apps/details?id=com.crearo.gpslogger). The application exploits the GPS capabilities of the mobile phones to log co-ordinates and runs as a background process at all times.
The following functionalities are present in the application:
Logs data from the mobile phones of subjects at a regular interval of 2 minutes on local storage. Since the application is only programmed to log location data, the privacy invasion is minimum as compared to previous such approaches which log voice calls, messages, active applications, phone’s charging status etc.
The data is sent to the server periodically (every 2 hours) if the mobile phone is connected to internet. If not, the data file is pushed to a queue and resent at a later time.
The subjects cannot switch off the application (without “Force Kill” or un-installing the application). It shall restart in 30 minutes if the subject attempts to do so.
The battery consumption is optimized to last as long as possible.
It is a user-friendly application where the subject only has to press “start logging” after installing the application.
Since the application automatically restarts following any crashes, data losses mainly occur only due to powered-off devices. The application can be assumed to be running on the phone while the phone is powered-on, however, the accuracy of the dataset generated shall rely on the strength of the GPS signal captured. The strength of the signal shall vary based on the location as well as hardware of the phone. We therefore have to deal with the accuracy loss and random chunks of missing data that is characteristic of any real-time GPS data collection.
An anonymized version of the dataset can be downloaded at: (https://github.com/deshanadesai/Geospat).
2.4 Ground Truth: Self Reported Survey data
We conducted an online survey for the subjects who participated in the DSSN data collection. The survey is detailed and focuses on questions to report strength of friendship and estimated average proximity with each subject. It also includes general questions regarding the subjects’ social behaviour, participation in various activities, anxiety levels, academic performance etc.
Questions to be answered for each subject in the study.
Estimate your average proximity with the Person. (Time spent together on an average per day)
Scale of 1-5 (1- 0 to 5 minutes, 2- 5 to 30 minutes, 3- 30 minutes to 2 hours, 4- 2 to 4 hours, 5- 4 hours and above)
Estimate strength of friendship with the person.
Scale of 0-5 (0- Do not know the person, 1- Acquainted, 2- Sort of friends, 3- Friends, 4- Good friends, 5- Very good friends)
Rate your participation in the following activities:
Sports, Programming, Quizzing, Debate, Electronics, Writing college magazine, Dance, E-sports, Music, Drama, Academics, Research
Native Language, Birthplace
Rate amount of stress experienced , GPA (academic performance), productivity with the time in college, satisfaction with time in college, social comfort, self-confidence.
3 Problem Definition
3.1 Problem Statement
Given a set of users U = (u1,u2,..,un) and a set of data points recorded
by every user u: Users location - l, which consists of latitude and longitude values, the timestamp - t, the provider used to measure location - p and finally, the accuracy logged by the provider - A.
The objective is to infer the relationship score between each pair of users based on quantitative values.
Definition 1: Relationship strength is a quantitative measure that tells how strongly associated two people are.
Definition 2: Encounter is defined as an event when two users co-occur at the same place at the same time. The distance threshold d chosen can be varied with the mean accuracy of the logged data points.
3.2 Related Works
One of the pioneering papers that study the behavioral characteristics of friendship and infer social network structure of the real world using Mobile Data is by Nathan Eagle et al. They study social networks with binary relational ties (i.e. are two students friends or not?). However, these binary indicators only provide a very coarse indication of the nature of the relationships, and do not embrace the complexity of human relationships. Our purpose is to estimate the strength of people’s relationships based on their interaction frequency, other proximity variables and describe the same in a discrete manner tending to the degree of friendship. Their collection of data includes mobile phone logs, calls, messages, usage of applications etc which is considerably privacy cannot be done at scale. Their results showed that the behavioural data collected by the mobile phones and the self-reported data are indeed related. In addition, the amount of communication was the most significant predictor of friendship.
This is extended by the work of Crawnshaw et al. who introduces various features such as specificity, location entropy, etc to analyze social connections. This study provides an insight into the social network structure showing that there exists a relationship between the mobility patterns of the user and number of friends that the user has in his social network. Further, Cyrus et al.  included the impact of co-incidences and co-occurrences at locations to infer a continuous variable predicting the relationship strength between a pair of subjects. They compute the strength of relationship by conducting multiple linear regression over location diversity (the spread of encounters over different locations) and frequency of encounters weighted by the location entropy (how crowded a place is).
[?] explore the inherent strucutre in mobility patterns which are governed by geographic and social constraints. They find that short ranged travel is periodic both spatially and temporally and not affected by social ties, while long-distance travel is more influenced by social networkt ties.
Temporal Representation: The day is divided into intervals of t minutes. For example, if t is 5 minutes, the day can be divided into 288 intervals from 00:00-00:05, 00:05-00:10 to 23:55-00:00.
The variable t can be a maximum of 1440 minutes where the entire day is counted as one interval.
Temporal Encounter Vector: Temporal Encounter Vector T contains the total number of times user i and user j had an encounter for each t minute interval of the day over a period of N days. The day can be divided into 1440/t intervals and hence the dimensions of the vector are fixed.
Example, for t=5 minutes, T = (1,3,0,0,0,0,…2) would mean that user 1 and user 2 had one encounter between 00:00-00:05, three between 00:05-00:10 and two between 23:55-00:00 across N days.
We introduce a new feature called Temporal Diversity which can be beneficial in predicting social ties based on spatio-temporal data.
Motivation: Over a period of days, encounters with a close friend will be spread across different time intervals in a day, as compared to encounters with someone you only meet because of scheduled activities. Essentially, encounters with friends over time don’t follow any rules or schedules and are randomly spread across a day’s time-span. On the other hand, encounters with people whom a person meets due to scheduled activities (eg: for work, lectures, tutorials) are routine and therefore would occur repeatedly at a scheduled time. We aim to capture this diversity in encounters over time to infer relationship strength.
Definition: Temporal Diversity quantifies the effective spread of encounters across time-intervals in a day’s timespan.
Given two users i and j, Temporal Encounter Vector contains the total number of times user i and user j had an encounter for each t minute interval of the day over a period of N days.
represent an encounter between user i and user j in location l and time interval t. Let
be the set of co-occurrences of User i and j in all time intervals.
The probability that a randomly picked encounter from the set R(i,j) happened at time interval t is:
If we randomly pick an encounter from the set R(i,j) and define its time-interval as a random variable then the uncertainty associated with this random variable is defined by the Shannon entropy for user i and j as follows:
Diversity D is the effective number of t min time intervals user i and j have been together for in a day.
The more spread the encounters are across different time intervals, the higher the diversity.
Example: If we use the temporal representation with t=120 minutes, you can divide the day in 12 parts as follows: 00:00-02:00, 02:00-04:00 till 22:00-00:00.
User A and User B have a Temporal Encounter Vector T calculated over 14 days such that,
T = (0,0,0,2,10,3,0,0,0,2,0,0)
T = (3,0,4,0,3,0,2,2,0,0,0,3)
Expanding T for clarity, User A and User B, had 2 encounters in 06:00-08:00 interval, 10 encounters in 08:00-10:00 interval, 3 encounters in 10:00-12:00 interval and 2 encounters in 18:00-20:00 interval over 14 days. Note: User A and User B have had 17 encounters in 14 days, so have User A and User C.
Temporal Entropy for T = 1.1218
Temporal Diversity for T = 3.1
Temporal Entropy for T = 1.7623
Temporal Diversity for T = 5.8
Most encounters for user T happened in the time interval 08:00-10:00 which could suggest that they were governed by some schedule such as a common lecture or breakfast routine. Encounters between User A and C are spread across the day and are more random or one might say diverse.
We want to give less weight to to time intervals where the users have encountered a lot because it could be due to some schedule shared between the users. They will still contribute, but lesser compared to encounters which were not part of schedules. Our assumption is that interactions with friends overtime are outside the purview of schedules and meeting times will be distributed across the entire day if averaged over N days.
4 Experiments with Temporal Diversity
4.1 Data Pre-processing
We use the the DAIICT Spatio-Temporal Social Network (DSSN) with the following pre-processing:
All timestamps are rounded off to the nearest 5 minute interval to align data collected from different devices/users.
The duplicate points occurring for a particular ID at any timestamp are dropped.
Only data points collected between 2016-04-01 00:00:00 and 2016-05-01 00:00:00 are used.
We keep the GPS points whose reported accuracy from the phone is less than 60 meters.
For a user, if he/she has less than 20% data points collected for a day, we discard that day’s data for the user. We assume the app malfunctioned or wasn’t switched on for that day.
Users with data for less than 5 days are dropped from the analysis
Only those Users are picked for which ground truth data is also available.
Reported closeness Friends and Sort of Friends are clubbed into one category; leaving a total of 5 categories.
Closeness Description 0 Don’t know the person 1 Acquaintance 2 Friends 3 Good Friends 4 Very Good Friends Table 1: Re-grouping
It’s not necessary that Users have data for all the 30 days for which the experiment was conducted. A fraction of users left/joined the experiment in the month of data collection. Hence, it’s important to keep the number of common days between two people as part of the discussion as compared to the number of days a User had data overall. We choose N, the number of common days as a minimum 7. Selecting a number too less leaves us with less number of data points per pair and selecting a number too large leaves us with less number of User pairs. Hence, all our results reflect a scenario where any two Users have at least 7 days worth of data common between themselves.
It’s important to note that reported closeness between two users is bi-directional. It’s not necessary that User A and User B have reported closeness to each other equally. Hence, U and U are treated differently in the experiments.
We attempt to evaluate the effectiveness of Location Diversity , Temporal Diversity and Average Encounters per day as features in predicting reported closeness scores between Users. We primarily use F-Tests and pearson correlations.
Sweeping over width of time interval (t)
The width of the time-intervals in which one divides the day in the temporal representation is an important parameter. For example, take User A and User B as discussed in the example with the Temporal Encounter Vector as T = (0,0,0,2,10,3,0,0,0,2,0,0). User A and User B have 10 encounters in 08:00-10:00 interval. We assume User A and User B have a common lecture at this time. If the interval width was shorter such as 5 minutes, there is a likelihood that these 10 encounters would be distributed across the intervals 08:00-08:05, 08:05-:08-10 to 09:55 to 10:00. Both User A and User B may not enter the classroom at the same time or may not always be in each others exact vicinity for an encounter to be recorded. So, even though the meeting is a scheduled encounter, the Temporal Diversity is not able to account for it, rather in all likelihood Temporal Diversity increases positively because of it.
As we saw, this effect can be mitigated by having wider time-intervals. But if the time-intervals are kept too wide, the actual random encounters across the day would be grouped into one and information would be lost.
We calculate for Temporal Diversity scores for the same set of encounters at different widths t. The cross correlation between each regressor and the target is computed, which is eventually converted to a F score and a p-value.
|Width t (minutes)||F Value||p-value|
Observation: We notice, 60 minutes has the highest F-value and hence can be the optimum width t for our dataset. F-values increase from 5-minute interval to 60-minute interval and thereafter decrease till the 720 minute interval.
Predicting Reported Closeness through Regression.
Location Diversity: We calculate Location Diversity based on the explanation given in EBM. The only difference being we used geohashing to divide the campus into fixed regions to map each encounter to a location id. We use hashes of length 8 which give a precision of 19 meters more or less.
We apply F-test on Location Diversity, Temporal Diversity and Mean Encounters each regressing them with the outcome (closeness).
|Feature||F Value||p-value||Correlation (r)|
|Temporal Diversity (t=60)||69.83||0||0.30|
Observations: We notice that amongst the three features, Temporal Diversity is ranked the highest, followed by Location Encounters and Mean Encounters. Temporal Diversity is most correlated with closeness.
We further explore how Temporal Diversity, Location Diversity and Mean Encounter scores are distributed across different closeness sub-groups.
Mean Temporal Diversity for sub-group Very Good Friends, is significantly higher than the rest.
There are many outliers in Don’t Know and Acquaintance sub-group for Temporal Diversity. It suggests there are a lot of coincidences happening. It will be worthwhile to explore further ways of accounting for co-incidences in encounters. Renyi Entropy has been explored previously to tackle for co-incidental encounters but in domain of locations. We apply the same concept to the domain of time.
Temporal Diversity isn’t able to differentiate between Friends and Good Friends. The overlap between both is high.
Location Diversity shows a healthy increase in mean diversity across sub-groups but contains a lot of overlaps between different sub-groups both in the lower quartile and the upper quartile.
Mean Encounters for Very Good Friends is significantly different than the other sub-groups. It would suggest that pairs in that sub-group had a healthy amount of encounters every common day. Strikingly, the same effect is not visible for pairs in Good Friends or even Friends, where the median of Mean Encounter is approximately 1 as compared 8 for Very Good Friends. But the same sparsity in encounter data doesn’t negatively affect temporal diversity for the said sub-subgroups. We can say Temporal Diversity is robust to sparsity in the daily data which is an important factor in real life geospatial applications.
Renyi Entropy and Co-incidences
We use Renyi Entropy as used in Cyrus et. al, which is a generalization of Shannon entropy. Let R be the diversity calculated from the Renyi entropy.
Cases of Renyi Entropy:
As q approaches zero, the Renyi entropy increasingly weighs all possible events more equally, regardless of their probabilities. In the limit for q -¿ 0, the Renyi entropy is just the logarithm of the size of the support of X.
When q ¡ 1, the temporal diversity tends to give more weight to the local frequencies with low-values in the Temporal Encounter vector. In other words, the lower the number of times a pair of students have met in a particular time slot, the more weight it gets from the diversity or the more impact the local frequency can make on diversity.
The limit for q -¿ 1 is the Shannon entropy.
When q ¿ 1 the Renyi entropy H , the opposite of q ¡ 1 occurs and consequently the diversity D , more favorably considers the high values of the Temporal Encounter vector.
As q approaches infinity, the Renyi entropy is increasingly determined by the events of highest probability.
When q¡1 is used, Renyi Entropy and consequently the diversity, more favourably considers lower values of local temporal frequencies than higher values. Using q = 0.5
 goes in detail about its exploration on how the effect of co-incidences can be controlled by using Renyi entropy based Diversity in the domain of locations. We apply to same concept for Temporal Diversity and sweep over the the parameter q, the order of diversity, which decides the senstivity of the final Temporal Diversity to the number of encounters in each time-interval.
Refer to Table 4 for results.
We observe that the lower the order of the diversity (q), the more effective is Temporal Diversity in predicting closeness. This is in line with the results noticed in  for Location Diversity.
|Order of Diversity||F-Value|
Changing Temporal Diversities Over Time.
How does Temporal Diversity between two people change over time? We’ve emphasized before how the essence of Temporal Diversity lies in the growing randomness of encounters across a day over time for users with strong social ties. Hence, it’s important to explore how temporal diversity changes for over days for different sub-groups of closeness.
Methodology: Each user has different number of common days for which they share data. For this experiment, we calculate temporal diversity between pairs of users using only common days less than equal to d, while we sweep d from 1 to 11. For example, for d=3 we calculate temporal diversity between pairs based on data from their respective first three common days. In case a pair has only 2 common days, temporal diversity for d=3 is set as empty. Then, for each sub-group in closeness and common day d, we calculate the average temporal diversity.
Temporal Diversity in general increases as the number of common days between two users increases.
For the sub-group Very Good Friends, average temporal diversity is starkly higher than the averages in other sub-groups.
Average temporal diversity of encounters between Very Good Friends increases at a much higher rate than that of other sub-groups.
In this paper, we first introduce and release the DAIICT Spatio-temporal Social Network (DSSN) dataset which is a granular dataset about 46 participant’s movement in a residential college campus over a span of thirty days. It reflects human patterns guided by routine and schedule in a geographically bounded area of a university. The data is complemented with rich self reported data about the participants based on an extensive survey capturing demographics, extent of participating in various campus activities, proximity to other participants in the survey and questions pertaining to happiness, sociability and self-confidence.
Further, we introduce a new feature called Temporal Diversity which our experiments confirmed to have superior predictive power for inferring social ties as compared to its counterparts: Location Diversity and Mean Encounters. Next, we examined how Temporal Diversity changes with time and gets better at differentiating between sub-groups of social-ties as the number of common days increase. Lastly, we saw how Temporal Diversity is robust to sparsity of data collected between two users and can function well even with lesser data. We believe, Temporal Diversity can be an useful feature to predict social ties in environments where users are governed by fixed schedules and the number of meaningful unique locations are less. We believe a bustling resendential university campus is a microcosm of urban lifestyles to some extent.
This work along with releasing of the DSSN dataset opens up opportunities to answer interesting multi-disciplinary questions. How does mobility affect self-confidence, sociability, happiness or GPA? Do people with similar interests end up in the same social groups? Do people in relationships have a particular mobility pattern? Are certain time-intervals of meeting more helpful in predicting social ties than others? How can one effectively predict social-ties in an environment where co-incidental encounters are abundant due to schedules and closed spaces like in workplaces? We wish to investigate some of these issues in our future work and at the same time hope that the research community finds the dataset useful to explore some of their own questions.
The authors would like to heartily thank professor Sanjay Srivastava of DA-IICT for his encouragement, initiatives and help. We also thank the software creators of the publicly available GPS Logger (https://github.com/mendhak/gpslogger) and Vraj Delhivala for the additions to the software application. Finally, we are thankful to all the participants for there help with the data collection and survey.
- D. Crandall, L. Backstrom, D. Cosley, S. Suri, D. Huttenlocher and J. Kleinberg- Inferring social ties from geographic coincidences. Proc. National Academy of Sciences,December, 2010.
- Nathan Eagle , Alex (Sandy) Pentland , David LazerInferring Social Network Structure using Mobile Phone Data.
- Charles Blundell, Katherine A. Heller, Jeffrey M. Beck- Modelling Reciprocating Relationships with Hawkes Processes.
- Zhenyu Wu , Ming Zou- An incremental community detection method for social tagging systems using locality-sensitive hashing.
- Ivan Brugere, Venkata M. V. Gunturi, Shashi Shekhar- Modeling and analysis of spatio-temporal social networks.
- Huy Pham, Cyrus Shahabi, Yan Liu- EBM - An Entropy-Based Model to Infer Social Strength from Spatiotemporal Data.
- Huy Pham, Ling Hu, Cyrus Shahabi- Towards Integrating Real-World Spatiotemporal Data with Social Networks.
- Nancy Katz, David Lazer, Holly Arrow, Noshir Contractor- Network Theory and Small Groups.
- Quannan Li, Yu Zheng, Xing Xie,Yukun Chen, Wenyu Liu, Wei-Ying Ma- Mining User Similarity Based on Location History.
- Tobias Hecking, Tilman GÃ¶hnert, Sam Zeini, Ulrich Hoppe- Task and Time Aware Community Detection in Dynamically Evolving Social Networks.
- Wikipedia - Geohash — Wikipedia, The Free Encyclopedia, [Online; accessed 20-August-2016].