Multidimensional Outlier Detection in Temporal Interaction Networks: An Application to Political Communication on Twitter ^{†}^{†}thanks: This article is a substantially extended and revised version of the authors’ COMPLENET 2019 paper [33], with an updated research, literature review and methodology, along with new data analysis.
Abstract
In social network Twitter, users can interact with each other and spread information via retweets. These millions of interactions may result in media events whose influence goes beyond Twitter framework. In this paper, we thoroughly explore interactions to provide a better understanding of the emergence of certain trends. First, we consider an interaction on Twitter to be a triplet meaning that user , called the spreader, has retweeted a tweet of user , called the author, at time . We model this set of interactions as a data cube with three dimensions: spreaders, authors and time. Then, we provide a method which builds different contexts, where a context is a set of features characterizing the circumstances of an event. Finally, these contexts allow us to find relevant unexpected behaviors, according to several dimensions and various perspectives: a user during a given hour which is abnormal compared to its usual behavior, a relationship between two users which is abnormal compared to all other relationships, etc. We apply our method to a set of retweets related to the 2017 French presidential election and show that one can build interesting insights regarding political organization on Twitter.
I Introduction
The use of social networks has exploded over the past fifteen years. The microblogging service Twitter is currently the most popular and fastestgrowing one of them. Within this social network, users can post information via tweets as well as spread information by retweeting tweets of other users. This leads to a dissemination of information from a variety of perspectives, thus affecting users ideas and opinions.
As discussed in the works of Murthy et al. [19] and Weller et al. [32], for some of the most active users, Twitter even constitute the primary medium by which they get informed. These users only represent a negligible fraction of the population. Nevertheless, hot topics emerging on Twitter’s data stream are relayed by traditional media and therefore reach a much broader audience. If such trends often naturally arise from discussions or are consequences of the reaction of all users to realworld events, they may also be originated by the intensive activity of a small group and mislead other users on the significance of certain topics.
The volume of usergenerated data is considerable: over 500 millions of tweets are posted every day on Twitter. Moreover, this data results from interactions of millions of users over time and therefore includes numerous complex structures. In this context, it is difficult for users to have a concrete vision of trends taking place and, even more, to apprehend the way in which all interactions are organised and can lead to media events.
In this paper, we seek to make this task achievable. More precisely, we aim at finding outliers in interaction data formed from a set of retweets. For instance, an event in a data stream is an outlier: it can be view as a statistical deviation of the total number of retweets at a given point in time. More generally, outliers, depending on which dimensions define them, highlight instants, users, users during given periods, or interactions for which the retweeting process behave unusually. Therefore, they constitute important information which is worth noticing from the perspective of the user. In order to find these unexpected behaviors, we design a multidimensional and multilevel analysis method.
First of all, we consider an interaction on Twitter to be a triplet meaning that user , called the spreader, has retweeted a tweet of user , called the author, at time . We model the set of interactions as a data cube with three dimensions: spreaders, authors and time. This representation enables us to access local information, that is the number of retweets between two users during a specific hour, as well as more global and aggregated information, as for instance, the total number of retweets during a given hour. Afterwards, we combine and compare these different quantities of interactions between them in order to find outliers according to different contexts. Using the two previous quantities, we could, for instance, find an unexpected relationship between a spreader and an author during an hour , if the number of retweet from to during is significantly large given the total number of retweets observed during this hour. This analysis gives us insight into the possible reasons why some events emerge more than others and, in particular, whether they are global phenomena or, on the contrary, whether they originate from specific actors only.
Our method applies to all types of temporal interaction networks. One can add attributes to interactions by adding dimensions to the problem. In this paper, we add a semantic dimension referring to tweet contents by considering the tuples meaning that retweeted a tweet written by and containing hashtag at time . This allows us to explore interactions from other perspectives and gain crucial information on events taking place.
The paper is organized as follows. We review the related work on outlier detection within Twitter in Section II. We introduce the modelling of interactions as a data cube in Section III, then we describe our method to build relevant contexts in Section IV. After describing our datasets in Section V, we present a case study in Section VI. In particular, we investigate the causes of emergence of events found in the temporal dimension by exploring authors, spreaders, then hashtags dimensions. In Section VII, we discuss two future works that can be achieved using our method, in particular, a characterization of the second screen usage and a usertopic link prediction. Finally, we conclude the paper in Section VIII.
Ii Related Work
The problem of outlier detection on Twitter has attracted a significant amount of interest among scientists and has been approached in various ways depending on how outliers are defined and on the techniques used.
Some researchers consider outliers as realworld events happening at a given place and at a given moment. For example, Sakaki et al. [25] and Bruns et al. [2] trace specific keywords attributed to a realworld event and find such outliers by monitoring temporal changes in word usage within tweets. There are also methods based on tweet clustering. In these approaches, authors infer, from timestamps, geolocalizations and tweet contents, a similarity between each pair of tweet and find realworld events into clusters of similar tweets. These techniques include the one of Dong et al. [10], which computes similarities with a waveletbased method between time series of keywords; the one of Li et al. [18] which aims at finding crime and disaster related events in a real time fashion; and the one of Walther et al. [31] which focuses on small scale events located in space.
Other researchers, instead, seek entities like bots, spammers, hateful users or influential users. Thus, they consider outliers as users with abnormal behaviors according to different criteria. Varol et al. [30] detect bots by means of a supervised machine learning technique. They extract features related to user activities along time, user friendships as well as tweet contents and use these features to identify bots by means of a labelled dataset. Stieglitz et al. [28] identify influential users by investigating the correlation between the vocabulary they use in tweets and the number of time they are retweeted. Ribeiro et al. [24] detect hateful users. They start by classifying users with a lexiconbased method and then show that hateful users differ from normal ones in terms of their activity patterns and network structure.
Finally, other works aim at finding privileged relationships between users. Among those, the work of Wong et al. [34] apply it to political leaning by combining an analysis of the number of retweets between two users with a sentiment analysis on the retweeted tweets.
All these works, although providing meaningful results, use different methods for different kind of outliers. Moreover, they only consider one perspective in the way they define them. With our approach, we want to treat these different types of outliers – keywords, users, relationships – in a unified way as well as to consider different contexts in which outliers are considered abnormal. Hence, not only we consider different entities as abnormal users; abnormal relationships; abnormal behaviors of users during specific hours, etc., but also different contexts in which outliers are defined. Thus, an abnormal user may be abnormal during a given hour compared to the way it usually behaves during other hours, but also compared to the behavior of all other users during the same hour. In this way, our framework aims to give a more complete and systematic picture of how users act, interact, and are organized along time in a way similar to what Grasland et al. [15] do in the case of media coverage in newspapers.
In practice, instead of characterizing and detecting outliers using tweets’content, as a lot of current approaches do, included those set out above, we focus on the volume and structure of interactions. Indeed, textmining techniques face challenges as the ambiguity of the language and the fact that resultant models are languagedependent and topicdependent. Moreover, the structure of communication alone is already quite informative. Other authors point into this direction. For instance Chavoshi et al. [5] use a similar technique to the one of Varol et al. [30], but only exploit user activities through their number of tweets and retweets. In the same idea, Chierichetti et al. [6] look at the tweet/retweet volume and detect points in time when important events happen. Instead of volumebased features, another alternative to textmining techniques is to use graphbased features. Song et al. [27], for instance, identify spammers in real time with a measure of distance and connectivity between users in the directed friendship graph (followers and followees). Bild et al. [1] designed a similar method but based on the retweet graph instead. Also based on the retweet graph, the method of Ten et al. [29] detects trends by noticing changes in the size and in the density of the largest connected component. Another example is the approach of Coletto et al. [7] which combines an analysis of the friendship graph and of the retweet graph to identify controversies in threads of discussion.
In this paper, we design a method able to handle multiple types of outliers by observing the retweets’ volume in numerous different contexts. We believe that this multidimensional and multilevel analysis is essential to detect subtle unexpected behaviors as well as fully understand the way in which millions of interactions may result in media events.
Iii Formalism
We denote the set of interactions by a set of triplets such that indicates that user , called the spreader, has retweeted a tweet written by user , called the author, at time . We represent this set of interactions by a data cube [16]. In this section, we formally define this tool as well as the possible operations we can perform to manipulate data.
Iiia Data Cube Definition
A data cube is a general term used to refer to a multidimensional array of values [16]. Given dimensions characterized by sets , we can built data cubes, each representing a different degree of aggregation of data. The quantity corresponds to the number of data cubes of dimension in which dimensions are aggregated. Within this set of data cubes, we call the base cuboid the dimensional data cube which has the lowest degree of aggregation. More generally, a dimensional data cube is denoted where is the Cartesian product of the sets , and is a feature which maps each tuple to a value in a value space :
In the following, tuples are also called cells of the cube and denoted such that .
Dimensions are the sets of entities with respect to which we want to study data. As a first step, we can consider three dimensions: spreaders, denoted , authors, denoted , and time, denoted . In addition, we can organise elements of a dimension into subdimensions. For instance, the temporal dimension can be organised depending on temporal granularity. In our case, we divide it into the two subdimensions days, denoted , and hours of the day, denoted , such that denotes the hour of day , with . While the set of days depends on the dataset, is the set of hours of the day such that .
The feature is a numerical measure which provides the quantities according to which we want to analyse relationships between dimensions. Here, we consider the quantity of interaction, denoted . It gives the number of retweets for any combination of the four dimensions. In the base cuboid , gives the number of times retweeted during hour of day (see Figure 1):
Data cubes of smaller dimensions are obtained by aggregating the base cuboid along one or several dimensions. We discuss this operation along with others in the next subsection.
IiiB Data Cube Operations
We can explore the data through three operations called aggregation, expansion and filtering.
Aggregation is the operation which consists in seeing information at a more global level. Given a data cube , the aggregation operation along dimension leads to a data cube of dimension , where . Formally, a dimension is aggregated by adding up values of the feature for all elements . We indicate by a the dimension which is aggregated with respect to . Hence, is constituted of dimensional cells denoted where
For instance, one can aggregate along the dimension of hours of the day such that
gives the total number of time retweeted during day . Opposed to the base cuboid, the apex cuboid is the most summarized cuboid. It is aggregated along all dimensions and hence consists in only one cell containing the grand total . In our case, the apex cuboid contains the total number of retweets.
We can also aggregate interactions according to a set of subsets of dimension . Let denote this partition such that the intersection of any two distinct sets in is empty and the union of the sets in is equal to . Then, given a data cube , the aggregation operation along leads to a data cube with . This cube is constituted of dimensional cells denoted , with , such that
For instance, one can aggregate according to a partition of hours , where is the set of nocturnal hours and the set of daytime hours such that
in , gives the total number of time retweeted during nocturnal hours on day .
Expansion is the reverse operation which consists in seeing information at a more local level by introducing additional dimensions. Given a data cube , the expansion operation on dimension leads to a data cube of dimension , where .
Filtering is the operation which consists in focusing on one specific subset of data. Given a data cube , the filtering operation leads to a subcube by selecting subsets of elements within one or more dimensions such that with .
It is also possible to combine operations together. For instance, we can filter the data cube aggregated on the partition of hours, , in order to focus on spreaders that abnormally retweet authors overnight on a given day. Note that the resulting data cube is different from the data cube : in the first case, a cell gives the total number of time retweeted during nocturnal hours on day ; while in the second case, a cell give the number of times retweeted during hour where is a nocturnal hour.
Figure 2 shows a set of all data cubes that can be obtained considering the three dimensions: spreaders, authors and time. It also illustrates how to navigate from one to another thanks to the previously described operations.
Iv Method
In this paper, our goal is to find abnormal data cube cells, i.e., entities for which the observation is abnormal. As an observation’ abnormality is relative to the elements to which it is compared [17], a given cell may be abnormal or not depending on the context. The context, denoted , is the set of elements which are taken into account in order to assess the abnormality of an entity . In this section, we design a set of steps in order to shape various contexts and show, through several examples, that it leads to a deeper exploration of interactions compared to an elementary outlier detection.
Iva Construction of a Context
An abnormal entity is an entity which behavior deviates from its expected one. Hence, one way to find outliers in a set of entities is to consider the following elements:
– a set of observed values ;
– a set of expected values ;
– a set of deviation values , which quantify the differences between observed and expected values.
Together, these elements constitute the context . Then, given , an outlier is a point whose absolute deviation value, , is significantly larger than most others deviation values.
We build more or less elaborate contexts by playing with the considered observed, expected and deviation values.
IvB Observed values
According to the type of unexpected behaviors we are looking for, the first step consists in choosing a cube among the set of cubes obtained from the base cuboid using one or several operations. This cube, denoted , constitutes the set of entities and observed values.
For instance, we can look for abnormal authors at given hours. To to so, we focus on the cube aggregated on spreaders such that . We may also want to find abnormal authors during nocturnal hours only. In this case, we consider the aggregated and filtered data cube .
In the first case, we consider all entities of the same type, : we are in a global context. On the contrary, when we only consider a subset of all entities, as in the second example with , we are in a local context.
IvC Expected values
Once the set of observed values is fixed, we build a model of expected behavior based on a combination of other data cubes , called comparison data cubes. For the context to be relevant, these must derive from the aggregation of on one or more dimensions. Hence, and where is the Cartesian product of the aggregated dimensions. In the following, we build three different types of expected contexts: the basic, aggregative and multiaggregative contexts.
IvC1 Basic Contexts
When seeking abnormal cells within a data cube , the most elementary context we can consider is the one in which the expected value is a constant, identical for each cell. We call it the basic context. The model of expected behavior is that interactions are uniformly distributed over cells. In this case, the comparison data cube is the apex cuboid and the expected value is the average number of interactions per cell:
For instance, in data cube , an abnormal cell indicates that during hour of day , author has been retweeted an abnormal number of times compared to the average number of times any author is retweeted during any hour, .
IvC2 Aggregative Contexts
To find more subtle and local outliers, expected values must be more specific to each cell. The process is the same as in the basic context except that the considered comparison cube is not aggregated over all dimensions of , i.e. with :
such that . Defined as such, the expected value is the value that one should observe if all interactions on were homogeneously distributed on dimensions . We call these contexts, aggregative contexts.
For instance, in data cube , relatively to data cube and expected values
such that and , an abnormal cell indicates a significant deviation between the number of retweets received by during hour and the one that should have been observed if all authors had received the same number of retweets during hour .
IvC3 Multiaggregative Contexts
Aggregative contexts assume that interactions are homogeneously distributed among dimensions . It is possible to create contexts which differentiate the repartition of interactions according to each cell activity. We call them multiaggregative contexts. Unlike the other two, they require multiple comparison data cubes. There are no generic formulas: the number and types of comparison cubes as well as expected values depend on the application.
For instance, if we take back the previous example, we can consider, instead, the following expected values:
This way, it is expected that the number of retweets during is distributed among authors proportionally to their mean activity. We can also add information on authors activity during specific hours, and consider the cubes , and , such that
In this context, an abnormal cell indicates a significant deviation between the number of retweets received by during hour of day and the one that should have been observed if had been retweeted the way it is used to during hour on other days.
Each of these contexts can either be global or local depending on the chosen set of observed values within .
IvD Deviation values
Finally, for each cell within , we measure the deviation between the observed value and its expected value . In this paper, we use two different deviation functions: the ratio and the Poisson deviation.
The ratio between an observed value and an expected value is defined such that
Note that this deviation function does not distinguish between and , on the one hand, and and , on the other hand.
To take into account the significance to which a value deviates, we define another deviation function: the Poisson deviation. Indeed, in the cases in which the feature consists in counting the number of interactions during a given period, as , it can be modelled by a Poisson counting process of intensity [15], such that
In this case, the Poisson deviation can be defined as follows. If , we calculate the probability of observing a value or less, knowing that we should have observed on average. This probability is the cumulative distribution function of a Poisson distribution with parameter . Accordingly, we denote it . Then, by symmetry, we define such that:
where the logarithm is calculated for convenience in order to have a better range of values.
In both cases, most of observed values are expected to be similar to their corresponding expected values. Consequently, the distribution of is expected to follow a normal distribution in which most values fluctuates around a mean: for the ratio and for the Poisson deviation. Outlying cells, instead, correspond to deviation values significantly distant from the mean^{1}^{1}1We use the classical assumption that a value is anomalous if its distance to the mean exceeds three times the standard deviation [4, 16]..
IvE Examples
Figure 3 illustrates several situations in which we find different abnormal authors during given hours by considering different contexts and a ratio deviation function:
– Triplet is abnormal in the global basic context: it has been retweeted times ( of a ) which is higher than all other triplets.
– Triplet is abnormal in the global aggregative context: its proportion of retweet is which is higher than all other triplets.
– Triplet is abnormal in the global multiaggregative context: the deviation in the activity of with respect to its usual activity at is higher than all other triplets.
– Triplet is abnormal in the local aggregative context: its proportion of retweet is higher than other triplets in which is not an influential author.
As this example shows and as we will show in practice in the next sections, our approach, combining data cubes to build different contexts, leads to numerous kinds of outliers which allows us to thoroughly analyse temporal interactions under different perspectives.
V Datasets
In this paper, we choose to study the organization of interactions on Twitter by analysing different sets of politicsrelated retweets. Indeed, since Twitter is an integral part of means of communication used by political leaders to disseminate information to the public, finding abnormal entities corresponding to different kinds of unexpected behaviors in this situation is of great interest. To do so, we use two different datasets.
Dataset is a set of retweets related to political communication during the 2017 French presidential elections. We use a subset of the dataset collected by Gaumont et al. [12] as part of the Politoscope project. It contains politicsrelated retweets during the month of August 2016. Formally, our dataset consists in the set of retweets , such that means that retweeted at hour of day , where either the corresponding tweet contains politicsrelated keywords, or belongs to a set of French political actors listed by the Politoscope project. It contains retweets and involves different users. In this dataset, the set of days is .
Dataset is the same as dataset except that it contains an additional dimension. It consists in the set of retweets , such that means that retweeted a tweet written by and containing the hashtag at hour of day . It contains different hashtags.
In the following, usernames are only mentioned when they correspond to official Twitter accounts of politicians, or public organizations, such as city halls, newspapers, or shows. Otherwise, they are designated by generic terms usern, where is an integer to differentiate anonymous users.
Vi Experiments
As a first illustration of our method, we present a case study which, based on events found in the temporal dimension, proposes possible causes of their emergence by exploring other dimensions. First, we apply our method on dataset and focus on the three dimensions: spreaders, authors and time. Then, we add the hashtag dimension with dataset in order to gain more insight on events.
Via Events
We define an event to be a set of consecutive abnormal hours. For convenience, we denote it when all hours span over the same day .
Figure 4 shows the evolution of the number of retweets per hour^{2}^{2}2Note that due to a server failure from Tuesday the to Thursday the , no activity is observed during this period.. We can distinguish three distinct behaviors:
– nocturnal hours, characterized by a number of retweets fluctuating around ,
– daytime from the of August to the , characterized by a higher number of retweets fluctuating around ,
– daytime from the of August to the , characterized by a global increase in the number of retweets which fluctuates around .
ViA1 Basic Context
First of all, we look for events in the basic context. The sets of entities and observed values are provided by data cube . Expected values are defined such that
Figure 5 (Left) shows the distribution of deviation values by considering a ratiobased deviation. We find seven abnormal hours leading to three events such that
We see that these hours correspond to the three peaks of activity on Figure 4. Hence, this context does not highlight local anomalies but only global ones, deviating from all observations. Therefore, it is biased by circadian and weekly rhythms and does not have access to abnormal nocturnal hours nor hours located during the first part of the month.
ViA2 Aggregative Context
To take into account the overall increase in the number of retweets during the month, we need to use a aggregative context in which expected values incorporate the overall activity of the day provided by data cube :
As such, deviation values are independent of daily variations in the data. This is what we observe in Figure 5 (Center). We find 10 abnormal hours. Among those, six hours are part of the first period of the month: the at , the at , the at , and the from to . Nevertheless, extreme values are still biased by circadian rhythms which prevent us from detecting abnormal nocturnal hours.
ViA3 Multiaggregative Context
To address this issue, we use a multiaggregative context in which we add aggregated information relating to the typical activity per hour, provided by data cubes and :
Moreover, we take the Poisson distance as a deviation measure to account for the significance of deviations. We find 40 abnormal hours (see Figure 5 (Right)). Among those, several are adjacent, which leads to 17 distinct events (see Table I).
Hour is abnormal. It means that, on average, at , we expect to observe of the total number of retweets of the day. Hence, on hour , we expect to observe retweets. However, we observe retweets in . This deviation from the expected value is much more important than those observed for most hours . As a consequence, is an abnormal hour in this particular multiaggregative context.
In Table I, we see several hours of generally low activity as nocturnal hours. This last result shows that using more sophisticated contexts leads to more subtle outliers.








































































ViB Abnormal authors during events
Now, we focus on determining whether an abnormal event is due to specific authors which have been retweeted predominantly, or, on the contrary, results from a more global phenomenon in which we observe an overall increase of the activity.
To do so, we use a local and multiaggregative context. Observed values are provided by the filtered and aggregated data cube , where is an abnormal event. A cell within this cube gives the total number of times author has been retweeted during event . This way, we focus on how interactions are organized among authors within each event.
We proceed in a similar way to obtain expected values. Instead of considering the set of authors during event , we consider the set of authors during each of the hourly periods corresponding to on all days. We denoted this set of hours . We focus on data cube , aggregated on the partition of , . Operations performed to switch from the original cube to data cube is depicted in Figure 6.
Finally, expected values are defined using the comparison data cubes and , obtained by aggregation of , and data cube , obtained by aggregation and filtering of :
where , is the number of retweets observed during ; is the total number of retweets author received during hours of ; and is the total number of retweets observed during .
According to this context, a couple is abnormal when there is a significant deviation between the number of retweets received by during , and the number of retweets is expected to receive on average during the corresponding period on other days. In the following, we discuss the three different situations which arise through specific examples.
1) One main author
Figure 7 (Left) displays the distribution of deviation values for event . Most observations follow a Gaussian distribution centred on . We find 14 abnormal values. Among those, the one corresponding to significantly deviates from others. Indeed, in the considered context, we expect NicolasSarkozy to account for
of all retweets observed from to . Thus, on the of August from to , we expect him to be retweeted times. Yet, he was retweeted times, which explain its large deviation value.
Table I lists events with a similar distribution. In most cases, we observe that the corresponding media event is centred on the main author. For instance, they often indicate a political meeting of this author.
2) Several main authors
Figure 7 (Center) displays the distribution of the set of deviation values for event . Once more, most observations follow a Gaussian distribution centred on . We detect outliers. Several values significantly deviates from the mean, indicating, this time, several main authors.
These events are not due to a single popular author, but to several authors, considerably retweeted. In contrast to the previous example, this suggest that they originate from the reaction of a few authors to some external fact in which they have an interest. This is what we observe in Table I: media events related to events with similar distributions are often indicative of situations according to which the main authors are not related, but on which they react. For example, the event on August the , on the intervention of the police in a church, and the one on August the , on burkini wearing, are media events intensely taken up by political members of right and extremeright wings.
3) No main authors
Figure 7 (Right) displays the distribution of deviation values for event