Discovering Emerging Topics in Social Streams via Link Anomaly Detection
Abstract
Detection of emerging topics are now receiving renewed interest motivated by the rapid growth of social networks. Conventional termfrequencybased approaches may not be appropriate in this context, because the information exchanged are not only texts but also images, URLs, and videos. We focus on the social aspects of theses networks. That is, the links between users that are generated dynamically intentionally or unintentionally through replies, mentions, and retweets. We propose a probability model of the mentioning behaviour of a social network user, and propose to detect the emergence of a new topic from the anomaly measured through the model. We combine the proposed mention anomaly score with a recently proposed changepoint detection technique based on the Sequentially Discounting Normalized Maximum Likelihood (SDNML), or with Kleinberg’s burst model. Aggregating anomaly scores from hundreds of users, we show that we can detect emerging topics only based on the reply/mention relationships in social network posts. We demonstrate our technique in a number of real data sets we gathered from Twitter. The experiments show that the proposed mentionanomalybased approaches can detect new topics at least as early as the conventional termfrequencybased approach, and sometimes much earlier when the keyword is illdefined.
Keywords: Topic Detection, Anomaly Detection, Social Networks, Sequentially Discounted Maximum Likelihood Coding, Burst detection
1 Introduction
Communication through social networks, such as Facebook and Twitter, is increasing its importance in our daily life. Since the information exchanged over social networks are not only texts but also URLs, images, and videos, they are challenging test beds for the study of data mining.
There is another type of information that is intentionally or unintentionally exchanged over social networks: mentions. Here we mean by mentions links to other users of the same social network in the form of messageto, replyto, retweetof, or explicitly in the text. One post may contain a number of mentions. Some users may include mentions in their posts rarely; other users may be mentioning their friends all the time. Some users (like celebrities) may receive mentions every minute; for others, being mentioned might be a rare occasion. In this sense, mention is like a language with the number of words equal to the number of users in a social network.
We are interested in detecting emerging topics from social network streams based on monitoring the mentioning behaviour of users. Our basic assumption is that a new (emerging) topic is something people feel like discussing about, commenting about, or forwarding the information further to their friends. Conventional approaches for topic detection have mainly been concerned with the frequencies of (textual) words [1, 2]. A term frequency based approach could suffer from the ambiguity caused by synonyms or homonyms. It may also require complicated preprocessing (e.g., segmentation) depending on the target language. Moreover, it cannot be applied when the contents of the messages are mostly nontextual information. On the other hands, the “words” formed by mentions are unique, requires little prepossessing to obtain (the information is often separated from the contents), and are available regardless of the nature of the contents.
Figure 1 shows an example of the emergence of a topic through posts on social networks. The first post by Bob contains mentions to Alice and John, which are both probably friends of Bob’s; so there is nothing unusual here. The second post by John is a reply to Bob but it is also visible to many friends of John’s that are not direct friends of Bob’s. Then in the third post, Dave, one of John’s friends, forwards (called retweet in Twitter) the information further down to his own friends. It is worth mentioning that it is not clear what the topic of this conversation is about from the textual information, because they are talking about something (a new gadget, car, or jewelry) that is shown as a link in the text.
In this paper, we propose a probability model that can capture the normal mentioning behaviour of a user, which consists of both the number of mentions per post and the frequency of users occurring in the mentions. Then this model is used to measure the anomaly of future user behaviour. Using the proposed probability model, we can quantitatively measure the novelty or possible impact of a post reflected in the mentioning behaviour of the user. We aggregate the anomaly scores obtained in this way over hundreds of users and apply a recently proposed changepoint detection technique based on the Sequentially Discounting Normalized Maximum Likelihood (SDNML) coding [3]. This technique can detect a change in the statistical dependence structure in the time series of aggregated anomaly scores, and pinpoint where the topic emergence is; see Figure 2. The effectiveness of the proposed approach is demonstrated on four data sets we have collected from Twitter. We show that our approach can detect the emergence of a new topic at least as fast as using the best term that was not obvious at the moment. Furthermore, we show that in two out of four data sets, the proposed linkanomaly based method can detect the emergence of the topics earlier than keywordfrequency based methods, which can be explained by the keyword ambiguity we mentioned above.
2 Related work
Detection and tracking of topics have been studied extensively in the area of topic detection and tracking (TDT) [1]. In this context, the main task is to either classify a new document into one of the known topics (tracking) or to detect that it belongs to none of the known categories. Subsequently, temporal structure of topics have been modeled and analyzed through dynamic model selection [4], temporal text mining [5], and factorial hidden Markov models [6].
Another line of research is concerned with formalizing the notion of “bursts” in a stream of documents. In his seminal paper, Kleinberg modeled bursts using time varying Poisson process with a hidden discrete process that controls the firing rate [2]. Recently, He and Parker developed a physics inspired model of bursts based on the change in the momentum of topics [7].
All the above mentioned studies make use of textual content of the documents, but not the social content of the documents. The social content (links) have been utilized in the study of citation networks [8]. However, citation networks are often analyzed in a stationary setting.
The novelty of the current paper lies in focusing on the social content of the documents (posts) and in combining this with a changepoint analysis.
3 Proposed Method
The overall flow of the proposed method is shown in Figure 2. We assume that the data arrives from a social network service in a sequential manner through some API. For each new post we use samples within the past time interval for the corresponding user for training the mention model we propose below. We assign anomaly score to each post based on the learned probability distribution. The score is then aggregated over users and further fed into a changepoint analysis.
3.1 Probability Model
We characterize a post in a social network stream by the number of mentions it contains, and the set of names (IDs) of the users mentioned in the post. Formally, we consider the following joint probability distribution
(1) 
Here the joint distribution consists of two parts: the probability of the number of mentions and the probability of each mention given the number of mentions. The probability of the number of mentions is defined as a geometric distribution with parameter as follows:
(2) 
On the other hand, the probability of mentioning users in is defined as independent, identical multinomial distribution with parameters ().
Suppose that we are given training examples , , from which we would like to learn the predictive distribution
(3) 
First we compute the predictive distribution with respect to the the number of mentions . This can be obtained by assuming a beta distribution as a prior and integrating out the parameter . The density function of the beta prior distribution is written as follows:
where and are parameters of the beta distribution and is the beta function. By the Bayes rule, the predictive distribution can be obtained as follows:
Both the integrals on the numerator and denominator can be obtained in closed forms as beta functions and the predictive distribution can be rewritten as follows:
Using the relation between beta function and gamma function, we can further simplify the expression as follows:
(4) 
where is the total number of mentions in the training set .
Next, we derive the predictive distribution of mentioning user . The maximum likelihood (ML) estimator is given as , where is the number of total mentions and is the number of mentions to user in the data set . The ML estimator, however, cannot handle users that did not appear in the training set ; it would assign probability zero to all these users, which would appear infinitely anomalous in our framework. Instead we use the Chinese Restaurant Process (CRP; see [9]) based estimation. The CRP based estimator assigns probability to each user that is proportional to the number of mentions in the training set ; in addition, it keeps probability proportional to for mentioning someone who was not mentioned in the training set . Accordingly the probability of known users is given as follows:
(5) 
On the other hand, the probability of mentioning a new user is given as follows:
(6) 
3.2 Computing the linkanomaly score
In order to compute the anomaly score of a new post by user at time containing mentions to users , we compute the probability (3) with the training set , which is the collection of posts by user in the time period (we use days in this paper). Accordingly the linkanomaly score is defined as follows:
(7) 
The two terms in the above equation can be computed via the predictive distribution of the number of mentions (4), and the predictive distribution of the mentionee (5)–(6), respectively.
3.3 Combining Anomaly Scores from Different Users
The anomaly score in (7) is computed for each user depending on the current post of user and his/her past behaviour . In order to measure the general trend of user behaviour, we propose to aggregate the anomaly scores obtained for posts using a discretization of window size as follows:
(8) 
where is the post at time by user including mentions to users .
3.4 Changepoint detection via Sequentially Discounting Normalized Maximum Likelihood Coding
Given an aggregated measure of anomaly (8), we apply a changepoint detection technique based on the SDNML coding [3]. This technique detects a change in the statistical dependence structure of a time series by monitoring the compressibility of the new piece of data. The sequential version of normalized maximum likelihood (NML) coding is employed as a coding criterion. More precisely, a change point is detected through two layers of scoring processes (see also [10, 11]); in each layer, the SDNML code length based on an autoregressive (AR) model is used as a criterion for scoring. Although the NML code length is known to be optimal [12], it is often hard to compute. The SNML proposed in [13] is an approximation to the NML code length that can be computed in a sequential manner. The SDNML proposed in [3] further employs discounting in the learning of the AR models.
Algorithmically, the change point detection procedure can be outlined as follows. For convenience, we denote the aggregate anomaly score as instead of .
 1. 1st layer learning

Let be the collection of aggregate anomaly scores from discrete time to . Sequentially learn the SDNML density function (); see Appendix A for details.
 2. 1st layer scoring

Compute the intermediate changepoint score by smoothing the log loss of the SDNML density function with window size as follows:
 4. 2nd layer learning

Let be the collection of smoothed changepoint score obtained as above. Sequentially learn the second layer SDNML density function (); see Appendix A for details.
 5. 2nd layer scoring

Compute the final changepoint score by smoothing the log loss of the SDNML density function as follows:
(9)
3.5 Dynamic Threshold Optimization (DTO)
We make an alarm if the changepoint score exceeds a threshold, which was determined adaptively using the method of dynamic threshold optimization (DTO), proposed in [14].
In DTO, we use a 1dimensional histogram for the representation of the score distribution. We learn it in a sequential and discounting way. Then, for a specified value , to determine the threshold to be the largest score value such that the tail probability beyond the value does not exceed . We call a threshold parameter.
The details of DTO are summarized as follows: Let be a given positive integer. Let be a 1 dimensional histogram with bins where is an index of bins, with a smaller index indicating a bin having a smaller score. For given such that , bins in the histogram are set as: and . Let be a histogram updated after seeing the th score. The procedures of updating the histogram and DTO are given in Algorithm 1.
4 Experiments
4.1 Experimental setup
We collected four data sets from Twitter. Each data set is
associated with a list of posts in a service called
Togetter
We compared our proposed approach with a keywordbased changepoint detection method. In the keywordbased method, we looked at a sequence of occurrence frequencies (observed within one minute) of a keyword related to the topic; the keyword was manually selected to best capture the topic. Then we applied DTO described in Section 3.5 to the sequence of keyword frequency. In our experience, the sparsity of the keyword frequency seems to be a bad combination with the SDNML method; therefore we did not use SDNML in the keywordbased method. We use the smoothing parameter , and the order of the AR model 30 in the experiments; the parameters in DTO was set as , , , .
Furthermore, we have implemented a twostate version of Kleinberg’s burst detection model [2] using linkanomaly score (8) and keyword frequency (as in the keywordbased changepoint analysis) to filter out relevant posts. For the linkanomaly score, we used a threshold to filter out posts to include in the burst analysis. For the keyword frequency, we used all posts that include the keyword for the burst analysis. We used the firing rate parameter of the Poisson point process (1/s) for the nonburst state and (1/s) for the burst state, and the transition probability . We consider the transition from the nonburst state to the burst state as an “alarm”.
A drawback of the keywordbased methods (dynamic thresholding and burst detection) is that the keyword related to the topic must be known in advance, although this is not always the case in practice. The changepoint detected by the keywordbased methods can be thought of as the time when the topic really emerges. Hence our goal is to detect emerging topics as early as the keyword based methods.
data set  of participants 

“Job hunting”  200 
“Youtube”  160 
“NASA”  90 
“BBC”  47 
4.2 “Job hunting” data set
This data set is related to a controversial post by a famous person in Japan that “the reason students having difficulty finding jobs is, because they are stupid” and various replies to that post.
The keyword used in the keywordbased methods was “Job hunting.” Figures a and b show the results of the proposed linkanomalybased change detection and burst detection, respectively. Figures c and d show the results of the keywordfrequencybased change detection and burst detection, respectively.
The first alarm time of the proposed linkanomalybased changepoint analysis was 22:55, whereas that for the keywordfrequencybased counterpart was 22:57; see also Table 2. The earliest detection was achieved by the keywordfrequencybased burst detection method. Nevertheless, from Figure 3, we can observe that the proposed linkanomalybased methods were able to detect the emerging topic almost as early as keywordfrequencybased methods.
Method  “Job hunting”  “Youtube”  “NASA”  “BBC”  

Linkanomalybased  of detections  4  4  14  3 
changepoint detection  1st detection time  22:55, Jan 08  08:44, Nov 05  20:11, Dec 02  19:52, Jan 21 
Keywordfrequencybased  of detections  1  1  1  1 
changepoint detection  1st detection time  22:57, Jan 08  00:30, Nov 05  04:10, Dec 03  22:41, Jan 21 
Linkanomalybased  of detections  1  9  25  2 
burst detection  1st detection time  23:07, Jan 08  00:07, Nov 05  00:44, Nov 30  20:51, Jan 21 
Keywordfrequencybased  of detections  6  15  11  1 
burst detection  1st detection time  22:50, Jan 08  23:59, Nov 04  08:34, Dec 03  22:32, Jan 21 
4.3 “Youtube” data set
This data set is related to the recent leakage of some confidential video by the Japan Coastal Guard officer.
The keyword used in the keywordbased methods is “Senkaku.” Figures a and b show the results of linkanomalybased change detection and burst detection, respectively. Figures c and d show the results of keywordfrequency based change detection and burst detection, respectively.
The first alarm time of the proposed linkanomalybased changepoint analysis was 08:44, whereas that for the keywordbased counterpart was 00:30; see also Table 2. Although the aggregated anomaly score (8) in Figure a around midnight, Nov 05 is elevated, it seems that SDNML fails to detect this elevation as a change point. In fact, the linkanomalybased burst detection (Figure b) raised an alarm at 00:07, which is earlier than the keywordfrequencybased changepoint analysis and closer to the the keywordfrequencybased burst detection at 23:59, Nov 04.
4.4 “NASA” data set
This data set is related to the discussion among Twitter users interested in astronomy that preceded NASA’s press conference about discovery of an arsenic eating organism.
The keyword used in the keywordbased models is “arsenic.” Figures a and b show the results of linkanomalybased change detection and burst detection, respectively. Figures c and d show the same results for the keywordfrequencybased methods.
The first alarm times of the two linkanomalybased methods were 20:11, Dec 02 (changepoint detection) and 00:44, Nov 30 (burst detection), respectively. Both of these are earlier than NASA’s official press conference (04:00, Dec 03) and are earlier than the keywordfrequency based methods (changepoint detection at 04:10, Dec 03 and burst detection at 08:34, Dec 03.); see Table 2.
4.5 ”BBC” data set
This data set is related to angry reactions among Japanese Twitter users against a BBC comedy show that asked “who is the unluckiest person in the world” (the answer is a Japanese man who got hit by nuclear bombs in both Hiroshima and Nagasaki but survived).
The keyword used in the keywordbased models is “British” (or “Britain”). Figures a and b show the results of linkanomalybased change detection and burst detection, respectively. Figures c and d show the same results for the keywordfrequencybased methods.
The first alarm time of the two linkanomalybased methods was 19:52 (changepoint detection) and 20:51 (burst detection), both of which are earlier than the keywordfrequencybased counterparts at 22:41 (changepoint detection) and 22:32 (burst detection). See Table 2.
4.6 Discussion
Within the four data sets we have analyzed above, the proposed linkanomaly based methods compared favorably against the keywordfrequency based methods on “NASA” and “BBC” data sets. On the other hand, the keywordfrequency based methods were earlier to detect the topics on “Job hunting” and “Youtube” data sets.
The above observation is natural, because for “Job hunting” and “Youtube” data sets, the keywords seemed to have been unambiguously defined from the beginning of the emergence of the topics, whereas for “NASA” and “BBC” data sets, the keywords are more ambiguous. In particular, in the case of “NASA” data set, people had been mentioning “arsenic” eating organism earlier than NASA’s official release but only rarely (see Figure d). Thus, the keywordfrequencybased methods could not detect the keyword as an emerging topic, although the keyword “arsenic” appeared earlier than the official release. For “BBC” data set, the proposed linkanomalybased burst model detects two bursty areas (Figure b). Interestingly, the linkanomalybased changepoint analysis only finds the first area (Figure a), whereas the keywordfrequencybased methods only find the second area (Figures c and d). This is probably because there was an initial stage where people reacted individually using different words and later there was another stage in which the keywords are more unified.
In our approach, the alarm was raised if the changepoint score exceeded a dynamically optimized threshold based on the significance level parameter . Table 3 shows results for a number of threshold parameter values. We see that as increased, the number of false alarms also increased. Meanwhile, even when it was so small, our approach was still able to detect the emerging topics as early as the keywordbased methods. We set as a default parameter value in our experiment. Although there are several alarms for “NASA” data set, most of them are more or less related to the emerging topic.
“Job hunting”  “Youtube”  “NASA”  “BBC”  

0.01  4  2  9  3 
0.05  4  4  14  3 
0.1  8  6  30  3 
Notice again that in the keywordbased methods the keyword related to the topic must be known in advance, which is not always the case in practice. Further note that our approach only uses links (mentions), hence it can be applied to the case where topics are concerned with information other than texts, such as images, video, sounds, etc.
5 Conclusion
In this paper, we have proposed a new approach to detect the emergence of topics in a social network stream. The basic idea of our approach is to focus on the social aspect of the posts reflected in the mentioning behaviour of users instead of the textual contents. We have proposed a probability model that captures both the number of mentions per post and the frequency of mentionee. We have combined the proposed mention model with the SDNML changepoint detection algorithm [3] and Kleinberg’s burst detection model [2] to pinpoint the emergence of a topic.
We have applied the proposed approach to four real data sets we have collected from Twitter. The four data sets included a widespread discussion about a controversial topic (“Job hunting” data set), a quick propagation of news about a video leaked on Youtube (“Youtube” data set), a rumor about the upcoming press conference by NASA (“NASA” data set), and an angry response to a foreign TV show (“BBC” data set). In all the data sets our proposed approach showed promising performance. In most data set, the detection by the proposed approach was as early as termfrequency based approaches in the hindsight of the keyword that best describes the topic that we have manually chosen afterwards. Furthermore, for “NASA” and “BBC” data sets, in which the keyword that defines the topic is more ambiguous than the first two data sets, the proposed linkanomaly based approaches have detected the emergence of the topics much earlier than the keywordbased approaches.
All the analysis presented in this paper was conducted offline but the framework itself can be applied online. We are planning to scale up the proposed approach to handle social streams in real time. It would also be interesting to combine the proposed linkanomaly model with contentbased topic detection approaches to further boost the performance and reduce false alarms.
Acknowledgments
This work was partially supported by MEXT KAKENHI 23240019, 22700138, Aihara Project, the FIRST program from JSPS, initiated by CSTP, Hakuhodo Corporation, NTT Corporation, and Microsoft Corporation (CORE Project).
Appendix A Sequentially discounting normalized maximum likelihood coding
This section describes the sequentially discounting normalized maximum likelihood (SDNML) coding that we use for changepoint detection in Section 3.4. The basic idea behind SDNMLbased change detection is as follows: when the data arrives in a sequential manner, we can consider a change has occurred if a new piece of data cannot be compressed using the statistical nature of the past. The original paper [10, 11] used the predictive stochastic complexity as a measure of compressibility, whereas Urabe et al. [3] proposed to employ a tighter coding scheme based on the SDNML.
Suppose that we observe a discrete time series (); we denote the data sequence by . Consider the parametric class of conditional probability densities , where is the dimensional parameter vector and we assume to be an empty set. We denote the maximum likelihood (ML) estimator given the data sequence by ; i.e., . The sequential normalized maximum likelihood (SNML) model is a coding distribution (see e.g., [15]) that is known to be optimal in the sense of the conditional minimax [16] problem:
(10) 
where is the joint density over induced by the conditional densities from . The minimization is taken over all conditional density functions and tries to minimize the regret (10) over any possible outcome of the new sample .
The SNML distribution is obtained as the optimal conditional density of the minimax problem (10) as follows [16]:
(11) 
where the normalization constant is necessary because the new sample is used in the estimation of parameter vector and the numerator in (11) is not a proper density function. We call the quantity the SNML codelength. It is known from [16, 13] that the cumulative SNML codelength, which is the sum of SNML codelength over the sequence, is optimal in the sense that it asymptotically achieves the shortest codelength.
The sequentially discounting normalized maximum likelihood (SDNML) is obtained by applying the above SNML to the class of autoregressive (AR) model and replacing the ML estimation in (11) with a discounted ML estimation, which makes the SDNMLbased changepoint detection algorithm more flexible than an SNMLbased one. Let for each . We define the th order AR model as follows:
where is the parameter vector.
In order to compute the SDNML density function we need the discounted ML estimators of the parameters in . We define the discounted ML estimator of the regression coefficient as follows:
(12) 
where is a sequence of sample weights with the discounting coefficient (); is the smallest number of samples such that the minimizer (12) is unique; . Note that the error terms from older samples receive geometrically decreasing weights in (12). The larger the discounting coefficient is, the smaller the weights of the older samples become; thus we have stronger discounting effect. Moreover, we obtain the discounted ML estimator of the variance as follows:
where we define and . Clearly when the discounted estimator of the AR coefficient is available, can be computed in a sequential manner.
In the sequel, we first describe how to efficiently compute the AR estimator . Finally we derive the SDNML density function using the discounted ML estimators .
The AR coefficient can simply be computed by solving the leastsquares problem (12). It can, however, be obtained more efficiently using the iterative formula described in [16, 13]. Here we repeat the formula for the discounted version presented in [3]. First define the sufficient statistics and as follows:
Using the sufficient statistics, the discounted AR coefficient from (12) can be written as follows:
Note that can be computed in a sequential manner. The inverse matrix can also be computed sequentially using the ShermanMorrisonWoodbury formula as follows:
where .
Finally the SDNML density function is written as follows:
where the normalization factor is calculated as follows:
with .
Footnotes
 http://togetter.com/
References
 J. Allan, J. Carbonell, G. Doddington, J. Yamron, Y. Yang et al., “Topic detection and tracking pilot study: Final report,” in Proceedings of the DARPA broadcast news transcription and understanding workshop, 1998.
 J. Kleinberg, “Bursty and hierarchical structure in streams,” Data Min. Knowl. Disc., vol. 7, no. 4, pp. 373–397, 2003.
 Y. Urabe, K. Yamanishi, R. Tomioka, and H. Iwai, “Realtime changepoint detection using sequentially discounting normalized maximum likelihood coding,” in Proceedings. of the 15th PAKDD, 2011.
 S. Morinaga and K. Yamanishi, “Tracking dynamics of topic trends using a finite mixture model,” in Proceedings of the 10th ACM SIGKDD, 2004, pp. 811–816.
 Q. Mei and C. Zhai, “Discovering evolutionary theme patterns from text: an exploration of temporal text mining,” in Proceedings of the 11th ACM SIGKDD, 2005, pp. 198–207.
 A. Krause, J. Leskovec, and C. Guestrin, “Data association for topic intensity tracking,” in Proceedings of the 23rd ICML, 2006, pp. 497–504.
 D. He and D. S. Parker, “Topic dynamics: an alternative model of bursts in streams of topics,” in Proceedings of the 16th ACM SIGKDD, 2010, pp. 443–452.
 H. Small, “Visualizing science by citation mapping,” Journal of the American society for Information Science, vol. 50, no. 9, pp. 799–813, 1999.
 D. Aldous, “Exchangeability and related topics,” in École d’Été de Probabilités de SaintFlour XIII—1983. Springer, 1985, pp. 1–198.
 K. Yamanishi and J. Takeuchi, “A unifying framework for detecting outliers and change points from nonstationary time series data,” in Proceedings of the 8th ACM SIGKDD, 2002.
 J. Takeuchi and K. Yamanishi, “A unifying framework for detecting outliers and change points from time series,” IEEE T. Knowl. Data En., vol. 18, no. 44, pp. 482–492, 2006.
 J. Rissanen, “Strong optimality of the normalized ML models as universal codes and information in data,” IEEE T. Inform. Theory, vol. 47, no. 5, pp. 1712–1717, 2002.
 J. Rissanen, T. Roos, and P. Myllymäki, “Model selection by sequentially normalized least squares,” Journal of Multivariate Analysis, vol. 101, no. 4, pp. 839–849, 2010.
 K. Yamanishi and Y. Maruyama, “Dynamic syslog mining for network failure monitoring,” Proceeding of the 11th ACM SIGKDD, p. 499, 2005.
 T. M. Cover and J. A. Thomas, Elements of Information Theory. New York: Wiley & Sons, 1991, 2nd edition, 2006.
 T. Roos and J. Rissanen, “On sequentially normalized maximum likelihood models,” in Workshop on information theoretic methods in science and engineering, 2008.