# To Post or Not to Post: Using Online Trends to Predict Popularity of Offline Content

###### Abstract.

Predicting the popularity of online content has attracted much attention in the past few years. In news rooms, for instance, journalists and editors are keen to know, as soon as possible, the articles that will bring the most traffic into their website. The relevant literature includes a number of approaches and algorithms to perform this forecasting. Most of the proposed methods require monitoring the popularity of content during some time after it is posted, before making any longer-term prediction. In this paper, we propose a new approach for predicting the popularity of news articles before they go online. Our approach complements existing content-based methods, and is based on a number of observations regarding article similarity and topicality. First, the popularity of a new article is correlated with the popularity of similar articles of recent publication. Second, the popularity of the new article is related to the recent historical popularity of its main topic. Based on these observations, we use time series forecasting to predict the number of visits an article will receive. Our experiments, conducted on a real data collection of articles in an international news website, demonstrate the effectiveness and efficiency of the proposed method.

## 1. Introduction

Monitoring the performance of news articles is a core task within any news media organization. The highly crowded news market, and the fast growth of online news platforms and applications in recent years, have pushed editors into a fierce competition for the attention of news readers.
Social media are changing the way people consume news (kwak:2010; pew2013trends), but they still constitute a small portion of the overall online news traffic. For instance, Andrew Miller, Guardian News and Media CEO, said that social media all combined add up to around 10% of their newspaper’s traffic.^{1}^{1}1https://blog.twitter.com/2013/guardian-says-twitter-surpassing-other-social-media-for-breaking-news-traffic
Currently, editors focus on popularity in terms of number of visits and visitors to news websites as the most important performance metric for news articles online.

Measuring popularity, however, is not sufficient. The ability to anticipate online news popularity enables editorial teams to take tactical and strategic decisions to maximize the impact of their online content, such as promoting or demoting articles in their web pages, changing the wording of headers, allocating editorial resources to follow-up stories or features, designing promotional campaigns, etc. Given the high velocity of news, editors and journalists need to have popularity forecasts for news articles as early as possible after publishing the article—and ideally, even before that.

The research community has addressed the problem of predicting the popularity of news articles in several recent papers including (Lerman:2010; Tsagkias:2010; Bandari:2012; Tatar:2012; Castillo:2014). Most of the proposed techniques rely on early measurements of visits and visitors to news websites, and are based on the auto-correlation of the time series that describe ebbs and flows in news popularity.

For example, a common method introduced by Szabo:2010 is based on the observation that in some websites, there is a strong linear relationship between log-transformed early popularity and log-transformed long-term popularity, with correlations as high as . This result makes it possible to forecast the future popularity of an article based on its early observed popularity. Generalizations of this method have emerged since, including (li2013popularity; rowe2011forecasting) and others.

Naturally, the quality of these forecasts is lower the earlier the predictions are made, both because there is less data available, and because the time span between prediction time and target time is longer. Moreover, predictions made before articles go online are desirable, as these predictions allow editorial teams to take news management decisions without having to wait for early popularity measurements. Approaches that can dispense with early popularity measurements have been explored through the development of predictive models that use features such as the words in the title of the article, e.g. (yu_2011_predicting; lakkaraju_2013_reddit). Our approach is complementary to such content-based methods, and provides a novel extension where topic popularity forecasts are used to improve news article popularity predictions.

Our contribution. We introduce a new method for early prediction of popularity of news articles that combines article topicality and article similarity. We show that the popularity of a topic (the total number of visits received by all articles on that topic) depends on the popularity of related topics, and describe how to use this dependency to improve topic popularity predictions. Next, we show that the popularity of an article depends on the popularity of recent articles similar to it, and on the popularity of its general primary topic, which we can predict with a high level of accuracy. Finally, we propose an extension of the emerging approach, where topic popularity forecasts are used to improve news article popularity predictions. We explore two forecasting algorithms that exploit these observations, and test them on a large collection of news articles published by an international news organization over 18 months in 2013 and 2014. The ensuing results yield a mean average percentage error as low as 11% demonstrating the efficacy of the approach in predicting news article popularity.

The paper is organized as follows. First, we provide an overview of related work. Then, we provide a detailed description of the data used in our study and discuss some of their characteristics (Section 3). Next, we present two predictive models of topic popularity (Section 4), and proceed with a discussion of article popularity prediction (Section 5). We conclude by summarizing the novelty an impact of this research and its future extensions.

## 2. Related Work

The increasing use of predictive models of online content popularity in the news industry has promoted the growth of the already significant interest in predictive models of online user behavior in the research community. For ease of exposition, we limit our review to research that is closely related to the study presented in this paper.

Methods Based on Early Measurements. The success of the auto-correlation approach pioneered by Szabo:2010; szabo2012predicting has encouraged many researchers to use early popularity measurements as predictors of future popularity. Predictive models of online popularity based on auto-correlation have been used by: Jamali:2009 with reference to votes in Digg; lee_2010_popularity for comments to articles; lerman_2010_news for visits to articles; kim_2011_temperature for visits to blog posts; tatar_2011_predicting for comments on articles; ruan_2012_prediction for number of Twitter messages—“tweets”; Pinto:2013 for views in YouTube, and Ahmed:2013 for views in YouTube and Vimeo, and for votes in Digg. Many of these works use content metadata, such as publication date, and in some cases information about the users who post this content (e.g. (Jamali:2009; ruan_2012_prediction)). Closer to the topic of this paper, the number of postings received by an article in social media (e.g. Twitter or Facebook) has been shown to be useful to predict visits to the article (Castillo:2014; Hsieh:2013).

Our approach differs from these auto-correlation approaches in two main regards. First, early popularity measurements are not needed to provide reliable popularity predictions, although they can be incorporated in the algorithm. Second, we introduce the use of cross-correlations among topics as an important factor to improve the accuracy of predictions for topics and articles popularity.

Topic-Based Methods. Bandari:2012 used information about the category of a news article (e.g. sports, politics, technology) together with information about the communication source, language subjectivity, and named entities present in the article to predict the popularity of news articles in social media, prior to their publication. Scores for the communication source and category were computed as the average number of tweets per article for each news source and each category. The named entity score was computed in the same way, except that only the highest scoring named entity is selected among those appearing in each article (other variations were also tested). The prediction was done using linear regression resulting in . tatar_2011_predicting; Tatar:2012 predicted the number of comments to articles on a large news website. The prediction was based on linear regression using early data measurements. Articles in this website are separated into categories (world, sports, economy, etc.) Interestingly, a per-category model showed no improvements over a generic model that was oblivious to the category of an article.

We exploit the insight emerging from these methods that using the popularity of a topic in the distant past may not be the best predictor of future success, and provide a methodology for establishing the ideal time window.

Methods Based on Keywords Some predictive methods utilized a selection of keywords present in an article or headline as features for the popularity prediction model (tsagkias2009predicting; lakkaraju_2011_attention; Berger:2012; lakkaraju_2013_reddit). The intuition of this approach is that some of these keywords may be important for stylistic reasons (e.g. words such as as “shocking” or “dramatic” may attract more clicks), or because they refer to prominent people or powerful countries, which are important news values (galtung_1965_foreign). For instance, authors of (tsagkias2009predicting) have studied the prediction of comments on news articles, using metadata about the articles (e.g. publication date), the number of articles posted at the same time, the number of similar articles posted at the same time in other sources, and named entities mentioned in the article. Others (Berger:2012) looked at articles that make it into the “most emailed” list of a large online newspaper, The New York Times. Their focus was on two aspects of the articles’ sentiment: polarity (“valence”) and emotionality (“arousal”), obtained through automated sentiment analysis. In (lakkaraju_2013_reddit), authors have also measured the popularity (positive minus negative votes) of an image re-posts for different communities in a popular content sharing site, Reddit. Results ranged from to . Finally, authors of (lakkaraju_2011_attention) has focused on Facebook data to predict the number of comments a post will get. Support vector regression (SVR) was used to create predictive models achieving a correlation of with observed values. Our approach differs from and is complementary to the approaches reviewed in this section, in that our approach relies on articles’ content, topics, and ads.

Social Cascade Predictions. The prediction of information cascades in social networks has been an extremely active topic in recent years, particularly at the macroscopic level (i.e. how many nodes will be activated by a cascade), e.g. (Cheng:2014; li2013popularity; myers_2012_external; huang_2012_predicting) and many others. However, a setting in which social influence occurs may bring a high degree of unpredictability. Salganik:2006 claim that with social the popularity of an item is not an aggregate of individual preferences and therefore cannot be predicted even with perfect information: “there are inherent limits on the predictability of outcomes, irrespective of how much skill or information one has” (Salganik:2006). These methods offer useful insights, but are not directly relevant to the problem focus of this study.

## 3. Dataset

In this section, we discuss the data used in our study. We describe how we generated the dataset from the source data (Section 3.1), provide some insights on the intrinsic features that characterize the popularity of articles within our collection (Section 3.2) and measure their effective life-span (Section 3.3).

### 3.1. Dataset Generation

We use data provided by (omitted for double-blind review), a large international news network operating multiple television channels and websites. We harvested articles from the English version of this website, which has millions of visits per month. The data covers a time span from September 2012 through April 2014. Our collection comprises two types of articles: News and Opinion. The first category refers to breaking news, reporting events and issues happening in different locations around the world. The second category refers to opinions and features contributed by named writers to present their opinion or analysis of a topic of public interest. The collection consists of a sample of 8,065 News articles and 4,357 Opinion articles. Each article includes: title, content, and publication date.

For each article, we also retrieved a time series of the number of visits the article gets after its publication. These time series are captured thanks to a large scale real-time process that records activities by single users per session on a minute by minute basis.

### 3.2. Distribution of Visits

The overall time series of visits for the two sets of articles is shown in Figure 1. As shown in Figure 1, the time series for News is more variable than that for Opinion articles. This difference reflects the more ephemeral nature of breaking news as compared to Opinion articles, and is corroborated by a shorter shelf-life for breaking news, as shown in Figure 4(b).

The average number of visits for each article is in the order of a few thousands, but there are some articles that have hundreds of thousands of visits, and others that have only a few hundred.^{2}^{2}2Due to our legal agreement with the data provider, including the business-competitive nature of this data, we are not allowed to provide exact figures that can be used to estimate the total traffic to the website. Figure 2 shows the complementary cumulative distribution function (CCDF) of the number of visits articles receive in the first 30 days after publication. The popularity distribution is heavy tailed, which is in agreement with observations in e.g. (Tsagkias:2010; Lerman:2010; Castillo:2014).

### 3.3. Shelf-Life of Articles: an Elusive Concept

Readers’ interest in news articles decreases sharply as time passes (as observed e.g. in (Tsagkias:2010; tatar_2011_predicting; Castillo:2014; Tatar:2012)). For example, 48% of the visits for an average News article in our dataset, over a 30-day period, occurs within the first three days, as shown in Figure 3.

To measure the shelf-life of articles, we follow (dezso_2006_dynamics; Castillo:2014; bitly_2011_halflife; bitly_2012_halflife) and compute the time required for an article to reach a certain percentage of its visits. Specifically, we use the notion of shelf-life at 90% (Castillo:2014), which is the time an article requires to accumulate 90% of the visits it will receive in its lifetime. Figure 4(a) depicts the shelf-life at 90% for News (4.1 days on average) and Opinion (7.7 days on average). We observe that visits are more concentrated around the publication date in News articles as compared to Opinion articles, where visits are more spread-out in time. This is probably due to the fact that Opinion articles are usually discussed longer and are not posted in reaction to immediate events as News articles tend to be.

In our 18-months dataset, articles posted online continued receiving visits long after their date of publication. New visitors may be directed to the article page as the result of a search engine query, through hyperlinks in more recent articles, or by consulting one of several thematic indexes on news websites.^{3}^{3}3This is in contrast with other measurements such as those for Twitter postings. People rarely tweet “old” articles on Twitter, so one can define the “longevity” of a news item as simply the time between the first and last tweet referring to the article (Hsieh:2013).
This makes defining an absolute shelf-life difficult: it depends on the time horizon used to compute it, as Figure 4(b) shows.
In general, we observe a monotonic decreasing trend of the proportion between the shelf-life at 90% and the time horizon used to compute it. While the shelf-life at 90% accounts for no less than 39% of the time for News articles and 57% for Opinion articles when the time horizon equals 7 days, these proportions decrease to 10% and 25% respectively when the time horizon is extended to 60 days.

There are interesting differences between News and Opinion articles, such as the longer shelf-life for Opinion articles, that may have an impact on the prediction of article popularity. In the remainder of the paper, we threat the two kind of articles as a single class of content type. We can easily obtain separate results for the two types of articles since the prediction method is the same, and we plan to do so in an extended version of this paper.

Set of topics | |

A topic | |

Number of topics | |

Set of articles | |

An article | |

Number of articles | |

Similarity of articles and | |

Relevance score of article to topic | |

The most relevant topic for : | |

Publication date of article | |

Time lag expressed in days | |

Set of time lags: | |

Set of articles with , and published on date | |

For , cumulative number of visits received by article on days | |

Total number of visits to topic received on day |

## 4. Predicting topic volume

The first task we describe is the prediction of the total volume of visits to a topic , i.e. the sum of the visits of all articles that have the topic as the main topic. We apply Latent Dirichlet Allocation (LDA) as topic modeling method (Section 4.1), determine the optimal number of topics using supervised classification (Section 4.2), describe the forecasting methods we use for topic volume prediction (Section 4.3), and discuss their application to our dataset and the ensuing results (Section 4.4).

### 4.1. Modeling Method for Topics: LDA

We use Latent Dirichlet allocation algorithm (LDA) to uncover the topics in our collection of articles (Blei:2003). LDA is a probabilistic generative method that uses a Bayesian network to discover a set of latent topics from a set of documents . To prepare our articles for LDA, we first concatenate the title of the article with its body, then remove stop words, and stem the remaining words using the stemmer implementation by Paice and Husk, also known as the Lancaster stemming algorithm (Paice:1990).

LDA outputs the probability that an article is about a topic , which we denote as . This and other notation used throughout this paper are summarized on Table 1.

### 4.2. Determining the Number of Topics

Like many methods used for identifying latent topics in documents, including non-negative matrix factorization (NMF) (Lee:2000) and Probabilistic Latent Semantic Analysis (pLSA) (Hofmann:1999), LDA assumes the number of topics is known in advance. However, determining the optimal number of topics remains an open research question (Arun:2010). This choice is critical for our application, because topic volume prediction is sensitive to the number of topics selected (see below). Empirically, if we use a small number of topics, LDA returns broad topics such as politics, sports and armed conflicts. But if we request a large number of topics, LDA creates specialized topics around specific stories or journalistic beats, such as the US elections, the Egyptian elections, the Syrian conflict, and politics in Latin America.

We use supervised classification to find the “appropriate” number of topics . The intuition is that topics should yield a partition of the documents in the dataset that can be accurately recognized by a classifier trained on classes of documents, each class corresponding to one of the selected topics. First, we run LDA with different number of topics . Let be the topic set produced by LDA for each value of . For each set of topics, we label every article with its primary topic such that .

Then, we select 80% of the entire collection of labeled articles for training and the remaining 20% for testing. We train a Multinomial Naive Bayes classifier (MNB) (McCallum:1998) using the training data, and evaluate the classifier on the test data. The selected feature space is defined as the scores of stems within each article. The classification quality achieved by the MNB is measured in terms of precision, recall, and score (the harmonic mean of precision and recall).

Results for varying number of topics are reported in Figure 5. While the precision of the classifier is almost the same at and , the recall and scores are maximized for . Based on these experiments, we select as the ideal number of topics to forecast topic volume for our dataset.

We note that the number of topics that yields the best classification model for a dataset is sensitive to the number and timespan of the articles in the dataset. In general, we have observed that as datasets get smaller so does the number of topics needed to yield the best possible classification model for the dataset. For example, 10 topics yield a better classification model than 20 topics do for a collection of about 1/3 of the articles contained in our dataset.

### 4.3. Topic Popularity Prediction Methods

We use a machine learning approach to forecasting, where each training sample is a pair , where is an input vector of features for the time-series class to be learned, and its associated value. The aim of the machine learning algorithm is to find a function that for each in the training dataset approximates its value as close as possible. The resulting function is then used to predict values n-steps ahead of the time series data used for training. We compare results from two algorithms, one based on linear regression (LR) and the other on support vector regression (SVR).

Linear Regression (LR). Within a linear regression approach for forecasting method (rowe2011forecasting; li2013popularity), , the total number of visits to articles in topic at time , is given by

(1) |

Where and are coefficients of the linear regression, is a residual term, and is the set of time lags . In a more general version, we assume that the volume of visits of a topic depends not only on that topic (due to auto-correlation) but on all the other topics (due to cross-correlations):

(2) |

Support Vector Regression (SVR). Within a Support Vector Regression (SVR) approach to time series forecasting (muller1997predicting), the prediction function is given by the formula:

(3) |

where is the weight vector, i.e. a linear combination of training patterns that supports the regression function, is the vector containing the input features available at time (this is a vector containing all for and ), and is the bias, i.e. an average over marginal vectors, which are weight vectors that lie within the margins set by the loss function (see below).

The objective of SVR regression is to learn the weight vector that has the smallest possible length so as to avoid over-fitting. To ease the regression task, a given margin of deviation is allowed with no penalty, and a given margin is specified where deviation is allowed with increasing penalty. The length of the weight vector is obtained by minimizing the loss function

subject to the constraints:

The solution is given by the equation

where and are Lagrange multipliers—see (smola2004tutorial) for details.

Feature selection. There are numerous input variables , a total of , which can be relatively large compared to the number of observations. This may lead to over-fitting, so a topic selection method could in principle lead to better results—indeed we show in the next section that it is the case. We apply a feature selection in which we select for each topic the topics that are most correlated with among the set of topics. Concretely, instead of using as input variables all with , we select only the topics that have the largest cross-correlation with topic (in practice this includes the topic itself).

### 4.4. Topic Volume Prediction Results

We use Pearson’s correlation () to measure auto-correlations and cross-correlations between topics, and Mean Absolute Percentage Error (MAPE) to evaluate forecasting results. MAPE is one of the most common measures of forecast error (armstrong1992error). It expresses the error of the forecasted time series as a percentage:

(4) |

where and are respectively the observed and forecasted values for topic at time . When there is a perfect fit, MAPE is 0%. There is no upper bound on the lack of fit.

Topic Volume Auto-Correlation. We first verify that the topics we determine are not only coherent in terms of content (as shown in the previous section), but also uncover auto-correlations in the time series of topic volume. This auto-correlation means that, for instance, a topic that was popular yesterday (or days ago) is likely to be popular today. Specifically, we compute the correlation of each time series of total topic volume with a -shifted version of it . We varied from 1 day to 7 days. The average auto-correlation across topics in is shown in Table 2.

0.70 | 0.52 | 0.43 | 0.36 | 0.32 | 0.30 | 0.27 |

Unsurprisingly, Table 2 shows that topics are strongly auto-correlated at small time lags. For instance, a correlation of 0.7 is observed between popularity scores picked within one day interval (). This means that if a topic was highly popular yesterday, then it is highly likely that it will be popular today. The auto-correlation decreases as the time lag increases.

Impact of Feature Selection and LR vs SVR. We next run an experiment to test the feature selection method and to compare LR and SVR. We present the results using features up to a time lag of days (results with time lags of 2, 4, and 5 days are basically equivalent). Given that we have 20 topics, this yields a total of variables. When applying feature selection, we select for each topic the top topics whose volumes are most correlated to (in terms of ), yielding a total of 12 variables.

We train on a sliding window of 50 days (we show the impact of the time window size next), meaning that predictions for articles posted on day , are done with a model trained on data from the days between and . To evaluate each method, we predict the topic volume for every topic at 2, 3, 7, 15, and 30 steps (days) ahead. We report the achieved MAPE scores averaged across topics, comparing the prediction error obtained using all 60 features, shown in Figure 6(a), with the prediction error using the subset of 12 features, shown in Figure 6(b).

We make the following observations from Figure 6. First, as expected the more steps ahead we try to forecast, the more errors we make. Second, the SVR method yields better MAPE results, particularly when no feature selection is applied. Third, and more importantly, feature selection dramatically increases the accuracy of this method, reducing MAPE significantly.

Determining the size of the training window. We now address the selection of the appropriate size of the time-window for training. In general, the size of the training set impacts the results of any machine learning algorithm. This is particularly true in the case of time series forecasting. A larger training window means more data is used for training, but if the underlying model changes over time, then incorporating training data that is too old may actually be counterproductive. The number of time lags to use is another important parameter. A larger means more variables are used for the prediction, which may lead to over-fitting.

We train our prediction models with training windows of different sizes and different time lag values. We vary the sliding training window size to take values in , and the lag , both values expressed in days. As before, we apply feature selection keeping the 4 topics most correlated to each topic.

Figure 7 reports the average MAPE scores computed for different values of time lags and sizes of the training set. Each reported MAPE value is the average of scores achieved at predicting different steps-ahead (2, 3, 7, 15, and 30). Linear regression (LR) results are shown in Figure 7(a). A high variation of MAPE scores is observed for small sizes of the training set () before the scores stabilizes starting from training sets of size 50. Support vector regression (SVR) is shown in Figure 7(b) and it shows a different behavior. First, it achieves much lower MAPE scores compared to those of LR, for all the values of the training set size we consider. Second, with SVR the ideal size of the training window is achieved at 7 days, and thereafter, adding more observations increases the error rate. Finally, adding more lags (larger ) also increases the error rate.

To summarize, the best topic prediction model we find is SVR with feature selection, a training window size of 7 days, and or as time lags.

## 5. Article predictions

We now address the problem of predicting the number of visits to an article. We predict the number of cumulative visits to an article during its first days after publication, which we denote as . As a conservative setting considering the effective half-life measured on Section 3.3, we set days.

Our objective is to assess to which extent topicality and article similarity can help predict the number of visits an article will receive. We start by computing the popularity of a news article as a function of the popularity that similar articles have attained in the last few days (Section 5.1). Then, we present a method that complements this approach with information about topic popularity in the last few days (Section 5.2). Next, we integrate topic popularity predictions into the overall forecasting model to provide (plausible) knowledge about popularity in the future (Section 5.3). Finally, we complement our prediction with early traffic observations to improve over both methods (Section 5.4).

###
5.1. Prediction Based on Article Similarity

Using Nearest Neighbors (NN)

We hypothesize that similar articles posted within a relatively small time window receive a similar number of visits. The rationale behind this hypothesis is that people who visited an article about a developing story yesterday (or a few days ago), are likely to visit similar articles published today or at a later day. Sets of follow-up articles can be understood as playing the role of ephemeral pseudo-topics.

We measure article similarity by representing articles using vectors over the concatenation of their content and title. The similarity between each pair of articles is measured using cosine similarity .

To predict article visits, we use these similarities as input to a nearest-neighbors estimation method (NN). This method consists on estimating the value of a function at given point, as an aggregate of the value of that function for a set of points near it (atkeson_1997_locally; navot_2006_nn). We use a variant of the kNN method applied to popularity prediction by li2013popularity, where the number of views of an item is the weighted sum of the number of views of similar items in the past few days.

Given an article posted on day , and a similarity threshold , we define as the set of articles published on day whose similarity with is greater than or equal to :

(5) |

We next define a function which gives the weighted average of the number of visits to articles in (for ) up to date :

(6) |

where is the cumulative number of visits received by article from its publication up to and including the publication date of , . Finally, our estimator is based on linear regression:

(7) |

where as before is the set of time lags under consideration, and are the linear regression coefficients, and is the residual term.

Results are shown on Figure 8(a). The model is trained on 80% of the data, and tested on the remaining 20%. We vary from 1 to 7 days and set to values in . We observe that adding more days does not improve significantly the results. Values of close to 0.1 and 0.2 yield in general better results than 0.05 (which may cover too many articles distantly related to the one for which the prediction is being done) or 0.3 (which may be too strict as a criterion and include too few neighbors). We experimented with SVR and found the results to be no better than those obtained with linear regression (LR); in the remainder we report only the results with LR which is a simpler model.

### 5.2. Prediction Based on Topic Volume (NN+T)

Let us now consider a predictor of visits to article based on the topic volume of its main topic . This predictor is simply:

(8) |

where is the number of visits to topic at time . The result is the dashed line in Figure 8(a). We observe its MAPE value is 1.33 percentage points lower than the one obtained with the method based on NN.

Given that this method is complementary to the one using nearest neighbors, we can combine them using:

(9) |

where is the aggregate of visits to nearest neighbors defined in Equation 6.

Results are shown on Figure 8(b). We observe that the combined method is better than the method based only on topic volume for , and that in general the MAPE for or is lower than for .

### 5.3. Adding Predicted Topic Volume (NN+T+PT)

We further improve the results by creating an ensemble forecasting that operates in two steps. First, we predict the future popularity of ’s topic at time , using the best estimator from Section 4.4. Next, we incorporate this as an input variable for the regression:

Results are shown on Figure 9. We observe a small but consistent improvement when incorporating this variable to our best predictor so far. Again, best results are observed using or .

### 5.4. Incorporating Early Observations

Finally, we compare our method to the standard auto-regressive models based on early measurements (e.g. (rowe2011forecasting; li2013popularity)). Results are shown on Figure 10. We observe that our method yields an error rate on the same scale as methods that use early observations. There is a smooth transition between the error rate resulting from our method (which can be used before publishing the article), and the error rate resulting from methods that use 5 minutes, 1 hour, or 6 hours of early observations.

On average, our method yields a MAPE of 11.47%, while early predictions after 5 minutes, 1 hour and six hours obtain error rates of 9.59%, 6.83%, and 4.75% respectively.

In the news domain, it is not realistic that an editor would publish a news article just to verify if it will have a large impact or not. Once a news is published, it can not be withdrawn without a reputational cost. Hence, our method provides a unique competitive advantage over the early-measurements-based methods.

## 6. Conclusions and Future Work

Predicting the popularity of an article before its date of publication requires combining content-based methods, which capture the article’s communicative frame, with time series methods, which capture the evolution of people’s attention around different issues. Our approach successfully combines two dimensions in the forecasting of visits for an article: the popularity of similar articles of recent issue, and the popularity of the topics that the article treats. More specifically, we have shown that an integration of these two dimensions rivals the performance of each dimension on its own. Furthermore, integrating topic predictions—which we can do with as little error as 2.5%—yield a final mean average error rates of about 11% when information from the 2 or 3 preceding days is taken into account.

Next, we plan to use a Content Analysis paradigm to develop a systematic augmentation of the dimensions of article popularity used in this paper. According to holsti1969content, the analysis of a message entails an understanding of who are the source and recipient of the communicative act, what is being said and how, and what are the purpose and potential reach of the message. In this study we have primarily focused on content and style. In future work, we will integrate information about source, target, purpose (e.g. attitude towards the topic treated) and potential reach (e.g. readability, trust), as well as possible sources of competition for attention (e.g. similar articles on the same day), as a way of increasing the accuracy and robustness of the approach we have presented.