Web search queries can predict stock market volumes.
Ilaria Bordino, Stefano Battiston, Guido Caldarelli, Matthieu Cristelli, Antti Ukkonen, Ingmar Weber
1 Yahoo! Research, Avinguda Diagonal 177, Barcelona, Spain
2 ETH Chair of System Design, Kreutzplatz 5, Zurich Switzerland
3 Inst. of Complex Systems CNR, Dip. Fisica, “Sapienza” Univ., P.le Moro 5 00185 Rome, Italy
4 London Institute for Mathematical Sciences, South Street 22, Mayfair London, UK
5 IMT, Institute for Advanced Studies, Piazza S. Ponziano, 6, 55100 Lucca, Italy
Email: matthieu.cristelli@roma1.infn.it
Abstract
We live in a computerized and networked society where many of our actions leave a digital trace and affect other people’s actions. This has lead to the emergence of a new datadriven research field: mathematical methods of computer science, statistical physics and sociometry provide insights on a wide range of disciplines ranging from social science to human mobility. A recent important discovery is that search engine traffic (i.e., the number of requests submitted by users to search engines on the www) can be used to track and, in some cases, to anticipate the dynamics of social phenomena. Successful examples include unemployment levels, car and home sales, and epidemics spreading. Few recent works applied this approach to stock prices and market sentiment. However, it remains unclear if trends in financial markets can be anticipated by the collective wisdom of online users on the web. Here we show that daily trading volumes of stocks traded in NASDAQ100 are correlated with daily volumes of queries related to the same stocks. In particular, query volumes anticipate in many cases peaks of trading by one day or more. Our analysis is carried out on a unique dataset of queries, submitted to an important web search engine, which enable us to investigate also the user behavior. We show that the query volume dynamics emerges from the collective but seemingly uncoordinated activity of many users. These findings contribute to the debate on the identification of early warnings of financial systemic risk, based on the activity of users of the www.
Author Summary
Introduction
Nowadays many of our activities leave a digital trace: credit card transactions, web activities, ecommerce, mobilephones, GPS navigators, etc. This networked reality has favored the emergence of a new datadriven research field where mathematical methods of computer science [1], statistical physics [2] and sociometry provide effective insights on a wide range of disciplines like [3] social sciences [4], human mobility [5], etc.
Recent investigations showed that Web search traffic can be used to
accurately track several social phenomena [6, 7, 8, 9].
One of the most successful results in this direction, concerns the
epidemic spreading of influenza virus among people in the USA. It has
been shown that the activity of people querying search engines for
keywords related to influenza and its treatment allows to anticipate
the actual spreading as measured by official data on contagion
collected by Health Care Agencies [10].
In this paper, we address the issue whether a similar approach can be
applied to obtain early indications of movements in the financial
markets [11, 12, 13] (see Fig. 1 for a graphical representation of this issue). Indeed, financial turnovers, financial contagion and, ultimately,
crises, are often originated by collective phenomena such as herding
among investors (or, in extreme cases, panic) which signal the
intrinsic complexity of the financial system [14].
Therefore, the possibility to anticipate anomalous collective
behavior of investors is of great interest to policy makers [15, 16, 17] because
it may allow for a more prompt intervention, when this is appropriate.
For instance the authors of [18] predict economical outcomes starting from social data, however,
these predictions are not in the context of financial markets.
Furthermore it has been shown how volume shifts can be correlated with
price movements [19, 20, 21].
Here, we focus on queries submitted to the Yahoo! search engine that are related to companies listed on the NASDAQ stock exchange. Our analysis is twofold. On the one hand, we assess the relation over time between the daily number of queries (“query volume”, hereafter) related to a particular stock and the amount of daily exchanges over the same stock (“trading volume” hereafter). We do so by means not only of a timelagged crosscorrelation analysis, but also by means of the Grangercausality test. On the other hand, our unique data set allows us to analyze the search activity of individual users in order to provide insights into the emergence of their collective behavior.
Results
In our analysis we consider a set of companies (“NASDAQ100 set”
hereafter) that consists of the companies included in the NASDAQ100
stock market index (the 100 largest nonfinancial companies traded on
NASDAQ). We list these companies in Table 1.
Previous studies [12] looked at
stock prices at a weekly time resolution and found that the volume of
queries is correlated with the volume of transactions for all stocks
in the S&P 500 set for a time lag of week, i.e. the
present week query volumes of companies in the S&P 500 are
significantly correlated with present week trading volumes of the S&P
500^{1}^{1}1in addition, differently from [12] we
use daily data from Yahoo! search engine and we look at query
volumes from single stocks and do not aggregate these volumes. The
authors of [12] suggest that the query volume can be interpreted as reflecting the attractiveness of trading a stock. Further, they find that this attractiveness effect lasts for several weeks and, citing the authors of [12], present price movements seem to influence the search volume in the following weeks pointing out that new analysis on data at a smaller time scale are needed.
This last observation is the starting point of the present work. Is it
possible to better investigate the relation between search traffic and
market activity on a daily time scale? And, even more important, can
query volumes anticipate market movements and be a proxy for market
activity? In other words in this paper we are addressing the question
whether web searches can be a forecasting tool for financial markets
and not only a nowcasting one. This is a novel analysis which try to
quantify the link and the direction of the link between search traffic and financial activity.
We consider search traffic as well as market activity at a daily frequency and find a strong
correlation between query volumes and trading volumes for all stocks
in the NASDAQ100 set.
Fig. 2 (top panel) shows the time evolution of the query volume of the
ticker “NVDA” and the trading volume of the corresponding company
stock “NVIDIA Corporation” and Fig. 3 (top panel) shows the same
plot for query volume of the ticker “RIMM” and the trading volume of
the company stock “Research In Motion Limited” (see also Section
“Materials and Methods”). A simple visual inspection of these figures (see also Fig. 4) reveals a
clear correlation between the two time series because peaks in one
time series tend to occur close to peaks in the other.
The lower panels of Figs. 2 and 3 report the values of cross correlation
between trading and query volume as a function of the time lag
defined as the timelagged Pearson cross correlation coefficient
between two time series and :
(1) 
where , are the sample averages of the two time series (in this case and represent query and trading volumes, respectively). The coefficient can range from (anticorrelation) to (correlation).
The cross correlation coefficients for positive values of
(solid lines) are always larger than the ones for negative time lag
(broken lines). This means that query volumes tend to anticipate
trading volumes. Such an anticipation spans from to days at most.
Beyond a lag of days, the correlation of query volumes
with trading volumes vanishes. In Table 4 where we report the cross correlation function between queries and trading volumes averaged over the 87 companies in the NASDAQ100 for which we have a clean querylog signal. In Table 6 instead we report the cross correlation functions for some of the 87 companies investigated in Table 4 (for the sake of completeness in the Supporting Information in Tables S1 and S2 we report the cross correlation functions for all the clean stocks while in Table S3 the cross correlation functions for those stocks characterized by spurious origin of the query volume).
As a first result from this analysis we find that the significant correlation between query volumes and trading volumes at confirms the results of [12] also at a daily timescale. Our findings (i.e. positive correlation for negative time lags) also support the vision that present market activity influences future users’ activity but in contrast with [12] the length of this influence appears to be much shorter than what expected (only few days). It appears that the correlation only emerges at a daily scale and seems to be not observed at weekly resolution.
However, the most striking result is that the crosscorrelation
coefficients between present query volumes and future trading volumes appears to be larger than the coefficient of the opposite case. In the following of this paper we discuss in detail this anticipation effect and give a statistical validation of our finding.
Statistical validation
In order to assess the statistical significance of the results for the NASDAQ100 set, we construct a reshuffled data set in which the query volume time series of a company is randomly paired to the trading volume time series of another company . The values of the crosscorrelation coefficient averaged over permutations (values which span the range ) are smaller than the original one (which is ) by a factor . The residual correlation present in the reshuffled dataset can be explained in terms of general trends of the market and of the specific (technological) sector considered [22, 23, 24].
As a second test we remove the top five (and ten) largest events from the trading volume times series in order to verify if the results shown in Table 6 (the results for all the stocks are reported in Tables S4 and S5 of Supporting Information) are dominated by these events. In Table 7 we report the comparison between the values of the cross correlation coefficient of the two series for a selection of stocks. A significant correlation is still observed for most of the stocks considered. This important test supports the robustness of our findings. In fact, even if the drop indicates that the distributions underlying the investigated series are fattailed (see Figs. S1S6 of Supporting Information and the discussion about the validity of the Granger test in the following of the paper) and that a significant fraction of the correlation is driven by largest events (about of the events are responsible for of the correlation on the average), more than half of the correlation (for some stocks this percentage reaches ) cannot be explained by these extreme events.
Turning now the discussion towards the validation of the fact that query volumes anticipate trading volumes, as a first issue, it is a wellknown fact that trading volumes and volatility are correlated and this last appears to be autocorrelated [25, 26, 27] (the decay of the volatility is welldescribed by a power law with an exponent ranging between and ). Therefore the correlation between the query volumes and the future trading volumes shown in Figs. 2 and 3 could be explained in terms of these two effects. In this respect we compare the lagged crosscorrelation function between a proxy for the volatility (the absolute value of price returns) and the query volumes with the results shown in Table 4. As shown in Fig. 5, the branch in the volatility case is equal or even smaller than the value observed in the one, differently from the trading volume case. If the origin of the effect were due to the autocorrelation component of the volatility, we would expect a similar behavior for both crosscorrelation functions. In addition we observe that the volatility autocorrelation function decays much slower (from weeks to months) than the typical time decay of the cross correlations here investigated (few days). This supports the nonautocorrelated origin of the anticipation effect.
As a second measure of the anticipation effect, we also performed a
Granger causality test [28] in order to determine if todays
search traffic provides significant information on forecasting
trading volumes of tomorrow. We find that trading volumes can be
considered Grangercaused by the query volume. We want to point out that Grangercausality does not imply
a causality relation between the two series. In fact it can be argued
with a simple counterexample that two Grangercaused series may be
driven by a third process and therefore the interpretation of the
Granger relation as a causality link would be wrong. In our analysis
the results of the Granger test are only used to assess the direction
of the anticipation between queries and trading activity. In this
sense we claim that query volumes observed today are informative of (and consequently forecast) tomorrows trading volumes.
Furthermore, the fattailed nature of the distributions under
investigation (see Figs. S1S6 of Supporting Information) may weaken the results of
the Grangertest which, in principle, requires gaussian distributions
for the error term of the regressions
[28]. However, we perform a series of additional analyses and tests which support and confirm the picture coming from Grangertest results (see Section “Materials and Methods”
for further details).
Users’ behavior
In the second part of our investigation we focus on the activity of single users. We are able to track the users who have registered to Yahoo! and thus have a Yahoo! profile. One could expect that users regularly query a set of tickers corresponding to stocks of their interest. This is because for queries that match the ticker of a stock, the search engine shows the user uptodate market information about the stock in a separate display that appears above the normal search results. In addition, if any important news appears, the corresponding page would show among the top links in the search result. Therefore, we first compute the distribution of the number of tickers searched by each user in various time windows and time resolution (see Fig. 7). Interestingly, most users search only one ticker, not only within a month, but also within the whole year. This result is robust along the time interval under observation and across tickers. As a further step, among the users who search at least once a given ticker in a certain time window, we compute the distribution of the number of different days in which they search again for the same ticker. In this case, we restrict the analysis to some specific tickers, namely to those with highest cross correlation between query volumes and trading volumes (e.g., those for Apple Inc., Amazon.com, Netflix Inc.). Surprisingly, as shown in Section “Materials and Methods”, Figs. 710, the majority of users () searched the ticker only once, not only during a month, but also within a year. Again, this result is robust along the 12 months in our dataset. Altogether, we find that most users search for one “favorite” stock, only once. The fact that these users do not check regularly a wide portfolio of stocks suggests that they are not financial experts. In addition, there is no consistent pattern over time. Users perform their searches in a seemingly uniform way over the months. In addition we find that our results are typical and very stable in time. In fact in this respect we do not observe any correlation between large fluctuations of trade volume, large price drops and influx of onetime searchers or with large price drops. In Fig. 11 we show the evolution of onetime searchers which appears to be very stable in time.
Overall, combining the evidence on the relation between query and trading volumes with the evidence on individual user behavior, brings about a quite surprising picture: movements in trading volume can be anticipated by volumes of queries submitted by nonexpert users, a sort of wisdom of crowds effect.
Discussion
In conclusion, we crawled the information stored in querylogs of the
Yahoo! search engine to assess whether signals in querying activity of
web users interested in particular stocks can anticipate movements in
trading activity of the same stocks.
Differently from previous studies we considered daily time series and
we focused on trading volumes rather than prices.
Daily volumes of queries related to a stock were compared with the
effective trading volume of the same stock by computing
timedelayed crosscorrelation.
Our results show the existence of a positive correlation between todays stockrelated web search traffic and the trading volume of the same stocks in the following days. The direction of the correlation is confirmed by several statistical tests.
Furthermore, the analysis of individual users’ behavior shows that most of the users query only one stock and only once in a month. This seems to suggest that movements in the market are anticipated by a sort of ”wisdom of crowd” [29]. These findings do not explain the origin of the market movements but shows that that search traffic can be a good proxy for them.
Furthermore, if one could assume that queries of a user reflect the
composition of her investment portfolio, our finding would suggest
that most of the investors place their investments in only one or two
financial instruments. The assumption that queries reflect portfolio
composition is a strong hypothesis and cannot be verified in our data
at the current stage. The finding would then deviate from the
diversification strategy of the wellknown Markovitz approach, but
would be in line with previous empirical works on carried out on
specific financial markets.
This result, if confirmed, could have very important consequences. In
epidemics, by taking for granted that everybody has a mean
number of contacts brings to incorrect results on disease propagations. Here the assumption that
investors portfolio is balanced, while it is not, could explain why domino effects in the market are
faster and more frequent than expected.
This does not mean that we can straightforwardly apply the models of
epidemic spreading [30, 31, 32] to financial markets.
In fact, in the latter case (differently from ordinary diseases) panic spreads mostly by
news. In an ideal market, all the financial agents can become “affected” at the same time by the
same piece of information.
This fundamental difference makes the typical time scale of reactions in financial markets much
shorter than the one in disease spreading.
It is exactly for that reason that any early sign of market behavior
must be considered carefully in order to promptly take the necessary countermeasures.
We think that this information can be effectively used in order to detect early signs of financial
distress.
We also believe this field to be very promising and we are currently working on the extension of this kind of web analysis to twitter data and semantic analysis of blogs.
Materials and Methods
In this section we give a detailed overview of the investigations carried out in this paper. The first contribution of our work consists, as previously said, of an analysis of the relation between the activity of the users of the Yahoo! search engine and real events taking place within the stock market. Our basic assumption is that any market activity in an individual stock may find some correspondence in the search activity of the users interested in that stock. Thus we study whether significant variations in the stock trading volumes are anticipated by analogous variations in the volume of related Web searches. To investigate the existence of a correlation between query volumes and trading volumes, we compute timelagged crosscorrelation coefficients of these two series.
We conduct such analysis performing separate experiments to test the two different query definitions that we take into consideration, i.e., queries containing the stock ticker string, or queries matching the company name. The results of this first set of experiments are presented in Subsection “Correlation between query volumes and trading volumes”.
We then apply permutation tests, Grangercausality test and several analyses to assess the significance of the correlations found. These experiments are described in Subsection “Statistical validation of query anticipation”.
Finally, Subsection “Analysis of users’ behavior” presents details of the last part of our work, where we try to gain a better knowledge of the typical behavior of the users who issue queries related to finance. Here we refine our analysis of the information extracted from query logs to understand what a typical user searches for, such as whether she looks for many different tickers or just for a few ones, and, if she looks for them regularly or just sporadically.
Database
The stocks analyzed
In this work we compare query volumes and trading volumes of a set of companies traded in the NASDAQ (National Association of Securities Dealers Automated Quotation) stock exchange, which is the largest electronic screenbased equity securities trading market in the United States and secondlargest by market capitalization in the world. Precisely, we analyze the companies included in the NASDAQ100 stockmarket capitalization index. These companies are amongst the largest nonfinancial companies that are listed on the NASDAQ (technically the NASDAQ100 is a modified capitalizationweighted index, it does not contain financial companies and it also includes companies incorporated outside the United States.) We list these companies in Table 1. The daily financial data for all of stocks is publicly available from Yahoo! Finance^{2}^{2}2http://finance.yahoo.com/ and we focus our attention on the daily trading volumes.
Query data
The querylog data we analyze is a segment of the Yahoo! US searchengine log, spanning a time interval of one year, from mid2010, to mid2011. The querylog stores information about actions performed by users during their interactions with the search engine, including the queries they submitted and the result pages they were returned, as well as the specific documents they decided to click on.
We compute query volume time series by extracting and aggregating on a daily basis two different types of queries for each traded company:

all queries whose text contains the stock ticker string (i.e. “YHOO” for Yahoo!) as a distinct word;

all queries whose text exactly matches the company name (after removing the legal ending, “Incorporated” or “Corporation” or “Limited”, and all their possible abbreviations).
All queries in the log are associated with a timestamp that represents the exact moment the query was issued to the search engine. We use this temporal information to aggregate the query volumes at different levels of granularity. Furthermore, every action is also annotated with a cookie, representing the user who submitted the query. These cookies allow to track the activity of a single user during a time window of a month. By using this information, we also computed user volumes by counting the daily number of distinct users who made at least one search related to one company (according to the query definitions provided above). Thus, for each stock taken into consideration, we can compare the daily volumes of related queries, as well as the number of distinct users issuing such queries per day with the daily trading volumes gathered from Yahoo! Finance.
Correlation between query volumes and trading volumes
We compare the query volume of every stock with the trading volume of the same stock. The two definitions of queries introduced are used in separate experiments, that is, in one case we aggregate all the queries containing the ticker of a company, and in another case we only consider queries that match the company name.
We extract from both data sources (the query volumes and the trading volumes of a given stock) a time series composed by daily values in the time interval ranging from mid 2010 to mid 2011. Although the querylog contains information collected during holidays and weekends as shown in Fig. 6 for the case of the AAPL stock, the financial information is obviously only available for trading days. Thus, for the sake of uniformity, we filter out all the nonworking days from the query volume time series. In the end, we obtain two time series of 250 working days for every stock.
As a second step, given the time series of the query volumes and the time series of trading volumes, we compute the crosscorrelation coefficient
for every company.
This correlation coefficient ranges from to .
Although the above coefficient can be computed for all delays , we chose to consider a maximum lag of one week (five working days).
Tables 2 and 3 report the results obtained for these experiments. Columns instead correspond to different values of the timelag used in the calculation of the crosscorrelation coefficients. We observe that the crosscorrelation coefficients always assume nearly equal to zero for .
When the first query definition is taken into consideration (ticker query), the average crosscorrelation coefficient in the base case of is equal to . Similar values are obtained if a timelag in the range is considered. It is worth noticing that for some individual companies we observe much higher correlations. On this account Table 6 presents the best results for single stocks (see Tables S1 and S2 of Supporting Information for the complete results: it is worth noticing that considering only the stocks for which , there are 8 stocks for which , for 68 stocks it holds that while for the remaining 11 stocks we observe ). For these companies, we also report in Table 7 (Tables S4 and S5 for all the results) the basic crosscorrelation at lag after removing from the time series the days corresponding to the top and values of the trading volume. It is interesting to observe that the correlations are still significant and so the correlation does not seem to be due only to peak events, which generally correspond to headlines in the news, product announcements or dividend payments.
When the second query definition (company names) is considered, we observe weaker correlations than the previous case. The average crosscorrelation coefficient in the base case is equal to .
In addition we point out that the process of extracting data from querylogs can introduce spurious queries which have a non financial origin. Especially some of the ticker queries match our above definition, but are nonetheless unrelated to the stock represented by the ticker. For instance, some ticker strings correspond to natural language words, such as “FAST” (Fastenal Company) and “LIFE” (Life Technologies Corp.). As one can reasonably expect, the overwhelming majority of queries containing these words are completely unrelated to the companies that are the subject of our study. Other cases of companies for which we discovered very large levels of noise included ecommerce portals like Ebay. In all these cases the ticker often appears in navigational queries that are unrelated to the company stock (see Table S3 of Supporting Information). For this reason, we filter out all companies whose query volumes are discovered to be noisy, retaining a smaller, but cleaner set of companies for which the spurious queries are a negligible fraction. By restricting the computation of the crosscorrelation function to these companies, we observe a larger value of the average crosscorrelation. Table 4 reports the results obtained for the first query definition (queries including the ticker as a distinct word), which represents the case for which the best performances of the queries are observed. The average crosscorrelation at time lag is .
Besides query volumes, we also consider user volumes, i.e., the number of distinct users who issued queries related to a company in any given day. For reasons listed above, this analysis is restricted to the 87 NASDAQ100 companies for which we have a clean querylog signal. Crosscorrelations between user volumes and trading volumes are shown in Table 5. We observe similar findings to the ones obtained in the previous experiments, although the average crosscorrelation is smaller than the one obtained with query volumes. The average crosscorrelation between user volumes and trading volumes at time lag is .
Statistical validation of the query anticipation
Permutation test
A permutation test, also called randomization test, is a statistical significance test where random rearrangements (or permutations) of the data are used to validate a model. Under the null hypothesis of such a test data permutations have no effect on the outcome, and the reshuffled data present the same properties as the true instance. The rank of the real test statistic among the shuffled test statistics determines the empirical “pvalue”, which is the probability that the test statistic would be at least as extreme as observed, if the null hypothesis were true. For example, if the value of the original statistic is greater than the random values, we can reject the null hypothesis with a confidence . This means that the probability that we would observe a value as extreme as the true one, if the null hypothesis were true, is less than . In our setting, the aim is to verify the significance of the correlation between the queries containing the ticker of a company and the trade volumes of the same company. In particular, we want to assess if the crosscorrelation between query volume and trading volume of a given company is higher than the crosscorrelation between query volume of company and trading volume of some other company . The purpose of this test is to show that the correlations we observe are not merely a consequence of stock market related web search activity being correlated with stock market activity in general.
Our original data is given by the set of pairs of time series previously considered. Every pair in this set contains information concerning a given company . As already indicated, is the time series of the query volumes of , whereas is the time series of the trading volumes of . We use as test statistic the crosscorrelation coefficient between and . Starting from the above data, we apply 1000 random permutations to create an ensemble of 1000 distinct datasets, each one composed of pairs , where the time series of query volumes of a company is randomly paired with the time series of trade volumes of a different company . For each pair included in each randomly generated dataset, we compute the crosscorrelation between and .
We then compare the (macro)average crosscorrelation that we get for the real data with the average values obtained for the 1000 randomized datasets in which the queries of a company are always paired with the trades of another company. While the average result that we get for the original data is , the values obtained for the test statistic when the random permutations are applied are much smaller. We find . Therefore we get an empirical pvalue of 0.001, meaning that the correlations observed on the real data are statistically significant at .
We also check the significance of the correlations obtained for individual companies separately. Our goal here is to understand on a deeper level what companies are actually correlated with the corresponding queries, and which ones are not. We consider the two scenarios below.

In the first case, the null hypothesis is the following: The correlation between trading volume of company and query volume of the same company is not higher than the correlation between trading volume of company and query volume of some other company . For every company , we compare the real data with the 1000 pairs where each comes from one of the 1000 random datasets generated before. The test statistic that we use for the comparison is the same as before, that is, the crosscorrelation coefficient between the two time series forming any given pair. For every company , we compute the empirical pvalue by taking the rank of the real test statistic within the sorted order of the values computed from reshuffled data.

Similarly, in the second scenario, our null hypothesis is: The correlation between query volume of company and trading volume of the same company is not higher than the correlation between query volume of company and trading volume of some other company . Now, for any queryvolume , the real data is still given by the pair . We compare this with the 1000 pairs where each comes from a different random dataset. We calculate the crosscorrelation between the two timeseries included in every pair, and determine the pvalues in the same way as above.
In both the scenarios taken into consideration, for most of the companies the test rejected . More specifically,

: We got the minimum pvalue for 50 companies (out of 87). The pvalue was in 19 cases.

: We got the minimum pvalue in 48 cases. The pvalue was in 26 cases.
To summarize, we observe that for of the stocks the correlation between query volume and trading volume can not be explained by a simple global correlation between finance related search traffic and market activity in general.
It is worth noting that large pvalues are related to companies for which poor correlation is present between querylog data and trading, maybe because of the large noise in the dataset.
Correlation between query volume and volatility
Trading volume and volatility are correlated and volatility is autocorrelated. Therefore a source of the correlation between present query volume and future trading volume can be the autocorrelation component of volatility. Here we show that the origin of these correlations cannot be traced back to volatility. In order to perform such a task we compare the correlation
between query volume and absolute price returns (i.e a proxy for the volatility) with the one between query volume and trading volume.
We define the price return of a day as follows:
where is the closing price of the day . For each stock in our NASDAQ100 clean list we compute the price returns and build three time series:

The time series of the unsigned price returns:

The time series of the positive price returns:

The time series of the negative price returns:
The time series of the unsigned price returns has elements, being the length (number of days) of the time interval covered by our data ().
Similarly to the experiments involving trading volumes, we compute for every stock the crosscorrelation between the price returns and the query volume of the same company.
Fig. 5 (broken line) reports the crosscorrelation function between the unsigned price returns and query volume. The average value of the basic crosscorrelation at lag between query volume and price returns is . This result reflects the fact that in days when the prices of the NASDAQ100 stocks exhibit a large variation (either positive or negative), there is a considerable amount of web search activity concerning the same stocks.
However, as shown in Fig. 5 the crosscorrelation between query volume and volatility (broken line) is significantly smaller than the one between query volume and trading volume (solid line). Moreover the branch in case of volatility is equal or even smaller than the value observed in the one. If the origin of the effect were due to the autocorrelation component of volatility, we would expect a similar behavior for both crosscorrelation functions. These facts support the nonautocorrelated origin of the correlation between between todays query volume and future trading volume.
For the time series (positive returns) and (negative returns), we only computed the crosscorrelation between query volumes for lag . The reason is due to the fact that the time gap between two consecutive elements of those series is variable. The average correlations obtained for the clean NASDAQ tickers are report in Table 8. The results are similar to ones we get for the unsigned price returns.
Granger Causality
The GrangerCausality test is widely used in timeseries analysis to determine whether a time series is useful in forecasting another time series . The idea is that if Grangercauses if can be better predicted using both the histories of and rather than using only the history of . The test can be assessed by regressing on its own timelagged values and on those of . An Ftest is then used to examine if the null hypothesis that is not Grangercaused by can be rejected.
In this work, we apply the Grangercausality test to analyze the relation between query volumes and trading volumes, and also between user volumes and trading volumes. Our aim is to prove that search activity related to a company, Grangercause the trading volume on the company stock. However, we also want to verify whether the notion of Granger causality holds in the opposite direction. Hence, we apply the test in the two possible directions.
Again, we first consider all companies included in the NASDAQ100 data set. However, given that we know from the previous analysis that in some cases the query volumes are very noisy and not related to the traded company they have been extracted for, we also perform the test on the smaller test of companies obtained through manual filtering.
Table 9 presents the results of the Grangercausality test. Each row in the table summarizes the outcome of an experiment. The table specifies the two available querylog time series (query volumes Q or user volumes U) compared with trading volume T (comparisons are always made for each company independently), the lag applied (expressed in terms of number of days), the direction in which the test is applied : means that the null hypothesis is “ does not Grangercause ”. The last three columns provide a summary of the results obtained for all companies that are taken into consideration during the test. The fourth and fifth column respectively report the percentage of companies for which the null hypothesis was rejected with . The last column reports the average reduction in RSS.
In all the cases, it can be observed that the direction of the test is much stronger than the opposite direction . That is, we obtained stronger support for the case that timeseries extracted from the querylog Grangercause the trading volume of the same company, as opposed to trading volume Grangercausing query or user volumes. Especially this is the case when significance at 1% is required.
For instance, let us consider rows 9 and 11 in the Table 9. When the clean set of tickers is examined, we observe that in of the cases the null hypothesis ( does not Grangercause ) is rejected with , and for of the companies the same held with with . A much weaker result is obtained when the opposite direction is considered. Only for of the companies the null hypothesis could be rejected with .
As we have already observed in the crosscorrelation experiment, we get slightly weaker results when considering user volumes. Observe line 11 in the table: in of the cases the trading volume is Grangercaused by the user volume with probability greater than . The average reduction in RSS is .
In short, adding information about todays query volume reduces the average prediction error (in an autoregressive model) for tomorrows trading volume by about . For half of the companies the reduction is statistically significant at , that is, both query volume and user volume Grangercauses the trading volume. We can also interpret this as follows: query/user volume helps to predict the trading volume, but the reverse does not hold.
It can be now argued that the Granger test, in principle, should be used only on series for which the error term in the regressions is gaussian. In this framework instead we are dealing with fattailed distribution underlying the query volume and trade volume series (see Figs. S1S6 of Supporting Information). However, in the next section we present a series of analyses which confirm the significance of the results found here. In particular, they all support the evidence that todays web search traffic is more informative on tomorrows trading activity than the reverse case.
Beyond Granger Causality
To study the anticipation effect and the power of search engine data for predicting stock trading volumes, we performed several statistical tests checking various hypotheses. The tests are detailed below.
Test 1
To test if query volume can predict future trading volume, denoted , we use four different regression models:

:
We predict trading volume of tomorrow using trading volume of today. 
:
We predict trading volume of tomorrow using both trading and query volume of today. 
:
We predict query volume of tomorrow using query volume of today. 
:
We predict query volume of tomorrow using both trading and query volume of today.
Let denote the sum of squared residuals for model . We define
In other words is the variation of when we use to predict in addition to . Likewise, is the variation in when is added to an autoregressive model of .
Our aim is to test the following hypotheses:

Nullhypothesis : and are not significantly different.

Alternative hypothesis : is significantly larger than .

Alternative hypothesis : is significantly larger than .
To compare and , we apply a bootstrap procedure to estimate their distribution.
We generate samples for and samples for , using the case resampling strategy. We denote by the bootstrap distribution of , and by the bootstrap distribution of .
Given and , we can derive an empirical pvalue of being larger than .
This pvalue, which we denote by , is computed as the the rank of in the list of sorted values divided by , where is the number of bootstrap samples.
Depending on the chosen significance level, by the empirical pvalue we can now reject , and support .
We run this test for the list of clean NASDAQ100 tickers.
For 26 companies we obtain an empirical pvalue lower than : this result suggests that, for these companies, we can reject the null hypothesis at the significance level of , finding support for .
Table S6 (see Supporting Information) reports the list of these companies, together with the respective pvalues and . The third column of the table contains the value
of the basic crosscorrelation at lag between query volume and trading volume.
We also test the opposite direction. To verify if there is any support for , we took and , and use the same procedure as above to compute the empirical pvalue of being larger than . This time, all pvalues that we obtain for the 87 clean tickers are very large. In almost every case is smaller than the values in . This suggests that trading volumes of today do not help in predicting query volumes of tomorrow.
In Table S7 we report the ten tickers with the smallest : observe that even the smallest values are much larger than , thus we not find any convincing support for .
Test 2
The previous test is based on the idea of comparing the improvement in after adding information from the second time series to an autoregressive model. The test that we present below is based on the direct comparison of the values of and .
We consider the two following regressive models:
We perform the two regressions above, and compute the respective values, which we call and . If , then we conclude , and viceversa.
To assess the significance of the test, we generate bootstrap vectors starting from the real data and applying random sampling with replacements.
We compute and on the bootstrap vectors, obtain the corresponding residuals, and extract the th percentiles and , that is, the values such that, for of the boostrap vectors, the sum of squared residual is below this values.
Then we compare with , and with .
We run this test on the clean set of NASDAQ100 tickers. For a significance level of , the outcome is the following:

61 companies with a significant difference at between and values: support , and support (These are: joyg, lltc, rost, teva, vrsn, vrtx).

26 companies have no significant difference between the two directions (see Table S8 and S9 of Supporting Information).
Test 3
In this test we again consider the four regression models that are used for the first test:

:
We predict trading volume of tomorrow using the trading volume of today. 
:
We predict trading volume of tomorrow using both trading and query volume of today. 
:
We predict query volume of tomorrow using the query volume of today. 
:
We predict query volume of tomorrow using both trading and query volume of today.
We consider the following hypothesis:

Nullhypothesis :

Alternative hypothesis : .
To test if , we compute the regression models and , and derive the corresponding residuals and . We then compute bootstrap estimates of both for and . Next we compare these two bootstrap samples by applying the MannWhitney U test, also known as the Wilcoxon ranksum test.
The test is aimed at assessing whether one of two samples of independent observations tends to have
larger values than the other. It is based on the nullhypothesis of the two samples having equal medians.
We also test the opposite direction .
We compute the regression models and , and the corresponding residuals and .
We compute bootstrap estimates of both for and , and we apply again the MannWhitney U test.
For the 87 clean NASDAQ100 tickers, we get the following results (see Table S10 of Supporting Information):

Only 3 out of 87 clean Nasdaq tickers are not significant at when testing for . These are LINTA , CHKP and FISV .

In the other direction, , only 19 tickers are not significant at .

In every other case the pvalue is approximately . This might be due to the MannWhitney test being better suited for small sample sizes.
Analysis of users’ behavior
We now investigate the typical behavior of searchengine users who issue queries related to NASDAQ100 tickers. In particular, our goal was to answer to the following questions:

What does a typical user search for?

Does a user look for many different tickers, or just for a few ones or even one?

Does a user ask the same question repeatedly on a certain regular basis, or sporadically?

Can we identify groups of users with a similar behavior?
First, we compute the distribution of the number of distinct tickers that any user looks at within a month. We then obtain an average monthly distribution by averaging over the 12 months in our period of observation, as shown in Fig. 7. We also compute the distribution of the number of distinct tickers that any user looked at within the whole year, as shown in Fig. 7. The distributions show very clearly that the overwhelming majority of the users search only for one ticker, not only within one month, but also within the whole year.
To further characterize the behavior of users with respect to this one ticker they look for, we then check how frequently people look for their favorite ticker, and if they search it regularly over time (once a day, once a week, once a month). To conduct this study we focus on three of the tickers characterized by the highest crosscorrelation between query volumes and trading volumes: AAPL (Apple Inc.), AMZN (Amazon.com), and NFLX (NetFlix, Inc.).
For each of these tickers, we consider the set of users who made at least one search related to the ticker during the whole year, and we compute the distribution of the number of days on which any users searched the ticker. We first consider, separately, the distribution for each month, and then we take the average over the twelve months. We also compute the distribution over the whole year. The yearly and monthly distributions for the three tickers are shown in Figs. 8, 9, 10. Surprisingly, in all the cases considered, a major fraction of the users () looks at their favorite ticker only one time during a month and the whole year.
Given the correlation and the anticipation of query volumes over trading volumes described in the previous section one could expect to observe a significant fraction of users regularly querying for a stock and doing so more frequently in coincidence of peaks of trading activity. In contrast, the typical behavior of users suggests the profile of people who are not financial experts nor regularly following the market trend. It is thus remarkable that, despite emerging from the uncoordinated action of “normal” people, the query activity still works well as a proxy to anticipate market trends.
Finally, for the subset of users who have a registered Yahoo! profile, we also analyze the personal data that they provide concerning gender, age, country. To check if the users who seek NASDAQ100 tickers behave differently from the rest of the Yahoo! users, we compare the set of registered users who submitted at least one query related to a NASDAQ100 ticker with a random sample containing half of the registered users who were tracked in the log during the whole year. We compute the distributions of the demographic properties for the two aforementioned set of users.
Table 10 and Table 11 respectively report the age distribution for the random sample and for the set of NASDAQ100 users. It is worth to observe that the population of NASDAQ100 users contains a smaller fraction of old people. Altogether, of the NASDAQ100 users are people in working age, while this fraction is equal to in the other sample, which we assume to be a fair representative of the whole set of Yahoo! users.
For what concerns gender, we observe that of the NASDAQ100 users are males, and are females. The random sample has of male users, and of females. Thus the set of users who searched NASDAQ100 tickers includes a slightly larger fraction of males.
For the country distribution, we get similar finding on the two set of users. In both cases, the top5 states which the users come from are California (), Texas (), New York (), Florida () and Illinois (). These fractions are expected, given that the aforementioned states are the most populated within the United States.
Acknowledgments
This research was supported by EU Grant FET Open Project 255987 “FOC”.
References
 1. Mitchell T. (2009) Mining our reality. Science 326: 1644.
 2. Vespignani A. (2009) Predicting the behavior of technosocial systems. Science 325: 425.
 3. Evans J., Rzhetsky A. (2010) Machine science. Science 329: 399.
 4. Lazer D., Pentland A., Adamic L., Aral S., Barabasi A. L., et al. (2009) Life in the network: the coming age of computational social science. Science 323: 721.
 5. Gonzalez M., Hidalgo C., Barabasi A. L. (2008) Understanding individual human mobility patterns. Nature 453: 479.
 6. Choi H., Varian H. (2009) Predicting the present with google trends. Technical report .
 7. Goel S., Hofman J., Lahaie S., Pennock D., Watts D. (2010) Predicting consumer behaviour with web search. Proc. Natl. Acad. Sci. USA 107: 17486.
 8. Golder S., Macy M. (2011) Diurnal and seasonal mood vary with work, sleep, and daylength across diverse cultures. Science 333: 1878.
 9. Preis T., Moat H. S., Stanley H. E., and Bishop S. R. (2012) Quantifying the Advantage of Looking Forward. Nature Scientific Report 2: 350.
 10. Ginzberg J., Mohebi M., Patel R., Brammer L., Smolinski M., et al. (2009) Detecting influenza epidemics using search engine query data. Nature 457: 1012.
 11. Saavedra S., Hagerty K., Uzzi B. (2011) Synchronicity, instant messaging, and performance among financial traders. Proc. Natl. Acad. Sci. USA 108: 5296.
 12. Preis T., Reith D., Stanley H. E. (2010) Complex dynamics of our economic life on different scales: insights from search engine query data. Phil. Trans. R. Soc. A 368: 5707.
 13. Bollen J., Mao H., Zeng X. J. (2011) Journal of Computational Science 2: 1.
 14. Bouchaud J.P. (2009) The (unfortunate) complexity of the economy. Physics World 04: 28.
 15. Haldane A. G., May R. M. (2011) Systemic risk in banking ecosystems. Nature 469: 351–355.
 16. Schweitzer F., Fagiolo G., Sornette D., VegaRedondo F., Vespignani A., et al. (2009) Economic Networks: The New Challenges. Science 325: 422–425.
 17. Bouchaud J.P. (2008) Economics needs a scientific revolution. Nature 455: 1181.
 18. Sitaram Asur, Huberman Bernardo A. (2010) Predicting the Future With Social Media. arXiv:1003.5699.
 19. Podobnik B., Horvatic D., Petersen A., Stanley H. E. (2009) Crosscorrelations between volume change and price change. Proc. Natl. Acad. Sci. USA 106: 22079.
 20. Plerou V., Gopikrishnan P., Rosenow B., Amaral L., Stanley H. E. (2000) Econophysics: financial time series from a statistical physics point of view. Physica A 279: 443.
 21. Yamasaki K., Muchnik L., Havlin S., Bunde A., Stanley H. E. (2005) Scaling and memory in volatility return intervals in financial markets. Proc Natl Acad Sci USA 102: 9424.
 22. Onnela J.P., Chakraborti A., Kaski K., Kertesz J., Kanto A. (2003) Asset trees and asset graphs in financial markets. Physica Scripta 106: 48.
 23. Onnela J.P., Chakraborti A., Kaski K., Kertesz J. (2002) Dynamic asset trees and portfolio analysis. European Physical Journal B 30: 285.
 24. Garlaschelli D., Battiston S., Castri M., Servedio V.D.P., Caldarelli G. (2005) The scalefree topology of market investments. Physics A 350: 491.
 25. Cont R. (2001) Empirical properties of asset returns: stylized facts and statistical issues. Quantitative Finance 1:223.
 26. Liu Y., Gopikrishnan P., Cizeau P., Meyer M., Peng CK., and Stanley H. E. (1999) Statistical properties of the volatility of price fluctuations. Phys. Rev. E 60:1390.
 27. Bouchaud J.P., Potters M. (2009) Theory of Financial Risk and Derivative Pricing: From Statistical Physics to Risk Management. Cambridge University Press, 2 edition.
 28. Granger C. (1969) Investigating causal relations by econometric models and crossspectral methods. Econometrica 37: 424.
 29. Easley D., Kleinberg J. (2010) Networks, Crowds, and Markets: Reasoning About a Highly Connected World. Cambridge University Press.
 30. Balcan D., Colizza V., Goncalves B., Hu H., Ramasco J., et al. (2009) Multiscale mobility networks and the spatial spreading of infectious diseases. Proc. Natl. Acad. Sci. USA 106: 21484.
 31. PastorSatorras R., Vespignani A. (2010) Patterns of complexity. Nature Physics 6: 480.
 32. Colizza V., PastorSatorras R., Vespignani A. (2007) Reactiondiffusion processes and metapopulation models in heterogeneous networks. Nature Physics 3: 276.
Figure Legends
Tables
Activision Blizzard (ATVI)  Adobe Systems Incorporated (ADBE)  Akamai Technologies, Inc (AKAM) 
Altera Corporation (ALTR)  Amazon.com, Inc. (AMZN)  Amgen Inc. (AMGN) 
Apollo Group, Inc. (APOL)  Apple Inc. (AAPL)  Applied Materials, Inc. (AMAT) 
Autodesk, Inc. (ADSK)  Automatic Data Processing, Inc. (ADP)  Baidu.com, Inc. (BIDU) 
Bed Bath & Beyond Inc. (BBBY)  Biogen Idec, Inc (BIIB)  BMC Software, Inc. (BMC) 
Broadcom Corporation (BRCM)  C. H. Robinson Worldwide, Inc. (CHRW)  CA, Inc. (CA) 
Celgene Corporation (CELG)  Cephalon, Inc. (CEPH)  Cerner Corporation (CERN) 
Check Point Software Technologies Ltd. (CHKP)  Cisco Systems, Inc. (CSCO)  Citrix Systems, Inc. (CTXS) 
Cognizant Tech. Solutions Corp. (CTSH)  Comcast Corporation (CMCSA)  Costco Wholesale Corporation (COST) 
Ctrip.com International, Ltd. (CTRP)  Dell Inc. (DELL)  Dentsplay International Inc. (XRAY) 
DirecTV (DTV)  Dollar Tree, Inc. (DLTR)  eBay Inc. (EBAY) 
Electronic Arts Inc. (ERTS)  Expedia, Inc. (EXPE)  Expeditors Int. of Washington, Inc. (EXPD) 
Express Scripts, Inc. (ESRX)  F5 Networks, Inc. (FFIV)  Fastenal Company (FAST) 
First Solar, Inc. (FSLR)  Fiserv, Inc. (FISV)  Flextronics International Ltd. (FLEX) 
FLIR Systems, Inc. (FLIR)  Garmin Ltd. (GRMN)  Genzyme Corporation (GENZ) 
Gilead Sciences, Inc. (GILD)  Google Inc. (GOOG)  Henry Schein, Inc. (HSIC) 
Illumina, Inc. (ILMN)  Infosys Technologies (INFY)  Intel Corporation (INTC) 
Intuit, Inc. (INTU)  Intuitive Surgical Inc. (ISRG)  Joy Global Inc. (JOYG) 
KLA Tencor Corporation (KLAC)  Lam Research Corporation (LRCX)  Liberty Media Corp., Int. Series A (LINTA) 
Life Technologies Corporation (LIFE)  Linear Technology Corporation (LLTC)  Marvell Technology Group, Ltd. (MRVL) 
Mattel, Inc. (MAT)  Maxim Integrated Products (MXIM)  Microchip Technology Incorporated (MCHP) 
Micron Technology, Inc. (MU)  Microsoft Corporation (MSFT)  Millicom International Cellular S.A. (MICC) 
Mylan, Inc. (MYL)  NetApp, Inc. (NTAP)  Netflix, Inc. (NFLX) 
News Corporation, Ltd. (NWSA)  NII Holdings, Inc. (NIHD)  NVIDIA Corporation (NVDA) 
OÕReilly Automotive, Inc. (ORLY)  Oracle Corporation (ORCL)  PACCAR Inc. (PCAR) 
Paychex, Inc. (PAYX)  Priceline.com, Incorporated (PCLN)  Qiagen N.V. (QGEN) 
QUALCOMM Incorporated (QCOM)  Research in Motion Limited (RIMM)  Ross Stores Inc. (ROST) 
SanDisk Corporation (SNDK)  Seagate Technology Holdings (STX)  Sears Holdings Corporation (SHLD) 
SigmaAldrich Corporation (SIAL)  Staples Inc. (SPLS)  Starbucks Corporation (SBUX) 
Stericycle, Inc (SRCL)  Symantec Corporation (SYMC)  Teva Pharmaceutical Industries Ltd. (TEVA) 
Urban Outfitters, Inc. (URBN)  VeriSign, Inc. (VRSN)  Vertex Pharmaceuticals (VRTX) 
Virgin Media, Inc. (VMED)  Vodafone Group, plc. (VOD)  Warner Chilcott, Ltd. (WCRX) 
Whole Foods Market, Inc. (WFMI)  Wynn Resorts Ltd. (WYNN)  Xilinx, Inc. (XLNX) 
Yahoo! Inc. (YHOO) 
.
5  4  3  2  1  0  1  2  3  4  5  

CCF  0.0067  0.0487  0.0507  0.0806  0.1510  0.3150  0.2367  0.0940  0.0675  0.0433  0.0197 
5  4  3  2  1  0  1  2  3  4  5  

CCF  0.0159  0.0629  0.0508  0.0455  0.0639  0.1196  0.1083  0.0561  0.0509  0.0299  0.0169 
Correlations are lower than the case in which we consider the queries deriving from the tickers (Table 2).
5  4  3  2  1  0  1  2  3  4  5  

CCF  0.0176  0.0604  0.0657  0.0993  0.1816  0.3641  0.2700  0.1145  0.0834  0.0540  0.0312 
By clean stocks we mean that we remove those stocks which give rise to spurious queries such as the one containing a common words like LIFE or for instance the stock EBAY. In Tables S1 and S2 of Supporting Information we report the cross correlation functions of the 87 stocks on which the average is performed.
5  4  3  2  1  0  1  2  3  4  5  

CCF  0.0078  0.0344  0.0501  0.0736  0.1482  0.3194  0.2349  0.0876  0.0623  0.0345  0.0151 
The results from the queries of Yahoo! users or from all searches (Table 4) are almost identical.
Ticker  5  4  3  2  1  0  1  2  3  4  5 

ADBE  0.08  0.12  0.14  0.19  0.47  0.83  0.51  0.19  0.09  0.10  0.11 
CEPH  0.16  0.26  0.22  0.14  0.32  0.80  0.44  0.24  0.12  0.13  0.15 
APOL  0.02  0.06  0.10  0.21  0.43  0.79  0.55  0.22  0.12  0.07  0.03 
NVDA  0.23  0.36  0.38  0.46  0.56  0.79  0.68  0.47  0.42  0.38  0.29 
CSCO  0.04  0.07  0.13  0.36  0.53  0.74  0.63  0.34  0.26  0.17  0.12 
AKAM  0.04  0.06  0.03  0.07  0.22  0.72  0.49  0.20  0.11  0.02  0.01 
NFLX  0.10  0.16  0.16  0.24  0.47  0.68  0.54  0.25  0.19  0.16  0.13 
ISRG  0.07  0.13  0.18  0.21  0.38  0.67  0.64  0.29  0.20  0.11  0.05 
RIMM  0.03  0.12  0.11  0.14  0.31  0.66  0.58  0.24  0.20  0.11  0.05 
FFIV  0.06  0.06  0.13  0.21  0.35  0.65  0.56  0.33  0.21  0.14  0.13 
The values of the crosscorrelation function for is always higher than the value of . From this evidence it appears that query volumes anticipate trading volumes by one or two days. See Tables S1 and S2 of Supporting Information for the complete results for the 87 clean stocks.
Ticker  Top5  Top 10  

ADBE  0.83  0.51  0.32 
CEPH  0.80  0.32  0.24 
APOL  0.79  0.55  0.46 
NVDA  0.79  0.70  0.64 
CSCO  0.74  0.56  0.46 
AKAM  0.72  0.51  0.39 
NFLX  0.68  0.62  0.62 
ISRG  0.67  0.57  0.55 
RIMM  0.66  0.59  0.52 
FFIV  0.65  0.55  0.50 
We compute the crosscorrelation coefficient between query and trading volumes after removing the days characterized by the highest trading volumes, respectively the top five and top ten events are removed. We note that a significant correlation is still observed for most of the stocks considered. This important test supports the robustness of our findings. See Tables S4 and S5 of Supporting Information for the complete results for the 87 clean stocks.
Volume  Price returns  Avg correlation 

searches  
searches  
searches  
users  
users  
users 
Dataset  lag (days)  Direction  Avg reduction in RSS  

Q (100 tickers)  1  Q T  
Q (100 tickers)  1  T Q  
U (100 tickers)  1  U T  
U (100 tickers)  1  T U  
Q (100 tickers)  2  Q T  
Q (100 tickers)  2  T Q  
U (100 tickers)  2  U T  
U (100 tickers)  2  T U  
Q (87 tickers)  1  Q T  
Q (87 tickers)  1  T Q  
U (87 tickers)  1  U T  
U (87 tickers)  1  T U  
Q (87 tickers)  2  Q T  
Q (87 tickers)  2  T Q  
U (87 tickers)  2  U T  
U (87 tickers)  2  T U 
Adding information about yesterday’s query volume reduces the average prediction error (in an autoregressive model) for today’s trade volume by about , and for half of the companies the reduction is statistically significant at .
Age Range  Fraction of Users 

Average age distribution for a random sample collecting half of the data
Age Range  Fraction of Users 

We observe some minor differences between the age of common users and the one of the users corresponding to queries belonging to NASDAQ100 sample.
Supporting Information
Appendix A Data Analysis and Results: all the NASDAQ100 stocks
In this section we report the complete results of the stocks on which the averages shown and discussed in the main paper are performed.
Ticker  5  4  3  2  1  0  1  2  3  4  5 

AAPL  0.03  0.07  0.07  0.13  0.30  0.58  0.40  0.19  0.13  0.05  0.04 
ADBE  0.08  0.12  0.14  0.19  0.47  0.83  0.51  0.19  0.09  0.10  0.11 
ADP  0.19  0.21  0.23  0.16  0.15  0.15  0.18  0.19  0.15  0.14  0.15 
ADSK  0.16  0.15  0.03  0.01  0.09  0.19  0.26  0.03  0.09  0.04  0.09 
AKAM  0.04  0.06  0.03  0.07  0.22  0.72  0.49  0.20  0.11  0.02  0.01 
ALTR  0.34  0.40  0.40  0.37  0.42  0.55  0.53  0.39  0.41  0.37  0.38 
AMAT  0.05  0.04  0.03  0.03  0.05  0.10  0.15  0.04  0.01  0.07  0.10 
AMGN  0.02  0.03  0.03  0.10  0.19  0.36  0.35  0.18  0.14  0.11  0.02 
AMZN  0.07  0.03  0.04  0.02  0.13  0.48  0.43  0.04  0.02  0.01  0.02 
APOL  0.02  0.06  0.10  0.21  0.43  0.79  0.55  0.22  0.12  0.07  0.03 
ATVI  0.04  0.05  0.10  0.16  0.27  0.39  0.39  0.22  0.23  0.18  0.10 
BBBY  0.04  0.19  0.13  0.12  0.21  0.43  0.39  0.14  0.09  0.07  0.14 
BIDU  0.10  0.09  0.12  0.19  0.32  0.49  0.42  0.16  0.11  0.05  0.04 
BIIB  0.06  0.09  0.13  0.10  0.21  0.59  0.23  0.20  0.10  0.07  0.09 
BMC  0.05  0.14  0.08  0.19  0.04  0.17  0.20  0.21  0.18  0.10  0.12 
BRCM  0.02  0.02  0.04  0.09  0.22  0.53  0.45  0.15  0.07  0.05  0.01 
CELG  0.00  0.03  0.01  0.02  0.20  0.65  0.29  0.03  0.05  0.04  0.05 
CEPH  0.16  0.26  0.22  0.14  0.32  0.80  0.44  0.24  0.12  0.13  0.15 
CHKP  0.03  0.06  0.04  0.07  0.06  0.09  0.03  0.02  0.05  0.03  0.01 
CHRW  0.03  0.12  0.07  0.05  0.00  0.16  0.23  0.07  0.05  0.06  0.06 
CMCSA  0.20  0.16  0.15  0.16  0.10  0.02  0.05  0.12  0.12  0.13  0.11 
CSCO  0.04  0.07  0.13  0.36  0.53  0.74  0.63  0.34  0.26  0.17  0.12 
CTRP  0.02  0.08  0.01  0.11  0.19  0.57  0.26  0.06  0.03  0.04  0.06 
CTSH  0.06  0.02  0.07  0.11  0.15  0.38  0.12  0.07  0.06  0.01  0.05 
CTXS  0.11  0.15  0.14  0.18  0.26  0.55  0.35  0.14  0.14  0.10  0.06 
DLTR  0.07  0.16  0.17  0.14  0.23  0.42  0.25  0.24  0.15  0.07  0.04 
DTV  0.04  0.02  0.06  0.09  0.05  0.03  0.03  0.05  0.05  0.03  0.01 
ERTS  0.06  0.17  0.22  0.24  0.34  0.62  0.53  0.18  0.05  0.02  0.02 
ESRX  0.14  0.23  0.17  0.21  0.21  0.43  0.31  0.17  0.16  0.09  0.05 
EXPD  0.01  0.07  0.08  0.09  0.24  0.37  0.31  0.22  0.18  0.16  0.11 
EXPE  0.10  0.14  0.13  0.16  0.27  0.52  0.40  0.17  0.17  0.19  0.15 
FFIV  0.06  0.06  0.13  0.21  0.35  0.65  0.56  0.33  0.21  0.14  0.13 
FISV  0.03  0.01  0.05  0.08  0.08  0.28  0.12  0.02  0.02  0.06  0.11 
FLIR  0.13  0.11  0.10  0.15  0.14  0.13  0.09  0.12  0.16  0.16  0.16 
FSLR  0.01  0.05  0.02  0.13  0.29  0.55  0.44  0.18  0.12  0.02  0.01 
GILD  0.03  0.13  0.13  0.12  0.13  0.18  0.14  0.06  0.08  0.03  0.02 
GOOG  0.09  0.03  0.00  0.01  0.04  0.02  0.09  0.02  0.05  0.01  0.07 
GRMN  0.07  0.10  0.07  0.05  0.23  0.46  0.24  0.12  0.09  0.07  0.08 
HSIC  0.20  0.18  0.13  0.10  0.04  0.07  0.02  0.07  0.00  0.03  0.16 
ILMN  0.01  0.06  0.12  0.16  0.20  0.40  0.39  0.31  0.27  0.21  0.14 
INFY  0.05  0.09  0.02  0.06  0.14  0.53  0.20  0.06  0.10  0.03  0.00 
INTC  0.07  0.04  0.00  0.05  0.18  0.44  0.40  0.14  0.09  0.05  0.03 
INTU  0.08  0.10  0.10  0.07  0.00  0.31  0.22  0.10  0.03  0.05  0.10 
ISRG  0.07  0.13  0.18  0.21  0.38  0.67  0.64  0.29  0.20  0.11  0.05 
Ticker  5  4  3  2  1  0  1  2  3  4  5 

JOYG  0.00  0.05  0.13  0.10  0.17  0.27  0.13  0.09  0.06  0.09  0.05 
KLAC  0.36  0.40  0.40  0.46  0.45  0.43  0.49  0.46  0.47  0.43  0.39 
LINTA  0.06  0.04  0.02  0.00  0.04  0.04  0.01  0.01  0.02  0.04  0.06 
LLTC  0.16  0.22  0.18  0.22  0.32  0.39  0.32  0.21  0.13  0.10  0.12 
LRCX  0.01  0.00  0.02  0.04  0.17  0.24  0.20  0.16  0.14  0.03  0.00 
MAT  0.06  0.21  0.07  0.10  0.09  0.04  0.06  0.03  0.02  0.05  0.02 
MCHP  0.22  0.21  0.22  0.23  0.23  0.24  0.32  0.25  0.18  0.14  0.10 
MICC  0.04  0.10  0.06  0.14  0.16  0.21  0.17  0.06  0.04  0.02  0.03 
MRVL  0.06  0.09  0.02  0.06  0.12  0.40  0.37  0.02  0.01  0.03  0.00 
MSFT  0.09  0.02  0.06  0.02  0.17  0.42  0.35  0.02  0.05  0.04  0.09 
MU  0.13  0.03  0.05  0.07  0.06  0.05  0.05  0.10  0.08  0.07  0.15 
MXIM  0.11  0.09  0.19  0.18  0.22  0.29  0.11  0.04  0.03  0.03  0.01 
MYL  0.10  0.07  0.07  0.10  0.11  0.07  0.07  0.06  0.07  0.04  0.01 
NFLX  0.10  0.16  0.16  0.24  0.47  0.68  0.54  0.25  0.19  0.16  0.13 
NIHD  0.10  0.11  0.20  0.24  0.30  0.56  0.34  0.25  0.15  0.11  0.09 
NTAP  0.06  0.02  0.01  0.06  0.26  0.61  0.46  0.18  0.09  0.09  0.11 
NVDA  0.23  0.36  0.38  0.46  0.56  0.79  0.68  0.47  0.42  0.38  0.29 
NWSA  0.04  0.03  0.03  0.10  0.01  0.06  0.09  0.04  0.04  0.03  0.08 
ORCL  0.09  0.17  0.09  0.07  0.23  0.52  0.43  0.13  0.16  0.10  0.03 
PAYX  0.06  0.08  0.00  0.05  0.04  0.04  0.03  0.00  0.00  0.06  0.03 
PCAR  0.04  0.14  0.14  0.15  0.16  0.27  0.28  0.14  0.14  0.15  0.06 
PCLN  0.10  0.04  0.03  0.01  0.20  0.51  0.37  0.06  0.01  0.06  0.06 
QCOM  0.15  0.11  0.12  0.06  0.09  0.24  0.15  0.06  0.10  0.09  0.14 
QGEN  0.09  0.09  0.06  0.11  0.09  0.35  0.31  0.15  0.13  0.10  0.21 
RIMM  0.03  0.12  0.11  0.14  0.31  0.66  0.58  0.24  0.20  0.11  0.05 
ROST  0.22  0.12  0.15  0.11  0.17  0.08  0.12  0.10  0.13  0.20  0.16 
SBUX  0.08  0.03  0.08  0.09  0.19  0.41  0.25  0.18  0.11  0.06  0.04 
SHLD  0.10  0.14  0.11  0.22  0.21  0.38  0.26  0.17  0.15  0.15  0.07 
SIAL  0.05  0.00  0.02  0.03  0.05  0.05  0.00  0.01  0.01  0.01  0.02 
SNDK  0.04  0.02  0.11  0.23  0.30  0.45  0.37  0.09  0.13  0.11  0.01 
SPLS  0.19  0.17  0.17  0.17  0.04  0.11  0.02  0.15  0.16  0.18  0.11 
SRCL  0.05  0.02  0.04  0.01  0.12  0.27  0.24  0.21  0.08  0.05  0.05 
STX  0.11  0.23  0.16  0.20  0.24  0.37  0.31  0.13  0.03  0.05  0.01 
SYMC  0.00  0.02  0.11  0.17  0.25  0.58  0.44  0.21  0.14  0.04  0.04 
TEVA  0.15  0.17  0.23  0.24  0.29  0.40  0.24  0.21  0.17  0.14  0.11 
URBN  0.00  0.08  0.10  0.09  0.17  0.37  0.32  0.14  0.10  0.01  0.01 
VMED  0.09  0.14  0.13  0.13  0.12  0.09  0.09  0.13  0.09  0.08  0.12 
VOD  0.10  0.10  0.07  0.10  0.11  0.17  0.15  0.13  0.17  0.15  0.03 
VRSN  0.00  0.05  0.01  0.18  0.44  0.56  0.40  0.26  0.22  0.18  0.16 
VRTX  0.02  0.14  0.32  0.42  0.30  0.50  0.24  0.07  0.19  0.14  0.16 
WCRX  0.05  0.06  0.06  0.07  0.17  0.51  0.23  0.11  0.05  0.05  0.01 
WFMI  0.00  0.05  0.01  0.06  0.23  0.45  0.31  0.03  0.03  0.06  0.06 
YHOO  0.06  0.15  0.15  0.16  0.23  0.38  0.25  0.02  0.00  0.02  0.03 
The values of the crosscorrelation function for are on average larger than the value of . In fact considering only the stocks for which (there are 8 stocks for which ) we observe that for 68 stocks it holds that while for the remaining 11 stocks we observe .
Ticker  5  4  3  2  1  0  1  2  3  4  5 

CERN  0.03  0.02  0.03  0.02  0.00  0.03  0.01  0.00  0.03  0.02  0.06 
COST  0.28  0.21  0.18  0.19  0.20  0.17  0.06  0.07  0.16  0.10  0.11 
DELL  0.05  0.04  0.01  0.01  0.11  0.11  0.05  0.04  0.05  0.02  0.07 
EBAY  0.08  0.07  0.10  0.16  0.21  0.18  0.10  0.18  0.17  0.20  0.20 
FAST  0.14  0.12  0.12  0.12  0.06  0.08  0.07  0.14  0.11  0.10  0.13 
FLEX  0.12  0.05  0.15  0.19  0.20  0.09  0.08  0.16  0.20  0.18  0.18 
LIFE  0.07  0.01  0.04  0.01  0.09  0.08  0.05  0.06  0.01  0.11  0.11 
ORLY  0.01  0.01  0.04  0.00  0.04  0.11  0.14  0.07  0.09  0.07  0.10 
WYNN  0.01  0.03  0.06  0.08  0.02  0.02  0.10  0.03  0.00  0.04  0.08 
XLNX  0.03  0.00  0.02  0.06  0.09  0.14  0.14  0.03  0.03  0.12  0.03 
XRAY  0.12  0.12  0.22  0.18  0.18  0.10  0.08  0.11  0.05  0.07  0.12 
Most of the query volumes associated to these tickers can be traced back to nonfinancial origin.
Ticker  Top5  Top 10  

AAPL  0.5826  0.4769  0.4481 
ADBE  0.8326  0.5196  0.3137 
ADP  0.1456  0.1246  0.1120 
ADSK  0.1933  0.1966  0.1795 
AKAM  0.7243  0.4893  0.4059 
ALTR  0.5546  0.5229  0.4956 
AMAT  0.1014  0.1145  0.0732 
AMGN  0.3563  0.3165  0.3138 
AMZN  0.4838  0.3356  0.1784 
APOL  0.7927  0.5547  0.4614 
ATVI  0.3854  0.3291  0.2410 
BBBY  0.4300  0.2963  0.2290 
BIDU  0.4891  0.3355  0.3001 
BIIB  0.5877  0.3449  0.3320 
BMC  0.1676  0.1508  0.1600 
BRCM  0.5342  0.2219  0.2338 
CELG  0.6508  0.3171  0.1942 
CEPH  0.7959  0.3208  0.2339 
CHKP  0.0939  0.0838  0.0808 
CHRW  0.1619  0.0559  0.0530 
CMCSA  0.0242  0.0299  0.0456 
CSCO  0.7352  0.5614  0.5014 
CTRP  0.5659  0.3203  0.2963 
CTSH  0.3791  0.2344  0.1756 
CTXS  0.5522  0.3525  0.2897 
DLTR  0.4243  0.3567  0.2830 
DTV  0.0308  0.0860  0.1069 
ERTS  0.6190  0.4764  0.3225 
ESRX  0.4319  0.3371  0.2189 
EXPD  0.3749  0.3186  0.3048 
EXPE  0.5177  0.3473  0.2712 
FFIV  0.6534  0.5410  0.5034 
FISV  0.2754  0.0568  0.0589 
FLIR  0.1267  0.1959  0.1932 
FSLR  0.5464  0.4577  0.4020 
GILD  0.1775  0.1901  0.2013 
GOOG  0.0199  0.0440  0.1211 
GRMN  0.4564  0.2749  0.2763 
HSIC  0.0706  0.0198  0.0053 
ILMN  0.4004  0.3020  0.3062 
INFY  0.5338  0.1080  0.0469 
INTC  0.4357  0.3178  0.3067 
INTU  0.3096  0.0262  0.0665 
ISRG  0.6683  0.5432  0.5590 
Ticker  Top5  Top 10  

JOYG  0.2660  0.2147  0.1841 
KLAC  0.4307  0.4260  0.4305 
LINTA  0.0446  0.0066  0.0156 
LLTC  0.3896  0.3286  0.2471 
LRCX  0.2424  0.2749  0.2157 
MAT  0.0441  0.0104  0.1008 
MCHP  0.2411  0.1850  0.2042 
MICC  0.2099  0.1548  0.1556 
MRVL  0.3966  0.2554  0.2236 
MSFT  0.4216  0.3808  0.3361 
MU  0.0458  0.0440  0.0411 
MXIM  0.2948  0.2009  0.1671 
MYL  0.0665  0.0871  0.1243 
NFLX  0.6757  0.6314  0.6253 
NHID  0.5553  0.3644  0.2925 
NTAP  0.6102  0.4173  0.2906 
NVDA  0.7856  0.6866  0.6481 
NWSA  0.0620  0.0729  0.0794 
ORCL  0.5156  0.3493  0.3218 
PAYX  0.0365  0.1071  0.1005 
PCAR  0.2725  0.1737  0.1798 
PCLN  0.5091  0.3054  0.2211 
QCOM  0.2444  0.0681  0.0853 
QGEN  0.3508  0.2262  0.2092 
RIMM  0.6587  0.5946  0.5564 
ROST  0.0847  0.1247  0.1385 
SBUX  0.4095  0.3263  0.2085 
SHLD  0.3826  0.3706  0.3563 
SIAL  0.0475  0.0053  0.0396 
SNDK  0.4510  0.3761  0.3404 
SPLS  0.1144  0.0184  0.0031 
SRCL  0.2695  0.1365  0.1023 
STX  0.3738  0.2979  0.2242 
SYMC  0.5761  0.3703  0.4122 
TEVA  0.4005  0.2934  0.3379 
URBN  0.3714  0.2841  0.2409 
VMED  0.0938  0.1070  0.0922 
VOD  0.1682  0.1599  0.1100 
VRSN  0.5551  0.3389  0.3199 
VRTX  0.5007  0.2135  0.1679 
WCRX  0.5106  0.3447  0.1688 
WFMI  0.4544  0.2279  0.1042 
YHOO  0.3750  0.2145  0.1299 
We compute the crosscorrelation coefficient between query and trading volumes after removing the days characterized by the highest trading volumes, respectively the top five and top ten events are removed. A significant correlation is still observed for most of the stocks considered.
Appendix B Beyond Granger Tests: Tables
Ticker  CCF  

atvi  
csco  
expe  
ilmn  
isrg  
nflx  
nvda  
rimm  
altr  
msft  
symc  
mrvl  
orcl  
erts  
amgn  
ffiv  
ntap  
bbby  
apol  
amzn  
urbn  
vrtx  
adbe  
qgen  
chrw  
stx 
Ticker  

dltr  
mxim  
lltc  
rost  
cmcsa  
vrsn  
infy  
flir  
vmed  
intu 
Ticker  Outcome 

aapl  
adbe  
adp  
akam  
altr  
amgn  
amzn  
apol  
atvi  
bbby  
bidu  
biib  
brcm  
celg  
ceph  
csco  
ctrp  
ctxs  
dltr  
erts  
esrx  
expd  
expe  
ffiv  
fslr  
gild  
grmn  
ilmn  
infy  
intc  
iisrg  
klac  
lrcx  
mchp  
micc  
mrvl  
msft  
nflx  
nihd  
ntap  
nvda  
orcl  
pcar  
pcln  
rimm  
sbux  
shld  
sndk  
srcl  
stx  
symc  
urbn  
wcrx  
wfmi  
yhoo 
Ticker  Outcome 

joyg  
lltc  
rost  
teva  
vrsn  
vrtx 
Ticker  Ticker  

aapl  joyg  
adbe  klac  
adp  linta  
adsk  lltc  
akam  lrcx  
altr  mat  
amat  mchp  
amgn  micc  
amzn  mrvl  
apol  msft  
atvi  mu  
bbby  mxim  
bidu  myl  
biib  nflx  
bmc  nihd  
brcm  ntap  
celg  nvda  
ceph  nwsa  
chkp  orcl  
chrw  payx  
cmcsa  pcar  
csco  pcln  
ctrp  qcom  
ctsh  qgen  
ctxs  rimm  
dltr  rost  
dtv  sbux  
erts  shld  
esrx  sial  
expd  sndk  
expe  spls  
ffiv  srcl  
fisv  stx  
flir  symc  
fslr  teva  
gild  urbn  
goog  vmed  
grmn  vod  
hsic  vrsn  
ilmn  vrtx  
infy  wcrx  
intc  wfmi  
intu  yhoo  
isrg 