A Unified Framework for Long Range and Cold Start Forecasting of Seasonal Profiles in Time Series
Abstract
Providing longrange forecasts is a fundamental challenge in time series modeling, which is only compounded by the challenge of having to form such forecasts when a time series has never previously been observed. The latter challenge is the time series version of the coldstart problem seen in recommender systems which, to our knowledge, has not been directly addressed in previous work. In addition, modern time series datasets are often plagued by missing data. We focus on forecasting seasonal profiles—or baseline demand—for periods on the order of a year long, even in the coldstart setting or with otherwise missing data. Traditional time series approaches that perform iterated stepahead methods struggle to provide accurate forecasts on such problems, let alone in the missing data regime. We present a computationally efficient framework which combines ideas from highdimensional regression and matrix factorization on a carefully constructed data matrix. Key to our formulation and resulting performance is (1) leveraging repeated patterns over fixed periods of time and across series, and (2) metadata associated with the individual series. We provide analyses of our framework on large messy realworld datasets.
1 Introduction
Large collections of time series are now commonplace in many domains. Examples include environmental monitoring based on sensors at millions of locations, product demand and purchase curves for millions of products, and web traffic over time for billions of websites. From such large, messy data sets, we focus on the prediction challenge of forming longrange forecasts:

LongRange Forecasting. How can we forecast demand for a product in the coming year, so as to appropriately allocate inventory?
Our particular interest is in full seasonal forecasts on the order of a year, whereas traditional time series methods, like autoregressive (AR) based models (e.g. vector AR and seasonal AR), focus on shortterm stepahead forecasts. Unfortunately, iterating stepahead forecasts to form longrange forecasts is prone to an accumulation of error with the length of the forecast window. Furthermore, these past methods struggle to scale to our highdimensional settings of interest without making overly simplistic assumptions about the relationships between series.
The challenge of longrange forecasts is only compounded when having to form such forecasts in the presence of missing data—a common scenario in large, realworld time series—or in the more challenging task of forecasting a previously unobserved time series:

Missing data. Sensors or websites may temporarily go down. Products may be temporarily unavailable. How can we be robust to and/or impute such missing data?

ColdStart Forecasting. Products on Amazon are introduced everyday, and old products are taken down. New websites are created every minute and old ones disappear. How can we provide a longrange forecast of demand for this new product/website?
Again, traditional time series models are not robust to such missingness, especially when the missingness is structured (e.g., continguous segments are not present). To our knowledge, the coldstart challenge for time series has not been directly addressed in previous work.
When discussing the goal of forecasting, there are many features of time series one might be interested in predicting. For example, the recent paper of [1] addressed the challenge of intermittent demand. There is also the challenge of predicting a spike in activity based on an unforeseeable event (e.g., movie release, news event, etc.), or yearly trend increase (e.g. growing popularity of a social media platform). We instead emphasize forecasting seasonal profiles—or baseline demand—based on patterns observed both across fixed periods of time and series.
In order to perform such forecasts, we focus on a few characteristics common to time series datasets. Many products, websites, and sensors show similar seasonal profiles and reactions to unobserved latent trends, resulting in shared lowdimensional structure and seasonality. A number of authors have proposed applying matrix factorization techniques to collections of time series [2, 3, 4, 5, 6] to either provide retrospective analysis of shared latent structure or form shortterm forecasts. However, these approaches ignore the coldstart challenge and do not directly nor robustly handle longrange forecasts. In order to address these latter challenges, we can leverage metadata features, which is common to many large datasets. For example, products have user reviews and product descriptions, sensors have locations and proximities to different points of interest, and websites have content and network information. Prediction of time series curves from features, or covariates, has traditionally been studied in the field of functional data analysis (FDA) [7, 8]. Most uses of FDA relevant to our approach are only developed for a very small number of covariates—however, our metadata vectors are on the order of thousands of dimensions. More generally, such methods struggle to scale to the massive size of the datasets of interest to us.
To harness the above characteristics of time series to address prediction challenges (P1)(P3), we propose a computationally efficient framework that combines ideas of highdimensional regression and matrix factorization. Our approach naturally tackles prediction of new series and missing data imputation using both highdimensional metadata and shared seasonality structure. The regression component of our model seeks to predict entire yearly profiles from high dimensional metadata vectors. This component allows us to form coldstart predictions (P3) in addition to yearlong forecasts for previously seen items (P1). The matrix factorization component captures lowrank structure in the residuals unexplained by the metadata predictions. Both ingredients leverage the shared seasonal profiles across years for a single item and across items by exploiting common relationships between (i) the metadata vector and the seasonal profiles and (ii) the mean deviations captured in the residuals.
Within our modular framework, one can propose different regression functions to tailor the framework to the dataset. To this end, we examine a few different regression functions motivated by realworld scenarios. For example, when the metadata is highdimensional, we introduce a lowrank regression structure, and show this structure is crucial to our predictive performance. We explore the potential benefits and shortcomings for each of our considered cases in an analysis of two large data sets—page traffic for popular Wikipedia articles and Google flu trends for regions around the world—that shows our straightforward approach efficiently and effectively addresses the problem of forecasting seasonal profiles.
2 Background
2.1 Matrix Factorization for Time Series
Let denote a matrix of time series where each column, , represents a single length time series. Matrix factorization (MF) in this setting approximates with a lowrank product of two matrices , where denotes the latent rank of the approximation. and may be computed by solving the general problem
(1) 
where and are regularization terms for and , respectively. The columns of represent latent time series features and the rows of are the feature loadings for each series. If entries in are missing, the learned latent series and loadings may be used to impute the missing values. Missing data imputation using matrix factorization is referred to as collaborative filtering [9].
Variations of Problem (1) have appeared in the literature. [5] use a penalty on to encourage an autoregressive model of the latent time series factors. Smoothness of the latent factors across time may also be enforced via a Gaussian process prior [2]. Both [3] and [6] perform nonnegative matrix factorization on where both . This approach is used for interpretability of the latent factors and factor loadings. Since our goal is prediction, we consider unrestricted and matrices.
2.2 Leveraging metadata in prediction
Collaborative filtering is greatly improved by leveraging metadata [9, 10]. For example, in the coldstart setting when a user has not rated any items, user metadata may be used to help predict ratings. Linear regression on the user and item metadata is added to the matrix factorization objective to obtain coldstart predictions [11, 12, 13, 10], and leads to improved performance.
In the time series context, one may have metadata or features about each series. Prediction of entire time series curves from features has been classically studied using functional regression [7]. Here one predicts the value of time series at time as a linear combination of the length feature vectors
(2) 
where is the coefficient at time , the covariate for series , and is mean zero noise. To share information across time, the coefficients are assumed to vary smoothly. One common approach for enforcing smoothness uses basis functions. See the Supplement for details.
In typical time varying regression settings, the dimensionality of the predictors is small [7]. In contrast, we consider time series prediction and forecasting in settings where the available metadata is large (e.g. on the order of tens of thousands). In this setting, the number of potential regression coefficients is also extremely large, . Below, we introduce methods to deal with such high dimensional metadata in time series prediction. Our approach also marries the standard time varying regression of Eq. (2) with the matrix factorization decomposition of multiple time series in Eq. (1) after a careful data matrix manipulation, to provide a unified approach to entire time series forecasting and missing data imputation.
3 Proposed Framework
3.1 Setup
Consider a multivariate time series with number of series and total time points . In order to leverage repeated patterns across fixed periods and series, we treat each year of each time series as a column in our data matrix. Note that this is in contrast to [3] who treated columns as an average of years of a series. Thus, we form a new data matrix
(3) 
where denotes the year of the time series, denotes the number of years of data for series . denotes the number of observations per year (e.g. 52 for weekly data, 365 for daily data) and is the number of observed years, i.e. . The reorganized matrix has dimensions , see Figure 1. We perform all modeling with this data matrix. We assume that each is accompanied by the same metadata vector for . While we assume that a year is the length of a seasonal period, any relevant length can be used.
3.2 Model Framework
Our analysis harnesses both the common seasonal profiles across years for a given product and across items for all years. Given a matrix as in Eq. (3), we develop a series of models that leverage datadriven discovery of lowdimensional structure via matrix factorization with various approaches for incorporating highdimensional metadata. These methods are summarized in Table 1.
To fill in missing data (P2), one straightforward approach is matrix factorization, represented by the matrices in Table 1 in red, with . Here, is the latent dimension. This approach leverages an inherent lowrank structure due to similar time series or multiple similar years of the same time series (i.e., columns of the data matrix ). Note that the reorganization of the data matrix in Eq. (3) encourages such lowrank structure. can be viewed as a factor loadings matrix, where each column is a latent time series. can be viewed as a latent factor matrix where column is the latent factor vector (i.e., learned weights on the latent time series) for time series .
However, when we encounter a new time series or want to predict the next year of an existing time series (unobserved column in the matrix ), we have no data to inform the corresponding latent factor and regularization will typically push the latent factor to zero. To combat this, we introduce a regression component with parameters to allow metadata, for series , to inform predictions. Importantly, the value of introducing the regression term depends on how predictive the metadata is at capturing seasonal features of the series. In particular, there must be shared correlations between features of the metadata across series and features of the seasonal profiles. As we see in our experiments, this is often the case in realworld datasets. We explore a number of possible formulations for the regression term, as outlined below:
Simple  LowRank  Functional  Neural Network  

No MF  
MF 
Approach 1: Simple regression
The simplest regression approach is to introduce a full weightings matrix as in multioutput regression. See Table 1.
Approach 2: Lowrank regression
When the metadata dimensionality is high, the matrix becomes very large. To help reduce the parameterization, we perform a lowrank approximation to the weight matrix, resulting in the matrices in Table 1, with . Here, is the assumed rank with . The columns of serve as timevarying weights, while represents a lowdimensional representation of the metadata features. This structure is not only beneficial computationally, but also because of the lowrank structure present in the data matrix.
Approach 3: Smoothly varying latent factors
When the dataset exhibits temporal regularity such as smoothness, we can enhance our longrange forecasts by restricting the nature of its evolution. We achieve this by introducing a functional regression approach as shown in Table 1. The matrix is a Bspline matrix consisting of knots, resulting in basis functions (no intercept). is a matrix of weights for each of the functional coefficients. One can view these models as using the smoothness assumption to fix in the lowrank regression models to a smooth basis matrix (note that we typically choose ).
Approach 4: Neural network
Since the regression component is a modular piece in our framework, we can introduce an arbitrary function approximator in the form of a neural network which allows for nonlinear featuretooutput mappings. Many different architectures can be explored here.
We also consider models that only utilize the regression component without the matrix factorization, shown in the first row of Table 1. When paired with the regression component, the matrix factorization term can be viewed as structured noise (lowrank covariance) that captures extra structure in the error residuals not captured by the regression. Additionally, for all considered models, we include an unregularized bias term (by appending with a 1) and additive noise .
Note that, as presented, our framework makes the following assumptions. First, it assumes that the baseline mean profile as a function of the metadata is the same across years. This occurs since we assume is constant across years ; for applications where metadata is available at each year, this may be relaxed to obtain changing baseline demand. Second, shared deviations from this mean profile are captured by lowrank residual errors.
3.3 Optimization Objectives
Our goal is to fit these models in the missing data setting. Let be the set of observed data indices of the data matrix in Eq. (3). To fit each of these models, we compute regularized maximum likelihood estimates of the parameters. In our framework, this amounts to solving this optimization problem:
(4) 
where is a regularization function. As an example, for the MF + lowrank regression model, we use the squared Frobenius norm to regularize and , resulting in the objective:
For the nonMF models we fix and for all . For optimization, we use gradient or stochastic gradient descent. See the Supplement for details on all models.
One could have addressed our set of challenges with a multistage approach that first imputes missing values with matrix factorization and then learns the regression component for forecasting with imputed data. However, jointly learning the regression and latent residual factors in the presence of missing data leads to a cleaner and more efficient estimation method. We empirically demonstrate the strength of our joint approach versus this multistage approach in the Supplement.
3.4 Forming Predictions
To predict with the MF models on the longrange (P1) and coldstart (P3) prediction challenges, we use the learned regression component only to forecast since the latent factors are uninformed by data and driven to zero by the optimization objective. As a result, prediction with an MF model from our framework and its corresponding model without MF utilizes the same set of parameters, although these parameters were learned differently (one in the presence of the MF term and the other not). For the missing data challenge (P2), the prediction involves all model parameters.
4 Experiments
We evaluate the trade offs in performance of the proposed methods on all three challenges on real world datasets. We describe each experimental setup for the prediction challenges:

LongRange Forecasting: The training set includes multiple previous years and metadata. The test set contains the final year for each series. We randomly remove 20% of the training data to make the learning even more difficult. Our goal is to forecast this last year for all series (i.e., entire dotted blue, green, and red lines shown in the right hand side of Figure 1).

Missing data: The test set consists of contiguous chunks of missing elements in the data matrix representing, for example, sensors temporarily failing. The training set contains the remaining observations in and all metadata. Our goal is to impute the missing elements (i.e., portions of dotted lines embedded in the solid lines in the right hand side of Figure 1).

ColdStart Forecasting: The test set consists of entirely heldout time series and associated metadata; none of the series that comprise our training set share this metadata. We also randomly remove 20% of the training data. Our goal is to forecast a year for new time series (i.e., entire dotted orange line in Figure 1) solely from metadata.
For each experiment, we run 5fold cross validation over a grid to select the model hyperparameters. Due to the nonconvexity of some of our loss functions, we run random restarts; this computationally intensive process leads us to perform an approximation to full grid search described in the Supplement.
Dataset  # series ()  # columns of ()  period ()  metadata dim () 

Google Flu Trends  311  3197  52  3356 
Wikipedia Page Traffic  4031  29093  365  22193 
4.1 Analysis of Real World Datasets
We explore our framework on prediction tasks (P1)(P3) arising in two very different datasets: Google Flu Trends (Flu) and Wikipedia page traffic (Wiki). The Flu dataset consists of weekly estimates of influenza rates in 311 worldwide regions [14]. To create our Wiki dataset, we scraped multiple years of daily page traffic data from 4031 of the most popular Wikipedia articles. We detrend the data using the method of [15]. For both datasets, our metadata is curated by scraping the relevant Wikipedia summaries—either from the geographic region’s page (Flu) or of the popular page itself (Wiki)—and calculating TFIDF representations of the text after standard preprocessing. We standardize the time series within each dataset to put them on the same scale. Full details of data collection and preprocessing can be found in the Supplement. Table 2 summarizes the dimensionality of each dataset. Note that the metadata dimensionality for Flu is much smaller than that of Wiki; this is due to the similarity and sparsity of vocabulary used in Wikipedia page summaries for geographic regions. Due to the highdimensionality of the metadata in Wiki, we drop the linear regression models from our comparison for computational reasons.
The two datasets differ significantly in structure. Because Wiki contains many articles corresponding to popular topics such as actors and movies, the data exhibits unpredictable spikes (such as movie releases). On the other hand, the Flu dataset exhibits a simpler seasonal structure with regions within countries and/or hemispheres sharing very similar yearly patterns.
From historical estimates of influenza rates, policy makers and healthcare providers are interested in forming longrange forecasts (P1) to properly supply vaccinations and provide other forms of treatments and interventions. There is also interest in imputing past missing values (P3)—which could arise from lack of funding or personnel to collect such data—to understand what happened and reevaluate policies. Coldstart forecasts (P2) provide estimates in regions where data is not being collected. For Wiki, we view this dataset as serving as a proxy for many types of demand datasets with motivations for (P1)(P3) provided earlier.
4.1.1 Quantitative Analysis
Within our framework we consider using full regression (Reg), lowrank regression (LR), functional regression (FuncReg), and neural network regression (NN), both with and without matrix factorization (MF). We compare these models to two stepahead baselines (where applicable): Temporal Regularized Matrix Factorization (TRMF) [5] and standard AR models. We also consider the following simple baselines: for longrange (P1), we average all past years (avg_PY); for coldstart (P3), we use the nearest neighbors (NN) in metadata (Euclidean) space and perform a weighted average of the associated time series; for missing data (P2), we compare to plain MF. Since AR, avg_PY, and NN baselines cannot handle missing data, we learn these models in the fully observed data setting (rather than 20% missing), aiding in their predictions. Model training details can be found in the Supplement.
We report our results in Figure 2 in the form of quantile boxandwhisker boxplots computed from median average percentage error (mdAPE). For a single time series, this is calculated as
(5) 
where is the forecast at time and is the true value. Mean average percentage error (MAPE) is a commonly used metric in time series forecast evaluation. mdAPE provides a more robust metric when the data has outliers and values close to zero.
Longrange forecasts
It is clear that the stepahead AR and TRMF methods perform poorly on this problem due to error propagation. Within our framework, the LR term seems key, with additional benefits seen by adding MF. (Note, however, that when the data are truly smooth, FuncReg can be helpful, as illustrated in synthetic data experiments in the Supplement.) The simple avg_PY baseline seems competitive, but exhibits worse performance in the right tails. This method is also not adept at handling missing data, and was aided by our providing it with fully observed training data.
Missing data forecasts
Compared to plain MF, adding metadata significantly improves our performance on the Flu data. Interestingly, the performance of any of the regression models alone is worse; the gains arise from the combined regression plus MF. On Wiki, the results are similar, but less pronounced. The regression plus MF models provide lower 10th, 25th, and 50th quantiles, and similar 75th quantiles. Here, LR+MF is the best performer amongst any combination of regression with or without MF. For Flu, FuncReg+MF takes a small lead over the lowrank counterpart, perhaps due to the smoothness of the trends. In this smooth setting, TRMF performs very well; filling in large contiguous blocks of missing data is hard for plain MF, but enforcing an AR regularization appears to help a lot. Indeed, TMRF was designed specifically for such scenarios. However, Wiki does not exhibit such a simple structure and TRMF performs worse compared to our class of models.
Coldstart forecasts
In this setting, the stepahead methods are not applicable. The simple NN baseline performs fairly well, especially on Flu where nearest neighbors have very similar profiles. However, NN suffers from similar drawbacks to avg_PY in addition to being sensitive to the choice of . Our framework provides competitive or better performance. We again see the LR models excelling on Wiki. Flu is a very simple dataset, and thus all methods seem to perform similarly.
Summary
The stepahead methods struggle to provide competitive longrange forecasts and are not immediately applicable to the coldstart scenario. The stepahead TMRF does well in certain missing data scenarios, but not universally, as demonstrated on Wiki. The simple avg_PY and NN baselines perform fairly competitively compared to our framework; however, they are not robust to structured missingness, e.g. a specific sensor goes down almost every winter due to heavy storms in the region. Furthermore, our methods could straightforwardly be embedded within a statistical framework for providing estimates of uncertainty, whereas the same is not true for the simple baselines. Overall, we see that the LR+MF model appears to be a robust and effective method across all considered scenarios.
4.1.2 Qualitative Analysis
Since LR+MF appears to be a fairly robust general purpose model, we hone in on this model when visually examining the performance of our predictions. In Figure 3, we plot selected time series from the Wiki test set and our associated predictions. We show more plots that qualitatively evaluate the successes and failures of our framework (and baselines) in the Supplement.
Longrange forecasts
For the longrange forecasting experiment (P3), our model captures a variety of interesting yearly patterns. Figures 2(a) and 2(b) display a weekly oscillation and a summer and winter slump. This may be due to students visiting these articles during the school year. Looking at Figure 2(c) for “Cherry Blossom”, we see predicted increases in page traffic during the months of the year when cherry blossoms are in full bloom; however, it does not quite capture the right scale.
Coldstart forecasts
In the more challenging coldstart scenario, the model still excels at learning the common weekly oscillation with summer and winter slumps, when appropriate, as shown in Figure 2(d). Interestingly, we capture a different type of weekly oscillation in Figure 2(e), matching the true trend. We also capture a (small) predicted spike when the real data has a huge spike. This is most likely the WrestleMania event. In Figure 2(f), the model predicts increases in page traffic around days 75 to 100, which is most likely due to March Madness. Prediction of activity spikes is challenging since many activity spikes result from viral news events, rather than recurring yearly events. Even recurring events can change in their exact timing. Such spikes are also tough to predict using only static text features. The changing daily features (e.g., holidays and promotions) is crucial to the performance in the recent work of [1] that focuses on spiky demand.
Overall, our very straightforward LR+MF model is able to forecast very intricate seasonal profiles over the course of an entire year both when provided with a past history of the series, and for completely unseen series. Interestingly, this model does not explicitly build in any time series model, but rather leverages the lowdimensional shared structure from a specially formed data matrix.
5 Discussion
We presented a computationally efficient framework for providing both longrange and coldstart forecasts of baseline demand while straightforwardly handling missing data. The modular framework consists of a matrix factorization term that exploits lowrank structure arising from shared patterns across fixed periods and series, and a regression component that leverages highdimensional metadata. Key to the framework is a carefully structured data matrix manipulated from the observed time series.
We explored different approaches to the regression component and examined the proposed formulations on Google Flu Trends and a large, messy Wikipedia dataset. Our experiments demonstrated that our framework excels at capturing baseline demand with intricate seasonal structure and can produce longrange forecasts of new series. Such longrange forecasts are extremely challenging for standard time series models that perform iterated stepahead predictions, leading to large error accumulation. Under model misspecification, such approaches lead to very poor longrange fits [16]. Our framework, on the other hand, side steps this issue by focusing prediction on the global behavior.
There are many interesting directions for future work. First, we plan to investigate how our framework can be augmented to predict other time series features, such as trends and burstiness. Both could be partially addressed by using timevarying metadata. Second, one could leverage ideas from trend filtering to allow for learned latent factors to exhibit varying levels of smoothness, which would be more descriptive than our current functional regression approach. Finally, it is interesting to explore the performance of our framework on predicting other types of multivariate functional output data from text, such as images based on captions.
References
 [1] Matthias W Seeger, David Salinas, and Valentin Flunkert. Bayesian intermittent demand forecasting for large inventories. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4646–4654. Curran Associates, Inc., 2016.
 [2] Trung V Nguyen and Edwin V Bonilla. Collaborative multioutput gaussian processes. In UAI, pages 643–652, 2014.
 [3] Wei Sun and Dmitry Malioutov. Time series forecasting with shared seasonality patterns using nonnegative matrix factorization. In NIPS, Time Series Workshop, 2015.
 [4] Steve ChengXian Li and Benjamin Marlin. Collaborative multioutput gaussian processes for collections of sparse multivariate time series. NIPS 2015 Time Series Workshop, 2015.
 [5] HsiangFu Yu, Nikhil Rao, and Inderjit S Dhillon. Temporal regularized matrix factorization for highdimensional time series prediction. In Advances in Neural Information Processing Systems, pages 847–855, 2016.
 [6] Ruairí de Fréin, Konstantinos Drakakis, Scott Rickard, and Andrzej Cichocki. Analysis of financial data using nonnegative matrix factorization. In International Mathematical Forum, volume 3, pages 1853–1870. Journals of Hikari Ltd, 2008.
 [7] James O.. Ramsay and Bernard W Silverman. Functional Data Analysis. Springer, 2005.
 [8] J. S. Morris. Functional Regression. ArXiv eprints, June 2014.
 [9] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8), 2009.
 [10] Deepak Agarwal and BeeChung Chen. Regressionbased latent factor models. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 19–28. ACM, 2009.
 [11] István Pilászy and Domonkos Tikk. Recommending new movies: even a few ratings are more valuable than metadata. In Proceedings of the third ACM conference on Recommender systems, pages 93–100. ACM, 2009.
 [12] Andrew I Schein, Alexandrin Popescul, Lyle H Ungar, and David M Pennock. Methods and metrics for coldstart recommendations. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 253–260. ACM, 2002.
 [13] Zeno Gantner, Lucas Drumond, Christoph Freudenthaler, Steffen Rendle, and Lars SchmidtThieme. Learning attributetofeature mappings for coldstart recommendations. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pages 176–185. IEEE, 2010.
 [14] Jeremy Ginsberg, Matthew H Mohebbi, Rajan S Patel, Lynnette Brammer, Mark S Smolinski, and Larry Brilliant. Detecting influenza epidemics using search engine query data. Nature, 457(7232):1012–1014, 2009.
 [15] Robert B Cleveland, William S Cleveland, Jean E McRae, and Irma Terpenning. Stl: A seasonaltrend decomposition procedure based on loess. Journal of Official Statistics, 6(1):3–73, 1990.
 [16] Yingcun Xia and Howell Tong. Feature matching in time series modeling. Statist. Sci., 26(1):21–46, 02 2011.
 [17] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pages 55–60, 2014.
 [18] Wikipedia. Wikipedia pageview statistics: Wiki trends. http://www.wikipediatrends.com/, 2016.
Appendix A Details on Functional Regression
To allow smoothness of the regression weights across time steps we utilize a basis expansion. In particular, we assume that each element of may be written as a linear combination of basis functions evaluated at time step . Specifically, we have that element of vector , , is written as
(6) 
where is the th smooth basis function evaluated at time point and the are linear combination coefficients that are constant across time. The goal is to learn the basis weights which fully describes the functional regression model. Under this parameterization, Eq. (3) in the main paper becomes
(7)  
(8) 
where the basis function weights are collected into the matrix and here is the vector with elements .
Appendix B Framework Optimization Objectives
Here we provide the specific optimization objectives for all methods shown in Table 1 in the main paper.

Regression:

Matrix factorization + regression:

Lowrank regression:

Matrix factorization + lowrank regression:

Functional regression:

Matrix factorization + functional regression:

Neural Network regression:

Matrix factorization + neural network regression:
Appendix C Experiment Details
Cross validation
For each forecasting experiment, we separately run 5fold cross validation over a grid in order to select the model hyperparameters. Because some of our loss functions are nonconvex, this requires running random restarts. Since this is computationally expensive, we perform an approximation to full grid search. We perform a twostage cross validation where we first run standard 5fold cross validation over the hyperparameters in the nonmatrix factorization models. The regularization parameters for these models are a subset of the respective corresponding model that includes matrix factorization, so we then fix those hyperparameter values in the corresponding models with matrix factorization and cross validate over the remaining hyperparameters. Looking at Eq. (4) in the main paper, we would first cross validate over and fix the selected value while cross validating over .
c.1 Dataset Collection Details
c.1.1 Google Flu Trends
The Flu Trends dataset consists of weekly estimates of influenza rates in 311 worldwide regions [14]. We scraped all of the data on the existing webpage, resulting in a matrix of 311 time series of weekly influenza rate estimates (). For each time series, we zeromean the data and divide by the standard deviation to standardize the scale. After reorganizing the matrix as described in Section 3 of the main paper, there are 3197 observed years of data. For metadata, we scraped the summary of the relevant Wikipedia page for each region. We removed stopwords, performed tokenization using the Stanford CoreNLP toolkit [17], and calculated TFIDF representations of these summaries to use as our metadata vectors. Additionally, each word used in our feature vector must be present in at least two articles. We also included latitude and longitude coordinates of each region if available. After preprocessing, the dimension of the feature vector is . This matrix is sparse with roughly nonzero entries.
c.1.2 Wikipedia Page Traffic Dataset
For the Wikipedia dataset, we collected daily page traffic counts from 4031 Wikipedia articles from the beginning of 2008 to the end of 2014 by querying Wiki Trends [18]. The pages were selected by obtaining a list of the 5000 most popular pages of a given week in March 2016. We employ the method of [15] to detrend the time series data. For each time series, we zeromean the data and divide by the standard deviation. For each article, we have anywhere between 1 to 7 years of page traffic counts starting from 2008 to 2014. After reorganizing the time series data matrix as described in Section 3 in the main paper, the total number of columns of is . Each year of article traffic has days. Furthermore, we shifted each year to start on the first Sunday and padded with zeros. We scraped the summary of the corresponding Wikipedia page, removed stopwords, performed tokenization using the Stanford CoreNLP toolkit [17], and calculated TFIDF representations of these summaries to use as our metadata vectors. Additionally, each word used in our feature vector must be present in at least two articles. After preprocessing, the dimension of the features is . This matrix is extremely sparse with roughly nonzero entries.
c.2 Real Data Experiment Details
Test set selection

LongRange Forecasting: We included the last year of page traffic for each article in the test set. We further removed 20% of values from the training data to demonstrate the efficacy of our prediction methods in the presence of missing data.

Missing Data: For each time series, we pick a time point uniformly from the year for the starting point of the missing chunk, and sample a geometric random variable with mean that determines the length of the missing chunk.

ColdStart Forecasting: We randomly selected 25% of the articles and included the last year of page traffic in the test set. We did not include any previous years of such articles from the training set. We further removed 20% of values from the training data to demonstrate the efficacy of our prediction methods in the presence of missing data.
Neural network architecture
For the neural network regression, we use a two hidden layer network with ReLU activations. The first hidden layer has 100 nodes and the second hidden layer has nodes. Finally, the output layer is a regression layer (uses the identity function as the activation function). For simplicity we also do not regularize the weights of the neural network.
Model training
Google Flu
We set and train the models with minibatch stochastic gradient descent (SGD). We use a minibatch size of 300 and 30,000 iterations. We individually tuned constant step sizes for each model, although they all turned out to be close to 1e2. We use 5fold cross validation to select hyperparameters. For both and , we select them from 10 values evenly spread on a log scale from 0.1 to 1000. We choose the number of (evenly spaced) knots from 4, 8, …, 20. After selecting hyperparameters, we trained each final model with 50,000 iterations and 10 random restarts. We use a validation set to tune the number of epochs the neural network regression models need.
Wikipedia Page Traffic
We set and train the models with minibatch SGD. We use a minibatch size of 500 and 10,000 iterations. We individually tuned constant step sizes for each model, although they all turned out to be close to 1e1 (with the exception that the neural network regression models used a step size of 1e2). In this setting, since our data is large enough, we use a single validation set to select our hyperparameters as opposed to cross validation. For both and , we select them from 10 values evenly spread on a log scale from 0.01 to 1000. We choose the number of (evenly spaced) knots from . After selecting hyperparameters, we trained each final model with 20,000 iterations and 10 random restarts. We use a validation set to tune the number of epochs the neural network regression models need.
Baselines
For TRMF [5], we fix latent dimension ( in our framework) to be the same as what we set for our framework (5 for Google Flu, 20 for the Wikipedia dataset). For both TRMF and the AR baseline, we use a lag set of for the Google Flu data, and a lag set of . The chosen lag sets allow for the stepahead models to learn seasonal yearly patterns. For TRMF, we run rolling cross validation to select hyperparameters, using the code provided by [5], while for AR, we simply perform maximum likelihood under Gaussian errors. For pure MF, we run cross validation to select the regularization (Frobenius norm).
Appendix D Synthetic Experiments
We run experiments showing that when the data exhibits temporal regularity in the form of smoothness, the functional regression models of our framework become very useful.
Synthetic data generation
We generated data according to the matrix factorization + functional regression model. However, instead of using a Bspline basis for , we sampled each column of from a smooth Gaussian process with zero mean and double exponential kernel. We sample each column of as a sine wave with a random period, , , , and where ; this is the spike and slab exponential distribution for to emulate highdimensional features that are sparse and positive, e.g. bagofwords features. We generated a synthetic dataset with dimensions .
Neural network architecture
For the neural network regression, we use a two hidden layer network with ReLU activations. The first hidden layer has 100 nodes and the second hidden layer has nodes. Finally, the output layer is a regression layer (uses the identity function as the activation function).
Model training
We run the models with and train the models with minibatch SGD. We use a minibatch size of 300 and 5,000 iterations. We use a constant step size of 1.0 for the pure regression and lowrank regression models, 0.5 for the functional regression models, and 0.05 for the neural network models. For both and , we cross validate over a log scale of 10 values from 0.0001 to 100. To choose the number of (evenly spaced) knots in the functional regression models, we cross validate over 10, 20, …, 100. After selecting hyperparameters, we trained each final model with 50,000 iterations and 10 random restarts to avoid local minima. Recall that the neural network models don’t have any hyperparameters; We use a validation set to tune the number of epochs the neural network regression models need.
Results
We show quantile boxandwhisker plots of MSE per time series in order to show the distribution of errors in Figure 4. The box encapsulates the 1st to 3rd quartiles, and the whiskers encapsulate the 10th to 90th quantile. Yellow indicates models without MF, and light blue indicates models with MF.
The functional regression models tend to outperform the other models across all prediction challenges as it captures the smoothness of the regression weights across time. The neural network models do not perform as well as the functional regression models, perhaps due to the neural network model not leveraging the smoothness of the series. Adding the matrix factorization component tends to lower high tail error quantiles in most cases, even in the flexible neural network case. As far as we know, such a structure for neural networks has not been explored. The matrix factorization term captures lowrank structure in the residuals, which results in improved learning of the regression component. The matrix factorization term also clearly aids performance in the missing data setting (P2).
Appendix E Twostage comparison
In this section, we show an empirical example of where a twostage approach of 1) imputing missing data via matrix factorization and then 2) learning a regression component on the imputed data leads to weaker prediction results.
Setup
We generate data the same way as described in the synthetic experiments (Section 4 in this Supplement). To generate missing data, we generate large contiguous chunks of missing data exactly as described in the experiments section of the main paper (Section 4 in the main paper). This type of structure missingness hides the lowrank nature of the data matrix quite well, making it hard for MF to do a good job of imputing the missing values.
Learning
To perform the twostage approach, we first perform matrix factorization on the training data (which has missing data). Using the learned lowdimensional representation, we impute the missing data to get a fully observed matrix . We then learn a functional regression model (without MF, as shown in Table 1 in the main paper) on , and use this functional regression model to perform longrange and coldstart forecasts. We compare this to our jointly learned MF + functional regression model as shown in Table 1 (of the main paper).
Results
We show boxplots results in Figure 5 for all three prediction challenges. We first note that our joint learning approach results in improved performance across all quantiles compared to the twostage approach. Importantly, the missingness structure makes it difficult for matrix factorization to impute missing values correctly. This leads to learning a regression component on noisy values. Our jointly learned model is able to robustly avoid this issue and provide more accurate forecasts. In the missing data challenge (P2), we see the largest increase in prediction accuracy.
Appendix F Plots
In this section, we show many more plots to qualitatively evaluate the successes and failures of our matrix factorization + lowrank regression model, which we observed to be most consistently successful across all three forecasting challenges on both the Google flu and Wikipedia data sets. In Figures 6 and 7, we plot example predictions from the MF + lowrank model on Wikipedia test series for the longrange and coldstart challenges, respectively. We aim to illustrate both the challenge of prediction on complex realworld data and the variety of seasonal profiles that our method is able to forecast.
In Figure 8, we show a comparison between TRMF and MF + lowrank in terms of longrange Wikipedia predictions. We plot the five series closest to the quantile of mdAPE for TRMF, thus selecting a set of examples that we know a priori to be very favorable for this method. Nonetheless, the results show that our MF + lowrank approach is generally able to produce forecasts with superior prediction of seasonal and oscillatory behavior, even on the test series where TRMF performs close to its best. In this figure, TRMF predictions are plotted in blue.
Lastly, to give a sense of the distribution of MF + lowrank prediction quality over the Wikipedia longrange test set, we plot the three series closest to each of the , , and quantiles of mdAPE error in Figure 9. For all figures, recall that the observed time series is in magenta and predictions are overlayed in green.