Highdimensional Time Series Prediction with Missing Values
Abstract
Highdimensional time series prediction is needed in applications as diverse as demand forecasting and climatology. Often, such applications require methods that are both highly scalable, and deal with noisy data in terms of corruptions or missing values. Classical time series methods usually fall short of handling both these issues. In this paper, we propose to adapt matrix matrix completion approaches that have previously been successfully applied to large scale noisy data, but which fail to adequately model highdimensional time series due to temporal dependencies. We present a novel temporal regularized matrix factorization (TRMF) framework which supports datadriven temporal dependency learning and enables forecasting ability to our new matrix factorization approach. TRMF is highly general, and subsumes many existing matrix factorization approaches for time series data. We make interesting connections to graph regularized matrix factorization methods in the context of learning the dependencies. Experiments on both real and synthetic data show that TRMF outperforms several existing approaches for common time series tasks.
1 Introduction
Time series data plays a central role in many applications, from demand forecasting to speech and video processing to climatology. Such applications involve data collected over a large time frame, and also a very large number of possibly interdependent time series. For example, climatology applications involve data collected from possibly thousands of sensors, every hour (or less) over several days. Similarly, a store tracking its inventory would track thousands of items every day for multiple years. Not only is the scale of such problems huge, but they also typically involve missing values, due to sensor malfunctions, occlusions or simple human errors. Thus, modern time series applications present two challenges to practitioners: scalability in the presence of highdimensional time series and the flexibility to handle missing values.
Classical approaches to handle timevarying data include autoregressive (AR) models or dynamic linear models (DLM) [8, 23]. These approaches fall short of handling the aforementioned issues [1]. Specifically, for time points of dimensional time series, an AR model of order requires time to estimate parameters, which is prohibitive even for moderate values of . Similarly, Kalman Filter based DLM approaches need operations to update parameters, where is the latent dimensionality which may be larger than [15]. As a specific example, the maximum likelihood estimator implementation in the widely used RDLM package [14], which relies on a general optimization solver, cannot scale beyond . (See Appendix B for details).
Recently, low rank matrix completion or matrix factorization (MF) methods have found use in large scale collaborative filtering [10, 26], multilabel learning [27], and many other applications where data often scales into the millions. A natural way to model highdimensional time series data is in the form of a matrix, with rows corresponding to time series and columns corresponding to time points. In light of this, it is prudent to ask
“Can we generalize matrix factorization to analyze large scale data with temporal dependencies?”
Let be the observed data matrix of time series of length . Many classic time series models can be described in the following general form [15, 23]:
(1)  
(2) 
where is the snapshot of time series at the th time point, , is the latent embedding for the th time point, and are Gaussian noise vectors. in (2) is a time series model parameterized by and : is a set containing each lag index denoting a dependency between th and th time points, while captures the weighting information of temporal dependencies (such as the transition matrix in AR models). When we stack all the s into a matrix and let be the th row of as shown in Figure 1, we clearly see that and we can attempt to learn the factors in we clearly see that and we can learn the factors in (1) by considering a standard matrix factorization formulation (3):
(3) 
where is the set of the observed entries, and are regularizers for and respectively.
This connection made, there now arise two major challenges to extend MF approaches to analyze data with temporal dependencies:

How to describe and incorporate the structure of temporal dependencies into the MF formulation?

How to efficiently forecast values of future time points?
It is clear that the common choice of the regularizer is no longer appropriate, as it does not take into account the dependencies among the columns of . Most existing MF adaptations to handle temporal dependencies [2, 28, 24, 16, 18] graphbased approaches, where the dependencies are described by a graph and incorporated through a Laplacian regularizer [20]. However, graphbased regularization fails in cases where there are negative correlations between two time points. Furthermore, unlike scenarios where explicit graph information is available with the data (such as social network or product copurchasing graph for recommender systems), explicit temporal dependency structures are usually unavailable and has to be inferred or approximated. Hence, this approach requires practitioners to either perform a separate procedure to estimate the dependencies or consider very shortterm dependencies with simple fixed weights. Moreover, existing MF approaches, while yielding good estimations for missing values in past time points, are poor in terms of forecasting future values, which is crucial in time series analysis.
1.1 Our Contributions:
In this paper, we propose a novel temporal regularized matrix factorization (TRMF) framework for data with temporal dependencies. TRMF generalizes several existing temporal matrix factorization approaches [2, 28, 24, 16, 18] and connects to some existing latent timeseries models [8, 11, 15, 12, 23]. Our approach not only supports datadriven temporal dependency learning but also brings the ability to forecast future values to matrix factorization methods. Also, unlike classic timeseries models, TRMF easily handles high dimensional time series data even in the presence of many missing values.
Next, based on AR models, we design a novel temporal regularizer to encourage temporal dependencies among the latent embeddings . We also make interesting connections between the proposed regularizer and graphbased approaches [20]. This connection not only leads to better understanding about the dependency structure incorporated by our framework but also brings the benefit of using offtheshelf efficient solvers such as GRALS [17] directly to solve TRMF.
The rest of this paper is organized as follows. In Section 2, we review the existing approaches and their limitations on data with temporal dependencies. We present the proposed TRMF framework in Section 3, and show that the method is highly general and can be used for a variety of time series applications. We introduce a novel AR temporal regularizer in Section 4, and make connections to graphbased regularization approaches. We demonstrate the superiority of the proposed approach via extensive experimental results in Section 5 and conclude the paper in Section 6.
2 Data with Temporal Dependency: Existing Approaches and Limitations
2.1 TimeSeries Models
Models such as AR and DLM are not suitable for modern multiple highdimensional time series data (i.e., both and are large) due to their inherent computational inefficiency (see Section 1). To reduce the number of parameters and avoid overfitting in AR models, there have been studies with various structured transition matrices such as low rank and sparse matrices [5, 13]. The focus of most of these works is on devising better statistical consistency guarantees, and the issue of scalability of AR models remains a challenge. On the other hand, it is also challenging for many classic timeseries models to deal with data with missing values [1].
DLM unifies many other time series models [15, 23]. The original idea goes back to the Kalman filter [8] in the control theory community. In many situations where the model parameters such as in (1) and in (2) are either given or designed by practitioners, the Kalman filter approach is used to preform forecasting, while the Kalman smoothing approach is used to impute missing entries. In situations where model parameters are unknown, EM algorithms are applied to estimate both the model parameters and latent embeddings for DLM [19, 3, 11, 12, 21]. As most EM approaches for DLM contain the Kalman filter as a building block, they cannot scale to very high dimensional time series data. Indeed, as shown in Section 5, the popular R package for DLM’s does not scale beyond data with a few hundred observations and dimensions.
2.2 Existing Matrix Factorization Approaches for Time Series
In standard matrix factorization (3), the squared Frobenius norm is the usual regularizer of choice for . Because squared Frobenius norm assumes no dependencies among , standard MF formulation is invariant to column permutation and not applicable to data with temporal dependencies. Hence most existing temporal MF approaches [2, 28, 24, 16, 18] turn to the framework of graphbased regularization [20] for temporally dependent , with a graph encoding the temporal dependencies.
Graph regularization for temporal dependency.
The framework of graphbased regularization is an approach to describe and incorporate general dependencies among variables. Let be a graph over and be the weight between the th node and th node. A popular regularizer to include as part of an objective function is the following:
(4) 
where denotes the edge between th node and th node, and the second summation term is used to guarantee strong convexity.
A large will ensure that and are close to each other in the Euclidean sense, when (4) is minimized. Note that to guarantee the convexity of , the weights need to be nonnegative.
To apply graphbased regularizers to temporal dependencies, we need to specify the (repeating) dependency pattern by a lag set and a weight vector such that all the edges of distance (i.e., ) share the same weight . See Figure 2 for an example with . Given and , the corresponding graph regularizer becomes
(5) 
This direct use of graphbased approach for temporal dependency, while intuitive, has two issues: a) there might be negatively correlated dependencies between two time points; however, the standard graph regularization requires nonnegative weights. b) unlike many applications where such regularizers are used, the explicit temporal dependency structure is usually not available and has to be inferred. As a result, Most existing temporal MF approaches consider only very simple temporal dependencies such with a small size of (e.g., ) and/or a uniform weight (e.g., ). For example, a simple chain graph is considered to design the smooth regularizer in TCF [24]. This leads to poor forecasting abilities of existing MF methods for largescale time series applications.
2.3 Challenges to Learn Temporal Dependencies
One could try to learn the weights automatically, by using the same regularizer as in (5) but with the weights unknown. This would lead to the following optimization problem:
(6) 
where is the zero vector, and is the constraint imposed by graph regularization.
It is not hard to see that the above optimization yields the trivial allzero solution for , meaning the objective function is minimized when no temporal dependencies exist! To avoid the all zero solution, one might want to impose a simplex constraint on (i.e., ). Again, it is not hard to see that this will result in being a 1sparse vector, with the non zero component being 1, where
Thus, looking to learn the weights automatically by simply plugging in the regularizer in the MF formulation is not a viable solution.
3 Temporal Regularized Matrix Factorization
In order to resolve the limitations mentioned in Sections 2.2 and 2.3, we propose the Temporal Regularized Matrix Factorization (TRMF) framework, which is a novel approach to incorporate temporal dependencies into matrix factorization models.
We propose to use wellknown time series models (e.g. AR) to describe temporal dependencies in explicitly. Unlike the aforementioned graphbased approaches, there are many wellstudied models which are specifically designed for data with temporal dependency [15]. Let denote the timeseries model for as introduced in (2). To incorporate the temporal dependency in TRMF, we propose to design a new regularizer which encourages the structure induced by .
Taking a standard approach to model time series, we set be the negative log likelihood of observing a particular realization of the for a given model :
(7) 
When is given, we can use in the MF formulation (3) to encourage to follow the temporal dependency induced by . When the is unknown, we can treat as another set of variables and include another regularizer into (3) as follows.
(8) 
which be solved by an alternating minimization procedure over , , and .
Datadriven Temporal Dependency Learning in TRMF:
Recall that in Section 2.3, we showed that directly using graph based regularizers to incorporate temporal dependencies leads to trivial solutions for the weights. We now show that the TRMF framework circumvents this issue. When and are fixed, (8) is reduced to the following problem:
(9) 
which is a maximumaposterior (MAP) estimation problem (in the Bayesian sense, assuming priors on the data) to estimate the best for a given under the model. There are welldeveloped algorithms to solve the MAP problem (9) and obtain nontrivial dependency parameter . Thus, unlike most existing temporal matrix factorization approaches where the strength of dependencies is fixed, in TRMF can be learned automatically from data.
3.1 TimeSeries Analysis with TRMF
We can see that TRMF (8) lends itself seamlessly to handle a variety of commonly encountered tasks in analyzing data with temporal dependency:

Timeseries Forecasting: As TRMF comes with a full specification of for latent embeddings , we can use to predict future latent embeddings and have the ability to obtain nontrivial forecasting results for for .

Missingvalue Imputation: In some timeseries applications, certain entries in might be unobserved, for example, due to faulty sensors in electricity usage monitoring or occlusions in the case of motion recognition in video. We can use to impute these missing entries, much like standard matrix completion. This task is useful in many recommender systems [24] and sensor networks [28].

Timeseries classification/clustering: The obtained can be used as the latent embedding for the th time series of . These latent features can be used to perform classification/clustering of the time series. Note that this can be done even when there are missing entries in the observed data, as missing entries are not a bottleneck for learning .
3.2 Extensions to Incorporate Extra Information
Like matrix factorization approaches, TRMF (8) can be extended to incorporate additional information. Below we only briefly describe three approaches, and more details on these extensions can be found in Appendix LABEL:sec:extdetails.

Known features for time series: In many applications, one is given additional features along with the observed time series. Specifically, given a set of feature vectors for each row of , we can look to solve
(10) That is, the observation is posited to be a bilinear function of the feature vector and the latent vector . Such an inductive framework has two advantages: we can generalize TRMF to a new time series without any observations up to time (i.e., a new row of without any observations). As long as the feature vector is available, the model learned by TRMF can be used to estimate . Furthermore, prediction can be significantly sped up when , since the dimension of is reduced from to . Such methods for standard multilabel learning and matrix completion have been previously considered in [7, 25, 27].

Graph information among time series: Often, separate features for the time series are not known, but other relational information is available. When a graph that encodes pairwise interactions among multiple time series is available, one can incorporate this graph in our framework using the graph regularization approach (4). Such cases are common in inventory and sales tracking, where sales of one item is related to sales of other items. Given a graph describing the relationship among multiple time series, we can formulate a graph regularized problem:
(11) where is the graph regularizer defined in (4) capturing pairwise interactions between time series. Graph regularized matrix completion methods have been previously considered in [29, 17].

Temporalregularized tensor factorization: Naturally, TRMF can be easily extended to analyze temporal collaborative filtering applications [24, 21], where the targeted data is a tensor with certain modes evolving over time. For example, consider be a 3way tensor with encoding the rating of the th user for the th item at time point . We can consider the following temporal regularization tensor factorization (TRTF) with as follows:
(12) where and are the latent embeddings for the users and items, respectively, and with some abuse of notation, we define
4 A Novel Autoregressive Temporal Regularizer
Up until now, we described the TRMF framework in a very general sense, with the regularizer incorporating dependencies specified by the time series model . In this section, we specialize this to the case of AR models.
An AR model, parameterized by a lag set and weights , assumes that is a noisy linear combination of some previous points; that is,
(13) 
where is Gaussian noise vector. For simplicity, we assume that the .^{1}^{1}1If the (known) covariance matrix is not identity, we can suitably modify the regularizer. As a result, the temporal regularizer corresponding to this AR model can be written as follows:
(14) 
where , , and the last term with a positive is used to guarantee the strong convexity of (14). We denote the term in (14) by .
TRMF allows us to learn the weights when they are unknown. Since each , there will be variables to learn, which may lead to overfitting. To prevent this and to yield more interpretable results, we consider diagonal , which reduces the number of parameters to . To simplify notation, we use to denote the matrix with the th column constituting the diagonal elements of . Note that for , the th column of is a zero vector. Let be the th row of and be the th row of . Then (14) can be written as follows:
(15)  
(16) 
where is the th element of , and is the th element of .
Correlations among Multiple Time Series. TRMF retains the power to capture the correlations among time series via the factors , even after simplifying to be diagonal, since it has effects only on the structure of latent embeddings . Indeed, as the th dimension of is modeled by in (8), one can see the low rank as a dimensional latent embedding of multiple time series. This low rank embedding captures correlations among multiple time series. Furthermore, acts as time series features, which can be used to perform classification/clustering even in the presence of missing values.
Choice of Lag Index Set . Unlike most MF approaches mentioned in Section 2.2, the choice of in TRMF is more flexible. Thus, TRMF can provide the following two important advantages for practitioners: First, because there is no need to specify the weight parameters , can be chosen to be larger to account for long range dependencies, which also yields more accurate and robust forecasts. Second, the indices in can be discontinuous so that one can easily embed domain knowledge about periodicity or seasonality. For example, one might consider for weekly data with an oneyear seasonality.
Connections to Graph Regularization. We now establish connections between and graph regularization (4) for matrix factorization. Let , so that (16) can be written as
and let . We then have the following result:
Theorem 1.
Given a lag index set , weight vector , and , there is a weighted signed graph with nodes and a diagonal matrix such that
(17) 
where is the graph regularization (4) for . The edge weight for and the diagonal element of are described as follows.
See Appendix A.1 for the detailed proof of Theorem 1. From Theorem 1, we know that is not empty if and only if there are edges of distance in . Thus, we can construct the dependency graph for a by checking whether is empty. Figure 3 demonstrates an example with . We can see that in addition to the edges of distance and , there are also edges of distance (dotted edges in Figure 3) because and .
Although Theorem 1 shows that ARbased regularizers are similar to the graphbased regularization framework, we note the following key differences:

The graph defined in Theorem 1 is a signed weighted graph, which contains both positive and negative edges. This implies that the AR temporal regularizer is able to support negative correlations, which the standard graphbased regularizer does not. This can make nonconvex.

While might be nonconvex, the addition of the second term in (17) still leads to a convex regularizer .
4.1 Optimization for TRMF with AR Temporal Regularization
Plugging into (8), we obtain the following optimization problem:
(18) 
where is a regularizer for . We will refer to (18) as TRMFAR. We can apply alternating minimization to solve (18), and in fact, solving for each variable reduces to well known methods, for which highly efficient algorithms exist:
Updates for . When and are fixed, the subproblem of updating is the same as updating while fixed in (3). Thus, fast algorithms such as alternating least squares [10] or coordinate descent [26] can be applied directly to find .
Updates for . To update when and fixed, we solve:
(19) 
From Theorem 1, we see that shares the same form as the graph regularizer, and we can apply GRALS [17] to solve (19).
Updates for . How to update while and fixed depends on the choice of . There are many parameter estimation techniques developed for AR with various regularizers [22, 13]. For simplicity, we consider the squared Frobenius norm: . As a result, each row of of can be updated independently by solving the following onedimensional autoregressive problem.
(20)  
(21) 
(21) is a simple dimensional ridge regression problem with instances, which can be solved efficiently by Cholesky factorization such as the backslash operator in MATLAB.
Note that since out method is highly modular, one can resort to any method to solve the optimization subproblems that arise for each module. Moreover, as seen from Section 3.2, TRMF can also be used with different regularization structures, making it highly adaptable.
4.2 Connections to Existing MF Approaches
TRMFAR, while being a special family of TRMF, still can be seen as a generalization for many existing MF approaches to handle data with temporal dependencies. We demonstrate several cases which arise from a specific choice of and :
4.3 Connections to Gaussian Markov Random Field
The Gaussian Markov Random Field (GMRF) is a general way to model multivariate data with dependencies. GMRF assumes that data are generated from a multivariate Gaussian distribution with a covariance matrix which describes the dependencies among dimensional variables i.e., .^{2}^{2}2a non zero mean can be similarly handled. If the unknown is assumed to be generated from this model, The negative log likelihood of the data can be written as
ignoring the constants and where is the inverse covariance for the Gaussian distribution. This prior can be incorporated into an empirical risk minimization framework as a regularizer. Furthermore, it is known that if , then and are conditionally independent, given the other variables. In Theorem 1 we established connections to graph based regularizers, and that such methods can be seen as regularizing with the inverse covariance matrix for Gaussians [29]. We thus have the following result:
Corollary 1.
For any lag set , , and , the inverse covariance matrix of the GMRF model corresponding to the quadratic regularizer shares the same offdiagonal nonzero pattern as defined in Theorem 1 for .
A detailed proof is in Appendix A.2. As a result, our proposed ARbased regularizer is equivalent to imposing a Gaussian prior on with a structured inverse covariance described by the matrix defined in Theorem 1. Moreover, the step to learn has a natural interpretation: the lag set imposes the nonzero pattern of the graphical model on the data, and then we solve a simple least squares problem to learn the weights corresponding to the edges.
5 Experimental Results
In this section, we perform extensive experiments on time series forecasting and missing value imputations on both synthetic and realworld datasets.
Datasets Used:

synthetic: a small synthetic dataset with . We generate from the autoregressive process (13) with a lag index set , randomly generated , and an additive white Gaussian noise of . We then randomly generate a matrix and obtain , where .

electricity ^{3}^{3}3https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014.: the electricity usage in kW recorded every 15 minutes, for clients. We convert the data to reflect hourly consumption, by aggregating blocks of 4 columns, to obtain .

traffic ^{4}^{4}4https://archive.ics.uci.edu/ml/datasets/PEMSSF.: A collection of 15 months of daily data from the California Department of Transportation. The data describes the occupancy rate, between 0 and 1, of different car lanes of San Francisco bay area freeways. The data was sampled every 10 minutes, and we again aggregate the columns to obtain hourly traffic data to finally get , .

walmart1 & walmart2: two propriety datasets from Walmart Ecommerce contain weekly sale information of 1,350 and 1,582 items for 187 weeks, respectively. The timeseries of sales for each item start and end at different time points; for modeling purposes we assume one start and end timestamp by padding each series with missing values. This along with some other missing values due to outofstock reasons lead to 55.3% and 49.3% of entries being missing.
Compared Methods/Implementations:

TRMFAR: The proposed formulation (18) with . For the lag set , we use for synthetic, for electricity and traffic, and for walmart1 and walmart2.

SVDAR(1): The best rank approximation of is first obtained by singular value decomposition. After setting and , a dimensional AR(1) is learned on for forecasting.

TCF: Matrix factorization with the simple temporal regularizer proposed in [24].

AR(1): dimensional AR(1) model.

Mean: This is the baseline approach, which predicts everything to be the mean of the observed portion of .
Evaluation Criteria: As the range of values varies in different time series data sets, we compute two normalized criteria: normalized deviation (ND) and normalized RMSE (NRMSE) as follows.
Normalized deviation (ND):  
Normalized RMSE (NRMSE): 
For each method and data set, we perform the grid search over various parameters (such as , values) and report the best ND and NRMSE. We search for synthetic and for other datasets. For TRMFAR, SVDAR(1), TCF, and AR(1), we search
Forecasting with Full Observation  

Matrix Factorization Models  Time Series Models  
TRMFAR  SVDAR(1)  TCF  AR(1)  DLM  RDLM  Mean  
synthetic  0.373/ 0.487  0.444/ 0.872  1.000/ 1.424  0.928/ 1.401  0.936/ 1.391  0.996/ 1.420  1.000/ 1.424 
electricity  0.255/ 1.397  0.257/ 1.865  0.349/ 1.838  0.219/ 1.439  0.435/ 2.753  /   1.410/ 4.528 
traffic  0.185/ 0.421  0.555/ 1.194  0.624/ 0.931  0.275/ 0.536  0.639/ 0.951  /   0.560/ 0.826 
Forecasting with Missing Values  
walmart1  0.533/ 1.958  /   0.540/2.231  /   0.602/ 2.293  /   1.239/3.103 
walmart2  0.432/ 1.065  /   0.446/1.124  /   0.453/ 1.110  /   1.097/2.088 
5.1 Forecasting
We compare the forecasting performance of various approaches. The detailed results are shown in Table 1.
Remark on the implementations of DLM.
We first note that the numbers for RDLM on electricity and traffic are not available because the R
package crashes when the dimension of the time series is large (See
Appendix B for the source code to demonstrate that RDLM fails when ).
Thus, the only numbers we obtained for RDLM are on synthetic. Furthermore, the
dlmMLE
routine in RDLM uses a general optimization solver,^{5}^{5}5It
uses LBFGSB with a finitedifference approximation to obtain gradients.
which is orders of magnitude slower than the DLM implementation provided in [12].
Forecasting with Full Observations. We first compare various methods on the task of forecasting values in the test set, given fully observed training data. Three data sets are considered: synthetic, electricity, and traffic. For synthetic, we consider onepoint ahead forecasting task and use the last ten time points as the test periods. For electricity and traffic, we consider the 24hour ahead forecasting task and use last seven days as the test periods. The results are shown in the first part of Table 1. We can clearly observe the superiority of TRMFAR. Other than the ND for electricity, TRMFAR outperforms other existing matrix factorization approaches and classic timeseries approaches.
Forecasting with Missing Values. We next compare the methods on the task of forecasting in the presence of missing values in the training data. We use the Walmart datasets here, and consider 6week ahead forecasting and use last 54 weeks as the test periods. Note that SVDAR(1) and AR(1) cannot handle missing values. The second part of Table 1 shows that we again outperform other methods.
Matrix Factorization Models  Time Series Models  

TRMFAR  TCF  MF  DLM  Mean  
synthetic  20%  0.467/ 0.661  0.713/ 1.030  0.688/ 1.064  0.933/ 1.382  1.002/ 1.474 
30%  0.336/ 0.455  0.629/ 0.961  0.595/ 0.926  0.913/ 1.324  1.004/ 1.445  
40%  0.231/ 0.306  0.495/ 0.771  0.374/ 0.548  0.834/ 1.259  1.002/ 1.479  
50%  0.201/ 0.270  0.289/ 0.464  0.317/ 0.477  0.772/ 1.186  1.001/ 1.498  
electricity  20%  0.245/ 2.395  0.255/ 2.427  0.362/ 2.903  0.462/ 4.777  1.333/ 6.031 
30%  0.235/ 2.415  0.245/ 2.436  0.355/ 2.766  0.410/ 6.605  1.320/ 6.050  
40%  0.231/ 2.429  0.242/ 2.457  0.348/ 2.697  0.196/ 2.151  1.322/ 6.030  
50%  0.223/ 2.434  0.233/ 2.459  0.319/ 2.623  0.158/ 1.590  1.320/ 6.109  
traffic  20%  0.190/ 0.427  0.208/ 0.448  0.310/ 0.604  0.353/ 0.603  0.578/ 0.857 
30%  0.186/ 0.419  0.199/ 0.432  0.299/ 0.581  0.286/ 0.518  0.578/ 0.856  
40%  0.185/ 0.416  0.198/ 0.428  0.292/ 0.568  0.251/ 0.476  0.578/ 0.857  
50%  0.184/ 0.415  0.193/ 0.422  0.251/ 0.510  0.224/ 0.447  0.578/ 0.857 
5.2 Missing Value Imputation
We next consider the case if imputing missing values in the data. As considered in [11], we assume that entire blocks of data are missing. This corresponds to sensor malfunctions for example, over a length of time.
To create data with missing entries, we first fixed the percentage of data that we were interested in observing, and then uniformly at random occluded blocks of a predetermined length ( for synthetic data and for the real datasets). The goal was to predict the occluded values. Table LABEL:tab:imputation shows that TRMF outperforms the methods we compared to on almost all cases.
6 Conclusions
In this paper, we have proposed introduced a novel temporal regularized matrix factorization framework (TRMF) for largescale multiple time series problems with missing values. TRMF not only models temporal dependency among the data points, but also supports datadriven dependency learning. Our method generalizes several well known methods, and also yields superior performance when compared to other stateoftheart methods on realworld datasets.
Acknowledgments
This research was supported by NSF grant CCF1320746 and gifts from Walmart Labs and Adobe. We thank Abhay Jha for the help to the experiments on the realworld datasets from Walmart Ecommerce.
References
 [1] Oren Anava, Elad Hazan, and Assaf Zeevi. Online time series prediction with missing data. In Proceedings of the International Conference on Machine Learning, pages 2191–2199, 2015.
 [2] Zhe Chen and Andrzej Cichocki. Nonnegative matrix factorization with temporal smoothness and/or spatial decorrelation constraints. Laboratory for Advanced Brain Signal Processing, RIKEN, Tech. Rep, 68, 2005.
 [3] Zoubin Ghahramani and Geoffrey E. Hinton. Parameter estimation for linear dynamical systems. Technical report, Technical Report CRGTR962, University of Totronto, Dept. of Computer Science, 1996.
 [4] Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. Concrete Mathematics: A Foundation for Computer Science. AddisonWesley Longman Publishing Co., Inc., 2nd edition, 1994.
 [5] Fang Han and Han Liu. Transition matrix estimation in high dimensional time series. In Proceedings of the International Conference on Machine Learning, pages 172–180, 2013.
 [6] Robert J. Hodrick and Edward C. Prescott. Postwar us business cycles: an empirical investigation. Journal of Money, Credit, and Banking, pages 1–16, 1997.
 [7] Prateek Jain and Inderjit S. Dhillon. Provable inductive matrix completion. arXiv preprint arXiv:1306.0626, 2013.
 [8] Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. Journal of Fluids Engineering, 82(1):35–45, 1960.
 [9] SeungJean Kim, Kwangmoo Koh, Stephen Boyd, and Dimitry Gorinevsky. ell_1 trend filtering. SIAM review, 51(2):339–360, 2009.
 [10] Yehuda Koren, Robert M. Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. IEEE Computer, 42:30–37, 2009.
 [11] Lei Li, James McCann, Nancy S Pollard, and Christos Faloutsos. DynaMMo: Mining and summarization of coevolving sequences with missing values. In ACM SIGKDD International Conference on Knowledge discovery and data mining, pages 507–516. ACM, 2009.
 [12] Lei Li and B Aditya Prakash. Time series clustering: Complex is simpler! In Proceedings of the International Conference on Machine Learning, pages 185–192, 2011.
 [13] William B. Nicholson, David S. Matteson, and Jacob Bien. Structured regularization for large vector autoregressions. Technical report, Technical Report, University of Cornell, 2014.
 [14] Giovanni Petris. An r package for dynamic linear models. Journal of Statistical Software, 36(12):1–16, 2010.
 [15] Giovanni Petris, Sonia Petrone, and Patrizia Campagnoli. Dynamic Linear Models with R. Use R! Springer, 2009.
 [16] Swati Rallapalli, Lili Qiu, Yin Zhang, and YiChao Chen. Exploiting temporal stability and lowrank structure for localization in mobile networks. In International Conference on Mobile Computing and Networking, MobiCom ’10, pages 161–172. ACM, 2010.
 [17] Nikhil Rao, HsiangFu Yu, Pradeep K. Ravikumar, and Inderjit S. Dhillon. Collaborative filtering with graph information: Consistency and scalable methods. In Advances in Neural Information Processing Systems 27, 2015.
 [18] Matthew Roughan, Yin Zhang, Walter Willinger, and Lili Qiu. Spatiotemporal compressive sensing and internet traffic matrices (extended version). IEEE/ACM Transactions on Networking, 20(3):662–676, June 2012.
 [19] Robert H. Shumway and David S. Stoffer. An approach to time series smoothing and forecasting using the EM algorithm. Journal of time series analysis, 3(4):253–264, 1982.
 [20] Alexander J. Smola and Risi Kondor. Kernels and regularization on graphs. In Learning theory and kernel machines, pages 144–158. Springer, 2003.
 [21] John Z. Sun, Kush R. Varshney, and Karthik Subbian. Dynamic matrix factorization: A state space approach. In Proceedings of International Conference on Acoustics, Speech and Signal Processing, pages 1897–1900. IEEE, 2012.
 [22] Hansheng Wang, Guodong Li, and ChihLing Tsai. Regression coefficient and autoregressive order shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(1):63–78, 2007.
 [23] Mike West and Jeff Harrison. Bayesian Forecasting and Dynamic Models. Springer Series in Statistics. Springer, 2013.
 [24] Liang Xiong, Xi Chen, TzuKuo Huang, Jeff G Schneider, and Jaime G. Carbonell. Temporal collaborative filtering with Bayesian probabilistic tensor factorization. In SIAM International Conference on Data Mining, pages 223–234, 2010.
 [25] Miao Xu, Rong Jin, and ZhiHua Zhou. Speedup matrix completion with side information: Application to multilabel learning. In C.j.c. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2301–2309, 2013.
 [26] HsiangFu Yu, ChoJui Hsieh, Si Si, and Inderjit S. Dhillon. Parallel matrix factorization for recommender systems. Knowledge and Information Systems, 41(3):793–819, 2014.
 [27] HsiangFu Yu, Prateek Jain, Purushottam Kar, and Inderjit S. Dhillon. Largescale multilabel learning with missing labels. In Proceedings of the International Conference on Machine Learning, pages 593–601, 2014.
 [28] Yin Zhang, Matthew Roughan, Walter Willinger, and Lili Qiu. Spatiotemporal compressive sensing and internet traffic matrices. SIGCOMM Comput. Commun. Rev., 39(4):267–278, August 2009.
 [29] Tinghui Zhou, Hanhuai Shan, Arindam Banerjee, and Guillermo Sapiro. Kernelized probabilistic matrix factorization: Exploiting graphs and side information. In SDM, volume 12, pages 403–414. SIAM, 2012.
Appendix A Proofs
a.1 Proof of Theorem 1
Proof.
In this proof, we use the notations and summation manipulation techniques introduced by Knuth [4]. To prove (17), it suffices to prove that
(22) 
The LHS of the can be expanded and regrouped as follows.
Let’s look at the first term :
where we can see that is equivalent to the first term of RHS of (22).
Now, we consider the second term :
Let be a diagonal matrix with be the coefficient associated with in . Combining the results of , and , can be written as follows.
It is clear that . Note that , . ∎
a.2 Proof of Corollary 1
Proof.
It is well known that graph regularization can be written in the quadratic form [20] as follows.
where is the graph Laplacian for defined as:
Based on the above fact and the results from Theorem 1, we obtain the quadratic form for as follows.
Because is diagonal, the nonzero pattern of the offdiagonal entries of the inverse covariance for is determined by which shares the same nonzero pattern as . ∎
Appendix B Details: Scalability Issue of RDLM package
In this section, we show the source code demonstrating that RDLM fails to
handle highdimensional time series even with . Interested readers can
run the following R code to see that the dlmMLE()
function in RDLM is
able to run on a dimensional time series. However, when we increase the
dimension to , dlmMLE()
crashes the entire R program.
library(dlm) builderFactory < function(n,k) { n = n; k = k; init = c(rep(0,k), rep(0.1,3),0.1*rnorm(n*k), 0.1*rnorm(k*k)) build = function(x) { m0 = x[1:k] C0 = (abs(x[k+1]))*diag(k) V = (abs(x[k+2]))*diag(n) W = (abs(x[k+3]))*diag(k) FF = matrix(nrow=n,ncol=k, data=x[(k+3+1):(k+3+n*k)]) GG = matrix(nrow=k,ncol=k, data=x[(k+3+n*k+1):(k+3+n*k+k*k)]) return (dlm( m0=m0, C0=C0, FF=FF, GG=GG, V=V, W=W)) } return (list(n=n,k=k,init=init,build=build)) } Rdlm_train < function(Y, k, maxit) { if(missing(maxit)) { maxit=10 } if(ncol(Y)==3) { Ymat = matrix(nrow=max(Y(,1)),ncol=max(Y(,2))) Ymat[cbind(Y(,1),Y(,2))] = Y(,3) } else { Ymat = Y; } n = nrow(Ymat) TT = ncol(Ymat) dlm_builder = builderFactory(n, k) mle = dlmMLE(Ymat,dlm_builder$init,build=dlm_builder$build,control=list(maxit=10)) dlm = dlm_builder$build(mle$par) dlm_filt = dlmFilter(Ymat,dlm) return (dlm_filt) } tmp = t(as.matrix(Nile)); tmp=rbind(tmp,tmp); tmp=rbind(tmp,tmp); tmp=rbind(tmp,tmp); tmp=rbind(tmp,tmp); print(nrow(tmp)) Rdlm_train(tmp,4); print(’works’) tmp=rbind(tmp,tmp); print(nrow(tmp)) Rdlm_train(tmp,4);