BRITS: Bidirectional Recurrent Imputation for Time Series
Abstract
Time series are widely used as signals in many classification/regression tasks. It is ubiquitous that time series contains many missing values. Given multiple correlated time series data, how to fill in missing values and to predict their class labels? Existing imputation methods often impose strong assumptions of the underlying data generating process, such as linear dynamics in the state space. In this paper, we propose BRITS, a novel method based on recurrent neural networks for missing value imputation in time series data. Our proposed method directly learns the missing values in a bidirectional recurrent dynamical system, without any specific assumption. The imputed values are treated as variables of RNN graph and can be effectively updated during the backpropagation. BRITS has three advantages: (a) it can handle multiple correlated missing values in time series; (b) it generalizes to time series with nonlinear dynamics underlying; (c) it provides a datadriven imputation procedure and applies to general settings with missing data. We evaluate our model on three realworld datasets, including an air quality dataset, a healthcare data, and a localization data for human activity. Experiments show that our model outperforms the stateoftheart methods in both imputation and classification/regression accuracies.
BRITS: Bidirectional Recurrent Imputation for Time Series
Wei Cao Dong Wang Jian Li Hao Zhou Lei Li Yitan Li Tsinghua University Duke University Toutiao AILab
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
Multivariate time series data are abundant in many application areas, such as financial marketing [5, 4], healthcare [10, 22], meteorology [31, 26], and traffic engineering [29, 35]. Time series are widely used as signals for classification/regression in various applications of corresponding areas. However, missing values in time series are very common, due to unexpected accidents, such as equipment damage or communication error, and may significantly harm the performance of downstream applications.
Much prior work proposed to fix the missing data problem with statistics and machine learning approaches. However most of them require fairly strong assumptions on missing values. We can fill the missing values using classical statistical time series models such as ARMA or ARIMA (e.g., [1]). But these models are essentially linear (after differencing). Kreindler et al. [19] assume that the data are smoothable, i.e., there is no sudden wave in the periods of missing values, hence imputing missing values can be done by smoothing over nearby values. Matrix completion has also been used to address missing value problems (e.g., [30, 34]). But it typically applies to only static data and requires strong assumptions such as lowrankness. We can also predict missing values by fitting a parametric datagenerating model with the observations [14, 2], which assumes that data of time series follow the distribution of hypothetical models. These assumptions make corresponding imputation algorithms less general, and the performance less desirable when the assumptions do not hold.
In this paper, we propose BRITS, a novel method for filling the missing values for multiple correlated time series. Internally, BRITS adapts recurrent neural networks (RNN) [16, 11] for imputing missing values, without any specific assumption over the data. Much prior work uses nonlinear dynamical systems for time series prediction [9, 24, 3]. In our method, we instantiate the dynamical system as a bidirectional RNN, i.e., imputing missing values with bidirectional recurrent dynamics. In particular, we make the following technical contributions:

We design a bidirectional RNN model for imputing missing values. We directly use RNN for predicting missing values, instead of tuning weights for smoothing as in [10]. Our method does not impose specific assumption, hence works more generally than previous methods.

We regard missing values as variables of the bidirectional RNN graph, which are involved in the backpropagation process. In such case, missing values get delayed gradients in both forward and backward directions with consistency constraints, which makes the estimation of missing values more accurate.

We simultaneously perform missing value imputation and classification/regression of applications jointly in one neural graph. This alleviates the error propagation problem from imputation to classification/regression. Additionally, the supervision of classification/regression makes the estimation of missing values more accurate.

We evaluate our model on three realworld datasets, including an air quality dataset, a healthcare dataset and a localization dataset of human activities. Experimental results show that our model outperforms the stateoftheart models for both imputation and classification/regression accuracies.
2 Related Work
There is a large body of literature on the imputation of missing values in time series. We only mention a few closely related ones. The interpolate method tries to fit a "smooth curve" to the observations and thus reconstruct the missing values by the local interpolations [19, 14]. Such method discards any relationships between the variables over time. The autoregressive method, including ARIMA, SARIMA etc., eliminates the nonstationary parts in the time series data and fit a parameterized stationary model. The state space model further combines ARIMA and Kalman Filter [14, 15], which provides more accurate results. Multivariate Imputation by Chained Equations (MICE) [2] first initializes the missing values arbitrarily and then estimates each missing variable based on the chain equations. The graphical model [20] introduces a latent variable for each missing value, and finds the latent variables by learning their transition matrix. There are also some datadriven methods for missing value imputation. Yi et al. [32] imputed the missing values in air quality data with geographical features. Wang et al. [30] imputed missing values in recommendation system with collaborative filtering. Yu et al. [34] utilized matrix factorization with temporal regularization to impute the missing values in regularly sampled time series data.
Recently, some researchers attempted to impute the missing values with recurrent neural networks [7, 10, 21, 12, 33]. The recurrent components are trained together with the classification/regression component, which significantly boosts the accuracy. Che et al. [10] proposed GRUD, which imputes missing values in healthcare data with a smooth fashion. It assumes that a missing variable can be represented as the combination of its corresponding last observed value and the global mean. GRUD achieves remarkable performance on healthcare data. However, it has many limitations on general datasets [10]. A closely related work is MRNN proposed by Yoon et al. [33]. MRNN also utilizes bidirectional RNN to impute missing values. Differing from our model, it drops the relationships between missing variables. The imputed values in MRNN are treated as constants and cannot be sufficiently updated.
3 Preliminary
We first present the problem formulation and some necessary preliminaries.
Definition 1 (Multivariate Time Series)
We denote a multivariate time series as a sequence of observations. The th observation consists of features , and was observed at timestamp (the time gap between different timestamps may be not same). In reality, due to unexpected accidents, such as equipment damage or communication error, may have the missing values (e.g., in Fig. 1, in is missing). To represent the missing values in , we introduce a masking vector where,
In many cases, some features can be missing for consecutive timestamps (e.g., blue blocks in Fig. 1). We define to be the time gap from the last observation to the current timestamp , i.e.,
See Fig. 1 for an illustration.
In this paper, we study a general setting for time series classification/regression problems with missing values. We use to denote the label of corresponding classification/regression task. In general, can be either a scalar or a vector. Our goal is to predict based on the given time series . In the meantime, we impute the missing values in as accurate as possible. In another word, we aim to design an effective multitask learning algorithm for both classification/regression and imputation.
4 Brits
In this section, we describe the BRITS. Differing from the prior work which uses RNN to impute missing values in a smooth fashion [10], we learn the missing values directly in a recurrent dynamical system [25, 28] based on the observed data. The missing values are thus imputed according to the recurrent dynamics, which significantly boosts both the imputation accuracy and the final classification/regression accuracy. We start the description with the simplest case: the variables observed at the same time are mutually uncorrelated. For such case, we propose algorithms for imputation with unidirectional recurrent dynamics and bidirectional recurrent dynamics, respectively. We further propose an algorithm for correlated multivariate time series in Section 4.3.
4.1 Unidirectional Uncorrelated Recurrent Imputation
For the simplest case, we assume that for the th step, and are uncorrelated if (but may be correlated with some ). We first propose an imputation algorithm by unidirectional recurrent dynamics, denoted as RITSI.
In a unidirectional recurrent dynamical system, each value in the time series can be derived by its predecessors with a fixed arbitrary function [9, 24, 3]. Thus, we iteratively impute all the variables in the time series according to the recurrent dynamics. For the th step, if is actually observed, we use it to validate our imputation and pass to the next recurrent steps. Otherwise, since the future observations are correlated with the current value, we replace with the obtained imputation, and validate it by the future observations. To be more concrete, let us consider an example.
Example 1
Suppose we are given a time series , where and are missing. According to the recurrent dynamics, at each time step , we can obtain an estimation based on the previous steps. In the first steps, the estimation error can be obtained immediately by calculating the estimation loss function for . However, when , we cannot calculate the error immediately since the true values are missing. Nevertheless, note that at the th step, depends on to . We thus obtain a “delayed" error for at the th step.
4.1.1 Algorithm
We introduce a recurrent component and a regression component for imputation. The recurrent component is achieved by a recurrent neural network and the regression component is achieved by a fullyconnected network. A standard recurrent network [17] can be represented as
where is the function, , and are parameters, and is the hidden state of previous time steps.
In our case, since may have missing values, we cannot use as the input directly as in the above equation. Instead, we use a “complement" input derived by our algorithm when is missing. Formally, we initialize the initial hidden state as an allzero vector and then update the model by:
(1)  
(2)  
(3)  
(4)  
(5) 
Eq. (1) is the regression component which transfers the hidden state to the estimated vector . In Eq. (2), we replace missing values in with corresponding values in , and obtain the complement vector . Besides, since the time series may be irregularly sampled, in Eq. (3), we further introduce a temporal decay factor to decay the hidden vector . Intuitively, if is large (i.e., the values are missing for a long time), we expect a small to decay the hidden state. Such factor also represents the missing patterns in the time series which is critical to imputation [10]. In Eq. (4), based on the decayed hidden state, we predict the next state . Here, indicates the concatenate operation. In the mean time, we calculate the estimation error by the estimation loss function in Eq. (5). In our experiment, we use the mean absolute error for . Finally, we predict the task label as
where can be either a fullyconnected layer or a softmax layer depending on the specific task, and is the weight for different hidden state which can be derived by the attention mechanism^{1}^{1}1The design of attention mechanism is out of this paper’s scope. or the mean pooling mechanism, i.e., . We calculate the output loss by . Our model is then updated by minimizing the accumulated loss .
4.1.2 Practical Issues
In practice, we use LSTM as the recurrent component in Eq. (4) to prevent the gradient vanishing/exploding problems in vanilla RNN [17]. Standard RNN models including LSTM treat as a constant. During backpropagation, gradients are cut at . This makes the estimation errors backpropagate insufficiently. For example, in Example 1, the estimation errors of to are obtained at the th step as the delayed errors. However, if we treat to as constants, such delayed error cannot be fully backpropagated. To overcome such issue, we treat as a variable of RNN graph. We let the estimation error passes through during the backpropagation. Fig. 2 shows how RITSI method works in Example 1. In this example, the gradients are backpropagated through the opposite direction of solid lines. Thus, the delayed error is passed to steps and . In the experiment, we find that our models are unstable during training if we treat as a constant. See Appendix C for details.
4.2 Bidirectional Uncorrelated Recurrent Imputation
In the RITSI, errors of estimated missing values are delayed until the presence of the next observation. For example, in Example 1, the error of is delayed until is observed. Such error delay makes the model converge slowly and in turn leads to inefficiency in training. In the meantime, it also leads to the bias exploding problem [6], i.e., the mistakes made early in the sequential prediction are fed as input to the model and may be quickly amplified. In this section, we propose an improved version called BRITSI. The algorithm alleviates the abovementioned issues by utilizing the bidirectional recurrent dynamics on the given time series, i.e., besides the forward direction, each value in time series can be also derived from the backward direction by another fixed arbitrary function.
To illustrate the intuition of BRITSI algorithm, again, we consider Example 1. Consider the backward direction of the time series. In bidirectional recurrent dynamics, the estimation reversely depends on to . Thus, the error in the th step is backpropagated from not only the th step in the forward direction (which is far from the current position), but also the th step in the backward direction (which is closer). Formally, the BRITSI algorithm performs the RITSI as shown in Eq. (1) to Eq. (5) in forward and backward directions, respectively. In the forward direction, we obtain the estimation sequence and the loss sequence . Similarly, in the backward direction, we obtain another estimation sequence and another loss sequence . We enforce the prediction in each step to be consistent in both directions by introducing the “consistency loss”:
(6) 
where we use the mean squared error as the discrepancy in our experiment. The final estimation loss is obtained by accumulating the forward loss , the backward loss and the consistency loss . The final estimation in the th step is the mean of and .
4.3 Correlated Recurrent Imputation
The previously proposed algorithms RITSI and BRITSI assume that features observed at the same time are mutually uncorrelated. This may be not true in some applications. For example, in the air quality data [32], each feature represents one measurement in a monitoring station. Obviously, the observed measurements are spatially correlated. In general, the measurement in one monitoring station is close to those values observed in its neighboring stations. In this case, we can estimate a missing measurement according to both its historical data, and the measurements in its neighbors.
In this section, we propose an algorithm, which utilizes the feature correlations in the unidirectional recurrent dynamical system. We refer to such algorithm as RITS. The feature correlated algorithm for bidirectional case (i.e., BRITS) can be derived in the same way. Note that in Section 4.1, the estimation is only correlated with the historical observations, but irrelevant with the the current observation. We refer to as a “historybased estimation". In this section, we derive another “featurebased estimation" for each , based on the other features at time . Specifically, at the th step, we first obtain the complement observation by Eq. (1) and Eq. (2). Then, we define our featurebased estimation as where
(7) 
, are corresponding parameters. We restrict the diagonal of parameter matrix to be all zeros. Thus, the th element in is exactly the estimation of , based on the other features. We further combine the historicalbased estimation and the featurebased estimation . We denote the combined vector as , and we have that
(8)  
(9) 
Here we use as the weight of combining the historybased estimation and the featurebased estimation . Note that is derived from by Eq. (7). The elements of can be historybased estimations or truly observed values, depending on the presence of the observations. Thus, we learn the weight by considering both the temporal decay and the masking vector as shown in Eq. (8). The rest parts are similar as the feature uncorrelated case. We first replace missing values in with the corresponding values in . The obtained vector is then fed to the next recurrent step to predict memory :
(10)  
(11) 
However, the estimation loss is slightly different with the feature uncorrelated case. We find that simply using leads to a very slow convergence speed. Instead, we accumulate all the estimation errors of , and :
5 Experiment
Our proposed methods are applicable to a wide variety of applications. We evaluate our methods on three different realworld datasets. The download links of the datasets, as well as the implementation codes can be found in the GitHub page^{2}^{2}2https://github.com/NIPSBRITS/BRITS.
5.1 Dataset Description
5.1.1 Air Quality Data
We evaluate our models on the air quality dataset, which consists of PM2.5 measurements from monitoring stations in Beijing. The measurements are hourly collected from 2014/05/01 to 2015/04/30. Overall, there are values are missing. For this dataset, we do pure imputation task. We use the exactly same train/test setting as in prior work [32], i.e., we use the , , and months as the test data and the other months as the training data. See Appendix D for details. To train our model, we randomly select every consecutive steps as one time series.
5.1.2 Healthcare Data
We evaluate our models on healthcare data in PhysioNet Challenge 2012 [27], which consists of multivariate clinical time series from intensive care unit (ICU). Each time series contains measurements such as Albumin, heartrate etc., which are irregularly sampled at the first hours after the patient’s admission to ICU. We stress that this dataset is extremely sparse. There are up to missing values in total. For this dataset, we do both the imputation task and the classification task. To evaluate the imputation performance, we randomly eliminate of observed measurements from data and use them as the groundtruth. At the same time, we predict the inhospital death of each patient as a binary classification task. Note that the eliminated measurements are only used for validating the imputation, and they are never visible to the model.
5.1.3 Localization for Human Activity Data
The UCI localization data for human activity [18] contains records of five people performing different activities such as walking, falling, sitting down etc (there are activities). Each person wore four sensors on her/his left/right ankle, chest, and belt. Each sensor recorded a 3dimensional coordinates for about to millisecond. We randomly select consecutive steps as one time series, and there are time series in total. For this dataset, we do both imputation and classification tasks. Similarly, we randomly eliminate observed data as the imputation groundtruth. We further predict the corresponding activity of observed time series (i.e., walking, sitting, etc).
5.2 Experiment Setting
5.2.1 Model Implementations
We fix the dimension of hidden state in RNN to be . We train our model by an Adam optimizer with learning rate and batch size . For all the tasks, we normalize the numerical values to have zero mean and unit variance for stable training.
We use different early stopping strategies for pure imputation task and classification tasks. For the imputation tasks, we randomly select of nonmissing values as the validation data. The early stopping is thus performed based on the validation error. For the classification tasks, we first pretrain the model as an imputation task. Then we use fold cross validation to further optimize both the imputation and classification losses simultaneously.
We evaluate the imputation performance in terms of mean absolute error () and mean relative error (). Suppose that is the groundtruth of the th item, is the output of the th item, and there are items in total. Then, and are defined as
For air quality data, the evaluation is performed on its original data. For heathcare data and activity data, since the numerical values are not in the same scale, we evaluate the performances on their normalized data. To further evaluate the classification performances, we use area under ROC curve () [8] for healthcare data, since such data is highly unbalanced (there are patients who died in hospital). We then use standard accuracy for the activity data, since different activities are relatively balanced.
5.2.2 Baseline Methods
We compare our model with both RNNbased methods and nonRNN based methods. The nonRNN based imputation methods include:

Mean: We simply replace the missing values with corresponding global mean.

KNN: We use nearest neighbor [13] to find the similar samples, and impute the missing values with weighted average of its neighbors.

Matrix Factorization (MF): We factorize the data matrix into two lowrank matrices, and fill the missing values by matrix completion [13].

MICE: We use Multiple Imputation by Chained Equations (MICE), a widely used imputation method, to fill the missing values. MICE creates multiple imputations with chained equations [2].

ImputeTS: We use ImputeTS package in R to impute the missing values. ImputeTS is a widely used package for missing value imputation, which utilizes the state space model and kalman smoothing [23].

STMVL: Specifically, we use STMVL for the air quality data imputation. STMVL is the stateoftheart method for air quality data imputation. It further utilizes the geolocations when imputing missing values [32].
We implement KNN, MF and MICE based on the python package fancyimpute^{3}^{3}3https://github.com/iskandr/fancyimpute. In recent studies, RNNbased models in missing value imputations achieve remarkable performances [10, 21, 12, 33]. We also compare our model with existing RNNbased imputation methods, including:

GRUD: GRUD is proposed for handling missing values in healthcare data. It imputes each missing value by the weighted combination of its last observation, and the global mean, together with a recurrent component [10].

MRNN: MRNN also uses bidirectional RNN. It imputes the missing values according to hidden states in both directions in RNN. MRNN treats the imputed values as constants. It does not consider the correlations among different missing values[33].
We compare the baseline methods with our four models: RITSI (Section 4.1), RITS (Section 4.2), BRITSI (Section 4.3) and BRITS (Section 4.3). We implement all the RNNbased models with PyTorch, a widely used package for deep learning. All models are trained with GPU GTX 1080.
5.3 Experiment Results
Method  Air Quality  Healthcare  Human Activity  
NonRNN  Mean  ()  
KNN  ()  
MF  ()  
MICE  ()  ()  
ImputeTS  ()  
STMVL  ()  /  /  
RNN  GRUD  /  
MRNN  ()  
Ours  RITSI  ()  
BRITSI  ()  
RITS  ()  
BRITS  () 
Table 1 shows the imputation results. As we can see, simply applying naıve mean imputation is very inaccurate. Alternatively, KNN, MF, and MICE perform much better than mean imputation. ImputeTS achieves the best performance among all the nonRNN methods, especially for the heathcare data (which is smooth and contains few sudden waves). STMVL performs well on the air quality data. However, it is specifically designed for geographical data, and cannot be applied in the other datasets. Most of RNNbased methods, except GRUD, demonstrates significantly better performances in imputation tasks. We stress that GRUD does not impute missing value explicitly. It actually performs well on classification tasks (See Appendix A for details). MRNN uses explicitly imputation procedure, and achieves remarkable imputation results. Our model BRITS outperforms all the baseline models significantly. According to the performances of our four models, we also find that both bidirectional recurrent dynamics, and the feature correlations are helpful to enhance the model performance.
We also compare the accuracies of all RNNbased models in classification tasks. GRUD achieves an AUC of on healthcare data , and accuracy of on human activity; MRNN is slightly worse. It achieves and on two tasks. Our model BRITS outperforms the baseline methods. The accuracies of BRITS are and respectively. See Appendix A for more classification results.
6 Conclusion
In this paper, we proposed BRITS, a novel method to use recurrent dynamics to effectively impute the missing values in multivariate time series. Instead of imposing assumptions over the datagenerating process, our model directly learns the missing values in a bidirectional recurrent dynamical system, without any specific assumption. Our model treats missing values as variables of the bidirectional RNN graph. Thus, we get the delayed gradients for missing values in both forward and backward directions, which makes the imputation of missing values more accurate. We performed the missing value imputation and classification/regression simultaneously within a joint neural network. Experiment results show that our model demonstrates more accurate results for both imputation and classification/regression than stateoftheart methods.
References
 [1] C. F. Ansley and R. Kohn. On the estimation of arima models with missing values. In Time series analysis of irregularly observed data, pages 9–37. Springer, 1984.
 [2] M. J. Azur, E. A. Stuart, C. Frangakis, and P. J. Leaf. Multiple imputation by chained equations: what is it and how does it work? International journal of methods in psychiatric research, 20(1):40–49, 2011.
 [3] A. Basharat and M. Shah. Time series prediction by chaotic modeling of nonlinear dynamical systems. In Computer Vision, 2009 IEEE 12th International Conference on, pages 1941–1948. IEEE, 2009.
 [4] B. BatresEstrada. Deep learning for multivariate financial time series, 2015.
 [5] S. Bauer, B. Schölkopf, and J. Peters. The arrow of time in multivariate time series. In International Conference on Machine Learning, pages 2043–2051, 2016.
 [6] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179, 2015.
 [7] M. Berglund, T. Raiko, M. Honkala, L. Kärkkäinen, A. Vetek, and J. T. Karhunen. Bidirectional recurrent neural networks as generative models. In Advances in Neural Information Processing Systems, pages 856–864, 2015.
 [8] A. P. Bradley. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7):1145–1159, 1997.
 [9] P. Brakel, D. Stroobandt, and B. Schrauwen. Training energybased models for timeseries imputation. The Journal of Machine Learning Research, 14(1):2771–2797, 2013.
 [10] Z. Che, S. Purushotham, K. Cho, D. Sontag, and Y. Liu. Recurrent neural networks for multivariate time series with missing values. Scientific reports, 8(1):6085, 2018.
 [11] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
 [12] E. Choi, M. T. Bahadori, A. Schuetz, W. F. Stewart, and J. Sun. Doctor ai: Predicting clinical events via recurrent neural networks. In Machine Learning for Healthcare Conference, pages 301–318, 2016.
 [13] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics Springer, Berlin, 2001.
 [14] D. S. Fung. Methods for the estimation of missing values in time series. 2006.
 [15] A. C. Harvey. Forecasting, structural time series models and the Kalman filter. Cambridge university press, 1990.
 [16] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural Computation, 9(8):1735, 1997.
 [17] G. Ian, B. Yoshua, and C. Aaron. Deep learning. Book in preparation for MIT Press, 2016.
 [18] B. Kaluža, V. Mirchevska, E. Dovgan, M. Luštrek, and M. Gams. An agentbased approach to care in independent living. In International joint conference on ambient intelligence, pages 177–186. Springer, 2010.
 [19] D. M. Kreindler and C. J. Lumsden. The effects of the irregular sample and missing data in time series analysis. Nonlinear Dynamical Systems Analysis for the Behavioral Sciences Using Real Data, page 135, 2012.
 [20] L. Li, J. McCann, N. S. Pollard, and C. Faloutsos. Dynammo: Mining and summarization of coevolving sequences with missing values. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 507–516. ACM, 2009.
 [21] Z. C. Lipton, D. Kale, and R. Wetzel. Directly modeling missing data in sequences with rnns: Improved classification of clinical time series. In Machine Learning for Healthcare Conference, pages 253–270, 2016.
 [22] Z. Liu and M. Hauskrecht. Learning linear dynamical systems from multivariate time series: A matrix factorization based framework. In Proceedings of the 2016 SIAM International Conference on Data Mining, pages 810–818. SIAM, 2016.
 [23] S. Moritz and T. BartzBeielstein. imputeTS: Time Series Missing Value Imputation in R. The R Journal, 9(1):207–218, 2017.
 [24] T. Ozaki. 2 nonlinear time series models and dynamical systems. Handbook of statistics, 5:25–83, 1985.
 [25] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, pages 1310–1318, 2013.
 [26] S. Rani and G. Sikka. Recent techniques of clustering of time series data: a survey. International Journal of Computer Applications, 52(15), 2012.
 [27] I. Silva, G. Moody, D. J. Scott, L. A. Celi, and R. G. Mark. Predicting inhospital mortality of icu patients: The physionet/computing in cardiology challenge 2012. In Computing in Cardiology (CinC), 2012, pages 245–248. IEEE, 2012.
 [28] D. Sussillo and O. Barak. Opening the black box: lowdimensional dynamics in highdimensional recurrent neural networks. Neural computation, 25(3):626–649, 2013.
 [29] D. Wang, W. Cao, J. Li, and J. Ye. Deepsd: supplydemand prediction for online carhailing services using deep neural networks. In Data Engineering (ICDE), 2017 IEEE 33rd International Conference on, pages 243–254. IEEE, 2017.
 [30] J. Wang, A. P. De Vries, and M. J. Reinders. Unifying userbased and itembased collaborative filtering approaches by similarity fusion. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 501–508. ACM, 2006.
 [31] S. Xingjian, Z. Chen, H. Wang, D.Y. Yeung, W.K. Wong, and W.c. Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pages 802–810, 2015.
 [32] X. Yi, Y. Zheng, J. Zhang, and T. Li. Stmvl: filling missing values in geosensory time series data. 2016.
 [33] J. Yoon, W. R. Zame, and M. van der Schaar. Multidirectional recurrent neural networks: A novel method for estimating missing data. 2017.
 [34] H.F. Yu, N. Rao, and I. S. Dhillon. Temporal regularized matrix factorization for highdimensional time series prediction. In Advances in neural information processing systems, pages 847–855, 2016.
 [35] J. Zhang, Y. Zheng, and D. Qi. Deep spatiotemporal residual networks for citywide crowd flows prediction. AAAI, November 2017.
Appendix A Performance Comparison for Classification Tasks
Table 2 shows the performances of different RNNbased models on the classification tasks. As we can see, our model BRITS outperforms other baseline methods for classifications. Comparing with Table 1, GRUD performs well for classification tasks. Furthermore, we find that it is very important for GRUD to carefully apply the dropout techniques in order to prevent the overfitting (we use dropout layer on the top classification layer). However, our models further utilize the imputation errors as the supervised signals. It seems that dropout is not necessary for our models during training.
Method  Healthcare (AUC)  Human Activity (Accuracy) 

GRUD  
MRNN  
RITSI  
BRITSI  
RITS  
BRITS 
Appendix B Performance Comparison for Univariate Synthetic Data
To better understand our model, we generate a set of univariate synthetic time series. Speficically, we randomly generate a time series with length , using the statespace representation [15]:
where is the th value in time series. The residual terms , , and are randomly drawn from a normal distribution . We eliminate about values from the generated series, and compare our model BRITSI (note the data is univariate) and ImputeTS. We show three examples in Fig. 3.
The first row corresponds to the imputations of ImputeTS and the second row corresponds to our model BRITSI. As we can see, our model demonstrates better performance than ImputeTS. Especially, ImputeTS fails when the start part of time series is missing. However, for our model, the imputation errors are backpropagated to previous time steps. Thus, our model can adjust the imputations in the start part with delayed gradient, which leads to much more accurate results.
Appendix C Performance for Nondifferentiable
As we claimed in Section 4.1, we regard the missing values as variables of RNN graph. During the backpropagation, the imputed values can be further updated sufficiently. In the experiment, we find that if we cut the gradient of (i.e., regard it as constant), the models are unstable and easy to overfit during the training. We refer to the model with nondifferentiate as BRITScut. Fig. 4 shows the validation errors of BRITS and BRITScut during training for the healthcare data imputation. At the first iterations, the validation error of BRITScut decreases fast. However, it soon fails due to the overfitting.
Appendix D Test Data of Air Quality Imputation
We use the same method as in prior work [32] to select the test data for air quality imputation. Specifically, we use the , , and months as the test set and the rest data as the training set. To evaluate our model, we select the test timeslots by the following rule: if a measurement is observed at a timeslot in one of these four months (e.g., 8 o’clock 2015/03/07), we check the corresponding position in its previous month (e.g., 8 o’clock 2015/02/07). If it is absent in the previous month, we eliminate such value and use it as the imputation groundtruth of test data. Recall that to train our model, we randomly select consecutive time steps as one time series. Thus, a test timeslot may be contained in multiple time series. We use the mean of the imputations in different time series as the final result.