Multivariate Time Series Imputation
with Variational Autoencoders
Abstract
Multivariate time series with missing values are common in many areas, for instance in healthcare and finance. To face this problem, modern data imputation approaches should (a) be tailored to sequential data, (b) deal with high dimensional and complex data distributions, and (c) be based on the probabilistic modeling paradigm for interpretability and confidence assessment. However, many current approaches fall short in at least one of these aspects. Drawing on advances in deep learning and scalable probabilistic modeling, we propose a new deep sequential variational autoencoder approach for dimensionality reduction and data imputation. Temporal dependencies are modeled with a Gaussian process prior and a Cauchy kernel to reflect multiscale dynamics in the latent space. We furthermore use a structured variational inference distribution that improves the scalability of the approach. We demonstrate that our model exhibits superior imputation performance on benchmark tasks and challenging realworld medical data.
1 Introduction
Time series are often associated with missing values, for instance due to faulty measurement devices, partially observed states, or costly measurement procedures [13]. These missing values impair the usefulness and interpretability of the data, leading to the problem of data imputation: estimating those missing values from the observed ones [34].
Multivariate time series, consisting of multiple correlated univariate time series or channels, give rise to two distinct ways of imputing missing information: (1) by exploiting temporal correlations within each channel, and (2) by exploiting correlations across channels, for example by using lowerdimensional representations of the data. For instance in a medical setting, if the blood pressure of a patient is unobserved, it can be informative that the heart rate at the current time is higher than normal and that the blood pressure was also elevated an hour ago. An ideal imputation model for multivariate time series should therefore take both of these sources of information into account. Another desirable property of such models is to offer a probabilistic interpretation, allowing for uncertainty estimation.
Unfortunately, current imputation approaches fall short with respect to at least one of these desiderata. While there are many timetested statistical methods for multivariate time series analysis (e.g., Gaussian processes [33]) that work well in the case of complete data, these methods are generally not applicable when features are missing. On the other hand, classical methods for time series imputation often do not take the potentially complex interactions between the different channels into account [25, 30]. Finally, recent work has explored the use of nonlinear dimensionality reduction using variational autoencoders for i.i.d. data points with missing values [29, 1, 27] , but this work has not considered temporal data and strategies for sharing statistical strength across time.
Following these considerations, it is promising to combine nonlinear dimensionality reduction with an expressive time series model. This can be done by jointly learning a mapping from the data space (where features are missing) into a latent space (where all dimensions are fully determined). A statistical model of choice can then be applied in this latent space to model temporal dynamics. If the dynamics model and the mapping for dimensionality reduction are both differentiable, the approach can be trained endtoend.
In this paper, we propose an architecture that uses deep variational autoencoders (VAEs) to map the missing data time series into a latent space without missingness, where we model the lowdimensional dynamics with a Gaussian process (GP). As we will discuss below, we hereby propose a prior model that efficiently operates at multiple time scales, taking into account that the multivariate time series may have different channels (e.g., heart rate, blood pressure, etc.) that change with different characteristic frequencies. Finally, our variational inference approach makes use of efficient structured variational approximations, where we fit another multivariate Gaussian process in order to approximate the intractable true posterior.
We make the following contributions:

A new model. We propose a VAE architecture for multivariate time series imputation with a GP prior in the latent space to capture temporal dynamics. We propose a Cauchy kernel to allow the time series to display dynamics at multiple scales in a reduced dimensionality.

Efficient inference. We use a structured variational approximation that models posterior correlations in the time domain. By construction, inference is efficient and the time complexity for sampling from the variational distribution, used for training, is linear in the number of time steps (as opposed to cubic when done naïvely).

Benchmarking on realworld data. We carry out extensive comparisons to classical imputation methods as well as stateoftheart deep learning approaches, and perform experiments on data from two different domains. Our method outperforms the baselines in both cases.
2 Related work
Classical statistical approaches.
The problem of missing values has been a longstanding challenge in many time series applications, especially in the field of medicine [30]. The earliest approaches to deal with this problem often relied on heuristics, such as mean imputation or forward imputation. Despite their simplicity, these methods are still widely applied today due to their efficiency and interpretability [13]. Orthogonal to these ideas, methods along the lines of expectationmaximization (EM) have been proposed, but they often require additional modeling assumptions [4].
Bayesian methods.
When it comes to estimating likelihoods and uncertainties relating to the imputations, Bayesian methods, such as Gaussian processes (GPs) [31], have a clear advantage over nonBayesian methods such as single imputation [25]. There has been much recent work in making these methods more expressive and incorporating prior knowledge from the domain (e.g., medical time series) [37, 9] or adapting them to work on discrete domains [10], but their widespread adoption is hindered by their limited scalability and the challenges in designing kernels that are robust to missing values.
Deep learning techniques.
Another avenue of research in this area uses deep learning techniques, such as variational autoencoders (VAEs) [29, 1, 27, 8] or generative adversarial networks (GANs) [38, 22]. It should be noted that VAEs allow for tractable likelihoods, while GANs generally do not and have to rely on additional optimization processes to find latent representations of a given input [24]. Unfortunately, none of these models explicitly take the temporal dynamics of time series data into account. Conversely, there are deep probabilistic models for time series [e.g., 19, 20, 11], but those do not explicitly handle missing data.
HiVae.
Our approach borrows some ideas from the HIVAE [29]. This model deals with missing data by defining an ELBO whose reconstruction error term only sums over the observed part of the data. For inference, the incomplete data are filled with arbitrary values (e.g., zeros) before they are fed into the inference network, which induces an unavoidable bias. The main difference to our approach is that the HIVAE was not formulated for sequential data and therefore does not exploit temporal information in the imputation task.
Deep learning for time series imputation.
While the mentioned deep learning approaches are very promising, most of them do not take the time series nature of the data directly into account, that is, they do not model the temporal dynamics of the data when dealing with missing values. To the best of our knowledge, the only deep learning model for missing value imputation that does account for the time series nature of the data is the GRUIGAN [26], which we describe in Sec. 4. We will show that our approach outperforms this baseline on our considered data sets.
Other related work.
Our proposed model combines several ideas from the domains of Bayesian deep learning and classical probabilistic modeling; thus, removing elements from our model naturally relates to other approaches. For example, removing the latent GP for modeling dynamics as well as our proposed structured variational distribution results in the HIVAE [29] described above. Furthermore, our idea of using a latent GP in the context of a deep generative model bears similarities to the GPPVAE [7], but note that the GPPVAE was not proposed to model time series data and does not take missing values into account. Lastly, the GP prior with the Cauchy kernel is reminiscent of Jähnichen et al. [15] and the structured variational distribution is similar to the one used by Bamler and Mandt [3] in the context of modeling word embeddings over time. Neither of these two works, however, considered amortized inference or VAEs.
3 Model
We propose a novel architecture for missing value imputation, an overview of which is depicted in Figure 1. Our model can be seen as a way to perform amortized approximate inference on a latent Gaussian process model.
The main idea of our proposed approach is to embed the data into a latent space of reduced dimensionality, in which every dimension is fully determined, and then model the temporal dynamics in this latent space. Since many features in the data might be correlated, the latent representation captures these correlations and uses them to reconstruct the missing values. Moreover, the GP prior in the latent space encourages the model to embed the data into a representation in which the temporal dynamics are smoother and more easily explainable than in the original data space. Finally, the structured variational distribution of the inference network allows the model to integrate temporal information into the representations, such that the reconstructions of missing values can not only be informed by correlated observed features at the same time point, but also by the rest of time series.
Specifically, we combine ideas from VAEs [18], GPs [31], Cauchy kernels [15], structured variational distributions with efficient inference [3], and a special ELBO for missing data [29] and synthesize these ideas into a general framework for missing data imputation on time series. In the following, we will outline the problem setting, describe the assumed generative model, and derive our proposed inference scheme.
3.1 Problem setting and notation
We assume a data set with data points . Let us assume that the data points were measured at consecutive time points with . By convention, we usually set . The data can thus be viewed as a time series of length in time.
We moreover assume that any number of these data features can be missing, that is, that their values can be unknown. We can now partition each data point into observed and unobserved features. The observed features of data point are . Equivalently, the missing features are with .
We can now use this partitioning to define the problem of missing value imputation. Missing value imputation describes the problem of estimating the true values of the missing features given the observed features . Many methods assume the different data points to be independent, in which case the inference problem reduces to separate problems of estimating . In the time series setting, this independence assumption is not satisfied, which leads to the more complex estimation problem of .
3.2 Generative model
In this subsection, we describe the details of our proposed approach of reducing the observed time series with missing data into a representation without missingness, and modeling dynamics in this lowerdimensional representation using Gaussian processes. Yet, it is tempting to try to skip the step of dimensionality reduction and instead directly try to model the incomplete data in the observed space using GPs. We argue that this is not practical for several reasons.
Gaussian processes are well suited for time series modeling [33] and offer many advantages, such as dataefficiency and calibrated posterior probabilities. However, they come at the cost of inverting the kernel matrix, which has a time complexity of . Moreover, designing a kernel function that accurately captures correlations in feature space and also in the temporal dimension is difficult.
This problem becomes even worse if certain observations are missing. One option is to fill the missing values with some numerical value (e.g., zero) to make the kernel computable. However, this arbitrary filling may make two data points with different missingness patterns look very dissimilar when in fact they are close to each other in the groundtruth space. Another alternative is to treat every channel of the multivariate time series separately and let the GP infer missing values, but this ignores valuable correlations across channels.
In this work, we overcome the problem of defining a suitable GP kernel in the data space with missing observations by instead applying the GP in the latent space of a variational autoencoder where the encoded feature representations are complete. That is, we assign a latent variable for every , and model temporal correlations in this reduced representation using a GP, . This way, we decouple the step of filling in missing values and capturing instantaneous correlations between the different feature dimensions from modeling dynamical aspects. The graphical model is depicted in Figure 1.
A remaining practical difficulty that we encountered is that many multivariate time series display dynamics at multiple time scales. One of our main motivations is to model time series that arise in medical setups where doctors measure different patient variables and vital signs, such as heart rate, blood pressure, etc. When using conventional GP kernels (e.g., the RBF kernel, ), one implicitly assumes a single time scale of relevance (). We found that this choice does not reflect the dynamics of medical data very well.
In order to model data that varies at multiple time scales, we consider a mixture of RBF kernels with different ’s [31]. By defining a Gamma distribution over the length scale, that is, , we can compute an infinite mixture of RBF kernels,
This yields the socalled Rational Quadratic kernel [31]. For and , it reduces to the Cauchy kernel
(1) 
which has previously been successfully used in the context of robust dynamic topic modeling where similar multiscale time dynamics occur [15]. We therefore choose this kernel for our Gaussian process prior.
Given the latent time series , the observations are generated timepointwise by
(2) 
where is a potentially nonlinear function parameterized by the parameter vector . Considering the scenario of a medical time series, can be thought of as the latent physiological state of the patient and would be the process of generating observable measurements (e.g., heart rate, blood pressure, etc.) from that physiological state. In our experiments, the function is implemented by a deep neural network.
3.3 Inference model
In order to learn the parameters of the deep generative model described above, and in order to efficiently infer its latent state, we are interested in the posterior distribution . Since the exact posterior is intractable, we use variational inference [16, 6, 39]. Furthermore, to avoid inference over perdatapoint (local) variational parameters, we apply inference amortization [18]. To make our variational distribution more expressive and capture the temporal correlations of the data, we employ a structured variational distribution [36] with efficient inference that leads to an approximate posterior which is also a GP.
We approximate the true posterior with a multivariate Gaussian variational distribution
(3) 
where indexes the dimensions in the latent space. Our approximation implies that our variational posterior is able to reflect correlations in time, but breaks dependencies across the different dimensions in space (which is typical in VAE training [18, 32]).
We choose the variational family to be the family of multivariate Gaussian distributions in the time domain, where the precision matrix is parameterized in terms of a product of bidiagonal matrices,
(4) 
Above, the ’s are local variational parameters and is an upper triangular band matrix. Similar structured distributions were also employed by Blei and Lafferty [5], Bamler and Mandt [2].
This parameterization automatically leads to being positive definite, symmetric, and tridiagonal. Samples from can thus be generated in linear time in [14, 28, 3] as opposed to the cubic time complexity for a fullrank matrix. Moreover, compared to a fully factorized variational approximation, the number of variational parameters are merely doubled. Note that while the precision matrix is sparse, the covariance matrix can still be dense, allowing to reflect longrange dependencies in time.
Instead of optimizing and separately for every data point, we amortize the inference through an inference network with parameters that computes the variational parameters based on the inputs as . In the following, we accordingly denote the variational distribution as . Following standard VAE training, the parameters of the generative model and of the inference network can be jointly trained by optimizing the evidence lower bound (ELBO),
(5) 
Following Nazabal et al. [29] (see Sec. 2), we evaluate the ELBO only on the observed features of the data since the remaining features are unknown, and set these missing features to a fixed value (zero) during inference. Our training objective is thus the RHS of (5).
Neural network architectures.
We use a convolutional neural network (CNN) as an inference network and a fully connected multilayer perceptron (MLP) as a generative network. The inference network convolves over the time dimension of the input data and allows for sequences of variable lengths. It consists of a number of convolutional layers that integrate information from neighboring time steps into a joint representation using a fixed receptive field (see Figure 1). The CNN outputs a tensor of size , where is the dimensionality of the latent space. Every row corresponds to a time step and contains parameters, which are used to predict the mean vector as well as the diagonal and offdiagonal elements that characterize at the given time step. More details about the network structure are given in the appendix (Sec. A).
4 Experiments
We performed experiments on the benchmark data set Healing MNIST [19], which combines the classical MNIST data set [21] with properties common to medical time series, the SPRITES data set [23], and on a realworld medical data set from the 2012 Physionet Challenge [35]. We compared our model against conventional single imputation methods [25], GPbased imputation [31], VAEbased methods that are not specifically designed to handle temporal data [18, 29], and modern stateoftheart deep learning methods for temporal data imputation [26]. We found strong quantitative and qualitative evidence that our proposed model outperforms the baseline methods in terms of imputation quality on all three tasks. In the following, we are first going to give an overview of the baseline methods and then present our experimental findings. Details about the data sets and neural network architectures can be found in the appendix (Sec. A).
4.1 Baseline methods
Forward imputation and mean imputation.
Forward and mean imputation are socalled single imputation methods, which means that they do not attempt to fit a distribution over possible values for the missing features, but only predict one estimate [25]. Forward imputation always predicts the last observed value for any given feature, while mean imputation predicts the mean of all the observations of the feature in a given time series.
Gaussian process in data space.
One option to deal with missingness in multivariate time series is to fit independent Gaussian processes to each channel. As discussed previously (Sec. 3.2), this ignores the correlation between channels. The missing values are then imputed by taking the mean of the respective posterior of the GP for that feature.
VAE and HIVAE.
The VAE [18] and HIVAE [29] are fit to the data using the same training procedure as the proposed GPVAE model. The VAE uses a standard ELBO that is defined over all the features, while the HIVAE uses the ELBO from (5), which is only evaluated on the observed part of the feature space. During inference, missing features are filled with constant values, such as zero.
GruiGan.
The GRUIGAN [26] uses a recurrent neural network (RNN), namely a gated recurrent unit (GRU). Once the network is trained, a time series is imputed by optimizing the latent vector in the input space of the generator, such that the generator’s output on the observed features is closest to the true values.
4.2 Healing MNIST
Time series with missing values play a crucial role in the medical field, but are often hard to obtain. Krishnan et al. [19] generated a data set called Healing MNIST, which is designed to reflect many properties that one also finds in real medical data. We benchmark our method on a variant of this data set. It was designed to incorporate some properties that one also finds in real medical data, and consists of short sequences of moving MNIST digits [21] that rotate randomly between frames. The analogy to healthcare is that every frame may represent the collection of measurements that describe a patient’s health state, which contains many missing measurements at each moment in time. The temporal evolution represents the nonlinear evolution of the patient’s health state. The image frames contain around 60 % missing pixels and the rotations between two consecutive frames are normally distributed.
The benefit of this data set is that we know the ground truth of the imputation task. We compare our model against a standard VAE (no latent GP and standard ELBO over all features), the HIVAE [29], as well as mean imputation and forward imputation. The models were trained on time series of digits from the Healing MNIST training set (50,000 time series) and tested on digits from the Healing MNIST test set (10,000 time series). Negative log likelihoods on the ground truth values of the missing pixels and mean squared errors (MSE) are reported in Table 1, and qualitative results shown in Figure 2. To assess the usefulness of the imputations for downstream tasks, we also trained a linear classifier on the imputed MNIST digits to predict the digit class and measured its performance in terms of area under the receiveroperatorcharacteristic curve (AUROC) (Tab. 1).
Healing MNIST  SPRITES  

Model  NLL  MSE  AUROC  NLL  MSE 
Mean imputation [25]    0.178 0.000  0.787    0.013 0.000 
Forward imputation [25]    0.174 0.000  0.779    0.028 0.000 
VAE [18]  0.4798 0.0016  0.152 0.001  0.738  0.3702 0.0140  0.034 0.000 
HIVAE [29]  0.2896 0.0010  0.088 0.000  0.811  0.3309 0.0126  0.035 0.000 
GPVAE (proposed)  0.2606 0.0008  0.078 0.000  0.826  1.9595 0.0011  0.002 0.000 
Our approach outperforms the baselines in terms of likelihood and MSE. The reconstructions (Fig. 2) reveal the benefits of the GPVAE approach: related approaches yield unstable reconstructions over time, while our approach offers more stable reconstructions, using temporal information from neighboring frames. Moreover, our model also yields the most useful imputations for downstream classification in terms of AUROC. The downstream classification performance correlates well with the test likelihood on the ground truth data, supporting the intuition that it is a good proxy measure in cases where the ground truth likelihood is not available.
4.3 SPRITES data
To assess our model’s performance on more complex data, we applied it to the SPRITES data set, which has previously been used with sequential autoencoders [23]. The dataset consists of 9,000 sequences of animated characters with different clothes, hair styles, and skin colors, performing different actions. Each frame has a size of pixels and each time series features 8 frames. We again introduced about 60 % of missing pixels and compared the same methods as above. The results are reported in Table 1 and example reconstructions are shown in Figure 2. As in the previous experiment, our model outperforms the baselines in terms of likelihood and MSE and also yields the most convincing reconstructions. The HIVAE seems to suffer from posterior collapse in this setting, which might be due to the large dimensionality of the input data.
4.4 Real medical time series data
We also applied our model to the data set from the 2012 Physionet Challenge [35]. The data set contains 12,000 patients which were monitored on the intensive care unit (ICU) for 48 hours each. At each hour, there is a measurement of 36 different variables (heart rate, blood pressure, etc.), any number of which might be missing. We again compare our model against the standard VAE and HIVAE, as well as a GP fit featurewise in the data space and the GRUIGAN model [26], which reported stateoftheart imputation performance.
The main challenge is the absence of ground truth data for the missing values. This cannot easily be circumvented by introducing additional missingness since (1) the mechanism by which measurements were omitted is not random, and (2) the data set is already very sparse with about 90 % of the features missing. To overcome this issue, Luo et al. [26] proposed a downstream task as a proxy for the imputation quality. They chose the task of mortality prediction, which was one of the main tasks of the Physionet Challenge on this data set, and measured the performance in terms of AUROC. In this paper, we adopt this measure.
For sake of interpretability, we used a linear support vector machine (SVM) as a downstream classification model. This model tries to optimally separate the whole time series in the input space using a linear hyperplane. The choice of model follows the intuition that under a perfect imputation similar patients should be located close to each other in the input space, while that is not necessarily the case when features are missing, or when the imputation is poor. Note that it is unrealistic to ask for high accuracies in this task, as the clean data are unlikely to be perfectly separable. As seen in Table 1, this proxy measure correlates well with the ground truth likelihood.
The performances of the different methods under this measure are reported in Table 4.2. Our model outperforms all baselines, including the GRUIGAN, which provides strong evidence that our model is well suited for realworld medical time series imputations.
5 Conclusion
We presented a deep probabilistic model for multivariate time series imputation, where we combined ideas from variational autoencoders and Gaussian processes. The VAE maps the missing data from the input space into a latent space where every dimension is completely determined. The GP then models the temporal dynamics in this latent space. To flexibly model dynamics on different time scales, we use a Cauchy kernel for the latent GP prior. Moreover, we use structured variational inference to approximate the latent GP posterior, which reflects the temporal correlations of the data more accurately than a fully factorized approximation. At the same time, inference in our variational distribution is still efficient, as opposed to inference in the full GP posterior.
We empirically validated our proposed model on Healing MNIST benchmark data, SPRITES data, and realworld medical time series data from the 2012 Physionet Challenge. We observe that our model outperforms classical baselines as well as modern deep learning approaches on these tasks. This suggests that the model can successfully learn the temporal dynamics of realworld processes, even under high missingness rates.
In future work, it would be interesting to assess the applicability of the model to other data domains (e.g., natural videos) and explore a larger variety of kernel choices (e.g., learned kernels) for the latent GP. Moreover, it could be a fruitful avenue of research to choose more sophisticated neural network architectures for the inference model and the generative model. The inference network could for instance use an architecture that factorizes across features [27] or groups of features [1], in order to handle missing values even more flexibly. The generative network, on the other hand, could be extended with an autoregressive model [12], in order to improve the coherence of the output time series even further.
References
 Ainsworth et al. [2018] Samuel K Ainsworth, Nicholas J Foti, and Emily B Fox. Disentangled vae representations for multiaspect and missing data. arXiv preprint arXiv:1806.09060, 2018.
 Bamler and Mandt [2017a] Robert Bamler and Stephan Mandt. Dynamic word embeddings. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 380–389. JMLR. org, 2017a.
 Bamler and Mandt [2017b] Robert Bamler and Stephan Mandt. Structured black box variational inference for latent time series models. arXiv preprint arXiv:1707.01069, 2017b.
 Bashir and Wei [2018] Faraj Bashir and HuaLiang Wei. Handling missing data in multivariate time series using a vector autoregressive modelimputation (varim) algorithm. Neurocomputing, 276:23–30, 2018.
 Blei and Lafferty [2006] David M Blei and John D Lafferty. Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning, pages 113–120. ACM, 2006.
 Blei et al. [2017] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
 Casale et al. [2018] Francesco Paolo Casale, Adrian Dalca, Luca Saglietti, Jennifer Listgarten, and Nicolo Fusi. Gaussian process prior variational autoencoders. In Advances in Neural Information Processing Systems, pages 10369–10380, 2018.
 Dalca et al. [2019] Adrian V Dalca, John Guttag, and Mert R Sabuncu. Unsupervised data imputation via variational inference of deep subspaces. arXiv preprint arXiv:1903.03503, 2019.
 Fortuin and Rätsch [2019] Vincent Fortuin and Gunnar Rätsch. Deep mean functions for metalearning in gaussian processes. arXiv preprint arXiv:1901.08098, 2019.
 Fortuin et al. [2018a] Vincent Fortuin, Gideon Dresdner, Heiko Strathmann, and Gunnar Rätsch. Scalable gaussian processes on discrete domains. arXiv preprint arXiv:1810.10368, 2018a.
 Fortuin et al. [2018b] Vincent Fortuin, Matthias Hüser, Francesco Locatello, Heiko Strathmann, and Gunnar Rätsch. Somvae: Interpretable discrete representation learning on time series. arXiv preprint arXiv:1806.02199, 2018b.
 Gulrajani et al. [2017] Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David Vazquez, and Aaron Courville. Pixelvae: A latent variable model for natural images. International Conference on Learning Representations, 2017.
 Honaker and King [2010] James Honaker and Gary King. What to do about missing values in timeseries crosssection data. American Journal of Political Science, 54(2):561–581, 2010.
 Huang and McColl [1997] Y Huang and WF McColl. Analytical inversion of general tridiagonal matrices. Journal of Physics A: Mathematical and General, 30(22):7919, 1997.
 Jähnichen et al. [2018] Patrick Jähnichen, Florian Wenzel, Marius Kloft, and Stephan Mandt. Scalable generalized dynamic topic models. Conference on Artificial Intelligence and Statistics, 2018.
 Jordan et al. [1999] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
 Kingma and Ba [2015] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015.
 Kingma and Welling [2014] Diederik P Kingma and Max Welling. Autoencoding variational bayes. International Conference on Learning Representations, 2014.
 Krishnan et al. [2015] Rahul G Krishnan, Uri Shalit, and David Sontag. Deep kalman filters. arXiv preprint arXiv:1511.05121, 2015.
 Krishnan et al. [2017] Rahul G Krishnan, Uri Shalit, and David Sontag. Structured inference networks for nonlinear state space models. In ThirtyFirst AAAI Conference on Artificial Intelligence, 2017.
 LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Li et al. [2019] Steven ChengXian Li, Bo Jiang, and Benjamin Marlin. Misgan: Learning from incomplete data with generative adversarial networks. International Conference on Learning Representations, 2019.
 Li and Mandt [2018] Yingzhen Li and Stephan Mandt. Disentangled sequential autoencoder. International Conference on Machine Learning, 2018.
 Lipton and Tripathi [2017] Zachary C Lipton and Subarna Tripathi. Precise recovery of latent vectors from generative adversarial networks. International Conference on Learning Representations, 2017.
 Little and Rubin [2002] Roderick JA Little and Donald B Rubin. Single imputation methods. Statistical analysis with missing data, pages 59–74, 2002.
 Luo et al. [2018] Yonghong Luo, Xiangrui Cai, Ying Zhang, Jun Xu, et al. Multivariate time series imputation with generative adversarial networks. In Advances in Neural Information Processing Systems, pages 1596–1607, 2018.
 Ma et al. [2018] Chao Ma, Sebastian Tschiatschek, Konstantina Palla, Jose Miguel Hernandez Lobato, Sebastian Nowozin, and Cheng Zhang. Eddi: Efficient dynamic discovery of highvalue information with partial vae. International Conference on Machine Learning, 2018.
 Mallik [2001] Ranjan K Mallik. The inverse of a tridiagonal matrix. Linear Algebra and its Applications, 325(13):109–139, 2001.
 Nazabal et al. [2018] Alfredo Nazabal, Pablo M Olmos, Zoubin Ghahramani, and Isabel Valera. Handling incomplete heterogeneous data using vaes. arXiv preprint arXiv:1807.03653, 2018.
 Pedersen et al. [2017] Alma B Pedersen, Ellen M Mikkelsen, Deirdre CroninFenton, Nickolaj R Kristensen, Tra My Pham, Lars Pedersen, and Irene Petersen. Missing data and multiple imputation in clinical epidemiological research. Clinical epidemiology, 9:157, 2017.
 Rasmussen [2003] Carl Edward Rasmussen. Gaussian processes in machine learning. In Summer School on Machine Learning, pages 63–71. Springer, 2003.
 Rezende et al. [2014] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. International Conference on Machine Learning, 2014.
 Roberts et al. [2013] Stephen Roberts, Michael Osborne, Mark Ebden, Steven Reece, Neale Gibson, and Suzanne Aigrain. Gaussian processes for timeseries modelling. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 371(1984):20110550, 2013.
 Rubin [1976] Donald B Rubin. Inference and missing data. Biometrika, 63(3):581–592, 1976.
 Silva et al. [2012] Ikaro Silva, George Moody, Daniel J Scott, Leo A Celi, and Roger G Mark. Predicting inhospital mortality of icu patients: The physionet/computing in cardiology challenge 2012. In 2012 Computing in Cardiology, pages 245–248. IEEE, 2012.
 Wainwright and Jordan [2008] Martin J Wainwright and Michael Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.
 Wilson et al. [2016] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. In Artificial Intelligence and Statistics, pages 370–378, 2016.
 Yoon et al. [2018] Jinsung Yoon, James Jordon, and Mihaela Van Der Schaar. Gain: Missing data imputation using generative adversarial nets. International Conference on Machine Learning, 2018.
 Zhang et al. [2018] Cheng Zhang, Judith Butepage, Hedvig Kjellstrom, and Stephan Mandt. Advances in variational inference. IEEE transactions on pattern analysis and machine intelligence, 2018.
Appendix
Appendix A Implementation details
a.1 Healing MNIST
Hyperparameter  Value 

Number of CNN layers in inference network  1 
Number of filters per CNN layer  256 
Filter size (i.e., time window size)  10 
Number of feedforward layers in inference network  2 
Width of feedforward layers  256 
Dimensionality of latent space  256 
Length scale of Cauchy kernel  1.0 
Number of feedforward layers in generative network  3 
Width of feedforward layers  256 
Activation function of all layers  ReLU 
Learning rate during training  0.0001 
Optimizer  Adam [17] 
Number of training epochs  20 
Train/val/test split of data set  50,000/10,000/10,000 
Dimensionality of time points  784 
Length of time series  10 
a.2 Sprites
Hyperparameter  Value 

Number of CNN layers in inference network  3 
Number of filters per CNN layer  1 
Filter size (i.e., time window size)  10 
Number of feedforward layers in inference network  2 
Width of feedforward layers  256 
Dimensionality of latent space  256 
Length scale of Cauchy kernel  1.0 
Number of feedforward layers in generative network  3 
Width of feedforward layers  256 
Activation function of all layers  ReLU 
Learning rate during training  0.0005 
Optimizer  Adam [17] 
Number of training epochs  30 
Train/val/test split of data set  8,000/1,000/1,000 
Dimensionality of time points  12288 
Length of time series  8 
a.3 Real medical time series data
Hyperparameter  Value 

Number of CNN layers in inference network  1 
Number of filters per CNN layer  32 
Filter size (i.e., time window size)  10 
Number of feedforward layers in inference network  1 
Width of feedforward layers  32 
Dimensionality of latent space  32 
Length scale of Cauchy kernel  1.0 
Number of feedforward layers in generative network  2 
Width of feedforward layers  32 
Activation function of all layers  ReLU 
Learning rate during training  0.0005 
Optimizer  Adam [17] 
Number of training epochs  20 
Train/val/test split of data set  4,000/4,000/4,000 
Dimensionality of time points  36 
Length of time series  48 