Multivariate Time Series Imputation with Variational Autoencoders

Multivariate Time Series Imputation
with Variational Autoencoders

Vincent Fortuin111fortuin@inf.ethz.ch Institute of Machine Learning, ETH Zürich, Zürich, Switzerland Gunnar Rätsch Institute of Machine Learning, ETH Zürich, Zürich, Switzerland Stephan Mandt Donald Bren School of Information and Computer Sciences, UC Irvine, California, USA
Abstract

Multivariate time series with missing values are common in many areas, for instance in healthcare and finance. To face this problem, modern data imputation approaches should (a) be tailored to sequential data, (b) deal with high dimensional and complex data distributions, and (c) be based on the probabilistic modeling paradigm for interpretability and confidence assessment. However, many current approaches fall short in at least one of these aspects. Drawing on advances in deep learning and scalable probabilistic modeling, we propose a new deep sequential variational autoencoder approach for dimensionality reduction and data imputation. Temporal dependencies are modeled with a Gaussian process prior and a Cauchy kernel to reflect multi-scale dynamics in the latent space. We furthermore use a structured variational inference distribution that improves the scalability of the approach. We demonstrate that our model exhibits superior imputation performance on benchmark tasks and challenging real-world medical data.

1 Introduction

Time series are often associated with missing values, for instance due to faulty measurement devices, partially observed states, or costly measurement procedures [13]. These missing values impair the usefulness and interpretability of the data, leading to the problem of data imputation: estimating those missing values from the observed ones [34].

Multivariate time series, consisting of multiple correlated univariate time series or channels, give rise to two distinct ways of imputing missing information: (1) by exploiting temporal correlations within each channel, and (2) by exploiting correlations across channels, for example by using lower-dimensional representations of the data. For instance in a medical setting, if the blood pressure of a patient is unobserved, it can be informative that the heart rate at the current time is higher than normal and that the blood pressure was also elevated an hour ago. An ideal imputation model for multivariate time series should therefore take both of these sources of information into account. Another desirable property of such models is to offer a probabilistic interpretation, allowing for uncertainty estimation.

Unfortunately, current imputation approaches fall short with respect to at least one of these desiderata. While there are many time-tested statistical methods for multivariate time series analysis (e.g., Gaussian processes [33]) that work well in the case of complete data, these methods are generally not applicable when features are missing. On the other hand, classical methods for time series imputation often do not take the potentially complex interactions between the different channels into account [25, 30]. Finally, recent work has explored the use of non-linear dimensionality reduction using variational autoencoders for i.i.d. data points with missing values [29, 1, 27] , but this work has not considered temporal data and strategies for sharing statistical strength across time.

Following these considerations, it is promising to combine non-linear dimensionality reduction with an expressive time series model. This can be done by jointly learning a mapping from the data space (where features are missing) into a latent space (where all dimensions are fully determined). A statistical model of choice can then be applied in this latent space to model temporal dynamics. If the dynamics model and the mapping for dimensionality reduction are both differentiable, the approach can be trained end-to-end.

In this paper, we propose an architecture that uses deep variational autoencoders (VAEs) to map the missing data time series into a latent space without missingness, where we model the low-dimensional dynamics with a Gaussian process (GP). As we will discuss below, we hereby propose a prior model that efficiently operates at multiple time scales, taking into account that the multivariate time series may have different channels (e.g., heart rate, blood pressure, etc.) that change with different characteristic frequencies. Finally, our variational inference approach makes use of efficient structured variational approximations, where we fit another multivariate Gaussian process in order to approximate the intractable true posterior.

We make the following contributions:

  • A new model. We propose a VAE architecture for multivariate time series imputation with a GP prior in the latent space to capture temporal dynamics. We propose a Cauchy kernel to allow the time series to display dynamics at multiple scales in a reduced dimensionality.

  • Efficient inference. We use a structured variational approximation that models posterior correlations in the time domain. By construction, inference is efficient and the time complexity for sampling from the variational distribution, used for training, is linear in the number of time steps (as opposed to cubic when done naïvely).

  • Benchmarking on real-world data. We carry out extensive comparisons to classical imputation methods as well as state-of-the-art deep learning approaches, and perform experiments on data from two different domains. Our method outperforms the baselines in both cases.

We start by reviewing the related literature in Sec. 2, describe the general setting in Sec. 3.1 and introduce our model and inference scheme in Sec. 3.2 and Sec. 3.3, respectively. Experiments and conclusions are presented in Sec. 4 and 5.

2 Related work

Classical statistical approaches.

The problem of missing values has been a long-standing challenge in many time series applications, especially in the field of medicine [30]. The earliest approaches to deal with this problem often relied on heuristics, such as mean imputation or forward imputation. Despite their simplicity, these methods are still widely applied today due to their efficiency and interpretability [13]. Orthogonal to these ideas, methods along the lines of expectation-maximization (EM) have been proposed, but they often require additional modeling assumptions [4].

Bayesian methods.

When it comes to estimating likelihoods and uncertainties relating to the imputations, Bayesian methods, such as Gaussian processes (GPs) [31], have a clear advantage over non-Bayesian methods such as single imputation [25]. There has been much recent work in making these methods more expressive and incorporating prior knowledge from the domain (e.g., medical time series) [37, 9] or adapting them to work on discrete domains [10], but their wide-spread adoption is hindered by their limited scalability and the challenges in designing kernels that are robust to missing values.

Deep learning techniques.

Another avenue of research in this area uses deep learning techniques, such as variational autoencoders (VAEs) [29, 1, 27, 8] or generative adversarial networks (GANs) [38, 22]. It should be noted that VAEs allow for tractable likelihoods, while GANs generally do not and have to rely on additional optimization processes to find latent representations of a given input [24]. Unfortunately, none of these models explicitly take the temporal dynamics of time series data into account. Conversely, there are deep probabilistic models for time series [e.g., 19, 20, 11], but those do not explicitly handle missing data.

Hi-Vae.

Our approach borrows some ideas from the HI-VAE [29]. This model deals with missing data by defining an ELBO whose reconstruction error term only sums over the observed part of the data. For inference, the incomplete data are filled with arbitrary values (e.g., zeros) before they are fed into the inference network, which induces an unavoidable bias. The main difference to our approach is that the HI-VAE was not formulated for sequential data and therefore does not exploit temporal information in the imputation task.

Deep learning for time series imputation.

While the mentioned deep learning approaches are very promising, most of them do not take the time series nature of the data directly into account, that is, they do not model the temporal dynamics of the data when dealing with missing values. To the best of our knowledge, the only deep learning model for missing value imputation that does account for the time series nature of the data is the GRUI-GAN [26], which we describe in Sec. 4. We will show that our approach outperforms this baseline on our considered data sets.

Other related work.

Our proposed model combines several ideas from the domains of Bayesian deep learning and classical probabilistic modeling; thus, removing elements from our model naturally relates to other approaches. For example, removing the latent GP for modeling dynamics as well as our proposed structured variational distribution results in the HI-VAE [29] described above. Furthermore, our idea of using a latent GP in the context of a deep generative model bears similarities to the GPPVAE [7], but note that the GPPVAE was not proposed to model time series data and does not take missing values into account. Lastly, the GP prior with the Cauchy kernel is reminiscent of Jähnichen et al. [15] and the structured variational distribution is similar to the one used by Bamler and Mandt [3] in the context of modeling word embeddings over time. Neither of these two works, however, considered amortized inference or VAEs.

3 Model

\thesubsubfigure Architecture sketch

\thesubsubfigure Graphical model
Figure 1: Overview of our proposed model with a convolutional inference network, a deep feed-forward generative network and a Gaussian process prior in the latent space. The several CNN blocks as well as the MLP blocks are each sharing their respective parameters. The Gaussian process has mean function and kernel function .

We propose a novel architecture for missing value imputation, an overview of which is depicted in Figure 1. Our model can be seen as a way to perform amortized approximate inference on a latent Gaussian process model.

The main idea of our proposed approach is to embed the data into a latent space of reduced dimensionality, in which every dimension is fully determined, and then model the temporal dynamics in this latent space. Since many features in the data might be correlated, the latent representation captures these correlations and uses them to reconstruct the missing values. Moreover, the GP prior in the latent space encourages the model to embed the data into a representation in which the temporal dynamics are smoother and more easily explainable than in the original data space. Finally, the structured variational distribution of the inference network allows the model to integrate temporal information into the representations, such that the reconstructions of missing values can not only be informed by correlated observed features at the same time point, but also by the rest of time series.

Specifically, we combine ideas from VAEs [18], GPs [31], Cauchy kernels [15], structured variational distributions with efficient inference [3], and a special ELBO for missing data [29] and synthesize these ideas into a general framework for missing data imputation on time series. In the following, we will outline the problem setting, describe the assumed generative model, and derive our proposed inference scheme.

3.1 Problem setting and notation

We assume a data set with data points . Let us assume that the data points were measured at consecutive time points with . By convention, we usually set . The data can thus be viewed as a time series of length in time.

We moreover assume that any number of these data features can be missing, that is, that their values can be unknown. We can now partition each data point into observed and unobserved features. The observed features of data point are . Equivalently, the missing features are with .

We can now use this partitioning to define the problem of missing value imputation. Missing value imputation describes the problem of estimating the true values of the missing features given the observed features . Many methods assume the different data points to be independent, in which case the inference problem reduces to separate problems of estimating . In the time series setting, this independence assumption is not satisfied, which leads to the more complex estimation problem of .

3.2 Generative model

In this subsection, we describe the details of our proposed approach of reducing the observed time series with missing data into a representation without missingness, and modeling dynamics in this lower-dimensional representation using Gaussian processes. Yet, it is tempting to try to skip the step of dimensionality reduction and instead directly try to model the incomplete data in the observed space using GPs. We argue that this is not practical for several reasons.

Gaussian processes are well suited for time series modeling [33] and offer many advantages, such as data-efficiency and calibrated posterior probabilities. However, they come at the cost of inverting the kernel matrix, which has a time complexity of . Moreover, designing a kernel function that accurately captures correlations in feature space and also in the temporal dimension is difficult.

This problem becomes even worse if certain observations are missing. One option is to fill the missing values with some numerical value (e.g., zero) to make the kernel computable. However, this arbitrary filling may make two data points with different missingness patterns look very dissimilar when in fact they are close to each other in the ground-truth space. Another alternative is to treat every channel of the multivariate time series separately and let the GP infer missing values, but this ignores valuable correlations across channels.

In this work, we overcome the problem of defining a suitable GP kernel in the data space with missing observations by instead applying the GP in the latent space of a variational autoencoder where the encoded feature representations are complete. That is, we assign a latent variable for every , and model temporal correlations in this reduced representation using a GP, . This way, we decouple the step of filling in missing values and capturing instantaneous correlations between the different feature dimensions from modeling dynamical aspects. The graphical model is depicted in Figure 1.

A remaining practical difficulty that we encountered is that many multivariate time series display dynamics at multiple time scales. One of our main motivations is to model time series that arise in medical setups where doctors measure different patient variables and vital signs, such as heart rate, blood pressure, etc. When using conventional GP kernels (e.g., the RBF kernel, ), one implicitly assumes a single time scale of relevance (). We found that this choice does not reflect the dynamics of medical data very well.

In order to model data that varies at multiple time scales, we consider a mixture of RBF kernels with different ’s [31]. By defining a Gamma distribution over the length scale, that is, , we can compute an infinite mixture of RBF kernels,

This yields the so-called Rational Quadratic kernel [31]. For and , it reduces to the Cauchy kernel

(1)

which has previously been successfully used in the context of robust dynamic topic modeling where similar multi-scale time dynamics occur [15]. We therefore choose this kernel for our Gaussian process prior.

Given the latent time series , the observations are generated time-point-wise by

(2)

where is a potentially nonlinear function parameterized by the parameter vector . Considering the scenario of a medical time series, can be thought of as the latent physiological state of the patient and would be the process of generating observable measurements (e.g., heart rate, blood pressure, etc.) from that physiological state. In our experiments, the function is implemented by a deep neural network.

3.3 Inference model

In order to learn the parameters of the deep generative model described above, and in order to efficiently infer its latent state, we are interested in the posterior distribution . Since the exact posterior is intractable, we use variational inference [16, 6, 39]. Furthermore, to avoid inference over per-datapoint (local) variational parameters, we apply inference amortization [18]. To make our variational distribution more expressive and capture the temporal correlations of the data, we employ a structured variational distribution [36] with efficient inference that leads to an approximate posterior which is also a GP.

We approximate the true posterior with a multivariate Gaussian variational distribution

(3)

where indexes the dimensions in the latent space. Our approximation implies that our variational posterior is able to reflect correlations in time, but breaks dependencies across the different dimensions in -space (which is typical in VAE training [18, 32]).

We choose the variational family to be the family of multivariate Gaussian distributions in the time domain, where the precision matrix is parameterized in terms of a product of bidiagonal matrices,

(4)

Above, the ’s are local variational parameters and is an upper triangular band matrix. Similar structured distributions were also employed by Blei and Lafferty [5], Bamler and Mandt [2].

This parameterization automatically leads to being positive definite, symmetric, and tridiagonal. Samples from can thus be generated in linear time in [14, 28, 3] as opposed to the cubic time complexity for a full-rank matrix. Moreover, compared to a fully factorized variational approximation, the number of variational parameters are merely doubled. Note that while the precision matrix is sparse, the covariance matrix can still be dense, allowing to reflect long-range dependencies in time.

Instead of optimizing and separately for every data point, we amortize the inference through an inference network with parameters that computes the variational parameters based on the inputs as . In the following, we accordingly denote the variational distribution as . Following standard VAE training, the parameters of the generative model and of the inference network can be jointly trained by optimizing the evidence lower bound (ELBO),

(5)

Following Nazabal et al. [29] (see Sec. 2), we evaluate the ELBO only on the observed features of the data since the remaining features are unknown, and set these missing features to a fixed value (zero) during inference. Our training objective is thus the RHS of (5).

Neural network architectures.

We use a convolutional neural network (CNN) as an inference network and a fully connected multilayer perceptron (MLP) as a generative network. The inference network convolves over the time dimension of the input data and allows for sequences of variable lengths. It consists of a number of convolutional layers that integrate information from neighboring time steps into a joint representation using a fixed receptive field (see Figure 1). The CNN outputs a tensor of size , where is the dimensionality of the latent space. Every row corresponds to a time step and contains parameters, which are used to predict the mean vector as well as the diagonal and off-diagonal elements that characterize at the given time step. More details about the network structure are given in the appendix (Sec. A).

4 Experiments

We performed experiments on the benchmark data set Healing MNIST [19], which combines the classical MNIST data set [21] with properties common to medical time series, the SPRITES data set [23], and on a real-world medical data set from the 2012 Physionet Challenge [35]. We compared our model against conventional single imputation methods [25], GP-based imputation [31], VAE-based methods that are not specifically designed to handle temporal data [18, 29], and modern state-of-the-art deep learning methods for temporal data imputation [26]. We found strong quantitative and qualitative evidence that our proposed model outperforms the baseline methods in terms of imputation quality on all three tasks. In the following, we are first going to give an overview of the baseline methods and then present our experimental findings. Details about the data sets and neural network architectures can be found in the appendix (Sec. A).

4.1 Baseline methods

Forward imputation and mean imputation.

Forward and mean imputation are so-called single imputation methods, which means that they do not attempt to fit a distribution over possible values for the missing features, but only predict one estimate [25]. Forward imputation always predicts the last observed value for any given feature, while mean imputation predicts the mean of all the observations of the feature in a given time series.

Gaussian process in data space.

One option to deal with missingness in multivariate time series is to fit independent Gaussian processes to each channel. As discussed previously (Sec. 3.2), this ignores the correlation between channels. The missing values are then imputed by taking the mean of the respective posterior of the GP for that feature.

VAE and HI-VAE.

The VAE [18] and HI-VAE [29] are fit to the data using the same training procedure as the proposed GP-VAE model. The VAE uses a standard ELBO that is defined over all the features, while the HI-VAE uses the ELBO from (5), which is only evaluated on the observed part of the feature space. During inference, missing features are filled with constant values, such as zero.

Grui-Gan.

The GRUI-GAN [26] uses a recurrent neural network (RNN), namely a gated recurrent unit (GRU). Once the network is trained, a time series is imputed by optimizing the latent vector in the input space of the generator, such that the generator’s output on the observed features is closest to the true values.

4.2 Healing MNIST

Time series with missing values play a crucial role in the medical field, but are often hard to obtain. Krishnan et al. [19] generated a data set called Healing MNIST, which is designed to reflect many properties that one also finds in real medical data. We benchmark our method on a variant of this data set. It was designed to incorporate some properties that one also finds in real medical data, and consists of short sequences of moving MNIST digits [21] that rotate randomly between frames. The analogy to healthcare is that every frame may represent the collection of measurements that describe a patient’s health state, which contains many missing measurements at each moment in time. The temporal evolution represents the non-linear evolution of the patient’s health state. The image frames contain around 60 % missing pixels and the rotations between two consecutive frames are normally distributed.

The benefit of this data set is that we know the ground truth of the imputation task. We compare our model against a standard VAE (no latent GP and standard ELBO over all features), the HI-VAE [29], as well as mean imputation and forward imputation. The models were trained on time series of digits from the Healing MNIST training set (50,000 time series) and tested on digits from the Healing MNIST test set (10,000 time series). Negative log likelihoods on the ground truth values of the missing pixels and mean squared errors (MSE) are reported in Table 1, and qualitative results shown in Figure 2. To assess the usefulness of the imputations for downstream tasks, we also trained a linear classifier on the imputed MNIST digits to predict the digit class and measured its performance in terms of area under the receiver-operator-characteristic curve (AUROC) (Tab. 1).

Healing MNIST SPRITES
Model NLL MSE AUROC NLL MSE
Mean imputation [25] - 0.178 0.000 0.787 - 0.013 0.000
Forward imputation [25] - 0.174 0.000 0.779 - 0.028 0.000
VAE [18] 0.4798 0.0016 0.152 0.001 0.738 -0.3702 0.0140 0.034 0.000
HI-VAE [29] 0.2896 0.0010 0.088 0.000 0.811 -0.3309 0.0126 0.035 0.000
GP-VAE (proposed) 0.2606 0.0008 0.078 0.000 0.826 -1.9595 0.0011 0.002 0.000
Table 1: Performance of the different models on the Healing MNIST test set and the SPRITES test set in terms of negative log likelihood [NLL] and mean squared error [MSE] (lower is better), as well as downstream classification performance [AUROC] (higher is better). The reported values are means and their respective standard errors over the test set. The proposed model outperforms all the baselines.
Figure 2: Reconstructions from Healing MNIST and SPRITES. The GP-VAE (proposed) is stable over time and yields the highest fidelity. Table 2: Performance of the different models on the Physionet test set in terms of AUROC of a linear SVM trained on the imputed time series. We observe that the proposed model outperforms the competitors. Model AUROC Mean imputation [25] 0.502 Forward imputation [25] 0.552 GP [31] 0.576 VAE [18] 0.588 HI-VAE [29] 0.595 GRUI-GAN [26] 0.595 GP-VAE (proposed) 0.617

Our approach outperforms the baselines in terms of likelihood and MSE. The reconstructions (Fig. 2) reveal the benefits of the GP-VAE approach: related approaches yield unstable reconstructions over time, while our approach offers more stable reconstructions, using temporal information from neighboring frames. Moreover, our model also yields the most useful imputations for downstream classification in terms of AUROC. The downstream classification performance correlates well with the test likelihood on the ground truth data, supporting the intuition that it is a good proxy measure in cases where the ground truth likelihood is not available.

4.3 SPRITES data

To assess our model’s performance on more complex data, we applied it to the SPRITES data set, which has previously been used with sequential autoencoders [23]. The dataset consists of 9,000 sequences of animated characters with different clothes, hair styles, and skin colors, performing different actions. Each frame has a size of pixels and each time series features 8 frames. We again introduced about 60 % of missing pixels and compared the same methods as above. The results are reported in Table 1 and example reconstructions are shown in Figure 2. As in the previous experiment, our model outperforms the baselines in terms of likelihood and MSE and also yields the most convincing reconstructions. The HI-VAE seems to suffer from posterior collapse in this setting, which might be due to the large dimensionality of the input data.

4.4 Real medical time series data

We also applied our model to the data set from the 2012 Physionet Challenge [35]. The data set contains 12,000 patients which were monitored on the intensive care unit (ICU) for 48 hours each. At each hour, there is a measurement of 36 different variables (heart rate, blood pressure, etc.), any number of which might be missing. We again compare our model against the standard VAE and HI-VAE, as well as a GP fit feature-wise in the data space and the GRUI-GAN model [26], which reported state-of-the-art imputation performance.

The main challenge is the absence of ground truth data for the missing values. This cannot easily be circumvented by introducing additional missingness since (1) the mechanism by which measurements were omitted is not random, and (2) the data set is already very sparse with about 90 % of the features missing. To overcome this issue, Luo et al. [26] proposed a downstream task as a proxy for the imputation quality. They chose the task of mortality prediction, which was one of the main tasks of the Physionet Challenge on this data set, and measured the performance in terms of AUROC. In this paper, we adopt this measure.

For sake of interpretability, we used a linear support vector machine (SVM) as a downstream classification model. This model tries to optimally separate the whole time series in the input space using a linear hyperplane. The choice of model follows the intuition that under a perfect imputation similar patients should be located close to each other in the input space, while that is not necessarily the case when features are missing, or when the imputation is poor. Note that it is unrealistic to ask for high accuracies in this task, as the clean data are unlikely to be perfectly separable. As seen in Table 1, this proxy measure correlates well with the ground truth likelihood.

The performances of the different methods under this measure are reported in Table 4.2. Our model outperforms all baselines, including the GRUI-GAN, which provides strong evidence that our model is well suited for real-world medical time series imputations.

5 Conclusion

We presented a deep probabilistic model for multivariate time series imputation, where we combined ideas from variational autoencoders and Gaussian processes. The VAE maps the missing data from the input space into a latent space where every dimension is completely determined. The GP then models the temporal dynamics in this latent space. To flexibly model dynamics on different time scales, we use a Cauchy kernel for the latent GP prior. Moreover, we use structured variational inference to approximate the latent GP posterior, which reflects the temporal correlations of the data more accurately than a fully factorized approximation. At the same time, inference in our variational distribution is still efficient, as opposed to inference in the full GP posterior.

We empirically validated our proposed model on Healing MNIST benchmark data, SPRITES data, and real-world medical time series data from the 2012 Physionet Challenge. We observe that our model outperforms classical baselines as well as modern deep learning approaches on these tasks. This suggests that the model can successfully learn the temporal dynamics of real-world processes, even under high missingness rates.

In future work, it would be interesting to assess the applicability of the model to other data domains (e.g., natural videos) and explore a larger variety of kernel choices (e.g., learned kernels) for the latent GP. Moreover, it could be a fruitful avenue of research to choose more sophisticated neural network architectures for the inference model and the generative model. The inference network could for instance use an architecture that factorizes across features [27] or groups of features [1], in order to handle missing values even more flexibly. The generative network, on the other hand, could be extended with an autoregressive model [12], in order to improve the coherence of the output time series even further.

References

  • Ainsworth et al. [2018] Samuel K Ainsworth, Nicholas J Foti, and Emily B Fox. Disentangled vae representations for multi-aspect and missing data. arXiv preprint arXiv:1806.09060, 2018.
  • Bamler and Mandt [2017a] Robert Bamler and Stephan Mandt. Dynamic word embeddings. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 380–389. JMLR. org, 2017a.
  • Bamler and Mandt [2017b] Robert Bamler and Stephan Mandt. Structured black box variational inference for latent time series models. arXiv preprint arXiv:1707.01069, 2017b.
  • Bashir and Wei [2018] Faraj Bashir and Hua-Liang Wei. Handling missing data in multivariate time series using a vector autoregressive model-imputation (var-im) algorithm. Neurocomputing, 276:23–30, 2018.
  • Blei and Lafferty [2006] David M Blei and John D Lafferty. Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning, pages 113–120. ACM, 2006.
  • Blei et al. [2017] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
  • Casale et al. [2018] Francesco Paolo Casale, Adrian Dalca, Luca Saglietti, Jennifer Listgarten, and Nicolo Fusi. Gaussian process prior variational autoencoders. In Advances in Neural Information Processing Systems, pages 10369–10380, 2018.
  • Dalca et al. [2019] Adrian V Dalca, John Guttag, and Mert R Sabuncu. Unsupervised data imputation via variational inference of deep subspaces. arXiv preprint arXiv:1903.03503, 2019.
  • Fortuin and Rätsch [2019] Vincent Fortuin and Gunnar Rätsch. Deep mean functions for meta-learning in gaussian processes. arXiv preprint arXiv:1901.08098, 2019.
  • Fortuin et al. [2018a] Vincent Fortuin, Gideon Dresdner, Heiko Strathmann, and Gunnar Rätsch. Scalable gaussian processes on discrete domains. arXiv preprint arXiv:1810.10368, 2018a.
  • Fortuin et al. [2018b] Vincent Fortuin, Matthias Hüser, Francesco Locatello, Heiko Strathmann, and Gunnar Rätsch. Som-vae: Interpretable discrete representation learning on time series. arXiv preprint arXiv:1806.02199, 2018b.
  • Gulrajani et al. [2017] Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David Vazquez, and Aaron Courville. Pixelvae: A latent variable model for natural images. International Conference on Learning Representations, 2017.
  • Honaker and King [2010] James Honaker and Gary King. What to do about missing values in time-series cross-section data. American Journal of Political Science, 54(2):561–581, 2010.
  • Huang and McColl [1997] Y Huang and WF McColl. Analytical inversion of general tridiagonal matrices. Journal of Physics A: Mathematical and General, 30(22):7919, 1997.
  • Jähnichen et al. [2018] Patrick Jähnichen, Florian Wenzel, Marius Kloft, and Stephan Mandt. Scalable generalized dynamic topic models. Conference on Artificial Intelligence and Statistics, 2018.
  • Jordan et al. [1999] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
  • Kingma and Ba [2015] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015.
  • Kingma and Welling [2014] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. International Conference on Learning Representations, 2014.
  • Krishnan et al. [2015] Rahul G Krishnan, Uri Shalit, and David Sontag. Deep kalman filters. arXiv preprint arXiv:1511.05121, 2015.
  • Krishnan et al. [2017] Rahul G Krishnan, Uri Shalit, and David Sontag. Structured inference networks for nonlinear state space models. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
  • LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Li et al. [2019] Steven Cheng-Xian Li, Bo Jiang, and Benjamin Marlin. Misgan: Learning from incomplete data with generative adversarial networks. International Conference on Learning Representations, 2019.
  • Li and Mandt [2018] Yingzhen Li and Stephan Mandt. Disentangled sequential autoencoder. International Conference on Machine Learning, 2018.
  • Lipton and Tripathi [2017] Zachary C Lipton and Subarna Tripathi. Precise recovery of latent vectors from generative adversarial networks. International Conference on Learning Representations, 2017.
  • Little and Rubin [2002] Roderick JA Little and Donald B Rubin. Single imputation methods. Statistical analysis with missing data, pages 59–74, 2002.
  • Luo et al. [2018] Yonghong Luo, Xiangrui Cai, Ying Zhang, Jun Xu, et al. Multivariate time series imputation with generative adversarial networks. In Advances in Neural Information Processing Systems, pages 1596–1607, 2018.
  • Ma et al. [2018] Chao Ma, Sebastian Tschiatschek, Konstantina Palla, Jose Miguel Hernandez Lobato, Sebastian Nowozin, and Cheng Zhang. Eddi: Efficient dynamic discovery of high-value information with partial vae. International Conference on Machine Learning, 2018.
  • Mallik [2001] Ranjan K Mallik. The inverse of a tridiagonal matrix. Linear Algebra and its Applications, 325(1-3):109–139, 2001.
  • Nazabal et al. [2018] Alfredo Nazabal, Pablo M Olmos, Zoubin Ghahramani, and Isabel Valera. Handling incomplete heterogeneous data using vaes. arXiv preprint arXiv:1807.03653, 2018.
  • Pedersen et al. [2017] Alma B Pedersen, Ellen M Mikkelsen, Deirdre Cronin-Fenton, Nickolaj R Kristensen, Tra My Pham, Lars Pedersen, and Irene Petersen. Missing data and multiple imputation in clinical epidemiological research. Clinical epidemiology, 9:157, 2017.
  • Rasmussen [2003] Carl Edward Rasmussen. Gaussian processes in machine learning. In Summer School on Machine Learning, pages 63–71. Springer, 2003.
  • Rezende et al. [2014] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. International Conference on Machine Learning, 2014.
  • Roberts et al. [2013] Stephen Roberts, Michael Osborne, Mark Ebden, Steven Reece, Neale Gibson, and Suzanne Aigrain. Gaussian processes for time-series modelling. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 371(1984):20110550, 2013.
  • Rubin [1976] Donald B Rubin. Inference and missing data. Biometrika, 63(3):581–592, 1976.
  • Silva et al. [2012] Ikaro Silva, George Moody, Daniel J Scott, Leo A Celi, and Roger G Mark. Predicting in-hospital mortality of icu patients: The physionet/computing in cardiology challenge 2012. In 2012 Computing in Cardiology, pages 245–248. IEEE, 2012.
  • Wainwright and Jordan [2008] Martin J Wainwright and Michael Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.
  • Wilson et al. [2016] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. In Artificial Intelligence and Statistics, pages 370–378, 2016.
  • Yoon et al. [2018] Jinsung Yoon, James Jordon, and Mihaela Van Der Schaar. Gain: Missing data imputation using generative adversarial nets. International Conference on Machine Learning, 2018.
  • Zhang et al. [2018] Cheng Zhang, Judith Butepage, Hedvig Kjellstrom, and Stephan Mandt. Advances in variational inference. IEEE transactions on pattern analysis and machine intelligence, 2018.

Appendix

Appendix A Implementation details

a.1 Healing MNIST

Hyperparameter Value
Number of CNN layers in inference network 1
Number of filters per CNN layer 256
Filter size (i.e., time window size) 10
Number of feedforward layers in inference network 2
Width of feedforward layers 256
Dimensionality of latent space 256
Length scale of Cauchy kernel 1.0
Number of feedforward layers in generative network 3
Width of feedforward layers 256
Activation function of all layers ReLU
Learning rate during training 0.0001
Optimizer Adam [17]
Number of training epochs 20
Train/val/test split of data set 50,000/10,000/10,000
Dimensionality of time points 784
Length of time series 10
Table S1: Hyperparameters used in the VAE, HI-VAE and GP-VAE models for the experiment on Healing MNIST. Some of the parameters are only relevant in a subset of the models.

a.2 Sprites

Hyperparameter Value
Number of CNN layers in inference network 3
Number of filters per CNN layer 1
Filter size (i.e., time window size) 10
Number of feedforward layers in inference network 2
Width of feedforward layers 256
Dimensionality of latent space 256
Length scale of Cauchy kernel 1.0
Number of feedforward layers in generative network 3
Width of feedforward layers 256
Activation function of all layers ReLU
Learning rate during training 0.0005
Optimizer Adam [17]
Number of training epochs 30
Train/val/test split of data set 8,000/1,000/1,000
Dimensionality of time points 12288
Length of time series 8
Table S2: Hyperparameters used in the VAE, HI-VAE and GP-VAE models for the experiment on SPRITES. Some of the parameters are only relevant in a subset of the models.

a.3 Real medical time series data

Hyperparameter Value
Number of CNN layers in inference network 1
Number of filters per CNN layer 32
Filter size (i.e., time window size) 10
Number of feedforward layers in inference network 1
Width of feedforward layers 32
Dimensionality of latent space 32
Length scale of Cauchy kernel 1.0
Number of feedforward layers in generative network 2
Width of feedforward layers 32
Activation function of all layers ReLU
Learning rate during training 0.0005
Optimizer Adam [17]
Number of training epochs 20
Train/val/test split of data set 4,000/4,000/4,000
Dimensionality of time points 36
Length of time series 48
Table S3: Hyperparameters used in the VAE, HI-VAE and GP-VAE models for the experiment on medical time series from the Physionet data set. Some of the parameters are only relevant in a subset of the models.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
382236
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description