Deep SelfOrganization: Interpretable Discrete Representation Learning on Time Series
Abstract
Human professionals are often required to make decisions based on complex multivariate time series measurements in an online setting, e.g. in health care. Since human cognition is not optimized to work well in highdimensional spaces, these decisions benefit from interpretable lowdimensional representations. However, many representation learning algorithms for time series data are difficult to interpret. This is due to nonintuitive mappings from data features to salient properties of the representation and nonsmoothness over time.
To address this problem, we propose to couple a variational autoencoder to a discrete latent space and introduce a topological structure through the use of selforganizing maps. This allows us to learn discrete representations of time series, which give rise to smooth and interpretable embeddings with superior clustering performance. Furthermore, to allow for a probabilistic interpretation of our method, we integrate a Markov model in the latent space. This model uncovers the temporal transition structure, improves clustering performance even further and provides additional explanatory insights as well as a natural representation of uncertainty.
We evaluate our model on static (Fashion)MNIST data, a time series of linearly interpolated (Fashion)MNIST images, a chaotic Lorenz attractor system with two macro states, as well as on a challenging real world medical time series application. In the latter experiment, our representation uncovers meaningful structure in the acute physiological state of a patient.
1 Introduction
Interpretable representation learning of time series is a seminal problem for uncovering the latent structure in complex physical systems, such as chaotic dynamical systems or medical time series from the intensive care unit. Clustering is a classical method for unsupervised representation learning, and is a natural solution for this problem at first sight. However, clustering used in the naïve sense makes misleading i.i.d. assumptions about the data, neglecting its rich temporal structure and smooth behaviour over time. Another concern for interpretability is the lack of topological structure in the discrete embedding space.
Deep neural networks have a very successful tradition in representation learning [5]. In recent years, they have increasingly been combined with generative modeling through the advent of generative adversarial networks (GANs) [12] and variational autoencoders (VAEs) [19]. However, the representations learned by these models can sometimes be cryptic and do not offer the necessary interpretability. This is why a lot of work has been done to improve them in this regard, in GANs [6] as well as VAEs [15, 9].
In areas where humans have to make decisions based on large amounts of data, interpretability is fundamental to ease the human task. When decisions have to be made in a timely manner and rely on observing some external process over time, such as finance or medicine, the need for intuitive interpretations is even stronger. Clinicians, for instance, might need to visualize and manually process the trajectory of a patient in order to decide on their treatment [17]. They would benefit heavily from a representation learning model, in which those patient trajectories become intuitively understandable and relevant health features are salient.
From this point of view, an interesting development in the field of VAEs is the discretization of the latent space, as implemented in the vector quantized VAE (VQVAE) [35]. The VQVAE encodes a data point into a continuous latent space and then maps it to the closest of prototype latent vectors, from which the decoder has to reconstruct the original input. The prototype vector is then pulled in the direction of the latent encoding. This algorithm leads to a vector quantization of the latent space and yields an interpretable clustering of latent representations.
Clinicians in the intensive care unit (ICU) are already used to classifying patients into a finite set of discrete categories (e.g. using diagnosis codes). Learned discrete health states may thus fit their intuition of a patient’s development. This advantage may be even more pronounced, if one introduces a neighborhood relationship between the latent states, such as in a selforganizing map (SOM) [22]. The discrete latent space of the SOM can then be used to visualize dynamic health trajectories of the patients, since it provides more structure [33]. Lastly, modeling uncertainty is equally important in this scenario, as it yields a distribution over the likely future states of the patient. It is therefore potentially fruitful to combine the ideas of probabilistic modeling and representation learning in an endtoend model for health trajectories.
In order to demonstrate this idea, we propose a new VQVAElike architecture that is coupled to a SOM in the latent space, train it on synthetic and real world time series and examine the arising representations. To test the basic idea of modeling an evolving discrete latent space, we use benchmark data sets and a synthetic dynamical system with chaotic behavior, before we evaluate the model on real world medical data.
Our main contributions are to

Devise a novel framework for discrete representation learning based on the vector quantized autoencoder and selforganizing maps that achieves superior clustering performance on benchmark tasks.

Include a latent probabilistic model into the representation learning architecture and show that it further improves clustering and interpretability of the representations.

Show superior performance of the full model on benchmark tasks and a challenging real world medical data set.
2 Related Work
From the early inception of the kmeans algorithm for clustering [25], there has been much methodological improvement, including methods that perform clustering in the latent space of (variational) autoencoders [2] or use a mixture of autoencoders for the clustering [38, 26]. The method most related to our work is the VQVAE [35] which also forms the basis for our architecture. Its authors have put a stronger focus on the discrete representation as a form of compression instead of clustering. Hence, we have made a few changes in adapting their model (see Sec. 3.1). All these methods have in common that they only yield a single number as a cluster assignment and provide no interpretable structure of relationships between clusters.
An algorithm that provides such an interpretable structure is the selforganizing map (SOM) [22]. It maps the data manifold to a lowerdimensional discrete space, which can be easily visualized in the 2D case. It has been extended to model dynamical systems [4] and combined with probabilistic models for time series [31], although without using learned representations. While there are approaches to turn the SOM into a “deeper” model [8], combine it with multilayer perceptrons [10] or with metric learning [30], it has (to the best of our knowledge) not been proposed to use SOMs in the latent space of (variational) autoencoders or any other form of unsupervised deep learning model.
Interpretable models for clustering and temporal predictions are especially crucial in fields where humans have to take responsibility for the model’s predictions, such as in health care. The prediction of a patient’s future state is an important problem, particularly on the intensive care unit (ICU) [14, 3]. Probabilistic models, such as Gaussian processes, are a popular choice in this domain [7, 32]. Recently, deep generative models have been proposed [16], sometimes even in combination with probabilistic modeling [24]. To the best of our knowledge, SOMs have only been used to learn interpretable static representations of patients [33], but not dynamic ones.
3 Probabilistic SOMVAE
Our proposed model is a combination of a selforganizing map [22], a vector quantized variational autoencoder [35] and a Markov model. In the following, we review the SOM and VQVAE in order to introduce the general concepts and the notation.
A selforganizing map consists of nodes , where every node corresponds to an embedding in the data space and a representation in a lowerdimensional discrete space , where usually . During training on a data set , a winner node is chosen for every point according to . The embedding vector for every node is then updated according to , where is the learning rate and is a neighborhood function between the nodes defined on the representation space . There can be different design choices for .
The VQVAE is a variational autoencoder which discretizes its latent space with an objective function that induces a vector quantization. The encoder neural network maps an input to a latent encoding (usually ) which is then assigned to an embedding in the set of embeddings according to . The decoder neural network reconstructs the input from , which can therefore be seen as a discrete representation of the input. The objective function is defined as
(1) 
where is the reconstruction of from decoding , is an operator stopping the gradient flow, and is a tuning parameter. The first term in this loss function is the reconstruction error, while the second and third term train the embeddings and encoder, respectively. The authors call the third term “commitment loss” and claim that the optimization is robust to a large range of values, including , such that the last two terms could potentially be written as one.
The main challenge in optimizing the VQVAE and related architectures is the nondifferentiability of the discrete cluster assignment step. Due to this, the gradients from the reconstruction loss cannot flow back into the encoder. In order to mitigate this, the authors of the VQVAE propose to copy the gradients from to . They acknowledge that this is an ad hoc approximation, but observed that it works well in their experiments [35].
3.1 Selforganizing map variational autoencoder
By connecting the embeddings in the VQVAE model in the form of a SOM, we can induce an interpretable neighborhood relation on them. A schematic overview of our proposed model is depicted in Figure 1. We choose to use a twodimensional map because it facilitates visualization similar to [33]. We implement it in a way such that any time an embedding at position in the map gets updated, it also updates all the embeddings in its immediate neighborhood, which is defined as for a twodimensional map.
The loss function for a single input looks like
(2) 
where is the reconstruction of from decoding , is the reconstruction of from decoding and and are weighting hyperparameters.
Every term in this function is specifically designed to optimize a different model component. The first term is the reconstruction loss . The first subterm of this is the discrete reconstruction loss as used in [35] (see Eq. 1), which encourages the assigned SOM node to be an informative representation of the input.
Due to our smaller number of embeddings compared to the original VQVAE paper [35], the average distance between an encoding and its closest embedding is much larger in our case. The gradient copying (see above) thus ceases to be a feasible approximation, because the true gradients at points in the latent space which are farther apart will likely be very different. In order to still overcome the nondifferentiability issue, we propose to add the second reconstruction subterm to , where the reconstruction is decoded directly from the encoding . This adds a fully differentiable credit assignment path from the loss to the encoder and encourages to also be an informative representation of the input, which is a desirable model feature. Most importantly, it works well in practice (see Sec. 4.1). Note that since is continuous and therefore much less constrained than , this term is optimized easily and becomes small early in training. After that, mostly the term contributes to . One could therefore view the term as an initial encouragement to place the data encodings at sensible positions in the latent space, after which the actual clustering task dominates the training objective.
The term encourages the encodings and assigned SOM nodes to be close to each other and is defined as . Closeness of encodings and embeddings should be expected to already follow from the term in a fully differentiable architecture. However, due to the nondifferentiability in our model, the term has to be explicitly added to the objective in order for the encoder to get gradient information about . The term is very similar to the commitment loss in the VQVAE (see Eq. 1), except that we do not split it into two terms with gradient stopping. We observe that in our model it works just as well this way.
The SOM loss encourages the neighbors of the assigned SOM node to also be close to , thus enabling the embeddings to exhibit a selforganizing map property, while stopping the gradients on such that the encoding is not pulled in the direction of the neighbors. It is defined as , where is the set of neighbors in the discrete space as defined above. This term enforces a neighborhood relation between the discrete codes and encourages all SOM nodes to ultimately receive gradient information from the data. The gradient stopping in this term is motivated similarly to the use in the VQVAE loss (see Eq. 1 and [35]). We want to optimize the embeddings based on their neighbors, but not the respective encodings, since any single encoding should be as close as possible to its assigned embedding and not receive gradient information from any other embeddings that it is not assigned to.
3.2 Latent Markov Transition Model
Our ultimate goal is to predict the development of time series in an interpretable way. This means that not only the state representations should be interpretable, but so should be the prediction as well. To this end, we use a temporal probabilistic model.
Learning a probabilistic model in a highdimensional continuous space can be challenging. Thus, we exploit the lowdimensional discrete space induced by our SOM to learn a temporal model. For that, we define a system state as the assigned node in the SOM and then learn a Markov model for the transitions between those states. The model is learned jointly with the SOMVAE, where the loss function becomes
(3) 
with weighting hyperparameters and .
The term encourages the probabilities of actually observed transitions to be high. It is defined as , with being the probability of a transition from state to state in the Markov model.
The term encourages the probabilities for transitions to nodes that are far away from the current data point to be low or respectively the nodes with high transition probabilities to be proximal. It achieves this by taking large values only for transitions to far away nodes that have a high probability under the model. It is defined as . The probabilistic model can inform the evolution of the SOM through this term which encodes our prior belief that transitions in natural data happen smoothly and that future time points will therefore mostly be found in the neighborhood of previous ones. In a setting where the data measurements are noisy, this improves the clustering by acting as a temporal smoother.
4 Experiments
We performed experiments on MNIST handwritten digits [23], FashionMNIST images of clothing [37], synthetic time series of linear interpolations of those images, time series from a chaotic dynamical system and real world medical data from the eICU Collaborative Research Database [11]. For model implementation details, we refer to the appendix (Sec. A).
We found that our method achieves a superior clustering performance compared to other methods. We also show that we can learn a temporal probabilistic model concurrently with the clustering, which is on par with the maximum likelihood solution, while improving the clustering performance. Moreover, we can learn interpretable state representations of a chaotic dynamical system and discover meaningful patterns in real medical data.
4.1 Clustering on benchmark data sets
In order to test the clustering component of the SOMVAE, we performed experiments on MNIST and FashionMNIST. We compare our model (including different adjustments to the loss function) against kmeans [25] (sklearnpackage [29]), the VQVAE [35], a standard implementation of a SOM (minisompackage [36]) and our version of a TFSOM (TensorFlowSOM), which is basically a SOMVAE where the encoder and decoder are set to be identity functions.
The results of the experiment in terms of purity and normalized mutual information are shown in Table 1. The SOMVAE outperforms the other methods w.r.t. the clustering performance measures. It should be noted here that while kmeans is a strong baseline, it is not density matching, i.e. the density of cluster centers is not proportional to the density of data points. Hence, the representation of data in a space induced by the kmeans clusters can be misleading.
As argued in the appendix (Sec. B), NMI is a more balanced measure for clustering performance than purity. If one uses 512 embeddings in the SOM, one gets a lower NMI due to the penalty term for the number of clusters, but it yields an interpretable twodimensional representation of the manifolds of MNIST (Fig. 2, Supp. Fig. S1) and FashionMNIST (Supp. Fig. S2).
The experiment shows that the SOM in our architecture improves the clustering (SOMVAE vs. VQVAE) and that the VAE does so as well (SOMVAE vs. TFSOM). Both parts of the model therefore seem to be beneficial for our task. It also becomes apparent that our reconstruction loss term on works better in practice than the gradient copying trick from the VQVAE (SOMVAE vs. gradcopy), due to the reasons described in Section 3.1. If one removes the reconstruction loss and does not copy the gradients, the encoder network does not receive any gradient information any more and the learning fails completely (no_grads). Another interesting observation is that stochastically optimizing a SOM using Adam [18] seems to discover a more performant solution than the classical SOM algorithm (TFSOM vs. minisom). Since kmeans seems to be the strongest competitor, we are including it as a reference baseline in the following experiments as well.
MNIST  FashionMNIST  

Method  Purity  NMI  Purity  NMI 
kmeans  0.690 0.000  0.541 0.001  0.654 0.001  0.545 0.000 
minisom  0.406 0.006  0.342 0.012  0.413 0.006  0.475 0.002 
TFSOM  0.653 0.007  0.519 0.005  0.606 0.006  0.514 0.004 
VQVAE  0.538 0.067  0.409 0.065  0.611 0.006  0.517 0.002 
no_grads*  0.114 0.000  0.001 0.000  0.110 0.009  0.018 0.016 
gradcopy*  0.583 0.004  0.436 0.004  0.556 0.008  0.444 0.005 
SOMVAE*  0.731 0.004  0.594 0.004  0.678 0.005  0.590 0.003 
4.2 Markov transition model on the discrete representations
In order to test the probabilistic model in our architecture and its effect on the clustering, we generated synthetic time series data sets of (Fashion)MNIST images being linearly interpolated into each other. Each time series consists of 64 frames, starting with one image from (Fashion)MNIST and smoothly changing sequentially into four other images over the length of the time course.
After training the model on these data, we constructed the maximum likelihood estimate (MLE) for the Markov model’s transition matrix by fixing all the weights in the SOMVAE and making another pass over the training set, counting all the observed transitions. This MLE transition matrix reaches a negative log likelihood of , while our transition matrix, which is learned concurrently with the architecture, yields . Our model is therefore on par with the MLE solution.
Comparing these results with the clustering performance on the standard MNIST and FashionMNIST test sets, we observe that the performance in terms of NMI is not impaired by the inclusion of the probabilistic model into the architecture (Tab. 2). On the contrary, the probabilistic model even slightly increases the performance on FashionMNIST. Note that we are using 64 embeddings in this experiment instead of 16, leading to a higher clustering performance in terms of purity, but a slightly lower performance in terms of NMI compared to Table 1. This shows again that the measure of purity has to be interpreted with care when comparing different experimental setups and that therefore the normalized mutual information should be preferred to make quantitative arguments.
This experiment shows that we can indeed fit a valid probabilistic transition model concurrently with the SOMVAE training, while at the same time not hurting the clustering performance. It also shows that for certain types of data the clustering performance can even be improved by the probabilistic model (see Sec. 3.2).
MNIST  FashionMNIST  

Method  Purity  NMI  Purity  NMI 
kmeans  0.791 0.005  0.537 0.001  0.703 0.002  0.492 0.001 
SOMVAE  0.868 0.003  0.595 0.002  0.739 0.002  0.520 0.002 
SOMVAEprob  0.858 0.004  0.596 0.001  0.724 0.003  0.525 0.002 
4.3 Interpretable representations of chaotic time series
In order to assess whether our model can learn an interpretable representation of more realistic chaotic time series, we train it on synthetic trajectories simulated from the famous Lorenz system [27]. The Lorenz system is a good example for this assessment, since it offers two well defined macrostates (given by the attractor basins) which are occluded by some chaotic noise in the form of periodic fluctuations around the attractors. A good interpretable representation should therefore learn to largely ignore the noise and model the changes between attractor basins. For a review of the Lorenz system and details about the simulations and the performance measure, we refer to the appendix (Sec. C.1).
In order to compare the interpretability of the learned representations, we computed entropy distributions over simulated subtrajectories in the real system space, the attractor assignment space and the representation spaces for kmeans and our model. The computed entropy distributions over all subtrajectories in the test set are depicted in Figure 3.
The experiment shows that the SOMVAE representations (Fig. 3) are much closer in entropy to the groundtruth attractor basin assignments (Fig. 3) than the kmeans representations (Fig. 3). For most of the subtrajectories without attractor basin change they assign a very low entropy, effectively ignoring the noise, while the kmeans representations partially assign very high entropies to those trajectories. In total, the kmeans representations’ entropy distribution is similar to the entropy distribution in the noisy system space (Fig. 3). The representations learned by the SOMVAE are therefore more interpretable than the kmeans representations with regard to this interpretability measure. As could be expected from these figures, the SOMVAE representation is also superior to the kmeans one in terms of purity with respect to the attractor assignment ( vs. ) as well as NMI ( vs. ).
Finally, we use the learned probabilistic model on our SOMVAE representations to sample new latent system trajectories and compute their entropies. The distribution looks qualitatively similar to the one over real trajectories (Fig. 3), but our model slightly overestimates the attractor basin change probabilities, leading to a heavier tail of the distribution.
4.4 Learning representations of acute physiological states in the ICU
In order to demonstrate interpretable representation learning on a complex real world task, we trained our model on time series measurements of intensive care unit (ICU) patients. We analyze the performance of the resulting clustering w.r.t. the acute patient physiology state in Table 3. For details regarding the data selection and processing, we refer to the appendix (Sec. C.2).
Method  score_6  score_12  score_24 

kmeans  0.0411 0.0007  0.0384 0.0006  0.0366 0.0005 
SOMVAE  0.0407 0.0005  0.0376 0.0004  0.0354 0.0004 
SOMVAEprob  0.0474 0.0006  0.0444 0.0006  0.0421 0.0005 
Our full model (including the latent Markov model) performs best on the given tasks, i.e. better than kmeans and also better than the SOMVAE without probabilistic model. This could be due to the noisiness of the medical data and the probabilistic model’s smoothing tendency (see Sec. 3.2).
In order to qualitatively assess the interpretability of the probabilistic SOMVAE, we analyzed the average physiology score per cluster (Fig. 4). Our model exhibits clusters where higher scores are enriched compared to the background level. Moreover, these clusters form compact structures, facilitating interpretability. We do not observe such interpretable structures in the other methods.
As an illustrative example, we show the trajectories of two patients that start in the same cluster. One patient stays in the regions of the map with low average physiology score and eventually gets discharged from the hospital healthily. The other one moves into map regions with high average physiology score and ultimately dies. Such knowledge could be helpful for doctors, who could determine the risk of a patient for certain endpoints from a glance at their trajectory in the SOMVAE representation.
5 Conclusion
The SOMVAE provides an improvement to the standard VQVAE framework in terms of clustering performance and offers a way to learn discrete twodimensional representations of the data manifold in concurrence with the reconstruction task. The learned map offers interpretability of states in different data sets and outperforms baseline methods with respect to different clustering performance measures. The probabilistic component of our model can be learned endtoend and concurrently with the rest of the architecture, while offering predictive performance close to the maximum likelihood solution and further improving the clustering on noisy data. The model can recover interpretable state representations on time series of chaotic dynamical systems.
On a challenging real world medical data set, our model learns more informative representations with respect to medically relevant dynamical endpoints than competitor methods. It uses the probabilistic model component in this setting to improve the clustering. The learned representations can be visualized in an interpretable way and could be helpful for clinicians to understand patients’ health states and trajectories more intuitively. It will be interesting to see in future work whether the probabilistic model can be extended to not just improve the clustering and interpretability of the whole model, but also enable us to make predictions. Promising avenues in that direction could be to increase the complexity by applying a higher order Markov Model, a Hidden Markov Model or a Gaussian Process.
Acknowledgments
FL is supported by the Max Planck/ETH Center for Learning Systems. MH is supported by the Grant No. 205321_176005 “Novel Machine Learning Approaches for Data from the Intensive Care Unit” of the Swiss National Science Foundation (to GR). VF, FL, MH and HS are partially supported by ETH core funding (to GR). We thank Natalia Marciniak for her administrative efforts, Marc Zimmermann for technical support, Gideon Dresdner, Stephanie Hyland, Viktor Gal and Maja Rudolph for helpful discussions; and Ron Swanson for his inspirational attitude.
References
 [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: LargeScale Machine Learning on Heterogeneous Distributed Systems. arXiv Preprints, 2016.
 [2] Elie Aljalbout, Vladimir Golkov, Yawar Siddiqui, and Daniel Cremers. Clustering with Deep Learning: Taxonomy and New Methods. arXiv Preprints, 2018.
 [3] Omar Badawi, Xinggang Liu, Erkan Hassan, Pamela J. Amelung, and Sunil Swami. Evaluation of ICU Risk Models Adapted for Use as Continuous Markers of Severity of Illness Throughout the ICU Stay. Critical Care Medicine, 46(3):361–367, 2018.
 [4] Guilherme A. Barreto and Aluizio F. R. Araújo. Identification and Control of Dynamical Systems Using the SelfOrganizing Map. IEEE Transactions on Neural Networks, 15(5):1244–1259, 2004.
 [5] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, 2012.
 [6] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets. arXiv Preprints, 2016.
 [7] Glen Wright Colopy, Marco A F Pimentel, Steven J Roberts, and David A Clifton. Bayesian Gaussian Processes for Identifying the Deteriorating Patient. Annual International Conference of the Engineering in Medicine and Biology Society, pages 5311–5314, 2016.
 [8] M. Dittenbach, D. Merkl, and A. Rauber. The growing hierarchical selforganizing map. In Proceedings of the International Joint Conference on Neural Networks, pages 15–19. IEEE, 2000.
 [9] Babak Esmaeili, Hao Wu, Sarthak Jain, N. Siddharth, Brooks Paige, and JanWillem van de Meent. Hierarchical Disentangled Representations. arXiv Preprints, 2018.
 [10] Tetsuo Furukawa, Kazuhiro Tokunaga, Kenji Morishita, and Syozo Yasui. Modular Network SOM (mnSOM): From Vector Space to Function Space. Proceedings of the International Joint Conference on Neural Networks, 3:1581–1586, 2005.
 [11] Ary L Goldberger, Luis A N Amaral, Leon Glass, Jeffrey M Hausdorff, Plamen Ch Ivanov, Roger G Mark, Joseph E Mietus, George B Moody, ChungKang Peng, and H Eugene Stanley. PhysioBank, PhysioToolkit, and PhysioNet. Components of a New Research Resource for Complex Physiologic Signals. Circulation, 101(23), 2000.
 [12] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
 [13] Klaus Greff, Aaron Klein, Martin Chovanec, Frank Hutter, and Jürgen Schmidhuber. The Sacred Infrastructure for Computational Research. In Proceedings of the 15th Python in Science Conference, 2017.
 [14] Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, and Aram Galstyan. Multitask Learning and Benchmarking with Clinical Time Series Data. arXiv Preprints, 2017.
 [15] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. betaVAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In Proceedings of the International Conference on Learning Representations, 2017.
 [16] Stephanie L. Hyland, Cristóbal Esteban, and Gunnar Rätsch. Realvalued (Medical) Time Series Generation with Recurrent Conditional GANs. arXiv Preprints, 2017.
 [17] Alistair E W Johnson, Mohammad M Ghassemi, Shamim Nemati, Katherine E Niehaus, David A Clifton, and Gari D Clifford. Machine learning and decision support in critical care. Proc. IEEE, 104(2):444–466, 2016.
 [18] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations, 2015.
 [19] Diederik P Kingma and Max Welling. AutoEncoding Variational Bayes. arXiv Preprints, 2013.
 [20] Aaron Klein, Stefan Falkner, Numair Mansur, and Frank Hutter. RoBO: A Flexible and Robust Bayesian Optimization Framework in Python. In NIPS Bayesian Optimization Workshop, 2017.
 [21] W A Knaus, E A Draper, D P Wagner, and J E Zimmerman. APACHE II: a severity of disease classification system. Critical Care Medicine, 13(10):818–29, 1985.
 [22] Teuvo Kohonen. The selforganizing map. Neurocomputing, 21(13):1–6, nov 1998.
 [23] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. GradientBased Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [24] Bryan Lim and Mihaela van der Schaar. DiseaseAtlas: Navigating Disease Trajectories using Deep Learning. arXiv Preprints, 2018.
 [25] Stuart Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137, 1982.
 [26] Francesco Locatello, Damien Vincent, Ilya Tolstikhin, Gunnar Rätsch, Sylvain Gelly, and Bernhard Schölkopf. Clustering Meets Implicit Generative Models. arXiv Preprints, 2018.
 [27] Edward N. Lorenz. Deterministic Nonperiodic Flow. Journal of the Atmospheric Sciences, 20(2):130–141, 1963.
 [28] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.
 [29] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikitlearn: Machine Learning in Python. Journal of Machine Learning Research, 12(Oct):2825–2830, 2011.
 [30] Piotr Płoński and Krzysztof Zaremba. Improving Performance of SelfOrganising Maps with Distance Metric Learning Method. arXiv Preprints, 2014.
 [31] Huiyan Sang, Alan E. Gelfand, Chris Lennard, Gabriele Hegerl, and Bruce Hewitson. Interpreting selforganizing maps through spacetime data models. Annals of Applied Statistics, 2(4):1194–1216, 2008.
 [32] Peter Schulam and Raman Arora. Disease Trajectory Maps. arXiv Preprints, 2016.
 [33] Santosh Tirunagari, Norman Poh, Guosheng Hu, and David Windridge. Identifying Similar Patients Using SelfOrganising Maps: A Case Study on Type1 Diabetes Selfcare Survey Responses. arXiv Preprints, 2015.
 [34] Warwick Tucker. The Lorenz attractor exists. Comptes Rendus de l’Académie des SciencesSeries IMathematics, 328(12):1197–1202, 1999.
 [35] Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural Discrete Representation Learning. arXiv Preprints, 2017.
 [36] Giuseppe Vettigli. MiniSom: a minimalistic implementation of the Self Organizing Maps, 2017.
 [37] Han Xiao, Kashif Rasul, and Roland Vollgraf. FashionMNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv Preprints, 2017.
 [38] Dejiao Zhang, Yifan Sun, Brian Eriksson, and Laura Balzano. Deep Unsupervised Clustering Using Mixture of Autoencoders. arXiv Preprints, 2017.
Appendix
Appendix A Implementation details
The hyperparameters of our model were optimized using Robust Bayesian Optimization with the packages sacred and labwatch [13] for the parameter handling and RoBo [20] for the optimization, using the mean squared reconstruction error as the optimization criterion. Since our model defines a general framework, some competitor models can be seen as special cases of our model, where certain parts of the loss function are set to zero or parts of the architecture are omitted. We used the same hyperparameters for those models. For external competitor methods, we used the hyperparameters from the respective publications where applicable and otherwise the default parameters from their packages. The models were implemented in TensorFlow [1] and optimized using Adam [18].
Appendix B Clustering performance measures
Given that one of our most interesting tasks at hand is the clustering of data, we need some performance measures to objectively compare the quality of this clustering with other methods. The measures that we decided to use and that have been used extensively in the literature are purity and normalized mutual information (NMI) [28]. We briefly review them in the following.
Let the set of ground truth classes in the data be and the set of clusters that result from the algorithm . The purity is then defined as where is the total number of data points. Intuitively, the purity is the accuracy of the classifier that assigns the most prominent class label in each cluster to all of its respective data points.
While the purity has a very simple interpretation, it also has some shortcomings. One can for instance easily observe that a clustering with , i.e. one cluster for every single data point, will yield a purity of but still probably not be very informative for most tasks. It would therefore be more sensible to have another measure that penalizes the number of clusters. The normalized mutual information is one such measure.
The NMI is defined as where is the mutual information between and and is the Shannon information entropy. While the entropy of the classes is a datadependent constant, the entropy of the clustering increases with the number of clusters. It can therefore be seen as a penalty term to regularize the tradeoff between low intracluster variance and a small number of clusters. Both NMI and purity are normalized, i.e. take values in .
Appendix C Experimental methods
c.1 Interpretable representations of chaotic time series
The Lorenz system is the system of coupled ordinary differential equations defined by
with tuning parameters , and . For parameter choices , and , the system shows chaotic behavior by forming a strange attractor [34] with the two attractor points being given by .
We simulated 100 trajectories of 10,000 time steps each from the chaotic system and trained the SOMVAE as well as kmeans on it with 64 clusters/embeddings respectively. The system chaotically switches back and forth between the two attractor basins. By computing the Euclidian distance between the current system state and each of the attractor points , we can identify the current attractor basin at each time point.
In order to assess the interpretability of the learned representations, we have to define an objective measure of interpretability. We define interpretability as the similarity between the representation and the system’s ground truth macrostate. Since representations at single time points are meaningless with respect to this measure, we compare the evolution of representations and system state over time in terms of their entropy.
We divided the simulated trajectories from our test set into spans of 100 time steps each. For every subtrajectory, we computed the entropies of those subtrajectories in the real system space (macrostate and noise), the assigned attractor basin space (noisefree groundtruth macrostate), the SOMVAE representation and the kmeans representation. We also observed for every subtrajectory whether or not a change between attractor basins has taken place. Note that the attractor assignments and representations are discrete, while the real system space is continuous. In order to make the entropies comparable, we discretize the system space into unit hypercubes for the entropy computation. For a representation with assignments at time and starting time of the subtrajectory, the entropies are defined as
(4) 
with being the Shannon information entropy of a discrete set.
c.2 Learning representations of acute physiological states in the ICU
All experiments were performed on dynamic data extracted from the eICU Collaborative Research Database [11]. Irregularly sampled time series data were extracted from the raw tables and then resampled to a regular time grid using a combination of forward filling and missing value imputation using global population statistics. We chose a grid interval of one hour to capture the rapid dynamics of patients in the ICU.
Each sample in the timegrid was then labeled using a dynamic variant of the APACHE score [21], which is a proxy for the instantaneous physiological state of a patient in the ICU. Specifically, the variables MAP, Temperature, Respiratory rate, HCO3, Sodium, Potassium, and Creatinine were selected from the score definition, because they could be easily defined for each sample in the eICU time series. The value range of each variable was binned into ranges of normal and abnormal values, in line with the definition of the APACHE score, where a higher score for a variable is obtained for abnormally high or low values. The scores were then summed up, and we define the predictive score as the worst (highest) score in the next hours, for . Patients are thus stratified by their expected pathology in the near future, which corresponds closely to how a physician would perceive the state of a patient. The training set consisted of 7000 unique patient stays, while the test set contained 3600 unique stays.