A Local Information Criterion for Dynamical Systems
Abstract
Encoding a sequence of observations is an essential task with many applications. The encoding can become highly efficient when the observations are generated by a dynamical system. A dynamical system imposes regularities on the observations that can be leveraged to achieve a more efficient code. We propose a method to encode a given or learned dynamical system. Apart from its application for encoding a sequence of observations, we propose to use the compression achieved by this encoding as a criterion for model selection. Given a dataset, different learning algorithms result in different models. But not all learned models are equally good. We show that the proposed encoding approach can be used to choose the learned model which is closer to the true underlying dynamics. We provide experiments for both encoding and model selection, and theoretical results that shed light on why the approach works.
A Local Information Criterion for Dynamical Systems
Arash Mehrjou Department of Empirical Inference Max Planck Institute for Intelligent Systems arash.mehrjou@tuebingen.mpg.de Friedrich Solowjow Intelligent Control Systems Group Max Planck Institute for Intelligent Systems fsolowjow@is.mpg.de Sebastian Trimpe Intelligent Control Systems Group Max Planck Institute for Intelligent Systems trimpe@is.mpg.de Bernhard Schölkopf Department of Empirical Inference Max Planck Institute for Intelligent Systems bs@tuebingen.mpg.de
noticebox[b]NIPS 2018 submission. indicates equal contribution. \end@float
1 Introduction
Objects are of various complexities in nature. A round stone looks simpler than a convoluted rough piece of rock; a constant beeplike sound is simpler than an orchestra. We humans have internal ideas about how complex are objects. The complexity can also be defined for abstract objects such as mathematical creatures. The focus of this paper is on the complexity of dynamical systems that model the laws of nature (newton1833philosophiae, ). To our eyes, a dynamical system is nothing more than a temporal sequence of observations. We might use the data sequence to infer a model. But what is the better representation of the dynamical system – the data or the model, and which model should we use? In this paper, we take a closer look at efficient encoding of dynamical systems and, based on that, propose a model selection criterion with practical use in empirical inference.
For illustration, assume the following scenario: Alice and Bob are friends and they are talking over the phone. Alice is watching a dynamical system and wants to share her experience with Bob. Alice knows what Bob knows about the nature, math, etc. She is watching a temporal sequence of observations caused by an underlying mathematical expression . Unfortunately, the transmission cord from Alice to Bob charges her for every voltage pulse. Therefore, Alice would like to transmit her experience to Bob with the least phone cost. Due to the physical constraints, Alice can observe samples from the model with sampling frequency where is the time interval between two consecutive observations. Assume the phone call starts at time and Alice can describe each of her observations with bits. One trivial solution is that Alice talks constantly with Bob and tells him every observation at each time instant for an indeterminate amount of time. Despite its simplicity, this approach will cost Alice a horrible amount that increases without bound as . More cleverly, Alice can use her prior assumptions about nature and her belief that her observations are not totally random. Hence, she is able to infer the underlying dynamics by a nonparametric model from her observations . Assume this model is chosen from a hypothesis set and both Alice and Bob agree on the members of . Thus, Alice only needs to inform Bob about the initial state of the system and the model she has learned about the dynamics. Bob can reconstruct the sequence on his side by running starting from the initial state . Notice that the state dynamics may cover only a small subset of the state space, which removes the need to model on its whole domain. We use this property of dynamical systems for compressing their information and obtaining an optimal local tradeoff between model complexity and prediction accuracy.
The underlying questions of this example are highly relevant also for artificial intelligent systems. Imagine autonomous vehicles flying or driving in a formation (alam2015heavy, ), or multiple robots coordinating their actions (rubenstein2014programmable, ). These systems need to know of each other; that is, agents need to transmit dynamics information between each other. An intelligent agents, however, will use its resources wisely and thus communicate only when and what is necessary. In this scenario, better encoding of dynamical systems means reduced communication, lower bandwidth requirements, and thus reduced cost. Likewise, intelligent agents may store various internal models for the purpose of simulation, prediction, or control (camacho2013model, ). Better representations here may mean improved performance, reduced memory requirements, and faster computation.
Contribution — In this paper, we propose to encode dynamical systems through local representations, which are computed to yield (locally) an optimal tradeoff between model complexity and predictive performance. The criterion automatically chooses the ‘right’ complexity – locally simple dynamics are represented by loworder models, while higherorder representations are automatically taken in areas with more complex dynamics. Because the representation thus adapts to the local information content of the dynamics, the proposed encoding scheme represents a novel information criterion for dynamical systems, which we call Local Information Criterion for Dynamical Systems (LICDS).
LICDS is motivated by compressing information through local representations. Since there are theories and empirical evidence in machine learning confirming the relation between generalization and compression (vapnik2013nature, ; luxburg2004compression, ), we hypothesized LICDS is also useful for model selection. Indeed, we show that the information criterion can be used (in addition to efficient encoding) to choose among different models learned from a given dataset. In particular, we show that it is possible to choose between different architectures of neural networks (NNs) and to compare different types of learned models (e.g., NNs versus GPs) solely with the aid of the compression score and not with test data. We extend our empirical findings with theoretical results, which confirm the correctness of our method for certain function classes and provide insight into why LICDS is a useful criterion for model selection. Fig. 1 illustrate the two proposed applications of LICDS: encoding and model selection.
Related Work — The subject of obtaining a representation of a dynamical system from its inputoutput data is known as system identification (ljung1999system, ; nelles2013nonlinear, ) or model learning (nguyen2011model, ). Two major approaches in system identification are graybox and blackbox approaches (ljung1999system, ; nelles2013nonlinear, ). Graybox methods learn the parameters of a known model (tulleken1993grey, ), where parameters typically have a physical interpretation. However, blackbox methods need to identify both the structure and parameters of the model (sjoberg1995nonlinear, ). In blackbox system identification, or machine learning in general, choosing the appropriate structure is usually done by investigating model performance on a leftout validation set. Information criterion is a different approach to model selection by taking into account model complexity and data explanation at the same time (yamaoka1978application, ). Many information criteria have been proposed and used for supervised (fogel1991information, ) and unsupervised learning (mehrjou2016improved, ). Despite some recent work (darmon2018information, ; mangan2017model, ) on the the information criterion approach towards dynamical systems, the field is not explored well yet. This work is proposing a compression method for dynamical systems that can be used as an information criterion and for model selection as well.
Models of dynamical systems take very different representations. On the one end of the spectrum, there are classical parametric models such as linear transfer functions and statespace models (ljung1999system, ), as well as nonlinear graybox models with known structure (e.g., from first principles) and some free parameters. In these, the model structure is relatively rigid and information is encoded in a small number of parameters, often with some physical interpretation. Neural networks (NNs) (wang2016data, ; narendra1990identification, ) can also learn model structure and encode information in a large number of weight parameters without direct physical interpretation. Fuzzy models such as TakagiSugeno (takagi1993fuzzy, ) encode dynamical systems as a set of fuzzy rules or sets and associated models. Nonparametric methods such as Gaussian process (GP) models (frigola2013bayesian, ; doerr2017optimizing, ; eleftheriadis2017identification, ) and classical time or frequencydomain methods (wellstead1981non, ) represent dynamical system information essentially in a dataset (in time or frequency domain). Herein, we propose to encode dynamical systems in local models whose complexity is adapted to the data stream. Our encoding thus provides a middle ground between encoding in a dataset and a single (global) parametric model.
The benefits of local modeling approaches for dynamical systems have long been recognized (atkeson1997locally, ; nelles1996basis, ; TiMeViSc16, ). These include, in particular, the abilities to learn fast and incrementally from a continuous stream of (possibly large) data, while allowing for nonstationary distributions (TiMeViSc16, ). This is critical in realtime learning such as robot control (ScAt10, ). While in these works the complexity of the local model must be chosen a priori (most often, locally linear models are used), our method allows for determining the optimal model complexity, which is adapted to the data.
The proposed encoding scheme for dynamical systems was first considered in (solowjowCDCsubm_arxiv, ), but in a different context from the one herein. While in (solowjowCDCsubm_arxiv, ), the true dynamics are assumed known and LICDS is used to efficiently communicate state information between agents in a networked system, we consider encoding of dynamics models learned from data. Moreover, the proposal of LICDS for model selection (Sec. 4), and all theoretical (Theorems 13) and empirical results (Sec. 3 and 5) are novel contributions.
2 Proposed Local Encoding
In this section, we explain our proposed encoding scheme for dynamical systems
(1) 
We present our idea based on the concepts of algorithmic complexity (wallace2005statistical, ), Universal Turing Machines (UTM) (turing1937computable, ), and minimum message length (MML) (wallace1999minimum, ). UTM is a programmable machine that receives a message as input and produces the desired output. Minimum length of the input message can be seen as the complexity of the output and is called algorithmic complexity (AC). The MML principle chooses a model for the observed data where the joint AC of the tuple (model, data) is minimized. The detailed prerequisite definitions are delegated to the supplementary material.
General notion— Our aim is to construct a brief and efficient explanation for the observed data from the model. The explanation is a message consisting of two parts. The first part encodes some general assertion (theory) about the source of data and the second part is the explanation for the data were the assertion is correct (wallace1999minimum, ). Throughout this paper, we assume the data takes finite discrete values with certain precision. Hence, each data example can be encoded to a finite sequence of ‘0s’ and ‘1s’. This is a reasonable assumption because, in practice, values are usually stored in a quantized way on digital computers, and we shall consider a finite horizon hereafter.
Alice’s encoding of a dynamical system— Assume Alice is given a long sequence of observations to be transmitted to Bob over the phone. Alice thinks of a message consisting of two parts. The first part encodes her belief about the dynamical system
0: Dynamical function , initial state , global time horizon , maximum number of partitions , maximum number of expansion terms for each local model 0: Approximated states , optimal total cost , optimal total complexity , optimal number of partitions 1: : Observations 2: : Optimal total cost 3: : Optimal total complexity 4: : Optimal number of partitions 5: : Local time horizon 6: for do 7: Reset to 8: Reset to 9: Reset to 10: for do 11: 12: 13: 14: 15: 16: end for 17: if = 1 then 18: 19: 20: else if and then 21: 22: 23: 24: end if 25: end for 26: return 0: Dynamical function , index of local window , local time horizon , relative weight , initial state of local time horizon , and the maximum allowed complexity . 0: Local approximate of state trajectory , optimal local cost , optimal local complexity 1: : Observations 2: : Optimal local cost 3: : Optimal local complexity 4: 5: 6: for do 7: 8: 9: if then 10: 11: else if and then 12: 13: 14: end if 15: end for 16: return that has generated the sequence, and the second part encodes the unexplained portion of the data by the assumed dynamical system. In this setting, Bob has a UTM that decodes and reconstructs the original sequence. The first part of the message teaches Bob Alice’s belief about the source dynamical system, and the second part teaches Bob how to recover the observations given this dynamical system. Assume that Alice and Bob have agreed on a finite set of functions from which the dynamics is chosen. Therefore, the first part of the message takes bits to choose one member of this set. The second part of the message encodes the initial point , from which the dynamical system starts evolving.
Again, we assume that state values are discrete, finite and chosen from alphabet set . This assumption is valid by assuming bounded value and finite precision for states. This requires bits to encode the initial point. In total, the number of bits required to encode the sequence of observations can be seen as an Information Criterion for Dynamical Systems (ICDS). For a deterministic dynamical system, having suffices to recover for all (within the assumed precision). Therefore, ICDS number of bits is sufficient information to recover the sequence .
Can Alice do better?— The states of a dynamical system move along a certain trajectory in the state space depending on and . Therefore, we do not need to encode for its entire input space. If takes a simple shape around the working point, we can save many bits by encoding locally rather than globally. This idea results in Local Information Criterion for Dynamical Systems (LICDS). Assume the state space is adaptively partitioned into pieces along the state trajectory and the complexity of the system within each partition is also adaptively chosen. The input tape of the UTM is formed as a concatenation of several messages (instead of two as before), i.e., as , where each tuple corresponds to the local partition in the state space of the dynamical system of Eq. 1. In each tuple, reprograms the UTM into the simulator of local approximation to and decodes to its corresponding initial point from the Observations; It means that the local trajectory corresponding to each local model starts from a point belonging to the correct trajectory to prevent propagating error from one local model to the next one. Formally speaking, we look for a local representation of a function based on a finite set of basis functions
(2) 
where is the local approximation to around working point . In other words, approximates the function in its local partition of the state space to which belongs. The set is chosen from the hypothesis space with cardinality . The set is chosen rich enough such that it contains basis functions that are able to approximate arbitrarily well as . Different classes of basis functions can be used, e.g., Taylor expansion, Fourier series, Legendre polynomials, etc (andrews1992special, ). In this paper, we use Taylor expansion to showcase our points, but the concepts are generally applicable to other expansions as well. Let us next assume the coefficients are chosen from a finite discrete set . The coefficients are bounded because we approximate the dynamics function by a smooth function (e.g., NN with tanh nonlinearity or GP) and the derivatives are bounded. In addition, the coefficient are continuous quantities, but we again assume they are represented by finite precision (as represented on a computer). Therefore, each local message requires bits code as follows: . On the other hand, if is encoded globally, we have that encodes on its whole input domain. The idea of this section is that in many practical dynamical systems, needs to be much larger than to give a good approximation to on its whole domain, which may result in (see Fig. 2).
2.1 Practical Algorithm
In this section, we present a practical algorithm to implement the abovementioned idea of encoding (the exposition of this subsection follows (solowjowCDCsubm_arxiv, )). Taylor expansion is used as the method for local approximation to dynamics function as Eq. 2. LICDS does not differentiate between whether the model is known () or is learned () and considers both as the function to be locally approximated. In this section, we simply write to refer to either one of them. The difference will however matter for model selection in Sec. 4.
Local time horizon— Local approximation relies on partitioning the input space of the dynamics function . Because governs a dynamcial system, partitioning space amounts to partitioning space. This means we divide the global time horizon into local time horizons with length where is a hyperparameter. The detailed cost function is then written as
(3) 
for each local time horizon delimited by and and for all partitions. Finding the optimal local complexity is implemented by Alg. 2, which is used as a module of LICDS in Alg. 1. The total cost function is then written as . The optimal number of partitions is found by
(4) 
where is the maximum allowed number of partitions. The concise message of this section is that the minimum value of usually occurs for , which implies that the proposed method gives a better encoding compared with global encoding where . Notice that and are hyperparameters of the model, which are chosen by our prior idea about the complexity of the dynamics function (larger values for more complicated functions). We observed that reasonably high values for these hyperparameters, e.g. and worked well for a variety of systems and benchmarks that we have considered in the paper and also in the supplementary document.
How to choose ? The hyperparameter acts as a balancing weight between the complexity of local Taylor approximation and error in the prediction of states. It can also be interpreted from an information theoretic perspective: Assume the values of the coefficients of the Taylor expansion come from a Gaussian distribution, i.e., . In the optimal coding scheme, the number of bits required to encode the coefficients equals the Shannon entropy of the normal distribution, . Thus, is logproportional to the variance of coefficients that is caused by the fluctuations of the dynamics functions. In the current version, we manually choose such that two terms of Eq. 3 are of the same order.
2.2 Theoretical Results
In this section, we will prove that it is possible to control the error introduced by local approximations. We distinguish here between two objects, the states and the dynamics , both as a function of time . We rely heavily on the identity in Eq. 1, which adds a lot of regularity to this problem. Therefore, we can derive statements of the type: if and are close in some sense, then the state trajectories and are close as well. And even better, the opposite is also true – close states imply close dynamical functions. This guarantees sufficiently accurate state prediction, while being able to reduce model complexity. Furthermore, we will elaborate later on the other direction in order to deploy LICDS as a model selection criterion.
First, we show a result that accurate local approximations imply precise state estimations. The proof of this and all following theorems are given in the supplementary material.
Theorem 1.
Consider Eq. 1 with Lipschitzcontinuous on . Furthermore, assume a Lipschitzcontinuous approximation is used to obtain state approximations . Then, for ,
(5) 
In particular, this implies: if , then .
Next, we show the opposite direction: close state trajectories imply close dynamical systems.
Theorem 2.
3 Experiments: Encoding Dynamical Systems
In this section, we illustrate how the encoding scheme proposed in Sec. 2 looks in practice. We elaborate in detail on the algorithm with the aid of two examples. More descriptive examples are in the supplementary material. We consider: 1) the onedimensional system ; and 2) a pendulum with two states and , where is a standard white noise process. We use the Euler–Maruyama method (schuss1988stochastic, ) to sample multiple trajectories from the systems and use those to learn the dynamics. In this example, we train a shallow NN as model of the dynamics (details on the learning method are given in the supplementary material). The function is now the input to the LICDS algorithm, which computes a local approximation . Fig. 2 depicts the functionality of the proposed method and highlights the local approximations in dependence of the number of partitions and the complexity order. In particular, the last column of Fig. 2 shows that LICDS prefers nontrivial solutions with , which results in the simplest model that still gives accurate state. Similar experiments for a more sophisticated system (quadrotor) are presented in the supplementary material with similar conclusions.
4 Local Information Criterion for Model Selection
In this section, we extend our idea in order to regard LICDS as a model selection criterion. From an abstract point of view, we can motivate our approach in terms of information compression and argue that simpler functions should be preferred when they explain data equally well. In addition to empirical findings, we support our claim with theoretical results, which give insight in the applicability of the proposed method. The schematic in Fig. 1(b) summarizes the key ideas of this section.
Again, we consider the three objects , , and , which are, respectively, the true dynamical function, the output of an arbitrary learning algorithm, and the local encoding. Assume we are given a collection of observed sequences all generated by the underlying dynamical system (1). Based on this dataset, we can deploy several different learning algorithms in order to obtain approximations . These approximations will most likely differ in their quality, which gives rise to the question, which of the learned functions should be selected.
Frequently used methods to learn dynamical functions are, for example, NNs and GPs (cf. ‘Related work’). However, determining the depth of the NN and finding a suitable kernel function for the GP are nontrivial tasks. Illconsidered choices can lead to overfitting and bad performance and hence, should be discarded as soon as possible. For example, an overparameterized NN may overfit to the training data and result in zero training error while being far from the correct dynamics function .
We propose a new way to compare learned functions, which is based on LICDS and facilitates choosing among them. In particular, we claim that, for a certain class of functions, the function with the smallest is closest to the true dynamics. We quantify this statement in the following theorem.
Theorem 3 (Model Selection).
Remark 1.
The proposed algorithm works well for certain types of systems, which we confirm with empirical results. However, the theorem does not guarantee that is works for all systems; clearly, for large and , the above theorem is not very meaningful. The condition in Eq. 26 is an interesting starting point to investigate the suitable class of functions, for which the theorem yields a meaningful bound. Finally, since Theorem 1 can also be stated in with slightly different assumptions (acosta2004optimal, ), it is also possible to derive a result similar to Theorem 3 purely in the norm.
The proposed method quantifies the model accuracy along a trajectory, which depends on the initial point . Since we are interested in obtaining results, which are representative in the whole domain of the training data , we propose to randomize within and average over the obtained results to make them meaningful for the whole domain.
5 Experiments: Model Selection for Dynamical Systems
After discussing the capabilities of LICDS as a model selection criterion, we also present empirical results in order to provide more evidence to our claims. First, consider a dynamical system as in Eq. 1, with and additive white noise . This system is used to generate 10 noisy trajectories starting from randomly chosen initial points and 100 data points each are sampled with the aid of the Euler–Maruyama method (schuss1988stochastic, ). This results in a training set with total size of 1000 samples. In Fig. 3, the learned functions are depicted together with the true function . Figure 3 clearly shows the connection between the score and the respective fit of the different NN architectures as the least score gives the best fit.
Next, we consider dynamical systems of Eq. 1 for some benchmark problems. Similar benchmark problems are considered, for example, in (doerr2017optimizing, ) and (kroll2014benchmark, ), which we used to shape the nonlinearities for the problems considered herein, which are summarized in Table 1. After generating noisy data based on the dynamical system, we deploy several NNs with different depth and width and a GP to capture the behaviour of the system (details of the learning procedures in the supplementary material). The results in terms of the score and actual distance to the underlying function (in the norm sense) are shown in Table 1. We emphasize that the best learned function achieves the lowest score.
The proposed method does not aim at improving any of the model learning methods. Instead, we provide a structured way to postprocess learned models and select the best among several candidates. However, the presented ideas might be incorporated into improving also the training process.
NN=[1]  NN=[10]  NN=[40]  GP  
NN=[1]  NN=[2]  NN=[5]  GP  
NN=[1]  NN=[2]  NN=[30]  GP  
NN=[1]  NN=[2]  NN=[5]  GP  
6 Discussion
In this paper, we proposed LICDS as a method to efficiently encode information of a dynamical system, which is either known or learned from a sequence of observations. We built the encoding scheme on top of the minimum message length principle and came up with a practical method to approximate the algorithmic complexity of dynamical systems by means of local approximations. In addition to efficient encoding, we showed through experiments and theorems that the proposed encoding criterion can be used for model selection likewise. By comparing LICDS scores for different learned models (e.g., NN and GP), the model that is closer to the underlying dynamics can be selected. For future work, we aim to apply LICDS for efficient communication in networked multiagent systems. Also, we seek to characterize more precisely the class of dynamical systems, for which LICDS is effective (cf. Remark 1), and investigate extensions to stochastic systems.
Acknowledgments
This work was supported in part by the Max Planck Society, the Cyber Valley Initiative, and the German Research Foundation (DFG) grant TR 1433/11.
References
 [1] Isaac Newton. Philosophiae naturalis principia mathematica, volume 1. G. Brookman, 1833.
 [2] Assad Alam, Bart Besselink, Valerio Turri, Jonas Martensson, and Karl H Johansson. Heavyduty vehicle platooning for sustainable freight transportation: A cooperative method to enhance safety and efficiency. IEEE Control Systems, 35(6):34–56, 2015.
 [3] Michael Rubenstein, Alejandro Cornejo, and Radhika Nagpal. Programmable selfassembly in a thousandrobot swarm. Science, 345(6198):795–799, 2014.
 [4] Eduardo F Camacho and Carlos Bordons Alba. Model predictive control. Springer, 2013.
 [5] Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 2013.
 [6] Ulrike Von Luxburg, Olivier Bousquet, and Bernhard Schölkopf. A compression approach to support vector model selection. Journal of Machine Learning Research, 5(Apr):293–323, 2004.
 [7] Lennart Ljung. System Identification: Theory for the User. Prentice Hall PTR, 1999.
 [8] Oliver Nelles. Nonlinear system identification. Springer, 2013.
 [9] Duy NguyenTuong and Jan Peters. Model learning for robot control: a survey. Cognitive processing, 12(4):319–340, 2011.
 [10] Herbert JAF Tulleken. Greybox modelling and identification using physical knowledge and bayesian techniques. Automatica, 29(2):285–308, 1993.
 [11] Jonas Sjöberg, Qinghua Zhang, Lennart Ljung, Albert Benveniste, Bernard Delyon, PierreYves Glorennec, Håkan Hjalmarsson, and Anatoli Juditsky. Nonlinear blackbox modeling in system identification: a unified overview. Automatica, 31(12):1691–1724, 1995.
 [12] Kiyoshi Yamaoka, Terumichi Nakagawa, and Toyozo Uno. Application of akaike’s information criterion (aic) in the evaluation of linear pharmacokinetic equations. Journal of pharmacokinetics and biopharmaceutics, 6(2):165–175, 1978.
 [13] David B Fogel. An information criterion for optimal neural network selection. IEEE Transactions on Neural Networks, 2(5):490–497, 1991.
 [14] Arash Mehrjou, Reshad Hosseini, and Babak Nadjar Araabi. Improved bayesian information criterion for mixture model selection. Pattern Recognition Letters, 69:22–27, 2016.
 [15] David Darmon. Informationtheoretic model selection for optimal prediction of stochastic dynamical systems from data. Physical Review E, 97(3):032206, 2018.
 [16] Niall M Mangan, J Nathan Kutz, Steven L Brunton, and Joshua L Proctor. Model selection for dynamical systems via sparse regression and information criteria. Proc. R. Soc. A, 473(2204):20170009, 2017.
 [17] WenXu Wang, YingCheng Lai, and Celso Grebogi. Data based identification and prediction of nonlinear and complex dynamical systems. Physics Reports, 644:1–76, 2016.
 [18] Kumpati S Narendra and Kannan Parthasarathy. Identification and control of dynamical systems using neural networks. IEEE Transactions on neural networks, 1(1):4–27, 1990.
 [19] Tomohiro Takagi and Michio Sugeno. Fuzzy identification of systems and its applications to modeling and control. In Readings in Fuzzy Sets for Intelligent Systems, pages 387–403. 1993.
 [20] Roger Frigola, Fredrik Lindsten, Thomas B Schön, and Carl Edward Rasmussen. Bayesian inference and learning in gaussian process statespace models with particle mcmc. In Advances in Neural Information Processing Systems, pages 3156–3164, 2013.
 [21] Andreas Doerr, Christian Daniel, Duy NguyenTuong, Alonso Marco, Stefan Schaal, Marc Toussaint, and Sebastian Trimpe. Optimizing longterm predictions for modelbased policy search. In Proceedings of Machine Learning Research, volume 78, pages 227–238, November 2017.
 [22] Stefanos Eleftheriadis, Tom Nicholson, Marc Deisenroth, and James Hensman. Identification of gaussian process state space models. In Advances in Neural Information Processing Systems, pages 5315–5325, 2017.
 [23] Peter E Wellstead. Nonparametric methods of system identification. Automatica, 17(1):55–69, 1981.
 [24] Christopher G Atkeson, Andrew W Moore, and Stefan Schaal. Locally weighted learning for control. In Lazy learning, pages 75–113. Springer, 1997.
 [25] Oliver Nelles and Rolf Isermann. Basis function networks for interpolation of local linear models. In Decision and Control, 1996., Proceedings of the 35th IEEE Conference on, volume 1, pages 470–475, 1996.
 [26] JoAnne Ting, Franziska Meier, Sethu Vijayakumar, and Stefan Schaal. Locally Weighted Regression for Control, pages 1–14. Springer US, Boston, MA, 2016.
 [27] S. Schaal and C. Atkeson. Learning control in robotics. IEEE Robotics Automation Magazine, 17(2):20–29, June 2010.
 [28] Friedrich Solowjow, Arash Mehrjou, Bernhard Schölkopf, and Sebastian Trimpe. Minimum information exchange in distributed systems. arXiv preprint arXiv:1805.09714, 2018.
 [29] Christopher S Wallace. Statistical and inductive inference by minimum message length. Springer Science & Business Media, 2005.
 [30] Alan Mathison Turing. On computable numbers, with an application to the entscheidungsproblem. Proceedings of the London mathematical society, 2(1):230–265, 1937.
 [31] Chris S. Wallace and David L. Dowe. Minimum message length and kolmogorov complexity. The Computer Journal, 42(4):270–283, 1999.
 [32] Larry C Andrews and Larry C Andrews. Special functions of mathematics for engineers. McGrawHill New York, 1992.
 [33] Zeev Schuss. Stochastic differential equations. Wiley Online Library, 1988.
 [34] Gabriel Acosta and Ricardo G Durán. An optimal poincaré inequality in l 1 for convex domains. Proceedings of the american mathematical society, pages 195–202, 2004.
 [35] Andreas Kroll and Horst Schulte. Benchmark problems for nonlinear system identification and control using soft computing methods: need and overview. Applied Soft Computing, 25:496–513, 2014.
 [36] Claude Elwood Shannon. A mathematical theory of communication. Bell system technical journal, 27(3):379–423, 1948.
 [37] Andrei N Kolmogorov. Three approaches to the quantitative definition ofinformation’. Problems of information transmission, 1(1):1–7, 1965.
 [38] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for largescale machine learning. In OSDI, volume 16, pages 265–283, 2016.
 [39] Gabriel Hoffmann, Haomiao Huang, Steven Waslander, and Claire Tomlin. Quadrotor helicopter flight dynamics and control: Theory and experiment. In AIAA Guidance, Navigation and Control Conference and Exhibit, page 6461, 2007.
Appendices
Appendix A Algorithmic Complexity and Universal Turing Machine
The concepts of algorithmic complexity and universal Turing machine were briefly introduced in paper and will be presented in more details here. Using the AliceBob scenario, we briefly review some necessary terms. The message is a sequence of symbols chosen from a set of alphabets. Each symbol is encoded by a binary subsequence called word that forms the entire message when the code for all symbols are pieced together. Shannon’s information theory considers the message as a sequence of outcomes of a random process [36]. Assume the message can obtain its words from a set and the probability of word be for all . The goal is to find a code that maps each word to a binary string with length such that the expected length of the string which is defined as
(9) 
is minimized. It can be proved that the optimal code in this sense will be obtained if where this value is also known as Shannon’s entropy. We consider the base of the logarithm 2 throughout this paper. The length of the code of a word can be taken as a measure of information content of word represented by . Nonetheless, the major limitation of Shannon’s approach to information is its explicit dependence on probabilistic source of the message. Algorithmic Complexity(AC) is a different approach that removes this assumption and gives a more generic idea of information. To present the core idea of AC, some preliminary definitions are required which will be briefed in the following
Universal Turing Machine— A Turing machine (TM) is a machine with

A clock that synchronizes all activities of the machine.

A finite set of internal states indexed by . The machine may change its state at the clock tick.

A binary work tape which can be moved to the right or left and be updated by the machine.

A oneway binary input tape which forms the input to the machine. The input tape cannot be moved backward.

A oneway binary output tape that carries the machine’s output.

An instruction list that determines the action of the machine at each clock tick depending on the current value of the input tape, work tape and the internal state of the machine. The action may include moving the input tape, updating and moving the work tape, updating and moving the output tape, or moving to a new internal state.
Given a binary string representing some data or information, the amount of information, a.k.a Algorithmic Complexity (AC), in given a particular Turing Machine (TM), is the shortest input tape which will cause TM to output , i.e., . It is obvious from this definition that the information content of a message depends on the chosen TM. The concept of Universal Turing Machine (UTM) comes as an assistance here. Apart from its detailed definition that can be looked up in [37], a UTM has the interesting property of being programmable. Meaning that the input tape may consist of two concatenated parts such that pushes the the initial Turing machine TM_{0} into a state from that state on, the UTM behaves as another Turing machine TM_{1}. The second part of the input tape is then decoded by TM_{1} rather than TM_{0}. This capacity of UTM enables us to achieve a universal measure for complexity or information content. In the next section we discuss how information content of dynamical systems can be described in the framework of a UTM.
Appendix B Proofs of Theoretical Results
We provide here the proofs to our theoretical results:
Theorem 4.
Consider Eq. (1) with Lipschitzcontinuous on . Furthermore, assume a Lipschitzcontinuous approximation is used to obtain state approximations . Then, for ,
(10) 
In particular, this implies: if , then .
Proof of Theorem 1.
We start the proof by showing that there exists a well defined solution to the considered ODE, which is due to the Picard–Lindelöf theorem.
Next, we show how to bound a function against its derivative, which is frequently done in Poincaré inequalities. Depending on the given assumptions, these results all look slightly different. Here, we use , and proof .
We start with the fundamental theorem of calculus and obtain
(11) 
Hence, we obtain for the absolute value
(12) 
Now we assume a multiplicative one and apply the CauchySchwarz inequality
(13) 
Since and everything is nonnegative, we obtain
(14) 
Taking the square and integrating does not change the inequality, since the right hand side is not dependent on anymore. This yields the final result
(15) 
Now we substitute and obtain
(16) 
∎
Lemma 1.
Assume and on the domain . We can show
(17) 
Proof.
Since we consider a bounded domain and the derivative is bounded we conclude that is bounded as well. We proof the statement by considering the worst case scenario, which is a triangle for this case. What essentially can happen is that the support of the function shrinks, while the maximum remains constant. However, by bounding the derivative we have control over the growth of the area beneath the function. Therefore, the extreme case is a triangle with the maximal slope and peak point . This yields
(18) 
and
(19) 
Hence for we obtain our claim. ∎
Lemma 2.
Assume the function is monotonically increasing in . Then the variation of is given by
(20) 
Proof.
The proof is straight forward and follows immediately with a monotonicity and telescope sum argument. ∎
Theorem 5.
Let the assumptions of Theorem 4 hold.
Additionally, assume
and .
Then, we have
(21) 
This implies: if then .
Proof of Theorem 2.
We use bounded variation type arguments here. In particular, we start again with
(22) 
It is well known fact in analysis that the quantity can be used to compute the total variation of a smooth function. We will use an equivalent approach to quantify the total variation and use this to bound the derivative with the states. We use
(23) 
where we take the supremum over all possible grids, which are not necessarily equidistant. Hence, is the number of grid points, which can in general go to infinity. This is even possible for functions with bounded variation, as long as the function value decays fast enough. The assumption ensures a finite number of oscillations on a bounded domain, which combined with Lemma 2 yields that . Hence, there exists an optimal grid with a finite number of points and we can use the bound
(24) 
We can split this apart with the triangle inequality and make the quantity even bigger by dropping the grid. Hence, we obtain
(25) 
With the aid of Lemma 1, and the same argument as in the end of the proof of Theorem 1 we conlude this proof. ∎
Theorem 6 (Model Selection).
If the previous assumptions hold, and
(26) 
then
(27) 
where are constants, which depend on certain properties of the dynamical systems.
Proof of Theorem 3.
We consider three objects in this proof  the true dynamical function , an approximation , which is most likely obtained from a learning algorithm and the local approximation , obtained through a local expansion, e.g. Taylor.
We start by inserting and obtain with the triangle inequality
(28) 
Now, we use Theorem 2 and obtain
(29) 
The assumption expands to
(30) 
Hence, for we obtain
(31) 
Now we apply the CauchySchwarz inequality to transform the norm into the norm and obtain
(32) 
With the aid of Theorem 1 it follows that
(33) 
With the aid of the triangle inequality we can again show
(34) 
∎
Appendix C Learning dynamical systems
The more detailed description of the method we used to learn dynamics function from observational data is presented here. We use a simple blackbox approach to learn the dynamical system from a set of trajectories.
Learning by Neural network— Assume we are given a collection of sequences of observations . Each sequence covers a trajectory in the state space starting from some starting point . We use each sequence as a minibatch of observations and train by the following simple relationship between its input/output pairs:
(35)  
(36)  
(37) 
The reason for using a collection instead of a single sequence is clear. A single sequence starting from an initial point is unlikely to be representative enough so that is learned as a good approximation to . Once is learned, we can use automatic differentiation to compute its derivative w.r.t. the input [38].
Learning Gaussian Process— The discretization of the nonlinear dynamics function is done just like above. We used a vanilla GP without any sparse approximations and a squared exponential kernel function.
Appendix D More experiments
Some parts of the experiment sections are delegated to here from the main text. It includes more sophisticated experiments with higher dimensional and physical dynamical systems.
d.1 Enlarged version of the illustrative example
As an illustrative example, let’s assume the dynamical system with depicted in Fig. 4(a). This dynamical system is stable and the evolution of its state is depicted in Fig. 4. If the initial point resides in the positive region of the state space, it never leaves the nonnegative side of the state space. Hence, encoding for the negative domain is not necessary and we can safely only encode in its positive domain. This can be formalized in terms of algorithmic complexity as
(38) 
meaning that knowing allows to design a better code for .
d.2 Illustrative example:
d.3 Local apprximations to the learned function:
In this section, the system is used to generate samples based on the method explained for the experiment in the main text. The dynamics function is then learned and depicted in Fig. 6(a). Once the function is learned, multiple local Taylor approximations is computed and shown in Fig. 6(be) corresponding to different number of partitions. Figure. 6(f) shows the score is optimal for a nontrivial case.
d.4 Pendulum dynamics:
Part of this experiment is in the main text. This is the complete version. We consider a realistic physical system of a pendulum with two dimensional dynamics and . The result for how is learned from and how approximates is depicted in Fig.7. The corresponding result in the state space is shown in Fig. 8, where is the angular position and the angular velocity of the pendulum. Again, LICDS finds a good tradeoff between model complexity and prediction accuracy.
d.5 Quadrotor dyamics:
LICDS as a method for compression, can be beneficial in any setting where it is necessary to transmit state information to a distant node. One typical example of these settings is Unmanned Aerial Vehicle (UAV), whose states are constantly measured onboard, but only occasionally transmitted to the ground base. Continuous transmission of data is expensive in terms of battery power and bandwidth. To show the performance of LICDS on more complex dynamics, we test it on the deterministic dynamical system of a quadrotor UAV [39].
In the following state space equation, is roll (rotation around x axis) and is pitch (rotation around y axis) in the earth frame. The value of angles can violate the bounds but its interpretation is circular. The vector contains the linear and angular velocities in the body frame. The functions , , and are sine, cosine, and tangent, respectively. The vector contains the wind forces and timevarying disturbance, wind torques, and