# Model Reduction with Memory and the Machine Learning of Dynamical Systems

###### Abstract

The well-known Mori-Zwanzig theory tells us that model reduction leads to memory effect. For a long time, modeling the memory effect accurately and efficiently has been an important but nearly impossible task in developing a good reduced model. In this work, we explore a natural analogy between recurrent neural networks and the Mori-Zwanzig formalism to establish a systematic approach for developing reduced models with memory. Two training models-a direct training model and a dynamically coupled training model-are proposed and compared. We apply these methods to the Kuramoto-Sivashinsky equation and the Navier-Stokes equation. Numerical experiments show that the proposed method can produce reduced model with good performance on both short-term prediction and long-term statistical properties.

In science and engineering, many high-dimensional dynamical systems are too complicated to solve in detail. Nor is it necessary since usually we are only interested in a small subset of the variables representing the gross behavior of the system. Therefore, it is useful to develop reduced models which can approximate the variables of interest without solving the full system. This is the celebrated model reduction problem. Even though model reduction has been widely explored in many fields, to this day there is still a lack of systematic and reliable methodologies for model reduction. One has to rely on uncontrolled approximations in order to move things forward.

On the other hand, there is in principle a rather solid starting point, the Mori-Zwanzig (M-Z) theory, for performing model reduction [1], [2]. In M-Z, the effect of unresolved variables on resolved ones is represented as a memory and a noise term, giving rise to the so-called generalized Langevin equation (GLE). Solving the GLE accurately is almost equivalent to solving the full system, because the memory kernel and noise terms contain the full information for the unresolved variables. This means that the M-Z theory does not directly lead to a reduction of complexity or the computational cost. However, it does provide a starting point for making approximations. In this regard, we mention in particular the -model proposed by Chorin et al [3]. In [4] reduced models of the viscous Burgers equation and -dimensional Navier-Stokes equation were developed by analytically approximating the memory kernel in the GLE using the trapezoidal integration scheme. Li and E [5] developed approximate boundary conditions for molecular dynamics using linear approximation of the M-Z formalism. In [6], auxiliary variables are used to deal with the non-Markovian dynamics of the GLE. Despite all of these efforts, it is fair to say that there is still a lack of systematic and reliable procedure for approximating the GLE. In fact, dealing with the memory terms explicitly does not seem to be a promising approach for deriving systematic and reliable approximations to the GLE.

One of the most successful approaches for representing memory effects has been the recurrent neural networks (RNN) in machine learning. Indeed there is a natural analogy between RNN and M-Z. The hidden states in RNN can be viewed as a reduced representation of the unresolved variables in M-Z. We can then view RNN as a way of performing dimension reduction in the space of the unresolved variables. In this paper, we explore the possibility of performing model reduction using RNNs. We will limit ourselves to the situation when the original model is in the form of a conservative partial differential equation (PDE), the reduced model is an averaged version of the original PDE. The crux of the matter is then the accurate representation of the unresolved flux term.

We propose two kinds of models. In the first kind, the unresolved flux terms in the equation are learned from data. This flux model is then used in the averaged equation to form the reduced model. We call this the direct training model. A second approach, which we call the coupled training model, is to train the neural network together with the averaged equation. From the viewpoint of machine learning, the objective in the direct training model is to fit the unresolved flux. The objective in the coupled training model is to fit the resolved variables (the averaged quantities).

For application, we focus on the Kuramoto-Sivashinsky (K-S) equation and the Navier-Stokes (N-S) equation. The K-S equation writes as

(1) | |||

(2) |

We are interested in a low-pass filtered solution of the K-S equation, , and want to develop a reduced system for . In general, can be written as the convolution of with a low pass filter :

(3) |

Hence, by performing filtering on (1), we get

(4) |

where

(5) |

is the sub-grid stress. We refer to this as the macro-scale equation, and the original K-S equation as the micro-scale equation. We refer to as the macro-scale solution. To develop the reduced system for , we need to model the sub-grid stress in (4). Later we will see that, from the M-Z theory, this stress can be approximated by a memory term of .

We also apply the same principle to the -D shear flow, whose governing equation is the following N-S equation

(6) |

where , . In this problem, the macro-scale solution and the sub-grid stress are defined similarly as for the K-S equation. We model the sub-grid stress using the history of the solution. Numerical experiments show that our models give good performance on both short-term and long-term predictions.

Machine learning tools, especially neural networks, have received much attention in recent years. Not surprisingly, they have also been used in model reduction such as large-eddy simulation [7] [8] and Reynolds averaged turbulence modeling [9]. In these papers, fully-connected neural network models are used to represent the appropriate stress terms. Other machine learnig tools are also used in model reduction. For example, in [10], a random forest based method is employed to improve the Reynolds-averaged Navier-Stokes model. In [11], the same problem is studied from a Bayesian viewpoint. A time series method is used to construct reduced system for the K-S equation in [12]. The difference between our work and these previous papers is that we attempt to develop a systematic approach starting from M-Z, and we explore the natural connection between M-Z and RNNs. In later work, we will further study the mathematical and algorithmic problems along this line.

The paper is organized as follows. Section 1 briefly introduces the idea of M-Z. Section 2 introduces RNNs and LSTMs. In Section 3, we describe the two training models. Numerical results on the Kuramoto-Sivashinsky equation and -D shear flow problem are present in Section 4. Concluding remarks and comments on future work are presented in Section 5.

## 1 The Mori-Zwanzig formalism

The starting point of the M-Z formalism is to divide all the variables into resolved and unresolved ones. The basic idea is to project functions of all variables into the space of functions of only the resolved variables. The original dynamical system then becomes an equation for the resolved variables with memory and noise. This equation is called the Generalized Langevin Equation (GLE). Below we first demonstrate the idea of the M-Z formalism using a simple linear example. We then use the Kuramoto-Sivashinsky equation as an example to show how M-Z formalism help us develop a reduced model.

### 1.1 A brief review to the M-Z formalism

Consider a linear system

(7) | |||

(8) |

with initial value , . We take as the resolved variable, as the unresolved variable, and want to develop a reduced system for . To achieve this, we can first fix in (8) and solve for , and then insert the result into (7). In this way we get

(9) |

There are three terms at the right hand side of (9). The first term is a Markovian term of ; the second term is a memory term. The third term contains the initial value of , this term will be viewed as a noise term. Hence, by eliminating from the original linear system, we obtain an equation for with memory and noise.

Similar deduction can be applied to non-linear ODE systems. Assume we have a system

(10) |

with and being vectors, and we split into and take as resolved variables, then by the Mori-Zwanzig formalism we can get a GLE for ,

(11) |

In (11), denotes at time with initial value . Although (11) is more complicated than (9), the essential ingredients are similar. The M-Z formalism tells us that model reduction leads to memory effects, and inspired us to pursuing the application of RNNs as a tool for performing efficient yet rigorous model reduction.

### 1.2 Application to the K-S equation

Going back to the K-S equation (1), let and be Fourier transforms of solution and filter , respectively. Then, by (3), we have for all frequency . Here, we take a spectrally sharp filter which satisfies

for a certain positive integer . Then, the Fourier transform of is a truncation of the Fourier transform of , and our resolved variables are those with . Writing the K-S equation in Fourier space, and put all terms with unresolved variables to the right hand side of the equation, we can get

(12) |

Applying Fourier transform to (4), and compare with (12), we have

(13) |

for . Using the M-Z theory, the sub-grid stress can be expressed as a memory term and a noise term. By Galilean invariance, the sub-grid stress can be expressed as a function of the history of the resolved strain, , when the noise effect is negligible.

## 2 Recurrent Neural Networks and LSTM

In this section, we briefly introduce recurrent neural networks and long short-term memory networks.

### 2.1 Recurrent neural networks

RNNs are neural networks with recurrent connections, designed to deal with time series. Usually, a recurrent connection links some units in a neural network to themselves. This connection means that the values of these units depend on their values at the previous time step. For a simple example, consider a one-layer RNN with time series as input. Let and be the hidden values and output values at time step , respectively. Then, the simplest RNN model is given by

(14) | |||||

where , , , , and are matrices or vectors of trainable parameters, and is a nonlinear activation function. To deal with complicated time series, sometimes the output of an RNN is fed to another RNN to form a multi-layer RNN.

Note that in RNNs the same group of parameters are shared by all time steps. In this sense, RNNs are stationary models. Among other things, this allows RNNs to learn information in long time series without inducing too many parameters. The training of RNNs is similar to the training of regular feed-forward neural networks, except that we have to consider gradient through time direction, such as the gradient of with respect to for . Usually RNNs are trained by the so-called back-propagation through time (BPTT) method.

### 2.2 Long short-term memory networks

Theoretically, RNNs is capable of learning long-term memory effects in the time series. However, in practice it is hard for RNN to catch such dependencies, because of the exploding or shrinking gradient effects [13], [14]. The Long Short-Term Memory (LSTM) network is designed to solve this problem. Proposed by Hochreiter et al. [15], the LSTM introduces a new group of hidden units called states, and uses gates to control the information flow through the states. Since the updating rule of the states is incremental instead of compositional as in RNN, the gradients are less likely to explode or shrink. The computational rule of an LSTM cell is

where , , are called the forget gate, input gate, and output gate, respectively, is the state, and is the hidden value. The LSTMs can also be trained by BPTT.

## 3 The Two Training Models

For simplicity, we will focus on learning the memory-dependent terms in the GLE, neglecting for now the noise term. For the K-S equation, as shown in Figure 3, by directly fitting the stress using the history of the strain, we can reduce the error to nearly while maintaining a small gap between the testing and training error. This shows that we are not overfitting, and the history of the strain determines most of the stress.

We will discuss two models for learning the memory effect: a direct training model and a coupled training model. For both models, we generate data using direct numerical simulation (DNS). In the direct training model, after generating the data, we directly train the RNN to fit the sub-grid stress using the time history of the strain. The loss function is defined as the difference between the output of the RNN and the true stress. The neural network model for the stress is then used in the macro-scale equation for the reduced system. In the coupled training model, the loss function is defined as the difference between the solutions of the reduced model (with stress represented by the neural network) and the ground truth solution. Therefore in the coupled model, the RNN is coupled with the solver of the macro-scale equation when training.

### 3.1 The direct training model

In the direct training model, a RNN is used to represent the stress as a function of the time history of the macrocell strain. The loss function is simply the squared difference of the predicted stress and the ground truth stress. This RNN model is then used in the reduced system. Figure 1 shows the flow chart of the direct training model.

### 3.2 The coupled training model

In the direct training model, we train an accurate stress model and we hope that this stress model can help produce accurate results for the macro-scale solution. In the coupled model, getting accurate macro-scale solutions is ensured by defining the loss function to be the difference between the solution of the reduced system and the ground truth, with the stress in the reduced system represented by a RNN. Specifically, in every epoch, we solve the reduced system for steps from some initial condition and get a prediction of the solution steps later (). We use to denote the predicted solution. The loss function is defined to be some function of . From another perspective, we can also view this coupled system as a single large RNN, part of the hidden units of this RNN is updated according to a macro-scale solver. Let be the predicted solution at the -th step, and , , be the state of RNN, stress, and strain at the step, respectively. Then the update rule of this large RNN can be written as

(15) | |||

(16) |

Figure 2 shows part of the computational graph of the coupled training model.

#### Computation of the gradient

In the coupled training model, the trainable parameters are still in the RNN, but to compute the gradient of the loss with respect to the parameters, we have to do back-propagation (BP) through the solver of the macro-scale equation, i.e. through (16). In many applications, this solver is complicated and it is hard to do BP through the solver. To deal with this problem, we take one step back to perform BP directly through the differential equations. This is done by writing down the differential equation satisfied by the gradient, which we call the backward equation. Another solver is used to solve this backward equation. In this way, BP is done by solving the backward equation, and is decoupled from the macro-scale solver.

As an illustration, let be the solution of the K-S equation at time in the Fourier space, and assume that we want to perform back-propagation through the K-S equation from time back to , which means we are going to compute the Jacobian . Let

(17) |

then we have and we want to compute . In the reduced system, we solve the K-S equation with stress,

(18) |

where is the sub-grid stress in Fourier space, means convolution, and represents a diagonal matrix with vector being the diagonal entries. Taking derivative of with respect to , and assuming that , we obtain

(19) |

Hence, as long as we know the solution for , we can compute the Jacobian by solving the equation (19) from to , with initial condition .

## 4 Numerical Experiments

We now present some numerical results of the proposed models for the K-S equation and the 2-dimensional shear flow problem. Below when we talk about true solution or ground truth, we mean the exact solution after filtering.

### 4.1 The Kuramoto-Sivashinsky equation

#### Experiment setting

The K-S equation is considered in the Fourier space. To generate the training data, we solve the K-S equation with Fourier modes to approximate the accurate solution, using a order integral factor Runge-Kutta method [16]. We set and the time step . The settings are similar to that in [12]. The micro-scale equation is solved for time units and filterd outputs are saved for every time units. We drop the results of the first time units, and take the results from the following time units as the training data, and the results of the last time units as the testing data.

For the reduced system, we solve (4) in Fourier space. We take Fourier modes and the time step . The macro-scale solver is still a order integral factor Runge-Kutta method. The solver works in the Fourier space, and takes the output of the RNN as the sub-grid stress.

On the machine learning side, we use an LSTM to predict the stress. The LSTM has two layers and hidden units in each layer. An output layer with linear activation is applied to ensure that the dimension of the outputs is . The LSTM works in the physical space: it takes strains in the physical space as inputs, and outputs predicted stresses in the physical space. A Fourier transform is applied to the output of the LSTM before feeding it into the macro solver.

#### Direct training

For the direct training model, we train the LSTM to fit the stress as a (memory-dependent) function of the strains for the last time steps. The network is trained by the Adam algorithm [17] for iterations, with batch size being . Figure 3 shows the relative training and testing error during the training process. We can see that the two curves go down simultaneously, finally reaching a relative error of about . There is almost no gap between the training and the testing error, hence little overfitting. This also suggests that most contribution to the sub-grid stress can be explained by the time history of the strain.

First we consider some a priori results. Figure 4 presents the relative error and the correlation coefficients between the predicted stress and the true stress at different locations in the physical space. We can see that our prediction of the stress has relative errors of about and correlation coefficients very close to .

We next examine some a posteriori results. After training the model, we pick some times in the testing set, and solve the reduced system initialized from the true solution at these times. Then, we compare the reduced solution with the true solution at later times. Figure 5 shows two examples. In these figures, the -axis measures the number of time units from the initial solution. From these figures we can see that the reduced solution given by the direct training model produces satisfactory prediction for - time units (- time steps of macro-scale solver).

As for the eventual deviation between the two solutions, it is not clear at this point what is more responsible, the intrinsic unstable behavior in the model or the model error. We will conduct careful studies to resolve this issue in future work.

Next we compare the long-term statistical properties of the solution to the reduced model with the true solution. We solve the reduced system for a long time ( time units in our case). We then compute the distributions and the autocorrelation curves of Fourier modes of the reduced solution, and compare with that of the true solution. Figure 6 shows some results. From these figures we see that the reduced model can satisfactorily recover statistical properties of the true solution.

#### Coupled training

For the coupled training model, we train the LSTM together with the equation solver. A detailed description of this training model is given in Section 3.2. Here we choose during training. The size of the LSTM is the same as that in the direct training model. A simple one-step forward scheme is used to solve the backward equation for the Jacobian (19). iterations were performed using an Adam optimizer with batch size . During the training process, we measured the relative error of predicted solutions steps later from the initial true solution. The results for different ’s are given in Figure 7. From the figure we can see that the relative error for different goes down to after training, and there is no gap between the different curves, which means that solutions with different steps from the initial time are fitted nearly equally well.

Short-term prediction and long-term statistical performance are also considered for the coupled training model. Results are shown on Figure 8 and 6. From the figures we see that, the reduced system trained by the coupled training model gives satisfactory prediction for time units, which is shorter than the direct training model. However, the auto-correlation of Fourier modes match better with the true solution, while the distribution is as good as the direct training model. This suggests that, compared to the direct training model, the coupled model is less competitive for short-term prediction, but performs better for long-term statistical properties.

### 4.2 The -dimensional shear flow

#### Experiment setting

Consider the -dimensional shear flow in channel, whose governing equation is given by (6). We take and use periodic boundary condition in direction and zero boundary condition at . We choose , and as a constant driving force. To numerically solve the equation, we employ the spectral method used in [18]. We take Fourier modes in direction and Legendre modes in direction. The time step are chosen to be .

For ease of implementation, we only do model reduction for Fourier modes in direction, and keep all Legendre modes in direction. The macro solution has Fourier modes in direction. Still, we generate accurate solution by DNS and compute the macro-scale strain and stress, and use an LSTM to fit the stress as a function of the history of the strain. In this experiment, the neural network we use is a -layer LSTM with hidden units in each layer. As for the K-S equation, the input and output of the LSTM are in the physical space.

In this problem, the stress has components (, , ). Since each component has modes in the spectral space, we need at least variables in the physical space to represent the stress. If directly fit the stress, our LSTM will have about thousand outputs. This can make the training very difficult. Here, noting that both the stress and the strain are periodic in the direction, we choose to fit the stress by column. We train an LSTM to predict one column of the stress (the stress at the same in the physical space), using strains at this column and the neighboring columns. Figure 9 shows this idea. In practice, when predicting the -th column of the stress, we use strains from the -th to the -th column.

For the -dimensional shear flow problem, we only show results from the direct training model. The LSTM is trained by an Adam optimizer for steps, with batch size being 64. Still, the stress is predicted using the strains from the last time steps.

#### Numerical results

Figure 10 shows the relative training and testing error during the training process. We can see that the training error goes down to , while the testing error goes down to about . Figure 11 shows the correlation coefficients of the predicted stress and the true stress at different . The three curves represent three components of the stress, respectively. we can see that the correlation coefficients are close to except in the area close to the boundary. Considering the boundary condition, the stress near the boundary is close to . Hence, its reasonable for the prediction to have a low correlation with the true stress.

Next, we solve the reduced system initialized from a true solution, and compare the quality of the reduced solution at later times with the solution given by the Smagorinsky model [19]. The Smagorinsky model is a classical and widely used model for large eddy simulation (LES). In the two dimensional case, the Smagorinsky model for the sub-grid stress can be written as

(20) |

where , is called the turbulent eddy viscosity, and

The turbulent eddy viscosity can be expressed as

(21) |

with being the grid size and being a constant. In Figure 12, we compare the deviation of our reduced solution and Smagorinsky solution from the true solution. We see that the prediction given by our reduced system is much better than that by the Smagorinsky model.

## 5 Conclusion

Much work needs to be done to develop the methods proposed here into a systematic and practical approach for model reduction for a wide variety of problems. For example, in the coupled training model, it is crucial to find an accurate and efficient way to compute the Jacobian. How to make the method scalable for larger systems is a problem that should be studied. For reduced systems where the noise effects cannot be neglected, how to model the noise term in the GLE is a problem that should be dealt with. Finally, we also need theoretical understanding to answer questions such as how to choose the size of the neural network, and how large the dataset should be in order to obtain a good reduced system.

## References

- [1] Hazime Mori. Transport, collective motion, and brownian motion. Progress of theoretical physics, 33(3):423–455, 1965.
- [2] Robert Zwanzig. Nonlinear generalized langevin equations. Journal of Statistical Physics, 9(3):215–220, 1973.
- [3] Alexandre J Chorin, Ole H Hald, and Raz Kupferman. Optimal prediction with memory. Physica D: Nonlinear Phenomena, 166(3-4):239–257, 2002.
- [4] Eric J Parish and Karthik Duraisamy. Non-markovian closure models for large eddy simulations using the mori-zwanzig formalism. Physical Review Fluids, 2(1):014604, 2017.
- [5] Xiantao Li and Weinan E. Variational boundary conditions for molecular dynamics simulations of crystalline solids at finite temperature: treatment of the thermal bath. Physical Review B, 76(10):104107, 2007.
- [6] Zhen Li, Hee Sun Lee, Eric Darve, and George Em Karniadakis. Computing the non-markovian coarse-grained interactions derived from the mori–zwanzig formalism in molecular systems: Application to polymer melts. The Journal of chemical physics, 146(1):014104, 2017.
- [7] Masataka Gamahara and Yuji Hattori. Searching for turbulence models by artificial neural network. Physical Review Fluids, 2(5):054604, 2017.
- [8] Antoine Vollant, Guillaume Balarac, and C Corre. Subgrid-scale scalar flux modelling based on optimal estimation theory and machine-learning procedures. Journal of Turbulence, 18(9):854–878, 2017.
- [9] Julia Ling, Andrew Kurzawski, and Jeremy Templeton. Reynolds averaged turbulence modelling using deep neural networks with embedded invariance. Journal of Fluid Mechanics, 807:155–166, 2016.
- [10] Jian-Xun Wang, Jin-Long Wu, and Heng Xiao. Physics-informed machine learning approach for reconstructing reynolds stress modeling discrepancies based on dns data. Physical Review Fluids, 2(3):034603, 2017.
- [11] H Xiao, J-L Wu, J-X Wang, R Sun, and CJ Roy. Quantifying and reducing model-form uncertainties in reynolds-averaged navier–stokes simulations: A data-driven, physics-informed bayesian approach. Journal of Computational Physics, 324:115–136, 2016.
- [12] Fei Lu, Kevin K. Lin, and Alexandre J. Chorin. Data-based stochastic model reduction for the kuramotoâsivashinsky equation. Physica D Nonlinear Phenomena, 340, 2016.
- [13] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.
- [14] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, pages 1310–1318, 2013.
- [15] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- [16] Chi-Wang Shu and Stanley Osher. Efficient implementation of essentially non-oscillatory shock-capturing schemes. Journal of computational physics, 77(2):439–471, 1988.
- [17] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- [18] Weinan E and Jianchun Wang. A thermodynamic study of the two-dimensional pressure-driven channel flow. Discrete & Continuous Dynamical Systems-A, 36(8):4349–4366, 2016.
- [19] Joseph Smagorinsky. General circulation experiments with the primitive equations: I. the basic experiment. Monthly weather review, 91(3):99–164, 1963.