Comparing recurrent and convolutional neural networks for predicting wave propagation
Abstract
Dynamical systems can be modelled by partial differential equations and a need for their numerical solution appears in many areas of science and engineering. In this work, we investigate the performance of recurrent and convolutional deep neural network architectures to predict the propagation of surface waves governed by the SaintVenant equations. We improve on the longterm prediction over previous methods while keeping the inference time at a fraction of numerical simulations. We also show that convolutional networks perform at least as well as recurrent networks in this task. Finally, we assess the generalisation capability of each network by extrapolating for longer times and in different physical settings.
capbtabboxtable[][\FBwidth] \iclrfinalcopy
1 Introduction
Many physical systems in science and engineering are described by partial differential equations (PDEs). This study investigates the performance of recurrent and convolutional deep neural networks to model such phenomena. Accurately predicting the evolution of such systems is usually done through numerical simulations, a task that requires significant computational resources. Simulations usually need extensive tuning and need to be rerun from scratch even for small variations in the parameters. With their potential to learn hierarchical representations, deep learning techniques have emerged as an alternative to numerical solvers, by offering a desirable balance between accuracy and computational cost (Carleo et al., 2019).
Here, we focus on the modelling of surface wave propagation governed by the SaintVenant (SV) equations. This phenomenon offers a good testbed for controlled analyses on twodimensional sequence prediction of PDEs for several reasons. First, in contrast to some physical systems, such as fluid flow, the evolution of the real system is unlikely to enter chaotic regimes. From a representation learning point of view, this makes model training and assessment relatively straightforward. Despite this, the SV equations are strongly related to the NavierStokes equations, widely used in computational fluids. Further, computational modelling of surface waves is used in seismology, computer animation, in predictions of surface runoff from rainfall – a critical aspect of the water cycle (Moussa and Bocquillon, 2000) – and flood modelling (Ersoy et al., 2017).
This study provides three contributions. First, we identify three relevant architectures for spatiotemporal prediction. Two of these architectures lead to improved accuracy in longterm prediction over previous attempts (Sorteberg et al., 2019) while keeping the inference time orders of magnitude smaller than typical solvers. Secondly, our comparison between recurrent and purely convolutional models indicates that both can be equally effective in spatiotemporal prediction of SV PDEs. This is in alignment with the findings of Bai et al. (2018) that demonstrates that convolutional models are as effective as recurrent models in onedimensional sequence modelling. Finally, we evaluate the generalisation of the models in situations not seen during training and indicate their shortcomings.
Timestep ahead  20  80 
LSTM (baseline)  0.08 0.00  0.19 0.03 
ConvLSTM  0.05 0.00  0.15 0.01 
Causal LSTM  0.02 0.01  0.09 0.01 
UNet  0.02 0.00  0.07 0.01 
2 Related work
Deep learning methods have been proposed for spatiotemporal forecasting in various fields including the solution of PDEs. Recurrent neural networks have been proven a good fit for the task, due to their innate ability to capture temporal correlations. Srivastava et al. (2015) use a convolutional encoderdecoder architecture where an LSTM module is used to propagate the latent space to the future. Variations of this technique have been successfully applied to the longterm prediction of physical systems, such as sliding objects (Ehrhardt et al., 2017) and wave propagation (Sorteberg et al., 2019). Convolutional LSTMs (ConvLSTM) use convolutions inside the LSTM cell to complement the temporal state with spatial information. Whilst initially proposed for precipitation nowcasting, ConvLSTMs were also found successful for video prediction (Shi et al., 2015). Wang et al. (2018) proposed the Causal LSTM, featuring spatial memory that traverses the stacked cells in the network and improves the accuracy of shortterm prediction over ConvLSTMs.
Feedforward models have, also, been used in spatiotemporal forecasting. Mathieu et al. (2015) used a CNN to encode video frames in a latent space and extrapolated the latent vectors to the future. Tompson et al. (2017) employed CNNs to speed up the projection step in fluid flow simulations. UNet has been used for optical flow estimation in videos (Dosovitskiy et al., Technical report) as well as in physical systems, such as sea temperature predictions (de Bezenac et al., 2017) and accelerating the simulation of the NavierStokes equations (Thuerey et al., 2018). While both recurrent and convolutional models have been successfully applied for the prediction of PDEs, there is a paucity of studies comparing the two categories from a representation learning point of view.
Other architectures for spatiotemporal prediction include Generative Adversarial Networks, for fluid simulations (Kim et al., 2018) and Graph Networks for windfarm power estimation (Park and Park, 2019). There is also a growing body of research on physicsinspired networks for solving PDEs (Raissi et al., 2017; Perdikaris and Yang, 2019).
Timestep ahead  20  40  60  80 
Test set  0.02  0.03  0.05  0.07 
Opposite Illum.  0.02  0.03  0.05  0.07 
Random Illum.  0.4  0.06  0.08  0.10 
Double Drop  0.04  0.07  0.10  0.13 
Lines  0.11  0.16  0.18  0.19 
Shallow Depth  0.04  0.09  0.13  0.16 
Big Tank  0.08  0.14  0.16  0.17 
Small Tank  0.19  0.22  0.23  0.23 
3 Evaluated models
Four different models are assessed in this work. Three of them are recurrent (LSTM, ConvLSTM, Causal LSTM) and one is feedforward (UNet). A detailed description of all the implementations can be found in Section B of the Appendix. The LSTM model was specifically developed for wave propagation prediction (Sorteberg et al., 2019) and serves as a baseline on which we sought improvement. It is composed of a convolutional encoder and decoder with three LSTMs in the middle. The LSTM modules use the vector output of the encoder as an inner representation and propagate it forward in time. Each LSTM propagates a different part of the sequence (see Appendix).
The other models were selected on the basis of their applicability to relevant tasks. ConvLSTM and Causal LSTM have been empirically shown to perform well at shortterm spatiotemporal predictions. The rationale for using them in longterm prediction is that the underlying physics of wave propagation do not change. If a model learns a good representation of shortterm dynamics, then the error accumulation should remain low longterm. Both models use convolutions inside the recurrent cell to create a synergy between spatial and temporal modelling. Additionally, Causal LSTMs employ a spatial memory that traverses the vertical stack to increase shortterm accuracy.
The feedforward model is based on the UNet architecture used in spatiotemporal prediction. For example, it has been used to infer optical flow (Fischer, 2015) , motion fields (de Bezenac et al., 2017) and velocity fields (Thuerey et al., 2018). In contrast, we train the network endtoend and conditional on its own predictions; the latter shifts the focus from shortterm to longterm accuracy.
4 Results
4.1 Long term prediction: Extrapolation in time
We evaluated how well the models extrapolate in time. Given groundtruth simulations of 100 frames in length, we tested the model predictions up to 80 steps, much more than the maximum of 20 frame sequences that the models are trained upon. The RMSE at each time step is calculated as an average over all the test sequences. Results show that the baseline LSTM gives the worst performance. The RMSE error reaches 0.10 after only 21 frames while the error sharply raises after frame 10 (Figure 1). A probable cause is the usage of three distinct LSTMs, which require more data to train upon. The ConvLSTM offers an improvement: it reaches 0.1 RMSE after only 53 frames. The error trend is also very gradual, almost linear. An even greater improvement comes from the Causal LSTM, which provides a very low error over the whole prediction range. Its maximum error at frame 80 is 0.091, substantially lower than the LSTM (0.186) and the ConvLSTM (0.150) (Table 1). This confirms the findings of Wang et al. (2018), that Causal LSTM is more efficient than ConvLSTM. UNet is on par with Causal LSTM until frame 34, but has better longterm prediction, reaching 0.071 RMSE at frame 80 vs 0.091 of the Causal LSTM. The UNet decreases the RMSE by compared to the baseline. It is also the faster model, providing a speedup over the numerical solver that we used (Table 6 in Appendix).
Qualitatively, it appears that the Causal LSTM propagates its internal representation one step at a time while the UNet predicts multiple frames in one pass. How the output is reconstructed in the last layer is indicative of the differences (Figure 2).
4.2 Generalisation: Extrapolation in other physical settings
Here, we evaluate the capabilities and limitations of our models by testing under different initial conditions, illumination models and tank dimensions (Table 3 in Appendix). For conciseness, we only present the results of the UNet but the same conclusions stand for all the models.
The UNet seems to be quite robust to changes in illumination. The RMSE for opposite illumination angle () is indistinguishable to the original test set (Figure 3 and Table 2). This indicates that the learned representation is invariant to a perpendicular phase shift in lighting conditions. Propagation of linear waves appears to be more challenging, RMSE exceeds 0.10 after just 12 frames. The visualisation shows how the morphology of the prediction is qualitatively different, containing circular artefacts, reminiscent of the training data (Figure 4). When two drops are used, the RMSE is fairly low but the two wavefronts of the predictions are sometimes blurred. We also varied the tank size to study the effect of wave speed. It seems that both cases are challenging with the smaller tank size, or equivalently faster waves, exceeding 0.10 RMSE after just 5 frames. Predictions in Figure 4 demonstrate how the network miscalculates the wave speed, and its predictions are either faster or slower than the ground truth. Please note that direct comparisons between datasets based on the RMSE is not without shortcomings. Each dataset has its own inherent “variation” which affect the RMSE, i.e. waves move faster in a small tank (see Figure 11 in the Appendix for a discussion).
5 Conclusions and Future Work
In this work we investigated the use of deep networks for approximating wave propagation. Using a UNet architecture, we managed to reduce the longterm approximation RMSE to 0.071 against the previous baseline of 0.186. At the same time, the UNet is faster than the simulation. Our results suggest that the UNet outperforms stateoftheart recurrent models. It is unclear why UNet models perform so well in this task. It been demonstrated that convolutional networks are effective at modelling onedimensional temporal sequences (Bai et al., 2018); it might be true for higherdimensional data. Furthermore, the simulated data are based on fewstep solvers. In such a case the memory modules may not offer a significant advantage. Lastly, we extensively assessed how the networks generalise in unseen physical settings and pointed out current limitations.
In the future, we aim to introduce noise in the simulation so the system becomes stochastic. It would be interesting to see if in this case the recurrent models learn the dynamics better than the UNet. A big shortcoming of the current models is generalisation in other physical settings. We plan to address this by a physicsinspired latent space factorisation and metalearning.
Appendix
Appendix A Datasets
The datasets were created by simulating the SaintVenant equations:
(1) 
The package triflow (Cellier, 2019) was used for the simulation. The Coriolis force and viscosity terms were neglected, kinematic viscosity was which is close to water viscosity at C, the height H is set to 10 m and the size of the tank is randomly selected in each simulation between 10 and 20m. The initial wave excitation is in the form of a Gaussian droplet at random locations. For rendering, we used lighting azimuth and altitude. Each sequence is 100 steps long while the time step is 0.01 sec. In total, 3,000 sequences were rendered. The frame size was pixels but was subsequently resampled down to . The generalisation datasets were created with the same method by varying the physical properties of the simulation (Table 3).
We also used image normalisation which is known to improve performance on image prediction tasks. Normalising the pixel values to zero mean and standard deviation 1 worked best for us. Note that the normalising values are computed from the training set alone and applied to the validation and test sets. Data augmentation techniques like horizontal and vertical flips were employed on a per sequence basis. From the 3000 sequences of the original dataset, 70% were used for training, 15% for validation and 15% for testing.
Dataset Name  Initial Condition  Height(m)  Tank Size(m)  Illum. Azimuth  Sequences 
Training/Validation/Test  Droplet  10  [10, 20]  3000  
Double Drop  Double Droplet  10  [10, 20]  500  
Lines  Line wave  10  [10, 20]  500  
Opposite Illumination  Droplet  10  [10, 20]  500  
Random Illumination  Droplet  10  [10, 20]  Random  500 
Shallow Depth  Droplet  5  [10, 20]  500  
Small Tank  Droplet  10  [5, 10]  500  
Big Tank  Droplet  10  [20, 40]  500 
Appendix B Models
b.1 Lstm
The encoder consists of 4 convolutional layers with 60, 120, 240, 480 feature maps, kernel sizes 7, 3, 3, 3 and padding of 2, 1, 1, 1 pixels. Dimensionality reduction is achieved by using a kernel stride of size in all layers. After each convolutional layer, there is a batch normalisation layer and a tanh nonlinearity. In the last convolutional layer, dropout is used on of the units, that are chosen randomly in each pass. The final part of the encoder is a fully connected layer of width . This is the latent vector input to the three LSTMs. One LSTM is used for the first input, the second LSTM is for predicting the 10th frame (midway) and the third LSTM for all the other frames. The decoder is based on deconvolutions that double the spatial dimensions of the feature maps in each layer until the original size is reached. It is a mirror of the encoder in terms of feature map size while the kernel is 3, the padding is 1 and the stride is 2 for all the layers. Figure 6 depicts the architecture.
b.2 ConvLSTM
Our architecture uses a stack of 3 ConvLSTM cells. Initially, a convolutional encoder with 8, 64, 192 feature maps respectively reduces the spatial dimensions to . All layers have kernels of size 3, zero padding of width 1 and Leaky ReLU nonlinearities with slope 0.2. A stride of 2 pixels is used to reduce the dimensionality. At the final layer, the input is represented by a tensor. Inside the ConvLSTMs we use kernels of size 3 and zero padding of 1 pixel to avoid the dimensionality reduction. The decoder uses deconvolutions with stride 2 to double up the pixel dimensions in each layer.
b.3 Causal LSTM
The unfolding of the model through time is presented in Figure 8. The vertical stack is comprised of one convolutional, one max pooling and four Causal LSTM layers. The convolutional layer has a kernel of size 3, no padding and outputs 8 feature maps. In the original paper, they do not use any dimensionality reduction because their input dimensions are per frame. Our input dimensions () are too big to fit in available GPU memory, so we used maxpooling with stride 4 to reduce the dimensions to pixels. Following the original paper, we used 4 Causal LSTM layers but reduced the size of all of them to 64 channels each to meet hardware memory constraints. We used convolutional kernels of size 3. The forecaster uses a deconvolutional layer kernel size 7 and stride 4 to restore the internal state to the original dimensions.
b.4 UNet
The encoder is composed of fours blocks each containing two convolutional layers with kernel size 3 and padding 1, followed by ReLU nonlinearities. The first three blocks include a maxpooling layer of stride 2 that reduces the size in half. The number of feature maps doubles in each layer. For the expanding part, we use bilinear interpolation with scale factor 2 instead of deconvolutions to keep the number of parameters low. Skip connections are also employed to copy feature maps from earlier layers but contrary to the original paper we do not reduce the dimensions of the copied feature maps. This way, highlevel, coarser feature maps are combined with finegrained local information of lower layers over the whole domain. The network architecture can be seen in Figure 9.
Appendix C Hyperparameters
Assume that is the number of input and the number of output frames of the model. For each training iteration, we randomly selecting subsequences of length from each simulated sequence. The models were trained to minimise the MSE over their respective output length . In each iteration, the weights are updated using an Adam optimiser while a scheduling scheme adjusts the learning rate (LR) by a scaling factor of if there is no improvement in validation error after a given amount of epochs (patience). The hyperparameters of interest are the input length , training output length , samples per sequence between weight updates , batch size , LR and patience . Grid search was used to find the best set of hyperparameters of each model. The training budget was 24h hours. To obtain an arbitrary long prediction we the output as the next input. The goal is to obtain networks with a low error in long term prediction so, during model selection, we chose the hyperparameters that gave the lowest validation error over 50 frames regardless of the output size of the model . The final hyperparameters and model sizes can be found in Tables 4 and 5.
Model 





Patience  
LSTM  5  20  10  16  5  
ConvLSTM  5  10  5  8  7  
Causal LSTM  5  20  5  4  3  
UNet  5  20  10  16  7 
Appendix D Model size and speed
Models were implemented in PyTorch and the code is publicly available in GitHub. Models were trained on a GTX 1060 GPU with 6GB of memory. Total training time includes evaluation overhead.
Model 






LSTM  88.2M  12m  75  71  24h  
ConvLSTM  12.3M  36m  24  18  24h  
Causal LSTM  2.5M  33m  43  36  24h  
UNet  7.8M  8m  171  166  24h 
Method  Time per frame (ms)  Speedup 
Numerical simulator  630.7   
LSTM  15.0  40x 
ConvLSTM  4.5  141x 
Causal LSTM  9.2  68x 
UNet  2.6  241x 
Appendix E Results Addendum
e.1 Predicting the tank size from the latent space
Here we check if the trained UNet acquired any understanding of the physical properties of the system. We focus on the tank size, or inversely the speed of the wave, for two reasons. First of all, the UNet failed to extrapolate to different tank sizes. This experiment could provide some insights on why this failure happens. Secondly, tank size information is readily available. Each dataset sequence corresponds to a different tank size, and the tank is always square. In the training and testing dataset we have tank size meters. For the smaller tank we used and for the bigger meters.
The question we try to answer is: does the latent representation of the UNet capture that tank size information . We take the pretrained encoder from the UNet and add some additional layers so that the output is only one number (Figure 13). The system is trained to predict the tank size when given 5 consecutive frame. Only the additional part is updated during training. The weights of the encoder are kept frozen. We compare the pretrained encoder against a randomly initialised encoder. We, also, compare the models to a dummy regressor that predicts always the mean tank size for each dataset i.e. 15 for the test set, 7.5 for the small tank and 30 for the big tank. Results in Table 6 indicate that the pretrained encoder can be used to extract the tank size with relatively low error (0.14) while the random encoder gives a much higher error of 2.27, slightly lower to the dummy regressor (2.45). This indicates that the pretrained encoder encapsulates physically relevant information relating to the tank size. When it comes to the bigger and smaller tanks, both the pretrained and the random encoders fail to extrapolate and give errors higher than the dummy regressor.
Test set  Bigger Tank  Smaller Tank  
Pretrained encoder  0.14  6.65  2.23 
Random encoder  2.27  14.21  6.35 
Dummy regressor  2.45  5.19  1.22 
Footnotes
 Code and data available at github.com/stathius/wave_propagation
References
 An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. Cited by: §1, §5.
 Machine learning and the physical sciences. External Links: Link Cited by: §1.
 Scikitfdiff / skfdiff. Note: \urlhttps://gitlab.com/celliern/scikitfdiff/[Online; accessed 1182019] Cited by: Appendix A.
 Deep learning for physical processes: incorporating prior scientific knowledge. arXiv preprint arXiv:1711.07970. Cited by: §2, §3.
 FlowNet: Learning Optical Flow with Convolutional Networks. Technical report Cited by: §2.
 Learning A Physical Longterm Predictor. External Links: Link Cited by: §2.
 A saintvenant shallow water model for overland flows with precipitation and recharge. External Links: 1705.05470 Cited by: §1.
 Flownet: learning optical flow with convolutional networks. arXiv preprint arXiv:1504.06852. Cited by: §3.
 Deep Fluids: A Generative Network for Parameterized Fluid Simulations. External Links: Link, Document Cited by: §2.
 Deep multiscale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440. Cited by: §2.
 Approximation zones of the saintvenant equations f flood routing with overbank flow. Hydrology and Earth System Sciences Discussions 4 (2), pp. 251–260. Cited by: §1.
 Physicsinduced graph neural network: An application to windfarm power estimation. Energy 187, pp. 115883. External Links: Document, ISSN 03605442 Cited by: §2.
 Modeling stochastic systems using physicsinformed deep generative models. Cited by: §2.
 Physics Informed Deep Learning (Part I): Datadriven Solutions of Nonlinear Partial Differential Equations. External Links: Link Cited by: §2.
 Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. External Links: Link Cited by: §2.
 Approximating the Solution of Surface Wave Propagation Using Deep Neural Networks. In INNS Big Data and Deep Learning, External Links: Document Cited by: Figure 5, Figure 6, §1, §2, §3.
 Unsupervised learning of video representations using lstms. In International conference on machine learning, pp. 843–852. Cited by: §2.
 Deep Learning Methods for ReynoldsAveraged NavierStokes Simulations of Airfoil Flows. External Links: Link Cited by: §2, §3.
 Accelerating eulerian fluid simulation with convolutional networks. In ICML, pp. 3424–3433. Cited by: §2.
 PredRNN++: Towards A Resolution of the DeepinTime Dilemma in Spatiotemporal Predictive Learning. External Links: Link Cited by: §2.
 Predrnn++: towards a resolution of the deepintime dilemma in spatiotemporal predictive learning. arXiv preprint arXiv:1804.06300. Cited by: §4.1.