Comparing recurrent and convolutional neural networks for predicting wave propagation
Dynamical systems can be modelled by partial differential equations and a need for their numerical solution appears in many areas of science and engineering. In this work, we investigate the performance of recurrent and convolutional deep neural network architectures to predict the propagation of surface waves governed by the Saint-Venant equations. We improve on the long-term prediction over previous methods while keeping the inference time at a fraction of numerical simulations. We also show that convolutional networks perform at least as well as recurrent networks in this task. Finally, we assess the generalisation capability of each network by extrapolating for longer times and in different physical settings.
Many physical systems in science and engineering are described by partial differential equations (PDEs). This study investigates the performance of recurrent and convolutional deep neural networks to model such phenomena. Accurately predicting the evolution of such systems is usually done through numerical simulations, a task that requires significant computational resources. Simulations usually need extensive tuning and need to be re-run from scratch even for small variations in the parameters. With their potential to learn hierarchical representations, deep learning techniques have emerged as an alternative to numerical solvers, by offering a desirable balance between accuracy and computational cost (Carleo et al., 2019).
Here, we focus on the modelling of surface wave propagation governed by the Saint-Venant (SV) equations. This phenomenon offers a good test-bed for controlled analyses on two-dimensional sequence prediction of PDEs for several reasons. First, in contrast to some physical systems, such as fluid flow, the evolution of the real system is unlikely to enter chaotic regimes. From a representation learning point of view, this makes model training and assessment relatively straightforward. Despite this, the SV equations are strongly related to the Navier-Stokes equations, widely used in computational fluids. Further, computational modelling of surface waves is used in seismology, computer animation, in predictions of surface runoff from rainfall – a critical aspect of the water cycle (Moussa and Bocquillon, 2000) – and flood modelling (Ersoy et al., 2017).
This study provides three contributions. First, we identify three relevant architectures for spatiotemporal prediction. Two of these architectures lead to improved accuracy in long-term prediction over previous attempts (Sorteberg et al., 2019) while keeping the inference time orders of magnitude smaller than typical solvers. Secondly, our comparison between recurrent and purely convolutional models indicates that both can be equally effective in spatiotemporal prediction of SV PDEs. This is in alignment with the findings of Bai et al. (2018) that demonstrates that convolutional models are as effective as recurrent models in one-dimensional sequence modelling. Finally, we evaluate the generalisation of the models in situations not seen during training and indicate their shortcomings.
|LSTM (baseline)||0.08 0.00||0.19 0.03|
|ConvLSTM||0.05 0.00||0.15 0.01|
|Causal LSTM||0.02 0.01||0.09 0.01|
|U-Net||0.02 0.00||0.07 0.01|
2 Related work
Deep learning methods have been proposed for spatiotemporal forecasting in various fields including the solution of PDEs. Recurrent neural networks have been proven a good fit for the task, due to their innate ability to capture temporal correlations. Srivastava et al. (2015) use a convolutional encoder-decoder architecture where an LSTM module is used to propagate the latent space to the future. Variations of this technique have been successfully applied to the long-term prediction of physical systems, such as sliding objects (Ehrhardt et al., 2017) and wave propagation (Sorteberg et al., 2019). Convolutional LSTMs (ConvLSTM) use convolutions inside the LSTM cell to complement the temporal state with spatial information. Whilst initially proposed for precipitation nowcasting, ConvLSTMs were also found successful for video prediction (Shi et al., 2015). Wang et al. (2018) proposed the Causal LSTM, featuring spatial memory that traverses the stacked cells in the network and improves the accuracy of short-term prediction over ConvLSTMs.
Feed-forward models have, also, been used in spatiotemporal forecasting. Mathieu et al. (2015) used a CNN to encode video frames in a latent space and extrapolated the latent vectors to the future. Tompson et al. (2017) employed CNNs to speed up the projection step in fluid flow simulations. U-Net has been used for optical flow estimation in videos (Dosovitskiy et al., Technical report) as well as in physical systems, such as sea temperature predictions (de Bezenac et al., 2017) and accelerating the simulation of the Navier-Stokes equations (Thuerey et al., 2018). While both recurrent and convolutional models have been successfully applied for the prediction of PDEs, there is a paucity of studies comparing the two categories from a representation learning point of view.
Other architectures for spatiotemporal prediction include Generative Adversarial Networks, for fluid simulations (Kim et al., 2018) and Graph Networks for wind-farm power estimation (Park and Park, 2019). There is also a growing body of research on physics-inspired networks for solving PDEs (Raissi et al., 2017; Perdikaris and Yang, 2019).
3 Evaluated models
Four different models are assessed in this work. Three of them are recurrent (LSTM, ConvLSTM, Causal LSTM) and one is feed-forward (U-Net). A detailed description of all the implementations can be found in Section B of the Appendix. The LSTM model was specifically developed for wave propagation prediction (Sorteberg et al., 2019) and serves as a baseline on which we sought improvement. It is composed of a convolutional encoder and decoder with three LSTMs in the middle. The LSTM modules use the vector output of the encoder as an inner representation and propagate it forward in time. Each LSTM propagates a different part of the sequence (see Appendix).
The other models were selected on the basis of their applicability to relevant tasks. ConvLSTM and Causal LSTM have been empirically shown to perform well at short-term spatiotemporal predictions. The rationale for using them in long-term prediction is that the underlying physics of wave propagation do not change. If a model learns a good representation of short-term dynamics, then the error accumulation should remain low long-term. Both models use convolutions inside the recurrent cell to create a synergy between spatial and temporal modelling. Additionally, Causal LSTMs employ a spatial memory that traverses the vertical stack to increase short-term accuracy.
The feed-forward model is based on the U-Net architecture used in spatiotemporal prediction. For example, it has been used to infer optical flow (Fischer, 2015) , motion fields (de Bezenac et al., 2017) and velocity fields (Thuerey et al., 2018). In contrast, we train the network end-to-end and conditional on its own predictions; the latter shifts the focus from short-term to long-term accuracy.
4.1 Long term prediction: Extrapolation in time
We evaluated how well the models extrapolate in time. Given ground-truth simulations of 100 frames in length, we tested the model predictions up to 80 steps, much more than the maximum of 20 frame sequences that the models are trained upon. The RMSE at each time step is calculated as an average over all the test sequences. Results show that the baseline LSTM gives the worst performance. The RMSE error reaches 0.10 after only 21 frames while the error sharply raises after frame 10 (Figure 1). A probable cause is the usage of three distinct LSTMs, which require more data to train upon. The ConvLSTM offers an improvement: it reaches 0.1 RMSE after only 53 frames. The error trend is also very gradual, almost linear. An even greater improvement comes from the Causal LSTM, which provides a very low error over the whole prediction range. Its maximum error at frame 80 is 0.091, substantially lower than the LSTM (0.186) and the ConvLSTM (0.150) (Table 1). This confirms the findings of Wang et al. (2018), that Causal LSTM is more efficient than ConvLSTM. U-Net is on par with Causal LSTM until frame 34, but has better long-term prediction, reaching 0.071 RMSE at frame 80 vs 0.091 of the Causal LSTM. The U-Net decreases the RMSE by compared to the baseline. It is also the faster model, providing a speed-up over the numerical solver that we used (Table 6 in Appendix).
Qualitatively, it appears that the Causal LSTM propagates its internal representation one step at a time while the U-Net predicts multiple frames in one pass. How the output is reconstructed in the last layer is indicative of the differences (Figure 2).
4.2 Generalisation: Extrapolation in other physical settings
Here, we evaluate the capabilities and limitations of our models by testing under different initial conditions, illumination models and tank dimensions (Table 3 in Appendix). For conciseness, we only present the results of the U-Net but the same conclusions stand for all the models.
The U-Net seems to be quite robust to changes in illumination. The RMSE for opposite illumination angle () is indistinguishable to the original test set (Figure 3 and Table 2). This indicates that the learned representation is invariant to a perpendicular phase shift in lighting conditions. Propagation of linear waves appears to be more challenging, RMSE exceeds 0.10 after just 12 frames. The visualisation shows how the morphology of the prediction is qualitatively different, containing circular artefacts, reminiscent of the training data (Figure 4). When two drops are used, the RMSE is fairly low but the two wave-fronts of the predictions are sometimes blurred. We also varied the tank size to study the effect of wave speed. It seems that both cases are challenging with the smaller tank size, or equivalently faster waves, exceeding 0.10 RMSE after just 5 frames. Predictions in Figure 4 demonstrate how the network miscalculates the wave speed, and its predictions are either faster or slower than the ground truth. Please note that direct comparisons between datasets based on the RMSE is not without shortcomings. Each dataset has its own inherent “variation” which affect the RMSE, i.e. waves move faster in a small tank (see Figure 11 in the Appendix for a discussion).
5 Conclusions and Future Work
In this work we investigated the use of deep networks for approximating wave propagation. Using a U-Net architecture, we managed to reduce the long-term approximation RMSE to 0.071 against the previous baseline of 0.186. At the same time, the U-Net is faster than the simulation. Our results suggest that the U-Net outperforms state-of-the-art recurrent models. It is unclear why U-Net models perform so well in this task. It been demonstrated that convolutional networks are effective at modelling one-dimensional temporal sequences (Bai et al., 2018); it might be true for higher-dimensional data. Furthermore, the simulated data are based on few-step solvers. In such a case the memory modules may not offer a significant advantage. Lastly, we extensively assessed how the networks generalise in unseen physical settings and pointed out current limitations.
In the future, we aim to introduce noise in the simulation so the system becomes stochastic. It would be interesting to see if in this case the recurrent models learn the dynamics better than the U-Net. A big shortcoming of the current models is generalisation in other physical settings. We plan to address this by a physics-inspired latent space factorisation and meta-learning.
Appendix A Datasets
The datasets were created by simulating the Saint-Venant equations:
The package triflow (Cellier, 2019) was used for the simulation. The Coriolis force and viscosity terms were neglected, kinematic viscosity was which is close to water viscosity at C, the height H is set to 10 m and the size of the tank is randomly selected in each simulation between 10 and 20m. The initial wave excitation is in the form of a Gaussian droplet at random locations. For rendering, we used lighting azimuth and altitude. Each sequence is 100 steps long while the time step is 0.01 sec. In total, 3,000 sequences were rendered. The frame size was pixels but was subsequently re-sampled down to . The generalisation datasets were created with the same method by varying the physical properties of the simulation (Table 3).
We also used image normalisation which is known to improve performance on image prediction tasks. Normalising the pixel values to zero mean and standard deviation 1 worked best for us. Note that the normalising values are computed from the training set alone and applied to the validation and test sets. Data augmentation techniques like horizontal and vertical flips were employed on a per sequence basis. From the 3000 sequences of the original dataset, 70% were used for training, 15% for validation and 15% for testing.
|Dataset Name||Initial Condition||Height(m)||Tank Size(m)||Illum. Azimuth||Sequences|
|Double Drop||Double Droplet||10||[10, 20]||500|
|Lines||Line wave||10||[10, 20]||500|
|Opposite Illumination||Droplet||10||[10, 20]||500|
|Random Illumination||Droplet||10||[10, 20]||Random||500|
|Shallow Depth||Droplet||5||[10, 20]||500|
|Small Tank||Droplet||10||[5, 10]||500|
|Big Tank||Droplet||10||[20, 40]||500|
Appendix B Models
The encoder consists of 4 convolutional layers with 60, 120, 240, 480 feature maps, kernel sizes 7, 3, 3, 3 and padding of 2, 1, 1, 1 pixels. Dimensionality reduction is achieved by using a kernel stride of size in all layers. After each convolutional layer, there is a batch normalisation layer and a tanh non-linearity. In the last convolutional layer, dropout is used on of the units, that are chosen randomly in each pass. The final part of the encoder is a fully connected layer of width . This is the latent vector input to the three LSTMs. One LSTM is used for the first input, the second LSTM is for predicting the 10th frame (midway) and the third LSTM for all the other frames. The decoder is based on deconvolutions that double the spatial dimensions of the feature maps in each layer until the original size is reached. It is a mirror of the encoder in terms of feature map size while the kernel is 3, the padding is 1 and the stride is 2 for all the layers. Figure 6 depicts the architecture.
Our architecture uses a stack of 3 ConvLSTM cells. Initially, a convolutional encoder with 8, 64, 192 feature maps respectively reduces the spatial dimensions to . All layers have kernels of size 3, zero padding of width 1 and Leaky ReLU non-linearities with slope 0.2. A stride of 2 pixels is used to reduce the dimensionality. At the final layer, the input is represented by a tensor. Inside the ConvLSTMs we use kernels of size 3 and zero padding of 1 pixel to avoid the dimensionality reduction. The decoder uses deconvolutions with stride 2 to double up the pixel dimensions in each layer.
b.3 Causal LSTM
The unfolding of the model through time is presented in Figure 8. The vertical stack is comprised of one convolutional, one max pooling and four Causal LSTM layers. The convolutional layer has a kernel of size 3, no padding and outputs 8 feature maps. In the original paper, they do not use any dimensionality reduction because their input dimensions are per frame. Our input dimensions () are too big to fit in available GPU memory, so we used max-pooling with stride 4 to reduce the dimensions to pixels. Following the original paper, we used 4 Causal LSTM layers but reduced the size of all of them to 64 channels each to meet hardware memory constraints. We used convolutional kernels of size 3. The forecaster uses a deconvolutional layer kernel size 7 and stride 4 to restore the internal state to the original dimensions.
The encoder is composed of fours blocks each containing two convolutional layers with kernel size 3 and padding 1, followed by ReLU non-linearities. The first three blocks include a max-pooling layer of stride 2 that reduces the size in half. The number of feature maps doubles in each layer. For the expanding part, we use bilinear interpolation with scale factor 2 instead of deconvolutions to keep the number of parameters low. Skip connections are also employed to copy feature maps from earlier layers but contrary to the original paper we do not reduce the dimensions of the copied feature maps. This way, high-level, coarser feature maps are combined with fine-grained local information of lower layers over the whole domain. The network architecture can be seen in Figure 9.
Appendix C Hyperparameters
Assume that is the number of input and the number of output frames of the model. For each training iteration, we randomly selecting sub-sequences of length from each simulated sequence. The models were trained to minimise the MSE over their respective output length . In each iteration, the weights are updated using an Adam optimiser while a scheduling scheme adjusts the learning rate (LR) by a scaling factor of if there is no improvement in validation error after a given amount of epochs (patience). The hyperparameters of interest are the input length , training output length , samples per sequence between weight updates , batch size , LR and patience . Grid search was used to find the best set of hyperparameters of each model. The training budget was 24h hours. To obtain an arbitrary long prediction we the output as the next input. The goal is to obtain networks with a low error in long term prediction so, during model selection, we chose the hyper-parameters that gave the lowest validation error over 50 frames regardless of the output size of the model . The final hyperparameters and model sizes can be found in Tables 4 and 5.
Appendix D Model size and speed
Models were implemented in PyTorch and the code is publicly available in GitHub. Models were trained on a GTX 1060 GPU with 6GB of memory. Total training time includes evaluation overhead.
|Method||Time per frame (ms)||Speed-up|
Appendix E Results Addendum
e.1 Predicting the tank size from the latent space
Here we check if the trained U-Net acquired any understanding of the physical properties of the system. We focus on the tank size, or inversely the speed of the wave, for two reasons. First of all, the U-Net failed to extrapolate to different tank sizes. This experiment could provide some insights on why this failure happens. Secondly, tank size information is readily available. Each dataset sequence corresponds to a different tank size, and the tank is always square. In the training and testing dataset we have tank size meters. For the smaller tank we used and for the bigger meters.
The question we try to answer is: does the latent representation of the U-Net capture that tank size information . We take the pre-trained encoder from the U-Net and add some additional layers so that the output is only one number (Figure 13). The system is trained to predict the tank size when given 5 consecutive frame. Only the additional part is updated during training. The weights of the encoder are kept frozen. We compare the pre-trained encoder against a randomly initialised encoder. We, also, compare the models to a dummy regressor that predicts always the mean tank size for each dataset i.e. 15 for the test set, 7.5 for the small tank and 30 for the big tank. Results in Table 6 indicate that the pre-trained encoder can be used to extract the tank size with relatively low error (0.14) while the random encoder gives a much higher error of 2.27, slightly lower to the dummy regressor (2.45). This indicates that the pre-trained encoder encapsulates physically relevant information relating to the tank size. When it comes to the bigger and smaller tanks, both the pre-trained and the random encoders fail to extrapolate and give errors higher than the dummy regressor.
|Test set||Bigger Tank||Smaller Tank|
- Code and data available at github.com/stathius/wave_propagation
- An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. Cited by: §1, §5.
- Machine learning and the physical sciences. External Links: Cited by: §1.
- Scikit-fdiff / skfdiff. Note: \urlhttps://gitlab.com/celliern/scikit-fdiff/[Online; accessed 11-8-2019] Cited by: Appendix A.
- Deep learning for physical processes: incorporating prior scientific knowledge. arXiv preprint arXiv:1711.07970. Cited by: §2, §3.
- FlowNet: Learning Optical Flow with Convolutional Networks. Technical report Cited by: §2.
- Learning A Physical Long-term Predictor. External Links: Cited by: §2.
- A saint-venant shallow water model for overland flows with precipitation and recharge. External Links: Cited by: §1.
- Flownet: learning optical flow with convolutional networks. arXiv preprint arXiv:1504.06852. Cited by: §3.
- Deep Fluids: A Generative Network for Parameterized Fluid Simulations. External Links: Cited by: §2.
- Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440. Cited by: §2.
- Approximation zones of the saint-venant equations f flood routing with overbank flow. Hydrology and Earth System Sciences Discussions 4 (2), pp. 251–260. Cited by: §1.
- Physics-induced graph neural network: An application to wind-farm power estimation. Energy 187, pp. 115883. External Links: Cited by: §2.
- Modeling stochastic systems using physics-informed deep generative models. Cited by: §2.
- Physics Informed Deep Learning (Part I): Data-driven Solutions of Nonlinear Partial Differential Equations. External Links: Cited by: §2.
- Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. External Links: Cited by: §2.
- Approximating the Solution of Surface Wave Propagation Using Deep Neural Networks. In INNS Big Data and Deep Learning, External Links: Cited by: Figure 5, Figure 6, §1, §2, §3.
- Unsupervised learning of video representations using lstms. In International conference on machine learning, pp. 843–852. Cited by: §2.
- Deep Learning Methods for Reynolds-Averaged Navier-Stokes Simulations of Airfoil Flows. External Links: Cited by: §2, §3.
- Accelerating eulerian fluid simulation with convolutional networks. In ICML, pp. 3424–3433. Cited by: §2.
- PredRNN++: Towards A Resolution of the Deep-in-Time Dilemma in Spatiotemporal Predictive Learning. External Links: Cited by: §2.
- Predrnn++: towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. arXiv preprint arXiv:1804.06300. Cited by: §4.1.