Learning Similarity Metrics for Numerical Simulations
Abstract
We propose a neural networkbased approach that computes a stable and generalizing metric (LSiM), to compare field data from a variety of numerical simulation sources. Our method employs a Siamese network architecture that is motivated by the mathematical properties of a metric. We leverage a controllable data generation setup with partial differential equation (PDE) solvers to create increasingly different outputs from a reference simulation in a controlled environment. A central component of our learned metric is a specialized loss function that introduces knowledge about the correlation between single data samples into the training process. To demonstrate that the proposed approach outperforms existing simple metrics for vector spaces and other learned, imagebased metrics, we evaluate the different methods on a large range of test data. Additionally, we analyze benefits for generalization and the impact of an adjustable training data difficulty. The robustness of LSiM is demonstrated via an evaluation on three realworld data sets.
1 Introduction
Evaluating computational tasks for complex data sets is a fundamental problem in all computational disciplines. Regular vector space metrics, such as the distance, were shown to be very unreliable (Wang et al., 2004; Zhang et al., 2018), and the advent of deep learning techniques with convolutional neural networks (CNNs) made it possible to more reliably evaluate complex data domains such as natural images, texts (Benajiba et al., 2018), or speech (Wang et al., 2018). Our central aim is to demonstrate the usefulness of CNNbased evaluations in the context of numerical simulations. These simulations are the basis for a wide range of applications ranging from blood flow simulations to aircraft design. Specifically, we propose a novel learned simulation metric (LSiM) that allows for a reliable similarity evaluation of simulation data.
Potential applications of such a metric arise in all areas where numerical simulations are performed, or similar data is gathered from observations. For example, accurate evaluations of existing and new simulation methods with respect to a known ground truth solution (Oberkampf et al., 2004) can be performed more reliably than with a regular vector norm. Another good example is weather data for which complex transport processes and chemical reactions make inplace comparisons with common metrics unreliable (Jolliffe and Stephenson, 2012). Likewise, the longstanding, open questions of turbulence (Moin and Mahesh, 1998; Lin et al., 1998) can benefit from improved methods for measuring the similarity and differences in data sets and observations.
In this work, we focus on field data, i.e., dense grids of scalar values, similar to images, which were generated with known partial differential equations (PDEs) in order to ensure the availability of ground truth solutions. While we focus on 2D data in the following to make comparisons with existing techniques from imaging applications possible, our approach naturally extends to higher dimensions. Every sample of this 2D data can be regarded a high dimensional vector, so metrics on the corresponding vector space are applicable to evaluate similarities. These metrics, in the following denoted as shallow metrics, are typically simple, elementwise functions such as or distances. Their inherent problem is that they cannot compare structures on different scales or contextual information.
Many practical problems require solutions over time and need a vast number of nonlinear operations that often result in substantial changes of the solutions even for small changes of the inputs. Hence, despite being based on known, continuous formulations, these systems can be seen as chaotic. We illustrate this behavior in Fig. 1, where two smoke flows are compared to a reference simulation. A single simulation parameter was varied for these examples, and a visual inspection shows that smoke plume (a) is more similar to the reference. This matches the data generation process: version (a) has a significantly smaller parameter change than (b), as shown in the inset graph on the right. LSiM robustly predicts the ground truth distances while the metric labels plume (b) as more similar. In our work, we focus on retrieving the relative distances of simulated data sets. Thus, we do not aim for retrieving the absolute parameter change but a relative distance that preserves ordering with respect to this parameter.
Using existing image metrics based on CNNs for this problem is not optimal either: natural images only cover a small fraction of the space of possible 2D data, and numerical simulation outputs are located in a fundamentally different data manifold within this space. Hence, there are crucial aspects that cannot be captured by purely learning from photographs. Furthermore, we have full control over the data generation process for simulation data. As a result, we can create arbitrary amounts of training data with gradual changes and a ground truth ordering. With this data, we can learn a metric that is not only able to directly extract and use features but also encodes interactions between them. The central contributions of our work are as follows:

We propose a Siamese network architecture with feature map normalization, which is able to learn a metric that generalizes well to unseen simulations methods.

We propose a novel loss function that combines a correlation loss term with a mean squared error to improve the accuracy of the learned metric.

In addition, we show how a data generation approach for numerical simulations can be employed to train networks with general and robust feature extractors for metric calculations.
2 Related Work
One of the earliest methods to go beyond using simple metrics based on norms for natural images was the structural similarity index (Wang et al., 2004). Despite improvements, this method can still be considered a shallow metric. Over the years multiple large databases for human evaluations of natural images were presented, for instance CSIQ (Larson and Chandler, 2010), TID2013 (Ponomarenko et al., 2015), and CID:IQ (Liu et al., 2014). With this data and the discovery that CNNs can create very powerful feature extractors that are able to recognize patterns and structures, deep feature maps quickly became established as means for evaluation (Amirshahi et al., 2016; Berardino et al., 2017; Bosse et al., 2016; Kang et al., 2014; Kim and Lee, 2017). Recently, these methods were improved by predicting the distribution of human evaluations instead of directly learning distance values (Prashnani et al., 2018; Talebi and Milanfar, 2018b). Zhang et al. compared different architecture and levels of supervision, and showed that metrics can be interpreted as a transfer learning approach, by applying a linear weighting to the feature maps of any network architecture to form the image metric LPIPS v0.1. Typical use cases of these imagebased CNN metrics are computer vision tasks like detail enhancement (Talebi and Milanfar, 2018a), style transfer, and superresolution (Johnson et al., 2016). Generative adversarial networks also leverage CNNbased losses by training a discriminator network in parallel to the generation task (Dosovitskiy and Brox, 2016).
Siamese network architectures are known to work well for a variety of comparison tasks such as audio (Zhang and Duan, 2017), satellite images (He et al., 2019) or the similarity of interior product designs (Bell and Bala, 2015). Furthermore, they yield robust object trackers (Bertinetto et al., 2016), algorithms for image patch matching (Hanif, 2019), and for descriptors for fluid flow synthesis (Chu and Thuerey, 2017). Inspired by these works we use a similar Siamese neural network architecture for our metric learning task. In contrast to other work on selfsupervised learning that utilizes spatial or temporal changes to learn meaningful representations (Agrawal et al., 2015; Wang and Gupta, 2015), our method does not rely on tracked keypoints in the data.
While correlation terms have been used for learning joint representations by maximizing correlation of projected views (Chandar et al., 2016), and are popular for style transfer applications via the Gram matrix (Ruder et al., 2016), they were not used for learning distance metrics. As we demonstrate below, they can yield significant improvements in terms of the inferred distances.
Similarity metrics for numerical simulations are a topic of ongoing investigation. A variety of specialized metrics have been proposed to overcome the limitations of norms, such as the displacement and amplitude score from the area of weather forecasting (Keil and Craig, 2009), and permutation based metrics for energy consumption forecasting (Haben et al., 2014). Turbulent flows, on the other hand, are often evaluated in terms of aggregated frequency spectra (Pitsch, 2006). Crowdsourced evaluations based on the human visual system were also proposed to evaluate simulation methods for physicsbased animation (Um et al., 2017), and for comparing nonoscillatory discretization schemes (Um et al., 2019). These results indicate that visual evaluations in the context of field data are possible and robust, but they require extensive (and potentially expensive) user studies. Additionally, our method naturally extends to higher dimensions, while human evaluations inherently rely on projections with at most two spatial and one time dimension.
3 Constructing a CNNbased Metric
In the following, we explain our considerations when employing CNNs as evaluation metrics. For a comparison that corresponds to our intuitive understanding of distances, an underlying metric has to obey certain criteria. More precisely, a function is a metric on its input space if it satisfies the following properties :
nonnegativity  (1)  
symmetry  (2)  
triangle ineq.  (3)  
identity of indisc.  (4) 
The properties (1) and (2) are crucial as distances should be symmetric and have a clear lower bound. Eq. (3) ensures that direct distances cannot be longer than a detour. Property (4), on the other hand, is not really useful for discrete operations as approximation errors and floating point operations can easily lead to a distance of zero for slightly different inputs. Hence, we focus on a relaxed, more meaningful definition , which leads to a socalled pseudometric. It allows for a distance of zero for different inputs but has to be able to spot identical inputs.
We realize these requirements for a pseudometric with an architecture that follows popular perceptual metrics such as LPIPS: The activations of a CNN are compared in latent space, accumulated with a set of weights, and the resulting perfeature distances are aggregated to produce a final distance value. Fig. 2 gives a visual overview of this process.
3.1 Base Network
The sole purpose of the base network is to extract feature maps from both inputs. The Siamese architecture implies that the weights of the base network are shared for both inputs, meaning all feature maps are comparable. In addition, this ensures the identity of indiscernibles because, for identical feature maps, a distance of zero is guaranteed by the following operations. In the last years, a variety of powerful CNNbased feature extraction architectures were proposed. We experimented with various networks, such as AlexNet (Krizhevsky et al., 2017), VGG (Simonyan and Zisserman, 2015), SqueezeNet (Iandola et al., 2016), and a fluid flow prediction network (Thuerey et al., 2018). In all cases, only the feature extracting layers are used, and the remaining layers, which are responsible for the original task (e.g., classification), are discarded. Building on this previous work, we consider three variants of the networks below: using the original pretrained weights, finetuning them, or retraining the full networks from scratch. In contrast to typical CNN tasks where only the result of the final output layer is further processed, we make use of the full range of extracted features across the layers of a CNN (see Fig. 2). This implies a slightly different goal: while early features should be general enough to allow for extracting more complex features in deeper layers, this is not their sole purpose. Rather, features in earlier layers of the network can directly participate in the final distance calculation and can yield important cues. As we will demonstrate below, we achieved the best performance for our data sets using a custom architecture with five layers, similar to a reduced AlexNet, that was trained from scratch (see App. B.1).
3.2 Feature Map Normalization
The goal of normalizing the feature maps is to transform the extracted features of each layer, which typically have very different orders of magnitude, into comparable ranges. While this task could potentially be performed by the learned weights, we found the normalization to yield improved performance in general (see App. B.2). Zhang et al. proposed a length unit normalization using a division by the Euclidean norm in channel dimension to only measure the angle between the latent space vectors with a cosine distance. Instead, we suggest to interpret all possible feature maps as a normal distribution and to normalize them to a standard normal distribution. This is achieved via a preprocessing step using the full training data set: we subtract the mean of the feature maps and then divide by their standard deviation in channel dimension for each layer. As a result, we can measure angles for the latent space vectors and compare their magnitude in the global length distribution.
3.3 Latent Space Differences
Combining the latent spaces representations that consist of all extracted features from the two inputs lies at the core of the metric computation. Here, the most obvious approach to employ an elementwise difference, i.e., , is not advisable as this would directly violate the metric properties above. Instead, nonnegativity and symmetry can be ensured via or . We found that both work equally well in practice. Considering the importance of comparing the extracted features, the simple operations used for comparing the features do not seem optimal. Rather, one can imagine that improvements in terms of comparing one set of feature activations could lead to overall improvements for derived metrics. Hence, we experimented with replacing these operations with a pretrained CNNbased metric for each feature map. This creates a recursive process or ”metametric”, that reformulates the initial problem of learning input similarities in terms of learning feature space similarities. However, as detailed in App. B.3, we have not found this recursive approach to yield any substantial improvements. This implies that once a large enough number of expressive features is available for comparison, the inplace difference of each feature is sufficient to compare two inputs. In the following, we compute the feature difference maps (Fig. 2, in yellow) via .
3.4 Aggregations
The subsequent aggregation operations (Fig. 2, in green) are applied to the difference maps to compress the contained per feature differences along the different dimensions into a single distance value. The aggregation operations only need to preserve the metric properties already established via the latent space difference. To aggregate the difference maps along the channel dimension, we found the weighted average proposed by Zhang et al. to work very well. Thus, we use one learnable weight to control the importance of a feature. The weight is a multiplier for the corresponding difference map before summation along the channel dimension. To preserve nonnegativity and the triangle inequality, the weights are clamped to be nonnegative. A negative weight would mean that a larger difference in this feature produces a smaller overall distance, which is not helpful. For spatial and layer aggregation, functions like a summation or averaging are sufficient and generally interchangeable. We tested more intricate aggregation functions, e.g., by learning a spatial average or determining layer importance weights dynamically from the inputs. When the base network is fixed and the metric only has very few trainable weights, this did improve the overall performance. But, with a fully trained base network, the feature extraction seems to automatically adopt these aspects making a more complicated aggregation unnecessary.
Additional details for the steps above are given in App. A, where we also show that the proposed Siamese architecture by construction qualifies as a pseudometric.
4 Data Generation and Training
Similarity data sets for natural images typically rely on changing already existing images with distortions, noise, or other operations and assigning ground truth distances according to the strength of the operation. Since we can control the data creation process for numerical simulations directly, we can generate large amounts of simulation data with increasing dissimilarities by altering the parameters used for the simulations. As a result, the data contains more information about the nature of the problem, i.e., which changes of the data distribution should lead to increased distances, than by applying modifications as a postprocess.
4.1 Data Generation
Given a set of model equations, e.g., a PDE from fluid dynamics, typical solution methods consist of a solver that, given a set of boundary conditions, computes discrete approximations of the necessary differential operators. The discretized operators and the boundary conditions typically contain problem dependent parameters, which we collectively denote with in the following. We only consider time dependent problems, and our solvers start with initial conditions at to compute a series of time steps until a target point in time () is reached. At that point, we obtain a reference output field from one of the PDE variables, e.g., a velocity.
For data generation, we incrementally change a single parameter in steps to create a series of outputs . We consider a series obtained in this way to be increasingly different from . To create natural variations of the resulting data distributions, we add Gaussian noise fields with zero mean and adjustable variance to an appropriate simulation field such as a velocity. This noise allows us to generate a large number of varied data samples for a single simulation parameter . In addition, it is similar in nature to numerical errors introduced by discretization schemes. Thus, these perturbations enlarge the space covered by the training data, and we found that training networks with suitable noise levels improves robustness as we will demonstrate below. The process for data generation is summarized in Fig. 3.
As PDEs can model extremely complex and chaotic behaviour, there is no guarantee that the outputs always exhibit increasing dissimilarity with the increasing parameter change. This behaviour is what makes the task of similarity assessment so challenging. Even if the solutions are essentially chaotic, their behaviour is not arbitrary but rather governed by the rules of the underlying PDE. For our data set, we choose the following range of representative PDEs: We include a pure AdvectionDiffusion model (AD) and Burger’s equation (BE), which introduces a viscosity term. Furthermore, we use the full NavierStokes equations (NSE), which introduce a conservation of mass constraint. When combined with a deterministic solver and a suitable parameter step size, all these PDEs exhibit chaotic behaviour at small scales and the medium to large scale characteristics of the solutions shift smoothly with increasing changes of the parameters . The noise amplifies the chaotic behaviour to larger scales to create an environment with a controlled amount of perturbations. This lets the network learn about the nature of the chaotic behaviour of PDEs without overwhelming it with data where patterns are not observable anymore. The latter can easily happen when or grows too large and produce essentially random outputs. Instead, we specifically target solutions that are difficult to evaluate in terms of a shallow metric. We choose the smallest and such that the ordering of several random output samples with respect to their difference drops below a correlation value of 0.8.
4.2 Training
For training, the 2D scalar fields from the simulations were individually normalized and augmented with random color maps, flips, rotations, and cropping to obtain an input size of . Afterwards, each input is standardized to a standard normal distribution by subtracting the mean and dividing by the standard deviation (precomputed from all training data). Unless noted otherwise, networks were trained with a batch size of 1 for 40 epochs with the Adam optimizer using a learning rate of that was reduced to after 15 epochs. To evaluate the trained networks on validation and test inputs, only a bilinear resizing and the standardization step is applied.
5 Correlation Loss Function
The central goal of our networks is to identify relative differences of input pairs produced via numerical simulations. Thus, instead of employing a loss that forces the network to only infer given labels or distance values, we train our networks to infer the ordering of a given sequence of varying inputs . We propose to use the Pearson correlation coefficient (see Pearson, 1920), which yields a value in that measures the linear relationship between two distributions. A value of implies that a linear equation describes their relationship perfectly. We compute this coefficient for a full series of outputs such that the network can learn to extract features that arrange this data series in the correct ordering. Each training sample of our network consists of every possible pair of inputs of the sequence and the corresponding ground truth distance distribution representing the parameter change from the data generation. For a distance prediction of our network for one sample, we compute the loss with:
(5) 
Here, the mean of a distance vector is denoted by and for ground truth and prediction, respectively. The first part of the loss is a regular MSE term, which minimizes the difference between predicted and actual distances. The second part is the Pearson correlation coefficient, which is inverted such that the optimization results in a maximization of the correlation. As this formulation depends on the length of the input sequence, the two terms are scaled to adjust their relative influence with and . For the training, we chose 10 variations for each reference simulation, i.e., . If should vary during training, the influence of both terms needs to be adjusted accordingly. We found that scaling both terms to a similar order of magnitude worked best in our experiments.
In Fig. 4, we investigate how the proposed loss function compares to other commonly used loss formulations for our full network and a pretrained network. In addition to our full loss function, we consider a loss function that replaces the Pearson correlation with a simpler crosscorrelation . We also include networks trained with only the MSE or only the correlation terms for each of the two variants.
As shown in Fig. 4, a simple MSE loss yields a low accuracy of less than . Using any correlation based loss function for the AlexNet\textsubscriptfrozen metric (see Section 6.2) improves the results, but there is no major difference due to the limited number of only 1152 trainable weights. For LSiM, the proposed combination of MSE loss with the Pearson correlation performs significantly better than using crosscorrelation or a variant without the MSE loss. Interestingly, combining cross correlation with MSE yields worse results than cross correlation alone. This is caused by the cross correlation term influencing absolute distance values, which potentially conflicts with the MSE term. For our loss, the Pearson correlation only handles the relative ordering while MSE deals with the absolute distances, leading to an improved correlation for the inferred distances.
6 Results
In the following, we will discuss how the data generation approach was employed to create a large range of training and test data from different PDEs. Afterwards, the proposed metric is compared to other metrics, and its robustness is evaluated with several external data sets.
6.1 Data Sets
We created four training (Smo, Liq, Adv and Bur) and two test data sets (LiqN and AdvD) with ten parameter steps for each reference simulation. Based on two 2D NSE solvers, the smoke and liquid simulation training sets (Smo and Liq) add noise to the velocity field and feature varied initial conditions such as fluid position or obstacle properties, in addition to variations of buoyancy and gravity forces. The two other training sets (Adv and Bur) are based on 1D solvers for AD and BE, concatenated over time to form a 2D result. In both cases, noise was injected into the velocity field, and the varied parameters are changes to the field initialization and forcing functions.
For the test data set, we substantially change the data distribution by injecting noise into the density instead of the velocity field for AD simulations to obtain the AdvD data set and by including background noise for the velocity field of a liquid simulation (LiqN). In addition, we employed three more test sets (Sha, Vid, and TID) created without PDE models to explore the generalization for data far from our training data setup. We include a shape data set that features multiple randomized moving rigid shapes, a video data set consisting of frames from random video footage, and the perceptual image data set TID2013 (Ponomarenko et al., 2015). Below, we additionally list a combined correlation score for all test sets (All) apart from TID, which is excluded due to its different structure.
6.2 Performance Evaluation
To evaluate the performance of a metric on a data set, we first compute the distances from each reference simulation to all corresponding variations. Then, the predicted and the ground truth distance distributions over all samples are combined and compared using Spearman’s rank correlation coefficient (see Spearman, 1904). Like the Pearson correlation, it is a value in to compare distributions, but it measures the correlation between ranking variables, i.e., monotonic relationships.
Metric  Validation data sets  Test data sets  
Smo  Liq  Adv  Bur  TID  LiqN  AdvD  Sha  Vid  All  
0.67  0.80  0.72  0.59  0.83  0.72  0.58  0.50  0.77  0.56  
SSIM  0.67  0.75  0.75  0.68  0.80  0.25  0.70  0.35  0.70  0.48 
LPIPS v0.1.  0.72  0.75  0.75  0.72  0.80  0.63  0.60  0.81  0.82  0.66 
AlexNet\textsubscriptrandom  0.64  0.75  0.67  0.64  0.84  0.64  0.67  0.61  0.78  0.62 
AlexNet\textsubscriptfrozen  0.67  0.70  0.68  0.70  0.79  0.40  0.64  0.84  0.81  0.62 
Optical flow  0.62  0.57  0.36  0.37  0.55  0.49  0.28  0.61  0.75  0.48 
NonSiamese  0.72  0.82  0.76  0.69  0.29  0.73  0.62  0.60  0.72  0.63 
Skip\textsubscriptfrom scratch  0.65  0.84  0.74  0.67  0.80  0.79  0.59  0.83  0.79  0.70 
LSiM\textsubscriptnoiseless  0.64  0.83  0.74  0.60  0.80  0.81  0.58  0.84  0.76  0.69 
LSiM\textsubscriptstrong noise  0.63  0.82  0.71  0.61  0.80  0.78  0.50  0.80  0.77  0.65 
LSiM (ours)  0.68  0.82  0.76  0.70  0.78  0.80  0.61  0.85  0.76  0.71 
The top part of Tab. 1 shows the performance of the shallow metrics and SSIM as well as the LPIPS metric (Zhang et al., 2018) for all our data sets. The results clearly show that shallow metrics are not suitable to compare the samples in our data set and only achieve good correlation values on the TID2013 data set, which contains a large number of pixelbased image variations without contextual structures. The perceptual LPIPS metric performs better in general and outperforms our method on the image data sets Vid and TID. This is not surprising as LPIPS is specifically trained for such images. For the simulation data sets LiqN and Sha, however, it performs significantly worse than for the image content. The last row of Tab. 1 shows the results of our LSiM network with a very good performance across all data sets. Note that although it was not trained with any natural image content, it still performs well for the image test sets.
The middle block of Tab. 1 contains several interesting variants: AlexNet\textsubscriptrandom and AlexNet\textsubscriptfrozen are small models, where the base network is the original AlexNet with pretrained weights. AlexNet\textsubscriptrandom contains purely random aggregation weights without training, whereas AlexNet\textsubscriptfrozen only has trainable weights for the channel aggregation and therefore lacks the flexibility to fully adjust to the data distribution of the numerical simulations. The random model performs surprisingly well, pointing to powers of the underlying CNN architecture.
Recognizing that many PDEs include transport phenomena, we investigated optical flow (Horn and Schunck, 1981) as a means to compute motion from field data. For the Optical flow metric, we used FlowNet2 (Ilg et al., 2016) to bidirectionally compute the optical flow field between two inputs and aggregate it to a single distance value by summing all flow vector magnitudes. On the data sets Sha and Vid that are similar to the training data of FlowNet2, it performs relatively well, but in most other cases, it performs poorly. This shows that computing a simple warping from one input to the other is not enough for a stable metric although it seems like an intuitive solution. A more robust metric needs the knowledge of the underlying features and their changes to generalize better to new data.
For the NonSiamese metric, we used a nonSiamese architecture that directly predicts the distance from both inputs to evaluate whether a Siamese architecture is really beneficial. For this purpose, we employed a modified version of AlexNet that reduces the weights of the feature extractor by 50% and of the remaining layers by 90%. As expected, this metric works great on the validation data but has huge problems with generalization. In addition, even simple metric properties like symmetry are no longer guaranteed because this architecture does not have the inherent constraints of the Siamese setup. Finally, we experimented with multiple fully trained base networks. As retraining existing feature extractors only provided small improvements, we used a custom base network with skip connections for the Skip\textsubscriptfrom scratch metric. Its results already come close to the proposed approach on most data sets.
The last block in Tab. 1 shows variants of the proposed approach, trained with varied noise levels. This inherently changes the difficulty of the data. Hence, LSiM\textsubscriptnoiseless was trained with relatively simple data without perturbations, whereas LSiM\textsubscriptstrong noise was trained with strongly varying data. Both cases decrease the generalizing capabilities of the trained model on the test data. This indicates that the network needs to see a certain amount of variation at training time in order to become robust, but overly large changes hinder the learning of useful features (see App. B and C).
6.3 Evaluation on RealWorld Data
To evaluate the generalizing capabilities of our trained metric, we turn to three representative and publicly available data sets of captured and simulated realworld phenomena, namely buoyant flows, turbulence, and weather. For the former, we make use of the ScalarFlow data set from Eckert et al., which consists of captured velocities of buoyant scalar transport flows. Additionally, we include data from the Johns Hopkins Turbulence Database (JHTDB) (Perlman et al., 2007), which represents direct numerical simulations of fully developed turbulence. As a third case, we use temperature and geopotential fields from the WeatherBench repository (Rasp et al., 2020), which contains global climate data on a Cartesian latitudelongitude grid of the earth.
For the evaluation, we extracted sequences of frames with fixed temporal and spatial intervals from each data set to obtain a ground truth ordering. We use six different interval spacings with 60240 sequences each, yielding more than 360 evaluations for every data source (details in App. E). We then measure how well our metric recovers the original ordering in the presence of the complex changes of content, driven by the underlying physical processes. We employ the pretrained LSiM metric model that was generated as outlined in the previous sections. It recovers the ordering of all three cases with high accuracy yielding averaged correlations of , , and for ScalarFlow, JHTDB, and WeatherBench, respectively. Visual examples of the three data sets are shown on the left of Fig. 6 while the right side summarizes the robustness of the LSiM evaluation.
7 Conclusion
We have presented the LSiM metric to reliably and robustly compare outputs from numerical simulations. Our method significantly outperforms existing shallow metric functions and provides better results than other learned metrics. We demonstrated the usefulness of the correlation loss, showed the benefits of a controlled data generation environment, and highlighted the stability of the obtained metric for a range of realworld data sets.
Our trained LSiM metric has the potential to impact a wide range of fields: from the fast and reliable accuracy assessment of new simulation methods, over robust optimizations of parameters for reconstructions of observations, to guiding generative models of physical systems. Furthermore, it will be highly interesting to evaluate other loss functions, e.g., mutual information (Bachman et al., 2019) or contrastive predictive coding (Hénaff et al., 2019). We also plan to evaluate our approach for an even larger collection of PDEs as well as for 3D and 4D data sets. Especially turbulent flows are a highly relevant and interesting area for future work on learned evaluation metrics.
This supplemental document contains an analysis of the proposed metric design with respect to properties of metrics in general (App. A), and details to the used network architectures (App. B). Afterwards, material that deals with the data sets is provided. It contains examples and failure cases for each of the data domains and analyzes the impact of the data difficulty (App. C and D). Next, the evaluation on realworld data is described in more detail (App. E). Finally, we explore additional metric evaluations (App. F), and give an overview on the used notation (App. G).
The source code for using and evaluating the trained LSiM metric, and some examples from each data set are published alongside this submission. The full source code for retraining the model, the data generation scripts, and the entire data sets will be published in the near future.
Appendix A Discussion of Metric Properties
To analyze if the proposed method qualifies as a metric, it is split in two functions and which operate on the input space and the latent space . Through flattening elements from the input or latent space into vectors, and where and are the dimensions of the input data or all feature maps respectively, and both values have a similar order of magnitude. describes the nonlinear function computed by the base network combined with the following normalization and returns a point in the latent space. uses two points in the latent space to compute a final distance value, so it includes the latent space difference and the aggregation along the spatial, layer, and channel dimensions. With the Siamese network architecture the resulting function for the entire approach is
The identity of indiscernibles mainly depends on because even if itself guarantees this property, could still be noninjective, which means it can map different inputs to the same point in latent space for . Due to the complicated nature of it is difficult to make accurate predictions about the injectivity of . Each base network layer of recursively processes the result of the preceding layer with various feature extracting operations so intuitively, significant changes in the input should produce different feature map results in some layer. But very small changes in the input do lead to zero valued distances predicted by the CNN (i.e. an identical latent space for different inputs), meaning is in practice not injective. In an additional experiment, the proposed architecture was evaluated on about 3500 random inputs from all our data sets, where the CNN received one unchanged and one slightly modified input. The modification consisted of multiple pixel adjustments by one bit (on 8bit color images) in random positions and channels. When adjusting only a single pixel in the input, the CNN predicts a zero valued distance on about 23% of the inputs, but we never observed an input where seven or more changed pixels resulted in a distance of zero in all experiments.
In this context, the problem of numerical errors is important, because even two slightly different latent space representations could lead to a result that seems to be zero if the difference vanishes in the aggregation operations or is smaller than the floating point precision. On the other hand, an automated analysis to find points that have a different input but an identical latent space image is a challenging problem and left as future work.
The evaluation of the base network and the normalization is deterministic, and hence holds. Further, we know that if guarantees that . Thus, the remaining properties nonnegativity, symmetry, and the triangle inequality only depend on , since for them the original inputs are not relevant, only their respective images in the latent space. The resulting structure with a relaxed identity of indiscernibles is called a pseudometric, where :
(6)  
(7)  
(8)  
(9) 
Notice, that has to fulfill these properties with respect to the latent space and not the input space. If is carefully constructed the metric properties still apply, independently of the actual design of the base network or the feature map normalization.
A first observation concerning is that if all aggregations were sum operations and the elementwise latent space difference was the absolute value of a difference operation, would be equivalent to computing the norm of the difference vector in latent space.
Similarly, adding a square operation to the elementwise distance in the latent space and computing the square root at the very end leads to the norm of the latent space difference vector. In the same way, it is possible to use any norm with the corresponding operations.
In both cases, this forms the metric induced by the corresponding norm which by definition has all desired properties (6), (7), (8), and (9). If we change all aggregation methods to a weighted average operation, each term in the sum is multiplied by a weight . This is even possible with learned weights, as they are constant at evaluation time, if they are clamped to be positive as described above. Now, can be attributed to both inputs by distributivity, meaning each input is elementwise multiplied with a constant vector before applying the metric, which leaves the metric properties untouched. The reason is that it is possible to define new vectors in the same space, equal to the scaled inputs. This renaming trivially provides the correct properties.
Accordingly, doing the same with the norm idea is possible, each just needs a suitable adjustment before distributivity can be applied, keeping the metric properties once again.
With these weighted terms for , it is possible to describe all used aggregations and latent space difference methods. The proposed method deals with multiple higher order tensors instead of a single vector, so the weights additionally depend on constants like the direction of the aggregations and their position in the latent space tensors. But it is easy to see that mapping a higher order tensor to a vector and keeping track of additional constants still retains all properties in the same way. As a result, the described architecture by design yields a pseudometric that is suitable for comparing simulation data in a way that corresponds to our intuitive understanding of distances.
Appendix B Architectures
The following sections provide details regarding the architecture of the base network and some experimental design.
b.1 Base Network Design
Fig. 7 shows the architecture of the base network for the LSiM metric. Its purpose is extracting features from both inputs of the Siamese architecture that are useful for the further processing steps. To maximise the usefulness and to avoid feature maps that show overly similar features, the chosen kernel size and stride of the convolutions is important. Starting with larger kernels and strides means the network has a big receptive field and can consider simple, lowlevel features in large regions of the input. For the two following layers, the large strides are replaced by additional MaxPool operations that serve a similar purpose and reduce the spatial size of the feature maps.
For the three final layers only small convolution kernels and strides are used, but the number of channels is significantly larger than before. These deep features maps typically contain highlevel structures, which are most important to distinguish complex changes in the inputs. Keeping the number of trainable weights as low as possible was an important consideration for this design, to prevent overfitting to certain simulations types and increase generality. We explored a weight range by using the same architecture and only scaling the number of feature maps in each layer. The final design shown in Fig. 7 with about 0.62 million weights worked best for our experiments.
In the following, we analyze the contributions of the perlayer features of two different metric networks to highlight differences in terms of how the features are utilized for the distance estimation task. In Fig. 8 the learned feature map aggregation weights of our LSiM network show a very similar mean and a small standard deviation throughout the five layers. This means, all feature maps similarly contribute to establishing the distances, and the aggregation just finetunes the relative importance of each feature. In addition, all features receive a weight greater than zero, and as a result no feature is excluded from contributing to the final distance value.
Employing a fixed pretrained feature extractor on the other hand shows a very different picture: Although the mean across the different network layers is similar, the contributions of different features vary strongly, which is visible in the standard deviation being significantly larger. Furthermore, 2–10% of the feature maps in each layer receive a weight of zero and hence were deemed not useful at all for establishing the distances. This illustrates the usefulness of a targeted network in which all features contribute to the distance inference.
b.2 Feature Map Normalization
Here we analyze how feature maps are best normalized for the ”normalization” step of our algorithm (cf. Figure 2 of the main paper). Let denote a 3^{rd} order feature tensor with dimensions from one layer of the base network. We can form a series for every possible content of this tensor across our training samples (computed as a preprocessing step). Below, we evaluate three different normalization methods and also consider not normalizing at all, denoted by
The normalization only happens in the channel dimension, so all following operations accumulate values along while keeping and constant, i.e., are applied independently of the spatial dimensions. The unit length normalization proposed by Zhang et al.
only considers the current sample. In this case is a 2^{nd} order tensor with the Euclidean norms of along the channel dimension. Combined with summation as the aggregation operation in channel direction, this results in a cosine distance which only measures angles of the latent space vectors. Using a learned average instead means the angles are no longer uniform, but warped according to the importance of each feature (i.e. the resulting angle changes differently for the same amount of change in two separate features). Extending this idea to consider other training samples as well, leads to a global unit length normalization
where the maximum Euclidean norm of all available samples is employed. As a result, not only the angle of the latent space vectors, but also their magnitude compared to the largest feature vector is available in the aggregation. This formulation is not really robust yet, because the largest feature vector could be an outlier w.r.t. the typical content. Instead, we can consider the full feature vector as a normal distribution and transform it to a standard normal distribution with the proposed
In addition to the angle, this formulation allows for a robust comparison of the magnitude of each feature vector in the global magnitude distribution.
Fig. 9 shows a comparison of these normalization methods on the combined test data. Using no normalization is detrimental in both cases as succeeding operations can not reliably compare the features. Interestingly, the unit length normalization works best for the AlexNet\textsubscriptfrozen metric (similar to LPIPS from Zhang et al.) that only uses learned aggregation weights with a fixed AlexNet feature extractor. This observation allows for a conclusion about the features extracted by AlexNet. For the original task of image classification, the magnitude of a feature vector does not seem to carry information about the feature. Interpreting the length as part of the feature for our task in and , obviously harms the performance of the metric. Therefore, training the feature extractor such that the magnitude of the feature vectors bears some meaning should improve the results for the complex normalizations. The performance of our approach with a fully trained feature extractor in Fig. 9 shows exactly this behaviour: A more complex normalization directly yields better results since the features can be adapted to utilize it.
b.3 Recursive ”MetaMetric”
Since comparing the feature maps is a central operation of the proposed metric calculations, we experimented with replacing it with an existing CNNbased metric. In theory, this would allow for a recursive, arbitrarily deep network that repeatedly invokes itself: first, the deep representations of inputs are used, then the deep representations of the deep representations, etc. In practice, however, using more than one recursion step is currently not feasible due to increasing computational requirements in addition to vanishing gradients.
Fig. 10 shows how our computation method can be modified for a CNNbased latent space difference, instead of an elementwise operation. Here we employ LPIPS (Zhang et al., 2018). There are two main differences compared to proposed method: First, the LPIPS latent space difference creates single distance values for a pair of feature maps instead of a spatial feature difference. As a result, the following aggregation is a single learned average operation and spatial or layer aggregations are no longer necessary. We also performed experiments with a spatial LPIPS version here, but due to memory limitations these were not successful. Second, the convolution operations in LPIPS have a lower limit for spatial resolution, and some feature maps of our base network are quite small (see Fig. 7). Hence, we upscale the feature maps below the required spatial size of using nearest neighbor interpolation.
On our combined test data, such a metric with a fully trained base network achieves a performance comparable to AlexNet\textsubscriptrandom or AlexNet\textsubscriptfrozen.
b.4 Optical Flow Metric
In the following, we describe our approach to compute a metric via optical flow (OF). For an efficient OF evaluation we employed a pretrained network (Ilg et al., 2016). From an OF network with two inputs data fields , we get the flow vector field , where and denote the location and and denote the components of the flow vectors. In addition, we have a second flow field computed from the reversed input ordering. We can now define a function :
Intuitively, this function computes the sum over the magnitudes of all flow vectors in both vector fields. With this definition, it is obvious that fulfills the metric properties of nonnegativity and symmetry (see Eq. (6) and (7)). Under the assumption that identical inputs create a zero flow field, a relaxed identity of indiscernibles holds as well (see Eq. (9)). Compared to the proposed approach there is no guarantee for the triangle inequality though, so only qualifies as a pseudosemimetric.
Fig. 11 shows flow visualizations on data examples produced by FlowNet2. The metric works relatively well for inputs that are similar to the training data from FlowNet2, like the shape data example in the top row. For data that provides some outline, for instance the smoke simulation example in the middle row or also liquid data, the metric does not work as well but still provides a reasonable flow field. But for full spatial examples like from the Burger’s or AdvectionDiffusion equation (see bottom row) the network is no longer able to produce meaningful flow fields. The results are often a very uniform flow with similar magnitude and direction.
b.5 NonSiamese Architecture
To compute a metric without the Siamese architecture outlined above, we use a network structure with a single output, as shown in Fig. 12. Thus, instead of having two identically feature extractors and combining the feature maps, here the distance is directly predicted from the stacked inputs with a single network with about 1.24 million weights. After using the same feature extractor as described in Section B.1, the final set of feature maps is spatially reduced with an adaptive MaxPool operation. Next, the result is flattened and three consecutive fully connected layers process the data to form the final prediction. Here, the last activation function is a sigmoid instead of ReLU. The reason is that a ReLU would clamp every negative intermediate value to a zero distance, while a sigmoid compresses the intermediate value to a small distance that is more meaningful than directly clamping it.
In terms of metric properties, this architecture only provides nonnegativity (see Eq. (6)) due to the final sigmoid function. All other properties can not be guaranteed without further constraints. This is the main disadvantage of a nonSiamese network. These issues could be alleviated with specialized training data or by manually adding constraints to the model, e.g., to have some amount of symmetry (see Eq. (7)) and at least a weakened identity of indiscernibles (see Eq. (9)). However, compared to a Siamese network that guarantees them by design, these extensions are clearly suboptimal. As a result of the missing properties, this network has significant problems with generalization. While it performs well on the training data, the performance noticeably deteriorates for several of the test data sets.
b.6 Skip Connections in Base Network
As explained above, our base network primarily serves as a feature extractor to produce activations that are employed to evaluate a learned metric. In many stateoftheart methods, networks with skip connections are employed (Ronneberger et al., 2015; He et al., 2016; Huang et al., 2017), as experiments have shown that these connections help to preserve information from the inputs. In our case, the classification ”output” of a network such as the AlexNet plays no actual role. Rather, the features extracted along the way are crucial. Hence, skip connections should not improve the inference task for our metrics. To verify that this is the case, we have included tests with a base network similar to the popular UNet architecture (Ronneberger et al., 2015). For our experiments, we kept the early layers closely in line with the feature extractors that worked well for the base network (see Section B.1). Only the layers in the decoder part have an increased spatial feature map size to accommodate the skip connections. As expected, this network can be used to compute reliable metrics for the input data without negatively affecting the performance. However, as expected, the improvements of skip connections for regular inference tasks do not translate into improvements for the metric calculations.
Appendix C Impact of Data Difficulty
We shed more light on the aspect of noise levels and data difficulty via six reduced data sets, that consist of a smaller amount of Smoke and AdvectionDiffusion data with differently scaled noise strength values. Results are shown in Fig. 14. Increasing the noise level creates more difficult data as shown by the dotted and dashed plots representing the performance of the and the LPIPS metric on each data set. Both roughly follow an exponentially decreasing function. Each point on the solid line plot is the test result of a reduced LSiM model trained on the data set with the corresponding noise level. Apart from the data, the entire training setup was identical. This shows that the training process is very robust to the noise, as the result on the test data only slowly decreases for very high noise levels. Furthermore, small amounts of noise improve the generalization compared to a the model that was trained without any noise. This is somewhat expected, as a model that never saw noisy data during training can not learn to extract features which are robust w.r.t. noise.
Appendix D Data Set Details
In the following sections the generation of each used data set is described. For each figure showing data samples (consisting of a reference simulation and several variants with a single changing initial parameter), the leftmost image is the reference and the images to the right show the variants in order of increasing parameter change. For the figures 15, 16, 17, and 18 the first subfigure (a) demonstrates that medium and large scale characteristics behave very nonchaotic for simulations without any added noise. They are only included for illustrative purposes and are not used for training. The second and third subfigure (b) and (c) in each case show the training data of LSiM, where the large majority of data falls into the category (b) of normal samples that follow the generation ordering, even with more varying behaviour. Category (c) is a small fraction of the training data and the shown examples are specifically picked, worst case examples to show how the chaotic behaviour can sometimes override the ordering intended by the data generation. In some cases, category (d) is included to show how normal data samples from the test set differ from the training data.
d.1 NavierStokes Equations
These equations describe the general behaviour of fluids with respect to advection, viscosity, pressure, and mass conservation. Eq. (10) defines the conservation of momentum and Eq. (11) the conservation of mass inside the fluid.
(10)  
(11) 
In this context, is the velocity, is the pressure the fluid exerts, is the density of the fluid (usually assumed to be constant), is the kinematic viscosity coefficient that indicates the thickness of the fluid, and denotes the acceleration due to gravity. With this PDE three data sets were created using a smoke and a liquid solver. For all data, 2D simulations were run until a certain step and useful data fields were exported afterwards.
Smoke
For the smoke data, a standard Eulerian fluid solver using a preconditioned pressure solver based on conjugate gradient and a SemiLagrangian advection scheme was employed.
The general setup for every smoke simulation consists of a rectangular smoke source at the bottom with a fixed additive noise pattern to provide smoke plumes with more details. Additionally, there is a downwards directed, spherical force field area above the source which divides the smoke in two major streams along it. We chose this solution over an actual obstacle in the simulation, in order to avoid overfitting to a clearly defined black obstacle area inside the smoke data. Once the simulation reaches a predefined time step, the density, pressure, and velocity field (separated by dimension) is exported and stored. Some example sequences can be found in Fig. 15.
With this setup the following initial conditions were varied in isolation:

Smoke buoyancy in x and ydirection

Strength of noise added to the velocity field

Amount of force in x and ydirection provided by the force field

Orientation and size of the force field

Position of the force field in x and ydirection

Position of the smoke source in x and ydirection
Overall, 768 individual smoke sequences were used for training, and the validation set contains 192 sequences with different initialization seeds.
Liquid
For the liquid data, a solver based on the fluid implicit particle (FLIP) method proposed by Zhu and Bridson was employed. It is a EulerianLagrangian hybrid approach that replaces the SemiLagrangian advection scheme with particle based advection to achieve higher accuracy and prevent the loss of mass. Still, this method is not optimal as we experienced problems with mass loss, especially for larger noise values.
The simulation setup consists of a large breaking dam and several smaller liquid areas for more detailed splashes. After the dam hits the simulation boundary a large, single drop of liquid is created in the middle of the domain that hits the already moving liquid surface. Then, the extrapolated level set values, binary indicator flags, and the velocity field (separated by dimension) are saved, with some examples shown in Fig. 16. The list of varied parameters include:

Radius of the liquid drop

Position of the drop in x and ydirection

Amount of additional gravity force in x and ydirection

Strength of noise added to the velocity field
The liquid training set consists of 792 sequences and the validation set of 198 sequences with different random seeds. For the liquid test set additional background noise was added to the velocity field of the simulations as displayed in Fig. 16(d). Because this only alters the velocity field, the extrapolated level set values and binary indicator flags are not used for this data set, leading to 132 sequences.
d.2 AdvectionDiffusion and Burger’s Equation
For these PDEs our solvers only discretized and solve the corresponding equation in 1D. Afterwards, the different time steps of the solution process are concatenated along a new dimension to form 2D data with one spatial and one time dimension.
AdvectionDiffusion Equation
This equation describes how a passive quantity is transported inside a velocity field due to the processes of advection and diffusion. Eq. (12) is the simplified AdvectionDiffusion equation with constant diffusivity and no sources or sinks.
(12) 
Here, denotes the density, is the velocity, and is the kinematic viscosity (also known as diffusion coefficient) that determines the strength of the diffusion. Our solver employed a simple implicit time integration and a diffusion solver based on conjugate gradient without preconditioning. The initialization for the 1D fields of the simulations was created by overlaying multiple parameterized sine curves with random frequencies and magnitudes.
In addition, continuous forcing controlled by further parameterized sine curves was included in the simulations over time. In this case, the only initial conditions to vary are the forcing and initialization parameters of the sine curves and the strength of the added noise. From this PDE only the passive density field was used as shown in Fig. 17. Overall, 798 sequences are included in the training set and 190 sequences with a different random initialization in the validation set.
For the AdvectionDiffusion test set the noise was instead added directly to the passive density field of the simulations. This results in 190 sequences with more small scale details, as shown in Fig. 17(d).
Burger’s Equation
This equation is very similar to the AdvectionDiffusion equation and describes how the velocity field itself changes due to diffusion and advection.
(13) 
Eq. (13) is known as the viscous form of the Burger’s equation that can develop shock waves, and again is the velocity and denotes the kinematic viscosity. Our solver for this PDE used a slightly different implicit time integration scheme, but the same diffusion solver as used for the AdvectionDiffusion equation.
The simulation setup and parameters were also the same; the only difference is that the velocity field instead of the density is exported. As a consequence, the data in Fig. 18 looks relatively similar to the results from the AdvectionDiffusion equation. The training set features 782 sequences, and the validation set contains 204 sequences with different random seeds.
d.3 Other
The remaining data sets are not based on PDEs and thus not generated with the proposed method. The data is only used to test the generalization of the discussed metrics and not for training or validation. The Shapes test set contains 160 sequences, the Video test set consists 131 sequences, and the TID test set features 216 sequences.
Shapes
This data set tests if the metrics are able to track simple, moving geometric shapes. To create it, a straight path between two random points inside the domain is generated and a random shape is moved along this path in steps of equal distance. The size of the used shape depends on the distance between start and end point, such that a significant fraction of the shape overlaps between two consecutive steps. It is also ensured that no part of the shape leaves the domain at any step, by using a sufficiently big boundary area when generating the path.
With this method, multiple random shapes for a single data sample are produced and their paths can overlap, such that they occlude each other to provide an additional challenge. All shapes are moved in their parametric representation and only when exporting the data, they are discretized onto a fixed binary grid. To add more variations to this simple approach, we also apply them in a nonbinary way with smoothed edges and include additive gaussian noise over the entire domain. Examples for the different exports can be seen in Fig. 19.
Video
For this data set, different publicly available video recordings were acquired and processed in three steps. First, videos with abrupt cuts, scene transitions or camera movements were discarded, and afterwards the footage was broken down into single frames. Then, each frame was resized to match the spatial size of our other data by linear interpolation. Since directly using consecutive frames is no challenge for any analyzed metric and all of them recovered the ordering almost perfectly, we achieved a more meaningful data set by skipping several intermediate frames. For the final data set, we defined the first frame of every video as the reference and subsequent frames in an interval step of ten frames as the increasingly different variations. Some data examples can be found in Fig. 20.
Tid2013
This data set was created by Ponomarenko et al. and used without any further modifications. It consists of 25 reference images with 24 distortion types in five levels. As a result it is not directly comparable to our data sets, so it is excluded from the test set aggregations. The distortions focus on various types of noise, image compression and color changes. Fig. 21 contains examples from the data set.
d.4 Data Augmentation
To add more variation to our data, we include different data augmentation techniques. These augmentations help the network to become more invariant to typical data transformations. Examples for the augmented data can be found in Fig. 22. In order to compare to existing natural image feature extractors trained for RGB images, we focus on three channel inputs. As a first step, the scalar, i.e. greyscale, data is converted to a three channel RGB input with a random color map (out of five fixed variations) or no color map at all by copying the data directly to each channel.
Next, the data is randomly flipped on either axis and rotated in increments of 90 to provide robustness to rotations. The rotated data fields are then cropped from their simulation size to a size of which is the typical input size for existing feature extractors. Finally, each input is normalized to a standard normal distribution. The mean and standard deviation are computed from all available training data without augmentations in a preprocessing step with an online algorithm from Welford. Note that each data sample gets a new augmentation every time it is used, and that the corresponding reference receives the identical transformations the keep comparability (see Fig. 22). For all validation and test data, only a bilinear interpolation to the correct input size and the final normalization is performed.
d.5 Hardware
Data generation, training, and metric evaluations were performed on a machine with an Intel i76850 (3.60Ghz) CPU and an NVIDIA GeForce GTX 1080 Ti GPU.
Appendix E RealWorld Data
Below we give details of the three datasets used for the evaluation in Section 6.3 of the main paper.
e.1 ScalarFlow
The ScalarFlow data set from Eckert et al. contains 3D velocities of realworld scalar transport flows reconstructed from multiple camera perspectives. For our evaluation, we cropped the volumetric grids to such that they only contain the area of interest and convert them to 2D with two variants: either by using the center slice or by computing the mean along the zdimension. Afterwards, the magnitudes of the velocity vectors are computed, linearly interpolated to , and then normalized. Variations for each reconstructed plume are acquired by using frames in equal temporal intervals. We employed the velocity field reconstructions from 30 plumes (with simulation IDs ) for both compression methods. Fig. 23 shows some example sequences.
e.2 Johns Hopkins Turbulence Database
The Johns Hopkins Turbulence Database (JHTDB) (Perlman et al., 2007) features various data sets of 3D turbulent flow fields created with direct numerical simulations (DNS). Here, we used three forced isotropic turbulence data sets with different resolutions (isotropic1024coarse, isotropic1024fine, and isotropic4096), two channel flows with different Reynolds numbers (channel and channel5200), the forced magnetohydrodynamic isotropic turbulence data set (mhd1024), and the rotating stratified turbulence data set (rotstrat4096).
For the evaluation, five reference slices in the x/yplane from each of the seven data sets are used. The spatial and temporal position of each slice is randomized within the bounds of the corresponding simulation domain. We compute velocity magnitudes, and normalize the value range. Variants for each reference are created by gradually varying the x and zposition of the slice in equal intervals. The temporal position of each slice is varied as well, if a sufficient amount of temporally resolved data is available (for isotropic1024coarse, isotropic1024fine, channel, and mhd1024). This leads to 90 sequences in total. Fig. 24 shows examples from six of the JHTDB data sets.
e.3 WeatherBench
The WeatherBench repository from Rasp et al. represents a collection of various weather measurements of different atmospherical quantities such as precipitation, cloud coverage, wind velocities, geopotential, and temperature. The data ranges from 1979 to 2018 with a fine temporal resolution and is stored on a Cartesian latitudelongitude grid of the earth. In certain subsets of the data, an additional dimension such as altitude or pressure levels are available. As all measurements are available as scalar fields, only a linear interpolation to the correct input size and a normalization was necessary in order to prepare the data. We used the lowresolution geopotential data set at 500hPa (i.e., at around 5.5km height) with a size of yielding smoothly changing features when upsampling the data. In addition, the highres temperature data with a size of for small scale details was used. For the temperature field, we used the middle atmospheric pressure level at 850hPa corresponding to an altitude of 1.5km in our experiments.
To create sequences with variations for a single time step of the weather data we used frames in equal time intervals, similar to the ScalarFlow data. Due to the very fine temporal discretization of the data, we consider a temporal interval of two hours as the smallest interval step of one in Fig. 26. We sampled three random starting points in time from each of the 40 years of measurements, resulting in 120 sequences overall. Fig. 25 shows a collection of example sequences.
e.4 Detailed Results
For each of the variants explained in the previous sections, we create test sets with six different spatial and temporal intervals. Fig. 26 shows the combined correlation of the sequences for different interval spacing when evaluating LSiM. For the results in the main paper, all correlation values shown here are aggregated by data source via mean and standard deviation.
While our metric reliably recovers the increasing distances within the data set, the individual measurements exhibit interesting differences in terms of their behavior for varying distances. As JHTDB and WeatherBench contain relatively uniform phenomena, a larger step interval creates more difficult data as the simulated and measured states contain changes that are more and more difficult to analyze along a sequence. For ScalarFlow, on the other hand, the difficulty decreases for larger intervals due to the largescale motion of the reconstructed plumes. As a result of buoyancy forces the observed smoke rises upwards into areas where no smoke has been before. For the network this makes predictions relatively easy, as the largescale translations are indicative of the temporal progression, and small scale turbulence effects can be largely ignored. For this data set smaller intervals are more difficult as the overall shape of the plume barely changes while the complex evolution of small scale features becomes more important.
Overall, the LSiM metric recovers the ground truth ordering of the sequences very well as indicated by the consistently high correlation values in Fig. 26.
Appendix F Additional Evaluations
Metric  Validation data sets  Test data sets  
Smo  Liq  Adv  Bur  TID  LiqN  AdvD  Sha  Vid  All  
0.66  0.81  0.71  0.58  0.85  0.72  0.57  0.55  0.76  0.56  
SSIM  0.67  0.74  0.75  0.68  0.83  0.25  0.69  0.34  0.66  0.47 
LPIPS v0.1.  0.71  0.75  0.76  0.72  0.79  0.63  0.61  0.82  0.80  0.65 
AlexNet\textsubscriptrandom  0.63  0.75  0.66  0.64  0.80  0.63  0.65  0.68  0.77  0.60 
AlexNet\textsubscriptfrozen  0.67  0.70  0.68  0.70  0.79  0.40  0.63  0.84  0.80  0.61 
Optical flow  0.63  0.56  0.37  0.39  0.49  0.45  0.28  0.61  0.74  0.48 
NonSiamese  0.71  0.82  0.75  0.69  0.26  0.72  0.62  0.65  0.68  0.63 
Skip\textsubscriptfrom scratch  0.65  0.83  0.74  0.66  0.72  0.78  0.59  0.83  0.78  0.70 
LSiM\textsubscriptnoiseless  0.64  0.82  0.74  0.60  0.69  0.80  0.58  0.83  0.75  0.68 
LSiM\textsubscriptstrong noise  0.63  0.81  0.71  0.61  0.70  0.78  0.50  0.80  0.76  0.64 
LSiM (ours)  0.68  0.82  0.76  0.70  0.71  0.79  0.61  0.86  0.76  0.71 
In the following, we demonstrate other ways to compare the performance of the analyzed metrics on our data sets. In Tab. 2 the Pearson correlation coefficient is used instead of Spearman’s rank correlation coefficient. While Spearman’s correlation measures monotonic relationships by using ranking variables, it directly measures linear relationships.
The results in Tab. 2 match very closely to the values computed with Spearman’s rank correlation coefficient. The best performing metrics in both tables are identical, only the numbers slightly vary. Since a linear and a monotonic relation describes the results of the metrics similarly well, there are no apparent nonlinear dependencies that can not be captured using the Pearson correlation.
In the Tables 4 and 3 we employ a different, more intuitive approach to determine combined correlation values for each data set using the Pearson correlation. We no longer analyzing the entire predicted distance distribution and the ground truth distribution at once as done above. Instead, we individually compute the correlation between the ground truth and the predicted distances for the single data samples of the data set. From the single correlation values, we compute the mean and standard deviations shown in the tables. Note that this approach potentially produces less accurate comparison results, as small errors in the individual computations can accumulate to larger deviations in mean and standard deviation. Still, both tables lead to very similar conclusions: The best performing metrics are almost the same and low combined correlation values match with results that have a high standard deviation and a low mean.
Metric  Validation data sets  

Smo  Liq  Adv  Bur  
0.71 (0.34)  0.84 (0.23)  0.76 (0.28)  0.63 (0.41)  
SSIM  0.73 (0.30)  0.78 (0.29)  0.80 (0.26)  0.72 (0.38) 
LPIPS v0.1.  0.77 (0.28)  0.79 (0.24)  0.81 (0.26)  0.77 (0.32) 
AlexNet\textsubscriptrandom  0.68 (0.36)  0.79 (0.28)  0.71 (0.36)  0.69 (0.36) 
AlexNet\textsubscriptfrozen  0.72 (0.31)  0.74 (0.29)  0.73 (0.35)  0.75 (0.33) 
Optical flow  0.66 (0.38)  0.59 (0.47)  0.38 (0.52)  0.41 (0.49) 
NonSiamese  0.76 (0.27)  0.87 (0.19)  0.80 (0.24)  0.75 (0.33) 
Skip\textsubscriptfrom scratch  0.69 (0.34)  0.87 (0.19)  0.79 (0.26)  0.72 (0.34) 
LSiM\textsubscriptnoiseless  0.68 (0.33)  0.85 (0.24)  0.78 (0.30)  0.66 (0.37) 
LSiM\textsubscriptstrong noise  0.67 (0.36)  0.85 (0.22)  0.76 (0.33)  0.67 (0.39) 
LSiM (ours)  0.72 (0.31)  0.85 (0.22)  0.81 (0.23)  0.75 (0.33) 
Metric  Test data sets  

TID  LiqN  AdvD  Sha  Vid  All  
0.98 (0.19)  0.76 (0.29)  0.59 (0.45)  0.61 (0.29)  0.82 (0.33)  0.68 (0.37)  
SSIM  0.92 (0.40)  0.26 (0.49)  0.73 (0.36)  0.45 (0.45)  0.77 (0.37)  0.57 (0.46) 
LPIPS v0.1.  0.94 (0.33)  0.66 (0.38)  0.65 (0.41)  0.83 (0.24)  0.85 (0.30)  0.74 (0.36) 
AlexNet\textsubscriptrandom  0.96 (0.27)  0.68 (0.34)  0.69 (0.38)  0.68 (0.26)  0.82 (0.33)  0.71 (0.34) 
AlexNet\textsubscriptfrozen  0.94 (0.33)  0.41 (0.49)  0.67 (0.39)  0.85 (0.21)  0.85 (0.29)  0.70 (0.40) 
Optical flow  0.74 (0.67)  0.50 (0.34)  0.32 (0.53)  0.63 (0.45)  0.78 (0.45)  0.53 (0.49) 
NonSiamese  0.47 (0.88)  0.76 (0.24)  0.66 (0.41)  0.67 (0.28)  0.76 (0.42)  0.71 (0.35) 
Skip\textsubscriptfrom scratch  0.99 (0.14)  0.85 (0.15)  0.61 (0.42)  0.84 (0.23)  0.82 (0.33)  0.76 (0.33) 
LSiM\textsubscriptnoiseless  0.98 (0.19)  0.86 (0.15)  0.61 (0.41)  0.84 (0.26)  0.79 (0.38)  0.76 (0.34) 
LSiM\textsubscriptstrong noise  0.99 (0.14)  0.83 (0.19)  0.52 (0.45)  0.81 (0.23)  0.82 (0.35)  0.73 (0.36) 
LSiM (ours)  0.97 (0.23)  0.83 (0.22)  0.64 (0.42)  0.86 (0.23)  0.80 (0.37)  0.77 (0.34) 
Fig. 27 shows a visualization of predicted distances against ground truth distances for different metrics on every sample from the test sets. Each plot contains over 6700 individual data points to illustrate the global distance distributions created by the metrics, without focusing on single cases. A theoretical optimal metric would recover a perfectly narrow distribution along the line , while worse metrics recover broader, more curved distributions. Overall, the sample distribution of an metric is very wide. LPIPS manages to follow the optimal diagonal a lot better, but our approach approximates it with the smallest deviations, as also shown in the tables above. The metric performs very poorly on the shape data indicated by the too steeply increasing blue lines that flatten after a ground truth distance of 0.3. LPIPS already significantly reduces this problem, but LSiM still works slightly better.
A similar issue is visible for the AdvectionDiffusion data, where for a larger number of red samples is below the optimal line, than for the other metrics. LPIPS has the worst overall performance for liquid test set, indicated by the large number of fairly chaotic green lines in the plot. On the video data, all three metrics perform similarly well.
A finegrained distance evaluation in 200 steps of and our LSiM metric via the mean and standard deviation of different data samples is shown in Fig. 28. Similar to Fig. 27, the mean of an optimal metric would follow the ground truth line with a standard deviation of zero, while the mean of worse metrics deviates around the line with a high standard deviation. The plot on the left combines eight samples with different seeds from the Sha data set, where only a single shape is used. Similarly, the center plot aggregates eight samples from Sha with more than one shape. The right plot shows six data samples from the LiqN test set that vary by the amount of noise that was injected into the simulation.
The task of only tracking a single shape in the example on the left is the easiest of the three shown cases. Both metrics have no problem to recover the position change until a variation of 0.4, where can no longer distinguish between the different samples. Our metric recovers distances with a continuously rising mean and a very low standard deviation. The task in the middle is already harder, as multiple shapes can occlude each other during the position changes. Starting at a position variation of 0.4 both metrics have a quite high standard deviation, but the proposed method stays closer to the ground truth line. shows a similar issue as before, because it flattens relatively fast. The plot on the right features the hardest task. Here, both metrics perform similar as each has a different problem in addition to an unstable mean. Our metric stays close to the ground truth, but has a quite high standard deviation starting at about a variation of 0.4. The standard deviation of is lower, but instead it starts off with a big jump from the first few data points. To some degree this is caused by the normalization of the plots, but it still overestimates the relative distances for small variations in the simulation parameter.
These findings also match with the distance distribution evaluations in Fig. 27 and the tables above: Our method has a significant advantage over shallow metrics on shape data, while the differences of both metrics become much smaller for the liquid test set.
Appendix G Notation
In this work, we follow the notation suggested by Goodfellow et al.. Vector quantities are displayed in bold and tensors use a sansserif font. Doublebarred letters indicate sets or vector spaces. The following symbols are used:
Real numbers  
Indexing in different contexts  
Input space of the metric, i.e., color images/field data of size  
Dimension of the input space when flattened to a single vector  
Elements in the input space  
Latent space of the metric, i.e., sets of 3^{rd} order feature map tensors  
Dimension of the latent space when flattened to a single vector  
Elements in the latent space , corresponding to 
Weights for the learned average aggregation (1 per feature map)  
Initial conditions / parameters of a numerical simulation  
Number of steps when varying a simulation parameter, thus length of the network input sequence  
Series of outputs of a simulation with increasing ground truth distance to  
Amount of change in a single simulation parameter  
Time steps of a numerical simulation  
Strength of the noise added to a simulation  
Ground truth distance distribution, determined by the data generation via  
Predicted distance distribution (supposed to match the corresponding )  
Mean of the distributions and  
Euclidean norm of a vector  
Entire function computed by our metric  
First part of , i.e., base network and feature map normalization  
Second part of , i.e., latent space difference and the aggregations  
3^{rd} order feature tensor from one layer of the base network  
Channel dimension () and spatial dimensions () of  
Optical flow network  
Flow fields computed by an optical flow network from two inputs in  
Components of the flow field  
Gradient () and Laplace operator ()  
Partial derivative operator  
Time in our PDEs  
Velocity in our PDEs  
Kinematic viscosity / diffusion coefficient in our PDEs  
Density in our PDEs 
Pressure in the NavierStokes Equations  
Gravity in the NavierStokes Equations 
References
 Learning to see by moving. In 2015 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 37–45. External Links: Document Cited by: §2.
 Image Quality Assessment by Comparing CNN Features between Images. Journal of Imaging Sience and Technology 60 (6). External Links: Document Cited by: §2.
 Learning representations by maximizing mutual information across views. CoRR abs/1906.00910. External Links: Link, 1906.00910 Cited by: §7.
 Learning visual similarity for product design with convolutional neural networks. ACM Transactions on Graphics 34 (4), pp. 98:1–98:10. External Links: Document Cited by: §2.
 Siamese networks for semantic pattern similarity. CoRR abs/1812.06604. External Links: Link, 1812.06604 Cited by: §1.
 EigenDistortions of Hierarchical Representations. In Advances in Neural Information Processing Systems 30 (NIPS 2017), Vol. 30. External Links: Link, 1710.02266 Cited by: §2.
 FullyConvolutional Siamese Networks for Object Tracking. In Computer Vision  ECCV 2016 Workshops, PT II, Vol. 9914, pp. 850–865. External Links: Document Cited by: §2.
 Neural NetworkBased FullReference Image Quality Assessment. In 2016 Picture Coding Symposium (PCS), External Links: Document Cited by: §2.
 Correlational neural networks. Neural Computation 28 (2), pp. 257–285. External Links: Document Cited by: §2.
 DataDriven Synthesis of Smoke Flows with CNNbased Feature Descriptors. ACM Transactions on Graphics 36 (4), pp. 69:1–69:14. External Links: Document Cited by: §2.
 Generating Images with Perceptual Similarity Metrics based on Deep Networks. In Advances in Neural Information Processing Systems 29 (NIPS 2016), Vol. 29. External Links: Link, 1602.02644 Cited by: §2.
 ScalarFlow: a largescale volumetric data set of realworld scalar transport flows for computer animation and machine learning. ACM Transactions on Graphics 38 (6). External Links: Document Cited by: §E.1, §6.3.
 Deep learning. MIT Press. External Links: Link Cited by: Appendix G.
 A new error measure for forecasts of householdlevel, high resolution electrical energy consumption. International Journal of Forecasting 30 (2), pp. 246–256. External Links: Document Cited by: §2.
 Patch match networks: Improved twochannel and Siamese networks for image patch matching. Pattern Recognition Letters 120, pp. 54–61. External Links: Document Cited by: §2.
 Learning to match multitemporal optical satellite images using multisupportpatches Siamese networks. Remote Sensing Letters 10 (6), pp. 516–525. External Links: Document Cited by: §2.
 Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. External Links: Document Cited by: §B.6.
 Dataefficient image recognition with contrastive predictive coding. CoRR abs/1905.09272. External Links: Link, 1905.09272 Cited by: §7.
 Determining optical flow. Artificial intelligence 17 (13), pp. 185–203. External Links: Document Cited by: §6.2.
 Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269. External Links: Document Cited by: §B.6.
 SqueezeNet: alexnetlevel accuracy with 50x fewer parameters and <1mb model size. CoRR abs/1602.07360. External Links: Link, 1602.07360 Cited by: §3.1.
 FlowNet 2.0: evolution of optical flow estimation with deep networks. CoRR abs/1612.01925. External Links: Link, 1612.01925 Cited by: §B.4, §6.2.
 Perceptual Losses for RealTime Style Transfer and SuperResolution. In Computer Vision  ECCV 2016, PT II, Vol. 9906, pp. 694–711. External Links: Document Cited by: §2.
 Forecast verification: a practitioner’s guide in atmospheric science. John Wiley & Sons. External Links: Document Cited by: §1.
 Convolutional Neural Networks for NoReference Image Quality Assessment. In 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1733–1740. External Links: Document Cited by: §2.
 A displacement and amplitude score employing an optical flow technique. Weather and Forecasting 24 (5), pp. 1297–1308. External Links: Document Cited by: §2.
 Deep Learning of Human Visual Sensitivity in Image Quality Assessment Framework. In 30TH IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), pp. 1969–1977. External Links: Document Cited by: §2.
 ImageNet classification with deep convolutional neural networks. Communications of the ACM 60 (6), pp. 84–90. External Links: Document Cited by: §3.1.
 Most apparent distortion: fullreference image quality assessment and the role of strategy. Journal of Electronic Imaging 19 (1). External Links: Document Cited by: §2.
 Turbulent transport reduction by zonal flows: massively parallel simulations. Science 281 (5384), pp. 1835–1837. External Links: Document Cited by: §1.
 CID:IQ  A New Image Quality Database. In Image and Signal Processing, ICISP 2014, Vol. 8509, pp. 193–202. External Links: Document Cited by: §2.
 Direct numerical simulation: a tool in turbulence research. Annual review of fluid mechanics 30 (1), pp. 539–578. External Links: Document Cited by: §1.
 Verification, validation, and predictive capability in computational engineering and physics. Applied Mechanics Reviews 57, pp. 345–384. External Links: Document Cited by: §1.
 Notes on the History of Correlation. Biometrika 13 (1), pp. 25–45. External Links: Document Cited by: §5.
 Data exploration of turbulence simulations using a database cluster. In SC ’07: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, pp. 1–11. External Links: Document Cited by: §E.2, §6.3.
 Largeeddy simulation of turbulent combustion. Annu. Rev. Fluid Mech. 38, pp. 453–482. External Links: Document Cited by: §2.
 Image database TID2013: Peculiarities, results and perspectives. Signal ProcessingImage Communication 30, pp. 57–77. External Links: Document Cited by: Figure 21, §D.3, §2, §6.1.
 PieAPP: perceptual imageerror assessment through pairwise preference. CoRR abs/1806.02067. External Links: Link, 1806.02067 Cited by: §2.
 WeatherBench: a benchmark dataset for datadriven weather forecasting. CoRR abs/2002.00469. External Links: Link, 2002.00469 Cited by: §E.3, §6.3.
 Unet: convolutional networks for biomedical image segmentation. CoRR abs/1505.04597. External Links: Link, 1505.04597 Cited by: §B.6.
 Artistic style transfer for videos. In Pattern Recognition, pp. 26–36. External Links: Document Cited by: §2.
 Very deep convolutional networks for largescale image recognition. In ICLR, External Links: Link, 1409.1556 Cited by: §3.1.
 The proof and measurement of association between two things. The American Journal of Psychology 15 (1), pp. 72–101. External Links: Document Cited by: §6.2.
 Learned Perceptual Image Enhancement. In 2018 IEEE International Conference on Computational Photography (ICCP), External Links: Document Cited by: §2.
 NIMA: Neural Image Assessment. IEEE Transactions on Image Processing 27 (8), pp. 3998–4011. External Links: Document Cited by: §2.
 Well, how accurate is it? A study of deep learning methods for reynoldsaveraged navierstokes simulations. CoRR abs/1810.08217. External Links: Link, 1810.08217 Cited by: §3.1.
 Perceptual Evaluation of Liquid Simulation Methods. ACM Transactions on Graphics 36 (4). External Links: Document Cited by: §2.
 Spot the Difference: Accuracy of Numerical Simulations via the Human Visual System. CoRR abs/1907.04179. External Links: Link, 1907.04179 Cited by: §2.
 Unsupervised learning of visual representations using videos. In 2015 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 2794–2802. External Links: Document Cited by: §2.
 L2 Mispronunciation Verification Based on Acoustic Phone Embedding and Siamese Networks. In 2018 11TH International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 444–448. External Links: Document Cited by: §1.
 Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. External Links: Document Cited by: §1, §2.
 Note on a method for calculating corrected sums of squares and products. Technometrics 4 (3), pp. 419–420. External Links: Document Cited by: §D.4.
 The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 586–595. External Links: Document Cited by: §B.2, §B.2, §B.3, §1, §2, §3.2, §3.4, §6.2.
 IMINET: Convolutional SemiSiamese Networks for Sound Search by Vocal Imitation. In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 304–308. External Links: Document Cited by: §2.
 Animating sand as a fluid. In ACM SIGGRAPH 2005 Papers, New York, NY, USA, pp. 965–972. External Links: Document Cited by: §D.1.