DisCo: PhysicsBased Unsupervised Discovery of Coherent Structures in Spatiotemporal Systems
Abstract
Extracting actionable insight from complex unlabeled scientific data is an open challenge and key to unlocking datadriven discovery in science. Complementary and alternative to supervised machine learning approaches, unsupervised physicsbased methods based on behaviordriven theories hold great promise. Due to computational limitations, practical application on realworld domain science problems has lagged far behind theoretical development. However, powerful modern supercomputers provide the opportunity to narrow the gap between theory and practical application. We present our first step towards bridging this divide  DisCo  a highperformance distributed workflow for the behaviordriven local causal state theory. DisCo provides a scalable unsupervised physicsbased representation learning method that decomposes spatiotemporal systems into their structurally relevant components, which are captured by the latent local causal state variables. Complex spatiotemporal systems are generally highly structured and organize around a lowerdimensional skeleton of coherent structures, and in several firsts we demonstrate the efficacy of DisCo in capturing such structures from observational and simulated scientific data. To the best of our knowledge, DisCo is also the first application software developed entirely in Python to scale to over 1000 machine nodes, providing good performance along with ensuring domain scientists’ productivity. We developed scalable, performant methods optimized for Intel manycore processors that will be upstreamed to opensource Python library packages. Our capstone experiment, using newly developed DisCo workflow and libraries, performs unsupervised spacetime segmentation analysis of CAM5.1 climate simulation data, processing an unprecedented 89.5 TB in 6.6 minutes endtoend using 1024 Intel Haswell nodes on the Cori supercomputer obtaining 91% weakscaling and 64% strongscaling efficiency. This enables us to achieve stateoftheart unsupervised segmentation of coherent spatiotemporal structures in complex fluid flows.
I Introduction
Ia DataDriven Discovery in Science
Over the last decade, the Data Deluge [4] has brought dramatic progress across all of science [9, 8, 48, 62, 20]. For datadriven science to flourish by extracting meaningful scientific insights [13, 18], new methods are required that discover and mathematically describe complex emergent phenomena, uncover the underlying physical and causal mechanisms, and are better able to predict the occurrence and evolution of these phenomena over time. Increasingly, scientists are leaning upon machine learning (ML) [44, 36, 6, 34, 69] and, more recently, deep learning (DL) [43, 7, 57, 5, 41] to fill this role.
While these techniques show great promise, serious challenges arise when they are applied to scientific problems. To better elucidate the challenges of scientific application of DL methods, we will focus on a particular problem of utmost and imminent importance that necessitates datadriven discovery  detection and identification of extreme weather events in climate data [72, 55, 66]. Driven by an everwarming climate, extreme weather events are changing in frequency and intensity at an unprecedented pace [14, 71]. Scientists are simulating a multitude of climate change scenarios using highresolution, highfidelity global climate models, producing 100s of TBs of data per simulation. Currently, climate change is assessed in these simulations using summary statistics (e.g. mean global sea surface temperature) which are inadequate for analyzing the full impact of climate change. Due to the sheer size and complexity of these simulated data sets, it is essential to develop robust and automated methods that can provide the deeper insights we seek.
Recently, supervised DL techniques have been applied to address this problem [46, 10, 11] including one of the 2018 Gordon Bell award winners [35]. However, there is an immediate and daunting challenge for these supervised approaches: groundtruth labels do not exist for pixellevel identification of extreme weather events [66]. The DL models used in the above studies are trained using the automated heuristics of TECA [55] for proximate labels. While the results in [46] qualitatively show that DL can improve upon TECA, the results in [11] reach accuracy rates over 97%, essentially reproducing the output of TECA. The supervised learning paradigm of optimizing objective metrics (e.g. training and generalization error) breaks down here [18] since TECA is not ground truth and we do not know how to train a DL model to disagree with TECA in just the right way to get closer to “ground truth”.
IB BehaviorDriven Theories for Scientific Machine Learning
With the absence of groundtruth labels, many scientific problems are fundamentally unsupervised problems. Rather than attempt to adapt unsupervised DL approaches to a problem like extreme weather detection, we instead take a behaviordriven approach and start from physical principles to develop a novel physicsbased representation learning method for discovering structure in spatiotemporal systems directly from unlabeled data.
At the interface of physics and machine learning, behaviordriven theories (e.g. [74, 59, 76, 70, 77]) leverage physical principles to extract actionable scientific insight directly from unlabeled data. Focusing directly on system behavior rather than the governing equations is necessitated for complex, nonlinear systems. For these systems it is generally not possible to deduce properties of emergent behavior from the underlying equations [1]. As an example, despite knowing the equations of hydrodynamics and thermodynamics, which critically govern the dynamics of hurricanes, many aspects of how hurricanes form and evolve are still poorly understood [15].
For the problem of unsupervised segmentation of extreme weather events in climate data, we view these events as particular cases of more general hydrodynamic coherent structures. Atmospheric dynamics, and hydrodynamic flows more generally, are highly structured and largely organize around a lower dimensional skeleton of collective features referred to as coherent structures [29, 28]. More broadly, coherent structures in spatiotemporal systems can be understood as key organizing features that heavily dictate the dynamics of the full system, and, as with extreme weather, the coherent structures are often the features of interest. Project DisCo (‘Discovery of Coherent Structures’) combines the behaviordriven local causal state theory of coherent structures with a firstofitskind performant and highly scalable HPC implementation in Python.
In Section III we describe the mathematical details of the theory and its use for unsupervised segmentation. In Section III we also present an overview of the distributed DisCo workflow and each of its stages. We then demonstrate its utility by identifying known coherent structures in 2D turbulence simulation data and observational data of Jupiter’s clouds from the NASA Cassini spacecraft in Section VII. Finally, we show promising results on CAM5.1 water vapor data and outline the path to extreme weather event segmentation masks.
IC Need for High Performance Computing
Theoretical developments in behaviordriven theories have far outpaced their implementation and application to real science problems due to significant computational demands. Theorists typically use highproductivity languages like Python, which often incur performance penalties, only for prototyping their method and demonstrating its use on small idealized data sets. Since these prototypes aren’t typically optimized for production level performance, their use in science applications with big datasets is limited. To solve real science problems, domain scientists often have to rewrite applications, or portions of, in programming languages like C, C++, and Fortran[58].
Making highproductivity languages performant and scalable on HPC systems requires highly optimized platformspecialized libraries with easytouse APIs, seamlessly integrated distributedmemory processing modes with popular Python libraries (like scikitlearn), efficient use of JIT compilers like Numba etc. In Project DisCo, we use all these techniques to enable optimized Python code from prototype development to production deployment on more than 1000 nodes of an HPC system. This brings us closer to bridging the performance and productivity disconnect that typically exists in HPC, and streamlining the process from theoretical development to deployment at scale for science applications.
A challenge specific to DisCo is the need for distancebased clustering of lightcone data structures (described in more detail in Sec. III). Compared to traditional clustering datasets, lightcones are very highdimensional objects. Though lightcone dimensionality depends on reconstruction parameters, even the baseline lower bound of O(100) is already very high for typical implementations of clustering methods. To facilitate discovery, our experiments used lightcones with dimension as high as . Also, creation of lightcone vectors increases the onnode data by . In our largest run, we process 89.5 TB of lightcone data, which is several orders of magnitude larger than previously reported lightconebased methods.
To enable novel datadriven discovery at the frontiers of domain science with the ability to process massive amounts of highdimensional data, we created a highly parallel, distributedmemory, performance optimized implementation of DisCo software including two specialized clustering methods (KMeans [3] and DBSCAN [17]). In keeping with our goal of maintaining scientists’ software development productivity, the libraries use standard Python APIs (scikitlearn). These distributed implementations will be upstreamed to benefit the growing community of Python developers.
ID Contributions
Project DisCo makes the following contributions:

First distributedmemory implementation of a novel physicsbased representation learning method allowing unprecedented data processing capability on large scientific data sets.

Performs unsupervised coherent structure segmentation that qualitatively outperforms stateoftheart methods for complex realistic fluid flows.

Demonstrates good singlenode, weak scaling, and strong scaling performance up to 1024 nodes.

Distributed implementation of KMeans and DBSCAN clustering methods for highdimensional data using standard Python APIs.

Achieves high performance while maintaining developer productivity by using newly developed optimized Python library functions and efficiently using parallelizing compilers.
Ii Related Work
The basic algorithm for realvalued local causal state reconstruction used by DisCo largely follows that of LICORS [21, 45]. Without an HPC implementation, LICORS focused on statistical properties of the algorithm, e.g. convergence, and small proofofconcept experiments. Further, this work used the pointwise entropy over local causal states for coherent structure filters [64], but this approach cannot produce objective segmentation masks, as our method is capable of.
The first realvalued local causal state reconstruction was done in [32], which also analyzed complex fluid flows and climate simulations. They were able to work with these data sets due to efficient data reuse and data subsampling from a lowproductivity singlenode implementation written from scratch. Even with these optimizations in their implementation, DisCo produces much higher resolution results with our highproductivity HPC optimized implementation. Compare the bottom row of Fig. 5 in [32] with Fig. 4 in Sec. VII. They also used the local causal state entropy, and so were also not capable of a structural segmentation analysis.
Lagrangian Coherent Structures (LCS) is a collection of
approaches grounded in nonlinear dynamical systems theory that seeks to describe the most repelling, attracting, and shearing material surfaces that form the skeletons of Lagrangian particle dynamics [28]. These approaches are the structural segmentation methods for fluid flows most relevant to DisCo. [25] gives a survey of LCS methods, including two benchmark data sets we use here. This provides us a key point of comparison to the stateoftheart for method validation, given in Section VII.
DisCo’s segmentation semantics are built on a structural decomposition provided by the local causal states. Such a decomposition is similar to empirical dimensionality reduction methods, such as PCA [33] and DMD [68]. These methods are used extensively in fluid dynamics [29] and climate analytics [67].
The key step in the DisCo pipeline requires an unsupervised clustering method. We focus on the popular KMeans [3] method and the densitybased DBSCAN [17]. Further discussion of clustering in the DisCo pipeline is given in Sections III, IV, and VII.
Several distributed implementations of KMeans have been developed over the years. [50] is a Cbased implementation that uses both MPI and OpenMP for parallelization. It evenly partitions the data to be clustered among all processes and replicates the cluster centers. At the end of each iteration, globalsum reduction for all cluster centers is performed to generate the new cluster centers. [49] is an extension of this work for larger datasets of billions of points, and [37] optimizes KMeans performance on Intel KNC processors by efficient vectorization. The authors of [38] propose a hierarchical scheme for partitioning data based on data flow, centroids(clusters), and dimensions. Our KMeans implementation partitions the problem based on data size, since its application to climate data is a weak scaling problem. We process much larger datasizes, though [38] showcases good performance for much higher dimensionality, up to , and clusters than our use case. For comparison, in our capstone problem, in the KMeans stage of DisCo workflow we process 70E9 lightcones (70E6/node) of 84 dimensions into 8 clusters in 2.32 s/iteration on Intel E52698 v3 (vs. 2.5E6 samples of 68 dimensions into 10,000 clusters in 2.42 s/iteration on 16 nodes of Intel i73770K processors in [38]). We also use a custom distance metric for applying temporal decay (described in Section 3.2.2) which doubles the number of floating point operations.
Several distributed implementations have been developed for densitybased algorithms. BDCATS is a distributed DBSCAN implementation using unionfind data structures that scales to 8000 nodes on Cori [51]. POPTICS [53] is a distributed implementation of the OPTICS algorithm using the disjointset data structure and scaled to 3000 nodes. HDBSCAN from Petuum analytics [54] uses the NNDescent algorithm and approximates the kNN graph. While their method has been shown to work for highdimensional data, it has not been shown to work at scale. Other implementations such as PDBSCAN [24], PSDBSCAN [30], PDSDBSCAN [52], HPDBSCAN [22], etc. have been shown to scale well, but they use specialized indexing structures like kd trees or balltrees, which are suboptimal for clustering highdimensional data. To the best of our knowledge, this is the first implementation to demonstrate clustering to O(100) dimensional data at this data scale (56 million points per node and 57 billion points in total).
Iii Description of the DisCo project
Project DisCo combines the first distributed HPC implementation of local causal state reconstruction with theoretical advances in using local causal states to decompose spatiotemporal systems into structurally relevant components and objectively identify coherent structures [60, 61]. When fully deployed, DisCo promises to discover and identify extreme weather events in climate data using an unsupervised segmentation analysis with local causal states. We now outline the mathematical basis for this claim.
Iiia Local Causal States  Theory
Similar to the intuition behind autoencoder neural networks, pattern and structure of dynamical systems derives from optimal prediction with minimal resources. Thus the mathematical representation of a system’s structure is learned through a minimal, optimally predictive stochastic model [75, 23, 63]. A body of behaviordriven theory, known as computational mechanics [12], gives a principled and constructive realization of this idea through the causal equivalence relation
Two pasts are causally equivalent if they make the same prediction of the future. The equivalence classes generated by the causal equivalence relation are the causal states. They are the unique minimal sufficient statistic of the past for optimally predicting the future [63].
Generalizing to spatiotemporal systems, lightcones are used as local notions of past and future. The past lightcone of a point in spacetime field is defined as the set of all points in the past (up to some finite horizon ) that could possibly influence :
where is the speed of information propagation in the system and . The future lightcone of is similarly the set of all points in the future (up to a finite horizon ) that can possibly affect;
From this we arrive at the local causal equivalence relation:
The associated equivalence classes are the local causal states [65]. They are the unique minimal sufficient statistics of past lightcones for optimal prediction of future lightcones. Each local causal state is the set of all past lightcones with . The function, which generates the causal equivalence classes, maps from past lightcones to local causal states; .
Segmentation is achieved by applying the function to all points in spacetime, mapping the observable field to its latent local causal state field in a process known as causal filtering. Every feature is mapped to its classification label (local causal state) via its past lightcone . Crucially, this ensures the latent field shares the same coordinate geometry with the observable field such that is the local latent variable corresponding to the local observable . This means the learned representation is directly utilizable for discovering pattern and structure in the physical observable field. In particular, coherent structure in are identified through locally broken symmetries in [60].
IiiB Local Causal States  Reconstruction
The core reconstruction parameters are the past lightcone horizon, future lightcone horizon, and speed of information propagation: . These define the lightcone template, as shown in Figure 1.
IiiB1 Lightcone extraction
The main task in local causal state reconstruction is empirical estimation of the conditional distributions, , known as future morphs. Ultimately this comes down to counting past lightcone  future lightcone pairs, . Thus the first step is to extract all such lightcone pairs from the given spacetime field(s) and store them in the paired lists ([plcs], [flcs]). Lightcones are stored as flattened vectors with dimension .
IiiB2 Cluster lightcones
For realvalued systems, like the fluid flows considered here, unique pairs will never repeat. Some form of discretization is needed for counting. The best way to do this is to discretize over the space of lightcones, rather than discretizing the original data itself [32]. We do this by performing (separate) distancebased clustering on the space of past lightcones and the space of future lightcones [21].
Let be the set of clusters over the realvalued past lightcones that results from some distancebased clustering of [plcs], with individual clusters denoted as and stored in [pasts]. Two lightcones are considered equivalent if they are assigned to the same distancebased cluster:
The function maps past lightcones to their associated distancebased cluster.
All prior work has used Euclidean distance for lightcone clustering. This gives uniform weight to all points within the finite time horizon of the lightcone, and no weight to all points outside. To smooth this step discontinuity, we introduce a lightcone distance with an exponential temporal decay. Consider two finite lightcones given as flattened vectors and , each of length n;
where is the temporal decay rate and is the temporal depth of the lightcone vector at index .
IiiB3 Build morphs
After clustering [plcs] and [flcs] to produce [pasts] and [futures], respectively, we can empirically estimate the future morphs using . The justification for this is the assumption of continuous histories [21, Assumption 3.1]: if two past lightcones and are very close in lightconespace, their future morphs and must be very similar. Using the function, we state a more actionable version of this assumption, which is implicitly used, but not formally stated, in Ref. [21]:
The conditional distributions are found as rows of the joint distribution matrix , where . To get we simply count occurrences of pairs in ([pasts], [futures]).
IiiB4 Causal equivalence
With the estimated morphs in hand we can reconstruct causal equivalence of past clusters. Two past clusters and are equivalent if they have the same conditional distribution over future clusters:
The resulting equivalence classes are the approximated local causal states, and the approximation of is given as:
We reconstruct equivalence using hierarchical agglomerative clustering. Distribution similarity
is evaluated using a chisquared test with pvalue .
IiiB5 Causal filter
Using the approximated map we can perform spacetime segmentation of though causal filtering. The function has already been applied to produce [pasts] by clustering [plcs]. We then apply the learned function from causal equivalence to [pasts] to produce [states]. Because all these lists are in spacetime order, we simply reshape [states] to get the approximated local causal state field .
IiiC Distributed Reconstruction Pipeline

Data loading: Stripe the spacetime data so that the spatial fields for each timestep in are stored individually to allow for parallel I/O. Let workset be the timesteps that each process will extract lightcones from. Because lightcones extend in time, each process must load extra timesteps (halos), at the beginning of workset and at the end. Each process loads its workset + halos in parallel.

Lightcone extraction: The temporal haloing removes any need for communication during lightcone extraction, which proceeds independently for each process.

Communication barrier: Ensure all processes have their
local [plcs] and [flcs] lists before proceeding. 
Cluster lightcones: First cluster the past lightcones across all processes. Store the cluster assignments labels locally, in order. Then do the same for future lightcones.

Build local morphs: Each process counts pairs in its local ([pasts], [futures]) to build .

Communication barrier: Wait for all processes to build .

Build global morphs: Execute an allreduce sum of all to yield across all processes.

Causal equivalence: Since each process has , they can independently reconstruct the approximated local causal states and map.

Causal filter: Each process independently applies the map to their workset to produce .

Write output: Each process independently saves with timeorder labels so that can be constructed from all .
Iv Challenges of Lightcone Clustering
The most significant step, both computationally and conceptually, in the DisCo pipeline is the discretization of lightconespace via distancebased clustering. While there are many choices for distancebased clustering, we focus on two of the most popular clustering algorithms in the scientific community: KMeans [3] and DBSCAN [17].
The use of clustering in a structural decomposition pipeline, along with the need to conform to the continuous histories assumption, would seem to favor a densitybased method like DBSCAN over a prototypebased method like KMeans. Densityconnectivity should ensure nearby lightcones are clustered together, whereas KMeans must return clusters and therefore may put cuts in lightconespace that separate nearby past lightcones, violating the continuous histories assumption.
Because we don’t want to put any geometric restrictions on the structures captured by local causal states, the ability of DBSCAN to capture arbitrary cluster shapes seems preferable to KMeans, which only captures convex, isotropic clusters. Furthermore, the restriction to clusters, as opposed to an arbitrary number of cluster with DBSCAN, puts an upper bound on the number of reconstructed local causal states. To test these hypotheses we experimented with both KMeans and DBSCAN at scale to evaluate their parallel scaling performance and the quality of clustering in the DisCo pipeline on realworld data sets. These experiments and results are discussed in Sections V, VI and VII.
Iva Distributed KMeans
We developed a distributed KMeans implementation which will be upstreamed to daal4py [31], a Python package similar in usage to scikitlearn. Daal4Py provides a Python interface to a large set of conventional ML algorithms highly tuned for Intel^{®} platforms. In contrast to other distributed frameworks for ML in Python, daal4py uses a strict SPMD approach, and so assumes the input data to be prepartitioned. All communication within the algorithms is handled under the hood using MPI.
Our singlenode KMeans implementation performs one iteration of the algorithm in the following way: all data points are split into small blocks to be processed in parallel. For each block, distances from all points within the block to all current centroids are computed. Based on these distances, points are reassigned to clusters and each thread computes the partial sums of coordinates for each cluster. At the end of the iteration the partials sums are reduced from all threads to produce new centroids. We use Intel^{®} AVX2 or Intel^{®} AVX512 instructions, depending on the hardware platform, for vectorizing distance computations.
Our multinode KMeans implementation follows the same general pattern: on each iteration current centroids are broadcast to all nodes, each node computes the assignments and partial sums of coordinates for each centroid, and then one of the nodes collects all partial sums and produces new centroids. We use MPI4Py for collecting partial sums. We integrate into various methods for finding the initial set of centroids  first feature vectors, random feature vectors, and KMeans++ [3]  provided by Intel^{®} DAAL.
IvB Distributed DBSCAN
We developed both singlenode and multinode implementations of DBSCAN optimized for use with highdimensionality lightcone data.
The singlenode DBSCAN implementation computes neighborhoods without using indexing structures, like kd tree or balltree, which are less suitable for highdimensional data. The overall algorithmic complexity is quadratic in the number of points and linear in feature size (lightcone dimension). Neighborhood computation for blocks of data points is done in parallel without use of pruning techniques. We use Intel^{®} AVX2 or Intel^{®} AVX512 instructions, depending on the hardware platform, to compute distances between points, giving a 22.5x speedup compared to the nonvectorized version.
For multinode DBSCAN clustering, the first step is geometric repartitioning of data to gather nearby points on the same node, inspired by the DBSCAN implementation of [51]. It is performed using the following recursive procedure: for a group of nodes we choose some dimension, find an approximate median of this dimension from all points currently stored on a node, split the current group of nodes into two halves (with value of chosen dimension lesser/greater than the median) and reshuffle all points so that each node contains only points satisfying the property above.
Next, do geometric repartitioning recursively for the two resulting halves (groups) of nodes. Then each node gathers from other nodes any extra points that fall into its bounding box (extended by the epsilon in each direction) similar to [52]. Using these gathered points the clustering is performed locally on each node (single node implementation) and the results from all the nodes are then merged into a single clustering.
Because we use an approximate value of the median, the geometric partitions can sometimes have imbalanced sizes. This can impact the overall performance since different nodes will complete local clustering at different times and no node can proceed further until every node has finished. Also, the number of extra points for some geometric partitions lying in low and high density regions of the data set may be different, which may also cause some load imbalance among nodes.
V Experimental Setup
Here we describe the data sets used for both the science results and scaling measurements. We also describe the HPC system – Cori – on which these computations were performed.
Va Description of the Cori System
All of our experiments were run on the Cori system at the National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory (LBNL). Cori is a Cray XC40 system featuring 2,004 nodes of Intel^{®} Xeon™ Processor E52698 v3 (Haswell) and 9,688 nodes of Intel^{®} Xeon Phi™ Processor 7250 (KNL). Both Haswell and KNL nodes were used.
Haswell compute nodes have two 16core Haswell processors. Each processor core has a 32 KB private L1 instruction cache, 32 KB private L1 data and a 256 KB private L2 cache. The 16 cores in each Haswell processor are connected with an ondie interconnect and share a 40MB shared L3 cache. Each Haswell compute node has 128 GB of DDR42133 DRAM.
Each KNL compute node has a single KNL processor with 68 cores (each with 4 simultaneous hardware threads and 32 KB instruction and 32 KB data in L1 cache), 16 GB of onpackage MCDRAM, and 96 GB of DDR42400 DRAM. Every two cores share an 1MB L2 (with an aggregate of 32MB total). The 68 cores are connected in a 2D mesh network. All measurements on KNL reported in this paper are performed with the MCDRAM in “cache” mode (configured as a transparent, directmapped cache to the DRAM).
Compute nodes in both the Haswell and KNL partitions are connected via the highspeed Cray Aries interconnect. Cori also has a Sonnexion 2000 Lustre filesystem, which consists of 248 Object Storage Targets (OSTs) and 10,168 disks, giving nearly 30PB of storage and a maximum of 700GB/sec IO performance.
VB Libraries and Environment
The DisCo application code is written in Python using both opensource and vendor optimized library packages. We use Intel^{®} Distribution Of Python (IDP) 3.6.8. IDP incorporates optimized libraries such as Intel^{®} MKL and Intel^{®} DAAL for machine learning and data analytics operations to improve performance on Intel platforms. We also use NumPy (1.16.1), SciPy (1.2.0), Numba (0.42.1), Intel^{®} TBB (2019.4), Intel^{®} DAAL (2019.3) and Intel^{®} MKL (2019.3) from Intel^{®} Anaconda channels. MPI4Py (3.0.0) is built to use the Cray MPI libraries.
For all KMeans experiments, our optimized implementation was built from source with Cray MPI and ICC (18.0.1 20171018). These will be contributed back to Intel^{®} daal4py. We compile our DBSCAN implementation with Intel^{®} C/C++ Compilers (ICC 18.0.1 20171018) and without the fpmodel strict compiler switch which can impact the vectorization performance. Both KMeans and DBSCAN are linked to Cray MPI binaries as well.
For scaling tests we installed the conda environments on the Lustre filesystem to improve Python package import times for large runs on Cori [58]. For KMeans experiments, we run the code with 1 MPI process per Haswell socket and limit the number of TBB threads to 32 on a node with m tbb p 32 flags to the Python interpreter. For DBSCAN experiments we run the code with 1 MPI process per node and 68 tbb threads on KNL, and 1 MPI process per node with 32 threads on Haswell nodes.
VC Datasets
Two benchmark data sets: 2D turbulence and clouds of Jupiter, are chosen for validation against a survey of LCS methods from [25], and a simulated climate data set to demonstrate scientific and scaling performance on a realworld scientific application, as in [35].
The Jupiter data is interpolated RGB video taken over a period of 10 days by the NASA Cassini spacecraft and converted to integer grayscale [47]. The 2D turbulence data set is the vorticity field from direct numerical solutions of 2dimensional NavierStokes equations using pseudospectral methods in a periodic domain [19]. The climate data set, used for scaling experiments, is simulated data of Earth’s climate from the 0.25degree Community Atmospheric Model (CAM5.1) [73]. Climate variables are stored on an 1152 x 768 spatial grid (float32), with a temporal resolution of 3 hours. Over 100 years of simulation output is available as NetCDF files. Our hero run processed 89.5 TB of lightcone data (obtained from 580 GB of simulated climate data).
Vi Performance Results
We performed both KMeans and DBSCAN weakscaling and strongscaling experiments with the CAM5.1 climate dataset. KMeans experiments are run on Cori Haswell nodes and DBSCAN experiments are run on both Cori Haswell and Cori KNL nodes. The performance of each stage of the pipeline as well as the total time to solution (including synchronization) for an endtoend single run is measured in seconds. These measurements capture the endtoend capability of the system and software, including the single node optimizations, efficiency of the distributed clustering methods, and interconnect subsystems.
Via KMeans performance
ViA1 Singlenode performance
Table I shows the breakdown of execution time of different stages of the DisCo pipeline developed from scratch on one Haswell and KNL node.
The data read stage simply reads the spacetime field data into memory which is then processed in extract to generate lightcone vectors. This involves reading spatiotemporal neighbors of each point in the field and flattening them into an ndimensional vector. These are serial read/write operations that are unavoidable, but the memory access pattern can be optimized. Using Numba decorators for jitting to improve caching and vectorization performance, we obtained a 64x speedup on Haswell and 134x speedup on KNL node resulting in overall speedup of 16.9x on Haswell and 62x on KNL over the baseline implementation inspired by [32]. For the cluster_lc stage, we compare our optimized KMeans implementation which gives 20x better performance than stock scikitlearn [39]. The other three stages take only a small fraction of the execution time and have little to gain from directed optimization.
Stage  Haswell, time(s) —  KNL, time(s)  


Speed up 


Speed up  

data read  3.32  3.29  1  8.44  7.01  1.2  
extract  519.64  7.85  65  4713.47  35.13  134  
cluster_lc  399.63  19.03  21  513.63  26.34  19  
morphs  0.85  0.86  1  6.19  6.16  1  
equivalence  0.002  0.002  1  0.56  0.02  26  
causal filter  0.14  0.14  0.3  0.49  0.49  1  
Total  923.58  31.20  29.6  5242.78  74.93  70 
ViA2 Multinode scaling
All experiments were conducted with , , number of clusters K=8, and iterations=100 for KMeans clustering. The results are shown in Figure 2. Extract is embarrassingly parallel and thus, shows excellent scaling.
For weak scaling on Haswell, we used 220MB/node of raw data (80 timesteps of 1152 x 768 spatial fields). After lightcone extraction (84 dimension vectors of float32 data), the size of input to the clustering stage increases to 87.44GB/node. We achieved weakscaling efficiency of 91% at 1024 nodes, measured against performance at 8 nodes. This is expected from increased amounts of time spent in communication at the end of each KMeans iteration as node concurrency increases.
For strong scaling experiments on Haswell, we used 64 timesteps per node on 128 nodes, 32 timesteps per node on 256 nodes, 16 timesteps per node on 512 nodes, and 8 timesteps per node on 1024 nodes. After lightcone extraction the total size of input data to the clustering stage is 54GB. We achieved 64% overall scaling efficiency and 81% clustering efficiency at 1024 nodes. At 1024 nodes, the amount of local computation workload on a node is small compared to the number of synchronization steps within KMeans and in the endtoend pipeline.
ViB DBSCAN performance
ViB1 Singlenode performance
We used the pipeline optimized for KMeans results for which are shown in Table I. In the cluster_lc stage, we use our implementation of the DBSCAN algorithm discussed in Section III. Designed for highdimensional clustering, it does not use indexing data structures for nearest neighbor calculations. On the 2D turbulence data set, the scikitlearn DBSCAN with bruteforce neighborhood computation is more than 3x faster than the default scikitlearn DBSCAN, which uses kd trees, while producing reasonable results (less than noise points). In turn, our DBSCAN implementation is more than 3x faster than the scikitlearn brute implementation (same clustering parameters) due to better onnode parallelization and use of AVX2 and AVX512 vector instructions for computing neighborhoods and distances.
ViB2 Multinode scaling
We performed DBSCAN weak scaling and strong scaling runs using the climate data set on both Haswell and KNL nodes. All experiments were conducted with minpts = 10 and eps=0.05 for DBSCAN clustering. The results are shown in Figure 3.
For weak scaling on Haswell and KNL, we split a single timestep of the 1152 x 768 spatial field across 8 nodes. At 1024 nodes, we achieved a scaling efficiency of 34.6%. The poor scaling efficiency can be attributed to several reasons. One, as discussed in Section III, distributed DBSCAN uses geometric partitioning to gather adjacent points on the same node. Then, at each step, every node clusters its local data subset before merging results among different nodes. Two, since we didn’t use indexing data structures to perform local clustering in DBSCAN, the complexity of each step is . Third, the total clustering time is equal to the running time of the slowest node, which is the node containing the largest data partition. As the number of nodes increases, it leads to an increase in imbalance in number of points between nodes (II and increased total running time, as can be seen in Figure 3. We are exploring ways of better partitioning the initial data to resolve the load imbalance issue while maintaining the scalability with increasing number of dimensions.
Nodes  128  256  512  1024 
Min  87392  81960  84963  70824 
Max  130867  130147  154574  172377 
Average  105156  105156  105156  105156 
Median  104378  104759  103854  103666 
For strong scaling on Haswell and KNL, we used a single timestep of the 1152 x 768 spatial field per node for the 128nodes run; one timestep across 2 nodes for the 256nodes run; one timestep across 4 nodes for the 512nodes run; and one timestep across 8 nodes for the 1024nodes run. We achieved an overall scaling efficiency of 38% and clustering efficiency of 52% on 1024 Haswell nodes. Increasing the number of nodes, while preserving the total input data size, results in a proportional decrease of partition sizes gathered per node. From the quadratic dependency on the number of points mentioned earlier, reducing the sizes of the partitions by , decreases the execution time by . However, since the partitions are not balanced, the obtained efficiency from increasing the number of nodes is marginally lower than the expected reduction in execution time.
ViC Hero Run
Our hero run processed 89.5 TB of lightcone data (obtained from 580 GB of simulated climate data) with distributed KMeans clustering on 1024 Intel Haswell nodes with 2 MPI ranks/node and 32 tbb threads/processor. We do not use Cori burst buffer. 580GB of climate data is read from the Cori /cscratch/ filesystem for generating nearly 90TB of lightcone data, after extract, which resides in the onnode memory. The left column of Figure 2 shows execution times for this run. The total time to solution was 6.6 minutes with a weak scaling efficiency of 91%, which suggests that further scaling may be possible to process unprecedented amounts of scientific data and facilitate physicsbased discovery in realistic systems.
Vii Science Results
Snapshot images for our segmentation results on the three scientific data sets using KMeans clustering in the DisCo pipeline are shown in Figure 4. DBSCAN results are discussed at the end of this section. We emphasize that DisCo produces a spacetime segmentation; the images shown are singletime snapshots taken from spacetime videos. The left image of each row in Figure 4 – (a), (d), and (f) – are snapshots from the unlabeled “training” data used for local causal state reconstruction. The other image(s) in each row are corresponding snapshots from the local causal state segmentation fields. Full segmentation videos are available at the DisCo YouTube channel [56].
The extreme weather event segmentation masks shown in [35] have the following semantics: cyclone, atmospheric river, and background. In contrast, the segmentation classes of DisCo are the local causal states. Each unique color in the segmentation images – Figure 4 (b), (c), (e), and (g) – corresponds to a unique local causal state. Further postprocessing is needed to assign semantic labels such as cyclone and atmospheric river to sets of local causal states. We will discuss this further in Sec. VII.1 and Sec. VII.3.
Vii1 Structural Decomposition
The local causal state fields that are the direct output of DisCo, without additional semantic labels, can be considered a “structural decomposition” of the flow. Incorporating the physics of local interactions to generalize the computational mechanics theory of structure to spatiotemporal systems, the local causal states are a more principled and wellmotivated decomposition approach compared to empirical dimensionality reduction methods such as PCA and DMD (see Sec. II), or automated heuristics like TECA.
But does the structural decomposition of the local causal states capture meaningful “structure”? What constitutes physically meaningful structure in complex fluid flows is an incredibly challenging open problem [29, 25]. Even something as seemingly obvious as a fluid vortex does not have a generally accepted rigorous definition [16]. This is to say that it is impossible to give a quantitative assessment of how close our method gets to ground truth because ground truth for this problem currently does not exist.
Vii2 Lagrangian Coherent Structures
In the absence of a quality metric to compare different methods against, the community standard is to qualitatively compare methods against each other. In particular, the Lagrangian approach to coherent structures in complex fluids is gaining wide acceptance and [25] surveys the current stateoftheart Lagrangian Coherent Structure methods (see Sec. II). We directly compare our results with the geodesic and LAVD approaches (described below) on the 2D turbulence data set from [25] and the Jupiter data set from [25] and [26].
There are three classes of flow structures in the LCS framework; elliptic LCS are rotating vortexlike structures, parabolic LCS are generalized Lagrangian jetcores, and hyperbolic LCS are tendrillike stableunstable manifolds in the flow. The geodesic approach [28, 26] is the stateoftheart method designed to capture all three classes of LCS and has a nice interpretation for the structures it captures in terms of characteristic deformations of material surfaces. The LagrangianAveraged Vorticity Deviation (LAVD) [27] is the stateoftheart method specifically for elliptic LCS, but is not designed to capture parabolic or hyperbolic LCS.
The local causal states are not a Lagrangian method (they are built from spacetime fields, not Lagrangian particle flow) so they are not specifically designed to capture these structures. However, LCS are of interest because they heavily dictate the dynamics of the overall flow, and so signatures of LCS should be captured by the local causal states. As we will see in the results and comparisons in 7.1 and 7.2, this is indeed the case.
Vii3 Reconstruction Parameters
The complex fluid flows of interest are multiscale phenomena and so the question of how they are structured may not have a single answer. Different notions of structure may exist at different length and time scales. With this in mind, we have found that essentially all reconstruction parameters yield a physically valid structural decomposition. Varying parameters adjusts the structural details captured in a satisfying way.
Larger values of in KMeans produce refinements of structural decompositions from smaller values of , capturing finer levels of detail. The speed of information propagation controls the spatialscale of the structural decomposition and the decayrate controls the temporal coherence scale. Because uniqueness and optimality of local causal states are asymptotic properties, lightcone horizons should be set as large as is computationally feasible. Though some finite cutoff must always be used. The lightcone horizon creates a discrete cut in the amount information of local pasts that is taken into account, as opposed to the smooth dropoff of the temporal decay.
The local causal states, using different parameter values, provide a suite of tools for analyzing structure in complex, multiscale spatiotemporal systems at various levels of description. Finally, we note that the (or, equivalently the ) limit produces a standard KMeans image segmentation, which captures instantaneous structure and does not account for coherence across time and space.
Viia 2D Turbulence
While still complex and multiscale, the idealized 2D turbulence data provides the cleanest Lagrangian Coherent Structure analysis using our DisCo structural decomposition. Figure 4 (a) shows a snapshot of the vorticity field, and (b) and (c) show corresponding snapshots from structural decompositions using different reconstruction parameter values. Both use the same lightcone template with , , and . To reveal finer structural details that persist on shorter time scales, Figure 4 (b) uses and . To isolate the coherent vortices, which persist at longer time scales, Figure 4 (c) was produced using and . As can be seen in (b), the local causal states distinguish between positive and negative vortices, so for (c) we modded out this symmetry by reconstructing from the absolute value of vorticity.
All three images are annotated with colorcoded bounding boxes outlining elliptic LCS to directly compare with the geodesic and LAVD LCS results from Figure 9, (k) and (l) respectively, in [25]. Green boxes are vortices identified by both the geodesic and LAVD methods and red boxes are additional vortices identified by LAVD but not the geodesic. Yellow boxes are new structural signatures of elliptic LCS discovered by DisCo. Because there is a single background state, colored white, in (c), all objects with a bounding box can be assigned a semantic label of coherent structure since they satisfy the local causal state definition given in [60] as spatially localized, temporally persistent deviations from generalized spacetime symmetries (i.e. local causal state symmetries). Significantly, DisCo has discovered vortices in (c) as coherent structures with this very general interpretation as locally broken symmetries.
In the finerscale structural decomposition of (b) we still have a unique set of states outlining the coherent vortices, as we would expect. If they persist on longer time scales, they will also persist on the short time scale. The additional structure of the background potential flow largely follows the hyperbolic LCS stableunstable manifolds. Because they act as transport barriers, they partition the flow on either side and these partitions are given by two distinct local causal states with the boundary between them running along the hyperbolic LCS in the unstable direction. For example, the narrow dark bluecolored state in the upper right of (b) indicates a narrow flow channel squeezed between two hyperbolic LCS.
ViiB Jupiter
Figure 4 (d) shows a snapshot from the Jupiter cloud data, with corresponding structural decomposition snapshot in (e). The Great Red Spot, highlighted with a blue arrow, is the most famous structure in Jupiter’s atmosphere. As it is a giant vortex, the Great Red Spot is identified as an elliptic LCS by both the geodesic and LAVD methods [26, 25]. While the local causal states in (e) do not capture the Great Red Spot as cleanly as the vortices in (b) and (c), it does have the same nested state structures as the turbulence vortices. There are other smaller vortices in Jupiter’s atmosphere, most notably the “string of pearls” in the Southern Temperate Belt, four of which are highlighted with blue bounding boxes. We can see in (e) that the pearls are nicely captured by the local causal states, similar to the turbulence vortices in (b).
Perhaps the most distinctive features of Jupiter’s atmosphere are the zonal belts. The eastwest zonal jet streams that form the boundaries between bands are of particular relevance to Lagrangian Coherent Structure analysis. Figure 11 in [26] uses the geodesic method to identify these jet streams as shearless parabolic LCS, indicating they act as transport barriers that separate the zonal belts. The particular decomposition shown in (e) captures a fair amount of detail inside the bands, but the edges of the bands have neighboring pairs of local causal states with boundaries that extend contiguously in the eastwest direction along the parabolic LCS transport barriers. Two such local causal state boundaries are highlighted in green, for comparison with Figure 11 (a) in [26]. The topmost green line, in the center of (d) and (e), is the southern equatorial jet, shown in more detail in Figure 11 (b) and Figure 12 of [26]. Its northsouth meandering is clearly captured by the local causal states.
ViiC Extreme Weather Events
Finally, we return to climate science. Figure 4 (g) shows the local causal state decomposition of the CAM5.1 water vapor field shown in Figure 4 (f). While our decomposition of the turbulence and Jupiter data align nicely with the LCS analysis in [25] and [26], we are not yet able to use the climate decomposition to construct segmentation masks of weather patterns.
However, given the promising decomposition in Figure 4 (g), we believe this is achievable. The climate decomposition shown here was performed solely on the water vapor field, and not all physical variables of the CAM5.1 climate model, like was done in [35]. While we see signatures of cyclones and atmospheric rivers outlined in Figure 4 (g), it is not surprising that these structures are not uniquely identified from the water vapor fields alone; this would be akin to describing hurricanes as just local concentrations of water vapor. More contextual spacetime information is needed. This includes additional physical fields, and the use of larger lightcones in reconstruction.
A key distinguishing feature of hurricanes is their rotation. While rotation has its signature in the water vapor field, the three timesteps used for the lightcones in our local causal state reconstruction cannot capture much of this rotation. The vorticity field, as used for the 2D turbulence data, gives instantaneous local circulation. From Figure 4 (b) and (c) we see that vortices are easier to capture from vorticity. Additionally, hurricanes have distinctive patterns in temperature and pressure, for example a warm, low pressure inner core. Including the vorticity, temperature, and pressure fields into the climate decomposition will help yield a distinct set of hurricane states.
Similarly, atmospheric rivers (ARs) have distinguishing characteristics in other physical fields, most notably water vapor transport, that will help identify ARs when added to the decomposition. The use of larger lightcones should be particularly helpful for creating AR masks, as their geometry is crucial for their identification, which extends over a much larger spatial scale than can be captured by the depth3 lightcones used in the reconstruction here.
ViiD Lightcone Clustering Revisited
As a densitybased method, when compared to KMeans, DBSCAN faces the curse of dimensionality [78, 40]. Distinguishing densityseparated regions becomes exponentially difficult in highdimensional spaces, such as the lightcone spaces used in our applications. Further, a limiting assumption of DBSCAN is that all clusters are defined by a single density threshold. Adaptive methods like OPTICS [2] and HDBSCAN [42] may perform better for complex data. The results we observe from experiments with DBSCAN suggests that the lightconespaces of complex fluid flows do not contain clear densityseparated regions, and thus do not yield to a densitybased discretization.
Figure 5 shows snapshots of spacetime segmentation results of the turbulence data set using both DBSCAN (a) and KMeans (b). The KMeans result in Figure 5 (b) is copied from Figure 4 (b) for easier visual comparison with the DBSCAN results in Figure 5 (a), which used reconstruction parameters , , eps=, and minpts = 10. We can see that the DBSCAN segmentation does outline structure in the flow, but it is all detailed structure of the background potential flow. None of our experiments with different parameter values were able to isolate vortices with a unique set of local causal states, which, as demonstrated above, is possible with KMeans. This inability to isolate vortices, along with the patchwork pattern of decomposition in parts of the background flow suggest that a single density threshold of lightcone space, which is assumed by DBSCAN, is incapable of coherent structure segmentation.
Similarly, our DBSCAN experiments on the climate data typically yielded outcomes that either classify most points as noise or into one single cluster. Some density parameters give clusters, but most contain lightcones. All of these cases do not yield physically meaningful segmentation output, which again points to the shortcoming of a single density threshold. As can be seen in Figure 4 (f), the CAM5.1 water vapor data is very heterogeneous in space; the water vapor values are much more uniform towards the poles (the top and bottom of the image) than in the tropics (center). The polar regions will contribute a relatively small, but very dense, region in lightcone space compared to contributions from the tropics.
From experiments it appears that KMeans attempts to uniformly partition lightconespace, which is consistent with these spaces not being densityseparated. If this is in fact the case, the convex clusters that uniformly separate lightconespace which result from KMeans are actually the most natural discretization of lightconespace concordant with the continuous histories assumption.
Despite prior assumptions and intuitions, KMeans appears to be a much more effective clustering method for hydrodynamic coherent structure segmentation using our DisCo pipeline. That being said, there are plenty of other applications for which densitybased clustering using DBSCAN is the appropriate choice. Our DBSCAN implementation has now made this algorithm available for large, highdimensional applications, with the same easytouse Python API as found in scikitlearn.
Viii Conclusions
We present several firsts in this paper, including a physicsbased behaviordriven approach for unsupervised segmentation of coherent spatiotemporal structures in scientific data. Our highperformance distributed pipeline, written entirely in Python, with newly developed libraries, demonstrates the application of KMeans and DBSCAN clustering methods to dimensional clustering problems with over points. We achieve 91% weak and 64% strong scaling efficiency with distributed KMeans clustering on 1024 Intel^{®} Haswell nodes (32core). We also demonstrate stateoftheart segmentation results for three complex scientific datasets, including on 89.5 TB of simulated climate data (lightcones) in 6.6 minutes endtoend. Our pipeline can be scaled further, which will facilitate accurate and precise structural segmentation on unprecedented amounts of scientific spatiotemporal data.
Disclaimers
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for nonIntel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks.
Performance results are based on testing as of April 10, 2019 and may not reflect all publicly available security updates. See configuration disclosure for details. No product or component can be absolutely secure.
Configurations: Testing on Cori (see V) was performed by NERSC, UC Davis, and Intel, with the spectre_v1 and meltdown patches.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at [intel.com].
Intel, the Intel logo, Intel Xeon, Intel Xeon Phi, Intel DAAL are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.
Acknowledgements
AR and JPC would like to acknowledge Intel^{®} for supporting the IPCC at UC Davis. KK and P were supported by the Intel^{®} Big Data Center. This research is based upon work supported by, or in part by, the U. S. Army Research Laboratory and the U. S. Army Research Office under contracts W911NF1310390 and W911NF1810028, and used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DEAC0205CH11231.
References
 [1] (1972) More is different. Science 177 (4047), pp. 393–396. Cited by: §IB.
 [2] (1999) OPTICS: ordering points to identify the clustering structure. In ACM Sigmod record, Vol. 28, pp. 49–60. Cited by: §VIID.
 [3] (2007) Kmeans++: the advantages of careful seeding. In Proceedings of the eighteenth annual ACMSIAM symposium on Discrete algorithms, pp. 1027–1035. Cited by: §IC, §II, §IVA, §IV.
 [4] (2009) Beyond the data deluge. Science 323 (5919), pp. 1297–1298. External Links: Document, ISSN 00368075, Link, http://science.sciencemag.org/content/323/5919/1297.full.pdf Cited by: §IA.
 [5] (2018) Deep neural networks for physics analysis on lowlevel wholedetector data at the LHC. In Journal of Physics: Conference Series, Vol. 1085, pp. 042034. Cited by: §IA.
 [6] (2018) Machine learning for molecular and materials science. Nature 559 (7715), pp. 547. Cited by: §IA.
 [7] (2017) Solving the quantum manybody problem with artificial neural networks. Science 355 (6325), pp. 602–606. External Links: Document Cited by: §IA.
 [8] (2017) How big data advances physics. Note: https://www.elsevier.com/connect/howbigdataadvancesphysicsAccessed: 20190407 Cited by: §IA.
 [9] (2018) How big data is changing science. Note: https://mosaicscience.com/story/howbigdatachangingsciencealgorithmsresearchgenomics/Accessed: 20190407 Cited by: §IA.
 [10] (2019) Spherical CNNs on unstructured grids. In International Conference on Learning Representations, Cited by: §IA.
 [11] (201909–15 Jun) Gauge equivariant convolutional networks and the icosahedral CNN. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 1321–1330. Cited by: §IA.
 [12] (2012) Between order and chaos. Nature Physics 8 (January), pp. 17–24. External Links: Document Cited by: §IIIA.
 [13] (2014) The dreams of theory. WIRES Comp. Stat. 6 (March/April), pp. 75–79. External Links: Document Cited by: §IA.
 [14] (1987) The dependence of hurricane intensity on climate. Nature 326 (6112), pp. 483. Cited by: §IA.
 [15] (2003) Tropical cyclones. Annual review of earth and planetary sciences 31 (1), pp. 75–104. Cited by: §IB.
 [16] (2017) Review of vortex identification methods. In 55th AIAA Aerospace Sciences Meeting, pp. 0989. Cited by: §VII1.
 [17] (1996) A densitybased algorithm for discovering clusters in large spatial databases with noise.. In Kdd, Vol. 96, pp. 226–231. Cited by: §IC, §II, §IV.
 [18] (2014) A big data guide to understanding climate change: the case for theoryguided data science. Big data 2 (3), pp. 155–163. External Links: Document Cited by: §IA, §IA.
 [19] (2016) An adjointbased approach for finding invariant solutions of navierstokes equations. J. Fluid Mech. 795, pp. 278–312. Cited by: §VC.
 [20] (2016) Mass spectrometry applied to bottomup proteomics: entering the highthroughput era for hypothesis testing. Annual Review of Analytical Chemistry 9, pp. 449–472. Cited by: §IA.
 [21] (2012) LICORS: light cone reconstruction of states for nonparametric forecasting of spatiotemporal systems. arXiv:1206.2398. Cited by: §II, §IIIB2, §IIIB3.
 [22] (2015) HPDBSCAN: highly parallel dbscan. In Proceedings of the workshop on machine learning in highperformance computing environments, pp. 2. Cited by: §II.
 [23] (1986) Toward a quantitative theory of selfgenerated complexity. Intl. J. Theo. Phys. 25, pp. 907. Cited by: §IIIA.
 [24] (2002) A fast parallel clustering algorithm for large spatial databases, high performance data mining. Cited by: §II.
 [25] (2017) A critical comparison of Lagrangian methods for coherent structure detection. Chaos 27 (5), pp. 053104. Cited by: §II, §VC, §VII1, §VII2, §VIIA, §VIIB, §VIIC.
 [26] (2016) Geodesic transport barriers in Jupiter’s atmosphere: videobased analysis. Siam Review 58 (1), pp. 69–89. Cited by: §VII2, §VII2, §VIIB, §VIIB, §VIIC.
 [27] (2016) Defining coherent vortices objectively from the vorticity. Journal of Fluid Mechanics 795, pp. 136–173. Cited by: §VII2.
 [28] (2015) Lagrangian coherent structures. Ann. Rev. Fluid Mech. 47, pp. 137–162. Cited by: §IB, §II, §VII2.
 [29] (2012) Turbulence, coherent structures, dynamical systems and symmetry. Cambridge university press. Cited by: §IB, §II, §VII1.
 [30] (2017) PSdbscan: an efficient parallel dbscan algorithm based on platform of ai (pai). arXiv preprint arXiv:1711.01034. Cited by: §II.
 [31] (Accessed: 20190408) Fast, scalable and easy machine learning with daal4py. Note: https://intelpython.github.io/daal4py/index.html Cited by: §IVA.
 [32] (2007) Multifield visualization using local statistical complexity. IEEE Trans. Vis. Comp. Graphics 13 (6), pp. 1384–1391. Cited by: §II, §IIIB2, §VIA1.
 [33] (2011) Principal component analysis. Springer. Cited by: §II.
 [34] (2017) How machine learning could help to improve climate forecasts. Nature News 548 (7668), pp. 379. Cited by: §IA.
 [35] (2018) Exascale deep learning for climate analytics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, pp. 51. Cited by: §IA, §VC, §VIIC, §VII.
 [36] (2006) Machine learning in bioinformatics. Briefings in Bioinformatics 7 (1), pp. 86–112. External Links: Document Cited by: §IA.
 [37] (2016) Evaluation of kmeans data clustering algorithm on intel xeon phi. In 2016 IEEE International Conference on Big Data (Big Data), pp. 2251–2260. Cited by: §II.
 [38] (2018) Largescale hierarchical kmeans for heterogeneous manycore supercomputers. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 160–170. Cited by: §II.
 [39] (Accessed: 20190408) Performing numerical analysis and data analytics with python at scale. Note: https://www.ixpug.org/documents/1526053887xpugmaidanovscalablescience.pdf Cited by: §VIA1.
 [40] (197908) Nearest Neighbour Searches and the Curse of Dimensionality. IMA Journal of Applied Mathematics 24 (1), pp. 59–70. External Links: Document Cited by: §VIID.
 [41] (2018) CosmoFlow: using deep learning to learn the universe at scale. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 819–829. Cited by: §IA.
 [42] (2017) Hdbscan: hierarchical density based clustering. The Journal of Open Source Software 2 (11), pp. 205. Cited by: §VIID.
 [43] (2017) Deep learning in bioinformatics. Briefings in Bioinformatics 18 (5), pp. 851–869. Cited by: §IA.
 [44] (2001) Machine learning for science: state of the art and future prospects. Science 293 (5537), pp. 2051–2055. Cited by: §IA.
 [45] (2015) The LICORS cabinet: nonparametric algorithms for spatiotemporal prediction. arXiv preprint arXiv:1506.02686. Cited by: §II.
 [46] (2017) Segmenting and tracking extreme climate events using neural networks. In Deep Learning for Physical Sciences (DLPS) Workshop, held with NIPS Conference, Cited by: §IA.
 [47] (Accessed: 20190314) Jupiter cloud sequence from cassini. Note: https://svs.gsfc.nasa.gov/cgibin/details.cgi?aid=3610 Cited by: §VC.
 [48] (2011) Climate data challenges in the 21st century. Science 331 (6018), pp. 700–702. External Links: Document, ISSN 00368075, Link, http://science.sciencemag.org/content/331/6018/700.full.pdf Cited by: §IA.
 [49] (Accessed: 20190408) Parallel kmeans data clustering for large data sets. Note: http://www.ece.northwestern.edu/~wkliao/Kmeans/kmeans_int64.html Cited by: §II.
 [50] (Accessed: 20190408) Parallel kmeans data clustering. Note: http://www.ece.northwestern.edu/~wkliao/Kmeans/index.html Cited by: §II.
 [51] (2015) BDcats: big data clustering at trillion particle scale. In SC’15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12. Cited by: §II, §IVB.
 [52] (2015) PDSDBSCAN source code. Web page http://users. eecs. northwestern. edu/~ mpatwary/Software. html. Cited by: §II, §IVB.
 [53] (2013) Scalable parallel optics data clustering using graph algorithmic techniques. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 49. Cited by: §II.
 [54] (Accessed: 20190408) Scalable clustering for exploratory data analysis. Note: https://medium.com/@Petuum/scalableclusteringforexploratorydataanalysis60b27ea0fb06 Cited by: §II.
 [55] (2012) TECA: a parallel toolkit for extreme climate analysis. Procedia Computer Science 9, pp. 866–876. Cited by: §IA, §IA.
 [56] (Accessed: 20190410) Project disco segmentation videos. Note: https://www.youtube.com/channel/UCwKTJloOOFQHVHDwkpqIdYA Cited by: Fig. 4, §VII.
 [57] (2019) Deep learning and process understanding for datadriven Earth system science. Nature 566 (7743), pp. 195. Cited by: §IA.
 [58] (2017) Python in the nersc exascale science applications program for data. In Proceedings of the 7th Workshop on Python for HighPerformance and Scientific Computing, pp. 4. Cited by: §IC, §VB.
 [59] (2015) Identifying causal gateways and mediators in complex spatiotemporal systems. Nature communications 6, pp. 8502. Cited by: §IB.
 [60] (2018) Local causal states and discrete coherent structures. Chaos 28 (7), pp. 1–22. External Links: Document Cited by: §IIIA, §III, §VIIA.
 [61] (2018) Spacetime symmetries, invariant sets, and additive subdynamics of cellular automata. arXiv preprint arXiv:1812.11597. Cited by: §III.
 [62] (2014) Putting big data to good use in neuroscience. Nature Neuroscience 17 (11), pp. 1440. Cited by: §IA.
 [63] (2001) Computational mechanics: pattern and prediction, structure and simplicity. J. Stat. Phys. 104, pp. 817–879. Cited by: §IIIA.
 [64] (2006) Automatic filters for the detection of coherent structure in spatiotemporal systems. Phys. Rev. E 73 (3), pp. 036104. Cited by: §II.
 [65] (2003) Optimal nonlinear prediction of random fields on networks. Discrete Mathematics & Theoretical Computer Science. Cited by: §IIIA.
 [66] (2018) Atmospheric river tracking method intercomparison project (ARTMIP): project goals and experimental design. Geoscientific Model Development 11 (6), pp. 2455–2474. External Links: Document Cited by: §IA, §IA.
 [67] (2015) An early warning indicator for atmospheric blocking events using transfer operators. Chaos: An Interdisciplinary Journal of Nonlinear Science 25 (3), pp. 036406. External Links: Document Cited by: §II.
 [68] (2014) On dynamic mode decomposition: theory and applications. Journal of Computational Dynamics 1 (2), pp. 391–421. Cited by: §II.
 [69] (201806) Machine learning outofequilibrium phases of matter. Phys. Rev. Lett. 120, pp. 257204. External Links: Document Cited by: §IA.
 [70] (2019) Unsupervised machine learning based on nonnegative tensor factorization for analyzing reactivemixing. Journal of Computational Physics. Cited by: §IB.
 [71] (2005) Changes in tropical cyclone number, duration, and intensity in a warming environment. Science 309 (5742), pp. 1844–1846. Cited by: §IA.
 [72] (2010) Towards direct simulation of future tropical cyclone statistics in a highresolution global atmospheric model. Advances in Meteorology 2010. Cited by: §IA.
 [73] (2014) The effect of horizontal resolution on simulation quality in the community atmospheric model, cam5.1.. Journal of Modeling the Earth System 06, pp. 980–997. External Links: Document Cited by: §VC.
 [74] (2015) A datadriven approximation of the Koopman operator: extending dynamic mode decomposition. Journal of Nonlinear Science 25 (6), pp. 1307–1346. Cited by: §IB.
 [75] (1984) Computation theory of cellular automata. Comm. Math. Phys. 96, pp. 15. Cited by: §IIIA.
 [76] (2015) Equationfree mechanistic ecosystem forecasting using empirical dynamic modeling. Proceedings of the National Academy of Sciences 112 (13), pp. E1569–E1576. Cited by: §IB.
 [77] (2019) Causal deconvolution by algorithmic generative models. Nature Machine Intelligence 1 (1), pp. 58. Cited by: §IB.
 [78] (2012) A survey on unsupervised outlier detection in highdimensional numerical data. Statistical Analysis and Data Mining: The ASA Data Science Journal 5 (5), pp. 363–387. External Links: Document Cited by: §VIID.