On the Scalability of Data Reduction Techniques in Current and Upcoming HPC Systems from an Application Perspective††thanks: This project has received funding from the European Unions Horizon 2020 research and innovation programme under grant agreement No 654220. An award of computer time was provided by the Innovative and Novel Computational Impact on Theory and Experiment (INCITE) program. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
We implement and benchmark parallel I/O methods for the fully-manycore driven particle-in-cell code PIConGPU. Identifying throughput and overall I/O size as a major challenge for applications on today’s and future HPC systems, we present a scaling law characterizing performance bottlenecks in state-of-the-art approaches for data reduction. Consequently, we propose, implement and verify multi-threaded data-transformations for the I/O library ADIOS as a feasible way to trade underutilized host-side compute potential on heterogeneous systems for reduced I/O latency.
picPICparticle-in-cell \newacronymornlORNLOak Ridge National Laboratory
Production-scale research simulation codes have been optimized in the last years to achieve maximum compute performance on leadership, heterogeneous computing systems such as the Titan supercomputer at \glsornl. With close to perfect weak scaling domain scientists can increase spatial and temporal resolution of their simulation and explore systems without reducing dimensionality or feature resolution.
We present the consequences of near-perfect weak-scaling of such a code in terms of I/O demands from an application perspective based on production runs using the \glspic code PIConGPU [1, 2]. PIConGPU demonstrates a typical use case in which a PFlops/s-scale, performance portable simulation [3, 4] leads automatically to PByte-scale output even for single runs.
PIConGPU is an electro-magnetic \glspic code [5, 6] implemented via abstract, performance portable C++11 kernels on manycore hardware utilizing the Alpaka library [3, 4]. Its applications span from general plasma physics, over laser-matter interaction to laser-plasma based particle accelerator research.
Since its initial open-source release in 2013 with CUDA support, PIConGPU is reportedly the fastest particle-in-cell code in the world in terms of sustained peak Flops/s . We achieved this by not only porting the bottlenecks of the \glspic algorithm to new compute hardware but the complete code, thus minimizing data transfer. PIConGPU data structures are tiled and swapping of frequently updated data residing on device memory over low-bandwidth bottlenecks such as the PCI bus is avoided .
The overall simulation is spatially domain decomposed and only nearby border areas need to be communicated across compute nodes (and accelerators) inbetween iterations. Iterations in PIConGPU are performed with a frequency of about 10 Hz on current accelerator architectures (GPUs) when simulating 3D spatial domains and up to 60 Hz for two-dimensional domains. Each iteration updates electro-magnetic fields and plasma particles, which together constitute the simulation’s state.
1.2 Physical Observables
We will define primary observables as variables directly accessible and iterated within the simulation. In terms of an electro-magnetic \glspic code these are electric field, magnetic field and plasma particles’ properties such as position, momentum, charge to mass ratio and weighting. Primary observables are convenient for the domain expert for exploration, of limited use for theories and models and nearly always inaccessible directly in experiments.
We define secondary observables as computable on-the-fly, \glspic examples being the electric current density, position-filtered energy histograms or projected phase space distributions. In practice, analysis of a specific setup needs multiple additional, study-specific derivations from already derived observables which we summarize as tertiary observables. Examples in the domain of plasma physics are integrals over phase-space trajectories, time-averaged fields, sample trajectories or particle distributions in gradients of fields, flux over time, growth-rates, etc.. Usually, observables accessible by experiments fall in this last category and can be compared to theoretical model predictions.
1.3 Two Example Workflows to Explore Complex Systems
In daily modeling work we usually iterate between two operational modes while investigating a new physical system. We start with an exploratory phase guided by initial hypotheses, looking at primary observables via visualizations or utilizing existing analysis pipelines to iterate over the result of strongly reduced secondary and tertiary observables. During this phase, we develop new study-specific analysis steps and working hypotheses.
The second phase continues with a high-resolution, high-throughput scan of an identified regime of the physical system to prove or falsify our working hypotheses. Due to higher resolution and full physical modeling, new observations will emerge from that step. Research is then about iterating both steps in a refined manner until a system is well understood and a model is found to describe the complex processes of interest.
1.4 Structure of this Paper
As our guiding example, we describe the Titan and Summit systems at \glsornl and their I/O bandwidth hierarchies from the special perspective of a fully GPU-driven, massively parallel \glspic code. We then evaluate the performance of PIConGPU’s I/O implementation, the overhead it introduces and mitigation strategies via on-the-fly data reductions. We address issues in current state-of-the-art compression schemes for our application and compare them to self-implemented compression schemes that make optimal use of underutilized hardware components. More specifically, we integrated the meta compression library blosc  into ADIOS, thereby for the first time enabling multi-threaded compression within ADIOS.
2 ORNL Titan & Summit Systems
With the launch of the Titan supercomputer to the public in 2013, manycore powered supercomputing finally became accessible on large-scale installations. Since then, the share of accelerator hardware in the TOP 100 systems has risen to one third . Such heterogeneous systems concentrate their compute performance in the accelerator component, usually outnumbering the host system’s compute potential by an order of magnitude, a trend that seems to continue on upcoming systems such as Summit.
2.1 I/O Limitations in State-of-the-Art Systems
The parallel file system Atlas at \glsornl, partitioned in two islands of 14 PBytes each, provides an overall design parallel bandwidth of TByte/s. It is worth noting that if a hypothetical application would be constantly writing at this maximum parallel bandwidth, Atlas would run out of disk memory in less than 9 hours. We managed to write within each 8 hour production run of our plasma simulation code PIConGPU about 1 PByte of (zlib) compressed data, sampling the full system state every 2000 iteration steps. PIConGPU thus presents a realistic use case that can consume a significant fraction of those resources. With the upper limit of shared storage in mind, it is clear that data reduction comes with great value. Additionally, fast migration to and from tape storage and a strictly imposed short data lifetime on Atlas also encourage users to avoid occupying disk memory for too long.
An equally severe limitation for I/O besides maximum data size is the overall time for file I/O compared to one iteration of the simulation, including data preparation time . Compared to the time one iteration takes without I/O this introduces an overhead to the application run time, so that the single iteration runtime with I/O becomes . When considering applications scaling to the full Titan system, reaching TByte/s overall throughput results in a maximum node-average throughput of 55 MByte/s. Applications with near perfect scaling can generate GPU data at two-digit Hz levels amounting to data rates as high as GiByte/s (device global memory) on a node-local level, outnumbering the file system performance by three orders of magnitude. Asynchronous I/O lowers this dramatic gap temporarily, but still throttles the application at least to 1/10th of the bandwidth of the CPU-GPU interconnect, not accounting for data reorganization from tiled GPU memory to per-node contiguous memory as expected by parallel I/O APIs.
2.2 Staging, Burst Buffers and I/O Backlog
Even at moderate data rates, asynchronous writing can quickly overlap with the next consecutive write period. Staging[10, 11], if operating off-node, can reduce that data pressure but is similarly limited by another order of magnitude gap in throughput as soon as the interconnect is accessed.
Systems such as NERSC’s Cori recently introduced so-called burst-buffers . Located either off-node similar to I/O nodes or in-node as with the upcoming Summit system, overall size of those burst buffers is usually similar to that of the global host RAM with access bandwidth ranging between network-interconnect and parallel filesystem bandwidth.
Burst buffers provide an interesting mean for temporary checkpointing and error-recovery. Coupled applications that only act as either a data sink or a source for the main application are also major beneficiaries of burst buffers. A prominent example in HPC are in situ visualizations copying on demand snapshots [13, 14] or accessing the primary observables directly [15, 16, 17].
Nevertheless, with the current absolute sizes of burst-buffers it is close to impossible to keep data between application lifetime and parallel filesystem data lifetime, simply because they cannot store a useful multiple of primary observables. As soon as a single stage in the I/O hierarchy is not drained as fast as it is filled, a backlog throughout all previous stages is inevitable even when buffers are used.
3 I/O Measurements
PIConGPU implements I/O for outputs and checkpoints within its plugin system. Plugins are tightly coupled algorithms that can register within the main application for execution after selected iterations. They share full access to primary observables (read and write) of the application.
I/O modules implemented are parallel HDF5  and ADIOS (1.10.0) . In order to tailor domain-specific needs for particle-mesh algorithms, libSplash is used as an abstraction layer . Data objects are described by the meta-data standard openPMD  in human- and machine-readable markup, allowing for cross-application exchangeability as needed in post- and pre-processing workflows.
3.1 Preparation of PIConGPU Primary Observables for I/O
In preparation of GPU device data for I/O libraries, PIConGPU field data are copied from device to host via CUDA 3D memory copies while plasma particle attributes stored in tiled data structures are copied via the mallocMC  heap manager. Subsequently, scalar particle attributes are concatenated in preparation for efficient parallel I/O in a parallelized manner using OpenMP. The single GPU data size needed for saving a complete system state is typically GiByte (assuming of device global memory for primary observables). The overall time for preparing these 4 GiByte of data for one GPU is typically s on the systems considered in this publication.
3.2 I/O Performance in a Realistic Production Scenario
Measurements of the I/O performance are based on one of the default benchmarks implemented in PIConGPU, a simulation of the relativistic Kelvin-Helmholtz Instability [1, 23]. Starting from two spatially homogeneous, counterpropagating neutral plasma streams, a shear flow instability develops. This scenario shows good load-balancing due to nearly homogeneous data distribution across all GPUs with data size per output and GPU of GiByte. We thus assume in our following analysis for sake of simplicity that indeed each node has the same output size, the same bandwidth and I/O operations have the same impact on all nodes of a system.
|Titan||Hypnos (queue: ‘k20’)|
|GPUs / node||K20x||K20m|
|CPUs / node||AMD Opteron 6274||Intel Xeon E5-2609|
|CPU-cores / GPU||16 (8 FP)||2|
|GPU / CPU Flop/s (DP)||9.3 : 1||7.6 : 1|
|* N [GiByte/s]||1000||20|
|maximum number of nodes||18000||16|
Our benchmark systems are Titan (\glsornl) and the K20 queue of Hypnos (HZDR), see Tab. 1. We choose the second system intentionally, since it has roughly the same age, similar ratio of Flops/s between CPU host an GPU device, multiple GPUs per node as in upcoming systems, even less CPU cores per GPU and an even higher single node average filesystem bandwidth compared to Titan. All measurement input and results of the following sections are available in the supplementary materials  and all software used is open source.
Most relevant from an application point of view is the absolute overhead in seconds caused by enabling I/O since it equals ‘wasted’ computing time that could be otherwise spent to iterate the problem further or in higher resolution. We define the effective parallel I/O throughput in GiByte/s as
with the number of nodes , the data size per node and the difference between execution time with I/O and without I/O as . Besides the (included) correction for intrinsic overheads in scaling the application, all measurements are performed as a weak scaling of PIConGPU, which is near-perfect up to the full size of Titan . We average over 11 outputs within 2000 iterations with an average application iteration frequency of one Hertz.
In the following we model the I/O time per node by
defining as the time to concatenate data into large, I/O-API compatible chunks and as the time to synchronously send the data off RAM. This preparation time can potentially be lowered by reorganizing data on the accelerator, where RAM is usually in full utilization from the application alone, while asynchronous (non-blocking) writes that hide data transfer latency require large enough temporary buffers to avoid backlog (see discussion in Sect. 2.2) and I/O library support. It is thus that will dominate overhead compared to iterations without I/O.
Figure 1 shows the achieved effective parallel I/O throughput on Titan. We noticed HDF5 I/O overhead getting prohibitively large for production runs as its parallelism is currently limited by the number of allocatable Lustre OSTs () on which one global file needs to be strided over. After optimizing HDF5 performance with MPI-I/O and HDF5 hints, first manually via best-practices and later using T3PIO , we turned down the strategy of parallel output in one global file (June 2014) and started adopting ADIOS aggregators, which enable transparently striding on subgroups of processes over a limited number of OSTs (latest benchmark: September 2015). When using ADIOS in this manner, we were able to reach an overall application throughput close to 280 GByte/s, see Fig. 2. We are not aware of substantial changes in the Atlas filesystem during this period of time, expecting both benchmarks to be comparable.
It is important to note that measuring the I/O throughput indirectly via introduced overhead masquerades the actual filesystem bandwidth which is always higher than the previously defined effective parallel throughput for raw, untransformed data as seen by the application. This is very important to keep in mind as the effective parallel throughput determines the application performance in most realistic scenarios.
As mentioned in Sect. 2.1, absolute I/O size during production runs quickly becomes a show-stopper. Compressing data streams on the fly seems to suggest itself as data reduction technique, either lossless or lossy, depending on application needs. In ADIOS, compression schemes are implemented transparently for the user as so-called data transforms. One would not only expect a reduction in data size but also an increase in effective bandwidth since the size of the compressed data written to the filesystem is lowered by a compression ratio compared to the initial size . We observed that this expectation could not be fulfilled using even the fastest compression algorithm implemented at the time in ADIOS, zlib, see Fig. 1.
We therefore expanded our model to account for the time it takes to reduce the data by compression or other means and copy it from an application-side buffer to an I/O library buffer. Up to now, data transforms in ADIOS are performed before starting to send the data off-node, while parallel HDF5 does not yet support data compression111An experimental development preview with compression support in parallel HDF5 was announced after our measurements in February 2017.. In order to account for data reduction, eq. (2) needs to be extended to add synchronous reduction overhead by
and characterize throughput for compression and filesystem writes, respectively, normalized to in-node memory copy throughput . We acknowledge that could in principle be lowered by copying the data to an I/O stage immediately and performing compression there, again within the limits of the discussion in Sect. 2.2.
Consequently, for a given normalized per-node filesystem throughput any data reduction algorithm C needs to fulfill the relation
in order to not only reduce data size by but also perceived write time. This inequality arises from eq. (3) assuming a reduce operation that is as fast as possible by setting the second term of the sum and thus comparing . The left-hand side of eq. (4), which we call the break-even threshold for a given data transform algorithm and single (parallel) I/O stage, is discussed in greater detail in the following section.
In order to confirm this observation, we measured I/O performance on the K20 queue of the HZDR compute cluster Hypnos, see Fig. 3 (data points ‘no transform’ and ‘zlib’). Following eq. (4) it should be even harder for a compression algorithm, lossless or lossy, to fulfill the requirements for break-even on Hypnos. Therefore, an improvement in the latter case will be automatically favorable for Titan or a Summit-like system.
3.3 Measurement of Compression Performance
In the interest of exploring feasible compression methods for PIConGPU data, we performed ex situ benchmarks on generated data. Visualized in Fig. 4, such a measurement directly allows a prediction for individual systems and user data when comparing to our model, eq. 4.
PIConGPU currently only utilizes one host thread per GPU, so we decided to implement and explore compression throughput for blosc as an example for a multi-threaded algorithm and compare it to other, previously implemented compression algorithms. Blosc provides several bitshuffle pre-conditioners, which we found of great importance for floating-point compression performance in agreement with recent studies . Further benchmarks with four threads on Hypnos’ K20 queue, limited to two host threads per GPU without oversubscription, indicated that on Hypnos application throughput would benefit from more physical CPU cores per GPU since the recent filesytem upgrade to GPFS.
Fully accelerator driven applications can use ’the last 10 % of system performance’ on the host side in order to trade compute performance for I/O latency. The Titan system provides up to 16 physical CPU cores per GPU and Summit is expected to allow for an order of magnitude higher parallelization on the host. This section explores the limits to data reduction methods in terms of data reduction ratio and throughput for an individual I/O stage independently of the method of data reduction and only exemplified for compression methods.
4.1 Overhead of Compression in Parallel I/O
From eq. (3) the relative I/O performance ratio when using data reduction instead of direct pass-through in an I/O stage follows as:
where we assume that the time for reducing the data at minimum is as long as for copying data from node RAM to I/O buffer. It is clear that in terms of I/O throughput reduction algorithms are beneficial if compared to I/O without reduction. Cases of and can still be relevant in case of limited disk space. Note, that decreasing would increase the gradient of , but not affect the position for which we expect break-even.
Fig. 5 shows the effect of threaded compression, keeping the compression ratio along iso-compression lines. Following the graph to the right, the higher the throughput of a compression algorithm the less importance it has on compared to the compression ratio . Thus, an important limit to is the high-throughput limit for fast compression algorithms below the break-even threshold. For such, the performance ratio over non-compressed I/O can barely be improved further via throughput but solely by compression ratio.
Exactly the opposite is true for any reduction algorithm with low throughput , to the left of the graph. Above the break-even threshold (dashed line at ), data reduction quickly becomes impractical for medium to high-throughput tasks for a specific system, as the relatively wasted computing time never reaches even for small .
Following the last argument one can further derive from eq. (4) with ‘perfect reduction’ : For any given I/O stage with write and reduction throughput the effective time an application spends in (synchronous) I/O can only be reduced, if the data reduction operation provides at least a throughput of
5 Summary and Outlook
We implemented and benchmarked parallel I/O methods on top of state-of-the-art I/O libraries for the massively parallel, fully-manycore driven, open source \glspic code PIConGPU. We outlined performance bottlenecks for medium to high-throughput applications in general and the possibility to overcome these with general data reduction techniques such as compression. We then derived and verified a scaling law that gives limits to expected application speed up when using data reduction schemes for medium- to high-throughput applications. With this we were able to derive a system- and application-specific break-even threshold that allows for predicting when reducing data is benefitial in terms of I/O throughput compared to I/O without reduction.
5.1 Compression Algorithms
For the special case of compression algorithms, future designs to soften I/O bottlenecks first and foremost need to improve throughput for floating point data. Even for a relatively large gap between local memory and filesystem throughput as on the current Titan system, many single-threaded compression algorithms that are still in use today do not fulfill the break-even threshold in eq. (4).
Existing high-throughput compression algorithms would benefit from research improving the compression ratio instead of throughput [29, 30]. This case is of importance since, due to high entropy in HPC applications’ primary observables (e.g. floating point), only lossy compression algorithms are likely to bridge the upcoming throughput gaps between node-local high-bandwidth memory and storage accessible longer than application lifetime.
For ADIOS we proposed, implemented and benchmarked for the first time host-side multi-threaded transform methods as a feasible step to reach the break-even threshold. With that, we successfully traded unused compute performance within a heterogeneous application for overall I/O performance.
5.2 I/O Libraries
Burst-buffers are identified as enablers to reduce blocking time of the application caused by synchronous transformations within I/O libraries, but are vulnerable to backlog. Nonetheless, burst-buffers alone cannot cover the gap that will arise between expected I/O on system today and in the future. Further applications of burst-buffers are coupled multi-scale simulations, in situ processing and checkpointing and not in the scope of this paper.
Nevertheless, for both explorative-qualitative and medium- to high-throughput quantitative studies I/O libraries need to act now to provide transparent and easily programmable means for multi-stage I/O. For any practical application, the first I/O stage should immediately start with a maximum-throughput memcopy from user RAM to I/O buffer, ideally asynchronously, while later stages need to follow fully asynchronously. Copied memory (in unutilized RAM or burst-buffers) will need several off-node user-programmable transformations which are finally staged transparently through a subsequent non-blocking data reduction (compression) pipeline. In each I/O stage, the break-even threshold derived in this paper needs to be fulfilled or backlog will occur for successive outputs and the overall application will be throttled by that specific bottleneck. With deeper memory hierarchies, user-programmability of stages will be a human bottleneck and needs to be addressed with easy and fast turnaround APIs to design application- and study-specific stages, e.g. via Python/Numba.
In conclusion, introducing data reduction for I/O will be necessary because of limited medium to long term storage size expected for future systems. Our analysis and measurements show that even today one should however not expect I/O performance gains when using reduction. Parallelization of reduction algorithms is one way to gain overall I/O performance but requires compute resources in addition to those used by the application. Even for fully GPU accelerated applications one should not assume resources to be ‘free’ for I/O and analysis tasks, since loosely coupled application workflows and models that depend heavily on hardly-parallelizable aspects such as atomic data lookups will in the future be more widespread and compete for the exact same resources.
-  Michael Bussmann, Heiko Burau, Thomas E. Cowan, Alexander Debus, Axel Huebl, Guido Juckeland, Thomas Kluge, Wolfgang E. Nagel, Richard Pausch, Felix Schmitt, Ulrich Schramm, Joseph Schuchart, and René Widera. Radiative signatures of the relativistic kelvin-helmholtz instability. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’13, pages 5:1–5:12. ACM, 2013. DOI: 10.1145/2503210.2504564.
-  Axel Huebl, René Widera, Alexander Grund, Richard Pausch, Heiko Burau, Alexander Debus, Marco Garten, Benjamin Worpitz, Erik Zenker, Frank Winkler, Carlchristian Eckert, Stefan Tietze, Benjamin Schneider, Maximilian Knespel, and Michael Bussmann. PIConGPU 0.2.4: Charge of bound electrons, openPMD axis range, manipulate by position, March 2017. DOI: 10.5281/zenodo.346005.
-  Erik Zenker, Benjamin Worpitz, René Widera, Axel Huebl, Guido Juckeland, Andreas Knüpfer, Wolfgang E Nagel, and Michael Bussmann. Alpaka–an abstraction library for parallel kernel acceleration. In Parallel and Distributed Processing Symposium Workshops, 2016 IEEE International, pages 631–640. IEEE, 2016.
-  Erik Zenker, René Widera, Axel Huebl, Guido Juckeland, Andreas Knüpfer, Wolfgang E. Nagel, and Michael Bussmann. Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond, pages 293–301. Springer International Publishing, 2016. DOI: 10.1007/978-3-319-46079-6_21.
-  C.K. Birdsall and A.B. Langdon. Plasma physics via computer simulation. The Adam Hilger series on plasma physics. McGraw-Hill, 1985. ISBN: 9780070053717.
-  R.W. Hockney and J.W. Eastwood. Computer Simulation Using Particles. Taylor & Francis, 1988. ISBN: 9780852743928.
-  Heiko Burau, René Widera, Wolfgang Honig, Guido Juckeland, Alexander Debus, Thomas Kluge, Ulrich Schramm, Thomas E. Cowan, Roland Sauerbrey, and Michael Bussmann. PIConGPU: A fully relativistic particle-in-cell code for a gpu cluster. IEEE Transactions on Plasma Science, 38(10):2831–2839, 2010.
-  Francesc Alted. blosc 1.11.4-dev, March 2017.
-  Hans Werner Meuer, Erich Strohmaier, Jack Dongarra, Horst Simon, and Martin Meuer. November 2016 — TOP500 Supercomputer Sites. https://www.top500.org/lists/2016/11/, June 2016. [Online; accessed March 22, 2017].
-  Hasan Abbasi, Matthew Wolf, Greg Eisenhauer, Scott Klasky, Karsten Schwan, and Fang Zheng. Datastager: scalable data staging services for petascale applications. Cluster Computing, 13(3):277–290, 2010. DOI: 10.1007/s10586-010-0135-6.
-  Ciprian Docan, Manish Parashar, and Scott Klasky. DataSpaces: An interaction and coordination framework or coupled simulation workflows. In Proc. of 19th International Symposium on High Performance and Distributed Computing (HPDC’10), June 2010. DOI: 10.1007/s10586-011-0162-y.
-  Wahid Bhimji, Debbie Bard, Melissa Romanus, David Paul, Andrey Ovsyannikov, Brian Friesen, Matt Bryson, Joaquin Correa, Glenn K Lockwood, Vakho Tsulaia, et al. Accelerating science with the NERSC burst buffer early user program. Proceedings of Cray Users Group, 2016.
-  Brad Whitlock, Jean M. Favre, and Jeremy S. Meredith. Parallel in situ coupling of simulation with a fully featured visualization system. In Torsten Kuhlen, Renato Pajarola, and Kun Zhou, editors, Eurographics Symposium on Parallel Graphics and Visualization. The Eurographics Association, 2011. DOI: 10.2312/EGPGV/EGPGV11/101-109.
-  Utkarsh Ayachit, Andrew Bauer, Berk Geveci, Patrick O’Leary, Kenneth Moreland, Nathan Fabian, and Jeffrey Mauldin. ParaView Catalyst: Enabling in situ data analysis and visualization. In Proceedings of the First Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization, ISAV2015, pages 25–29. ACM, 2015. DOI: 10.1145/2828612.2828624.
-  Alexander Matthes, Axel Huebl, René Widera, Sebastian Grottel, Stefan Gumhold, and Michael Bussmann. In situ, steerable, hardware-independent and data-structure agnostic visualization with ISAAC. Supercomputing Frontiers and Innovations, 3(4), 2016.
-  NVIDIA Corporation. NVIDIA IndeX 1.4.
-  Jeremy S. Meredith, Sean Ahern, Dave Pugmire, and Robert Sisneros. EAVL: The extreme-scale analysis and visualization library. In Hank Childs, Torsten Kuhlen, and Fabio Marton, editors, Eurographics Symposium on Parallel Graphics and Visualization. The Eurographics Association, 2012. DOI: 10.2312/EGPGV/EGPGV12/021-030.
-  The HDF Group. Hierarchical data format version 5 (C-API: 1.8.14), 2000-2017.
-  Qing Liu, Jeremy Logan, Yuan Tian, Hasan Abbasi, Norbert Podhorszki, Jong Youl Choi, Scott Klasky, Roselyne Tchoua, Jay Lofstead, Ron Oldfield, et al. Hello ADIOS: the challenges and lessons of developing leadership class I/O frameworks. Concurrency and Computation: Practice and Experience, 26(7):1453–1473, 2014.
-  Axel Huebl, Felix Schmitt, René Widera, Alexander Grund, Conrad Schumann, Carlchristian Eckert, Aleksandar Bukva, and Richard Pausch. libSplash: 1.6.0: SerialDataCollector filename API, October 2016. DOI: 10.5281/zenodo.163609.
-  Axel Huebl, Rémi Lehe, Jean-Luc Vay, David P. Grote, Ivo Sbalzarini, Stephan Kuschel, and Michael Bussmann. openPMD 1.0.0: A meta data standard for particle and mesh based data, November 2015. DOI: 10.5281/zenodo.33624.
-  Carlchristian Helmut Johannes Eckert. Enhancements of the massively parallel memory allocator scatteralloc and its adaption to the general interface mallocMC, October 2014. DOI: 10.5281/zenodo.34461.
-  T. Grismayer, E.P. Alves, R.A. Fonseca, and L.O. Silva. dc-magnetic-field generation in unmagnetized shear flows. Phys. Rev. Lett., 111:015005, Jul 2013. DOI: 10.1103/PhysRevLett.111.015005.
-  Axel Huebl et al. Supplementary materials: On the scalability of data reduction techniques in current and upcoming HPC systems from an application perspective, April 2017. DOI: 10.5281/zenodo.545780.
-  Robert McLay, Doug James, Si Liu, John Cazes, and William Barth. A user-friendly approach for tuning parallel file operations. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, pages 229–236. IEEE Press, 2014. DOI: 10.1109/SC.2014.24.
-  Yann Collet, Przemyslaw Skibinski, Nick Terrell, Sean Purcell, and Contributors. Zstandard (zstd) 1.1.4 - fast real-time compression algorithm, March 2017.
-  Steinar H. Gunderson, Alkis Evlogimenos, and Contributors. Snappy 1.1.1 - a fast compressor/decompressor, 2011.
-  Peter Lindstrom. Fixed-rate compressed floating-point arrays. IEEE Transactions on Visualization and Computer Graphics, 20(12):2674–2683, Dec 2014. DOI: 10.1109/TVCG.2014.2346458.
-  Martin Burtscher, Hari Mukka, Annie Yang, and Farbod Hesaaraki. Real-time synthesis of compression algorithms for scientific data. In SC16: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 264–275, Nov 2016. DOI: 10.1109/SC.2016.22.
-  Dingwen Tao, Di Sheng, Zizhong Chen, and Franck Cappello. Significantly improving lossy compression for scientific data sets based on multidimensional prediction and error-controlled quantization. In IPDPS’17: Proceedings of the 31th IEEE International Parallel & Distributed Processing Symposium, May 2017.