Reproducible Workflow on a Public Cloud for Computational Fluid Dynamics

Reproducible Workflow on a Public Cloud for Computational Fluid Dynamics

Olivier Mesnard, Lorena A. Barba Mechanical and Aerospace Engineering, the George Washington University, Washington, DC 20052.
E-mail: Email:

In a new effort to make our research transparent and reproducible by others, we developed a workflow to run computational studies on a public cloud. It uses Docker containers to create an image of the application software stack. We also adopt several tools that facilitate creating and managing virtual machines on compute nodes and submitting jobs to these nodes. The configuration files for these tools are part of an expanded ”reproducibility package” that includes workflow definitions for cloud computing, in addition to input files and instructions. This facilitates re-creating the cloud environment to re-run the computations under the same conditions.

1 Introduction

Reproducible research and replication studies are essential components for the progress of evidence-based science, even more now when nearly all fields advance via computation. We use computer simulations and data models to create new knowledge, but how do we provide evidence that this new knowledge is justified? Traditional journal publications exclude software and data products from the peer-review process, yet reliance on ever more complex computational artifacts and methods is the norm. Lacking standards for documenting, reporting and reviewing the computational facets of research, it becomes difficult to verify and corroborate the findings presented in journals and conferences [1].

The literature is cluttered with confused and sometimes contradictory definitions for reproducible research, reproducibility, replicability, repetition, etc.[2]. It is thus worth clarifying how we use these terms. “Reproducible research” was used by geophysics professor Jon Claerbout in the 1990s to mean computational studies that can be reproduced by other scientists. His research group at Stanford created a reproducible-research environment[3] whose goal was complete documentation of scientific computations, in such a way that a reader could reproduce all the results and figures in a paper using the author-provided computer programs and raw data. This requires open data and open source software and, for this reason, the reproducibility movement is closely linked with the open science movement. The term “replication” has been adopted to refer to an independent study generating new data which, when analyzed, lead to the same findings [4]. We follow this convention, adopted also recently in a report from the National Academies of Sciences, Engineering, and Medicine [5].

Many efforts to develop cyberinfrastructure that supports reproducible research have been launched in past years. They address concerns like automatic capture of changes to software (version control systems), persistent data archival, global registration of data identifiers, workflow management, and more. But capturing the whole computational environment used in a research project remains one of the most difficult problems. Computational researchers often use a multi-layer stack of software applications that can be laborious to build from scratch. Container technology like Docker is a recent addition to the reproducibility toolbox [6]. In this work, we develop and assess a workflow for reproducible research on a public cloud, adopting Docker containers and several other tools to automate and fully document scientific computations.

Universities and national laboratories spend millions of dollars to deploy and maintain on-site high-performance computing (HPC) clusters. At the George Washington University, we have access to a cluster called Colonial One. The cluster is now years old and approaching its end-of-life. On average, its computational resources are idle of the time and unavailable to users for roughly 5 days a year (due to maintenance). Administrators of the cluster have been recently considering integrating cloud-computing platforms in the research-computing portfolio. Over the past decade, cloud-computing platforms have rapidly evolved, now offering solutions for scientific applications. From a user’s point of view, a cloud platform offers great flexibility (hardware and virtual machines) with instantaneous availability of infinite (in appearance) computational resources. No more waiting time in job-submission queues! This promises to greatly facilitate code development, debugging, and testing. Resources allocated on the cloud are released as soon as the job is done, avoiding paying for idle time. On a public-cloud platform—such as Microsoft Azure, Google Cloud, Amazon AWS—with a ”pay-as-you-go” type of subscription, a user directly sees how much it costs to run a scientific application, while such information usually remains obscure to the end-user on university-managed clusters. Cost models that make sense for researchers, labs, and universities are still unclear. Yet, this information is key when making a decision to adopt a cloud workflow for research. We report here what we have learned from nearly two years of using cloud computing for our computational fluid dynamics (CFD) simulations, with the hope that it will shed some light on the costs and benefits, and help others considering using cloud computing for their research.

2 Reproducible cloud-based workflow

Scientific publications reporting computational results often lack sufficient details to reproduce the researcher’s computational environment; e.g., they may miss to mention external libraries used along with the main computational code. We have learned the hard way how different versions of the same external library can alter the numerical results and even the scientific findings of a computational study[7]. This section presents an overview and mini-tutorial of the workflow we developed to aim for the highest level of reproducibility of our computational research, with the best available technology solutions. In the process of creating this reproducible workflow, we also evaluated the suitability of public cloud offerings by Microsoft Azure for our research computing needs. The tools we adopted for computing on cloud resources are specific for this provider.

2.1 Use of Container Technology

Docker Terminologies:

Docker— An open source OS-level virtualization software to create and run multiple independent, isolated, and portable containers on the same host Operating System. Docker Image— Union of layered filesystems stacked on top of each other. Each layer defines a set of differences from the previous layer. A user composes (builds) a Docker image using a Dockerfile, usually starting from a base image (such as ubuntu:16.04). Docker Container— A standardized unit created from a Docker image to deploy an application or a runtime environment. A Docker container can be seen as an instance of a Docker image that includes an additional writable layer at the top of the layered stack. When a container is deleted, so is the writable layer, while the image remains unchanged. Dockerfile— An ASCII file including the sequence of instructions to create a Docker image for the computational runtime environment. A Dockerfile contains Docker keywords such as FROM, RUN, or COPY. Each instructions in the Dockerfile creates a layer in the Docker image. DockerHub— The official registry of Docker Inc.; a cloud-based registry service to store, share, and retrieve (public or private) Docker images.

To overcome the so-called “dependency hell” and facilitate reproducibility and portability, we use the container technology provided by the open-source project Docker. A container represents an isolated user space where application programs run directly on the operating system’s kernel of the host (with limited access to its resources). In contrast with virtual machines, containers do not include a full operating system, making them lighter and faster. Containers allow re-creating the same runtime environment of an application (including all its dependencies) from one machine to another. They empower researchers to share pre-built images of their software stack, ameliorating one of the biggest pain points for reproducible research: building the software on a different machine. The majority of enterprise developers today are familiar with Docker container technology (first released six years ago). But in academic settings, many are still unaware of its value. We present this section as an overview for research software developers unfamiliar with Docker, but comfortable with scripting, version control, and distributed collaboration.

A container is an instance of an image. The developer builds this image on a local machine and pushes it to a public registry to share it with other users. Users pull the image from the public registry,—in our case, DockerHub—and create containers out of it. To create a Docker image, the developer writes a Dockerfile: an ASCII file containing instructions that tell Docker what to do. For example, we start building a new image from a base image using the keyword FROM. We then write the different shell instructions (with the RUN instruction) to build a multi-layered runtime environment of the application that includes all its dependencies. Once we have built and tested the image on the local machine, we push it to a repository on the DockerHub registry: a place to store and retrieve Docker images. Listing 1 provides the command lines to build (docker build) and push (docker push) an image of our CFD software (barbagroup/petibm:0.4-GPU-IntelMPI-ubuntu) that we used to obtain some of the results presented in the next section. Here, CLOUDREPRO is an environment variable set to the local path of the GitHub repository for this paper, cloud-repro,111 cloned on the user’s machine.

$ cd $CLOUDREPRO/docker/petibm
$ docker build --tag=barbagroup/petibm:0.4-GPU-IntelMPI-ubuntu --file=Dockerfile .
$ docker push barbagroup/petibm:0.4-GPU-IntelMPI-ubuntu
Listing 1: Build and push a Docker image.

Anyone can now pull the application image from DockerHub, and create a Docker container to run the CFD application software in a faithfully reproduced local environment. The great advantage of container technology is that the same container that was built and tested on the developer’s local machine can now run at scale in production mode on virtual machines, bare metal, or even cloud platforms. Our objective is to create and run containers on a public cloud provider such as Microsoft Azure. Figure 1 shows a graphical representation of the workflow we developed using Docker and various tools for running CFD simulations on Microsoft Azure. The next section explains these tools.

Fig. 1: Reproducible workflow on the public cloud provider Microsoft Azure. Our CFD software is version-controlled with Git and GitHub. We push to DockerHub a Docker image of our CFD application with all its dependencies. Azure CLI is use to configure accounts on Microsoft Azure and to upload/download data to/from an Azure Storage account. With Batch Shipyard, we create a pool on Azure Batch and run container-based simulation using our Docker image.

2.2 Use of Public Cloud Resources

To run computational jobs on Microsoft Azure, we use several tools that facilitate creating and managing virtual machines on compute nodes, and submitting jobs to those nodes. We use a service called Azure Batch that leverages Microsoft Azure at no extra cost, relieving the user from manually creating, configuring, and managing an HPC-capable cluster of cloud nodes, including virtual machines, virtual networks, job and task scheduling infrastructure. Azure Batch works with both embarrassingly parallel workloads and tightly coupled MPI jobs (the latter being the case of our CFD software). To use Azure Batch, we first need to configure a workspace on Microsoft Azure. This can be done either via the Azure Portal in a web browser or from a local terminal using the open-source tool Azure CLI.222Azure CLI (version 2.0.57): We prefer to use the command-line solution (program az), as it allows us to keep track of the steps taken to configure the cloud workspace (see Listing 2). First, we set the Azure subscription we want to use (let’s call it reprosubscription). Next, we create a resource group (reprorg) located in this case in the East US region, which will contain all the Azure resources. We create an Azure Storage account (reprostorage) in the resource group, as well as an Azure Batch account (reprobatch) associated to the storage account. Finally, we create a fileshare (in this case of size GB) in the storage.

$ az account set --subscription reprosubscription
$ az group create --name reprorg --location eastus
$ az storage account create --name reprostorage --resource-group reprorg --sku Standard_LRS --location eastus
$ az batch account create --name reprobatch --resource-group reprorg --location eastus --storage-account reprostorage
$ az storage share create --name fileshare --account-name reprostorage --account-key storagekey --quota 10
Listing 2: Configure the workspace on Microsoft Azure.

To create computational nodes and submit container-based jobs to Azure Batch, we use the open-source command-line utility Batch Shipyard.333Batch Shipyard (version 3.6.1): Batch Shipyard is entirely driven by configuration files: the utility parses user-written YAML files to automatically create pools of compute nodes on Azure Batch and to submit jobs to those pools. Typically, we need to provide four configuration files:

  • config.yaml contains information about the Azure Storage account and Docker images to use.

  • credentials.yaml stores the necessary credentials to use the different Microsoft Azure service platforms (e.g., Azure Batch and Azure Storage).

  • pool.yaml is where the user configures the pool of virtual machines to create.

  • jobs.yaml details the configuration of the jobs to submit to the pool.

Once the configuration files are written, we invoke Batch Shipyard through the shipyard program on our local machine. The folder examples/snake2d2k35/config_shipyard in the repository accompanying this paper contains an example YAML files to create a pool of two NC24r compute nodes (featuring K80 GPUs and using InfiniBand network). Listing 3 shows the commands to run in your local terminal to create a pool of compute nodes on Azure Batch and submit jobs to it. The Docker image of our CFD application is pulled from the registry to the virtual machines during the pool creation (shipyard pool add). We then upload the input files to the compute nodes (shipyard data ingress) and submit jobs to the pool (shipyard jobs add). The tasks for a job will start automatically upon submission.

$ cd $CLOUDREPRO/examples/snake2d2k35
$ az storage directory create --name snake2d2k35 --share-name fileshare --account-name reprostorage
$ export SHIPYARD_CONFIGDIR=config_shipyard
$ shipyard pool add
$ shipyard data ingress
$ shipyard jobs add
Listing 3: Create a pool and submit jobs to it.

Once the simulations are done (i.e., the job tasks are complete), we delete the jobs and the pool (Listing 4). The output of the computation is now stored in the fileshare in our Azure Storage account. We can download the data to our local machine to perform additional post-processing steps (such as flow visualizations).

$ shipyard pool del
$ shipyard jobs del
$ mkdir output
$ az storage file download-batch --source fileshare/snake2d2k25 --destination output --account-name reprostorage
Listing 4: Delete the pool and jobs, and download to output to a local machine.

Reproducible research requires authors to make their code and data available. Thus, the Dockerfile and YAML configuration files should be made part of an extended reproducibility package that includes workflow instructions for cloud computing, in addition to other input files. Such a reproducibility package facilitates re-creating the cloud environment to run the simulations under the same conditions. The reproducibility packages of the examples showcased in the next section are available in the GitHub repository cloud-repro, which includes instructions on how to reproduce the results.

3 Results

The top concerns of researchers considering cloud computing are performance and cost. Until just a few years ago, the products offered by cloud providers were unsuitable to the needs of computational scientists using HPC, due to performance overhead of virtualization or lack of support for fast networking [8]. Azure only introduced nodes with GPU devices during late 2016 and Infiniband support for Linux virtual machines on the NC-series in 2017. Our first objective was to assess performance on cloud nodes for the type of computations in our research workflows with tightly coupled parallel applications. We present results from benchmarks and test-cases showing that we are able to obtain similar performance in terms of latency and bandwidth using the Azure virtual network, comparing to a traditional university-managed HPC cluster (Colonial One). Table I lists the hardware specifications of the nodes used on Microsoft Azure and Colonial One. Our target research application relies on three-dimensional CFD simulations with our in-house research software. We include here a sample of the types of results needed to answer our research question, obtained by running on the public cloud using the reproducible workflow described in Section 2. The goal is to showcase the potential of cloud computing for CFD, share the lessons we learned in the process, as well as analyze the cost scenarios for full applications.

Platform Node Intel Xeon CPU # threads NVIDIA GPU RAM (GiB) SSD Storage (GiB)
Azure NC24r Dual 12-Core E5-2690v3 (2.60GHz) 24 4 x K80 224 1440
H16r Dual 8-Core E5-2667v3 (3.20GHz) 16 - 112 2000
Colonial One Ivygpu Dual 6-Core E5-2620v2 (2.10GHz) 12 2 x K20 120 93
Short Dual 8-Core E5-2650v2 (2.60GHz) 16 - 120 93
TABLE I: Hardware specifications of nodes used on Microsoft Azure and Colonial One. On both platforms, MPI applications take advantage of RDMA (Remote Direct Memory Access) network with FDR InfiniBand.

3.1 MPI Communication Benchmarks

We ran point-to-point MPI benchmarks from the Ohio State University Micro-Benchmarks suite444OSU Micro-Benchmarks (version 5.6): on Microsoft Azure and Colonial One, to investigate performance in terms the latency and bandwidth. The latency test is carried out in a ping-pong fashion and measures the time elapsed to get a response; the sender sends a message with a certain data size and waits for the receiver to send back the message with the same data size. The bandwidth test measures the maximum sustained rate that can be achieved on the network; the sender sends a fixed number of messages to a receiver that replies only after receiving all of them. The tests ran on NC24r nodes on Azure and Ivygpu nodes on Colonial One, all of them featuring a network interface for RDMA (Remote Direct Memory Access) connectivity to communicate over an InfiniBand network. (RDMA allows direct access to a remote’s memory without involving the operating system of the host and remote.) Fig. 2 reports the mean latencies and bandwidths obtained over repetitions on both platforms. For small message sizes, the average latencies on Colonial One and Azure are and s, respectively. As the message size increases, the latency becomes smaller on Azure compared to Colonial One. The maximum sustained bandwidth rates for Colonial One and Azure are on average and GB/s, respectively. For all message sizes, a higher bandwidth rate was achieved on Azure. Over the last few years, Microsoft Azure has indeed improved its HPC solutions to provide networking capabilities that are comparable or even better than our 6-year-old university-managed cluster.

Fig. 2: Point-to-point latency (top) and bandwidth (bottom) obtained on Colonial One (Ivygpu nodes) and on Microsoft Azure (NC24r nodes). Benchmark results are averaged over repetitions.

3.2 Poisson Benchmarks

CFD algorithms often require solving linear systems with iterative methods. For example, the Navier-Stokes solver implemented in our software requires the solution of a Poisson system at every time step to project the velocity field onto the divergence-free space (to satisfy the incompressibility constraint). We investigated the time-to-solution for a three-dimensional Poisson system (obtained with a 7-point stencil central-difference scheme) on different nodes of Microsoft Azure and Colonial One. The solution method was a Conjugate-Gradient (CG) method with a classical algebraic multi-grid (AMG) preconditioner using an exit criterion set to an absolute tolerance of . The iterative solver ran on CPU nodes (H16r instances on Azure and Short nodes on Colonial One) using the CG algorithm from the PETSc library[9] and an AMG preconditioner from Hypre BoomerAMG. Fig. 3 (top) reports the mean runtimes (averaged over repetitions) to iteratively solve the system, on a uniform grid of million cells (), as we increase the number of nodes in the pool. We also solved the Poisson system with the NVIDIA AmgX library on multiple GPU devices using NC24r instances on Azure and Ivygpu nodes on Colonial One. The Poisson system for a base mesh of million cells () was solved on a single compute node using MPI processes and GPU devices; we then doubled the mesh size as we doubled the number of nodes (keeping the same number of MPI processes and GPUs per node). The number of iterations to reach convergence increases with the size of the system, so we normalize the runtimes by the number of iterations. Fig. 3 (bottom left) shows the normalized mean runtimes ( repetitions) obtained on Azure and Colonial One. The smaller runtimes on Azure are explained by the fact that the NC-series of Microsoft Azure features NVIDIA K80 GPU devices with a higher compute capability than the K20 GPUs on Colonial One. The bottom-right panel of the figure reports the normalized mean runtimes obtained on Azure when we load the bandwidth with a larger problem; the base mesh now contains millions cells (). We observe larger variations in the time-to-solution for the Poisson system on Microsoft Azure; runtimes are more uniform on Colonial One.

Fig. 3: Runtimes to solve a Poisson system on Colonial One and on Microsoft Azure. Benchmark was repeated 5 times and we report the mean runtimes as well as the extrema. Poisson system was solved on Azure using PETSc on H16r nodes and using AmgX on NC24r nodes. On Colonial One, we solved the system with PETSc on Short nodes and with AmgX on Ivygpu nodes. The runs with PETSc solved a Poisson system on a mesh size of 50 million cells (top). With AmgX (bottom left), the base mesh size contains 6.25 million cells and we scale it with the number of nodes. On Azure, we also ran the Poisson benchmark with a finer base mesh size of 25 million cells (bottom right). Runtimes obtained with AmgX were normalized by the number of iterations required to reach a absolute tolerance of .

3.3 Flow Around a Flying Snake Cross-Section

Our research lab is interested in understanding the aerodynamics of flying animals via CFD simulations. One of our applications deals with the aerodynamics of a snake species, Chrysopelea Paradisi, that lives in South-East Asia. This arboreal reptile has the remarkable capacity to turn its entire body into a wing and glide over several meters[10]. The so-called “flying snake” jumps from tree branches, undulates in the air, and is able to produce lift by expanding its ribcage to flatten its ventral surface (morphing its normally circular cross-section into a triangular shape).

To study the flow around the flying snake, we developed a CFD software called PetIBM[11], an open-source toolbox that solves the two- and three-dimensional incompressible Navier-Stokes equations using a projection method (seen as an approximate block-LU decomposition of the fully discretized equations[12]) and an immersed-boundary method (IBM). Within this framework, the fluid equations are solved over an extended grid that does not conform to the surface of the immersed body. To model the presence of the body, the momentum equation is augmented with a forcing term that is activated in the vicinity of the immersed boundary. This technique allows the use of simple fixed structured Cartesian grids to solve the equations. PetIBM implements immersed boundary methods (IBMs) that fit into the projection method; in the present study, we use the IBM scheme proposed in [13]. PetIBM runs on distributed-memory architectures using the efficient data structures and parallel routines from the PETSc library. The software also implements the possibility to solve linear systems on multiple GPU devices distributed across the nodes with the NVIDIA AmgX library and our AmgXWrapper[14]. One of the requirements for reproducible computational results is to make the code available under a public license (allowing reuse and modification by others). In that regard, PetIBM is open source, version-controlled on GitHub,555PetIBM (version 0.4): and shared under the permissive (non copy-left) BSD-3 clause license. We also provide a Docker image of PetIBM on DockerHub and its Dockerfile is available in the GitHub repository of the software.

3.3.1 2D Flow Around a Snake Cross-Section

We submitted a job on Azure Batch (with Batch Shipyard,) to compute the two-dimensional flow around an anatomically accurate sectional shape of the gliding snake. The cross-section has a chord-length and forms a -degree angle of attack with the incoming freestream flow. The Reynolds number, based on the freestream speed, the body chord-length, the kinematic viscosity, is set to . The immersed boundary is centered in a computational domain that contains just over million cells. The grid is uniform with the highest resolution in the vicinity of the body and stretched to the external boundaries with a constant ratio (see Table II for details about the grid). The discretization of the immersed boundary has the same resolution as the background fluid grid. A convective condition was used at the outlet boundary while freestream conditions were enforced on the three other boundaries. The job was submitted to a pool of two NC24r nodes, using MPI processes and GPU devices per node. It completed time steps (i.e., time units of flow simulation with time-step size ) in just above hours (wall-clock time).

Fig. 5 shows the history of the force coefficients on the two-dimensional cross-section. The lift coefficient only maintains its maximum mean value during the early stage of the simulation (up to time units). Between and time units, the mean value starts to drop. The time-averaged force coefficients (between and time units) are reported in Table III. Fig. 4 shows snapshots of the vorticity field after , , , and time units of flow simulation. After time units, the vortices shed from the snake section are almost aligned in the near wake (with a slight deflection towards the lower part of the domain). Snapshots of the vorticity at time units and show that the initial alignment is perturbed by vortex-merging events (same-sign vortices merging together to form a stronger one). Following that, the wake signature is altered for the rest of the simulation: vortices are not aligned anymore, the wake becomes wider (leading to lower aerodynamic forces) with a 1S+1P vortex signature (a single clockwise-rotating vortex on the upper part and a vortex pair on the lower part).

Fig. 4: Filled contour of the vorticity field () after , , , and time units of flow simulation with PetIBM for the snake cross-section at a -degree angle of attack and Reynolds number . Vortex merging events trigger a change in the wake signature causing the drop in the mean value of the lift coefficient.

3.3.2 3D Flow Around a Snake Cylinder

Although the two-dimensional simulations can give us some insights into to flow dynamics happening in the wake behind the snake, we know that at this Reynolds number (), three-dimensional structures will develop in the wake. We submitted jobs on Azure Batch to perform direct numerical simulation of the three-dimensional flow around a snake model: a cylinder with the same anatomically accurate cross-section. The computational domain extends in the -direction over a length of and the grid now contains about million cells (with a uniform discretization in the -direction and periodic boundary conditions; see Table II). The job was submitted to a pool of two NC24r nodes using MPI processes and GPU devices per node. The task completed time steps ( time units with a time-step size ) in about days and hours.

Fig. 5 compares the history of the force coefficients between the two- and three-dimensional configurations. The force coefficients resulting from the two-dimensional simulation of the snake are higher than those obtained for the snake cylinder; we computed a relative difference of and for the time-averaged drag and lift coefficient, respectively (see Table III). Two-dimensional computational simulations of fundamentally three-dimensional flows lead to incorrect estimation of the force coefficients, as is well known[15].

Fig. 6 shows the instantaneous spanwise vorticity averaged along the -direction after and time units. Compared to the snapshots from the two-dimensional simulation, we observe that (1) free-shear layers roll up into vortices further away from the snake body and (2) the von Kármán street exhibits a narrower wake than in the two-dimensional simulations. We also note the presence of an unsteady recirculation region just behind the snake cylinder, and alternating regions of positive and negative cross-flow velocity showing the presence of von Karman vortices (Fig. 7). Fig. 8 shows a side-view of the isosurfaces of the Q-criterion after time units and highlights the complexity of the three-dimensional turbulent wake generated by the snake model.

The selection of results presented here corresponds to a typical set of experiments included in a CFD study. A comprehensive study might include several similar sets, leading to additional insights about the flow dynamics, and clues to future avenues of research. We include this selection here to represent a typical workflow, and combine our discussion with a meaningful analysis of the costs associated with running CFD studies in a public cloud.

Fig. 5: History of the force coefficients obtained with two- and three-dimensional simulations at Reynolds number for a snake cross-section with a -degree angle of attack. As expected, the two-dimensional simulations over-estimate the force coefficients of what is essentially a three-dimensional flow problem.
Case Domain Uniform region Smallest cell-width Stretching ratio Size
TABLE II: Details about the computational grids used for the snake simulations with PetIBM. Distances are expressed in terms of chord-length units.
2D () ()
TABLE III: Time-averaged force coefficients on the snake model at Reynolds number and angle of attack for the two- and three-dimensional configurations. (We average the force coefficients between and time units of flow simulation and report the relatively difference of the 2D values with respect to the 3D ones.)
Fig. 6: Filled contour of the spanwise-averaged z-component of the vorticity () field after and time units of a three-dimensional flow simulation with PetIBM for the snake cylinder with a cross-section at a -degree angle of attack and Reynolds number .
Fig. 7: Filled contour of the streamwise velocity (top) and cross-flow velocity (bottom) behind the snake cylinder in the plane at in the wake of the snake cylinder with a -degree angle of attack at Reynolds number after time-units of flow simulations. There are 52 contours from to . The grey area shows a projection of the snake cylinder in the plane. The solid black line defines the contour with (top) and (bottom).
Fig. 8: Lateral view of the isosurfaces of the Q-criterion () in the wake of the snake cylinder (with a -degree angle of attack) at Reynolds number . The isosurfaces are colored with the streamwise vorticity (). The figure was generated using the visualization software VisIt [16].

4 Cost Analysis and User Experience

For the year , Microsoft Azure granted our research lab a sponsorship (in terms of cloud credits) to run CFD simulations with our in-house software. With a “pay-as-you-go” subscription, users can immediately see how much it costs to run a scientific application on a public cloud; such information is often hidden to end-users on university-managed clusters. From May to December, we spent a total of USD to run CFD simulations, including a couple dozens of snake simulations. (The output of those simulations is now being processed to further analyze the complex flow dynamics generated behind the snake model.) During that period, we have been charged for data management, networking, storage, bandwidth, and virtual machines (Table IV). More than of the charges incurred were for the usage of virtual machines, mainly instances from the NC-series (Table V). To run PetIBM simulations of the snake model, we used the NC24r virtual machines to get access to NVIDIA K80 GPUs and InfiniBand networking. For example, the two-dimensional snake run reported in the present article cost USD ( NC24r instances with a hourly price of USD for about hours). The three-dimensional computation cost USD ( NC24r instances for about hours). (Note that with 3-year-reserved instances, the two- and three-dimensional snake simulations would have only cost and USD, respectively.)

Service name Cost (USD) % of total cost
Data Management
Virtual Machines
TABLE IV: Charges incurred for the usage of different services on Microsoft Azure.
Instance cores
disk sizes
1-year reserved
3-year reserved
NC6 6 56 340 1 x K80 0.90 0.18 0.5733 0.3996
NC12 12 112 680 2 x K80 1.80 0.36 1.1466 0.7991
NC24 24 224 1,440 4 x K80 3.60 0.72 2.2932 1.5981
NC24r666The NC24r configuration provides a low latency, high throughput network interface optimized for tightly coupled parallel applications. 24 224 1,440 4 x K80 3.96 0.792 2.5224 1.7578
TABLE V: NC series on Microsoft Azure. (Prices as of March 24, 2019, for CentOS or Ubuntu Linux Virtual Machines in the East US region.)

Running CFD simulations on Microsoft Azure was a first time for our research lab. As novices in cloud computing, it took us several months to become familiar with the technical vocabulary and the infrastructure of Microsoft Azure before we could submit our first simulation of the flow around a snake profile.

Command-line utilities such as Azure CLI and Batch Shipyard were of great help to create and manage resources on Microsoft Azure for executing our reproducible research workflow. (In our lab, we tend to avoid interacting directly with graphical-user interfaces to keep a trace of the commands run and make the workflow more reproducible.) Azure CLI helped us set up the Azure Batch and Storage accounts as well as moving data between the cloud platform and our local machines. Thanks to Batch Shipyard, we did not have to dig into the software development kit of Azure Batch to use the service. Writing YAML configuration files was all we needed to do to create a pool of virtual machines with Batch Shipyard and to submit jobs to it. With Batch Shipyard, we have painlessly submitted jobs on Azure Batch to run multi-instance tasks in Docker containers, everything from the command-line terminal on a local machine. Note that Batch Shipyard also supports Singularity containers.

Our in-house CFD software, PetIBM, relies on MPI to run applications on distributed-memory architectures. The Poisson system is solved on distributed GPU devices using the NVIDIA AmgX library. Thus, we used Azure instances of the NC-series, which feature a network interface for remote direct memory access (RDMA) connectivity, allowing nodes in the same pool to communicate over InfiniBand network. As of this writing, only Intel MPI 5.x versions are supported with the Azure Linux RDMA drivers.

Microsoft Azure offers the possibility of taking advantage of surplus capacity with “low-priority” virtual machines that are substantially cheaper than “dedicated” ones (V). For example, the low-priority NC24r instance costs USD per hour, times cheaper than its dedicated counterpart. The reader should keep in mind that low-priority virtual machines may not be available for allocation or may be preempted at any time. Thus, low-priority instances should be avoided for long-running MPI jobs in pools where inter-node communication is enabled. We only used dedicated virtual machines to run our CFD simulations. Moreover, our job tasks used a shared filesystem (GlusterFS on compute nodes) for I/O operations and Batch Shipyard would fail to create the pool if low-priority nodes were requested.

Microsoft Azure implements quotas and limits on resources with Azure Batch service, such as the maximum number of dedicated cores that can be used in a certain region. To be able to run our simulations, we had to contact Azure Support through Azure Portal to request a quota increase for a given type of instances in a specific region and for a given subscription. We had to go through this process at least five times during our sponsorship period. Readers should be aware of these quotas and limits before scaling up workloads on Azure Batch.

Exploring public cloud services as an alternative (or complementary) solution to university-managed HPC clusters to run CFD simulations with our in-house software, we used Microsoft Azure to generate reproducible computational results. Our reproducible workflow makes use of Docker containers to faithfully port the runtime environment of our CFD application to Azure, and the command-line utilities Azure CLI and Batch Shipyard to run multi-instance tasks within Docker containers on the Azure Batch platform. We ran CFD simulations on instances of the NC-series that feature NVIDIA K80 GPU devices and have access to InfiniBand network. Latency and bandwidth micro-benchmarks show that our university-managed HPC system and Microsoft Azure can deliver similar performances. Thanks to a Microsoft Azure sponsorship, we were able to run dozens of CFD simulations of a gliding snake’s model. Our study of the three-dimensional flow around a cylinder with an anatomically accurate cross-section of the flying snake is currently ongoing. We plan to analyze the output data from these runs, and publish them in a later paper.

In this work, we show that public cloud resources are today able to deliver similar performances to a university-managed cluster, and thus can be regarded as a suitable solution for research computing. But in addition to performance, researchers also worry about cost. We share the actual costs incurred to run two- and three-dimensional CFD simulations using cloud services with a pay-as-you-go model, and report on reduced costs possible with reserved instances. Universities (and possibly funding agencies) may obtain even more favorable pricing through bids or medium-term contracts. Researchers thinking of adding cloud computing to proposal budgets might encounter other barriers, however. For example, some universities exempt large equipment purchases from indirect (facilities and administrative, F&A) costs, while they may add these overhead rates to cloud purchases. Until these internal policies are adjusted, cloud computing may not be adopted widely.

In a renewed effort to make our research transparent and reproducible by others, we maintain in a public version-controlled repository all the files necessary to re-run the examples highlighted in this paper. The repository is found at Our expanded reproducibility packages contain input files, Batch Shipyard configuration files, command-line instructions to set up and submit jobs to Microsoft Azure, as well as the post-processing scripts to reproduce the figures of this paper. Although they are not required to reproduce the computations, the Dockerfiles to build Docker images are also made available. These files contain details about all the libraries needed to create the computational runtime environment. We even include a Dockerfile to reproduce the local environment with the visualization tools needed for post-processing. In addition, we have also deposited in the Zenodo service all secondary data required to reproduce the figures of the paper without running the simulations again[17].


This work was possible thanks to a sponsorship from the Microsoft Azure for Research program777Microsoft Azure for Research: and supported by NSF Grant No. CCF-1747669. We would like to thank Kenji Takeda, Fred Park, and Joshua Poulson of Microsoft for their help.


  • [1] D. L. Donoho, A. Maleki, I. U. Rahman, M. Shahram, and V. Stodden, “Reproducible research in computational harmonic analysis,” Computing in Science & Engineering, vol. 11, no. 1, 2009.
  • [2] L. A. Barba, “Terminologies for reproducible research,” arXiv preprint arXiv:1802.03311, 2018.
  • [3] M. Schwab, N. Karrenbach, and J. Claerbout, “Making scientific computations reproducible,” Computing in Science & Engineering, vol. 2, no. 6, pp. 61–67, 2000.
  • [4] R. D. Peng, “Reproducible research in computational science,” Science, vol. 334, no. 6060, pp. 1226–1227, 2011.
  • [5] National Academies of Sciences, Engineering, and Medicine, Open Source Software Policy Options for NASA Earth and Space Sciences.   Washington, DC: The National Academies Press, 2018. [Online]. Available:
  • [6] C. Boettiger, “An introduction to Docker for reproducible research,” ACM SIGOPS Operating Systems Review, vol. 49, no. 1, pp. 71–79, 2015.
  • [7] O. Mesnard and L. A. Barba, “Reproducible and replicable computational fluid dynamics: it’s harder than you think,” Computing in Science & Engineering, vol. 19, no. 4, p. 44, 2017. [Online]. Available:
  • [8] C. Freniere, A. Pathak, M. Raessi, and G. Khanna, “The feasibility of amazon’s cloud computing platform for parallel, gpu-accelerated, multiphase-flow simulations,” Computing in Science & Engineering, vol. 18, no. 5, p. 68, 2016.
  • [9] S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, A. Dener, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, D. A. May, L. C. McInnes, R. T. Mills, T. Munson, K. Rupp, P. Sanan, B. F. Smith, S. Zampini, H. Zhang, and H. Zhang, “PETSc users manual,” Argonne National Laboratory, Tech. Rep. ANL-95/11 - Revision 3.10, 2018. [Online]. Available:
  • [10] J. J. Socha, “Gliding flight in chrysopelea: Turning a snake into a wing,” Integrative and Comparative Biology, vol. 51, no. 6, pp. 969–982, 2011. [Online]. Available:
  • [11] P.-Y. Chuang, O. Mesnard, A. Krishnan, and L. A. Barba, “PetIBM: toolbox and applications of the immersed-boundary method on distributed-memory architectures,” The Journal of Open Source Software, vol. 3, no. 25, p. 558, May 2018. [Online]. Available:
  • [12] J. B. Perot, “An analysis of the fractional step method,” Journal of Computational Physics, vol. 108, no. 1, pp. 51–58, 1993.
  • [13] R.-Y. Li, C.-M. Xie, W.-X. Huang, and C.-X. Xu, “An efficient immersed boundary projection method for flow over complex/moving boundaries,” Computers & Fluids, vol. 140, pp. 122–135, 2016.
  • [14] P.-Y. Chuang and L. A. Barba, “AmgXWrapper: An interface between PETSc and the NVIDIA AmgX library,” The Journal of Open Source Software, vol. 2, no. 16, Aug. 2017. [Online]. Available:
  • [15] R. Mittal and S. Balachandar, “Effect of three-dimensionality on the lift and drag of nominally two-dimensional cylinders,” Physics of Fluids, vol. 7, no. 8, pp. 1841–1865, 1995.
  • [16] H. Childs, E. Brugger, B. Whitlock, J. Meredith, S. Ahern, D. Pugmire, K. Biagas, M. Miller, C. Harrison, G. H. Weber, H. Krishnan, T. Fogal, A. Sanderson, C. Garth, E. W. Bethel, D. Camp, O. Rübel, M. Durant, J. M. Favre, and P. Navrátil, “VisIt: An End-User Tool For Visualizing and Analyzing Very Large Data,” in High Performance Visualization–Enabling Extreme-Scale Scientific Insight, Oct. 2012, pp. 357–372.
  • [17] O. Mesnard and L. A. Barba, “Cloud-repro: Reproducible workflow on a public cloud for computational fluid dynamics,” Zenodo archive, April 2019. [Online]. Available:
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description