Virtualizing the Stampede2 Supercomputer with Applications to HPC in the Cloud
Methods developed at the Texas Advanced Computing Center (TACC) are described and demonstrated for automating the construction of an elastic, virtual cluster emulating the Stampede2 high performance computing (HPC) system. The cluster can be built and/or scaled in a matter of minutes on the Jetstream self-service cloud system and shares many properties of the original Stampede2, including: i) common identity management, ii) access to the same file systems, iii) equivalent software application stack and module system, iv) similar job scheduling interface via Slurm.
We measure time-to-solution for a number of common scientific applications on our virtual cluster against equivalent runs on Stampede2 and develop an application profile where performance is similar or otherwise acceptable. For such applications, the virtual cluster provides an effective form of “cloud bursting” with the potential to significantly improve overall turnaround time, particularly when Stampede2 is experiencing long queue wait times. In addition, the virtual cluster can be used for test and debug without directly impacting Stampede2. We conclude with a discussion of how science gateways can leverage the TACC Jobs API web service to incorporate this cloud bursting technique transparently to the end user.
As the demand for HPC continues to grow, centers must cater to a deep range of researchers bringing forward more numerous and challenging workflows. The ability to automatically and intelligently respond when computing needs outstrip a current system’s supply provide a pathway towards reducing overall time-to-solution. With mutiple, on-premises production systems available at TACC, we are investigating the potential for automatically offloading work from one system to another in ways that are largely transparent to the user.
In particular, this work catalogues and demonstrates a set of necessary components for the creation of a virtualized HPC execution environment on the Jetstream cloud system. Ultimately, this setup emulates the environment found on the Stampede2 HPC system allowing for the potential of on-demand overflow capacity with effectively no user-level adaption required.
Although the concept of virtualization has been around for decades (Reuther et al., 2012), HPC virtualization for industrial and scientific research is gaining significant attention recently. Huang et al. (Huang et al., 2006) in a case study, describe the benefits of HPC virtualization. Virtualization can effectively address the issues of system management and allow scalability by cloud bursting. System administrators can easily configure the required runtime setup and spin up VMs to run various applications. Hardware upgrades and failures on large scale systems can be handled gracefully, using a checkpoint/restart mechanism. Also, operating systems and runtime environments can be largely customized to gain optimal performance, beneficial for HPC applications.
Walters et al. (Walters et al., 2008) discuss three major categories of virtualization including full virtualization, paravirtualization ,and operating system level virtualization. Their work evaluates the suitability of these techniques for industry standard VMs (VMWare Server, Xen, and OpenVZ). Their test setup focuses on evaluating performance of various VMs running HPC applications for file reads/writes, network utilization and other scientific benchmarks: symmetric multiprocessor performance and MPI scalability. Although the performance of the base system is not exactly matched by any of the VMs, the authors point out potential areas where the systems perform well and scope of improvement.
Guo et al. (Guo et al., 2014) describe a predictive cost model that optimizes cloud bursting to a remote cluster. Applications that run in one or more VMs in an enterprise’s private cloud can be moved to a public cloud when more resources are required. Their predictive model uses integer linear programming and heuristics to determine when and which VMs to move. A important consideration in their strategy is how to manage the movement of large VMs and other data across low bandwidth networks. The problem addressed in this paper is how to achieve reasonable time-to-solution for HPC batch jobs when the overflow cloud resources are colocated with the HPC system and data movement is not a primary consideration.
Multiple studies analyze the feasibility of running computationally intensive scientific applications on commodity cloud (He et al., 2010; Rehr et al., 2010), such as Amazon EC2 (Inc, 2006). However, not all public clouds are capable of running HPC applications due to network latency or lack of fast interconnection between the virtual machines. Even so, some of the previous experiments have obtained satisfactory results when HPC applications were run on Amazon EC2 (He et al., 2010). Unlike these studies, in this paper, we chose to run HPC applications by virtualizing the Stampede2 system (Stanzione et al., 2017) which has the proven capability to successfully run hundreds of computing intensive scientific applications.
This paper outlines a method for constructing a virtual HPC cluster in an elastic cloud environment that uses disparate hardware and system management techniques when compared with the original HPC system. The method addresses issues common to any HPC virtualization effort and explains the particular choices tailored for TACC’s Stampede2 system. Initial experiments comparing HPC and cloud time-to-solution for some common scientific applications provide proof-of-concept that cloud bursting HPC workloads can be practical and effective. Moreover, this work demonstrates of how a job scheduler or middleware such as Agave can seamlessly be used to tie HPC and cloud environments.
The rest of this paper is organized as follows: Section 2 provides implementation details for our setup, including the system configuration, virtualization techniques, and platform details. Section 3 describes the experimental setup and time-to-solution results for the HPC and cloud systems. In Section 4, we conclude this paper and discuss scope for future work.
To provide a virtual extension to existing HPC infrastructure, it is necessary to identify core software and hardware components that allow for a specific design level of interoperability. The tighter the integration, the more seamless the experience of the intended audience. This section introduces key components utilized in and tailored for both the Stampede2 HPC system and the Jetstream self-service cloud system. Additional components crafted specifically for this work that ultimately provide a consistent software experience across systems are also discussed.
2.1. Basic System Configuration
System management at large scale has two main thrusts: provisioning and change management. Together, these two applied concepts are responsible for maintaining the state of the cluster. Generally, the provisioning step selects resources from a pool of network servers to load the appropriate software (including operating system, device drivers, middleware, and basic applications), customize unique network information, and associate storage resources. Once a server or VM is provisioned, a change management application functions to provide incremental software updates beyond the initial state of the machine.
At its heart, Stampede2 uses Cobbler and LosF (McLay et al., 2011; DeHaan, 2018b; Schulz, 2017) for its provisioning and change management systems while Jetstream utilizes a combination of OpenStack and Ansible (Sefraoui et al., 2018; DeHaan, 2018a). The divergence in system administration approches between these systems rule out a more natural extension of simply adding compute resources to an existing sever pool.
At a hardware level, Stampede2 and Jetstream have little in common. For the purposes of this discussion, Stampede2 consists of three different node classes: master, compute, and login. In the compute class, there are 4200 single socket Intel Xeon Phi 7250 (KNL) nodes with 16GB of MCDRAM plus 96GB DDR4 RAM. In addition, there are 1736 2-socket Intel Xeon Platinum 8160 (SKX) nodes with 192GB of DDR4 RAM. These nodes are connected via a 100Gb/s Intel Omni-Path (OPA) network in a 7:5 oversubscribed fat tree topology. The front-facing login node class consists of 8 2-socket Intel Xeon Gold 6132 (SKX) nodes with 96GB DDR4 RAM each connected into the OPA network and serving the outside world via two network-bonded 10 gigabit Ethernet connections. The master node is a single 2-socket Intel Xeon CPU E5-2680 v4 (BDW) node with 96GB of DDR4 RAM also connected to the OPA network but isolated from the outside. Connecting all nodes, three main parallel Lustre file systems referred to as /home, /work, and /scratch can provide single-job aggregate performance write I/O of 300GB/s across 512 nodes with a total capacity of more than 30PB.
Jetstream’s hardware at TACC consists of 320 Intel Xeon E5-2680 v3 (HSW) compute nodes with 128GB DDR4 RAM that are connected at 10Gb/s to a 40Gb/s Ethernet backbone, with a shared external connection of 120Gb/s. Each node contains 2TB of local storage as well as access to a 960TB network attached global Ceph storage system. An additional factor unique to Jetstream is that multiple virtual machines may be colocated on each compute node necessitating the need for sharing of resources including the CPU, memory, and network interface card.
At a basic software level, both Jetstream and Stampede2 are capable of running the same Linux CentOS 7.4.1708 distribution. Stampede2 is heavily customized with over 250 staff-supported software packages in addition to several hundred more that are community-provided in the form of RPMs. These RPMs are installed via LosF in addition to the standard OS distribution RPMs provided via Cobbler. Jetstream VMs, on the other hand, start from community developed and maintained collections of RPMs that are tailored to specific functions that a user may need. Generally, the sets of RPMs tend to be close to the original OS distribution RPM set with additions made from online community repositories.
2.2. Virtualization Ingredients
For a seamless virtual extension, Jetstream needed to share or emulate a few common features of Stampede2. Some of these include the shell start-up environment, file systems, batch scheduler, identity management, and internode communication pathways. Depending on how similar these features are created on the cloud extension system dictate the difference between packaging up an autonomous unit of computing work to be run on a completely generic cloud to that same unit of work being run on a system that is practically indistinguishable from the original HPC system.
Identity management for both systems can be synchronized through the use of TACC’s LDAP directory service to ensure that users exist with the same name and group structures on both machines. This is facilitated through administrative queries to a set of servers that maintain up-to-date records site-wide.
The ability to share user and group information between systems allowed for a straightforward mechanism to present Stampede2 file systems in a logical and secure manner on Jetstream. For the purposes of this demonstration, the /home, /work, and /scratch Lustre file systems were re-exported via NFS as well as the locations of the majority of staff-supported software in the local /opt directory. Jetstream VMs were then free to mount these file systems upon start-up to provide a functional, if not as performant, approach to interacting natively with files and applications on Stampede2.
Files that customize and present the user with a unique shell environment upon login as well as other crucial system configuration files also need to be transferred to the cloud extension system. These files, which constitute the TACC user environment, have been honed over several generations of compute systems to provide a scalable and flexible platform to meet a wide range of computing needs. A minimum core of RPMs were identified and outfitted for use on a generic cloud platform and served out from a custom-built Yum repository. This configuration allows for any node or VM with a connection to the outside world to import this TACC repository and its associated signed GPG key and install these RPMs as part of a change management step.
The other key set of RPMs that were presented from the TACC repository included the basic components of the Slurm workload manager used on Stampede2 for batch scheduling of users’ compute jobs (Yoo et al., 2018). The three node classes discussed in Section 2.1 were configured via Ansible to support the Slurm controller host, Slurm worker hosts and the job submission host on the Jetstream master node VM, compute node VMs, and login node VM, respectively. The Jetstream Slurm controller was configured to tap into a common Slurm database housed within the Stampede2 system. This allowed for inquiries and submission requests to pass from one system to another without the need for any other intermediary service for communication.
Typically, a user’s job on an HPC system will take advantage of at least one of three parallel paradigms including multiprocessing, multithreading, or job packing to utilize as much of the available resources as possible. For workloads that take advantage of internode communication, usually via an MPI resource manager, information needs to be delivered to appropriately allocate tasks across a dynamic set of compute hosts. For an HPC system, this is typically handled internally and automatically via specialized logic in conjunction with information from the workload manager such that a user need not be concerned in setting parameters beyond the number of compute nodes and MPI processes to be used. Thanks to built-in features of the MPI libraries from Intel and MVAPICH, it is possible to bootstrap the Slurm workload manger for its information and launch mechanism such that a user may only need to issue an “mpirun” on either Stampede2 or Jetstream systems.
2.3. Virtual Cluster Creation
To begin, a persistent master node VM was created to serve as the orchestration point for the rest of the virtual cluster. Other basic configurations included setting up a virtual private network and adding a pair of shared SSH keys automatically upon creation of any new VMs. A non-production node on Stampede2 was designated for serving out the file systems via NFS while the site firewall was configured to allow traffic for this as well as for the Slurm controller to interact with the Stampede2 Slurm database.
Next, a dedicated and persistent login node VM was created as a front end for users to interact with the virtual cluster. With file systems mounted, RPMs installed, and the appropriate Slurm configuration in place, this VM would closely mimic the environment that users would experience on a Stampede2 login node.
From there, compute node VMs ranging from 1 virtual core to 44 could be instantiated via OpenStack commands issued from the master node. As compute nodes were instantiated, each IP was added to an Ansible host inventory to be brought up and ready for use. Updates were first applied, followed by the RPMs in the TACC repository. Next, the Stampede2 file systems were mounted, Slurm was installed and configured, and authorization via LDAP completed the process.
2.4. Job Submission
With compute nodes up and running, a user logged into any of the Stampede2 login nodes via SSH could request that a job be directed to either the Jetstream or Stampede2 systems with one additional Slurm command-line option. Similarly, a user logged into the Jetstream login node VM could request that a job either run locally or on Stampede2. A third option, Agave, which provides an API and web portal front end to researchers, was also configured for initial testing and future cloud expansion endeavours.
Concretely, Agave is a set of containerized servers that support a REST API (Dooley et al., 2012; Dooley, 2018). It supports computationally intensive research by running and monitoring scientific applications on behalf of a user while recording all inputs, outputs, environment settings, software versions, and hardware used by a job to support experimental traceability and reproducibility. The main components in Agave are shown in Table 1.
|Execution system||System used for computation where application binaries can be run|
|Storage system||Data repository that can be accessed through Agave for I/O|
|Application||Executable code invoked by Agave on a specific execution system|
|Job||Runtime instance of an application with parameters|
Existing systems can be defined in Agave as execution or storage systems. Credentials provided to Agave are used to transfer data and run commands on those systems, usually using a protocol like SSH. Web portals such as Designsafe (Rathje et al., 2017; Rathje et al., 2018) and CyVerse (Merchant et al., 2018) are built on top of Agave to provide researchers a customized graphical user interface to HPC systems.
|GROMACS||2016.4||Molecular dynamics simulation for biochemical molecules|
|NAMD||2.10||Molecular dynamics simulation of large biomolecular systems|
|OpenSeesSP||2.5.0||Earthquake simulation and modeling for structural and geotechnical systems|
|WRF||3.6.1||Mesoscale numerical weather prediction system|
|Avg. run time (H:MM:SS)||CPU||Nodes||Runs||Avg. run time (H:MM:SS)||CPU||Nodes||Runs|
|NAMD111Both NAMD and OpenSeesSP were launched directly with Slurm and through Agave’s job submission REST API with no difference in run times.||0:02:40||16||8||3||0:03:58||16||8||3|
|OpenSeesSP111Both NAMD and OpenSeesSP were launched directly with Slurm and through Agave’s job submission REST API with no difference in run times.||0:03:46||1||1||3||0:06:43||1||1||3|
Several applications were chosen to run on the Jetstream cloud extension system that are regularly executed on Stampede2 and may be flexibly scaled by the number of processors used in computation. Table 2 provides the applications chosen along with version information and a brief description of each. GROMACS (Abraham et al., 2015), NAMD (Phillips et al., 2005), OpenSeesSP (McKenna et al., 2018), and WRF (Skamarock et al., 2005) were chosen because input data and historical information were readily available. All applications were launched via the Slurm sbatch command on both Stampede2 and Jetstream. NAMD and OpenSeesSP were additionally launched with Agave.
To gauge the practicability of seamlessly offloading a typical HPC workload to a cloud system, the same application binaries were run on both Stampede2 and Jetstream. The applications themselves are built as multi-architecture binaries that allow for code branching depending on what level of vectorization instruction is supported on the underlying chip. AVX2 instructions serve as the baseline to support the Haswell architecture on Jetstream while the newer Skylake and KNL architectures of Stampede2 can take advantage of AVX-512 instructions.
The input cases used are described briefly below: For GROMACS, a pure water case was simulated for 1.536 million atoms in the isothermal isobaric (NpT) ensemble at 300K and 1atm with initial coordinates and system parameters coming from the application website for twenty thousand time steps. For NAMD, the APOA1 case is a 92 thousand atom Particle Mesh Ewald (PME) molecular dynamics simulation for three thousand time steps of Apolipoprotein A-1 which is the primary component of the high-density lipoprotein cholesterol molecule. For OpenSeesSP, the input case conducts transient load analysis of a two-dimensional structure over twenty thousand time steps. For WRF, the input case represents a fixed grid weather forecast at 12km resolution over the continental United States for a 24 hour period.
In this work, we demonstrate the feasibility of HPC workload offload to a cloud extension system. As such, the performance characteristics of interest focus on reasonable relative time to solution when comparing systems. Enough runs to see stable execution times for a handful of canonical input examples were performed on both systems lauching via Slurm and Agave. Results are displayed in Table 3. The Stampede2 runs were conducted on the Skylake architecture and generally outperform runs conducted on Jetstream’s Haswell architecture. Eight Jetstream compute node VMs were instantiated with two virtual cores and 4GB of memory each. Compared runs were conducted with the same number of MPI processes per node with no explicit affinity settings.
|Req. Time||Requested Node Count|
Both NAMD and OpenSeesSP were launched directly with Slurm and through Agave’s job submission REST API with no difference in run times. Ultimately, Agave launches a job through Slurm’s sbatch command after configuring a job’s inputs, outputs and other miscellaneous settings. The Agave REST interface hides the details of different HPC schedulers and makes those schedulers accessible from web applications.
In this paper, we’ve briefly described the necessary components needed to construct a Jetstream cloud extension that closely emulates a Stampede2 HPC execution environment. Because a portion of the Jetstream and Stampede2 systems are colocated, we were able to exercise administrative privilege to share file systems, network configurations, identity management, and other facility resources that would be more difficult or intractable when interacting in a more general cloud infrastructure. Both systems use the same job scheduler, shell environment, and applications stack that help to provide a consistent user experience.
With this setup, migrating applications from Stampede2 to Jetstream requires much less work for the user than migrating to an off-site, public cloud. Input, output, application, and library directories are already mounted and ready for use. This also means that VMs are comparatively light weight in their instantiation, minimizing initialization times. File system mounting and network locality further help to avoid potential expensive application and data transfer costs.
The Agave middleware layer provided a high-level interface from which users could launch applications on Stampede2 and on the Jetstream cloud with no additional timing overhead. When interacting with Agave’s API, the Jetstream cloud extension is simply another HPC system running Slurm; no additional customization was necessary. This submission technique is an effective mechanism for web portals such as science gateways to dynamically leverage available computing resources.
Four common HPC applications were chosen as part of an initial demonstration for the HPC cloud extension. Multi-architecture application binaries were built to take advantage of the heterogeneous instruction sets of the different systems. Ultimately, time-to-solution was the metric used to establish viability of this experiment. Application runs on the cloud system expectedly result in slower but still acceptable solution times due largely in part to underlying hardware differences. Thus, when HPC queue wait times are long, offloading work to the cloud can both decrease any backlog on the HPC system and can improve end user response time.
4.1. Future Work
The results reported in this paper establish the efficacy of on-premises virtualization of HPC systems. Yet, this drive toward automatic cloud bursting raises a number of interesting questions. For instance, more experimentation is needed to determine what policies should be in place to effectively ascertain which applications should even be considered for cloud bursting. Or another, how well would massively parallel computations involving hundreds or even thousands of processors run in this cloud environment? What about offloading jobs involving I/O heavy workloads? Moreover, is there a way to statically qualify or disqualify an application from being considered for cloud execution by inspecting its code, data, and dependencies?
One key factor in deciding whether to move an HPC job to the cloud is whether time-to-solution is improved. If the amount of time between job submission and the availability of the job’s results decreases, then, all things being equal, moving to the cloud is beneficial. The time a job waits in the queue can be a substantial percentage of the overall time-to-solution. It’s been reported that in one national laboratory, wait times may be up to four times longer than execution times (Bicer et al., 2011). Table 4 suggests the median wait times for the retired Stampede1 system tend to be significantly lower than this figure in most cases. While queue wait times follow a heavily skewed distribution towards lower values, there still remains the potential for interacting with the job scheduler and/or historical data to help determine when a job may have a significant wait ahead, providing the conditions in which a cloud burst might be beneficial.
Another consideration involves automation of the cloud bursting process. This initial implementation uses a common Slurm database and command-line flags to transfer jobs from the Stampede2 controller to the Jetstream controller. In the future, it is possible to enable Slurm’s federation process that will submit a job to all federated clusters simultaneously only to remove pending duplicates once one of the systems is able to schedule the job. Another possibility would be to construct a job submission filter either via Slurm or Agave to realize a more sophisticated predictive model. One example might be as described by Guo et al. (Guo et al., 2014) and would be able to dynamically route jobs to the cloud as HPC backlogs grow.
Future work will include dynamically scaling the number of compute node VMs available based on the HPC system’s congestion and the cloud system’s current idle resource availability. Finally, adaptations and policy decisions to integrate accounting mechanisms for the two distinct systems will need to be investigated.
We thank Harika Gurram for reconfiguring her production Agave definitions to allow our experiments to run.
- Abraham et al. (2015) M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess, and E. Lindahl. 2015. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1 (Sept. 2015), 19–25. https://doi.org/10.1016/j.softx.2015.06.001
- Bicer et al. (2011) Tekin Bicer, David Chiu, and Gagan Agrawal. 2011. A framework for data-intensive computing with cloud bursting. In Cluster computing (cluster), 2011 ieee international conference on. IEEE, Piscataway, New Jersey, US, 169–177.
- DeHaan (2018a) Michael DeHaan. 2018a. Ansible. (2018). Retrieved March 20, 2018 from https://www.ansible.com
- DeHaan (2018b) Michael DeHaan. 2018b. Cobbler. (2018). Retrieved March 20, 2018 from https://cobbler.github.io
- Dooley (2018) Rion Dooley. 2018. Agave Platform: SaaS platform for open source community. (2018). Retrieved March 20, 2018 from https://agaveapi.co
- Dooley et al. (2012) Rion Dooley, Matthew Vaughn, Dan Stanzione, Steve Terry, and Edwin Skidmore. 2012. Software-as-a-service: the iPlant foundation API. In 5th IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS). IEEE, Piscataway, New Jersey, US, 8.
- Guo et al. (2014) Tian Guo, Upendra Sharma, Prashant J. Shenoy, Timothy Wood, and Sambit Sahu. 2014. Cost-Aware Cloud Bursting for Enterprise Applications. ACM Trans. Internet Techn. 13, 3 (2014), 10:1–10:24. https://doi.org/10.1145/2602571
- He et al. (2010) Qiming He, Shujia Zhou, Ben Kobler, Dan Duffy, and Tom McGlynn. 2010. Case Study for Running HPC Applications in Public Clouds. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC ’10). ACM, New York, NY, USA, 395–401. https://doi.org/10.1145/1851476.1851535
- Huang et al. (2006) Wei Huang, Jiuxing Liu, Bulent Abali, and Dhabaleswar K. Panda. 2006. A Case for High Performance Computing with Virtual Machines. In Proceedings of the 20th Annual International Conference on Supercomputing (ICS ’06). ACM, New York, NY, USA, 125–134. https://doi.org/10.1145/1183401.1183421
- Inc (2006) Amazon Inc. 2006. Amazon Elastic Compute Cloud (Amazon EC2). (2006). Retrieved March 20, 2018 from http://aws.amazon.com/ec2
- McKenna et al. (2018) Frank McKenna, Michael H Scott, and Gregory L Fenves. 2018. Opensees Parallel. (2018). Retrieved March 20, 2018 from http://opensees.berkeley.edu/OpenSees/parallel/parallel.php
- McLay et al. (2011) Robert McLay, Karl W Schulz, William L Barth, and Tommy Minyard. 2011. Best practices for the deployment and management of production HPC clusters. In State of the Practice Reports. ACM, New York, NY, USA, 9.
- Merchant et al. (2018) Nirav Merchant, Eric Lyons, Stephen Goff, Matthew Vaughn, Doreen Ware, David Micklos, and Parke Antin. 2018. Cyverse:Transforming Science through Data-Driven Discovery. (2018). Retrieved March 20, 2018 from http://www.cyverse.org
- Phillips et al. (2005) James C Phillips, Rosemary Braun, Wei Wang, James Gumbart, Emad Tajkhorshid, Elizabeth Villa, Christophe Chipot, Robert D Skeel, Laxmikant Kale, and Klaus Schulten. 2005. Scalable molecular dynamics with NAMD. Journal of computational chemistry 26, 16 (2005), 1781–1802.
- Rathje et al. (2018) Ellen M Rathje, Clint Dawson, Jamie E Padgett, Jean-Paul Pinelli, Dan Stanzione, Ashley Adair, Pedro Arduino, Scott J Brandenberg, Tim Cockerill, Charlie Dey, et al. 2018. Designsafe-CI: A natural hazards engineering research infrastructure. (2018). Retrieved March 20, 2018 from https://www.designsafe-ci.org
- Rathje et al. (2017) Ellen M. Rathje, Clint Dawson, Jamie E. Padgett, Jean-Paul Pinelli, Dan Stanzione, Ashley Adair, Pedro Arduino, Scott J. Brandenberg, Tim Cockerill, Charlie Dey, Maria Esteva, Fred L. Haan, Matthew Hanlon, Ahsan Kareem, Laura Lowes, Stephen Mock, and Gilberto Mosqueda. 2017. DesignSafe: New Cyberinfrastructure for Natural Hazards Engineering. Natural Hazards Review 18, 3 (2017), 06017001. https://doi.org/10.1061/(ASCE)NH.1527-6996.0000246
- Rehr et al. (2010) John J Rehr, Fernando D Vila, Jeffrey P Gardner, Lucas Svec, and Micah Prange. 2010. Scientific computing in the cloud. Computing in science & Engineering 12, 3 (2010), 34–43.
- Reuther et al. (2012) Albert Reuther, Peter Michaleas, Andrew Prout, and Jeremy Kepner. 2012. HPC-VMs: Virtual machines in high performance computing systems. (2012). https://doi.org/10.1109/HPEC.2012.6408668
- Schulz (2017) Karl W. Schulz. 2017. LosF: A Linux operating system Framework for managing HPC clusters. (2017). Retrieved March 20, 2018 from https://github.com/hpcsi/losf
- Sefraoui et al. (2018) Omar Sefraoui, Mohammed Aissaoui, and Mohsine Eleuldj. 2018. Openstack. (2018). Retrieved March 20, 2018 from https://www.openstack.org
- Skamarock et al. (2005) William C Skamarock, Joseph B Klemp, Jimy Dudhia, David O Gill, Dale M Barker, Wei Wang, and Jordan G Powers. 2005. A description of the advanced research WRF version 2. Technical Report. National Center For Atmospheric Research Boulder Co Mesoscale and Microscale Meteorology Div.
- Stanzione et al. (2017) Dan Stanzione, Bill Barth, Niall Gaffney, Kelly Gaither, Chris Hempel, Tommy Minyard, S Mehringer, Eric Wernert, H Tufo, D Panda, et al. 2017. Stampede 2: The Evolution of an XSEDE Supercomputer. In Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability, Success and Impact. ACM, New Orleans, LA, USA, 15.
- Walters et al. (2008) John Paul Walters, Vipin Chaudhary, Minsuk Cha, Salvatore Guercio Jr., and Steve Gallo. 2008. A Comparison of Virtualization Technologies for HPC. In Proceedings of the 22Nd International Conference on Advanced Information Networking and Applications (AINA ’08). IEEE Computer Society, Washington, DC, USA, 861–868. https://doi.org/10.1109/AINA.2008.45
- Yoo et al. (2018) Andy B Yoo, Morris A Jette, and Mark Grondona. 2018. Slurm Workload Manager. (2018). Retrieved March 20, 2018 from https://slurm.schedmd.com