HPAT: High Performance Analytics with Scripting Ease-of-Use
Big data analytics requires high programmer productivity and high performance simultaneously on large-scale clusters. However, current big data analytics frameworks (e.g. Apache Spark) have prohibitive runtime overheads since they are library-based. We introduce a novel auto-parallelizing compiler approach that exploits the characteristics of the data analytics domain such as the map/reduce parallel pattern and is robust, unlike previous auto-parallelization methods. Using this approach, we build High Performance Analytics Toolkit (HPAT), which parallelizes high-level scripting (Julia) programs automatically, generates efficient MPI/C++ code, and provides resiliency. Furthermore, it provides automatic optimizations for scripting programs, such as fusion of array operations. Thus, HPAT is 369 to 2033 faster than Spark on the Cori supercomputer and 20 to 256 times on Amazon AWS.
Ehsan Totoni Intel Labs, USA email@example.com \authorinfoTodd A. Anderson Intel Labs, USA firstname.lastname@example.org \authorinfoTatiana Shpeisman Intel Labs, USA email@example.com
Big data analytics applies advanced analytics and machine learning techniques to gain new insights from large data sets, which are gathered from sources such as sensors, web, log files, and social media. Big data analytics allows users to extract knowledge from this data and make better and faster decisions. However, supporting fast decision making necessitates rapid development of the application by domain expert programmers (i.e., high productivity) and low execution time (i.e., high performance). High productivity in the data analytics domain requires scripting languages such as MATLAB, R, Python, and Julia since they express mathematical operations naturally and are the most productive languages in practice Prechelt ; Chaves et al. . High performance requires efficient execution on large-scale distributed-memory clusters due to extreme dataset sizes.
Currently, there is a significant productivity and performance gap in the big data analytics domain. A typical High Performance Computing (HPC) approach of writing low-level codes (e.g. MPI/C++) is beyond the expertise of most data scientists and is not practical in their interactive workflows. Existing big data analytics frameworks such as Apache Hadoop White  and Apache Spark Zaharia et al.  provide better productivity for big data analytics on clusters using the MapReduce programming paradigm Dean and Ghemawat . However, this productivity comes at the cost of performance as these frameworks are orders of magnitude slower than hand-written MPI/C++ programs Brown et al. ; Reyes-Ortiz et al. ; McSherry et al. . A fundamental issue is that these frameworks are library-based, requiring a runtime system to coordinate parallel execution across all the nodes. This leads to high runtime overheads - for example, the master node managing execution is typically a bottleneck.
We propose a novel auto-parallelizing compiler approach for this domain that does not suffer from the shortcomings of prior methods Wilson et al. ; Blume et al. ; Kennedy and Kremer ; Chatterjee et al. . These efforts failed in practice due to the complexity of the problem, especially for distributed-memory architectures Eigenmann et al. ; Waheed et al. ; Hiranandani et al. ; Kennedy et al. . The main challenge is that the compiler needs to know which data structures and loops of the program should be parallelized, and how they should be mapped across the machine. Previous algorithms for automatic parallelization typically start by analyzing array accesses of the program and finding all possible array and computation distributions for loop-nests. Then, distributions are selected based on heuristics and/or cost models. A fundamental problem is that the approximate cost models and heuristics cannot find the best combination of distributions from a potentially large search space reliably, leading to significantly lower performance compared to manual parallelization. In contrast, we develop a data flow algorithm that exploits domain knowledge as well as high-level semantics of mathematical operations to find the best distributions, but without using approximations such as cost models. Hence, our approach is robust and matches manual parallelization.
In this paper, we use this approach to build High Performance Analytics Toolkit (HPAT)
2 Logistic Regression Example
We introduce HPAT and compare it to the state-of-the-art using the logistic regression machine learning algorithm. Logistic regression uses the logistic sigmoid function to derive a model for data classification Bishop . This model is updated iteratively using the gradient descent optimization method. Figure 0(a) presents the HPAT code for logistic regression. The @acc hpat macro annotation and the DataSource construct are HPAT-specific, but the rest of the code is high-level sequential Julia code which uses matrix/vector operations (corresponding to “vectorized” mathematical notation). HPAT automatically parallelizes and optimizes this program and generates efficient MPI/C++. HPAT’s auto-parallelization algorithm uses domain knowledge to infer distributions for all variables and computations accurately; for example, vector w is replicated on all processors while columns of matrix points are divided in block manner, which is the same strategy for manual parallelization of this program. Previous auto-parallelization methods cannot achieve this since there is a large distribution space even for this simple program; for example, even computations on w are data-parallel and could be distributed without violation of dependencies, but parallel performance suffers. HPAT also fuses all computation steps in the iteration loop together to achieve best data locality.
The current state-of-the-art for big data analytics frameworks is the library approach, which can have significant overheads. We use Apache Spark to demonstrate this approach since it is a widely used big data analytics framework – other frameworks are similar in principle. Figure 0(b) presents the Spark version, which initializes a resilient distributed dataset (RDD) called points. An RDD is essentially a distributed data structure that is distributed in one dimension. In addition, the computation is written in terms of map and reduce operators. In each iteration, the task scheduler of Spark runtime library in master node divides map/reduce operators into tasks, and sends these tasks to the executor (slave) nodes. The operator closures which include the required data (w in this case) are also serialized and sent. On executors, each task is initialized and the RDD data is deserialized from the data cache into Python (Numpy) objects for execution. Finally, the result of the reduce operator is sent back to the master node. This sequence is repeated for every iteration since the library has to return the results of reductions to the user context and is not aware of the iteration loop in the Python code. In essence, each iteration is a separate job that is launched as a wave of small tasks; scheduling and serialization overheads incur repeatedly. This is a fundamental problem with this library design and cannot be mitigated easily – significant Spark development effort has not closed the performance gap with MPI/C++ codes (Spark has over 1000 contributors spa ).
Figure 2 demonstrates the performance of logistic regression algorithm in Spark, MPI/Python, HPAT, and MPI/C++. HPAT is 35 faster than Spark and is close to the handwritten MPI/C++ code. The gap between Spark and the MPI/Python code reveals the magnitude of overheads in Spark. The gap between MPI/Python code and MPI/C++ is mostly due to locality advantages of fused loops as opposed to executing a sequence of Numpy operations. Bridging the performance gap with MPI/C++ codes is crucial for big data analytics; for example, 91% of Spark users chose performance among the most important aspects for them in a Spark survey spa . Furthermore, distributed libraries are implemented in languages such as Java and Scala since reflection, serialization, and isolation features of a sandbox like Java Virtual Machine (JVM) are required, but it can have significant overheads Maas et al. ; Lion et al. ; Ousterhout et al. . In contrast, HPAT achieves scripting productivity and HPC performance simultaneously using an auto-parallelizing compiler approach.
3 HPAT Overview
In this section, we provide an overview of our target application domain and the HPAT compilation pipeline.
The goal of HPAT is to automatically parallelize common analytics tasks such as data-parallel queries and iterative machine learning algorithms. The main domain characteristic we exploit is that the map/reduce parallel pattern inherently underlies the target application domain. This characteristic is assumed in current frameworks like Spark as well. Hence, distribution of parallel vectors and matrices is mainly one-dimensional (1D) for analytics tasks (i.e. matrices are “tall-and-skinny”). Furthermore, the workload is uniform across the data set (which can be extended by providing load balancing). These assumptions do not necessarily apply to other domains. For example, multi-dimensional distribution is often best in many physical simulation applications. Note that while traditional HPC applications are sometimes developed over many years, analytics programs need to be developed in as little as a few minutes in some cases to support the interactive workflow of data scientists (which makes productivity of scripting languages critical). Hence, analytics programs are often significantly shorter and simpler than many HPC applications.
HPAT Coding Style:
HPAT supports the high-level scripting syntax of the Julia language, which is intuitive to domain experts. However, users need to follow certain coding style guidelines to make sure HPAT can analyze and parallelize their programs automatically:
The analytics task is in functions annotated with @acc hpat.
I/O (e.g. reading input samples) is performed through HPAT (using the DataSource and DataSink constructs).
The data-parallel computations are in the form of high-level matrix/vector computations or comprehensions since HPAT does not parallelize sequential loops.
Julia’s column-major order should be followed for multidimensional arrays since HPAT parallelizes across last dimensions. For example, this means that features of a sample are in a column of the samples matrix.
HPAT Compiler Pipeline:
HPAT uses the @acc macro provided by Julia’s CompilerTools package to configure HPAT’s compilation pipeline as shown in Figure 3. The solid boxes are HPAT components. The macro pass desugars HPAT extensions (such as DataSource) into function calls and type annotations to enable compilation by Julia. The Julia compiler then performs further desugaring and type inference on the function’s intermediate representation (IR). The Domain-Pass transforms HPAT extensions into a form more conducive to optimization by the subsequent Domain-IR and Parallel-IR passes. The Domain-IR and Parallel-IR passes are provided by Julia’s ParallelAccelerator package paG . The Domain-IR pass identifies operators (e.g. vector-vector element-wise addition) and other constructs in the IR whose semantics are parallelizable. The Parallel-IR pass takes the different kinds of parallel constructs found by Domain-IR and transforms those into a common representation called parfor (that represents a tightly nested for loop whose iterations can all be performed in parallel) and performs fusion and other optimizations on the IR. These two passes are described in more detail in Section 6. Given this parallel representation, the Distributed-Pass partitions the arrays and parfors for distributed-memory execution and generates communication calls. HPAT Code Generation takes advantage of several hooks provided by CGen (part of ParallelAccelerator) to generate C++ code.
4 Automatic Parallelization
Our novel auto-parallelization algorithm exploits domain knowledge in a data flow framework Nielson et al.  to find data and computation distributions. The goal of the algorithm is to find a consistent map/reduce view of the program that does not violate the high-level parallel semantics of any mathematical operation. Since the underlying parallel pattern is map/reduce, each array should be either distributed in 1D block fashion (1D_B) or replicated on all processors (REP). We also support a 2D block-cyclic distribution (2D_BC) for the less common square matrix computations, but this requires a user annotation. To give some intuition, 1D block distribution is typically applied to arrays such as datasets and map operations on those datasets, whereas a replicated distribution is used for an array containing a model and is often associated with a reduction. We also know that data remapping is not needed in this domain so array distributions are global. On the other hand, the distribution of each computation step can be determined based on its high-level semantics (e.g. reductions) and the distribution of its inputs/outputs.
We construct our data flow framework as follows. The meet-semilattice of distributions is defined as:
The properties to infer are defined as
where is the set of arrays in the program, is the set of parfors, specifies distributions for arrays, and specifies distributions for parfors. Other operations such as matrix multiply and library calls also have similar properties, which we omit here for brevity. Next, we define the set of transfer functions for each node type based on its parallel semantics, which are used to update the properties. In essence, this framework provides equation
which can be solved using a fixed-point iteration algorithm (repeatedly walks over the IR until quiescence). The initial distributions are assigned as for all arrays and parfors (the top element in the semi-lattice). We design the transfer functions to be monotone, which is consistent with the semantics of the operations in this domain since data remappings are not common. However, it is possible to remap data by inserting special remapping nodes that copy data to new arrays (left for future work). Monotonicity ensures that the fixed-point iteration algorithm converges to the least solution Nielson et al. , which means that higher values like are preserved as much as possible while satisfying all the equations. This ensures maximum parallelization for the program. Program control flow (e.g. branches) can be safely ignored since the distributions do not change across different paths in this domain.
Transfer Function: Array Assignment - The distribution of left-hand side and right-hand side of an assignment on arrays should have the same distribution, which is the meet () of their previously inferred distributions:
Transfer Function: Calls - When the distribution algorithm encounters a call site, the algorithm determines whether it knows the transfer function for the function being called. If so, the call site is considered to be a known call; otherwise, it is called an unknown call. For unknown calls, such as functions from external modules not compiled through HPAT, the distribution algorithm conservatively assumes the involved arrays need to be REP. If the function has parallel semantics for arrays, the user needs to provide the information. Conversely, distribution transfer functions are built into a HPAT knownCallProps table for many Julia and HPAT operations. Common examples include Julia’s array operations (e.g. reshape, array set, array get, array length), and HPAT’s data storage functions.
Transfer Function: Returned Array - For return statements of the top-level function, the set of arrays being returned are each flagged as REP since returned arrays need to fit on a single node and this output typically represents a summarization of a much larger data set. If larger output is required, the program should write the output to storage. This is a useful domain-specific heuristic that facilitates compiler analysis.
Transfer Function: Matrix Operations - Matrix/matrix and matrix/vector multiply operations (GEMM/GEMV call nodes) have more complex transfer functions.
Figure 1 illustrates the portion of the GEMM transfer function that is exercised during auto-parallelization of the logistic regression example of Figure 0(a), as well as a 2D_BC case. These formulas are derived based on our matrix distribution and layout assumptions, and semantics of GEMM operations. Since samples are laid out in columns and we do not split sample features across processors (1D_B), any vector or row of a matrix computed using reduction across samples is inferred as REP. For example, the result of the inner formula in Logistic Regression’s kernel is multiplied by the transpose of sample points and used to update the parameters (). Using this analysis, the algorithm infers that the output of the operation is REP if both inputs are 1D_B and the second one is transposed. This also means that the inputs can stay 1D_B. In this case, a reduction is also inferred for the node (which eventually turns into MPI_Allreduce in the backend). Furthermore, since the vector w is used in a dot product with matrix columns in , it should be REP and the matrix stays as 1D_B.
Transfer Function: Parfor - As described in Section 3, parfor nodes represent data-parallel computation and require special handling during distribution analysis.
Figure 2 illustrates the transfer function for parfors. For clarity, this figure is simplified and only shows how the common case of 1D_B arrays are handled. As with array distribution, we start by assuming that the parfor is 1D_B until proven otherwise. First, we analyze the body of the parfor and extract all the array access/indexing operations. We then iterate across all the array accesses. Since HPAT parallelizes 1D_B arrays and parfors across the last dimension, the index variable of the last loop of the parfor is used for testing. First, we check if the index expression for the last dimension of the array access (i.e., index_exprN) is identical to the parfor’s index variable allocated to the last dimension of the loop nest. If so and the array being accessed is REP then the parfor itself becomes REP (the meet of two distributions is chosen). Second, we check whether the last dimension’s parfor loop index variable is used directly or indirectly (e.g. temp = parfor.LoopsNests[end].index_var; index_expr1 = temp + 1) in any of the array access index expressions for the first N-1 dimensions. If so, then the parfor must be REP. These tests are conservative but do not typically prevent parallelization of common analytics codes. For accesses to 2D_BC arrays, the above algorithm has a straight-forward extension that considers not one but the last two parfor index variables and array index expressions.
4.1 Effectiveness on Data Analytics Programs
The main reason HPAT’s heuristics are effective is that data analytics programs typically produce a summary of large data sets. In the case of machine learning algorithms, this summary is the weights of the trained model. More specifically, many large-scale machine learning algorithms optimization methods such as stochastic gradient descent (SGD) Bottou . Hence, their structure can be represented as in Figure 3. Parameter set w is updated iteratively using gradient updates in order to minimize a form of cost function on samples. HPAT’s analysis can infer that samples is distributed since it will be accessed in a data parallel manner. It will also infer that w is replicated since it is updated using a form of reduction. For example, variable w in Figure 0(a) is updated by a matrix/vector multiplication that implies a reduction.
4.2 Domain-specific Optimizations
Before distributed code generation, HPAT applies two novel domain-specific optimization heuristics. HPAT performs these since typical compiler cost models cannot reliably conclude that the complex transformations described below are beneficial; for example, we found ICC and GCC unable to perform these optimizations on any of our benchmarks. Hence, HPAT’s use of domain knowledge is critical. For the first heuristic, we observe that matrix multiplications (GEMM calls) with at least one “tall-and-skinny” input are common in data analytics algorithms (e.g. in Figure 1). Ideally, these GEMM calls should be replaced with equivalent loop-nests and fused with other operations to achieve a single pass through data points and intermediate results:
GEMM operations with one or more 1D_B inputs are replaced with loop-nests and fusion is called on the basic block.
To enable fusion, HPAT arranges the loop-nests so that the long dimension is outer-most. This optimization causes all mathematical operations of the logistic regression algorithm in Figure 1 to be fused. Hence, each data point is loaded just once, which improves performance by maximizing locality. Furthermore, memory consumption is improved since saving intermediate data into memory is avoided.
The second heuristic is based on the observation that loops over data sets and their intermediate results can occur inside other loops. For example, the centroids calculation in Figure 4 is written using nested comprehensions that include passes over points and labels inside. This prevents maximum loop fusion and causes multiple passes over those data sets. Furthermore, extra communication is then generated for each iteration of the outer loop-nest instead of a singe communication call for all data exchanges. Thus, we rearrange loop-nests to avoid this problem:
Parfors with REP distribution that have 1D_B parfors inside are interchanged and fusion is called on the basic block.
HPAT performs loop fission on the REP parfor before interchange since the body may have more statements. This optimization maximizes fusion in the kmeans example of Figure 4 and ensures a single pass over the data set.
4.3 Automatic Parallel I/O
Figure 5 demonstrates how HPAT’s compilation pipeline translates DataSource syntax to parallel I/O code (DataSink is similar). Macro-Pass desugars the syntax into a HPAT placeholder special function call (e.g. HPAT_h5_read) and type-annotates the output array so that Julia’s type inference can work. Domain-Pass generates function calls to get the size of the array and allocations so that Domain-IR and Parallel-IR can optimize the program effectively. Allocations and size variables are critical information that enable fusion and elimination of intermediate arrays.
Distributed-Pass enables parallel I/O by distributing the input array among nodes and adding the start and end indices for each dimension. HPAT Code Generation’s function call replacement mechanism generates the backend HDF5 code (currently MPI/C++) for placeholders such as HPAT_h5_read. HPAT also supports text files using MPI I/O because many big data files are stored as text.
4.4 Distributed-Memory Translation
Distributed-Pass transforms the function for distributed-memory execution and generates communication calls after auto-parallelization provides distributions. The pass also inserts calls into the IR to query the number of processors and to get the node number. Allocations of distributed arrays are divided among nodes by inserting code to calculate size based on the number of processors (e.g. ). Distributed (1D_B) parfors are translated by dividing the iterations among processors using node number and number of processors and updating array indices accordingly. For example, is replaced with where contains the starting index of the parfor on the current processor. Furthermore, for parfors with reductions, communication calls for distributed-memory reductions are generated.
Furthermore, matrix/vector and matrix/matrix multiplication (GEMM calls) need special handling. For instance, in Figure 0(a) does not require communication since replication of w makes it data-parallel, while requires an allreduce operation since both arguments are distributed. The Distributed-Pass makes this distinction by using the parallelization information provided by previous analyses. In the first case, the first input is replicated while the second input and the output are distributed. In the second case, both inputs of GEMM are distributed but the output is replicated.
4.5 Backend Code Generation
Our approach facilitates using various backends, since Distributed-Pass returns a high-level parallel IR that enables flexible code generation. Our current prototype supports MPI/C++ and MPI/OpenMP/C++, which we found to be suitable for many analytics workloads. However, some cases might require using a backend with an adaptive runtime system, due to scheduling and load balancing requirements. Hence, one might use Charm++ Acun et al.  or HPX Kaiser et al.  as the backend for HPAT. Evaluation of these backends for HPAT is left for future work.
On the other hand, Spark uses a runtime system with DAG scheduling, which is required for implementation of complex operations (such as shuffling) on top of Spark’s map/reduce core. However, these operations do not necessarily need runtime scheduling. In general, Spark surveys show that only a small fraction of users run irregular workloads such as graph workloads on Spark spa [2015, 2016]. Nevertheless, our approach enables the use of low-overhead HPC runtime systems, avoiding Spark’s overheads.
4.6 Automatic Utilization of Distributed-Memory Libraries
Libraries are essential components of productive analytics platforms. For example, Spark provides MLlib mll , which is implemented using its RDD data format. Since HPAT is compiler based, it can take advantage of distributed-memory libraries without requiring changes to their data structures and interfaces. On the other hand, only libraries that implement interoperation with Spark’s RDDs can be used with Spark; this is a fundamental limitation for library-based frameworks.
Figure 6 shows example code that calls a library. This function call goes through the compilation pipeline as a special known function call, i.e., it has entries in HPAT analysis tables. Most importantly, the HPAT parallelization algorithm knows that the input arrays can be 1D_B, while the output array is REP. If parallelization is successful, the Distributed-Pass adds two new arguments to the function call; the first index and the last index of the input array on each node. HPAT’s CGen extension uses a MPI/C++ code routines for code generation of the call. Currently, HPAT supports ScaLAPACK and Intel® Data Analytics Acceleration Library (Intel® DAAL) daa  as an alternative to MLlib for machine learning algorithms.
Performance evaluation of analytics programs that use libraries (e.g. MLlib or DAAL) is beyond the scope of this paper. Comparing libraries is non-trivial in general; data analytics functions can be implemented using multiple algorithms with different computational complexities. Our objective is to generate efficient code for scripting codes which are dominantly used in this area.
4.7 2D Parallelization
HPAT requires a simple annotation to assist the automatic parallelization algorithm for the less common cases that require 2D parallelization. Consider the example program in Figure 7 (a real-world case requested by a user). The program reads two (multi-terabyte) matrices, multiplies them, adds some random value to the result, and writes it back to storage. The user specifies that matrix M requires 2D partitioning. HPAT infers that x and y matrices also need 2D partitioning as well. Furthermore, the related intermediate variables in the AST (such as the random matrix created) and the parfors are also 2D partitioned. In the backend, HPAT generates MPI/C++ code which calls parallel HDF5 for I/O and ScaLAPACK (PBLAS component) for matrix multiplication.
Manually developing efficient parallel code for this program is challenging. ScaLAPACK requires block-cyclic partitioning of input and output data but HDF5 provides a block-based read/write interface for parallel I/O. Hence, the generated code has to sets 96 indices, since there are three I/O operations; each operation requires four hyperslab selections including corner cases, and each hyperslab selection requires start, stride, count and block size indices for two dimensions. In addition, ScaLAPACK’s legacy interface is Fortran-based and has its own intricacies. As a result, the generated MPI/C++ code is 525 lines. Therefore, manually developing this code is highly error-prone for domain experts and HPAT’s automation is significant.
This use case demonstrates a fundamental advantage of our compiler approach. Library-based frameworks are based on fixed distributed data structures with specific partitioning and layout formats. For example, Spark’s RDDs are 1D partitioned and the whole framework (e.g block manager and scheduler) is based on this 1D partitioning. Supporting different partitionings and layouts is easy for a compiler, but is difficult for a library.
Resiliency is essential for long-running iterative machine learning algorithms. The challenge is to provide low-overhead fault tolerance support without significant programmer involvement. Previous work on automatic checkpointing cannot achieve minimal checkpoints since, for example, those systems do not have the high-level knowledge that models are replicated and one copy is enough Schulz et al. ; Plank et al. ; Wang et al. . Spark’s approach is to use lineage to restart shortest possible tasks but this is made possible only by a system design with high overheads. HPAT provides automated application-level checkpointing (ALC) based on domain characteristics.
For iterative machine learning applications, only the learning parameters and the loop index need to be checkpointed since the data points are read-only. Moreover, these learning parameters are replicated on all processors so only one copy needs to be checkpointed. We use these domain characteristic in designing HPAT’s checkpointing capability. Hence, HPAT checkpointing assumes a typical analytics function containing a single outer-loop and having the form: initialization, iteration loop, results. In the initialization phase, input is loaded and variables initialized, which establish the invariants for entry into the loop. The body of the outer-loop can be arbitrarily complex including containing nested loops. In the results phase, the outputs of the loop are used to compute the final result of the function. This approach is readily extensible to support multiple outer-loops as well.
If enabled, HPAT adds a checkpointing pass to the compilation pipeline after Domain-Pass. The checkpointing pass first locates the outer-loop and analyzes it to determine which variables are live at entry to the loop (including the loop index) and are written in the loop. This set of iteration-dependent variables are saved as part of a checkpoint. The pass creates a new checkpoint function tailored to this set of variables and inserts a call to that function as the first statement of the loop. In this function, MPI rank zero compares the time since the last checkpoint was taken with the next checkpoint time as calculated using Young’s formula Young . If it is time to take a checkpoint, rank zero calls the HPAT runtime to start a checkpointing session, write each variable to the checkpoint, and then end the session. The HPAT runtime records the time to take the checkpoint and uses this information to improve the estimated checkpoint time that is input to Young’s formula. At the exit of the outer-loop, the pass also inserts a call to the HPAT runtime to indicate that the checkpointed region has completed and that any saved checkpoints can be deleted.
To restart a computation, the programmer calls the HPAT restart routine and passes the function to be restarted and the original arguments. The HPAT compiler creates a restart version of the function that is identical to the original but with the addition of checkpoint restore code before the entry to the loop. This checkpoint restore code finds the saved checkpoint file and loads the iteration-dependent variables from the checkpoint. In this way, the initialization code is performed again (restoring read-only variables), and the loop fast-forwards to the last successfully checkpointed iteration. An important consideration is that the iterations should be deterministic since some might be re-executed during restart.
Consider the logistic regression example in Figure 0(a); we store only the loop index i and w in the checkpoint whereas the full set of live data would include points and labels and would result in checkpoints orders of magnitude larger. As far as we are aware, this checkpointing approach that exploits domain knowledge by for example re-executing the initialization phase is novel. A key insight is that HPAT can perform the analysis for this minimal checkpointing, while a library approach like Spark is unable to do so.
6 ParallelAccelerator Infrastructure
HPAT relies on the ParallelAccelerator package paG  for discovering potential parallelism in dense array operations of Julia. ParallelAccelerator consists of three main compiler passes, Domain-IR, Parallel-IR, and CGen. Domain-IR looks for operations and other constructs in Julia’s IR that have different kinds of parallel semantics and then replaces those operations with equivalent Domain-IR nodes that encode those semantics. Some of the most common parallelism patterns in Domain-IR are map, reduce, Cartesian map, and stencil. For example, Domain-IR would identify unary vector operations (such as -, !, log, exp, sin, and cos) and binary, element-wise vector-vector or vector-scalar operations (such as +, -, *, /, ==, !=, <, and >) as having map semantics. Likewise, Julia’s sum() and prod() functions would be identified by Domain-IR as having reduce semantics. Domain-IR identifies comprehensions within the Julia code as having Cartesian map semantics.
The Parallel-IR pass lowers Domain-IR nodes to a common representation called parfor. Once in this representation, Parallel-IR performs parfor fusion between parfors coming from potentially dissimilar Domain-IR nodes. This fusion process reduces loop overhead and eliminates many intermediate arrays, helping the program to have better locality. There are three main components of the parfor representation: loop nests, reductions, and body. Every parfor has a loop nest that represents a set of tightly nested for loops. It is typical for the number of such loops to match the number of dimensions of the array on which the parfor operates. The parfor reductions component is only present when the parfor involves a reduction and encodes the variable holding the reduction value along with its initial value and the function used to combine reduction elements. Finally, the body of the parfor contains code to compute the result of the parfor for a single point in the iteration space. After the Parallel-IR pass is finished, CGen converts the IR to OpenMP-annotated C code in the backend.
We compare the performance of Spark, HPAT and handwritten MPI/C++ programs on the Cori supercomputer at NERSC cor  and Amazon AWS cloud. Cori is a Cray XC40 supercomputer that includes advanced features for data-intensive applications such as large memory capacity and high I/O bandwidth. Each node (Phase I) has two sockets, each of which is a 16-core Intel®Xeon®E5-2698 v3 2.3GHz (Haswell architecture). The memory capacity of each node is 128GB. On AWS, we use the compute-optimized C4 instances (c4.8xlarge with 36 vCPUs), which feature Intel®Xeon®E5-2666 v3 (Haswell) processors. Each instance has 60GB of memory. We use Placement Groups which provide low latency, full bisection, 10Gbps bandwidth between instances. We use Intel®C++ Compiler (ICC) 17.0 for backend C++ compilation. The default Spark 2.0.0 installation on Cori is used which is tuned and supported.
We use the benchmarks listed in Table 1 for evaluation. We believe they are representative of many workloads in data analytics; related studies typically use the same or similar benchmarks Brown et al. ; Islam et al. ; Ousterhout et al. ; Reyes-Ortiz et al. . Benchmark sizes are chosen so that they fit in the memory, even with excessive memory usage of Spark, to make sure Spark’s performance is not degraded by accessing disks repeatedly. HPAT is capable of generating MPI/OpenMP but currently, it turns OpenMP off and uses MPI-only configuration since OpenMP code generation is not tuned yet. We use one MPI rank per core (without hyperthreading). The Spark versions of the benchmarks are based on the available examples in Spark’s open-source distribution. We developed the MPI/C++ programs by simply dividing the problem domain across ranks equally and ensuring maximum locality by fusing the loops manually. Parallel I/O times are excluded from all results. MPI/C++ codes are about 6 longer in lines of code compared to HPAT codes.
Figure 8 compares the performance of Spark, manual MPI/C++, and HPAT on 64 nodes (2048 cores) of Cori and four nodes of Amazon AWS. HPAT is 369-2033 faster than Spark for the benchmarks on Cori and is only 2-4 slower than manual MPI/C++. In addition, HPAT is 20-256 faster than Spark on Amazon AWS. The lower performance of Spark on Cori is expected since the master node is a bottleneck and prevents scaling. Kernel Density demonstrates the largest performance gap among the benchmarks; HPAT is 2033 faster than Spark on Cori and 256 on AWS. The reason is that computation per element is small and the Spark overheads such as serialization/deserialization and master-executor coordination are amplified.
Automatic parallelization by HPAT matches the manual parallelization for all of the benchmarks perfectly, which demonstrates the robustness of our auto-parallelization algorithm. Furthermore, loop structures are identical to the manual versions which demonstrates the success of HPAT’s fusion transformations. The performance gap between HPAT and MPI/C++ codes can be attributed to the the generated code being longer and containing extra intermediate variables; this makes code generation phase of the backend C++ compiler more challenging. The optimization reports of ICC support this hypothesis. We hope to match the performance of manual codes in future work.
We use the Python interface of Spark in this paper as baseline since scripting languages are preferred by domain experts. One may argue that using Scala, which is statically compiled, might be significantly faster in Spark. We tested this hypothesis and found that Scala is moderately faster for some benchmarks (e.g. K-Means) but HPAT is still several times faster than Spark. For other benchmarks (e.g. Logistic Regression) Python is actually faster since the computation is inside Numpy operations, which have optimized native backends (Anaconda distribution). Furthermore, the flexibility of Numpy allows batched processing while Scala’s breeze linear algebra library does not have this capability.
We use an ADMM LASSO solver Wahlberg et al.  to evaluate the effectiveness of our auto-parallelization method on a complex algorithm. The code is originally written in Python and parallelized using mpi4py by a domain expert. However, the manual parallelization method sacrifices accuracy for easier parallelization. On the other hand, HPAT is able to parallelize the code effectively and accurately. Figure 9 demonstrates the scaling on up to 64 nodes. The slowdown of the MPI/Python code running in parallel is partially due to accuracy loss which forces the algorithm to run up to specified maximum number of iterations. The success of HPAT’s auto-parallelization on this algorithm provides confidence about the effectiveness of our approach, since we do not expect analytics algorithms to be significantly more complex than this case in practice.
Compiler feedback and control:
Previous compiler approaches are hard to understand and control by users since they use complex heuristics and cost models. HPAT’s automatic parallelization algorithm inherently facilitates user control since it is deterministic and can be controlled easily. For example, HPAT provides the operations that caused each REP inference. The user is then able to change parallelization behavior by explicitly annotating parallelization for arrays or providing more information for operations (e.g. parallelization for library calls not previously known to HPAT).
8 Related Work
Previous studies demonstrate that current big data analytics frameworks are orders of magnitude slower than hand-tuned MPI implementations Brown et al. ; Reyes-Ortiz et al. ; McSherry et al. . To improve the performance of data analytics frameworks, some previous studies have focused on more efficient inter-node communication Islam et al. ; Zhang et al.  but they do not address the fundamental performance bottlenecks resulting from the library approach. HPAT solves this problem using a novel automatic parallelization approach.
Automatic parallelization is studied extensively in HPC, but it is known that auto-parallelizing compilers cannot match the performance of hand-written parallel programs for many applications Eigenmann et al. ; Wilson et al. ; Blume et al. ; Waheed et al. ; Hiranandani et al. . For example, previous studies have tried to automate data alignment (TEMPLATE and ALIGN directives) and distribution (DISTRIBUTE directive) steps in High Performance Fortran (HPF) with limited success Kennedy et al. ; Chatterjee et al. ; Kennedy and Kremer . More specifically, our distribution analysis algorithm can be compared with the framework by Kennedy and Kremer Kennedy and Kremer . This framework performs a search of possible alignment and distribution combinations for loop-nests; a performance model helps predict the combination with the best performance. However, this framework cannot find the best combination reliably due to the inaccuracies of the performance model. Conversely, the HPAT algorithm leverages domain knowledge and assigns distributions to arrays and computations based on semantics of different high-level operations. To the best of our knowledge, this is the only algorithm to use a data flow formulation and is therefore novel. In general, previous work targeting scientific computing applications relies on complex analysis, performance models and approximations that are known to have challenges in a practical setting Kennedy et al. ; Waheed et al. . In contrast, our algorithm does not rely on models and approximations, and is therefore accurate (matches manual parallelization).
Distributed Multiloop Language (DMLL) provides compiler transformations on map-reduce programs on distributed heterogeneous architectures Brown et al. . However, HPAT starts from higher level scripting code, which might have operations like matrix-multiply that do not have a simple map-reduce translation. More importantly, DMLL relies on user annotations and a simple partitioning analysis pass for data partitioning but HPAT is fully automatic. We believe that our iterative partitioning analysis algorithm produces more efficient code since it does not parallelize all potentially parallel operations like DMLL’s approach. This can affect communication cost on distributed-memory operations significantly since DMLL broadcasts local data structures. This could be the reason for DMLL’s much smaller speedups over Spark on CPU clusters.
Distributed Halide is a domain-specific compiler for image processing that translates high-level stencil pipelines into parallel code for distributed-memory architectures Denniston et al. . Unlike HPAT, 2D partitioning and near neighbor communication are the norm and not 1D partitioning and reductions. Moreover, Halide requires user “schedule” for optimization and distributed execution while HPAT is fully automatic.
Several previous efforts such as Delite Sujeeth et al. , Copperhead Catanzaro et al. , and Intel®Array Building Blocks Newburn et al.  provide embedded domain-specific languages (DSLs) that are compiled for parallel hardware. HPAT’s design is similar in many aspects, but HPAT targets data analytics for distributed-memory architectures (with more accurate auto-parallelization). Furthermore, HPAT uses the abstractions of the host language and avoids introducing new abstractions to the programmer as much as possible.
Systems that automate application-level checkpointing to some degree have been proposed before Schulz et al. ; Plank et al. ; Wang et al. . For example, in the method by Plank et al. Plank et al. , the programmer adds checkpoint locations and the system excludes dead variables and read-only data from checkpoints. In contrast, HPAT’s scheme is domain-specific and, for example, uses the knowledge that the learning parameters in HPAT are replicated; checkpointing them does not require an MPI consistency protocol. HPAT also completely automates the checkpointing process whereas other systems require programmer effort to some degree.
9 Conclusion and Future Work
Library-based big data analytics frameworks such as Spark provide programmer productivity but they are much slower than hand-tuned MPI/C++ codes due to immense runtime overheads. We introduced an alternative approach based on a novel auto-parallelization algorithm, which is implemented in High Performance Analytics Toolkit (HPAT). HPAT provides the best of both worlds: productivity of scripting abstractions and performance of efficient MPI/C++ codes. Our evaluation demonstrated that HPAT is 369-2033 faster than Spark. We plan to expand HPAT to provide more data analytics features and use cases. For example, providing support for sparse computations, data frames (heterogeneous tables), out-of-core execution is under investigation. Most of these features need research on multiple layers; from scripting abstractions to compilation techniques and code generation. We are also building a prototype HPAT system for Python.
- HPAT is available online as open-source at https://github.com/IntelLabs/HPAT.jl.
- 2015. Apache Spark Survey 2015 Report. http://go.databricks.com/2015-spark-survey/. (2015).
- 2016. Apache Spark Survey 2016 Report. http://go.databricks.com/2016-spark-survey/. (2016).
- 2016. Cori Supercomputer at NERSC. http://www.nersc.gov/users/computational-systems/cori/. (2016).
- 2016. Intel Data Analytics Acceleration Library. https://software.intel.com/en-us/daal. (2016).
- 2016. ParallelAccelerator Julia package. https://github.com/IntelLabs/ParallelAccelerator.jl. (2016).
- 2016. Spark Machine Learning Library (MLlib) Guide. http://spark.apache.org/docs/latest/mllib-guide.html. (2016).
- Bilge Acun, Abhishek Gupta, Nikhil Jain, Akhil Langer, Harshitha Menon, Eric Mikida, Xiang Ni, Michael Robson, Yanhua Sun, Ehsan Totoni, Lukasz Wesolowski, and Laxmikant Kale. 2014. Parallel Programming with Migratable Objects: Charm++ in Practice (SC’14).
- Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc.
- William Blume, Ramon Doallo, Rudolf Eigenmann, John Grout, Jay Hoeflinger, Thomas Lawrence, Jaejin Lee, David Padua, Yunheung Paek, Bill Pottenger, Lawrence Rauchwerger, and Peng Tu. 1996. Parallel Programming with Polaris. Computer 29, 12 (1996).
- Léon Bottou. 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010.
- Kevin J. Brown, HyoukJoong Lee, Tiark Rompf, Arvind K. Sujeeth, Christopher De Sa, Christopher Aberger, and Kunle Olukotun. 2016. Have Abstraction and Eat Performance, Too: Optimized Heterogeneous Computing with Parallel Patterns. In CGO.
- Bryan Catanzaro, Michael Garland, and Kurt Keutzer. 2011. Copperhead: Compiling an Embedded Data Parallel Language (PPoPP ’11).
- Siddhartha Chatterjee, John R. Gilbert, Robert Schreiber, and Shang-Hua Teng. 1995. Optimal Evaluation of Array Expressions on Massively Parallel Machines. ACM Trans. Program. Lang. Syst. 17, 1 (1995).
- J. C. Chaves, J. Nehrbass, B. Guilfoos, J. Gardiner, S. Ahalt, A. Krishnamurthy, J. Unpingco, A. Chalker, A. Warnock, and S. Samsi. 2006. Octave and Python: High-Level Scripting Languages Productivity and Performance Evaluation. In HPCMP Users Group Conference, 2006.
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM (2008).
- Tyler Denniston, Shoaib Kamil, and Saman Amarasinghe. 2016. Distributed Halide (PPoPP ’16).
- R. Eigenmann, J. Hoeflinger, and D. Padua. 1998. On the automatic parallelization of the Perfect Benchmarks(R). IEEE Transactions on Parallel and Distributed Systems 9, 1 (1998).
- Seema Hiranandani, Ken Kennedy, Chau Wen Tseng, and Scott Warren. 1994. The D Editor: A New Interactive Parallel Programming Tool (Supercomputing ’94).
- N. Islam, W. Rahman, X. Lu, D. Shankar, and D. Panda. 2015. Performance Characterization and Acceleration of In-Memory File Systems for Hadoop and Spark Applications on HPC Clusters. In 2015 IEEE International Conference on Big Data.
- Hartmut Kaiser, Thomas Heller, Bryce Adelstein-Lelbach, Adrian Serio, and Dietmar Fey. 2014. HPX: A Task Based Programming Model in a Global Address Space (PGAS ’14).
- Ken Kennedy, Charles Koelbel, and Hans Zima. 2007. The Rise and Fall of High Performance Fortran: An Historical Object Lesson (HOPL III).
- Ken Kennedy and Ulrich Kremer. 1998. Automatic Data Layout for Distributed-memory Machines. ACM Trans. Program. Lang. Syst. 20, 4 (1998).
- David Lion, Adrian Chiu, Hailong Sun, Xin Zhuang, Nikola Grcevski, and Ding Yuan. 2016. Don’t Get Caught in the Cold, Warm-up Your JVM: Understand and Eliminate JVM Warm-up Overhead in Data-Parallel Systems. In OSDI. USENIX Association.
- Martin Maas, Tim Harris, Krste Asanović, and John Kubiatowicz. 2015. Trash Day: Coordinating Garbage Collection in Distributed Systems. In HotOS.
- Frank McSherry, Michael Isard, and Derek G Murray. 2015. Scalability! but at what COST? (HotOS).
- Chris J. Newburn, Byoungro So, Zhenying Liu, Michael McCool, Anwar Ghuloum, Stefanus Du Toit, Zhi Gang Wang, Zhao Hui Du, Yongjian Chen, Gansha Wu, Peng Guo, Zhanglin Liu, and Dan Zhang. 2011. Intel’s Array Building Blocks: A Retargetable, Dynamic Compiler and Embedded Language (CGO ’11).
- Flemming Nielson, Hanne R Nielson, and Chris Hankin. 2015. Principles of program analysis. Springer.
- Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, and Byung-Gon Chun. 2015. Making Sense of Performance in Data Analytics Frameworks (NSDI’15).
- J. S. Plank, M. Beck, and G. Kingsley. 1995. Compiler-Assisted Memory Exclusion for Fast Checkpointing. IEEE Technical Committee on Operating Systems and Application Environments 7, 4 (1995).
- Lutz Prechelt. 2000. An Empirical Comparison of Seven Programming Languages. IEEE Computer 33, 10 (Oct. 2000), 23–29.
- Jorge Luis Reyes-Ortiz, Luca Oneto, and Davide Anguita. 2015. Big Data Analytics in the Cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf. In INNS Conference on Big Data 2015.
- Martin Schulz, Greg Bronevetsky, Rohit Fernandes, Daniel Marques, Keshav Pingali, and Paul Stodghill. 2004. Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs (SC ’04).
- Arvind K. Sujeeth, Kevin J. Brown, Hyoukjoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. 2014. Delite: A Compiler Architecture for Performance-Oriented Embedded Domain-Specific Languages. ACM Trans. Embed. Comput. Syst. 13, 4s (2014).
- Abdul Waheed, Michael Frumkin, Jerry Yan, Haoqiang Jin, and Michelle Hribar. 1998. A Comparison of Automatic Parallelization Tools/Compilers on the SGI Origin 2000. SC’98 (1998).
- Bo Wahlberg, Stephen Boyd, Mariette Annergren, and Yang Wang. 2012. An ADMM algorithm for a class of total variation regularized estimation problems. IFAC Proceedings Volumes 45, 16 (2012), 83–88.
- Panfeng Wang, Xuejun Yang, Hongyi Fu, Yunfei Du, Zhiyun Wang, and Jia Jia. 2008. Automated Application-level Checkpointing Based on Live-variable Analysis in MPI Programs (PPoPP ’08). 273–274.
- Tom White. 2012. Hadoop: The definitive guide. O’Reilly Media, Inc.
- Robert P. Wilson, Robert S. French, Christopher S. Wilson, Saman P. Amarasinghe, Jennifer M. Anderson, Steve W. K. Tjiang, Shih-Wei Liao, Chau-Wen Tseng, Mary W. Hall, Monica S. Lam, and John L. Hennessy. 1994. SUIF: An Infrastructure for Research on Parallelizing and Optimizing Compilers. SIGPLAN Not. 29, 12 (1994).
- John W. Young. 1974. A First Order Approximation to the Optimum Checkpoint Interval. Commun. ACM 17, 9 (Sept. 1974), 530–531.
- Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. (2010).
- Bingjing Zhang, Yang Ruan, and Judy Qiu. 2015. Harp: Collective communication on hadoop (IC2E).