A Microbenchmark Characterization of the Emu Chick
The Emu Chick is a prototype system designed around the concept of migratory memory-side processing. Rather than transferring large amounts of data across power-hungry, high-latency interconnects, the Emu Chick moves lightweight thread contexts to near-memory cores before the beginning of each memory read. The current prototype hardware uses FPGAs to implement cache-less “Gossamer” cores for doing computational work and a stationary core to run basic operating system functions and migrate threads between nodes. In this multi-node characterization of the Emu Chick, we extend an earlier single-node investigation Hein et al. (2018) of the the memory bandwidth characteristics of the system through benchmarks like STREAM, pointer chasing, and sparse matrix-vector multiplication. We compare the Emu Chick hardware to architectural simulation and an Intel Xeon-based platform. Our results demonstrate that for many basic operations the Emu Chick can use available memory bandwidth more efficiently than a more traditional, cache-based architecture although bandwidth usage suffers for computationally intensive workloads like SpMV. Moreover, the Emu Chick provides stable, predictable performance with up to 65% of the peak bandwidth utilization on a random-access pointer chasing benchmark with weak locality.
Analysis of data represented as graphs, sparse tensors, and other non-regular structures poses many challenges for traditional computer architectures because the data locality of these applications typically occurs in small bursts. While individual data elements may have multiple associated attributes nearby (e.g. neighbors, weights, semantic attributes, and timestamps for streaming graph edges), analysis algorithms tend to access these small chunks in a more random fashion. Limited spatial locality in traditional analysis kernels leads to underutilizing cache lines, confounding prefetch engines, and thus reducing overall effective memory bandwidth. Furthermore, common analysis kernels may exhibit dynamic parallelism and create many data-dependent memory references. These references can stall architectures that cannot maintain enough contexts and requests in flight. Consequently, today’s “big data” platforms frequently are outperformed by a single thread accessing a large SSD McSherry et al. (2015).
This state of affairs motivates novel architectures like the Emu migratory thread system Dysart et al. (2016), the subject of this study. The Emu is a cache-less system built around “nodelets” that each execute lightweight threads. These threads migrate to data rather than moving data through a traditional cache hierarchy.
This paper expands on the first independent characterization of the Emu Chick prototypeHein et al. (2018) by exploring multiple distributed nodes that consist of those nodelets (see Section 2). Our study uses microbenchmarks and small kernels—namely, STREAM, pointer chasing, and sparse matrix-vector multiplication (SpMV)—as proxies that reflect some of the key characteristics of our motivating computations, which come from sparse and irregular applications Ediger et al. (2012); Li et al. (2016). Indeed, one larger goal of our work beyond this paper is to develop a performance-portable, Emu-compatible API for Georgia Tech’s STINGER open-source streaming graph framework Ediger et al. (2012) and ParTI ParTI (2018) tensor decomposition algorithms (e.g. CP and Tucker decomposition). Mapping such applications to the Emu architecture is difficult because the thread migration makes programming around data’s location critical to reducing migrations.
This study’s specific demonstrations include
a detailed characterization of the Emu Chick hardware using custom Cilk kernels derived from optimized OpenMP kernels;
an analysis of memory bandwidth on the Chick system and comparison to a more traditional cache-based architecture with Emu results tested on 64 nodelets across eight nodes
a discussion of memory allocation, data layout, and “smart” thread migration on the Emu architecture with respect to SpMV kernels;
an investigation and validation of the Emu architectural simulator for projecting larger configurations’ performance.
The main high-level finding is that an Emu-style architecture can more efficiently utilize available memory bandwidth while reducing the variability of that bandwidth to the memory access pattern. However, achieving such results still requires careful consideration of the interplay between data layout and its affect on thread migration. Additionally, our current Chick prototype is still compute-bound for some algorithms like SpMV which hurts its usage of available memory bandwidth when compared to a traditional CPU-based system.
2 The Emu Architecture
The Emu architecture focuses on improved random-access bandwidth scalability by migrating lightweight, Gossamer threads or “threadlets” to data and emphasizing fine-grained memory access. A general Emu system consists of the following processing elements, as illustrated in Figure 1:
A common stationary processor runs the operating system (e.g. Linux) and manages storage and network devices.
Nodelets combine narrowly banked memory with several highly multi-threaded, cache-less Gossamer cores to provide a memory-centric environment for migrating threads.
These elements are combined into nodes that are connected by a RapidIO fabric. The current generation of Emu systems include one stationary processor for each of the eight nodelets contained within a node. System-level storage is provided by SSDs. We talk more specifically about some of the prototype limitations of our Emu Chick in Section 3. A more detailed description of the Emu architecture is available elsewhere Dysart et al. (2016).
For programmers, the Gossamer cores are transparent accelerators. The compiler infrastructure compiles the parallelized code for the Gossamer ISA, and the runtime infrastructure launches threads on the nodelets. Currently, one programs the Emu platform using Cilk Leiserson (1997), providing a path to running on the Emu for simple OpenMP programs whose translations to Cilk are straightforward. The current compiler supports the expression of task or fork-join parallelism through Cilk’s cilk_spawn and cilk_sync constructs, with a future Cilk Plus software release in progress that would include cilk_for (the nearly direct analogue of OpenMP’s parallel for). Many existing C and C++ OpenMP codes can translate almost directly to Cilk Plus.
A launched Gossamer thread only performs local reads. Any remote read triggers a migration, which will transfer the context of the reading thread to a processor local to the memory channel containing the data. Experience on high-latency thread migration systems like Charm++ identifies migration overhead as a critical factor even in highly regular scientific codes Acun et al. (2014). The Emu system keeps thread migration overhead to a minimum by limiting the size of a thread context, implementing the transfer efficiently in hardware, and integrating migration throughout the architecture. In particular, a Gossamer thread consists of 16 general-purpose registers, a program counter, a stack counter, and status information, for a total size of less than 200 bytes. The compiled executable is replicated across the cores to ensure that instruction access always is local. Limiting thread context size also reduces the cost of spawning new threads for dynamic data analysis workloads. Any operating system requests are forwarded to the stationary control processors through the service queue.
The highly multi-threaded Gossamer cores, which are reading only local memory, do not need caches nor, therefore, cache coherency traffic. Additionally, “memory-side processors” provide atomic read or write operations that can be used to access small amounts of data without triggering unnecessary thread migrations. A node’s memory size is relatively large (64 GiB) but with multiple, narrow memory channels (8 channels with 8 bit interfaces), in order to extract weak spatial locality from data analysis kernels while maintaining low-latency read and write operations. The high degree of multi-threading also helps to cover the migration latency of the many threadlets. The Emu architecture is designed from the ground up to support high bandwidth utilization and efficiency for demanding data analysis workloads.
3 Experimental Setup
3.1 Emu Chick Prototype
The Emu Chick prototype is still in active development. The current hardware iteration uses an Arria 10 FPGA on each node card to implement the Gossamer cores, the migration engine, and the stationary cores. Several aspects of the system are scaled down in the current prototype with respect to the next-generation system which will use larger and faster FPGAs to implement computation and thread migration. The current Emu Chick prototype has the following features and limitations:
Our system has one Gossamer Core (GC) per nodelet with a concurrent max of 64 threadlets. The next-generation system will have four GC’s per nodelet, supporting 256 threadlets per nodelet.
A full Chick system has 64 nodelets across eight nodes, implementing a distributed Partitioned Global Address System (PGAS) architecture that is connected by the RapidIO network.
Our GC’s are clocked at 175MHz rather than the planned 300MHz for later-generation Emu development systems.
The Emu’s DDR4 DRAM modules are clocked at 1600MHz rather than the full 2133MHz.
The current Emu software version provides support for C++ but does not yet include functionality to translate Cilk Plus features like cilk_for or Cilk reducers Frigo et al. (2009). All benchmarks currently use cilk_spawn directly, which also allows more control over spawning strategies.
All experiments are run using Emu’s 18.08.1 compiler and simulator toolchain, and the Emu Chick system is running NCDIMM firmware version 2.3.0, system controller software version 3.1.0, and each stationary core is running the 2.2.3 version of software.
3.2 Emu Simulator
Emu provides a timing simulator implemented using SystemC. Along with the compiler toolchain, the simulator aids in testing and evaluating software before running on the hardware. The simulator counts key performance events such as the number of thread spawns, migrations, and memory operations per nodelet. This work employs a configuration of the simulator that matches the characteristics of a single node (8 nodelets) of the current hardware for validation. Multi-node simulations are also possible, but require much more simulation time. We compare the simulation results with the actual hardware in Section 4.4.
3.3 Common CPU-based Comparison Platform
In order to make an initial comparison of the Emu’s memory bandwidth characteristics with commodity hardware, each benchmark is also run on an four-socket Intel Xeon E7-4850 v3 (Haswell) machine with 2 TiB of DDR4 (referred to as Haswell Xeon in associated results). The CPUs on the Haswell server are each clocked at 2.20GHz and each have a 35 MiB L3 cache, while the memory is clocked at 1333 MHz (although it is rated for 2133 MHz). Each socket has a peak theoretical bandwidth of 85 GB/s.
For each benchmark, Emu-specific intrinsics (e.g. localized mallocs) are swapped out for their x86 equivalents, and the benchmarks are compiled with GCC 5.5.0. The Cilk keywords are left unchanged, allowing GCC’s Cilk runtime to implement the parallel functionality. Intel’s MKL library (version 20180001) is used for some of the SpMV comparisons made in Section 4.3. STREAM is run with default OpenMP settings including OMP_PROC_BIND=false and OMP_SCHEDULE=static.
3.4 Metrics for Comparing the Emu Prototype with Cache-Based Hardware
The architectural design choices that enable the Emu computational model (migrate threads instead of data, narrow memory channels, limited thread context) and the base platforms for the prototype (FPGAs with lower clock frequencies) make it difficult to accurately compare the Emu and CPU- or GPU-based systems in terms of execution or runtime.
Additionally, the Emu platform uses Narrow-Channel DRAM (NCDRAM) which reduces the width of the DRAM bus to 8 bits. Otherwise, the memory uses standard DDR4 chips. An 8-byte word can be transferred in a single burst. The smaller bus means that each channel of NCDRAM has only 2GB/s of bandwidth, but the system makes up for this by having many more independent channels. Because of this, it can sustain more simultaneous fine-grained accesses than a traditional system with fewer channels and the same peak memory bandwidth specification.
Due to difficulties in comparing differently clocked architectures with different memory controller configurations, we focus our initial characterization not on runtime but on memory bandwidth (MB/s) and effective memory bandwidth utilization (% of measured peak memory bandwidth). In a CPU-based system, this might be analogous to effective cache line utilization while in the Emu it correlates more closely to how much bandwidth can be achieved with respective to other system overheads, such as thread migration and queuing delays.
As discussed in Section 3.1, the Emu Chick toolchain currently lacks support for cilk_for and Cilk reducers. However, we present several benchmarks that use Cilk semantics to characterize the performance of the system, specifically focusing on kernels that expose the memory bandwidth characteristics of the system and test important kernels like SpMV that are key for applications like sparse tensor decomposition. For each benchmark result, we present the average memory bandwidth (usually expressed as megabytes per second) over ten trials.
The STREAM benchmark McCalpin (1995) has been ported and tuned for the Emu hardware to measure raw memory bandwidth. The ADD kernel computes the vector sum of two large arrays of 8-byte integers, storing the result in a third array. On the Emu Chick these arrays are striped across all the nodelets in the system.
This benchmark demonstrates that thread spawning is important for performance. Different thread spawning mechanisms achieve different memory bandwidths. Our tests control thread spawning and do not rely on cilk_for. The spawning methods follow trees trees that are briefly described as follows:
serial_spawn: threads spawn locally on a single nodelet using a for loop,
recursive_spawn: threads are spawned locally using recursive calls,
serial_remote_spawn: threads are spawned on each nodelet, which in turn uses a for loop to spawn threads locally, and
recursive_remote_spawn: threads are spawned recursively across all nodelets, and then each nodelet recursively spawns new threads locally.
In this benchmark, each thread adds up all the elements in a linked list. Each element consists of an 8-byte payload and an 8-byte pointer to the next element. After the elements of this linked list are grouped into blocks, their ordering is randomized. This permutation may be applied to the ordering of the elements within each block (intra_block_shuffle), or the ordering of the blocks themselves (block_shuffle), or both (full_block_shuffle). The block size is also varied to emulate different levels of spatial locality that may arise in a workload. Figure 2 explains the list initialization further.
The pointer chasing benchmark has three key properties by design.
Data-dependent loads: Memory-level parallelism is severely limited since each thread must wait for one pointer dereference to complete before accessing the next pointer
Fine-grained accesses: Spatial locality is restricted since all accesses are at a 16B granularity. This is smaller than a 64B cache line on x86 platforms, and much smaller than a typical DRAM page size.
Random access pattern: Since each block of memory is read exactly once in random order, caching and prefetching are mostly ineffective.
The pointer chasing benchmark simulates a worst-case memory fragmentation scenario that can arise in memory intensive workloads such as streaming graph analytics. When small list elements are dynamically allocated and deallocated from a shared memory pool, the resulting data structure will exhibit all of these characteristics when it is traversed. The pointer chase benchmark otherwise is quite similar to the GUPS/RandomAccess benchmarkLuszczek et al. (2006), however GUPS lacks data-dependent loads, and pointer chase does not modify the list.
Sparse Matrix - Vector Multiplication (SpMV)
In addition to being a fundamental kernel for graph analytics and sparse tensor decomposition applications, SpMV provides an opportunity to investigate data layout strategies on the Emu’s global physical address space. Emu provides a “local” malloc similar to a traditional contiguous malloc (mw_localmalloc) as well as a “striped” malloc (mw_malloc1dlong) that places data in a round-robin fashion across nodelets and a 2D malloc (mw_malloc2d), that stripes entire data structures across nodelets.
Figure 3 demonstrates the three layouts that are tested with inputs in Compressed Sparse Row (CSR) format. In the local case, contiguous mallocs are used to place the output matrix, Y, the input CSR matrix, V, the row pointer and column index arrays and the vector, X, all on a single node. For the 1D layout, mw_malloc1dlong is used to stripe the input matrix and row and column arrays across the nodelets (and across nodes in the multi-node case) while the output matrix is on nodelet 0 and X is replicated across all nodelets. For the 2D allocation, we use a two-stage allocation rather than Emu’s 2D malloc to partition V across multiple nodelets. First, the lengths of each row that is assigned to a nodelet are computed and then data for V and the column index array is allocated on each nodelet using mw_malloc1dlong. X is replicated across each nodelet and the output is placed on nodelet 0 in both the 1D and 2D cases.
This benchmark uses these different layout strategies to test performance for placing all data within a nodelet and striping it in a 1D and 2D fashion across multiple nodelets. In the 2D case no thread migrations occur when accessing elements in the same row as opposed to a migration for every element within a row in the 1D layout. Synthetic Laplacian matrix inputs are created corresponding to a -dimensional -point stencil on a grid of length in each dimension. For the tested synthetic matrices, and , meaning that a Laplacian size of specifies a sparse matrix corresponding to a 5-point, 2-D, stencil, which is a matrix of with 5 diagonals. CPU tests are run on the Haswell Xeon-based system mentioned in Section 3.3, using SpMV from Intel’s Math Kernel Library (MKL) with MKL_MAX_THREADS set at 56 (the number of physical cores in the system as opposed to total threads). We include two Cilk SpMV kernels for comparison, labeled cilk_for and cilk_spawn, which are written with the respective Cilk primitives, compiled using GCC 5.5.0, and run with CILK_NWORKERS set to 56. Data is distributed across NUMA regions using numactl --interleave=0-3.
Simulation validation results (Section 4.4) demonstrate a need for a more fine-grained microbenchmark to illustrate potential differences between hardware and simulated hardware Emu platforms. To explore the cause of this discrepancy, we present another small benchmark called ping pong migration. This micro-benchmark measures the bandwidth of thread migrations on the Emu Chick. In each trial, threads simply migrate back and forth between two nodelets several thousand times.
The updated characterization of the Emu Chick repeats the STREAM, pointer chasing, and SpMV experiments that were initially investigated in Hein et al. (2018), but these experiments focus on further characterizing the entire Chick “multi-node” system that uses all 64 nodelets (across 8 nodes) within the Chick. Single node results are presented for specific experiments including SpMV layout and simulation, primarily due to limitations in the current firmware (i.e., certain configurations encounter hardware faults) and slowness of the Emu architectural simulator.
Figure 4 shows the results from running the STREAM benchmark on a single Emu nodelet. Performance scales up with thread count through 32 threads and then plateaus. Two methods of thread creation are tested here. In the serial_spawn strategy, a single thread uses a for loop to create each worker thread, while recursive_spawn uses a recursive spawn tree. There is not much difference between the two approaches, indicating that thread creation is not terribly expensive on the Emu platform.
In Figure 4(a), we extend the STREAM benchmark to run on eight nodelets (one node card) of the Emu Chick. Two new thread creation strategies are introduced here, serial_remote_spawn and recursive_remote_spawn. A remote spawn on Emu means that the thread is created on a remote nodelet, rather than being created locally and allowed to migrate to the remote data. The “remote” thread creation strategies first create a thread on each nodelet (either one at a time or with a recursive spawn tree), and then perform a second level of spawning on the local nodelet, as in the single nodelet case. Figure 4(b) extends the analysis to 64 nodelets and up to 4096 threads showing that recursive remote spawn continues to scale for large numbers of threads up to 12 GB/s across 8 nodes. Both sets of results show that remote spawns are essential to achieving maximum bandwidth on Emu.
In comparison to the Emu, our reference Xeon system (Haswell) achieves 100 GB/s on the STREAM benchmark (with an interleaved NUMA layout across four sockets) while the Emu Chick has a maximum STREAM bandwidth of 12 GB/s. The Emu bandwidth is currently limited by CPU speed and thread count rather than DDR bus transfer rates. However even with this prototype system we can observe improvements in other benchmarks where the memory access pattern is not as linear and predictable as it is with STREAM.
4.2 Pointer Chasing
Figures 6 and 7 compare the performance of the Emu Chick against our Haswell Xeon server system for the pointer chasing benchmark. These results reveal important characteristics of both systems and highlight the unique advantages of the Emu Chick.
Pointer chasing on the Xeon architecture performs poorly for several reasons. For small block sizes, the memory system bandwidth is used inefficiently. An entire 64-byte cache line must be transferred from memory, but only 16 bytes will be used. The best performance is achieved with a block size between 256 and 4096 elements. This corresponds to a memory chunk of about 8KiB, the size of one DRAM page. Regardless of the size of the access, an entire DRAM row must be activated for each element traversed. Adding more threads at this point increases the number of simultaneous row activations. As the block size grows beyond the size of a DRAM page, performance declines again.
Performance on Emu remains mostly flat regardless of block size. Emu’s memory access granularity is 8 bytes, so it never transfers unused data in this benchmark. As long as a block fits within a single nodelet’s local memory channel, there is no penalty for random access within the block. However, block size of 1 provides an interesting case; here Emu threads are likely to migrate on every access, and so performance is greatly reduced. But performance recovers when even as few as four elements are accessed between each migration.
Figure 8 shows the normalized bandwidth usage (i.e., effective bandwidth usage) for the Haswell and Emu systems. The performance of each system has been normalized to the peak measured bandwidth of the system (i.e., the best result on the STREAM benchmark). In the pointer chasing benchmark, the Emu system is much better at using the available system bandwidth, using 65% of available system bandwidth in most cases and 25% in the worst case. The Haswell Xeon uses less than 50% of peak bandwidth in most cases and less than 10% in the worst case, relying on multi-kilobyte levels of locality to efficiently transfer the data. These results bode well both for the targeted streaming graph and tensor decomposition applications which have pointer chasing behavior and rely on random accesses to compute SpMV and SpMM (sparse matrix-matrix product) operations, respectively.
4.3 Sparse Matrix-Vector Multiplication
Figure 9 shows the memory bandwidth achieved by SpMV on a single node (8 nodelets) of the Chick using each of the three data layout strategies. The local layout on the Emu suffers from a limited amount of thread parallelism while the 1D layout suffers from a large number of thread migrations, resulting in max bandwidths of close to 96.13 MB/s and 148.66 MB/s, respectively. Overall, the 2D memory layout provides the only scalable solution for SpMV, scaling up to 561.48 MB/s for n=300. Notably, the local and 1D layouts do not scale to multi-node tests due to large numbers of thread migrations and a confirmed hardware bug related to handling large numbers of thread migrations between nodes.
Figure 9(a) shows results for experiments run on the Haswell Xeon machine with four different code configurations and two NUMA layouts, the default ”current” policy (place all data in current node) and ”interleaved” which interleaves data across four sockets at a page granularity. 1) MKL refers to an implementation of SpMV using Intel’s MKL library, 2) cilk_for is a native Cilk implementation using cilk_for, 3) cilk_spawn is a native Cilk implementation using cilk_spawn, and 4) emu is an implementation of SpMV that is backported from the Emu-optimized version of the code. This last version of the code includes the 2D layout optimization described in Section 3.5 and evaluated in Figure 9.
The Haswell Xeon results in Figure 9(a) show that the MKL implementation of SpMV with the ”current” NUMA layout achieves the highest bandwidth, getting close to 175 GB/s across four nodes. Meanwhile, cilk_for and cilk_spawn show similar scaling up to n=200 with the interleaved versions of both implementations closely mirroring each other’s performance from n=500 to n=8000. Finally, the Emu ”backported” code only shows scalable performance with the NUMA interleaved setting, peaking in performance at n=8000 and around 50 GB/s. While it is unclear at the moment why the MKL version scales so well with a non-interleaved data layout (we suspect it may have to do with first-touch layouts being amenable to relevant MKL data structures and computation), the performance of the Emu backported code seems to mesh well with our understanding of the Emu system as a distributed PGAS machine. That is, the optimal 2D layout of SpMV for the Emu keeps data accesses “local” on the Emu, but a normal x86 NUMA allocator does not stripe data like Emu’s allocator does, meaning that most data accesses on the x86 emu “current” setup are remote NUMA accesses. Furthermore, x86 NUMA interleaved layouts have much larger granularity for striping (pages versus elements in an array on the Emu), which likely also penalizes the ”emu (interleaved)” implementation.
Following the Haswell Xeon results, we compare the total percentage of STREAM bandwidth that is achieved for SpMV on the Emu versus on the Haswell machine in Figure 9(b). The Emu results use the peak multi-node STREAM bandwidth of 12 GB/s and are compared to the Haswell STREAM peak without NUMA interleaving, which peaks at 175 GB/s. In the case of the Haswell results, the best case SpMV (MKL non-interleaved from Figure9(a)) is used as the comparison point. As opposed to the pointer chasing results in Figure 8, we see that both systems scale in terms of bandwidth utilization as the amount of synthetic data is increased with the Emu peaking at about 50% of peak STREAM bandwidth versus the Haswell system’s 80% of peak STREAM bandwidth. Because it requires roughly twice as many computations per element compared to STREAM and pointer chasing, we speculate that SpMV is actually compute bound in this case. Between the address calculations and the multiply-and-accumulate, the 175MHz Gossamer Cores cannot generate loads quickly enough to saturate the available memory bandwidth.
4.4 Emu Simulation Validation and Prediction
We wish to predict the performance of an Emu Chick system operating at full speed as well as larger configurations by using the provided Emu simulator. First we validate the simulated measurements by configuring it to match the specifications of our current hardware system. The results of this evaluation are displayed in Figure 11. While the STREAM benchmark results match well for both single nodelet and multi-nodelet operation, the pointer chase benchmark results do not. Despite the error in magnitude, the shape of the results matches well.
To help explain this difference, Figure 10(d) shows results from the hardware and simulated ping pong benchmark. While the simulator can perform 16 million migrations per second, the hardware is currently limited to only 9 million migrations per second. Since pointer chasing is a migration-heavy benchmark, the performance of the thread migration engine affects its performance to a much greater degree than STREAM. Our experiments indicate that the latency for a single thread migration on the current system is approximately 1 s. The peak migration bandwidth is 11 million migrations per second.
Figure 12 further shows the results for SpMV with a 2D layout on 1 node (8 nodelets). We again see that the simulator is slightly more optimistic in terms of predicted performance especially at higher values of n.
This characterization raises important topics for programming memory-centric architectures like the Emu Chick and also for building realistic comparisons between prototype novel architectures and existing architectures.
5.1 Impacts of Data Location and Thread Migration
While the results from SpMV demonstrate that data layout can have an impact on performance on the Emu, application performance also depends on where threads are spawned and how many migrations occur between nodes and nodelets. In the initial development of our benchmarks, we debated explicitly minimizing thread movement and keeping computation local to a specific node. However, this strategy both goes against the “lightweight, migrating threadlets” model of computation with the Emu, and it is hard to implement in practice.
For this reason, we have settled on a strategy of “smart thread migration” for future benchmarking and application development with the Emu system. In short, this means 1) using “smart” thread spawn techniques like the two-level recursive remote spawn as in Section 4.1, 2) using replicated allocations for commonly used inputs like the vector X in the SpMV benchmark, and 3) picking the appropriate layout strategy for the application. In this last case, it is likely that good application performance will be most easily achieved through proper data layouts like with CSR SpMV’s striped allocation across nodelets and per-nodelet secondary allocation for different-length rows. In this sense, we have created our own custom 2D allocator for SpMV, but we expect that higher-level memory allocation constructs will eventually be supported to help use the Emu’s novel global address space layout.
5.2 Performance Models and Comparisons to Existing Architectures
One of the challenges in evaluating a drastically different architecture like the Emu is performing a realistic comparison between a prototype architecture and existing platforms using CPUs or other mainstream accelerators. Many aspects of the prototype Emu Chick present challenges. The Chick is a cacheless architecture and uses thread migration and atomic operations to avoid buffering large chunks of data. Even when compared with accelerators like GPUs, the low-latency access of the Chick, different memory clock speeds and data widths, and the lack of shared memory or caches provide a challenge for modeling how much more “efficient” the Chick is in terms of memory bandwidth. As shown in Section 4, different STREAM numbers for x86 systems based on NUMA interleaving settings also complicates this comparison. Additionally, the Chick is a full-scale prototype built using FPGA devices, which are useful for their flexibility and customization capabilities but naturally are slower than a traditional, hardwired ASIC. Firmware upgrades to the Chick prototype can also affect application performance dramatically by changing the gossamer cores’ maximum frequency and by adding new functionality.
These comparison challenges are common not only to the Emu Chick but also to other new, experimental hardware like neuromorphic and quantum computing platforms. We may need to define additional metrics to supplement traditional characterization metrics like performance (FLOPS), memory bandwidth balance (FLOPS/B), and power efficiency (FLOPS/W). While we do not yet have enough application experience with the Emu Chick to fully define new metrics, we propose that there may be promise in focusing on comparison metrics that highlight the differences listed above. For example, a cache-less system like the Emu Chick may not actually move data physically across the system, but a comparable metric to a traditional CPU-based system might be some combination of network traffic (i.e., threads migrated measured using context size and time, or B/s) and cache misses avoided (B/s). We plan to investigate how to better model and define these types of differences in future work to effectively quantify not just the high-level application benefits of novel architectures like the Chick but also the fundamental qualities that help define which applications are the best fit for these new architectures.
6 Related Work
Advances in memory and integration technologies provide opportunities for profitably moving computation closer to dataSiegl et al. (2016). Some proposed architectures return to the older processor-in-memory (PIM) and “intelligent RAM”Patterson et al. (1997) ideas. Simulations of architectures focusing on near-data processingGao et al. (2015) including in-memoryFinkbeiner et al. (2017) and near-memoryFarmahini-Farahani et al. (2015) show great promise for increasing performance while also drastically reducing energy usage. Other than our previous studyHein et al. (2018), and related work on characterizing the Emu by other research groupsBelviranli et al. (2018); Minutoli et al. (2015) few of these architectures have been implemented in hardware, even FPGAs, limiting the data scales on which applications can be evaluated.
Other hardware architectures have tackled massive-scale data analysis to differing degrees of success. The Cray XMTMizell and Maschhoff (2009) could provide high bandwidth utilization by tolerating long memory latencies in applications that could produce enough threads. Another approach is to push memory-centric aspects to an accelerator like Sparc M7’s data analytics acceleratorAingaran et al. (2015) for database operations or GraphicionadoHam et al. (2016) for graph analysis.
Moving computation to data via software has had a successful history in supercomputers and clusters via Charm++Acun et al. (2014), which manages dynamic load balancing on distributed memory systems by migrating the computational objects. Previously data analysis systems like Hadoop had moved computation to data when the network was a data bottleneck, but that no longer appears to be usefulAnanthanarayanan et al. (2011).
Finally, algorithms research related to SpMV could prove beneficial to future implementations for Emu-like architectures. New state-of-the-art SpMV formats and algorithms such as SparseX, which uses the Compressed Sparse eXtended (CSX) format for storing matricesElafrou et al. (2018) provide an alternative data structure and data layout that can be used to improve the performance of SpMV-based operations on the Emu.
Our microbenchmark evaluation of the Emu Chick demonstrates some of the limitations of the existing prototype system as well as some potential benefits for massive data analytics applications like streaming graph analytics and sparse tensor decomposition. We demonstrate multi-nodelet (64 nodelets across 8 nodes) performance for a variety of benchmarks including STREAM, pointer chasing, and SpMV. Initial results demonstrate relatively low overall bandwidth for the Emu system with a peak of 12 GB/s STREAM bandwidth for the current Chick prototype (compared to 80+ GB/s on a Haswell CPU server socket). However, we also show that algorithms implemented on the Emu can achieve a high percentage of effective memory bandwidth even in a worst-case access scenario like pointer chasing. The pointer chasing benchmark in Section 4.2 achieves a stable 60-65% bandwidth utilization across a wide range of locality parameters. These pointer chasing results and data layout studies show how random accesses with SpMV can be improved and while performance of SpMV does not quite match a well-optimized x86 implementation, these optimizations can provide a template for future benchmarking and application development and show how application memory layouts and “smart” thread migration can be used to maximize performance on the Emu system.
This work was supported in parts by NSF Grant ACI-1339745 (XScala), NSF Grant OAC-1710371 (SuperSTARLU), and IARPA. Thanks also to the Emu Technology team for support and debugging assistance with the Emu Chick prototype.
- Hein et al. (2018) E. Hein, T. Conte, J. Young, S. Eswar, J. Li, P. Lavin, R. Vuduc, J. Riedy, An initial characterization of the emu chick, in: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2018, pp. 579–588. doi:10.1109/IPDPSW.2018.00097.
- McSherry et al. (2015) F. McSherry, M. Isard, D. G. Murray, Scalability! but at what COST?, in: 15th Workshop on Hot Topics in Operating Systems (HotOS XV), USENIX Association, Kartause Ittingen, Switzerland, 2015.
- Dysart et al. (2016) T. Dysart, P. Kogge, M. Deneroff, E. Bovell, P. Briggs, J. Brockman, K. Jacobsen, Y. Juan, S. Kuntz, R. Lethin, Highly scalable near memory processing with migrating threads on the Emu system architecture, in: Irregular Applications: Architecture and Algorithms (IA3), Workshop on, IEEE, 2016, pp. 2–9.
- Hein et al. (2018) E. Hein, T. Conte, J. S. Young, S. Eswar, J. Li, P. Lavin, R. Vuduc, J. Riedy, An initial characterization of the Emu Chick, in: The Eighth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), 2018, pp. 579–588. doi:10.1109/IPDPSW.2018.00097.
- Ediger et al. (2012) D. Ediger, R. McColl, J. Riedy, D. A. Bader, STINGER: High performance data structure for streaming graphs, in: The IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, 2012. doi:10.1109/HPEC.2012.6408680.
- Li et al. (2016) J. Li, Y. Ma, C. Yan, R. Vuduc, Optimizing sparse tensor times matrix on multi-core and many-core architectures, in: 2016 6th Workshop on Irregular Applications: Architecture and Algorithms (IA3), 2016, pp. 26–33. doi:10.1109/IA3.2016.010.
- ParTI (2018) ParTI, ParTI Github, online, 2018. URL: https://github.com/hpcgarage/ParTI.
- Leiserson (1997) C. E. Leiserson, Programming irregular parallel applications in Cilk, in: International Symposium on Solving Irregularly Structured Problems in Parallel, Springer, 1997, pp. 61–71.
- Acun et al. (2014) B. Acun, A. Gupta, N. Jain, A. Langer, H. Menon, E. Mikida, X. Ni, M. Robson, Y. Sun, E. Totoni, L. Wesolowski, L. Kale, Parallel programming with migratable objects: Charm++ in practice, in: SC14: International Conference for High Performance Computing, Networking, Storage and Analysis, 2014, pp. 647–658. doi:10.1109/SC.2014.58.
- Frigo et al. (2009) M. Frigo, P. Halpern, C. E. Leiserson, S. Lewin-Berlin, Reducers and other Cilk++ hyperobjects, in: Proceedings of the Twenty-first Annual Symposium on Parallelism in Algorithms and Architectures, SPAA ’09, ACM, New York, NY, USA, 2009, pp. 79–90. doi:10.1145/1583991.1584017.
- McCalpin (1995) J. D. McCalpin, Memory bandwidth and machine balance in current high performance computers, IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter (1995) 19–25.
- Luszczek et al. (2006) P. R. Luszczek, D. H. Bailey, J. J. Dongarra, J. Kepner, R. F. Lucas, R. Rabenseifner, D. Takahashi, The HPC Challenge (HPCC) benchmark suite, Proceedings of the 2006 ACM/IEEE conference on Supercomputing - SC â06 (2006). doi:10.1145/1188455.1188677.
- Siegl et al. (2016) P. Siegl, R. Buchty, M. Berekovic, Data-centric computing frontiers: A survey on processing-in-memory, in: Proceedings of the Second International Symposium on Memory Systems, MEMSYS ’16, ACM, New York, NY, USA, 2016, pp. 295–308. doi:10.1145/2989081.2989087.
- Patterson et al. (1997) D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, K. Yelick, A case for intelligent RAM, IEEE Micro 17 (1997) 34–44. doi:10.1109/40.592312.
- Gao et al. (2015) M. Gao, G. Ayers, C. Kozyrakis, Practical near-data processing for in-memory analytics frameworks, in: 2015 International Conference on Parallel Architecture and Compilation (PACT), 2015, pp. 113–124. doi:10.1109/PACT.2015.22.
- Finkbeiner et al. (2017) T. Finkbeiner, G. Hush, T. Larsen, P. Lea, J. Leidel, T. Manning, In-memory intelligence, IEEE Micro 37 (2017) 30–38. doi:10.1109/MM.2017.3211117.
- Farmahini-Farahani et al. (2015) A. Farmahini-Farahani, J. H. Ahn, K. Morrow, N. S. Kim, Nda: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules, in: 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), 2015, pp. 283–295. doi:10.1109/HPCA.2015.7056040.
- Belviranli et al. (2018) M. Belviranli, S. Lee, J. S. Vetter, Designing algorithms for the EMU migrating-threads-based architecture, High Performance Extreme Computing Conference 2018 (2018).
- Minutoli et al. (2015) M. Minutoli, S. Kuntz, A. Tumeo, P. Kogge, Implementing radix sort on Emu 1, in: 3rd Workshop on Near-Data Processing (WoNDP), 2015.
- Mizell and Maschhoff (2009) D. Mizell, K. Maschhoff, Early experiences with large-scale Cray XMT systems, in: 2009 IEEE International Symposium on Parallel Distributed Processing, 2009, pp. 1–9. doi:10.1109/IPDPS.2009.5161108.
- Aingaran et al. (2015) K. Aingaran, S. Jairath, G. Konstadinidis, S. Leung, P. Loewenstein, C. McAllister, S. Phillips, Z. Radovic, R. Sivaramakrishnan, D. Smentek, T. Wicki, M7: Oracle’s next-generation sparc processor, IEEE Micro 35 (2015) 36–45. doi:10.1109/MM.2015.35.
- Ham et al. (2016) T. J. Ham, L. Wu, N. Sundaram, N. Satish, M. Martonosi, Graphicionado: A high-performance and energy-efficient accelerator for graph analytics, in: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016, pp. 1–13. doi:10.1109/MICRO.2016.7783759.
- Ananthanarayanan et al. (2011) G. Ananthanarayanan, A. Ghodsi, S. Shenker, I. Stoica, Disk-locality in datacenter computing considered irrelevant, in: Proceedings of the 13th USENIX Conference on Hot Topics in Operating Systems, HotOS’13, USENIX Association, Berkeley, CA, USA, 2011, pp. 12–12.
- Elafrou et al. (2018) A. Elafrou, V. Karakasis, T. Gkountouvas, K. Kourtis, G. Goumas, N. Koziris, SparseX: A library for high-performance sparse matrix-vector multiplication on multicore platforms, ACM Trans. Math. Softw. 44 (2018) 26:1–26:32. doi:10.1145/3134442.