Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms

Exploiting Fine-Grain Ordered Parallelism
in Dense Matrix Algorithms

Jian Weng, Vidushi Dadu, Tony Nowatzki

University of California, Los Angeles
{jian.weng, vidushi.dadu, tjn}@cs.ucla.edu
Abstract

Dense linear algebra kernels are critical for wireless applications, and the oncoming proliferation of 5G only amplifies their importance. Many such matrix algorithms are inductive, and exhibit ample amounts of fine-grain ordered parallelism – when multiple computation flows with fine-grain producer/consumer dependences, and where the iteration domain is not easily tileable. Synchronization overheads make multi-core parallelism ineffective, and the non-tileable iterations make the vector-VLIW approach less effective, especially for the typically modest-sized matrices.

Because CPUs and DSPs lose order-of-magnitude performance/hardware utilization, costly and inflexible ASICs are often employed in signal processing pipelines. A programmable accelerator with similar performance/power/area would be highly desirable. We find that fine-grain ordered parallelism can be exploited by supporting: 1. fine-grain stream-based communication/synchronization; 2. inductive data-reuse and memory access patterns; 3. implicit vector-masking for partial vectors; 4. hardware specialization of dataflow criticality.

In this work, we propose, REVEL, as a next-generation DSP architecture. It supports the above features in its ISA and microarchitecture, and further uses a novel vector-stream control paradigm to reduce control overheads. Across a suite of linear algebra kernels, REVEL outperforms equally-provisioned DSPs by 4.6\times-37\times in latency, and achieves a performance per mm{}^{2} of 8.3\times. It is only 2.2\times higher power to achieve the same performance as ideal ASICs, at about 55% of the combined area.

\setitemize

noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt

Exploiting Fine-Grain Ordered Parallelism

in Dense Matrix Algorithms


Jian Weng, Vidushi Dadu, Tony Nowatzki
University of California, Los Angeles
{jian.weng, vidushi.dadu, tjn}@cs.ucla.edu

Dense linear algebra kernels, like matrix factorization, multiplication, decomposition and FFT, have for decades been the computational workhorses of signal processing across standards, specifications, and device settings. The oncoming proliferation of 5G wireless is only further pushing the computational demands, both in performance and energy efficiency. Driven by needs of higher capacity and applications like augmented and virtual reality [?], new standards will require signal processing at more than an order-of-magnitude higher throughput and lower latency.

Despite their ubiquity, many important dense matrix operations are far from trivial to parallelize and compute at high hardware efficiency. As evidence, Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms shows the hardware utilization (based on max. vector issue width), of a modern CPU and DSP running common DSP algorithms from native application suites (eg. MKL, and TI DSPLIB). For algorithms without fine-grain dependences (GEMM, FIR, and FFT), a reasonable utilization is achieved, usually between 30-80%. However, for factorization/decomposition (SVD, QR, Cholesky, Solver), the utilization is exceptionally poor, generally between 5%-20%. Even this measure is generous as we only consider the maximum throughput of a single core, yet there is enough raw parallelism to multithread. Empirically, however, MKL and TI libraries do not even invoke multiple threads at the commonly-small matrix sizes required, due to synchronization overheads. CPUs and DSPs leave untapped factors of performance/hardware utilization.

Figure \thefigure: Percent peak performance of CPU (Intel Xeon 4116) and DSP (TI C6678) on DSP kernels
Figure \thefigure: FGOP Example: Triangular Linear Solver

The challenge and opportunity comes from the dominant form of parallelism in these workloads, which we call fine-grain ordered parallelism (FGOP). FGOP consists of fine-grain producer/consumer relationships between otherwise parallel computations, where the rate of production-to-consumption, the rate of data reuse, and the memory access relation is an affine function of the induction variables. This results from the iterative and inductive nature of these algorithms, as they operate on relatively small matrix sizes.

To substantiate, consider the triangular solver in Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms(a). Its iteration space diagram, Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms(b), reveals the many fine-grain dependences that make profitable multithreading between regions impossible. Furthermore, the inner-loop trip count changes inductively, leading to many iterations that are difficult to vectorize. Nevertheless, an architecture can be designed to exploit FGOP; the potential is shown in Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms(c,d). If dependences between regions can be enforced at a fine-grain with low overhead, then overlap between regions becomes possible, increasing the parallelism. If the inductive memory access pattern (and its relationship to computation) can be expressed efficiently, then vectorization can reduce the total time of the inner-loop region.

ASICs can of-course be designed to exploit FGOP – hence why they are so commonly employed for these tasks. Unfortunately, they have significant drawbacks: design time and verification effort, extra on-chip area, lack of flexibility, and lengthened time-to-market; these are especially relevant for example domain of wireless, where standards are continually changing and infrastructure costs are high. A general and programmable architecture exploiting FGOP could prove to be a worthy, if not essential, replacement of traditional vector-VLIW DSP architectures.

Our goals are twofold: 1. developing abstractions and execution semantics to enable efficient expression of FGOP; and 2. applying these abstractions to create an efficient programmable accelerator instance for DSP algorithms, capable of accelerating both FGOP and non-FGOP workloads in this domain (eg. GEMM, filters).

Through an in-depth workload analysis, we find four essential architecture abstractions to express FGOP efficiently to hardware: 1. parallel dataflows with ordered communication channels. 2. to reduce control overhead, induction-variable dependent communication, memory access, and data-reuse. 3. for efficient vectorization, the implicit masking of non-vector-width-divisible iterations. 4. for high hardware utilization, the specialization of compute hardware for critical versus non-critical dataflows.

While in principle the above abstractions can be added to a conventional ISA, we choose a stream-dataflow ISA [?], as its dataflow-based computation and communication abstractions are simple to modify, and the resulting accelerator can be performance/power competitive with DSPs. For the hardware implementation, we start with a simple design for one lane: a scratchpad connected to a coarse grain reconfigurable fabric (eg. similar to some previous designs [?, ?, ?, ?]). We use multiple such “lanes” to scale up the design.

Our accelerator, REVEL: the Reconfigurable Vector Lane architecture (Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms), is constructed by adding support for each of the FGOP-exploiting abstractions: 1. We allow multiple parallel dataflows (similar to threads) which can communicate within/across lanes through FIFOs to support synchronization on fine-grain dependences. To simplify the ordering of commands, we centralize control into one control-core which coordinates all lanes. 2. We provide the ability to express inductive memory access, data-reuse and communication patterns by adding suitable state machines to FIFO communication structures. 3. We implement implicit vector masking by exploiting the relationship between computation-vector width and communication-stream length. 4. For high computation utilization, we develop a novel heterogeneous compute fabric, where different regions are specialized for critical and non-critical dataflows.

Figure \thefigure: REVEL Architecture Model
  • Identification and characterization of fine-grain ordered parallelism (FGOP) as the main challenge for accelerating many dense linear algebra kernels.

  • Architecture and execution model for expressing FGOP naturally to hardware.

  • Novel architecture features (vector-stream control, inductive access/reuse, implicit vector-masking, and heterog. fabric), enabling ASIC-like power/area/performance.

A single 1.25GHz REVEL unit can outperform a 2.1GHz OOO core running highly-optimized MKL code on DSP workloads by mean 9.6\times, with an area normalized speedup of 1308\times. Compared to a DSP, REVEL achieves between 4.6\times-37\times lower latency, with an area normalized speedup of 8.3\times. Compared to a set of ideal ASICs with equivalent performance, it is about 2.2\times higher power and .55\times the area.

We briefly motivate the kernels in Section Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms, and analyze their challenges/potential in Section Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms. The FGOP abstractions and ISA instance (REVEL) are in Sections Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms and \thefigure. Section Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms and Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms describe the microarchitecture and compiler. Methodology and results are in Sections Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms and  Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms. We finally cover related work and conclude.

We examine these DSP workloads as they represent a coherent and important set, and because exploiting FGOP is critical for their performance. To elaborate, Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms shows the stages of a typical 4G/5G transmitter/receiver.

Figure \thefigure: Typical 4G/5G Transmitter/Receiver Pipeline

Channel coding and modulation involve mostly bit-level arithmetic. RE mapping is a short resource allocation phase which is not computation intense.

The Beamforming stage involves mostly GEMM, coming from spatial signal filtering [?]. Filters and FFT of several varieties are also common [?, ?, ?].

Challenging FGOP workloads are mostly within MIMO equalization and channel estimation. These include Singular Value Decomp., used for noise reduction [?], QR Decomposition, used for signal detection for (MIMO) [?], and Cholesky decomposition used in channel estimation and equalization [?, ?]. Solver is an instructive example.

The parameters often depend on the number of antennas and beams (in the range of 12-32 would be common) [?].

Here we first define FGOP properties with an example kernel, and explain why each is important either as a challenge or opportunity. Then we characterize their prevalence in our workloads and beyond. Finally, we perform a case study to answer why task-parallelism plus vectorization is insufficient.

We use Cholesky decomposition as a running example in Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms. Cholesky contains a point, a vector, and a matrix computation region. In general, regions correspond to computations either across subsequent loops, or from within the same imperfect loop but at different nesting levels.

A central characteristic of FGOP is the presence of fine-grain dependences between regions. In Cholesky, the vector and matrix region are dependent on the point region (forward dep.), and the point region is dependent on the first element in the matrix region (loop carried dep.). For a small matrix, the granularity of these dependences is a few hundred instructions or less, and even lower as the algorithm progresses.

Why is this important: the presence of these fine-grain dependences is the key limiter to performance of multithreading the regions, due to high synchronization overhead.

Dependences are often strictly-ordered from the perspective of their producing and consuming instructions. Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms(b) shows Cholesky’s iteration space and dependences. In Cholesky, across multiple iterations of the outer k loop, the point region is producing values (inva and sqrt) that are consumed by the matrix region. Similarly, the matrix region produces values consumed by subsequent iterations of the point region. An example of a non-ordered dependence is when an array is consumed in the backwards order of how it was produced.

Figure \thefigure: FGOP Example: Cholesky

Why is this important: the structure of ordered dependences makes fine-grain synchronization of these dependences natural, and therefore creates an opportunity to exploit efficiently in hardware.

An inductive algorithm iteratively builds on previous computations. In array codes, this often manifests as induction-variable dependent trip-counts. This is the case for both of Cholesky’s loops (but would not, for example, be true of a matrix multiply).

This has implications for dependences, in that their reuse pattern (the rate of production to consumption) would be induction-variable dependent. For example, how many times inva is consumed in the matrix region varies with k and j. Another example is that the matrix region only produces a value for the next point region on the first iteration of the inner loop (so depends on k).

Also, inductive loops cause patterns of computation/memory that are not easy to tile with vectorized loop iterations. Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms(b) also shows a vector-tiling pattern for Cholesky, with many leftover iterations. In a traditional vector architecture, these would require scalar iterations or masking/merging overhead.

Why is this important: inductive patterns cause overheads for coordination of vector computation, as well as the wide interface to memory.

Finally, regions often express imbalanced amounts of work. In Cholesky, the matrix region performs much more computation than the others, making it critical for performance. In the Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms(a), we highlight the critical region in red, and the sub-critical regions in purple. In DSP workloads, sub-critical regions often contain high-latency operations like sqrt and div.

Why is this important: for a high hardware utilization, execution resources should be allocated appropriately to regions. Furthermore, we will show how it is possible to specialize the computation substrate for criticality.

Figure \thefigure: FGOP Examples: QR and SVD

Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms shows that both have fine-grain ordered deps. between scalar/matrix regions (eg. tau) and between inner loops (eg. w[j]). Inner loops are inductive and imbalanced compared to householder region.

Figure \thefigure: Prevalence of FGOP Properties.

We characterize the prevalence of each FGOP property in our 7 DSP workloads (Cholesky, QR, SVD, Solver, FFT, GEMM, Filter), as well as a more general dense matrix benchmark suite, PolyBench [?]. We use LLVM [?] to instrument programs to track dynamic memory dependences. Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms shows a cumulative density function (CDF) for each property across three different matrix sizes (16,32,128)111FFT/Filter does not process matrices, so we pick data size compared to the matrices.. In all plots, lines closer to the upper left indicates more FGOP.

  • Fine Granularity: Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms(a) characterizes the distance between inter-region dependences in terms of arithmetic instructions. Most dependences (where the steepest part of the CDF curve is) are between about 75 to 1000 instructions, where larger data sizes are on the higher end. Intuitively, this is a range where an out-of-order (OOO) core’s instruction window begins to be insufficient, but still where shared-memory based synchronization also significantly hurts performance (especially considering pipeline serialization during synchronization).

  • Ordered: Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms(b) shows the prevalence of ordered dependences as a fraction of total dependences. All workloads contain at least 50% ordered dependences, and more than 80% of DSP workloads are completely ordered; this is quite high and promising for later exploitation.

  • Inductive: DSP workloads have significant amounts of inductive access. 4/7 of the DSP workloads have more than 85% inductive accesses. PolyBench in general has much less inductive access, but still about 1/5 of those workloads are 60% inductive. Nevertheless, the inductive property is critical for our DSP workloads.

  • Imbalanced: 4/7 DSP workloads have imbalanced regions, while 50% of PolyBench have imbalanced regions.

Overall, FGOP properties are common across dense matrix workloads, especially for the relevant DSP workloads.

We know from the data in Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms (page Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms) that exploiting FGOP is non-trivial for vector and VLIW cores. But why is this so, given that DSPs are designed for these workloads?

The traditional method of parallelizing workloads with FGOP is through task parallelism on a block of computations (eg. a set of iterations). Each dependence, or set of dependences, is simply a condition under which to start a new task or synchronize existing tasks. Intuitively, this works well when dependences are coarse grain (less overhead to start or synchronize), and when blocks are perfectly tile the iteration space. As we discussed earlier, this is not true for the DSP workloads we study.

For practical analysis, we analyzed an established Cholesky kernel [?] which uses blocked task-parallel execution. Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms shows the task-parallel speedup over the single-threaded industry-standard MKL for different matrix sizes (see Sec Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms for CPU and methodology). First, notice the its performance is similar to MKL for large matrices, which is possible because it calls underlying BLAS routines (dpotf2,dtrsm,dsyrk) at a block-level. We suspect MKL’s implementation uses a similar approach, but does not use task parallelism at small matrix sizes for performance reasons.

Figure \thefigure: Case Study: Speedup of task-parallel Cholesky and MKL.

Considering the task-parallel code, the results indicate that exploiting FGOP is only profitable at all for larger matrices. Using more threads simply results in higher overhead of synchronization, far outweighing the benefits of parallelization. For the task-parallel version, speedups higher than 2\times are only possible with matrices of 1024k size or larger, higher than we can make use of in our domain. Therefore, our goal is to create an architecture which can exploit FGOP better at all matrix sizes and enable speedup at small sizes.

In this section we propose a set of architecture features which can express fine-grain ordered parallelism efficiently to hardware. The description here is architecture-neutral, and we later develop an architecture instance (an ISA).

Figure \thefigure: Solver’s Dataflows and Ordered Dependences

The essential abstraction is that of concurrent dataflows (threads) with the ability to express ordered dependences between regions. Ordered dependences are distinguished from typical instruction dependences in that they have a non-uniform rate of production to consumption. A consumption rate higher than one indicates reuse of a value along a dependence. This may occur because data is reused multiple times within a subsequent inner loop. A production rate higher than one means that several iterations occur before producing a value. For example, this could be because data is being reduced (accumulated) for several iterations. Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms shows the solver kernel’s dataflows, annotated with the memory access it performs within each iteration of the outer j loop. Edges are labeled with their production:consumption rate, unless they are uniform (1:1).

One important consideration is the control-to-computation ratio. Short-vector SIMD is one way to reduce control overhead; one SIMD instruction expresses multiple operations over a fixed number of data items. A generalization (used in a variety of prior architectures [?, ?, ?, ?, ?, ?, ?]) is the concept of streams, where a single control command describes an entire pattern of operations. The following features (2-4) are related to the use of streams to reduce control overhead.

Data-reuse patterns may depend on induction variables, as seen in Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms. Here, the output of division is used multiple times within the inner loop, but the number of times is reduced by one each iteration. In general we find that the pattern changes only by small constant numbers. We specify these as two “stretch” parameters: s_{p} and s_{c}, the rate of change of production and consumption. Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms contains an example encoding as a stream. Including these parameters is not necessary for correct enforcement of dependences, because multiple lower-dimension streams can be generated. However, the number of instructions increases by an order of magnitude (as shown in Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms).

Similarly, all prior architectures that we are aware of use rectangular memory access streams: ie. their iteration domains (without loss of generality) begin at \vec{0} and end at a constant n in each dimension (ie. a trip count), and their address functions are linear functions of \vec{I}. If we let c_{i}, c_{j} etc. be the multipliers of \vec{I} in the address function, rectangular streams can then be depicted intuitively as a loop nest – see Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms(a).

Inductive streams are more general; their iteration domains may be bounded polyhedra instead of strictly rectangular. Trip counts become a linear function of lexicographically previous iterators. We encode using stretch multipliers s_{ji}, representing the multipliers of iterator j in the trip count for dimension i. Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms(b) shows a 2D inductive stream pattern as a loop nest. Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms shows how to specify the accesses in solver using either rectangular or inductive streams. Again, inductive access streams require O(n) fewer control insts.

Figure \thefigure: Memory Address Stream Type Comparison
Figure \thefigure: Stream Specification using Different Types
Figure \thefigure: Implicit Vector Masking

Later evaluation uses a simple notation to describe capabilities: Letter “R” denotes a rectangular dimension, and “I” denotes inductive dimension, so “RI” would be a 2D capability with induction in second dimension.

There are two issues with vectorization of FGOP. The first is that the reuse rate may become fractional, as it may be divided by the vector width (see example in Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms(a)). Therefore, we need s_{ij} to be able to represent fractional numbers. Second is the problem of non-vector-width divisible iterations. To address this, we make it implicit that the datapath for the remaining iterations becomes masked or predicated. This can be enforced by dynamically checking the stream iterator for the case when the inner-loop iterator i is greater than the current length n_{i} (see Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms(b)).

Certain regions may be more or less computationally critical than others, as they perform more or less work. In our example in Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms, the “divide” dataflow happens n/2 fewer times. In practice, non-critical dataflows should be allocated shared resources, while critical dataflows should be vectorized. We will later demonstrate the effectiveness of hardware specialization for criticality.

Figure \thefigure: Explaining REVEL abstractions using Cholesky as an example.

Using the principles from the previous section, we construct an efficient and scalable ISA and execution model (REVEL ISA) to exploit FGOP and traditional parallelism in dense matrix algorithms. REVEL is an instance of a stream-dataflow ISA [?], which we chose because it is straightforward to enhance for FGOP, and it enables an efficient programmable accelerator. Section Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms discusses enhancing other architectures (OOO cores and Plasticine [?]). In this section, we discuss the control model, how the architecture incorporates FGOP features, and then its specific ISA instantiation.

Stream-dataflow ISAs express computation as a dataflow graph, where its inputs and outputs are named ports. Communication is performed using streams, where the endpoints of streams are either the dataflow-graph ports or memory. A VonNeumann program embeds all stream commands, and streams with the same port number are guaranteed to be executed in program order. Memory requests can be ordered by explicit barriers.

  • Ordered Dependences between Dataflows: Computation is expressed as multiple independently-triggered dataflow graphs, where streams describe their communication and re-use pattern.

  • Inductive Dependence/Access: Stretch parameters (s{}_{p},s{}_{c}) added to relevant streams.

  • Vector Masking: Non-divisible iteration lengths causes predication of the corresponding dataflow.

  • Execution Rate: Implementation is closely related to hardware, so we describe separately (Section Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms).

Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms demonstrates REVEL’s abstractions by showing the transformation from source (a) to the abstract dataflow IR (b), and finally to dataflow configuration and stream-code running on the control core (c,d).

To enable scalability at low overhead, we chose to add multiple lanes of execution. Each lane is independent, in that it can concurrently execute multiple dataflows, each potentially communicating using inductive streams or through a global memory. Also, since each lane can be programmed separately, the architecture is flexible in terms of what computations are being parallelized.

There are two challenges with using multiple lanes: 1. Each lane needs coordination (control overhead), and 2. The dataflow-dependence streams between lanes must somehow be ordered.

Our solution is the vector-stream control model. Here, a single VonNeumann control program coordinates the execution of all lanes. Control commands are sent to all relevant lanes, specified by a bitmask (eg. load array from address 0 of local memory to dataflow 1). In addition, a lane’s index can be used to offset the address of a command, so a single command can allow each lane to read a different portion of an array. This is unique and more powerful than the control amortization offered by either vectorization or streaming alone, as it amortizes both in “space” across lanes, and in time through streaming commands. It is inspired by Vector-threading [?, ?, ?] but with a stream-based ISA.

In the example, Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms, we map all three dataflows (scalar, vector, matrix) to one lane to share its resources, and parallelize the outer k loop across lanes.

Pattern Params Source Params Dest. Params
Shared_ld c_{i}, c_{j}, n_{j}, n_{i} shared_addr local_addr

Lane Bitmask (ALL)

Shared_st local_addr shared_addr
Local_st c_{i}, c_{j} n_{j}, n_{i}, s_{ji} out_port local_addr
Local_ld local_addr
Const      n_{j}, n_{i}, s_{ji} val{}_{1}, val{}_{2}
in_port,
n_{c}, s_{c}
XFER      n_{p}, s_{p} out_port
Configure local_addr
Barrier Ld/St & Wait
Table \thetable: REVEL’s Vector-Stream Control Commands

Table Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms contains the set of commands within the VonNeumann control program for stream coordination, including their pattern parameters, source, and destination. Shared_Ld/St are for transfers between local and shared memory. Local_Ld/St transfer between the local memory and the dataflow graph. XFER specifies inter-dataflow communication streams to support fine-grain dependences. Const can stream a pattern of val{}_{1},val{}_{2}, eg. (0,0,0,1,0,0,1,0,1), which is useful for inductive control-flow within the dataflow graph. The Barrier_Ld/St command prevents concurrent scratchpad memory access, and Wait delays until a lane is no longer active. These are used for flexible double buffering. All commands take a lane bitmask as a parameter, to implement vector-stream control.

We describe REVEL’s microarchitecture by first giving a broad overview, then explaining the key innovations that enable efficient exploitation of FGOP. We discuss the heterogeneous compute fabric in detail, as it is a key novel component of the design, enabling low overhead execution of unbalanced FGOP regions.

The REVEL processor (Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms) is composed of a number of lanes, a low power VonNeumann control core, and a shared scratchpad. The control core can issue vector-stream commands to each lane. Each lane manages its stream and memory dependences, data access requests and computation firing. Dataflows on the same or separate lanes can communicate data through the XFER unit or shared scratchpad.

REVEL’s high-level execution flow is as follows: First, the core will issue a config command, and configuration data is broadcast to each relevant lanes’ compute fabric and its ports. Asynchronously, the control core will begin to compute the parameters of any stream commands. When a command is ready, it will be broadcast relevant lanes’ command queues. Commands are then issued to either the private or shared scratchpad, provided the resources they depend on are free (eg. input or output compute-fabric ports). Streams execute locally until completion, until which point they notify the command queue that they are free. Independent dataflows may be mapped onto the compute fabric, where they are executed in pipelined fashion once enough data has arrived for an instance of the computation. Once the control core has completed issuing vector-stream commands, it will issue a Wait command. This blocks the control program until all relevant lanes are no longer active, which is determined by the completion of all streams.

Figure \thefigure: REVEL Microarchitecture
  • Command Queue is the lane’s resource manager, and is responsible for maintaining data ordering. It maintains a queue of commands from the control core, and issues them to the scratchpad or XFER unit if no barrier commands or port dependences prevent that. A scoreboard tracks ports in-use.

  • Stream Control maintains the set of concurrent streams, where each stream tracks the state of its iterators (i,j) and length n_{i}, (to support inductive access). It can generate addresses for one stream per cycle, along with a mask for any unused words of the scratchpad line. Streams are prioritized by minimum “cycles-to-stall,” which is the number of cycles before the corresponding port will run out of data (data-in-fifo / port-width).

  • Input/Output Ports contain a set of FIFOs for holding intermediate results while waiting for (or produced from) the compute fabric. Input ports can receive data either from the scratchpad bus, or from the XFER unit if receiving data from a neighboring dataflow. Each port attaches to a unique location within the grid, so it is the compiler’s responsibility to choose optimal ports.

  • Compute Fabric monitors the data ready in each input port FIFO to determine which dataflows can begin. Multiple can be fired in a single cycle. This heterogeneous fabric is divided into regions which specialize for either critical or non-critical computations (Section Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms).

  • XFER Unit is responsible for arbitrating the bus from output ports back to the local or remote input ports, which enables fine-grain dependences between dataflows, both within and across lanes.

Here we describe the essential hardware mechanisms for supporting FGOP within REVEL. Details on the heterogeneous compute are discussed subsequently.

To support multiple dataflows with different firing conditions, the data present in each dataflow’s ports are tracked separately by the data-firing logic, which can manage up to four independently-firing dataflows. The association between ports and dataflows is determined at configuration time.

One other challenge is maintaining data-ordering when there are fine-grain dependences between lanes – ie. a source lane should not transmit until all prior data items (in program order) for the destination lane’s port have arrived. This is accomplished by sending the destination lane a placeholder stream. The destination’s command queue informs the source’s when the placeholder is issued for the destination port, and the source’s command queue informs the destination’s when the placeholder can be removed.

To support inductive re-use streams, the scratchpad controller maintains the current iterator values and the current stream length. When n_{i} addresses are complete, the length is adjusted by the stretch s_{ij}. Note that s_{ij} is a fixed-point number to support vectorization with induction patterns.

Ordered Dep. Inductive Dep. Inductive Mem. Implicit Vector Mask Crit. Specialization
OOO Core S/W Thread-communicat-ion-aware OS sched (see below) Streaming memory command interface Add FIFO interf. b/t streams & vector instrs. Not applicable, no explicit-dataflow substrate.
ISA Stream-based producer/consumer instrs. Add induction parameters to stream instrs. Implicit mask register indicating predicated lanes.
H/W Commun.-FIFOs b/t neighbor cores Add inductive control to FIFOs Add streaming memory request engine Vector store instruction ignores masked lanes.
Plast- icine [?] S/W Already Supported Add inductive param for map&fold patterns None None
ISA Update stream-control and addr. gen. interf. Update stream instr. semantics Temporal fabric ISA
H/W Add induction to stream controller Add induction to addr. gen. Implement predication within SIMD Lanes Make some PCU’s temporally shared
Table \thetable: Adding FGOP Abstractions to Existing Architectures S/W: Software, H/W: Hardware

While a stream without reuse would perform the usual destructive FIFO read on every cycle, a stream with reuse will only pop the data from the port at a longer interval. When the stream is issued from the command queue to the stream control unit, the reuse configuration (n_{r} and s_{r}) is sent to the port (maintained similarly to params for memory access). Besides enabling fine-grain dependences with inductive changes in re-use length, another benefit of the reuse within the configurable port is a large reduction in scratchpad bandwidth.

As a stream is executing, the stream control unit compares the remaining iterations with the vector length of the destination port. If the iterations left is non-zero and less than the vector length, the stream control unit sends the data padded with zeroes for the unused lanes, along with additional meta-information which indicates that the those lanes should be predicated off. This information is buffered in a predication FIFO which tracks the data FIFO.

Attaining high utilization in FGOP workloads requires balancing execution resources between critical and non-critical dataflows. This is especially challenging given that they prefer quite-different fabric microarchitectures.

Figure \thefigure: Heterogeneous Computation Fabric and Tiles

There are two key types of fabrics with different tradeoffs. Dedicated fabrics are those that restrict each execution resource (tile) to execute only a single instruction, but pipelined at full throughput (eg. FPCA [?], Q100 [?], Tartan [?], PipeRench [?], and DySER [?]). In contrast, temporal fabrics are those that may time-multiplex multiple different static instructions over the same resources (eg. TRIPS [?], WaveScalar [?], Triggered Insts [?] and RAW [?]).222A VonNeumann core is also temporal in this context, but does not yield enough performance for this use in REVEL. While dedicated fabrics only need to wait for the arrival of data for dataflow execution, temporal fabrics require token matching, meaning more complex structures (and power/area overhead).

Because critical dataflows are often easily vectorizable by the compiler, they can be scaled to the size of the fabric, and be executed more efficiently on the more power/area efficient dedicated fabric. However, non-critical dataflows often have many instructions, and running them on a dedicated fabric would be a waste of resources (eg. FP units) as they would attain poor utilization. Therefore, non-critical dataflows would be best run on a temporal fabric. It would also be inefficient to run critical dataflows on a (smaller) temporal fabric, as contention for resources would degrade the throughput. An over-provisioned temporal fabric can alleviate this, at the cost of significant power/area overhead.

Given the above, our approach is to make the fabric heterogeneous: provision most of the fabric’s resources to be a dedicated fabric to enable fast execution of the critical dataflows, and allocate a smaller portion to be a temporal fabric, which can execute non-critical regions efficiently. Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms shows the lower corner of REVEL’s heterogeneous compute fabric, which embeds the temporal fabric’s network and compute into the dedicated fabric.

The physical network for both fabrics is a circuit-switched mesh with no flow control. The dedicated tiles simply select inputs, performs computations, and forward outputs in fully pipelined fashion according to the dataflow configured by the mesh. The dataflow compiler must equalize delays for each operand to ensure correct execution.

The temporal fabric is embedded within the circuit-switched mesh, using a pattern shown by blue arrows in Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms. This allows temporal units to communicate without interfering with the dedicated region (ie. no horizontal/vertical links consumed). The temporal tile microarchitecture is based on Triggered Instructions [?], which performs operations based on the arrival of data to a queue at the input or output. A register file holds live-state of waiting instructions.

Note that dataflows communicate through ports (exiting and re-entering the fabric). The benefit of integration into the same network is that when there are no non-critical dataflows, the temporal region may be reclaimed for use by critical instructions. Section Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms details compilation concerns.

Finally, we argue our approach is applicable to other architectures. Table Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms explains how to add FGOP capabilities to out-of-order (OOO) cores and Plasticine [?], a reconfigurable dataflow fabric, programmed using parallel patterns. This table describes changes necessary in the software, ISA and hardware.

A program is decomposed into two components: 1. C+intrinsics specifying the Von Neumann control program, and 2. Dataflow specification which is mapped onto the compute fabric. A dataflow compiler (described in the next paragraph) is responsible for producing the hardware configuration bits for the temporal and dedicated portion of the fabric. These are finally compiled together to create the final RISCV binary.

We implemented a spatial architecture compiler (eg. [?, ?, ?, ?, ?]) which maps computation and communication of all dataflows together on the compute fabric. For the dedicated dataflows, all operand timing paths must be equalized, and there is no sharing of resources. For the temporal dataflows, the goal is to minimize resource contention. Usually instructions from a temporal/dedicated dataflow map to the corresponding region of the compute fabric. However, temporal instructions may map to the dedicated fabric to minimize utilization, and dedicated instructions may be mapped to the temporal fabric to minimize latency or network congestion, provided that there are enough resources either way. To balance these objectives, we take a simulated annealing approach similar to the Pathfinder algorithm [?] and prior stochastic schedulers [?], which allows resource over-provisioning to determine and then constrain heavily needed network and execution resources.

Revel Lane (\times 8)

CGRA PEs 14 add, and 3 sqrt/div, 9 mult
Div/Sqrter Lat.: 12 Cyc., Thr.: 5 Cyc.
SubwrdSIMD 4-way Fixed-point, 2-way FP
Data Firing 4 Independent Dataflows
Temporal PE 2x1 (32 Insts/FU)
Vector Ports Width 2\times512, 2\times256, 1\times 128, 1\times 64 bit
Capability 4-entry FIFO+Config. Reuse
Stream Control SPAD Ctrl. Induct. Addr. Gen, 8-Ent. Table
XFER Ctrl. 8-Entry Stream-table
Cmd Queue 8-Entry Cmd Queue
SPAD Structure 8Kb, Single-bank
Bandwidth 512 Bits (1R/1W Port)
Net. SPAD-Ports 512 Bit Dedicated Bus
XFER-Ports 512 Bit Dedicated Bus
Ports-CGRA Point-to-Point 64-bit links

Ctrl Core

RISCV ISA [?], 5-stage, single-issue, 16kb d$, insts. added for stream-commands

Shr. SPD

Structure: 128Kb, Single-bank
Bandwidth: 512 Bits (1R/1W Port)

Net.

Inter-lane: 512 Bit Data Bus (8-bit Cmd Sync)
Shared scratchpad Bus: 512 Bit Shared Bus
Table \thetable: REVEL Params (bold features for FGOP)

REVEL hardware parameters are in Table Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms. For performance, all blocks are modeled at a cycle level within a modified gem5 [?][?]. We synthesized a single lane of REVEL (heterogeneous fabric, stream control, ports, command queue, XFER unit) using Synopsys DC, 28nm tech library. The design meets timing at 1.25GHz. An open source triggered instructions implementation was our reference for the temporal fabric [?]. Results from synthesis, with Cacti 7.0 [?] for SRAMs, are used to create an event-based power model and area model.

These highly-optimistic models (Table Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms) are based on the optimized algorithms, and are only limited by the algorithmic critical path and throughput constraints, with equivalent FUs to REVEL. ASIC area and power models only count FU and scratchpad power.

SVD QR MM
48m+2\text{QR}(n)+\lceil\frac{n^{3}}{4}\rceil 40n+n^{2}+\sum\limits_{i=1}^{n}(i+i*n) \lceil\frac{nmp}{8}\rceil
Solver FFT Cholesky Centro-FIR
2\sum\limits_{0}^{n-1}\max(\lceil\frac{i}{4}\rceil,14) \frac{n}{8}\log{n} \sum\limits_{i=1}^{n-1}\max(\lceil\frac{i^{2}}{4}\rceil,24) \lceil\frac{n-m+1}{4}\rceil
Table \thetable: Ideal ASIC Models (assumes FU lat. from Table Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms)

For fairness we compare designs with similar max. per-cycle throughput:

  • TI 6678 DSP (@1.25GHz) 8-core DSP, each core has 16-FP adders/multipliers, using DSPLIB_C66x_3.4.0.0.

  • OOO Core: Intel Xeon 4116 (@2.1GHz) Conventional OOO processor using highly-optimized Intel MKL library. (8 cores used)

  • REVEL-No-FGOP: REVEL without FGOP support (8 Lanes). To evaluate, we therefore also implement highly-optimized non-FGOP workload versions.

Workload Data Size Lane Acc Dep Reuse Het Vec
SVD 12,16,24,32 1 RI Y Y Y Y
QR 12,16,24,32 8 RI Y Y Y Y
Cholesky 12,16,24,32 8 RI Y Y Y Y
Solver 12,16,24,32 1 RI Y Y Y Y
FFT 64,128…1024 1 RR N Y N N
GEMM 12,24,48x16x64 8 RR N Y N N
FIR 12,16,24,32 8 I N Y N N
Table \thetable: Workload Params. and FGOP Features; small and large sizes bolded; Lane: #Lanes in latency ver., Acc: Access pat. Dep: Fine-grain deps, Reuse: stream-reuse, Het: Heterog. fabric, Vec: implicit vect. masking

We make comparisons in two different settings, high-throughput and low-latency. The throughput setting assumes there exist multiple data items to parallelize over, while the latency setting assumes only one. We implement both throughput and latency optimized REVEL workloads, where latency-optimized spreads work across multiple lanes. Throughput versions use each lane in data-parallel fashion. Note that we could not profitably parallelize any FGOP kernel across multiple DSP/OOO cores, even using native libraries, so their latency-optimized versions only use a single core. Table Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms describes data-sizes, and also how FGOP features were used by each workload.

We broadly answer the question of whether fine-grain ordered parallelism is exploitable in DSP workloads, and if REVEL’s execution model, architecture, and microarchitecture is effective. What we find overall is that REVEL’s ability to exploit FGOP leads to order-of-magnitude speedup and area-normalized performance over traditional DSPs.

We first discuss the applicability of FGOP features and overall latency and throughput potential. We then explain how performance improvements were achieved by analyzing cycle-level bottleneck breakdowns, and incremental performance improvements. We also answer the question of sensitivity to temporal region size and address-generation capability. Finally, we analyze the area and power breakdowns, comparison of normalized performance, and compare to optimistic ASIC models.

Table Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms shows the applicability of FGOP features. Matrix factorization/decomposition workloads (QR, SVD, Cholesky, Solver) use all FGOP features. Even non-FGOP workloads took advantage of streaming-reuse to reduce scratchpad bandwidth, and FIR had a short inductive access phase.

Figure \thefigure: Latency-optimized kernel performance
Figure \thefigure: Throughput-optimized kernel performance

The speedups over DSP for latency optimized codes are shown in Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms for both small and large matrices. The DSP and CPU have similar mean performance. REVEL attains up to 37\times speedup, with geomean of 10\times and 17\times for small and large data sizes. Considering just workloads which exhibit FGOP, the speedup from FGOP specialization is 6.1\times (large size). The benefit of REVEL’s dataflow/vector-stream model without FGOP provides 2.8\times speedup over DSP. The DSP is only competitive on the small FFT, as REVEL here requires multiple-configurations.

Performance for throughput-optimized kernels (data parallelism across lanes), is shown in Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms. For small and large sizes, REVEL gets a speedup of 6.3\times and 8.1\times over the DSP and CPU. Again, considering just workloads which exhibit FGOP, the speedup from FGOP specialization is 4.4\times (large size). REVEL’s dataflow/vector-stream model provides the other 2.6\times speedup over the DSP. The performance tradeoffs here are similar, except the advantage of parallelizing across lanes is diminished due to data-parallel execution.

The vector-stream control and FGOP-exploitation enables combined order-of-magnitude speedups.

Figure \thefigure: REVEL’s Cycle-level bottlenecks

Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms overviews REVEL’s cycle-level behavior, normalized to non-FGOP hardware. Latency-optimized workloads are labeled as “multi”. To explain the categories, issue and multi-issue means that one or multiple dedicated dataflow fired, and temporal means only a temporal dataflow fired during that cycle. All other categories represent overhead, including the drain of the dedicated fabric, scr-b/w and scr-barrier for bandwidth and synchronization, stream-dpd for waiting on dependences, and ctrl-ovhd for waiting on the control core.

The clearest trend is that exploiting FGOP reduces the control overhead for both small and large matrix sizes. Also, exploiting FGOP enables parallelism between dataflows, which can be seen in the multi-issued category, especially prevalent for the larger matrix sizes of FGOP kernels.

Exploiting FGOP increases parallelism and reduces control overhead, enabling higher hardware utilization.

Figure \thefigure: Performance Impact of Each Mechanism.

Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms shows the incremental speedup of each hardware/software feature (so 5 versions of each kernel). Latency-optimized versions have “lat” as a suffix.

The inductive benefit is small alone, as it reduces control but does not increase parallelism. Only FFT benefits greatly by using inductive reuse to reduce SPAD bandwidth. Most workloads were accelerated dramatically from exploiting fine-grain dependences. However, QR and SVD have complex sub-critical regions, so they only see the benefit after adding the heterogeneous fabric. Throughput-optimized QR suffers from local memory access because of the shrinking matrix sizes, but latency-optimized QR converts these to inter-lane data streams. Solver is also accelerated by the heterogeneous fabric because it is latency sensitive, and collapsing sub-critical instructions can reduce latency. Cholesky’s triangular access implies large gains from implicit vector masking.

REVEL’s mechanisms together enable high performance.

As shown in Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms, this is the drain time on smaller workloads, often caused by reconfiguration. This is more of an issue for the smaller matrices and especially FFT, where the datapath should be reconfigured for each algorithm phase. REVEL’s reliance on deep pipelines causes config/serialization penalty on extremely short phases.

Table Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms shows the breakdown; the largest source (especially power) comes mostly from the floating point units. At 28nm, REVEL is 1.79mm{}^{2}. Note that the control core is now only about one 50th of the overall area.

area(mm{}^{2}) power(mw)
Compute
Fabric
Dedi. Net. (23) 0.05 71.40
Temp. Net. (2) 0.01 14.81
Func. Units 0.07 74.04

Total Fabric 0.13 160.25
Control (ports/XFER/str. ctrl) 0.03 62.92
SPAD-8KB 0.06 4.64
1 Vector Lane 0.22 207.90
Control Core 0.04 19.91
REVEL 1.79 1663.3

Table \thetable: Area and Power Breakdown (28nm)

REVEL’s high speedup with only small area overhead (over the DSP) for the computation fabric’s networks results in a large performance/mm{}^{2} advantage: 1308\times over the OOO core and 8.3\times over the DSP.

Because temporal tiles cost more than 5\times the area than dedicated tiles (dedicated tile: 2265\mum{}^{2}, temporal tile: 12062\mum{}^{2}), it is important to choose the correct temporal region size. Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms shows REVEL’s performance sensitivity to this size, as well as the area tradeoff. SVD and QR have the largest regions, so are affected the most, but a 1\times1 temporal region only has 13% overhead. We choose this size to minimize area penalty.

Figure \thefigure: Temporal region sensitivity (width\timesheight)

To create a purely dedicated fabric, we would have had to support 52 additional dedicated tiles for our largest temporal region in SVD, costing about 2.75\times fabric area. Similarly, for the entire design to be temporal, it would have cost around 2.5\times fabric area. A heterogeneous fabric provides the best performance/area ratio.

Our analysis so far has shown that using 2D inductive streams can reduce control overhead and improve performance significantly. An interesting question is whether supporting higher dimension stream-access could have helped further.

To analyze stream capabilities analytically, we implement a static compiler analysis in LLVM [?], using scalar evolution analysis [?] for the closed-form representation of address patterns and loop termination with respect to induction variables. This analysis can determine the length of a given stream for each pattern. Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms shows the average length of a stream: the number of loop iterations the pattern describes. We also calculate the number of effective memory instructions per inner-loop iteration, “Mem. Insts/Iter”, which is a measure of the control overhead (Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms). We consider vector (V), 1D streams (R), 2D streams (RR and inductive RI) and 3D streams (RRR and inductive RII).

Regular workloads like GEMM require only a low dimension rectangular access pattern for a long length. However, FGOP workloads show much higher lengths only with inductive access capability (RI or RII capability). This benefit translates to fewer memory instructions per-iteration. A value of less than 1 in Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms means that fewer than one control instructions would need to be issued per cycle. This helps to explain why vector instructions alone are insufficient for parallelism – because they require too much specification of work. Fortunately, the RI capability always either achieves a control overhead below 1 inst/iter or matches the least overhead capability.

The ability to reuse stream values as part of the stream definition can also reduce control overhead. The control overhead if this feature is disabled is shown by the stacked bar in Figure Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms. This benefit is modest; the more important reason for stream-reuse is to reduce memory bandwidth.

2D Induction streams (RI) are necessary to reduce control overhead, and RII may provide only a small energy advantage, but is also more complex.

Table Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms shows the performance-normalized power and area overhead, as compared to ASIC analytical models. REVEL is mean 2.2\times power. This is mostly due to the control logic (vector ports, bus, and etc.) and reconfigurable networks. It is 0.55\times the area of the combined ASIC. Note this is highly optimistic for ASICs, as the performance model assumes perfect pipelining, and the power model assumes no control.


Workloads
SVD QR Cho. Sol. FIR MM FFT Mean
Power Ovhd. 3.5 2.1 2.2 2.0 2.0 1.9 1.9 2.2
Area Ovhd. 3.8 2.4 2.5 2.7 2.2 2.1 2.6 2.6/0.55
Table \thetable: Power/Area overheads to ideal ASIC (iso-perf)

REVEL is competitive with ASICs, and could replace fixed-function accelerators or conventional DSPs in some designs.

Figure \thefigure: Stream-type Access Length Comparison
Figure \thefigure: Control overhead of various Capabilities measured in control instructions per iteration. The stacked bar indicates additional control overhead if stream-reuse technique is disabled.

Many application/domain-specific reconfigurable designs have targeted DSP algorithms. Fasthuber et. al [?] outline the basic approaches. One representative example includes LAC [?], targeted at matrix factorization.

A conceptually similar work to ours from the GPGPU space is dMT-CGRA [?], which proposes inter-thread communication between SIMT threads [?, ?]. Prabhakar et al [?] develops “nested-parallelism,” which enables coupling of datapaths with different nesting of parallel patterns. Swarm [?] also targets a form of ordered parallelism by building abstractions on top of a task-parallel model, targeting irregular data-dependent parallelism [?].

An alternative model to ours is task-based parallelism plus some form of acceleration to reduce the synchronization overhead (eg. TAPAS [?]). Task parallelism has the benefit of dynamic load balancing, but this does not appear to be necessary our DSP workloads.

Our vector-stream control paradigm is inspired by prior techniques which marshal independent execution lanes to create a vector-like execution when useful. This includes Vector-Threading [?, ?], Vector-Lane Threading [?], and Loop-Task Accelerators [?]. REVEL also marshals lanes to reduce control and increase parallelism, but its lanes are autonomous once programmed with streams.

Some techniques apply vectorization with reconfigurability, eg. Libra [?] and DySER [?], which can create/reconfigure vector lanes. REVEL also amortizes control through time.

Address Gen Capability
Name Type
Imagine [?] R
Q100 [?] R
Accel DMA [?] R
Softbrain [?] RR
RSVP [?] RR
CoRAM++ [?] RR
APMC [?] RR
REVEL RI
FPCA [?] RRR

Several prior architectures have stream primitives. We list their address capability as compared to REVEL on the right. We believe REVEL is the only one to support inductive patterns.

This paper identified fine-grain ordered parallelism as a common property across a variety of linear-algebra and DSP algorithms. It is extremely difficult to take advantage of using existing VonNeumann Vector and multi-threading architectures. This work identified a set of abstractions and developed an execution model and hardware implementation (REVEL) which could exploit this form of parallelism.

Our REVEL implementation was more than an order of magnitude lower latency (10\times-17\times), and its performance per mm{}^{2} was 6.7\times that of a DSP (up to 16\times). Overall, REVEL’s design offers large advantages over existing architectures for important signal processing workloads, and is a promising alternative to existing DSPs and beyond.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
366769
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description