Finite element numerical integration
for first order approximations
on multicore architectures
Abstract
The paper presents investigations on the implementation and performance of the finite element numerical integration algorithm for first order approximations and three processor architectures, popular in scientific computing, classical CPU, Intel Xeon Phi and NVIDIA Kepler GPU. A unifying programming model and portable OpenCL implementation is considered for all architectures. Variations of the algorithm due to different problems solved and different element types are investigated and several optimizations aimed at proper optimization and mapping of the algorithm to computer architectures are demonstrated. Performance models of execution are developed for different processors and tested in practical experiments. The results show the varying levels of performance for different architectures, but indicate that the algorithm can be effectively ported to all of them. The general conclusion is that the finite element numerical integration can achieve sufficient performance on different multi and manycore architectures and should not become a performance bottleneck for finite element simulation codes.
1 Introduction
Finite element numerical integration forms an indispensable part of practically all complex finite element codes. Apart from special formulations for specific problems (such as e.g. applications in structural mechanics for beams, plates etc.) where it can be avoided using some analytically obtained formulae, it is universally used for creating entries to finite element stiffness matrices and load vectors (matrices and right hand side vectors for systems of linear equations).
The recent progress in hardware for numerical computations caused the widespread use of multicore and massively multicore (often termed manycore) processors with vector processing capabilities. These created the necessity to reconsider the implementations of scientific codes, taking into account the possible options for multithreading and the changing execution environment characteristics.
The aim of the research presented in the current paper is to thoroughly analyse the process of finite element numerical integration for first order approximations, in a possibly broad context, and to propose a set of guidelines for designing efficient integration procedures for multicore processors, taking into account SIMD modes of execution. The paper shows also some variants of the algorithm, that are implemented for different hardware platforms and tested in practice.
The paper is organised as follows. First, finite element numerical integration is described in the form investigated in our work. Then, in Section 3 its implementation, in the context of finite element simulations and for different types of problems, elements and computing platforms is analysed. Several variants of the procedure are developed, and, in the next section, their optimization and mapping to different processor architectures is discussed. Section 5 presents the results of computational experiments that test the performance of procedures with short discussion of results. The last two sections briefly document other works related to the subject presented in the current paper and present the final conclusions.
2 Finite element numerical integration
2.1 Finite element weak statements and systems of linear equations
In the current paper, we consider the finite element method as a tool for finding approximate solutions to partial differential equations, specified in a computational domain with suitable boundary conditions imposed on , based on weak formulations that can be expressed (with some details omitted) as follows [19]:
Find an unknown function belonging to a specified function space , such that
(1)  
for every test function defined in a space .
Above, and , denote the coefficients specific to the weak formulation (with being the number of physical space dimensions), ”” denotes differentiation with respect to space coordinates and summation convention for repeated indices is used. and denote boundary terms, usually associated with integrals over the boundary . The task of calculating boundary terms is usually much less computationally demanding, and we neglect it in the current paper, when discussing implementations for multi and manycore architectures. In the actual computations described later in the paper this task is performed by the part of the code executed by CPU cores, using standard finite element techniques.
There are many possible finite element approximation spaces that can be defined for triangulations of into a set of finite elements , In the current paper we concentrate on scalar problems and the popular case of first order approximations, i.e. spaces spanned by elementwise linear (or multilinear) basis functions, associated with element vertices. Vertices with the associated basis functions form the nodes of the finite element mesh, with their number denoted by . Piecewise linear basis functions are defined for simplex elements (triangles in 2D and tetrahedra in 3D), multilinear basis functions are specified for other types of elements (quadrilateral, prismatic, hexahedral, etc.)
After representing unknown and test functions as linear combinations of basis functions , , the system (1) is transformed to a set of linear equations for the unknown vector of degrees of freedom :
(2) 
with the entries in the global system matrix (the stiffness matrix) computed as (with boundary terms neglected):
(3) 
The entries to the global right hand side vector (the load vector) are calculated according to the formula (again with boundary terms neglected):
(4) 
The solution of a single problem (1) (or equivalently the system (2)) may constitute the only step of the finite element solution process or may form a single stage of more complex solution strategies, e.g. for time dependent and/or nonlinear problems. In the latter case the efficient accomplishment of numerical integration becomes even more important for the overall performance of simulation programs.
2.2 Finite element solution procedures
The solution of a single finite element problem represented by (1) or (2) requires the creation of global stiffness matrix and load vector entries, followed by their use in obtaining the final solution of the problem. A single entry in the global stiffness matrix or load vector consists of an integral over the whole computational domain plus some boundary terms. The integral over the computational domain, on which we concentrate in the current paper, can be expressed as the sum of contributions from individual elements:
(5) 
The entries can be either computed individually, in a double loop over all entries of the global stiffness matrix, or can be computed in a loop over all elements, with all the entries related to a given element computed at once. We consider in the current paper only the latter strategy. Compared to the first strategy where each finite element is visited several times, this saves some element related calculations but leaves as a result element contributions that have to be further processed. Our choice is motivated by the fact that for many types of problems, including nonlinear, the overhead of repeated element calculations may become too high. In our approach we do not decide, however, how element contributions are used further in the solution procedure. Different options, with the assembly of the global stiffness matrix [7] or without assembly, in the so called matrixfree approaches [1], can be applied.
The element contributions to the global stiffness matrix form a small dense matrix , with each entry corresponding to a pair of element shape functions (that are used for creation of specific global basis functions). The entries of the element stiffness matrix (with boundary terms neglected) are given by the formula:
(6) 
where now indices and are local indices, with the range from 1 to – the number of shape functions for a given element.
The element right hand side vector, that forms the element contribution to the global right hand side vector, is computed based on the formula (with boundary terms neglected):
(7) 
2.3 A generic finite element numerical integration algorithm
We consider in our paper the creation of element stiffness matrices and load vectors using numerical integration and transformations to reference elements. Both procedures are related, since quadratures for numerical integration are usually defined for reference elements.
Shape functions are defined for a reference element of a given type of geometry and approximation. Each real element in the finite element mesh is treated as an image of the associated reference element under a suitable geometric transformation. Denoting the physical coordinates of points in the real element by , the transformation (defined separately for each element) from the reference space with coordinates is expressed as (with the inverse transformation given by ).
The parameters of the transformation are used for calculations of global derivatives of shape functions that involve the elements of the Jacobian matrix of the inverse transformation (the elements of Jacobian matrices and with their determinants, are later on, when considering actual computations, termed together as Jacobian terms).
The transformation from the reference element to the real element is used to perform the change of variables in integrals from (6) and (7). Taking an example integral from (6) and applying the change of variables we obtain:
(8) 
where denotes the shape functions defined for the reference element.
The application of numerical integration quadratures (in practical examples we always use Gaussian quadratures) leads to the transformation (for brevity we use global derivatives of shape functions):
(9) 
where denotes the coordinates of the subsequent integration points with the associated weights , ( being the number of integration points related to the type of element, the degree of approximation and the form of coefficients ).
The input for the generic procedure of numerical integration for a single real element, that implements (6) and (7), with (8), (9) and similar expressions for other terms taken into account, is given by:

the type of element, the type and the order of approximation

the set of local coordinates and weights for integration points

possibly the set of values of shape functions and their local derivatives at all integration points for the reference element (they can be also computed on the fly during integration)

the set of geometry degrees of freedom for the real element

the set of problem dependent parameters that are used for computing coefficients and at integration points (for simple problems these may be the values of coefficients at integration points themselves)
The output of the procedure is formed by the element stiffness matrix and the element load vector . Algorithm 1 shows a generic procedure for creating a sequence of element stiffness matrices and load vectors, assuming they have the same type and order of approximation, with the data for the respective reference element (integration data and shape functions data) assumed to be available in local data structures.
The algorithm shows how numerical integration calculations can be performed. As it can be seen, there are three subsequent, separate loops over integration points in the algorithm. They can be fused together, as it is often done in practice. The presented form of Algorithm 1 has been chosen to show that, with the suitable arrangement of calculation stages, the final computations of stiffness matrix entries are performed in five nested loops, with the order of loops being arbitrary. Each order can, however, lead to different execution performance, that additionally depends on the hardware on which the code runs. This opens multiple ways for optimizing the execution characteristics of finite element numerical integration that we investigate in the following sections.
3 Implementation of numerical integration
3.1 The place of numerical integration within the finite element solution procedure
We treat the procedure of numerical integration as one of computational kernels of finite element codes. By this we want to stress its importance for finite element computations and its influence on the performance of codes. To further clarify its place in finite element programs we describe a way in which, in the setting assumed in the paper, it interacts with the rest of the code.
We consider the implementation of finite element numerical integration in codes designed for large scale calculations. The number of elements in such calculations often exceeds million and even can reach several billions. The only way to ensure the scalable execution of programs for that size of problems is to consider distributed memory systems and message passing. The standard way of implementation in this case is some form of domain decomposition approach. In the current paper we assume the design of finite element codes, in which the issues of message passing parallelization and multicore acceleration are separated [2]. In this design multithreading concerns only single node execution and a separate layer of software combines single node components into a whole parallel code.
As a result we consider the algorithm of numerical integration executed for all elements of a given subdomain. We assume the size of the subdomain to exceed one hundred thousand of elements, the size that justifies the use of massively multicore accelerators. We allow for complex mesh and field data structures related to adaptive capabilities, many types of elements, several approximation fields and different approximation spaces in a single simulation. Therefore we assume that the finite element data concerning elements and approximation fields are stored in standard global memory accessible to CPUs. This memory should ensure the sufficient storage capacity and a latency minimizing way of accessing data.
Hence, the input data to numerical integration comes from finite element data structures stored in CPU memory. The output of numerical integration is formed by element stiffness matrices and load vectors. In order to allow for broad range of scenarios of using the computed entries (standard direct, frontal, iterative solvers, with or without matrix assembly) we assume that the output arrays after being computed, can be stored in some memory, available to threads performing subsequent solution steps (this can even include registers if e.g. assembly operation is designed as extension of procedures performing numerical integration). Hence, we discuss the problem of effective storing of element arrays in memory of the processor performing integration, but consider also the case, where the data are not stored at all. In both cases the further use of element stiffness matrices and load vectors depends on implementation details of possible assembly and solver procedures.
3.2 Parallelization strategy
As can be seen in Algorithm 1, numerical integration in the form considered in the current paper, consists of several loops. Each of these loops can be parallelized, leading to different parallel algorithms with different optimization options and performance characteristics.
For the analysis in the current paper we consider parallelization of only one loop – the loop over finite elements. Due to the sufficiently large number of elements in the loop (as assumed for message passing implementation of large scale problems) and the embarrassingly parallel character of the loop this choice seem to be obvious. By analysing it in details, we want to investigate its optimization and performance characteristics, but also to show possible limitations that can lead to different parallelization strategies.
Numerical integration is an embarrassingly parallel algorithm, but the later stages, assembly and linear system solution, are not. To allow for efficient execution of these stages, further assumptions are made. As was described when developing Algorithm 1, elements of a single subdomain are divided into sets of elements of the same type and degree of approximation (to made them use the same reference element shape functions and integration quadratures). In order to allow for parallel assembly, the elements are further divided into subsets with different colors assigned. Elements of a single color do not contribute to the same entries in the global stiffness matrix and their stiffness matrices can be easily assembled into the global stiffness matrix in a fully parallel way, with no data dependencies.
3.3 Variations of the algorithm
Despite the fact that finite element numerical integration can be expressed by two simple formulas (6) and (7), there exist many variations of the algorithm (even when considering only first order approximations), that should be taken into account when designing computer implementations.
First, there are different types of problems solved, scalar and vector, linear and nonlinear. Depending on the weak statement of the corresponding problem, the set of PDE coefficients can be either sent as input to the procedure or the input to the procedure can form just the arguments of a special, problem dependent function, that calculates the final coefficients. In the current paper we consider the first option, with the assumption that the coefficients for element stiffness matrices are constant over the whole finite element.
The coefficient matrices for different problems (matrices in Algorithm 1) can have varying nonzero structure. To make our implementations, to certain extent, independent of these variations, we consider the organization of computations with the final calculations of updates to the stiffness matrix (two loops over space dimensions and – lines 1722 in Algorithm 1) always performed inside all other loops. With such a solution, these final calculations are left as problem dependent and should be optimized separately, for each problem, either manually or using some automatic tools [28].
Another variation is related to different geometrical types of finite elements: linear, multilinear and geometrically higher order (e.g. isoparametric for higher order approximations). The main difference, from the point of view of numerical integration, is between linear simplex elements, for which Jacobian terms are constant over the whole element, and all other types for which Jacobian terms have to be computed separately for each integration point. In our study we compare two types of elements: geometrically linear tetrahedral elements and geometrically multilinear prismatic elements (both are elements without curved boundaries, with geometry degrees of freedom given by vertices coordinates).
For the two different types of geometry of elements we consider different variations of numerical integration for first order approximations. We explicitly take into account the fact that for geometrically linear elements the derivatives of geometric shape functions and solution shape functions are constant over each element and that the corresponding calculations can be moved out of the loop over integration points.
Finally, we consider several options for arranging the order of calculations in Algorithm 1. For the three loops, over integration points () and twice over shape functions ( and ), we consider the different orderings and denote them by QSS, SQS and SSQ (the symbol reflects the order implied by the subscripts of the loop indices). We assume that there are no separate loops for computing the additional quantities, Jacobian terms and global derivatives of shape functions, but that the loops are fused together.
Combining together the options discussed above we present six algorithms (2, 3, 4, 5, 6, 7) for the final calculations of entries to (and as well). The algorithms correspond to:

different loop orderings – QSS, SQS and SSQ versions

different types of elements – for geometrically linear elements and for all other types (it can be used for geometrically linear elements, but is less efficient)

different types of problems – through the delegation of final updates to problem specific calculations.
In the algorithms it is assumed that the values of Jacobian terms and the global derivatives of all shape functions are calculated just after entering the calculations for a given integration point. The global derivatives are stored in some temporary array and later used in calculations. In practical calculations, in order to save the storage for temporary arrays, another strategy can be used with the values of global derivatives for each shape function computed just at the beginning of the body of the loop for this shape function. The drawback of the approach is that the calculations in the innermost loop over shape functions are redundantly repeated for each shape function.
The six variants of numerical integration can be considered as the optimizations of the algorithm, corresponding to classical techniques of loop interchange and loop invariant code motion. We perform them explicitly, to help compilers and to make the output for the same variant from different compilers similar. There are other different optimizations possible, such as common subexpression elimination, loop unrolling or induction variable simplification. In order to preserve the portability of the procedures for different processor architectures, we leave the application of these and other optimizations to the compiler. Instead, aiming particularly at optimization of the execution on graphics processors, we explicitly consider the placement of different arrays appearing in calculation in different levels of memory hierarchy. Before we present the details, we briefly introduce the programming model and example hardware platforms for which we design the implementations.
3.4 Programming model
We want to compare the optimizations and performance of execution for numerical integration on several types of hardware – multicore CPUs, GPUs, Xeon Phi – but do not want to consider a radically different programming models. We choose a programming model that offers the explicit usage of shared memory (necessary for GPUs), but that can be used also for architectures based on standard CPU cores (we include in this category also Xeon Phi architecture). The model has to allow for efficient vectorization, as well as other standard optimizations.
The model that we choose is based on models developed for GPUs (CUDA [33], OpenCL [14]), but tries to simplify them. The goal is to obtain the model simple enough, so that application programmers can use it, but having enough features to allow for high performance implementations on different contemporary hardware. Our final implementations are all done in OpenCL due to the fact of existing software development tools for all processors considered.
We assume that the code is written in the SPMD (single program multiple data) fashion, for threads that execute only scalar operations. Several threads can be grouped together, so that hardware can run them in lockstep, creating vectorised parts of the code. It is up to the compiler and the hardware to perform such vectorization. This vectorization is, in current programming environments, hidden from the programmer, but cannot be neglected when analysing the performance of implementation. The vectorization is considered the standard form of execution for GPUs and an important performance improvement for architectures with standard CPU cores.
Because of the lockstep execution, we will call a group of threads performing vectorised operations a SIMD group (this term directly corresponds to the notions of NVIDIA warps and AMD wavefronts). The number of threads in a SIMD group depends on scheduling and execution details of each processor architecture. For the contemporary GPUs it is usually equal to 32 or 64, while for the current CPU like cores it is related to the width of vector registers and can be equal to 4, 8 or 16^{1}^{1}1In the model, we adopt the point of view on execution on standard CPU cores taken from Intel OpenCL compiler [18]. In that perspective, a traditional CPU thread performing vector operations is seen as a SIMD group of threads, while for scalar operations, it is assumed that they are performed by a single thread in the SIMD group..
One or several SIMD groups form a set of threads, that is called a threadblock in CUDA and a workgroup in OpenCL. We include that notion in the programming model and use the OpenCL word workgroup. Threads in a single workgroup are run on the same processor core (by the core we understand a standard CPU core, an NVIDIA streaming multiprocessor, an OpenCL computing unit). They share access to explicitly available fast memory (that we call further shared memory, adopting the CUDA convention, as opposed to using the OpenCL notion of local memory) and can be explicitly synchronised using the barrier construct.
Threads from different workgroups are not synchronised and form a space of threads, executing concurrently a part of the code constituting a (computational) kernel. We assume that the whole code, in our case a code performing finite element simulations, that contains complex and large data structures, is executed in the traditional way on standard CPUs with only several kernels offloaded to (massively) multicore accelerators.
The act of offloading is not a necessary step in executing codes in our model. For future architectures, having standard memories accessible to massively multicore processors with SIMD capabilities, it can be eliminated, with kernels denoting at that moment only carefully parallelized, vectorised and optimized parts of codes.
A programmer in our model can explicitly specify a storage location for selected kernel variables, as global or shared memory. In that case, specific, hardware dependent, considerations has to be performed to take into account, and possible optimize, different ways of accessing data in different levels of memory hierarchy. More specifically we assume that the accesses from different threads in the same SIMD group executing a single memory operation can be organized in such a way that they concern subsequent memory locations in memory. If two levels of memory are involved, the rule of subsequent accesses is considered more important for the slower memory.
The above, simple, guideline should work well for GPUs, where it should allow for forming fast coalesced memory accesses to global memory [33, 14]. When it cannot be applied, more complex rules for performing coalesced global memory accesses and avoiding bank conflicts for shared memory should be applied. Also for CPU cores, the guideline should allow for fast vectorised memory accesses. If it turns out that the compiler and the hardware cannot perform such operations, more traditional approaches, exploiting spatial and temporal locality for each thread can be exploited.
The variables with no explicit storage location form a set of automatic variables, that are private to each thread. We assume that actual calculations are always performed using a set of registers available to a thread. If the set is large enough, all automatic variables can be stored in registers. This is the optimal situation, that can often happen for simple kernels executed on GPUs. When this is not the case, register spilling occurs, the values are stored in different levels of cache memory and eventually can be transferred, with relatively high cost, to global memory.
3.5 Hardware platforms
Optimal implementation of an algorithm has to consider specific features of hardware platforms for which it is designed. In the paper we try to find, for the particular problem of finite element numerical integration, the most important dependencies between the processor architectures and the performance of codes. We consider three types of computing devices, popular in HPC and having a strong presence on the current Top500 lists [38]. We do not describe the platforms in details, but compare several characteristics, important, in our view, for execution of numerical integration procedures (and definitely other scientific codes as well) that are further exploited when optimizations of these procedures are analysed.
The first platform is a standard contemporary multicore processor, represented by Intel Xeon E52620. It has 6 Sandy Bridge cores, each of which supports twoway multithreading. The second platform, an Intel Xeon Phi 5110P accelerator card has 60 cores, derived from Pentium architecture and supporting 4way multithreading (our OpenCL implementations can use 59 cores). The third platform is an NVIDIA Tesla K20m accelerator card with Kepler (GK110) GPU.
Processor  
Tesla  Xeon  Xeon  
K20m  Phi  E52620  
(GK110)  5110P  (2)  
Number of cores  13  60 (59)  26 
Core characteristics  
Number of SIMD lanes (SP/DP)  192/64  16/8  8/4 
Number of registers  65536x32bit  4x32x512bit  2x16x256bit 
Shared memory (SM) size [KB]  16 or 48  32 (?)  32 (?) 
L1 cache memory size [KB]  64SM size  32  32 
L2 cache memory size [KB]  118  512  256 
Table 1 compares several important characteristics of the considered processors. The data are selected to be important for the performance oriented programmers and reflects purely hardware features and some details of programming environments. By the core, as it was already mentioned, we denote a CPU core or a compute unit (streaming multiprocessor) for GPUs. For certain computing environments (as e.g. OpenCL that we use in numerical tests) the fact of supporting multithreading is reflected by multiplying the number of compute units (Intel OpenCL compiler reports the number of computing units for Xeon Phi equal to 236, and for Xeon E52620 equal to 12). Since hyperthreading does not necessarily lead to proportional performance improvement, we leave in the table the number of physical cores for the Xeon processors.
The notion of SIMD lanes, in the meaning that we use in the paper, reflects the ability of a hardware to perform floating point operations (single or double precision), as it is revealed to programmers through the speed of execution of suitable low level instructions. The notion of SIMD lanes is directly related to the notion of maximal floating point performance, expressed in TFlops, that can be computed by multiplying the maximal number of operations completed by a SIMD lane in a single cycle (usually 2 for FMA or MAD operations) by the number of SIMD lanes and the frequency of core operation in THz. For CPU cores (in Xeon and Xeon Phi processors) SIMD lanes form parts of vector units, for NVIDIA GPUs SIMD lanes correspond to CUDA cores and double precision floating point units.
The number of registers, reveals the number reported in instruction sets and programming manuals. The actual performance of the codes depends on this number but also on the number of physical resources associated (e.g. in standard CPU cores, the number of physical registers is larger than the number available in instruction sets due to the capabilities of outoforder execution).
The picture of fast memory resources for contemporary processors is complex. Not only the details of hardware implementation (the number of banks, bus channels, communication protocols), but also the configurations change from architecture to architecture. For GPUs shared memory forms a separate hardware unit (but in the case of NVIDIA GPUs its size can be selected by the programmer). For standard CPUs, even if the programming model includes the notion of shared memory, it can be neglected in the actual execution (as is the case e.g. for Intel OpenCL compiler that, reportedly, maps it to standard DRAM memory and uses standard for CPU cache hierarchy mechanisms [18] – the reason for placing ”?” in Table 1).
The practical computing power of processors is related to the number of SIMD lanes but there may exist important differences between the different types of processor cores: each one can require a different number of concurrently executing threads (per single SIMD lane) to reach its maximal practical performance (this fact concerns also the performance of memory transfers). For standard CPU cores (as in the Xeon processor), one thread per single lane can suffice to reach the near peak performance. This is related to the complex architecture of cores with outoforder execution and hardware prefetching. For simpler Xeon Phi cores, 4 threads are necessary to reach the peak performance. Even more threads are required for NVIDIA GPUs: usually several threads are sufficient to hide arithmetic operation latencies, while several tens of threads may be needed (in the case when the threads do not issue concurrent memory requests) to allow for hiding the latencies of DRAM memory operations [41, 43].
These facts have important consequences for designing high performance codes for the processors. The speed of execution strongly depends on the speed of providing data from different levels of memory hierarchy (registers, caches, DRAM). It is better to use resources that are faster, but their amount is limited for each core. Table 2 presents the resources available to a single thread when the number of threads per SIMD lane is equal 1 for standard Sandy Bridge core in the Xeon CPU and 4 for Xeon Phi and Kepler GPU. The number of threads per SIMD lane is selected to some extent arbitrarily, but can be understood as the minimal number of threads when high performance of execution is sought.
It can be seen that the processors exhibit different characteristics. Standard Xeon cores (Sandy Bridge in this example) have relatively large caches with less registers. Xeon Phi cores, that are based on previous Intel CPU designs, still have large caches (although smaller than for Sandy Bridge), but, according to AVX standard, possess twice as much registers. Kepler streaming multiprocessors offer relatively large number of registers with relatively small fast memory (especially L2). If software requirements, for register per thread or shared memory per workgroup of threads, are large, then the GPU, either reduces the number of concurrently executed threads (that usually decreases the performance) or even cannot execute the code at all.
All considered processors exhibit complex behaviour with respect to using the memory hierarchy. First generations of GPUs had not caches and, in order to get maximum performance, the programmers had to ensure that the number of requested registers did not exceed the hardware limits and that the DRAM memory accesses strictly adhered to the rules of coalescing, specific to different generations of GPU architectures. Current architectures, due to the use of caches, may not result in excessive penalties for not obeying the two mentioned above principles. This makes the picture of GPU performance less clear: sometimes it is better to allow for register spilling (when due to the too large number of requested registers the data has to be stored in cache), if this decreases the requirements for shared memory, allowing more threads to run concurrently.
Processor  
Tesla  Xeon  Xeon  
K20m  Phi  E52620  
(Kepler GK110)  5110P  (2)  
Number of DP registers  128  32  16 
Number of DP entries in SM  8 or 24  128 (?)  1024 (?) 
Number of DP entries in L1 cache  32SM number  128  1024 
Number of DP entries in L2 cache  59  2048  8192 
4 Performance analysis and optimization of numerical integration kernels
4.1 Memory requirements
Type of problem:  
Poisson  convdiff  
Type of element:  
tetra  prism  tetra  prism  
4  6  4  6  
4  6  4  6  
Integration data (reference element)  16  24  16  24 
Input/output data for each finite element  
Geometry data (input)  12  18  12  18 
PDE coefficients (sent as input)  4  6  20  20 
Stiffness matrix (output)  16  36  16  36 
Load vector (output)  4  6  4  6 
Data for single integration point – QSS algorithm:  
PDE coefficients and  1  1  20  20 
Shape functions and derivatives  16  24  16  24 
Total (including and )  37  67  56  86 
Data for all integration points – QSS algorithm:  
PDE coefficients and  4  6  20  20 
Shape functions and derivatives  28  144  28  144 
Total (including and )  52  192  68  206 
Data for all integration points – SSQ algorithm:  
PDE coefficients and  4  6  20  20 
Shape functions and derivatives  8  8  8  8 
Total (including one entry from and )  14  16  30  30 
We start the performance analysis of numerical integration kernels by specifying the memory requirements that can be compared with the resources offered by different processors. The requirements determine the possible placement of data in different levels of memory hierarchy and, in consequence, the number of accesses to different levels of memory and the number of operations performed during execution.
From now on, in the rest of the paper, we concentrate on four special cases, that present important alternatives related to implementation of numerical integration. We study two types of elements: tetrahedral (geometrically linear) and prismatic (geometrically multilinear) and two types of problems (both scalar): with Laplace operator on the left hand side and, hence, no PDE coefficients corresponding to computing stiffness matrix entries and with the full set of coefficients (as can happen e.g. for different varieties of convectiondiffusionreaction equations). For the first case, denoted later on as ”Poisson” we consider a varying right hand side term, that appears in the numerical integration as a single separate coefficient for each integration point. For the second case, denoted ”convdiff” the coefficients and are assumed to be constant for the whole element.
Table 3 presents the basic parameters for the four considered cases of numerical integration and the sizes (in the number of scalar values) for the arrays that appear in calculations. The data shows a broad range of requirements from several tens of scalars to more than two hundreds (these numbers can grow to several hundreds if e.g. nonconstant PDE coefficients are considered).
The smallest requirements are associated with SSQ versions of algorithms. This can lead to the shortest execution times, however, for prismatic elements the SSQ version is accompanied by redundant calculations of Jacobian terms. On the other hand, the QSS versions, that avoid this redundancy require more resources. The SQS versions are in between, with registers required at most for a single row of stiffness matrix and with Jacobian terms calculations repeated for each row.
When comparing the numbers in Table 3 with the resources offered by the processors for a single thread, specified in Table 2, it can be seen that for some cases and some processors it could be possible to store all data necessary for numerical integration in registers. For other cases more levels of memory hierarchy has to be used. The actual placement of data will depend on the version of the algorithm and optimizations performed.
4.2 Input and output arrays
Because of the importance of coalescing memory accesses to DRAM memory for graphics processors, the organization of input and output arrays can have significant influence on performance of numerical integration kernels.
The arrays passed as arguments to the procedure (i.e. geometrical data and PDE coefficients) can be either used directly in calculations or first retrieved to shared memory or registers and than used many times in actual calculations. The first option can work well for CPU cores, for which the number of registers is relatively small, the compilers perform aggressive optimizations and the data is automatically placed in different levels of cache memory. For GPUs, the level of automation for achieving optimal performance of memory accesses is lower, and usually the programmers are responsible for ensuring the optimal organization of accesses. We adopt this point of view and explicitly design kernels with the minimal number of global memory accesses and, whenever possible, coalesced way of accessing DRAM by different threads. This means that we adopt only the second strategy, with the input data rewritten to local arrays, stored in registers or shared memory. In order to maintain the portability of kernels, we use this strategy for both CPU cores and GPUs, assuming that CPU compilers can perform the proper optimizations of kernels despite this additional step.
The process of rewriting input data still can have several different forms. When threads rewrite subsequent data to registers, each thread refers to subsequent data concerning the element it currently processes (because of the one element–one thread parallelization strategy). In order to ensure coalesced accesses for threads in a single SIMD group, the data for different elements should be placed in subsequent memory locations in DRAM memory, while data for a single element should be spread over distant memory locations. The part of the code responsible for passing input arguments should write properly input arrays, taking into account such parameters as the size of SIMD groups, the number of threads etc. We consider this as too farreaching requirements and assume that the data is passed in a form of arrays in a ”natural” order, i.e. the data for a single element are placed together.
In this case, when rewriting data directly to registers the accesses to global memory are not coalesced. The typical remedy to that, used often for GPUs, is to employ shared memory as a temporary storage with shared access for threads from a single workgroup. The threads read input data in a coalesced way, with subsequent threads from a single workgroup referring to subsequent memory locations, and than write to shared memory, as is illustrated in Fig. 1. The order of writing to and reading from shared memory is assumed not to have the influence on performance and is not optimized (the order of data is left the same as in DRAM memory).
The data placed in shared memory can be left there and later used in calculations, or can be rewritten to registers, with shared memory possibly reused for other data. Both options are taken into account when considering the optimization of kernels.
We adopt a different strategy for output data. Since we leave produced stiffness matrices and load vectors for further use by other kernels or standard CPU code, we assume that the procedures that read them can adapt to the format determined by integration kernels. We also design the versions with coalesced and noncoalesced memory accesses, but now the versions with coalesced memory accesses do not use additional shared memory buffers and write the data directly from data structures related to threads (that can be stored either in registers or in shared memory). As a consequence the data for single element are now spread over distant memory locations and the procedures that read them must take into account the particular formats, that depend on details of the organization of calculations. For noncoalesced writing, a single thread writes the data related to a single element to subsequent memory locations – the situation that can be advantageous for CPU cores with not vectorized memory operations. The situations of coalesced and noncoalesced memory accesses of output data are illustrated in Fig. 2.
4.3 Shared memory usage
The algorithms presented so far can be used to create a proper code for compilers in different software development kits for multicore processors. For architectures based on standard CPU cores, the compilers can produce codes that can reach high performance, thanks to different automatic optimizations. For GPUs, however, there exists one more important optimization, that can have substantial impact on performance and that has to be taken into account explicitly in the source code.
This optimization is the use of shared memory, either as a mean for interthread communication or an explicitly manageable cache. The first option has been already investigated in the previous section, for the purpose of optimal reading of input data. The second option is related to the register requirements of the kernels. If this requirements are too large, the variables designed as local (private to each thread) are spilled to caches and DRAM memory. The last situation, that can happen often due to small cache resources of GPUs, can lead to severe performance penalties. In order to relieve register pressure some of data used in calculations can be placed explicitly in shared memory. Shared memory accesses are usually several times slower than register accesses but, in turn, several times faster then (even coalesced) DRAM accesses.
One thing has to be taken into account for such designs. When one variable is designed to be placed in shared memory instead of a register, the kernel has to allocate the place in shared memory for all threads from a single workgroup. Storing an array with entries for a single thread in shared memory, requires the allocation of times the size of workgroup entries. Since workgroups sizes for GPUs are designed as multiples of SIMD groups sizes (with the minimum recommended both for AMD and NVIDIA GPUs equal to 64), moving data to shared memory involves relatively large resource requirements. Because of that, when designing numerical integration kernels, we consider the placement in shared memory of only one array appearing in final calculations of stiffness matrix and load vector entries, either the one for PDE coefficients or for shape functions and their derivatives or, eventually, for the output and matrices.
It is difficult to investigate all possibilities for placement of data in shared memory (including the options of reusing shared memory after coalesced reading for some other (than input) arrays). The number of options can reach tens (or even hundreds if the assumption of placing only one array in shared memory is abandoned). The factors that influence the performance in these cases include, apart from the number of accesses to shared memory, that can be induced e.g. from the analysis of source code, also the relative sizes and speeds of different levels of memory, as well as compiler strategies for allocating registers, that influence the number of accesses to caches and global DRAM memory. These factors can differ much for different architectures.
Because of that, in order to select the best option, we choose a popular in HPC strategy of parameter based tuning. We create a single code with options concerning the placement of particular arrays in shared memory, that can be switched on or off at compile time, and perform an exhaustive search of the whole space of options by running several tens of variants of the procedure.
4.4 Operation count
The operations performed during kernel executions include global memory accesses, shared memory accesses and the remaining operations. Among the latter we are interested only in floating point operations. We base our execution time estimates, developed always for a single finite element, on the numbers of global memory accesses and floating point operations, due to the difficulties with taking into account shared memory and cache accesses.
When performing analysis at this level, the final optimizations, performed by software (compilers) and hardware should be taken into account. This can be difficult, especially when different architectures are taken into account. We try to develop the estimates, considering such classical optimizations as loop invariant code motion, common subexpression elimination, loop unrolling, induction variable simplification, etc. (that as we mentioned earlier are not directly introduced into the source code, but left for the compilers). In the estimates we accept the minimal number of performed operations and, when calculating performance data for processors, we base them on the estimated numbers.
The number of global memory accesses is relatively easy to estimate. As was already mentioned, in our implementations we accept the principle of explicitly managing the accesses to global memory, i.e. we do not use global arrays in calculations but rewrite the data to shared memory or registers (local, automatic variables) and than use the latter in computations. Hence, the number of global memory accesses can be estimated based on the sizes of input and output arrays for the numerical integration procedures (Table 3). All variants (QSS, SQS and SSQ) perform the same number of global memory accesses reported in the first line of Table 4.
Type of problem:  
Poisson  convdiff  
Type of element:  
tetra  prism  tetra  prism  
The number of global memory accesses:  
all variants  36  66  52  80 
The number of auxiliary shared memory accesses – QSS algorithm  
Geometry data in shared memory  12  108  12  108 
PDE coefficients in shared memory  4  6  80  120 
Shape functions in shared memory  60  210  60  210 
and in shared memory  180  546  180  546 
The number of operations  
QSS algorithm  290  2700  986  4806 
SQS algorithm  290  10416  986  12492 
SSQ algorithm  290  54876  1623  65232 
Arithmetic intensities  
QSS algorithm  8  41  19  60 
SQS algorithm  8  158  19  156 
SSQ algorithm  8  831  31  815 
Processor  
Tesla  Xeon  Xeon  
K20m  Phi  E52620  
(GK110)  5110P  (2)  
Peak DP performance [TFlops]  1.17  1.01 (0.99)  20.096 
Peak memory bandwidth [GB/s]  208  320  242.6 
Limiting arithmetic intensity (DP)  45  25  18 
Benchmark (DGEMM) performance  1.10  0.84  0.18 
Benchmark (STREAM) bandwidth  144  171  67 
Limiting arithmetic intensity (DP)  61  39  21 
The situation is much more complex for shared memory accesses and floating point operations. First, for different variants of kernels there will be different organization of operations. Moreover, for different organizations compilers can perform different sets of optimizations. We present the estimated numbers for shared memory accesses and floating point operations in Table 4. In deriving the estimates we tried to take into account the possible optimizations, related e.g. to the facts that certain values are constant during calculations (e.g. derivatives of shape functions for different integration points within the tetrahedral element, PDE coefficients for different integration points in the convectiondiffusion case) and that the resulting stiffness matrices may be symmetric.
Calculations in all versions of numerical integration algorithm contain several phases, such as:

computing real derivatives of geometrical shape functions (9 operations for tetrahedral elements, 126 operations per integration point for prisms)

computing Jacobian terms (49 operations in all cases, but performed once for tetrahedra and repeated for each integration point for prisms)

computing real derivatives for shape functions (60 operations for tetrahedra and 90 operations per integration point for prismatic elements)

calculating the entries for the stiffness matrix and the load vector
The final calculations are performed in a triple loop over integration points and shape functions, but in many cases the loops are fully unrolled by the compiler and the resulting numbers of operations are the same for all variants of numerical integration. This happens for tetrahedral elements due to many calculations moved outside all loops (since derivatives of shape functions are constant over the element). For prismatic elements the fact that several calculations must be repeated for each integration point results in much larger number of operations for the SQS and especially SSQ versions of the algorithm, as compared to the QSS version.
Based on the estimated numbers of global memory accesses and floating point operations we calculate arithmetic intensities (defined as the number of operations performed per global memory access) for the different variants of integration kernels and different cases considered. The calculated values, reported in Table 4, can be compared with performance characteristics of the processors used in the study. We present such characteristics in Table 5. For each processor we report its peak performance and benchmark performance for the two types of operations – memory accesses and floating point operations (as benchmarks we consider STREAM for the memory bandwidth and dense matrixmatrix multiplication (DGEMM) for floating point performance; moreover from now on we restrict our analysis to only double precision calculations, as more versatile than single precision). Based on these data and assuming the simple roofline model of processor performance [40], the limiting arithmetic intensity, i.e. the value of arithmetic intensity that separates the cases of the maximal performance limited by memory bandwidth from that limited by the speed of performing arithmetic operations (pipeline throughput), is calculated for each processor and theoretical and benchmark performance.
Comparing the data in Table 4 and Table 5 we see that the performance when running different variants of numerical integration kernels can be either memory bandwidth or processor performance (pipeline throughput) limited. For tetrahedral elements the performance for both example problems is memory limited for all architectures (excluding the not optimal SSQ version). For prismatic elements, even when benchmark (i.e. higher) limiting arithmetic intensity is considered, the performance of kernel execution should be pipeline limited for Xeon and Xeon Phi architectures. For the Kepler architecture and the optimal QSS variant the performance will be memory limited for Poisson problem, while is on the border between being memory or pipeline limited for the convectiondiffusion example. For other variants the performance is pipeline limited but the large number of operations in these cases makes them far from optimal.
4.5 Execution time
Based on the data in Tables 4 and 5 we perform final estimates of execution time for different variants of numerical integration and different architectures. We calculate separately the estimated time determined by the number of global memory accesses and the benchmark (STREAM) memory bandwidth of processors and the time determined by the number of floating point operations and the arithmetic throughput in DGEMM. Table 6 contains the longer times from the two, computed as the estimate for each architecture.
The values in Table 6 form lower bounds for the times possible to achieve. There are many factors that can slow down the execution of algorithms. The most important include:

too low occupancy for GPUs (too small number of concurrently executing threads in order to effectively hide instruction or memory latencies) – caused by excessive requirements for registers or shared memory

shared and cache memory accesses for GPUs, not considered in Table 6 and caused by either explicit use of shared memory or register spilling to caches (or even to global memory)

using other than FMA or MAD instructions – peak performance is obtained by all the considered processors for algorithms using almost exclusively FMA or MAD operations that double the performance as compared to using other floating point instructions

employing barriers to ensure the proper usage of shared memory
Type of problem:  
Poisson  convdiff  
Type of element:  
tetra  prism  tetra  prism  
Tesla K20m (GK110 – Kepler)  
QSS algorithm  2.00  3.67  2.89  4.44 
SQS algorithm  2.00  9.47  2.89  11.36 
SSQ algorithm  2.00  49.89  2.89  59.30 
Xeon Phi 5110P  
QSS algorithm  1.68  3.21  2.43  5.72 
SQS algorithm  1.68  12.40  2.43  14.87 
SSQ algorithm  1.68  57.87  2.43  69.40 
Xeon E52620 (two socket configuration)  
QSS algorithm  4.30  15.00  6.21  26.70 
SQS algorithm  4.30  57.87  6.21  69.40 
SSQ algorithm  4.30  304.87  9.02  362.40 
4.6 Transfer from and to host memory
The last issue that we deal with, is the one that is often considered first – the time required for transfer of data between host memory and accelerator memory (when these two memories are separate, i.e. when accelerator is connected with the CPU using PCIe bus, as is the case for Tesla K20m and Xeon Phi).
The reason why we consider it last is that, based on the discussions performed so far, it is evident that if numerical integration is considered alone, without considering other procedures of finite element processing, the time required for transferring input and output data (or even input data alone) strongly dominates the time for performing actual computations. This becomes obvious after realizing that the performance of numerical integration kernels is usually memory bound or close to being memory bound, that the amount of data in global memory accessed by accelerator during calculations is the same as that of hostaccelerator memory transfer and that the bandwidth of PCIe is several times lower than that of either GPU or Xeon Phi global memory.
Because of that we do not discuss this issue here. Apart from the fact that such a discussion should include broader context of finite element simulations, there is another reason for not going into details. The solution with PCIe CPUaccelerator connection is universally considered as one of the most important performance bottlenecks for accelerator computing. Hardware providers are considering many different options that should change this situation. When these solutions become available, the detailed analyses of the influence of host memoryaccelerator memory transfer on performance of codes can be performed (for some initial experience see e.g. [26]).
5 Computational experiments
We test the three versions (QSS, SQS, SSQ) of numerical integration kernels for the four described example cases (Poisson and convectiondifffusion for tetrahedral and prismatic elements) and the three selected processor architectures Intel Xeon E52620, Intel Xeon Phi 5110P and NVIDIA Tesla K20m with Kepler GK110 architecture. For all processors we use 64bit Linux with 2.6.32 kernel. For Xeon processors we use Intel SDK for OpenCL Applications version 4.5 (with OpenCL 1.2 support). For NVIDIA GPU we use CUDA SDK version 5.5 with OpenCL 1.1 support and 331.20 driver. We use the same kernels for all three architectures, changing only the size of workgroups: 8 for Xeon, 16 for Xeon Phi and 64 for Kepler.
For each particular kernel and problem we perform a series of experiments with different combinations of parameters determining the use of shared memory in computations. In Table 7 we present the best times for each kernel, indicating additionally what percentage of the best performance (associated with the times presented in Table 6) was achieved. Figures 3, 4 and 5 illustrates the comparison of performance for each case of problem and approximation and different processors.
Type of problem:  
Poisson  convdiff  
Type of element:  
tetra  prism  tetra  prism  
Tesla K20m (GK110 – Kepler)  
QSS algorithm  2.02 (99%)  12.80 (29%)  4.46 (65%)  15.58 (29%) 
SQS algorithm  2.02 (99%)  38.39 (25%)  6.05 (48%)  64.81 (18%) 
SSQ algorithm  2.02 (99%)  94.99 (53%)  6.13 (47%)  194.93 (30%) 
Xeon Phi 5110P  
QSS algorithm  8.77 (19%)  25.09 (13%)  17.11 (14%)  33.29 (17%) 
SQS algorithm  11.35 (15%)  41.20 (30%)  15.97 (15%)  60.18 (25%) 
SSQ algorithm  11.43 (15%)  101.20 (65%)  16.53 (15%)  132.46 (59%) 
Xeon E52620 (two socket configuration)  
QSS algorithm  19.31 (22%)  49.74 (30%)  26.74 (23%)  64.27 (42%) 
SQS algorithm  18.77 (23%)  128.07 (45%)  24.86 (25%)  133.14 (52%) 
SSQ algorithm  17.76 (24%)  385.33 (79%)  26.11 (35%)  475.41 (76%) 
The detailed results of experiments lead to several remarks belonging to different domains of interest. For practitioners performing finite element simulations on modern hardware the interesting facts include:

the execution time of finite element integration depends strongly not only on the order of approximation (this fact is obvious from the classical complexity analysis), but also on the type of element and the kind of problem solved. The differences are not so significant as in the case of different approximation orders [20, 25, 4], but still can reach an order of magnitude

due to large differences in execution time the problem of proper mapping to hardware and optimization of numerical integration may be more or less important for the time of the whole finite element simulation, depending also e.g. on the time required by the solver of linear equations (if present) [3, 42])
For finite element software developers, especially in the HPC domain, there are the following general observations:

large resource requirements for some variants of numerical integration (e.g. convectiondiffusion problems and prismatic elements), limits the performance achieved by the GPU architecture in our comparison; the reason lies mainly in too small number of threads (SIMD groups) that can run concurrently on a single core (streaming multiprocessor)

the QSS variants of the numerical integration algorithm turned out to be the most versatile and efficient; due to the proper optimizations in the source code and performed by the compilers, the performance of QSS kernels was either the fastest or slower by only several percent than the fastest version

detailed results of experiments showed that the performance on our GPU (as is also the case for other contemporary GPU architectures) is sensitive to the way the data are read and written to memory, coalescing memory accesses turned out to be more important for the Kepler processor than for both Xeons

when the performance of algorithm is memory bound and there is a significant influence of the way the data are accessed in global memory (as is the case e.g. for Kepler architecture, tetrahedral element and Poisson problem), there is a growing importance of data layout in memory (and the use of output data by other components of finite element codes) [7, 5, 11]

the GPU architecture turned out to be also more sensitive to the use of OpenCL shared memory (as suggested by the Intel OpenCL compiler manual [17], this had no significant impact on the performance of Xeons)

together, the changes in the way data are read and written to the global memory and the changes in shared memory usage caused the differences in execution time of the order 23 for Xeons, and more than 5 for Kepler
In general, for some test cases, the performance of obtained portable numerical integration kernels can be considered satisfactory. Still, for each architecture and most of the cases considered, the benchmark results suggest that it should be possible to achieve execution times several times shorter (from 3 times for Kepler architecture, up to almost 8 times for Xeon Phi). This could however usually be obtained by either considering the specialized domain specific optimizing compilers or by loosing the portable character of the kernels and introducing architecture specific optimizations into the source code. Another possibility is to consider different parallelization strategies, the option that we plan to follow in subsequent publications.
Finally there are some observations concerning the hardware itself (these observations are common to many algorithms, but are confirmed for our example case of finite element numerical integration):

our example GPU, Tesla K20m, turned out to be the fastest processor in the comparison; the observations made by several authors [39, 15, 36] show that Xeon Phi processors applied to different scientific computing algorithms, although usually faster than standard Xeon CPUs, are usually slower than the recent HPC targeted GPUs

the detailed results of experiments for particular optimization variants of kernels showed significant resemblance of Xeon and Xeon Phi performance, indicating the similarities between these architectures.
6 Related work
The subject of numerical integration on modern multicore processors, especially graphics processors, has been investigated in several contexts. The first works concerned GPU implementations for specific types of problems, such as higher order FEM approximations in earthquake modelling and wave propagation problems [23], GPU implementations of some variants of discontinuous Galerkin approximation [21] or higher order approximations for electromagnetics problems [8].
The second important context for which finite element numerical integration was considered, is the creation of domain specific languages and compilers. Interesting works for this approach include [30, 31, 22, 27, 28]. The research in the current paper can be understood as laying theoretical and experimental grounds for possible future finite element compilers, able to create optimized code for all types of problems, elements and approximations, as well as different processor architectures.
Finally, numerical integration, in more or less details, have been discussed in the context of the whole simulation procedure by the finite element codes. Some works in this area include [16, 32, 13, 29, 12, 11, 35]. Usually more attention to numerical integration has been given in articles that consider the process of creation of systems of linear equations for finite element approximations, with the important examples such as [10, 37, 6, 9].
The present paper forms a continuation of our earlier works, devoted solely to the subject of finite element numerical integration. The first papers considered higher order approximations, starting with first theoretical and experimental investigations in [34, 24] and culminating in larger articles devoted to the implementation and performance of numerical integration kernels for multicore processors with vector capabilities [25] (especially IBM Power XCell processor) and GPUs [4].
The investigations in the current paper on one hand touch the narrower subject of only first order approximations, but on the other hand concern two important types of finite elements (geometrically linear and nonlinear) and are performed, in a unifying way, for three different processor architectures used in scientific computing. This scope, combined with the depth of investigations concerning the performance of OpenCL kernels, differentiate the article from the other on this subject.
7 Conclusions
We have investigated the optimization and performance on current multi and manycore processors for an important computational kernel in scientific computing, the finite element numerical integration algorithm for first order approximation. We have used a unifying approach of OpenCL for programming model and implementation on three popular architectures in scientific computing: Intel Xeon (Sandy Bridge), Intel Xeon Phi and NVIDIA Kepler. We have demonstrated that this approach allows for exploiting multicore and vector capabilities of processors and for obtaining satisfactory levels of performance, as compared to the theoretical and benchmark maxima, not loosing the portability of the developed code (we have used the same OpenCL kernels for all architectures). The investigations in the paper reveal that, even for the simple problems, elements and approximations as considered in the paper, there is a significant variation of required resources and associated complexities for the algorithm, that can lead to different problems when mapping to architectures of modern processors. Nevertheless, the results of computational experiments show that for all the cases considered in the paper, the numerical integration algorithm can be successfully ported to massively multicore architectures, and hence, when used in finite element codes should not form a performance bottleneck. The presented detailed analyses indicate what conditions must be met in order to obtain the best performance of the kernels and what performance can be expected when numerical integration is used on different processors for different types of problems and finite element approximations.
References
 [1] P. Arbenz, G.H. van Lenthe, U. Mennel, R. Muller, and M. Sala, A scalable multilevel preconditioner for matrixfree finite element analysis of human bone structures, International Journal for Numerical Methods in Engineering, 73 (2008), pp. 927–947.
 [2] K. Banaś, A modular design for parallel adaptive finite element computational kernels, in Computational Science — ICCS 2004, 4th International Conference, Kraków, Poland, June 2004, Proceedings, Part II, M. Bubak, G.D. van Albada, P.M.A. Sloot, and J.J. Dongarra, eds., vol. 3037 of Lecture Notes in Computer Science, Springer, 2004, pp. 155–162.
 [3] K. Banaś, Scalability analysis for a multigrid linear equations solver, in Parallel Processing and Applied Mathematics, Proceedings of VIIth International Conference, PPAM 2007, Gdansk, Poland, 2007, R. Wyrzykowski, J. Dongarra, K. Karczewski, and J. Wa’sniewski, eds., vol. 4967 of Lecture Notes in Computer Science, Springer, 2008, pp. 1265–1274.
 [4] Krzysztof Banaś, Przemysław Płaszewski, and Paweł Macioł, Numerical integration on GPUs for higher order finite elements, Computers and Mathematics with Applications, 67 (2014), pp. 1319 – 1344.
 [5] Yong Cai, Guangyao Li, and Hu Wang, A parallel nodebased solution scheme for implicit finite element method using {GPU}, Procedia Engineering, 61 (2013), pp. 318 – 324. 25th International Conference on Parallel Computational Fluid Dynamics.
 [6] Cris Cecka, Adrian J. Lew, and E. Darve, Application of assembly of finite element methods on graphics processors for realtime elastodynamics, in GPU Computing Gems. Jade edition, Wen mei W. Hwu, ed., Morgan Kaufmann, 2011, pp. 187–205.
 [7] , Assembly of finite element methods on graphics processors, International Journal for Numerical Methods in Engineering, 85 (2011), pp. 640–669.
 [8] A. Dziekonski, P. Sypek, A. Lamecki, and M. Mrozowski, Finite element matrix generation on a gpu, Progress In Electromagnetics Research, 128 (2012), pp. 249–265.
 [9] , Generation of large finiteelement matrices on multiple graphics processors, International Journal for Numerical Methods in Engineering, 94 (2013), pp. 204–220.
 [10] J. Filipovic, J. Fousek, B. Lakomy, and M. Madzin, Automatically optimized gpu acceleration of element subroutines in finite element method, in Application Accelerators in High Performance Computing (SAAHPC), 2012 Symposium on, July 2012, pp. 141–144.
 [11] Zhisong Fu, T. James Lewis, Robert M. Kirby, and Ross T. Whitaker, Architecting the finite element method pipeline for the GPU, Journal of Computational and Applied Mathematics, 257 (2014), pp. 195 – 211.
 [12] Serban Georgescu, Peter Chow, and Hiroshi Okuda, Gpu acceleration for fembased structural analysis, Archives of Computational Methods in Engineering, 20 (2013), pp. 111–121.
 [13] M. Geveler, D. Ribbrock, D. GÃ¶ddeke, P. Zajac, and S. Turek, Towards a complete fembased simulation toolkit on gpus: Unstructured grid finite element geometric multigrid solvers with strong smoothers based on sparse approximate inverses, Computers & Fluids, 80 (2013), pp. 327 – 332. Selected contributions of the 23rd International Conference on Parallel Fluid Dynamics ParCFD2011.
 [14] Khronos OpenCL Working Group, The OpenCL Specification, version 1.1, 2010.
 [15] Nina Hanzlikova and Eduardo Rocha Rodrigues, A novel finite element method assembler for coprocessors and accelerators, in Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms, New York, NY, USA, 2013, ACM, pp. 1:1–1:8.
 [16] Peter Huthwaite, Accelerated finite element elastodynamic simulations using the {GPU}, Journal of Computational Physics, 257, Part A (2014), pp. 687 – 707.
 [17] Intel, OpenCL Design and Programming Guide for the Intel Xeon Phi Coprocessor. Accessed 28.04.2014.
 [18] , Intel SDK for OpenCL Applications XE 2013 R3. User’s Guide, 2013.
 [19] C. Johnson, Numerical Solution of Partial Differential Equations by the Finite Element Method, Cambridge University Press, 1987.
 [20] Robert C. Kirby, Matthew G. Knepley, Anders Logg, and L. Ridgway Scott, Optimizing the evaluation of finite element matrices, SIAM J. Scientific Computing, 27 (2005), pp. 741–758.
 [21] A. Klöckner, T. Warburton, J. Bridge, and J. S. Hesthaven, Nodal discontinuous galerkin methods on graphics processors, J. Comput. Phys., 228 (2009), pp. 7863–7882.
 [22] Matthew G. Knepley and Andy R. Terrel, Finite element integration on gpus, ACM Trans. Math. Softw., 39 (2013), pp. 10:1–10:13.
 [23] D. Komatitsch, D. MichÃ©a, and G. Erlebacher, Porting a highorder finiteelement earthquake modeling application to NVIDIA graphics cards using CUDA, Journal of Parallel and Distributed Computing, 69 (2009), pp. 451–460.
 [24] Filip Krużel and Krzysztof Banaś, Finite element numerical integration on PowerXCell processors, in PPAM’09: Proceedings of the 8th international conference on Parallel processing and applied mathematics, Berlin, Heidelberg, 2010, SpringerVerlag, pp. 517–524.
 [25] Filip Krużel and Krzysztof Banaś, Vectorized OpenCL implementation of numerical integration for higher order finite elements, Computers and Mathematics with Applications, 66 (2013), pp. 2030–2044.
 [26] Filip Krużel and Krzysztof Banaś, AMD APU systems as a platform for scientific computing, submitted to Computer Methods in Materials Science.
 [27] Anders Logg and Garth N. Wells, Dolfin: Automated finite element computing, ACM Trans. Math. Softw., 37 (2010), pp. 20:1–20:28.
 [28] Fabio Luporini, Ana Lucia Varbanescu, Florian Rathgeber, GheorgheTeodor Bercea, J. Ramanujam, David A. Ham, and Paul H. J. Kelly, Crossloop optimization of arithmetic intensity for finite element local assembly, ACM Trans. Archit. Code Optim., 11 (2015), pp. 57:1–57:25.
 [29] Ramin Mafi and Shahin Sirouspour, Gpubased acceleration of computations in nonlinear finite element deformation analysis, International Journal for Numerical Methods in Biomedical Engineering, 30 (2014), pp. 365–381.
 [30] Graham R. Markall, David A. Ham, and Paul H.J. Kelly, Towards generating optimised finite element solvers for gpus from highlevel specifications, Procedia Computer Science, 1 (2010), pp. 1815 – 1823. ICCS 2010.
 [31] G. R. Markall, A. Slemmer, D. A. Ham, P. H. J. Kelly, C. D. Cantwell, and S. J. Sherwin, Finite element assembly strategies on multicore and manycore architectures, International Journal for Numerical Methods in Fluids, 71 (2013), pp. 80–97.
 [32] F. Mossaiby, R. Rossi, P. Dadvand, and S. Idelsohn, Openclbased implementation of an unstructured edgebased finite element convectiondiffusion solver on graphics hardware, International Journal for Numerical Methods in Engineering, 89 (2012), pp. 1635–1651.
 [33] NVIDIA, NVIDIA CUDA C Programming Guide Version 5.0, 2012.
 [34] Przemysław Płaszewski, Paweł Macioł, and Krzysztof Banaś, Finite element numerical integration on GPUs, in PPAM’09: Proceedings of the 8th international conference on Parallel processing and applied mathematics, Berlin, Heidelberg, 2010, SpringerVerlag, pp. 411–420.
 [35] I.Z. Reguly and M.B. Giles, Finite element algorithms and data structures on graphical processing units, International Journal of Parallel Programming, (2013), pp. 1–37.
 [36] Erik Saule, Kamer Kaya, and Umit V Catalyurek, Performance evaluation of sparse matrix multiplication kernels on intel xeon phi, arXiv preprint arXiv:1302.1078, (2013).
 [37] Li Tang, X.S. Hu, D.Z. Chen, M. Niemier, R.F. Barrett, S.D. Hammond, and G. Hsieh, Gpu acceleration of data assembly in finite element methods and its energy implications, in ApplicationSpecific Systems, Architectures and Processors (ASAP), 2013 IEEE 24th International Conference on, June 2013, pp. 321–328.
 [38] Top500.
 [39] Sandra Wienke, Dieter Mey, and Matthias Muller, Accelerators for technical computing: Is it worth the pain? a tco perspective, in Supercomputing, JulianMartin Kunkel, Thomas Ludwig, and HansWerner Meuer, eds., vol. 7905 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2013, pp. 330–342.
 [40] Samuel Williams, Andrew Waterman, and David Patterson, Roofline: An insightful visual performance model for multicore architectures, Commun. ACM, 52 (2009), pp. 65–76.
 [41] H. Wong, M.M. Papadopoulou, M. SadooghiAlvandi, and A. Moshovos, Demystifying gpu microarchitecture through microbenchmarking, in Performance Analysis of Systems Software (ISPASS), 2010 IEEE International Symposium on, March 2010, pp. 235–246.
 [42] M Woźniak, K Kuźnik, M Paszyński, VM Calo, and D Pardo, Computational cost estimates for parallel shared memory isogeometric multifrontal solvers, Computers & Mathematics with Applications, 67 (2014), pp. 1864–1883.
 [43] Yao Zhang and J.D. Owens, A quantitative performance analysis model for gpu architectures, in High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, Feb 2011, pp. 382–393.