Automatic Code Generation for HighPerformance Discontinuous Galerkin Methods on Modern Architectures
Abstract.
SIMD vectorization has lately become a key challenge in highperformance computing. However, handwritten explicitly vectorized code often poses a threat to the software’s sustainability. In this publication we solve this sustainability and performance portability issue by enriching the simulation framework dunepdelab with a code generation approach. The approach is based on the wellknown domainspecific language UFL, but combines it with loopy, a more powerful intermediate representation for the computational kernel. Given this flexible tool, we present and implement a new class of vectorization strategies for the assembly of Discontinuous Galerkin methods on hexahedral meshes exploiting the finite element’s tensor product structure. The optimal variant from this class is chosen by the code generator through an autotuning approach. The implementation is done within the open source PDE software framework Dune and the discretization module dunepdelab. The strength of the proposed approach is illustrated with performance measurements for DG schemes for a scalar diffusion reaction equation and the Stokes equation. In our measurements, we utilize both the AVX2 and the AVX512 instruction set, achieving 40% to 60% of the machine’s theoretical peak performance for one matrixfree application of the operator.
1. Introduction
In the last decade, there have been two main approaches to finite element software frameworks in the academic community.
Library approaches like Dune (bastian2008grid1, ) (bastian2008grid2, ), MFEM (mfem, ) or deal.ii (bangerth2007deal, ) use objectoriented C++ to provide users powerful toolboxes for numerical simulation codes.
The biggest advantage of this approach is its flexibility and extendability.
However, these frameworks have been struggling with the learning curve experience of new users, as well as the necessity for users to have quite some expertise in programming.
Code generation based frameworks such as FEniCS (LoggMardalEtAl2012a, ) or firedrake (rathgeber2015firedrake, ) pursued an alternate road focussing on easeofuse:
The finite element problem is formulated in a domain specific language (DSL) and all user code is written through a highlevel Python interface.
This provides an excellent framework for rapid prototyping, but often fails once the capabilities of the underlying C(++) framework need to be extended.
A related approach is implemented by projects like FreeFEM++, which embeds a DSL for PDEs into C++ using expression templates.
The necessity of expert level C++ programming skills has become even more apparent in the light of the latest hardware innovations.
Programming techniques that enable high performance computing on modern architectures include SIMDaware programming and multi threading.
This work focusses on SIMD (Single Instruction Multiple Data) vectorization, which has become a crucial factor in achieving nearpeak floating point performance on modern microarchitectures.
This is due to both the increase in bit width of SIMD registers and the ability to perform fused multiplication and addition in a single instruction.
As a consequence the maximum percentage of peak performance an optimal scalar code can achieve, has dropped significantly in the last decade:
While it could achieve 50% of double precision peak performance on a 2008 Intel Nehalem processor (SSE4.2, no FMA), the number is as low as 6.125% on the 2016 Intel Skylake processor family (AVX512, FMA).
The performance optimization literature describes two basic approaches to the issue (see e.g. (franchetti2005simd, )):

Automatic Vectorization through a vectorizing compiler. Developers write scalar code, whose data dependencies are analyzed by the compiler in order to identify vectorization opportunities.

Explicit Vectorization, where developers directly write code for the target architecture.
While the former would be the most preferrable solution in terms of separation of concern between application developer and performance engineers, in many practical applications it is not feasible.
Many vectorization opportunities cannot be automatically found, as they would change the program semantics and only domain knowledge allows developers to include those into the optimization search space.
As an example we mention a memory layout change along the lines of Structure of Arrays vs. Array of Structures.
However, also explicit vectorization is suboptimal, as code is written directly for a given microarchitextures through lowlevel programming constructs such as compilerspecific intrinsics or even assembly routines.
The resulting code is usually hard to read, hard to maintain and hard to optimize further.
This poses a severe threat to the sustainability of the software.
We are targetting this issue by generating explicitly vectorized code from a hardwareindependent problem formulation in a domain specific language.
Not all discretization methods for the solution of PDEs are equally well suited for high performance implementation on modern architectures.
In order to achieve a notable percentage of machine peak, the numerical algorithm needs to exhibit a favorable ratio of floating point operations per byte loaded from main memory.
We target high order Discontinuous Galerkin methods on hexahedral meshes for a variety of reasons beyond it being a powerful and flexible method for a large variety of PDE problems.
DG methods allow for globally blocked data structures. These data structures can be accessed directly from local computations, removing a costly data movement step from the algorithm.
With hexahedral meshes, the basis functions exhibit tensor product structure.
This can be exploited through sum factorization (orszag1980spectral, ), which not only greatly reduces the algorithmic complexity of the algorithm, but also features solely the fused multiplication and addition (FMA) operations needed to make full use of the floating point capabilities of modern CPUs.
The treatment of tensor product spaces in the domain specific language UFL has been described in (mcrae2016tensorelements, ).
Sum factorization has already been adopted into a variety of numerical codes e.g. (muething2018sumfact, ) (kronbichler2012genericinterface, ) (schoeberl2014ngsolve, ) (homoloya2017structure, ).
Using sum facotrization and minimizing data movement, we can achieve a finite element assembly algorithm that is limited by the processors floating point capabilities instead of its memory bandwidth.
Replacing inherently memorybound matrixvector products with matrix free operator applications using the same sum factorization technique as in finite element assembly, we achieve compute boundedness even in the linear algebra part of our code.
In other words, our algorithms are designed to be capable of recomputing matrix entries faster than loading them from memory.
It is worth noting, that iterative solvers using this kind of matrixfree operator evaluation suffer from the additional challenge to implement preconditioners that do not hamper the algorithmic complexity of the overall algorithm.
Matrix free preconditioning techiques have been studied by various authors:
In (mueller2018matrixfree, ), BlockJacobi and Block GaussSeidel preconditioners in a fully matrixfree and in a partially matrixfree setting are investigated.
(pazner2018preconditioner, ) uses a Kroneckerproduct singular value decomposition approach to approximate the jacobian such that it can be evaluated matrixfree.
In (diosady2017preconditioner, ), the use of alternatedirectionimplicit and fast diagonalization methods is advocated.
We are integrating the results of this publication into the Dune framework(bastian2008grid1, ) (bastian2008grid2, ).
The Dune project emerged as a C++ successor to long standing C projects such as UG(bastian1997ug, ).
It features an large variety of grid managers available through a generic interface.
The discretization module dunepdelab(bastian2010pdelab, ) builds on top of the Dune framework and provides abstractions for finite element and finite volume discretizations.
Its strength lies with implicit methods and massive parallelism.
The contribution of this publication is twofold: A class of algorithms to achieve SIMD vectorization of the finite element assembly problem for DG methods with sum factorization is introduced.
These algorithms extend ideas presented in (muething2018sumfact, ) to overcome their limitations with respect to the SIMD width and the applicable PDE problems.
Performance measurements for the Intel Haswell and Intel Skylake architectures show the algorithms cababilitity of achieving a notable percentage of peak performance.
The second contribution is the embedding of the algorithm into a code generation workflow to achieve performance portability across architectures and PDE problems.
This workflow is based on UFL, the state of the art domain specific language for the description of PDE models, but in contrast to the current workflow of other code generation based simulation packages uses a new intermediate representation.
This intermediate representation, loopy (kloeckner2014loopy, ), provides a powerful tool for optimizing the assembly loop nest and transforming otherwise hardtomodify code properties like memory layouts.
The generated code is integrated into the simulation workflow of dunepdelab through its CMake build system.
The structure of the paper is as follows: The DG method and its implementation using sum factorization is summarized in section 2. Section 3 introduces new vectorization strategies extending on ideas from (muething2018sumfact, ). These are implement through the code generation approach outlined in section 4. The paper concludes with a performance study in section 5.
2. Sum Factorization for Discontinuous Galerkin Methods
Section 2.1 will briefly introduce the notation and function spaces needed for Discontinuous Galerkin methods with a special focus on tensor product structure of the finite elment. This will be done in a very general fashion, for practical examples see section 5. Section 2.2 will then give a quick overview of how the tensor product structure of the finite element can be exploited algorithmically through sum factorization.
2.1. The Discontinuous Galerkin Method with Tensor Product Spaces
Let be a family of shape regular triangulations of the
domain consisting of closed elements , each being
the image of a map with being the reference
cube in dimensions. The map is differentiable, invertible and
its gradient is nonsingular on .
is an interior face if it is the intersection of two elements
and has nonzero dimensional
measure.
All interior faces are collected in the set .
Likewise, is a boundary face if it is
the intersection of some with
and has nonzero dimensional measure.
All boundary faces make up the set
and we set .
The diameter of is and with each
we associate a unit normal vector oriented from to
in case of interior faces
and coinciding with the unit outer normal of in case of boundary faces.
Every face is the image of a map with being the
reference element of the face.
Note that we restrict ourselves to cuboid meshes, where the reference element is the tensor product cell . We define the DG finite element space as the tensor product of one dimensional polynomials on the reference interval, allowing to be doublevalued on :
(1) 
In this definition and throughout the rest of this work, the polynomial degree is allowed to be anisotropic, but we assume to be constant throughout the triangulation. This is not a general restriction of the presented methodology, we just have not tackled it yet due to the involved programming efforts and the combinatorial increase in code size and generation time. We define the number of local degrees of freedom per direction as and the number of local degrees of freedom as . For the presented work, the choice of the local basis of the space is not important.
We expect the discretized PDE problem to be given in residual form: The solution is given as the solution of the discrete variational problem
(2) 
We assume to be expressable in the following form:
(3)  
(4)  
(5) 
This problem description introduces the notation that will be necessary in the following sections. We do not expect the user to provide the input in this splitted form, but perform symbolic manipulation in our code generation toolchain to determine the functions . To solve problem (2) with Newton’s method, we need not only be able to evaluate the algebraic residual , but also its jacobian . As mentioned, we would like to only implement the action of the jacobian on a vector . Although we are omitting the exact formulae for brevity here, we mention that the action of the jacobian can be stated in a similar fashion as equation (3). For linear problems this will be the same as the residual except for those terms not depending on . For nonlinear problems, the action of the jacobian will depend on both and .
All integrals from equation (3) will be computed with appropriate quadrature rules. We construct the quadrature rules on the reference cuboids by building the tensor product of 1D quadrature rules with maximum order. The number of quadrature points per direction is defined as and is allowed to vary the same way as the polynomial degree: per direction, but not throughout the triangulation. The total number of quadrature points on the reference cuboid is defined as .
2.2. Sum Factorized Residual Evaluation
In the following chapters we will focus on the evaluation of the residual for the discussion of different vectorization strategies and performance results.
The described techniques are however fully applicable to matrix free jacobian applications and jacobian matrix assembly.
We will discuss the details of these generalizations at the appropriate places.
Note that, while we are formulating the algorithms here with a second order elliptic problem in mind, there is no hard assumptions on the residuum for the algorithms to be valid.
In particular, the residuum does not have to have tensor product structure.
The sum factorization technique is used in two places:

For the evaluation of finite element functions on the reference element and its partial derivatives at all quadrature points.

To multiply the evaluations of functions with (being the test function or its partial derivatives)
The necessary calculations are usually expressed in the literature using Kronecker products(vanloan2000kronecker, ). We introduce the matrices , whose entries are the evaluations of the chosen local basis functions of at the given 1D quadrature points . Given the coefficient vector as a way tensor , the evaluation of at all quadrature points reads the following:
(6) 
For the evaluation of the partial derivative , the basis evaluation matrix needs to be replaced with the matrix containing the derivatives of the 1D basis functions at the 1D quadrature points. We use the notation for the way tensor containing the th component of the gradient of at all quadrature points.
The tensor from equation (6) is evaluated in the following way, which is the fundamental idea of sum factorization:
(7)  
(8)  
(9) 
Note how the complexity of the calculation reduces in equation (9) in comparison to equation (8). Assuming , the complexity of evaluating all elements of decreases from to , illustrating well the desirability of the sum factorization approach. On a linear algebra level, a sum factorization kernel boils down to a series of tensor contractions and tensor rotations.
The residual assembly process for one cell of a 3D problem is described in detail in algorithm 1. Part 1 contains the aforementioned evaluation of finite element solution and its partial derivatives. In part 2, the functions from equation (3) are evaluated at each quadrature point in order to set up an input tensor for part 3, which multiplies the test function in a sum factorized fashion. To this end, transposed basis evaluation matrices are necessary.
Integrals over facets are treated similarly, only that for being the normal direction on the reference cube. It is worth to note, that in this case the total number of arithmetic operations of a sum factorization kernel can be greatly reduced by permuting the order of tensor contractions. Such considerations of course make the implementation of a sum factorized algorithm dependent on the embeddings of into the reference elements of and . Code generation again helps greatly to write the implementations necessary for different face embeddings and combinations thereof.
3. Explicit SIMD Vectorization
In the following we will develop vectorization strategies for algorithm 1. These strategies follow the ideas previously described in (muething2018sumfact, ), but are extended to gain further flexibility w.r.t. new PDE models and meet the challenge of vectorizing for new microarchitectures. We will classify our approaches into two categories: Loopfusion based approaches and loopsplitting based ones. Loop fusion based approaches typically require a drastical change in memory layout to work, where loop splitting based ones only work optimally if the mathematical problem leads to loop bounds satisfying suitable divisibility constraints.
3.1. Loop Fusion Based Vectorization Strategies
When identifying loops to fuse for a vectorization strategy in the finite element assembly problem, there are multiple choices for the level of granularity of the workload to be parallelized.
All these different approaches have advantages and disadvantages.
The most widely used approach (e.g. in (kronbichler2012genericinterface, )) is to parallelize over multiple cells in the grid.
The advantages of this strategy lie with the natural adoption to wider SIMD lanes on future architectures.
However, this also introduces additional costs:
The memory footprint of the integration kernel is increased by a factor of the size of SIMD lane width.
Additionally, the data for the degrees of freedom from multiple cells need to be interleaved in a preparation step.
This is an extension to a procedure that is commonly done in finite element assembly for continuous finite elements:
Degrees of freedom associated with a single element are gathered into a data structure for a local integration kernel to work on.
After the integration kernel has run, the data is scattered back into the global data structures.
However, Discontinuous Galerkin methods can also be implemented efficiently such that they operate on the blocks in the global data structure directly.
Doing so removes a large amount of costly, hardtohide memory operations from the algorithm and thus enables higher performance.
We have identified and described this issue in (muething2018sumfact, ).
Sticking to the idea of avoiding the setup of local data structures, we choose a different level of granularity to perform vectorization.
Within an integration kernel on a cell or facet, usually several quantites need to be evaluated through a sum factorized algorithm. Many of these sum factorization algorithms exhibit great structural similarities. We explain the idea using the example of the residual evaluation algorithm 1 for a second order elliptic PDE in 3D with a SIMD width of 256 bits. We do this restriction to illustrate our core ideas and will then proceed how the approach generalizes to other models, architectures, space dimensions and jacobian assembly. The core idea is to use the necessary four tensor quantities of step one in algorithm 1 for vectorization: The finite element solution at all quadrature points , as well as the three components of its gradient . We recall the main tensor product formulaes from section 2 for these quantities (in 3D):
(10)  
(11)  
(12)  
(13) 
As the tensor bounds and match in all of these computation, so do the loop bounds in the resulting sum factorization kernel implementation. Therefore, the loops in the implementation can be fused to achieve an implementation suitable for SIMD vectorization. In such implementation, each of the equations (10)  (13) would be carried out in one SIMD lane. We will now introduce mathematical notation to reason about such a fused kernel as a tensor calculation. To this end, we define two operations commonly used in tensor algebra (vanloan2000kronecker, ):

The operator maps a way tensor to a vector by flattening. This operation imposes an order on the tensor axes. This is completely analoguous to selecting strides in multidimensional array computation. We will use this operator to refer to the representation of a tensor in memory.

Given way tensors with identical bounds, we define as the way tensor constructed by stacking the tensors on top of each other. We assume that the order of axes of the input tensors is preserved in the stacked tensor and that the new axis generated by stacking is the fastest varying. Put in other words, the new axis has stride 1.
We will now discuss the memory layout implications of implementing equation (14). The memory layout of the input tensor is prescribed by the underlying discretization framework. Accesses to the stacked tensor can be implemented easily by accessing an element of and broadcasting it into a SIMD register. Our code generator will do this, whenever it finds a stacking of identical matrices. The layout of the stacked basis evaluation matrices is given by interleaving the individual basis evaluation matrices, such that the stacked axis has stride . We do assemble these stacked basis evaluation matrices in memory. This is a trade off decision between the increased memory traffic of loading redundant data and the necessity of instructions that manipulate single SIMD lanes. These underlying matrices may be stored in column major or row major fashion, the code generation approach allows for flexibility in this regard. Carrying out the sum factorization algorithm with these stacked tensors, the resulting stacked tensor will be given as an interleaved tensor as well. This can be seen as a an array of structures layout, where the inner structure is of fixed size . The layout is illustrated in figure 1. All data structures are aligned to the vector size to allow aligned loads into SIMD registers. We will now discuss how this array of structures affects step 2 of algorithm 1, the quadrature loop.
Our idea of vectorizing the quadrature loop is based on the idea to treat four quadrature points at a time.
We have found it beneficial to neglect the tensor product structure of the quadrature loop here and instead use flat indexing.
In order to evaluate the functions from equation (3) with vector arguments, we need to undo
the array of structures layout. We do so in the quadrature loop by applying a transposition of four consecutive SIMD vectors
of .
The procedure is illustrated in figure 2.
The transposition code is implemented efficiently in C++ intrinsics.
Step three of algorithm 1 again expects an array of structures type layout, that we do not get directly from evaluating the functions in a vectorized way. We achieve this by applying the transposition algorithm from figure 2 again. The overall quadrature loop algorithm is summarized in algorithm 2. It is worth noting, that the loop does not need a tail loop although the total number of quadrature points is not necessarily divisible by four: It is sufficient to overallocate the storage of to assure that the last loop iteration cannot write out of bounds. Step three of algorithm 1 is treated in the exact same way step one is, with the notable exception of the necessity to accumulate the results of four sum factorization kernel into the residual tensor. With the chosen memory layout, this requires an intraregister reduction. The implementation of this operation is subject to special care, as it benefits greatly from microarchitecturedependent optimization.
So far, we have studied the explicitly vectorized assembly algorithm under quite a number of simplifications and assumptions. We will now discuss which of these assumptions were made for the sake of presentability in this article and which are actual limitations of the approach.

So far, we have only looked at volume integrals. Extension to boundary and interior facet integrals can be implemented straightforward by replacing the basis evaluation matrix for the normal direction of the facet with a special matrix consisisting of only one quadrature point. Note, that this quadrature point is either or depending on the facet being on the upper or lower boundary of the reference cube. Consequently, facet kernel implementations depend on the embedding of the facet into the reference cube.

The approach was described for residual assembly only. Jacobian application for matrixfree methods works completely analoguous for linear problems. For nonlinear problems, not only the solution needs to be evaluated in step 1 on of algorithm 1, but also the given linearization point. The evaluation of the linearization point uses the exact same algorithm and is vectorizable with the same techniques. Assembling a jacobian matrix is implemented in a very similar fashion: Step two and three of algorithm 1 are executed once per basis function in the ansatz space.

We have only looked at the weak formulation of an elliptic problem depending on both and . For more complex problems, particularly systems of PDEs, the number of quantities that have to be evaluated via a sum factorization kernel may differ. This gives rise to additional opportunities to fuse sum factorization kernels. We offload the decision on how to vectorize to an algorithm implemented in our code generation workflow. The procedure is described in section 4.2.

We have only investigated the problem in three dimensional space. This is admittedly a bestcase scenario, but we emphasize the relevance of such problems for realworld HPC applications. If the number of sum factorization kernels is less than what is needed for loop fusion (e.g. in 2D or because the evaluation is not needed), it is still possible to perform above strategy with one SIMD lane left empty at a 25% penalty to the floating point throughput. The following sections will introduce more clever strategies to cope with such situations though.

We have limited ourself to a SIMD width of 256 bits. Instruction sets with smaller SIMD width like SSE2 can easily be targetted with the same techniques by combining four kernels into two vectorized kernels instead of one. Instruction sets with wider SIMD width like AVX512 pose a problem: Finding eight kernels in the problem that share the same loop bounds (or even the same input tensor) will rarely be possible. Developing vectorization techniques to overcome this limitation for AVX512 is the central goal of sections 3.2 and 3.3. If these techniques are applied to another SIMD width , a transposition algorithm for matrices analoguous to figure 2 has to be implemented.

So far, we only fused sum factorization kernels, that share the same input tensor , which resulted in values being broadcasted into SIMD registers. However, it is also possible to fuse kernels that have differing input tensors. This comes at the cost of an increased memory footprint of the fused kernel and more expensive load instructions (in the most general case, a gather instruction). In section 4.2 we will consider one special case that we consider a good tradeoff between the increased costs and the desirable increase in vectorization opportunities: We fuse kernels with a total of two different input tensors. The necessary load instructions originate from the interoperability between different SIMD widths and fill the lower and upper half of a SIMD register. Use cases for having fusable kernels with exactly two input kernels occur quite naturally in DG discretizations: On a facet , quantities need to be evaluated w.r.t. the cells and and when the action of the jacobian for a nonlinear problem is calculated, both the finite element solution and the linearization point need to be evaluated.
3.2. Loop Splitting Based Vectorization Strategies
While the loop fusion vectorization from section 3.1 tries to fuse multiple sum factorization kernel, the idea of this section is to split the workload of one sum factorization kernel, such that execution can make use of SIMD parallelism. This does not suffer the disadvantages seen above: Increased memory footprint of the kernel and the necessity of memory layout adjustments. On the other hand, these splitting based vectorization techniques come with the disadvantage, that maximum efficiency can only be reached if the kernel structure exhibits loop bounds with suitable divisibility constraints. We will now explore these strategies. For the sake of readability, we again limit ourselves to a SIMD width of four lanes and to the evaluation treatment of the evaluation of via
(15) 
and discuss possible generalizations later.
We base our strategy on the idea to split the set of quadrature points into a number of subsets equal to the SIMD width. We do so by choosing one direction and splitting the basis evaluation matrix into matrices, where is the SIMD width. For now, we assume the number of quadrature points to be divisible by and discuss other cases later on. The index denotes the index of the slice of the basis evaluation matrix. Note, that we did not split in a blocked fashion, but in a circular one, like it is illustrated in figure 3. Carrying out the sum factorized computation from equation (15) with a slice instead of leads to an output tensor only containing evaluations at of the quadrature points. Problem (15) can then be reformulated into the following equivalent problem using the notation from section 3.1:
(16) 
The fact that we have sliced in a circular fashion leads to the following, desirable property that allows us to load data of the resulting tensor without further manipulation of memory layout:
(17) 
We observe that in contrast to section 3.1, the combined basis evaluation matrices do not have to be explicitly set up beforehand, as and can be implemented as a broadcast of elements of .
The loop splitting based approach described in this section does not depend on the problem structure the same way as the loop fusion based one from section 3.1 does.
As there is no need to group multiple sum factorization kernels, the approach vectorizes equally well in two and three dimensional space, as well as with arbitrary combination of terms present in the problem formulation.
However, applicability of the approach depends on the divisibility of the number of 1D quadrature points.
Having this constraint on the number of quadrature points is not as bad as having it on the number of basis functions:
Artificially increasing the number of quadrature points is equivalent to overintegration, which even yields additional accuracy for problems that are not exactly integrated.
However, one has to be cautious as the increase in floating point operations affects the whole algorithm and not only the sum factorization kernel to be vectorized.
We now study the cost increase of such a procedure and refer to section 4.2 for discussion of the necessary trade off decisions.
We recall that the number of quadrature points per direction is given as a tuple and the number of basis functions per direction as a tuple . The floating point cost of a single sum factorization kernel reads
(18) 
We observe that is linear in .
This also holds for other relevant parts of the algorithm such as the quadrature loop.
Consequently, the total cost of the algorithm will be increased by a factor of , if the number of quadrature points in the first direction is increased to .
Setting to the next multiple of the SIMD width , we get a cost increase of .
For sufficiently high numbers of quadrature points, this increase becomes negligible.
In the worst case scenario of a discretization with minimal quadrature order, it can be as high as though and careful consideration is necessary.
See section 4.2 for more details about that.
In this section, we formulated the implementation idea in terms of the fusion based vectorization described in section 3.1.
The same ideas could have been developed from a different perspective, but having it formulated using the same notation will be of great benefit for developing hybrid strategies in section 3.3 and also for the vectorization heuristics in the code generator described in section 4.2.
3.3. Hybrid Strategies
Neither the strategy described in section 3.1 nor the strategy from section 3.2 are in general a perfect fit for vectorization with wider SIMD width.
For the loop fusion strategy, the problem description will usually not exhibit enough quantities that can be computed in parallel.
The loop splitting strategy on the other hand leads to prohibitively severe constraints on the number of quadrature points with increasing SIMD width.
We now seek to combine the two strategies into a hybrid vectorization strategy mitigating these disadvantages.
We have formulated the loop fusion approach from section 3.1 and the loop splitting one from section 3.2 using a common framework: A set of sum factorization kernels is collected into a larger tensor, which is suitable for vectorization, where these kernels are potentially gained by first splitting the given sum factorization kernels. We will now generalize this for arbitrary SIMD width and combinations of these techniques. We define and , such that denotes the number of sum factorization kernels to be combined through loop fusion and denotes the number of slices these are split into. We only treat those cases, where with the number of SIMD lanes. For and , this allows a natural extension of section 3.1 for a SIMD width (AVX512), which calculates and in parallel and introduces a divisibility constraint of on :
(19) 
Again, we will study the memory layout implications of implementing equation (19). The stacked basis evaluation matrices are preevaluated and loaded from memory, just like in section 3.1. The input tensor is implemented by broadcasting the values of . The only remaining question is how the memory layout of the output tensor affects the quadrature loop implementation. Independently of the choice of and , the quadrature loop treats quadrature points at a time. However, to get values of the quantities present in the data, we need to shuffle consecutive vectors. This results in the need for nonsquare matrix shufflings to get the correct data layout. In order to reduce the amount of necessary such transposition implementations, we fix the intraregister layout to be such, that kernels resulting from splitting need to be adjacent to each other. In other words, we disallow tensors like . We implement all necessary shuffling operations in C++ intrinsics. The quadrature loop algorithm 3 is further complicated by the fact, that an integration kernel might consist of more than one vectorized sum factorization kernel and that the choice of and can differ for each of these.
We have seen, that the techniques of sections 3.1 and 3.2 can be combined to mitigate their disadvantages and target wider SIMD widths. However, for a given problem, the number of possible vectorization strategies is of exponential nature and it is not a priori known, which one is best. This issue is targetted in section 4.2 with a cost model approach.
4. Performance Portability Through Code Generation
After studying vectorization techniques for variable SIMD widths in section 3, we will now look into how the sustainability issue arising from the inherent hardware dependency can be solved. To that end, we will first establish a code generation workflow in section 4.1. Then, we will see how optimal vectorization can be found in this workflow in section 4.2.
4.1. Code Generation Tool Chain
We are targetting dunepdelab with our code generation workflow. It builds upon the Dune core modules, which provide Dune’s generic grid interface(bastian2008grid2, ), linear algebra, local basis functions, quadrature rules and geometry mappings. The discretization framework dunepdelab conceptionally extends this with key ingredients of finite element programs:

A FiniteElement describes a local finite element on a given reference element. This comprises the local basis and how the local coefficients are attached to subentities of the reference element.

A FiniteElementMap maps grid cells to their associated local finite elements. One of PDELab’s strengths is to exploit this knowledge at compile time and only pay runtime penalties if necessary (e.g. in adaptivity).

A GridFunctionSpace implements a discrete ansatz space by combining a grid with a finite element map. PDELab allows to construct arbitrary trees of grid function spaces to represent systems of PDEs. Again, PDELab draws its strength from compile time reasoning about these tree structures allowing a lot of different blocking and ordering techniques of the global data structures.

A ConstraintsAssembler implements how degrees of freedom need to be constrained depending on boundary conditions and parallelism concepts.

The finite element assembly algorithm is driven by a GridOperator that iterates over the grid elements and facets. Implementation of assembly integrals is delegated to a LocalOperator that only operates on single cells/facets. The grid operator also makes sure to apply constraints as needed.
With the intended code generation workflow, we would like to leverage as many features of the underlying framework as possible. We therefore concentrate on generating code for the local operator, which is the performance critical component for finite element assembly. The generation and compilation process is controlled by an extension to Dune’s CMake build system. We provide rudimentary automation of the full simulation workflow only for the very important topic of automated testing (kempf2015testtools, ), but advise users to still write their own Dune applications. Otherwise, we’d have to make sure to deliver Python bindings to all of the frameworks features or we defeat the purpose of a highly modular C++ framework.
In the following we describe the employed tool chain in detail, which is also illustrated by figure 4.
We aim to reuse existing code generation projects wherever feasible.
The FEniCS project has developed UFL (AlnaesEtAl2012, ), a domain specific language for the description of the weak formulations arising in finite element discretizations.
It is embedded in Python and also used outside of FEniCS, e.g. in the firedrake project.
Reusing UFL as the input language to our code generation toolchain will not only save us work, but also contribute to a standardization of open source finite element packages.
UFL allows to express multilinear forms in a cell and facetlocal fashion.
The Discontinuous Galerkin methods we are targetting in this work fit nicely into this framework.
UFL delivers an intermediate representation (IR) of the PDE problem in form of an abstract syntax tree (AST). We apply some customizations and algorithms to the IR of UFL as a preprocessing:

We enforce that users write the weak formulation in residual form (which is always a rank one form), as dunepdelab’s abstractions are formulated in this way. This is quite similar to the way that users implement nonlinear problems in FEniCS, though we use the same workflow for linear problems as well ^{1}^{1}1For readers that are more involved into UFL: We enforce this by overriding the \mintinlinepythonTrialFunction class from UFL to be a \mintinlinepythonCoefficient of reserved index, instead of an \mintinlinepythonArgument.. From this residual form, we derive the bilinear form in the code generation process by taking the Gâteaux derivative with respect to the trial function with the automatic differentiation code provided by UFL. For matrix free calculations, we apply another symbolic manipulation algorithm from UFL to get a rank one form implementing the action of the bilinear form on a given vector. These symbolic manipulations do add a lot of value to our user experience, as the problem needs to be implemented exactly once and the jacobians can be derived from that. In handwritten dunepdelab codes, one can either pay a performance penalty and rely on automatic differentiation of residuals at runtime or one needs to handcode jacobians and their action as well. Especially for nonlinear problems, the latter can be quite tedious.

We apply knowledge about geometry transformations for axiparallel grids. These are not built into UFL, as the underlying discretization framework of FEniCS did not support hexahedral meshes for the longest time.
The IR of UFL is not a good choice for employing transformationbased optimization.
This mainly stems from the fact, that the only notion of loops in the UFL AST are sum reductions.
However, hardwaredependent transformations need much more insight into the loop structure of the assembly kernel.
We therefore transform the UFL IR into loopy (kloeckner2014loopy, ) adding the necessary loop bound information for quadrature rules, ansatz and test functions.
Loopy is a Python package which provides an IR for computational kernels and a transformation library to operate on that IR.
The IR comprises a polyhedral description of the loop domain and a symbolic representation of the calculation to be carried out in the form of statements.
The package has already been proven to be capable of handling examples as complicated as full PDE applications (kloeckner2016femexample, ).
Loopy provides several code generation backends called targets, such as plain C, Cuda, OpenCL, ISPC etc.
The objectoriented nature of the code generation target classes allows to customize a target for C/C++ to instead produce code to be used with the discretization framework dunepdelab without modifying loopy itself.
We use this to produce C code for our finite element assembly kernels that features enough C++ to call framework functions.
The loopy target for plain C code does not support explicit vectorization, as SIMD vectors are not part of the C language.
We add this functionality by generating code, that uses a C++ vectorization library.
We chose the vector class library (fogvcl, ), although there are several other viable options (e.g. VC (kretz2012vc, )).
The library provides C++ classes for a given precision and vector width (like \mintinlinec++Vec4d for 256bit double precision) representing a vector register having suitable operator and function overloads for all the basic tasks.
The implementation uses C++ intrinsics and hides most of the hardwarespecific assemblylevel considerations from the user and our code generator.
We consider such a library a key ingredient to a sustainable, performance portable code generation workflow.
Having settled on using loopy as the working horse of our code generation approach, we also gain additional value through loopys transformation library.
Common performance optimization techniques such as loop tiling, loop fusion or software prefetching are readily available.
We stress that in order to be capable of implementing such transformations, an IR needs to operate on a very specific level of abstraction:
On the one hand it needs to be reasonably general in the sense that it has a full symbolic representation of the computations to be carried out.
On the other hand it needs to have a notion of loop domains without fixing the loop structure too early in the tool chain (like an AST for the C language would).
We believe that loopy’s level of abstraction hits this sweet spot needed for a high performance code generation workflow.
Our approach differs from other approaches to code generation in UFLbased projects:
The FEniCS project generates C code through the form compiler FFC(kirby2006ffc, ) and any performance considerations are left to the underlying C++ framework dolfin and the C compiler.
Also, FEniCS  successfully  emphasizes usability and productivity over performance, which does not match with our intent of using code generation.
The firedrake project uses the form compiler TSFC (homoloya2017tsfc, ):
In a first step, UFL input is transformed into the tensor algebra IR GEM, which does not contain any domainspecific finite element information.
Such a step also happens in our transition from UFL to loopy, as loopy represents tensor algebra through the package pymbolic.
In a second step, TSFC generates C code from the given tensor algebra expression, much like the scheduler of loopy does.
However, this step does not yet take into account any hardwarespecific considerations and performance optimization is left to COFFEE(luporini2015coffee, ), which operates on a Cstyle IR.
This toolchain is strictly following the idea of separation of concern, which we believe to not be wellapplicable for performance optimization of finite element applications.
The firedrake project has recently also invested much work into generating code for sum factorized finite element assembly (homoloya2017structure, ).
Our code is published as a Dune module under a BSD license at
4.2. Generating Explicitly Vectorized Sum Factorization Kernels
We will now take a closer look at how the vectorization strategies described in section 3 can be exploited from the code generation tool chain. Foremost, this is about how to choose the best vectorization strategy from a large set of possibilities. All of the AST transformations (UFL to loopy, loopy to C) are implemented via recursive tree traversals with typebased function dispatch. These algorithms work best if the tree transformation is fully local, meaning that the visitor object is completely stateless. As our vectorization strategies  especially those based on loop fusion  are inherently nonlocal, we instead apply a twostep procedure:

During tree traversal, any quantity that is calculated through a sum factorization kernel is represented by a dedicated AST node \mintinlinepythonSumfactKernel, that stores all the relevant information.

After tree traversal, an algorithm decides on vectorization by providing a mapping of all the \mintinlinepythonSumfactKernel in the AST to \mintinlinepythonVectorizedSumfactKernel nodes with vectorization information attached.

A second tree traversal is done, in which these modified \mintinlinepythonVectorizedSumfactKernel nodes are realized by loopy statements.
We will now give details about the algorithm used to decide which vectorization strategy should be in use. The number of possibilities to combine the vectorization strategies described in section 3 for a given set of sum factorization kernels is vast and it is not a priori known which strategy delivers optimal performance. We mention two scenarios arising in the examples in section 5 to illustrate that trade off decisions need to be made:

For a simple Poisson problem in 3D, are needed, but not the evaluation of itself. Given a SIMD width of 256 bits, is it better to fuse three kernels and ignore the forth SIMD lane or to apply a splitting based vectorization strategy partially or fully? How does the quadrature order affect this?

For the implementation of the DG scheme for the Stokes equation from section 5.3, the evaluation of pressure cannot be parallelized with any other necessary evaluations. Vectorizing pressure evaluation by splitting may come at the cost of increasing the number of quadrature points for the whole algorithm though. When is it better to not vectorize pressure evaluation?
A costmodel based approach is required in order to make optimal vectorization decisions. Such an approach consists of two core components: A function that systematically traverses all vectorization opportunities to find a minimum and an actual cost function. In order to handle the exponential complexity of the traversal of vectorization opportunities, we employ a divide and conquer strategy splitting the optimization problem into several subproblems:

Starting from the quadrature point tuple that was specified by the user or deduced from the problem formulation, we list all possible tuples with increased number of quadrature points, that enable other vectorization strategies. For each of those we find an optimal vectorization strategy and find the minimum among these:
(20) Here, again denotes the SIMD width.

When finding an optimal strategy for a given fixed quadrature point tuple, we first divide the given set of sum factorization kernels into smaller subsets, which may potentially be subject to a loop fusion based vectorization approach i.e. they share the same loop bounds. Minimal solutions w.r.t. the defined cost function for these subsets are then combined into a full vectorization strategy.
We define a function such that two sum factorization kernels that yield the same value are potentially vectorizable via loop fusion. Likewise, we define a function such that two kernels yielding the same result operate on the same input tensor. To wrap up the divide and conquer approach, a function that merges the minimal solutions on subsets is used. Algorithm 4 shows the overall optimization algorithm, where algorithm 5 shows the algorithm within one of the subsets.
Algorithms 4 and 5 establish a framework to explore different cost functions. In our work, we have used two cost functions: An autotuning measurement function and a heuristic cost model function. The autotuning function generates a benchmark program for a given sum factorization kernel, compiles it, and runs it on the target hardware. Measured runtime is used as the cost function. While this approach delivers very good results and we generally use it for our performance measurements, there is also some downsides to it. Depending on how large the set of possibilities is, code generation may take a substantial amount of time. While this is not important for an HPC application to be run at large scale, it is unfeasible during a development cycle. There is an additional problem with measuring sum factorization kernels as opposed to measuring whole integration kernels: Cost increases in the quadrature loop are not covered by the benchmark programs. To account for this, we leverage the fact that we can count floating point operations on the symbolic representation and introduce a penalty factor for the autotuning scores.
The heuristic cost model function we have worked with is given in equation (23). It reproduces our practical experiences quite well, but does not take into account specific hardware features. This can be used as a dropin replacement, if the autotuning approach is infeasible. The function depends on the following quantities:

the number of nonparallel floating point operations carried out.

a heuristic penalty function for the instruction level parallelism potential of a sum factorization kernel depending on the size of the splitting as used in section 3.2. where we observe that loop fusion based vectorization should always be preferred over vertical vectorization when applicable.
(21) 
a heuristic penalty function for the necessary load instructions. This depends on the number of input coefficients used for loop fusion in the kernel.
(22)
The resulting cost function is the product of these given terms:
(23) 
In practice, we have chosen and and achieved good results. Figure 5 validates this choice of cost model: For a variety of integrals, a number of possible vectorization strategies is realized and performance is measured with the methodology described in section 5. We observe that the cost model minimum captures the performance maximum correctly.
5. Performance Results
Throughout this chapter, we will study the performance of the proposed algorithms on two different architectures. As we strongly believe, that in order for performance numbers to be meaningful, all details about the used hardware and the measuring process must be given, we will thoroughly describe the benchmark setup in section 5.1. We will then carry out performance studies for two model problems: Section 5.2 studies a diffusionreaction type model, while section 5.3 studies the Stokes equations.
5.1. Benchmark Methodology
In this article we will use two measures to evaluate the performance of our implementation:

Floating Point Operations per second expressed in GFlops/s and often given as a percentage of the machine’s maximum floating point performance.

Degrees of freedom per second processed during a full operator application. Note that we prefer this measure over its inverse (time per degree of freedom).
Good results on the latter measure are always more important from the application point of view, as it gives an accurate measure of how fast a real problem can be solved.
However, the former is still an interesting measure that allows reasoning about how good a code is suited for a given hardware.
In order to accurately measure the floating point operations per second, the number of performed floating point operations needs to be measured exactly.
We exploit the fact that we are generating code for dunepdelab, which uses C++ templates to the extent that we can replace the underlying floating point type throughout all our simulation workflow.
Instead of using double, we use a custom type templated to double, which has overloads for all arithmetic operations that increase a global counter and forward the operation to the underlying template type.
This counting of course introduces a nonnegligible performance overhead. We therefore compile different executables from the same source for operation counting and time measurement.
Apart from counting operations, accurate time measurements are needed.
We instrument our code with C macros to start and stop timers using the TSC registers.
The performance overhead of starting and stopping a timer is measured at runtime and the measurements are calibrated accordingly.
To gain further insight into the performance bottlenecks of our implementation, we measure time and floprates at different levels of granularity:
Full operator application, celllocal integration kernel and individual steps of algorithm 1.
For all these granularity levels, separate executables are compiled to assure that no measurement is tampered by additional measurements taken within the measuring interval.
We study the nodelevel performance of our generated code on two Intel micro architectures.

Intel Haswell

Intel Xeon processor E52698v3

2x16 cores

base frequency: 2.3 GHz, 1.9 Ghz on AVX2heavy loads (dolbeau2017theoretical, )

Theoretical peak performance: 972.8 GFlops/s0


Intel Skylake

Intel Xeon Gold 6148

2x20 cores

base frequency: 2.4 GHz, 2.0 GHz on AVX2heavy loads, 1.6 GHz on AVX512heavy loads (dolbeau2017theoretical, )

Theoretical peak performance: 2.04 TFlops/s

We turn off turbo mode entirely on the machines in order to be able to get a good estimate on the maximum floating point performance of the machine. In order to study the nodelevel performance of our implementation, we need to saturate the entire node with our computation. Otherwise, some processors may have priviliged access to ressources such as memory controllers and tamper results. We do so by doing MPI parallel computations with one rank per processor and an identical workload size on each of these ranks. Also, we choose the problem size to be such, that one vector of degrees of freedom exceeds the level 3 cache of the machine. While this may not be a realistic setting when doing strong scaling of simulation codes, it gives a good worst case estimate of our code’s node level performance. The time for communication of overlap data via MPI is not included in our measurement, as the task of hiding that communication behind computation is not the subject of this work.
5.2. DiffusionReaction Equation
In this section, we will consider the diffusionreaction equation (24) on , where we assume the triangulation to be axiparallel.
(24)  
For the discretization we choose the symmetric interior penalty Discontinuous Galerkin method, that we recall in equation (25) to disclose all the details of our implemenation to enable the reader to compare our performance numbers to other codes.
First we need to introduce some notation. For an interior facet with neighboring elements and we define the jump
the average
and the penalty factor
where is the polynomial degree (same for all directions), is the world dimension and a free parameter we set to . On a boundary facet