Firedrake: automating the finite element method by composing abstractions
Abstract
Firedrake is a new tool for automating the numerical solution of partial differential equations. Firedrake adopts the domainspecific language for the finite element method of the FEniCS project, but with a pure Python runtimeonly implementation centred on the composition of several existing and new abstractions for particular aspects of scientific computing. The result is a more complete separation of concerns which eases the incorporation of separate contributions from computer scientists, numerical analysts and application specialists. These contributions may add functionality, or improve performance.
Firedrake benefits from automatically applying new optimisations. This includes factorising mixed function spaces, transforming and vectorising inner loops, and intrinsically supporting block matrix operations. Importantly, Firedrake presents a simple public API for escaping the UFL abstraction. This allows users to implement common operations that fall outside pure variational formulations, such as fluxlimiters.
— \acmVolume0 \acmNumber0 \acmArticle0 \acmYear0 \acmMonth0
G.1.8Numerical AnalysisPartial Differential Equations[Finite element methods] \categoryG.4Mathematical Software[Algorithm design and analysis]
Design, Algorithms, Performance
Florian Rathgeber, David A. Ham, Lawrence Mitchell, Michael Lange, Fabio Luporini, Andrew T. T. McRae, GheorgheTeodor Bercea, Graham R. Markall, and Paul H. J. Kelly, 2015. Firedrake: automating the finite element method by composing abstractions.
This work was supported by the Engineering and Physical Sciences Research Council [grant numbers EP/I00677X/1, EP/L000407/1, EP/I012036/1], the Natural Environment Research Council [grant numbers NE/G523512/1, NE/I021098/1, NE/K006789/1, NE/K008951/1] and the Grantham Institute, Imperial College London. Author’s addresses: F. Rathgeber, (Current address) European Centre for Mediumrange Weather Forecasts; D. A. Ham and L. Mitchell, Department of Mathematics and Department of Computing, Imperial College London; M. Lange, Department of Earth Science and Engineering, Imperial College London; F. Luporini G.T. Bercea, and P. H. J. Kelly, Department of Computing, Imperial College London; A. T. T. McRae, (Current address) Department of Mathematical Sciences, University of Bath; G. R. Markall, (Current address) Embecosm.
1 Introduction
The numerical solution of partial differential equations (PDEs) is an indispensable tool in much of modern science and engineering. However, the successful development and application of advanced PDE solvers on complex problems requires the combination of diverse skills across mathematics, scientific computing and lowlevel code optimisation, which is rarely at expert level in a single individual. For the finite element method, which will be the focus of this work, this set of skills includes at least: knowledge of the system being simulated, analysis of the resulting PDEs, numerical analysis to create appropriate discretisations, mesh generation, graph theory to create data structures on those meshes, the analysis and implementation of linear and nonlinear solvers, parallel algorithms, vectorisation, and loop nest optimisation under memory constraints.
The development of such software is therefore increasingly a multidisciplinary effort and its design must enable scientists with different specialisations to collaborate effectively without requiring each one of them to understand every aspect of the system in full detail. The key to achieving this is to abstract, automate and compose the various processes involved in numerically solving PDEs. At some level, this process is a familiar one: few of the people who write C or Fortran code really understand how the compiler works; and they need not do so. Instead, the programmer understands the rules of the language and programmes to that model. Similarly, mathematical operations and results are frequently employed without having their derivation or proof immediately at hand. In other words, mathematical and software abstractions such as languages and theorems enable a separation of concerns between developing a technique and employing it.
This paper presents a new contribution to the automation and abstraction of the finite element method. Previous work, most especially the Unified Form Language [Alnæs et al. (2014)] employed by the FEniCS project [Logg et al. (2012), Logg and Wells (2010)] enables scientists to express partial differential equations (PDEs) in a high productivity interpreted language close to the mathematics. Implementations of the finite element method have traditionally been tightly coupled to the numerics, requiring contributors to have a deep understanding of both. FEniCS creates a separation of concerns between employing the finite element method and implementing it. Firedrake goes beyond this by introducing a new abstraction, PyOP2, to create a separation within the implementation layer between the local discretisation of mathematical operators, and their parallel execution over the mesh. This separation enables numericists to contribute evermore sophisticated finite elements, while computer scientists, expert in parallel execution but not in numerics, contribute more advanced execution strategies.
In addition to admitting uniformly high performance mesh iteration, the introduction of the additional abstraction layer results in a very compact code base. The resulting core Firedrake code has only around 5000 lines of executable code, while the PyOP2 parallel execution layer has fewer than 9000 executable lines. This compactness is evidence of the effectiveness of the abstraction choice made and is of immense benefit to the maintainability and extensibility of the code base.
Section 2 describes the state of the art in abstractions for scientific computing, particularly the finite element method. Section 3 details the abstractions, and their implementations, which are composed to form the Firedrake toolchain. Sections 4 and 5 describe in more detail the Firedrake and PyOP2 abstraction layers which are the core contribution of this paper. Section 6 describes an extensive computational verification of the performance and capability of the Firedrake system.
2 Mathematical and software abstraction of the finite element method
A particular advantage of the finite element method as a class of numerical methods for PDEs is that the entire algorithm can frequently be described in highly abstract mathematical terms. In the simplest cases, the mathematics of the method can be completely specified by a PDE in weak form, along with the desired boundary conditions and the discrete function spaces from which the solution and test functions should be drawn. Of course, a complete mathematical specification of the method is not the same as an efficient, parallel and bugfree software implementation. As a result, countless years of scientists’ time have been spent over the decades implementing finite element methods in low level Fortran and C code.
Whilst hand coding algorithms at a low level can produce efficient code, that approach suffers from a number of serious drawbacks. The key among these is a premature loss of mathematical abstraction: the symbolic structure of differential equations, function spaces and integrals is replaced by loops over arrays of coefficient values and individual floating point operations. Interspersed among these are parallel communication calls, threading and vectorisation directives, and so forth.
The original abstract mathematical expression of the equations embodies a separation of concerns: the equation to be solved is separated from its discretisation, from the linear and/or nonlinear solver techniques to be applied and from the implementation of the assembly and solvers. A lowlevel implementation loses this separation of concerns. This has a number of deleterious effects. First, choices are committed to far too early: deciding to change discretisation or the equation to be solved requires the implementation to be recoded. Second, the developer must deal with the mixture of equations, discretisation and implementation all at once. Reasoning about the mathematics of the code requires the developer to mentally reinterpret series of primitive instructions as the high level abstract mathematics they represent, and any change made to the desired behaviour must be implemented by manually working out the correct series of primitive operations. Changing and debugging the code in this way also carries the risks of defeating implementation choices which were made to optimise performance, and of introducing bugs.
2.1 The benefits and limits of mathematical library interfaces
Given the limitations of handwriting low level code, it is unsurprising that much effort has been devoted to the development of finite element and other scientific software which maintains something of the mathematical abstraction of the methods. A core feature of these approaches is that they present a programming environment in which the data objects correspond to the higherlevel mathematical objects found in the finite element method. For example there may be data objects corresponding to sparse matrices, distributed vectors, finite elements and function spaces.
A common and highly successful approach to this is for these mathematical objects to be represented as data objects in objectoriented libraries. High level mathematical operations are then expressed as operations on these objects, resulting in method calls. The actual implementation of the primitive numerical operations on arrays of floating point numbers is hidden in the implementation of those methods. Deal.II [Bangerth et al. (2007), Bangerth et al. (2013)] and DuneFEM [Dedner et al. (2010)] are prominent examples of objectoriented finite element packages, and there are many others. The objectoriented library approach has also been very successfully applied by leading sparse linear algebra libraries, notably including PETSc [Balay et al. (2014)] and Trilinos EPetra and TPetra packages [Heroux et al. (2005)].
The library approach is most successful where the mathematical operations specified by the application developer have a fairly large granularity: for example in the case of linear algebra packages, the smallest operations (such as scaling vectors or taking a dot product) still involve an iteration over the entire vector, and operations such as a linear solve are much larger. This means that the implementation of tight inner loops and much or all of the parallelism can be hidden from the application developer, thereby achieving the desired separation of algorithm and implementation.
Conversely, a key domain of variability in the field of numerical methods for PDEs lies at the level of the innermost loops: the numerical operations conducted at each mesh entity (cell, face, edge or vertex) depend on the PDE being solved and the numerical method employed. This means that the level of code at which the algorithm is expressed is at or below the level at which factors such as loop order, data layout and function calls become performance critical. Fine grained parallelism, such as threading and vectorisation may also need to be expressed at this level. In contrast to the case of linear algebra, in the PDE discretisation a critical part of the user algorithm describes very finegrain operations which must be woven together to form an efficient, parallel implementation. For this reason, librarybased finite element packages such as DuneFEM and Deal.II require that C++ implementations of integrals expressed as lowlevel sums over quadrature points be provided by the application developer.
2.2 Domain specific languages for finite elements
The desire to express the integrals at the heart of the finite element method in a highlevel mathematical language while still producing efficient lowlevel code implementing these integrals has led some projects to adopt a different approach. Rather than writing directly executable code utilising library calls to access functionality, the numerics of the finite element method are specified purely symbolically in a special purpose language. A specialised compiler or interpreter then uses this input to generate lowlevel, efficient code. Within this category, we can distinguish between standalone languages with their own parser, and embedded languages implemented in an existing general purpose compiled or interpreted language. A prominent example of the former class is FreeFem++ [Hecht (2012)], while the Unified Form Language (UFL, [Alnæs et al. (2014)]) and Sundance [Long et al. (2010)] are examples of finite element domain specific languages (DSLs) embedded in Python and C++ respectively.
A well designed DSL not only enables the application programmer to express their problem clearly, mathematically and concisely, it also provides the compiler writer with a great deal of freedom to make optimal implementation choices, including those which are too verbose, tedious and errorprone to implement by hand. For example the FEniCS Form Compiler, which takes UFL as its input language, has been used to develop highly optimised quadrature [Ølgaard and Wells (2010)] and tensor reduction [Kirby et al. (2005)] implementations of finite element assembly.
A further benefit of the DSL approach is that the symbolic mathematical form of the variational problem is available in the program code. This can be exploited to automate reasoning about the mathematical structure of the problem, for example to provide highlevel differentiation of the algorithm with respect to any of its inputs. This is employed by Sundance and FEniCS [Logg et al. (2012)] to compute the linearisation of the terms in the equation. It has been further exploited to provide automated adjoint operators, and thereby adaptive error control, functional optimisation and stability analysis [Rognes and Logg (2013), Farrell et al. (2013), Funke and Farrell (2013), Farrell et al. (2014)].
3 Exploiting composable abstractions in Firedrake
The novel contribution of Firedrake as a piece of mathematical software is to take the decomposition of the finite element method into automated abstractions further than previous approaches. In particular, we use a uniform abstraction (PyOP2) for the specification of iterations over the mesh, motivated by the observation that the mathematical statement of finite element problems decouples the local computation from its execution over the whole domain.
Firedrake models finite element problems as the composition of several abstract processes and its software stack is composed of separate packages for each. The core Firedrake package, composes these into a largely seamless abstraction for finite element problems. Fig. 1 illustrates the Firedrake software stack, showing the relationships between the various abstractions and software layers. These are described in more detail in the following sections.
An important benefit of welldesigned mathematical abstractions is that they facilitate code reuse. Where possible we have adopted and adapted existing abstractions and existing implementations of those abstractions. This not only saves reinvention of previous work, it means that users and developers of those aspects of Firedrake do not need to learn new interfaces. However, in the case of the tasks of iteration over the mesh graph and the generation of optimal kernel implementations, there was no completely suitable existing solution and so new components were created.
3.1 Specification of finite element problems: the FEniCS Language
The end user of Firedrake wants to specify and solve finite element problems. In some sense the core part of this is the specification of the weak form of the PDE, and the selection of the appropriate finite elements. The Unified Form Language is a particularly elegant and powerful solution to this problem [Alnæs et al. (2014)]. UFL is a purely symbolic language with welldefined, powerful and mathematically consistent semantics embedded in Python. This makes interactive use possible and allows Firedrake to use the original implementation of UFL directly, thereby automatically maintaining compatibility with other users of the language. Firedrake adds several extensions to UFL, some of which have already been merged back into the upstream version.
The specification of the PDE and finite elements is necessary but not sufficient to specify a finite element problem. In addition to the weak form of the PDE, it is necessary to specify the mesh to be employed, set field values for initial and/or boundary conditions and forcing functions, and to specify the sequence in which solves occur. UFL was developed as part of the FEniCS project, which provides a complete finite element problem solving environment in the form of the Python interface to DOLFIN [Logg et al. (2012b)]. We refer to the language for finite element problems defined by DOLFIN and UFL as the FEniCS language. To ensure compatibility, Firedrake implements (a close variant of) that language and presents a user interface which is identical in most respects to the DOLFIN Python interface. Firedrake implements various extensions to the language, and there are a few features of DOLFIN which are not supported.
A Poisson and linear wave equation finite element problem specified in the FEniCS language for execution by Firedrake are shown in listings 1 and 2 (Section 6). Line 1 defines a finite element function space on a given mesh (whose definition is omitted for brevity) and degree using linear Lagrange elements. A Dirichlet boundary condition of value 0 on a region of the domain identified by the markers 3 and 4 is defined on line 3. Lines 510 show the UFL code defining the bilinear and linear forms and with test and trial functions and and forcing function . The resemblance to the mathematical formulation is immediately apparent. In lines 1315, the forms are assembled into a matrix —A— and Function —b— with the boundary conditions applied. The linear system of equations is solved in line 16 for a Function —u— defined on line 12.
3.2 Finite element tabulation: Fiat
Firedrake employs the FInite element Automatic Tabulator, FIAT [Kirby (2004)] which implements the classical finite element abstraction of \citeNCiarlet1978 to support a wide range of finite elements with relatively few elementspecific alterations. The process of merging Firedrake’s extensions to FIAT back into the original version is underway.
3.3 Iteration over the mesh graph: PyOP2
In a typical finite element problem, the operations whose cost in data movement or floating point operations is proportional to the size of the mesh will be the dominant cost. These operations typically fall into two categories: iterating over data structures associated with the mesh, and sparse linear algebra. Firedrake’s solution to the former class of operation is PyOP2 [Rathgeber et al. (2012), Markall et al. (2013)].
PyOP2 is a domainspecific language embedded in Python for the parallel execution of computational kernels on unstructured meshes or graphs. Fundamental concepts are shared with OP2 [Giles et al. (2011)], however the implementation differs in ways that are crucial for the integration with Firedrake and other projects. PyOP2 dynamically generates code at runtime by leveraging Python to inspect objects and data structures. OP2 relies on static analysis of an input programme, which is transformed through sourcetosource translation at compile time, making it very difficult to embed in another application. Furthermore, PyOP2 provides sparse matrices and other data structures required for finite element computations which are not supported by OP2.
PyOP2 provides an abstract interface for the definition of operations composed of the application of a kernel function for each entry in a fixed arity graph. By representing the computational mesh as such a graph, it becomes possible to represent all of the meshvisitor operations in the finite element method as instances of this single abstraction. A particularly clean separation of concerns is thereby achieved between the specification of the local kernel functions, in which the numerics of the method are encoded, and their efficient parallel execution. PyOP2 is the key novel abstraction in the Firedrake system. It is documented in much more detail in section 4.
3.4 Unstructured meshes: DMPlex
PyOP2 has no concept of the topological construction of a mesh: it works with indirection maps between sets of topological entities and sets of degrees of freedom but has no need to know the origin of these maps. Firedrake derives the required indirection maps for input meshes through an intermediate mesh topology object using PETSc’s DMPlex API, a data management abstraction that represents unstructured mesh data as a directed acyclic graph [Knepley and Karpeev (2009), Balay et al. (2014)]. This allows Firedrake to leverage the DMPlex partitioning and data migration interfaces to perform domain decomposition at runtime while supporting multiple mesh file formats. Moreover, Firedrake reorders mesh entities to ensure computational efficiency through communicationcomputation overlap, while also employing mesh renumbering techniques provided by DMPlex to improve cache coherency within the resulting data sets [Lange et al. (2015)].
3.5 Linear and nonlinear solvers: PETSc
As noted above, the encapsulation of solvers for linear and nonlinear systems of equations is one of the most spectacular success stories for abstraction in scientific computing. The creation of efficient solver algorithms and implementations is also a complex and deep research field which it is not profitable to attempt to reinvent. We therefore adopt the widespread practice of passing solver problems on to an established high performance solver library. PETSc is adopted as a particularly wellestablished and fullyfeatured library which provides access to a large range of its own and third party implementations of solver algorithms [Balay et al. (2014)]. The fully featured Python interface to PETSc [Dalcin et al. (2011)] makes its integration with Firedrake particularly straightforward. Employing PETSc for both its solver library and for DMPlex has the additional advantage that the set of library dependencies required by Firedrake is kept small.
4 PyOP2
Many numerical algorithms and scientific computations on unstructured meshes can be viewed as the independent application of a local operation everywhere on the mesh. In the finite element method, this characterisation applies most obviously to the assembly of integrals over the domain, however it also applies to other operations such as time step increments and boundary condition implementation. This local operation is often called a computational kernel and its independent application lends itself naturally to parallel computation.
4.1 Sets
A mesh is modelled in PyOP2 as a graph defined by sets of entities (such as vertices, edges, and cells), and maps between these sets. Sets are used to represent collections of topological entities: vertices, edges, faces and cells. Sets are completely abstract entities: they store no bulk data themselves but only record the number of entities they contain, and their distribution among MPI processes. A set may also represent a set of nodes at which data may be stored; this set of nodes need not correspond to a set of topological entities. This facilitates the support of higher order finite element spaces in which varying numbers of degrees of freedom may be associated with various classes of topological entities. Sets exist only to be the subject of reference of other data objects, most particularly Maps and Dats.
4.2 Maps
A map associates a tuple of entries in a target set with each entry of another source set. For example, the source set might be the set of cells in a mesh, and the target set might be the set of degrees of freedom of a finite element function space. The map could then record, for each cell, the tuple of degrees of freedom of the target function space which are incident to that cell.
It is important to note that PyOP2 itself has no concept of meshes, or function spaces. The semantic meanings of sets and maps are defined and understood only by the Firedrake layer. At the PyOP2 layer these structures are merely objects over which iteration and indirection can occur.
There is a requirement for the map to be of constant arity, that is each element in the source set must be associated with a constant number of elements in the target set. The constant arity restriction causes the extent of many tight loop bounds to be fixed, which creates opportunities for vectorisation and other optimisations. However it excludes certain kinds of mappings. A map from vertices to incident edges or cells is only possible on a very regular mesh where the multiplicity of any vertex is constant. However the full set of maps required to implement the finite element method is supported.
4.3 Data
PyOP2 supports three core arrangements of mutable data: Dats, which are abstracted discretised vectors, Mats, which are sparse matrices, and Globals, which represent data not associated with individual set members. In other words, a Mat is equivalent to a bilinear operator over a pair of Sets, a Dat is equivalent to a linear operator over a Set and a Global is a scalar (a 0linear operator).
A Dat represents a vector of values, each associated with a particular member of the Set^{1}^{1}1There is actually a thin intermediate Dataset between the Set and Dat to parametrise the size of the data at each set element, but this is an implementation detail over which we will not dwell. over which that Dat is defined. The Dat presents a completely abstracted interface: the data may actually reside on one or more accelerators (GPUs) and be distributed over multiple MPI processes but the user will not usually observe this. In particular, Dats are able to reason about the validity and location of their data so that copies to and from the GPU and halo exchanges over MPI happen automatically and only if required.
A Mat object represents a sparse matrix, that is a linear operator from the data space defined on one Set to that defined on another. The matrix interface is actually a fairly thin layer over PETSc (in the CPU case) or CUSP (in the NVIDIA GPU case) and linear solves are completely outsourced to those libraries. At this stage, PETSc is the far more complete system and the only one considered productionready. The primary role of the Mat object is to match the sparse linear algebra library abstraction to the PyOP2 abstraction so that a PyOP2 kernel can be employed to assemble matrix entries efficiently and in parallel.
A Global represents a single tuple of values not connected with a Set. The reason for including this type, rather than simply employing a native Python numerical type, is to facilitate reasoning about updating data location. This enables the PyOP2 system to ensure that a Global always has the correct, consistent value even when updated in parallel or located on an accelerator.
4.4 Parloops and kernels
PyOP2’s model of execution is one of parallel loop operations which transform the system state, consisting of a set of Dats, Mats and Globals. Each parallel loop operation executes a kernel function once for each member of a specified iteration set. In finite element computations, this set is usually the set of a particular class of topological entities, thereby allowing a stencil operation to be executed over the whole mesh. The function usually accesses each Dat argument indirectly through a map . In other words, when a kernel function is called for iteration set entry , then the reference to the set of values given by is passed to . For a computation over cells where requires data defined over vertices, provides the indices into for each cell .
For example, if is the set of cells in a mesh, is the set of degree of freedom values of a discretised field, and is the map from cells to the incident degrees of freedom, then will be a reference to the set of degree of freedom values incident to .
We term the application of a kernel to a particular set of data a Parloop. Specification of a Parloop requires a kernel function , a set of iteration entities , and data arguments each of which is annotated with an access descriptor and indirection map . A Parloop created with arguments encodes the mathematical algorithm
where each element of is accessed according to the descriptor as detailed in the next subsection.
The kernel only has access to those entries of the Dat arguments which are adjacent to the current iteration set entry under the map provided. It sees the local ordering of the Dat entries to which it has access, but has no information about the global indices.
The loop over the iteration set is explicitly unordered and parallel: PyOP2 is licensed to execute it in any order and using as many threads, vector lanes or distributed processes as are available. Indirect access to data creates the possibility that this parallel execution may cause write contention, that is the same piece of data is accessed via more than one entity in the iteration set. PyOP2 must reason to avoid these contentions using colouring, communication and copies of data as appropriate. This is made possible by the specification of access descriptors for all kernel arguments.
The current colouring implementation in PyOP2 is deterministic, which results in bitreproducible results when run on the same number of processors. Whether this feature remains sustainable as hardware parallelism becomes more finegrained is yet to be determined.
4.4.1 Access descriptors
Kernel functions modify their data arguments in place. The critical observation in OP2, which is adopted in PyOP2, is that meshbased simulation kernels modify their arguments in characteristic ways. By explicitly specifying the character of the operations which the kernel will perform on each Dat, automated reasoning about the parallel execution of the Parloop in the presence of indirectly accessed arguments becomes vastly easier. The available access descriptor are as follows:
 READ

The kernel may use previous values of this argument but not set them.
 WRITE

The kernel may set the values of this argument, but the kernel’s behaviour does not depend on the previous values of the argument.
 RW

The kernel may set the values of this argument and may use the previous values of the argument. Note that this still does not imply a particular execution order over the iteration set.
 INC

The kernel adds increments to the values of the argument using the equivalent of the
+=
operator in C.
The reader will immediately observe that READ, WRITE, and INC are special cases of RW. However their inclusion enables more sophisticated automated reasoning about data dependencies than would be possible were all arguments labelled RW.
Any data accessed as READ, RW or INC is automatically gathered via the mapping relationship in a staging in phase and the kernel is passed pointers to local data. After the kernel has been invoked, any data accessed as WRITE, RW or INC is scattered back out in a staging out phase. Only data accessed in INC mode could potentially cause conflicting writes and requires thread colouring to prevent any contention.
4.4.2 Global arguments
Global reductions are important operations in meshbased simulations. Users may wish to calculate globally integrated quantities, such as energy, or execute other reductions such as calculating the maximum Courant number in the simulation domain. Global data does not have an indirection map relationship with the mesh: the same global value is visible from every mesh entity. The kernel is therefore passed an appropriately sized variable into which it can place its contribution to the reduction operation. Globals have their own set of permitted access descriptors which reflect this: READ, SUM, MIN, MAX. PyOP2 is free to create multiple variables in memory corresponding to a single Global to support parallel kernel execution. The access descriptor enables PyOP2 to subsequently reduce these multiple variables to a single value. The addition of further reduction access descriptor operations, or even allowing userspecified reductions, would be straightforward. However at the time of writing there does not appear to be user demand for this feature.
4.4.3 Matrix arguments
Mat arguments differ from Dat and Global arguments in a number of important ways. Critically, from PyOP2’s perspective, Mats are writeonly data structures. Operations which read matrices, such as matrixvector multiply and solving linear systems, are executed by the sparse matrix library (for CPU execution this is PETSc). Consequently, the only access descriptors permitted for Mats are WRITE and INC. A Mat represents a linear relationship between two sets, corresponding to the rows and the columns of the matrix, so two maps (which may be identical) are required to map the kernel contribution to the matrix. In terms which may be more familiar to the reader conversant with the finite element method, the kernel is responsible for the local assembly of the integral of a test function against a trial function, and PyOP2 then uses the Maps to execute the global assembly into a sparse matrix.
4.5 Kernel optimisation in COFFEE
Kernels are initialised with either a C code string or an abstract syntax tree (AST), from which C code is generated. The AST representation provides the opportunity for optimisation through the COFFEE AST optimiser [Luporini et al. (2015)], a compiler which specialises in advanced optimisations for short loops enclosing nontrivial mathematical expressions of the kind which typify finite element local assembly kernels.
COFFEE performs platformspecific optimisations on the AST with the goals of minimising the number of floatingpoint operations and improving instruction level parallelism through the use of SIMD (Single Instruction Multiple Data) vectorisation. The optimiser can detect invariant subexpressions and hoist them out of the loop nest, permute and unroll loop nests and vectorise expressions. The last step may require padding of the data and enforcing alignment constraints to match the target SIMD architecture. COFFEE supports both SSE (Streaming SIMD Extensions) and AVX (Advanced Vector Extensions) instruction sets.
5 The Firedrake layer
The role of the Firedrake layer is to marshal the abstractions provided by UFL, FIAT, FFC, PETSc and PyOP2 to take finite element problems specified in the FEniCS Language and efficiently produce solutions.
5.1 Mapping finite element constructs to data abstractions
The FEniCS Language presents higher level mathematical objects than PyOP2. Firedrake implements these by composing suitable combinations of PyOP2 and PETSc objects. Fig. 2 illustrates this relationship. The Firedrake implementation of operations in the FEniCS language consists primarily of selecting the relevant PyOP2 objects and composing corresponding parallel loop calls so that the PyOP2 layer can undertake the actual calculation.
5.1.1 The mesh abstraction
The primary functions of the mesh object are to record adjacency between the topological entities (vertices, edges, faces, facets and cells) of the mesh, and to record the mesh geometry. The former of these is encoded in a PETSc DMPlex which provides arbitrary adjacency relationships [Knepley and Karpeev (2009)].
A common approach in PDE toolkits is to treat the coordinates as a special class of data by, for example, storing the coordinates of each vertex in the mesh. Firedrake eschews this approach in favour of treating the coordinates as a first class vectorvalued field represented in a suitable vectorvalued function space. An advantage of this approach is that any operation which can be applied to a field may be applied to the coordinates. Solving a finite element problem to determine a new geometry field is therefore straightforward. Representing the coordinates using a fully featured function space also presents a mechanism for supporting curved (isoparametric) elements, support for which is currently available in a branch of Firedrake.
The mesh contains PyOP2 Sets which are proxies for the sets of cells, interior and exterior facets^{2}^{2}2A facet is a mesh entity of codimension 1: an edge of a 2D mesh or a face of a 3D mesh. of the mesh. These form the iteration spaces for the PyOP2 parallel loops and, correspondingly, are the “source” sets of Maps which encode function spaces.
5.1.2 Function spaces and functions
A central and distinct feature of the finite element method is its representation of all solution fields as weighted sums of basis functions. The FEniCS language supports this construction with FunctionSpace and Function objects. The Function objects store the coefficient values and a reference to the corresponding FunctionSpace, while the FunctionSpace stores all of the indirections from the Mesh to the degrees of freedom, and the symbolic information required to access the basis functions. As Fig. 2 demonstrates, this maps in a rather natural way onto PyOP2 data types.
The Function object holds a PyOP2 Dat. This reflects the separation of concerns in the Firedrake toolchain: the Firedrake layer reasons about the finite element method and all of the actual data storage and communication is delegated to PyOP2.
FunctionSpace objects contain PyOP2 Map objects which encode the indirections from mesh entities to the collections of degrees of freedom required to implement the finite element method. The cellnode indirection map provides the indices of the nodes incident to each cell. Equivalently, this is the set of nodes whose associated basis functions may be nonzero in that cell. The use of the term node here rather than DOF (degree of freedom) reflects the treatment of vector and tensor function spaces: the same indirection map is employed regardless of the number of degrees of freedom at each mesh location, and the indices of the DOFs are calculated from this.
5.1.3 Assembling forms
Solving a variational problem requires the assembly of a linear system of equations in the linear, and the Jacobian and residual form in the nonlinear case. In the Poisson problem in listing 1, the bilinear and linear forms —a— and —L— are explicitly assembled into the sparse matrix —A— and vector —b— respectively. Firedrake hands off assembly computations to PyOP2 Parloops (Section 4.4) with a formspecific list of arguments constructed as follows: The local assembly kernels for the forms are generated by FFC as described in the following section. The iteration set is extracted from the FunctionSpace of the test function —v— and the first Parloop argument is the output tensor. For the bilinear form this is a Mat built from a pair of maps extracted from test and trial space, for the linear form a Dat obtained by creating a new Function on the test space. The second Parloop argument is the coordinate field. Each coefficients used in the form, such as —f— in listing 1, translates into an additional argument.
5.2 A modified FFC
The FEniCS project provides the FEniCS Form Compiler (FFC), which takes variational forms specified in UFL and generates optimised C++ kernel functions conforming to the UFC interface [Logg et al. (2012a)]. This approach cuts across the abstraction provided by the PyOP2 interface: in PyOP2 the specification of the kernel is a problemspecific question delegated to the user (in this case, the PyOP2 user is Firedrake). Conversely, the optimisation of the kernel body for a given hardware platform is a matter for which PyOP2 (specifically COFFEE) takes responsibility. To reflect this, the version of FFC employed in the Firedrake toolchain is substantially modified. It still accepts UFL input but produces an unoptimised (and, indeed, unscheduled) abstract syntax tree (AST) for the local assembly of the form. Firedrake employs this AST to create a PyOP2 Kernel, and executes a PyOP2 parallel loop to perform global assembly. The modifications required to FFC are such that the Firedrake version of FFC is effectively a fork and will not be merged back. However it is hoped that the UFLACS form compiler, currently under development by Martin Alnæs will provide a basis for a unified compiler infrastructure.
5.3 Escaping the abstraction
It is an inherent feature of software abstractions that they create a division between those algorithms which are expressible in the abstraction, and those which are not. In a welldesigned abstraction, the former are concise, expressive, and computationally efficient. However any part of an algorithm not expressible within the abstraction may become impossible to express without completely breaking out of the abstract framework and coding at a much lower level. It will never be possible to represent all algorithms with the same level of elegance in a single abstraction. Instead, the challenge is to ensure that a graceful degradation of abstraction occurs. That is to say, operations which lie a little outside the abstraction should require the user to work at only a slightly lower level, and access to aspects of the abstraction which are still applicable should be preserved.
5.3.1 Custom kernels
The FEniCS language presents an elegant and powerful abstraction for the expression of the core of the finite element method: weak form PDEs and their solution on piecewise polynomial triangulations of domains. However, it is frequently the case that real simulation challenges also incorporate nonfinite element aspects. For example discontinuous Galerkin discretisations may require shock detectors and slope limiters; parametrisations of unresolved phenomena may require complex pointwise operations; and initial conditions may require access to external data in ways not representable in UFL.
The critical observation is that these operations, and many others besides, are still characterised by visiting mesh entities and accessing only data local to them: the operations supported by PyOP2. Firedrake therefore presents the user with the option of specifying a custom kernel in either C or as an AST. This kernel can then be explicitly executed over the mesh by invoking a parallel loop. If, as is often the case, the data access patterns are equivalent to those of the finite element method, then the user can invoke the Firedrake wrapper of a parallel loop and let Firedrake extract the correct Maps and Dats from the Firedrake Function. Alternatively, the user may directly invoke a PyOP2 parallel loop and extract the PyOP2 data structures manually. In either case, the automated parallelisation provided by PyOP2 remains. Listings 3 and 4 show an example of a randomised initial condition specified with custom Firedrake and PyOP2 kernels respectively.
5.3.2 Direct access to data structures
At a more direct level, the user may also elect to directly access the data in the Firedrake data structures. Since Firedrake is a pure Python library, the user can then deploy the full armoury of Python, NumPy and compatible libraries. PyOP2 employs the introspection capabilities of Python so that even in this case it remains aware of the data which has been accessed and modified. PyOP2 ensures that copies and halo exchanges occur as necessary to ensure that the user’s view of the data is current and correct, and that algorithmic correctness is maintained.
5.3.3 Access to generated code
For debugging purposes, it is sometimes useful for the user to access the C code which PyOP2 generates. This is accessible both in the disk cache and in memory attached to the relevant PyOP2 parallel loop object. In the particular case of C code which fails to compile (most commonly due to a syntax error in userprovided custom kernel code), the error message provides the location of the generated source file and the compiler error log.
5.4 Additional features facilitated by the Firedrake abstraction
5.4.1 Factorisation of mixed function spaces
When solving PDEs with multiple unknown solution fields, the standard finite element approach is to seek a solution in the mixed function space given by concatenating the function spaces of the solution fields. The test function is, naturally, drawn from the same space. UFL represents a form defined over a mixed function space as a single form. FFC then constructs a single mixed kernel which iterates over the combined set of basis function of the test space and (in the case of a bilinear form) the trial space. In DOLFIN this is then assembled into a single monolithic sparse matrix.
In contrast, Firedrake takes the UFL form, represented as an abstract syntax tree, and employs symbolic manipulation to split it into forms for each combination of constituent test and trial space. This results in separate forms for each block of the mixed system, and FFC then creates kernels for those individual blocks. The resulting kernels have simpler loop structures, which aids COFFEE in producing highly optimised implementations. Bilinear forms are then assembled into a hierarchical matrix structure, comprising a matrix for each block combined using PETSc’s nested matrix facility [Balay et al. (2014), p86]. Using PETSc’s compressed sparse row storage, the insertion of entries into submatrices is expected to be faster than into a monolithic matrix due to the smaller number of nonzero columns (which have to be searched) in each row. This furthermore enables more efficient exploitation of block solver techniques such as Schur complements. A simulation employing mixed function spaces is presented in section 6.4. A much more detailed exposition of the mixed form splitting algorithm is presented in \citeN[section 5.2.3]Rathgeber2014.
5.4.2 Pointwise operations
Users often need to change the values of fields by means other than solving a variational problem. For example when employing a RungeKutta timestepping scheme, variational problems are solved for the updates to fields, but the actual updates are linear combinations of fields. Similarly users frequently choose to calculate forcing functions pointwise in terms of other functions or may rescale the coordinate field. All of these are achievable by writing custom kernels, however they are expressed much more naturally by writing assignments of expressions in which the variables are Function objects. These expressions are then compiled to form a kernel function which is applied pointwise over the mesh. The explicit wave equation code shown in listing 2 illustrates the simplicity of the user code required. The increments for and both employ the pointwise expression compiler.
5.4.3 Immersed manifolds and extruded meshes
The support for domains which are manifolds immersed in a higher dimensional spaces introduced in \citeNRognes2013a extends directly to Firedrake. Furthermore Firedrake has extended the algebraic representation of finite elements and basis functions in UFL, FFC and FIAT to enable the algorithmic creation of tensor product finite elements on quadrilateral, triangular prism, and hexahedral cells [McRae et al. (2015)]. A particularly important class of meshes in high aspect ratio domains, such as the atmosphere and ocean, is composed of layers of triangular prism or hexahedral cells aligned in the vertical direction. The PyOP2 abstraction has been extended to exploit the structure induced by this vertical alignment to create very high speed iteration over such “extruded” meshes documented in \citeNBercea2016.
6 Experiments
Firedrake is a tool chain capable of solving a wide range of finite element problems, which is demonstrated in this section through experiments chosen to cover different characteristics of its implementation. These include assembling and solving a stationary Poisson problem, the nonlinear timedependent CahnHilliard equation and the linear wave equation using an explicit time stepping scheme. Implementation aspects investigated are assembly of left and righthand sides for regular and mixed forms, solving linear and nonlinear systems, evaluating expressions and using fieldsplit preconditioners. All benchmarks represent realworld applications used in fluid dynamics to model diffusion, phase separation of binary fluids and wave propagation.
The principle contribution of this paper is to describe the composition of abstractions and consequent separation of concerns achieved in Firedrake. A comprehensive performance evaluation is beyond its scope, indeed a comprehensive performance evaluation of a single problem might easily occupy an entire paper. Instead, this section is designed to enable the reader to develop an impression of the broad performance characteristics of Firedrake.
We have chosen to compare against DOLFIN for two reasons. The first is that it is the package which provides the closest analogue to Firedrake  many of the same test cases can be run from nearly the same code. The second reason goes to the heart of the difficulty of conducting fair performance comparisons. By using someone else’s code, it is difficult to avoid the risk that any performance deficiency is due to inexpert use rather than an inherent flaw. By employing DOLFIN on ARCHER using the compilation flags recommended by the DOLFIN developers and using test cases based on DOLFIN examples, we minimize the chance that any performance deficiencies are due to incorrect use of the software.
Source code for all benchmarks and the scripts used to drive them are available as part of the firedrakebench repository hosted on GitHub. The particular version used in these experiments has been archived on Zenodo [Rathgeber and Mitchell (2016)].
6.1 Experimental setup
Computational experiments were conducted on the UK national supercomputer ARCHER, a Cray XC30 architecture [Andersson (2014)] with an Aries interconnect in Dragonfly topology. Compute nodes contain two 2.7 GHz, 12core E52697 v2 (Ivy Bridge) series Intel Xeon processors linked via a Quick Path Interconnect (QPI) and 64GB of 1833MHz DDR3 memory accessed via 8 channels and shared between the processors in two 32GB NUMA regions. Each node is connected to the Aries router via a PCIe 3.0 link. For the reasons given in section 7.1, execution is always one core per MPI process: OpenMP is not employed.
Firedrake and PETSc were compiled with version 4.9.2 of the GNU compilers^{3}^{3}3Due to technical limitations in accessing the licence server, Intel and Cray compilers cannot be used on ARCHER compute nodes and are therefore unavailable to PyOP2’s just in time compilation system. and Cray MPICH2 7.1.1 with the asynchronous progress feature enabled was used for parallel runs. The Firedrake component revisions used are archived on Zenodo and are accessible via the DOIs in the relevant citation: Firedrake [Mitchell et al. (2016)], PyOP2 [Rathgeber et al. (2016)], FIAT [McRae et al. (2016)], COFFEE [Luporini et al. (2016)], ffc [Logg et al. (2016)], PETSc [Smith et al. (2016)], PETSc4py [Dalcin et al. (2016)]. The DOLFIN used as a comparator is revision 5ec6384 (July 12 2015) and is linked to the same PETSc version as Firedrake.
Generated code is compiled with —O3 fnotreevectorize— in the Firedrake and —O3 ffastmath march=native— (as suggested by the FEniCS developers) in the DOLFIN case.
Unless otherwise noted, DOLFIN is configured to use quadrature representation with full FFC optimisations and compiler optimisations enabled and Firedrake makes use of COFFEE’s loopinvariant code motion, alignment and padding optimisations described in \citeNLuporini2015 using quadrature representation. Meshes are reordered using PETSc’s implementation of reverse CuthillMcKee in the Firedrake case and DOLFIN’s mesh reordering respectively.
Benchmark runs were executed with exclusive access to compute nodes and process pinning was used. All measurements were taken preceded by a dry run of the same problem to prepopulate the caches for kernels and generated code to ensure compilation times do not distort measurements. Reported timings are the minimum of three consecutive runs.
6.2 Poisson
Poisson’s equation is a simple elliptic partial differential equation. A primal Poisson problem for a domain with boundary is defined as:
(1)  
(2)  
(3) 
The weak formulation reads: find such that
(4) 
where is a suitable function space satisfying the Dirichlet boundary condition .
This benchmark demonstrates assembly of a bilinear and linear form into a sparse matrix and vector, and solving a linear system with a preconditioned Krylov method.
6.2.1 Problem Setup
The domain is chosen to be the unit cube , represented as a fully unstructured mesh. The source term is:
(5) 
with known analytical solution
(6) 
Since the operator is symmetric positive definite, the problem is solved using a CG solver [Hestenes and Stiefel (1952)] with the HYPRE BoomerAMG algebraic multigrid preconditioner [Falgout et al. (2006)] on a unit cube mesh of varying resolution and for varying polynomial degrees. Listing 1 shows the Firedrake code for this problem.
6.2.2 Results
Strong scaling runtimes for matrix and righthand side assembly and linear solve comparing DOLFIN and Firedrake on up to 1536 cores are shown in Fig. 3 for problems of approximately 0.5M to 14M DOFs for first and third order respectively.
Parallel efficiency for the strong scaling results with respect to a full node (24 cores) is shown in Fig. 4.
Weak scaling run times and efficiencies for P3 basis functions are shown in Fig. 5 separately for the intra node case for up to 24 cores and the inter node case for 24 to 1536 cores. Within a node, processes share resources, in particular memory bandwidth, which limits achievable performance for these bandwidth bound computations. Scaling beyond a node, resources per core remain constant, and the limiting factor for scalability is network communication latency.
6.3 Linear Wave Equation
The strong form of the wave equation, a linear secondorder PDE, is given as:
(7)  
(8)  
(9) 
To facilitate an explicit time stepping scheme, an auxiliary quantity is introduced:
(10)  
(11)  
(12)  
(13) 
The weak form of (11) is formed as: find such that
(14) 
for a suitable function space . The absence of spatial derivatives in (10) makes the weak form of this equation equivalent to the strong form so it can be solved pointwise.
An explicit symplectic method is used in time, where and are offset by a half time step. Time stepping in (10) is a pointwise operation, whereas stepping forward in (14) involves inverting a mass matrix. However, by lumping the mass, this operation can be turned into a pointwise one, in which the inversion of the mass matrix is replaced by a pointwise multiplication by the inverse of the lumped mass.
This benchmark demonstrates a numerical scheme in which no linear system is solved and therefore no PETSc solver is invoked. The expression compiler is used for the and updates and all aspects of the computation are under the control of Firedrake. The implementation of this problem in Firedrake is given in listing 2.
6.3.1 Results
Strong scaling performance is shown in Fig. 6 for up to 384 cores and is limited by the measured nonparallelisable overhead indicated by the horizontal lines in the graph. Weak scaling runtimes and efficiencies are shown in Fig. 7 separately for the intra node case for up to 24 cores and the inter node case for 24 to 384 cores.
6.4 CahnHilliard Equation
The final experiment presented in this section, is the fourthorder parabolic timedependent nonlinear CahnHilliard equation, based on a DOLFIN demo^{4}^{4}4http://fenicsproject.org/documentation/dolfin/1.6.0/python/demo/documented/cahnhilliard/python/documentation.html, which involves firstorder time derivatives, and second and fourthorder spatial derivatives. It describes the process of phase separation of the two components of a binary fluid:
(15)  
(16)  
(17) 
with the unknown fluid concentration, a nonconvex function in , the diffusion coefficient and the outward pointing boundary normal.
Introducing an auxiliary quantity (the chemical potential) allows the equation to be restated as two coupled secondorder equations:
(18)  
(19) 
The timedependent variational form of the problem with unknown fields and is given as: find for a suitable function space such that
(20)  
(21) 
Applying the CrankNicolson scheme for time discretisation yields:
(22)  
(23) 
6.4.1 Problem setup
The problem is solved on the unit square, represented as a fully unstructured mesh, with , , and . The function space is the space of first order Lagrange basis functions.
Firedrake allows the initial condition to be set by defining a custom —Kernel— and executing a parallel loop, in which the expression may be written as a C string. The custom —Kernel— used to set the initial condition is shown as Listing 3. For comparison, an equivalent Kernel using the lowerlevel PyOP2 interface is provided in Listing 4.
To solve the mixed system, a GMRES solver with a fieldsplit preconditioner using a lower Schur complement factorisation is employed. When solving a mixed system with a block matrix with blocks , , , the Schur complement is given by
(24) 
and the lower factorisation is an approximation to
(25) 
where and are never explicitly formed.
An approximation to is computed using a single Vcycle of the HYPRE Boomeramg algebraic multigrid preconditioner [Falgout et al. (2006)]. The inverse Schur complement, , is approximated by
(26) 
using a custom PETSc —mat— preconditioner, where and are defined as
(27)  
(28) 
with and [Bosch et al. (2014)].
6.4.2 Results
Strong scaling runtimes for up to 1536 cores comparing Firedrake and DOLFIN for solving the nonlinear system, assembling the residual and Jacobian forms as well as evaluating the initial condition on an 8M DOF mesh for ten time steps are shown in Fig. 8. Weak scaling run times and parallel efficiencies are shown separately for 124 cores intra and 241536 cores inter node in Fig. 9.
6.5 Performance discussion
The experiments presented were selected to demonstrate the performance of Firedrake in several different regimes. By drawing together the results, we can make some observations on the impact of the introduction of the PyOP2 abstraction layer and its implementation.
6.5.1 Assembly in comparison with DOLFIN
First, assembly of linear and bilinear forms in Firedrake is consistently much faster than in DOLFIN. There are several features of Firedrake which impact on this. Critically, the PyOP2 interface is an abstract basis for code generation, while the UFC interface imposed by DOLFIN is a C++ abstract interface [Alnæs et al. (2012)]. This means that PyOP2 kernels can be completely inlined while DOLFIN kernels result in multiple virtual function calls per element. The COFFEE optimisations have been found to result in up to a fourfold increase in speed over the quadrature optimisations in FFC [Luporini et al. (2015)]. The speedup is most pronounced in the case of the CahnHilliard equation, which employs mixed function spaces. In this case, a performance increase is expected due to the form splitting optimisation (see Section 5.4.1).
6.5.2 Scaling performance
The weak scaling performance of pure Firedrake code (that is, excluding the PETSc solver) beyond one node is uniformly excellent. Within one node, resource contention results in significantly less than perfect efficiency, but this is expected. In the strong scaling regime, the fixed overhead per field of some hundreds of microseconds (Fig. 8) results in loss of optimal scaling at a significantly higher degree of freedom count than would be completely optimal. Reduction of the fixed overhead therefore remains an important development objective.
7 Current limitations and future extensions
7.1 Accelerators and threads
This paper presents only performance results for MPI parallel execution on CPUs, with instruction level vector parallelism facilitated by COFFEE. As Fig. 1 shows, PyOP2 also supports execution using OpenMP threads or OpenCL on CPU, and OpenCL and CUDA on GPU. Preliminary performance results on these platforms were published in \citeNMarkall2013. However, the available hybrid parallel and GPU linear solver libraries are far more limited than PETSc’s MPIonly functionality. The Firedrake developers have therefore given priority to achieving high performance and feature completeness for the CPU backend using MPI and vector parallelism. The other backends are fully functional in the sense that form assembly is supported and solving is supported to the limits of the relevant solver backends. This demonstrates the utility of the PyOP2 interface in isolating such implementation matters from the specification of the algorithm. However at this stage only the MPI CPU backend is considered to be of production quality and suitable for full documentation here. Given the increasingly finegrained parallelism of both CPU and accelerator hardware, hybrid parallel approaches combining message passing with shared memory approaches will be a future direction of development.
7.2 adaptive finite element methods
Support for refined finite element methods requires lifting the restriction of PyOP2 maps to fixed arity described in Section 4.2. Permitting variable arity maps and the consequent variable trip count loops in kernels would impede many of the lowlevel optimisations applied by COFFEE such that both classes of maps should be supported independently. The map storage format would also be required to record the arity of each source element. A more promising option would be to support container maps containing several maps of different arity and corresponding kernels to match. This would enable the support of not just refinement, but also mixed geometry meshes.
7.3 Firedrakeadjoint
\citeNFarrell2012a demonstrated that the mathematical abstraction captured by the FEniCS language can be exploited to automate the generation and highly efficient execution of the tangent linear and adjoint models corresponding to forward models written in that language. Dolfinadjoint^{5}^{5}5http://dolfinadjoint.org, the software implementing \citeNFarrell2012a, operates on objects at the FEniCS Language level. Using only a short Python wrapper module, dolfinadjoint has been extended to support Firedrake solvers written using unextended versions of the FEniCS Language. The userdefined extension kernels described in section 5.3 are not supported by this Firedrakeadjoint, since they cannot be differentiated using UFL’s intrinsic symbolic operations. The extension of Firedrakeadjoint to employ traditional algorithmic differentiation methods to custom kernels is planned for the future.
8 Acknowledgements
The authors would like to thank Patrick E. Farrell for his help with the CahnHilliard preconditioner implementation and helpful comments on the results section. We would further like to thank the other contributors whose code is in Firedrake: Miklós Homolya, George Boutsioukis, Nicolas Loriant, and Kaho Sato. Finally we would like to acknowledge the input and ideas we receive from Colin J. Cotter and from the core FEniCS development team, particularly Martin S. Alnæs and Marie E. Rognes.
References
 [1]
 Alnæs et al. (2012) Martin S. Alnæs, Anders Logg, and KentAndre Mardal. 2012. UFC: a finite element code generation interface. In Automated Solution of Differential Equations by the Finite Element Method, Anders Logg, KentAndre Mardal, and Garth Wells (Eds.). Lecture Notes in Computational Science and Engineering, Vol. 84. Springer Berlin Heidelberg, 283–302. DOI:http://dx.doi.org/10.1007/9783642230998_16
 Alnæs et al. (2014) Martin S. Alnæs, Anders Logg, Kristian B. Ølgaard, Marie E. Rognes, and Garth N. Wells. 2014. Unified Form Language: a domainspecific language for weak formulations of partial differential equations. ACM Transactions on Mathematical Software (TOMS) 40, 2 (2014), 9.
 Andersson (2014) Stefan Andersson. 2014. Cray XC30 Architecture Overview. (Jan. 2014). http://www.archer.ac.uk/training/courses/craytools/pdf/architectureoverview.pdf
 Balay et al. (2014) Satish Balay, Shrirang Abhyankar, Mark F. Adams, Jed Brown, Peter Brune, Kris Buschelman, Victor Eijkhout, William D. Gropp, Dinesh Kaushik, Matthew G. Knepley, Lois Curfman McInnes, Karl Rupp, Barry F. Smith, and Hong Zhang. 2014. PETSc Users Manual. Technical Report ANL95/11  Revision 3.5. Argonne National Laboratory. http://www.mcs.anl.gov/petsc
 Bangerth et al. (2007) Wolfgang Bangerth, Ralf Hartmann, and Guido Kanschat. 2007. deal.II – a General Purpose Object Oriented Finite Element Library. ACM Trans. Math. Softw. 33, 4 (2007), 24/1–24/27.
 Bangerth et al. (2013) Wolfgang Bangerth, Timo Heister, Luca Heltai, Guido Kanschat, Martin Kronbichler, Matthias Maier, Bruno Turcksin, and Toby D. Young. 2013. The deal.II Library, Version 8.1. arXiv preprint (2013). http://arxiv.org/abs/1312.2266v4
 Bercea et al. (2016) GheorgheTeodor Bercea, Andrew T. T. McRae, David A. Ham, Lawrence Mitchell, Florian Rathgeber, Luigi Nardi, Fabio Luporini, and Paul H. J. Kelly. 2016. A numbering algorithm for finite elements on extruded meshes which avoids the unstructured mesh penalty. Geoscientific Model Development Discussions 2016 (2016), 1â–26. DOI:http://dx.doi.org/10.5194/gmd2016140
 Bosch et al. (2014) Jessica Bosch, David Kay, Martin Stoll, and Andrew J. Wathen. 2014. Fast solvers for Cahn–Hilliard inpainting. SIAM Journal on Imaging Sciences 7, 1 (2014), 67–97. DOI:http://dx.doi.org/10.1137/130921842
 Ciarlet (1978) Philippe G. Ciarlet. 1978. The finite element method for elliptic problems. Elsevier.
 Dalcin et al. (2016) Lisandro Dalcin, Lawrence Mitchell, Jed Brown, Patrick E. Farrell, Michael Lange, Barry Smith, Dmitry Karpeyev, nocollier, Matthew Knepley, David A. Ham, Simon Wolfgang Funke, Aron Ahmadia, Jan Blechta, Thomas Hisch, keceli, MiklÃ³s Homolya, Jorge CaÃ±ardo Alastuey, AsbjÃ¸rn Nilsen Riseth, Garth N. Wells, and Jonathan Guyer. 2016. petsc4py: The Python interface to PETSc. (June 2016). DOI:http://dx.doi.org/10.5281/zenodo.56639
 Dalcin et al. (2011) Lisandro D. Dalcin, Rodrigo R. Paz, Pablo A. Kler, and Alejandro Cosimo. 2011. Parallel distributed computing using Python. Advances in Water Resources 34, 9 (Sept. 2011), 1124–1139.
 Dedner et al. (2010) Andreas Dedner, Robert Klöfkorn, Martin Nolte, and Mario Ohlberger. 2010. A generic interface for parallel and adaptive discretization schemes: abstraction principles and the DuneFem module. Computing 90, 34 (2010), 165–196. DOI:http://dx.doi.org/10.1007/s0060701001103
 Falgout et al. (2006) Robert D. Falgout, Jim E. Jones, and Ulrike Meier Yang. 2006. The Design and Implementation of hypre, a Library of Parallel High Performance Preconditioners. In Numerical Solution of Partial Differential Equations on Parallel Computers, Are Magnus Bruaset and Aslak Tveito (Eds.). Number 51 in Lecture Notes in Computational Science and Engineering. Springer Berlin Heidelberg, 267–294.
 Farrell et al. (2014) Patrick E. Farrell, Colin J. Cotter, and Simon W. Funke. 2014. A Framework for the Automation of Generalized Stability Theory. SIAM Journal on Scientific Computing 36, 1 (2014), C25–C48.
 Farrell et al. (2013) Patrick E. Farrell, David A. Ham, Simon W. Funke, and Marie E. Rognes. 2013. Automated derivation of the adjoint of highlevel transient finite element programs. SIAM Journal on Scientific Computing 35 (2013), C359–393. DOI:http://dx.doi.org/10.1137/120873558
 Funke and Farrell (2013) Simon W. Funke and Patrick E. Farrell. 2013. A framework for automated PDEconstrained optimisation. arXiv preprint arXiv:1302.3894 (2013).
 Giles et al. (2011) Michael B. Giles, Gihan R. Mudalige, Zohirul Sharif, Graham R. Markall, and Paul H.J. Kelly. 2011. Performance Analysis and Optimization of the OP2 Framework on ManyCore Architectures. Comput. J. (2011).
 Hecht (2012) Frédéric Hecht. 2012. New development in freefem++. Journal of Numerical Mathematics 20, 34 (2012), 251–266.
 Heroux et al. (2005) Michael A. Heroux, Roscoe A. Bartlett, Vicki E. Howle, Robert J. Hoekstra, Jonathan J. Hu, Tamara G. Kolda, Richard B. Lehoucq, Kevin R. Long, Roger P. Pawlowski, Eric T. Phipps, Andrew G. Salinger, Heidi K. Thornquist, Ray S. Tuminaro, James M. Willenbring, Alan Williams, and Kendall S. Stanley. 2005. An overview of the Trilinos project. ACM Trans. Math. Softw. 31, 3 (2005), 397–423.
 Hestenes and Stiefel (1952) Magnus R. Hestenes and Eduard Stiefel. 1952. Methods of conjugate gradients for solving linear systems. J. Res. Nat. Bur. Stand 49, 6 (1952), 409–436.
 Kirby (2004) Robert C. Kirby. 2004. Algorithm 839: FIAT, a new paradigm for computing finite element basis functions. ACM Transactions on Mathematical Software (TOMS) 30, 4 (2004), 502–516.
 Kirby et al. (2005) Robert C. Kirby, Matthew Knepley, Anders Logg, and L. Ridgway Scott. 2005. Optimizing the evaluation of finite element matrices. SIAM Journal on Scientific Computing 27, 3 (2005), 741–758.
 Knepley and Karpeev (2009) Matthew G. Knepley and Dmitry A. Karpeev. 2009. Mesh algorithms for PDE with Sieve I: Mesh distribution. Scientific Programming 17, 3 (2009), 215–230.
 Lange et al. (2015) Michael Lange, Lawrence Mitchell, Matthew G. Knepley, and Gerard J. Gorman. 2015. Efficient mesh management in Firedrake using PETScDMPlex. (2015). arxiv.org/abs/1506.07749 To appear in SIAM SISC.
 Logg et al. (2016) Anders Logg, Martin Sandve AlnÃ¦s, Marie E. Rognes, Andrew T. T. McRae, Garth N. Wells, Johannes Ring, Lawrence Mitchell, Johan Hake, MiklÃ³s Homolya, Florian Rathgeber, Fabio Luporini, Graham Markall, Aslak Bergersen, Lizao Li, David A. Ham, KentAndre Mardal, Jan Blechta, GheorgheTeodor Bercea, Tuomas Airaksinen, Nico SchlÃ¶mer, Hans Petter Langtangen, Ola Skavhaug, Colin J Cotter, Thomas Hisch, mliertzer, Andy R. Terrel, and Joachim B Haga. 2016. ffc: The FEniCS Form Compiler. (June 2016). DOI:http://dx.doi.org/10.5281/zenodo.56643
 Logg et al. (2012) Anders Logg, KentAndre Mardal, and Garth N. Wells (Eds.). 2012. Automated Solution of Differential Equations by the Finite Element Method. Springer.
 Logg et al. (2012a) Anders Logg, Kristian B. Ølgaard, Marie E. Rognes, and Garth N. Wells. 2012a. FFC: the FEniCS Form Compiler. In Automated Solution of Differential Equations by the Finite Element Method, Volume 84 of Lecture Notes in Computational Science and Engineering, Anders Logg, KentAndre Mardal, and Garth N. Wells (Eds.). Springer, Chapter 11.
 Logg and Wells (2010) Anders Logg and Garth N. Wells. 2010. DOLFIN: Automated Finite Element Computing. ACM Trans. Math. Softw. 37, 2, Article 20 (April 2010), 28 pages. DOI:http://dx.doi.org/10.1145/1731022.1731030
 Logg et al. (2012b) Anders Logg, Garth N. Wells, and Johan Hake. 2012b. DOLFIN: a C++/Python finite element library. In Automated Solution of Differential Equations by the Finite Element Method, Anders Logg, KentAndre Mardal, and Garth Wells (Eds.). Lecture Notes in Computational Science and Engineering, Vol. 84. Springer Berlin Heidelberg, 173–225. DOI:http://dx.doi.org/10.1007/9783642230998_10
 Long et al. (2010) Kevin Long, Robert Kirby, and Bart van Bloemen Waanders. 2010. Unified embedded parallel finite element computations via softwarebased Fréchet differentiation. SIAM Journal on Scientific Computing 32, 6 (2010), 3323–3351.
 Luporini et al. (2016) Fabio Luporini, Lawrence Mitchell, MiklÃ³s Homolya, Florian Rathgeber, Andrew T. T. McRae, David A. Ham, Michael Lange, Graham Markall, and Francis Russell. 2016. COFFEE: A Compiler for Fast Expression Evaluation. (June 2016). DOI:http://dx.doi.org/10.5281/zenodo.56636
 Luporini et al. (2015) Fabio Luporini, Ana Lucia Varbanescu, Florian Rathgeber, GheorgheTeodor Bercea, J Ramanujam, David A Ham, and Paul HJ Kelly. 2015. CrossLoop Optimization of Arithmetic Intensity for Finite Element Local Assembly. ACM Transactions on Architecture and Code Optimization (TACO) 11, 4 (2015), 57.
 Markall et al. (2013) Graham R. Markall, Florian Rathgeber, Lawrence Mitchell, Nicolas Loriant, Carlo Bertolli, David A. Ham, and Paul H.J. Kelly. 2013. PerformancePortable Finite Element Assembly Using PyOP2 and FEniCS. In 28th International Supercomputing Conference, ISC, Proceedings (Lecture Notes in Computer Science), Julian Martin Kunkel, Thomas Ludwig, and Hans Werner Meuer (Eds.), Vol. 7905. Springer, 279–289. DOI:http://dx.doi.org/10.1007/9783642387500_21
 McRae et al. (2015) Andrew T. T. McRae, GheorgeTeodor Bercea, Lawrence Mitchell, David A. Ham, and Colin J. Cotter. 2015. Automated generation and symbolic manipulation of tensor product finite elements. (2015). http://arxiv.org/abs/1411.2940 To appear in SIAM SISC.
 McRae et al. (2016) Andrew T. T. McRae, Marie E. Rognes, Anders Logg, David A. Ham, MiklÃ³s Homolya, Jan Blechta, Nico SchlÃ¶mer, Johannes Ring, Aslak Bergersen, Colin J Cotter, Lawrence Mitchell, Garth N. Wells, Florian Rathgeber, Robert Kirby, Lizao Li, Martin Sandve AlnÃ¦s, mliertzer, and Andy R. Terrel. 2016. fiat: The Finite Element Automated Tabulator. (June 2016). DOI:http://dx.doi.org/10.5281/zenodo.56637
 Mitchell et al. (2016) Lawrence Mitchell, David A. Ham, Florian Rathgeber, MiklÃ³s Homolya, Andrew T. T. McRae, GheorgheTeodor Bercea, Michael Lange, Colin J Cotter, Christian T. Jacobs, Fabio Luporini, juan42, Simon Wolfgang Funke, Francis J. Poulin, Henrik BÃ¼sing, Tuomas KÃ¤rnÃ¤, Anna Kalogirou, Stephan Kramer, Hannah Rittich, Eike Hermann Mueller, Geordie McBain, AsbjÃ¸rn Nilsen Riseth, Patrick E. Farrell, Graham Markall, and Justin Chang. 2016. firedrake: an automated finite element system. (June 2016). DOI:http://dx.doi.org/10.5281/zenodo.56640
 Ølgaard and Wells (2010) Kristian B. Ølgaard and Garth N. Wells. 2010. Optimizations for quadrature representations of finite element tensors through automated code generation. ACM Trans. Math. Softw. 37, 1, Article 8 (Jan. 2010), 23 pages. DOI:http://dx.doi.org/10.1145/1644001.1644009
 Rathgeber (2014) Florian Rathgeber. 2014. Productive and efficient computational science through domainspecific abstractions. Ph.D. Dissertation. Imperial College London.
 Rathgeber et al. (2012) Florian Rathgeber, Graham R. Markall, Lawrence Mitchell, Nicolas Loriant, David A. Ham, Carlo Bertolli, and Paul H.J. Kelly. 2012. PyOP2: A HighLevel Framework for PerformancePortable Simulations on Unstructured Meshes. In High Performance Computing, Networking Storage and Analysis, SC Companion:. IEEE Computer Society, Los Alamitos, CA, USA, 1116–1123.
 Rathgeber and Mitchell (2016) Florian Rathgeber and Lawrence Mitchell. 2016. firedrakebench: Benchmarks for Firedrake. (June 2016). DOI:http://dx.doi.org/10.5281/zenodo.56646
 Rathgeber et al. (2016) Florian Rathgeber, Lawrence Mitchell, Fabio Luporini, Graham Markall, David A. Ham, GheorgheTeodor Bercea, MiklÃ³s Homolya, Andrew T. T. McRae, Hector Dearman, Christian T. Jacobs, gbts, Simon Wolfgang Funke, Kaho Sato, and Francis Russell. 2016. PyOP2: Framework for performanceportable parallel computations on unstructured meshes. (June 2016). DOI:http://dx.doi.org/10.5281/zenodo.56635
 Rognes et al. (2013) Marie E. Rognes, David A. Ham, Colin J. Cotter, and Andrew T. T. McRae. 2013. Automating the solution of PDEs on the sphere and other manifolds in FEniCS 1.2. Geoscientific Model Development 6, 6 (2013), 2099–2119.
 Rognes and Logg (2013) Marie E. Rognes and Anders Logg. 2013. Automated goaloriented error control I: Stationary variational problems. SIAM Journal on Scientific Computing 35, 3 (2013), C173–C193.
 Smith et al. (2016) Barry Smith, Satish Balay, Matthew Knepley, Jed Brown, Lois Curfman McInnes, Hong Zhang, Peter Brune, sarich, Lisandro Dalcin, stefanozampini, and Dmitry Karpeyev. 2016. petsc: Portable, Extensible Toolkit for Scientific Computation. (June 2016). DOI:http://dx.doi.org/10.5281/zenodo.56641