High level implementation of geometric multigrid solvers for finite element problems: applications in atmospheric modelling
Abstract
The implementation of efficient multigrid preconditioners for elliptic partial differential equations (PDEs) is a challenge due to the complexity of the resulting algorithms and corresponding computer code. For sophisticated (mixed) finite element discretisations on unstructured grids an efficient implementation can be very time consuming and requires the programmer to have indepth knowledge of the mathematical theory, parallel computing and optimisation techniques on manycore CPUs. In this paper we show how the development of bespoke multigrid preconditioners can be simplified significantly by using a framework which allows the expression of the each component of the algorithm at the correct abstraction level. Our approach (1) allows the expression of the finite element problem in a language which is close to the mathematical formulation of the problem, (2) guarantees the automatic generation and efficient execution of parallel optimised lowlevel computer code and (3) is flexible enough to support different abstraction levels and give the programmer control over details of the preconditioner. We use the composable abstractions of the Firedrake/PyOP2 package to demonstrate the efficiency of this approach for the solution of strongly anisotropic PDEs in atmospheric modelling. The weak formulation of the PDE is expressed in Unified Form Language (UFL) and the lower PyOP2 abstraction layer allows the manual design of computational kernels for a bespoke geometric multigrid preconditioner. We compare the performance of this preconditioner to a singlelevel method and hypre’s BoomerAMG algorithm. The Firedrake/PyOP2 code is inherently parallel and we present a detailed performance analysis for a single node (24 cores) on the ARCHER supercomputer. Our implementation utilises a significant fraction of the available memory bandwidth and shows very good weak scaling on up to 6,144 compute cores.
keywords:
geometric multigrid, atmospheric modelling, preconditioner, (mixed) finite elements, domainspecific compilers
1 Introduction
Efficient solvers for partial differential equations (PDEs) are required in all areas of science and engineering. The design and implementation of these solvers requires a wide set of skills, including, but not limited to: knowledge of the system being simulated; creation and implementation of appropriate numerical schemes; analysis of the resulting linear and nonlinear operators; and efficient lowlevel implementation of the chosen schemes. It is therefore rare that a single individual possesses the full range of skills to successfully deliver both an algorithmically and computationally efficient solver on their own.
To address this multidisciplinary problem requires software frameworks that enable scientists with complementary skills and specialisations to collaborate without each one of them requiring full knowledge of the system. This can be achieved by carefully designing composable interfaces that capture the natural abstraction at each level of the simulation stack.
An important application is the simulation of fluid systems which are described by the Navier Stokes equations. In contrast to other mixed finite element methods such as the TaylorHood element, discretisations based on mimetic finite element methods (see e.g. [1, 2, 3]) exactly preserve certain physical properties of the system under study. In particular, the divergence theorems and hold exactly for the discretised operators; this is not the case for some other mixed methods such as the TaylorHood element. Mimetic methods reproduce the favourable properties of the Cgrid staggering, but allow the use of higher order discretisations and nonorthogonal grids. This is particularly important in global atmospheric modelling applications, where the “pole problem” of standard longitudelatitude grids introduces artificial time step limitations near the pole and restricts the parallel scalability of the code [4]. While mimetic finite element schemes are very popular in applications such as numerical climate and weather prediction, their efficient implementation poses significant challenges. To deliver forecasts under tight operational timescales, solvers have to be algorithmically robust, make optimal use of modern manycore chip architectures and scale to large node counts on distributed memory clusters.
Elliptic PDEs arise in implicit time stepping methods of the atmospheric equations of motion and their solution often forms the bottleneck of the full model code. Since the Earth’s atmosphere is represented by a thin spherical shell, the resulting discrete system is highly anisotropic. This requires the use of bespoke preconditioners, in particular the tensorproduct multigrid approach in [5] has turned out to algorithmically efficient [6, 7, 8]. The implementation of multigrid algorithms for loworder finite difference discretisations on structured grids is straightforward but this is not the case for sophisticated finite element discretisations on more complex geometries for several reasons: Since the HDiv (velocity) mass matrix is not diagonal, the iterative solver algorithm is significantly more involved than the corresponding finite difference version. In the finite element case the innermost compute kernels can not be expressed as simple stencil applications and  to achieve optimal performance on a particular architecture  the loop nest has to be optimised, bearing in mind hardware specific properties such as the cache layout. In a distributed memory setting halo exchanges and overlapping of computation and communication require a careful partitioning of the grid. The Firedrake/PyOP2 [9, 10] framework allows the automatic assembly of finite element operators from their weak formulation and the expression of the algorithm at a high abstraction level. Architecture dependent optimised lowlevel Ccode is automatically generated and executed with justintime compilation techniques.
In this paper, we discuss the extension of the Firedrake framework to support the development of geometric multigrid solvers for finite element problems. We illustrate its use, and investigate the performance of the resulting method, in the development of a geometric multigrid solver for a mimetic finite element discretisation of the atmospheric equations of motion. This is a challenging problem and an excellent test case for the developed abstractions since both the implementation of the underlying finite element scheme and the design of an efficient multigrid method are nontrivial. We show how our approach simplifies the implementation of the solver, by cleanly separating the different aspects of the model. In particular we demonstrate how the chosen abstractions allow the easy implementation of a columnlocal matrix representation which is crucial for the blockJacobi smoother. This particular smoother is algorithmically optimal for strongly anisotropic problems and a key ingredient of the tensorproduct multigrid algorithm in [5]. A careful performance analysis confirms that our implementation is efficient in the sense that it uses a significant fraction of the system’s peak performance for bandwidthbound applications.
An alternative and very successful approach to the implementation of finite element solvers is the use of templated C++ code [11, 12, 13]. This allows the user full control over all components of the algorithm and multigrid solvers for finite element discretisations have been implemented for example in DUNE [14] and deal.II [15]. However, the key computational kernels (such as the local application of the operator in weak form) have to be written by hand and any optimisation is limited by the capabilities of the available compiler. In contrast, frameworks like FEniCS [16] and Firedrake use domainspecific compilers to carry out optimisations that are infeasible for general purpose compilers to perform on a lowlevel representation of the same algorithm [17]. Compared to FEniCS, where expressing nonfinite element operations requires the programmer to explicitly manage all parallelism and mesh iteration, one of the advantages of our approach is the straightforward implementation of any local operations as computational kernels in PyOP2 [9]; this is crucial for the preconditioners considered in this work.
The paper is structured as follows. In section 2, we give a brief overview of the Firedrake software framework. Section 3 lays out the mathematical abstractions of the multigrid method, and how we organise the software abstractions around them. The multigrid method for our model problem, an atmospheric gravity wave, is discussed in section 4. We characterise the performance of the resulting scheme in section 5 and conclude in section 6. Some more technical aspects are discussed in the appendices where we derive the equations for linear gravity wave propagation, give the parameters of the PETSc fieldsplit solver and provide a detailed breakdown of the solver setup times.
Main achievements
We demonstrate the performance and parallel scalability of the solver on the ARCHER supercomputer and compare our geometric multigrid implementation to a matrixexplicit implementation based on the PETSc library [18, 19]. In the latter case we use the BoomerAMG [20] preconditioner from the hypre suite [21] to solve the DGpressure system. We show that the performance of the solver for the low order discretisations treated in the paper is memory bound, and quantify the absolute performance of the most computationally intensive kernels in our solver algorithm by a detailed analysis of memory traffic. We find that the computationally most expensive kernels utilise a significant fraction of the peak memory bandwidth.
2 Firedrake/PyOP2: abstractions for finite element methods
Firedrake [10] is a Python system for the solution of partial differential equations by the finite element method. It builds on the abstractions introduced in the FEniCS project [16, 22] to present a highlevel, automated, problem solving environment. Firedrake enforces a strong separation of concerns between employing the finite element method, the implementation of the local discretisation of the mathematical operators, and their parallel execution over a mesh. The execution of kernels over the mesh is carried out using an iteration abstraction layer, PyOP2 [9]. This layer is explicitly exposed to the model developer, and allows them to write and execute custom kernels over the mesh. The critical observation is that most operations that fall only slightly outside the finite element abstraction may still be formulated as the execution of a local operation over some set of mesh entities, these can be expressed as a PyOP2 parallel loop. This separate abstraction layer allows the user to worry about the local operations: parallelisation is carried out automatically by PyOP2 exactly as it is for finite element kernels in Firedrake itself. More details of the interaction between Firedrake and PyOP2 can be found in [10, Section 4].
3 Multigrid in Firedrake
Multigrid methods [23] are an algorithmically optimal approach to solving many PDEs, especially those involving elliptic operators. They rely on a hierarchy of scales to cheaply and efficiently compute properties of the system at the appropriate scale: modes that vary slowly in space can be accurately represented using only a few degrees of freedom and are thus best solved for on coarse grids, whereas fast modes need high spatial accuracy. Here we briefly provide an overview of the mathematical operations in terms of a twogrid setup, and then describe our implementation in Firedrake.
Let be the approximation space on the “coarse” grid and the space on the fine grid, these need not necessarily be nested, although the implementation is simplified if they are. A complete multigrid cycle is built from a few basic operators. In algorithm 1 we show the form of correction scheme multigrid (so called because on the coarse grid we solve for the correction to the fine grid equation) on two levels, given a linear system .
The extension to multiple levels recursively applies this twolevel cycle to compute . This setup suggests that we need to provide facilities for computing restrictions and prolongations. We also need the ability to compute coarse grid operators, and we need to be able to apply smoothers (effectively some form of linear solver). In listing 1, we show how this abstract framework translates into our implementation in Firedrake for a simple twodimensional example. We are able to exploit the existing facilities for the vast majority of the implementation, we merely need a few extensions to deal with hierarchies of meshes and transferring between them, which we discuss below.
3.1 Grid hierarchies
It is clear that the first object we will need is a hierarchy of grids (or meshes, in Firedrake’s parlance). This will encapsulate the relationship between the refined grids (providing information on the fine grid cells corresponding to coarse grid cells). These are provided by the Firedrake MeshHierarchy object. Similarly, we shall need to represent discrete solution spaces. Firedrake uses FunctionSpace objects for this, and we extend these with a FunctionSpaceHierarchy. As with the MeshHierarchy this encapsulates the relationship between function spaces on related grids (allowing us to determine the degrees of freedom in the fine grid that are related to a coarse grid cell). We note that in this work, we only treat hierarchically refined grids. This does not affect the abstractions we discuss, although it does simplify some of the implementation.
3.2 Restriction and prolongation
Nested finite element spaces admit particularly simple implementation of restriction and prolongation. To compute the restriction operator we use FIAT [24] to evaluate the coarse basis at the node points on the fine grid. This allows us to express the coarse cell basis functions in terms of linear combinations of fine cell basis functions. Since the mesh is regularly refined, we need only do this once and can use the same weighting for all cells. The restriction operator can then be expressed simply by applying this combination kernel to a given residual using a PyOP2 parallel loop (exposed as restrict in the Firedrake interface). In the same way we compute the interpolation from into , which is just the natural embedding, using FIAT, and apply the kernel over the mesh with PyOP2.
3.3 Coarse grid operators
Forming the coarse grid operators is straightforward using the existing facilities of Firedrake and UFL [25]. Rediscretised operators are readily available simply by taking the UFL expression for the fine grid operator and assembling it on the coarse grid (achieved using the coarsen_form operation in listing 1). These rediscretised operators have minimal stencil. In addition, it is also straightforward to provide simpler operators (perhaps throwing away couplings that do not contribute on coarse grids) by explicitly defining the operator symbolically on the appropriate coarse level and using it in the smoother.
3.4 Smoothers
Firedrake uses PETSc to provide solvers for linear systems. As such, for assembled matrices, we can use as a smoother any linear solver that PETSc makes available. In listing 1, for example, we use two Jacobipreconditioned Richardson iterations as a presmoother, solve the coarse problem exactly with LU and then use three preconditioned Richardson iterations as a postsmoother. Naturally, we are free to implement our own smoothers instead, perhaps we do not have an assembled matrix and therefore cannot use “blackbox” smoothers. Indeed, the key ingredient to achieve optimal performance is the use of the correct smoother. This is of particular importance in the tensorproduct multigrid scheme we discuss in the rest of the paper, since an assembled operator is not available and there is structure in the problem (the strong vertical anisotropy) we wish to exploit in the smoothers.
4 Tensorproduct multigrid for (mimetic) mixed finite elements
To illustrate the power of these abstractions we consider an important model system for meteorological applications: the equations for linear gravity wave propagation in the global atmosphere. The corresponding system of PDEs is discretised with a mixed finite element discretisation. The problem is solved in a thin spherical shell which represents the Earth’s atmosphere. The thickness of the atmosphere is several orders smaller than the radius of the earth. This flatness of the domain is typical for applications in atmospheric modelling and introduces a strong gridaligned anisotropy. As we shall see, for a mixed finite element discretisation this problem is significantly more complex than the simple example shown in listing 1. We will therefore use it to both characterise the performance of our multigrid implementation, and validate that we have exposed the correct abstractions. For the linear solver to converge rapidly, the anisotropy has to be treated correctly in the preconditioner and the main challenge is the implementation of an optimal smoother. The PyOP2 abstraction level allows the expression of this smoother in terms of tailored data structures and lowlevel kernels which implement the linerelaxation method that is key to exploit the vertical anisotropy.
Since the PDE we are solving is elliptic, one option is to employ an algebraic multigrid (AMG) preconditioner. On highly anisotropic domains, one must take special care in constructing the coarse grid operators and smoothers to achieve mesh independence. See for example [26], where the authors use smoothedaggregation AMG to precondition the velocity block when solving the nonlinear Stokes equations in the context of icesheet dynamics. To account for the strong anisotropy, the aggregation strategy is adapted to maintain the column structure of the degrees of freedom, and a smoother based on incomplete factorisations is used. There are a number of reasons why an AMG approach might not always be desirable. The AMG preconditioner can only use Galerkin coarse grid operators, and so the coarse grid stencil will be large. More importantly, unless special care is taken, the coarsening operation will not necessarily obey the anisotropy inherent in the problem, perhaps leading to suboptimal algorithmic performance. Finally, we found in [6] that a bespoke preconditioner, based on the tensorproduct multigrid of [5], is superior to AMG based methods if it is applied to a simplified model equation discretised with the finite volume method; indeed the geometric multigrid approach turned out to be about faster than blackbox AMG implementations from the DUNE [27] and hypre [21] libraries.
Moreover the tensorproduct preconditioner can be shown to be optimal for gridaligned anisotropies. In a recent paper [8] we have also demonstrated numerically that the method works well for atmospheric applications under slightly more general circumstances. These encouraging results motivate us to study the geometric multigrid implementation reported in this work.
4.1 Mathematical formulation and mixed finite element discretisation
The following linear gravity wave problem for pressure , velocity and buoyancy can be obtained by linearising the full NavierStokes equations for large scale atmospheric flow(see appendix A for a derivation):
(1) 
For simplicity we assume that both the speed of sound and the buoyancy frequency are constant and enforce the (strong) boundary condition
(2) 
at the upper and lower boundary of the atmosphere. The domain can be expressed as a tensorproduct where is the twodimensional surface of a sphere with radius . As described above, the domain is very flat, i.e. . To construct function spaces for mimetic finite element discretisations, consider the following de Rham complexes in one, two and three dimensions:
(3) 
We seek a solution to Eq. 1 with
(4) 
where is the subspace of whose normal component vanishes on the boundary of the domain. and are respectively the “horizontal” and “vertical” parts of . The remaining spaces are and . This choice of spaces for and is analogous to the CharneyPhillips staggering. We refer the reader to [28] for the implementation and further description of tensorproduct spaces in Firedrake.
It is worth highlighting here that the decomposition of the velocity space into a horizontal () and vertical () component is very important for the construction of the tensorproduct multigrid preconditioner described below.
We discretise in time using an implicit scheme which is a special case of the CrankNicholson method [29], resulting in the following weak system for the increments , and
(5)  
where is the normal vector in the vertical direction and , and are the known fields are the previous time step. In this work we consider two choices for the threedimensional complex in Eq. 3 introduced in [2] and summarised in Tab. 1. In the following they are referred to as “Lowest Order” (LO) and “NexttoLowest Order” (NLO).
Lowest Order (LO)  
NexttoLowest Order (NLO) 
Discretising, we obtain a system of linear equations for the dofvectors , and :
(6) 
Note that since , the velocity mass matrix can be written as the tensorial sum of a horizontal and a vertical mass matrix
(7) 
Similarly the weak derivative can be expressed as .
In the absence of orography, buoyancy can be eliminated point wise from Eq. 6 to obtain a block system of equations for velocity and pressure only^{*}^{*}*Note that the continuous version of the last equation in (6), is true also in the presence of orography, and an alternative approach (leading to a different discrete system) would be to eliminate orography from the continuous equations before discretising. However, a pointwise elimination of buoyancy from the discretised equations is only possibly in the absence of orography when (6) holds strongly and not just weakly.,
(8) 
It is important to note that, in contrast to lowest order finite volume discretisations on staggered grids, the velocity mass matrix is not diagonal^{†}^{†}†It is, however, well conditioned, with a condition number of that is independent of the grid resolution.. This prevents the standard treatment taken in many atmospheric modelling codes based on finite volume or finite difference discretisations (for example the semiimplicit dynamical cores of the Unified Model [31, 32]), which eliminate velocity point wise to obtain a system for the pressure, which is solved iteratively before reconstructing the velocity. In our case we have to use an iterative solver for the full linear system in and .
The condition number of the linear system grows with the (vertical) acoustic Courant number , and hence preconditioning is essential to achieve rapid convergence. Due to the flatness of the grid cells () the dominant terms in are due to fast vertical sound waves. However, wave propagation can be treated approximately independently in each column of the grid (in a given time interval, a wave which propagates over grid cells in the vertical direction will only cover a distance of cells in the horizontal). In the following we describe the construction of a preconditioner which uses this idea to reduce the condition number such that it only depends on the much smaller horizontal acoustic Courant number which is in atmospheric models (larger values of are excluded due to the explicit treatment of other processes such as advection). Under these conditions a tensorproduct multigrid Vcycle with vertical line relaxation and horizontal grid coarsening will then lead to rapid convergence of the linear solver iteration.
4.2 Schurcomplement preconditioner
Formally the Schurcomplement factorisation of the inverse of the block matrix in Eq. 8 is given by
(9) 
with the elliptic and positivedefinite “Helmholtz” operator
(10) 
This operator contains the dense inverse of the velocity mass matrix and is therefore also dense. We obtain a preconditioner with a number of approximation steps. We replace the full inverses in Eq. 9 by a diagonal approximation :
(11) 
Furthermore, since and , the second order term in Eq. 10 can be written as
(12) 
Again, we replace full mass matrix inverses by their diagonal approximations, , . Since the function space is horizontally discontinuous, the matrix has a blockdiagonal structure. It would therefore be possible to use more sophisticated approximations for in each vertical column. For example we tried a sparse approximate inverse [33], but found that this leads to increased setup costs and worse overall performance.
With the diagonally lumped mass matrices we obtain the approximate Helmholtz operator
(13) 
which can be inverted iteratively to approximate in Eq. 9. In contrast to , the operator has a sparse structure but still contains the dominant couplings in the vertical direction. Since the size of the derivatives and is proportional to the inverse grid spacings, the two terms in the brackets are of order and respectively (the pressure mass term is of order ). On the highly anisotropic grids considered here we have , and hence the second term in the bracket is the dominant contribution. If we write for the blockdiagonal part of , the following blockdiagonal operator differs from by terms of :
(14) 
The operator is decoupled in the horizontal direction, and can therefore be inverted independently in each column. With a suitable degree of freedom ordering this can be carried out using a simple banded matrix solve. To invert the operator , one can now use a suitably preconditioned iterative method. One possibility would be a Krylov iteration with blockJacobi preconditioner, namely
(15) 
This is similar to the approach taken in existing operational codes such as the Met Office Unified Model [31, 32]. However, as we demonstrated in [6], a much more efficient preconditioner is a tensorproduct multigrid algorithm in which the grid is only coarsened in the horizontal direction and Eq. 15 is used as a smoother (we refer to this blockdiagonal smoother as “vertical line relaxation” in the following). This is the method we use in this paper, and the numerical results in section 5 confirm the superiority of the method compared to a singlelevel preconditioner: the multigrid algorithm is about twice as fast. This was also observed in [6] and in numerical experiments we carried out on the ENDGame dynamical core. Since the inverse of is only required in the preconditioner and does not have to be computed to very high accuracy, we found it is most efficient to simply apply one multigrid Vcycle with .
4.2.1 Tensorproduct multigrid algorithm
The tensorproduct multigrid algorithm which we use to solve the system was first described and analysed in [5] and proven robust for gridaligned anisotropies. In [8] the analysis is extended to dimensional grids and it is demonstrated numerically that the algorithm is still efficient for approximately grid aligned anisotropies in meteorological applications.
This can be interpreted in the context of the discussion at the end of session 4.2: the performance of the tensorproduct multigrid algorithm is given by the horizontal acoustic Courant number ; the much larger vertical Courant number does not appear at leading order since sound propagation in each column is treated exactly by the blockdiagonal line relaxation smoother.
The tensorproduct multigrid cycle fits naturally into the language of section 3, with the following selections for the operators. Rather than full coarsening in all three dimensions, we semicoarsen in the horizontal direction, leaving columns intact. For the nexttolowestorder function spaces, the first coarsening step is a refinement from into , with the restriction and prolongation defined using the natural embedding. We use the vertical “line relaxation” operator of Eq. 15 for our multigrid smoother. Finally we note that the natural correlation length in units of the horizontal grid spacing on the finest multigrid level is given by . Since other components of the model (such as explicit advection) limit the time step size to , this correlation length is . It is therefore sufficient to coarsen to the grid until the grid spacing is larger then this length scale; the coarse grid problem becomes well conditioned and can be solved with a small number of smoother iterations.
4.3 Implementation
The function spaces and are discontinuous in the horizontal direction. It is natural to order the degrees of freedom such that all unknowns in a vertical column of the three dimensional grid are stored consecutively in memory (see Fig. 1). This storage format, used by Firedrake, has good performance characteristics since the contiguous column data can be directly addressed.
As a consequence, in Eq. 13 and Eq. 14 the matrices , , , and are columnwise blockdiagonal. Each block can be associated with a cell of the twodimensional grid that covers and corresponds to a banded matrix which couples the unknowns in a particular vertical column. To apply the vertical line relaxation smoother of Eq. 15 we explicitly assemble so that it is amenable to inversion using banded matrix solves. Although it is not possible to express the operations we require using UFL – they are not operations on variational forms – it is still possible to write them as PyOP2 kernels that are executed over the grid. A key part of our implementation are therefore linear algebra operations on columnwise banded matrices. We describe this algebra and its implementation in the following section.
4.3.1 Banded matrix algebra
To represent the blockdiagonal matrices we implemented a bespoke banded matrix class which provides the necessary operations. In each vertical column, given a UFL expression for the operator, we can assemble the local matrix block. More generally, the class implements the blockdiagonal matrix representation of a linear operator that maps between two horizontally discontinuous spaces
(16) 
In each vertical column , the local matrix block is a generalised banded matrix:
(17) 
where , and (with ) depend on the function spaces and . When (which occurs when and have the same degreeoffreedom layout) this reduces to the standard banded storage format. In each column, we then store the operator using a generalisation of the LAPACK banded matrix format, see Fig. 1.
The Python class which implements provides the following functionality:

Assemble from a given UFL form to obtain

Apply to a fieldvector:

For two assembled linear operators and , calculate and .

Apply the inverse of to a fieldvector, i.e. solve the linear banded system for .
To solve the equation in each column we use two
different approaches: at nexttolowest order where the matrix bandwidth is
27, we use the LAPACK routine dgbtrf
to precompute and store a
LU decomposition of the matrix and then employ the
corresponding method dgbtrs
to solve the linear system based on
this LU decomposition. However, at lowest order, where the matrix is
tridiagonal, we found that a handwritten direct solve (Thomas
algorithm, see e.g. [34, §2.4]) is much more
efficient.
All operations can be carried out independently in each vertical column. Based on this observation, the hierarchical architecture of PyOP2/Firedrake is crucial for the implementation: we store the local matrix blocks as PyOP2 dats on the horizontal grid cells. The operations above can then be implemented as short PyOP2 kernels, which are executed over the horizontal grid. An example is given in listing 2, which shows the matrixvector product . Note that the user only has to express the local kernel as a short piece of C code, the PyOP2 system is responsible for executing this kernel over the grid.
4.4 PETSc/BoomerAMG based fieldsplit preconditioner
To compare the performance of our bespoke implementation to an existing solver package, we also solved the discretised equations purely algebraically with PETSc [19, 18] and preconditioned with hypre [21]. PETSc’s fieldsplit preconditioner is used to algebraically form the Schur complement operator. We use a diagonal approximation to the velocity mass inverse to form a preconditioning operator for the Schur complement (corresponding exactly with in Eq. 13), everywhere else the velocity mass matrix is inverted using an ILUpreconditioned blockJacobi iteration. The resulting elliptic problem in the pressure space is solved using the BoomerAMG algebraic multigrid preconditioner from the hypre library. To handle the strong vertical anisotropy it was essential to replace the standard pointsmoother by an incomplete LU factorisation (Euclid). Since the vertical couplings are the dominant terms, this is approximately the same as an exact inversion of the blockdiagonal operator which contains the vertical couplings only. Table 8 in appendix B details the full set of PETSc options used.
5 Results
In all cases the linear system in Eq. 8 is solved with a suitably preconditioned, restarted GMRES() iteration [35] to a relative tolerance of , which is typical in atmospheric modelling applications. Note that the system in (6) is not positive definite (after all, it has wavelike solutions), ruling out use of the more efficient Conjugate Gradient method. In the particular case of constant and considered here, the system could be made symmetric by multiplying the first equation in (6) by and scaling and by factors and respectively. Then a MINRES method [36] with reduced memory requirements could be used. However, this is not the case for general, spatially varying values of the speed of sound and buoyancy frequency since . To cover this more general case, we therefore use GMRES to obtain our results. Since the computationally expensive components in both Krylov methods are the operator application and preconditioner solve, the performance should be comparable and the only difference is the higher memory requirement of the GMRES method. For a discussion of the nonsymmetry of the elliptic problems in atmospheric models we also refer the reader to [37] where the authors argue that the GCR(k) algorithm (which is equivalent to GMRES) should be used.
To compare the two preconditioners (Schur complement+geometric multigrid described in section 4.2 vs. PETSc fieldsplit+BoomerAMG described in section 4.4) we first use the setup in Tab. 2. For comparison we also use a singlelevel method (two iterations of BlockJacobi vertical line relaxation) since this is the approach used in many atmospheric forecast models. For all numerical experiments we use five levels in the geometric multigrid preconditioner. On each level one pre and one postsmoothing step with an overrelaxation parameter of is employed. The well conditioned coarse grid problem (see discussion at the end of section 4.2.1) is solved approximately with two smoother iterations.
Instead of using a fixed number of coarse grid iteration, we could have reduced the coarse grid residual to a fixed tolerance. However, then the preconditioner is no longer guaranteed to be stationary and therefore a flexible Krylov method must be used. Empirically, two iterations are good enough.
We choose a typical Courant number of for those runs (see section 5.2 for the dependence on ). Note, however, that the nexttolowest order method resolves variations within one grid cell, so the effective Courant number is larger than . After solving the system Eq. 8 to the desired tolerance, buoyancy is reconstructed pointwise to obtain the solution of the full system Eq. 6. This requires an additional inversion of a well conditioned matrix.
# horizontal  # vertical  number of unknowns  

cells  layers  per cell  total  
Lowest Order (LO)  81,920  km  64  3.5  18.4 mio 
NexttoLowest Order (NLO)  5,120  km  64  24  7.9 mio 
All following results are obtained on the ARCHER supercomputer. Each node consists of two 12 core Intel Xeon Ivybridge (E52697 v2) processors, i.e. 24 cores in total. The spec sheet peak floating point performance of a full node is and the maximum achieved memory bandwidth for the STREAM triad benchmark [38] is . All code was compiled using version 4.9.2 of the Gnu C compiler and suitable flags are used to compile PETSc and generate optimised code. All runs were carried out on a full node (utilising all 24 cores) with MPI parallelism.
Using threaded parallelism could potentially lead to further performance enhancements by avoiding halo exchanges. Since the relative cost of halo exchanges is suppressed by the surfacetovolume ratio of the local domain and most of the work is spent on the finest grid, it is unlikely that threaded parallelism will lead to significant benefits for the grid sizes and loworder discretisations considered in this work. In [39] an advection diffusion equation is solved within the PyOP2 framework with different parallel backends. The authors find that an OpenMP implementation is slightly slower than the MPI equivalent. This superiority of MPI over OpenMP is also seen in other loworder studies, even when significant effort is expended in tuning the OpenMP implementation, for example [40]. While exploring shared memory and hybrid parallelisation strategies would be interesting to improve the performance of our solver in the strong scaling limit, it is beyond the scope of this paper.
To obtain the following timings we do not include any overheads from the justintime compilation of Ckernels in PyOP2. This is legitimate since in a full model the linear equation will be solved a large number of times and those overheads are completely amortised over the model runtime.
5.1 Algorithmic and computational performance
Tab. 3 shows a comparison of the solution times, and the convergence history is plotted in Fig. 2. At lowest order, the iterative solver converges in ten or fewer iterations if a multigrid preconditioner is used; the number of iterations is around 6 times larger for the single level method. As can be seen from the convergence history in Fig. 2, the asymptotic convergence rate of the single level method (filled green triangles) is significantly worse than for the multigrid methods (filled blue circles and red squares).
Preconditioner  ^{‡}^{‡}‡Due to technical reasons, the setup time for the AMG preconditioner includes the time for the first iteration.  

Lowest Order (LO)  geometric multigrid  5.36  1.91  0.265  10 
hypre BoomerAMG  5.98  3.19  0.351  7  
single level  13.91  1.46  0.201  59  
NexttoLowest Order (NLO)  geometric multigrid  11.07  3.85  0.307  21 
hypre BoomerAMG  12.17  4.55  0.353  21  
single level  22.71  3.52  0.230  81 
Although the time per iteration for the single level method is lower, the additional iterations result in a significantly longer time to solution. In total the single level method is more than twice as expensive as a multigrid method, which is similar to the gain observed in [6]. The fastest solution time is achieved with the geometric multigrid preconditioner, which is about faster than the AMG solver both at lowest order and at nexttolowest order. While this shows some gains from using a bespoke preconditioner, this is nowhere near the speedup observed for the simplified setup in [6]. Reasons for this are discussed in section 5.6. An initially surprising result is the large setup time of the geometric multigrid preconditioner which, although smaller than the AMG setup, is still significant. The setup time for the singlelevel method is not much smaller than for the multigrid algorithm, which implies that most of the time is spent on the finest multigrid level. As detailed in section 5.3 a significant proportion of this setup time is taken up by assembly of the operators for the mixed system and the Helmholtz operator on the finest multigrid level, which is required for all preconditioners.
At nexttolowest order, the asymptotic convergence rates of the multigrid methods are again comparable, and both methods require 21 iterations to converge. The asymptotic convergence rate of the single level method is significantly worse and it needs almost 4 times as many iterations as the geometric multigrid preconditioner. Again the total runtime is twice as large for the single level method.
5.2 Robustness
Of particular interest is the robustness of the solver under variations of the time step size and grid resolution. The number of iterations and total solution time as a function of the Courant number is shown in Fig. 3.
Note that the range we explore here is larger than what is typical in meteorological applications where . The number of iterations for the AMG preconditioner is virtually independent of the Courant number up to values of . The total solution time, however, increases sharply for the largest Courant number. The increase in the number of iterations for the geometric multigrid preconditioner can be partially explained by the fact that we use a fixed number of levels and the coarse grid problem becomes increasingly ill conditioned. To demonstrate this we include results in which the coarse grid problem of the geometric multigrid preconditioner is solved with a BoomerAMG Vcycle instead of two smoother iterations. In this configuration the increase in the number of GMRES iterations is very modest for . As a result the total solution time is reduced for large Courant numbers. However, as expected, for small Courant numbers solving the coarse grid problem with a small number of smoother iterations works very well and it is not necessary to use a more powerful coarse grid solver. While the singlelevel method might be competitive for unrealistically small Courant numbers, the number of iterations increases dramatically with .
In addition, it is important to check that the convergence rate of the multigrid solver is independent of the grid resolution. To verify this, we varied the horizontal resolution and recorded the number of iterations, which is shown in Fig. 4 (left). For this test we kept the Courant number constant at , i.e. as the resolution increases, the time step size decreases linearly with the horizontal grid spacing. We conclude that as expected the multigrid solvers show grid independent convergence.
5.3 Breakdown of time per iteration
One possible concern is that the Firedrakebased implementation is significantly slower than a monolithic implementation in a fully compiled language, which would skew the comparison in runtimes reported in Tab. 3. To address this concern, we first remind the reader that the computationally most expensive loops over the grid are implemented by generating optimised Ckernels which are executed over the grid in PyOP2. The performance of these finite element kernels in the Firedrake framework is studied in detail in [41]; the performance of the Firedrake framework as a whole is discussed in [10], where it is shown that the overhead of using the highlevel framework is negligible as long as there are more than unknowns per MPI process (all results in this paper are obtained with (LO) or (NLO) unknowns per core).
To further quantify this in our case, we show a breakdown of the time per iteration for the multigrid and singlelevel solvers in Fig. 4 (right) and report the times spent in the most important grid iterations in the last row of Tab. 4. As can be seen from the figure, in almost all cases (the exception being the lowest order single level method) more than half of the time is spent in the pressure solve, i.e. the multigrid or singlelevel preconditioner. Applying the mixed operator also takes up a significant amount of time, and the same operation is necessary for the PETSc fieldsplit+AMG implementation.
The time spent in the geometric multigrid preconditioner is further broken down by multigrid levels in Tab. 4. As expected from the “total” column, which gives the total time spent on a particular level, this is dominated by the finest level. Naively one would expect a reduction of cost by a factor of four from one level to the next coarser, since the horizontal grid is coarsened by a factor two in each direction. The only exception is the difference between level 4 and 3 in the nexttolowest order case, which corresponds to a refinement step from (6 unknowns per cell) to (1 unknown per cell). The theoretical reduction factor is even larger than 6 since the local stencil size is reduced significantly when going from nexttolowest order to lowest order on the same grid. However the rates that are observed in practice are smaller. This is due to a combination of worse communication to computation ratio on the coarser levels (they have a relatively larger halo) and the nonscalable fixed costs incurred by the Firedrake framework as detailed in [10].
On each level the most expensive components of the multigrid algorithm are the application of the operators and in the residual calculation and the smoother application. Since we split the operator into a horizontal and a vertical component, those operations can be expressed as
(operator application)  (18)  
(smoother application) 
Since before the presmoother application the solution is zero, the total cost of those operations on on the finer multigrid levels is . On the coarsest level, where we apply the smoother times, the cost is . In all numerical experiments we use and . The total cost of those operator applications is listed in the last column. The remaining three columns show the time for one individual application of , and on a particular multigrid level. This confirms that indeed a significant amount of time is spent in the PyOP2 parallel loops and we focus on analysing the performance of those parts of the code. In the following we construct a performance model for the operator application and confirm that the implementation is indeed efficient: the achieved memory bandwidth is close to the peak for the machine.
Lowest Order (LO)  NexttoLowest Order (NLO)  

level  total  all  total  all  
0  0.0076  0.0002  0.0008  0.0009  0.0027  0.0072  0.0001  0.0007  0.0008  0.0025 
1  0.0151  0.0003  0.0009  0.0010  0.0043  0.0140  0.0002  0.0007  0.0009  0.0035 
2  0.0174  0.0006  0.0012  0.0012  0.0060  0.0150  0.0002  0.0008  0.0010  0.0040 
3  0.0315  0.0022  0.0022  0.0020  0.0128  0.0213  0.0023  0.0012  0.0013  0.0096 
4  0.0683  0.0073  0.0053  0.0049  0.0350  0.1461  0.0137  0.0080  0.0413  0.1261 
total  0.1399  0.0105  0.0105  0.0100  0.0609  0.2035  0.0165  0.0115  0.0453  0.1457 
5.4 Performance model
Iterative solvers employing assembled sparse operators are known to be bandwidth, rather than compute bound [42]. Assuming that the vectors in a matrix vector product are cached perfectly, multiplication by each nonzero entry streams one scalar and one column index and performs one floating point multiply and one add. This results of an arithmetic intensity of flops per byte. Modern hardware delivers somewhere between 4 and 10 flops per byte (recall that an ARCHER node provides 518.4 Gflop/s and a bandwidth of 74.1 Gbytes/s, resulting in 7 flops/byte for a balanced code). With a good degree of freedom numbering, the assumption of good vector caching is close to true, and high performance sparse matrix implementations typically achieve an appreciable fraction of STREAM bandwidth. It is therefore clear that if we want to improve over this baseline, we can do little without changing the way we apply matrixvector products.
Rather than assembling the full operator, finite element solvers may be implemented in a matrixfree manner by providing the matrixvector product directly. This, along with the addition of a matrixfree preconditioner is all iterative methods require. Such approaches are typically used for highorder discretisations (polynomial degree 4 and upwards) [43], although recent work shows that significant speedup can already be achieved at degree two on hexahedral meshes [44] when the tensor product structure of the basis is exploited. In contrast to these studies, we are in a loworder regime (degree zero or one) and only have a partial tensor product structure for the basis. Additionally, since our discretisation only couples degrees of freedom through faces, the sparsity of the assembled operator is not too bad.
Finally, as discussed above, we apply the multigrid cycle to the Schur complement operator . Since it contains a velocity mass matrix inverse, its action is not purely expressible in terms of finite element integrals. Instead, we assemble (into a nonsparse banded matrix) a columnwise decoupled approximation and apply this action in the iterative solver. For algorithmic reasons we need to apply exactly, which we do by inverting the columnwise blocks. Our method is therefore not truly matrixfree, but instead exploits structure in the discretisation to provide a more efficient storage format for the assembled operator: we only need to stream matrix entries (rather than entries and column indices) for a bandwidth saving of 33% (50% if using 64 bit integers or single precision scalars). To model performance, we count flops and bytes for the application of the Schur complement. The blockdiagonal massmatrix is inverted once per solve, and we count this as well.
5.4.1 Schur complement matrixvector products
As quantified numerically in section 5.3, the computational bottleneck of the algorithm is application of the Schur complement operator to a vector and the banded (or tridiagonal) matrix solve in the vertical direction. We therefore concentrate our analysis on these components. The three key operations are the applications of the operators , and . While is stored as a standard CSR sparse matrix, is stored as a banded matrix in the format described in section 4.3.1. At lowest order this matrix has rows in each vertical column of the grid and is tridiagonal with bandwidth . The nexttolowest order matrix has rows and bandwidth . Since there are 6 degrees of freedom per cell, this bandwidth could be reduced to if the banded matrix storage format were adapted to account for the blockstructure of the matrix. However, in the current code this would prevent the use of LAPACK routines and should be considered as an additional optimisation which will be explored in the future.
To apply to a vector, the input vector of size has to be read from memory, an output vector of the same size has to be written back and the banded matrix has to be loaded. The total amount of data transferred is
(19) 
At lowest order we use a direct tridiagonal solver to apply
, and the total amount of moved data is exactly the
same as in Eq. 19,
. At nexttolowest order we
calculate by an LU backsubstitution using the LAPACK
routine dgbtrs
. This requires additional storage of size
for the pivot array and for additional
matrix rows in the LU factorisation. Hence the total data volume is
(20) 
Based on those estimates, we quantify the achieved memory bandwidth as follows: Since all data is contiguous in the vertical direction, it is reasonable to assume perfect caching, and we can estimate a “useful” memory bandwidth of
(21) 
where is the time it takes to apply the operator and is the number of vertical columns. Note that any lessthanperfect caching would manifest itself in a useful bandwidth which smaller than the achievable peak memory bandwidth.
For the horizontal operator let denote the total number of rows and the total number of nonzero entries. Following [42] we need to read real numbers for the input vector and write back real numbers to store the output vector. Since the matrix is stored in compressed row storage, we also have to read real numbers as well as integers from memory. The total amount of moved memory is in applying to a vector is therefore
(22) 
In analogy to Eq. 21 the bandwidth can be calculated by dividing this number by the measured time.
Useful bandwidths calculated in this ways are reported in Tab. 5. On the finer levels, which amount for most of the time, the achieved bandwidth corresponds to a significant fraction of the achievable peak as measured using STREAM triad.
Lowest Order (LO)  NexttoLowest Order (NLO)  
multigrid level  
0  7.33  2.52  1.55  2.40  0.70  0.47 
1  32.05  6.65  4.50  8.81  2.30  1.60 
2  48.36  14.50  4.00  31.22  7.23  5.29 
3  62.28  28.09  29.29  14.31  4.69  8.99 
4  73.53  39.30  40.78  68.11  45.92  12.36 
5.5 Breakdown of setup time
Unlike a truly matrixfree geometric multigrid solver, a significant amount of time is spent in the setup of the geometric multigrid solver. The most significant fraction of time (1.094s at lowest order and 1.477s at nexttolowest order) is spent in the assembly of the operator for the mixed system in Eq. 8; this assembly is also required for the PETSc fieldsplit+AMG preconditioner. While total time spent in banded matrix algebra (0.290ss at LO, 1.163s at NLO) is relatively small, the cost of assembling the horizontal derivative (0.254s at LO, 0.187s at NLO) and lumped mass matrices (0.085s at LO, 0.102s at NLO) and (0.204s at LO, 0.322s at NLO) are not negligible, as is the time for multiplication with (0.298s at LO, 0.551s at NLO), which requires irregular memory access on the horizontally unstructured grid. A more detailed breakdown of the setup time can be found in Tab. 9 in appendix C.
5.6 Comments on matrixfree implementations
The relatively modest speedup of the geometric multigrid solver compared to the AMG algorithm raises interesting questions. The main reason for the large speedups of the geometric multigrid solver (compared to an AMG implementation) reported previously in [6] is the fact that the algorithm could be implemented in a matrixfree way for the simple finitevolume discretisation used there (we also used a matrixfree implementation for the slightly more complicated model equation in [8]). This approach avoided storage of the matrix which was reassembled on the fly. In contrast, for an AMG algorithm the matrix elements have to be loaded from memory which is expensive on modern computer architectures. As a consequence, in [6] geometric multigrid leads to a increase in performance compared to AMG. This, however, is only true if the matrixfree implementation for the application of linear operators is indeed faster than applying the assembled matrix. As we will now argue, this is not true anymore for the loworder finite element discretisations considered in this work.
In general, if an operator is to be applied times, the cost of the matrixfree (MF) and matrixexplicit (MX) implementation is
(23) 
where and is the time for one operator application and is the time it takes to assemble the operator into a sparse matrix. For the code in [6] we had and it is always cheaper to use the matrixfree implementation.
In contrast to this, we find that the matrixfree approach does not improve performance for the low order finite element discretisations used in this work, since our implementation has .
For the mixed operator in Eq. 8 the corresponding timings are shown in Tab. 6. In both cases, one matrixfree operator application is about five times as expensive as an application of the corresponding assembled operator. The main reason for this is the much larger number of floating point operations. As can be seen from Tab. 7, one matrixfree operator application requires around more FLOPs than the corresponding matrixexplicit implementation. Although one ARCHER node can execute floating point operations for each byte read from memory and the matrixexplicit implementation requires more memory traffic since the matrix has to be read in addition to the field vectors, overall we expect the matrixfree code to be slower. More specifically, we can estimate the relative runtime of the (FLOPbound) matrixfree and the (bandwidth bound) matrixexplicit code purely from counting memory references in Tab. 7 as
(24) 
Since both implementations run at a sizeable fraction of the respective peak performance (see last two columns in Tab. 7), this estimate ( for Lowest Order and at NexttoLowest Order) is in the same ballpark as the measured ratio of times.
At lowest order the assembly cost is amortised after about 6 operator applications, at nexttolowest order it is amortised after 8 applications. The overhead from the matrix assembly is further mitigated by the fact that parts of the assembled mixed operator are reused at other places in the code. For example, the lumped H(div) mass matrix can be extracted very cheaply once the full mass matrix has been assembled.
Lowest Order (LO)  1.094  0.053  0.235 

NexttoLowest Order (NLO)  1.577  0.054  0.263 
order  #FLOPs  #bytes  arithmetic  FP performance  bandwidth  

moved  intensity  [GFLOPs/s]  [GByte/s]  
matrix free  LO  84.38  203.3  2.2  
NLO  236.12  176.5  0.7  
matrix explicit  LO  0.13  7.0  51.8  
NLO  0.16  12.9  76.8 
Consequently in any tests that we carried out the matrixexplicit implementation of the geometric multigrid solver was much faster than a matrixfree version. As a result the performance of both the AMG and multigrid solver is limited by the speed with which the matrix and field vector can be loaded from memory, and with which the matrix can be assembled in the setup phase (explicit matrix assembly of was also necessary for the LU decomposition that is required at nexttolowest order). Since the matrix size is comparable in both cases, it is not surprising that the time per iteration and absolute solution time is comparable for both methods. This picture is likely to change at higher discretisation orders where sum factorisation techniques can improve performance dramatically: as shown in [43, 45, 46, 44] on modern chip architectures with a large FLOPtobandwidth ratio, sumfactorised matrix free implementations can be faster than memory bound implementations which assemble the operator and apply it in a sparse matrix representation. At high order, storing the assembled matrix can also require significantly more memory and limit the simulated problem size.
5.7 Parallel scaling
The results of a weak scaling study of the multigrid solvers on ARCHER is shown in Fig. 5. As in the previous section we fix the Courant number at , decreasing the timestep linearly with higher grid resolution. The largest system solved at lowest order had unknowns on 1536 cores (64 full nodes); at nexttolowest order we solved problems with up to degrees of freedom on 6144 cores (256 full nodes).
All solvers show excellent weak scaling once the nodes are fully populated. Any increases in runtime on partially occupied nodes can probably be attributed to the increasing saturation of the memory bandwidth as more cores are used.
6 Conclusion
The implementation of bespoke preconditioners for complex finite element problems requires suitable abstractions which allow a separation of concerns between the algorithm developer and the computational scientist. We discussed the implementation of appropriate abstractions in the Firedrake finite element framework. We then used the extended system to build a bespoke preconditioner for the mixed finite element discretisation of a linear gravity wave system in a thin spherical shell – an important model system in atmospheric flow applications. Motivated by earlier work in [6], we constructed a bespoke tensorproduct geometric multigrid preconditioner, which is tailored to the strong grid aligned anisotropy in the vertical direction. This preconditioner results in a solver which is around faster than a purely algebraic approach using PETSc fieldsplit preconditioning and hypre’s algebraic multigrid to solve the resulting elliptic problem. For operationally relevant CFL numbers, the multigrid preconditioners are about twice as fast as the single level methods popular in current operational models. The different abstractions of the Firedrake/PyOP2 framework simplify the implementation significantly: the user can express the algorithm at the correct abstraction level, while still obtaining very good performance. This is demonstrated by our careful analysis of key components of the algorithm, which we show are running at a significant fraction of the theoretical peak memory bandwidth. For simpler finite volume or finite difference discretisations on structured grids we previously demonstrated previously that it is possible to achieve further speedups by using a matrixfree implementation [6, 8]. However, here we show that this is not possible for the loworder finite element discretisations. In fact a matrixbased implementation based on a BoomerAMG preconditioner achieves almost the same performance as our bespoke geometric multigrid solvers. The balance between matrixfree and assembled operators may change with changes in future hardware, and if we are able to exploit some of the structure in the tensorproduct basis (reducing the flop counts of the element kernels); we intend to study this in the future.
6.1 Future work
There are several ways of extending the current work. In realistic atmospheric models orography will lead to a distortion of the grid and the decomposition of the velocity function space into a purely horizontal and purely vertical component is no longer valid, which can have an impact on the performance of the solver. Nevertheless the tensorproduct multigrid algorithm, which assumes a perfect factorisation, can still be used as a good preconditioner, as has been confirmed by preliminary experiments (not reported here). Performance gains are also expected from a decomposition of the operators into tensor products. If, for example, an operator can be written as a tensor product , assembling (and applying) the operators and separately requires operations (and storage) instead of , and, since this will lead to significant speedups. This approach has been explored for a simplified problem in [8]. In the finite element setting used in this work it would require a shallowatmosphere approximation in the preconditioner.
Instead of using a Schurcomplement factorisation and applying a multigrid preconditioner to the positivedefinite system in (10), one could also the multigrid algorithm for the full system in (8) or (6). This requires the construction of suitable smoothers, for example based on “distributive iterations” [47].
Ultimately the linear solver will be used inside a Newton iteration to solve a nonlinear problem in the full atmospheric model, and more careful numerical studies will have to be carried out in this context. While a nonlinear multigrid (FAS) scheme [23, §8] could also be explored, the nonlinearities in the problem considered here might not be large enough to justify this approach which requires computationally expensive nonlinear smoothers. Guided by the work reported here, we are currently working on implementing both the PETSc fieldsplit preconditioner and the geometric multigrid preconditioner in the Fortran 2003 code base that is used to develop the Met Office’s next generation dynamical core (codenamed “GungHo”).
Acknowledgements
This work was funded as part of the NERC project on Next Generation Weather and Climate Prediction (NGWCP), grant numbers NE/K006789/1 and NE/K006754/1. LM additionally acknowledges funding from EPSRC grant EP/M011054/1. We gratefully acknowledge input from discussions with our collaborators in the Met Office Dynamics Research group and the GungHo! project.
This work used the ARCHER UK National Supercomputing Service (http://www.archer.ac.uk).
All numerical experiments in this paper were performed with the following versions of software, which we have archived on Zenodo: Firedrake [48]; PyOP2 [49]; FIAT [50]; FFC [51]; COFFEE [52]; PETSc [53]; petsc4py [54]; UFL [55]. Additionally, the simulation code itself is archived as [56] and the performance data and plotting scripts as [57].
Appendix A Derivation of model equations
Here we derive the equations Eq. 1 from first principles. Large scale atmospheric flow is governed by the Navier Stokes equations and the equation of state:
(25) 
Expand around stationary profiles , and
(26) 
At lowest order in the stationary solution satisfies hydrostatic balance and the equation of state
(27) 
The equation of state also implies
(28) 
At the momentum equation and the conservation of potential temperature are
(29) 
If we introduce the buoyancy and buoyancy (BruntVaisälä) frequency with
(30) 
these can be written (in the momentum equation we used Eq. 27 to replace by ) as
(31) 
The equation for the density is at :
(32) 
Divide by and use Eq. 28 to replace by and
(33) 
and then Eq. 29 to eliminate the time derivative of
(34) 
We now estimate the relative size of the two terms and in this equation. Define the scale height for a profile as
(35) 
We now assume that the height of the domain is much smaller than all scale heights, , i.e. the profiles vary only relatively slowly with height. Then we have
(36) 
With this approximation the equation for the Exner pressure becomes
(37) 
Define
(38) 
and note that the speed of sound is given by
(39) 
When written in terms of the variable the pressure equation and momentum equations become
(40) 
Here we have again used the condition on the scale heights to approximate . Together Eqs. 31 and 40 form the set of equations given in Eq. 1
Appendix B PETSc fieldsplit preconditioner options
The options used for the fieldsplit preconditioner are shown in Tab. 8.
option  value 

pc_type  fieldsplit 
pc_fieldsplit_type  schur 
pc_fieldsplit_schur_fact_type  FULL 
pc_fieldsplit_schur_precondition  selfp 
fieldsplit_0_ksp_type  preonly 
fieldsplit_0_pc_type  bjacobi 
fieldsplit_0_sub_pc_type  ilu 
fieldsplit_1_ksp_type  preonly 
fieldsplit_1_pc_type  hypre 
fieldsplit_1_pc_hypre_type  boomeramg 
fieldsplit_1_pc_hypre_boomeramg_max_iter  1 
fieldsplit_1_pc_hypre_boomeramg_agg_nl  0 
fieldsplit_1_pc_hypre_boomeramg_coarsen_type  Falgout 
fieldsplit_1_pc_hypre_boomeramg_smooth_type  Euclid 
fieldsplit_1_pc_hypre_boomeramg_eu_bj  1 
fieldsplit_1_pc_hypre_boomeramg_interptype  classical 
fieldsplit_1_pc_hypre_boomeramg_P_max  0 
fieldsplit_1_pc_hypre_boomeramg_agg_nl  0 
fieldsplit_1_pc_hypre_boomeramg_strong_threshold  0.25 
fieldsplit_1_pc_hypre_boomeramg_max_levels  25 
fieldsplit_1_pc_hypre_boomeramg_no_CF  False 
Appendix C Breakdown of setup time
A detailed breakdown of the setup time is shown in Tab. 9.
order  

LO  NLO  
assemble  0.254  0.187 
assemble  0.085  0.102 
assemble  0.204  0.322 
assemble operator for mixed system  1.094  1.577 
bandedmatrix: LU factorise ( 