High level implementation of geometric multigrid solvers for finite element problems: applications in atmospheric modelling
The implementation of efficient multigrid preconditioners for elliptic partial differential equations (PDEs) is a challenge due to the complexity of the resulting algorithms and corresponding computer code. For sophisticated (mixed) finite element discretisations on unstructured grids an efficient implementation can be very time consuming and requires the programmer to have in-depth knowledge of the mathematical theory, parallel computing and optimisation techniques on manycore CPUs. In this paper we show how the development of bespoke multigrid preconditioners can be simplified significantly by using a framework which allows the expression of the each component of the algorithm at the correct abstraction level. Our approach (1) allows the expression of the finite element problem in a language which is close to the mathematical formulation of the problem, (2) guarantees the automatic generation and efficient execution of parallel optimised low-level computer code and (3) is flexible enough to support different abstraction levels and give the programmer control over details of the preconditioner. We use the composable abstractions of the Firedrake/PyOP2 package to demonstrate the efficiency of this approach for the solution of strongly anisotropic PDEs in atmospheric modelling. The weak formulation of the PDE is expressed in Unified Form Language (UFL) and the lower PyOP2 abstraction layer allows the manual design of computational kernels for a bespoke geometric multigrid preconditioner. We compare the performance of this preconditioner to a single-level method and hypre’s BoomerAMG algorithm. The Firedrake/PyOP2 code is inherently parallel and we present a detailed performance analysis for a single node (24 cores) on the ARCHER supercomputer. Our implementation utilises a significant fraction of the available memory bandwidth and shows very good weak scaling on up to 6,144 compute cores.
geometric multigrid, atmospheric modelling, preconditioner, (mixed) finite elements, domain-specific compilers
Efficient solvers for partial differential equations (PDEs) are required in all areas of science and engineering. The design and implementation of these solvers requires a wide set of skills, including, but not limited to: knowledge of the system being simulated; creation and implementation of appropriate numerical schemes; analysis of the resulting linear and nonlinear operators; and efficient low-level implementation of the chosen schemes. It is therefore rare that a single individual possesses the full range of skills to successfully deliver both an algorithmically and computationally efficient solver on their own.
To address this multi-disciplinary problem requires software frameworks that enable scientists with complementary skills and specialisations to collaborate without each one of them requiring full knowledge of the system. This can be achieved by carefully designing composable interfaces that capture the natural abstraction at each level of the simulation stack.
An important application is the simulation of fluid systems which are described by the Navier Stokes equations. In contrast to other mixed finite element methods such as the Taylor-Hood element, discretisations based on mimetic finite element methods (see e.g. [1, 2, 3]) exactly preserve certain physical properties of the system under study. In particular, the divergence theorems and hold exactly for the discretised operators; this is not the case for some other mixed methods such as the Taylor-Hood element. Mimetic methods reproduce the favourable properties of the C-grid staggering, but allow the use of higher order discretisations and non-orthogonal grids. This is particularly important in global atmospheric modelling applications, where the “pole problem” of standard longitude-latitude grids introduces artificial time step limitations near the pole and restricts the parallel scalability of the code . While mimetic finite element schemes are very popular in applications such as numerical climate- and weather- prediction, their efficient implementation poses significant challenges. To deliver forecasts under tight operational timescales, solvers have to be algorithmically robust, make optimal use of modern manycore chip architectures and scale to large node counts on distributed memory clusters.
Elliptic PDEs arise in implicit time stepping methods of the atmospheric equations of motion and their solution often forms the bottleneck of the full model code. Since the Earth’s atmosphere is represented by a thin spherical shell, the resulting discrete system is highly anisotropic. This requires the use of bespoke preconditioners, in particular the tensor-product multigrid approach in  has turned out to algorithmically efficient [6, 7, 8]. The implementation of multigrid algorithms for low-order finite difference discretisations on structured grids is straightforward but this is not the case for sophisticated finite element discretisations on more complex geometries for several reasons: Since the HDiv (velocity) mass matrix is not diagonal, the iterative solver algorithm is significantly more involved than the corresponding finite difference version. In the finite element case the innermost compute kernels can not be expressed as simple stencil applications and - to achieve optimal performance on a particular architecture - the loop nest has to be optimised, bearing in mind hardware specific properties such as the cache layout. In a distributed memory setting halo exchanges and overlapping of computation and communication require a careful partitioning of the grid. The Firedrake/PyOP2 [9, 10] framework allows the automatic assembly of finite element operators from their weak formulation and the expression of the algorithm at a high abstraction level. Architecture dependent optimised low-level C-code is automatically generated and executed with just-in-time compilation techniques.
In this paper, we discuss the extension of the Firedrake framework to support the development of geometric multigrid solvers for finite element problems. We illustrate its use, and investigate the performance of the resulting method, in the development of a geometric multigrid solver for a mimetic finite element discretisation of the atmospheric equations of motion. This is a challenging problem and an excellent test case for the developed abstractions since both the implementation of the underlying finite element scheme and the design of an efficient multigrid method are non-trivial. We show how our approach simplifies the implementation of the solver, by cleanly separating the different aspects of the model. In particular we demonstrate how the chosen abstractions allow the easy implementation of a column-local matrix representation which is crucial for the block-Jacobi smoother. This particular smoother is algorithmically optimal for strongly anisotropic problems and a key ingredient of the tensor-product multigrid algorithm in . A careful performance analysis confirms that our implementation is efficient in the sense that it uses a significant fraction of the system’s peak performance for bandwidth-bound applications.
An alternative and very successful approach to the implementation of finite element solvers is the use of templated C++ code [11, 12, 13]. This allows the user full control over all components of the algorithm and multigrid solvers for finite element discretisations have been implemented for example in DUNE  and deal.II . However, the key computational kernels (such as the local application of the operator in weak form) have to be written by hand and any optimisation is limited by the capabilities of the available compiler. In contrast, frameworks like FEniCS  and Firedrake use domain-specific compilers to carry out optimisations that are infeasible for general purpose compilers to perform on a low-level representation of the same algorithm . Compared to FEniCS, where expressing non-finite element operations requires the programmer to explicitly manage all parallelism and mesh iteration, one of the advantages of our approach is the straightforward implementation of any local operations as computational kernels in PyOP2 ; this is crucial for the preconditioners considered in this work.
The paper is structured as follows. In section 2, we give a brief overview of the Firedrake software framework. Section 3 lays out the mathematical abstractions of the multigrid method, and how we organise the software abstractions around them. The multigrid method for our model problem, an atmospheric gravity wave, is discussed in section 4. We characterise the performance of the resulting scheme in section 5 and conclude in section 6. Some more technical aspects are discussed in the appendices where we derive the equations for linear gravity wave propagation, give the parameters of the PETSc fieldsplit solver and provide a detailed breakdown of the solver setup times.
We demonstrate the performance and parallel scalability of the solver on the ARCHER supercomputer and compare our geometric multigrid implementation to a matrix-explicit implementation based on the PETSc library [18, 19]. In the latter case we use the BoomerAMG  preconditioner from the hypre suite  to solve the DG-pressure system. We show that the performance of the solver for the low order discretisations treated in the paper is memory bound, and quantify the absolute performance of the most computationally intensive kernels in our solver algorithm by a detailed analysis of memory traffic. We find that the computationally most expensive kernels utilise a significant fraction of the peak memory bandwidth.
2 Firedrake/PyOP2: abstractions for finite element methods
Firedrake  is a Python system for the solution of partial differential equations by the finite element method. It builds on the abstractions introduced in the FEniCS project [16, 22] to present a high-level, automated, problem solving environment. Firedrake enforces a strong separation of concerns between employing the finite element method, the implementation of the local discretisation of the mathematical operators, and their parallel execution over a mesh. The execution of kernels over the mesh is carried out using an iteration abstraction layer, PyOP2 . This layer is explicitly exposed to the model developer, and allows them to write and execute custom kernels over the mesh. The critical observation is that most operations that fall only slightly outside the finite element abstraction may still be formulated as the execution of a local operation over some set of mesh entities, these can be expressed as a PyOP2 parallel loop. This separate abstraction layer allows the user to worry about the local operations: parallelisation is carried out automatically by PyOP2 exactly as it is for finite element kernels in Firedrake itself. More details of the interaction between Firedrake and PyOP2 can be found in [10, Section 4].
3 Multigrid in Firedrake
Multigrid methods  are an algorithmically optimal approach to solving many PDEs, especially those involving elliptic operators. They rely on a hierarchy of scales to cheaply and efficiently compute properties of the system at the appropriate scale: modes that vary slowly in space can be accurately represented using only a few degrees of freedom and are thus best solved for on coarse grids, whereas fast modes need high spatial accuracy. Here we briefly provide an overview of the mathematical operations in terms of a two-grid setup, and then describe our implementation in Firedrake.
Let be the approximation space on the “coarse” grid and the space on the fine grid, these need not necessarily be nested, although the implementation is simplified if they are. A complete multigrid cycle is built from a few basic operators. In algorithm 1 we show the form of correction scheme multigrid (so called because on the coarse grid we solve for the correction to the fine grid equation) on two levels, given a linear system .
The extension to multiple levels recursively applies this two-level cycle to compute . This setup suggests that we need to provide facilities for computing restrictions and prolongations. We also need the ability to compute coarse grid operators, and we need to be able to apply smoothers (effectively some form of linear solver). In listing 1, we show how this abstract framework translates into our implementation in Firedrake for a simple two-dimensional example. We are able to exploit the existing facilities for the vast majority of the implementation, we merely need a few extensions to deal with hierarchies of meshes and transferring between them, which we discuss below.
3.1 Grid hierarchies
It is clear that the first object we will need is a hierarchy of grids (or meshes, in Firedrake’s parlance). This will encapsulate the relationship between the refined grids (providing information on the fine grid cells corresponding to coarse grid cells). These are provided by the Firedrake MeshHierarchy object. Similarly, we shall need to represent discrete solution spaces. Firedrake uses FunctionSpace objects for this, and we extend these with a FunctionSpaceHierarchy. As with the MeshHierarchy this encapsulates the relationship between function spaces on related grids (allowing us to determine the degrees of freedom in the fine grid that are related to a coarse grid cell). We note that in this work, we only treat hierarchically refined grids. This does not affect the abstractions we discuss, although it does simplify some of the implementation.
3.2 Restriction and prolongation
Nested finite element spaces admit particularly simple implementation of restriction and prolongation. To compute the restriction operator we use FIAT  to evaluate the coarse basis at the node points on the fine grid. This allows us to express the coarse cell basis functions in terms of linear combinations of fine cell basis functions. Since the mesh is regularly refined, we need only do this once and can use the same weighting for all cells. The restriction operator can then be expressed simply by applying this combination kernel to a given residual using a PyOP2 parallel loop (exposed as restrict in the Firedrake interface). In the same way we compute the interpolation from into , which is just the natural embedding, using FIAT, and apply the kernel over the mesh with PyOP2.
3.3 Coarse grid operators
Forming the coarse grid operators is straightforward using the existing facilities of Firedrake and UFL . Rediscretised operators are readily available simply by taking the UFL expression for the fine grid operator and assembling it on the coarse grid (achieved using the coarsen_form operation in listing 1). These rediscretised operators have minimal stencil. In addition, it is also straightforward to provide simpler operators (perhaps throwing away couplings that do not contribute on coarse grids) by explicitly defining the operator symbolically on the appropriate coarse level and using it in the smoother.
Firedrake uses PETSc to provide solvers for linear systems. As such, for assembled matrices, we can use as a smoother any linear solver that PETSc makes available. In listing 1, for example, we use two Jacobi-preconditioned Richardson iterations as a pre-smoother, solve the coarse problem exactly with LU and then use three preconditioned Richardson iterations as a post-smoother. Naturally, we are free to implement our own smoothers instead, perhaps we do not have an assembled matrix and therefore cannot use “black-box” smoothers. Indeed, the key ingredient to achieve optimal performance is the use of the correct smoother. This is of particular importance in the tensor-product multigrid scheme we discuss in the rest of the paper, since an assembled operator is not available and there is structure in the problem (the strong vertical anisotropy) we wish to exploit in the smoothers.
4 Tensor-product multigrid for (mimetic-) mixed finite elements
To illustrate the power of these abstractions we consider an important model system for meteorological applications: the equations for linear gravity wave propagation in the global atmosphere. The corresponding system of PDEs is discretised with a mixed finite element discretisation. The problem is solved in a thin spherical shell which represents the Earth’s atmosphere. The thickness of the atmosphere is several orders smaller than the radius of the earth. This flatness of the domain is typical for applications in atmospheric modelling and introduces a strong grid-aligned anisotropy. As we shall see, for a mixed finite element discretisation this problem is significantly more complex than the simple example shown in listing 1. We will therefore use it to both characterise the performance of our multigrid implementation, and validate that we have exposed the correct abstractions. For the linear solver to converge rapidly, the anisotropy has to be treated correctly in the preconditioner and the main challenge is the implementation of an optimal smoother. The PyOP2 abstraction level allows the expression of this smoother in terms of tailored data structures and low-level kernels which implement the line-relaxation method that is key to exploit the vertical anisotropy.
Since the PDE we are solving is elliptic, one option is to employ an algebraic multigrid (AMG) preconditioner. On highly anisotropic domains, one must take special care in constructing the coarse grid operators and smoothers to achieve mesh independence. See for example , where the authors use smoothed-aggregation AMG to precondition the velocity block when solving the nonlinear Stokes equations in the context of ice-sheet dynamics. To account for the strong anisotropy, the aggregation strategy is adapted to maintain the column structure of the degrees of freedom, and a smoother based on incomplete factorisations is used. There are a number of reasons why an AMG approach might not always be desirable. The AMG preconditioner can only use Galerkin coarse grid operators, and so the coarse grid stencil will be large. More importantly, unless special care is taken, the coarsening operation will not necessarily obey the anisotropy inherent in the problem, perhaps leading to suboptimal algorithmic performance. Finally, we found in  that a bespoke preconditioner, based on the tensor-product multigrid of , is superior to AMG based methods if it is applied to a simplified model equation discretised with the finite volume method; indeed the geometric multigrid approach turned out to be about faster than black-box AMG implementations from the DUNE  and hypre  libraries.
Moreover the tensor-product preconditioner can be shown to be optimal for grid-aligned anisotropies. In a recent paper  we have also demonstrated numerically that the method works well for atmospheric applications under slightly more general circumstances. These encouraging results motivate us to study the geometric multigrid implementation reported in this work.
4.1 Mathematical formulation and mixed finite element discretisation
The following linear gravity wave problem for pressure , velocity and buoyancy can be obtained by linearising the full Navier-Stokes equations for large scale atmospheric flow(see appendix A for a derivation):
For simplicity we assume that both the speed of sound and the buoyancy frequency are constant and enforce the (strong) boundary condition
at the upper and lower boundary of the atmosphere. The domain can be expressed as a tensor-product where is the two-dimensional surface of a sphere with radius . As described above, the domain is very flat, i.e. . To construct function spaces for mimetic finite element discretisations, consider the following de Rham complexes in one-, two- and three dimensions:
We seek a solution to Eq. 1 with
where is the subspace of whose normal component vanishes on the boundary of the domain. and are respectively the “horizontal” and “vertical” parts of . The remaining spaces are and . This choice of spaces for and is analogous to the Charney-Phillips staggering. We refer the reader to  for the implementation and further description of tensor-product spaces in Firedrake.
It is worth highlighting here that the decomposition of the velocity space into a horizontal () and vertical () component is very important for the construction of the tensor-product multigrid preconditioner described below.
We discretise in time using an implicit scheme which is a special case of the Crank-Nicholson method , resulting in the following weak system for the increments , and
where is the normal vector in the vertical direction and , and are the known fields are the previous time step. In this work we consider two choices for the three-dimensional complex in Eq. 3 introduced in  and summarised in Tab. 1. In the following they are referred to as “Lowest Order” (LO) and “Next-to-Lowest Order” (NLO).
|Lowest Order (LO)|
|Next-to-Lowest Order (NLO)|
Discretising, we obtain a system of linear equations for the dof-vectors , and :
Note that since , the velocity mass matrix can be written as the tensorial sum of a horizontal and a vertical mass matrix
Similarly the weak derivative can be expressed as .
In the absence of orography, buoyancy can be eliminated point wise from Eq. 6 to obtain a block- system of equations for velocity and pressure only***Note that the continuous version of the last equation in (6), is true also in the presence of orography, and an alternative approach (leading to a different discrete system) would be to eliminate orography from the continuous equations before discretising. However, a pointwise elimination of buoyancy from the discretised equations is only possibly in the absence of orography when (6) holds strongly and not just weakly.,
It is important to note that, in contrast to lowest order finite volume discretisations on staggered grids, the velocity mass matrix is not diagonal†††It is, however, well conditioned, with a condition number of that is independent of the grid resolution.. This prevents the standard treatment taken in many atmospheric modelling codes based on finite volume or finite difference discretisations (for example the semi-implicit dynamical cores of the Unified Model [31, 32]), which eliminate velocity point wise to obtain a system for the pressure, which is solved iteratively before reconstructing the velocity. In our case we have to use an iterative solver for the full linear system in and .
The condition number of the linear system grows with the (vertical) acoustic Courant number , and hence preconditioning is essential to achieve rapid convergence. Due to the flatness of the grid cells () the dominant terms in are due to fast vertical sound waves. However, wave propagation can be treated approximately independently in each column of the grid (in a given time interval, a wave which propagates over grid cells in the vertical direction will only cover a distance of cells in the horizontal). In the following we describe the construction of a preconditioner which uses this idea to reduce the condition number such that it only depends on the much smaller horizontal acoustic Courant number which is in atmospheric models (larger values of are excluded due to the explicit treatment of other processes such as advection). Under these conditions a tensor-product multigrid V-cycle with vertical line relaxation and horizontal grid coarsening will then lead to rapid convergence of the linear solver iteration.
4.2 Schur-complement preconditioner
Formally the Schur-complement factorisation of the inverse of the block- matrix in Eq. 8 is given by
with the elliptic and positive-definite “Helmholtz” operator
This operator contains the dense inverse of the velocity mass matrix and is therefore also dense. We obtain a preconditioner with a number of approximation steps. We replace the full inverses in Eq. 9 by a diagonal approximation :
Furthermore, since and , the second order term in Eq. 10 can be written as
Again, we replace full mass matrix inverses by their diagonal approximations, , . Since the function space is horizontally discontinuous, the matrix has a block-diagonal structure. It would therefore be possible to use more sophisticated approximations for in each vertical column. For example we tried a sparse approximate inverse , but found that this leads to increased setup costs and worse overall performance.
With the diagonally lumped mass matrices we obtain the approximate Helmholtz operator
which can be inverted iteratively to approximate in Eq. 9. In contrast to , the operator has a sparse structure but still contains the dominant couplings in the vertical direction. Since the size of the derivatives and is proportional to the inverse grid spacings, the two terms in the brackets are of order and respectively (the pressure mass term is of order ). On the highly anisotropic grids considered here we have , and hence the second term in the bracket is the dominant contribution. If we write for the block-diagonal part of , the following block-diagonal operator differs from by terms of :
The operator is decoupled in the horizontal direction, and can therefore be inverted independently in each column. With a suitable degree of freedom ordering this can be carried out using a simple banded matrix solve. To invert the operator , one can now use a suitably preconditioned iterative method. One possibility would be a Krylov iteration with block-Jacobi preconditioner, namely
This is similar to the approach taken in existing operational codes such as the Met Office Unified Model [31, 32]. However, as we demonstrated in , a much more efficient preconditioner is a tensor-product multigrid algorithm in which the grid is only coarsened in the horizontal direction and Eq. 15 is used as a smoother (we refer to this block-diagonal smoother as “vertical line relaxation” in the following). This is the method we use in this paper, and the numerical results in section 5 confirm the superiority of the method compared to a single-level preconditioner: the multigrid algorithm is about twice as fast. This was also observed in  and in numerical experiments we carried out on the ENDGame dynamical core. Since the inverse of is only required in the preconditioner and does not have to be computed to very high accuracy, we found it is most efficient to simply apply one multigrid V-cycle with .
4.2.1 Tensor-product multigrid algorithm
The tensor-product multigrid algorithm which we use to solve the system was first described and analysed in  and proven robust for grid-aligned anisotropies. In  the analysis is extended to dimensional grids and it is demonstrated numerically that the algorithm is still efficient for approximately grid aligned anisotropies in meteorological applications.
This can be interpreted in the context of the discussion at the end of session 4.2: the performance of the tensor-product multigrid algorithm is given by the horizontal acoustic Courant number ; the much larger vertical Courant number does not appear at leading order since sound propagation in each column is treated exactly by the block-diagonal line relaxation smoother.
The tensor-product multigrid cycle fits naturally into the language of section 3, with the following selections for the operators. Rather than full coarsening in all three dimensions, we semi-coarsen in the horizontal direction, leaving columns intact. For the next-to-lowest-order function spaces, the first coarsening step is a -refinement from into , with the restriction and prolongation defined using the natural embedding. We use the vertical “line relaxation” operator of Eq. 15 for our multigrid smoother. Finally we note that the natural correlation length in units of the horizontal grid spacing on the finest multigrid level is given by . Since other components of the model (such as explicit advection) limit the time step size to , this correlation length is . It is therefore sufficient to coarsen to the grid until the grid spacing is larger then this length scale; the coarse grid problem becomes well conditioned and can be solved with a small number of smoother iterations.
The function spaces and are discontinuous in the horizontal direction. It is natural to order the degrees of freedom such that all unknowns in a vertical column of the three dimensional grid are stored consecutively in memory (see Fig. 1). This storage format, used by Firedrake, has good performance characteristics since the contiguous column data can be directly addressed.
As a consequence, in Eq. 13 and Eq. 14 the matrices , , , and are columnwise block-diagonal. Each block can be associated with a cell of the two-dimensional grid that covers and corresponds to a banded matrix which couples the unknowns in a particular vertical column. To apply the vertical line relaxation smoother of Eq. 15 we explicitly assemble so that it is amenable to inversion using banded matrix solves. Although it is not possible to express the operations we require using UFL – they are not operations on variational forms – it is still possible to write them as PyOP2 kernels that are executed over the grid. A key part of our implementation are therefore linear algebra operations on columnwise banded matrices. We describe this algebra and its implementation in the following section.
4.3.1 Banded matrix algebra
To represent the block-diagonal matrices we implemented a bespoke banded matrix class which provides the necessary operations. In each vertical column, given a UFL expression for the operator, we can assemble the local matrix block. More generally, the class implements the block-diagonal matrix representation of a linear operator that maps between two horizontally discontinuous spaces
In each vertical column , the local matrix block is a generalised banded matrix:
where , and (with ) depend on the function spaces and . When (which occurs when and have the same degree-of-freedom layout) this reduces to the standard banded storage format. In each column, we then store the operator using a generalisation of the LAPACK banded matrix format, see Fig. 1.
The Python class which implements provides the following functionality:
Assemble from a given UFL form to obtain
Apply to a field-vector:
For two assembled linear operators and , calculate and .
Apply the inverse of to a field-vector, i.e. solve the linear banded system for .
To solve the equation in each column we use two
different approaches: at next-to-lowest order where the matrix bandwidth is
27, we use the LAPACK routine
dgbtrf to precompute and store a
LU decomposition of the matrix and then employ the
dgbtrs to solve the linear system based on
this LU decomposition. However, at lowest order, where the matrix is
tridiagonal, we found that a handwritten direct solve (Thomas
algorithm, see e.g. [34, §2.4]) is much more
All operations can be carried out independently in each vertical column. Based on this observation, the hierarchical architecture of PyOP2/Firedrake is crucial for the implementation: we store the local matrix blocks as PyOP2 dats on the horizontal grid cells. The operations above can then be implemented as short PyOP2 kernels, which are executed over the horizontal grid. An example is given in listing 2, which shows the matrix-vector product . Note that the user only has to express the local kernel as a short piece of C code, the PyOP2 system is responsible for executing this kernel over the grid.
4.4 PETSc/BoomerAMG based fieldsplit preconditioner
To compare the performance of our bespoke implementation to an existing solver package, we also solved the discretised equations purely algebraically with PETSc [19, 18] and preconditioned with hypre . PETSc’s fieldsplit preconditioner is used to algebraically form the Schur complement operator. We use a diagonal approximation to the velocity mass inverse to form a preconditioning operator for the Schur complement (corresponding exactly with in Eq. 13), everywhere else the velocity mass matrix is inverted using an ILU-preconditioned block-Jacobi iteration. The resulting elliptic problem in the pressure space is solved using the BoomerAMG algebraic multigrid preconditioner from the hypre library. To handle the strong vertical anisotropy it was essential to replace the standard point-smoother by an incomplete LU factorisation (Euclid). Since the vertical couplings are the dominant terms, this is approximately the same as an exact inversion of the block-diagonal operator which contains the vertical couplings only. Table 8 in appendix B details the full set of PETSc options used.
In all cases the linear system in Eq. 8 is solved with a suitably preconditioned, restarted GMRES() iteration  to a relative tolerance of , which is typical in atmospheric modelling applications. Note that the system in (6) is not positive definite (after all, it has wave-like solutions), ruling out use of the more efficient Conjugate Gradient method. In the particular case of constant and considered here, the system could be made symmetric by multiplying the first equation in (6) by and scaling and by factors and respectively. Then a MINRES method  with reduced memory requirements could be used. However, this is not the case for general, spatially varying values of the speed of sound and buoyancy frequency since . To cover this more general case, we therefore use GMRES to obtain our results. Since the computationally expensive components in both Krylov methods are the operator application and preconditioner solve, the performance should be comparable and the only difference is the higher memory requirement of the GMRES method. For a discussion of the non-symmetry of the elliptic problems in atmospheric models we also refer the reader to  where the authors argue that the GCR(k) algorithm (which is equivalent to GMRES) should be used.
To compare the two preconditioners (Schur complement+geometric multigrid described in section 4.2 vs. PETSc fieldsplit+BoomerAMG described in section 4.4) we first use the setup in Tab. 2. For comparison we also use a single-level method (two iterations of Block-Jacobi vertical line relaxation) since this is the approach used in many atmospheric forecast models. For all numerical experiments we use five levels in the geometric multigrid preconditioner. On each level one pre- and one post-smoothing step with an overrelaxation parameter of is employed. The well conditioned coarse grid problem (see discussion at the end of section 4.2.1) is solved approximately with two smoother iterations.
Instead of using a fixed number of coarse grid iteration, we could have reduced the coarse grid residual to a fixed tolerance. However, then the preconditioner is no longer guaranteed to be stationary and therefore a flexible Krylov method must be used. Empirically, two iterations are good enough.
We choose a typical Courant number of for those runs (see section 5.2 for the dependence on ). Note, however, that the next-to-lowest order method resolves variations within one grid cell, so the effective Courant number is larger than . After solving the system Eq. 8 to the desired tolerance, buoyancy is reconstructed pointwise to obtain the solution of the full system Eq. 6. This requires an additional inversion of a well conditioned matrix.
|# horizontal||# vertical||number of unknowns|
|Lowest Order (LO)||81,920||km||64||3.5||18.4 mio|
|Next-to-Lowest Order (NLO)||5,120||km||64||24||7.9 mio|
All following results are obtained on the ARCHER supercomputer. Each node consists of two 12 core Intel Xeon Ivybridge (E5-2697 v2) processors, i.e. 24 cores in total. The spec sheet peak floating point performance of a full node is and the maximum achieved memory bandwidth for the STREAM triad benchmark  is . All code was compiled using version 4.9.2 of the Gnu C compiler and suitable flags are used to compile PETSc and generate optimised code. All runs were carried out on a full node (utilising all 24 cores) with MPI parallelism.
Using threaded parallelism could potentially lead to further performance enhancements by avoiding halo exchanges. Since the relative cost of halo exchanges is suppressed by the surface-to-volume ratio of the local domain and most of the work is spent on the finest grid, it is unlikely that threaded parallelism will lead to significant benefits for the grid sizes and low-order discretisations considered in this work. In  an advection diffusion equation is solved within the PyOP2 framework with different parallel backends. The authors find that an OpenMP implementation is slightly slower than the MPI equivalent. This superiority of MPI over OpenMP is also seen in other low-order studies, even when significant effort is expended in tuning the OpenMP implementation, for example . While exploring shared memory and hybrid parallelisation strategies would be interesting to improve the performance of our solver in the strong scaling limit, it is beyond the scope of this paper.
To obtain the following timings we do not include any overheads from the just-in-time compilation of C-kernels in PyOP2. This is legitimate since in a full model the linear equation will be solved a large number of times and those overheads are completely amortised over the model runtime.
5.1 Algorithmic and computational performance
Tab. 3 shows a comparison of the solution times, and the convergence history is plotted in Fig. 2. At lowest order, the iterative solver converges in ten or fewer iterations if a multigrid preconditioner is used; the number of iterations is around 6 times larger for the single level method. As can be seen from the convergence history in Fig. 2, the asymptotic convergence rate of the single level method (filled green triangles) is significantly worse than for the multigrid methods (filled blue circles and red squares).
|Preconditioner||‡‡‡Due to technical reasons, the setup time for the AMG preconditioner includes the time for the first iteration.|
|Lowest Order (LO)||geometric multigrid||5.36||1.91||0.265||10|
|Next-to-Lowest Order (NLO)||geometric multigrid||11.07||3.85||0.307||21|
Although the time per iteration for the single level method is lower, the additional iterations result in a significantly longer time to solution. In total the single level method is more than twice as expensive as a multigrid method, which is similar to the gain observed in . The fastest solution time is achieved with the geometric multigrid preconditioner, which is about faster than the AMG solver both at lowest order and at next-to-lowest order. While this shows some gains from using a bespoke preconditioner, this is nowhere near the speedup observed for the simplified setup in . Reasons for this are discussed in section 5.6. An initially surprising result is the large setup time of the geometric multigrid preconditioner which, although smaller than the AMG setup, is still significant. The setup time for the single-level method is not much smaller than for the multigrid algorithm, which implies that most of the time is spent on the finest multigrid level. As detailed in section 5.3 a significant proportion of this setup time is taken up by assembly of the operators for the mixed system and the Helmholtz operator on the finest multigrid level, which is required for all preconditioners.
At next-to-lowest order, the asymptotic convergence rates of the multigrid methods are again comparable, and both methods require 21 iterations to converge. The asymptotic convergence rate of the single level method is significantly worse and it needs almost 4 times as many iterations as the geometric multigrid preconditioner. Again the total runtime is twice as large for the single level method.
Of particular interest is the robustness of the solver under variations of the time step size and grid resolution. The number of iterations and total solution time as a function of the Courant number is shown in Fig. 3.
Note that the range we explore here is larger than what is typical in meteorological applications where . The number of iterations for the AMG preconditioner is virtually independent of the Courant number up to values of . The total solution time, however, increases sharply for the largest Courant number. The increase in the number of iterations for the geometric multigrid preconditioner can be partially explained by the fact that we use a fixed number of levels and the coarse grid problem becomes increasingly ill conditioned. To demonstrate this we include results in which the coarse grid problem of the geometric multigrid preconditioner is solved with a BoomerAMG V-cycle instead of two smoother iterations. In this configuration the increase in the number of GMRES iterations is very modest for . As a result the total solution time is reduced for large Courant numbers. However, as expected, for small Courant numbers solving the coarse grid problem with a small number of smoother iterations works very well and it is not necessary to use a more powerful coarse grid solver. While the single-level method might be competitive for unrealistically small Courant numbers, the number of iterations increases dramatically with .
In addition, it is important to check that the convergence rate of the multigrid solver is independent of the grid resolution. To verify this, we varied the horizontal resolution and recorded the number of iterations, which is shown in Fig. 4 (left). For this test we kept the Courant number constant at , i.e. as the resolution increases, the time step size decreases linearly with the horizontal grid spacing. We conclude that as expected the multigrid solvers show grid independent convergence.
5.3 Breakdown of time per iteration
One possible concern is that the Firedrake-based implementation is significantly slower than a monolithic implementation in a fully compiled language, which would skew the comparison in runtimes reported in Tab. 3. To address this concern, we first remind the reader that the computationally most expensive loops over the grid are implemented by generating optimised C-kernels which are executed over the grid in PyOP2. The performance of these finite element kernels in the Firedrake framework is studied in detail in ; the performance of the Firedrake framework as a whole is discussed in , where it is shown that the overhead of using the high-level framework is negligible as long as there are more than unknowns per MPI process (all results in this paper are obtained with (LO) or (NLO) unknowns per core).
To further quantify this in our case, we show a breakdown of the time per iteration for the multigrid- and single-level solvers in Fig. 4 (right) and report the times spent in the most important grid iterations in the last row of Tab. 4. As can be seen from the figure, in almost all cases (the exception being the lowest order single level method) more than half of the time is spent in the pressure solve, i.e. the multigrid or single-level preconditioner. Applying the mixed operator also takes up a significant amount of time, and the same operation is necessary for the PETSc fieldsplit+AMG implementation.
The time spent in the geometric multigrid preconditioner is further broken down by multigrid levels in Tab. 4. As expected from the “total” column, which gives the total time spent on a particular level, this is dominated by the finest level. Naively one would expect a reduction of cost by a factor of four from one level to the next coarser, since the horizontal grid is coarsened by a factor two in each direction. The only exception is the difference between level 4 and 3 in the next-to-lowest order case, which corresponds to a -refinement step from (6 unknowns per cell) to (1 unknown per cell). The theoretical reduction factor is even larger than 6 since the local stencil size is reduced significantly when going from next-to-lowest order to lowest order on the same grid. However the rates that are observed in practice are smaller. This is due to a combination of worse communication to computation ratio on the coarser levels (they have a relatively larger halo) and the non-scalable fixed costs incurred by the Firedrake framework as detailed in .
On each level the most expensive components of the multigrid algorithm are the application of the operators and in the residual calculation and the smoother application. Since we split the operator into a horizontal and a vertical component, those operations can be expressed as
Since before the pre-smoother application the solution is zero, the total cost of those operations on on the finer multigrid levels is . On the coarsest level, where we apply the smoother times, the cost is . In all numerical experiments we use and . The total cost of those operator applications is listed in the last column. The remaining three columns show the time for one individual application of , and on a particular multigrid level. This confirms that indeed a significant amount of time is spent in the PyOP2 parallel loops and we focus on analysing the performance of those parts of the code. In the following we construct a performance model for the operator application and confirm that the implementation is indeed efficient: the achieved memory bandwidth is close to the peak for the machine.
|Lowest Order (LO)||Next-to-Lowest Order (NLO)|
5.4 Performance model
Iterative solvers employing assembled sparse operators are known to be bandwidth-, rather than compute bound . Assuming that the vectors in a matrix vector product are cached perfectly, multiplication by each non-zero entry streams one scalar and one column index and performs one floating point multiply and one add. This results of an arithmetic intensity of flops per byte. Modern hardware delivers somewhere between 4 and 10 flops per byte (recall that an ARCHER node provides 518.4 Gflop/s and a bandwidth of 74.1 Gbytes/s, resulting in 7 flops/byte for a balanced code). With a good degree of freedom numbering, the assumption of good vector caching is close to true, and high performance sparse matrix implementations typically achieve an appreciable fraction of STREAM bandwidth. It is therefore clear that if we want to improve over this baseline, we can do little without changing the way we apply matrix-vector products.
Rather than assembling the full operator, finite element solvers may be implemented in a matrix-free manner by providing the matrix-vector product directly. This, along with the addition of a matrix-free preconditioner is all iterative methods require. Such approaches are typically used for high-order discretisations (polynomial degree 4 and upwards) , although recent work shows that significant speedup can already be achieved at degree two on hexahedral meshes  when the tensor product structure of the basis is exploited. In contrast to these studies, we are in a low-order regime (degree zero or one) and only have a partial tensor product structure for the basis. Additionally, since our discretisation only couples degrees of freedom through faces, the sparsity of the assembled operator is not too bad.
Finally, as discussed above, we apply the multigrid cycle to the Schur complement operator . Since it contains a velocity mass matrix inverse, its action is not purely expressible in terms of finite element integrals. Instead, we assemble (into a non-sparse banded matrix) a column-wise decoupled approximation and apply this action in the iterative solver. For algorithmic reasons we need to apply exactly, which we do by inverting the column-wise blocks. Our method is therefore not truly matrix-free, but instead exploits structure in the discretisation to provide a more efficient storage format for the assembled operator: we only need to stream matrix entries (rather than entries and column indices) for a bandwidth saving of 33% (50% if using 64 bit integers or single precision scalars). To model performance, we count flops and bytes for the application of the Schur complement. The block-diagonal mass-matrix is inverted once per solve, and we count this as well.
5.4.1 Schur complement matrix-vector products
As quantified numerically in section 5.3, the computational bottleneck of the algorithm is application of the Schur complement operator to a vector and the banded (or tridiagonal) matrix solve in the vertical direction. We therefore concentrate our analysis on these components. The three key operations are the applications of the operators , and . While is stored as a standard CSR sparse matrix, is stored as a banded matrix in the format described in section 4.3.1. At lowest order this matrix has rows in each vertical column of the grid and is tridiagonal with bandwidth . The next-to-lowest order matrix has rows and bandwidth . Since there are 6 degrees of freedom per cell, this bandwidth could be reduced to if the banded matrix storage format were adapted to account for the block-structure of the matrix. However, in the current code this would prevent the use of LAPACK routines and should be considered as an additional optimisation which will be explored in the future.
To apply to a vector, the input vector of size has to be read from memory, an output vector of the same size has to be written back and the banded matrix has to be loaded. The total amount of data transferred is
At lowest order we use a direct tridiagonal solver to apply
, and the total amount of moved data is exactly the
same as in Eq. 19,
. At next-to-lowest order we
calculate by an LU backsubstitution using the LAPACK
dgbtrs. This requires additional storage of size
for the pivot array and for additional
matrix rows in the LU factorisation. Hence the total data volume is
Based on those estimates, we quantify the achieved memory bandwidth as follows: Since all data is contiguous in the vertical direction, it is reasonable to assume perfect caching, and we can estimate a “useful” memory bandwidth of
where is the time it takes to apply the operator and is the number of vertical columns. Note that any less-than-perfect caching would manifest itself in a useful bandwidth which smaller than the achievable peak memory bandwidth.
For the horizontal operator let denote the total number of rows and the total number of nonzero entries. Following  we need to read real numbers for the input vector and write back real numbers to store the output vector. Since the matrix is stored in compressed row storage, we also have to read real numbers as well as integers from memory. The total amount of moved memory is in applying to a vector is therefore
In analogy to Eq. 21 the bandwidth can be calculated by dividing this number by the measured time.
Useful bandwidths calculated in this ways are reported in Tab. 5. On the finer levels, which amount for most of the time, the achieved bandwidth corresponds to a significant fraction of the achievable peak as measured using STREAM triad.
|Lowest Order (LO)||Next-to-Lowest Order (NLO)|
5.5 Breakdown of setup time
Unlike a truly matrix-free geometric multigrid solver, a significant amount of time is spent in the setup of the geometric multigrid solver. The most significant fraction of time (1.094s at lowest order and 1.477s at next-to-lowest order) is spent in the assembly of the operator for the mixed system in Eq. 8; this assembly is also required for the PETSc fieldsplit+AMG preconditioner. While total time spent in banded matrix algebra (0.290ss at LO, 1.163s at NLO) is relatively small, the cost of assembling the horizontal derivative (0.254s at LO, 0.187s at NLO) and lumped mass matrices (0.085s at LO, 0.102s at NLO) and (0.204s at LO, 0.322s at NLO) are not negligible, as is the time for multiplication with (0.298s at LO, 0.551s at NLO), which requires irregular memory access on the horizontally unstructured grid. A more detailed breakdown of the setup time can be found in Tab. 9 in appendix C.
5.6 Comments on matrix-free implementations
The relatively modest speedup of the geometric multigrid solver compared to the AMG algorithm raises interesting questions. The main reason for the large speedups of the geometric multigrid solver (compared to an AMG implementation) reported previously in  is the fact that the algorithm could be implemented in a matrix-free way for the simple finite-volume discretisation used there (we also used a matrix-free implementation for the slightly more complicated model equation in ). This approach avoided storage of the matrix which was re-assembled on the fly. In contrast, for an AMG algorithm the matrix elements have to be loaded from memory which is expensive on modern computer architectures. As a consequence, in  geometric multigrid leads to a increase in performance compared to AMG. This, however, is only true if the matrix-free implementation for the application of linear operators is indeed faster than applying the assembled matrix. As we will now argue, this is not true anymore for the low-order finite element discretisations considered in this work.
In general, if an operator is to be applied times, the cost of the matrix-free (MF) and matrix-explicit (MX) implementation is
where and is the time for one operator application and is the time it takes to assemble the operator into a sparse matrix. For the code in  we had and it is always cheaper to use the matrix-free implementation.
In contrast to this, we find that the matrix-free approach does not improve performance for the low order finite element discretisations used in this work, since our implementation has .
For the mixed operator in Eq. 8 the corresponding timings are shown in Tab. 6. In both cases, one matrix-free operator application is about five times as expensive as an application of the corresponding assembled operator. The main reason for this is the much larger number of floating point operations. As can be seen from Tab. 7, one matrix-free operator application requires around more FLOPs than the corresponding matrix-explicit implementation. Although one ARCHER node can execute floating point operations for each byte read from memory and the matrix-explicit implementation requires more memory traffic since the matrix has to be read in addition to the field vectors, overall we expect the matrix-free code to be slower. More specifically, we can estimate the relative runtime of the (FLOP-bound) matrix-free and the (bandwidth bound) matrix-explicit code purely from counting memory references in Tab. 7 as
Since both implementations run at a sizeable fraction of the respective peak performance (see last two columns in Tab. 7), this estimate ( for Lowest Order and at Next-to-Lowest Order) is in the same ballpark as the measured ratio of times.
At lowest order the assembly cost is amortised after about 6 operator applications, at next-to-lowest order it is amortised after 8 applications. The overhead from the matrix assembly is further mitigated by the fact that parts of the assembled mixed operator are reused at other places in the code. For example, the lumped H(div) mass matrix can be extracted very cheaply once the full mass matrix has been assembled.
|Lowest Order (LO)||1.094||0.053||0.235|
|Next-to-Lowest Order (NLO)||1.577||0.054||0.263|
Consequently in any tests that we carried out the matrix-explicit implementation of the geometric multigrid solver was much faster than a matrix-free version. As a result the performance of both the AMG and multigrid solver is limited by the speed with which the matrix and field vector can be loaded from memory, and with which the matrix can be assembled in the setup phase (explicit matrix assembly of was also necessary for the LU decomposition that is required at next-to-lowest order). Since the matrix size is comparable in both cases, it is not surprising that the time per iteration and absolute solution time is comparable for both methods. This picture is likely to change at higher discretisation orders where sum factorisation techniques can improve performance dramatically: as shown in [43, 45, 46, 44] on modern chip architectures with a large FLOP-to-bandwidth ratio, sum-factorised matrix free implementations can be faster than memory bound implementations which assemble the operator and apply it in a sparse matrix representation. At high order, storing the assembled matrix can also require significantly more memory and limit the simulated problem size.
5.7 Parallel scaling
The results of a weak scaling study of the multigrid solvers on ARCHER is shown in Fig. 5. As in the previous section we fix the Courant number at , decreasing the timestep linearly with higher grid resolution. The largest system solved at lowest order had unknowns on 1536 cores (64 full nodes); at next-to-lowest order we solved problems with up to degrees of freedom on 6144 cores (256 full nodes).
All solvers show excellent weak scaling once the nodes are fully populated. Any increases in runtime on partially occupied nodes can probably be attributed to the increasing saturation of the memory bandwidth as more cores are used.
The implementation of bespoke preconditioners for complex finite element problems requires suitable abstractions which allow a separation of concerns between the algorithm developer and the computational scientist. We discussed the implementation of appropriate abstractions in the Firedrake finite element framework. We then used the extended system to build a bespoke preconditioner for the mixed finite element discretisation of a linear gravity wave system in a thin spherical shell – an important model system in atmospheric flow applications. Motivated by earlier work in , we constructed a bespoke tensor-product geometric multigrid preconditioner, which is tailored to the strong grid aligned anisotropy in the vertical direction. This preconditioner results in a solver which is around faster than a purely algebraic approach using PETSc fieldsplit preconditioning and hypre’s algebraic multigrid to solve the resulting elliptic problem. For operationally relevant CFL numbers, the multigrid preconditioners are about twice as fast as the single level methods popular in current operational models. The different abstractions of the Firedrake/PyOP2 framework simplify the implementation significantly: the user can express the algorithm at the correct abstraction level, while still obtaining very good performance. This is demonstrated by our careful analysis of key components of the algorithm, which we show are running at a significant fraction of the theoretical peak memory bandwidth. For simpler finite volume or finite difference discretisations on structured grids we previously demonstrated previously that it is possible to achieve further speedups by using a matrix-free implementation [6, 8]. However, here we show that this is not possible for the low-order finite element discretisations. In fact a matrix-based implementation based on a BoomerAMG preconditioner achieves almost the same performance as our bespoke geometric multigrid solvers. The balance between matrix-free and assembled operators may change with changes in future hardware, and if we are able to exploit some of the structure in the tensor-product basis (reducing the flop counts of the element kernels); we intend to study this in the future.
6.1 Future work
There are several ways of extending the current work. In realistic atmospheric models orography will lead to a distortion of the grid and the decomposition of the velocity function space into a purely horizontal and purely vertical component is no longer valid, which can have an impact on the performance of the solver. Nevertheless the tensor-product multigrid algorithm, which assumes a perfect factorisation, can still be used as a good preconditioner, as has been confirmed by preliminary experiments (not reported here). Performance gains are also expected from a decomposition of the operators into tensor products. If, for example, an operator can be written as a tensor product , assembling (and applying) the operators and separately requires operations (and storage) instead of , and, since this will lead to significant speedups. This approach has been explored for a simplified problem in . In the finite element setting used in this work it would require a shallow-atmosphere approximation in the preconditioner.
Instead of using a Schur-complement factorisation and applying a multigrid preconditioner to the positive-definite system in (10), one could also the multigrid algorithm for the full system in (8) or (6). This requires the construction of suitable smoothers, for example based on “distributive iterations” .
Ultimately the linear solver will be used inside a Newton iteration to solve a non-linear problem in the full atmospheric model, and more careful numerical studies will have to be carried out in this context. While a non-linear multigrid (FAS) scheme [23, §8] could also be explored, the non-linearities in the problem considered here might not be large enough to justify this approach which requires computationally expensive non-linear smoothers. Guided by the work reported here, we are currently working on implementing both the PETSc fieldsplit preconditioner and the geometric multigrid preconditioner in the Fortran 2003 code base that is used to develop the Met Office’s next generation dynamical core (codenamed “GungHo”).
This work was funded as part of the NERC project on Next Generation Weather and Climate Prediction (NGWCP), grant numbers NE/K006789/1 and NE/K006754/1. LM additionally acknowledges funding from EPSRC grant EP/M011054/1. We gratefully acknowledge input from discussions with our collaborators in the Met Office Dynamics Research group and the GungHo! project.
This work used the ARCHER UK National Supercomputing Service (http://www.archer.ac.uk).
All numerical experiments in this paper were performed with the following versions of software, which we have archived on Zenodo: Firedrake ; PyOP2 ; FIAT ; FFC ; COFFEE ; PETSc ; petsc4py ; UFL . Additionally, the simulation code itself is archived as  and the performance data and plotting scripts as .
Appendix A Derivation of model equations
Here we derive the equations Eq. 1 from first principles. Large scale atmospheric flow is governed by the Navier Stokes equations and the equation of state:
Expand around stationary profiles , and
At lowest order in the stationary solution satisfies hydrostatic balance and the equation of state
The equation of state also implies
At the momentum equation and the conservation of potential temperature are
If we introduce the buoyancy and buoyancy (Brunt-Vaisälä) frequency with
these can be written (in the momentum equation we used Eq. 27 to replace by ) as
The equation for the density is at :
Divide by and use Eq. 28 to replace by and
and then Eq. 29 to eliminate the time derivative of
We now estimate the relative size of the two terms and in this equation. Define the scale height for a profile as
We now assume that the height of the domain is much smaller than all scale heights, , i.e. the profiles vary only relatively slowly with height. Then we have
With this approximation the equation for the Exner pressure becomes
and note that the speed of sound is given by
When written in terms of the variable the pressure- equation and momentum equations become
Appendix B PETSc fieldsplit preconditioner options
The options used for the fieldsplit preconditioner are shown in Tab. 8.
Appendix C Breakdown of setup time
A detailed breakdown of the setup time is shown in Tab. 9.
|assemble operator for mixed system||1.094||1.577|
|bandedmatrix: LU factorise (|