Asymptotic convergence of the parallel full approximation scheme in space and time for linear problems

Asymptotic convergence of the parallel full approximation scheme in space and time for linear problems

Abstract

For time-dependent partial differential equations, parallel-in-time integration using the “parallel full approximation scheme in space and time” (PFASST) is a promising way to accelerate existing space-parallel approaches beyond their scaling limits. Inspired by the classical Parareal method and multigrid ideas, PFASST allows to integrate multiple time-steps simultaneously using a space-time hierarchy of spectral deferred correction sweeps. While many use cases and benchmarks exist, a solid and reliable mathematical foundation is still missing. Very recently, however, PFASST for linear problems has been identified as multigrid method and in this paper, we will use this multigrid formulation and in particular PFASST’s iteration matrix to show that in the non-stiff as well as in the stiff limit PFASST indeed is a convergent iterative method. We will provide upper bounds for the spectral radius of the iteration matrix and investigate how PFASST performs for increasing numbers of parallel time-steps. Finally, we will demonstrate that the results obtained here indeed relate to actual PFASST runs.

1Introduction

With the advent of supercomputing architectures featuring millions of processing units, classical parallelization techniques used to accelerate the solution of discretized partial differential equations face new challenges. For fixed-size problems, communication starts to dominate eventually, when only small portions of data is left for computation on each unit. This “trap of strong scaling” leads to severe and inevitable upper limits for speedup obtainable with parallelization in space, leaving large parts of extreme scale supercomputers unexploited. If weak scaling is the target, this may not be an issues, but for time-dependent problems, however, stability considerations often lead to an increase of the number of time-steps as the problem is refined in space. This is not mitigated by spatial parallelization alone, yielding the “trap of weak scaling”. Thus, the challenges arising from the extreme levels of parallelism required by today’s and future high-performance computing systems mandates the development of new numerical methods that feature a maximum degree of concurrency.

For time-dependent problems, in particular for time-dependent partial differential equations, approaches for the parallelization along the temporal dimension have become increasingly popular over the last years. In his seminal work in 2015, Gander lists over approaches to parallelize the seemingly serial process of time integration [1]. In particular, the invention of the Parareal method in 2001 [2] alone sparked a multitude of new developments in this area. This “parallelization across the step” approach allows to integrate many time-steps simultaneously. The idea is to derive a coarser, less expensive time-integration scheme for the problem at hand and use this so-called coarse propagator to quickly and serially propagate information forward in time. The original integrator, in this context often called the fine propagator, is then used in parallel using the initial values the coarse scheme provided. This cycle of fine and coarse, parallel and serial time-stepping is repeated and upon convergence, Parareal is as accurate as the fine propagator run in serial. This way, the costs of the expensive fine scheme are distributed, while the serial part is kept small using a cheap propagator. This predictor-corrector approach, being easy to implement and easy to apply, has been analyzed extensively. It has been identified as multiple shooting method or as FAS multigrid scheme [3] and convergence has been proven under various conditions, see e.g. [3].

Yet, a key drawback of Parareal is the severe upper bound on parallel efficiency. If iterations are required to converge, the efficiency is bounded by . Perfect linear speedup cannot be expected due to the serial coarse propagator, but efficiencies of a few percent are also not desirable. Therefore, many researchers started to enhance the Parareal idea with the goal of loosening this harsh bound on parallel efficiency. One idea is to replace the fine and the coarse propagators by iterative solvers and coupling their “inner” iteration with the “outer” Parareal iteration. A first step in this direction was done in [8], where spectral deferred correction methods (SDC, see [9]) were used within Parareal. This led to the “parallel full approximation scheme in space and time” (PFASST), which augmented this approach by ideas from non-linear multigrid methods [10]. In these original papers from 2012 and 2014, the PFASST algorithm was introduced, its implementation was discussed and it was applied to first problems. In the following years, PFASST has been applied to more and more problems and coupled to different space-parallel solvers, ranging from a Barnes-Hut tree code to geometric multigrid, see [12]. Together with spatial parallelization, PFASST was demonstrated to run and scale on up to cores of an IBM Blue Gene/Q installation. Yet, while applications, implementation and improvements are discussed frequently, a solid and reliable convergence theory is still missing. While for Parareal many results exists and provide a profound basis for a deep understanding of this algorithm, this is by far not the case for PFASST. Very recently, however, PFASST for linear problems was identified as a multigrid method in [16] and the definition of its iteration matrix yielded a new understanding of the algorithm’s components and their mechanics. Although a careful block Fourier mode analysis already revealed many interesting features and also limitations, a rigorous proof of convergence has not been provided so far.

In this paper, we will use the multigrid formulation of PFASST for linear problems and in particular the iteration matrix to show that in the non-stiff as well as in the stiff limit PFASST indeed is a convergent iterative method. We will provide upper bounds for the spectral radius of the iteration matrix and show that under certain assumptions, PFASST also satisfies the approximation property of standard multigrid theory. In contrast, the smoothing property does not hold, but we will state a modified smoother which allows to satisfy also this property. We will further investigate how PFASST performs for increasing numbers of parallel time-steps. Finally, we will demonstrate that the results obtained here indeed relate to actual PFASST runs. We start with a brief summary of the results found in [16], describing PFASST as a multigrid method.

2A multigrid perspective on PFASST

We focus on linear, autonomous systems of ordinary differential equations (ODEs) with

and “spatial” matrix , stemming from e.g. a spatial discretization of a partial differential equation (PDE). Examples include the heat or the advection equation, but also wave equation and other types of linear PDEs and ODEs. We decompose the time interval into steps and rewrite the ODE for such a time-step in Picard formulation as

where is the initial condition for this time-step, e.g. coming from a time-stepping scheme. Introducing quadrature nodes with , we can approximate the integrals from to these nodes using spectral quadrature like Gauß-Radau or Gauß-Lobatto quadrature, such that

where , and represent the quadrature weights for the interval with

where are the Lagrange polynomials to the points . Note that for the quadrature rule on each subinterval all nodes are taken into account, even if they do not belong to the interval under consideration. Combining this into one set of linear equations yields

for vectors and quadrature matrix . This is the so-called “collocation problem” and it is equivalent to a fully implicit Runge-Kutta method. Before we proceed with describing the solution strategy for this problem, we slightly change the notation: Instead of working with the term , we introduce the “CFL number” (sometimes called the “discrete dispersion relation number”) to absorb the time-step size , problem-specific parameters like diffusion coefficients as well as the spatial mesh size , if applicable. We write

where the matrix is the parameter-free description of the spatial problem or system of ODEs. For example, for the heat and the advection equation, the parameter is defined by

with diffusion coefficient and advection speed . Then, Equation reads

and we will use this form for the remainder of this paper.

This system of equations is dense and a direct solution is not advisable, in particular if the right-hand side of the ODE is non-linear. While the standard way of solving this for is a simplified Newton approach [18], the more recent development of spectral deferred correction methods (SDC, see [9]) provides an interesting and very flexible alternative. In order to present this approach, we follow the idea of preconditioned Picard iteration as found e.g. in [19]. The key idea here is to provide a flexible preconditioner based on a simpler quadrature rule for the integrals. More precisely, the iteration is given by

where the matrix gathers the weights of this simpler quadrature rule. Examples are the implicit right-hand side rule or the explicit left-hand side rule, both yielding lower triangular matrices, which make the inversion of the preconditioner straightforward using simple forward substitution. More recently, Weiser [20] defined for and showed superior convergence properties of SDC for stiff problems. This approach has become the de-facto standard for SDC preconditioning and is colloquially known as St. Martin’s or LU trick. Now, for each time-step, SDC can be used to generate an approximate solution of the collocation problem . As soon as SDC has converged (e.g. the residual of the collocation problem is smaller than a prescribed threshold), the solution at is used as initial condition for the next time-step. In order to parallelize this, the “parallel full approximation scheme in space and time” (PFASST, see [10]) makes use of a space-time hierarchy of SDC iterations (called “sweeps” in this context), using the coarsest level to propagate information quickly forward in time. This way, multiple time-steps can be integrated simultaneously, where on each local time interval SDC sweeps are used to approximate the collocation problem. We follow [16] to describe the PFASST algorithm more formally.

The problem PFASST is trying to solve is the so called “composite collocation problem” for time-steps with

The system matrix consists of collocation problems on the block diagonal and a transfer matrix on the lower diagonal, which takes the last value of each time-step and makes it available as initial condition for the next one. With the nodes we choose here, is simply given by

More compactly and more conveniently, the composite collocation problem can be written as

with space-time-collocation vectors and system matrix , where the matrix simply has ones on the first subdiagonal and zeros elsewhere.

The key idea of PFASST is to solve this problem using a multigrid scheme. If the right-hand side of the ODE is non-linear, a non-linear FAS multigrid is used. Although our analysis is focussed on linear problems, we use FAS terminology to formulate PFASST to remain consistent with the literature. Also, we limit ourselves to a two-level scheme in order to keep the notation as simple as possible. Three components are needed to describe the multigrid scheme used to solve the composite collocation problem: (1) smoother on the fine level, (2) solver on the coarse level and (3) level transfer operators. In order to obtain parallelism, the smoother we choose is an approximative block Jacobi smoother. The idea is to use SDC within each time-step (this is why it is an “approximative” Jacobi iteration), but omit the transfer matrices on the lower diagonal. In detail, the smoother is defined by the preconditioner

or, more compactly, . Inversion of this matrix can be done on all time-steps simultaneously, leading to decoupled SDC sweeps. Note that typically this is done only once or twice on the fine level. In contrast, the solver on the coarse level is given by an approximative block Gauß-Seidel preconditioner. Here, SDC is used for each time-step, but the transfer matrix is included. This yields for the preconditioner

or . Inversion of this preconditioner is inherently serial, but the goal is to keep this serial part as small as possible by applying it on the coarse level only, just as the Parareal method does. Thus, we need coarsening strategies in place to reduce the costs on the coarser levels [22]. To this end, we introduce block-wise restriction and interpolation and , which coarsen the problem in space and reduce the number of quadrature nodes but do not coarsen in time, i.e., the number of time steps is not reduced. We note that the latter is also possible in this formal notation, but so far no PFASST implementation is working with this and the theory presented here makes indeed use of this fact. A first discussion on this topic can be found in [17]. Let be the number of degrees of freedom on the coarse level and the number of collocation nodes on the coarse level. Restriction and the interpolation operators are then given by

where the matrices and represent restriction and interpolation on the quadrature nodes, while and deal spatial degrees-of-freedom. Within PFASST, these operators are typically standard Lagrangian-based restriction and interpolation. We use the tilde symbol to denote matrices on the coarse level, so that the approximative block Gauß-Seidel preconditioner is actually given by

where . Note that this is typically applied only once or twice on the coarse level, too. In addition, the composite collocation problem has to be modified on the coarse level. This is done by the -correction of the FAS scheme and we refer to [16] for details on this, since the actual formulation does not matter here. We now have all ingredients for one iteration of the two-level version of PFASST using post-smoothing:

  1. restriction to the coarse level including the formulation of the -correction,

  2. serial approximative block Gauß-Seidel iteration on the modified composite collocation problem on the coarse level,

  3. coarse-grid correction of the fine-level values,

  4. smoothing of the composite collocation problem on the fine level using parallel approximative block Jacobi iteration.

Thus, one iteration of PFASST can simply be written as

Beside this rather convenient and compact form, this formulation has the great advantage of providing the iteration matrix of PFASST, which paves the way to a comprehensive analysis. In the following, we summarize the results described above and state the iteration matrix of PFASST.

This is taken from [16] and a much more detailed derivation and discussion can be found there.

Besides the matrix formulation of PFASST, we will make frequent use of these results:

  • Lemma 1.2.1 of [23]: For a matrix with the inverse of exists and

  • Theorem 1.2.1 of [23]: For two matrices with the inverse of exists and

  • Theorem 1.3 (Elsner) of [24]: Let be a perturbation of a matrix , such that . Then it follows

    where corresponds to the th eigenvalue of and to the th eigenvalue of .

In what follows, we are interested in the behavior of PFASST’s iteration matrix for the non-stiff as well as for the stiff limit. More precisely, we will look at the case where the CFL number either goes to zero or to infinity. While the first case represents the analysis for smaller and smaller time-steps, the second one covers scenarios where e.g. the mesh- or element-size goes to zero. Alternatively, problem parameters like the diffusion coefficient or the advection speed could become very small or very large, while and are fixed.

3The case

This section is split into three parts. We look at the iteration matrices of the smoother and the coarse-grid correction separately and then analyze the full iteration matrix of PFASST. While for the smoother we introduce the main idea behind the asymptotic convergence analysis, the analysis of the coarse-grid correction is dominated by the restriction and interpolation operators. For PFASST’s full iteration matrix we then combine both results in a straightforward manner.

3.1The smoother

We first consider the iteration matrix of the smoother with

We write , so that

Therefore, we have

Moreover, the norm of can be bounded by

if is small enough, i.e. if

see . Together with we obtain

since the last factor of does not depend on .

Therefore, the matrix converges to linearly as , so that we can write

This leads us to the following lemma:

The eigenvalues of the limit matrix are all zero (since it only has entries strictly below the diagonal) and because the eigenvalues of converge to the eigenvalues of as , the spectral radius of will be smaller than if is small enough. More precisely, following we choose , and some perturbation matrix of the order of . Then, the eigenvalues are all equal to zero so that the left-hand side of Inequality is actually the spectral radius of and we obtain

In particular, for small , and , e.g. for few parallel time-steps, the order of convergence behaves like , while for very large , or , e.g. for many parallel time-steps, the convergence rate deteriorates to nearly , at least in the limit. Figure ? shows for the test problem with right-hand side that this estimate is severely over-pessimistic, if is small, but becomes rather accurate, if becomes larger. This does not change significantly when choosing an imaginary value for of the test equation, as Figure 2 shows. Note, however, that for these large numbers of time-steps the numerical computation of the spectral radius may be prone to rounding errors. We refer to the discussion in [17] for more details.

Figure 1: \lambda=-1
Figure 1:
Figure 2: \lambda=i
Figure 2:

Yet, while this lemma gives a proof of the convergence of the smoother, it cannot be used to estimate the speed of convergence. The standard way of providing such an estimate would be to bound the norm of the iteration matrix by something smaller than . However, even in the limit the norm of is still larger than or equal to in all feasible matrix norms. Alternatively, we can look at the th power of the iteration matrix, which corresponds to the -times application of the smoother.

3.2The coarse-grid correction

We now consider the iteration matrix of the coarse-grid correction with

and write . Then,

according to Theorem ?. Therefore, we get

Then, with we have

and therefore

The inverse of the coarse-level preconditioner can be bounded by

if is small enough, which in this context means if

see . Together with this leads to

as for the smoother. This allows us to write the iteration matrix of the coarse-grid correction as

While the eigenvalues of again converge to the eigenvalues of , the eigenvalues of the latter are not zero anymore. For a partial differential equation in one dimension half of the eigenvalues of are zero for standard Lagrangian interpolation and restriction, because the dimension of the coarse space is only of half size. Therefore, the limit matrix has a spectral radius of at least .

3.3Pfasst

We now couple both results to analyze the full iteration matrix of PFASST. Using and , we obtain

Again, the eigenvalues of are all zero, because the eigenvalues of are all zero. We can therefore extend Lemma ? to cover the full iteration matrix of PFASST.

The estimate for the spectral radius can be shown in an analogous way as the one for the smoother in Lemma ?. For this inequality to hold, we need to satisfy Inequalities and . Choosing the -norm, we have , so that the condition for the smoother part is directly satisfied. Furthermore, it is

because each eigenvalue of is one and so is the smallest one .

We note that this result indeed proves convergence for PFASST under certain conditions, but it does not provide an indication of the speed of convergence. It rather ensures convergence for the asymptotic case of . To get an estimate of the speed of convergence of PFASST, a bound for the norm of the iteration matrix would be necessary. Yet, we could make the same observation as in Remark ?, but this is again not a particularly useful result.

4The case

Working with larger and larger CFL numbers requires an additional trick and poses limitations on the spatial problem. We write

in slight abuse of notation (when compared to the case , but those two cases will not be mixed here anyway). Similarly,

with

again redefining the notation from above. Now, the following lemma can be easily shown.

The derivation of the limit matrices and follows more or less the reasoning of Section 3. Likewise, the lower bounds for result from the same arguments as before, using and .

4.1The spectral radius

So far, the arguments have been very similar to the case of , but now the eigenvalues of the limit matrix are not zero anymore, at least not in the general case. Yet, if we choose to define the preconditioner using the LU trick [20], i.e. for , we can prove the following analog of Theorem ?.

For iterations of the smoother we have with Lemma ?

For this equation to hold, both conditions of Lemma ? need to be satisfied. We cannot apply the same trick as in the proof of Theorem ?, but clearly

With and it is , so that is strictly upper diagonal and therefore nilpotent with . Therefore, is actually for and so are its eigenvalues. The perturbation result then gives us for steps, nodes, degrees-of-freedom, and some perturbation matrix of order

This result shows that even the spectral radius goes to zero with the same speed as goes to infinity. Yet, the condition of requiring at least smoothing steps is rather severe since standard simulations with PFASST typically perform only one or two steps here. Nevertheless, for at least smoothing steps asymptotic convergence of PFASST is shown even for .

Exactly the same result applies to the smoother alone. In order to avoid having to specify the interpolation and restriction matrices as well as the spatial problem, we show in Figure ? the spectral radii of the iteration matrix of the smoother for , (upper and lower figures) and larger and larger time-step sizes (with ), using and Gauß-Radu nodes (left and right figures). There are a few interesting things to note here: First, for the asymptotics it does not matter which we choose, the plots look the same. Only for small values of the spectral radii differ significantly, leading to a severely worse convergence behavior for complex eigenvalues. Second, the result does not depend on the number of time-steps , so choosing is reasonable here (and the plots do not change for , which is a key difference to the results for ). Third, the estimate of the spectral radius is rather pessimistic, again. More precisely, already for the linear slope of the estimation is met (and, actually, surpassed), while for the convergence is actually faster. This is not reflected in the result above, but it shows that this estimate is only a rather loose bound. However, we note that in Section 6 we see a different outcome, more in line with the theoretical estimate. Note that in Figure ?, the constant is not included so that the estimate depicted there is not necessarily an upper bound, but reflects the asymptotics.

Figure 3: M=3 and \lambda=-1
Figure 3: and
Figure 4: M=7 and \lambda=-1
Figure 4: and
Figure 5: M=3 and \lambda=i
Figure 5: and
Figure 6: M=7 and \lambda=i
Figure 6: and

For fixed , and the test problem, Figure ? shows the spectral radius of the iteration matrix of the smoother for different values , choosing the LU trick (Fig. Figure 7) and the classical implicit Euler method or right-hand side rule (Fig. Figure 8) for . Following Theorem ?, we choose smoothing steps. Note that the scale of the colors is logarithmic, too, and all values for the spectral radius are considerably below . The largest for the LU trick is at about and for the implicit Euler at about . We can nicely see how the smoother converges for as well as for in both cases, although the LU case is much more pronounced. However, we also see that there is a banded area, where the spectral radius is significantly higher than in the regions of large or small . This is most likely due to the properties of the underlying preconditioner and can be observed on the real axis for standard SDC as well, see [20]. Oddly, this band does not describe a circle but rather a box, hitting the axes with its maximum at around and . It seems unlikely that a closed description of this area, i.e. filling the gap between and is easily possible.

Figure 7: LU-trick
Figure 7: LU-trick
Figure 8: Implicit Euler
Figure 8: Implicit Euler

4.2Smoothing and approximation property

Another consequence of Lemma ? is given in the following, emphasizing the link between PFASST and standard multigrid theory even more [25]. Note that we now interpret as so that in the following results the appearance of in the denominator and numerator is counterintuitive.

We can write

and