A minimal communication approach to parallel time integration
We explore an approach due to Nievergelt of decomposing a time-evolution equation along the time dimension and solving it in parallel with as little communication as possible between the processors. This method computes a map from initial conditions to final conditions locally on slices of the time domain, and then patches these operators together into a global solution using a single communication step. A basic error analysis is given, and some comparisons are made with other parallel in time methods. Based on the assumption that parallel computation is cheap but communication is very expensive, it is shown that this method can be competitive for some problems. We present numerical simulations on graphics chips and on traditional parallel clusters using hundreds of processors for a variety of problems to show the practicality and scalability of the proposed method.
Consider two different computations.
Implemented in the standard, straightforward way, the problems and require about the same number of floating point operations. However, each operation in depends on the one before it, while the operations of can all be performed independently. In particular, relies on the clock speed of a single processor, while depends on the number of processors. Hardware is moving in the direction of increased concurrency and parallelism, and computational science is moving with it, so that processing units are proliferating and becoming very cheap, while clock speed will remain expensive.
The catch in the description of is the word “uncoupled.” For useful problems in scientific computing, problems are rarely completely uncoupled, which means that different parallel processes will have to communicate somehow. That leads us to consider a third class of problem, namely
In addition to the floating point operations, problems of type require a great deal of communication, which is already expensive on current hardware. On future hardware we anticipate that this cost will be the dominant one, and that good algorithms will avoid problems of this type as much as possible.
In this paper we take these trends in hardware seriously, perhaps even taking them to an extreme. In particular, we assume that computations of type are not just very cheap but essentially free. We assume that problems of type are more expensive. Most importantly, we assume that problems of type , namely communication, are very expensive, so expensive that we will do a thousand or a million problems of type in order to avoid doing one problem of type .
In 1964 Nievergelt presented an approach to the solution of time-dependent differential equations that divides the time domain into slices which are assigned to different processors . His approach never attracted much attention because of its high computational cost in terms of floating point operations, but we revisit it here with an eye toward communication cost instead, a factor which was not predicted to be important when Nievergelt was writing. We present three numerical examples for this method (which we believe are the first actual parallel implementations of the method), and we argue that the Nievergelt approach minimizes total communication among all parallel in time algorithms.
The basic idea of the Nievergelt approach is to construct a propagation operator on each slice, that is, a map from initial conditions to final conditions, in a way that is embarrassingly parallel but requires a great deal of computation. These maps are composed with a single reduce operation of type at the very end of the computation. As presented here this method has only limited applicability and in particular is difficult to apply to large systems of nonlinear equations. However, because of its simplicity and its lack of communication, we present it as an instructive example of what kinds of algorithms will minimize communication in future computing environments where parallel computations will become much cheaper in comparison to communication.
Most parallel algorithms for evolution equations exploit parallelism in space, often with very impressive speedup and scalability. Here we focus on parallelism in time, and in particular we will focus on applications where the computational resources are abundant and total time to solution is the primary concern.
One way to exploit parallel computation in the time integration is to use a predictor–corrector framework where the correctors run in parallel one or more steps behind the predictors . The primary purpose of parallelism in this case is to improve the accuracy of the time integration rather than to speed it up, and this method is not suitable for use on a large number of processors.
Some methods based on the matrix exponential are well-suited to parallelism and require minimal communication between processors. One early effort in this direction is from Gallopoulos and Saad , who describe a parallel Krylov subspace method for approximating the matrix exponential. A more recent approach is the paraexp method of Gander and Güttel for linear initial value problems with constant coefficients , which uses a time-domain decomposition together with a “near-optimal exponential propagator” to parallelize the solve, with results that show good speedup on up to eight processors. A similar method is due to Hochbruck and Ostermann , where again a linear initial value problem can be split into several independent problems, where the division is based on quadrature of the variation-of-constants formula. These approaches depend on the matrix exponential, so that they are not suitable for problems with a coefficient that varies in time, while we will see that the Nievergelt approach is suitable for such a problem. Similarly, the contour integral quadrature approach of Sheen et. al.  relies on the Laplace transform and so cannot be used with varying coefficients, and in addition this method is restricted to parabolic problems. The similar Dunford-Cauchy integral approach to approximating the matrix exponential in  has similar restrictions. We also mention that none of the above papers include numerical results on more than eight parallel processors.
The parareal scheme has been applied on a large number of processors, is suitable for non-constant coefficient problems, and has been used very effectively to get speedup for a variety of time-dependent problems . In this approach the time domain is split into subintervals which are each assigned to processors. An iteration involving a fine time-stepping scheme (which determines the desired accuracy) and a coarse scheme (which facilitates fast transfer of information between subdomains) is repeated until the error is small. The parareal methodology has been applied to many types of problems with great success, including for problems in structural and fluid dynamics in  and , for an ocean model in , and for reaction–diffusion problems in , among many other practical and more theoretical studies. The computational cost of the parareal method can be impressively reduced by combining it with the Spectral Deferred Correction method, but this does nothing to reduce the cost of communication .
The approach of Nievergelt that we discuss here is similar to the parareal method in terms of the target applications, namely small systems of differential equations integrated over a very long time interval where spatial parallelization is not easy, but computational resources are abundant. However, the Nievergelt approach requires the same amount of communication as a single iteration of the parareal method, measured in terms of total amount of data communicated (or in terms of number of sends and receives). Although minimal communication is our main goal, we also note that this method will not require a coarse propagator, something that is not always easy to construct. Finally, the parareal algorithm as well as many of the matrix exponential or Laplace transform based approaches work relatively poorly or not at all for hyperbolic problems , while we will see that the Nievergelt approach behaves well for the wave equation.
The price we pay for all of these good properties is that the computational cost of Nievergelt is very high when measured simply in terms of floating point operations. However, almost all of the computations can be done in parallel, so given our (admittedly extreme) assumptions on the computational environment, the Nievergelt approach is competitive for some problems.
In the next section we approach questions of accuracy and scalability in the context of a very simple numerical model problem in order to clarify the properties of the Nievergelt method. Then in Section 4 we present some results on graphics processing units, for which the Nievergelt algorithm is well suited, and discuss the costs and potential parallel speedup of the method in more depth. In Section 5, we take a slightly more theoretical look at the method and provide some error analysis. We apply the method to some larger and slightly more realistic problems on standard parallel hardware in Section 6, and we end with some concluding remarks.
3A simple example
To make the discussion concrete, consider the model nonlinear initial value problem
which has exact solution
We will avoid the singularity at time and consider this problem on the time interval with final time . We assume the quantity of interest is simply and we calculate all our (absolute) errors with respect to that.
Now we subdivide into subparts where . On each part we consider an initial value problem
on . Of course in general is unknown, so we conceptually solve the problem for all possible , constructing not just a scalar value but a mapping that takes initial values to final values.
We assume is known to lie in some range . We call the initial value space and sample points inside to get a discrete set . Then, for each we run the initial value problem . When the initial value becomes known (communicated to us from the previous processor in the final step of the algorithm), we use interpolation to calculate .
To understand how the interpolation process affects accuracy, we represent the initial value space as a Lagrange interpolant on the Chebyshev nodes in this interval. We present some accuracy results in Table ?. Intuitively we wish to balance the time–discretization errors with interpolation errors that are of the same order to get an efficient method. Since Chebyshev interpolation is spectrally accurate, we see in Table ? that we need just 5 or 6 Chebyshev points is enough to get accuracy. Also see Figure 1, which shows the component of error related to interpolation as the time integration proceeds. We remark that the number of subdomains does not significantly affect these results.
For comparison we also solve this problem using a parareal method . The parareal algorithm proceeds by iteratively applying the correction equation
where is the initial value for time slice (or processor) in iteration , is a standard time stepping method with a short timestep, and is a (cheap) coarse time stepping method with larger timestep . To understand the two compared algorithms visually, see Figures Figure 2 and Figure 3, where we present the scheduling of the Nievergelt and parareal algorithms as in the figures in . Here we have made an analogy between the fine propagator of parareal and the construction of the map from initial to final conditions in the Nievergelt approach, but we should note that this latter operation is much more expensive (by a constant factor of ). However, like the fine propagation of parareal, the construction of this map is embarrassingly parallel, requiring no synchronization. Also, we have compared the coarse propagator of parareal to the interpolation in Nievergelt. These are quite different operations, and it is not obvious exactly how their costs compare, but they play similar roles in the algorithm. It is clear from these figures that Nievergelt involves significantly fewer communications, especially as the number of parareal iterations increases.
The results of the comparison are given in Table ?. Here we use backward Euler as both the fine and coarse propagator, with fine discretization size and coarse discretization size , and represents the number of parareal iterations used. More iterations gives better accuracy, but recall that each iteration of parareal requires as much communication as the entire Nievergelt algorithm, and in Table ? we see that even for this simple problem it usually takes a few iterations to get to discretization error.
To end this section we summarize the Nievergelt approach in pseudocode in Figure ?.
4A graphics card implementation and cost model
One possible use case for the Nievergelt method is on general purpose graphical processing units, which are highly parallel and energy efficient computing units that are widely and cheaply available due to their use in gaming hardware. Effectively using this hardware is a challenge because they are most efficient when the computational task can be broken down into very many independent threads, which is precisely what the Nievergelt algorithm does. In Table ? we show the speedup of using the parallel capabilities of the GPU over the same computation performed in serial on the CPU, with times recorded in microseconds, for the nonlinear scalar initial value problem, for various timestep sizes . We partition the time interval into time slices, for each of which we compute the map from initial to final conditions independently. For each of the time slices we propagate initial conditions so that we can use completely independent threads on the GPU. The computation is finished by doing some cheap interpolation on the CPU. We are using backward Euler for the timestepping and Chebyshev interpolation with nodes (chosen to insure that the total error in the Nievergelt method is comparable to time discretization error) and doing all computations in single precision. In each case the error for the Nievergelt implementation is comparable to the discretization error due to timestepping. The timing results include all the time required to allocate memory, build the initial value arrays, transfer data to and from the GPU, and to perform the interpolation that evaluates the map from initial to final conditions. The CPU and GPU tests are run on the same machine, which has a six core AMD Opteron CPU and an nVidia Tesla C2075 GPU. This experiment demonstrates the potential for using the inherent parallelism of the graphics card to solve the problem faster than is possible with the CPU of the same machine, even though many more floating point operations are required.
One possible model for the cost (in terms of time to solution) for the Nievergelt method using the GPU implementation is
where is the time to advance a single timestep using backward Euler for a single initial value problem, is a measure of the cost of applying the map that has been constructed (the final communication and interpolation step), and represents some fixed GPU startup costs. Predicting the time to solution for the experiments in Table ? using parameter values from Table ? shows good agreement, as you can see in the last column of Table ?.
The corresponding serial cost is simpler, given by
The speedup can then be written
where we have ignored the lower order term because we are interested in the extreme case of large and and we have defined and . In order to balance time discretization with interpolation errors, we will assume for . In fact in model problem with Chebyshev interpolation we could choose but we will not assume we have such an exponentially convergent scheme in general. Then if for fixed problem size we choose we get
which implies that for , the larger the problem is, the more speedup we can get.
Similar to the cost model analysis of a parareal–type method in , we see that the traditional parallel efficiency for our method is not as good as usual space–based parallelism. However, the goal is to make the best use of an available GPU that otherwise would not be doing useful work, and we see that it is possible to get speedup in this case.
Nievergelt proves some error bounds on his approach in , using forward Euler and linear interpolation. We want to say something about the error in a slightly more abstract setting, with correspondingly abstract assumptions.
To begin, fix and consider the interval . Let be the exact solution operator that maps an initial condition at to a final condition at , and let be a standard time-discretization method that approximates . We will assume that our problem is well posed in the sense that is Lipschitz as a function of the initial conditions,
We evaluate for many initial conditions in order to construct an interpolating function which approximates the map using some standard interpolation technique. (The map in turn approximates the exact operator , which is of course what we really want.) Denote the true solution at by , but of course we will instead have an approximation to this value communicated from a previous processor, denoted by . We will also use and without superscripts to denote propagators on the whole time interval , and will be the Nievergelt method used over the entire time domain (including several interpolations).
We begin our analysis by considering the error at the end of the fixed time domain :
where the first term is recognized as a standard interpolation error, the second term is a standard time discretization error, and the third term depends on how accurately the solution was computed on the previous time-slice . Beginning with this last term and using , we have
which contains exactly the same type of expression as , so that we can use it recursively to get an expression for the final error,
We will require that our time discretization method satisfies
where is independent of the time interval and is the order of error in the underlying time discretization method,
The standard error bounds for many time discretization methods (including linear multistep methods) have growing exponentially with the length of the time interval that we are integrating over, so that is not hard to satisfy .
The analogous condition for the interpolation is easy to state but perhaps harder to interpret. It is
where again represents the order of error of the underlying interpolation on the initial value space:
Here we are assuming that the interpolating function to the time discretization operator is more accurate if the time integration is over a short interval than if it is over a long interval (cf. ). If we use polynomial interpolation, standard results give us
where is the order of the polynomial. The quantity in absolute value can be interpreted as the above, and in general can be expected to grow very quickly in , so that is also a very plausible assumption for a wide range of problems—in particular it certainly holds for our simple model problem with linear interpolation.
We put these simple results together in the following
For linear differential equations, the solution is also linear as a function of its initial condition, so that the interpolation error can be made zero using only enough points to construct a basis for the initial condition space, as shown in . In the remainder of the paper we will focus on this simpler case.
We close this section with a few comments on the cost required to achieve this error. The Nievergelt algorithm is very computationally costly in conventional terms. In a sense, we trade off everything else against communication costs, assuming those to be dominant. The result is that floating point operation counts and storage costs are very high. For example, when applied to a nonlinear PDE discretized with unknowns, the initial value space will have dimension . If you sample uniformly in this high–dimensional space, with points for each of the unknowns, then the total number of operations to construct the propagation operator (or storage to store the final conditions) will grow as . If parallel operations are truly free we may be willing to pay this cost, or in some cases we may be able to sample much more efficiently than uniformly.
In addition to the computational cost, in many dimensions the choice of how to sample the initial value space and how to perform the interpolation step is not straightforward. If some suitable interpolation can be applied, than the Nievergelt method can in principle be applied and will result in parallel speedup, but in many cases this will not be practical and some other algorithm must be sought.
However, in the remainder of this paper we will consider linear problems, where the storage and cost issues can be greatly reduced and the interpolation issue disappears completely. In particular, if after space discretization we have the system
then the solution is an affine function of the initial condition . With points in the spatial discretization, then for the unit basis vectors , and we can solve once with a zero initial condition and solve a homogeneous version of times, once for each basis vector. This can be done with no knowledge of the true initial condition coefficients . Then we reconstruct the solution
We use the notation (rather than ) because there is no need for interpolation in this linear setting and this technique introduces no additional error beyond rounding error.
6The heat equation and wave equation
Having examined the Nievergelt approach numerically on a simple scalar initial value problem and done some analysis, we now turn to some results for slightly more complicated problems. First we consider the variable coefficient heat equation
in one dimension with homogeneous Dirichlet boundary conditions and with and the initial conditions chosen so that the true solution is given by
After discretization in space using standard centered finite differences and spatial points, can be written in the form . On each parallel time slice we solve this problem times with different initial conditions (and with only once) and reconstruct the solution using and a single communication step.
Some numerical results are shown in Table ?, where we can see a speedup of about 15. Note that we are able to get this speedup partially because we are willing to accept poor spatial resolution. If more points in space are required, we will need many more processors to see speedup, but there is no real barrier to doing this provided the processors are available. Scaling results for a parareal implementation of this problem are shown in Table ?. See also Tables ? and ? which achieve similar speedups for larger problems with more processors.
Comparisons of the communication times in Tables ? and ? show that we do not, in fact, live in a world where parallel floating point operations are free compared to communication time, and the Nievergelt algorithm is competitive with parareal only if we require an unrealistically large number of parareal iterations. In Table ? we simulate additional latency by enforcing a delay of one millisecond for every message that is received from another processor, and it is clear that in this simulated high-latency environment the additional communication steps of the parareal method are a drawback, especially for problems where a large number of iterations might be required. As parallel computations become cheaper relative to communication, the Nievergelt approach becomes more and more promising in comparison to the parareal method.
The parareal algorithm is reported to work relatively poorly for second–order problems and for hyperbolic problems , and many of the other time-parallel methods we considered in the introduction are restricted to parabolic problems . As our final numerical example we consider the Nievergelt method applied to these types of problems. In , a version of the parareal algorithm for second–order and hyperbolic algorithms is presented—however, this method requires the communication of more initial conditions from more distant processors (not just the immediately preceding time slice), and so is not competitive in trying to minimize communication.
We will use an example problem from , Program 19, which is the wave equation
with homogeneous Dirichlet boundary conditions and with the initial conditions such that the solution is a single peaked wave propagating to the left. This problem is discretized in space with a spectral method on Chebyshev points, and time-stepping is with the explicit second-order leapfrog method. We will use a varying number of spatial points but always choose a timestep of to satisfy a CFL condition. See  for details for this model problem.
The time-domain decomposition proceeds much as it did for the heat equation, the only important difference being that we have two initial conditions (data from the previous timestep and the step before that) to deal with and to transfer from subdomain to subdomain. This has the effect of doubling the initial value space, that is, if we have spatial points, we need to propagate 160 initial conditions to get the full map from initial conditions to final conditions.
Numerical results are shown in Tables ? and ?.
We have examined the method of Nievergelt for the parallel solution of time-dependent problems, a method which essentially builds the time propagation operator on each time-slice in parallel with no knowledge of the initial conditions.
It is clear from a viewpoint of counting arithmetic operations that the Nievergelt approach is not efficient, and it may never be suitable for nonlinear problems with a large number of degrees of freedom. However, we are moving to a future where computational cores are essentially free—hundreds of them may be sitting unused on your desktop, and if there is any way to use them profitably it may be worthwhile even if the efficiency is poor. In addition, for some applications the total time to solution is the only concern, and the amount of computational resources spent is less important. It is this kind of hardware environment and these kind of applications that the parareal algorithm was invented to deal with. The Nievergelt method compares well to the parareal algorithm but has the important advantage of always requiring less communication, and it is applicable to more situations than methods based on exponential quadrature or Laplace transforms.
In our work above, we have shown speedup of 3 on 32 processors for a scalar ODE problem, a speedup of 10 on 1024 processors for a variable coefficient heat equation and speedup of 5 on 256 processors for a wave equation. These results are somewhat worse than those in published parareal implementations, but roughly comparable. The original parareal paper  predicted speedups of 8 to 18, and in  the authors get speedup of 5 or 6 on 200 processors. The best speedups in  and  are from 4 to 6. For the different paraexp method, speedups are reported to be from 4 to 6 . In addition to speedup on traditional parallel clusters, we have seen that implementing the Nievergelt algorithm on a GPU provides modest speedup, with this case being of particular interest because many computational scientists will have a GPU sitting idle in their desktop while they wait for a CPU to compute their results.
The assumptions that we are using in this work, namely that parallel computation is so cheap as to be nearly free when compared to communication costs, are not satisfied on traditional parallel machines. However, to some extent this assumption holds on modern graphics hardware, and we have shown that algorithms can be designed for a future world where this kind of pattern is more common. Even if the very high computational and storage costs of the Nievergelt method make it impractical in many cases, we argue that something like it could influence at least part of the design of future parallel methods and that understanding this extreme case may point the way to exploiting future hardware where communication costs dominate computational costs. Aside from the model problem at the beginning, we have focused on linear problems, because in this case it is easier to construct and represent the map from initial conditions to final conditions. In the general nonlinear case, using this map will require approximating it and using some high-dimensional interpolation scheme, which may be prohibitively expensive. However, there is no additional communication necessary for nonlinear problems.
We have divided our time domain into slices and assigned each slice to a single processor on a parallel machine, but it is worth noting that there is a great deal more parallelism inherent in the problem than we have exploited. The map from initial to final conditions is constructed by integrating in time a large number of initial conditions given by some discretization of the initial condition space, and in principle each of these time integrations could be done on a separate processor or processing core in parallel. The lack of need for synchronization in the whole algorithm (except at the final step) suggests the possibility of a distributed computing model for some applications.
There are several possible directions for future research. One is to combine the Nievergelt approach with a parareal method in some way, to get a balance of the communication costs versus the greatly added computation of Nievergelt. Another is to do more realistic nonlinear problems so as to more carefully consider the role of interpolation in the method. Ongoing research is to combine the traditional parallel implementation with the GPU implementation to see the potential of the method on heterogeneous systems.
This work was supported in part by the U.S. National Science Foundation under Grant Number DMS-07-39382. Computational resources were provided by the Louisiana Optical Network Initiative and the Stellar Group at the Center for Computational Technology at Louisiana State University.
- K.E. Atkinson, An introduction to numerical analysis, Wiley (1989).
- A.J. Christlieb, C.D. MacDonald, and B.W. Ong, Parallel high-order integrators, SIAM J. Sci. Comput. 32 (2010), pp. 818–835.
- M. Duarte, M. Massot, and S. Descombes, Parareal operator splitting techniques for multi-scale reaction waves: numerical analysis and strategies, Math. Model. Numer. Anal. 45 (2011), pp. 825–852.
- C. Farhat and M. Chandesris, Time-decomposed parallel time-integrators: theory and feasibility studies for fluid, structure, and fluid–structure applications, Int. J. Numer. Meth. Engng. 58 (2003), pp. 1397–1434.
- C. Farhat, J. Cortial, C. Dastillung, and H. Bavestrello, Time-parallel implicit integrators for the near-real-time prediction of linear structural dynamic responses, Int. J. Numer. Meth. Engng. 67 (2006), pp. 697–724.
- P.F. Fischer, F. Hecht, and Y. Maday, A parareal in time semi-implicit approximation of the Navier-Stokes equations, in Domain Decomposition Methods in Science and Enginnering, e.a. Timothy J. Barth, ed., Lecture Notes in Computational Science and Engineering, vol. 40, Springer, 2005, pp. 443–440.
- E. Gallopoulos and Y. Saad, Efficient solution of parabolic equations by Krylov approximation methods, SIAM J. Sci. Stat. Comput. 13 (1992), pp. 1236–1264.
- M. Gander and M. Petcu, Analysis of a modified parareal algorithm for second-order ordinary differential equations, AIP Conference Proceedings 936 (2007), pp. 233–236.
- ———, Analysis of a Krylov subspace enhanced parareal algorithm for linear problems, ESAIM Proc. 25 (2008), pp. 114–129.
- M.J. Gander, Analysis of the parareal algorithm applied to hyperbolic problems using characteristics, Bol. Soc. Esp. Mat. Apl. 42 (2008), pp. 5–19.
- M.J. Gander and S. Güttel, Paraexp: A parallel integrator for linear initial-value problems, SIAM J. Sci. Comput. (2013), p. to appear.
- M.J. Gander and S. Vandewalle, Analysis of the parareal time-parallel time-integration method, SIAM J. Sci. Comput. 29 (2007), pp. 556–678.
- I.P. Gavrilyuk and L. Makarov, Exponentially convergent algorithms for the operator exponential with applications to inhomogeneous problems in Banach spaces, SIAM J. Numer. Anal. 43 (2005), pp. 2144–2177.
- P. Henrici, Discrete variable methods in ordinary differential equations, Wiley (1961).
- M. Hochbruck and A. Ostermann, Exponential Runge-Kutta methods for parabolic problems, Appl. Numer. Math. 53 (2005), pp. 323–339.
- J.L. Lions, Y. Maday, and G. Turinici, Résolution d’EDP par un schéma en temps pararéel, C.R. Acad. Sci. Paris 332 (2001), pp. 661–668.
- Y. Liu and J. Hu, Modified propagators of parareal in time algorithm and application to Princeton Ocean model, Int. J. Numer. Meth. Fluids 57 (2008), pp. 1793–1804.
- Y. Maday and G. Turinici, The parareal in time iterative solver: a further direction to parallel implementation, in Domain Decomposition Methods in Science and Engineering, e.a. Timothy J. Barth, ed., Lecture Notes in Computational Science and Engineering, vol. 40, Springer, 2005, pp. 441–448.
- M.L. Minion, A hybrid parareal spectral deferred corrections method, Comm. App. Math. and Comp. Sci. 5 (2010), pp. 265–301.
- J. Nievergelt, Parallel methods for integrating ordinary differential equations, Comm. ACM 7 (1964), pp. 731–733.
- D. Sheen, I.H. Sloan, and V. Thomée, A parallel method for time discretization of parabolic equations based on Laplace transformation and quadrature, IMA J. Numer. Anal. 23 (2003), pp. 269–299.
- L.N. Trefethen, Spectral Methods in MATLAB, SIAM (2000).