Resilience for Exascale Enabled Multigrid Methods
Abstract
With the increasing number of components and further miniaturization the mean time between faults in supercomputers will decrease. System level fault tolerance techniques are expensive and cost energy, since they are often based on redundancy. Also classical checkpointrestart techniques reach their limits when the time for storing the system state to backup memory becomes excessive. Therefore, algorithmbased fault tolerance mechanisms can become an attractive alternative. This article investigates the solution process for elliptic partial differential equations that are discretized by finite elements. Faults that occur in the parallel geometric multigrid solver are studied in various model scenarios. In a standard domain partitioning approach, the impact of a failure of a core or a node will affect one or several subdomains. Different strategies are developed to compensate the effect of such a failure algorithmically. The recovery is achieved by solving a local subproblem with Dirichlet boundary conditions using local multigrid cycling algorithms. Additionally, we propose a superman strategy where extra compute power is employed to minimize the time of the recovery process.
Institute for Numerical Mathematics (M2), Technische Universität München; Department of Computer Science 10, FAU ErlangenNürnberg; Corresponding author: huber@ma.tum.de
Key words: Fault Tolerance, Geometric Multigrid, Exascale, Massive Parallelism
1 Introduction
Future high performance systems will be characterized by millions of compute nodes that are executing up to a billion of parallel threads. This compute power will be expensive due to the associated investment and operational cost which will include energy consumption as a rapidly increasing factor. Hardware, software and algorithmic components of such a large scale computing are interdependent, and thus reliability of each single component is necessary. Since this increasing complexity results in a higher probability of any kind of failure in the HPCsystem [8], strategies which circumvent and/or accomplish such a behavior are inevitable. A commonly used approach is copying the data to backup disks on the I/Osystem and restart the run with the stored data in case of a failure. These checkpoint/restart strategies need to collect and transfer data from and to all processors and are, in general, too costly and not attractive.
Geometric multigrid methods can deliver an asymptotic optimal complexity and can be implemented with excellent efficiency on large scale parallel machines [2, 19, 18, 30]. Typical future runs involving multigrid computations will last from a few hours to weeks and use up to a billion threads. Therefore, an errorresilient methodology for any failure will be required and necessary to obtain faultfree and efficiently computed solutions.
The type of failures and their treatment can be categorized in hardwarebased fault tolerance (HBFT) [25, 17, 26, 24], softwarebase fault tolerance (SBFT) [5, 31, 16, 3] and algorithmbased fault tolerance (ABFT) [22, 7, 11], for a general overview of fault tolerance techniques we refer to [8, 9, 10].
In this paper, we focus on the algorithmic approaches for handling failures. ABFT improves the reliability of the HPCsystem in detecting the failure and correcting the results by implementing the resilience in the algoithm itself. Originally, ABFT was proposed by Huang and Abraham [22] for systolic arrays where due to checksums the persistency of the data involved in the algorithm is monitored and reconstructed. Later, it was extended to applications in linear algebra such as addition, matrix operations, scalar product, LUdecomposition, transposition and in fast Fourier transformation [1, 4, 13, 23]. Currently, the work by Davies and Chen [15] efficiently deals with fault detection and correction during the calculation for dense matrix operations. For iterative  linear and Krylov space  solvers for sparse matrix such as SOR, GMRES, CGiterations the previous mentioned approaches are not suitable, since this can result in high overhead for sparse linear algebra [28] and were consequently adapted by [7, 12, 27, 29]. Cui et al. propose in [14] a technique to use the structure of a parallel subspace correction method such that the subspaces are redundant on different processors and the workload is efficiently balanced. Further, an algebraic multigrid solver was analyzed in [11] where the most vulnerable components are identified and resilience techniques are explored.
Here, we investigate in a fault tolerant parallel geometric multigrid method. Similar to [14] we pursue fault tolerant strategies which

converge when a fault occurs assuming it is detectable,

minimize the delay in the solution process,

minimize computational and communication overhead.
In order to achieve these goals, we study the consequences of failures for a geometric multigrid algorithm from bottom up. By applying local correction methods, we recover a partial solution and use it such that the effect of the fault on the global solution procedure is minimized. The major difference to other approaches in ABFT [11, 14] is that we proceed without checkpointing data of the solution but rather recalculate the faulty part.
Our paper is organized as follows: In Sec. 2, we describe the model equation and briefly discuss the parallel hierarchical hybrid multigrid (HHG) framework that will serve as the basis of the study in this paper. Next, we introduce the failure scenario that is used to study the influence of a fault within a parallel geometric multigrid method and its effect on the convergence. In Sec. 3, we then develop local recovery strategies and demonstrate by numerical experiments how these improve the recovery behavior after the fault.
2 Faulty Solution Process
2.1 Model problem and geometric multigrid
This paper considers, for simiplicity of notation, the Laplace equation with Dirichlet boundary conditions
(1) 
as model problem for the design and analysis of a fault recovery algorithm. Here, is a bounded polyhedral domain.
is triangulated with an unstructured tetrahedral mesh that we denote by . From this initial coarse mesh, a hierarchy of meshes is constructed by successive uniform refinement as illustrated in Fig. 1.
The discretization of (1) uses conforming linear finite elements (FE) on that leads canonically to a nested sequence of finite element spaces and a corresponding family of linear systems
(2) 
The Dirichlet boundary conditions are included in the linear systems (2).
This hierarchy will be used to set up an iterative multigrid solver and to define the error recovery strategy. Multigrid methods can achieve levelindependent convergence rates with optimal complexity , where is the number of unknowns, cf. [6, 20]. We apply multigrid correction schemes in V, W, or Fcycles with standard components to (2). Explicitly, we use linear transfer operators and a hybrid variant of a GaussSeidel updating scheme as smoother.
2.2 Hierarchical Hybrid Grids
The Hierarchical Hybrid Grids (HHG) framework [2, 19, 18] is designed to combine the flexibility of unstructured FE meshes with the performance advantage of structured grids in a blockstructured approach.
The implementation is based on domain partitioning that splits the mesh into primitives: vertices, edges, faces, and volumes. In the HHG data structure each primitive is then refined regularly resulting in a global blockstructured grid. For our later error recovery strategies, the domain partitioning approach is crucial, but the blockstructured mesh structure could be generalized to fully unstructured meshes. The multigrid operations such as smoothing, prolongation, restriction, and residual calculation, are exploited such that they typically operate on the primitive itself and its neighboring primitives via ghost layers. These operations are inherently local and suited for parallel computations on a distributed memory system using message passing with MPI. Here, the primitives are mapped to processors that execute the local operations. The data dependencies require a systematic exchange of the ghost layers. This functionality is provided in a transparent and highly optimized form in the HHG framework.
2.3 Fault Model
We assume that a failure in the solution process for (2) can occur at any time. For our study, we concentrate on a specific fault model under assumptions similar to [14, 21]. We restrict the analysis, for simplicity, to the case that only one process crashes. All strategies can be extended easily to a defect of more processors, since they only rely on the locality of the fault.
Furthermore, we concentrate on the case of using Vcycles for the solution of (2). The input tetrahedral mesh defines the partitioning used for parallelization in HHG. Each tetrahedron in is mapped to a processor, including all the refined subdomain meshes contained in the coarsest level element. Consequently, the number of subdomains and the number of processes is equal to the number of tetrahedra in .
If a process experiences a fault, the information of the subdomain is lost. In the context of this article, the faulty subdomain is just a single tetrahedron in . The other tetrahedra constitute the healthy subdomain , i.e. those tetrahedra that are not affected by the fault. Healthy and faulty regions are separated by an interface . In the finite element mesh, as implemented in HHG, the interface region contains the nodes living on faces, edges, and vertices of the input mesh. These data structures are responsible for handling communication in HHG and are thus stored redundantly in the form of ghost instances on several processors. Thus, even if one of the instances is lost due to the fault, a complete recovery is always possible for them, and thus we assume implicitly that the data associated with them are unaffected by the fault.
In Fig. 2, the setup is illustrated for a computational domain with 16 million unknowns. The domain consists of 48 tetrahedral subdomains that are distributed to 48 processors. Then each subdomain includes 300 000 unknowns and, thus, the failure of a process causes the loss of information for 300 000 unknowns.
For our strategy, it is necessary that we can detect erroneous processes quickly and then adapt the solution procedure dynamically. Unfortunately, the current supercomputer systems and the fault tolerant MPIextensions such as Harness FTMPI ^{1}^{1}1http://icl.cs.utk.edu/ftmpi/ or ULFM ^{2}^{2}2http://faulttolerance.org/ do not yet support this functionality as ideally needed. For the purposes of this study, we suppose a failure is reported as soon as it occurs during a multigrid cycle. When a process crashes, we assign a new  until then not used  substitute process to take over its job. This assumes that a large scale parallel computation is started with a certain number of substitute processors initially being idle – very much like substitute players in a team sport match. The solution values in the faulty subdomain are set to zero as initial guess. Other initial guesses could also be used, such as data from previous checkpointing or values obtained by interpolation, but this will not be considered here in detail. After the local reinitialization of the problem, we continue with multigrid cycles in the solution process.
In a first experiment, we consider the performance of our multigrid iteration when it is continued after a fault has occurred. All numerical experiments are performed within the HHG framework introduced in Subsec. 2.2. We choose the computational domain , and in (1) with the described setup in Fig. 2. In the solution process, we apply Vcycles with three pre and postsmoothing steps of the GaussSeidel smoother of Sec. 2.1.
In Fig. 3, the residual is visualized on a cross section through the domain together with the surface of the tetrahedron where the fault had occurred after 5 global iterations. Right after the failure and after reinitialization, the largest residual error is clearly located in this tetrahedron. These large local error components are transported over the whole domain in the course of the following multigrid iterations. Though each application of the smoother transports information only across a few neighboring mesh cells, multigrid employs coarser grids recursively. The smoothing on these grids leads to the global data exchange that is essential for the levelindependent convergence of multigrid iterations. Therefore, though the residual is reduced efficiently (in the norm) by such iterations, we observe a pollution of the residual error across the whole domain. This can be seen in Fig. 3 at the right, where the overall residual error has been reduced after the additional Vcycle (note the scaling), but the error now pollutes the whole domain.
The numerical behavior is analysed quantitatively in Fig. 4. After the fault the residual norm jumps from up to . If a complete checkpointingrecovery (CCR) of the lost values could be performed, it would fully restore the residual from before the fault. Note, that this recovery, as marked in the diagram with no fault, introduces no additional computational effort in comparison to the situation without failure, but writing and reading checkpoint data would be too expensive for large scale computations. However, the failure introduces error components that can be reduced efficiently by the multigrid method as can be seen in the residuals marked with fault. In the first cycles after the fault, we observe a preasymptotic convergence rate that is better than the asymptotic rate for roughly three cycles. This helps significantly to compensate for the fault. The roundoff error limit of approximately is reached after a total of 20 Vcycles, as compared to 16 Vcycles that were necessary in the unperturbed computation.
As expected these effects can be seen more drastically, when the fault occurs at a later step during the iteration process. The situation of a fault after 7 iterations is displayed in Fig. 5 (left) and after 11 iterations in Fig. 5 (right). In those cases, the global residual is already quite small when the fault occurs, and we need 7 and 10 more iterations, respectively, to obtain the rounding error limit of .
3 Local Recovery Strategy
Avoiding the global pollution observed in Subsec. 2.3 motivates a recovery strategy. For the recovery step, we here require that it is local, i.e., can be performed without communication. For the overall efficiency of the recovery strategy, it is essential that it can be performed as quickly as possible. We therefore propose a superman strategy, which simply means that more resources are devoted to perform the recovery in an attempt to reduce the recovery time. This becomes especially attractive, since the recovery procedure is local, i.e. the superman substitute processor can focus its attention on a single subdomain. Note, that during the local recovery, the global solution process cannot continue unaffected, since this would require to use values from the faulty subdomain. This access to the faulty subdomain must be prevented, since otherwise a global error pollution would occur. The values in the faulty subdomain will only become available again, when the local recovery has been completed.
Technically the speedup of the superman strategy can be accomplished by additional parallelization. We propose here, that e.g. a full (shared memory) compute node is assigned to perform the local recovery for a domain that was previously handled by a single core. This can be accomplished by a dedicated OpenMP parallelization for the recovery process. Otherwise, of course, a further domain partitioning of the faulty region might be used together with an additional local distributed memory parallelization.
We denote the speedup factor that is achieved with a superman strategy for the local recovery by , i.e., if , there is no speedup. For the case , the recovery would cost no time. For the moment, let us assume such a perfect superman and that the local recovery step is in this sense free. Let us define the following local subproblem
(3) 
with Dirichlet boundary conditions on .
Under the assumptions of Subsec. 2.3, we set the lost values in the faulty region to zero and before continuing with global problem (1) we solve (3).
The subproblem (3) can in principle be solved by any method, e.g., the relaxation that is used as multigrid smoothing, a direct solver, Krylov space iterations, or multigrid cycles. We denote by the number of local solver iterations. After solving the subproblem (3) with sufficient accuracy, the solution process for the global problem (1) can be resumed. The procedure is presented in Alg. 1.
To quantify the number of multigrid cycles that are saved by using the local recovery strategy, we introduce a reference parameter Cycle Advantage which we denote by . We assume a process experiences a fault after cycles. Then, we evaluate the residual after cycles with . We choose such that the stopping criteria is fulfilled in the nofault case. Here, and denote the residual after a local recovery with iterations and when nolocalrecovery () has been performed, respectively.
Then, it holds
(4) 
since the local strategy improves the residual error. In the case of nolocalrecovery, we need to apply additionally cycles such that
(5) 
where is the convergence rate of one multigrid cycle. Thus, we solve for in (5)
(6) 
Note, that for the calculation of the reference parameter we need two versions of computation runs, one with a fault and a local recovery strategy and one with a fault and without a local recovery strategy. A local recovery strategy with does not improve the residual error in comparison to no recovery strategy, whereas a higher implies an improvement of the residual error of magnitude in multigrid cycles. For example, if , then, the residual error without a local recovery needs 5 additional cycles to achieve the same residual error as the local recovery strategy.
Let us consider again the example of Subsec. 2.3. We study five different local recovery strategies: GaussSeidel smoothing (Smth), Jacobi preconditioned CG (PCG) iterations, local Vcycles, local Wcycles and local Fcycles.
In Fig. 6, we present three cross sections through the computational domain together with the surface of the faulty tetrahedron. In the left plot, the residual error directly after the failure is shown, in the middle plot the residual error after the local recovery with one Fcycle and in the right plot after an additional global Vcycle after the local recovery. We observe two major advantages of local recovery strategies compared to nolocalrecovery: the local recovery reduces the residual error in the defective tetrahedron (middle plot of Fig. 6) and the error pollution is much smaller over the computational domain (right plot of Fig. 6).
In Fig. 7, we consider different local recovery strategies and their impact on the solution process. We observe that all local strategies improve the residual error directly after the fault in comparison to using nolocalrecovery. The different cycles, local V (green), local W (purple), and local Fcycle (orange) strategies, produce almost the same effect on the residual errors. However, using 10 PCGiterations (pink) or 10 GaussSeidel smoothing steps (light blue) result only in an almost negligible improvement compared to using no recovery at all. After the fault and recovery, all recovery strategies using a local multigrid solver exhibit a favorable preasymptotic convergence in the first three global iterations after recovery and, then, align with the convergence of CCR. The residual error of the smoothing or PCG strategy after three further iterations is similar to the approach without a local strategy and, therefore, needs as many iterations as the nolocalrecovery strategy to reach the prescribed accuracy of , i.e., three additional iterations in comparison to the CCR strategy. Two local Vcycles improve the situation such that the delay in finding the solution is reduced to one iteration. A local correction by two Wcycles or Fcycles deliver almost the same residual error and reduces the delay in comparison to the two local Vcycles strategy. A Fcycle is preferred, since its computational cost is by a factor of lower than for a Wcycle. Further in Fig. 8, we vary the number of Vcycles for solving the subproblem on . The delay in finding the solution significantly depends on how accurate the subproblem is solved. One local Vcycle reduces the solution process by one iteration in comparison to nolocalrecovery strategy, two Vcycles by three iterations and three Vcycles completely compensate the fault.
Additionally, we compare the size of for different local recovery strategies and for different iterations after which a failure occurs. We study faults after 5, 7, 11 iterations, respectively, and smoothing, PCGiterations, and multigrid V, W, and Fcycle as local recovery strategies. The Vcycle convergence rate has been numerically evaluated as . The cycle advantage is measured after 15 iterations () and presented in Tab. 1.
Fault After 5 Iter.  

Strategies  
CCR  4.164 
1 Vcycle  1.637 
2 Vcycle  3.075 
3 Vcycle  4.069 
4 Vcycle  4.165 
5 Vcycle  4.165 
1 Wcycle  1.825 
2 Wcycle  3.425 
3 Wcycle  4.147 
4 Wcycle  4.164 
5 Wcycle  4.165 
1 Fcycle  1.828 
2 Fcycle  3.426 
3 Fcycle  4.147 
4 Fcycle  4.164 
5 Fcycle  4.165 
2 PCG  0.007 
5 PCG  0.026 
10 PCG  0.037 
10 Smth  0.105 
20 Smth  0.172 
30 Smth  0.221 
40 Smth  0.260 
50 Smth  0.291 
Fault After 7 Iter.  

Strategies  
CCR  6.445 
1 Vcycle  1.646 
2 Vcycle  3.111 
3 Vcycle  4.445 
4 Vcycle  5.693 
5 Vcycle  6.397 
1 Wcycle  1.780 
2 Wcycle  3.404 
3 Wcycle  4.967 
4 Wcycle  6.249 
5 Wcycle  6.444 
1 Fcycle  1.781 
2 Fcycle  3.405 
3 Fcycle  4.967 
4 Fcycle  6.249 
5 Fcycle  6.444 
2 PCG  0.012 
5 PCG  0.043 
10 PCG  0.072 
10 Smth  0.147 
20 Smth  0.236 
30 Smth  0.298 
40 Smth  0.346 
50 Smth  0.384 
Fault After 11 Iter.  

Strategies  
CCR  10.999 
1 Vcycle  1.678 
2 Vcycle  3.225 
3 Vcycle  4.643 
4 Vcycle  5.970 
5 Vcycle  7.242 
1 Wcycle  1.747 
2 Wcycle  3.380 
3 Wcycle  4.951 
4 Wcycle  6.458 
5 Wcycle  7.911 
1 Fcycle  1.748 
2 Fcycle  3.381 
3 Fcycle  4.951 
4 Fcycle  6.458 
5 Fcycle  7.911 
2 PCG  0.039 
5 PCG  0.104 
10 PCG  0.183 
10 Smth  0.249 
20 Smth  0.387 
30 Smth  0.482 
40 Smth  0.554 
50 Smth  0.611 
The residual error of the CCR strategy is of order after 15 iterations such that the convergence does not saturate due to round off errors. We need 4.164, 6.445 or 10.999 additional Vcycles for the case of nolocalrecovery and when the failure occurs after 5, 7 or 11 iterations, respectively, in order to achieve the same accuracy as in the CCR strategy. This is due to the error difference, introduced by the fault, before and after the fault which is larger the later it occurs. These are also the maximal values which can be achieved by a local recovery strategy. For the fault after 5 iterations in Tab. 1, we obtain similar for a local recovery with 4 Vcycles, 3 Wcycles, or 3 Fcycle which is almost as good as in the CCR strategy. Further, for smoothing and PCG iterations, only marginally small improvement can be observed in comparison to using multigrid cycles or the CCR strategy. As expected for the fault after 7 iterations, 5 W or Fcycles and for the fault after 11 iterations, more than 5 W or Fcycles are necessary to achieve a similar to the CCR strategy. For the other strategies (smoothing and PCG iterations) no significant improvement is obtained. Again, we observe that Wcycles and Fcycles yield similar results, thus, Fcycles are preferred to Wcycles due to computational cost arguments.
4 Conclusion and Outlook
This paper gives a first insight in constructing a fault tolerant multigrid solver. It is shown that geometric multigrid solvers are inherently suitable to deal with failures of processes. The failure results in a loss of the values of a subdomain. To recover these lost values local subproblems with Dirichlet boundary conditions are solved by various strategies ranging from relaxation scheme, Krylov space methods to multigrid cycles. Further, the local problems are accelerated by a superman strategy through additional parallelization of the recovery. We introduce the reference parameter Cycle Advantage which gives the possibility to demonstrate to which extent a recovery strategy improves the timetosolution in terms of cycle applications in comparison to use norecovery. Only multigrid cycles can efficiently treat the local subproblem and even reach the same accuracy as the complete checkpointing recovery but with a minimized access to backup memory.
In future work, we will extend this basic idea for problems with more unknowns, incorporate the local strategy in a global recovery strategy to balance the waiting time for all processors. We will enhance our considerations to problems with nonvanishing righthand side and analyze its recovery. We will further develop acceleration strategies for the local recovery and include them in the HHG framework with a fault tolerant MPIvariants.
References
 [1] J. Anfinson and F. T. Luk. A linear algebraic model of algorithmbased fault tolerance. IEEE Transactions on Computers, 37(12):1599–1604, 1988.
 [2] B. K. Bergen and F. Hülsemann. Hierarchical hybrid grids: Data structures and core algorithms for multigrid. Numerical Linear Algebra with Applications, 11(23):279–291, 2004.
 [3] W. Bland, P. Du, A. Bouteiller, T. Herault, G. Bosilca, and J. J. Dongarra. Extending the scope of the checkpointonfailure protocol for forward recovery in standard mpi. Concurrency and Computation: Practice and Experience, 25(17):2381–2393, 2013.
 [4] D. L. Boley, R. P. Brent, G. H. Golub, and F. T. Luk. Algorithmic fault tolerance using the Lanczos method. SIAM Journal on Matrix Analysis and Applications, 13(1):312–332, 1992.
 [5] G. Bosilca, A. Bouteiller, E. Brunet, F. Cappello, J. Dongarra, A. Guermouche, T. Herault, Y. Robert, F. Vivien, and D. Zaidouni. Unified model for assessing checkpointing protocols at extremescale. Concurrency and Computation: Practice and Experience, 26(17):2772–2791, 2014.
 [6] A. Brandt and O. E. Livne. Multigrid Techniques: 1984 Guide with Applications to Fluid Dynamics, Revised Edition. Classics in Applied Mathematics. Society for Industrial and Applied Mathematics, 2011.
 [7] P. G. Bridges, K. B. Ferreira, M. A. Heroux, and M. Hoemmen. Faulttolerant linear solvers via selective reliability. ArXiv eprints, June 2012.
 [8] F. Cappello. Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities. International Journal of High Performance Computing Applications, 23(3):212–226, 2009.
 [9] F. Cappello, A. Geist, B. Gropp, L. Kale, B. Kramer, and M. Snir. Toward exascale resilience. International Journal of High Performance Computing Applications, 23(4):374–388, Nov. 2009.
 [10] F. Cappello, A. Geist, S. Kale, B. Kramer, and M. Snir. Toward exascale resilience: 2014 update. Supercomputing frontiers and innovations, 1:1–28, 2014.
 [11] M. Casas, B. R. de Supinski, G. Bronevetsky, and M. Schulz. Fault resilience of the algebraic multigrid solver. In Proceedings of the 26th ACM International Conference on Supercomputing, ICS ’12, pages 91–100, New York, NY, USA, 2012. ACM.
 [12] Z. Chen. OnlineABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’13, pages 167–176, New York, NY, USA, 2013. ACM.
 [13] Z. Chen and J. Dongarra. Algorithmbased fault tolerance for failstop failures. IEEE Transactions on Parallel and Distributed Systems, 19(12):1628–1641, 2008.
 [14] T. Cui, J. Xu, and C.S. Zhang. An ErrorResilient Redundant Subspace Correction Method. ArXiv eprints, Sept. 2013.
 [15] T. Davies and Z. Chen. Correcting soft errors online in LU factorization. In Proceedings of the 22Nd International Symposium on Highperformance Parallel and Distributed Computing, HPDC ’13, pages 167–178, New York, NY, USA, 2013. ACM.
 [16] S. Di, M. S. Bouguerra, L. BautistaGomez, and F. Cappello. Optimization of multilevel checkpoint model for large scale HPC applications. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS ’14, pages 1181–1190, Washington, DC, USA, 2014. IEEE Computer Society.
 [17] P. D. Düben, J. Joven, A. Lingamneni, H. McNamara, G. De Micheli, K. V. Palem, and T. N. Palmer. On the use of inexact, pruned hardware in atmospheric modelling. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 372(2018), 2014.
 [18] B. Gmeiner, H. Köstler, M. Stürmer, and U. Rüde. Parallel multigrid on hierarchical hybrid grids: a performance study on current high performance computing clusters. Concurrency and Computation: Practice and Experience, 26(1):217–240, 2014.
 [19] B. Gmeiner, U. Rüde, H. Stengel, C. Waluga, and B. Wohlmuth. Performance and scalability of hierarchical hybrid multigrid solvers for Stokes systems. SIAM Journal on Scientific Computing, 2014. accepted.
 [20] W. Hackbusch. Multigrid methods and applications, volume 4. SpringerVerlag Berlin, 1985.
 [21] B. Harding, M. Hegland, J. Larson, and J. Southern. Scalable and fault tolerant computation with the sparse grid combination technique. ArXiv eprints, Apr. 2014.
 [22] K.H. Huang and J. A. Abraham. Algorithmbased fault tolerance for matrix operations. IEEE Transactions on Computers, 33(6):518–528, June 1984.
 [23] F. T. Luk and H. Park. An analysis of algorithmbased fault tolerance techniques. Journal of Parallel and Distributed Computing, 5(2):172–184, Apr. 1988.
 [24] K. Malkowski, P. Raghavan, and M. Kandemir. Analyzing the soft error resilience of linear solvers on multicore multiprocessors. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1–12, April 2010.
 [25] M. Maniatakos, P. Kudva, B. Fleischer, and Y. Makris. Lowcost concurrent error detection for floatingpoint unit (FPU) controllers. IEEE Transactions on Computers, 62(7):1376–1388, July 2013.
 [26] S. S. Mukherjee, J. Emer, and S. K. Reinhardt. The soft error problem: An architectural perspective. In HighPerformance Computer Architecture, 2005. HPCA11. 11th International Symposium on, pages 243–247. IEEE, 2005.
 [27] A. RoyChowdhury and P. Banerjee. A faulttolerant parallel algorithm for iterative solution of the laplace equation. In Parallel Processing, 1993. ICPP 1993. International Conference on, volume 3, pages 133–140, Aug 1993.
 [28] J. Sloan, R. Kumar, and G. Bronevetsky. Algorithmic approaches to low overhead fault detection for sparse linear algebra. In Proceedings of the 2012 42Nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), DSN ’12, pages 1–12, Washington, DC, USA, 2012. IEEE Computer Society.
 [29] M. K. Stoyanov and C. G. Webster. Numerical analysis of fixed point algorithms in the presence of hardware faults. Technical report, Tech. rep. Oak Ridge National Laboratory (ORNL), Aug 2013.
 [30] H. Sundar, G. Biros, C. Burstedde, J. Rudi, O. Ghattas, and G. Stadler. Parallel geometricalgebraic multigrid on unstructured forests of octrees. In High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for, pages 1–11, Nov 2012.
 [31] G. Zheng, C. Huang, and L. V. Kalé. Performance evaluation of automatic checkpointbased fault tolerance for ampi and charm++. ACM SIGOPS Operating Systems Review, 40(2):90–99, 2006.