On the Combination of
Silent Error Detection and Checkpointing
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures, such latent errors cannot be detected immediately, and a mechanism to detect them must be provided. We consider two models: (i) errors are detected after some delays following a probability distribution (typically, an Exponential distribution); (ii) errors are detected through some verification mechanism. In both cases, we compute the optimal period in order to minimize the waste, i.e., the fraction of time where nodes do not perform useful computations. In practice, only a fixed number of checkpoints can be kept in memory, and the first model may lead to an irrecoverable failure. In this case, we compute the minimum period required for an acceptable risk. For the second model, there is no risk of irrecoverable failure, owing to the verification mechanism, but the corresponding overhead is included in the waste. Finally, both models are instantiated using realistic scenarios and application/architecture parameters.
For several decades, the High Performance Computing (HPC) community has been aiming at increasing the computational capabilities of parallel and distributed platforms, in order to fulfill expectations arising from many fields of research, such as chemistry, biology, medicine and aerospace. The core problem of delivering more performance through ever larger systems is reliability, because of the number of parallel components. Even if each independent component is quite reliable, the Mean Time Between Failures (MTBF) is expected to drop drastically when considering an exascale system [IESP-Exascale]. Failures become a normal part of application executions.
The de-facto general-purpose error recovery technique in high performance computing is checkpoint and rollback recovery. Such protocols employ checkpoints to periodically save the state of a parallel application, so that when an error strikes some process, the application can be restored into one of its former states. There are several families of checkpointing protocols. We assume in this work that each checkpoint forms a consistent recovery line, i.e., when an error is detected, we can rollback to the last checkpoint and resume execution, after a downtime and a recovery time.
Most studies assume instantaneous error detection, and therefore apply to fail-stop failures, such as for instance the crash of a resource. In this work, we revisit checkpoint protocols in the context of latent errors, also called silent data corruption. In HPC, it has been shown recently that such errors are not unusual, and must also be accounted for [Moody:2010:DME:1884643.1884666]. The cause may be for instance soft efforts in L1 cache, or double bit flips. The problem is that the detection of a latent error is not immediate, because the error is identified only when the corrupted data is activated. One must then account for the detection interval required to detect the error in the error recovery protocol. Indeed, if the last checkpoint saved an already corrupted state, it may not be possible to recover from the error. Hence the necessity to keep several checkpoints so that one can rollback to the last correct state.
This work is motivated by a recent paper by Lu, Zheng and Chien [LuZhengChien2013], who introduce a multiple checkpointing model to compute the optimal checkpointing period with error detection latency. More precisely, Lu, Zheng and Chien [LuZhengChien2013] deal with the following problem: given errors whose inter arrival times follow an Exponential probability distribution of parameter , and given error detection times that follow an Exponential probability distribution of parameter , what is the optimal checkpointing period in order to minimize the total execution time? The problem is illustrated on Figure 1: the error is detected after a (random) time , and one has to rollback up to the last checkpoint that precedes the occurrence of the error. Let be the number of checkpoints that can be simultaneously kept in memory. Lu, Zheng and Chien [LuZhengChien2013] derive a formula for the optimal checkpointing period in the (simplified) case where is unbounded (), and they propose some numerical simulations to explore the case where is a fixed constant.
The first major contribution of this paper is to correct the formula of [LuZhengChien2013] when is unbounded, and to provide an analytical approach when is a fixed constant. The latter approach is a first-order approximation but applies to any probability distribution of errors.
While it is very natural and interesting to consider the latency of error detection, the model of [LuZhengChien2013] suffers from an important limitation: it is not clear how one can determine when the error has indeed occurred, and hence to identify the last valid checkpoint, unless some verification system is enforced. Another major contribution of this paper is to introduce a model coupling verification and checkpointing, and to analytically determine the best balance between checkpoints and verifications so as to optimize platform throughput.
The rest of the paper is organized as follows. First we revisit the multiple checkpointing model of [LuZhengChien2013] in Section 2; we tackle both the case where all checkpoints are kept, and the case with at most checkpoints. In Section 3, we define and analyze a model coupling checkpoints and verifications. Then, we evaluate the various models in Section 4, by instantiating the models with realistic parameters derived from future exascale platforms. Related work is discussed in Section 5. Finally, we conclude and discuss future research directions in Section 6.
2 Revisiting the multiple checkpointing model
In this section, we revisit the approach of [LuZhengChien2013]. We show that their analysis with unbounded memory is incorrect and provide the exact solution (Section 2.1). We also extend their approach to deal with the case where a given (constant) number of checkpoints can be simultaneously kept in memory (Section 2.2).
2.1 Unlimited checkpoint storage
Let be the time needed for a checkpoint, the time for recovery, and the downtime. Although and are a function of the size of the memory footprint of the process, is a constant that represents the unavoidable costs to rejuvenate a process after an error (e.g., stopping the failed process and restoring a new one that will load the checkpoint image). We assume that errors can take place during checkpoint and recovery but not during downtime (otherwise, the downtime could be considered part of the recovery).
Let be the mean time between errors. With no error detection latency and no downtime, well-known formulas for the optimal period (useful work plus checkpointing time that minimizes the execution time) are (as given by Young [young74]) and (as given by Daly [daly04]). These formulas are first-order approximations and are valid only if (in which case they collapse).
With error detection latency, things are more complicated, even with the assumption that one can track the source of the error (and hence identify the last valid checkpoint). Indeed, the amount of rollback will depend upon the sum . For Exponential distributions of and , Lu, Zheng and Chien [LuZhengChien2013] derive that , where is the mean of error detection times. However, although this result may seem intuitive, it is wrong, and we prove that the correct answer is , even when accounting for the downtime: this first-order approximation is the same as Young’s formula. We give an intuitive explanation after the proofs provided in Section 2.1.1. Then in Section 2.1.2, we extend this result to arbitrary laws, but under the additional constraint that .
2.1.1 Exponential distributions
In this section, we assume that and follow Exponential distributions of mean and respectively.
The expected time needed to successfully execute a work of size followed by its checkpoint is
Let be the time needed for successfully executing a work of duration .
There are two cases: (i) if there is no error during execution and checkpointing, then the time needed is
exactly ; (ii) if there is an error before successfully completing the work and its checkpoint, then some additional delays
are incurred. These delays come from three sources: the time spent
computing by the processors before the error occurs, the time spent before the error is detected, and the time spent for downtime and recovery.
Regardless, once a successful recovery has been completed,
there still remain units of work to execute.
Thus, we can write the following recursion:
Here, denotes the amount of time spent by the processors before the first error, knowing that this error occurs within the next units of time. In other terms, it is the time that is wasted because computation and checkpoint were not both completed before the error occurred. The random variable represents the time needed for error detection, and its expectation is . The last variable represents the amount of time needed by the system to perform a recovery. Equation (1) simplifies to:
Integrating by parts, we derive that
Next, to compute , we have a recursive equation quite similar to Equation (1) (remember that we assumed that no error can take place during the downtime):
Here, is the expected amount of time lost to executing the recovery before an error happens, knowing that this error occurs within the next units of time. Replacing by in Equation (3), we obtain
The expression for simplifies to
Plugging the values of and into Equation (2) leads to the desired value. ∎
The optimal strategy to execute a work of size is to divide it into equal-size chunks, each followed by a checkpoint, where is equal either to or to . The value of is uniquely derived from , where (, the Lambert function, defined as ). The optimal strategy does not depend on the value of .
Using chunks of size (with ), by linearity of the expectation, we have where is a constant. By convexity, the sum is minimum when all the s are equal (to ). Now, is a convex function of , hence it admits a unique minimum such that the derivative is zero:
Let , we have , hence . Then, since we need an integer number of chunks, the optimal strategy is to split into or same-size chunks, whichever leads to the smaller value. As stated, the value of , hence of , is independent of . ∎
A first-order approximation for the optimal checkpointing period (that minimizes total execution time) is . This value is identical to Young’s formula, and does not depend on the value of .
We use Proposition 2 and Taylor expansions when is small: from , we derive . We have , and , hence . The period is
An intuitive explanation of the result is the following: error detection latency is paid for every error, and can be viewed as an additional downtime, which has no impact on the optimal period.
2.1.2 Arbitrary distributions
Here we extend the previous result to arbitrary distribution laws for and (of mean and respectively):
When and , a first-order approximation for the optimal checkpointing period is .
Let be the base time of the application without any overhead due to resilience techniques. First, assume a fault-free execution of the application: every period of length , only units of work are executed, hence the time for a fault-free execution is . Now, let denote the expectation of the execution time with errors taken into account. In average, errors occur every time-units, and for each of them we lose time-units, so there are errors during the execution. Hence we derive that
which we rewrite as
The waste is the fraction of time where nodes do not perform useful computations. Minimizing execution time is equivalent to minimizing the waste. In Equation (7), we identify the two sources of overhead: (i) the term , which is the waste due to checkpointing in a fault-free execution, by construction of the algorithm; and (ii) the term , which is the waste due to errors striking during execution. With these notations, we have
There remains to determine the (expected) value of . Assuming at most one error per period, we lose per error: for the average work lost before the error occurs, for detecting the error, and for downtime and recovery. Note that the assumption is valid only if and . Plugging back this value into Equation (8), we obtain
which is minimal for
We point out that this approach based on the waste leads to a different approximation formula for the optimal period, but up to second-order terms, when is large in front of the other parameters, includig . For example, this approach does not allow us to handle the case ; in such a case, the optimal period is known only for Exponential distributions, and is independent of , as proven by Proposition 2. ∎
To summarize, the exact value of the optimal period is only known for Exponential distributions and is provided by Proposition 2, while Young’s formula can be used as a first-order approximation for any other distributions. Indeed, the optimal period is a trade-off between the overhead due to checkpointing () and the expected time lost per error ( plus some constant). Up to second-order terms, the waste is minimum when both factors are equal, which leads to Young’s formula, and which remains valid regardless of error detection latencies.
2.2 Saving only checkpoints
Lu, Zheng and Chien [LuZhengChien2013] propose a set of simulations to assess the overhead induced when keeping only the last checkpoints (because of storage limitations). In the following, we derive an analytical approach to numerically solve the problem. The main difficulty is that when error detection latency is too large, it is impossible to recover from a valid checkpoint, and one must resume the execution from scratch. We consider this scenario as an irrecoverable failure, and we aim at guaranteeing that the risk of irrecoverable failure remains under a user-given threshold.
Assume that a job of total size is partitioned into chunks. What is the risk of irrecoverable failure during the execution of one chunk of size followed by its checkpoint? Let be the length of the period. Intuitively, the longer the period, the smaller the probability that an error that has just been detected took place more than periods ago, thereby leading to an irrecoverable failure because the last valid checkpoint is not one of the most recent ones.
Formally, there is an irrecoverable failure if: (i) there is an error detected during the period (probability ), and (ii) the sum of , the time elapsed since the last checkpoint, and of , the error detection latency, exceeds (probability ). The value of is easy to compute from the error distribution law. For instance with an Exponential law, . As for , we use an upper bound: . The latter value is easy to compute from the error distribution law. For instance with an Exponential law, . Of course, if there is an error and the error detection latency does not exceed (probability (1-)), we have to restart execution and face the same risk as before. Therefore, the probability of irrecoverable failure can be recursively evaluated as , hence . Now that we have computed , the probability of irrecoverable failure for a single chunk, we can compute the probability of irrecoverable failure for chunks as . In full rigor, these expressions for and are valid only for Exponential distributions, because of the memoryless property, but they are a good approximation for arbitrary laws. Given a prescribed risk threshold , solving numerically the equation leads to a lower bound on . Let be the optimal period given in Theorem 3 for an unbounded number of saved checkpoints. The best strategy is then to use the period to minimize the waste while enforcing a risk below threshold.
In case of irrecoverable failure, we have to resume execution from the very beginning. The number of re-executions due to consecutive irrecoverable failures follows a geometric law of parameter , so that the expected number of executions until success is . We refer to Section 4.1 for an example of how to instantiate this model to compute the best period with a fixed number of checkpoints, under a prescribed risk threshold.
3 Coupling verification and checkpointing
In this section, we move to a more realistic model where silent errors are detected only when some verification mechanism (checksum, error correcting code, coherence tests, etc.) is executed. Our approach is agnostic of the nature of this verification mechanism. We aim at solving the following optimization problem: given the cost of checkpointing , downtime , recovery , and verification , what is the optimal strategy to minimize the expected waste as a function of the mean time between errors ? Depending upon the relative costs of checkpointing and verifying, we may have more checkpoints than verifications, or the other way round. In both cases, we target a periodic pattern that repeats over time.
Consider first the scenario where the cost of a checkpoint is smaller than the cost of a verification: then the periodic pattern will include checkpoints and verification, where is some parameter to determine. Figure 4(a) provides an illustration with . We assume that the verification is directly followed by the last checkpoint in the pattern, so as to save results just after they have been verified (and before they get corrupted). In this scenario, the objective is to determine the value of that leads to the minimum platform waste. This problem is addressed in Section 3.1.
Because our approach is agnostic of the cost of the verification, we also envision scenarios where the cost of a checkpoint is higher than the cost of a verification. In such a framework, the periodic pattern will include verifications and checkpoint, where is some parameter to determine. See Figure 4(b) for an illustration with . Again, the objective is to determine the value of that leads to the minimum platform waste. This problem is addressed in Section 3.2.
We point out that combining verification and checkpointing guarantees that no irrecoverable failure will kill the application: the last checkpoint of any period pattern is always correct, because a verification always takes place right before this checkpoint is taken. If that verification reveals an error, we roll back until reaching a correct verification point, maybe up to the end of the previous pattern, but never further back, and re-execute the work. The amount of roll-back and re-execution depends upon the shape of the pattern, and we show how to compute it in Sections 3.1 and 3.2 below.
3.1 With checkpoints and verification
We use the same approach as in the proof of Proposition 4 and compute a first-order approximation of the waste (see Equations (7) and (8)). We compute the two sources of overhead: (i) , the waste incurred in a fault-free execution, by construction of the algorithm, and (ii) , the waste due to errors striking during execution.
Let be the length of the periodic pattern. We easily derive that . As for , we still have . However, in this context, the time lost because of the error depends upon the location of this error within the periodic pattern, so we compute averaged values as follows. Recall (see Figure 4(a)) that checkpoint is the one preceded by a verification. Here is the analysis when an error is detected during the verification that takes place in the pattern:
If the error took place in the (last) segment : we recover from checkpoint , and verify it; we get a correct result because the error took place later on. Then we re-execute the last piece of work and redo the verification. The time that has been lost is . (We assume that there is at most one error per pattern.)
If the error took place in segment , : we recover from checkpoint , verify it, get a wrong result; we iterate, going back up to checkpoint , verify it, and get a correct result because the error took place later on. Then we re-execute pieces of work and checkpoints, together with the last verification. We get .
If the error took place in (first) segment : this is almost the same as above, except that the first recovery at the beginning of the pattern need not be verified, because the verification was made just before the corresponding checkpoint at the end of the previous pattern. We have the same formula with but with one fewer verification: .
Therefore, the formula for writes
and (after some manipulation using a computer algebra system) the formula simplifies to
Using and Equation (8), we compute the total waste and derive that , where , , and are some constants. The optimal value of is , provided that this value is at least . We point out that this formula only is a first-order approximation. We have assumed a single error per pattern. We have also assumed that errors did not occur during checkpoints following verifications. Now, once we have found , the value of the waste obtained for the optimal period , we can minimize this quantity as a function of , and numerically derive the optimal value that provides the best value (and hence the best platform usage).
Due to lack of space, computational details are available in [webrefmaple], which is a Maple sheet that we have to instantiate the model. This Maple sheet is publicly available for users to experiment with their own parameters. We provide two example scenarios to illustrate the model in Section 4.3.
Finally, note that in order to minimize the waste, one could do a binary search in order to find the last checkpoint before the fault. Then we can upper-bound by , and Equation (12) becomes .
3.2 With verifications and checkpoint
We use a similar line of reasoning for this scenario and compute a first-order approximation of the waste for the case with verifications and checkpoint per pattern. The length of the periodic pattern is now . As before, for , let segment denote the period of work before verification , and assume (see Figure 4(b)) that verification is preceded by a checkpoint. The analysis is somewhat simpler here.
If an error takes place in segment , , we detect the error during verification , we recover from the last checkpoint, and redo the first segments and verifications: therefore . The formula for is the same as in Equation (11) and (after some manipulation) we derive
Using and Equation (8), we proceed just as in Section 3.1 to compute the optimal value of the periodic pattern, and then the optimal value that minimizes the waste. Details are available within the Maple sheet [webrefmaple].
This section provides some examples for instantiating the various models. We aimed at choosing realistic parameters in the context of future exascale platforms, but we had to restrict to a limited set of scenarios, which do not intend to cover the whole spectrum of possible parameters. The Maple sheet [webrefmaple] is available to explore other scenarios.
4.1 Best period with checkpoints under a given risk threshold
We first evaluate , the risk of irrecoverable failure, as defined in Section 2.2. Figures 5 and 6 present, for different scenarios, the probability as a function of the checkpointing period on the left. On the right, the figures present the corresponding waste with checkpoints and in the absence of irrecoverable failures. This waste can be computed following the same reasoning as in Equation (9). For each figure, the left diagram represents the risk implied by a given period , showing the value of the optimal checkpoint interval (optimal with respect to waste minimization and in the absence of irrecoverable failures, see Equation (10)) as a blue vertical line. The right diagram on the figure represents the corresponding waste, highlighting the trade-off between an increased irrecoverable-failure-free waste and a reduced risk. As stated in Section 2.2, it does not make sense to select a value for lower than , since the waste would be increased, for an increased risk.
Figure 5 considers a machine consisting of components, and a component MTBF of 100 years. This component MTBF corresponds to the optimistic assumption on the reliability of computers made in the literature [IESP-toward, IESP-Exascale]. The platform MTBF is thus hours. The times to checkpoint and recover (10 min) correspond to reasonable mean values for systems at this size [c178, Ferreira2011]. At this scale, process rejuvenation is small, and we set the downtime to 0s. For these average values to have a meaning, we consider a run that is long enough (10 days of work), and in order to illustrate the trade-off, we take a rather low (but reasonable) value of intervals, and a mean time error detection significantly smaller (30 times) than the MTBF itself.
With these parameters, is around 100 minutes, and the risk of irrecoverable failure at this checkpoint interval can be evaluated at , inducing an irrecoverable-failure-free waste of . To reduce the risk to , a of seconds is sufficient, increasing the waste by only . In this case, the benefit of fixing the period to is obvious. Naturally, keeping a bigger amount of checkpoints (increasing ) would also reduce the risk, at constant waste, if memory can be afforded.
We also consider in Figure 6 a more optimistic scenario where the checkpointing technology and availability of resources is increased by a factor 10: the time to checkpoint, recover, and allocate new computing resources is divided by 10 compared to the previous scenario. Other parameters are kept similar. One can observe that is largely reduced (down to less than 35 minutes between checkpoints), as well as the optimal irrecoverable-failure-free waste (). This is unsurprising, and mostly due to the reduction of failure-free waste implied by the reduction of checkpointing time. But because the period between checkpoints becomes smaller, while the latency to detect an error is unchanged ( is still 30 times smaller than ), the risk that an error happens at the interval but is detected after interval is increased. Thus, the risk climbs to , an unacceptable value. To reduce the risk to as previously, it becomes necessary to consider a of seconds, which implies an irrecoverable-failure-free waste of , significantly higher than the optimal one, which is below , but still much lower than the when checkpoint and availability costs are 10 times higher.
4.2 Periodic pattern with verifications and checkpoint
We now focus on the waste induced by the different ways of coupling periodic verification and checkpointing. We first consider the case of a periodic pattern with more verifications than checkpoints: every verifications of the current state of the application, a checkpoint is taken. The duration of the work interval , between two verifications in this case, is optimized to minimize the waste. We consider two scenarios. For each scenario, we represent two diagrams: the left diagram shows the waste as a function of for a given verification cost , and the right diagram shows the waste as a function of and using a 3D surface representation.
In the first scenario, we consider the same setup as above in Section 4.1. The waste is computed in its general form, so we do not need to define the duration of the work. As represented in Figure 7, for a given verification cost, the waste can be optimized by making more than one verifications. When , there are intermediate verifications that can enable to detect an error before a periodic pattern (of length ) is completed, hence, that can reduce the time lost due to an error. However, introducing too many verifications induces an overhead that eventually dominates the waste. The 3D surface shows that the waste reduction is significant when increasing the number of verifications, until the optimal number is reached. Then, the waste starts to increase again slowly. Intuitively, the lower the cost for , the higher the optimal value for .
When considering the second scenario (Figure 8), with an improved checkpointing and availability setup, the same conclusions can be reached, with an absolute value of the waste greatly diminished. Since forced verifications allow to detect the occurrence of errors at a controllable rate (depending on and ), the risk of non-recoverable errors is nonexistent in this case, and the waste can be greatly diminished, with very few checkpoints taken and kept during the execution.
4.3 Periodic pattern with checkpoints and verification
The last set of experiments considers the opposite case of periodic patterns: checkpoints are taken more often than verifications. Every checkpoints, a verification of the data consistency is done. Intuitively, this could be useful if the cost of verification is large compared to the cost of checkpointing itself. In that case, when rolling back after an error is discovered, each checkpoint that was not validated before is validated at rollback time, potentially invalidating up to checkpoints.
Because this pattern has potential only when the cost of checkpoint is much lower than the cost of verification, we considered the case of a greatly improved checkpoint / availability setup: the checkpoint and recovery times are only seconds in Figure 9. One can observe that in this extreme case, it can still make sense to consider multiple checkpoints between two verifications (when seconds, a verification is done only every 3 checkpoints optimally); however the 3D surface demonstrates that the waste is still dominated by the cost of the verification, and little improvement can be achieved by taking the optimal value for . The cost of verification must be incurred when rolling back, and this shows on the overall performance if the verification is costly.
This is illustrated even more clearly with Figure 10, where the checkpoint costs and machine availability are set to the second scenario of Sections 4.1 and 4.2. As soon as the checkpoint cost is not negligible compared to the verification cost (only 5 times smaller in this case), it is more efficient to validate every other checkpoint than to validate only after checkpoints. The 3D surface shows that this holds true for rather large values of .
All the rollback / recovery techniques that we have evaluated above, using various parameters for the different costs, and stressing the different approaches to their limits, expose a waste that remains, in the vast majority of the cases, largely below . This is noticeable, because the traditional hardware based technique, which relies on triple modular redundancy and voting [Lyons1962], mechanically presents a waste that is at least equal to (two-thirds of resources are wasted, and neglecting the cost of voting).
5 Related work
As already mentioned, this work is motivated by the recent paper by Lu, Zheng and Chien [LuZhengChien2013], who introduce a multiple checkpointing model to compute the optimal checkpointing period with error detection latency. We start with a brief overview of traditional checkpointing approaches before discussing error detection and recovery mechanisms.
Traditional (coordinated) checkpointing has been studied for many years. The major appeal of the coordinated approach is its simplicity, because a parallel job using processors of individual MTBF can be viewed as a single processor job with MTBF . Given the value of , an approximation of the optimal checkpointing period can be computed as a function of the key parameters (downtime , checkpoint time , and recovery time ). The first estimate had been given by Young [young74] and later refined by Daly [daly04]. Both use a first-order approximation for Exponential failure distributions; their derivation is similar to the approach in Equations (6) and (7). More accurate formulas for Weibull failure distributions are provided in [ling2001variational, ozaki2006distribution, doi:10.1007/978-3-642-14390-8_22]. The optimal checkpointing period is known only for Exponential failure distributions [c178]. Dynamic programming heuristics for arbitrary distributions are proposed in [toueg1983optimum, 10.1109/TC.2012.57, c178].
The literature proposes different works [Plank01processorallocation, SunICPP10, Wang_DSN05, 4367962, 5289177] on the modeling of coordinated checkpointing protocols. In particular, [SunICPP10] and [Plank01processorallocation] focus on the usage of available resources: some may be kept as backup in order to replace the down ones, and others may be even shutdown in order to decrease the failure risk or to prevent storage consumption by saving fewer checkpoint snapshots.
The major drawback of coordinated checkpointing protocols is their lack of scalability at extreme-scale. These protocols will lead to I/O congestion when too many processes are checkpointing at the same time. Even worse, transferring the whole memory footprint of an HPC application onto stable storage may well take so much time that a failure is likely to take place during the transfer! A few papers [5289177, j116] propose a scalability study to assess the impact of a small MTBF (i.e., of a large number of processors). The mere conclusion is that checkpoint time should be dramatically reduced for platform waste to become acceptable, which motivated the instantiation of optimistic scenarios in Section 4.
All the above approaches maintain a single checkpoint. If the checkpoint file includes errors, the application faces an irrecoverable failure and must restart from scratch. This is because error detection latency is ignored in traditional rollback and recovery schemes. These schemes assume instantaneous error detection (therefore mainly targeting fail-stop failures) and are unable to accommodate silent errors.
5.2 Error detection
Considerable efforts have been directed at error-checking to reveal latent errors. Most techniques combine redundancy at various levels, together with a variety of verification mechanisms. The oldest and most drastic approach is at the hardware level, where all computations are executed in triplicate, and majority voting is enforced in case of different results [Lyons1962]. Error detection approaches include memory scrubbing [Hwang2012], fault-tolerant algorithms [Bronevetsky2008, Heroux2011, Shantharam2011], ABFT techniques [Kuang1984, Bosilca2009] and critical MPI message validation [Fiala2012]. We refer to Lu, Zheng and Chien [LuZhengChien2013] for a comprehensive list of techniques and references. As already mentioned, our work is agnostic of the underlying error-detection technique and takes the cost of verification as an input parameter to the model (see Section 3).
In this paper, we revisit traditional checkpointing and rollback recovery strategies. Rather than considering fail-stop failures, we focus on silent data corruption errors. Such latent errors cannot be neglected anymore in High Performance Computing, in particular in sensitive and high precision simulations. The core difference with fail-stop failures is that error detection is not immediate.
We discuss and analyze two models. In the first model, errors are detected after some delay following a probability distribution (typically, an Exponential distribution). We compute the optimal checkpointing period in order to minimize the waste when all checkpoints can be kept in memory, and we show that this period does not depend on the distribution of detection times. In practice, only a few checkpoints can be kept in memory, and hence it may happen that an error was detected after the last correct checkpoint was removed from storage. We derive a minimum value of the period to guarantee, within a risk threshold, that at least one valid checkpoint remains when a latent error is detected.
A more realistic model assumes that errors are detected through some verification mechanism. Periodically, one checks whether the current status is meaningful or not, and then eventually detects a latent error. We discuss both the case where the periodic pattern includes checkpoints for one verification (large cost of verification), and the opposite case with verifications for one checkpoint (inexpensive cost for verification). We express a formula for the waste in both cases, and, from these formulas, we derive the optimal period.
The various models are instantiated with realistic parameters, and the evaluation results clearly corroborate the theoretical analysis. For the first model, with detection times, the tradeoff between waste and risk of irrecoverable error clearly appears, hence showing that a period larger than the one minimizing the irrecoverable-failure-free waste should often be chosen to achieve an acceptable risk. The advantage of the second model is that there are no irrecoverable failures (within each period, there is a verification followed by a checkpoint, hence ensuring a valid checkpoint). We compute the optimal pattern of checkpoints and verifications per period, as a function of their respective cost, to minimize the waste. The pattern with more checkpoints than verification turns out to be usable only when the cost of checkpoint is much lower than the cost of verification, and the conclusion is that it is often more efficient to verify the result every other checkpoint.
Overall, we provide a thorough analysis of checkpointing models for latent errors, both analyzing the models analytically, and evaluating them through different scenarios. A future research direction would be to study more general scenarios of multiple checkpointing, for instance by keeping not the consecutive last checkpoints in the first model, but rather some older checkpoints to decrease the risk. In the second model, more verification/checkpoint combinations could be studied, while we focused on cases with an integer number of checkpoints per verification (or the converse).
This work was supported in part by the ANR RESCUE project. A. Benoit and Y. Robert are with the Institut Universitaire de France.