Repairing Multiple Failures with Coordinated and Adaptive Regenerating Codes ^{†}^{†}thanks: This paper was presented in part at the International Symposium on Network Coding in 2011 (NetCod’2011) at Beijing, China [1]. It also initially appeared (September 2010) as an INRIA Research Report (http://hal.inria.fr/inria00516647) entitled Beyond Regenerating Codes. The main additions in this update (September 2013) are (i) an expanded section on Adaptive Regenerating Codes explaining that they make no sense at the MBR point, and discussing their implementation (Section Iv); (ii) a section studying the impact of lazy repairs on both network repair cost but also on diskrelated repair costs (Section VB); (iii) a discussion of the related work (Section Vi). ^{†}^{†}thanks: The following notice apply to the conference article published at NetCod 2011. ©2011 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Abstract
Erasure correcting codes are widely used to ensure data persistence in distributed storage systems. This paper addresses the simultaneous repair of multiple failures in such codes. We go beyond existing work (i.e., regenerating codes by Dimakis et al.) by describing (i) coordinated regenerating codes (also known as cooperative regenerating codes) which support the simultaneous repair of multiple devices, and (ii) adaptive regenerating codes which allow adapting the parameters at each repair. Similarly to regenerating codes by Dimakis et al., these codes achieve the optimal tradeoff between storage and the repair bandwidth. Based on these extended regenerating codes, we study the impact of lazy repairs applied to regenerating codes and conclude that lazy repairs cannot reduce the costs in term of network bandwidth but allow reducing the diskrelated costs (disk bandwidth and disk I/O).
I Introduction
Over the last decade, digital information to be stored, be it scientific data, photos, videos, etc., has grown exponentially. Meanwhile, the widespread access to the Internet has changed behaviors: users now expect reliable storage and seamless access to their data. The combination of these factors dramatically increases the demand for largescale distributed storage systems for backing up or sharing data. This is traditionally achieved by aggregating numerous physical devices to provide large and resilient storage [2, 3, 4, 5]. In such systems, which are prone to disk and network failures, redundancy is the natural solution to prevent permanent data losses. However, as failures occur, the level of redundancy decreases, potentially jeopardizing the ability to recover the original data. This requires the storage system to selfrepair to go back to its healthy state (i.e., keep redundancy above a minimum level).
Repairing lost redundancy from remaining one is paramount for distributed storage systems. Redundancy in storage systems has been extensively implemented using erasure correcting codes [6, 7, 5] for they enable tolerance to failures with low storage overheads. However codes came at the price of a large communication overhead, because repairing required downloading and decoding the whole file. This repair cost has a wide impact on systems since repairs are not limited to restoring data after permanent failures, but are also triggered when doing degraded reads (i.e., accessing data stored on temporarily unavailable or overloaded devices). Dimakis et al. recently showed [8, 9] that the repair cost can be significantly reduced by avoiding decoding using regenerating codes. Yet, they assume a static setting and do not support simultaneous coordinated repairs.
In this paper, we go beyond these works by considering simultaneous repairs in regeneratinglike codes. We propose coordinated regenerating codes allowing devices to leverage simultaneous repairs (or simultaneous degraded reads): each of the devices being repaired contacts live (i.e., nonfailed) devices and then coordinates with the others. We also consider a relaxed scheme where and can change at each repair to define adaptive regenerating codes. Our contributions regarding these codes are:

We define coordinated regenerating codes (also known as cooperative regenerating codes) and derive closed form expressions of the optimal quantities of information to transfer when devices must be repaired simultaneously from live devices (Section III).

We design adaptive regenerating codes achieving optimal repairs in a dynamic environment where and change over time. (Section IV).

Based on these constructions, we prove that, when relying on regeneratinglike codes (MSR or MBR) [9], deliberately delaying repairs does not bring further savings with respect to repair bandwidth, contrary to what is observed for traditional erasure correcting codes [5, 10, 11] but that it could help when looking at disk I/O (Section V).
Our work fills the gap between approaches not supporting simultaneous coordinated repair [9] and approaches repairing by decoding the whole file [6, 5, 7, 10, 11]. Two recent pieces of work focus on similar problems: MCR codes [12] define MSRlike codes that support multiple repairs and MFR [13] codes turn MSR codes into adaptive codes. Yet, MCR codes only consider the MSR point and assume that all transfers are equal without proving it (i.e., ); MFR [13] codes are not optimal when repairing more than one failure. More recently, concurrent studies have led to the definition of cooperative regenerating codes [14, 15] which are similar to coordinated regenerating codes: they also describe exact codes constructions that achieve the bounds given in this paper.
Ii Background
We consider an device system storing a file of bits split into blocks of size . To cope with device failures, blocks are stored with some redundancy so that a small number of failures cannot cause permanent data losses. We use a codebased redundancy scheme as it has been acknowledged as more efficient than replication with respect to both storage and repair costs [6]. We focus on selfhealing systems as they do not gradually lose their ability to recover the initial file. In the rest of this section, we describe the main codebased approaches for redundancy. For the sake of clarity we will use repairs to designate both repairs following permanent failures and degraded reads following temporary unavailability. Table I gives some values of the storage and repair costs for these approaches, and also includes the codes we propose.
Erasure codes  32  NA  NA  1 MB  32 MB 

Erasure codes (delayed repair)  32  NA  4  1 MB  8.8 MB 
Dimakis et al.’s MSR  32  36  NA  1 MB  7.2 MB 
Dimakis et al.’s MBR  32  36  NA  1.8 MB  1.8 MB 
Our MSCR (cf. Sec. IIID2)  32  36  4  1 MB  4.9 MB 
Our MBCR (cf. Sec. IIID1)  32  36  4  1.7 MB  1.7 MB 
Iia Erasure correcting codes (immediate/eager repairs)
Erasure correcting codes have been widely used to provide redundancy in distributed storage systems [6, 7]. Devices store encoded blocks of size , which are generated from the original blocks. The whole file can be recovered, in spite of failures, by decoding from any encoded blocks. Yet, repairing a single lost encoded block is very expensive since the device must download encoded blocks and decode the file to regenerate any single lost block (Fig. (a)a).
IiB Erasure correcting codes (delayed/lazy repairs)
A first approach to limiting the repair cost of erasure correcting codes is to delay repairs so as to factor downloading costs [10, 5, 11]. When a device has downloaded blocks, it can produce as many new encoded blocks as wanted without any additional cost. Hence, instead of immediately repairing every single failure (Figure (a)a), one deliberately waits until failures are detected (Figure (b)b), then one of the new devices downloads blocks, regenerates blocks and dispatches them to the other devices (Fig. (b)b).
IiC Network coding and regenerating codes
A second approach to increasing the efficiency of repairs relies on network coding [16]. Network coding was initially applied to multicast, for which it has been proven that linear codes achieve the maxflow in a communication graph [17, 18]. Network coding has latter been applied to distributed storage and data persistence [19, 20, 21, 22]. A key contribution in this area is regenerating codes [8, 9] introduced by Dimakis et al..
Regenerating codes achieve an optimal tradeoff between the storage and the repair cost (repair bandwidth) with bits being downloaded from devices as shown on Figure (c)c. On the tradeoff curve (Figure 9), two specific codes are of interest: MSR (Minimum Storage Regenerating codes) which offer optimal repair costs for minimum storage costs and MBR (Minimum Bandwidth Regenerating codes) which offer optimal storage costs for minimum repair costs . Regenerating codes can be implemented using linear codes [17, 18, 23, 24, 25, 26, 27, 28, 29]. Related work on the implementation of regenerating codes is discussed in more details in Section VI.
Iii Coordinated regenerating codes
Regenerating codes by Dimakis et al. perform all repairs independently. Hence, the repair cost increases linearly with . In this work, we investigate repairing simultaneous failures through coordination in an attempt to reduce the cost, along the lines of delayed erasure correcting codes. We consider that devices fail and that repairs are performed simultaneously.
Iiia Repair algorithm
Contrary to erasure correcting codes delayed repair (Fig. (b)b), our algorithm (Fig. (d)d) is fully distributed: repairing does not require a single device to gather all the information since no decoding is performed. A device being repaired performs the three following tasks as depicted on Figure 10:

[]

1. Collect. Download a set of subblocks (size ) from each of the live devices. The union of the sets is stored as .

2. Coordinate. Upload a set of subblocks (size ) to each of the other devices being repaired. These sets are generated from . At this stage, subblocks received from the other devices being repaired are stored as .

3. Store. Store a set of subblocks (size ) generated from . and can be erased afterwards.
Interestingly, coordinated regenerating codes evenly balance the load on all devices, thus avoiding the bottleneck existing in erasure correcting codes delayed repairs (i.e., the device gathering and decoding all the information (Fig. (b)b)).
In the rest of this section, we take an information theoretic point of view and focus on the amounts of information exchanged. We define the achievable tradeoffs between the storage cost and the repair cost .
Overall, our main proof (of Theorem 1) follows the same methodology as the seminal article by Dimakis et al. [9]: the system is represented as an information flow graph, we determine inequalities on the amount of information that can flow through the graph and, applying network coding theory, we show that the recovery of a file is possible if and only if some constraints are satisfied. Costs shown in plots are normalized by . The following table summarizes the notations used.
Devices to recover  Bits stored  
Devices being repaired  Bits transferred (collect)  
Live devices ()  Bits transferred (coordinate)  
Total bits transferred per node repaired (i.e., repair cost) 
[0.0eX]
IiiB Information flow graphs
Information flow graphs describe the amounts of information transferred, processed and stored. Contrary to the graph defined in [9], ours captures the coordination by adding edges between nodes being repaired. The information flow graph is a directed acyclic graph consisting of a source , intermediary nodes, and data collectors which contact devices to recover the file. A device is represented by nodes of the graph (, and ) corresponding to its repair states ( corresponds to a time step while corresponds to a device introduced at time step ). The capacities of the edges correspond to the amounts of information that can be stored or transferred.
Figure 11 depicts the graph of devices being repaired (assuming divides .). First, devices being repaired perform a collecting step represented by edges () of capacity . Second, devices undergo a coordinating step represented by edges of capacity for . Devices keep everything they obtained during the first step justifying the infinite capacities of edges . Third, they store as shown on edges . Figure 15 depicts the information flow graph of successive repairs.
The graph evolves as repairs are performed. When a repair is performed, a set of nodes is added to the graph and the nodes corresponding to failed devices become inactive (i.e., data collectors and subsequently added nodes cannot connect to these nodes). The rest of the analysis relies on the concept of maxflow, which is the maximum amount of information that can flow from the source to some destination , through the study of the minimum cut. Network coding [16, 17, 18] allows achieving the maximum flow for multiple destinations.
IiiC Achievable codes
We define two important properties on codes:

[]
 Correctness

A code is correct iff, for any succession of repairs, a data collector can recover the file by connecting to any devices.
 Optimality

A code is optimal iff it is correct and any code with is not correct^{1}^{1}1In this paper, we always consider that means that either and , or and .
The following theorem is an important result of our work.
Theorem 1.
These constraints mean that for any scenario ( is the number of devices contacted in each repair group of size during the recovery and is the number of such groups), the sum of the amounts of information that can be downloaded from each of the devices contacted by a data collector must be greater than the file size. We now give the proof of this theorem. We study all possible graphs given some , and . Finally, it is shown that (2) must be satisfied to allow decoding at any time thus preventing data losses.
Lemma 2.
For any information flow graph , no data collector can recover the initial file if the minimum cut in between and is smaller than the initial file size .
Proof:
Similarly to the proof in [9], since each edge in the information flow graph can be used at most once, and since source to data collector capacity is less than the file size , the recovery of the file is impossible. ∎
Lemma 3.
For any finite information flow graph , if the minimum of the mincuts separating the source and each data collector is larger than or equal to the file size , then there exists a linear network code such that all data collectors can recover the file. We also assume that the finite field size is not an issue.
Proof:
Similarly to the proof in [9], since the reconstruction problem reduces to multicasting on all possible data collectors, the result follows from network coding theory. ∎
Lemma 4.
For any information flow graph consisting of initial devices that obtain bits directly from the source and of additional devices that join the graph in groups of devices obtaining from existing devices and from each of the other joining devices, any data collector that connects to a subset of outnodes of satisfies:
(3)  
with .
Proof:
Let us consider some graph (see an example in Figure 15) formed by adding devices according to the repair process described above. Consider a recovery scenario in which, a data collector connects to a subset of nodes , where is the set of contacted devices.
As all incoming edges of have infinite capacity, we only examine mincuts with and . Moreover some additional cases cannot happen since there is an order between , and (e.g., and need not be considered). Therefore, we only need to examine three cases detailed in the rest of this proof.
Let denote the edges in the cut (i.e., the set of edges going from to ). Every directed acyclic graph has a topological sorting, which is an ordering of its vertices such that the existence of an edge implies . In the rest of the analysis, we group nodes that were repaired simultaneously. Since we contact nodes, we have at least groups and at most groups (i.e., ). Since nodes are sorted, nodes considered at the th step cannot depend on nodes considered at th steps with .
Consider the th group. Let be the set of indexes such that are the topologically th output nodes in corresponding to the th (same) repair. The set contains nodes. Consider a subset of size such that and . can take any value between and .
First, consider the nodes . For each node, . We consider the two cases.

If , then . The contribution to the cut is .

If , then . The contribution to the cut is .
Second, consider the other nodes (third and last case: , and all belong to ). For each node, the contribution comes from multiple sources.

The cut contains at least edges carrying : since are the topologically th output nodes in , at most edges come from output nodes in , other edges come from .

The cut contains edges carrying thanks to the coordination step. The node has incoming edges . However, since , the cut contains only such edges.
Therefore, the total contribution of these nodes is
Since the function is concave for taking values in the interval , the contribution can be bounded thanks to Jensen’s inequality.
Summing these contributions for all , and considering the worst case for (i.e., the scenario that minimizes the sum) leads to (4). ∎
Proof:
From Lemmas 3 and 4, a code is correct if it satisfies (1) and (2). From Lemma 2, a code is correct only if . Moreover, for any set of parameter and any scenario , we can find a graph such that
The graph is built using the following process (for the graph of Figure (c)c is built):

The data collector gets all bits from a set of devices.

The contacted devices repaired simultaneously are grouped in subsets of size such that . Since we contact nodes, we have at least groups and at most groups (i.e., ).

Each device gets bits from all devices in , from devices taking part to the reconstruction, from devices not in , from devices not taking part to the reconstruction.
IiiD Optimal tradeoffs
Determining the optimal tradeoffs boils down to minimizing storage cost and repair cost , under constraints of (1) and (2). , and are parameters to be optimized. Again, we assume that divides .
IiiD1 MBCR codes
Minimum Bandwidth Coordinated Regenerating Codes correspond to optimal codes that provide the lowest possible repair cost (bandwidth consumption) while minimizing the storage cost . Figure 16 compares MBCR codes to both Dimakis et al. ’s MBR [9] and erasure correcting codes with delayed repairs (ECC).
We determine these values in two steps. We study two particular cuts to find the minimum values required to ensure that the max flow is at least equal to the file size, thus proving the optimality of the solution if correct. We then prove that these quantities are sufficient for all possible cuts.
Proof:
Let us consider two specific successions of repairs ( (Fig. (b)b) and (Fig. (a)a)). The corresponding repairs are described in the Proof of Theorem 1. As we want to minimize before , we assume .
When , it is required that
which is equivalent to
When , it is required that
which is equivalent to
Consider the smallest value , the associated repair cost is . This implies that the repair cost grows linearly with , we therefore seek to minimize . The minimum value for is . ∎
Proof:
We have proved that the aforementioned values are required for two specific scenarios. We now prove that such values ensure that enough information flows through every cut for any scenario thus proving correctness. According to Theorem 1, the following condition is sufficient for correctness. We show that the values of , and for MBCR codes satisfy this condition:
since (the stored part) is always larger than or equal to the transmitted data,
replacing , and by their values,
which is equivalent to
As , it simplifies to which is always true. Hence, MBCR codes are correct.∎
IiiD2 MSCR codes
Minimum Storage Coordinated Regenerating Codes correspond to optimal codes that provide the lowest possible storage cost while minimizing the repair cost . This point has been independently characterized by Hu et al. in [12]; however, they assume that without proving it. We present a simple derivation from Theorem 1 allowing us to characterize this point. Figure 17 compares MSCR codes to both Dimakis et al.’s MSR [9] and erasure correcting codes with delayed repairs (ECC). Note that for , our MSCR codes share the same repair cost as erasure correcting codes delayed repair. Yet, in this case, our codes still have the advantage that they balance the load evenly thus avoiding bottlenecks.
Proof:
Let us consider two particular successions of repairs ( and ) leading to the graphs shown on Figure 15. The repairs corresponding to such graphs are described in the Proof of Theorem 1.
We minimize first. It is clear that is minimal since makes impossible to reconstruct a file of size using only blocks. Hence, what is important is now that each element of the sum is at least equal to .
Consider the smallest value , the associated repair cost is . This implies that the repair cost grows linearly with , we therefore seek to minimize . The minimum value for is . ∎
Proof:
The proof of correctness is quite similar to the previous one. It consists in proving that
is always verified when , and take the aforementioned values.
Since each element of the sum is at most , each element of the sum must satisfy the following constraint.
Applying values for MSCR codes,
which is satisfied if
which is true since and . Therefore, MSCR codes are correct.
∎
IiiD3 General CR codes
The general case corresponds to all possible tradeoffs in between MSCR and MBCR. Valid points can be determined by performing a numerical minimization of the repair cost for various storage cost under constraints of (2) and (1). Figure 18 shows the optimal tradeoffs : coordinated regenerating codes () can go beyond the optimal tradeoffs for independent repairs () defined by regenerating codes by Dimakis et al. [9].
Iv Adaptive Regenerating Codes
So far we assumed and to remain constant across repairs, similarly to [9] where is assumed to remain constant. It may not be realistic in real systems that are dynamic.
At the Minimum Storage Point (), such strong assumptions are not needed as repairs are independent (i.e., each term of the sum in (2) can be treated independently). We propose to adapt the quantities to transfer and to the system state, which is defined by the number of devices being repaired and the number of live devices. The resulting adaptive regenerating codes simplify the system design as only the parameter needs to be decided during the conception: adaptive regenerating codes decide, at runtime for each repair, the best to offer the lowest repair cost .
Iva Adaptive codes at the Minimum Storage point
Theorem 5.
Adaptive regenerating codes are both correct and optimal. is a function that maps a particular repair setting to the amounts of information to be transferred during a repair.
(4) 
In this subsection, we prove they are correct and Pareto optimal.
Lemma 6.
For any information flow graph compounded of initial devices that obtain bits directly from the source and of additional devices that join the graph in groups of devices obtaining from existing devices and from each of the other joining devices, any data collector that connects to a subset of outnodes of satisfies:
(5)  
with .
Proof:
The proof is similar to the proof of Lemma 4.∎
Proof:
Using Lemmas 2, 3 and 6, we can define the following sufficient condition for the code to be correct. The condition is satisfied when and take the values defined in (4).
The condition must be satisfied for every . For any , since each element of the sum is at most , each element of the sum must satisfy the following constraint.
Applying formulas of (4),
which is satisfied if
which is true since and . Therefore, adaptive regenerating codes are correct.∎
Proof:
We prove by contradiction that the adaptive regenerating codes are optimal. Let us assume that there exists a correct code such that (i.e., for some , ).
Consider a set of failures such that all repairs are performed by groups of devices downloading data from devices. Consider the corresponding information flow graph. Assuming repairs are performed with a correct code , the information flow graph also corresponds to a correct code .
Moreover, according to the previous section, these failures can be repaired optimally using the MSCR code . Therefore, there is a contradiction since the code cannot be correct if the code is optimal. A correct code cannot exist, and the adaptive regenerating code defined in this section is optimal. ∎
Building on results from coordinated regenerating codes (especially MSCR), we have defined adaptive regenerating codes and proved that they are both correct and optimal. These codes are of particular interest for dynamic systems where failures may occur randomly and simultaneously.
IvB Adaptive codes at the Minimum Bandwidth point
We have built adaptive regenerating codes from Minimum Storage codes () by observing that initial assumptions of fixed value and can be relaxed. In this subsection, we study whether adaptive codes can be built from MBR codes. We determine lower bounds on the storage and repair cost . These lower bounds allow concluding that an adaptive scheme at the Minimum Bandwidth point cost as much as classical erasure correcting codes.
Let us consider that can take any value between and , and that can take any value between and . Since we cannot predict the future, when choosing and , we must assume any value for and with . More specifically, the current repair can be the first of a sequence since we do not know which devices will fail, how they will be repaired and how data will be collected. We need to consider the worst case that can occur in the future: and .
In a first scenario when (e.g., Fig. (a)a), it is required that
which expands to
replacing and by their minimum admissible values since we cannot make any assumption on the future, and by the number of groups when all groups but the first are of size (i.e., ), we get
which simplifies to
which simplifies to
Let us consider the case of and , and determine the minimal value for .
The first scenario we considered allowed determining that . We now consider another possible scenario to obtain a lower bound on the per failed node repair cost . In this second scenario, when (e.g., Fig. (b)b), it is required that
which expands to
replacing and by their minimum admissible values ( and ) since we cannot make any assumption on the future, and by the number of groups when all groups but the first are of size (i.e., ), we get
which simplifies to
Hence, we obtain the following lower bound for an adaptive regenerating code operating at the MBR point.
The cost of a static scheme assuming that we contact as few nodes as possible and repair as few nodes as possible is as explained in Section IIID1. Hence,
As a consequence, an adaptive scheme at the Minimum Bandwidth point is meaningless since it would be more expensive than a simpler static MBCR code set up for the worst case (i.e., and ).
IvC Performance
We compare our Adaptive Regenerating Codes at the MSR point to MFR codes defined in [13]. This approach is built upon MSR codes defined by Dimakis et al. in [9]. The coding scheme can be described as where is a function . The repairs needed are performed independently.
(6) 
Let us consider the particular case where . The average cost per repair of our codes remains constant . In the MFR approach, which requires repairs to be performed independently, the average repair cost increases with . Therefore, the performance of our adaptive regenerating codes does not degrade as the number of failures increases, as opposed to the MFR constructed upon Dimakis et al. ’s codes. This is also shown on Figure 19.
IvD Adaptive Coding Schemes
Our approach also has significant advantages over the MFR approach with respect to the actual coding scheme being implemented when is constant. The coding schemes are similar in principle to the one described in Subsection IIIA and Figure 10. The only difference is that the values and may differ from one repair to the other. Each device stores subblocks of data and combines them to send the appropriate quantities of information. To be able to send , each device must store subblocks. To be able to send or , each device must store subblocks. Hence, the length of any random linear code used to implement such a system is where is the number of subblocks stored by each device. We now consider a system of constant size and compare both implementations.
The implementation of the MFR approach implies that to support , each device must be able to send all quantities . Hence, . It is known that . Hence, the length of the codes required to implement such codes grows exponentially with .
The implementation of our approach implies that to support