Repairing Multiple Failures with Coordinated and Adaptive Regenerating Codes This paper was presented in part at the International Symposium on Network Coding in 2011 (NetCod’2011) at Beijing, China [1]. It also initially appeared (September 2010) as an INRIA Research Report (http://hal.inria.fr/inria-00516647) entitled Beyond Regenerating Codes. The main additions in this update (September 2013) are (i) an expanded section on Adaptive Regenerating Codes explaining that they make no sense at the MBR point, and discussing their implementation (Section IV); (ii) a section studying the impact of lazy repairs on both network repair cost but also on disk-related repair costs (Section V-B); (iii) a discussion of the related work (Section VI). The following notice apply to the conference article published at NetCod 2011. ©2011 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Repairing Multiple Failures with Coordinated and Adaptive Regenerating Codes thanks: This paper was presented in part at the International Symposium on Network Coding in 2011 (NetCod’2011) at Beijing, China [1]. It also initially appeared (September 2010) as an INRIA Research Report (http://hal.inria.fr/inria-00516647) entitled Beyond Regenerating Codes. The main additions in this update (September 2013) are (i) an expanded section on Adaptive Regenerating Codes explaining that they make no sense at the MBR point, and discussing their implementation (Section Iv); (ii) a section studying the impact of lazy repairs on both network repair cost but also on disk-related repair costs (Section V-B); (iii) a discussion of the related work (Section Vi). thanks: The following notice apply to the conference article published at NetCod 2011. ©2011 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Anne-Marie Kermarrec1, Nicolas Le Scouarnec2 and Gilles Straub2 1 INRIA Rennes - Bretagne-Atlantique, Rennes, France
2 Technicolor, Rennes, France
Anne-Marie.Kermarrec@inria.fr, Nicolas.Le-Scouarnec@technicolor.com, Gilles.Straub@technicolor.com
Abstract

Erasure correcting codes are widely used to ensure data persistence in distributed storage systems. This paper addresses the simultaneous repair of multiple failures in such codes. We go beyond existing work (i.e., regenerating codes by Dimakis et al.) by describing (i) coordinated regenerating codes (also known as cooperative regenerating codes) which support the simultaneous repair of multiple devices, and (ii) adaptive regenerating codes which allow adapting the parameters at each repair. Similarly to regenerating codes by Dimakis et al., these codes achieve the optimal tradeoff between storage and the repair bandwidth. Based on these extended regenerating codes, we study the impact of lazy repairs applied to regenerating codes and conclude that lazy repairs cannot reduce the costs in term of network bandwidth but allow reducing the disk-related costs (disk bandwidth and disk I/O).

erasure correcting codes, regenerating codes, network coding, distributed storage, repair, multiple failures

I Introduction

Over the last decade, digital information to be stored, be it scientific data, photos, videos, etc., has grown exponentially. Meanwhile, the widespread access to the Internet has changed behaviors: users now expect reliable storage and seamless access to their data. The combination of these factors dramatically increases the demand for large-scale distributed storage systems for backing up or sharing data. This is traditionally achieved by aggregating numerous physical devices to provide large and resilient storage [2, 3, 4, 5]. In such systems, which are prone to disk and network failures, redundancy is the natural solution to prevent permanent data losses. However, as failures occur, the level of redundancy decreases, potentially jeopardizing the ability to recover the original data. This requires the storage system to self-repair to go back to its healthy state (i.e., keep redundancy above a minimum level).

Repairing lost redundancy from remaining one is paramount for distributed storage systems. Redundancy in storage systems has been extensively implemented using erasure correcting codes [6, 7, 5] for they enable tolerance to failures with low storage overheads. However codes came at the price of a large communication overhead, because repairing required downloading and decoding the whole file. This repair cost has a wide impact on systems since repairs are not limited to restoring data after permanent failures, but are also triggered when doing degraded reads (i.e., accessing data stored on temporarily unavailable or overloaded devices). Dimakis et al. recently showed [8, 9] that the repair cost can be significantly reduced by avoiding decoding using regenerating codes. Yet, they assume a static setting and do not support simultaneous coordinated repairs.

In this paper, we go beyond these works by considering simultaneous repairs in regenerating-like codes. We propose coordinated regenerating codes allowing devices to leverage simultaneous repairs (or simultaneous degraded reads): each of the devices being repaired contacts live (i.e., non-failed) devices and then coordinates with the others. We also consider a relaxed scheme where and can change at each repair to define adaptive regenerating codes. Our contributions regarding these codes are:

  • We define coordinated regenerating codes (also known as cooperative regenerating codes) and derive closed form expressions of the optimal quantities of information to transfer when devices must be repaired simultaneously from live devices (Section III).

  • We design adaptive regenerating codes achieving optimal repairs in a dynamic environment where and change over time. (Section IV).

  • Based on these constructions, we prove that, when relying on regenerating-like codes (MSR or MBR) [9], deliberately delaying repairs does not bring further savings with respect to repair bandwidth, contrary to what is observed for traditional erasure correcting codes [5, 10, 11] but that it could help when looking at disk I/O (Section V).

(a) Erasure correcting codes
(b) Erasure codes (delayed repair)
(c) Regenerating codes
(d) Coordinated reg. codes
Fig. 5: Repairing failures with codes. In an device network, failed devices are replaced by new ones. The new devices fetch a given amount of data from live devices to repair the redundancy. In our examples, , , , , and .

Our work fills the gap between approaches not supporting simultaneous coordinated repair [9] and approaches repairing by decoding the whole file [6, 5, 7, 10, 11]. Two recent pieces of work focus on similar problems: MCR codes [12] define MSR-like codes that support multiple repairs and MFR [13] codes turn MSR codes into adaptive codes. Yet, MCR codes only consider the MSR point and assume that all transfers are equal without proving it (i.e., ); MFR [13] codes are not optimal when repairing more than one failure. More recently, concurrent studies have led to the definition of cooperative regenerating codes [14, 15] which are similar to coordinated regenerating codes: they also describe exact codes constructions that achieve the bounds given in this paper.

Ii Background

We consider an device system storing a file of bits split into blocks of size . To cope with device failures, blocks are stored with some redundancy so that a small number of failures cannot cause permanent data losses. We use a code-based redundancy scheme as it has been acknowledged as more efficient than replication with respect to both storage and repair costs [6]. We focus on self-healing systems as they do not gradually lose their ability to recover the initial file. In the rest of this section, we describe the main code-based approaches for redundancy. For the sake of clarity we will use repairs to designate both repairs following permanent failures and degraded reads following temporary unavailability. Table I gives some values of the storage and repair costs for these approaches, and also includes the codes we propose.

Erasure codes 32 NA NA 1 MB 32 MB
Erasure codes (delayed repair) 32 NA 4 1 MB 8.8 MB
Dimakis et al.’s MSR 32 36 NA 1 MB 7.2 MB
Dimakis et al.’s MBR 32 36 NA 1.8 MB 1.8 MB
Our MSCR (cf. Sec. III-D2) 32 36 4 1 MB 4.9 MB
Our MBCR (cf. Sec. III-D1) 32 36 4 1.7 MB 1.7 MB
TABLE I: Some examples of repairs of codes for a file of MB

Ii-a Erasure correcting codes (immediate/eager repairs)

Erasure correcting codes have been widely used to provide redundancy in distributed storage systems [6, 7]. Devices store encoded blocks of size , which are generated from the original blocks. The whole file can be recovered, in spite of failures, by decoding from any encoded blocks. Yet, repairing a single lost encoded block is very expensive since the device must download encoded blocks and decode the file to regenerate any single lost block (Fig. (a)a).

Ii-B Erasure correcting codes (delayed/lazy repairs)

A first approach to limiting the repair cost of erasure correcting codes is to delay repairs so as to factor downloading costs [10, 5, 11]. When a device has downloaded blocks, it can produce as many new encoded blocks as wanted without any additional cost. Hence, instead of immediately repairing every single failure (Figure (a)a), one deliberately waits until failures are detected (Figure (b)b), then one of the new devices downloads blocks, regenerates blocks and dispatches them to the other devices (Fig. (b)b).

(a) Immediate repairs
(b) Delayed repairs
Fig. 8: Delaying repairs allows performing multiple repairs at once.

Ii-C Network coding and regenerating codes

A second approach to increasing the efficiency of repairs relies on network coding [16]. Network coding was initially applied to multicast, for which it has been proven that linear codes achieve the maxflow in a communication graph [17, 18]. Network coding has latter been applied to distributed storage and data persistence [19, 20, 21, 22]. A key contribution in this area is regenerating codes [8, 9] introduced by Dimakis et al..

Regenerating codes achieve an optimal trade-off between the storage and the repair cost (repair bandwidth) with bits being downloaded from devices as shown on Figure (c)c. On the tradeoff curve (Figure 9), two specific codes are of interest: MSR (Minimum Storage Regenerating codes) which offer optimal repair costs for minimum storage costs and MBR (Minimum Bandwidth Regenerating codes) which offer optimal storage costs for minimum repair costs . Regenerating codes can be implemented using linear codes [17, 18, 23, 24, 25, 26, 27, 28, 29]. Related work on the implementation of regenerating codes is discussed in more details in Section VI.

Fig. 9: Regenerating codes (MSR or MBR) offer improved performances when compared to erasure correcting codes (EC)

Iii Coordinated regenerating codes

Regenerating codes by Dimakis et al. perform all repairs independently. Hence, the repair cost increases linearly with . In this work, we investigate repairing simultaneous failures through coordination in an attempt to reduce the cost, along the lines of delayed erasure correcting codes. We consider that devices fail and that repairs are performed simultaneously.

Iii-a Repair algorithm

Contrary to erasure correcting codes delayed repair (Fig. (b)b), our algorithm (Fig. (d)d) is fully distributed: repairing does not require a single device to gather all the information since no decoding is performed. A device being repaired performs the three following tasks as depicted on Figure 10:

[]

1. Collect. Download a set of sub-blocks (size ) from each of the live devices. The union of the sets is stored as .

2. Coordinate. Upload a set of sub-blocks (size ) to each of the other devices being repaired. These sets are generated from . At this stage, sub-blocks received from the other devices being repaired are stored as .

3. Store. Store a set of sub-blocks (size ) generated from . and can be erased afterwards.

Interestingly, coordinated regenerating codes evenly balance the load on all devices, thus avoiding the bottleneck existing in erasure correcting codes delayed repairs (i.e., the device gathering and decoding all the information (Fig. (b)b)).

Fig. 10: Coordinated regenerating codes based on linear codes. The system stores a file and is compound of 5 devices. Device stores 3 sub-blocks . Devices 4 and 5 fail and are replaced by devices 6 and 7. In the figure, which depicts the block computed, are blocks, and are linear functions that compute new blocks from other blocks.

In the rest of this section, we take an information theoretic point of view and focus on the amounts of information exchanged. We define the achievable tradeoffs between the storage cost and the repair cost .

Overall, our main proof (of Theorem 1) follows the same methodology as the seminal article by Dimakis et al. [9]: the system is represented as an information flow graph, we determine inequalities on the amount of information that can flow through the graph and, applying network coding theory, we show that the recovery of a file is possible if and only if some constraints are satisfied. Costs shown in plots are normalized by . The following table summarizes the notations used.

Devices to recover Bits stored
Devices being repaired Bits transferred (collect)
Live devices () Bits transferred (coordinate)
Total bits transferred per node repaired (i.e., repair cost)

[0.0eX]

Iii-B Information flow graphs

Information flow graphs describe the amounts of information transferred, processed and stored. Contrary to the graph defined in [9], ours captures the coordination by adding edges between nodes being repaired. The information flow graph is a directed acyclic graph consisting of a source , intermediary nodes, and data collectors which contact devices to recover the file. A device is represented by nodes of the graph (, and ) corresponding to its repair states ( corresponds to a time step while corresponds to a device introduced at time step ). The capacities of the edges correspond to the amounts of information that can be stored or transferred.

Figure 11 depicts the graph of devices being repaired (assuming divides .). First, devices being repaired perform a collecting step represented by edges () of capacity . Second, devices undergo a coordinating step represented by edges of capacity for . Devices keep everything they obtained during the first step justifying the infinite capacities of edges . Third, they store as shown on edges . Figure 15 depicts the information flow graph of successive repairs.

Fig. 11: Information flow graph of a repair of devices. The internal nodes represent intermediary steps in the repair. Plain edges correspond to network communication and dashed edges correspond to local communication.

The graph evolves as repairs are performed. When a repair is performed, a set of nodes is added to the graph and the nodes corresponding to failed devices become inactive (i.e., data collectors and subsequently added nodes cannot connect to these nodes). The rest of the analysis relies on the concept of maxflow, which is the maximum amount of information that can flow from the source to some destination , through the study of the minimum cut. Network coding [16, 17, 18] allows achieving the maximum flow for multiple destinations.

Iii-C Achievable codes

We define two important properties on codes:

[]

Correctness

A code is correct iff, for any succession of repairs, a data collector can recover the file by connecting to any devices.

Optimality

A code is optimal iff it is correct and any code with is not correct111In this paper, we always consider that means that either and , or and .

The following theorem is an important result of our work.

Theorem 1.

A coordinated regenerating code is correct222We assume that divides , no result is known if does not divide . if and only if there exists and such that the constraints of (1) and (2) are satisfied. A code minimizing the repair cost  (1), along constraints of (2) is optimal.

(1)
(2)

These constraints mean that for any scenario ( is the number of devices contacted in each repair group of size during the recovery and is the number of such groups), the sum of the amounts of information that can be downloaded from each of the devices contacted by a data collector must be greater than the file size. We now give the proof of this theorem. We study all possible graphs given some , and . Finally, it is shown that (2) must be satisfied to allow decoding at any time thus preventing data losses.

(a) for
(b) for
(c) for
Fig. 15: Information flow graphs for which bounds in (2) are matched with equality for some .
Lemma 2.

For any information flow graph , no data collector can recover the initial file if the minimum cut in between and is smaller than the initial file size .

Proof:

Similarly to the proof in [9], since each edge in the information flow graph can be used at most once, and since source to data collector capacity is less than the file size , the recovery of the file is impossible. ∎

Lemma 3.

For any finite information flow graph , if the minimum of the min-cuts separating the source and each data collector is larger than or equal to the file size , then there exists a linear network code such that all data collectors can recover the file. We also assume that the finite field size is not an issue.

Proof:

Similarly to the proof in [9], since the reconstruction problem reduces to multicasting on all possible data collectors, the result follows from network coding theory. ∎

Lemma 4.

For any information flow graph consisting of initial devices that obtain bits directly from the source and of additional devices that join the graph in groups of devices obtaining from existing devices and from each of the other joining devices, any data collector that connects to a subset of out-nodes of satisfies:

(3)

with .

Proof:

Let us consider some graph (see an example in Figure 15) formed by adding devices according to the repair process described above. Consider a recovery scenario in which, a data collector connects to a subset of nodes , where is the set of contacted devices.

As all incoming edges of have infinite capacity, we only examine min-cuts with and . Moreover some additional cases cannot happen since there is an order between , and (e.g., and need not be considered). Therefore, we only need to examine three cases detailed in the rest of this proof.

Let denote the edges in the cut (i.e., the set of edges going from to ). Every directed acyclic graph has a topological sorting, which is an ordering of its vertices such that the existence of an edge implies . In the rest of the analysis, we group nodes that were repaired simultaneously. Since we contact nodes, we have at least groups and at most groups (i.e., ). Since nodes are sorted, nodes considered at the -th step cannot depend on nodes considered at -th steps with .

Consider the -th group. Let be the set of indexes such that are the topologically -th output nodes in corresponding to the -th (same) repair. The set contains nodes. Consider a subset of size such that and . can take any value between and .

First, consider the nodes . For each node, . We consider the two cases.

  • If , then . The contribution to the cut is .

  • If , then . The contribution to the cut is .

Second, consider the other nodes (third and last case: , and all belong to ). For each node, the contribution comes from multiple sources.

  • The cut contains at least edges carrying : since are the topologically -th output nodes in , at most edges come from output nodes in , other edges come from .

  • The cut contains edges carrying thanks to the coordination step. The node has incoming edges . However, since , the cut contains only such edges.

Therefore, the total contribution of these nodes is

Since the function is concave for taking values in the interval , the contribution can be bounded thanks to Jensen’s inequality.

Summing these contributions for all , and considering the worst case for (i.e., the scenario that minimizes the sum) leads to (4). ∎

Proof:

From Lemmas 3 and 4, a code is correct if it satisfies (1) and (2). From Lemma 2, a code is correct only if . Moreover, for any set of parameter and any scenario , we can find a graph such that

The graph is built using the following process (for the graph of Figure (c)c is built):

  • The data collector gets all bits from a set of devices.

  • The contacted devices repaired simultaneously are grouped in subsets of size such that . Since we contact nodes, we have at least groups and at most groups (i.e., ).

  • Each device gets bits from all devices in , from devices taking part to the reconstruction, from devices not in , from devices not taking part to the reconstruction.

Hence, a code is correct if and only if (1) and (2) are satisfied. A code minimizing under constraints of (1) and (2) is optimal as any code with would not satisfy at least one constraint and hence would not be correct. ∎

Iii-D Optimal tradeoffs

Determining the optimal tradeoffs boils down to minimizing storage cost and repair cost , under constraints of (1) and (2). , and are parameters to be optimized. Again, we assume that divides .

Iii-D1 MBCR codes

Minimum Bandwidth Coordinated Regenerating Codes correspond to optimal codes that provide the lowest possible repair cost (bandwidth consumption) while minimizing the storage cost . Figure 16 compares MBCR codes to both Dimakis et al. ’s MBR [9] and erasure correcting codes with delayed repairs (ECC).

We determine these values in two steps. We study two particular cuts to find the minimum values required to ensure that the max flow is at least equal to the file size, thus proving the optimality of the solution if correct. We then prove that these quantities are sufficient for all possible cuts.

Proof:

Let us consider two specific successions of repairs ( (Fig. (b)b) and (Fig. (a)a)). The corresponding repairs are described in the Proof of Theorem 1. As we want to minimize before , we assume .

When , it is required that

which is equivalent to

When , it is required that

which is equivalent to

Consider the smallest value , the associated repair cost is . This implies that the repair cost grows linearly with , we therefore seek to minimize . The minimum value for is . ∎

Proof:

We have proved that the aforementioned values are required for two specific scenarios. We now prove that such values ensure that enough information flows through every cut for any scenario thus proving correctness. According to Theorem 1, the following condition is sufficient for correctness. We show that the values of , and for MBCR codes satisfy this condition:

since (the stored part) is always larger than or equal to the transmitted data,

replacing , and by their values,

which is equivalent to

As , it simplifies to which is always true. Hence, MBCR codes are correct.∎

Fig. 16: Total repair cost for , and . MBCR codes permanently outperform both erasure correcting codes and regenerating codes

Iii-D2 MSCR codes

Minimum Storage Coordinated Regenerating Codes correspond to optimal codes that provide the lowest possible storage cost while minimizing the repair cost . This point has been independently characterized by Hu et al. in [12]; however, they assume that without proving it. We present a simple derivation from Theorem 1 allowing us to characterize this point. Figure 17 compares MSCR codes to both Dimakis et al.’s MSR [9] and erasure correcting codes with delayed repairs (ECC). Note that for , our MSCR codes share the same repair cost as erasure correcting codes delayed repair. Yet, in this case, our codes still have the advantage that they balance the load evenly thus avoiding bottlenecks.

Proof:

Let us consider two particular successions of repairs ( and ) leading to the graphs shown on Figure 15. The repairs corresponding to such graphs are described in the Proof of Theorem 1.

We minimize first. It is clear that is minimal since makes impossible to reconstruct a file of size using only blocks. Hence, what is important is now that each element of the sum is at least equal to .

When (Fig. (a)a), it is required that

which is equivalent to

When (Fig. (b)b), it is required that

which is equivalent to

Consider the smallest value , the associated repair cost is . This implies that the repair cost grows linearly with , we therefore seek to minimize . The minimum value for is . ∎

Proof:

The proof of correctness is quite similar to the previous one. It consists in proving that

is always verified when , and take the aforementioned values.

Since each element of the sum is at most , each element of the sum must satisfy the following constraint.

Applying values for MSCR codes,

which is satisfied if

which is true since and . Therefore, MSCR codes are correct.

Fig. 17: Total repair cost for , and . MSCR codes permanently outperform both erasure correcting codes and regenerating codes

Iii-D3 General CR codes

The general case corresponds to all possible trade-offs in between MSCR and MBCR. Valid points can be determined by performing a numerical minimization of the repair cost for various storage cost under constraints of (2) and (1). Figure 18 shows the optimal tradeoffs : coordinated regenerating codes () can go beyond the optimal tradeoffs for independent repairs () defined by regenerating codes by Dimakis et al. [9].

Fig. 18: Optimal tradeoffs between storage and repair costs for and . Regenerating codes (RC) [9] are depicted as . For each , both MSCR and MBCR are shown. Costs are normalized by .

Iv Adaptive Regenerating Codes

So far we assumed and to remain constant across repairs, similarly to [9] where is assumed to remain constant. It may not be realistic in real systems that are dynamic.

At the Minimum Storage Point (), such strong assumptions are not needed as repairs are independent (i.e., each term of the sum in (2) can be treated independently). We propose to adapt the quantities to transfer and to the system state, which is defined by the number of devices being repaired and the number of live devices. The resulting adaptive regenerating codes simplify the system design as only the parameter needs to be decided during the conception: adaptive regenerating codes decide, at runtime for each repair, the best to offer the lowest repair cost .

Iv-a Adaptive codes at the Minimum Storage point

Theorem 5.

Adaptive regenerating codes are both correct and optimal. is a function that maps a particular repair setting to the amounts of information to be transferred during a repair.

(4)

In this subsection, we prove they are correct and Pareto optimal.

Lemma 6.

For any information flow graph compounded of initial devices that obtain bits directly from the source and of additional devices that join the graph in groups of devices obtaining from existing devices and from each of the other joining devices, any data collector that connects to a subset of out-nodes of satisfies:

(5)

with .

Proof:

The proof is similar to the proof of Lemma 4.∎

Proof:

Using Lemmas 23 and 6, we can define the following sufficient condition for the code to be correct. The condition is satisfied when and take the values defined in (4).

The condition must be satisfied for every . For any , since each element of the sum is at most , each element of the sum must satisfy the following constraint.

Applying formulas of (4),

which is satisfied if

which is true since and . Therefore, adaptive regenerating codes are correct.∎

Proof:

We prove by contradiction that the adaptive regenerating codes are optimal. Let us assume that there exists a correct code such that (i.e., for some , ).

Consider a set of failures such that all repairs are performed by groups of devices downloading data from devices. Consider the corresponding information flow graph. Assuming repairs are performed with a correct code , the information flow graph also corresponds to a correct code .

Moreover, according to the previous section, these failures can be repaired optimally using the MSCR code . Therefore, there is a contradiction since the code cannot be correct if the code is optimal. A correct code cannot exist, and the adaptive regenerating code defined in this section is optimal. ∎

Building on results from coordinated regenerating codes (especially MSCR), we have defined adaptive regenerating codes and proved that they are both correct and optimal. These codes are of particular interest for dynamic systems where failures may occur randomly and simultaneously.

Iv-B Adaptive codes at the Minimum Bandwidth point

We have built adaptive regenerating codes from Minimum Storage codes () by observing that initial assumptions of fixed value and can be relaxed. In this subsection, we study whether adaptive codes can be built from MBR codes. We determine lower bounds on the storage and repair cost . These lower bounds allow concluding that an adaptive scheme at the Minimum Bandwidth point cost as much as classical erasure correcting codes.

Let us consider that can take any value between and , and that can take any value between and . Since we cannot predict the future, when choosing and , we must assume any value for and with . More specifically, the current repair can be the first of a sequence since we do not know which devices will fail, how they will be repaired and how data will be collected. We need to consider the worst case that can occur in the future: and .

In a first scenario when (e.g., Fig. (a)a), it is required that

which expands to

replacing and by their minimum admissible values since we cannot make any assumption on the future, and by the number of groups when all groups but the first are of size (i.e., ), we get

which simplifies to

which simplifies to

Let us consider the case of and , and determine the minimal value for .

The first scenario we considered allowed determining that . We now consider another possible scenario to obtain a lower bound on the per failed node repair cost . In this second scenario, when (e.g., Fig. (b)b), it is required that

which expands to

replacing and by their minimum admissible values ( and ) since we cannot make any assumption on the future, and by the number of groups when all groups but the first are of size (i.e., ), we get

which simplifies to

Hence, we obtain the following lower bound for an adaptive regenerating code operating at the MBR point.

The cost of a static scheme assuming that we contact as few nodes as possible and repair as few nodes as possible is as explained in Section III-D1. Hence,

As a consequence, an adaptive scheme at the Minimum Bandwidth point is meaningless since it would be more expensive than a simpler static MBCR code set up for the worst case (i.e., and ).

Iv-C Performance

We compare our Adaptive Regenerating Codes at the MSR point to MFR codes defined in [13]. This approach is built upon MSR codes defined by Dimakis et al. in [9]. The coding scheme can be described as where is a function . The repairs needed are performed independently.

(6)

Let us consider the particular case where . The average cost per repair of our codes remains constant . In the MFR approach, which requires repairs to be performed independently, the average repair cost increases with . Therefore, the performance of our adaptive regenerating codes does not degrade as the number of failures increases, as opposed to the MFR constructed upon Dimakis et al. ’s codes. This is also shown on Figure 19.

Fig. 19: Average repair cost for and . Adaptive Regenerating Codes (ARC) permanently outperform both erasure correcting codes (ECC) and the MFR codes.

Iv-D Adaptive Coding Schemes

Our approach also has significant advantages over the MFR approach with respect to the actual coding scheme being implemented when is constant. The coding schemes are similar in principle to the one described in Subsection III-A and Figure 10. The only difference is that the values and may differ from one repair to the other. Each device stores sub-blocks of data and combines them to send the appropriate quantities of information. To be able to send , each device must store sub-blocks. To be able to send or , each device must store sub-blocks. Hence, the length of any random linear code used to implement such a system is where is the number of sub-blocks stored by each device. We now consider a system of constant size and compare both implementations.

The implementation of the MFR approach implies that to support , each device must be able to send all quantities . Hence, . It is known that . Hence, the length of the codes required to implement such codes grows exponentially with