A Family of Erasure Correcting Codes with Low Repair Bandwidth and Low Repair Complexity
We present the construction of a new family of erasure correcting codes for distributed storage that yield low repair bandwidth and low repair complexity. The construction is based on two classes of parity symbols. The primary goal of the first class of symbols is to provide good fault tolerance, while the second class facilitates node repair, reducing the repair bandwidth and the repair complexity. We compare the proposed codes with other codes proposed in the literature.
Distributed storage (DS) uses a network of interconnected inexpensive storage devices (referred to as storage nodes or simply nodes) to store data reliably over long periods of time. Reliability against node failures (commonly referred to as fault tolerance) is achieved by means of erasure correcting coding. Furthermore, when a node fails, a new node needs to be added to the DS network and populated with data to maintain the initial state of reliability. The problem of repairing a failed node is known as the repair problem.
Classical maximum distance separable (MDS) codes are optimal in terms of the fault tolerance/storage overhead tradeoff. However, the repair of a failed node requires the retrieval of large amounts of data from a large subset of nodes. Therefore, in the recent years, the design of erasure correcting codes that reduce the cost of repair has attracted a great deal of attention. Pyramid codes  were one of the first code constructions that addressed this problem. In particular, they aim at reducing the number of nodes that need to be contacted to repair a failed node, known as the repair access. Other codes that reduce the repair access are the local reconstruction codes (LRCs) , and the locally repairable codes [3, 4].
Other code constructions aim at reducing the repair bandwidth, defined as the amount of information that needs to be read from the DS network to repair a failed node. Among them, we can mention minimum disk I/O repairable (MDR) codes , Zigzag codes  and piggyback codes . Piggybacking consists of adding carefully chosen linear combinations of data symbols (called piggybacks) to the parity symbols of a given erasure correcting code. This results in a lower repair bandwidth at the expense of a lower erasure correcting capability with respect to the original code.
In this paper, we propose a family of erasure correcting codes that achieve low repair bandwidth and low repair complexity. In particular, we propose a systematic code construction based on two classes of parity symbols. Correspondingly, there are two classes of parity nodes. The first class of parity nodes, whose primary goal is to provide erasure correcting capability, is constructed using an MDS code modified by applying specially designed piggybacks to some of its code symbols. The second class of parity nodes is constructed using a block code whose parity symbols are obtained with simple additions. This class of parity nodes does not have the purpose to bring any additional erasure correcting capability, but to facilitate node repair at low repair bandwidth and low repair complexity. In the paper, we compare the proposed codes with MDR codes, Zigzag codes, piggyback codes and LRCs , in terms of repair bandwidth and repair complexity.
Notation: We define the operator . The Galois field of order is denoted by .
Ii Code Construction
We consider the distributed storage system depicted in Fig. 1. There are data nodes, each containing a very large number of data symbols over . As we shall see in the sequel, the proposed code construction works with blocks of data symbols per node. Thus, without loss of generality, we assume that each node contains data symbols. We denote by , , the th data symbol in the th data node. We say that the data symbols form a data array , where . For later use, we also define the set of data symbols . Further, there are parity nodes each storing parity symbols. We denote by , , , the th parity symbol in the th parity node, and define the set as the set of parity symbols in the th parity node. The set of all parity symbols is denoted by . We say that the data and parity symbols form a code array , where . Note that for and for , .
Our main goal is to construct codes that yield low repair bandwidth and low repair complexity of a single failed systematic node. To this purpose, we construct a family of systematic codes consisting of two different classes of parity symbols. Correspondingly, there are two classes of parity nodes, referred to as Class and Class parity nodes, as shown in Fig. 1. Class and Class parity nodes are built using an code and an code, respectively, such that . In other words, the parity nodes of the code111With some abuse of language we refer to the nodes storing the parity symbols of a code as the parity nodes of the code. correspond to the parity nodes of Class and Class codes. The primary goal of Class parity nodes is to achieve good erasure correcting capability, while the purpose of Class nodes is to yield low repair bandwidth and low repair complexity. In particular, we focus on the repair of data nodes. The repair bandwidth (in bits) per node, denoted by , is proportional to the average number of symbols (data and parity) that need to be read to repair a data symbol, denoted by . More precisely, let be the number of symbols per node222For our code construction, , but this is not the case in general.. Then,
where is the size (in bits) of a symbol. can be interpreted as the repair bandwidth normalized by the size (in bits) of a node. Therefore, in the rest of the paper we will use to refer to the normalized repair bandwidth.
The main principle behind our code construction is the following. The repair is performed one symbol at a time. After the repair of a data symbol is accomplished, the symbols read to repair that symbol are cached in the memory. Therefore, they can be used to repair the remaining data symbols at no additional read cost. The proposed codes are constructed in such a way that the repair of a new data symbol requires a low additional read cost (defined as the number of additional symbols that need to be read to repair the data symbol), so that (hence ) is reduced. Since we will often use the concepts of read cost and additional read cost in the remainder of the paper, we define them in the following.
The read cost of a symbol is the number of symbols that need to be read to repair the symbol. The additional read cost of a symbol is the additional number of symbols that need to be read to repair the symbol, considering that other symbols are already cached in the memory (i.e., have been read to recover some other data symbols previously).
Iii Class Parity Nodes
Class parity nodes are constructed using a modified MDS code, , over . In particular, we start from an MDS code and apply piggybacks  to some of the parity symbols. The construction of Class parity nodes is performed in two steps as follows.
Encode each row of the data array using an MDS code (the same for each row). The parity symbol is333We use the superscript to indicate that the parity symbol is stored in a Class parity node.
where denotes a coefficient in . Store the parity symbols in the corresponding row of the code array. Overall, parity symbols are generated.
Modify some of the parity symbols by adding piggybacks. Let , , be the number of piggybacks introduced per row. The parity symbol is obtained as
where and the second term in the summation is the piggyback.
The fault tolerance (i.e., the number of node failures that can be tolerated) of Class codes is given in the following theorem.
An Class code with piggybacks per row can correct a minimum of node failures.
The proof is given in the appendix. \qed
When a failure of a data node occurs, Class parity nodes are used to repair of the failed symbols. The Class parity symbols are constructed in such a way that, when node is erased, data symbols in this node can be repaired reading the (non-failed) data symbols in the th row of the data array and parity symbols in the th row of Class nodes (see also Section IV-C). For later use, we define the set as follows.
For , define the set as .
The set is the set of data symbols that are read from row to recover data symbols of node using Class parity nodes.
An example of Class code is shown in Fig. 2. One can verify that the code can correct any 2 node failures. For each row , the set is indicated in red color. For instance, .
The main purpose of Class parity nodes is to provide good erasure correcting capability. However, the use of piggybacks helps also in reducing the number of symbols that need to be read to repair the symbols of a failed node that are repaired using Class code, as compared to MDS codes. The remaining data symbols of the failed node can also be recovered from Class parity nodes, but at a high symbol read cost. Hence, the idea is to add another class of parity nodes, namely Class parity nodes, in such a way that these symbols can be recovered with lower read cost.
Iv Class Parity Nodes
Class parity nodes are obtained using an linear block code over to encode the data symbols of the data array, i.e., we use the code times. This generates Class parity symbols, , , .
For , define the set as
Assume that data node fails. It is easy to see that the set is the set of data symbols that are not recovered using Class parity nodes.
For the example in Fig. 2, the set is indicated by hatched symbols for each column , . For instance, .
For later use, we also define the following set.
For , define the set as
Note that .
For the example in Fig. 2, the set is indicated by hatched symbols for each row . For instance, .
The purpose of Class parity nodes is to allow recovering of the data symbols in , , at a low additional read cost. Note that after recovering symbols using Class parity nodes, the data symbols in are already stored in the decoder memory, therefore they are accessible for the recovery of the remaining data symbols using Class parity nodes without the need of reading them again. The main idea is based on the following proposition.
If a Class parity symbol is the sum of one data symbol and a number of data symbols in , then the recovery of comes at the cost of one additional read (one should read parity symbol ).
This observation is used in the construction of Class parity nodes (see Section IV-A below) to reduce the normalized repair bandwidth, . In particular, we add Class parity nodes which allow to reduce the additional read cost of all data symbols in all ’s to . (The addition of a single Class parity node allows to recover one new data symbol in each , , at the cost of one additional read).
In order to describe the code construction, we define the function as follows.
Consider a Class parity node and let denote the set of parity symbols in this node. Also, let for some and be , where , i.e., the parity symbol is the sum of and a subset of other data symbols. Then,
For a given data symbol , the function gives the additional number of symbols that need to be read to recover (considering the fact that some symbols are already cached in the memory).
Iv-a Construction Example
In the following, we propose a recursive algorithm for the construction of Class parity nodes. To ease understanding, we introduce the algorithm through an example.
We construct a code starting from the Class code in Fig. 2. In particular, we construct Class parity nodes, so that the additional number of reads to repair each of the remaining failed symbols (after recovering symbols using Class parity nodes) is . With some abuse of notation, we denote these parity nodes by , , and .
Denote by , , a temporary matrix of read values for the respective data symbols . After Class decoding,
where . For our example, after Class decoding is given in Fig. 3(a). Our algorithm operates on the s whose initial value is and aims to obtain the lowest possible values for these s under the given number of Class parity nodes. This is done in a recursive manner as follows.
Construct the first parity node, .
For each symbol define the set .
Start with the elements in . Pick an element such that , and . For instance, we take .
and update the respective and ,
The resulting matrix is shown in Fig. 3(b). There are still entries that need to be handled.
where and after step 1b. Update accordingly (see Fig. 3(c)). Note that the read values have not worsened. This comes from the fact that the new added data symbol belongs to the corresponding set and is already cached in the memory. Thus, the additional read cost is . On the other hand, the values increase.
Construct the second parity node, .
Pick an element such that the corresponding is maximal. In our example, this is because .
For , do the following. Pick an element such that for all , , where is set to . For our example, we choose . Note that the only other option, , is not a good choice as the new additional read cost would increase from 1 to 2. If such does not exist, set .
Update . The updated matrix is shown in Fig. 3(d).
Pick an element such that the corresponding is maximal. In our example, this is .
For , do the following. . Update . The resulting has value for all diagonal elements and elsewhere (Fig. 3(e)).
The Class parity nodes , , and are shown in Fig. 4.
A general version of the algorithm to construct Class parity nodes is given in Appendix B.
Iv-B Discussion of the Construction Example
The construction of Class parity nodes starts by selecting an element of a given such that and (for simplicity, as in the example, we can start with ). The first parity symbol of after step 1c is therefore , and the remaining parity symbols are obtained as in (8). By Proposition 1 the additional read cost of (after step 1c) is . The reason for selecting is due to the fact that, again by Proposition 1, its additional read cost is also . We remark that for each it is not always possible to select and set . This is the case when . If does not exist, then we select (see Appendix B). In this case, the additional read cost of (after step 1c) is .
In general, step 1d has to be performed times, corresponding to the number of entries per column of .
Adding Class nodes allows to reduce the additional read cost for all data symbols in all to (see Fig. 3(e)). However, this comes at the expense of a reduction in the code rate, i.e., the storage overhead is increased. In the example, Class parity nodes need to be introduced, which reduces the code rate from to . If a lower storage overhead is required, Class parity nodes can be punctured, starting from the last parity node (for the example, nodes , , and are punctured in this order), at the expense of an increased repair bandwidth. If all Class parity nodes are punctured, we would remain only with Class parity nodes and the repair bandwidth corresponds to that of the Class code. Thus, our code construction gives a family of rate-compatible codes which trades off between repair bandwidth and storage overhead: adding more Class nodes reduces the repair bandwidth but increases the storage overhead.
Iv-C Repair of a Single Node Failure: Decoding Schedule
The repair of a failed systematic node, proceeds as follows. First, symbols are repaired using Class parity nodes. Then, the remaining symbols are repaired using Class parity nodes. With a slight abuse of language, we will refer to the repair of symbols using Class and Class parity nodes as the decoding of Class and Class codes, respectively. Suppose that node fails. Decoding is as follows.
Decoding of Class code. To reconstruct the failed data symbol in the th row of the code array, symbols ( data symbols and ) in the th row are read. These symbols are now cached in the memory. We then read the piggybacked symbols in the th row. By construction (see (3)), this allows to repair failed symbols, at the cost of an additional read each.
Decoding of Class code. Each remaining failed data symbol is obtained by reading a Class parity symbol whose corresponding set (see Definition 5) contains . In particular, if several Class parity symbols contain , we read the parity symbol with largest index . This yields the lowest additional read cost.
V Code Characteristics and Comparison
|Fault Tolerance||Norm. Repair Band.||Norm. Repair Compl.||Enc. Complexity|
V-a Fault Tolerance
The fault tolerance of the Class code depends on the MDS code used in its construction and , as stated in Theorem 1. Hence, our proposed code has also fault tolerance . Since , our codes have a fault tolerance of at least .
V-B Normalized Repair Bandwidth
According to Section IV-C, to repair the first symbols in a failed node requires that data symbols plus Class parity symbols are read. The remaining data symbols in the failed node are repaired by reading the Class parity symbols. As seen in Section IV, the parity symbols in the first Class parity node are constructed from sets of data symbols of cardinality . Therefore, to repair each of the data symbols in this set requires to read at most symbols. The remaining Class parity nodes are constructed from fewer symbols than . An upper bound on the normalized repair bandwidth is therefore . It is observed that when increases, the fault tolerance reduces while improves.
V-C Repair Complexity of a Failed Node
We first consider the complexity of elementary arithmetic operations of elements of size in . An addition requires and multiplication requires . The term inside denotes the number of elementary binary additions. To repair the first symbol requires multiplications and additions. To repair the following symbols require an additional multiplications and additions. The final symbols require at most additions, since Class parity symbols are constructed as the sum of at most data symbols. The repair complexity of one failed node is therefore
The first two terms correspond to the Class code while the last term corresponds to the Class code.
V-D Encoding Complexity
The encoding complexity of the code, , is the sum of the encoding complexities of the two codes. The generation of each of the Class parity symbols in one row of the code array, in (2), requires multiplications and additions. Adding data symbols to of these parity symbols according to (3) requires an additional additions. The encoding complexity of the Class code is therefore
According to Section IV, the parity symbols in the first Class parity node are constructed as the sum of data symbols, and each parity symbol in the subsequent parity nodes uses one less data symbol. Therefore, the encoding complexity of the Class code is
V-E Code Comparison
Table I provides a summary of the characteristics of different codes proposed in the literature as well as the codes constructed in this paper.444The variables and in Table I are defined in  and  respectively. The definition of comes directly from that is defined in . In the table, column 2 reports the value of (see (1)) for each code construction. For our code, , unlike for MDR and Zigzag codes, for which grows exponentially with . This implies that our codes require less memory to cache data symbols during repair. The fault tolerance , the normalized repair bandwidth , the normalized repair complexity, and the encoding complexity, discussed in the previous subsections, are reported in columns 3, 4, 5, and 6, respectively.
In Fig. 5, we compare our codes with other codes in the literature. In particular, the figure plots the normalized repair complexity of codes over () versus their normalized repair bandwidth . In contrast to the bounds for the repair bandwidth and complexity reported in Table I, Fig. 5 contains the exact number of integer additions.
The best codes for a DS system should be the ones that achieve the lowest repair bandwidth and have the lowest repair complexity. As seen in Fig. 5, MDS codes have both high repair complexity and repair bandwidth, but they are optimal in terms of fault tolerance for a given and . Zigzag codes achieve the same fault tolerance and high repair complexity as MDS codes, but at the lowest repair bandwidth. At the other end, LRCs yield the lowest repair complexity but a higher repair bandwidth and worse fault tolerance than Zigzag codes. Piggyback codes have a repair bandwidth between that of Zigzag and MDS codes, but with a higher repair complexity and worse fault tolerance. For a given storage overhead, our proposed codes have better repair bandwidth than MDS codes, Piggyback codes and LRCs, and equal or similar repair bandwidth than Zigzag codes. Furthermore, they yield lower repair complexity as compared to MDS, Piggyback and Zigzag codes. However, the benefits in terms of repair bandwidth and/or repair complexity with respect to MDS and Zigzag codes come at a price of a lower fault tolerance.
In this paper, we constructed a new class of codes that achieve low repair bandwidth and low repair complexity for a single node failure. The codes are constructed from two smaller codes, Class and , where the former focuses on the fault tolerance of the code, and the latter focuses on reducing the repair bandwidth and complexity. Our proposed codes achieve better repair complexity than Zigzag codes and Piggyback codes and better repair bandwidth than LRCs, but at the cost of slightly lower fault tolerance. A side effect of such a construction is that the number of symbols per node that needs to be encoded grows linearly with the code dimension. This implies that our codes are suitable for memory constrained DS systems as compared to Zigzag and MDR codes, for which the number of symbols per node increases exponentially with the code dimension.
Appendix A Proof of Theorem 1
Each row in the code array contains parity symbols based on the MDS construction (i.e., parity symbols without piggybacks). Using these symbols, one can recover data symbols in that row and, thus, failures of systematic nodes. In order to prove the theorem, we need to show that by using piggybacked parity symbols , , in some parity node, , it is possible to correct one arbitrary systematic node failure. To do this, let us consider the system of linear equations , representing the set of parity equations to compute s where . In other words, , , and is given by
where , is a vector of length with one at position and zeros elsewhere, and is the all-zero vector of size . Now, assume a systematic node has failed. In order to repair it, we need to solve the following subsystem of linear equations , in which and is a submatrix of such that: a) its diagonal elements are all ; b) it has 1 at row and column ; c) all other entries are 0. Note that is full rank. Therefore, one arbitrary data symbol can be corrected and, hence, the erasure correcting capability of Class code is at least , which completes the proof.
Appendix B Algorithm to Construct Class Parity Nodes
We give an algorithm to construct Class parity nodes in the order . This results in the construction of parity symbols . The algorithm is given in Algorithm 1. Consider the construction of the parity symbols of parity node . The algorithm constructs first the parity symbol as the sum of an element and elements in . Then, the other parity symbols , , are constructed as the sum of an element and elements in , i.e., following a specific pattern. The remaining parity nodes are constructed in a similar way, with the only difference that the number of elements added from the sets , , varies for each parity node. The construction of the parity symbols depends on the choice of the symbols in the sets and . Assume that a parity symbol is constructed. The data symbols involved in are picked as follows.
Choice of a data symbol in : Select a symbol such that the corresponding is maximum and there exists (lines 2 and 3 in the algorithm). If the latter does not exists, then select such that is maximum. Such a always exist.
Choice of data symbols in : Select symbols such that and its additional read cost does not increase (line 20 in the algorithm). If such a condition is not met, then the symbol is not used in the construction of the parity symbol.
After the construction of each parity symbol, the corresponding entry of matrix is updated.
-  C. Huang, M. Chen, and J. Li, “Pyramid codes: Flexible schemes to trade space for access efficiency in reliable data storage systems,” in Proc. IEEE Int. Symp. Network Computing and Applications, Jul. 2007.
-  C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li, and S. Yekhanin, “Erasure coding in windows azure storage,” in Proc. USENIX Annual Technical Conference, Jun. 2012.
-  M. Sathiamoorthy et al., “XORing elephants: Novel erasure codes for big data,” Proc. Very Large Data Bases Endowment, vol. 6, no. 5, Mar. 2013.
-  D. Papailiopoulos and A. Dimakis, “Locally repairable codes,” in Proc. IEEE Int. Symp. Inf. Theory, Jul. 2012.
-  Y. Wang, X. Yin, and X. Wang, “MDR codes: A new class of raid-6 codes with optimal rebuilding and encoding,” IEEE J. Sel. Areas in Commun., vol. 32, no. 5, pp. 1008–1018, May 2014.
-  I. Tamo, Z. Wang, and J. Bruck, “Zigzag codes: MDS array codes with optimal rebuilding,” IEEE Trans. Inf. Theory, vol. 59, no. 3, pp. 1597–1616, Mar. 2013.
-  K. Rashmi, N. Shah, and K. Ramchandran, “A piggybacking design framework for read-and download-efficient distributed storage codes,” in Proc. IEEE Int. Symp. Inf. Theory, Jul. 2013.