LDPC Code Design for Distributed Storage: Balancing Repair Bandwidth, Reliability and Storage Overhead
Abstract
Distributed storage systems suffer from significant repair traffic generated due to frequent storage node failures. This paper shows that properly designed lowdensity paritycheck (LDPC) codes can substantially reduce the amount of required block downloads for repair thanks to the sparse nature of their factor graph representation. In particular, with a careful construction of the factor graph, both low repairbandwidth and high reliability can be achieved for a given code rate. First, a formula for the average repair bandwidth of LDPC codes is developed. This formula is then used to establish that the minimum repair bandwidth can be achieved by forcing a regular check node degree in the factor graph. Moreover, it is shown that given a fixed code rate, the variable node degree should also be regular to yield minimum repair bandwidth, under some reasonable minimum variable node degree constraint. It is also shown that for a given repairbandwidth requirement, LDPC codes can yield substantially higher reliability than currently utilized ReedSolomon (RS) codes. Our reliability analysis is based on a formulation of the general equation for the meantimetodataloss (MTTDL) associated with LDPC codes. The formulation reveals that the stopping number is closely related to the MTTDL. It is further shown that LDPC codes can be designed such that a small loss of repairbandwidth optimality may be traded for a large improvement in erasurecorrection capability and thus the MTTDL.
ptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptpt
LDPC Code Design for Distributed Storage: Balancing Repair Bandwidth, Reliability and Storage Overhead
Hyegyeong Park, Student Member, IEEE, Dongwon Lee, and Jaekyun Moon, Fellow, IEEE
^{0}^{0}footnotetext: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. This work was supported by the National Research Foundation of Korea under grant no. NRF2016R1A2B4011298. This paper was presented in part at the IEEE International Conference on Communications (ICC), 2016. The authors are with the School of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, 34141 South Korea (email: parkh@kaist.ac.kr; leedw1020@kaist.ac.kr; jmoon@kaist.edu).
Index Terms
Distributed storage, repair bandwidth, meantimetodataloss (MTTDL), lowdensity paritycheck (LDPC) codes, factor graph.
I Introduction
Distributed storage has been deployed as a solution to storing and retrieving massive amounts of data. By using the MapReduce architecture [1], the distributed feature of recent storage systems enables data centers to store big data sets reliably while allowing scalability and offering high bandwidth efficiency. However, since distributed storage systems consist of commodity disks, failure events occur frequently. As a case in point, in the Google File System (GFS) “component failures are the norm rather than the exception” [2]. Simply replicating data multiple times prevent data loss against the node failure events in GFS [2] and Hadoop Distributed File System (HDFS) [3], but the associated costs in terms of storage overhead are rather high.
In order to reduce the large storage overhead of replication schemes, erasure codes have been introduced as alternatives [4]. ReedSolomon (RS) codes [5] are typical erasure codes having the maximum distance separable (MDS) property that can tolerate a certain maximum number of erasures given a number of parity blocks. Typically, an RS code splits a file to be stored into blocks and encodes them into blocks including parity blocks [6]. These blocks of a code are referred to as a stripe in distributed storage. Any out of blocks can be used to reconstruct the original file, which is exactly how the MDS property is defined. In practice, a (14, 10) RS code is implemented on the Facebook clusters [6] whereas a (9, 6) RS code is used in the GFS [7]. Both of these codes have high storage efficiency as well as orders of magnitude higher reliability compared to 3replication [4, 8]. Hence, erasure coding schemes based on RS codes have become popular choices especially for archival storage systems where maintaining optimal tradeoff between data reliability and storage overhead is priority.
However, the point at issue is that MDS codes such as RS codes require high bandwidth overhead for the repair process. If a node failure event happens, the erased blocks need to be reconstructed in order to retain the same level of reliability; the amount of blocks to be downloaded for this repair task is defined as repair bandwidth. Since the repair bandwidth represents a limited and expensive resource for data centers, bandwidth overhead associated with the repair job should be carefully managed. For a typical RS code, blocks are required to reconstruct a failed block whereas replication schemes need only one block. For instance, the (14, 10) RS code has a 10x repair bandwidth overhead relative to a replication scheme, consuming a significant amount of bandwidth during repair as confirmed by real measurements in the Facebook’s clusters [6].
A number of recent publications have dealt with the repair bandwidth issues. Dimakis et al. [9] showed repair models of MDS codes for functional repair and exact repair. Whereas exact repair restores the failed blocks by generating blocks having exact copies of the data, functional repair generates blocks that can be different from the failed blocks as long as the MDS property is maintained. They established optimal storagebandwidth tradeoff for functional repair and coined the term regenerating codes for the codes that achieve optimality in this sense. Many researchers have since designed regenerating codes for exact repair that operate in some specific environments [10, 11].
In contrast to existing works, this paper focuses on coding schemes that offer significant reliability advantages, while achieving highly competitive repair bandwidth and storage overhead tradeoff. Local reconstruction codes/locally repairable codes (LRCs) and piggybacked RS codes are known methods aiming at reducing repair bandwidth sharing the same key idea. LRCs are nonMDS codes that add local parity symbols to existing RS codes to reduce repair bandwidth at the expense of an increased parity overhead. Windows Azure Storage (WAS) by Microsoft [7] and HDFSXorbas by Facebook [12, 13] are practical applications for LRCs^{1}^{1}1The AzureLRC and the XorbasLRC are represented by (, , ) and (, , ), respectively, where denotes the number of local groups, represents the number of global parities and indicates the block locality.. Rashmi et al. [6] suggested piggybacked RS codes which can reduce the repair bandwidth of the RS codes without using extra storage but at the expense of code complexity.
This paper specifically explores design of lowdensity paritycheck (LDPC) codes [14] for distributed storage applications exploiting tradeoffs of the key performance metrics such as repair bandwidth, reliability and storage overhead. LDPC codes have been considered as an alternative for conventional distributed storage coding schemes. However, most known works in this area have been about reducing the coding overhead factor of LDPC codes rather than the repair bandwidth [15, 16]. Whereas Wei et al. [17, 18, 19] showed a low latency of LDPC codes based on simulation and suggested that LDPC codes may have advantages in repair bandwidth and reliability, there has been no rigorous analysis for repair bandwidth and reliability except in [20].
In [20], the present authors have demonstrated that LDPC codes provide benefits in terms of both repair bandwidth and reliability given the same storage overhead. Since a variable node of LDPC codes is connected to a relatively small number of nodes, LDPC codes have inherent local repair property as LRCs. The repair bandwidth of an LDPC code does not depend on the length of the code. The reliability typically gets better with increasing code length. Thus, in the case of LDPC codes, the code length can be allowed to grow to achieve excellent reliability without worrying about expanding repair bandwidth as in RS codes. The only limiting factor in growing the code length in the LDPC codes is the computation and buffer requirements, but compared to the RS codes, the implementation complexity/buffer requirements of the LDPC codes grow considerably slower with code length.
This paper refines and adds to the analysis of [20]. Optimality associated with the regularity of the LDPC codes and dependency of LDPC codes’ reliability on the stopping set are made precise in the form of theorems with proofs. In addition, this paper also finds LDPC codes that allow a control of the repair bandwidth while targeting high reliability. It is shown that properly designed LDPC codes can achieve very high meantimetodataloss (MTTDL) at the slight expense of the repair bandwidth overhead.
Overall, the key contributions of this paper are as follows. The average repair bandwidth of the LDPC codes is formulated which leads to the observation that a regular check node degree achieves the minimum repair bandwidth given a fixed total number of edges in the factor graph. Moreover, given a fixed code rate, the variable node degree is also forced to be regular for the repair bandwidth minimality, under some reasonable minimum variable node degree constraint. For reliability analysis, a general formula for the MTTDL of LDPC codes is derived. The formula shows how the stopping number of a code directly affects reliability. It is confirmed that increasing the stopping number of the factor graph greatly enhances reliability. Regular quasicyclic (QC) progressiveedgegrowth (PEG) LDPC codes with different code rates have been designed and compared against representative RS codes and their variants. The results show that with LDPC codes a slight relaxation of the repair bandwidth minimality may allow meaningful improvement in reliability. In summary, LDPC codes could be a powerful choice for distributed storage systems enjoying both reasonably low repairbandwidth and very high MTTDL.
The rest of this paper is organized as follows. In Section II, we give preliminary information on LDPC codes. Section III provides repair bandwidth analysis of LDPC codes. In Section IV, a design of LDPC codes for high reliability and reasonable repair bandwidth efficiency is discussed. In Section V, reliability analysis of LDPC codes is given. Approaches to increase reliability are introduced as well. In Section VI, some specific examples of LDPC codes are discussed which show great performance on distributed storage. Simulation results that compare LDPC codes with other schemes are also given in this section. Finally, the paper draws conclusions in Section VII.
Ii Preliminaries
Iia LDPC Codes
An LDPC code is a class of linear block codes defined as the null space of an sparse parity check matrix (i.e., if and only if is a codeword), where is the number of parity blocks and is the length of the codeword. The LDPC codes we are concerned with in this paper are binary, which are defined over GF(2). can be illustrated by a factor graph in Fig. 1, which consists of the check nodes (squares), variable nodes^{2}^{2}2Since each variable node stores a coded data block, data node and the variable node are used synonymously in this paper. (circles), and edges (lines between squares and circles) [21]. The factor graph describing the LDPC code is called a bipartite graph since it consists of two kinds of nodes: variable node (VN) and check node (CN). In a bipartite graph of an LDPC code, there are CNs representing parity check equations and VNs indicating coded blocks. CN is connected to VN (i.e., CN involves VN ), if , the element of , is 1.
The ensemble of LDPC codes is specified with a variable degree distribution polynomial and a check degree distribution polynomial [21],
where ) is the fraction of edges connected to VNs (resp. CNs) with degree and (i.e., the sum of coefficients is equal to one). This definition of the degree distribution pair is based on the “edge perspective”. If the number of edges connected to each VN/CN is all identical, which means that the number of nonzero elements in each row and column in are both constant, the corresponding LDPC code is termed a regular LDPC code. Otherwise, it is designated an irregular LDPC code.
Assuming that the parity check matrix is full rank, the code rate of an LDPC code can be represented by its degree distribution pair [22],
(1) 
The degree distributions from a node perspective can be expressed from an edge perspective description:
(2) 
where and are the fractions of VNs and CNs, respectively, with degree .
IiB Density Evolution
Density evolution is a deterministic numerical tool which tracks the fraction of erased variable nodes as iterative decoding proceeds. For the binary LDPC codes we are concerned with, the failure of a data block can be translated into an erased variable node over binary erasure channel (BEC). In this case, the expected fraction of erased data nodes at the th iteration, , as the block length goes to infinity is presented as the recursion [21]:
Here, for the channel erasure probability of BEC , . Decoding with an LDPC code constructed by a degree distribution pair and an initial erasure probability is successful if and only if . This condition for successful decoding is equivalent to
As the block length grows to infinity, every code in an ensemble tends to behave in the same way. Assuming that the code is sufficiently large, the performance of an individual code thus can be captured in the performance of the ensemble average. The decoding threshold of an LDPC code is the largest value for which the above inequality condition is satisfied. We can evaluate the decoding threshold of LDPC codes with the density evolution technique as the block length tends to infinity. The decoding threshold divides the channel into areas where data can be reliably stored and areas that are not. Density evolution therefore provides information on the maximum channel erasure probability that can be corrected by messagepassing decoding averaged over all LDPC codes configured by a particular ensemble. This maximum channel erasure probability is called the decoding threshold for the ensemble.
Iii Repair Bandwidth Analysis of LDPC Codes
In this section, the repair bandwidth of LDPC codes is described in the average sense for all nodes. LDPC codes are similar to LRCs regarding the repair process since the parity blocks of both codes are made locally from a small portion of data blocks.
As shown in Fig. 2, if a block represented by node VN1 is erased, repair job can be done by downloading adjacent blocks VN2 and VN3 connected to the same check node CN1. This simple example demonstrates that LDPC codes can reconstruct erased data by using a relatively small number of blocks.
When an erased block is connected to multiple CNs, as is usually the case, we can choose a specific CN for repair. If the LDPC code is regular, any choice is equally good statistically. For an irregular LDPC code, however, the choice of a CN affects the amount of repair bandwidth since each check may have different degree. We thus define the bandwidth in the average sense. If a VN is erased, the repair bandwidth for that VN is the number of blocks downloaded averaged over all choices of CNs the VN is connected to. Note that all other VNs are assumed intact in this definition. This value is then averaged over all VN erasure positions. This final average repair bandwidth is obtained by first considering all VNs connected to each CN. For CN with degree , there are VNs attached to it, each of which will have a repair bandwidth of , assuming the other VNs attached to CN are downloaded for repair. The total repair bandwidth associated with CN can be said to be equal to . Summing over all CNs, we get . To get to the perVN repair bandwidth, we recognize that each VN is counted as many times as its node degree in the computation of since each VN is connected to multiple CNs in general. Thus, this sum should be divided by , where is the average VN degree, to arrive at the perVN average repair bandwidth we are looking for. But , where is the average CN degree. Note that also represents the total number of edges, , in the factor graph. We establish a definition:
Definition 1.
The average repair bandwidth or simply repair bandwidth of an LDPC code is defined as
(3) 
where represents the number of parity blocks of an LDPC code, denotes the degree of check node and indicates the total number of edges in the factor graph.
The following lemma subsequently tells us how the CN degrees should be distributed to minimize the average repair bandwidth of (3).
Lemma 1.
Given a fixed number of edges on the factor graph, a regular check node degree minimizes the repair bandwidth of LDPC codes to , where denotes the check node degree of the corresponding LDPC codes.
Proof: The repair bandwidth in (3) can be rewritten as
By using the CauchySchwarz inequality and the constraint , the choice
minimizes the average bandwidth. Thus, a regular CN degree minimizes the repair bandwidth and the corresponding minimum value is , one less than the CN degree.
Lemma 1 indicates that an LDPC code must be CNregular in order to minimize the repair bandwidth. How about the VN degrees? Before discussing desirable VN degree characteristics in light of the repair bandwidth issue, it is natural to impose a minimum VN degree constraint such that any VN in a factor graph has a degree at least equal to some positive integer . This is due to practical reasons having to do with decodability. For example, we obviously need so that each VN is attached to at least one CN, in order to reconstruct any single VN erasure. In practical applications where a node may fail before the current failure can be repaired, we actually need a more stringent condition: LDPC codes with even degree1 VNs have been deemed impractical [23, 24, 25], suggesting that we should set . This type of minimum VN degree requirement calls for the VNregularity as well. We summarize the desired CN and VN degree characteristics in the following combined statement.
Theorem 1.
Among the factor graphs having no VNs with degree less than , a chosen graph yields an LDPC code of rate with minimum repair bandwidth if and only if it is both CN and VNregular with and .
Proof: Lemma 1 states that a minimumrepairbandwidth LDPC code is CNregular with the uniform CN degree of . The proof follows directly from this lemma combined with the fact that the ratio of , the average VN degree, to gets fixed once the code rate is given as seen in the relation: , where . For a given value, the average VN degree must be made as small as possible to minimize so that
(4) 
is in turn minimized. But since each VN degree is greater than or equal to by assumption, so is the average VN degree . Apparently, the minimum average VN value of is achieved when all VNs have a fixed degree of , i.e., when the factor graph is VNregular. As for the CN degree, we obviously need for minimum repair bandwidth.
It is clear that the repair bandwidth of an LDPC code does not depend on the code length, but on . This property makes the LDPC codes a powerful option for distributed storage. Moreover, for a fixed , (4) also reveals an interesting relationship that increases with increasing , which is due to the fact that for a fixed , increasing must also mean increasing .
The regularity of minimum repair bandwidth LDPC codes automatically results in a condition on the code rate, as stated in the following corollary.
Corollary 1.
An LDPC code of rate allows minimum repair bandwidth only if is an integer greater than .
Theorem 1 indicates that under the practical constraint of , regular LDPC codes with give the best repair bandwidth efficiency. At this point, a useful question arises: if we are allowed to increase beyond 2, in hopes of improving reliability for certain applications, how rapidly do we lose repairbandwidth efficiency? In other words, we are interested in investigating the possibility of relaxing the repair bandwidth minimality in an effort to improve reliability. Specifically, we shall compare the reliabilitybandwidth tradeoffs of the regular LDPC codes having a fixed with VNirregular LDPC codes with average VN degree beyond 2. In the process, we provide a new VNirregular LDPC code design that allows a good erasure correction capability at the slight expense of the efficiency of the repair bandwidth. In comparing different coding schemes we consider three code rates: 1/2, 2/3 and 3/4. Theorem 1 provides the reference point for minimal repair bandwidth.
Remark 1 (Nonbinary LDPC Codes).
As can be seen in Fig. 3, suppose we have an coded system consisting of words (or symbols), each of which has data bits in it (e.g., nonbinary LDPC codes of GF() or RS codes of GF()). As an example of nonbinary LDPC codes consisting of bit symbols, the repair bandwidth is given by , where is the block size and assume for simplicity that is a multiple of . Likewise, for RS codes made up of bit words, bits are required for repairing a failed block. From the two examples above, the repair bandwidth of codes consisting of multiple bits does not change with the symbol or word size . Hence we shall consider only binary LDPC codes in this paper without loss of generality. We will stick to the normalized value for the repair bandwidth instead of the more general expression for simplicity.
Iv A Design for the Efficiency in Both Repair and Protection
In this section, we suggest a degree distribution design criterion to guarantee high erasurecorrectioncapability while enjoying reasonable efficiency in repairbandwidth. Recall that a regular CN degree minimizes the repair bandwidth for a given number of edges in the factor graph and the minimum repair bandwidth is given by in Lemma 1. While maintaining the CNregularity, we will relax the regularity condition on VN to find the appropriate VN degree distribution. The average VN degree can be computed as
(5) 
where from (2) and .
Using (5), can be rewritten as
Therefore, we need to maximize in order to minimize for a fixed code rate constraint of .
We propose a design of LDPC codes that balances the repair bandwidth overhead and the system reliability. To do this, we consider the following optimization problem with the VN degree distribution parameters as optimization variables.
maximize  (6)  
subject to  (7)  
(8) 
where represents the minimum level of reliability imposed. The constraint (7) ensures successful decoding as discussed in Section II.B, and (8) is the rate constraint from (1). In addition, obvious extra constraints exist on any VN degree distribution polynomial: and
We employ a CN degree distribution , forcing the CNregularity. Hence, (7) and (8) reduce to the following constraints:
(9) 
(10) 
Our optimization problem can then be stated as: for a given value of , find the distribution that will minimize while satisfying (9) and (10). A small value would be great for maintaining repair efficiency but to tolerate a higher value of in ensuring reliability, a compromise would have to be made on how small could get. Noticing that are integer values forming a relatively small search space, a clear picture on this tradeoff can be obtained conveniently by fixing and then iteratively finding the maximum value of and the corresponding that satisfy (9) and (10) for each fixed value of . Fig. 4 shows the relationships obtained for the minimum repair bandwidth versus the decoding threshold for different code rates. Fig. 4 clearly reveals the maximum level of reliability that can be achieved for a given repair bandwidth or, equivalently, the minimum repair bandwidth attainable for a given level of reliability, for some fixed code rate.
We also present several examples of designed degree distributions for different target code rates of = 1/2, 2/3 and 3/4 in Table I. The scaled maximum decoding threshold indicates how close the designed decoding threshold is to the BEC capacity. The repair bandwidth and the average VN degree are also shown.
1/2  3  0.6680  2  
1/2  4  0.9100  2.4997  
1/2  5  0.9640  2.9990  
1/2  6  0.9840  3.4987  
2/3  5  0.6667  2  
2/3  6  0.8260  2.3328  
2/3  7  0.9010  2.6662  
2/3  8  0.9430  2.9977  
2/3  9  0.9640  3.3321  
3/4  7  0.5720  2  
3/4  8  0.7480  2.2500  
3/4  9  0.8480  2.4997  
3/4  10  0.8960  2.7500  
3/4  11  0.9320  2.9973  
3/4  12  0.9520  3.2468 
V Reliability Analysis of LDPC Codes
Va The Mean Time to Data Loss
We now provide reliability analysis for LDPC codes. In particular, we show that increasing the stopping number of the factor graph directly influences reliability. A Markov model is introduced to estimate system reliability of coding schemes. Continuoustime Markov models have been commonly used to compare reliability of storage systems in terms of the MTTDL (e.g., see [26, 8, 7, 12, 27]). Unlike the biterrorrate (BER) or the worderrorrate (WER) performance metric, the MTTDL metric based on the Markov model considers the repair speed, which is our main interest in this paper.
Fig. 5 shows a Markov model example of the (14, 10) RS code [12]. The MTTDL is mainly influenced by the number of failures which can be tolerated before data loss as well as by the repair rate. Here, indicates the failure rate of a node and represents the repair rate of the nodes. Typically, for storage applications. We can assume that each node fails independently at rate if the blocks are stored in different racks (physically separated storage units in data centers). Then, it is reasonable to ignore the possibility of burst failures. Also, the adoption of a continuoustime Markov model presupposes that only a single node failure is allowed at a given instance. Each state of the Markov model represents the number of erased blocks in a stripe. For the (14, 10) RS code, state 5 is the data loss (DL) state since five erasures in a stripe cannot be decoded. Whereas the failure rates depend on the state, the repair rates are all the same since the number of blocks to be downloaded for repair is always 10. The MTTDL can be obtained from this Markov model by calculating the mean arrival time to the DL state. The MTTDL of the MDS codes are wellestablished [8, 28]. The MTTDL analysis for MDS codes can be modified and extended for the LDPC codes, as discussed next.
VB MTTDL of LDPC Codes
In this section, details in calculating the MTTDL for nonMDS codes are described. While the Markov model is already discussed for the LDPC codes in [29], the general formula for the MTTDL of the LDPC codes has not been given. We provide such a formula here. We also develop insights into how the MTTDL of the LDPC codes is affected by the stopping number. Before presenting the Markov model of LDPC codes, some key terms are clarified. On factor graphs, the girth indicates the shortest cycle. A stopping set [30] is a subset of variable nodes such that all check nodes connected to it are connected by at least two edges, and the stopping number is the size of the smallest stopping set.
The derivation process is similar to that for MDS codes. However, as shown in Fig. 6, LDPC codes can directly go to the data loss state with only a small number of erasures. For instance in Fig. 2, if VN5 and VN7 fail, it is impossible to repair those nodes unlike in MDS codes. To model this behavior, probability parameters are introduced to the Markov model. Probability is the conditional probability that a stripe of a given code can tolerate an additional node failure given state . This means that the code has already survived from failures and can tolerate one more failure with probability . In general, LDPC codes are designed to guarantee and since length4 cycles are prohibited; however, other probabilities depend on the paritycheck matrix of the code. If the paritycheck matrix of the LDPC code is given, the values can be obtained by the relationship, , where denotes the unconditional probability that a given code can tolerate failures [29]. Such unconditional probabilities can be estimated by decoding simulation of LDPC codes on the erasure channel. Exploiting the estimators of and , we can obtain an asymptotically unbiased estimator of given a large number of samples. This can be justified as follows.
Note.
Given two random variables and , assume that we cannot directly measure the ratio . From the measured realizations and , we wish to estimate . Suppose that and are samples means over samples. Given the samples of and , a possible estimator for the ratio is a sample ratio . The bias of this estimator goes as :
(11) 
which indicates that is an unbiased estimator as tends to infinity.
For parity blocks (see Fig. 6), the MTTDL equation is given by the following lemma.
Lemma 2.
For an arbitrary number of the parity blocks, the MTTDL of LDPC codes can be represented by
(12) 
where denotes the failure rate of a node, indicates the repair rate of the nodes, represents the conditional probability that a stripe of a given code can tolerate an additional node failure given state and is defined by
(13) 
Proof.
See Appendix A. ∎
From Lemma 2 it is seen that with all other parameters fixed, making the values large increases the MTTDL by examining what each term in the MTTDL is doing in the limit. In order to see the behavior in the limit, divide the numerator and denominator of the MTTDL by and write:
(14) 
Decreasing (14) increases the MTTDL for a given . In the right hand side of (14), the value of in the ^{th} term drops quickly with increasing for a small value of . Note that the ^{th} term disappears as is forced to 1 in any practical LDPC code. It is easy to see that if in the ^{th} term is set to 1, then this term reduces to zero. Since is larger for a smaller value of , forcing as many ’s for small as possible to 1 is crucial to minimize (14) or, equivalently, maximize the MTTDL. This property is the key to designing factor graphs that enhance reliability. Since the stopping number is the smallest number of erasures that cannot be corrected, it is clear that increasing the stopping number is equivalent to driving more ’s to 1. Therefore, a large stopping number of the factor graph would mean an enhanced MTTDL. Theorem 2 makes this relationship between the MTTDL and the stopping number more precise.
Theorem 2.
The MTTDL for LDPC codes is a monotonically increasing function of the stopping number of the given factor graph as and assuming .
Proof: See Appendix B.
Especially for the VN degree of 2, the stopping number is equal to , where is the girth of the graph [30]. As a result, to increase reliability of the regular LDPC codes with , the girth should be increased. This observation motivates LDPC code design by PEG, which is an effective search method for factor graphs with good girth properties.
Remark 2.
As can be seen in (B.3) derived in the proof of Theorem 2, only the single probability really matters in computing the MTTDL. Empirical results also show that the simplified expression (B.3), which is reproduced below, yields virtually identical MTTDL values as the full expression (12).
(15) 
Since the MTTDL is governed essentially by a single probability , computing the MTTDL of an LDPC code now does not require estimating all probabilities through very extensive error pattern search.
Vi Quantitative Results
From the repair bandwidth analysis in Section III, it is shown that a regular CN degree minimizes the average repair bandwidth of LDPC codes. It is also shown that regular LDPC codes with can minimize repair bandwidth for a given code rate, provided degree1 VNs are prohibited. In addition, from the MTTDL analysis in Section V, it is verified that LDPC codes should have a large stopping number which helps to improve reliability. With regards to regular LDPC codes with , the size of the girth plays the same role as the stopping number. We shall focus on PEGLDPC codes in this section. PEG is a wellknown algorithm which can construct factor graphs having a large girth [31]. However, a concern that may arise for setting is a potentially poor decoding capability in practical scenarios where multiple erasures may occasionally occur within a single codeword, since each VN is protected by only two sets of checks with . We plot the data (codeword) loss probability of two regular LDPC codes in Fig. 7 in environments where each symbol erasure occurs independently within each codeword. The results indicate that even a short LDPC code with shows erasure correction behavior similar to 3replication at low erasure probabilities. Note that decoding capability improves when a larger LDPC code is adopted, showing data loss probability comparable to the (15,10) RS code. In the case of irregular LDPC codes, even though the direct correlation between the girth and the stopping number is unknown, PEG is still a reasonable approach.
Having ensured a good decoding capability, the metrics considered for comparison are storage overhead (code rate inverse), repair bandwidth and MTTDL. For the MTTDL simulation, the following normalized equation is used for fair comparison among codes having different lengths:
where is the MTTDL given in Section V for a stripe. Here, the MTTDL for a stripe is normalized by the number of stripes, , in storage system. The parameters used for MTTDL simulation are given in Table II. These values are chosen consistent with the existing literature [7, 12]. Note that for the repair rate, both the triggering time and the downloading time are included; the downloading time depends on the repair bandwidth (BW) overhead of the coding scheme.
Parameter  Value  Description 

40 PB  Total amounts of data  
256 MB  Block size  
2000  Number of disk nodes  
20 TB  Storage capacity of a disk  
1 Gbps  Network bandwidth on each node  
1 year  MTTF (meantimetofailure) of a node  
Repair rate  
15 min  Detection and triggering time for repair  
Downloading time of blocks  
Repair BW overhead of the given code  
Number of total coded blocks in a stripe  
Number of data blocks in a stripe  
Number of parity blocks in a stripe 
Coding  Storage  Repair BW  MTTDL 

scheme  overhead  overhead  (days) 
3replication  3x  1x  1.20E+3 
(15, 10) RS  1.5x  10x  2.13E+10 
(10, 6, 5) Xorbas LRC  1.6x  5x  7.38E+7 
(15, 10, 6) Binary LRC  1.5x  6x  3.00E+4 
(60, 40) LDPC  1.5x  5x  1.40E+7 
(150, 100) LDPC  1.5x  5x  1.42E+8 
(210, 140) LDPC  1.5x  5x  2.91E+11 
For LDPC code simulations, using specific QCPEG paritycheck matrices, ’s are first obtained from decoding simulation and the MTTDL values are calculated from (12) or (15).^{3}^{3}3Note that the MTTDL value shown here for 3replication is different from that in [12, 7] due to the fact that the definition of the repair rate is different (in [7], for repair from a single failure and from multiple failures, and in [12], ). Table III shows performance of the QCPEG LDPC codes with for . Here, the (15, 10) RS code is chosen for comparison as well as simple replication and existing LRC methods.
For a given storage overhead, LDPC codes in Table III have a 5x repair bandwidth overhead, relative to replication, whereas the RS code has a 10x overhead. Thus, compared to the RS code, these LDPC codes require only one half of the repair bandwidth given the same storage overhead. Moreover, LDPC codes maintain the same repair bandwidth even as the code length is increased. This indicates that LDPC codes can get better MTTDLs than the (15, 10) RS code and the (10, 6, 5) Xorbas LRC [12] when longer codes are used. The table shows specifically that the (150, 100) and (210, 140) LDPC code has better performance in terms of both repair bandwidth and MTTDL compared to the (15, 10) RS code. Relative to the (10, 6, 5) Xorbas LRC, we observed that the (150, 100) and (210, 140) LDPC codes provide higher MTTDL. This is at the expense of a longer code length. In general, it is expected that the price of increasing the code length will be complexity. However, the complexity of encoding/decoding of LDPC codes in erasure channels is quite reasonable for the code lengths discussed here. The computational complexity issue of the LDPC code is discussed below.
Remark 3 (Computational Complexity).
The computational complexity that need be considered in the context of distributed storage includes the encoding and decoding complexity. Note that LDPC encoding/decoding is based on simple XOR operations, while RS code and LRC require expensive Galois field operations. The encoding complexity of RS codes and LRCs both increases quadratically with respect to ; on the other hand, encoding of the LDPC code requires a linear (or nearlinear) complexity. Decoding complexity is directly related to the computational burden required for reading data or repairing the failed block, which are the most frequent events in operating distributed storage. From this point of view, decoding complexity is also referred to as repairing complexity in distributed storage. The decoding/repairing traffic per one node of the LDPC code depends on the check node degree. Since is independent of as presented in Section III, overall decoding complexity of the LDPC code is only linear with , whereas decoding the LRC and RS code requires complexity quadratic in . Specifically, the required numbers of additions and multiplications on average to decode/repair an LDPC code of rate that we employed are four and zero, respectively, regardless of the code length. For decoding of the RS code, nine additions and ten multiplications are required, which can increase tremendously with increasing code length. The binary LRC [32, 27] is a modification of the Xorbas LRC to reduce computational complexity at the expense of repair bandwidth and MTTDL. For example, considering the failure of single nodes, decoding/repairing of a (, , ) (10, 5, 6) binary LRC (see Table III for its repair bandwidth overhead and MTTDL) which is constructed based on a (10, 6, 5) Xorbas LRC requires five additions and zero multiplications. For the corresponding Xorbas LRC, four additions and 4.75 multiplications in binary extension field are needed on average. We thus observed that the LDPC code is competitive in terms of the decoding complexity as well thanks to its lowrepairbandwidth and the XORonly feature. Note also that the difference in decoding complexity will increase further as the code length becomes longer.
Scheme  = 3/4  = 2/3  = 1/2 

RS  (20, 15)  (12, 8)  (8, 4) 
Piggybacked RS  (20, 15)  (12, 8)  (8, 4) 
Azure LRC  (18, 3, 3)  (12, 3, 3)  (6, 3, 3) 
LDPC  (240, 180)  (120, 80)  (56, 28) 
For rates 3/4, 2/3 and 1/2, various coding schemes are compared in Fig. 8. Here we only consider codes that have higher MTTDLs than the (14, 10) RS code used in the Facebook cluster. The MTTDL of the (14, 10) RS codes is 1.61E+7. Note that our comparison with all other codes are done by averaging systematic and parity nodes. For the three storage overhead factors (code rate inverses), it is shown that LDPC codes have consistently better repairbandwidth/storagespace tradeoffs compared to other codes. As the storage overhead is forced to decrease, LDPC codes enjoy a bigger performance gap relative to other codes with the exception of the LRC codes that perform similar to the LDPC codes.
Scheme  = 3/4  = 2/3  = 1/2 

RS  (10, 7)  (8, 5)  (6, 3) 
LDPC1  (80, 60)  (60, 40)  (44, 22) 
LDPC2  (200, 150)  (150, 100)  (72, 36) 
LDPC3  (320, 240)  (240, 160)  (100, 50) 
For given storage and repair bandwidth overheads, LDPC codes can achieve better MTTDL by increasing the code length, compared to the LRC and other codes. Fig. 9 shows such MTTDL comparison between the RS and LDPC codes, where for a given storage overhead, the MTTDL advantage of the LDPC codes is evident. Since the MTTDL of the LRC is known to be similar to that of the RS codes [7], LDPC codes will have definite reliability advantages over the LRCs.
MTTDLs of irregular LDPC codes that are designed to enhance the system reliability are shown in Figs. 10 and 11. Irregular LDPC codes are designed by the VN degree distributions given in Table I. The codelengths are set to be identical to those of LDPC3, and it is guaranteed that the global girth size is strictly larger than 4.
Fig. 10 shows the MTTDLs of the designed irregular LDPC codes with repair bandwidths increased by one relative to the regular LDPC codes also included in the figure. The MTTDLs of RS and regular LDPC codes are also shown for comparison. The parameters of the RS and LDPC codes are in Table VI. The LDPC codes have the same code parameters as LDPC3 in Table V, and the parameters of the RS codes are set to have the same repair bandwidth as the irregular LDPC codes being compared. As can be seen, the designed irregular LDPC codes outperform RS and regular LDPC codes in terms of the MTTDL, at the cost of increased repair bandwidth (by 1).
Scheme  = 3/4  = 2/3  = 1/2 

RS  (11, 8)  (9, 6)  (8, 4) 
LDPC  (320, 240)  (240, 160)  (100, 50) 
Fig. 11 represents the behavior of MTTDLs versus repair bandwidth for LDPC codes. The code parameters are the same as those used in Fig. 10. As mentioned above, regular LDPC codes with have the minimum repair bandwidth for given code parameters. We see that MTTDLs improve substantially when the repair bandwidths are allowed to grow from the minimum value. The tradeoff effect is more dramatic for smaller code rates.
Vii Concluding Remarks
Viia Conclusion
For distributed storage applications, this paper shows that LDPC codes could be a highly viable option in terms of storage overhead, repair bandwidth and reliability tradeoffs. Unlike the RS code, the repair bandwidth of the LDPC codes does not increase with the code length. As a result, the LDPC codes can be designed to enjoy both low repair bandwidth and high reliability compared to the RS code and its known variants. It has been specifically shown that for a given number of edges in the factor graph, CNregular LDPC codes minimize the repair bandwidth. In addition to the requirement of the CNregularity, VNregularity with minimizes the repair bandwidth for a given code rate, barring all VNs with degree 1. A code design that takes advantage of the improved reliability of LDPC codes has been given, yielding useful tradeoff options between MTTDL and repair bandwidth. The MTTDL analysis for LDPC codes has also been provided that relates the code’s stopping set size with its MTTDL.
ViiB Future Work
Interesting future work includes LDPC code design aiming at reduction of both repairbandwidth and latency. For the reliability analysis in this paper, we assumed that there occurred only one node failure at a time since it was the most frequent failure pattern. Considering multiple erasures it can be shown that the repair bandwidth of LDPC codes is still one less than the CN degree. However, the number of decoding iterations required to repair multiple erasures may differ from one specific code design to next. Since the decoding latency of LDPC codes is proportional to the number of decoding iterations [33], we need to design LDPC code degree distributions to minimize the number of decoding iterations. It would be meaningful to study LDPC code structures that maximize the number of singlestep recoverable nodes in combating latency.
The update complexity (defined as the maximum number of coded symbols updated for one changed symbol in the message [34]) is an important measure especially in applications to highly dynamic distributed storage in which data updates are frequent. The study on the existence and construction of updateefficient codes is an active area of research (e.g., see [34, 35, 36, 37]). Investigating the relationships between update complexity and other performance metrics considered in this paper such as the MTTDL would be a good direction as well.
A Proof of Lemma 2
Proof: As can be seen in Fig. 6, the Markov model of LDPC codes with parity blocks consists of states regarding the state as a DL state. Let denote the state probability of the state at time and the set of all states. Then we have a constraint , since the process must be in one of the states at any given time . For an arbitrary number of parity blocks, we build the sets of equations describing the Markov model which are followed by the MTTDL equation of LDPC codes with parity blocks.
Assume that state 0 is the initial state of the Markov chain, so that
First we construct a set of differential equations from the Markov model in Fig. 6.

(A.1) 
(A.2) 
(A.3) 
(Data loss state)
(A.4)
Taking the Laplace transforms of eqs. A.1 to A.4 yields the following string of equations, where denotes the Laplace transform of .

(A.5) 
(A.6) 
(A.7) 
(Data loss state)
(A.8)
Solving eqs. A.5 to A.8 for , is presented as follows:
(A.9) 
where for is recursively defined by
(A.10) 
Equation (A.10) is determined by the following set of equations:
From the moment generating property of Laplace transforms, the MTTDL is given by
(A.11) 
Combining (A.9) and (A.11) we arrive at the final result:
MTTDL  
 [∑j=1m1[{∏i=0j(ni)⋅λj+1} ⋅{∏i=0j1pi⋅(1pj)⋅G’j+1(0)}] ] ⋅G0(0) G02(0)  (A.12)  
/ { nλ(1p0)⋅μm +∑j=1m1[{∏i=0j(ni)⋅λj+1} ⋅{∏i=0j1pi⋅(1pj)⋅μmj}]  
+ {∏i=0m(ni)⋅λm+1 } ⋅∏i=0m1pi }  (A.13)  
+ {∏i=0m(ni)⋅λm+1 } ⋅∏i=0m1pi }  
as λμ →0 ,  (A.14) 
where (A.12) follows from the finalvalue theorem of Laplace transforms; as , (A.9) implies (A.15) considering the finalvalue theorem in (A.16).
(A.15) 
(A.16) 
Equation (A.13) is due to (A.15) and from the fact that , , , , and in (A.15). Equation (A.14) is because the numerator of (A.13) approaches as .
B Proof of Theorem 2
Proof: Consider the right hand side of (14). Recall that the stopping number is the smallest number of erasures that cannot be corrected and that is the conditional probability that an additional erasure will be tolerated given erasures. This leads to the property that for any , and the first probability that is not equal to 1 as increases from 0 is . Then, the first nonzero term in the right hand side of (14) is
(B.1) 
We now show that this term dominates as . In fact, the ratio of the next term and this term is given by
(B.2) 
which approaches zero for any finite as . Using the same argument the similar ratio of any two successive terms reduces to zero in the limit. This means that the MTTDL of an LDPC code simplifies to
(B.3) 
for . Now, for any reasonably large , we have . The MTTDL in the limit of can now be rewritten as
(B.4) 
which is a monotonically and very rapidly increasing function of , as long as . This completes the proof.
References
 [1] J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
 [2] S. Ghemawat, H. Gobioff, and S.T. Leung, “The Google File System,” in ACM SIGOPS Operating Systems Review, vol. 37, no. 5. ACM, 2003, pp. 29–43.
 [3] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop Distributed File System,” in 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). IEEE, 2010, pp. 1–10.
 [4] H. Weatherspoon and J. D. Kubiatowicz, “Erasure coding vs. replication: a quantitative comparison,” in PeertoPeer Systems. Springer, 2002, pp. 328–337.
 [5] I. S. Reed and G. Solomon, “Polynomial codes over certain finite fields,” Journal of the Society for Industrial and Applied Mathematics, vol. 8, no. 2, pp. 300–304, 1960.
 [6] K. Rashmi, N. B. Shah, D. Gu, H. Kuang, D. Borthakur, and K. Ramchandran, “A solution to the network challenges of data recovery in erasurecoded distributed storage systems: A study on the Facebook warehouse cluster.” in 5th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage), 2013.
 [7] C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li, S. Yekhanin et al., “Erasure coding in Windows Azure Storage.” in USENIX Annual Technical Conference. Boston, MA, 2012, pp. 15–26.
 [8] D. Ford, F. Labelle, F. I. Popovici, M. Stokely, V.A. Truong, L. Barroso, C. Grimes, and S. Quinlan, “Availability in globally distributed storage systems.” in 2010 USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2010, pp. 61–74.
 [9] A. G. Dimakis, P. B. Godfrey, Y. Wu, M. J. Wainwright, and K. Ramchandran, “Network coding for distributed storage systems,” IEEE Transactions on Information Theory, vol. 56, no. 9, pp. 4539–4551, Sept 2010.
 [10] K. V. Rashmi, N. B. Shah, and P. V. Kumar, “Optimal exactregenerating codes for distributed storage at the MSR and MBR points via a productmatrix construction,” IEEE Transactions on Information Theory, vol. 57, no. 8, pp. 5227–5239, Aug 2011.
 [11] I. Tamo, Z. Wang, and J. Bruck, “Zigzag codes: MDS array codes with optimal rebuilding,” IEEE Transactions on Information Theory, vol. 59, no. 3, pp. 1597–1616, Mar 2013.
 [12] M. Sathiamoorthy, M. Asteris, D. Papailiopoulos, A. G. Dimakis, R. Vadali, S. Chen, and D. Borthakur, “XORing elephants: novel erasure codes for big data,” in Proceedings of the VLDB Endowment, vol. 6, no. 5. VLDB Endowment, 2013, pp. 325–336.
 [13] D. S. Papailiopoulos and A. G. Dimakis, “Locally repairable codes,” IEEE Transactions on Information Theory, vol. 60, no. 10, pp. 5843–5855, Oct 2014.
 [14] R. Gallager, “Lowdensity paritycheck codes,” IRE Transactions on Information Theory, vol. 8, no. 1, pp. 21–28, Jan 1962.
 [15] J. S. Plank and M. G. Thomason, “A practical analysis of lowdensity paritycheck erasure codes for widearea storage applications,” in 2004 International Conference on Dependable Systems and Networks (DSN), June 2004, pp. 115–124.
 [16] J. S. Plank, A. L. Buchsbaum, R. L. Collins, and M. G. Thomason, “Small paritycheck erasure codes  exploration and observations,” in 2005 International Conference on Dependable Systems and Networks (DSN), June 2005, pp. 326–335.
 [17] Y. Wei, Y. W. Foo, K. C. Lim, and F. Chen, “The autoconfigurable ldpc codes for distributed storage,” in 2014 IEEE 17th International Conference on Computational Science and Engineering, Dec 2014, pp. 1332–1338.
 [18] Y. Wei, F. Chen, and K. C. Lim, “Large LDPC codes for big data storage,” in Proceedings of the ASE BigData & SocialInformatics 2015. ACM, 2015, p. 1.
 [19] Y. Wei and F. Chen, “expanCodes: Tailored LDPC codes for big data storage,” in 2016 IEEE 14th Intl Conf on Dependable, Autonomic and Secure Computing, 14th Intl Conf on Pervasive Intelligence and Computing, 2nd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech), Aug 2016, pp. 620–625.
 [20] D. Lee, H. Park, and J. Moon, “Reducing repairbandwidth using codes based on factor graphs,” in 2016 IEEE International Conference on Communications (ICC), May 2016, pp. 1–6.
 [21] M. G. Luby, M. Mitzenmacher, M. A. Shokrollahi, D. A. Spielman, and V. Stemann, “Practical lossresilient codes,” in Proceedings of the 29th Annual ACM Symposium on Theory of Computing. ACM, 1997, pp. 150–159.
 [22] T. J. Richardson, M. A. Shokrollahi, and R. L. Urbanke, “Design of capacityapproaching irregular lowdensity paritycheck codes,” IEEE Transactions on Information Theory, vol. 47, no. 2, pp. 619–637, Feb 2001.
 [23] D. Divsalar, S. Dolinar, C. R. Jones, and K. Andrews, “Capacityapproaching protograph codes,” IEEE Journal on Selected Areas in Communications, vol. 27, no. 6, pp. 876–888, Aug 2009.
 [24] T. V. Nguyen, A. Nosratinia, and D. Divsalar, “The design of ratecompatible protograph LDPC codes,” IEEE Transactions on Communications, vol. 60, no. 10, pp. 2841–2850, Oct 2012.
 [25] J. GarciaFrias and W. Zhong, “Approaching Shannon performance by iterative decoding of linear codes with lowdensity generator matrix,” IEEE Communications Letters, vol. 7, no. 6, pp. 266–268, June 2003.
 [26] S. Ramabhadran and J. Pasquale, “Analysis of longrunning replicated systems,” in Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications, Apr 2006, pp. 1–9.
 [27] M. Shahabinejad, M. Khabbazian, and M. Ardakani, “A class of binary locally repairable codes,” IEEE Transactions on Communications, vol. 64, no. 8, pp. 3182–3193, Aug 2016.
 [28] K. S. Trivedi, Probability & statistics with reliability, queuing and computer science applications. John Wiley & Sons, 2008.
 [29] J. L. Hafner and K. Rao, “Notes on reliability models for nonMDS erasure codes,” IBM Res. rep. RJ10391, 2006.
 [30] A. Orlitsky, R. Urbanke, K. Viswanathan, and J. Zhang, “Stopping sets and the girth of Tanner graphs,” in Proceedings IEEE International Symposium on Information Theory (ISIT), 2002, pp. 2–.
 [31] X.Y. Hu, E. Eleftheriou, and D. M. Arnold, “Progressive edgegrowth tanner graphs,” in 2001 IEEE Global Telecommunications Conference (GLOBECOM), vol. 2, 2001, pp. 995–1001 vol.2.
 [32] M. Shahabinejad, M. Khabbazian, and M. Ardakani, “An efficient binary locally repairable code for Hadoop Distributed File System,” IEEE Communications Letters, vol. 18, no. 8, pp. 1287–1290, Aug 2014.
 [33] B. Smith, M. Ardakani, W. Yu, and F. R. Kschischang, “Design of irregular LDPC codes with optimized performancecomplexity tradeoff,” IEEE Transactions on Communications, vol. 58, no. 2, pp. 489–499, Feb 2010.
 [34] N. P. Anthapadmanabhan, E. Soljanin, and S. Vishwanath, “Updateefficient codes for erasure correction,” in 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Sept 2010, pp. 376–382.
 [35] A. Mazumdar, V. Chandar, and G. W. Wornell, “Updateefficiency and local repairability limits for capacity approaching codes,” IEEE Journal on Selected Areas in Communications, vol. 32, no. 5, pp. 976–988, May 2014.
 [36] A. Jule and I. Andriyanova, “Some results on update complexity of a linear code ensemble,” in 2011 International Symposium on Network Coding (NetCod), July 2011, pp. 1–5.
 [37] K. Kralevska, D. Gligoroski, and H. Øverby, “Balanced locally repairable codes,” in 2016 9th International Symposium on Turbo Codes and Iterative Information Processing (ISTC), Sept 2016, pp. 280–284.