UpdateEfficient ErrorCorrecting ProductMatrix Codes
Abstract
Regenerating codes provide an efficient way to recover data at failed nodes in distributed storage systems. It has been shown that regenerating codes can be designed to minimize the pernode storage (called MSR) or minimize the communication overhead for regeneration (called MBR). In this work, we propose new encoding schemes for errorcorrecting MSR and MBR codes that generalize our earlier work on errorcorrecting regenerating codes. We show that by choosing a suitable diagonal matrix, any generator matrix of the ReedSolomon (RS) code can be integrated into the encoding matrix. Hence, MSR codes with the least update complexity can be found. By using the coefficients of generator polynomials of and RS codes, we present a leastupdatecomplexity encoding scheme for MBR codes. A decoding scheme is proposed that utilizes the RS code to perform data reconstruction for MSR codes. The proposed decoding scheme has better error correction capability and incurs the least number of node accesses when errors are present. A new decoding scheme is also proposed for MBR codes that can correct more errorpatterns.
Distributed storage, Regenerating codes, ReedSolomon codes, Decoding, ProductMatrix codes
I Introduction
Cloud storage is gaining popularity as an alternative to enterprise storage where data is stored in virtualized pools of storage typically hosted by thirdparty data centers. Reliability is a key challenge in the design of distributed storage systems that provide cloud storage. Both crashstop and Byzantine failures (as a result of software bugs and malicious attacks) are likely to be present during data retrieval. A crashstop failure makes a storage node unresponsive to access requests. In contrast, a Byzantine failure responds to access requests with erroneous data. To achieve better reliability, one common approach is to replicate data files on multiple storage nodes in a network. There are two kinds of approaches: duplication (Google) [1] and erasure coding [2, 3]. Duplication makes an exact copy of each data and needs lots of storage space. The advantage of this approach is that only one storage node needs to be accessed to obtain the original data. In contrast, in the second approach, erasure coding is employed to encode the original data and then the encoded data is distributed to storage nodes. Typically, multiple storage nodes need to be accessed to recover the original data. One popular class of erasure codes is the maximumdistanceseparable (MDS) codes. With MDS codes such as ReedSolomon (RS) codes, data items are encoded and then distributed to and stored at storage nodes. A user or a data collector can retrieve the original data by accessing any of the storage nodes, a process referred to as data reconstruction.
Any storage node can fail due to hardware or software damage. Data stored at the failed nodes need to be recovered (regenerated) to remain functional to perform data reconstruction. The process to recover the stored (encoded) data at a storage node is called data regeneration. A simple way for data regeneration is to first reconstruct the original data and then recover the data stored at the failed node. However, it is not efficient to retrieve the entire symbols of the original file to recover a much smaller fraction of data stored at the failed node. Regenerating codes, first introduced in the pioneer works by Dimakis et al. in [4, 5], allow efficient data regeneration. To facilitate data regeneration, each storage node stores symbols and a total of surviving nodes are accessed to retrieve symbols from each node. A tradeoff exists between the storage overhead and the regeneration (repair) bandwidth needed for data regeneration. Minimum Storage Regenerating (MSR) codes first minimize the amount of data stored per node, and then the repair bandwidth, while Minimum Bandwidth Regenerating (MBR) codes carry out the minimization in the reverse order. There have been many works that focus on the design of regenerating codes [6, 7, 8, 9, 10, 11, 12, 13]. There are two categories of approaches to regenerate data at a failed node. If the replacement data is exactly the same as that previously stored at the failed node, we call it exact regeneration. Otherwise, if the replacement data only guarantees the correctness of data reconstruction and regeneration properties, it is called functional regeneration. In practice, exact regeneration is more desirable since there is no need to inform each node in the network regarding the replacement. Furthermore, it is easy to keep the codes systematic via exact regeneration, where partial data can be retrieved without accessing all nodes. It has been proved that no linear code performing exact regeneration can achieve the MSR point for any ï¿¼ ï¿¼ ï¿¼when is normalized to 1 [14]. ï¿¼However, when approaches infinity, this is achievable for any [15]. In this work, we only consider exact regeneration.
There are several existing code constructions of regenerating codes for exact regeneration[9, 15, 16, 13]. In [9], Wu and Dimakis apply ideas from interference alignment[17, 18] to construct the codes for and . The idea was extended to the more general case of in [16]. In [13], Rashmi et al. used productmatrix construction to design optimal MSR codes and MBR codes for exact regeneration. These constructions of exactregenerating codes are the first for which the code length can be chosen independently of other parameters. However, only crashstop failures of storage nodes are considered in [13].
The problem of the security of regenerating codes was considered in [11] and in [12, 19, 20]. In [11], the security problem against eavesdropping and adversarial attack during the data reconstruction and regeneration processes was considered. Upper bounds on the maximum amount of information that can be stored safely were derived. Pawar et al. also gave an explicit code construction for in the bandwidthlimited regime. The problem of Byzantine fault tolerance for regenerating codes was considered in [12]. Oggier and Datta investigated the resilience of regenerating codes when supporting multirepairs. By collaboration among newcomers, they derived upper bounds on the resilience capability of regenerating codes. Our work deals with Byzantine failures for productmatrix regenerating codes and it does not need to have multiple newcomers to recover the failures.
Based on the same code construction as given in [13], Han et al. extended Rashmi’s work to provide decoding algorithms that can handle Byzantine failures [19]. In [19], decoding algorithms for both MSR and MBR errorcorrecting productmatrix codes were provided. In particular, the decoding of an MBR code given in [19] can decode errors up to error correction capability of since is even. In [20], the code capability and resilience were discussed for errorcorrecting regenerating codes. Rashmi, et al. proved that it is possible to decode an MBR code up to errors. The authors also claimed that any MSR code can be decoded up to errors. However no explicit decoding (data reconstruction) procedure was provided due to which these codes cannot be used in practice. Thus, one contribution of this paper is to present a decoding algorithm for MSR codes.
In addition to bandwidth efficiency and error correction capability, another desirable feature for regenerating codes is update complexity [21], defined as the number of nonzero elements in the row of the encoding matrix with the maximum Hamming weight.^{1}^{1}1The update complexity adopted from [21] is not equivalent to the maximum number of encoded symbols that must be updated while a single data symbol is modified. The smaller the number, the lower the update complexity is. Low update complexity is desirable in scenarios where updates are frequent.
One drawback of the decoding algorithms for MSR codes given in [19] is that, when one or more storage nodes have erroneous data, the decoder needs to access extra data from many storage nodes (at least more nodes) for data reconstruction. Furthermore, when one symbol in the original data is updated, all storage nodes need to update their respective data. Thus, the MSR and MBR codes in [19] have the maximum possible update complexity. Both of these issues deficiencies are addressed in this paper. First, we propose a general encoding scheme for MSR codes. As a special case, leastupdatecomplexity codes are designed. We also design leastupdatecomplexity encoding matrix for the MBR codes by using the coefficients of generator polynomials of the and RS codes. The proposed codes are not only with least update complexity but also with the smallest numbers of updated symbols while a single data symbol is modified. This is in contrast to the existing productmatrix codes. Second, a new decoding algorithm is presented for MSR codes. It not only exhibits better error correction capability but also incurs low communication overhead when errors occur in the accessed data. Third, we devise a decoding scheme for the MBR codes that can correct more error patterns compared to the one in [19].
The main contributions of this paper beyond the existing literature are as follows:

The general encoding schemes of productmatrix MSR and MBR codes are derived. The encoder based on RS codes is no longer limited to the Vandermonde matrix proposed in [13] and [19]. Any generator matrix of the corresponding RS codes can be employed for the MSR and MBR codes. As a result, this highlights the connection between productmatrix MSR and MBR codes and wellknown RS codes in coding theory.

The MSR and MBR codes with systematic generator matrices of the RS codes are provided. These codes have least update complexity compared to existing codes such as systematic MSR and MBR codes proposed by Rashmi et al. [13]. This approach also makes productmatrix MSR and MBR codes more practical due to higher update efficiency.

The detailed decoding algorithm of data construction of MSR codes is provided. It is nontrivial to extend the decoding procedure given in [13] to handle errors. The difficulty arises from the fact that an error in will propagate into many places in and . Due to the operations involved in the decoding process, many rows cannot be decoded successfully or correctly. No decoding algorithm was provided in [20] that can decode up to errors even though the errorcorrection capability was analyzed in [20].

The decoding algorithm of MBR codes that can decode beyond errorcorrection capability for some error patterns is also presented. This decoding algorithm can correct errors up to
even though not all error patterns up to such number of errors can be corrected.
The rest of this paper is organized as follows. Section II gives an overview of errorcorrecting regenerating codes. Section III presents the leastupdatecomplexity encoding and decoding schemes for errorcorrecting MSR regenerating codes. Section IV demonstrates the leastupdatecomplexity encoding of MBR codes and the corresponding decoding scheme. Section V details evaluation results for the proposed decoding schemes. Section VI concludes the paper with a list of future work. Since only errorcorrecting regenerating codes are considered in this work, unless stated otherwise, we refer to errorcorrecting MSR and MBR codes as MSR and MBR codes in the rest of the paper.
Ii ErrorCorrecting ProductMatrix Regenerating Codes
In this section, we give a brief overview of regenerating codes, and the MSR and MBR productmatrix code constructions in [13].
Iia Regenerating Codes
Let be the number of symbols stored at each storage node and the number of symbols downloaded from each storage during regeneration. To repair the stored data at the failed node, a helper node accesses surviving nodes. The design of regenerating codes ensures that the total regenerating bandwidth be much less than that of the original data, . A regenerating code must be capable of reconstructing the original data symbols and regenerating coded data at a failed node. An regenerating code requires at least nodes to ensure successful data reconstruction, and surviving nodes to perform regeneration [13], where is the number of storage nodes and .
The cutset bound given in [6, 5] provides a constraint on the repair bandwidth. By this bound, any regenerating code must satisfy the following inequality:
(1) 
From (1), or can be minimized achieving either the minimum storage requirement or the minimum repair bandwidth requirement, but not both. The two extreme points in (1) are referred to as the minimum storage regeneration (MSR) and minimum bandwidth regeneration (MBR) points, respectively. The values of and for the MSR point can be obtained by first minimizing and then minimizing :
(2) 
where we normalize and set it equal to .^{2}^{2}2It has been proved that when designing MSR codes for . it suffices to consider those with [13]. Reversing the order of minimization we have for MBR as
(3) 
while .
IiB ProductMatrix MSR Codes With Error Correction Capability
Next, we describe the MSR code construction originally given in [13] and adapted later in [19]. Here, we assume .^{3}^{3}3An elegant method to extend the construction of based on the construction of has been given in [13]. Since the same technology can be applied to the code constructions proposed in this work, it is omitted here. The information sequence can be arranged into an information vector with size such that and are symmetric matrices with dimension . An RS code is adopted to construct the MSR code [13]. Let be a generator of . In the encoding of the MSR code, we have
(4) 
where
and is the codeword vector with dimension .
It is possible to rewrite generator matrix of the RS code as,
(5)  
(6) 
where contains the first rows in , and is a diagonal matrix with as diagonal elements, namely,
(7) 
Note that if the RS code is over for , then it can be shown that are all distinct. According to the encoding procedure, the symbols stored in storage node are given by,
where is the th column in .
IiC ProductMatrix MBR Codes With Error Correction Capability
In this section, we describe the MBR code constructed in [13] and reformatted later in [19]. Note that at the MBR point, . Let the information sequence be arranged into an information vector with size , where
(8) 
is a symmetric matrix, a matrix, is the zero matrix. Note that both and are symmetric. It is clear that has a dimension (or ). An RS code is chosen to encode each row of . The generator matrix of the RS code is given as
(9) 
where is a generator of . Let be the codeword vector with dimension . It can be obtained as
From (9), can be divided into two submatrices as
(10) 
where
(11) 
and
It can be shown that is a generator matrix of the RS code and it will be used in the decoding for data reconstruction.
Iii Encoding and Decoding Schemes for ProductMatrix MSR Codes
In this section, we propose a new encoding scheme for errorcorrecting MSR codes. With a feasible matrix , in (6) can be any generator matrix of the RS code. The code construction in [13, 19] is thus a special case of our proposed scheme. We can also select a suitable generator matrix such that the update complexity of the resulting code is minimized. A decoding scheme is then proposed that uses the subcode of the RS code, the RS code generated by , to perform the data reconstruction.
Iiia Encoding Schemes for ErrorCorrecting MSR Codes
RS codes are known to have very fast decoding algorithms and exhibit good error correction capability. From (6) in Section IIB, a generator matrix for productmatrix MSR codes needs to satisfy:

where contains the first rows in and is a diagonal matrix with distinct elements in the diagonal.

is a generator matrix of the RS code and is a generator matrix of the RS code.
Next, we present a sufficient condition for and such that is a generator matrix of an RS code. We first introduce some notations. Let and the RS code generated by be . Similarly, let and the RS code generated by be . Clearly, are roots of , and are roots of . Thus, and are equivalent RS codes.
Theorem 1
Let be a generator matrix of the RS code . Let the diagonal elements of be such that for all , and is a codeword in but not . In other words, . Then, is a generator matrix of the RS code .
We need to prove that each row of is a codeword of and all rows in are linearly independent. Let be the dual code of . It is wellknown that is an RS code [22, 23]. Similarly, let be the dual code of and its generator matrix be . Note that is a paritycheck matrix of . Let and . Then, the roots of and are and , respectively. Since an RS code is also a cyclic code, the generator polynomials of and are and , respectively, where and . Clearly, the roots of are that are equivalent to . Similarly, the roots of are . Since has roots of , we can choose
(12) 
as the generator matrix of . To prove that each row of is a codeword of the RS code generated by , it is sufficient to show that . From the symmetry of , we have
Thus, we only need to prove that each row of is a codeword in . Let the diagonal elements of be . The th row of is thus in the polynomial representation. Let be a codeword in . Then, we have
(13) 
Substituting , for , into , it becomes
(14) 
Let . Since and , . By (13), for and . Hence, each row of is a codeword in .
The s need to make all rows in linearly independent. Since all rows in or those in are linearly independent, it is sufficient to prove that , where is the code generated by . Let be a codeword in . for some . It can be shown that, by the MattsonSolomon polynomial [24], we can choose
(15) 
as the generator matrix of . Then
for some . Evaluating at and putting them into a matrix form, we have
(16) 
where
and is an dimensional vector. If , then ; otherwise, . Taking transpose on both sizes of (16), it becomes
(17)  
Since ,
(18) 
Substituting (18) into (17) and taking out rows with all zeros, we have
(19)  
If , i.e., is a root of , then due to the fact that makes in (19). Thus, we need to exclude the codewords in that have as a root. These codewords turn out to be in . If , then it is clear that the only making in (19) is the allzero vector. Hence, any does not make zero except .
Corollary 1
Under the condition that the RS code is over for and , the diagonal elements of , , can be
where .
Note that one valid generator matrix of is
(20) 
can be represented as , where . Now choose to be allzero codeword. Under the condition that the RS code is over for and , is equivalent to . If is a generator of , then all elements of are distinct. It is wellknown that is a generator if .
It is clear that by setting in Corollary 1, we obtain the generator matrix given in (6) first proposed in [13, 19] as a special case.^{4}^{4}4Even though the roots in given in (6) are different from those for the proposed generator matrix, they generate equivalent RS codes.
One advantage of the proposed scheme is that it can now operate on a smaller finite field than that of the scheme in [13, 19]. Another advantage is that one can choose (and accordingly) freely as long as is the generator matrix of an RS code. In particular, as discussed in Section I, to minimize the update complexity, it is desirable to choose a generator matrix that has the least rowwise maximum Hamming weight. Next, we present a leastupdatecomplexity generator matrix that satisfies (6).
Corollary 2
Suppose is chosen according to Corollary 1. Let be the generator matrix associated with a systematic RS code. That is,
(21) 
where
and
Then, is a leastupdatecomplexity generator matrix.
The result holds since each row of is a nonzero codeword with the minimum Hamming weight .
The update complexity adopted from [21] is not equivalent to the maximum number of encoded symbols that must be updated when a single data symbol is modified. If the modified data symbol is located in the diagonal of or , encoded symbols need to be updated; otherwise, there are two corresponding encoding symbols in modified such that encoded symbols need to be updated.
IiiB Decoding Scheme for MSR Codes
Unlike the decoding scheme in [19] that uses RS code, we propose to use the subcode of the RS code, i.e., the RS code generated by , to perform data reconstruction. The advantage of using the RS code is twofold. First, its error correction capability is higher. Specifically, it can tolerate instead of errors. Second, it only requires the access of two additional storage nodes (as opposed to nodes) for each extra error.
Without loss of generality, we assume that the data collector retrieves encoded symbols from () storage nodes, . We also assume that there are storage nodes whose received symbols are erroneous. The stored information on the storage nodes are collected as the columns in . The columns of corresponding to storage nodes are denoted as the columns of . First, we discuss data reconstruction when . The decoding procedure is similar to that in [13].
No Error
In this case, and there is no error in . Then,
(22)  
Multiplying to both sides of (22), we have [13],
(23)  
Since and are symmetric, and are symmetric as well. The th element of , and , is
(24) 
and the th element is given by
(25) 
Since for all , , and , combining (24) and (25), the values of and can be obtained. Note that we only obtain values for each row of and since no elements in the diagonal of or are obtained.
To decode , recall that . can be treated as a portion of the codeword vector, . By the construction of , it is easy to see that is a generator matrix of the RS code. Hence, each row in the matrix is a codeword. Since we know components in each row of , it is possible to decode by the erroranderasure decoder of the RS code.^{5}^{5}5 The erroranderasure decoder of an RS code can successfully decode a received vector if , where is the number of erasure (no symbol) positions, is the number of errors in the received portion of the received vector, and is the minimum Hamming distance of the RS code.
Since one cannot locate any erroneous position from the decoded rows of , the decoded codewords are accepted as . By collecting the last columns of as to find its inverse (here it is an identity matrix), one can recover from . Since any rows in are independent and thus invertible, we can pick any of them to recover . can be obtained similarly by .
It is not trivial to extend the above decoding procedure to the case of errors. The difficulty is raised from the fact that for any error in , this error will propagate into many places in and , due to operations involved in (23), (24), and (25), such that many rows of them cannot be decoded successfully or correctly (Please refer to Lemma 1). In the following we present how to locate erroneous columns in based on RS decoder.
Single Error: In this case, and only one column of is erroneous. Without loss of generality, we assume the erroneous column is the first column in . That is, the symbols received from storage node contain error. Let be the error matrix, where and is allzero matrix with dimension . Then
(26)  
Multiplying to both sides of (26), we have
(27)  
It is easy to see that the errors only affect the first column of since the nonzero elements are all in the first column of