Cost Analysis of Redundancy Schemes for Distributed Storage Systems

Cost Analysis of Redundancy Schemes for Distributed Storage Systems

Lluis Pamies-Juarez Ernst Biersack April 15, 2011 Universitat Rovira i Virgili (Tarragona, Spain) Eurecom (Sophia-Antipolis, France)
Abstract

Distributed storage infrastructures require the use of data redundancy to achieve high data reliability. Unfortunately, the use of redundancy introduces storage and communication overheads, which can either reduce the overall storage capacity of the system or increase its costs. To mitigate the storage and communication overhead, different redundancy schemes have been proposed. However, due to the great variety of underlaying storage infrastructures and the different application needs, optimizing these redundancy schemes for each storage infrastructure is cumbersome. The lack of rules to determine the optimal level of redundancy for each storage configuration leads developers in industry to often choose simpler redundancy schemes, which are usually not the optimal ones. In this paper we analyze the cost of different redundancy schemes and derive a set of rules to determine which redundancy scheme minimizes the storage and the communication costs for a given system configuration. Additionally, we use simulation to show that theoretically-optimal schemes may not be viable in a realistic setting where nodes can go off-line and repairs may be delayed. In these cases, we identify which are the trade-offs between the storage and communication overheads of the redundancy scheme and its data reliability.

erasure correction codes, data reliability, data redundancy, distributed storage systems, redundancy costs, regenerating codes, storage systems design.
\category

E.4Coding and Information TheoryError Control Codes \categoryE.5FilesBackup/Recovery \categoryC.2.4Computer-Communication NetworksDistributed Systems \categoryH.3.2Information Storage and RetrievalInformation Storage

\terms

Reliability, Performance

{bottomstuff}

Author’s addresses: L. Pamies-Juarez (lluis.pamies@urv.cat) and E. Biersack (ernst.biersack@eurecom.fr).

1 Introduction

Distributed storage systems are widely used today for reasons of scalability and performance. There are distributed file-systems such as Google FS [Ghemawat et al. (2003)], HDFS [Borthakur (2007)], GPFS [Schmuck and Haskin (2002)] or Dynamo [Hastorun et al. (2007)] and peer-to-peer (P2P) storage applications like Wuala [WuaLa (2010)] or OceanStore [Kubiatowicz et al. (2000)].

To achieve high reliability in distributed storage systems, a certain level of data redundancy is required. Unfortunately, the use of redundancy increases the storage and communication costs of the system: (i) the space required to store each file is increased, and (ii) additional communication bandwidth is required to repair lost data. It is important to optimize redundancy schemes in order to minimize these storage and communication costs. For example, in data centers where the energy cost associated with the storage sub-system represents about 40% of the energy consumption of all the IT components [Guerra et al. (2010)], minimizing storage cost can significantly reduce the per-byte cost of the storage system. Whereas in less-reliable infrastructures —i.e. P2P systems— where the storage capacity is mainly constrained by the cross-system communications bandwidth [Blake and Rodrigues (2003)], minimizing communication costs can increase the overall storage capacity of the system.

Different redundancy schemes have been proposed to minimize the storage and communication costs associated with redundancy. Redundancy schemes based on coding techniques such as Reed-Solomon codes [Reed and Solomon (1960)] or LDPC [Plank and Thomason (2004)] allow to achieve significant storage savings as compared to simple replication [Rodrigues and Liskov (2005), Lin et al. (2004), Weatherspoon and Kubiatowicz (2002), Dimakis et al. (2007)]. Moreover, recent advances in network coding have lead to the design of Regenerating Codes [Dimakis et al. (2007)] that allow to reduce both, the storage and communication costs, as compared to replication. While coding schemes can provide cost efficient redundancy in production environments [Zhang et al. (2010), Ford et al. (2010), Fan et al. (2009)], distributed storage designers are still slow in adapting advanced coding schemes for their systems. In our opinion, one reason for this reluctance is that coding schemes present too many configuration trade-offs that make it difficult to determine the optimal configuration for a given storage infrastructure.

Besides coding or replication one can also combine these two techniques into a hybrid redundancy scheme. In some circumstances these hybrid redundancy schemes can reduce the costs of coding schemes [Wu et al. (2005), Haeberlen et al. (2005)]. Besides reducing costs, there are other reasons why maintaining whole file replicas in conjunction with encoded copies is advantageous: (i) production systems using replication that want to reduce their costs without migrating their whole infrastructure, (ii) peer-assisted cloud storage systems [Toka et al. (2010)], like Wuala [WuaLa (2010)] that aim to reduce the outgoing cloud bandwidth by combining cloud-storage with P2P storage, and (iii) storage systems needing efficient file retrievals that cannot afford the computational costs inherent in coding schemes. Unfortunately, there are no studies analyzing under which conditions —i.e. node dynamics and network parameters— hybrid schemes can reduce the storage and communication costs as compared to simple replication.

Due to the great variety of redundancy schemes, it is complex to determine which redundancy scheme is the best for a given infrastructure that is defined by properties like size (number of storage nodes), amount of stored data, node dynamics, and cross-system bandwidth. The aim of this paper is to analyze the impact of different properties on the storage and communication costs of the redundancy scheme. We focus our analysis on Regenerating Codes [Dimakis et al. (2007)]. As we will see in Section 4, Regenerating Codes provide a generic framework that also allows us to analyze replication schemes and maximum-distance separable (MDS) codes such as Reed-Solomon codes as specific instances of Regenerating Codes.

The main contributions of our paper are as follows:

  • This paper is the first to completely evaluate the storage and communication costs of Regenerating Codes under different system conditions.

  • For storage systems that need to maintain whole replicas of the stored files, we identify the conditions where a hybrid scheme (replication+coding) can reduce the storage and communication costs of a simple replication scheme.

  • Finally, we evaluate through simulation the effects that different redundancy scheme configurations have on the scalability of the storage system. We show that some theoretically-optimal schemes cannot guarantee data reliability in realistic storage environments.

The rest of the paper is organized as follows. In Section 2 we present the related work. In sections 3 and 4 we describe our storage model and Regenerating Codes. In Section 5, we analytically evaluate the storage and communication costs of Regenerating Codes. In Section 6 we analyze a hybrid redundancy scheme that combines Regenerating Codes and replication. Finally, in Section 7 we validate and extend our analytical results using simulations, and in Section 8, we state our conclusions.

2 Related Work

Tolerating node failures is a key requirement to achieve data reliability in distributed storage systems. Existing distributed storage systems use different strategies to cope with these node failures depending on whether these failures are transient —nodes reconnect without losing any data— or permanent —nodes disconnect and lose their data. In this section we present the existing techniques used to alleviate the costs caused by these transient and permanent node failures.

Transient node failures cause temporal data unavailabilities that may prevent users from retrieving their stored files. To tolerate transient node failures and guarantee high data availability, storage systems need to introduce data redundancy. Redundancy ensures (with high probability) that files can be retrieved even when some storage nodes are temporally off-line. The simplest way to introduce redundancy is by replicating each stored file. However, redundancy schemes based on coding techniques can significantly reduce the amount of redundancy (and storage space) required while achieving the same data reliability [Weatherspoon and Kubiatowicz (2002), Bhagwan et al. (2002)]. Lin et al. [Lin et al. (2004)] showed that such a reduction in redundancy is only possible when node on-line availabilities are high. For example, nodes must be more than 50% of the time on-line when files are stored occupying twice their original size, or more than 33% of the time on-line when files occupy three times their original size.

To cope with permanent node failures, storage systems need to repair the lost redundancy. Unfortunately, repairing such lost redundancy introduces communication overheads since it requires to move large amounts of data between nodes. Blake and Rodrigues [Blake and Rodrigues (2003)] demonstrated that the communication bandwidth used by these repairs can limit the scalability of the system in three main situations: (i) when the node failure rate is high, (ii) when the cross-system bandwidth is low, (iii) or when the system stores too much data. Additionally, Rodrigues and Liskov [Rodrigues and Liskov (2005)] compared replication and erasure codes in terms of communications overheads and concluded that when on-line node availabilities are high, replication requires less communication than erasure codes. These results pose a dilemma for storage designers: when node on-line availabilities are high, erasure codes minimize storage overheads [Lin et al. (2004)] and replication minimize communication overheads [Rodrigues and Liskov (2005)].

In order to reduce communication overheads for erasure codes, Wu et al. [Wu et al. (2005)] proposed the use of a hybrid scheme combining erasure codes and replication. Although this technique slightly increases the storage overhead, it can significantly reduce the communication overhead of erasure codes when node on-line availabilities are high. Another technique used to minimize the communication overhead consists in lazy redundancy maintenance [Kiran et al. (2004), Datta and Aberer (2006)] which amortizes the costs of several consecutive repairs. However, deferring repairs can reduce the amount of available redundancy, requiring extra redundancy to guarantee the same data reliability. Furthermore, lazy repairs lead to spikes in the network resource usage [Duminuco et al. (2007), Sit et al. (2006)].

New coding schemes such as Hierarchical Codes or tree structured data regeneration have also been proposed to reduce the communication overhead as compared to classical erasure codes [Li et al. (2010), Duminuco and Biersack (2008)]. These solutions propose storage optimizations that exploit heterogeneities in node bandwidth and node availabilities. Finally, Dimakis et al. presented Regenerating Codes [Dimakis et al. (2007)] as a flexible redundancy scheme for distributed storage systems. Regenerating Codes use ideas from network coding to define a new family of erasure codes that can achieve different trade-offs in the optimization of storage and communication costs. This flexibility allows to adjust the code to the underlaying storage infrastructure. However, there are no studies on how Regenerating Codes should be adapted to these infrastructures, or how Regenerating Codes should be configured when combined with file replication in hybrid schemes. In this paper we will use Regenerating Codes [Dimakis et al. (2007), Dimakis et al. (2010)] as the base of our analysis on how to adapt and optimize redundancy schemes for different underlying storage infrastructures and different application needs.

3 Modeling a Generic Distributed Storage System

We consider a storage system where nodes dynamically join and leave the system [Duminuco et al. (2007), Utard and Vernois (2004)]. We assume that node lifetimes are random and follow some specific distribution . Because of these dynamics, the number of on-line nodes at time , , is a random process that fluctuates over time. Once stationarity is reached, we can replace by its limiting version . Assuming that node arrivals follow a Poisson process with a constant rate , then the average number of nodes in the system is  [Pamies-Juarez and García-López (2010)]. Additionally, it has been observed in real traces that during their lifetime in the system, nodes present several off-line periods caused by transient failures [Steiner et al. (2007), Guha et al. (2006)]. To model these transient failures, we model each node as an alternating process between on-line and off-line states. The sojourn times at these states are random and follow two different distributions: and respectively. Using these distributions we can measure the node on-line availability in stationary state as [Yao et al. (2006)]:

All the nodes in the system are responsible to store a constant amount of data that is uniformly distributed among the nodes. To model different data granularity, we will consider that this total amount of stored data corresponds to different data files of size bytes. However, since each of these files is stored with redundancy, the total disk space required to store each file is , being the redundancy factor (or stretch factor). The value of is set to guarantee that files are always available with a probability, , that is very close to one.

When a node reaches the end of its life, it abandons the system, losing all the data stored on it. A repair process is responsible to recreate the lost redundancy and to ensure that the retrieve probability, , is not compromised. There are three main approaches used to recreate redundancy when nodes fail:

  1. Eager repairs: Lost redundancy is repaired on demand immediately after a node failure is detected.

  2. Lazy repairs: The system waits until a certain number of nodes had failed and repairs them all at once.

  3. Proactive repairs: The system schedules the insertion of new redundancy at a constant rate, which is set according to the average node failure rate.

In our storage model we will assume the use of proactive repairs. Compared to eager repairs, proactive repairs simplify the analysis of the communication costs. Furthermore, while lazy repair can reduce the maintenance costs by amortizing the communication costs across several repairs [Datta and Aberer (2006)], it presents some important drawbacks: (i) delaying repairs leads to periods with low-redundancy that makes the system vulnerable; (ii) lazy repairs cause network resource usage to occur in bursts, creating spikes of system utilization [Duminuco et al. (2007)]. By adapting a proactive repair strategy, communication overheads are evenly distributed in time and we can analyze the storage system in its steady state [Duminuco et al. (2007)].

4 Regenerating Codes

Regenerating Codes [Dimakis et al. (2007)] are a family of erasure codes that allow to trade-off communication cost for storage cost and vice versa. To store a file of size bytes, Regenerating Codes generate blocks each to be stored on a different storage node. Each of these storage blocks has a size of , which makes the file stretch factor to be . When a storage node leaves the system or when a failure occurs, a new node can repair the lost block by downloading a repair block of size bytes, , from any set of out of alive nodes (). We will refer to as the repair degree. The total amount of data received by the repairing node, , , is called the repair bandwidth. Finally, a node can reconstruct the file by downloading bytes (the entire storage block) from different nodes. In Figure 1 we depict the basic operations of file retrieve and block repair for a Regenerating Code . The labels at the edges indicate to the amount of data transmitted between nodes during each operation.

Figure 1: Scheme for the repair and retrieve operations of Regenerating Codes.

Dimakis et al. [Dimakis et al. (2007)] gave the conditions that the set of parameters must satisfy to construct a valid Regenerating Code. Basically, once the subset of parameters: is fixed, Dimakis et al. obtained an analytical expression for the relationship between the values of and . This - relationship presents a trade-off curve: the larger , the smaller , and vice-versa. It means that it is impossible to simultaneously minimize both, communication cost and storage cost. Since the maximum storage capacity of the system can be constrained either by bandwidth bottlenecks or disk storage bottlenecks, there are two extreme ()-points of this trade-off curve that are of special interest w.r.t. maximizing the storage capacity. The first is the point where the storage block size per node is minimized, which is referred to as Minimum Storage Regenerating (MSR) code. The second is the point where the repair bandwidth is minimized, which is referred to as Minimum Bandwidth Regenerating (MBR) code. According to [Dimakis et al. (2010)], the storage block size () and the repair bandwidth () for MSR and MBR codes are:

(1)
(2)

There are two particular MSR configurations of special interest:

  • Maximum-distance separable (MDS) codes: In MSR codes, when , we obtain and the Regenerating Code behaves exactly like a traditional MDS codes such as a Reed Solomon code [Reed and Solomon (1960)]. In this case, the repair bandwidth, , is identical to the size of the original file, :

  • File replication: In MSR codes, when , the code becomes a simple replication scheme where the storage nodes each store a complete copy of the original file. For , the storage block size, , and the repair bandwidth, , are equal to the size of the original file, .

\tbl

Symbols used. Average number of storage nodes. Node arrival/departure rate (nodes/sec.). Distribution of the node lifetime (sec.). Node on-line availability. Number of stored files. Size of the stored files (bytes). Service bandwidth of each node (KBps). Data availability. Probability of detecting blocks on-line. Number of storage blocks. Retrieval degree: number of blocks required for retrieval of original data. Repair degree: number of blocks required for repair. Storage block size. Repair block size. Repair bandwidth.

In Table 4 we summarize the symbols used throughout the paper.

5 Cost Analysis

5.1 Redundancy Cost

In Section 4 we defined data redundancy as . In this section we aim to measure the minimum required to guarantee a desired file retrieve probability . Since in Regenerating Codes the retrieval process needs to download different blocks out of the total blocks, the retrieve probability is measured as [Lin et al. (2004)],

(3)

Given the values of , and , we can use eq. (3) to determine the minimum number of redundant blocks required to guarantee a certain retrieve probability using the function :

(4)

Note that the number of redundant blocks required to achieve is a function of the repair degree, , the node on-line availability, , and . In the rest of this paper we will use the notation to refer to the number of storage blocks required to achieve a retrieve probability for the specific and values.

Since data redundancy is , we can obtain the redundancy required by MSR and MBR codes, and respectively, by substituting with the expressions given for in eq. (1) and eq. (2):

(5)
(6)

Using these expressions we can state the following lemma: {lemma} For , and fixed, the redundancy , required by MSR codes is always smaller than or equal to the redundancy required by MBR codes. {proof} We can state the lemma as . Using equations (5) and (6) we obtain:

which is true by the definition of MSR codes and MBR codes [Dimakis et al. (2007)].

(a) Redundancy for MSR codes (includes MDS codes).
(b) Redundancy for MBR codes.
(c) Value of .
Figure 2: Redundancy required to achieve a retrieve probability for MSR and MBR codes as a function of the retrieve degree . Each plot in (a) and (b) depicts the redundancy evaluated using eq. (5) and eq. (6) for different values of , and different values of the node on-line availability . In (c) we plot the number of storage blocks required to achieve the retrieve probability for each case.

In Figure (a)a and (b)b we plot the redundancy required to achieve a retrieve probability for MSR and MBR codes. We plot the values of as a function of the retrieve degree, , and for different node availabilities, . Additionally, for MBR codes we also depict the values of for the two extreme repair degree values: and . We do not evaluate for different values because is independent of (see eq. (5)). In Figure (c)c we use eq. (4) to plot the number of blocks, , used in figures (a)a and (b)b for the retrieve probability .

In Figure 2 we can see that for MSR and MBR, increasing reduces , and therefore, reduces storage costs. Additionally, comparing figures (a)a and (b)b we can appreciate the consequences of Lemma 5.1: for a given node availability, , and a retrieve degree , the redundancy required for MSR codes is always smaller than the redundancy required for MBR codes. Finally, we can see that first quickly deceases with increasing before it reaches its asymptotic values. There is no point in choosing very large to minimize the storage costs of MSR and MBR codes, since large values induce a very high computational cost for coding and decoding [Duminuco and Biersack (2009)]. We therefore recommend to use values for where the redundancy starts approaching the asymptote, namely for , for and for . In Table 5.1 we provide the redundancy savings achieved by using these values.

\tbl

Storage space savings for adopting a Regenerating Code instead of replication. We use different values for each on-line node availability and a target retrieve probability of . MSR 47% 77% 84% MBR () 69% 55% 11% MBR () 81% 70% 25%

5.2 Communication Costs

When a node fails, the system must repair all the data blocks stored on the failed node. Repairing each of these blocks requires to transfer data between nodes, which entails a communication cost. In this section we measure the minimum per-node bandwidth required to sustain the overall repair traffic of the storage system. We will first compute the total amount of data that is transfered within the system during a period of time :

(7)

Assuming that there are nodes with an average lifetime , the average number of nodes that fail during a period is  [Duminuco et al. (2007)]. Additionally, assuming that data blocks are uniformly distributed between all storage nodes, the average number of blocks stored per node is . Finally, since the traffic required to repair one failed block is , we can rewrite eq. (7) as:

(8)

Then, the minimum per-node bandwidth, , required to ensure that all stored data can be successfully repaired is the ratio between the amount of data transmitted per unit of time (in seconds), and the average number of on-line nodes, :

(9)

Assuming that the repair bandwidth, , is given in KB, and the node lifetime, , in seconds, then the minimum per-node bandwidth is expressed in KBps. Assuming that the upload bandwidth of each node is always smaller than or equal to the download bandwidth, this minimum per-node bandwidth, , represents the minimum upload bandwidth required by each node.

If we use the values of the repair bandwidth given in equations (1) and (2), we obtain the minimum per-node bandwidth for each Regenerating Code configuration:

(10)
(11)

Taking these two expressions we can state the following lemma: {lemma} For the same , and parameters, the per-node bandwidth required by MBR codes, , is always smaller than or equal to the per-node bandwidth required by MSR codes, . {proof} We can state the lemma as . Using equations (10) and (11) we obtain:

which is true by the definition of MSR codes and MBR codes from [Dimakis et al. (2007)].

In the rest of this section we analyze the per-node bandwidth requirements, , for MSR and MBR codes. Since in eq. (10) and eq. (11) the term does not depend on the Regenerating Code parameters, , we will assume that . To obtain the minimum per-node bandwidth, we simply have to multiply times .

(a) MSR when and MDS.
(b) MSR when .
Figure 3: We use eq. (10) to show the per-node bandwidth required to achieve for MSR codes.

Communication Cost for MSR Codes

In Figure 3 we use eq. (10) to analyze the per-node bandwidth requirements of MSR codes when the required retrieve probability is . We plot the results for and and for three different on-line node availabilities:

  • For we can see in Figure (a)a how the per-node bandwidth of a MDS code such as a Reed-Solomon code, is linear in . In this case, the lowest per-node bandwidth is achieved when , which corresponds to a simple replication scheme.

  • For , however, we can see in Figure (b)b that the per-node bandwidth is asymptotically decreasing in . However, as already said, we recommend to choose when and when . Finally, we can see that for , is not an asymptotically decreasing function: As tends to one, the number of required blocks, , tends to (see eq. (4)) and the case is identical to the case , which is depicted in sub-figure (a)a.

In Figure (a)a we saw that MDS codes () do not reduce the per-node bandwidth as compared to replication () while in Figure (b)b we saw that for , a MSR code can reduce the bandwidth as compared to replication except for high node on-line availabilities (). We now want to determine the maximum node on-line availability, , for which a MSR code can reduce the per-node bandwidth requirement as compared to replication. Let us denote by the per-node bandwidth required by replication and denote the per-node bandwidth required by a MSR code. Then, a MSR reduces the bandwidth required by replication when the following inequality holds:

(12)
\tbl

Minimum values to construct MSR codes that requiring less repair bandwidth than simple file replication. The target retrieve probability is . minimum repair degree satisfying eq. (12) and the value of . Node availability

Table 5.2 shows the minimum that satisfies the inequality defined in eq. (12) for different on-line node availabilities, , and different retrieve degrees . We additionally provide the number of storage blocks, , required to achieve . We can see that for low node availabilities small values of , slightly larger than , are sufficient to reduce the per-node bandwidth required by replication. However, for high on-line node availabilities, the minimum value of satisfying eq. (12) becomes larger than , which is not a valid Regenerating Code configuration. This maximum on-line availability becomes higher for low values, namely for , for and for . We can generally state that for high on-line node availabilities, replication becomes more bandwidth efficient than any MSR code, which confirms the result obtained by Rodrigues and Liskov in [Rodrigues and Liskov (2005)].

(a) MBR when .
(b) MBR when .
Figure 4: Per-node bandwidth required to achieve for MBR codes using eq. (9).

Communication Cost for MBR Codes

In Figure 4 we plot the required per-node bandwidth of MBR codes for and . For MBR codes, in difference to MSR codes, we can see that for both values the required per-node bandwidth asymptotically decreases with increasing and we can state: {remark} For MBR codes . From Lemma 5.2 we know that for the same configuration, MBR codes are more bandwidth efficient than MSR codes. Using Remark 5.2 we can now state that all MBR codes are also more bandwidth efficient than simple replication, which is a special case of MSR: {lemma} The per-node bandwidth requirements of MBR codes are lower than or equal to the per-node bandwidth requirements of simple replication: . {proof} If this lemma is true, then the per-node bandwidth of the MBR configuration that consumes the most bandwidth must be lower than or equal to the per-node bandwidth of replication. Since is largest for (see Remark 5.2), we can rewrite this lemma as: . To proof it by contradiction we assume that . Using equations (10) and (11) we obtain:

which is a contradiction.

In Figure 5 we plot the communication savings a storage system makes when using a MBR code instead of replication. The savings have the same asymptotic behavior than the bandwidth requirements, , depicted in Figure 4. Since for MBR codes , i.e. the storage block size is the same as the repair bandwidth, the communication savings for MBR are the same as the storage savings listed in Table 5.1.

(a) Savings when .
(b) Savings when .
Figure 5: Reduction of the communication cost by adopting a MBR code instead of replication as function of a retrieve probability of .

6 Hybrid Repositories

In Section 5 we saw that except for one a particular case (MSR codes and high node on-line availabilities), MSR and MBR codes offer both, lower storage costs and lower communication costs than simple file replication. However, there are some scenarios where the storage system needs to ensure that files can be accessed without the need of decoding operations. For example, storage infrastructures using replication [Ghemawat et al. (2003), Borthakur (2007)] may not afford a migration of their infrastructures from replication to erasure encodes. Other examples are on-line streaming services or content distribution networks (CDNs) that need efficient access to stored files without requiring complex decoding operations.

As we saw in Section 5, maintaining whole file replicas (MSR codes with ) has a higher storage cost than using coding schemes. However, when whole file replicas are required, storage systems can reduce this high cost by using a hybrid redundancy scheme that combines replication and erasure codes. The replicas can also help reduce the communication cost when repairing lost data by generating new redundant blocks using the on-line file replicas: Generating a redundant block from a file replica requires transmitting bytes instead of the bytes required by the normal repair process. From eqs. (1) and (2) it is easy to see that . While some papers have studied hybrid redundancy schemes [Rodrigues and Liskov (2005), Haeberlen et al. (2005), Dimakis et al. (2007)], their aim was to reduce communication costs and not to guarantee permanent access to replicated objects. Therefore, these papers assumed that only one replica of each file was maintained in the system, ignoring the two problems that arise when this replica goes temporarily off-line: (i) it is not possible to access the file without decoding operations, and (ii) repairs using the replica are not possible.

In this section we evaluate a different hybrid scenario, where the storage system may maintain more than one replica of the whole file in order to ensure with high probability that there is always one replica on-line. However, it is not clear if the overall communication costs of our hybrid scheme will be lower than the communication costs of a single replication scheme. Further, even if communication costs are reduced, the use of a double redundancy scheme (replication and coding) may increase storage costs. To the best of our knowledge, there is no prior work analyzing these aspects. In our analysis we differentiate between the probability of having a file replica on-line, and the retrieve probability for being able to retrieve a files using encoded blocks, which requires that out of a total of storage blocks are on-line. We assume that , for example and , which is motivated by the fact that while users are likely to tolerate higher access times to a file, which will need to be reconstructed first in some rare cases when no replicas are found on-line, but they require very strong guarantees that data is never lost.

Adapting Communication Cost to the Hybrid Scheme

In a hybrid scheme we need to consider two types of repair traffic, namely (i) traffic , to repair lost replicas and (ii) traffic to repair encoded blocks. Since in the hybrid scheme blocks are repaired directly from a replicated copy, repairing an encoded block requires transmitting only one new storage block of bytes. We obtain by replacing in eq. (7) the term “traffic to repair a block” in by . Arranging the terms we obtain the following two expressions:

(13)
(14)

Note that these expressions assume that all lost blocks are repaired from replicas. Since we are adopting a proactive repair scheme, the system can delay individual repairs when no replicas are available. However, since replicas are available most of the time, these delays will rarely happen.

Comparing , and we can state the following lemma: {lemma} For the same , and parameters, a hybrid scheme using a MBR code has a communication cost that is at least as high as the communication cost of a hybrid scheme using a MSR code. {proof} We can state the lemma as . Using equations (13) and (14) we obtain:

which is true by the definition of Regenerating Codes.

Lemma 6 implies that MSR codes when used in hybrid schemes are both, more storage-efficient and more bandwidth-efficient than MBR codes. For this reason we will not consider the use of MBR codes in hybrid schemes.

Let us assume that the required retrieve probability for the whole hybrid system is and that the retrieve probability for replicated objects is , . A hybrid scheme reduces the storage cost compared to replication when the following condition is satisfied:

(15)

And analogously, a hybrid scheme reduces communication costs when:

(16)
\tbl

Number of replicas required to achieve a retrieve probability for different node availabilities . Node availability Number of replicas required 0.5 7 6 5 0.75 4 3 3 0.99 1 1 1

(a) Storage efficient hybrid schemes (any value).
(b) Bandwidth efficient hybrid schemes (when ).
(c) Bandwidth efficient hybrid schemes (when ).
Figure 6: The -pairs under each of the lines represent the scenarios where a hybrid scheme (replication+MSR codes) reduces the costs of a single replicated scheme. The lines are the maximum values that satisfy eq. (15) for (a), and eq. (16) for (b) and (c).

In Figure (a)a we plot the maximum value for that satisfies eq. (15) as a function of the overall retrieve probability for different on-line node availabilities . The parameter is set to when , when and when . The -pairs below each of the lines correspond to the hybrid instances that satisfy eq. (15), i.e. a hybrid scheme reduces the storage costs. Similarly, in figures (b)b and (c)c, we plot the -pairs that satisfy eq.(16), i.e. a hybrid scheme reduces the communication costs.

As example, let us assume a storage system that wants 99% data availability for their replicated files. In this case (), looking at Figure 6 we see that a hybrid scheme (replication+MSR codes) can reduce the storage costs compared to replication only when for , when for , and when for . Since in general we always want strong guarantees that files are never lost —e.g., we assume —, we can conclude that hybrid schemes reduce storage and communication cost for almost all practical scenarios.

It is interesting to note that in Figure 6 all three sub-figures look very much alike. The reason is that the cost contribution of replication is significantly higher than the cost contribution of the coding (see Section 5). Since we have demonstrated the cost efficiency of a hybrid scheme for , which requires a larger number of replicas than configurations with , see Table 6, a hybrid scheme will also reduce storage and communication costs for any system requiring fewer replicas i.e., .

7 Experimental Evaluation

In previous sections we presented our generic storage model based on Regenerating Codes and we analytically analyzed the storage and communication costs for MSR and MBR codes, as well as the efficiency of using these codes in hybrid redundancy schemes. In this section, we aim to evaluate how the network traffic caused by repair processes can affect the performance and scalability of the redundancy scheme. For that, we assume a distributed storage system constrained by its network bandwidth: a system where storage nodes have low upload bandwidth and nodes have low on-line availabilities. For such a storage system we will evaluate two measures that are difficult to obtain analytically: (i) the real bandwidth used by the repair process —i.e., bandwidth utilization—, and (ii) the repair time —i.e., time required to download fragments. In this way we can evaluate the impact of the repair degree on bandwidth utilization and system scalability.

Bandwidth utilization

Given a node upload bandwidth, , and the per-node required bandwidth, , we can theoretically state that a feasible storage system must satisfy , and that the storage system reaches its maximum capacity when . However, practical storage systems may not reach this maximum capacity because of system inefficiencies due to failed repairs or fragment retransmissions. To measure these inefficiencies, we will compare the real bandwidth utilization with the theoretical bandwidth utilization .

Repair time

The repair time is proportional to the repair bandwidth, , the repair degree, , and the probability of finding a node on-line [Pamies-Juarez et al. (2010)]. We showed in Section 5 that increasing reduces the repair bandwidth , (see eqs. (1) and (2)), which should then intuitively reduce repair times. However, since the system only guarantees on-line nodes, contacting nodes may require to wait for nodes coming back on-line, which will cause longer repair times. In previous sections we only considered two repair degrees , namely and . In this section we will analyze how different values affect repair times and bandwidth utilization.

7.1 Simulator Set-Up

We implemented an event-based simulator that simulates a dynamic storage infrastructure. Initially, the simulator starts with storage nodes. New node arrivals follow a Poisson process with average inter-arrival times . Node departures follow a Poisson process with the same inter-departure time. Once a node joins the system it draws its lifetime from an exponential distribution with expected value days. During their lifetime in the system, nodes alternate between on-line/off-line sessions. For each session, each node draws its on-line and off-line durations from distributions and respectively. In this paper and are exponential variates with parameters and respectively, where is the base time and the node on-line availability. Using the mean value of the exponential distribution we can compute the average duration of the on-line and off-line periods as (in hours):

(17)
(18)

The simulator implements parameterized Regenerating Code. To cope with node failures, redundant blocks are repaired in a proactive manner following the algorithm defined in [Duminuco et al. (2007)] and the simulator proactively generates new redundant blocks at a constant rate. For each stored object, a new redundant block is generated every days. To balance the amount of data assigned to each node, each repair is assigned to the on-line node that is lest loaded in terms of the number of stored blocks and the number of repairs going on.

If the repair node disconnects during a repair process, the repair is aborted and restarted at another on-line node. Similarly, when a node uploading data disconnects, the partially uploaded data is discarded and the repair node starts a block retrieval from another on-line node.

The number of objects stored in the system is set in all the simulations to achieve a desired system utilization . Given , the number of stored objects, , is obtained using the two following expressions:

(19)
(20)

These formulas are obtained by taking the definition of utilization, , replacing by in eq. (9) and solving the equation for .

We set the on-line node availability to and we set . With these values, we use eq. (4) to compute the minimum number of redundant blocks, , required to achieve a retrieve probability : .

Finally, the node upload bandwidth is set to 20KB/sec, allowing only one concurrent upload per node. To simulate asymmetric network bandwidth, we allow up to 3 concurrent downloads per node, which makes a maximum download bandwidth of 60KB/sec.

7.2 Impact of the Repair Degree

(a) Bandwidth utilization
(b) Repair times
(c) Number of Objects
(d) Overall Disk Utilization ()
Figure 7: Bandwidth utilization and repair times for MSR and MBR and different repair degrees when the object size is 120MB and the number of objects is set to achieve half bandwidth utilization . The rest of the parameters are set to: and hours.

In Figure 7 we measure the effect of the repair degree on the system utilization and on the repair times. In this experiment, we set the size of the object to MB and the base time to hours —i.e. on average nodes connect and disconnect once per day. The number of stored objects is set to achieve a bandwidth utilization of . Figure (c)c shows the number of objects for , and Figure (d)d the storage space required. Figures (a)a and (b)b show that small values (values close to ) allow to keep the bandwidth utilization on target and assure low repair times. However, for repair degrees the repair times start to increase exponentially.

It is interesting to see that when the repair times are quite long, nodes executing repairs may not finish their repairs before disconnecting since repair times become longer than on-line sessions. In this case, failed repairs are reallocated and restarted in other on-line nodes. These unsuccessful repairs cause useless traffic that increase then the real bandwidth utilization. In Figure (a)a we can see how for repair times start to be larger than on-line sessions, increasing utilization beyond . It is important to note that these larger repair times can jeopardize the reliability of the system: large values can cause most repairs to fail, reducing the amount of available blocks and reducing the probability of successfully accessing stored files.

(a) Bandwidth utilization
(b) Repair times
Figure 8: Bandwidth utilization and repair times for MSR and MBR and different base times when the object size is 120MB and the number of objects is set to achieve a bandwidth utilization . The rest of the parameters are set to: and . For the MSR case , and for MBR .

To investigate the increase of bandwidth utilization in detail, we analyze in Figure 8 the performance of the storage system for the point where repair times begin to increase, . At this point we evaluate repair times and bandwidth utilization for different base times, . As increases, the duration of on-line sessions become longer and fewer repairs need to be restarted, theoretically reducing bandwidth utilization. We can see this effect in Figure (a)a, larger base times reduce the bandwidth utilization of the system. Due to this utilization reduction, repair times are also slightly reduced as we can see in Figure (b)b.

7.3 Scalability

(a) Bandwidth utilization
(b) Repair times
(c) Number of Objects
Figure 9: Bandwidth utilization and repair times for MSR and MBR and different targeted utilizations when the object size is 120 MB and the number of objects is set to achieve the targeted . The rest of the parameters are set to: , hours and .

Other than the impact of the repair degree and the base time we aim to analyze the behavior of the storage system under different target bandwidth utilizations. In Figure 9 we plot the measured utilization and repair times for a wide range of target utilizations . We set the size of the stored objects to 120MB and we increase the number of stored objects, , to achieve different utilizations. In this scenario we set . In Figure 9a we see how the measured utilization is nearly the same than the target utilization. This is because causes short repair times and repairs typically finish before nodes go off-line. However, in Figure 9b we can appreciate how for a high bandwidth utilization of , the saturation of the node upload queues increases repair times significantly.

(a) Bandwidth utilization
(b) Repair times
(c) Number of Objects
Figure 10: Bandwidth utilization and repair times for MSR and MBR and different targeted utilizations when the object size is 120MB and the number of objects is set to achieve the targeted . The rest of the parameters are set to: , hours and .

In Figure 10 we plot the same metrics as in Figure 9 but for a repair degree of . Increasing the repair degree causes longer retrieval times, however as we saw in Figure 7, keep repairs short enough to guarantee that the utilization is not affected. However, by increasing the repair degree from to we can store on the same system configuration one order of magnitude more objects, namely 6452 (MSR, ) instead of 683 (MSR, ).

(a) Bandwidth utilization
(b) Repair times
(c) Number of Objects
Figure 11: Bandwidth utilization and repair times for MSR and MBR and different object sizes when the number of objects is set to achieve a bandwidth utilization . The rest of the parameters are set to: , hours and .

Finally, in Figure 11 we analyze the impact of object size on bandwidth utilization and repair times. For each object size we set the number of stored objects to achieve a target bandwidth utilization of . Since the utilization is the same for all object sizes, the number stored objects, , decreases as the object size increases (Figure 11c). Independently of the object size, the total amount of stored data, remains constant: 774GB for MSR codes and 1206GB for MBR codes. We can also see in Figure 11a that the measured bandwidth utilization is independent of the object size. However, as expected, we can see in Figure 11b that larger objects take longer to repair.

8 Conclusions

In this paper we evaluated redundancy schemes for distributed storage systems in order to have a clearer understanding of the cost trade-offs in distributed storage systems. Specifically, we analyzed the performance of the generic family of erasure codes called Regenerating Codes [Dimakis et al. (2007)], and the use of Regenerating Codes in hybrid redundancy schemes. For each parameter combination we analytically derived its storage and communication costs of Regenerating Codes. Our cost analysis is novel in that it takes into account the effects of on-line node availabilities and node lifetimes. Additionally, we used an event-based simulator to evaluate the effects of network utilization on the scalability of different redundancy configurations. Our main results are as follows:

  • Compared to simple replication, the use of a Regenerating Codes can reduce the costs of a storage system (storage and communication costs) from 20% up to 80%.

  • The optimal value of the retrieval degree depends on the on-line node availability, ranging from when nodes have 99% availability, to when nodes have 50% availability. Once is fixed, storage systems with limited storage capacity can maximize their storage capacity by adopting MSR codes. On the other hand, systems with limited communications bandwidth can maximize their storage capacity by adopting MBR codes.

  • High repair degrees reduce the overall communication costs but may increase repair times significantly, which can lead to data loss. We experimentally found that the repair degree should be small enough to make sure the repair times are shorter than the on-line session durations of nodes.

  • Finally, in storage systems where the access to whole file replicas is required, we showed that hybrid schemes combining replication and MSR codes are more cost efficient than simple replication.

{acks}

This work was done during a visit of the first author at Eurecom in Spring 2010. We would like to thank Matteo Dell’Amico and Zhen Huang for their helpful comments and their support of this research.

References

  • Bhagwan et al. (2002) Bhagwan, R., Moore, D., Savage, S., and Voelker, G. 2002. Replication strategies for highly available peer-to-peer storage systems. In Proceedings of FuDiCo: Future directions in Distributed Computing.
  • Blake and Rodrigues (2003) Blake, C. and Rodrigues, R. 2003. High availability, scalable storage, dynamic peer networks: Pick two. In Proceedings the 9th Workshop on Hot Topics in Operating Systems (HOTOS).
  • Borthakur (2007) Borthakur, D. 2007. The hadoop distributed file system: Architecture and design.
  • Datta and Aberer (2006) Datta, A. and Aberer, K. 2006. Internet-scale storage systems under churn. a study of the steady-state using markov models. In Proceedings of the 6th Intl. Conference on Peer-to-Peer Computing (P2P).
  • Dimakis et al. (2007) Dimakis, A., Godfrey, P., Wainwright, M., and Ramchandran, K. 2007. Network coding for distributed storage systems. In Proceedings of the 26th IEEE Intl. Conference on Computer Communications (INFOCOM).
  • Dimakis et al. (2010) Dimakis, A. G., Ramchandran, K., Wu, Y., and Suh, C. 2010. A survey on network codes for distributed storage. arXiv:1004.4438v1 [cs.IT].
  • Duminuco and Biersack (2009) Duminuco, A. and Biersack, E. 2009. A practical study of regenerating codes for peer-to-peer backup systems. In Proceedings of the 29th IEEE Intl. Conference on Distributed Computing Systems (ICDCS).
  • Duminuco and Biersack (2008) Duminuco, A. and Biersack, E. W. 2008. Hierarchical codes: How to make erasure codes attractive for peer-to-peer storage systems. In Proceedings of the 8th Intl. Conference on Peer-to-Peer Computing (P2P).
  • Duminuco et al. (2007) Duminuco, A., Biersack, E. W., and En-Najjary, T. 2007. Proactive replication in distributed storage systems using machine availability estimation. In Proceedings of the 3rd CoNEXT conference (CONEXT).
  • Fan et al. (2009) Fan, B., Tantisiriroj, W., Xiao, L., and Gibson, G. 2009. Diskreduce: Raid for data-intensive scalable computing. In Proceedings of the 4th Annual Workshop on Petascale Data Storage (PDSW).
  • Ford et al. (2010) Ford, D., Labelle, F., Popovici, F. I., Stokely, M., Truong, V.-A., Barroso, L., Grimes, C., and Quinlan, S. 2010. Availability in globally distributed storage systems. In Proceedings of the 9th USENIX conference on Operating systems design and implementation (OSDI).
  • Ghemawat et al. (2003) Ghemawat, S., Gobioff, H., and Leung, S. 2003. The google file system. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP).
  • Guerra et al. (2010) Guerra, J., Belluomini, W., Glider, J., Gupta, K., and Pucha, H. 2010. Energy proportionality for storage: Impact and feasibility. ACM SIGOPS Operating Systems Review 44, 1.
  • Guha et al. (2006) Guha, S., Daswani, N., and Jain, R. 2006. An experimental study of the skype peer-to-peer voip system. In Proceedings of the 5th Intl. Workshop on Peer-to-Peer Systems (IPTPS).
  • Haeberlen et al. (2005) Haeberlen, A., Mislove, A., and Druschel, P. 2005. Glacier: highly durable, decentralized storage despite massive correlated failures. In Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation (NSDI). USENIX Association, Berkeley, CA, USA, 143–158.
  • Hastorun et al. (2007) Hastorun, D., Jampani, M., Kakulapati, G., Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W. 2007. Dynamo: Amazon’s highly available key-value store. In Proceedings of Symposium on Operating Systems Principles (SOSP).
  • Kiran et al. (2004) Kiran, R. B., Tati, K., Cheng, Y.-c., Savage, S., and Voelker, G. M. 2004. Total recall: System support for automated availability management. In Symposium on Networked Systems Design and Implementation (NSDI).
  • Kubiatowicz et al. (2000) Kubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S., Eaton, P., Geels, D., Gummadi, R., Rhea, S., Weatherspoon, H., Weimer, W., Wells, C., and Zhao, B. 2000. Oceanstore: An architecture for global-scale persistent storage. In Proceedings of the 9th Intl. Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
  • Li et al. (2010) Li, J., Yang, S., Wang, X., Xue, X., and Li, B. 2010. Tree-structured Data Regeneration in Distributed Storage Systems with Regenerating Codes. In Proceedings of the 29th IEEE Intl. Conference on Computer Communications (INFOCOM).
  • Lin et al. (2004) Lin, W. K., Chiu, D. M., and Lee, Y. B. 2004. Erasure code replication revisited. In Proceedings of the 4th Intl. Conference on Peer-to-Peer Computing (P2P).
  • Pamies-Juarez and García-López (2010) Pamies-Juarez, L. and García-López, P. 2010. Maintaining data reliability without availability in p2p storage systems. In Proceedings. of the 25th Symposium On Applied Computing (SAC).
  • Pamies-Juarez et al. (2010) Pamies-Juarez, L., García-López, P., and Sánchez-Artigas, M. 2010. Availability and redundancy in harmony: Measuring retrieval times in p2p storage systems. In Proceedings of the 10th IEEE Intl. Conference on Peer-to-Peer Computing (P2P).
  • Plank and Thomason (2004) Plank, J. and Thomason, M. 2004. A practical analysis of low-density parity-check erasure codes for wide-area storage applications. In Proceedings of the International Conference on Dependable Systems and Networks (DSN).
  • Reed and Solomon (1960) Reed, I. and Solomon, G. 1960. Polynomial codes over certain finite fields. Journal of the Society for Industrial and Applied Mathematics 8, 2, 300–304.
  • Rodrigues and Liskov (2005) Rodrigues, R. and Liskov, B. 2005. High availability in dhts: Erasure coding vs. replication. In Proceedings of the 4th Intl. Workshop on Peer-To-Peer Systems (IPTPS).
  • Schmuck and Haskin (2002) Schmuck, F. and Haskin, R. 2002. Gpfs, a shared-disk file system for large computing clusters. In Proceedings of the 1th USENIX conference on File and Storage Technologies (FAST).
  • Sit et al. (2006) Sit, E., Haeberlen, A., Dabek, F., Chun, B., Weatherspoon, H., Morris, R., Kaashoek, M. F., and Kubiatowicz, J. 2006. Proactive replication for data durability. In Proceedings of the 5th Intl. Workshop on Peer-to-Peer Systems (IPTPS).
  • Steiner et al. (2007) Steiner, M., En-Najjary, T., and Biersack, E. 2007. A global view of kad. In Proceedings of the 7th ACM SIGCOMM conference on Internet Measurement (IMC).
  • Toka et al. (2010) Toka, L., Dell’Amico, M., and Michiardi, P. 2010. Online data backup: A peer-assisted approach. In Proceedings of the 10th IEEE Intl. Conference on Peer-to-Peer Computing (P2P).
  • Utard and Vernois (2004) Utard, G. and Vernois, A. 2004. Data durability in peer to peer storage systems. In Proceedings of the 4th IEEE Intl. Symposium on Cluster Computing and the Grid (CCGRID).
  • Weatherspoon and Kubiatowicz (2002) Weatherspoon, H. and Kubiatowicz, J. D. 2002. Erasure coding vs. replication: A quantitative comparison. In Proceedings of the 1st Intl. Workshop on Peer-To-Peer Systems (IPTPS).
  • Wu et al. (2005) Wu, F., Qiu, T., Chen, Y., and Chen, G. 2005. Redundancy schemes for high availability in dhts. In Proceedings of the 3rd Intl. Symposium on Parallel and Distributed Processing and Applications (ISPA).
  • WuaLa (2010) WuaLa. 2010. Wuala. http://www.wuala.com.
  • Yao et al. (2006) Yao, Z., Leonard, D., Derek, X., Wang, X., and Loguinov, D. 2006. Modeling Heterogeneous User Churn and Local Resilience of Unstructured P2P Networks. In Proceedings of the IEEE Intl. Conference on Network Protocols (ICNP).
  • Zhang et al. (2010) Zhang, Z., Deshpande, A., Ma, X., Thereska, E., and Narayanan, D. 2010. Does erasure coding have a role to play in my data center? Tech. Rep. MSR-TR-2010-52, Microsoft Research.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
66268
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description