Cost Analysis of Redundancy Schemes for Distributed Storage Systems
Abstract
Distributed storage infrastructures require the use of data redundancy to achieve high data reliability. Unfortunately, the use of redundancy introduces storage and communication overheads, which can either reduce the overall storage capacity of the system or increase its costs. To mitigate the storage and communication overhead, different redundancy schemes have been proposed. However, due to the great variety of underlaying storage infrastructures and the different application needs, optimizing these redundancy schemes for each storage infrastructure is cumbersome. The lack of rules to determine the optimal level of redundancy for each storage configuration leads developers in industry to often choose simpler redundancy schemes, which are usually not the optimal ones. In this paper we analyze the cost of different redundancy schemes and derive a set of rules to determine which redundancy scheme minimizes the storage and the communication costs for a given system configuration. Additionally, we use simulation to show that theoreticallyoptimal schemes may not be viable in a realistic setting where nodes can go offline and repairs may be delayed. In these cases, we identify which are the tradeoffs between the storage and communication overheads of the redundancy scheme and its data reliability.
E.4Coding and Information TheoryError Control Codes \categoryE.5FilesBackup/Recovery \categoryC.2.4ComputerCommunication NetworksDistributed Systems \categoryH.3.2Information Storage and RetrievalInformation Storage
Reliability, Performance
Author’s addresses: L. PamiesJuarez (lluis.pamies@urv.cat) and E. Biersack (ernst.biersack@eurecom.fr).
1 Introduction
Distributed storage systems are widely used today for reasons of scalability and performance. There are distributed filesystems such as Google FS [Ghemawat et al. (2003)], HDFS [Borthakur (2007)], GPFS [Schmuck and Haskin (2002)] or Dynamo [Hastorun et al. (2007)] and peertopeer (P2P) storage applications like Wuala [WuaLa (2010)] or OceanStore [Kubiatowicz et al. (2000)].
To achieve high reliability in distributed storage systems, a certain level of data redundancy is required. Unfortunately, the use of redundancy increases the storage and communication costs of the system: (i) the space required to store each file is increased, and (ii) additional communication bandwidth is required to repair lost data. It is important to optimize redundancy schemes in order to minimize these storage and communication costs. For example, in data centers where the energy cost associated with the storage subsystem represents about 40% of the energy consumption of all the IT components [Guerra et al. (2010)], minimizing storage cost can significantly reduce the perbyte cost of the storage system. Whereas in lessreliable infrastructures —i.e. P2P systems— where the storage capacity is mainly constrained by the crosssystem communications bandwidth [Blake and Rodrigues (2003)], minimizing communication costs can increase the overall storage capacity of the system.
Different redundancy schemes have been proposed to minimize the storage and communication costs associated with redundancy. Redundancy schemes based on coding techniques such as ReedSolomon codes [Reed and Solomon (1960)] or LDPC [Plank and Thomason (2004)] allow to achieve significant storage savings as compared to simple replication [Rodrigues and Liskov (2005), Lin et al. (2004), Weatherspoon and Kubiatowicz (2002), Dimakis et al. (2007)]. Moreover, recent advances in network coding have lead to the design of Regenerating Codes [Dimakis et al. (2007)] that allow to reduce both, the storage and communication costs, as compared to replication. While coding schemes can provide cost efficient redundancy in production environments [Zhang et al. (2010), Ford et al. (2010), Fan et al. (2009)], distributed storage designers are still slow in adapting advanced coding schemes for their systems. In our opinion, one reason for this reluctance is that coding schemes present too many configuration tradeoffs that make it difficult to determine the optimal configuration for a given storage infrastructure.
Besides coding or replication one can also combine these two techniques into a hybrid redundancy scheme. In some circumstances these hybrid redundancy schemes can reduce the costs of coding schemes [Wu et al. (2005), Haeberlen et al. (2005)]. Besides reducing costs, there are other reasons why maintaining whole file replicas in conjunction with encoded copies is advantageous: (i) production systems using replication that want to reduce their costs without migrating their whole infrastructure, (ii) peerassisted cloud storage systems [Toka et al. (2010)], like Wuala [WuaLa (2010)] that aim to reduce the outgoing cloud bandwidth by combining cloudstorage with P2P storage, and (iii) storage systems needing efficient file retrievals that cannot afford the computational costs inherent in coding schemes. Unfortunately, there are no studies analyzing under which conditions —i.e. node dynamics and network parameters— hybrid schemes can reduce the storage and communication costs as compared to simple replication.
Due to the great variety of redundancy schemes, it is complex to determine which redundancy scheme is the best for a given infrastructure that is defined by properties like size (number of storage nodes), amount of stored data, node dynamics, and crosssystem bandwidth. The aim of this paper is to analyze the impact of different properties on the storage and communication costs of the redundancy scheme. We focus our analysis on Regenerating Codes [Dimakis et al. (2007)]. As we will see in Section 4, Regenerating Codes provide a generic framework that also allows us to analyze replication schemes and maximumdistance separable (MDS) codes such as ReedSolomon codes as specific instances of Regenerating Codes.
The main contributions of our paper are as follows:

This paper is the first to completely evaluate the storage and communication costs of Regenerating Codes under different system conditions.

For storage systems that need to maintain whole replicas of the stored files, we identify the conditions where a hybrid scheme (replication+coding) can reduce the storage and communication costs of a simple replication scheme.

Finally, we evaluate through simulation the effects that different redundancy scheme configurations have on the scalability of the storage system. We show that some theoreticallyoptimal schemes cannot guarantee data reliability in realistic storage environments.
The rest of the paper is organized as follows. In Section 2 we present the related work. In sections 3 and 4 we describe our storage model and Regenerating Codes. In Section 5, we analytically evaluate the storage and communication costs of Regenerating Codes. In Section 6 we analyze a hybrid redundancy scheme that combines Regenerating Codes and replication. Finally, in Section 7 we validate and extend our analytical results using simulations, and in Section 8, we state our conclusions.
2 Related Work
Tolerating node failures is a key requirement to achieve data reliability in distributed storage systems. Existing distributed storage systems use different strategies to cope with these node failures depending on whether these failures are transient —nodes reconnect without losing any data— or permanent —nodes disconnect and lose their data. In this section we present the existing techniques used to alleviate the costs caused by these transient and permanent node failures.
Transient node failures cause temporal data unavailabilities that may prevent users from retrieving their stored files. To tolerate transient node failures and guarantee high data availability, storage systems need to introduce data redundancy. Redundancy ensures (with high probability) that files can be retrieved even when some storage nodes are temporally offline. The simplest way to introduce redundancy is by replicating each stored file. However, redundancy schemes based on coding techniques can significantly reduce the amount of redundancy (and storage space) required while achieving the same data reliability [Weatherspoon and Kubiatowicz (2002), Bhagwan et al. (2002)]. Lin et al. [Lin et al. (2004)] showed that such a reduction in redundancy is only possible when node online availabilities are high. For example, nodes must be more than 50% of the time online when files are stored occupying twice their original size, or more than 33% of the time online when files occupy three times their original size.
To cope with permanent node failures, storage systems need to repair the lost redundancy. Unfortunately, repairing such lost redundancy introduces communication overheads since it requires to move large amounts of data between nodes. Blake and Rodrigues [Blake and Rodrigues (2003)] demonstrated that the communication bandwidth used by these repairs can limit the scalability of the system in three main situations: (i) when the node failure rate is high, (ii) when the crosssystem bandwidth is low, (iii) or when the system stores too much data. Additionally, Rodrigues and Liskov [Rodrigues and Liskov (2005)] compared replication and erasure codes in terms of communications overheads and concluded that when online node availabilities are high, replication requires less communication than erasure codes. These results pose a dilemma for storage designers: when node online availabilities are high, erasure codes minimize storage overheads [Lin et al. (2004)] and replication minimize communication overheads [Rodrigues and Liskov (2005)].
In order to reduce communication overheads for erasure codes, Wu et al. [Wu et al. (2005)] proposed the use of a hybrid scheme combining erasure codes and replication. Although this technique slightly increases the storage overhead, it can significantly reduce the communication overhead of erasure codes when node online availabilities are high. Another technique used to minimize the communication overhead consists in lazy redundancy maintenance [Kiran et al. (2004), Datta and Aberer (2006)] which amortizes the costs of several consecutive repairs. However, deferring repairs can reduce the amount of available redundancy, requiring extra redundancy to guarantee the same data reliability. Furthermore, lazy repairs lead to spikes in the network resource usage [Duminuco et al. (2007), Sit et al. (2006)].
New coding schemes such as Hierarchical Codes or tree structured data regeneration have also been proposed to reduce the communication overhead as compared to classical erasure codes [Li et al. (2010), Duminuco and Biersack (2008)]. These solutions propose storage optimizations that exploit heterogeneities in node bandwidth and node availabilities. Finally, Dimakis et al. presented Regenerating Codes [Dimakis et al. (2007)] as a flexible redundancy scheme for distributed storage systems. Regenerating Codes use ideas from network coding to define a new family of erasure codes that can achieve different tradeoffs in the optimization of storage and communication costs. This flexibility allows to adjust the code to the underlaying storage infrastructure. However, there are no studies on how Regenerating Codes should be adapted to these infrastructures, or how Regenerating Codes should be configured when combined with file replication in hybrid schemes. In this paper we will use Regenerating Codes [Dimakis et al. (2007), Dimakis et al. (2010)] as the base of our analysis on how to adapt and optimize redundancy schemes for different underlying storage infrastructures and different application needs.
3 Modeling a Generic Distributed Storage System
We consider a storage system where nodes dynamically join and leave the system [Duminuco et al. (2007), Utard and Vernois (2004)]. We assume that node lifetimes are random and follow some specific distribution . Because of these dynamics, the number of online nodes at time , , is a random process that fluctuates over time. Once stationarity is reached, we can replace by its limiting version . Assuming that node arrivals follow a Poisson process with a constant rate , then the average number of nodes in the system is [PamiesJuarez and GarcíaLópez (2010)]. Additionally, it has been observed in real traces that during their lifetime in the system, nodes present several offline periods caused by transient failures [Steiner et al. (2007), Guha et al. (2006)]. To model these transient failures, we model each node as an alternating process between online and offline states. The sojourn times at these states are random and follow two different distributions: and respectively. Using these distributions we can measure the node online availability in stationary state as [Yao et al. (2006)]:
All the nodes in the system are responsible to store a constant amount of data that is uniformly distributed among the nodes. To model different data granularity, we will consider that this total amount of stored data corresponds to different data files of size bytes. However, since each of these files is stored with redundancy, the total disk space required to store each file is , being the redundancy factor (or stretch factor). The value of is set to guarantee that files are always available with a probability, , that is very close to one.
When a node reaches the end of its life, it abandons the system, losing all the data stored on it. A repair process is responsible to recreate the lost redundancy and to ensure that the retrieve probability, , is not compromised. There are three main approaches used to recreate redundancy when nodes fail:

Eager repairs: Lost redundancy is repaired on demand immediately after a node failure is detected.

Lazy repairs: The system waits until a certain number of nodes had failed and repairs them all at once.

Proactive repairs: The system schedules the insertion of new redundancy at a constant rate, which is set according to the average node failure rate.
In our storage model we will assume the use of proactive repairs. Compared to eager repairs, proactive repairs simplify the analysis of the communication costs. Furthermore, while lazy repair can reduce the maintenance costs by amortizing the communication costs across several repairs [Datta and Aberer (2006)], it presents some important drawbacks: (i) delaying repairs leads to periods with lowredundancy that makes the system vulnerable; (ii) lazy repairs cause network resource usage to occur in bursts, creating spikes of system utilization [Duminuco et al. (2007)]. By adapting a proactive repair strategy, communication overheads are evenly distributed in time and we can analyze the storage system in its steady state [Duminuco et al. (2007)].
4 Regenerating Codes
Regenerating Codes [Dimakis et al. (2007)] are a family of erasure codes that allow to tradeoff communication cost for storage cost and vice versa. To store a file of size bytes, Regenerating Codes generate blocks each to be stored on a different storage node. Each of these storage blocks has a size of , which makes the file stretch factor to be . When a storage node leaves the system or when a failure occurs, a new node can repair the lost block by downloading a repair block of size bytes, , from any set of out of alive nodes (). We will refer to as the repair degree. The total amount of data received by the repairing node, , , is called the repair bandwidth. Finally, a node can reconstruct the file by downloading bytes (the entire storage block) from different nodes. In Figure 1 we depict the basic operations of file retrieve and block repair for a Regenerating Code . The labels at the edges indicate to the amount of data transmitted between nodes during each operation.
Dimakis et al. [Dimakis et al. (2007)] gave the conditions that the set of parameters must satisfy to construct a valid Regenerating Code. Basically, once the subset of parameters: is fixed, Dimakis et al. obtained an analytical expression for the relationship between the values of and . This  relationship presents a tradeoff curve: the larger , the smaller , and viceversa. It means that it is impossible to simultaneously minimize both, communication cost and storage cost. Since the maximum storage capacity of the system can be constrained either by bandwidth bottlenecks or disk storage bottlenecks, there are two extreme ()points of this tradeoff curve that are of special interest w.r.t. maximizing the storage capacity. The first is the point where the storage block size per node is minimized, which is referred to as Minimum Storage Regenerating (MSR) code. The second is the point where the repair bandwidth is minimized, which is referred to as Minimum Bandwidth Regenerating (MBR) code. According to [Dimakis et al. (2010)], the storage block size () and the repair bandwidth () for MSR and MBR codes are:
(1)  
(2) 
There are two particular MSR configurations of special interest:

Maximumdistance separable (MDS) codes: In MSR codes, when , we obtain and the Regenerating Code behaves exactly like a traditional MDS codes such as a Reed Solomon code [Reed and Solomon (1960)]. In this case, the repair bandwidth, , is identical to the size of the original file, :

File replication: In MSR codes, when , the code becomes a simple replication scheme where the storage nodes each store a complete copy of the original file. For , the storage block size, , and the repair bandwidth, , are equal to the size of the original file, .
In Table 4 we summarize the symbols used throughout the paper.
5 Cost Analysis
5.1 Redundancy Cost
In Section 4 we defined data redundancy as . In this section we aim to measure the minimum required to guarantee a desired file retrieve probability . Since in Regenerating Codes the retrieval process needs to download different blocks out of the total blocks, the retrieve probability is measured as [Lin et al. (2004)],
(3) 
Given the values of , and , we can use eq. (3) to determine the minimum number of redundant blocks required to guarantee a certain retrieve probability using the function :
(4) 
Note that the number of redundant blocks required to achieve is a function of the repair degree, , the node online availability, , and . In the rest of this paper we will use the notation to refer to the number of storage blocks required to achieve a retrieve probability for the specific and values.
Since data redundancy is , we can obtain the redundancy required by MSR and MBR codes, and respectively, by substituting with the expressions given for in eq. (1) and eq. (2):
(5)  
(6) 
Using these expressions we can state the following lemma: {lemma} For , and fixed, the redundancy , required by MSR codes is always smaller than or equal to the redundancy required by MBR codes. {proof} We can state the lemma as . Using equations (5) and (6) we obtain:
which is true by the definition of MSR codes and MBR codes [Dimakis et al. (2007)].
In Figure (a)a and (b)b we plot the redundancy required to achieve a retrieve probability for MSR and MBR codes. We plot the values of as a function of the retrieve degree, , and for different node availabilities, . Additionally, for MBR codes we also depict the values of for the two extreme repair degree values: and . We do not evaluate for different values because is independent of (see eq. (5)). In Figure (c)c we use eq. (4) to plot the number of blocks, , used in figures (a)a and (b)b for the retrieve probability .
In Figure 2 we can see that for MSR and MBR, increasing reduces , and therefore, reduces storage costs. Additionally, comparing figures (a)a and (b)b we can appreciate the consequences of Lemma 5.1: for a given node availability, , and a retrieve degree , the redundancy required for MSR codes is always smaller than the redundancy required for MBR codes. Finally, we can see that first quickly deceases with increasing before it reaches its asymptotic values. There is no point in choosing very large to minimize the storage costs of MSR and MBR codes, since large values induce a very high computational cost for coding and decoding [Duminuco and Biersack (2009)]. We therefore recommend to use values for where the redundancy starts approaching the asymptote, namely for , for and for . In Table 5.1 we provide the redundancy savings achieved by using these values.
5.2 Communication Costs
When a node fails, the system must repair all the data blocks stored on the failed node. Repairing each of these blocks requires to transfer data between nodes, which entails a communication cost. In this section we measure the minimum pernode bandwidth required to sustain the overall repair traffic of the storage system. We will first compute the total amount of data that is transfered within the system during a period of time :
(7)  
Assuming that there are nodes with an average lifetime , the average number of nodes that fail during a period is [Duminuco et al. (2007)]. Additionally, assuming that data blocks are uniformly distributed between all storage nodes, the average number of blocks stored per node is . Finally, since the traffic required to repair one failed block is , we can rewrite eq. (7) as:
(8) 
Then, the minimum pernode bandwidth, , required to ensure that all stored data can be successfully repaired is the ratio between the amount of data transmitted per unit of time (in seconds), and the average number of online nodes, :
(9) 
Assuming that the repair bandwidth, , is given in KB, and the node lifetime, , in seconds, then the minimum pernode bandwidth is expressed in KBps. Assuming that the upload bandwidth of each node is always smaller than or equal to the download bandwidth, this minimum pernode bandwidth, , represents the minimum upload bandwidth required by each node.
If we use the values of the repair bandwidth given in equations (1) and (2), we obtain the minimum pernode bandwidth for each Regenerating Code configuration:
(10)  
(11) 
Taking these two expressions we can state the following lemma: {lemma} For the same , and parameters, the pernode bandwidth required by MBR codes, , is always smaller than or equal to the pernode bandwidth required by MSR codes, . {proof} We can state the lemma as . Using equations (10) and (11) we obtain:
which is true by the definition of MSR codes and MBR codes from [Dimakis et al. (2007)].
In the rest of this section we analyze the pernode bandwidth requirements, , for MSR and MBR codes. Since in eq. (10) and eq. (11) the term does not depend on the Regenerating Code parameters, , we will assume that . To obtain the minimum pernode bandwidth, we simply have to multiply times .
Communication Cost for MSR Codes
In Figure 3 we use eq. (10) to analyze the pernode bandwidth requirements of MSR codes when the required retrieve probability is . We plot the results for and and for three different online node availabilities:

For we can see in Figure (a)a how the pernode bandwidth of a MDS code such as a ReedSolomon code, is linear in . In this case, the lowest pernode bandwidth is achieved when , which corresponds to a simple replication scheme.

For , however, we can see in Figure (b)b that the pernode bandwidth is asymptotically decreasing in . However, as already said, we recommend to choose when and when . Finally, we can see that for , is not an asymptotically decreasing function: As tends to one, the number of required blocks, , tends to (see eq. (4)) and the case is identical to the case , which is depicted in subfigure (a)a.
In Figure (a)a we saw that MDS codes () do not reduce the pernode bandwidth as compared to replication () while in Figure (b)b we saw that for , a MSR code can reduce the bandwidth as compared to replication except for high node online availabilities (). We now want to determine the maximum node online availability, , for which a MSR code can reduce the pernode bandwidth requirement as compared to replication. Let us denote by the pernode bandwidth required by replication and denote the pernode bandwidth required by a MSR code. Then, a MSR reduces the bandwidth required by replication when the following inequality holds:
(12) 
Table 5.2 shows the minimum that satisfies the inequality defined in eq. (12) for different online node availabilities, , and different retrieve degrees . We additionally provide the number of storage blocks, , required to achieve . We can see that for low node availabilities small values of , slightly larger than , are sufficient to reduce the pernode bandwidth required by replication. However, for high online node availabilities, the minimum value of satisfying eq. (12) becomes larger than , which is not a valid Regenerating Code configuration. This maximum online availability becomes higher for low values, namely for , for and for . We can generally state that for high online node availabilities, replication becomes more bandwidth efficient than any MSR code, which confirms the result obtained by Rodrigues and Liskov in [Rodrigues and Liskov (2005)].
Communication Cost for MBR Codes
In Figure 4 we plot the required pernode bandwidth of MBR codes for and . For MBR codes, in difference to MSR codes, we can see that for both values the required pernode bandwidth asymptotically decreases with increasing and we can state: {remark} For MBR codes . From Lemma 5.2 we know that for the same configuration, MBR codes are more bandwidth efficient than MSR codes. Using Remark 5.2 we can now state that all MBR codes are also more bandwidth efficient than simple replication, which is a special case of MSR: {lemma} The pernode bandwidth requirements of MBR codes are lower than or equal to the pernode bandwidth requirements of simple replication: . {proof} If this lemma is true, then the pernode bandwidth of the MBR configuration that consumes the most bandwidth must be lower than or equal to the pernode bandwidth of replication. Since is largest for (see Remark 5.2), we can rewrite this lemma as: . To proof it by contradiction we assume that . Using equations (10) and (11) we obtain:
which is a contradiction.
In Figure 5 we plot the communication savings a storage system makes when using a MBR code instead of replication. The savings have the same asymptotic behavior than the bandwidth requirements, , depicted in Figure 4. Since for MBR codes , i.e. the storage block size is the same as the repair bandwidth, the communication savings for MBR are the same as the storage savings listed in Table 5.1.
6 Hybrid Repositories
In Section 5 we saw that except for one a particular case (MSR codes and high node online availabilities), MSR and MBR codes offer both, lower storage costs and lower communication costs than simple file replication. However, there are some scenarios where the storage system needs to ensure that files can be accessed without the need of decoding operations. For example, storage infrastructures using replication [Ghemawat et al. (2003), Borthakur (2007)] may not afford a migration of their infrastructures from replication to erasure encodes. Other examples are online streaming services or content distribution networks (CDNs) that need efficient access to stored files without requiring complex decoding operations.
As we saw in Section 5, maintaining whole file replicas (MSR codes with ) has a higher storage cost than using coding schemes. However, when whole file replicas are required, storage systems can reduce this high cost by using a hybrid redundancy scheme that combines replication and erasure codes. The replicas can also help reduce the communication cost when repairing lost data by generating new redundant blocks using the online file replicas: Generating a redundant block from a file replica requires transmitting bytes instead of the bytes required by the normal repair process. From eqs. (1) and (2) it is easy to see that . While some papers have studied hybrid redundancy schemes [Rodrigues and Liskov (2005), Haeberlen et al. (2005), Dimakis et al. (2007)], their aim was to reduce communication costs and not to guarantee permanent access to replicated objects. Therefore, these papers assumed that only one replica of each file was maintained in the system, ignoring the two problems that arise when this replica goes temporarily offline: (i) it is not possible to access the file without decoding operations, and (ii) repairs using the replica are not possible.
In this section we evaluate a different hybrid scenario, where the storage system may maintain more than one replica of the whole file in order to ensure with high probability that there is always one replica online. However, it is not clear if the overall communication costs of our hybrid scheme will be lower than the communication costs of a single replication scheme. Further, even if communication costs are reduced, the use of a double redundancy scheme (replication and coding) may increase storage costs. To the best of our knowledge, there is no prior work analyzing these aspects. In our analysis we differentiate between the probability of having a file replica online, and the retrieve probability for being able to retrieve a files using encoded blocks, which requires that out of a total of storage blocks are online. We assume that , for example and , which is motivated by the fact that while users are likely to tolerate higher access times to a file, which will need to be reconstructed first in some rare cases when no replicas are found online, but they require very strong guarantees that data is never lost.
Adapting Communication Cost to the Hybrid Scheme
In a hybrid scheme we need to consider two types of repair traffic, namely (i) traffic , to repair lost replicas and (ii) traffic to repair encoded blocks. Since in the hybrid scheme blocks are repaired directly from a replicated copy, repairing an encoded block requires transmitting only one new storage block of bytes. We obtain by replacing in eq. (7) the term “traffic to repair a block” in by . Arranging the terms we obtain the following two expressions:
(13)  
(14) 
Note that these expressions assume that all lost blocks are repaired from replicas. Since we are adopting a proactive repair scheme, the system can delay individual repairs when no replicas are available. However, since replicas are available most of the time, these delays will rarely happen.
Comparing , and we can state the following lemma: {lemma} For the same , and parameters, a hybrid scheme using a MBR code has a communication cost that is at least as high as the communication cost of a hybrid scheme using a MSR code. {proof} We can state the lemma as . Using equations (13) and (14) we obtain:
which is true by the definition of Regenerating Codes.
Lemma 6 implies that MSR codes when used in hybrid schemes are both, more storageefficient and more bandwidthefficient than MBR codes. For this reason we will not consider the use of MBR codes in hybrid schemes.
Let us assume that the required retrieve probability for the whole hybrid system is and that the retrieve probability for replicated objects is , . A hybrid scheme reduces the storage cost compared to replication when the following condition is satisfied:
(15) 
And analogously, a hybrid scheme reduces communication costs when:
(16) 
In Figure (a)a we plot the maximum value for that satisfies eq. (15) as a function of the overall retrieve probability for different online node availabilities . The parameter is set to when , when and when . The pairs below each of the lines correspond to the hybrid instances that satisfy eq. (15), i.e. a hybrid scheme reduces the storage costs. Similarly, in figures (b)b and (c)c, we plot the pairs that satisfy eq.(16), i.e. a hybrid scheme reduces the communication costs.
As example, let us assume a storage system that wants 99% data availability for their replicated files. In this case (), looking at Figure 6 we see that a hybrid scheme (replication+MSR codes) can reduce the storage costs compared to replication only when for , when for , and when for . Since in general we always want strong guarantees that files are never lost —e.g., we assume —, we can conclude that hybrid schemes reduce storage and communication cost for almost all practical scenarios.
It is interesting to note that in Figure 6 all three subfigures look very much alike. The reason is that the cost contribution of replication is significantly higher than the cost contribution of the coding (see Section 5). Since we have demonstrated the cost efficiency of a hybrid scheme for , which requires a larger number of replicas than configurations with , see Table 6, a hybrid scheme will also reduce storage and communication costs for any system requiring fewer replicas i.e., .
7 Experimental Evaluation
In previous sections we presented our generic storage model based on Regenerating Codes and we analytically analyzed the storage and communication costs for MSR and MBR codes, as well as the efficiency of using these codes in hybrid redundancy schemes. In this section, we aim to evaluate how the network traffic caused by repair processes can affect the performance and scalability of the redundancy scheme. For that, we assume a distributed storage system constrained by its network bandwidth: a system where storage nodes have low upload bandwidth and nodes have low online availabilities. For such a storage system we will evaluate two measures that are difficult to obtain analytically: (i) the real bandwidth used by the repair process —i.e., bandwidth utilization—, and (ii) the repair time —i.e., time required to download fragments. In this way we can evaluate the impact of the repair degree on bandwidth utilization and system scalability.
Bandwidth utilization
Given a node upload bandwidth, , and the pernode required bandwidth, , we can theoretically state that a feasible storage system must satisfy , and that the storage system reaches its maximum capacity when . However, practical storage systems may not reach this maximum capacity because of system inefficiencies due to failed repairs or fragment retransmissions. To measure these inefficiencies, we will compare the real bandwidth utilization with the theoretical bandwidth utilization .
Repair time
The repair time is proportional to the repair bandwidth, , the repair degree, , and the probability of finding a node online [PamiesJuarez et al. (2010)]. We showed in Section 5 that increasing reduces the repair bandwidth , (see eqs. (1) and (2)), which should then intuitively reduce repair times. However, since the system only guarantees online nodes, contacting nodes may require to wait for nodes coming back online, which will cause longer repair times. In previous sections we only considered two repair degrees , namely and . In this section we will analyze how different values affect repair times and bandwidth utilization.
7.1 Simulator SetUp
We implemented an eventbased simulator that simulates a dynamic storage infrastructure. Initially, the simulator starts with storage nodes. New node arrivals follow a Poisson process with average interarrival times . Node departures follow a Poisson process with the same interdeparture time. Once a node joins the system it draws its lifetime from an exponential distribution with expected value days. During their lifetime in the system, nodes alternate between online/offline sessions. For each session, each node draws its online and offline durations from distributions and respectively. In this paper and are exponential variates with parameters and respectively, where is the base time and the node online availability. Using the mean value of the exponential distribution we can compute the average duration of the online and offline periods as (in hours):
(17)  
(18) 
The simulator implements parameterized Regenerating Code. To cope with node failures, redundant blocks are repaired in a proactive manner following the algorithm defined in [Duminuco et al. (2007)] and the simulator proactively generates new redundant blocks at a constant rate. For each stored object, a new redundant block is generated every days. To balance the amount of data assigned to each node, each repair is assigned to the online node that is lest loaded in terms of the number of stored blocks and the number of repairs going on.
If the repair node disconnects during a repair process, the repair is aborted and restarted at another online node. Similarly, when a node uploading data disconnects, the partially uploaded data is discarded and the repair node starts a block retrieval from another online node.
The number of objects stored in the system is set in all the simulations to achieve a desired system utilization . Given , the number of stored objects, , is obtained using the two following expressions:
(19)  
(20) 
These formulas are obtained by taking the definition of utilization, , replacing by in eq. (9) and solving the equation for .
We set the online node availability to and we set . With these values, we use eq. (4) to compute the minimum number of redundant blocks, , required to achieve a retrieve probability : .
Finally, the node upload bandwidth is set to 20KB/sec, allowing only one concurrent upload per node. To simulate asymmetric network bandwidth, we allow up to 3 concurrent downloads per node, which makes a maximum download bandwidth of 60KB/sec.
7.2 Impact of the Repair Degree
In Figure 7 we measure the effect of the repair degree on the system utilization and on the repair times. In this experiment, we set the size of the object to MB and the base time to hours —i.e. on average nodes connect and disconnect once per day. The number of stored objects is set to achieve a bandwidth utilization of . Figure (c)c shows the number of objects for , and Figure (d)d the storage space required. Figures (a)a and (b)b show that small values (values close to ) allow to keep the bandwidth utilization on target and assure low repair times. However, for repair degrees the repair times start to increase exponentially.
It is interesting to see that when the repair times are quite long, nodes executing repairs may not finish their repairs before disconnecting since repair times become longer than online sessions. In this case, failed repairs are reallocated and restarted in other online nodes. These unsuccessful repairs cause useless traffic that increase then the real bandwidth utilization. In Figure (a)a we can see how for repair times start to be larger than online sessions, increasing utilization beyond . It is important to note that these larger repair times can jeopardize the reliability of the system: large values can cause most repairs to fail, reducing the amount of available blocks and reducing the probability of successfully accessing stored files.
To investigate the increase of bandwidth utilization in detail, we analyze in Figure 8 the performance of the storage system for the point where repair times begin to increase, . At this point we evaluate repair times and bandwidth utilization for different base times, . As increases, the duration of online sessions become longer and fewer repairs need to be restarted, theoretically reducing bandwidth utilization. We can see this effect in Figure (a)a, larger base times reduce the bandwidth utilization of the system. Due to this utilization reduction, repair times are also slightly reduced as we can see in Figure (b)b.
7.3 Scalability
Other than the impact of the repair degree and the base time we aim to analyze the behavior of the storage system under different target bandwidth utilizations. In Figure 9 we plot the measured utilization and repair times for a wide range of target utilizations . We set the size of the stored objects to 120MB and we increase the number of stored objects, , to achieve different utilizations. In this scenario we set . In Figure 9a we see how the measured utilization is nearly the same than the target utilization. This is because causes short repair times and repairs typically finish before nodes go offline. However, in Figure 9b we can appreciate how for a high bandwidth utilization of , the saturation of the node upload queues increases repair times significantly.
In Figure 10 we plot the same metrics as in Figure 9 but for a repair degree of . Increasing the repair degree causes longer retrieval times, however as we saw in Figure 7, keep repairs short enough to guarantee that the utilization is not affected. However, by increasing the repair degree from to we can store on the same system configuration one order of magnitude more objects, namely 6452 (MSR, ) instead of 683 (MSR, ).
Finally, in Figure 11 we analyze the impact of object size on bandwidth utilization and repair times. For each object size we set the number of stored objects to achieve a target bandwidth utilization of . Since the utilization is the same for all object sizes, the number stored objects, , decreases as the object size increases (Figure 11c). Independently of the object size, the total amount of stored data, remains constant: 774GB for MSR codes and 1206GB for MBR codes. We can also see in Figure 11a that the measured bandwidth utilization is independent of the object size. However, as expected, we can see in Figure 11b that larger objects take longer to repair.
8 Conclusions
In this paper we evaluated redundancy schemes for distributed storage systems in order to have a clearer understanding of the cost tradeoffs in distributed storage systems. Specifically, we analyzed the performance of the generic family of erasure codes called Regenerating Codes [Dimakis et al. (2007)], and the use of Regenerating Codes in hybrid redundancy schemes. For each parameter combination we analytically derived its storage and communication costs of Regenerating Codes. Our cost analysis is novel in that it takes into account the effects of online node availabilities and node lifetimes. Additionally, we used an eventbased simulator to evaluate the effects of network utilization on the scalability of different redundancy configurations. Our main results are as follows:

Compared to simple replication, the use of a Regenerating Codes can reduce the costs of a storage system (storage and communication costs) from 20% up to 80%.

The optimal value of the retrieval degree depends on the online node availability, ranging from when nodes have 99% availability, to when nodes have 50% availability. Once is fixed, storage systems with limited storage capacity can maximize their storage capacity by adopting MSR codes. On the other hand, systems with limited communications bandwidth can maximize their storage capacity by adopting MBR codes.

High repair degrees reduce the overall communication costs but may increase repair times significantly, which can lead to data loss. We experimentally found that the repair degree should be small enough to make sure the repair times are shorter than the online session durations of nodes.

Finally, in storage systems where the access to whole file replicas is required, we showed that hybrid schemes combining replication and MSR codes are more cost efficient than simple replication.
This work was done during a visit of the first author at Eurecom in Spring 2010. We would like to thank Matteo Dell’Amico and Zhen Huang for their helpful comments and their support of this research.
References
 Bhagwan et al. (2002) Bhagwan, R., Moore, D., Savage, S., and Voelker, G. 2002. Replication strategies for highly available peertopeer storage systems. In Proceedings of FuDiCo: Future directions in Distributed Computing.
 Blake and Rodrigues (2003) Blake, C. and Rodrigues, R. 2003. High availability, scalable storage, dynamic peer networks: Pick two. In Proceedings the 9th Workshop on Hot Topics in Operating Systems (HOTOS).
 Borthakur (2007) Borthakur, D. 2007. The hadoop distributed file system: Architecture and design.
 Datta and Aberer (2006) Datta, A. and Aberer, K. 2006. Internetscale storage systems under churn. a study of the steadystate using markov models. In Proceedings of the 6th Intl. Conference on PeertoPeer Computing (P2P).
 Dimakis et al. (2007) Dimakis, A., Godfrey, P., Wainwright, M., and Ramchandran, K. 2007. Network coding for distributed storage systems. In Proceedings of the 26th IEEE Intl. Conference on Computer Communications (INFOCOM).
 Dimakis et al. (2010) Dimakis, A. G., Ramchandran, K., Wu, Y., and Suh, C. 2010. A survey on network codes for distributed storage. arXiv:1004.4438v1 [cs.IT].
 Duminuco and Biersack (2009) Duminuco, A. and Biersack, E. 2009. A practical study of regenerating codes for peertopeer backup systems. In Proceedings of the 29th IEEE Intl. Conference on Distributed Computing Systems (ICDCS).
 Duminuco and Biersack (2008) Duminuco, A. and Biersack, E. W. 2008. Hierarchical codes: How to make erasure codes attractive for peertopeer storage systems. In Proceedings of the 8th Intl. Conference on PeertoPeer Computing (P2P).
 Duminuco et al. (2007) Duminuco, A., Biersack, E. W., and EnNajjary, T. 2007. Proactive replication in distributed storage systems using machine availability estimation. In Proceedings of the 3rd CoNEXT conference (CONEXT).
 Fan et al. (2009) Fan, B., Tantisiriroj, W., Xiao, L., and Gibson, G. 2009. Diskreduce: Raid for dataintensive scalable computing. In Proceedings of the 4th Annual Workshop on Petascale Data Storage (PDSW).
 Ford et al. (2010) Ford, D., Labelle, F., Popovici, F. I., Stokely, M., Truong, V.A., Barroso, L., Grimes, C., and Quinlan, S. 2010. Availability in globally distributed storage systems. In Proceedings of the 9th USENIX conference on Operating systems design and implementation (OSDI).
 Ghemawat et al. (2003) Ghemawat, S., Gobioff, H., and Leung, S. 2003. The google file system. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP).
 Guerra et al. (2010) Guerra, J., Belluomini, W., Glider, J., Gupta, K., and Pucha, H. 2010. Energy proportionality for storage: Impact and feasibility. ACM SIGOPS Operating Systems Review 44, 1.
 Guha et al. (2006) Guha, S., Daswani, N., and Jain, R. 2006. An experimental study of the skype peertopeer voip system. In Proceedings of the 5th Intl. Workshop on PeertoPeer Systems (IPTPS).
 Haeberlen et al. (2005) Haeberlen, A., Mislove, A., and Druschel, P. 2005. Glacier: highly durable, decentralized storage despite massive correlated failures. In Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation (NSDI). USENIX Association, Berkeley, CA, USA, 143–158.
 Hastorun et al. (2007) Hastorun, D., Jampani, M., Kakulapati, G., Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W. 2007. Dynamo: Amazon’s highly available keyvalue store. In Proceedings of Symposium on Operating Systems Principles (SOSP).
 Kiran et al. (2004) Kiran, R. B., Tati, K., Cheng, Y.c., Savage, S., and Voelker, G. M. 2004. Total recall: System support for automated availability management. In Symposium on Networked Systems Design and Implementation (NSDI).
 Kubiatowicz et al. (2000) Kubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S., Eaton, P., Geels, D., Gummadi, R., Rhea, S., Weatherspoon, H., Weimer, W., Wells, C., and Zhao, B. 2000. Oceanstore: An architecture for globalscale persistent storage. In Proceedings of the 9th Intl. Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
 Li et al. (2010) Li, J., Yang, S., Wang, X., Xue, X., and Li, B. 2010. Treestructured Data Regeneration in Distributed Storage Systems with Regenerating Codes. In Proceedings of the 29th IEEE Intl. Conference on Computer Communications (INFOCOM).
 Lin et al. (2004) Lin, W. K., Chiu, D. M., and Lee, Y. B. 2004. Erasure code replication revisited. In Proceedings of the 4th Intl. Conference on PeertoPeer Computing (P2P).
 PamiesJuarez and GarcíaLópez (2010) PamiesJuarez, L. and GarcíaLópez, P. 2010. Maintaining data reliability without availability in p2p storage systems. In Proceedings. of the 25th Symposium On Applied Computing (SAC).
 PamiesJuarez et al. (2010) PamiesJuarez, L., GarcíaLópez, P., and SánchezArtigas, M. 2010. Availability and redundancy in harmony: Measuring retrieval times in p2p storage systems. In Proceedings of the 10th IEEE Intl. Conference on PeertoPeer Computing (P2P).
 Plank and Thomason (2004) Plank, J. and Thomason, M. 2004. A practical analysis of lowdensity paritycheck erasure codes for widearea storage applications. In Proceedings of the International Conference on Dependable Systems and Networks (DSN).
 Reed and Solomon (1960) Reed, I. and Solomon, G. 1960. Polynomial codes over certain finite fields. Journal of the Society for Industrial and Applied Mathematics 8, 2, 300–304.
 Rodrigues and Liskov (2005) Rodrigues, R. and Liskov, B. 2005. High availability in dhts: Erasure coding vs. replication. In Proceedings of the 4th Intl. Workshop on PeerToPeer Systems (IPTPS).
 Schmuck and Haskin (2002) Schmuck, F. and Haskin, R. 2002. Gpfs, a shareddisk file system for large computing clusters. In Proceedings of the 1th USENIX conference on File and Storage Technologies (FAST).
 Sit et al. (2006) Sit, E., Haeberlen, A., Dabek, F., Chun, B., Weatherspoon, H., Morris, R., Kaashoek, M. F., and Kubiatowicz, J. 2006. Proactive replication for data durability. In Proceedings of the 5th Intl. Workshop on PeertoPeer Systems (IPTPS).
 Steiner et al. (2007) Steiner, M., EnNajjary, T., and Biersack, E. 2007. A global view of kad. In Proceedings of the 7th ACM SIGCOMM conference on Internet Measurement (IMC).
 Toka et al. (2010) Toka, L., Dell’Amico, M., and Michiardi, P. 2010. Online data backup: A peerassisted approach. In Proceedings of the 10th IEEE Intl. Conference on PeertoPeer Computing (P2P).
 Utard and Vernois (2004) Utard, G. and Vernois, A. 2004. Data durability in peer to peer storage systems. In Proceedings of the 4th IEEE Intl. Symposium on Cluster Computing and the Grid (CCGRID).
 Weatherspoon and Kubiatowicz (2002) Weatherspoon, H. and Kubiatowicz, J. D. 2002. Erasure coding vs. replication: A quantitative comparison. In Proceedings of the 1st Intl. Workshop on PeerToPeer Systems (IPTPS).
 Wu et al. (2005) Wu, F., Qiu, T., Chen, Y., and Chen, G. 2005. Redundancy schemes for high availability in dhts. In Proceedings of the 3rd Intl. Symposium on Parallel and Distributed Processing and Applications (ISPA).
 WuaLa (2010) WuaLa. 2010. Wuala. http://www.wuala.com.
 Yao et al. (2006) Yao, Z., Leonard, D., Derek, X., Wang, X., and Loguinov, D. 2006. Modeling Heterogeneous User Churn and Local Resilience of Unstructured P2P Networks. In Proceedings of the IEEE Intl. Conference on Network Protocols (ICNP).
 Zhang et al. (2010) Zhang, Z., Deshpande, A., Ma, X., Thereska, E., and Narayanan, D. 2010. Does erasure coding have a role to play in my data center? Tech. Rep. MSRTR201052, Microsoft Research.