On Distributed Storage Allocations of Large Filesfor Maximum Service Rate

# On Distributed Storage Allocations of Large Files for Maximum Service Rate

Pei Peng and Emina Soljanin Pei Peng and Emina Soljanin are with the Department of Electrical and Computer Engineering, Rutgers University, Piscataway, NJ 08854 USA. pei.peng@rutgers.edu; emina.soljanin@rutgers.edu
###### Abstract

Allocation of (redundant) file chunks throughout a distributed storage system affects important performance metrics such as the probability of file recovery, data download time, or the service rate of the system under a given data access model. This paper is concerned with the service rate under the assumption that the stored data is large and its download time is not negligible. We focus on quasi-uniform storage allocations and provide a service rate analysis for two common data access models. We find that the optimal allocation varies in accordance with different system parameters. This was not the case under the assumption that the download time does not scale with the size of data, where the minimal spreading allocation was previously found to be universally optimal.

## I Introduction

Distributed storage systems (DSSs) are, in various guises, an integral part of different computing and content providing environments such as cloud data centers, caching edge networks, and more recently, fog systems. Their purpose is to ensure reliable storage and/or quick access of data by end users or computing processes. Today, both goals are being increasingly addressed by storing data redundantly, either by replication or erasure coding. This paper is concerned with allocations of redundant data chunks throughout a DSS that ensure maximum data access service rate.

Most of the work on data access in DSSs is concerned with the download latency (see e.g., [1, 2, 3, 4] and references therein). It has recently been recognized, that another important metric that measures the availability of the stored data is the service rate [5, 6]. Maximizing service rate (or the throughput) of a distributed system helps support a large number of simultaneous system users. Rate-optimal strategies are also latency-optimal in high traffic. Thus, maximizing the service rate also reduces the latency experienced by users in particular in highly contending scenarios.

This paper adopts a DSS model originally proposed in [7]. In this model, a file is split into multiple chunks, and (replication or coded) redundancy is introduced at some fixed level determined by the storage budget that the DSS has for the file. This total storage is the only constraint, and there is no limit on how many chunks a particular node can store as long as it stays within the budget. Attempts to data retrieval are done according to some limited access models.

Several studies have looked into how to allocate redundant chunks of data over the storage nodes, focusing mostly on optimizing two DSS performance metrics [7, 8, 9, 10]. One of them is the probability of successful data recovery when only a subset of (possibly failed) nodes are accessed, and the other is the average download time when a set of nodes from which the file can certainly be recovered is accessed. Finding these quantities has shown to be quite challenging, and optimal allocations are known only in some special cases. Some versions of this problem are related to a long standing conjecture by Erdős on the maximum number of edges in a uniform hypergraph [11].

In general, both measures are of interest and should be simultaneously taken into account. Often increasing the chance of successfully downloading a file while desirable should not come at the cost of intolerable delivery delay. Moreover, in practice, we may often want to partially sacrifice a successful but tardy data delivery to some users in order to ensure that other users, that can receive the data, are indeed served fast.

Note that, depending on the allocation, some subsets of nodes may not contain enough file chunks between them to ensure data recovery, and accessing them will result in a zero system’s service rate. On the other hand, again depending on the allocation, some subsets of nodes will contain redundant file chunks, and that redundancy (superfluous for file recovery) can be exploited to increase the service rate. These issues were first addressed in [5], where a non intuitive conclusion was reached that the allocation that maximizes the probability of successful data recovery is often not the one that maximizes the average service rate.

Depending on the number of storage nodes and the allocated redundancy budget, it may be beneficial for recovery to maximally spread the redundant file chunks over the nodes, whereas concentrating the redundant chunks (minimum spreading) may increase the expected service rate. The work of [5] assumes that the download time is random because of independent workload fluctuations inherent to the system, and does not depend on the size of the data being downloaded. We here assume that the stored data is large, its download time is not negligible and scales with the size of the data. We find that the optimal allocation varies in accordance with different system parameters, which was not the case in [5], where the minimal spreading allocation was found to be universally optimal.

The paper is organized as follows. A DSS model and problem formulation are given in Sec. II. Service rate analysis considering the effect of access model and the success of serving a request is presented in Sec. III. Some numerical examples and further discussion are provided in Sec. IV.

## Ii System Model

### Ii-a Storage Model

A file consisting of blocks is to be redundantly stored over a DSS with storage nodes. To protect the data against nodes’ failure or unavailability, the file is encoded by an MDS code into () encoded blocks so that any of them are sufficient to recover the original file. The encoded blocks are partitioned into subsets ’s for where , and thus . We refer to such partitioning an allocation. The blocks in are stored at the storage node .

We are concerned with quasi-symmetric allocations [9], where a node can either store a constant number of blocks () or no blocks at all. (Dealing with a general storage allocation optimization problem is computationally difficult for a general setup, see e.g., [8].) We will refer to such allocations as quasi-symmetric allocation. Fig. 1 depicts an example quasi-symmetric allocation on nodes.

We refer to a quasi-symmetric allocation where as minimal spreading [8]. Note that for a minimal spreading allocation, the file blocks are simply replicated over some storage nodes. Similarly, an allocation with will be referred to as a maximal spreading allocation since the file chunks are spread over all nodes in the system.

### Ii-B Data Access and Delivery Models

Fixed-size Access: In this model, the download request is forwarded to a random -node subset of the storage nodes [8, 9]. Therefore, the access to a given -subset results in the successful recovery of the data iff the nodes in jointly contain at least coded blocks:

 ∑i∈Asi≥k. (1)

Note that for , it is impossible to recover the data. Thus, we only consider the case.
Probabilistic Access: In this model, the download request is forwarded to all nodes that store the data. However, the request to a node fails with probability . Assuming that represents the set of nodes that are successfully accessed, the condition for data recovery is also (1). In this case, . In this access model, is a Binomial random number between 1 and .
Regardless of the access model, for an accessed subset of nodes , we denote the number of nodes containing data by . For instance, in Fig. 1, three nodes () are accessed while only of them have data. For an quasi-symmetric allocation, data recovery from this subset is successful iff . The probability of successful file recovery under an allocation is, therefore, given by

 (2)

where is the probability of acccesing . Note that the sum goes over all sets that satisfy the condition (1).

### Ii-C Service Models

We assume a request is simultaneously served by all nodes in the accessed set that contain data, where each node takes some i.i.d. random time to deliver its blocks. In the fixed-size access model while in the probabilistic access model, is a Binomial random variable between 1 and . Note that the file can be reconstructed when the accessed nodes jointly deliver encoded blocks. We here limit our study to the case where a node has to deliver all its blocks for the download to count.

For an quasi-symmetric allocation, the download request can be served iff , and as soon as all blocks are downloaded from any out of the nodes with data. Therefore, the average download time is the -th order statistics of waiting times at the storage nodes.
Scaled Exponential Service: In this model [1], a node delivers the first file block in some exponential random time, and each subsequent block in the the same time. We assume that a node storing the whole file delivers all of its blocks in a random time exponentially distributed with the mean . It is easy to see that, equivalently, we can say that a node storing fraction of the file delivers all of its blocks in the random time exponentially distributed with the mean . (Recall that in [5], the download times are assumed exponential i.i.d. and independent of the size of the data being downloaded.) For this model, we have

 Ts(α|φ(A))=1αμ(Hφ(A)−Hφ(A)−α), (3)

where denotes the -th harmonic number, and comes from the service rate scaling discussed above. The corresponding service rate from set (with nodes containing data) is

 μα(A)=1Ts(α|φ(A))=αμHφ(A)−Hφ(A)−α. (4)

It is not hard to see that

 μφ(A)≥μs(α|φ(A))≥μ(φ(A)−α+1) (5)

Shifted Exponential Service: In this model [1], delivery consists of two steps: first, the node takes an exponential random time to process the request; second, the node takes a constant time proportional to the number of blocks to deliver them to the user. Therefore, the two step delivery time for a node storing fraction of the file can be modeled by the shifted exponential distribution with rate and the shift parameter . For this model, we have

 Ts(α|φ(A))=Δα+1μ(Hφ(A)−Hφ(A)−α), (6)

where comes from the service rate shifting discussed above. The corresponding service rate from set is

 μα(A)=αμΔμ+α(Hφ(A)−Hφ(A)−α). (7)

As , it is not hard to see that

 μφ(A)Δμ+α≥μs(α|φ(A))≥αμ(φ(A)−α+1)Δμ(αm−α+1)+α2 (8)

DSS Service Rate: Under an -allocation, the DSS service rate is given by

 μs(α)=∑A:∑i∈Asi≥kP(A)μα(A) (9)

where is the service rate when the set of accessed nodes is , given by (4) or (7), and is the probability of accessing set , given by (2).

### Ii-D Preview of the Results and Future Work

We argue that finding that maximizes (9) is hard. We prove that is not always maximal for , and we specify two system parameter regions where 1) and 2) . We numerically analyze the optimal storage allocation. We find that performance metrics (the service rate) and (the probability of successful recovery) may exhibit different trends with changing allocations. We make conjectures on how optimal storage allocation changes with the parameter , , and , which we will try to prove in future work.

## Iii System Performance Analysis

### Iii-a Fixed-size Access and Scaled Exponential Service

###### Claim 1

For fixed-size access model under scaled exponential distribution, the DSS service rate (9) becomes

 μs(α)=μα(Nr)min(r,αm)∑φ=α1Hφ−Hφ−α(αmφ)(N−αmr−φ)

The claim follows from the assertions in Sec. II. We see that finding that maximizes the is hard. Instead, we prove below that is not always optimal.

###### Theorem 1

For the fixed-size access model, when the waiting time of each node follows scaled exponential distribution, the optimal isn’t always reached at .

{proof}

To prove is not the optimal choice for all , we will show that 1) there is a region of values s.t.  and 2) there is a region of values s.t.  for .
1) We consider as follows: According to (5),

 μs(α) <μ(Nr)min(r,αm)∑φ=αφ(αmφ)(N−αmr−φ) =μαm(Nr)min(r,αm)∑φ=α(αm−1φ−1)(N−αmr−φ) =μαm(Nr)min(r,αm)∑φ=αα−2∏i=0αm−1−iφ−1−i(αm−αφ−α)(N−αmr−φ)

Since  goes from to , we further have

 μs(α) <μαm(Nr)min(r,αm)∑φ=α(α−2∏i=0αm−1−iα−1−i) (αm−αφ−α)(N−αmr−φ) =μαm(αm−1α−1)(Nr)min(r,αm)∑φ=α(αm−αφ−α)(N−αmr−φ) =μαm(αm−1α−1)(N−αr−α)(Nr)    (by Vandermonde's\ convolution)

As we know,

 μs(1) =m∑φ=1μs(1|φ)(mφ)(N−mr−φ)(Nr) =μ(Nr)m∑φ=1φ(mφ)(N−mr−φ)=μm(N−1r−1)(Nr)

Then to satisfy , we need

 μαm(αm−1α−1)(N−α)r−α(Nr)<μm(N−1r−1)(Nr)⇔α(αm−1α−1)<α−2∏i=0N−1−ir−1−i (10)

As for , we have . Inequality (10) is true when

 α(αm−1α−1)<(N−1r−1)α−1⇔r<1+N−1α−1√α(αm−1α−1)

Thus for .
2) We consider as follows: According to (5),

 μs(α) >μ(Nr)min(r,αm)∑φ=α(φ−α+1)(αmφ)(N−αmr−φ) =μ(Nr)min(r,αm)∑φ=α(φ−α+1)α−2∏i=0αm−iφ−i (αm−α+1φ−α+1)(N−αmr−φ)

Since  goes from to , we further have

 μs(α) >μ(Nr)min(r,αm)∑φ=α(φ−α+1)(αm−α+1φ−α+1)(N−αmr−φ) =μ(αm−α+1)(N−αr−α)(Nr)   (Vandermonde's convolution)

As we know,
Then to satisfy , we need

 μ(αm−α+1)(N−αr−α)(Nr)>μm(N−1r−1)(Nr)⇔αm−α+1m>α−2∏i=0N−1−ir−1−i (11)

As for , we have . Inequality (11) is true when

 αm−α+1m>(N−α+1r−α+1)α−1 ⇔r>α−1√mαm−α+1(N−α+1)+α−1

Thus when

In the proof above, we find two regions of which can show when reaches the maximum. Here we give some examples to analyze these two regions. Let’s give the parameter set as . When the parameter set is (30,2,1,4), we can get two regions [4,6.5] for , and [22.9,30] for , and both regions are exist. But there is a gap between the two regions, which means when is in (6.5,22.9), we can not decide whether is maximum or not. The big gap appears because of the bounds we used in proof are not tight enough. When the parameter set is (30,3,1,5), we can get another two regions [4,4.4] for and [22.7,30] for , and the first region is not exist for .

Although these two regions can not help us to make a decision under any parameter set, they can still tell us something more about the system model in Conjecture 1.

###### Conjecture 1

When is small, is more likely to be the maximum; When is large, the maximum is not at 1, and the optimal is increasing with .

### Iii-B Probabilistic Access and Scaled Exponential Service

###### Claim 2

Under the probabilistic access model,

 μs(α)=αm∑φ=αμαHφ−Hφ−α(αmφ)(1−p)φpαm−φ

The claim also follows from the assertions in Sec. II. We see that finding that maximizes the is still hard. Therefore, we prove below that is not always optimal.

###### Theorem 2

For the probabilistic access model, when the waiting time of each node follows scaled exponential distribution, the optimal isn’t always reached at .

{proof}

To prove is not the optimal choice for all , we will show that 1) there is a region of values s.t.  and 2) there is a region of values s.t.  for .
1) We consider as follows: According to (5),

 μs(α) <μαm∑φ=αφ(αmφ)(1−p)φpαm−φ =μαmαm∑φ=α(αm−1φ−1)(1−p)φpαm−φ =μαmαm∑φ=α(α−2∏i=0αm−1−iφ−1−i)(αm−αφ−α)(1−p)φpαm−φ

Since goes from to , we have

 μs(α) <μαm(αm−1α−1)(1−p)ααm−α∑φ=0(αm−αφ) (1−p)φpαm−α−φ By using binomial expansion , =μαm(αm−1α−1)(1−p)α

As we know,

 μs(1) =m∑φ=1μs(1|φ)(mφ)(1−p)φpm−φ =μm∑φ=1φ(mφ)(1−p)φpm−φ=μm(1−p)

Then to satisfy , we need

 μαm(αm−1α−1)(1−p)α<μm(1−p) ⇔(1−p)α−1<1α(αm−1)α−1⇔p>1−1α−1√α(αm−1α−1)

Thus for .
2) We consider as follows:
According to (5),

 μs(α) >μαm∑φ=α(φ−α+1)(αmφ)(1−p)φpαm−φ =μαm∑φ=α(φ−α+1)(α−2∏i=0αm−iφ−i)(αm−α+1φ−α+1) (1−p)φpαm−φ

Since goes from to , we have

 μs(α) >μαm∑φ=α(φ−α+1)(αm−α+1φ−α+1)(1−p)φpαm−φ =μ(αm−α+1)(1−p)ααm−α∑φ=0(αm−αφ) (1−p)φpαm−α−φ=μ(αm−α+1)(1−p)α

As we know Then to satisfy , we need

 μ(αm−α+1)(1−p)α>μm(1−p) ⇔(1−p)α−1>mαm−α+1⇔p<1−α−1√mαm−α+1

Thus for .

In the proof above, we find two regions of which can show when reaches the maximum. It is easy to see that both regions are exist for any parameter set. Here we can also give a parameter set as to show an example of the two regions in the proof. When the parameter set is (2,1,4), we can get two regions [0.81,1] for , and [0,0.26] for . There is also a gap between two regions which shows the undecided region.

Here we give the Conjecture 2 according to the ’s regions.

###### Conjecture 2

When is large, is more likely to be the maximum; When is small, the maximum is not at 1, and the optimal is decreasing with .

### Iii-C Fixed-size Access and Shifted Exponential Service

###### Claim 3

For fixed-size access model under shifted exponential distribution, the DSS servise rate is given by

 μα(Nr)min(r,αm)∑φ=α1Δμ+α(Hφ−Hφ−α)(αmφ)(N−αmr−φ)

The claim follows from the assertions in Sec. II. Similarly, finding that maximizes the is hard. Instead, we prove below that is not always optimal.

###### Theorem 3

For the fixed-size access model, when the waiting time of each node follows shifted exponential distribution, the optimal isn’t always reached at .

{proof}

This proof is similar as which for the Theorem 1. Therefore we only keep some key steps.

1) We consider as follows: According to (8),

 μs(α) <μ(Δμ+α)(Nr)min(r,αm)∑φ=αφ(αmφ)(N−αmr−φ) <μαm(αm−1α−1)(N−αr−α)(Δμ+α)(Nr)

As we know,

 μs(1) >μ(Δμm+1)(Nr)m∑φ=1φ(mφ)(N−mr−φ)=μm(N−1r−1)(Δμm+1)(Nr)

Then to satisfy , we need

 μαm(αm−1α−1)(N−α)r−α(Δμ+α)(Nr)<μm(N−1r−1)(Δμm+1)(Nr)⇔α(Δμm+1)(αm−1α−1)Δμ+α<α−2∏i=0N−1−ir−1−i (12)

As for , we have . Inequality (12) is true when

 α(Δμm+1)(αm−1α−1)Δμ+α<(N−1r−1)α−1 ⇔r<1+α−1 ⎷Δμ+αα(Δμm+1)(αm−1α−1)(N−1)

Therefore, for , is true.
2) We consider as follows:
According to (8),

 μs(α) >μα(Δμ(αm−α+1)+α2)(N)rmin(r,αm)∑φ=α(φ−α+1) (αmφ)(N−αmr−φ) >μα(αm−α+1)(N−αr−α)(Δμ(αm−α+1)+α2)(Nr)

As we know,

 μs(1) <μ(Δμ+1)(Nr)m∑φ=1φ(mφ)(N−mr−φ)=μm(N−1r−1)(Δμ+1)(Nr)

Then to satisfy , we need

 μα(αm−α+1)(N−αr−α)(Δμ(αm−α+1)+α2)(Nr)>μm(N−1r−1)(Δμ+1)(Nr)⇔α(Δμ+1)(αm−α+1)m(Δμ(αm−α+1)+α2)>α−2∏i=0N−1−ir−1−i (13)

As for , we have . Inequality (13) is true when

 α(Δμ+1)(αm−α+1)m(Δμ(αm−α+1)+α2)>(N−α+1r−α+1)α−1 ⇔r>α−1√Δμm(αm−α+1)+α2mα(Δμ+1)(αm−α+1) (N−α+1)+α−1

Therefore, for , is true.

In the proof above, we find two regions of , which are similar as what we find in Theorem 1, can show when is the maximum . Here we can give a parameter set as to show an example of the two regions in the proof. When the parameter set is (30,2,1,10,4), we can get two regions [4,5.8] for , and [25.7,30] for , and both regions are exist. When the parameter set is (30,3,1,10,6), we can get another two regions [6,4.1] for and [27.4,30] for , and the first region is not exist. When the parameter set is (30,2,1,1,4), we can get another two regions [4,7.6] for and [30.4,30] for , and the second region is not exist. Here the Conjecture 1 holds.

### Iii-D Probabilistic Access and Shifted Exponential Service

###### Claim 4

Under the probabilistic access model,

 μs(α)=αm∑φ=αμαΔμ+α(Hφ−Hφ−α)(αmφ)(1−p)φpαm−φ

The claim follows from the assertions in Sec. II, and finding that maximizes the remains hard. Therefore, we prove below that is not always optimal.

###### Theorem 4

For the fixed-size access model, when the waiting time of each node follows shifted exponential distribution, the optimal isn’t always reached at .

{proof}

This proof is similar as which for the Theorem 2. Therefore we only keep some key steps.

1) We consider as follows: According to (8),

 μs(α) <μΔμ+ααm∑φ=αφ(αmφ)(1−p)φpαm−φ <μαm(αm−1α−1)(1−p)αΔμ+α

As we know,

 μs(1) >μΔμm+1m∑φ=1φ(mφ)(1−p)φpm−φ=μm(1−p)Δμm+1

Then to satisfy , we need

 μαm(αm−1α−1)(1−p)αΔμ+α<μm(1−p)Δμm+1 ⇔p>1−α−1 ⎷Δμ+αα(Δμm+1)(αm−1α−1)

, is true.
2) We consider as follows: According to (8),

 μs(α) >αμΔμ(αm−α+1)+α2αm∑φ=α(φ−α+1)(αmφ) (1−p)φpαm−φ >αμ(αm−α+1)(1−p)αΔμ(αm−α+1)+α2

As we know,

 μs(1) <μΔμ+1m∑φ=1φ(mφ)(1−p)φpm−φ=μm(1−p)Δμ+1

Then to satisfy , we need

 αμ(αm−α+1)(1−p)αΔμ(αm−α+1)+α2>μm(1−p)Δμ+1 ⇔p<1−α−1√m(Δμ(αm−α+1)+α2)α(Δμ+1)(αm−α+1)

Therefore, for , is true.

From the proof above, we find two regions of , similar as what we found in Theorem 2, and can see when is the maximum. Here we give some examples to analyze these two regions. We also give a parameter set as to show an example of the two regions in the proof. When the parameter set is (2,1,10,4), we can get two regions [0.83,1] for ,and [0,0.15] for , and both regions are exist. When the parameter set is (2,1,10,4), we can get another two regions [0,-0.016] for and [0.77,1] for , and the first region is not exist. Here the Conjecture 2 holds.

From Theorems 1, 2, 3 and 4, we see that the optimal allocation varies in accordance with different system parameters. Recall that was found to be universally optimal in [5] where it was assumed that the download time does not scale with the size of data.

## Iv Optimal Storage Allocation Analysis

We next numerically analyze the optimal storage allocation. We compute the service rate and probability of successful recovery with the allocation parameter . Since the accessed nodes number , coded file size ratio and failure probability are the key parameters for the the storage system, we also vary these values to see how the optimal allocation changes.

According to the formulas in Claim 1 and 2, we know that the rate parameter in scaled exponential distribution doesn’t affect the numerical analysis results. But from Claim 3 and 4, we can see both rate parameter and shift parameter affect the numerical analysis results. When , the shifted exponential distribution is equivalent to an exponential distribution, then the minimum spreading allocation is universally optimal; When , the shifted exponential distribution is equivalent to a constant, then the is changing with the probability of successful recovery. Therefore, we select the and in the simulations below as appropriate values.

### Iv-a Fixed-size Access

For fixed-sized access model, we present two figures to analyze the optimal storage allocation in the interval or . In Fig. 2, we have three subfigures, the left is the average service rate for the scaled exponential distribution, the middle is for the shifted exponential distribution, and the right is the probability of successful recovery. Firstly, let’s analyze the left and right subfigures. When and , and are decreasing with , and the optimal allocation is . When , the largest is reached at , but the is decreasing, then it is better to select between to based on the weight of and ; When , is increasing and reaches maximum at , then the optimal allocation is . Secondly, the middle subfigure has a similar pattern as the left one. And the only different is when , the largest is reached at .

Fig. 3 shows similar results. Firstly,let’s analyze the left and right subfigures. When , both and are decreasing, the optimal allocation is . When and , is still decreasing, but reaches maximum at and , then the optimal allocation is between to or to . When