Pros & Cons of Model-based Bandwidth Control for Client-assisted Content Delivery
A key challenge in client-assisted content delivery is determining how to allocate limited server bandwidth across a large number of files being concurrently served so as to optimize global performance and cost objectives. In this paper, we present a comprehensive experimental evaluation of strategies to control server bandwidth allocation. As part of this effort, we introduce a new model-based control approach that relies on an accurate yet concise “cheat sheet” based on a priori offline measurement to predict swarm performance as a function of the server bandwidth and other swarm parameters. Our evaluation using a prototype system, SwarmServer, instantiating static, dynamic, and model-based controllers shows that static and dynamic controllers can both be suboptimal due to different reasons. In comparison, a model-based approach consistently outperforms both static and dynamic approaches provided it has access to detailed measurements in the regime of interest. Nevertheless, the broad applicability of a model-based approach may be limited in practice because of the overhead of developing and maintaining a comprehensive measurement-based model of swarm performance in each regime of interest.
Faced with the challenge of ever-increasing demand for content, content distributors have turned to client-assisted content delivery in recent times. A client-assisted content delivery architecture enables content distributors to provide performance in a scalable and cost-effective manner by opportunistically leveraging client resources, especially their uplink bandwidth, to augment their managed infrastructure resources. Although client-assisted content delivery systems have their roots in peer-to-peer file sharing systems [1, 2], commercial CDNs such as Akamai, Velocix, and Octoshape [3, 4, 5] as well as live streaming services such as PPLive and Sopcast [6, 7] have warmed up to using them for mainstream enterprise content delivery in recent times.
A key problem in client-assisted content delivery is bandwidth management, i.e., determining how to allocate limited server bandwidth across a large number of files being concurrently served to clients so as to balance the performance and cost objectives of the content distributor. Unlike purely client-server systems or purely peer-to-peer systems, this problem is particular to client-assisted content delivery systems that attempt to combine the predictable performance and ease of management of the former with the scalability and cost-effectiveness of the latter. The sever bandwidth allocated to a swarm, or a set of clients concurrently downloading the same file, is critical in determining the effectiveness of client-to-client exchanges and by consequence client-perceived performance. Furthermore, the appropriate allocation may be counter-intuitive, e.g., a popular file requires less server bandwidth compared to an unpopular file, all else being equal, in order to ensure similar client-perceived performance.
Our primary contribution is a measurement-driven comparative analysis of several existing and new strategies for allocating server bandwidth in client-assisted content delivery systems. To this end, we classify these bandwidth allocation strategies, or controllers, into three categories. The first is static, a class of controllers that use simplistic strategies such as allocating bandwidth uniformly, on a best-effort basis, or proportional to the demand across files . The second is dynamic, a class of controllers that constantly adjust the allocation in response to fine-grained client-perceived performance so as to optimize the performance or cost objectives of the content distributor [8, 9].
In this paper, we present a third, new class of controllers called model-based controllers that allocate server bandwidth based on a predictive model of client-perceived performance as a function of the server bandwidth and other swarm parameters such as the request arrival rate, file size, and client upload capacities. Unlike dynamic controllers that can be suboptimal due to long convergence delays while searching for an optimal allocation in situ, model-based controllers can jump to the optimal allocation in a single step by solving the underlying optimization problem “on paper”.
We have implemented a prototype system, SwarmServer, to facilitate our comparative analysis of controllers. In addition to several simple static and dynamic controllers, SwarmServer supports a model-based controller called CheatSheet for three bandwidth allocation objectives: minimizing the average download time, maximum download time, or the server bandwidth consumed so as to achieve a target performance objective. CheatSheet uses extensive a priori measurement to develop an accurate and concise model of performance as a function of the server bandwidth and a number of swarm parameters. To our knowledge, CheatSheet is the first attempt at developing a detailed empirical model of swarm performance.
Our extensive experiments with SwarmServer in conjunction with BitTorrent swarms running over 350 PlanetLab nodes reveal several insights. First, simple static controllers are hit-or-miss; while they perform well for some performance objectives and workloads, even outperforming dynamic controllers, they fall severely short on others. The suboptimal performance of static controllers is unsurprising and consistent with previous findings  for one our three objectives of interest. Second, model-based control is feasible and promising—CheatSheet consistently outperforms both static and dynamic controllers provided its model is based on detailed a priori measurements in an environment similar to the operational environment. CheatSheet performs up to 4 better than static schemes and up to 1.7 better than dynamic controllers.
Nevertheless, having gone through the experience of building a model-based controller, our conclusions about its practicality are somewhat mixed because of several reasons. First, it is hard. To appreciate this, consider that CheatSheet’s model used in the experiments in this paper alone required over 12 days of measurement data on PlanetLab so as to account for a number of parameters such as the server bandwidth, request arrival rate, distribution of client upload capacities, file size, etc. Second, while a measurement-driven model is robust to small variations in the operational environment, significant changes require recalibrating the model. For example, we find that the model developed over PlanetLab is inaccurate when deployed on a public cloud such as Amazon EC2 or a local cluster in our department. Similarly, significant changes in the client population or behavior such as participation in multiple swarms introduce further uncertainties into the model. Thus, model-based control may be appropriate primarily for relatively predictable environments (e.g., distributing TV shows and movies to FIOS  customers).
The rest of the paper quantifies these nuanced pros and cons of the three classes of controllers. We begin with a background on client-assisted content delivery.
A client-assisted content delivery system consists of a server that acts as the primary source for all content. All clients concurrently downloading the same file are referred to as a swarm. Clients follow a common peer-to-peer protocol for downloading (uploading) the file from (to) other clients in the swarm. The server participates by contributing bandwidth to all swarms. In this paper, we focus on the BitTorrent protocol  because of its open nature and wide deployment, however our findings are qualitatively applicable to other comparable plugins offered by content distributors [3, 5].
A key goal of a client-assisted content delivery system is to optimize a system-wide objective, e.g., minimize the average download time of all clients, by judiciously allocating limited server bandwidth across all swarms. To this end, a controller at the server collects information from all swarms and uses this information to compute and effect an allocation of server bandwidth so as to optimize the system-wide objective.
Ii-a Classification of controllers
We classify existing controllers as static or dynamic, and introduce a new class called model-based controllers.
Static: A static controller allocates server bandwidth using a simple heuristic while being agnostic to the system-wide performance objective and unresponsive to actual client-perceived performance. For example, a static controller that we analyze is using BitTorrent as-is by repurposing a common seeder across all swarms as the server .
Dynamic: A dynamic controller continuously monitors fine-grained information about client-perceived performance for all clients in each swarm (see Figure 2, left), and accordingly adjusts the bandwidth allocation in each monitoring epoch. An example of a dynamic controller is AntFarm  that monitors the number of blocks uploaded and downloaded by each client in each epoch, uses a strategy based on perturbation and gradient-ascent in order to optimize the aggregate download rate across all clients across all swarms.
Model-based: A model-based controller relies on a predictive model of swarm performance as a function of the supplied server bandwidth and other swarm parameters such as the file size, the peer arrival rate, and the upload capacity distribution of peers. Unlike dynamic controllers, a predictive model obviates explicit measurement of client-perceived performance, requiring only parameters that are already available or easily inferred at the server (see Figure 2, right). More importantly, it obviates in situ perturbation and gradual adjustment of the allocation enabling the controller to jump to the optimal allocation in a single step by using the model to solve the underlying optimization problem “on paper”. Thus, a model-based controller can quickly adapt to sudden changes in request arrival rates.
Ii-B Limitations of dynamic control
Our motivation for investigating model-based control stems from the limitations of dynamic controllers in realistic environments. Unlike static controllers that are but naive baseline strategies, the limitations of dynamic control are less obvious and are described next.
Convergence time: Dynamic control works in a feedback-driven manner by perturbing the current allocation, monitoring the performance impact of the perturbation, and accordingly determining the next perturbation. This approach is prone to prohibitively long convergence delays, primarily because the effect of a perturbed allocation can take several minutes to propagate through the swarm so as to be observable by the controller. As an example, AntFarm updates its server bandwidth once every 300 seconds by 5KB/s, so an adjustment of 50KBps requires nearly an hour to take effect.
Measurement overhead: Dynamic controllers utilize server resources to monitor every client’s performance in a swarm; this overhead can be significant for a swarming system with tens of thousands of clients.
Measurement error: The performance of any controller in steady state depends on how accurately it can estimate the relation between server bandwidth and swarm performance. Dynamic controllers can inaccurately estimate this relation because they measure swarm performance for the current bandwidth allocation only for a single measurement interval of a few hundred seconds. The statistical variations in the number of peers joining the swarm and in their upload capacities introduce error in measuring swarm performance.
These limitations of dynamic controllers compel us to explore model-based controllers. We hope that the measurement overhead could be relegated to an a priori offline phase to develop an accurate model of swarm performance in exchange for increased responsiveness in the operational phase. The challenge, of course, is to develop an accurate model of swarm performance with a tractable measurement overhead and small representation size, a challenge we address next.
Iii A Measurement-based Model
In this section, we develop a measurement-based model of swarm performance – the key building block for a model-based controller. Unlike prior theoretical models [11, 12, 13] that over-simplify swarm behavior, our work, to our knowledge, is the first effort at developing a measurement-based model of swarm performance. Despite our progress, the proposed model falls short both because it requires very extensive measurements lasting several days, but more fundamentally due to the large number of factors that affect swarm performance, even with several simplifying assumptions.
Iii-a Goal and model assumptions
We start with the following question: what is the average download time of peers in a BitTorrent swarm when given a certain amount of server bandwidth? The answer to this question of course depends on several characteristics of the swarm such as the arrival and departure patterns of peers, their upload and download capacities, the size of the file being distributed, etc. The answer also depends on design parameters of BitTorrent clients such as the number of active peers to which a peer concurrently uploads and how it splits its upload capacity across them, the length of an optimistic unchoke round, the size of chunks, etc. Finally, network conditions and artifacts of the transport protocol (TCP or custom transport protocols such as TP for non-interfering downloads ) will also impact swarm performance. Clearly, a model attempting to account for all of the factors affecting a swarm’s performance quickly becomes intractable.
To derive a simple yet useful model, we consider a swarm distributing a file of size to peers arriving at a rate . The upload capacities of arriving peers are drawn from a distribution with mean . The download capacity of peers is unlimited. Peers depart immediately after finishing their download (so the departure rate of peers is equal to the arrival rate in steady state). Let denote the (fixed) bandwidth supplied by the server. Our model postulates that the average download time of peers, , can be determined as a function of and . We state this dependence as
We call as swarm performance. As the average download time of peers () reduces, swarm performance improves.
By assuming that is determined by the above four parameters alone, the model implicitly makes a few assumptions. The model assumes that network loss rates and round-trip times are not so high that they reduce the effective average peer upload capacity (or equivalently that already incorporates these effects). It also implicitly assumes that all peers use a standard BitTorrent client and that implementation variations across operating systems are minor. It further assumes that already incorporates the effect of user-specific configurations that limit their upload contribution. Finally, the model assumes that despite all these heterogeneous factors affecting the distribution of peer upload capacities in practice, this distribution is stationary, so the average upload capacity (in conjunction with the other three parameters) is sufficient to determine the average download time.
Iii-B Measurement-based model
We take an empirical, measurement-driven approach to capture the relationship in Equation (1). A naive approach to this end would be to “measure” the relationship posed in Equation (1) for all foreseeable values of the four underlying dimensions (), which is impractical. Instead, our approach is to summarize the relationship using a small number of measured scenarios and use simple interpolation to estimate the unmeasured scenarios. We begin with a description of our measurement setup.
Iii-B1 Measurement setup
Our measurement testbed consists of 350 PlanetLab nodes installed with an an instrumented BitTorrent client , and two (non PlanetLab) servers hosted at our university that act as the seeder and the tracker respectively for all swarms. In each swarm run, peers arrive over time at a PlanetLab node to download the file and depart immediately after completing the download. Each swarm is run long enough so that the average download times of peers stabilizes, and the server records the average download time of peers that have completed downloads at the end of the experiment. Each swarm run is repeated five times with a fixed set of parameters and different runs vary these parameters.
We use the upload capacity distribution of BitTorrent peers reported in , which was scaled and truncated to remove very high capacity peers so as to accommodate the daily limit on the maximum data transfer imposed on PlanetLab nodes. The resulting average upload capacity () is 100 KBps with upload bandwidths in the range of 40 to 200 KBps for individual peers. No restrictions are imposed on the maximum download rate of any client. The file size is fixed at = 10 MB. Peer inter-arrival times are exponentially distributed with mean .
Figure 2 shows the aggregate results of our measurement experiments. Each line corresponds to a fixed arrival rate as shown, and plots the swarm performance for different values of the server bandwidth that is varied from 10 to 100 KBps (also the average peer upload capacity) in 10 KBps increments. With these parameters, a swarm run takes between 2000 to 5000 seconds, so the total running time to generate this figure is over 12 days (5 runs per point 60 points an hour roughly per run = 300 hours).
Iii-B2 Swarm performance vs. server bandwidth
Figure 2 presents several insights about how the swarm performance depends on server bandwidth and peer arrival rate. First, swarm performance as expected increases with server bandwidth keeping all else fixed. Second, swarm performance is concave with respect to server bandwidth. This is because, when the server bandwidth is very low, it becomes the bottleneck preventing peers from efficiently utilizing their upload capacity for exchanging blocks. In this regime, increasing server bandwidth slightly improves the efficiency of P2P exchanges, which improves swarm performance significantly. At high values of server bandwidth, there is less room for improving the efficiency of P2P exchanges, so the server’s bandwidth improves performance similar to traditional client-server systems, i.e., the bandwidth is divided across extant peers. When the server bandwidth equals the average peer upload capacity we find that a swarm’s utilization of P2P bandwidth is about as efficient as it can be, and any additional server bandwidth is simply used as in a client-server system. As a result, the swarm performance in the regime (not shown in Figure 2) can be easily derived analytically obviating time-consuming measurements.
Third, in the regime shown in the figure, swarm performance improves with the arrival rate (keeping all else fixed). At very low arrival rates, e.g., = 1/100/s, the swarm behaves like a client-server system as there is at most one peer most of the time, so the corresponding curve resembles the line . At higher arrival rates, the swarm remains efficient (i.e., it maintains a healthy download rate of over 80 KBps) for values of much smaller than . This is because large swarms are mostly self-sustaining and need only a tiny amount of server bandwidth to supply missing blocks in the unlikely event that none of the extant peers possess those blocks.
Iii-B3 Model representation
To concisely represent the swarm-performance model, we carefully select a small number of values of each parameter for measurements. We maintain a table, referred to as the “cheat sheet”, that records the swarm performance for all combinations of these parameters. This cheat sheet is used to approximately estimate by simple linear interpolation the swarm performance for values of parameters that are not explicitly measured. Next, we describe how we select the values of the model parameters for measurements.
Server bandwidth & peer arrival rate
The dependence of swarm performance on server bandwidth and peer arrival rate for a given upload capacity distribution and file size (as in Figure 2) is captured using 100 values. We take measurements for ten values of ranging from to , and for ten values of in a range determined by a metric we refer to as the “healthy swarm size”. The healthy swarm size is the number of peers when the efficiency of P2P exchanges in maximum. The intuition for healthy swarm size comes from Little’s law , healthy swarm size is , as is the average download time of peers in this case. When the healthy swarm size is one or less, the swarm essentially behaves like a client-server system. We empirically observe that when the healthy swarm size is 50 or more, the swarm is essentially self-sustaining, i.e., even with a server bandwidth of just a , the swarm is efficient. So we take measurements for values of selected such that the healthy swarm size increases from 1 to 50 in 10 equal increments. The total number of combinations of and is therefore 100.
We address file size diversity using an interpolation approach similar to the one used for arrival rates and server bandwidth. A separate cheat sheet is stored for a small number of file sizes spanning the regime of interest, e.g., 10 file sizes in geometric progression from 1MB to 10GB. The swarm performance for file sizes in between is estimated via interpolation.
At the onset of this work, we expected that a larger file size could be treated as equivalent to a larger arrival rate, i.e., could be approximated as , thereby obviating the need to maintain separate cheat sheets for different file sizes. Our intuition was that (bits/sec) represents the aggregate demand arriving into the system, so the response curve should not change significantly if the demand remains unchanged. Unfortunately, this turns out not to be the case as shown by the experiment in Figure 4. The figure plots the swarm performance as a function of the server bandwidth, and the different lines increase (decrease) () by the same factor, i.e., is the same for all points in the graph. The lines clearly show a slight uptrend suggesting that larger file sizes boost swarm performance more than larger arrival rates or, equivalently, a swarm distributing a larger file performs better than a swarm distributing a smaller file even though both have the same aggregate demand, client upload capacities, and server bandwidth.
Upload capacity distribution
There are two kinds of variations that occur in peer upload capacities. First, the upload capacity distribution of any sample of peers currently participating in a swarm may differ from the overall distribution. Our model implicitly accounts for this statistical variation because peer upload capacities during measurements are chosen by randomly sampling the distribution. Second, the overall upload capacity distribution of peers visiting the site can change. However, we expect that upload capacity distribution is unlikely to change at short time scales, as it depends on technology trends and the population of users who visit the site, which is likely to remain stable over the course of several months.
The changes in the upload capacity distribution at time scales of several months can be addressed by updating the cheat sheet with new measurements. In additional experiments (included in our tech report  due to lack of space), we find that significant changes in the mean or even the variance of upload capacity distribution indeed necessitate a new set of measurements. For example, for the same mean upload capacity, we find that increasing the variance of upload capacities reduces the swarm performance.
Iii-B4 Effect of measurement testbed
The measurement-based model requires network conditions to remain relatively similar to the environment in which the model’s measurements were obtained. We repeated the experiment shown in Figure 2 on two other testbeds - Amazon EC2 , and a local cluster. For the EC2 experiment, we select equal number of machines from five geographic locations to differentiate the EC2 testbed from the local cluster which has microsecond round trip latencies. In Figure 4, we compared the swarm performance on the three testbeds for a peer arrival rate of = 1/5/s. Swarm performance on EC2 is up to 30 KBps higher than on PlanetLab. Experiments on the local cluster show even better swarm performance than on EC2.
Swarm performance differs on the three testbeds as their effective upload capacities are different. The round-trip times in the local cluster are much smaller than in PlanetLab which reflects in the form of higher effective upload capacities and better performance. EC2 only has a small extent of geographic diversity (five different locations), so neighbor relationships between peers in the same data center tend to dominate (a clustering effect that has also been alluded to by prior work ). This clustering effect again results in the form of EC2 nodes having higher effective upload capacities.
Iii-B5 Summary and limitations
Although the measurement-based model can capture the dependence on four key swarm parameters, it still has several limitations. The most critical limitation is the extensive measurement needed to build a cheat sheet. For a single file size, our measurement take a few hundred hours on PlanetLab. A content distributor maintaining a few such tables for common file sizes may require a few thousand hours of measurements or even more.
The second limitation is the difficulty in estimating two of the model parameters – upload capacity distribution and peer arrival rates. Upload capacity distribution is difficult to estimate for several reasons – peers may download files from multiple swarms simultaneously or otherwise limit their upload capacity, and network conditions can significantly change the effective upload capacity as shown in Section III-B4. Estimating peer arrival rates is challenging primarily because users may abort the download before completion and return later to resume a download as shown in prior work [20, 21]. Therefore, the model also needs to account for peer arrivals and departures in the middle of a download. In combination, the difficulty of estimating all the model parameters can make the measurement-based model ineffective in practice.
Iv SwarmServer system
In this section, we present an implemented prototype of our system, SwarmServer, to compare different controller strategies. We begin with a brief description of our implementation and the content distribution objectives that we use for our comparison. Then, we discuss the design of model-based, dynamic, and static controllers implemented in SwarmServer.
Implementation: SwarmServer system is implemented in Python and consists of nearly 5000 lines of code. The system does not require any modification to the BitTorrent protocol for either the peers or the tracker. Our implementation uses the instrumented BitTorrent client developed by Legout et al. , which we modified to enable us to change the maximum upload bandwidth of the client without restarting it.
Content distribution objectives: We compare controller strategies on three content distribution objectives.
MIN_AVG: Minimize the average download time across all peers in all swarms for a given total server bandwidth.
MIN_MAX: Minimize the maximum value of the average download time across swarms for a given total server bandwidth.
MIN_COST: Minimize the total server bandwidth while achieving a set of specified target download times for each swarm.
Iv-a Model-based controller
The model-based controller – CheatSheet – allocates server bandwidth by solving an optimization problem using the measurement-based model developed in Section III. Next, we describe the optimization formulations used by CheatSheet to calculate bandwidth allocation for each of the objectives introduced above. We assume that there are a total of swarms and the average upload capacity, arrival rate, and file size of the ’th swarm are given by respectively. The goal is to determine server bandwidth allocations so as to optimize the desired objective.
Optimization formulation for MIN_AVG:
The first constraint (3) above simply rephrases Equation (1) relating the average download time to the server bandwidth and other swarm parameters. The second constraint above limits the total bandwidth the server can allocate to all swarms.
CheatSheet uses its measured knowledge of to solve this optimization problem. If is known to be smooth and concave in , MIN_AVG can be solved using a greedy gradient-ascent strategy that computes a unique, optimal solution as follows: (1) Start with for a small ; (2) Allocate the next units of capacity (divided equally) to the swarm(s) with the largest value(s) of the gradient ; (3) If not all units of capacity have been allocated, goto (2). Else terminate.
If is piecewise linear and concave, the above strategy still works, but the resulting solution may not be unique. In order to compute a unique optimal solution, CheatSheet cleans the measured by fitting smooth and concave polynomial curves for each line in Figure 2. We assume that this data cleaning has been already performed while describing the solutions to the next two objectives as well.
If monotonically increases with , MIN_MAX can be solved optimally using a simple greedy heuristic. For a target rate , let denote the server bandwidth required to achieve an average download time of . The heuristic is as follows: (1) Initialize target rate for a small ; (2) Set , ; (3) If bandwidth allocation required to achieve the target is feasible, i.e., (), increment target rate to and goto (2). Else, terminate.
Optimization formulation for MIN_COST:
If is invertible, then MIN_COST can be solved by setting .
Iv-B Dynamic controller
We implement three dynamic controllers: AIAD, Leveler, and AntFarm.
AIAD optimizes the MIN_COST objective and works as follows. Suppose the target average download time of the swarm is and the file size is . AIAD initializes the server bandwidth to . Once every epoch, it measures the average download rate, , of peers in the swarm. If , it increases the server bandwidth by . Otherwise it decreases by , except in the case that the decrement would cause to dip below a minimum bandwidth threshold. Our implementation sets the epoch length to 200 s to 10 KBps, and the minimum bandwidth threshold to 5 KBps.
Leveler optimizes the MIN_MAX objective. At the start, Leveler assigns equal bandwidth to all swarms. Once every epoch, Leveler measures the average download rate of all swarms. The server bandwidth is increased by a small, fixed for swarms whose download rate is lower than the median of average download rates. Similarly, Leveler reduces the server bandwidth by for each swarm with average download rate higher than the median value. Similar to AIAD, Leveler never reduces the server bandwidth allocated to a swarm below a minimum threshold. Epoch length, , and the minimum bandwidth are the same as in AIAD.
AntFarm optimizes the MIN_AVG objective and is based on the algorithm in the AntFarm paper . At the start, AntFarm allocates a small initial bandwidth to every swarm, and then assigns the server bandwidth to swarms in small increments until all server bandwidth is used up.
In steady state, AntFarm computes the bandwidth allocation using response curves for each swarm that predict the swarm performance as a function of server bandwidth. AntFarm measures download rates of peers periodically to obtain a set of sample data points of the form (server bandwidth, swarm performance). The response curve for a swarm is computed by fitting a concave, piecewise-linear curve to this set of data points. Given the response curves for all swarms, their bandwidth allocation is determined using a gradient-ascent algorithm similar to that used by CheatSheet’s to optimize the MIN_AVG objective. We refer the reader to our tech report  for a detailed description of our implementation.
Iv-C Static controller
We implement three static controllers. (1) BitTorrent sets an upload limit at the server for a set of swarms but does not set a per-swarm limit. The server bandwidth to each swarm by the server is determined by the number of peers connected to the server. (2) EqualSplit splits the available server bandwidth equally among all swarms. (3) PropSplit splits total server bandwidth proportional to the peer arrival rate for each swarm.
Our comparison of controller strategies, presented in this section, answers two main questions: (1) Do dynamic and model-based controllers improve performance over static controllers? If yes, then by how much? (2) Which type of controller, dynamic or model-based, performs better for the objectives in Section IV? Our experiments show that model-based controller outperforms dynamic controllers on all three objectives we compared. Static controllers cannot equal a model-based controller either; they perform well on some workloads and objectives but fare poorly on others.
Experimental setup: We performed our evaluation on 350 PlanetLab nodes using an instrumented BitTorrent client . Upload and download capacities of peers and peer inter-arrival times follow the same distributions as in our measurements.
Due to limited upload capacities and daily data transfer limit on PlanetLab, we focus on experiments with small file sizes. File download time will increase in proportion to the file size as we cannot increase upload capacity significantly. Thus, an experiment with a 1 GB file could take 100 longer to finish than an experiment with a 10 MB file, if the same number of file downloads occur in both experiments (assuming both swarms have same aggregate demand ).
V-a Average download time
First, we compare controllers on the MIN_AVG objective. We select a workload consisting of 20 swarms whose mean arrival rates are chosen according to a Zipf distribution with parameter 1.5. The mean arrival rate of the most popular swarm and the least popular swarm is 0.5/s and 0.0055/s respectively. Each swarm distributes a file of size 10 MB. The total server bandwidth is set to 200 KBps.
Figure 7 shows how the average download time changes over time for the different compared schemes. The average is computed using the download time of peers that completed their download within the previous 2000 sec interval as well as the resident time, i.e., the time since arrival, for peers whose downloads are under progress.
There are two main observations from the experiment in Figure 7. First, in the initial phase, EqualSplit, BitTorrent and AntFarm incur much higher average download times than PropSplit and CheatSheet, and their average download times take considerably longer to stabilize. Second, even after all controllers have reached steady state, CheatSheet achieves a download time that continues to be lower (by at least 25%) compared to all other schemes (that perform roughly similarly in steady state in this experiment).
The explanation for these observations is as follows. EqualSplit, BitTorrent and AntFarm have a very high download time at the start of the experiment because they assign a small server bandwidth to large and small swarms alike. If the initial server bandwidth is small, a huge number of peers build up in highly popular swarms, which is reflected in the corresponding download time curves that rise rapidly. For example, the download times in EqualSplit increase rapidly until about 1500 sec as no peers have departed until then. At this point, the download time drops sharply as a result of a horde of peer departures that occur when the last block in a swarm has been uploaded by the server. In contrast, both CheatSheet and PropSplit assign higher bandwidth to popular swarms from the start, so peer departures start much quicker in popular swarms considerably reducing their average download times. We note here that CheatSheet is implemented so as to begin with an allocation identical to PropSplit until it has a stable estimate of peer arrival rates, at which point it switches to the model-based optimal allocation.
EqualSplit, BitTorrent and AntFarm take considerably longer to reach steady state because the number of peers in highly popular swarms goes through multiple rounds of ramp-ups followed by bulk departures before stabilizing. In this experiment, AntFarm takes the longest time to converge to a steady state because after assigning 5 KBps to each swarm at the beginning, it allocates remaining bandwidth in small chunks of 5 KBps once every 200 sec. AntFarm requires many such 200 sec epochs in order to build a stable response curve for all swarms, resulting in higher download times during this convergence phase. We have observed (not shown for brevity) that reducing the epoch length does not help and sometimes hurts performance as it increases the measurement error in learned response curves.
Why does CheatSheet outperform other schemes even in steady state? Figure 7 shows the steady-state allocations of server bandwidth achieved by different schemes that explain this observation. Swarms are ordered from left to right in decreasing order of popularity. We show only the top 10 most popular swarms for clarity of presentation. CheatSheet uses the model to predict that the most popular swarm is mostly self-sustaining and therefore needs only a small bandwidth to achieve healthy download times. Compared to other controllers, CheatSheet assigns higher bandwidth values to the next four popular swarms that belong to a regime where a small amount of server bandwidth disproportionately improves performance, which considerably reduces the average download time. PropSplit and AntFarm by design assign the most bandwidth to the most popular swarm, but the extra server bandwidth hardly benefits that swarm. BitTorrent is biased more towards the popular swarms (as it receives more peer connections from these swarms compared to singleton swarms), but its allocation is nevertheless sub-optimal. EqualSplit clearly makes a sub-optimal decision by allocating equal bandwidth to all swarms in the light of the above reasons.
V-B Min-max average download time
Next, we compare controllers on the MIN_MAX objective, i.e., minimizing the average download time of the swarm that has the worst average download time. We evaluate on the Zipf workload in the previous subsection and set the total server bandwidth to 500 KBps in these experiments.
Figure 7 shows the average download time of the swarm with the maximum average download time (referred to as MAD time in this discussion). We observe that, even though the workload is the same, the relative performance of controllers is different compared to the experiment in the previous subsection. The MAD time achieved by PropSplit is twice as worse as other controllers that have relatively smaller differences between them. Both EqualSplit and CheatSheet achieve the lowest MAD time. BitTorrent incurs a higher MAD time in comparison to EqualSplit. The performance of Leveler varies with time because it changes the server bandwidth to each swarm periodically and struggles to converge to a steady bandwidth allocation as it shuffles bandwidth across 20 swarms. The reason (not visible in the figure) is that different swarms take different times to manifest the effect of the most recent change. Leveler sometimes “panics” and allocates more bandwidth to the currently worst swarms too quickly and at other times is too slow to move bandwidth away from swarms that could do without it. The fluctuating performance of Leveler reveals that it is nontrivial to design a robust dynamic controller.
Unpopular swarms, i.e., swarms with a small peer arrival rate, impact the MAD time significantly in this experiment. Unpopular swarms require higher bandwidth than popular swarms to achieve the same download time (Figure 2). Due to the Zipf popularity distribution, a majority of swarms for this workload are unpopular. PropSplit incurs the highest MAD times because it assigns the least bandwidth to the most unpopular swarm, which significantly increases the download time of that swarm. EqualSplit, unlike PropSplit, assigns equal bandwidth to all the swarms and hence has a much smaller MAD time. CheatSheet performs the same as EqualSplit because the unpopular swarms in the workload have nearly the same performance in both cases. Due to a large number of unpopular swarms, CheatSheet only assigns 5 KBps more bandwidth to each unpopular swarm than EqualSplit, which does not sufficiently impact the MAD times.
Does EqualSplit achieve the least MAD time for all workloads? The answer is no. On a workload dominated by popular swarms, EqualSplit results in 50% higher MAD time than CheatSheet (refer to tech report  for this experiment). This experiment illustrates that performance of static controllers such as EqualSplit can vary depending on the workload.
V-C Target download time
Next, we compare CheatSheet and AIAD against the MIN_COST objective. We do not compare against the simplistic static schemes as they are designed to always use all available capacity (and can therefore be made to appear arbitrarily worse by choosing a sufficiently low target download time in an experiment). Our workload for this experiment consisted of six swarms with peer arrival rates of 0.5/s, 0.14/s, 0.12/s, 0.1/s, 0.08/s, and 0.01/s. All swarms distributed a file of size 10 MB. The target download time for all the swarms is set to 150 sec. We only present detailed results for arrival rates 0.5/s, 0.12/s, and 0.01/s here. Results for other arrival rates are qualitatively consistent and are omitted due to lack of space.
Figure 8 shows the average download time achieved by each strategy over the duration of the experiment (figures on the left column) and shows the corresponding server capacity set by the controllers over the same duration (figures on the right column). The actual bandwidth consumed at the server is very close to the configured capacity shown in the figure.
CheatSheet meets the target download time well in all cases, but AIAD sometimes significantly exceeds the target download time as in the later part of Figure 8(c). This is because AIAD is not always able to accurately estimate the relation between server bandwidth and the download time. To illustrate this point, Figure 9 shows the measured download rate of the swarm and the server bandwidth limit set by AIAD during this experiment. At t = 2200 s, the measured download rate of swarm is above the corresponding target download time (10 MB / 150 s = 67 KBps). Hence it decreases the server bandwidth to 40 KBps at t = 2200 s and then to 30 KBps at t = 2400 s. This causes the measured download rate to drop sharply which is reflected in the increased download time of peers in Figure 8(c). The download time curve shows an increase somewhat later as it is calculated as an average over a window of 2000 s.
We also experimented by changing the interval after which AIAD updates bandwidth to 300 sec, but it continues to fluctuate above the target download time. Of course, if the bandwidth update interval is increased to a sufficiently high value and the bandwidth increments/decrements made small, the AIAD controller will converge to the target download rate. However, it will take longer to converge and will be less responsive if peer arrival rates change.
CheatSheet consumes much less bandwidth compared to AIAD for = 0.5/s especially in the first 2000 seconds of the experiment. While AIAD takes several cycles of measurement and perturbations to reach the bandwidth allocation, CheatSheet directly jumps to the minimal required bandwidth using its model.
V-D Summary and discussion
In summary, our evaluation shows that bandwidth allocation done by static controllers is hit-or-miss. A static controller that works well for one objective and workload combination may perform poorly for others. This is intuitively unsurprising and is also consistent with the findings in prior work analyzing a specific optimization objective . For a fixed performance objective however, the simplicity of static controllers may outweigh their sub-optimality (e.g., EqualSplit for the MIN_MAX metric or PropSplit for the MIN_AVG metric).
Our evaluation also shows that designing a dynamic controller for scenarios involving peer arrivals and departures is nontrivial. Although dynamic controllers are generally superior to any given simplistic static scheme when evaluated over a range of objectives and workloads, we find that they are far from optimal. Indeed, in some scenarios, simple schemes like EqualSplit or PropSplit outperform dynamic control. The reason is that measuring the relationship between swarm performance and allocated bandwidth in an online manner is nontrivial. As a result of measurement errors, a dynamic control scheme is vulnerable to prolonged convergence delays or persistent fluctuations.
The experiments in this paper suggest that a model-based approach is feasible and promising. We find that when a model-based controller is given a cheat sheet based on prior measurements in the regime of interest, it consistently outperforms both static and dynamic controllers for different objectives and workloads.
Nevertheless, having gone through the experience of making a model-based controller work, our conclusion is that, in its current form, the complexity of the model-based approach outweighs its advantages. The extensive set of measurements required to build the model, reduce the viability of this approach. Further, the challenges in estimating model parameters such as peer arrival rates and upload capacity distribution of peers (see Section III-B5) can reduce the effectiveness of model-based approach.
Vi Related work
Our primary contribution is a comparative analysis of different categories of bandwidth controllers for client-assisted content delivery systems and the design and implementation of a model-based control approach, that to our knowledge has not been attempted before. Our work builds upon a large body of prior work that can be grouped into dynamic controllers, models of swarm behavior, and incentive strategies.
Dynamic controllers: AntFarm  and VFormation  are closely related to ours. However, both these works adopt a dynamic controller approach. While AntFarm monitors average download rate of each peer, VFormation uses more detailed measurements by monitoring propagation of each block through the swarm. Our comparison of controller strategies does not include V-Formation because its implementation is proprietary and not available publicly. Dynamic controllers have also been studied for of live-streaming P2P systems .
Models of swarm behavior: Qiu , Fan , and Liao  analytically model BitTorrent to derive expressions for average download time and other swarm metrics. But, their models make assumptions that over-simplify swarm behavior, e.g., homogenous upload capacities , fixed number of peers , and seeds contributing their full upload capacity [11, 12]. To address these concerns, we model swarm performance based on actual swarm measurements.
Incentive strategies: Several BitTorrent clients that improve BitTorrent’s incentive strategies have been proposed, such as BitTyrant , Levin et al.’s client  and FairTorrent . Other swarming systems (incompatible with BitTorrent) incentivize peers to contribute bandwidth through virtual currencies, e.g., Dandelion , or tokens, e.g., AntFarm . Our position is that a large majority of users use BitTorrent clients as-is or use unmodifiable closed-source clients, e.g., Akamai’s NetSession , so incentive issues are less important.
In this paper, we performed a comparative evaluation of strategies to control server bandwidth in client-assisted content delivery systems. As part of this effort, we introduced a new approach referred to as model-based control and presented the design and implementation of a model-based controller, CheatSheet, that uses a concise model based on a priori offline measurement of swarm performance as a function of the server bandwidth and other swarm parameters. Our experiments show that simple static strategies are unreliable as they perform well on some workloads and objectives but fare poorly on others. Dynamic control can also lead to a sub-optimal performance as it is prone to prolonged convergence delays and persistent fluctuations. In comparison, a model-based approach consistently outperforms both static and dynamic approaches provided it has access to detailed measurements in the regime of interest. Nevertheless, the broad applicability of a model-based approach may be limited in practice because of the overhead of developing and maintaining a comprehensive measurement-based model of swarm performance in each regime of interest.
-  BitTorrent, “Bittorrent, inc.” 2008, www.bittorrent.com.
-  Pando, http://www.pando.com/.
-  “Akamai netsession,” http://www.akamai.com/client/.
-  Velocix, http://www.velocix.com/.
-  “Octoshape,” http://www.octoshape.com.
-  “PP Live,” http://www.pplive.com/.
-  SopCast, http://www.sopcast.com/.
-  R. S. Peterson and E. G. Sirer, “Antfarm: efficient content distribution with managed swarms,” in NSDI, 2009.
-  R. S. Peterson, B. Wong, and E. G. Sirer, “A Content Propagation Metric for Efficient Content Distribution,” 2011.
-  Verizon, “Fios,” http://www.verizon.com/fios.
-  D. Qiu and R. Srikant, “Modeling and performance analysis of bittorrent-like peer-to-peer networks,” in SIGCOMM, 2004.
-  B. Fan, D.-M. Chiu, and J. Lui, “Stochastic differential equation approach to model peer-to-peer,” in ICC, 2006.
-  W. Liao and et al., “Performance analysis of bittorrent-like systems with heterogeneous users,” Performance Evaluation, 2007.
-  “Micro Transport Protocol, Wikipedia,” http://bit.ly/MpFSiH.
-  A. Legout, N. Liogkas, E. Kohler, and L. Zhang, “Clustering and sharing incentives in Bittorrent systems,” in SIGMETRICS, 2007.
-  M. Piatek, T. Isdal, T. Anderson, A. Krishnamurthy, and A. Venkataramani, “Do incentives build robustness in bittorrent?” in NSDI, 2007.
-  “Little’s Law,” http://en.wikipedia.org/wiki/Little’s_law.
-  A. Sharma, A. Venkataramani, and A. Rocha, “Tech Report: A Comparison of Server Bandwidth Controllers for Client-assisted Content Delivery,” uRL: http://bit.ly/SvhSAT.
-  “Amazon EC2,” http://aws.amazon.com/ec2/.
-  D. Stutzbach and R. Rejaie, “Understanding churn in peer-to-peer networks,” in IMC, 2006.
-  A. Finamore and et al., “Youtube everywhere: impact of device and infrastructure synergies on user experience,” in IMC, 2011.
-  A. Legout, N. Liogkas, and E. Kohler, “Rarest first and choke algorithms are enough,” in IMC, 2006.
-  C. Wu, B. Li, and S. Zhao, “Multi-channel live p2p streaming: Refocusing on servers,” in INFOCOM 2008, 2008.
-  D. Levin and et al, “Bittorrent is an auction: Analyzing and improving Bittorrent incentives,” in SIGCOMM, 2008.
-  A. Sherman, J. Nieh, and C. Stein, “Fairtorrent: bringing fairness to peer-to-peer systems,” in CoNEXT, 2009.
-  M. Sirivianos, X. Yang, and S. Jarecki, “Robust and efficient incentives for cooperative content distribution,” IEEE/ACM Trans. Netw., 2009.