mvfst-rl: An Asynchronous RL Framework for Congestion Control with Delayed Actions
Effective network congestion control strategies are key to keeping the Internet (or any large computer network) operational. Network congestion control has been dominated by hand-crafted heuristics for decades. Recently, Reinforcement Learning (RL) has emerged as an alternative to automatically optimize such control strategies. Research so far has primarily considered RL interfaces which block the sender while an agent considers its next action. This is largely an artifact of building on top of frameworks designed for RL in games (e.g. OpenAI Gym). However, this does not translate to real-world networking environments, where a network sender waiting on a policy without sending data is costly for throughput. We instead propose to formulate congestion control with an asynchronous RL agent that handles delayed actions. We present mvfst-rl, a scalable framework for congestion control in the QUIC transport protocol that leverages state-of-the-art in asynchronous RL training with off-policy correction. We analyze modeling improvements to mitigate the deviation from Markovian dynamics, and evaluate our method on emulated networks from the Pantheon benchmark platform. The source code is publicly available at https://github.com/facebookresearch/mvfst-rl.
Congestion control is one of the key components of any large computer network, most notably the Internet, and is crucial to enable operation at scale.111It is estimated that 150,000 PB of data were sent per month over the Internet in 2018 (https://www.statista.com/statistics/499431/global-ip-data-traffic-forecast/). The goal of a congestion control algorithm is to dynamically regulate the rate of data being sent at each node to maximize total throughput and minimize queuing delay and packet loss. The vast majority of network strategies still rely on hand-crafted heuristics that are reactive rather than predictive. An early method, called Remy , demonstrated that offline-learned congestion control can be competitive with conventional methods. More recently, RL-based approaches have been proposed and show promise in simulated environments [6; 13].
Despite the above advances, to our knowledge, no RL method has been transferred to real-world production systems. One of the biggest drawbacks of RL congestion control from a deployment perspective is that policy lookup takes orders of magnitude longer compared to hand-crafted methods. Moreover, networking servers often have limited resources to run a machine learning model, thus requiring inference to be offloaded to dedicated servers and further delaying action updates. Current RL congestion control environments follow the synchronous RL paradigm (e.g. using the OpenAI Gym  interface), where model execution blocks the environment (network sender). This makes it infeasible for deployment where a sender waiting on a synchronous RL agent for congestion control, even for a few milliseconds, negatively impacts throughput (Figure 1).
In this paper we introduce mvfst-rl, a training framework that addresses these issues with a non-blocking RL agent for congestion control. For training in the presence of asynchronous interaction between the environment and the learner (owing to the inability of the environment to wait for gradient updates), we leverage IMPALA  which uses importance-sampling for off-policy correction. mvfst-rl is built on mvfst 222https://github.com/facebookincubator/mvfst, a C++ implementation of the QUIC transport protocol used in Facebook production networks, allowing seamless transfer to deployment. To emulate real-world network traffic in RL environments for training, we build upon Pantheon  which obviates the need for hand-written traffic patterns and network topologies. We evaluate training with delayed actions and present our results on the Pantheon test-bed.
2 Related Work
A few different studies have applied RL to congestion control in varying ways. Iroko  takes the approach of a centralized policy to regulate sending rates of all the nodes in a network topology. While this is applicable for small to medium sized networks, it is intractable for large or Internet-scale networks. Iroko also requires manually specifying network topologies and traffic generators, making it difficult to emulate real-world network conditions. PCC-RL  implements congestion control as an environment in OpenAI Gym , with a blocking RL agent. Park’s  congestion control environment based on CCP  is closest to mvfst-rl in design with Remote Procedure Calls (RPC) for environment–agent communication, but it effectively takes a synchronous approach with its short step-time of 10ms, constraining to only very small models.
3 Congestion Control as MDP with Delayed Actions
Consider a Markov Decision Process (MDP)  formulated as where is a set of states, is a set of actions, is the reward obtained, and is the state-transition probability function. Unrolling over time, we have the trajectory . Casting congestion control as an MDP, the state space includes network statistics gathered from acknowledgement packets (ACKs) from the receiver such as round-trip time (RTT), queuing delay, packets sent, acknowledged and lost. The action is an update to the congestion window size, cwnd, which is the number of unacknowledged packets that are allowed to be in flight. Large cwnd leads to high throughput, but also increased delay due to queuing at intermediate buffers and network congestion. Our choice of action space is a discrete set of updates to cwnd: (Appendix A.2). The reward is generally a function of measured throughput and delay, aiming to strike a trade-off.
When an agent acts asynchronously on the environment at fixed time-intervals, the action is applied after a delay of such that , where is the policy lookup time. The environment meanwhile would have transmitted further data based on the old action during the interval . The next state therefore depends not just on and , but also on . To incorporate this into the MDP, the state space can be augmented with the last action taken. In addition to this, given that our action space is relative to the previous actions taken, we find that a longer history of actions in the environment state helps learning. Therefore, we augment our state space with an action history of generic length , which we treat as a hyperparameter, and obtain our augmented state space . At time , the augmented state is . With this formalism, we define our environment state and reward function below.
Environment State: Our choice of state vector contains a summary of network statistics gathered over a 100ms window along with a history of actions taken. From each ACK within the window, we gather 20 features such as the latest and minimum RTT, queuing delay, bytes sent, bytes acknowledged, and bytes lost and re-transmitted. The complete set of features are listed in Appendix A.1. To normalize the features, the time fields are measured in milliseconds and scaled by and byte fields are in KB. We follow Copa’s  strategy of computing RTTStanding to reduce noise in delay measurements. For each feature, we calculate sum, mean, standard deviation, min and max within a window, and concatenate them with one-hot vectors of past actions along with the cwnd on applying those actions. This results in a state vector of size , where is the action space defined earlier.
Reward: Our reward is simply where and are the average throughput in MB/sec and the maximum delay in milliseconds, respectively, during the 100ms window, and the parameter trades-off between the two. We experimented with logarithmic scales, such as Copa’s  optimization function of , but we found it difficult to attain higher throughput such as 100 Mbps on faster networks when trained jointly with bad network scenarios such as cellular connections with 0.5 Mbps throughput.
In the following sections, we describe a framework for asynchronous RL training of congestion control with delayed actions and provide experimental analysis of the MDP with the augmented state space.
4 Introducing mvfst-rl
Framework: mvfst-rl follows an environment-driven design with an asynchronous RL agent. It integrates three components: (1) mvfst, Facebook’s implementation of the QUIC transport protocol used in production servers, (2) TorchBeast333https://github.com/facebookresearch/torchbeast, a PyTorch  implementation of IMPALA  for fast, asynchronous, parallel RL training, and (3) Pantheon  network emulators calibrated with Bayesian Optimization  to match real-world network conditions. The RL congestion controller accumulates network statistics from ACKs over a fixed time-window, say 100ms, and sends a state update asynchronously to an RL agent in a separate thread. Once the agent executes the policy, the action to update cwnd is applied asynchronously to the transport layer (Figure (a)a).
IMPALA was originally designed to overcome scaling limitations during training of batched Actor-Critic methods, where parallel actors act for steps each to gather trajectories (sequences of ), and then wait for a gradient update on the policy by the learner. IMPALA relaxes this restriction, and with an off-policy correction algorithm called V-trace, it allows actors to continue with trajectories while the learner asynchronously updates the policy. Although the original motivation was higher training throughput, the decoupled actor-learner algorithm and off-policy correction is well-suited for mvfst-rl, allowing the actors corresponding to network senders to not block on gradient updates, as doing so would change the dynamics of the underlying network.
Figure (b)b illustrates the training architecture. Each actor corresponds to a separate Pantheon instance in a sender–receiver setup with a randomly chosen emulated network scenario. The senders run the QUIC protocol with RL congestion controller and communicate state updates via RPC to the actors. The actors execute the model for the next action, and once a full trajectory is obtained, they communicate it to the learner for a gradient update. All communications are asynchronous and do not block the network sender. After training, the model can be exported with TorchScript and executed in C++ on the network host without the need for RPC.
Model: Our model trunk is a two-layer fully-connected neural network with 512 layers each and ReLU non-linearity. The extracted features and the reward are fed into a single-layer LSTM . For correctness during training, along with a trajectory of steps, an actor also communicates the LSTM hidden state at the beginning of the trajectory to the learner. Finally, the LSTM output is fed into two heads – a policy head with outputs, and another head for the policy gradient baseline.
Training: Our training setup consists of 40 parallel actors, each corresponding to a randomly chosen calibrated network emulator. We train episodically with each Pantheon sender–receiver instance running for 30 seconds, with the episodic trajectory used as a rollout for training. Episodic training provides the opportunity to learn dynamics similar to TCP Slow-Start, where the behavior of the algorithm during startup is sufficiently different from steady state. Following hyperparameter sweeps, we set the initial learning rate to with RMSProp  and the trade-off parameter in reward to 0.2. All experiments are run for a total of steps.
Results: Figure (a)a shows the test performance on a Pantheon emulator calibrated against Nepal to AWS India. mvfst-rl achieves competitive performance with higher throughput and similar delays compared to traditional methods like Cubic, Copa and NewReno, as well as better performance over learned methods like TaoVA. Compared to rl-random, a baseline agent that picks a random action over the same action space at each step, our model consistently performs better. We also test against a network emulator unseen during training where it outperforms most other methods (Figure (b)b).
One challenge we faced is generalization when training jointly over networks with widely different ranges of throughput and delay, causing rewards to be scaled very differently among the actors. We experimented with reward-clipping strategies as well as reward functions with logarithmic components as in Copa , but they result in policies that struggle to achieve high throughput when trained jointly with low throughput networks. This suggests that further research into reward shaping and multi-task learning strategies might be warranted.
Ablations: We examine the impact of the length of the action history in the augmented state space. Figure (c)c plots the total reward obtained per episode, over the course of training, for varying values of the action history length . While models with or struggle to learn, optimal performance is attained with with diminishing returns beyond that. We also run the same experiment without LSTM and find the performance to be significantly worse, especially for larger values of (Figures (d)d and (e)e). The results show that the MDP with the augmented state space performs better with delayed actions, and validate the need for a recurrent policy in such a setting.
6 Conclusion and Future Work
RL is a promising direction for improving real-world systems. Yet, it is challenging to deploy RL in datacenters due to performance and resource constraints, especially when evaluated against hand-written heuristics. We take the first steps towards moving away from blocking RL agents for systems problems and introduce a framework for learning asynchronous RL policies in the face of delayed actions. When applied to congestion control, our initial results show promise in the MDP with augmented state space. One challenge we face is learning a joint policy for a range of network scenarios when reward scales vary wildly due to differences inherent in the environment. We believe this is an interesting direction for further research on reward normalization strategies across environments. We hope to evaluate our trained congestion control agents in production networks in the future by leveraging mvfst-rl’s efficient implementation and tight integration with the QUIC transport protocol. mvfst-rl has been open-sourced and is available at https://github.com/facebookresearch/mvfst-rl.
We would like to thank Jakob Foerster for fruitful discussions on applying RL for networks, Udip Pant, Ranjeeth Dasineni and Subodh Iyengar for their support related to mvfst, and Francis Cangialosi for insightful discussions around CCP.
-  (2018-04) Copa: practical delay-based congestion control for the internet. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), Renton, WA, pp. 329–342. External Links: Cited by: §A.1, §3, §3, §5.
-  (2016) OpenAI gym. CoRR abs/1606.01540. External Links: Cited by: §1, §2.
-  (2018-10–15 Jul) IMPALA: scalable distributed deep-RL with importance weighted actor-learner architectures. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 1407–1416. External Links: Cited by: §1, §4.
-  (2012) Overview of mini-batch gradient descent. External Links: Cited by: §5.
-  (1997) Long short-term memory. Neural Computation 9, pp. 1735–1780. Cited by: §4.
-  (2019-09–15 Jun) A deep reinforcement learning perspective on internet congestion control. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 3050–3059. External Links: Cited by: §1, §2.
-  (2019) TorchBeast: a pytorch platform for distributed rl. External Links: Cited by: §4.
-  (2019) Park: an open platform for learning augmented computer systems. In ICML Reinforcement Learning for Real Life Workshop, Cited by: §2.
-  (1975) On bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical Conference Novosibirsk, July 1–7, 1974, G. I. Marchuk (Ed.), Berlin, Heidelberg, pp. 400–404. External Links: Cited by: §4.
-  (2018) Restructuring endpoint congestion control. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, SIGCOMM ’18, New York, NY, USA, pp. 30–43. External Links: Cited by: §2.
-  (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §4.
-  (1994) Markov decision processes: discrete stochastic dynamic programming. 1st edition, John Wiley & Sons, Inc., New York, NY, USA. External Links: Cited by: §3.
-  (2018) Iroko: A framework to prototype reinforcement learning for data center traffic control. CoRR abs/1812.09975. External Links: Cited by: §1, §2.
-  (2013) TCP ex machina: computer-generated congestion control. In Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM, SIGCOMM ’13, New York, NY, USA, pp. 123–134. External Links: Cited by: §1.
-  (2018-07) Pantheon: the training ground for internet congestion-control research. In 2018 USENIX Annual Technical Conference (USENIX ATC 18), Boston, MA, pp. 731–743. External Links: Cited by: §1, §4.
Appendix A Appendix
a.1 State Space
We design our state space to contain aggregate network statistics obtained from ACKs within each 100ms window (the environment step-time) and an action history of length . An optimal value of is chosen by hyperparameter sweeps to be 20. When an ACK or a packet loss notification arrives, we gather the following 20 network statistics:
|1||lrtt||Latest round-trip time (RTT) in milliseconds (ms)|
|2||rtt_min||Minimum RTT (in ms) over a 10 second window|
|3||srtt||Smoothed RTT (in ms)|
|4||rtt_standing||Min RTT (in ms) over window of size |
|5||rtt_var||Variance in RTT (in ms) as observed by the network protocol|
|6||delay||Queuing delay measured as|
|7||cwnd_bytes||Congestion window in bytes calculated as|
|8||inflight_bytes||Number of bytes sent but unacknowledged (cannot exceed cwnd_bytes)|
|9||writable_bytes||Number of writable bytes measured as|
|10||sent_bytes||Number of bytes sent since last ACK|
|11||received_bytes||Number of bytes received since last ACK|
|12||rtx_bytes||Number of bytes re-transmitted since last ACK|
|13||acked_bytes||Number of bytes acknowledged in this ACK|
|14||lost_bytes||Number of bytes lost in this loss notification|
|15||throughput||Instantaneous throughput measured as|
|16||rtx_count||Number of packets re-transmitted since last ACK|
|17||timeout_based_rtx_count||Number of re-transmissions due to Probe Timeout (PTO) since last ACK|
|18||pto_count||Number of times packet loss timer fired before receiving an ACK|
|19||total_pto_count||Total number of times packet loss timer fired since last ACK|
|20||persistent_congestion||Flag indicating whether persistent congestion is detected by the protocol|
As a way of normalization, the time-based features (features 1–6) are scaled by and byte-based ones (features 7–14) are scaled by . Since there could be varying number of ACKs within any 100ms time window corresponding to an environment step, we obtain a fixed-size state vector by computing aggregate statistics from all ACKs within a window. This is achieved by computing sum, mean, standard deviation, min and max over each of the 20 features and flattening the resulting 100 features into a single vector. For features 1–9, it doesn’t make sense to compute sums and we set those entries to zero in the state vector.
The history at time , , is encoded as a one-hot vector of the action along with the congestion window as a result of that action . To the aggregate features, we concatenate a history of length , resulting in our final state vector of length , where is the size of the action space. In our experiments, and , resulting in a state vector of length 220.
a.2 Action Space
While there are various possible choices for the action space , we chose the simplest set of discrete updates common in conventional congestion control methods:
where cwnd is the congestion window in units of Maximum Segment Size (MSS). The network environment starts with an initial cwnd of 10, and bounded updates are applied according to the policy as follows:
where is the action according to the policy at time and is an index into the action space , is a function that updates the current congestion window according to , and the function bounds the congestion window to reasonable limits.
mvfst-rl supports configuring the action space to any discrete set of updates easily via a simple format. The above mentioned action space can be configured as "0,/2,-10,+10,*2".