mvfst-rl: An Asynchronous RL Framework for Congestion Control with Delayed Actions

mvfst-rl: An Asynchronous RL Framework for Congestion Control with Delayed Actions

Viswanath Sivakumar Facebook AI Research Tim Rocktäschel Alexander H. Miller Facebook AI Research Heinrich Küttler Facebook AI Research Nantas Nardelli Mike Rabbat Facebook AI Research Joelle Pineau Sebastian Riedel

Effective network congestion control strategies are key to keeping the Internet (or any large computer network) operational. Network congestion control has been dominated by hand-crafted heuristics for decades. Recently, Reinforcement Learning (RL) has emerged as an alternative to automatically optimize such control strategies. Research so far has primarily considered RL interfaces which block the sender while an agent considers its next action. This is largely an artifact of building on top of frameworks designed for RL in games (e.g. OpenAI Gym). However, this does not translate to real-world networking environments, where a network sender waiting on a policy without sending data is costly for throughput. We instead propose to formulate congestion control with an asynchronous RL agent that handles delayed actions. We present mvfst-rl, a scalable framework for congestion control in the QUIC transport protocol that leverages state-of-the-art in asynchronous RL training with off-policy correction. We analyze modeling improvements to mitigate the deviation from Markovian dynamics, and evaluate our method on emulated networks from the Pantheon benchmark platform. The source code is publicly available at

1 Introduction

Congestion control is one of the key components of any large computer network, most notably the Internet, and is crucial to enable operation at scale.111It is estimated that 150,000 PB of data were sent per month over the Internet in 2018 ( The goal of a congestion control algorithm is to dynamically regulate the rate of data being sent at each node to maximize total throughput and minimize queuing delay and packet loss. The vast majority of network strategies still rely on hand-crafted heuristics that are reactive rather than predictive. An early method, called Remy [14], demonstrated that offline-learned congestion control can be competitive with conventional methods. More recently, RL-based approaches have been proposed and show promise in simulated environments [6; 13].

Despite the above advances, to our knowledge, no RL method has been transferred to real-world production systems. One of the biggest drawbacks of RL congestion control from a deployment perspective is that policy lookup takes orders of magnitude longer compared to hand-crafted methods. Moreover, networking servers often have limited resources to run a machine learning model, thus requiring inference to be offloaded to dedicated servers and further delaying action updates. Current RL congestion control environments follow the synchronous RL paradigm (e.g. using the OpenAI Gym [2] interface), where model execution blocks the environment (network sender). This makes it infeasible for deployment where a sender waiting on a synchronous RL agent for congestion control, even for a few milliseconds, negatively impacts throughput (Figure 1).

Figure 1: Cumulative bytes sent during 60 seconds over AWS California to Mexico emulator. RL agents which block the sender for 25ms and 50ms waiting on the policy transmit 1.1% and 11.4% fewer bytes, respectively, compared to the non-blocking agent. All agents take actions at 100ms intervals.

In this paper we introduce mvfst-rl, a training framework that addresses these issues with a non-blocking RL agent for congestion control. For training in the presence of asynchronous interaction between the environment and the learner (owing to the inability of the environment to wait for gradient updates), we leverage IMPALA [3] which uses importance-sampling for off-policy correction. mvfst-rl is built on mvfst 222, a C++ implementation of the QUIC transport protocol used in Facebook production networks, allowing seamless transfer to deployment. To emulate real-world network traffic in RL environments for training, we build upon Pantheon [15] which obviates the need for hand-written traffic patterns and network topologies. We evaluate training with delayed actions and present our results on the Pantheon test-bed.

2 Related Work

A few different studies have applied RL to congestion control in varying ways. Iroko [13] takes the approach of a centralized policy to regulate sending rates of all the nodes in a network topology. While this is applicable for small to medium sized networks, it is intractable for large or Internet-scale networks. Iroko also requires manually specifying network topologies and traffic generators, making it difficult to emulate real-world network conditions. PCC-RL [6] implements congestion control as an environment in OpenAI Gym [2], with a blocking RL agent. Park’s [8] congestion control environment based on CCP [10] is closest to mvfst-rl in design with Remote Procedure Calls (RPC) for environment–agent communication, but it effectively takes a synchronous approach with its short step-time of 10ms, constraining to only very small models.

3 Congestion Control as MDP with Delayed Actions

Consider a Markov Decision Process (MDP) [12] formulated as where is a set of states, is a set of actions, is the reward obtained, and is the state-transition probability function. Unrolling over time, we have the trajectory . Casting congestion control as an MDP, the state space includes network statistics gathered from acknowledgement packets (ACKs) from the receiver such as round-trip time (RTT), queuing delay, packets sent, acknowledged and lost. The action is an update to the congestion window size, cwnd, which is the number of unacknowledged packets that are allowed to be in flight. Large cwnd leads to high throughput, but also increased delay due to queuing at intermediate buffers and network congestion. Our choice of action space is a discrete set of updates to cwnd: (Appendix A.2). The reward is generally a function of measured throughput and delay, aiming to strike a trade-off.

When an agent acts asynchronously on the environment at fixed time-intervals, the action is applied after a delay of such that , where is the policy lookup time. The environment meanwhile would have transmitted further data based on the old action during the interval . The next state therefore depends not just on and , but also on . To incorporate this into the MDP, the state space can be augmented with the last action taken. In addition to this, given that our action space is relative to the previous actions taken, we find that a longer history of actions in the environment state helps learning. Therefore, we augment our state space with an action history of generic length , which we treat as a hyperparameter, and obtain our augmented state space . At time , the augmented state is . With this formalism, we define our environment state and reward function below.

Environment State: Our choice of state vector contains a summary of network statistics gathered over a 100ms window along with a history of actions taken. From each ACK within the window, we gather 20 features such as the latest and minimum RTT, queuing delay, bytes sent, bytes acknowledged, and bytes lost and re-transmitted. The complete set of features are listed in Appendix A.1. To normalize the features, the time fields are measured in milliseconds and scaled by and byte fields are in KB. We follow Copa’s [1] strategy of computing RTTStanding to reduce noise in delay measurements. For each feature, we calculate sum, mean, standard deviation, min and max within a window, and concatenate them with one-hot vectors of past actions along with the cwnd on applying those actions. This results in a state vector of size , where is the action space defined earlier.

Reward: Our reward is simply where and are the average throughput in MB/sec and the maximum delay in milliseconds, respectively, during the 100ms window, and the parameter trades-off between the two. We experimented with logarithmic scales, such as Copa’s [1] optimization function of , but we found it difficult to attain higher throughput such as 100 Mbps on faster networks when trained jointly with bad network scenarios such as cellular connections with 0.5 Mbps throughput.

In the following sections, we describe a framework for asynchronous RL training of congestion control with delayed actions and provide experimental analysis of the MDP with the augmented state space.

4 Introducing mvfst-rl

Framework: mvfst-rl follows an environment-driven design with an asynchronous RL agent. It integrates three components: (1) mvfst, Facebook’s implementation of the QUIC transport protocol used in production servers, (2) TorchBeast333[7], a PyTorch [11] implementation of IMPALA [3] for fast, asynchronous, parallel RL training, and (3) Pantheon [15] network emulators calibrated with Bayesian Optimization [9] to match real-world network conditions. The RL congestion controller accumulates network statistics from ACKs over a fixed time-window, say 100ms, and sends a state update asynchronously to an RL agent in a separate thread. Once the agent executes the policy, the action to update cwnd is applied asynchronously to the transport layer (Figure (a)a).

IMPALA was originally designed to overcome scaling limitations during training of batched Actor-Critic methods, where parallel actors act for steps each to gather trajectories (sequences of ), and then wait for a gradient update on the policy by the learner. IMPALA relaxes this restriction, and with an off-policy correction algorithm called V-trace, it allows actors to continue with trajectories while the learner asynchronously updates the policy. Although the original motivation was higher training throughput, the decoupled actor-learner algorithm and off-policy correction is well-suited for mvfst-rl, allowing the actors corresponding to network senders to not block on gradient updates, as doing so would change the dynamics of the underlying network.

(a) Asynchronous RL agent–sender interaction
(b) Training architecture
Figure 2: (a) States are sent to the agent every 100ms and policy lookup takes 30ms. In the synchronous paradigm, packets could not be sent during the 30ms sections between and . (b) TorchBeast maintains one actor per Pantheon environment, with RPC for state and action updates. The key difference from IMPALA is that, in addition to the actor–learner interaction, the environment–actor interaction is also asynchronous and the environment steps are not blocked by the forward-pass.

Figure (b)b illustrates the training architecture. Each actor corresponds to a separate Pantheon instance in a sender–receiver setup with a randomly chosen emulated network scenario. The senders run the QUIC protocol with RL congestion controller and communicate state updates via RPC to the actors. The actors execute the model for the next action, and once a full trajectory is obtained, they communicate it to the learner for a gradient update. All communications are asynchronous and do not block the network sender. After training, the model can be exported with TorchScript and executed in C++ on the network host without the need for RPC.

Model: Our model trunk is a two-layer fully-connected neural network with 512 layers each and ReLU non-linearity. The extracted features and the reward are fed into a single-layer LSTM [5]. For correctness during training, along with a trajectory of steps, an actor also communicates the LSTM hidden state at the beginning of the trajectory to the learner. Finally, the LSTM output is fed into two heads – a policy head with outputs, and another head for the policy gradient baseline.

5 Experiments

Training: Our training setup consists of 40 parallel actors, each corresponding to a randomly chosen calibrated network emulator. We train episodically with each Pantheon sender–receiver instance running for 30 seconds, with the episodic trajectory used as a rollout for training. Episodic training provides the opportunity to learn dynamics similar to TCP Slow-Start, where the behavior of the algorithm during startup is sufficiently different from steady state. Following hyperparameter sweeps, we set the initial learning rate to with RMSProp [4] and the trade-off parameter in reward to 0.2. All experiments are run for a total of steps.

(a) Nepal to AWS India (calibrated)
(b) Token-bucket policer (unseen environment)
(c) With LSTM
(d) Without LSTM
(e) Action history = 10
Figure 3: Experimental Results. (a) and (b): Average throughput and 95th percentile delay when tested against calibrated Pantheon emulators. rl-random refers to a baseline agent that picks a random action over the same action space as mvfst-rl at each step. Mean of three 30-second runs. (c) – (e): Training rewards with and without LSTM for varying values of action history length .

Results: Figure (a)a shows the test performance on a Pantheon emulator calibrated against Nepal to AWS India. mvfst-rl achieves competitive performance with higher throughput and similar delays compared to traditional methods like Cubic, Copa and NewReno, as well as better performance over learned methods like TaoVA. Compared to rl-random, a baseline agent that picks a random action over the same action space at each step, our model consistently performs better. We also test against a network emulator unseen during training where it outperforms most other methods (Figure (b)b).

One challenge we faced is generalization when training jointly over networks with widely different ranges of throughput and delay, causing rewards to be scaled very differently among the actors. We experimented with reward-clipping strategies as well as reward functions with logarithmic components as in Copa [1], but they result in policies that struggle to achieve high throughput when trained jointly with low throughput networks. This suggests that further research into reward shaping and multi-task learning strategies might be warranted.

Ablations: We examine the impact of the length of the action history in the augmented state space. Figure (c)c plots the total reward obtained per episode, over the course of training, for varying values of the action history length . While models with or struggle to learn, optimal performance is attained with with diminishing returns beyond that. We also run the same experiment without LSTM and find the performance to be significantly worse, especially for larger values of (Figures (d)d and (e)e). The results show that the MDP with the augmented state space performs better with delayed actions, and validate the need for a recurrent policy in such a setting.

6 Conclusion and Future Work

RL is a promising direction for improving real-world systems. Yet, it is challenging to deploy RL in datacenters due to performance and resource constraints, especially when evaluated against hand-written heuristics. We take the first steps towards moving away from blocking RL agents for systems problems and introduce a framework for learning asynchronous RL policies in the face of delayed actions. When applied to congestion control, our initial results show promise in the MDP with augmented state space. One challenge we face is learning a joint policy for a range of network scenarios when reward scales vary wildly due to differences inherent in the environment. We believe this is an interesting direction for further research on reward normalization strategies across environments. We hope to evaluate our trained congestion control agents in production networks in the future by leveraging mvfst-rl’s efficient implementation and tight integration with the QUIC transport protocol. mvfst-rl has been open-sourced and is available at


We would like to thank Jakob Foerster for fruitful discussions on applying RL for networks, Udip Pant, Ranjeeth Dasineni and Subodh Iyengar for their support related to mvfst, and Francis Cangialosi for insightful discussions around CCP.


  • [1] V. Arun and H. Balakrishnan (2018-04) Copa: practical delay-based congestion control for the internet. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), Renton, WA, pp. 329–342. External Links: ISBN 978-1-939133-01-4, Link Cited by: §A.1, §3, §3, §5.
  • [2] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI gym. CoRR abs/1606.01540. External Links: Link, 1606.01540 Cited by: §1, §2.
  • [3] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu (2018-10–15 Jul) IMPALA: scalable distributed deep-RL with importance weighted actor-learner architectures. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 1407–1416. External Links: Link Cited by: §1, §4.
  • [4] G. Hinton, N. Srivastava, and K. Swersky, (2012) Overview of mini-batch gradient descent. External Links: Link Cited by: §5.
  • [5] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9, pp. 1735–1780. Cited by: §4.
  • [6] N. Jay, N. Rotman, B. Godfrey, M. Schapira, and A. Tamar (2019-09–15 Jun) A deep reinforcement learning perspective on internet congestion control. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 3050–3059. External Links: Link Cited by: §1, §2.
  • [7] H. Küttler, N. Nardelli, T. Lavril, M. Selvatici, V. Sivakumar, T. Rocktäschel, and E. Grefenstette (2019) TorchBeast: a pytorch platform for distributed rl. External Links: 1910.03552 Cited by: §4.
  • [8] H. Mao, P. Negi, A. Narayan, H. Wang, J. Yang, H. Wang, R. Marcus, R. Addanki, M. Khani, S. He, V. Nathan, F. Cangialosi, S. B. Venkatakrishnan, W. Weng, S. Han, T. Kraska, and M. Alizadeh (2019) Park: an open platform for learning augmented computer systems. In ICML Reinforcement Learning for Real Life Workshop, Cited by: §2.
  • [9] J. Močkus (1975) On bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical Conference Novosibirsk, July 1–7, 1974, G. I. Marchuk (Ed.), Berlin, Heidelberg, pp. 400–404. External Links: ISBN 978-3-540-37497-8 Cited by: §4.
  • [10] A. Narayan, F. Cangialosi, D. Raghavan, P. Goyal, S. Narayana, R. Mittal, M. Alizadeh, and H. Balakrishnan (2018) Restructuring endpoint congestion control. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, SIGCOMM ’18, New York, NY, USA, pp. 30–43. External Links: ISBN 978-1-4503-5567-4, Link, Document Cited by: §2.
  • [11] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §4.
  • [12] M. L. Puterman (1994) Markov decision processes: discrete stochastic dynamic programming. 1st edition, John Wiley & Sons, Inc., New York, NY, USA. External Links: ISBN 0471619779 Cited by: §3.
  • [13] F. Ruffy, M. Przystupa, and I. Beschastnikh (2018) Iroko: A framework to prototype reinforcement learning for data center traffic control. CoRR abs/1812.09975. External Links: Link, 1812.09975 Cited by: §1, §2.
  • [14] K. Winstein and H. Balakrishnan (2013) TCP ex machina: computer-generated congestion control. In Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM, SIGCOMM ’13, New York, NY, USA, pp. 123–134. External Links: ISBN 978-1-4503-2056-6, Link, Document Cited by: §1.
  • [15] F. Y. Yan, J. Ma, G. D. Hill, D. Raghavan, R. S. Wahby, P. Levis, and K. Winstein (2018-07) Pantheon: the training ground for internet congestion-control research. In 2018 USENIX Annual Technical Conference (USENIX ATC 18), Boston, MA, pp. 731–743. External Links: ISBN 978-1-939133-01-4, Link Cited by: §1, §4.

Appendix A Appendix

a.1 State Space

We design our state space to contain aggregate network statistics obtained from ACKs within each 100ms window (the environment step-time) and an action history of length . An optimal value of is chosen by hyperparameter sweeps to be 20. When an ACK or a packet loss notification arrives, we gather the following 20 network statistics:

Feature Description
1 lrtt Latest round-trip time (RTT) in milliseconds (ms)
2 rtt_min Minimum RTT (in ms) over a 10 second window
3 srtt Smoothed RTT (in ms)
4 rtt_standing Min RTT (in ms) over window of size [1]
5 rtt_var Variance in RTT (in ms) as observed by the network protocol
6 delay Queuing delay measured as
7 cwnd_bytes Congestion window in bytes calculated as
8 inflight_bytes Number of bytes sent but unacknowledged (cannot exceed cwnd_bytes)
9 writable_bytes Number of writable bytes measured as
10 sent_bytes Number of bytes sent since last ACK
11 received_bytes Number of bytes received since last ACK
12 rtx_bytes Number of bytes re-transmitted since last ACK
13 acked_bytes Number of bytes acknowledged in this ACK
14 lost_bytes Number of bytes lost in this loss notification
15 throughput Instantaneous throughput measured as
16 rtx_count Number of packets re-transmitted since last ACK
17 timeout_based_rtx_count Number of re-transmissions due to Probe Timeout (PTO) since last ACK
18 pto_count Number of times packet loss timer fired before receiving an ACK
19 total_pto_count Total number of times packet loss timer fired since last ACK
20 persistent_congestion Flag indicating whether persistent congestion is detected by the protocol

As a way of normalization, the time-based features (features 1–6) are scaled by and byte-based ones (features 7–14) are scaled by . Since there could be varying number of ACKs within any 100ms time window corresponding to an environment step, we obtain a fixed-size state vector by computing aggregate statistics from all ACKs within a window. This is achieved by computing sum, mean, standard deviation, min and max over each of the 20 features and flattening the resulting 100 features into a single vector. For features 1–9, it doesn’t make sense to compute sums and we set those entries to zero in the state vector.

The history at time , , is encoded as a one-hot vector of the action along with the congestion window as a result of that action . To the aggregate features, we concatenate a history of length , resulting in our final state vector of length , where is the size of the action space. In our experiments, and , resulting in a state vector of length 220.

a.2 Action Space

While there are various possible choices for the action space , we chose the simplest set of discrete updates common in conventional congestion control methods:

where cwnd is the congestion window in units of Maximum Segment Size (MSS). The network environment starts with an initial cwnd of 10, and bounded updates are applied according to the policy as follows:

where is the action according to the policy at time and is an index into the action space , is a function that updates the current congestion window according to , and the function bounds the congestion window to reasonable limits.

mvfst-rl supports configuring the action space to any discrete set of updates easily via a simple format. The above mentioned action space can be configured as "0,/2,-10,+10,*2".

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description