Quantized Reinforcement Learning (QuaRL)
Abstract
Quantization can help reduce the memory, compute, and energy demands of deep neural networks without significantly harming their quality. However, whether these prior techniques, applied traditionally to imagebased models, work with the same efficacy to the sequential decision making process in reinforcement learning remains an unanswered question. To address this void, we conduct the first comprehensive empirical study that quantifies the effects of quantization on various deep reinforcement learning policies with the intent to reduce their computational resource demands. We apply techniques such as posttraining quantization and quantization aware training to a spectrum of reinforcement learning tasks (such as Pong, Breakout, BeamRider and more) and training algorithms (such as PPO, A2C, DDPG, and DQN). Across this spectrum of tasks and learning algorithms, we show that policies can be quantized to 68 bits of precision without loss of accuracy. We also show that certain tasks and reinforcement learning algorithms yield policies that are more difficult to quantize due to their effect of widening the models’ distribution of weights and that quantization aware training consistently improves results over posttraining quantization and oftentimes even over the full precision baseline. Finally, we demonstrate realworld applications of quantization for reinforcement learning. We use mixed/halfprecision training to train a Pong model 50% faster, and deploy a quantized reinforcement learning based navigation policy onto an embedded system, achieving an 18 speedup and a 4 reduction in memory usage over an unquantized policy.
1 Introduction
Deep reinforcement learning has promise in many applications, ranging from game playing (alphago; alphagozero; vizdoom) to robotics (motor_control1; motor_control2) to locomotion and transportation (rl_overview; rlcar). However, the training and deployment of reinforcement learning models remain challenging. Training is expensive because of their computationally expensive demands of repeatedly performing the forward and backward propagation in neural network training. Achieving state of the art results in the game DOTA2 required around 128,000 CPUs cores and 256 P100 GPUs—the total infrastructure cost in the tens of millions of US dollars (dota5). Deploying deep reinforcement learning (DRL) models is prohibitively expensive, if not even impossible, due to the resource constraints of embedded computing systems typically used in applications (robotics and drone navigation, for example). Quantization may substantially reduce the memory, compute, and energy usage of deep learning models without significantly harming their quality (quantization1; quantization4; quantizationenergy). However, it is unknown whether the same techniques carry over to reinforcement learning. Unlike models in supervised learning, the quality of a reinforcement learning policy depends on how effective it is in sequential decision making. Specifically, an agent’s current input and decision heavily affect its future state and future actions; it is unclear how quantization affects the longterm decision making capability of reinforcement learning policies. Also, there are many different algorithms to train a reinforcement learning policy. Algorithms like actorcritic methods (A2C), deepq networks (DQN), proximal policy optimization (PPO) and deep deterministic policy gradients (DDPG) are significantly different in their optimization goals and implementation details, and it is unclear whether quantization would be similarly effective across these algorithms. Finally, reinforcement learning policies are trained and applied to a wide range of environments, and it is unclear how quantization affects performance in tasks of differing complexity. Here, we aim to understand quantization effects on deep reinforcement learning policies with the goal of reducing memory and compute to enable faster and cheaper training/deployment. Hence, we comprehensively benchmark the effects of quantization on policies trained by various reinforcement learning algorithms on different tasks, conducting in excess of 350 experiments to present a representative and conclusive analysis. We perform experiments over 3 major axes: (1) environments (Atari Arcade, PyBullet, OpenAI Gym), (2) reinforcement learning training algorithms (DeepQ Networks, Advantage ActorCritic, Deep Deterministic Policy Gradients, Proximal Policy Optimization) and (3) quantization methods (posttraining quantization, quantization aware training). We show that deep reinforcement learning models can be quantized to 68 bits of precision without loss in quality. Furthermore, we analyze how each axis affects the final performance of the quantized model to develop insights into how to achieve better model quantization. Our results show that some tasks and training algorithms yield models that are more difficult to apply posttraining quantization as they widen the spread of the models’ weight distribution, yielding higher quantization error. This motivates the use of quantization aware training, which we show demonstrates improved performance over posttraining quantization and oftentimes even over the full precision baseline. To demonstrate the usefulness of quantization for deep reinforcement learning in realworld applications, we 1) use half precision ops to train a Pong model 50% faster than full precision training and 2) deploy a quantized reinforcement learning based navigation policy onto an embedded system and achieve an 18 speedup and a 4 reduction in memory usage over an unquantized policy.
2 Related Work
Reducing neural network resource requirements is an active research topic. Techniques include quantization (quantization1; quantization2; quantization3; fakequantization; quant_extra_1; quant_extra_2; quant_extra_3), deep compression (quantization2), knowledge distillation (knowledge_distillation; knowledge_distillation_efficient), sparsification (quantizationenergy; sparse_prune; sparse_extra_1; sparse_extra_2; sparse_extra_3) and pruning (sparse_prune; prune_efficient; prune_extra_1). These methods are employed because they compress to reduce storage and memory requirements as well as enable fast and efficient inference and training with specialized operations. We provide background for these motivations and describe the specific techniques that fall under these categories and motivate why quantization for reinforcement learning needs study. Compression for Memory and Storage: Techniques such as quantization, pruning, sparsification, and distillation reduce the amount of storage and memory required by deep neural networks. These techniques are motivated by the need to train and deploy neural networks on memoryconstrained environments (e.g., IoT or mobile). Broadly, quantization reduces the precision of network weights (quantization1; quantization2; quantization3), pruning removes various layers and filters of a network (sparse_prune; prune_efficient), sparsification zeros out selective network values (prune_efficient; sparse_prune) and distillation compresses an ensemble of networks into one (knowledge_distillation; knowledge_distillation_efficient). Various algorithms combining these core techniques have been proposed. For example, Deep Compression (quantization1) demonstrated that a combination of weightsharing, pruning, and quantization might reduce storage requirements by 3549x. Importantly, these methods achieve high compression rates at small losses in accuracy by exploiting the redundancy that is inherent within the neural networks. Fast and Efficient Inference/Training: Methods like quantization, pruning, and sparsification may also be employed to improve the runtime of network inference and training as well as their energy consumption. Quantization reduces the precision of network weights and allows more efficient quantized operations to be used during training and deployment, for example, a ”binary” GEMM (general matrix multiply) operation (xnornet; bnn). Pruning speeds up neural networks by removing layers or filters to reduce the overall amount of computation necessary to make predictions (prune_efficient). Finally, Sparsification zeros out network weights and enables faster computation via specialized primitives like blocksparse matrix multiply (blocksparse). These techniques not only speed up neural networks but decrease energy consumption by requiring fewer floatingpoint operations. Quantization for Reinforcement Learning: Prior work in quantization focuses mostly on quantizing image / supervised models. However, there are several key differences between these models and reinforcement learning policies: an agent’s current input and decision affects its future state and actions, there are many complex algorithms (e.g: DQN, PPO, A2C, DDPG) for training, and there are many diverse tasks. To the best of our knowledge, this is the first work to ask the question as to how quantization affects deep reinforcement learning. To this end, we apply and analyze the performance of quantization across a broad of reinforcement learning tasks and training algorithms.
3 Quantized Reinforcement Learning (QuaRL)
We develop QuaRL, an opensource software framework that allows us to systematically apply traditional quantization methods to a broad spectrum of deep reinforcement learning models.^{1}^{1}1Source code for QuaRL can be found here: https://github.com/harvardedge/quarl We use the QuaRL framework to 1) evaluate how effective quantization is at compressing reinforcement learning policies, 2) analyze how quantization affects/is affected by the various environments and training algorithms in reinforcement learning and 3) establish a standard on the performance of quantization techniques across various training algorithms and environments. Environments: We evaluate quantized models on three different types of environments: OpenAI gym, Atari Arcade Learning, and PyBullet (which is an opensource implementation of the MuJoCo). These environments consist of a variety of tasks, including CartPole, MountainCar, LunarLandar, Atari Games, Humanoid, etc. The complete list of environments used in the QuaRL framework is listed in Table 1. Evaluations across this spectrum of different tasks provide a robust benchmark on the performance of quantization applied to different reinforcement learning tasks. Training Algorithms: We study quantization on four popular reinforcement learning algorithms, namely Advantage ActorCritic (A2C) (a2c), Deep QNetwork (DQN) (dqn), Deep Deterministic Policy Gradients (DDPG) (ddpg) and Proximal Policy Optimization (PPO) (ppo). Evaluating these standard reinforcement learning algorithms that are well established in the community allows us to explore whether quantization is similarly effective across different reinforcement learning algorithms. Quantization Methods: We apply standard quantization techniques to deep reinforcement learning models. Our main approaches are posttraining quantization and quantization aware training. We apply these methods to models trained in different environments by different reinforcement learning algorithms to broadly understand their performance. We describe how these methods are applied in the context of reinforcement learning below.
Algorithm  OpenAI Gym  Atari  PyBullet  
Cartpole  MountainCar  BeamRider  Breakout  MsPacman  Pong  Qbert  Seaquest  SpaceInvaders  BipedalWalker  HalfCheetah  Walker2D  
DQN  PTQ  n/a  PTQ  PTQ  PTQ  PTQ  PTQ  PTQ  PTQ  n/a  n/a  n/a  
A2C 









PPO 









DDPG  n/a  PTQ  n/a  n/a  n/a  n/a  n/a  n/a  n/a 



3.1 Posttraining Quantization
Posttraining quantization takes a trained full precision model (32bit floating point) and quantizes its weights to lower precision values. We quantize weights down to fp16 (16bit floating point) and int8 (8bit integer) values. fp16 quantization is based on IEEE754 floating point rounding and int8 quantization uses uniform affine quantization. Fp16 Quantization: Fp16 quantization involves taking full precision (32bit) values and mapping them to the nearest representable 16bit float. The IEEE754 standard specifies 16bit floats with the format shown below. Bits are grouped to specify the value of the sign (), fraction () and exponent () which are then combined with the following formula to yield the effective value of the float:
Uniform Affine Quantization: Uniform affine quantization (tffakequant) is a applied to a full precision weight matrix and is performed by 1) calculating the minimum and maximum values of the matrix and 2) dividing this range equally into representable values (where is the number of bits being quantized to). As each representable value is equally spaced across this range, the quantized value can be represented by an integer. More specifically, quantization from full precision to bit integers is given by:
Note that is the gap between representable numbers and is an offset so that 0 is exactly representable. Further note that we use and to ensure that 0 is always represented. To dequantize we perform:
In the context of QuaRL, int8 and fp16 quantization are applied after training a full precision model on an environment, as per Algorithm 1. In post training quantization, uniform quantization is applied to each fully connected layer of the model (pertensor quantization) and is applied to each channel of convolution weights (peraxis quantization); activations are not quantized. We use posttraining quantization to quantize to fp16 and int8 values. Input: T : task or environment Input: L : reinforcement learning algorithm Input: A : model architecture Input: n : quantize bits (8 or 16) Output: Reward 1 = Train(, , ) 2 return Eval() Output: Reward Input: T : task or environment Input: L : reinforcement learning algorithm Input: n : quantize bits Input: A : model architecture Input: Qd : quantization delay 1 InsertAfterWeightsAndActivations() 2 , = TrainNoQuantMonitorWeightsActivationsRanges(, , , ) 3 = TrainWithQuantization(, , , , ) return Eval(, , )
3.2 Quantization Aware Training
Quantization aware training involves retraining the reinforcement learning policies with weights and activations uniformly quantized to bit values. Importantly, weights are maintained in full fp32 precision except that they are passed through the uniform quantization function before being used in the forward pass. Because of this, the technique is also known as “fake quantization” (tffakequant). Additionally, to improve training there is an additional parameter, quantization delay (quant_delay), which specifies the number of full precision training steps before enabling quantization. When the number of steps is less than the quantization delay parameter, the minimum and maximum values of weights and activations are actively monitored. Afterwards, the previously captured minimum and maximum values are used to quantize the tensors (these values remain static from then on). Specifically:
Where and are the monitored minimum and maximum values of the tensor (expanding and to include 0 if necessary). Intuitively, the expectation is that the training process eventually learns to account for the quantization error, yielding a higher performing quantized model. Note that uniform quantization is applied to fully connected weights in the model (pertensor quantization) and to each channel for convolution weights (peraxis quantization). bit quantization is applied to each layer’s weights and activations:
During backward propagation, the gradient is passed through the quantization function unchanged (also known as the straightthrough estimator (hintonstraight)), and the full precision weight matrix is optimized as follows:
In context of the QuaRL framework, the policy neural network is retrained from scratch after inserting the quantization functions between weights and activations (all else being equal). At evaluation full precision weights are passed through the uniform affine quantizer to simulate quantization error during inference. Algorithm 2 describes how quantization aware training is applied in QuaRL.
4 Results
We perform evaluations across the three principal axes of QuaRL: environments, training algorithms, and quantization methods. Table 1 lists the space of the evaluations explored. We analyze the results based on the following three cases: Effectiveness of Quantization: To evaluate the overall effectiveness of quantization for deep reinforcement learning, we apply posttraining quantization and quantization aware learning to a spectrum of tasks and record their performance. We present the reward results for posttraining quantization in Table 2. We also compute the percentage error of the performance of the quantized policy relative to that of their corresponding full precision baselines (E and E). Additionally, we report the mean of the errors across tasks for each of the training algorithms. The absolute mean of 8bit and 16bit relative errors ranges between 2% and 5% (with the exception of DQN), which indicates that models may be quantized to 8/16 bit precision without much loss in quality. Interestingly, the overall performance difference between the 8bit and 16bit posttraining quantization is minimal (with the exception of the DQN algorithm, for reasons we explain in Section 4). We believe this is because the policies weight distribution is narrow enough that 8 bits is able to capture the distribution of weights without much error. In a few cases, posttraining quantization yields better scores than the full precision policy. We believe that quantization injected an amount of noise that was small enough to maintain a good policy and large enough to regularize model behavior; this supports some of the results seen by quant_reg_1; quant_reg_2; quant_reg_3; see Appendix E for plots showing that there is a sweet spot for posttraining quantization.
Algorithm  A2C  DQN  PPO  DDPG  
Datatype  fp32  fp16  int8  fp32  fp16  int8  fp32  fp16  int8  fp32  fp16  int8  
Environment  Rwd  Rwd  E (%)  Rwd  E (%)  Rwd  Rwd  E (%)  Rwd  E (%)  Rwd  Rwd  E (%)  Rwd  E (%)  Rwd  Rwd  E (%)  Rwd  E (%) 
Breakout  379  371  2.11  350  7.65  214  217  1.40  78  63.55  400  400  0.00  368  8.00  
SpaceInvaders  717  667  6.97  634  11.56  586  625  6.66  509  13.14  698  662  5.16  684  2.01  
BeamRider  3087  3060  0.87  2793  9.52  925  823  11.03  721  22.05  1655  1820  9.97  1697  2.54  
MsPacman  1915  1915  0.00  2045  6.79  1433  1429  0.28  2024  41.24  1735  1735  0.00  1845  6.34  
Qbert  5002  5002  0.00  5611  12.18  641  641  0.00  616  3.90  15010  15010  0.00  14425  3.90  
Seaquest  782  756  3.32  753  3.71  1709  1885  10.30  1582  7.43  1782  1784  0.11  1795  0.73  
CartPole  500  500  0.00  500  0.00  500  500  0.00  500  0.00  500  500  0.00  500  0.00  
Pong  20  20  0.00  19  5.00  21  21  0.00  21  0.00  20  20  0.00  20  0.00  
Walker2D  1890  1929  2.06  1866  1.27  
HalfCheetah  2553  2551  0.08  2473  3.13  
BipedalWalker  98  90  8.16  83  15.31  
MountainCar  92  92  0.00  92  0.00  
Mean  1.66  2.31  0.88  8.60  0.62  0.54  1.54  4.93 
For quantization aware training, we train the policy with fakequantization operations while maintaining the same model and hyperparameters (see Appendix B). Figure 1 shows the results of quantization aware training on multiple environments and training algorithms to compress the policies down from 8bits to 2bits. Generally, the performance relative to the full precision baseline is maintained until 5/6bit quantization, after which there is a drop in performance. Broadly, at 8bits, we see no degradation in performance. From the data, we see that quantization aware training achieves higher rewards than posttraining quantization and also sometimes outperforms the full precision baseline.
Environment  E 

Breakout  63.55% 
BeamRider  22.05% 
Pong  0% 
Effect of Environment on Quantization Quality: To analyze the task’s effect on quantization quality we plot the distribution of weights of full precision models trained in three environments (Breakout, Beamrider and Pong) and their error after applying 8bit posttraining quantization on them. Each model uses the same network architecture, is trained using the same algorithm (DQN) with the same hyperparameters (see Appendix B). Figure 2 shows that the task with the highest error (Breakout) has the widest weight distribution, the task with the secondhighest error (BeamRider) has a narrower weight distribution, and the task with the lowest error (Pong) has the narrowest distribution. With an affine quantizer, quantizing a narrower distribution yields less error because the distribution can be captured at a fine granularity; conversely, a wider distribution requires larger gaps between representable numbers and thus increases quantization error. The trends indicate the environment affects models’ weight distribution spread which affects quantization performance: specifically, environments that yield a wider distribution of model weights are more difficult to apply quantization to. This observation suggests that regularizing the training process may yield better quantization performance. Algorithm Environment fp32 Reward E E DQN Breakout 214 63.55% 1.40% PPO Breakout 400 8.00% 0.00% A2C Breakout 379 7.65% 2.11%
Effect of Training Algorithm on Quantization Quality: To determine the effects of the reinforcement learning training algorithm on the performance of quantized models, we compare the performance of posttraining quantized models trained by various algorithms. Table 3 shows the error of different reinforcement learning algorithms and their corresponding 8bit posttraining quantization error for the Atari Breakout game. Results indicate that the A2C training algorithm is most conducive to int8 posttraining quantization, followed by PPO2 and DQN. Interestingly, we see a sharp performance drop compared to the corresponding full precision baseline when applying 8bit posttraining quantization to models trained by DQN. At 8 bits, models trained by PPO2 and A2C have relative errors of 8% and 7.65%, whereas the model trained by DQN has an error of 64%. To understand this phenomenon, we plot the distribution of model weights trained by each algorithm, shown in Figure 4. The plot shows that the weight distribution of the model trained by DQN is significantly wider than those trained by PPO2 and A2C. A wider distribution of weights indicates a higher quantization error, which explains the large error of the 8bit quantized DQN model. This also explains why using more bits (fp16) is more effective for the model trained by DQN (which reduces error relative to the full precision baseline from 64% down to 1.4%).
5 Case Studies
To show the real world applications of our results, we use quantization to optimize the training and deployment of reinforcement learning policies. We 1) train a pong model 1.5 faster by using mixed precision optimization and 2) deploy a quantized robot navigation model onto a resource constrained embedded system (RasPi3b), demonstrating 4 reduction in memory and an 18 speedup in inference. Faster training time means running more experiments for the same time. Achieving speedup on resourceconstrained devices enables deployment of the policies on real robots. Mixed/HalfPrecision Training: Motivated by that reinforcement learning training is robust to quantization error, we train three policies of increasing model complexity (Policy A, Policy B, and Policy C) using mixed precision training and compare its performance to that of full precision training (see Appendix for details). In mixed precision training, the policy weights, activations, and gradients are represented in fp16. A master copy of the weights are stored in full precision (fp32) and updates are made to it during backward pass (mptraining). We measure the runtime and convergence rate of both full precision and mixed precision training (see Appendix C). Algorithm Network Parameter fp32 Runtime (min) MP Runtime (min) Speedup DQNPong Policy A 127 156 0.87 Policy B 179 172 1.04 Policy C 391 242 1.61 Figure 5 shows that all three policies converge under full precision and mixed precision training. Interestingly, for Policy B, training with mixed precision yields faster convergence; we believe that some amount of quantization error speeds up the training process. Table 5 shows the computational speedup to the training loop by using mixed precision training. While using mixed precision training on smaller networks (Policy A) may slow down training iterations (as overhead of doing fp32 to fp16 conversions outweigh the speedup of low precision ops), larger networks (Policy C) show up to a 60% speedup. Generally, our results show that mixed precision may speed up the training process by up to 1.6 without harming convergence. Quantized Policy for Deployment: To show the benefits of quantization in deploying of reinforcement learning policies, we train multiple pointtopoint navigation models (Policy I, II, and III) for aerial robots using Air Learning (airlearning) and deploy them onto a RasPi3b, a cost effective, generalpurpose embedded processor. RasPi3b is used as proxy for the compute platform for the aerial robot. Other platforms on aerial robots have similar characteristics. For each of these policies, we report the accuracies and inference speedups attained by the int8 and fp32 policies. Table 5 shows the accuracies and inference speedups attained for each corresponding quantized policy. We see that quantizing smaller policies (Policy I) yield moderate inference speedups (1.18 for Policy I), while quantizing larger models (Policies II, III) can speed up inference by up to 18. This speed up in policy III execution times results in speedingup the generation of the hardware actuation commands from 5 Hz (201.115 ms for fp32) to 90 Hz (11.036 ms for int8). A deeper investigation shows that Policies II and III take more memory than the total RAM capacity of the RasPi3b, causing numerous accesses to swap memory (refer to Appendix D) during inference (which is extremely slow). Quantizing these policies allow them to fit into the RasPi’s RAM, eliminating accesses to swap and boosting performance by an order of magnitude. Figure 5 shows the memory usage while executing the quantized and unquantized version of Policy III, and shows how without quantization memory usage skyrockets above the total RAM capacity of the board. Policy Name Network Parameters fp32 Time (ms) fp32 Success Rate (%) int8 Time (ms) int8 Success Rate (%) Speed up Policy I 3L, MLP, 64 Nodes 0.147 60% 0.124 45% 1.18 Policy II 3L, MLP, 256 Nodes 133.49 74% 9.53 60% 14 Policy III 3L, MLP (4096, 512, 1024) 208.115 86% 11.036 75% 18.85 In context of realworld deployment of an aerial (or any other type of) robot, a speedup in policy execution potentially translates to faster actuation commands to the aerial robot – which in turn implies faster and better responsiveness in a highly dynamic environment (responsiveness). Our case study demonstrates how quantization can facilitate the deployment of a accurate policies trained using reinforcement learning onto a resource constrained platform.
6 Conclusion
We perform the first study of quantization effects on deep reinforcement learning using QuaRL, a software framework to benchmark and analyze the effects of quantization on various reinforcement learning tasks and algorithms. We analyze the performance in terms of rewards for posttraining quantization and quantization aware training as applied to multiple reinforcement learning tasks and algorithms with the high level goal of reducing policies’ resource requirements for efficient training and deployment. We broadly demonstrate that reinforcement learning models may be quantized down to 8/16 bits without loss of performance. Additionally, we link quantization performance to the distribution of models’ weights, demonstrating that some reinforcement learning algorithms and tasks are more difficult to quantize due to their effect of widening the models’ weight distribution. Finally, we apply our results to optimize the training and inference of reinforcement learning models, demonstrating a 50% training speedup for Pong using mixed precision optimization and up to a 18x inference speedup on a RasPi by quantizing a navigation policy. In summary, our findings indicate that there is much potential for the future of quantization of deep reinforcement learning policies.
Acknowledgments
We thank Cody Coleman (from Stanford), Itay Hubara (from Habana), Paulius Micikevicius (from Nvidia), and Thierry Tambe (from Harvard) for their constructive feedback on improving the paper. The research was sponsored by NSF IIS award #1724197 along with research credits to run experiments at scale from the Google Cloud Platform.
References
Appendix
Here, we list several details that are omitted from the first 8 pages due to the limited page count. To the best of our ability, we provide sufficient details to reproduce our results and address common clarification questions.
A Post Training Quantization Results
Here, we tabulate the post training quantization results listed in Table 2 into four separate tables for clarity. Each table corresponds to post training quantization results for a specific algorithm. Table 5 tabulates the post training quantization for A2C algorithm. Likewise, Table 6 tabulates the post training quantization results for DQN. Table 7 and Table 8 lists the post training quantization results for PPO and DDPG algorithms respectively.
Environment  fp32  fp16  E_fp16  int8  E_int8 
Breakout  379  371  2.11%  350  7.65% 
SpaceInvaders  717  667  6.97%  634  11.58% 
BeamRider  3087  3060  0.87%  2793  9.52% 
MsPacman  1915  1915  0.00%  2045  6.79% 
Qbert  5002  5002  0.00%  5611  12.18% 
Seaquest  782  756  3.32%  753  3.71% 
CartPole  500  500  0.00%  500  0.00% 
Pong  20  20  0.00%  19  5.00% 
Mean  1.66 %  2.31 % 
Environment  fp32  fp16  E_fp16  int8  E_int8 

Breakout  214  217  1.40%  78  63.55% 
SpaceInvaders  586  625  6.66%  509  13.14% 
BeamRider  925  823  11.03%  721  22.05% 
MsPacman  1433  1429  0.28%  2024  41.24% 
Qbert  641  641  0.00%  616  3.90% 
Seaquest  1709  1885  10.30%  1582  7.43% 
CartPole  500  500  0.00%  500  0.00% 
Pong  21  21  0.00%  21  0.00% 
Mean  0.88%  8.60% 
Environment  fp32  fp16  E_fp16  int8  E_int8 

Breakout  400  400  0.00%  368  8.00% 
SpaceInvaders  698  662  5.16%  684  2.01% 
BeamRider  1655  1820  9.97%  1697  2.54% 
MsPacman  1735  1735  0.00%  1845  6.34% 
Qbert  15010  15010  0.00%  14425  3.90% 
Seaquest  1782  1784  0.11%  1795  0.73% 
CartPole  500  500  0.00%  500  0.00% 
Pong  20  20  0.00%  20  0.00% 
Mean  0.62%  0.54% 
Environment  fp32  fp16  E_fp16  int8  E_int8 

Walker2D  1890  1929  2.06%  1866  1.27% 
HalfCheetah  2553  2551  0.08%  2473  3.13% 
BipedalWalker  98  90  8.16%  83  15.31% 
MountainCarContinuous  92  92  0.00%  92  0.00% 
Mean  1.54%  4.93% 
B DQN Hyperparameters for Atari
For all Atari games in the results section, we use a standard 3 Layer Conv (128) + 128 FC. Hyperparameters are listed in Table 9. We use stablebaselines (stablebaselines) for all the reinforcement learning experiments. We use Tensorflow version 1.14 as the machine learning backend.




n_timesteps  1 Million Steps  
buffer_size  10000  
learning_rate  0.0001  
warm_up  10000  
quant_delay  500000  
target_network_update_frequency  1000  
exploration_final_eps  0.01  
exploration_fraction  0.1  
prioritized_replay_alpha  0.6  
prioritized_replay  True 
C Mixed Precision Hyperparameters
In mixed precision training, we used three policies namely Policy A, Policy B and Policy C respectively. The policy architecture for these policies are tabulated in Table 10.
Algorithm 



Policy A  3 Layer Conv (128 Filters) + FC (128)  
Policy B  3 Layer Conv (512 Filters) + FC (512)  
Policy C  3 Layer Conv (1024 Filters) + FC (2048) 
For measuring the runtimes for fp32 adn fp16 training, we use the time Linux command for each run and add the usr and sys times to measure the runtimes for both mixedprecision training and fp32 training. The hyperparameters used for training DQNPong agent is listed in Table 9.
D Quantized Policy Deployment
Here, we describe the methodology used to train a point to point navigation policy in Air Learning and deploy it on an embedded compute platform such as RasPi 3b+. Air Learning is an AI research platform that provides infrastructure components and tools to train a fully functional reinforcement learning policies for aerial robots. In simple environments like OpenAI gym, Atari the training and inference happens in the same environment without any randomization. In contrast to these environments, Air Learning allows us to randomize various environmental parameters such as such as arena size, number of obstacles, goal position etc.
In this study, we fix the arena size to 25 m 25 m 20 m. The maximum number of obstacles at anytime would be anywhere between one to five and is chosen randmonly on episode to episode basis. The position of these obstacles and end point (goal) are also changed every episode. We train the aerial robot to reach the end point using DQN algorithm. The input to the policy is sensor mounted on the drone along with IMU measurements. The output of the policy is one among the 25 actions with different velocity and yaw rates. The reward function we use in this study is defined based on the following equation:
(1) 
Here, is a binary variable whose value is ‘1’ if the agent reaches the goal else its value is ‘0’. is a binary variable which is set to ‘1’ if the aerial robot collides with any obstacle or runs out of the maximum allocated steps for an episode.^{2}^{2}2We set the maximum allowed steps in an episode as 750. This is to make sure the agent finds the endpoint (goal) within some finite amount of steps. Otherwise, is ’0’, effectively penalizing the agent for hitting an obstacle or not reaching the end point in time. is the distance to the end point from the agent’s current location, motivating the agent to move closer to the goal.D is the distance correction which is applied to penalize the agent if it chooses actions which speed up the agent away from the goal. The distance correction term is defined as follows:
(2) 
V is the maximum velocity possible for the agent which for DQN is fixed at 2.5 . V is the current velocity of the agent and t is the duration of the actuation.
We train three policies namely Policy I, Policy II, and Policy III. Each policy is learned through curriculum learning where we make the end goal farther away as the training progresses. We terminate the training once the agent has finished 1 Million steps. We evaluate the all the three policies in fp32 and quantized int8 data types for 100 evaluations in airlearning and report the success rate.
We also take these policies and characterize the system performance on a Raspi 3b platform. RasPi 3b is a proxy for the compute platform available on the aerial robot. The hardware specification for RasPi 3b is shown in Table 11.
Embedded System  RasPi 3b 

CPU Cores  4 Cores (ARM A53) 
CPU Frequency  1.2 GHz 
GPU  None 
Power  <1W 
Cost  $35 
We allocate a region of storage space as swap memory. It is the region of memory allocated in disk that is used when system memory is utilized fully by a process. In RasPi 3b, the swap memory is allocated in Flash storage.
E PostTraining Quantization SweetSpot
Here we demonstrate that post training quantization can regularize the policy. To that end, we take a pretrained policy for three different Atari environments and quantize it from 32bits (fp32) to 2bit using uniform affine quantization. Figures 6 shows that there is a sweetspot for posttraining quantization. Sometimes, quantizing to fewer bits outperforms higher precision quantization. Each plot was generated by applying posttraining quantization to the full precision baselines and evaluating over 10 runs.