Soft Qnetwork
Abstract
When DQN is announced by deepmind in 2013, the whole world is surprised by the simplicity and promising result, but due to the low efficiency and stability of this method, it is hard to solve many problems. After all these years, people purposed more and more complicated ideas for improving, many of them use distributed DeepRL which needs tons of cores to run the simulators. However, the basic ideas behind all this technique are sometimes just a modified DQN. So we asked a simple question, is there a more elegant way to improve the DQN model? Instead of adding more and more small fixes on it, we redesign the problem setting under a popular entropy regularization framework which leads to better performance and theoretical guarantee. Finally, we purposed SQN, a new offpolicy algorithm with better performance and stability.
figuret
RL ENTROPY DQN EFFICIENCY
1 Introduction
The many recent successes in scaling reinforcement learning (RL) to complex sequential decisionmaking problems were kickstarted by the Deep QNetworks algorithm [DQN]. Its combination of Qlearning with convolutional neural networks and experience replay enabled it to learn, from raw pixels, how to play many Atari games at a humanlevel performance. Since then, many extensions have been proposed that enhance their speed or stability. Double DQN [DDQN] addresses an overestimation bias of Qlearning, by decoupling selection and evaluation of the bootstrap action. Prioritized experience replay [prioritized] improves data efficiency, by replaying more often transitions from which there is more to learn. The dueling network architecture [dueling] helps to generalize across actions by separately representing state values and action advantages. Learning from multistep bootstrap targets, as used in A3C [A3C], shifts the biasvariance tradeoff and helps to propagate newly observed rewards faster to earlier visited states. Distributional Qlearning [distributionalQ] learns a categorical distribution of discounted returns, instead of estimating the mean. Noisy [noisydqn] uses stochastic network layers for exploration.
However, most of the current method is easily suffer from the local optimal position, they find a solution looks good a in shorter view, but often sacrifice the performance in the long run, for example, maybe the agent learns to stay still to avoid death, it’s not something we want. Second, the modelfree method has a bad name for its low sample efficiency, even some simple tasks need millions of interval with the environment when it comes to complex decisionmaking problem, the total interval step could easily come to [R2D2], which is inaccessible for most domain except in some really fast simulator.
Many methods are often extremely brittle with respect to their hyperparameters and they often have so much hyperparameters that need to be tuned. It means we need carefully tuning the parameters, most important, they often suck to local optimal. In many cases they fail to find a reward signal, even when the reward signal is relatively dense, they still fail to find the optimal solution, some researchers design such a complex reward function for each environment they want to solve [design].
In this paper, we propose a different approach to deal with complex tasks with deep reinforcement learning and investigate an entropy regularization approach to learning a good policy under the SQN framework. Extensive experiments on Atari tasks demonstrate the effectiveness and advantages of the proposed approach, which performs the best among a set of previous stateoftheart methods.
2 Background
Reinforcement learning addresses the problem of an agent learning to act in an environment to maximize a scalar reward signal. No direct supervision is provided to the agent. We first introduce notation and summarize the standard Soft ActorCritic framework.
2.1 Mdp
Our problem is searching an optimal policy which maximize our accumulate future reward in Markov Decision Process defined by the tuple [planet].

represent a set of states

represent a set of actions,

stand for the transition function which maps stateactions to probability distributions over next states

correspond to the reward function, with
Within this framework, the agent acting in the environment according to . the environment changes to a new state following . Next, an state and reward are received by the agent. Although there are many approaches suitable for the MDP process, we focus on using the policy gradient method with an entropy bonus. The Deep QNetwork agent (DQN)[DQN] learns to play games from the Atari57 benchmark by using framestacking of 4 consecutive frames as observations, and training a convolutional network to represent a value function with Qlearning, from data continuously collected in a replay buffer. Other algorithms like the A3C, use an LSTM and are trained directly on the online stream of experience without using a replay buffer. In paper [RDPG] combined DDPG with an LSTM by storing sequences in the replay and initializing the recurrent state to zero during training.
2.2 Soft Actor Critic
Some of the most successful RL algorithms in recent years such as Trust Region Policy Optimization [trpo], Proximal Policy Optimization [ppo], and Asynchronous ActorCritic Agents [A3C] suffer from sample inefficiency. This is because they learn in an “onpolicy” manner. In contrast, Qlearning based “offpolicy” methods such as Deep Deterministic Policy Gradient [ddpg] and Twin Delayed Deep Deterministic Policy Gradient [TD3] can learn efficiently from past samples using experience replay buffers. However, the problem with these methods is that they are very sensitive to hyperparameters and require a lot of tuning to get them to converge. Soft ActorCritic follows in the tradition of the latter type of algorithms and adds methods to combat the convergence brittleness.
The Theory Of SAC
SAC is defined as RL tasks involving continuous actions. The biggest feature of SAC is that it uses a modified RL objective function. Instead of only seeking to maximize lifetime rewards, SAC seeks to also maximize the entropy of the policy. Entropy is a quantity which, roughly speaking, says how random a random variable is. If a coin is weighted so that it almost always comes up heads, it has low entropy; if it’s evenly weighted and has a half chance of either outcome, it has high entropy.
Let be a random variable with probability mass or density function . The entropy of is computed from its distribution according to:
(1) 
The standard reinforcement learning object is finding a policy which can maximize expected future return which we can purpose as in Equation 3
(2) 
(3) 
In addition to encouraging policy to converge toward a set of probabilities over actions that lead to a high longterm reward, we although add an “entropy bonus” to the loss function. This bonus encourages the agent to take action more unpredictably. Entropy bonuses are used because without them an agent can too quickly converge on a policy that is locally optimal, but not necessarily globally optimal. Anyone who has worked on RL problems empirically can attest to how often an agent may get stuck learning a policy that only runs into walls, or only turns in a single direction, or any number of clearly suboptimal, but lowentropy behaviors. In the case where the globally optimal behavior is difficult to learn due to sparse rewards or other factors, an agent can be forgiven for settling on something simpler, but less optimal. The entropy bonus is used to attempt to counteract this tendency by adding an entropy increasing term to the loss function, and it works well in most cases. This changes the RL problem to:
(4) 
Rather than optimizing for the reward at every timestep, agents are trained to optimize for the longterm sum of future rewards. We can apply this same principle to the entropy of the agent’s policy, and optimize for the longterm sum of entropy. So value functions in this setting should include entropy bonuses at each timestep which leads to a different definition than before. Now which include the entropy bonuses is:
(5) 
The temperature parameter making tradeoff between the importance of the entropy term against the environment’s reward. When is large, the entropy bonuses play an important role in reward, so the policy will tend to have larger entropy, which means the policy will be more stochastic, on the contrary, if become smaller, the policy will become more deterministic. And has to be modified to contain the entropy bonuses as well:
(6) 
The original Bellman operator is augmented with an entropy regularize term, with equation(5)(6) the connection between and can be easily derived as:
(7) 
Given these equations above, the Bellman equation for is changing to:
(8)  
(9) 
3 Method
In particular, SAC makes use of two soft Qfunctions to mitigate positive bias in the policy improvement step that is known to degrade the performance of valuebased methods, Function approximators are used for both the soft Qfunction and the policy. Instead of running evaluation and improvement to convergence, we alternate between optimizing both networks with stochastic gradient descent. We will consider two parameterized soft Qfunction and a tractable policy . The parameters of these networks are and .
LEARNING QFUNCTIONS: The Qfunctions are learned by MSBE minimization, using a target value network to form the Bellman backups. They both use the same target, like in TD3, and have loss functions:
(10) 
As for target value network, we can obtain it by polyak averaging the value network parameters over the course of training. It not hard to rewrite the connection equation between value function and Qfunction as follows:
(11)  
(12) 
The value function is implicitly parameterized through the soft Qfunction parameters via Equation 12 We use clipped doubleQ like TD3[TD3] and SAC[SAC] for express the TD target, and takes the minimum Qvalue between the two approximators, So the loss for Qfunction parameters are:
(13) 
The update makes use of a target soft Qfunction, that is obtained as an exponentially moving average of the soft Qfunction weights, which has been shown to stabilize training. Importantly, we do not use actions from the replay buffer here: these actions are sampled fresh from the current version of the policy.
LEARNING THE POLICY: The policy should, in each state, act to maximize the expected future return plus expected future entropy. That is, it should maximize , which we expand out (as before) into
(14) 
The target density is the Qfunction, which is represented by a neural network an can be differentiated, and it is thus convenient to apply the reparameterization trick instead, resulting in a lower variance estimate, in which a sample from is drawn by computing a the deterministic function of the state, policy parameters, and independent noise. following the authors of the SAC paper[SAC1], we use a squashed Gaussian policy, which means that samples are obtained according to
(15) 
The reparameterization trick allows us to rewrite the expectation over actions (which contains a pain point: the distribution depends on the policy parameters) into an expectation over noise (which removes the pain point: the distribution now has no dependence on parameters):
(16) 
To get the policy loss, the final step is that we need to substitute with one of our function approximators. The same as in TD3, we use . The policy is thus optimized according to
(17) 
LEARNING : As it proposed in[SAC1], for the purpose of improving performance, we leaning the temporal parameter by minimizing the dual objective as well:
(18) 
Prior give us tools to achieve this, as shown in [opt_convex], approximating dual gradient descent is a way to achieve that. Because we use a function approximator and it is impractical to optimizing with respect to the primal variables fully, we compute gradients for with the following objective:
(19) 
Inspired by SAC we derived our SQN method, SAC is aimed for discrete space so the policy network is necessary however here DQN gives great example of how to sample an action directly from Q function, combine this two idea means sample an action with entropy bonus, it comes to:
(20) 
(21) 
It clearly shows that policy parameter update step won’t exist at all. The overall architecture of our algorithm is shown in 1. As for value function update, it is same to SAC which described in section 3.The final algorithm is listed in Algorithm 1. The method alternates between collecting experience from the environment with the current policy and updating the function approximators using the stochastic gradients from batches sampled from a replay pool. Using offpolicy data from a replay pool is feasible because both value estimators and the policy can be trained entirely on offpolicy data. The algorithm is agnostic to the parameterization of the policy, as long as it can be evaluated for any arbitrary stateaction tuple.
4 Experiment
In order to test our agent, We designed our experiments to answer the following questions:

Can SQN be used to solve challenging Atari problems? How does our agent compare with other methods when applied to these problems, concerning the final performance, computation time, and sample complexity?

what is the impact of different reward scale, and how different hyperparameter influence the stability of our agent

We add entropy bonus on our agent, does this parts give us a more powerful agent
To answer (1) we compare the performance of our agent with other methods in session4.1 With regard to (2)(3), we addressed the ablation study on our algorithm in session 4.2, testing how does different settings and network design influence the performance
The results show overall our agent outperforms baseline with a large margin, both in terms of learning speed and the final performance, The quantitative results attained by our agent is our experiments also compare very favorably to results reported by other methods in prior work, indicating that both the sample efficiency and the final performance of our agent on these benchmark tasks exceeds the state of art.
4.1 Atari
The Atari Learning Environment (ALE) has been the testing ground of most recent deep reinforcement agents. It posed challenging reinforcement learning problems including exploration, planning, reactive play, and complex visual input. Most games feature very different visuals and game mechanics which makes this domain particularly challenging. The goal of this experimental evaluation is to understand how the sample complexity and stability of our method compare with prior deep reinforcement learning algorithms. We compare our method to prior techniques on a range of challenging Atari tasks from the OpenAI gym benchmark suite. Although the easier tasks can be solved by a wide range of different algorithms, the more complex benchmarks are exceptionally difficult to solve with offpolicy algorithms. The stability of the algorithm also plays a large role in performance: easier tasks make it more practical to tune hyperparameters to achieve good results, while the already narrow basins of effective hyperparameters become prohibitively small for the more sensitive algorithms on the hardest benchmarks, leading to poor performance[QProp].
To allow for a reproducible and fair comparison, we evaluate all the algorithm with a similar network structure, for the offpolicy algorithm, we use a twolayer feedforward neural network of 400 and 300 hidden nodes respectively, with rectified linear units (ReLU) between each layer for both the actor and critic, we use the parameters with is shown superior in prior work [that_matters] as the comparison of our agent. Both network parameters are updated using Adam[adam] with a the learning rate of , with no modifications to the environment or reward.
Figure 2 compares three individual runs with both variants, initialized with different random seeds. SQN performs much better, shows that our agent significantly outperforms the baseline, indicating substantially better stability and stability. As evident from the figure, with jointly training and internal reward, we can achieve stable training. This becomes especially important with harder tasks, where tuning hyperparameters is challenging.
It shows our agent outperforms other baseline methods with a large marginal, indicate both the efficiency and stability of the method is superior
4.2 Ablation Study
We have at most three different kinds of update scheme. Figure 1(d) shows how learning performance changes when the update schemes are changed, For large , the policy becomes nearly random, and consequently fails to exploit the reward signal, resulting in substantial degradation of performance. For small temporary coefficient, the value function is enhanced with exploring so the model learns quickly at first, but the policy then becomes nearly deterministic, leading to poor local minimal due to the lack of adequate exploration and worse state representation, as for which scale is right, due to the reward become larger at the end of training, entropy bonus becomes nearly nothing to the algorithm, still it can not achieve awesome performance. With learned , the model balance exploration and exploitation, model head make sure, our agent can take advantage of joint optimization, and achieve stable training.
5 Conclusion
Empirically results show that SQN outperforms DQN by a large marginal, notice this is the ”plain” version of SQN, so it could become an important cornerstone for many modern algorithms like IMPALA[espeholt2018impala], APEX[APEX], R2D2[R2D2] and so on. We illustrate SQN have the potential to combine with all kinds of improvement made on DQN and improving the performance and efficiency from the starting point. More experiments need to be done to test if SQN could replacing DQN and become a standard algorithm to use. But as we purposed, promising results already illustrate SQN is simple but powerful enough to solve many interesting problems DQN can’t solve. In the future, we will focus on some challenging tasks (like the new google football environment) examining the limit of our algorithm.
References
Footnotes
 footnotetext: Thanks Microsoft for providing cloud service