Soft Q-network

Soft Q-network

Abstract

1

When DQN is announced by deepmind in 2013, the whole world is surprised by the simplicity and promising result, but due to the low efficiency and stability of this method, it is hard to solve many problems. After all these years, people purposed more and more complicated ideas for improving, many of them use distributed Deep-RL which needs tons of cores to run the simulators. However, the basic ideas behind all this technique are sometimes just a modified DQN. So we asked a simple question, is there a more elegant way to improve the DQN model? Instead of adding more and more small fixes on it, we redesign the problem setting under a popular entropy regularization framework which leads to better performance and theoretical guarantee. Finally, we purposed SQN, a new off-policy algorithm with better performance and stability.

\sidecaptionvpos

figuret

\keywords

RL ENTROPY DQN EFFICIENCY

1 Introduction

The many recent successes in scaling reinforcement learning (RL) to complex sequential decision-making problems were kick-started by the Deep Q-Networks algorithm [DQN]. Its combination of Q-learning with convolutional neural networks and experience replay enabled it to learn, from raw pixels, how to play many Atari games at a human-level performance. Since then, many extensions have been proposed that enhance their speed or stability. Double DQN [DDQN] addresses an overestimation bias of Q-learning, by decoupling selection and evaluation of the bootstrap action. Prioritized experience replay [prioritized] improves data efficiency, by replaying more often transitions from which there is more to learn. The dueling network architecture [dueling] helps to generalize across actions by separately representing state values and action advantages. Learning from multi-step bootstrap targets, as used in A3C [A3C], shifts the bias-variance tradeoff and helps to propagate newly observed rewards faster to earlier visited states. Distributional Q-learning [distributionalQ] learns a categorical distribution of discounted returns, instead of estimating the mean. Noisy [noisydqn] uses stochastic network layers for exploration.

However, most of the current method is easily suffer from the local optimal position, they find a solution looks good a in shorter view, but often sacrifice the performance in the long run, for example, maybe the agent learns to stay still to avoid death, it’s not something we want. Second, the model-free method has a bad name for its low sample efficiency, even some simple tasks need millions of interval with the environment when it comes to complex decision-making problem, the total interval step could easily come to [R2D2], which is inaccessible for most domain except in some really fast simulator.

Many methods are often extremely brittle with respect to their hyper-parameters and they often have so much hyper-parameters that need to be tuned. It means we need carefully tuning the parameters, most important, they often suck to local optimal. In many cases they fail to find a reward signal, even when the reward signal is relatively dense, they still fail to find the optimal solution, some researchers design such a complex reward function for each environment they want to solve [design].

In this paper, we propose a different approach to deal with complex tasks with deep reinforcement learning and investigate an entropy regularization approach to learning a good policy under the SQN framework. Extensive experiments on Atari tasks demonstrate the effectiveness and advantages of the proposed approach, which performs the best among a set of previous state-of-the-art methods.

2 Background

Reinforcement learning addresses the problem of an agent learning to act in an environment to maximize a scalar reward signal. No direct supervision is provided to the agent. We first introduce notation and summarize the standard Soft Actor-Critic framework.

2.1 Mdp

Our problem is searching an optimal policy which maximize our accumulate future reward in Markov Decision Process defined by the tuple [planet].

  • represent a set of states

  • represent a set of actions,

  • stand for the transition function which maps state-actions to probability distributions over next states

  • correspond to the reward function, with

Within this framework, the agent acting in the environment according to . the environment changes to a new state following . Next, an state and reward are received by the agent. Although there are many approaches suitable for the MDP process, we focus on using the policy gradient method with an entropy bonus. The Deep Q-Network agent (DQN)[DQN] learns to play games from the Atari-57 benchmark by using frame-stacking of 4 consecutive frames as observations, and training a convolutional network to represent a value function with Q-learning, from data continuously collected in a replay buffer. Other algorithms like the A3C, use an LSTM and are trained directly on the online stream of experience without using a replay buffer. In paper [RDPG] combined DDPG with an LSTM by storing sequences in the replay and initializing the recurrent state to zero during training.

2.2 Soft Actor Critic

Some of the most successful RL algorithms in recent years such as Trust Region Policy Optimization [trpo], Proximal Policy Optimization [ppo], and Asynchronous Actor-Critic Agents [A3C] suffer from sample inefficiency. This is because they learn in an “on-policy” manner. In contrast, Q-learning based “off-policy” methods such as Deep Deterministic Policy Gradient [ddpg] and Twin Delayed Deep Deterministic Policy Gradient [TD3] can learn efficiently from past samples using experience replay buffers. However, the problem with these methods is that they are very sensitive to hyper-parameters and require a lot of tuning to get them to converge. Soft Actor-Critic follows in the tradition of the latter type of algorithms and adds methods to combat the convergence brittleness.

The Theory Of SAC

SAC is defined as RL tasks involving continuous actions. The biggest feature of SAC is that it uses a modified RL objective function. Instead of only seeking to maximize lifetime rewards, SAC seeks to also maximize the entropy of the policy. Entropy is a quantity which, roughly speaking, says how random a random variable is. If a coin is weighted so that it almost always comes up heads, it has low entropy; if it’s evenly weighted and has a half chance of either outcome, it has high entropy.

Let be a random variable with probability mass or density function . The entropy of is computed from its distribution according to:

(1)

The standard reinforcement learning object is finding a policy which can maximize expected future return which we can purpose as in Equation 3

(2)
(3)

In addition to encouraging policy to converge toward a set of probabilities over actions that lead to a high long-term reward, we although add an “entropy bonus” to the loss function. This bonus encourages the agent to take action more unpredictably. Entropy bonuses are used because without them an agent can too quickly converge on a policy that is locally optimal, but not necessarily globally optimal. Anyone who has worked on RL problems empirically can attest to how often an agent may get stuck learning a policy that only runs into walls, or only turns in a single direction, or any number of clearly sub-optimal, but low-entropy behaviors. In the case where the globally optimal behavior is difficult to learn due to sparse rewards or other factors, an agent can be forgiven for settling on something simpler, but less optimal. The entropy bonus is used to attempt to counteract this tendency by adding an entropy increasing term to the loss function, and it works well in most cases. This changes the RL problem to:

(4)

Rather than optimizing for the reward at every timestep, agents are trained to optimize for the long-term sum of future rewards. We can apply this same principle to the entropy of the agent’s policy, and optimize for the long-term sum of entropy. So value functions in this setting should include entropy bonuses at each timestep which leads to a different definition than before. Now which include the entropy bonuses is:

(5)

The temperature parameter making trade-off between the importance of the entropy term against the environment’s reward. When is large, the entropy bonuses play an important role in reward, so the policy will tend to have larger entropy, which means the policy will be more stochastic, on the contrary, if become smaller, the policy will become more deterministic. And has to be modified to contain the entropy bonuses as well:

(6)

The original Bellman operator is augmented with an entropy regularize term, with equation(5)(6) the connection between and can be easily derived as:

(7)

Given these equations above, the Bellman equation for is changing to:

(8)
(9)

3 Method

In particular, SAC makes use of two soft Q-functions to mitigate positive bias in the policy improvement step that is known to degrade the performance of value-based methods, Function approximators are used for both the soft Q-function and the policy. Instead of running evaluation and improvement to convergence, we alternate between optimizing both networks with stochastic gradient descent. We will consider two parameterized soft Q-function and a tractable policy . The parameters of these networks are and .

\subsubsubsection

LEARNING Q-FUNCTIONS: The Q-functions are learned by MSBE minimization, using a target value network to form the Bellman backups. They both use the same target, like in TD3, and have loss functions:

(10)

As for target value network, we can obtain it by polyak averaging the value network parameters over the course of training. It not hard to rewrite the connection equation between value function and Q-function as follows:

(11)
(12)
Figure 1: Overall architecture of our SQN algorithm, there are two main part: Q-networks, policy. The Q-network inference current state-action value from current observation then multiply with one-hot action. The action is simply sampled from the output. To simplify notation, we call this ”policy”

The value function is implicitly parameterized through the soft Q-function parameters via Equation 12 We use clipped double-Q like TD3[TD3] and SAC[SAC] for express the TD target, and takes the minimum Q-value between the two approximators, So the loss for Q-function parameters are:

(13)

The update makes use of a target soft Q-function, that is obtained as an exponentially moving average of the soft Q-function weights, which has been shown to stabilize training. Importantly, we do not use actions from the replay buffer here: these actions are sampled fresh from the current version of the policy.

\subsubsubsection

LEARNING THE POLICY: The policy should, in each state, act to maximize the expected future return plus expected future entropy. That is, it should maximize , which we expand out (as before) into

(14)

The target density is the Q-function, which is represented by a neural network an can be differentiated, and it is thus convenient to apply the reparameterization trick instead, resulting in a lower variance estimate, in which a sample from is drawn by computing a the deterministic function of the state, policy parameters, and independent noise. following the authors of the SAC paper[SAC1], we use a squashed Gaussian policy, which means that samples are obtained according to

(15)

The reparameterization trick allows us to rewrite the expectation over actions (which contains a pain point: the distribution depends on the policy parameters) into an expectation over noise (which removes the pain point: the distribution now has no dependence on parameters):

(16)

To get the policy loss, the final step is that we need to substitute with one of our function approximators. The same as in TD3, we use . The policy is thus optimized according to

(17)
Input :  Q-function parameters ,
Temperature
Empty replay buffer
1 Set target parameters equal to main parameters , while not converge do
2       for each environment step do
               // Sample action from the policy
               // Sample transition from the environment
               // Store the transition in the replay buffer
3            
4       end for
5      for each update step do
               // Randomly sample a batch of transitions
             Compute targets for Q functions:   // Calculate the TD target
             Update Q-functions by gradient descent using: Update the temperature by one step of gradient descent using ( is the target entropy):   // Update temperature parameter
             Update target networks with:   // Update target network weights
6            
7       end for
8      
9 end while
Algorithm 1 SQN
\subsubsubsection

LEARNING : As it proposed in[SAC1], for the purpose of improving performance, we leaning the temporal parameter by minimizing the dual objective as well:

(18)

Prior give us tools to achieve this, as shown in [opt_convex], approximating dual gradient descent is a way to achieve that. Because we use a function approximator and it is impractical to optimizing with respect to the primal variables fully, we compute gradients for with the following objective:

(19)

Inspired by SAC we derived our SQN method, SAC is aimed for discrete space so the policy network is necessary however here DQN gives great example of how to sample an action directly from Q function, combine this two idea means sample an action with entropy bonus, it comes to:

(20)
(21)

It clearly shows that policy parameter update step won’t exist at all. The overall architecture of our algorithm is shown in 1. As for value function update, it is same to SAC which described in section 3.The final algorithm is listed in Algorithm 1. The method alternates between collecting experience from the environment with the current policy and updating the function approximators using the stochastic gradients from batches sampled from a replay pool. Using off-policy data from a replay pool is feasible because both value estimators and the policy can be trained entirely on off-policy data. The algorithm is agnostic to the parameterization of the policy, as long as it can be evaluated for any arbitrary state-action tuple.

4 Experiment

In order to test our agent, We designed our experiments to answer the following questions:

  1. Can SQN be used to solve challenging Atari problems? How does our agent compare with other methods when applied to these problems, concerning the final performance, computation time, and sample complexity?

  2. what is the impact of different reward scale, and how different hyper-parameter influence the stability of our agent

  3. We add entropy bonus on our agent, does this parts give us a more powerful agent

To answer (1) we compare the performance of our agent with other methods in session4.1 With regard to (2)(3), we addressed the ablation study on our algorithm in session 4.2, testing how does different settings and network design influence the performance

The results show overall our agent outperforms baseline with a large margin, both in terms of learning speed and the final performance, The quantitative results attained by our agent is our experiments also compare very favorably to results reported by other methods in prior work, indicating that both the sample efficiency and the final performance of our agent on these benchmark tasks exceeds the state of art.

4.1 Atari

The Atari Learning Environment (ALE) has been the testing ground of most recent deep reinforcement agents. It posed challenging reinforcement learning problems including exploration, planning, reactive play, and complex visual input. Most games feature very different visuals and game mechanics which makes this domain particularly challenging. The goal of this experimental evaluation is to understand how the sample complexity and stability of our method compare with prior deep reinforcement learning algorithms. We compare our method to prior techniques on a range of challenging Atari tasks from the OpenAI gym benchmark suite. Although the easier tasks can be solved by a wide range of different algorithms, the more complex benchmarks are exceptionally difficult to solve with off-policy algorithms. The stability of the algorithm also plays a large role in performance: easier tasks make it more practical to tune hyper-parameters to achieve good results, while the already narrow basins of effective hyper-parameters become prohibitively small for the more sensitive algorithms on the hardest benchmarks, leading to poor performance[QProp].

To allow for a reproducible and fair comparison, we evaluate all the algorithm with a similar network structure, for the off-policy algorithm, we use a two-layer feed-forward neural network of 400 and 300 hidden nodes respectively, with rectified linear units (ReLU) between each layer for both the actor and critic, we use the parameters with is shown superior in prior work [that_matters] as the comparison of our agent. Both network parameters are updated using Adam[adam] with a the learning rate of , with no modifications to the environment or reward.

(a) Pong
(b) Breakout
(c) Qbert
(d) Different update scheme
Figure 2: 1(a) to 1(d) are the training curves on Atari benchmarks, SQN agent performs consistently across all tasks and outperforming both on-policy and off-policy methods in the most challenging tasks 1(d)shows the training curve of different and update method.

Figure 2 compares three individual runs with both variants, initialized with different random seeds. SQN performs much better, shows that our agent significantly outperforms the baseline, indicating substantially better stability and stability. As evident from the figure, with jointly training and internal reward, we can achieve stable training. This becomes especially important with harder tasks, where tuning hyperparameters is challenging.

It shows our agent outperforms other baseline methods with a large marginal, indicate both the efficiency and stability of the method is superior

4.2 Ablation Study

We have at most three different kinds of update scheme. Figure 1(d) shows how learning performance changes when the update schemes are changed, For large , the policy becomes nearly random, and consequently fails to exploit the reward signal, resulting in substantial degradation of performance. For small temporary coefficient, the value function is enhanced with exploring so the model learns quickly at first, but the policy then becomes nearly deterministic, leading to poor local minimal due to the lack of adequate exploration and worse state representation, as for which scale is right, due to the reward become larger at the end of training, entropy bonus becomes nearly nothing to the algorithm, still it can not achieve awesome performance. With learned , the model balance exploration and exploitation, model head make sure, our agent can take advantage of joint optimization, and achieve stable training.

5 Conclusion

Empirically results show that SQN outperforms DQN by a large marginal, notice this is the ”plain” version of SQN, so it could become an important cornerstone for many modern algorithms like IMPALA[espeholt2018impala], APEX[APEX], R2D2[R2D2] and so on. We illustrate SQN have the potential to combine with all kinds of improvement made on DQN and improving the performance and efficiency from the starting point. More experiments need to be done to test if SQN could replacing DQN and become a standard algorithm to use. But as we purposed, promising results already illustrate SQN is simple but powerful enough to solve many interesting problems DQN can’t solve. In the future, we will focus on some challenging tasks (like the new google football environment) examining the limit of our algorithm.

References

Footnotes

  1. footnotetext: Thanks Microsoft for providing cloud service
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
403071
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description