# Reannealing of Decaying Exploration Based On Heuristic Measure in Deep Q-Network

## Abstract

Existing exploration strategies in reinforcement learning (RL) often either ignore the history or feedback of search, or are complicated to implement. There is also a very limited literature showing their effectiveness over diverse domains. We propose an algorithm based on the idea of reannealing, that aims at encouraging exploration only when it is needed, for example, when the algorithm detects that the agent is stuck in a local optimum. The approach is simple to implement. We perform an illustrative case study showing that it has potential to both accelerate training and obtain a better policy.

## 1 Introduction

The goal of a reinforcement learning agent is to try to make best decisions, based on the information it gathers along the way. Unlike supervised learning tasks, however, the agent can only have access to the environment through its own actions. It needs to explicitly explore its environment and gather information for decision making. Simultaneous exploitation (making best decisions) and exploration (gathering of information) tasks create a dilemma, and balancing the two is one of the core challenges in reinforcement learning. An early survey [1] of exploration strategies made a distinction between two categories: undirected and directed. The key idea behind the former is to add randomness, in the hope that a random action might lead towards better actions compared to the suboptimal policy which is viewed as the best given current information. Directed strategies take a “global” view and measure some statistics of the past experiences, and utilize these measures to guide efficient exploration mainly by adding an exploration bonus to the reward function, so that the less visited (in terms of pseudo-count [2] using a fitted density model or hash-count [3] with locality sensitive hashing) state-action pairs, or those with larger information gain ([4], VIME [5]) or prediction error [6], are favored. Such strategies often allow for theoretical analysis, usually based on multi-armed bandit (MAB) problem theory [7]. In spite of their appealing mathematical formalism and theoretical guarantees in finite case, directed exploration strategies have not shown effectiveness over domains and thus have not played an important role in the recent success of reinforcement learning [3].

In many real-life learning applications, RL agents are trained to achieve optimal performance in certain specific tasks. Hence, it is often simple to distinguish when the learned behavior of the agent is acceptable. It is well described in the literature that if insufficient focus has been placed on exploration, then the agent can learn to stay in a “comfort zone” of a local optimum. In this case, one needs to force the agent to leave the “comfort zone” and try new actions which would take it to new states that have not been well-learned yet, so that it can explore more information about the environment, in the hope of finding better policy. In this paper, we propose a heuristic to overcome this problem, and in general, to speed up learning procedure.

Our main contribution is an easy yet efficient and scalable method that encourages exploration in complex reinforcement learning domains when it is needed. To be more specific, we emphasize the model dynamics and the agent’s behavior rather than uncertainty estimates while measuring the need for exploration. This can be accomplished by training a supervised model to make predictions based on existing experiences, which would require extensive computation as well as the effective representation of the supervised model. In this paper, we focus on using a simple heuristic measure and an annealing-based method to redirect the agent. Our approach could be extended to serve as a general framework for interactively training in complex RL domains, to aid the agent in finding better policies.

Another contribution is that we abstract the learning procedure with the view of general optimization/search, and apply a generic method that attempts to improve search algorithms on hard problems, specifically a modified version of simulated annealing and thus our method is referred to as exploration reannealing. A number of other metaheuristic approaches can be applied in similar fashion to better balance the exploration-exploitation tradeoff in reinforcement learning.

It is worth noting that exploration reannealing itself is not a complete learning algorithm, and in fact needs to be combined with other reinforcement learning tools. In this paper, we emphasize and evaluate its application to deep Q-learning, but the combination of exploration reannealing with other methods can be considered.

The paper is organized as follows. Section 2 provides some background on reinforcement learning in the context of Markov Decision Process (MDP) models, in particular the Q-learning algorithm, deep Q-networks (DQN) and some recent techniques we exploit to solve DQN. In Section 3, we present the motivation of our reannealing method and discuss the appropriate ways to use it, as well as the specific algorithm. We perform empirical studies and showcase the improvement by exploiting our reannealing strategy on a large scale challenging domain, the Lunar Lander model, in Section 4. Finally, we outline conclusions in Section 5.

## 2 Background and Preliminaries

### 2.1 Q-learning

A natural abstraction for many sequential decision-making problems is to model the system as a Markov Decision Process (MDP) [8], in which the agent interacts with the environment over a sequence of discrete time steps. It is often represented as a 5-tuple: , where is a set of states; is a set of actions that can be taken; is the transition function such that , which denotes the (stationary) probability distribution over of reaching a new state , after taking action in state ; is the reward function, which can take the form of either , , or ; and is the discount factor.

A policy defines the conditional probability distribution of choosing each action while in state . For an MDP, once a stationary policy is fixed, the distribution of the reward sequence is then determined. Thus to evaluate a policy , it is natural to define the action value function under as the expected cumulative discounted reward by taking action starting from state and following thereafter:

(1) |

The goal of solving an MDP is to find an optimal policy that maximizes the expected cumulative discounted reward in all states. The corresponding optimal action values satisfy , and Banach’s fixed-point theorem ensures the existence and uniqueness of the fixed-point solution of Bellman optimality equations [8]:

(2) |

from which we can derive a deterministic optimal policy by being greedy with respect to , i.e., .

In reinforcement learning problems, the agent must interact with the environment to learn the information about the transition and reward functions, meanwhile trying to produce an optimal policy. While interacting with the environment, at each time step , the agent senses some representation of current state , selects an action , then receives an immediate reward from the environment and finds itself in a new state . The experience tuple summarizes the observed transition for a single step. Based on the experiences through interacting with the environment, the agent can either learn the MDP model first by approximating the transition probabilities and reward functions, and then plan in the MDP to obtain an optimal policy (this is called the model-based approach in reinforcement learning); or without learning the model, directly learn the optimal value functions and upon which the optimal policy is derived (this is called the model-free approach).

We use Q-learning with function approximation in this paper. As a model-free approach, Q-learning [9] updates one-step bootstrapped estimation of Q-values from the experience samples over time steps. The update rule upon observing is

(3) |

in which is the learning rate, serves as the update target of the Q-value, which can be seen as a sample of the expected value of one-step look-ahead estimation for state-action pair , based on the the maximum estimated value over next state , and the last term is simply the current estimation. The difference is referred to as temporal difference (TD) error, or Bellman error. Note that one can bootstrap more than one step when estimating the target, often by using the eligibility trace as in [10]. Q-learning is guaranteed to converge to the optimal values in probability as long as each action is executed in each state infinitely often, is sampled following the distribution , is sampled with mean , variance is bounded and given appropriately decaying .

For environments with large state spaces, the Q-values are often represented by a function of state-action pairs rather than the tabular form, i.e., , where is a parameter vector. To update parameter vector , first-order gradient methods are usually applied to minimize the mean squared error (MSE) loss: However, with function approximation, the convergence guarantee can no longer be established in general. Neural networks, while attractive as a powerful function approximator, were well known to be unstable and even to diverge when applied for reinforcement learning until deep Q-network (DQN) [11] was introduced to show great success, in which several important modifications were made. Experience replay [12] was used to address the non-stationary data problem, by storing and mixing the samples (i.e., experiences) into a replay memory for the updates. During training a batch of experiences is randomly sampled each time and the gradient descent is performed on the sampled batch. This way the temporal correlations could be alleviated. In addition, a separate target network, which is a copy of the learned network parameters () is employed. This copy is frozen for a period of time and is only updated periodically (denoted as ), and is applied to calculate the TD error, with the aim of improving stability. A variety of extensions and generalizations have been proposed and shown successes in the literature. Overestimation due to the max operator in Q-learning may significantly hurt the performance. To reduce the overestimation error, double DQN (DDQN) [13] decouples the action selection from estimation of the target, that is, choosing the maximizing action according to the original network (), and evaluate the current value using the other one ( from the target network), i.e., Interested readers are referred to [14] for further reading of more DQN architectures. In this paper, unless stated explicitly, we use the DDQN update for all DQN architectures.

### 2.2 Exploration Strategies

#### -Greedy Exploration.

The most commonly-used strategy for exploration is the -greedy method, in which the agent selects the action it believes to be the best according to current values for the most of the time, and occasionally acts randomly. That is, it takes the greedy action with probability , and selects (uniformly) randomly among all actions with probability . Then after infinitely many steps, every state-action pair will be visited infinitely often, thus all converge to the true action values almost surely [15]. However, deficiencies of -greedy are also often discussed and new RL algorithms can be proposed. For example, the time complexity of -greedy learning is exponential with respect to the size of the state space, which leads to PAC-learning ideas for RL [16]. Moreover, -greedy selects actions with equal probability. Intuitively, we would expect the agent to pay more attention to more “promising” actions, i.e., those with maybe slightly lower -values than the current greedy action, rather than those with really low -values, which have less potential to be optimal. Moreover, it might be a waste to explore those actions with low -values, which have been selected many times since we may be confident that these are “bad” actions. Despite its deficiencies, due to its simplicity, practical effectiveness, and the ease with which it can be embedded into Q-learning, -greedy strategy has been prevalent in most value-based algorithms in reinforcement learning, including DQN and its variants.

#### Softmax (or Boltzmann) Exploration.

The (variational) free energy for an RL agent can be defined as

(4) |

in which the first term represents the energy of the agent, and the second term is the standard form of negative entropy. Coefficient of the negative entropy is referred to as temperature. Free energy principle claims that a self-organizing agent would act on the environment by minimizing its free energy, by which it reaches an equilibrium with the environment (or more accurately, a sampling of sensory data) [17]. Minimization of free energy gives us

(5) |

which is called the softmax policy or Boltzmann policy. Note that hyperparameter controls the exploration [18]. If the temperature is high, the action selection according to approaches uniform distribution, which yields more randomness and thus encourages exploration. On the other hand, low temperature would reduce the randomness and enhance exploitation. As an extreme case, if the temperature is zero, the negative entropy term in Eq. (4) goes away, and the corresponding policy becomes deterministic which takes the greedy action given the current estimate of . With the softmax probability, the possibilities for each action to be selected are ranked and weighted relevant to their estimated -values, instead of equal probabilities for all actions in -greedy approach.

### 2.3 Exploration Decay

Theoretical analysis of exploration strategies is usually performed through the Multi-Armed Bandit (MAB) model [7]. -Greedy strategy has been well studied through regret analysis in MAB, in which the regret is often defined as a measure of the difference in value between taking an action and the optimal action at time , i.e., , that is, the opportunity loss of taking for one step. The total regret is then the overall opportunity loss over time until time , i.e., As shown in [7], if we set , that is, always choose the action with the largest -value greedily without exploration attempt, then the greedy action could lock onto a suboptimal policy forever, in that case, a linear bound () on total regret is achieved. On the other hand, if we take -greedy action with constant , the agent would keep exploring with probability even if the optimal policy is found, thus also resulting in linear total regret.

A natural approach, then, is to encourage exploration early and exploitation later, which is achieved with decaying over time. Decaying--greedy can achieve asymptotically logarithmic bound on total regret, by defining , where is a constant and is the gap between the best and second best action values, both are unknown however. Thus it is often hard to derive an efficient decay schedule. Nevertheless, it is important to emphasize the decay strategy on exploration. We note that the exploration of stochasticity for softmax strategy could also be annealed during training by changing the temperature over time.

## 3 Exploration Reannealing

### 3.1 Local Optima in DQN

In theory, -learning converges to the optimal policy if all state-action pairs are visited infinitely often. However, this condition cannot be met in practice if the state-space is too large or continuous. A neural network in DQN approximates large or continuous state space and thus suffers from this problem. A deep neural network in general is of very high dimensinality, and popular practical optimization techniques, such as stochastic gradient descent, only consider first-order gradient information of the loss function. Such optimization algorithms may get stuck at local optima or saddle points. In practice, for regular neural networks, saddle points of the loss function can be escaped by applying special optimization techniques [19], and it is often the case that a local optima is good enough for many supervised learning problems. However, this might not be the case in DQN. A local optimum in DQN arises from both the complicated structure of loss function itself, and the limited representation and inference ability of a neural network for the search space. The latter is especially pervasive for state-space segments that are not well-explored. As a result, the learning agent cannot make progress for a long time and might waste learning resources by updating information for irrelevant parts of state space. Therefore, it is more common in practice that a DQN learning agent gets stuck in poor local optima, due to the difficulty of handling the exploration-exploitation trade-off well.

### 3.2 Exploration Reannealing

Simulated annealing (SA) is a classic heuristic optimization approach used to escape local optima. At the heart of it is an analogy with thermodynamics. Boltzmann probability distribution again is used to analogically represent the (variational) free energy, with a control hyperparameter, referred to as temperature . The free energy determines the stochasticity for the search direction, which aids the local search to escape from local optima. The concepts are fundamentally based on the same principle as those in exploration strategies we mentioned above, in which decaying the exploration is fundamentally the same as tuning the temperature in SA. In some variations of simulated annealing search, re-anneaing [20] is a quite common idea for the anneal schedule, that is, the temperature is periodically set to a high value in order to encourage exploration.

Similar idea can be naturally employed in our problem for enabling exploration in RL. Note that the act of reannealing itself can be implemented in a straightforward way for both -greedy and softmax strategies we introduced above. In -greedy, we can easily reset the exploration rate to a large value (note that and decays over time) if poor local optima is met. Similarly, for softmax action selection, we can more directly reset the temperature to a high value (e.g., close to the initial temperature) whenever it is necessary. We note here that with finitely many reannealing events, the theoretic guarantee of asymptotically logarithmic bound on total regret will still hold as for decaying -greedy. A more significant challenge relates to the timing of reannealing events. We will discuss this question below, but first we discuss additional reasons why we believe reannealing can bring benefits for learning in DQN.

The key advantage of reannealing exploration is that it could substantially improve the sample efficiency. We know that collecting data by interacting with the environment is usually expensive for RL systems. While stuck in local optima, it is usually the case that the TD errors being backpropagated are small, and the agent could learn little information thus gain little learning progress. With reannealing, the agent would tend to take random actions in this case and is more likely to experience unacquainted states thereafter. Those state-action pairs are usually visited much less often than those obtained by taking greedy policy, thus have larger TD errors in general and from which the agent can learn more.

Another advantage is that reannealing exploration could substantially alleviate the data imbalance problem. Without reannealing exploration, large amounts of samples are collected around local optima, resulting in data distribution biased in favor of samples that may not be relevant. As a result, a notable portion of model parameters are dedicated to describing states around (poor) local optima, and much of the training work is hence in vain. By reannealing the exploration instead of exploiting around the local optima, random actions are taken with much higher probabilities, the agent are more likely to jump out of the local optima and experience with unacquainted states, gather significantly more useful information about the entire environment as well as the training overall.

Finally, a training episode is often designed to have a finite horizon for computational simulation purpose. Each episode finishes when either certain criteria are met (in this case, a success or a failure on the task is defined and final reward is given), or the time step exceeds a fixed period. When a local optimum is encountered, the agent tends to wander around until exceeding the time limit of an episode. It is important to note that using a time limit makes the environment non-stationary, since in this case the final reward is never actually assigned, and hence, the agent may not be able to recognize a suboptimal policy. Exploration reannealing can enable the agent to actually achieve either success or failure, making sure that appropriate reward is assigned. Consequently, more episodes finish with more concrete information gain.

### 3.3 Defining Poor Local Optima

Given the intuitive advantages of applying reannealing to DQN, we next describe our proposed algorithm. As described above, the mechanism of reannealing is straghtford for both -greedy and softmax strategies. On the other hand, detemining the appropriate times for reannealing is more difficult. Clearly, we must reanneal, when the agent is stuck in a poor local optimum. Unfortunately, in high-dimensional spaces formally determining local optimum is challenging. In practice, an often used empirical way is to track variation of loss function across iterations. When the loss stops improving, it is often the case that the search reaches local optimum. However, simply looking at the change in loss does not tell us whether the local optimum is acceptable. It might be the case that near-global optimum has already been achieved, and hence there is no need to escape from it.

On the other hand, sometimes a poor local optimum can be easily observed and distinguished by human from the outside perspective, in which case the observer utilizes some a priori knowledge that has not been integrated into the reward function. In RL, the agent’s learned policy as well as its behavior are determined by optimizing the discounted cumulative rewards, thus an ideal reward function should capture the goal and measure the performance exactly, which requires perfect knowledge of all states and transitions in the environment. Except for some human designed games in which the rules are entirely understood, it often takes considerable effort to tweak the rewards until desired behavior is learned. This then means, that in many applications the reward function is already overloaded in a way as to result in favorable agent’s behavior, and, hence, attempting to also use the same reward function to distinguish the quality of local optimum may be either impossible or very complicated with unexpected side-effects (see also inverse reinforcment learning, [21]).

Alternatively, we propose to consider a separate criterion for initiating reannealing. This criterion can be viewed as a supervised model, which makes predictions based on existing experiences. The training labels could be as simple as a categorical signal to denote the need to explore, or as complicated as representation of next state, which would require extensive computation as well as the effective representation of the supervised model. In this paper, we use a simplified version of this supervision idea. We explicitly measure the easily distinguished feature as an a priory defined heuristic, representing the fact that the agent’s bottleneck behavior due to sub-optimal policy can often be described with some undesirable characteristics from an outside observer’s perspective. We, then, can explicitly extract such a feature as a useful heuristic independent from the reward function. Once defined, we can keep track of the heuristic along the learning process and use it to control the learning behavior. See Section 4 for an example based on Lunar Lander problem.

### 3.4 Algorithm

The objective of reannealing exploration is to explicitly inform the learning agent that it should be exploring rather than exploiting with a heuristic measure. We set up a heuristic variable called stuck to represent if the agent has been stuck in poor local optima. The variable stuck should be a global statistic for some aspect of the agent’s performance information. If some threshold of “stuck” has been reached, we reanneal the exploration. In -greedy learning, we reset to 1 and force the agent do pure exploration. The exploration rate then is decayed over time. The pseudo-code of our proposed procedure for DQN is shown in Algorithm 1. And similar reannealing strategy applies for softmax, in which we reset the “temperature” to its initial value, and anneal it again to smaller value over time.

As argued above, the candidates of the stuck variable should be some performance measure that might have not been integrated (or not been integrated well) in the reward function. The chosen feature as the explicit heuristic should be a representative bottleneck for learning. We expect that the RL agent could jump out of the local optima by applying reanneal strategy, and be able to learn better policy than the one it obtained before reannealing when it stucks. As a result, acceptable behavior and good policy could be learned faster. We also expect that with reannealing exploration, we could worry less about poor local optima and spend less time on tuning the hyperparameters (such as the annealing schedule, learning rate, etc.) while training.

## 4 Experimental Results

### 4.1 Testbed Setup

We conducted an experiment by implementing a reinforcement learning agent to solve the Lunar Lander task in Box2D [22], interfaced through OpenAI gym environment [23]. In each step, the agent is provided with the current state of the lander in , in which 6 of the dimensions are in continuous space whereas the other 2 are dummy variables in discrete space, and the agent is allowed to make one of the 4 possible actions (i.e., the action space is discrete). At the end of each step, the agent receives a reward and moves to a new state . An episode finishes if the lander rest on the ground at zero speed (receives additional reward of ), or hits the ground and crashes (receives additional reward), or flies outside the screen, or reaches the maximum of 1000 time steps of one episode. The agent aims for successful landing which is defined as reaching the landing pad (between two flags) centered at the ground at the speed of zero, and receives an additional reward in range , while landing outside the pad would cause some penalty. Figure 1 provides a snapshot of the task environment.

We use a neural network with two fully-connected hidden layers (which consist of 200 and 60 neurons, respectively) as our function approximator. ReLU nonlinearity is utilized as the activate function for each hidden neuron. The network takes the 8-dimensional vector which describes the state as the input, and outputs the approximated -values for the 4 possible actions. We train the neural network with a FIFO memory of size for experience replay. A target neural network for double learning is updated every 20 episodes, so that the original network has enough time to converge. The adaptive moment estimation (Adam) optimizer with initial learning rate set to 0.01 is used to train the network, since it is in general less sensitive to the choice of the learning rate than other stochastic gradient descent algorithms [24]. We apply the pseudo-Huber loss instead of MSE as the loss function, as it is less sensitive to outliers and is more commonly used in DQN [11]. The discount factor is set to 0.99, and -greedy policy is used for choosing actions throughout interacting with the environment. For comparison purpose, we used two different exploration decay rates, and 0.985. These hyperparameters are empirically tuned in the aim of achieving better performance.

### 4.2 Implementation of Exploration Reannealing

As in Q-learning, a simple -greedy policy is applied while choosing actions to interact with the environment during training. With large exploration rate , the agent fails in exploitation and refining its policy, while with small , the agent would have a problem in exploration. For example, if we simply pick , the agent soon learns to hover above the ground forever but hesitates to land. Annealing strategy for exploration rate is considered and tried, in which gradually decreases from 1 to 0.01 during say, half of the training episodes, and for the rest of training time. However, this cannot solve the hovering problem. This annealing strategy adds randomness at the early stage of training, but the pretrain step (in order to fill the memory for experience replay) has already provided the agent enough exploration stored in the memory at the beginning. Even if the agent learns to land occasionally, it prefers hovering for most of the time. This is probably because the neural network is dealing with continuous state space, learning through some unknown states with bad decisions would also affects the values of well-learned states. As a result, the agent again learns to hover forever.

In order to escape from such hovering local optima, we carefully engineered a reannealing strategy for the exploration rate. The idea is to encourage the agent to explore while it is hovering. We define a variable hover to count the hovering number which starts from 0. Whenever an episode finishes exceeding the time limit (i.e., the maximum 1000 step in an episode), we increase the hovering number by 1. If the next episode finishes within 1000 steps, we halve the hovering number (using integer division). We will reset back to 1 (for fully exploration) and recount if the hovering number reaches 10 (in this case, the agent tends to hover forever). Otherwise, anneals to 0.01 as described above.

### 4.3 Results

The network was trained over 10,000 episodes. Figures 1(a) and 1(b) illustrate the results for the ordinary DQN without applying the exploration reannealing strategy, using two different exploration decay rates. We plot the cumulative rewards for each episode shown with grey, and the smoothed moving averages of the last 100 episodes shown with the blue line. Notice that higher rate means slower decay, which results in more exploration at the beginning. In the case that reannealing strategy is not applied, the agent with exploration decay rate explores less at the beginning than the one with , and performs worse, i.e., its average episodic total rewards are significantly lower, and also the learning process is slower. For instance, with , the agent barely learns to avoid crashing (i.e., with episodic rewards above zero) within 3000 episodes, while with , the agent can obtain the same level in about 2000 episodes. Also, to achieve average episodic rewards above 100, it takes less than 4000 episodes with , and more than 7000 episodes with . This coincides with our intuition, and emphasizes the importance of sufficient exploration.

As shown in Figures 1(c) and 1(d), applying the reannealing strategy improves our result significantly. We could achieve an average value of episodic total reward as high as 200 (in this case, reward 200 means that the agent could land smoothly at the right position on the ground). Without reannealing, however, the agent never achieves such level in either cases (see Figures 1(a) and 1(b)). An interesting observation is the steep falls of the moving average along the training while reannealing is applied, clearly these are the moments when is reset to 1. Note that at those times, the falling of episodic total rewards value does not mean the agent is doing worse in general. Q-learning is an off-policy algorithm, which means the learned target policy is not the same as the behavior policy (-greedy) it uses while interacting with the environment and accumulating the samples. The induced greedy policy has not changed much in such a short period of time from the recent resetting, so the agent can still do as well as before falling if it acts greedily. At the same time, the target policy keeps learning while exploring. We can see from Figures 1(c) and 1(d), that for most of the time, it can soon get back to the previous best performance, and often its new peaks are higher, which indicates that it jumps out of the previous local optima.

We deliberately choose the sliding window size to be not too big, nor do we show the average values over multiple training runs, so that the curves are not over-smoothed, thus allowing us to discern the occurrences of reannealing. Over-smoothed curves would give us the illusion that the learning is be slower with reannealing strategy, especially at the early training stages. We claim it is not true, using the same argument that Q-learning is off-policy. We cannot compare the derived policies early on since the exploration rate differ a lot, however, while reaches its minimum value, we can compare the performance of all “near-greedy” behavior polices. We see that with reannealing, the agent could reach higher values much faster, thus we claim that reannealing accelerated the training.

We also plot the varying values along training with reannealing strategy in Figures 1(e) and 1(f), from which we can directly observe the moments when reannealing was initiated. There is no need to plot such patterns for the cases without reannealing, since decays to 0.01 in a few hundred episodes. From Figures 1(e) and 1(f) we can see the frequent reannealing early on, since the agent generally can learn to hover very quickly and frequently. Note that reannealing occurs more frequently with than that with . We can surmise then that our reannealing strategy serves as a remedy for poor hyperparameter tuning, specifically the exploration decay rate, as long as the reannealing criterion (aka the heuristic measure) is appropriately picked. With insufficient exploration at the beginning, the learning would get stuck in poor local optima more often, but reannealing strategy can force the agent to explore later on when it is necessary, and help find similarly good policy as when training with better hyperparameter. Also notice that there is a long flat tail in Figure 1(e) after episode 4500. During this period of training, the agent did not reanneal, and the total reward values stays at that level with smaller variance, compared with no reannealing graphs on Figures 1(a) and 1(b) . In fact, we can see that with reannealing, the variance is smaller when near-greedy policy is applied, i.e., when stays at its minimum for a while. Upon this, we could expect the (greedy) policy learned with reannealing strategy to be superior both in terms of higher total reward and smaller variance.

## 5 Conclusions

In this paper we present a method to organize exploration in RL algorithms. In particular, we focus on its application to model-free value-based approaches, such as DQN. Our method is particularly suited to problems which suffer from poor local optima, and that have sparse rewards as well as long horizons which can trigger the termination criterion earlier. Poor local optima can often be easily distinguished from an outside perspective, yet it may be hard to encode this additional information into reward function or state variables due to complexity of the underlying system. Instead we propose to use a separate, heuristic measure, independent from the agents reward and state, aimed at detecting the local optima that need to be avoided. With such a measure, we can then organize the learning process using a reannealing framework, previously used to solve hard optimization problems.

We highlight some intuitive benefits of applying exploration reannealing, and demonstrate its performance on a standard RL task. In our experiments, reannealing method, indeed helps the agent avoid poor local optima and gather more useful information. The sample efficiency for the reinforcement learning is improved, and the data imbalance problem alleviated. As a result, the training procedure can is accelerated, and the derived policies have superior performance. In addition, we hypothesize that it can serve as a remedy for imperfect hyperparameter tuning.

It is worth noting that the simple framework presented here can be extended to use more sophisticated supervised learning-based heuristic measures for reannealing initiation. If trained properly, such strategies can result in even better performance, due to improved timing of reannealing.

### References

- Sebastian B Thrun. Efficient exploration in reinforcement learning. 1992.
- Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pages 1471–1479, 2016.
- Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. # exploration: A study of count-based exploration for deep reinforcement learning. In Advances in Neural Information Processing Systems, pages 2753–2762, 2017.
- Richard Y Chen, John Schulman, Pieter Abbeel, and Szymon Sidor. Ucb and infogain exploration via -ensembles. arXiv preprint arXiv:1706.01502, 2017.
- Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems, pages 1109–1117, 2016.
- Bradly C Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814, 2015.
- Giuseppe Burtini, Jason Loeppky, and Ramon Lawrence. A survey of online experiment design with the stochastic multi-armed bandit. arXiv preprint arXiv:1510.00757, 2015.
- ML Puterman. Markov decision processes. 1994. Jhon Wiley & Sons, New Jersey, 1994.
- Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
- Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4):293–321, 1992.
- Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In AAAI, pages 2094–2100, 2016.
- Xing Wang. On advances in deep learning with applications in financial market modeling. 2020.
- Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. Number 1. MIT press Cambridge, 1998.
- Alexander L Strehl, Lihong Li, and Michael L Littman. Reinforcement learning in finite mdps: Pac analysis. Journal of Machine Learning Research, 10(Nov):2413–2444, 2009.
- Karl Friston. The free-energy principle: a unified brain theory? Nature Reviews Neuroscience, 11(2):127, 2010.
- Shin Ishii, Wako Yoshida, and Junichiro Yoshimoto. Control of exploitation–exploration meta-parameter in reinforcement learning. Neural networks, 15(4-6):665–687, 2002.
- Animashree Anandkumar and Rong Ge. Efficient approaches for escaping higher order saddle points in non-convex optimization. In Conference on Learning Theory, pages 81–102, 2016.
- Lester Ingber. Very fast simulated re-annealing. Mathematical and computer modelling, 12(8):967–973, 1989.
- Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Icml, pages 663–670, 2000.
- Erin Catto. Box2d: A 2d physics engine for games, 2011.
- Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
- Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.