Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning

# Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning

Kyungjae Lee, Sungyub Kim, Sungbin Lim, Sungjoon Choi, and Songhwai Oh
Dep. of Electrical and Computer Engineering, Seoul National University
School of Computing, KAIST
Kakao Brain
kyungjae.lee@rllab.snu.ac.kr, sungyub.kim@kaist.ac.kr,
{sungbin.lim, sam.choi}@kakaobrain.com,
songhwai@snu.ac.kr
###### Abstract

In this paper, we present a new class of Markov decision processes (MDPs), called Tsallis MDPs, with Tsallis entropy maximization, which generalizes existing maximum entropy reinforcement learning (RL). A Tsallis MDP provides a unified framework for the original RL problem and RL with various types of entropy, including the well-known standard Shannon-Gibbs (SG) entropy, using an additional real-valued parameter, called an entropic index. By controlling the entropic index, we can generate various types of entropy, including the SG entropy, and a different entropy results in a different class of the optimal policy in Tsallis MDPs. We also provide a full mathematical analysis of Tsallis MDPs, including the optimality condition, performance error bounds, and convergence. Our theoretical result enables us to use any positive entropic index in RL. To handle complex and large-scale problems, we propose a model-free actor-critic RL method using Tsallis entropy maximization. We evaluate the regularization effect of the Tsallis entropy with various values of entropic indices and show that the entropic index controls the exploration tendency of the proposed method. For a different type of RL problems, we find that a different value of the entropic index is desirable. The proposed method is evaluated using the MuJoCo simulator and achieves the state-of-the-art performance.

Tsallis Reinforcement Learning:
A Unified Framework for Maximum Entropy Reinforcement Learning

Kyungjae Lee, Sungyub Kim, Sungbin Lim, Sungjoon Choi, and Songhwai Oh Dep. of Electrical and Computer Engineering, Seoul National University School of Computing, KAIST Kakao Brain kyungjae.lee@rllab.snu.ac.kr, sungyub.kim@kaist.ac.kr, {sungbin.lim, sam.choi}@kakaobrain.com, songhwai@snu.ac.kr

\@float

noticebox[b]Preprint. Work in progress.\end@float

## 1 Introduction

Reinforcement learning (RL) combined with a powerful function approximation technique like a neural network has shown noticeable successes on challenging sequential decision making problems, such as playing a video game volodymyr15humanlevel (), learning complex control duan2016benchmarking (); gu2016modelacc (), and realistic motion generation peng2017deeploco (). A model-free RL algorithm aims to learn a policy that effectively performs a given task by autonomously exploring an unknown environment where the goal of RL is to find an optimal policy which maximizes the expected sum of rewards. Since the prior knowledge about the environment is generally not given, an RL algorithm should consider not only gathering information about the environment to find out which state or action leads to high rewards but also improving its policy based on the collected information. Such trade-offs should be carefully handled to reduce the sample complexity of a learning algorithm.

For the sake of efficient exploration of an environment, many RL algorithms employ maximization of the Shannon-Gibbs (SG) entropy of a policy distribution. It has been empirically shown that maximizing the SG entropy of a policy along with reward maximization encourages exploration since the entropy maximization penalizes a greedy behavior mnih2016asyncrl (). Eysenbach et al. eysenbach2018diversity () also demonstrated that maximizing the SG entropy helps to learn diverse and useful behaviors. This penalty from the SG entropy also helps to capture the multi-modal behavior where the resulting policy is robust against unexpected changes in the environment Haarnoja et al. haarnoja2017energy (). Theoretically, it has been shown that the optimal solution of maximum entropy RL has a softmax distribution of state-action value, not a greedy policy schulman17equivalence (); nachum2017bridging (); haarnoja2017energy (); haarnoja2018sac (). Furthermore, maximum SG entropy in RL provides the connection between policy gradient and value-based learning schulman17equivalence (); odonoghue2017pgq (). In dai2018sbeed (), it has been also shown that maximum entropy induces a smoothed Bellman operator and it helps stable convergence of value function estimation.

While the SG entropy in RL provides better exploration, numerical stability, and capturing multiple optimal actions, it is known that the maximum SG entropy causes a performance loss since it hinders exploiting the best action to maximize the reward lee2018sparse (); chow2018tsallispcl (). Such drawback is often handled by scheduling a coefficient of the SG entropy to progressively vanish cesa2017boltzmann (). However, designing a proper decaying schedule is still a demanding task in that it often requires an additional validation step in practice. In grau-moya2018soft (), the same issue was handled by automatically manipulating the importance of actions using the mutual information. On the other hand, an alternative way has been proposed to handle the exploitation issue of the SG entropy using a sparse Tsallis (ST) entropy lee2018sparse (); chow2018tsallispcl (), which is a special case of the Tsallis entropy tsallis1988possible (). The ST entropy encourages exploration while penalizing less on a greedy policy, compared to the SG entropy. However, unlike the SG entropy, the ST entropy may discover a suboptimal policy since it enforces the algorithm to explore the environment less lee2018sparse (); chow2018tsallispcl ().

In this paper, we present a unified framework for the original RL problem and maximum entropy RL problems with various types of entropy. The proposed framework is formulated as a new class of Markov decision processes with Tsallis entropy maximization, which is called Tsallis MDPs. The Tsallis entropy generalizes the standard SG entropy and can represent various types of entropy, including the SG and ST entropies by controlling a parameter, called an entropic index. A Tsallis MDP presents a unifying view on the use of various entropies in RL. We provide a comprehensive analysis of how a different value of the entropic index can provide a different type of optimal policies and different Bellman optimality equations. Our theoretical result allows us to interpret the effects of various entropies for an RL problem.

We also derive two dynamic programming algorithms: Tsallis policy iteration and Tsallis value iteration for all postive entropic indices with optimality and convergence guarantees. We further extend Tsallis policy iteration to a Tsallis actor-critic method for model-free large-scale problems. The entropic index controls the exploration-exploitation trade-off in the proposed Tsallis MDP since a different index induces a different bonus reward for a stochastic policy. Similar to the proposed method, Chen et al. chen2018tsallisensembles () also proposed an ensemble network of action value combining multiple Tsallis entropies. While ensemble network with multiple Tsallis entropies shows efficient exploration in discrete action problems, it has limitations in that it is not suitable for a continuous action space.

As aforementioned, it has been observed that using the SG and ST entropies show distinct exploration tendencies. The former makes the policy always assign non-zero probability to all possible actions. On the other hand, the latter can assign zero probability to some actions. The proposed Tsallis RL framework contains both SG and ST entropy maximization as special cases and allows a diverse range of exploration-exploitation trade-off behaviors for a learning agent, which is a highly desirable feature since the problem complexity is different for each task at hand. We validate our claim in MuJoCo simulations and demonstrate that the proposed method with a proper entropic index outperforms existing actor-critic methods.

## 2 Background

In this section, we review a Markov decision process and define the Tsallis entropy using q-exponential and q-logarithm functions.

### 2.1 Markov Decision Processes

A Markov decision process (MDP) is defined as a tuple , where is the state space, is the corresponding feature space, is the action space, is the distribution of an initial state, is the transition probability from to by taking , is a discount factor, and is the reward function defined as with a random reward . In our paper, we assume that is bounded. Then, the MDP problem can be formulated as follows:

 maximize% π∈ΠEτ∼P,π[∞∑tγtRt], (1)

where is a discounted sum of rewards, also called a return, is a set of policies, , and is an infinite sequence of state-action pairs sampled from the transition probability and policy, i.e., for and . For a given , we can define the state value and state-action (or action) value as

 (2)

The solution of (1) is called the optimal policy . The optimal value and action-value satisfy the Bellman optimality equation as follows: For ,

 Q⋆(s,a)=Es′∼P[r(s,a,s′)+γV⋆(s′)]V⋆(s)=maxa′Q⋆(s,a′),π⋆∈argmaxa′Q⋆(s,a′), (3)

where indicates a set of policies satisfying and indicates . Note that there may exist multiple optimal policies if the optimal action value has multiple maxima with respect to actions.

### 2.2 q-Exponential, q-Logarithm, and Tsallis Entropy

Before defining the Tsallis entropy, let us first introduce variants of exponential and logarithm functions, which are called q-exponential and q-logarithm, respectively. They are used to define the Tsallis entropy and defined as follows111Note that the definition of , and Tsallis entropy are slightly different from the original one amari2011geometry () but those settings are recovered by setting .:

 expq(x)≜{exp(x),if q=11q−1+,if q≠1, (4)
 lnq(x)≜{log(x),if q=1 and x>0xq−1−1q−1,if q≠1 and x>0, (5)

where and is a real number.

The property of q-exponential and q-logarithm depends on the value of . We would like to note that, for all , is a monotonically increasing function with respect to , since its gradient is always positive. In particular, becomes when and a linear function when . However, the tendencies of their gradients are different. For , is an increasing function. On the contrary, for , is a decreasing function. Especially, when , its gradient becomes a constant. This property will play an important role for controlling the exploration-exploitation trade-off when we model a policy using parameterized function approximation. Now, we define the Tsallis entropy using .

###### Definition 1.

The Tsallis entropy of a random variable with the distribution is defined as follows amari2011geometry ():

 Sq(P)≜\mathbbmEX∼P[−lnq(P(X))]. (6)

Here, is called an entropic-index.

The Tsallis entropy can represent various types of entropy by varying the entropic index. For example, when , becomes the Shannon-Gibbs entropy and when , becomes the sparse Tsallis entropy lee2018sparse (). Furthermore, when , converges to zero. Exmaples of q-exponential, q-logarithm, and are shown in Figure 1. We would like to emphasize that, for , the Tsallis entropy is a concave function with respect to the density function, but, for , the Tsallis entropy is a convex function. Detail proofs are included in the supplementary material. In this paper, we only consider the case when since our purpose of using the Tsallis entropy is to give a bonus reward to a stochastic policy.

## 3 Bandit with Maximum Tsallis Entropy

Before applying the Tsallis entropy to an MDP problem, let us consider a simpler case, which is a stochastic multi-armed bandit (MAB) problem defined by only a reward function and action space, i.e., , where the reward only depends on an action, i.e., . While an MAB is simpler than an MDP, many properties of an MAB are analogous to that of an MDP.

In this section, we discuss an MAB with Tsallis entropy maximization defined as

 maxπ∈Δ[\mathbbmEa∼π[R]+αSq(π)], (7)

where is a probability simplex whose element is a probability distribution and is a coefficient. We assume that an action is a discrete random variable in this analysis but an extension to a continuous action space is discussed in the supplementary material. Furthermore, we assume that and, for , by replacing with , the following analysis will hold. The Tsallis entropy leads to a stochastic optimal policy and the problem (22) has the following solution

 π⋆q(a)=expq(r(a)/q−ψq(r/q)), (8)

where is called a q-potential function amari2011geometry (), which is uniquely determined by the normalization condition:

 ∑aπ⋆q(a)=∑aexpq(r(a)/q−ψq(r/q))=1. (9)

A detail derivation can be found in the supplementary material. Note that is a mapping from to a real value. The optimal solution (23) assigns a probability -exponentially proportional to the reward and normalizes the probability. Since is an increasing function when , the optimal solution assigns high probability to an action with a high reward. Furthermore, using , the optimal value can be written as

 \mathbbmEa∼π⋆[R]+Sq(π⋆)=(q−1)∑ar(a)qexpq(r(a)q−ψq(rq))+ψq(rq). (10)

Now, we can analyze how the optimal policy varies with respect to . As mentioned in the previous section, we only consider .

When , the optimal policy becomes a softmax distribution and becomes a log-sum-exp operator, i.e., and and the optimal value becomes the same as . The softmax distribution is the well-known solution of the MDP with the Shannon-Gibbs entropy ziebart2010modeling (); schulman17equivalence (); haarnoja2017energy (); nachum2017bridging (). When , the problem (22) becomes the probability simplex projection problem, where is projected into . It leads to and , where is a supporting set, i.e., . is the optimal solution of the MDP with the ST entropy lee2018sparse (); chow2018tsallispcl (). Compared to , allows zero probability to the action whose is below , whereas cannot. Furthermore, when , the problem becomes the original MAB problem since becomes zero as goes to infinity. In this case, the optimal policy only assigns positive probability to optimal actions. If has a single maximum, then the optimal policy becomes greedy.

Unfortunately, finding a closed form of for a general value of is intractable except since it is the sum of radical equations with the index . Thus, for other values of , the solution can be obtained using a numerical optimization method. This intractability of (24) hampers the viability of the Tsallis entropy. Chen et al. chen2018tsallisensembles () handled this issue by obtaining an approximated closed form using the first order Tayler expansion of . However, we propose an alternative way to avoid numerical computation, which will be discussed in Section 6.

From aforementioned observations, we can observe that, as increases from zero to infinity, the optimal policy becomes more sparse and finally converges to a greedy policy. The optimal policy with different is shown in Figure 2. The effect of different and can be found in the supplementary material. This tendency of the optimal policy also appears in the MDP with the Tsallis entropy. Many existing methods employ the SG entropy to encourage the exploration. However, the Tsallis entropy allows us to select the proper entropy according to the property of the environment.

### 3.1 q-Maximum

Before extending from MAB to MDP, we define the problem (22) as an operator, which is called q-maximum. A q-maximum operator is a bounded approximation of the maximum operator. For a function , q-maximum is defined as follows:

 q-maxx(f(x))≜maxP∈Δ[\mathbbmEX∼P[f(X)]+Sq(P)], (11)

where is a probability simplex whose element is a probability. The reason why this operator (11) is called a q-maximum is that it has the following bounds.

###### Theorem 1.

For any function defined on a finite input space , the q-maximum satisfies the following inequalities.

 (12)

where is the cardinality of .

The proof can be found in the supplementary material. The proof of Theorem 8 utilizes the definition of q-maximum. This bounded property will be used to analyze the performance bound of an MDP with the maximum Tsallis entropy. Furthermore, -maximum plays an important role in the optimality condition of Tsallis MDPs.

## 4 Maximum Tsallis Entropy in MDPs

In this section, we formulate MDPs with Tsallis entropy maximization, which will be named Tsallis MDPs, by extending the SG entropy to the Tsallis entropy. We mainly focus on deriving the optimality conditions and algorithms generalized for the entropic index so that a wide range of values can be used for a learning agent. First, we extend the definition of the Tsallis entropy so that it can be applicable for a policy distribution in MDPs. The Tsallis entropy of a policy distribution is defined by

 S∞q(π)≜Eτ∼P,π[∞∑t=0γtSq(π(⋅|st))].

Using , the original MDPs can be converted into Tsallis MDPs by adding to the objective function as follows:

 maximize% π∈ΠEτ∼P,π[∞∑tγtRt]+αS∞q(π), (13)

where is a coefficient of the Tsallis entropy. A state value and state-action value are redefined for Tsallis MDPs as follows:

 Vπq(s)≜Eτ∼P,π[∞∑t=0γt(Rt+αSq(π(⋅|st))∣∣ ∣∣s0=s],Qπq(s,a)≜Eτ∼P,π[R0+γVπq(s1)∣∣s0=s,a0=a], (14)

where is the entropic index. The goal of a Tsallis MDP is to find an optimal policy distribution which maximizes both the sum of rewards and the Tsallis entropy whose importance is determined by . The solution of the problem (37) is denoted as and its value functions are denoted as and , respectively. In our analysis, is set to one, however one can easiliy generalize the case of by replacing , and with , and .

In the following sections, we first derive the optimality condition of (37), which will be called the Tsallis-Bellman optimality (TBO) equation. Second, dynamic programming to solve Tsallis MDPs is proposed with convergence and optimality guarantees. Finally, we provide the performance error bound of the optimal policy of the Tsallis MDP, where the error is caused by the Tsallis entropy term. The theoretical results derived in this section are extended to a viable actor-critic algorithm in Section 6.

### 4.1 Tsallis Bellman Optimality Equation

Using the -maximum operator, the optimality condition of a Tsallis MDP can be obtained as follows.

###### Theorem 2.

For , an optimal policy and optimal value sufficiently and necessarily satisfy the following Tsallis-Bellman optimality (TBO) equations:

 Q⋆q(s,a)=\mathbbmEs′∼P[r(s,a,s′)+γV⋆q(s′)|s,a]V⋆q(s)=q-maxa(Q⋆q(s,a))π⋆q(a|s)=expq(Q⋆q(s,a)/q−ψq(Q⋆q(s,⋅)/q)), (15)

where is a q-potential function.

###### Proof Sketch.

Unlike to the original Bellman equation, we derive Theorem 9 from Karush-Kuhn-Tucker (KKT) conditions instead of the Bellman’s principle of optimality puterman2014markov (). The proof consists of three steps. First, the optimization variable in (37) is converted to a state-action visitation based on syed2008allinearprogram (); puterman2014markov ().222, where is an indicator function. Second, after changing variables, (37) becomes a concave problem with respect to . Thus, we can apply KKT conditions since the strong duality holds. Finally, TBO equations are obtained by solving the KKT conditions. The entire proof of Theorem 9 is included in the supplementary material. ∎

The TBO equation differs from the original Bellman equation in that the maximum operator is replaced by the -maximum operator. The optimal state value is the -maximum of the optimal state-action value and the optimal policy is the solution of -maximum (11). Thus, as changes, can represent various types of -exponential distributions. We would like to emphasize that the TBO equation becomes the original Bellman equation as diverges into infinity. This is reasonable tendency since, as , tends zero and the Tsallis MDP becomes the original MDP. Furthermore, when , -maximum becomes the log-sum-exponential operator and the Bellman equation of maximum SG entropy RL, (a.k.a. soft Bellman equation) haarnoja2017energy () is recovered. When , the Bellman equation of maximum ST entropy RL, (a.k.a. sparse Bellman equation) lee2018sparse () is also recovered. Moreover, our result guarantees that the TBO equation holds for every real value .

## 5 Dynamic Programming for Tsallis MDPs

In this section, we develop dynamic programming algorithms for a Tsallis MDP: Tsallis policy iteration (TPI) and Tsallis value iteration (TVI). These algorithms can compute an optimal value and policy and their convergence can be shown. TPI is a policy iteration method which consists of policy evaluation and policy improvement. In TPI, first, a value function of a fixed policy is computed and, then, the policy is updated using the value function. TVI is a value iteration method which computes the optimal value directly. In dynamic programming of the original MDPs, the convergence is derived from the maximum operator. Similarly, in the MDP with the SG entropy, log-sum-exponential plays a crucial role for the convergence. In TPI and TVI, we generalize such maximum or log-sum-exponential operators by the -maximum operator, which is a more abstract notion and available for all .

### 5.1 Tsallis Policy Iteration

We first discuss the policy evaluation method in a Tsallis MDP, which computes and for fixed policy . Similar to the original MDP, a value function of a Tsallis MDP can be computed using the expectation equation defined by

 Qπq(s,a)=\mathbbmEs′∼P[r(s,a,s′)+γVπq(s′)|s,a]Vπq(s)=\mathbbmEa∼π[Qπq(s,a)−lnq(π(a|s))], (16)

where indicates and indicates . Equation (55) will be called the Tsallis Bellman expectation (TBE) equation and it is derived from the definition of and . Based on the TBE equation, we can define the operator for an arbitrary function over , which is called the TBE operator,

 [TπqF](s,a)≜\mathbbmEs′∼P[r(s,a,s′)+γVF(s′)|s,a]VF(s)≜\mathbbmEa∼π[F(s,a)−lnq(π(a|s))]. (17)

Then, the policy evaluation method for a Tsallis MDP can be simply defined as repeatedly applying the TBE operator to an initial function : .

###### Theorem 3 (Tsallis Policy Evaluation).

For fixed and , consider the TBE operator , and define Tsallis policy evaluation as for an arbitrary initial function over . Then, converges to and satisfies the TBE equation (55).

The proof of Theorem 10 relies on the contraction property of and the proof can be found in the supplementary material. The contraction property guarantees the sequence of converges to a fixed point of , i.e., and the fixed point is the same as .

The value function evaluated from Tsallis policy evaluation can be employed to update the policy distribution. In the policy improvement step, the policy is updated to maximize

 ∀s,πk+1(⋅|s)=argmaxπ(⋅|s)\mathbbmEa∼π[Qπkq(s,a)−lnq(π(a|s))|s] (18)
###### Theorem 4 (Tsallis Policy Improvement).

For , let be the updated policy from (60) using . For all , is greater than or equal to .

Theorem 12 tells us that the policy obtained by the maximization (60) has performance no worse than the previous policy. From Theorem 10 and 12, it is guaranteed that the Tsallis policy iteration gradually improves its policy as the number of iterations increases and it converges to the optimal solution.

###### Theorem 5 (Optimality of TPI).

When , define the Tsallis policy iteration as alternatively applying (56) and (60), then converges to the optimal policy.

The proof is done by checking if the converged policy satisfies the TBO equation. In the next section, Tsallis policy iteration is extended to a Tsallis actor-critic method which is a practical algorithm to handle continuous state and action spaces and complex environments.

### 5.2 Tsallis Value Iteration

Tsallis value iteration is derived from the optimality condition. From the TBO equation, the TBO operator is defined by

 [TqF](s,a)≜\mathbbmEs′∼P[r(s,a,s′)+γVF(s)∣∣s,a]VF(s)≜q-maxa′(F(s,a′)). (19)

Then, Tsallis value iteration (TVI) is defined by repeatedly applying the TBO operator: .

###### Theorem 6.

For , consider the TBO operator , and define Tsallis value iteration as for an arbitrary initial function over . Then, converges to .

Similar to Tsallis policy evaluation, the convergence of Tsallis value iteration depends on the contraction property of , which makes converges to a fixed point of . Then, the fixed point can be shown to satisfy the TBO equation.

### 5.3 Performance Error Bounds

We provide the performance error bounds of the optimal policy of a Tsallis MDP which can be obtained by TPI or TVI. The error is caused by the regularization term used in Tsallis entropy maximization. We compare the performance between the optimal policy of a Tsallis MDP and that of the original MDP. The performance error bounds are derived as follows.

###### Theorem 7.

Let be the expected sum of rewards of a given policy , be the optimal policy of an original MDP, and be the optimal policy of a Tsallis MDP with an entropic index q. Then, the following inequality holds:

 J(π⋆)+(1−γ)−1lnq(1/|A|)≤J(π⋆q)≤J(π⋆), (20)

where is the cardinality of and .

The proof of Theorem 14 is included in the supplementary material. Here, we can observe that the performance gap shows the similar property of the TBO equation. We further verify Theorem 14 on a simple grid world problem. We compute the expected sum of rewards of obtained from TVI by varying values and compare them to the bounds in Theorem 14, as shown in Figure 3. Notice that converges to zero as . This fact supports that converges to the greedy optimal policy in the original Bellman equation when .

To summarize this section, we derive dynamic programming methods for Tsallis MDPs with proofs of convergence, optimality, and performance error bounds. One important result is that all theorems derived in this section hold for every , for which is concave. Furthermore, the previous algorithms ziebart2010modeling (); lee2018sparse () can be recovered by setting to a specific value. This generalization makes it possible to apply the Tsallis entropy to any sequential decision making problem by choosing the proper entropic index depending on the nature of the problem.

TPI and TVI methods require the transition probability to update the value function. Furthermore, due to the intractability of , it also requires an additional numerical computation to evaluate -maximum. In this regard, we extend TPI to an actor-critic method which can avoid these issues and handle large-scale model-free RL problems.

## 6 Tsallis Actor Critic for Model-Free RL

In this section, we propose a Tsallis actor-critic (TAC), which can be applied to a complex environment with continuous state and action spaces without knowing the transition probabilities. To address a large-scale problem with continuous state and action spaces, we approximate Tsallis policy iteration (TPI) using a neural network to estimate both value and policy functions. In the dynamic programming setting, Tsallis policy improvement (60) and Tsallis value iteration (63) (TVI) require the same numerical computation since a closed form solution of -maximum (and (60)) is generally not known. However, in TAC, the Tsallis policy improvement step is replaced by updating a policy network.

Our algorithm maintains five networks to model a policy , state value , target value , two action values and . We also utilize a replay buffer where every interactions are stored and it is sampled when updating the networks. Value networks and are updated using (56) and is updated using (60). Since (60) has the expectation over which is intractable, a stochastic gradient of (60) is required to update . We employ the reparameterization trick to approximate the stochastic gradient. In our implementation, we model a policy function as a tangent hyperbolic of a Gaussian random variable which has been first introduced in haarnoja2018sac () , i.e., , where and are the outputs of . Then, the gradient of (60) becomes , where , indicates a replay buffer and is a coefficient of the Tsallis entropy. Thus, the gradient of plays an important role in exploring the environment. Finally, the is updated towards using an exponential moving average method. Algorithmic details are similar to the soft actor-critic (SAC) algorithm which is known to be the state of the art. Since we generalize the fundamental Bellman equation to the maximum Tsallis entropy case, the Tsallis entropy can be applied to existing RL methods with the SG etropy by replacing the entropy term. Due to the space limitation, more detailed settings are explained in the supplementary material where the implementation of TAC are also included and it is available publicly.

## 7 Experiment

In experiment, we verify the effect of the entropic index on exploration and compare our algorithm to the existing state-of-the-art actor-critic methods on continuous control problems using the MuJoCo simulators: HalfCheetah-v2, Ant-v2, Pusher-v2, Humanoid-v2, Hopper-v2, and Swimmer-v2. Note that results for Hopper-v2 and Swimmer-v2 are included in the supplementary material. We first evaluate how a different entropic index influences the exploration of TAC. As the entropic index changes the structure of the Tsallis entropy, different entropic indices cause different types of exploration. We also compare our method to various on-policy and off-policy actor-critic methods. For on-policy methods, trust region policy optimization (TRPO) schulman2015trpo (), which slowly updates a policy network within the trust region to obtain numerical stability, and proximal policy optimization (PPO) schulman2017ppo (), which utilizes an importance ratio clipping for stable learning, are compared where a value network is employed for generalized advantage estimation schulman2015gae (). For off-policy methods, deep deterministic policy gradient (DDPG) lillicrap2015ddpg (), whose policy is modeled as a deterministic function instead of a stochastic policy and is updated using the deterministic policy gradient, and twin delayed deep deterministic policy gradient (TD3) fujimoto2018td3 (), which modifies the DDPG method by applying two Q networks to increase stability and prevent overestimation, are compared. We also compare the soft actor-critic (SAC) method haarnoja2018sac () which employs the Shannon-Gibbs entropy for exploration. Since TAC can be reduced to SAC with and algorithmic details are the same, we denote TAC with as SAC. For other algorithms, we utilize OpenAI’s implementations and extend the SAC algorithm to TAC by replacing the SG entropy with the Tsallis entropy with the entropic index . We exclude chen2018tsallisensembles (), which also can utilize the Tsallis entropy in the Q learning method, since chen2018tsallisensembles () is only applicable for discrete action spaces.

### 7.1 Effect of Entropic Index

To verify the effect of the entropic index, we conduct experiments with wide range of values: and measure the total average returns during training phase. We only change the entropic index and fix an entropy coefficient to for Humanoid-v2 and for other problems. We run entire algorithms with ten different random seeds and the results are shown in Figure 6. We realize that the proposed method performs better when than when and , in terms of stable convergence and final total average returns. Using generally shows poor performance since it hinders exploitation more strongly than the SG entropy. For , the Tsallis entropy penalizes less the greediness of a policy compared to the SG entropy (or ) where, for the same probability distribution, the value of is always less than . However, the approximated stochastic gradient follows the gradient of , which is inversely proportional to the policy probability similar to the SG entropy. Thus, the Tsallis entropy within not only encourages exploration but also allows the policy to converge a greedy policy. However, when , the value of the Tsallis entropy is smaller than that of the SG entropy for a given distribution and the gradient of -logarithm is proportional to the policy probability. Then, the action with smaller probability is less encouraged to be explored since its gradient is smaller than the action with larger probability. Consequently, the Tsallis MDP shows an early convergence. In this regards, we can see TAC with outperforms TAC with . Furthermore, in HalfCheetah-v2 and Ant-v2, TAC with shows the best performance in while, in Humanoid-v2, TAC with shows the best performance. Furthermore, in Pusher-v2, the final total average returns of all settings are similar, but TAC with shows slightly faster convergence. We believe that these results empirically show that there exists the most appropriate value between one and two depending on the environment while has a negative effect on exploration.

### 7.2 Comparative Evaluation

Figure 7 shows the total average returns of TAC and other compared methods. We use the best value from the previous experiments. For SAC, the best hyperparameter reported in haarnoja2018sac () is used. To verify that there exists a more suitable entropic index than , we set all hyperparameters in TAC to be the same as that of SAC and only change values. SAC, TAC, and DDPG use the same architectures for actor and critic networks, which are single layered fully connected neural networks with hidden units. However, TRPO and PPO employ a smaller architecture using units since they shows poor performances when a large network is used. Entire experimental settings are explained in the supplementary material. To obtain consistent results, we run all algorithms with ten different random seeds. TAC with a proper value outperforms all existing methods in all environments. While SAC generally outperforms DDPG, PPO, TRPO and shows similar or better performance than TD3, except Ant-v2, TAC achieves better performance with a smaller number of samples than SAC in all problems. Especially, in Ant-v2, TAC achieves the best performance with , while SAC is the third best. Furthermore, in Humnoid-v2 which has the largest action space among all problems, TAC with outperforms all the other methods dramatically. Although hyperparameters for TAC are not optimized, simply changing values achieves a significant performance improvement. These results demonstrate that, by properly setting value, TAC can achieve the state-of-the-art performance.

## 8 Conclusion

We have proposed a unified framework for maximum entropy RL problems. The proposed maximum Tsallis entropy MDP generalizes a MDP with Shannon-Gibbs entropy maximization and allows a diverse range of different entropies. The optimality condition of a Tsallis MDP is shown and the convergence and optimality guarantees for the proposed dynamic programming for Tsallis MDPs have been derived. We have also presented the Tsallis actor-critic (TAC) method, which can handle a continuous state action space for model-free RL problems. It has been observed that there exists a suitable entropic index for a different RL problem and TAC with a specific entropic index outperforms all compared actor-critic methods. One valuable extension to this work is to learn a proper entropic index for a given task, which is our future work.

## Appendix A Tsallis Entropy

We show that the Tsallis entropy is a concave function over the distribution and has the maximum at an uniform distribution. Note that this is an well known fact, but, we restate it to make the manuscript self-contained.

###### Proposition 1.

Assume that is a finite space. Let is a probability distribution over . If , then, is concave with respect to .

###### Proof.

Let us consider the function defined over . Second derivative of is computed as

 d2f(x)dx2=−qxq−2<0(x>0,q>0).

Thus, is a concave function. Now, using this fact, we show that the following inequality holds. For such that , and probabilities and ,

 Sq(λ1P1+λ2P2)=∑x−(λ1P1(x)+λ2P2(x))lnq(λ1P1(x)+λ2P2(x))<∑x−λ1P1(x)lnq(P1(x))−λ2P2(x)lnq(P2(x))=λ1Sq(P1)+λ2Sq(P2).

Consequently, is concave with respect to . ∎

###### Proposition 2.

Assume that is finite space. Then, is maximized when is a uniform distribution, i.e., where is the number of elements in .

###### Proof.

We would like to employ the KKT condition on the following optimization problem.

 maxP∈ΔSq(P) (21)

where is a probability simplex. Since is finite, the optimization variables are probability mass defined over each element. The KKT condition of 21 is

 ∀x∈X,∂(Sq(π)−∑xλ⋆(x)P(x)−μ⋆(1−∑xP(x)))∂P(x)∣∣∣P(x)=P⋆(x)=−lnq(P⋆(x))−(P⋆(x))q−1−λ⋆(x)+μ⋆=−qlnq(P⋆(x))−1−λ⋆(x)+μ⋆=0∀x∈X,0=1−∑xP⋆(x),P⋆(x)≥0∀x∈X,λ⋆(x)≤0∀x∈X,λ⋆(x)P⋆(x)=0

where and are the Lagrangian multipliers for constraints in . First, let us consider . Then, from the last condition (complementary slackness). The first condition implies

 P⋆(x)=expq(μ⋆−1q).

Hence, has constant probability mass which means where . The optimal value is . Since is a monotonically decreasing function, should be the largest number as possible as it can be. Hence, and . ∎

## Appendix B Bandit with Maximum Tsallis Entropy

We first analyze a Bandit setting with maximum Tsallis entropy, which is defined as

 maxπ∈Δ[\mathbbmEa∼π[R]+Sq(π)] (22)

The following propositions are already done in several works . However, we restate it to introduce the concept of maximum Tsallis entropy in an expectation maximization problem.

###### Proposition 3.

The optimal solution of (22) is

 π⋆q(a)=expq(r(a)q−ψq(rq)), (23)

where the q-potential function is a functional defined on . is determined uniquely for given by the following normalization condition:

 (24)

Furthermore, using , the optimal value can be written as

 (25)
###### Proof.

It is easy to check exists uniquely for given . Indeed, because is a continuous monotonic function, for any , converge to and if goes to and , respectively. Therefore by the intermediate value theorem, there exists a unique constant such that . Hence it is sufficient to take .

To show the remains, we mainly employ the convex optimization technique. Since is concave and the expectation and constraints of are linear w. r. t. , the problem is concave. Thus, strong duality holds and we can use KKT conditions to obtain an optimal solution.

has two constraints: sum-to-one and nonnegativity. Let be a dual variable for and be a dual variable for . Then, KKT conditions are as follows:

 ∀i1−∑aπ⋆q(a)=0,π⋆q(a)≥0∀iλ⋆(a)≤0∀iλ⋆(a)p⋆i=0∀ir(a)−μ⋆−lnq(π⋆q(a))−(π⋆q(a))q−1+λ⋆(a)=0 (26)

where indicates an optimal solution. We focus on the last condition. The last condition is converted into

 0=r(a)−μ⋆−lnq(π⋆q(a))−(π⋆q(a))q−1+λ⋆(a)0=r(a)−μ⋆−lnq(π⋆q(a))−(q−1)π⋆q(a)q−1−1q−1−1+λ⋆(a)0=r(a)−μ⋆−qlnq(π⋆q(a))−1+λ⋆(a) (27)

First, let’s consider positive measure (). Then, from equation (27),

 expq(r(a)q−μ⋆+1q)=π⋆q(a) (28)

and can be found by solving the following equation:

 ∑aexpq(r(a)q−μ⋆+1q)=1. (29)

Since the equation (29) is exactly same as a normalization equation (24), can be found using a -potential function :

 μ⋆=qψq(rq)−1 (30)

Then,

 π⋆q(a)=expq(r(a)q−ψq(rq)). (31)

The optimal value of primal problem is

 (32)

Finally, let’s check the supporting set. For , the following condition should be satisfied:

 1+(q−1)(r(a)q−ψq(rq))>0, (33)

where this condition comes from the definition of . ∎

###### Proposition 4.

When and , and are computed as follows:

 π⋆1=exp(r(a)−ψ1(r))ψ1(r)=log∑aexp(r(a))

and

where is a supporting set, i.e., .

###### Proof.

Let us start from the KKT condition in the proof of Remark 3. When , Equation (29) becomes

 ∑aexp(r(a)−μ⋆−1)=1.

Then,

 ∑aexp(r(a))=exp(μ⋆+1)log(∑aexp(r(a)))=μ⋆+1μ⋆+1=ψ1(r(a))

Finally,

 π⋆1=exp(r(a)−ψ1(r))ψ1(r)=log∑aexp(r(a)).

When , let us consider the supporting set of . Then,

Thus,

 ψ2(r2)=1+∑a∈Sr(a)/2−1|S|

## Appendix C q-Maximum: Bounded Approximation of Maximum

Now, we prove the property of q-maximum which is defined by

###### Theorem 8.

For any function defined on finite input space , the q-maximum satisfies the following inequalities.

 (34)

where is a cardinality.

###### Proof.

First, consider the lower bound. Let be a probability simplex. Then,

 q-maxx(f(x))=maxP∈Δ[\mathbbmEX∼P[f(X)]+Sq(P)]≤maxP∈Δ\mathbbmEX∼P[f(X)]+maxP∈ΔSq(P)=maxx(f(x))−lnq(1|X|) (35)

where has the maximum at an uniform distribution.

The upper bound can be proven using the similar technique. Let be the distribution whose probability is concentrated at a maximum element, which means if , then, and, otherwise, . If there are multiple maximum at , then, one of them can be arbitrarily chosen. Then, the Tsallis entropy of becomes zero since all probability mass is concentrated at a single instance, i.e., . Then, the upper bound can be computed as follows:

 q-maxx(f(x))=maxP∈Δ[\mathbbmEX∼P[f(X)]+Sq(P)]≥\mathbbmEX∼P′[f(X)]+Sq(P′)=f(argmaxx′f(x′))=maxxf(x). (36)

## Appendix D Tsallis Bellman Optimality Equation

Markov Decision Processes with Tsallis entropy maximization is formulated as follows.

 maximize% π∈ΠEτ∼P,π[∞∑tγtRt]+αS∞q(π) (37)

In this section, we analyze the optimality condition of a Tsallis MDP.

###### Theorem 9.

An optimal policy and optimal value sufficiently and necessarily satisfy the following Tsallis-Bellman optimality (TBO) equations:

 (38)

where is a q-potential function.

Before starting proof, we first remind two propositions and prove one lemma. They are mainly employed to convert the optimization variable from to the state action visitation .

###### Proposition 5.

Let a state visitation be and state action visitation be . Following relationships hold.

 ρπ(s)=∑aρπ(s,a),ρπ(s,a)=ρπ(s)π(a|s) (39)
 ∑aρπ(s,