Maximum Causal Tsallis Entropy Imitation Learning

# Maximum Causal Tsallis Entropy Imitation Learning

Kyungjae Lee
Seoul National University
kyungjae.lee@cpslab.snu.ac.kr
&Sungjoon Choi
Kakao Brain
sam.choi@kakaobrain.com
&Songhwai Oh
Seoul National University
songhwai.oh@cpslab.snu.ac.kr
###### Abstract

In this paper, we propose a novel maximum causal Tsallis entropy (MCTE) framework for imitation learning which can efficiently learn a sparse multi-modal policy distribution from demonstrations. We provide the full mathematical analysis of the proposed framework. First, the optimal solution of an MCTE problem is shown to be a sparsemax distribution, whose supporting set can be adjusted. The proposed method has advantages over a softmax distribution in that it can exclude unnecessary actions by assigning zero probability. Second, we prove that an MCTE problem is equivalent to robust Bayes estimation in the sense of the Brier score. Third, we propose a maximum causal Tsallis entropy imitation learning (MCTEIL) algorithm with a sparse mixture density network (sparse MDN) by modeling mixture weights using a sparsemax distribution. In particular, we show that the causal Tsallis entropy of an MDN encourages exploration and efficient mixture utilization while Boltzmann Gibbs entropy is less effective. We validate the proposed method in two simulation studies and MCTEIL outperforms existing imitation learning methods in terms of average returns and learning multi-modal policies.

Maximum Causal Tsallis Entropy Imitation Learning

Kyungjae Lee Seoul National University kyungjae.lee@cpslab.snu.ac.kr Sungjoon Choi Kakao Brain sam.choi@kakaobrain.com Songhwai Oh Seoul National University songhwai.oh@cpslab.snu.ac.kr

\@float

noticebox[b]32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada.\end@float

## 1 Introduction

In this paper, we focus on the problem of imitating demonstrations of an expert who behaves non-deterministically depending on the situation. In imitation learning, it is often assumed that the expert’s policy is deterministic. However, there are instances, especially for complex tasks, where multiple action sequences perform the same task equally well. We can model such nondeterministic behavior of an expert using a stochastic policy. For example, expert drivers normally show consistent behaviors such as keeping lane or keeping the distance from a frontal car, but sometimes they show different actions for the same situation, such as overtaking a car and turning left or right at an intersection, as suggested in ziebart2008maximum (). Furthermore, learning multiple optimal action sequences to perform a task is desirable in terms of robustness since an agent can easily recover from failure due to unexpected events Haarnoja2017 (); lee2018sparse (). In addition, a stochastic policy promotes exploration and stability during learning Heess2012 (); Haarnoja2017 (); vamplew2017softmax (). Hence, modeling experts’ stochasticity can be a key factor in imitation learning.

To this end, we propose a novel maximum causal Tsallis entropy (MCTE) framework for imitation learning, which can learn from a uni-modal to multi-modal policy distribution by adjusting its supporting set. We first show that the optimal policy under the MCTE framework follows a sparsemax distribution martins2016softmax (), which has an adaptable supporting set in a discrete action space. Traditionally, the maximum causal entropy (MCE) framework ziebart2008maximum (); bloem2014infinite () has been proposed to model stochastic behavior in demonstrations, where the optimal policy follows a softmax distribution. However, it often assigns non-negligible probability mass to non-expert actions when the number of actions increases lee2018sparse (); nachum2018path (). On the contrary, as the optimal policy of the proposed method can adjust its supporting set, it can model various expert’s behavior from a uni-modal distribution to a multi-modal distribution.

To apply the MCTE framework to a complex and model-free problem, we propose a maximum causal Tsallis entropy imitation learning (MCTEIL) with a sparse mixture density network (sparse MDN) whose mixture weights are modeled as a sparsemax distribution. By modeling expert’s behavior using a sparse MDN, MCTEIL can learn varying stochasticity depending on the state in a continuous action space. Furthermore, we show that the MCTEIL algorithm can be obtained by extending the MCTE framework to the generative adversarial setting, similarly to generative adversarial imitation learning (GAIL) by Ho and Ermon ho2016generative (), which is based on the MCE framework. The main benefit of the generative adversarial setting is that the resulting policy distribution is more robust than that of a supervised learning method since it can learn recovery behaviors from less demonstrated regions to demonstrated regions by exploring the state-action space during training. Interestingly, we also show that the Tsallis entropy of a sparse MDN has an analytic form and is proportional to the distance between mixture means. Hence, maximizing the Tsallis entropy of a sparse MDN encourages exploration by providing bonus rewards to wide-spread mixture means and penalizing collapsed mixture means, while the causal entropy ziebart2008maximum () of an MDN is less effective in terms of preventing the collapse of mixture means since there is no analytical form and its approximation is used in practice instead. Consequently, maximizing the Tsallis entropy of a sparse MDN has a clear benefit over the causal entropy in terms of exploration and mixture utilization.

To validate the effectiveness of the proposed method, we conduct two simulation studies. In the first simulation study, we verify that MCTEIL with a sparse MDN can successfully learn multi-modal behaviors from expert’s demonstrations. A sparse MDN efficiently learns a multi-modal policy without performance loss, while a single Gaussian and a softmax-based MDN suffer from performance loss. The second simulation study is conducted using four continuous control problems in MuJoCo todorov2012mujoco (). MCTEIL outperforms existing methods in terms of the average cumulative return. In particular, MCTEIL shows the best performance for the reacher problem with a smaller number of demonstrations while GAIL often fails to learn the task.

## 2 Background

#### Markov Decision Processes

Markov decision processes (MDPs) are a well-known mathematical framework for a sequential decision making problem. A general MDP is defined as a tuple , where is the state space, is the corresponding feature space, is the action space, is a feature map from to , is a set of stochastic policies, i.e., , is the initial state distribution, is the transition probability from to by taking , is a discount factor, and is the reward function from a state-action pair to a real value. In general, the goal of an MDP is to find the optimal policy distribution which maximizes the expected discount sum of rewards, i.e., . Note that, for any function , will be denoted as .

#### Maximum Causal Entropy Inverse Reinforcement Learning

Zeibart et al. ziebart2008maximum () proposed the maximum causal entropy framework, which is also known as maximum entropy inverse reinforcement learning (MaxEnt IRL). MaxEnt IRL maximizes the causal entropy of a policy distribution while the feature expectation of the optimized policy distribution is matched with that of expert’s policy. The maximum causal entropy framework is defined as follows:

 maximize% π∈ΠαH(π)subject toEπ[ϕ(s,a)]=EπE[ϕ(s,a)], (1)

where is the causal entropy of policy , is a scale parameter, is the policy distribution of the expert. Maximum casual entropy estimation finds the most uniformly distributed policy satisfying feature matching constraints. The feature expectation of the expert policy is used as a statistic to represent the behavior of an expert and is approximated from expert’s demonstrations , where is the number of demonstrations and is a sequence of state and action pairs whose length is , i.e., . In ziebart2010MPAs (), it is shown that the optimal solution of (1) is a softmax distribution.

In ho2016generative (), Ho and Ermon have extended (1) to a unified framework for IRL by adding a reward regularization as follows:

 maxcπ∈Πmin−αH(π)+Eπ[c(s,a)]−EπE[c(s,a)]−ψ(c), (2)

where is a cost function and is a convex regularization for cost . As shown in ho2016generative (), many existing IRL methods can be interpreted with this framework, such as MaxEnt IRL ziebart2008maximum (), apprenticeship learning abbeel2004apprenticeship (), and multiplicative weights apprenticeship learning syed2008game (). Existing IRL methods based on (2) often require to solve the inner minimization over for fixed in order to compute the gradient of . In ziebart2010MPAs (), Ziebart showed that the inner minimization is equivalent to a soft Markov decision process (soft MDP) under the reward and proposed soft value iteration to solve the soft MDP. However, solving a soft MDP every iteration is often intractable for problems with large state and action spaces and also requires the transition probability which is not accessible in many cases. To address this issue, the generative adversarial imitation learning (GAIL) framework is proposed in ho2016generative () to avoid solving the soft MDP problem directly. The unified imitation learning problem (2) can be converted into the GAIL framework as follows:

 minπ∈ΠmaxDEπ[log(D(s,a))]+EπE[log(1−D(s,a))]−αH(π), (3)

where indicates a discriminator, which returns the probability that a given demonstration is from a learner, i.e., for learner’s demonstrations and for expert’s demonstrations. Notice that we can interpret as cost (or reward of ).

Since existing IRL methods, including GAIL, are often based on the maximum causal entropy, they model the expert’s policy using a softmax distribution, which can assign non-zero probability to non-expert actions in a discrete action space. Furthermore, in a continuous action space, expert’s behavior is often modeled using a uni-modal Gaussian distribution, which is not proper to model multi-modal behaviors. To handle these issues, we propose a sparsemax distribution as the policy of an expert and provide a natural extension to handle a continuous action space using a mixture density network with sparsemax weight selection.

#### Sparse Markov Decision Processes

In lee2018sparse (), a sparse Markov decision process (sparse MDP) is proposed by utilizing the causal sparse Tsallis entropy to the expected discounted rewards sum, i.e., . Note that is an extension of a special case of the generalized Tsallis entropy, i.e., , for , to sequential random variables. It is shown that that the optimal policy of a sparse MDP is a sparse and multi-modal policy distribution lee2018sparse (). Furthermore, sparse Bellman optimality conditions were derived as follows:

 Q(s,a)≜r(s,a)+γ∑s′V(s′)T(s′|s,a),π(a|s)=max(Q(s,a)α−τ(Q(s,⋅)α),0),V(s)=α⎡⎣12∑a∈S(s)⎛⎝(Q(s,a)α)2−τ(Q(s,⋅)α)2⎞⎠+12⎤⎦, (4)

where , is a set of actions satisfying with indicating the action with the th largest state-action value , and is the cardinality of . In lee2018sparse (), a sparsemax policy shows better performance compared to a softmax policy since it assigns zero probability to non-optimal actions whose state-action value is below the threshold . In this paper, we utilize this property in imitation learning by modeling expert’s behavior using a sparsemax distribution. In Section 3, we show that the optimal solution of an MCTE problem also has a sparsemax distribution and, hence, the optimality condition of sparse MDPs is closely related to that of MCTE problems.

## 3 Principle of Maximum Causal Tsallis Entropy

In this section, we formulate maximum causal Tsallis entropy imitation learning (MCTEIL) and show that MCTE induces a sparse and multi-modal distribution which has an adaptable supporting set. The problem of maximizing the causal Tsallis entropy can be formulated as follows:

 maximize% π∈ΠαW(π)subject toEπ[ϕ(s,a)]=EπE[ϕ(s,a)]. (5)

In order to derive optimality conditions, we will first change the optimization variable from a policy distribution to a state-action visitation measure. Then, we prove that the MCTE problem is concave with respect to the visitation measure. The necessary and sufficient conditions for an optimal solution are derived from the Karush-Kuhn-Tucker (KKT) conditions using the strong duality and the optimal policy is shown to be a sparsemax distribution. Furthermore, we also provide an interesting interpretation of the MCTE framework as robust Bayes estimation in terms of the Brier score. Hence, the proposed method can be viewed as maximization of the worst case performance in the sense of the Brier score brier1950verification ().

We can change the optimization variable from a policy distribution to a state-action visitation measure based on the following theorem.

###### Theorem 1 (Theorem 2 of Syed et al. syed2008apprenticeship ()).

Let be a set of state-action visitation measures, i.e., . If , then it is a state-action visitation measure for , and is the unique policy whose state-action visitation measure is .

###### Proof.

The proof can be found in syed2008apprenticeship (). ∎

Theorem 1 guarantees the one-to-one correspondence between a policy distribution and state-action visitation measure. Then, the objective function is converted into the function of as follows.

###### Theorem 2.

Let . Then, for any stationary policy and any state-action visitation measure , and hold.

The proof is provided in the supplementary material. Theorem 2 tells us that if has the maximum at , then also has the maximum at . Based on Theorem 1 and 2, we can freely convert the problem (5) into

 % maximizeρ∈Mα¯W(ρ)subject to∑s,aρ(s,a)ϕ(s,a)=∑s,aρE(s,a)ϕ(s,a), (6)

where is the state-action visitation measure corresponding to .

### 3.1 Optimality Condition of Maximum Causal Tsallis Entropy

We show that the optimal policy of the problem (6) is a sparsemax distribution using the KKT conditions. In order to use the KKT conditions, we first show that the MCTE problem is concave.

###### Theorem 3.

is strictly concave with respect to .

The proof of Theorem 3 is provided in the supplementary material. Since all constraints are linear and the objective function is concave, (6) is a concave problem and, hence, strong duality holds. The dual problem is defined as follows:

 % maxθ,c,λminρLW(θ,c,λ,ρ)subject to∀s,aλsa≥0, (7)

where and , , and are Lagrangian multipliers and the constraints come from . Then, the optimal solution of primal and dual variables necessarily and sufficiently satisfy the KKT conditions.

###### Theorem 4.

The optimal solution of (6) sufficiently and necessarily satisfies the following conditions:

 qsa≜θ⊺ϕ(s,a)+γ∑s′cs′T(s′|s,a),cs=α⎡⎣12∑a∈S(s)((qsaα)2−τ(qsα)2)+12⎤⎦,andπρ(a|s)=max(qsaα−τ(qsα),0),

where , is an auxiliary variable, and .

The optimality conditions of the problem (6) tell us that the optimal policy is a sparsemax distribution which assigns zero probability to an action whose auxiliary variable is below the threshold , which determines a supporting set. If expert’s policy is multi-modal at state , the resulting becomes multi-modal and induces a multi-modal distribution with a large supporting set. Otherwise, the resulting policy has a sparse and smaller supporting set. Therefore, a sparsemax policy has advantages over a softmax policy for modeling sparse and multi-modal behaviors of an expert whose supporting set varies according to the state.

Furthermore, we also discover an interesting connection between the optimality condition of an MCTE problem and the sparse Bellman optimality condition (4). Since the optimality condition is equivalent to the sparse Bellman optimality equation lee2018sparse (), we can compute the optimal policy and Lagrangian multiplier by solving a sparse MDP under the reward function , where is the optimal dual variable. In addition, and can be viewed as a state value and state-action value for the reward , respectively.

### 3.2 Interpretation as Robust Bayes

In this section, we provide an interesting interpretation about the MCTE framework. In general, maximum entropy estimation can be viewed as a minimax game between two players. One player is called a decision maker and the other player is called the nature, where the nature assigns a distribution to maximize the decision maker’s misprediction while the decision maker tries to minimize it grunwald2004game (). The same interpretation can be applied to the MCTE framework. We show that the proposed MCTE problem is equivalent to a minimax game with the Brier score brier1950verification ().

###### Theorem 5.

The maximum causal Tsallis entropy distribution minimizes the worst case prediction Brier score,

 minπ∈Πmax~π∈ΠE~π[∑a′12(\mathbbm1{a′=a}−π(a|s))2]subject toEπ[ϕ(s,a)]=EπE[ϕ(s,a)] (8)

where is the Brier score.

Note that minimizing the Brier score minimizes the misprediction ratio while we call it a score here. Theorem 5 is a straightforward extension of the robust Bayes results in grunwald2004game () to sequential decision problems. This theorem tells us that the MCTE problem can be viewed as a minimax game between a sequential decision maker and the nature based on the Brier score. In this regards, the resulting estimator can be interpreted as the best decision maker against the worst that the nature can offer.

## 4 Maximum Causal Tsallis Entropy Imitation Learning

In this section, we propose a maximum causal Tsallis entropy imitation learning (MCTEIL) algorithm to solve a model-free IL problem in a continuous action space. In many real-world problems, state and action spaces are often continuous and transition probability of a world cannot be accessed. To apply the MCTE framework for a continuous space and model-free case, we follow the extension of GAIL ho2016generative (), which trains a policy and reward alternatively, instead of solving RL at every iteration. We extend the MCTE framework to a more general case with reward regularization and it is formulated by replacing the causal entropy in the problem (2) with the causal Tsallis entropy as follows:

 maxθminπ∈Π−αW(π)−Eπ[θ⊺ϕ(s,a)]+EπE[θ⊺ϕ(s,a)]−ψ(θ). (9)

Similarly to ho2016generative (), we convert the problem (9) into the generative adversarial setting as follows.

###### Theorem 6.

The maximum causal sparse Tsallis entropy problem (9) is equivalent to the following problem:

 % minπ∈Πψ∗(Eπ[ϕ(s,a)]−EπE[ϕ(s,a)])−αW(π),

where .

The proof is detailed in the supplementary material. The proof of Theorem 6 depends on the fact that the objective function of (9) is concave with respect to and is convex with respect to . Hence, we first switch the optimization variables from to and, using the minimax theorem millar1983minimax (), the maximization and minimization are interchangeable and the generative adversarial setting is derived. Similarly to ho2016generative (), Theorem 6 says that a MCTE problem can be interpreted as minimization of the distance between expert’s feature expectation and training policy’s feature expectation, where is a proper distance function since is a convex function. Let be a feature indicator vector, such that the th element is one and zero elsewhere. If we set to , where for and for , we can convert the MCTE problem into the following generative adversarial setting:

 minπ∈ΠmaxDEπ[log(D(s,a))]+EπE[log(1−D(s,a))]−αW(π), (10)

where is a discriminator. The problem (10) can be solved by MCTEIL which consists of three steps. First, trajectories are sampled from the training policy and discriminator is updated to distinguish whether the trajectories are generated by or . Finally, the training policy is updated with a policy optimization method under the sum of rewards with a causal Tsallis entropy bonus . The algorithm is summarized in Algorithm 1.

#### Sparse Mixture Density Network

We further employ a novel mixture density network (MDN) with sparsemax weight selection, which can model sparse and multi-modal behavior of an expert, which is called a sparse MDN. In many imitation learning algorithms, a Gaussian network is often employed to model expert’s policy in a continuous action space. However, a Gaussian distribution is inappropriate to model the multi-modality of an expert since it has a single mode. An MDN is more suitable for modeling a multi-modal distribution. In particular, a sparse MDN is a proper extension of a sparsemax distribution for a continuous action space. The input of a sparse MDN is state and the output of a sparse MDN is components of mixtures of Gaussians: mixture weights , means , and covariance matrices . A sparse MDN policy is defined as

 π(a|s)=K∑iwi(s)N(a;μi(s),Σi(s)),

where indicates a multivariate Gaussian density at point with mean and covariance . In our implementation, is computed as a sparsemax distribution, while most existing MDN implementations utilize a softmax distribution. Modeling the expert’s policy using an MDN with mixtures can be interpreted as separating continuous action space into representative actions. Since we model mixture weights using a sparsemax distribution, the number of mixtures used to model the expert’s policy can vary depending on the state. In this regards, the sparsemax weight selection has an advantage over the soft weight selection since the former utilizes mixture components more efficiently as unnecessary components will be assigned with zero weights.

#### Tsallis Entropy of Mixture Density Network

An interesting fact is that the causal Tsallis entropy of an MDN has an analytic form while the Gibbs-Shannon entropy of an MDN is intractable.

###### Theorem 7.

Let . Then,

 W(π)=12∑sρπ(s)(1−K∑iK∑jwi(s)wj(s)N(μi(s);μj(s),Σi(s)+Σj(s))). (11)

The proof is included in the causal Tsallisrial. The analytic form of the Tsallis entropy shows that the Tsallis entropy is proportional to the distance between mixture means. Hence, maximizing the Tsallis entropy of a sparse MDN encourages exploration of diverse directions during the policy optimization step of MCTEIL. In imitation learning, the main benefit of the generative adversarial setting is that the resulting policy is more robust than that of supervised learning since it can learn how to recover from a less demonstrated region to a demonstrated region by exploring the state-action space during training. Maximum Tsallis entropy of a sparse MDN encourages efficient exploration by giving bonus rewards when mixture means are spread out. (11) also has an effect of utilizing mixtures more efficiently by penalizing for modeling a single mode using several mixtures. Consequently, the Tsallis entropy has clear benefits in terms of both exploration and mixture utilization.

## 5 Experiments

To verify the effectiveness of the proposed method, we compare MCTEIL with several other imitation learning methods. First, we use behavior cloning (BC) as a baseline. Second, generative adversarial imitation learning (GAIL) with a single Gaussian distribution is compared. While several variants of GAIL exist baram2017end (); li2017infogail (), they are all based on the maximum causal entropy framework and utilize a single Gaussian distribution as a policy function. Hence, we choose GAIL as the representative method. We also compare a straightforward extension of GAIL for a multi-modal policy by using a softmax weighted mixture density network (soft MDN) in order to validate the efficiency of the proposed sparsemax weighted MDN. In soft GAIL, due to the intractability of the causal entropy of a mixture of Gaussians, we approximate the entropy term by adding to since . The other related imitation learning methods for multi-modal task learning, such as hausman2017multi (); wang2017robust (), are excluded from the comparison since they focus on the task level multi-modality, where the multi-modality of demonstrations comes from multiple different tasks. In comparison, the proposed method captures the multi-modality of the optimal policy for a single task. We would like to note that our method can be extended to multi-modal task learning as well.

### 5.1 Multi-Goal Environment

To validate that the proposed method can learn multi-modal behavior of an expert, we design a simple multi-goal environment with four attractors and four repulsors, where an agent tries to reach one of attractors while avoiding all repulsors as shown in Figure 1(a). The agent follows the point-mass dynamics and get a positive reward (resp., a negative reward) when getting closer to an attractor (resp., repulsor). Intuitively, this problem has multi-modal optimal actions at the center. We first train the optimal policy using lee2018sparse () and generate demonstrations from the expert’s policy. For both soft GAIL and MCTEIL, episodes are sampled at each iteration. In every iterations, we measure the average return using the underlying rewards and the reachability which is measured by counting how many goals are reached. If the algorithm captures the multi-modality of expert’s demonstrations, then, the resulting policy will show high reachability.

The results are shown in Figure 1(b) and 1(c). Since the rewards are multi-modal, it is easy to get a high return if the algorithm learns only uni-modal behavior. Hence, the average returns of soft GAIL and MCTEIL increases similarly. However, when it comes to the reachability, MCTEIL outperforms soft GAIL when they use the same number of mixtures. In particular, MCTEIL can learn all modes in demonstrations at the end of learning while soft GAIL suffers from collapsing mixture means. This advantage clearly comes from the maximum Tsallis entropy of a sparse MDN since the analytic form of the Tsallis entropy directly penalizes collapsed mixture means while indirectly prevents modes collapsing in soft GAIL. Consequently, MCTEIL efficiently utilizes each mixture for wide-spread exploration.

### 5.2 Continuous Control Environment

We test MCTEIL with a sparse MDN on MuJoCo todorov2012mujoco (), which is a physics-based simulator, using Halfcheetah, Walker2d, Reacher, and Ant. We train the expert policy distribution using trust region policy optimization (TRPO) schulman2015trust () under the true reward function and generate demonstrations from the expert policy. We run algorithms with varying numbers of demonstrations, and , and all experiments have been repeated three times with different random seeds. To evaluate the performance of each algorithm, we sample episodes from the trained policy and measure the average return value using the underlying rewards. For methods using an MDN, we use the best number of mixtures using a brute force search.

The results are shown in Figure 2. For three problems, except Walker2d, MCTEIL outperforms the other methods with respect to the average return as the number of demonstrations increases. For Walker2d, MCTEIL and soft GAIL show similar performance. Especially, in the reacher problem, we obtain the similar results reported in ho2016generative (), where BC works better than GAIL. However, our method shows the best performance for all demonstration counts. It is observed that the MDN policy tends to show high performance consistently since MCTEIL and soft GAIL are consistently ranked within the top two high performing algorithms. From these results, we can conclude that an MDN policy explores better than a single Gaussian policy since an MDN can keep searching multiple directions during training. In particular, since the maximum Tsallis entropy makes each mixture mean explore in different directions and a sparsemax distribution assigns zero weight to unnecessary mixture components, MCTEIL efficiently explores and shows better performance compared to soft GAIL with a soft MDN. Consequently, we can conclude that MCTEIL outperforms other imitation learning methods and the causal Tsallis entropy has benefits over the causal Gibbs-Shannon entropy as it encourages exploration more efficiently.

## 6 Conclusion

In this paper, we have proposed a novel maximum causal Tsallis entropy (MCTE) framework, which induces a sparsemax distribution as the optimal solution. We have also provided the full mathematical analysis of the proposed framework, including the concavity of the problem, the optimality condition, and the interpretation as robust Bayes. We have also developed the maximum causal Tsallis entropy imitation learning (MCTEIL) algorithm, which can efficiently solve a MCTE problem in a continuous action space since the Tsallis entropy of a mixture of Gaussians encourages exploration and efficient mixture utilization. In experiments, we have verified that the proposed method has advantages over existing methods for learning the multi-modal behavior of an expert since a sparse MDN can search in diverse directions efficiently. Furthermore, the proposed method has outperformed BC, GAIL, and GAIL with a soft MDN on the standard IL problems in the MuJoCo environment. From the analysis and experiments, we have shown that the proposed MCTEIL method is an efficient and principled way to learn the multi-modal behavior of an expert.

## Appendix A Analysis

###### Proof of Theorem2.

The proof is simply done by checking two equalities. First,

 W(π)=12Eπ[1−π(a|s)]=12∑s,aρπ(s,a)(1−π(a|s))=12∑s,aρπ(s,a)(1−ρπ(s,a)∑a′ρπ(s,a′))

and, second,

 ¯W(ρ)=12∑s,aρ(s,a)(1−ρ(s,a)∑a′ρ(s,a′))=12∑s,aρπρ(s,a)(1−πρ(a|s))=W(πρ).

### a.1 Concavity of Maximum causal Tsallis Entropy

###### Proof of Theorem3.

Proof of concavity of is equivalent to show that following inequality is satisfied for all state and action pairs:

 (λ1ρ1(s,a)+λ2ρ2(s,a))(1−λ1ρ1(s,a)+λ2ρ2(s,a)λ1∑a′ρ1(s,a′)+λ2∑a′ρ2(s,a′))≥λ1ρ1(s,a)(1−ρ1(s,a)∑a′ρ1(s,a′))+λ2ρ2(s,a)(1−ρ2(s,a)∑a′ρ2(s,a′))

where , , and . For notational simplicity, and are replaced with and , respectively. Then, the right-hand side is

 ∑i=1,2λiai(1−aibi)=∑i=1,2λiai(1−λiaiλibi)=(∑j=1,2λjbj)∑i=1,2⎡⎢ ⎢⎣λibi(∑j=1,2λjbj)λiaiλibi(1−λiaiλibi)⎤⎥ ⎥⎦.

Let , which is a concave function. Then the above equation can be expressed as follows,

 ∑i=1,2λiai(1−aibi)=(∑j=1,2λjbj)∑i=1,2⎡⎢ ⎢⎣λibi(∑j=1,2λjbj)F(λiaiλibi)⎤⎥ ⎥⎦.

By using the property of concave function 111 , for some and such that and . , we obtain the following inequality:

 (∑j=1,2λjbj)∑i=1,2⎡⎢ ⎢⎣λibi(∑j=1,2λjbj)F(λiaiλibi)⎤⎥ ⎥⎦≤(∑j=1,2λjbj)F⎛⎜ ⎜⎝∑i=1,2⎡⎢ ⎢⎣λibi(∑j=1,2λjbj)λiaiλibi⎤⎥ ⎥⎦⎞⎟ ⎟⎠=(∑j=1,2λjbj)F(∑i=1,2λiai∑j=1,2λjbj)=(∑j=1,2λjbj)∑i=1,2λiai∑j=1,2λjbj(1−∑i=1,2λiai∑j=1,2λjbj)=∑i=1,2λiai(1−∑i=1,2λiai∑j=1,2λjbj).

Finally, we have the following inequality for every state and action pair,

 (λ1ρ1(s,a)+λ2ρ2(s,a))(1−λ1ρ1(s,a)+λ2ρ2(s,a)λ1∑a′ρ1(s,a′)+λ2∑a′ρ2(s,a′))≥λ1ρ1(s,a)(1−ρ1(s,a)∑a′ρ1(s,a′))+λ2ρ2(s,a)(1−ρ2(s,a)∑a′ρ2(s,a′)),

and, by summing up with respect to , we get

Therefore, is a concave function. ∎

### a.2 Optimality Condition from Karush–Kuhn–Tucker (KKT) conditions

The following proof explains the optimality condition of the maximum causal Tsallis entropy problem and also tells us that the optimal policy distribution has a sparse and multi-modal distribution.

###### Proof of Theorem4.

These conditions are derived from the stationary condition of KKT, where the derivative of is equal to zero,

 ∂LW∂ρ(s,a)=0.

We first compute the derivative of as follows:

 ∂¯W∂ρ(s,a)=12−ρ(s,a)∑a′ρ(s,a′)+12∑a′(ρ(s,a′)∑a′ρ(s,a′))2.

We also check the derivative of Bellman flow constraints as follows:

 ∂∑scs(∑a′ρ(s,a′)−d(s)−γ∑s′,a′T(s|s′,a′)ρ(s′,a′))∂ρ(s′′,a′′)=cs′′−γ∑scsT(s|s′′,a′′).

Hence, the stationary condition can be obtained as

 ∂LW∂ρ(s,a)=α⎡⎣−12+ρ(s,a)∑a′ρ(s,a′)−12∑a′(ρ(s,a′)∑a′ρ(s,a′))2⎤⎦−θ⊺ϕ(s,a)+cs−γ∑s′cs′T(s′|s,a)−λsa=0. (12)

First, let us consider a positive . From the complementary slackness, the corresponding is zero. By replacing with and using the definition of , the following equation is obtained from the stationary condition (12).

 π(a|s)−qsaα=12+12∑a′(π(a′|s))2−csα. (13)

It can be observed that the right hand side of the equation only depends on the state and is constant for the action . In this regards, by summing up with respect to the action with positive , is obtained as follows:

 1−∑a∈S(s)qsaα=K(12+12∑a′(π(a′|s))2−csα)csα=12+12∑a′(π(a′|s))2+∑a∈S(s)qsaα−1K,

where is the number of actions with positive . By plug in into (13), we obtain a policy as follows:

Now, we define , and, interestingly, is the same as the threshold of a sparsemax distribution martins2016softmax (). Then, we can obtain the optimality condition for the policy distribution as follows:

 ∀s,aπ(a|s)=max(qsaα−τ(s),0).

where indicates .

The Lagrangian multiplier can be found from as follows:

To summarize, we obtain the optimality condition of (6) as follows:

 qsa≜θ⊺ϕ(s,a)+γ∑s′cs′T(s′|s,a),cs=α⎡⎣12∑a∈S(s)((qsaα)2−τ(qs⋅α)2)+12⎤⎦,π(a|s)=max(qsaα−τ(qs⋅α),0).

### a.3 Interpretation as Robust Bayes

In this section, the connection between MCTE estimation and a minimax game between a decision maker and the nature is explained. We prove that the proposed MCTE problem is equivalent to a minimax game with the Brier score.

###### Proof of Theorem5.

The objective function can be reformulated as

 E~π[∑a′12(\mathbbm1{a′=a}−π(a′|s))2]=E~π[B(s,a)]=∑s,aρ~π(s,a)B(s,a)=12∑s,aρ~π(s,a)(1−2π(a|s)+∑a′π(a′|s)2),

Hence, the objective function is quadratic with respect to and is linear with respect to . By using the one-to-one correspondence between and , we change the variable of inner maximization into the state action visitation. After changing the optimization variable, by using the minimax theorem millar1983minimax (), the minimization and maximization of the problem (8) are interchangeable as follows:

where sum-to-one, positivity, and Bellman flow constraints are omitted here. After converting the problem, the optimal solution of inner minimization with respect to is easily computed as using . After applying and recovering the variables from to , the problem (8) is converted into

 max~π∈Π12∑sρ~π(s)(1−∑a~π(a|s)2)=max~π∈ΠW(~π),

which equals to the causal Tsallis entropy. Hence, the problem (8) is equivalent to the maximum causal Tsallis entropy problem. ∎

### a.4 Generative Adversarial Setting with Maximum Causal Tsallis Entropy

###### Proof of Theorem6.

We first change the variable from to as follows:

 maxθminρ−α¯W(ρ)−θ⊺∑s,aρ(s,a)ϕ(s,a)−θ⊺∑s,aρE(s,a)ϕ(s,a)−ψ(θ)subject to∀s,a,∑s,aρ(s,a)ϕ(s,a)=∑s,aρE(s,a)ϕ(s,a),ρ(s,a)≥0,∑aρ(s,a)=d(s)+γ∑s′,a′T(s|s′,a′)ρ(s′,a′), (14)

where is . Let

 ¯L(ρ,θ)≜−α¯W(ρ)−ψ(θ)−θ⊺∑s,aρ(s,a)ϕ(s,a)+θ⊺∑s,aρE(s,a)ϕ(s,a). (15)

From Theorem 3, is a concave function with respect to for a fixed . Hence, is also a concave function with respect to for a fixed . From the convexity of , is a convex function with respect to for a fixed . Furthermore, the domain of is compact and convex and the domain of is convex. Based on this property of , we can use minimax duality millar1983minimax ():

 maxθminρ¯L(ρ,θ)=minρmaxθ¯L(ρ,θ).

Hence, the maximization and minimization are interchangable. By using this fact, we have:

 maxθminρ¯L(ρ,θ)=minρmaxθ¯L(ρ,θ)=minρ−α¯W(ρ)+maxθ(−ψ(θ)+θ⊺∑s,a(ρ(s,a)−ρE(s,a))ϕ(s,a))=minρ−α¯W(ρ)+ψ∗(∑s,a(ρ(s,a)−ρE(s,a))ϕ(s,a))=minπψ∗(Eπ[ϕ(s,a)]−EπE[ϕ(s,a)])−αW(π)

### a.5 Tsallis Entropy of a Mixture of Gaussians

###### Proof of Theorem7.

The causal Tsallis entropy of a mixture of Gaussian distribution can be obtained as follows:

 (16)

## Appendix B Causal Entropy Approximation

In our implementation of maximum causal Tsallis entropy imitation learning (MCTEIL), we approximate using sampled trajectories as follows:

 W(π)=\mathbbmEπ[12(1−π(a|s))]≈1NN∑i=0Ti∑t=0γt2(1−∫Aπ(a|si,t)2da), (17)

where