Hierarchical Modular Reinforcement Learning Method and Knowledge Acquisition of State-Action Rule for Multi-target Problem ©2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Hierarchical Modular Reinforcement Learning Method and Knowledge Acquisition of State-Action Rule for Multi-target Problem ††thanks: ©2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Takumi Ichimura Faculty of Management and Information Systems,
Prefectural University of Hiroshima
1-1-71, Ujina-Higashi, Minami-ku,
Hiroshima, 734-8559, Japan
Email: ichimura@pu-hiroshima.ac.jp
Daisuke Igaue\authorrefmark1 Graduate School of Comprehensive Scientific Research,
Prefectural University of Hiroshima
\authorrefmark1 He graduated from Prefectural Univ. of Hiroshima
and is working at Iyo Bank, Ltd., Japan
Email: punch20@gmail.com
Abstract

Hierarchical Modular Reinforcement Learning (HMRL), consists of 2 layered learning where Profit Sharing works to plan a prey position in the higher layer and Q-learning method trains the state-actions to the target in the lower layer. In this paper, we expanded HMRL to multi-target problem to take the distance between targets to the consideration. The function, called ‘AT field’, can estimate the interests for an agent according to the distance between 2 agents and the advantage/disadvantage of the other agent. Moreover, the knowledge related to state-action rules is extracted by C4.5. The action under the situation is decided by using the acquired knowledge. To verify the effectiveness of proposed method, some experimental results are reported.

Reinforcement Learning, Profit Sharing, Q-learning, Hierarchical Modular Reinforcement Learning, Multi-target, C4.5, Knowledge Acquisition

I Introduction

Multi-Agent Systems (MAS) where there a number of autonomous agents interacting with each affecting the actions of the other agents is a complex system. Learning enables MAS to be more flexible and robust and makes agents better able to handle uncertain and changing circumstances. Thus how to coordinate the behaviors of different agents by learning method is required. Reinforcement learning is an area of machine learning in computer intelligent system [3], [1], [2]. One of problems of reinforcement learning application of actual sized problem is “curse of dimensional problem”. High dimension of input leads to huge number of rules in the reinforcement learning application.

In order to solve these problems several types of hierarchical reinforcement learning have been proposed to apply actual applications [6], [7]. Hierarchical Modular Reinforcement Learning (HMRL), consists of 2 layered learning where Profit Sharing works to plan a prey position in the higher layer and Q-learning method trains the state-actions to the target in the lower layer. In this paper, we expanded HMRL to multi-target problem under the consideration of the distance between targets. The function, called ‘AT field’, can estimate the interests for an agent according to the distance between 2 agents and the advantage/disadvantage of the other agent. Moreover, the knowledge related to state-action rules is extracted by C4.5. The action under the situation is decided by using the acquired knowledge. To verify the effectiveness of proposed method, some experimental results are reported.

The remainder of this paper is organized as follows. Section II describes about reinforcement learning method. Hierarchical modular reinforcement learning method is explained in Section III. In the section, we explain the multi-agent pursuit problem. Moreover, we give consideration to deal with the value of target according to the distance between 2 prey agents. Section IV is the knowledge discover of learning agents in the format of If-Then rules. In Section V, we give some discussions to conclude this paper.

Ii Reinforcement Learning

The Profit Sharing and Q-Learning method are very popular in Reinforcement Learning. The section describes the algorithms of two kinds of Reinforcement Learning methods briefly.

Ii-a Profit Sharing

Multi agent systems have been developed in the field of Artificial Intelligence. Each agent is designed to work some schemes based on many rules which indicate knowledge of the agent world or relationship among the agents. However, the knowledge or relationship is not always effective to survive in their environment, because the agent will discard a partial of knowledge if its environment changes dynamically. Reinforcement Learning [3] is known to be worth to realize the cooperative behavior among agents even if little knowledge is provided with initial condition. The multi-agent system works to share a given reward among all agents.

Especially, PS method [1], [2] is an effective exploitation of reinforcement learning to adapt to a given environment. In PS, an agent learns a policy based on the reward that is received from the environment when it reaches a goal state. It is important to design a reinforcement function that distributes the received reward to each action rule in the policy. In PS, the rule is for possible action to a given sensory input to . The rule “If then .” is also written by . PS does not estimate the value function and computes weight of rules for . The episode is determined from the start state to the terminal state which the agent achieves the goal at time and then a reward is provided. The PS gives the partial reward of to the fired rule in an episode(). is the maximum length of episode. The partial is determined by the value function . Each rule is reinforced by the sum of current weight and slanted reward. That is,

 Sri=Sri+fi,i=0,1,⋯,W−1, (1)

where means the weight of the th rule of an episode, is the reinforce function and means the reinforce value at the step from obtaining .

The detour as shown in Fig.1 is the sequence of rules when the difference rules are selected for the same sensory input. There is a detour in the sequence in Fig.1. The rules in the detour may occur some ineffective rules. The ineffective rule is always on the detour from the episode. The other rules are called the effective rule. If the competition between ineffective rules and effective rules exists, the ineffectiveness are not reinforced. If the reinforcement function satisfies the ineffective rule suppression theorem, the reinforcement function is able to distribute more reward to effective rules than ineffective ones. In order to suppress such ineffective rules, the forgettable PS method is proposed.

 Lw∑j=1fj

where is the reinforcement function and is the maximum number of effective rules. The reinforcement function decreases in a geometric series in the following.

 fi=1Mfi−1,i=1,2,⋯,W−1, (3)

where is a discount rate. Eq.(3) reinforces the rule from to in an episode. Eq.(3) satisfied with the curve as shown in Fig. 3.

The algorithm of PS is as follows.

{indentation}

0.1cm0.1cm {breakbox}

1. Initialize arbitrarily.

2. Repeat (for each episode):

1. Initialize and .

2. Repeat (for each step of episode):

1. action given by for at state

2. Take action ; observe reward, and next state

3. , set

4. If , set and calculate the following.

 Sri=Sri+fi,i=0,1,⋯,W−1, (4)

where .

3. until is terminal

Ii-B Q-Learning

Temporal Difference (TD) method can directly learn from raw experience without a model of the environment’s dynamics [3]. TD method uses experience to solve the prediction problem. If a non-terminal state is visited at time , TD method updates their estimate based on events after that visit. TD method waits only until the next time step. That is, TD method forms a target at time and makes an appropriate update using the observed reward and the estimate . The simplest expression in TD method can be written as follows.

 V(st)←V(st)+α[rt+1+γV(st+1)−V(st)] (5)

TD method can learn their estimates in part on the basis of other estimates. TD method is used for the evaluation or prediction by applying generalized policy iteration. We use the -learning as an off-policy TD method in this paper, because the learned action-value function, , directly approximates the optimal action-value function, with no dependence of policy. The simplest -learning can be written as follows.

 Q(st,at) ← Q(st,at) +α[rt+1+γmaxaQ(st+1,a)−Q(st,at)]
{indentation}

0.1cm0.1cm {breakbox}

1. Initialize arbitrarily.

2. Repeat (for each episode):

1. Initialize .

2. Repeat (for each step of episode):

1. Choose from by using policy derived from

2. Take action ; observe and next state

3. Eq.(LABEL:eq:Q-Learning-1) is executed.

4. ;

3. until is terminal

Iii Hierarchical Modular Reinforcement Learning Method

This section defines Multi-Agent Pursuit Problem to explain the simulation environment where the Hierarchical Modular Reinforcement Learning (HMRL) [7] Method works. Moreover, we develop the HMRL method to work in Multi-Agent Pursuit Problem where two or more kinds of prey agents works in the same environment.

Iii-a Multi-Agent Pursuit Problem

The pursuit problem is well-known to be an appropriate example of cooperative Mulit-agent system (MAS) [5]. In this study, the pursuit problem is considered in a grid world, where two prey agents() and four hunter agents are placed at random positions in the environment as shown in Fig.5. Hunters are learning agents and try to capture the randomly moving prey. In this paper, the prey agent does not learn the state-action rule through the experience and the two or more prey agents does not work to cooperate with each other. At each time, agents synchronously select and perform on out of five actions without communicating with each other: Staying at the current position or moving north, south, west, or east. Preys and hunters cannot share a cell. Also, an agent is not allowed to move off the environment. The prey is captured, when all of its neighbor cells are occupied by hunters as shown in Fig.5.

Iii-B Hierarchical Modular Reinforcement Learning Method

For the pursuit problem, huge memory consumption is required to express the internal knowledge of the agents. Moreover, because the surrounding environment is complex, the agents cannot express the collaboration. [6], [7] proposed the hierarchical modular reinforcement learning to solve the above problems. It is difficult to decide how many kinds of sub-task should be decomposed into.

In [7], Prof. Watanabe conceived of the idea that decomposes the surrounding task(capturing) into “decision of move position target” for surrounding according to current monitored state and “selection of appropriate action” to move to the target position of each agent. The task is decomposed into “surrounding” task synchronized with the other hunter agents and “exploring the environment” task. Moreover, the upper task corresponds only to collaborative surrounding strategy.

In the upper layer, the target position of the agent is decided based on observed state such as the current position of the prey agent and the other hunter agents. The rules in the upper layer express goodness of the target position corresponding to the current state excluding actual actions. In order to construct the rules based on the current state combination, huge corresponding memory is needed. To avoid such requirement, the authors applied modular structure for the rule expression [7] in the upper layer as shown in Fig.6. In Fig.6, the state space is divided to 4 sub-spaces where the following equation is satisfied.

 (g,s1,s2,s3,s4)=∪e(e,g,se,sϵ),(e,ϵ∈E,e≠ϵ) (7)

The weights of rules in the upper layer are updated by Profit Sharing as follows.

 u(e,g(i),he(i),hϵ(i))=u(e,g(i),he(i),hϵ(i)) +k(e,g(i),he(i),hϵ(i)), k(e,g(i−1),he(i−1),hϵ(i−1))= ρk(e,g(i),he(i),hϵ(i)) (i=0,−1,⋯,−m,ϵ≠e), (8)

where is the estimate function for target position and is an reinforcement function as shown Fig.3. is the hunter agent and is the other hunter agent. is the position and is the position at time , respectively. Time is when the hunter agent receives the reward. is the parameter.

The target position is divided as a sub goal for surrounding tasks instead of final goal corresponding to the current state of the prey agent according to the weight of rules. The target position of the agent is determined by the following equation.

 θe=argmaxv∑ϵu(e,g,v,hϵ)μ|he−v|,(ϵ≠e,μ≥1),

where is the candidate of target position. According to the selected state, the information for target position is sent to the lower layer.

In the lower layer, the selection of action to walk to the target position decided at the upper layer is implemented by reinforcement learning process as Q-learning:

 Q(se(t),ae(t),θe)=Q(se(t),ae(t),θe) +k(rt+γmaxηQ(se(t+1),η,θe)−Q(se(t),ae(t),θe)),

where is -value, and are the state vector and the action of the agent at th step, respectively. is the position of agent . is the reward. is the maximum value of action. is the step size parameter.

Iii-C 2 prey agent based Hierarchical Reinforcement Learning

Multi-Agent Pursuit Problem with 2 prey agents in the environment is discussed in this paper. For the problem, we consider the division of space as shown in Fig.8. If there are 2 prey agents, the environment has 2 goals. Therefore, the relation among sub spaces in Fig.8 is defined as follows:

 (g1,g2,s1,s2,s3,s4)=∪e∪l(e,gl,se,sϵ) (e,ϵ∈E,l∈L,e≠ϵ), (9)

where is the goal position for each prey agent.

Each modular has 2 target position, but only one target position should be sent to the lower layer. Therefore, the judgment rule for the decision of appropriate position is defined as follows:

 θe=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩argmaxv∑ϵu(e,g0,v,hϵ)μ|he−v|if|he−g0|<|he−g1|argmaxv∑ϵu(e,g1,v,hϵ)μ|he−v|if|he−g1|<|he−g0| (ϵ≠e,μ≥1) (10)

If the target is quite different altering behaviors among the prey agents, e.g. a target has positive reinforcement and the other agent has punishment, it is difficult to consider the value of target simultaneously. In this paper, Eq.(11) is defined by the idea that when there are 2 kinds of target due to their value, positive and negative, the value of is changed according to the degree to be affected by each other.

 ATF=⎧⎪⎨⎪⎩Φ=0.0(ifgd≤n1)Φ=1.0(ifn1

where is the distance between two agents. is the parameter to judge for whether the distance of the agent and the other agent is within close distance and is the parameter to judge whether the distance is long distance. In this paper, we set that and . The estimate value is updated by using Eq.(III-C).

 u(e,gl(i),he(i),hϵ(i))=u(e,gl(i),he(i),hϵ(i)) +k(e,gl(i),he(i),hϵ(i)) k(e,gl(i−1),he(i−1),hϵ(i−1)) =ρ⋅ATF(gd)⋅k(e,gl(i),he(i),hϵ(i)) (12) (e,ϵ∈E,l∈L,i=0,−1,⋯,−m),

where is the function of AT-Field given by Eq.(11). The output of function can be reduced by the discount factor, , according to the degree that the corresponding agent is affected by the other agent. and are the index of hunter and prey agent, respectively. and mean the set of all agents and the prey agent, respectively. and are the position of the hunter agent and the prey agent, respectively. function will not affect the division of state space in the profit sharing.

Iii-D Simulation Results

This section describes the simulation results under the 2 prey agent and 4 hunter agents. The position of all agents are randomly assigned in the grid. A trial is starting from the initial situation until the hunter agents capture 2 prey agents as shown in Fig.5. After one trial the environment and -value are initialized, and a set of simulation is till 20,000 trials. The reward is 100 if the prey agent is positive target and it is 0 otherwise. In lower layer, when the agent reaches to the target position sent from the upper layer, the agent can receive the reward 100. The behavior of prey agent is randomly and the hunter agent moves due to the acquired state-action rules. Of course, each agent does not know the behaviors of the other agents.

In order to evaluate the effectiveness of the proposed model, we define the 3 ratio of capturing targets: “(Within Safety)”, “(Within Dangerous)”, and “(Positive Ratio)”.

1. (Within Safety): When the hunter agent captured a prey agent with positive reward, if the distance between them is larger than , the captured prey agents belongs to the set .

 P(safety_distance)=#(safety_target∩far)#(safety_target) (13)
2. (Within Dangerous): When the hunter agent captures a prey agent with positive reward, if the distance between them is smaller than , the captured prey agents belongs to the set .

 P(dangerous_distance)=#(safety_target∩near)#(safety_target) (14)
3. (Positive Ratio): It means the ratio of (Within Safety) over the simulations.

 P(safety_targetpositive)=#(safety_target∩far)iteration (15)

Table I and Table II show that the number of steps and the actions without ATField model and with one, respectively, until the prey target is captured. The simulation result related to the number of steps and actions are almost same results, although the computation time with ATField model gets longer than that without ATField model.

Table III and Table IV show that the capture ratio of (Safety Target), (Within Safety), (Within Dangerous), (Positive Ratio), and distance, without ATField model and with one, respectively. The distance means the distance between targets. From these tables, the performance in model with ATField is better than that without ATField model.

Iv Knowledge Acquisition

The state-action rules in the lower layer are extracted by C4.5. Fig.IV shows the part of extracted results. The rules are extracted while training the module. Fig.IV is the result of 19,900-20,000 trials. ‘theta_x’ and ‘theta_y’ mean the difference in the x axis and y axis in the move, respectively. The output in the teaching signal is the target position sent from the upper layer. The simulation is 72,327 instances in the teach data set.

{indentation}

0.1cm0.1cm {breakbox}

54pt

theta_Y > -1
|   theta_X <= -1
|   |   theta_Y <= 0: left (12519.0/1907.0)
|   |   theta_Y > 0
|   |   |   theta_Y <= 1: left (2172.0/1096.0)
|   |   |   theta_Y > 1
|   |   |   |   theta_X <= -2
|   |   |   |   |   theta_Y <= 2: down (270.0/128.0)
|   |   |   |   |   theta_Y > 2
|   |   |   |   |   |   theta_X <= -3
|   |   |   |   |   |   |   theta_X <= -5
|   |   |   |   |   |   |   |   theta_Y <= 3: left (4.0/1.0)
|   |   |   |   |   |   |   |   theta_Y > 3: down (2.0/1.0)
|   |   |   |   |   |   |   theta_X > -5
|   |   |   |   |   |   |   |   theta_Y <= 3: down (20.0/8.0)
|   |   |   |   |   |   |   |   theta_Y > 3: stay (5.0/2.0)
|   |   |   |   |   |   theta_X > -3: left (56.0/29.0)
|   |   |   |   theta_X > -2: down (648.0/328.0)
|   theta_X > -1
|   |   theta_Y <= 0
|   |   |   theta_X <= 0: stay (8056.0)
|   |   |   theta_X > 0
|   |   |   |   theta_X <= 2: right (12260.0/1733.0)
|   |   |   |   theta_X > 2
|   |   |   |   |   theta_X <= 3: right (320.0/179.0)
|   |   |   |   |   theta_X > 3
|   |   |   |   |   |   theta_X <= 4: stay (442.0/104.0)
|   |   |   |   |   |   theta_X > 4: right (47.0/33.0)
|   |   theta_Y > 0
|   |   |   theta_X <= 0
|   |   |   |   theta_Y <= 1: down (11541.0/1451.0)
|   |   |   |   theta_Y > 1
|   |   |   |   |   theta_Y <= 2: down (959.0/352.0)
|   |   |   |   |   theta_Y > 2
|   |   |   |   |   |   theta_Y <= 3: down (328.0/199.0)
|   |   |   |   |   |   theta_Y > 3
|   |   |   |   |   |   |   theta_Y <= 5: stay (173.0/83.0)
|   |   |   |   |   |   |   theta_Y > 5: left (11.0/4.0)


For easy comprehension, Fig.IV shows the extracted knowledge as shown in Fig.IV in the If-Then rule format. In the simulation, we can get 47 state-action rules. Fig.IV shows 10 sample rules only.

{indentation}

0.1cm0.1cm {breakbox}

No.1
If theta_X <= 4 theta_X > 2 theta_Y <= -6 Then up
with CF=1.0
No.2
If theta_X <= 0 theta_X > -1 theta_Y <= 0 theta_Y > -1 Then stay
with CF=1.0
No.3
If theta_X <= 0 theta_X > -1 theta_Y <= 1 theta_Y > 0 Then down
with CF=0.8742743263148774
No.4
If theta_X <= 2 theta_X > 0 theta_Y <= 0 theta_Y > -1 Then right
with CF=0.8586460032626427
No.5
If theta_X <= 0 theta_X > -1 theta_Y <= -1 Then up
with CF=0.8478816513050886
No.6
If theta_X <= -1 theta_Y <= 0 theta_Y > -1 Then left
with CF=0.8476715392603243
No.7
If theta_X <= 4 theta_X > 3 theta_Y <= 0 theta_Y > -1 Then stay
with CF=0.7647058823529411
No.8
If theta_X <= -5 theta_Y <= 3 theta_Y > 2 Then left CF=0.75
No.9
If theta_X <= 1 theta_X > 0 theta_Y <= 5 theta_Y > 4 Then stay
with CF=0.7272727272727273
No.10
If theta_X <= -6 theta_Y <= -1 theta_Y > -2 Then left
with CF=0.7142857142857143


By using the acquired rules, the simulation results are the number of steps and the actions, and the ratio of captured prey agents as shown in Table V and VI. The performance with rules are better than that of without rules.

V Conclusive Discussion

Hierarchical Modular Reinforcement Learning (HMRL)[7], consists of 2 layered learning where Profit Sharing works to plan a prey position in the higher layer and Q-learning method trains the state-actions to the target in the lower layer. If the multi-agent pursuit problem has 2 or more prey agents, in many cases, the reward for them is set toward same purpose, that is, the rewards are same value. In this paper, we expanded HMRL to multi-target problem under the consideration of the distance between targets. The function, called ‘AT field’, can estimate the interests for an agent according to the distance between 2 agents and the advantage/disadvantage of the other agent. Moreover, the knowledge related to state-action rules is extracted by C4.5. In simulation results, AT field function is effective to measure the difference between the rewards of prey agents. We will verify the method in real world problem in future.

References

• [1] J.J.Grefenstette, Credit Assignment in Rule Discovery Systems Based on Genetic Algorithms, Machine Learning, Vol.3, pp.225-245, 1998.
• [2] K.Miyazaki, S.Arai, and S.Kobayashi, A Theory of Profit Sharing in Multi-agent Reinforcement Learning, Journal of Japanese Society for Artificial Intelligence, Vol.14, No.6, pp.1156-1164, 1999 (Japanese).
• [3] R.S.Sutton and G.B.Andrew, Reinforcement Learning: An Introduction, MIT Press, 1998.
• [4] Y.Ishiwaka, T.Sato, Y.Kakazu, An approach to the pursuit problem on a heterogeneous multiagent system using reinforcement learning, Robotics and Autonomous Systems, Vol.43, No.4, pp.245-256, 2003.
• [5] Z.Pu-Cheng, H.Bing-Rong, H.Qing-Cheng, and J.Khurshid, Hybrid Multiagent reinforcement Learning Approach:The Pursuit Problem, Information Technology Journal, Vol.5, No.6, pp.1006-1011, 2006.
• [6] T.Wada, T.Okawa, T.Watanabe, A study on hierarchical modular reinforcement learning for multi-agent pursuit problem based on relative coordinate states, Proc. of the 8th IEEE international conference on Computational intelligence in robotics and automation, pp.302-308, 2009.
• [7] T.Watanabe and T.Wada, A Study on Hierarchical Modular Reinforcement Learning Algorithm for Multi-Agent Pursuit Problem Based on Relative Coordinate States, Journal of Bio-medical Fuzzy System Association, Vol.12, No.2, pp.65-74, 2010 (Japanese).
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters