Reinforcement Learning with Internal Reward for Multi-Agent Cooperation : A Theoretical Approach

This paper focuses on a multi-agent cooperation which is generally di cult to be achieved without su cient information of other agents, and proposes the reinforcement learning method that introduces an internal reward for a multi-agent cooperation without su cient information. To guarantee to achieve such a cooperation, this paper theoretically derives the condition of selecting appropriate actions by changing internal rewards given to the agents, and extends the reinforcement learning methods (Q-learning and Pro t Sharing) to enable the agents to acquire the appropriate Q-values updated according to the derived condition. Concretely, the internal rewards change when the agents can only nd better solution than the current one. The intensive simulations on the maze problems as one of testbeds have revealed the following implications:(1) our proposed method successfully enables the agents to select their own appropriate cooperating actions which contribute to acquiring the minimum steps towards to their goals, while the conventional methods (i.e., Q-learning and Pro t Sharing) cannot always acquire the minimum steps; and (2) the proposed method based on Pro t Sharing provides the same good performance as the proposed method based on Q-learning.


INTRODUCTION
Multi-agent reinforcement learning is suitable to tackle the many problems such as multi-robot cooperation and cars navigation.However, it is generally dicult to derive the good performance because the agents have to cooperate with each other.For this issue, some previous works, for example, have proposed swarm reinforcement learning [1] and fast adaptive learning in stochastic games [2].Concretely, swarm reinforcement learning anables agents to choose appropriate actions from agents' various actions to generate some formation of agents.Fast adaptive learning, on the other hand, enables agents to choose the optimal action for stochastic games by observing actions of other agents, which promotes other agents to choose the optimal actions by showing the action of the agent.However, swarm reinforcement learning is heuristic, meaning that the agent cooperation cannot be guaranteed due to insucient information of the other agents.Fast adaptive learning is theoretic but the complete information is needed which is generally dicult to acquire such an information.Even if we assume to acquire the complete information, a huge amount of communication would be needed but we generally cannot guarantee no communication failure.To tackle the above issues, this paper proposes the theoretic method with a (very) small amount of information.Concretely, this paper theoretically derives the condition of selecting appropriate actions by changing internal rewards given to the agents, and extends the reinforcement learning methods (i.e.Q-learning and Prot Sharing) to enable the agents to acquire the appropriate Q-values updated according to the derived condition.Note that Q-learning agents are employed because of the mathematical proof in Q-learning (i.e., the convergence of Q-value is proofed in the single agent environment.),and prot sharing agents are also employed for a comparison with Q-learning ones.As the rst step towards our purpose in this paper, we start to investigate the proposed method in the simple maze problem where two agents learn the actions to minimize the steps towards their goals through the cooperation with each other by a small amount of communication.This paper is organized as follows.Section 2 explains reinforcement learning and Section 3 describes the multi-agent cooperation task addressed in this paper.Our method is proposed in Section 4. Section 5 conducts the experiment and analyzed the obtained results.Finally, our conclusion is given in Section 6.

REINFORCEMENT LEARNING
Before we add a mechanism using the internal reward to two reinforcement learning techniques Q-learning and Prot Sharing, this section gives their brief descriptions.

Q-learning
Q-learning [3] is a very popular reinforcement learning technique which is originally designed for a single-agent task.As the general framework of reinforcement learning, an agent interacts with an environment; the agent observes a state from an environment, takes an action then receives a reward from it.
In Q-learning, the agent calculates a state-action value (called Q-value Q(s, a)) for each possible state-action pair in the environment, which estimates a future reward the agent will eventually receive when its action a is executed in its state s.The agent acquires a policy π(s, a) to decide which action should be executed to maximize the received reward.This results in nding the minimum step to a proper goal returning the maximum reward.Technically the policy π can be a probability in selecting the action on the state s and is calculated on the basis of Q-value Q(s, a), a ∈ A where A is the action space that denes possible actions the agent can take at the state s.
The agent which is powered by Q-learning aims at estimating all possible Q-values accurately in order to nd the minimum step, thorough the interaction with the environment.

Q(s, a) is updated as follows
where max Q(s ′ , a ′ ) is the largest Q-value in state s ′ after the transition from state s to s ′ with executing action a; r is the reward received from the environment; α is the learning rate and γ is the discount factor.α is the real number from 0 to 1, and expresses what percentage of new value(reward, etc) incorporated to Q-value.γ is the real number from 0 to 1, and presents how much incorporate the Q-value calculated before to new Q-value.

Profit Sharing
Prot Sharing [4] is also a popular reinforcement learning (2) where (st, at) is the state-action pair the agent was placed at time t; r(t) is a reward function decides a value of reward assigned to Q(st, at); C bid is the coecient.The reward function r(t) can be dened in several ways.Here, we employ the following reward function as shown in Figure 1; In this study, A Prot Sharing agent put Q-value on the actions of the path made by the agent as left of gure 2.
In gure 2, the agent reaches the goal by passing through the path same as right of gure 2. then, the agent makes the path left of gure 2 from right of gure 2, and updates actions on the path left of gure 2. the way of making the pass is according to the following three rules.
1.The agent follows a path, and stores the path it already pass.
2. When the agent goes back to the same state, the path between same states from the path already stored by the agent is reset.
3. If the agent reaches the goal, the way is nished.
The reason why the agent makes this path is that Q-values of all actions become like Q-value of common Q-learning and the situation becomes like that of proposed method.

MULTI-AGENT COOPERATION TASK
This section introduces a multi-agent cooperation task using a 3x8 grid maze problem. Figure 3 shows an example of 3x8 grid maze.On the maze this paper uses, as shown in the gure, we dene there are two possible start states (A, B) where agents will be initially placed before learning; and two goal states (S, L) where the agents attempts to reach.
A diculty of multi-agent cooperation task on the maze, is

PROPOSED METHOD
In the dilemma maze problem like Figure 3, in order to achieve the cooperation of agents, the proposed method mainly has the following two steps; Step 1 is a process of goal determination with communication between the agents, and step 2 is a process of internal reward shaping.The remain of this section rst explains the overview of the proposed method as shown in Figure 4 and then the main 2 steps pointed at the processes 8 and 9 on the gure.

Overview of procedure
As shown in Figure 4, the agents observe a state and then choose an action as the framework of reinforcement learning (processes 1 and 2), which results in the state transition as process 3.Then, the agent updates Q-value for the executed action (process 4).These processes are the standard procedure of the reinforcement learning and the cycle from processes 1 to 4 is often called as step".
After process 4 the proposed method determines whether each agent reached the goal (process 5).If the agent reach the goal, the receives reward (process 6); otherwise, jump to process 11.In process 7, the agent updates minimum steps; specically, if the number of steps from a start position to goal are shorter than stored minimum steps the agent has acquired before, the agent replaced the minimum steps with the new ones the agent newly founded.
Next, in process 8, the agent determines the optimal goal by the minimum steps (detail can be founded in subsection 4.2).Then, the agent estimates an internal reward by using the minimum steps (detail can be founded in subsection 4.3).After that, the agent updates Q-value using internal reward in process 10.In process 11, the system determines whether the step count is greater than a threshold; If true, the system go to 12; otherwise, the agent goes back to process 1.In process 12, the system counts iteration of learning and determines whether this iteration count is greater than a threshold.the whole process is ended when the system meets this condition; otherwise, the system returns to process 1.  ).When reaching one goal by shorter steps than before, agents memorize these steps and send them to other agents, i.e., each agent share the Q-value table other agents learn.This sharing can be continued while agents' learning process to explore the maze.Dierent from the standard reinforcement learning where an agent explores the shortest steps to goal in order to maximize the reward he gets, the proposed method attempts agents to nd goal where all agent can receive the maximum rewards per unit step.For example, Figure 5 shows these processes in the maze of gure 3.Each agent has memorized the minimum steps as tAS, tAL, tBS and tBL in the balloon of the agent.

Goal determination
In

Internal Reward Shaping
In this process, each agent shapes reward function to reach the goal chosen in rst step.Figure 6  Then, in the proposed method, the internal reward is added to reform the reward shaping of agent B to reach the goal L.
In the turning point on Figure 6, the Q-value of the action to reach goal S eventually converges to a value r and thus the Q-value of the action to reach goal L is rγ 2 .If Agent B uses the internal reward rS, rL, the Q-value of the action to reach goal S is rS and the Q-value of the action to reach goal L is rLγ 2 .Since rLγ 2 > rS is satised, if rL is r γ 2 + 1 and rS is r, agent B will reach goal L nally and be able to cooperate.We explains the general way to shape internal reward in the following.

Mathematical Analysis
In this section, there is mathematical description for internal reward shaping in preceding section.Therefore, we describe generalized proposed method on the case of the maze of gure 6 in this paper.
Agents estimate rAS, rAL, rBS and rBL for agents to get cooperative action.rAS is agent A's internal reward for goal S, rAL is agent A's internal reward for goal L, rBS is agent B's internal reward for goal S and rBL is agent B's internal reward for goal L. Whether to reach goal S or not is determined by Q-value in the turning point, if there are Each Q-value of the action toward goal L is as follow: Because agent A must reach to goal S the expression is satised as follows: rAS > γ t AL −t AS rAL (6) Also in the same manner for agent B, the expression is satised as follows: rBL > rBS γ t BL −t BS (7) In gure 6, equation 7 is equal to rLγ 2 > rS (2 of rLγ 2 > rS in gure 6 is tBL − tBS).As for agent A, it is not necessary to set internal reward, since rAS = r > γ 2 r = γ 2 rAL is established while rAS = r and rAL = r.Therefore, generalization from gure 6 is succeeded.On implementation, the system must consider the quantity of the dierence to meet equation 6 and 7. Proposed method in this paper set parameter δ. rAS and rBL using δ are equation 8 and 9.
Furthermore, we show the expression obtained by modifying the above equations below.

EXPERIMENT 5.1 Experimental setting
Here, we test our mechanism on two reinforcement learning techniques Q-learning and Prot Sharing, i.e., both are extended with the proposed method to be applicable to the multi-cooperation task on the maze problem.Specically, we apply both extended techniques to 100 dierent types of 3x8 grid mazes where two start states and two goal states are dierently placed in the maze.Note this paper deals with 2 agents cooperation task.We consider the following four cases as possible congurations on multi-agent cooperation task; • case 1：ideal and easy case In this case, an agent cannot reach the goal another agent already reached, and all agents already knows minimum steps between every start and every goal at rst.In addition, if each agent observes the same state three times and Q-value is decreased over 0.1, Q-value is not updated in that iteration.

• case 2：ideal and dicult case
In this case, each agent reaches the goal even if other agents reached same goal and all agents already knows minimum steps between every start and every goal at rst.

• case 3：practical and easy case
In this case, an agent cannot reach the goal another agent already reached, and all agents do not know minimum steps between every start and every goal at rst.
In addition, if each agent observe same state three times and Q-value is decreased over 0.1, Q-value is not updated in that iteration, and when minimum steps were updated, agents initialize all Q-value.

• case 4：practical and dicult case
In this case, each agent reaches the goal even if other agents reached the same goal and all agents do not know minimum steps between every start and every goal at rst.In addition, when minimum steps were updated, agents initialize all Q-value.

Common Q-learning vs Proposed Method
The comparison result of common Q-learning and proposed method is as follow gure 11.Success rate of proposed method is 100 percent, and the proposed method is seen as the performance is good compared to common Q-learning.

Q-learning vs Profit Sharing
The comparison result of Q-learning applied the proposed method and Prot Sharing applied the proposed method is as follow gure 16.The performance of Prot Sharing is good.However, the performance of Q-learning is better than that of prot sharing.

The results of each reward difference
The result of internal reward dierence is as follow gure 23.When the reward dierence is large, there is the high probability of success in each case.

Discussion
From the result in the cases 1 and 2, we can think the mathematical analysis in Section 4 is validated, since our analysis argues that the internal reward can be theoretically determined if agents know the minimum steps to possible goal states.However, as shown in the results of the cases 3 and 4, for some mazes, our method fails to enable agents to cooperate with each other.Let us discuss why the agent sometimes fail to cooperate.
In case 3 and 4, proposed method agents using Q-learning fail to cooperate each other in same two mazes.Thus, there are mazes that agents cannot cooperate easily.The reason of this is that agents cannot reach every goal enough times.
In addition, a gap between case 1,2 and 3,4 is whether agents know minimum steps or not.Therefore, the reason of error is that agents cannot search minimum steps.Against this problem, if we increase learning iterations, the errors are decreased.For example, gure 25 shows the result of 30000 iterations in case 4.However, solution for many learning iterations is not practical.From the above, a searchability of agents decides whether an agent succeeds in cooperating another agent.We put 0.7 to epsilon in order to improve the searchability in experiment of this paper.Therefore, agents search many times to large areas of maze.In this paper experiment, epsilon is larger than normal.

Gap between internal rewards
Agents can cooperate when there is even a little gap between internal rewards from analysis of section 4.However, from the result of experiment, there is some error of successful probability when the gap is under 1.Therefore, in practical case, this is dependent on the ability of searching for agents.
Then, successful probability is 100 percent when the gap is over 1.From the above, a gap of internal reward can cover the dierence between section 4's case and practical case.
For example, gure 25 presents distribution of the Q-value.
There are four mazes in gure 24.Some squares in mazes are trout, and orange square is goal position, a blue square is start position and green square is common position.Some agents.However, if agents make mistake to choose the goal, in other words, if agents learn by mistake without cognition, agents cannot correct learning hardly by same reason.

CONCLUSION
This paper focused on a multi-agent cooperation which is generally dicult to be achieved without sucient information of other agents, and proposed the reinforcement learning method that introduced an internal reward for a multiagent cooperation without sucient information.To guarantee to achieve such a cooperation, this paper theoretically derived the condition of selecting appropriate actions by changing internal rewards given to the agents, and extends the reinforcement learning methods (i.e., Q-learning and Prot Sharing) to enable the agents to acquire the appropriate Q-values updated according to the derived condition.Through the intensive simulations on the 3x8 grid maze problems where two agents are required to cooperate with each other, the following implications were revealed: (1) our proposed method successfully enables the agents to select their own appropriate cooperating actions which contribute to acquiring the minimum steps towards to their goals, while the conventional methods (i.e., Q-learning and Prot Sharing) cannot always acquire the minimum steps.
Note that the same tendency is obtained even in the different parameter settings such as the epsilon and the gap of internal reward; and (2) the proposed method based on Prot Sharing provides the same good performance as the proposed method based on Q-learning.What should be noticed here is that the results have only been obtained from one simple gird maze problem with two agents.Therefore, further careful qualications and justications, such as an analysis of results using other maze problems or an increase of the number of agents, are needed to generalize our results.Such important directions must be pursued in the near future in addition to the following future research: (1) reducing the number of sending the information to other agents; and (2) applying proposed method to other tasks.
technique.Similar to Q learning, an agent interacts with the environment and calculates the stat-action values on the framework of Prot Sharing.Dierent from Q-learning, Prot Sharing is designed to calculate Q-values when the agent when the agent receives the reward; Q-learning calculates Q-values every after executing the action.The agent stores a history of state-action pairs the agent sensed and executed.Then, a set of Q-values for all stored state-action pairs is calculated as follows;

Figure 1 :Figure 2 :
Figure 1: Reward function for Prot Sharing this paper employs

Figure 4 :
Figure 4: Flow chart of proposed method

Figure 5 :
Figure 5: Goal determination In the proposed method, agents have memorized the minimum steps between every start and every goal (because of the Q-value table).When reaching one goal by shorter

Figures 9 -
Figures 9-15 show the success rate.The vertical axis shows success times of cooperation and horizontal axis presents 100

Figure 24 :
Figure 24: Q-value in maze 14 (upper left: agent A and internal reward gap is 0.1, lower left: agent B and internal reward gap is 0.1, upper right: agent A and internal reward gap is 10, lower right: agent B and internal reward gap is 10) the gure, agent A has reached goal S and agent B has reached goal L by passing through the path indicated by the directional arrows.In this time, if the number of tAS newly discovered is shorter than the number of tAS which agent A already has had, the tAS is replaced with new tAS, and sent to agent B. Then, Agent B update tBL by the way same as agent A. This results in that each agent memorizes the minimum steps between all agents and all goals.Then, each agent determines the goal where all agent can receive the maximum rewards.In this gure, goal S is the goal which agent A must reach and goal L is the goal which agent B must reach, because the step count is shortest when agent A reaches goal S and agent B reaches goal L.
presents a way for two agent to cooperate each other.tBS is minimum step count between start B to goal S and tBL is minimum step count Note that each agent knows the goal with the process introduced in subsection 4.2.The agent estimates Q-value to reach the optimal goal determined in subsection 4.2 by an internal reward for Q-value described above.In gure 6, agent A should reach goal S and agent B should reach goal L. Note, it is not necessary for agent A to set internal reward, since it reaches goal S normally (maximizing reward for the agent a).However, Agent B should set the internal reward; under the standard reinforcement learning Agent B would reach to the goal S, while Agent B need to reach the goal L for maximizing the reward for all agents.
Also in the same manner for agent B, its Q-value of the action toward goal L is largest in the turning point.If tAS represents minimum step count when agent A reaches goal S, tAL represents minimum step count when agent A reaches goal L, tBS represents minimum step count when agent B reaches goal s and if tBL represents minimum step count when agent B reaches goal L, each Q-value of the action toward goal S is as follow: