Vol.33,No.6ACTA AUTOMATICA SINICA June,2007
Role-based Context-specific Multiagent Q-learning
JIANG Da-Wei1WANG Shi-Yuan1DONG Yi-Sheng1
Abstract One of the main problems in cooperative multiagent learning is that the joint action space grows exponentially with the number of agents.In this paper,we investigate a sparse representation of the coordination dependencies between agents to employ roles and context-specific coordination graphs to reduce the joint action space.In our framework,the global joint Q-function is decomposed into a number of local Q-functions.Each local Q-function is shared among a small group of agents and is composed of a set of value rules.We propose a novel multiagent Q-learning algorithm which learns the weights in each value rule automatically. We give empirical evidence to show that our learning algorithm converges to the same optimal policy with a significantly faster speed than traditional multiagent learning techniques.
Key words Multiagent Q-learning,multiagent coordination,role,context-specific coordination graph
1Introduction
A multiagent system(MAS)is a group of agents co-existing in an environment that interact with each other to optimize a performance measure[1,2].Research in MASs focuses on the agents behavior management issues.In this paper,we concentrate on the fully cooperative MASs in which all the agents share a common goal.Scenarios are a team of robots who play football against another team or a group of robots who plan to build a house.A key aspect in such systems is coordination:the process to ensure that the individual actions of the agents generate optimal joint decisions for the whole group.
Reinforcement learning(RL)[3,4]techniques have been widely used in many single-agent domains to learn the opti-mal policy of an agent in uncertain environments.However, directly porting such techniques to multiagent settings is not straightforward[5,6].In principle,it is possible to treat a MAS as a single‘big’agent and learn the optimal joint pol-icy using standard reinforcement learning techniques such as Q-learning[6,7].Unfortunately,the joint action space grows exponentially with the number of agents.Therefore in even small settings,these techniques are infeasible.On the other hand,we can let each agent learn his policy in-dependent of the other agents[6].Using this approach,the learning agent only considers his own actions without the knowledge of actions of other agents.However,the con-vergence condition of RL does not hold anymore since the transition model
of the learning agent depends on the pol-icy of other learning agents[8].The learning process will oscillate.
A recent work to decrease the size of the joint action space uses a context-specific coordination graph(CG)[9∼12]. The idea behind CG is that in many situations,only a small number of agents need to coordinate their actions while the rest can act individually.For example,in robotic soccer, only the ball owner and his surrounding players need to co-ordinate their actions to perform a pass while other robots can take their individual decisions.Therefore the global joint Q-function,the representation of the global joint co-ordination dependencies between all agents,can be approx-imated as a linear combination of local terms.Each term represents local coordination dependencies between a small subgroup of the agents[9].Unfortunately,it is difficult to apply original rule-based CG to dynamic domains.Fur-thermore,until now,only a planning algorithm has been proposed for the weights determination in each value rule Received May18,2006;in revised form August11,2006 Supported by the Natural Science Foundation of Jiangsu Province of P.R.China(BG2004034)
1.School of Computer Science and Engineering,Southeast Univer-sity,Nanjing210096,P.R.China
DOI:10.1360/aas-007-0583of CG[9,10].Such an algorithm assumes the complete model of the environment to be a prior knowledge,which is often unavailable in practical MASs.
In this work,we propose a formal framework and a novel multiagent Q-learning algorithm to scale up multiagent learning to large,complex systems by use of a problem-specific coordination structure.Our framework is built on CG.To conquer the drawback of original CG in dy-namic domains,we use role[13]to discretize the continuous states.In off-line design stage,instead of defining value rules for each agent,we define value rules for each role.In on-line assignment stage,we use role assignment algorithm to assign a role to an agent,and each agent gets his corre-sponding value rules associated with that role.Thereafter, the context-specific coordination graph is established.To learn the weights in CG,we propose a novel multiagent Q-learning algorithm.The algorithm approximates the global joint Q-function with a sum of local Q-functions.Each lo-cal Q-function is shared among a small number of agents and is the sum of value rules that describe the coordina-tion dependencies among them.Then we update weights in each value rule by use of an updating rule derived from the standard Q-learning.We call our approach role-based context-specific Q-learning(RQ).Our approach only as-sumes the coordination structure of the system is known beforehand.It does not need to know the complete model of the environment and can be easily applied in both dis-crete and dynamic domains.We give empirical evidence to show our learning algorithm converges to the same op-timal policy produced by traditional multiagent learning techniques with a significantly faster learning speed.
We make the following contributions:
•To our knowledge,we are thefirst to combine role and context-specific CG to represent the coordination de-pendencies between the agents in MASs.The original CG can be regarded as a special case of our framework with only one role in the system and the unique role holding all value rules.Therefore,our framework is feasible in both discrete and dynamic domains.
•We propose a novel multiagent Q-learning algorithm to learn the weights in each value rule.The algorithm does not assume the complete model of the environ-ment as a prior knowledge.Thus,it is more suitable for real-world MASs than the planning algorithm.
•Experiments show that our algorithm can learn the same optimal policy with a significantly faster speed than traditional multiagent RL.
The paper is organized as follows.In Section2,we use collaborative multiagent Markov decision process(CM-MDP)framework to formalize the multiagent cooperative learning problem and review different solutions.Then we
584ACTA AUTOMATICA SINICA Vol.33
describe our proposed framework and learning algorithm in Section 3.Section 4experimentally validates the correct-ness and efficiency of our algorithm.Section 5is conclu-sions.
2
Collaborative multiagent MDPs and reinforcement learning
This section will discuss several multiagent RL methods.First,we formalize the multiagent coopera-tive learning problems using the CMMDP [12]framework which extends the traditional single agent Markov decision process (MDP)to multiagent cooperative settings.In this paper,we use upper case letters (e.g.,X )to denote random variables and lower case (e.g.,x )to denote their values.We also use boldface to denote vectors of variables (e.g.,X )or their values (x ).
Definition    1.(Guestrin)A CMMDP Γis a five-tuple <n,S ,A ,R,T >,where n is the number of agents;S ={S 1,...,S n }is a finite joint state space;
A =×n
i =1A i is a joint action space of n agents;R =P n i =1R i (s ,a )where R i :S ×A →R is a reward func-tion which returns the reward R i (s ,a )for agent i taking the joint action a in state s ;and T :S ×A ×S
→[0,1]is a Markovian transition function which describes the prob-ability p (s  |s ,a )that the system will move from state s to state s  after the agents perform the joint action a ∈A .The objective of the agents is to find a joint policy π={πi =1...n }(where π:S →A and πi :S →A i )to optimize the sum of expected discounted rewards
Q ∗(s ,a )=max πQ π(s ,a )=
max πE "∞X t =0
γt R (s t ,π(s t ))˛˛s 0=s ,a 0=a
#(1)
for each state s where γ∈[0,1]is a discount factor.
RL [3,4]is successfully used in single agent domains to estimate the Q ∗(s ,a ).One of the most important break-throughs in RL is the development of Q-learning [3,14].Q-learning begins with an initial estimation Q (s ,a )and re-peatedly updates it using the following rule
Q (s ,a )←Q (s ,a )+α[R (s ,a )+γmax a  Q (s  ,a  )−Q (s ,a )]
(2)
where α∈(0,1)is the learning rate.Q-learning converges to the optimal Q ∗(s ,a )under certain conditions [14].How-ever extending single agent RL to multiagent systems is not straightforward [5,7].We review the existing multiagent RL techniques in the following subsections.2.1
Joint action learners
This approach treats the whole MAS as a single ‘big’agent.The state and action sets of the agent are the joint state space and the joint action space of the original MAS,respectively [6].The ‘big’agent uses Q-learning to learn the optimal policy.This technique is called joint action learning (JAL).The drawback of JAL is that both the state set and the action set of the learner grow exponentially with the number of agents.Therefore,JAL is infeasible in even small settings [6].2.2
Independent learners
On the other hand,we can let each agent learn the pol-icy independently without the knowledge of the actions and rewards of other agents [6,7].This technique is called inde-pendent learners (IL).Although the IL learner does not
need to exhaust the exponential joint action space ,the convergence proof of Q-learning is not held anymore since the transition model depends on the actions of other learn-ing agents [8].Despite the drawback,IL has been used for some situations [6].
3
Role-based
context-specific Q-
learning
This section will describe our proposed multiagent learning method.We treat the whole MAS as a single ‘big’agent like JAL to guarantee the RL convergence condition.The problem is the joint action space grows exponentially with the number of the agents.In general,this problem is intractable [12].However,many complex systems have a “nearly decomposable,hierarchical structure”,with the subsystems interacting only weakly between themselves [16].This is the type of structure that a human decision maker will exploit to solve large-scale problems.We may make the best of problem-specific structure to scale up multiagent learning algorithm to large,complex systems.To achieve suc
h a goal,we first build a formal framework to repre-sent the coordination dependencies (the problem-specific structure)between the agents.Then we propose a novel multiagent Q-learning algorithm to learn the optimal pol-icy.
3.1Context-specific coordination graphs and
roles According to [10],the coordination dependencies can be represented by a context-specific coordination graph (CG).First,we define a set of value rules.Each rule defines a context where involved agents perform their joint action to coordinate.The rules conformed to a given state are trans-formed into a CG G =(V,E )with each agent mapping to a node and each coordination dependency to an edge.Only interconnected agents have to coordinate their actions.The global joint Q-function is approximated as the sum of a set of local Q-functions:Q (s ,a )=P n i =1Q i (s ,a ).Each lo-cal Q-function is shared by involved agents Agents[Q i ]={A i ∈A |A i ∈Dom[Q i ]}and is the sum of value rules.Definition 2.(Guestrin)A value rule ρ=<s ∧a :v >is a function ρ:S ×A →R such that ρ(s ,a )=v when the current world is consistent with state s and the agents perform the joint action a and 0otherwise.
Definition    3.A local Q-function Q i :S ×A →R is shared among its involved agents Agents[Q i ]={A i ∈A |A i ∈Dom[Q i ]}and is composed of a set of value rules {ρ1,...,ρn }such that
Q i (s ,a )=
n X j =1
ρj (s ,a )(3)
where Agents[ρj ]∩Agents[Q i ]=∅and ρj is consistent with the current state s .
If the weights in all value rules are known,the op-timal joint action of the agents a ∗=argmax a Q (s ,a )is computed by variable elimination (VE)algorithm efficiently [10,11].The drawback of original rule-based CG is that it is difficult to be used in dynamic environments.To tackle that problem,we introduce role.Our role defini-tion is similar to [13]but adding associated value rules.Definition 4.A role is a tuple <m,P m ,r i,m >,where m ∈M is the role  s identity;P m is a set of value rules associated with the role m ;r i,m :r (i,m )→[0,1]is a potential function which determines how appropriate the agent i is for the role m in the current world state.
No.6JIANG Da-Wei et al.:Role-based Context-specific Multiagent Q-learning585 We incorporate role into CG in two stages–an off-line
design stage and an on-line assignment stage.In off-line
design stage,instead of defining value rules for each agent,
we define value rules for each role.In on-line assignment
stage,we use a role assignment algorithm to assign a role
to an agent.Thereafter,the agent can get corresponding
value rules from that role.
The role assignment algorithm works as follows.Suppose
we have n agents.The role assignment algorithm defines
a sequence M of roles where|M|≥n.The sequence is
ordered with respect to the importance of the role:the
most‘important’role is assigned to an agentfirst,followed
by the second most important role,etc.The same role
can be assigned to more than one agent,but each agent
is assigned only one role.The algorithm is performed by
the agents in parallel:each agent computes the potential
r i,m for the agent i and the role m∈M.The role m is
then assigned to the agent that has the highest potential.
The role assignment algorithm runs in time of polynomial
in the numbers of agents and roles.Each agent calculates
O(|M|·n)potentials.
3.2Q-learning in context-specific coordination
graphs
Thus far,we have built a formal framework to represent
the coordination dependencies between agents.To deter-
mine weights,we propose a multiagent Q-learning algo-
rithm.First,we introduce the local Q-value which denotes
the individual contributions of a single agent to the system.
Definition5.Q i(s,a)is a local Q-value for agent i:
Q i(s,a)=
X
j ρi j(s,a)
j
(4)
whereρi j(s,a)is the value rule involving agent i and n j is the number of agents involved in the rule(including agent i).
To derive the updating rule,we regard the weights in each value rule as the parameters of global Q-function and follow the principle of weights adjustment of function ap-proximation in standard RL[3].The only difference is that our Q-function is not differentiable.
Lemma1.The global joint Q-value is the sum of all local Q-values:
Q(s,a)=
X
i
Q i(s,a)(5) Proof.
X i Q i(s,a)=
X
i
X
j
ρi j(s,a)
j
=
X
j
X
i
ρi j(s,a)
j
=Q(s,a)
Lemma2.In each updating step,the local Q-value can be updated according to the following rule.
Q i(s,a)←Q i(s,a)+αh
R i(s,a)+γQ i(s ,a∗)−Q i(s,a)
i
(6)
where a∗=argmax
a Q(s ,a).
Proof.We treat the MAS as a single agent.The Q-learning updating rule is rewritten as follows.
Q(s,a)←Q(s,a)+α"
n
X
i=1
R i(s,a)+γQ(s ,a∗)−Q(s,a)
#
(7)
According to Lemma1,we replace Q(s,a)with
P
n
i=1
Q i(s,a).
n
X
i=1
Q i(s,a)←
n
X
i=1
Q i(s,a)+
α
"
n
X
i=1
R i(s,a)+γ
n
X
i=1
Q i(s ,a∗)−
n
X
i=1
Q i(s,a)
#
(8)
Q i(s,a)←Q i(s,a)+α
h
R i(s,a)+γQ i(s ,a∗)−Q i(s,a)
i
(9)
We use∆Q i(s,a)=[R i(s,a)+γQ i(s ,a∗)−Q i(s,a)]
to denote the increments in(6).
Theorem  1.The value ruleρj(s,a)can be updated
according to the following rule.
ρj(s,a)←ρj(s,a)+α
n j
X
i=1
∆Q i(s,a)
n i
(10)
where n j is the number of involved agents inρj and n i is the
number of occurrences of agent i in the instantiated value
rules which are consistent with state s and joint action a.
Proof.According to the updating rule for global Q-
function,we have
X
j
ρj(s,a)←
X
j
ρj(s,a)+α
n
X
i=1
∆Q i(s,a)
∆Q i(s,a)=n i·
∆Q i(s,a)
i
=
n i
X
k=1
∆Q i(s,a)
i
Therefore,we get
X
j
ρj(s,a)←
X
j
ρj(s,a)+α
n
X
i=1
n i
X
k=1
∆Q i(s,a)
n i
(11)
Note that n i is the number of occurrences of agent i in the
value rules that are consistent with state action pair(s,a).
Thus,
n
X
i=1
n i
X
k=1
∆Q i(s,a)
n i
=
X
j
n j
X
i=1
∆Q i(s,a)
n i
(12)
where n j is the number of involved agents in value rule
ρj(s,a).(11)now reads:
X
j
ρj(s,a)←
X
j
ρj(s,a)+α
X
j
n j
X
i=1
∆Q i(s,a)
n i
(13)
ρj(s,a)←ρj(s,a)+α
n j
X
i=1
∆Q i(s,a)
n i
(14)
An example will make things clear.Suppose we have
following rules for state s and state s :
<ρ1;s∧a1:v1><ρ5;s ∧¯a1:v5>
<ρ2;s∧a1∧a2:v2><ρ6;s ∧a1∧¯a2:v6>
<ρ3;s∧a2∧¯a3:v3><ρ7;s ∧a2∧¯a3:v7>
<ρ4;s∧a2∧a3:v4><ρ8;s ∧¯a2∧¯a3:v8>
586ACTA AUTOMATICA SINICA Vol.33 Furthermore,assume the agents perform the joint action
a={a1,a2,a3}in state s and move to state s .The
optimal joint action produced by VE in state s is a∗=
{¯a1,a2,¯a3}.So the rulesρ1,ρ2,andρ4apply in state s
and the rulesρ5andρ7apply in state s .
We updateρ1,ρ2,andρ4as follows.
∆Q1(s,a)=R1(s,a)+γv5
h v
1+v2
i
∆Q2(s,a)=R2(s,a)+γv7
2
h v
2
2
+
v4
2
i
∆Q3(s,a)=R3(s,a)+γv7
2
v4
2
ρ1(s,a)←v1+α∆Q1(s,a)
2
ρ2(s,a)←v2+α»
∆Q1(s,a)
2
+
∆Q2(s,a)
2
ρ3(s,a)←v3+α»
∆Q2(s,a)
2
+
∆Q3(s,a)
1
The complete learning algorithm is described in Algo-rithm1.The algorithm uses role assignment algorithm to assign roles to the agents and VE to determine the optimal joint action.
Algorithm1Role-based context-specific Q-learning(RQ) Define:ρ={ρ1,...,ρn}is the set of rules
Initialize eachρi(s,a)∈ρarbitrarily
T←0
repeat{for each episode}
Observe the current state s
P←∅
repeat{for each step of episode}
for each agent i do
Assign role m to agent i
Obtain value rules P m from role m
P←P∪P m
end for
Select joint action a following -greedy method
Perform joint action a and observe the next state s
for eachρj∈P do
ρj(s,a)←ρj(s,a)+αP n j
i=1
∆Q i(s,a)
n i
end for
s←s
until s is terminal
T←T+1
until T>T max
Using our approach,the convergence condition of Q-learning is held[6].Unfortunately,it is difficult to provide formal proof that the learned policy is optimal[3].However, if the coordination dependencies is sufficiently captured, the differences will be very small.
4Experimentscooperative
In this section,we compare our learning algorithm with other multiagent RL techniques,especially the
JAL algo-rithm and IL algorithm.We apply the three algorithms to the famous Pursuit domain[17]where the goal of preda-tors is to capture the prey as fast as possible.We set up a problem where two predators try to capture one prey in a10×10grid world.Fig.1shows an example.In the beginning of each episode,the predators and prey are ran-domly placed at the corners of the world.They can move to their adjacent cells or hold on their current , dir∈D={center,n,w,s,e}where moving center stands for holding on their current positions.The prey is captured when both predators stands on his adjacent cells and only one predator move to the prey s location.The policy of the prey isfixed.It stands on its current position with a probability of0.2otherwise randomly moves to one of his free adjacent
cells.
Fig.1Pursuit domain example,the circles are the
predators and the triangle is the prey
For all three algorithms,the learning rateα=0.3,dis-counted factorγ=0.9,and the exploration =0.1.We generate99×98×52=242,550and99×98×5=48,510 state-action pairs for JAL and IL respectively.All the pro-grams are implemented in C++on an IBM notebook com-puter.
When applying RQ algorithm,we introduce two roles:a capturer role who tries to capture the prey,and a supporter role who stands on his current position when the capturer tries to move to the prey s location,otherwise,moves to support the capturer.The role assignment sequence is M={capturer,supporter}.The potential for the role capturer is based on the Manhattan distance d i,p between predator i and the prey:
r i,capturer=
1
i,p
where d i,p is the distance between predator i and the prey(15)
The potential r i,supporter is a constant such that the remain predator is assigned to that role.
The following is an example of our generated value rules.
<ρcapturer
1
;a i=moveTo(dir):100>
<ρcapturer
2
;has-role-supporter(j)∧
is-adjacent-to-prey(j)∧
a i=moveToPrey()∧
a j=moveTo(center):100>
The value ruleρ1denotes that the capturer role should try to capture the prey even without the support of the other predator.The ruleρ2denotes a coordinated context where the capturer moves to location of the prey and the supporter holds on his current position.Totally,we create 124value rules which instantiate5,483value rules.The
No.6JIANG Da-Wei et al.:Role-based Context-specific Multiagent Q-learning587 reward that each predator i receives is defined as follows.
R i(s,a)=8
>>
>>
><
>>
>>
>:
50.0if i capture the prey with support
of the other predator
−50.0if i collide with other predator
−10.0if i moving to the prey without
−1.0support otherwise
(16)
Fig.2shows the capture times for the learned policy dur-ing thefirst400,000episodes for the three algorithms.From the plot,we can clearly see that both RQ and IL learn very quickly at the beginning with respect to the JAL, since the RQ and IL have fewer state-actions pairs than the JAL.However,the IL can not converge and keeps os-cillating on the average capture time of17.02.After about 50,000episodes,our algorithm converges to the stable pol-icy with average capture time of about12.92which is very similar to the average capture times of12.78learned by JAL.
To reach such results,JAL needs to learn from more than200,000episodes.The reason for the little difference is because we have not enumerated all possible coordinated value rules for
RL.
Fig.2Learning results of the learned policy for the
three methods
5Conclusion
In this paper,we have proposed a role-based context-specific multiagent Q-learning technique.First,w
e use roles and context-specific coordination graphs to build a formal framework for a sparse representation of the coor-dination dependencies between the agents.Then we have proposed a novel Q-learning algorithm to learn the weights in each value rule automatically.The experiments show that our approach converges to the same policy learned by traditional multiagents RLs with a significantly faster learning speed.
References
1Weiss G.Multiagent Systems:A Modern Approach to Distributed Artificial Intelligence.USA:MIT Press,1999.
26∼38
2Woolridge M,Wooldridge M J.Introduction to Multiagent Systems.USA:John Wiley&Sons,2001.10∼13
3Sutton R S,Barto A G.Reinforcement Learning:An Intro-duction.USA:MIT Press,1998.120∼250
4Kaelbling L P,Littman M,Moore A.Reinforcement learning:
a survey.Journal of Artificial Intelligence Research,1996,4:
237∼285
5Claus C,Boutilier C.The dynamics of reinforcement learn-ing in cooperative multiagent systems.In:Proceedings of the Fifteenth National Conference on Artificial Intelligence.
Wisconsin,USA,AAAI Press,1998.746∼752
6Tan M.Multi-agent reinforcement learning:perative learning.Readings in Agents.USA:Morgan Kaufmann,1997,487∼494
7Boutilier C.Planning,learning and coordination in multia-gent decision processes.In:Proceedings of the6th Confer-ence on Theoretical Aspects of Rationality and Knowledge.
San Francisco,USA,Morgan Kaufmann,1996.195∼210
8Watkins C J C H,Dayan P.Technical note:Q-learning.
Machine Learning,1992,3(8):279∼292
9Guestrin C,Koller D,Parr R.Multiagent planning with fac-tored MDPs.In:Proceedings of the14th Neural Informa-tion Processing Systems.Cambridge,USA,MIT Press,2001.
1073∼1080
10Guestrin C,Venkataraman S,Koller D.Context specific mul-tiagent coordination and planning with factored MDPs.In: Proceedings of the Eighteenth National Conference on Arti-ficial Intelligence(AAAI-2002).Edmonton,Canada,AAAI Press,2002.253∼259
11Guestrin C,Koller D,Parr R,Venkataraman S.Efficient solution algorithms for factored MDPs.Journal of Artificial Intelligence Research,2003,19:399∼468
12Guestrin C.Planning under Uncertainty in Complex Struc-tured Environments[Ph.D.dissertation],Stanford Univer-sity,2004
13Castelpietra C,Iocchi L,Nardi D,Piaggio M,Scalzo A,Sgor-bissa A.Communication and coordination among heteroge-neous mid-size players:ART99.Lecture Notes in Computer Science,2001,2019:86∼95
14Watkins C J C H.Learning from Delayed Rewards[Ph.D.
dissertation],Cambridge University,1989
15Sen S,Sekaran M,Hale J.Learning to coordinate without sharing information.In:Proceedings of the12th National Conference on Artificial Intelligence.Seattle,USA,AAAI Press,1994.426∼431
16Simon H.The Architecture of Complexity.USA:MIT Press, 1981.78∼80
17Kok J R,Vlassis N.The pursuit domain package[On-line], available:staff.science.uva.nl/∼jellekok/software/
indexen.html,May21,2003
JIANG Da-Wei Ph.D.candidate in
Department of Computer Science and
Technology at Southeast University.His
research interests include grid computing,
distributed artificial intelligence,and mul-
tiagent system.Corresponding author of
this paper.
E-mail:davidjiang2005@gmail
W ANG Shi-Yuan Master student in
Department of Computer Science and
Technology at Southeast University.Her
research interests include P2P computing
and multiagent system.
E-mail:desiree wsy@yahoo
DONG Yi-Sheng Professor at South-
east University.His research interests in-
clude grid computing,mobile agent,and
multiagent system.
E-mail:ysdong@seu.edu

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。