Apiumhub

Posted on Dec 8, 2021 • Originally published at apiumhub.com on Dec 3, 2021

Cooperative Multi-Agent Reinforcement Learning and QMIX at Neurips 2021

#datascience

Authors: Gema Parreño, David Suarez (Apiumhub).

Thanks to: Alberto Hernandez (BBVA Innovation Labs)

The following blogspot aims to do an introduction about Cooperative MARL and goes through innovations by S. Whiterson Lab, with QMIX (2019) their current contributions for #Neurips2021. Going through this article might imply to have certain fundamentals about Reinforcement Learning.

A multi-agent system describes multiple distributed entities—so-called agents—which take decisions autonomously and interact within a shared environment (Weiss 1999). MARL (Multi-Agent Reinforcement Learning) can be understood as a field related to RL in which a system of agents that interact within an environment to achieve a goal. The Goal of each one of these agents or learnable units is to learn a policy in order to maximize the long term reward, in which each agent discovers a strategy alongside other entities in a common environment and adapts its policy in response to the behavioural changes of others.

        <h2>Taxonomy</h2> 














                <p>Properties of MARL systems that are key to their modeling and depending on these properties we might be branching into specific particularities of areas of research.<br></p>
















        <table>
                                <thead>
                    <tr>
                        <th>TRAINING (Control of training and execution) </th>

INFORMATION AVAILABLE (What the agents have about other agents)
ENVIRONMENT (Conditions of the environment)
REWARD STRUCTURE (System Behaviour)

Centralized
A central unit takes the decision for each agent in each time step. Policies are updated based on exchange of information during training.

Independent Learner
Ignore other’s existence. No rewards & actions info from other’s.

Fully observable
The agents are able to access the whole information and the sensory information.

Cooperative
Same reward to all agents. They cooperate to achieve a common goal and avoid individual failure.

Decentralized
Agents make a decision for themselves. Each agent performs updates on its own and develops an individual policy without utilizing foreign information.

Joint-Action Learner
Observe actions from other’s a-posteriori.

Partially observable
The agent is only able to observe its local information.

Competitive
Sumatory of reward is equal to zero. The agents compete against each other maximizing their own reward and minimizing others.

                <p>Table 1. This taxonomic schema ( Weiss 1999) proposes to let us know more about the MARL exploration we will talk about today. In cooperative MARL, <em>agents cooperate to achieve a common goal.</em></p>















        <h2>Challenges</h2>

From the environment perspective, we can enunciate several challenges :

Non-stationarity: A single agent faces a moving target problem when the transition probability function changes.
Credit assignment problem: Agent can’t know the impact of its own action towards the team’s success.
The reality of Partial Observable environment: Partially Observable Markov Decision Process (POMDP). Most real-world cases of uses and applications are based on Partially observable environments.

When we branch from MARL into Cooperative MARL , we focus on reformulating the challenge into a system of agents that interact within an environment to achieve a common goal. These challenges might have more importance depending on the type of behaviour and environment. From the conceptual challenges derived from the agent interaction and performance perspective inside cooperation we can think of the following derived from:

Coordination: Accomplishing a joint goal in cooperative settings requires agents to agree on a consensus.
Communication: The learning of meaningful communication protocols in cooperative tasks.
Commitment: Constructing cooperative commitments, so as to overcome incentives to neglect a cooperative arrangement.
Scalability: MARL algorithms are hard to train: the potentially high number of agents and heterogeneous action space entails a linear growth of computational effort.

From now on, we will focus on centralized cooperative MARL and QMIX definition , notation and description.

    <img src="https://apiumhub.com/wp-content/uploads/2021/12/Neurips2021_Fig1_1.png" alt="Neurips 1.1" title="Cooperative Multi-Agent Reinforcement Learning and QMIX at Neurips 2021 7">
    <img src="https://apiumhub.com/wp-content/uploads/2021/12/Neurips2021_Fig1_2.png" alt="Neurips 1.2" title="Cooperative Multi-Agent Reinforcement Learning and QMIX at Neurips 2021 8">















                <p>Fig 1. Visual representation of MARL properties with some challenges regarding the taxonomy. The zoom area includes areas inside <strong>Cooperative AI posted in Open Problems in Cooperative AI </strong>and<strong> Q-MIX papers</strong>.</p>















        <h2>Centralized Cooperative Multi-Agent</h2> 














        <h4>Centralized Cooperative Multi-Agent RL Notations and Formulation for the coordination problem

Regarding notation, the main differences between the notations for RL are that we introduce in the tuple the parameter N that stands for the number of agents, and O = { O1,…On } that is the set of observations for all agents (if different agents have different set of observations, all of them might be represented in this set) and the same happens for the set of observations U = { U₁…U_n} stands for the joint action set for all agents, meaning that the action will be taken in a cooperative manner even though this shall mean that each agent takes a different action. Therefore , taking into account the tuple < N, S, U, R, P, O, ץ >

N = {1…N} denotes the set of N>1 interacting Agents
S is the State space of all agents
U = { U₁…U_n} joint action set for all agents or the collection of individual action spaces from N agents
R is the Reward.
P : U x S → P( U ) is the probability distribution of actions
O = { O₁…O_n} set of observations for all agents.
ץ discount factor [0,1)

                <p><span>Notation 1</span>. Colored letters set the Differences with respect to the traditional Reinforcement Learning approach. <span>Notation for a Fully cooperative setup</span></p>
















    <img src="https://apiumhub.com/wp-content/uploads/2021/12/Neurips2021_Fig2_2.png" alt="Neurips 2.1" title="Cooperative Multi-Agent Reinforcement Learning and QMIX at Neurips 2021 9">
    <img src="https://apiumhub.com/wp-content/uploads/2021/12/Neurips2021_Fig2_1.png" alt="Neurips 2.2" title="Cooperative Multi-Agent Reinforcement Learning and QMIX at Neurips 2021 10">















                Fig 2. Visual representation of a fully cooperative and partially observable multiagent environment Dec-POMDPs. The example takes <a href="https://arxiv.org/pdf/1902.04043.pdf" rel="nofollow noopener"><strong>SMAC</strong></a> environment 2c_vs_64zg: at each time step t, the environment sends observations to the agents (2 Colosi) about enemy positions and actions of both enemies and the other agent , and each agent (Colosi) produces an action based on their Qtot value function. All the agents share the same reward.















        <h2>QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning</h2> 














                Regarding a<span> fully cooperative behavior with centralized learning with decentralized execution</span>, the joint <span>action-value</span> function <span><b>Q</b></span><span><span>tot</span></span> can be decomposed into <b>N Q-functions for N agents</b>, in which each <span><b>Q-function Q</b></span><span><span><b>i</b></span></span>  measures how good each action is, given a state, for the agents following a policy















                <p><span><b>Q</b></span><span><b>tot</b></span><span><b>(τ, u)</b> = Σ <b>Q</b><span>i </span><span>( τ</span><span><b>i </b></span><span>, u</span><span>i  </span><span>)</span></span></p>















                <ul>
<li>Q<span>tot  </span>-&gt; global action-value function</li>
<li>

Qi -> Action-value function for each one of the agents

τ -> joint action-observation history

U -> joint action

                <p><strong>Notation 2 </strong><span>. Global Action-value function as a sum of individual action-value functions, one for each agent.</span></p>

Q-Mix paper, published in 2018 by T. Rashid et al explores a hybrid value-based multi-agent reinforcement learning method , adding a constraint and a mixing Network structure in order to make the learning stable, faster and ultimately better in a controlled setup.

As a conceptual key idea for QMIX is to understand centralized learning ( Qtot ) with decentralized execution paradigm ( Qi ), also known as CTDE : agents are trained in a centralized way with access to the overall action-observation history ( τ) and global state during training , but during execution have access only to their own local action-observation histories ( τi )

One of the main first ideas is to verify a constraint that enforces the monotonicity of the relationship between the global action-value function Qtot and the action-value function of each one of the agents Qi in every action. This constrained action allows each agent to participate in a decentralised execution by choosing greedy actions with respect to its action value function

                <p>მ<b>Q<sub>tot</sub></b>  / მ<b>Q<sub>i</sub></b>  ≥ 0, ∀a</p>















                <p><strong>Notation 3. </strong><span>The Global argmax Action-Value function divided for the argmax Action-Value function of each agent is 0 or higher, for every action</span></p>

This function allows each agent to participate in a decentralized execution by choosing greedy actions with respect to its value function .

The overall QMIX architecture shows two main differentiated parts :

Agent Networks: for each agent Ai , there is an agent Network that represents its action-value function. It receives the current observation and the last action as input at each time step and returns a Q action-value function Qi . The NN topology is inside the DRQN family that makes use of GRU, as it facilitates the learning over longer timescales and probably converges faster. This means that if we are dealing with an environment with, for example, two colosi agents, we might have.

Mixing Network: A feedforward Network that takes the agents outputs (Qi for every one of the agents) and outputs the total Action-value function Qtot . Inside this architecture we find the creative and innovative part, in which the weights of the Neural Networks are produced by a separate hypernetwork, meaning that there is a NN that generates the weights for another network. The output of the hypernetwork is then a vector forced to be positive, making it possible to condition the weights of the monotonicity.

    <img src="https://apiumhub.com/wp-content/uploads/2021/12/Neurips2021_Fig3_1.png" alt="Neurips 3.1" title="Cooperative Multi-Agent Reinforcement Learning and QMIX at Neurips 2021 11">
    <img src="https://apiumhub.com/wp-content/uploads/2021/12/Neurips2021_Fig3_2.png" alt="Neurips 3.2" title="Cooperative Multi-Agent Reinforcement Learning and QMIX at Neurips 2021 12">















                <p>Fig 3. Overall architecture of QMix proposed by QMIX paper with the main components: the <strong>mixing network</strong> with the <strong>hypernetwork</strong>, that forces monotonicity and the agent networks.</p>

Key ideas from Q-MIX algorithm :

Satisfy a condition for choosing a greedy Action-Value function for each agent
Each agent has an agent Network that calculates the Action-Value function.
A Mixing-Network calculates the weights forced to be positive, based on the states in order to calculate the joint action-value function Q_tot.

        <h2>Regularized Softmax Deep Multi-Agent Q-learning at Neurips 2021</h2> 














        <h4>Neurips 2021: Regularized Softmax Deep Multi-Agent Q-learning

During #Neurips2021, the lab will present the challenge of practical severe overestimation Q-MIX presents, proposing a regularization-based update scheme that penalizes large Qtot values that stabilizes learning and a softmax operator that reduces overestimation bias.

Overestimation is an important challenge because it indeed can be accumulated and be counterproductive for performance of value-based algorithms . Besides, the fact that there are multiple agents inside a MARL scenario derives into the joint-action space exponentially increasing with the number of agents and this can be considered an issue. In the case of Q-MIX, the overestimation fenomena can not only come from the calculation of Qi but also from the mixing network.

First the paper presents some key experimental results from some mental model to tackle the challenge that didn´t show the desired outcomes: a gradient Regularization of the mixing network and a baseline with Qtot by adding a regularized term to the loss λ (Qtot(s,u) − b(s,u))2, where they used the mean squared error loss and λ is the regularization coefficient.

As the final proposal that showed better empirical results they used a softmax for the joint action-value function (softmax(Qtot(s,u)) with principles from Deep Q-Learning, using the state and not the action-observation history τ as in QMIX Value Decomposition Networks approach.

For knowing more about this contribution, don´t hesitate to read their paper here.

        <h2>References</h2> 














                <ol>

Multiagent Systems: A Modern Approach to Distributed Artificial Intelligence, Weiss (1999)

Review of Multi-Agent Deep Reinforcement Learning based on the work , A. Oroojlooy and D. Hajinezhad (2020)

Open Problems in Cooperative AI , A.Dafoe et al. (2020)

QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning

SMAC The StarCraft Multi-Agent Challenge, Mikayel Samvelyan et al. (2019)

Regularized Softmax Deep Multi-Agent Q-Learning (2021)

DEV Community

Cooperative Multi-Agent Reinforcement Learning and QMIX at Neurips 2021

Top comments (0)