Multi-Agent Reinforcement Learning: One Graph Understand MADDPG • Luca ML Blog

Introduction

This article aims to demystify the Multi-Agent Deep Deterministic Policy Gradient [1] (MADDPG) method, an extension of DDPG that facilitates learning in complex, multi-agent environments where agents must cooperate or compete.

Due to the complexity of its workflow, specifically, the loss calculation, where the actor/critic networks sharing information among agents, it is not obvious to comprehend WHY and HOW does MADDPG centralized training method works. Hence, this article aims to simplify the understanding of MADDPG through a flowchart with detailed annotations. code for building maddpg from scratch.

MADDPG extends the DDPG method to solve complex environments where multiple agents need to learn to cooperate and/or compete against each other.

*MADDPG Collaboration: Multi-Give-Ways[2]*

Unique Features of MADDPG

MADDPG is characterized by:

Determinism: For a given input, the MADDPG actor network always produces the same output, ideal for precise control of continuous actions.
Decentralized Policy with Centralized Training
Temporal Difference (TD) Learning
Applicability to Both Cooperative and Competitive Environments

The Networks

In MADDPG, each agent consists of four networks: two actor networks and two critic networks, along with their corresponding target networks. Thus, if there are (N) agents, the total number of networks would be (4N).

The Actor Network processes the raw state of an agent at a specific time and outputs actions. The Critic Network evaluates these actions along with the global states and produces a Q-value, which is a numerical estimate predicting the expected return from a particular action taken in a given state. This evaluation helps agents choose the best action combinations for any given state.

Why Target Networks?

The role of both Target Actor and Target Critic networks is to stabilize training updates by smoothing the learning curve and offering consistent targets during training sessions. Using the Target Critic to estimate the value of the next state helps in reducing training variance. In the mean time, the slightly outdated policy maintained by the Target Actor provides a stable, consistent target, preventing the potential feedback loops that occur when the live, constantly updating Actor is used for action generation.

The Replay Buffers

In MADDPG, the replay buffer is centralized, meaning that it stores the experiences (states, actions, rewards, and next states) of all agents together. This centralized approach allows each agent’s Critic to access not only its own actions and rewards but also the actions and outcomes of other agents in the environment. This is particularly useful in multi-agent settings where the actions of one agent can significantly impact the state of the environment and, consequently, the outcomes for other agents.

While the experiences are stored centrally, the training of the agents’ policies (Actors) is decentralized. Each agent trains its own Actor independently to decide on actions based on local observations, but it uses the global information stored in the replay buffer for Critic updates. This method helps in learning more robust strategies that consider the potential impact of other agents’ actions.

The One Chart

The workflow of MADDPG can be simplified as:

actors acting on given actor states(positions, directions, angles etc.)➡️replay buffer stores outputs ➡️ actor/critic networks training and updating.

MADDPG Flowchart — *MADDPG Workflow 🔵 : actor loss 🟢 : critic loss*

Chart Notations

[1]. Old/New Global State: Represents a comprehensive view of all agents’ states at a given time, aiding the Critic in centralized training. For code representation, the global state is a concatenation of all actor states (observation) at the given time stamp.

[2]. Dones Flags: Track the termination points of the environment to ensure accurate experience recording and zeroing of the expected total return when appropriate.

[3]. Expected Total Return:

\text{Target } Q = r + \gamma Q'(s', a')

where:

$r$ is the instant reward
$s'$ is the next state,
$a'$ is the predicted next action,
$γ$ is the discount factor adjusting the future reward’s present value, and
$Q'$ is the Target Critic’s estimation used for future state valuation.

By using $Q'(s', a')$ , the algorithm can estimate the value of future states, incorporating not just the immediate rewards but also the potential future returns. This application of temporal difference learning allows the algorithm to update its evaluation of future states based on current estimates.

[4]. Actor Loss: The goal of the actor in MADDPG is to find a policy that maximizes the expected return from the current state. When computing the loss for the actor, using the mean of the Q-values (obtained from the critic evaluations of the selected actions in the given states) provides an average gradient signal for the entire batch of experiences.

[5]. Critic Loss: The Critic loss is determined using the Mean Squared Error (MSE) between the expected total return and the Q-values predicted by the target networks. The Critic loss:

L = \frac{1}{N} \sum_{i=1}^N (y_i - Q(s_i, a_i))^2

where:

$L$ is the loss for the Critic.
$N$ is the number of samples in the batch.
$y_i$ is the target Q-value for the i-th sample.
$Q(s_i, a_i)$ is the predicted Q-value by the critic network for the action taken in state $s_i$ .

[6]. Update Target/Critic Networks: MADDPG employs a soft update technique that incrementally blends the weights of the target networks with those of the main networks. This method, which involves a small coefficient $τ$ (typically between 0.001 and 0.01), ensures that the target networks update at a controlled rate, preventing rapid changes that could destabilize learning.

\theta' \leftarrow \tau \theta + (1 - \tau) \theta'

Where:

$\theta'$ represents the parameters of the target network.
$\theta$ represents the parameters of the main network.
$\tau$ is a small coefficient (e.g., 0.001 to 0.01) that determines the rate at which the target network parameters are updated. This is known as the mixing factor.

Code Representation:

def update_networks(self, tau=None):
    if tau is None:
        tau = self.tau

    # update actor networks
    actor_params = self.actor.named_parameters()
    target_actor_params = self.target_actor.named_parameters()
    actor_dic = dict(actor_params)
    target_actor_dic = dict(target_actor_params)

    for name in actor_dic:
        actor_dic[name] = tau * actor_dic[name].clone() + \
			(1 - tau) * target_actor_dic[name].clone()
    self.target_actor.load_state_dict(actor_dic)

    # update critic networks
    critic_params = self.critic.named_parameters()
    target_critic_params = self.target_critic.named_parameters()
    critic_dic = dict(critic_params)
    target_critic_dic = dict(target_critic_params)

    for name in critic_dic:
        critic_dic[name] = tau * critic_dic[name].clone() + \
			(1 - tau) * target_critic_dic[name].clone()

	self.target_critic.load_state_dict(critic_dic)

Project Repository

👉 Github

Source

[1]. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments

[2]. MADDPG Collaboration: Multi-Give-Ways

[3]. MADDPG Competition: Pongs