深度探索MADDPG算法及其改进策略理解

最编程 2024-07-29 16:17:22

...

引子

深度强化学习可以分为两类：单智能体算法和多智能体算法，单智能体算法从DQN开始有policy gradient、actor critic、dpg、ppo、ddpg、sac等等，它们解决的是环境中存在一个智能体的情况（或者是多个智能体可以转化为一个智能体决策的情况），但是在某些环境（environment）下，似乎单智能体算法就有些心有余而力不足，例如足球比赛亦或是追逐游戏。如果依旧对每个agent采用单智能体算法会出现如下情况：在第 $i$ 个agent做出动作 $a_i$ 的情况下由于其余agent的动作 $a_j, j\ne i$ 未知，会导致第 $i$ 个agent收到的奖励 $r e w a r d$ 不稳定，也就是对于单个agent来说，环境是不稳定(unstable)的。

从另一个方面来考虑，大多数DRL算法都沿用了其开山鼻祖DQN的replay buffer机制，即在一个合适的时机通过sample buffer得到无序的训练数据用以训练网络，但在一个不稳定的环境下可能出现下面情况：buffer中存在相同状态、相同动作下奖励reward却不同的数据，这会直接导致训练的震荡，甚至是崩溃。

多智能体深度强化学习的算法应运而生，这篇blog主要介绍一种DDPG算法的多智能体版本，即MADDPG算法(Multi-Agent Deep Deterministic Policy Gradient)。

核心思想

不想讲很多没用的。如果有DDPG基础的应该知道DDPG有4个网络：actor，target_actor，critic，target_critic。其中带target的网络是定期由其对应网络复制参数而来（可以全复制也可以采用soft update机制）。actor网络我们也可以叫它策略网络，负责为agent做决策，其输入为agent的状态state，critic网络我们一般称其为批评家，负责评估actor做这个决策的价值，用Movan的一句话就是critic像是坐在actor头上的指挥家一样，凭借其长远的目光教导actor做出越来越好的决策。因此critic网络的输入为状态加动作即state+action，用RL的通用术语讲，这个critic就是常听到的 $Q$ 函数。值得注意的是，单智能体DDPG的critic网络输入的仅仅是该agent的state和action，并不涉及其他agent的信息，因此直接使用单智能体算法在多智能体环境中会出现因无法获取够信息从而训练效果不好的问题。

MADDPG与DDPG的最大不同在于critic网络，你不是说信息获取得不够吗？行，我给你critic更多的信息，这里给出原论文的更新critic网络的公式加以说明：
$\mathcal{L}\left(\theta_{i}\right)=\mathbb{E}_{\mathbf{x}, a, r, \mathbf{x}^{\prime}}\left[\left(Q_{i}^{\mu}\left(\mathbf{x}, a_{1}, \ldots, a_{N}\right)-y\right)^{2}\right], \quad y=r_{i}+\left.\gamma Q_{i}^{\boldsymbol{\mu}^{\prime}}\left(\mathbf{x}^{\prime}, a_{1}^{\prime}, \ldots, a_{N}^{\prime}\right)\right|_{a_{j}^{\prime}=\boldsymbol{\mu}_{j}^{\prime}\left(o_{j}\right)}$
观察它 $Q$ 函数的输入为 $\mathbf{x}, a_{1}, \ldots, a_{N}$ 其中 $a_{1}, \ldots, a_{N}$ 很好理解，为其他所有agent的动作， $\mathbf{x}$ 在原文中的描述如下：In the simplest case, x could consist of the observations of all agents, x = (o1, …, oN), however we could also include additional state information if available. 在一般的情况下 $\mathbf{x}$ 取所有agent的状态。讲到这里其实发现MADDPG看着复杂其实简单，就是critic从原来的只注重自身的经验到现在的注重全局经验。一句话概括：分布式的actor和集中式的critic。何谓分布式actor，即actor的输入只使用了其自身的状态state而没有其他agent状态的输入，在测试阶段只需要每个actor对每个agent做出指导即可。何谓集中式critic，critic集中了环境中的所有信息用以指导其actor。引用原文的一张图加以说明MADDPG的思想：
在这里插入图片描述

更新方式

critic的更新方式之前已经给出公式了，说白了就是TD式更新，与单智能体的DDPG如出一辙。
既然critic是对actor的“评委”，即critic输出的是对actor的“认可度”，那么actor的更新方向自然是向着让critic给自己打更高分的方向，即：
$\nabla_{\theta_{i}} J\left(\boldsymbol{\mu}_{i}\right)=\mathbb{E}_{\mathbf{x}, a \sim \mathcal{D}}\left[\left.\nabla_{\theta_{i}} \boldsymbol{\mu}_{i}\left(a_{i} \mid o_{i}\right) \nabla_{a_{i}} Q_{i}^{\mu}\left(\mathbf{x}, a_{1}, \ldots, a_{N}\right)\right|_{a_{i}=\boldsymbol{\mu}_{i}\left(o_{i}\right)}\right],$
个人不太喜欢这种高深的公式表达，一句话说明白就是actor的loss是对应critic输出的 $Q$ 值加个负号（深度学习架构是最小化loss，即最大化 $Q$ 值，即最大化critic的打分）。

附上一段torch更新MADDPG网络的核心代码（可能没有上下文会难懂一点）：

 for agent_idx, (actor_c, actor_t, critic_c, critic_t, opt_a, opt_c) in \
     enumerate(zip(actors_cur, actors_tar, critics_cur, critics_tar, optimizers_a, optimizers_c)):

     _obs_n_o, _action_n, _rew_n, _obs_n_n, _done_n = memory.sample(
         arglist.batch_size, agent_idx)
     rew = torch.tensor(_rew_n, device=arglist.device, dtype=torch.float)
     done_n = torch.tensor(_done_n, device=arglist.device, dtype=torch.float)
     action_cur_o = torch.from_numpy(_action_n).to(arglist.device, torch.float)
     obs_n_o = torch.from_numpy(_obs_n_o).to(arglist.device, torch.float)
     obs_n_n = torch.from_numpy(_obs_n_n).to(arglist.device, torch.float)
     action_tar = torch.cat([a_t(obs_n_n[:, obs_size[idx][0]:obs_size[idx][1]]).detach() \
                             for idx, a_t in enumerate(actors_tar)], dim=1)
     q = critic_c(obs_n_o, action_cur_o).reshape(-1)  # q
     with torch.no_grad():
         q_ = critic_t(obs_n_n, action_tar).reshape(-1)  # q_
         tar_value = q_ * arglist.gamma * (1 - done_n) + rew  # q_*gamma*done + reward
     loss_c = torch.nn.MSELoss()(q, tar_value)  # bellman equation
     opt_c.zero_grad()
     loss_c.backward()
     opt_c.step()

     # --use the data to update the ACTOR
     # There is no need to cal other agent's action
     policy_c_new = actor_c(
         obs_n_o[:, obs_size[agent_idx][0]:obs_size[agent_idx][1]])
     # update the aciton of this agent
     action_cur_o[:, action_size[agent_idx][0]:action_size[agent_idx][1]] = policy_c_new
     loss_a = torch.mul(-1, torch.mean(critic_c(obs_n_o, action_cur_o)))

     opt_a.zero_grad()
     loss_a.backward()
     opt_a.step()

伪代码

在这里插入图片描述
在具体实现细节上网上的代码都会存在些许出入，但整体框架按照上图来就是没错的。

优势和劣势

原文给出了一些实验结果：

上一篇：从DDPG进阶到MADDPG：深入探索强化学习的进化之路

下一篇：深度学习精华：连续操控下的MADDPG强化学习指南