Coordination and Communication in Deep Multi-Agent Reinforcement Learning


A growing number of real-world control problems require teams of software agents to solve a joint task through cooperation. Such tasks naturally arise whenever human workers are replaced by machines, such as robot arms in manufacturing or autonomous cars in transportation. At the same time, new technologies have given rise to novel cooperative control problems that are beyond human reach, such as in package routing. Be it for physical constraints such as partial observability, robustness requirements, or to manage large joint action spaces, cooperative agents are often required to function in a fully decentralised fashion. This means that each agent merely has access to its own local sensory input during task execution, and does not have explicit communication channels to other agents. Deep multi-agent reinforcement learning (DMARL) is a natural framework for learning control policies in such settings. When trained in simulation or in a laboratory, learning algorithms often have access to additional information that will not be available at execution. Such centralised training with decentralised execution (CTDE) poses a number of technical challenges to DMARL algorithms that try to exploit the centralised setting in order to facilitate the training of decentralised policies. These difficulties arise primarily from the apparent incongruency between joint policy learning, which can learn arbitrary policies but is not naively decentralisable and scales poorly with the number of agents, and independent learning, which is readily decentralisable and scalable but provably less expressive and prone to environment non-stationarity due to the presence other of learning agents. The first part of this thesis develops algorithms that use the technique of value decomposition in order to exploit the centralised training of decentralised policies. In Monotonic Value Factorisation for Deep Multi-Agent Reinforcement Learning, we introduce the novel Q-learning algorithm QMIX. QMIX uses a centralised monotonic mixing network in order to model joint team action-value functions that are nevertheless decomposable into decentralised agent policies over discrete action spaces. To evaluate the performance of QMIX, we develop a novel benchmark suite, the StarCraft Multi-Agent Challenge (SMAC), which features a variety of discrete-action cooperative control tasks in StarCraft II unit micromanagement. Unlike pre-existing toy environments, SMAC scenarios feature diverse dynamics owing to a large number of different unit types and sophisticated in-built enemy heuristics. Many robotic control tasks feature continuous action spaces. To extend value decomposition to those settings, in FACMAC":" Factored Multi-Agent Centralised Policy Gradients, we focus on actor-critic approaches to multi-agent learning in CTDE settings. The resulting learning algorithm, FACMAC, achieves state-of-the-art performance on SMAC and opens the door toward using nonmonotonic critic factorisations. Just as for QMIX, we introduce a novel benchmark suite for cooperative continuous control tasks, Multi-Agent Mujoco (MAMujoco). MAMujoco decomposes robots from the popular Mujoco benchmark suite into multiple agents with configurable partial observability constraints. The second part of this thesis explores the value of common knowledge as a resource for both coordinating and communicating through actions. Common knowledge between groups of agents arises in a large class of tasks of practical interest, for example, if agents can recognize each other in overlapping fields of view. In Multi-Agent Common Knowledge Reinforcement Learning, we introduce a novel actor-critic method, MACKRL, which constructs a hierarchy of controllers over common knowledge across agent groups of varying sizes. This hierarchy gives rise to a decentralized policy structure that effectuates a joint-independent hybrid policy which executes decentralized joint policies or falls back to independent policies depending on whether the common knowledge between agent groups is sufficiently informative for action coordination. In this way, MACKRL enjoys the coordinative advantage of joint policy training while being fully decentralised. The third part of thesis investigates how to learn efficient implicit communication protocols for collaborative tasks. In Communicating via Markov Decision Processes, we explore how a sender agent can execute a task optimally while at the same time communicating information to a receiver agent solely through its actions. In this novel implicit referential game, both sender and receiver agents commonly know both the sender policy, as well as the sender’s trajectory. By splitting the sender task into a single-agent maximum entropy reinforcement learning task and a separate message encoding step based on minimum-entropy coupling, we show that our method GME allows establishing communication channels of significantly higher bandwidth than those trained end-to-end. In summary, this thesis presents a number of significant contributions deep multi-agent reinforcement for cooperative control within the framework of centralised training with decentralised execution and two associated novel benchmark suites. Within this setting, we make contributions to value decomposition, the use of common knowledge in multi-agent learning, and how to learn implicit communication protocols efficiently.

In DPhil thesis, University of Oxford - Awarded EPSRC AAI Doctoral Impact Fund Award
Christian Schroeder de Witt
Christian Schroeder de Witt
AI & Security Research | Strategy

AI and Security Research | Security Strategy