Federated Control with Hierarchical Multi-Agent Deep Reinforcement Learning

Federated Control with Hierarchical Multi-Agent Deep Reinforcement Learning

Saurabh Kumar
Georgia Tech
\AndPararth Shah1
\ANDDilek Hakkani-Tür
\AndLarry Heck
Equal contribution. Work done while Saurabh Kumar interned at Google Research.
1footnotemark: 1

We present a framework combining hierarchical and multi-agent deep reinforcement learning approaches to solve coordination problems among a multitude of agents using a semi-decentralized model. The framework extends the multi-agent learning setup by introducing a meta-controller that guides the communication between agent pairs, enabling agents to focus on communicating with only one other agent at any step. This hierarchical decomposition of the task allows for efficient exploration to learn policies that identify globally optimal solutions even as the number of collaborating agents increases. We show promising initial experimental results on a simulated distributed scheduling problem.

1 Introduction

Multi-agent reinforcement learning foerster2017counterfactual can be applied to many real-world coordination problems, e.g. network packet routing or urban traffic control, to find decentralized policies that jointly optimize the private value functions of participating agents. However, multi-agent RL algorithms scale poorly with problem size. Since communication possibilities increase quadratically as the number of agents, the agents must explore a larger combined action space before receiving feedback from the environment. In a separate line of research, hierarchical reinforcement learning (HRL) kulkarni2016hierarchical has enabled learning goal-directed behavior from sparse feedback in complex environments. HRL divides the overall task into independent goals and trains a meta-controller to pick the next goal, while a controller learns to reach individual goals. Consequently, HRL requires the task to be divisible into independent subtasks that can be solved sequentially. Multi-agent problems do not directly fit this criteria as a subtask would entail coordination between multiple agents, each having partial observability on the global state. Recently, lewis2017deal trained a pair of agents to negotiate and agree upon a joint action, but they do not explore scaling to settings with many agents.

We propose Federated Control with Reinforcement Learning (FCRL), a framework for combining hierarchical and multi-agent deep RL to solve multi-agent coordination problems with a semi-decentralized model. Similar to HRL, the model consists of a meta-controller and controllers, which are hierarchically organized deep reinforcement learning modules that operate at separate time scales. In contrast to HRL, we modify the notion of a controller to characterize a decentralized agent which receives a partial view of the state and chooses actions that maximize its private value function. The model supports a variable number of controllers, where each controller is intrinsically motivated to negotiate with another controller and agree upon a joint action under some constraints, e.g. a division of available resources or a consistent schedule. The meta-controller chooses a sequence of pairs of controllers that must negotiate with each other as well as a constraint provided to each pair, and it is rewarded by the environment for efficiently surfacing a globally consistent set of controller actions. Since a controller needs to communicate with a single other controller at any step, the controller’s policy can be trained separately via self-play to maximize expected future intrinsic reward with gradient descent. As the details of individual negotiations are abstracted away from the meta-controller, it can efficiently explore the space of choices of controller pairs even as number of controllers increases, and it is trained to maximize expected future extrinsic reward with gradient descent.

FCRL can be applied to a variety of real-world coordination problems where privacy of agents’ data is paramount. An example is multi-task dialogue with an automated assistant, where the assistant must help a user to complete multiple interdependent tasks, for example making a plan to take a train to the city, watch a movie and then get dinner. Each task requires querying a separate third-party Web service which has a private database of availabilities, e.g. a train ticket purchase, a movie ticket purchase or a restaurant table reservation service. Each Web service is a decentralized controller which aims to maximize its utilization, while the assistant is a meta-controller which aims to obtain a globally viable schedule for the user. Another example is urban traffic control, where each vehicle is a controller having a destination location that it desires to keep private. A meta-controller guides the traffic flow through a grid of intersections, aiming to maintain a normal level of traffic on all roads in the grid. The meta-controller iteratively picks a pair of controllers and road segments, and the controllers must negotiate with each other and assign different road segments among themselves.

In the next section, we formally describe the Federated RL model, and in Section 3 we mention related work. In Section 4 we present preliminary experiments with a simulated multi-task dialogue problem. We conclude with a discussion and present directions for future work in Section 5.

2 Model

Figure 1: Federated control model

Reinforcement Learning (RL) problems are characterized by an agent interacting with a dynamic environment with the objective of maximizing a long term reward sutton1998reinforcement . The basic RL model casts the task as a Markov Decision Process (MDP) defined by the tuple of states, actions, transition function, reward, and discount factor.

Agents As in the h-DQN setting kulkarni2016hierarchical , we construct a two-stage hierarchy with a meta-controller and a controller that operate at different temporal scales. However, rather than utilizing a single controller which learns to complete multiple subtasks, FCRL employs multiple controllers, each of which learns to communicate with another controller to collectively complete a prescribed subtask.

Temporal Abstractions As shown in Figure 1, the meta-controller receives a state from the environment and selects a subtask from the set of all subtasks and a constraint from the set of all constraints. Constraints ensure that individual subtasks focus on disjoint parts of the overall problem, allowing each problem to be solved independently by a subset of the controllers. The meta-controller’s goal is to pick a sequence of subtasks and associated constraints to maximize the cumulative discounted extrinsic reward provided by the environment. A subtask is associated with two controllers, and , who must communicate with each other to complete the task. and receive separate partial views of the environment state through states and . For time steps, the subtask and constraint remain fixed while and negotiate to decide on a set of output actions, and , which are outputted at the th time step. The controllers are trained with an intrinsic reward provided by an internal critic. The reward is shared between the controller pairs that communicated with each other, and they are rewarded for choosing actions that satisfy the constraints provided by the environment as well as the meta-controller.

3 Related Work

Multi-agent communication with Deep RL Multi-agent RL involves multiple agents that must either cooperate or compete in order to successfully complete a task. A number of recent works have demonstrated the success of this approach and have applied it to tasks in which agents communicate with one another foerster2016learntocomm ; mordatch2017complang ; sukhbaatar2016learning . Two agents learned to communicate in natural language with one another in order to complete a negotiation task in lewis2017deal and an image guessing task in das2017visdial . The demonstrated success of multi-agent communication is promising, but it may be difficult to scale to greater numbers of agents. In our work, we combine multi-agent RL with a meta-controller that selects subsets of agents to communicate so that the overall task is optimally completed.

Hierarchical Deep RL The h-DQN algorithm kulkarni2016hierarchical splits an agent into two components: a meta-controller that selects subtasks to complete and a controller that selects primitive actions given a subtask as input from the meta-controller. FCRL uses multiple controllers and extends the notion of a subtask to involve a subset of the controllers that must collectively communicate to satisfy certain constraints. Therefore, each controller can be pre-trained to complete a distinct goal and may receive information only relevant for that particular goal. This is in contrast to the work by kulkarni2016hierarchical , which trains a single controller to complete multiple subtasks.

Hierarchical Deep RL for dialogue In task-oriented dialogues, a dialogue agent assists a user in completing a particular task, such as booking movie tickets or making a restaurant reservation. Composite tasks are those in which multiple tasks must be completed by the dialogue agent. Recent work by peng2017composite applies the h-DQN technique to composite task completion. While this work successfully trains agents on composite task dialogues, a drawback with the straightforward application of h-DQN to dialogue is that only one goal is in focus at any given time, which must be completed prior to another goal being addressed. This prevents the dialogue agent from handling cross-goal constraints. The FCRL algorithm addresses cross-goal constraints by modeling a subtask as a negotiation between two controllers that must simultaneously complete their individual goals.

4 Experiments

We present a preliminary experiment applying our method to a simulated distributed scheduling problem. The goal is to validate our approach against baseline approaches in a controlled setup. We plan to run further experiments on realistic scenarios in future work.

4.1 Environment

We consider a distributed scheduling problem which is inspired by the multi-domain dialogue management setup described in the introduction. Formally, this consists of agents, each having a private database of available time entries, who must each pick a time such that the relative order of times chosen by all agents is consistent with the order specified by the environment. At the start of an episode, the environment randomly chooses agents and provides an ordering111We model this directly as a sequence of agent IDs, but an extension is to generate or sample crowd-sourced utterances for the constraints (“I want to watch a movie and get dinner. Also I’ll need a cab to get there.”) and train the agents to parse the natural language into an agent order. of agents and agent databases , where is a bit vector of size specifying the times that are available to agent . The agents can communicate amongst each other for rounds, after which each agent must output an action . The environment emits a reward if the actions are such that , and , else it emits a reward of . In our setup we used , , and experimented with .

4.2 Agents

We evaluate three approaches for solving the distributed scheduling problem. Our proposed approach (FCRL) consists of a meta-controller that picks a pair of controllers and a constraint vector, and controller agents which communicate in pairs and output times that satisfy their private databases as well as the meta-controller’s constraint. The two baselines, Multi-agent RL (MARL) and Hierarchical RL (HRL), compare our approach with the settings without a meta-controller or multi-agent communication, respectively.

Federated Control (FCRL) We use an FCRL agent with communication steps between the controllers. Note that this means they communicate for steps and then produce the output actions at the th time step. The controller and meta-controller Q-networks have the same structure with two hidden layers of sizes , each followed by a nonlinearity, and a final fully-connected layer outputting Q-values for each action. The controller has actions, one for each possible time value, and the meta-controller has actions, corresponding to constraint windows of sizes , . (We assume that is a power of 2.) The meta-controller iterates through agent pairs in the order expected by the environment, and for the pair selected at time t, it chooses a constraint vector . This constraint is applied to the two agents’ databases, and the controllers then communicate with each other for rounds. If the controllers are able to come up with a valid order, they are rewarded by the intrinsic critic, and the meta-controller moves on to the next pair. Otherwise, the meta-controller retries the same pair of controllers. The meta-controller is given a maximum of 10 total invocations of controller pairs, after which the episode is terminated with reward .

For a controller communicating with another controller , ’s state is a concatenation of the database vector from the environment, a one-hot vector of size denoting the position of that agent in the relative order between and , and a communication vector of size which is a one-hot vector denoting ’s output in the previous round. (In the first round the communication vector is zeroed out.) The meta-controller’s state is a concatenation of a one-hot vector of size denoting the latest time entry that has been selected so far, and a multi-hot vector of size , denoting the constraints that have been tried for the current controller pair.

Multi-agent RL (MARL) As a baseline, we consider a setup without a meta-controller, which is the standard multi-agent setup where agents communicate with each other and emit individual actions. The controller agent is same as that in FCRL, except that the position vector is a one-hot vector of size , denoting ’s position in the overall order, and the communication vector is an average of the outputs of all other agents in the previous round, similar to CommNet described in sukhbaatar2016learning .

Hierarchical RL (HRL) We also consider a Hierarchical RL baseline as described in kulkarni2016hierarchical . The controllers do not communicate with each other but instead independently achieve the task of emitting a time value that is consistent with their database and the meta-controller’s constraint. The meta-controller is the same as FCRL, except that it picks one agent at a time and assigns it a constraint vector.

4.3 FCRL Training

Below, we present the pseudocode for training the FCRL agent.

1:Initialize experience replay buffer for meta-controller and for the controllers
2:Intialize meta-controller’s Q-network, , and controllers’ Q-network, , with random weights
3:for  = 1:N do
4:     Environment selects controllers , , …, with databases , , …,
5:      Meta-controller state is: done pairs (), done times (), tried constraints ()
7:     while  is not terminal do
12:         for  = 1:K do
18:              Store transition (, , , ) in
19:              Store transition (, , , ) in
22:         Sample minibatch of transitions from and update weights
23:         if  then Controller pair found a valid schedule
27:         else
30:          extrinsic reward from environment
31:         Store transition (, , , ) in
32:         Sample minibatch of transitions from and update weights
Algorithm 1 Learning algorithm for FCRL agent

All controllers share the same replay buffer and the same weights. Additionally, by randomly sampling meta-controller constraints, the controllers can be pre-trained to communicate and complete subtasks prior to the joint training as described above. In this case, will start with these pre-trained weights rather than being initialized to be random in the above algorithm.

For the distributed scheduling task as described in the experiments, the provides an intrinsic reward only if (i) the controllers’ actions are valid according to their constrained databases ( and ), and (ii) the actions are in the correct order (). For , we use the heuristic of emitting controller pairs in the order expected by the environment: . Alternatively, a separate Q-network could be trained to select controller pairs, which would be useful in domains where a sequencing of controllers for pairwise communication is not manifest from the task description.

4.4 Results

We alternate training and evaluation for 1000 episodes each. Figure 2 plots the average reward on the evaluation episodes over the course of training. Each curve is the average of 5 independent runs with the same configuration. We ran three experiments by varying , i.e. the number of agents that are part of the requested schedule and must communicate to come up with a valid plan.

For , all three approaches are able to find the optimal policy as it requires only two agents to communicate. For , HRL performs poorly as there is no inter-agent communication and the meta-controller must do all the work of picking the right sequence of constraints to surface a valid schedule. MARL does better as agents can communicate their preferences and get a chance to update their choices based on what other agents picked. FCRL does better than both baselines, as the meta-controller learns to guide the communications by constraining each pair of agents to focus on disjoint slices of the database, while the controllers have to only communicate with one other controller making it easy to agree upon a good pair of actions.

For , both HRL and MARL are unable to see a positive reward, as finding a valid schedule requires significantly more exploration for the meta-controller and controller, respectively. FCRL is able to do better by dividing the problem into disjoint subtasks and leveraging temporal abstractions. However, the meta-controller’s optimal policy is more intricate in this case, as it needs to learn to start with smaller constraint windows and try larger ones if the smaller one fails, so that the earlier agent pairs do not choose farther apart times when closer ones are possible.

Figure 2: Comparing FCRL with baselines MARL and HRL, on three environments: (a) easy (m=2), (b) medium (m=4), and (c) hard (m=6).

5 Discussion

We presented a framework for combining hierarchical and multi-agent RL to benefit from temporal abstractions to reduce the communication complexity for finding globally consistent solutions with distributed policies. Our experimental results show that this approach scales better than baseline approaches as the number of communicating agents increases.

Future work The effect of increasing the size of the database or number of communication rounds will be interesting to study. Multi-agent training creates a non-stationary environment for the agent as other agents’ policies change over the course of training. While we employ the standard DQN algorithm mnih2013dqn to train the meta-controller, the controllers can be trained using recent policy gradient based methods (eg. counterfactual gradients foerster2017counterfactual ) which address this problem.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description