QNetworks for Binary Vector Actions^{†}^{†}thanks: This paper was accepted at Deep Reinforcement Learning Workshop, NIPS 2015
Abstract
In this paper reinforcement learning with binary vector actions was investigated. We suggest an effective architecture of the neural networks for approximating an actionvalue function with binary vector actions. The proposed architecture approximates the actionvalue function by a linear function with respect to the action vector, but is still nonlinear with respect to the state input. We show that this approximation method enables the efficient calculation of greedy action selection and softmax action selection. Using this architecture, we suggest an online algorithm based on Qlearning. The empirical results in the grid world and the blocker task suggest that our approximation architecture would be effective for the RL problems with large discrete action sets.
QNetworks for Binary Vector Actions^{†}^{†}thanks: This paper was accepted at Deep Reinforcement Learning Workshop, NIPS 2015
Naoto Yoshida Tohoku University Aramaki Aza Aoba 6601 Sendai 9808579, Miyagi, Japan naotoyoshida@pfsl.mech.tohoku.ac.jp
1 Introduction
One of the big challenges in reinforcement learning (RL) is learning in high dimensional stateaction spaces. Recent advances in deep learning technologies have enabled us to treat RL problems with the highdimensional state space, and it achieved an impressive result in general game playing tasks (e.g. ATARI game plays) [1].
Even though several approaches are suggested for RL with continuous actions [2][3], RL with a large action space is still problematic, especially when we treat binary vectors as representations of the actions. The difficulty is that the number of actions exponentially grows as the length of the binary vector grows. Recently, several approaches have been used to tackle this problem. Sallans & Hinton suggested an energybased approach in which restricted Boltzmann machines [4] were adopted in the algorithm and their free energy was used as the function approximator [5]. Heess et al. followed their energybased approach and investigated natural actorcritic algorithms with energybased policies by RBMs [6]. Although energybased approaches are known to be effective in large discrete domains, exact action sampling is intractable due to the nonlinearity of the approximation architecture. Hence, an energybased approach samples actions by Gibbs sampling.
However, the Gibbs samplingbased action selection is computationally expensive and requires careful tuning of the parameters. Also, because of the intractability of the exact sampling of greedy actions, no Qlearningbased online offpolicy RL algorithm has so far been proposed for the large discrete action domain. From this background, we treat this issue and suggest novel architecture for the offpolicy RL with the a large discrete action set.
2 Preliminaries
2.1 Markov Decision Process and Reinforcement Learning
The valuebased reinforcement learning algorithms utilize the Markov decision process (MDP) assumption. The MDP is defined by a tuple . is the state set, is the action set, is the transition probability , where is the next state given a stateaction pair . Finally is the average reward function and is the reward sample.
In the valuebased RL, the actionvalue function is defined by
(1) 
here, is the discount factor. In valuebased RL, we look for the optimal policy that maximizes the actionvalues for every stateaction pair. Qlearning is an algorithm for finding the optimal policy in MDP [7], and the advantage of Qlearning is its offpolicy property: the agent can directly approximate the actionvalue of an optimal policy while following the other policy .
Although Qlearning is guaranteed to approximate optimal actionvalues when we use the tabular functions in a discrete stateaction environment [8], tabular functionbased approaches become quickly inefficient for RL with large stateaction spaces. Then, function approximations become necessary in such domains.
2.2 Qlearning with Function Approximation
In the Qlearning algorithm with function approximations, we approximate the optimal value function by the function where is the parameter of the function.
The gradientbased update of the function calculates the gradient of the error function
(2) 
where is the target signal. Then, the gradient of the error is obtained by
(3) 
The target signal in the Qlearning is given a transition sample . Then the direction of the parameter update is given by
(4)  
(5) 
The first term of the product in the second equality is called the TD error. Using this gradient, the stochastic gradient descent or more sophisticated gradientbased algorithms are used for approximating the optimal actionvalue function [9][10].
The Qlearningbased gradient requires the max operation of given a state. In the previous research with small discrete action sets, this max operation were tractable. However, if the actions are composed of binary vectors or factored representation [5][11], the number of total actions exponentially grows and quickly become intractable.
3 Proposed Method
In this study, we assume that the function approximation is done by the multilayer perceptrons (MLPs) parameterized by . To efficiently calculate the max operations in Qlearning with a large discrete action space, we propose the network architecture of MLPs shown in Figure 1. In this architecture, the outputs of the network are composed of a continuous scalar variable and continuous vector variable . In this study, we approximate the actionvalue function by the linear function with respect to the action vector:
(6)  
(7) 
Here, is the action represented by the binary vector, and is the th component of the action.
The gradient of the function is given by
(8) 
and this is efficiently obtained by the back propagation algorithm.
3.1 Sampling of the Actions
The proposed approximation architecture provides an efficient calculation of the greedy action. For actions with the onehot representation, the greedy policy is obvious. This is
(9)  
(10) 
where is the th element of the outputs .
For the bits binary vector actions, sampling of the greedy actions with respect to the function 7 is still tractable. The th element of the greedy action vector is given by
(11) 
Because we can efficiently sample the greedy action, the greedy action selection is tractable in our case. In the experiment section, we tested some variants of the greedy action selection.
The exact sampling from the softmax action selection for binary vector actions is also tractable. Substituting the equation 7 into the conventional softmax policy with the inverse temperature gives the equality
(12)  
(13)  
(14)  
(15) 
where is the bernoulli distribution for the th element of the action. The firing probability of the th bit of the action is given by the logistic function
(16) 
When the environment is represented by the factored MDP [5][11], the action may be represented by the binary vector, which is composed of a concatenation of onehot representation vectors (for example, the agent may have to decide one of 2 options and one of 3 options simultaneously. In this case, if the agent takes the first option and third option, an action is represented as a 5bit vector ). The greedy action for the factored environment is given by
(17) 
where is the index of the factored action sets, and is the size of the th action set. Following a similar transformation of the equation 15, the softmax policy for the factored action is given as
(18) 
and is the softmax function with respect to the th factored action set
(19) 
4 Experiment
In the experiment, we tested our proposed architecture in several domains. In all of the experiments, we used the threelayer MLPs described in Figure 1. We also set the activation function of the hidden units using the rectifier linear units (ReLU). All weights connected with output units are sampled from the uniform distribution over , and all weights between input units and hidden units are sampled from the uniform distribution over , where and are the number of units in the layers. The update of the parameter was done by the stochastic gradient descent with a constant step size . The discount rate of the objective function in RL is also same in the all of the experiments, so we used .

4.1 Grid World with Onehot Representation
First, we tested our algorithm in the conventional grid world with a onehot representation. This task is the shortestpath problem in the grid world, as suggested by Sutton & Barto [12]. The state space is composed of 47 discrete states and they are given by the onehot representation. The agent has 4 discrete actions that correspond to the 4 direction moves (North, South, East, West). The action in this experiment is represented by a onehot representation (for example, the “North” action corresponds to the vector ). The agent receives a zero reward when the agent reaches the goal, but otherwise receives a reward. The agent was trained in an episodic manner, a single episode was terminated when the agent reached the goal or passed 800 time steps in the episode. The agents were implemented by MLPs with 50 hidden units. The greedy policy was used as the behavior policy. In this task, we used .
The left panel of Figure 3 is the result of the experiment. The horizontal axis represents the number of episodes, the vertical axis is the step size in the episode. The black line is the mean performance of 10 runs and the bars are standard deviations. The broken line is the optimal step size. As expected, the agent successfully obtained the optimal policy.
4.2 Grid World with 4bit Binary Vector Actions
Action  Binary Vector 

North  1,1,0,0 
South  0,0,1,1 
East  1,0,1,0 
West  0,1,0,1 
Stay 
In this environment, the task is also the shortestpath problem in the same grid world. The state is given as a onehot representation, as well. The agent receives a zero reward when it reaches the goal, but otherwise receives a reward. The training is episodic and the termination rule of a single episode is the same as in the previous experiment. In this experiment, actions are represented by 4bit binary vectors as shown in Table 1. Only 4 of patterns move the agent to the corresponding direction, and the agent stays at the same state if the other action patterns are selected. The agents were again implemented by MLPs with 50 hidden units. The greedy policy was used as the behavior policy. In this task, we used .
Right panel of Figure 3 The horizontal axis represents the number of episodes, the vertical axis is the step size in the episode. The black line is the mean performance of 10 runs and the bars are standard deviations. The broken line is the optimal step size. Again, the agent successfully obtained the optimal policy through the experiment even in the binary vector action domain. This result shows that the proposed method successfully improved the behavior of the agent without any MonteCalro based samplings of the actions, even when the representation of actions is not a onehot representation.
4.3 Grid World with Population Coding
Again, the task is the shortest path problem in the same grid world. The state representation, the reward function and termination rules of a single episode are the same as in the previous experiments. In this experiment, the action is represented by a 40bit binary vector. And the moves of the agent are driven according to the type of population coding. Concretely, when the environment receives a 40bit vector, one of the fourdirection moves (1: North; 2: South; 3: East; 4: West) or the stay behavior (5: Stay) occurs according to the probability
(20) 
where are give by the action following the equations
(21) 
and
(22) 
In this experiment, because the discrete action space exponentially grows according to the length of the binary action vector, the size of the corresponding action space is huge . Therefore, efficient sampling of the action is also required in this domain.
In this experiment, we used MLPs with 50 hidden units. We tested three types of behavior policies. The first policy is the conventional greedy policy. We used in the task. the second policy is the bitwise greedy policy. In this policy, each bit of the action element undertakes greedy exploration. More concretely, the th element of the action vector takes the random action ( with probability 0.5) with probability . Because we can sample the greedy actions with ease, we can explicitly take this behavior policy. We used in this experiment. The third policy is the sofmax policy that was explained in section 3.1. We used in this experiment.
Figure 4 shows the result of the experiment. The horizontal axis represents the number of episodes, the vertical axis is the step size in the episode. The solid lines are the mean performance of 10 runs and the bars are standard deviations. The broken lines are the optimal step size. The results show that all three behavior policies successfully improved the performance of the agent in the highdimensional action space. From these results, the bitwise greedy policy (center) and the softmax policy (right) shows better performance than that of the conventional greedy policy (left). This would be because of the large exploration rate in the greedy policy (), but running with a smaller exploration rate () sometimes resulted in divergence of the parameters during the learning.
4.4 Blocker
The blocker is the multiagent task suggested by Sallans and Hinton [5][11]. This environment consists of a grid, three agents, and two preprogrammed blockers. Agents and blockers never overlap each other in the grid. To obtain a positive reward, agents need to cooperate in this environment. The “team” of agents obtain a reward when any one of the three agents enters the endzone, otherwise the team receives a reward. The state vector is given as a 141 binary vector, composed of the positions (grid cells) of all the agents (28 bits 3 agents), the east most positions of each blocker (28 bits 2 blockers) and a bias bit that is always one (1 bit). Each agent can move to any of the four directions. Hence the size of the action space is . In this environment, the representation of the action is given as a 12bit binary vector in which the three onehot representation is concatenated (for example, (North, North, North) actions corresponding to the vector ). In each episode, the agents start at a random position in the bottom row of the grid. When one of the agents enters the endzone or 40 time steps have passed, the episode terminates and the next episode starts after the initialization of the environment.
In this task, we used MLPs with 100 hidden units. The greedy policy was used as the behavior policy. In this task, we used . Also, we tested the agentwise greedy policy as the behavior policy. This policy is a modified version of the greedy policy for actions with factored representation, and each agent follows the greedy policy independently. In the case of the agentwisegreedy policy, we used for each agent.
Figure 5 shows the results of the experiment. The horizontal axis represents the time steps, and the vertical axis represents the average reward during the last 1000 steps. The left panel is the result of the conventional greedy action selection, the right panel is that of the agentwise greedy action selection. Both results are competitive, but in this experiment, agentwise greedy agents tend to escape from the local optima.
5 Discussion
In the environment with onehot representation actions, the linear function approximation of the actionvalue corresponds to the bilinear function with respect to the action vector and the state vector
(23) 
In this case, the parameter is give as a matrix. If the state is given by a onehot representation, this approximation is identical with the table representation. As suggested in our method, the linear architecture with respect to the action enables efficient sampling of the greedy action. More recently, Mnih et al. proposed a DQN architecture [10]. In this case, we evaluate the actionvalues corresponding to all the discrete actions by a single forward propagation. And then the training of the approximator is done only on the output, which corresponds to the selected action. This architecture can be interpreted as a linear function approximation with respect to the actions
(24) 
If we construct by some nonlinear function with high representational power such as deep neural networks, this approximation is sufficient for approximating the Qvalues when actions are given by onehot representation vectors.
The goal of our architecture (equation 7) is to adapt these ideas to the RL with binary vector actions. Although our function approximator is strongly restricted by the linear architecture with respect to the action, our function approximator is sufficient to represent an arbitrary deterministic policy by even when we treat the binary vector actions, as long as we represent by a universal function approximator.
6 Conclusion
In this paper, we suggest a novel architecture of multilayer perceptrons for RL with a large discrete action set. In our architecture, the actionvalue function is approximated by a linear function with respect to the vector actions. This approximation method enables us to efficiently sample from the greedy policy and the softmax policy. The Qlearningbased offpolicy algorithm is therefore tractable in our architecture without any MonteCarlo approximations. We empirically tested our method in several discrete action domains, and the results supported its effectiveness. Based on these promising results, we expect to extend our approach using deep architectures in a future work.
References
 [1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 [2] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 [3] Evangelos Theodorou, Jonas Buchli, and Stefan Schaal. Reinforcement learning of motor skills in high dimensions: A path integral approach. In Robotics and Automation (ICRA), 2010 IEEE International Conference on, pages 2397–2403. IEEE, 2010.
 [4] P Smolensky. Information processing in dynamical systems: foundations of harmony theory. In Parallel distributed processing: explorations in the microstructure of cognition, vol. 1, pages 194–281. MIT Press, 1986.
 [5] Brian Sallans and Geoffrey E Hinton. Using free energies to represent qvalues in a multiagent reinforcement learning task. In NIPS, pages 1075–1081, 2000.
 [6] Nicolas Heess, David Silver, and Yee Whye Teh. Actorcritic reinforcement learning with energybased policies. In EWRL, pages 43–58. Citeseer, 2012.
 [7] Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. PhD thesis, University of Cambridge, 1989.
 [8] Christopher JCH Watkins and Peter Dayan. Qlearning. Machine learning, 8(34):279–292, 1992.
 [9] LongJi Lin. Selfimproving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(34):293–321, 1992.
 [10] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 [11] Brian Sallans and Geoffrey E Hinton. Reinforcement learning with factored states and actions. The Journal of Machine Learning Research, 5:1063–1088, 2004.
 [12] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 1998.