Macro Action Reinforcement Learning with Sequence Disentanglement using Variational Autoencoder
One problem in the application of reinforcement learning to real-world problems is the curse of dimensionality on the action space. Macro actions, a sequence of primitive actions, have been studied to diminish the dimensionality of the action space with regard to the time axis. However, previous studies relied on humans defining macro actions or assumed macro actions as repetitions of the same primitive actions. We present Factorized Macro Action Reinforcement Learning (FaMARL) which autonomously learns disentangled factor representation of a sequence of actions to generate macro actions that can be directly applied to general reinforcement learning algorithms. FaMARL exhibits higher scores than other reinforcement learning algorithms on environments that require an extensive amount of search.
Macro Action Reinforcement Learning with Sequence Disentanglement using Variational Autoencoder
Heecheol Kim††thanks: Both authors equally contributed to this paper. , Masanori Yamada11footnotemark: 1 , Kosuke Miyoshi and Hiroshi Yamakawa
Dwango Artificial Intelligence Laboratory
NTT Secure Platform Laboratories
Reinforcement learning has gained significant attention recently in both robotics and machine-learning communities because of its potential of wide application to different domains. Recent studies have achieved above-human level game play in Go [?; ?] and video games [?; ?]. Application of reinforcement learning to real-world robots has also been widely studied [?; ?].
Reinforcement learning involves learning the relationship between a state and action on the basis of rewards. Reinforcement learning fails when the dimensionality of a state or action increases. This is why reinforcement learning is often considered data inefficient, i.e., requiring a large number of trials. The curse of dimensionality on the state space is partially solved using a convolutional neural network (CNN) [?; ?]; training policy from raw image input has become possible by applying a CNN against the input states. However, reducing the dimensionality on the action side is still challenging. The search space can be exponentially wide with a longer sequence and higher action dimension.
Application of macro actions to reinforcement learning has been studied to reduce the dimensionality of actions. By compressing the sequence of primitive actions, macro actions diminish the search space. Previous studies defined macro actions as repetitions of the same primitive actions [?] or requiring humans to manually define them [?]. However, more sophisticated macro actions should contain different primitive actions in one sequence without humans having to manually defining these actions.
We propose Factorized Macro Action Reinforcement Learning (FaMARL), a novel algorithm for abstracting the sequence of primitive actions to macro actions by learning disentangled representation [?] of a given sequence of actions, reducing dimensionality of the action search space. Our algorithm uses Factorized Action Variational Autoencoder (FAVAE) [?], a variation of VAE [?], to learn macro actions from given expert demonstrations. Using the acquired disentangled latent variables as macro actions, FaMARL matches the state with the latent variables of FAVAE instead of primitive actions directly. The matched latent variables are then decoded into a sequence of primitive actions and applied repeatedly to the environment. FaMARL is not limited to just repeating the same primitive actions multiple times, because this compresses any kind of representation with FAVAE. We experimentally show that FaMARL can learn environments with high dimensionality of the search space.
2 Related work
Applying a sequence of actions to reinforcement learning has been studied [?; ?; ?; ?]. Fine Grained Action Repetition (FiGAR) successfully adopts macro actions into deep reinforcement learning [?], showing that Asynchronous Advantage Actor-Critic (A3C)[?], an asynchronous variant of deep reinforcement learning algorithm, with a learning time scale of repeating the action as well as the action itself scores higher than that with primitive actions in Atari 2600 Games.
There are mainly two differences between FaMARL and FiGAR. First, FiGAR can only generate macro actions that are the repeat of the same primitive actions. On the other hand, macro actions generated with FaMARL can be a combination of different primitive actions because FaMARL finds a disentangled representation of a sequence of continuous actions and uses the decoded sequence as macro actions. Second, FaMARL learns how to generate macro actions and optimizes the policy for the target task independently, while FiGAR learns both simultaneously. Despite FaMARL cannot learn macro actions end-to-end, this algorithm can easily recycle acquired macro actions to new target tasks, because macro actions are acquired independent to target tasks.
Hausknecht proposed using a parameterized continuous action space in the reinforcement learning framework [?]. This approach, however, is limited in the fact that the action has to be selected at every time step, and humans need to parameterize the action. FaMARL can be viewed as an expansion of this model to time series.
3 Sequence-Disentanglement Representation Learning by Factorized Action Variational AutoEncoder
VAE [?] is a generative model that learns probabilistic latent variables via the probability distribution learning of a dataset. VAE encodes data to latent variable and reconstructs from .
The -VAE [?] and CCI-VAE [?], which is an improved -VAE, are models for learning the disentangled representations. These models disentangle by adding the constraint to reduce the total correlation to VAE. FAVAE [?] is an extended -VAE model to learn disentangled representations from sequential data. FAVAE has a ladder network structure and information-bottleneck-type loss function. This loss function of FAVAE is defined as
where , is the index of the ladder, is a constant greater than zero that encourages disentangled representation learning by weighting Kullback-Leibler divergence term, and is called information capacity for supporting the reconstruction. In the learning phase, increases linearly along with epochs from to . The is determined by first training FAVAE with a small amount of (we used ) and . The last value of is used as . Each ladder requires a C . For example, a 3-ladder network requires 3 Cs.
4 Proposed algorithm
Our objective is to find factorized macro actions from given time series of expert demonstrations and search for the optimal policy of a target task based on these macro actions instead of primitive actions. The target task can differ from the task that the expert demonstrations are generated. We use FAVAE [?] to find factorized macro actions. The details of FaMARL are given in Sections 4.1 and 4.2.
One might be curious why we do not apply expert demonstrations or their segmentations, directly to the reinforcement learning agent to learn a new task. There are two reasons for learning disentangled factors of (segmented) expert demonstrations. First, if the agent explores these expert demonstrations only, it can only mimic expert demonstrations to solve the task, which results in serious deficiencies in generalizing macro actions. Consider a set of demonstrations containing actions of turn right , turn right , …, turn right . If the environment requires the agent to turn right , the agent cannot complete the task. On the other hand, latent variables trained with the expert demonstrations acquire generated macro actions to ”turn right . Thus, the agent can easily adapt to the target task. Second, without latent variables, the action space is composed by listing only all expert demonstrations, forming a discrete action space. This causes the curse of dimensionality, detering fast convergence on the task.
4.1 Unsupervised segmentation of macro actions
An episode of an expert demonstration is composed of a series of macro actions, e.g., when humans show a demonstration of moving an object by hand, that demonstration is composed of 1)extending a hand to the object, 2)grasping the object, 3)moving the hand to the target position, and 4)releasing the object.
Therefore, expert demonstrations first need to be segmented into each macro action. One significant challenge is that there are usually no ground-truth labels for macro actions. One possible solution is to ask experts to label their actions. However, this is another burden and incurs additional cost.
Lee proposed a simple method using an AE [?; ?] to segment signal data [?]. This method, simply speaking, trains an AE with sliding windows of signal data, acquiring the temporal characteristics of the sliding windows. Then, the distance between the encoded features of two adjacent sliding windows is calculated. All the peaks of the distance curve are selected as segmentation points. One advantage of this method is that it is not domain-specific. This method can be easily applied to expert demonstration data since it is assumed that there are no specific data characteristics.
On our implementation of this segmentation method, distance is defined as , where refers to the encoded feature of the th sliding window on th trajectory data. We used a sliding window size of . Any distance point that is highest among adjacent points with a margin of is selected as a peak.
4.2 Learning disentangled latent variables with FAVAE
Once the expert demonstrations are segmented, FAVAE learns factors that compose macro actions. However, FAVAE cannot directly intake segmented macro actions. This is because segmented macro actions may have different lengths, while FAVAE cannot compute data with different lengths because it uses a combination of 1D convolution and multilayer perceptron which requires an unified data size across all datasets. To address this issue, macro actions are padded with trailing zeros to match the data length of , the input size of FAVAE. Also, two additional dimensions and are added to macro actions to identify if action at timestep is a real action or zero-padded one. The is and is , where subscript is the length of a macro action and subscript is the input size of FAVAE. The cutting point of a real action against zero-padding is computed by the first timestep where is selected from the softmax of and . We used the mean squared error for reconstruction loss. Also, FAVAE used three ladders and CCI is applied. [?].
4.3 Learning policy with proximal policy optimization (PPO)
Our key idea of diminishing the search space is to search on the latent space of the macro actions instead of primitive actions directly. We used proximal policy optimization (PPO) [?] as the reinforcement learning algorithm, although any kind of reinforcement learning algorithm can be used111Our implementation of PPO is based on https://github.com/Anjum48/rl-examples.
PPO is used following the loss function:
Here, , where denotes the probability ratio.
Integrating PPO with macro actions generated with FAVAE is simply to replace the primitive action of every time step with the macro action with a step interval which is the length of the macro action. Therefore, the model of the environment with respect to a macro action is:
where is the transition model of the environment.
The PPO agent matches a latent variable on input state .
The decoder of FAVAE then decodes into series of actions: , where subscript is the output length of the decoder. Then actions are trimmed using the value of the softmax of and , which is also decoded from the decoder.
The macro action is cropped to where subscript is the first timestep at which is selected. This macro action is applied to the environment without feedback. Rewards between and are summed and regarded as the reward for the result of output .
Thus, the objective function of PPO can be modified as:
where is the time step from the perspective of the macro action. If and indicate the same time step in the environment, the relationship of is established.
FaMARL was tested in two environments: ContinuousWorld, a simple 2D environment with continuous action and state spaces, and RobotHand, a 2D environment with simulated robot hand made by Box2D, a 2D physics engine222Dataset and other supplementary results are available at https://github.com/FaMARLSupplement/FaMARLSupplement.
The objective with this environment is to find the optimal trajectory from the starting position (blue dot in Figure 2) to the goal position (red dot in Figure 2). The reward of this environment is , where is the position of the agent and is the position of the goal. The action space is defined by the acceleration to the axis and acceleration to the axis .
There are two tasks in ContinuousWorld: Base and Maze. In Base, the agent and goal are randomly placed at the corners, top or bottom. Thus, the number of cases for initialization is . To acquire factors of macro actions regardless of scale, the size of map is selected between randomly. In Maze, the agent and goal are always placed at the same position. However, the entrances in the four walls are set randomly for each episode so that the agent has to find an optimal policy on different entrance positions. This makes this environment difficult because walls act like strong local optima of reward; the agent has to make a long detour with lower rewards to finally find the optimal policy.
Our purpose was to find disentangled macro actions from expert demonstrations in Base and applying the macro actions to complete the target tasks. 100 episodes of the expert demonstrations were generated in Base using programmed scripts. We compared four different scripts: DownOnly, Down&Up, PushedDownOnly, and PushedDown&Up. All scripts are illustrated in Figure 3. For DownOnly, the goal is only initialized at the bottom of the aisle; therefore, the macro actions do not include upward movements. On the other hand, Down&Up does not limit the position of the goal; thus, upward and downward movements are included in the macro actions. For PushedDownOnly and PushedDown&Up, the agent always accelerates upward or downward, according to the goal position.
With the expert demonstrations generated in Base, we used FaMARL in Maze. We used . Among FaMARL with macro actions acquired from expert demonstrations of PushedDownOnly, PPO with primitive actions, and FiGAR, FaMARL performed best for this task and other two algorithms failed to converge (Figure 4). It is also obvious that the choice of macro action is critical. While PushedDownOnly outperformed the primitive action, other macro actions could not complete the task. Because PushedDownOnly does not contain any demonstrated actions of moving upwards, this can dramatically diminish the action space to search. On the other hand, Down&Up is similar to just repeatedly moving one direction, which was not sufficient for completing the task.
Figure 5 shows visualized example trajectories of latent traversal for Down&Up. Latent traversal is a technique that shifts only one latent variable and fixes the other variables for observing the decoded output from the modified latent variables. If disentangled factor representation is acquired, the output shows meaningful changes. Otherwise, changes are not distinguishable. Also, if the number of latent variables exceeds that of factors that form the sequence of actions, only some of the latent variables acquire factors and the others show no changes when traversed. Figure 5 shows that the st variable of the rd ladder changed to . This changed the direction of the agent’s trajectory, while Figure 5 shows no change. This result indicates that FAVAE learns the disentangled representation of a given sequence of actions.
Comparison among different of equation 1 and numbers of expert trajectories are shown in Figure 6 using PushedDownOnly. Figure 6 illustrates the experiment with different . FAVAE did not learn factors in a disentangled manner when was low. The entangled latent variables of macro actions severely deters matching the state space with macro action space for an optimal policy because the latent space, which actually matches with the state space, is distorted. On ContinuousWorld, we found that is enough to complete Maze. Figure 6 illustrates the experiment with different numbers of expert trajectories. Even though we used 100 expert trajectories across all experiments, the number of trajectories did not impact the performance of FaMARL.
RobotHand has four degrees of freedom (DOFs), i.e., moving along the x axis, moving along the y axis, rotation, and grasping operation. The entire environment was built with Box2D https://box2d.org/ and rendered with OpenGL https://www.opengl.org/. Similar to Base task at ContinuousWorld, Base at RobotHand, which is a pegging task, provides 100 expert demonstrations to learn disentangled macro actions. And the target tasks Reaching and BallPlacing are completed with the acquired macro actions. We used on this environment.
Base (Figure 7) is a pegging task. In Base, the robot moves a rod from a blue basket to a red one. We chose this task because the pegging task is complex enough to contain all macro actions that might be used in target tasks.
Reaching (Figure 7) is a simple task. The robot hand has to reach for a randomly generated goal position (red) as fast as possible. To make this task sufficiently difficult, we used a sparse reward setting in which the robot hand only receives a positive reward of +100 for reaching the goal position within a distance of 0.5 m; otherwise there is a time penalty of -1.
In BallPlacing (Figure 7), the robot hand has to carry the ball (blue) to the goal position (red). The ball is initialized at random positions within a certain range, and the goal position is fixed. The reward is defined by where is the position of the ball and is the position of the goal. An episode ends when the ball hits the edges or reaches the goal position within a distance of 0.5 m. An additional reward of +200 is given when the ball reaches the goal.
Figure 8 is a comparison of FaMARL, PPO with primitive actions, and FiGAR on both Reaching and BallPlacing. PPO with primitive actions and FiGAR respectively failed to learn Reaching and BallPlacing, while FaMARL successfully learned both tasks. Because the reward of Reaching is sparse, using primitive actions fails to find rewards. on the other hand, even though the reward of BallPlacing is not sparse, it requires precisely controlling a ball to the goal., FiGAR, which repeats the same primitive actions a number of times, could not precisely control the ball. FaMARL is the only algorithm that completed both tasks.
It should be noted that in the RobotHand experiments FaMARL optimized its behavior by shortening macro actions, while fully using the advantages of exploring with macro actions. In Reaching, the average length of macro actions gradually diminished (Figure 9). However, when time penalty (in Reaching, time penalty of -1 was added to the reward at every time step) is eliminated, the length of a macro action did not diminish (Figure 9). This is because the agent did not need to optimize its policy in accordance with speed. A macro action can be inefficient in optimizing policy compare to a primitive action because the optimal policy for the task may not match macro actions, but a suboptimal policy will. That is why FaMARL gradually uses primitive-like actions (macro actions with lengths of 1 3) instead of keeping macro actions with dozens of primitive actions.
6 Limitations of FaMARL
FaMARL exhibits generally better scores than using primitive actions. However, there are limitations with FaMARL.
6.1 Lack of feed-back control
Searching on macro actions instead of primitive actions facilitates searching on the action space in exchange for fast response to unexpected changes in state. We failed to train BipedalWalker-v2333https://gym.openai.com/envs/BipedalWalker-v2/ with FaMARL based on the expert demonstration at BipedalWalker-v2. Because a bipedal-locomotion task requires highly precise control for balancing induced from instability of the environment; thus, diminishing the search space by macro actions in exchange for faster response was not adequate.
6.2 Compatibility of macro actions with task
Figure 4 shows that the type of macro actions is critical. If the targeted task does not require the macro actions that are abstracted from expert demonstrations, FaMARL will easily fail because the actions an optimal policy requires are not present in the acquired macro actions. Thus, choosing appropriate expert demonstrations for a targeted task is essential for transferring macro actions to target tasks.
We proposed FaMARL, an algorithm of using expert demonstrations to learn disentangled latent variables of macro actions to search on these latent spaces instead of primitive actions directly for efficient search. FaMARL exhibited higher scores than other reinforcement learning algorithms in tasks that require extensive iterations of search when proper expert demonstrations are provided. This is because FaMARL diminishes the searching space based on acquired macro actions. We consider this a promising first step for practical application of macro actions in reinforcement learning in a continuous actions space. However, FaMARL could not complete a task that requries actions outside of macro actions. the tasks that need actions outside of restricted searching space cannot be solved. Possible solutions include searching optimal policy with both macro actions and primitive actions.
- [Bengio, 2013]\hfill Yoshua Bengio. Deep learning of representations: Looking forward. In International Conference on Statistical Language and Speech Processing, pages 1–37. Springer, 2013.
- [Burgess et al., 2018]\hfill Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in -vae. arXiv preprint arXiv:1804.03599, 2018.
- [Durugkar et al., 2016]\hfill Ishan P Durugkar, Clemens Rosenbaum, Stefan Dernbach, and Sridhar Mahadevan. Deep reinforcement learning with macro-actions. arXiv preprint arXiv:1606.04615, 2016.
- [Haarnoja et al., 2018]\hfill Tuomas Haarnoja, Aurick Zhou, Sehoon Ha, Jie Tan, George Tucker, and Sergey Levine. Learning to walk via deep reinforcement learning. arXiv preprint arXiv:1812.11103, 2018.
- [Hausknecht and Stone, 2015]\hfill Matthew Hausknecht and Peter Stone. Deep reinforcement learning in parameterized action space. arXiv preprint arXiv:1511.04143, 2015.
- [Higgins et al., 2017]\hfill Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017.
- [Hinton and Salakhutdinov, 2006]\hfill Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.
- [Kingma and Welling, 2013]\hfill Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- [Krizhevsky et al., 2012]\hfill Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- [Lakshminarayanan et al., 2017]\hfill Aravind S Lakshminarayanan, Sahil Sharma, and Balaraman Ravindran. Dynamic action repetition for deep reinforcement learning. In AAAI, pages 2133–2139, 2017.
- [Lee et al., 2018]\hfill Wei-Han Lee, Jorge Ortiz, Bongjun Ko, and Ruby Lee. Time series segmentation through automatic feature learning. arXiv preprint arXiv:1801.05394, 2018.
- [Levine et al., 2016]\hfill Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
- [Mnih et al., 2015]\hfill Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- [Mnih et al., 2016]\hfill Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
- [OpenAI, 2018]\hfill OpenAI. Openai five. https://blog.openai.com/openai-five/, 2018.
- [Schulman et al., 2017]\hfill John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- [Sharma et al., 2017]\hfill Sahil Sharma, Aravind S Lakshminarayanan, and Balaraman Ravindran. Learning to repeat: Fine grained action repetition for deep reinforcement learning. arXiv preprint arXiv:1702.06054, 2017.
- [Silver et al., 2016]\hfill David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
- [Silver et al., 2017]\hfill David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
- [Vezhnevets et al., 2016]\hfill Alexander Vezhnevets, Volodymyr Mnih, Simon Osindero, Alex Graves, Oriol Vinyals, John Agapiou, et al. Strategic attentive writer for learning macro-actions. In Advances in neural information processing systems, pages 3486–3494, 2016.
- [Vincent et al., 2008]\hfill Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM, 2008.
- [Yamada et al., 2019]\hfill Masanori Yamada, Kim Heecheol, Kosuke Miyoshi, and Hiroshi Yamakawa. Favae: Sequence disentanglement using information bottleneck principle. arXiv preprint arXiv:1902.08341, 2019.