Natural Option Critic

Natural Option Critic

Saket Tiwari
College of Information and Computer Sciences
University of Massachusetts Amherst
Amherst, MA 01003
sakettiwari@umass.edu
&Philip S. Thomas
College of Information and Computer Sciences
University of Massachusetts Amherst
Amherst, MA 01003
pthomas@cs.umass.edu
Abstract

The recently proposed option-critic architecture (?) provides a stochastic policy gradient approach to hierarchical reinforcement learning. Specifically, it provides a way to estimate the gradient of the expected discounted return with respect to parameters that define a finite number of temporally extended actions, called options. In this paper we show how the option-critic architecture can be extended to estimate the natural gradient (?) of the expected discounted return. To this end, the central questions that we consider in this paper are: 1) what is the definition of the natural gradient in this context, 2) what is the Fisher information matrix associated with an option’s parameterized policy, 3) what is the Fisher information matrix associated with an option’s parameterized termination function, and 4) how can a compatible function approximation approach be leveraged to obtain natural gradient estimates for both the parameterized policy and parameterized termination functions of an option with per-time-step time and space complexity linear in the total number of parameters. Based on answers to these questions we introduce the natural option critic algorithm. Experimental results showcase improvement over the vanilla gradient approach.

Natural Option Critic


Saket Tiwari College of Information and Computer Sciences University of Massachusetts Amherst Amherst, MA 01003 sakettiwari@umass.edu                        Philip S. Thomas College of Information and Computer Sciences University of Massachusetts Amherst Amherst, MA 01003 pthomas@cs.umass.edu

Copyright © 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

Introduction

Hierarchical reinforcement learning methods enable agents to tackle challenging problems by identifying reusable skills—temporally extended actions—that simplify the task. For example, a robot agent that tries to learn to play chess by reasoning solely at the level of how much current to give to its actuators every 20ms will struggle to correlate obtained rewards with their true underlying cause. However, if this same agent first learns skills to move its arm, grasp a chess piece, and move a chess piece, then the task of learning to play chess (leveraging these skills) becomes tractable. Several mathematical frameworks for hierarchical reinforcement learning have been proposed, including hierarchies of machines (?), MAXQ (?), and the options framework (?). However, none of these frameworks provides a practical mechanism for skill discovery: determining what skills will be useful for an agent to learn. Although skill discovery methods have been proposed, they tend to be heuristic in that they find skills that have a property that intuitively might make for good skills for some problems, but which do not follow directly from the primary objective of optimizing the expected discounted return (? ?; ? ?; ? ?; ? ?).

The option-critic architecture (?), stands out from other attempts at developing a general framework for skill discovery in that it searches for the skills that directly optimize the expected discounted return. Specifically, the option critic uses the aforementioned options framework, wherein a skill is called an option, and it proposes parameterizing all aspects of the option and then performing stochastic gradient descent on the expected discounted return with respect to these parameters. The key insight that enables the option-critic architecture is a set of theorems that give expressions for the gradient of the expected discounted return with respect to the different parameters of an option.

One limitation of the option critic is that it uses ordinary (stochastic) gradient descent. In this paper we show how the option critic can be extended to use natural gradient descent (?), which exploits the underlying structure of the option-parameter space to produce a more informed update direction. The primary contributions of this work are theoretical: we define the natural gradients associated with the option critic, derive the Fisher information matrices associated with an option’s parameterized policy and termination function, and show how the natural gradients can be estimated with per-time-step time and space complexity linear in the total number of parameters. This is achieved by means of compatible function approximations. We also analyze the performance of natural gradient descent based approach on various learning tasks.

Preliminaries and Notation

A reinforcement learning (RL) agent interacts with an environment, modeled as a Markov decision process (MDP), over a sequence of time steps . A finite MDP is a tuple . is the finite set of possible states of the environment. is the state of the environment at time . is the finite set of possible actions the agent can take. is the action taken by the agent at time . is the transition function: , for all . Meaning, the probability of transitioning to state given the agent takes action in state . denotes the reward at time . is the reward function, , where , i.e., the expected reward the agent receives given it took action in state . We say that a process has ended when the environment enters a terminal state, meaning for a terminal state , and for all and . The process ends after steps and we call the horizon. We say the process is infinite horizon when there does not exist a finite . is the initial state distribution, i.e., . The parameter scales how the rewards are discounted over time. When a terminal state is reached, time is reset to and consequently a new initial state is sampled using .

A policy, , represents the agent’s decision making system: . Given a policy, , and an MDP, , an episode, is a sequence of states of the environment, actions taken by the agent, and the rewards observed from the initial state, , to the terminal state, , i.e., . We also define the path that an agent takes to be a sequence of states and actions, i.e., a history without rewards, . Path is a random variable from the set of all possible paths, . The return of an episode is the discounted sum of all rewards, . We call the value function for the policy , , where . We call the action-value function associated with policy , , where .

Policy Gradient Framework

The policy gradient framework (? ?; ? ?) assumes the policy , parametrized by , is differentiable. The objective function, , is defined with respect to a start state , . The agent learns by updating the parameters approximately proportional to the gradient , i.e., where is the learning rate (LR): a scalar hyper-parameter.

Option Critic framework

The options framework (?) formalizes the notion of temporal abstractions by introducing options. An option, , from a set of options, , is a generalization of primitive actions. The intra-option policy represents the agent’s decision making while executing an option : . Like primitive actions the agent executes an option at a state and the option terminates at another , where is the duration for which the agent is executing the option: . While in the option , from state to , the agent follows the policy . Option terminates stochastically in state according to a distribution . The framework puts restrictions on where an option can be initiated by defining an initiation state set, , for option . The option is initiated in state based on , which is a policy over options defined as . An initiation state set , an intra-option policy and a termination function comprise an option . It is commonly assumed that all options are available everywhere and thereby we dispense with the notion of an initiation set.

The option critic framework makes all the options available everywhere, and introduces policy-gradient theorems within the options framework. The option active at time step is . The intra-option policies () and termination functions () are represented using differentiable functions parametrized by and , respectively. The goal is to optimize the expected discounted return starting at state and option . We re-define the objective function, , for the option critic setting: .

Equations similar to those in the policy gradient framework (?) are manipulated to derive gradients of the objective with respect to and in the option-critic framework. The analogous state value function is , where . is the value of a state , within the options framework, with the option set and the policy over options . The option-value function is , where . Here, is the value of state when option is active with the option set . The state-option-action value function is , where . Here, is the value of executing action in the context of state-option pair . The option-value function upon arrival is , where . Here, is the value of option being active upon the agent entering state . ? (?) observe a consequence of the definitions:

(1)

The main results presented by ? (?) are the intra-option policy gradient theorem and the termination gradient theorem. The gradient of the expected discounted return with respect to and initial condition is:

(2)

where is the discounted weighting of state-option pair along trajectories starting from defined by . The gradient of the expected discounted return with respect to and initial condition is:

(3)

where is the advantage function over options such that . Here, is the discounted weighting of state option pair from , i.e., according to a Markov chain shifted by one time step, defined by. The agent learns by updating parameters and in the direction approximately proportional to and , respectively. Meaning, it learns by updating and , where and are the learning rates for and , respectively.

Natural Actor Critic

Natural gradient descent (?) exploits the underlying structure of the parameter space when defining the direction of steepest descent. It does so by defining the inner product in the parameter space as:

(4)

where is called the metric tensor. Although the choice of remains open under certain conditions (?) we choose the Fisher information matrix, as is common practice. The fisher information matrix distribution over random variable , parametrized by policy parameters , that lie on a Reimannian manifold (? ?; ? ?):

(5)

where the expectation is over the distribution and represents a matrix with its element being the expression as defined on the right hand side — we use this notation to represent a matrix throughout the paper. ? (?) makes the assumption that every policy, , is ergodic and irreducible, therefore it has a well-defined stationary distribution for each state . Under this assumption, ? (?) introduces the use of natural gradient for optimizing the expected reward over the parameters of policy , as defined by . The natural gradient for the objective function, , is defined as:

(6)

The derivation of a closed form expression for for the parameter space of policy , parametrized by , is non-trivial as demonstrated for the limiting matrix of the infinite horizon problem in reinforcement learning (?). For a weight vector let be an approximation of the state action value function , which has the form:

The mean squared error , for a weight vector and a given policy parametrized by , is defined as:

where is the discounted weighting of state in the infinite horizon problem. The weights normalize to the stationary distribution for state under policy in the undiscounted setting where the MDP terminates at every time step with probability . Theorem 1 as introduced by ? (?) states that which minimizes the mean squared error, , is equal to the natural gradient as defined in (6).

? (?) also demonstrates how natural policy gradient performs under the re-scaling of parameters. In addition to that, ? (?) demonstrates how the natural gradient weights the components of uniformly, instead of using . We also point out that the natural gradient is independent to local re-parametrization of the model (?) and can be used in online learning (?). Natural gradients for reinforcement learning (?? ??; ? ?), as well as more recent work in deep neural networks (? ?; ? ?; ? ?; ? ?) have shown to be effective in learning.

The Option-Critic architecture uses vanilla gradient to learn temporal abstraction and internal policies, which can be less data efficient compared to the natural gradient (?). The natural gradient also overcomes the difficulty posed by the plateau phenomena (?). We derive the metric tensors for the parameters in the option-critic architecture. Computing the complete Fisher information matrix or is expensive. We use a block-diagonal estimate of the Fisher information matrix as has been applied in the past to reinforcement learning (?) and to neural networks (? ?; ? ?; ? ?; ? ?; ? ?). Specifically, we estimate and separately, where and are the parameters of of the intra-option policy and the option termination function. These are then combined into a sized estimate of the complete Fisher information matrix of the parameter space, where represent the size of vectors.

We also provide theoretical justification for the resulting algorithm inspired from the incremental natural actor critic algorithm (?) (INAC) and its extension to include eligibility traces (? ?; ? ?).

Start State Fisher Information Matrix Over Intra-Option Path Manifold

We define path in the options framework for the infinite horizon problem as the sequence of state-option-action tuples: . We use to denote the set of all paths. We introduce the function called the expected return over path, where is the expected return given the path . The goal in a reinforcement learning problem, in the context of the option-critic architecture, is to maximize the discounted return, . The goal can be re-written as maximizing . Where the summation is over all starting from and the intra-option policies are parametrized by . To optimize the objective , we define it over a Riemannian space , with . In the Riemannian space the inner product is defined as in . The direction of steepest ascent of in the Riemannian space, , is given by (?), (see equation (6)).

In this section we use to denote and use to indicate the expected value of with respect to distribution . We obtain an alternative form of the Fisher information matrix which is a well know result (?) (for details see appendix):

(7)

Fisher Information Matrix Over Intra-Option Path Manifold

In Theorem 1 we show that the Fisher information matrix over the paths, , truncated to terminate at time step converges as to the Fisher information matrix over the intra-option policies, . This gives an expression for Fisher information matrix over the set of paths, , and simplifies computation of the natural gradient when maximizing the objective . We use to indicate the -step finite horizon Fisher information matrix, meaning the Fisher information matrix if the problem were to be reduced to terminate at step . We normalize the metric by the total length of path (?) to get a convergent metric.

Theorem 1 (Infinite Horizon Intra-Option Matrix).

Let be the -step finite horizon Fisher information matrix and be the Fisher information matrix of intra-option policies under a stationary distribution of states, actions and options: . Then:

Proof.

See the appendix (supplementary materials). ∎

Compatible Function Approximation For Intra-Option Path Manifold

We subtract the option-state value function, , from the state-option-action value function, , and treat it as a baseline to reduce variance in the gradient estimate of the expected discounted return. The baseline can be a function of both state and action in special circumstances, but none of those apply here (?). So, we define the state-option-action advantage function . Where is the advantage of the agent taking action in state in the context of option . Here, is approximated by some compatible function approximator . For vector and parameters we define:

(8)

The that is a local minima of the squared error :

is equal to the natural gradient of the objective, , with respect to (the complete derivation is in the appendix):

Thus, for a sensible (?) function approximation, as in (8), in the option-critic framework the natural gradient of the expected discounted return is the weights of linear function approximation.

Start State Fisher Information Matrix Over State-Option Transition Path Manifold

We derive the Fisher information matrix for the parameters over the state-option transitions path manifold. We define as a path for state-option transitions in the option-critic architecture. More specifically, we define to be path tuples of state option pairs shifted by one time step. We define to be the set of all state-option transition paths. Similar to the previous section, we define the expected return over state-option transitions , where is the expected return given state-option transitions path . The goal can be re-written to maximize . Where the summation is over all starting from and terminations are parametrized by . To optimize we define it over a Reimannian space with and the inner product defined as in (4), similar to previous section. The direction of steepest ascent in the Reimannian space, , is the natural gradient.

In this section, we use to denote and use to indicate the expected value of with respect to the distribution . Equation (7) implies that the Fisher information matrix can be written as:

Fisher Information Matrix Over State-Option Transition Path Manifold

In Theorem 2 we show that the Fisher information matrix over the paths, , truncated to terminate at time step converges as to an expression in terms of the terminations and the policy over options over the stationary distribution of states and options. This gives an expression for Fisher information Matrix over set of paths, , and simplifies computation of the natural gradient when maximizing the objective .

Theorem 2 (Infinite Horizon State-Option Transition Matrix).

Let be the -step finite horizon Fisher information matrix and is the stationary distribution of state-option pairs . Then:

Proof.

See appendix (supplementary materials). ∎

Compatible Function Approximation For State-Option Transition Path Manifold

We define the advantage function of continued option as: . Where is the advantage of the option being active while exiting given that option is active when the agent enters . We consider terminations improvement when is approximated by some compatible function approximator . For vector and parameters we define:

(9)

We define the squared error associated with vector as:

where is the likelihood ratio of option being active while exiting given that option is active when the agent enters . It is defined as follows:

We assume, throughout the paper, that the denominator is not . The that is a local minima of satisfies (the complete derivation is in the appendix):

Therefore, for an approximation of the continued state-option value function, as in (9), the natural gradient of the expected discounted return is the negative weights of the linear function approximation.

Incremental Natural Option Critic Algorithm

We introduce algorithms inspired from the incremental natural actor critic introduced by ? (?), who in turn built on the theoretical work of ? (?). The algorithm learns the parameters for approximations of state-option-action advantage function, , and the advantage function of continued option, , incrementally by taking steps in the direction of reducing the error and . It does stochastic gradient descent using the gradients and . Learning the parameters and leads to natural gradient based updates for and . We introduce hyper parameters and , which are the learning rate for , the learning rate for and the the eligibility trace parameter of both and , respectively. The algorithm learns the policy over options, , using intra-option Q-learning (?) as in previous work (?).

The algorithm uses TD-error style updates to learn and . Analogous to the consistent estimates used by ? (?), we state that a consistent estimate of the state-option value function, , satisfies . Similarly, a consistent estimate of the value function upon arrival, , satisfies . We define the TD-error for the intra-option policies at time step to be .

A consistent estimate of the state value function, , satisfies . We define the TD-error at time step for the terminations to be . We provide Lemmas 1 and 2 to show that and are consistent estimates of and .

Lemma 1.

Given intra-option policies, for all , policy over options, , and terminations, for all , then:

Lemma 2.

Under the precondition and given intra-option policies, for all , policy over options, , and terminations, for all , then:

The proofs are in the appendix (supplementary materials). Using these lemmas and theorems we introduce algorithm 20 (INOC). We provide details on how we arrive at the updates to parameters and in the appendix. The precondition might lead to fewer updates to the parameters of the terminations. The options evaluation part in the algorithm is the same as in previous work (?).

1:   and choose using .
2:  while Not in terminal state do
3:     Select action as per
4:     Take action observe
5:     
6:     
7:     
8:     
9:     
10:     if  is the same as  then
11:        
12:        
13:        
14:        
15:        
16:     end if
17:     if should terminate in according to  then
18:        Choose according to and reset
19:     end if
20:  end while
Algorithm 1 Incremental Natural Option-Critic Algorithm (INOC)

Experiments

We look at the performance of natural option critic in three different types of domains: a simple 2 state MDP, one with linear state representations and one with neural networks for state representations, and compare it to option critic. In all the cases we use sigmoid terminations and linear-softmax intra-option policies, as in previous work (?).

Figure 1: Simple deterministic MDP of two states and two actions
Figure 2: Average reward for INOC reaches the maxima while that of OC is stuck in a plateau. Results averaged over 200 runs of 2000 episodes.

MDP Setup: We design an MDP to demonstrate the uniform weighting of the components of the natural termination gradient, , as opposed to using . Note that the effectiveness of the natural policy gradient has been demonstrated sufficiently in past work (? ?; ? ?; ? ?). We define a simple 2 state MDP as in Figure 1. The initial state distribution is and . The transitions are deterministic. The reward for self loops into and are 1 and 2, respectively. The episode terminates after 30 steps. We use an -greedy policy over options, .

We consider a scenario with two options, and , each of which has probability 0.9 for actions and , respectively, regardless of the state. This gives us options as abstractions over individual actions. We initialize the terminations, , and option value function, such that they are biased towards the greedy action, , in state via the selection of option . Specifically, we set and , this way the setup is biased towards higher probability of . This presents the agent with the challenge of learning the more optimal action of transitioning to state , despite the higher probability and the self loop reward of . We set the learning rate for the intra-option policies, , to be negligible as our goal is to demonstrate the efficacy of the natural termination gradient.

As can be seen from Figure 2, the natural option critic converges to the optimal value, by overcoming the plateau, for average reward much faster than the option critic. The option critic is initially stuck in the greedy self-loop action, this is due to the weighting by . Whereas the natural option critic begins learning early on and achieves the optimal average reward.

Four Rooms: The four rooms domain (?) is a particularly favorable case for demonstrating the use of options. We use the same number of options, 4, as in previous work (?). The result (Figure 4) indicates that natural option critic converges faster.

Arcade Learning Environment:

Figure 3: Four rooms with , , , and critic LR 0.5, averaged over 350 runs
(a) Asterisk (b) Seaquest (c) Zaxxon
Figure 4: Moving average of 10 returns for a single trial for Arcade learning Environment, with , , and

We compare natural option-critic with the option critic framework on the Arcade Learning Environment (?). To showcase the improvement over the option-critic architecture we use the same configuration for all the layers as in previous work (?). Which in turn uses the same configuration for the first 3 convolutional layers of the network introduced by ? (?). The critic network was trained, similar to previous work (?), using experience replay (?) and RMSProp.

As in previous work (?), we apply the regularizer prescribed by ? (?) to penalize low entropy policies. We use an on-policy estimate of the policy over options, , which is used in the computation of the natural gradient with respect to the termination parameters.

We compare the two approaches, option critic and natural option critic, by evaluating them for the games Asterisk, Seaquest, and Zaxxon (?). For comparison we run training over same number of frames per epoch as done by ? (?), running the same number of trial and use the same number of options: 8. We demonstrate the results in Figure 4. More importantly, we use the same hyperparameters, for learning rates and entropy regularization, as in previous work to merit a fair comparison. We obtain improvements on the option-critic architecture (OC) for Asterisk and Zaxxon. We also note that we were unable to reproduce the results for Seaquest for option critic, but having given the same set of hyperparameters we observe that option critic performs better. We explain the issue with termination updates, and it’s effect on the return, for Seaquest in the appendix.

For Zaxxon and Asterisk we see that NOC breaks the plateau much earlier than option critic. Note that the value network, for approximating , is learned using vanilla gradient.

Discussion

We have introduced a natural gradient based approach for learning intra-option policies and terminations, within the option-critic framework, which is linear in the number of parameters. More importantly, we have furnished instructive proofs on deriving the Fisher information matrix over path manifolds and corresponding function approximations based approach while reducing mean squared errors. We have also introduced an algorithm that uses consistent estimates of the advantage functions and learn the natural gradient by learning coefficients of the corresponding linear function approximators. The results showcase performance improvements on previous work. The proofs for finite horizon metrics are very similar to the ones provided by ? (?). We also demonstrate the effectiveness of natural option critic in three distinct domains.

As discussed by ? (?) we can obtain a truly unbiased estimate for our updates, but it may not be practical (?). The limitations that apply to the option-critic framework, except the use of vanilla gradient, apply. We use a block diagonal estimate of the Fisher information matrix. The complete Fisher information matrix for the option-critic framework over path manifolds is:

where and are the Fisher information matrices for intra-option path manifold and state-option transition manifold, respectively. The random variable is the path variable over state-option-action tuples. The computation of the complete Fisher information matrix suffers and its inverse is expensive and needs a compatible function approximation based approach to obtain a natural gradient estimate with space complexity linear in number of parameters.

Although our approach has added benefits it is limited by fewer updates of the termination policy. Work is required to develop better estimates of the advantage functions. More experimental work, e.g. applications to other domains, can further help understand the efficacy of natural gradients in the context of the option-critic framework.

References

  • [Amari 1967] Amari, S. 1967. A theory of adaptive pattern classifiers. IEEE Trans. Electronic Computers 16:299–307.
  • [Amari 1985] Amari, S. 1985. Differential-geometrical methods in statistics. In Lecture Notes in Statistics 28. Springer-Verlag.
  • [Amari 1998] Amari, S.-I. 1998. Natural gradient works efficiently in learning. Neural Comput. 10(2):251–276.
  • [Amari 2016] Amari, S.-i. 2016. Information Geometry and Its Applications. Springer.
  • [Bacon, Harb, and Precup 2017] Bacon, P.-L.; Harb, J.; and Precup, D. 2017. The option-critic architecture. In AAAI.
  • [Bagnell and Schneider 2003] Bagnell, J. A., and Schneider, J. 2003. Covariant policy search. IJCAI.
  • [Bellemare et al. 2013] Bellemare, M. G.; Naddaf, Y.; Veness, J.; and Bowling, M. H. 2013. The arcade learning environment: An evaluation platform for general agents. J. Artif. Intell. Res. 47:253–279.
  • [Bhatnagar et al. 2007] Bhatnagar, S.; Sutton, R. S.; Ghavamzadeh, M.; and Lee, M. 2007. Incremental natural actor-critic algorithms. In Proceedings of the 20th International Conference on Neural Information Processing Systems, NIPS’07, 105–112. USA: Curran Associates Inc.
  • [Bhatnagar et al. 2009] Bhatnagar, S.; Sutton, R. S.; Ghavamzadeh, M.; and Lee, M. 2009. Natural actor-critic algorithms. Automatica 45(11):2471–2482.
  • [Degris, Pilarski, and Sutton 2012] Degris, T.; Pilarski, P. M.; and Sutton, R. S. 2012. Model-free reinforcement learning with continuous action in practice.
  • [DeGroot 1970] DeGroot, M. 1970. Optimal Statistical Decisions. Wiley Classics Library. Wiley.
  • [Desjardins et al. 2015] Desjardins, G.; Simonyan, K.; Pascanu, R.; et al. 2015. Natural neural networks. In Advances in Neural Information Processing Systems, 2071–2079.
  • [Dietterich 2000] Dietterich, T. G. 2000. Hierarchical reinforcement learning with the maxq value function decomposition. J. Artif. Intell. Res.(JAIR) 13(1):227–303.
  • [Kakade 2001] Kakade, S. 2001. A natural policy gradient. In Dietterich, T. G.; Becker, S.; and Ghahramani, Z., eds., Advances in Neural Information Processing Systems 14 (NIPS 2001), 1531–1538. MIT Press.
  • [Konda and Tsitsiklis 2000] Konda, V. R., and Tsitsiklis, J. N. 2000. Actor-critic algorithms. NIPS’2000, 1008–1014.
  • [Konidaris and Barto 2009] Konidaris, G., and Barto, A. G. 2009. Skill discovery in continuous reinforcement learning domains using skill chaining. In Bengio, Y.; Schuurmans, D.; Lafferty, J. D.; Williams, C. K. I.; and Culotta, A., eds., Advances in Neural Information Processing Systems 22. Curran Associates, Inc. 1015–1023.
  • [Kurita 1992] Kurita, T. 1992. Iterative weighted least squares algorithms for neural networks classifiers. New Generation Computing 12:375–394.
  • [Machado, Bellemare, and Bowling 2017] Machado, M. C.; Bellemare, M. G.; and Bowling, M. 2017. A Laplacian framework for option discovery in reinforcement learning. In Precup, D., and Teh, Y. W., eds., Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, 2295–2304. International Convention Centre, Sydney, Australia: PMLR.
  • [Martens and Grosse 2015] Martens, J., and Grosse, R. B. 2015. Optimizing neural networks with kronecker-factored approximate curvature. In ICML.
  • [Martens 2010] Martens, J. 2010. Deep learning via hessian-free optimization. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, 735–742. USA: Omnipress.
  • [Mnih et al. 2013] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; and Riedmiller, M. A. 2013. Playing atari with deep reinforcement learning. CoRR abs/1312.5602.
  • [Mnih et al. 2016] Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T. P.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In ICML.
  • [Morimura, Uchibe, and Kenji 2005] Morimura, T.; Uchibe, E.; and Kenji, D. 2005. Utilizing the natural gradient in temporal difference reinforcement learning with eligibility traces. 0–0.
  • [Parr and Russell 1998] Parr, R., and Russell, S. J. 1998. Reinforcement learning with hierarchies of machines. In Advances in neural information processing systems, 1043–1049.
  • [Pascanu and Bengio 2013] Pascanu, R., and Bengio, Y. 2013. Revisiting natural gradient for deep networks.
  • [Peters and Schaal 2008] Peters, J., and Schaal, S. 2008. Natural actor-critic. Neurocomputing 71:1180–1190.
  • [Rao 1945] Rao, C. R. 1945. Information and accuracy attainable in the estimation of statistical parameters. In Bulletin of the Calcutta Mathematical Society. 81–91.
  • [Roux, Manzagol, and Bengio 2008] Roux, N. L.; Manzagol, P.; and Bengio, Y. 2008. Topmoumoute online natural gradient algorithm. In Platt, J. C.; Koller, D.; Singer, Y.; and Roweis, S. T., eds., Advances in Neural Information Processing Systems 20. Curran Associates, Inc. 849–856.
  • [Simsek and Barto 2008] Simsek, Ö., and Barto, A. G. 2008. Skill characterization based on betweenness. In NIPS.
  • [Stein and Shakarchi 2009] Stein, E., and Shakarchi, R. 2009. Real Analysis: Measure Theory, Integration, and Hilbert Spaces. Princeton University Press.
  • [Sun and Nielsen 2017] Sun, K., and Nielsen, F. 2017. Relative fisher information and natural gradient for learning large modular models. In ICML.
  • [Sutton et al. 1999] Sutton, R. S.; McAllester, D.; Singh, S.; and Mansour, Y. 1999. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, 1057–1063.
  • [Sutton, Precup, and Singh 1999] Sutton, R. S.; Precup, D.; and Singh, S. P. 1999. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artif. Intell. 112:181–211.
  • [Thomas and Brunskill 2017] Thomas, P. S., and Brunskill, E. 2017. Policy gradient methods for reinforcement learning with function approximation and action-dependent baselines. CoRR abs/1706.06643.
  • [Thomas et al. 2016] Thomas, P.; Silva, B. C.; Dann, C.; and Brunskill, E. 2016. Energetic natural gradient descent. In Balcan, M. F., and Weinberger, K. Q., eds., Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, 2887–2895. New York, New York, USA: PMLR.
  • [Thomas, Dann, and Brunskill 2018] Thomas, P. S.; Dann, C.; and Brunskill, E. 2018. Decoupling learning rules from representations. In ICML.
  • [Thomas 2011] Thomas, P. S. 2011. Policy gradient coagent networks. In Shawe-Taylor, J.; Zemel, R. S.; Bartlett, P. L.; Pereira, F.; and Weinberger, K. Q., eds., Advances in Neural Information Processing Systems 24. Curran Associates, Inc. 1944–1952.
  • [Thomas 2014] Thomas, P. 2014. Bias in natural actor-critic algorithms. In ICML.
  • [Thrun and Schwartz 1995] Thrun, S., and Schwartz, A. 1995. Finding structure in reinforcement learning. In Tesauro, G.; Touretzky, D. S.; and Leen, T. K., eds., Advances in Neural Information Processing Systems 7. MIT Press. 385–392.

Appendix

Here, we provide proofs for the theorems and lemmas presented in the body of the paper and we also provide derivations for the estimates for the natural gradient. Despite these proofs being in the appendix due to space constraint these are our major contributions.

Alternate Form Of The Fisher Information Matrix

We derive the following result, same as ? (?) with the meanings of the symbols changed, for the Fisher information matrix under appropriate regularity conditions for :

(10)
(11)
(12)
(13)
(14)
(15)

The first equality follows from the definition of Fisher information matrix. The third equality follows from integration by parts. The last equality is a result of the sum of probabilities being constant, i.e., . The matrix is positive semi-definite (?) and the derivations resulting from this expression inherit this property.

Proof Of Infinite Horizon Intra-Option Matrix

Theorem (Infinite Horizon Intra-Option Matrix).

Let be the -step finite horizon Fisher information matrix and be the Fisher information matrix of intra-option policies under a stationary distribution of states, actions and options: . Then:

Proof.

is the -step finite horizon Fisher information matrix.

(16)
(17)

The process represented by the path is Markovian, meaning . This leads to the following result for the likelihood probability, similar to the simple form of the path probability metric presented by ? (?):

(18)
(19)
(20)
(21)

A reinforcement learning problem discounted with a discount factor is equivalent to an undiscounted problem where the MDP terminates with probability in each state. We use this formulation of the problem to derive the results as we go further. Applying the chain rule to (17) and using to denote the stationary start state distribution we obtain:

(22)
(23)
(24)