A Function Approximation Method for Modelbased HighDimensional Inverse Reinforcement Learning
Abstract
This works handles the inverse reinforcement learning problem in highdimensional state spaces, which relies on an efficient solution of modelbased highdimensional reinforcement learning problems. To solve the computationally expensive reinforcement learning problems, we propose a function approximation method to ensure that the Bellman Optimality Equation always holds, and then estimate a function based on the observed human actions for inverse reinforcement learning problems. The time complexity of the proposed method is linearly proportional to the cardinality of the action set, thus it can handle highdimensional even continuous state spaces efficiently. We test the proposed method in a simulated environment to show its accuracy, and three clinical tasks to show how it can be used to evaluate a doctor’s proficiency.
I Introduction
Recently, surgical robots, like Da Vinci Surgical System, have been applied to many tasks, due to its reliability and accuracy. In these systems, a doctor operates the robot manipulator remotely, and gets the visual feedback during a surgery. With a sophisticated control system and highresolution images, the surgery can be done with higher precision and less accidents. However, this requires the doctor to concentrate on robot operations and visual feedbacks during the whole surgery, which may lead to fatigue and errors.
To solve the problem, some level of automation can be introduced, considering that many surgeries contain repeating atomic operations. For example, knot tying is a typical procedure after many surgeries, as shown in Figure 1, and it can be decomposed into a sequence of pretrained standard operations for the robot. The automation can also be used to avoid possible mistakes committed by an inexperienced doctor during a surgery, where alarm signal can be triggered when an unusual action is taken by the doctor, and the amount of alarm signals can be used to evaluate the doctor as well.
The core of the automation system is a control policy, predicting which action to take under each state for typical surgical robots. The control policy can be defined manually, but it is difficult due to the possible number of states occurring during a surgery. Another solution is estimating the policy by solving a Markov decision process, but it needs an accurate reward function, depending on too many factors to be defined manually.
An alternative solution is learning the control policy from experts’ demonstrations through imitation learning. Many algorithms try to learn the policy from the stateaction pair directly in a supervised way, but the learned policy usually does not indicate how good a stateaction pair is, which is useful for online doctor action evaluation. This problem can be solved by inverse reinforcement learning algorithms, which learns a reward function from the observed demonstrations, and the optimality of a control policy can be estimated based on the reward function.
Existing solutions of the inverse reinforcement learning problem mainly work on smallscale problems, by collecting a set of observations for reward estimation and using the estimated reward afterwards. For example, the methods in [2, 3, 4] estimate the agent’s policy from a set of observations, and estimate a reward function that leads to the policy. The method in [5] collects a set of trajectories of the agent, and estimates a reward function that maximizes the likelihood of the trajectories. This strategy works for applications in small state spaces. However, the state space of sensory feedback is huge for surgical evaluation, and these method cannot handle it well due to the reinforcement learning problem in each iteration of reward estimation.
Some existing methods can be scaled to highdimensional state spaces and solve the problem without learning the transition model. While they improve the learning efficiency, they cannot utilize unsupervised data, or data from the demonstrations of nonexperts. These data cannot be used to learn the reward function, but they provide information about the environment dynamics.
In this work, we find that inverse reinforcement learning in highdimensional space can be simplified under the condition that the transition model and the set of action remain unchanged for the subject, where each reward function leads to a unique optimal value function. Based on this assumption, we propose a function approximation method that learns the reward function and the optimal value function, but without the computationally expensive reinforcement learning steps, thus it can be scaled to high dimensional state spaces. This method can also solve modelbased highdimensional reinforcement learning problems, although it is not our main focus.
The paper is organized as follows. We review existing work on inverse reinforcement learning in Section II, and formulate the function approximation inverse reinforcement learning method for highdimensional problems in III. A simulated experiment and a clinical experiment are shown in Section IV, with conclusions in Section V.
Ii Related Works
Approximate dynamic programming for reinforcement learning is a wellresearched topic in Markov decision process. A good introduction is given in [6]. Some modelfree methods produce many promising results in recent years, like deep Q network [7], double Q learning [8], advantage learning [9], etc. But in many robotic applications, reward values are not available for all robot actions, and those data is wasted in modelfree learning. Common modelbased approximation methods use a function to approximate the value function or the Q function, and the performance depends on the selected features.
Inverse Reinforcement Learning problem is firstly formulated in [2], where the agent observes the states resulting from an assumingly optimal policy, and tries to learn a reward function that makes the policy better than all alternatives. Since the goal can be achieved by multiple reward functions, this paper tries to find one that maximizes the difference between the observed policy and the second best policy. This idea is extended by [10], in the name of maxmargin learning for inverse optimal control. Another extension is proposed in [3], where the purpose is not to recover the real reward function, but to find a reward function that leads to a policy equivalent to the observed one, measured by the amount of rewards collected by following that policy.
Since a motion policy may be difficult to estimate from observations, a behaviorbased method is proposed in [5], which models the distribution of behaviors as a maximumentropy model on the amount of reward collected from each behavior. This model has many applications and extensions. For example, [11] considers a sequence of changing reward functions instead of a single reward function. [12] and [13] consider complex reward functions, instead of linear one, and use Gaussian process and neural networks, respectively, to model the reward function. [14] considers complex environments, instead of a wellobserved Markov Decision Process, and combines partially observed Markov Decision Process with reward learning. [15] models the behaviors based on the local optimality of a behavior, instead of the summation of rewards. [16] uses a multilayer neural network to represent nonlinear reward functions.
Another method is proposed in [17], which models the probability of a behavior as the product of each stateaction’s probability, and learns the reward function via maximum a posteriori estimation. However, due to the complex relation between the reward function and the behavior distribution, the author uses computationally expensive MonteCarlo methods to sample the distribution. This work is extended by [4], which uses subgradient methods to simplify the problem. Another extensions is shown in [18], which tries to find a reward function that matches the observed behavior. For motions involving multiple tasks and varying reward functions, methods are developed in [19] and [20], which try to learn multiple reward functions.
Most of these methods need to solve a reinforcement learning problem in each step of reward learning, thus practical largescale application is computationally infeasible. Several methods are applicable to largescale applications. The method in [2] uses a linear approximation of the value function, but it requires a set of manually defined basis functions. The methods in [13, 21] update the reward function parameter by minimizing the relative entropy between the observed trajectories and a set of sampled trajectories based on the reward function, but they require a set of manually segmented trajectories of human motion, where the choice of trajectory length will affect the result. The method in [22] only learns an optimal value function, instead of the reward function.
Iii Highdimensional Inverse Reinforcement Learning
Iiia Markov Decision Process
A Markov Decision Process is described with the following variables:

, a set of states

, a set of actions

, a state transition function that defines the probability that state becomes after action .

, a reward function that defines the immediate reward of state .

, a discount factor that ensures the convergence of the MDP over an infinite horizon.
An agent’s motion can be represented as a sequence of stateaction pairs:
where denotes the length of the motion, varying in different observations. Given the observed sequence, inverse reinforcement learning algorithms try to recover a reward function that explains the motion.
One key problem is how to model the action in each state, or the policy, , a mapping from states to actions. This problem can be handled by reinforcement learning algorithms, by introducing the value function and the Qfunction , described by the Bellman Equation [23]:
(1)  
(2) 
where and define the value function and the Qfunction under a policy .
For an optimal policy , the value function and the Qfunction should be maximized on every state. This is described by the Bellman Optimality Equation [23]:
(3)  
(4) 
In typical inverse reinforcement learning algorithms, the Bellman Optimality Equation needs to be solved once for each parameter updating of the reward function, thus it is computationally infeasible in highdimensional state spaces. While several existing approaches solve the problem at the expense of the optimality, we propose an approximation method to avoid the problem.
IiiB Function Approximation Framework
Given the set of actions and the transition probability, a reward function leads to a unique optimal value function. To learn the reward function from the observed motion, instead of directly learning the reward function, we use a parameterized function, named as VR function, to represent the summation of the reward function and the discounted value function:
(5) 
The function value of a state is named as VR value.
Substituting Equation (5) into Bellman Optimality Equation, the optimal Q function is given as:
(6) 
the optimal value function is given as:
(7) 
and the reward function can be computed as:
(8) 
Note that this formulation can be generalized to other extensions of Bellman Optimality Equation by replacing the operator with other types of Bellman backup operators. For example, is used in the maximumentropy method[5]; is used in Bellman Gradient Iteration [24].
For any VR function and any parameter , the optimal Q function , optimal value function , and reward function constructed with Equation (6), (7), and (8) always meet the Bellman Optimality Equation. Under this condition, we try to recover a parameterized function that best explains the observed rewards for reinforcement learning problems, and expert demonstrations for inverse reinforcement learning problems.
For reinforcement learning problems, the Bellman backup operator should be a differentiable one, thus the function parameter can be updated based on the observed rewards.
For inverse reinforcement learning problems, combined with different Bellman backup operators, this formulation can extend many existing methods to highdimensional space, like the motion model in [25], , the motion model in [5], , and the motion model in [17], . The main limitation is the assumption of a known transition model , but it only requires a partial model on the visited states rather than a full environment model, and it can be learned independently in an unsupervised way.
IiiC Highdimensional Reinforcement Learning
Although it is not our main focus, we briefly show how the proposed method solves highdimensional reinforcement learning problems. Assuming the approximation function is a neural network, the parameter weights and biasesin Equation (5) can be estimated from the observed sequence of rewards via leastsquare estimation, where the objective function is:
The reward function in Equation (8) is nondifferentiable with the max function as the Bellman backup operator. By approximating it with the generalized softmax function [24], the gradient of the objective function is:
where
and is the approximation level.
The parameter can be learned with gradient methods. The algorithm is shown in Algorithm 1. With the learned parameter, the optimal value function and a control policy can be estimated.
IiiD Highdimensional Inverse Reinforcement Learning
For IRL problems, this work chooses as the Bellman backup operator and a motion model based on the optimal Q function [17]:
(9) 
where is a parameter controlling the degree of confidence in the agent’s ability to choose actions based on Q values. In the remaining sections, we use to denote the optimal Q values for simplified notations.
Assuming the approximation function is a neural network, the parameter weights and biasesin Equation (5) can be estimated from the observed sequence of stateaction pairs via maximumlikelihood estimation:
(10) 
where the loglikelihood of is given by:
(11) 
and the gradient of the loglikelihood is given by:
(12) 
With a differentiable approximation function,
and
(13) 
where denotes the gradient of the neural network output with respect to neural network parameter .
If the VR function is linear, the objective function in Equation (11) is concave, and a global optimum exists. However, a multilayer neural network works better to handle the nonlinearity in approximation and the highdimensional state space data.
A gradient ascent method is used to learn the parameter :
(14) 
where is the learning rate.
When the method converges, we can compute the optimal Q function, the optimal value function, and the reward function based on Equation (5), (6), (7), and (8). The algorithm under a neural networkbased approximation function is shown in Algorithm 2.
This method does not involve solving the MDP problem for each updated parameter , and largescale state space can be easily handled by an approximation function based on a multilayer neural network.
Obviously, the approximation function is not unique, but all of them will generate the same optimal values and rewards for the observed stateaction pairs after convergence. By choosing a neural network with higher capacity, we may overfit the observed stateaction distribution, and do not generalize well. Therefore, the choice of the approximation function depends on how well the observed motion matches the ground truth one.
Iv Experiments
We first test the proposed method in a simulated environment, to compare its accuracy under different approximation functions, and then apply the proposed method to surgical data in JIGSAW dataset [1].
Iva Simulated Environment
We create a fourdimensional grids, with 10 grid in each dimension, thus 10000 states are generated. Several rewardemitting objects are put randomly in the grid, and each of them generates an exponentially decaying negative or positive reward value to all the grid based on the distances. The true reward value of each grid is the summation of the generated rewards in the grid. An agent moves in the grids, and it can choose to move up, down, or stay still in each dimension, described by an action set of actions. The observable feature of a grid is the grid’s distances to the rewardgenerating objects.
To test the application of the proposed method to reinforcement learning problems, we assume that the reward value of each state is available for the robot, and it has to learn an optimal value function from it. We compare the ground truth value function, computed through value iteration, and the value function recovered by the robot based on the mean error of the optimal Q values.
We choose neural networks as the approximation function, and compare the errors under different neural net configurations. We choose the configuration by firstly fixing the number of nodes in each hidden layer and increasing the number of layers, and then fixing the number of hidden layers and increasing the number of nodes in each layer. Stochastic gradient descent is used in optimization, with batch size 50 and learning rate 0.00001. The result is shown in Figure 2 and 3.
To test the application of the proposed method to inverse reinforcement learning problems, we generate trajectories with random initial position and length based on the true reward function, and try to recover a reward function based on the trajectories. We compute the accuracy based on the correlation coefficient between the ground truth reward function and the recovered reward function. Similarly, we compare the accuracy under different neural network configurations. The result is shown in Figure 4 and 5.
The results show that the accuracies of learned value function and reward function improve as the capacity of network increases, and increasing network width works better.
IvB Surgical Robot Operator
We apply the proposed method to surgical robot operators in JIGSAW data set [1]. This data set describes three tasks, knot tying, needling passing, and suturing. An illustration of the tasks is shown in Figure 6. Each task is conducted by multiple robot operators, whose skills range from expert, intermediate to novice.
The data includes videos from two stereo cameras and robot states synchronized to the images. We assume the operator’s actions change the linear and angular acceleration of the robot, and then we use kmeans clustering to identify 10000 actions from the dataset. The state set includes the robot manipulator’s positions and velocities, represented by a length38 vector with continuous values. The transition probability is computed based on physical law.
We apply the model to surgical operator evaluation on three tasks by training on all experts and testing on novice and intermediate operators. The results are shown in Figure 7, 8 and 9.
The results show that the proposed method successfully identifies the difference between inexperienced operators and experienced operators, thus it can be used in evaluation tasks.
V Conclusions
This work deals with the problem of highdimensional inverse reinforcement learning, where the state space is usually too large for many existing solutions. We solve the problem with a function approximation framework by approximating the reinforcement learning solution. The method is firstly tested in a simulated environment, and then applied to the evaluation of surgical robot operators in three clinical tasks.
In current settings, each task has one reward function, associated with an optimal value function. In future work, we will extend this method for a robot to learn multiple reward functions. Besides, we will try to integrate transition model learning into the framework.
References
 [1] Y. Gao, S. S. Vedula, C. E. Reiley, N. Ahmidi, B. Varadarajan, H. C. Lin, L. Tao, L. Zappella, B. Béjar, D. D. Yuh et al., “Jhuisi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling,” in MICCAI Workshop: M2CAI, vol. 3, 2014.
 [2] A. Y. Ng and S. Russell, “Algorithms for inverse reinforcement learning,” in in Proc. 17th International Conf. on Machine Learning, 2000.
 [3] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse reinforcement learning,” in Proceedings of the twentyfirst international conference on Machine learning. ACM, 2004, p. 1.
 [4] G. Neu and C. Szepesvári, “Apprenticeship learning using inverse reinforcement learning and gradient methods,” arXiv preprint arXiv:1206.5264, 2012.
 [5] B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey, “Maximum entropy inverse reinforcement learning,” in Proc. AAAI, 2008, pp. 1433–1438.
 [6] W. B. Powell, Approximate Dynamic Programming: Solving the curses of dimensionality. John Wiley & Sons, 2007, vol. 703.
 [7] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
 [8] H. v. Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double qlearning,” in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. AAAI Press, 2016, pp. 2094–2100.
 [9] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Advances in neural information processing systems, 2000, pp. 1057–1063.
 [10] N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich, “Maximum margin planning,” in Proceedings of the 23rd international conference on Machine learning. ACM, 2006, pp. 729–736.
 [11] Q. P. Nguyen, B. K. H. Low, and P. Jaillet, “Inverse reinforcement learning with locally consistent reward functions,” in Advances in Neural Information Processing Systems, 2015, pp. 1747–1755.
 [12] S. Levine, Z. Popovic, and V. Koltun, “Nonlinear inverse reinforcement learning with gaussian processes,” in Advances in Neural Information Processing Systems 24, J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2011, pp. 19–27.
 [13] C. Finn, S. Levine, and P. Abbeel, “Guided cost learning: Deep inverse optimal control via policy optimization,” arXiv preprint arXiv:1603.00448, 2016.
 [14] J. Choi and K.E. Kim, “Inverse reinforcement learning in partially observable environments,” Journal of Machine Learning Research, vol. 12, no. Mar, pp. 691–730, 2011.
 [15] S. Levine and V. Koltun, “Continuous inverse optimal control with locally optimal examples,” arXiv preprint arXiv:1206.4617, 2012.
 [16] M. Wulfmeier, P. Ondruska, and I. Posner, “Deep inverse reinforcement learning,” arXiv preprint arXiv:1507.04888, 2015.
 [17] D. Ramachandran and E. Amir, “Bayesian inverse reinforcement learning,” in Proceedings of the 20th International Joint Conference on Artifical Intelligence, ser. IJCAI’07. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2007, pp. 2586–2591.
 [18] K. Mombaur, A. Truong, and J.P. Laumond, “From human to humanoid locomotionâan inverse optimal control approach,” Autonomous robots, vol. 28, no. 3, pp. 369–383, 2010.
 [19] C. Dimitrakakis and C. A. Rothkopf, “Bayesian multitask inverse reinforcement learning,” in European Workshop on Reinforcement Learning. Springer, 2011, pp. 273–284.
 [20] J. Choi and K.E. Kim, “Nonparametric bayesian inverse reinforcement learning for multiple reward functions,” in Advances in Neural Information Processing Systems, 2012, pp. 305–313.
 [21] A. Boularias, J. Kober, and J. R. Peters, “Relative entropy inverse reinforcement learning,” in International Conference on Artificial Intelligence and Statistics, 2011, pp. 182–189.
 [22] E. Todorov, “Linearlysolvable markov decision problems,” in Proceedings of the 19th International Conference on Neural Information Processing Systems. MIT Press, 2006, pp. 1369–1376.
 [23] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press Cambridge, 1998, vol. 1, no. 1.
 [24] K. Li and J. W. Burdick, “Bellman Gradient Iteration for Inverse Reinforcement Learning,” ArXiv eprints, Jul. 2017.
 [25] E. Todorov, “Linearlysolvable markov decision problems,” in Advances in neural information processing systems, 2007, pp. 1369–1376.