Vector Autoregressive POMDP Model Learning and Planning for HumanRobot Collaboration
Abstract
Humanrobot collaboration (HRC) has emerged as a hot research area at the intersection of control, robotics, and psychology in recent years. It is of critical importance to obtain an expressive but meanwhile tractable model for human beings in HRC. In this paper, we propose a model called Vector Autoregressive POMDP (VARPOMDP) model which is an extension of the traditional POMDP model by considering the correlation among observations. The VARPOMDP model is more powerful in the expressiveness of features than the traditional continuous observation POMDP since the traditional one is a special case of the VARPOMDP model. Meanwhile, the proposed VARPOMDP model is also tractable, as we show that it can be effectively learned from data and we can extend pointbased value iteration (PBVI) to VARPOMDP planning. Particularly, in this paper, we propose to use the Bayesian nonparametric learning to decide potential human states and learn a VARPOMDP model using data collected from human demonstrations. Then, we consider planning with respect to PCTL which is widely used as safety and reachability requirement in robotics. Finally, the advantage of using the proposed model for HRC is validated by experimental results using data collected from a driverassistance testbed.
I Introduction
Humanrobot collaboration (HRC) studies how to achieve effective collaborations between human and robots to synthetically combine the strengths of human beings and robots. While robots have advantages in handling repetitive tasks with high precision and long endurance, human beings are much more flexible to changing factors or uncertain environments that are difficult for robots to adapt. Therefore, to establish an efficient collaboration between human and robots is the core problem in the design of the HRC system.
To achieve an effective HRC, it is of critical importance to obtain an expressive but meanwhile tractable model for HRC. Among several types of models in the HRC literature, such as the ACTR/E model [1], the IDDM Model [2] and the TLP model [3], the POMDP model has emerged as a popular choice in recent years [4][5]. As a general probabilistic system model to capture uncertainties from actuation errors, sensing noises and human behaviors, the POMDP model provides a comprehensive framework for the system modeling and sequential decision making. Most of the existing results assume that the POMDP model is given [6][7] or the number of states is given [8][9] before learning the model from data. However, the state space could be tedious to be predefined and the number of hidden states could be case dependent especially when human is involved. In our previous work [10], we dropped these assumptions and proposed a Bayesian nonparametric learning approach to infer the structure of the POMDP model, such as the number of states, from data. However, hidden states of the POMDP model being learned can only model static properties of the observed data. For example, in the driving scenario, each state is a cluster of positions of the human hand. Then the human intention, say turning right, can be inferred if the observed position belongs to the cluster. Dynamic properties such as turning speed and acceleration could not be modeled and distinguished. This is because we did not consider the correlations among observations in the POMDP model.
In order to fill this gap, we propose a new type of model for HRC, called Vector Autoregressive POMDP (VARPOMDP) model, which takes the observation correlation into consideration and hence extends the existing POMDP model. Our main objective in this paper is to show that the proposed VARPOMDP model can achieve a good tradeoff between model expressiveness and tractability.
The expressiveness of the proposed model is clear as the POMDP model becomes a special case of our proposed model. To illustrate the tractability of the proposed model, we investigate both the model learning and planning issues in this paper. First, in the model learning process, we do not assume the state space is given or the bound on the number of states is known. Our basic idea is to use a Bayesian nonparametric learning method to automatically identify the number of hidden states. Secondly, in the planning process, we consider the probabilistic computation tree logic (PCTL) as a formal specification since it is widely used as a safety and reachability requirement in robotic application [11]. The PCTL bounded until model checking problem can be converted to a finite horizon dynamic programming problem. We show that the value function can be approximated by a piecewise linear function by extending the PBVI on the VARPOMDP model. The effectiveness of the proposed learning and planning algorithms are illustrated in real experiments.
The main contribution of this paper is twofold. First, the VARPOMDP which considers the correlation of observations is proposed to model the HRC process and the corresponding learning framework is proposed to learn the VARPOMDP model from demonstrations using the Bayesian nonparametric learning method. Secondly, the PBVI algorithm is extended to the VARPOMDP model to solve a finite step dynamic programming problem and therefore the bounded until model checking problem.
The rest of the paper is organized as follows. Section II presents the formal definition of the VARPOMDP model and formulates the problem. The learning framework with corresponding experiment results are shown in Section III. Section IV shows how to extend the PBVI algorithm on the VARPOMDP model and section V concludes the paper.
Ii VARPOMDP Model
The VARPOMDP model is inspired from the autoregressive hidden Markov model (ARHMM) which is an extension of the HMM. Distinct from the HMM which assumes the independence of observations, the autoregressive model specifies that the observed variable depends linearly on its previous values with certain uncertainties. This correlation property exhibits in the behavior of human motion [12]. Thus, we consider the correlation of observations in the POMDP model and propose the VARPOMDP model.
Definition 1
The VARPOMDP model is defined as a tuple where

is a finite set of states.

is a finite set of decision actions.

is a set of continuous observations.

is a transition function which defines the probability over the next state after taking an action from the state .

is an observation function which defines the distribution over the observation that may occur in state conditional on observation history .

is a reward function.

is a labelling function that assigns a subset of atomic propositions to each state .
The difference between the VARPOMDP model and the traditional POMDP model is the observation function. In traditional POMDP model, the observation function only dependents on the hidden state and the current observation. While in VARPOMDP model, the observation function also relies on the observation history, namely,
(1) 
where are Gaussian noise modeling the uncertainty, matrices are lag matrices under mode . Note that the continuous observation POMDP with Gaussian emissions is a special case of this model when lag matrices are zero and the Gaussian mean is replaced with a constant vector. Instead of using a constant value to characterize the motion feature, the VARPOMDP uses lag matrices and the covariance matrix to characterize the feature. Thus, dynamic features can be expressed and identified by the VARPOMDP.
Iia Probabilistic Computation Tree Logic
To achieve a performanceguaranteed model learning and planning framework, we use the PCTL as the formal specification to guide the designing process. PCTL is a probabilistic extension of computation tree logic which allows for probabilistic quantification of described properties [13]. The syntax of PCTL is as follows:
(2) 
where is an atomic proposition, , , and stands for bounded until. The soft deadline property make PCTL a widely used specification language for probabilistic model checkers.
In this paper, we consider the problem of VARPOMDP model learning and planning for the bounded until specification in PCTL.
Problem 1
Given training data collected from an HRC process, learn a VARPOMDP model . Based on the learned model , together with a given initial belief over the state, a finite horizon and a PCTL bounded until specification,
(3) 
check whether or not the specification is satisfied.
The PCTL specification specifies the upper bound of the probability that, in a finite step, there are some states along a path making holds and holds in all states prior to that state. For example, if and , the specification bounds the probability of the system going to states that cause the system failure. The specification in equation (3) is satisfied if and only if where is the maximum satisfaction probability with respect to belief . The model checking problem is converted to a finite step optimization problem.
On one hand, the PCTL specification gives a performance requirement of the system which guides the model learning and controller designing process. On the other hand, using PCTL as specification avoids further defining the reward function for the VARPOMDP model. Although algorithms such as inverse reinforcement learning can be used to recover the reward function [14], it is hard to explain the physical meaning of the reward. For PCTL specification, the reward can be clearly explained as satisfaction probability.
Iii VARPOMDP Model Learning
The proposed framework to learn the VARPOMDP model is shown in Figure 1. The action set is assumed given since it represents the capability of the robot. Using the training data collected from the HRC system, the state space of the model is identified using the Bayesian nonparametric learning method. The whole state space could be the product of the state space of human, robots and the environment. Based on the identified state space, transition probability and observation distribution can be learned from data.
Iiia Motion Feature Extraction
Instead of assuming the state space is given, we use the Bayesian nonparametric learning method to directly infer the state space from data. The training data consists of several dimensional time series which could be human motion trajectories collected from demonstrations. Taking advantages of the BPARHMM framework proposed in [15], we do not only learn features of the data but also the number of features in a full Bayesian rule.
In the Bayesian nonparametric learning method, an ARHMM is used as a generative model to model the relationship between the hidden features and observations. For each observed time series where , we assume it is generated from the following model,
(4)  
where is a zeromean Gaussian noise with covariance matrix to capture the uncertainty, and is the order of the autoregressive process. The variable is the hidden state and specifies the transition property for the state . Each hidden state is characterized by a set of parameters where the parameters are used to characterize the corresponding features of the data. For example, in the scenario of using human motion trajectories as training data, the parameter describes the dependence of the current position on historical positions. The physical meaning of features are motion patterns.
Compared with the HMM assumption which ignores the dependence among observations [10], the ARHMM assumption makes the model be able to extract dynamic properties of the observing data since it considers the observing data as a dynamic system. Thus using the ARHMM as the generative model is necessary when one cares about dynamic properties rather than static properties of the observing data.
The traditional approach usually assumes the number of hidden states or the upper bound of hidden states is given. However, to get this prior knowledge is nontrivial especially when human is involved. In Bayesian nonparametric learning, a prior distribution is used and the number of hidden states can be inferred from data automatically. The BPARHMM uses a Beta Process (BP) to generate a collection of an infinite number of points and assign each point a weight which is a flip coin probability. Then a Bernoulli Process (BeP) selects points that exhibit in each training data . These points are bond with the hidden states and therefore the feature parameters . The BetaBernoulli Process together is used to model the correlation among time series. This process is summarized as follows,
(5)  
where Dir stands for Dirichlet distribution, is a draw from the Beta process which provides a set of weights for the potentially infinite number of hidden states. For each time series , an is drawn from a Bernoulli process parameterized by . Each can be used to construct a binary vector indicating which of the global hidden state are selected in the time series. Then the transition probability of the ARHMM is drawn from a Dirichlet distribution with selftransition bias for each state .
The generative model is a total Bayesian model which implies that the model can be inferred from data according to the Bayes’ rule. The parameters such as the hidden variable and can be learned from data using the Markov chain Monte Carlo (MCMC) method [16].
Example 1
A driver and hardwareintheloop simulation system is used as an example to validate the proposed approach. Markers are put on the left/right hand of the driver and the steering wheel, a time series of positions of these markers are collected using the Optitrack system. An example of the raw data is shown in Figure 2 which consists of driving motions such as turning left and right. Using the Bayesian nonparametric learning method, motion features can be identified automatically. Figure 3 gives a comparison of learning results using HMM and ARHMM as generative models for the same training data. Different motion features are labeled by different colors. From the result, there are only motion features detected using the HMM generative model while there are motion features detected using the ARHMM generative model, which is much more than that of the HMM assumption. To give a more detailed comparison, data points from to are zoomed out which is shown in Figure 4. From the results, some dynamic motions are not identified by the HMM model while they are detected using the ARHMM. The reason behind this phenomenon is that the ARHMM use a dynamic system to model the observed data and considers the correlation among observations while HMM assumes observation independence.
IiiB Construct VARPOMDP model
Based on the features identified in section IIIA, the VARPOMDP model can be directly constructed. First, the state space of the VARPOMDP model is defined as the product of the state space of human, robots and the environment. The state space of the human can be defined as the union of motion feature identified in section IIIA. Each state can be labeled manually with a physical meaning. According to the physical meaning, the labeling function can be defined. For example, in a driving scenario, the label could be or . The observation space is defined to be the dimensional vector space which could be continuous sensor readings. The observation function is defined as a multivariate Gaussian distribution,
where is the observation history.
After identifying the state space and observation distribution, our next step is to learn the transition probability. To learn the exact transition probability is difficult due to reasons such as limited data. In this case, modeling uncertainties will make the learned transition probabilities subject to a certain confidence level which motivates us to apply the Chernoff bound to reason the accuracy of the transition probabilities for VARPOMDP. Details of the transitions probability learning can be found in [10]. From the Chernoff bound, the estimation error of the transition probability can be sufficiently small with high confidence as long as the training data is sufficient enough.
Example 2
In this example, parts of the model are shown in Figure 5 to illustrate the model construction process. For the driver assistance system, actions are designed to increase car safety and road safety by providing realtime instructions, warnings or directly controlling the vehicles. For example, one action can be designed as instruction and another one can be designed to increase the steering torque. Different actions have different influence on the human driver and therefore have different transition probability.
Each state is a composition of environment status and human intentions. State represents , represents and represents where can be any human intentions. Based on observations, a belief is maintained over hidden states according to the Bayes’ rule and an action can be selected to maximize the safety of the system.
Iv VARPOMDP Planning
In the previous section, we showed that the proposed VARPOMDP model can be effectively learned from data. This section aims to illustrate that the planning problem based on the proposed VARPOMDP model is also tractable. Particularly, we consider PCTL specification as safety and reachability requirement and apply PBVI to solve the PCTL model checking problem.
Iva Converting the PCTL model checking into a dynamical programming problem
We first show that the bound until specification model checking problem described in Problem 1 can be converted into a dynamic programming problem. The PCTL model checking problem is well studied for MDP model [17], we generalize the result to the VARPOMDP model. The state space of the model can be divided into three disjoint subsets.
All states in satisfy and all states in dissatisfy and . The state in satisfies but not . Once the system runs to states in , the satisfaction probability will be zero no matter where it goes in the future. And once the system runs to states in , the satisfaction probability dependents only on the running prefix. Thus changing the states in and absorbed and assigning reward for states in and for other states does not change the satisfaction probability. The maximum satisfaction probability can be solved recursively by value iteration,
(6) 
where stands for the belief on state , and are notations for transition probability and observation distribution, is the posterior belief after observing and represents observation history. The value is the maximum probability that satisfies the specification when the belief is and observation history is . The initial condition where if and otherwise.
IvB Pointbased Value Iteration for VARPOMDP
The main challenge arising from the VARPOMDP model is the curse of dimensionality and the curse of history. In equation (6), it is impossible to enumerate observations since is in continuous space. The value function is not only a function of belief but also a function of observation history . Thus the exact dynamic programming approach cannot be applied to solve the problem [18]. Inspired by the dynamic discretization approach [19], we propose to use the PBVI to solve the dynamic programming problem.
In PBVI, a set of belief points is selected and the value function is updated only on these belief points. Thus PBVI gives an approximate solution where the approximation error dependents on the belief point selection.
Theorem 1
The optimization problem defined recursively by equation (6) can be solved using PBVI algorithm on a predefined belief set . At these belief points, the value function can be approximated by a piecewise linear and convex function, which is written as
(7) 
for a set of vector where represents the element of the vector for the steptogo value function.
The theorem is proved by induction. Assume that the value function can be expressed as . When , the initial vector is defined corresponding to the reward of the reaching state. According to the steptogo value function defined in section IVA, the initial vector and it is a constant vector. When , substitute the updated belief
(8) 
and value function (7) into the right hand of equation (6),
(9)  
For each point in the belief set , we select the action that maximizes the expected reward. Then a set of vector is updated.
Assume for all , the value function can be expressed as . Then the steptogo value function can be expressed as
(10) 
Due to the function inside the integral, directly calculating the integration is not tractable. Inspired by the work of [19], we break the observation space into subspace and use the samplingbased approach to approximate the integration.
Let be the subspace of the observation that makes the expected reward maximum for a specific belief point and vector. For a given belief and vector ,
(11) 
Then the value iteration is converted to
(12)  
where . To calculate the integral directly is not tractable, thus sampling approach is used to approximate the integration.
The subspace currently is a function of observation history which causes the sampling from intractable. However, it can be shown that the probability does not dependent on observation history .
Let be a new variable for which the observation distribution is . Then
(13) 
where . The the probability is no longer a function of observation history and can be approximated using the Monte Carlo method. For each state , we sample observations from and approximate the integration by
(14) 
where is the number of samples that fall into the subregion . Back to the value iteration equation (12), the steptogo vector is not a function of . Thus the value function where
(15) 
Then function of equation (15) is evaluated on belief set . By induction, we conclude that the value function can be approximated by a piecewise linear and convex function for all .
From the proof, it is shown that although the correlation of observations is considered, the piecewise linear function used to approximate the value function is not a function of observation history . At each time step, vectors are created while the number of final vectors stored is limited to (in time complexity ). Thus, the whole value iteration takes polynomial time and the size of vectors remains constant. The PBVI is an approximation algorithm, it scarifies the accuracy of the solution to achieve efficiency. The approximation error depends on how densely the belief set samples from the belief simplex . Following the proof of Lemma in [20], it can be shown that the approximation error is bounded by the density of the belief set.
Assertion 1
The error induced by the pointbased value iteration algorithm for the VARPOMDP model is bounded by the density of the belief set , which is defined as .
Let be the point where pointbased value iteration makes the pruning error worst and let be the closest belief to . Let be the vector maximal at and be the vector maximal at . Then it is easy to get . The pruning error for the steptogo value function is bounded by
(16)  
The last inequality holds because the vector represents the achievable reward which is bounded by in our case. Intuitively, the proof shows that the more densely the belief set is selected the smaller the approximation error will be.
The value iteration algorithm is summarized in Algorithm . Since it is a finitestep value iteration, the algorithm will always converge.
Example 3
A threestate VARPOMDP model shown in Figure 6 is used to validate the PBVI algorithm. The transition probability of action and are shown in the figure. Each state represents a motion feature of a threedimensional motion trajectory and state are labeled as . The belief points are selected to be . The PCTL specification is given as with and . Using the PBVI, five vectors are solved which are ,, ,,. If the initial belief is , then the satisfaction probability is . Since the upper bound is , the specification is not satisfied.
V Conclusion
In this paper, we proposed the VARPOMDP model for HRC which is an extension of the traditional POMDP model by considering the correlation among observations. We showed the tractability of the proposed model by providing a learning framework and a planning algorithm. In the learning framework, we proposed to use Bayesian nonparametric methods to learn the VARPOMDP model from demonstrations effectively. We proved that the PBVI algorithm can be extended on the VARPOMDP model to solve a model checking problem for bounded until specification in PCTL. In both the learning and planning process, approximations were used to estimate parameters of the model including the transition probability, observation distribution and the potential belief points. Evaluating the influence of these approximations on system performance will be future work.
References
 [1] J Gregory Trafton, Laura M Hiatt, Anthony M Harrison, Franklin P Tamborello II, Sangeet S Khemlani, and Alan C Schultz. Actr/e: An embodied cognitive architecture for humanrobot interaction. Journal of HumanRobot Interaction, 2(1):30–55, 2013.
 [2] Zhikun Wang, Katharina Mülling, Marc Peter Deisenroth, Heni Ben Amor, David Vogt, Bernhard Schölkopf, and Jan Peters. Probabilistic movement modeling for intention inference in human–robot interaction. The International Journal of Robotics Research, 32(7):841–858, 2013.
 [3] Weitian Wang, Rui Li, Yi Chen, and Yunyi Jia. Human intention prediction in humanrobot collaborative tasks. In Companion of the 2018 ACM/IEEE International Conference on HumanRobot Interaction, pages 279–280. ACM, 2018.
 [4] Stefanos Nikolaidis, Ramya Ramakrishnan, Keren Gu, and Julie Shah. Efficient model learning from jointaction demonstrations for humanrobot collaborative tasks. In Proceedings of the Tenth Annual ACM/IEEE International Conference on HumanRobot Interaction, pages 189–196. ACM, 2015.
 [5] Xiaobin Zhang and Hai Lin. Performance guaranteed humanrobot collaboration with pomdp supervisory control. Robotics and ComputerIntegrated Manufacturing, 57:59–72, 2019.
 [6] Frank Broz, Illah Nourbakhsh, and Reid Simmons. Designing POMDP models of socially situated tasks. In ROMAN, 2011 IEEE, pages 39–46. IEEE, 2011.
 [7] Nakul Gopalan and Stefanie Tellex. Modeling and solving humanrobot collaborative tasks using POMDPs. In Proc. Robot., Sci. Syst., 2015.
 [8] Robin Jaulmes, Joelle Pineau, and Doina Precup. Active learning in partially observable markov decision processes. In European Conference on Machine Learning, pages 601–608. Springer, 2005.
 [9] Stephane Ross, Brahim Chaibdraa, and Joelle Pineau. Bayesadaptive pomdps. In Advances in neural information processing systems, pages 1225–1232, 2008.
 [10] Wei Zheng, Bo Wu, and Hai Lin. Pomdp model learning for human robot collaboration. In 2018 IEEE Conference on Decision and Control (CDC), pages 1156–1161. IEEE, 2018.
 [11] Xiaobin Zhang, Bo Wu, and Hai Lin. Learning based supervisor synthesis of pomdp for pctl specifications. In 2015 54th IEEE Conference on Decision and Control (CDC), pages 7470–7475. IEEE, 2015.
 [12] Konomu Abe, Hideki Miyatake, and Koji Oguri. A study on switching arhmm driving behavior model depending on driver’s states. In Intelligent Transportation Systems Conference, 2007. ITSC 2007. IEEE, pages 806–811. IEEE, 2007.
 [13] Hans Hansson and Bengt Jonsson. A logic for reasoning about time and reliability. Formal aspects of computing, 6(5):512–535, 1994.
 [14] Jaedeug Choi and KeeEung Kim. Inverse reinforcement learning in partially observable environments. Journal of Machine Learning Research, 12(Mar):691–730, 2011.
 [15] Emily Fox, Michael I Jordan, Erik B Sudderth, and Alan S Willsky. Sharing features among dynamical systems with beta processes. In Advances in Neural Information Processing Systems, pages 549–557, 2009.
 [16] Emily B Fox, Michael C Hughes, Erik B Sudderth, and Michael I Jordan. Joint modeling of multiple time series via the beta process with application to motion capture segmentation. The Annals of Applied Statistics, pages 1281–1313, 2014.
 [17] Jan JMM Rutten. Mathematical techniques for analyzing concurrent and probabilistic systems. Number 23. American Mathematical Soc., 2004.
 [18] Edward Jay Sondik. The optimal control of partially observable markov processes. Technical report, STANFORD UNIV CALIF STANFORD ELECTRONICS LABS, 1971.
 [19] Jesse Hoey and Pascal Poupart. Solving pomdps with continuous or large discrete observation spaces. In IJCAI, pages 1332–1338, 2005.
 [20] Joelle Pineau, Geoff Gordon, Sebastian Thrun, et al. Pointbased value iteration: An anytime algorithm for pomdps. In IJCAI, volume 3, pages 1025–1032, 2003.