# POMCPOW: An online algorithm for POMDPs with

continuous state, action, and observation spaces

###### Abstract

Online solvers for partially observable Markov decision processes have been applied to problems with large discrete state spaces, but continuous state, action, and observation spaces remain a challenge. This paper begins by investigating double progressive widening (DPW) as a solution to this challenge. However, we prove that this modification alone is not sufficient because the belief representations in the search tree collapse to a single particle causing the algorithm to converge to a policy that is suboptimal regardless of the computation time. The main contribution of the paper is to propose a new algorithm, POMCPOW, that incorporates DPW and weighted particle filtering to overcome this deficiency and attack continuous problems. Simulation results show that these modifications allow the algorithm to be successful where previous approaches fail.

POMCPOW: An online algorithm for POMDPs with

continuous state, action, and observation spaces

Zachary N. Sunberg^{†}^{†}thanks: {zsunberg, mykel}@stanford.edu and Mykel J. Kochenderfer^{1}^{1}footnotemark: 1
Aeronautics and Astronautics Dept.
Stanford University
Stanford, CA 94305

## 1 Introduction

The partially observable Markov decision process (POMDP) is a flexible mathematical framework for representing sequential decision problems (?). Once a problem has been formalized as a POMDP, a wide range of solution techniques can be used to solve it. In a POMDP, at each step in time, an agent selects an action, and the state of the world changes stochastically based only on the current state and action. The agent seeks to maximize the expectation of the reward, which is a function of the state and action. However, the agent cannot directly observe the state, and may only make decisions based on observations that are stochastically generated by the state.

Many offline methods have been developed to solve small and moderately sized POMDPs (?). Solving larger POMDPs generally requires the use of online methods (?; ?; ?). One widely used online algorithm is partially observable Monte Carlo planning (POMCP) (?), which is an extension to Monte Carlo tree search that implicitly uses an unweighted particle filter to represent beliefs in the search tree.

POMCP and other online methods can accomodate continuous state spaces, and there has been recent work on solving problems with continuous action spaces (?). However, there has been less progress on problems with continuous observation spaces. This paper presents a new algorithm based on POMCP, which addresses the challenge of solving POMDPs with continuous state, action, and observation spaces. We call the algorithm partially observable Monte Carlo planning with observation widening (POMCPOW).

There are two challenges that make tree search difficult in continuous spaces. The first is that, since the probability of sampling the same real number twice from a continuous random variable is zero, the width of the planning trees explodes on the first step, causing them to be too shallow to be useful (see Fig. 1). POMCPOW resolves this issue with a technique called double progressive widening (DPW) (?). The second issue is that, when DPW is applied, the belief representations used by current solvers collapse to a single state particle, resulting in overconfidence. As a consequence, the solutions obtained are equivalent to QMDP policies, and there is no incentive for information gathering behavior. POMCPOW overcomes this issue by using weighted particle mixtures as belief representations.

This paper proceeds as follows: Section 2 provides an overview of previous online POMDP approaches. Section 3 provides a brief introduction to POMDPs and Monte Carlo tree search. Section 4 presents several algorithms for solving POMDPs on continuous spaces, comments about their behavior, and finally presents the POMCPOW algorithm as a solution. Section 5 then gives experimental validation of the new algorithm.

## 2 Prior Work

Considerable progress has been made in solving large POMDPs. Initially, exact offline solutions to problems with only a few discrete states, actions, and observations were sought by using value iteration and taking advantage of the convexity of the value function (?), although solutions to larger problems were also explored using Monte Carlo simulation and interpolation between belief states (?). Many effective offline planners for discrete problems use point based value iteration, where a selection of points in the belief space are used for value function approximation, (?). Offline solutions for problems with continuous state and observation spaces have also been proposed (?; ?).

There are also various solution approaches that are applicable to specific classes of POMDPs, including continuous problems. For example, ? Ì(?) simplify planning in large domains by assuming that the most likely observation will always be received, which can provide an acceptable approximation in some problems with unimodal observation distributions. ? Ì(?) solve a monitoring problem with continuous spaces with a Gaussian process belief update. ? Ì(?) propose a method for partitioning large observation spaces without information loss, but demonstrate the method only on small state and action spaces that have a modest number of conditional plans. Other methods involve motion-planning techniques (?; ?; ?). In particular, ? Ì(?) present a method to take advantage of the existence of a stabilizing controller in belief space planning. ? Ì(?) perform local optimization with respect to uncertainty on a pre-computed path, and ? Ì(?) devise a hierarchical approach that handles uncertainty in both the robot’s state and the surrounding environment.

General purpose online algorithms for POMDPs have also been proposed. Many early online algorithms focused on point-based belief tree search with heuristics for expanding the trees (?). The introduction of POMCP (?) caused a pivot toward the simple and fast technique of using the same simulations for decision-making and using beliefs implicitly represented as unweighted collections of particles. Determinized sparse partially observable tree (DESPOT) is a similar approach that attempts to achieve better performance by analyzing only a small number of random outcomes in the tree (?). Abstract belief tree (ABT) was designed specifically to accommodate changes in the environment without having to replan from scratch (?).

These methods can all easily handle continuous state spaces (?), but they must be modified to extend to domains with continuous action or observation spaces. Though DESPOT has demonstrated effectiveness on some large problems, we conjecture that since it uses unweighted particle beliefs in its search tree, it will struggle with continuous information gathering problems in a way similar to the failure described later in Theorem 1. ABT has been extended to use generalized pattern search for selecting locally optimal continuous actions, an approach which is especially effective in problems where high precision is important (?). Continuous observation Monte Carlo tree search (COMCTS) constructs observation classification trees to automatically partition the observation space in a POMCP-like approach (?).

Although research has yielded effective solution techniques for many classes of problems, there remains a need for a simple, general purpose online POMDP solver that can handle continuous spaces, especially continuous observation spaces.

## 3 Background

This section reviews mathematical formulations for sequential decision problems and some existing solution approaches. The discussion assumes familiarity with Markov decision processes (?), particle filtering (?), and Monte Carlo tree search (?), but reviews some details for clarity.

### 3.1 POMDPs

The Markov decision process (MDP) and partially observable Markov decision process (POMDP) can represent a wide range of sequential decision making problems. In a Markov decision process, an agent takes actions that affect the state of the system and seeks to maximize the expected value of the rewards it collects (?). Formally, an MDP is defined by the 5-tuple , where is the state space, is the action space, is the transition model, is the reward function, and is the discount factor. The transition model can be encoded as a set of probabilities, specifically denotes the probability that the system will transition to state given that action is taken in state . In continuous problems, is defined by probability density functions.

In a POMDP, the agent cannot directly observe the state. Instead, the agent only has access to observations that are generated probabilistically based on the actions and latent true states. A POMDP is defined by the 7-tuple , where , , , , and have the same meaning as in an MDP. Additionally, , is the observation space, and is the observation model. is the probability or probability density of receiving observation in state given that the previous state and action were and .

Information about the state may be inferred from the entire history of previous actions and observations and the initial information about the state known to the agent, . Thus, in a POMDP, the agent’s policy is a function mapping each possible history, to an action. In some cases, the probability of each state can be calculated based on the history. This distribution is known as a belief, with denoting the probability of state .

The belief is a sufficient statistic for constructing a policy such that, when , the cumulative reward or ”value function” is maximized for the POMDP. Given the POMDP model, each subsequent belief can be calculated using Bayes’ rule (?; ?). However, the exact update is computationally intensive, so approximate approaches such as particle filtering are usually used in practice (?).

#### Generative Models

For many problems, it can be difficult to explicitly determine or represent the probability distributions or . Some solution approaches, however, only require samples from the state transitions and observations. A generative model, , stochastically generates a new state, reward, and observation in the partially observable case, given the current state and action, that is for an MDP, or for a POMDP. A generative model implicitly defines and , even when they cannot be explicitly represented.

#### Belief MDPs

Every POMDP is equivalent to an MDP where the state space of the MDP is the space of possible beliefs. The reward function is the expectation with respect to the belief distribution of the state-action reward function. This MDP will be referred to as a “belief MDP”. The Bayesian update of the belief serves as a generative model for the belief space MDP.

### 3.2 MCTS with Double Progressive Widening

Monte Carlo Tree Search (MCTS) is an effective and widely studied algorithm for online decision-making. It works by incrementally creating a policy tree consisting of alternating layers of state nodes and action nodes using a generative model and estimating the state-action value function, , at each of the action nodes. The Upper Confidence Tree (UCT) version expands the tree by selecting nodes that maximize the upper confidence bound

(1) |

where is the number of times the action node has been visited, , and is a problem-specific parameter that governs the amount of exploration in the tree (?).

#### Double Progressive Widening

In cases where the action and state spaces are large or continuous, the MCTS algorithm will produce trees that are very shallow. In fact, if the action space is continuous, the UCT algorithm will never try the same action twice (observe that , so untried actions are always favored). Moreover, if the state space is continuous and the transition probability density is finite, the probability of sampling the same state twice from is zero. Because of this, simulations will never pass through the same state node twice and a tree below the first layer of state nodes will never be constructed.

In progressive widening, the number of children of a node is artificially limited to where is the number of times the node has been visited and and are hyper-parameters (?). Originally, progressive widening was applied to the action space and was found to be especially effective when a set of preferred actions was tried first (?). The term double progressive widening refers to progressive widening in both the state and action space. When the number of state nodes is greater than the limit, instead of simulating a new state transition, one of the previously generated states is chosen with probability proportional to the number of times it has been previously generated.

### 3.3 Pomcp

A conceptually straightforward way to solve a POMDP using MCTS is to apply it to the corresponding belief MDP. Indeed, many tree search techniques have been applied to POMDP problems in this way (?). However, when the Bayesian belief update is used, this approach is computationally expensive. POMCP and its successors, DESPOT and ABT, can tackle problems many times larger than their predecessors because they use simulations of state trajectories, rather than full belief trajectories, to build the belief tree.

Each of the nodes in a POMCP tree corresponds to a history proceeding from the root belief and terminating with an action or observation. In the search phase of POMCP tree construction, state trajectories are simulated through this tree. At each action node, the rewards from the simulations that pass through the node are used to estimate the function. This simple approach has been shown to work well for large discrete problems (?). However, when the action or observation space is continuous, the tree degenerates and does not extend beyond a single layer of nodes because each new simulation produces a new branch.

## 4 Algorithms

This section presents several variations of the POMCP algorithm, comments on their behavior, and concludes by introducing the POMCPOW algorithm.

The three algorithms in this section share a common structure. For all algorithms, the entry point for the decision making process is the Plan procedure, which takes the current belief, , as an input. The algorithms also share the same ActionProgWiden function to control progressive widening of the action space. These components are listed in listing 1. The difference between the algorithms is in the Simulate function.

The following variables are used in the listings and text: represents a history , and and are shorthand for histories with and appended to the end, respectively; is the depth to explore, with the maximum depth; is a list of the children of a node; is a count of the number of visits; and is a count of the number of times that a history has been generated by the model. The set of states associated with a node is denoted , and is a set of weights corresponding to those states. Finally, is an estimate of the value of taking action after observing history . , , , , , and are all implicitly initialized to or . The Rollout procedure, runs a simulation with a default rollout policy, which can be based on the history or fully observed state for steps and returns the discounted reward.

### 4.1 Pomcp-Dpw

The first new algorithm that we consider is POMCP with double progressive widening (POMCP-DPW). In this algorithm, listed in Algorithm 2, the number of new children sampled from any node in the tree is limited by DPW using the parameters , , , and . In the case where the simulated observation is rejected (line 13), the tree search is continued with an observation selected in proportion to the number of times, , it has been previously simulated (line 14) and a state is sampled from the associated belief (line 15).

This algorithm obtained remarkably good solutions for a very large autonomous freeway driving POMDP with multiple vehicles (up to 40 continuous fully observable states and 72 continuous correlated partially observable states) (?). To our knowledge, this is the first work applying progressive widening to POMCP.

We conjecture that this algorithm asymptotically converges to the optimal solution for POMDPs with discrete observation spaces. However, on continuous observation spaces, POMCP-DPW is clearly suboptimal. In particular, it finds a QMDP policy, that is, the solution under the assumption that the problem becomes fully observable after one time step (?; ?). This is expressed formally in Theorem 1.

###### Definition 1 (QMDP value).

Let be the optimal state-action value function assuming full observability starting by taking action in state . The QMDP value at belief , , is the expected value of when is distributed according to .

###### Theorem 1 (POMCP-DPW convergence to QMDP).

If a POMDP meets the following conditions: 1) the observation space is continuous with a finite observation probability density function, and 2) the regularity hypothesis for decision nodes is met and the exploration coefficients are appropriately chosen, then POMCP-DPW with polynomial exploration will produce a value function estimate, , that converges to the QMDP value for the problem. Specifically, there exists a constant , such that after iterations,

exponentially surely in , for every action (see ? Ì(?) for the definitions of the regularity hypothesis, exploration coefficients, polynomial exploration, and exponential surety).

###### Proof.

(Sketch) Because the observation space is continuous with a finite density function, the generative model will (with probability one) produce a unique observation each time it is queried. Thus, for every generated history , only one state will ever be inserted into (line 10, Algorithm 2), and therefore is merely an alias for that state. Since each history node is indistinguishable from a state node and the transition models for a POMDP and the corresponding fully observable MDP are the same, aside from the root node and its immediate children, the tree created by POMCP-DPW is identical to a tree created by applying MCTS-DPW to the fully observable MDP.

POMCP-DPW is thus equivalent to MCTS-DPW applied to a new MDP with an additional special state at the root node representing the current belief. The transition model from this special state to the fully observable states is equivalent to sampling a state from the belief and then applying the MDP transition model. The optimal state-action value for this new problem at the special belief state is for the original POMDP at the current belief, so the results given by ? Ì(?) can be applied directly. ∎

As a result of Theorem 1, the action chosen by POMCP-DPW will match a QMDP policy (a policy of actions that maximize the QMDP value) with high precision exponentially surely (see Corollary 1 of ? Ì(?)).
For many problems this is a very useful solution,^{1}^{1}1Indeed, a useful online QMDP tree search algorithm could be created by deliberately constructing a tree with a single root belief node and fully observable state nodes below it. but since it neglects the value of information, a QMDP policy is suboptimal for problems where information gathering is important (?; ?).
This phenomenon is demonstrated in simulation in Section 5.1.

### 4.2 Particle Filter Tree

Another algorithm that one might consider for solving continuous POMDPs online is MCTS-DPW on the equivalent belief-MDP. Since the Bayesian belief update is usually computationally intractable, a particle filter is used. This approach will be referred to as the particle filter tree approach.

This algorithm has two shortcomings. First, though the belief representation has particles instead of the single particle in POMCP-DPW, it is static after the first time the node is reached, and does not grow with more simulations. Since no new particles are added to the beliefs, there is a finite probability that a valuable state will never be included in the tree, so any universal guarantees of asymptotic optimality are precluded. Second, if computational power is limited, the user must explicitly decide how much computation to devote to updates () versus decision making ().

### 4.3 Pomcpow

In order to address the suboptimality of POMCP-DPW and the shortcomings of particle filter trees, we now propose a new algorithm, POMCPOW, shown in Algorithm 3. In this algorithm, the belief updates are weighted, but they also expand gradually as more simulations are added. Furthermore, since the richness of the belief representation is related to the number of times the node is visited, beliefs that are more likely to be reached by the optimal policy have more particles. At each step, the simulated state is inserted into the weighted particle collection that represents the belief (line 13), and a new state is sampled from that belief (line 15). A simple illustration of the tree is shown in Figure 2 to contrast with a POMCP-DPW tree. Because of the resampling, which can be implemented efficiently using binary search, the computational complexity is .

We conjecture that, unlike particle filter trees, since the beliefs are constantly being improved, the POMCPOW algorithm has similar convergence properties to MCTS-DPW (?).

It should be noted that, while POMCP and POMCP-DPW only require a generative model of the problem, both POMCPOW and particle filter belief trees require explicit knowledge of because they use weighted particle filters. This extra requirement is often satisfied in practice, and the algorithm can be used wherever an importance resampling particle filter can be.

## 5 Experiments

Numerical simulation experiments were conducted to evaluate POMCPOW’s performance on POMDPs with continuous spaces. The first set of experiments illustrates that the POMCP-DPW algorithm will not choose to take exploratory actions. The second set demonstrates the superior simulation efficiency of POMCPOW compared to the Particle Filter Tree algorithm.

#### Software

Open source Julia implementations of the solvers are available as part of the POMDPs.jl suite (https://github.com/JuliaPOMDP/POMDPs.jl), and the source code for the experiments is available at https://github.com/zsunberg/ContinuousPOMDPTreeSearchExperiments.jl.

### 5.1 Light-Dark Target

The first experimental domain is a light-dark problem similar to the one studied by ? Ì(?) illustrated in Fig. 3. An agent must move through a 2D space (, ) until it reaches a target consisting of all points within a radius of of the origin. The dynamics are deterministic () and there is a negative reward of one unit for every step. The problem terminates immediately when the agent reaches the target. A discount factor of bounds the reward. The agent’s initial position is normally distributed with , and it receives a noisy measurement of its position () at each step. Measurements are much more accurate when in the light. Specifically, the Gaussian observation noise has standard deviation .

The results of 500 simulation runs are shown in Table 1. All algorithms use the same particle filter with particles for outer loop belief updates. The heuristic policy moves towards until the standard deviation is less than and then shoots for the target. The greedy policy assumes that its position is the mean of the particles and immediately shoots for the target.

For this problem, the best performing algorithm is the particle filter belief tree. There are two reasons that it outperforms POMCPOW. First, the generative model is computationally simple, which means that a significant portion of time in POMCPOW is spent calculating the upper confidence bounds and progressive widening limits. Hence, the particle filter tree algorithm can use nearly an order of magnitude more simulations than POMCPOW while still using less computational time. Second, this problem is entirely focused on information gathering. The solution to the fully observable version of the problem is trivial: . POMCPOW only simulates a single particle to its leaf nodes, so it can only use a fully observable rollout, which is uninformative in this case. The particle filter tree simulates an particle belief to its nodes, allowing it to perform a belief rollout. Accordingly, the particle filter tree succeeds with only one tenth of the iterations of POMCPOW.

The most important result of these simulations is that POMCP-DPW does not outperform the greedy algorithm, confirming that it cannot choose information-gathering actions in domains with continuous observations, even with a large number of iterations, confirming Theorem 1 for this case.

Algorithm | Reward | Time (s) | |
---|---|---|---|

PFT () | |||

POMCPOW | |||

Heuristic | – | ||

POMCPOW | |||

POMCP-DPW | |||

Greedy | – |

### 5.2 Van Der Pol Tag

The second experimental problem is called Van Der Pol tag. In this problem an agent moves through 2D space to try to tag a target () that has a random unknown initial position in . The agent always travels at the same speed, but chooses a direction of travel and whether to take an observation (, ). If the agent chooses to measure, it receives an observation of the bearing to the target with small Gaussian noise () added, but there is a cost of ; if it chooses not to measure, it receives a random angle observation. There is a cost of for each step, and a reward of for tagging the target (being within a distance of ).

The target moves following a two dimensional form of the Van Der Pol oscillation, moving according to the differential equations

where . Gaussian noise () is added to the position at the end of each step. Runge-Kutta fourth order integration is used to propagate the state.

Compared to the light-dark problem, the fully observable task of choosing a constant-speed course to tag the target is more difficult, so POMCPOW’s ability to construct a larger search tree becomes an advantage. Numerical integration in the state transition compounds this advantage because these calculations take a larger portion of the processing time.

In this domain, we limited the computation time per step. Figure 4 shows the performance of POMCPOW compared to particle filter tree solvers at a range of computation times. POMCPOW clearly outperforms the alternative at all time limits and exhibits good performance even with very limited computation time.

Discretization is another widely-used approach for solving continuous POMDPs. Since current online solvers such as POMCP can easily handle continuous state spaces, only the action and observation spaces need to be discretized. Figure 5 shows the relative performance of POMCP enabled by action and observation discretization. Clearly, continuous POMCPOW outperforms coarse discretization. Additionally, even when the discrete approximation is used, POMCPOW performs better than POMCP and is not as sensitive to coarseness.

## 6 Conclusion

In this paper, we have proposed a new general-purpose online POMDP algorithm that is able to solve problems with continuous state, action, and observation spaces. This is a qualitative advance in capability over previous solution techniques, with the only major new requirement being explicit knowledge of the observation distribution, a condition that is often commonly satisfied.

This study has yielded several insights into the behavior of tree search algorithms for POMDPs. We explained why POMCP-DPW is unable to choose information-gathering actions in continuous spaces and noted that, although the particle filter tree approach excels at information-gathering, it can struggle when the system dynamics are more complex. POMCPOW is able to balance both of these tasks, and outperformed the discretization approach in our tests. The computational experiments carried out for this work used only small toy problems (though they are quite large compared to many of the POMDPs previously studied in the literature), but other recent research (?) shows that POMCP-DPW is effective in very large and complex realistic domains and thus provides clear evidence the POMCPOW will also be successful.

Further work will study the theoretical properties of the algorithm. In addition, we will study better ways for choosing continuous actions, such as generalized pattern search (?) and hierarchical optimistic optimization (?).

## References

- [Agha-Mohammadi, Chakravorty, and Amato 2011] Agha-Mohammadi, A.-A.; Chakravorty, S.; and Amato, N. M. 2011. FIRM: Feedback controller-based information-state roadmap - a framework for motion planning under uncertainty. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
- [Auger, Couetoux, and Teytaud 2013] Auger, D.; Couetoux, A.; and Teytaud, O. 2013. Continuous upper confidence trees with polynomial exploration–consistency. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 194–209. Springer.
- [Bai, Hsu, and Lee 2014] Bai, H.; Hsu, D.; and Lee, W. S. 2014. Integrated perception and planning in the continuous space: A POMDP approach. International Journal of Robotics Research 33(9):1288–1302.
- [Brechtel, Gindele, and Dillmann 2013] Brechtel, S.; Gindele, T.; and Dillmann, R. 2013. Solving continuous POMDPs: Value iteration with incremental learning of an efficient space representation. In International Conference on Machine Learning (ICML), 370–378.
- [Browne et al. 2012] Browne, C. B.; Powley, E.; Whitehouse, D.; Lucas, S. M.; Cowling, P. I.; Rohlfshagen, P.; Tavener, S.; Perez, D.; Samothrakis, S.; and Colton, S. 2012. A survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games 4(1):1–43.
- [Bry and Roy 2011] Bry, A., and Roy, N. 2011. Rapidly-exploring random belief trees for motion planning under uncertainty. In IEEE International Conference on Robotics and Automation (ICRA), 723–730.
- [Couëtoux et al. 2011] Couëtoux, A.; Hoock, J.-B.; Sokolovska, N.; Teytaud, O.; and Bonnard, N. 2011. Continuous upper confidence trees. In Learning and Intelligent Optimization.
- [Goldhoorn et al. 2014] Goldhoorn, A.; Garrell, A.; Alquézar, R.; and Sanfeliu, A. 2014. Continuous real time POMCP to find-and-follow people by a humanoid service robot. In IEEE-RAS International Conference on Humanoid Robots.
- [Hoey and Poupart 2005] Hoey, J., and Poupart, P. 2005. Solving POMDPs with continuous or large discrete observation spaces. In International Joint Conference on Artificial Intelligence (IJCAI), 1332–1338.
- [Indelman, Carlone, and Dellaert 2015] Indelman, V.; Carlone, L.; and Dellaert, F. 2015. Planning in the continuous domain: A generalized belief space approach for autonomous navigation in unknown environments. International Journal of Robotics Research 34(7):849–882.
- [Kaelbling, Littman, and Cassandra 1998] Kaelbling, L. P.; Littman, M. L.; and Cassandra, A. 1998. Planning and acting in partially observable stochastic domains. Artificial Intelligence 101:99–134.
- [Kochenderfer 2015] Kochenderfer, M. J. 2015. Decision Making Under Uncertainty: Theory and Application. MIT Press.
- [Kurniawati and Yadav 2016] Kurniawati, H., and Yadav, V. 2016. An online POMDP solver for uncertainty planning in dynamic environment. In Robotics Research. Springer. 611–629.
- [Kurniawati, Hsu, and Lee 2008] Kurniawati, H.; Hsu, D.; and Lee, W. S. 2008. SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces. In Robotics: Science and Systems.
- [Littman, Cassandra, and Kaelbling 1995] Littman, M. L.; Cassandra, A. R.; and Kaelbling, L. P. 1995. Learning policies for partially observable environments: Scaling up. In International Conference on Machine Learning (ICML).
- [Mansley, Weinstein, and Littman 2011] Mansley, C. R.; Weinstein, A.; and Littman, M. L. 2011. Sample-based planning for continuous action Markov decision processes. In International Conference on Automated Planning and Scheduling (ICAPS).
- [Melchior and Simmons 2007] Melchior, N. A., and Simmons, R. 2007. Particle RRT for path planning with uncertainty. In IEEE International Conference on Robotics and Automation (ICRA).
- [Morere, Marchant, and Ramos 2016] Morere, P.; Marchant, R.; and Ramos, F. 2016. Bayesian optimisation for solving continuous state-action-observation POMDPs. In Advances in Neural Information Processing Systems (NIPS).
- [Pas 2012] Pas, A. 2012. Simulation based planning for partially observable markov decision processes with continuous observation spaces. Master’s thesis, Maastricht University.
- [Platt et al. 2010] Platt, Jr., R.; Tedrake, R.; Kaelbling, L.; and Lozano-Perez, T. 2010. Belief space planning assuming maximum likelihood observations. In Robotics: Science and Systems.
- [Prentice and Roy 2009] Prentice, S., and Roy, N. 2009. The belief roadmap: Efficient planning in belief space by factoring the covariance. International Journal of Robotics Research 28(11-12):1448–1465.
- [Ross et al. 2008] Ross, S.; Pineau, J.; Paquet, S.; and Chaib-Draa, B. 2008. Online planning algorithms for POMDPs. Journal of Artificial Intelligence Research 32:663–704.
- [Seiler, Kurniawati, and Singh 2015] Seiler, K. M.; Kurniawati, H.; and Singh, S. P. N. 2015. An online and approximate solver for POMDPs with continuous action space. In IEEE International Conference on Robotics and Automation (ICRA), 2290–2297.
- [Silver and Veness 2010] Silver, D., and Veness, J. 2010. Monte-Carlo planning in large POMDPs. In Advances in Neural Information Processing Systems (NIPS).
- [Somani et al. 2013] Somani, A.; Ye, N.; Hsu, D.; and Lee, W. S. 2013. DESPOT: Online POMDP planning with regularization. In Advances in Neural Information Processing Systems (NIPS), 1772–1780.
- [Sunberg, Ho, and Kochenderfer 2017] Sunberg, Z. N.; Ho, C. J.; and Kochenderfer, Mykel, J. 2017. The value of inferring the internal state of traffic participants for autonomous freeway driving. In American Control Conference (ACC).
- [Thrun, Burgard, and Fox 2005] Thrun, S.; Burgard, W.; and Fox, D. 2005. Probabilistic Robotics. MIT Press.
- [Thrun 1999] Thrun, S. 1999. Monte Carlo POMDPs. In Advances in Neural Information Processing Systems (NIPS), volume 12, 1064–1070.
- [Van Den Berg, Patil, and Alterovitz 2012] Van Den Berg, J.; Patil, S.; and Alterovitz, R. 2012. Motion planning under uncertainty using iterative local optimization in belief space. International Journal of Robotics Research 31(11):1263–1278.