# Reinforcement Learning for POMDP: Partitioned Rollout and Policy Iteration with Application to Autonomous Sequential Repair Problems

## Abstract

In this paper we consider infinite horizon discounted dynamic programming problems with finite state and control spaces, and partial state observations. We discuss an algorithm that uses multistep lookahead, truncated rollout with a known base policy, and a terminal cost function approximation. This algorithm is also used for policy improvement in an approximate policy iteration scheme, where successive policies are approximated by using a neural network classifier. A novel feature of our approach is that it is well suited for distributed computation through an extended belief space formulation and the use of a partitioned architecture, which is trained with multiple neural networks. We apply our methods in simulation to a class of sequential repair problems where a robot inspects and repairs a pipeline with potentially several rupture sites under partial information about the state of the pipeline.

## I Introduction

We consider the classical partial observation Markovian decision problem (POMDP) with a finite number of states and controls, and discounted additive cost over an infinite horizon. The optimal solution is typically intractable, and several suboptimal solution/reinforcement learning approaches have been proposed. Amongst these, are point-based value iteration (see e.g., [PGT06, SpV05, SPK13]), approximate policy iteration methods based on the use of finite state controllers (see e.g., [KLC98, Han98, MPK99]), the use of Monte Carlo tree search and adaptive sampling methods for multistep lookahead with terminal cost function approximation (see e.g., [RPP08, SiV10]), and discretization/aggregation methods based on the solution of a related perfectly observable Markovian decision problem (see e.g., [ZhH01, YuB04, Ber19]). There have also been proposals of policy gradient and related actor-critic methods (see [BaB01, Yu05, ELP12]) that are largely unrelated to methodology proposed here.

In this paper we focus on methods of policy iteration (PI) that are based on rollout, and approximations in policy and value space. They update a policy by using truncated rollout with that policy, and a terminal cost function approximation. Several earlier works include elements of our algorithmic approach. In particular, related methods have been applied with some variations to perfect state information problems, notably in the backgammon algorithm of Tesauro and Galperin [TeG96], and in the AlphaGo program [SHM16], as well as in the classification-based PI algorithm of Lagoudakis and Parr [LaP03] (see also [DiL08, LGM10, LiW14]). The AlphaZero program also involves a similar approximation architecture, but in its published description [SHS17], it does not use rollout, likely because the use of long multistep lookahead in conjunction with Monte Carlo tree search makes the use of rollout unnecessary. A further novel feature of our algorithms is the use of a partitioned architecture, involving multiple policy and value neural networks, which is well-suited for distributed implementation. Partitioning in conjunction with asynchronous PI was originally proposed by Bertsekas and Yu [BeY10], and further developed in the papers [BeY12] and [YuB13] [see also the books [Ber12] (Section 2.6) and [Ber18] (Section 2.6) for descriptions, analysis, and extensions]. However, most of this research was focused on the case of perfect state information, and lookup table representations of policies and cost functions. Thus, while most of the principal elements of our approach have individually appeared in various forms in earlier perfect state information algorithmic frameworks, their combination has not been brought together into a single algorithm, and moreover they have not been adapted to the special challenges posed by POMDP.

Because of its simulation-based rollout character, the methodology of this paper depends critically on the finiteness of the control space, but it does not rely on the piecewise linear structure of the finite horizon optimal cost function of POMDP. It can be extended to POMDP with infinite state space but finite control space, although we have not considered this possibility in this paper. We describe error bounds to guide the implementation of our algorithms, and we provide results of computational experimentation showing that our methods are viable and well-suited to the POMDP structure.

We apply our methods to a class of problems in robotics, involving sequential repairs, and search and rescue, where the POMDP model is particularly well-suited to deal with partial state information. Autonomous robots in search and rescue have been viewed as one of the robotics applications where POMDP approaches need further development and where these approaches can have great impact [SARuavs, Cassandra2003, decisionMakingAutonomousVeh, controlRusUnderwater, serviceRobotGaurav, decisionMakingRus]. Indeed, exploration and learning in unknown environments has been identified as one of the grand challenges of Science Robotics [grandChallengesWood]. These problems are very complex, characterized by large state spaces and constrained communication. Many tools in machine learning and artificial intelligence have recently been developed for tackling difficult problems in robotics [AIHowDeepRL, RLWolfram, deepLearningAIAbeel, AIDeliberation]. Of particular relevance to the current paper are those works that involve decision making under uncertainty and POMDP frameworks [POMDPAmatoDecentralized, POMDPSycara].

As an application of our methodology, we consider a robot that must decide on the sequence of linearly arranged pipeline locations to explore and/or repair using prior information and observations made in situ. The actual damage of each location is initially unknown and can become worse if not repaired. This process is modeled by a Markov chain with known transition probabilities. The problem has a very large number of states ( in our largest implementation). Our experiments demonstrate that our methodology is well suited to robotics applications that involve: i) large state spaces and long planning horizons, which exacerbate both the curse of dimensionality and the curse of history, and ii) decaying environments as in sequential repair problems, where it is important to use a policy that can identify and execute critical actions in minimum time.

We compare the performance of our proposed method with the POMCP and DESPOT methods from [SiV10, despot], respectively, and we showcase the generality of our method by applying it to more complex versions of the pipeline problem: a two-dimensional grid pipeline, and a multi-robot variant of the problem. The use of distributed computation within our framework also suggests future applicability to multi-robot problems with asynchronous communication and/or bandwidth constraints, which are of great importance in robotics [gilMultiRobot, gilISRR2019, gilMultiRobotCov, controlRusUnderwater, POMDPAmatoDecentralized].

We also provide theoretical support for our methodology in the form of a performance bound (Prop. 1), which shows improvement of the rollout policy over the base policy (approximately). Alternative methodologies, such as POMCP and DESPOT, do not enjoy a comparable level of theoretical support.

In summary, the contributions of the paper include: 1) the development of an algorithmic framework for finding approximately optimal policies for large state space POMDP, including the development of distributed PI methods with a partitioned architecture, 2) performance bounds to support the architectural structure, and 3) implementation and validation in simulation of the proposed methods on a pipeline repair problem, whose character is shared by broad classes of problems in robotics.

## Ii Extended Belief Space Problem Formulation

The starting point of our paper is the classical belief space formulation of a POMDP. However, we extend this formulation to make it compatible with a distributed multiprocessor implementation of our algorithm. In particular, we propose to use as state a sufficient statistic that subsumes the belief state in the sense that its value determines the value of the belief state. In this section, we describe this extended belief space problem formulation.

We assume that there are states denoted by and a finite set of controls at each state. We denote by the transition probability from to under , and by the corresponding transition cost. The cost is accumulated over an infinite horizon and is discounted by . At each new generated state , an observation from a finite set is obtained with known probability that depends on and the control that was applied prior to the generation of . The objective is to select each control optimally as a function of the prior history of observations and controls.

A classical approach to this problem is to convert it to a perfect state information problem whose state is the current belief , where is the conditional distribution of the state given the prior history. It is well-known that is a sufficient statistic, which can serve as a substitute for the set of available observations, in the sense that optimal control can be achieved with knowledge of just . In this paper, we consider a more general form of sufficient statistic, which we call the feature state and we denote by . We require that the feature state subsumes the belief state . By this we mean that can be computed exactly knowing . One possibility is for to be the union of and some identifiable characteristics of the belief state, or some compact representation of the measurement history up to the current time (such as a number of most recent measurements, or the state of a finite-state controller). We also make the additional assumption that can be sequentially generated using a known feature estimator . By this we mean that given that the current feature state is , control is applied, and observation is obtained, the next feature can be exactly predicted as .

Clearly, since is a sufficient statistic, the same is true for . Thus the optimal costs achievable by the policies that depend on and on are the same. However, specific suboptimal schemes may become more effective with the use of the feature state instead of just the belief state . Moreover, the use of can facilitate the use of distributed computation through partitioning of the space of features , as we will explain later.

The optimal cost , as a function of the sufficient statistic/feature state , is the unique solution of the corresponding Bellman equation for all , where is the Bellman operator

with being the set of all stationary policies [functions that map to a control ], and being the Bellman operator corresponding to :

Here denotes the belief state that corresponds to feature state , is the expected cost per stage

is the conditional probability that the next observation will be given the current belief state and control , and is the feature state estimator.

The feature space reformulation of the problem can serve as the basis for approximation in value space, whereby is replaced in Bellman’s equation by some function after one-step or multistep lookahead. For example a one-step lookahead scheme yields the suboptimal policy given by

In -step lookahead schemes, is used as terminal cost function in an -step horizon version of the original infinite horizon problem (see e.g., [Ber19]). In the standard form of a rollout algorithm, is the cost function of some base policy. Here we adopt a rollout scheme with -step lookahead, which involves rollout truncation and terminal cost approximation. This scheme has been formalized and discussed in the book [Ber19] (Section 5.1), and is described in the next section.

## Iii Truncated Rollout with Terminal Cost Function Approximation

In the pure form of the rollout algorithm, the cost function approximation is the cost function of a known policy , called the base policy, and its value at any is obtained by first extracting from , and then running a simulator starting from , and using the system model, the feature generator, and (see Fig. 1). In the truncated form of rollout, is obtained by running the simulator of for a given number of steps , and then adding a terminal cost approximation for each terminal feature state that is obtained at the end of the steps of the simulation with (see Fig. 2 for the case where ).

Thus the rollout policy is defined by the base policy , the terminal cost function approximation , the number of steps after which the simulated trajectory with is truncated, as well as the lookahead size ; see Fig. 2. The choices of and are typically made by trial and error, based on computational tractability among other considerations, while may be chosen on the basis of problem-dependent insight or through the use of some off-line approximation method. In variants of the method, the multistep lookahead may be implemented approximately using a Monte Carlo tree search or adaptive sampling scheme. In our experiments, we used a variable and state-dependent value of . Monte-Carlo tree search did not work well in our initial experiments, and we subsequently abandoned it.

Using -step rollout between the -step lookahead and the terminal cost approximation gives the method the character of a single PI. We have used repeated truncated rollout as the basis for constructing a PI algorithm, which we will discuss shortly (see also [Ber19], Section 5.1.2).

There are known error bounds that can be applied to the preceding truncated rollout scheme. These bounds were given in [Ber19], Section 5.1, Prop. 5.1.3, but they were derived for the case where the number of states is finite, so they do not apply to our feature state representation of a POMDP. However, the bounds can be extended to our infinite feature space case, since the proof arguments of [Ber19] do not depend on the finiteness of the state space. We thus give the bounds without proof.

###### Proposition 1

(Error Bounds) Consider a truncated rollout scheme consisting of -step lookahead, followed by rollout with a policy for steps, and a terminal cost function approximation at the end of steps. Let be the policy generated by this scheme. Then: (a) We have

where is the result of applying times to the Bellman operator for policy , and is the sup norm on the space of bounded functions of the feature state . (b) We have for all

The first bound implies that as the size of lookahead increases, the bound on the performance of the rollout policy improves. The second bound suggests that if is close to , the performance of the rollout policy is approximately improved (to within an error bound), relative to the performance of the base policy . This is typical of the practically observed cost improvement property of rollout schemes. In particular, when we obtain , which is the theoretical policy improvement property of rollout (see [Ber19], Section 5.1.2).

## Iv Supervised Learning of Rollout Policies and Cost Functions - Policy Iteration

The rollout algorithm uses multistep lookahead and on-line simulation of the base policy to generate the rollout control at any feature state of interest. To avoid the cost of on-line simulation, we can approximate the rollout policy off-line by using some approximation architecture that may involve a neural network. This is policy approximation built on top of the rollout scheme (see [Ber19], Sections 2.1.5 and 5.7.2). To this end, we introduce a parametric family/architecture of policies of the form , where is a parameter vector. We then construct a training set that consists of a large number of sample feature state-control pairs , , such that for each , is the rollout control at feature state . We use this data set to obtain a parameter by solving a corresponding classification problem, which associates each feature state with a control . The parameter defines a classifier, which given a feature state , classifies as requiring control (this idea was proposed in the context of PI in the paper [LaP03], and is also described in the book [Ber19], Section 3.5). The classification problem is often solved with the use of neural networks, and this has been our approach in our experimentation.### Iv-A Approximate Policy Iteration (API)

We can also apply the rollout policy approximation to the context of PI. The idea is to view rollout as a one-step policy improvement, and to view the PI algorithm as a perpetual rollout process, which performs multiple policy improvements, using at each iteration the current policy as the base policy, and the next policy as the corresponding rollout policy. In particular, we consider a PI algorithm where at the typical iteration we have a policy , which we use as the base policy to generate many feature state-control sample pairs , , where is the rollout control corresponding to feature state . We then obtain an “improved” policy with an approximation architecture and a classification algorithm, as described above. The “improved” policy is then used as a base policy to generate samples of the corresponding rollout policy, which is approximated in policy space, etc; see Fig. 3. To use truncated rollout in this PI scheme, we must also provide a terminal cost approximation, which may take a variety of forms. Using zero is a simple possibility, which may work well if either the size of multistep lookahead or the length of the rollout is relatively large. Another possibility, employed in our pipeline repair problem, is to use as terminal cost in the truncated rollout an approximation of the cost function of some base policy, which is obtained with a neural network-based approximation architecture. In particular, at any policy iteration with a given base policy, once the rollout data is collected, one or two neural networks are constructed: A policy network that approximates the rollout policy, and (in the case of rollout with truncation) a value network that constructs a cost function approximation for that rollout policy (the essentially synonymous terms “actor network” and “critic network” are also common in the literature). We consider two methods: Approximate rollout and PI with truncation, where each generated policy as well as its cost function are approximated by a policy and a value network, respectively. The cost function approximation of the current policy is used to truncate the rollout trajectories that are used to train the next policy. Approximate rollout and PI with no truncation, where each generated policy is approximated using a policy network, but the rollout trajectories are continued up to a large maximum number of stages (enough to make the cost of the remaining stages insignificant due to discounting) or upon reaching a termination state. Here only a policy network is used; a value network is unnecessary since there is no rollout truncation. Note that as in all approximate PI schemes, the sampling of feature states used for training is subject to exploration concerns. In particular, for each policy approximation, it is important to include in the sample set , a subset of feature states that are “favored” by the rollout trajectories; e.g., start from some initial subset of feature states and selectively add to this subset feature states that are encountered along the rollout trajectories. This is a challenging issue, which must be approached with care; see [LaP03, DiL08].## V Policy Iteration With a Partitioned Architecture

We will now discuss our partitioned architecture. It is based on a partition of the set of feature states into disjoint sets , so that . We train in parallel a separate (local) policy network and a (local) value network (in the case of a scheme with truncation) using feature state data from each of the sets . Thus at each policy iteration, we use policy networks and as many as additional value networks (in some specially structured problems, including the pipeline repair problem, some of the value networks may not be needed, as we will discuss later).

The local policy networks, each defined over a subset , are combined into a global policy, defined over the entire feature space . For schemes with truncation, each local value network is trained using the global policy, but with starting feature states from the corresponding subset of the partition; see Fig. 4.

This partitioned architecture is well-suited for distributed computation, with the local networks sharing and updating piecemeal the current global policy and (in the case of truncation) the current global terminal cost approximation. Moreover, we speculate that our methodology requires smaller training sets, which cover more evenly the feature space, thereby addressing in part the issue of adequate feature space exploration. While it is hard to quantify this potential advantage, our computational experience indicates that it is substantial. Regarding the distributed training of our partitioned architecture, we may conceptually assign one virtual processor for each set of the feature space partition (of course multiple virtual processors can coexist within the same physical processor). The empirical work described in the following sections provides, among others, guidelines for future research directions on the performance and implementations of synchronous and asynchronous distributed PI with partitioned architectures (in the spirit of [BeY10]), particularly in the context of POMDP as well as multi-agent robotics.

## Vi Sequential Repair and the Pipeline Problem

Various elements of the approximate PI methodology just described have been applied to a pipeline repair problem. We will describe the implementation, computational results, and comparisons with other methodologies in detail. The problem involves autonomous repair of a number of pipeline locations where damage may have occurred, and the objective is to find a policy to repair the pipeline with minimum cost, based on available information. Our results show that 1) approximate PI can be successfully applied in the context of POMDP, 2) a partitioned architecture results in substantial computation time savings, thus allowing applicability of our methodology to large state space problems, and 3) our pipeline repair model and solution methods can be extended to complex two-dimensional and multi-robot contexts, where a solution can be facilitated by using partitioning.

### Vi-A Pipeline Repair Problem Description

In this problem, an autonomous robot moves along a pipeline consisting of a sequence of locations, denoted . Each location of the pipeline can be in one of progressively worse damage levels indicated by , where indicates a no damage condition. The damage level of each location changes stochastically over time according to a known Markov chain with the states (see Fig. 5). As the figure indicates, we assume that a location that is not damaged (state ), cannot become damaged. However, for a damaged location, the level of damage can stochastically become worse. We assume that the robot has a sensing radius of one location within which it can verify the damage level of the location. Thus with each visit to a location, an observation is obtained, namely, the exact damage level of the location.

The robot knows its current location, and maintains for each location , a belief state of damages consisting of the conditional probabilities of the damage level, given the current observation history. The initial belief state could be obtained from information gathered a priori, for example, via noisy images of the pipeline. Upon reaching a location , the robot determines its damage level, and decides upon one of three actions: Stay in for one time period and repair the damage, bringing it to level , or move to one of the two adjacent locations or without repairing the damage (if or , only two of these actions are available). There is a known cost per unit time for each location, depending on its damage level, and the objective is to minimize the discounted sum of costs of all the locations of the pipeline over an infinite horizon. This is a POMDP with states and three actions per state.

It is straightforward to implement the belief state estimator, given the initial belief distributions of the damage levels of all the locations. It is also straightforward to program a simulator of the type illustrated in Fig. 1, and to generate rollout trajectories with a given base policy starting from a given feature state.

### Vi-B Partitioned Architecture for the Pipeline Repair Problem

For our partitioning scheme, we use two characteristics of the belief vector. These are (see Fig. 6a):

Percentage of disrepair: If less than one half of the pipeline locations are believed to be in some damaged level (), we say that we are in an endgame state, and otherwise we say that we are in a startgame state. Density of possible damage: We distinguish three cases, based on two variables, and (Left and Right Damage), which will be defined shortly. The three cases are a) damage to the left is greater than to the right: , b) damage to the left is approximately equal to damage to the right: , and c) damage to the right is greater than to the left: .

Based on the six possible pairs of the characteristics above, the belief space is partitioned into six corresponding subsets, and the index of the current subset is added to the belief state to form the feature state (hence subsumes and its evolution can be sequentially simulated, in conjunction with , consistent with our assumptions). Since we assume that a repaired location cannot fall into disrepair, endgame states cannot lead to startgame states. This allows a sequential training strategy whereby endgame cost and policy approximations can be used for training startgame policy and value networks.

### Vi-C Algorithmic Implementation

We employ two algorithms that are based on the partitioning scheme just described: Partitioned Approximate Policy Iteration with No Truncation (pAPI-NT) and Partitioned Approximate Policy Iteration with Truncation (pAPI-T) (see Fig. 7). These algorithms are well-suited for distributed computation, with a separate processor being in charge of the simulation and training within its own subset of the partition, and communicating its local policy and cost function to the other processors for use in their truncated rollout computations. Also, the approximate cost functions for the endgame subsets are shared as terminal cost approximations with other processors that are learning startgame policies. We find that these schemes save computation time while performing almost as well as a nonpartitioned architecture in terms of minimizing cost (see the next section).

### Vi-D Experimental Results

We implemented our proposed methodology on various problems of different sizes: 1) A linear pipeline problem of 20 locations and states, 2) a two-dimensional grid pipeline of size 66 and states, and 3) a two-robot linear pipeline of states. We describe our experimentation for the linear pipeline first.

For our experiments we have made the following choices. We used neural networks for both policy and value approximations. The policy networks used 2 hidden layers of size 256 and 64 ReLU units respectively, while the value networks used 3 hidden layers of 256, 128, and 64 ReLU units. The output layer of the policy network used a softmax layer over 3 actions (repair, go left, go right) in a linear single-robot pipeline (5 actions for a single robot 2D grid, 9 actions for two-robot linear pipeline). We used the RMSprop optimizer to minimize the L2 loss with a learning rate of 0.001.

To obtain the rollout policy, we used one-step lookahead and 10 Monte-Carlo simulated trajectories with the base policy from each leaf of the lookahead tree. We have not tried multistep lookahead in order to keep the size of the lookahead tree manageable. We used a greedy policy as the starting base policy, which repairs the current location if it is damaged, moves left if out of all damaged locations, the nearest one is on the left, or moves right otherwise. The density of damage to the left of the robot at location is defined as , where is the vector of costs for the different damage levels. The definition of is similar. For the 20-location linear pipeline cases, 200,000 training samples were used for each subset of the feature space partition (and samples for training using the methods that do not use partitioning). For the 66 grid and two-robot cases, 500,000 training samples were used for each subset of the partition. Training samples were obtained by: 1) random generation of feature states from each partition set (Fig. 6(b)), and 2) augmenting the pool of samples with representative states over a memory buffer where we randomly retained around 10% of states generated while evaluating the previous policies as described in [Ber19], Ch. 5.

We have generally observed that the partitioned architecture results in significant improvement in running time thanks to parallelization, at the expense of a modest degradation in performance. We expect that this improvement will become more significant as we address larger problems that require a richer feature space partition.

We will now present an ablation study of our proposed method that demonstrates the performance of the major pieces of our algorithm. In particular, we compare: 1) pAPI-T and pAPI-NT, which are the full versions of our algorithm with and without truncation respectively, 2) R-T and R-NT, which are rollout algorithms with and without truncation, which are the basis for data collection and training for the improved policies used in pAPI-T and pAPI-NT, and finally 3) the greedy policy, which is used as the starting policy for rollout and for PI. The plot in Fig. 9(a) shows how each component of our method adds to improved performance. Notice that pAPI-T uses a terminal cost approximation and results in computational savings over pAPI-NT, but incurs a slight approximation penalty.

#### Comparison to Existing Methods

We have compared our method with two published POMDP methods: DESPOT and POMCP. Our simulations were conducted using the code for these methods, which is freely available at [despot]. These methods do not use rollout with a base policy, but instead rely on long lookahead supplemented by Monte Carlo tree search and/or heuristic pruning of the lookahead tree. Results are shown in Fig. 9(b) for an evaluation set of 1000 random startgame states. We tried many DESPOT parameter combinations, including , , and . Fig. 9(b) shows the best results that we obtained with DESPOT using , , , and with POMCP using the default parameters. We used a closed form of belief update equation based on the Markov chain of Fig. 5, by modifying the POMCP and DESPOT code to use a single particle to represent the belief with a weight of 1.

Our results show the cost of two implementations of POMCP and DESPOT; the first implementation denoted is “straight out of the box,” using the code provided in [despot]. We obtained the other implementation, , by modifying the first to disallow “repair” actions at already repaired locations. We note that without this modification the DESPOT and POMCP algorithms will sometimes choose a repair action on an already repaired location, leading to higher costs. Our results indicate that pAPI-NT and pAPI-T outperform DESPOT and POMCP (original and modified versions) for our pipeline repair problem. The greedy policy, somewhat unexpectedly, also performs better than DESPOT and POMCP. The reason may be that the pipeline repair problem has a long effective planning horizon. Intuitively for such problems, DESPOT and POMCP are handicapped by their reliance on a tradeoff between lookahead length and the sparsity of the lookahead tree. Since both DESPOT and POMCP estimate Q-values approximately, with a larger and sparser lookahead, they can perform worse than the greedy base policy. By contrast, our rollout policy has the cost improvement property described in Prop. 1, and improves over the base policy (approximately, within a bound).

Indeed, we have observed that for states requiring a relatively long horizon to repair the pipeline, POMCP and DESPOT often make poor decisions. However, we could find no reliable method to detect whether a particular state would lead to relatively good or bad performance for DESPOT of POMCP, except in hindsight. Nonetheless for investigation purposes, even after hand-selecting for evaluation a favorable set of states, we found that DESPOT or POMCP were still outperformed by our method. In fact, based on examination of sample generated trajectories, we have concluded that the advantage that our method holds is principally due to the long horizon exploration that is characteristic of the use of rollout with terminal cost approximation.

#### Extensions of the Pipeline Repair Problem

We present in Table I results for two extensions of our problem to demonstrate the generality of our methodology. Specifically we show results for: 1) a two-dimensional grid pipeline and 2) a two-robot implementation of our methodology. Generally, our results show that both pAPI-T and pAPI-NT can be applied to problems with larger state space and multiple robots thanks to the use of partitioning and the attendant computational savings.

Extension | Greedy | Iter 1 | Iter 2 |
---|---|---|---|

pAPI-NT on 5x4 grid | 5476.79 | 5255.64 | 4952.26 |

pAPI-NT on 6x6 grid | 18799.5 | 18037.1 | 17341.9 |

pAPI-NT 2-robot, linear () | 4142.84 | 3321.95 | 3133.08 |

## Vii Concluding Remarks

This paper develops rollout algorithms and PI methods that are well-suited to deal with the challenges of POMDP, including large state spaces, incomplete information, and long planning horizons. Thus these methods are of high relevance to robotic tasks in uncertain environments such as search and rescue. While several of the components of our methodology have been suggested for perfect state information problems, they have not been combined and adapted to POMDP. Our methods are based on partitioning the feature space and training local policies that are specialized, easier to train, can be combined to provide a global policy over the entire space, and are amenable to a highly distributed implementation.

We have applied our methods in simulation to a class of sequential repair problems where a robot inspects and repairs a linear pipeline with potentially several rupture sites under partial information about the state of the pipeline (acquired from a priori obtained knowledge and in situ observations). Our method’s partitioning of the state space has important implications for robotics problems – namely, it allows for massive parallelization over several potentially independent processors (or robots). Importantly, this framework lends itself to distributed learning about the environment where different partitions can correspond to states in spatially different areas of the world, thus suggesting a new basis from which to solve future multi-robot POMDP problems.

We finally note that there are several possible extensions to our sequential repair problem, such as for example allowing for stochastic repair times, two-dimensional problems with obstacles, and multiple robots.

Acknowledgement: We acknowledge Calvin Norman and Siva Kailas for their help with numerical studies.