# Online Active Perception for Partially Observable

Markov Decision Processes with Limited Budget

###### Abstract

Active perception strategies enable an agent to selectively gather information in a way to improve its performance. In applications in which the agent does not have prior knowledge about the available information sources, it is crucial to synthesize active perception strategies at runtime. We consider a setting in which at runtime an agent is capable of gathering information under a limited budget. We pose the problem in the context of partially observable Markov decision processes. We propose a generalized greedy strategy that selects a subset of information sources with near-optimality guarantees on uncertainty reduction. Our theoretical analysis establishes that the proposed active perception strategy achieves near-optimal performance in terms of expected cumulative reward. We demonstrate the resulting strategies in simulations on a robotic navigation problem.

## I Introduction

An intelligent system should be able to exploit the available information in its surroundings toward better accomplishment of its task. However, in many applications in robotics and control, a decision-maker (called an agent) is not necessarily aware of the available information sources during a priori planning. For instance, consider an environment in which multiple agents, each with individual plans for their specific tasks, operate together. An agent may have no or only limited access to the behavioral model of other agents, and hence their observability of the environment and whether they are in the communication range. Nevertheless, at runtime, the agents may decide to exchange their information in order to enhance their performance.

In practical settings, the ability of an agent in gathering information is subject to budget constraints originating from power, communication, or computational limitations. If an agent decides to employ a sensor, it incurs a cost associated with the required power, or, if an agent decides to communicate with another agent, it incurs a communication cost. Such budget constraints accentuate the need for actively selecting a subset of available information that are most beneficial to the agent. We call this decision-making problem budget-constrained online active perception.

We formulate budget-constrained online active perception for partially observable Markov decision processes (POMDPs). Computing an optimal policy for POMDPs that maximizes the expected cumulative reward, is generally PSPACE-complete [papadimitriou1987complexity]. This complexity result has led to design of numerous approximation algorithms. A well-known family of these approximate methods relies on point-based value iteration solvers [cheng1988algorithms, kurniawati2008sarsop, smith2012point]. Point-based solvers exploit the piecewise linearity and convexity [sondik1978optimal] of value function to approximate it as the maximum of a set of hyperplanes, each associated with a sampled belief point. It is provable that the error due to this approximation is bounded by a factor depending on the density of sampled belief points [pineau2006anytime].

The combinatorial nature of selecting a subset of available information subject to budget constraints renders the task of finding an optimal solution NP-hard. We propose an efficient yet near-optimal online active perception strategy for POMDPs that aims to minimize the agent’s uncertainty about the state while respecting the constraint. We prove the near-optimality of the proposed algorithm. Further, we evaluate the efficacy of the proposed solution for a robotic navigation task where the robot can communicate with unmanned aerial vehicles (UAVs) to better localize itself.

### I-a Related Work

Active perception has been studied in many applications including robotics [elfes1990occupancy, stone2006pixels, charrow2015active, best2016decentralised] and image processing [darrell1996active, vogel2007non]. A body of literature formalizes active perception as a reward-based task of a POMDP, enabling non-myopic decision-making. The reward-based treatment of perception has been employed for active classification [guo2003decision] and cooperative active perception [spaan2008cooperative, spaan2009decision, natarajan2015multi]. Araya et al. [araya2010pomdp] introduce POMDP model in which the reward is the entropy of the belief and Spaan et al. [spaan2015decision] propose POMDP-IR in which the reward depends on the accuracy of state prediction. In [satsangi2018exploiting], the authors exploit the submodularity of value function for POMDP and POMDP-IR to design a greedy maximization technique for finding a near-optimal active perception policy. Our setting differs from the existing work in two aspects. First, we consider both planning and perception where the perception serves the planning objective. Second, we consider settings in which the perception model in only partially known in a priori planning.

An instance of active perception, considered in this paper, is that of dynamically selecting a subset of available information sources. The existing work on subset selection quantify usefulness of an information source by information-theoretic utility functions such as scalarizations of error covariance matrix of the estimated parameter [shamaiah2010greedy, hashemi2018randomized], mutual information between the measurements and the parameter of interest, or entropy of the selected measurements [krause2007near, krause2014submodular]. Given a specific utility function, selecting an optimal subset of information sources under constraint is a combinatorial problem [williamson2011design]. However, if the utility function has properties such as monotonicity or (weak) submodularity, greedy algorithms can achieve near-optimal solutions with only polynomial number of function evaluations [nemhauser1978analysis, wang2016approximation, qian2017subset]. We use mutual information between the current state and the observations as the utility function. We obtain theoretical guarantee for the performance of the proposed generalized greedy maximization algorithm by exploiting monotonicity and submodularity of mutual information as well as linearity of cost constraint.

## Ii Preliminaries and Problem Statement

In this section, we provide an outline of the related concepts and definitions in order to formally state the problem.

### Ii-a Preliminaries

We first overview the necessary background on partially observable Markov decision processes (POMDPs), point-based value iteration solvers, and properties of set functions.

#### Ii-A1 Pomdp

A POMDP is a tuple , where is the finite set of states, is the finite set of actions, is the probabilistic transition function, is the set of observations, is the probabilistic observation function, and is the discount factor. At each time step, the environment is in some state . The agent takes an action that causes a transition to a state with probability . Then it receives an observation with probability , and a scalar reward .

The belief of the agent at each time step, denoted by is the posterior probability distribution of states given the history of previous actions and observations, i.e., . A well-known fact is that due to Markovian property, a sufficient statistics to represent history of actions and observations is the belief [aastrom1965optimal, smallwood1973optimal]. Given the initial belief , the following update equation holds between previous belief and the belief after taking action and receiving observation :

(1) |

The agent’s objective is to find a pure policy that maximizes its expected discounted cumulative reward denoted by . A pure policy is a mapping from belief to actions , where is the set of belief states. Note that constructs a -dimensional probability simplex which we indicate by .

#### Ii-A2 Point-Based Value Iteration

POMDP solvers apply value iteration [sondik1978optimal], a dynamic programming technique, to find the optimal policy. Let be a value function that maps beliefs to values in that represent the expected discounted cumulative reward for a given belief. The following recursive expression holds for :

(2) | ||||

The value iteration process converges to the optimal value function which satisfies the Bellman’s optimality equation [bellman1957markovian]. Then, an optimal policy can be derived from the optimal value function. An important outcome of (2) is that at any horizon, the value function is piecewise linear and convex [smallwood1973optimal] and hence, can be represented by a finite set of hyperplanes. Each hyperplane is associated with an action. Let ’s to denote the corresponding vectors of the hyperplane parameters and let to be the set of vectors at horizon . Then,

(3) |

where indicates the dot product of the two vectors. Additionally, the action corresponding to the optimal in (3) determines the optimal action at . This representation of the value function has motivated approximate point based solvers to try to approximate the value function by updating the hyperplanes over a finite set of sampled belief points.

Generic point-based solvers consist of three main steps, namely sampling, backup, and pruning. These steps are applied repeatedly until a desired convergence criterion for the value function is realized. For the sampling step, different approaches exist including discretization of the belief simplex and adaptive sampling techniques [pineau2006anytime, kurniawati2008sarsop, smith2012point]. The backup step follows the standard Bellman backup operation. More specifically, one can rewrite (2) using (3) to obtain:

where is the set of vectors from previous iteration. Let to denote the current set of sampled belief points. The Bellman backup operator on is performed through the following procedure [pineau2006anytime]:

Step 1: | |||

Step 2: | |||

Step 3: | |||

Step 4: | |||

Step 5: |

where is the new set of vectors. Lastly, in the pruning step, the vectors that are dominated by other vectors are removed to simplify next round of computation [araya2010pomdp].

#### Ii-A3 Properties of Set Functions

Since the proposed active perception algorithm is founded upon the theoretical results from the field of submodular optimization for set functions, here, we overview the necessary definitions. Let to denote a ground set and a set function that maps an input set to a real number.

###### Definition 1.

A set function is monotone nondecreasing if for all .

###### Definition 2.

A set function is submodular if

for all subsets and . The term is the marginal value of adding element to set .

Monotonicity states that adding elements to a set increases the function value while submodularity refers to diminishing returns property.

### Ii-B Problem Statement

In this paper, we consider an agent whose interaction with the environment, i.e., stochastic transitions and observations, is captured by a POMDP. In addition to a priori known observations captured by the POMDP, during runtime, the agent can further collect auxiliary observations, e.g., by means of communicating with other nearby agents. However, there is a budget constraint, such as limited communication bandwidth or limited communication power, on the auxiliary information gathering. Therefore, the agent must pick (or activate) a subset of auxiliary information sources that maximally increase its expected reward in the future while respecting the constraint. We formally state the problem next.

###### Problem 1.

Consider a POMDP with initial belief . Let set to denote auxiliary observations available at time step , with associated costs of , and an upper bound on the cost. Also, let to represent the power set obtained from . In a priori planning, we aim to compute a pure belief-based policy that maximizes the expected discounted cumulative reward, i.e,

Furthermore, at runtime, we aim to compute an active perception policy that given current belief , maximizes the expected discounted cumulative reward in the future while respecting the cost constraint, i.e.,

##
Iii Online Active Perception with

Limited Budget

Problem 1 consists of two stages. The first stage is an a priori planning based on the POMDP model. We resort to point-based value iteration (see Section II) to compute a near-optimal policy for this planning problem. As discussed earlier, various heuristics for adaptive sampling of belief points have been developed. The core idea of these methods is to guide the sampling toward the reachable subspace of the belief simplex . Nevertheless, since the reachable belief points depend on possible observations and the agent is not aware of auxiliary observations a priori, we propose a uniform sampling of the belief simplex. While uniform sampling is not as efficient as that of adaptive sampling for large POMDPs, it ensures coverage of the whole belief space. The second stage of the problem is an online computation of an optimal subset of information sources with respect to expected future reward while complying with the cost constraint. To that end, we design a generalized greedy strategy, to be applied at each time step, which is computationally efficient and achieves near-optimal guarantees. Before introducing the algorithm, we state the following assumption regarding dependency of observations from the auxiliary information sources.

###### Assumption 1.

We assume that the observations from the information sources are mutually independent given the current state and the previous action, i.e.,

Let to denote the updated belief after taking action and receiving observation . Assume the agent then picks a perception action corresponding to and receives an auxiliary observation . Then, if Assumption 1 holds, according to Bayes’ theorem, the agent’s belief will be further updated by the following rule:

(4) |

where .

### Iii-a Proposed Generalized Greedy Algorithm

To quantify utility of information sources, we use mutual information between the state and auxiliary informations. Mutual information between two random variables is a positive and symmetric measure of their dependence and is defined as:

Mutual information, due to its monotonicity and submodular characteristics, has inspired many subset selection algorithms [krause2014submodular]. The mutual information between the state and the auxiliary informations is closely related to the change in the entropy of the state after receiving the additional observations, as expressed by the following equation:

(5) |

For a discrete random variable , the entropy is defined as and captures the amount of uncertainty. Therefore, intuitively, maximizing the mutual information is equivalent to minimizing the uncertainty in the state. Minimizing the state uncertainty is the goal of perception actions as it leads to higher expected reward in the future. Notice that the entropy is strictly concave on [cover2012elements]. Hence, minimizing the entropy pushes the belief toward the boundary of the simplex that due to convexity of the value function, possesses higher value. That being the case, in order to select the optimal perception action, we define the objective function as the following set function:

(6) |

and aim to compute by solving the following discrete optimization problem:

(7) |

Note that is constant and does not affect the selection procedure. Furthermore, yields the expected value of entropy over all possible realizations of observations and can be computed via:

(8) | ||||

At each time step, there is possible perception actions with their associated costs. Finding an optimal subset of information sources with respect to (7) is a combinatorial optimization problem and is NP-hard [williamson2011design]. Hence, we propose an approximate solution based on greedy maximization schemes. The proposed greedy algorithm, outlined in Algorithm 1, is founded upon the idea of generalized greedy algorithm in [lin2010multi].

The algorithm takes as input the agent’s belief and action along the current set of available auxiliary informations. Then it iteratively adds elements from the ground set (set of all information sources) whose marginal gain with respect to , scaled by the added cost, is maximal and terminates when no more elements can be added due to the constraint. Parameter is a scaling factor of the cost which can be adjusted to calibrate the effect of cost for a particular problem. The output set is the superior of the constructed subset and the best singleton subset.

### Iii-B Theoretical Analysis

Next, we theoretically analyze the performance of the proposed online active perception algorithm. The following lemma states the required properties of the objective function to prove near-optimality result.

###### Lemma 1.

The proof of the lemma follows from submodularity of conditional entropy [ko1995exact] and its monotonicity. The above lemma enables us to establish the approximation factor using the analysis in [lin2010multi].

###### Theorem 1.

Theorem 1 proves that the mutual information obtained by the generalized greedy algorithm is close to that of optimal solution in (7). Nevertheless, we need to analyze the near-optimality of the proposed online active perception policy compared to in Problem 1. To that end, we show that the expected distance between the two belief points from greedy and optimal perception actions is bounded. Using this fact, we prove that the value loss is bounded as well.

###### Theorem 2.

Let to denote the agent’s current belief and to denote its last action. Further, let and to be the greedy perception action and the optimal action, respectively. Then, it holds that

where and are the updated beliefs according to (4).

## Iv Simulation Results

We evaluate the proposed online active perception algorithm in a robotic navigation task. To that end, we implement a simple point-based value iteration solver that uses a fixed set of belief points. The belief points are uniformly distributed over and their associated vectors are initialized by [shani2013survey]. We run the solver until the -norm distance between value functions in two consecutive iterations falls below a predefined threshold of 0.001 or a maximum iteration number of 1000 is reached.
We implement the proposed generalized greedy selection algorithm as well as a random selection algorithm that selects a subset of information sources, uniformly at random.
After learning the policy from the solver, we apply the online active perception policies for 50 Monte Carlo simulation runs.^{1}^{1}1The code is available at https://github.com/MahsaGhasemi/greedy-perception-POMDP

The robotic navigation scenario models a robot in a grid map whose objective is to reach a goal state while avoiding the obstacles in the environment, see Fig. 1. The goal state has a reward of 10, obstacle cells have a reward of -5, and other cells have a reward of -1. The navigation actions of the robot are . The robot’s transitions are probabilistic due to possible actuation errors with 0.7 probability of taking the correct action. The robot has an inaccurate sensor as well that can localize it correctly with probability 0.5. In addition to the robot, there are 12 UAVs that are patrolling the area in periodic motions. The field of view of each UAV is a area. At each time step, the robot can select some of the UAVs and ask them to send their information regarding the state of the robot. However, note that the observation model of UAVs is time-varying and changes based on their location. Besides, the robot does not know the policies of UAVs during planning time. We assume that the cost of communicating with each UAV is the same. At each time step, the cost constraint allows communication with at most 2 UAVs.

We first find a planning policy via the implemented point-based solver. Next, we let the robot to run for a horizon of 40 steps, with no auxiliary information, with random selection of information sources, and with the proposed generalized greedy selection based on mutual information. We terminate the simulations once the robot reaches the goal. Fig. 2 illustrates the normalized frequency of visiting each state for each perception algorithm. No use of auxiliary informations leads to worst performance as it visits the obstacle cells frequently. Random addition of auxiliary information sources improves the performance since it results in better obstacle avoidance. However, the best obstacle avoidance performance is for the proposed generalized greedy algorithm and it shows more concentration around the optimal path. Fig. 3 demonstrates the discounted cumulative reward, averaged over 50 Monte Carlo runs, for all three policies, i.e., no auxiliary information, random selection of 1 and 2 information sources, and greedy selection of 1 and 2 information sources. It can be seen that the generalized greedy selection scheme obtains the highest reward.

## V Conclusion

We studied online active perception for POMDPs where at each time step, the agent can pick a subset of available information sources, under a budget constraint, to enhance its belief. We defined a utility function based on the mutual information between the state and the information sources. We developed an efficient generalized greedy scheme to iteratively pick observation sources with highest marginal gain, scaled by the added cost. We theoretically established near-optimality of the proposed scheme and further evaluated it on a robotic navigation task. As part of the future work, we aim to employ PAC greedy maximization [satsangi2016pac] to accelerate the information selection process since instead of exact computation, it only requires bounds on the utility function.

## Appendix A Proof of Lemma 1

It is clear that .

Let . To prove monotonicity, consider and . Then,

where and are due to Bayes’ rule for entropy, follows from the conditional independence assumption and joint entropy definition, is due to the conditional independence assumption, and stems from the fact that conditioning does not increase entropy.

Furthermore, from the third line of above proof, we can derive the marginal gain as:

To prove submodularity, let and . Then,

where is based on the fact that conditioning does not increase entropy, and results from .

## Appendix B Proof of Theorem 2

Let to be the updated belief (see (1)) after taking action and receiving observation . Also, let and to denote the updated beliefs (see (4)) after receiving auxiliary observations corresponding to the proposed generalized greedy scheme and the optimal selection, respectively. First, by leveraging the relation between mutual information and Kullback-Leibler (KL-) divergence, we establish the followings:

(10a) | |||

(10b) |

In other words, the mutual information between the state and a set of information sources is equivalent to expected KL-divergence from current belief to posterior belief. Therefore, using (10) along the result of Theorem 1 yields:

(11) | ||||

Next, we use the Pythagorean theorem for KL-divergence [csiszar1975divergence] and take expectation over all realizations of the observations to obtain:

(12) | ||||

We combine (11) and (12), and rearrange the terms to establish the following:

(13) |

where the right hand side is a constant. Lastly, we exploit Pinkster’s inequality which relates the total variation distance to KL-divergence and apply Jansen’s inequality for square-root function (a concave function) to derive the desired result:

## Appendix C Proof of Theorem 3

Let and to represent the gradient of value function at and , respectively. Let and . Therefore, we can show that

where follows from the fact that is the gradient of optimal value function, is due to Hölder’s inequality, and is the result of Theorem 2 and the fact that for every vector.