# The Robot Routing Problem for Collecting Aggregate Stochastic Rewards

## Abstract

We propose a new model for formalizing reward collection problems on graphs with dynamically generated rewards which may appear and disappear based on a stochastic model. The *robot routing problem* is modeled as a graph whose nodes are stochastic processes generating potential rewards over discrete time. The rewards are generated according to the stochastic process, but at each step, an existing reward disappears with a given probability. The edges in the graph encode the (unit-distance) paths between the rewards’ locations. On visiting a node, the robot collects the accumulated reward at the node at that time, but traveling between the nodes takes time. The optimization question asks to compute an optimal (or -optimal) path that maximizes the expected collected rewards. =-1

We consider the finite and infinite-horizon robot routing problems. For finite-horizon, the goal is to maximize the total expected reward, while for infinite horizon we consider limit-average objectives. We study the computational and strategy complexity of these problems, establish NP-lower bounds and show that optimal strategies require memory in general. We also provide an algorithm for computing -optimal infinite paths for arbitrary .

## 1Introduction

Reward collecting problems on metric spaces are at the core of many applications, and studied classically in combinatorial optimization under many well-known monikers: the traveling salesman problem, the knapsack problem, the vehicle routing problem, the orienteering problem, and so on. Typically, these problems model the metric space as a discrete graph whose nodes or edges constitute rewards, either deterministic or stochastic, and ask how to traverse the graph to maximize the collected rewards. In most versions of the problem, rewards are either fixed or cumulative. In particular, once a reward appears, it stays there until collection. However, in many applications, existing rewards may disappear (e.g., a customer changing her mind) or have more “value” if they are collected fast.

We introduce the *Robot Routing problem*, which combines the spatial aspects of traveling salesman and other reward collecting problems on graphs with stochastic reward generation and with the possibility that uncollected rewards may disappear at each stage. The robot routing problem consists of a finite graph and a reward process for each node of the graph. The reward process models dynamic requests which appear and disappear. At each (discrete) time point, a new reward is generated for the node according to a stochastic process with expectation . However, at each point, a previously generated reward disappears with a fixed probability . When the node is visited, the entire reward is collected. The optimization problem for robot routing asks, given a graph and a reward process, what is the optimal (or -optimal) path a robot should traverse in this graph to maximize the expected reward?

As an illustrating example for our setting, consider a vendor planning her path through a city. At each street corner, and at each time step, a new customer arrives with expectation , and an existing customer leaves with probability . When the vendor arrives at the corner, she serves all the existing requests at once. We ignore other possible real-world features and behaviors e.g., customers leaving queues if the queue length is long. How should the vendor plan her path? Similar problems can be formulated for traffic pooling [25], for robot control [13], for patrolling [15], and many other scenarios.

Despite the usefulness of robot routing in many scenarios involving dynamic appearance and disappearance of rewards, algorithms for its solution have not, to the best of our knowledge, been studied before. In this paper, we study two optimization problems: the *value computation problem*, that asks for the maximal expected reward over a *finite* or *infinite* horizon, and the *path computation problem*, that asks for a path realizing the optimal (or -optimal) reward. The key observation to solving these problems is that the reward collection can be formulated as discounted sum problems over an extended graph, using the correspondence between stopping processes and discounted sum games.

For finite horizon robot routing we show that the value *decision* problem (deciding if the maximal expected reward is at least a certain amount) is NP-complete when the horizon bound is given in unary, and the value and optimal path can be computed in exponential time using dynamic programming.

For the infinite horizon problem, where the accumulated reward is defined as the long run average, we show that the value decision problem is NP-hard if the probability of a reward disappearing is more than a threshold dependent on the number of nodes. We show that computing the optimal long run average reward can be reduced to a 1-player mean-payoff game on an *infinite graph*. By solving the mean payoff game on a finite truncation of this graph, we can approximate the solution up to an arbitrary precision. This gives us an algorithm that, for any given , computes an -optimal path in time exponential in the size of the original graph and logarithmic in . Unlike finite mean-payoff 2-player games, strategies which generate optimal paths for robot routing even in the 1-player setting can require memory. For the *non-discounted* infinite horizon problem (that is, when rewards do not disappear) we show that the optimal path and value problems are solvable in polynomial time.=-1

###### Related work

The robot routing problem is similar in nature to a number of other problems studied in robot navigation, vehicle routing, patrolling, and queueing network control, but to the best of our knowledge has not been studied so far.

There exists a plethora of versions of the famous traveling salesman problem (TSP) which explore the trade-off between the cost of the constructed path and its reward. Notable examples include the orienteering problem [23], in which the number of locations visited in a limited amount of time is to be maximized, vehicle routing with time-windows [17] and deadlines-TSP [2], which impose restrictions or deadlines on when locations should be visited, as well as discounted-reward-TSP [4] in which soft deadlines are implemented by means of discounting. Unlike in our setting, in all these problems, rewards are static, and there is no generation and accumulation of rewards, which is a key feature of our model.

In the dynamic version of vehicle routing [7] and the dynamic traveling repairman problem [3], tasks are dynamically introduced and the objective is to minimize the expected task waiting time. In contrast, we focus on limit-average objectives, which are a classical way to combine rewards over infinite system runs. Patrolling [6] is another graph optimization problem, motivated by operational security. The typical goal in patrolling is to synthesize a strategy for a defender against a single attack at an undetermined time and location, and is thus incomparable to ours. A single-robot multiple-intruders patrolling setting that is close to ours is described in [15], but there again the objective is to merely detect whether there is a visitor/intruder at a given room. Thus, the patrolling environment in [15] is described by the probability of detecting a visitor for each location. On the contrary, our model can capture *counting patrolling problems*, where the robot is required not only to detect the presence of visitors but to register/count as many of them as possible. Another related problem is the information gathering problem [20]. The key difference between the information gathering setting and ours is that [20] assumes that making an observation earlier has bigger value than if a lot of observations have already been made. This restriction on the reward function is *not* present in our model, since the reward value collected when visiting node at time (making observation , in their terms) only depends on the last time when was previously visited, and not on the rest of the path (the other observations made, in their terms).

Average-energy games [8] are a class of games on finite graphs in which the limit-average objective is defined by a double summation. The setting discussed in [8] considers static edge weights and no discounting. Moreover, the inner sum in an average-energy objective is over the whole prefix so far, while in our setting the inner sum spans from the last to the current visit of the current node, which is a crucial difference between these two settings.

Finally, there is a rich body of work on multi-robot routing [24] which is closely related to our setting. However, the approaches developed there are limited to static tasks with fixed or linearly decreasing rewards. The main focus in the multi-robot setting is the task allocation and coordination between robots, which is a dimension orthogonal to the aggregate reward collection problem which we study.

Markov decision processes (MDP) [19] seem superficially close to our model. In an MDP, the rewards are determined statically as a function of the state and action. In contrast, the dynamic generation and accumulation of rewards in our model, especially the individual discounting of each generated reward, leads to algorithmic differences: for example, while MDPs admit memoryless strategies for long run average objectives, strategies require memory in our setting and there is no obvious reduction to, e.g., an exponentially larger, MDP.

We employed the reward structure of this article in [13] with the goal of synthesizing controllers for reward collecting Markov processes in continuous space. The work [13] is mainly focused on addressing the continuous dynamics of the underlying Markov process where the authors use abstraction techniques [12] to provide approximately optimal controllers with formal guarantees on the performance while maintaining the probabilistic nature of the process. In contrast, we tackle the challenges of this problem with having a deterministic graph as the underlying dynamical model of the robot and study the computational complexity of the proposed algorithms thoroughly.

###### Contributions

We define a novel optimization problem for formalizing and solving reward collection in a metric space where stochastic rewards appear as well as disappear over time.

We consider reward-collection problems in a novel model with

*dynamic generation*and*accumulation*of rewards, where each reward*can disappear with a given probability*.We study the value decision problem, the value computation problem, and the path computation problem over a finite horizon. We show that the value decision problem is NP-complete when the horizon is given in unary. We describe a dynamic programming approach for computing the optimal value and an optimal path in exponential time.

We study the value decision problem, the value computation problem, and the path computation problem over an infinite horizon. We show that for sufficiently large values of the disappearing factor the value decision problem is NP-hard. We provide an algorithm which for any given , computes an -optimal path in time exponential in the size of the original graph and logarithmic in . We demonstrate that strategies (in the 1-player robot routing games) which generate infinite-horizon optimal paths can require memory.

## 2Problem Formulation

###### Preliminaries and notation

A finite directed graph consists of a finite set of nodes and a set of edges . A path in is a finite or infinite sequence of nodes in , such that for each . We denote with the length (number of edges) of a finite path and write and . For an infinite path , we define We also denote the cardinality of a finite set by . We denote by and the sets of non-negative and positive integers respectively. We define for any . We denote with the indicator function which takes a Boolean-valued expression as its argument and returns if this expression evaluates to true and 0 otherwise.

###### Problem setting

Fix a graph . We consider a discrete-time setting where at each time step , at each node a reward process generates rewards according to some probability distribution. Once generated, each reward at a node decays according to a decaying function. A *reward-collecting* robot starts out at some node at time , and traverses one edge in at each time step. Every time the robot arrives at a node , it collects the reward accumulated at since the last visit to . Our goal is to compute the maximum expected reward that the robot can possibly collect, and to construct an optimal path for the robot in the graph, i.e., a path whose expected total reward is maximal.

To formalize reward accumulation, we define a function which (for path ) maps an index and a node to the length of the path starting at the previous occurrence of in till position ; and to if does not occur in before time :

###### Reward functions

Let be a set of random variables defined on a sample space and indexed by the set of nodes . Then , is a measurable function from to that generates a random reward at node at any time step. Let be the path in traversed by the robot. At time , the position of the robot is the node , and the robot collects the uncollected decayed reward generated at node (since its last visit to ) up till and including time . Then, the robot traverses the edge , and at time it collects the rewards at node .

The uncollected reward at time at a node given a path traversed by the robot is defined by the random variable

The value in the above definition is a *discounting factor* that models the probability that a reward at node survives for one more round, that is, the probability that a given reward instance at node disappears at any step is .

Note that the previous time a reward was collected at node was at time , the time node was last visited before . Thus corresponds to the rewards generated at node at times , which have decayed by factors of respectively. When traversing a path , the robot collects the accumulated reward at time at node .

We define the *expected finite -horizon sum reward* for a path as:

Let be a function that maps each node to the *expected value of the reward generated at node * for each time step, . We assume that the rewards generated at each node are independent of the agent’s move. Thus, the function will be sufficient for our study, since we have

For an infinite path , the *limit-average* expected reward is defined as

The finite and infinite-horizon *reward values* for a node are defined as the best rewards over all paths originating in : and , respectively. The choice of limit-average in is due to the unbounded sum reward when goes to infinity. For a given path , the sequence in may not converge. Thus we opt for the worst case limiting behavior of the sequence. Alternatively, one may select the best case limiting behavior in with no substantial change in the results of this paper.

###### Node-invariant functions and and definition of cost functions

In the case when the functions and are constant, we write and for the respective constants. In this case, the expressions for and can be simplified using the identity for . Then we have

The expression can be simplified as:

For the special case (i.e., when the rewards are not discounted), the expression for the finite-horizon reward is .

We define *cost functions* that map a path to a real valued finite- or infinite-horizon cost:=-1

From Equations and , the computation of optimal paths for the reward functions and corresponds to computing paths that minimize the cost functions and , respectively. Analogously to and , the infimums of the cost functions in over paths are denoted by and respectively.

Consider the finite path . For the occurrences of node in we have , , , and similarly for the other nodes in . The reward for as a function of and is for and for . For the infinite path we have , , and the value of is in all other cases. Thus we have for and for . Similarly, for we have for and for .

###### Problem statements

We investigate optimization and decision problems for finite and infinite-horizon robot routing. The *value computation problems* ask for the computation of and . The corresponding decision problems asks to check if the respective one of these two quantities is greater than or equal to a given threshold .=-1

For a finite directed graph , expected reward and discounting functions and and , a finite path is said to be an *optimal path* for time-horizon if (a) and , and (b) for every path in with and it holds that . Similarly, an infinite path is said to be *optimal for the infinite horizon* if and for every infinite path with in it holds that . We can also define corresponding *threshold paths*: given a value a path is said to be threshold -optimal if or , respectively. An *-optimal* path is one which is or threshold optimal (for finite or infinite horizon respectively).

###### Paths as strategies

We often refer to infinite paths as resulting from strategies (of the collecting agent). A *strategy* in is a function that maps finite paths to nodes such that if is defined then . Given an initial node , the strategy generates a unique infinite path , denoted as . Thus, every infinite path defines a unique strategy where , and , and is undefined otherwise. Clearly, . We say a strategy is optimal for a path problem if the path is optimal. A strategy is *memoryless* if for every two paths for which , it holds that . We say that memoryless strategies suffice for the optimal path problem if there always exists a memoryless strategy such that is an optimal path.

## 3Finite Horizon Rewards: Computing

In this section we consider the finite-horizon problems associated with our model. The following theorem summarizes the main results.

Analogous results hold for the related reward problem where in addition to the initial node , we are also given a destination node , and the objective is to go from to in at most steps while maximizing the reward.

The finite-horizon value problem is NP-hard by reduction from the Hamiltonian path problem (the proof is in the appendix), even in the case of node-invariant and . Membership in NP in case is in unary follows from the fact that we can guess a path of length and check that the reward for that path is at least the desired threshold value.

To prove the second part of the theorem, we construct a finite *augmented weighted graph*.

For simplicity, we give the proof for node-invariant and , working with the cost functions and . The augmented graph construction in the general case is a trivial generalization by changing the weights of the nodes, and the dynamic programming algorithm used for computing the optimal cost values is easily modified to compute the corresponding reward values instead. For the objective is to minimize and for the objective is to maximize over paths .

###### Augmented weighted graph

Given a finite directed graph we define the *augmented weighted graph* which “encodes” the values for the paths in explicitly in the augmented graph node. We can assume w.l.o.g. that .

Let be a path in . In the graph there exists a unique path that corresponds to :

starting from the node such that for all and for all , we have . Dually, for each path in starting from , there exists a unique path in from the node such that for all and .

For a path in the form of let

Thus, and define the classical total finite sum (shortest paths) and limit average objectives on weighted (infinite) graphs [26]. Additionally, , and where is the path in corresponding to the path .

Now, define as the infimum of over all paths with , and similarly for . Then it is easy to see that and . Thus, we can reduce the optimal path and value problems for to standard objectives in . The major difficulty is that is infinite. However, note that only the first nodes of are relevant for the computation of . Thus, the value of can be computed on a *finite* subgraph of , obtained by considering only the finite subset of nodes .

For , we obtain the value by a standard dynamic programming algorithm which computes the shortest path of length on this finite subgraph starting from the node (and keeping track of the number of steps). For , where the objective is to maximize over paths , we proceed analogously. Note that the subgraph used for the dynamic programming computations is of exponential size in terms of the size of and the description of . This gives the desired result in Theorem ?.

## 4Infinite Horizon Rewards: Computing

Since we consider finite graphs, every infinite path eventually stays in some strongly connected component (SCC). Furthermore, the value of the reward function does not change if we alter/remove a finite prefix of the path . Thus, it suffices to restrict our attention to the SCCs of the graph: the problem of finding an optimal path from a node reduces to finding the SCC that gives the highest reward among the SCCs reachable from node . Therefore, we assume that the graph is strongly connected.

### 4.1Hardness of Exact Value Computation

Since it is sufficient for the hardness results, we consider node-invariant and .

###### Insufficiency of memoryless strategies.

Before we turn to the computational hardness of the value decision problem, we look at the *strategy complexity* of the optimal path problem and show that optimal strategies need memory.

Consider Example ?. A memoryless strategy results in paths which cycle exclusively either in the left cycle, or the right cycle (as from node , it prescribes a move to either only or only ). As shown in Example ?, the optimal path for needs to visit both cycles. Thus, memoryless strategies do not suffice for this example.

For -regular objectives, strategies based on *latest visitation records* [14], which depend only on the *order* of the last node visits (*i.e.*, for all node pairs , whether the last visit of was before that of or vice versa) are sufficient. However, we can show that such strategies do not suffice either. To see this, recall the graph in Figure 1 for which the optimal path for is . Upon visiting node this strategy chooses one of the nodes or depending on the *number* of visits to since the last occurrence of . On the other hand, every strategy based *only on the order of last visits* is not able to count the number of visits to and thus, results in a path that ends up in one of , or , or , which are not optimal for this . The proof is given in the appendix. It is open if finite memory strategies are sufficient for the infinite-horizon optimal path problem.

###### NP-Hardness of the value decision problem

To show NP-hardness of the infinite-horizon value decision problem, we first give bounds on . The following Lemma proven in the appendix, establishes these bounds.

The following lemma establishes a relationship between the value of optimal paths and the existence of a Hamiltonian cycle in the graph, and is useful for providing a lower bound on the computational complexity of the value decision problem.

The proof is by contradiction. Suppose does not visit any Hamiltonian cycle infinitely often. Then it visits each such cycle at most a finite number of times. Without loss of generality we can assume that the path doesn’t visit any such cycles, since the total number of Hamiltonian cycles is finite. We have for

Now let’s look at a finite sub-path of length . There is at least one node repeated in since the graph has distinct nodes. Note that if , there must be another repetition due to the lack of Hamiltonian cycles in the path. In either case, there is an such that .

which is a contradiction with the assumption that . Then a necessary condition for with some is the existence of a Hamiltonian cycle.

*Remark.* Following the same reasoning of the above proof it is possible to improve the upper bound in as for small values of , where is the length of the longest simple cycle of the graph.

We reduce the Hamiltonian cycle problem to the infinite horizon optimal path problem. Given a graph , we fix some and . We show that is Hamiltonian iff . If has a Hamiltonian cycle , then the infinite path has reward , for any choice of . For the other direction, applying Lemma ? with implies that is Hamiltonian.

###### Non-discounted rewards () and node-invariant function

Contrary to the finite-horizon non-discounted case, the infinite-horizon optimal path and value problems for can be solved in polynomial time. To see this, note that the reward expression can be written as , where is defined as . Then, we can bound the reward by=-1

where is the set of nodes visited in the path infinitely often. This indicates that the maximum reward is bounded by times the maximal size of a reachable SCC in the graph . This upper bound is also achievable: we can construct an optimal path by finding a maximal SCC reachable from the initial node and a (not necessarily simple) cycle that visits all the nodes in this SCC. Then, a subset of optimal paths contains paths of the form , where is any finite path that reaches . This procedure can be done with a computational time polynomial in the size of .

Note that for the case there always exists an ultimately periodic optimal path, such a path is generated by a finite-memory strategy.

### 4.2Approximate Computation of

In the previous section we discussed how to solve the infinite-horizon value and path computation problems for the non-discounted case. Now we show how the infinite-horizon path and value computation problems for can be effectively approximated. We first define functions that over and underapproximate (thus also ) and establish bounds on the error of these approximations. Given an integer , approximately optimal paths and an associated interval approximating can be computed using a finite augmented graph based on the augmented graph of Section 3. Intuitively, is obtained from by pruning nodes that have a component greater than in their augmentation. By increasing the value of , the approximation error can be made arbitrarily small. =-1

We describe the approximation algorithm for node-invariant and . The results generalize trivially to the case when and are not node-invariant by choosing large enough to satisfy the condition that bounds the approximation error for each and .

###### Approximate cost functions

Consider the following functions from to :

Informally, for , if the last visit to node occurred more than time units before time , the cost is , rather than the original smaller amount . For , if the last visit to occurred more than time steps before time , then the cost is . For both, if the last visit to the node occurred less than or equal to steps before, we pay the actual cost . The above definition implies that for every . Then we have , where we define

The difference between the upper and lower bounds can be tuned by selecting :

Therefore belongs to the interval and the length of the interval is at most . In order to guarantee the total error of for the actual reward ^{1}

###### Truncated augmented weighted graph

Recall the infinite augmented weighted graph from Section 3. We define a truncated version of where we only keep track of values less than or equal to . For we define two weight functions and , for and respectively.

Similarly to the infinite augmented graph we have

It is easy to see that is the least possible limit-average cost with respect to in starting from the node . The same holds for with . Below we show how to compute . The case is analogous, and thus omitted.

###### Algorithm for computing

We now describe a method to compute as the least possible limit-average cost in with respect to . It is well-known that this can be reduced to the computation of the minimum cycle mean in the weighted graph [26], which in turn can be done using the algorithm from [16] that we now describe.

As before, first we assume that is strongly connected. For every , and every , we define as the minimum weight of a path of length from to ; if no such path exists, then . The values can be computed by the recurrence

with the initial conditions and for any . Then, we can compute

A cycle with the computed minimum mean can be extracted by fixing the node which achieves the minimum in the above value and the respective path length and finding a minimum-weight path from to and a cycle of length within this path. Thus, the path in obtained by repeating infinitely often realizes this value. A path from in with is obtained from by projection on .

In the general case, when is not strongly connected we have to consider each of its SCCs reachable from , and determine the one with the least minimum cycle mean.

For each SCC with edges and nodes the computation of the quantities requires operations. The computation of the minimum cycle mean for this component requires further operations. Since because of the strong connectivity, the overall computation time for the SCC is . Finally, the SCCs of can be computed in time [21]. This gives us the following result.

The same result can be established for the under approximation .

*Remark.* The number of nodes of is . For the approximation procedure described above it suffices to augment the graph with the information about which nodes were visited in the last steps and in what order. Thus, we can alternatively consider a graph with nodes in the case when the computed is smaller than .

Theorem ?, a consequence of Lemma ?, states the approximate computation result.

### 4.3Approximation via Bounded Memory

The algorithm presented earlier is based on an augmentation of the graph with a specific structure updated deterministically and whose size depends on the desired quality of approximation. Furthermore, in this graph there exists a memoryless strategy with approximately optimal reward. We show that this allows us to quantify how far from the optimal reward value is a strategy that is optimal among the ones with bounded memory of fixed size.

First, we give the definition of memory structures. A *memory structure* for a graph consists of a set , initial memory and a memory update function . The memory update function can be extended to by defining and . A memory structure together with a function such that for all and , and an initial node define a strategy where and . In this case we say that the strategy has memory . Given a bound on memory size we define the finite graph , where ; and .

Memoryless strategies in this graph precisely correspond to strategies that have memory of size . More precisely, for each strategy in that has memory there exists a memoryless strategy in such that the projection of on is . Conversely, for each memoryless strategy in there exist a memory structure with and a strategy with memory in such that the projection of on is .

An optimal strategy among those with memory of a given size can be computed by inspecting the memoryless strategies in and selecting the one with maximal reward (these strategies are finitely but exponentially many).

A strategy returned by the approximation algorithm presented earlier uses a memory structure of size