On the Adaptivity Gap of Stochastic Orienteering
The input to the stochastic orienteering problem  consists of a budget and metric where each vertex has a job with a deterministic reward and a random processing time (drawn from a known distribution). The processing times are independent across vertices. The goal is to obtain a non-anticipatory policy (originating from a given root vertex) to run jobs at different vertices, that maximizes expected reward, subject to the total distance traveled plus processing times being at most . An adaptive policy is one that can choose the next vertex to visit based on observed random instantiations. Whereas, a non-adaptive policy is just given by a fixed ordering of vertices. The adaptivity gap is the worst-case ratio of the expected rewards of the optimal adaptive and non-adaptive policies.
We prove an lower bound on the adaptivity gap of stochastic orienteering. This provides a negative answer to the -adaptivity gap conjectured in , and comes close to the upper bound proved there. This result holds even on a line metric.
We also show an upper bound on the adaptivity gap for the correlated stochastic orienteering problem, where the reward of each job is random and possibly correlated to its processing time. Using this, we obtain an improved quasi-polynomial time -approximation algorithm for correlated stochastic orienteering.
In the orienteering problem , we are given a metric with a starting vertex and a budget on length. The objective is to compute a path originating from having length at most , that maximizes the number of vertices visited. This is a basic vehicle routing problem (VRP) that arises as a subroutine in algorithms for a number of more complex variants, such as VRP with time-windows, discounted reward TSP and distance constrained VRP.
The stochastic variants of orienteering and related problems such as traveling salesperson and vehicle routing have also been extensively studied. In particular, several dozen variants have been considered depending on which parameters are stochastic, the choice of the objective function, the probability distributions, and optimization models such as a priori optimization, stochastic optimization with recourse, probabilistic settings and so on. For more details we refer to a recent survey  and references therein.
Here, we consider the following stochastic version of the orienteering problem defined by . Each vertex contains a job with a deterministic reward and random processing time (also referred to as size); these processing times are independent across vertices. The processing times model the random delays encountered at the node, say due to long queues or activities such as filling out a form, before the reward can be collected. The distances in the metric correspond to travel times between vertices, which are deterministic. The goal is to compute a policy, which describes a path originating from the root that visits vertices and runs the respective jobs, so as to maximize the total expected reward subject to the total time (for travel plus processing) being at most . Stochastic orienteering also generalizes the well-studied stochastic knapsack problem [8, 4, 3] (when all distances are zero). We also consider a further generalization, where the reward at each vertex is also random and possibly correlated to its processing time.
A feasible solution (policy) for the stochastic orienteering problem is represented by a decision tree, where nodes encode the “state” of the solution (previously visited vertices and the residual budget), and branches denote random instantiations. Such solutions are called adaptive policies, to emphasize the fact that their actions may depend on previously observed random outcomes. Often, adaptive policies can be very complex and hard to reason about. For example, even for the stochastic knapsack problem an optimal adaptive strategy may have exponential size (and several related problems are PSPACE-hard) .
Thus a natural approach for designing algorithms in the stochastic setting is to: (i) restrict the solution space to the simpler class of non adaptive policies (eg. in our stochastic orienteering setting, such a policy is described by a fixed permutation to visit vertices in, until the budget is exhausted), and (ii) design an efficient algorithm to find a (close to) optimum non-adaptive policy.
While non-adaptive policies are often easier to optimize over, the drawback is that they could be much worse than the optimum adaptive policy. Thus, a key issue is to bound the adaptivity gap, introduced by  in their seminal paper, which is the worst-case ratio (over all problem instances) of the optimal adaptive value to the optimal non-adaptive value.
In recent years, increasingly sophisticated techniques have been developed for designing good non-adaptive policies and for proving small adaptivity gaps [8, 11, 7, 2, 12, 13]. For stochastic orienteering,  gave an bound on the adaptivity gap, using an elegant probabilistic argument (previous approaches only gave a bound). More precisely, they considered certain correlated probabilistic events and used martingale tails bounds on suitably defined stopping times to bound the probability that none of these events happen. In fact,  conjectured that the adaptivity gap for stochastic orienteering was , suggesting that the factor was an artifact of their analysis.
1.1 Our Results and Techniques
Adaptivity gap for stochastic orienteering: Our main result is the following lower bound.
The adaptivity gap of stochastic orienteering is , even on a line metric.
This answers negatively the -adaptivity gap conjectured in , and comes close to the upper bound proved there. To the best of our knowledge, this gives the first non-trivial adaptivity gap for a natural problem.
The lower bound proceeds in three steps and is based on a somewhat intricate construction. We begin with a basic instance described by a directed binary tree of height that essentially represents the optimal adaptive policy. Each processing time is a Bernoulli random variable: it is either zero, in which case the optimal policy goes to its left child, or a carefully set positive value, in which case the optimal policy goes to its right child. The edge distances and processing times are chosen so that when a non-zero size instantiates, it is always possible to take a right edge, while the left edges can only be taken a few times. On the other hand, if the non-adaptive policy chooses a path with mostly right edges, then it cannot collect too much reward.
In the first step of the proof, we show that this directed tree instance has an adaptivity gap. The main technical difficulty here is to show that every fixed path (which may possibly skip vertices, and gain advantage over the adaptive policy) either runs out of budget or collects low expected reward. In the second step, we drop the directions on the edges and show that the adaptivity gap continues to hold (up to constant factors). The optimum adaptive policy that we compare against remains the same as in the directed case, and the key issue here is to show that the non-adaptive policy cannot gain too much by backtracking along the edges. To this end, we use some properties of the distances on edges in our instance. In the final step, we embed the undirected tree onto a line at the expense of losing another factor in the adaptivity gap. The problem here is that pairs of nodes that are far apart on the tree may be very close on the line. To get around this, we exploit the asymmetry of the tree distances and some other structural properties to show that this has limited effect.
Correlated Stochastic Orienteering: Next, we consider the correlated stochastic orienteering problem, where the reward at each vertex is also random and possibly correlated with its processing time (the distributions are still independent across vertices). In this setting, we prove the following.
The adaptivity gap of correlated stochastic orienteering is .
This improves upon the -factor adaptivity gap that is implicit in , and matches the adaptivity gap upper bound known for uncorrelated stochastic orienteering. The proof makes use of a martingale concentration inequality  (as  did for the uncorrelated problem), but dealing with the reward-size correlations requires a different definition of the stopping time. For the uncorrelated case, the stopping time  used a single “truncation threshold” (equal to minus the travel time) to compare the instantiated sizes and their expectation. In the correlated setting, we use different truncation thresholds (all powers of ), irrespective of the travel time, to determine the stopping criteria.
Algorithm for Correlated Stochastic Orienteering: Using some structural properties in the proof of the adaptivity gap upper bound above, we obtain an improved quasi-polynomial111A quasi-polynomial time algorithm runs in time on inputs of size , where is some constant. time algorithm for correlated stochastic orienteering.
There is an -approximation algorithm for correlated stochastic orienteering, running in time . Here denotes the best approximation ratio for the orienteering with deadlines problem.
The orienteering with deadlines problem is defined formally in Section 1.3. Previously,  gave a polynomial time -approximation algorithm for correlated stochastic orienteering. They also showed that this problem is at least as hard to approximate as the deadline orienteering problem, i.e. an -hardness of approximation (this result also holds for quasi-polynomial time algorithms). Our algorithm improves the approximation ratio to , but at the expense of quasi-polynomial running time. We note that the running time in Theorem 1.3 is quasi-polynomial for general inputs where probability distributions are described explicitly, since the input size is . If probability distributions are specified implicitly, the runtime is quasi-polynomial only for .
The algorithm in Theorem 1.3 is based on finding an approximate non-adaptive policy, and losing an -factor on top by Theorem 1.2. There are three main steps in the algorithm: (i) we enumerate over many “portal” vertices (suitably defined) on the optimal policy; (ii) using these portal vertices, we solve (approximately) a configuration LP relaxation for paths between portal vertices; (iii) we randomly round the LP solution. The quasi-polynomial running time is only due to the enumeration. In formulating and solving the configuration LP relaxation, we also use some ideas from the earlier -approximation algorithm . Solving the configuration LP requires an algorithm for deadline orienteering (as the dual separation oracle), and incurs an -factor loss in the approximation ratio. This configuration LP is a “packing linear program”, for which we can use fast combinatorial algorithms [15, 9]. The final rounding step involves randomized rounding with alteration, and loses an extra factor.
1.2 Related Work
The deterministic orienteering problem was introduced by Golden et al. . It has several applications, and many exact approaches and heuristics have been applied to this problem, see eg. the survey . The first constant-factor approximation algorithm was due to Blum et al. . The approximation ratio has been improved [1, 6] to the current best .
Dean et al.  were the first to consider stochastic packing problems in this adaptive optimization framework: they introduced the stochastic knapsack problem (where items have random sizes), and obtained a constant-factor approximation algorithm and adaptivity gap. The approximation ratio has subsequently been improved to , due to [4, 3]. The stochastic orienteering problem  is a common generalization of both deterministic orienteering and stochastic knapsack.
Gupta et al.  studied a generalization of the stochastic knapsack problem, to the setting where the reward and size of each item may be correlated, and gave an -approximation algorithm and adaptivity gap for this problem. Recently, Ma  improved the approximation ratio to .
The correlated stochastic orienteering problem was studied in , where the authors obtained an -approximation algorithm and an adaptivity gap. They also showed the problem to be at least as hard to approximate as the deadline orienteering problem, for which the best approximation ratio known is .
A related problem to stochastic orienteering was considered by Guha and Munagala  in the context of the multi-armed bandit problem. As observed in , the approach in  yields an -approximation algorithm (and adaptivity gap) for the variant of stochastic orienteering with two separate budgets for the travel and processing times. In contrast, our result shows that stochastic orienteering (with a single budget) has super-constant adaptivity gap.
1.3 Problem Definition
An instance of stochastic orienteering () consists of a metric space with vertex-set and symmetric integer distances (satisfying the triangle inequality) that represent travel times. Each vertex is associated with a stochastic job, with a deterministic reward and a random processing time (also called size) distributed according to a known probability distribution. The processing times are independent across vertices. We are also given a starting “root” vertex , and a budget on the total time available. A solution (policy) must start from , and visit a sequence of vertices (possibly adaptively). Each job is executed non-preemptively, and the solution knows the precise processing time only upon completion of the job. The objective is to maximize the expected reward from jobs that are completed before the horizon ; note that there is no reward for partially completing a job. The approximation ratio of an algorithm is the ratio of the expected reward of an optimal policy to that of the algorithm’s policy.
We assume that all times (travel and processing) are integer valued and lie in . In the correlated stochastic orienteering problem (), the job sizes and rewards are both random, and correlated with each other. The distributions across different vertices are still independent. For each vertex , we use and to denote its random size and reward, respectively. We assume an explicit representation of the distribution of each job : for each , job has size and reward with probability . Note that the input size is .
An adaptive policy is a decision tree where each node is labeled by a job/vertex of , with the outgoing arcs from a node labeled by corresponding to the possible sizes in the support of . A non-adaptive policy is simply given by a path starting at : we just traverse this path, processing the jobs that we encounter, until the total (random) size of the jobs plus the distance traveled reaches . A randomized non-adaptive policy may pick a path at random from some distribution before it knows any of the size instantiations, and then follows this path as above. Note that in a non-adaptive policy, the order in which jobs are processed is independent of their processing time instantiations.
In our algorithm for , we use the deadline orienteering problem as a subroutine. The input to this problem is a metric denoting travel times, a reward and deadline at each vertex, start () and end () vertices, and length bound . The objective is to compute an path of length at most that maximizes the reward from vertices visited before their deadlines. The best approximation ratio for this problem is due to [1, 6].
The adaptivity gap lower bound appears in Section 2, where we prove Theorem 1.1. In Section 3, we consider the correlated stochastic orienteering problem and prove the upper bound on its adaptivity gap (Theorem 1.2). Finally, the improved quasi-polynomial time algorithm (Theorem 1.3) for correlated stochastic orienteering appears in Section 4
2 Lower Bound on the Adaptivity Gap
Here we describe our lower bound instance which shows that the adaptivity gap is even for an undirected line metric. The proof and the description of the instance is divided into three steps. First we describe an instance where the underlying graph is a directed complete binary tree, and prove the lower bound for it. The directedness ensures that all policies follow a path from root to a leaf (possibly with some nodes skipped) without any backtracking. Second, we show that the directed assumption can be removed at the expense of an additional factor in the adaptivity gap. In particular this means that the nodes on the tree can be visited in any order starting from the root. Finally, we “embed” the undirected tree into a line metric, and show that the adaptivity gap stays the same up to a constant factor.
2.1 Directed Binary Tree
Let be an integer and . We define a complete binary tree of height with root . All the edges are directed from the root towards the leaves. The level of any node is the number of nodes on the shortest path from to any leaf. So all the leaves are at level one and the root is at level . We refer to the two children of each internal node as the left and right child, respectively. Each node of the tree has a job with some deterministic reward and a random size . Each random variable is Bernoulli, taking value zero with probability and some positive value with the remaining probability . The budget for the instance is .
To complete the description of the instance, we need to define the values of the rewards , the job sizes , and the distances on edges .
Defining rewards. For any node , let denote the number of right-branches taken on the path from the root to . We define the reward of each node to be .
Defining sizes. Let for any . The size at the root, . The rest of the sizes are defined recursively. For any non-root node at level with denoting its parent, the size is:
In other words, for a node at level , consider the path from to . Let where if is the left child of its parent , and otherwise (we assume ). Then .
Observe that for a node , each node in its left (resp. right) subtree has (resp. ).
It remains to define distances on the edges. This will be done in an indirect way, and it is instructive to first consider the adaptive policy that we will work with. In particular, the distances will be defined in such a way that the adaptive policy can always continue till it reaches a leaf node.
Adaptive policy . Consider the policy that goes left at node whenever it observes size zero at , and goes right otherwise.
Clearly, the residual budget at node under will satisfy the following: , and
Defining distances. We will define the distances so that the residual budgets under satisfy the following: , and for any node with parent ,
In particular, this implies the following lengths on edges. For any node with parent ,
In Claim 2.3 below we will show that the distances are non-negative, and hence well-defined.
Figure 1 gives a pictorial view of the instance.
Basic properties of the instance
Let denote the distance traveled by the adaptive strategy A to reach , and let denote the total size instantiation before reaching . By the definition of the budgets, and as A takes the right branch at iff the size at instantiates, we have the following.
For any node , the budget satisfies .
If a node is a left child of its parent, then .
Let be the parent of . By definition of sizes, . As by the definition of residual budgets, the claim follows. ∎
For any node , we have . This implies that all the residual budgets and distances are non-negative.
Let denote the lowest level node on the path from to that is the left child of its parent (if is the left child of its parent, then ); if there is no such node, set . Note that by Claim 2.2 and the definition of and , in either case it holds that .
Let denote the path from to (including but not ; so if ). Since contains only right-branches, and hence . Thus to prove it suffices to show . For brevity, let and . Using the definition of sizes,
as desired. Here the right hand side of the first inequality is simply the total size of nodes in the to leaf path using all right branches. The inequality in the second line follows as for all .
Thus we always have .
As if is the right child of , or otherwise, this implies that all the residual-budgets are non-negative.
Similarly, as is either or (and hence at least ), this implies that all edge lengths are non-negative. ∎
This claim shows that the above instance is well defined, and that is a feasible adaptive policy that always continues for steps until it reaches a leaf. Next, we show that obtains large expected reward.
The expected reward of policy is .
Notice that accrues reward as follows: it keeps getting reward (and going left) until the first positive size instantiation, then it goes right for a single step and keeps going left and getting reward till the next positive size instantiation and so on. This continues for a total of steps. In particular, at any time it collects reward , if exactly nodes have positive sizes among the nodes seen.
Let denote the Bernoulli random variable that is if the node in has a positive size instantiation, and otherwise. So , and . By Markov’s inequality, the probability that more than nodes in have positive sizes is at most half. Hence, with probability at least the reward collected in the last node of is at least . That is, the total expected reward of is at least . ∎
2.2 Bounding Directed Non-adaptive Policies
We will first show that any non-adaptive policy that is constrained to visit vertices according to the partial order given by the tree gets reward at most . Notice that these correspond precisely to non-adaptive policies on the directed tree .
The key property we need from the size construction is the following.
For any node , the total size instantiation observed under the adaptive policy before is strictly less than .
Consider the path from the root to , and let denote the levels at which “turns left”. That is, for each , the node at level in path satisfies (a) is the right child of its parent, and (b) contains the left child of if it goes below level . (If is the right child of its parent then and .) Let denote the size of , the level node in . Also, set corresponding to the root. Below we use .
We first bounds the size instantiation between levels and in terms of . Observe that a positive size instantiation is seen in only along right branches. So for any , the total size instantiation seen in between levels and is at most:
Now, note that for any , the sizes and are related as follows:
The first inequality uses the fact that the path from to is a sequence of (at least one) left-branches followed by a sequence of (at least one) right-branches. As the size decreases along left-branches and increases along right branches, it follows that conditional on the values of and , the ratio is maximized for the path with a sequence of left branches followed by a single right branch (at level ).
Using (2), we obtain inductively that:
We now show that any non-adaptive policy on the directed tree achieves reward . Note that any such solution is just a root-leaf path in that skips some subset of vertices. A node in is an L-branching node if the path goes left after . R-branching nodes are defined similarly.
The total reward from R-branching nodes is at most .
As the reward of a node decreases by a factor of upon taking a right branch, the total reward of such nodes is at most . ∎
can not get any reward after two L-branching nodes instantiate to positive sizes.
For any node in tree , let (resp. ) denote the distance traveled (resp. size instantiated) in the adaptive policy until ; here does not include the size of . Observe that Lemma 2.5 implies that for all nodes .
In the non-adaptive solution , let and be any two L-branching nodes that instantiate to positive sizes and ; say appears before . Under this outcome, we will show that exhausts its budget after . Note that the distance traveled to node in is exactly , the same as that under . So the total distance plus size instantiated in is at least , which (as we show next) is more than the budget .
By Claim 2.1, . Moreover, the residual budget at the left child of equals . Since the residual budgets are non-increasing down the tree , we have , i.e. . Hence, the total distance plus size in is at least
where the last inequality follows from Lemma 2.5. So can not obtain reward from any node after . ∎
Combining the above two claims, we obtain:
The expected reward of any directed non-adaptive policy is at most .
This proves an adaptivity gap for stochastic orienteering on directed metrics. We remark that the upper bound in  also holds for directed metrics.
2.3 Adaptivity Gap for Undirected Tree
We now show that the adaptivity gap does not change much even if we make the edges of the tree undirected. In particular, this has the effect of allowing the non-adaptive policy to backtrack along the (previously directed) edges, and visit any collection of nodes in the tree. Recall that in the directed instance of the previous subsection, the non-adaptive policy could not try too many -branching nodes (Claim 2.7) and hence was forced to choose mostly -branching nodes, in which case the rewards decreased rapidly. However, in the undirected case, the non-adaptive policy can move along some right edges to collect rewards and then backtrack to high-reward nodes.
The adaptive policy we compare against is the same as in the directed case. Let denote some fixed non-adaptive policy. Using the definition of edge-lengths,
can not backtrack over any left-branching edge.
As in the proof of Claim 2.7, for any , let (resp. ) denote the distance traveled (resp. size instantiated) in the adaptive policy until node ; recall does not include the size of . If backtracks over the left-edge out of some node then the distance traveled is at least:
The first equality follows as and the second equality follows as by Claim 2.1. The first inequality follows as , by Lemma 2.5. The second inequality follows as by the definition of sizes. Finally, the last inequality follows by Claim 2.3. Since the distance traveled by ’ is more than , the claim follows. ∎
We now focus on bounding the contribution due to backtracking on the right edges.
Let be the left-edges traversed by ; we denote where is the left child of . We now partition the nodes visited in as follows. For each , group consists of nodes visited after traversing and before traversing ; and is the set of nodes visited after . Note that the nodes in are visited contiguously using only right edges (they need not be visited in the order given by tree , as the algorithm may bactrack). See Figure 3 for a pictorial view.
For each , let denote the nodes at level more than (the parent node of left-edge ); and let . We also set .
By using exactly the argument in Claim 2.6, the total reward in is at most .
Let us modify by dropping all nodes in . Each remaining node of is either (i) an L-branching node, where goes left after (these are the end-points s of left-edges), or (ii) a “backtrack node” where backtracks on the edge from to its parent (these are nodes in s). By Claim 2.7, the expected reward from L-branching nodes is at most . In order to bound the total expected reward, it now suffices to bound the reward from the backtrack nodes.
The expected reward of from the backtrack nodes is at most
Consider the partition (defined above) of backtrack nodes of into groups . Recall that visits each group contiguously (perhaps not in the order given by ) and then traverses left-edge to go to the next group . Moreover, (the parent end-point of left-edge ) is an ancestor of all -nodes. See also Figure 3.
Note also that the walk visiting each group consists only of right-edges: so the total reward in any single group is at most (see Claim 2.6). Define for each .
Let denote the (random) index of the first group where a positive size is instantiated. We now show that can not visit any group indexed more than . Let and denote the end-points of the left-edge . Note that must traverse the left edge out of to reach groups . If is the node with positive size instantiation and its level, then (since is an ancestor of all -nodes). The distance traveled by ’ till is
where the last inequality follows by Lemma 2.5. Thus the total distance plus size seen in (till ) is at least , which is at least and hence . Thus can not visit any higher indexed group.
Using the above observation, the expected reward from backtrack nodes is at most:
Above we used the fact that . ∎
Altogether, it follows that any non-adaptive policy has expected reward at most . Finally, using Lemma 2.4, we obtain an adaptivity gap.
2.4 Adaptivity Gap on Line Metric
We now show that the previous instance on a tree metric can also be embedded into a line metric such that the adaptivity gap does not change much. This gives an adaptivity gap for stochastic orienteering even on line metrics.
The line metric is defined as follows. Each node of the tree is mapped (on the real line) to the coordinate which is the distance in from the root to . Since all distances in our construction are integers, each node lies at a non-negative integer coordinate. Note that multiple nodes may be at the same coordinate (for example, as all right-edges in have zero length). Below, will denote distances in the tree metric , and denotes distances in the line metric .
Note that for all nodes . Moreover, the distance between two nodes and in the line metric is , which is at most the distance in the tree metric. Thus the adaptive policy for the tree is also valid for the line, which (by Lemma 2.4) has expected reward . However, the distances on the line could be arbitrarily smaller than , and thus the key issue is to show that non-adaptive policies cannot do much better. To this end, we begin by observing some more properties of the distances and the embedding on the line.
For any internal node , let (resp. ) denote the subtree rooted at the left (resp. right) child of . Then, for any node , and for any node ,
For any node , recall that its residual budget , where is the total size instantiated in the adaptive policy before node . Suppose , and let be the left child of . Then
where we use that , and the last inequality follows from Lemma 2.5.
Now consider . We have , as lies in the right subtree under and so must have instantiated to a positive size before reaching . By Claim 2.3, which is at least since for each . Thus . ∎
This implies the following useful fact.
In the line embedding, for any node , all nodes in the left-subtree appear after all nodes in the right-subtree .
We will now show that any non-adaptive policy has reward at most . This requires more work than in the tree metric case, but the high level idea is quite similar: we restrict how any non-adaptive policy can look like by using the properties of distances, and show that such policies cannot obtain too much profit. Observe that a non-adaptive policy is just a walk on , originating from and visiting a subset of vertices.
Any non-adaptive policy on must visit vertices ordered by non-decreasing distance from .
We will show that if vertex is visited before and then the walk to has length more than ; this would prove the lemma.
Let denote the least common ancestor of and . There are two cases depending on whether or ; note that the ancestor cannot be as .
If , since , it must be that and by Corollary 2.12. Moreover, the total distance traveled by the path is at least
where the second inequality is by Lemma 2.11.
If , since , there must be at least one left edge on the path from to in the tree (as the length of the right edges is 0). Then, the distance traveled by the path is at least . As by Lemma 2.5, and as (by definition of distances on left edges), we have
where the last inequality follows from Claim 2.3. ∎
By Lemma 2.13, any non-adaptive policy visits vertices in non-decreasing coordinate order. For vertices at the same coordinate, we can break ties and assume that these nodes are visited in decreasing order of their level in . This does not decrease the expected reward due to the following exchange argument.
If visits two vertices consecutively that have the same coordinate in and have levels , then must be visited before .
Since and have the same coordinate in , by Lemma 2.11 it must be that one is an ancestor of the other, and the path in consists only of right-edges. Since , node is an ancestor of in . Suppose that chooses to visit before . We will show that the alternate solution that visits before has larger expected reward. This is intuitively clear since stochastically dominates in our setting: the probabilities are identical, size of is less than , and reward of is more than . The formal proof also requires independence of and , and is by a case analysis.
Let us condition on all instantiations other than and : we will show that has larger conditional expected reward than . This would also show that the (unconditional) expected reward of is more than . Let denote the total distance plus size in (resp. ) when it reaches (resp. ). Irrespective of the outcomes at and , the residual budgets in and before/after visiting will be identical. So the only difference in (conditional) expected reward is at and . The following table lists the different possibilities for rewards from and , as varies (recall that is the budget).
In each case, gets at least as much reward as since . This completes the proof. ∎
For any node in , let denote the set of nodes satisfying (i) appears before in , and (ii) is not an ancestor of in tree . We refer to as the “blocking set” for node . We first prove a useful property of the sets.
For any and , we must have right-subtree and left-subtree at the lowest common ancestor of and . Moreover can not get reward from if any vertex in its blocking set instantiates to a positive size.
Observe that and are incomparable in because:
is not an ancestor of by definition of .
is not a descendant of . Suppose (for a contradiction) that is a descendant of . Note that and are not co-located in : if it were, then by Claim 2.14 and the fact that visits before , must be an ancestor of , which contradicts the definition of . So the only remaining case is that is located further from than : but this contradicts Lemma 2.13 as visits before .
So the lowest common ancestor of and is distinct from both . Since , we must have and . This proves the first part of the claim.
Since , its size . As , by Lemma 2.11 and hence if has non-zero size, the total distance plus size until is more than , i.e. can not get reward from . ∎
The next key claim shows that the sets are increasing along .
If node appears before in then .
Consider nodes and as in the claim, and suppose (for contradiction) that there is some . Since , by Claim 2.15, right-subtree and left-subtree, where is the lowest common ancestor of and . Clearly appears before in ( is before which is before ). And since , must be an ancestor of . Hence is also in the right-subtree, and ; recall that left-subtree. This contradicts with Lemma 2.13 since is visited before . Thus . ∎
Based on Claim 2.16, the blocking sets in form an increasing sequence. So we can partition into contiguous segments with (resp. ) denoting the first (resp. last) vertex of , so that the following hold for each .
The first vertex of has , and
the increase in the blocking set .
Defining directed non-adaptive policies from . For each consider the non-adaptive policy that traverses segment and visits only vertices in ; note that since . Notice that the blocking set is always empty in : this means that nodes in are visited in the order of some root-leaf path in tree , i.e. is a directed non-adaptive policy (as considered in Section 2.2). So by Claim 2.8, the expected reward in each is at most . That is,