Relay Selection with Partial Information in Wireless Sensor Networks
Abstract
Our work is motivated by geographical forwarding of sporadic alarm packets to a base station in a wireless sensor network (WSN), where the nodes are sleepwake cycling periodically and asynchronously. When a node (referred to as the source) gets a packet to forward, either by detecting an event or from an upstream node, it has to wait for its neighbors in a forwarding set (referred to as relays) to wakeup. Each of the relays is associated with a random reward (e.g., the progress made towards the sink) that is independent and identically distributed (iid). To begin with, the source is uncertain about the number of relays, their wakeup times and the reward values, but knows their distributions. At each relay wakeup instant, when a relay reveals its reward value, the source’s problem is to forward the packet or to wait for further relays to wakeup. In this setting, we seek to minimize the expected waiting time at the source subject to a lower bound on the average reward. In terms of the operations research literature, our work can be considered as a variant of the asset selling problem. We formulate the relay selection problem as a partially observable Markov decision process (POMDP), where the unknown state is the number of relays. We begin by considering the case where the source knows the number of relays. For the general case, where the source only knows a probability mass function (pmf) on the number of relays, it has to maintain a posterior pmf on the number of relays and forward the packet iff the pmf is in an optimum stopping set. We show that the optimum stopping set is convex and obtain an inner bound to this set. We prove a monotonicity result which yields an outer bound. The computational complexity of the above policies motivates us to formulate an alternative simplified model, the optimal policy for which is a simple threshold rule. We provide simulation results to compare the performance of the inner and outer bound policies against the simple policy, and against the optimal policy when the source knows the exact number of relays. Observing the simplicity and the good performance of the simple policy, we heuristically employ it for endtoend packet forwarding at each hop in a multihop WSN of sleepwake cycling nodes.
Relay selection, wireless sensor networks, sleepwake cycling, partially observable Markov decision process (POMDP), asset selling problem.
I Introduction
We are interested in the problem of packet forwarding in a class of wireless sensor networks (WSNs) in which local inferences based on sensor measurements could result in the generation of occasional “alarm” packets that need to be routed to a basestation, where some sort of action could be taken [1, 2, 3]. Such a situation could arise, for example, in a WSN for human intrusion detection or fire detection in a large region. Such WSNs often need to run on batteries or on harvested energy and, hence, must be energy conscious in all their operations. The nodes of such a WSN would be sleepwake cycling, waking up periodically to perform their tasks. One approach for the forwarding problem is to use a distributed algorithm to schedule the sleepwake cycles of the nodes such that the delay of a packet from its source to the sink on a multihop path is minimized [2, 4]. An organizational phase is required for such algorithms, which increases the protocol overhead and moreover the scheduling algorithm has to be rerun periodically since the clocks at different nodes drift at different rates (so that the previously computed schedule would have become stale after long operation time). For a survey of routing techniques in wireless sensor and ad hoc networks and their classification, see [5, 6].
In this paper we are concerned with the sleepwake cycling approach that permits the nodes to wakeup independently of each other even though each node is waking up periodically, i.e., asynchronous periodic sleepwake cycling [7, 1]. In fact, given the need for a long network lifetime, nodes are more likely to be sleeping than awake. In such a situation, when a node has a packet to forward, it has to wait for its neighbors to wake up. When a neighbor node wakes up, the forwarding node can evaluate it for its use as a relay, e.g., in terms of the progress it makes towards the destination node, the quality of the channel to the relay, the energy level of the relay, etc., (see [8, 9] for different routing metrics based on the above mentioned quantities). We think of this as a reward offered by the potential relay. The endtoend network objective is to minimize the average total delay subject to a lower bound on some measure of total reward along the endtoend path. In this paper we address this endtoend objective by considering optimal strategies at each hop. When a node gets a packet to forward, it has to make decisions based only on the activities in its neighborhood. Waiting for all potential relays to wakeup and choosing the one with the best reward maximizes the reward at each hop, but increases the forwarding delay. On the other hand, forwarding to the first relay to wakeup may result in the loss of the opportunity of choosing a node with a better reward. Hence, at each hop, there is a tradeoff between the onehop delay and the onehop reward. By solving the onehop problem of minimizing the average delay subject to a constraint on the average reward, we expect to capture the tradeoff between the endtoend metrics. For instance, suppose the endtoend objective is to minimize the expected endtoend delivery delay subject to an upper bound on the expected number of hops in the path, the motivation for this constraint being that more hops traversed entails a greater expenditure of energy in the network. In our approach, we would heuristically address this problem by considering at each hop the problem of minimizing the mean forwarding delay subject to a lower bound on the progress made towards the sink. Greater progress at each hop entails greater delay per hop, while reducing the number of hops it takes a packet to reach the sink.
The local problem setting is the following. Somewhere in the network a node has just received a packet to forward; for the local problem we refer to this forwarding node as the source and think of the time at which it gets the packet as . There is an unknown number of relays in the forwarding set of the source. In the geographical forwarding context, this lack of information on the number of relays could model the fact that the neighborhood of a forwarding node could vary over time due, for example, to node failures, variation in channel conditions, or (in a mobile network) the entry or exit of mobile relays. However, we assume that the number of relays is bounded by a known number , and the source has an initial probability mass function (pmf), over , on the number of potential relays. The source desires to forward the packet within the interval , while knowing that the relays wakeup independently and uniformly over and the rewards they offer are independently and identically distributed (iid). We will formally introduce our model in Section II. Next we discuss related work and highlight our contributions.
Ia Related Work
Here we provide a summary of related literature in the context of
geographical forwarding and channel selection. Since our problem
also belongs to the class of asset selling problems
studied in operations research literature,
we survey related work from there as well.
Geographical forwarding problems: In our prior work [7] we have considered a simple model where the number of relays is a constant which is known to the source. There the reward is simply the progress made by a relay node towards the sink. In the current work we have generalized our earlier model by allowing the number of relays to be not known to the source. Also, here we allow a general reward structure.
There has been other work in the context of geographical forwarding and anycast routing, where the problem of choosing one among several neighboring nodes arises. Zorzi and Rao [10] consider a scenario of geographical forwarding in a wireless mesh network in which the nodes know their locations, and are sleepwake cycling. They propose GeRaF (Geographical Random Forwarding), a distributed relaying algorithm, whose objective is to carry a packet to its destination in as few hops as possible, by making as large progress as possible at each relaying stage. For their algorithm, the authors obtain the average number of hops (for given sourcesink distance) as a function of the node density. These authors do not consider the tradeoff between the relay selection delay and the reward gained by selecting a relay, which is a major contribution of our work.
Liu et al. [11] propose a relay selection approach as a part of CMAC, a protocol for geographical packet forwarding. With respect to the fixed sink, a node has a forwarding set consisting of all nodes that make progress greater than (an algorithm parameter). If represent the delay until the first wakeup instant of a node in the forwarding set, and is the corresponding progress made, then, under CMAC, node chooses an that minimizes the expected normalized latency . The Random Asynchronous Wakeup (RAW) protocol [12] also considers transmitting to the first node to wakeup that makes a progress of greater than a threshold. Interestingly, this is the structure of the optimal policy for our simplified model in [7]. For the sake of completeness we have described the simplified model in this paper as well (see Section VI). Thus we have provided analytical support for using such a threshold policy.
Kim et al. [1] consider a dense WSN. Just like the motivation for our model, an occasional alarm packet needs to be sent, from wherever in the network it is generated, to the sink. The authors develop an optimal anycast scheme to minimize average endtoend delay from any node to the sink when each node wakes up asynchronously with rate . They show that periodic wakeup patterns obtain minimum delay among all sleepwake patterns with the same rate. They propose an algorithm called LOCALOPT [13] which yields, for each node , a threshold for each of its neighbor . If the time at which neighbor wakes up is less than , then will transmit to . Otherwise will go back to sleep and will continue waiting for further neighbors. A key drawback is that a configuration phase is required to run the LOCALOPT algorithm.
Rossi et al. [14], consider the problem where
a node , with a packet to forward and
which is hops away from the sink, has to
choose between two of its shortlisted
neighbors. The first shortlisted neighbor is the one with the least cost
among all others with hop count (one less than node ).
The second one is the least cost node among all its neighbors
with hop count (same as that of node ).
Though the first node is on the shortest path, sometimes when its cost
is high, it may not be the best option.
It turns out that it is optimal to choose one node over
the other by comparing the cost difference with a threshold. The threshold
depends on the cost distribution of the nodes which are two hops away from node .
Here there is no notion of sleepwake cycling so that all
the neighbor costs are known when node
gets a packet to forward. The problem is that of one shot decision making.
In our problem a neighbor’s cost will become available only after it wakes up,
at which instant node has to take decision regarding forwarding.
Hence, ours is a sequential decision problem.
Channel selection problems:
Akin to the relay selection problem is the problem of channel
selection. The authors in [15, 16] consider a model where there are several
channels available to choose from. The transmitter has to probe the
channels to learn their quality. Probing many channels yields one with
a good gain but reduces the effective time for transmission within the
channel coherence period. The problem is to obtain optimal strategies
to decide when to stop probing and to transmit.
Here the number of
channels is known and all the channels are
available at the very beginning of the decision process. In our
problem the number of relays is not known, and the relays
become available at random times.
Asset selling problems: The basic asset selling problem [17, 18], comprises offers that arrive sequentially over discrete time slots. The offers are iid. As the offers arrive, the seller has to decide whether to take an offer or wait for future offers. The seller has to pay a cost to observe the next offer. Previous offers cannot be recalled. The decision process ends with the seller choosing an offer. Over the years, several variants of the basic problem have been studied, both with and without recalling the previous offers. Recently Kang [19] has considered a model where a cost has to be paid to recall the previous best offer. Further, the previous best offer can be lost at the next time instant with some probability. See [19] for further references to literature on models with uncertain recall. In [20], the authors consider a model in which the offers arrive at the points of a renewal process. Additional literature on such work can be found in [20]. In these models, either the number of potential offers is known or is infinite. In [21], a variant is studied in which the asset selling process can reach a deadline in the next slot with some fixed probability, provided that the process has proceeded upto the present slot.
In our work the number of offers (i.e., relays) is not known. Also the successive instants at which the offers arrive are the order statistics of an unknown number of iid uniform random variables over an interval . After observing a relay, the probability that there are no more relays to go (which is the probability that the present stage is the last one) is not fixed. This probability has to be updated depending on the previous such probabilities and the inter wakeup times between the sucessive relays. Although our problem falls in the class of asset selling problems, to the best of our knowledge the particular setting we have considered in this paper has not been studied before.
IB Our Contributions
With the number of relays being unknown, the natural approach is to formulate the problem as a partially observed Markov decision process (POMDP). A POMDP is a generalization of an MDP, where at each stage the actual internal state of the system is not available to the controller. Instead, the controller can observe a value from an observation space. The observation probabilistically depends on the current actual state and the previous action. In some cases, a POMDP can be converted to an equivalent MDP by regarding a belief (i.e., a probability distribution) on the state space as the state of the equivalent MDP. For a survey of POMDPs see [22]. It is clear that, even if the actual state space is finite, the belief space is uncountable. There are several algorithms available to obtain the optimal policy when the actual state space is finite [23], starting from the seminal work by Smallwood and Sondik [24]. When the number of states is large, these algorithms are computationally intensive. In general, it is not easy to obtain an optimal policy for a POMDP. In the current work, we have characterized the optimal policy in terms of an optimum stopping set. We have made use of the convexity results in [25] and some properties specific to our problem to obtain an inner bound on the optimum stopping set. We prove a simple monotonicity result to obtain an outer bound. In summary, the following are the main contributions of our work:

We formulate the problem of relay selection with partial information as a finite horizon partially observable Markov decision process (POMDP), with the unknown state being the actual number of relays (Section III). The posterior pmf on the number of relays is shown to be a sufficient decision statistic.

We first consider the completely observable MDP (COMDP) version of the problem where the source knows the number of relays with probability one (wp1) (Section IV). The optimal policy is characterized by a sequence of threshold functions.

For the POMDP, at each stage the optimum stopping set is the set of all pmfs on the number of relays where it is optimal to stop (Section V). We prove that this set is convex (Section VA), and provide an inner bound (subset) for it (Section VB). We prove a monotonicity result and obtain an outer bound (superset, Section VC). The threshold functions obtained in COMDP version are used in the design of the bounds. These threshold functions need to be obtained recursively which is in general, computationally intensive.

The complexity of the above policies motivates us to consider a simplified model (Section VI). We prove that the optimal policy for this simplified model is a simple threshold rule.

Through simulations (Section VIIA) we study the performance comparision of various policies with the optimal COMDP policy. The inner bound policy performs slighty better than the outer bound policy. The simple policy obtained from the simplified model performs very close to the inner bound. Also, we show the poor performance of a naive policy, that assumes the actual number of relays to be simply the expected number.

Finally as a heuristic for the endtoend problem in the geographical forwarding context, we apply the simple policy at each hop and study the endtoend performance by simulation (Section VIIB). We find that it is possible to tradeoff between the expected endtoend delay and expected number of hops by tuning a parameter.
For the ease of presentation, in the main sections we only provide an outline of the proof for most of the lemmas, followed by a brief description. Formal proofs are available in Appendices A, B and C. Appendix D contains additional simulation results.
Ii System Model
We consider the one stage problem in which a node in the network receives a packet to forward. We call this node the “source” and the nodes that it could potentially forward the packet to are called “relays”. The local problem is taken to start at time . Thus at time , the source node has a packet to forward to a sink but needs a relay node to accomplish this task. There is a nonempty set of relay nodes, labeled by the indices . is a random variable bounded above by , a system parameter that is known to the source node, i.e., the support of is . The source does not know , but knows the bound , and a pmf on , which is the initial pmf of . A relay node , , becomes available to the source at the instant . The source knows that the instants are iid uniformly distributed on . Observe that this would be the case if the wakeup instants of all the nodes in the network are periodic with period , if these (periodic) renewal processes are stationary and independent, and if the forwarding node’s decision instants are stopping times w.r.t. these wakeup time processes [26].
We call the wakeup instant of relay . If the source forwards the packet to the relay , then a reward of is accrued. The rewards , are iid random variables with pdf . The support of is . The source knows this statistical characterisation of the rewards, and also that the are independent of the wakeup instants . When a relay wakes up at and reveals its reward , the source has to decide whether to transmit to relay or to wait for further relays. If the source decides to wait, then it instructs the relay with the best reward to stay awake, while letting the rest go back to sleep. This way the source can always forward to a relay with the best reward among those that have woken up so far.
Given that (throughout this discussion we will focus on the event ), let represent the order statistics of , i.e., the sequence is the sequence sorted in the increasing order. The pdf of the th () order statistic [27, Chapter 2] is, for ,
(1) 
Also the joint pdf of the th and the th order statistic (for ) is, for ,
(2) 
Using the above expressions, we can write down the conditional pdf (for ) as, for and ,
(3) 
Comparing (II) with (1), as expected, we observe that, given , the pdf of the wakeup instant of the th node, conditioned on the wakeup instant of the th node, is the th order statistic of iid random variables that are uniform on the remaining time . Let and define for . are the interwakeup time instants between the consecutive nodes (see Fig. 1). Later we will be interested in the conditional pdf for which is given by, for and ,
(4)  
The conditional expectation is given by,
(5) 
which is simply the expected value of the minimum of random variables ( is the remaining number of relays), each of which are iid uniform on the interval ( is the remaining time).
Definition 1
For notational simplicity we define,
Note that depends on and through the difference and depends on through .
Since the reward sequence is iid and independent of the wakeup instants , we write as the pairs of ordered wakeup instants and the corresponding rewards. Evidently, for . Further we define (when ) , and . Also . All these variables are depicted in Fig. 1. We end this section by listing out, in Table I, most of the symbols that appear in the paper with a brief description for each.
Symbol  Description 

Inner product of vectors and  
Thresholds lying on the line joining and of the simplex ; Used in the construction of the inner and outer bounds, respectively  
Best reward so far, i.e.,  
Average cost of continuing at stage when the state is  
Optimum stopping set at stage when  
Inner bound for the stopping set  
Outer bound for the stopping set  
Onestepstopping set for the simplified model  
Expectation conditioned on  
pdf of conditioned on  
pdf of the iid rewards  
Optimal costtogo function at stage when the state is  
Bound on the number of relays  
Number of relays; random variable taking values from  
Number of relays in the simplified model; a constant  
Probability of an event  
Set of all pmfs on the set  
Represents a typical state at stage where is the belief state and  
A corner point in , i.e.,  
Reward of the th relay  
Inter wakeup time between the and th relay, i.e.,  
Wakeup instant of the th relay  
Quantities, analogous to the ones in the exact model, for the simplified model  
Threshold obtained from the simplified model  
Reward constraint for the problem in (IIIC)  
When is such that then it is optimal to stop iff  
Lagrange multiplier, see (12)  
Average cost of stopping at stage when  
Belief transition function; is a pmf in for a given , and  
Threshold obtained from the COMDP version of the problem; If the source knows wp1 that , then at some stage with it is optimal to stop iff 
Iii The Sequential Decision Problem
For the model set up in Section II, we now consider the following sequential decision problem. At each instant that a relay wakes up, i.e., , the source has to make the decision to forward the packet, or to hold the packet until the next wakeup instant. Since the number of available relays, , is unknown, we have a decision problem with partial information. We will show how the problem can be set up in the framework of a partially observable Markov decision process (POMDP) [22] [28, Chapter 5].
Iiia Actions, State Space, and State Transition
Actions: We assume that the time instants at which the relays wakeup, i.e., , constitute the decision instants or stages ^{1}^{1}1A better choice for the decision instants may be to allow the source to take decision at any time . When is known to the source it can be argued that it is optimal to take decisions only at relay wakeup instances. However this may not hold for our case where is unknown. In this paper we proceed with our restriction on the decision instants and consider the general case as a topic for future work.. At each decision instant, there are two actions possible at the source, denoted and , where

represents the action to continue waiting for more relays to wakeup, and

represents the action to stop and forward the packet to the relay that provides the best reward among those that have woken up to the current decision epoch.
Since there can be at most relays, the total number of decision instants is .
The decision process technically ends at the first instant , at which the source
chooses action , in which case we assume that all the subsequent decision instants,
, occur at .
In cases where the source ends up waiting until time
(referring to Fig. 1, this is possible if, even at the
source decides to continue, not realizing that it has seen all the relays there
are in its forwarding set), all the subsequent decision
instants are assumed to occur at .
State Space: At stage the state space is simply and the only action possible is , where in the superscript is to signify that is the set of actual internal states of the system. The state space at stage is,
and for stages is,
Thus the state space at stage is written as the union of three sets. The physical meanings of these sets are as follows:

: in the state triple represents the actual number of relays. The states in this set correspond to the case where there are more than or equal to relays, i.e., satisfies, . In the pair , is the wakeup instant () of the th relay, and is the best reward among the relays seen so far. Same remark holds for the states in . Stage begins at time with reward. Hence the states in are of the form .

: Suppose there were relays and, at stage the source decides to continue. Note that it is possible for the source to take such a decision, since it does not know the number of relays. In such a case, the source ends up waiting until time and enters stage . Hence the states in this set are of the form where represents the best reward among all the relays ().

: is the terminating state. The state at stage will be , if the source has already forwarded the packet at an earlier stage.
State Transition: If the state at stage is (i.e., the source has already forwarded the packet) then the next state is always . Suppose is the state at some stage , , and represents the action taken. If then the decision process stops and we regard that the system enters the termination state so that the state at all the subsequent stages, , is . The source will also terminate the decision process, knowing that the relays wakeup within the interval , if it has waited for a duration of . This means that , i.e., and .
On the other hand if and , the source waits for a random duration of and encounters a relay with a random reward of so that the next state is . Note that if , i.e., the current relay is the last one, then since we have defined and , the next state will be of the form . Thus the state at stage can be written down as,
(7) 
IiiB Belief State and Belief State Transition
Since the source does not know the actual number of relays , the state is only partially observable. The source takes decisions based on the entire history of the wakeup instants and the best rewards. If the source has not forwarded the packet until stage then define, to be the information vector available at the source when the th relay wakes up. represents the wakeup instants of relays waking up at stages and are the corresponding best rewards. Define to be the belief state about at stage given the information vector , i.e., for (note that is the probability that the th relay is the last one). Thus, is a pmf in the dimensional probability simplex. Let us denote this simplex as .
Definition 2
For , let := set of all pmfs on the set . is the dimensional probability simplex in .
The “observation” at stage is a part of the actual state . For a general POMDP problem the observation can belong to a completely different space than the actual state space. Moreover the distribution of the observation at any stage can in general depend on all the previous states, observations, actions and disturbances. Suppose this distribution depends only on the state, action and disturbance of the immediately preceding stage, then a belief on the actual state given the entire history turns out to be sufficient for taking decisions [28, Chapter 5]. For our case, this condition is met and hence at stage , is a sufficient statistic to take decision. Therefore we modify the state space as, and for ,
(8) 
After seeing relays, suppose the source chooses not to forward the packet, then upon the next relay waking up (if any), the source needs to update its belief about the number of relays. Formally, if is the state at stage and is the wakeup instant of the next relay then, using Bayes rule, the next belief state can be obtained via the following belief state transition function which yields a pmf in ,
(9) 
for . Note that this function does not depend on . Thus, if at stage , the state is , then the next state is
(10) 
where is the random delay until the next relay wakes up and is the random reward offered by that relay. The explanation for the above belief state transition expression remains same as that of the actual state transition in (7), except that if the action is to continue, then the source needs to update the belief about the number of relays. Suppose at stage , the actual number of relays happens to be and the action is to continue, which is possible since the source does not know the actual number, then the source will end up waiting until time and then transmit to the relay with the best reward.
IiiC Stopping Rules and the Optimization Problem
As the relays wakeup, the source’s problem is to decide to stop or continue waiting for further relays. A stopping rule or a policy is a sequence of mappings where . Let represent the set of all policies. The delay incurred using policy is the instant at which the source forwards the packet. It could be either one of the , or the instant . The reward is the reward associated with the relay to which the packet is forwarded. The problem we are interested in is the following,
Subject to  (11) 
To solve the above problem, we consider the following unconstrained problem,
(12) 
where .
Lemma 1
For any policy satisfying the constraint we can write,
where the first inequality is by the optimality of for (12), the equality is by the hypothesis on , and the last inequality is due to the restriction of to .
Hence we focus on solving the unconstrained problem in (12).
IiiD OneStep Costs
The objective in (12) can be seen as accumulating additively over each step. If the decision at a stage is to continue then the delay until the next relay wakes up (or until ) gets added to the cost. On the other hand if the decision is to stop then the source collects the reward offered by the relay to which it forwards the packet and the decision process enters the state . The cost in state is . Suppose is the state at stage . Then the onestepcost function is, for ,
(13) 
The cost of termination is . Also note that for , the possible states are of the form and the only possible action is , so that .
IiiE Optimal Costtogo Functions
For , let represent the optimal costtogo function at stage . For any state , can be written as,
(14) 
where stopping cost (continuing cost) represents the average cost incurred, if the source, at the current stage decides to stop (continue), and takes optimal action at the subsequent stages. For the termination state, since the one step cost is zero and since the system remains in in all the subsequent stages, we have . For a state , we next evaluate the two costs in the above expression.
First let us obtain the stopping cost. Suppose that there were relay nodes and the source has seen them all. In such a case if (note that will just be a point mass on ) is the state at stage then the optimal cost is simply the cost of termination, i.e., . For , if the action is to stop then the one step cost is and the next state is so that the further cost is . Therefore, the stopping cost at any stage is simply .
On the other hand the cost for continuing, when the state at stage is , using the total expectation law, can be written as,
(15)  
Each of the expectation term in the summation in (15) is the average cost to continue conditioned on the event . is the (random) time until the next relay wakes up ( is the one step cost) and is the optimal costtogo from the next stage onwards ( constitutes the future cost). The next state is obtained via the state transition equation (10). The term in (15) associated with is the cost of continuing when the number of relays happen to be , i.e., and there are no more relays to go. Recall that we had defined (in Section II) and when the actual number of relays is . Therefore is the one step cost when . Also and so that at the next stage (which occurs at ) the process will terminate (enter ) with a cost of (see (10) and (13)), which represents the future cost.
Thus the optimal costtogo function (14) at stage can be written as,
(16) 
From the above expression it is clear that at stage when the state is , the source has to compare the stopping cost, , with the cost of continuing, , and stop iff . Later in Section V, we will use this condition () and define, the optimum stopping set. We will prove that the continuing cost, , is concave in , leading to the result that the optimum stopping set is convex. (15) and (16) are extensively used in the subsequent development.
Iv Relationship with the Case Where is Known (the COMDP Version)
In the previous section (Section III) we detailed our problem formulation as a POMDP. The state is partially observable because the source does not know the exact number of relays. It is interesting to first consider the simpler case where this number is known, which is the contribution of our earlier work in [7]. Hence, in this section, we will consider the case when the initial pmf, , has all the mass only on some , i.e., . We call this, the COMDP version of the problem.
First we define a sequence of threshold functions which will be useful in the subsequent proofs. These are the same threshold functions that characterize the optimal policy for our model in [7].
Definition 3
We will need the following simple property of the threshold functions in a later section.
Lemma 2
For , .
See Appendix AA.
Next we state the main lemma of this section. We call this the Onepoint Lemma, because it gives the optimal cost, , at stage when the belief state is such that it has all the mass on some .
Lemma 3 (Onepoint)
Fix some and . For any , if is such that then,
The proof is by induction. We make use of the fact that if at some stage the belief state is such that then the next belief state , obtained by using the belief transition equation (9), is also of the form . We complete the proof by using Definition 3 and the induction hypothesis. For a complete proof, see Appendix AB.
Discussion of Lemma 3: At stage if the state is , where is such that for some , then from the Onepoint Lemma it follows that the optimal policy is to stop and transmit iff . The subscript of the function signifies the number of more relays to go. For instance, if we know that there are exactly 4 more relays to go then the threshold to be used is . Suppose at stage if it was optimal to continue, then from (9) it follows that the next belief state also has mass only on and hence at this stage it is optimal to use the threshold function . Therefore, if we begin with an intial belief such that for some , then the optimal policy is to stop at the first stage such that where is the wakeup instant of the th relay and . Note that, since at stage the threshold to be used is (see Definition 3), we invariably have to stop at stage if we have not terminated earlier. This is exactly the same as our optimal policy in [7], where the number of relays is known to the source (instead of knowing the number wp1, as in our Onepoint Lemma here).
V Unknown : Bounds on the optimum stopping set
In this section we will consider the general case where the number of relays is not known to the source. The sequential decision problem developed in Section III was for this unknown case. The problem was formulated as a POMDP for which the source’s decision to stop and forward the packet is based on the belief state which takes values in after the source has observed relays waking up. We begin this section by defining the optimum stopping set. We show that this set is convex. Characterizing the exact optimum stopping set is computationally intensive. Therefore, our aim is to derive inner and outer bounds (a subset and a superset, respectively) for the optimum stopping set.
Definition 4 (Optimum stopping set)
For , let . Referring to (16) it follows that, for a given , represents the set of all beliefs at stage at which it is optimal to stop. We call the optimum stopping set at stage when the delay () and best reward () values are and , respectively.
Va Convexity of the Optimum Stopping Sets
We will prove (in Lemma 4) that the continuing cost, , in (15) is concave in . From the form of the stopping set , a simple consequence of this lemma will be that the optimum stopping set is convex. We further extend the concavity result of for , where is the affine set containing (to be defined shortly in this section).
Lemma 4
For and any given , the cost of continuing (defined in (15)), , is concave on .
The essence of the proof is same as that in [25, Lemma 1]. From (15) we easily see that is an affine function of , and hence , in (16), being minimum of an affine function and a constant is concave. The proof then follows by induction. The induction hypothesis is that for some stage , is concave. Hence it can be expressed as an infimum over some collection of affine functions. The inductive step then shows that can also be similarly expressed as an infimum over some collection of affine functions. Hence and (using 16) are concave. Formal proof is available in Appendix BA.
The following corollary is a straight forward application of the above lemma.
Corollary 1
For and any given , is a convex set.
From Lemma 4 we know that is a concave function of . Hence (see Definition 4), being a super level set of a concave function, is convex [29].
In the next section while proving an inner bound for the stopping set , we will identify a set of points that could lie outside the probability simplex . We can obtain a better inner bound if we extend the concavity result to the affine set,
where , i.e., in the vectors sum to one, but we do not require nonnegativity of the vectors. This can be done as follows. Define using (9) for every . Then as a function of , is the extension of from to . Similarly, for every , define and using (15) and (16). These are the extensions of and respectively. Then again, using the proof technique same as that in Lemma 4, we can obtain the following corollary,
Corollary 2
For , and any given , is concave on the affine set .
Using the above corollary, can be written as,
(18) 
VB Inner Bound on the Optimum Stopping Set
We have showed that the optimum stopping set is convex. In this section, we will identify points that lie along certain edges of the simplex . A convex hull of these points will yield an inner bound to the optimum stopping set. This will first require us to prove the following lemma, referred to as the Twopoints Lemma, and is a generalization of the Onepoint Lemma (Lemma 3). It gives the optimal cost, , at stage when is such that it places all its mass on and on some , i.e., . Throughout this and the next section (on an outer bound) is fixed and hence, for the ease of presentation (and readability), we drop from the notations , and (to appear in these sections later). However it is understood that these thresholds are, in general, functions of .
Lemma 5 (Twopoints)
For if is such that , where then,
Using (15) we can write,
For given as in the hypothesis, the belief in the next state is such that . Using this observation, Lemma 3 (Onepoint), and the definition of in (17), we obtain the desired result.
Discussion of Lemma 5: The Twopoints Lemma (Lemma 5) can be used to obtain certain threshold points in the following way. When has mass only on and on some , , then using Lemma 5, the continuing cost can be written as a function of as,
(19) 
From Lemma 2, it follows that in (19) is a decreasing function of . Let and be pmfs in with mass only on and respectively. These are two of the corner points of the simplex (as an example, Fig. 2 illustrates the simplex and the corner points for stage . With at most two more nodes to go, is a two dimensional simplex in . , and are the corner points of this simplex).
At stage as we move along the line joining the points and (Fig. 3 and 3 illustrates this as going from to ), the cost of continuing in (19) decreases and there is a threshold below which it is optimal to transmit and beyond which it is optimal to continue. The value of this threshold is that value of in (19) at which the continuing cost becomes equal to . Let denote this threshold value, then
The cost of continuing in (19) as a function of along with the stopping cost, , is shown in Fig. 3 and 3. The threshold is the point of intersection of these two cost functions. The value of the continuing cost at is . Note that in the case when the threshold will be greater than in which case it is optimal to stop for any on the line joining and .