Relay Selection with Partial Information in Wireless Sensor Networks

Relay Selection with Partial Information in Wireless Sensor Networks

K. P. Naveen,  and Anurag Kumar,  Both the authors are with the Dept. of Electrical Communication Engineering, Indian Institute of Science, Bangalore 560 012, India. Email:{naveenkp, anurag}@ece.iisc.ernet.inThis research was supported in part by a project on Wireless Sensor Networks for Intrusion Detection, funded by DRDO, Government of India, and in part by IFCPAR (Indo-French Center for the Promotion of Advanced Research) (Project 4000-IT-1).
Abstract

Our work is motivated by geographical forwarding of sporadic alarm packets to a base station in a wireless sensor network (WSN), where the nodes are sleep-wake cycling periodically and asynchronously. When a node (referred to as the source) gets a packet to forward, either by detecting an event or from an upstream node, it has to wait for its neighbors in a forwarding set (referred to as relays) to wake-up. Each of the relays is associated with a random reward (e.g., the progress made towards the sink) that is independent and identically distributed (iid). To begin with, the source is uncertain about the number of relays, their wake-up times and the reward values, but knows their distributions. At each relay wake-up instant, when a relay reveals its reward value, the source’s problem is to forward the packet or to wait for further relays to wake-up. In this setting, we seek to minimize the expected waiting time at the source subject to a lower bound on the average reward. In terms of the operations research literature, our work can be considered as a variant of the asset selling problem. We formulate the relay selection problem as a partially observable Markov decision process (POMDP), where the unknown state is the number of relays. We begin by considering the case where the source knows the number of relays. For the general case, where the source only knows a probability mass function (pmf) on the number of relays, it has to maintain a posterior pmf on the number of relays and forward the packet iff the pmf is in an optimum stopping set. We show that the optimum stopping set is convex and obtain an inner bound to this set. We prove a monotonicity result which yields an outer bound. The computational complexity of the above policies motivates us to formulate an alternative simplified model, the optimal policy for which is a simple threshold rule. We provide simulation results to compare the performance of the inner and outer bound policies against the simple policy, and against the optimal policy when the source knows the exact number of relays. Observing the simplicity and the good performance of the simple policy, we heuristically employ it for end-to-end packet forwarding at each hop in a multihop WSN of sleep-wake cycling nodes.

{keywords}

Relay selection, wireless sensor networks, sleep-wake cycling, partially observable Markov decision process (POMDP), asset selling problem.

I Introduction

We are interested in the problem of packet forwarding in a class of wireless sensor networks (WSNs) in which local inferences based on sensor measurements could result in the generation of occasional “alarm” packets that need to be routed to a base-station, where some sort of action could be taken [1, 2, 3]. Such a situation could arise, for example, in a WSN for human intrusion detection or fire detection in a large region. Such WSNs often need to run on batteries or on harvested energy and, hence, must be energy conscious in all their operations. The nodes of such a WSN would be sleep-wake cycling, waking up periodically to perform their tasks. One approach for the forwarding problem is to use a distributed algorithm to schedule the sleep-wake cycles of the nodes such that the delay of a packet from its source to the sink on a multihop path is minimized [2, 4]. An organizational phase is required for such algorithms, which increases the protocol overhead and moreover the scheduling algorithm has to be rerun periodically since the clocks at different nodes drift at different rates (so that the previously computed schedule would have become stale after long operation time). For a survey of routing techniques in wireless sensor and ad hoc networks and their classification, see [5, 6].

In this paper we are concerned with the sleep-wake cycling approach that permits the nodes to wake-up independently of each other even though each node is waking up periodically, i.e., asynchronous periodic sleep-wake cycling [7, 1]. In fact, given the need for a long network life-time, nodes are more likely to be sleeping than awake. In such a situation, when a node has a packet to forward, it has to wait for its neighbors to wake up. When a neighbor node wakes up, the forwarding node can evaluate it for its use as a relay, e.g., in terms of the progress it makes towards the destination node, the quality of the channel to the relay, the energy level of the relay, etc., (see [8, 9] for different routing metrics based on the above mentioned quantities). We think of this as a reward offered by the potential relay. The end-to-end network objective is to minimize the average total delay subject to a lower bound on some measure of total reward along the end-to-end path. In this paper we address this end-to-end objective by considering optimal strategies at each hop. When a node gets a packet to forward, it has to make decisions based only on the activities in its neighborhood. Waiting for all potential relays to wake-up and choosing the one with the best reward maximizes the reward at each hop, but increases the forwarding delay. On the other hand, forwarding to the first relay to wake-up may result in the loss of the opportunity of choosing a node with a better reward. Hence, at each hop, there is a trade-off between the one-hop delay and the one-hop reward. By solving the one-hop problem of minimizing the average delay subject to a constraint on the average reward, we expect to capture the trade-off between the end-to-end metrics. For instance, suppose the end-to-end objective is to minimize the expected end-to-end delivery delay subject to an upper bound on the expected number of hops in the path, the motivation for this constraint being that more hops traversed entails a greater expenditure of energy in the network. In our approach, we would heuristically address this problem by considering at each hop the problem of minimizing the mean forwarding delay subject to a lower bound on the progress made towards the sink. Greater progress at each hop entails greater delay per hop, while reducing the number of hops it takes a packet to reach the sink.

The local problem setting is the following. Somewhere in the network a node has just received a packet to forward; for the local problem we refer to this forwarding node as the source and think of the time at which it gets the packet as . There is an unknown number of relays in the forwarding set of the source. In the geographical forwarding context, this lack of information on the number of relays could model the fact that the neighborhood of a forwarding node could vary over time due, for example, to node failures, variation in channel conditions, or (in a mobile network) the entry or exit of mobile relays. However, we assume that the number of relays is bounded by a known number , and the source has an initial probability mass function (pmf), over , on the number of potential relays. The source desires to forward the packet within the interval , while knowing that the relays wake-up independently and uniformly over and the rewards they offer are independently and identically distributed (iid). We will formally introduce our model in Section II. Next we discuss related work and highlight our contributions.

I-a Related Work

Here we provide a summary of related literature in the context of geographical forwarding and channel selection. Since our problem also belongs to the class of asset selling problems studied in operations research literature, we survey related work from there as well.

Geographical forwarding problems: In our prior work [7] we have considered a simple model where the number of relays is a constant which is known to the source. There the reward is simply the progress made by a relay node towards the sink. In the current work we have generalized our earlier model by allowing the number of relays to be not known to the source. Also, here we allow a general reward structure.

There has been other work in the context of geographical forwarding and anycast routing, where the problem of choosing one among several neighboring nodes arises. Zorzi and Rao [10] consider a scenario of geographical forwarding in a wireless mesh network in which the nodes know their locations, and are sleep-wake cycling. They propose GeRaF (Geographical Random Forwarding), a distributed relaying algorithm, whose objective is to carry a packet to its destination in as few hops as possible, by making as large progress as possible at each relaying stage. For their algorithm, the authors obtain the average number of hops (for given source-sink distance) as a function of the node density. These authors do not consider the trade-off between the relay selection delay and the reward gained by selecting a relay, which is a major contribution of our work.

Liu et al. [11] propose a relay selection approach as a part of CMAC, a protocol for geographical packet forwarding. With respect to the fixed sink, a node has a forwarding set consisting of all nodes that make progress greater than (an algorithm parameter). If represent the delay until the first wake-up instant of a node in the forwarding set, and is the corresponding progress made, then, under CMAC, node chooses an that minimizes the expected normalized latency . The Random Asynchronous Wakeup (RAW) protocol [12] also considers transmitting to the first node to wake-up that makes a progress of greater than a threshold. Interestingly, this is the structure of the optimal policy for our simplified model in [7]. For the sake of completeness we have described the simplified model in this paper as well (see Section VI). Thus we have provided analytical support for using such a threshold policy.

Kim et al. [1] consider a dense WSN. Just like the motivation for our model, an occasional alarm packet needs to be sent, from wherever in the network it is generated, to the sink. The authors develop an optimal anycast scheme to minimize average end-to-end delay from any node to the sink when each node wakes up asynchronously with rate . They show that periodic wake-up patterns obtain minimum delay among all sleep-wake patterns with the same rate. They propose an algorithm called LOCAL-OPT [13] which yields, for each node , a threshold for each of its neighbor . If the time at which neighbor wakes up is less than , then will transmit to . Otherwise will go back to sleep and will continue waiting for further neighbors. A key drawback is that a configuration phase is required to run the LOCAL-OPT algorithm.

Rossi et al. [14], consider the problem where a node , with a packet to forward and which is hops away from the sink, has to choose between two of its shortlisted neighbors. The first shortlisted neighbor is the one with the least cost among all others with hop count (one less than node ). The second one is the least cost node among all its neighbors with hop count (same as that of node ). Though the first node is on the shortest path, sometimes when its cost is high, it may not be the best option. It turns out that it is optimal to choose one node over the other by comparing the cost difference with a threshold. The threshold depends on the cost distribution of the nodes which are two hops away from node . Here there is no notion of sleep-wake cycling so that all the neighbor costs are known when node gets a packet to forward. The problem is that of one shot decision making. In our problem a neighbor’s cost will become available only after it wakes up, at which instant node has to take decision regarding forwarding. Hence, ours is a sequential decision problem.

Channel selection problems: Akin to the relay selection problem is the problem of channel selection. The authors in [15, 16] consider a model where there are several channels available to choose from. The transmitter has to probe the channels to learn their quality. Probing many channels yields one with a good gain but reduces the effective time for transmission within the channel coherence period. The problem is to obtain optimal strategies to decide when to stop probing and to transmit. Here the number of channels is known and all the channels are available at the very beginning of the decision process. In our problem the number of relays is not known, and the relays become available at random times.

Asset selling problems: The basic asset selling problem [17, 18], comprises offers that arrive sequentially over discrete time slots. The offers are iid. As the offers arrive, the seller has to decide whether to take an offer or wait for future offers. The seller has to pay a cost to observe the next offer. Previous offers cannot be recalled. The decision process ends with the seller choosing an offer. Over the years, several variants of the basic problem have been studied, both with and without recalling the previous offers. Recently Kang [19] has considered a model where a cost has to be paid to recall the previous best offer. Further, the previous best offer can be lost at the next time instant with some probability. See [19] for further references to literature on models with uncertain recall. In [20], the authors consider a model in which the offers arrive at the points of a renewal process. Additional literature on such work can be found in [20]. In these models, either the number of potential offers is known or is infinite. In [21], a variant is studied in which the asset selling process can reach a deadline in the next slot with some fixed probability, provided that the process has proceeded upto the present slot.

In our work the number of offers (i.e., relays) is not known. Also the successive instants at which the offers arrive are the order statistics of an unknown number of iid uniform random variables over an interval . After observing a relay, the probability that there are no more relays to go (which is the probability that the present stage is the last one) is not fixed. This probability has to be updated depending on the previous such probabilities and the inter wake-up times between the sucessive relays. Although our problem falls in the class of asset selling problems, to the best of our knowledge the particular setting we have considered in this paper has not been studied before.

I-B Our Contributions

With the number of relays being unknown, the natural approach is to formulate the problem as a partially observed Markov decision process (POMDP). A POMDP is a generalization of an MDP, where at each stage the actual internal state of the system is not available to the controller. Instead, the controller can observe a value from an observation space. The observation probabilistically depends on the current actual state and the previous action. In some cases, a POMDP can be converted to an equivalent MDP by regarding a belief (i.e., a probability distribution) on the state space as the state of the equivalent MDP. For a survey of POMDPs see [22]. It is clear that, even if the actual state space is finite, the belief space is uncountable. There are several algorithms available to obtain the optimal policy when the actual state space is finite [23], starting from the seminal work by Smallwood and Sondik [24]. When the number of states is large, these algorithms are computationally intensive. In general, it is not easy to obtain an optimal policy for a POMDP. In the current work, we have characterized the optimal policy in terms of an optimum stopping set. We have made use of the convexity results in [25] and some properties specific to our problem to obtain an inner bound on the optimum stopping set. We prove a simple monotonicity result to obtain an outer bound. In summary, the following are the main contributions of our work:

  • We formulate the problem of relay selection with partial information as a finite horizon partially observable Markov decision process (POMDP), with the unknown state being the actual number of relays (Section III). The posterior pmf on the number of relays is shown to be a sufficient decision statistic.

  • We first consider the completely observable MDP (COMDP) version of the problem where the source knows the number of relays with probability one (wp1) (Section IV). The optimal policy is characterized by a sequence of threshold functions.

  • For the POMDP, at each stage the optimum stopping set is the set of all pmfs on the number of relays where it is optimal to stop (Section V). We prove that this set is convex (Section V-A), and provide an inner bound (subset) for it (Section V-B). We prove a monotonicity result and obtain an outer bound (superset, Section V-C). The threshold functions obtained in COMDP version are used in the design of the bounds. These threshold functions need to be obtained recursively which is in general, computationally intensive.

  • The complexity of the above policies motivates us to consider a simplified model (Section VI). We prove that the optimal policy for this simplified model is a simple threshold rule.

  • Through simulations (Section VII-A) we study the performance comparision of various policies with the optimal COMDP policy. The inner bound policy performs slighty better than the outer bound policy. The simple policy obtained from the simplified model performs very close to the inner bound. Also, we show the poor performance of a naive policy, that assumes the actual number of relays to be simply the expected number.

  • Finally as a heuristic for the end-to-end problem in the geographical forwarding context, we apply the simple policy at each hop and study the end-to-end performance by simulation (Section VII-B). We find that it is possible to tradeoff between the expected end-to-end delay and expected number of hops by tuning a parameter.

For the ease of presentation, in the main sections we only provide an outline of the proof for most of the lemmas, followed by a brief description. Formal proofs are available in Appendices A, B and C. Appendix D contains additional simulation results.

Ii System Model

We consider the one stage problem in which a node in the network receives a packet to forward. We call this node the “source” and the nodes that it could potentially forward the packet to are called “relays”. The local problem is taken to start at time . Thus at time , the source node has a packet to forward to a sink but needs a relay node to accomplish this task. There is a nonempty set of relay nodes, labeled by the indices . is a random variable bounded above by , a system parameter that is known to the source node, i.e., the support of is . The source does not know , but knows the bound , and a pmf on , which is the initial pmf of . A relay node , , becomes available to the source at the instant . The source knows that the instants are iid uniformly distributed on . Observe that this would be the case if the wake-up instants of all the nodes in the network are periodic with period , if these (periodic) renewal processes are stationary and independent, and if the forwarding node’s decision instants are stopping times w.r.t. these wake-up time processes [26].

We call the wake-up instant of relay . If the source forwards the packet to the relay , then a reward of is accrued. The rewards , are iid random variables with pdf . The support of is . The source knows this statistical characterisation of the rewards, and also that the are independent of the wake-up instants . When a relay wakes up at and reveals its reward , the source has to decide whether to transmit to relay or to wait for further relays. If the source decides to wait, then it instructs the relay with the best reward to stay awake, while letting the rest go back to sleep. This way the source can always forward to a relay with the best reward among those that have woken up so far.

Given that (throughout this discussion we will focus on the event ), let represent the order statistics of , i.e., the sequence is the sequence sorted in the increasing order. The pdf of the th () order statistic [27, Chapter 2] is, for ,

(1)

Also the joint pdf of the th and the th order statistic (for ) is, for ,

(2)

Using the above expressions, we can write down the conditional pdf (for ) as, for and ,

(3)

Comparing (II) with (1), as expected, we observe that, given , the pdf of the wake-up instant of the th node, conditioned on the wake-up instant of the th node, is the th order statistic of iid random variables that are uniform on the remaining time . Let and define for . are the inter-wake-up time instants between the consecutive nodes (see Fig. 1). Later we will be interested in the conditional pdf for which is given by, for and ,

(4)

The conditional expectation is given by,

(5)

which is simply the expected value of the minimum of random variables ( is the remaining number of relays), each of which are iid uniform on the interval ( is the remaining time).

Definition 1

For notational simplicity we define,

Note that depends on and through the difference and depends on through .

Since the reward sequence is iid and independent of the wake-up instants , we write as the pairs of ordered wake-up instants and the corresponding rewards. Evidently, for . Further we define (when ) , and . Also . All these variables are depicted in Fig. 1. We end this section by listing out, in Table I, most of the symbols that appear in the paper with a brief description for each.

Fig. 1: There are relays. represents the wake-up instant and reward repectively, of the th relay. These are shown as points in . are the inter-wake-up times. Note that , and .
Symbol Description
Inner product of vectors and
   Thresholds lying on the line joining and of the simplex ; Used in the construction of the inner and outer bounds, respectively
Best reward so far, i.e.,
Average cost of continuing at stage when the state is
Optimum stopping set at stage when
Inner bound for the stopping set
Outer bound for the stopping set
One-step-stopping set for the simplified model
Expectation conditioned on
pdf of conditioned on
pdf of the iid rewards
Optimal cost-to-go function at stage when the state is
Bound on the number of relays
Number of relays; random variable taking values from
Number of relays in the simplified model; a constant
Probability of an event
Set of all pmfs on the set
Represents a typical state at stage where is the belief state and
A corner point in , i.e.,
Reward of the th relay
Inter wake-up time between the and th relay, i.e.,
Wake-up instant of the th relay
Quantities, analogous to the ones in the exact model, for the simplified model
Threshold obtained from the simplified model
Reward constraint for the problem in (III-C)
When is such that then it is optimal to stop iff
Lagrange multiplier, see (12)
Average cost of stopping at stage when
Belief transition function; is a pmf in for a given , and
Threshold obtained from the COMDP version of the problem; If the source knows wp1 that , then at some stage with it is optimal to stop iff
TABLE I: List of mathematical notation.

Iii The Sequential Decision Problem

For the model set up in Section II, we now consider the following sequential decision problem. At each instant that a relay wakes up, i.e., , the source has to make the decision to forward the packet, or to hold the packet until the next wake-up instant. Since the number of available relays, , is unknown, we have a decision problem with partial information. We will show how the problem can be set up in the framework of a partially observable Markov decision process (POMDP) [22] [28, Chapter 5].

Iii-a Actions, State Space, and State Transition

Actions: We assume that the time instants at which the relays wake-up, i.e., , constitute the decision instants or stages 111A better choice for the decision instants may be to allow the source to take decision at any time . When is known to the source it can be argued that it is optimal to take decisions only at relay wake-up instances. However this may not hold for our case where is unknown. In this paper we proceed with our restriction on the decision instants and consider the general case as a topic for future work.. At each decision instant, there are two actions possible at the source, denoted and , where

  • represents the action to continue waiting for more relays to wake-up, and

  • represents the action to stop and forward the packet to the relay that provides the best reward among those that have woken up to the current decision epoch.

Since there can be at most relays, the total number of decision instants is . The decision process technically ends at the first instant , at which the source chooses action , in which case we assume that all the subsequent decision instants, , occur at . In cases where the source ends up waiting until time (referring to Fig. 1, this is possible if, even at the source decides to continue, not realizing that it has seen all the relays there are in its forwarding set), all the subsequent decision instants are assumed to occur at .

State Space: At stage the state space is simply and the only action possible is , where in the superscript is to signify that is the set of actual internal states of the system. The state space at stage is,

and for stages is,

Thus the state space at stage is written as the union of three sets. The physical meanings of these sets are as follows:

  • : in the state triple represents the actual number of relays. The states in this set correspond to the case where there are more than or equal to relays, i.e., satisfies, . In the pair , is the wake-up instant () of the th relay, and is the best reward among the relays seen so far. Same remark holds for the states in . Stage begins at time with reward. Hence the states in are of the form .

  • : Suppose there were relays and, at stage the source decides to continue. Note that it is possible for the source to take such a decision, since it does not know the number of relays. In such a case, the source ends up waiting until time and enters stage . Hence the states in this set are of the form where represents the best reward among all the relays ().

  • : is the terminating state. The state at stage will be , if the source has already forwarded the packet at an earlier stage.

State Transition: If the state at stage is (i.e., the source has already forwarded the packet) then the next state is always . Suppose is the state at some stage , , and represents the action taken. If then the decision process stops and we regard that the system enters the termination state so that the state at all the subsequent stages, , is . The source will also terminate the decision process, knowing that the relays wake-up within the interval , if it has waited for a duration of . This means that , i.e., and .

On the other hand if and , the source waits for a random duration of and encounters a relay with a random reward of so that the next state is . Note that if , i.e., the current relay is the last one, then since we have defined and , the next state will be of the form . Thus the state at stage can be written down as,

(7)

Iii-B Belief State and Belief State Transition

Since the source does not know the actual number of relays , the state is only partially observable. The source takes decisions based on the entire history of the wake-up instants and the best rewards. If the source has not forwarded the packet until stage then define, to be the information vector available at the source when the th relay wakes up. represents the wake-up instants of relays waking up at stages and are the corresponding best rewards. Define to be the belief state about at stage given the information vector , i.e., for (note that is the probability that the th relay is the last one). Thus, is a pmf in the dimensional probability simplex. Let us denote this simplex as .

Definition 2

For , let := set of all pmfs on the set . is the dimensional probability simplex in .

The “observation” at stage is a part of the actual state . For a general POMDP problem the observation can belong to a completely different space than the actual state space. Moreover the distribution of the observation at any stage can in general depend on all the previous states, observations, actions and disturbances. Suppose this distribution depends only on the state, action and disturbance of the immediately preceding stage, then a belief on the actual state given the entire history turns out to be sufficient for taking decisions [28, Chapter 5]. For our case, this condition is met and hence at stage , is a sufficient statistic to take decision. Therefore we modify the state space as, and for ,

(8)

After seeing relays, suppose the source chooses not to forward the packet, then upon the next relay waking up (if any), the source needs to update its belief about the number of relays. Formally, if is the state at stage and is the wake-up instant of the next relay then, using Bayes rule, the next belief state can be obtained via the following belief state transition function which yields a pmf in ,

(9)

for . Note that this function does not depend on . Thus, if at stage , the state is , then the next state is

(10)

where is the random delay until the next relay wakes up and is the random reward offered by that relay. The explanation for the above belief state transition expression remains same as that of the actual state transition in (7), except that if the action is to continue, then the source needs to update the belief about the number of relays. Suppose at stage , the actual number of relays happens to be and the action is to continue, which is possible since the source does not know the actual number, then the source will end up waiting until time and then transmit to the relay with the best reward.

Iii-C Stopping Rules and the Optimization Problem

As the relays wake-up, the source’s problem is to decide to stop or continue waiting for further relays. A stopping rule or a policy is a sequence of mappings where . Let represent the set of all policies. The delay incurred using policy is the instant at which the source forwards the packet. It could be either one of the , or the instant . The reward is the reward associated with the relay to which the packet is forwarded. The problem we are interested in is the following,

Subject to (11)

To solve the above problem, we consider the following unconstrained problem,

(12)

where .

Lemma 1

Let be an optimal policy for the unconstrained problem in (12). Suppose that is such that , then is optimal for the main problem in (III-C) as well.

{proof}

For any policy satisfying the constraint we can write,

where the first inequality is by the optimality of for (12), the equality is by the hypothesis on , and the last inequality is due to the restriction of to .

Hence we focus on solving the unconstrained problem in (12).

Iii-D One-Step Costs

The objective in (12) can be seen as accumulating additively over each step. If the decision at a stage is to continue then the delay until the next relay wakes up (or until ) gets added to the cost. On the other hand if the decision is to stop then the source collects the reward offered by the relay to which it forwards the packet and the decision process enters the state . The cost in state is . Suppose is the state at stage . Then the one-step-cost function is, for ,

(13)

The cost of termination is . Also note that for , the possible states are of the form and the only possible action is , so that .

Iii-E Optimal Cost-to-go Functions

For , let represent the optimal cost-to-go function at stage . For any state , can be written as,

(14)

where stopping cost (continuing cost) represents the average cost incurred, if the source, at the current stage decides to stop (continue), and takes optimal action at the subsequent stages. For the termination state, since the one step cost is zero and since the system remains in in all the subsequent stages, we have . For a state , we next evaluate the two costs in the above expression.

First let us obtain the stopping cost. Suppose that there were relay nodes and the source has seen them all. In such a case if (note that will just be a point mass on ) is the state at stage then the optimal cost is simply the cost of termination, i.e., . For , if the action is to stop then the one step cost is and the next state is so that the further cost is . Therefore, the stopping cost at any stage is simply .

On the other hand the cost for continuing, when the state at stage is , using the total expectation law, can be written as,

(15)

Each of the expectation term in the summation in (15) is the average cost to continue conditioned on the event . is the (random) time until the next relay wakes up ( is the one step cost) and is the optimal cost-to-go from the next stage onwards ( constitutes the future cost). The next state is obtained via the state transition equation (10). The term in (15) associated with is the cost of continuing when the number of relays happen to be , i.e., and there are no more relays to go. Recall that we had defined (in Section II) and when the actual number of relays is . Therefore is the one step cost when . Also and so that at the next stage (which occurs at ) the process will terminate (enter ) with a cost of (see (10) and (13)), which represents the future cost.

Thus the optimal cost-to-go function (14) at stage can be written as,

(16)

From the above expression it is clear that at stage when the state is , the source has to compare the stopping cost, , with the cost of continuing, , and stop iff . Later in Section V, we will use this condition () and define, the optimum stopping set. We will prove that the continuing cost, , is concave in , leading to the result that the optimum stopping set is convex. (15) and (16) are extensively used in the subsequent development.

Iv Relationship with the Case Where is Known (the COMDP Version)

In the previous section (Section III) we detailed our problem formulation as a POMDP. The state is partially observable because the source does not know the exact number of relays. It is interesting to first consider the simpler case where this number is known, which is the contribution of our earlier work in [7]. Hence, in this section, we will consider the case when the initial pmf, , has all the mass only on some , i.e., . We call this, the COMDP version of the problem.

First we define a sequence of threshold functions which will be useful in the subsequent proofs. These are the same threshold functions that characterize the optimal policy for our model in [7].

Definition 3

For , define inductively as follows: for all , and for (recall Definition 1),

(17)

In the above expression we have suppressed the subscript for and for simplicity. The pdf used to take the expectation in the above expression is (again recall Definition 1).

We will need the following simple property of the threshold functions in a later section.

Lemma 2

For , .

{proof}

See Appendix A-A.

Next we state the main lemma of this section. We call this the One-point Lemma, because it gives the optimal cost, , at stage when the belief state is such that it has all the mass on some .

Lemma 3 (One-point)

Fix some and . For any , if is such that then,

{proof}

The proof is by induction. We make use of the fact that if at some stage the belief state is such that then the next belief state , obtained by using the belief transition equation (9), is also of the form . We complete the proof by using Definition 3 and the induction hypothesis. For a complete proof, see Appendix A-B.

Discussion of Lemma 3: At stage if the state is , where is such that for some , then from the One-point Lemma it follows that the optimal policy is to stop and transmit iff . The subscript of the function signifies the number of more relays to go. For instance, if we know that there are exactly 4 more relays to go then the threshold to be used is . Suppose at stage if it was optimal to continue, then from (9) it follows that the next belief state also has mass only on and hence at this stage it is optimal to use the threshold function . Therefore, if we begin with an intial belief such that for some , then the optimal policy is to stop at the first stage such that where is the wake-up instant of the th relay and . Note that, since at stage the threshold to be used is (see Definition 3), we invariably have to stop at stage if we have not terminated earlier. This is exactly the same as our optimal policy in [7], where the number of relays is known to the source (instead of knowing the number wp1, as in our One-point Lemma here).

V Unknown : Bounds on the optimum stopping set

In this section we will consider the general case where the number of relays is not known to the source. The sequential decision problem developed in Section III was for this unknown case. The problem was formulated as a POMDP for which the source’s decision to stop and forward the packet is based on the belief state which takes values in after the source has observed relays waking up. We begin this section by defining the optimum stopping set. We show that this set is convex. Characterizing the exact optimum stopping set is computationally intensive. Therefore, our aim is to derive inner and outer bounds (a subset and a superset, respectively) for the optimum stopping set.

Definition 4 (Optimum stopping set)

For , let . Referring to (16) it follows that, for a given , represents the set of all beliefs at stage at which it is optimal to stop. We call the optimum stopping set at stage when the delay () and best reward () values are and , respectively.

V-a Convexity of the Optimum Stopping Sets

We will prove (in Lemma 4) that the continuing cost, , in (15) is concave in . From the form of the stopping set , a simple consequence of this lemma will be that the optimum stopping set is convex. We further extend the concavity result of for , where is the affine set containing (to be defined shortly in this section).

Lemma 4

For and any given , the cost of continuing (defined in (15)), , is concave on .

{proof}

The essence of the proof is same as that in [25, Lemma 1]. From (15) we easily see that is an affine function of , and hence , in (16), being minimum of an affine function and a constant is concave. The proof then follows by induction. The induction hypothesis is that for some stage , is concave. Hence it can be expressed as an infimum over some collection of affine functions. The inductive step then shows that can also be similarly expressed as an infimum over some collection of affine functions. Hence and (using 16) are concave. Formal proof is available in Appendix B-A.

The following corollary is a straight forward application of the above lemma.

Corollary 1

For and any given , is a convex set.

{proof}

From Lemma 4 we know that is a concave function of . Hence (see Definition 4), being a super level set of a concave function, is convex [29].

In the next section while proving an inner bound for the stopping set , we will identify a set of points that could lie outside the probability simplex . We can obtain a better inner bound if we extend the concavity result to the affine set,

where , i.e., in the vectors sum to one, but we do not require non-negativity of the vectors. This can be done as follows. Define using (9) for every . Then as a function of , is the extension of from to . Similarly, for every , define and using (15) and (16). These are the extensions of and respectively. Then again, using the proof technique same as that in Lemma 4, we can obtain the following corollary,

Corollary 2

For , and any given , is concave on the affine set .

Using the above corollary, can be written as,

(18)

V-B Inner Bound on the Optimum Stopping Set

We have showed that the optimum stopping set is convex. In this section, we will identify points that lie along certain edges of the simplex . A convex hull of these points will yield an inner bound to the optimum stopping set. This will first require us to prove the following lemma, referred to as the Two-points Lemma, and is a generalization of the One-point Lemma (Lemma 3). It gives the optimal cost, , at stage when is such that it places all its mass on and on some , i.e., . Throughout this and the next section (on an outer bound) is fixed and hence, for the ease of presentation (and readability), we drop from the notations , and (to appear in these sections later). However it is understood that these thresholds are, in general, functions of .

Lemma 5 (Two-points)

For if is such that , where then,

{proof}

Using (15) we can write,

For given as in the hypothesis, the belief in the next state is such that . Using this observation, Lemma 3 (One-point), and the definition of in (17), we obtain the desired result.

Discussion of Lemma 5: The Two-points Lemma (Lemma 5) can be used to obtain certain threshold points in the following way. When has mass only on and on some , , then using Lemma 5, the continuing cost can be written as a function of as,

(19)

From Lemma 2, it follows that in (19) is a decreasing function of . Let and be pmfs in with mass only on and respectively. These are two of the corner points of the simplex (as an example, Fig. 2 illustrates the simplex and the corner points for stage . With at most two more nodes to go, is a two dimensional simplex in . , and are the corner points of this simplex).

Fig. 2: Probability simplex, , at stage . A belief state at stage is a pmf on the points , and (i.e., no-more, one-more and two-more relays to go, respectively). Thus is a two dimensional simplex in .

At stage as we move along the line joining the points and (Fig. 3 and 3 illustrates this as going from to ), the cost of continuing in (19) decreases and there is a threshold below which it is optimal to transmit and beyond which it is optimal to continue. The value of this threshold is that value of in (19) at which the continuing cost becomes equal to . Let denote this threshold value, then

The cost of continuing in (19) as a function of along with the stopping cost, , is shown in Fig. 3 and 3. The threshold is the point of intersection of these two cost functions. The value of the continuing cost at is . Note that in the case when the threshold will be greater than in which case it is optimal to stop for any on the line joining and .

Fig. 3: Depiction of the thresholds . in Equation (19) is plotted as a function of . Also shown is the constant function