Value of Information Aware Opportunistic Duty Cycling in Solar Harvesting Sensor Networks
The energy-harvested Wireless Sensor Networks (WSNs) may operate perpetually with the extra energy supply from ambient natural energy, such as solar energy. Nevertheless, the harvested energy is still limited so it’s not able to support the perpetual network operation with full duty cycle. To achieve the perpetual network operation and process the data with high importance, measured by Value of Information (VoI), the network has to operate under partial duty cycle and to improve the efficiency to consume the harvested energy. The challenging problem is how to deal with the stochastic feature of the natural energy and the variable data VoI. We consider the energy consumption during storing and the diversity of the data process including sampling, transmitting and receiving, which consume different power levels. The problem is then mapped as the budget-dynamic Multi-Arm Bandit (MAB) problem by treating the energy as the budget and the data process as arm pulling. This paper proposes an Opportunistic Duty Cycling (ODC) scheme to improve the energy efficiency while satisfying the perpetual network operation. ODC chooses the proper opportunities to store the harvested energy or to spend it on the data process based on the historical information of the energy harvesting and the VoI of the processed data. With this scheme, each sensor node need only estimate the ambient natural energy in short term so as to reduce the computation and the storage for the historical information. It also can distributively adjust its own duty cycle according to its local historical information. This paper also conducts the extensive analysis on the performance of our scheme ODC, and the theoretical results validate the regret, which is the difference between the optimal scheme and ours. Our experimental results also manifest the promising performance of ODC.
As a promising technique, the great success of Wireless Sensor Networks (WSNs) has been witnessed over a variety of critical applications in recent years . One common constraint, impeding the wider application of this kind of networks, is the limited energy supply. To extend the network life or even to support the perpetual network operation, two major techniques have been severally applied to WSNs: energy harvesting  and duty cycling . Energy harvesting can supply the sensor node with the extra energy from the ambient energy resources while the duty cycling technique can save energy so as to extend the network lifetime. But the tiny energy-harvesting module in the solar sensor networks cannot harvest enough energy to support the network with full duty cycle normally . Some existing works combine the energy harvesting and duty cycling techniques to achieve the permanent network operation, i.e., meeting the energy neutral operation . These existing works estimate the amount of the active time for a period in advance, such as at the initialization phase of the period , or the average amount of the active time for some periods over a long duration, such as a season . However, there are several facts ignored by the existing works.
1) Imperfect charge efficiency. In practice, the charge efficiency of the battery for the solar powered sensor node is often less than , which means that it indirectly wastes 25% energy if using the stored energy. Another choice, capacitor, suffers high leakage .
2) Variable data importance. In WSNs, the data redundancy is the common phenomenon. Meanwhile, if introducing the Value of Information (VoI) for the processed data111In this paper, the data process or to process data means sampling, receiving and transmitting data., the more important data has higher VoI. It can obtain higher energy efficiency to process the more important data.
3) Random natural energy. Some natural energy, such as solar or wind energy, is shown to be random , so as hard to accurately predict the profiles of the future energy for long term because of the unpredictable weather and disturbance.
Section II illustrates some detailed technical evidences and examples to illustrate the above observations. We find that it is still an open problem to improve the efficiency to exploit the ambient energy.
Notice that the energy consumption caused by the imperfect charge efficiency can be decreased if the harvested energy is directly used rather than stored in the battery. Considering the data importance, the sensor node can arrange right moments to process data and to harvest energy so as to improve the energy efficiency, which is defined as the average VoI obtained per unit energy consumption in this paper. To do this, we propose the Opportunistic Duty Cycling (ODC) scheme to catch the features: the dynamic profile of the energy harvesting, the variable VoI of the data, and the easiness to estimate the harvested energy in short term. Meanwhile, ODC considers the diversity of the data process including three actions: data sampling, transmitting and receiving, which consume different power since they have much impact on the energy efficiency. This paper then maps the opportunistic duty cycling as the gambling game: Multi-Arm Bandit (MAB) . In the game, the sensor node is treated as the gambler. The gambler decides its next action (sampling, receiving, transmitting or storing energy) step by step based on its estimation for the harvested energy and the VoI of the data to process in the subsequent time.
In the real applications of the energy-harvesting WSNs, the data process and energy harvesting are highly dynamic. Under the MAB game, each sensor node can determine its next state according to its historical information in short term so as to deal with the dynamic feature. The goal of the gambling game is to maximize the energy efficiency for each sensor node. Clearly, in order to achieve this goal, each sensor node should carefully decide its next action while adhering to the energy constraint. Notice that to meet the energy neutral operation and to improve the energy efficiency usually contradict to each other when adjusting the duty cycle. The former goal requires each sensor node to short its duty cycle while the later requires longer one to obtain the overall VoI as much as possible. To achieve the bi-criteria object, this paper adjusts the VoI threshold according to the historical information.
Contributions. The contributions of this paper include:
1) This paper adjusts the duty cycle by considering the imperfect charge efficiency and the VoI of the data while meeting the energy neutral operation. We map the new duty cycling problem as the budget-dynamic MAB problem. To our best knowledge, this is the first work to formulate and study the problem.
2) This paper designs ODC scheme to achieve the bi-criteria object. With ODC, each sensor node can distributively determine the action to take for the next time slot by running the MAB with the previous reward and harvested energy. An algorithm, called ODC, is designed to implement the ODC scheme. We theoretically analyze the performance of ODC by measuring the regret, the difference between the optimal scheme and ODC.
3) The extensive experiments are also conducted to evaluate the performance of our scheme. In the experiments, because of the hardness to find the optimal scheme, we propose two baseline approaches: a Centralized and Off-line duty cycling Algorithm (COA), and a Simple Duty Cycling (SDC). COA has the complete knowledge of the natural energy and the data VoI in advance. SDC predicts the energy to harvest and calculates the duty cycle in advance as the algorithm given in the reference . The experimental results show that the average energy efficiency achieved by our scheme is only 16.02% lower than that of COA, and 69.09% higher than that of SDC.
Road map. The following context of the paper is organized as follows. Section II describes the motivation based on our preliminary experiments, and formulates the opportunistic duty cycling problem in Section III. The problem is mapped as the budget-dynamic MAB problem, and ODC is presented in Section IV with its performance analysis in Section V, while the experimental results are discussed in Section VI. In Section VII, we review the related works on the energy harvesting module and the duty cycling schemes for WSNs and conclude this paper in Section VIII.
Ii Preliminary Experiments and Motivation
This work is motivated by the following observations. Firstly, the inherent hardware property of the energy harvesting module leads to time varying charge efficiency. In practice, the average charge efficiency of the battery for the solar powered sensor node is often less than . Secondly, the random environmental factors, such as the shadow of clouds, can also decrease the charge efficiency. Thirdly, the data VoI varies over time and is different among the nodes. These observations leave the existing duty cycling schemes unsuitable, and motivate us to design the new duty cycling scheme.
Ii-a Dynamic Energy Harvesting and Storage
The unpredictable environmental factors cause the diversity of the energy profiles among the sensor nodes as illustrated in Figure 1. The experiment results in Figure 1(a) indicate that the same sensor node usually has different energy profiles in several days even under the similar weather conditions. More so, the energy profiles for several different sensor nodes vary a lot during one day because of the different locations as shown in Figure 1(b). Similar phenomenon was also observed in previous works . Some works model the solar energy harvesting as a first-Markov random process .
The time to consume or store harvested energy has great impact on the energy efficiency. Due to the imperfect charge efficiency, denoted by , the relation between the harvested energy and the actual stored energy is for some charge efficiency . The solar panels in the most existing solar modules, such as SolarMote , Prometheus  and AmbiMax , have the rated current of about 20mA. Meanwhile, the working current of the sensor node, such as TelosB, is about 20 mA for receiving and about 19 mA or more for transmission. If the sensor node powers its antenna with the harvested energy (20mA) directly, then the antenna can work normally. Otherwise, if the sensor node stores the harvested energy with the power 20mA, the actual stored energy is 20=15mA222There is a fault voltage to support the normal operation of the sensor node, such as 3 V for the TelosB and MICA nodes. This paper ignores the voltage for simplicity, and thus represents the power by the unit: mA. given , which means that 5mA harvested energy is wasted. The power of the stored energy is thus too low to support the normal operation of the sensor node.
Ii-B VoI of Data
The limitation of the harvested energy compels each sensor node to preferentially process the data with high VoI. According to Information Theory, the data importance can be indicated by the VoI, denoted by . The Kullback-Leibler (KL) divergence measure can calculate the VoI by qualifying the difference between two probability distributions: and as follows.
With the concept of VoI, the sensor node chooses the important data (i.e. with high VoI) to process. The times to process data then can be decreased so much energy can be saved while the overall VoI is preserved. For example, when reducing the times to sample the luminous intensity from Figure 1(b) to Figure 3, about 92% energy is saved while the overall VoI lose is preserved under 5%.
Ii-C Call for Online Energy Allocation
Since both of the data process and energy harvesting are random processes, each sensor node can make online decision on how to allocate the harvested energy. The example in Figure 3 illustrates the necessity of the online energy allocation to maximize the overall VoI by carefully scheduling the energy consumption. In this example, the sensor node can harvest 20 mA energy at the time slots marked with “white” color solar status, and cannot harvest energy at the “black” time slots. Suppose that requires at least 20 mA energy to support its normal operation at each time slot, and that the charge efficiency . When time goes to , can use the harvested 20mA energy directly to process the first data with 20 unit VoI. After goes to , has two choices. The first choice is that uses the harvested energy at to process the second data, and then obtains 10 unit VoI. At , stores the harvested 20mA energy, and obtains 15mA energy because . At , cannot process data since the stored energy is not sufficient. The VoI per unit energy that obtained by the first choice is . The second choice is that stores the 40mA energy harvested at and and obtains 30mA energy. It then processes the second data at , and obtains unit VoI. The VoI per unit energy that obtained by the second choice is . Obviously, the second choice can result in higher energy efficiency, i.e., the VoI per unit energy, than the first one.
Ii-D Opportunistic Duty Cycling
From the above facts, we find that the processes of the data process and energy harvesting are highly dynamic. It can greatly improve the energy efficiency to wake up the sensor node to process data and to hibernate them for storing energy at proper moments. These facts motivate us to propose the novel opportunistic duty cycling scheme, under which the sensor nodes can catch the right opportunities to process data or to store the harvested energy. Existing works on duty cycling adjust only the duty cycle, i.e., roughly the ratio of the active time to the period as shown in Figure 5. Under the opportunistic duty cycling, the slots to be active are also considered as the example in Figure 5, where the period composes of 8 slots. The set of slots to be active may be different as the cases and in Figure 5 although the duty cycles under both cases are same, i.e., . The reason to adjust the duty cycle in this way is that it may result in different energy efficiency to be active in different slots. The goal of the opportunistic duty cycle is to adjust the duty cycle and the moments to be active so that the energy efficiency can be improved under the constraint of the energy neutral operation.
Most symbols used in this paper are summarized in Table I.
Iii System Model and Problem Formulation
Iii-a Network and Energy Model
Given the network with a sink and some nodes , , each node is assumed to have at least one stable route leading to the sink. A period composes of time slots , . Each node is equipped with a micro-scale energy-harvesting module, and its antenna works under the half-duplex mode. It cannot receive and transmit data at same time. It is equipped with one battery to store energy with the initial energy . Because of the limited hardware, the battery cannot support the operation of the sensor node when it is being charged by the energy-harvesting . Meanwhile, the power of the micro-solar panel is also too low to support the normal operation of the sensor node and the battery charging simultaneously in most time as the experimental result in Figure 1. We thus assume that the limited harvested power cannot support the normal operation of the sensor and antenna simultaneously.
For each sensor node , the different power levels are required to support data sampling, receiving, transmitting and storing the harvested energy, respectively denoted by , , and . and are constant and same over all sensor nodes. The VoI, denoted by , is measured by Equation (1). Denote the amount of energy harvested by a single sensor node at time slot by . The harvested energy , , over a period can be modelled as the first-order stationary Markov process . The processed data is the same. Each solar panel can support its node’s normal operation or can charge its node’s battery if and only if its harvested energy is over a threshold . Let if the power of the harvested energy is over the threshold, and 0 otherwise.
Iii-B Opportunistic Duty Cycling Problem
The opportunistic duty cycling can be formalized as the optimization problem. The goal of ODC is to maximize the overall VoI collected at the sink as given in Equation (2), while satisfying the energy neutral operation under the constraint of the energy harvesting randomness in Equation (3).
where denotes the VoI received by the sink at . At the time slots in the sets , and , the sensor node samples, receives and transmits data respectively. At the time slots in the set , stores the harvested energy into its battery and thus at every slot in . To maintain the perpetual operation, the consumed energy should be less than the harvested.
According to the assumption in the subsection III-A, the antenna is half-duplex so the sets , has no common element. Meanwhile, the four sets: , , , and have no common element because of the limited hardware and harvested energy. The four sets thus satisfy the following condition.
Iv Opportunistic Duty Cycling
This section formulates the opportunistic duty cycling as the budget-dynamic MAB problem , and then presents our duty cycling scheme: ODC.
Iv-a Budget-dynamic MAB Problem
Let us look into the detailed process of the opportunistic duty cycling in the energy harvested WSNs. With the harvested energy, each node has two ways to deal: consuming or storing it. To store the energy means some energy consumption because of the imperfect charge efficiency, i.e., . Otherwise, it spends the harvested energy on the data process. When no energy to harvest, it must spend the energy in its battery on the data process, or sleep so as to lose the chance to process data. Obviously, each node has to choose one of the four actions: sampling, receiving, transmitting data and storing energy (i.e. sleeping), as shown in Figure 6(a), by consuming the harvested or stored energy at each time slot. To maximize the energy efficiency, the node need choose the best action by learning the historical information of the energy harvesting and data process. Since the energy harvesting and data process are the Markov process, the conditional probability (given the historical information) that the harvested energy and VoI of the data are at certain levels at the beginning of slot is a sufficient statistic for the design of the optimal actions in the slot . Each node thus need not record the long historical information, and can estimate the VoI for the next time slot by counting the probability that the power and VoI of the data are at certain levels during the previous time slots in short term.
If treating the sensor node as the gambler, the harvested energy is the budget of the gambler and the four actions represent the four arms of the bandit machine as shown in Figure 6, the opportunistic duty cycling can be formulated as the budget-dynamic MAB problem. Pulling the arms , , and are the four actions: data receiving, sampling, transmitting and energy storing. In the MAB problem, the gambler pulls one of the bandit machine’s arms by costing some budget. The bandit machine then returns the gambler with some reward each time. For simplicity, we take the VoI of the processed data as the reward. For example, the node receives a data, whose VoI is , and then the reward returned to the node is . The goal of the gambler is to maximize the overall reward under its budget constraint by a series of arm pullings. In this paper, the harvested energy, i.e. the budget, is dynamic, so the problem in this paper is a new variation of the classical stochastic MAB problem: the budget-dynamic MAB problem. By mapping the opportunistic duty cycling problem to the MAB problem, the goal to maximize the energy efficiency is equivalent to maximizing the reward given the budget.
Since one sensor node is treated as one gambler in the MAB problem, it means that the solution to the problem is implemented distributively. The challenge to solve the problem is to prove the distributive scheme can guarantee the global maximization of the overall VoI. Recall that the goal is to maximize the overall VoI of the processed data as given in Equation (2). Thus, the straightforward idea is to maximize the VoI of the data processed by each node including the data sampling, receiving and transmitting. The VoI caused by the three actions is denoted by , and respectively. Meanwhile, the overall VoI of the data received by the sink can be maximized only if each node transmits its sampled or received data to the neighbors in the next-hop as much as possible. In the following context, we consider the more general case than that given in Equation (4) to state the straightforward idea. Notice that the case in Equation (4) is covered by the following statements. Let denote the overall VoI of the data remaining in ’s memory till the end of time slot . Recall that each node cannot receive and transmit data simultaneously as the constraint in Equation (4). When the node takes the action to transmit data in , there is a balance that is at time slot , where is the VoI of the sampled data at the slot . We have the following equation:
Similarly, we have the following equation when the node takes the receiving action.
where is the VoI of the received data at time slot respectively. and may be zero since the action: data transmitting or energy storing, may be taken. Considering the special case that only one of the four items: ,, and can be the value over zero, Equation (5) and (6) satisfy the constraints in Equation (4).
Recall that each node has at least one routing connecting with the sink as the statement in Section III-A. Let denote the set of nodes that are hops away from the sink, . The overall reward of the whole network can be calculated as in the period , where is the VoI of the data transmitted by the node at the time slot . The following theorem proves that can be maximized through maximizing the overall reward of each single node. This paper decomposes the overall reward of the whole network to that of each node by the following theorem.
Assume each node has at least one route connect to the sink, the total reward of all nodes accumulated over the overall period equals to the total reward received by the sink over the same period.
The intuitive idea of the proof is that all of the data received by the sink must be sent or relayed by the intermediate nodes in the network. Let denote the sink, and suppose that the network starts at the time slot . When , i.e., the network does not begin to run, each node does not receive or sample any data so . In an arbitrary time slot , the VoI of the data received by the sink is that the relay node transmits at the same slot. That is
Thus, to maximize is equivalent to maximizing the data traffic of each node away one-hop from the sink when the time slot . According to Equation (5), the right side of the above equation can be rewritten as the following:
Notice that any data sampled or received at time slot can be transmitted after . The transmitted data must come from the remaining data . The last two items and have no contribution to . Before the time slot , () must receive or sample the data to record it in . Otherwise, it has no data to transmit in . The data that the sensor node chooses to transmit at time slot must be received or sampled in some time slot before , i.e., . When the sensor node transmits the data in , the time or () at which the data is received or sampled has no affection on the transmission of the data. For easy to understand the proof, we can assume that the data that the sensor node chooses to transmit at time slot is received or sampled in . Meanwhile, the data received by the sensor nodes in the layer must be transmitted by those in the layer so we have the following equation:
In the last equality of the above equation, the first item is the sum of the traffic of the sensor nodes in the layer , which contributes to the VoI of the data received by the sink, i.e., during time slot . In other words, the VoI of each sensor node must be maximized in before the overall VoI can be maximized at time slot since the last two items have no contribution to in according to the statement below Equation (7).
Similarly, we can deduce in Equation (IV-A) back to the sum of the VoI of the data transmitted by the sensor nodes in the layer during time slot . Therefore, the overall VoI of the sink in the period , i.e. , can be maximized by maximizing the VoI of the data transmitted by each sensor node in each layer over a series of time slot , . \qed
This block presents the detailed design of our scheme: ODC. In order to achieve the energy neutral operation, a parameter, called VoI threshold , is introduced to control the amount of energy that each sensor node can consume in each time slot. Because of the randomness of the harvest energy, should be updated continuously. The Adaptive VoI Adjustment (AVA) algorithm is designed to update the threshold .
Iv-B1 ODC algorithm
Recall that the goal of ODC is to maximize the VoI of each sensor node, i.e. to solve the budget-dynamic MAB problem, so that the overall VoI can be maximized according to Theorem 1. Imagine that taking an action corresponds to placing an item into the knapsack. The expected reward by taking the action equals to the item’s value and the energy consumption for the action is the item’s weight. The total harvested energy till is then the weight capacity of the knapsack at . Therefore, the budget-dynamic MAB can be reduced to the unbounded knapsack problem at each time slot . We borrow the idea of the density-ordered greedy algorithm  to solve the problem.
During solving the budget-dynamic MAB problem by the density-ordered greedy algorithm, the key step is to estimate the VoI that each action will obtain at the next time slot , so that the sensor node can take those actions with the highest energy efficiency. Auer introduced the Upper Confidence Bound (UCB) to calculate the estimated VoI of each action . The most popular UCB, called UCB-1, relies on the upper-bound VoI obtained by taking the action , where is a padding function. A standard expression of the function is , where is the upper-bound on the reward/VoI, is some appropriate constant, is the number of taking action till , is the overall number of actions that the sensor node has taken till , and is the estimation of the action ’s expected reward for the slot at the end of the slot . In order to improve the energy efficiency, the upper-bound VoI per unit cost can be calculated as by taking the cost into consideration. We have and , where . Notice that the remaining energy till time slot composes of the energy remained in its battery and possibly harvested energy at , i.e., . Thus, the unbounded knapsack problem can be formulated as the following problem with the time-dependent energy bound .
where is a bool indicator. if the action is taken at , and otherwise . is the energy consumption to pull the arm once. The constraint in Equation (11) means that the energy consumption at time slot is constrained by . can be calculated as the average reward received by pulling arm till .
The problem defined in Equation (10) is NP-hard so this paper uses the density-ordered greedy method  to find a near-optimal selection of the sets , and , i.e. to find the integer so that Equation (10) is maximized (see step 12 in Algorithm 1).
The capacity of the memory is limited. Each sensor node thus should keep balance between its output: the transmitted data and its input: the received and sampled data in the long term. In other words, the times to take the action: the data transmitting, i.e. pulling the arm , is expected to equal to the sum of the times to take the actions: the data sampling and receiving, i.e. pulling the arms and . To do this, we assign each action with some probability. Let be the solution to the problem in Equation (10) by the density-ordered greedy method at the time slot . ODC takes the next action with some probability, which is determined by the following equation (see step 13 in Algorithm 1).
where is the number of the arms of the bandit machine. Notice that the arm with the higher upper bound VoI will have higher probability in Equation (13) since the times that it is pulled is higher than others. ODC is presented in Algorithm 1, and its performance will be theoretically analyzed on its regret bound in the next section. In this algorithm, is the energy consumed at time slot . For example, if the arm is pulled and the consumed energy is in , then .
The intuitive idea behind AVA is that each sensor node dynamically estimates the VoI threshold for the next time slot according to the harvested energy and the consumed energy in the previous time slots. The energy neutral operation condition requires each sensor node to consume energy less than the remaining one, i.e. , while the sensor node has to consume energy as much as possible to maximize the total reward in the period. The best choice is to keep the balance between the remaining and consumed energy in the period, i.e. . We define the following function as the metric to find the balance point.
Denote the VoI threshold updated at by . A proper ensures that the sensor node can minimize the average squared deviation of the harvested energy from the consumed energy by Equation (14). To find the proper , we adopt the adaptive control theory in Algorithm AVA, transforming the threshold determining problem as the linear-quadratic tracking problem. More formally, this paper argues that a first order, discrete-time, linear dynamical system with colored noise for the problem. This system can be described by the following equation:
In this system, is refer to the output of the system, is the control, is mean zero input noise, are real-valued coefficients. The optimal output of the system is to keep the metric in Equation (14) as small as possible in the period . The optimal control law to minimize the tracking error is :
The coefficients , and are not known in advance, and can be estimated online in our problem by using the standard gradient descent techniques . Firstly, we define a parameter vector , and a feature vector . By the two vectors, the optimal control law in Equation (16) can be expressed as . The estimated parameter vector for then can be defined by the gradient descent update rule as given by
where is a positive constant step-size parameter.
Because each sensor node need store its harvested energy in its battery, the initial energy level would better be about half of its full capacity. The choice of the ’s initial value greatly affects the converge speed of the parameter estimation in Equation (17). can be set preciously according to preliminary experimental results. Examining the system in Equation (15), the increment of the control results in less data being received or sampled, so less energy consumption. should be negative. Set .
Considering a special case in which each sensor node can harvest enough solar energy. Thus, the harvested energy can support each sensor node to operate at each time slot. However, each sensor node cannot harvest sufficient energy usually so prevents each sensor node from working at every time slot, i.e. by reserving some energy at some time slots. So the harvested energy is stored and will not be consumed completely at every time slot, i.e., .
Iv-C Common Activity
A concerned issue is how about the common active time among neighboring nodes under ODC, which is implemented in the distributive mode. By Algorithm 1, each node chooses the transmitting and receiving arms with some probability and thus each node has common active, i.e. simultaneous waking up, with some probability in each time slot. This section shows the probability that one node has common active time with its neighbor theoretically and experimentally. If the node can communicate with at least one of its neighbors, we say that it has common active time with its neighbor. Figure 8 illustrates the theoretical probability that the neighboring nodes have common active time. When each node has some probability to wake up, i.e. active probability, the common active probability can be easily computed as the y-coordinate. More neighbors the node has or higher probability it wakes up, it has higher probability to communicate with its neighbor in Figure 8. Figure 8 illustrates the experimental results when one node has two neighbors. The experimental setting is given in Section VI. In the experiment, the common active probability tends to 0.22, and the average data VoI obtained by each action tends to about 0.57. In each time slot, the node can guarantee a certain probability to communicate with its neighbors. The probability is not quite high but the obtained VoI is not low since the node catches the most important time to communicate. Next section analyzes that VoI difference of the data processed by the optimal solution and our scheme ODC.
V Performance Analysis
This section analyzes the theoretical performance of ODC by the metric: regret. Let be the total VoI returned by a given algorithm under the constraint of the variable harvested energy over a fixed period . The expectation of is denoted by . This paper always sticks a superscript “*” to any instance that is the optimum. Suppose that is the optimal algorithm for our problem, i.e.
Thus, the regret of the algorithm can be formally defined as :
where is represented by the expectation of the arm with the maximal reward, i.e. , because of the hardness to find the optimal scheme.
In the following context, we analyze the regret of our scheme . The Hoeffding inequality will be applied in the following analysis, and stated as below:
The Hoeffding inequality—Let be random variables with common range [0,1] and such that . Let . Then for the constant , the probability and .
Recall that the power of the harvested energy must be higher than the threshold , and then it can support the normal operation of the sensor node. Denote by the time slot set in which the harvested energy is higher than the threshold . is determined by the energy harvesting process, and its expectation can be determined easily if its state transition probability is previously known. By the algorithm 2, the VoI threshold is continuously adjusted so the sensor node may choose to sleep (i.e. to store the harvested energy) in some slots when the harvested energy is higher than the threshold . Because of the charging efficiency , the amount of the time slots, denoted by , in which the harvested energy can support the normal operation of the sensor node under the algorithm 2 must be not higher than . Thus, we have .
Firstly, we analyze the expected times that the arm , is pulled. The arm (storing energy) is not included since it does not return any reward. This is given in the following lemma. We prove the following lemma based on the idea of the reference  and consider the cost of each arm , .
For an arbitrary arm , , the expected times that it is pulled in the period , is:
where is the difference of the expected reward between the optimal algorithm and the arm . an .
Recall that the step 9 of the algorithm 1 indicates that each arm , , is pulled once in the first slots. Thus, the times to pull is , where . Since the algorithm 1 is a greedy algorithm, the selected arm has the higher upper-band VoI per unit cost over other arms including the optimal one in each slot . So we have the following condition: , i.e. . In order to satisfy the condition with high probability, at least one of the following inequalities must be satisfied.
where and are the reward expectation of the optimal algorithm and the arm by our algorithm, which is unknown to the sensor node. and . By using the Hoeffding inequality, the probability that the inequalities in Equation (21) and (22) are satisfied is given as follows:
Recall that and , and then the inequality in Equation (23) implies:
Similarly, we can obtain that the expected times to pull the arm .
The expected times to pull the arm in the period , is:
By using the Hoeffding inequality, the probability that the inequality in Equation (28) is satisfied is given as follows:
The inequality in Equation (29) implies:
According to the step 17 in Algorithm 1, the conditions given in Equation (28) and (29) should be satisfied for all arms , simultaneously. Therefore, by Equation (30) and (V), the expectation of the times to pull the arm thus can be given as follows:
where , and . \qed
For the dynamic energy budget , the expectation of ODC’s regret is at most:
where is a constant, and is the reward expectation of the optimal algorithm.
Algorithm 1 can operate at the time slots in the set , where . Suppose that . In the period , the arm , , are pulled, and in the period , the arm is pulled. Suppose that the optimal algorithm operates at the time slots in the set . and because the charging efficiency . The reward regret of ODC is: