RealTime FineGrained Air Quality Sensing Networks in Smart City: Design, Implementation and Optimization
Abstract
Driven by the increasingly serious air pollution problem, the monitoring of air quality has gained much attention in both theoretical studies and practical implementations. In this paper, we present the architecture, implementation and optimization of our own air quality sensing system, which provides realtime and finegrained air quality map of the monitored area. As the major component, the optimization problem of our system is studied in detail. Our objective is to minimize the average joint error of the established realtime air quality map, which involves data inference for the unmeasured data values. A deep Qlearning solution has been proposed for the power control problem to reasonably plan the sensing tasks of the powerlimited sensing devices online. A genetic algorithm has been designed for the location selection problem to efficiently find the suitable locations to deploy limited number of sensing devices. The performance of the proposed solutions are evaluated by simulations, showing a significant performance gain when adopting both strategies.
I Introduction
Based on a recent report of the World Health Organization [1], air pollution has been proved to be one of the greatest threat to human health, which is responsible for one in eight of deaths each year. In addition to the exhaust emission from industrial production procedures, the daily activities of residents also contribute to the accumulation of air pollutants, such as driving fuel automobiles or incinerating garbages [2]. The degree of air pollution is usually quantitatively described by the air quality index (AQI), which is defined according to the concentrations of some typical air pollutants, including the fine Particulate Matters (e.g., PM and PM) and other basic chemical substances [3]. The value of AQI will be larger if the concentrations of the air pollutants become higher, indicating a higher risk of people suffering from harmful health effects.
To measure the concentration of a specific air pollutant, available approaches could be either the large professional instruments with high precision or the tiny commercial sensors with low cost [4]. For the consideration of accuracy, the governmentowned official meteorological bureaus have deployed authoritative monitoring systems across the country with high costs. Despite the high precision they can achieve, these official systems only have limited numbers of observation stations over a large area and provide measurement results with significant latency [5]. However, recent studies show that the concentrations of air pollutants have the intrinsic characteristics to change from meters to meters, especially for the particulate matters in the urban areas with complicated terrain resulted from densely distributed tall buildings [6, 7]. This indicates that the data provided by official measurements lose their accuracy to represent the air quality at remote locations.
Therefore, it is preferred that large number of lowcost tiny sensing devices are deployed to provide air quality sensing for the regions with complicated terrain [8, 9]. Since the deployment of tiny sensing devices can be dense and the data collection can be frequent, the air quality distribution can be updated with low latency and high resolution [10, 11]. Such a solution creates a promising application of InternetofThings (IoT) in smart city [12], where massive data can be collected and analyzed [13]. The citizens are able to benefit from the valuable information provided by the air quality sensing system, by following the suggestions like keeping away from the highly polluted area or deciding the best ventilation system for a building [14].
In this paper, we propose the architecture, implementation and optimization of our own air quality sensing system, which provides realtime and finegrained air quality map of the monitored area. For the system design, a fourlayer architecture is established, including the energyefficient sensing layer, the highreliable transmission layer, the fullfeatured processing layer, and the userfriendly presentation layer. For the implementation, we have deployed this system in Peking University (PKU) for six months and have collected over thousand data values from devices. The terrain of our campus is considered complex enough to represent a typical urban terrain of a large smart city, since green areas, tall buildings and vehicle lanes are all included. For the system optimization, we aim to minimize the error of the realtime and finegrained air quality map, where the limited number of available sensing devices and the limited capacity of their batteries are the challenges.
As the major part of this paper, the optimization of the IoT air quality sensing network is studied in detail, which is rarely taken into account in related works [4, 8, 15]. Specifically, the necessity of performing optimization is essentially due to the fact that, the IoT air quality sensing devices are deployed without external power supply [16, 17] in order to adapt to the complicated measurement area. Therefore, a sensing device can only perform a limited number of powerconsuming actions, such as detecting the concentration of an air pollutant, or uploading data back to the server. To recover a realtime and finegrained air quality map from the sparse data, a procedure of inference and estimation is required, which can be realized by approaches such as machine learning [18, 19]. The accuracy of inferring the data at unmeasured locations and unmeasured times depends on the spatialtemporal structure of the collected data. For instance, inferring the current air quality based on a measured value from long ago would be questionable [20]. In addition, inferring the air quality at a certain location based on the data from a hardly correlated location is also inaccurate [21, 22]. In order to guarantee the accuracy of the established air quality map, it is necessary to consider the problems of where to deploy the limited number of sensing devices (location selection problem) and when to perform sensing actions (power control problem). These two problems are interdependent, e.g., the location selection could influence the correlation of the sensing data values and therefore influences the optimal power control.
In our work, we model the measurement error and the inference error based on the statistical data from our own system. Our objective is to minimize the joint error of the realtime and finegrained air quality map, by properly designing the power control and location selection strategies. To be specific, the power control problem is solved by the proposed solution based on deep Qlearning by considering the system as a Markov Decision Process (MDP), which can be deployed online to deal with unexpected weather conditions. The location selection problem is solved by the proposed genetic algorithm, which takes the result of means clustering as the initial genetic population and iteratively improves the location selection by widely searching the solution space. Both solutions achieve satisfactory suboptimal outcomes, and the combination of our power control and location selection strategies presents a significant superiority to reduce the average joint error. In addition, these solutions are scalable and therefore able to be implemented in a citywide huge IoT air quality sensing network.
The main contributions of our work are listed as below:

We present our energyefficient realtime and finegrained air quality sensing system, which has been deployed in PKU for six months by Spet. 2018.

We model the measurement error and inference error in the air quality sensing system based on the collected data.

We provide a deep Qlearning solution for the power control problem to reasonably plan the sensing tasks of the powerlimited sensing devices online.

We design a genetic algorithm for the location selection problem to efficiently find the suitable locations to deploy limited number of sensing devices.

The performance of the proposed solutions is evaluated by simulations, showing a significant performance gain when deploying both strategies.
The rest of our paper is organized as follows. Section II provides an overview of the design and implementation of our air quality sensing system. Section III formulates the problem of minimizing the joint error. Section IV discusses the parameters that influence the inference error. Section V presents the deep Qlearning solution for power control. Section VI presents the genetic solution for location selection. Section VII shows the simulation results of the proposed solutions. Finally, we conclude our paper in Section VIII.
Ii System Overview
In this section, we first provide a brief overview of the design of our air quality sensing system, and then present some of the representative implementation results, and finally describe the collected data set.
Iia System Design
As shown in Fig. 1, our air quality sensing system consists of four layers, namely, the sensing layer, the transmission layer, the processing layer and the presentation layer. The sensing layer collects the data of realtime air quality, which is carried out by the sensing devices installed near the ground. The transmission layer enables the bidirectional communications between the sensing layer and the processing layer, which is supported by the infrastructure of the current wireless communication networks. The processing layer is implemented in the cloud server, which is responsible to receive, record and process the data from the sensing layer, and to control the behaviour of the sensing layer. The presentation layer can provide valuable information for the users, which includes our official website and our official WeChat subscription account.
IiB System Implementation
Fig. 2 shows the implementation of our system, which has been deployed in PKU for months. Most sensing devices are fixed near the ground and powered by batteries. As the data being transmitted back to the server, users can inquire the realtime air quality data on our website [23] or through Wechat official account. The backend of the server also monitors the status of the devices and manage their sensing behaviours to balance between accuracy and battery duration. Spatial inference and shortterm prediction can also be supported to guarantee the air quality map to be realtime and finegrained. More details can be found in [24], which are not presented here, as we focus on the optimization of the air quality sensing network.
IiC Data Set Description
During the deployment, we have collected over thousand effective values, mostly for the concentrations of PM2.5. Here we provide the data set collected by onground sensing devices [25]. Specifically, it contains the PM2.5 values from two time periods, including the period from March 1st 2018 to May 15th 2018, and the period from June 5th 2018 to Augest 25th 2018. The provided data set is used to extract some important statistical properties of the monitored area, as given in Section III, in which way we are able to design the corresponding power control and location selection strategies. If the proposed sensing system is expanded to the whole smart city, then the data set of the whole city will be necessary.
Iii Optimization Problem Formulation
In this section, we present the optimization problem in our air quality sensing system. First, we provide the overview of the optimization problem in Section IIIA. We then model the measurement error and inference error in Section IIIB based on the statistics of our collected data. Finally, we formulate the optimization problem for the air quality sensing system, including power control and location selection.
Iiia Problem Overview
The air quality sensor and wireless transmission module of each sensing device contribute to most of its power consumption. Therefore, these devices keep themselves in sleep mode during most of the time to save their limited energy supplied by their own batteries. The control server is responsible for planning the sensing tasks for all the devices (i.e., when should each device wake up and collect data), as well as receiving and recording the transmitted data. Since the air quality data from nearby spatial locations and temporal points are not independent, the control server can utilize limited data to establish a realtime air quality map by spatial and temporal inference.
Assume that there are totally suitable locations for sensing deployment in the concerned area, and only sensing devices are available to be deployed, where . We denote the set of locations with sensing devices as , and the set of locations without sensing devices as . Here, we have and .
The sensing system is divided into equallength time slots, and we should decide whether each device is waken up to collect data at each time slot. We provide the 01 matrix to represent the power control strategy, where is the expected number of time slots that the whole system should sustain without recharging. As the element of the matrix , indicates that the device in the location is turned on to sense data at the time slot, and indicates that this device still keeps asleep or there is no device at the location. The missing data values are inferred by the server, based on the current and previous collected data, according to their spatialtemporal relation.
IiiB Measurement Error and Inference Error
In this subsection, we model the measurement error and inference error based on our statistical data [25]. The inference error here is modeled independently of any advanced inference algorithms that based on massive historical data (such as neural networks), in which way we can depict the most general situation. Regardless of whether the data is being directly measured or being inferred from other data, we denote the air quality value at the location at the time slot as a random variable , where and . In the following, the deviation of the mean value of and the uncertainty (variance) of are considered as the major indicators to represent the error of the measurement or the inference.
Measurement:
The measurements of the sensing devices are not perfect, the distribution of the measured value (e.g., PM2.5) at a ceratin location and a certain time approximately complies to Gaussian distribution
(1) 
where is the precise value of the location at the time slot
Temporal inference: With a measured value at location time , we can infer the possible value at time for the same location. As time goes on, the new value of this location deviates from the original one randomly. Such deviation, can be seem as a additive random noise applied on the original measured value. As long as the length of the time slot is fixed, the deviation between two adjacent time slots has a fixed distribution, given as
(2) 
where is the constant showing the average change rate of the air quality based on the given length of time slot. We call as the temporal deviation variance. Therefore the distribution of is given by
(3) 
which implies that the more time span it is, the less accurate the inference will be.
Spacial inference by single source: Based on , no matter it is a directly measured value or the result of a temporal inference, we are able to infer the value at another location at the same time slot, , as shown in Fig. 3. To achieve this, we exploit the relevance among different locations from historical data and find that the deviations among different locations can also be modeled as additive. Specifically, the additive random deviation from location to location is denoted as , obeying the following distribution:
(4) 
where is the average value at time , is the constant describing the normalized average deviation from location to , and is the constant describing the normalized increased variance when using to infer . Also note that , . Now we have the distribution of the inferred as:
(5) 
Note that as in (4) gets larger (indicating worse air quality), the additional inference variance in (5) gets larger.
Spacial inference by multiple sources: We can further utilize values from multiple locations to infer an unknown value at a different location at the same time slot . The utilized values can either be the directly measured values or the inferred values through earlier measurement based on (3). For each of these value, we use (5) to perform a singlesource inference for the target location. The inference result for the target location is denoted as , where . Then we can multiple all the probability density functions (PDF) of these inference results together to get the PDF of the target location. For simplicity, we assume the distributions of different are independent (since they can be traced back to different sensors). Therefore the final inference also has a Gaussian distribution, given as:
(6) 
where the inference result has a weighted mean based on the mean of these random variables and has a smaller variance compared with each one of these random variables.
Rule of inference: For a measured value with , no inference is performed. For an unmeasured value with , we consider a threestep inference. The first step is to execute up to times of temporal inferences for all the selected locations based on their previous measured values, in which way we have intermediate results for the current time, according to Eqn. (3). The second step is to utilize these intermediate results to perform times of “single source” spatial inference for the target location, according to Eqn. (5). And the final step is to combine these inferences to form a “multisource” spatial inference, according to Eqn. (6). Fig. 3 provides a simple illustration of the above inference steps.
IiiC Environment Model
In the last subsection, we have mentioned as the average result of the time slot. This value can be seen as the air quality for the whole area in a coarsegrained perspective. Without the loss of accuracy, we consider this value is the same as the true average air quality for the whole area. And we aim to establish a statistic model for the change of .
From our collected data, we find that there is an approximately fixed statistic pattern of . Specifically, we can calculate how often does a certain level of polluted weather occurs, given by
(7) 
where is the value space of the possible air quality. The values of air quality (such as PM2.5) are usually in the form of integer, thus we consider as a finite discrete value space. In addition, for a fixed length of time slot (such as minutes), the probability of air quality transition between adjacent time slots can also be calculated, given by
(8) 
where the current coarsegrained air quality has a relation with .
It is assumed that can be roughly known when it comes to the time slot. The corresponding approaches could be neural networks [19], or checking the official weather report (which is not our focus in this paper). We focus on how to increase the finegrained air quality map by power control and location selection, as presenting in the next subsection.
IiiD Problem of Power Control and Location Selection
The limited capacity of each sensing device confines the number of sensing data it can collect. For simplicity, we assume that the sensing devices have the same battery capacity and each one of them can only perform times of sensing tasks (including data sensing and an immediate data uploading) before its battery dies, where . Therefore, we have , showing the energy budget of the devices. In addition, we expect that each device should not be silent for too long. The maximum number of consecutive time slots that a device can keep asleep is , which provides . We should guarantee that to avoid contradiction.
Since the server needs to provide a realtime distribution of the air quality, the incomplete data at the unmeasured locations should be inferred by the collected data according to the spatial and the temporal inference mentioned in Section IIIB. For a given time slot, when the current air quality map is established with the help of inference, we can investigate the accuracy of this map. For , we define its joint error, , as the indicator to quantitatively show reliability of the data, which is given as below:
(9) 
which jointly considers the variance of the value and the deviation from the current average value. Specifically, a larger variance or a larger deviation could increase the joint error of the data, i.e., is less reliable as gets larger. Note that if is a measured value, then and . We consider in this case for simplicity. Otherwise, and should be calculated according to Eqn. (2)(6) based on the inference rule. Either way, the joint error of each value at the current time slot can be calculated if we have determined the subset of sensing devices being turned on.
At the time slot, the average joint error of the current generated realtime air quality map is given by . And for the whole period including time slots, the average joint error is calculated as
(10) 
where we assume all the sensors should perform a sensing at for a good initialization and the situation at is not counted. The objective function of minimizing the average joint error of the realtime air quality map is
(11)  
(12)  
(13)  
(14)  
(15)  
(16)  
(17) 
where Eqn. (12)(15) show the constraints of power control and Eqn. (16)(17) show the constraints of location selection.
Iv Theoretical Analysis
In this section, we first take a deeper look into the threestep inference rule and obtain some basic properties of the joint inference in Section IVA. Then we study the influence of the system parameters on the system performance in Section IVB. Finally, we discuss some intuitions for the optimization problem in Section IVC, which leads to the solutions in Section V and Section VI.
Iva The Mean and The Variance of The Joint Inference
From the threestep inference rule introduced in Section IIIB, we know that each unmeasured value is inferred by values, which are the most current data that collected by the each one of the sensing devices. We provide Fig. 4 as an example to illustrate such procedure. The final inference is a multisource spatial inference based on singlesource spatial inferences. And each singlesource spatial inference is based on a temporal inference if this location has no current measured value.
Now we focus on the inference for a certain location at a certain time slot . We denote the time span that the device has not sense any data until as . Therefore the intermediate inference result after the temporal and the singlesource temporal inference for the target location is given by
(18) 
where is the measurement variance, is the additional variance of temporal inference, and is the variance of spatial inference based on the relation of and . Since if the variable , the above expression is compatible for all situations, such as in Fig. 4.
To combine these results using a multisource spatial inference, we use Eqn. (6) to calculate the mean value and the variance of the final result. For the convince of reading, we rewrite the expression of and as below:
(19)  
(20) 
where and are short for the mean value and the variance of , respectively, to facilitate reading in the rest of this section.
Remark 1.
From Eqn. (19), we can see that is the weighted sum of . The corresponding weight for the component is , meaning that a more accurate singlesource spatial inference affects more on the final result of the multisource spatial inference. In addition, we have , since
(21) 
where and .
Remark 2.
From Eqn. (20), we can guarantee that , since the following condition holds:
(22) 
Remark 3.
The final inference variance is more sensitive to the minimal value of , since we have the following partial derivative:
(23) 
which means that the same amount of decrease of a smaller will lead to a larger reduce of the final variance.
IvB Influence of The System Parameters on The Joint Error
From the expression of the joint error in Eqn. (9), we can see that both the variance of the inference result and the deviation from the coarsegrained air quality contributes to . The increase of and could decrease the inference accuracy and lower the confidence level of the established realtime air quality map.
From Eqn. (18) and (20), we can see that the variance of the joint inference depends on the current air quality , the time span since the most recent sensing, and the air quality when performing the most recent sensing. This means that the temporal inference from a data long time ago (especially when the value was high back then) is questionable, and the spatial inference on a bad weather condition (high values of air quality) is also inaccurate.
From Eqn. (19) and (20), we can see that the mean of the joint inference is the weighted mean of the corresponding values from all the sensing locations. Since is actually the air quality of the time slot, its difference with the current value could be large if the air quality changes rapidly in the recent time slots. From our statistical data mentioned in Section IIIC, the air quality transition in adjacent time slots presents a greater probability for the similar air quality values, i.e., is larger if is small. This means that the air quality values in recent time slots is more reliable compared with the values from more previous time slots. Thus is expected to be smaller if the sensing devices can turn on more frequently.
Lamma 1.
Adding a measured value in the existing spatialtemporal graph of the air quality sensing system can averagely decrease the joint error.
Proof:
We assume that the added measurement is at the location at the time slot, given by . And we denote the nearest measurement of location is at the and the time slots, with . As illustrated in Fig. 5, the influenced values are within , where the earlier unmeasured values are inferred based on and the later unmeasured values are inferred based on . For each of these influence unmeasured values, provides a lower variance in the singlesource spatial inference compared with the original value according to Eqn. (18). This is because the value of is smaller and the probability distribution of is the same as in the long term average observations. In addition, also has a smaller expectation than for all since the aforementioned property of statistical air quality transition. ∎
Note that the conclusion of Lemma 1 shows the average outcome of the situations. Based on Lemma 1, we can directly obtain the following propositions:
Proposition 1.
Given a fixed time period , a fixed number of sensing devices , two different settings of energy budget , the corresponding average joint errors comply to in the optimal power control strategy.
Proof:
We assume that the best power control strategy of is , where , . As we raise the energy budget from to , more values of can be changed from to . Based on Lemma 1, adding a new measured value can averagely reduce the average joint error. Even in the worst case where no newly added measurement increases inference accuracy due to extreme weather condition, we can keep as it is and do not deteriorate the original result. ∎
Proposition 2.
Given a fixed time period , a fixed energy budget , two different settings of the number of available sensing devices , the corresponding average joint errors comply to in the optimal power control and location selection strategy.
Proof:
We assume the optimal power control and location selection for devices are and , respectively. Assume that we add one more device at location , then its collected data can be used to infer the values at the unselected locations for , and the values at its own location only for . From Eqn. (20) we know that the variance of the inference decreases since an additional value participates in the multisource inference. The remaining problem is to figure out how of each inferred value changes. A basic idea is to let the newly added device to copy the power scheduling of one of the existing device. According to Remark 1, this is equivalent to the action of adding the weight of the copied device when calculating Eqn. (19). It is expected that some of the will increase and some will decrease. Find the best existing device to copy its power scheduling can averagely achieve positive effect, which will generally reduce . Even in the worst case where the newly added device results in a worse due to some extreme settings, we can eliminate the newly added device and keep the original location selection plan, resulting a same . ∎
IvC Discussions on The Formulated Optimization Problem
For the location selection, intuitively, the devices need to be deployed in those less correlated locations (with high values of between each other), acquiring “more diversified” data to help reestablish the finegrained air quality map.
For the power control, the turningon frequency of the sensing devices should be properly adjusted. A low frequency sensing plan could reduce the accuracy of the realtime air quality map, and a too frequent sensing plan may lead to the the depletion of their batteries long before the last hour .
It should be noted that, both the measurement and inference error depends on the average air quality (). This means that we need to know the air quality in advance to make the perfect strategy, which is not a acceptable assumption. We aim to create a more generalized power control strategy which can dynamically deal with the encountered weather condition as long as the statistics of the air quality ( and ) is fixed. Therefore, we only assume the current and the previous air quality (, ) is known as the system is establishing the air quality map at .
In fact, the joint optimization of power control and location selection is highly intractable even with the help of the statistics of historical data. Therefore, in the following part of this paper, we separate problem into the power control problem and the location selection problem. Specifically, we first study the problem of power control in a stochastic environment based on a fixed location selection in Section V. Next, in Section VI, we study the problem of location selection based on a fixed power control strategy in a given environment. By combining the solutions for these two individual problems together, its is expected that a satisfactory outcome can be acquired.
V Power Control Strategy
In this section, we provide the power control strategy with a fixed location selection . With the knowledge of the environment statistics (as and ), we aim to provide a best power control strategy that is able to deal with the unknown environment having the same statistics. In our context, the power control strategy is learnt by means of reinforcement learning.
However, before formally studying the problem of multiple devices, we first take a look at a simpler situation where only one device is included. Analyzing and solving this simpler problem can help us deal with the case of multiple devices. Specifically, the problem of power control for a single device can be transformed into a Markov Decision Process (MDP), and solved by a dynamic programming algorithm optimally, as provided in Section VA. Since the complexity of the optimal dynamic programming algorithm increases exponentially with the number of devices, we provide a deep Qlearning solution with approximated value functions for the problem of multiple devices in Section VB.
Va Power Control for Single Device
In this subsection, we assume that the number of available device is one, i.e., . This means that all the efforts of the power control is concentrated on this single device. In the following, we establish a MDP model with discrete and finite state space, which describes the state transition during the power control procedure.
A MDP consists of five components, namely, the set of states , the set of available actions , the state transition probability matrix , the reward function , and the discount factor . To be specific, the states in should obey the Markov property, where each next state only depends on the current state and the adopted action. Assume that the current state is , one can choose an action from the action set to make the system change. There could be multiple consequent states after performing on , and the corresponding transition probability is given by , where and represents the state and the action in the whole history. In addition, there is an reward of performing on , representing the immediate utility/gain. The discount factor indicates the fading utility of the future rewards from the viewpoint the current state.
Definition 1.
In the power control problem with a single sensing device, the system state in the whole state transition history is defined in the following form:
(24) 
which has five components. The integer represents the time of the system (“t” for “time”). The integer indicates the remaining power of the sensing device (“p” for “power”). The integer shows the number of time slots since the last time of measurement (“d” for “delay”). The integer records the average air quality value during the last time of measurement (“u” for “record”). And the integer shows the current average air quality (“e” for “environment”).
Initial state: The initial state is given by , which means that it is the time slot, and there are available chances of sensing. Note that since we assume all the devices perform a sensing as soon as being deployed at the time slot (which is not counted in the energy budget), the time span since the last sensing is , and the recorded air quality is .
Action set: From any given intermediate state, , , two actions can be performed.
Specifically, the action set is given by , where is to keep the sensing device asleep and is to turn on the sensing device.
If and , then only is available since is the maximum time to keep a device asleep.
If , then only is available since the power has been depleted
Performing “off” action: If we perform on state , it means that we execute no sensing task at time . The overall joint error at the time slot can be calculated according to Eqn. (18) by setting , and . We define the reward of taking action on state as the opposite value of , written as
(25) 
The following state will be . Note that the first four components are determined, and the last component is generated randomly according to the air quality transition probability. We denote the probability of changing to state by taking action as .
Performing “on” action: If we perform on state , it means that we perform a sensing task at time . The overall joint error at the time slot can be calculated according to Eqn. (18) by setting , and . The corresponding reward is
(26) 
And the following state will be , where because it has been one time slot since the last time of sensing, indicates the recorded air quality when performing sensing, and also complies to the air quality transition probability based on . We denote the probability of changing into by taking action as .
Termination condition: It can be seen that no matter we use action or , the component increases by one at each time of the state transition. When it comes to , we need to make the last action and the subsequent state will be , which shows the termination of the state transition.
Statevalue function: For each state, there is a value function representing the utility of this state, denoted by . Specifically, the termination state has zero utility, given by . In each intermediate step, if with reward (with or ), then we have
(27) 
where the discount factor is set to in our calculation. It can be seen that the value of is the sum of the rewards along the path of the experienced states, given by
(28) 
where we can see that maximizing is the same as minimizing the average joint error as the objective function describes.
Action strategy: The problem of maximizing is to find a best path in the state space, which has a size of . Since the state transition is not fixed due to the random change of air quality, the problem can be interpreted as how to decide the action for each possible state, given by
(29) 
As proved in[26], there exists an optimal deterministic action strategy for MDP. That is to say, the optimal action strategy for any given state does not need to be a probatilistic one (e.g. with probability choosing and with probability choosing ).
Dynamic programming algorithm: The MDP of the single device problem is highly structured. Each state with can only change to another state with , indicating an unidirectional dependence of the states. Since all the termination states with have zero value, we can iteratively use the values of the state with to calculate the values of the state with . Specifically, we have
(30)  
(31) 
where we should calculate for all possible before calculating . Since each value of considers all the possible subsequent states, can be maximized and the corresponding is the optimal choice for the state . At the end of the iteration procedure, we acquire the final optimal strategy for all the possible states. Therefore, we can use to deal with the singledevice power control problem in an online mode, where the actions can dynamically adapt to the randomly changed environment ().
Computation complexity: The value of each state is calculated once. And to calculate the value of each state, no more than subsequent states are being considered. Therefore, the final computation complexity is . If the value space of the air quality can be approximated into multiple segments, the complexity can greatly reduce. An overview of this solution is presented in Algorithm 1.
VB Power Control for Multiple Devices
For the problem of devices, the intuition is to define the MDP states by extending the one in (24). Specifically, we have
(32) 
where , , and are the row vectors with length , representing for all the devices. The possible action for each state is also a length vector, given by , where or , . It is easy to see that the number of states is , and the number of actions is . Therefore, the optimal dynamic programming algorithm is no longer suitable to solve the multidevice power control problem.
Since both the extremely large state space and value space pose challenge for solving the problem, we first aim to transfer the complexity of the value space to the complexity of the state space. This is done by arranging the sensing devices to take actions in a predefined order. In this way, there are only two possible actions ( and ) for each state. And the number of states will be multiply by after such arrangement.
Definition 2.
In the power control problem with sensing devices, the system state in the whole state transition history is defined in the following form:
(33) 
which has six components. The integer represents the time of the system. The length integer vector indicates the remaining power of each sensing device, with , . The length integer vector shows the number of time slots since the last time of measurement for each device, with , . The length integer vector records the average air quality value during the last time of measurement for each device, , . The integer shows the current average air quality . And the integer implies who’s turn it is to take the action at this state.
Initial state: The initial state is given by , where , , , . Note the last component of is , indicating that it is the turn of the device to take action.
Alternation rule: The first component and the last component of each state obey the following rule in the state transition process, regardless of the exact actions being performed. If we have , then and , meaning that it is the turn of the next device to decide sensing or not in the same time slot. Otherwise, and , indicating that all the devices have done making decisions in the current time slot and the time moves on.
Action set: For each intermediate state , two actions can be performed, given by . If and , meaning that the device has been asleep long enough and still have power to perform sensing, then only action can be executed. If , meaning that the device has no power, then only can be executed. For other cases, both and can be chosen for the device.
State transition for : Assume that the current state is . If is performed, then