Decentralized Delay Optimal Control for Interference Networks with Limited Renewable Energy Storage
Abstract
In this paper, we consider delay minimization for interference networks with renewable energy source, where the transmission power of a node comes from both the conventional utility power (AC power) and the renewable energy source. We assume the transmission power of each node is a function of the local channel state, local data queue state and local energy queue state only. In turn, we consider two delay optimization formulations, namely the decentralized partially observable Markov decision process (DECPOMDP) and Noncooperative partially observable stochastic game (POSG). In DECPOMDP formulation, we derive a decentralized online learning algorithm to determine the control actions and Lagrangian multipliers (LMs) simultaneously, based on the policy gradient approach. Under some mild technical conditions, the proposed decentralized policy gradient algorithm converges almost surely to a local optimal solution. On the other hand, in the noncooperative POSG formulation, the transmitter nodes are noncooperative. We extend the decentralized policy gradient solution and establish the technical proof for almostsure convergence of the learning algorithms. In both cases, the solutions are very robust to model variations. Finally, the delay performance of the proposed solutions are compared with conventional baseline schemes for interference networks and it is illustrated that substantial delay performance gain and energy savings can be achieved.
I Introduction
Recently, there have been intense research interests to study the interference channels. In [1, 2], the authors show that interference alignment (using infinite dimension symbol extension in time or frequency selective fading channels) can achieve optimal Degreesoffreedom (DoF) and the total capacity of the user interference channels is given by . In [3, 4]. the authors consider joint beamforming to minimize the weighted sum MMSE or maximize the SINR of pairs MIMO interference channels using optimization approaches. In [5, 6], the authors considered decentralized beamforming design for MIMO interference networks using noncooperative games and studied the sufficient conditions for the existence and convergence of the Nash Equilibrium (NE). However, all of these works have assumed that there are infinite backlogs at the transmitters, and focused on the maximization of physical layer throughput. In practice, applications are delay sensitive, and it is critical to optimize the delay performance in the interference network.
The design framework taking into consideration of queueing delay and physical layer performance is not trivial as it involves both queuing theory (to model the queuing dynamics) and information theory (to model the physical layer dynamics) [7]. The simplest approach is to convert the delay constraints into an equivalent average rate constraint using tail probability (large derivation theory), and solve the optimization problem using a purely information theoretical formulation based on the equivalent rate constraint [8]. However, the control policy thus derived is a function of the channel state information (CSI) only, and it fails to exploit data queue state information (DQSI) in the adaptation process. The Lyapunov drift approach is also widely used in the literature [9] to study the queue stability region of different wireless systems and to establish the throughput optimal control policy (in stability sense). A systematic approach in dealing with delayoptimal resource control in general delay regime is based on the Markov decision process (MDP) technique[10, 7, 11]. However, bruteforce solution of MDP is usually very complex (owing to the curse of dimensionality) and extension to multiflow problems in interference networks is highly nontrivial.
Another interesting dimension that has been ignored by most of the above works is the inclusion of renewable energy source on the transmit nodes. For instance, there are intense research interests in exploiting renewable energy in communication network designs[12, 13, 14, 15]. In [12, 13], the authors presented an optimal energy management policy for a solarpowered device that uses a sleep and wake up strategy for energy conservation in wireless sensor networks. In [14], the authors developed a solar energy prediction algorithm to estimate the amount of energy harvested by solar panels to deploy powerefficient task management methods on solar energyharvested wireless sensor nodes. In [15], the author proposed a power management scheme under the assumption that the harvested energy satisfies performance constraints at the application layer. However, in all these works, the delay requirement of applications have been completely ignored. Furthermore, the renewable energy source can act as low cost supplement to the conventional utility power source in communication networks. Yet, there are various technical challenges regarding delay optimal design for interference networks with renewable energy source.

Randomness of Renewable Energy Source: Recent developments in hardware design have made energy harvesting possible in wireless communication networks [16, 17]. For example, we have solarpowered base stations available from various telecommunication vendors [17]. While the renewable energy source may appear to be completely free, there are various challenges involved to fully capture its advantage. For instance, the renewable energy sources are random in nature and energy storage is needed to buffer the unstable supply of renewable energy. Yet, the cost of energy storage depends heavily on the associated capacity [18]. For limited capacity energy storage, the transmission power allocation should be adaptive to the CSI, the DQSI as well as the energy queue state information (EQSI). The CSI, DQSI and EQSI provide information regarding the transmission opportunity, the urgency of the data flows, and the available renewable energy, respectively. It is highly nontrivial to strike a balance among these factors in the optimization.

Decentralized Delay Minimization: The existing works for the throughput or DoF optimization in the interference network [16] requires global knowledge of CSI, which leads to heavy backhaul signaling overhead and high computational complexity for the central controller. For delay minimization with renewable energy source, the entire system state is characterized by the global CSI (CSI from any transmitter to any receiver), the global QSI (data queue length of all users), and the global EQSI (energy queue length of all users). Therefore, the centralized solution (which requires global CSI, DQSI and EQSI) will also induce substantial signaling overhead, which is not practical. It is desirable to have decentralized control based on local observations only. However, due to the partial observation of the system state in decentralized designs, existing solutions of the MDP approach cannot be applied to our problem.

Algorithm Convergence Issue: In conventional iterative solutions for deterministic network utility maximization (NUM) problems, the updates in the iterative algorithms (such as subgradient search) are performed within the coherence time of the CSI (i.e., the CSI remains quasistatic during the iteration updates) [5, 6]. When we consider delay minimization, the problem is stochastic and the control actions are defined over ergodic realizations of the system states (CSI, DQSI and EQSI). Furthermore, the restriction of partial observation of system states in decentralized control further complicates the problem. As a result, the convergence proof of the decentralized stochastic algorithm is highly nontrivial.
In this paper, we consider delay minimization for interference networks with renewable energy source. The transmitters are capable of harvesting energy from the environment, and the transmission power of a node comes from both the conventional utility power (AC power) and the renewable energy source. For decentralized control, we assume the transmission power of each node is adaptive to the local system states only, namely the local CSI (LCSI), the local DQSI (LDQSI) and the local EQSI (LEQSI). We consider two delay optimization formulations, namely the decentralized partially observable MDP (DECPOMDP), which corresponds to a cooperative stochastic game setup (where each user cooperatively share a common system utility), and noncooperative partially observable stochastic game (POSG), which corresponds to a noncooperative stochastic game setup (where each user has a different (and selfish) utility. In DECPOMDP formulation, the transmitters are fully cooperative and we derive a decentralized online learning algorithm to determine the control actions and the Lagrangian multipliers (LMs) simultaneously based on the policy gradient approach [11, 19]. Under some mild technical conditions, the proposed decentralized policy gradient algorithm converges almost surely to a local optimal solution. On the other hand, in the noncooperative POSG formulation, the transmitters are noncooperative^{1}^{1}1Noncooperative nodes means that each transmitter shall optimize its own utility in a selfish manner. and we extend the decentralized policy gradient algorithm and establish the technical proof for almostsure convergence of the learning algorithms. In both cases, the solutions do not require explicit knowledge of the CSI statistics, random data source statistics as well as the renewable energy statistics. Therefore, the solutions are very robust to model variations. Finally, the delay performance of the proposed solutions are compared with conventional baseline schemes for interference networks and it is illustrated that substantial delay performance gain and energy savings can be achieved by incorporating the CSI, DQSI and EQSI in the power control design.
Ii System Model
We consider pair interference channels sharing a common spectrum with bandwidth Hz as illustrated in Fig. 1. Specifically, each transmitter maintains a data queue for the random traffic flow towards the desired receiver in the system. Furthermore, the transmitters are fixed base stations but the receiver can be mobile. The time dimension is partitioned into scheduling frames (that lasts for seconds). In the following subsections, we shall elaborate the physical layer model, the random data source model as well as the renewable energy source model.
Iia Physical Layer Model
The signal received at the th receiver is given by:
(1) 
where and are the long term path loss and the microscopic channel fading gain respectively, from the th transmitter to the th receiver. is the total transmission power of the th transmitter. is the information symbol sent by the th transmitter, and is the additive white Gaussian noise with variance . For notation convenience, we define the global CSI as . Furthermore, the assumption on channel model is given as follows.
Assumption 1 (Channel Model)
We assume that the global CSI is quasistatic in each frame. Furthermore, is i.i.d. over the scheduling frame according to a general distribution with and is independent w.r.t. . The path loss remains constant for the duration of the communication session. ∎
Given transmission powers , the transmit data rate is given by:
(2) 
where is a constant. Note that (2) can be used to model both uncoded and coded systems [20]. For example, for QAM constellation at BER and for capacity achieving coding (in which (2) corresponds to the instantaneous mutual information).
IiB Random Data Source Model and Data Queue Dynamics
Let be the random new arrivals (number of bits) at the transmitters at the end of the th scheduling frame.
Assumption 2 ( Random Data Source Model)
The arrival process is i.i.d. over the scheduling frame and is distributed according to a general distribution with average arrival rate . Furthermore, the random arrival process is independent w.r.t. . ∎
Let denote the global DQSI in the system, where represents the number of bits at the queue of transmitter at the beginning of frame . denotes the maximal buffer size (number of bits) of user . When the buffer is full, i.e., , new bit arrivals will be dropped. The cardinality of the global QSI is . Given a new arrival at the end of frame , the queue dynamics of transmitter is given by:
(3) 
where is the achievable data rate for receiver at frame given in (2), and .
IiC Power Consumption Model with Renewable Energy Source
The transmission power of each node comes from both the AC power source and the renewable energy source. Specifically, the transmitter is assumed to be capable of harvesting energy from the environment, e.g., using solar panels [17, 21]. However, the amount of harvestable energy in a frame is random. Let be the harvestable energy (Joule) by the transmitters during the th scheduling frame. Note that the harvestable energy can be interpreted as the energy arrival at the th frame.
Assumption 3 (Random Renewable Energy Model)
The random process is i.i.d. over the scheduling frame and is distributed according to a general distribution with mean renewable energy . Furthermore, the random process is independent w.r.t. .
∎
Let denote the global EQSI in the system, where represents the renewable energy level at the energy storage of the th transmitter at the beginning of frame . Let denote the maximum energy queue buffer size (i.e., energy storage capacity in Joule) of user . When the energy buffer is full, i.e., , additional energy cannot be harvested. Given an energy arrival of at the end of frame , the energy queue dynamics of transmitter is given by:
(4) 
where is the renewable power consumption that must satisfy the following energyavailability constraint^{2}^{2}2 We consider a discrete time system with fixed time step . Hence, represents the energy level at the renewable energy storage of the th transmitter at the beginning of frame , and is the renewable energy consumption. As a result, (energy consumed from the renewable energy storage) cannot be larger than (total energy available from the renewable energy storage).:
(5) 
The power consumption is contributed by not only the transmission power of the power amplifier (PA) but also the circuit power of the RF chains (such as the mixers, synthesizers and digitalto analog converters). Furthermore, the circuit power is constant irrespective of the transmission data rate. Therefore, the total power consumption of user at the th frame is given by
(6) 
Note that in practice, due to the random nature of the renewable energy and the limited renewable energy storage capacity, it can be used only as a supplementary form of power rather than completely replacing the AC utility power. To support a total power consumption of , we can have power circuitry [12, 13] to control the contributions from AC utility as well as the renewable energy storage as illustrated in Fig. 1. This is similar in concept to hybrid cars where the power is contributed by both the gas engine and the battery. As a result, the total power consumption is given by: . Given and , the transmission power is given by:
(7) 
Iii Delay Optimal Power Control
Iiia Control Policy and Resource Constraints
We define as the global system state, and as the local system state for the th transmit node, where is the LCSI^{3}^{3}3We denote the local CSI at the th transmit node as . However, in practice, the th transmit node only needs to observe and the total interference ., is the LDQSI and is the LEQSI. Based on the local system state , transmitter determines the power consumption using a control policy defined below, where and are the AC power allocation space and the renewable power allocation space (both with cardinality ), respectively.
Definition 1 (Stationary Randomized Decentralized Power Control Policy)
A stationary randomized power control policy for user , , is a mapping from the local system state to a probability distribution over the power allocation space , i.e., , where is the space of joint probability distribution over the power allocations, and denotes the probability of transmission powers . ∎
For simplicity, denote the joint control policy as . Note that the power allocation policy should satisfy the energyavailability constraint given in (5), i.e., given , the probability of transmission powers satisfy
(8) 
Furthermore, should meet the requirement of circuit power consumption, i.e.,
(9) 
Finally, should also satisfy the peruser average AC power consumption constraint:
(10) 
where the expectation in (10) is taken w.r.t. the induced probability measure from the policy .
Remark 1 (Formulation with two optimization variables )
While the “reward” of the system dynamics (the transmission rate in (2)) depends on the total transmission power only, it does not mean the problem can be formulated with just one variable (total transmission power). We also have to look at the “cost” side. While the total power consumption , and have different cost structure (and different constraints) as in (10) and (5), respectively. Hence, the problem with and as variables cannot be transformed or reduced into a problem with as one variable only (due to the constraints). ∎
IiiB Parametrization of Control Policy and Dynamics of System State
In this paper, we consider the parameterized stationary randomized policy, which is widely used in the literature [22, 19, 23, 24]. Specifically, the randomized policy can be parameterized by . For example, when a local system state realization is observed, the power consumption of transmit node is with probability given by[23]:
(11) 
where is the indicator function, and . As a result, the control policy is now parameterized by and is denoted by . Another possible parameterization is to use neural network [22, 19] where the probability is given by:
(12) 
where is the parameter and is the prior basis function. Note that the dimension of the parameter is reduced to in this case.
For a given stationary parameterized control policy (), the induced random process is a controlled Markov chain with transition probability
(13) 
where the joint data and energy queue transition probability is given by
(14) 
where , , and is the achievable data rate of receiver given in (2) under the power allocation . Note that it is not sufficient to specify the evolution of the joint process by just describing the measure of individual local processes . This is because the individual state process are not independent and there are mutual coupling.
Given a unichain policy , the induced Markov chain is ergodic and there exists a unique steady state distribution , where [11]. The average delay utility of user , under a unichain policy , is given by:
(15) 
where is a monotonic increasing utility function of . For example, when , using Little’s Law [11], is the average delay^{4}^{4}4 Since the buffer size is finite, is the average delay when , where is the packet drop rate due to buffer overflow. However in practice our target , and hence is a good approximation for the average delay. Furthermore, this approximation is asymptotically tight as the data buffer size increases. In practice, the approximation error will not be significant since the system will have reasonable (e.g. ). of user . When , is queue outage probability^{5}^{5}5The probability that the queue state exceeds a threshold , i.e., .. Since is a constant, the average delay is proportional to the average queue length.
IiiC Problem Formulation
Note that the stochastic dynamics of the data queues and energy queues are coupled together via the control policy . In this paper, we consider two different decentralized control problems:
IiiC1 DECPOMDP Problem
In this case, all the transmitter nodes are cooperative and we seek to find an optimal stationary control policy to minimize a common weighted sum delay utility in (15). Since the control policy is only a function of the local system state , the problem is a partially observed MDP, which is summarized below:
Problem 1 (Delay Optimal DECPOMDP)
For some positive constants , find a stationary control policy that minimizes:
(16) 
where is the joint perstage utility. The positive constants indicate the relative importance of the users, and for the given , the solution to (16) corresponds to a Pareto optimal point of the multiobjective optimization problem: . ∎
Note that the average AC power constraint is commonly used in a lot of existing studies [7, 10] and is very relevant in practice (because the electric bill is charged by average AC power consumption time of usage). The motivation of Problem 1 is to optimize the delay performance under average cost constraint (AC power) by fully utilizing the free renewable energy. Problem 1 is also equivalent to minimizing the average AC power consumption subject to average delay constraint because they have the same Lagrangian function.
IiiC2 NonCooperative POSG Problem
In this case, the transmitter nodes are noncooperative and we formulate the delay utility minimization problem as a noncooperative partially observable stochastic game (POSG), in which the user competes against the others by choosing his power allocation policy , to maximize his average utility selfishly. Specifically, the noncooperative POSG is formulated as Problem 2
Problem 2 (Delay Optimal NonCooperative POSG)
For transmitter , we try to find a stationary control policy that minimizes:
(17) 
where , and is the set of all the users’ policies except the th user. ∎
The local equilibrium solutions of the noncooperative POSG (17) are formally defined as follows.
Definition 2 (Local Equilibrium of NonCooperative POSG)
A profile of the power allocation policy is the local equilibrium of the game (17) if it satisfies the following fixed point equations for some ,
where . ∎
Remark 2 (Interpretation of the Local Equilibrium)
Remark 3 (Comparison between the DECPOMDP and NonCooperative POSG Problems)
In Problem 1 (DECPOMDP), the controller is decentralized at the transmitters and they have access to the local system state only. Yet, the controllers are fully cooperative in the sense that they are designed to optimize a common objective function where the perstage utility is assumed to be known globally through message passing. As a result, they interact in a decentralized cooperative manner. On the other hand, in the noncooperative POSG formulation, the controllers are noncooperative in the sense that each controller is interested in optimizing its own delay utility function. Hence, they interact in a decentralized noncooperative manner. ∎
Note that the policies are reactive or memoryless in that their choice of action is based only upon the current local observation. Furthermore, the DECPOMDP and the noncooperative POSG problem are NPhard [26]. Instead of targeting at global optimal solutions, we shall derive low complexity iterative algorithms for local optimal solutions in the following sections.
Iv Decentralized Solution for DECPOMDP
In this section, we shall propose a decentralized online policy gradient update algorithm to find a local optimal solution for problem (16). The proposed solution has low complexity and does not require explicit knowledge of the CSI statistics, random data source statistics as well as the renewable energy statistics.
Iva Decentralized Stochastic Policy Gradient Update
We first define the Lagrangian function of problem (16) as
(18) 
where is the LM vector w.r.t. the average power constraint for all the users. The local optimal solution for problem (16) should satisfy the following firstorder necessary conditions given by [25]
(19) 
Define a reference state^{6}^{6}6For example, we can set without loss of optimality. and using perturbation analysis [11, 22], the gradient^{7}^{7}7Note that a change of will affect the function via the probability measure behind the expectation in and hence, deriving the gradient is highly nontrivial. is given in the following lemma.
Lemma 1 (Gradient of the Lagrangian Function)
The gradient of the Lagrangian function is given by
(20) 
where is the steady state probability of state under the policy , is the probability that joint action is taken, and , if ,
(21) 
where . is the first future time that the reference state is visited. ∎
Please refer to Appendix A.
Note that the brute force solution of (19) requires huge complexity and knowledge of the CSI statistics, random data source statistics as well as the renewable energy statistics. Based on Lemma 1, we shall propose a low complexity decentralized online policy gradient update algorithm to obtain a solution of (19). Specifically, the key steps for decentralized online learning is given below.

Step 1, Initialization: Each transmitter initiates the local parameter .

Step 2, Peruser Power Allocation: At the beginning of the th frame, each transmitter determines the transmission power allocation according to the policy based on the local system state , and transmit at the associated achievable data rate given in (2).

Step 3, Message Passing among the Transmitters^{8}^{8}8 Note that the peruser perstage utility includes not only the packet buffer states but also the control action. As a result, just broadcasting nodes’ buffer states is not enough to replace the peruser perstage utility. Furthermore, if each user wants to have complete state information, they need to share both the buffer states and the CSI states. As a result, it will cause much information exchanges compared with the peruser perstage utility sharing. Table I summarizes the communication overhead by exchanging the perstage utility and sharing the buffer states and the CSI states.: At the end of the th frame, each transmitter shares the peruser perstage utility and the reference state indication , where if , and otherwise.

Step 4, Peruser Parameter Update: Based on the current local observation, each of the transmitters updates the local parameter according to Algorithm 1.

Step 5, Peruser LM Update: Based on the current local observation, each of the transmitters updates the local LMs according to Algorithm 1.
Fig. 2 illustrates the above procedure by a flowchart. The detailed algorithm for the local parameters and LMs update in Step 4 and Step 5 is given below:
Algorithm 1 (Online Learning Algorithm for Peruser Parameter and LM)
Let be the current local system state, be the current realization of power allocation, be the current realization of the perstage utility and be the current realization of the reference state indication. The online learning algorithm at the th transmitter is given by
(22) 
where , and
(23) 
Stepsizes are nonincreasing positive scalars satisfying , , . ∎
Remark 4 (Feature of the Learning Algorithm 1)
The learning algorithm only requires local observations only, i.e., local system state at each transmit node, and limited message passing of , where the overhead is quite mild[27]. Both the peruser parameter and the LMs are updated simultaneously and distributively at each transmitter. Furthermore, the iteration is online and proceed in the same timescale as the CSI and QSI variations in the learning algorithm. Finally, the solution does not require knowledge of the CSI distribution or statistics of the arrival process or renewable energy process, i.e., robust to model variations. ∎
IvB Convergence Analysis
In this section, we shall establish the convergence proof of the proposed decentralized learning algorithm 1. Since we have two different stepsize sequences and with , e.g., and . the peruser parameter updates and the LM updates are done simultaneously but over two different timescales. During the peruser parameter update (timescale I), we have . Therefore, the LMs appear to be quasistatic[28] during the peruser parameter update in (22), and the convergence analysis can be established over two timescales separately. We first have the following lemma.
Lemma 2 (Convergence of Peruser Parameter Learning (Timescale I))
The iterations of the peruser parameter in the proposed learning algorithm 1 will converge almost surely to a stationary point, i.e., , and satisfies
(24) 
Please refer to Appendix B.
On the other hand, during the LM update (timescale II), we have almost surely. Hence, during the LM update in (22), the peruser parameter is seen as almost equilibrated. The convergence of the LMs is summarized below.
Lemma 3 (Convergence of LM over Timescale II)
The iterations of the LMs almost surely, where satisfies the power constraints of all the users in (10). \QED
Please refer to Appendix C.
Based on the above lemmas, we can summarize the convergence performance of the proposed learning algorithm in the following theorem.
Theorem 1 (Convergence of Online Learning Algorithm 1)
Note that is a very mild condition that is usually satisfied [28].
V Decentralized Solution for NonCooperative POSG Problem
In this section, we shall propose a decentralized online policy gradient update algorithm to find a local equilibrium of the noncooperative POSG problem. The proposed solution also has low complexity and does not require explicit knowledge of the CSI statistics, random data source statistics as well as the renewable energy statistics.
Va Decentralized Stochastic Policy Gradient Update
From (2), the Lagrangian function for user is given by
(26) 
where is the LM w.r.t. the average power constraint for user . Following similar perturbation analysis as in Lemma 1, the gradient is given in the following lemma.
Lemma 4 (Gradient of the Lagrangian Function)
Based on the Lemma 4, we shall propose a low complexity decentralized online policy gradient update algorithm to obtain a local equilibrium. Specifically, the key steps for decentralized online learning is given below.

Step 1, Initialization: Each transmitter initiates the local parameter .

Step 2, Peruser Power Allocation: At the beginning of the th frame, each transmitter determines the transmission power allocation according to the policy based on the local system state , and transmit at the associated achievable data rate given in (2).

Step 3, Message Passing among the Transmitters: At the end of the th frame, each transmitter shares the one bit reference state indication , where if , and otherwise.

Step 4, Peruser Parameter Update: Based on the current local observation, each of the transmitters updates the local parameter according to Algorithm 2.

Step 5, Peruser LM Update: Based on the current local observation, each of the transmitters updates the local LMs according to Algorithm 2.
Fig. 3 illustrates the above procedure by a flowchart. The detailed algorithm for the local parameters and LMs update in Step 4 and Step 5 is given below:
Algorithm 2 (Online Learning Algorithm for Peruser Parameter and LM)
Let be the current local system state, be the current realization of power allocation, be the current realization of the reference state indication. The online learning algorithm at the th transmitter is given by
(29) 
where , and
(30) 
∎
Remark 5 (Features of the Learning Algorithm 2)
The learning algorithm only requires local observations, i.e., local system state at each transmit node, and one bit message passing of . Both the peruser parameter and the LMs are updated simultaneously and distributively at each transmitter. Furthermore, the iteration is online and proceed in the same timescale as the CSI and QSI variations in the learning algorithm. Finally, the solution does not require knowledge of the CSI distribution or statistics of the arrival process or renewable energy process, i.e., robust to model variations. ∎
VB Convergence Analysis
In this section, we shall establish the convergence proof of the proposed decentralized learning algorithm 2. Specifically, let , and let be the set of the local equilibrium of the game (17), i.e., satisfies the fixed point equations in (2). The convergence performance of the proposed learning algorithm is given in the following theorem.
Theorem 2 (Convergence of Online Learning Algorithm 2)
Suppose is not empty. The iterations of the peruser parameter in the proposed learning algorithm 2 will converge almost surely to an invariant set given by
(31) 
as , for some positive constant and some . ∎
Please refer to Appendix D
Vi Simulations
In this section, we shall compare the performances of the proposed decentralized solutions against various existing decentralized baseline schemes.

Baseline 1, Orthogonal Transmission: The transmissions between the pairs are coordinated using TDMA so that there is no interference among the users. Both the AC and renewable power consumption are adaptive to LCSI and LEQSI only by optimizing the sum throughput as in [16].

Baseline 2, LCSI and LEQSI Only Strategy: The transmitters send data to their desired receiver simultaneously sharing the same spectrum. Both the AC and renewable power consumption are adaptive to LCSI and LEQSI only by optimizing the sum throughput as in [16].

Baseline 3, Greedy Strategy: The transmitters send data to their desired receiver simultaneously sharing the same spectrum. The transmitters will consume all the available renewable energy source at each frame (emptying the renewable energy buffer at each frame), and the AC power consumption is adaptive to LCSI only by optimizing the sum throughput.
In the simulation, we consider a symmetric system where as in [6]. The long term path loss for the desired link is 15dB, which corresponds to a cell size of 5.6km[29]. The static circuit power is (Watt) [30]. We assume Poisson packet arrival^{9}^{9}9 Note that the proposed algorithm works for generic packet and renewable energy arrival models as depicted in Definition 2 and Definition 3. The Poisson model is used for simulation illustration only. with average arrival rate (packet/s) and exponentially distributed random packet size with mean = 2Mbits. The scheduling frame duration is 50ms, and the total BW is = 1MHz. The maximum data queue buffer size is 5 (packets). Furthermore, we consider Poisson energy arrival with average arrival rate (Watt) as in [16], and the renewable energy is stored in a 1.2V 20Ah lithiumion battery. The AC power allocation space and the renewable power allocation space is given by (Watt). The average delay is considered as our utility (), and the randomized policy is parameterized in the form given by (11).
Via Delay Performance w.r.t. the AC power consumption
Fig. 4 illustrates the average delay per user versus the AC power consumption . The average data arrival rate is , and the energy arrival rate is . The average delay of all the schemes decreases as the AC power consumption increase, and the proposed schemes achieve significant performance gain over all the baselines. This gain is contributed by the DQSI and EQSI aware dynamic power allocation. Furthermore, it can also be observed that the solution to the noncooperative POSG problem has similar performance as the solution to the DECPOMDP problem.
ViB Delay Performance w.r.t. Number of Power Control Levels
Fig. 5 illustrates the average delay per user versus the number of power control levels that lie between 0 and 1.5kW. The average data arrival rate is , the energy arrival rate is , and the average AC power consumption is . The average delay of the proposed schemes decreases as the number of power control levels increases, yet the performance improvement is marginal. It can also be observed that there is significant performance gain with the proposed schemes compared with all the baselines, and the solution to the noncooperative POSG problem has similar performance as the solution to the DECPOMDP problem.
ViC Delay Performance w.r.t. Renewable Energy Buffer Size
Fig. 6 illustrates the average delay per user versus the renewable energy buffer size . Specifically, we consider the lithiumion battery given from 1.2V 10Ah to 40Ah. The average data arrival rate is , the energy arrival rate is , and the average AC power consumption is . It can also be observed that the proposed schemes achieve significant performance gain over all the baselines at any given renewable energy buffer size.
ViD Convergence Performance
Fig. 7 illustrates the convergence property of the proposed schemes. We plot the randomized power control policy versus the scheduling frame index for the POMDP and noncooperative POSG problems, respectively. The average data arrival rate is , the energy arrival rate is , and the average AC power consumption is . It can be observed that the convergence rate of the online algorithm is quite fast. For example, the delay performance of the proposed schemes already outperforms all the baselines at the 2500th scheduling frame. Furthermore, the delay performance at the 2500th scheduling frame is already quite close to the converged average delay.
Vii Conclusion
In this paper, we consider the decentralized delay minimization for interference networks with limited renewable energy storage. Specifically, the transmitters are capable of harvesting energy from the environment, and the transmission power of a node comes from both the conventional utility power (AC power) and the renewable energy source. We consider two decentralized delay optimization formulations, namely the DECPOMDP and the noncooperative POSG, where the control policy is adaptive to local system states (LCSI, LDQSI and LEQSI) only. In the DECPOMDP formulation, the controllers interact in a cooperative manner and the proposed decentralized policy gradient solution converges almost surely to a local optimal point under some mild technical conditions. In the noncooperative POSG formulation, the transmitter nodes are noncooperative. We extend the decentralized policy gradient solution and establish the technical proof for almostsure convergence of the learning algorithms. In both cases, the solutions are very robust to model variations. Finally, the delay performance of the proposed solutions are compared with conventional baseline schemes for interference networks and it is illustrated that substantial delay performance gain and energy savings can be achieved by incorporating the CSI, DQSI and EQSI in the power control design.
Appendix A Proof of Lemma 1
From the perturbation analysis [11, 22] in MDP, the gradient ^{10}^{10}10The notation of is ignored in this section for simplicity. is given by
(32) 
where satisfies the following Bellman (Possion) equation
(33) 
Since for every , we have
(34) 
Taking the conditional expectation (conditioned on ) on both sides of (33), we have following equivalent Bellman equation