Decentralized Delay Optimal Control for Interference Networks with Limited Renewable Energy Storage

# Decentralized Delay Optimal Control for Interference Networks with Limited Renewable Energy Storage

\authorblockNHuang Huang Member, IEEE, Vincent K. N. Lau, Fellow, IEEE The authors are with the Department of Electronic and Computer Engineering (ECE), Hong Kong University of Science and Technology (HKUST), Hong Kong.
###### Abstract

In this paper, we consider delay minimization for interference networks with renewable energy source, where the transmission power of a node comes from both the conventional utility power (AC power) and the renewable energy source. We assume the transmission power of each node is a function of the local channel state, local data queue state and local energy queue state only. In turn, we consider two delay optimization formulations, namely the decentralized partially observable Markov decision process (DEC-POMDP) and Non-cooperative partially observable stochastic game (POSG). In DEC-POMDP formulation, we derive a decentralized online learning algorithm to determine the control actions and Lagrangian multipliers (LMs) simultaneously, based on the policy gradient approach. Under some mild technical conditions, the proposed decentralized policy gradient algorithm converges almost surely to a local optimal solution. On the other hand, in the non-cooperative POSG formulation, the transmitter nodes are non-cooperative. We extend the decentralized policy gradient solution and establish the technical proof for almost-sure convergence of the learning algorithms. In both cases, the solutions are very robust to model variations. Finally, the delay performance of the proposed solutions are compared with conventional baseline schemes for interference networks and it is illustrated that substantial delay performance gain and energy savings can be achieved.

## I Introduction

Recently, there have been intense research interests to study the interference channels. In [1, 2], the authors show that interference alignment (using infinite dimension symbol extension in time or frequency selective fading channels) can achieve optimal Degrees-of-freedom (DoF) and the total capacity of the -user interference channels is given by . In [3, 4]. the authors consider joint beamforming to minimize the weighted sum MMSE or maximize the SINR of -pairs MIMO interference channels using optimization approaches. In [5, 6], the authors considered decentralized beamforming design for MIMO interference networks using non-cooperative games and studied the sufficient conditions for the existence and convergence of the Nash Equilibrium (NE). However, all of these works have assumed that there are infinite backlogs at the transmitters, and focused on the maximization of physical layer throughput. In practice, applications are delay sensitive, and it is critical to optimize the delay performance in the interference network.

The design framework taking into consideration of queueing delay and physical layer performance is not trivial as it involves both queuing theory (to model the queuing dynamics) and information theory (to model the physical layer dynamics) [7]. The simplest approach is to convert the delay constraints into an equivalent average rate constraint using tail probability (large derivation theory), and solve the optimization problem using a purely information theoretical formulation based on the equivalent rate constraint [8]. However, the control policy thus derived is a function of the channel state information (CSI) only, and it fails to exploit data queue state information (DQSI) in the adaptation process. The Lyapunov drift approach is also widely used in the literature [9] to study the queue stability region of different wireless systems and to establish the throughput optimal control policy (in stability sense). A systematic approach in dealing with delay-optimal resource control in general delay regime is based on the Markov decision process (MDP) technique[10, 7, 11]. However, brute-force solution of MDP is usually very complex (owing to the curse of dimensionality) and extension to multi-flow problems in interference networks is highly non-trivial.

Another interesting dimension that has been ignored by most of the above works is the inclusion of renewable energy source on the transmit nodes. For instance, there are intense research interests in exploiting renewable energy in communication network designs[12, 13, 14, 15]. In [12, 13], the authors presented an optimal energy management policy for a solar-powered device that uses a sleep and wake up strategy for energy conservation in wireless sensor networks. In [14], the authors developed a solar energy prediction algorithm to estimate the amount of energy harvested by solar panels to deploy power-efficient task management methods on solar energy-harvested wireless sensor nodes. In [15], the author proposed a power management scheme under the assumption that the harvested energy satisfies performance constraints at the application layer. However, in all these works, the delay requirement of applications have been completely ignored. Furthermore, the renewable energy source can act as low cost supplement to the conventional utility power source in communication networks. Yet, there are various technical challenges regarding delay optimal design for interference networks with renewable energy source.

• Randomness of Renewable Energy Source: Recent developments in hardware design have made energy harvesting possible in wireless communication networks [16, 17]. For example, we have solar-powered base stations available from various telecommunication vendors [17]. While the renewable energy source may appear to be completely free, there are various challenges involved to fully capture its advantage. For instance, the renewable energy sources are random in nature and energy storage is needed to buffer the unstable supply of renewable energy. Yet, the cost of energy storage depends heavily on the associated capacity [18]. For limited capacity energy storage, the transmission power allocation should be adaptive to the CSI, the DQSI as well as the energy queue state information (EQSI). The CSI, DQSI and EQSI provide information regarding the transmission opportunity, the urgency of the data flows, and the available renewable energy, respectively. It is highly non-trivial to strike a balance among these factors in the optimization.

• Decentralized Delay Minimization: The existing works for the throughput or DoF optimization in the interference network [1-6] requires global knowledge of CSI, which leads to heavy backhaul signaling overhead and high computational complexity for the central controller. For delay minimization with renewable energy source, the entire system state is characterized by the global CSI (CSI from any transmitter to any receiver), the global QSI (data queue length of all users), and the global EQSI (energy queue length of all users). Therefore, the centralized solution (which requires global CSI, DQSI and EQSI) will also induce substantial signaling overhead, which is not practical. It is desirable to have decentralized control based on local observations only. However, due to the partial observation of the system state in decentralized designs, existing solutions of the MDP approach cannot be applied to our problem.

• Algorithm Convergence Issue: In conventional iterative solutions for deterministic network utility maximization (NUM) problems, the updates in the iterative algorithms (such as subgradient search) are performed within the coherence time of the CSI (i.e., the CSI remains quasi-static during the iteration updates) [5, 6]. When we consider delay minimization, the problem is stochastic and the control actions are defined over ergodic realizations of the system states (CSI, DQSI and EQSI). Furthermore, the restriction of partial observation of system states in decentralized control further complicates the problem. As a result, the convergence proof of the decentralized stochastic algorithm is highly non-trivial.

In this paper, we consider delay minimization for interference networks with renewable energy source. The transmitters are capable of harvesting energy from the environment, and the transmission power of a node comes from both the conventional utility power (AC power) and the renewable energy source. For decentralized control, we assume the transmission power of each node is adaptive to the local system states only, namely the local CSI (LCSI), the local DQSI (LDQSI) and the local EQSI (LEQSI). We consider two delay optimization formulations, namely the decentralized partially observable MDP (DEC-POMDP), which corresponds to a cooperative stochastic game setup (where each user cooperatively share a common system utility), and non-cooperative partially observable stochastic game (POSG), which corresponds to a non-cooperative stochastic game setup (where each user has a different (and selfish) utility. In DEC-POMDP formulation, the transmitters are fully cooperative and we derive a decentralized online learning algorithm to determine the control actions and the Lagrangian multipliers (LMs) simultaneously based on the policy gradient approach [11, 19]. Under some mild technical conditions, the proposed decentralized policy gradient algorithm converges almost surely to a local optimal solution. On the other hand, in the non-cooperative POSG formulation, the transmitters are non-cooperative111Non-cooperative nodes means that each transmitter shall optimize its own utility in a selfish manner. and we extend the decentralized policy gradient algorithm and establish the technical proof for almost-sure convergence of the learning algorithms. In both cases, the solutions do not require explicit knowledge of the CSI statistics, random data source statistics as well as the renewable energy statistics. Therefore, the solutions are very robust to model variations. Finally, the delay performance of the proposed solutions are compared with conventional baseline schemes for interference networks and it is illustrated that substantial delay performance gain and energy savings can be achieved by incorporating the CSI, DQSI and EQSI in the power control design.

## Ii System Model

We consider -pair interference channels sharing a common spectrum with bandwidth Hz as illustrated in Fig. 1. Specifically, each transmitter maintains a data queue for the random traffic flow towards the desired receiver in the system. Furthermore, the transmitters are fixed base stations but the receiver can be mobile. The time dimension is partitioned into scheduling frames (that lasts for seconds). In the following subsections, we shall elaborate the physical layer model, the random data source model as well as the renewable energy source model.

### Ii-a Physical Layer Model

 yk=√PkLkkHkkxkdesired signal+∑n≠k√PnLknHknxn% interference+zk, (1)

where and are the long term path loss and the microscopic channel fading gain respectively, from the -th transmitter to the -th receiver. is the total transmission power of the -th transmitter. is the information symbol sent by the -th transmitter, and is the additive white Gaussian noise with variance . For notation convenience, we define the global CSI as . Furthermore, the assumption on channel model is given as follows.

###### Assumption 1 (Channel Model)

We assume that the global CSI is quasi-static in each frame. Furthermore, is i.i.d. over the scheduling frame according to a general distribution with and is independent w.r.t. . The path loss remains constant for the duration of the communication session.   ∎

Given transmission powers , the transmit data rate is given by:

 Rk≤Wlog2(1+ξPtxkLkkHkk∑n≠kPnLknHkn+N0W), (2)

where is a constant. Note that (2) can be used to model both uncoded and coded systems [20]. For example, for QAM constellation at BER and for capacity achieving coding (in which (2) corresponds to the instantaneous mutual information).

### Ii-B Random Data Source Model and Data Queue Dynamics

Let be the random new arrivals (number of bits) at the transmitters at the end of the -th scheduling frame.

###### Assumption 2 ( Random Data Source Model)

The arrival process is i.i.d. over the scheduling frame and is distributed according to a general distribution with average arrival rate . Furthermore, the random arrival process is independent w.r.t. .   ∎

Let denote the global DQSI in the system, where represents the number of bits at the queue of transmitter at the beginning of frame . denotes the maximal buffer size (number of bits) of user . When the buffer is full, i.e., , new bit arrivals will be dropped. The cardinality of the global QSI is . Given a new arrival at the end of frame , the queue dynamics of transmitter is given by:

 Qk(t+1)=[[Qk(t)−Rk(t)τ]++Ak(t)]⋀NQk, (3)

where is the achievable data rate for receiver at frame given in (2), and .

### Ii-C Power Consumption Model with Renewable Energy Source

The transmission power of each node comes from both the AC power source and the renewable energy source. Specifically, the transmitter is assumed to be capable of harvesting energy from the environment, e.g., using solar panels [17, 21]. However, the amount of harvestable energy in a frame is random. Let be the harvestable energy (Joule) by the transmitters during the -th scheduling frame. Note that the harvestable energy can be interpreted as the energy arrival at the -th frame.

###### Assumption 3 (Random Renewable Energy Model)

The random process is i.i.d. over the scheduling frame and is distributed according to a general distribution with mean renewable energy . Furthermore, the random process is independent w.r.t. .

Let denote the global EQSI in the system, where represents the renewable energy level at the energy storage of the -th transmitter at the beginning of frame . Let denote the maximum energy queue buffer size (i.e., energy storage capacity in Joule) of user . When the energy buffer is full, i.e., , additional energy cannot be harvested. Given an energy arrival of at the end of frame , the energy queue dynamics of transmitter is given by:

 (4)

where is the renewable power consumption that must satisfy the following energy-availability constraint222 We consider a discrete time system with fixed time step . Hence, represents the energy level at the renewable energy storage of the -th transmitter at the beginning of frame , and is the renewable energy consumption. As a result, (energy consumed from the renewable energy storage) cannot be larger than (total energy available from the renewable energy storage).:

 Ptxk,e(t)τ≤Ek(t),∀k, (5)

The power consumption is contributed by not only the transmission power of the power amplifier (PA) but also the circuit power of the RF chains (such as the mixers, synthesizers and digital-to analog converters). Furthermore, the circuit power is constant irrespective of the transmission data rate. Therefore, the total power consumption of user at the -th frame is given by

 Pk(t)=Ptxk(t)+Pcct⋅1(Ptxk>0) (6)

Note that in practice, due to the random nature of the renewable energy and the limited renewable energy storage capacity, it can be used only as a supplementary form of power rather than completely replacing the AC utility power. To support a total power consumption of , we can have power circuitry [12, 13] to control the contributions from AC utility as well as the renewable energy storage as illustrated in Fig. 1. This is similar in concept to hybrid cars where the power is contributed by both the gas engine and the battery. As a result, the total power consumption is given by: . Given and , the transmission power is given by:

 Ptxk(t)=(Pk,ac(t)+Pk,e(t)−Pcct⋅1(Ptxk>0))+ (7)

## Iii Delay Optimal Power Control

### Iii-a Control Policy and Resource Constraints

We define as the global system state, and as the local system state for the -th transmit node, where is the LCSI333We denote the local CSI at the -th transmit node as . However, in practice, the -th transmit node only needs to observe and the total interference ., is the LDQSI and is the LEQSI. Based on the local system state , transmitter determines the power consumption using a control policy defined below, where and are the AC power allocation space and the renewable power allocation space (both with cardinality ), respectively.

###### Definition 1 (Stationary Randomized Decentralized Power Control Policy)

A stationary randomized power control policy for user , , is a mapping from the local system state to a probability distribution over the power allocation space , i.e., , where is the space of joint probability distribution over the power allocations, and denotes the probability of transmission powers .   ∎

For simplicity, denote the joint control policy as . Note that the power allocation policy should satisfy the energy-availability constraint given in (5), i.e., given , the probability of transmission powers satisfy

 pi,j=0, if aje>Ek/τ. (8)

Furthermore, should meet the requirement of circuit power consumption, i.e.,

 pi,j=0, if 0

Finally, should also satisfy the per-user average AC power consumption constraint:

 ¯¯¯¯Pk(Ω)=limsupT→∞1TT∑t=1EΩ[Pk,ac(t)]≤P0k, (10)

where the expectation in (10) is taken w.r.t. the induced probability measure from the policy .

###### Remark 1 (Formulation with two optimization variables {Pac,Pe})

While the “reward” of the system dynamics (the transmission rate in (2)) depends on the total transmission power only, it does not mean the problem can be formulated with just one variable (total transmission power). We also have to look at the “cost” side. While the total power consumption , and have different cost structure (and different constraints) as in (10) and (5), respectively. Hence, the problem with and as variables cannot be transformed or reduced into a problem with as one variable only (due to the constraints).   ∎

### Iii-B Parametrization of Control Policy and Dynamics of System State

In this paper, we consider the parameterized stationary randomized policy, which is widely used in the literature [22, 19, 23, 24]. Specifically, the randomized policy can be parameterized by . For example, when a local system state realization is observed, the power consumption of transmit node is with probability given by[23]:

 (11)

where is the indicator function, and . As a result, the control policy is now parameterized by and is denoted by . Another possible parameterization is to use neural network [22, 19] where the probability is given by:

 (12)

where is the parameter and is the prior basis function. Note that the dimension of the parameter is reduced to in this case.

For a given stationary parameterized control policy (), the induced random process is a controlled Markov chain with transition probability

 Pr{χ(t+1)|χ(t),ΩΘ}=Pr{H(t+1)}Pr{Q(t+1),E(t+1)|χ(t),ΩΘ(χ(t))}, (13)

where the joint data and energy queue transition probability is given by

 Pr{Q(t+1),E(t+1)|χ(t),ΩΘ(χ(t))}=⎧⎨⎩∏kPr{Ak(t)}Pr{Xk(t)}μχk(Θk,Pk(t))if Qk(t+1)=^QkEk(t+1)=^Ek∀k0 otherwise, (14)

where , , and is the achievable data rate of receiver given in (2) under the power allocation . Note that it is not sufficient to specify the evolution of the joint process by just describing the measure of individual local processes . This is because the individual state process are not independent and there are mutual coupling.

Given a unichain policy , the induced Markov chain is ergodic and there exists a unique steady state distribution , where [11]. The average delay utility of user , under a unichain policy , is given by:

 ¯¯¯¯Tk(Θ)=limsupT→∞1T∑Tt=1EΩΘ[f(Qk(t))], (15)

where is a monotonic increasing utility function of . For example, when , using Little’s Law [11], is the average delay444 Since the buffer size is finite, is the average delay when , where is the packet drop rate due to buffer overflow. However in practice our target , and hence is a good approximation for the average delay. Furthermore, this approximation is asymptotically tight as the data buffer size increases. In practice, the approximation error will not be significant since the system will have reasonable (e.g. ). of user . When , is queue outage probability555The probability that the queue state exceeds a threshold , i.e., .. Since is a constant, the average delay is proportional to the average queue length.

### Iii-C Problem Formulation

Note that the stochastic dynamics of the data queues and energy queues are coupled together via the control policy . In this paper, we consider two different decentralized control problems:

#### Iii-C1 DEC-POMDP Problem

In this case, all the transmitter nodes are cooperative and we seek to find an optimal stationary control policy to minimize a common weighted sum delay utility in (15). Since the control policy is only a function of the local system state , the problem is a partially observed MDP, which is summarized below:

###### Problem 1 (Delay Optimal DEC-POMDP)

For some positive constants , find a stationary control policy that minimizes:

 minΘ¯¯¯¯TΘβ=∑kβk¯¯¯¯Tk(Θ)=limsupT→∞1T∑Tt=1EΩΘ[g(χ(t),ΩΘ(χ(t)))]subject to¯¯¯¯Pk(ΩΘ)=¯¯¯¯Pk(Θ)≤P0k,∀kEk≤NEk,∀k, (16)

where is the joint per-stage utility. The positive constants indicate the relative importance of the users, and for the given , the solution to (16) corresponds to a Pareto optimal point of the multi-objective optimization problem: .   ∎

Note that the average AC power constraint is commonly used in a lot of existing studies [7, 10] and is very relevant in practice (because the electric bill is charged by average AC power consumption time of usage). The motivation of Problem 1 is to optimize the delay performance under average cost constraint (AC power) by fully utilizing the free renewable energy. Problem 1 is also equivalent to minimizing the average AC power consumption subject to average delay constraint because they have the same Lagrangian function.

#### Iii-C2 Non-Cooperative POSG Problem

In this case, the transmitter nodes are non-cooperative and we formulate the delay utility minimization problem as a non-cooperative partially observable stochastic game (POSG), in which the user competes against the others by choosing his power allocation policy , to maximize his average utility selfishly. Specifically, the non-cooperative POSG is formulated as Problem 2

###### Problem 2 (Delay Optimal Non-Cooperative POSG)

For transmitter , we try to find a stationary control policy that minimizes:

 minΘk¯¯¯¯Tk(Θk,Θ−k)=limsupT→∞1T∑Tt=1EΩΘkk,ΩΘ−k−k[f(Qk(t))]subject to¯¯¯¯Pk(Θk,Θ−k)≤P0k,Ek≤NEk,∀k,∀k (17)

where , and is the set of all the users’ policies except the -th user.   ∎

The local equilibrium solutions of the non-cooperative POSG (17) are formally defined as follows.

###### Definition 2 (Local Equilibrium of Non-Cooperative POSG)

A profile of the power allocation policy is the local equilibrium of the game (17) if it satisfies the following fixed point equations for some ,

 ∇Θkψk(Θ∗k,Θ∗−k,γ∗k)=0,∇2ΘkΘkψk(Θ∗k,Θ∗−k,γ∗k)≻0and¯¯¯¯Pk(Θ∗k,Θ∗−k)−P0k≤0,γ∗k(¯¯¯¯Pk(Θ∗k,Θ∗−k)−P0k)=0∀k,

where .   ∎

###### Remark 2 (Interpretation of the Local Equilibrium)

can be regarded as the Lagrange function for user (given the policies of the other users ) in the non-cooperative POSG problem (17). From the Lagrangian theory[25], a local equilibrium means that given , is the local optimal solution for the non-cooperative POSG problem (17).   ∎

###### Remark 3 (Comparison between the DEC-POMDP and Non-Cooperative POSG Problems)

In Problem 1 (DEC-POMDP), the controller is decentralized at the transmitters and they have access to the local system state only. Yet, the controllers are fully cooperative in the sense that they are designed to optimize a common objective function where the per-stage utility is assumed to be known globally through message passing. As a result, they interact in a decentralized cooperative manner. On the other hand, in the non-cooperative POSG formulation, the controllers are non-cooperative in the sense that each controller is interested in optimizing its own delay utility function. Hence, they interact in a decentralized non-cooperative manner.   ∎

Note that the policies are reactive or memoryless in that their choice of action is based only upon the current local observation. Furthermore, the DEC-POMDP and the non-cooperative POSG problem are NP-hard [26]. Instead of targeting at global optimal solutions, we shall derive low complexity iterative algorithms for local optimal solutions in the following sections.

## Iv Decentralized Solution for DEC-POMDP

In this section, we shall propose a decentralized online policy gradient update algorithm to find a local optimal solution for problem (16). The proposed solution has low complexity and does not require explicit knowledge of the CSI statistics, random data source statistics as well as the renewable energy statistics.

### Iv-a Decentralized Stochastic Policy Gradient Update

We first define the Lagrangian function of problem (16) as

 ψ(Θ,γ)=∑k(βk¯¯¯¯Tk(Θ)+γk(¯¯¯¯Pk(Θ)−P0k)), (18)

where is the LM vector w.r.t. the average power constraint for all the users. The local optimal solution for problem (16) should satisfy the following first-order necessary conditions given by [25]

 ∇Θψ(Θ∗,γ∗)=0γ∗k(¯¯¯¯Pk(Θ∗)−P0k)=0,∀k (19)

Define a reference state666For example, we can set without loss of optimality. and using perturbation analysis [11, 22], the gradient777Note that a change of will affect the function via the probability measure behind the expectation in and hence, deriving the gradient is highly non-trivial. is given in the following lemma.

###### Lemma 1 (Gradient of the Lagrangian Function)

The gradient of the Lagrangian function is given by

 ∇Θkψ(Θ,γ)=∑χ∑Pπ(χ;Θ)μχ(Θ,P)∇Θkμχk(Θk,Pk)μχk(Θk,Pk)q(χ,P;γ,Θ) (20)

where is the steady state probability of state under the policy , is the probability that joint action is taken, and , if ,

 q(χ,P;γ,Θ)=EΩΘ[TI−1∑t=0(gψ(χ(t),P(t))−ψ(Θ,γ))|χ(0)=χ,P(0)=P], (21)

where . is the first future time that the reference state is visited.   ∎

{proof}

Note that the brute force solution of (19) requires huge complexity and knowledge of the CSI statistics, random data source statistics as well as the renewable energy statistics. Based on Lemma 1, we shall propose a low complexity decentralized online policy gradient update algorithm to obtain a solution of (19). Specifically, the key steps for decentralized online learning is given below.

• Step 1, Initialization: Each transmitter initiates the local parameter .

• Step 2, Per-user Power Allocation: At the beginning of the -th frame, each transmitter determines the transmission power allocation according to the policy based on the local system state , and transmit at the associated achievable data rate given in (2).

• Step 3, Message Passing among the Transmitters888 Note that the per-user per-stage utility includes not only the packet buffer states but also the control action. As a result, just broadcasting nodes’ buffer states is not enough to replace the per-user per-stage utility. Furthermore, if each user wants to have complete state information, they need to share both the buffer states and the CSI states. As a result, it will cause much information exchanges compared with the per-user per-stage utility sharing. Table I summarizes the communication overhead by exchanging the per-stage utility and sharing the buffer states and the CSI states.: At the end of the -th frame, each transmitter shares the per-user per-stage utility and the reference state indication , where if , and otherwise.

• Step 4, Per-user Parameter Update: Based on the current local observation, each of the transmitters updates the local parameter according to Algorithm 1.

• Step 5, Per-user LM Update: Based on the current local observation, each of the transmitters updates the local LMs according to Algorithm 1.

Fig. 2 illustrates the above procedure by a flowchart. The detailed algorithm for the local parameters and LMs update in Step 4 and Step 5 is given below:

###### Algorithm 1 (Online Learning Algorithm for Per-user Parameter and LM)

Let be the current local system state, be the current realization of power allocation, be the current realization of the per-stage utility and be the current realization of the reference state indication. The online learning algorithm at the -th transmitter is given by

 Θt+1k=Θtk−a(t)(gL−˜Lt)ztkγt+1k=[γtk+b(t)(Pk,ac−P0k)]+, (22)

where , and

 (23)

Stepsizes are non-increasing positive scalars satisfying , , .   ∎

###### Remark 4 (Feature of the Learning Algorithm 1)

The learning algorithm only requires local observations only, i.e., local system state at each transmit node, and limited message passing of , where the overhead is quite mild[27]. Both the per-user parameter and the LMs are updated simultaneously and distributively at each transmitter. Furthermore, the iteration is online and proceed in the same timescale as the CSI and QSI variations in the learning algorithm. Finally, the solution does not require knowledge of the CSI distribution or statistics of the arrival process or renewable energy process, i.e., robust to model variations.   ∎

### Iv-B Convergence Analysis

In this section, we shall establish the convergence proof of the proposed decentralized learning algorithm 1. Since we have two different stepsize sequences and with , e.g., and . the per-user parameter updates and the LM updates are done simultaneously but over two different timescales. During the per-user parameter update (timescale I), we have . Therefore, the LMs appear to be quasi-static[28] during the per-user parameter update in (22), and the convergence analysis can be established over two timescales separately. We first have the following lemma.

###### Lemma 2 (Convergence of Per-user Parameter Learning (Timescale I))

The iterations of the per-user parameter in the proposed learning algorithm 1 will converge almost surely to a stationary point, i.e., , and satisfies

 ∇Θψ(Θ∞(γ),γ)=0. (24)
{proof}

On the other hand, during the LM update (timescale II), we have almost surely. Hence, during the LM update in (22), the per-user parameter is seen as almost equilibrated. The convergence of the LMs is summarized below.

###### Lemma 3 (Convergence of LM over Timescale II)

The iterations of the LMs almost surely, where satisfies the power constraints of all the users in (10).   \QED

{proof}

Based on the above lemmas, we can summarize the convergence performance of the proposed learning algorithm in the following theorem.

###### Theorem 1 (Convergence of Online Learning Algorithm 1)

In the learning algorithm 1, we have almost surely, where and satisfy the KKT condition given in (19), i.e.,

 ∇Θψ(Θ∞,γ∞)=0,γ∞k(¯¯¯¯Pk(Θ∞)−P0k)=0 (25)

and the power constraints of all the users in (10). Furthermore, if (positive definite matrix), then is a local optimal solution for the constrained DEC-POMDP problem in (16).   ∎

Note that is a very mild condition that is usually satisfied [28].

## V Decentralized Solution for Non-Cooperative POSG Problem

In this section, we shall propose a decentralized online policy gradient update algorithm to find a local equilibrium of the non-cooperative POSG problem. The proposed solution also has low complexity and does not require explicit knowledge of the CSI statistics, random data source statistics as well as the renewable energy statistics.

### V-a Decentralized Stochastic Policy Gradient Update

From (2), the Lagrangian function for user is given by

 ψk(Θk,Θ−k,γ)=βk¯¯¯¯Tk(Θk,Θ−k)+γk(¯¯¯¯Pk(Θk,Θ−k)−P0k), (26)

where is the LM w.r.t. the average power constraint for user . Following similar perturbation analysis as in Lemma 1, the gradient is given in the following lemma.

###### Lemma 4 (Gradient of the Lagrangian Function)

The gradient of the Lagrangian function in (26) is given by

 ∇Θkψk(Θk,Θ−k,γk)=∑χ∑Pπ(χ;Θ)μχ(Θ,P)∇Θkμχk(Θk,Pk)μχk(Θk,Pk)qk(χ,P;γk,Θ), (27)

where

 qk(χ,P;γk,Θ)=EΩΘ[TI−1∑t=0(f(Qk(t))+γk(Pk,ac(t)−P0k)−ψk(Θk,Θ−k,γk))|χ(0)=χ,P(0)=P]. (28)

Based on the Lemma 4, we shall propose a low complexity decentralized online policy gradient update algorithm to obtain a local equilibrium. Specifically, the key steps for decentralized online learning is given below.

• Step 1, Initialization: Each transmitter initiates the local parameter .

• Step 2, Per-user Power Allocation: At the beginning of the -th frame, each transmitter determines the transmission power allocation according to the policy based on the local system state , and transmit at the associated achievable data rate given in (2).

• Step 3, Message Passing among the Transmitters: At the end of the -th frame, each transmitter shares the one bit reference state indication , where if , and otherwise.

• Step 4, Per-user Parameter Update: Based on the current local observation, each of the transmitters updates the local parameter according to Algorithm 2.

• Step 5, Per-user LM Update: Based on the current local observation, each of the transmitters updates the local LMs according to Algorithm 2.

Fig. 3 illustrates the above procedure by a flowchart. The detailed algorithm for the local parameters and LMs update in Step 4 and Step 5 is given below:

###### Algorithm 2 (Online Learning Algorithm for Per-user Parameter and LM)

Let be the current local system state, be the current realization of power allocation, be the current realization of the reference state indication. The online learning algorithm at the -th transmitter is given by

 Θt+1k=Θtk−a(t)(fk(Qk)+γtk(Pk,ac−P0k)−˜Ltk)ztkγt+1k=[γtk+b(t)(Pk,ac−P0k)]+, (29)

where , and

 (30)

###### Remark 5 (Features of the Learning Algorithm 2)

The learning algorithm only requires local observations, i.e., local system state at each transmit node, and one bit message passing of . Both the per-user parameter and the LMs are updated simultaneously and distributively at each transmitter. Furthermore, the iteration is online and proceed in the same timescale as the CSI and QSI variations in the learning algorithm. Finally, the solution does not require knowledge of the CSI distribution or statistics of the arrival process or renewable energy process, i.e., robust to model variations.   ∎

### V-B Convergence Analysis

In this section, we shall establish the convergence proof of the proposed decentralized learning algorithm 2. Specifically, let , and let be the set of the local equilibrium of the game (17), i.e., satisfies the fixed point equations in (2). The convergence performance of the proposed learning algorithm is given in the following theorem.

###### Theorem 2 (Convergence of Online Learning Algorithm 2)

Suppose is not empty. The iterations of the per-user parameter in the proposed learning algorithm 2 will converge almost surely to an invariant set given by

 Sθ≜{Θ:||Θ−Θ∗||−δ≤0} (31)

as , for some positive constant and some .   ∎

{proof}

###### Remark 6 (Interpretation of Theorem 2)

From (31), the error between the converged solution and the local equilibrium of the POSG decreases in the order of where represents the degree of coupling among the transmitters.   ∎

## Vi Simulations

In this section, we shall compare the performances of the proposed decentralized solutions against various existing decentralized baseline schemes.

• Baseline 1, Orthogonal Transmission: The transmissions between the pairs are coordinated using TDMA so that there is no interference among the users. Both the AC and renewable power consumption are adaptive to LCSI and LEQSI only by optimizing the sum throughput as in [16].

• Baseline 2, LCSI and LEQSI Only Strategy: The transmitters send data to their desired receiver simultaneously sharing the same spectrum. Both the AC and renewable power consumption are adaptive to LCSI and LEQSI only by optimizing the sum throughput as in [16].

• Baseline 3, Greedy Strategy: The transmitters send data to their desired receiver simultaneously sharing the same spectrum. The transmitters will consume all the available renewable energy source at each frame (emptying the renewable energy buffer at each frame), and the AC power consumption is adaptive to LCSI only by optimizing the sum throughput.

In the simulation, we consider a symmetric system where as in [6]. The long term path loss for the desired link is 15dB, which corresponds to a cell size of 5.6km[29]. The static circuit power is (Watt) [30]. We assume Poisson packet arrival999 Note that the proposed algorithm works for generic packet and renewable energy arrival models as depicted in Definition 2 and Definition 3. The Poisson model is used for simulation illustration only. with average arrival rate (packet/s) and exponentially distributed random packet size with mean = 2Mbits. The scheduling frame duration is 50ms, and the total BW is = 1MHz. The maximum data queue buffer size is 5 (packets). Furthermore, we consider Poisson energy arrival with average arrival rate (Watt) as in [16], and the renewable energy is stored in a 1.2V 20Ah lithium-ion battery. The AC power allocation space and the renewable power allocation space is given by (Watt). The average delay is considered as our utility (), and the randomized policy is parameterized in the form given by (11).

### Vi-a Delay Performance w.r.t. the AC power consumption

Fig. 4 illustrates the average delay per user versus the AC power consumption . The average data arrival rate is , and the energy arrival rate is . The average delay of all the schemes decreases as the AC power consumption increase, and the proposed schemes achieve significant performance gain over all the baselines. This gain is contributed by the DQSI and EQSI aware dynamic power allocation. Furthermore, it can also be observed that the solution to the non-cooperative POSG problem has similar performance as the solution to the DEC-POMDP problem.

### Vi-B Delay Performance w.r.t. Number of Power Control Levels

Fig. 5 illustrates the average delay per user versus the number of power control levels that lie between 0 and 1.5kW. The average data arrival rate is , the energy arrival rate is , and the average AC power consumption is . The average delay of the proposed schemes decreases as the number of power control levels increases, yet the performance improvement is marginal. It can also be observed that there is significant performance gain with the proposed schemes compared with all the baselines, and the solution to the non-cooperative POSG problem has similar performance as the solution to the DEC-POMDP problem.

### Vi-C Delay Performance w.r.t. Renewable Energy Buffer Size

Fig. 6 illustrates the average delay per user versus the renewable energy buffer size . Specifically, we consider the lithium-ion battery given from 1.2V 10Ah to 40Ah. The average data arrival rate is , the energy arrival rate is , and the average AC power consumption is . It can also be observed that the proposed schemes achieve significant performance gain over all the baselines at any given renewable energy buffer size.

### Vi-D Convergence Performance

Fig. 7 illustrates the convergence property of the proposed schemes. We plot the randomized power control policy versus the scheduling frame index for the POMDP and non-cooperative POSG problems, respectively. The average data arrival rate is , the energy arrival rate is , and the average AC power consumption is . It can be observed that the convergence rate of the online algorithm is quite fast. For example, the delay performance of the proposed schemes already out-performs all the baselines at the 2500-th scheduling frame. Furthermore, the delay performance at the 2500-th scheduling frame is already quite close to the converged average delay.

## Vii Conclusion

In this paper, we consider the decentralized delay minimization for interference networks with limited renewable energy storage. Specifically, the transmitters are capable of harvesting energy from the environment, and the transmission power of a node comes from both the conventional utility power (AC power) and the renewable energy source. We consider two decentralized delay optimization formulations, namely the DEC-POMDP and the non-cooperative POSG, where the control policy is adaptive to local system states (LCSI, LDQSI and LEQSI) only. In the DEC-POMDP formulation, the controllers interact in a cooperative manner and the proposed decentralized policy gradient solution converges almost surely to a local optimal point under some mild technical conditions. In the non-cooperative POSG formulation, the transmitter nodes are non-cooperative. We extend the decentralized policy gradient solution and establish the technical proof for almost-sure convergence of the learning algorithms. In both cases, the solutions are very robust to model variations. Finally, the delay performance of the proposed solutions are compared with conventional baseline schemes for interference networks and it is illustrated that substantial delay performance gain and energy savings can be achieved by incorporating the CSI, DQSI and EQSI in the power control design.

## Appendix A Proof of Lemma 1

From the perturbation analysis [11, 22] in MDP, the gradient 101010The notation of is ignored in this section for simplicity. is given by

 ∇Θψ(Θ)=∑χπ(χ;Θ){∇Θgψ(χ,Θ)+∑χ′(∇ΘPr{χ′|χ,Θ})V(χ′)}, (32)

where satisfies the following Bellman (Possion) equation

 V(χ)+ψ(Θ)=gψ(χ,Θ)+∑χ′Pr{χ′|χ,Θ}V(χ′). (33)

Since for every , we have

 (34)

Substituting (34) into (32), we have

 ∇Θkψ(Θ)=∑χ∑Pπ(χ;Θ)μχ(Θ,P)∇Θkμχk(Θk,Pk)μχk(Θk,Pk)q(χ,P;Θ), (35)

where

 q(χ,P;γ,Θ)=gψ(χ,P)−ψ(Θ)+∑χ′Pr{χ′|χ,P}V(χ′). (36)

Taking the conditional expectation (conditioned on ) on both sides of (33), we have following equivalent Bellman equation

 ˜V(Q