DelayOptimal User Scheduling and InterCell Interference Management in Cellular Network via Distributive Stochastic Learning
Abstract
In this paper, we propose a distributive queueaware intracell user scheduling and intercell interference (ICI) management control design for a delayoptimal celluar downlink system with base stations (BSs), and users in each cell. Each BS has downlink queues for users respectively with heterogeneous arrivals and delay requirements. The ICI management control is adaptive to joint queue state information (QSI) over a slow time scale, while the user scheduling control is adaptive to both the joint QSI and the joint channel state information (CSI) over a faster time scale. We show that the problem can be modeled as an infinite horizon average cost Partially Observed Markov Decision Problem (POMDP), which is NPhard in general. By exploiting the special structure of the problem, we shall derive an equivalent Bellman equation to solve the POMDP problem. To address the distributive requirement and the issue of dimensionality and computation complexity, we derive a distributive online stochastic learning algorithm, which only requires local QSI and local CSI at each of the BSs. We show that the proposed learning algorithm converges almostsurely (with probability 1) and has significant gain compared with various baselines. The proposed solution only has linear complexity order .
multicell systems, delay optimal control, partially observed Markov decision problem (POMDP), interference management, stochastic learning.
I Introduction
It is wellknown that cellular systems are interference limited and there are a lot of works to handle the intercell interference (ICI) in cellular systems. Specifically, the optimal binary power control (BPC) for the sum rate maximization has been studied in [1]. They showed that BPC could provide reasonable performance compared with the multilevel power control in the multilink system. In [2], the authors studied a joint adaptive multipattern reuse and intracell user scheduling scheme, to maximize the longterm networkwide utility. The ICI management runs at a slower scale than the user selection strategy to reduce the communication overhead. In [3] and the reference therein, cooperation or coordination is also shown to be a useful tool to manage ICI and improve the performance of the celluar network.
However, all of these works have assumed that there are infinite backlogs at the transmitter, and the control policy is only a function of channel state information (CSI). In practice, applications are delay sensitive, and it is critical to optimize the delay performance in the cellular network. A systematic approach in dealing with delayoptimal resource control in general delay regime is via Markov Decision Process (MDP) technique. In [4, 5], the authors applied it to obtain the delayoptimal crosslayer control policy for broadcast channel and pointtopoint link respectively. However, there are very limited works that studied the delay optimal control problem in the cellular network. Most existing works simply proposed heuristic control schemes with partial consideration of the queuing delay[6]. As we shall illustrate, there are various technical challenges involved regarding delayoptimal cellular network.

Curse of Dimensionality: Although MDP technique is the systematic approach to solve the delayoptimal control problem, a primal difficulty is the curse of dimensionality[7]. For example, a huge state space (exponential in the number of users and number of cells) will be involved in the MDP and brute force value or policy iterations cannot lead to any implementable solution^{1}^{1}1For a celluar system with 5 BSs, 5 users served by each BS, a buffer size of 5 per user and 5 CSI states for each link between one user and one BS, the system state space contains states, which is already unmanageable. [8, 9]. Furthermore, brute force solutions require explicit knowledge of transition probability of system states, which is difficult to obtain in the complex systems.

Complexity of the Interference Management: Jointly optimal ICI management and user scheduling requires heavy computation overhead even for the throughput optimization problem [2]. Although grouping clusters of cells [1] and considering only neighboring BSs [10] were proposed to reduce the complexity, complex operations on a slot by slot basis are still required, which is not suitable for the practical implementation.

Decentralized Solution: For delayoptimal multicell control, the entire system state is characterized by the global CSI (CSI from any BS to any MS) and the global QSI (queue length of all users). Such system state information are distributed locally at each BS and centralized solution (which requires global knowledge of the CSI and QSI) will induce substantial signaling overhead between the BSs and the Base Station Controller (BSC).
In this paper, we consider the delayoptimal intercell ICI management control and intracell user scheduling for the cellular system. For implementation consideration, the ICI management control is computed at the BSC at a longer time scale and it is adaptive to the QSI only. On the other hand, the intracell user scheduling control is computed distributively at the BS at a smaller time scale and hence, it is adaptive to both the CSI and QSI. Due to the two timescale control structure, the delay optimal control is formulated as an infinitehorizon average cost Partially Observed Markov Decision Process (POMDP). Exploiting the special structure, we propose an equivalent Bellman Equation to solve the POMDP. Based on the equivalent Bellman equation, we propose a distributive online learning algorithm to estimate a peruser value function as well as a peruser factor^{2}^{2}2The factor is a function of the system state and the control action , which represents the potential cost of applying a control action at the current state and applying the action for any system state in the future[11].. Only the local CSI and QSI information is required in the learning process at each BS. We also establish the technical proof for the almostsure convergence of the proposed distributive learning algorithm. The proposed algorithm is quite different from the iterative update algorithm for solving the deterministic NUM [12], where the CSI is always assumed to be quasistatic during the iterative updates. However, the delayoptimal problem we considered is stochastic in nature, and during the iterative updates, the system state will not be quasistatic anymore. In addition, the proposed learning algorithm is also quite different from conventional stochastic learning[11, 13]. For instance, conventional stochastic learning requires centralized update and global system state knowledge and the convergence proof follows from standard contraction mapping arguments[7]. However, due to the distributive learning requirement and simultaneous learning of the peruser value function and factor, it is not trivial to establish the contraction mapping property and the associated convergence proof. We also illustrate the performance gain of the proposed solution against various baselines via numerical simulations. Furthermore, the solution has linear complexity order and it is quite suitable for the practical implementation.
Ii System Model
In this section, we shall elaborate the system model, as well as the control policies. We consider the downlink of a wireless celluar network consisting of BSs, and there are mobile users in each cell served by one BS. Specifically, let and denote the set of BSs and the set of users served by the BS respectively. denotes the th user served by BS . The time dimension is partitioned into scheduling slots (every slot lasts for seconds). The system model is illustrated in Fig.1.
Iia Source Model
In each BS, there are independent application streams dedicated to the users respectively. Let and , where represents the new arrivals (number of bits) for the user at the end of the slot .
Assumption 1 (Assumption on Source Model)
We assume that the arrival process is i.i.d over the scheduling slot according to a general distribution with average arrival rate , and the arrival processes for all the users are independent with each other, i.e., if or . ∎
Let denote the global QSI in the system, where is the state space for the global QSI. denotes the QSI in the BS , where represents the number of bits for user at the beginning of the slot , and denotes the maximal buffer size (bits). When the buffer is full, i.e, , new bits arrivals will be dropped. The cardinality of the global QSI is .
IiB Channel Model and Physical Layer Model
Let and denote the small scale channel fading gain and the path loss from the th BS to the user respectively, and is the local CSI states for user . denotes the local CSI states for BS , and the global CSI is denoted as , where is the state space for the global CSI.
Assumption 2 (Assumption on Channel Model)
We assume that the global is quasistatic in each slot. Furthermore, is i.i.d over the scheduling slot according to a general distribution and the small scale channel fading gains for all users are independent with each other. The path loss remains constant for the duration of the communication session. ∎
The cellular system shares a single common channel with bandwidth Hz (all the BSs use the same channel). At the beginning of each slot, the BS is either turned on (with transmit power ) or off (with transmit power 0)^{3}^{3}3Note that the onoff BS control is shown to be close to optimal in[1, 2]. Moreover, the solution framework can be easily extended to deal with discrete BS power control., according to a ICI management control policy, which is defined later. At each slot, a BS can select only one user for its data transmission. Specifically, let denotes an ICI management control pattern, where denotes BS is active, otherwise, and denotes the set of all possible control patterns. Furthermore, let be the set of BSs activated by the pattern and be the set of patterns that activate the BS . The signal received by the user at slot , when pattern is selected, is given by
(1) 
where is the transmit signal from the th BS to the th user at slot , and is the i.i.d noise. The achievable data rate of user can be expressed by
(2) 
where , is an indicator variable with when the user is scheduled. is a constant can be used to model both the coded and uncoded systems[5].
IiC ICI Management and User Scheduling Control Policy
At the beginning of the slot, the BSC will decide which BSs are allowed to transmit according to a stationary ICI management control policy defined below.
Definition 1 (Stationary ICI Management Control Policy)
A stationary ICI management control policy is defined as the mapping from current global QSI to an ICI management pattern . ∎
Let to be the global system state at the beginning of slot . The active user at each cell is selected according to a user scheduling policy defined below.
Definition 2 (Stationary User Scheduling Policy)
A stationary user scheduling policy is defined as the mapping from current global system state to current user scheduling action . The scheduling action is a set of all the users’ scheduling indicator variable, i.e., . It represents which users are scheduled and which users are not in any given slot. is the set of all user scheduling actions. ∎
For notation convenience, let to be the joint control policy, and be the control action under state .
Iii Problem Formulation
In this section, we will first elaborate the dynamics of system state under a control policy . Based on that, we shall formally formulate the delayoptimal control problem.
Iiia Dynamics of System State
Given the new arrival at the end of the slot , the current system state and the control action , The queue evolution for user is given by:
(3) 
where is the number of bits delivered to user at slot , and , given by (2), is the achievable data rate under the control action . denotes the floor of , , and . Let , and , for the user , and . Therefore, given a control policy , the random process is a controlled Markov chain with transition probability
(4) 
IiiB Delay Optimal Control Problem Formulation
Given a stationary control policy , the average cost of the user is given by:
(5) 
where is a monotonic increasing cost function of . For example, when , using Little’s Law [4, 14], is an approximation^{4}^{4}4Strictly speaking, the average delay is given by , where is the bit dropping probability conditioned on bit arrival. Since our target bit dropping probability , . of the average delay of user . When and follows the bernoulli process, is the bit dropping probability (conditioned on bit arrival). Note that, the queues in the celluar system are coupled together via the control policy . In this paper, we seek to find an optimal stationary control policy to minimize the average cost in (5). Specifically, we have:
Problem 1 (Delay Optimal Multicell Control Problem)
^{5}^{5}5In fact, the proposed solution framework can be easily extended to deal with a more general QoS based optimization. For example, say we minimize the average delay subject to the constraints on average data rate: . The Lagrangian of such constrained optimization is: , where , and is the Lagrange multiplier corresponding to the QoS constraint . Note that it has the same form as (1) and the proposed solution framework can be applied to the QoS constrained problem as well.For some positive constants , finding a stationary control policy that minimizes:
where is the perslot cost, and denotes the expectation w.r.t. the induced measure (induced by the control policy and the transition kernel in (4)). The positive constants indicate the relative importance of the users and for a given , the solution to (1) corresponds to a Pareto optimal point of the multiobjective optimization problem given by . Moreover, a control policy is called Pareto optimal if for any control policy such that , it implies that . In other words, we cannot reduce without increasing other component (say ) at Pareto optimal control [15].
Iv General Solution to the Delay Optimal Problem
In this section, we will show that the delay optimal problem 1 can be modeled as an infinite horizon average cost POMDP, which is a very difficult problem. By exploiting the special structure, we shall derive an equivalent Bellman equation to solve the POMDP problem.
Iva Preliminary on MDP and POMDP
An infinite horizon average cost MDP can be characterized by a tuple of four objects: , where is a finite set of states and is the action space. is the transition probability from state to , given that the action is taken. is the perslot cost function. The objective is to find the optimal policy so as to minimize the average perslot cost as:
(7) 
If the policy space consists of unichain policies and the associated induced Markov chain is irreducible, it is well known that there exist a unique for each starting state[11, 7]. Furthermore, the optimal control policy can be obtained by the following Bellman equation.
(8) 
where is called the value function. General offline solutions, value or policy iteration, can be used to find the value function iteratively, as well as the optimal policy[7].
POMDP is an extension of MDP when the control agent does not have direct observation of the entire system state (and hence it is called “partially observed MDP”). Specifically, an infinite horizon average cost POMDP can be characterized by a tuple [16, 17]: , where characterize a MDP and is a finite set of observations. is the observation function, which gives the probability (or stochastic relationship) between the partial observation , the actual system state and the control action . Specifically, is the probability of getting a partial observation “” given that the current system state is and the action was taken in the previous slot. A PODMP is a MDP where current system state and the actions are based on the observation . The objective is to find the optimal policy so as to minimize the average perslot cost in (7). However, in general, it is a NPhard problem and there are various approximation solutions proposed based on the special structure of the studied problems[18].
IvB Equivalent Bellman Equation and Optimal Control Policy
In this subsection, we shall first illustrate that the optimization problem 1 is an infinite horizon average cost POMDP. We shall then exploit some special problem structure to simplify the complexity and derive an equivalent Bellman equation to solve the problem. For instance, in the delay optimal problem 1, the ICI management control policy is adaptive to the QSI , while the user scheduling policy is adaptive to the complete system state . Therefore, the optimal control policy cannot be obtained by solving a standard Bellman equation from conventional MDP^{6}^{6}6The policy will be a function of the complete system state by solving a standard bellman equation.. In fact, problem 1 is a POMDP with the following specification.

State Space: The system state is the global QSI and CSI .

Action Space: The action is ICI management pattern and user scheduling .

Transition Kernel: The transition probability is given in (4).

PerSlot Cost Function: The perslot cost function is .

Observation: The observation for ICI management control policy is global QSI, i.e., , while the observation for User scheduling policy is the complete system state, i.e., .

Observation Function: The observation function for ICI management control policy is , if , otherwise 0. Furthermore the observation function for user scheduling policy is , if , otherwise 0.
While POMDP is a very difficult problem in general, we shall utilize the notion of action partitioning in our problem to substantially simplify the problem. We first define partitioned actions below.
Definition 3 (Partitioned Actions)
Given a control policy , we define as the collection of actions under a given for all possible . The complete policy is therefore equal to the union of all partitioned actions, i.e., . ∎
Based on the action partitioning, we can transform the POMDP problem into a regular infinitehorizon average cost MDP. Furthermore, the optimal control policy can be obtained by solving an equivalent Bellman equation which is summarized in the theorem below.
Theorem 1 (Equivalent Bellman Equation)
The optimal control policy in problem 1 can be obtained by solving the equivalent Bellman equation given by:
(9) 
where is the perslot cost function, and the transition kernel is given by , where is given by
(10) 
where , and , and for . Suppose is a solution that solves the Bellman equation in (9), the optimal control policy for the original Problem 1 is given by: and . The value function that solves (9) is a componentwise monotonic increasing function. ∎
Please refer to Appendix A.
Note that solving (9) will obtain an ICI management policy that is a function of QSI and a user scheduling policy that is a function of the QSI and CSI . We shall illustrate this with a simple example below.
Example 1
Suppose there are two BSs with equal transmitting power (), and there are three ICI management control patterns in , given by (BS 1 is active), (BS 2 is active) and (both BSs are active). Assume deterministic arrival where one bit will always arrive at each slot, i.e., . The number of users served by each BS is . The path loss for all , and the small scale fading gain is chosen from two values with equal probability. As a result, the global CSI state space^{7}^{7}7For the sake of easy discussion, we consider discrete state space in this example. Yet, the proposed algorithms and convergence results in the paper work for general continuous state space as well. is . Note that the cardinality of CSI state space is . Given a realization of the global QSI , the partitioned actions (following Definition 3) is given by:
(11) 
Using Theorem 1, the optimal partitioned action is given by solving the right hand side (RHS) of (9):
(12) 
where
(13) 
and is the number of departure bits. For a given ICI management control , the optimal user scheduling policy is
(14) 
Observe that the RHS of (14) is a decoupled objective function w.r.t. the variables and hence, applying standard decomposition theory,
(15) 
As a result, the optimal ICI management control policy is given by:
(16) 
where given in (15) is the optimal user scheduling policy under the ICI management control policy . Using Theorem 1, the optimal ICI management control and user selection control of the original Problem 1 for a CSI realization and QSI realization are given by and respectively. ∎
V Distributive Value Function and factor Online Learning
The solution in Theorem 1 requires the knowledge of the value function . However, obtaining the value function is not trivial as solving the Bellman equation (9) involves solving a very large system of the nonlinear fixed point equations (corresponding to each realization of in (9)). Bruteforce solution of require huge complexity, centralized implementation and knowledge of global CSI and QSI at the BSC. This will also induce huge signaling overhead because the QSI of all the users are maintained locally at the BSs. In this section, we shall propose a decentralized solution via distributive stochastic learning following the structure as illustrated in Fig. 2. Moreover, we shall prove that the proposed distributive stochastic learning algorithm will converge almostsurely.
Va PostDecision State Framework
In this section, we first introduce the postdecision state also used framework, also used in [19] and the references therein, to lay ground for developing the online learning algorithm. The postdecision state is defined to be the virtual system state immediately after making an action but before the new bits arrive. For example, is the state at the beginning of some time slot (also called the predecision state), and making an action , the postdecision state immediately after the action is , where the transition to is given by . If new arrivals occur in the postdecision state, and the CSI changes to , then the system reaches the next actual state, i.e., predecision state, .
Using the action partitioning and defining the value function on postdecision state (where predecision state is ), will satisfy the postdecision state Bellman equation[19]
(17) 
where , , and is the next postdecision state transited from . As Theorem 1, is also a componentwise monotonic increasing function. The optimal policy is obtained by solving the RHS of Bellman equation (17).
VB Distributive User Scheduling Policy on the CSI Time Scale
To reduce the size of the state space and to decentralize the user scheduling, we approximate in (17) by the sum of peruser postdecision state value function^{8}^{8}8Using the linear approximation in (18), we can address the curse of dimensionality (complexity) as well as facilitate distributive implementation where each BS could solve for based on local CSI and QSI only. , i.e.,
(18) 
where is defined as the fixed point of the following peruser fixed point equation:
(19) 
where is the predecision state, means that the user is scheduled to transmit at BS , is a reference state and is a reference ICI management pattern (with the BS active). The peruser value function is obtained by the proposed distributive online learning algorithm (explained in section VD). Note that the state space for the value function of is substantially reduced from (exponential growth w.r.t the number of all mobile users ) to (linear growth w.r.t the number of all mobile users).
Corollary 1 (Decentralized User Scheduling Actions)
Using the linear approximation in (18), the user scheduling action of BS under any given ICI management pattern (obtained by solving the RHS of Bellman equation (17)) is given by:
(20) 
where , and ^{9}^{9}9 Note that , and hence the users with empty buffer will not be scheduled and the activated BS will serve the users with nonempty buffer (the chance for the buffer of all users being empty at a given slot is very small).. , where is the power sum of interference and noise, and is the signal power. ∎
Please refer to Appendix B.
Remark 1 (Structure of the User Scheduling Actions)
The user scheduling action in (20) is both function of local CSI and QSI. Specifically, the number of bits to be delivered is controlled by the local CSI , and local QSI will determine . Each user estimates and in the preamble phase, and sends to the associated BS according to the process as indicated in Fig.2. ∎
VC ICI Management Control Policy on the QSI Time Scale
To determine the ICI management control policy, we define the factor as follows [11]:
(21) 
where is the transition probability from current QSI to , given current action , and is a constant. Note that the factor represents the potential cost of applying a control action at the current QSI and applying the action for any system state in the future. Similar to (18), we approximate the factor in (21) with a sum of peruser factor, i.e,
(22) 
where is defined as the fixed point of the following peruser fixed point equation:
(23) 
where . is a reference state and is a reference ICI management control pattern. The peruser factor is obtained by the proposed distributive online learning algorithm (explained in section VD). The BSC collects the perBS information at the beginning of slot , and the ICI management control policy is given by:
(24) 
In order to reduce the communication overhead between the BSs and the BSC, we could further partition the local QSI space into regions^{10}^{10}10For example, one possible criteria is to partition the local QSI space so that the probability of belonging to any region is the same (uniform probability partitioning). () as illustrated in Fig. 3. At the beginning of the th slot, the th BS will update the BSC of the perBS information if its QSI state belongs to a new region. Hence, the perBS information at the BSC is updated according to the following dynamics:
(25) 
Remark 2 (Communication Overhead)
The communication overhead between the BS and the BSC is reduced from (exponential growth w.r.t the number of users ) to for some constant (O(1) w.r.t. ), where is the cardinality of the CSI state space for one link. ∎
VD Online PerUser Value Function and PerUser factor Learning Algorithm
The system procedure for distributive online learning is given below:

Initialization: Each BS initiates the peruser value function and factor for its users, denoted as and , where .

User Scheduling: If , BS is selected to transmit. The user scheduling policy is determined according to (20).

Local Peruser Value Function and Peruser factor Update: Based on the current observations, each of the BSs updates the peruser value function and the peruser factor according to Algorithm 1.
Fig. 2 illustrates the above procedure by a flowchart. The algorithm for the peruser value function and peruser factor update is given below:
Algorithm 1 (Online Learning Algorithm)
Let and be the current observation of postdecision and predecision states respectively, be the current observation of new arrival, be the current observation of the local CSI, and is the realization of the ICI management control pattern. The online learning algorithm for user is given by
(26) 