Delay-Optimal User Scheduling and Inter-Cell Interference Management in Cellular Network via Distributive Stochastic Learning
In this paper, we propose a distributive queue-aware intra-cell user scheduling and inter-cell interference (ICI) management control design for a delay-optimal celluar downlink system with base stations (BSs), and users in each cell. Each BS has downlink queues for users respectively with heterogeneous arrivals and delay requirements. The ICI management control is adaptive to joint queue state information (QSI) over a slow time scale, while the user scheduling control is adaptive to both the joint QSI and the joint channel state information (CSI) over a faster time scale. We show that the problem can be modeled as an infinite horizon average cost Partially Observed Markov Decision Problem (POMDP), which is NP-hard in general. By exploiting the special structure of the problem, we shall derive an equivalent Bellman equation to solve the POMDP problem. To address the distributive requirement and the issue of dimensionality and computation complexity, we derive a distributive online stochastic learning algorithm, which only requires local QSI and local CSI at each of the BSs. We show that the proposed learning algorithm converges almost-surely (with probability 1) and has significant gain compared with various baselines. The proposed solution only has linear complexity order .
multi-cell systems, delay optimal control, partially observed Markov decision problem (POMDP), interference management, stochastic learning.
It is well-known that cellular systems are interference limited and there are a lot of works to handle the inter-cell interference (ICI) in cellular systems. Specifically, the optimal binary power control (BPC) for the sum rate maximization has been studied in . They showed that BPC could provide reasonable performance compared with the multi-level power control in the multi-link system. In , the authors studied a joint adaptive multi-pattern reuse and intra-cell user scheduling scheme, to maximize the long-term network-wide utility. The ICI management runs at a slower scale than the user selection strategy to reduce the communication overhead. In  and the reference therein, cooperation or coordination is also shown to be a useful tool to manage ICI and improve the performance of the celluar network.
However, all of these works have assumed that there are infinite backlogs at the transmitter, and the control policy is only a function of channel state information (CSI). In practice, applications are delay sensitive, and it is critical to optimize the delay performance in the cellular network. A systematic approach in dealing with delay-optimal resource control in general delay regime is via Markov Decision Process (MDP) technique. In [4, 5], the authors applied it to obtain the delay-optimal cross-layer control policy for broadcast channel and point-to-point link respectively. However, there are very limited works that studied the delay optimal control problem in the cellular network. Most existing works simply proposed heuristic control schemes with partial consideration of the queuing delay. As we shall illustrate, there are various technical challenges involved regarding delay-optimal cellular network.
Curse of Dimensionality: Although MDP technique is the systematic approach to solve the delay-optimal control problem, a primal difficulty is the curse of dimensionality. For example, a huge state space (exponential in the number of users and number of cells) will be involved in the MDP and brute force value or policy iterations cannot lead to any implementable solution111For a celluar system with 5 BSs, 5 users served by each BS, a buffer size of 5 per user and 5 CSI states for each link between one user and one BS, the system state space contains states, which is already unmanageable. [8, 9]. Furthermore, brute force solutions require explicit knowledge of transition probability of system states, which is difficult to obtain in the complex systems.
Complexity of the Interference Management: Jointly optimal ICI management and user scheduling requires heavy computation overhead even for the throughput optimization problem . Although grouping clusters of cells  and considering only neighboring BSs  were proposed to reduce the complexity, complex operations on a slot by slot basis are still required, which is not suitable for the practical implementation.
Decentralized Solution: For delay-optimal multi-cell control, the entire system state is characterized by the global CSI (CSI from any BS to any MS) and the global QSI (queue length of all users). Such system state information are distributed locally at each BS and centralized solution (which requires global knowledge of the CSI and QSI) will induce substantial signaling overhead between the BSs and the Base Station Controller (BSC).
In this paper, we consider the delay-optimal inter-cell ICI management control and intra-cell user scheduling for the cellular system. For implementation consideration, the ICI management control is computed at the BSC at a longer time scale and it is adaptive to the QSI only. On the other hand, the intra-cell user scheduling control is computed distributively at the BS at a smaller time scale and hence, it is adaptive to both the CSI and QSI. Due to the two time-scale control structure, the delay optimal control is formulated as an infinite-horizon average cost Partially Observed Markov Decision Process (POMDP). Exploiting the special structure, we propose an equivalent Bellman Equation to solve the POMDP. Based on the equivalent Bellman equation, we propose a distributive online learning algorithm to estimate a per-user value function as well as a per-user -factor222The -factor is a function of the system state and the control action , which represents the potential cost of applying a control action at the current state and applying the action for any system state in the future.. Only the local CSI and QSI information is required in the learning process at each BS. We also establish the technical proof for the almost-sure convergence of the proposed distributive learning algorithm. The proposed algorithm is quite different from the iterative update algorithm for solving the deterministic NUM , where the CSI is always assumed to be quasi-static during the iterative updates. However, the delay-optimal problem we considered is stochastic in nature, and during the iterative updates, the system state will not be quasi-static anymore. In addition, the proposed learning algorithm is also quite different from conventional stochastic learning[11, 13]. For instance, conventional stochastic learning requires centralized update and global system state knowledge and the convergence proof follows from standard contraction mapping arguments. However, due to the distributive learning requirement and simultaneous learning of the per-user value function and -factor, it is not trivial to establish the contraction mapping property and the associated convergence proof. We also illustrate the performance gain of the proposed solution against various baselines via numerical simulations. Furthermore, the solution has linear complexity order and it is quite suitable for the practical implementation.
Ii System Model
In this section, we shall elaborate the system model, as well as the control policies. We consider the downlink of a wireless celluar network consisting of BSs, and there are mobile users in each cell served by one BS. Specifically, let and denote the set of BSs and the set of users served by the BS respectively. denotes the -th user served by BS . The time dimension is partitioned into scheduling slots (every slot lasts for seconds). The system model is illustrated in Fig.1.
Ii-a Source Model
In each BS, there are independent application streams dedicated to the users respectively. Let and , where represents the new arrivals (number of bits) for the user at the end of the slot .
Assumption 1 (Assumption on Source Model)
We assume that the arrival process is i.i.d over the scheduling slot according to a general distribution with average arrival rate , and the arrival processes for all the users are independent with each other, i.e., if or . ∎
Let denote the global QSI in the system, where is the state space for the global QSI. denotes the QSI in the BS , where represents the number of bits for user at the beginning of the slot , and denotes the maximal buffer size (bits). When the buffer is full, i.e, , new bits arrivals will be dropped. The cardinality of the global QSI is .
Ii-B Channel Model and Physical Layer Model
Let and denote the small scale channel fading gain and the path loss from the -th BS to the user respectively, and is the local CSI states for user . denotes the local CSI states for BS , and the global CSI is denoted as , where is the state space for the global CSI.
Assumption 2 (Assumption on Channel Model)
We assume that the global is quasi-static in each slot. Furthermore, is i.i.d over the scheduling slot according to a general distribution and the small scale channel fading gains for all users are independent with each other. The path loss remains constant for the duration of the communication session. ∎
The cellular system shares a single common channel with bandwidth Hz (all the BSs use the same channel). At the beginning of each slot, the BS is either turned on (with transmit power ) or off (with transmit power 0)333Note that the on-off BS control is shown to be close to optimal in[1, 2]. Moreover, the solution framework can be easily extended to deal with discrete BS power control., according to a ICI management control policy, which is defined later. At each slot, a BS can select only one user for its data transmission. Specifically, let denotes an ICI management control pattern, where denotes BS is active, otherwise, and denotes the set of all possible control patterns. Furthermore, let be the set of BSs activated by the pattern and be the set of patterns that activate the BS . The signal received by the user at slot , when pattern is selected, is given by
where is the transmit signal from the -th BS to the -th user at slot , and is the i.i.d noise. The achievable data rate of user can be expressed by
where , is an indicator variable with when the user is scheduled. is a constant can be used to model both the coded and uncoded systems.
Ii-C ICI Management and User Scheduling Control Policy
At the beginning of the slot, the BSC will decide which BSs are allowed to transmit according to a stationary ICI management control policy defined below.
Definition 1 (Stationary ICI Management Control Policy)
A stationary ICI management control policy is defined as the mapping from current global QSI to an ICI management pattern . ∎
Let to be the global system state at the beginning of slot . The active user at each cell is selected according to a user scheduling policy defined below.
Definition 2 (Stationary User Scheduling Policy)
A stationary user scheduling policy is defined as the mapping from current global system state to current user scheduling action . The scheduling action is a set of all the users’ scheduling indicator variable, i.e., . It represents which users are scheduled and which users are not in any given slot. is the set of all user scheduling actions. ∎
For notation convenience, let to be the joint control policy, and be the control action under state .
Iii Problem Formulation
In this section, we will first elaborate the dynamics of system state under a control policy . Based on that, we shall formally formulate the delay-optimal control problem.
Iii-a Dynamics of System State
Given the new arrival at the end of the slot , the current system state and the control action , The queue evolution for user is given by:
where is the number of bits delivered to user at slot , and , given by (2), is the achievable data rate under the control action . denotes the floor of , , and . Let , and , for the user , and . Therefore, given a control policy , the random process is a controlled Markov chain with transition probability
Iii-B Delay Optimal Control Problem Formulation
Given a stationary control policy , the average cost of the user is given by:
where is a monotonic increasing cost function of . For example, when , using Little’s Law [4, 14], is an approximation444Strictly speaking, the average delay is given by , where is the bit dropping probability conditioned on bit arrival. Since our target bit dropping probability , . of the average delay of user . When and follows the bernoulli process, is the bit dropping probability (conditioned on bit arrival). Note that, the queues in the celluar system are coupled together via the control policy . In this paper, we seek to find an optimal stationary control policy to minimize the average cost in (5). Specifically, we have:
Problem 1 (Delay Optimal Multi-cell Control Problem)555In fact, the proposed solution framework can be easily extended to deal with a more general QoS based optimization. For example, say we minimize the average delay subject to the constraints on average data rate: . The Lagrangian of such constrained optimization is: , where , and is the Lagrange multiplier corresponding to the QoS constraint . Note that it has the same form as (1) and the proposed solution framework can be applied to the QoS constrained problem as well.
For some positive constants , finding a stationary control policy that minimizes:
where is the per-slot cost, and denotes the expectation w.r.t. the induced measure (induced by the control policy and the transition kernel in (4)). The positive constants indicate the relative importance of the users and for a given , the solution to (1) corresponds to a Pareto optimal point of the multi-objective optimization problem given by . Moreover, a control policy is called Pareto optimal if for any control policy such that , it implies that . In other words, we cannot reduce without increasing other component (say ) at Pareto optimal control .
Iv General Solution to the Delay Optimal Problem
In this section, we will show that the delay optimal problem 1 can be modeled as an infinite horizon average cost POMDP, which is a very difficult problem. By exploiting the special structure, we shall derive an equivalent Bellman equation to solve the POMDP problem.
Iv-a Preliminary on MDP and POMDP
An infinite horizon average cost MDP can be characterized by a tuple of four objects: , where is a finite set of states and is the action space. is the transition probability from state to , given that the action is taken. is the per-slot cost function. The objective is to find the optimal policy so as to minimize the average per-slot cost as:
If the policy space consists of unichain policies and the associated induced Markov chain is irreducible, it is well known that there exist a unique for each starting state[11, 7]. Furthermore, the optimal control policy can be obtained by the following Bellman equation.
where is called the value function. General offline solutions, value or policy iteration, can be used to find the value function iteratively, as well as the optimal policy.
POMDP is an extension of MDP when the control agent does not have direct observation of the entire system state (and hence it is called “partially observed MDP”). Specifically, an infinite horizon average cost POMDP can be characterized by a tuple [16, 17]: , where characterize a MDP and is a finite set of observations. is the observation function, which gives the probability (or stochastic relationship) between the partial observation , the actual system state and the control action . Specifically, is the probability of getting a partial observation “” given that the current system state is and the action was taken in the previous slot. A PODMP is a MDP where current system state and the actions are based on the observation . The objective is to find the optimal policy so as to minimize the average per-slot cost in (7). However, in general, it is a NP-hard problem and there are various approximation solutions proposed based on the special structure of the studied problems.
Iv-B Equivalent Bellman Equation and Optimal Control Policy
In this subsection, we shall first illustrate that the optimization problem 1 is an infinite horizon average cost POMDP. We shall then exploit some special problem structure to simplify the complexity and derive an equivalent Bellman equation to solve the problem. For instance, in the delay optimal problem 1, the ICI management control policy is adaptive to the QSI , while the user scheduling policy is adaptive to the complete system state . Therefore, the optimal control policy cannot be obtained by solving a standard Bellman equation from conventional MDP666The policy will be a function of the complete system state by solving a standard bellman equation.. In fact, problem 1 is a POMDP with the following specification.
State Space: The system state is the global QSI and CSI .
Action Space: The action is ICI management pattern and user scheduling .
Transition Kernel: The transition probability is given in (4).
Per-Slot Cost Function: The per-slot cost function is .
Observation: The observation for ICI management control policy is global QSI, i.e., , while the observation for User scheduling policy is the complete system state, i.e., .
Observation Function: The observation function for ICI management control policy is , if , otherwise 0. Furthermore the observation function for user scheduling policy is , if , otherwise 0.
While POMDP is a very difficult problem in general, we shall utilize the notion of action partitioning in our problem to substantially simplify the problem. We first define partitioned actions below.
Definition 3 (Partitioned Actions)
Given a control policy , we define as the collection of actions under a given for all possible . The complete policy is therefore equal to the union of all partitioned actions, i.e., . ∎
Based on the action partitioning, we can transform the POMDP problem into a regular infinite-horizon average cost MDP. Furthermore, the optimal control policy can be obtained by solving an equivalent Bellman equation which is summarized in the theorem below.
Theorem 1 (Equivalent Bellman Equation)
The optimal control policy in problem 1 can be obtained by solving the equivalent Bellman equation given by:
where is the per-slot cost function, and the transition kernel is given by , where is given by
where , and , and for . Suppose is a solution that solves the Bellman equation in (9), the optimal control policy for the original Problem 1 is given by: and . The value function that solves (9) is a component-wise monotonic increasing function. ∎
Please refer to Appendix A.
Note that solving (9) will obtain an ICI management policy that is a function of QSI and a user scheduling policy that is a function of the QSI and CSI . We shall illustrate this with a simple example below.
Suppose there are two BSs with equal transmitting power (), and there are three ICI management control patterns in , given by (BS 1 is active), (BS 2 is active) and (both BSs are active). Assume deterministic arrival where one bit will always arrive at each slot, i.e., . The number of users served by each BS is . The path loss for all , and the small scale fading gain is chosen from two values with equal probability. As a result, the global CSI state space777For the sake of easy discussion, we consider discrete state space in this example. Yet, the proposed algorithms and convergence results in the paper work for general continuous state space as well. is . Note that the cardinality of CSI state space is . Given a realization of the global QSI , the partitioned actions (following Definition 3) is given by:
and is the number of departure bits. For a given ICI management control , the optimal user scheduling policy is
Observe that the RHS of (14) is a decoupled objective function w.r.t. the variables and hence, applying standard decomposition theory,
As a result, the optimal ICI management control policy is given by:
where given in (15) is the optimal user scheduling policy under the ICI management control policy . Using Theorem 1, the optimal ICI management control and user selection control of the original Problem 1 for a CSI realization and QSI realization are given by and respectively. ∎
V Distributive Value Function and -factor Online Learning
The solution in Theorem 1 requires the knowledge of the value function . However, obtaining the value function is not trivial as solving the Bellman equation (9) involves solving a very large system of the nonlinear fixed point equations (corresponding to each realization of in (9)). Brute-force solution of require huge complexity, centralized implementation and knowledge of global CSI and QSI at the BSC. This will also induce huge signaling overhead because the QSI of all the users are maintained locally at the BSs. In this section, we shall propose a decentralized solution via distributive stochastic learning following the structure as illustrated in Fig. 2. Moreover, we shall prove that the proposed distributive stochastic learning algorithm will converge almost-surely.
V-a Post-Decision State Framework
In this section, we first introduce the post-decision state also used framework, also used in  and the references therein, to lay ground for developing the online learning algorithm. The post-decision state is defined to be the virtual system state immediately after making an action but before the new bits arrive. For example, is the state at the beginning of some time slot (also called the pre-decision state), and making an action , the post-decision state immediately after the action is , where the transition to is given by . If new arrivals occur in the post-decision state, and the CSI changes to , then the system reaches the next actual state, i.e., pre-decision state, .
Using the action partitioning and defining the value function on post-decision state (where pre-decision state is ), will satisfy the post-decision state Bellman equation
where , , and is the next post-decision state transited from . As Theorem 1, is also a component-wise monotonic increasing function. The optimal policy is obtained by solving the RHS of Bellman equation (17).
V-B Distributive User Scheduling Policy on the CSI Time Scale
To reduce the size of the state space and to decentralize the user scheduling, we approximate in (17) by the sum of per-user post-decision state value function888Using the linear approximation in (18), we can address the curse of dimensionality (complexity) as well as facilitate distributive implementation where each BS could solve for based on local CSI and QSI only. , i.e.,
where is defined as the fixed point of the following per-user fixed point equation:
where is the pre-decision state, means that the user is scheduled to transmit at BS , is a reference state and is a reference ICI management pattern (with the BS active). The per-user value function is obtained by the proposed distributive online learning algorithm (explained in section V-D). Note that the state space for the value function of is substantially reduced from (exponential growth w.r.t the number of all mobile users ) to (linear growth w.r.t the number of all mobile users).
Corollary 1 (Decentralized User Scheduling Actions)
where , and 999 Note that , and hence the users with empty buffer will not be scheduled and the activated BS will serve the users with non-empty buffer (the chance for the buffer of all users being empty at a given slot is very small).. , where is the power sum of interference and noise, and is the signal power. ∎
Please refer to Appendix B.
Remark 1 (Structure of the User Scheduling Actions)
The user scheduling action in (20) is both function of local CSI and QSI. Specifically, the number of bits to be delivered is controlled by the local CSI , and local QSI will determine . Each user estimates and in the preamble phase, and sends to the associated BS according to the process as indicated in Fig.2. ∎
V-C ICI Management Control Policy on the QSI Time Scale
To determine the ICI management control policy, we define the -factor as follows :
where is the transition probability from current QSI to , given current action , and is a constant. Note that the -factor represents the potential cost of applying a control action at the current QSI and applying the action for any system state in the future. Similar to (18), we approximate the -factor in (21) with a sum of per-user -factor, i.e,
where is defined as the fixed point of the following per-user fixed point equation:
where . is a reference state and is a reference ICI management control pattern. The per-user -factor is obtained by the proposed distributive online learning algorithm (explained in section V-D). The BSC collects the per-BS -information at the beginning of slot , and the ICI management control policy is given by:
In order to reduce the communication overhead between the BSs and the BSC, we could further partition the local QSI space into regions101010For example, one possible criteria is to partition the local QSI space so that the probability of belonging to any region is the same (uniform probability partitioning). () as illustrated in Fig. 3. At the beginning of the -th slot, the -th BS will update the BSC of the per-BS -information if its QSI state belongs to a new region. Hence, the per-BS -information at the BSC is updated according to the following dynamics:
Remark 2 (Communication Overhead)
The communication overhead between the BS and the BSC is reduced from (exponential growth w.r.t the number of users ) to for some constant (O(1) w.r.t. ), where is the cardinality of the CSI state space for one link. ∎
V-D Online Per-User Value Function and Per-User -factor Learning Algorithm
The system procedure for distributive online learning is given below:
Initialization: Each BS initiates the per-user value function and -factor for its users, denoted as and , where .
User Scheduling: If , BS is selected to transmit. The user scheduling policy is determined according to (20).
Local Per-user Value Function and Per-user -factor Update: Based on the current observations, each of the BSs updates the per-user value function and the per-user -factor according to Algorithm 1.
Fig. 2 illustrates the above procedure by a flowchart. The algorithm for the per-user value function and per-user -factor update is given below:
Algorithm 1 (Online Learning Algorithm)
Let and be the current observation of post-decision and pre-decision states respectively, be the current observation of new arrival, be the current observation of the local CSI, and is the realization of the ICI management control pattern. The online learning algorithm for user is given by