# Delay-Optimal User Scheduling and Inter-Cell Interference Management in Cellular Network via Distributive Stochastic Learning

## Abstract

In this paper, we propose a distributive queue-aware intra-cell user scheduling and inter-cell interference (ICI) management control design for a delay-optimal celluar downlink system with base stations (BSs), and users in each cell. Each BS has downlink queues for users respectively with heterogeneous arrivals and delay requirements. The ICI management control is adaptive to joint queue state information (QSI) over a slow time scale, while the user scheduling control is adaptive to both the joint QSI and the joint channel state information (CSI) over a faster time scale. We show that the problem can be modeled as an infinite horizon average cost Partially Observed Markov Decision Problem (POMDP), which is NP-hard in general. By exploiting the special structure of the problem, we shall derive an equivalent Bellman equation to solve the POMDP problem. To address the distributive requirement and the issue of dimensionality and computation complexity, we derive a distributive online stochastic learning algorithm, which only requires local QSI and local CSI at each of the BSs. We show that the proposed learning algorithm converges almost-surely (with probability 1) and has significant gain compared with various baselines. The proposed solution only has linear complexity order .

## 1Introduction

It is well-known that cellular systems are *interference limited* and there are a lot of works to handle the *inter-cell interference* (ICI) in cellular systems. Specifically, the optimal binary power control (BPC) for the sum rate maximization has been studied in [1]. They showed that BPC could provide reasonable performance compared with the multi-level power control in the multi-link system. In [2], the authors studied a joint adaptive multi-pattern reuse and intra-cell user scheduling scheme, to maximize the long-term network-wide utility. The ICI management runs at a slower scale than the user selection strategy to reduce the communication overhead. In [3] and the reference therein, cooperation or coordination is also shown to be a useful tool to manage ICI and improve the performance of the celluar network.

However, all of these works have assumed that there are infinite backlogs at the transmitter, and the control policy is only a function of channel state information (CSI). In practice, applications are delay sensitive, and it is critical to optimize the delay performance in the cellular network. A systematic approach in dealing with delay-optimal resource control in general delay regime is via Markov Decision Process (MDP) technique. In [4], the authors applied it to obtain the delay-optimal cross-layer control policy for broadcast channel and point-to-point link respectively. However, there are very limited works that studied the delay optimal control problem in the cellular network. Most existing works simply proposed heuristic control schemes with partial consideration of the queuing delay[6]. As we shall illustrate, there are various technical challenges involved regarding delay-optimal cellular network.

**Curse of Dimensionality:**Although MDP technique is the systematic approach to solve the delay-optimal control problem, a primal difficulty is the curse of dimensionality[7]. For example, a huge state space (exponential in the number of users and number of cells) will be involved in the MDP and brute force value or policy iterations cannot lead to any implementable solution [8]. Furthermore, brute force solutions require explicit knowledge of transition probability of system states, which is difficult to obtain in the complex systems.^{1}**Complexity of the Interference Management:**Jointly optimal ICI management and user scheduling requires heavy computation overhead even for the throughput optimization problem [2]. Although grouping clusters of cells [1] and considering only neighboring BSs [10] were proposed to reduce the complexity, complex operations on a slot by slot basis are still required, which is not suitable for the practical implementation.**Decentralized Solution:**For delay-optimal multi-cell control, the entire system state is characterized by the global CSI (CSI from any BS to any MS) and the global QSI (queue length of all users). Such system state information are distributed locally at each BS and centralized solution (which requires global knowledge of the CSI and QSI) will induce substantial signaling overhead between the BSs and the Base Station Controller (BSC).

In this paper, we consider the delay-optimal inter-cell ICI management control and intra-cell user scheduling for the cellular system. For implementation consideration, the ICI management control is computed at the BSC at a longer time scale and it is adaptive to the QSI only. On the other hand, the intra-cell user scheduling control is computed distributively at the BS at a smaller time scale and hence, it is adaptive to both the CSI and QSI. Due to the *two time-scale* control structure, the delay optimal control is formulated as an infinite-horizon average cost Partially Observed Markov Decision Process (POMDP). Exploiting the special structure, we propose an *equivalent Bellman Equation* to solve the POMDP. Based on the equivalent Bellman equation, we propose a distributive online learning algorithm to estimate a per-user value function as well as a per-user -factor^{2}*contraction mapping* arguments[7]. However, due to the distributive learning requirement and simultaneous learning of the per-user value function and -factor, it is not trivial to establish the contraction mapping property and the associated convergence proof. We also illustrate the performance gain of the proposed solution against various baselines via numerical simulations. Furthermore, the solution has linear complexity order and it is quite suitable for the practical implementation.

## 2System Model

In this section, we shall elaborate the system model, as well as the control policies. We consider the downlink of a wireless celluar network consisting of BSs, and there are mobile users in each cell served by one BS. Specifically, let and denote the set of BSs and the set of users served by the BS respectively. denotes the -th user served by BS . The time dimension is partitioned into *scheduling slots* (every slot lasts for seconds). The system model is illustrated in Fig. ?.

### 2.1Source Model

In each BS, there are independent application streams dedicated to the users respectively. Let and , where represents the new arrivals (number of bits) for the user at the end of the slot .

Let denote the global QSI in the system, where is the state space for the global QSI. denotes the QSI in the BS , where represents the number of bits for user at the beginning of the slot , and denotes the maximal buffer size (bits). When the buffer is full, i.e, , new bits arrivals will be dropped. The cardinality of the global QSI is .

### 2.2Channel Model and Physical Layer Model

Let and denote the small scale channel fading gain and the path loss from the -th BS to the user respectively, and is the local CSI states for user . denotes the local CSI states for BS , and the global CSI is denoted as , where is the state space for the global CSI.

The cellular system shares a single common channel with bandwidth Hz (all the BSs use the same channel). At the beginning of each slot, the BS is either turned on (with transmit power ) or off (with transmit power 0)^{3}*ICI management control policy*, which is defined later. At each slot, a BS can select only one user for its data transmission. Specifically, let denotes an ICI management control pattern, where denotes BS is active, otherwise, and denotes the set of all possible control patterns. Furthermore, let be the set of BSs activated by the pattern and be the set of patterns that activate the BS . The signal received by the user at slot , when pattern is selected, is given by

where is the transmit signal from the -th BS to the -th user at slot , and is the i.i.d noise. The achievable data rate of user can be expressed by

where , is an indicator variable with when the user is scheduled. is a constant can be used to model both the coded and uncoded systems[5].

### 2.3ICI Management and User Scheduling Control Policy

At the beginning of the slot, the BSC will decide which BSs are allowed to transmit according to a stationary ICI management control policy defined below.

Let to be the global system state at the beginning of slot . The active user at each cell is selected according to a user scheduling policy defined below.

For notation convenience, let to be the joint control policy, and be the control action under state .

## 3Problem Formulation

In this section, we will first elaborate the dynamics of system state under a control policy . Based on that, we shall formally formulate the delay-optimal control problem.

### 3.1Dynamics of System State

Given the new arrival at the end of the slot , the current system state and the control action , The queue evolution for user is given by:

where is the number of bits delivered to user at slot , and , given by (Equation 2), is the achievable data rate under the control action . denotes the floor of , , and . Let , and , for the user , and . Therefore, given a control policy , the random process is a controlled Markov chain with transition probability

### 3.2Delay Optimal Control Problem Formulation

Given a stationary control policy , the average cost of the user is given by:

where is a monotonic increasing cost function of . For example, when , using Little’s Law [4], is an approximation^{4}*bit dropping probability* (conditioned on bit arrival). Note that, the queues in the celluar system are coupled together via the control policy . In this paper, we seek to find an optimal stationary control policy to minimize the average cost in (Equation 4). Specifically, we have:

## 4General Solution to the Delay Optimal Problem

In this section, we will show that the delay optimal problem ? can be modeled as an infinite horizon average cost POMDP, which is a very difficult problem. By exploiting the special structure, we shall derive an *equivalent Bellman equation* to solve the POMDP problem.

### 4.1Preliminary on MDP and POMDP

An infinite horizon average cost MDP can be characterized by a tuple of four objects: , where is a finite set of states and is the action space. is the transition probability from state to , given that the action is taken. is the per-slot cost function. The objective is to find the optimal policy so as to minimize the average per-slot cost as:

If the policy space consists of *unichain policies* and the associated induced Markov chain is irreducible, it is well known that there exist a unique for each starting state[11]. Furthermore, the optimal control policy can be obtained by the following Bellman equation.

where is called the value function. General offline solutions, *value* or *policy iteration*, can be used to find the value function iteratively, as well as the optimal policy[7].

POMDP is an extension of MDP when the control agent does not have direct observation of the entire system state (and hence it is called “partially observed MDP”). Specifically, an infinite horizon average cost POMDP can be characterized by a tuple [16]: , where characterize a MDP and is a finite set of observations. is the observation function, which gives the probability (or stochastic relationship) between the partial observation , the actual system state and the control action . Specifically, is the probability of getting a partial observation “” given that the current system state is and the action was taken in the previous slot. A PODMP is a MDP where current system state and the actions are based on the observation . The objective is to find the optimal policy so as to minimize the average per-slot cost in (Equation 5). However, in general, it is a *NP-hard* problem and there are various approximation solutions proposed based on the special structure of the studied problems[18].

### 4.2Equivalent Bellman Equation and Optimal Control Policy

In this subsection, we shall first illustrate that the optimization problem ? is an infinite horizon average cost POMDP. We shall then exploit some special problem structure to simplify the complexity and derive an *equivalent Bellman equation* to solve the problem. For instance, in the delay optimal problem ?, the ICI management control policy is adaptive to the QSI , while the user scheduling policy is adaptive to the complete system state . Therefore, the optimal control policy cannot be obtained by solving a standard Bellman equation from conventional MDP^{5}

**State Space:**The system state is the global QSI and CSI .**Action Space:**The action is ICI management pattern and user scheduling .**Transition Kernel:**The transition probability is given in (Equation 3).**Per-Slot Cost Function:**The per-slot cost function is .**Observation:**The observation for ICI management control policy is global QSI, i.e., , while the observation for User scheduling policy is the complete system state, i.e., .**Observation Function:**The observation function for ICI management control policy is , if , otherwise 0. Furthermore the observation function for user scheduling policy is , if , otherwise 0.

While POMDP is a very difficult problem in general, we shall utilize the notion of *action partitioning* in our problem to substantially simplify the problem. We first define *partitioned actions* below.

Based on the action partitioning, we can transform the POMDP problem into a regular infinite-horizon average cost MDP. Furthermore, the optimal control policy can be obtained by solving an *equivalent Bellman equation* which is summarized in the theorem below.

Please refer to Appendix A.

Note that solving ( ?) will obtain an ICI management policy that is a function of QSI and a user scheduling policy that is a function of the QSI and CSI . We shall illustrate this with a simple example below.

## 5Distributive Value Function and -factor Online Learning

The solution in Theorem ? requires the knowledge of the value function . However, obtaining the value function is not trivial as solving the Bellman equation ( ?) involves solving a very large system of the nonlinear fixed point equations (corresponding to each realization of in ( ?)). Brute-force solution of require huge complexity, centralized implementation and knowledge of global CSI and QSI at the BSC. This will also induce huge signaling overhead because the QSI of all the users are maintained locally at the BSs. In this section, we shall propose a decentralized solution via distributive stochastic learning following the structure as illustrated in Fig. ?. Moreover, we shall prove that the proposed distributive stochastic learning algorithm will converge almost-surely.

### 5.1Post-Decision State Framework

In this section, we first introduce the post-decision state also used framework, also used in [19] and the references therein, to lay ground for developing the online learning algorithm. The post-decision state is defined to be the virtual system state immediately after making an action but before the new bits arrive. For example, is the state at the beginning of some time slot (also called the *pre-decision state*), and making an action , the post-decision state immediately after the action is , where the transition to is given by . If new arrivals occur in the post-decision state, and the CSI changes to , then the system reaches the next actual state, i.e., pre-decision state, .

Using the action partitioning and defining the value function on post-decision state (where pre-decision state is ), will satisfy the post-decision state Bellman equation[19]

where , , and is the next post-decision state transited from . As Theorem ?, is also a component-wise monotonic increasing function. The optimal policy is obtained by solving the RHS of Bellman equation (Equation 7).

### 5.2Distributive User Scheduling Policy on the CSI Time Scale

To reduce the size of the state space and to decentralize the user scheduling, we approximate in (Equation 7) by the sum of per-user post-decision state value function^{6}

where is defined as the *fixed point* of the following per-user fixed point equation:

where is the pre-decision state, means that the user is scheduled to transmit at BS , is a reference state and is a reference ICI management pattern (with the BS active). The per-user value function is obtained by the proposed distributive online learning algorithm (explained in Section 5.4). Note that the state space for the value function of is substantially reduced from (exponential growth w.r.t the number of all mobile users ) to (linear growth w.r.t the number of all mobile users).

Please refer to Appendix B.

### 5.3ICI Management Control Policy on the QSI Time Scale

To determine the ICI management control policy, we define the -factor as follows [11]:

where is the transition probability from current QSI to , given current action , and is a constant. Note that the -factor represents the potential cost of applying a control action at the current QSI and applying the action for any system state in the future. Similar to (Equation 8), we approximate the -factor in (Equation 10) with a sum of per-user -factor, i.e,

where is defined as the *fixed point* of the following per-user fixed point equation:

where . is a reference state and is a reference ICI management control pattern. The per-user -factor is obtained by the proposed distributive online learning algorithm (explained in Section 5.4). The BSC collects the per-BS -information at the beginning of slot , and the ICI management control policy is given by:

In order to reduce the communication overhead between the BSs and the BSC, we could further partition the local QSI space into regions^{7}

### 5.4Online Per-User Value Function and Per-User -factor Learning Algorithm

The system procedure for distributive online learning is given below:

**Initialization**: Each BS initiates the per-user value function and -factor for its users, denoted as and , where .**ICI Management Control**: At the beginning of the -th slot, the BSC updates the -information as (Equation 14) and determines the ICI management pattern as (Equation 13).**User Scheduling**: If , BS is selected to transmit. The user scheduling policy is determined according to ( ?).**Local Per-user Value Function and Per-user -factor Update**: Based on the current observations, each of the BSs updates the per-user value function and the per-user -factor according to Algorithm ?.

Fig. ? illustrates the above procedure by a flowchart. The algorithm for the per-user value function and per-user -factor update is given below:

### 5.5Convergence Analysis

In this section we will establish the convergence proof of the proposed per-user learning algorithm ?. We first define a mapping on the post-decision state as

where is the pre-decision state, , and . The vector form of the mapping is given by:

where is transition matrix for the post-decision state queue of the user . and are vectors. Specifically, we have the following lemma for the per-user value function learning in ( ?).

Please refer to Appendix C.

Note that ( ?) is equivalent to the per-user fixed point equation in (Equation 9). This result illustrates that the proposed online distributive learning in ( ?) can converge to the target per-user fixed point solution in (Equation 9). We define a mapping for the per-user -factor as

Specifically, we have following lemma for the -factor online learning in ( ?).

Note that ( ?) is equivalent to the per-user fixed point equation for in (Equation 12). This result illustrates that the proposed online distributive learning in ( ?) can converge to the target per user fixed point solution in (Equation 12).

Lemma ? and ? only established the convergence of the proposed online learning algorithm. Strictly speaking, the converged result is not optimal due to the linear approximation of the value function and the -factor in (Equation 8) and (Equation 11) respectively. The linear approximation is needed for distributive implementation. As illustrated in Fig. ?, the proposed distributive solution has close-to-optimal performance compared with brute-force centralized solution of the Bellman equation in ( ?).

## 6Simulation and Discussion

In this section, we shall compare the proposed distributive queue-aware intra-cell user scheduling and ICI management control scheme with three baselines. Baseline 1 refers to the *CSIT only* scheme, where the user scheduling are adaptive to the CSIT only so as to optimize the achievable data rate. Baseline 2 refers to a throughput optimal policy (in stability sense) for the user scheduling, namely the *Dynamic Backpressure* scheme [20]. In both baseline 1 and 2, the traditional frequency reuse scheme (frequency reuse factor equals 3) is used for inter-cell interference management. Baseline 3 refers to the *time-scale decomposition* scheme proposed in [2], where the sets of possible ICI management patterns is the same as the proposed scheme. In the simulation, we consider a two-tier celluar network composed of 19 BSs as in [2], each has a coverage of 500m. Channel models are implemented according to the Urban Macrocell Model in 3GPP and Jakes’ Rayleigh fading model. Specifically, the path loss model is given by , where (in m) is the distance from the transmitter to the receiver. The total BW is 10MHz. We consider Poisson packet arrival with average arrival rate (packets/slot) and exponentially distributed random packet size with Mbits. The scheduling slot duration is 5ms. The maximum buffer size is 9 (in packets), where each user’s QSI is partitioned into 4 regions, given by . The cost function is given by for all the users in the simulations.

### 6.1Performance w.r.t. Transmit Power

Fig. ? and Fig. ? illustrate the performance of average delay and packet dropping probability (conditioned on packet arrival) per user versus transmit power respectively. The number of users per BS , and the average arrival rate . Note that the average delay and packet dropping probability of all the schemes decreases as the transmit power increases, and there is significant performance gain of the proposed scheme compared to all baselines. This gain is contributed by the QSI-aware user scheduling as well as ICI management control.

### 6.2Performance w.r.t. Loading

Fig. ? illustrates the average delay versus per user loading (average arrival rate ) at transmit power of dBm and the number of users per BS . It can also be observed that the proposed scheme achieved significant gain over all the baselines across a wide range of input loading.

### 6.3Cumulative Distribution Function (CDF) of the Queue Length

Fig. ? illustrates the Cumulative Distribution Function (CDF) of the queue length per user with transmit power dBm. The number of users per BS is and the average arrival rate . It can be also be verified that the proposed scheme achieves not only a smaller average delay but also a smaller delay percentile compared with the other baselines.

### 6.4Convergence Performance

Fig. ? illustrates the average delay per user versus the scheduling slot index with transmit power dBm. The number of users per BS is and the average arrival rate . It can be observed that the convergence rate of the online algorithm is quite fast. For example, the delay performance of the proposed scheme already out-performs all the baselines at the -th slot. Furthermore, the delay performance at -th slot is already quite close to the converged average delay. Finally, unlike the conventional iterative NUM approach where the iterations are done offline within the coherence time of the CSI, the proposed iterative algorithm is updated over the same time scale of the CSI and QSI updates. Moreover, the iterative algorithm is online, meaning that useful payload are transmitted during the iterations.

## 7Summary

In this paper, we study the design of a distributive queue-aware intra-cell user scheduling and inter-cell interference management control design for a delay-optimal celluar downlink system. We first model the problem as an infinite horizon average reward POMDP, which is NP-hard in general. By exploiting special problem structure, we derive an equivalent Bellman equation to solve the POMDP problem. To address the distributive requirement and the issue of dimensionality and computation complexity, we derive a distributive online stochastic learning algorithm, which only requires local QSI and local CSI at each of the BSs. We show that the proposed learning algorithm converges almost-surely and has significant gain compared with various baselines. The proposed algorithm only has linear complexity order .

## Appendix A: Proof of Theorem

Based on the action partitioning, we can associate the MDP formulation in our delay-optimal control problem as follows:

**State Space:**The system state of the MDP is global QSI .**Action Space:**The action on the system state is the partitioned action given in Definition ?, and the action space is .**Transition Kernel:**The transition kernel is , where is given by (Equation 3).**Per-Slot Cost:**The per-slot cost function is .

Therefore, the optimal partitioned action can be determined from the equivalent Bellman equation in ( ?).

Next, we shall prove that is a monotonic increasing function w.r.t. its component. Given the is the result of -th iteration, is given by:

where , and is a reference state. Because [7], it is sufficient to prove is component-wise monotonic increasing. Using the induction method, we start from . In the induction step, we assume that , we get

where is the delivered bits under the conditional action for all users. Specifically, .

## Appendix B: Proof of Corollary

Using the linear approximation in (Equation 8), and the given ICI management pattern , the optimal user scheduling action (obtained by solving the RHS of Bellman equation (Equation 7)) is:

where is the set of all the possible user scheduling policy for BS . As a result, Corollary ? is obvious from the above equation.

## Appendix C: Proof of Lemma

From the definition of mapping in (), the convergence property of the per-user value function update algorithm in ( ?) is equivalent to the following update equation[21]:

where , and . is determined by the ICI management control pattern and local CSI . Let be the -algebra generated by , It can be verified that , and for a suitable constant . Therefore, the learning algorithm in (Equation 18) is a standard stochastic learning algorithm with the Martingale difference noise . We use the ordinary differential equation (ODE) to analyze the convergence probability. Specifically, the limiting ODE associated for (Equation 18) to track asymptotically is given by:

Note that there is a unique fixed point that satisfies the Bellman equation

and it is proved in [22] that is the globally asymptotically stable equilibrium for (Equation 19). Furthermore, define and . The origin is the globally asymptotically stable equilibrium point of the ODE (This is merely a special case by setting in the ). By theorem 2.2 of [23], the iterates remains bounded almost-surely. By the ODE approach[21], we can conclude that the iterates of the update almost-surely, i.e., converging to the globally asymptotically stable equilibrium of the associated ODE.

Finally the proof of being a monotonic increasing function can be derived in the same way as Theorem ?.

## Appendix D: Proof of Lemma

From the definition of mapping