Queue-Aware Dynamic Clustering and Power Allocation for Network MIMO Systems via Distributive Stochastic Learning

Queue-Aware Dynamic Clustering and Power Allocation for Network MIMO Systems via Distributive Stochastic Learning

\authorblockNYing Cui, Qingqing Huang, Vincent K. N. Lau,
\authorblockAECE Department, Hong Kong University of Science and Technology, Hong Kong
Email: cuiying@ust.hk, tmhqxaa@stu.ust.hk, eeknlau@ust.hk
Abstract

In this paper, we propose a two-timescale delay-optimal dynamic clustering and power allocation design for downlink network MIMO systems. The dynamic clustering control is adaptive to the global queue state information (GQSI) only and computed at the base station controller (BSC) over a longer time scale. On the other hand, the power allocations of all the BSs in one cluster are adaptive to both intra-cluster channel state information (CCSI) and intra-cluster queue state information (CQSI), and computed at the cluster manager (CM) over a shorter time scale. We show that the two-timescale delay-optimal control can be formulated as an infinite-horizon average cost Constrained Partially Observed Markov Decision Process (CPOMDP). By exploiting the special problem structure, we shall derive an equivalent Bellman equation in terms of Pattern Selection Q-factor to solve the CPOMDP. To address the distributive requirement and the issue of exponential memory requirement and computational complexity, we approximate the Pattern Selection Q-factor by the sum of Per-cluster Potential functions and propose a novel distributive online learning algorithm to estimate the Per-cluster Potential functions (at each CM) as well as the Lagrange multipliers (LM) (at each BS). We show that the proposed distributive online learning algorithm converges almost surely (with probability 1). By exploiting the birth-death structure of the queue dynamics, we further decompose the Per-cluster Potential function into sum of Per-cluster Per-user Potential functions and formulate the instantaneous power allocation as a Per-stage QSI-aware Interference Game played among all the CMs. We also propose a QSI-aware Simultaneous Iterative Water-filling Algorithm (QSIWFA) and show that it can achieve the Nash Equilibrium (NE).

I Introduction

The network MIMO/Cooperative MIMO system is proposed as one effective solution to address the inter-cell interference (ICI) bottleneck in multicell systems by exploiting data cooperation and joint processing among multiple base stations (BS). Channel state information (CSI) and user data exchange among BSs through the backhaul are required to support network MIMO and this overhead depends on the number of BSs involved in the cooperation and joint processing. In practice, it is not possible to support such full-scale cooperation and BSs are usually grouped into disjoint clusters with limited number of BSs in each cluster to reduce the processing complexity as well as the backhaul loading. The BSs within each cluster cooperatively serve the users associated with them, which lowers the system complexity and completely eliminate the intra-cluster interference.

The clustering methods can be classified into two categories: static clustering approach and dynamic clustering approach. For static clustering, the clusters are pre-determined and do not change over time. For example, in [1],[2], the authors proposed BS coordination strategies for fixed clusters to eliminate intra-cluster interference. For dynamic clustering, the cooperation clusters change in time. For example, in [3], given GCSI, a central unit jointly forms the clusters, selects the users and calculates the beamforming coefficients and the power allocations to maximize the weighted sum rate by a brute force exhaustive search. In [4], the authors proposed a greedy dynamic clustering algorithm to improve the sum rate under the assumption that CSI of the neighboring BSs is available at each BS. In general, compared with static clustering, the dynamic clustering approach usually has better performance due to larger optimizing domain, while it also leads to larger signaling overhead to obtain more CSI and higher computational complexity for intelligent clustering.

However, all of these works have assumed that there are infinite backlogs of packets at the transmitter and assume the information flow is delay insensitive. The control policy derived (e.g. clustering and power allocation policy) is only a function of CSI explicitly or implicitly. In practice, a lot of applications are delay sensitive, and it is critical to optimize the delay performance for the network MIMO systems. In particular, we are interested to investigate delay-optimal clustering and power control in network MIMO systems, which also adapts to the queue state information (QSI). This is motivated by an example in Fig. 1. The CSI-based clustering will always pick Pattern 1, creating a cooperation and interference profile in favor of MS 2 and MS 4 regardless of the queue states of these mobiles. However, the QSI-based clustering will dynamically pick the clustering patterns according to the queue states of all the mobiles.

The design framework taking consideration of queueing delay and physical layer performance is not trivial as it involves queuing theory (to model the queuing dynamics) and information theory (to model the physical layer dynamics). The simplest approach is to convert the delay constraints into an equivalent average rate constraint using tail probability (large derivation theory) and solve the optimization problem using purely information theoretical formulation based on the rate constraint [Hui:2007]. However, the control policy derived is a function of the CSI only, and it failed to exploit the QSI in the adaptation process. Lyapunov drift approach is also widely used in the literature to study the queue stability region of different wireless systems and establish throughput optimal control policy (in stability sense). However, the average delay bound derived in terms of the Lyapunov drift is tight only for heavy traffic loading[Neelybook:2006]. A systematic approach in dealing with delay-optimal resource control in general delay regime is via Markov Decision Process (MDP) technique[5]. However, there are various technical challenges involved regarding dynamic clustering and power allocation for delay-optimal network MIMO systems.

  • The Curse of Dimensionality: Although MDP technique is the systematic approach to solve the delay-optimal control problem, a first order challenge is the curse of dimensionality[5]. For example, a huge state space (exponential in the total number of users in the network) will be involved in the MDP and brute force value or policy iterations cannot lead to any implementable solutions [6]111For a multi-cell system with 7 BSs, 2 users served by each BS, a buffer size of 10 per user and 50 CSI states for each link between one user and one BS, the system state space contains states, which is already unmanageable..

  • Signaling Overhead and Computational Complexity for Dynamic Clustering: Optimal dynamic clustering in [3] and greedy dynamic clustering in [4] (both in throughput sense) require GCSI or CSI of all neighboring BSs, which leads to heavy signaling overhead on backhaul and high computational complexity for the central controller. For delay-optimal network MIMO control, the entire system state is characterized by the GCSI and the global QSI (GQSI). Therefore, the centralized solution (which requires GCSI and GQSI) will induce substantial signaling overhead between the BSs and the base station controller (BSC).

  • Issues of Convergence in Stochastic Optimization Problem: In conventional iterative solutions for deterministic network utility maximization (NUM) problems, the updates in the iterative algorithms (such as subgradient search) are performed within the coherence time of the CSI (the CSI remains quasi-static during the iteration updates)222This poses a serious limitation on the practicality of the distributive iterative solutions because the convergence and the optimality of the iterative solutions are not guaranteed if the CSI changes significantly during the update. [7]. When we consider the delay-optimal problem, the problem is stochastic and the control actions are defined over the ergodic realizations of the system states (CSI,QSI). Therefore, the convergence proof is also quite challenging.

In this paper, we consider a two-timescale delay-optimal dynamic clustering and power allocation for the downlink network MIMO consisting of cells with one BS and MSs in each cell. For implementation consideration, the dynamic clustering control is adaptive to the GQSI only and computed at the BSC over a longer time scale. On the other hand, the power allocations of all the BSs in one cluster are adaptive to both CCSI and intra-cluster QSI (CQSI), and computed at the CM over a shorter time scale. Due to the two time-scale control structure, the delay optimal control is formulated as an infinite-horizon average cost Constrained Partially Observed Markov Decision Process (CPOMDP).We propose an equivalent Bellman equation in terms of Pattern Selectio Q-factor to solve the CPOMDP. We approximate the Pattern Selection Q-factor by the sum of Per-cluster Potential functions and propose a novel distributive online learning algorithm to estimate the Per-cluster Potential functions (at each CM) as well as the Lagrange multipliers (LM) (at each BS). This update algorithm requires CCSI and CQSI only and therefore, facilitates distributive implementations. Using separation of time scales, we shall establish the almost-sure convergence proof of the proposed distributive online learning algorithm. By exploiting the birth-death structure of the queue dynamics, we further decompose the Per-cluster Potential function into sum of Per-cluster Per-user Potential functions. Based on these distributive potential functions and birth-death structure, the instantaneous power allocation control is formulated as a Per-stage QSI-aware Interference Game and determined by a QSI-aware Simultaneous Iterative Water-filling Algorithm (QSIWFA). We show that QSIWFA can achieve the NE of the QSI-aware interference game. Unlike conventional iterative water-filling solutions [17], the water-level of our solution is adaptive to the QSI via the potential functions.

We first list the acronyms used in this paper in Table I:

BSC base station controller CM cluster manager
ICI inter-cell interference LM Lagrange multiplier
L/C/G CSI (QSI) local/intra-cluster/global channel state information (queue state information)
CPOMDP constrained partially observed Markov decision process
QSIWFA QSI-aware simultaneous iterative water-filling algorithm
TABLE I: List of Acronyms

Ii System Models

In this section, we shall elaborate the network MIMO system topology, the physical layer model, the bursty source model and the control policy.

Ii-a System Topology

We consider a wireless cellular network consisting of cells with one BS and MSs in each cell as illustrated in Fig. 2. We assume each BS is equipped with transmitter antennas and each MS has receiver antenna333When , there will be a user selection control to select at most active users from the users and the proposed solution framework could be extended easily to accommodate this user selection control as well.. Denote the set of BSs as and the set of MSs in each cell as , respectively. We consider a clustered network MIMO system with maximum cluster size . Let denote a feasible cluster , which is a collection of neighboring BSs.We define a clustering pattern to be a partition of as follows

(1)

where is the collection of all clustering patterns, with cardinality .

As illustrated in Fig. 2, the overall multicell network is specified by three-layer hierarchical architecture, i.e. the base station controller (BSC), the cluster managers (CM) and the BSs. There are user queues at each BS, which buffer packets for the MSs in each cell. Both the local CSI (LCSI) and local QSI (LQSI) are measured locally at each BS. The BSC obtains the global QSI (GQSI) from the LQSI distributed at each BS, determines the clustering pattern according to the GQSI, and informs the CMs of the concerned clusters with their intra-cluster QSI (CQSI). During each scheduling slot, the CM of each cluster determines the precoding vectors as well as the transmit power of the BSs in the cluster.

Ii-B Physical Layer Model

Denote MS in cell as a BS-MS index pair . The channel from the transmit antennas in BS to the MS is denoted as the vector (), with its -th element () a discrete random variable distributed according to a general distribution with mean 0 and variance , where denotes the per-user discrete CSI state space with cardinality and denotes the path gain between BS and MS . For a given clustering pattern , let (), () and denote the LCSI at BS in cluster , the CCSI at the CM , and the GCSI, respectively, where denotes the GCSI state space. In this paper, the time dimension is partitioned into scheduling slots indexed by with slot duration .

Assumption 1

The GCSI is quasi-static in each scheduling slot and i.i.d. over scheduling slots. Furthermore, is independent w.r.t. and . The path gain remains constant for the duration of the communication session.   \QED

Let and () denote the information symbols and the received power of MS , respectively. Denote () as the precoding vector for MS at the BS . Therefore, the received signal of MS in cluster () is given by

where is noise. Based on CCSI at the CM, we adopt zero-forcing (ZF) within each cluster to eliminate the intra-cluster interference444We consider ZF precoding as an example but the solution framework in the paper can be applied to other SDMA processing techniques as well. Our zero-forcing precoder design can also be extended for multi-antenna MS with block zero-forcing similar to that in [spencer2004zero].[1, 3]. The ZF precoder of cluster () satisfies () and (). The transmit power of BS is therefore given by

(2)

For simplicity, we assume perfect CSI at the transmitter and receiver, and the maximum achievable data rate (bit/s/Hz) of MS in cluster is given by the mutual information between the channel inputs and channel outputs as:

(3)

where () is the inter-cell interference power.

Ii-C Bursty Source Model

Let be the random new arrivals (number of bits) for the users in the multicell network at the end of the -th scheduling slot.

Assumption 2

The arrival process is distributed according to general distributions and is i.i.d. over scheduling slots and independent w.r.t. .   \QED

Let be the GQSI matrix of the multicell network, where is the -element of , which denotes the number of bits in the queue for MS at the beginning of the -th slot. The per-user QSI state space and the GQSI state space are given by , and , separately. denotes the buffer size (maximum number of bits) of the queues for the MSs. Thus, the cardinality of the GQSI state space is , which grows exponentially with . Let be the scheduled data rates matrix of the MSs, where the -element can be calculated using (3). We assume the controller is causal so that new arrivals are observed only after the controller’s actions at the -th slot. Hence, the queue dynamics is given by the following equation:

(4)

where and is the duration of a scheduling slot. For notation convenience, we denote as the global system state at the -th slot.

Ii-D Clustering Pattern Selection and Power Control Policy

At the beginning of the -th slot, given the observed GQSI realization , the BSC determines the clustering pattern defined in (1), the CMs of the active clusters () do power allocation based on GCSI and GQSI according to a pattern selection and power allocation policy defined below.

Definition 1 (Stationary Pattern Selection and Power Allocation Policy)

A stationary pattern selection and power allocation policy is a mapping from the system state to the pattern selection and power allocation actions, where and . A policy is called feasible if the associated actions satisfy the per-BS average transmit power constraint given by

(5)

where is given by (2) and is the average total power of BS .   \QED

Remark 1 (Two Time-Scale Control Policy)

The pattern selection policy is defined as a function of GQSI only, i.e. , for the following reasons. The QSI is changing on a slower time scale while the CSI is changing on a faster (slot-by-slot) time scale. The dynamic clustering is enforced at the BSC and hence, a longer time scale will be desirable from the implementation perspective, considering computational complexity at the BSC and signaling overhead for collecting GCSI from all the BSs. On the other hand, the low complexity and decentralized power allocation policy (obtained later in Sec. IV) is a function of CQSI and CCSI only and executed at the CM level distributively555According to Definition 1, the power control policy is defined as a function of the GQSI and GCSI. Yet, in Sec.IV, we shall derive a decentralized power allocation policy, which is adaptive to CCSI and CQSI only., and hence it can operate at slot-time scale with acceptable signaling overhead and complexity.   \QED

Iii Problem Formulation

In this section, we shall first elaborate the dynamics of the system state under a control policy . Based on that, we shall then formally formulate the delay-optimal control problem for network MIMO systems.

Iii-a Dynamics of System State

A stationary control policy induces a joint distribution for the random process . Under Assumption 1 and 2, the arrival and departure are memoryless. Therefore, the induced random process for a given control policy is Markovian with the following transition probability:

(6)

Note that the queues are coupled together via the control policy .

Iii-B Delay Optimal Problem Formulation

Given a unichain policy , the induced Markov chain is ergodic666 The unichain policy is defined as a policy under which the resulting Markov chain is ergodic[8]. Similar to other literature in MDP [5],[13], we restrict out consideration to unchain policy in this paper. Such assumption usually does not contribute any loss of performance. For example, in Section V, any non-degenerate control policy satisfies , i.e. . Hence, the induced Markov chain is an ergodic birth death process. and there exists a unique steady state distribution where . The average cost of MS under a unichain policy is given by:

(7)

where is a monotonic increasing utility function of and the denotes expectation w.r.t. the underlying measure . For example, when , is the average delay of MS (by Little’s Law). Another interesting example is the queue outage probability , in which , where is the reference outage queue state. Similarly, the average transmit power constraint in (5) can be written as

(8)

where is given by (2).

In this paper, we seek to find an optimal stationary unichain control policy to minimize the average cost in (7). Specifically, we have

Problem 1 (Delay-Optimal Control Problem for Network MIMO)

For some positive constants777The positive weighting factors in (9) indicate the relative importance of buffer delay among the data streams and for each given , the solution to (9) corresponds to a point on the Pareto optimal delay tradeoff boundary of a multi-objective optimization problem. , the delay-optimal problem is formulated as

(9)

where .   \QED

Iii-C Constrained POMDP

Next, we shall illustrate that Problem 1 is an infinite horizon average cost constrained POMDP. In Problem 1, the pattern selection policy is defined on the partial system state , while the power allocation policy is defined on the complete system state , where . Therefore, Problem 1 is a constrained POMDP (CPOMDP) with the following specification:

  • State Space: The state space is given by , where is a realization of the global system state.

  • Action Space: The action space is given by , where is a unichain feasible policy as defined in Definition 1.

  • Transition Kernel: The transition kernel is given by (6).

  • Per-stage Cost Function: The per-stage cost function is given by .

  • Observation: The observation for the pattern selection policy is GQSI, i.e., , while the observation for the power allocation policy is the complete system state, i.e. .

  • Observation Function: The observation function for the pattern selection policy is
    , if , otherwise 0. Similarly, the observation function for the power allocation policy is , if , otherwise 0.

For any Lagrangian multiplier (LM) vector (), define the Lagrangian as

where . Therefore, the corresponding unconstrained POMDP for a particular LM vector (i.e. the Lagrange dual function) is given by

(10)

The dual problem of the primal problem in Problem 1 is given by . It is shown in [15] that there exists a Lagrange multiplier such that minimizes and the saddle point condition holds, i.e. is a saddle point of the Lagrangian . Using standard optimization theory[10], is the primal optimal (i.e. the optimal solution of the original Problem 1), is the dual optimal (i.e. the optimal solution of the dual problem), and the duality gap (i.e. the gap between the primal objective at and the dual objective at ) is zero. Therefore, by solving the dual problem, we can obtain the primal optimal .

Iii-D Equivalent Bellman Equation

While POMDP is a very difficult problem in general, we shall exploit some special structures in our problem to substantially simplify the problem. We first define conditional power allocation action sets below:

Definition 2 (Conditional Power Allocation Action Sets)

Given a power allocation policy , we define a conditional power allocation set as the collection of actions for all possible CSI conditioned on a given QSI . The complete control policy is therefore equal to the union of all the conditional power allocation action sets. i.e. .   \QED

Based on Definition 2, we can transform the POMDP problem into a regular infinite-horizon average cost MDP. Furthermore, for a given , the optimal control policy can be obtained by solving an equivalent Bellman equation which is summarized in the lemma below.

Lemma 1 (Equivalent Bellman Equation and Pattern Selection Q-factor)

For a given LM , the optimal control policy for the unconstrained optimization problem in Problem 1 can be obtained by solving the following equivalent Bellman equation: ()

(11)

where is the Pattern Selection Q-factor, is the conditional per-stage cost, is the conditional expectation of transition kernel. If there is a that satisfies the fixed-point equations in (11), then is the optimal average cost in Problem 1. Furthermore, the optimal control policy is given by with attaining the minimum of the R.H.S. of (11) () and .   \QED

{proof}

Please refer to the Appendix A.

Remark 2

The equivalent Bellman equation in (11) is defined on the GQSI with cardinality only. Nevertheless, the optimal power allocation policy obtained by solving (11) is still adaptive to GCSI and GQSI , where are the conditional power allocation action sets given by Definition 2. We shall illustrate this with a simple example below. In other words, the derived policies of the equivalent Bellman equation in (11) solve the CPOMDP in Problem 1.   \QED

Example 1

Suppose there are two MSs with the CSI state space as a simple example. As a result, the global CSI state space is with cardinality . Given GQSI , the optimal conditional power allocation action set (by Definition 2) for any given pattern is obtained by solving the R.H.S. of (11).

Observe that the R.H.S. of the above equation is a decoupled objective function w.r.t. the variables and hence

Hence, using Lemma 1, the optimal power control policy is given by , which are functions of both the GQSI and the GCSI. The optimal clustering pattern selection is given by , which is a function of the GQSI only.   \QED

Iv General Low Complexity Decentralized Solution

The key steps in obtaining the optimal control policies from the R.H.S. of the Bellman equation in (11) rely on the knowledge of the pattern selection Q-factor (which involves solving a system of non-linear Bellman equations in (11) for given LMs with unknowns () and the LMs , which leads to enormous complexity. Brute-force solution has exponential complexity and requires centralized implementation and knowledge of GCSI and GQSI (which leads to huge signaling overheads). In this section, we shall approximate the pattern selection Q-factor by the sum of Per-cluster Potential functions. Based on the approximation structure, we propose a novel distributive online learning algorithm to estimate the Per-cluster Potential functions (performed at each CM) as well as the LMs (performed at each BS).

Iv-a Linear Approximation of the Pattern Selection Q-Factor

Let denote the CQSI state space of cluster with cardinality . To reduce the cardinality of the state space and to decentralize the resource allocation, we approximate by the sum of per-cluster potential (), i.e.

(12)

where is the CQSI state of cluster () and are per-cluster potential functions which satisfy the following per-cluster potential fixed point equation:

(13)

where

(14)
(15)

, given by (2), .

In the literature, there are mainly three types of compact representations, which can be used to approximate the potential functions [tsitsiklis1996feature],[12]: Artificial neural networks, Feature Extraction, and Parametric Form. The first two approaches still need (GCSI,GQSI), have exponential complexity with respect to and , and do not facilitate distributed implementations. Therefore, we adopt Parametric Form with linear approximation. Due to the cluster-based structure and the relationship between the GQSI and the CQSI, we can extract meaningful features and use the summation form for approximation, which naturally lead to distributed implementation. Using the above linear approximation of the pattern selection Q-factor by the sum of per-cluster potential functions in (12), the BSC determines the optimal clustering pattern based on the current observed GQSI according to

(16)

Based on the CQSI and CCSI observation , each CM () determines , which attains the minimum of the R.H.S. of (13) (). Hence, the overall power allocation control policy is given by .

Iv-B Online Primal-Dual Distributive Learning Algorithm via Stochastic Approximation

Since the derived policy is function of per-cluster potential functions (), we need to obtain by solving (13) and determine the LMs such that the per-BS average power constraints in (5) are satisfied, which are not trivial. In this section, we shall apply stochastic learning[13, 14] to estimate the per-cluster potential functions () distributively at each CM based on realtime observations of the CCSI and CQSI and LMs at each BS based on the realtime power allocations actions. The convergence proof of the online learning algorithm will be established in the next section.

Fig. 3 illustrates the top level structure of the online learning algorithms. The Online Primal-Dual Distributive Learning Algorithm via Stochastic Approximation, which requires knowledge on CCSI and CQSI only, can be described as follows:

Algorithm 1

(Online Primal-Dual Distributive Learning Algorithm via Stochastic Approximation)

  • Step 1 [Initialization]: Set . Each cluster initialize the per-cluster potential functions (). Each BS initialize the LM ().

  • Step 2 [Clustering Pattern Selection]: At the beginning of the -th slot, the BSC determines the clustering pattern based on GQSI obtained from each BS according to (16), and broadcasts to the active CMs of the clusters .

  • Step 3 [Per-cluster Power Allocation]: Each CM () of the active cluster obtains CCSI , CQSI and LMs () from the BSs in its cluster, based on which, each CM () performs power allocation according to .

  • Step 4 [Potential Functions Update]: Each CM updates the per-cluster potential based on CQSI according to (17) and reports the updated potential functions to the BSC.

  • Step 5 [LMs Update]: Each BS () calculates the total power based on from its CM according to (2) and updates its LM according to (18).

The per-cluster potential update in Step 4 and the LMs update in Step 5 based on CCSI observation and CQSI observation at the current time slot are further illustrated as follows:

(17)
(18)

where , is the number of updates of till , , is the reference state888Without lost of generality, we set reference state (), i.e. buffer empty for all MSs in cluster , and initialize the ., is the last time slot that the reference state was updated. is the projection onto an interval for some . and are the step size sequences satisfying the following equations:

(19)

Iv-C Convergence Analysis for Distributive Primal-Dual Online Learning

In this section, we shall establish the technical conditions for the almost-sure convergence of the online distributive learning algorithm in Algorithm 1. Let denote the -dimensional vector form of . For any per-cluster LM vector , define a vector mapping of cluster with the -th () component mapping as:

(20)

Define and , where is a average transition probability matrix for the queue of cluster with as its -element and is a identity matrix.

Since we have two different step size sequences and with , the per-cluster potential updates and the LM updates are done simultaneously but over two different timescales. During the per-cluster potential update (timescale I), we have , where . Therefore, the LMs appear to be quasi-static[15] during the per-cluster potential update in (17). We first have the following lemma.

Lemma 2

(Convergence of Per-cluster Potential Learning (Time Scale I)) Assume that for every set of feasible control policies , , in the policy space, there exist a and some positive integer such that

(21)

where denotes the -element of the corresponding matrix. For stepsize sequence satisfying the conditions in (19), we have a.s. () for any initial potential vector and per-cluster LM vector , where the steady state per-cluster potential