QueueAware Dynamic Clustering and Power Allocation for Network MIMO Systems via Distributive Stochastic Learning
Abstract
In this paper, we propose a twotimescale delayoptimal dynamic clustering and power allocation design for downlink network MIMO systems. The dynamic clustering control is adaptive to the global queue state information (GQSI) only and computed at the base station controller (BSC) over a longer time scale. On the other hand, the power allocations of all the BSs in one cluster are adaptive to both intracluster channel state information (CCSI) and intracluster queue state information (CQSI), and computed at the cluster manager (CM) over a shorter time scale. We show that the twotimescale delayoptimal control can be formulated as an infinitehorizon average cost Constrained Partially Observed Markov Decision Process (CPOMDP). By exploiting the special problem structure, we shall derive an equivalent Bellman equation in terms of Pattern Selection Qfactor to solve the CPOMDP. To address the distributive requirement and the issue of exponential memory requirement and computational complexity, we approximate the Pattern Selection Qfactor by the sum of Percluster Potential functions and propose a novel distributive online learning algorithm to estimate the Percluster Potential functions (at each CM) as well as the Lagrange multipliers (LM) (at each BS). We show that the proposed distributive online learning algorithm converges almost surely (with probability 1). By exploiting the birthdeath structure of the queue dynamics, we further decompose the Percluster Potential function into sum of Percluster Peruser Potential functions and formulate the instantaneous power allocation as a Perstage QSIaware Interference Game played among all the CMs. We also propose a QSIaware Simultaneous Iterative Waterfilling Algorithm (QSIWFA) and show that it can achieve the Nash Equilibrium (NE).
I Introduction
The network MIMO/Cooperative MIMO system is proposed as one effective solution to address the intercell interference (ICI) bottleneck in multicell systems by exploiting data cooperation and joint processing among multiple base stations (BS). Channel state information (CSI) and user data exchange among BSs through the backhaul are required to support network MIMO and this overhead depends on the number of BSs involved in the cooperation and joint processing. In practice, it is not possible to support such fullscale cooperation and BSs are usually grouped into disjoint clusters with limited number of BSs in each cluster to reduce the processing complexity as well as the backhaul loading. The BSs within each cluster cooperatively serve the users associated with them, which lowers the system complexity and completely eliminate the intracluster interference.
The clustering methods can be classified into two categories: static clustering approach and dynamic clustering approach. For static clustering, the clusters are predetermined and do not change over time. For example, in [1],[2], the authors proposed BS coordination strategies for fixed clusters to eliminate intracluster interference. For dynamic clustering, the cooperation clusters change in time. For example, in [3], given GCSI, a central unit jointly forms the clusters, selects the users and calculates the beamforming coefficients and the power allocations to maximize the weighted sum rate by a brute force exhaustive search. In [4], the authors proposed a greedy dynamic clustering algorithm to improve the sum rate under the assumption that CSI of the neighboring BSs is available at each BS. In general, compared with static clustering, the dynamic clustering approach usually has better performance due to larger optimizing domain, while it also leads to larger signaling overhead to obtain more CSI and higher computational complexity for intelligent clustering.
However, all of these works have assumed that there are infinite backlogs of packets at the transmitter and assume the information flow is delay insensitive. The control policy derived (e.g. clustering and power allocation policy) is only a function of CSI explicitly or implicitly. In practice, a lot of applications are delay sensitive, and it is critical to optimize the delay performance for the network MIMO systems. In particular, we are interested to investigate delayoptimal clustering and power control in network MIMO systems, which also adapts to the queue state information (QSI). This is motivated by an example in Fig. 1. The CSIbased clustering will always pick Pattern 1, creating a cooperation and interference profile in favor of MS 2 and MS 4 regardless of the queue states of these mobiles. However, the QSIbased clustering will dynamically pick the clustering patterns according to the queue states of all the mobiles.
The design framework taking consideration of queueing delay and physical layer performance is not trivial as it involves queuing theory (to model the queuing dynamics) and information theory (to model the physical layer dynamics). The simplest approach is to convert the delay constraints into an equivalent average rate constraint using tail probability (large derivation theory) and solve the optimization problem using purely information theoretical formulation based on the rate constraint [Hui:2007]. However, the control policy derived is a function of the CSI only, and it failed to exploit the QSI in the adaptation process. Lyapunov drift approach is also widely used in the literature to study the queue stability region of different wireless systems and establish throughput optimal control policy (in stability sense). However, the average delay bound derived in terms of the Lyapunov drift is tight only for heavy traffic loading[Neelybook:2006]. A systematic approach in dealing with delayoptimal resource control in general delay regime is via Markov Decision Process (MDP) technique[5]. However, there are various technical challenges involved regarding dynamic clustering and power allocation for delayoptimal network MIMO systems.

The Curse of Dimensionality: Although MDP technique is the systematic approach to solve the delayoptimal control problem, a first order challenge is the curse of dimensionality[5]. For example, a huge state space (exponential in the total number of users in the network) will be involved in the MDP and brute force value or policy iterations cannot lead to any implementable solutions [6]^{1}^{1}1For a multicell system with 7 BSs, 2 users served by each BS, a buffer size of 10 per user and 50 CSI states for each link between one user and one BS, the system state space contains states, which is already unmanageable..

Signaling Overhead and Computational Complexity for Dynamic Clustering: Optimal dynamic clustering in [3] and greedy dynamic clustering in [4] (both in throughput sense) require GCSI or CSI of all neighboring BSs, which leads to heavy signaling overhead on backhaul and high computational complexity for the central controller. For delayoptimal network MIMO control, the entire system state is characterized by the GCSI and the global QSI (GQSI). Therefore, the centralized solution (which requires GCSI and GQSI) will induce substantial signaling overhead between the BSs and the base station controller (BSC).

Issues of Convergence in Stochastic Optimization Problem: In conventional iterative solutions for deterministic network utility maximization (NUM) problems, the updates in the iterative algorithms (such as subgradient search) are performed within the coherence time of the CSI (the CSI remains quasistatic during the iteration updates)^{2}^{2}2This poses a serious limitation on the practicality of the distributive iterative solutions because the convergence and the optimality of the iterative solutions are not guaranteed if the CSI changes significantly during the update. [7]. When we consider the delayoptimal problem, the problem is stochastic and the control actions are defined over the ergodic realizations of the system states (CSI,QSI). Therefore, the convergence proof is also quite challenging.
In this paper, we consider a twotimescale delayoptimal dynamic clustering and power allocation for the downlink network MIMO consisting of cells with one BS and MSs in each cell. For implementation consideration, the dynamic clustering control is adaptive to the GQSI only and computed at the BSC over a longer time scale. On the other hand, the power allocations of all the BSs in one cluster are adaptive to both CCSI and intracluster QSI (CQSI), and computed at the CM over a shorter time scale. Due to the two timescale control structure, the delay optimal control is formulated as an infinitehorizon average cost Constrained Partially Observed Markov Decision Process (CPOMDP).We propose an equivalent Bellman equation in terms of Pattern Selectio Qfactor to solve the CPOMDP. We approximate the Pattern Selection Qfactor by the sum of Percluster Potential functions and propose a novel distributive online learning algorithm to estimate the Percluster Potential functions (at each CM) as well as the Lagrange multipliers (LM) (at each BS). This update algorithm requires CCSI and CQSI only and therefore, facilitates distributive implementations. Using separation of time scales, we shall establish the almostsure convergence proof of the proposed distributive online learning algorithm. By exploiting the birthdeath structure of the queue dynamics, we further decompose the Percluster Potential function into sum of Percluster Peruser Potential functions. Based on these distributive potential functions and birthdeath structure, the instantaneous power allocation control is formulated as a Perstage QSIaware Interference Game and determined by a QSIaware Simultaneous Iterative Waterfilling Algorithm (QSIWFA). We show that QSIWFA can achieve the NE of the QSIaware interference game. Unlike conventional iterative waterfilling solutions [17], the waterlevel of our solution is adaptive to the QSI via the potential functions.
We first list the acronyms used in this paper in Table I:
BSC  base station controller  CM  cluster manager 

ICI  intercell interference  LM  Lagrange multiplier 
L/C/G CSI (QSI)  local/intracluster/global channel state information (queue state information)  
CPOMDP  constrained partially observed Markov decision process  
QSIWFA  QSIaware simultaneous iterative waterfilling algorithm 
Ii System Models
In this section, we shall elaborate the network MIMO system topology, the physical layer model, the bursty source model and the control policy.
Iia System Topology
We consider a wireless cellular network consisting of cells with one BS and MSs in each cell as illustrated in Fig. 2. We assume each BS is equipped with transmitter antennas and each MS has receiver antenna^{3}^{3}3When , there will be a user selection control to select at most active users from the users and the proposed solution framework could be extended easily to accommodate this user selection control as well.. Denote the set of BSs as and the set of MSs in each cell as , respectively. We consider a clustered network MIMO system with maximum cluster size . Let denote a feasible cluster , which is a collection of neighboring BSs.We define a clustering pattern to be a partition of as follows
(1) 
where is the collection of all clustering patterns, with cardinality .
As illustrated in Fig. 2, the overall multicell network is specified by threelayer hierarchical architecture, i.e. the base station controller (BSC), the cluster managers (CM) and the BSs. There are user queues at each BS, which buffer packets for the MSs in each cell. Both the local CSI (LCSI) and local QSI (LQSI) are measured locally at each BS. The BSC obtains the global QSI (GQSI) from the LQSI distributed at each BS, determines the clustering pattern according to the GQSI, and informs the CMs of the concerned clusters with their intracluster QSI (CQSI). During each scheduling slot, the CM of each cluster determines the precoding vectors as well as the transmit power of the BSs in the cluster.
IiB Physical Layer Model
Denote MS in cell as a BSMS index pair . The channel from the transmit antennas in BS to the MS is denoted as the vector (), with its th element () a discrete random variable distributed according to a general distribution with mean 0 and variance , where denotes the peruser discrete CSI state space with cardinality and denotes the path gain between BS and MS . For a given clustering pattern , let (), () and denote the LCSI at BS in cluster , the CCSI at the CM , and the GCSI, respectively, where denotes the GCSI state space. In this paper, the time dimension is partitioned into scheduling slots indexed by with slot duration .
Assumption 1
The GCSI is quasistatic in each scheduling slot and i.i.d. over scheduling slots. Furthermore, is independent w.r.t. and . The path gain remains constant for the duration of the communication session. \QED
Let and () denote the information symbols and the received power of MS , respectively. Denote () as the precoding vector for MS at the BS . Therefore, the received signal of MS in cluster () is given by
where is noise. Based on CCSI at the CM, we adopt zeroforcing (ZF) within each cluster to eliminate the intracluster interference^{4}^{4}4We consider ZF precoding as an example but the solution framework in the paper can be applied to other SDMA processing techniques as well. Our zeroforcing precoder design can also be extended for multiantenna MS with block zeroforcing similar to that in [spencer2004zero].[1, 3]. The ZF precoder of cluster () satisfies () and (). The transmit power of BS is therefore given by
(2) 
For simplicity, we assume perfect CSI at the transmitter and receiver, and the maximum achievable data rate (bit/s/Hz) of MS in cluster is given by the mutual information between the channel inputs and channel outputs as:
(3) 
where () is the intercell interference power.
IiC Bursty Source Model
Let be the random new arrivals (number of bits) for the users in the multicell network at the end of the th scheduling slot.
Assumption 2
The arrival process is distributed according to general distributions and is i.i.d. over scheduling slots and independent w.r.t. . \QED
Let be the GQSI matrix of the multicell network, where is the element of , which denotes the number of bits in the queue for MS at the beginning of the th slot. The peruser QSI state space and the GQSI state space are given by , and , separately. denotes the buffer size (maximum number of bits) of the queues for the MSs. Thus, the cardinality of the GQSI state space is , which grows exponentially with . Let be the scheduled data rates matrix of the MSs, where the element can be calculated using (3). We assume the controller is causal so that new arrivals are observed only after the controller’s actions at the th slot. Hence, the queue dynamics is given by the following equation:
(4) 
where and is the duration of a scheduling slot. For notation convenience, we denote as the global system state at the th slot.
IiD Clustering Pattern Selection and Power Control Policy
At the beginning of the th slot, given the observed GQSI realization , the BSC determines the clustering pattern defined in (1), the CMs of the active clusters () do power allocation based on GCSI and GQSI according to a pattern selection and power allocation policy defined below.
Definition 1 (Stationary Pattern Selection and Power Allocation Policy)
A stationary pattern selection and power allocation policy is a mapping from the system state to the pattern selection and power allocation actions, where and . A policy is called feasible if the associated actions satisfy the perBS average transmit power constraint given by
(5) 
where is given by (2) and is the average total power of BS . \QED
Remark 1 (Two TimeScale Control Policy)
The pattern selection policy is defined as a function of GQSI only, i.e. , for the following reasons. The QSI is changing on a slower time scale while the CSI is changing on a faster (slotbyslot) time scale. The dynamic clustering is enforced at the BSC and hence, a longer time scale will be desirable from the implementation perspective, considering computational complexity at the BSC and signaling overhead for collecting GCSI from all the BSs. On the other hand, the low complexity and decentralized power allocation policy (obtained later in Sec. IV) is a function of CQSI and CCSI only and executed at the CM level distributively^{5}^{5}5According to Definition 1, the power control policy is defined as a function of the GQSI and GCSI. Yet, in Sec.IV, we shall derive a decentralized power allocation policy, which is adaptive to CCSI and CQSI only., and hence it can operate at slottime scale with acceptable signaling overhead and complexity. \QED
Iii Problem Formulation
In this section, we shall first elaborate the dynamics of the system state under a control policy . Based on that, we shall then formally formulate the delayoptimal control problem for network MIMO systems.
Iiia Dynamics of System State
A stationary control policy induces a joint distribution for the random process . Under Assumption 1 and 2, the arrival and departure are memoryless. Therefore, the induced random process for a given control policy is Markovian with the following transition probability:
(6) 
Note that the queues are coupled together via the control policy .
IiiB Delay Optimal Problem Formulation
Given a unichain policy , the induced Markov chain is ergodic^{6}^{6}6 The unichain policy is defined as a policy under which the resulting Markov chain is ergodic[8]. Similar to other literature in MDP [5],[13], we restrict out consideration to unchain policy in this paper. Such assumption usually does not contribute any loss of performance. For example, in Section V, any nondegenerate control policy satisfies , i.e. . Hence, the induced Markov chain is an ergodic birth death process. and there exists a unique steady state distribution where . The average cost of MS under a unichain policy is given by:
(7) 
where is a monotonic increasing utility function of and the denotes expectation w.r.t. the underlying measure . For example, when , is the average delay of MS (by Little’s Law). Another interesting example is the queue outage probability , in which , where is the reference outage queue state. Similarly, the average transmit power constraint in (5) can be written as
(8) 
where is given by (2).
In this paper, we seek to find an optimal stationary unichain control policy to minimize the average cost in (7). Specifically, we have
Problem 1 (DelayOptimal Control Problem for Network MIMO)
For some positive constants^{7}^{7}7The positive weighting factors in (9) indicate the relative importance of buffer delay among the data streams and for each given , the solution to (9) corresponds to a point on the Pareto optimal delay tradeoff boundary of a multiobjective optimization problem. , the delayoptimal problem is formulated as
(9)  
where . \QED
IiiC Constrained POMDP
Next, we shall illustrate that Problem 1 is an infinite horizon average cost constrained POMDP. In Problem 1, the pattern selection policy is defined on the partial system state , while the power allocation policy is defined on the complete system state , where . Therefore, Problem 1 is a constrained POMDP (CPOMDP) with the following specification:

State Space: The state space is given by , where is a realization of the global system state.

Action Space: The action space is given by , where is a unichain feasible policy as defined in Definition 1.

Transition Kernel: The transition kernel is given by (6).

Perstage Cost Function: The perstage cost function is given by .

Observation: The observation for the pattern selection policy is GQSI, i.e., , while the observation for the power allocation policy is the complete system state, i.e. .

Observation Function: The observation function for the pattern selection policy is
, if , otherwise 0. Similarly, the observation function for the power allocation policy is , if , otherwise 0.
For any Lagrangian multiplier (LM) vector (), define the Lagrangian as
where . Therefore, the corresponding unconstrained POMDP for a particular LM vector (i.e. the Lagrange dual function) is given by
(10) 
The dual problem of the primal problem in Problem 1 is given by . It is shown in [15] that there exists a Lagrange multiplier such that minimizes and the saddle point condition holds, i.e. is a saddle point of the Lagrangian . Using standard optimization theory[10], is the primal optimal (i.e. the optimal solution of the original Problem 1), is the dual optimal (i.e. the optimal solution of the dual problem), and the duality gap (i.e. the gap between the primal objective at and the dual objective at ) is zero. Therefore, by solving the dual problem, we can obtain the primal optimal .
IiiD Equivalent Bellman Equation
While POMDP is a very difficult problem in general, we shall exploit some special structures in our problem to substantially simplify the problem. We first define conditional power allocation action sets below:
Definition 2 (Conditional Power Allocation Action Sets)
Given a power allocation policy , we define a conditional power allocation set as the collection of actions for all possible CSI conditioned on a given QSI . The complete control policy is therefore equal to the union of all the conditional power allocation action sets. i.e. . \QED
Based on Definition 2, we can transform the POMDP problem into a regular infinitehorizon average cost MDP. Furthermore, for a given , the optimal control policy can be obtained by solving an equivalent Bellman equation which is summarized in the lemma below.
Lemma 1 (Equivalent Bellman Equation and Pattern Selection Qfactor)
For a given LM , the optimal control policy for the unconstrained optimization problem in Problem 1 can be obtained by solving the following equivalent Bellman equation: ()
(11) 
where is the Pattern Selection Qfactor, is the conditional perstage cost, is the conditional expectation of transition kernel. If there is a that satisfies the fixedpoint equations in (11), then is the optimal average cost in Problem 1. Furthermore, the optimal control policy is given by with attaining the minimum of the R.H.S. of (11) () and . \QED
Please refer to the Appendix A.
Remark 2
The equivalent Bellman equation in (11) is defined on the GQSI with cardinality only. Nevertheless, the optimal power allocation policy obtained by solving (11) is still adaptive to GCSI and GQSI , where are the conditional power allocation action sets given by Definition 2. We shall illustrate this with a simple example below. In other words, the derived policies of the equivalent Bellman equation in (11) solve the CPOMDP in Problem 1. \QED
Example 1
Suppose there are two MSs with the CSI state space as a simple example. As a result, the global CSI state space is with cardinality . Given GQSI , the optimal conditional power allocation action set (by Definition 2) for any given pattern is obtained by solving the R.H.S. of (11).
Observe that the R.H.S. of the above equation is a decoupled objective function w.r.t. the variables and hence
Hence, using Lemma 1, the optimal power control policy is given by , which are functions of both the GQSI and the GCSI. The optimal clustering pattern selection is given by , which is a function of the GQSI only. \QED
Iv General Low Complexity Decentralized Solution
The key steps in obtaining the optimal control policies from the R.H.S. of the Bellman equation in (11) rely on the knowledge of the pattern selection Qfactor (which involves solving a system of nonlinear Bellman equations in (11) for given LMs with unknowns () and the LMs , which leads to enormous complexity. Bruteforce solution has exponential complexity and requires centralized implementation and knowledge of GCSI and GQSI (which leads to huge signaling overheads). In this section, we shall approximate the pattern selection Qfactor by the sum of Percluster Potential functions. Based on the approximation structure, we propose a novel distributive online learning algorithm to estimate the Percluster Potential functions (performed at each CM) as well as the LMs (performed at each BS).
Iva Linear Approximation of the Pattern Selection QFactor
Let denote the CQSI state space of cluster with cardinality . To reduce the cardinality of the state space and to decentralize the resource allocation, we approximate by the sum of percluster potential (), i.e.
(12) 
where is the CQSI state of cluster () and are percluster potential functions which satisfy the following percluster potential fixed point equation:
(13) 
where
(14) 
(15) 
, given by (2), .
In the literature, there are mainly three types of compact representations, which can be used to approximate the potential functions [tsitsiklis1996feature],[12]: Artificial neural networks, Feature Extraction, and Parametric Form. The first two approaches still need (GCSI,GQSI), have exponential complexity with respect to and , and do not facilitate distributed implementations. Therefore, we adopt Parametric Form with linear approximation. Due to the clusterbased structure and the relationship between the GQSI and the CQSI, we can extract meaningful features and use the summation form for approximation, which naturally lead to distributed implementation. Using the above linear approximation of the pattern selection Qfactor by the sum of percluster potential functions in (12), the BSC determines the optimal clustering pattern based on the current observed GQSI according to
(16) 
Based on the CQSI and CCSI observation , each CM () determines , which attains the minimum of the R.H.S. of (13) (). Hence, the overall power allocation control policy is given by .
IvB Online PrimalDual Distributive Learning Algorithm via Stochastic Approximation
Since the derived policy is function of percluster potential functions (), we need to obtain by solving (13) and determine the LMs such that the perBS average power constraints in (5) are satisfied, which are not trivial. In this section, we shall apply stochastic learning[13, 14] to estimate the percluster potential functions () distributively at each CM based on realtime observations of the CCSI and CQSI and LMs at each BS based on the realtime power allocations actions. The convergence proof of the online learning algorithm will be established in the next section.
Fig. 3 illustrates the top level structure of the online learning algorithms. The Online PrimalDual Distributive Learning Algorithm via Stochastic Approximation, which requires knowledge on CCSI and CQSI only, can be described as follows:
Algorithm 1
(Online PrimalDual Distributive Learning Algorithm via Stochastic Approximation)

Step 1 [Initialization]: Set . Each cluster initialize the percluster potential functions (). Each BS initialize the LM ().

Step 2 [Clustering Pattern Selection]: At the beginning of the th slot, the BSC determines the clustering pattern based on GQSI obtained from each BS according to (16), and broadcasts to the active CMs of the clusters .

Step 3 [Percluster Power Allocation]: Each CM () of the active cluster obtains CCSI , CQSI and LMs () from the BSs in its cluster, based on which, each CM () performs power allocation according to .

Step 4 [Potential Functions Update]: Each CM updates the percluster potential based on CQSI according to (17) and reports the updated potential functions to the BSC.
The percluster potential update in Step 4 and the LMs update in Step 5 based on CCSI observation and CQSI observation at the current time slot are further illustrated as follows:
(17)  
(18) 
where , is the number of updates of till , , is the reference state^{8}^{8}8Without lost of generality, we set reference state (), i.e. buffer empty for all MSs in cluster , and initialize the ., is the last time slot that the reference state was updated. is the projection onto an interval for some . and are the step size sequences satisfying the following equations:
(19) 
IvC Convergence Analysis for Distributive PrimalDual Online Learning
In this section, we shall establish the technical conditions for the almostsure convergence of the online distributive learning algorithm in Algorithm 1. Let denote the dimensional vector form of . For any percluster LM vector , define a vector mapping of cluster with the th () component mapping as:
(20) 
Define and , where is a average transition probability matrix for the queue of cluster with as its element and is a identity matrix.
Since we have two different step size sequences and with , the percluster potential updates and the LM updates are done simultaneously but over two different timescales. During the percluster potential update (timescale I), we have , where . Therefore, the LMs appear to be quasistatic[15] during the percluster potential update in (17). We first have the following lemma.
Lemma 2
(Convergence of Percluster Potential Learning (Time Scale I)) Assume that for every set of feasible control policies , , in the policy space, there exist a and some positive integer such that
(21) 
where denotes the element of the corresponding matrix. For stepsize sequence satisfying the conditions in (19), we have a.s. () for any initial potential vector and percluster LM vector , where the steady state percluster potential