Dynamic Nested Clustering for Parallel PHYLayer Processing in CloudRANs
Abstract
Featured by centralized processing and cloud based infrastructure, Cloud Radio Access Network (CRAN) is a promising solution to achieve an unprecedented system capacity in future wireless cellular networks. The huge capacity gain mainly comes from the centralized and coordinated signal processing at the cloud server. However, fullscale coordination in a largescale CRAN requires the processing of very large channel matrices, leading to high computational complexity and channel estimation overhead. To resolve this challenge, we exploit the nearsparsity of large CRAN channel matrices, and derive a unified theoretical framework for clustering and parallel processing. Based on the framework, we propose a dynamic nested clustering (DNC) algorithm that not only greatly improves the system scalability in terms of basebandprocessing and channelestimation complexity, but also is amenable to various parallel processing strategies for different data center architectures. With the proposed algorithm, we show that the computation time for the optimal linear detector is greatly reduced from to no higher than , where is the number of RRHs in CRAN.
CloudRAN; dynamic clustering; parallel processing
I Introduction
The explosive growth in mobile data traffic threatens to outpace the infrastructure it relies on. To sustain the mobile data explosion with low bitcost and high spectrum/energy efficiency, a revolutionary wireless cellular architecture, termed Cloud Radio Access Network (CRAN), emerges as a promising solution [1]. In contrast to traditional base stations (BSs), the radio function units and baseband processing units (BBUs) in CRANs are separated, and the latter are migrated to a centralized data center using an optical transport network with high bandwidth and low latency. This keeps the radio function units (also referred to as remote radio heads (RRHs)) lightweight, thereby allowing them to be deployed in a large number of small cells with low costs. Meanwhile, the centralized baseband allows RRHs to seamlessly cooperate with each other for flexible interference management, coordinated signal processing, etc. As such, CRAN has been recognized as a “future proof” architecture that enables various key technologies including fibreconnected distributed antenna systems (DASs) and multicell coordination (CoMP) [2]. In this way, CRAN holds great promise for significant systemcapacity enhancement and cost reduction.
The exciting opportunities come hand in hand with new technical challenges. Theoretically speaking, the highest system capacity is achieved when all RRHs cooperatively form a largescale virtual antenna array that jointly detects the users’ signals. The fullscale coordination, however, requires the processing of a very large channel matrix consisting of channel coefficients from all mobile users to all RRHs. For example, the complexity of the optimal linear receiver grows cubically with the number of users/RRHs [3]. In other words, the normalized baseband processing complexity (normalized by the number of users/RRHs) grows quadratically as the size of the system becomes large. This fundamentally limits the scalability of the system. In addition, the fullscale joint RRH processing requires to estimate a largescale channel matrix, causing significant channel estimation overhead. In [4], it is shown that the benefit of cooperation is fundamentally limited by the overhead of pilotassisted channel estimation.
Distributed BS/antenna coordination has been extensively studied in CoMP and DAS systems [5, 6, 7, 8, 9, 10, 11, 12, 14, 13]. Most of the work has focused on throughput maximization [5, 6, 7, 8] and interinterference management [9, 10, 11, 12, 14, 13] by forming cooperative clusters among neighboring BSs/antennas. Few, if not none, of the work has considered the scalability of the basebandprocessing and channelestimation complexity when the system becomes extremely large. In reality, the preliminary CRAN technology can already support around 10 km separation between the BBU pool and RRHs, covering 101000 RRH sites [1]. With such a large scale of coordination, the current DAS and CoMP schemes will become prohibitively expensive to implement. To solve the coordinated beamforming problem in CRAN, a recent work by Shi et al. [15] proposes a lowcomplexity optimization algorithm. Even though the simulation results show that the proposed algorithm can significantly reduce the computation time, [15] does not discuss how the complexity scales with the network size. Moreover, perfect knowledge of the entire channel matrix is required for beamforming, which is impractical for largescale CRAN.
In this paper, we endeavour to design a CRAN baseband processing solution that can achieve the advantage of fullscale RRH coordination with a complexity that does not explode with the network size. In particular, our work is divided into the following steps.

Through rigorous analysis, we show that without causing noticeable performance loss, the signals can be detected by processing a sparsified channel matrix instead of the full channel matrix. In particular, we propose a thresholdbased channel matrix sparsification method, which discards matrix entries if the corresponding link length (or large scale fading in general) is larger than a certain threshold. A closedform expression is derived to relate the threshold to the signaltointerferenceandnoise ratio (SINR) loss due to matrix sparsification. The result shows that for reasonably large networks, a vast majority of the channel coefficients can be ignored if we can tolerate a very small percentage of SINR loss. This not only opens up the possibility of significant complexity reduction, but also greatly reduces the channel estimation overhead. In practice, channel estimation overhead is mainly due to the estimation of smallscale fadings, because largescale fadings vary at a much slower time scale. With the proposed channel matrix sparsification, we only need to estimate smallscale fadings corresponding to a small percentage of matrix entries that have not been discarded.

We show that by skillfully indexing the RRHs, the sparsified channel matrix can be turned into a (nested) doubly bordered block diagonal (DBBD) structure, as shown in Fig. 1. Interestingly, we find that the DBBD matrix naturally leads to a dynamic nested clustering (DNC) algorithm that greatly improves the scalability of the system complexity. Specifically, the diagonal blocks (see Fig. 1) can be interpreted as clusters (or subnetworks) that are processed separately in parallel. Different clusters are coordinated by the cutnode block and border blocks that capture the interference among clusters. As such, the basebandprocessing complexity is dominated by the size of the clusters instead of the entire CRAN network.

Thanks to the centralized BBU pool of CRAN, the DNC algorithm is amenable for parallel processing at the CRAN data center. We design a parallel processing strategy that allow flexible tradeoff between the computation time, the number of parallel processors required, the allocation of computational power among BBUs, etc. In this way, the DNC algorithm is adaptive to various architectures of the BBU pool.
The rest of this paper is organized as follows. In Section II, we describe the system model and outline the steps of the DNC algorithm. The first step in the DNC algorithm, thresholdbased channel sparsification, is proposed and analysed in Section III. In Section IV, a singlelayer DNC algorithm is proposed, and the detailed parallel implementation of this algorithm is discussed. The multilayer DNC algorithm is introduced in Section V. Conclusions and discussions are given in Section VI.
Ii System Model
Iia System Setup
We consider the uplink transmission of a CRAN with singleantenna RRHs and singleantenna mobile users randomly located over the entire coverage area. The received signal vector at the RRHs is
(1) 
where denotes the channel matrix, with the th entry being the channel coefficient between the th user and the th RRH; is a diagonal matrix with the th diagonal entry being the transmitting power allocated to user ; is a vector of the transmitted signal from the users and is a vector of noise received by RRHs. The transmit signals are assumed to follow an independent complex Gaussian distribution with unit variance, i.e. . Further, the th entry of is given by , where is the i.i.d Rayleigh fading coefficient with zero mean and variance , is the distance between the th RRH and the th user, and is the path loss exponent. Then, is the path loss from the th user to the th RRH.
IiB MMSE Detection for CRAN
With centralized baseband processing, a CRAN system can jointly detect all users’ signals through a fullscale RRH cooperation. Suppose that the optimal linear detection, i.e., MMSE detection, is employed. The receive beamforming matrix is
(2) 
where .
The decision statistics of the transmitted signal vector is a linear function of the received signal vector , i.e.
(3) 
Then, the decision statistics of for user is
(4) 
where is the th column of the detection matrix and is the th column of the channel matrix . The SINR of user is
(5) 
Notice that to calculate the detection matrix , the full channel matrix needs to be acquired and processed. That is, full channel state information (CSI) needs to be estimated at the RRHs’ side. In addition, it takes as much as operations to calculate V, because the calculation involves the inverse of an matrix . As mentioned in Section I, a CRAN generally covers a large area with a huge number of RRHs. As a result, the dimension of the channel matrix is extremely large, and the cost of acquiring and processing becomes prohibitively high. The key question is then how to obtain the best system performance by enabling a fullscale RRH cooperation without incurring high channel estimation overhead and computational complexity.
IiC Sketch of the Proposed Approach
As described in the Introduction section, the work in this paper consists of the following steps:

Channel sparsification based on a linkdistance threshold.

Transforming the MMSE detection into a system of linear equations defined by a (nested) DBBD matrix.

A parallel detection algorithm based on dynamic nested clustering.
We first introduce the channelmatrix sparsification approach in Section III. The transformation of the MMSE detection and parallel detection algorithms are discussed in both Section IV and V.
Iii Thresholdbased Channel Sparsification
In this section, we discuss the first step in the DNC algorithm, i.e., thresholdbased channel sparsification. We first present the detailed sparsification approach in Subsection A. Then, a closedform expression of the matrix sparsity as a function of the tolerable SINR loss is derived in Subsection B. Finally, verifications and discussions are given in Subsection C.
Iiia Sparsification Approach
Since the RRHs and users are distributed over a large area, an RRH can only receive reasonably strong signals from a small number of nearby users, and vice versa. Thus, ignoring the small entries in would significantly sparsify the matrix, hopefully with a negligible loss in system performance. In this paper, we propose to ignore the entries of based on the distance between RRHs and users. In other words, the entry is set to when the link distance is larger than a threshold
(6) 
Note that we propose to sparsify the channel matrix based on the link distance instead of the actual channel coefficients that are affected by both the link distances and fast channel fadings. In practice, link distances vary much more slowly than fast channel fading. As such, the amount of overhead to estimate the link distances is negligible compared with the overhead to estimate the fast channel fadings. With the distancebased channel sparsification, instantaneous channel estimation (i.e., estimation of the fast channel fading coefficients) is needed only on short links. Otherwise, if we sparsify based on the absolute values of the entries, then channel fading needs to be estimated on every link.
The received signal can now be represented as
(7) 
where consists of the entries that have been ignored. Treating the first term in the RHS of (7) as signal, and the remaining two terms as interference plus noise, the detection matrix becomes
(8) 
where , and
(9) 
with and being the th column of and respectively.
With this, the SINR becomes
(10) 
where is the th column of .
Notice that when the distance threshold is small, the matrix can be very sparse, leading to a significant reduction in channel estimation overhead and processing complexity. The key question is: how small can be or how sparse the channel matrix can be without significantly affecting the system performance. In other words, how should we set so that is not much lower than in (5). This question will be answered in the next subsection.
IiiB Distance Threshold Analysis
In this subsection, we show how to set the distance threshold if a high percentage of full SINR is to be achieved. Specifically, we wish to set , such that the SINR ratio, defined as
(11) 
is larger than a prescribed level , where the expectation is taken over , with randomness including both path loss and Rayleigh fading.
In the following, we endeavour to derive a closedform expression of as a function of the target SINR ratio . Let us first introduce two approximations that make our analysis tractable.
Approximation 1: The distances , for all are mutually independent.
As shown in Fig. 2, we plot the SINR ratio for systems with and without Approximation 1. The system area is assumed to be a circle with radius km. The figure shows that the gap between the SINR ratio is very small, which validates the independence approximation.
Approximation 2: Conditioning on the distance threshold , the matrices and are mutually independent.
Note that , which means that and are uncorrelated. With the independence approximation, the equality
(12) 
holds. This approximation will be verified in our numerical results in Fig. 3, which shows that the gap between the simulated SINR ratio and the lower bound of derived based on this approximation is small.
Based on these two aproximations, we see that , where for arbitrary RRH . We can now derive a lower bound of as shown in the following Theorem 1.
Theorem 1.
Given a distance threshold , a lower bound of SINR ratio is given by
(13) 
where and respectively, and is the probability density function of the distance between RRHs and users.
When each user transmits at the same amount of power , the lower bound is simplified as
(14) 
Proof.
See Appendix. ∎
We notice that depends on the probability distribution of the distances between mobile users and RRHs. In [17], distance distributions are derived for different network area shapes, such as circle, square and rectangle. Take, for example, a circular network area with radius . In this case, the distance distribution between two random points is [17]
(15) 
where is the minimum distance between RRHs and users.
When the network radius becomes very large, (15) can be approximated as
(16) 
Substituting (15) or (16) into (14), we obtain the relation between and the SINR requirement :
Theorem 2.
When the network size is very large (i.e., ), the solution to (17) can be approximated as
(18) 
Particularly, when the network size goes to infinity (i.e., ), can be further simplified as
(19) 
IiiC Verifications and Discussions
In this subsection, we first verify our analysis through numerical simulations. We then illustrate the effect of SINR ratio requirement on the choice of the distance threshold. Finally, we illustrate the sparsity of the channel matrix and discuss the possibility of reducing estimation overhead based on the sparsified matrix. Unless stated otherwise, we assume that the minimum distance between RRHs and users is meter, the path loss exponent is 3.7, and the average transmit SNR at the user side equals to dB. That is dB.
Verification of Theorem 1 and 2
Fig. 3 plots the SINR ratio against the distance threshold, when and km. The simulated SINR ratio with different numbers of RRHs, , are plotted as the blue curves and derived based on the distributions in (15) and (16) are plotted as the red and black curves, respectively. We can see that the gap between the lower bound based on (15) and that based on (16) is negligible, which means that the approximation of distance distribution is reasonable. Moreover, we notice that even though the simulated ratios vary with , the lower bounds derived based on Theorem 1 and (15), (16) remain unchanged for different .
In Fig. 4, we show that the distance threshold converges to a constant when the network radius becomes large, as predicted in (19). Here, the user density is , and the SINR ratio requirement is set to and , respectively. As expected, the distance threshold converges quickly to a constant when the network radius increases. Indeed, the convergence is observed even when the network radius is as small as km for both and .
SINR Loss versus Distance Threshold
We then discuss the effect of the SINR requirement on the distance threshold. In Fig. 5, we plot the distance thresholds against , when user density , , and , respectively. The network radius is assumed to be very large. We can see that the distance threshold remains very small for a wide range of , i.e., when is smaller than 0.95. There is a sharp increase in when approaches 1. This implies an interesting tradeoff: if full SINR is to be achieved, we do need to process the full channel matrix at the cost of high complexity when the network size is large. On the other hand, if a small percentage of SINR degradation can be tolerated, the channel matrix can be significantly sparsified, leading to lowcomplexity detection algorithms. We emphasize that the SINR degradation may not imply a loss in the system capacity. This is because the overhead of estimating the full channel matrix can easily outweigh the SINR gain. A little compromise in SINR (say, reducing from to ) may yield a higher system capacity eventually.
(km)  

(meter)  
Percentage of nonzero entries () 
(meter)  

Percentage of nonzero entries () 
Sparsity of
As seen from (18) and (19), for a given , the distance threshold converges to a constant when the network radius goes to infinity. Since the average number of nonzero channel coefficients each RRH is approximately , the convergence of implies that the number of nonzero entries per row or per column in does not scale with the network radius in a large CRAN. Moreover, the percentage of nonzero entries in is approximately , which can be very small when is large. In Table I, we list both and the corresponding percentage of nonzero entries in matrix for various network sizes, with and . It can be seen that, when is large, does not change much with the network radius . Moreover, only a small percentage of entries (say ) in are nonzero for all values of considered in Table I. In other words, each RRH only needs to estimate the CSI of a small number of users closest to this RRH. The channel estimation overhead can be significantly reduced. If a larger SINR loss can be tolerated, the amount of CSI needed can be further reduced as shown in Table II, which lists the percentages of nonzero entries in for different , with and km. We see that the percentage of nonzero entries can be reduced from to by allowing a drop of the the SINR performance from to .
Remark 1.
As we can see from the figures, closeto SINR is achievable when the channel matrix is reasonably sparsified. Notice that sparsifying the channel matrix leads to a significant reduction in channel estimation overhead. This is because we only need to estimate the small scale fadings of the matrix entries that have not been discarded. Therefore, matrix sparsification may effectively lead to a higher system capacity due to the reduction of channel estimation overhead, despite a small decrease in SINR.
Iv Singlelayer Dynamic Nested Clustering
With the sparsified channel matrix, we now proceed to present the singlelayer DNC algorithm in this section. As shown in (8), to estimation is to calculate . Note that the computational complexity is which is dominated by calculating . This is because the sparse matrix only contains a constant number of nonzeros per column, and so does , when the network area goes to infinity. Suppose the average number of nonzero entries in each column of is . The computational complexity of multiplying and is only and is much smaller than , i.e., the computational complexity of inverting . Therefore, we focus on reducing the computational complexity of calculating , which is equivalent to solving for in the equation:
(20) 
It is obvious that the matrix is a sparse matrix. The sparse linear equations, i.e., (20), have been well studied in multiple areas, such as the numerical linear algebra, graph theory, etc. However, most of the existing works proposed iterative algorithms, whose accuracy and convergence cannot be guaranteed [18]. Here, we propose a direct algorithm to obtain an accurate solution of (20). Before go into the details of the algorithm, we first explain the physical meanings of the entries in as follows. According to the thresholdbased channel matrix sparsification approach, the th entry of channel matrix is nonzero only when the th user is in the service area of RRH , i.e., a circular area with radius centered around RRH . Consequently, from the definition of in (8), the th entry in is nonzero only when the service areas of RRH and overlap, and there is at least one user in the overlapping area.
Consider an ideal case where the whole set of RRHs can be divided into disjoint clusters. While the RRHs within one cluster have overlapping service areas, those from different clusters do not serve the same user(s). In this case, the matrix becomes block diagonal with each block corresponding to one cluster. Then, the complexity of calculating reduces from to , where is the number of RRHs in a cluster. Note that is typically much smaller than , i.e., the total number of RRHs in a CRAN.
In reality, however, adjacent clusters interact and interfere with each other. Particularly, the service areas of the RRHs in adjacent clusters are likely to overlap. Traditional clustering algorithms [16, 14] usually ignore such overlapping, resulting in a noticeable performance degradation. In what follows, we show that by properly labeling the RRHs, matrix can be transformed to a DBBD matrix, where the borders capture the overlaps between clusters. Then, later in Subsection IVB, the DNC algorithm that enables parallel computation is presented.
Iva RRH Labelling Algorithm
To start with, we give the definition of a Hermitian DBBD matrix as follows:
Definition 1.
A matrix is said to be a Hermitian DBBD matrix if it is in the following form
(21) 
where the diagonal blocks are Hermitian matrices, the border blocks are matrices, and the cutnode block is an Hermitian matrix.
We divide the entire CRAN area into disjoint subareas as illustrated in Fig. 6. We then separate each subarea into a width boundary and a subarea center which are colored by lightgreen and darkgreen respectively. Here, is the distance threshold used in the channel sparification. We see that RRHs in a subarea center do not have overlapping service region with RRHs in other subareas. Only RRHs in the width boundary may have overlapping service region with the RRHs in adjacent subareas. This implies that matrix can be transformed to a DBBD matrix with each diagonal block corresponding to RRHs in a subarea center and the cutnode block corresponding to RRHs in the width boundaries. The border blocks of capture the interaction between different clusters due to interference.
Denote the coordinates of an arbitrary RRH as follows:
(22) 
where , and are the side lengths of the whole network. The RRH labelling algorithm is given in Algorithm 1, where is the label of RRH . We first divide the overall network into disjoint squares with side length , and group the RRHs into center clusters or the boundary cluster according to their locations in steps 2 to 9. Then, the RRHs are numbered based on the cluster they belong to, as shown in steps 10 to 18. After numbering all the RRHs, we organize the matrix and the signal vector in the ascending order of the RRHs’ numbers. For example, the first row of corresponds to the RRH with label . That is, the first row captures the interaction between that RRH with label and all RRHs in the network. The matrix now becomes a DBBD matrix.
Remark 2.
In this paper, we assume that the computational complexity of MMSE detection grows cubically with the number of RRHs. As such, RRH labelling in Algorithm 1 is only based on the locations of RRHs, but not on those of the users. However, notice that the actual computational complexity of MMSE detection is instead of , where . This is because if , the estimation of can be obtained by calculating . Now the complexity of the matrix inversion becomes . In our paper, we just take the case that as an example. The algorithm can be adapted to the case by applying the labelling algorithm to users instead of RRHs.
IvB SingleLayer DNC with Parallel Computing
With converted into a DBBD matrix, we are now ready to present the DNC algorithm. In particular, the diagonal blocks of can be processed in parallel, leading to a significant reduction in computation time.
Suppose that the DBBD matrix has diagonal blocks. Then, the linear equation (20) becomes
(23) 
where the vectors and are subvectors of and , respectively. Likewise, and are subvectors.
The solution to the above equation is given by
(24) 
and
(25) 
for all .
step  operation  complexity/operation  total number of operations 
From equations (24) and (25), we draw the following conclusions. First, , the subvector corresponding to the cutnode block, can be calculated independently of the other subvector . Second, with obtained from (24), we can calculate each using (25) independently. The calculation of only involves the diagonal block of and the corresponding border block . In other words, if we treat each diagonal block as a cluster, then the signals received by each cluster can be processed in parallel of each other, while the interactions between different clusters are captured by and the border blocks. Based on the above discussions, Fig. 7 shows the architecture of the CRAN BBU pool, where parallel signal processing is carried out. The arrows in Fig. 7 indicate the data flows between the processing units. As the figure shows, to expedite the calculation of , matrices and vectors are calculated at the same time by a number of parallel processing units and then fed into the central processing units. Then, is calculated in a central processing unit. The result is fed back into the parallel processing units. Each parallel processing unit is responsible for processing one cluster and calculating . Specifically, we divide all the operations of signal processing in the proposed clustering algorithm into six steps as listed in Table III. Steps 1 and 2 are first carried out in the parallel processing units. After receiving the results of steps 1 and 2, the central processing unit performs steps 3 and 4. At last, steps 5 and 6 are carried out in the parallel processing units.
IvC Optimizing the Computation Time
Table III
Before optimizing the computation time, we make some assumptions on the CRAN BBU pool. We notice that the size of diagonal block should be determined by the processing power of the corresponding parallel processing unit. This implies that the subareas should have different side lengths, say different . To simplify later discussions, we assume that all the parallel processing units in the BBU pool have equal processing power. We also notice that the central processing unit should be more powerful than the parallel ones. Otherwise, the central processing unit is unnecessary. For example, the corresponding operations, i.e., steps 3 and 4, can be carried out at one of the parallel processing units instead of the central one, and the total computational complexity can be reduced. Then, we define an unbalanced processing power ratio to represent the processing power of the central processing unit and parallel ones. That is, to perform a same operation, the computation time of the central processing unit is times shorter than that of the parallel ones. Denote a logN ratio as for notational brevity. Without loss of generality, the processing power of a parallel processing unit is normalized to be . Then, the processing power of the central one is . As the operations in steps 1, 2, 5 and 6 can be performed in parallel, the total computation time is .
The computation time is an increasing function of the block sizes, and . To achieve a short computation time, and should be as small as possible. However, the block sizes cannot be adjusted arbitrarily. In fact, for a given , there is a fixed ratio between and . We denote this ratio as . Specifically, since and equal to the average number of RRHs in the subarea center and the boundaries respectively, the relationship between and the ratio is
(26) 
where is the RRH density. By adjusting from to , goes from to . Based on (26), we obtain the following approximations of , and :
Lemma 1.
In large CRANs, given the block size ratio , the approximations of , and are
(27) 
(28) 
(29) 
Proof.
After ignoring , and , we obtain the optimal computation time with parallel computing below.
Lemma 2.
When the logN ratio , the minimum computation time with parallel computing is , with the optimal . When , the minimum computation time is , with the optimal .
Remark 3.
we notice that when the central processing unit is much more powerful than other units in the data center, performing all the operations in the central processor can achieve a shorter computation time than parallel computing. Based on Table III and Lemma 1, the computation time of serial computing at the central processing unit is with .
Proposition 1.
In CRAN with parallel processing units and a central processing unit, when the logN ratio , the optimal computation time, , is achieved by parallel computing. When , the optimal computation time, , is achieved by performing all the operations at the central processing unit in serial.
We also show the effect of on the total computation time in Fig. 8, where the order of computation time is the maximum exponent of the computation time function. For example, the order of is . We see that the order of the computation time is reduced with the increase of . However, obviously, the price of the central processing unit is increased with the increase of . Then, Fig. 8 illustrates a tradeoff between the computation time and the economic cost. This tradeoff can serve as a guideline during the deployment of BBU pool. For example, when the economic cost is a major concern in the CRAN system or a long computation time can be tolerated, processing units with low processing power, which leads to a low price, should be deployed. When the computation time is more important than cost, more powerful processing units should be selected.
Remark 4.
So far, we have assumed that there are always enough parallel processing units, regardless of or . In this case, we only need to optimize the sizes of diagonal blocks and the cutnode block and ignore the number of blocks. Instead, when the number of parallel processing units is limited, the number of blocks, , also has an effect on the total computation time. Based on Lemma 1, can also be adjusted by or . In this way, we can balance the computation time with limited availability of processing units. More detailed analysis, however, is out of the scope of this paper.
In this section, we have shown that the simple RRH labelling algorithm allows us to easily optimize the size of diagonal blocks , which can be directly interpreted as the size of clusters in large CRANs. The result further provides an important guideline as to the architecture design of the CRAN BBU pool, including the number of processing units, the choice between parallel and serial processing, the allocation of processing power among BBUs, etc.
V Multilayer DNC Algorithm
In the preceding section, we propose a singlelayer DNC algorithm, which reduces the total computation time from to . In this section, we propose a multilayer DNC algorithm to further reduce the computation time.
We notice that the computation time of the parallel processing units in the singlelayer DNC algorithm is dominated by calculating . Interestingly, a close study indicates that the diagonal blocks are themselves sparse matrices. This is because the RRHs in the same cluster only have interactions with their neighboring RRHs instead of the whole cluster. This implies that can be further permuted to a DBBD form, and the computation time of calculating can be reduced. As such, matrix becomes a twolayer nested DBBD matrix. An example of such a matrix is shown in Fig. 9.
Fig. 10 illustrates the RRH labelling strategy that turns into a DBBD form. In particular, the RRHs in a cluster is grouped into subclusters. For example, for the topleft square, the RRHs at the dark green boundary area are clustered to the subborder, and the RRHs at the center area (the white area) are clustered to diagonal blocks. Intuitively, one can minimize the computation time by balancing the sizes of different blocks. This is the focus of our study in the remainder of this section.
Note that by repeating the process, can be further permuted into a multilayer nested DBBD matrix. For simplicity, we focus on the twolayer DNC algorithm in this section. The results, however, can be easily extended to the multilayer case, as briefly discussed at the end of the section.
Va MultiLayer DNC Algorithm with Parallel Computing
As discussed in the previous section, each parallel processor in Fig. 7 needs to calculate . In the following, we show how can be computed in parallel, if is already a twolayer nested DBBD matrix with diagonal blocks being DBBD as well. For notational brevity, we denote the diagonal block by . Then, inverting is equivalent to solving the following system:
(34) 
where and , with , , and , .
The columns in matrix is given below:
(35) 
(36) 
Similar to equations (24) and (25), parallel computing can be adopted in calculating (35) and (36). Combining the parallel computing of calculating in the second layer and that of solving in the first layer, a nested parallel computing architecture is illustrated in Fig. 11. We first need to calculate , i.e. . Since the diagonal blocks in are in DBBD forms, the calculation of , can be split and allocated to a number of parallel processing units. As listed in Table IV, the calculation of (i.e. step 1 in Table III) is divided into six steps. Steps 1.1 and 1.2 are first carried out in the level3 processing units. The results are fed back into the level2 processing units, which are responsible for performing steps 1.3 and 1.4. Then, steps 1.5 and 1.6 are carried out in the level3 parallel processing units. Then, similar to the singlelayer DNC algorithm, the level2 processing units calculate matrices and vectors , and the results are fed into the level1 processing unit. Then, is calculated by the level1 processing unit, and is calculated by the level2 processing unit.
VB Optimizing the Computation Time
step  operation  complexity/operation  total number of operations 