Higher aggregation of gNodeBs in Cloud-RAN architectures via parallel computing
In this paper, we address the virtualization and the centralization of real-time network functions, notably in the framework of Cloud RAN (C-RAN). We thoroughly analyze the required fronthaul capacity for the deployment of the proposed C-RAN architecture. We are specifically interested in the performance of the software based channel coding function. We develop a dynamic multi-threading approach to achieve parallel computing on a multi-core platform. Measurements from an OAI-based testbed show important gains in terms of latency; this enables the increase of the distance between the radio elements and the virtualized RAN functions and thus a higher aggregation of gNodeBs in edge data centers, referred to as Central Offices (COs).
[name=Authors, color=blue]authors \setremarkmarkup(#2) \makeglossaries\newacronymAAAAAAAuthentication, Authorization, and Accounting \newacronymRSCRSCRecursive Systematic Convolutional \newacronymLLRLLRLog-Likelihood Ratio \newacronymFSFSFunctional Split \newacronymBBUBBUBase Band Unit \newacronymCOTSCOTSCommercial off-the-shelf \newacronymVNFVNFVirtualized Network Function \newacronymVNF FGVNF FGVNF Forwarding Graph \newacronymNFVNFVNetwork Function Virtualization \newacronymGPPGPPGeneral Purpose Processor \newacronymvEPCvEPCvirtual Evolved Packet Core \newacronymLTELTELong Term Evolution \newacronymURLLCURLLCUltra-Reliable Low-Latency Communications \newacronymeMBBeMBBenhanced Mobile Broad-Band \newacronymMNOMNOMobile Network Operator \newacronymCDCCDCCentralized Data Center \newacronymATMATMAsynchronous transfer mode \newacronymHDRHDRHigh Data Rate \newacronymNBSNBSNash Bargaining Solution \newacronymC-EPCC-EPCCloud-EPC \newacronymEPCaaSEPCaaSEPC as a Service \newacronymTDDTDDTime Division Duplex \newacronymUEUEUser Equipment \newacronymHARQHARQHybrid Automatic Repeat-Request \newacronymPRBPRBPhysical Resource Blocks \newacronymMCSMCSModulation and Coding Scheme \newacronymCQICQIChannel Quality Indicator \newacronymDCDCDedicated Core \newacronymRRRRRound Robin \newacronymGGGreedy \newacronymSNRSNRSignal Noise Ratio \newacronymFDDFDDFrequency Division Duplex \newacronymOFDMOFDMOrthogonal Frequency Division Multiplexing \newacronymVMVMVirtual Machine \newacronymPDCPPDCPPacket Data Convergence Protocol \newacronymMACMACMedium Access Control \newacronymRLCRLCRadio Link Control \newacronymRRCRRCRadio Resource Control \newacronymAMAMAcknowledged Mode \newacronymUMUMUnacknowledged Mode \newacronymTMTMTransparent Mode \newacronymMIMOMIMOMultiple Input Multiple Output \newacronymMISOMISOMultiple Input Single Output \newacronymSIMOSIMOSingle Input Multiple Output \newacronymSISOSISOSingle Input Single Output \newacronymMCCMCCMobile Country Code \newacronymMNCMNCMobile Network Code \newacronymS-TMSIS-TMSIShortened Temporary Mobile Subscriber Identity \newacronymIMSIIMSIInternational Mobile Subscriber Identity \newacronymDRBDRBDedicated Radio Bearer \newacronymGUMMEIGUMMEIGlobally Unique MME Identity \newacronymPCIPCIPhysical-layer Cell Identity \newacronymROHCROHCRobust Header Compression \newacronymSNSNSequence Number \newacronymRARRARRandom Access Response \newacronymC-RNTIC-RNTICell Radio Network Temporary Identifier \newacronymBSRBSRBuffer Status Report \newacronymDRXDRXDiscontinuous Reception \newacronymPHRPHRPower Head Room \newacronymPUSCHPUSCHPhysical Uplink Shared Channel \newacronymADMADMActivation/Deactivation MAC \newacronymGPGPGap Period \newacronymCPCPCyclic Prefix \newacronymREREResource Element \newacronymRBRBResource Block \newacronymREGREGResource Element Group \newacronymCSRSCSRSCell-Specific Reference Signal \newacronymIFFTIFFTInverse Fast Fourier Transform \newacronymOFDMAOFDMAOrthogonal Frequency Division Multimple Access \newacronymCRCCRCCyclic Redundancy Check \newacronymeNBeNBEvolved NodeB \newacronymRANRANRadio Access Network \newacronymARQARQAutomatic Repeat reQuest \newacronymNASNASNon-Access Stratum \newacronymMMEMMEMobility Management Entity \newacronymMIBMIBMaster Information Block \newacronymSIBSIBSystem Information Block \newacronymRSRPRSRPReference Signal Received Power \newacronymRATRATRadio Access Technologie \newacronymACKACKAcknowledge \newacronymNACKNACKNegative acknowledge \newacronymPDCCHPDCCHPhysical Downlink Control Channel \newacronymSAWSAWStop and Wait \newacronymTTITTITransmission Time Interval \newacronymRRHRRHRadio Remote Head \newacronymSNIRSNIRSignal-to-Noise-plus-Interference Ratio \newacronymWCETWCETWorst Case Execution Time \newacronymGPCGPCGeneral Purpose Computer \newacronymKPIKPIKey Performance Indicator \newacronymOAIOAIOpen Air Interface \newacronymIMSIMSIP Multimedia Subsystem \newacronymvIMSvIMSvirtual IP Multimedia Subsystem \newacronymEPCEPCEvolved Packet Core \newacronymSDNSDNSoftware Defined Network \newacronymC-RANC-RANCentralized-RAN \newacronymOSOSOperating System \newacronymTBTBTransport Block \newacronymTBSTBSTransport Block Size \newacronymQCIQCIQoS Channel Indicator \newacronymBERBERBit Error Rate \newacronymMECMECMulti-access Edge Computing \newacronymGPUGPUGraphics Processing Unit \newacronymCPUCPUCentral Processing Unit \newacronymICICICICInter-Cell Interference Coordination \newacronymSDUSDUService Data Unit \newacronymCBSCBSCode Block Size \newacronymCBCBCode Block \newacronymSPMDSPMDSingle Program Multiple Data \newacronymSIMDSIMDSingle Instruction Multiple Data \newacronymITITInformation Technology \newacronymSINRSINRSignal-to Interference Noise Ratio \newacronymCOCOCentral Office \newacronymCACACarrier Aggregation \newacronymSRSSRSSound Reference Signal \newacronymSC-OFDMASC-OFDMASingle Carrier - Orthogonal Frequency Division Multiple Access \newacronymFPGAFPGAField-Programmable Gate Array \newacronymTATATime Advancing \newacronymCoMPCoMPCoordinated Multi-point \newacronymNPRBNPRBNumber of Physical Resource Blocks \newacronymRTTRTTRound Trip Time \newacronymCPRICPRICommon Public Radio Interface \newacronymCBRCBRConstant Bit Rate \newacronymNRBNRBNumber of Resource Blocks \newacronymBJFBJFBiggest Job First \newacronymEDFEDFEarliest Deadline First \newacronymFCFSFCFSFirst-come, First-served \newacronymPSTNPSTNPublic Switched Telephone Network \newacronymETSIETSIEuropean Telecommunications Standards Institute \newacronymvBBUvBBUvirtualized BBU \newacronymvRANvRANvirtualized RAN \newacronymIoTIoTInternet of Things \newacronymB2BB2BBusiness to Business \newacronymB2CB2CBusiness to Customer \newacronymQoEQoEQuality of Experience \newacronymQoSQoSQuality of Service \newacronymVNOVNOVirtual mobile Network Operator \newacronymSLASLAService Level Agreement \newacronymVRRMVRRMVirtual Radio Resource Management \newacronymKVMKVMKernel-based Virtual Machine \newacronymLXCLXCLinux Containers \newacronymPSPSProcessor Sharing \newacronymeCPRIeCPRIevolved CPRI \newacronymRoERoERadio over Ethernet \newacronymPAPRPAPRPeak-to-average power ratio \newacronymSC-FDMASC-FDMASingle Carrier Frequency Division Multiple Access \newacronymAGCAGCAutomatic Gain Control \newacronymPMDPMDPolarization Mode Dispersion \newacronymADCADCAnalogic-Digital Converter \newacronymI/QI/QIn-Phase Quadrature \newacronymxRANxRANextensible Radio Access Network \newacronymISIISIInter-symbol interference \newacronymFFTFFTFast Fourier Transform \newacronymIPCIPCInter process communication \newacronymCCDUCCDUChannel Coding Data Unit \newacronymCCCCChannel Coding \newacronymgNBgNBnext-Generation Node B \newacronymEUTRANEUTRANEvolved Universal Terrestrial Radio Access Network \newacronymSCTPSCTPStream Control Transmission Protocol \newacronymNRNRNew Radio \newacronymNFNFNetwork Function \newacronymCUCUCentral Unit \newacronymDUDUDistributed Unit \newacronymNGCNGCNext Generation Core \newacronymDLDLdownlink \newacronymULULuplink \newacronymLJFLJFLargest Job First \newacronymRANaaSRANaaSRAN as a Service \newacronymNSNSNetwork Service \newacronymFGFGForwarding Graph \newacronymVNFCVNFCVNF Component \newacronymMANOMANOManagement and Orchestration \newacronymFIFOFIFOFirst In Firs Out \newacronymNFVINFVINFV Infrastructure \newacronymNFVONFVONFV Orchestrator \newacronymPoPPoPPoint of Presence \newacronymNATNATNetwork Address Translation \newacronymCDNCDNContent Delivery Network \newacronymVNFMVNFMVNF Manager \newacronymEMEMElement Management \newacronymVIMVIMVirtualised Infrastructure Manager \newacronymVMMVMMVirtual Machine Monitor \newacronyme2ee2eend-to-end \newacronymOTTOTTover-the-top \newacronymABIABIApplication Binary Interface \newacronymAPIAPIApplication Programing Interface \newacronymISAISAInstruction Set Architecture \newacronymJVMJVMJava Virtual Machine \newacronymRESTRESTRepresentational State Transfer \newacronymSOASOAService Oriented Architecture \newacronymHSSHSSHome Subscriber Server \newacronymDiahDiahDiameter handler \newacronymILPILPInteger Linear Programming \newacronymISPISPInternet Service Provider \newacronymMIQCPMIQCPMixed Integer Quadratically Constrained Program \newacronymMILPMILPMixed Integer Linear Programming \newacronymCCOCCOCore Central Office \newacronymMCOMCOMain Central Office \newacronymONAPONAPOpen Networking Automation Platform \newacronymRAMRAMRandom Access Memory \newacronymPMPMPhysical Machine \newacronymDRFDRFDominant Resource Fairness \newacronymYARNYARNYet Another Resource Negotiator \newacronymDRFHDRFHDRF in Heterogeneous environments \newacronymFQFQFair Queuing \newacronymGPSGPSGeneralized Processor Sharing \newacronymWFQWFQWeighted Fair Queuing \newglossarystylemodsuper\glossarystylesuper
Keywords: NFV, Cloud-RAN, gNodeB, BBU, channel coding, OAI, multi-core, scheduling.
The next generation of mobile networks promises not only broadband communications and very high data rates but customized and optimized network services for specific vertical markets (e.g, Health, Automotive, Media and Entertainment) [verticals]. 5G mobile networks consider heterogeneous \glsRAN architectures for targeting different types of mobile access (WiFi, cellular femto, small, and macro cells) and for fulfilling service requirements especially in terms of latency, resilience, coverage, and bandwidth.
In the perspective of achieving specific end-to-end service performances, the virtualization of network functions is highly desirable to flexibly deploy network services in cloud infrastructures according to customer needs. For example, new RAN architectures aim at virtualizing and centralizing higher-\glsRAN functions in the network while keeping lower-RAN functions in distributed units (near to antennas). These two nodes, respectively referred to as \glsCU and \glsDU by the 3GPP, enable flexible and scalable functional splits, which can be adapted to the required network performance. In addition, the collocation of \glsCU with Mobile/Multi-access Edge Computing facilities opens the door to the realization of low latency services, thus meeting the strict requirements of URLLC (Ultra Reliable Low Latency Communications) identified by the 3GPP [3GPP_uRLLC].
Centralizing RAN functions higher in the network however raises two main issues: low latency processing of radio signals (namely, base-band processing) and high capacity fiber-links in the fronthaul network. These two issues are addressed in the present work.
Given that channel coding processing (i.e., a physical-layer function) is the most consuming in terms of computing resources and also the most sensitive with regard to performance, in particular the robustness of selected codes against interference, it seems essential to keep this function in the \glsCU. Via resource pooling, it is possible to achieve statistical multiplexing in the utilization of cores of a multi-core platform and thus to gain economies of scale while guaranteeing the deadline compliance in the execution of encoding/decoding functions. In addition, a global view of channel coding for several \glsplgNB enables better radio resource management via the adaptation of coding to predictable interference and also \glsCoMP technologies, i.e., interference reduction and better throughput.
In order to reduce fiber bandwidth requirements, we address in this work a bi-directional intra-PHY functional split which transmits both encoded and decoded data, in the \glsDL and \glsUL directions, respectively over Ethernet. In addition, we increase the fronthaul transmission time budget by improving the execution time of RAN functions.
The basic principle to accelerate RAN functions, i.e., to reduce latency, consists of parallelizing the coding and decoding functions, either on the basis of \glsplUE or \glsplCB. The \glsCB is the smallest coding unit, which can be individually handled by the coding/decoding function of the RAN. The parallelization principles of channel coding are described in [rodriguez2017towards, rodriguez2017performance, jsac]. In this paper, we go one step forward and focus on the implementation of the proposed thread-based models on a \glsCOTS multi-core server by modifying the \glsOAI \glseNB (an open source solution) [oaiWebSite]. We report performance measurements from this platform by connecting the \glseNB to a second server supporting an OAI-based core network and by observing the traffic generated by \glsplUE (commercial smart-phones), which are connected to the \glseNB via an USRP card.
We furthermore provide the required fronthaul bandwidth supporting the proposed C-RAN architecture and evaluate the various intra-PHY functional splits currently envisaged by 3GPP [3GPP38_801] and studied by eCPRI [eCPRI] and IEEE [IEEEP1914.3].
The organization of this paper is as follows: In Section II, we review the various functional splits considered in the literature and different solutions to reducing the time necessary to execute RAN functions. In Section III, we evaluate the bandwidth requirements for various functional splits and formulate a recommendation for the best option in our understanding. In Section IV, we describe the implementation of the multi-threading approach for the channel coding function in an OAI open source \glseNB. Performance results are reported in Section V. Concluding remarks are presented in Section VI.
Ii Related work
Forthcoming 5G standards consider the coexistence of several functional splits of C-RAN architectures. For instance, the 3GPP proposes eight options for splitting the E-UTRAN protocol at different levels. Each functional split meets the requirement of specific services. The most ambitious one (namely, the PHY-RF split which corresponds to option of 3GPP TR 38.801 Standard [3GPP38_801]) aims at a high level of centralization and coordination and enables efficient resource management of both radio (e.g., pooling of physical resources, \glsCoMP technologies) and cloud resources (e.g, statistical multiplexing of the computing capacity). However, this configuration (here referred to as \glsFS-I) brings some deployment issues, notably tight latency and high-bandwidth on fronthaul links.
In fact, there is an open debate concerning the adoption of the most appropriate fronthaul transmission protocol over fiber. The problem relies not only on the constant bit rate performed by the currently used \glsCPRI [duan2016performance] protocol but on the high redundancy present in the transmitted \glsI/Q signals. Many efforts are currently being devoted to reducing optic/fiber resource consumption such as \glsI/Q compression [guo2013lte], non-linear quantization, sampling rate reduction among others. Incoming \glsCPRI variants, notably those proposed by Ericsson et al. [eCPRI] perform \glsCPRI packetization via IP or Ethernet. A similar approach to \glsRoE is being defined by the IEEE Next Generation fronthaul Interface (1914) working group [IEEEP1914.3]. It specifies the encapsulation of digitized radio \glsI/Q payload for both control and user data. The xRAN Forum, which gathers industrials and network operators, is also producing an open specification for the fronthaul interface [xRAN]. It considers intra-PHY splitting as defined by 3GPP in TR 38.801 [3GPP38_801]. A detailed fronthaul capacity analysis is addressed in Section III.
While numerous fronthaul solutions are being standardized, less attention is paid by the industry and academia to the runtime latency of virtualized \glsRAN functions. First studies concerning the computing performance are presented in [nikaein2015processing]. Authors compare the processing time of different virtualized environments, namely \glsLXC, Docker and \glsKVM; however, the number of concurrent threads/cores per \glseNB is limited to since parallel processing of intra-sub-frame is not performed, i.e., a single-core is dedicated to the whole processing of an LTE sub-frame.
On the contrary, the multi-threading model presented in [rodriguez2017performance] performs data parallelism at a finer granularity, which enables an important latency reduction. The authors carry out an in-depth analysis of the workload and data structures handled during the base-band processing of the radio signals for both the \glsDL [rodriguez2017vnf] and \glsUL [rodriguez2017towards] directions. Two multi-threading solutions are then proposed for the channel coding function, which is the most resource consuming. First, the sub-frame data is decomposed in smaller data structures so-called \glsTB, which can be executed in parallel. A \glsTB corresponds to the data of a single \glsUE scheduled within one millisecond. It turns out that the runtime of the channel coding function (i.e., encoding in the \glsDL and decoding in the \glsUL) is directly proportional to the \glsTBS. A finer breaking of sub-frames is also presented; it considers the execution of \glsCB in parallel. A \glsCB is the smallest data unit, which can be individually treated by the channel coding function. The behavior of both parallelism by \glsplUE and parallelism by \glsCB is evaluated by simulation and presents a gain up to . In Section IV, we describe an implementation of the proposed schemes and give the performance results in Section V.
Iii Fronthaul capacity
Iii-a Problem formulation
One of the main issues of Cloud-RAN (C-RAN) is the required fiber bandwidth to transmit base band signals between the BBU-pool (namely, \glsCU) and each antenna (namely, \glsDU or \glsRRH). The fronthaul capacity is determined by the number of base band units (one per \glsgNB) hosted in the data center at the edge of the network (referred to as \glsCO). The current widely used protocol for data transmission between antennas and \glsplBBU is \glsCPRI which transmits \glsI/Q signals. The transmission rate is constant since \glsCPRI is a serial \glsCBR interface. It is then independent of the mobile network load [duan2016performance]. Several functional splits of the physical layer can then be analyzed in order to save fiber bandwidth [wubben2014benefits, duan2016performance]. The required fronthaul capacity for the various functional splits is presented below.
Iii-B Required capacity per functional split
We have illustrated in Figure 1 the various functions executed in a classical RAN. For the downlink direction, IP data packets are first segmented by the PDCP and RLC layers. Then, the MAC layer determines the structure of the subframes (of 1 ms in LTE) forming frames of 10 ms to be transmitted to \glsplUE. Once the MAC layer has fixed the allocation of \glsPRB for the \glsplUE, information is coded in the form of \glsplCB. Then, remaining L1 functions are executed on the encoded data for their transmission (modulation, Fourier transform, giving rise to \glsI/Q signals). In the uplink direction, the functions are executed in reverse order.
The functional split actually defines the centralization level of RAN functions in the cloud-platform, i.e., it determines which functions are processed in dedicated hardware near to antennas (\glsDU) and those which are moved higher in the network to be executed in centralized data centers (\glsCU).
The required fronthaul capacity significantly decreases when the functional split is shifted after the PHY layer or even after the MAC layer [wubben2014benefits]. It is worth noting that new RAN implementations consider the coexistence of configurable functional splits where each of them is tailored to the requirements of a specific service or to a network slice. For instance, \glsURLLC expects a one-millisecond round-trip latency between the \glsUE and the \glsgNB while \glseMBB requires only milliseconds.
In the following, we shall pay special attention to C-RAN supporting fully centralization. The required fronthaul capacity (given in Mbps) for all intra-PHY functional splits, (denoted, for short, by , ranging from 1 to 8) is presented in Table I as a function of the cell bandwidth (given in MHz).
Functional Split I
The fully centralized architecture (Option according to 3GPP), referred to in this paper as \glsFS-I, only keeps in the \glsDU the down-converter, filters and the \glsADC. \glsI/Q signals are transmitted from and to the \glsCU by using the \glsCPRI standard. The problem of \glsFS-I is in the fact that the required fronthaul capacity does not depend on the traffic in a cell but of the cell bandwidth. The required data rate per radio element is given by
where the various variables are defined in Table II.
|useful bandwidth||MHz (MHz)|
|nominal chip rate||MHz|
|sampling frequency||e.g., MHz (MHz)|
|code rate||e.g., [lopez2011optimization]|
|number of bits per sample|
|number of antennas for MIMO||e.g.,|
|number of FFT samples per OFDM symbol||e.g., (MHz)|
|total number of resource blocks per subframe||e.g., (MHz)|
|total number of sub-carriers per subframe||e.g., (MHz)|
|number of sub-carriers per resource block|
|number of symbols per time slot||(normal CP)|
|number of symbols per subframe||(normal CP)|
|RBs utilization (mean cell-load)|
|data rate when using the x-th functional split|
|average duration of a cyclic prefix||s (normal CP)|
|symbol duration||s (normal CP)|
|useful data duration per time slot||s (normal CP)|
In Table I, we have computed the fronthaul capacity in function of the cell bandwidth. As the cell bandwidth increases, the required front haul capacity per sector can reach GBit/s. Since each site is generally equipped with three sectors, we can observe that the required bandwidth reaches prohibitive values for this functional split.
Functional Split II
When implementing the \glsCP removal in the \glsDU, the fronthaul capacity can be reduced. This solution may experiment correlation problems due to the \glsISI apparition. The required data rate for this functional split is given by
where is the average duration of a \glsCP in a radio symbol. microseconds. . The useful data duration in a radio slot ( microseconds) is given by microseconds. The resulting fronthaul capacity is given in Table I. The reduction in bandwidth requirement is rather small when compared with FS-I.
Functional Split III
By keeping the \glsFFT function near to antennas the required fronthaul capacity can be considerably reduced. In this case, radio signals are transmitted in the frequency domain from radio elements to the CU for the uplink and vice versa for the downlink. This solution prevents from the overhead introduced when sampling the time domain signal. The oversampling factor is given by , e.g., for an LTE bandwidth of MHz. The corresponding fronthaul bit rate is then given by
As illustrated in Table I, the fronthaul capacity is halved when compared with the initial CPRI solution. In the following, we show that the fronthaul capacity can still be reduced by a factor .
Functional Split IV
When including the de-mapping process in the \glsDU, it is possible to adapt the bandwidth as a function of the traffic load in the cell, then the required fronthaul capacity is directly given by the fraction of utilized radio resources [wubben2014benefits].
Here, only the \glsplRB, which carry information are transmitted. When considering the behavior of current deployed eNBs of the Orange mobile network serving a high-density zone (e.g, a train station), an eNB presents in average a \glsRB utilization of and in the uplink and downlink directions, respectively. The highest utilization values observed in current deployed Orange’s \glspleNB do not exceed in the downlink direction. Thus, if we take as the mean cell-load (worst-case), this yields a fronthaul bit rate equal to
Numerical values given in Table I show that the gain with respect to the previous solution is however rather small.
Functional Split V
This configuration presents a gain in the fronthaul load, when \glsMIMO schemes are performed. The equalization function combines signals coming from multiple antennas; as a consequence, the required fronthaul capacity is divided by . The required front haul capacity is then given by
Table I shows a drop by a factor with respect to the previous solution.
Functional Split VI
By keeping the demodulation/modulation function near to antennas, the required data rate is given by
where is the number of symbols per subframe (i.e., when using normal cyclic prefix), is the modulation order, i.e., the number of bits per symbol. Taking the highest modulation order currently supported in the deployed networks, i.e, , the required fronthaul capacity is reduced to Mbps. This represents a significant gain when compared to the initial CPRI (FS-I) solution. It is also worth noting that this solution preserves the gain achievable by C-RAN.
Functional Split VII
Just for the sake of completeness, we consider now the case when keeping the channel coding function near to antennas, redundancy bits are not transmitted. Nevertheless this configuration reduces the advantages of C-RAN. \glsplDU become more complex and expensive. The required fronthaul capacity is
where is the code rate, i.e., the ratio between the useful information and the transmitted information including redundancy. In LTE code rate commonly ranges from to [lopez2011optimization]. In Table II, we use as the worst-case.
Iii-C Functional split selection
In view of the analysis carried out in the previous section, functional split VI seems to be the most appropriate. It is then necessary to encapsulate the fronthaul payload within Ethernet, i.e., distributed units are connected to the centralized ones through an Ethernet network. RoE is considered by IEEE Next Generation fronthaul Interface (1914) Working Group as well as by the xRAN fronthaul Working Group of the xRAN Forum.
The main issue of an Ethernet-based fronthaul is the latency fluctuation [chih2015rethink]. Transport jitter can be isolated by a buffer, however, the maximum transmission time is constrained by the processing time of centralized functions. The sum of both transmission and processing time must meet \glsRAN requirements (i.e., ms for DL and ms for UL).
The transmission time can quickly rise due to the distance and the added latency at each hop (e.g., switches) in the network. The transmission time can be roughly obtained from the light-speed in the optic-fiber (e.g., x m/s), and latency of s by hop [chih2015rethink]. For instance, the required transmission time for an \glsgNB located km from the \glsCO rises ss. Hence, the remaining time-budget for BBU processing is barely s in the down-link direction. The proof-of-concept of C-RAN acceleration for supporting FS-VI is described below.
Iv Testbed and parallelization of coding functions
Iv-a Testbed description
To evaluate the proposed C-RAN acceleration method and notably the gain in terms of latency when parallelizing the channel coding, we have set up a testbed basically composed of servers (COTS PCs), one supporting the OAI EPC (MME, HSS, SPGW) and another equipped with an USRP card and implementing a modified version of the OAI \glseNB software. This testbed is illustrated in Figure 2.
The platform implements a pool of threads, which perform the parallel processing of both encoding (downlink) and decoding (uplink) functions on a multi-core server. The workload of threads is managed by a global non-preemptive scheduler (so-called, thread manager); a thread is assigned to a dedicated single core with real-time OS priority and is executed until completion without interruption. The isolation of threads is provided by a specific configuration performed in the OS, which prevents from the use of channel coding computing resources for any other job.
The goal of our modification of OAI code is to perform massive parallelization of channel encoding and decoding processes. These functions are detailed below, before presenting the multi-threading mechanism and the scheduling algorithm.
The encoder (See Figure 3 for an illustration) consists of 2 \glsRSC codes separated by an inter-leaver. Before encoding, data (i.e., a subframe) are conditioned and segmented in code blocks of size , which can be encoded in parallel. When the multi-threading model is not implemented, \glsplCB are executed in series under a \glsFIFO discipline. Thus, an incoming data block is twice encoded, where the second encoder is preceded of the permutation procedure (inter-leaver). The encoded block of size constitutes the information to be transmitted in the downlink direction. Hence, for each information bit two parity bits are added, i.e., the resulting code rate is given by . With the aim of reducing the channel coding overhead, a puncturing procedure may be activated for periodically deleting bits. A multiplexer is finally employed to form the encoded block to be transmitted. The multiplexer is nothing but a parallel to serial converter which concatenates the systematic output , and both recursive convolutional encoded output sequences, , and .
Unlike encoding, the decoding function is iterative and works with soft bits (real and not binary values). Real values represent the \glsLLR, i.e., the radio of the probability that a particular bit was 1 and the probability that the same bit was 0 (log is used for better precision).
The decoding function runs as follows: Received data is firstly de-multiplexed in , , and , which correspond to the systematic information bits of -th code block and to the received parity bits and , respectively.
and feed the first decoder which calculates the \glsLLR (namely, extrinsic information) and passes it to the second decoder. The second decoder uses that value to calculate LLR and feeds back it to the first decoder after a de-interleaved process. Hence, the second decoder has three inputs, the extrinsic information (reliability value) from the first decoder, the interleaved received systematic information , and the received values parity bits . See Figure 4 for an illustration.
The decoding procedure iterates until either the final solution is obtained or the allowed maximum number of iterations is reached. At termination, the final decision (i.e., or decision) is taken to obtain the decoded data block . The data block is either successfully decoded or not. The stopping criterion corresponds to the average mutual information of \glsLLR; if it converges the decoding process may terminate earlier. Note that there is a trade-off between the runtime (i.e., number of iterations) and the successful decoding of a data block.
On the basis of massive parallel programming, we propose splitting the channel encoding and decoding function in multiple parallel runnable jobs. The main goal is to improve their performance in terms of latency.
In order to deal with the various parallel runnable jobs, we implement a thread-pool, i.e., a multi-threading environment. A dedicated core is affected to each thread during the channel coding processing. When the number of runnable jobs exceeds the number of free threads, jobs are queued.
To achieve low latency, we implement multi-threading within a single process instead of multitasking across different processes (namely, multi-programming). In a real-time system, creating a new process on the fly becomes extremely expensive because all data structures must be allocated and initialized. In addition, in a multi-programming \glsplIPC go through the \glsOS, which produces system calls and context switching overhead.
When using a multi-threading (namely, POSIX [butenhof1997programming]) process for running encoding and decoding functions, other processes cannot access resources (namely, data space, heap space, program instructions), which are reserved for channel coding processing.
The memory space is shared among all threads belonging to the channel coding process, which enables latency reduction. Each thread performs the whole encoding or decoding flow of a single \glsCCDU. We define a \glsCCDU as the suite of bits, which corresponds to a radio sub-frame (no-parallelism), a \glsTB or even a \glsCB. When performing parallelism, \glsplCCDU arrive in batches every millisecond. These data units are appended to a single queue (see Algorithm 1), which is managed by a global scheduler. We use non-preemptive scheduling, i.e., a thread (\glsCCDU) is assigned to a dedicated single core with real-time \glsOS priority and is executed until completion without interruption.
Isolation of threads is not provided by the POSIX API; hence, a specific configuration has been set up in the \glsOS to prevent the use of channel coding computing resources for any other jobs. The global scheduler (i.e., the thread manager) runs itself within a dedicated thread and performs a FIFO discipline for allocating cores to \glsCC jobs, which are waiting in the queue to be processed. Figure 5 illustrates cores dedicated to channel coding processing, remaining cores are shared among all processes running in the system, including those belonging to the upper-layers of the \glsgNB.
Iv-C Queuing principles
The \glsCCDU’s queue is a chained list containing the pointers to the first and last element, the current number of \glsplCCDU in the queue and the mutex (namely, mutual exclusion) signals for managing shared memory. The mutex mechanism is used to synchronize access to memory space in the case when more than one thread require writing at the same time. In order to reduce waiting times, we perform data context isolation per channel coding operation, i.e., dedicated \glsCC threads do not access any global variable of the \glsgNB (referred to as ‘soft-modem’ in \glsOAI).
The scheduler takes from the queue the next \glsCCDU to be processed and updates the counter of jobs (i.e., decrements the counter of remaining \glsplCCDU to be processed). The next free core executes the first job in the queue.
In the case of decoding failure, the scheduler purges all \glsplCCDU belonging to the same \glsUE (\glsTB). In fact, a \glsTB can be successfully decoded only when all \glsplCB have been individually decoded (See Algorithm 2).
Channel coding variables are embedded in a permanent data structure to create an isolated context per channel coding operation; in this way, \glsCC threads do not access any memory variable in the main soft-modem (\glsgNB). The data context is passed between threads by pointers.
Iv-D Performance captor
In order to evaluate the performance of multi-threading, we have implemented a performance captor which gets key timestamps during the channel coding processing for uplink and downlink directions. With the aim of minimizing measurements overhead, data is collected by a separate process, so-called measurements collector, which works out of the real-time domain.
The data transfer between both separate processes, i.e., the performance captor and the measurements collector, is performed via an \glsOS-based pipe (also referred to as named pipe or FIFO pipe because the order of bytes coming in is the same as the order of bytes going out [bovet2005understanding]).
Timestamps are got at several instants in order to obtain the following \glsplKPI:
Pre-processing delay, which includes data conditioning, i.e., code block creation, before triggering the channel coding itself.
Channel coding delay, which measures the runtime of the encoder (decoder) process in the downlink (uplink).
Post-processing delay, including the combination of \glsplCB.
Collected traces contain various performance indicators such as the number of iterations carried out by the decoder per \glsCB as well as the identification of cores affected for both encoding and decoding processes. Decoding failures are detected when a value greater than the maximum number of allowed iterations is registered. As a consequence, the loss rate of channel coding processes as well as the individual workload of cores can be easily obtained.
V Performance results
In order to evaluate the performance of the proposed multi-threading scheme, we use the above described test-bed which contains a multi-core server hosting the \glsgNB. The various \glsplUE perform file transfers in both uplink and downlink directions.
The test scenario is configured as follows: number of cells: \glsgNB; transmission mode: FDD; maximum number of RB: ; available physical cores: ; channel coding dedicated cores: ; number of \glsplUE: .
The performance captor takes multiple timestamps in order to evaluate the runtime of the encoder/decoder itself, as well as the whole execution time performed by the encoding/decoding function, which includes pre- and post-processing delays, e.g., code block creation, segmentation, assembling, decoder-bits conditioning (log-likelihood). When a given data-unit is not able to be decoded, i.e., when the maximum number of iterations is achieved without success, data is lost and needs to be retransmitted. This issue is quantified by the KPI referred to as loss rate.
Runtime results are presented in Figures 6 and 7 for the uplink and downlink directions, respectively. Decoding function shows a performance gain of when executing \glsplCB in parallel, i.e., when scheduling jobs at the finest-granularity. Beyond the important latency reduction, runtime values present less dispersion when performing parallelism, i.e., runtime values are concentrated around the mean especially when executing \glsplCB in parallel. This fact is crucial when dimensioning cloud-computing infrastructures and notably data centers hosting virtual network functions with real-time requirements. When considering the gap between CB-parallelism and no-parallelism maximum runtime values, the C-RAN system (also referred to as BBU-pool) may be moved several tens of kilometers higher in the network.
In this work, we have addressed the two main issues of C-RAN architectures, i.e., high fronthaul capacity for transmitting radio signals between the radio elements and the Central Office, and tight latency for the execution of virtualized RAN functions in general purposes servers.
In the aim of taking advantage of the benefits of C-RAN systems (i.e., spectral efficiency, interference reduction, data rate improvement) we focus on the study of fully centralized RAN architectures which notably include the channel coding function in the CU. We have thus proposed an bi-directional intra-PHY functional split (namely, FS-VI) for transmitting encoded and decoded data over Ethernet.
For meeting low latency requirements, we have performed C-RAN acceleration by means of parallel processing of the channel coding function. We concretely implemented on the basis of various open-source solutions, an end-to-end virtualized mobile network, which notably comprises a virtualized RAN. The platform particularly implements a thread-pool and two scheduling strategies, namely, parallelism by \glsplUE and parallelism by \glsplCB. The parallel processing of both encoding (downlink) and decoding (uplink) functions is carried out in a multi-core server within a single OS process in order to avoid multi-tasking overhead. Results show important gains in terms of latency, which opens the door for deploying fully centralized cloud-native RAN architectures.