Coded Network Function Virtualization: Fault Tolerance via In-Network Coding
Network Function Virtualization (NFV) prescribes the instantiation of network functions on general-purpose network devices, such as servers and switches. While yielding a more flexible and cost-effective network architecture, NFV is potentially limited by the fact that commercial off-the-shelf hardware is less reliable than the dedicated network elements used in conventional cellular deployments. The typical solution for this problem is to duplicate network functions across geographically distributed hardware in order to ensure diversity. In contrast, this letter proposes to leverage channel coding in order to enhance the robustness on NFV to hardware failure. The proposed approach targets the network function of uplink channel decoding, and builds on the algebraic structure of the encoded data frames in order to perform in-network coding on the signals to be processed at different servers. The key principles underlying the proposed coded NFV approach are presented for a simple embodiment and extensions are discussed. Numerical results demonstrate the potential gains obtained with the proposed scheme as compared to the conventional diversity-based fault-tolerant scheme in terms of error probability.
Network Function Virtualization (NFV) is a novel architectural paradigm for cellular wireless networks that has been put forth within the European Telecommunications Standards Institute (ETSI) with the goal of simplifying network management, update and operation . NFV decouples the Network Functions (NFs), such as baseband processing at the base stations and firewalling or routing at the core network, from the physical network equipment on which they run. This is done by leveraging virtualization technology in order to map NFs into Virtual Network Functions (VNFs) that are instantiated on Commercial Off-The-Shelf (COTS) hardware resources, such as servers, storage devices and switches [2, 3]. NFV enables an adaptive “slicing” of the available network physical resources so as to accommodate different network services, e.g., mobile broadband, machine-type or ultra-reliable communications .
A simplified view of the NFV architecture is illustrated in Fig. 1 [5, 2, 4, 3]. The top layer in this architecture describes the logical functionality of the given network service as a so called forwarding graph, which characterizes the functional relationship among the VNFs that implement the network service. The bottom layer contains the general-purpose hardware appliances that provide storage, computation and networking capabilities. Finally, the intermediate virtualization layer is responsible for mapping VNFs to physical resources. In the example of Fig. 1, VNF1 and VNF2 are instantiated on Virtual Machines (VMs) running on Server 1, while VNF3 and VNF4 are instantiated on VMs running on Server 2 and Server 3, respectively. The combination of the last two layers constitutes the Network Function Virtualization Infrastructure (NFVI), and the three planes are under the control of a Network Functions Virtualization Management and Orchestration (NFV-MANO) block (see [2, 3, 1] for details). Most research activity on NFV focuses on the design of mapping rules between VNFs and hardware resources via the solution of mixed integer problems (see, e.g., ).
One of the key challenges for the adoption and deployment of NFV is the fact that COTS hardware is significantly less reliable than the dedicated network devices used in conventional network deployments [3, Sec. VI]. Hardware outages may in fact be caused by random failures, intentional attacks, software malfunction or disasters. This problem is motivating an emerging line of work, also within ETSI, on developing fault-tolerant virtualization strategies for NFV [5, 7, 8]. The typical solution, as summarized in , is to adapt to NFV well established policies introduced in the context of virtualization for data centers. These strategies are based on overprovisioning and diversity: NFs are split into multiple constituents VNFs, which are then mapped on VMs instantiated on multiple distributed servers in order to minimize the probability of a disruptive failure as well as the mean time to recovery from a failure.
In this work, we propose a novel principle for the design of fault-tolerant NFV that moves from diversity-based solutions to coded solutions. The proposed approach addresses the NF of uplink data decoding in a Cloud Radio Access Network (C-RAN) architecture, in which the baseband processing operations of the base station are carried out remotely at the “cloud” . The focus on uplink channel decoding is dictated by the fact that the latter is known to be among the most demanding baseband functions in terms of computational complexity (see, e.g., [10, 11]).
The proposed coded NFV solution leverages the algebraic structure of the transmitted coded data frames in order to enhance the robustness of channel decoding. To elaborate, assume that there are a number of servers on which VMs carrying out channel decoding can be instantiated, as illustrated in Fig. 2. A conventional diversity-based technique would duplicate the decoding task at multiple servers. In this letter, instead, we propose a coded approach, whereby received data frames are encoded prior to being processed by the VMs that implement decoding at the distributed servers. This letter elaborates on a simple embodiment of this idea, which is illustrated in Fig. 3 and introduced in Sec. III after a description of the system model in Sec. II. Numerical examples are provided in Sec. IV. Extensions and more general applications of the principle of coded NFV are presented in Sec. V, while the concluding remarks are given in Sec. VI. We finally observe that the idea presented here is related to the concepts of coded computations put forth in [12, 13], but, to the best of the authors’ knowledge, the idea of performing coding to robustify the operation of NFV is first proposed here.
Ii System Model
We consider a C-RAN system implemented by means of NFV, and focus on the implementation of the NF of uplink channel decoding. In this system, as illustrated in Fig. 2, a Remote Radio Head (RRH) is connected to the cloud by means of a fronthaul link. The RRH forwards the received baseband packets to the cloud on the fronthaul link in order to enable channel decoding. We assume the overprovisioning of hardware resources, such that servers are available in the cloud, each of which can run a VM performing the decoding of a single received frame in the allotted time. More specifically, due to latency constraints, the decoding of received data frames should be carried out on the servers by allocating at most one frame to decode to each of servers. As in conventional implementations (see, e.g., ), we further assume that the VMs implementing channel decoding on Servers are managed by a controller VM, which is characterized by lower computational requirements and is instantiated at a server, marked as Server in Fig. 2, that is connected with bidirectional links to Servers .
Servers are distributed strategically across multiple locations throughout the service provider’s network, and are hence assumed to have independent availabilities . In particular, we assume that each one of Servers fails independently with probability . It is emphasized that a failure here means that a server is not available to perform the given task within an acceptable deadline due to software or hardware issues (see Sec. I).
The transmitted frames are encoded with the same linear code with a given rate , such as convolutional, turbo or LDPC codes. Furthermore, in order to present the key ideas, we consider first a Binary Symmetric Channel (BSC) model between the user under study and the RRH. As a result, for each transmitted frame , with , the transmitted signal can be written as , where represents the data encoded in the th frame and is the generator matrix of the code. Furthermore, the signal received for the th frame is given as where is a vector of independent Bernoulli variables with probability of being equal to 1. Generalizations of the system model are discussed in Sec. V.
Throughout, we take as the performance metric of interest the probability of error, that is, the complement of the probability that decoding of all the frames is carried out successfully by the cloud. Note that, according to the introduced model, a failure may occur due to either errors on the communication channel between user and RRH or due to a failure of the servers.
Iii Fault tolerance via coded NFV
In this section, we first review the conventional diversity-based fault-tolerant approach as applied to the problem at hand of uplink channel decoding in a C-RAN via NFV. We then present the proposed coded NFV approach. For both schemes, we focus on the case and and present a simple analysis of the probability of error. The problem statement in the general case is treated in Section V.
Iii-a Fault Tolerance via Diversity
A conventional solution based on diversity is illustrated in Fig. 3(a) for servers and frames. In this scheme, the controller VM instantiated at Server 0 duplicates one of the received frames, namely in the figure, at the input of both Server 2 and Server 3. Server 1, Server 2 and Server 3 each run a VM that performs channel decoding as well as error detection (via a Cyclic Redundancy Check test) on the input frame. The outcome of the decoders is fed back to the VM in Server 0. We note that, for general values of and with , the scheme would just duplicate one or more frames at the input of multiple servers.
The conventional diversity-based system succeeds in decoding both packets as long as: (i) Server 1 decodes correctly data and is available; and (ii) Server 2 and/or Server 3 decode correctly and are available. As a consequence, the error probability can be written as
where is the cardinality of set , which is a subset of the Servers 1, 2 and 3; is the probability that only the decoders in successfully decode the input frame, while the rest of the servers decode incorrectly.
Iii-B Fault Tolerance via Coded NFV
In the proposed coded NFV scheme, as illustrated in Fig. 3(b), Server 0 pre-processes the received frames by computing the linear combination of the received frames and . Note that this operation is of much lower complexity as compared to channel decoding. Server 0 then assigns frame for decoding at Server 1, frame to Server 2 and frame to Server 3. A key observation is that Server 3 can decode over the same linear code as Server 1 and 2 since we have
Hence, Server 3 can decode over a BSC with parameter , which is the probability that the effective noise equals 1. As a result, as long as any two servers decode successfully and are available, Server 0, which receives the outputs of all other servers as in the diversity-based scheme, can decode both data messages and .
Based on the description above, the proposed coded NFV scheme can be interpreted as a form of concatenated code in which the outer linear code encodes each frame, while the inner NFV code is applied on the noisy received signals in order to obtain robustness with respect to infrastructure failures. The probability of error for this scheme is given by
where is defined as in (1), with the key difference that Server 3 decodes based on (2).
We conclude this section by emphasizing that, beside the advantages in terms of the error probability which will be further discussed in Sec. IV, the proposed coded scheme increases the Minimum Failure Removal (MFR) . The MFR is the minimum number of servers whose removal leads to failure. In particular, with the conventional diversity-based scheme, even the non availability of a single server, namely Server 1, causes a failure, while the proposed scheme has a MFR of two.
Iv Numerical Results
In this section, we present numerical experiments to compare the performance of the conventional diversity-based scheme and the proposed coded NFV for the presented example with and . To this end, we consider a feedforward convolutional code, in which the constraint length is 7, the code generator polynomial matrix is , with and and Viterbi decoders are implemented at Servers 1, 2 and 3. We evaluate the probabilities in (1) and (3) via Monte Carlo simulations.
The error probability as a function of the servers’ failure probability is plotted in Fig. 4 for the indicated values of the BSC parameter for both schemes. It is seen that, in the regime in which hardware failures have similar or smaller probability as compared to channel errors, coded NFV can provide significant gains. For instance, to achieve with , the conventional diversity-based method requires hardware with a server failure probability of , while the coded NFV requires , which is an order of magnitude larger.
The robust coded NFV scheme was presented in Sec. III-B for , and for a BSC channel between user and RRH. In this section, we briefly discuss extensions.
1) Frames encoded with different rates: Different rates can be accommodated by using rate-compatible codes obtained from the same master linear code for each frame.
2) Additive Gaussian noise or fading channels: For such channels between user and RRH, lattice codes can be used to encode the frames, instead of linear binary codes. The input to Server (see Fig. 3(b)) is computed as the sum of the received signals on the real field, or complex field for complex Gaussian or fading channels. Server 3 then decodes the XOR of the two messages, namely , by decoding over the lattice code using the technique of computation over a multiple access channel (see  and references therein for an introduction).
3) Generalization to any values of the parameters and : For any values of and , each one of the bits input to Server , with , is obtained as a binary-field linear combination of the corresponding bits of the received frames , with . The resulting NFV code can be then described by a generator matrix such that the input bits to the servers can be computed as , with collecting one bit from every frame. Note that in the discussed , example.
Regarding the design of the generator matrix , we note that an NFV code benefits from a sparse structure of , as well as from a large minimum distance — two conflicting requirements. For the former requirement, as seen in Sec. III-B, summing more received signals at the input of a server increases the noise level. More formally, the NFV code operates over an erasure channel in which the probability of error associated to each server equals , where is the number of ones in the th column of matrix and represents the probability of incorrect decoding when received signals are summed (which is an increasing function of ). This novel property of NFV codes sets an interesting research challenge for code design.
Vi concluding remarks
Software-based virtual network functions enabled by NFV are less reliable than those provided via the traditional hardware-based platforms. To alleviate this shortcoming, this letter proposed to enhance traditional diversity-based solutions by means of channel coding. The proposed solutions addresses the important network function of uplink channel decoding at the base station and leverages the algebraic structure of the received encoded data frames. Open questions for future work encompass the design of NFV codes and the application of the principle of coded NFV to other network functions, such as routing.
-  European Telecommunications Standards Institute, “Network functions virtualisation (NFV); terminology for main concepts in NFV,” RGS/NFV, Dec. 2014.
-  B. Han, V. Gopalakrishnan, L. Ji, and S. Lee, “Network function virtualization: Challenges and opportunities for innovations,” IEEE Commun. Magazine, vol. 53, no. 2, pp. 90–97, Feb. 2015.
-  R. Mijumbi, J. Serrat, J.-L. Gorricho, N. Bouten, F. De Turck, and R. Boutaba, “Network function virtualization: State-of-the-art and research challenges,” IEEE Commun. Surveys & Tuts., vol. 18, no. 1, pp. 236–262, Sep. 2015.
-  P. Rost, A. Banchs, I. Berberana, M. Breitbach, M. Doll, H. Droste, C. Mannweiler, M. A. Puente, K. Samdanis, and B. Sayadi, “Mobile network architecture evolution toward 5G,” IEEE Commun. Magazine, vol. 54, no. 5, pp. 84–91, May 2016.
-  European Telecommunications Standards Institute, “Network functions virtualisation (NFV); reliability; report on models and features for end-to-end reliability,” GS NFV-REL 003 V1.1.1, Apr. 2016.
-  R. Cohen, L. Lewin-Eytan, J. S. Naor, and D. Raz, “Near optimal placement of virtual network functions,” in Proc. IEEE Conference on Computer Communications (INFOCOM), Kowloon, Hong Kong, pp. 1346-1354, Apr. 2015.
-  European Telecommunications Standards Institute, “Network functions virtualisation (NFV); reliability; report on scalable architectures for reliability management,” GS NFV-REL 002 V1.1.1, Oct. 2014.
-  J. Liu, Z. Jiang, N. Kato, O. Akashi, and A. Takahara, “Reliability evaluation for NFV deployment of future mobile broadband networks,” IEEE Wireless Commun., vol. 23, no. 3, pp. 90–96, Jun. 2016.
-  O. Simeone, A. Maeder, M. Peng, O. Sahin, and W. Yu, “Cloud radio access network: Virtualizing wireless access for dense heterogeneous systems,” Journal of Commun. and Networks, vol. 18, no. 2, pp. 135–149, Apr. 2016.
-  P. Rost, S. Talarico, and M. C. Valenti, “The complexity-rate tradeoff of centralized radio access networks,” IEEE Trans. on Wireless Commun., vol. 14, no. 11, pp. 6164–6176, Nov. 2015.
-  A. Gatherer, “Revisiting cloud RAN from a computer architecture point of view,” IEEE ComSoc Technology News (CTN), Jul. 2016.
-  S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “Coded mapreduce,” in Proc. IEEE Conference on Communication, Control, and Computing (Allerton), pp. 964-971, Sep. 2015.
-  K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, “Speeding up distributed machine learning using codes,” arXiv preprint:1512.02673, Dec. 2015.
-  S. H. Lim, C. Feng, A. Pastore, B. Nazer, and M. Gastpar, “A joint typicality approach to algebraic network information theory,” ArXiv preprint:1606.09548, Jun. 2016.