Towards Disaggregating the SDN Control Plane
Current SDN controllers have been designed based on a monolithic approach that integrates all of services and applications into one single, huge program. The monolithic design of SDN controllers restricts programmers who build management applications to the specific programming interfaces and services that a given SDN controller provides, making application development dependent on the controller, and thereby restricting portability of management applications across controllers. Furthermore, the monolithic approach means an SDN controller must be recompiled whenever a change is made, and does not provide an easy way to add new functionality or scale to handle large networks. To overcome the weaknesses inherent in the monolithic approach, the next generation of SDN controllers must use a distributed, microservice architecture that disaggregates the control plane by dividing the monolithic controller into a set of cooperative microservices. In this paper, we explain the steps that are required to migrate from a monolithic to a microservice architecture and propose two potential architectures to achieve the goal. Finally, the paper reports the results of testbed measurements that we use to evaluate the proposed disaggregated architecture from multiple perspectives, including functionality and performance.
Software defined networking (SDN) is an emerging trend for the design of Internet management systems that decouples vertical integration of the control plane and data plane and provides flexibility that allows software to program the data plane hardware directly according to a set of network policies. Thus, the control functions used to configure a device no longer need to be integrated with the functions that perform data forwarding. SDN offers flexibility to program and monitor computer networks directly using a controller that runs software known as a Network Operating System (NOS) to provide control logic. In the current SDN paradigm, management functionality crosses three key layers, including data plane forwarding mechanisms, control plane functions, and management applications . A centralized control plane is implemented by an SDN controller. A controller uses two types of Application Program Interfaces (APIs) to connect to other entities: a Northbound (NB) API and a Southbound (SB) API. A NB API defines communication between an external management application and the SDN controller. A SB API defines communication between the controller and underlying network devices . One of the most widely used SB APIs, known as the OpenFlow  protocol, allows a controller to insert, modify, and update flow table rules, and to specify associated actions to be performed for each of the flows that pass through a given network device. The architecture of the current software defined management systems exhibits several weaknesses as follows:
Monolithic and Proprietary: Current SDN controllers have been designed based on a monolithic architecture that aggregates all control plane subsystems into a single, huge monolithic program. In the aggregated control plane model, each controller defines its own set of programming interfaces and services; a programmer can only use the controller’s set when creating management applications. The approach makes application development dependent on a particular SDN controller and the specific programming language that has been used to implement the controller, consequently restricting portability across multiple controllers by making each application depend on a specific controller. Using a monolithic approach to the design of an SDN controller allows a vendor to create a controller and ensure the cohesion of all pieces. However the approach does not provide an easy way for users to incorporate new services or adapt the controller quickly.
Lack of a Uniform Set of NB APIs: Even if they use the same general form of interaction (e.g., RESTful interace), SDN controllers, such as ONOS and OpenDayLight, each offer a NB APIs that differs from the APIs offered by other controllers in terms of syntax, naming conventions, and resources. The lack of uniformity among NB APIs makes each external management application dependent on the NB API of a specific SDN controller.
Lack of Reusability of Software Modules: In the current SDN architecture, the dependency between a management application and a specific type of controller limits reuse of SDN software. In many cases, when porting an application from one controller to another, a programmer must completely recode even the basic modules that collect topology information, generate and install flow rules, monitor topology changes, and collect flow rule statistics.
Lack of External Reactive SDN Applications: In the current SDN architecture, NB APIs can be used to program network devices proactively by building applications that install flow rules to handle all possible cases before data traffic arrives. To exploit the advantages of SDN fully, a network management system must also support a reactive approach in which external management applications can be informed of changes in the network or data traffic, and can then react to change forwarding rules. Unfortunately, current SDN controllers do not provide any notification mechanisms that can inform external applications about changes in network conditions.
To overcome the above weaknesses, we propose a new architecture for the design and implementation of next generation SDN control plane systems that splits the current monolithic controller software into a set of cooperating microservices. The following summarizes the main contributions of this article. The article:
Explains the concept of SDN Control Plane Disaggregation and introduce a distributed architecture for next generation of SDN controllers.
Describes steps taken towards disaggregating the SDN control plane, considers potential ways to achieve the goal, and discuss the advantages and disadvantages of each.
Evaluates the approaches and consequent tradeoffs from multiple perspectives, such as performance and implementation difficulty.
The rest of the paper is organized as follows: The following section explains SDN control plane disaggregation. Section III considers two event distribution systems that can potentially be used to externalize event processing. Section IV compares the two event distribution systems by assessing implementation tradeoffs. Section V explains an experimental setup used to assess performance, and summarizes the results of measurements. Section VI briefly presents related work, and section VII concludes the paper.
Ii SDN Control Plane Disaggregation
As Figure 1 illustrates, a monolithic SDN controller can be disaggregated into a suite of cooperating microservices, such that a controller core provides the minimum required functionality, and micro-services, which exist outside of the SDN controller core provide all other services. We note that the approach is analogous to a microkernel operating system design. In the disaggregated model, each of the microservices can run in a container, and the containers can be orchestrated using orchestrator technologies such as Kubernetes  or Docker swarm . Migrating from a monolithic controller to micro-service architecture for control planes offers the following benefits:
Flexibility to scale: One of the main advantages of the microservice architecture arises from its ability to scale a given service horizontally, independent of other subsystems and services.
Freedom in choosing the programming language: In the disaggregated control plane model, programmers have the opportunity to choose an arbitrary programming language, programming technology, and third-party library function when building an SDN management application. The approach makes the application development process more flexible, and allows a programmer to choose a language that is appropriate to a given app.
Fault isolation: In current SDN controllers, if one of the subsystems fails it can affect the entire controller. A dissagregated control plane means that the failure of a given microservice will not affect other microservices. Moreover, a microservice can be repaired and restarted without recompiling the controller and without restarting other microservices.
A controller core with minimal components: In the monolithic approach, every instance of a controller includes all services and apps, even if the instance only needs a small subset. In the disaggregated model, the controller core contains the minimum viable set of components and functions, and only the required apps and services need to be deployed outside of the controller core.
A Disaggregated Code Base: In the monolithic architecture, a single, large code base includes all services and apps. Consequently, changes to even a small seldom-used service requires changing the controller code base. In the disaggregated model, each services and each applications can be in a separate code base, isolating changes.
Figure 1 illustrates the first step towards SDN control plane disaggregation consists of identifying a minimal set of viable controller core components and equipping them with an event distribution mechanism that can externalize event processing. External apps will use the event mechanism to learn about link and device changes, as well as to learn about packets that case exceptions (e.g., a packet for which no forwarding rule exists). A notification system allows a programmer to develop management applications and services that respond to such events reactively. Thus, an external application can perform topology discovery and use subsequent events to update the topology. The disaggregated approach allows such applications to use any programming language and to run outside of the SDN controller core. Furthermore, externalizing link and device events means management apps that can react to the network changes dynamically and quickly. In the following section, we explain two event distribution mechanisms that can be used to externalize event processing in a disaggregated SDN control plane.
Iii An Overview of Event Distribution Mechanisms
We define an event as a change in the network. As an example, consider two types of network events that occur commonly:
Packet Event: A switch is configured to send any packet to the controller core if there is no other forwarding rule for the packet. A packet event occurs when a packet arrives at the controller core.
Topology Event: whenever the network topology changes (e.g., a link fails or a link is placed back in service), the controller core receives a topology event. Topology events can be categorized according to type: link, device, port, and so on.
As the previous section explains, disaggregating control plane requires external event processing. This section examines two potential event distribution mechanisms: gRPC streaming  and Kafka.
Iii-a An Event Distribution using Kafka plus gRPC
Figure (a)a illustrates an architecture that uses a Kafka to externalize event processing. The key components are:
Kafka Event Distribution Application: The architecture follows the produce-consumer model for event distribution. The system is based on Apache Kafka ; an open-source stream-processing software platform developed by the Apache Software Foundation. A Kafka event distribution application listens for events that occur in the controller core, and publishes the events on a Kafka cluster; the events are consumed by external processes and applications. In other words, the event distribution application acts as a producer by pushing data to brokers on Kafka cluster, and external applications and services act as consumers that receive the data.
Applications and Services: As Figure (a)a illustrates, whenever an application needs to receive incoming event from the controller, the application sends an HTTP request to the event distribution application to subscribe to the specific type of event (e.g a packet event, topology event, etc.). The Kafka event distribution application checks the request, subscribes the requesting app to the specified type of event, and then replies to the requesting app. In our implementation, the Kafka event distribution application encodes each packet in a protobuf message and publishes the message as an array of bytes to the Kafka cluster. Whenever an application receives an event by consuming it from Kafka cluster, the application must decode and parse the event. In some cases, such as packet events, an application or service may need to return the packet that caused the event to the data pipeline. As the figure shows, we use a gRPC API to return the packet to the controller core, which will send the packet to a switch. If an application or service needs to install flow rules on network devices it has a choice of using a REST API or gRPC.
Kafka Cluster: Kafka Brokers that each run a Kafka server form a cluster of servers. The event distribution system in the controller core acts as a producer that publish events into Kafka topics within the broker. A consumer of a given topic consumes all messages on that topic.
Iii-B An Event Distribution System Using gRPC
The second event distribution mechanism uses gRPC server side streaming. As Figure (b)b illustrates, the SDN controller core is equipped with an application that uses gRPC sever side streaming. The application implements a push-notification system that provides incoming events to external processes. As the Figure (b)b shows, an external application or service sends a registration request to the gRPC server to register as a receiver. Then the external application subscribes to a topic (in this case a specific type of event) which causes the server to start streaming occurrences of the specified type of event. Similar to the architecture used with Kafka, an external application uses gRPC to return packets to the pipeline, and uses either gRPC or a REST API to install flow rules in switches. The following section compares the event distribution mechanisms.
Iv A Comparison Of Kafka and gRPC Event Distribution Mechanisms
The advantages and disadvantages of the two even distribution mechanisms can be summarized briefly:
From the client side perspective, gRPC makes it easier than Kafka to expand a distribution system to include apps written in new programming languages. To use gRPC with a new language, a programmer only needs to compile the set of protobuf messages used for communication between the event distribution system and the application. The gRPC technology automatically generates the gRPC stubs that external SDN applications and services need. In contrast, to use Kafka with a new programming language, a programmer needs to implement a set of high level abstractions that required before applications can create and initialize a consumer that can receive Kafka events.
In the Kafka implementation, a programmer must tune parameters in Kafka brokers, producers, and consumers to optimize performance. Using gRPC allows applications access a remote procedure call mechanism that has a high performance potential, but it is not always clear how to optimize performance. In the implementation of gRPC client and server code described in this paper, we attempt to follow best practices.
In both the Kafka and gRPC event distribution systems, external applications and services (i.e. microservices) may need to communicate with one another. In either system, they can choose to use gRPC or a REST API.
V Experimental Results
This section presents an evaluation of the Kafka and gRPC distribution mechanisms using various performance metrics, such as response time and throughput.
V-a Experimental Setup
We implemented an early of version of the proposed event distribution systems for the ONOS  SDN controller. To measure the two distribution mechanisms, we used an SDN testbed that consists of 10 OpenFlow switches that logically define 5 interconnected sites. We use virtualized mode feature on the network switches to divide each physical switch into 10 independent smaller switches. Each site implements a Fat-tree network topology.
V-B Experimental Scenarios
V-B1 Response Time
This experiment measures response time to evaluate the overhead of external event processing. Our goal is to compare the amount of time that an external app or service needs to process a packet event with the time it takes to process the same packet even inside the monolithic version of ONOS, and to understand the effect on overall response time. In the internal ONOS packet processing, whenever a host sends a ping request for which no forwarding rules have been established, each switch along the path sends the incoming packet to the controller core, which processes the packet internally and returns the packet to the switch to be forwarded. In the disaggregated architecture, whenever a host sends a ping request for which no forwarding rules have been established, each switch along the path sends the incoming packet to the controller core, and the event distribution system in the core (using either Kafka or gRPC) distributes the incoming packet to the set of external apps and services that have subscribed to receive packet events. The external process then uses gRPC to return the packet to the switch for forwarding. To focus measurements on the control plane overhead, we did not install flow rules. Thus, each packet causes a packet event at each switch along the path. We ran the experiment 500 times and measured the ping response time between two end hosts in our SDN testbed that are 5 hops apart. As the graph in Figure 4 show, externalization of packet processing introduces overhead that increases the overall response time. As a baseline, we measured the average response time for internal processing in ONOS as 24 ms. The average response time for a gRPC system is 29 ms, and the average time for a Kafka system is 35 ms. We observe that using gRPC introduces less overhead than using Kafka based event distribution system.
To assess the impact of externalized packet processing and the use of a REST API for flow rule installation on throughput, we compare external reactive forwarding applications that use gRPC and Kafka event distribution systems with an ONOS reactive forwarding application. In all three cases, we use a hard time-out of 10 seconds to remove flow rules (i.e.,the flow rules installed in a switch disappear every 10 seconds, causes packet events and re-installation of the rules). We use the iperf3 tool to generate TCP traffic, varying the the number of concurrent TCP connections, and running each measurement for 150 seconds. As the results in Figure 5 show, the effect of externalized packet processing on throughput is negligible. Furthermore, the overhead of using a REST API to install flow rules is larger than overhead introduced by externalization.
Vi Related Work
Umbrella  is a unified software defined network programming framework that provides a new set of APIs for the implementation of SDN applications independent of the NB APIs used by specific SDN controllers. Umbrella uses OFtee  as a tool to provide OpenFlow PACKET_IN messages to the external applications to support external reactive SDN applications as well as proactive applications. In , we present the idea of externalization of packet processing in SDN that is one of the first steps towards control plane disaggregation. This article is an extension of our previous article and we focus on the required steps to migrate from a monolithic SDN control plane architecture to a disaggregated control plane.
A monolithic architecture for an SDN controller aggregates all control plane subsystems into a single, gigantic program. An SDN controller that adopts the monolithic approach restricts programmers who write management applications to use the programming interfaces and services that the controller provides. To overcome the limitations inherent in a the monolithic architecture, we propose a distributed architecture that disaggregates controller software into a small controller core and a set of cooperative microservices. A programmer can choose a programming language that is appropriate for each microservice. To migrate from a monolithic approach to a disaggregated microservice architecture, a mechanism must be devised to distribute events to external processes. In this paper, we evaluate two candidate distribution mechanisms: Kafka and gRPC. Our experimental results show that externalizing of event processing introduces some overhead, and that the overhead resulting from gRPC is lower than the overhead resulting from Kafka. Externalizing packet processing has a negligible effect on throughput, and the cost is considered small when compared with the advantages of portability, flexibility, and support for multi-language management applications.
-  C. Trois, M. D. D. Fabro, L. C. E. de Bona, and M. Martinello, “A survey on sdn programming languages: Toward a taxonomy,” IEEE Communications Surveys Tutorials, vol. 18, no. 4, pp. 2687–2712, 2016.
-  J. H. Cox, J. Chung, S. Donovan, J. Ivey, R. J. Clark, G. Riley, and H. L. Owen, “Advancing software-defined networks: A survey,” IEEE Access, vol. 5, pp. 25 487–25 526, 2017.
-  N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and J. Turner, “Openflow: Enabling innovation in campus networks,” SIGCOMM Comput. Commun. Rev., vol. 38, no. 2, pp. 69–74, Mar. 2008. [Online]. Available: http://doi.acm.org/10.1145/1355734.1355746
-  P. Berde, M. Gerola, J. Hart, Y. Higuchi, M. Kobayashi, T. Koide, B. Lantz, B. O’Connor, P. Radoslavov, W. Snow, and G. Parulkar, “Onos: Towards an open, distributed sdn os,” in Proceedings of the Third Workshop on Hot Topics in Software Defined Networking, ser. HotSDN ’14. New York, NY, USA: ACM, 2014, pp. 1–6. [Online]. Available: http://doi.acm.org/10.1145/2620728.2620744
-  “OpenDayLight,” https://www.opendaylight.org/, Accessed on 2019.
-  “kubernetes:Production-Grade Container Orchestration,” https://www.kubernetes.io, Accessed on 2019.
-  “Docker,” https://www.docker.com/, Accessed on 2019.
-  “gRPC: A high performance, open-source universal RPC framework,” https://www.grpc.io/, Accessed on 2019.
-  “Apache Kafka:A Distributed Streaming Platform,” https://kafka.apache.org/, Accessed on 2019.
-  D. Comer, R. H. Karandikar, and A. Rastegarnia, “Umbrella: A unified software defined development framework,” in Proceedings of the 2018 Symposium on Architectures for Networking and Communications Systems, ser. ANCS ’18. New York, NY, USA: ACM, 2018, pp. 148–150. [Online]. Available: http://doi.acm.org/10.1145/3230718.3233546
-  “OFtee: An OpenFlow Proxy,” https://github.com/ciena/oftee/, Accessed on 2019.
-  D. Comer and A. Rastegarnia, “Externalization of packet processing in software defined networking,” 2019. [Online]. Available: https://arxiv.org/abs/1901.02585
Douglas Comer is an internationally recognized expert on computer networking and the TCP/IP protocols. He has been working with TCP/IP and the Internet since the late 1970s. Comer established his reputation as a principal investigator on several early Internet research projects. He served as chairman of the CSNET technical committee, chairman of the DARPA Distributed Systems Architecture Board, and was a member of the Internet Activities Board (the group of researchers who built the Internet). Professor Comer is well-known for his series of ground breaking textbooks on computer networks, the Internet, computer operating systems, and computer architecture. His books have been translated into sixteen languages, and are widely used in both industry and academia. Comer’s three-volume series Internetworking With TCP/IP is often cited as an authoritative reference for the Internet protocols.
Adib Rastegarnia is a PhD candidate at Computer Science Department of Purdue University under the supervision of Prof. Douglas Comer. He earned his Masterâs degree in Computer Science from Purdue University in 2018. His current research interests span the areas of computer networks, operating systems, Software Defined Networking (SDN), and Internet of Things (IoT) with a focus on designing and implementing working prototypes of large, complex software systems and conducting real world measurements. He is also a member of Systems Research Group of Computer Science Department at Purdue University.