Efficient, Dynamic Multi-tenant Edge Computation in EdgeOS

Efficient, Dynamic Multi-tenant Edge Computation in EdgeOS

Yuxin Ren The George Washington University ryx@gwu.edu Vlad Nitu EPFL Lausanne vlad.nitu@epfl.ch Guyue Liu The George Washington University guyue@gwu.edu Gabriel Parmer The George Washington University gparmer@gwu.edu Timothy Wood The George Washington University timwood@gwu.edu Alain Tchana Toulouse University Alain.Tchana@enseeiht.fr  and  Riley Kennedy The George Washington University rskennedy@gwu.edu

Abstract

In the future, computing will be immersed in the world around us – from augmented reality to autonomous vehicles to the Internet of Things. Many of these smart devices will offer services that respond in real time to their physical surroundings, requiring complex processing with strict performance guarantees. Edge clouds promise a pervasive computational infrastructure a short network hop away from end devices, but today’s operating systems are a poor fit to meet the goals of scalable isolation, dense multi-tenancy, and predictable performance required by these emerging applications. In this paper we present EdgeOS, a micro-kernel based operating system that meets these goals by blending recent advances in real-time systems and network function virtualization. EdgeOS introduces a Featherweight Process model that offers lightweight isolation and supports extreme scalability even under high churn. Our architecture provides efficient communication mechanisms, and low-overhead per-client isolation. To achieve high performance networking, EdgeOS employs kernel bypass paired with the isolation properties of Featherweight Processes. We have evaluated our EdgeOS prototype for running high scale network middleboxes using the Click software router and endpoint applications using memcached. EdgeOS reduces startup latency by 170X compared to Linux processes and over five orders of magnitude compared to containers, while providing three orders of magnitude latency improvement when running 300 to 1000 edge-cloud memcached instances on one server.

1. Introduction

There is a growing desire to deploy software services closer to users. Cellular providers must run mobility management services near customers to properly maintain connectivity for cell phone users. The Internet of Things foretells the deployment of billions of devices producing data streams, which often require processing close to the data source to avoid excess bandwidth consumption in the network core. Latency sensitive cyber physical systems desire communication and processing at millisecond scale, preventing the use of centralized cloud infrastructures. These applications and many more demand an efficient and scalable “edge cloud” infrastructure, where computational resources are available on demand, as close to users as possible.

Unfortunately, edge clouds pose major challenges for traditional operating system and virtualization architectures. First, an edge cloud must support dense multi-tenancy—each edge cloud site is expected to be a tiny fraction of the size of a centralized cloud, yet it may need to host many carefully isolated services for the users connected to it. Thus rather than run thousands of servers each supporting a dozen services in virtual machines (VMs), as is common in today’s centralized cloud data centers, an edge cloud site might only run a dozen servers, each supporting thousands of diverse services. Even lightweight virtualization platforms such as Linux Containers have trouble scaling to these extremes (Manco et al., 2017).

Second, the combination of limited resources and mobile users means that edge cloud workloads are likely to see extremely high churn. Maintaining a large number of long running yet infrequently accessed services will not be efficient in such an environment, so services will instead need to be instantiated and terminated frequently on demand. In the extreme case, this may require dynamically starting a new service for each incoming user connection.

The overarching concerns of dense multi-tenancy and high churn are compounded by the latency sensitivity and network intensive nature of many edge cloud services. This is particularly challenging since virtualization adds overhead for I/O tasks (Gupta et al., 2006; Hu et al., 2017). Recent support for HW virtualization, such as SR-IOV capable NICs, reduces virtualization layer costs, but comes at the expense of scalability (e.g., only a few dozen virtual devices per port). Thus current OS and HW virtualization techniques lack scalability and often suffer from performance unpredictability which can be a major concern for latency sensitive applications utilizing the edge.

Prior work has investigated portions of these problems in contexts such as cloud computing, network function virtualization (NFV), or real-time systems. Lightweight virtualization techniques based on unikernels (Manco et al., 2017) and hypervisor optimizations (Nitu et al., 2017) have been proposed to reduce boot times, but don’t address providing many isolated clients high throughput. Recent NFV platforms achieve high throughput with the use of kernel-bypass networking, but they often trade isolation for performance (Han et al., 2015; Zhang et al., 2016). Similarly, predictable performance is the hallmark of real-time systems, but these systems generally rely on conservative resource overprovisioning which is counter to the goals of an efficient edge cloud.

In this paper we explore how a clean-slate OS can provide a flexible infrastructure that can securely and efficiently support a large number of isolated services, while offering strict performance guarantees. By using a -kernel based design, EdgeOS provides a customizable architecture tuned for network intensive workloads, while providing stronger isolation and latency guarantees than existing approaches. EdgeOS uses a “Feather Weight Process” (FWP) abstraction to provide fine grained isolation at low cost, with support for FWP caching to assist with fast startup under high churn.

Despite its radical design, EdgeOS is able to run several common applications, including middleboxes from the Click software router (Kohler et al., 2000) and the memcached key-value store. These network functions and endpoint servers can be flexibly combined to build complex services, while still providing strong isolation for both application and network data.

EdgeOS makes the following contributions:

  • A Feather Weight Process abstraction built on a -kernel-based OS that supports orders of magnitude greater density than prior approaches.

  • Efficient mechanisms so that message data can be securely communicated through service chains.

  • FWP chain caching to support microsecond speed initialization of complex services in high churn environments.

We have implemented EdgeOS by extending the Composite component-based operating system (Wang et al., 2015). EdgeOS integrates with the Data Plane Development Kit (DPDK) to provide high performance network I/O. Our evaluation illustrates how EdgeOS can offer dramatically better scale, density, and performance predictability than traditional approaches. We execute 1000s of FWPs per host, instantiate them 170X faster than a Linux process, and maintain a memcached latency under 1 millisecond even when running 600 isolated instances on a single host. EdgeOS provides performance on par with state-of-the-art NFV platforms, while offering stronger isolation and greater agility.

2. Motivation

In EdgeOS we consider edge clouds in the context of 5G networks, which will allow large numbers of mobile devices to connect with low latency and high bandwidth to nearby access points (Taleb et al., 2017). An access point (or perhaps a nearby telco central office (Peterson, 2015)) can contain an edge cloud site, i.e., a tiny data center offering compute and storage capabilities to connected devices. Edge clouds enable requests to be serviced, filtered, and transformed before they traverse the WAN, thus avoiding computation in a centralized cloud and/or reducing core bandwidth usage. However, given the large number of edge cloud sites, each is expected to only have a small number of servers due to space, power, and cost constraints. Since edge clouds are likely to be deployed first by telco operators, it is expected that early use cases will focus on NFV middleboxes, such as cellular mobility management, DDoS detection, etc. Here we discuss how the scale, churn, and performance requirements of edge clouds pose major challenges to existing platforms, motivating the need for a redesign of the underlying OS primitives and communication mechanisms.

2.1. Multi-tenancy and Churn

Given the increasing number of stakeholders that can benefit from edge cloud execution, the support for multi-tenant execution is critical. However, today’s common infrastructures built on containers or virtual machines may add prohibitive cost for edge workloads. Though past research has increased the agility of such infrastructures by optimizing the startup/shutdown costs (Lagar-Cavilla et al., 2009; Manco et al., 2017; Nitu et al., 2017), the overhead of creating and deleting isolated execution environments can still be significant.

The costs of inter-tenant isolation, especially with high churn – the rate of client arrival and exit – can severely limit the workloads a system can handle. A number of factors are increasing the importance of systems that can handle an increased churn at the network edge. (1) serverlesscomputing has been made popular by platforms like Amazon Lambda and the open-source OpenWhisk, and leverages transient computations without local permanent state to increase agility and consolidation, (2) middleboxesfocus on doing efficient and low-latency network computation, and benefit from per-flow isolation, (3) the number of clients accessing the infrastructure is both increasing and becoming more transient with mobile computing (Manco et al., 2017), and (4) large volumes of sporadically network-connected embedded (IoT) devices are prospected to generate a majority of the world’s network traffic.

Churn and isolation overheads. When new clients require isolated computation in the edge cloud, namespace, memory, and CPU isolation provide the requisite separation between tenants. Unfortunately, even relatively efficient mechanisms such as containers rely on layers of abstraction such as the Linux Virtual File System (VFS), and management of a large number of namespaces (including those for processes, network, and shared memory) that impose significant overhead. As the churn of a system increases, the overheads of container creation are amplified.

Percentile Docker fork() EdgeOS
50th 521 0.26 0.048
90th 574 5.8 0.054

The table above depicts the cost in milliseconds of leveraging various isolation facilities; we measure the time to start a minimal service and then fault in 8 pages of memory to show the unpredictability of Linux’s Copy on Write (full details in Section 5.2). Using docker start can take hundreds of milliseconds due to the cost of initializing namespaces and setting up Docker metadata. Linux fork() has a much lower cost than Docker, but it still exhibits high variance, with the 90 percentile being over 20 times slower than the median. In contrast, EdgeOS improves median start time by 5X compared to Linux, and has minimal variability. As discussed in the remainder of the paper, we can improve EdgeOS by another order of magnitude by maintaining a cache of services that can be started near instantaneously. We achieve this through lightweight, yet strong isolation mechanisms, and a clean separation of the control and data paths.

2.2. Latency and Throughput at Scale

Lightweight isolation mechanisms such as containers facilitate running large numbers of applications (e.g., hundreds of Docker containers per server), but they cannot provide performance predictability as the scale rises. This leads to the second key challenge in edge infrastructures: predictable performance, particularly latency, at large scale.

Scaling isolation facilities. Unfortunately, current infrastructures suffer poor performance not only under churn, but also at high scale. Both VMs and containers see overheads due to the expense of traversing the host’s software switch to determine the appropriate destination to deliver incoming data to. This is exacerbated with new convenient, yet expensive, networking abstractions such as overlay networking provided by Docker. While an approach such as SR-IOV can provide high performance networking to VMs or containers, it does so by dedicating virtual hardware functions that are a limited resource, preventing high scalability.

Figure 1. Round-trip latency of netperf or memcached instances. Compared with the 1ms round-trip of 5G networks, netperf latencies represent a 2x/8x latency increase using one/sixteen cores, while memcached exhibits a 1000x latency increase.

To evaluate the latency behavior of today’s infrastructure, we adjust the number of netperf servers sharing a single core (netperf-SC) or spread across multiple cores (netperf-MC), and the number of memcached instances spread across multiple cores. A second, well provisioned host transmits traffic to the test server over a 10 Gbps link. The overhead, even in a prevalent and widespread system such as Linux, can be significant. Using multiple cores still cannot achieve ideal latency due to poor scalability as shown in Figure 1. Real applications such as memcached are quickly overwhelmed and can only support a hundred or fewer instances (full details in Section 5.4). This illustrates the inability of existing OS isolation mechanisms to provide fine grained performance isolation at high scale. EdgeOS is designed to support isolation with both high scalability and predictability.

3. Design

As shown in Figure 2, EdgeOS is designed around: 1) DPDK-based IO gateways that efficiently receive and send packets with kernel-bypass, 2) a Feather-Weight Process (FWP) abstraction that provides fine grained isolation at low cost, 3) a Memory Movement Accelerator (MMA) that securely copies messages between FWPs arranged in chains, and 4) a control plane that manages the FWP-based data plane by providing the high level policies, and offering management functions like FWP template caching for fast startup.

Figure 2. EdgeOS Control and Data Plane Architecture

3.1. FWPs for Lightweight Isolation

Traditional UNIX processes maintain not only memory protection using virtual address space page-tables, but also additional abstractions including file system (FS) hierarchy visibility, file descriptor namespaces, and signal status. Further, mechanisms optimized for fork performance such as copy-on-write, and for exec performance such as demand loading, add unnecessary and unpredictable overheads.

In contrast, Feather-Weight Processes (FWPs) in EdgeOS are a minimal abstraction wrapping only memory and a small set of simple kernel resources. This is partially motivated by the growing usage of stateless computation and the adoption of middlebox network functions into cloud infrastructures, signaling a growing prevalence of services that depend on external databases to store persistent state. This enables a very tightly constrained execution environment that focuses mainly on the communication of messages (e.g. network packets) between many, possibly untrusting FWPs. EdgeOS optimizes around this trend. As shown in Figure 3, FWPs have access only to their own memory (including stack and heap), memory for storing messages, and a number of communication end-points used to ask the EdgeOS system for services. Notably absent are default access to a file system, dynamic linking facilities, and high-level networking layers such as a TCP/IP stack. The relative simplicity of the FWP abstraction enables the efficient start-up and tear-down of computation in response to client demands.

Figure 3. FWPs can be middleboxes (e.g. Firewalls or Intrusion Detection Systems) or endpoints (e.g. memcached), and can be composed into chains or even replicated for every new client.
Function Description
eos_postinit(fn_t callback, void *data) Provides the function that is triggered after FWP initialization has completed
eos_receive_fn(fn_t callback, void *data) Sets the callback function invoked on each message reception;
that function is passed both the message, and its source end-point
msg_t eos_recv(rcv_ep_t) A lower-level API for retrieving a message from an end-point (ep)
eos_send(send_ep_t, msg_t) Send a message to an egress end-point
msg_t eos_msg_alloc(size_t) Allocate a new message in message memory
eos_msg_free(msg_t) Free a message in message memory (eos_send is much more common)
eos_sbrk(size_t) Allocate local memory into the heap
Table 1. FWP Programming Interface.

Resource access control. Access to all of an FWP’s resources relies on capability-based access control (Dennis and Horn, 1983) using kernel-mediated references, removing any ambient authority (Miller et al., 2003). These resources include the message pool that is used to receive and send data, communication end-points used to trigger the message communication, and synchronous communication end-points to request operations from system-level services. These capabilities restrict messages to be only between FWPs in defined chains, which can be shared by many clients, or instantiated on demand for each new connection (see Figure 3). FWPs are provided minimal sets of resources consistent with the principle of least privilege (Saltzer and Schroeder, 1975), which, paired with strict resource management, enables the scalable execution of isolated computations for many tenants.

Programming API. FWP’s primary focus is on processing of data streams (e.g., network packets), their programming API focuses around event notification and the reception and transmission of messages, summarized in Table 1. Each FWP provides a callback at initialization that is triggered upon message reception. Memory allocation functions distinguish between standard local memory (following a malloc-based interface) and message memory which is integrated with the communication system. While our current implementation uses a single thread per FWP, the underlying Composite system supports hierarchical scheduling (Parmer and West, 2011), which could be adapted for multi-threaded FWPs.

Rethinking processes for scalable isolation. It is important to contrast the isolation properties and programming model of FWPs with those of existing abstractions such as containers (Price and Tucker, 2004; Docker, 2018) and virtual machines (Dragovic et al., 2003). While containers rely on process abstractions for memory isolation, they add namespace partitioning, and resource rate consumption limitations (Banga et al., 1999). They rely on the system call layer and Linux’s monolithic kernel, thus have a large Trusted Computing Base (Saltzer and Schroeder, 1975) whereby a single bug in the large kernel can compromise isolation. In contrast, virtual machine hypervisors expose an interface to virtual machines that mimics the native hardware, or is extended to include paravirtualization extensions (Dragovic et al., 2003). The hypervisor is often smaller and has a smaller attack surface compared to the extended POSIX interface of a system like Linux. Virtual machines are often scheduled by the hypervisor as a collective abstraction of their applications using a virtual CPU (VCPU), thus focusing on inter-VM isolation.

In contrast to approaches that support a standard API (e.g. POSIX or x86), the FWP abstraction focuses on minimizing the FWP API down to the bare necessities required for network intensive edge computations. The API is focused on enabling different FWPs to coordinate and compose for complex functionality – similar in concept to UNIX pipelines. In this way, EdgeOS shares the philosophical design of -kernels to “a concept is tolerated inside the -kernel only if moving it outside the kernel…would prevent the implementation of the system’s required functionality” (Liedtke, 1995), but extends it to the core edge computing primitives. EdgeOS’s system services focus on simplicity of implementation and are limited to scheduling, inter-core coordination, low-level network interfaces, FWPs, and the capability-based access control to scope access to each. The obvious downside of this approach is decreased legacy support. However, we have successfully ported the Click software router and memcached key value store to EdgeOS. Further, we have prototype implementations of POSIX unikernels (Madhavapeddy et al., 2013) (based on NetBSD rumpkernels), but a discussion of these is beyond this paper’s scope.

EdgeOS’ design departs from heavyweight VM or container abstractions to enable scale and minimize the width of the system API to increase security. Though process abstractions have often been cast aside in favor of VMs (Martins et al., 2014), containers (Zhang et al., 2016), or language-based techniques (Panda et al., 2016), EdgeOS demonstrates that simplified process abstractions with tailored minimal APIs and focused optimizations for churn and communication, can scale to a large number of tenants and clients while maintaining strong isolation for edge computation.

3.2. Data-Plane and Communication

Receiving and transmitting packets with the NIC has traditionally required kernel intervention to manipulate the hardware. EdgeOS embraces the recent trend towards kernel-bypass to reduce this overhead by allowing user-space management of message buffers and network card DMA rings. Though FWPs have isolated local memory, the memory used for message passing between FWPs exposes a trade-off between performance and isolation. Existing high throughput systems often eschew isolation and use shared memory to pass data by reference. This is the design chosen by high-throughput networking stacks and software middleboxes (Zhang et al., 2016; Palkar et al., 2015; Panda et al., 2016). In EdgeOS, we leverage data copying between separate FWPs to maintain strong mutual isolation. Data copying can be a very expensive operation as it can dirty caches Thus, EdgeOS pairs strong isolation, with Memory Movement Accelerators (MMAs) that decouple copying from the FWP fast-path.

Network Gateways. EdgeOS’s microkernel design is a natural fit for user-space packet processing frameworks such as DPDK. In and Out gateway services run on dedicated cores and pull packets into message pools with no kernel interactions. Input packet processing maintains rules dictated by the control plane to match packets to a destination FWP service. Depending on the rule specification, it may be necessary to instantiate a new FWP chain in order to handle the incoming request. FWP chains are a core abstraction in EdgeOS as the entire chain can be created to service a new client.

FWP Memory and Isolation. An FWP’s memory is separated into message memory that is used for message passing between FWPs, and local memory that backs each FWP’s data-structures. This separation enables memory allocations to be optimized for the purpose and use of the memory. Though future optimizations might relax isolation, EdgeOS focuses on strong protection between FWPs, and employs copying to safely transfer data between each other. When messages are passed between FWPs, a trusted system component must be involved as neither FWP has the access rights to copy into, or from, the other FWP’s memory.

Efficient message passing with the MMA. A key EdgeOS design is to move message copying off the fast-path of FWP message processing, as we have observed that even a single in-line copy can prevent line-rate processing in many cases. Toward this, EdgeOS employs a Memory Movement Accelerator (MMA) whose focus is on efficiently copying messages between FWPs. The MMA retrieves messages from a upstream FWP’s ring buffers, copies them and adds them into a downstream FWP’s ring buffers, and alerts the scheduler that the destination FWP needs to be activated to receive it. The MMA acts as a software DMA engine to move message data between FWPs, and runs on one or more dedicated cores in order to perform out-of-band data movement. In contrast to long-standing networking subsystem guidance that dictates that zero-copy is necessary (von Eicken et al., 1995; Han et al., 2012) – often at the price of isolation, EdgeOS optimizes the MMA and treats it as a specialized processor that can push data significantly faster than line-rate, while maintaining strong isolation.

3.3. Control Plane and FWP Lifecycle

Similar to the approach taken in Software Defined Networks (SDN) and split-OS designs such as Arrakis (Peter et al., 2015), EdgeOS separates the data plane processing (implemented with FWPs, MMAs, and network gateways) from control functions that determine request routing, security policies, and resource management (implemented as user space components extending the Composite kernel). As shown in Figure 2, EdgeOS’s control plane is composed of three major components: (1) the EOS Controller that maps incoming flows to FWP chains, (2) the FWP Manager that controls the lifecycle of FWPs and optimizes their startup, and (3) the Scheduler that determines which FWP to run on each core and activates them in response to incoming messages.

Flow matching with the EdgeOS Controller. When new requests arrive from connected client devices, they need to be routed to the appropriate FWP chain. The EdgeOS Controller allows administrators to define FWP chains and the packet filtering rules that specify what traffic should be routed to them. These rules are pushed to the Net-In data plane component. Net-In applies rules similar to SDN match-action rules: packets are split into flows based on the header n-tuple (e.g. src/dest IP and protocol) and a rule is found that matches the flow. The rules indicate the FWP chain that will process that flow.111Our implementation currently assumes flow rules are statically preconfigured, but this could be extended to support on-demand flow lookups similar to SDN controllers, with a northbound interface to application logic that would assign a rule dynamically to each flow. Since our focus is on fine-grained isolation and high scale, a rule can indicate whether all flows that match the rule should be handled by a single chain, or if each flow should be given a dynamically started instance of the chain.

FWP Lifecycle and Caching. The creation of FWPs on the fly in response to the arrival of a new flow requires a cascade of activity: the instantiation of a set of new FWPs (including memory initialization, kernel data-structure management, and thread creation), connecting the FWPs together with communication channels (ring buffers, kernel end-points, and MMA integration), and finally, the creation of the message memory regions for the FWPs.

The FWP Manager orchestrates the lifecycle of FWP chains, which is illustrated in Figure 4. Similar to a Linux process, an FWP starts as an object file, which must be loaded into memory. Once execution begins, FWPs typically perform some initialization routines (e.g., parsing configuration files and allocating initial data structures). Rather than repeat such computation every time a new FWP of the same type must be instantiated, EdgeOS optimizes startup with an FWP checkpoint cache. Thus, we utilize the eos_postinit() API to allow FWPs to first initialize, then to take a checkpoint that defines the state of an FWP ready to process new data.

Since we anticipate many complex services will require multiple FWPs arranged in a chain, the Manager employs a FWP-chain cache that caches entire chains of FWPs, their interconnections, and their message memory. As new flows arrive, they are paired with corresponding FWP-chains from the cache. The selected FWPs will be Activated, allowing them to process messages or transition to the Blocked state, before eventually Terminating when they are no longer needed.

When a FWP chain terminates, the Manager reuses the chain by Restoring it back into the FWP-chain cache. In doing so, EdgeOS must guarantee that the memory of the cached computation represents the checkpointed, post-initialization state. As this places data-structures into a known and safe state, it ensures the integrity of future FWP-chain instances. EdgeOS avoids control operations in the data-path, thus the Manager’s checkpoint and restore operations run in parallel to FWP message passing. If memory pressure exists in the system, cached FWP templates and chains are Reclaimed.

Figure 4. Lifecycle of a FWP-chain: Dotted lines indicate FWP manager operations conducted once to load and then checkpoint a FWP-chain, or to reclaim the FWP’s resources when memory pressure exists. Dashed lines indicate operations to re-initialize terminated FWP-chains for future use. Solid lines are data-path operations performed by on the critical path of FWP execution.

Scheduling and inter-FWP coordination. Once a set of FWPs are activated, they are distributed across cores, and partitioned scheduling (i.e. without task migrations) multiplexes the core’s processing time. Each scheduler requires global context on which FWPs are assigned on its core, and which are runnable, and which are blocked awaiting messages.

Traditional systems often use direct coordination between cores via shared data-structures and explicit notification using Inter-Processor Interrupts (IPIs). For example, Linux provides notifications to activate threads (via futexes, or pipes) by accessing that thread’s data-structure directly to see if it is already awake, and if not, an IPI is sent. The resulting cache-coherency traffic for access to shared data-structures, then the IPI overheads, can be significant, especially if used for message notifications arriving over a network at line rate. FWP-chains can be spread across cores, only increasing the cost. Motivated by these overheads, NFV platforms based on DPDK such as OpenNetVM (Zhang et al., 2016) use active polling for communication between threads on different cores, thus entirely avoiding blocking. However, as the number of processes (“network functions” in OpenNetVM) grows beyond the number of cores, spin-based event notification is inefficient.

To avoid the large overheads of shared resources, all inter-scheduler coordination in EdgeOS is via message passing. When a FWP-chain is activated, a message as such is sent to the scheduler controlling the core hosting the FWP. Additionally, when a message is sent to a FWP, and its ring buffer is empty, a message is sent to the corresponding scheduler. On the other hand, when an FWP has processed all of its pending messages, instead of spinning awaiting more, the eos_recv operation will invoke the scheduler (which uses IPC to the scheduler component) asking to block.

EdgeOS Timeline Summary. Figure 5 shows the complete timeline for receiving and processing a packet. 1) A packet reception at the Net-In gateway causes a flow lookup to decide which FWP chain should process the packet. 2) If there is a miss and no FWP is currently allocated, the FWP Manager spawns one from its cache. 3) A message is sent to the MMA causing it to copy the packet into the destination FWP’s pool. 4) A message is added to the FWP’s ring and 5) the MMA messages the scheduler on the FWP’s core to activate it. 6) The FWP processes the packet and 7) asks the output gateway to DMA the packet out the NIC.

Figure 5. EdgeOS Timeline

4. Implementation

In this section we describe how our EdgeOS design is implemented and the key optimizations we make to achieve predictable, high performance. We plan to release our source code and experiment templates for repeatable research.

4.1. EdgeOS Implementation in Composite

Composite222composite.seas.gwu.edu is an open source -kernel that externalizes traditionally core kernel features into user-level components that define the resource management and isolation policies (Wang et al., 2015). Components interact through highly-optimized Inter-Process Communication (IPC) to leverage system logic and resources. Similar to Eros (Shapiro et al., 1999) and seL4 (Elphinstone and Heiser, 2013), Composite is based on a capability-based protection model that controls component access to kernel resources. These resources include threads, communication end-points (synchronous and asynchronous), page-tables, capability-tables, temporal capabilities (Gadepalli et al., 2017), and memory frames. The kernel includes no scheduling policies, instead implementing schedulers at user-level (Parmer and West, 2008). The Composite kernel scales up to multiple cores well as it has no locks and is designed entirely around store-free common-paths, wait-free data-structures, and quiescence for data-structure consistency (Wang et al., 2015).

EdgeOS builds on these underlying facilities to provide: (1) FWPmanagement and caching capabilities, (2) a DPDK-compatible userspace networking module, (3) new communication mechanisms built around the MMA, and (4) a scheduler that is integrated with the communication and DPDK modules. EdgeOS is implemented as a component consisting of these main system modules. Co-location of these in a component is convenient and simplifies their communication, but is not necessary. Together, they provide the abstractions to execute FWPs as isolated components with only a limited number of synchronous communication channels to EdgeOS corresponding to the functions in Table 1. Thus, the attack surface of any given FWPis restricted and small.

The current Composite implementation is for 32 bit x86. Though this limits the scale of the system due to memory limitations, our prototype demonstrates the core functionality of EdgeOS. Ports to other platforms such as ARM and x86-64 are in progress by the Composite developers, and we expect EdgeOS would exhibit similar behavior on them.

4.2. Feather-Weight Process Management

Optimized FWP checkpointing. EdgeOS caches the images of chains of FWP binaries so they are ready for prompt activation. These ready-to-execute images are asynchronously prepared, thus moving the overhead for FWP preparation off the fast-path. The cache contains full FWP chains so that complete services can be quickly deployed. The cached FWPs represent their execution immediately following the eos_postinit function, thus capturing the initialized state of a ready-to-execute FWP. This avoids redundant initialization computation. For example, our Click network functions trigger the checkpoint only after loading and parsing their configuration file from disk.

However, the mechanisms to prepare FWP-chains (in the FWP Manager) still must be efficient to maintain a high churn rate. Thus, we utilize a few optimizations: (1) the post-initialization checkpoint of the FWP-chain is laid out contiguously in memory so that re-initializing a chain is bounded mainly by memcpy and memset overheads (for which we use the musl libc, unoptimized versions), (2) we do not reclaim – and thus later re-allocate – heap memory from terminated FWPs, instead only zeroing it out, and using it to satisfy future eos_sbrk calls, (3) we reuse the threads active in each FWP, instead only resetting their instruction pointer to the appropriate post-initialization execution point which has the side effect of avoiding thread allocation and scheduling overheads beyond suspending the thread. These optimizations culminate in a system that can handle exceedingly high churn and scalability – FWP chain initialization converges on memcpy overheads, and chain activation in response to a new client takes low 10s of microseconds.

FWP scheduling. We specialize the user-level scheduling policies within EdgeOS to manage untrusted FWPs that require low-latency computations. The scheduling policy aims to prevent any FWP from monopolizing the CPU, and from interfering with the progress of other FWPs. Additionally, as all scheduling operations represent overhead that can impact system throughput, they must be as rare and efficient as possible, while maintaining inter-FWP isolation. Given these goals, in the current work we focus on simplicity in the scheduling policy, and the careful usage of timer interrupts to balance each FWP’s progress with scheduler overhead.

Each core separately schedules the FWPs assigned to it using a fixed-priority, round-robin scheduling policy. The quantum chosen to preempt an executing FWP is specifically calibrated to enable the average FWP to complete its execution cooperatively (thus avoiding timer overheads), and round-robin prevents starvation. To implement this, user-level schedulers use the kernel’s facilities to dispatch to a thread and pass the time that the next timer interrupt should fire. We use modern x86 processor local-APIC support for specifying one-shot timer interrupts with cycle-accuracy (called “TSC Deadline Timers” in Intel documents). Each scheduler receives messages from the MMA to activate its FWPs.

The simplicity of the scheduling policy and our optimized use of timer interrupts, together enable the necessary efficiency for line-rate computations, while guaranteeing progress and performance predictabiliy in spite of the large-scale, multi-tenant environment.

4.3. Message Pool Management

To support multi-tenancy, FWPs provide isolation for local memory, CPU processing, and access to system resources. However, message pool management provides both inter-FWP isolation and coordination.

Ring-buffers for both coordination and memory management. Each FWP’s message pool is associated with two ring buffers that track both how to transmit and receive messages, and the allocation and deallocation of messages. These ring buffers are similar to NIC DMA ring buffers. However, unlike traditional driver ring buffers, EdgeOS makes the observations that (1) general purpose memory allocation facilities (malloc/free) can have significant overhead for high message arrival rates, and complicate the coordinated memory management between the MMA and FWPs; and (2) the ring buffers are organized to track not only incoming and outgoing messages, but also free memory.

A reception ring buffer contains a set of references to message slots into which incoming data can be copied, and the transmission ring buffer contains references to messages to move downstream in the FWP chain. The MMA dequeues messages from an FWP’s transmit ring, copies the data, and enqueues a message in the recipient’s ring. In this way, the MMA acts directly as a software DMA accelerator between FWPs. Each ring buffer entry has a set of bits that tracks the state of the entry: transmit – ready to send the message, receive – empty message to transmit into, ready – populated message ready for processing, free – ready to be reallocated by the FWP, or unused – an unused ring buffer entry (with an ignored pointer). Thus, the MMA transitions ring buffer entries in transmit rings from the transmit to the free state after copying the message, thus signaling the message’s reused; and it transitions receive ring entries from receive to ready after copying data into the message.

Message pools are managed by FWPs as a span of MTU-sized message slots, and unlike traditional NIC DMA ring buffers, the ring buffers include an entry for each message slot. When a message arrives in a message pool, the FWP dequeues it from its receive ring – transitioning the ring entry from ready to unused, processes it, and later adds it to the transmission ring buffer – transitioning the entry from unused to transmit. FWPs must maintain a sufficient number of messages in reception rings in the receive state to compensate for the scheduling latencies due to multiplexing the CPU among many FWPs. Thus, after it finishes processing pending messages, it will move freed messages from the transmit ring (free unused), into the reception ring (unused receive). In this way, message liveness is managed indirectly through the ring buffers.

Message pools and isolation. The ring buffer design decouples the message memory from the meta-data to coordinate the data movement and liveness between FWPs and the MMA. In doing so, EdgeOS avoids lock-based protection of the rings, instead relying on wait-free mechanisms that guarantee execution progress of both FWPs and the MMA. This has the benefit of minimizing coherency overheads in ring coordination, and avoiding critical sections which threaten the MMA’s starvation. Additionally, it enables FWPs to have more restrictive access rights to the pool than the ring buffer, for example, providing integrity by mapping the pool read-only.

4.4. Memory Movement Accelerator

Our initial experiments showed that naively copying packets between stages in a DPDK-based NFV pipeline decreased throughput by more than 50%. However, we also found that a core devoted to data movement has a throughput of around 30 Gb/s, which is sufficient for line-rate. By using the parallelism of the underlying processor and specializing cores to run the MMA, we achieve both isolation and high throughput by taking message movement out of the critical path.

The MMA has read-write access to all message pools. It maintains a mapping between both pairs of transmit and receive ring buffers, and their associated pools, and continuously iterates through all such pairs, transferring messages when it finds a transmission. The MMA provides two essential services: data-movement by copying transmitted messages, and event notification of the receiving FWPs. The MMA’s FWP event notification is efficient as it simply sends a message to the scheduler controlling the FWP’s core. Though the current system uses only a single MMA, more cores can be devoted to this, should it require more memory movement throughput in the future.

MMA optimizations. The MMA is on the data-path of all FWP interactions, including message reception, thus it must be able to move messages at faster than line rate. The MMA iterates through all FWP transmit rings, and (1) copies data between message pools while updating rings, and (2) activates the downstream FWP by sending an event (through a ring buffer) to the scheduler on that FWP’s core. The data-structures linking transmit and reception rings are laid out in an array to leverage the processor’s prefetcher as the MMA iterates over them. The initial implementation of the operations on the ring buffers were straight-forward, but cache-coherency overheads, possibly for each ring entry, hurt throughput. To address this, we added two optimization:

  • Double-cache-line (128B) caches are added to both the enqueue and dequeue operations. These caches are in local memory outside of the ring, thus their modifications are free of coherency traffic. Transmitting a message adds it to the transmit queue cache, and only when it is full is it flushed to the ring buffer. This batches what would be eight separate ring updates into essentially a single memcpy of 128 bytes. To avoid cached entries that are not yet transferred into the ring from having delayed (or starved) processing, when an FWP has completed processing, and is going to block, it flushes its cache to its transmit ring buffer. Similarly, when the ring buffers are dequeued, entries are copied out into a double-cache-line cache, and subsequent accesses first check the cache. The caches are 128B to match the Intel policy of fetching double-cache-lines at a time.

  • These caches enable messages to be viewed in batches. This enables a second optimization to use explicit software prefetch instructions to load all referenced messages into the core’s cache. This optimization is particularly effective as the processing of the messages is temporally proximate.

  • Naming of different messages uses direct virtual addresses. Though the MMA is isolated from FWPs, they share a single virtual address space (Chase et al., 1992; Druschel and Peterson, 1993). To maintain protection, all local memory for both the MMA and each FWP is isolated and uses overlapping address, and when the MMA and FWPs pass a message, they validate that it lies within the message pool’s boundaries.

These optimizations contribute to EdgeOS’ high message throughput. However, should they be insufficient due to too many FWPs or long chains, the MMA can trivially partition the ring buffers, thus scale to multiple cores.

4.5. Network Interface Integration

EdgeOS uses DPDK for direct access to the NIC via kernel-bypass. Our port of DPDK to EdgeOS is conducted mainly as a new Environment Abstraction Layer (EAL), thus minimizing the impact on the DPDK code-base. DPDK transmits and receives packets via the MMA, but, unlike other FWPs, it has a number of heightened privileges. First, DPDK is used in poll-mode, and we devote a core to polling for and receiving packets, and another to transmitting them.

Packet reception. Incoming packets are demultiplexed to their corresponding FWPs via the flow mapping facilities in DPDK. In this way, EdgeOS has mechanisms to maintain the mappings of IPs and ports to specific FWP-chains, but we leave the policy of creating those mappings to a cluster manager such as Kubernetes or a Software-Defined Networking (SDN) controller. If a flow maps to an FWP-chain that is not yet active, a chain is retrieved from the FWP-chain cache, and activated. The FWP-chain cache is populated with FWP chains by the FWP manager.

DPDK packet pools are treated as EdgeOS message pools, and the MMA copies packets into downstream FWPs. Intelligent hardware with flow direction built in might enable zero-copy here (Sharma et al., 2017), and EdgeOS could be modified to use this support in the future. Flows that map to an active FWP-chain are placed in a message pool transmit ring buffer, and the MMA copies the data accordingly.

Packet transmission. A final optimization avoids a packet copy on the transmit path. When the last FWP in the chain transmits to DPDK, the MMA omits the copy, and instead enables DPDK to add a direct reference to the packet to its own DMA ring buffers. Later, when the NIC signals the successful transmission of the packet, DPDK signals the transmission to the message pool so that the packet can be reclaimed.

Figure 6. (a) EdgeOS provides substantially better latency, and reduced jitter compared to Linux processes and NFV platforms like OpenNetVM and ClickOS. (b) Throughput of each system with different packets sizes. (c) EdgeOS provides isolation and adds negligible overheads compared to OpenNetVM (no isolation) for different chain length for messages of size 64 and 1024 bytes.

5. Evaluation

All experiments are run on CloudLab Wisconsin c220g1 series nodes. These are two socket, 8 core, Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz Intel processors with 128GB ECC Memory (8x 16 GB DDR4 1866 MHz dual rank RDIMMs). Systems are connected via Dual-port Intel X520-DA2 10Gb NIC (PCIe v3.0, 8 lanes) networking cards. For EdgeOS we use less than 1GB of the system’s memory due to the underlying -kernel’s 32-bit address space limitations.

5.1. Latency and Throughput

We first evaluate the latency and performance predictability of EdgeOS compared to other high performance networking platforms. Figure 6(a) shows the response time distribution (in microseconds) for an ICMP ping response Click (Kohler et al., 2000) element implemented as either: a DPDK process, an OpenNetVM NF (ONVM), a standard linux process with kernel-based IO, a ClickOS NF in a Xen VM, or an FWP in EdgeOS. The results show that EdgeOS significantly outperforms all of these techniques (by up to 3.8X in average latency), except for DPDK. DPDK is slightly better because it can run only a single service at a time and thus does not need to copy packets from the initial receive DMA ring to a separate pool. In contrast, EdgeOS provides a platform to potentially run thousands of distinct services, and thus needs to offer stronger isolation via copying.

Figure 6(b) shows the maximum throughput of different approaches when forwarding traffic from pktgen, a high speed packet generator. EdgeOS again provides better performance than ClickOS, while offering stronger isolation than DPDK and ONVM, which rely on globally shared memory pools for zero-copy IO.

Next we evaluate the performance of EdgeOS communication by comparing with ONVM. We run a chain of NFs on the same core that each forward small (64B) or big (1024B) packets, thus both systems have context switch overhead by passing a packet to the next NF. In addition, EdgeOS has copying overhead from the MMA to enforce isolation. The results in Figure 6(c), show that as the chain length increases, the throughput of 64B packet drops for both EdgeOS and ONVM affected by different overheads. The main overhead of EdgeOS is data copying, while the overhead of Linux context switches and scheduling dominates ONVM. When the chain length is smaller than 3, the overhead of copying is less than 8%, and EdgeOS outperforms ONVM when the chain is longer as the Linux system overheads increase. The throughput with 1024B packets maintains line rate for both systems when the chain length is smaller than 6, but EdgeOS sees a throughput decrease when the chain is longer as one MMA is not able to handle copies for all FWPs.

Figure 7. EdgeOS provides orders of magnitude better startup time than other approaches and does not suffer from scalability problems when starting larger numbers of FWPs.

5.2. Startup Time

FWP Initialization and Activation. In Linux, initializing a process involves calling fork (and possibly execve). For Docker containers, a docker run command is similar, but includes additional system calls to configure namespaces and maintain container metadata. In order to optimize the fast path of readying a cached FWP, EdgeOS separates out creation from activation. For EdgeOS, creation involves transitioning from the Object File to Cached state in Figure 4, including setting up page tables, capability tables, and thread creation. We record the start time for 10,000 iterations of starting a container, process, or FWP and report the median in Figure 7 (a). Note the log scale; we use median time values since as described below, Container creation becomes more slowly over time so the average is skewed by these outliers. We compare against two variants of Linux processes: ”fork + exec” loads a different binary whereas ”fork + faults” mimics loading the service’s working set by issuing writes to 8 different pages to trigger page faults. These approaches are 5-20X slower than the comparable ”EOS create” approach (dashed lines in Figure 4).

Once an FWP has been created, EdgeOS keeps copies of it in a cache which can be quickly activated on demand (solid lines in Figure 4). Cached activation improves EdgeOS performance by another order of magnitude, allowing new processing entities to be instantiated in 6.2 microseconds. Figure 7(b) presents a CDF of these approaches, including the activation cost for starting a full chain of 10 FWPs, which remains an order of magnitude faster than fork+exec.

FWP Scalability. Further, we have found that containers suffer from poor scalability – as the number of containers rise, the start time worsens. Similar behavior has been shown previously for virtual machines (Manco et al., 2017). In Figure 7(c) we show the time to start a new container, exec a process, or activate an FWP, when up to 2200 are started incrementally. The Container case gradually drifts upward before hitting a step after 2000 containers. The cost of starting the last container is 1.368 seconds versus 0.467 seconds for the first. The standard deviation for containers is 236 ms versus only 0.08 ms for FWP activation. As long as sufficient FWPs are available in the template cache, EdgeOS provides nearly constant start time regardless of scale; if additional templates are needed, the FWP manager can created them in parallel to the data path. The EdgeOS timeline has a few outlier points (11 out of 15K measurements are at 2ms), which we believe to be Non-Maskable Interrupts, or a bug in our scheduling logic.

5.3. Isolation

Just in Time Service Instantiation. To evaluate the impact of client churn in edge environments, we mimic an experiment from the LightVM paper (Manco et al., 2017). Clients send requests to an EdgeOS based service at a configurable interval, and we assume that each new client request requires its own FWP to be instantiated. The new FWP receives the incoming packet, produces a reply, and then terminates, representing a worst case churn scenario. Figure 8 shows a response time CDF for EdgeOS under different client arrival patterns. The results show that even when a new client arrives every millisecond, 90% of requests are serviced within 50 microseconds Although we have not been able to successfully run the LightVM software on our testbed, we note that their paper produced a 90th percentile response time of 20 milliseconds (more than 400X worse) with clients arriving 10 times less frequently (10ms interval). The EdgeOS performance advantage comes from our extremely lightweight FWP abstraction and our template cache that allows nearly instant instantiation.

Figure 8. EdgeOS just in time service instantiation for mobile clients with varying client inter-arrival rates.
Figure 9. Routing and processing latency for routing netperf traffic for an increasing number of clients.
Figure 10. Single memcached instance on one core.
Figure 11. Multiple memcached instances (1 per client) on 16 cores.

Multi-Tenancy and Customer Isolation. An important job of edge-cloud systems, is acting as a middlebox to route a subset of requests to the cloud. Figure 9 depicts the processing latency of processing and routing requests between netperf client and server machines for an increasing number of concurrent clients. We use three nodes, two running netperf clients and servers, and the third running EdgeOS or ONVM in the middle to act as the middlebox. The systems run either a single firewall to filter flows or a 2 FWP chain of firewall plus monitor, all implemented in Click, to further maintain statistics about flows. Each customer is serviced by its own separate firewall or chain, thus is isolated from each other. We measure the middlebox latency overhead (i.e., the added cost versus direct client/server connections from Figure 1) as we increase the number of clients, and thus number of FWPs (EdgeOS) and Network Functions (NFs in ONVM).

Though ONVM is a highly optimized middlebox infrastructure, it relies on containers and expensive coordination mechanisms between NFs and the management layer. Because of this, ONVM cannot scale past around 820 containers or 410 chains, and the added latency rises quickly with each new client. FWPs enable the system to scale past 2000 customers with an average increase in the latency of only around 0.3 per additional client. Chaining in EdgeOS adds negligible latency overhead thanks to our efficient scheduler notification and context switch, while ONVM sees an increasing gap since it relies on Linux’s more heavyweight futexes and its underlying scheduling.

5.4. Memcached

Finally, we evaluate how EdgeOS can provide a platform for low latency endpoint applications. We implement an FWP capable of parsing memcached UDP requests and use it to replace the standard socket interface in the memcached server. The EOS controller can then be used to map incoming requests either to a single memcached FWP (e.g., representing an edge cloud data cache) or one FWP per client (e.g., representing private data stores for edge-connected IoT devices). We compare EdgeOS against Linux, either using a single memcached server or multiple. Our workload, inspired by (Nishtala et al., 2013), uses 135 byte value sizes and a 95% get, 5% set request mix generated by the mcblaster client.

Single instance. Figure 10 shows the throughput and latency when all clients connect to a single memcached server instance pinned to one core. We use a 16-core server as the client, running one mcblaster process per core to ensure the client will not be the bottleneck. Each client process sends requests at a configurable rate and we report the aggregate throughput and average latency of successful requests (i.e., dropped requests do not impact latency). From Figure 10 we see that EdgeOS can support a throughput of up to 1.4 million requests per second, a nearly 5X increase compared to Linux. The response time of EdgeOS is also substantially lower than Linux, and it can handle 8X the client request rate before seeing an increase in latency. With very low client request rates, both systems perform similarly because EdgeOS cannot take advantage of batching. Since Linux is not able to keep up, it drops a large number of requests, e.g., 5.2% at a 320K req/sec client rate. In contrast, EdgeOS does not see any requests drops at a 1.2M req/sec client rate.

Multiple instance. We next run a scalability test where each client is paired with its own memcached instance, either running as a Linux process or an FWP. The Linux processes are started in advance, whereas the FWPs must be activated from the cache for the first request from a client. Each client sends at a fixed rate of 10 Mbit/s and they are distributed across four hosts to prevent them being the bottleneck. We distribute the memcached server instances evenly across the available cores of the host server – for Linux all 16 cores are available, whereas for EdgeOS only 12 cores are used for running FWPs and 4 are used for the system services. As we increase the number of clients, the aggregate request rate rises, with Linux hitting its peak throughput with 300 memcached clients and servers. EdgeOS is able to scale substantially further, hitting a maximum throughput over 4M req/sec with 800 clients. The Linux server, overwhelmed with the number of memcached processes, has a response rate of nearly 1 second, whereas EdgeOS maintains an average latency below 1 millisecond for up to 600 memcached instances. From the latency CDF, we observe that even with only 100 memcached instances, Linux has much higher tail latency than EdgeOS, and that with 800 instances Linux has more than three orders of magnitude worse tail latency. Keep in mind that these latency metrics ignore dropped, requests – with 800 instances, EdgeOS drops 13% of requests, whereas Linux drops 66%.

TCP vs UDP. Our memcached implementation is based on UDP since EdgeOS does not provide a TCP stack. Adding a high performance DPDK-based TCP stack (Jeong et al., 2014) would be straightforward, and we expect the performance difference between Linux and EdgeOS to grow even larger in this case. In the current UDP implementation, the high drop rate seen in Linux has no impact on its throughput or latency, but with TCP this would trigger congestion control and retransmissions, leading to even worse performance.

6. Related Work

Scalable multi-tenant isolation. Significant research addresses the increasing churn seen in serverless computing (Lagar-Cavilla et al., 2009; Madhavapeddy et al., 2015; Nitu et al., 2017; Manco et al., 2017) by decreasing the startup and teardown costs of virtual machines. Light-weight systems such as unikernels (Manco et al., 2017; Madhavapeddy et al., 2013) only further increase the agility of these systems. In contrast, EdgeOS is motivated by the potentially enormous churn and large-scale isolation requirements of the edge cloud, providing service to transient mobile and IoT devices. The FWP abstraction, and activation based on the FWP-chain cache provide low-overhead isolation and message pools for effective communication that can handle the unprecedented churn. Denali (Whitaker et al., 2002) separates the protection provided by a VMM from the abstractions within the VM, and enables lightweight VM contexts that scale from tens to low hundred’s of VMs. EdgeOS focuses on extremely fast FWP activation times for on-the-fly instantiation, and MMA-coordinated communication through chains of FWPs to enable service composition from multiple tenants. Multiple projects have increased the efficiency of containers by specializing the environment for more efficient boot-up. Cntr (Thalheim et al., 2018) includes only the application-specific context in a container, while SOCK (Oakes et al., 2018) specializes the container to use efficient kernel operations, and uses a Zygote mechanism paired with a cache to accelerate container creation for stateless computations. For isolated edge computation instantiation, EdgeOS compares favorably to forking of minimal Linux processes (two orders of magnitude faster start-time) which is the lower-bound for many such techniques. These projects have startup latencies in the milliseconds versus FWPs in the 10s of microseconds. Additionally due to FWP optimizations, EdgeOS also maintains significantly lower edge application latencies at scale than Linux (100s of -seconds vs. a second for memcached).

Lightweight isolation. Wedges (Bittau et al., 2008), Light-weight Contexts (Litton et al., 2016), and SpaceJMP (El Hajj et al., 2016) expand the UNIX interface to include lightweight facilities for controlling and changing protection domains. Similarly, Dune (Belay et al., 2012) uses hardware virtualization support to provide user-level control over page-tables, and dIPC (Vilanova et al., 2017) uses hardware support to bypass the kernel during inter-protection domain communication. EdgeOS instead relies on a highly-optimized -kernel’s core support for secure and efficient control flow management, protection domains, and capability-based access control. We target abstractions to support immense churn rates, and efficient communication with complete isolation via the MMA. To efficiently use the limited resources in the edge cloud, EdgeOS leverages this support to scale to more than two thousand FWPs in less than 1GB of RAM while maintaining line-rate communication.

User-level, high-performance networking. User-level network processing has long been proposed (von Eicken et al., 1995) to better utilize HW and reach line-rate throughput. Isolation is provided when paired with the early demultiplexing of networking packets (Tennenhouse, 1989; Engler et al., 1995). Shared memory for zero-copy communication, and batched processing have pushed these techniques into Gb-level networking (Belay et al., 2014; Rizzo, 2012; Han et al., 2012). DPDK and other kernel by-pass techniques have also pushed middlebox network function processing effectively into VMs (Martins et al., 2014), and containers (Zhang et al., 2016). EdgeOS expands on these techniques by integrating them with large-scale multi-tenancy via the MMA, and the strong isolation of FWPs. NetBricks (Panda et al., 2016) implement network processing functions in a memory-safe language (Rust), thus relying on the software isolation in a single thread. EdgeOS effectively uses the parallelism of the underlying hardware, and the MMA to maintain memory safety, but also provides temporal isolation by executing all FWPs in separate threads that are explicitly scheduled by the run-time.

Hardware NIC demultiplexing. Hardware-based early demultiplexing of networking packets has enabled isolated, high-performance library-based system services (Peter et al., 2015). While this avoids the use of shared memory pools, it relies on network hardware support for multiple queues to isolate principals. Such support is limited, e.g., the common Intel 82599 chipset for 10Gbps NICs only supports up to 128 queues (intel, [n. d.]). Intelligent NICs take this idea further by supporting demultiplexing with higher fidelity (Sharma et al., 2017). EdgeOS supports a high level of scalability required for the multi-tenant edge cloud, thus uses software techniques to safely demultiplex packets by devoting cores to act as MMAs without relying on specialized hardware. Results show that the system can maintain line-rate despite using these software accelerators.

7. Conclusions

The increasing prevalence of mobile computations and the Internet of Things requires both scalable isolation facilities for multi-tenancy in the edge, and the agility to handle high churn. This paper has described EdgeOS, an OS for edge cloud computation that introduces a Feather-Weight Process abstraction for low-overhead isolation that is paired with a cache of post-initialization checkpointed FWP-chains to provide the microsecond scale activation times necessary to handle high churn. Isolation is facilitated with a specialized core devoted to accelerating moving messages between FWPs, thus maintaining isolation.

We show that EdgeOS provides more than a 3.8X reduction in ping latency and more than 2X throughput increase compared to ClickOS – a system that also provides isolated computation – for middlebox computations. More importantly, EdgeOS can create FWPs for client computation in 25-50 microseconds, even when they are created every millisecond, and can scale to over 2000 FWPs while maintaining low latency, even with a very limited amount of memory. For edge applications like memcached, EdgeOS has more than three orders of magnitude decreases in latency when running over 300 server instances simultaneously. We believe that EdgeOS paves the way for closely integrating the edge cloud into – and augmenting the capabilities of – the increasing prevalence of mobile and embedded devices.

References

  • (1)
  • Banga et al. (1999) Gaurav Banga, Peter Druschel, and Jeffrey C. Mogul. 1999. Resource containers: a new facility for resource management in server systems. In OSDI ’99: Proceedings of the third symposium on Operating systems design and implementation. USENIX Association, Berkeley, CA, USA, 45–58.
  • Belay et al. (2012) Adam Belay, Andrea Bittau, Ali Mashtizadeh, David Terei, David Mazières, and Christos Kozyrakis. 2012. Dune: Safe User-level Access to Privileged CPU Features. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI’12), Hollywood, CA, USA, October 8-10.
  • Belay et al. (2014) Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. 2014. IX: A Protected Dataplane Operating System for High Throughput and Low Latency. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI).
  • Bittau et al. (2008) Andrea Bittau, Petr Marchenko, Mark Handley, and Brad Karp. 2008. Wedge: Splitting Applications into Reduced-privilege Compartments. In Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation (NSDI).
  • Chase et al. (1992) Jeffrey S. Chase, Miche Baker-Harvey, Henry M. Levy, and Edward D. Lazowska. 1992. Opal: A Single Address Space System for 64-Bit Architectures. Operating Systems Review 26, 2 (1992), 9. citeseer.ist.psu.edu/58003.html
  • Dennis and Horn (1983) Jack B. Dennis and Earl C. Van Horn. 1983. Programming semantics for multiprogrammed computations. Commun. ACM 26, 1 (1983), 29–35. https://doi.org/10.1145/357980.357993
  • Docker (2018) Docker 2018. Docker: https://www.docker.com/.
  • Dragovic et al. (2003) B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, I. Pratt, A. Warfield, P. Barham, and R. Neugebauer. 2003. Xen and the Art of Virtualization. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP).
  • Druschel and Peterson (1993) Peter Druschel and Larry L. Peterson. 1993. Fbufs: A High-Bandwidth Cross-Domain Transfer Facility. In Symposium on Operating Systems Principles. 189–202.
  • El Hajj et al. (2016) Izzat El Hajj, Alexander Merritt, Gerd Zellweger, Dejan Milojicic, Reto Achermann, Paolo Faraboschi, Wen-mei Hwu, Timothy Roscoe, and Karsten Schwan. 2016. SpaceJMP: Programming with Multiple Virtual Address Spaces. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
  • Elphinstone and Heiser (2013) Kevin Elphinstone and Gernot Heiser. 2013. From L3 to seL4 what have we learnt in 20 years of L4 microkernels?. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP).
  • Engler et al. (1995) Dawson R. Engler, Frans Kaashoek, and James O’Toole. 1995. Exokernel: An Operating System Architecture for Application-Level Resource Management. In Proceedings of the 15th ACM Symposium on Operating System Principles. ACM, Copper Mountain Resort, Colorado, USA, 251–266.
  • Gadepalli et al. (2017) Phani Kishore Gadepalli, Robert Gifford, Lucas Baier, Michael Kelly, and Gabriel Parmer. 2017. Temporal Capabilities: Access Control for Time. In Proceedings of the 38th IEEE Real-Time Systems Symposium.
  • Gupta et al. (2006) Diwaker Gupta, Ludmila Cherkasova, Rob Gardner, and Amin Vahdat. 2006. Enforcing Performance Isolation Across Virtual Machines in Xen. In Proceedings of the ACM/IFIP/USENIX 2006 International Conference on Middleware (Middleware ’06). Springer-Verlag New York, Inc., New York, NY, USA, 342–362. http://dl.acm.org/citation.cfm?id=1515984.1516011
  • Han et al. (2015) Sangjin Han, Keon Jang, Aurojit Panda, Shoumik Palkar, Dongsu Han, and Sylvia Ratnasamy. 2015. SoftNIC: A Software NIC to Augment Hardware. Technical Report UCB/EECS-2015-155. EECS Department, University of California, Berkeley. http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-155.html
  • Han et al. (2012) Sangjin Han, Scott Marshall, Byung-Gon Chun, and Sylvia Ratnasamy. 2012. MegaPipe: A New Programming Interface for Scalable Network I/O. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation.
  • Hu et al. (2017) Yang Hu, Mingcong Song, and Tao Li. 2017. Towards ”Full Containerization” in Containerized Network Function Virtualization. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’17). ACM, New York, NY, USA, 467–481. https://doi.org/10.1145/3037697.3037713
  • intel ([n. d.]) intel [n. d.]. Intel 82599 10 gbe controller brief. https://www.intel.com/content/dam/www/public/us/en/documents/product-briefs/82599-10-gbe-controller-brief.pdf.
  • Jeong et al. (2014) EunYoung Jeong, Shinae Wood, Muhammad Jamshed, Haewon Jeong, Sunghwan Ihm, Dongsu Han, and KyoungSoo Park. 2014. mTCP: a Highly Scalable User-level TCP Stack for Multicore Systems. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14). USENIX, Seattle, WA, 489–502. https://www.usenix.org/conference/nsdi14/technical-sessions/presentation/jeong
  • Kohler et al. (2000) Eddie Kohler, Robert Morris, Benjie Chen, John Jannotti, and M. Frans Kaashoek. 2000. The Click modular router. ACM Transactions on Computer Systems 18, 3 (August 2000), 263–297.
  • Lagar-Cavilla et al. (2009) Horacio Andrés Lagar-Cavilla, Joseph Andrew Whitney, Adin Matthew Scannell, Philip Patchin, Stephen M. Rumble, Eyal de Lara, Michael Brudno, and Mahadev Satyanarayanan. 2009. SnowFlock: Rapid Virtual Machine Cloning for Cloud Computing. In Proceedings of the 4th ACM European Conference on Computer Systems (Eurosys).
  • Liedtke (1995) J. Liedtke. 1995. On Micro-Kernel Construction. In Proceedings of the 15th ACM Symposium on Operating System Principles (SOSP’95), Copper Mountain Resort, Colorado, USA, December 3-6.
  • Litton et al. (2016) James Litton, Anjo Vahldiek-Oberwagner, Eslam Elnikety, Deepak Garg, Bobby Bhattacharjee, and Peter Druschel. 2016. Light-weight Contexts: An OS Abstraction for Safety and Performance. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI).
  • Madhavapeddy et al. (2015) Anil Madhavapeddy, Thomas Leonard, Magnus Skjegstad, Thomas Gazagnaire, David Sheets, Dave Scott, Richard Mortier, Amir Chaudhry, Balraj Singh, Jon Ludlam, Jon Crowcroft, and Ian Leslie. 2015. Jitsu: Just-in-time Summoning of Unikernels. In Proceedings of the 12th USENIX Conference on Networked Systems Design and Implementation (NSDI’15). USENIX Association, Berkeley, CA, USA, 559–573. http://dl.acm.org/citation.cfm?id=2789770.2789809
  • Madhavapeddy et al. (2013) Anil Madhavapeddy, Richard Mortier, Charalampos Rotsos, David Scott, Balraj Singh, Thomas Gazagnaire, Steven Smith, Steven Hand, and Jon Crowcroft. 2013. Unikernels: Library Operating Systems for the Cloud. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’13). ACM, New York, NY, USA, 461–472. https://doi.org/10.1145/2451116.2451167
  • Manco et al. (2017) Filipe Manco, Costin Lupu, Florian Schmidt, Jose Mendes, Simon Kuenzer, Sumit Sati, Kenichi Yasukata, Costin Raiciu, and Felipe Huici. 2017. My VM is Lighter (and Safer) Than Your Container. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP).
  • Martins et al. (2014) Joao Martins, Mohamed Ahmed, Costin Raiciu, Vladimir Olteanu, Michio Honda, Roberto Bifulco, and Felipe Huici. 2014. ClickOS and the Art of Network Function Virtualization. In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation (NSDI).
  • Miller et al. (2003) Mark S. Miller, Ka-Ping Yee, and Jonathan Shapiro. 2003. Capability myths demolished. Technical Report SRL2003-02. Johns Hopkins University Systems Research Laboratory, Mountain View CA (USA). http://www.erights.org/elib/capability/duals/
  • Nishtala et al. (2013) Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry C. Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab, David Stafford, Tony Tung, and Venkateshwaran Venkataramani. 2013. Scaling Memcache at Facebook. In Presented as part of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13). USENIX, Lombard, IL, 385–398. https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/nishtala
  • Nitu et al. (2017) Vlad Nitu, Pierre Olivier, Alain Tchana, Daniel Chiba, Antonio Barbalace, Daniel Hagimont, and Binoy Ravindran. 2017. Swift Birth and Quick Death: Enabling Fast Parallel Guest Boot and Destruction in the Xen Hypervisor. In Proceedings of the 13th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE ’17). ACM, New York, NY, USA, 1–14. https://doi.org/10.1145/3050748.3050758
  • Oakes et al. (2018) Edward Oakes, Leon Yang, Dennis Zhou, Kevin Houck, Tyler Harter, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau. 2018. SOCK: Rapid Task Provisioning with Serverless-Optimized Containers. In 2018 USENIX Annual Technical Conference (USENIX ATC 18).
  • Palkar et al. (2015) Shoumik Palkar, Chang Lan, Sangjin Han, Keon Jang, Aurojit Panda, Sylvia Ratnasamy, Luigi Rizzo, and Scott Shenker. 2015. E2: A Framework for NFV Applications. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP).
  • Panda et al. (2016) Aurojit Panda, Sangjin Han, Keon Jang, Melvin Walls, Sylvia Ratnasamy, and Scott Shenker. 2016. NetBricks: Taking the V out of NFV. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI).
  • Parmer and West (2008) Gabriel Parmer and Richard West. 2008. Predictable Interrupt Management and Scheduling in the Composite Component-based System. In Proceedings of the 29th IEEE Real-Time Systems Symposium (RTSS’08), Barcelona, Spain, November 30 - December 3.
  • Parmer and West (2011) Gabriel Parmer and Richard West. 2011. HiRes: A System for Predictable Hierarchical Resource Management. In Proceedings of the 17th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS).
  • Peter et al. (2015) Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind Krishnamurthy, Thomas Anderson, and Timothy Roscoe. 2015. Arrakis: The Operating System Is the Control Plane. ACM Trans. Comput. Syst. 33, 4 (Nov. 2015).
  • Peterson (2015) Larry Peterson. 2015. Cord: Central office re-architected as a datacenter. Open Networking Lab white paper (2015).
  • Price and Tucker (2004) Daniel Price and Andrew Tucker. 2004. Solaris Zones: Operating System Support for Consolidating Commercial Workloads. In Proceedings of the 18th USENIX Conference on System Administration (LISA).
  • Rizzo (2012) Luigi Rizzo. 2012. Netmap: A Novel Framework for Fast Packet I/O. In Proceedings of the 2012 USENIX Conference on Annual Technical Conference (USENIX ATC).
  • Saltzer and Schroeder (1975) J. Saltzer and M. Schroeder. 1975. The protection of information in computer systems. in Proceedings of the IEEE 9, 63 (1975).
  • Shapiro et al. (1999) Jonathan S. Shapiro, Jonathan M. Smith, and David J. Farber. 1999. EROS: a fast capability system. In Proceedings of the 17th ACM Symposium on Operating System Principles (SOSP’99), Kiawah Island Resort, South Carolina, USA, December 12-15.
  • Sharma et al. (2017) Naveen Kr. Sharma, Antoine Kaufmann, Thomas Anderson, Changhoon Kim, Arvind Krishnamurthy, Jacob Nelson, and Simon Peter. 2017. Evaluating the Power of Flexible Packet Processing for Network Resource Allocation. In Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation (NSDI).
  • Taleb et al. (2017) T. Taleb, K. Samdanis, B. Mada, H. Flinck, S. Dutta, and D. Sabella. 2017. On Multi-Access Edge Computing: A Survey of the Emerging 5G Network Edge Cloud Architecture and Orchestration. IEEE Communications Surveys Tutorials 19, 3 (2017), 1657–1681. https://doi.org/10.1109/COMST.2017.2705720
  • Tennenhouse (1989) David Tennenhouse. 1989. Layered Multiplexing Considered Harmful. In Protocols for High-Speed Networks. North Holland, Amsterdam, 143–148.
  • Thalheim et al. (2018) Jörg Thalheim, Pramod Bhatotia, Pedro Fonseca, and Baris Kasikci. 2018. Cntr: Lightweight OS Containers. In 2018 USENIX Annual Technical Conference (USENIX ATC 18).
  • Vilanova et al. (2017) Lluís Vilanova, Marc Jordà, Nacho Navarro, Yoav Etsion, and Mateo Valero. 2017. Direct Inter-Process Communication (dIPC): Repurposing the CODOMs Architecture to Accelerate IPC. In Proceedings of the Twelfth European Conference on Computer Systems (Eurosys).
  • von Eicken et al. (1995) Thorsten von Eicken, Anindya Basu, Vineet Buch, and Werner Vogels. 1995. U-Net: A User-Level Network Interface for Parallel and Distributed Computing. In Proceedings of the 14th ACM Symposium on Operating Systems Principles. ACM, 40–53.
  • Wang et al. (2015) Qi Wang, Yuxin Ren, Matt Scaperoth, and Gabriel Parmer. 2015. Speck: A Kernel for Scalable Predictability. In Proceedings of the 21st IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS’15), Seattle, WA, USA, April 13-16.
  • Whitaker et al. (2002) A. Whitaker, M. Shaw, and S. Gribble. 2002. Denali: Lightweight virtual machines for distributed and networked applications. citeseer.ist.psu.edu/whitaker02denali.html
  • Zhang et al. (2016) Wei Zhang, Guyue Liu, Wenhui Zhang, Neel Shah, Phillip Lopreiato, Gregoire Todeschi, K.K. Ramakrishnan, and Timothy Wood. 2016. OpenNetVM: A Platform for High Performance Network Service Chains. In Proceedings of the 2016 ACM SIGCOMM Workshop on Hot Topics in Middleboxes and Network Function Virtualization. ACM. http://faculty.cs.gwu.edu/timwood/papers/16-HotMiddlebox-onvm.pdf
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
332399
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description