INFaaS: Managed & Model-less Inference Serving

INFaaS: Managed & Model-less Inference Serving

Francisco Romero* Stanford University Qian Li* Stanford University Neeraja J. Yadwadkar Stanford University  and  Christos Kozyrakis Stanford University, Google

The number of applications relying on inference from machine learning models is already large and expected to keep growing. For instance, Facebook applications issue tens-of-trillions of inference queries per day with varying performance, accuracy, and cost constraints. Unfortunately, existing inference serving systems are neither easy to use nor cost effective. Developers must manually match the performance, accuracy, and cost constraints of their applications to a large design space that includes decisions such as selecting the right model and model optimizations, selecting the right hardware architecture, selecting the right scale-out factor, and avoiding cold-start effects. These interacting decisions are difficult to make, especially when the application load varies over time, applications evolve over time, and the available resources vary over time.

We present INFaaS, an inference-as-a-service system that abstracts resource management and model selection. Users simply specify their inference task along with any performance and accuracy requirements for queries. Given the currently available resources, INFaaS automatically selects and serves inference queries using a specific model that satisfies these requirements. INFaaS autoscales resources as model load changes both within and across inference workers. It also shares workers across users and models to increase utilization. We evaluate INFaaS using 44 model architectures and their 270 model variants against serving systems that rely on users for model selection and pre-load models, fix the scale policy, or use dedicated hardware resources. Our evaluation on realistic workloads shows that INFaaS achieves 2 higher throughput and violates latency SLO goals 3 less frequently, while maintaining high utilization and having overheads that are less than 12% of millisecond-scale queries.

journalyear: 2019copyright: acmlicensedconference: SOSP ’19: ACM Symposium on Operating Systems Principles; October 27–30, 2019; Huntsville, Ontario, Canadabooktitle: SOSP ’19: ACM Symposium on Operating Systems Principles, October 27–30, 2019, Huntsville, Ontario, Canadaprice: doi: xx.xxxxisbn: xxxxxxx\acmSubmissionID

xxx-xxx-xxx \setitemizenosep, leftmargin=0.8em \newtoggleshowmarks \toggletrueshowmarks

1. Introduction

Machine learning (ML) is proliferating across a variety of disciplines and applications such as video analytics (Poms et al., 2018; Jiang et al., 2018), sentiment analysis (Attia et al., 2018; Velikovich et al., 2018), advertisement recommendation, and scientific computing (Jonas et al., 2017). Most research and engineering effort from both the ML and distributed systems communities (e.g., TensorFlow (ten, 2018b), PyTorch (Facebook, 2018), MXNet (MXN, 2017)) have focused on the model training phase by optimizing the convergence time of algorithms and improving resource utilization. The training phase is usually characterized by long-running hyperparameter searches, dedicated hardware resource usage, and no completion deadlines. In contrast, inference is user-facing. It requires cost-effective systems that render predictions with strict latency constraints while handling unpredictable and bursty requests arrival rate. The number of applications relying on inference is already large and expected to keep growing. For example, Facebook services tens-of-trillions of inference queries per day (Hazelwood et al., 2018).

Figure 1. Variety in application requirements, model/variants, and heterogeneous resources. Grayed-out boxes in the last layer show resources with models already loaded on them.

Figure 1 summarizes challenges of inference serving, which we detail further in Section 2: (1) Query rates to a particular model can be unpredictable and vary over time, which makes it non-trivial to design scaling and resource management policies. (2) Applications issue queries that differ in latency, cost, and accuracy requirements. Some applications can tolerate a lower accuracy in exchange for low prediction latency while others cannot. Some queries are latency-sensitive (online), while others are large batch jobs (offline). Applications often target the same model for both online and offline queries. (3) Methods such as knowledge distillation (Koratana et al., 2018), or compiler optimizations such as TVM, TensorRT, and SageMaker Neo produce versions of the same model, model/variants, that may differ in inference cost and latency, memory footprint, and accuracy. These techniques further increase the decision space for which model to choose based on a user’s, potentially varying, performance requirements.

To address these challenges, an inference serving system needs to have the following desirable properties: First, the system must use dynamic scaling and resource management policies to account for query rate variability. Second, the system should concurrently support queries with a wide range of latency, throughput, and accuracy requirements without requiring significant user effort to manage or configure the system. And finally, a query’s performance requirements should govern which model/variant to select, and the system needs to make this decision in a time-efficient manner to avoid violating performance requirements. The user should not be required to know or even select which model/variant is most suitable to meet their application’s requirements.

While recent work has improved the performance of ML inference systems, ease-of-use and resource efficiency remain challenges. Frameworks such as TensorRT (ten, 2018a), TVM (Chen et al., 2018), and AWS SageMaker Neo (Amazon, 2018d) optimize pre-trained models for hardware acceleration by fusing layers and using lower precision arithmetic when appropriate. However, users are responsible to configure the target hardware and what precision and batch size their models should be optimized for. General model serving systems, such as Clipper (Crankshaw et al., 2017), TensorFlow Serving (Ten, 2018), and the TensorRT Inference Server (TRT, 2018) give users the ability to deploy ML models on their own infrastructure, while cloud offerings such as AWS SageMaker (Amazon, 2018c, a), Google Cloud ML (Google, 2018), and Azure ML (Azu, 2018) manage the infrastructure for the users. However, these systems require users to make critical deployment decisions such as: the instance type and hardware to use, the model/variant to query, and how to configure autoscaling. This composition forces users to fix their model/variants to a particular hardware platform, as well as resource management and scaling configurations.

The manual resource and configuration management by the user has both performance and cost implications. For example, GPUs usually have much lower latencies for large batch queries but with high loading overhead, while CPUs generally have lower load latencies and perform better with small batch sizes. GPUs also cost more than CPUs: at least 6 higher on AWS (AWS, 2018a). Such decision complexity is further exacerbated when a model’s query pattern changes over time. The tight coupling of models to the underlying infrastructure and resource management techniques also forces service providers to use dedicated resources per-user. Models are normally kept loaded and persisted to meet the stringent performance requirements of users, especially for unpredictable loads. This results in resource under-utilization, limited hardware and resource configurations, and scalability limitations. Additional hardware options, such as FPGAs (Xil, 2018), Google’s TPU (Jouppi et al., 2017), AWS Inferentia (AWS, 2018b) make the problem of manual configuration further challenging.

This paper presents a managed and model-less INFerence-as-a-Service system (INFaaS). INFaaS decouples application needs from the underlying models and hardware resources, thus allowing the applications, hardware, scaling policies, and resource management techniques to evolve independently. INFaaS allows users to query any registered model that captures latency, cost, and accuracy requirements through a simple API. INFaaS selects a model/variant along with the hardware to run it on based on the specified performance requirements — hence the term model-less. To improve resource utilization and reduce cost, INFaaS shares model/variants across user queries on heterogeneous hardware and avoids persisting models that are idling. INFaaS manages when model/variants should be scaled by adding/removing replicas and/or by upgrading/downgrading to a differently optimized model/variant. Using 44 model architectures and 270 model/variants, and comparing to state-of-the-art inference serving baselines under a realistic workload, INFaaS demonstrates 2 higher throughput and 3 fewer SLO violations while having similar CPU utilization and over 6 higher GPU utilization.

Our key contributions include:

  • The first managed and model-less inference serving system that rids the users of optimizing their models on available hardware so as to meet performance and cost requirements of their inference queries.

  • A light-weight selection policy that navigates the large space of model-variants and leverages them to automatically meet various application constraints.

  • A mechanism that allows sharing of heterogeneous hardware resources and models across user applications to improve utilization and user-costs.

  • An autoscaling algorithm that dynamically scales models in multiple ways to respond to the changes in application load and requirements.

2. Motivation

We begin by describing the challenges of existing inference systems and the insights that led to the INFaaS system.

2.1. Selecting the right model/variant

A model/variant is a version of a model architecture that runs on a single hardware platform. Within a model architecture, variants achieve the same accuracy but differ in the target hardware platform, the resource usage, and the achieved throughput and latency. Across model architectures, variants also differ in the achieved accuracy. The number of model/variants for a specific task such as image classification can be large as we have multiple model architectures to begin with (e.g., ResNet50 and VGG16), multiple programming frameworks, (e.g., TensorFlow and PyTorch), multiple compilers (e.g., TensorRT and TVM), and multiple optimization goals (e.g., optimize for batch 1 or batch 32). For example, ResNet50 can have a PyTorch variant that runs on CPU and a TensorRT variant optimized for batch-8 and FP16 that runs on a NVIDIA V100 GPU. Each hardware architecture is unique in terms of its performance potential and optimization requirements. For instance, CPUs are currently a cost-effective choice for inference queries with relaxed latency requirements and low batch sizes (Hazelwood et al., 2018), while GPUs provide more than 10x higher throughput especially for large batch sizes (Tes, 2017). FPGAs allow for optimizations for batch-1 inference with very narrow datatypes. As new inference accelerators are introduced, such as Google’s TPU (Jouppi et al., 2017) and Amazon’s Inferentia (AWS, 2018b), and new optimization techniques emerge, the number of model/variants will only grow.

Existing systems require that users identify the model/variant that will meet their performance, accuracy, and cost requirements. Even if a user selects a model architecture, differences in memory footprint, startup latency, supported batch size, and multiple type of hardware options in cloud platforms gives rise to a large and complex search space. Figure 1(a) demonstrates that for an image classification task, model architectures and their corresponding model/variants differ greatly in terms of accuracy, inference latency, and peak memory utilization. Even when we focus on variants with inference latencies less than 50 ms in Figure 1(b), the search space remains large and tedius to parse. The ”ensemble method” employed by Clipper and Rafiki (Crankshaw et al., 2017; Wang et al., 2018) can be thought of as a way to get around the need for users to select models. They send each inference request to multiple candidate variants and select that best result. This approach leads to increased cost, and does not clearly define how candidate model/variants should be chosen from a large space. We argue that inference systems should instead automate the selection of a single model/variant that meets the user’s performance, accuracy, and cost constraints.

(a) All 44 model architectures and 270 model/variants
(b) Model/variants with latencies lower than 50 ms
Figure 2. Inference latency, memory usage, and accuracy for model/variants for image classification generated with TensorFlow, Caffe2, PyTorch, and TensorRT. Variants of the same model architecture have the same color and marker. For (b), the variants in the square box are from ZFNet512, while the circled arrows are VGG19 variants.

Insight 1: The inherent diversity of model/variants across and within hardware platforms can be leveraged in order to meet diverse user requirements for performance, accuracy, and cost.

Insight 2: An abstraction that maps user requirements to the underlying model/variants is necessary in order to have (1) a simple high-level API for inference, and (2) fast and automatic model/variant selection.

2.2. Varying usage patterns and SLO requirements

Query patterns for services such as real-time language translation and video analytics can vary unpredictably (Hazelwood et al., 2018; Kang et al., 2017). An inference serving system can either provision resources for the peak load, or scale automatically in response to load variations. Peak provisioning leads to underutilized resources, while autoscaling can introduce significant start-up latency for loading a model/variant on a particular hardware platform. The startup-latency can vary depending on the frameworks used for the models and the state of the system. Most existing inference serving systems end up underutilizing resources as they overprovision by pre-loading and persisting all the models indefinitely or for long periods of time in anticipation of serving requests when needed (Crankshaw et al., 2017; TRT, 2018; Ten, 2018; Google, 2018; Azu, 2018). Hence, an effective autoscaler that responds to changes in user and model load patterns is needed.

There are three options for scaling in response to changes in load. First, a model/variant can be horizontally scaled across additional machines. Since inference serving is embarrassingly parallel, increasing the number of workers will result in proportional increases in throughput and cost. The latency of horizontal autoscaling can also be significant as new VMs or containers must be spawned. Second, if the variant underutilizes hardware resources, we can replicate it on each existing machine(s), which is a form of vertical autoscaling. For example, latency-sensitive or online inference jobs use small batch sizes (1 to 8), which limits parallelism and resource utilization on hardware platforms. However, throughput may not increase linearly with the number of replicas. Finally, we can choose a different variant that is better optimized for the increased load (e.g., increase batching to gain throughput potentially at the cost of latency) or a variant that runs on different hardware within the same machine (e.g., move inference from the CPU to a GPU or an inference accelerator).

Since the query load can change unpredictably over time and accuracy, latency, or cost constraints can also be adjusted, it is not obvious which autoscaling option(s) should be used in and in what order. To illustrate the tradeoffs, Figure 3, shows that adaptive batching on GPU can achieve up to higher throughput than a single replica while incurring minimal latency degradation. In contrast, replication improves throughput by at most 45% with Inception-ResNetV2 (Figure 3-left), while making both latency and throughput worse for MobileNetV1 (Figure 3-right). On CPUs (shown in Figure 4), adding replicas doubles the throughput without sacrificing latency. We believe adaptive batching leads to larger matrix multiplication — the predominant operation in inference processing — that unlike GPUs, leads to higher latency and lower throughput on CPUs.

Figure 3. Impact of replication and adaptive batching for different model-variants on a V100 GPU. Left graph shows average latencies and throughput across 16 threads sending one request at a time for Inception-ResNetV2 model. The right graph shows the same for MobileNetV1 using 32 threads. We doubled the load every 20 seconds. Both variants were optimized for FP16, batch-8 using TensorRT.
Figure 4. Impact of replication and adaptive batching for two model-variants on 8-vCPUs. This experiment is similar to the one in Fig. 3, but runs on CPUs. Both model-variants use the TensorFlow Serving system.

Insight 3: The system must automatically decide whether to horizontally scale, replicate, or select a different variant as load changes without sacrificing latency.

2.3. Sharing in the face of multi-tenancy

While effective autoscaling can rightsize the amount resources allocated for each inference job as load varies, each job typically utilizes a fraction of the resources on the underlying machine. As is the case with all other workloads for in cloud platforms, multi-tenancy is also needed in order to best utilize hardware resources. Specifically, there is an opportunity to share both resources and models across users to improve the overall cost, utilization, and even performance. Popular model architectures, such as ResNet50, tend to be commonly queried across several users and applications.

Virtual machines and container frameworks allow for effective multiplexing of user workloads on shared CPU and memory resource. In contrast, multi-tenancy of accelerators is still challenging. Hence, most existing inference systems allocate dedicated full accelerators, GPUs or FPGAs, to each user and model leading to low resource utilization and ultimately increased total total cost of ownership (TCO). Recent work (Xiao et al., 2018; Gu et al., 2019) has shown the benefit of sharing GPUs for deep-learning training jobs. ML inference is less demanding for compute and memory resources than training, thus making it an ideal candidate for GPU sharing (Yu and Chowdhury, 2019; Jain et al., 2018).

Figure 5. The impact of co-locating two models, the large Inception-ResNetV2 and the small MobileNetV1 , on one V100 GPU. Each graph shows the latency-load for each model running alone versus sharing the GPU. When sharing, we send the same load to both models. Both variants are optimized for batch-1 using TensorRT.

Figure 5 shows the result of co-locating one large and one small model on a GPU. At low load, GPU sharing does not affect the performance of either model. At higher load, sharing heavily impacts the performance of the small model, while the large model remains unaffected. Thus, we conclude that GPU sharing is promising, but must be carefully managed.

An additional opportunity is to multiplex resources for online and offline inference jobs. Large offline jobs, such as historical data analysis (Polyzotis et al., 2017) and image labeling at Pinterest (Jing et al., 2015) tend to process large amounts of data in a batch and are typically long-running (i.e., minutes to hours). Most existing systems provide separate services for online and offline inference (Google, 2018; Amazon, 2018c), which leads to resource fragmentation. Since offline jobs are not latency sensitive, they can run along with online inference tasks during their periods of load or medium load. The tradeoff is in maximizing the resources used by offline jobs while minimizing the impact to online jobs (Lo et al., 2015).

Insight 4: Sharing CPU and accelerator resources across both users and models, can reduce costs for both services providers and users.

Insight 5: Offline queries can execute as best-effort jobs to absorb slack resources from online queries, but interference must be managed.

Insight 6: The inference serving system must make resource management decisions to maintain high resource utilization without violating any performance and cost constraints.

3. INFaaS Overview

Figure 6. INFaaS System Architecture. Numbered circles correspond to the life cycle of typical queries.

Figure 6 presents an overview of INFaaS’s architecture. There are five main components. The Master receives user requests and dispatches them to the Worker machines for execution. The Metadata Store saves registered model architectures and associated variant metadata, along with system state. The Model Repository is a high-capacity, persistent storage medium for model/variants. Finally, the Model Profiler and Optimizer generates new variants for a model architecture using optimization frameworks such as TVM and TensorRT, and profiles variants on supported hardware platforms. INFaaS supports both online (low-batch, latency-sensitive) and offline (batch processing) jobs. Details of each component are presented in Section 4. Users interact with INFaaS using the workflow and API outlined in Table 1.

3.1. Model registration workflow

Users register models using the register_model API. This API takes a model binary (e.g., a GraphDef or SavedModel for TensorFlow or a NetDef for Caffe2) along with metadata about the model, including its architecture, framework, accuracy, task, and the dataset it was trained on. Users specify whether a model is public or private; access to private models is restricted to the users specified by the owner. INFaaS verifies the accuracy of a public model on the submitted validation dataset before successfully registering the model as a security precaution. The register_model API notifies users of the status of model registration. Unlike existing systems, public models are common across all users, and only need to be registered once for all users to interact with it.

3.2. Query submission workflow

Users can list the registered model architectures and variants using the model_info API. This provides users with a standard naming scheme to interact with INFaaS.

(a) Abstraction for classification
(b) Abstraction for translation
Figure 7. Examples of the model-less abstraction for classification and translation. The solid blue boxes are the task-dataset, the dashed red boxes are the model architecture, and the dotted green boxes are the model/variants.

Model-less Abstraction

Figure 7 demonstrates INFaaS’s model-less abstraction that is guided by Insight 2. This abstraction allows users to express their requirements in three different ways, from the most generic to the most specific:

Specify use-case: Users who do not know which variant or model architecture is most suitable for their performance requirements can simply specify the task and dataset of their query. They also define latency and accuracy requirements to guide INFaaS in selecting variants.

Specify model architecture: Users can specify a model architecture and a latency requirement, allowing INFaaS to select the variant that works best for the specific model load and system state.

Specify model/variant: Users who know which variant they want to query can specify this to INFaaS. This is the only option in existing inference systems.

INFaaS provides three different online_query and offline_query API functions that map user requirements to model/variants using the model-less abstraction. Users can submit a batch of inputs to increase throughput. Prior to servicing online jobs, INFaaS verifies the input’s dimensions are valid for the particular query. offline_query calls require the user to provide the input and output object storage paths (e.g., an AWS S3 bucket (Amazon, 2018b)). Both paths are validated prior to job initiation. Since offline queries pertain to large batches of inputs, they are processed as best-effort jobs. Hence, INFaaS does not provide a latency option for the offline_query calls.

API Parameters
register_model modelBinary, modArch, framework, accuracy, task, dataset, submitter, isPrivate
model_info submitter, task, dataset, accuracy
online_query submitter, input(s), modVar
online_query submitter, input(s), modArch, latency
online_query submitter, input(s), task, dataset, accuracy, latency
offline_query submitter, inputPath, outputPath, modVar
offline_query submitter, inputPath, outputPath, modArch
offline_query submitter, inputPath, outputPath, task, dataset, accuracy
Table 1. INFaaS User API

3.3. Life cycle of typical queries to INFaaS

Figure 6 depicts the steps an inference queries goes through.

For an online query to a registered and accessible model:

  1. The user submits a query using the API from Table 1.

  2. The master selects a model/variant, then selects a worker to process the query.

  3. The query proceeds to run on the variant’s target hardware platform.

  4. Upon completion, the result is returned to the user.

For offline queries, INFaaS immediately acknowledges the request and schedules them asynchronously. The results are stored in the user-specified output object store.

We describe further details of INFaaS’s key components, model-variant selection process, and the autoscaling mechanism in Sections 4 to 6.

4. System Design

As depicted in Figure 6, INFaaS is a multi-tenant system with a hierarchical master-worker style architecture. The master collaborates with workers to select model/variants and manage heterogeneous resources.

Master: The master is a logically centralized coordinator that receives model registration requests and inference queries using the INFaaS Front-end. The Dispatcher & Load Balancer component then selects a model/variant based on (1) the query’s performance requirements, and (2) the current system state (e.g., which models are running or overloaded). INFaaS caches the model/variant selection decisions made for recent queries to expedite the incoming queries with similar objectives. Details of these selection policies are discussed in Section 5. The Load Balancer aids the dispatcher to level the work across workers by monitoring CPU and GPU resource utilization, as well as overall current load on the workers (measured in QPS). The load balancer communicates with the monitoring daemons running on each of the workers to get the resource utilization statistics. The master is also responsible for scaling up and down the number of workers based on current load. The master’s Autoscaler monitors in the background and is off of the critical path of serving queries. For fault-tolerance, the master is replicated following commonly used existing techniques (Gujarati et al., 2017; Chairunnanda et al., 2014).

Workers: Worker machines execute the inference queries on loaded model/variants. Hardware-specific Executor daemons manage the deployment and execution of model/variants. Figure 6 depicts GPU and CPU executors; INFaaS’s modular design allows for other executors to be plugged in as new hardware platforms emerge. For each incoming query, the worker-level Dispatcher forwards requests to a specific model instance running on the proper executor. The Autoscaler component works with the Monitoring Daemon to scale model/variants as needed within the Worker. Section 6 discusses both of these scaling algorithms in detail.

Model Repository: The Model Repository stores the model/variants to make them accessible to workers when needed to serve queries.

Model Profiler and Optimizer A model registration request goes through the Model Profiler that generates different feasible variants and profiles them for statistics such as their inference latencies, loading latencies, and memory footprints. The one-time profiling is necessary for model-variant selection and autoscaling and occurs on a dedicated set of machines. We measure the load and inference latencies for each model/variant for a set of batch sizes of 1, 4, and 8. We also note the peak memory utilized by each model/variant. We predict expected inference latencies for other batch sizes using linear regression on the profiled memory footprints for these model-variants as follows. First, we observe from Figure 8 that inference latency tends to linearly increase with batch size. INFaaS stores — the linear model slope — and — the intercept. For a given a batch size, , we then estimates a model/variant’s inference latency, , as . These parameters, along with a model/variant’s task, dataset, framework, accuracy, and maximum supported batch size are recorded in the metadata store.

For compatible frameworks and intermediate representations, INFaaS generates optimized versions of variants for use on hardware accelerators. For instance, INFaaS uses TensorRT to generate mixed-precision optimized variants for batches 1, 4, 8, 16, 32, and 64, consuming lowest to highest GPU memory, respectively. As we will discuss in Section 6, these variants are used for autoscaling on GPU. These generated variants are also profiled the same manner as models submitted by users. Currently, INFaaS supports this optimization step for the TensorRT framework (ten, 2018a), but this can be extended to similar frameworks (Chen et al., 2018; Oh et al., 2018).

Figure 8. Inference latency as batch size increases, and the corresponding linear fitting, which provides an accurate approximation.

Metadata Store: The master and workers rely on the Metadata Store for data needed to make the decisions described above. This metadata mainly consists of (1) the information associated with available model architectures and their variants, such as accuracy, expected inference latency, loading latency, and other profiled values, and (2) the resource usage statistics of the worker machines, the currently loaded model/variants, and the QPS of the loaded variants. The stored model information is organized per the model-less abstraction described in Section 3.2. Resource utilization statistics are updated by the respective executors and monitors. Metadata stored this way enables faster access of the global state of resources and available models without needing a direct communication between the master and workers (which could quickly become a bottleneck). The majority of queries to the metadata store are for reads, since it functions as a decision-making medium. One-time updates, (e.g., whether a variant is running on a worker) are immediately added to the metadata store, while updates such as hardware utilization occur every few seconds (typically, 1-2). We discuss the implementation of the metadata store in Section 7.

5. Selecting a Model-Variant

1:function SelectModelVariant(modelArch,batch,latency)
2:      if inDecisionCache(modelArch,batch,latency) then
3:            return cachedVariant
4:      end if
5:      for  do
6:            if  then
7:                 if notOverloaded(v) then
8:                       return v
9:                 end if
10:            end if
11:      end for
12:      return lowestLoadInf(modelArch,batch,latency)
13:end function
Algorithm 1 Model-Variant Selection

Automatic model/variant selection is a key feature for INFaaS, as we pointed out in Insights 1 and 2. To enable the model-less abstraction, there are two scenarios when we need to select a model/variant: when a user just specifies the use-case and when they specify the model architecture.

Algorithm 1 describes how INFaaS selects a model/variant for a query where the user specifies a model architecture and a latency target. The algorithm can have three outcomes. In the first case, INFaaS finds the inputted batch and latency requirement in its decision cache, indicating that it was recently processed. If the variant is running and not overloaded, it is selected and sent to the worker that reports the lowest QPS (i.e., is the least loaded) for the model/variant. A variant is labeled as overloaded if its current QPS and average latency exceed the profiled values. The worker monitoring daemons update the metadata store with this information. In the second case, the cached variant is no longer running, or the decision cache returns no variant. In this case, INFaaS proceeds to search through all variants under a model architecture. If a variant is running that meets the batch and latency requirements, and is not overloaded, it is selected and again sent to the worker that reports the lowest QPS for it. In the third case, INFaaS finds no running variants, and proceeds to pick and load a variant with the lowest combined loading-inference latency that matches the submitted query’s requirements. Here, INFaaS sends the query to the worker with the lowest utilization on the variant’s target hardware while also load balancing to avoid hotspots.

For brevity, Algorithm 1 does not show the decision algorithm when a use-case is specified. The main difference is that Line 5 is a query to the metadata store for the top model/variants that meet the user’s accuracy requirement. Although is configurable, we set it to 7, the average number of model/variants per model architecture, to get a range of variants from different frameworks that target different hardware platforms without having to do an exhaustive search over the large search space.

INFaaS makes these decisions on the order of hundreds of s to ms. We assess these latencies further in Section 8.6.

6. Autoscaling

Automatically scaling the underlying resources is critical to achieving INFaaS’s vision of managed and model-less inference serving. These components interact and cooperate together to achieve the goal of managed and model-less inference serving. INFaaS supports autoscaling by scaling up or down (1) the number of worker machines, (2) the number of model/variant replicas, and (3) the types of model/variants on the worker machines. These tasks are divided amongst INFaaS’s master and workers as follows: (a) the master ensures there is a sufficient number of worker machines at each time by monitoring each worker’s resource utilization, and (b) the workers ensure there is a sufficient number and type of model/variant replicas running based on changes in load.

6.1. Master autoscaler

The master autoscaler is responsible for monitoring each worker’s resource utilization to decide if a new worker should be brought up/down. Worker monitoring daemons update their respective hardware resource utilization and the average latency of each running model/variant every 2 seconds to the metadata store. If INFaaS detects latency spikes, or that a worker’s resource utilization has exceeded a threshold (set at 80%), the master’s autoscaler temporarily “blacklists” the worker to avoid transiently overloading it. The load-balancer diverts requests to other workers in this case. If the latency spike is due to GPU sharing contention, the autoscaler starts a new worker with a GPU if all available GPUs are heavily utilized. When scaling up, we add a CPU-only instance only if the CPU utilization is exceeded and no GPU model is experiencing contention. This decision maintains low costs by avoiding idling of GPU resources in the event that CPU models are predominantly running. We add a new worker if the resource utilization of all the workers exceeds a threshold; we empirically set this threshold to be 65% to compensate for the start-up latencies. Similarly, idling or underutilized worker machines are brought down.

6.2. Worker autoscaler

Each worker runs an autoscaler process that responds to changes in load for all running model/variants. It performs either variant replication to fully saturate CPU and GPU resources or variant upgrading by switching to a model that uses more efficient batching or a different hardware feature.

The autoscaler takes into account model loading latencies and acts conservatively allocating the resources necessary to absorb small spikes above the current load without the need for autoscaling actions.

Scaling Up: We compare the current load of a model/variant to the maximum load it can serve with the currently allocated resources . We define as the batch size-weighted request rate. is proportional to the variant’s inference latency (inversely), supported batch, and current number of replicas. If the delta drops to what is necessary to serve 5% load spikes, we trigger autoscaling. If the model/variant is running on a CPU, the autoscaler either performs (1) variant replication (i.e., adds more replicas) or (2) variant upgrading, (i.e., use more cores or upgrade to a GPU variant). For instance, for a CPU variant, the autoscaler estimates the number of CPU replicas needed to be added to serve the load, and leverages the profiling data to calculate the total loading latency, peak memory, and hardware cost of this choice. This is compared against upgrading to, say, a TensorRT optimized variant. The autoscaler chooses the one with lower cost and lower resource consumption. If the decision requires upgrading to a GPU variant while on a CPU-only worker, the worker will coordinate with the master to load the variant on a worker with a GPU. From our analysis in Section 2.2, INFaaS does not allow variant replication on the same GPU. If the variant is running on a GPU, the autoscaler upgrades to a variant with a higher batch size for improved adaptive batching at the cost of higher GPU memory consumption.

Scaling Down: Scaling down entails checking if the current load can be supported by removing one replica in the case of CPUs, or downgrading to a lower batch variant for GPUs. For the latter, if the running variant has batch-size of 1, it considers downgrading to a CPU variant. To avoid scaling down too quickly, the autoscaler keeps a count, , of consecutive time slots where the load can be supported by removing a replica or downgrading. In our experiments, we set to for CPU variants and for GPU variants.

7. Implementation

INFaaS is implemented in about 6,800 lines of C++ code. INFaaS API and communication between master and workers are implemented using gRPC in C++ (grp, 2018). Users can interact with INFaaS by issuing gRPC requests in any language. INFaaS uses AWS S3 for its model repository (Amazon, 2018b).

On the master machine, the front-end, dispatcher & load balancer, and model registration are different threads within the same process for fast query dispatch. The autoscaler runs as a separate process, polling system status every 2 seconds.

On the worker machines, the dispatcher and monitoring daemons run as separate processes. The monitoring daemon updates the resource usage in the metadata store every 2 seconds. We built the GPU executor using the TensorRT Inference Server-19.03 (TRT, 2018), which supports running TensorRT, Caffe2, and TensorFlow. We deploy a custom Docker container for PyTorch models. We use TensorFlow Serving container for TensorFlow models on the CPU (Ten, 2018) The autoscaler main thread monitors load for model/variants every second and makes scaling decisions, which also manages a thread pool for loading and unloading model/variants. We run all monitoring and autoscaling threads with low priority (nice value 10) to reduce interference to inference threads.

The INFaaS’s metadata store is implemented as a key-value store that replies to master and worker queries within hundreds of microseconds. Specifically, we use Redis (Redis, 2018) and make queries using the Redox C++ library (Red, 2018). We currently run the Redis server on the same machine as the master to reduce model/variant selection latencies, but can run separately as needed. Data structures and the underlying storage is optimized for reads that constitute the majority of queries. In the event of a key-value store failure, we recover the static information about model/variants from the most recent key-value store snapshot. The monitoring daemons update the dynamic state of workers including the model/variants running on them, the QPS supported by them, and their load and inference latencies.

8. Evaluation

8.1. Experimental Setup

We deployed INFaaS on AWS EC2, using m5.2xlarge instances for masters with 8 vCPUs, 32GiB DRAM. Workers are deployed on p3.2xlarge and m5.2xlarge instances. The former has 8 vCPUs, 61GiB DRAM, one NVIDIA V100 GPU. All CPUs are Intel Xeon Platinum 8175M operating at 2.50GHz. All instances run Ubuntu 16.04 and the 4.4.0 kernel, and up to 10Gbps networking.

Baselines: To the best of our knowledge, there is no existing system that provides a model-less and fully-managed interface like INFaaS. State-of-the-art serving systems require users to specify the model/variant and hardware per-query. Hence, we use INFaaS configurations that approximate the resource management policies, autoscaling techniques, and APIs of existing systems such as TensorFlow Serving (Ten, 2018) (TFS), TensorRT Inference Server (TRTIS) (TRT, 2018), Clipper (Crankshaw et al., 2017), AWS SageMaker (Amazon, 2018c), and Google CloudML (Google, 2018) as baselines. This allows us to compare model and resource management without the variability caused by differences in the execution environments between these systems, such as the RPC libraries and container technologies.

We compare INFaaS to the following baseline configurations for online query execution:

STATIC: Pre-load all model/variants and set a pre-defined number of running replicas. This strategy is used by TRTIS and TensorFlow Serving. We consider two static cases, one where only GPUs are used (GPU-S), and one where only CPUs are used (CPU-S).

INDV: Individually horizontally scale each model/variant by adding or removing replicas within the same worker or across multiple workers, but without variant upgrading. This approach approximates Clipper, SageMaker, and CloudML.

Model/variants used in our experiments: Table 2 shows all model architectures and the number of model/variants associated with each. As discussed in Section 2.1, the number of variants depends on the framework, supported hardware platforms, and operations supported by compiler frameworks. We used 270 image classification model/variants spanning 44 widely used model architectures. Model/variants are pre-trained on ImageNet(Deng et al., 2009) using either Caffe2, TensorFlow, or PyTorch. For the 26 model architectures that can be optimized by TensorRT, INFaaS generates 6 optimized variants from batch 1 to 64. We used TensorRT version 5.1.2.

Model Arch # Vars Model Arch # Vars Model Arch # Vars
inceptionv3 11 resnet101 11 resnext 3
squeezenet1.1 2 zfnet512 7 mobilenet0p5160 6
vgg13 2 resnet50 18 squeezenet 7
inceptionv2 13 densenet201 5 resnet101v2_2 6
mobilenet 10 inceptionv1 13 reference 7
resnet152 11 resnet50v2 3 mobilenet0p25128 6
vgg19 12 nasnetlarge 3 vgg19_bn 2
nasnetmobile 3 vgg16 18 densenet169 5
alexnet 9 resnet34 2 resnet101v2 3
googlenet 7 inceptionv4 6 xception 3
densenet121 12 resnet152v2_2 6 resnet18 2
squeezenet1.0 2 vgg16_bn 2 mobilenetv2 3
densenet161 2 inceptionresnetv2 9 resnet50v2_2 6
resnet152v2 3 vgg13_bn 3 vgg11 2
resnext50 3 vgg11_bn 2
Table 2. Model architectures and associated model/variants.

8.2. Does INFaaS improve ease-of-use?

The key goal for INFaaS’s managed and model-less approach is to simplify the use of serving systems. In the absence of a direct user study, we can draw conclusions about the user-friendliness of inference interfaces by comparing the knobs users need to configure to use them in Table 3. Compared to existing systems, INFaaS’s users do not need to configure any knobs for model/variant selection, hardware selection, autoscaling strategy, or mixing online and offline queries. INFaaS users specify latency and accuracy requirements, and INFaaS automatically manages the serving system. With minimal configuration, users can access INFaaS by specifying their high-level performance goals. Nevertheless, INFaaS supports expert users that want to exert direct control over the settings (e.g., specifying a model/variant).

TFS / TRTIS Clipper CloudML SageMaker INFaaS
Model-variant Yes Yes Yes Yes No
Hardware Yes Yes Yes Yes No
Scaling strategy N/A Yes Yes Yes No
Online / Offline N/A N/A Yes Yes No
Table 3. Comparison of required configuration parameters from users. N/A means the system does not support this feature.

8.3. Does INFaaS share resources effectively?

(a) Inception-ResNetV2 avg latency
(b) MobileNetV1 avg latency
(c) Inception-ResNetV2 throughput
(d) MobileNetV1 throughput
Figure 9. Performance of co-locating GPU model/variants when varying load.

We evaluate INFaaS’s ability to manage and share resources across multiple models by sending an increasing amount of concurrent requests to two GPU model/variants, Inception-ResNetV2 and MobileNetV1 (shown in Figure 9). Both models have been optimized with TensorRT for batch-1, but have very different resource needs and latencies. We send 16 and 18 concurrent non-batch requests to Inception-ResnetV2 and MobileNetV1. We measure the average latency and throughput over a window of 4 seconds. For Inception-ResnetV2, the initial load is 32 images/second and gradually increases to 180 images/second at 140 seconds. For MobileNetV1, the load starts from 36 images/second and reaches 200 images/second at 140 seconds. We choose the final load based on the saturation throughput of the Inception-ResnetV2 variant on the GPU. We set the latency SLO based on the average inference latency of each variant when running alone on the GPU (30 ms for Inception-ResnetV2 and 20 ms for MobileNevV1).

We compare INFaaS to 2 simple resource management strategies: running each model exclusively on a separate GPU (Exclusive) and co-locating the 2 models on a single GPU without scaling to a new machine (Sharing). As shown in Figure 9, INFaaS scales the number of GPU workers between 1 and 3. Note that the number of workers in each figure depicts how many workers used to serve a variant: from 88 to 128 secs, the second GPU worker only serves MobileNetV1. When only considering throughput, a single GPU is sufficient to serve the load placed on both models. INFaaS uses multiple GPUs to reduce latency and meet SLO.

In Figure 9, INFaaS starts by co-locating the two model/variants on one worker. As the load increases, INFaaS detects an SLO violation for MobileNetV1 around 88 seconds and starts a second GPU worker. Around 140 seconds, the load increase causes INFaaS to start a third GPU worker for serving both models. The sharp spikes and drops in INFaaS’s latency and throughput are caused by a new worker being added, which incurs a temporary GPU warm-up penalty and GPU model loading latency. We can address this issue by reserving a pool of standby GPU worker machines during high load periods that proactively load frequently queried GPU variants. In contrast, both the exclusive and sharing alternatives can have long-term latency issues. The exclusive strategy suffers when the allocated GPU resources to one model are not sufficient, as it cannot use any other underutilized resources. The sharing strategy suffers due to inteference and lack of resource fairness between the two models. INFaaS can share one GPU across multiple model and scales to more GPU workers at high load to mitigate interference between different models.

(a) Median latency and throughput for Online
(b) Throughput for Offline
Figure 10. Performance of different co-location strategies with ResNet50. The y-axes use log scale. Alone means only running online/offline jobs.

Next, we evaluate the efficiency of CPU sharing by co-locating online and offline queries on a single worker. We use ResNet50 and pre-load two TensorFlow CPU replicas and one TensorRT-8 instance on GPU. The online requests start from a 500 ms SLO and 2 images/second rate, gradually increasing to 10 images/second. At 40 seconds, the SLO switches to 20 ms, and at 60 seconds, the load increases to 300 images/second before decreasing symmetrically after 80 seconds. The SLO switches back to 500 ms after 100 seconds. For the offline job, we send one offline request at the beginning of the experiment that has a workload of 1000 input images and specifies the ResNet50 model architecture.

Figure 10 contrasts the performance for online and offline queries when running alone and when co-located. INFaaS maintains similar latency and throughput for online requests in both cases by limiting the offline query processing when it detects SLO violations or high resource utilization for online queries. There are three troughs in Figure 9(b): 20-40 seconds and 100-120 seconds are caused by high load on CPU variants, while 60-80 seconds is due to high load on GPU. At 110 seconds, INFaaS stops processing offline jobs due to an online SLO violation. The online query latency returns to meet the SLO requirement shortly after. INFaaS can effectively share resources across models, as well as online and offline requests, without penalizing performance.

8.4. How well does INFaaS scale with load changes?

(a) Median latency
(b) Throughput
(c) Decomposition of QPS in INFaaS
Figure 11. Performance of different autoscaling strategy, with ResNet50 and non-batch requests. The y-axis for (a) uses log scale. P99 latencies closely match the median latency. The SLO is 500 ms before 30 seconds and after 80 seconds, and is set to 20 ms from 30 to 80 seconds.

We focus the autoscaling evaluation on a single worker, as INFaaS differs from existing systems by more efficiently using a worker’s resources. Horizontal autoscaling in INFaaS behaves the same as existing serving systems. We use a single model architecture (ResNet50) on a single GPU worker and vary the load starting at a relaxed SLO (500 ms) with 5 images/second, and gradually increase it to 50 images/second. We switch to a strict SLO (20 ms) at 30 seconds and further increase the load to 800 images/second, then decrease symmetrically back to 50 images/second and switch back to 500 ms SLO at 80 seconds. The final load is 5 images/second.

We compare INFaaS with GPU-S, CPU-S, and INDV methods explained in Section 8.1. For GPU-S, we keep one instance of a TensorRT variant optimized for batch-8 since it is sized to serve the provided peak load. For CPU-S, we maintain 2 TensorFlow CPU containers. For INDV, we only replicate the TensorFlow CPU container and limit up to one running instance of a TensorRT variant optimized for batch-1. As shown in Figures 10(a) and 10(b), GPU-S achieves the highest throughput and lowest latency. However, it comes at the highest cost, since it exclusively occupies the GPU even during low load. CPU-S has the lowest cost, but violates SLOs when the load is higher. It also cannot maintain throughput for high load. The INDV strategy, which only uses model replication for CPU and no adaptive batching on GPU, has limited throughput (shown in Figures 3 and 4).

Figure 10(c) shows the benefit of using both model/variant replication and variant upgrading in INFaaS. At lower load, INFaaS scales to two CPU replicas. It then upgrades to a TensorRT variant optimized for batch-1. As the load increases, INFaaS gradually upgrades to variants that are optimized for higher batch sizes to enable adaptive batching, which maintains low latency while achieving similar throughput as GPU-S. When the load decreases, INFaaS detects the change and steadily downgrades to a lower batch variant and eventually back to a CPU variant, equalling CPU-S’s cost.

We now quantify the cost savings using Figure 11. An AWS GPU instance is currently at least 6 more expensive on AWS than its equivalent CPU instance  (AWS, 2018a). For simplicity, we only consider the cost for hardware used at a given timestep (e.g., charge for a GPU instance if a GPU is used and for a CPU instance if only CPU is used). INFaaS saves 38% compared to the GPU-S approach, as it only uses the GPU when needed. It is 4 more expensive than CPU-S, but still offers 100 lower latency and 65 higher throughput. INFaaS scales and adapts to changes in load and query patterns, and maintains low cost by better resource allocation.

8.5. Putting it all together

Figure 12. Realistic workload overview.
Figure 13. Throughput and SLO violation rate of online requests for the realistic workload. The results report the 60-second steady state that exclude the first 20 seconds of an interval. The error bars denote the minimum and maximum values observed across three runs.

We evaluate the end-to-end performance of INFaaS on a realistic workload with all 44 image classification model architectures. We expect INFaaS to meet the majority of user requirements when serving models with diverse query patterns, and right-size resources as load changes.

We designed a load generator that submits user requests following the Poisson distribution commonly used to simulate cloud workloads (Kannan et al., 2019; Atikoglu et al., 2012; Meisner and Wenisch, 2012; Yang et al., 2017). Since model popularity tends to follow the Zipf distribution (Lee et al., 2018), the workload designates 20% of the models to be popular and share 80% of the total load, while the rest are cold. We selected top 20% (9) popular model architectures based on the number of variants in Table 2. Among these popular models, we assigned ResNet50 and VGG16 to be the two most popular so that they represent 20% and 15% of the total load, respectively. We generate requests using 79 client threads, with one thread per cold model, four threads for popular models, and eight threads for the two most popular models. Figure 12 shows the offered load. It starts at 50 requests/second and gradually increases to 500. Each level is maintained for 80 seconds.

We compare with STATIC and INDV (described in 8.1) that persist 16 CPU-only workers and 8 GPU workers. INFaaS starts with 5 GPU workers and scales to 8 at high load, as shown in Figure 12. Since existing systems require the user to select a variant, we specify one as follows: If a model architecture has both CPU and GPU variants, we select the CPU variant with lowest inference latency. Otherwise, we pick the fastest GPU variant that supports the smallest batch size. For INFaaS, we specify the SLO for each model architecture based on the average inference latency of the chosen variant when running alone, but provide no model/variant.

Figure 13 shows that INFaaS can achieve 2 higher throughput than STATIC and violates 3 fewer SLOs. This is attributed to leveraging both variant scaling and variant upgrading, where INFaaS can upgrade to a GPU variant while the baselines can only replicate variants. INDV has lower throughput and violates more SLOs due to frequently incurring a load latency penalty and the absence of variant upgrading. As depicted in Figure 14, INFaaS maintains a high CPU and GPU resource utilization while keeping SLO violations at about 10%. Utilization is around 50% for CPU since INFaaS avoids overloading CPU models that have a lower QPS limit. For GPU, INFaaS has over 6 higher GPU utilization than both baselines at high load, since it leverages GPU sharing.

We also add 8 concurrent offline requests to evaluate the efficiency of resource management and resource utilization. Each offline request has a workload of 500 input images and specifies the ResNet50 model architecture. As shown in Figure 13, INFaaS w/offline maintains similar throughput and SLO violations as INFaaS running only online jobs. Across 3 runs, an average of 3,275 of the 4,000 images processed by offline, which runs as a best-effort job. Moreover, Figure 14 shows that adding offline requests to INFaaS further improve the resource utilization. INFaaS achieves higher performance and resource utilization than the baselines. It also reduces cost at low load by spinning down worker machines.

Figure 14. Average worker GPU and CPU utilization for the realistic workload. CPU utiization corresponds to core usage while GPU utilization corresponds to GPU DRAM memory. INFaaS maintains high resource utilization without creating contention between running models.

8.6. What is the overhead of INFaaS’s decisions?

Figure 15 shows the fraction of query processing time spent on making decisions about which model/variants and workers to use. Each colored bar corresponds to the same TensorRT batch-1 variant being selected in 3 scenarios: (1) user explicitly specifies it (ModVar), (2) user specifies ResNet50 and a latency constraint of 10 ms (ModArch), and (3) user specifies the classification task and the ImageNet dataset, along with a latency constraint of 10 ms and an accuracy of 75.3% (He et al., 2016) (Use-Case). Under a 2 request/second load, we evaluate the case when (a) the selected model/variant is already loaded (L), and (b) is not loaded (NL).

When model/variant is explicitly defined by the user, INFaaS has low overheads, as it only selects a worker. When a model architecture is provided, INFaaS quickly finds a model/variant that meets the latency SLO if it is loaded, or spends more time selecting a variant if none are loaded. Similarly, when a use-case is provided, INFaaS leverages its decision cache to select a model/variant if it is loaded, or searches a subset of the large search space. ResNet50 has the highest NL decision latency as it has the most model/variants—18. INFaaS maintains low overheads (1.6 ms when using the decision cache and less than 12% of serving time for TensorRT models), and keeps SLO violations low using loaded variants when possible.

9. Limitations and Future Directions

White box inference serving: INFaaS currently treats ML models as black boxes. Opening the black box of models offers additional opportunities to optimize inference serving (Lee et al., 2018). For instance, intermediate computations could be reused across “similar” model/variants. We leave model-less inference serving with white box models to future work.

Offline queries with performance SLOs: INFaaS currently supports best-effort execution for offline requests with no support for performance SLOs. Understanding how to efficiently schedule and process offline requests in a multi-tenant environment given the users’s inputs, deadlines, and cost requirements needs further exploration. INFaaS’s modular design allows it to be extended to work with existing and new deadline-driven scheduling techniques (Ferguson et al., 2012; Venkataraman et al., 2016).

Query preprocessing: INFaaS currently assumes that the queries are pre-processed (i.e., video decoding, image cropping, and scaling). However, many machine learning applications have complex and compute-intensive pre-processing pipelines that are difficult to deploy. We plan to support input query pre-processing by adopting high performance data processing libraries such as NVIDIA DALI (nvi, 2018) and Weld (Palkar et al., 2017).

10. Related Work

Figure 15. Fraction of query time spent on making variant and worker selection, and decision latency. The whiskers show the minimum and maximum values observed over 3 runs.

Serving Systems and APIs: Clipper (Crankshaw et al., 2017) lets users specify latency constraints and use adaptive batching to increase throughput without violating SLO. Amazon SageMaker (Amazon, 2018c), Google Cloud ML (Google, 2018), and Microsoft Azure ML (Azu, 2018) are enterprise cloud offerings with separate online and offline services. All three services autoscale models based on usage load, but cannot scale fast enough to serve bursty query patterns. SageMaker features Elastic Inference (Amazon, 2018a), which allows users to only use a portion of a GPU to reduce cost. TensorRT Inference Server (TRT, 2018) lets users deploy CPU and GPU models, statically configure the maximum number of model replicas, and leverages adaptive batching. TensorFlow Serving (Ten, 2018) supports TensorFlow models with GPU acceleration and employs static batching. Halpern et al. proposed Tolerance Tiers for ML-as-a-Service, where users programmatically trade off accuracy and latency (Halpern et al., 2019). INFaaS leverages user accuracy and latency requirements to select a suitable model/variant.

Scaling: Swayam (Gujarati et al., 2017) is a model-based CPU autoscaler that accounts for SLOs to achieve high resource utilization. Unike Swayam, INFaaS shares models across different services and SLO boundaries. Autoscale (Gandhi et al., 2012) reviews scaling techniques and argues for a simple approach that includes slack resources and not scaling down recklessly. INFaaS’s worker autoscalers use slack resources for headroom, and both master and worker autoscalers use scaledown counters.

GPU Sharing: NVIDIA MPS (NVIDIA, 2018) enables efficient sharing of GPUs, which Tiresias (Gu et al., 2019) and Gandiva (Xiao et al., 2018) exploit for deep-learning training. TrIMS (Dakkak et al., 2018) is an ML caching layer that manages models for CPUs, GPUs, and cloud storage. TensorRT Inference Server, TrIMS, Salus (Yu and Chowdhury, 2019), and Space-Time GPU Scheduling (Jain et al., 2018) allow users to share GPUs either spatially, temporally, or both. INFaaS uses the TensorRT Inference Server for GPU sharing, and can leverage one or more of these techniques in the future.

11. Conclusion

We presented INFaaS: a managed and model-less inference serving system. INFaaS’s allows users define inference tasks and performance/accuracy requirements for queries, leaving it to the system to determine the model/variant, hardware, scaling configuration. We quantitatively demonstrate that INFaaS’s policies for model selection and resource management and sharing lead to better throughput, fewer latency SLO violations, and better resource utilization compared to existing approaches for managing inference serving systems.


  • (1)
  • MXN (2017) 2017. Apache MXNet (Incubating) - A flexible and efficient library for deep learning.
  • Tes (2017) 2017. NVIDIA Tesla V100 Tensor Core GPU.
  • Xil (2018) 2018. Accelerating DNNs with Xilinx Alveo Accelerator Cards.
  • Azu (2018) 2018. Azure Machine Learning.
  • grp (2018) 2018. gRPC.
  • nvi (2018) 2018. NVIDIA DALI.
  • TRT (2018) 2018. NVIDIA TensorRT Inference Server.
  • ten (2018a) 2018a. NVIDIA TensorRT: Programmable Inference Accelerator.
  • Red (2018) 2018. Redox.
  • ten (2018b) 2018b. TensorFlow - An open source machine learning framework for everyone.
  • Ten (2018) 2018. TensorFlow Serving for model deployment in production.
  • Amazon (2018a) Amazon 2018a. Amazon Elastic Inference. (2018).
  • Amazon (2018b) Amazon 2018b. Amazon S3. (2018).
  • Amazon (2018c) Amazon 2018c. Amazon SageMaker. (2018).
  • Amazon (2018d) Amazon 2018d. Amazon SageMaker Neo. (2018).
  • Atikoglu et al. (2012) Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny. 2012. Workload analysis of a large-scale key-value store. In ACM SIGMETRICS Performance Evaluation Review, Vol. 40. ACM, 53–64.
  • Attia et al. (2018) Mohammed Attia, Younes Samih, Ali Elkahky, and Laura Kallmeyer. 2018. Multilingual Multi-class Sentiment Classification Using Convolutional Neural Networks. Miyazaki, Japan, 635–640.
  • AWS (2018a) AWS 2018a. AWS EC2 Pricing. (2018).
  • AWS (2018b) AWS 2018b. AWS Inferentia. (2018).
  • Chairunnanda et al. (2014) Prima Chairunnanda, Khuzaima Daudjee, and M. Tamer Özsu. 2014. ConfluxDB: Multi-Master Replication for Partitioned Snapshot Isolation Databases. PVLDB 7 (2014), 947–958.
  • Chen et al. (2018) Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 578–594.
  • Crankshaw et al. (2017) Daniel Crankshaw, Xin Wang, Giulio Zhou, Michael J. Franklin, Joseph E. Gonzalez, and Ion Stoica. 2017. Clipper: A Low-Latency Online Prediction Serving System. In 14th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2017, Boston, MA, USA, March 27-29, 2017. 613–627.
  • Dakkak et al. (2018) Abdul Dakkak, Cheng Li, Simon Garcia De Gonzalo, Jinjun Xiong, and Wen-Mei W. Hwu. 2018. TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep Learning Inference in Function as a Service Environments. CoRR abs/1811.09732 (2018). arXiv:1811.09732
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li jia Li, Kai Li, and Li Fei-fei. 2009. Imagenet: A large-scale hierarchical image database. In In CVPR.
  • Facebook (2018) Facebook 2018. PyTorch. (2018).
  • Ferguson et al. (2012) Andrew D. Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fonseca. 2012. Jockey: Guaranteed Job Latency in Data Parallel Clusters. In Proceedings of the 7th ACM European Conference on Computer Systems (EuroSys ’12). ACM, New York, NY, USA, 99–112.
  • Gandhi et al. (2012) Anshul Gandhi, Mor Harchol-Balter, Ram Raghunathan, and Michael A Kozuch. 2012. Autoscale: Dynamic, robust capacity management for multi-tier data centers. ACM Transactions on Computer Systems (TOCS) 30, 4 (2012), 14.
  • Google (2018) Google 2018. Google Cloud Machine Learning Engine. (2018).
  • Gu et al. (2019) Juncheng Gu, Mosharaf Chowdhury, Kang G. Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. 2019. Tiresias: A GPU Cluster Manager for Distributed Deep Learning. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). USENIX Association, Boston, MA, 485–500.
  • Gujarati et al. (2017) Arpan Gujarati, Sameh Elnikety, Yuxiong He, Kathryn S McKinley, and Björn B Brandenburg. 2017. Swayam: distributed autoscaling to meet SLAs of machine learning inference services with resource efficiency. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference. ACM, 109–120.
  • Halpern et al. (2019) M. Halpern, B. Boroujerdian, T. Mummert, E. Duesterwald, and V. Reddi. 2019. One Size Does Not Fit All: Quantifying and Exposing the Accuracy-Latency Trade-off in Machine Learning Cloud Service APIs via Tolerance Tiers. In Proceedings of the 19th International Symposium on Performance Analysis of Systems and Software (ISPASS).
  • Hazelwood et al. (2018) Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, James Law, Kevin Lee, Jason Lu, Pieter Noordhuis, Misha Smelyanskiy, Liang Xiong, and Xiaodong Wang. 2018. Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In Proceedings of the 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) (HPCA ’18). IEEE.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  • Jain et al. (2018) Paras Jain, Xiangxi Mo, Ajay Jain, Harikaran Subbaraj, Rehan Durrani, Alexey Tumanov, Joseph Gonzalez, and Ion Stoica. 2018. Dynamic Space-Time Scheduling for GPU Inference. In LearningSys Workshop at Neural Information Processing Systems 2018.
  • Jiang et al. (2018) Junchen Jiang, Ganesh Ananthanarayanan, Peter Bodik, Siddhartha Sen, and Ion Stoica. 2018. Chameleon: Scalable Adaptation of Video Analytics. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication (SIGCOMM ’18). ACM, New York, NY, USA, 253–266.
  • Jing et al. (2015) Yushi Jing, David Liu, Dmitry Kislyuk, Andrew Zhai, Jiajing Xu, Jeff Donahue, and Sarah Tavel. 2015. Visual search at pinterest. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1889–1898.
  • Jonas et al. (2017) Eric Jonas, Qifan Pu, Shivaram Venkataraman, Ion Stoica, and Benjamin Recht. 2017. Occupy the Cloud: Distributed Computing for the 99%. In Proceedings of the 2017 Symposium on Cloud Computing (SoCC ’17). ACM, New York, NY, USA, 445–451.
  • Jouppi et al. (2017) Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA ’17). ACM, New York, NY, USA, 1–12.
  • Kang et al. (2017) Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. 2017. NoScope: Optimizing Neural Network Queries over Video at Scale. Proc. VLDB Endow. 10, 11 (Aug. 2017), 1586–1597.
  • Kannan et al. (2019) Ram Srivatsa Kannan, Lavanya Subramanian, Ashwin Raju, Jeongseob Ahn, Jason Mars, and Lingjia Tang. 2019. GrandSLAm: Guaranteeing SLAs for Jobs in Microservices Execution Frameworks. In Proceedings of the Fourteenth EuroSys Conference 2019 (EuroSys ’19). ACM, New York, NY, USA, Article 34, 16 pages.
  • Koratana et al. (2018) Animesh Koratana, Daniel Kang, Peter Bailis, and Matei Zaharia. 2018. LIT: Block-wise Intermediate Representation Training for Model Compression. CoRR abs/1810.01937 (2018). arXiv:1810.01937
  • Lee et al. (2018) Yunseong Lee, Alberto Scolari, Byung-Gon Chun, Marco Domenico Santambrogio, Markus Weimer, and Matteo Interlandi. 2018. PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 611–626.
  • Lo et al. (2015) David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. 2015. Heracles: Improving Resource Efficiency at Scale. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA ’15). ACM, New York, NY, USA, 450–462.
  • Meisner and Wenisch (2012) David Meisner and Thomas F. Wenisch. 2012. DreamWeaver: architectural support for deep sleep. In ASPLOS.
  • NVIDIA (2018) NVIDIA 2018. NVIDIA. (2018).
  • Oh et al. (2018) Young H. Oh, Quan Quan, Daeyeon Kim, Seonghak Kim, Jun Heo, Sungjun Jung, Jaeyoung Jang, and Jae W. Lee. 2018. A Portable, Automatic Data Quantizer for Deep Neural Networks. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT ’18). ACM, New York, NY, USA, Article 17, 14 pages.
  • Palkar et al. (2017) Shoumik Palkar, James J Thomas, Anil Shanbhag, Deepak Narayanan, Holger Pirk, Malte Schwarzkopf, Saman Amarasinghe, Matei Zaharia, and Stanford InfoLab. 2017. Weld: A common runtime for high performance data analytics. In Conference on Innovative Data Systems Research (CIDR).
  • Polyzotis et al. (2017) Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2017. Data management challenges in production machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 1723–1726.
  • Poms et al. (2018) Alex Poms, Will Crichton, Pat Hanrahan, and Kayvon Fatahalian. 2018. Scanner: Efficient Video Analysis at Scale. CoRR abs/1805.07339 (2018). arXiv:1805.07339
  • Redis (2018) Redis 2018. Redis. (2018).
  • Velikovich et al. (2018) Leonid Velikovich, Ian Williams, Justin Scheiner, Petar S. Aleksic, Pedro J. Moreno, and Michael Riley. 2018. Semantic Lattice Processing in Contextual Automatic Speech Recognition for Google Assistant. In Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018. 2222–2226.
  • Venkataraman et al. (2016) Shivaram Venkataraman, Zongheng Yang, Michael Franklin, Benjamin Recht, and Ion Stoica. 2016. Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16). USENIX Association, Santa Clara, CA, 363–378.
  • Wang et al. (2018) Wei Wang, Jinyang Gao, Meihui Zhang, Sheng Wang, Gang Chen, Teck Khim Ng, Beng Chin Ooi, Jie Shao, and Moaz Reyad. 2018. Rafiki: machine learning as an analytics service system. Proceedings of the VLDB Endowment 12, 2 (2018), 128–140.
  • Xiao et al. (2018) Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, and Lidong Zhou. 2018. Gandiva: Introspective Cluster Scheduling for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 595–610.
  • Yang et al. (2017) Hailong Yang, Quan Chen, Moeiz Riaz, Zhongzhi Luan, Lingjia Tang, and Jason Mars. 2017. PowerChief: Intelligent power allocation for multi-stage applications to improve responsiveness on power constrained CMP. 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA) (2017), 133–146.
  • Yu and Chowdhury (2019) Peifeng Yu and Mosharaf Chowdhury. 2019. Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications. CoRR abs/1902.04610 (2019). arXiv:1902.04610
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description