INFaaS: A Model-less Inference Serving System

INFaaS: A Model-less Inference Serving System

Francisco Romero , Qian Li*, Neeraja J. Yadwadkar, Christos Kozyrakis,,,
Stanford University, Google
Equal contribution

Despite existing work in machine learning inference serving, ease-of-use and cost efficiency remain key challenges. Developers must manually match the performance, accuracy, and cost constraints of their applications to decisions about selecting the right model and model optimizations, suitable hardware architectures, and auto-scaling configurations. These interacting decisions are difficult to make for users, especially when the application load varies, applications evolve, and the available resources vary over time. Thus, users often end up making decisions that overprovision resources. This paper introduces , a model-less inference-as-a-service system that relieves users of making these decisions. provides a simple interface allowing users to specify their inference task, and performance and accuracy requirements. To implement this interface, generates and leverages model-variants, versions of a model that differ in resource footprints, latencies, costs, and accuracies. Based on the characteristics of the model/variants, automatically navigates the decision space on behalf of users to meet user-specified objectives: (a) it selects a model, hardware architecture, and any compiler optimizations, and (b) it makes scaling and resource allocation decisions. By sharing models across users and hardware resources across models, achieves up to 150 cost savings, 1.5 higher throughput, and violates latency objectives 1.5 less frequently, compared to Clipper and TensorFlow Serving.


showmarks \toggletrueshowmarks

1 Introduction

The number of applications relying on inference from Machine Learning (ML) models is already large [36, 47, 34, 14, 51] and expected to keep growing. Facebook, for instance, serves tens-of-trillions of inference queries per day [32]. Inference serving is user-facing. It requires cost-effective systems that render predictions with strict latency constraints while handling unpredictable and bursty request arrivals.

Figure 1: Variety in application requirements, model/variants, and heterogeneous resources. Colored boxes in the last layer show resources with models already loaded on them.

Specifically, inference serving is challenging due to the following reasons [55] (see Figure 1): (a) Diverse application requirements: Applications issue queries that differ in latency, cost, and accuracy requirements. Some applications, such as intruder detection, can tolerate lower accuracy in exchange for low prediction latency while others, such as manufacturing defect detection, cannot. Some queries are latency-sensitive (online), while others are latency-tolerant (offline). (b) Diverse model/variants: Methods such as knowledge distillation [39], or compiler optimizations [18, 7] produce versions of the same model, model/variants, that may differ in inference cost and latency, memory footprint, and accuracy. This increases the number of candidate models to choose from. (c) Dynamic and heterogeneous execution environments: Use of heterogeneous resources, such as TPUs, GPUs, and CPUs, in the face of dynamic changes in application load makes it non-trivial to design scaling and resource allocation policies. Together, these challenges increase the decision space and make it challenging for users wishing to select a model.

Despite existing work in inference serving [21, 9, 6], ease-of-use and resource efficiency remain key challenges. Existing model serving systems [21, 20, 9, 6] give users the ability to deploy ML models on their own infrastructure, while cloud offerings [13, 11, 28, 3] manage the infrastructure for the users. However, these systems still require users to make various decisions: Selecting a model/variant, instance type, hardware resources, and autoscaling configurations. Users thus need to navigate the large search space of trade-offs between performance, cost, and accuracies offered by the models, hardware resources, compilers, and other software optimizations. For example, GPUs usually serve large batches of queries with low latencies, but incur high model loading overhead, while CPUs load models faster and perform better with small batch sizes. GPUs cost more than CPUs: almost 8 higher on AWS [15]. This decision complexity is further exacerbated when a model’s query pattern changes over time.

Additional hardware options, such as FPGA [2], Google’s TPU [37], and AWS Inferentia [16], make the problem of manual configurations even more tedious. To circumvent the complexity of navigating this decision space, an alternative is to tightly couple a model to a hardware resource, and use statically-defined resource management policies. However, this results in the use of dedicated, and thus underutilized, resources per-user.

An easy-to-use and cost-effective inference serving system needs to have the following desirable properties [55]: First, it should support queries with a wide range of latency, throughput, and accuracy requirements without requiring significant user efforts to manage or configure the system. Second, based on a query’s requirements, the system should automatically and efficiently select a model/variant, without requiring user intervention. And finally, the system must dynamically react to the changing application requirements and request patterns by deciding when and by how much to increase the number of resources and model instances, and whether to switch to a differently optimized model/variant.

To this end, we built , a model-less INFerence-as-a-Service system. ’ interface allows users to focus on requesting inference for their prediction tasks without needing to think of models, and the trade-offs offered by model-variants, thereby providing ease-of-use. We term this interface model-less. Behind this interface, (a) generates various model/variants and their performance-cost profiles on different hardware platforms, (b) generates dynamic profiles indicating availability of hardware resources and state of models (e.g., loaded, but busy), and (c) uses simple, yet effective algorithms to select the right variant, and scale with changes in application load.

We evaluate using 158 model/variants generated from 21 model architectures, and compare to state-of-the-art inference serving systems under query submission patterns derived from real-world user request submissions. ’ ability to share models across users and hardware resources across models enables it to achieve up to 150 lower cost, 1.5 higher throughput, and violate latency objectives 1.5 less frequently. Our key contributions include:

  • [leftmargin=*]

  • The first model-less inference serving system that rids the users of selecting models to meet the performance and cost requirements of their inference queries.

  • A light-weight selection policy that navigates and leverages the large space of model-variants to automatically meet various application constraints.

  • A mechanism that shares heterogeneous hardware resources and models across user applications to improve utilization and user-costs.

  • An autoscaling algorithm that dynamically decides whether to scale models via replication or upgrade to a differently optimized variant.

2 Challenges and Insights

(a) All 21 model architectures and 158 model/variants
(b) Model/variants with latencies lower than 50 ms
Figure 2: Inference latency, memory usage, and accuracy for image classification model/variants generated with TensorFlow, Caffe2, PyTorch, and TensorRT. Variants of the same model architecture have the same color and marker. For (b), the variants in the blue circle are VGG19 variants.

2.1 Selecting the right model/variant

A model/variant is a version of a model defined by its architecture, the underlying hardware platform, the programming framework, and any compiler optimization used. For a specific model architecture, say ResNet50, a version trained using TensorFlow and running on GPU is an example of its model/variant. Variants for a given model architecture achieve the same accuracy, but may differ in resource usage and performance (throughput and latency), depending on the target hardware platform and programming framework used.

Accuracies may be different for variants of different model architectures trained for the same prediction task (e.g., ResNet50 and VGG16). The number of such model/variants can be large, depending on: (a) model architectures (e.g., ResNet50 and VGG16), (b) programming frameworks, (e.g., TensorFlow and PyTorch), (c) compilers (e.g., TensorRT [7] and TVM [18]), (d) optimization goals (e.g., optimize for batch size of 1 or 32), and (e) hardware platforms (e.g., CPUs and GPUs).

Each hardware platform is unique in terms of its performance, cost, and optimal use cases. For instance, CPU is currently a cost-effective choice for inference queries with relaxed latency requirements and low batch sizes [32], while GPUs provide more than 10 higher throughput especially for large batch sizes [1]. FPGAs allow for optimizations for batch-1 inference with narrow datatypes [26]. As new inference accelerators are introduced, such as Google’s TPU [37] and Amazon’s Inferentia [16], and new optimization techniques emerge, the number of model/variants will only grow.

Existing systems require users to identify the model/variant that will meet their performance, accuracy, and cost targets; however, making this decision is hard. Even if a user selects a model architecture, differences in memory footprint, start-up latency, supported batch size, and multiple types of hardware options lead to a large and complex search space. Figure 1(a) demonstrates that, for an image classification task, model architectures and their corresponding model/variants differ greatly in terms of accuracy, inference latency, and peak memory utilization. Even when we focus on variants with inference latencies less than 50 ms in Figure 1(b), the search space remains large and tedious to parse. The ensemble method adopted by Clipper and Rafiki [21, 53] partially solves the problem by sending each inference request to multiple candidate variants and returning an aggregated best result. However, this approach leads to increased cost and still requires users to choose candidate model/variants. We argue that inference systems should instead automate the selection of a model/variant that meets user’s performance, accuracy, and cost constraints.

Insight 1: The inherent diversity of model/variants across and within hardware platforms can be leveraged to meet diverse user requirements for performance, accuracy, and cost.

Insight 2: To enable ease-of-use to the users, the complexity of parsing this diverse space of model/variants needs to be hidden behind a simple high-level interface. The implementation behind this interface needs to efficiently make choices on users’ behalf for their inference queries.

2.2 Varying usage patterns and objectives

Query patterns and service level objectives (SLOs) for applications, such as real-time language translation and video analytics, can vary unpredictably [32, 38]. Provisioning for peak demand often leads to underutilized resources, and hence, inference serving systems need an autoscaler that dynamically responds to changes in query patterns and SLOs. However, traditional autoscaling mechanisms are agnostic to models and their characteristics, such as sizes and resource footprints, and thus cannot directly be applied to inference serving.

We identify three desirable aspects of autoscaling in the context of model serving: (a) Add/remove worker machines: We can increase the amount of compute and memory resources available to the system by launching additional worker machines. Since inference serving is usually embarrassingly parallel, increasing the number of workers results in proportional increases in throughput and cost. This kind of scaling may incur significant latency, as new machines must be spawned. (b) Add/remove model/variants: We can also increase the number of model instances by replicating selected model/variants on the same or different machines. Replicating on the same machine helps improve utilization of underlying hardware resources. For example, latency-sensitive inference jobs use small batch sizes (1 to 8), which limits parallelism and thus, the utilization of hardware resources. (c) Upgrade/downgrade model/variants: We can upgrade to a variant that is better optimized for the increased load (e.g., one with adaptive batching, to gain throughput potentially at the cost of higher resource usage) or a variant that runs on different hardware platform (e.g., move from CPU to an accelerator).

However, it is not obvious which autoscaling option is the best, especially for different hardware platforms, and models. To illustrate this tradeoff, Figures 3 and 4 compare the latency and throughput of adaptive batching (i.e., increasing batch size) to adding another single-batch model instance on a GPU and CPU, respectively. Figure 3 shows that adaptive batching on GPU can achieve up to higher throughput while lowering the latency by at least 20% compared to the latency observed using 2 model instances. For Inception-ResNetV2 (Figure 3-left), 2 model instances improves throughput by at most 45%, while for MobileNetV1 (Figure 3-right) both latency and throughput get worse. Thus, adaptive batching is better for GPUs than adding model instances. On CPUs (shown in Figure 4), use of 2 model instances doubles the throughput without sacrificing latency. Adaptive batching leads to larger matrix multiplication — the predominant operation in inference processing — that unlike GPUs, leads to higher latency and lower throughput on CPUs. Thus, for CPUs, adding model instances is better than adaptive batching.

Figure 3: Impact of adding model instances versus adaptive batching for two variants on a V100 GPU. Left graph shows average latency and total throughput across 16 threads sending batch-1 requests for Inception-ResNetV2. Right graph is the same for MobileNetV1, 32 threads. Both variants are TensorRT, batch-8, FP16.
Figure 4: Impact of adding model instances versus adaptive batching for two variants on 8-vCPUs. Setup was similar to the one described in Figure 3. Both variants are TensorFlow.

Insight 3: The system must automatically and dynamically react to changes in query submission patterns and state of resources using a scaling strategy: Add/remove machines or model/variants, or upgrade/downgrade model/variants.

2.3 Sharing model/variants and resources

Deploying all model/variants for each user is tedious and cost-inefficient. Instead, we note that there is an opportunity to share both resources and models across users to improve the overall cost, utilization, and even performance. Popular model architectures, such as ResNet50, tend to be commonly queried across several users and applications. Recent work [54, 29] has shown the benefit of sharing GPUs for deep-learning training jobs. ML inference is less demanding for compute and memory resources than training, thus making it an ideal candidate for GPU sharing [56, 33].

Figure 5: Impact of co-locating two models, Inception-ResNetV2 (large) and MobileNetV1 (small), on a V100 GPU. Graphs show average latency and throughput for each model running alone versus sharing. When sharing, same QPS sent to both models. Both variants are TensorRT, batch-1, FP16.

However, how to share accelerators while maintaining predictable performance is unclear. Figure 5 shows the result of co-locating one large and one small model on a GPU. At low load, GPU sharing does not affect the performance of either model. At higher load, sharing heavily impacts the performance of the small model, while the large model remains unaffected. The point when sharing starts negatively affecting the performance varies across models and depends on the load.

An additional opportunity to improve resource utilization is to multiplex resources for online and offline inference jobs. Offline jobs, such as historical data analysis [46] and image labeling at Pinterest [35], tend to process large amounts of data in a batch and are typically latency tolerant (i.e., minutes to hours). Most existing systems provide separate services for online and offline serving [28, 13], leading to resource fragmentation. Since offline jobs are not latency-sensitive, they can run along with online inference tasks during their periods of low or medium load. The tradeoff is in maximizing the resources used by offline jobs while minimizing the interference to online jobs [41].

Insight 4: To improve utilization without violating any performance-cost constraints, an inference serving system should: (a) Share hardware resources across models, and models across users, and (b) harvest spare resources for running offline queries.

3 INFaaS

In this section, we first describe how the insights, described in Section 2, led to the design of , and then detail the interface (Section 3.1) and the architecture (Section 3.2).

To leverage model/variants, guided by Insight 1, generates new variants from the models registered by users, and stores them in a repository. These variants are optimized along different dimensions using compilers such as TVM and TensorRT. To enable a simple model-less interface, guided by Insight 2, automatically selects a model/variant for a query to satisfy user’s performance, cost, and accuracy objectives (detailed in Section 4). To do so, profiles the model/variants and underlying resources, and stores their characteristics, static and dynamic, in a metadata store. Static metadata includes the details provided by users at model registration, such as architecture, framework, accuracy, task, and the name of training dataset. The dynamic state of a model/variant includes its compute and memory footprint, load (queries per second) served by a model/variant, and average inference latency. The dynamic state of an underlying worker machine includes the compute and memory utilization, sampled every few seconds.

Based on Insight 3, reacts to changes in the state of resources and user query patterns by automatically scaling resources, as well as model/variants (detailed in Section 5). decides whether to add/remove resources, or model instances, or upgrade or downgrade to variants that differ in performance and cost, to satisfy the users’ requirements.

Finally, guided by Insight 4, ’ autoscaling mechanisms share models across users, and underlying resources across model/variants. ’ static and dynamic metadata assists in ensuring that its scaling and sharing of resources and variants does not impact performance negatively (detailed in Section 5). ensures that this metadata is captured and organized in a way that incurs low access latencies (detailed in Sections 3.2 and 6).

Figure 6: system architecture. Numbered circles correspond to the typical life-cycle of queries.
INFaaS’ Workflow (see Figure 6).

Users interact with the Front-End, logically hosted at the Controller, and submit requests for model registration and inference. Controller dispatches inference queries to Worker machines as per the variant selection algorithm (detailed in Section 4). The Variant-Generator generates new variants optimized across different dimensions from existing variants using compilers, such as TVM and TensorRT. The Variant-Profiler profiles these variants on supported hardware platforms to collect various metadata and usage statistics. The static and dynamic metadata about model/variants and the resource utilization statistics about worker machines are stored in the Metadata Store. Worker machines further dispatch inference queries to the appropriate hardware-specific Executors according to the selected model/variant. A typical life-cycle of a query follows the steps marked in Figure 6. Note that variant generation and profiling are one-time tasks, and do not lie on the critical path of serving a query.

3.1 Interface

Table 1 lists ’ model-less API.

Model registration.

The register_model API takes a serialized model (e.g., a TensorFlow SavedModel or model in ONNX format) along with model metadata, such as its architecture, framework, accuracy, task, and name of the publicly available training dataset. verifies the accuracy of a public model on the submitted validation set before registering the model. Users specify whether a model is public or private: access to a private model is restricted to owner-specified ACLs (access-control lists) while public models are accessible to all users.

Query submission and Model-less abstraction.

provides three different online_query and offline_query API functions that map user requirements to model/variants using the model-less abstraction, shown in Figure 7. These API functions allow users to express requirements in three ways, from the most generic to the most specific:

  • [leftmargin=*]

  • Specify use-case: With this highest-level abstraction, users specify the prediction task (e.g., classification) and dataset (e.g., ImageNet) their query resembles, along with any latency and accuracy requirements.

  • Specify model architecture: Users specify a model architecture (e.g., ResNet50) and performance requirements, guiding ’ search for a variant.

  • Specify model/variant: This abstraction allows users to specify a particular model/variant (e.g., ResNet50 trained using Caffe2 on GPU) for their queries. This is the only option offered by existing inference systems.

API Parameters
register_model modelBinary, modArch, framework, accuracy, task, dataset, validationSet, isPrivate
model_info task, dataset, accuracy
online_query input(s), task, dataset, accuracy, latency
online_query input(s), modArch, latency
online_query input(s), modVar
offline_query inputPath, outputPath, task, dataset, accuracy
offline_query inputPath, outputPath, modArch
offline_query inputPath, outputPath, modVar
Table 1: user API

3.2 Architecture

We now describe ’ components, shown in Figure 6. We discuss how ’ Autoscaler and Model-Autoscaler, Variant-Generator and Variant-Profiler are uniquely designed for supporting ’ model-less interface.

Controller. The Front-End of the logically-centralized Controller receives model registration and inference requests. The Dispatcher module then selects a model/variant based on (a) the query’s requirements, and (b) the current system state (e.g., which models are running or overloaded). Details of the selection policies are discussed in Section 4. The Autoscaler module is responsible for scaling the number of Workers up and down based on the current load and resource utilization. For fault-tolerance, the Controller is replicated using existing techniques [30, 17].

Workers. Worker machines serve inference queries using instances of model/variants loaded on them. Hardware-specific Executor daemons (e.g., CPU and GPU Executors, in Figure 6) manage the deployment and execution of variants. The Monitoring Daemon tracks variants’ resource utilization and load, and decides when to process offline requests and when to pause them to avoid interference with online serving. The Dispatcher forwards each query to a specific model instance through the corresponding Executor. The Dispatcher and the Monitoring Daemon together manage resources shared by multiple models while avoiding SLO violations, and notify the Controller’s Dispatcher if models need to be migrated. Model-Autoscaler collaborates with the Monitoring Daemon to scale variants as needed within the Worker. The algorithm for resource sharing and scaling is detailed in Section 5.

Model Repository. The Model Repository is a high-capacity, persistent storage medium that stores serialized variants that are accessible to Workers when needed to serve queries.

(a) Abstraction for classification
(b) Abstraction for translation
Figure 7: Examples of the model-less abstraction. Solid blue boxes denote use-case, dashed red boxes indicate model architecture, and dotted green boxes are model/variants.

Variant-Generator and Variant-Profiler. The key objective of this component is to assist the model/variant selection process by extracting static metadata and dynamic statistics about all of the registered models and their variants. The first step is to generate feasible variants for a registered model. Depending on the compatibility of frameworks and intermediate representations, the Variant-Generator generates optimized variants of a model for use on hardware accelerators. For instance, uses TensorRT to generate mixed-precision optimized variants for batch sizes from 1 to 64 (only sizes that are power of two) that consume lowest to highest GPU memory, respectively. For reduced-precision variants (e.g., INT8), uses the validation set submitted by the user to check for changes in accuracy, and also records this information in the Metadata Store. As we discuss in Section 5, all variants within a model architecture are considered for autoscaling by the Model-Autoscaler module.

To help model-variant selection (Section 4) and autoscaling (Section 5), conducts a one-time profiling for each model/variant through the Variant-Profiler component. The Variant-Profiler measures statistics, such as the loading and inference latencies, and peak memory utilization. These parameters, along with a model/variant’s task, dataset, framework, accuracy, and maximum supported batch size are recorded in the metadata store. Details of how stores inference latencies for different batch sizes are discussed in Section 6.

Metadata Store: The Metadata Store fuels the model selection and autoscaling mechanisms by facilitating efficient access to the static and dynamic data about Workers and model/variants. This data consists of (a) the information about available model architectures and their variants (e.g., accuracy and profiled inference latency), and (b) the resource usage and load statistics of variants and Worker machines. The Metadata Store organizes the model metadata per the model-less abstraction described in Section 3, and strategically uses data structures to access decision-making metadata in (detailed in Section 6). It also enables fast access to the global state of resources and models without needing explicit communication between the Controller and Workers.

Decision Cache. needs to select model/variants and Workers for user queries. To accelerate this decision-making, maintains a Decision Cache: when queried using latency requirement as the key, it produces the chosen model/variant from previous decisions, on a cache hit. We use a version of the LRU (least-recently-used) eviction policy that prefers keeping the decisions for queries with stringent (order of ms) latency requirements. An entry is invalidated when the Controller’s Dispatcher finds a cached variant that is no longer running, and subsequently removes it upon the next entry lookup. Section 6 discusses the implementation details.

4 Selecting a Model-Variant

1:function SelectModelVariant(modelArch,latency)
2:      if inDecisionCache(modelArch,latency) then
5:      for  do
6:            if  then
8:                 return v Add to Decision Cache                   
9:      return searchAndLoad(modelArch,latency)
Algorithm 1 Model-Variant Selection.

Automatic model/variant selection is key to ’ model-less interface, as we pointed out in Insights 1 and 2. We need model/variant selection in two scenarios, when users specify: (a) only the use-case, and (b) the model architecture. Algorithm 1 describes ’ model selection process where a user specifies a model architecture and a latency target.

In Lines 2-4, first checks to see if a decision matching the specified latency requirement was cached. If the corresponding cache entry is found, enquires the metadata store to get a list of workers running the model/variant. If this list is non-empty, dispatches the query to the least-loaded worker machine. also ensures that the variant instance is not overloaded by comparing its current QPS and average latency with its profiled values.

If we get a miss in the decision cache, or if the cached variant is not running on any worker (Lines 5-8), queries the metadata store to search through all variants under a model architecture. For efficiency, this search is not conducted linearly: as we describe in Section 6, the metadata store organization enables the search to begin with variants that are closest to meeting the latency constraint. If finds a variant that is running and not overloaded, it again gets a list of workers running the model/variant. The query is then dispatched to the least-loaded worker.

Finally, if we find no running variant (Line 9), selects and loads the cheapest variant with the lowest combined loading and inference latency that matches the query’s requirement. sends the query to the worker with the lowest utilization of the variant’s target hardware, while load balancing to avoid hot-spots.

For brevity, Algorithm 1 omits the code when only the use-case is specified. The main difference is that Line 4 queries the metadata store for the top model/variants that meet the user’s requirements. automatically sets based on the latency constraint (e.g., for a 20 ms deadline), and begins with variants that are closest to meeting the deadline.

makes these decisions on the order of hundreds of s to ms. We assess these latencies further in Section 7.5.

5 Autoscaling

Automatically scaling resources in response to changing load of user queries is critical to implementing ’ model-less interface. As described by Insights 3 and 4, must decide how to scale (a) the number of worker machines, (b) the number of model/variant replicas, and (c) the types of model/variants on the workers.

’ autoscaling is a joint effort between the controller and workers. The Autoscaler on the controller (shown in Figure 6) scales the number of workers, and replicates variants across machines. The Autoscaler has access to the utilization of all the workers; this data is captured and maintained by the worker-specific monitoring daemons in the metadata store. The Model-Autoscaler on each worker either replicates or upgrades variants on the same machine. Without this division of responsibility between controller and workers, the controller would need to monitor variants running on each worker, adding significant overhead.

5.1 Controller’s Autoscaler

The Autoscaler on the controller decides if and when a new worker should be brought up/down. To do so, it uses the utilization and load statistics of workers and variants, stored in the metadata store. The monitoring daemon on each worker updates the metadata store with utilization, queries served per second (QPS), and average latency of each running model/variant every 2 seconds. Based on this profiled metadata, the Autoscaler starts a worker under 3 conditions.

First, if CPU utilization exceeds a pre-defined threshold on all the workers, the Autoscaler adds a new CPU worker. We set the threshold to 80% considering the time VMs take to instantiate (20-30 seconds) and the longest loading latency for variants (7 seconds). A lower threshold triggers scaling too quickly and adds workers, while a higher value may not meet the scaling need in time given the VM start-up latency and the time taken to load new models.

Second, similar to CPUs, if a GPU’s utilization exceeds 80%, a new worker with GPU is started. A new worker with GPU is also added if all existing GPU workers are found to cause contention to the variants running on them. The monitoring daemon keeps track of utilization statistics and flags such contentions when the performance (latencies and throughputs) of variants sharing a GPU degrades compared to their profiled values.

Third, if detects that at least two variants on a worker have latencies higher than their profiled values for one second, the affected worker is “blacklisted” for the next two seconds to avoid continuously overloading it. The load-balancer then diverts requests to other workers, causing variant replication across workers. If more than 80% workers are blacklisted at a time, a new worker is started. schedules requests to workers using an online bin packing algorithm [49] to improve utilization.

1:function ScaleUp(modelArch)
2:      for  do
3:             Remaining request load headroom
4:            if  then
8:function ScaleDown(modelVar, ts) ts is a counter (Section 5.2)
9:      if isCpuVariant(modelVar) then CPU variant
10:            if can serve after removing 1 instance then
11:                 Increment ts
13:      else if isGpuVariant(modelVar) then GPU variant
14:            if can serve after downgrading this variant then
15:                 Increment ts
Algorithm 2 Model-Autoscaling

5.2 Model-Autoscaler at each worker

The controller adds/removes workers, and dispatches queries to them as described in the previous section. Based on the requested load, each worker’s autoscaler, the Model-Autoscaler, decides whether to replicate variants on the same machine, or upgrade to a differently optimized variant.

Scaling Up: The ScaleUp routine in Algorithm 2 describes how workers react to increases in requested load. The current load of a model/variant, , is compared to the maximum it can serve with the currently allocated resources . We define as the query rate weighted by the average query batch size. is a function of the variant’s inference latency, supported batch, and current number of instances. If the delta (difference between and , Line 3) drops below what is necessary to serve load spikes (loadSpikeSlack in Line 4, set to 5%), the next step is to decide the most cost-effective scaling strategy given available resources (Lines 5-7).

For CPU variants, the algorithm computes the cost of adding replicas or upgrading, e.g., switching to a TensorRT variant, on the same machine. For GPU variants, the algorithm computes the cost of upgrading to a higher-batch variant. The strategy with the lowest cost — a function of model load latency, resource consumption, and hardware cost — is selected and deployed. If the upgrading strategy is chosen on a CPU-only worker, the worker coordinates with the controller to load the GPU variant on a capable worker. For GPU variants, the Model-Autoscaler selects the upgrade strategy and switches to a variant with a higher batch size for improved adaptive batching, at the cost of higher GPU memory consumption. From our analysis in Section 2.2, adaptive batching improves GPU throughput at a lower latency compared to replicating. Hence, we do not replicate model/variants on the same GPU.

Scaling Down: The ScaleDown routine in Algorithm 2 checks if the current load can be supported by removing an instance running on a CPU (Lines 9-12), or downgrading a GPU variant to a lower-batch or a CPU variant (Lines 13-16). The Model-Autoscaler waits for time slots before executing the chosen strategy to avoid scaling down too quickly. is set to be the largest loading latency of a variant on a hardware platform: in our experiments, we set to for CPU variants and for GPU variants.

Though we only describe strategies for CPU and GPU variants, the scaling routines are extensible to other hardware.

6 Implementation

We implemented in about 18.6K lines of C++ code111 is open-sourced at ’ API and communication logic between Controller and Workers are implemented using gRPC in C++ [4]. Users can interact with by issuing gRPC requests in any language. uses AWS S3 for its Model Repository [12].

On the Controller machine, the Front-End, Dispatcher, and Model Registrar are threads of the same process for fast query dispatch. The Dispatcher collaborates with Monitoring Daemons at Workers to avoid creating hotspots. To do so, it tracks (a) queuing delays and current load in QPS, and (b) resource utilization, on each Worker. The Autoscaler runs as a separate process, polling system status every 2 seconds. The Decision Cache is implemented as a key-value store.

On Worker machines, the Dispatcher and Monitoring Daemon run as separate processes. The Monitoring Daemon updates compute and memory utilization, and load and average inference latencies for each variant running on that worker, to the Metadata Store every 2 seconds. We run all monitoring and autoscaling threads with low priority (nice value 10) to reduce interference to the threads serving user queries. We built the GPU Executor using the TensorRT Inference Server-19.03 [6] that supports TensorRT, Caffe2, and TensorFlow variants. We deployed a custom Docker container for PyTorch models. We used TensorFlow Serving container for TensorFlow models on CPU [9]. The Model-Autoscaler’s main thread monitors query load and average latencies for model/variants every second, and makes scaling decisions according to Algorithm 2. It also manages a thread pool for asynchronously loading and unloading of model/variants.

Figure 8: Inference latency as batch size increases for CPU (left) and GPU (right) variants. Batch sizes up to 16 can be linearly fitted for both CPU and GPU variants.

We built the Variant-Generator using TensorRT [7]; it can be extended to similar frameworks [18, 44]. Storing profiling data for each model/variant and each batch size makes it inefficient for querying when needed by the controller or the Workers for making various decisions. We reduce the amount of data stored for each model/variant as follows: As observed from Figure 8, although inference latency does not increase linearly with batch sizes, it follows a piece-wise linear trend up to the batch size of 16. We only measure the inference latencies for batch sizes of 1, 4, and 8, and use linear regression to predict expected latencies for other batch sizes.

’ Metadata Store is implemented as a key-value store that replies to Controller and Worker queries within hundreds of microseconds. Specifically, we use Redis [48] and the Redox C++ library [8]. We run the Redis server on the same machine as the Controller to reduce variant selection latencies. The Metadata Store uses hash maps, lists, and sorted sets for making fast metadata lookups, which constitute the majority of its queries. One-time updates (e.g., whether a variant is running on a Worker) are immediately made to the Metadata Store, while periodic updates (e.g., hardware utilization) occur every 1-2 seconds. We backup the Metadata Store in AWS S3 periodically for fault tolerance.

Thresholds Configurability. Finally, we note that ’ thresholds are configurable. We used the following values: (a) Decision Cache size (20 entries), (b) Offline job resource utilization (40%), (c) Autoscaler scale up resource utilization (80%), (d) Model-Autoscaler load spike slack (5%), (e) Worker blacklist threshold (1 second), (f) Worker blacklist length (2 seconds), (g) Worker scale-down counter maximums (10 for CPU, 20 for GPU), and (h) Monitoring Daemon utilization recording frequency (2 seconds).

7 Evaluation

To demonstrate the effectiveness of ’ design decisions and optimizations, we first evaluated its individual aspects: ease-of-use (Section 7.1), scalability (Section 7.2), and improvement in resource utilization and cost savings (Section 7.3). We then compared , with all of its optimizations and features, to existing systems (Section 7.4). We begin by describing the experimental setup common across all our experiments, the baselines, and the workloads.

Model Arch # Vars Model Arch # Vars Model Arch # Vars
alexnet 9 resnet101 11 resnext50 3
densenet121 12 resnet101v2 3 vgg16 18
densenet169 5 resnet152 11 vgg19 12
densenet201 5 resnet152v2 3 inception-resnetv2 9
mobilenetv1 10 resnet50 18 inceptionv3 11
mobilenetv2 3 resnet50v2 3 xception 3
nastnetmobile 3 resnext101 3 nastnetlarge 3
Table 2: Model architectures and associated model/variants.

Experimental Setup. We deployed on AWS EC2 [10]. The controller ran on an m5.2xlarge instance (8 vCPUs, 32GiB DRAM), and workers ran on p3.2xlarge (8 vCPUs, 61GiB DRAM, one NVIDIA V100 GPU) and m5.2xlarge instances. All instances feature Intel Xeon Platinum 8175M CPUs operating at 2.50GHz, Ubuntu 16.04 with 4.4.0 kernel, and up to 10Gbps networking speed.

Baselines. To the best of our knowledge, no existing system provides a model-less interface like . State-of-the-art serving systems require users to specify the variant and hardware for each query. For fair comparison with these systems, we configured to closely resemble the resource management policies, autoscaling techniques, and APIs of existing systems, including TensorFlow Serving [9] (TFS), TensorRT Inference Server (TRTIS) [6], Clipper [21], InferLine [20], AWS SageMaker [13], and Google CloudML [28]. Specifically, we compared to the following baseline configurations for online query execution:

  • : Derived from TFS and TRTIS, this baseline pre-loads all model/variants and sets a pre-defined number of instances. To show the performance and cost difference between hardware platforms, we considered two cases: only GPUs are used () and only CPUs are used ().

  • : Derived from Clipper, InferLine, SageMaker, and CloudML, this baseline individually scales each model/variant horizontally by adding/removing instances within or across multiple workers, but cannot upgrade/downgrade variants. We considered two cases: only GPUs () and only CPUs ().

Configuring the baselines with (a) allowed for a fair comparison by removing variabilities in execution environments (e.g., RPC libraries and container technologies), and (b) enabled us to evaluate each design decision individually by giving the baselines access to ’ optimizations (e.g., support for various frameworks and hardware resources). For example, benefited from having TensorRT optimizations, and ’ detection and mitigation of worker and variant performance degradation.

Model/variants. Table 2 shows 21 model architectures and the number of model/variants associated with each: 158 in total. As discussed in Section 2.1, the number of variants depends on the frameworks (e.g., TensorFlow, Caffe2), hardware platforms (e.g., CPUs, GPUs), and compilers (e.g., TensorRT, TVM). Our model/variants are classification models pre-trained on ImageNet[24] using Caffe2, TensorFlow, or PyTorch. For the 10 model architectures capable of being optimized by TensorRT, generated 6 optimized variants for batch sizes between 1 to 64 using TensorRT version 5.1.2.

Workloads. We used common patterns [23] indicating flat and fluctuating loads. Additionally, since there are no publicly available datasets that indicate inference serving’s query patterns, we used a real-world Twitter trace from 2018 collected over a month, with a Poisson inter-arrival rate for queries. As noted in prior work on inference serving [57, 42], this trace resembles inference workloads, as there are both diurnal patterns and unexpected spikes. We randomly selected one day out of the month for each experiment from the Twitter trace.

1  # Model registration parameters
2  model_params = (’’, ’PT-mod’, ...)
4  # === Clipper model registration and query ===
5  def predict(model, inputs):
6    ... # Prediction function defined here
7  clipper.register_application(name="PT-app", slo=200ms)
8  deploy_pytorch_model(model_params, func=predict)
9  clipper.link_model_to_app("PT-app", "PT-mod")
10  clipper.set_num_replicas("PT-mod", 2)
11  q_addr = clipper.get_query_addr()
12"/PT-app/predict", headers, img1)
14  # === INFaaS model registration and query ===
15  infaas.register_model(model_params)
16  infaas.online_query(’Sally’, img1, ’classification’,
17                      ’imagenet’, 70%, 200ms)\end{lstlisting}
18  \vspace{-2mm}
19  \caption{Python code for registering and querying models with Clipper (Lines 4-12) and \infaas (Lines 14-17).
20    Code simplified for display.}
21  \label{fig:infaas-code}

7.1 Does INFaaS improve ease-of-use?

’ key goal is to simplify the use of serving systems. Existing systems, including SageMaker, CloudML, and Clipper, require users to explicitly decide the variant, hardware, and scaling policy. Figure LABEL:fig:infaas-code shows how a Clipper user would create a prediction function, register an SLO per application, and manually configure the number of instances. When querying the model, users need to specify a variant tied to a hardware platform, SLO, and scaling policy. Other systems require similar or even more complex configurations (e.g., setting thresholds for scaling per model).

In contrast, simplifies inference for users by automatically generating model/variants, selecting a variant for each query, and managing and scaling hardware resources to support its model-less interface. Users can query the same model with different latency and accuracy requirements using the model-less API (Table 1). Finally, users only need to specify a task and SLO requirements with their query. Nevertheless, also supports expert users who want to exert direct control over the settings. Thus, with minimal configuration, users can specify prediction tasks and any high-level performance goals to .

7.2 How well does INFaaS scale with load?

We now demonstrate the efficiency of ’ autoscaling in reacting to changes in query patterns. ’ autoscaling is a combined effort by the controller’s autoscaler and the model-autoscaler. The controller’s autoscaler (detailed in Section 5.1) adds CPU/GPU workers when (a) resource utilization exceeds a threshold (80%), and (b) contention for existing GPUs is detected. The model-autoscaler (detailed in Section 5.2) runs on each worker: it replicates/upgrades variants when the load increases, and removes/downgrades model/variants when the load decreases, as described in Algorithm 2.

Experimental Setup. We compared with , , , and . pre-loaded and persisted 2 TensorFlow CPU instances. persisted one batch-8 optimized TensorRT variant, sized to serve the provided peak load. dynamically added/removed instances of the TensorFlow CPU variant. dynamically replicated a batch-1 optimized TensorRT variant (the cheapest GPU variant). We used one model architecture, ResNet50, and one worker. We measured throughput and P99 latency every 2 seconds, and calculated the total cost. Cost for a running model/variant is estimated according to its memory footprint based on AWS EC2 pricing [15]. We normalize cost to 1 for 1 GB/sec on CPU, and 7.97 for 1 GB/sec on GPU.

Different load patterns. To evaluate scalability, we used three load patterns that are commonly observed in real-world setups [23]: (a) a flat, low load (4 QPS), (b) a steady, high load (slowly increase from 650 to 700 QPS), and (c) a fluctuating load (ranging between 4 and 100 QPS).

Figures 8(a) and 8(d) show the throughput and total cost, respectively, for and the baselines when serving a flat, low load. and met the throughput demand, but incurred high costs since they only use GPU variants. automatically selected CPU variants when they could meet the demand, thus reducing cost by 150 and 127 compared to and , respectively. For a steady, high load (Figures 8(b) and 8(e)), and served only 10 QPS (even with multiple instances). automatically selected the batch-8 GPU variant, and both and met the throughput demand. While replicated to 2 GPU variants to meet the load, it was 1.7 more expensive than / and served 15% fewer QPS. Finally, for a fluctuating load (Figures 8(c) and 8(f)), , , and met the throughput demand, while both and served only 10 QPS. During low load periods (0-60 seconds, 90-150 seconds, and 180-240 seconds), used a CPU variant. At load spikes (60-90 seconds and 150-180 seconds), upgraded to a TensorRT batch-1 variant. Hence, resulted to be 1.45 and 1.54 cheaper than and , respectively.

(a) Flat, low load
(b) Steady, high load
(c) Fluctuating load
(d) Flat, low load
(e) Steady, high load
(f) Fluctuating load
(g) P99 latency and cost, Twitter load
(h) variant breakdown
Figure 9: Performance of different autoscaling strategies, with ResNet50 and batch-1 requests.

Twitter dataset. We then used a real-world dataset to show reduces cost while maintaining low P99 latencies. We mapped the Twitter trace to a range between 100 and 700 QPS for a total of 49,000 batch-1 queries. Figure 8(g) shows that maintained comparable P99 latencies to and , but was 1.11 and 1.22 cheaper, respectively. Figure 8(h) demonstrates how ’ model selection and model-autoscaling algorithms leveraged GPU variants optimized for different batch sizes (lower batch is cheaper) to enable low latency and reduced cost. As the load increased, gradually upgraded from TRT-1, through TRT-4, to TRT-8, which enabled adaptive batching and kept latency low. As the load decreased, downgraded back to TRT-4, then TRT-1. matched ’s throughput, and had 15% higher throughput than .

Thus, scales and adapts to changes in load and query patterns, and improves cost by up to 150.

7.3 Does INFaaS share resources effectively?

7.3.1 Sharing hardware resources

We first show how manages and shares GPU resources across models without affecting performance. We compared to , which persisted one model per GPU. Since requires a pre-defined number of workers, we specified 2 GPU workers. For fairness, was also configured to scale up to 2 GPU workers. We measured throughput and P99 latency every 30 seconds, and expected to (a) detect when model latencies exceeded their profiled values, and (b) either migrate the model to a different GPU, or scale to a new GPU worker if all GPUs were serving variants near their profiled peak throughput.

To demonstrate how resource sharing differs with model popularity, we evaluated the scenario where one popular model served 80% QPS, and the other model served 20%. As noted in Section 2.3, the load at which GPU sharing starts degrading performance is different across models. We selected two model/variants that diverge in inference latency, throughput, and peak memory: Inception-ResNetV2 (large model) and MobileNetV1 (small model). Both variants are TensorRT-optimized for batch-1. We have observed similar results with other popularity distributions, and with different models. We mapped the Twitter trace to a range between 50 and 500 QPS for a total of 75,000 batch-1 queries.

Figure 10 shows P99 latency and throughput for both models when Inception-ResNetV2 is popular. ’ autoscaler detected Inception-ResNetV2 and MobileNetV1 exceeded their profiled latencies around 30 and 50 seconds, respectively. started a new GPU worker (30 second start-up latency), created an instance of each model on it, and spread the load for both models across the GPUs. The allocated resources for Inception-ResNetV2 with were insufficient, and led to a significant latency increase and throughput degradation. Unlike , could further mitigate the latency increase by adding more GPU workers (limited to two in this experiment). Similarly, when MobileNetV1 was deemed popular, started a new worker after 30 seconds, and after 60 seconds, only replicated MobileNetV1 to the second GPU (not shown for brevity). This allocation was sufficient to maintain low latencies and high throughput for both models.

Even with a high load of up to 500 QPS, saved about 10% on cost compared to by (a) sharing a GPU across multiple models, and (b) only adding GPUs when latency increases were detected.

(a) Inception-ResNetV2
(b) MobileNetV1
(c) Inception-ResNetV2
(d) MobileNetV1
Figure 10: Performance of co-locating GPU model/variants when 80% of queries are to Inception-ResNetV2.

7.3.2 Co-locating online and offline jobs

(a) Tail latency for online
(b) Throughput for offline
(c) Throughput for online
(d) Worker CPU utilization
Figure 11: Performance and utilization of online-offline queries with ResNet50. Alone: Serving either online or offline queries, but not both; : Serving both.

Using spare resources from online queries for offline jobs allows to improve utilization. To maintain performance for online queries, throttles offline queries when utilization for the underlying worker exceeds a threshold (set to 40%), or the observed latency for online queries exceeds the model/variant’s profiled latency. lower thresholds would starve offline queries, while higher thresholds would incur severe interference. We measured the throughput of both online and offline queries, and P99 latency for online queries.

To demonstrate ’ performance when it co-locates online and offline jobs, we used one model architecture (ResNet50), one CPU worker, and pre-loaded 2 TensorFlow ResNet50 instances on CPU. Each CPU instance supports 4 requests per second while maintaining its profiled latency. Online requests had a 500 ms latency SLO, and load varied between 3 to 8 QPS. For offline, we submitted one offline request to ResNet50 at the beginning of the experiment, containing 1,000 input images.

Figures 10(a) to 10(c) contrast the performance of online and offline queries when running alone and when co-located by . Figure 10(d) shows the resource utilization change for ; the 40% threshold is marked. maintained performance for online requests in both cases by limiting offline query processing when it detected (a) resource utilization exceeded 40%, or (b) online latency was higher than profiled. There were two long periods when throttled offline processing (see Figure 10(b)): 20-40 and 60-80 seconds, both due to high online resource utilization (60% – 70%).

7.4 Putting it all together

We now evaluate ’ automated model selection, resource allocation, and autoscaling mechanisms together.

Experimental Setup. We mapped the Twitter trace to a range between 10 and 1K QPS for a total of 113,420 batch-1 queries. We used all the model architectures listed in Table 2. Similar to prior work, we used a Zipfian distribution for model popularity [40]. We designated 4 model architectures (DenseNet121, ResNet50, VGG16, and InceptionV3) to be popular with 50 ms SLOs and share 80% of the load. The rest are cold models with SLO set to 1.5 the profiled latency of each model’s fastest CPU variant. Requests were sent using 66 client threads, with 2 threads per cold model and 8 threads per popular model. persisted 5 CPU and 7 GPU workers. and started with 5 CPU and 5 GPU workers, and scaled up to 7 GPU workers. Baselines only used GPU variants for popular models. We evaluated using the following metrics: throughput, latency, and SLO violation ratio. SLO violation ratio is the number of SLO violations versus the total number of queries.

Figure 12: Throughput and SLO violation ratio, measured every 4 seconds. Each box shows the median, 25% and 75% quartiles; whiskers extend to the 1.5 quartile. Circles show the outliers.

Figure 12 shows that achieved 1.5 higher throughput than and violated 50% fewer SLOs on average. This is attributed to both variant replication and upgrading: can upgrade to GPU (higher batch) variants while the baselines can only replicate variants. Reacting to increased load, added a 6th GPU worker at 44 seconds, and a 7th at 77 seconds Although also added a 6th and 7th GPU worker, it achieved lower throughput and violated more SLOs due to frequently incurring variant loading penalties and being unable to upgrade variants. maintained higher CPU and GPU resource utilization while keeping SLO violations under 10% on average. load balances requests and avoids overloading CPU models that have lower QPS limits. This resulted in an average worker utilization of about 55%. For GPU, achieved up to 5 and 3 higher GPU DRAM memory utilization than and , respectively.

We also added 4 concurrent offline requests to evaluate the efficiency of resource management. Each offline request contained 500 input images and specified the ResNet50 model architecture. As shown in Figure 12, w/offline maintained similar throughput and SLO violations compared to only serving online requests. Across 3 runs, an average of 688 images were processed by offline queries. We observed that w/offline maintained CPU core utilization around 60% by harvesting spare resources for offline processing. achieves higher performance (1.5 higher throughput), resource utilization (5 higher GPU utilization), and lower SLO violations (50% lower) compared to the baselines.

7.5 What is INFaaS’ decision overhead?

makes the following decisions that are on the critical path of serving a query: (a) selecting a model/variant, and (b) selecting a worker. Table 3 shows the fraction of query latency spent on making decisions. Each row corresponds to a query specifying (1) a variant, (2) a model architecture, and (3,4) a use-case. Rows 3 and 4 demonstrate how adjusted the number of valid options based on user SLOs (Section 4). For each query, we show how the selected variant being (a) loaded, and (b) not loaded affected the decision latency.

When a model/variant was explicitly specified by the user, incurred low overheads (1 ms), as it only selected a worker. When a model architecture was provided, leveraged its decision cache to search for a variant that met the SLO. For an already-loaded model, quickly selected it along with the least-loaded worker (1.7 ms). Otherwise, spent 10.7 ms choosing a variant and a worker. Similarly, when a use-case was provided, again searched its decision cache for a variant. For an already-loaded model, made the variant and worker selection in 2 ms. Otherwise, searched a subset of the large model search space to find a variant. The size of the search space was dictated by the SLO. maintains low overheads across its different query submission modes: about 2 ms when using the decision cache, which is less than 12% of the serving time.

Query Variant Picked
(Valid Options)
 Latency in ms
(% Serving Time)
Not Loaded Loaded
resnet50-trt resnet50-trt (1) 1.0 (0.01%) 0.9 (4.9%)
resnet50, 300ms resnet50-tf (15) 10.6 (0.4%) 1.6 (0.7%)
classification, 72%, 20ms inceptionv3-trt (5) 3.5 (0.06%) 2.2 (11.2%)
classification, 72%, 200ms nasnetmobile-tf (50) 28.1 (4.9%) 2.0 (1.5%)
Table 3: Median decision latency and fraction of serving time spent on making variant and worker selection across 3 runs.

8 Limitations and Future Directions

White box inference serving: currently treats ML models as black boxes. Understanding the internals of models offers additional opportunities to optimize inference serving [40]. For instance, intermediate computations could be reused across “similar” model/variants. We leave model-less inference serving with white box models to future work.

Offline queries with performance SLOs: currently supports best-effort execution for offline requests with no support for deadlines or other SLOs. Understanding how to efficiently schedule and process offline requests in a multi-tenant environment given user inputs, deadlines, and cost requirements needs further exploration. ’ modular design allows it to be extended to work with existing [25, 52] and new deadline-driven scheduling techniques.

Query pre-processing: currently assumes that the query inputs are pre-processed (e.g., cropped and scaled images). However, many ML applications have complex pre-processing pipelines that are challenging to deploy [50, 19]. We plan to extend ’ implementation to support input query pre-processing by adopting high performance data processing libraries, such as DALI [5] and Weld [45].

9 Related Work

Serving Systems and Interfaces: TensorFlow Serving [9] provided one of the first production environments for models trained using the TensorFlow framework. Clipper [21] generalized it to enable the use of different frameworks and application-level SLOs. Other approaches [40, 20] built upon Clipper for optimizing the pipelines of inference serving. SageMaker [13], Cloud ML [28], and Azure ML [3] offer users separate online and offline services that autoscale models based on usage load. SageMaker also introduced Elastic Inference [11] that allows users to rent part of a GPU. TensorRT Inference Server [6] optimizes GPU inference serving while still supporting CPU models, but requires static model replica configuration. For ML-as-a-Service, Tolerance Tiers are a way for users to programmatically choose a tradeoff between accuracy and latency [31].

Unlike , none of these existing systems offer a simple model-less interface, or leverage model/variants to meet user requests with accuracy and latency requirements.

Scaling: Swayam [30] focused on improving CPU utilization while meeting user-specified SLOs. Unlike Swayam, shares models across different services (further improving resource utilization), and is not restricted to one SLO per application or service. MArk [57] proposed SLO-aware model scheduling and scaling by selecting between AWS EC2 and AWS Lambda to absorb unpredictable load bursts. Autoscale [27] reviewed scaling techniques and argued for a simple approach that maintains slack resources and does not scale down recklessly. Similarly, ’ autoscalers, at the controller and workers, maintain headrooms using scale-down counters to cautiously scale resources down. Existing systems only use model replication, while additionally upgrades/downgrades within the same model architecture.

GPU Sharing: NVIDIA MPS [43] enabled efficient sharing of GPUs, which facilitated some of the first exploration into sharing for deep-learning. Tiresias [29] and Gandiva [54] leveraged MPS for deep-learning training. TensorRT Inference Server, TrIMS [22], Salus [56], and Space-Time GPU Scheduling [33] allow GPUs to be shared either spatially, temporally, or both. INFaaS’ current implementation builds on TensorRT Inference Server, and provides SLO-aware GPU sharing. can also be extended to leverage other mechanisms for sharing GPUs and other hardware resources.

10 Conclusion

We presented : a model-less inference serving system. allows users to define inference tasks and performance/accuracy requirements for queries, leaving it to the system to determine the model/variant, hardware, and scaling configuration. We quantitatively demonstrated that ’ policies for model selection, resource management, and resource sharing lead to reduced costs, better throughput, and fewer SLO violations compared to existing model serving systems.


  • [1] NVIDIA Tesla V100 Tensor Core GPU, 2017.
  • [2] Accelerating DNNs with Xilinx Alveo Accelerator Cards, 2018.
  • [3] Azure Machine Learning, 2018.
  • [4] gRPC, 2018.
  • [5] NVIDIA DALI, 2018.
  • [6] NVIDIA TensorRT Inference Server, 2018.
  • [7] NVIDIA TensorRT: Programmable Inference Accelerator, 2018.
  • [8] Redox, 2018.
  • [9] TensorFlow Serving for model deployment in production, 2018.
  • [10] Amazon EC2., 2018.
  • [11] Amazon Elastic Inference., 2018.
  • [12] Amazon S3., 2018.
  • [13] Amazon SageMaker., 2018.
  • [14] Mohammed Attia, Younes Samih, Ali Elkahky, and Laura Kallmeyer. Multilingual multi-class sentiment classification using convolutional neural networks. pages 635–640, Miyazaki, Japan, 2018.
  • [15] AWS EC2 Pricing., 2018.
  • [16] AWS Inferentia., 2018.
  • [17] Prima Chairunnanda, Khuzaima Daudjee, and M. Tamer Özsu. Confluxdb: Multi-master replication for partitioned snapshot isolation databases. PVLDB, 7:947–958, 2014.
  • [18] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578–594, Carlsbad, CA, 2018. USENIX Association.
  • [19] Yang Cheng, Dan Li, Zhiyuan Guo, Binyao Jiang, Jiaxin Lin, Xi Fan, Jinkun Geng, Xinyi Yu, Wei Bai, Lei Qu, Ran Shu, Peng Cheng, Yongqiang Xiong, and Jianping Wu. Dlbooster: Boosting end-to-end deep learning workflows with offloading data preprocessing pipelines. In Proceedings of the 48th International Conference on Parallel Processing, ICPP 2019, pages 88:1–88:11, New York, NY, USA, 2019. ACM.
  • [20] Daniel Crankshaw, Gur-Eyal Sela, Corey Zumar, Xiangxi Mo, Joseph E. Gonzalez, Ion Stoica, and Alexey Tumanov. Inferline: ML inference pipeline composition framework. CoRR, abs/1812.01776, 2018.
  • [21] Daniel Crankshaw, Xin Wang, Giulio Zhou, Michael J. Franklin, Joseph E. Gonzalez, and Ion Stoica. Clipper: A low-latency online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2017, Boston, MA, USA, March 27-29, 2017, pages 613–627, 2017.
  • [22] Abdul Dakkak, Cheng Li, Simon Garcia De Gonzalo, Jinjun Xiong, and Wen-Mei W. Hwu. Trims: Transparent and isolated model sharing for low latency deep learning inference in function as a service environments. CoRR, abs/1811.09732, 2018.
  • [23] Christina Delimitrou and Christos Kozyrakis. Quasar: Resource-efficient and qos-aware cluster management. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’14, pages 127–144, New York, NY, USA, 2014. ACM.
  • [24] Jia Deng, Wei Dong, Richard Socher, Li jia Li, Kai Li, and Li Fei-fei. Imagenet: A large-scale hierarchical image database. In In CVPR, 2009.
  • [25] Andrew D. Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fonseca. Jockey: Guaranteed job latency in data parallel clusters. In Proceedings of the 7th ACM European Conference on Computer Systems, EuroSys ’12, pages 99–112, New York, NY, USA, 2012. ACM.
  • [26] Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi, Stephen Heil, Prerak Patel, Adam Sapek, Gabriel Weisz, Lisa Woods, Sitaram Lanka, Steven K. Reinhardt, Adrian M. Caulfield, Eric S. Chung, and Doug Burger. A configurable cloud-scale dnn processor for real-time ai. In Proceedings of the 45th Annual International Symposium on Computer Architecture, ISCA ’18, pages 1–14, Piscataway, NJ, USA, 2018. IEEE Press.
  • [27] Anshul Gandhi, Mor Harchol-Balter, Ram Raghunathan, and Michael A Kozuch. Autoscale: Dynamic, robust capacity management for multi-tier data centers. ACM Transactions on Computer Systems (TOCS), 30(4):14, 2012.
  • [28] Google Cloud Machine Learning Engine., 2018.
  • [29] Juncheng Gu, Mosharaf Chowdhury, Kang G. Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. Tiresias: A GPU cluster manager for distributed deep learning. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 485–500, Boston, MA, 2019. USENIX Association.
  • [30] Arpan Gujarati, Sameh Elnikety, Yuxiong He, Kathryn S McKinley, and Björn B Brandenburg. Swayam: distributed autoscaling to meet slas of machine learning inference services with resource efficiency. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference, pages 109–120. ACM, 2017.
  • [31] M. Halpern, B. Boroujerdian, T. Mummert, E. Duesterwald, and V. Reddi. One size does not fit all: Quantifying and exposing the accuracy-latency trade-off in machine learning cloud service apis via tolerance tiers. In Proceedings of the 19th International Symposium on Performance Analysis of Systems and Software (ISPASS), 2019.
  • [32] Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, James Law, Kevin Lee, Jason Lu, Pieter Noordhuis, Misha Smelyanskiy, Liang Xiong, and Xiaodong Wang. Applied machine learning at facebook: A datacenter infrastructure perspective. In Proceedings of the 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), HPCA ’18. IEEE, 2018.
  • [33] Paras Jain, Xiangxi Mo, Ajay Jain, Harikaran Subbaraj, Rehan Durrani, Alexey Tumanov, Joseph Gonzalez, and Ion Stoica. Dynamic space-time scheduling for gpu inference. In LearningSys Workshop at Neural Information Processing Systems 2018, 2018.
  • [34] Junchen Jiang, Ganesh Ananthanarayanan, Peter Bodik, Siddhartha Sen, and Ion Stoica. Chameleon: Scalable adaptation of video analytics. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, SIGCOMM ’18, pages 253–266, New York, NY, USA, 2018. ACM.
  • [35] Yushi Jing, David Liu, Dmitry Kislyuk, Andrew Zhai, Jiajing Xu, Jeff Donahue, and Sarah Tavel. Visual search at pinterest. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1889–1898. ACM, 2015.
  • [36] Eric Jonas, Qifan Pu, Shivaram Venkataraman, Ion Stoica, and Benjamin Recht. Occupy the cloud: Distributed computing for the 99%. In Proceedings of the 2017 Symposium on Cloud Computing, SoCC ’17, pages 445–451, New York, NY, USA, 2017. ACM.
  • [37] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA ’17, pages 1–12, New York, NY, USA, 2017. ACM.
  • [38] Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. Noscope: Optimizing neural network queries over video at scale. Proc. VLDB Endow., 10(11):1586–1597, August 2017.
  • [39] Animesh Koratana, Daniel Kang, Peter Bailis, and Matei Zaharia. LIT: block-wise intermediate representation training for model compression. CoRR, abs/1810.01937, 2018.
  • [40] Yunseong Lee, Alberto Scolari, Byung-Gon Chun, Marco Domenico Santambrogio, Markus Weimer, and Matteo Interlandi. PRETZEL: Opening the black box of machine learning prediction serving systems. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 611–626, Carlsbad, CA, 2018. USENIX Association.
  • [41] David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. Heracles: Improving Resource Efficiency at Scale. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA ’15, pages 450–462, New York, NY, USA, 2015. ACM.
  • [42] MLPerf Benchmark., 2019.
  • [43] NVIDIA., 2018.
  • [44] Young H. Oh, Quan Quan, Daeyeon Kim, Seonghak Kim, Jun Heo, Sungjun Jung, Jaeyoung Jang, and Jae W. Lee. A portable, automatic data quantizer for deep neural networks. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, PACT ’18, pages 17:1–17:14, New York, NY, USA, 2018. ACM.
  • [45] Shoumik Palkar, James J Thomas, Anil Shanbhag, Deepak Narayanan, Holger Pirk, Malte Schwarzkopf, Saman Amarasinghe, Matei Zaharia, and Stanford InfoLab. Weld: A common runtime for high performance data analytics. In Conference on Innovative Data Systems Research (CIDR), 2017.
  • [46] Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. Data management challenges in production machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1723–1726. ACM, 2017.
  • [47] Alex Poms, Will Crichton, Pat Hanrahan, and Kayvon Fatahalian. Scanner: Efficient video analysis at scale. CoRR, abs/1805.07339, 2018.
  • [48] Redis., 2018.
  • [49] Steven S. Seiden. On the online bin packing problem. J. ACM, 49(5):640–671, September 2002.
  • [50] Evan R. Sparks, Shivaram Venkataraman, Tomer Kaftan, Michael J. Franklin, and Benjamin Recht. Keystoneml: Optimizing pipelines for large-scale advanced analytics. In 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, April 19-22, 2017, pages 535–546, 2017.
  • [51] Leonid Velikovich, Ian Williams, Justin Scheiner, Petar S. Aleksic, Pedro J. Moreno, and Michael Riley. Semantic lattice processing in contextual automatic speech recognition for google assistant. In Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018., pages 2222–2226, 2018.
  • [52] Shivaram Venkataraman, Zongheng Yang, Michael Franklin, Benjamin Recht, and Ion Stoica. Ernest: Efficient performance prediction for large-scale advanced analytics. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16), pages 363–378, Santa Clara, CA, 2016. USENIX Association.
  • [53] Wei Wang, Jinyang Gao, Meihui Zhang, Sheng Wang, Gang Chen, Teck Khim Ng, Beng Chin Ooi, Jie Shao, and Moaz Reyad. Rafiki: machine learning as an analytics service system. Proceedings of the VLDB Endowment, 12(2):128–140, 2018.
  • [54] Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, and Lidong Zhou. Gandiva: Introspective cluster scheduling for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 595–610, Carlsbad, CA, 2018. USENIX Association.
  • [55] Neeraja J. Yadwadkar, Francisco Romero, Qian Li, and Christos Kozyrakis. A Case for Managed and Model-less Inference Serving. In Proceedings of the Workshop on Hot Topics in Operating Systems, pages 184–191. ACM, 2019.
  • [56] Peifeng Yu and Mosharaf Chowdhury. Salus: Fine-grained GPU sharing primitives for deep learning applications. CoRR, abs/1902.04610, 2019.
  • [57] Chengliang Zhang, Minchen Yu, Wei Wang, and Feng Yan. Mark: Exploiting cloud services for cost-effective, slo-aware machine learning inference serving. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 1049–1062, Renton, WA, July 2019. USENIX Association.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description