Edge Intelligence: On-Demand Deep Learning Model Co-Inference with Device-Edge Synergy
As the backbone technology of machine learning, deep neural networks (DNNs) have have quickly ascended to the spotlight. Running DNNs on resource-constrained mobile devices is, however, by no means trivial, since it incurs high performance and energy overhead. While offloading DNNs to the cloud for execution suffers unpredictable performance, due to the uncontrolled long wide-area network latency. To address these challenges, in this paper, we propose Edgent, a collaborative and on-demand DNN co-inference framework with device-edge synergy. Edgent pursues two design knobs: (1) DNN partitioning that adaptively partitions DNN computation between device and edge, in order to leverage hybrid computation resources in proximity for real-time DNN inference. (2) DNN right-sizing that accelerates DNN inference through early-exit at a proper intermediate DNN layer to further reduce the computation latency. The prototype implementation and extensive evaluations based on Raspberry Pi demonstrate Edgent’s effectiveness in enabling on-demand low-latency edge intelligence.
I Introduction & Related Work
As the backbone technology supporting modern intelligent mobile applications, Deep Neural Networks (DNNs) represent the most commonly adopted machine learning technique and have become increasingly popular. Due to DNNs’s ability to perform highly accurate and reliable inference tasks, they have witnessed successful applications in a broad spectrum of domains from computer vision  to speech recognition  and natural language processing . However, as DNN-based applications typically require tremendous amount of computation, they cannot be well supported by today’s mobile devices with reasonable latency and energy consumption.
In response to the excessive resource demand of DNNs, the traditional wisdom resorts to the powerful cloud datacenter for training and evaluating DNNs. Input data generated from mobile devices is sent to the cloud for processing, and then results are sent back to the mobile devices after the inference. However, with such a cloud-centric approach, large amounts of data (e.g., images and videos) are uploaded to the remote cloud via a long wide-area network data transmission, resulting in high end-to-end latency and energy consumption of the mobile devices. To alleviate the latency and energy bottlenecks of cloud-centric approach, a better solution is to exploiting the emerging edge computing paradigm. Specifically, by pushing the cloud capabilities from the network core to the network edges (e.g., base stations and WiFi access points) in close proximity to devices, edge computing enables low-latency and energy-efficient DNN inference.
While recognizing the benefits of edge-based DNN inference, our empirical study reveals that the performance of edge-based DNN inference is highly sensitive to the available bandwidth between the edge server and the mobile device. Specifically, as the bandwidth drops from 1Mbps to 50Kbps, the latency of edge-based DNN inference climbs from 0.123s to 2.317s and becomes on par with the latency of local processing on the device. Then, considering the vulnerable and volatile network bandwidth in realistic environments (e.g., due to user mobility and bandwidth contention among various Apps), a natural question is that can we further improve the performance (i.e., latency) of edge-based DNN execution, especially for some mission-critical applications such as VR/AR games and robotics .
To answer the above question in the positive, in this paper we proposed Edgent, a deep learning model co-inference framework with device-edge synergy. Towards low-latency edge intelligence
While the topic of edge intelligence has began to garner much attention recently, our study is different from and complementary to existing pilot efforts. On one hand, for fast and low power DNN inference at the mobile device side, various approaches as exemplified by DNN compression and DNN architecture optimization has been proposed [5, 6, 7, 8, 9]. Different from these works, we take a scale-out approach to unleash the benefits of collaborative edge intelligence between the edge and mobile devices, and thus to mitigate the performance and energy bottlenecks of the end devices. On the other hand, though the idea of DNN partition among cloud and end device is not new , realistic measurements show that the DNN partition is not enough to satisfy the stringent timeliness requirements of mission-critical applications. Therefore, we further apply the approach of DNN right-sizing to speed up DNN inference.
Ii Background & Motivation
In this section, we first give a primer on DNN, then analyse the inefficiency of edge- or device-based DNN execution, and finally illustrate the benefits of DNN partitioning and right-sizing with device-edge synergy towards low-latency edge intelligence.
Ii-a A Primer on DNN
DNN represents the core machine learning technique for a broad spectrum of intelligent applications spanning computer vision, automatic speech recognition and natural language processing. As illustrated in Fig. 1, computer vision applications use DNNs to extract features from an input image and classify the image into one of the pre-defined categories. A typical DNN model is organized in a directed graph which includes a series of inner-connected layers, and within each layer comprising some number of nodes. Each node is a neuron that applies certain operations to its input and generates an output. The input layer of nodes is set by raw data while the output layer determines the category of the data. The process of passing forward from the input layer to the out layer is called model inference. For a typical DNN containing tens of layers and hundreds of nodes per layer, the number of parameters can easily reach the scale of millions. Thus, DNN inference is computational intensive. Note that in this paper we focus on DNN inference, since DNN training is generally delay tolerant and is typically conducted in an off-line manner using powerful cloud resources.
Ii-B Inefficiency of Device- or Edge-based DNN Inference
Currently, the status quo of mobile DNN inference is either direct execution on the mobile devices or offloading to the cloud/edge server for execution. Unfortunately, both approaches may suffer from poor performance (i.e., end-to-end latency), being hard to well satisfy real-time intelligent mobile applications (e.g., AR/VR mobile gaming and intelligent robots) . As illustration, we take a Raspberry Pi tiny computer and a desktop PC to emulate the mobile device and edge server respectively, running the classical AlexNet  DNN model for image recognition over the Cifar-10 dataset . Fig. 2 plots the breakdown of the end-to-end latency of different approaches under varying bandwidth between the edge and mobile device. It clearly shows that it takes more than 2s to execute the model on the resource-limited Raspberry Pi. Moreover, the performance of edge-based execution approach is dominated by the input data transmission time (the edge server computation time keeps at 10ms) and thus highly sensitive to the available bandwidth. Specifically, as the available bandwidth jumps from 1Mbps to 50Kbps, the end-to-end latency climbs from 0.123s to 2.317s. Considering the network bandwidth resource scarcity in practice (e.g., due to network resource contention among users and apps) and computing resource limitation on mobile devices, both of the device- and edge-based approaches are challenging to well support many emerging real-time intelligent mobile applications with stringent latency requirement.
Ii-C Enabling Edge Intelligence with DNN Partitioning and Right-Sizing
DNN Partitioning: For a better understanding of the performance bottleneck of DNN execution, we further break the runtime (on Raspberry Pi) and output data size of each layer in Fig. 3. Interestingly, we can see that the runtime and output data size of different layers exhibit great heterogeneities, and layers with a long runtime do not necessarily have a large output data size. Then, an intuitive idea is DNN partitioning, i.e., partitioning the DNN into two parts and offloading the computational intensive one to the server at a low transmission overhead, and thus to reduce the end-to-end latency. For illustration, we choose the second local response normalization layer (i.e., lrn_2) in Fig. 3 as the partition point and offload the layers before the partition point to the edge server while running the rest layers on the device. By DNN partitioning between device and edge, we are able to collaborate hybrid computation resources in proximity for low-latency DNN inference.
DNN Right-Sizing: While DNN partitioning greatly reduces the latency by bending the computing power of the edge server and mobile device, we should note that the optimal DNN partitioning is still constrained by the run time of layers running on the device. For further reduction of latency, the approach of DNN Right-Sizing can be combined with DNN partitioning. DNN right-sizing promises to accelerate model inference through an early-exit mechanism. That is, by training a DNN model with multiple exit points and each has a different size, we can choose a DNN with a small size tailored to the application demand, meanwhile to alleviate the computing burden at the model division, and thus to reduce the total latency. Fig. 4 illustrates a branchy AlexNet with five exit points. Currently, the early-exit mechanism has been supported by the open source framework BranchyNet. Intuitively, DNN right-sizing further reduces the amount of computation required by the DNN inference tasks.
Problem Description: Obviously, DNN right-sizing incurs the problem of latency-accuracy tradeoff — while early-exit reduces the computing time and the device side, it also deteriorates the accuracy of the DNN inference. Considering the fact that some applications (e.g., VR/AR game) have stringent deadline requirement while can tolerate moderate accuracy loss, we hence strike a nice balance between the latency and the accuracy in an on-demand manner. Particularly, given the predefined and stringent latency goal, we maximize the accuracy without violating the deadline requirement. More specifically, the problem to be addressed in this paper can be summarized as: given a predefined latency requirement, how to jointly optimize the decisions of DNN partitioning and right-sizing, in order to maximize DNN inference accuracy.
We now outline the initial design of Edgent, a framework that automatically and intelligently selects the best partition point and exit point of a DNN model to maximize the accuracy while satisfying the requirement on the execution latency.
Iii-a System Overview
Fig. 5 shows the overview of Edgent. Edgent consists of three stages: offline training stage, online optimization stage and co-inference stage.
At offline training stage, Edgent performs two initializations: (1) profiling the mobile device and the edge server to generate regression-based performance prediction models (Sec. III-B) for different types DNN layer (e.g., Convolution, Pooling, etc.). (2) Using Branchynet to train DNN models with various exit points, and thus to enable early-exit. Note that the performance profiling is infrastructure-dependent, while the DNN training is application-dependent. Thus, given the sets of infrastructures (i.e., mobile devices and edge servers) and applications, the two initializations only need to be done once in an offline manner.
At online optimization stage, the DNN optimizer selects the best partition point and early-exit point of DNNs to maximize the accuracy while providing performance guarantee on the end-to-end latency, based on the input: (1) the profiled layer latency prediction models and Branchynet-trained DNN models with various sizes. (2) the observed available bandwidth between the mobile device and edge server. (3) The pre-defined latency requirement. The optimization algorithm is detailed in Sec. III-C.
At co-inference stage, according to the partition and early-exit plan, the edge server will execute the layer before the partition point and the rest will run on the mobile device.
Iii-B Layer Latency Prediction
|Layer Type||Independent Variable|
|Relu||input data size|
|Local Response Normalization||input data size|
|Dropout||input data size|
|Model Loading||model size|
When estimating the runtime of a DNN, Edgent models the per-layer latency rather than modeling at the granularity of a whole DNN. This greatly reduces the profiling overhead since there are very limited classes of layers. By experiments, we observe that the latency of different layers is determined by various independent variables (e.g., input data size, output data size) which are summarized in Table I. Note that we also observe that the DNN model loading time also has an obvious impact on the overall runtime. Therefore, we further take the DNN model size as a input parameter to predict the model loading time. Based on the above inputs of each layer, we establish a regression model to predict the latency of each layer based on its profiles. The final regression models of some typical layers are shown in Table II (size is in bytes and latency is in ms).
|Layer||Mobile Device model||Edge Server model|
|Convolution||y = 6.03e-5 x1 + 1.24e-4 x2 + 1.89e-1||y = 6.13e-3 x1 + 2.67e-2 x2 - 9.909|
|Relu||y = 5.6e-6 x + 5.69e-2||y = 1.5e-5 x + 4.88e-1|
|Pooling||y = 1.63e-5 x1 + 4.07e-6 x2 + 2.11e-1||y = 1.33e-4 x1 + 3.31e-5 x2 + 1.657|
|Local Response Normalization||y = 6.59e-5 x + 7.80e-2||y = 5.19e-4 x+ 5.89e-1|
|Dropout||y = 5.23e-6 x+ 4.64e-3||y = 2.34e-6 x+ 0.0525|
|Fully-Connected||y = 1.07e-4 x1 - 1.83e-4 x2 + 0.164||y = 9.18e-4 x1 + 3.99e-3 x2 + 1.169|
|Model Loading||y = 1.33e-6 x + 2.182||y = 4.49e-6 x + 842.136|
Iii-C Joint Optimization on DNN Partition and DNN Right-Sizing
At online optimization stage, the DNN optimizer receives the latency requirement from the mobile device, and then searches for the optimal exit point and partition point of the trained branchynet model. The whole process is given in Algorithm 1. For a branchy model with exit points, we denote that the exit point has layers. Here a larger index correspond to a more accurate inference model of larger size. We use the above-mentioned regression models to predict the runtime of the layer when it runs on device and the runtime of the layer when it runs on server. is the output of the layer. Under a specific bandwidth , with the input data , then we calculate the whole runtime when the is the partition point of the selected model of exit point.When , the model will only run on the device then and when , the model will only run on the server then . In this way, we can find out the best partition point having the smallest latency for the model of exit point. Since the model partition does not affect the inference accuracy, we can then sequentially try the DNN inference models with different exit points(i.e., with different accuracy), and find the one having the largest size and meanwhile satisfying the latency requirement. Note that since regression models for layer latency prediction are trained beforehand, Algorithm 1 mainly involves linear search operations and can be done very fast (no more than 1ms in our experiments).
We now present our preliminary implementation and evaluation results.
We have implemented a simple prototype of Edgent to verify the feasibility and efficacy of our idea. To this end, we take a desktop PC to emulate the edge server, which is equipped with a quad-core Intel processor at 3.4 GHz with 8 GB of RAM, and runs the Ubuntu system. We further use Raspberry Pi 3 tiny computer to act as a mobile device. The Raspberry Pi 3 has a quad-core ARM processor at 1.2 GHz with 1 GB of RAM. The available bandwidth between the edge server and the mobile device is controlled by the WonderShaper  tool. As for the deep learning framework, we choose Chainer  that can well support branchy DNN structures.
For the branchynet model, based on the standard AlexNet model, we train a branchy AlexNet for image recognition over the large-scale Cifar-10 dataset . The branchy AlexNet has five exit points as showed in Fig. 4 (Sec. II), each exit point corresponds to a sub-model of the branchy AlexNet. Note that in Fig. 4, we only draw the convolution layers and the fully-connected layers but ignore other layers for ease of illustration. For the five sub-models, the number of layers they each have is 12, 16, 19, 20 and 22, respectively.
We deploy the branchynet model on the edge server and the mobile device to evaluate the performance of Edgent. Specifically, since both the pre-defined latency requirement and the available bandwidth play vital roles in Edgent’s optimization logic, we evaluate the performance of Edgent under various latency requirements and available bandwidth.
We first investigate the effect of the bandwidth by fixing the latency requirement at 1000ms and varying the bandwidth 50kbps to 1.5Mbps. Fig. 6(a) shows the best partition point and exit point under different bandwidth. While the best partition points may fluctuate, we can see that the best exit point gets higher as the bandwidth increases. Meaning that the higher bandwidth leads to higher accuracy. Fig. 6(b) shows that as the bandwidth increases, the model runtime first drops substantially and then ascends suddenly. However, this is reasonable since the accuracy gets better while the latency is still within the latency requirement when increase the bandwidth from 1.2Mbps to 2Mbps. It also shows that our proposed regression-based latency approach can well estimate the actual DNN model runtime latency. We further fix the bandwidth at 500kbps and vary the latency from 100ms to 1000ms. Fig. 6(c) shows the best partition point and exit point under different latency requirements. As expected, the best exit point gets higher as the latency requirement increases, meaning that a larger latency goal gives more room for accuracy improvement.
In Fig. 7, under different latency requirements, it shows the model accuracy of different inference methods. The accuracy is negative if the inference can not satisfy the latency requirement. The network bandwidth is set to 400kbps. Seen in the Fig. 7, at a very low latency requirement (100ms), all four methods can’t satisfy the requirement. As the latency requirement increases, inference by Edgent works earlier than the other methods that at the 200ms to 300ms requirements, by using a small model with a moderate inference accuracy to meet the requirements. The accuracy of the model selected by Edgent gets higher as the latency requirement relaxes.
In this work, we present Edgent, a collaborative and on-demand DNN co-inference framework with device-edge synergy. Towards low-latency edge intelligence, Edgent introduces two design knobs to tune the latency of a DNN model: DNN partitioning which enables the collaboration between edge and mobile device, and DNN right-sizing which shapes the computation requirement of a DNN. Preliminary implementation and evaluations on Raspberry Pi demonstrate the effectiveness of Edgent in enabling low-latency edge intelligence. Going forward, we hope more discussion and effort can be stimulated in the community to fully accomplish the vision of intelligent edge service.
- As an initial investigation, in this paper we focus on the execution latency issue. We will also consider the energy efficiency issue in a future study.
- C. e. a. Szegedy, “Going deeper with convolutions,” 2014.
- A. V. D. e. a. Oord, “Wavenet: A generative model for raw audio,” arXiv:1609.03499, 2016.
- D. Wang and E. Nyberg, “A long short-term memory model for answer sentence selection in question answering,” in ACL and IJCNLP, 2015.
- Qualcomm, “Augmented and virtual reality: the first wave of 5g killer apps,” 2017.
- S. Han, H. Mao, and W. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” Fiber, 2015.
- Y. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compression of deep convolutional neural networks for fast and low power mobile applications,” in ICLR, 2016.
- J. W. et al., “Quantized convolutional neural networks for mobile devices,” 2016.
- N. Lane, S. Bhattacharya, A. Mathur, C. Forlivesi, and F. Kawsar, “Dxtk: Enabling resource-efficient deep learning on mobile and embedded devices with the deepx toolkit,” 2016.
- F. N. I. et al., “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and 0.5mb model size,” arXiv:1602.07360, 2016.
- Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, and L. Tang, “Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,” in 2017 ASPLOS, 2017.
- C. T. et al., “5g mobile/multi-access edge computing,” 2017.
- K. Alex, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012.
- A. Krizhevsky, “Learning multiple layers of features from tiny images,” 2009.
- S. Teerapittayanon, B. McDanel, and H. T. Kung, “Branchynet: Fast inference via early exiting from deep neural networks,” in 2016 23rd ICPR, 2016.
- V. Mulhollon, “Wondershaper,” http://manpages.ubuntu.com/manpages/trusty/man8/wondershaper.8.html, 2004.
- P. Networks, “Chainer,” https://github.com/chainer/chainer/tree/v1, 2017.