Neural Network Inference on Mobile SoCs
The ever-increasing demand from mobile Machine Learning (ML) applications calls for evermore powerful on-chip computing resources. Mobile devices are empowered with Heterogeneous Multi-Processor Systems on Chips (HMPSoCs) to process ML workloads such as Convolutional Neural Network (CNN) inference. HMPSoCs house several different types of ML capable components on-die, such as CPU, GPU, and accelerators. These different components are capable of independently performing inference but with very different power-performance characteristics. In this article, we provide a quantitative evaluation of the inference capabilities of the different components on HMPSoCs. We also present insights behind their respective power-performance behaviour. Finally, we explore the performance limit of the HMPSoCs by synergistically engaging all the components concurrently.
The tremendous popularity of neural-network (NN) based machine learning applications in recent years has been fuelled partly by the increased capability of the compute engines, in particular, the GPUs. Traditionally, both the network training and inference were performed on the cloud with mobile devices only acting as user interfaces. However, enriched user experience now demands inference to be performed on the mobile devices themselves with high accuracy and throughput.
In this article, we look at NN-enabled vision applications on mobile devices. These applications extract high-level semantic information from real-time video streams and predominately use Convolutional Neural Networks (CNNs). They are important in many domains such as Advanced Driver-Assistance Systems (ADAS), Virtual Reality (VR), and Augmented Reality (AR). Enabling these applications in the power-constrained mobile devices is challenging due to the enormous computational and memory requirements.
The mobile devices are supported by Heterogeneous Multi-Processor Systems on Chips (HMPSoCs). But the mobile SoC market is fragmented with the presence of multiple vendors. Although accelerators including GPU, FPGA and dedicated neural accelerators demonstrate great performance for inference, only a small fraction of the mobile SoCs are equipped with these high-performance components. Moreover, due to the market fragmentation, it is impossible to develop a mobile application with accelerators that can run across multiple devices. Instead, the CPUs remain the common denominator among mobile SoCs and is the favoured choice for inference [wu2019machine].
We embark on an exploration to quantitatively characterize and understand the inferencing capabilities of the mobile SoCs given the diverse landscape. We portray the power-performance gap between the ubiquitous CPUs and the high-performance GPU, neural accelerators in high-end devices and uncover the reasons behind the gap through the roofline models. Finally, we propose simultaneous engagement of all the SoC components to greatly expand the promise of functional deployment of vision applications on mobile devices.
Ii Inference on Mobile SoCs
Ii-a Heterogeneous Multi-processor SoCs
There are over two thousand unique HMPSoCs in the mobile devices market. The diversity comes from the choice of different CPUs, GPUs, caches, memory controllers and other application-specific accelerators. This fragmentation of SoC market makes standard optimizations impossible. However, the similarity among these SoCs lies in the choice of one or more CPU core clusters.
Ii-A1 ARM big.LITTLE
State-of-the-art HMPSoCs are usually equipped with multi-core CPUs. 99.9% of the Android devices in the market in 2019 have multiple cores [wu2019machine]. Among these, about half of the SoCs implement performance heterogeneity with at least two CPU clusters: a high-performance and a energy-efficient core cluster. ARM big.LITTLE architecture is one of the most popular architectures implementing this heterogeneity present in Hi-Silicon Kirin, Samsung Exynos and Qualcomm Snapdragon series SoCs. The heterogeneous cores differ in power-performance-area characteristics but share the same Instruction Set Architecture (ISA). Figure 1 shows an abstract block diagram of this architecture. The general availability and programmability of CPUs makes them the favourable choice for mobile inference and makes device agnostic optimizations feasible.
Existing architectures including GPU and FPGA have proven to be advantageous for ML workloads and are thus commonly used for deployment on certain devices. Both academic and commercial dedicated accelerators (Google edge TPU, Intel Nervana NNP, Huawei NPU, Apple Neural Engine) offer exceptional runtime and energy-efficiency. There are no standard neural accelerators for mobile SoCs, making horizontal application integration difficult. For inference on mobile devices in general, CPUs as the denominator are most commonly used. The use of accelerators including GPU are constrained with availability for limited set of devices. In addition, we show in later analysis that, the performance gap between mobile CPUs and GPUs is only about two to three times. This makes mobile CPU a competitive candidate for inference.
Ii-B Mobile ML Framework and Optimizations
Tensorflow, PyTorch and MXNet are some of the common ML development frameworks for all scenarios. Tensorflow Lite like frameworks facilitate the compression of huge models to fit into resource-constrained mobile devices. Efficient libraries and APIs bridge the gap between the aforementioned frameworks and the underlying hardware, examples of which are Nvidia cuDNN for GPUs, ARM NN powered by Compute Library (ARM-CL) for ARM CPUs and GPUs, Facebook NNPACK and QNNPACK for mobile CPUs. These libraries usually optimize with the detailed architectural information. ARM-CL supports acceleration through ARM NEON vectorization and provides NEON assembly implementation for the most computationally intensive convolution kernels. Algorithmic optimizations including Winograd transform, Fast Fourier Transform and exploration of sparsity lower the computational complexity of convolution computations. Another branch of efforts are in end-to-end compiler frameworks. Frameworks, for example TVM and Glow, can directly compile ML models to platform-specific object code. In addition, quantization and network pruning are common techniques that bring down the processing requirement with the sacrifice of accuracy.
Despite the fact that most mobile inference workloads run on CPUs, most of the attention has been focused on optimizations of ML workloads with accelerators. There is a lot of room for optimizations on mobile CPUs to enable ML applications across different mobile platforms.
Iii Characterizing Inferencing on Mobile SoC
We perform experiments across different technology nodes using two commonly used HMPSoCs. We make use of 28 nm Exynos 5422 HMPSoC within Odroid XU3 development platform, and 10 nm Kirin 970 HMPSoC within Hikey 970 development platform. Exynos 5422 and Kirin 970, released in 2014 and 2017 respectively, show us the progress of the mobile SoCs development over the years. In addition, these two HMPSoCs roughly approximate the mid and high end mobile SoCs today.
In the experiments, both SoCs are using ARM-CL 18.05v. Kirin 970 NPU is supported by HiAI DDK (v100) for network deployment. For Exynos5422, in-built power sensors, running at 200 Hz, measure the power of each individual components; for Kirin 970, because of the absence of any integrated on-chip power sensors, we approximate the power consumption by measuring the socket power with the help of a power measurement unit [pmu] running at 100 Hz.
Iii-a Experimental Platforms
Both SoCs include ARM big.LITTLE based asymmetric multi-core CPU. Kirin 970 CPU adopts ARMv8-A architecture. It consists of a high-performance high-power out-of-order four-core Cortex-A73 cluster (2.36 GHz) and a low-performance low-power four-core in-order Cortex-A53 (1.8 GHz). A 128-bit Cache Coherent Interconnect (CCI) bus keeps the two clusters coherent. Exynos 5422 has a similar design but uses an older ARMv7-A architecture with Cortex-A15 (2 GHz) and Cortex-A7 (1.4 GHz) cores. All CPU cores support NEON advanced Single Instruction Multiple Data (SIMD) operations, which allows for four 32-bit floating-point operations per cycle.
Kirin 970 adopts a new generation ARM Mali G72 MP12 GPU (850 MHz), implementing the second generation Bifrost architecture. It has twelve shader core with three execution engines each. Each engine is capable of eight FP32 operations per cycle, giving a total peak compute capability of 244.8 GFLOPS/s for G72. Exynos 5422 includes an ARM Mali T628 MP6 GPU (600 MHz). It adopts an older Midgard architecture with six shader cores implementing Tripipe design with two arithmetic pipelines. Each pipeline is capable of eight FP32 operations per cycle, providing a total peak compute capability of 57.6 GFLOPS/s for T628.
Kirin 970 includes a Huawei NPU purpose-built for ML. It has a peak performance of 1.92 TFLOPS/s with FP16. The further details of the NPU are however not disclosed. The accompanying HiAi DDK API enables deployment of networks on NPU but only works with Android. Exynos 5422 does not have any ML accelerator.
Iii-B Network Structure
Active research is on-going to design new networks catering to different problems. Researchers created new network structures such as MobileNet that improved upon the accuracy of the prediction while reducing the computation resource requirements. In general, MobileNet [mobilenet] and SqueezeNet [squeezenet] are more suitable for mobile devices. In this article, we explore with several popular network structure in the recent years, as summarizes in table I.
|Network||Major Layers / Modules|
|AlexNet [alexnet]||5 Conv + 3 FC|
|MobileNet [mobilenet]||14 Conv + 13 Conv DW + 1 FC|
Conv: Convolutional; FC: Fully-connected; Conv DW: Depthwise Convolutional
Iii-C Individual Heterogeneous Components
We first study each component in isolation by running inferencing of multiple images in a stream on a single component. Both Big and Small cluster are self-sufficient for inferencing. GPU and NPU require the support of Small cluster for inferencing.
Table II shows the throughput of each component on both our HMPSoCs. All components in Kirin 970 outperform their respective counterparts in older Exynos 5422. Big A73 cluster, Small A53 cluster, and G72 GPU outperform Big A15 cluster, Small A7 cluster, and T628 GPU, respectively, on average by a factor of 4.4x, 2.6x and 4.2x, respectively. We can see that compared to the big cluster, both the small cluster and GPU has improved significantly over the years. The performance gap between the big and small cluster has reduced from 4x to about 2.5x. In addition, the performance gap between GPU and CPU clusters is only about 2x to 3x for both SoCs.
For NPU, we were unable to deploy MobileNet due to incompatible operators. On average, NPU is only 1.6x better than the high-end G72 GPU even though it is designed to be a dedicated accelerator. On the other hand, the portability of applications across different platforms remains a challenge for dedicated accelerators. The proprietary development kit in addition makes the general optimization a difficult endeavour.
Iii-C2 Energy Efficiency
Table III shows the average active power consumption of inferencing on different components. We calculate active power values by subtracting the idle power (measured when no workload is running) from power measurement taken during inferencing. Big cluster consumes considerably more power than Small cluster on both SoCs. The power consumption of GPU is in between Big and Small CPU clusters for Exynos 5422, and comparable to Big CPU cluster in Kirin 970. Power consumption of NPU is comparable to Small CPU cluster in Kirin 970.
Figure 2 shows the energy-efficiency of each component measured in Images/J. NPU is the most energy-efficient among all components, which we expect given its custom design for inference. GPUs are the second-most energy-efficient component. Small clusters also show good energy-efficiency. However, Table II shows its performance in terms of absolute throughput is too low to be ever useful alone.
We observe that NPU provides unmatched energy-efficiency for inferences. It is the optimal choice of component to perform network inferences on platform with such dedicated accelerators. However, a developer need to put in substantial effort to port their application with proprietary API to execute on NPU, and the effort would not bear any fruits on mobile devices lacking this very-specific NPU. NPU as a black-box also causes inflexibility in development and optimizations. In addition, NPU is compatible with limited network operators which certain network, for example MobileNet, will fail to integrate. This extra design requirements could make it quickly obsolete for future networks.
On the other hand, high-end GPUs can provide performance comparable to NPU at relatively good energy-efficiency. GPUs are capable of running General-Purpose (GPGPU) applications written in OpenCL, which is easily portable to a large variety of GPUs and even CPUs supporting OpenCL.
CPUs provide both the worst energy-efficiency as well as the worst throughput among all components. Still, they are critical for inferencing because they are commonly presented across all mobile devices. Low-end HMPSoCs would lack accelerators like NPU. They may contain a low-end GPU. However, low-end GPU may be missing OpenCL support and thereby lack any inferencing capability. Network inference on CPU is inevitable and demands optimization considerations.
Our analysis shows that any component alone on both platforms can barely support the increasing performance requirement for network inferencing. Section V-A presents the co-execution methodology that can mitigate the performance issue to some extent. Still, we must continue to look into the networks themselves in search for further optimization opportunities.
Iv Roofline Analysis
To understand the execution behaviours of the networks on each HMPSoC components, we perform a roofline analysis. Roofline analysis [roofline] is a widely applied methodology that can classify an application as memory- or compute-bound on given hardware. It gives insights to developers for improving their application design to cater for computation and memory capability of the underlying processing devices. The horizontal “Ceiling” and the “Roof” constructs a “Roofline” that bounds the maximum performance of an application (measured in GOPS/s) under a hardware-determined compute- or memory-bound, respectively. Operational Intensity (OI) of application (measured in FLOPS/byte) determine whether its peak performance is bounded by the memory bandwidth (measured in GB/s) or compute capability (measured in GOP/s) of the hardware. Both Exynos 5422 and Kirin 970 show similar behaviour for the CPU core clusters and GPU, thus we only present here the analysis for Exynos 5422.
Iv-a Construction of a Roofline Model
Peak pure compute performance is obtained from hardware specifications. The peak (sustainable) memory bandwidth is obtained through micro-benchmarking [tinymembench]. Specifications claim peak memory bandwidth of the memory bus to be 14.9 GB/s. However, we observe the actual component-wise peak bandwidth to be 3.44 GB/s, 0.49 GB/s, and 6.15 GB/s for A15 cluster, A7 cluster and T628 GPU, respectively.
Many variations of the roofline model are constructed to adapt to different use-cases. In this analysis, we defined two operational intensities, that are, theoretical OI () and empirical OI (), defined in Eqn (1) and (2).
We calculate by analysing the code. The memory accesses include all the data required in the computation. During actual executions, multi-level of caches within components improve the memory access performance. The caches make it difficult for to correlate with the actual performance on the components. Therefore, we introduce empirical operational intensity . We calculate using the actual DRAM accesses on the bus, which models the presence of multi-level memory hierarchy. It is more informative and has a better correlation with the actual performance on the component than . We use application-specific performance counters obtained from ARM Streamline DS5 at run-time for calculation of . Fig. 3(a) show the roofline points of major layers in AlexNet on A15 cluster for both and .
Iv-B Theoretical and Empirical OI
Figure 3(a) plots the (squares) and (diamonds) values of several AlexNet major layers, marked with different colours. The whole network and of AlexNet are marked in black. The intersection points of the values with the “Roofline” represent the theoretical maximum performance for the code-based operational intensities, which fall in the memory-bound region on the “Roof”. The corresponding points for are actual achieved performance in GOPS/s, which are always below the “Roofline”.
On actual components, the presence of cache reduces the memory accesses going to the DRAM during execution, and thus increases the operational intensity. Therefore for all layers, points are on the right of points, giving better performance. For layers with low (fully connected, FC), the points move along the “Roofline”, achieving the theoretical maximum performance. For layers with higher (convolutional, CONV), the points cross the boundary of memory-bound and become compute-bound. The performance gain is not as significant and we explain this with the underutilization due to insufficient or imperfect parallelization. Overall, is a better indicator of real-world performance, thus we only plot values of going forward.
Iv-C Across Different Components
Figure 3(b) shows the performance of different networks on different components on Exynos 5422. The colour of the points corresponds to the respective component. We can observe that memory severely bottlenecks the performance of both A7 cluster and T628 GPU. Performance of A15 cluster falls in both compute- and memory-bound region depending upon the network. In addition, we observe that although values are application specific, which remain the same for a certain application, the component values are different because of the different memory hierarchy for different component. The big core cluster with larger cache size (L2: 2MB) therefore derives higher benefits from memory hierarchy than GPU (L2: 128KB). However, for AlexNet that is notorious for huge parameter sizes, caches will get flushed regardless of the cache sizes, resulting in a smaller benefit from memory hierarchy. On the other hand, small filter sizes lead to sub-optimal parallelization and thus under-utilization. This observation holds more starkly for newer networks with smaller filter size than older networks. The observation explains the significant deviation in the empirical performance of networks on the components from the “Roofline”.
Iv-D Major Layers in Inference
We do a deeper layer-level analysis to explain the behaviour of the networks. Both convolutional and fully-connected layers dominate the total execution time of networks. Therefore, we consider both types of layers as major layers worthy of examination. We limit our analysis to Big cluster because networks there show both memory- and compute-bound behaviour. Figure 3(c) shows that different layers in AlexNet (and also other networks to a lesser extent) exhibits different empirical OIs. Convolutional layers at the start of AlexNet perform compute-intensive convolution on large inputs and thereby have relatively higher OIs. On the other hand, fully-connected layers perform memory-intensive operations on large size parameters and thereby have relatively lower OIs. Convolutional and fully-connected layers of AlexNet fall in the compute- and memory-bound region of the roofline model, respectively. Overall, AlexNet falls somewhere in the middle of both.
In general, we observe that layers of a network are scattered in both compute- or memory-bound region. This difference comes from the choice of the size of the input tensors and filters. The vast differences in for different layers within a network motivates layer-level optimizations such as per-layer Dynamic Voltage and Frequency Scaling (DVFS) for power management. In addition, the variation within a network motivates fine-grain layer level co-executions which improve the overall chip utilization [pipe-it].
Iv-E Effect of Quantization
Quantization is a commonly applied technique that reduces the memory and computation requirement of a network while reducing accuracy. However, the quality of its implementation primarily determines the benefits it provides. We observe that in ARM-CL (18.05v) quantization fails to improve the performance of a network. Deeper analysis reveals that quantization reduces the execution time of convolutional layers. However, the overheads from extensive de-quantization and re-quantization overshadow any benefit.
Quantization reduces the total operations and memory access required near-proportionally. Reduction in memory accesses results in a slightly higher empirical operational intensity . Therefore, roofline analysis of a quantized network nearly overlaps with that of its non-quantized counterpart and thereby quantization does not improve the memory behaviour of the layers. Lower operation requirements under quantization predominately contribute to the reduction in execution time of the convolutional layers.
Iv-F Glimpse of NPU
NPU due to its novelty and dedicated machine learning processing design garners a lot of attention. However, most of the details are kept confidential. We are unaware of the architecture details and the SoC integration. Therefore, we can only attempt to reverse engineer its behaviour within limited information we have to gain some insights.
We implement a kernel module that enables counting of traffic on the CCI bus. We attribute the traffic on the CCI bus that goes to DRAM during the engagement of NPU to the main memory activity of NPU. The maximum observed memory bandwidth of executing several networks and the peak performance of 1.92 TOPS from the specification construct the “Roof” and “Ceiling” of the NPU roofline. Figure 4 shows the performance of different networks on NPU within the roofline model. We observe that the performance of NPU is significantly bounded by the memory for the networks tested. This observation shows significant scope for optimization to achieve the full processing potential of NPU.
V Improving the performance
V-a Co-Execution of Multiple Components
Stream processing requires 10 to 40 images/sec throughout depending on the application. Some applications even require multiple inferences to run at the same time. Table II shows that the high-end Kirin 970 HMPSoC can barely sustain such requirement while the mid-end Exynos 5422 cannot. We previously observed that peak bandwidth consumed by any individual component is far below the total bandwidth supported by the bus. This observation supports the claim that inferencing through multiple components together will not make individual components more memory-constrained compared to their isolated inferencing. Therefore, we use ARM-CL to create an infrastructure, wherein multiple components process images from a single unified stream in parallel. Figure 5 shows an abstract diagram of our proposed infrastructure. Co-execution obtains significantly higher throughput than the highest throughput component in isolated execution.
Table IV shows the peak co-execution throughput on both HMPSoCs with the ARM big.LITTLE CPU core cluster and GPU. We include the best individual component executions, which are GPU for both platforms, for comparison. On average, the co-execution give 50% throughput improvement over GPU only execution.
The performance gap between combined CPU clusters and GPU is small for some networks as shown in Table II. For example, for MobileNet on both platforms, the performance ratio of CPU throughput (direct summation of throughput for two CPU clusters) versus GPU throughput is about 0.8. The co-execution of multiple components thus gives as high as 77% performance gain. However, for AlexNet, the co-execution does not give high performance gain as it is more memory-bound compared to other networks.
In addition, Table IV shows Exynos 5422’s obsolescence. Even with the co-execution, Exynos 5422 shows very low absolute throughput.
V-B Temperature Consideration
Mobile CPUs are susceptible to thermal effects, especially for the high-performance high-power big core cluster [optic]. The frequency of the CPU core cluster gets severely throttled at a high chip temperature to prevent thermal failures. Accelerators, like GPUs, suffer less because of the energy efficient design and relatively bigger size. Previous experiments were carried out with 5V USB fan attached to reduce the effect of thermal condition. We run the experiments again without the fan to evaluate the effect of thermal throttling.
The accumulation of heat gets worse with longer execution time. For a continuous inference of five seconds, the overall performance loss due to thermal effect is 5% on average, compared to the best co-execution performance reported in Table IV. We also observe an increase in average chip temperature of C. For a longer inference of one minute, the performance loss rises to 20%, and the average chip temperature reaches C (chip throttling threshold).
The co-execution of multiple components brings significant benefits in throughput. However, the engagement of high-performance CPU cores causes performance instability, depending on the chip thermal condition. The use of accelerators in this case will provide a stable behaviour for time critical applications for a guaranteed performance.
V-C Co-execution with NPU
The performance of NPU is unbeatable. Table V shows that Kirin 970 with co-execution of all on-chip components gives exceptionally high throughput. In practice, NPU and GPU can be executed in parallel towards applications that demand very high performance, as well as to perform multiple inference with stable performance.
Mobile inferencing is now ubiquitous. In this article, we examine the power-performance characteristics of inferencing through several prominent neural networks on different components available within a mobile SoC. We also perform roofline analysis of networks on components to unveil the further optimization scope. We show that network throughput can increase by up to 2x using co-execution that engages all the components in inferencing simultaneously.