Neural Network Inference on Mobile SoCs
The ever-increasing demand from mobile Machine Learning (ML) applications calls for evermore powerful on-chip computing resources. Mobile devices are empowered with heterogeneous multi-processor Systems-on-Chips (SoCs) to process ML workloads such as Convolutional Neural Network (CNN) inference. Mobile SoCs house several different types of ML capable components on-die, such as CPU, GPU, and accelerators. These different components are capable of independently performing inference but with very different power-performance characteristics. In this article, we provide a quantitative evaluation of the inference capabilities of the different components on mobile SoCs. We also present insights behind their respective power-performance behavior. Finally, we explore the performance limit of the mobile SoCs by synergistically engaging all the components concurrently. We observe that a mobile SoC provides up to 2x improvement with parallel inference when all its components are engaged, as opposed to engaging only one component.
The tremendous popularity of Neural-Network (NN) based machine learning applications in recent years has been fuelled partly by the increased capability of the compute engines, in particular, the GPUs. Traditionally, both the network training and inference were performed on the cloud with mobile devices only acting as user interfaces. However, enriched user experience and privacy concerns now demand inference to be performed on the mobile devices themselves with high accuracy and throughput.
In this article, we look at NN-enabled vision applications on mobile devices. These applications extract high-level semantic information from real-time video streams and predominately use Convolutional Neural Networks (CNNs). They are important in many domains, such as Advanced Driver-Assistance Systems (ADAS), Virtual Reality (VR), and Augmented Reality (AR). Enabling these applications in the power-constrained mobile devices is challenging due to the enormous computational and memory requirements.
Heterogeneous multi-processor SoC enables the current state-of-the-art mobile devices. However, the presence of multiple vendors fragments the mobile SoCs. Accelerators (including GPU, FPGA, and dedicated neural accelerators) demonstrate great performance for inference. However, these high-performance components are present in only a small fraction of the mobile devices. Moreover, due to market fragmentation, it is impossible to develop a mobile application with accelerators that can run across multiple devices. Instead, the CPUs remain the common denominator among mobile SoCs and is the favored choice for inference .
We embark on an exploration to quantitatively characterize and understand the inferencing capabilities of the mobile SoCs given the diverse landscape. We portray the power-performance gap between the ubiquitous CPUs and the high-performance accelerators in high-end devices and uncover the reasons behind the gap through the roofline models. Finally, we propose simultaneous engagement of all the SoC components to greatly expand the promise of functional deployment of vision applications on mobile devices.
Ii Inference on Mobile SoCs
Ii-a Heterogeneous Multi-processor SoCs
There are over two thousand unique mobile SoCs in the mobile devices market. The diversity comes from the choice of different CPUs, GPUs, caches, memory controllers, and other application-specific accelerators. This fragmentation of the SoC market makes standard optimizations impossible. However, the similarity among these SoCs lies in the choice of one or more CPU core clusters.
Multi-cores enable the state-of-the-art Mobile SoCs. 99.9% of the Android devices in the market in 2019 have multiple cores . Among these, about half of the SoCs implement performance heterogeneity with at least two CPU clusters: a high-performance and an energy-efficient core cluster. ARM big.LITTLE architecture, one of the most popular architectures implementing this heterogeneity, is present in Hi-Silicon Kirin, Samsung Exynos, and Qualcomm Snapdragon series SoCs. The heterogeneous cores differ in power-performance-area characteristics but share the same Instruction Set Architecture (ISA). Figure 1 shows an abstract block diagram of this architecture. The general availability of CPUs make them a favorable choice for mobile inference and make device-agnostic optimizations feasible.
Existing architectures, including GPU and FPGA, have proven to be advantageous for ML workloads and are thus commonly used for deployment on certain devices. Both academic and commercial dedicated accelerators (Google Edge TPU, Intel Nervana NNP, Huawei NPU, Apple Neural Engine) offer exceptional runtime and energy-efficiency. There are no standard neural accelerators for mobile SoCs, making horizontal application integration difficult. Limited availability even constraints the use of GPUs.
Ii-B Mobile ML Framework and Optimizations
Tensorflow, PyTorch, and MXNet are some of the common ML development frameworks for all scenarios. Tensorflow Lite like frameworks facilitates the compression of huge models to fit into resource-constrained mobile devices. Efficient libraries and APIs bridge the gap between the frameworks and the underlying hardware, examples of which are Nvidia cuDNN for GPUs, ARM NN powered by Compute Library (ARM-CL) for ARM CPUs and GPUs, Facebook NNPACK, and QNNPACK for mobile CPUs. These libraries usually optimize with detailed architectural information. ARM-CL supports acceleration through ARM NEON vectorization and provides NEON assembly implementation for the most computationally intensive convolution kernels. Algorithmic optimizations (Winograd transform, FFT, sparsity exploration) lower the computational complexity of convolution computations. Furthermore, quantization and network pruning are common techniques that bring down the processing requirement with the sacrifice of accuracy .
Even though most mobile inference workloads run on CPUs, optimizations of ML workloads with accelerators hordes most of the attention. There is a lot of room for optimizations on mobile CPUs to enable ML applications across different mobile platforms.
Iii Characterizing Inferencing on Mobile SoC
We perform experiments across different technology nodes using two commonly used mobile SoCs: 28 nm Exynos 5422 within Odroid XU3 development platform and 10 nm Kirin 970 within Hikey 970 development platform. Released in 2014 and 2017 respectively, these two SoCs show us the progress of mobile SoCs development over the years. Furthermore, these two SoCs roughly approximate the mid- and high-end mobile SoCs today.
In the experiments, both SoCs are using ARM-CL 18.05v. Kirin 970 NPU is supported by HiAI DDK (v100) for network deployment. For Exynos5422, in-built power sensors, running at 200 Hz, measure the power of each component. For Kirin 970, because of the absence of any integrated on-chip power sensors, we approximate the power consumption by measuring the socket power with the help of a power measurement unit  running at 100 Hz.
Iii-a Experimental Set-up
Both SoCs include ARM big.LITTLE based asymmetric multi-core CPU. Kirin 970 CPU adopts ARMv8-A architecture. It consists of a high-performance high-power out-of-order four-core Cortex-A73 cluster (2.36 GHz) and a low-performance low-power four-core in-order Cortex-A53 (1.8 GHz). Exynos 5422 has a similar design but uses an older ARMv7-A architecture with Cortex-A15 (2 GHz) and Cortex-A7 (1.4 GHz) cores. All CPU cores support NEON advanced Single Instruction Multiple Data (SIMD) operations, which allows for four 32-bit floating-point operations per cycle.
Kirin 970 adopts ARM Mali G72 MP12 GPU (850 MHz), implementing the second generation Bifrost architecture. It has twelve shader cores with three execution engines each. Each engine is capable of eight FP32 operations per cycle, giving a total peak compute capability of 244.8 GFLOPS/s for G72. Exynos 5422 includes an ARM Mali T628 MP6 GPU (600 MHz). It adopts an older Midgard architecture with six shader cores implementing Tripipe design with two arithmetic pipelines. Each pipeline is capable of eight FP32 operations per cycle, providing a total peak compute capability of 57.6 GFLOPS/s for T628.
Kirin 970 includes a Huawei NPU purpose-built for ML. It has a peak performance of 1.92 TFLOPS/s with FP16. The accompanying HiAi DDK API enables the deployment of networks on NPU but only works with Android. Exynos 5422 does not have any ML accelerator.
Iii-B Individual Heterogeneous Components
We first study each component in isolation by running inferencing of multiple images in a stream on a single component. Both Big and Small clusters are self-sufficient for inferencing. GPU and NPU require the support of a Small cluster for inferencing.
Table I shows the throughput of each component on both our SoCs. All components in Kirin 970 outperform their respective counterparts in older Exynos 5422. Big A73 cluster, Small A53 cluster, and G72 GPU outperform Big A15 cluster, Small A7 cluster, and T628 GPU on average by a factor of 4.4x, 2.6x, and 4.2x, respectively. The performance gap between the Big and Small cluster has reduced from 4x to 2.5x with a decrease in Big to Small power consumption ratio from 10x to 4x. Furthermore, the performance gap between GPU and CPU clusters is only about 2x to 3x for both SoCs.
For NPU, we were unable to deploy MobileNet due to incompatible operators. On average, NPU is only 1.6x better than the high-end G72 GPU. On the other hand, the portability of applications across different platforms remains a challenge for dedicated accelerators. The proprietary development kit makes the general optimization a difficult endeavor.
We measure the average active power consumption of inferencing on different components and calculate the energy efficiency, as shown in Figure 2. For Exynos 5422, power sensors for individual components measure the power consumption of each component separately. For Kirin 970, we calculate active power values by subtracting the idle power (measured when no workload is running) from socket power measurement taken during inferencing. Therefore, the power measurements for Kirin are slightly higher, as memory power cannot be separated.
NPU is the most energy-efficient among all components, which we expect, given its custom design for inference. GPUs are the second-most energy-efficient component. Small clusters also show good energy-efficiency. However, Table I shows their performance in terms of absolute throughput is too low to be ever useful alone.
Comparing across two platforms, the energy efficiency of each component has improved for the newer SoC. However, the improvement is minimal and even negative for the Small CPU cluster. Compared to its predecessor A7, A53 is more complex and area hungry with 64-bit, complex branch prediction, and larger TLB. It achieves greater performance but at the cost of even greater power consumption.
Impact of Technology Scaling Versus Architectural Innovations
Exynos 5422 and Kirin 970 use the 28 nm and 10 nm technology nodes, respectively. In moving from 28 nm Exynos 5422 to 10 nm Kirin 970, the maximum frequency of the Big cluster has only changed from 2 GHz (A15) to 2.36 GHz (A73), while the Small cluster changes from 1.4 GHz (A7) to 1.8 GHz (A53). So the frequency scaling is 1.18x for the big cluster and 1.29x for the Small cluster for these two platforms. On the other hand, we get 4.4x and 2.6x throughput improvement across technology generations (Table I) for Big cluster and Small cluster, respectively. This improvement in performance is achieved through smart designs such as micro-architectural improvements (improved branch predictor, cache data prefetchers, etc.), larger caches, and 64-bit support leading to improved NEON processing, among others.
However, in the case of the small cluster, with an increased area, the micro-architectural changes give an increase in power that cannot be offset by technology scaling. Indeed, the small A53 cluster consumes roughly twice the power of the small A7 cluster. Thus, the energy-efficiency improvement is limited for the small cluster for some networks as we move from A7 to A53. In contrast, between the two big clusters, A73 is more power-efficient compared to A15; the energy-efficiency improves from A15 to A73 cluster. As mentioned earlier, the power measurements for A7 and A15 are quite accurate, while the measured power for A53 and A73 are higher as it includes the memory power that could not be separated.
We observe that NPU provides unmatched energy-efficiency for inferences. It is the optimal choice to perform network inferences on the platforms with such dedicated accelerators. However, a developer needs to put in substantial effort to port their application with proprietary API to execute on NPU, and the effort would not bear any fruits on mobile devices lacking this very-specific NPU. NPU, as a black-box, also causes inflexibility in development and optimizations. Furthermore, NPU is compatible with only a limited set of network designs. These extra requirements could make it quickly obsolete for future networks.
On the other hand, high-end GPUs can provide performance comparable to NPU at satisfactory energy-efficiency. GPUs are capable of running General-Purpose (GPGPU) applications written in OpenCL, which is easily portable to a large variety of GPUs and even CPUs supporting OpenCL. This generality makes it a good candidate to use when high performance is a major consideration.
CPUs provide both the worst energy-efficiency as well as the worst throughput among all components. Still, they are critical for inferencing because they are commonly present across all mobile devices. Low-end mobile SoCs would lack accelerators like NPU. They may contain a low-end GPU, but maybe missing OpenCL support and thereby lack any inferencing capability. Network inference on CPU is inevitable and demands optimization considerations.
Our analysis shows that any component alone on both platforms can barely support the increasing performance requirement for network inferencing. Section V-A presents the co-execution methodology that can mitigate the performance issue to some extent. Still, we must continue to look into the networks themselves in search of further optimization opportunities.
Iv Roofline Analysis
To understand the execution behaviors of the networks on each SoC components, we perform a roofline analysis. Roofline analysis  is a widely applied methodology that can classify an application as memory- or compute-bound on given hardware. It gives insights to developers for improving their application design to cater to the computation and memory capability of the underlying processing devices. The horizontal “Ceiling” and the “Roof” constructs a “Roofline” that bounds the maximum performance of an application (measured in GOPS/s) under a hardware-determined compute- or memory-bound, respectively. Operational Intensity (OI) of application (measured in FLOPS/byte) determines whether its peak performance is bounded by the memory bandwidth (measured in GB/s) or compute capability (measured in GOP/s) of the hardware. Both Exynos 5422 and Kirin 970 show similar behavior for the CPU core clusters and GPU. Therefore, we only present here the analysis for Exynos 5422.
Iv-a Construction of a Roofline Model
Hardware specifications provide the peak pure compute performance. Micro-benchmarking  provides the peak (sustainable) memory bandwidth. Specifications claim peak memory bandwidth of the memory bus to be 14.9 GB/s. However, we observe the actual component-wise peak bandwidth to be 3.44 GB/s, 0.49 GB/s, and 6.15 GB/s for A15 cluster, A7 cluster, and T628 GPU, respectively.
Many variations of the roofline model are constructed to adapt to different use-cases. In this analysis, we defined two operational intensities, that are, theoretical OI () and empirical OI (), defined in Eqn (1) and (2).
We calculate by analyzing the code. The memory accesses include all the data required in the computation. During actual executions, multiple levels of caches within components improve the memory access performance. The caches make it difficult for to correlate with the actual performance on the components. Therefore, we introduce empirical operational intensity . We calculate using the actual DRAM accesses on the bus, which models the presence of multi-level memory hierarchy. It is more informative and has a better correlation with the actual performance on the component than . We use application-specific performance counters obtained from ARM Streamline DS5 at run-time for calculation of (CPU: L2_data_refill, GPU: Mali L2 cache external read/write bytes). Fig. 3(a) show the roofline points of major layers in AlexNet on A15 cluster for both and .
Iv-B Theoretical and Empirical OI
Figure 3(a) plots the (squares) and (diamonds) values of several AlexNet major layers, marked with different colors. Black marks the whole network and of AlexNet. The intersection points of the values with the “Roofline” represent the theoretical maximum performance for the code-based theoretical operational intensities, which fall in the memory-bound region on the “Roof”. The corresponding points for are actual achieved performance in GOPS/s, which are always below the “Roofline”.
The presence of cache reduces the memory accesses going to the DRAM during execution, and thus increases the operational intensity. Therefore, for all layers, points are on the right of points, indicating higher performance. For layers with low (fully connected, FC), the points move along the “Roofline”, achieving the theoretical maximum performance. For layers with higher (convolutional, CONV), the points cross the boundary of memory-bound and become compute-bound. The performance gain is not as significant, and we explain this with the underutilization due to insufficient or imperfect parallelization. Overall, is a better indicator of real-world performance. Therefore, we only plot values of going forward.
Iv-C Across Different Components
Figure 3(b) shows the performance of different networks on different components on Exynos 5422. The color of the points corresponds to the respective component. We can observe that memory severely bottlenecks the performance of both A7 cluster and T628 GPU. Performance of A15 cluster falls in both compute- and memory-bound regions depending upon the network.
The values are different because of the different memory hierarchies for different components. The Big core cluster with a larger cache size (L2: 2MB) derives higher benefits from memory hierarchy than GPU (L2: 128KB). However, AlexNet that is notorious for huge parameter sizes caches will get flushed regardless of the cache sizes resulting in a smaller benefit from the memory hierarchy. On the other hand, small filter sizes lead to sub-optimal parallelization (under-utilization). This observation holds more starkly for newer networks with smaller filter size than older networks. The observation explains the significant deviation in the empirical performance of networks on the components from the “Roofline”.
Iv-D Major Layers in Inference
We do a deeper layer-level analysis to explain the behavior of the networks. Both convolutional and fully-connected layers dominate the total execution time of networks, and thus both are considered as major layers worthy of examination. We limit our analysis to Big cluster because networks there show both memory- and compute-bound behavior. Figure 3(c) shows that different layers in AlexNet (and also other networks to a lesser extent) exhibits different empirical OIs. Convolutional layers at the start of AlexNet perform compute-intensive convolution on large inputs and thereby have relatively higher OIs. On the other hand, fully-connected layers perform memory-intensive operations on large size parameters and thereby have relatively lower OIs. Convolutional and fully-connected layers of AlexNet fall in the compute- and memory-bound region of the roofline model, respectively. Overall, AlexNet falls somewhere in the middle of both.
In general, we observe that layers of a network are scattered in both compute- or memory-bound region. This difference comes from the choice of the size of the input tensors and filters. The vast differences in for different layers within a network motivates layer-level optimizations such as per-layer Dynamic Voltage and Frequency Scaling (DVFS) for power management. Furthermore, the variation within a network motivates fine-grain layer level co-executions, which improve the overall chip utilization .
Iv-E Effect of Quantization
Quantization is a commonly applied technique that reduces the memory and computation requirement of a network while reducing accuracy. However, the quality of its implementation primarily determines the benefits it provides. In the implementation of quantized MobileNet in ARM-CL (18.05v), QASYMM8 model with 8-bit weights is used. This implementation fails to improve the overall performance of the network. Deeper analysis reveals that the latencies of convolutional layers are indeed reduced, but the overheads from extensive de-quantization and re-quantization overshadow any benefit.
Quantization reduces the total operations and memory access required near-proportionally. Reduction in memory accesses results in a slightly higher empirical operational intensity . Therefore, the roofline analysis of a quantized network nearly overlaps with that of its non-quantized counterpart, and quantization does not improve the memory behavior of the layers. Lower operation requirements under quantization predominately contribute to the reduction in execution time of the convolutional layers.
Iv-F Glimpse of NPU
NPU, due to its novelty and dedicated machine learning processing design, garners a lot of attention. However, most of the details are kept confidential. We are unaware of its architectural and integration details. Therefore, we can only attempt to reverse engineer its behavior to gain some insights.
We implement a kernel module that enables counting of traffic on the CCI bus. We attribute the traffic on the CCI bus that goes to DRAM during the engagement of NPU to the main memory activity of NPU. The maximum observed memory bandwidth of executing several networks and the peak performance of 1.92 TOPS from the specification construct the “Roof” and “Ceiling” of the NPU roofline. We observe that the performance of NPU is significantly bounded by the memory for the networks tested. This observation shows a significant scope for optimization to achieve the full processing potential of NPU.
V Improving the performance
V-a Co-Execution of Multiple Components
Stream processing, depending on the application, requires 10 to 40 images/second throughput. Some applications even require multiple inferences to run at the same time. Table I shows that the high-end Kirin 970 SoC can barely sustain such requirement while the mid-end Exynos 5422 cannot. We previously observed that peak bandwidth consumed by any individual component is far below the total bandwidth supported by the bus. This observation supports the claim that inferencing through multiple components together will not make individual components more memory-constrained compared to their isolated inferencing. Therefore, we use ARM-CL to create an infrastructure, wherein multiple components process images from a single unified stream in parallel using a work-stealing mechanism. The infrastructure uses a buffer to reorder the out-of-sync output from different components. Co-execution obtains significantly higher throughput than the highest throughput component in isolated execution.
Table II shows the peak co-execution throughput on both mobile SoCs with the ARM big.LITTLE CPU core cluster and GPU. We include the best individual component executions, which are GPU for both platforms, for comparison. On average, the co-execution gives 50% throughput improvement over GPU only execution. Furthermore, Table II shows Exynos 5422’s obsolescence. Even with the co-execution, Exynos 5422 shows very low absolute throughput.
V-B Co-execution with NPU
The performance of NPU is unbeatable. Table III shows that Kirin 970, with co-execution of all on-chip components, gives exceptionally high throughput. In practice, we can execute NPU and GPU in parallel towards one application that demands very high performance or to perform multiple inferences simultaneously with multiple applications.
V-C Co-Execution Energy Efficiency
Synergistic co-execution engages multiple components simultaneously to improve performance at the cost of higher power consumption. Therefore, the energy efficiency of the co-execution is the average energy efficiency of engaged components. Figure 4 shows the energy efficiency of the execution that engages all the components on Exynos 5422, the CPU clusters and GPU on Kirin 970 (exclude NPU), and all the components on Kirin 970 (include NPU).
Overall, the co-execution energy efficiency is always better than the Big CPU cluster. In Kirin 970 SoC, as GPU is much more energy-efficient than the CPU clusters, the co-execution provides better energy efficiency than the power-efficient Small CPU cluster.
Mobile inferencing is now ubiquitous. In this work, we examine the power-performance characteristics of inferencing through several prominent neural networks on different components available within a mobile SoC. We also perform roofline analysis of networks on components to unveil the further optimization scope. We show that network throughput can increase by up to 2x using co-execution that engages all the components in inferencing simultaneously.
Siqi Wang is currently a research assistant and is working toward the Ph.D. degree at School of Computing, National University of Singapore. Her current research interests include performance optimization, task scheduling, general purpose GPUs and deep learning on heterogeneous multi-processor systems.
Anuj Pathania is currently working as a research fellow at School of Computing, National University of Singapore. He received his Ph.D. degree from Karlsruhe Institute of Technology (KIT), Germany in 2018. His research focuses on resource management algorithms with emphasis on performance-, power- and thermal-efficiency in embedded systems.
Tulika Mitra is a Professor of Computer Science at School of Computing, National University of Singapore. She received her PhD degrees in computer science from the State University of New York Stony Brook in 2000. Her research interests span various aspects of the design automation of embedded real-time systems, cyber-physical systems, and Internet of Things.
- C.-J. Wu, D. Brooks, K. Chen, D. Chen, S. Choudhury, M. Dukhan, K. Hazelwood, E. Isaac, Y. Jia, B. Jia et al., “Machine learning at facebook: Understanding inference at the edge,” in 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2019, pp. 331–344.
- M. Wess, S. M. P. Dinakarrao, and A. Jantsch, “Weighted quantization-regularization in dnns for weight memory minimization toward hw implementation,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 11, pp. 2929–2939, 2018.
- “Keysight Technologies B2900 Series Precision Source/Measure Unit,” https://goo.gl/U4HMbu.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
- A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint:1704.04861, 2017.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and 0.5 MB model size,” arXiv preprint :1602.07360, 2016.
- S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual performance model for floating-point programs and multicore architectures,” Lawrence Berkeley National Lab.(LBNL), Berkeley, CA (United States), Tech. Rep., 2009.
- S. Siamashka, “Tinymembench,” https://github.com/ssvb/tinymembench.
- S. Wang, G. Ananthanarayanan, Y. Zeng, N. Goel, A. Pathania, and T. Mitra, “High-throughput cnn inference on embedded arm big.little multi-core processors,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2019. [Online]. Available: http://dx.doi.org/10.1109/TCAD.2019.2944584