MobiSR: Efficient On-Device Super-Resolution through Heterogeneous Mobile Processors

MobiSR: Efficient On-Device Super-Resolution through Heterogeneous Mobile Processors

Royson Lee*, Stylianos I. Venieris*
Łukasz Dudziak, Sourav Bhattacharya, Nicholas D. Lane
Samsung AI Center, Cambridge       University of Oxford* Indicates equal contribution.
Abstract.

In recent years, convolutional networks have demonstrated unprecedented performance in the image restoration task of super-resolution (SR). SR entails the upscaling of a single low-resolution image in order to meet application-specific image quality demands and plays a key role in mobile devices. To comply with privacy regulations and reduce the overhead of cloud computing, executing SR models locally on-device constitutes a key alternative approach. Nevertheless, the excessive compute and memory requirements of SR workloads pose a challenge in mapping SR networks on resource-constrained mobile platforms. This work presents MobiSR, a novel framework for performing efficient super-resolution on-device. Given a target mobile platform, the proposed framework considers popular model compression techniques and traverses the design space to reach the highest performing trade-off between image quality and processing speed. At run time, a novel scheduler dispatches incoming image patches to the appropriate model-engine pair based on the patch’s estimated upscaling difficulty in order to meet the required image quality with minimum processing latency. Quantitative evaluation shows that the proposed framework yields on-device SR designs that achieve an average speedup of over highly-optimized parallel difficulty-unaware mappings and over highly-optimized single compute engine implementations.

Super-resolution, deep neural networks, mobile computing, heterogeneous computing, scheduling
ccs: Human-centered computing Ubiquitous and mobile computingccs: Computing methodologies Computer vision tasksjournalyear: 2019conference: The 25th Annual International Conference on Mobile Computing and Networking; October 21–25, 2019; Los Cabos, Mexicobooktitle: The 25th Annual International Conference on Mobile Computing and Networking (MobiCom ’19), October 21–25, 2019, Los Cabos, Mexicoprice: 15.00doi: 10.1145/3300061.3345455isbn: 978-1-4503-6169-9/19/10\backgroundsetup

angle=0, scale=1, color=black, firstpage=true, position=current page.north, hshift=0pt, vshift=-20pt, contents= \authorwithoutinstuitionRoyson Lee, Stylianos I. Venieris, Łukasz Dudziak, Sourav Bhattacharya, Nicholas D. Lane

1. Introduction

The rapid progress of convolutional neural networks (CNNs) has led to substantial performance improvements in the computer vision task of super-resolution (SR). SR networks are capable of processing a low-resolution image and producing an output with a significant increase in resolution (SRCNN). This property has made CNN-powered SR an enabling technology for building novel applications on mobile and home devices, including mobile phones, electronic photograph frames and televisions.

Despite their unparalleled performance, state-of-the-art SR networks (EDSR; RCAN; RDN; IDN) pose significant deployment challenges. To upscale low-resolution images, SR models often propagate feature maps of large spatial dimensions across their layers, leading to an excessive number of operations and run-time storage requirements.

At the moment, to alleviate this computational barrier, service providers commonly employ cloud-computing solutions. Under this setup, an application collects frames and transmits them to a base server where powerful server-grade machines perform SR. However, in latency- and privacy-sensitive applications, the high response time and security risks of cloud computing may not be tolerable. Furthermore, the need for constant Internet connectivity and the power consumption overhead of exchanging data with the cloud together with the cost of hosting a data center often prohibits the offloading of computations. As a result, there is an emerging need to develop methods and systems that alleviate the limitations of cloud-based computing by executing SR networks using local on-device processing (mcdnn_2016; nic2017embedded_dl; venieris2018deploying).

However, as SR networks are computationally expensive, achieving 30 fps using on-device resources is impractical for upscaling to large image resolutions. For instance, given that mobile digital cameras, such as Pixel 3’s, are able to capture and stream in extremely high image resolutions, achieving such resolutions in real-time by running SR networks locally is currently unrealistic. Therefore, common realistic applications of SR on mobile, such as zoom, are image-centric, rather than video-focused. Another practical application of mobile SR involves saving data. Popular social media networks such as Facebook, Instagram and Reddit and messaging applications such as Snapchat are image-heavy applications which constantly use data as the user scrolls his feed or sends a message. Given the popularity of data-saving alternatives such as Facebook Lite, features that enable devices to download low-resolution images of a user’s feed and/or messages and upscale them locally would be not only feasible, but also well-received. Moreover, minimizing the network bandwidth needed to load an image feed would allow the app to work more responsively under harsh network conditions and operate in areas with poor cloud connectivity.

In this paper, we propose MobiSR, a novel automated framework that pushes the performance of on-device SR networks. Drawing from the fact that not all inputs have the same upscaling difficulty, MobiSR introduces model compression as a design dimension for the local processing of SR models and introduces a hardware-aware scheduling scheme for allocating inputs to model-compute engine pairs. To explore the model space, the proposed framework starts from a user-supplied SR network and employs a set of compression techniques in order to generate multiple SR networks with varying accuracy-workload characteristics. Upon deployment, a difficulty evaluation unit estimates the upscaling difficulty of incoming samples. Based on the observation that some image patches are shown to be more difficult to upscale for both large and compact models, while some patches are handled better by larger models, the framework schedules inputs accordingly to strike an optimal balance between image quality and speed. Specifically, the inputs that are classified as difficult are computed using a less accurate, but compact model to obtain a rapid upscaling, while easier inputs are assigned to a larger, but more accurate model. Overall, MobiSR considers the error tolerance of the target application in order to perform model selection and tailors its scheduling policy to both the selected SR models and the available compute engines. The key contributions of this paper are the following:

  • The introduction of a two-model super-resolution system that exploits the upscaling difficulty of incoming patches to boost the performance of on-device SR. A novel tunable difficulty evaluation unit is presented that estimates the upscaling difficulty of incoming image patches and schedules them across different model-compute engine pairs at run time.

  • A design space exploration methodology that considers the user-supplied SR model and the target mobile platform together with a user-specified error tolerance and generates an optimized SR system. By treating model compression as a design dimension and employing a hardware-aware scheduling policy, the proposed methodology explores candidate designs at both the model and scheduling level and generates an SR system tailored to meet the user-specified error tolerance at the minimum latency.

2. Background

Since the introduction of using CNNs for SR tasks in (SRCNN), there has been a surge in SR models that utilized popular techniques such as attention (Bahdanau_2014), residual blocks (He_2016), and generative adversarial networks (Goodfellow_2014). These models aim to either map low-resolution images closer to their high-resolution ground truth or make SR images look more naturally pleasing. The former, which are usually trained on either the L1 or Mean Square Error (MSE) loss, favour pixel-to-pixel comparisons and are evaluated on image distortion metrics such as MSE, Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity Index (SSIM) (SSIM). The latter, on the other hand, are usually trained using a combination of different loss functions, including perceptual (Johnson_2016) and adversarial losses (Goodfellow_2014). These models focus on the perceptual quality of the image and are evaluated on no-reference metrics such as Natural Image Quality Evaluator (NIQE) and perceptual score (Ma_2017). In this work, we focus on the former, i.e. mapping low-resolution images closer to their high-resolution ground truth.

Unlike models that are optimized for discriminative tasks, SR models are resource-intensive networks as each layer needs to maintain or upscale the spatial dimensions of its feature maps. As a result, the number of multiply-add operations are typically counted in the billions as opposed to millions in discriminative networks (embench_2019). Although the research community has made a few steps towards constructing efficient SR models that are optimized for mobile platforms, (1) running these models on-device is still costly and (2) popular compression techniques have not yet been utilized to derive lightweight, mobile-friendly variants. For instance, in the experiments presented in Section 4, the winning model (Vu2018FastAE) of the recent 2018 PIRM Challenge on perceptual SR on mobile (Ignatov_2018) requires more than 1.4 s to 4 upscale an image to 720p on the Hexagon DSP of Qualcomm Snapdragon 845.

Figure 1. MobiSR’s system architecture.

Challenges of on-device SR. As the size of the image increases, running SR models on a single compute engine is difficult to scale or even impractical; upscaling a large image may lead to a memory overload. Cloud-based solutions (Caulfield2016; Fowers_2018; Hazelwood_2018) can be deployed to offload the expensive computation. However, such solutions rely on a fast and stable communication channel and the need to maintain privacy, assumptions that are difficult to achieve in practice. Another solution would be to load-balance the computation of upscaling across the available on-device compute engines of the target mobile System-on-Chip (SoC). However, naively load-balancing SR models on multiple compute engines fails to utilize hardware-specific optimizations; different compute engines are optimized for different types of layers of a network. Furthermore, using reduced-precision compute engines in an uninformed manner can substantially affect the application-level quality of result (QoR). Therefore, there is a need to better utilize on-device resources to improve both the efficiency and scalability of running SR models locally.

SR model compression. So far, substantial effort has been invested on developing network compression techniques, such as pruning (Han_2015; Yang_2017_CVPR), quantization (Han_2016), and knowledge distillation (Hinton_2015), for building efficient neural networks. In particular, a number of convolution approximations, such as low-rank tensor decomposition (Sifre_2014), have been successfully employed as a primary component in building fast and accurate discriminative vision models (mobilenetv1; shufflenetv1; clcnet; effnet). These techniques typically aim to express a convolution as a sequence of simpler tensor operations, reducing in this manner the storage and computation cost of the network. With current SR models being excessively large, exploiting the potential of existing compression techniques can lead to significant gains in efficiency. Nevertheless, each technique provides varying gains depending on the target hardware optimizations, but also on the model and the level of quantization involved. Therefore, a key challenge in accelerating SR models is selecting appropriate convolution-approximation techniques based on both their impact on the accuracy of the given model and their efficient mapping on the available compute engines.

Figure 2. MobiSR’s processing flow.

3. MobiSR

In this section, we present the high-level flow of MobiSR followed by a detailed description of its internal components.

3.1. Overview

Given a particular SR task, MobiSR searches the space of candidate on-device designs and generates a two-model system optimized for the target mobile platform. Upon deployment, the generated super-resolution system (Fig. 1) consists of:

  • A compact network mapped on a high-performance compute engine that trades-off QoR with low processing latency; this can be an aggressively compressed model running on the DSP of the target mobile SoC.

  • A large network which guarantees the user-specified QoR at the cost of a larger workload; this can be a lightly compressed model or a user-defined reference model running on the CPU and GPU of the target SoC.

  • A tunable difficulty-aware scheduler that parallelizes the incoming low-resolution image by dispatching each image patch to the appropriate model-compute engine pair based on its estimated upscaling difficulty.

The key idea behind the proposed approach is that, instead of processing the full set of patches using the large and expensive network, inputs that are classified as hard-to-upscale for both networks are rapidly processed by the compact network, with only a fraction of the inputs processed by the expensive large network, reducing in this way the overall latency of the system. Furthermore, the distortion that is induced due to the network compression of the compact model is restored by tuning the portion of images processed by each network based on the user-specified error threshold.

A high-level overview of MobiSR’s flow is presented in Fig. 2. The framework is supplied with a high-level description of an SR network (i.e. PyTorch111https://pytorch.org/ model), the specifications of the target mobile platform and an error tolerance in an image reconstruction quality metric (e.g. PSNR). As a first step, the Compression module applies a set of transformations over the supplied network in order to modify its topology and generate a number of compressed variants. To characterize their latency-QoR trade-off, each model is evaluated with respect to both its SR performance and on-device processing latency by the Image Quality and On-device Evaluator respectively. The On-device Evaluator performs a number of runs on the compute engines of the target mobile platform and measures the average latency of each (model, compute engine) pair.

Given the latency measurements, an analytical performance model is populated which enables the rapid estimation of the attainable latency for different scheduling schemes across the available devices. Next, the Pruning module takes as input the (PSNR, latency) of each (model, compute engine) pair as generated by the Image Quality and On-device Evaluators. By examining the PSNR-latency space of each compute engine, only the models that lie on the Pareto front are kept, with the rest of the dominated models discarded as inefficient, reducing in this manner the space of candidate models. After the pruning step, the Total-variation Analysis module is responsible for both tuning the difficulty-aware scheduler and selecting the models to be mapped on the target platform. Overall, given the user-specifed error tolerance, MobiSR generates a two-model system together with an associated scheduler tailored for the target mobile platform.

3.2. Model Space

In MobiSR, the user-supplied SR model comprises the starting point for model selection. In this setting, the space of candidate models is determined by the techniques employed by our framework in order to modify the topology of the reference network. The complete model space is formed by defining a set of model transformations to change the complexity-QoR characteristics of the reference model. Given the computation cost of a standard convolution

(1)

where is the number of input channels, is the number of output channels and is the feature map size, MobiSR employs the following set of transformations:

Residual Bottleneck Block : First introduced in the ResNet model (He_2016), the residual bottleneck design substitutes a conventional convolutional layer with a 11 convolutional layer, used to compress the number of channels by a reduction factor , followed by a convolutional layer. Then, another 11 convolutional layer along with a skip connection are employed to recover the number of output channels. The reduction in computation cost over a standard convolutional layer is therefore

(2)

Group Convolutions : The use of group convolutions (Krizhevsky_2012) was introduced as a method of reducing the number of both parameters and operations with minimal impact on task-level performance (Xie_2017). This is achieved by splitting the convolutions channel-wise and computing them separately. In other words, the input feature maps are grouped and convolution is performed independently in each group. This leads to a computation cost reduction of as compared to a standard convolutional layer.

Depthwise Separable Convolutions : Depthwise convolutions are referred to as group convolutions in which the number of input channels is equal to the number of groups, . In order for information to flow among groups, depthwise convolutions are usually paired with a 11 convolution and the combination is known as depthwise separable convolution, which was first introduced in (Sifre_2014) and termed in (mobilenetv1). From a workload perspective, depthwise separable convolutions yield a computation cost reduction of

(3)

Separable Convolutions : This technique substitutes each convolutional layer with a 1 followed by a 1 convolution, separating the convolution dimension-wise and resulting in a computation cost reduction of

(4)

Inverted Residual Blocks : Inverted residual blocks expand the number of channels by an expansion factor of by means of a 11 convolution, followed by a convolution and another 11 convolution to recover the initial number of channels. This technique enables the use of skip connections directly on the bottleneck layers, resulting in an increase in computation cost, which is equal to that of Eq. (2) with , but also in performance. Due to the increase in workload, inverted residual blocks were used together with depthwise convolutions when first introduced in (mobilenetv2).

Channel Shuffle : Channel shuffling was introduced in (shufflenetv1) to improve representational capability by changing the order of the channels, allowing information flow among channel groups. Specifically, an output of a grouped convolutional layer, which has groups of channels each, is reshaped into , transposed into , and flattened back to the number of output channels, .

Channel Split : The splitting of feature channels into branches is termed as a "channel split" in (shufflenetv2) and was introduced to improve processing speed. For instance, (shufflenetv2) uses channel splitting to split the number of channels into two branches. Convolutions are performed only on a single branch before both branches are concatenated, resulting in a reduction in workload.

Given these compression methods, we define the transformations set as follows:

(5)

To generate a new candidate model, we apply one transformation from the transformations set over the reference model:

(6)

Formally, we capture the configuration of a model by defining a tuple representation of and the overall model space by means of a model set (Eq. (7)) that contains all reachable candidate models.

(7)

where is the topology of the reference model, is the subset of applied transformations that are applied on to obtain , and are the learned parameters of after the training process.

Figure 3. Low-resolution images along with their TV and the PSNR achieved after 4 upscaling using our reference model, , for images in the DIV2K training and validation dataset.
Figure 4. PSNR difference of 4 upscaling between our reference model, , and a more compact model, .

3.3. Difficulty Evaluation Unit

To sustain the QoR within the tolerance bounds of the user while achieving higher processing speed, MobiSR exploits the fact that not all image patches have the same upscaling difficulty. To this end, the Difficulty Evaluation Unit (DEU) is responsible for examining each patch and determining its complexity. To estimate upscaling difficulty, we employ the total variation (TV) metric (total_variation_1992). Total variation captures the complexity of an image by examining its spatial variation, with its anistropic version for a patch defined as:

(8)

Fig. 5 presents a visual comparison between two images with low (Fig. 4(a)) and high (Fig. 4(b)) TV values together with the associated PSNR achieved after 4 upscaling using our reference model, . As illustrated in the figures, an image consisting of unstructured fine details and texture (Fig. 4(b)), has a higher TV and is harder to upscale compared to a highly structured or smoother image (Fig. 4(a)).

(a) 0301.png
PSNR: 45.38 / TV: 0.2e7
(b) 0063.png
PSNR: 20.62 / TV: 2.9e7
Figure 5. TV of low-resolution images from the DIV2K dataset and the PSNR achieved after 4 upscaling using our reference model, .

To investigate the relationship between upscaling difficulty and TV of a given image in super-resolution settings, we examined the TV of each image in the DIV2K training and validation sets, together with the achieved PSNR obtained by our reference model, , which is described in Section 4. As depicted in Fig. 3, images with higher TV values tend to yield lower PSNR and hence are harder to upscale, while lower-TV images tend to reach higher PSNR and thus are upscaled with higher quality.

Following these observations, we define an image patch as hard-to-upscale based on the following criterion:

(9)

where the TV threshold is a tunable parameter whose value is automatically configured by MobiSR as discussed in Section 3.5.

Upscaling-Difficulty-aware Scheduling. After computing the TV of an incoming patch, the DEU is responsible for dispatching it to the suitable model between and . The goal is to employ a scheduling strategy that will not exceed the user-specified error tolerance and will yield the lowest latency. To this end, we explore the behavior of the model pair (, ) on patches with varying TV values. Fig. 4 shows the PSNR difference between models and as a function of the value of TV for the DIV2K training and validation set. As observed from the figure, patches that are harder to upscale based on the TV criterion (i.e. towards the right in Fig. 4) are almost equally hard for the two models. On the other hand, on easier-to-upscale patches, the larger model is able to achieve significantly higher PSNR. To exploit this property, an upscaling-difficulty-aware scheduling policy is proposed which directs easy-to-upscale patches to the larger model and hard-to-upscale patches to the more compact model. In this manner, higher-TV patches that are almost equally hard for both models are processed rapidly using the more compact model, with easier patches processed by the larger model to sustain the PSNR within the specified bounds.

Algorithm 1 presents the overall scheduling scheme. Model is mapped on the CPU and GPU engines with model mapped on the available DSP. Instead of solely using the per-patch upscaling difficulty as a scheduling criterion, load balancing is also employed to sustain the utilization of the available compute engines high. In this setting, Algorithm 1 takes as inputs the SR model pair (), the estimated execution time for processing a patch with model on compute engine and the selected TV threshold . For each patch, the DEU first computes the associated TV value (line 3) and then dispatches the patch to the appropriate model-compute engine pair based on the total-variation criterion (line 4). In the case of an easy-to-upscale patch, the patch is allowed to be processed only by and thus the DEU dispatches the patch to either the CPU or the GPU, aiming to balance the load of the two engines (lines 5-7). In the case of hard-to-scale patches, the DEU allows the patch to be directed to , but also includes as a candidate in order to avoid oversubscription of ’s compute engine. Since processing a patch with does not degrade the resulting PSNR, hard patches are also allowed to be processed by in case the DSP is overloaded (lines 8-10). On the other hand, easy patches are restricted to run using in order to avoid a significant quality loss due to ’s compression.

The range of values of total variation tends to vary between different domains. To estimate the dynamic range of TV on the target domain, MobiSR employs a user-supplied calibration set consisting of a small number of input samples. Given a few patches, the dynamic range of TV for a given dataset is estimated in order to tune the domain-specific total-variation threshold, .

Input: Image
                SR model
                Execution time per compute engine
                Total-variation threshold
1 // End time per compute engine foreach patch  do
2        CalcTV() if  then
3              
4                                                         
5        else
6                                                         
7        end if
8       
9 end foreach
Algorithm 1 Upscaling-difficulty-aware scheduling for parallel load-balanced on-device super-resolution

3.4. Performance Model

To efficiently explore different candidate designs without the need for implementations, a performance model is constructed that rapidly estimates a design’s latency. To formally capture the processing resources of the target mobile platform, we define a compute engine set, , which includes the compute engines that are available on the target chipset. In general, can represent a diversity of mobile SoCs hosting heterogeneous compute engines, ranging from the ubiquitous mobile CPUs and GPUs to the newer emerging NPUs (ai_benchmark_2018). For instance, Qualcomm Snapdragon 845 SoC (SDM845) is represented as . With this formulation, given an SR model and a single compute engine , the execution time of upscaling an image using the pair is estimated as:

(10)

where is the execution time for a single patch when model is mapped on compute engine . The per-patch execution time is measured by the On-device Evaluator by means of a number of benchmark runs.

Following our difficulty-aware scheduling presented in Section 3.3, each model-compute engine pair processes only the samples that lie within its total-variation threshold, . To capture this strategy the execution time model is modified as follows:

where is the unity function that evaluates to when its bracketed condition is true. MobiSR distributes patches across the available engines in order to maximize the utilization of the on-chip compute resources and exploit the inherent parallelism across independent patches. Under this scheme, the overall latency of upscaling image using model on the target SoC is estimated as in Eq. (11).

(11)

where the first term captures the parallel execution of patches across engines and represents the overhead of assembling together the partial results of all patches to form the final high-resolution image.

3.5. System Optimization

The developed framework aims to determine a pair of models together with a total-variation threshold that minimize the processing latency of performing on-device SR on the target mobile platform, given a user-supplied error tolerance. In this context, we pose the following optimization problem:

(12)

where , and are the latency in s/input, the total-variation threshold and the user-specified error tolerance respectively. Under this formulation, the objective function aims to find the tuple that minimizes latency with a constraint on the degradation of QoR as captured by PSNR.

Given a reference SR model , the optimization problem in Eq. (12) is defined over all candidate model pairs in the model space presented in Section 3.2. Formally, we express this as the product . Furthermore, each pair can be deployed with a different TV threshold and therefore, given discrete candidate total-variation thresholds, the total number of alternative designs to be explored can be calculated as follows:

(13)

In this setup, the objective function can be evaluated for all by means of the performance model of Section 3.4. In theory, the optimal design could be obtained by means of an exhaustive search with complete enumeration of all possible designs.

With latency and PSNR being a function of the TV of each patch of the processed image, evaluating and is data-dependent and hence requires running each possible design over a task-specific dataset to assess its attainable PSNR and latency. To avoid the overhead of an excessive number of evaluation runs, MobiSR employs two strategies for pruning the design space: 1) for each compute engine, we keep only the models that lie on the Pareto front of the PSNR-latency space. In this manner, models that are dominated with respect to their PSNR-latency balance on a given compute engine are discarded as inefficient; and 2) we impose the constraint that is equally or more compact than . In this manner, we guide MobiSR to select two models with different PSNR-latency characteristics, in order to combine the high PSNR of with the fast processing of .

After the pruning stage, MobiSR searches the remaining design space to determine the highest performing configuration of the tuple . To enable fast and exhaustive exploration, the developed performance model of Section 3.4 is employed. For each () pair, an analysis is initially performed over the user-supplied calibration set, yielding the achieved PSNR and latency for different values of . As a final step, MobiSR selects the fastest design that lies within the tolerated error of the target application.

Model Params (K) Latency (ms) Average PSNR/SSIM
CPU GPU DSP Set5 Set14 B100 Urban100
SRCNN (SRCNN) 057 9742.97 584.83 656.44 30.47/0.8610 27.57/0.7528 26.89/0.7108 24.51/0.7232
VDSR (VDSR) 665 198027.52 7164.60 2623.61 31.53/0.8840 28.42/0.7830 27.29/0.7262 25.18/0.7534
FEQE-P (Vu2018FastAE) 096 2996.92 911.61 1475.45 31.53/0.8824 28.21/0.7714 27.32/0.7273 25.32/0.7583
152 4570.08 2792.43 1220.00 31.73/0.8873 28.24/0.7729 27.33/0.7283 25.34/0.761
Calculated using full 32-bit floating-point precision (FP32).
Table 1. Comparison of reference model with state-of-the-art efficient SR models (4 upscaling).

4. Evaluation

This section presents the effectiveness of MobiSR in significantly improving the performance of on-device super-resolution by examining its core components and comparing with the currently standard implementations and highly optimized difficulty-unaware designs.

4.1. Experimental Setup

In our experiments, we target Intrinsyc’s Open-Q 845 board mounting the Qualcomm Snapdragon 845 SoC (SDM845). SDM845 integrates an octa-core Kryo 385 CPU alongside an Adreno 630 mobile GPU and a Hexagon 685 DSP on the same chip. 222MobiSR can also target modern mobile chipsets equipped with CPU, GPU and NPU/DSP engines such as Samsung Exynos 9820, Qualcomm Snapdragon 855 and Huawei Kirin 810 SoCs. All SR models were developed and trained using PyTorch (v1.0) and run on the Open-Q 845 board using the Snapdragon Neural Processing Engine (SNPE)333https://developer.qualcomm.com/software/qualcomm-neural-processing-sdk SDK (v1.21). SNPE allows targeting all three CPU, GPU and DSP engines of the SDM845 platform with highly optimized execution of CNN layers. The three compute engines employ different precision for data representation; namely the CPU, GPU and DSP use single-precision floating-point (FP32), half-precision floating-point (FP16) and 8-bit fixed-point (INT8) respectively for both storage and computation. All models that were run on the Hexagon DSP were first quantized offline to INT8 using linear quantization, with the per-layer scaling factors tuned based on the dynamic range of weights and activations on the DIV2K validation set.

Datasets and Training Scheme. Following the common practice of the super-resolution community (EDSR; SRMDNF; RDN), all SR models were trained on the training set of the DIV2K dataset (DIV2K) and validated on its validation set, comprising 800 and 100 images of 2K resolution with diverse contents respectively. For the evaluation, four benchmark datasets were used which constitute the standard for assessing SR models in the super-resolution literature (DIV2K): Set5 (Set5) and Set14 (Set14) comprising five and fourteen images respectively that are commonly used across the image processing community, B100 (B100) with 100 images of real-life scenes and Urban100 (Urban100) consisting of 100 images depicting urban environments.

For the training of the SR models, we employ a similar scheme to the one used by (RCAN) and (EDSR). First, data augmentation was applied on the DIV2K training set by randomly flipping horizontally and rotating by , and all images were normalized by subtracting the training set’s mean. Next, training was performed in 300 epochs using an Adam optimizer (Kingma_2014) with , , and L1 loss. Each mini-batch consists of 16 RGB patches with input size of 9696 for both 2 and 4 upscaling. The starting learning rate was set to and was halved after 200 epochs. Lastly, we train 2 models from scratch and use them as pre-trained models to train 4 models, confirming the findings of (EDSR) that using the weights of the 2 models as initial weight values for 4 models leads to faster training convergence.

Implementation Details. All SR designs presented in this section were run on SDM845 using the high-performance profile of the SNPE SDK that configures the hardware for maximum processing speed. All inputs to the SR models during on-device inference are partitioned into overlapping patches of size , with partial results stitched together at the end. The reported latency in all experiments is based on the average across 100 runs, with the latency measurements conducted using the SNPE’s timing utilities. The images in the aforementioned SR datasets have different sizes and therefore we report the average latency taken to upscale an image in the given dataset. For all other experiments, we assume a target high-resolution image with 720p resolution ().

(a) RCAN 4 Upsampling Module
(b) 4 Upsampling
Module
Figure 6. We reduce the number of feature maps in RCAN to improve processing speed. is a convolution with input and output feature maps.
Model Params (K) Latency (ms) Speedup Error
CPU GPU DSP CPU GPU DSP CPU/GPU DSP
152 4570.08 2792.43 1220.0 1.00 1.00 1.00 0.00% 0.00%
58 2694.94 2626.78 1508.56 1.69 1.06 0.80 2.84% 14.27%
22 1434.42 2657.49 1561.97 3.18 1.05 0.78 3.94% 22.39%
30 1850.82 10692.13 2508.63 2.46 0.26 0.48 3.70% 19.27%
30 1969.74 2723.66 1080.59 2.32 1.02 1.12 2.48% 4.15%
24 1284.21 2700.24 1398.407 3.55 1.03 0.87 3.19% 5.71%
88 4045.80 2846.70 1327.65 1.12 0.98 0.91 2.32% 2.11%
30 1910.74 2722.86 1061.38 2.39 1.02 1.14 1.93% 0.87%
13 1263.90 12367.24 3060.82 3.61 0.22 0.39 4.69% 26.1%
17 1023.26 2595.59 973.07 4.46 1.07 1.25 3.03% 3.07%
We obtained similar results on CPU with FP32 and GPU with FP16.
Table 2. Performance of our explored model space for 4 upscaling. Error drop is based on PSNR on Urban100.

4.2. Evaluation of Model Transformations

In MobiSR, the user supplies a reference model and the framework applies a series of model transformations to generate a set of compressed models. This set is then automatically pruned to remove suboptimal models, resulting in a list of Pareto-optimal candidate models to select from in order to produce the resulting two-model SR system. In our experiments, we exemplify this process by selecting a reference model, , that is comparable to the state-of-the-art models in the existing literature for mobile SR and then pass it through MobiSR.

Reference Model Selection. We adopted the residual channel attention network (RCAN) (RCAN) as our reference model, , as RCAN yields the state-of-the-art performance based on PSNR/SSIM among large-scale SR models. In order for RCAN to be comparable to existing state-of-the-art mobile SR models, its architecture was modified by reducing the number of residual groups to , the number of residual channel attention blocks to , and the number of feature maps to . Additionally, to further reduce the computational cost of the reference model, the number of feature maps in the upscaling module was reduced by a factor of and the last convolutional layer was removed. Fig. 6 shows the slight change in the upsampling module between RCAN and .

Figure 7. PSNR vs CPU latency of MobiSR-generated models on SDM845 (4 upscaling on Urban100).

As shown on Table 1, by constructing a shallower variant of RCAN, we are able to achieve comparable results with state-of-the-art SR models that are hand-optimized for increased efficiency. Notably, our reference model manages to outperform the winning model of the 2018 PIRM Challenge (Ignatov_2018) on perceptual SR on mobile, FEQE, by when run on the Hexagon DSP and achieves higher PSNR across all four SR datasets. Furthermore, yields an average speedup of 16.01 (6.2 geo. mean) over VDSR with 4.3 fewer parameters and achieves an average PSNR improvement of 0.8 dB over the lightweight SRCNN. Regardless, MobiSR accepts any starting reference model and searches for a set of model transformations that will work best for that reference model on the given compute engines.

Figure 8. PSNR vs DSP latency of MobiSR-generated models on SDM845 (4 upscaling on Urban100).
Figure 9. Achieved PSNR and measured performance as a function of TV on SDM854.

Explored Model Space. Based on the findings of recently proposed high- and low-level vision models, we examined specific transformation combinations from the transformation set (detailed in Section 3.2) on our reference model by focusing on the ones that have demonstrated the highest effectiveness in the deep learning literature. Table 4 details the topologies of the MobiSR-generated compressed models together with the associated transformations that were applied over our reference model. Specifically, given the reference model, MobiSR replaces all 33 convolutional layers that lie in the core of the network, excluding the first layer and those within the upsampling block, using a subset of transformations. After this step, the obtained compressed models are retrained from scratch following the training scheme of Section 4.1.

Table 2 lists the attainable latency of each model, the speedup over the reference model and the error with respect to PSNR, as obtained by MobiSR’s On-device and Image Quality Evaluator modules. Apart from the latency-PSNR trade-off of the generated compressed models, Table 2 also highlights the compatibility of each compute engine with the various transformations that have been applied. Different compute engines are more highly optimized for a different subset of transformations. For instance, the sole use of bottleneck residual blocks, , obtains a greater speedup on the CPU, but lower gains on the DSP as compared to the sole use of depthwise separable convolutions, . Furthermore, models that employ group convolutions, such as and , yield worse latency when executed on the GPU due to suboptimal mapping. Additionally, the 8-bit quantization of the DSP had a severe impact on the representational capacity of compressed models that utilized the residual bottleneck blocks, such as , and . As a result, selecting the highest performing set of compressed models is dependent on both the provided reference model and the available compute engines.

Overall, Fig. 7 and 8 depict the PSNR-latency space of the generated models on the CPU and DSP of the Qualcomm SDM845 respectively. In this case, the framework picked the same three models, namely , and , that lie on the Pareto fronts of all three compute engines.

Model Set5 Set14 B100 Urban100
Pair Speedup Avg/G. Mean Speedup Avg/G. Mean Speedup Avg/G. Mean Speedup Avg/G. Mean
Running on the CPU
(, ) 1.74-4.31 2.90/2.78 1.96-4.65 3.05/2.90 1.76-4.31 2.63/2.47 2.35-5.91 3.53/3.28
(, ) 1.74-4.91 3.12/2.94 1.97-5.30 3.26/3.05 1.76-4.91 2.79/2.58 2.35-6.18 3.62/3.34
(, ) 2.64-4.91 3.72/3.64 3.12-5.34 4.09/4.01 2.68-4.91 3.51/3.42 3.51-7.13 4.79/4.59
Running on the GPU
(, ) 1.06-2.63 1.78/1.70 1.20-2.84 1.86/1.77 1.07-2.63 1.61/1.51 1.44-3.61 2.16/2.01
(, ) 1.06-3.00 1.91/1.80 1.20-3.24 1.99/1.87 1.07-3.00 1.71/1.59 1.44-3.78 2.21/2.04
(, ) 1.61-3.00 2.27/2.23 1.91-3.26 2.50/2.45 1.64-3.00 2.15/2.09 2.14-4.36 2.93/2.80
Running on the CPU & GPU
(, ) 1.28-2.47 1.80/1.75 1.11-2.36 1.66/1.59 1.02-2.45 1.59/1.51 1.01-2.51 1.60/1.49
(, ) 1.30-2.82 1.95/1.87 1.11-2.69 1.79/1.69 1.02-2.79 1.71/1.59 1.01-2.62 1.64/1.52
(, ) 1.52-2.82 2.13/2.09 1.58-2.71 2.08/2.04 1.52-2.79 2.09/2.04 1.49-3.03 2.04/1.95
Table 3. Performance comparison of Pareto-optimal model pairs with faithful reference model .

4.3. MobiSR PSNR and Performance vs TV

In this section, the PSNR and performance of the MobiSR-generated designs are evaluated as a function of total-variation threshold. Given the three Pareto-optimal models from Section 4.2 and the pruning strategy that dictates that is more compact than , three model pairs were selected in the valid design space; namely (, ), (, ) and (, ).

Fig. 9 shows the measured latency on SDM854 and the achieved PSNR across different TV thresholds for the four SR datasets. When has substantially high values (towards the left hand side of the plots), the majority of incoming samples is processed by on the CPU and GPU of SDM845. In this manner, PSNR remains high, but at the cost of increased latency due to the underutilization of the DSP. As decreases from left to right, the three model pairs trade off a decreased PSNR for substantially reduced processing latency. Eventually, as reaches very low values, the DEU relaxes the constraints and its scheduling policy reduces to a load balancing of the incoming samples across the three compute engines, without the need to exploit the upscaling difficulty of each sample. In this manner, the highest speed up is achieved for low at the expense of a significant drop in the achieved PSNR.

Figure 10. MobiSR’s speedup as a function of error degradation.

Table 3 lists the speedup gains of the three model pairs over the execution of on the CPU, GPU, and both CPU and GPU. Since is mapped only on the CPU and/or GPU, no degradation of PSNR is induced due to the 8-bit operations of the DSP. Overall, the parametrization of the DEU based on allows the tuning of the system at a fine granularity so that even a small increase in the application-level error tolerance can be capitalized as reduced processing latency.

Model Inspired By Building Block
& ResNet (He_2016) \pbox20cm
ResNeXt (Xie_2017) \pbox20cm
MobileNet (mobilenetv1) \pbox20cm
EffNet (effnet) \pbox20cm
MobileNetV2 (mobilenetv2) \pbox20cm
ClcNet (clcnet) \pbox20cm
ShuffleNet (shufflenetv1) \pbox20cm
ShuffleNet V2 (shufflenetv2) \pbox20cm
Table 4. All 33 convolutional layers in are replaced with the corresponding building blocks where represents the rectifier activation function, and represents a convolutional layer with input , input channels, output channels and groups.

4.4. Evaluation of MobiSR Performance

This section presents the performance gains of MobiSR with respect to processing speed. This is investigated by comparing the generated two-model design for different PSNR drop values with a baseline single-model network. For each interval of PSNR drop, each MobiSR instance is compared with the fastest baseline single-model architecture that achieves the same or higher PSNR as the MobiSR system (shown on top of each plot in Fig. 10). The single-model baselines do not employ MobiSR’s TV-based scheduling; instead each model is allowed to run in one of two modes: i) either with load balancing across the CPU and GPU and no PSNR degradation or ii) with load balancing across CPU, GPU, and DSP with PSNR drop due to the DSP’s reduced precision. In this respect, the fastest single model that satisfies the PSNR drop constraint is selected at each PSNR drop interval. The overall measured runtime includes the DEU, processing all patches and the overhead of combining the partial results to construct the final high-resolution image.

Fig. 10 presents the achieved speedup across a wide range of PSNR tolerance values on the SDM845 platform when targeting the four benchmark datasets. When minimal to no PSNR drop is allowed (towards the left of Fig. 10), MobiSR selects a strict scheduling policy for the DEU with high TV values. In this manner, the large majority of patches are processed by on the CPU and GPU and the DSP remains underutilized, leading to minimal speedup. As more error is allowed, the proposed system outperforms the baseline by up to 47%, 78%, 94% and 29% for the same PSNR drop budget in Set5, Set14, B100 and Urban100 respectively.

Finally, in the case of high error tolerance, the speedup becomes less significant as uninformed load balancing using the fastest compressed model across the CPU, GPU, and DSP becomes the fastest design.

5. Related Work

The emergence of mobile image-centric applications has attracted the attention of the computer vision community, with efforts for alleviating the large compute demands of large-scale SR models. Addressing on-device SR from a model perspective, SRCNN (SRCNN) and the second-generation FSRCNN (FSRCNN) were first proposed as efficient neural architectures for SR, consisting of only three convolutional layers. By aiming to improve the image quality, the VDSR (VDSR) network employed a deeper design consisting of twenty layers. With a direction towards mobile settings, CARN-M (CARN) was proposed as a lightweight variant of the CARN architecture. Inspired by MobileNet (mobilenetv1), CARN-M employs recursive blocks and group convolutions to reduce the storage and compute requirements at inference time. In 2018, FEQE (Vu2018FastAE) introduced the desubpixel block, enabling a lossless downsampling at the start of the network, while reducing the computation cost throughout the rest of the network. The aforementioned works are primarily hand-tuned architectures with manually selected architectural choices aiming to reach a balance between image quality and computation cost. In this paper, we focus on the largely unexplored space of applying model transformations in a hardware-aware manner with the goal to tailor the generated system to both the application-level image quality requirements and the target mobile platform characteristics.

From a systems perspective, apart from task-agnostic frameworks for executing deep neural networks on mobile platforms (Huynh_2016; Oskouei_2016; deepx_2016; caffepresso_2017; Song_2018), research efforts have mainly focused in the direction of 1) cascade systems (mcdnn_2016; videostorm_2017; noscope_2017; focus_2018; shen_2017; cascadecnn_2018), 2) early-exit classifiers (branchynet_2016; msdnet_2018; overthink_2019) and 3) specialized accelerators for CNNs (Qiu_2016; Venieris_2018) and SR models (He_2018; fpga_sr_2018).

Cascade systems. Cascade systems base their operation on conditionally passing input samples through a pipeline of classifiers based on information obtained at each classification stage. VideoStorm (videostorm_2017), NoScope (noscope_2017), Focus (focus_2018) and Shen et al. (shen_2017) focus on the task of issuing queries on video databases. A common element between these systems and MobiSR is the use of multiple networks. However, a key differentiating factor in the model generation approach is that, by exploiting video-specific optimization opportunities, these systems train class-specialized models based on the object classes that appear more often in a given video stream. In contrast, the generative nature of the super-resolution task is not amenable to such an approach. Furthermore, at the run-time model selection stage, each stage of the cascade determines whether a particular object class is present or not, and if not, the input sample is propagated to the next classification stage. Contrary to this approach, MobiSR’s DEU determines which model to use based on the input image complexity and the current load of the available compute engines, without requiring information to be passed between models.

In a similar manner to MobiSR, MCDNN (mcdnn_2016) employs a form of model selection. However, while MCDNN focuses on run-time model selection and the partitioning of computation between cloud and device, MobiSR employs a difficulty-aware mechanism to exploit the heterogeneous compute engines that are available on-device and parallelizes both within an image (i.e. parallel processing of patches) and across images (i.e. pipelined execution as long as images are available). Moreover, while MCDNN aims to maximize the average accuracy of classification tasks, MobiSR sets a constraint on the PSNR drop and guarantees that the average PSNR will not be compromised below user-specified bounds.

From a target platform perspective, the aforementioned systems are optimized for cloud setups that have substantially different characteristics compared to MobiSR’s fully on-device system. In our case, the available compute engines share the same main memory, and in turn the same storage and bandwidth. This poses a significant additional challenge and calls for the mobile-specific methodology of MobiSR to develop and implement high-performing mobile designs.

Finally, the cascading approach of CascadeCNN (cascadecnn_2018) involves a two-model cascade with each classifier quantized at a different precision level. In this case, input samples are first processed rapidly by an aggressively quantized model. If the prediction confidence of a sample is below a tunable threshold, the input sample is passed to a higher-precision model for recomputation. Despite the fact that variable precision quantization could be integrated in the transformations set of MobiSR, CascadeCNN has so far been evaluated on FPGA-based platforms targeting image recognition tasks.

Early-exit classifiers. Designs such as BranchyNet (branchynet_2016), MSDNet (msdnet_2018) and Shallow-Deep Networks (overthink_2019) approach inference acceleration from an architectural aspect. First, they focus on classification rather than generative tasks. Secondly, they explicitly introduce early-exit outputs on a single model in order to reduce the workload-quality characteristics. Nevertheless, by exploiting the fact that different samples require different amount of computation to yield a correct classification, such designs share a similar philosophy to our upscaling-difficulty-aware scheme. However, the criterion to capture an input sample’s difficulty is the prediction confidence at each early exit, which is relevant to classification tasks and differs to the upscaling-difficulty metric used by MobiSR for run-time model selection.

Hardware acceleration. Several works have explored the design of custom hardware architectures for the efficient execution of CNN workloads in resource- and power-constrained settings (Qiu_2016; Venieris_2018). With a focus on SR, He et al. (He_2018) proposed a highly optimized FPGA-based hardware accelerator tailored to the FSRCNN (FSRCNN) network. Furthermore, by adopting a hardware-software codesign methodology, Kim et al. (fpga_sr_2018) derived a CNN-based SR model and implemented it on an FPGA-based platform. Our work focuses on programmable mobile platforms which are more flexible and enable the efficient execution of SR models in a network-agnostic manner.

6. Conclusion

The MobiSR framework described in this paper uses several techniques to achieve high performance for fully on-device super-resolution. Through the generation of a two-model processing system tailored to the available compute engines of the target mobile platform, the proposed framework demonstrates significant speedup compared to single-model designs without penalizing the achieved image quality. By considering the user-specified error tolerance in the design space exploration phase and exploiting the heterogeneous compute engines of commodity mobile platforms, MobiSR is able to deliver high-speed SR on-device while meeting the application-level image quality requirements.

Furthermore, as the proposed methodology is parametrized to target any arbitrary mobile SoC with heterogeneous compute engines, using MobiSR to take advantage of newer emerging platforms that consist of neural accelerators can be a key enabler for efficient mobile super-resolution with potentially larger room for performance gains.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
386935
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description