Fast Sparse ConvNets
Historically, the pursuit of efficient inference has been one of the driving forces behind research into new deep learning architectures and building blocks. Some recent examples include: the squeeze-and-excitation module , depthwise separable convolutions in Xception , and the inverted bottleneck in MobileNet v2 . Notably, in all of these cases, the resulting building blocks enabled not only higher efficiency, but also higher accuracy, and found wide adoption in the field. In this work, we further expand the arsenal of efficient building blocks for neural network architectures; but instead of combining standard primitives (such as convolution), we advocate for the replacement of these dense primitives with their sparse counterparts. While the idea of using sparsity to decrease the parameter count is not new , the conventional wisdom is that this reduction in theoretical FLOPs does not translate into real-world efficiency gains. We aim to correct this misconception by introducing a family of efficient sparse kernels for ARM and WebAssembly, which we open-source for the benefit of the community as part of the XNNPACK  library. Equipped with our efficient implementation of sparse primitives, we show that sparse versions of MobileNet v1, MobileNet v2 and EfficientNet architectures substantially outperform strong dense baselines on the efficiency-accuracy curve. On Snapdragon 835 our sparse networks outperform their dense equivalents by – equivalent to approximately one entire generation of MobileNet-family improvement. We hope that our findings will facilitate wider adoption of sparsity as a tool for creating efficient and accurate deep learning architectures.
Convolutional neural networks (CNNs) have proven to be excellent at solving a diverse range of tasks . Standard network architectures are used in classification, segmentation, object detection and generation tasks [32, 28, 48]. Given their wide utility, there has been significant effort to design efficient architectures that are capable of being run on mobile and other low power devices while still achieving high classification accuracy on benchmarks such as ImageNet . For example, MobileNets [16, 36] employ the depthwise separable convolutions introduced by Sifre in  to significantly reduce resource requirements over previous architectures. Inference time, FLOPs and parameter counts in these architectures are dominated by the 1 convolutions, which directly map to matrix-matrix multiplications.
Weight sparsity is generally known to lead  to theoretically smaller and more computationally efficient (in terms of number of floating-point operations) models, but it is often disregarded as a practical means of accelerating models because of the misconception that sparse operations cannot be fast enough to achieve actual speedups during inference. To address this misconception and firmly establish sparsity as a tool in the deep learning practitioner’s arsenal, we introduce fast kernels for sparse matrix-dense matrix multiplication (SpMM) specifically targeted at the acceleration of sparse neural networks. The main distinction of our SpMM kernel from prior art [31, 46] is that we focus on a different point in the design space. While prior work focused on extremely sparse problems (typically >99%, found in scientific and graph problems), we target the sparsity range of 70-95%, more common when inducing weight sparsity in neural networks. As a result our kernels significantly outperform the kernels generated by the TACO compiler  and the Intel MKL .
Using these kernels, we demonstrate the effectiveness of weight sparsity across three generations of MobileNet [16, 36, 41] architectures. Sparsity leads to an improvement of approximately one whole generation in each architecture, with sparse EfficientNets being significantly more efficient than all previous models. These models represent a new generation of efficient CNNs, which reduces inference times by , parameter counts by over and number of floating-point operations (FLOPs) by up to relative to the previous generations.
2 Related Work
Improvements in convolutional network architectures [25, 38, 13, 18], as measured by increased classification accuracy on benchmark tasks such as ImageNet , have generally been concomitant with increases in model parameter counts, FLOPs and memory requirements. Recently this evolution has led to networks found through neural architecture search [51, 34] which can achieve over 82% top-1 accuracy, but require nearly 25 GFLOPs for one inference.
Given these prohibitive inference costs, there have been many lines of work attempting to improve CNN efficiency, which is often defined as one of three metrics:
Inference speedup on real hardware
Theoretical speedup through FLOPs reduction
Model size reduction
These axes are neither parallel nor orthogonal. The effect of (3) and (2) on (1) in particular can be quite complicated and highly varied depending on the hardware in question.
The MobileNet family of architectures [16, 36] has focused on improving efficiency by taking advantage of the depthwise separable convolutions introduced in , which can be thought of as a hand-crafted sparsification of full convolutions with a predefined sparse topology, and which are responsible for the parameter efficiency of these architectures. MobileNet v1 (MBv1) used layers of convolutions followed by depthwise convolutions. MobileNet v2 (MBv2) introduced the inverted residual block which consists of a convolution expanding the channel count, a depthwise convolution on the expanded channel count, and then a convolution reducing the parameter count. Across MobileNet architectures, the depthwise convolutions account for only a small fraction of the total FLOPs, parameters, and inference time of these models. In MBv1, they account for less than 2% of the total FLOPs and in MBv2 less than 3%.
A different line of work attempted to make more efficient CNNs by directly pruning the weights of full convolutional filters accompanied by the necessary inference kernels [33, 26].  was not able to accelerate convolutions,  did not attempt it. The latter also required generating a new set of kernels for each instance of a model, which is often impractical for deployment. Due to the difficultly of accelerating sparse computation, channel pruning approaches have been preferred [11, 5, 30, 29, 42, 14]. These approaches prune away entire filters leaving the final model dense, and function more as an architecture search over channel counts.
Alternatively, factorizations of the 11 convolutions have been considered in ShuffleNet  and Learnable Butterfly Factorizations . ShuffleNet factorizes the weight matrix into a product of a permutation matrix and block diagonal matrix. Butterfly Factorizations factorize the weight matrix into a sequence of permutation matrices and weight matrices with special structure that can represent many common transforms such as Fast Fourier Transforms.
Work in Text-to-Speech (TTS)  demonstrated that increasing sparsity and concomitant increase in state size in RNN models lead to increased model quality for a given non-zero parameter count. They additionally demonstrated fast block-sparse matrix-vector (SpMV) multiplication routines necessary for RNN inference.
To understand how to design the most efficient convolutional models, we investigate both how to construct and train sparse MBv1, MBv2 and EfficientNet models and also the performance of our SpMM kernels.
3.1 Sparsifying Networks
We train on the ImageNet  dataset with standard augmentation and report top-1 accuracies on the provided 50k example validation set. To make the networks sparse we use the gradual magnitude pruning technique of .
We do not prune the first full convolution at the beginning of all three networks. Its overall contribution to the parameter count, FLOP count, and runtime is small and does not warrant introducing a new sparse operation. Instead, we implement a dense convolutional kernel which takes as input the image in the standard HWC layout and outputs the CHW layout consumed by the sparse operations in the rest of the network. In HWC layout, the values for different channels corresponding to one spatial location are adjacent in memory. In CHW layout, the values of all the spatial locations for one channel are adjacent in memory.
We also do not prune the squeeze-excitation  blocks in EfficientNet as they contribute <1% of the total FLOPs to the dense model. The last fully-connected layer in all models also contributes insignificantly (<1%) to the total FLOP count, but does contribute a significant fraction (20-50%) of total parameters, especially after the rest of the model is pruned. As we are concerned with maximizing top-1 accuracy for a given runtime, we do not prune the final layer in MobileNet v1 and v2 as doing so leads to a small decrease in top-1 accuracy. Standard EfficientNets do not scale the number of filters in the last convolution by the width of the model, however we find that when introducing sparsity it is beneficial to do this; in all sparse EfficientNet models we double the units from 1280 to 2560. We also find that it is possible to make the fully-connected layer sparse without loss of accuracy in EfficientNet, so we do so.
3.2 Kernel Implementation
A diagram of the convolution as a SpMM is seen in figure 2. Our scheme requires activation tensors be stored in CHW format, in contrast to dense mobile inference libraries [21, 7, 22] which favor HWC.
There are three key insights enabling the high performance of our kernels:
While the weight matrix is sparse, the activation matrix is dense. This means that we can perform vector loads from the activation matrix and process multiple spatial locations simultaneously.
By processing the matrix in the right order we can keep values that will be randomly accessed in the L1 cache, from which random access is fast and constant time.
When the number of input channels is small enough, prefetching from the activations can further reduce cache misses.
Figure 3 shows the memory read and write patterns of a few steps of the kernel. The figure shows 8 elements being processed together for visualization but 16 is more natural for the actual implementation as it corresponds to one cache line. The outer loop is over columns and the inner loop is over rows; this allows each strip of 16 spatial locations in the activations to remain in the L1 cache until it is no longer needed. In figure 3 steps 1 and 2 prime the cache, while subsequent steps 3 and 4 load all right hand side values from the L1 cache.
*This model is 80% sparse in all layers (except final fully-connected) and uses a block size of 1 everywhere.
This is the cache aware MBv2 architecture described in section 4.4. It uses a block size of 1 throughout.
We also tried WebDNN , but it does not support some operations necessary to run MobileNet and EfficientNet models.
In addition to the vectorization in the dimension, taking advantage of small amounts of structure in the weight matrix can offer significant performance boosts by increasing data reuse after values are loaded into registers. Constraining the sparsity pattern so that multiple output or input channels all share the same zero/non-zero pattern creates ‘blocks’ in the weight matrix (see figure 3 right). Blocks in the output channel dimension
allow for more data reuse than blocks in the input channel dimension. Experiments (see figure 5) show that either choice has the same effect on accuracy, so we implement output channel blocking with sizes of 2 and 4. Our nomenclature for kernels is to give their spatial vectorization width followed by the output channel block size –
6x2 ~means6 pixels and 2 output channels are processed in the inner loop.
We will provide a library that can run sparse models trained with the model pruning library in TensorFlow . This includes conversion from a dense representation to a Block Compressed Sparse Row (BCSR)-like representation suitable for inference. In addition to the high performance convolutions, we also provide all supporting CHW kernels – depthwise convolutions, global average pooling and a stride-2 dense convolution that takes the standard HWC format as input and outputs CHW – necessary for running all three generations of models. While we provide high performance versions of these kernels, we do not detail them here; they are also available as part of XNNPACK . Times for these kernels are included in end-to-end measurements.
In some settings it is useful to run inference (and even training) of deep neural networks directly in web browsers. This is supported now by multiple frameworks including WebDNN , Tensorflow.js  and Webml-polyfill . Frameworks generally support using WebAssembly (Wasm)  to run on CPUs or WebGL to run on GPUs. In this work we target web assembly backends and show that sparse vision networks significantly outperform their dense counterparts in this setting. In this setting our kernel decomposition strategy is similar, however due to current lack of vectorization support in Wasm, they are converted to scalar instructions. Due to the smaller register file, we find that unrolling by 8 instead of 16 is optimal. Scalar versions of all kernels are also provided as part of XNNPACK .
In the main text we mainly include plots for MBv1 and MBv2 due to space limitations. EfficientNets generally follow the same trends as MBv2 models, plots for EfficientNet can be found in the supplementary material. Table 1 contains the main results comparing inference times for models achieving approximately the same top-1.
To breakdown the end-to-end results, first we reveal performance results for our SpMM kernels, then we show how the networks respond to sparsity and then finally how we combined this information to find the models with the lowest inference time.
4.1 ARM Kernel Performance
We use Ruy , the current TensorFlow Lite ARM64 backend written largely in hand-coded assembly, as the dense baseline. For a sparse baseline we use the kernel generated by the TACO compiler . We present results by plotting the FLOPs achieved at each layer in the model, with increasing depth to the right in figure 4. For MBv1 we use a width multiplier of 1.4 and 90% sparse and for MBV2 we use a width multiplier of 1.4 and 85% sparse as these configurations approximately match the top-1 accuracy of the width 1 dense models. The kernel variants that process 16 spatial locations at a time (e.g.
6x , etc.) are the highest performing and all reported numbers are from these kernel variants. TACO results should be compared with the
The raw performance of the sparse kernels falls in the range of 40–90% of the dense kernels. And as they must do much less work, when taking the sparsity of the layer into account, the effective FLOPs are in the 2–7 range. In MBv1 performance falls significantly in the last two layers of the model when the number of channels (1024) causes the size of one “strip” of spatial locations to exceed the size of the L1 cache. In MBv2 the sawtooth pattern is caused by the alternating expand and contract operations. The performance is higher for the expand kernels due to greater data reuse of each “strip” that is brought into the L1 cache. The performance of the contract operations drop significantly once the number of expanded channels in each residual block exceeds the size of the L1 cache; a smaller drop is observed when it exceeds the size of half of the L1 cache as this prevents effective prefetching.
4.2 Model Performance
The hyper-parameters used to train MBv1 and MBv2 are listed in table 2, they were found with a grid search on dense models with a width multiplier of 1.0 to reproduce the original results, which used RMSProp , with SGD with momentum. The same hyper-parameters are used to train sparse models. This change allows us to match or exceed the reported accuracies with only 38,000 iterations of training.
The hyper-parameters used to train EfficientNet are largely unmodified from their code release, with the exception of extending training from 350 to 650 epochs and increasing the learning rate decay exponent to .985 from .97 so that the learning rate decays more slowly. These changes do not improve the dense baseline.
We induce sparsity in MBv1 and MBv2 by extending training by a factor (along with learning rate anchor points) of four, we find this increases the performance of the sparse models, but not the baselines. We start the sparsification process at iteration and stop at with a pruning frequency of 2,000. For EfficientNet we start at iteration 23,000 and end at iteration 105,000, also with a pruning frequency of 2,000. See  for the meaning of these hyper-parameters.
We train on the ImageNet  dataset with standard data augmentation. Top-1 accuracies are reported on the validation set with center single-crops.
To understand the effect of block size, we plot in figure 5 accuracy against flops for different block sizes. In these plots, every sparse tensor in the network uses the same output channel block size. The tradeoff for block sparsity only appears to involve how many elements are in each block, and not their configuration. For example, in MBv1, the , and curves all lie on top of one another. The loss in accuracy due to blocking seems to decrease slightly for larger width models.
To understand how the sparsity level affects the efficiency of the models, we train models at 70%, 80% and 90% unstructured sparsity which is constant throughout the model. The results are plotted in figure 6. MBv1 and MBv2 are more efficient the more sparse they become, confirming that the results of  hold not just for RNNs, but also for convolutional models as well.
In figure 1 we plot Top-1 accuracy vs. FLOPs for all three generations of sparse and dense models. MobileNet v1 is 90% sparse, the other models are 80% sparse. A sparse MBv1 exceeds MBv2 in terms of FLOP and parameter efficiency; a sparse MBv2 matches EfficientNet in terms of FLOP and parameter efficiency; and a sparse EfficientNet exceeds all other models in both categories.
4.3 Model Design for Block Size
To design the models with the best top-1 accuracy vs. inference time frontiers we make the following assumptions to reduce the search space:
We leave the models themselves unchanged.
We induce the same level of sparsity in all convolutions.
Then we do a search at width multiplier 1.4 over models when there are residual blocks in a model. An x-axis location of corresponds to a model in which the first residual blocks are unstructured and the last residual blocks have an output channel block size of 4. We train each model, note its top-1 accuracy and then measure its inference time. From this we can calculate the ratio of inference time reduction relative to a fully unstructured model and top-1 lost, which are plotted in figure 7. We choose the model with the highest ratio and train models at all widths with this choice. This amounts to making layers 6 and deeper blocked in MBv1 models and layers 11 and deeper blocked in MBv2.
4.4 Cache Aware Model Design
The sawtooth pattern in figure 4 is due to the large number of channels during the contract phase causing the size of stripe to exceed the size of the L1 cache. One possible solution, common with dense kernels, would be to split the matrix into N pieces such that each piece only accesses a number of channels that fit within the cache. However, this introduces additional complexity in the software – in the sparse case it would require repacking the matrix, which isn’t required in the dense case. As another possible solution we examine a simple modification to the MBv2 architecture to make it cache aware.
The inverted residual block introduced by MBv2 expands the number of channels before the depthwise convolution and then reduces them afterwards. The expansion factor is fixed at 6 for all layers. Once there are more than 512 channels after expansion, the size of the 32Kb L1 cache is exceeded during contraction. Additionally, once there are more than 256 channels after expansion, the data exceeds half of the cache precluding effective use of pre-fetching during contraction. To design an architecture that is aware of these constraints we take MBv2 and increase the expansion factor of early layers while reducing the expansion factor as the number of channels increases so that the number of expanded channels still fits within approximately half of the cache size. To compensate for the decrease in capacity this would otherwise cause, we also increase the depth of the model by adding one more layer with 32 channels, two more with 64 channels and three more each with 96 and 160 channels. The final architecture is in table 3.
4.5 Wallclock Times
Table 1 contains the timings for running our sparse models on a single big core of two different processors, a Snapdragon 835 and a Snapdragon 670. We compare them with MBv1 and MBv2 models from their official repositories [9, 10] run on the dense-inference TF Lite framework with the standard Ruy backend.111The dense EfficientNet b0 model we trained ourselves using the official code with a slight modification to make exporting a TF-lite model easier. Model files for all sparse models in table 1 are available here.
We demonstrate that for a constant computational budget, sparse convolutional networks are more accurate than dense ones; this corroborates the findings of , which demonstrated that for a set number of floating-point operations, sparse RNNs are more accurate than dense RNNs. We enable the use of weight sparsity to accelerate state-of-the-art convolutional networks by providing fast SpMM kernels along with all necessary supporting kernels for ARM processors. On a Snapdragon 835 the sparse networks we present in this paper outperform their dense equivalents by in terms of wall clock time for a given top-1 accuracy while needing only as many parameters – equivalent to approximately one entire generation of improvement. By overturning the misconception that “sparsity is slow”, we hope to open new avenues of research that would previously not be considered.
-  (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Cited by: §3.3.
-  (2016) Applications of convolutional neural networks. International Journal of Computer Science and Information Technologies 7 (5), pp. 2206–2215. Cited by: §1.
-  (2017) A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282. Cited by: §1.
-  (2017-07) Xception: deep learning with depthwise separable convolutions. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 1800–1807. External Links: Cited by: Fast Sparse ConvNets.
-  (2018-10–15 Jul) Compressing neural networks using the variational information bottleneck. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 1135–1144. External Links: Cited by: §2.
-  (2019) Learning fast algorithms for linear transforms using butterfly factorizations. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pp. 1517–1527. External Links: Cited by: §2.
-  (2019) QNNPACK: open source library for optimized mobile deep learning. External Links: Cited by: §3.2, §3.2.
-  (2019) XNNPACK. Google. External Links: Cited by: Fast Sparse ConvNets, §3.2, §3.3, §3.4.
-  (2018) External Links: Cited by: §4.5.
-  (2018) External Links: Cited by: §4.5.
-  (2018-06) MorphNet: fast simple resource-constrained structure learning of deep networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 1586–1595. External Links: Cited by: §2.
-  (2017) Bringing the web up to speed with webassembly. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017, New York, NY, USA, pp. 185–200. External Links: Cited by: §3.4.
-  (2016) Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. Cited by: §2.
-  (2018) AMC: automl for model compression and acceleration on mobile devices. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VII, pp. 815–832. Cited by: §2.
-  (2017) WebDNN: fastest dnn execution framework on web browser. In Proceedings of the 25th ACM international conference on Multimedia, pp. 1213–1216. Cited by: §3.4, Table 1.
-  (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861. External Links: Cited by: §1, §1, §2.
-  (2018-06) Squeeze-and-excitation networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 7132–7141. External Links: Cited by: Fast Sparse ConvNets, §3.1.
-  (2017-07) Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 2261–2269. External Links: Cited by: §2.
-  (2009) Intel math kernel library. reference manual. Intel Corporation, Santa Clara, USA. Note: ISBN 630813-054US Cited by: §1.
-  (2019) WebML-polyfill. External Links: Cited by: §3.4, Table 1.
-  (2017) Gemmlowp: a small self-contained low-precision gemm library. External Links: Cited by: §3.2, §3.2.
-  (2019) RUY. External Links: Cited by: §3.2, §3.2, §4.1.
-  (2018) Efficient Neural Audio Synthesis. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pp. 2415–2424. Cited by: §2, §4.2, §5.
-  (2017) The tensor algebra compiler. Proceedings of the ACM on Programming Languages 1 (OOPSLA), pp. 77. Cited by: §1, §4.1.
-  (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. External Links: Cited by: §2.
-  (2015) Sparse convolutional neural networks. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 806–814. Cited by: §2.
-  (2019) DARTS: differentiable architecture search. In International Conference on Learning Representations, External Links: Cited by: §4.5.
-  (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §1.
-  (2018) Learning sparse neural networks through l0 regularization. In International Conference on Learning Representations, External Links: Cited by: §2.
-  (2017) ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 5068–5076. Cited by: §2.
-  (2018) High-performance sparse matrix-matrix products on intel knl and multicore architectures. In Proceedings of the 47th International Conference on Parallel Processing Companion, ICPP ’18, New York, NY, USA, pp. 34:1–34:10. External Links: Cited by: §1.
-  (2019) Recent progress on generative adversarial networks (gans): a survey. IEEE Access 7 (), pp. 36322–36333. External Links: Cited by: §1.
-  (2016) Holistic sparsecnn: forging the trident of accuracy, speed, and size. CoRR abs/1608.01409. External Links: Cited by: §2.
-  (2019) Regularized evolution for image classifier architecture search. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4780–4789. Cited by: §2.
-  (2015-12) ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115 (3), pp. 211–252. External Links: Cited by: §1, §2, §3.1, §4.2.
-  (2018-06) MobileNetV2: inverted residuals and linear bottlenecks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 4510–4520. External Links: Cited by: Fast Sparse ConvNets, §1, §1, §2.
-  (2014) Ecole polytechnique, CMAP phd thesis rigid-motion scattering for image classification. Cited by: §1, §2.
-  (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: §2.
-  (2019) TensorFlow.js: machine learning for the web and beyond. In SysML Conference, Palo Alto, CA, USA. External Links: Cited by: §3.4.
-  (2019) MnasNet: platform-aware neural architecture search for mobile. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 2820–2828. External Links: Cited by: §2.
-  (2019) EfficientNet: rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pp. 6105–6114. External Links: Cited by: §1, §2.
-  (2018) Faster gaze prediction with dense networks and Fisher pruning. CoRR abs/1801.05787. External Links: Cited by: §2.
-  (1995) Evaluating pruning methods. In National Chiao-Tung University, pp. 2. Cited by: Fast Sparse ConvNets.
-  (2012) Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. Note: COURSERA: Neural Networks for Machine Learning Cited by: §4.2.
-  (2019-06) FBNet: hardware-aware efficient convnet design via differentiable neural architecture search. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2018-08) Design principles for sparse matrix multiplication on the gpu. International European Conference on Parallel and Distributed Computing - EURO-PAR. Cited by: §1.
-  (2017) ShuffleNet: an extremely efficient convolutional neural network for mobile devices. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6848–6856. Cited by: §2.
-  (2019) Object detection with deep learning: a review. IEEE transactions on neural networks and learning systems. Cited by: §1.
-  (2018) To prune, or not to prune: exploring the efficacy of pruning for model compression. In International Conference on Learning Representations, Cited by: §3.1, §4.2.
-  (2017) Neural architecture search with reinforcement learning. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Cited by: §4.5.
-  (2017) Learning transferable architectures for scalable image recognition. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8697–8710. Cited by: §2.
Appendix A Comparison with Intel MKL
We also implement our scheme with AVX512 intrinsics to compare with the Intel MKL. With minimal tuning we find that it achieves a geometric mean speedup of 1.2 across all layers for both MBv1 and MBv2 compared with the MKL. Results can be found in figure 9.
Appendix B Non-Uniform Layerwise Sparsity with Variational Dropout
Variational Dropout (see Molchanov et al. 2017) is a Bayesian technique for inducing weight (or activation) sparsity in neural networks. It does not outperform pruning (see Gale et al. 2019) but it prunes globally, so it can induce non-uniform distributions of sparsity across layers without manual intervention. Parameter count and FLOP count do not have a 1:1 relationship in convolutional models – removing parameters from layers when the spatial dimension is large reduces the FLOP count by significantly more than removing parameters from later layers when it is small. Combined with the fact that later layers tend to have many more parameters than early ones means that global pruning approaches such as VD which operate under a parameter constraint find models which have many more FLOPs than techniques such as pruning, which are generally used to induce near uniform sparsity in the model, even if both models have the same number of parameters.
An additional practical difficulty with using VD to prune models is that one cannot specify a desired final sparsity directly. Instead one must tune the weight of the KL penalty term so that after training and thresholding the resulting model has the desired sparsity. This makes finding a model with a desired sparsity a time consuming process.
Despite these limitations, we nonetheless find it insightful to examine the pattern of layer wise sparsity that is found by using VD to prune MBv1, MBv2 and EfficientNet, with the hope of using some of this knowledge when using magnitude pruning to find even more efficient models.
Results for MBv1 can be found in figure 8 and results for MBv2 and EfficientNet in figure 10. As expected under a global parameter constraint, the layer wise sparsity increases with depth as the number of parameters per layer grows. Interestingly, there is a trend, especially evident in MBv2 and EN models, that layers containing a spatial downsampling and concomitant channel increase should be less sparse than those before or after. We were unable to take advantage of this information with hand tuning to find more efficient architectures than those with uniform sparsity, but it may provide a useful prior for a full architecture search.
Appendix C EfficientNet Plots
Figures 11 contain the plots showing how EfficientNet behaves in the presence of blocks and for varying levels of sparsity. It follows generally the same trend as the MobileNet models in the main text. The difference is that sparsities above 80% do not seem to lead to more efficient models.