EBPC: Extended Bit-Plane Compression for Deep Neural Network Inference and Training Accelerators
In the wake of the success of convolutional neural networks in image classification, object recognition, speech recognition, etc., the demand for deploying these compute-intensive ML models on embedded and mobile systems with tight power and energy constraints at low cost, as well as for boosting throughput in data centers, is growing rapidly. This has sparked a surge of research into specialized hardware accelerators. Their performance is typically limited by I/O bandwidth, power consumption is dominated by I/O transfers to off-chip memory, and on-chip memories occupy a large part of the silicon area.
We introduce and evaluate a novel, hardware-friendly and lossless compression scheme for the feature maps present within convolutional neural networks. Its hardware implementation fits into 2.8 kGE and 1.7 kGE of silicon area for the compressor and decompressor, respectively. We show that an average compression ratio of 5.1 for AlexNet, 4 for VGG-16, 2.4 for ResNet-34 and 2.2 for MobileNetV2 can be achieved—a gain of 45–70% over existing methods. Our approach also works effectively for various number formats, has a low frame-to-frame variance on the compression ratio, and achieves compression factors for gradient map compression during training that are even better than for inference.
Computer vision has evolved into a key component for automating data analysis over a wide range of field applications: medical diagnostics , industrial quality assurance , video surveillance, advanced driver assistance systems  and many more. A large number of these applications have only emerged recently due to the tremendous accuracy improvements—even beyond human capabilities —associated with the advent of deep learning and in particular convolution neural networks (CNNs, ConvNets).
Even though CNN-based solutions often require considerable compute resources, many of these applications have to run in real-time and on embedded and mobile systems. As a result, purpose-built platforms, application-specific hardware accelerators, and optimized algorithms have been engineered to reduce the amount of arithmetic operations and their precision requirements [6, 7, 8, 9, 10, 11, 12, 13, 14, 15].
Examining these hardware platforms, the amount of energy required to load and store intermediate results/feature maps (and gradients maps during training) in the off-chip memory is not only significant, but typically dominating the energy consumed during computation and on-chip data buffering. This energy bottleneck is even more remarkable when considering networks that are engineered to reduce compute energy by quantizing weights to one or two bits or power-of-two values, dispensing with the need for high-precision multiplications and significantly reducing weight storage requirements [16, 17, 18, 19, 20].
Many compression methods for CNNs have been proposed over the last few years. However, many of them are focusing exclusively on
very complex methods requiring large dictionaries, or otherwise not suitable for a small, energy-efficient hardware implementation—often targeting efficient distribution and storage of trained models to mobile devices or the transmission of intermediate feature maps from/to mobile devices over a costly communication link .
In contrast to these, the focus on this paper is on reducing the energy consumption of hardware accelerators for CNN inference and training by cutting down on the dominant power contributor—I/O transfers. These data transfers to and from off-chip memory consist of the network parameters (read-only) and the feature maps (read/write).
|Network||Accuracy||Dataset||Resolution1||#MACs||#params||#FM values2||I/O-ratio FM/parameter3|
|ResNet-50||||77.2 / 93.3||ILSVRC12||224224||4.1 G||25.6 M||11.1 M||9.1||0.9||27.8|
|DenseNet-121||||76.4 / 93.3||ILSVRC12||224224||2.9 G||8.0 M||6.9 M||17.2||1.7||55.3|
|SqueezeNet||||57.5 / 80.3||ILSVRC12||224224||355.9 M||1.2 M||2.6 M||42.4||4.2||134.1|
|ShuffleNet2||||69.4 / —–||ILSVRC12||224224||150.6 M||2.3 M||2.0 M||17.2||1.7||54.8|
|MobileNet2||||72.0 / —–||ILSVRC12||224224||320.2 M||3.5 M||6.7 M||38.4||3.8||121.9|
|MnasNet||||75.6 / 92.7||ILSVRC12||224224||330.2 M||4.4 M||5.5 M||25.2||2.5||79.7|
|YOLOv3||||57.9% AP||COCO-det.||480640||9.5 G||61.6 M||68.4 M||22.2||2.2||71.1|
|YOLOv3-tiny||||33.1% AP||COCO-det.||480640||800.0 M||8.7 M||10.7 M||24.2||2.4||78.4|
|OpenPose||||65.3% mAP||COCO-keyp.||480640||50.4 G||52.3 M||132.5 M||51.5||5.1||162.1|
|MultiPoseNet-50||||64.3% mAP||COCO-keyp.||480640||13.3 G||36.7 M||96.0 M||52.5||5.2||167.4|
|MultiPoseNet-101||||62.3% mAP||COCO-keyp.||480640||16.8 G||55.6 M||119.9 M||43.4||4.3||138.0|
This resolution is used to determine the number of multiply-accumulate operations and feature map values. For the detection CNNs, this differs from the one used during training.
We count the number of feature maps values wherever they are activated (e.g. by a ReLU layer).
Feature maps values are assumed to be 16 bit and counted twice since they are written and read (most HW accelerator required multiple reads, though). Modes — TWN-inference: batch size 1, ternary weights; full-precision inference: batch size 1, 16 bit weights; training: batch size 32, 16 bit weights.
The latter is the larger contributor to the overall transfers as highlighted in Table I, and previous work has already shown that the parameters can be compressed and/or quantized even to ternary representations (1.58 bit) with negligible accuracy loss . This further highlights the requirements for feature map compression, as these now outweigh the parameters by 20–50 (16 bit to 1.58 bit and 2–5 more values) and the energy share spent on feature map I/O and buffering is growing even more dominant with the simpler arithmetic operations [17, 20, 16, 37] and further work on model compression.
In this paper, we propose and evaluate a simple compression scheme for intermediate feature maps that can exploits sparsity as well as the distribution of the remaining values. It is suitable for a very small and energy-efficient implementation in hardware (<300 bit of registers), and could be inserted as a stream (de-)compressor before/after a DMA controller to compress the data by 4.4 for 8 bit AlexNet.
We are extending our work in  and make the following main contributions:
A comparison of state-of-the-art DNNs regarding bandwidth and/or memory size requirements for parameters and feature maps, showing the relevance of feature map compression;
An in-depth analysis of the feature and gradient maps’ properties indicating compressibility;
The proposal of a novel, state-of-the-art and hardware-friendly compression scheme;
A thorough evaluation of its capabilities for inference as well as training (compressing gradient maps);
A hardware architecture for the compressor and decompressor and a detailed analysis of its implementation in 65 nm CMOS technology.
Ii Related Work
There are several methods out there describing hardware accelerators which exploit feature map sparsity to reduce computation: Cnvlutin , SCNN , Cambricon-X , NullHop , Eyeriss , EIE . Their focus is on power gating or skipping some of the operations and memory accesses. This entails defining a scheme to feed the data into the system. They all use one of three methods:
Zero-RLE (used in SCNN): A simple run-length encoding for the zero values, i.e. a single prefix bit followed by the number of zero-values or the non-zero value.
Zero-free neuron array format (ZFNAf) (used in Cnvlutin): Similarly to the widely-used compressed sparse row (CSR) format, non-zero elements are encoded with an offset and their value.
Compressed column storage (CCS) format (e.g. used in EIE, and similar to NullHop): Similar to ZFNAf, but the offsets are stored in relative form, thus requiring less bits to store them. Few bits are sufficient, and in case they are all exhausted, a zero-value can be encoded as if it was non-zero.
Oher compression methods are focusing on minimizing the model size and are very complex (silicon area) to implement in hardware. One such method, deep compression , combines pruning, trained clustering-based quantization, and Huffman coding. Most of these steps cannot be applied to the intermediate feature maps, which change for every inference as opposed to the weights which are static and can be optimized off-line. Furthermore, applying Huffman coding—while being optimal in terms of compression rate and given a specification of input symbols and their statistics—implies storing a very large dictionary: encoding a 16 bit word requires a table with k entries, but effectively multiple values would have to be encoded jointly in order to exploit their joint distribution (e.g. the smoothness), immediately increasing the dictionary size to G even for just two values. Similar issues arise when using Lempel-Ziv-Welch (LZW) coding [39, 40] as present in e.g. the ZIP compression scheme, where the dictionary is encoded in the compressed data stream. This makes it unsuitable for a lightweight and energy-efficient VLSI implementation [41, 42].
Few more methods exist by changing the CNN’s structure in order to compress the weights [21, 22] or the feature maps [26, 27, 43]. However, they require altering the CNN’s model and/or retraining, and they introduce some accuracy loss. Furthermore, they can only be used to compress a few feature maps at specific points within the network and introduce additional compute effort, such as applying a Fourier transform to the feature maps.
The most directly comparable approach, cDMA , describes a hardware-friendly compression scheme to reduce the data size of intermediate feature maps. Their target application differs in that their main goal is to allow faster temporary offloading of the feature maps from GPU to CPU memory through the PCIe bandwidth bottleneck during training, thereby enabling larger batch sizes and deeper and wider networks without sacrificing performance. They propose to use zero-value compression (ZVC), which takes a block of 32 activation values, and generates a 32-bit mask where only the bits to the non-zero values are set. The non-zero values are transferred after the mask. This provides the main advantage over Zero-RLE that the resulting data volume is independent of how the values of the feature maps are serialized while also providing small compression ratio advantages. Note that this is a special case of Zero-RLE with a maximum zero burst length of 1.
For this work, we build on a method known in the area of texture compression for GPUs, bit-plane compression (BPC) , fuse it with sparsity-focused compression methods, and evaluate the resulting compression algorithm on intermediate feature maps and gradient maps to show compression ratios of 5.1 (8 bit AlexNet), 4 (VGG-16), 2.4 (ResNet-34), 2.8 (SqueezeNet), and 2.2 (MobileNetV2).
Iii Compression Algorithm
Sparsity: The value stream is decomposed into a zero/non-zero stream on which we apply run-length encoding to compress the zero burst commonly occurring in the data.
Smoothness: Spatially neighboring values are typically highly correlated. We thus compress the non-zero values using bit-plane compression. The later compresses a fixed number of words jointly, and the resulting compressed bit-stream is injected immediately after at least non-zero values have been compressed.
The resulting algorithm can be viewed as an extension to bit-plane compression to better exploit the sparsity present in most feature maps.
Iii-a Zero/Non-Zero Encoding with RLE
The run-length encoder simply compresses bursts of 0s with a single 0 followed by a fixed number of bits which encode the burst length. Non-zero values, at this point 1-bits, are not run-length encoded, i.e. for each of them a 1 is emitted. If the length of a zero-burst exceeds the corresponding maximum burst length, the maximum is encoded and the remaining bits are encoded independently, i.e. in the next code symbol.
Iii-B Bit-Plane Compression
An overview of the bit-plane compressor (BPC) used to compress the non-zero values is shown in Fig. 1. For BPC a set of words of bit, a data block, is compressed by first building differences between each two consecutive words and storing the first word as the base. This exploits that neighboring values are often similar to reduce concentrates the distribution of the compressed values around zero.
The data items storing these differences are then viewed as bit-planes of bit each (delta bit-planes, DBPs). Neighboring DBPs are XOR-ed, now called DBX, and the DBP of the most significant bit is kept as the base-DBP. The results are fed into bit-plane encoders, which compress the DBX and DBP values to a bit-stream following Table 2(a). Most of these encodings are applied independently per DBX symbol. However, the first can be used to jointly encode multiple consecutive bit-planes at once, if they are all zero. This is where the correlation of neighboring values is best exploited. Note also the importance of the XOR-ing step in order to map two’s complement negative values close to zero also to words consisting mostly of zero-bits.
As 1) the bit-plane encoding is a prefix code, 2) both the block size and word width are fixed, and 3) the representation of word 0– as (base, DBP , DBX 0–) is invertible, the resulting bit-stream of the base (word 0) followed by all the encoded symbols can be decompressed into the original data block.
We have analyzed the code symbol distribution in Fig. 2(b) across all ResNet-34 feature maps with 8 bit fixed-point quantization. Similar histograms are obtained for 16 bit fixed-point and/or other networks. The 1.25 M blocks result in 11.25 M symbols of which 5.1 M are uncompressed bitplanes, 1.2 M are multi-all-0 DBX symbols encoding 5.4 M all-zero bit-planes, 0.5 M single-1 symbols, 0.2 M symbols for bit-planes with two consecutive one-bits.
As we are processing a stream of data, transmitting the base can be omitted in favor of re-using the the last word of the previous block. As the compression is loss-less, the last decoded word of the previous block is the used as the base for decoding the next block. When starting to transmit a new stream of data, either base of the first block can be transmitted, or the base can be initialized to zero.
Iii-C Data Types
The proposed compression method can be applied to integers of various word widths and for various block sizes. It also works with floating-point words, in which case the deltas do not need an additional bit and correspondingly there is one less DBP and DBX symbol (cf. Fig. 1). The floating-point subtraction is not exactly (bit-true) invertible, hence a minimal and in practice negligible compression loss can be expected. Floating-point numbers are known to be notoriously hard to compress. While the DBX symbols corresponding to the fraction bits are almost equiprobable ‘1’ or ‘0’, those for the exponent and sign bits are often all-zero and thus compressible.
Notably, this compression method is suitable to handle variable-precision input data types very well. For example, 10 bit values can be represented as 16 bit values and fed into a bit-plane compressor for 16 bit values. First, all the benefit coming from sparsity remain. Then, once a data block of such is converted into the DBX representation, there will now be 6 additional all-zero DBX symbols. These are then in the worst case encoded together into a single multi-all-0 DBX symbol, adding a mere 7 bit to the overall block’s code stream. In the best case the additional all-zero DBX symbols can be encoded into an existing adjacent symbol. Similarly, not only reduced bit-widths, but generally reduced value ranges will have a positive impact on the length of the compressed bit stream. This can be used to alter the trade-off between accuracy and energy-efficiency on-the-fly.
While the focus here is on evaluating the compression rate on feature and gradient maps of CNNs, such a (de-)compressor will be beneficial for any smooth data (images/textures, audio data, spectrograms, biomedical signals, …) and/or sparse data (event streams, activity maps, …).
Iv Hardware Architecture & Implementation
The compression scheme’s elements have been selected such that it is particularly suitable for a light-weight hardware implementation: no code-book needs to be stored, just a few data words need to be kept in memory. To verify this claim, we present a hardware architecture in this section from which we obtain implementation results. For both the compressor and decompressor, we chose to target a throughput of approximately 1 word per cycle. We have separate output data streams for the compressed zero/non-zero stream and the bit-plane compressed data, which could optionally be packed into a single compressed bit stream. In the following we use a block size of 8 and 8 bit fixed-point data words.
In Fig. 4 we show the hardware architecture of the encoder. On top, we show the Zero-RLE compressor—a simple comparator to zero (a 8-input NOR block) followed by a counter and a multiplexer which selects a ’1’ in case of a non-zero or the zero count if one or more zeros have been received. Towards the end of the unit, variable-length symbols are packed into 8 bit words for connection to a memory bus: a register is filled with shifted data until at least 8 bits have been collected, at which point an 8 bit word is sent out and the remaining bits in excess of 8 are shifted to the LSB side. At the same time, any non-zero values are processed by the Delta and DBP/DBX Transform block. The first word is written to the base word register, all subsequent words of the block are each subtracted from their previous value and these pushed into a shift-register which is read in parallel once a complete block has been aggregated (now interpretable as bit-planes) and the pair-wise XOR of the DBPs is taken to get the DBX symbols (for ease of implementation, the first DBP is XOR-ed with zero, i.e. directly taken as the DBX symbol). The entire block of symbols is then pushed into a FIFO of depth 1, such that new input words can be accepted while the bit-planes are iteratively encoded to allow an average throughput of up to 0.8 words/cycle. The data block is the read by the DBP/DBX encoder to encode each bit-plane as a bit-vector and its length. The resulting variable-length data is then packed with a circuit similar to the packer in the Zero-RLE block to produce fixed 8 bit length words.
Although the throughput of the bit-plane compression part of the circuitis limited to 0.8 word/cycle, this constitutes a worst-case scenario. When zero-values are encountered, the Zero-RLE block handles the workload while the processing of the non-zero words continues in parallel. This way, the compressor can be operated more closely to 1 word/cycle on average.
The decompressor shown in Fig. 5 reverts the steps of the encoder. After inverting the Zero-RLE encoding, the bit-plane compressed data stream is read in 8 bit words and unpacked into variable length data chunks. The Unpacker always provides 8 valid data bits to the Symbol Decoder, which decodes the symbol into a DBP or DBX word and feeds the effective symbol length back to the Unpacker. In case of a DBX word, it is XOR-ed with the previous DBP, such that DBP words are emitted to the Buffer unit—or the base word is forwarded in case of the first 8 bits of the block. The Buffer block aggregates the DBPs and the base word, buffering it for the Delta Reverse unit, which iteratively accumulates the delta symbols and emits the decompressed words. The Buffer unit with its built-in FIFO allows to unpack and decode data (10 cycles/block) while reverting the delta compression (8 cycles).
Iv-C Implementation Results
|Delta & DBX Transf.||1059||735||1940||1347||3832||2661|
Gate equivalents (GEs): size expressed in terms of area of 2-input NAND gates. 1 GE: 1.44 m (umc 65 nm), 0.49 m (ST 28 nm FDSOI), 0.20 m (GlobalFoundries 22 nm).
Without inverse Zero-RLE.
The total area deviates from the sum of the blocks as the synthesizer performs optimizations across the blocks.
We have implemented the described architecture for a UMC 65 nm low-leakage process and synthesized the design. We report the area and a per-unit breakdown for a block size of 8 (i.e. the optimal case, cf. Section V-C) and 8 bit, 16 bit and 32 bit words and a target frequency of 600 MHz in Table II. For most inference applications, 8 bit feature maps are sufficient, such that the compressor and decompressor fit onto a mere 0.004 mm and 0.0025 mm, respectively. For comparison, the area of both together is similar to a single 32 bit integer multiplier, which requires a minimum of 0.0065 mm.
Synthesis of the circuit for lower frequencies does not reduce area while at 1 GHz and 1.5 GHz the area of the compressor grows to 4337 and 5180 , respectively. For higher frequencies, timing closure could not be attained, with the longest path passing from the DBX multiplexer’s control input in the DBP/DBX Encoder to the register in the Packer.
Directly scaling up the word size from 8 bit to 16 bit increases the area of the compressor by 70% and 85% for the decompressor. Increasing it further from 16 bit to 32 bit, adds another 83% or 108%, respectively. Increasing the word width does not have any effect on the DBP/DBX Encoder, as it works on bit-planes, but it requires more iterations. It might thus be considered using multiple DBP/DBX Encoder units to not bottleneck the throughput and thereby also increasing the size of the Packer and Unpacker to be able to take data from all encoder and feed all the decoders. The size of the Packer and Unpacker increases as well with the word width as the register size grows, and so does the number of multiplexers in the barrel shifters.
Scaling up the throughput can be achieved by doubling the capacity of each unit, reading two words into the compressor, computing the differences both in the same cycle, and increasing the size of the input port of the shift register. Similarly, multiple encoders can be used to compress two bit-planes per cycle. This will only have a limited impact on the area in this part of the compressor which is mostly defined by the size of the shift-register and the FIFO, which do not need to grow. The main impact will be visible with in the Packer and later the Unpacker units, where the barrel shifters have to take twice as wide words and shift twice as far when doubling the throughput, and hence grows quadratically—a problem inherent to packing data of any variable symbol-size compressors. For the decompressor, similarly there can be multiple symbol decoders and the Delta Reverse unit can be modified to process two words per cycle. Overall, increasing the throughput this way can be expected to scale below linear in area for processing few words in parallel, but once reaching close to full parallelization (i.e. 8 for block size 8 and word width 8 bit), the size of the Packer and Unpacker will take up most of the circuit’s size. However, the throughput can be scaled with linear area cost by instantiating multiple complete (de-)compressors to work on individual feature maps in parallel or on separate spatial tiles of the feature maps.
Iv-D System Integration
The presented compression scheme can be used to reduce the energy spent on interfaces to external DRAM, on inter-chip or back-plane communication—the corresponding standards specify very efficient power-down modes [45, 46]—and to reduce the required bandwidth of such interfaces, lowering the cost of packaging, circuit boards, and additional on-chip circuits (e.g. PLLs, on-chip termination, etc.) [45, 46].
Given the limited size, it can also be used to reduce the size of on-chip data transfers, e.g. from large background L2 memories in large DNN inference chips that try to fit all data on chip, such as the one Tesla has presented for its next generation of self-driving cars or the hardware by Graphcore .
Streaming HW Accelerators
The (de-)compressor could be integrated with an accelerator such as YodaNN  which reaches a state-of-the-art core energy efficiency of 60 TOp/s/W for binary-weight DNNs. For the specific case of YodaNN, however, taking I/O energy cost into accounts adds 15.28 mW to the core’s 0.26 mW, bottlenecking the efficiency to 1 TOp/s/W. A drop-in addition of 8 compressor and decompressor units—YodaNN works on 8 feature maps at the input and output in parallel—would reduce the I/O cost and directly increase its energy efficiency at system level by 2–4 (cf. Section V-D) while adding only 0.05 mm (6%) to the 0.86 mm of core area.
HW Accelerator with Feature Maps On-Chip
Another application scenario would be with Hyperdrive  and large industrial chips such as Tesla’s platform for its next generation of car, which store the feature maps on chip. Memory inherently takes up a large share of such a design, for the case of Hyperdrive specifically 65%. With a compression scheme providing a reliable compression ratio across different input images and for all layer pairs (in a ping-pong buffering scheme), we can reduce the memory size by around 2 (cf. Section V-E), to saving almost as much silicon area.
Integration into a Heterogeneous Many-Core Accelerator
A further use-case is the integration into a heterogeneous accelerator with multiple cores and/or accelerators working from a local scratchpad memory, where data is prefetched from a different level in the memory hierarchy, e.g. in the GAP-8 SoC  which can be used for DNN-based autonomous navigation of nano-drones , 8 cores and a CNN hardware accelerator which share a 64 kB L1 scratchpad memory which is loaded with data from the 512 kB L2 memory using a DMA controller. In such systems, SRAM memory accesses and data movement across interconnects can make up for a significant share of the overall power, and generally memory space is a scarce resource. Integration of the proposed (de-)compressor into the DMA would improve both aspects jointly in such a system.
V-a Experimental Setup
Where not otherwise stated, we perform our experiments on AlexNet and are using images from the ILSVRC validation set. The models we used were pre-trained and downloaded from the PyTorch/Torchvision data repository wherever possible and an identical preprocessing was applied to the data. For the gradient analyses we self-trained the same networks111Using code available at https://github.com/spallanzanimatteo/QuantLab. Some of the experiments are performed with fixed-point data types (default: 16-bit fixed-point). The feature maps were normalized to span 80% of the full range before applying uniform quantization in order to use the full value range up to a safety margin to prevent overflows. All the feature maps were extracted after the ReLU (or ReLU6) activations. The code to reproduce these experiments is available online222Code: https://github.com/lukasc-ch/ExtendedBitPlaneCompression.
V-B Sparsity, Activation Histogram & Data Layout
Neural networks are known to have sparse feature maps after applying a ReLU activation layer, which can be done on-the-fly after the convolution layer and possibly batch normalization. However, it varies significantly for different layers within the network as well as for different CNNs. Sparsity is a key aspect when compressing feature maps, and we analyze it quantitatively with statistics collected across 250 random ILSVRC’12 validation image and for each layer of AlexNet as well as the more modern and size-optimized MobileNeV2 in Fig. 6. For AlexNet, we can clearly see the increase in sparsity from earlier to later layers. For MobileNetV2, multiple effects are overlying. Overall, the feature maps later in the network are more sparse, and generally this is correlated with the number of feature maps (also in AlexNet). Feature maps following expanding 11 convolutions (e.g. 15, 17, 19, 21) generally show lower sparsity (25–40%) than after the depth-wise separable 33 convolutions (e.g. 16, 18, 20, 22; sparsity 50–65%), where for the latter there are exceptions (e.g. 8, 14, 28) when these convolutions were strided (sparsity 20–35%). This aligns with intuition as the 11 layers combine feature maps to later be filtered, and the depth-wise 33 convolution layers literally perform the filtering.
Besides the average sparsity, its probability distribution across different frames becomes relevant in case guarantees have to be provided either due to real-time requirements in case of a bandwidth-limited hardware accelerator or due to size limits of the memory in which the feature maps are stored (e.g. on-chip SRAM). The whiskers in the box plot mark the 1st and 99th percentile, clearly showing how narrow the distribution of the sparsity is and that we thus can expect a very stable compression rate.
We consider compressing not only the feature maps but also the gradient maps for specialized training hardware, thus raising the question how sparsity evolves over as training progresses. The gradient maps are generally identically sparse as the corresponding feature maps, as ReLU activations pass a zero-gradient wherever the outgoing feature map value was zero. In Fig. 7 we can observe how the various layers are starting from all-50% after random initialization and with a few epochs settle close to their final level. In both networks this is the case after around 15% of the epochs required for full convergence.
The sparse values are not independently distributed but rather occur in bursts when the 4D data tensor is laid out in one of the obvious formats. The most commonly used formats are NCHW and NHWC, which are those supported by most frameworks and the widely used Nvidia cuDNN backend. NCHW is the preferred format for cuDNN and the default memory layout and means that neighboring values in horizontal direction are stored next to each other in memory before the vertical, channel, and batch dimensions. NHWC is the default format of TensorFlow and has long before been used in compute vision and has the advantage of simple non-strided computation of inner products in channel (i.e. feature map) dimension. Further reasonable options which we include in our analysis are CHWN and HWCN, although most use-cases with hardware acceleration are targeting real-time low-latency inference and are thus operating with a batch size of 1. We analyze the distribution of the length of zero bursts for the these four data layouts at various depths within the network in Fig. 8.
The results clearly show that having the spatial dimensions (H, W) next to each other in the data stream provides the longest zero bursts (lowest cumulative distribution curve) and thus the better compressibility than the other formats. This is also aligned with intuition: feature maps values mark the presence of certain features and are expected to be smooth. Inspection the feature maps of CNNs is commonly known to show that they behave like ’heat maps’ marking the presence of certain geometric features nearby. Based on these results, we perform all the following evaluations based on the NCHW data layout. Note also that the burst length of non-zero values is mostly very short, such that there is limited gain in applying RLE also for the one-bits.
To compress further beyond exploiting the sparsity, the data has to remain compressible. This is definitely the case as can be seen when looking at histograms of the activation distributions as shown for some sample layers of AlexNet and MobileNetV2 in Fig. 9 and a strong indication that additional compression of the non-zero data is possible.
V-C Selecting Parameters
The proposed method has two parameters: the maximum length of a zero sequence that can be encoded with a single code symbol of the Zero-RLE, and the BPC block size (, number of non-zero word encoded jointly).
Max. Zero Burst Length
We first analyze the effect of varying the maximum zero burst length for Zero-RLE on the compression ratio without for various data word widths in Table III.
|word width||ZVC||Zero-RLE max. zero burst length|
The optimal value is arguably the same for our proposed method, since a constant offset in compressing the non-zero values does not affect the optimal choice of this parameter (just like the word width has no effect on it). The results also serve as a baseline for Zero-RLE and ZVC. It is worth noting that ZVC corresponds to Zero-RLE with a max. burst length of 1, yet breaks the trend shown in Table III. This is due to an inefficiency of Zero-RLE in this corner: for a zero burst length of 1, ZVC requires 1 bit whereas Zero-RLE with a max. burst length of 2 takes 2 bit. For a zero burst of length 2, ZVC encode 2 symbols of 1 bit each and Zero-RLE takes 2 bit as well. ZVC thus always performs at least as well for such a short max. burst length.
BPC Block Size
We analyze the effect of the BPC block size parameter in Fig. 10 at various depths within the network. The best compression ratio is achieved with a block size of 16 across all the layers. A block size of 8 might also be considered to minimize the resources of the (de-)compression hardware block at a small drop in compression ratio.
V-D Total Compression Factor
We analyze the total compression factor of all feature maps of AlexNet, VGG-16, ResNet-34, SqueezeNet, and MobileNetV2 in Fig. 11. For AlexNet, we can notice the high compression ratio of around 3 already introduced by Zero-RLE and ZVC and that it is very similar for all data types. We further see that pure BPC is not suitable since it introduces too much overhead when encoding only zero-values. For ResNet-34, SqueezeNet, and MobileNetV2, the gains by exploiting only the sparsity is significantly lower at around 1.55, 1.7 and 1.4. The proposed method outperforms previous approaches clearly and particularly for 8 bit fixed-point values commonly used in today’s inference accelerators. There we observe compression ratios of 5 (AlexNet), 4 (VGG-16), 2.4 (ResNet-34), 2.8 (SqueezeNet), and 2.2 (MobileNetV2).
When moving from 8 bit fixed-point values to 12 and 16 bit and ultimately to 16 bit floating point, the compression ratio of the methods based on sparsity only (zero-RLE, ZVC) do not change significantly. This is in line with expectations, since the zero-values only make a negligible contribution to the final data size with zero-RLE and do not have any effect with ZVC. Contrary to this, BPC is very effective for small word widths but loses its benefits as word widths increase. Our proposed method combines the best of the two worlds, starting with a very high compression ratio and slowly converging to ZVC and zero-RLE as the word width increases. The gains for 8-bit fixed-point data are significantly higher than for other data formats. Most input data—also CNN feature maps—carry the most important information is in the more significant bits and in case of floats in the exponent. The less significant bits appear mostly as noise to the encoder and cannot be compressed without accuracy loss, such that this behavior—a lower compression ratio for wider word widths—is expected.
V-E Per-Layer Compression Ratio
As already expected from the results on sparsity, the compression ratio is fairly stable across multiple frames. Specifically, the 1st percentile of compression ratios only lies around 20% below the average case for all the networks. Further results on a per-layer basis for AlexNet, ResNet-34, and MobileNetV2 is shown in Fig. 12. While there is significant variability between the layers, the 1% farthest outliers towards lower compression ratios can be found in AlexNet’s first layer at a drop of around 25% from the average ratio. For ResNet-34 and MobileNetV2, even the worst-case variations remain within less than 5% deviation from the mean. This allows to scale down the available bandwidth and/or the corresponding memory size with only a minimal risk of failure. Furthermore, the remaining risk can be further reduced when processing video streams in real-world applications, where the numerical precision could be scaled down to graciously as described in Section III-C, thus allowing to accept graceful accuracy degradation in exchange for a smaller size of the compressed bit stream, thereby mitigating potentially catastrophic failure.
For applications in on-device learning as well as to further boost the throughput of thermally or I/O-limited training accelerators in compute clusters, we have further investigated the compressibility of the gradient maps (cf. Fig. 13). Despite the higher precision data types as required for the gradients, high compression rates can be achieved, mostly higher or on par with those of the feature maps.
We have presented and evaluated a novel compression method for CNN feature maps. The proposed algorithm achieves an average compression ratio of 5.1 on AlexNet (+46% over previous methods), 4 on VGG-16 (+67%), 2.4 on ResNet-34 (+33%), 2.8 on SqueezeNet (+51%), and 2.2 on MobileNetV2 (+30%) for 8 bit data, and thus clearly outperforms state-of-the-art, while fitting a very tight hardware resource budget with 0.004 mm and 0.0025 mm of silicon area in UMC 65 nm at 600 MHz and 0.8 word/cycle. The frequency can be pushed to 1.5 GHz with a slight area increase of 25%.
We further show the proposed method works well not for various data formats and precisions, that the compression ratios are achieved reliably across many images with outliers only going down to 15% below the average ratio at the 1st percentile. The same method is also applicable for the compression of gradient maps during training, achieving compression rates again more than 20% higher than achieved for the feature maps in the forward pass.
-  G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. van der Laak, B. van Ginneken, and C. I. Sánchez, “A survey on deep learning in medical image analysis,” Medical Image Analysis, vol. 42, no. December, pp. 60–88, 2017.
-  X. Bian, S. N. Lim, and N. Zhou, “Multiscale fully convolutional network with application to industrial inspection,” in Proc. IEEE WACV, 3 2016, pp. 1–8.
-  L. Cavigelli, D. Bernath, M. Magno, and L. Benini, “Computationally efficient target classification in multispectral image data with Deep Neural Networks,” in Proc. SPIE Security + Defence, vol. 9997, 2016.
-  B. Wu, F. Iandola, P. H. Jin, and K. Keutzer, “SqueezeDet: Unified, Small, Low Power Fully Convolutional Neural Networks for Real-Time Object Detection for Autonomous Driving,” in Proc. IEEE CVPRW, 2017, pp. 129–137.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,” in Proc. IEEE ICCV, 2015, pp. 1026–1034.
-  L. Cavigelli and L. Benini, “Origami: A 803-GOp/s/W Convolutional Network Accelerator,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 11, pp. 2461–2475, 11 2017.
-  L. Cavigelli, D. Gschwend, C. Mayer, S. Willi, B. Muheim, and L. Benini, “Origami: A Convolutional Network Accelerator,” in Proc. ACM GLSVLSI. ACM Press, 2015, pp. 199–204.
-  J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, “Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing,” in Proc. ACM/IEEE ISCA, 2016, pp. 1–13.
-  A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks,” in Proc. ACM/IEEE ISCA, 2017, pp. 27–40.
-  S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, “Cambricon-X: An accelerator for sparse neural networks,” in Proc. IEEE/ACM MICRO, 2016, pp. 1–20.
-  A. Aimar, H. Mostafa, E. Calabrese, A. Rios-Navarro, R. Tapiador-Morales, I.-A. Lungu, M. B. Milde, F. Corradi, A. Linares-Barranco, S.-C. Liu, and T. Delbruck, “NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 3, pp. 644–656, 3 2019.
-  Y. H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks,” in Proc. ACM/IEEE ISCA, 2016, pp. 367–379.
-  S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “EIE: Efficient Inference Engine on Compressed Deep Neural Network,” in Proc. ACM/IEEE ISCA, 2016, pp. 243–254.
-  L. Cavigelli and L. Benini, “CBinfer: Exploiting Frame-to-Frame Locality for Faster Convolutional Network Inference on Video Streams,” IEEE Transactions on Circuits and Systems for Video Technology, 2019.
-  L. Cavigelli, M. Magno, and L. Benini, “Accelerating Real-Time Embedded Scene Labeling with Convolutional Networks,” in Proc. ACM/IEEE DAC, 2015, pp. 108:1–108:6.
-  R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “Hyperdrive: A Systolically Scalable Binary-Weight CNN Inference Engine for mW IoT End-Nodes,” in Proc. IEEE ISVLSI, 2018, pp. 509–515.
-  ——, “YodaNN: An Ultra-Low Power Convolutional Neural Network Accelerator Based on Binary Weights,” in Proc. IEEE ISVLSI, 2016, pp. 236–241.
-  M. Courbariaux, Y. Bengio, and J.-P. David, “BinaryConnect: Training Deep Neural Networks with binary weights during propagations,” in Adv. NIPS, 2015, pp. 3123–3131.
-  A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights,” in Proc. ICLR, 2017.
-  R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “YodaNN: An Architecture for Ultralow Power Binary-Weight CNN Acceleration,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 1, pp. 48–60, 2018.
-  W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, “Compressing Neural Networks with the Hashing Trick,” in Proc. ICML, vol. 37, 2015, pp. 2285–2294.
-  E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. Van Gool, “Soft-to-Hard Vector Quantization for End-to-End Learning Compressible Representations,” in Adv. NIPS, 2017, pp. 1141–1151.
-  S. Han, H. Mao, and W. J. Dally, “Deep Compression - Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding,” in ICLR, 2016.
-  P. Stock, A. Joulin, R. Gribonval, B. Graham, and H. Jégou, “And the Bit Goes Down: Revisiting the Quantization of Neural Networks,” Facebook AI Research, Tech. Rep., 2019.
-  M. Rhu, M. O’Connor, N. Chatterjee, J. Pool, Y. Kwon, and S. W. Keckler, “Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks,” in Proc. IEEE HPCA, 2018, pp. 78–91.
-  D. Gudovskiy, A. Hodgkinson, and L. Rigazio, “DNN feature map compression using learned representation over GF(2),” in Proc. ECCV Workshops, 2018, pp. 502–516.
-  Y. Wang, C. Xu, C. Xu, and D. Tao, “Beyond Filters: Compact Feature Map for Portable Deep Model,” in Proc. ICML, vol. 70, 2017, pp. 3703–3711.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” Proc. IEEE CVPR, pp. 770–778, 2015.
-  F. Iandola, M. Moskewicz, S. Karayev, R. Girshick, T. Darrell, and K. Keutzer, “DenseNet: Implementing Efficient ConvNet Descriptor Pyramids,” UC Berkeley, Berkeley, CA, USA, Tech. Rep., 2014.
-  F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size,” UC Berkeley, Berkeley, CA, USA, Tech. Rep., 2016.
-  N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design,” in Proc. ECCV, 2018, pp. 116–131.
-  M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetV2: Inverted Residuals and Linear Bottlenecks,” in Proc. IEEE CVPR. IEEE, 6 2018, pp. 4510–4520.
-  M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le, “MnasNet: Platform-Aware Neural Architecture Search for Mobile,” in Proc. IEEE CVPR, 2019, pp. 2820–2028.
-  J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,” University of Washington, Seattle, WA, USA, Tech. Rep., 2018.
-  Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields,” in Proc. IEEE CVPR. IEEE, 2017, pp. 1302–1310.
-  M. Kocabas, S. Karagoz, and E. Akbas, “MultiPoseNet: Fast Multi-Person Pose Estimation Using Pose Residual Network,” in Proc. ECCV, 2018, pp. 417–433.
-  R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “Hyperdrive: A Multi-Chip Systolically Scalable Binary-Weight CNN Inference Engine,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 309–322, 6 2019.
-  L. Cavigelli and L. Benini, “Extended Bit-Plane Compression for Convolutional Neural Network Accelerators,” in Proc. IEEE AICAS, 2018.
-  T. A. Welch, “A Technique for High-Performance Data Compression,” Computer, vol. 17, no. 6, pp. 8–19, 1984.
-  J. Ziv and A. Lempel, “Compression of Individual Sequences via Variable-Rate Coding,” IEEE Transactions on Information Theory, vol. 24, no. 5, pp. 530–536, 1978.
-  M. B. Lin, J. F. Lee, and G. E. Jan, “A lossless data compression and decompression algorithm and its hardware architecture,” IEEE Transactions on Very Large Scale Integration Systems, vol. 14, no. 9, pp. 925–936, 2006.
-  X. Zhou, Y. Ito, and K. Nakano, “An efficient implementation of LZW decompression in the FPGA,” Proc. IEEE IPDPS, pp. 599–607, 2016.
-  M. Spallanzani, L. Cavigelli, G. P. Leonardi, M. Bertogna, and L. Benini, “Additive Noise Annealing and Approximation Properties of Quantized Neural Networks,” in arXiv:1905.10452, 5 2019, pp. 1–18.
-  J. Kim, M. Sullivan, E. Choukse, and M. Erez, “Bit-Plane Compression: Transforming Data for Better Compression in Many-Core Architectures,” in Proc. IEEE ISCA, 2016, pp. 329–340.
-  JEDEC Solid State Technology Association, “Low Power Double Data Rate 4 (LPDDR4),” JEDEC Solid State Technology Association, Tech. Rep. August, 2014.
-  ——, “Graphics Double Data Rate (GDDR5) SGRAM,” JEDEC Solid State Technology Association, Tech. Rep. Feburary, 2016.
-  N. Toon and S. Knowles, “Graphcore,” 2017. [Online]. Available: https://www.graphcore.ai
-  E. Flamand, D. Rossi, F. Conti, I. Loi, A. Pullini, F. Rotenberg, and L. Benini, “GAP-8: A RISC-V SoC for AI at the Edge of the IoT,” in Proc. IEEE ASAP. IEEE, 7 2018, pp. 1–4.
-  D. Palossi, A. Loquercio, F. Conti, F. Conti, E. Flamand, E. Flamand, D. Scaramuzza, and L. Benini, “A 64mW DNN-based Visual Navigation Engine for Autonomous Nano-Drones,” IEEE Internet of Things Journal, vol. PP, no. May, pp. 1–1, 2019.
Lukas Cavigelli received the B.Sc., M.Sc., and Ph.D. degree in electrical engineering and information technology from ETH Zürich, Zürich, Switzerland in 2012, 2014 and 2019, respectively. He has since been a postdoctoral researcher with ETH Zürich. His research interests include deep learning, computer vision, embedded systems, and low-power integrated circuit design. He has received the best paper award at the VLSI-SoC and the ICDSC conferences in 2013 and 2017, the best student paper award at the Security+Defense conference in 2016, and the Donald O. Pederson best paper award (IEEE TCAD) in 2019.
Georg Rutishauser received his B.Sc. and M.Sc. degrees in Electrical Engineering and Information Technology from ETH Zürich in 2015 and 2018, respectively. He is currently pursuing a Ph.D. degree at the Integrated Systems Laboratory at ETH Zürich. His research interests include algorithms and hardware for reduced-precision deep learning, and their application in computer vision and embedded systems.
Luca Benini is the Chair of Digital Circuits and Systems at ETH Zürich and a Full Professor at the University of Bologna. He has served as Chief Architect for the Platform2012 in STMicroelectronics, Grenoble. Dr. Benini’s research interests are in energy-efficient system and multi-core SoC design. He is also active in the area of energy-efficient smart sensors and sensor networks. He has published more than 1’000 papers in peer-reviewed international journals and conferences, four books and several book chapters. He is a Fellow of the ACM and of the IEEE and a member of the Academia Europaea.