EBPC: Extended BitPlane Compression for Deep Neural Network Inference and Training Accelerators
Abstract
In the wake of the success of convolutional neural networks in image classification, object recognition, speech recognition, etc., the demand for deploying these computeintensive ML models on embedded and mobile systems with tight power and energy constraints at low cost, as well as for boosting throughput in data centers, is growing rapidly. This has sparked a surge of research into specialized hardware accelerators. Their performance is typically limited by I/O bandwidth, power consumption is dominated by I/O transfers to offchip memory, and onchip memories occupy a large part of the silicon area.
We introduce and evaluate a novel, hardwarefriendly and lossless compression scheme for the feature maps present within convolutional neural networks. Its hardware implementation fits into 2.8 kGE and 1.7 kGE of silicon area for the compressor and decompressor, respectively. We show that an average compression ratio of 5.1 for AlexNet, 4 for VGG16, 2.4 for ResNet34 and 2.2 for MobileNetV2 can be achieved—a gain of 45–70% over existing methods. Our approach also works effectively for various number formats, has a low frametoframe variance on the compression ratio, and achieves compression factors for gradient map compression during training that are even better than for inference.
I Introduction
Computer vision has evolved into a key component for automating data analysis over a wide range of field applications: medical diagnostics [1], industrial quality assurance [2], video surveillance[3], advanced driver assistance systems [4] and many more. A large number of these applications have only emerged recently due to the tremendous accuracy improvements—even beyond human capabilities [5]—associated with the advent of deep learning and in particular convolution neural networks (CNNs, ConvNets).
Even though CNNbased solutions often require considerable compute resources, many of these applications have to run in realtime and on embedded and mobile systems. As a result, purposebuilt platforms, applicationspecific hardware accelerators, and optimized algorithms have been engineered to reduce the amount of arithmetic operations and their precision requirements [6, 7, 8, 9, 10, 11, 12, 13, 14, 15].
Examining these hardware platforms, the amount of energy required to load and store intermediate results/feature maps (and gradients maps during training) in the offchip memory is not only significant, but typically dominating the energy consumed during computation and onchip data buffering. This energy bottleneck is even more remarkable when considering networks that are engineered to reduce compute energy by quantizing weights to one or two bits or poweroftwo values, dispensing with the need for highprecision multiplications and significantly reducing weight storage requirements [16, 17, 18, 19, 20].
Many compression methods for CNNs have been proposed over the last few years. However, many of them are focusing exclusively on

very complex methods requiring large dictionaries, or otherwise not suitable for a small, energyefficient hardware implementation—often targeting efficient distribution and storage of trained models to mobile devices or the transmission of intermediate feature maps from/to mobile devices over a costly communication link [23].
In contrast to these, the focus on this paper is on reducing the energy consumption of hardware accelerators for CNN inference and training by cutting down on the dominant power contributor—I/O transfers. These data transfers to and from offchip memory consist of the network parameters (readonly) and the feature maps (read/write).
Network  Accuracy  Dataset  Resolution^{1}  #MACs  #params  #FM values^{2}  I/Oratio FM/parameter^{3}  
[% top1/top5]  TWNinfer.  FPinfer.  training  
Recognition 
ResNet50  [28]  77.2 / 93.3  ILSVRC12  224224  4.1 G  25.6 M  11.1 M  9.1  0.9  27.8 
DenseNet121  [29]  76.4 / 93.3  ILSVRC12  224224  2.9 G  8.0 M  6.9 M  17.2  1.7  55.3  
SqueezeNet  [30]  57.5 / 80.3  ILSVRC12  224224  355.9 M  1.2 M  2.6 M  42.4  4.2  134.1  
ShuffleNet2  [31]  69.4 / —–  ILSVRC12  224224  150.6 M  2.3 M  2.0 M  17.2  1.7  54.8  
MobileNet2  [32]  72.0 / —–  ILSVRC12  224224  320.2 M  3.5 M  6.7 M  38.4  3.8  121.9  
MnasNet  [33]  75.6 / 92.7  ILSVRC12  224224  330.2 M  4.4 M  5.5 M  25.2  2.5  79.7  
Detection 
YOLOv3  [34]  57.9% AP  COCOdet.  480640  9.5 G  61.6 M  68.4 M  22.2  2.2  71.1 
YOLOv3tiny  [34]  33.1% AP  COCOdet.  480640  800.0 M  8.7 M  10.7 M  24.2  2.4  78.4  
OpenPose  [35]  65.3% mAP  COCOkeyp.  480640  50.4 G  52.3 M  132.5 M  51.5  5.1  162.1  
MultiPoseNet50  [36]  64.3% mAP  COCOkeyp.  480640  13.3 G  36.7 M  96.0 M  52.5  5.2  167.4  
MultiPoseNet101  [36]  62.3% mAP  COCOkeyp.  480640  16.8 G  55.6 M  119.9 M  43.4  4.3  138.0 

This resolution is used to determine the number of multiplyaccumulate operations and feature map values. For the detection CNNs, this differs from the one used during training.

We count the number of feature maps values wherever they are activated (e.g. by a ReLU layer).

Feature maps values are assumed to be 16 bit and counted twice since they are written and read (most HW accelerator required multiple reads, though). Modes — TWNinference: batch size 1, ternary weights; fullprecision inference: batch size 1, 16 bit weights; training: batch size 32, 16 bit weights.
The latter is the larger contributor to the overall transfers as highlighted in Table I, and previous work has already shown that the parameters can be compressed and/or quantized even to ternary representations (1.58 bit) with negligible accuracy loss [19]. This further highlights the requirements for feature map compression, as these now outweigh the parameters by 20–50 (16 bit to 1.58 bit and 2–5 more values) and the energy share spent on feature map I/O and buffering is growing even more dominant with the simpler arithmetic operations [17, 20, 16, 37] and further work on model compression.
In this paper, we propose and evaluate a simple compression scheme for intermediate feature maps that can exploits sparsity as well as the distribution of the remaining values. It is suitable for a very small and energyefficient implementation in hardware (<300 bit of registers), and could be inserted as a stream (de)compressor before/after a DMA controller to compress the data by 4.4 for 8 bit AlexNet.
Contributions
We are extending our work in [38] and make the following main contributions:

A comparison of stateoftheart DNNs regarding bandwidth and/or memory size requirements for parameters and feature maps, showing the relevance of feature map compression;

An indepth analysis of the feature and gradient maps’ properties indicating compressibility;

The proposal of a novel, stateoftheart and hardwarefriendly compression scheme;

A thorough evaluation of its capabilities for inference as well as training (compressing gradient maps);

A hardware architecture for the compressor and decompressor and a detailed analysis of its implementation in 65 nm CMOS technology.
Ii Related Work
There are several methods out there describing hardware accelerators which exploit feature map sparsity to reduce computation: Cnvlutin [8], SCNN [9], CambriconX [10], NullHop [11], Eyeriss [12], EIE [13]. Their focus is on power gating or skipping some of the operations and memory accesses. This entails defining a scheme to feed the data into the system. They all use one of three methods:

ZeroRLE (used in SCNN): A simple runlength encoding for the zero values, i.e. a single prefix bit followed by the number of zerovalues or the nonzero value.

Zerofree neuron array format (ZFNAf) (used in Cnvlutin): Similarly to the widelyused compressed sparse row (CSR) format, nonzero elements are encoded with an offset and their value.

Compressed column storage (CCS) format (e.g. used in EIE, and similar to NullHop): Similar to ZFNAf, but the offsets are stored in relative form, thus requiring less bits to store them. Few bits are sufficient, and in case they are all exhausted, a zerovalue can be encoded as if it was nonzero.
Oher compression methods are focusing on minimizing the model size and are very complex (silicon area) to implement in hardware. One such method, deep compression [23], combines pruning, trained clusteringbased quantization, and Huffman coding. Most of these steps cannot be applied to the intermediate feature maps, which change for every inference as opposed to the weights which are static and can be optimized offline. Furthermore, applying Huffman coding—while being optimal in terms of compression rate and given a specification of input symbols and their statistics—implies storing a very large dictionary: encoding a 16 bit word requires a table with k entries, but effectively multiple values would have to be encoded jointly in order to exploit their joint distribution (e.g. the smoothness), immediately increasing the dictionary size to G even for just two values. Similar issues arise when using LempelZivWelch (LZW) coding [39, 40] as present in e.g. the ZIP compression scheme, where the dictionary is encoded in the compressed data stream. This makes it unsuitable for a lightweight and energyefficient VLSI implementation [41, 42].
Few more methods exist by changing the CNN’s structure in order to compress the weights [21, 22] or the feature maps [26, 27, 43]. However, they require altering the CNN’s model and/or retraining, and they introduce some accuracy loss. Furthermore, they can only be used to compress a few feature maps at specific points within the network and introduce additional compute effort, such as applying a Fourier transform to the feature maps.
The most directly comparable approach, cDMA [25], describes a hardwarefriendly compression scheme to reduce the data size of intermediate feature maps. Their target application differs in that their main goal is to allow faster temporary offloading of the feature maps from GPU to CPU memory through the PCIe bandwidth bottleneck during training, thereby enabling larger batch sizes and deeper and wider networks without sacrificing performance. They propose to use zerovalue compression (ZVC), which takes a block of 32 activation values, and generates a 32bit mask where only the bits to the nonzero values are set. The nonzero values are transferred after the mask. This provides the main advantage over ZeroRLE that the resulting data volume is independent of how the values of the feature maps are serialized while also providing small compression ratio advantages. Note that this is a special case of ZeroRLE with a maximum zero burst length of 1.
For this work, we build on a method known in the area of texture compression for GPUs, bitplane compression (BPC) [44], fuse it with sparsityfocused compression methods, and evaluate the resulting compression algorithm on intermediate feature maps and gradient maps to show compression ratios of 5.1 (8 bit AlexNet), 4 (VGG16), 2.4 (ResNet34), 2.8 (SqueezeNet), and 2.2 (MobileNetV2).
Iii Compression Algorithm
An overview of the proposed algorithm is shown in Fig. 2. We motivate its structure based on the properties we have observed in feature maps (cf. Section VB):

Sparsity: The value stream is decomposed into a zero/nonzero stream on which we apply runlength encoding to compress the zero burst commonly occurring in the data.

Smoothness: Spatially neighboring values are typically highly correlated. We thus compress the nonzero values using bitplane compression. The later compresses a fixed number of words jointly, and the resulting compressed bitstream is injected immediately after at least nonzero values have been compressed.
The resulting algorithm can be viewed as an extension to bitplane compression to better exploit the sparsity present in most feature maps.
Iiia Zero/NonZero Encoding with RLE
The runlength encoder simply compresses bursts of 0s with a single 0 followed by a fixed number of bits which encode the burst length. Nonzero values, at this point 1bits, are not runlength encoded, i.e. for each of them a 1 is emitted. If the length of a zeroburst exceeds the corresponding maximum burst length, the maximum is encoded and the remaining bits are encoded independently, i.e. in the next code symbol.
IiiB BitPlane Compression
An overview of the bitplane compressor (BPC) used to compress the nonzero values is shown in Fig. 1. For BPC a set of words of bit, a data block, is compressed by first building differences between each two consecutive words and storing the first word as the base. This exploits that neighboring values are often similar to reduce concentrates the distribution of the compressed values around zero.

The data items storing these differences are then viewed as bitplanes of bit each (delta bitplanes, DBPs). Neighboring DBPs are XORed, now called DBX, and the DBP of the most significant bit is kept as the baseDBP. The results are fed into bitplane encoders, which compress the DBX and DBP values to a bitstream following Table 2(a). Most of these encodings are applied independently per DBX symbol. However, the first can be used to jointly encode multiple consecutive bitplanes at once, if they are all zero. This is where the correlation of neighboring values is best exploited. Note also the importance of the XORing step in order to map two’s complement negative values close to zero also to words consisting mostly of zerobits.
As 1) the bitplane encoding is a prefix code, 2) both the block size and word width are fixed, and 3) the representation of word 0– as (base, DBP , DBX 0–) is invertible, the resulting bitstream of the base (word 0) followed by all the encoded symbols can be decompressed into the original data block.
We have analyzed the code symbol distribution in Fig. 2(b) across all ResNet34 feature maps with 8 bit fixedpoint quantization. Similar histograms are obtained for 16 bit fixedpoint and/or other networks. The 1.25 M blocks result in 11.25 M symbols of which 5.1 M are uncompressed bitplanes, 1.2 M are multiall0 DBX symbols encoding 5.4 M allzero bitplanes, 0.5 M single1 symbols, 0.2 M symbols for bitplanes with two consecutive onebits.
As we are processing a stream of data, transmitting the base can be omitted in favor of reusing the the last word of the previous block. As the compression is lossless, the last decoded word of the previous block is the used as the base for decoding the next block. When starting to transmit a new stream of data, either base of the first block can be transmitted, or the base can be initialized to zero.
IiiC Data Types
The proposed compression method can be applied to integers of various word widths and for various block sizes. It also works with floatingpoint words, in which case the deltas do not need an additional bit and correspondingly there is one less DBP and DBX symbol (cf. Fig. 1). The floatingpoint subtraction is not exactly (bittrue) invertible, hence a minimal and in practice negligible compression loss can be expected. Floatingpoint numbers are known to be notoriously hard to compress. While the DBX symbols corresponding to the fraction bits are almost equiprobable ‘1’ or ‘0’, those for the exponent and sign bits are often allzero and thus compressible.
Notably, this compression method is suitable to handle variableprecision input data types very well. For example, 10 bit values can be represented as 16 bit values and fed into a bitplane compressor for 16 bit values. First, all the benefit coming from sparsity remain. Then, once a data block of such is converted into the DBX representation, there will now be 6 additional allzero DBX symbols. These are then in the worst case encoded together into a single multiall0 DBX symbol, adding a mere 7 bit to the overall block’s code stream. In the best case the additional allzero DBX symbols can be encoded into an existing adjacent symbol. Similarly, not only reduced bitwidths, but generally reduced value ranges will have a positive impact on the length of the compressed bit stream. This can be used to alter the tradeoff between accuracy and energyefficiency onthefly.
While the focus here is on evaluating the compression rate on feature and gradient maps of CNNs, such a (de)compressor will be beneficial for any smooth data (images/textures, audio data, spectrograms, biomedical signals, …) and/or sparse data (event streams, activity maps, …).
Iv Hardware Architecture & Implementation
The compression scheme’s elements have been selected such that it is particularly suitable for a lightweight hardware implementation: no codebook needs to be stored, just a few data words need to be kept in memory. To verify this claim, we present a hardware architecture in this section from which we obtain implementation results. For both the compressor and decompressor, we chose to target a throughput of approximately 1 word per cycle. We have separate output data streams for the compressed zero/nonzero stream and the bitplane compressed data, which could optionally be packed into a single compressed bit stream. In the following we use a block size of 8 and 8 bit fixedpoint data words.
Iva Compressor
In Fig. 4 we show the hardware architecture of the encoder. On top, we show the ZeroRLE compressor—a simple comparator to zero (a 8input NOR block) followed by a counter and a multiplexer which selects a ’1’ in case of a nonzero or the zero count if one or more zeros have been received. Towards the end of the unit, variablelength symbols are packed into 8 bit words for connection to a memory bus: a register is filled with shifted data until at least 8 bits have been collected, at which point an 8 bit word is sent out and the remaining bits in excess of 8 are shifted to the LSB side. At the same time, any nonzero values are processed by the Delta and DBP/DBX Transform block. The first word is written to the base word register, all subsequent words of the block are each subtracted from their previous value and these pushed into a shiftregister which is read in parallel once a complete block has been aggregated (now interpretable as bitplanes) and the pairwise XOR of the DBPs is taken to get the DBX symbols (for ease of implementation, the first DBP is XORed with zero, i.e. directly taken as the DBX symbol). The entire block of symbols is then pushed into a FIFO of depth 1, such that new input words can be accepted while the bitplanes are iteratively encoded to allow an average throughput of up to 0.8 words/cycle. The data block is the read by the DBP/DBX encoder to encode each bitplane as a bitvector and its length. The resulting variablelength data is then packed with a circuit similar to the packer in the ZeroRLE block to produce fixed 8 bit length words.
Although the throughput of the bitplane compression part of the circuitis limited to 0.8 word/cycle, this constitutes a worstcase scenario. When zerovalues are encountered, the ZeroRLE block handles the workload while the processing of the nonzero words continues in parallel. This way, the compressor can be operated more closely to 1 word/cycle on average.
IvB Decompressor
The decompressor shown in Fig. 5 reverts the steps of the encoder. After inverting the ZeroRLE encoding, the bitplane compressed data stream is read in 8 bit words and unpacked into variable length data chunks. The Unpacker always provides 8 valid data bits to the Symbol Decoder, which decodes the symbol into a DBP or DBX word and feeds the effective symbol length back to the Unpacker. In case of a DBX word, it is XORed with the previous DBP, such that DBP words are emitted to the Buffer unit—or the base word is forwarded in case of the first 8 bits of the block. The Buffer block aggregates the DBPs and the base word, buffering it for the Delta Reverse unit, which iteratively accumulates the delta symbols and emits the decompressed words. The Buffer unit with its builtin FIFO allows to unpack and decode data (10 cycles/block) while reverting the delta compression (8 cycles).
IvC Implementation Results
WW=8  WW=16  WW=32  
[m]  [GE]^{a}  [m]  [GE]  [m]  [GE]  
Compressor 
ZeroRLE  685  476  1085  753  1806  1254 
Delta & DBX Transf.  1059  735  1940  1347  3832  2661  
Depth1 FIFO  871  605  1563  1085  3157  2192  
DBP/DBX Encoder  483  335  928  644  1750  1215  
Packer  432  300  808  561  1625  1128  
Total  4079  2833  6880  4778  12611  8792  
Decompressor^{b} 
Unpacker  409  284  995  691  2290  1590 
Symbol Decoder  338  235  460  319  684  475  
Buffer  1582  1099  2966  2060  5856  4067  
Delta Reverse  277  192  507  352  1099  763  
Total  2488  1728  4606  3199  9569  6645 

Gate equivalents (GEs): size expressed in terms of area of 2input NAND gates. 1 GE: 1.44 m (umc 65 nm), 0.49 m (ST 28 nm FDSOI), 0.20 m (GlobalFoundries 22 nm).

Without inverse ZeroRLE.

The total area deviates from the sum of the blocks as the synthesizer performs optimizations across the blocks.
We have implemented the described architecture for a UMC 65 nm lowleakage process and synthesized the design. We report the area and a perunit breakdown for a block size of 8 (i.e. the optimal case, cf. Section VC) and 8 bit, 16 bit and 32 bit words and a target frequency of 600 MHz in Table II. For most inference applications, 8 bit feature maps are sufficient, such that the compressor and decompressor fit onto a mere 0.004 mm and 0.0025 mm, respectively. For comparison, the area of both together is similar to a single 32 bit integer multiplier, which requires a minimum of 0.0065 mm.
Synthesis of the circuit for lower frequencies does not reduce area while at 1 GHz and 1.5 GHz the area of the compressor grows to 4337 and 5180 , respectively. For higher frequencies, timing closure could not be attained, with the longest path passing from the DBX multiplexer’s control input in the DBP/DBX Encoder to the register in the Packer.
Directly scaling up the word size from 8 bit to 16 bit increases the area of the compressor by 70% and 85% for the decompressor. Increasing it further from 16 bit to 32 bit, adds another 83% or 108%, respectively. Increasing the word width does not have any effect on the DBP/DBX Encoder, as it works on bitplanes, but it requires more iterations. It might thus be considered using multiple DBP/DBX Encoder units to not bottleneck the throughput and thereby also increasing the size of the Packer and Unpacker to be able to take data from all encoder and feed all the decoders. The size of the Packer and Unpacker increases as well with the word width as the register size grows, and so does the number of multiplexers in the barrel shifters.
Scaling up the throughput can be achieved by doubling the capacity of each unit, reading two words into the compressor, computing the differences both in the same cycle, and increasing the size of the input port of the shift register. Similarly, multiple encoders can be used to compress two bitplanes per cycle. This will only have a limited impact on the area in this part of the compressor which is mostly defined by the size of the shiftregister and the FIFO, which do not need to grow. The main impact will be visible with in the Packer and later the Unpacker units, where the barrel shifters have to take twice as wide words and shift twice as far when doubling the throughput, and hence grows quadratically—a problem inherent to packing data of any variable symbolsize compressors. For the decompressor, similarly there can be multiple symbol decoders and the Delta Reverse unit can be modified to process two words per cycle. Overall, increasing the throughput this way can be expected to scale below linear in area for processing few words in parallel, but once reaching close to full parallelization (i.e. 8 for block size 8 and word width 8 bit), the size of the Packer and Unpacker will take up most of the circuit’s size. However, the throughput can be scaled with linear area cost by instantiating multiple complete (de)compressors to work on individual feature maps in parallel or on separate spatial tiles of the feature maps.
IvD System Integration
The presented compression scheme can be used to reduce the energy spent on interfaces to external DRAM, on interchip or backplane communication—the corresponding standards specify very efficient powerdown modes [45, 46]—and to reduce the required bandwidth of such interfaces, lowering the cost of packaging, circuit boards, and additional onchip circuits (e.g. PLLs, onchip termination, etc.) [45, 46].
Given the limited size, it can also be used to reduce the size of onchip data transfers, e.g. from large background L2 memories in large DNN inference chips that try to fit all data on chip, such as the one Tesla has presented for its next generation of selfdriving cars or the hardware by Graphcore [47].
Streaming HW Accelerators
The (de)compressor could be integrated with an accelerator such as YodaNN [20] which reaches a stateoftheart core energy efficiency of 60 TOp/s/W for binaryweight DNNs. For the specific case of YodaNN, however, taking I/O energy cost into accounts adds 15.28 mW to the core’s 0.26 mW, bottlenecking the efficiency to 1 TOp/s/W. A dropin addition of 8 compressor and decompressor units—YodaNN works on 8 feature maps at the input and output in parallel—would reduce the I/O cost and directly increase its energy efficiency at system level by 2–4 (cf. Section VD) while adding only 0.05 mm (6%) to the 0.86 mm of core area.
HW Accelerator with Feature Maps OnChip
Another application scenario would be with Hyperdrive [37] and large industrial chips such as Tesla’s platform for its next generation of car, which store the feature maps on chip. Memory inherently takes up a large share of such a design, for the case of Hyperdrive specifically 65%. With a compression scheme providing a reliable compression ratio across different input images and for all layer pairs (in a pingpong buffering scheme), we can reduce the memory size by around 2 (cf. Section VE), to saving almost as much silicon area.
Integration into a Heterogeneous ManyCore Accelerator
A further usecase is the integration into a heterogeneous accelerator with multiple cores and/or accelerators working from a local scratchpad memory, where data is prefetched from a different level in the memory hierarchy, e.g. in the GAP8 SoC [48] which can be used for DNNbased autonomous navigation of nanodrones [49], 8 cores and a CNN hardware accelerator which share a 64 kB L1 scratchpad memory which is loaded with data from the 512 kB L2 memory using a DMA controller. In such systems, SRAM memory accesses and data movement across interconnects can make up for a significant share of the overall power, and generally memory space is a scarce resource. Integration of the proposed (de)compressor into the DMA would improve both aspects jointly in such a system.
V Results
Va Experimental Setup
Where not otherwise stated, we perform our experiments on AlexNet and are using images from the ILSVRC validation set. The models we used were pretrained and downloaded from the PyTorch/Torchvision data repository wherever possible and an identical preprocessing was applied to the data. For the gradient analyses we selftrained the same networks^{1}^{1}1Using code available at https://github.com/spallanzanimatteo/QuantLab. Some of the experiments are performed with fixedpoint data types (default: 16bit fixedpoint). The feature maps were normalized to span 80% of the full range before applying uniform quantization in order to use the full value range up to a safety margin to prevent overflows. All the feature maps were extracted after the ReLU (or ReLU6) activations. The code to reproduce these experiments is available online^{2}^{2}2Code: https://github.com/lukascch/ExtendedBitPlaneCompression.
VB Sparsity, Activation Histogram & Data Layout
Neural networks are known to have sparse feature maps after applying a ReLU activation layer, which can be done onthefly after the convolution layer and possibly batch normalization. However, it varies significantly for different layers within the network as well as for different CNNs. Sparsity is a key aspect when compressing feature maps, and we analyze it quantitatively with statistics collected across 250 random ILSVRC’12 validation image and for each layer of AlexNet as well as the more modern and sizeoptimized MobileNeV2 in Fig. 6. For AlexNet, we can clearly see the increase in sparsity from earlier to later layers. For MobileNetV2, multiple effects are overlying. Overall, the feature maps later in the network are more sparse, and generally this is correlated with the number of feature maps (also in AlexNet). Feature maps following expanding 11 convolutions (e.g. 15, 17, 19, 21) generally show lower sparsity (25–40%) than after the depthwise separable 33 convolutions (e.g. 16, 18, 20, 22; sparsity 50–65%), where for the latter there are exceptions (e.g. 8, 14, 28) when these convolutions were strided (sparsity 20–35%). This aligns with intuition as the 11 layers combine feature maps to later be filtered, and the depthwise 33 convolution layers literally perform the filtering.
Besides the average sparsity, its probability distribution across different frames becomes relevant in case guarantees have to be provided either due to realtime requirements in case of a bandwidthlimited hardware accelerator or due to size limits of the memory in which the feature maps are stored (e.g. onchip SRAM). The whiskers in the box plot mark the 1st and 99th percentile, clearly showing how narrow the distribution of the sparsity is and that we thus can expect a very stable compression rate.
We consider compressing not only the feature maps but also the gradient maps for specialized training hardware, thus raising the question how sparsity evolves over as training progresses. The gradient maps are generally identically sparse as the corresponding feature maps, as ReLU activations pass a zerogradient wherever the outgoing feature map value was zero. In Fig. 7 we can observe how the various layers are starting from all50% after random initialization and with a few epochs settle close to their final level. In both networks this is the case after around 15% of the epochs required for full convergence.
The sparse values are not independently distributed but rather occur in bursts when the 4D data tensor is laid out in one of the obvious formats. The most commonly used formats are NCHW and NHWC, which are those supported by most frameworks and the widely used Nvidia cuDNN backend. NCHW is the preferred format for cuDNN and the default memory layout and means that neighboring values in horizontal direction are stored next to each other in memory before the vertical, channel, and batch dimensions. NHWC is the default format of TensorFlow and has long before been used in compute vision and has the advantage of simple nonstrided computation of inner products in channel (i.e. feature map) dimension. Further reasonable options which we include in our analysis are CHWN and HWCN, although most usecases with hardware acceleration are targeting realtime lowlatency inference and are thus operating with a batch size of 1. We analyze the distribution of the length of zero bursts for the these four data layouts at various depths within the network in Fig. 8.
The results clearly show that having the spatial dimensions (H, W) next to each other in the data stream provides the longest zero bursts (lowest cumulative distribution curve) and thus the better compressibility than the other formats. This is also aligned with intuition: feature maps values mark the presence of certain features and are expected to be smooth. Inspection the feature maps of CNNs is commonly known to show that they behave like ’heat maps’ marking the presence of certain geometric features nearby. Based on these results, we perform all the following evaluations based on the NCHW data layout. Note also that the burst length of nonzero values is mostly very short, such that there is limited gain in applying RLE also for the onebits.
To compress further beyond exploiting the sparsity, the data has to remain compressible. This is definitely the case as can be seen when looking at histograms of the activation distributions as shown for some sample layers of AlexNet and MobileNetV2 in Fig. 9 and a strong indication that additional compression of the nonzero data is possible.
VC Selecting Parameters
The proposed method has two parameters: the maximum length of a zero sequence that can be encoded with a single code symbol of the ZeroRLE, and the BPC block size (, number of nonzero word encoded jointly).
Max. Zero Burst Length
We first analyze the effect of varying the maximum zero burst length for ZeroRLE on the compression ratio without for various data word widths in Table III.
word width  ZVC  ZeroRLE max. zero burst length  

8  2.52  2.48  2.56  2.62  2.63  2.59  2.53 
16  3.00  2.96  3.02  3.06  3.07  3.04  3.00 
32  3.30  3.28  3.32  3.34  3.35  3.33  3.31 
The optimal value is arguably the same for our proposed method, since a constant offset in compressing the nonzero values does not affect the optimal choice of this parameter (just like the word width has no effect on it). The results also serve as a baseline for ZeroRLE and ZVC. It is worth noting that ZVC corresponds to ZeroRLE with a max. burst length of 1, yet breaks the trend shown in Table III. This is due to an inefficiency of ZeroRLE in this corner: for a zero burst length of 1, ZVC requires 1 bit whereas ZeroRLE with a max. burst length of 2 takes 2 bit. For a zero burst of length 2, ZVC encode 2 symbols of 1 bit each and ZeroRLE takes 2 bit as well. ZVC thus always performs at least as well for such a short max. burst length.
BPC Block Size
We analyze the effect of the BPC block size parameter in Fig. 10 at various depths within the network. The best compression ratio is achieved with a block size of 16 across all the layers. A block size of 8 might also be considered to minimize the resources of the (de)compression hardware block at a small drop in compression ratio.
VD Total Compression Factor
We analyze the total compression factor of all feature maps of AlexNet, VGG16, ResNet34, SqueezeNet, and MobileNetV2 in Fig. 11. For AlexNet, we can notice the high compression ratio of around 3 already introduced by ZeroRLE and ZVC and that it is very similar for all data types. We further see that pure BPC is not suitable since it introduces too much overhead when encoding only zerovalues. For ResNet34, SqueezeNet, and MobileNetV2, the gains by exploiting only the sparsity is significantly lower at around 1.55, 1.7 and 1.4. The proposed method outperforms previous approaches clearly and particularly for 8 bit fixedpoint values commonly used in today’s inference accelerators. There we observe compression ratios of 5 (AlexNet), 4 (VGG16), 2.4 (ResNet34), 2.8 (SqueezeNet), and 2.2 (MobileNetV2).
When moving from 8 bit fixedpoint values to 12 and 16 bit and ultimately to 16 bit floating point, the compression ratio of the methods based on sparsity only (zeroRLE, ZVC) do not change significantly. This is in line with expectations, since the zerovalues only make a negligible contribution to the final data size with zeroRLE and do not have any effect with ZVC. Contrary to this, BPC is very effective for small word widths but loses its benefits as word widths increase. Our proposed method combines the best of the two worlds, starting with a very high compression ratio and slowly converging to ZVC and zeroRLE as the word width increases. The gains for 8bit fixedpoint data are significantly higher than for other data formats. Most input data—also CNN feature maps—carry the most important information is in the more significant bits and in case of floats in the exponent. The less significant bits appear mostly as noise to the encoder and cannot be compressed without accuracy loss, such that this behavior—a lower compression ratio for wider word widths—is expected.
VE PerLayer Compression Ratio
As already expected from the results on sparsity, the compression ratio is fairly stable across multiple frames. Specifically, the 1st percentile of compression ratios only lies around 20% below the average case for all the networks. Further results on a perlayer basis for AlexNet, ResNet34, and MobileNetV2 is shown in Fig. 12. While there is significant variability between the layers, the 1% farthest outliers towards lower compression ratios can be found in AlexNet’s first layer at a drop of around 25% from the average ratio. For ResNet34 and MobileNetV2, even the worstcase variations remain within less than 5% deviation from the mean. This allows to scale down the available bandwidth and/or the corresponding memory size with only a minimal risk of failure. Furthermore, the remaining risk can be further reduced when processing video streams in realworld applications, where the numerical precision could be scaled down to graciously as described in Section IIIC, thus allowing to accept graceful accuracy degradation in exchange for a smaller size of the compressed bit stream, thereby mitigating potentially catastrophic failure.
For applications in ondevice learning as well as to further boost the throughput of thermally or I/Olimited training accelerators in compute clusters, we have further investigated the compressibility of the gradient maps (cf. Fig. 13). Despite the higher precision data types as required for the gradients, high compression rates can be achieved, mostly higher or on par with those of the feature maps.
Vi Conclusion
We have presented and evaluated a novel compression method for CNN feature maps. The proposed algorithm achieves an average compression ratio of 5.1 on AlexNet (+46% over previous methods), 4 on VGG16 (+67%), 2.4 on ResNet34 (+33%), 2.8 on SqueezeNet (+51%), and 2.2 on MobileNetV2 (+30%) for 8 bit data, and thus clearly outperforms stateoftheart, while fitting a very tight hardware resource budget with 0.004 mm and 0.0025 mm of silicon area in UMC 65 nm at 600 MHz and 0.8 word/cycle. The frequency can be pushed to 1.5 GHz with a slight area increase of 25%.
We further show the proposed method works well not for various data formats and precisions, that the compression ratios are achieved reliably across many images with outliers only going down to 15% below the average ratio at the 1st percentile. The same method is also applicable for the compression of gradient maps during training, achieving compression rates again more than 20% higher than achieved for the feature maps in the forward pass.
References
 [1] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. van der Laak, B. van Ginneken, and C. I. Sánchez, “A survey on deep learning in medical image analysis,” Medical Image Analysis, vol. 42, no. December, pp. 60–88, 2017.
 [2] X. Bian, S. N. Lim, and N. Zhou, “Multiscale fully convolutional network with application to industrial inspection,” in Proc. IEEE WACV, 3 2016, pp. 1–8.
 [3] L. Cavigelli, D. Bernath, M. Magno, and L. Benini, “Computationally efficient target classification in multispectral image data with Deep Neural Networks,” in Proc. SPIE Security + Defence, vol. 9997, 2016.
 [4] B. Wu, F. Iandola, P. H. Jin, and K. Keutzer, “SqueezeDet: Unified, Small, Low Power Fully Convolutional Neural Networks for RealTime Object Detection for Autonomous Driving,” in Proc. IEEE CVPRW, 2017, pp. 129–137.
 [5] K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into Rectifiers: Surpassing HumanLevel Performance on ImageNet Classification,” in Proc. IEEE ICCV, 2015, pp. 1026–1034.
 [6] L. Cavigelli and L. Benini, “Origami: A 803GOp/s/W Convolutional Network Accelerator,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 11, pp. 2461–2475, 11 2017.
 [7] L. Cavigelli, D. Gschwend, C. Mayer, S. Willi, B. Muheim, and L. Benini, “Origami: A Convolutional Network Accelerator,” in Proc. ACM GLSVLSI. ACM Press, 2015, pp. 199–204.
 [8] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, “Cnvlutin: IneffectualNeuronFree Deep Neural Network Computing,” in Proc. ACM/IEEE ISCA, 2016, pp. 1–13.
 [9] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “SCNN: An Accelerator for Compressedsparse Convolutional Neural Networks,” in Proc. ACM/IEEE ISCA, 2017, pp. 27–40.
 [10] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, “CambriconX: An accelerator for sparse neural networks,” in Proc. IEEE/ACM MICRO, 2016, pp. 1–20.
 [11] A. Aimar, H. Mostafa, E. Calabrese, A. RiosNavarro, R. TapiadorMorales, I.A. Lungu, M. B. Milde, F. Corradi, A. LinaresBarranco, S.C. Liu, and T. Delbruck, “NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 3, pp. 644–656, 3 2019.
 [12] Y. H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture for EnergyEfficient Dataflow for Convolutional Neural Networks,” in Proc. ACM/IEEE ISCA, 2016, pp. 367–379.
 [13] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “EIE: Efficient Inference Engine on Compressed Deep Neural Network,” in Proc. ACM/IEEE ISCA, 2016, pp. 243–254.
 [14] L. Cavigelli and L. Benini, “CBinfer: Exploiting FrametoFrame Locality for Faster Convolutional Network Inference on Video Streams,” IEEE Transactions on Circuits and Systems for Video Technology, 2019.
 [15] L. Cavigelli, M. Magno, and L. Benini, “Accelerating RealTime Embedded Scene Labeling with Convolutional Networks,” in Proc. ACM/IEEE DAC, 2015, pp. 108:1–108:6.
 [16] R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “Hyperdrive: A Systolically Scalable BinaryWeight CNN Inference Engine for mW IoT EndNodes,” in Proc. IEEE ISVLSI, 2018, pp. 509–515.
 [17] ——, “YodaNN: An UltraLow Power Convolutional Neural Network Accelerator Based on Binary Weights,” in Proc. IEEE ISVLSI, 2016, pp. 236–241.
 [18] M. Courbariaux, Y. Bengio, and J.P. David, “BinaryConnect: Training Deep Neural Networks with binary weights during propagations,” in Adv. NIPS, 2015, pp. 3123–3131.
 [19] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental Network Quantization: Towards Lossless CNNs with LowPrecision Weights,” in Proc. ICLR, 2017.
 [20] R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “YodaNN: An Architecture for Ultralow Power BinaryWeight CNN Acceleration,” IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, vol. 37, no. 1, pp. 48–60, 2018.
 [21] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, “Compressing Neural Networks with the Hashing Trick,” in Proc. ICML, vol. 37, 2015, pp. 2285–2294.
 [22] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. Van Gool, “SofttoHard Vector Quantization for EndtoEnd Learning Compressible Representations,” in Adv. NIPS, 2017, pp. 1141–1151.
 [23] S. Han, H. Mao, and W. J. Dally, “Deep Compression  Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding,” in ICLR, 2016.
 [24] P. Stock, A. Joulin, R. Gribonval, B. Graham, and H. Jégou, “And the Bit Goes Down: Revisiting the Quantization of Neural Networks,” Facebook AI Research, Tech. Rep., 2019.
 [25] M. Rhu, M. O’Connor, N. Chatterjee, J. Pool, Y. Kwon, and S. W. Keckler, “Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks,” in Proc. IEEE HPCA, 2018, pp. 78–91.
 [26] D. Gudovskiy, A. Hodgkinson, and L. Rigazio, “DNN feature map compression using learned representation over GF(2),” in Proc. ECCV Workshops, 2018, pp. 502–516.
 [27] Y. Wang, C. Xu, C. Xu, and D. Tao, “Beyond Filters: Compact Feature Map for Portable Deep Model,” in Proc. ICML, vol. 70, 2017, pp. 3703–3711.
 [28] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” Proc. IEEE CVPR, pp. 770–778, 2015.
 [29] F. Iandola, M. Moskewicz, S. Karayev, R. Girshick, T. Darrell, and K. Keutzer, “DenseNet: Implementing Efficient ConvNet Descriptor Pyramids,” UC Berkeley, Berkeley, CA, USA, Tech. Rep., 2014.
 [30] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “SqueezeNet: AlexNetlevel accuracy with 50x fewer parameters and <0.5MB model size,” UC Berkeley, Berkeley, CA, USA, Tech. Rep., 2016.
 [31] N. Ma, X. Zhang, H.T. Zheng, and J. Sun, “ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design,” in Proc. ECCV, 2018, pp. 116–131.
 [32] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.C. Chen, “MobileNetV2: Inverted Residuals and Linear Bottlenecks,” in Proc. IEEE CVPR. IEEE, 6 2018, pp. 4510–4520.
 [33] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le, “MnasNet: PlatformAware Neural Architecture Search for Mobile,” in Proc. IEEE CVPR, 2019, pp. 2820–2028.
 [34] J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,” University of Washington, Seattle, WA, USA, Tech. Rep., 2018.
 [35] Z. Cao, T. Simon, S.E. Wei, and Y. Sheikh, “Realtime Multiperson 2D Pose Estimation Using Part Affinity Fields,” in Proc. IEEE CVPR. IEEE, 2017, pp. 1302–1310.
 [36] M. Kocabas, S. Karagoz, and E. Akbas, “MultiPoseNet: Fast MultiPerson Pose Estimation Using Pose Residual Network,” in Proc. ECCV, 2018, pp. 417–433.
 [37] R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “Hyperdrive: A MultiChip Systolically Scalable BinaryWeight CNN Inference Engine,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 309–322, 6 2019.
 [38] L. Cavigelli and L. Benini, “Extended BitPlane Compression for Convolutional Neural Network Accelerators,” in Proc. IEEE AICAS, 2018.
 [39] T. A. Welch, “A Technique for HighPerformance Data Compression,” Computer, vol. 17, no. 6, pp. 8–19, 1984.
 [40] J. Ziv and A. Lempel, “Compression of Individual Sequences via VariableRate Coding,” IEEE Transactions on Information Theory, vol. 24, no. 5, pp. 530–536, 1978.
 [41] M. B. Lin, J. F. Lee, and G. E. Jan, “A lossless data compression and decompression algorithm and its hardware architecture,” IEEE Transactions on Very Large Scale Integration Systems, vol. 14, no. 9, pp. 925–936, 2006.
 [42] X. Zhou, Y. Ito, and K. Nakano, “An efficient implementation of LZW decompression in the FPGA,” Proc. IEEE IPDPS, pp. 599–607, 2016.
 [43] M. Spallanzani, L. Cavigelli, G. P. Leonardi, M. Bertogna, and L. Benini, “Additive Noise Annealing and Approximation Properties of Quantized Neural Networks,” in arXiv:1905.10452, 5 2019, pp. 1–18.
 [44] J. Kim, M. Sullivan, E. Choukse, and M. Erez, “BitPlane Compression: Transforming Data for Better Compression in ManyCore Architectures,” in Proc. IEEE ISCA, 2016, pp. 329–340.
 [45] JEDEC Solid State Technology Association, “Low Power Double Data Rate 4 (LPDDR4),” JEDEC Solid State Technology Association, Tech. Rep. August, 2014.
 [46] ——, “Graphics Double Data Rate (GDDR5) SGRAM,” JEDEC Solid State Technology Association, Tech. Rep. Feburary, 2016.
 [47] N. Toon and S. Knowles, “Graphcore,” 2017. [Online]. Available: https://www.graphcore.ai
 [48] E. Flamand, D. Rossi, F. Conti, I. Loi, A. Pullini, F. Rotenberg, and L. Benini, “GAP8: A RISCV SoC for AI at the Edge of the IoT,” in Proc. IEEE ASAP. IEEE, 7 2018, pp. 1–4.
 [49] D. Palossi, A. Loquercio, F. Conti, F. Conti, E. Flamand, E. Flamand, D. Scaramuzza, and L. Benini, “A 64mW DNNbased Visual Navigation Engine for Autonomous NanoDrones,” IEEE Internet of Things Journal, vol. PP, no. May, pp. 1–1, 2019.
Lukas Cavigelli received the B.Sc., M.Sc., and Ph.D. degree in electrical engineering and information technology from ETH Zürich, Zürich, Switzerland in 2012, 2014 and 2019, respectively. He has since been a postdoctoral researcher with ETH Zürich. His research interests include deep learning, computer vision, embedded systems, and lowpower integrated circuit design. He has received the best paper award at the VLSISoC and the ICDSC conferences in 2013 and 2017, the best student paper award at the Security+Defense conference in 2016, and the Donald O. Pederson best paper award (IEEE TCAD) in 2019. 
Georg Rutishauser received his B.Sc. and M.Sc. degrees in Electrical Engineering and Information Technology from ETH Zürich in 2015 and 2018, respectively. He is currently pursuing a Ph.D. degree at the Integrated Systems Laboratory at ETH Zürich. His research interests include algorithms and hardware for reducedprecision deep learning, and their application in computer vision and embedded systems. 
Luca Benini is the Chair of Digital Circuits and Systems at ETH Zürich and a Full Professor at the University of Bologna. He has served as Chief Architect for the Platform2012 in STMicroelectronics, Grenoble. Dr. Benini’s research interests are in energyefficient system and multicore SoC design. He is also active in the area of energyefficient smart sensors and sensor networks. He has published more than 1’000 papers in peerreviewed international journals and conferences, four books and several book chapters. He is a Fellow of the ACM and of the IEEE and a member of the Academia Europaea. 