Compact and Computationally Efficient Representation of Deep Neural Networks

Compact and Computationally Efficient Representation of Deep Neural Networks

Simon Wiedemann, Klaus-Robert Müller, , and Wojciech Samek,  This work was supported by the Fraunhofer Society through the MPI-FhG collaboration project “Theory & Practice for Reduced Learning Machines”. This research was also supported by the German Ministry for Education and Research as Berlin Big Data Center under Grant 01IS14013A and by the Institute for Information & Communications Technology Promotion and funded by the Korea government (MSIT) (No. 2017-0-01779 and No. 2017-0-00451). S. Wiedemann and W. Samek are with Fraunhofer Heinrich Hertz Institute, 10587 Berlin, Germany (e-mail: wojciech.samek@hhi.fraunhofer.de)K.-R. Müller is with the Technische Universität Berlin, 10587 Berlin, Germany, with the Max Planck Institute for Informatics, 66123 Saarbrücken, Germany, and also with the Department of Brain and Cognitive Engineering, Korea University, Seoul 136-713, South Korea (e-mail: klaus-robert.mueller@tu-berlin.de)
Abstract

Dot product operations between matrices are at the heart of almost any field in science and technology. In many cases, they are the component that requires the highest computational resources during execution. For instance, deep neural networks such as VGG-16 require up to 15 giga-operations in order to perform the dot products present in a single forward pass, which results in significant energy consumption and thus limits their use in resource-limited environments, e.g., on embedded devices or smartphones. One common approach to reduce the complexity of the inference is to prune and quantize the weight matrices of the neural network and to efficiently represent them using sparse matrix data structures. However, since there is no guarantee that the weight matrices exhibit significant sparsity after quantization, the sparse format may be suboptimal. In this paper we present new efficient data structures for representing matrices with low entropy statistics and show that these formats are especially suitable for representing neural networks. Alike sparse matrix data structures, these formats exploit the statistical properties of the data in order to reduce the size and execution complexity. Moreover, we show that the proposed data structures can not only be regarded as a generalization of sparse formats, but are also more energy and time efficient under practically relevant assumptions. Finally, we test the storage requirements and execution performance of the proposed formats on compressed neural networks and compare them to dense and sparse representations. We experimentally show that we are able to attain up to x15 compression ratios, x1.7 speed ups and x20 energy savings when we lossless convert state-of-the-art networks such as AlexNet, VGG-16, ResNet152 and DenseNet into the new data structures.

Neural network compression, computationally efficient deep learning, data structures, sparse matrices, lossless coding.

I Introduction

The dot product operation between matrices constitutes one of the core operations in almost any field in science. Examples are the computation of approximate solutions of complex system behaviors in physics [1], iterative solvers in mathematics [2] and features in computer vision applications [3]. Also deep neural networks heavily rely on dot product operations in their inference [4, 5]; e.g., networks such as VGG-16 require up to 16 dot product operations, which results in 15 giga-operations for a single forward pass. Hence, lowering the algorithmic complexity of these operations and thus increasing their efficiency is of major interest for many modern applications. Since the complexity depends on the data structure used for representing the elements of the matrices, a great amount of research has focused on designing data structures and respective algorithms that can perform efficient dot product operations [6, 7, 8].

Of particular interest are the so called sparse matrices, a special type of matrices that have the property that many of their elements are zero valued. In principle, one can design efficient representations of sparse matrices by leveraging the prior assumption that most of their element values are zero and therefore, only store the non-zero entries of the matrix. Consequently, their storage requirements become of the order of the number of non-zero values. However, having an efficient representations with regard to storage requirement does neither imply that the dot product algorithm associated to that data structure will also be efficient. Hence, a great part of the research was focused on the design of data structures that have as well low complex dot product algorithms [8, 9, 10, 11]. However, by assuming sparsity alone we are implicitly imposing a spike-and-slab prior111That is, a delta function at 0 and a uniform distribution over the non-zero elements. over the probability mass distribution of the elements of the matrix. If the actual distribution of the elements greatly differs from this assumption, then the data structures devised for sparse matrices become inefficient. Hence, sparsity may be considered a too constrained assumption for some applications of current interest, e.g., representation of neural networks.

In this work, we alleviate the shortcomings of sparse representations by considering a more relaxed prior over the distribution of the matrix elements. More precisely, we assume that this distribution has low Shannon entropy [12]. Mathematically, sparsity can be considered a subclass of the general family of low entropic distributions. In fact, sparsity measures the min-entropy of the element distribution, which is related to Shannon’s entropy measure through Renyi’s generalized entropy definition [13]. With this goal in mind, we ask the question:

“Can we devise efficient data structures under the implicit assumption that the entropy of the distribution of the matrix elements is low?”

We want to stress once more that by efficiency we regard two related but distinct aspects

1. efficiency with regard to storage requirements

2. efficiency with regard to algorithmic complexity of the dot product associated to the representation

For the later, we focus on the number of elementary operations required in the algorithm, since they are related to the energy and time complexity of the algorithm. It is well known that the minimal bit-length of a data representation is bounded by the entropy of it’s distribution [12]. Hence, matrices with low entropic distributions automatically imply that we can design data structures that do not require high storage resources. In addition, as we will discuss in the next sections, low entropic distributions may also attain gains in efficiency if these data structures implicitly encode the distributive law of multiplications. By doing so, a great part of the algorithmic complexity of the dot product is reduced to the order of the number of distinct elements that appear per row in a matrix. This number is related to the entropy such, that it is likely to be small as long as the entropy of the matrix is low. Therefore, these data structures not only attain higher compression gains, but also require less total number of operations when performing the dot product.

Our contributions can be summarized as follows:

• We propose new highly efficient data structures that exploit on the improper prior that the matrix does not entail many distinct elements per row (i.e., low entropy).

• We provide a detailed analysis of the storage requirements and algorithmic complexity of performing the dot product associated to this data structures.

• We establish a relation between the known sparse and the proposed data structures. Namely, sparse matrices belong to the same family of low entropic distributions, however, they may be considered a more constrained subclass of them.

• We show through experiments that indeed, these data structures attain gains in efficiency on simulated as well as real-world data. In particular, we show that up to x15 compression ratios, x1.7 speed ups and x20 energy savings can be achieved when converting state-of-the-art neural networks into the proposed data structures.

In the following Section II we introduce the problem of efficient representation of neural networks and briefly review related literature. In Section III the proposed data structures are given. We demonstrate through a simple example that these data structures are able to: 1) achieve higher compression ratios than their respective dense and sparse counterparts and 2) reduce the algorithmic complexity of performing the dot product. Section IV analyses the storage and energy complexity of these novel data structures. Experimental evaluation is performed in Section V using simulations as well as state-of-the-art neural networks such as AlexNet, VGG-16, ResNet152 and DenseNet. Section VI concludes the paper with a discussion.

Ii Efficient Representation of Neural Networks

Deep neural networks [14, 15] nowadays became the state-of-the-art in many fields of machine learning, such as in computer vision, speech recognition, natural language processing [16, 17, 18, 19], and have been progressively also used in the sciences, e.g. Physics [20], neuroscience [21], Chemistry [22, 23]. In their most basic form, they constitute a chain of affine transformations concatenated with a non linear function which is applied element-wise to the output. Hence, the goal is to learn the values of those transformation or weight matrices (i.e., parameters) such that the neural network performs it’s task particularly well. The procedure of calculating the output prediction of the network for a particular input is called inference. The computational cost of performing inference is dominated by computing the affine transformations (thus, the dot products between matrices). Since today’s neural networks perform many dot product operations between large matrices, this greatly complicates their deployment onto resource constrained devices.

However, it has been extensively shown that most neural networks are overparameterized, meaning that there are many more parameters than actually needed for the tasks of interest [24, 25, 26, 27]. This implies that these networks are highly inefficient with regard to the resources they require when performing inference. This fact motivated an entire research field of model compression [28]. One of the suggested approaches is to design methods that compress the weight elements of the neural network such, that their prediction accuracy is minimally affected. For instance, usual techniques include sparsification of the network [29, 30, 31, 27] or quantization of the weight elements [32, 33], or both simultaneously [26, 34, 35]. The advantage of these approaches is that after compression, one can greatly reduce both storage and execution complexity of inference by leveraging on the sparsity of the network, e.g., by representing the weight matrices in one of the sparse data structures and performing the dot products accordingly. However, it is not guaranteed that the distribution of the weight elements of the compressed matrices matches the sparse distribution assumption, which consequently would make the respective data structures inefficient.

Figure 1 demonstrates this discrepancy between the sparsity assumption and the real distribution of weight elements. It plots the distribution of the weight elements of the last classification layer of VGG-16 [36] ( dimensional matrix), after having applied uniform quantization on the weight elements. We stress that the prediction accuracy and generalization of the network was not affected by this operation. As we can see, the distribution of the compressed layer does not satisfy the sparsity assumption, i.e., there is not one particular element (such as 0) that appears specially frequent in the matrix. The most frequent value is -0.008 and it’s frequency of appearance does not dominate over the others (about 4.2%). The entropy of the matrix is about , which indicates that highly likely not many distinct elements appear per row in the matrix (as a rough lower bound estimation one can think of distinct elements per row, whenever the column dimension is large). Indeed, we notice that most of the entries are dominated by only 15 distinct values, which is 1.5% of the number of columns of the matrix.

Hence, designing efficient data structures for these type of matrices is of major interest in this application.

Iii Data structures for matrices with low entropy statistics

In this section we introduce the proposed data structures and show that they implicitly encode the distributive law. Consider the following matrix

 M=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝030240023404440004004404403400040200000444034400044004040000⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠

Now assume that we want to: 1) store this matrix with the minimum amount of bits and 2) perform the dot product with a vector with the minimum complexity.

Iii-a Minimum storage

We firstly comment on the storage requirement of dense and sparse formats and then introduce two new formats which more effectively store matrix .
Dense format: Arguably the simplest way to store the matrix is in it’s so called dense representation. That is, we store it’s elements in a long array (in addition to it’s dimensions and ).
Sparse format: However, notice that more than 50% of the entries are 0. Hence, we may be able to attain a more compressed representation of this matrix if we store it in one of the well known sparse data structure, for instance, in the Compressed Sparse Row (or CSR in short) format. This particular format stores the values of the matrix in the following way:

• Scans the non-zero elements in row-major order (that is, from left to right, up to down) and stores them in an array (which we denote as ).

• Simultaneously, it stores the respective column indices in another array (which we call ).

• Finally, it stores pointers that signalize when a new row starts (we denote this array as ).

Hence, our matrix would take the form

 W: [3,2,4,2,3,4,4,4,4,4,4,4,4,4,3, 4,4,2,4,4,4,3,4,4,4,4,4,4] colI: [1,3,4,7,8,9,11,0,1,5,8,9,11,0, 2,3,7,9,3,4,5,7,8,9,1,2,5,7] rowPtr: [0,7,13,18,24,28]

If we assume the same bit-size per element for all arrays, then the CSR data structure does not attain higher compression gains in spite of not saving the zero valued elements (62 entries vs. 60 that are being required by the dense data structure).
We can improve this by exploiting the low-entropy property of matrix . In the following, we propose two new formats which realize this.
CER format: Firstly, notice that many elements in share the same value. In fact, only the four values appear in the entire matrix. Hence, it appears reasonable to assume that data structures that repeatedly store these values (such as the dense or CSR structures) induce high redundancies in their representation. Therefore, we propose a data structure where we only store those values once. Secondly, notice that different elements appear more frequent than others, and their relative relation does not change throughout the rows of the matrix, and the ordering. Concretely, we have a set of unique elements which appear times respectively in the matrix, and we obtain the same relative order of highest to lowest frequent value throughout the rows of the matrix. Hence, we can design an efficient data structure which leverages on both properties in the following way:

1. Store unique elements present in the matrix in an array in frequency-major order (that is, from most to least frequent). We name this array .

2. Store respectively the column indices in row-major order, excluding the first element (thus excluding the most frequent element). We denote it as .

3. Store pointers that signalize when the positions of the next new element in starts. We name it . If a particular pointer in is the same as the previous one, this means that the current element is not present in the matrix and we jump to the next element.

4. Store pointers that signalize when a new row starts. We name it .

Hence, this new data structure represents matrix as

 Ω: [0,4,3,2] colI: [4,9,11,1,8,3,7,0,1,5,8,9,11,0, 3,7,2,9,3,4,5,8,9,7,1,2,5,7] ΩPtr: [0,3,5,7,13,16,17,18,23,24,28] rowPtr: [0,3,4,7,9,10]

Notice that we can uniquely reconstruct from this data structure. We refer to this data structure as the Compressed Entropy Row (or CER in short) data structure. One can verify that indeed, the CER data structure only requires 49 entries (instead of 60 or 62) attaining as such a compressed representation of the matrix .

To our knowledge, this is the first format optimized for representing low entropy matrices.
CSER format: In some cases, it may well be that the probability distribution across rows in a matrix are not similar to each other. Hence, the second assumption in the CER data structure would not apply and we would only be left with the first one. That is, we only know that not many distinct elements appear per row in the matrix or, in other words, that many elements share the same value. The compressed shared elements row (or CSER in short) data structure is a slight extension to the previous CER representation. Here, we add an element pointer array, which signalizes which element in the positions are referred to (we called it ). Hence, the above matrix would then be represented as follows

 Ω: [0,2,3,4] colI: [4,9,11,1,8,3,7,0,1,5,8,9,11,0, 3,7,2,9,3,4,5,8,9,7,1,2,5,7] ΩI: [3,2,1,3,3,2,1,3,2,3] ΩPtr: [0,3,5,7,13,16,17,18,23,24,28] rowPtr: [0,3,4,7,9,10]

Thus, for storing matrix we require 59 entries, which is still a gain but not a significant one. Notice, that now the ordering of the elements in is not important anymore.

The relationship between CSER, CER and CSR data structures is described in Section IV.

Iii-B Dot product complexity

We just saw that we can attain gains with regard to compression if we represent the matrix in the CER and CSER data structures. However, we can also devise corresponding dot product algorithms that are more efficient than their dense and sparse counterparts. As an example, consider only the scalar product between the second row of matrix with an arbitrary input vector . In principle, the difference in the algorithmic complexity arises because each data structure implicitly encodes a different expression of the scalar product, namely

 dense: 4a1+4a2+0a3+0a4+0a5+4a6 +0a7+0a8+4a9+4a10+0a11+4a12 CSR: 4a1+4a2+4a6+4a9+4a10+4a12 CER/CSER: 4(a1+a2+a6+a9+a10+a12)

For instance, the dot product algorithm associated to the dense format would calculate the above scalar product by

2. calculating .

This requires 24 load (12 for the matrix elements and 12 for the input vector elements), 12 multiply, 11 add and 1 write operations (for writing the result into memory). We purposely omitted the accumulate operation which stores the intermediate values of the multiply-sum operations, since their cost can effectively be associated to the sum operation. Moreover, we only considered read/write operations from and into memory. Hence, this makes 48 operations in total.

In contrast, the dot product algorithm associated with the CSR representation would only multiply-add the non-zero entries. It does so by performing the following steps

1. Load the subset of respective to row 2. Thus, .

2. Then, load the respective subset of non-zero elements and column indices. Thus, and .

3. Finally, load the subset of elements of respective to the loaded subset of column indices and subsequently multiply-add them to the loaded subset of . Thus, and calculate .

By executing this algorithm we would require 20 load operations (2 from the and 6 for the , the and the input vector respectively), 6 multiplications, 5 additions and 1 write. In total this dot product algorithm requires 32 operations.

However, we can still see that the above dot product algorithm is inefficient in this case since we constantly multiply by the same element 4. Instead, the dot product algorithm associated to, e.g., the CER data structure, would perform the following steps

1. Load the subset of respective to row 2. Thus, .

2. Load the corresponding subset in . Thus, .

3. For each pair of elements in , load the respective subset in and the element in . Thus, and .

4. For each loaded subset of , perform the sum of the elements of respective to the loaded . Thus, and do .

5. Subsequently, multiply the sum with the respective element in . Thus, compute .

A similar algorithm can be devised for the CSER data structure. One can find both pseudocodes in the appendix. The operations required by this algorithm are 17 load operations (2 from , 2 from , 1 from , 6 from and 6 from ), 1 multiplication, 5 additions and 1 write. In total this are 24 operations.
Hence, we have observed that for the matrix , the CER (and CSER) data structure does not only achieve higher compression rates, but it also attains gains in efficiency with respect to the dot product operation.
In the next section we give a detailed analysis about the storage requirements needed by the data structures and also the efficiency of the dot product algorithm associated to them. This will help us identify when one type of data structure will attain higher gains than the others.

Iv An analysis of the storage and energy complexity of data structures

Without loss of generality, in the following we assume that we aim to encode a particular matrix , where it’s elements take values from a finite set of elements . Moreover, we assign to each element a probability mass value , where counts the number of times the element appears in the matrix . We denote the respective set of probability mass values . In addition, we assume that each element in appears at least once in the matrix (thus, for all ) and that is the most frequent value in the matrix. We also denote with the subset of elements that appear at row (where for all ) and with their respective frequencies. Finally, we order the elements in and in probability-major order, that is, .

Iv-a Measuring the energy efficiency of the dot product

This work proposes representations that are efficient with regard to storage requirements as well as their dot product algorithmic complexity. For the latter, we focus on the energy requirements, since we consider it as the most relevant measures for neural network compression. However, exactly measuring the energy of an algorithm may be unreliable since it depends on the software implementation and on the hardware the program is running on. Therefore, we will model the energy costs in a way that can easily be adapted across different software implementations as well as hardware architectures.

In the following we model a dot product algorithm by a computational graph, whose nodes can be labeled with one of 4 elementary operations, namely: 1) a mul or multiply operation which takes two numbers as input and outputs their multiplied value, 2) a sum or summation operation which takes two values as input and outputs their sum, 3) a read operation which reads a particular number from memory and 4) a write operation which writes a value into memory. Note, that we do not consider read/write operations from/into low level memory (like caches and registers) that store temporary runtime values, e.g., outputs from summation and/or multiplications, since their cost can be associated to those operations. Now, each of these nodes can be associated with an energy cost. Then, the total energy required for a particular dot product algorithm simply equals the total cost of the nodes in the graph.

However, the energy cost of each node depends on the hardware architecture and on the bit-size of the values involved in the operation. Hence, in order to make our model flexible with regard to different hardware architectures, we introduce four cost functions , which take as input a bit-size and output the energy cost of performing the operation associated to them222The sum and mul operations take two numbers as input and they may have different bit-sizes. Hence in this case, we take the maximum of those as a reference for the bit-sizes involved in the operation.; is associated to the sum operation, to the mul, to the read and to the write operation.

Figure 2 shows the computational graph of a simple dot product algorithm for two 2-dimensional input vectors. This algorithm requires 4 read operations, 2 mul, 1 sum and 1 write. Assuming that the bit-size of all numbers is , we can state that the energy cost of this dot product algorithm would be . Note that similar energy models have been previously proposed [37, 38]. In the experimental section we validate the model by comparing it to real energy results measured by previous authors.

Considering this energy model we can now provide a detailed analysis of complexity of the CER and CSER data structure. However, we start with a brief analysis of the storage and energy requirements of the dense and sparse data structure in order to facilitate the comparison between them.

Iv-B Efficiency analysis of the dense and CSR formats

The dense data structure stores the matrix in an -long array (where ) using a constant bit-size for each element. Therefore, it’s effective per element storage requirement is

 Sdense =bΩ (1)

bits. The associated standard scalar product algorithm then has the following per element energy costs

 Edense =σ(bo)+μ(bo)+γ(ba)+γ(bΩ)+1nδ(bo) (2)

where denotes the bit-size of the elements of the input vector and the bit-size of the elements of the output vector. The cost (2) is derived from considering 1) loading the elements of the input vector [], 2) loading the elements of the matrix [)], 3) multiplying them [], 4) summing the multiplications [], and 5) writing the result []. We can see that both the storage and the dot product efficiency have a constant cost attached to them, despite the distribution of the elements of the matrix.

In contrast, the CSR data structure requires only

 SCSR =(1−p0)(bΩ+bI)+1nbI (3)

effective bits per element in order to represent the matrix, where denotes the bits-size of the column indices. This comes from the fact that we need in total bits for representing the non-zero elements of the matrix, bits for their respective column indices and bits for the row pointers. Moreover, it requires

 ECSR =(1−p0)(σ(bo)+μ(bo)+γ(ba)+γ(bΩ)+γ(bI)) +1nγ(bI)+1nδ(bo) (4)

units of energy per matrix element in order to perform the dot product. The expression (4) was derived from 1) loading the non-zero element values [)], their respective indices [] and the respective elements of the input vector [], 2) multiplying and summing those elements [] and then 3) writing the result into memory [].

Different to the dense format, the efficiency of the CSR data structure increases as , thus, as the number of zero elements increases. Moreover, if the matrix-size is large enough, the storage requirement and the cost of performing a dot product becomes effectively 0 as .

For the ease of the analysis, we introduce the big notation for capturing terms that depend on the shape of the matrix. In addition, we denote the following set of operations

 ca =σ(ba)+γ(ba)+γ(bI) (5) cΩ =γ(bI)+γ(bΩ)+μ(bo)+σ(bo)−σ(ba) (6)

can be interpreted as the total effective cost of involving an element of the input vector in the dot product operation. Analogously can be interpreted with regard to the elements of the matrix. Hence, we can rewrite the above equations (2) and (4) as follows

 Edense =ca+cΩ−2γ(bI)+O(1/n) (7) ECSR =(1−p0)(ca+cΩ)+O(1/n) (8)

Iv-C Efficiency analysis of the CER and CSER formats

Following a similar reasoning as above, we can state the following theorem

Theorem 1

Let be a matrix. Let further be the maximum likelihood estimate of the frequency of appearance of the zero element, and let be the bit-size of the numerical representation of a column or row index in the matrix. Then, the CER representation of requires

 SCER =(1−p0)bI+¯k+~knbI+O(1/n)+O(1/N) (9)

effective bits per matrix element, where denotes the average number of distinct elements that appear per row (excluding the most frequent value), the average number of padded indices per row and the total number of elements of the matrix. Moreover, the effective cost associated to the dot product with an input vector is

 ECER =(1−p0)ca+¯kncΩ+~knγ(bI)+O(1/n) (10)

per matrix element, where and are as in (5) and (6).

Analogously, we can state

Theorem 2

Let , , , , , be as in theorem 1. Then, the CSER representation of requires

 SCSER (11)

effective bits per matrix element, and the per element cost associated to the dot product with an input vector is

 ECSER =(1−p0)ca+¯kncΩ+¯knγ(bI)+O(1/n) (12)

The proofs of theorems 1 and 2 are in the appendix. These theorems state that the efficiency of the data structures depends on the (sparsity - average number of distinct elements per row) values of the empirical distribution of the elements of the matrix. That is, these data structures are increasingly efficient for distributions that have high and low values. However, since the entropy measures the effective average number of distinct values that a random variable outputs333From Shannon’s source coding theorem [12] we know that the entropy of a random variable gives the effective average number of bits that it outputs. Henceforth, we may interpret as the effective average number of distinct elements that a particular random variable outputs., both values are intrinsically related to it. In fact, from Renyi’s generalized entropy definition [13] we know that . Moreover, the following properties are satisfied

• , as or , and

• , as or .

Consequently, we can state the following corollary

Corollary 2.1

For a fixed set size of unique element , the storage requirements as well as the cost of the dot product operation of the CER and CSER representations satisfy

 S,E ≤O(1−2−H)+O(K/n)+O(1/N) =O(1−2−H)+O(1/n)

where , , and are as in theorems 1 and 2, and denotes the entropy of the matrix element distribution.

Thus, the efficiency of the CER and CSER data structures increase as the column size increases, or as the entropy decreases. Interestingly, when both representations will converge to the same values, thus, will become equivalent. In addition, there will always exist a column size where both formats are more efficient than the dense and sparse representations (see Fig. 5 where this trend is demonstrated experimentally).

Iv-D Connection between CSR, CER and CSER

The CSR format is considered to be one of the most general sparse matrix representations, since it makes no further assumptions regarding the empirical distribution of the matrix elements. Consequently, it implicitly assumes a spike-and-slab444That is, a spike at zero with probability and a uniform distribution over the non-zero elements. distribution on them, and are therefore accordingly optimized. However, spike-and-slab distributions are a particular class of low entropic (for sufficiently high sparsity levels ) distributions. In fact, spike-and-slab distributions have the highest entropy values compared to all other distributions that have same sparsity level. In contrast, as a consequence of corollary 2.1, the CER and CSER data structures relax this prior and can therefore efficiently represent the entire set of low entropic distributions. Hence, the CSR data structure may be interpreted as a more specialized version of the CER and CSER representations.

This may be more evident via the following example: consider the 1st row of the matrix example from section III

 (030240023404)

The CSER data structure would represent the above row in the following manner

 Ω: [0,4,3,2] colI: [4,9,11,1,8,3,7] ΩI: [1,2,3] ΩPtr: [0,3,5,7] rowPtr: [0,3]

In comparison, the CER representation assumes that the ordering of the elements in is similar for all rows and therefore, it directly omits this array and implicitly encodes this information in the array. Henceforth, the CER representation may be interpreted as a more explicit/specialized version of the CSER. The representation would then be

 Ω: [0,4,3,2] colI: [4,9,11,1,8,3,7] ΩPtr: [0,3,5,7] rowPtr: [0,3]

Finally, since the CSR data structure assumes a uniform distribution over the non-zero elements, it omits the array since in such case all it’s entries would redundantly be equal to 1. Henceforth, the respective representation would be

 Ω: [3,2,4,2,3,4,4] colI: [1,3,4,7,8,9,11] rowPtr: [0,7]

Consequently, the CER and CSER representations will have superior performance for all those distributions that are not similar to the spike-and-slab distributions. Figure 3 displays a sketch of the regions on the entropy-sparsity plane where we expect the different data structures to be more efficient. The sketch shows that the efficiency of sparse data structures is high on the subset of distributions that are close to the right border line of the -plane, thus, that are close to the family of spike-and-slab distribution. In contrast, dense representations are increasingly efficient for high entropic distributions, hence, in the upper-left region. The CER and CSER data structures would then cover the rest of them. Figure 4 confirms this trend experimentally.

V Experiments

We applied the dense, CSR, CER and CSER representations on simulated matrices as well as on neural network weight matrices, and benchmarked their efficiency with regard to the following four criteria:

1. Storage requirements: We calculated the storage requirements according to equations (1), (3), (9) and (11).

2. Number of operations: We implemented the dot product algorithms associated to the four above data structures (pseudocodes of the CER and CSER formats can be seen in the appendix) and counted the number of elementary operations they require to perform a matrix-vector multiplication.

3. Time complexity: We timed each respective elementary operation and calculated the total time from the sum of those values.

4. Energy complexity: We estimated the respective energy cost by weighting each operation according to Table I. The total energy results consequently from the sum of those values. As for the case of the IO operations (read/write operations), their energy cost depend on the size of the memory the values reside on. Therefore, we calculated the total size of the array where a particular number is entailed and chose the respective maximum energy value. For instance, if a particular column index is stored using a 16 bit representation and the total size of the column index array is 30KB, then the respective read/write energy cost would be 5.0 pJ.

In addition, we used single precision floating point representations for the matrix elements and unsigned integer representations for the index and pointer arrays. For the later, we compressed the index-element-values to their minimum required bit-sizes, where we restricted them to be either 8, 16 or 32 bits.

V-a Experiments on simulated matrices

As first experiments we aimed to confirm the theoretical trends described in Section IV.

V-A1 Efficiency on different regions of the entropy-sparsity plane

Firstly, we argued that each distribution has a particular entropy-sparsity value, and that the superiority of the different data structures is manifested in different regions on that plane. Concretely, we expected the dense representation to be increasingly more efficient in the upper-left corner, the CSR on the bottom-right (and along the right border) and the CER and CSER on the rest.

Figure 4 shows the result of performing one such experiment. In particular, we randomly selected a point-distribution on the -plane and sampled 10 different matrices from that distribution. Subsequently, we converted each matrix into the respective dense, CSR, CER and CSER representation, and benchmarked the performance with regard to the 4 different measures described above. We then averaged the results over these 10 different matrices. Finally, we compared the performances with each other and respectively color-coded the max result. That is, blue corresponds to the dense representation, green to the CSR and red to either the CER or CSER. As one can see, the result closely matches the expected behavior.

V-A2 Efficiency as a function of the column size

As second experiment, we study the asymptotic behavior of the data structures as we increase the column size of the matrices. From corollary 2.1 we expect that the CER and CSER data structures increase their efficiency as the number of columns in the matrix grows (thus, as ), until they converge to the same point, outperforming the dense and sparse data structures. Figure 5 confirms this trend experimentally with regard to all four benchmarks. Here we chose a particular point-distribution on the -plane and fixed the number of rows. Concretely, we chose , and (the later is the row dimension), and measured the average complexity of the data structures as we increased the number of columns .

As a side note, the sharp changes in the plots are due to the sharp discontinuities in the values of table I. For instance, the sharp drops in storage ratios come from the change of the index bit-sizes, e.g., from bits.

V-B Compressed deep neural networks

As second set of experiments, we tested the efficiency of the proposed data structures on compressed deep neural networks. That is, we firstly lossy compressed the elements of the weight matrices of the networks, while ensuring that we negligible impact their prediction accuracy. As compression method we used an uniform quantizer over the range of weight values at each layer and subsequently rounded the values to their nearest quantization point. For all experiments of this section we quantized the weight elements down to 7 bits. Finally, we lossless converted the quantized weight matrices into the different data structures and tested their efficiency with regard to the four above mentioned benchmark criteria.

V-B1 Storage requirements

Table II shows the gains in storage requirements of different state-of-the-art neural networks. Gains can be attained when storing the networks in CER or CSER formats. In particular, we achieve more than x2.5 savings on the DenseNet architecture, whereas in contrast the CSR data structure attains negligible gains. This is mainly attributed to the fact, that the dense and sparse representations store very inefficiently the weight element values of these networks. This is also reflected in Fig. 6, where one can see that most of the storage requirements for the dense and CSR representations is spend in storing the elements of the weight matrices . In contrast, most of the storage cost for the CER and CSER data structures comes from storing the column indices , which is much lower than the actual weight values.

V-B2 Number of operations

Table III shows the savings attained with regard to number of elementary operations needed to perform a matrix-vector multiplication. As one can see, we can save up to 40% of the number of operations if we use the CER/CSER data structures on the DenseNet architecture. This is mainly due to the fact, that the dot product algorithm of the CER/CSER formats implicitly encode the distributive law of multiplications and consequently they require much less number of them. This is also reflected in Fig. 7, where one can see that the CER/CSER dot product algorithms are mainly performing input load (), column index load () and addition (add) operations. In contrast, the dense and CSR dot product algorithms require an additional equal number of weight element load () and multiplication (mul) operations.

V-B3 Time cost

In addition, Table III also shows that we attain speedups when performing the dot product in the new representation. Interestingly, Fig. 8 shows that most of the time is being consumed on IO’s operations (that is, on load operations). Consequently, the CER and CSER data structures attain speedups since they do not have to load as many weight elements. In addition, 20% and 16% of the time is spend in performing multiplications respectively in the dense and sparse representation. In contrast, this time cost is negligible for the CER and CSER representations.

V-B4 Energy cost

Similarly, we see that most of the energy consumption is due to IOs operations (Fig. 9). Here the cost of loading an element may be up to 3 orders of magnitude higher than any other operations (see Table I) and henceforth, we obtain up to x6 energy savings when using the CER/CSER representations (see Table III).

Finally, Table IV and Fig. 10 further justify the observed gains. Namely, Table IV shows that the effective number of distinct elements per row is small relative to the networks effective column dimension. Moreover, Fig. 10 shows the distributions of the different layers of the networks on the entropy-sparsity plane where we see, that most of them lay in the regions where we expect the CER/CSER formats to be more efficient.

V-C Deep compression

Deep compression [26] is a technique for compressing neural networks that is able to attain high compression rates without loss of accuracy. It is able to do so by applying a three staged pipeline: 1) prune unimportant connections by employing algorithm [31], 2) cluster the non-pruned weight values and refine the cluster centers to the loss surface and 3) employ an entropy coder for storing the final weights. The first two stages allow to highly quantize the weight matrices of the networks without incurring significant loss of accuracy, whereas the third stage lossless converts the weight matrices into low-bit representation. However, one requires specialized hardware in order to efficiently exploit this final neural network representation during inference [42].

Henceforth, many authors benchmark the inference efficiency of highly compressed deep neural networks with regard to sparse representations when tested on standard hardware such as CPU’s and/or GPU’s [26, 35, 37]. However, we pay a high cost by doing so since sparse representations are highly redundant and inefficient for these type of networks. In contrast, the CER/CSER representation can be efficiently executed by conventional hardware and they are able to better exploit the redundancies entailed in low entropic neural networks. Hence, it is of high interest to benchmark their efficiency on highly compressed networks and compare them to their sparse (and dense) counterparts.

As experimental setup we chose the by the authors trained and quantized AlexNet architecture [41], where they were able to reduce the overall entropy of the network down to 0.89 without incurring any loss of accuracy. Figure 11 shows the gains in efficiency when the network layers are converted into the different data structures. We see, that the proposed data structures are able to surpass the dense and sparse data structures for all four benchmark criteria. Therefore, CER/CSER data structures are much less redundant and efficient representations of highly compressed neural network models. Interestingly, the CER/CSER data structures attain up to x20 energy savings, which is considerably higher than the sparse counterpart. We want to stress once more, that this gains apply directly, meaning that there is no need of specialized hardware in order to obtain them.

Vi Conclusion

We presented two new matrix representations, compressed entropy row (CER) and compressed shared elements row (CSER), that are able to attain high compression ratios and energy savings if the distribution of the matrix elements has low entropy. We showed on an extensive set of experiments that the CER/CSER data structures are more compact and computationally efficient representations of compressed state-of-the-art neural networks than dense and sparse formats. In particular, we attained up to x15 compression ratios and x20 energy savings by representing the weight matrices of an highly compressed AlexNet model in their CER/CSER forms.

By demonstrating the advantages of entropy-optimized data formats for representing neural networks, our work opens up new directions for future research, e.g., the exploration of entropy constrained regularization and quantization techniques for compressing deep neural networks. The combination of entropy constrained regularization and quantization and entropy-optimized data formats may push the limits of neural network compression even further and also be beneficial for applications such as federated or distributed learning [43, 44].

Appendix A Details on neural network experiments

A-a Matrix preprocessing and convolutional layers

Before benchmarking the quantized weight matrices we applied the following preprocessing steps:

A-A1 Matrix decomposition

After the quantization step it may well be that the 0 value is not included in the set of values and/or that it’s not the most frequent value in the matrix. Therefore, we applied the following simple preprocessing steps: assume a particular quantized matrix , where each element belong to a discrete set. Then, we decompose the matrix into the identity , where is the unit matrix whose elements are equal to 1 and is the element that appears most frequently in the matrix. Consequently, is a matrix with as it’s most frequent element. Moreover, when performing the dot product with an input vector , we only incur the additional cost of adding the constant value to all the elements of the output vector. The cost of this additional operation is effectively of the order of additions and 1 multiplication for the entire dot product operation, which is negligible as long as the number of rows is sufficiently large.

A-A2 Convolution layers

A convolution operation can essentially be performed by a matrix-matrix dot product operation. The weight tensor containing the filter values would be represented as a -dimensional matrix, where is the number of filters of the layer, the number of channels, and the height/width of the filters. Hence, the convolution matrix would perform a dot product operation with an -dimensional matrix, that contains all the patches of the input image as column vectors.

Hence, in our experiments, we reshaped the weight tensors of the convolutional layers into their respective matrix forms and tested their storage requirements and dot product complexity by performing a simple matrix-vector dot product, but weighted the results by the respective number of patches that would have been used at each layer.

A-B More results from experiments

Figures 12, 13 and 14 show our results for compressed ResnNet152, VGG16 and AlexNet, respectively.

A-C Dot product pseudocodes

Algorithms 1 and 2 show the pseudocodes of the dot product algorithm of the CER and CSER data structures.

Appendix B Proof of theorems

B-1 Theorem 1

The CER data structure represents any matrix via 4 arrays, which respectively contain: , , , entries, where denotes the number of unique elements appearing in the matrix, the total number of elements, the total number of zero elements, the row dimension and finally, the number of distinct elements that appeared at row (excluding the 0) and the number of redundant padding entries needed to communicate at row .
Hence, by multiplying each array with the respective element bit-size and dividing by the total number of elements we get

 KbΩN+(1−#(0)N)bI+1N(m∑r=0¯kr+~kr)bI+1nbI

where and are the bit-sizes of the matrix elements and the indices respectively. With and we get equation 9.

The cost of the respective dot product algorithm can be estimated by calculating the cost of each line of algorithm 2. To recall, we denoted with the cost of performing a summation operation, which involved bits. the cost of a multiplication. the cost of a read and of a write operation into memory. We further denoted with the cost of performing other types of operations. Moreover, assume an input vector (that is, ), since the result can be trivially extended to input matrices of arbitrary size. Thus, algorithm 1 requires: from line 2) - 7) we assume a cost of , 8) , 9) , 10) , 11) , 12) , 13) , 14) , 15) , 16) , 17) , 18) , 19) , 20) , 21) , 22) ; where , and are the bit-sizes of the matrix elements, the indices and output vector element respectively. Hence, adding up all above costs and replacing and as in equations (5) and (6), we can get the total cost of . It is fair to assume that the cost is negligible compared to the rest for highly optimized algorithms. Indeed, figures 8 and 7 and 9 show that cost of these operations contribute very little to the total cost of the algorithm. Hence, we can assume the ideal cost of the algorithm to be equal to the above expression with (which corresponds to equation (10)).

B-2 Theorem 2

Analogously, we can follow the same line of arguments. Namely, each array in the CSER data structure contains: , , , , entries. Consequently, by adding those terms, multiplying by their bit-size and dividing by the total number of elements we recover (11).

Each line of algorithm 2 induces a cost of: form line 2) - 8) we assume a cost of , 9) , 10) , 11) , 12) , 13) , 14) , 15) , 16) , 17) , 18) , 19) , 20) , 21) , 22) .
Again, adding up all terms and replacing with and then we get the total cost of and with we recover equation (12).

References

• [1] R. H. Landau, J. Paez, and C. C. Bordeianu, A Survey of Computational Physics: Introductory Computational Science.   Princeton, NJ, USA: Princeton University Press, 2008.
• [2] D. M. Young and R. T. Gregory, A Survey of Numerical Mathematics.   New York, NY, USA: Dover Publications, Inc., 1988.
• [3] S. Krig, Computer Vision Metrics: Survey, Taxonomy, and Analysis, 1st ed.   Berkely, CA, USA: Apress, 2014.
• [4] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning.   MIT Press, 2016, http://www.deeplearningbook.org.
• [5] Y. LeCun, L. Bottou, G. B. Orr, and K. Müller, “Efficient backprop,” in Neural Networks: Tricks of the Trade - Second Edition, Springer LNCS 7700, 2012, pp. 9–48.
• [6] S. Afroz, M. Tahaseen, F. Ahmed, K. S. Farshee, and M. N. Huda, “Survey on matrix multiplication algorithms,” in 5th International Conference on Informatics, Electronics and Vision (ICIEV), 2016, pp. 151–155.
• [7] M. Bläser, “Fast matrix multiplication,” Theory of Computing, Graduate Surveys, vol. 5, pp. 1–60, 2013.
• [8] I. S. Duff, “A survey of sparse matrix research,” Proceedings of the IEEE, vol. 65, no. 4, pp. 500–535, 1977.
• [9] V. Karakasis, T. Gkountouvas, K. Kourtis, G. Goumas, and N. Koziris, “An extended compression format for the optimization of sparse matrix-vector multiplication,” IEEE Transactions on Parallel and Distributed Systems, vol. 24, no. 10, pp. 1930–1940, 2013.
• [10] J. King, T. Gilray, R. M. Kirby, and M. Might, “Dynamic-csr: A format for dynamic sparse-matrix updates,” in High Performance Computing - 31st International Conference, ISC High Performance 2016, Proceedings, ser. LNCS, vol. 9697.   Springer Verlag, 2016, pp. 61–80.
• [11] R. Yuster and U. Zwick, “Fast sparse matrix multiplication,” in Algorithms – ESA 2004.   Springer Berlin Heidelberg, 2004, pp. 604–615.
• [12] C. E. Shannon, “A mathematical theory of communication,” SIGMOBILE Mobile Computing and Communications Review, vol. 5, no. 1, pp. 3–55, 2001.
• [13] A. Rényi, “On measures of entropy and information,” in Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, 1961, pp. 547–561.
• [14] Y. LeCun, Y. Bengio, and G. E. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
• [15] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks, vol. 61, pp. 85–117, 2015.
• [16] K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,” in IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1026–1034.
• [17] S. Bosse, D. Maniry, K.-R. Müller, T. Wiegand, and W. Samek, “Deep neural networks for no-reference and full-reference image quality assessment,” IEEE Transactions on Image Processing, vol. 27, no. 1, pp. 206–219, 2018.
• [18] W. Dai, C. Dai, S. Qu, J. Li, and S. Das, “Very Deep Convolutional Neural Networks for Raw Waveforms,” arXiv preprint arXiv:1610.00087, 2016.
• [19] D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” in International Conference on Representation Learning (ICLR), 2015.
• [20] P. Baldi, P. Sadowski, and D. Whiteson, “Searching for exotic particles in high-energy physics with deep learning,” Nature Communications, vol. 5, p. 4308, 2014.
• [21] I. Sturm, S. Lapuschkin, W. Samek, and K.-R. Müller, “Interpretable deep neural networks for single-trial eeg classification,” Journal of Neuroscience Methods, vol. 274, pp. 141–145, 2016.
• [22] K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller, and A. Tkatchenko, “Quantum-chemical insights from deep tensor neural networks,” Nature Communications, vol. 8, p. 13890, 2017.
• [23] S. Chmiela, A. Tkatchenko, H. E. Sauceda, I. Poltavsky, K. T. Schütt, and K.-R. Müller, “Machine learning of accurate energy-conserving molecular force fields,” Science Advances, vol. 3, no. 5, 2017.
• [24] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. de Freitas, “Predicting parameters in deep learning,” in Advances in Neural Information Processing Systems, 2013, pp. 2148–2156.
• [25] G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network,” arXiv:1503.02531, 2015.
• [26] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding,” arXiv:1510.00149, 2015.
• [27] D. Molchanov, A. Ashukha, and D. Vetrov, “Variational dropout sparsifies deep neural networks,” in 34th International Conference on Machine Learning, 2017, pp. 2498–2507.
• [28] Y. Cheng, D. Wang, P. Zhou, and T. Zhang, “A survey of model compression and acceleration for deep neural networks,” arXiv:1710.09282, 2017.
• [29] Y. L. Cun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in Advances in Neural Information Processing Systems 2, 1990, pp. 598–605.
• [30] B. Hassibi, D. G. Stork, and G. J. Wolff, “Optimal brain surgeon and general network pruning,” in IEEE International Conference on Neural Networks, 1993, pp. 293–299 vol.1.
• [31] S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and connections for efficient neural networks,” in Advances in Neural Information Processing Systems, 2015, pp. 1135–1143.
• [32] V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of neural networks on cpus,” in NIPS’11 Deep Learning and Unsupervised Feature Learning Workshop, 2011.
• [33] D. D. Lin, S. S. Talathi, and V. S. Annapureddy, “Fixed point quantization of deep convolutional networks,” in International Conference on Machine Learning, 2016, pp. 2849–2858.
• [34] F. Li, B. Zhang, and B. Liu, “Ternary weight networks,” arXiv:1605.04711, 2016.
• [35] C. Louizos, K. Ullrich, and M. Welling, “Bayesian Compression for Deep Learning,” in Advances in Neural Information Processing Systems, 2017, pp. 3290–3300.
• [36] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556, 2014.
• [37] T. Yang, Y. Chen, and V. Sze, “Designing energy-efficient convolutional neural networks using energy-aware pruning,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2017, pp. 5687–5695.
• [38] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–138, 2017.
• [39] M. Horowitz, “1.1 computing’s energy problem (and what we can do about it),” in IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014, pp. 10–14.
• [40] G. Huang, Z. Liu, and K. Q. Weinberger, “Densely connected convolutional networks,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4700–4708.
• [41] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, 2012, pp. 1097–1105.
• [42] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “EIE: efficient inference engine on compressed deep neural network,” in ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016, pp. 243–254.
• [43] H. B. McMahan, E. Moore, D. Ramage, S. Hampson et al., “Communication-efficient learning of deep networks from decentralized data,” arXiv preprint arXiv:1602.05629, 2016.
• [44] F. Sattler, S. Wiedemann, K.-R. Müller, and W. Samek, “Sparse binary compression: Towards distributed deep learning with minimal communication,” arXiv:1805.08768, 2018.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters