No Multiplication? No Floating Point? No Problem!Training Networks for Efficient Inference

No Multiplication? No Floating Point? No Problem!
Training Networks for Efficient Inference

Shumeet Baluja
Google Research
Google, Inc.
shumeet@google.com
\And
David Marwood
Google Research
Google, Inc.
marwood@google.com
\And
Michele Covell
Google Research
Google, Inc.
covell@google.com
\And
Nick Johnston
Google Research
Google, Inc.
nickj@google.com
Abstract

For successful deployment of deep neural networks on highly–resource-constrained devices (hearing aids, earbuds, wearables), we must simplify the types of operations and the memory/power resources used during inference. Completely avoiding inference-time floating-point operations is one of the simplest ways to design networks for these highly-constrained environments. By discretizing both our in-network non-linearities and our network weights, we can move to simple, compact networks without floating point operations, without multiplications, and avoid all non-linear function computations. Our approach allows us to explore the spectrum of possible networks, ranging from fully continuous versions down to networks with bi-level weights and activations. Our results show that discretization can be done without loss of performance and that we can train a network that will successfully operate without floating-point, without multiplication, and with less RAM on both regression tasks (auto encoding) and multi-class classification tasks (ImageNet). The memory needed to deploy our discretized networks is less than of the equivalent architecture that does use floating-point operations.

We address this in two steps. First, we train deep networks that emit only a predefined, static number of discretized values. Despite reducing the number of values that can be emitted from to only 32, there is little to no degradation in network performance across a variety of tasks. Compared to existing approaches for discretization, our approach is both conceptually and programmatically simple and has no stochastic component. Second, we provide a method to constrain the network’s weights to a small number of unique values (typically 100-1000) by employing a periodic adaptive clustering step during training. With only weight clustering in place, large network models can be transmitted (or stored) using less than a fourth of the bandwidth.

 

No Multiplication? No Floating Point? No Problem!
Training Networks for Efficient Inference


  Shumeet Baluja Google Research Google, Inc. shumeet@google.com David Marwood Google Research Google, Inc. marwood@google.com Michele Covell Google Research Google, Inc. covell@google.com Nick Johnston Google Research Google, Inc. nickj@google.com

1 Introduction

Almost all recent neural-network training algorithms rely on gradient-based learning. This has moved the research field away from using discrete-valued inference, with hard thresholds, to smooth, continuous-valued activation functions Werbos (1974); Rumelhart et al. (1986). Unfortunately, this causes inference to be done with floating-point operations, making it difficult to deploy on an increasingly-large set of low-cost, limited-memory, low-power hardware in both commercial Lane et al. (2015) and research settings Bourzac (2017).

Avoiding all floating point operations allows the inference network to realize the power-saving gains available with fixed-point processing Finnerty and Ratigner (2017). To move fully to fixed point, we need to discretize both the network weights and the activation functions. We can also achieve significant memory savings, by not just quantizing the network weights, but clustering all of them across the entire network into a small number of levels. With this in place, the memory footprint grows about (or less) as fast as the unclustered, continuous-weight network size. Additionally, the relative rates at which our memory footprint will grow is easily controlled using , the number of unique weights. In our experiments, with , we show that we can meet or exceed the classification performance of an unconstrained network, using the same architecture and (nearly) the same training process.

While most neural networks use continuous non-linearities, many use non-linearities with poorly defined gradients, without impacting the training process. Goodfellow et al. (2013); Glorot et al. (2011); Nair and Hinton (2010). When purely discrete outputs are desired, however, such as with binary units, a number of additional steps are normally taken Raiko et al. (2014); Bengio et al. (2013a); Hou et al. (2016); Courbariaux et al. (2016); Tang and Salakhutdinov (2013); Maddison et al. (2016) or evolutionary strategies are used Plagianakos et al. (2001). At a high level, many of the methods employ a stochastic binary unit and inject noise during the forward pass to sample the units and the associated effect on the network’s outputs. With this estimation, it is possible to calculate a gradient and pass it through the network. One interesting benefit of this method is its use in generative networks in which stochasticity for diverse generation is desired Raiko et al. (2014).  Raiko et al. (2014) also extended Tang and Salakhutdinov (2013) to show that learning with stochastic units may not be necessary within a larger deterministic network.

A different body of research has focused on discretizing and clustering network weights. Wu et al. (2018); Yi et al. (2008); Deng et al. (2017); Courbariaux et al. (2016) Several existing weight-quantization methods (e.g., Courbariaux et al. (2016)) liken the process to Dropout Srivastava et al. (2014) and its regularization effects. Instead of randomly setting activations to zero when computing gradients (as with dropout), weight clustering and binarization tends to push extreme weights partway back towards zero. Additional related work is given in the next section.

2 Training Networks for Efficient Inference

In this section, we separately consider the tasks of (1) discretizing the output of each unit and (2) reducing the set of allowed weights to a small, preset, size. The effects of each method are examined in isolation and together in Section 3.

2.1 Discretizing the Network’s Activations

To make this section concrete and easily reproducible, we will focus our attention on how to discretize the tanh activation function. However, we have employed the exact same method to discretize ReLU-6 Krizhevsky and Hinton (2010), rectified-Tanh, and sigmoid among others. Figure 1 gives a simple procedure that we use for activation discretization and shows its effects on the activation’s output.

Figure 1: Discretized tanh (tanhD) computation (detailed for reproducibility). Outputs shown with 4, 9, and 64 levels. In the largest slope areas of the underlying tanh function, the discretization levels change the fastest. There is no requirement to constrain to a power of 2, though it may be preferred.

Naively backpropagating errors with these discretized tanh (tanhD) units will quickly run into problems as the activations are both discontinuous and characterized by piece-wise constant functions. In order to use gradient based methods with tanhD units, we need to define a suitable derivative. We simply use the derivative of the underlying function — (e.g. for , we used ). In the forward pass, both in training and inference, the output is discretized to levels. In the backward pass, we proceed by ignoring the discretization and instead compute the derivatives of the underlying function. Whereas previous studies that attempted discretization to binary-output units experienced difficulties in training, we have found that as is increased, even to relatively small values (), all of the currently popular training algorithms perform well with no modification Baluja (2018), (e.g. SGD, SGD+momentum, ADAM, RMS-Prop, etc). A number of studies have used similar approaches, often in a binary setting (e.g. straight through estimatorsHinton (2012); Bengio et al. (2013b); Rippel and Bourdev (2017)); most recently, Agustsson et al. (2017); Mentzer et al. (2018) used a smooth mixture of the quantized and underlying function in the backwards pass.

Why does ignoring the discretization in the backwards pass work? If we had tried to use the discretized outputs, the plateaus would not have given usable derivatives. By ignoring the discretizations, the weights of the network still move in the desired directions with each backpropagation step. However, unlike non-discretized units, any single move may not affect the unit’s output. In fact, it is theoretically possible the entire network’s output may not change despite all the weight changes made in a single step. Nonetheless, in a subsequent weight update, the weights will again be directed to move, and of those that move in the same direction, some will cause a unit’s output to cross a discretization threshold. This changes the unit’s and, eventually, the network’s output. Further, notice that for tanhD, Figure 1, the plateaus are not evenly sized. Where the magnitude of the derivative for the underlying tanh function is maximum is where the plateau is the smallest. This is beneficial in training since the unit’s output changes most rapidly where the derivative of tanh changes the most rapidly.

Finally, to provide an intuitive example of how these units perform in practice, see Figure 2. This shows how a parabola is fit with a variety of activations and discretization. In this example, a tiny network with a single linear output unit and only two hidden units is used. The most revealing graphs are the training curves with tanhD() (Figure 2-c). The fit to the parabola matches closely with intuition; the different levels of discretization symmetrically reduce the error in a straightforward manner. As is increased (d and e), the performance approximates and then matches the networks trained with tanh and ReLU activations.

Figure 2: Fitting a parabola with 2 hidden units. The red area is the error between the actual and predicted after 100,000 epochs. In the top row, the hidden unit activations are standard tanh and ReLU. In the second row are discretized versions of tanh: tanhD(2), tanhD(8), tanhD(256). The discretization levels clearly affect the network’s performance. For example, with tanhD(2), the network has found a reasonable symmetric approximation but, with only two hidden units it cannot overcome the discretization artifacts.

To summarize, a simple procedure to discretize the outputs of a unit’s activations during inference and training was given. For ease of exposition and clarity, it was presented with tanh, though testing has been done with most, if not all, the commonly used activation functions. Beyond tanhD, we will demonstrate the use of discretized ReLU activations in Section 3.

2.2 Weight Enumeration

We turn now to reducing the set of allowed weights to a small, predefined, number. As the goal of our work is efficient inference; we do not attempt to stay discretized during the training process. The process used to obtain only a small number of unique weights is a conceptually simple addition to standard network training. Like activation discretization, it can be used with any weight setting procedure - from SGD or ADAM to evolutionary algorithms.

Previous research has been conducted in making the network weights integers Wu et al. (2018); Yi et al. (2008), as well as reducing the number of weights to only binary or ternary values Deng et al. (2017); Courbariaux et al. (2016) during both testing and training, using a stochastic update process driven by the sign bit of the gradient. Empirically, many of the previous techniques that either compress or quantize weights on an already trained model perform poorly on real-valued regression problems. While our implementations of these methods Denton et al. (2014); Han et al. (2016); Lin et al. (2015) are quite successful on classification problems, we were unable to achieve comparable performance using those techniques on networks that perform image compression and reconstruction. Discretizations can create sharp cuts that seem to be beneficial for decision boundaries but hinder performance when regressing to real-valued variables. Tasks in which real-valued outputs are required have become common recently (e.g. image-to-image-translation Isola et al. (2017), image compression and speech synthesis, to name a few). Fortunately, our method exhibits good performance on regression tasks, as well as providing an easily tunable hyper-parameter (the number of weight clusters), thereby alleviating any remaining task impact.

Perhaps the most straightforward approach to creating a network with only a limited number of unique weights () is to start with a trained network and place each weight into a small number, , of equally sized bins that span the full range of weight values. Each weight is then assigned the centroid of that bin. This approach has limitations: (1) the network needs to be retrained after this procedure. However, continuing training re-introduces small deviations from the centroids and therefore the network once again has a large number of weights. (2) The distribution of weights is far from uniform, see Figure 3 (top row). (3) If we decrease the fidelity of the relatively few large magnitude weights, we have observed severely degraded performance across a wide variety of tasks.

Figure 3: (top) MNIST distributions of weights trained with no weight quantization, shown for epochs 1000,10000 and 100000. (middle) Distribution of weights when trained with weight maximum of 1000 unique weights. Same Epochs shown immediately prior to weight quantization. (bottom) Weights shown after weight quantization. Y-axis is log-scale to show lesser occupied bins. The frequency distributions are more Laplacian than Gaussian when looked at on a linear scale.

To address these limitations, we adaptively cluster the weights throughout the training process. Rather than fully training a network and then discretizing the weights, a recurring clustering step is added into the training procedure. Periodically, all of the weights in the network (including the bias weights) are added to a bucket from which clusters are found. This is a one-dimensional clustering problem, where the single dimension is the value of the weight.111While we have not yet tested 1D-specific clustering Jenks (1967); Wang and Song (2011), all of the approaches we tried (e.g., LVQ Kohonen (1995), k-means Jain (2010), HAC Duda et al. (1995)) gave similar results. We settled on 5 simple lines of code, with scipy scipy.org (2018). Clustering, rather than using uniformly sized bins, ensures that bin spacing respects the underlying distribution of the weights. Once the clusters are created, each weight is replaced with the centroid of its assigned cluster, thereby reducing the number of unique weights to . After weight replacement, training is continued, and proceeds with no modifications until the next clustering step. At this point, the cycle repeats (for all of our experiments, quantization occurs after every 1000 steps).

This procedure, though simple, has some subtle effects. First, as a training regularizer, it keeps the range of the weights from growing too quickly, as there is a persistent “regression to the mean.” Second, it provides a mechanism to inject directed noise into the training process. As we will show, both of these properties have, at times, yielded improved results over allowing arbitrary valued weights. Figure 3 (row b) shows the distribution of weights at the beginning, middle, and end of training when weight clustering is used, immediately prior to the quantization step. With 1000 clusters used throughout training, the weights after replacement (Figure 3, row c) appear very similar to the unclustered weights (Figure 3, row b).

3 Experiments

An extensive exploration of tasks to elucidate how discretizations of activations and weights affected the performance was conducted. These included testing memorization capacity, real-value function approximation, and numerous classification problems. Because of space limitations, however, we only present the three most often researched tasks; these are representative of the results seen across our studies. We present two classification tasks: MNIST LeCun et al. (1998), and ImageNet Russakovsky et al. (2014) and a real-valued output task: auto-encoding images, the crucial building block to neural-network based image compression.

In all of our tests, we retrained the baselines to eliminate the possibility of any task-specific heuristics. In some cases, this led to lower performance than state-of-the-art; however, since our goal is to measure the relative effect of discretizations on any network, the results provide the insights needed.

3.1 Mnist

For MNIST, we train a fully connected network with ADAM Kingma and Ba (2015) and vary the number of hidden units to explore the trade-off between discretization, accuracy, and network size. Figure 4-a contains the performance of the networks using ReLU and tanh activation functions with no discretizations; these are the baselines. Since tanh slightly outperformed ReLU, we will discretize tanh in our experiments.

2 Hidden Layers


4 Hidden Layers

(a) Activations discretized (b) Weights discretized (c) Both discretized
Figure 4: Effects of activation discretization vs. number of hidden units on MNIST classification. The only performance loss is observed when the number of unique weights is reduced to 100. With 1000 weights, with or without activation-discretization, performance matches or surpasses ReLU and tanh. Average of 3 runs shown.

First, we examine the effect of only discretizing each unit’s activations. We experimented with 8 sets of activation discretization (2, 4, 8, … 256 levels). We found that both tanhD(8) and tanhD(16) often perform as well as tanh and ReLU in performance when there are hidden units per layer. At tanh(32) and above, performance is largely indistinguishable from tanh (Figure 4-a). Next, we examine weight discretization in isolation. Two sets of experiments are presented: and (Figure 4 b). With 1000 unique weights allowed, the performance is nearly identical to no weight quantization. However, when is reduced to 100, there is a decline in performance. Nonetheless, note that even with the performance recovers with additional hidden units – hinting towards the trade-off in representational capacity between number of distinct values a weight can represent and the number of weights in the network.

Finally, when both discretizations are combined, we again see that the only noticeable negative effect comes when the number of unique weights is set to 100. No matter which activation is used, when 100 weights are used, performance decreases. This same trend holds true for networks with a depth of 2 hidden layers (top row) and with 4 hidden layers (bottom row).

3.2 Auto-Encoding

A number of recent as well as classic research papers have used auto-encoding networks as the basis for image compression Jiang (1999); Toderici et al. (2016); Cottrell and Munro (1988); Kramer (1991); Toderici et al. (2017); Svoboda et al. (2016). To recreate the input image, real values are used as the outputs. As discussed earlier, real value approximation can be a more challenging problem domain than classification when discretizing the network.

For these experiments, we train two network architectures, convolutional and another with only fully connected layers. ADAM is used for training, and error is minimized. We trained with the ImageNet train-set and all tests are performed with the validation images. The smallest conv-network has four conv. layers with (, , , ) depth, followed by 3 conv-transpose layers with depth (, , ). The last two layers are conv. with depth 20 and 3. For the second experiment, the fully-connected network has 7 hidden layers with (, , , , , ) units each. To examine the effects of network size, is varied from 0.5 to 2.0 for both experiments.

Because the raw numbers are not meaningful in isolation, we show performance measurements relative to training the smallest network with ReLU activations and no quantizations (the graphs for both architectures can be compared to see effects of network size). In Figure 5-a for both architectures, ReLU performed worse than tanh. TanhD(32) and TanhD(256) tracked the performance of tanh closely for all sizes of the network. Similarly to the MNIST experiment (Figure 4 b), reducing the number of weights to hurt performance. With , the performance decline was much smaller; however, unlike with MNIST, there was a discernible effect.

When the two discretizations were combined, again, the largest impact was a result of setting the weight discretization levels too low. As before, increasing the network size returns the performance lost due to weight and activation quantization. Importantly, this task indicates that although the discretization procedures do indeed take a larger toll on the performance with real-outputs, discretization remains a viable approach for network computation/memory reduction. The amount of performance degradation tolerated can be explicitly dictated by the needs of the application by controlling .

Convolution


Fully Connected

(a) Activations discretized (b) Weights discretized (c) Both discretized
Figure 5: (a) Effects of activation discretization vs. number of hidden units on auto encoding. The worst performing is unmodified ReLU. Tanh and tanhD(256) performed better and were equivalent to each other. (b) When weights are discretized, performance declines as decreases. (c) Combined effects of both discretizations ( is not shown at bottom of (c), due to plot range).

3.3 ImageNet

To evaluate the effects of discretization on a larger network, we used AlexNet Krizhevsky et al. (2012) to address the 1000-class ImageNet task. To ensure that we are training our network correctly, we first retrained AlexNet from scratch using the same architecture and training procedures specified in in Krizhevsky et al. (2012); some small differences are: we employed an RMSProp optimizer, weight initializer sd=0.005, bias initializer sd=0.1, one Nvidia Tesla P100 GPU, and a stepwise decaying learning rate. Our network achieved a recall@5 accuracy of 80.1% and recall@1 accuracy of 57.1%. This should be compared to the accuracy reported in  Krizhevsky et al. (2012) of 81.8% and 59.3%, recall@5 and recall@1, respectively. The small difference in performance was because we did not use the PCA pre-processing, which  Krizhevsky et al. (2012) cite as causing approximately the difference seen. This performance was achieved by including training with crops and flips and using the average of multiple forward-passes with random crops during evaluation.

With the above matching performance, we were confident that our training approach matches the AlexNet one sufficiently. However, as we needed to speed-up training to explore the parameter space we were interested in thoroughly, we eliminated using multiple crops in training and testing and retrained the AlexNet system. This system is used as as our baseline. The baseline (AlexNet without crops and resizes) achieved a recall top-1 of 49.6% and top-5 of 74.2%.

To begin, we examined the effect of switching to ReLU6 instead of ReLU. AlexNet with ReLU6 achieved a recall top-1 of 47.8% and top-5 of 72.8% (Table 1, Experiment #1). All of the remaining comparisons will use the exact same training procedure, only differing in which discretizations and activations are used.

Following our experimental design, we first independently examine the performance of only discretizing each unit’s activations, see Table 1. With 256 activation levels (8 bits) down to only 32 levels (5 bits) (Experiments #2 and #3), there is little degradation in performance in comparison to using the full 32 bits of floating point (Experiment #1). As expected, as the activation levels become more sparse, performance declines (Experiments #4 and #5).

Using the most aggressive acceptable discretization (32 values), we turn to our next experiment: reducing the number of unique weights allowed. We set (Experiment #6). The only training modification was the elimination of dropout. As illustrated in Figure 3, the discretization process itself works as a regularizer, so dropout is not needed. Wu et al. (2018) took a similar approach and removed dropout from their AlexNet discretization experiments.222Further, Wu et al. (2018) did not discretize the last layer for reporting results. All of our discretized AlexNet results include discretization of the final layer.

Examining Experiment #6, there is actually an increase in performance in both recall@1 and recall @5 from baseline AlexNet-ReLU6 (Experiment #1). The results even overcame the slight penalty of using ReLU6 instead of ReLU (Experiment #0). Further, Experiment #7, with only 100 unique weights, performed much better than we would have expected given its earlier performance. We speculate that unlike the other tasks in which setting was detrimental to performance, AlexNet has so much extra capacity and depth that the effective decrease in representational capacity for each weight was lessened by the large architecture. We will return to this later.

Activation Unique
Experiment # Levels Weights Recall Recall
() () @1 @5
AlexNet - Full Training & Testing (w/crops + rotations) - - 57.1 80.1
AlexNet w/ ReLU & simplified training & testing 0 - - 49.6 74.2
AlexNet w/ ReLU6 & simplified training & testing 1 - - 47.8 72.8
Continuous weights, only discretized activations. 2 256 - 47.0 72.4
3 32 - 46.9 72.0
4 16 - 45.5 71.1
5 8 - 37.0 63.2
k-Means discretized weights and discretized activations (no dropout). 6 32 1000 49.6 74.7
7 32 100 47.2 71.6
Laplacian discretized weights, discretized activations:
    - with dropout 8 32 1000 47.4 72.3
    - without dropout 9 32 1000 51.7 75.7
Laplacian discretized weights, discretized activations
(Exp #9) with full training 10 32 1000 55.7 79.3
Table 1: AlexNet Comparisons

The results to this point show minimal loss in performance (relative to Experiment #1) after the activation is discretized to 32 levels and 1000 weights are used. This improves on the relative loss seen in earlier studies Zhou et al. (2018); Wu et al. (2018); Hubara et al. (2016). Nonetheless, as a final test, we ask whether it is possible to do better? The AlexNet network contains approximately 50 million weights. Unconstrained clustering using all of the weights is computationally expensive, slowing our training process. However, sampling the weights for modeling leads to degraded performance as the large-magnitude weights may not be accurately represented. Here, we briefly outline our final experiment: attempting to replace the clustering procedure with an approximation of what it should do, based on our observations of its behavior across all problems. This approximation was guided by the quantization levels that we see in the k-Means clustering: as shown in Figure 3-c, the distribution of unconstrained cluster centers converges to an approximation of a truncated Laplacian distribution.

Can we force the quantization levels into a Laplacian-like distribution? We do this by setting normalized levels at where and , up to . The actual quantization levels are then set to where is the mean value of the network weights and is a scaling factor. Our original approach was to set where is the maximum amplitude difference between any weight and the mean . This scaling allows us to accurately model the largest magnitude weights. However, when we do that, we lose the regularization benefits seen in Figure 3-b and -c. Instead, we “nudge” the value of just slightly lower, to adjust downward by , whenever the activation weights are spread out by more than the expected range of desired values (specifically, whenever ). We have found that we can also speed up training by providing a similar “nudge” larger to the value of , when the activation weights are clustered too tightly around there mean: specifically, if , we change to move outward by

We have seen an improvement to performance using this Laplacian discretization (Experiment #9). We surpass the performance both of our fully continuous baseline (Experiment #1) and of our k-Means weight clustering (Experiment #6). Finally, in Experiment #10, we repeated Experiment #9 but, instead of using the simplified training (as Experiments #0-#9 do), we used the “full training” process, including crops and rotations in training and testing. With that full training, we regain nearly all of the accuracy of the original Alexnet. In fact, the drop in accuracy from the original Alexnet to Experiment #10 is less than the drop in accuracy from Experiment #0 to Experiment #1, where the only change we made was replacing the ReLU with a ReLU6 (without any quantization). Whether the performance benefit continues is an exciting avenue for future work; however, avoiding the computational expense of clustering is already pragmatically very beneficial.

Top 1 Accuracy Top 5 Accuracy
baseline quantized difference baseline quantized difference

Our work
57.1% 55.7% -1.4% 80.1% 79.3% -0.8%
DoReFa Zhou et al. (2018) 55.9% 53.0% -2.9% - - -
WAGE Wu et al. (2018) - - - 80.7% 72.2% -8.5%
QNN Hubara et al. (2016) 56.6% 51.3% -5.3% 80.2% 73.7% -6.5%
Table 2: Accuracy of Alexnet under Quantization: Comparison to Prior Work

Compared to the prior work that focused on moving away from floating-point implementations (Table 2), our approach is the only one which did not suffer a significant loss in performance, relative to the unquantized version of the network. We have, by far, the best performance both relative to baseline and in absolute terms. Han et al. (2016) is the only other reference that we have found with weight quantization that did not suffer from performance but Han et al. (2016) does not quantize activations and requires floating point calculations. DoReFa Zhou et al. (2018), which is the closest in performance to ours, is 8 times slower than the baseline implementation, whereas we expect our implementation to be faster than the baseline, due to the relative speed of lookups versus multiplies.

4 Memory Savings, No Multiplication, No Floating Point

As shown, it is possible to train networks with little (if any) additional error introduced through discretizing the activation function. On top of the discretized activation function, we can use our clustering approach to reduce the number of unique weights in the network. With these two discretization components in place, an inference step in a fully trained neural network can be completed with no multiplication operations, no non-linear function computation, and no floating point representations.

To accomplish this, we discretize the non-linear activation function to activations and allow unique weights in the network. We pre-compute all of the multiplications required and store them in a table of size . In our AlexNet experiments, we typically used , which required storing 32,000 entries. However, this extra memory requirement is completely offset with the savings obtained from no longer needing to store the weights. Previously, for each weight, a floating point number (32 bits) was required. With this method, only an index to the correct column in the table is needed (10 bits). Given the number of weights in a network like AlexNet (), this reflects savings in memory, in addition to computational savings detailed below. Furthermore, in terms of bandwidth for downloading trained models (for example to mobile devices) we can find greater efficiency by using entropy coding of the weight indices: based on our fully-trained discrete network weight distributions, even the simplest (non-adaptive, marginal-only) entropy coding reduces the index size from 10 bits to below 7 bits, giving a savings in model storage size.333This same memory/bandwidth savings is available as soon as the weights are clustered, even if the activations are not discretized.

We replace floating point computations with table lookups and integer summation. The first step is to replace the weight multiplication by a look-up into the table described above, with the refinement of representing the result (the looked-up value) using a fixed-point representation Wiki (2018). As long as our weights are within a known, bounded range (which they are since we know our cluster-center values) and as long as the previous layers outputs are in a known, bounded range (which they are due to our bounded quantization), we can always pick a scale factor that keeps the result from overflowing our fixed-point representation. As we look up the values, we simply sum them together, with integer summation. The summed values are then mapped to the output value of the activation using a look up, that is an index into the row of the table. Also, note that in the unlikely event that longer integers are required in the tables, increasing bits here is a minor expense in comparison to storing full resolution weights.

4.1 Implementation Details

In this section, we provide suggestions to make the implementation more efficient. In the first example, shown in Figure 6, we show how a pre-computed activation multiplication table can be deployed. In the example, we show a single unit with a network; the unit has 4 inputs + a bias unit. Note that the stored multiplication table also has a row for the bias unit’s computation (e.g. multiplying the bias unit’s weight by an “activation” of 1.0). Note that the same multiplication table is used across all of the network’s nodes. 444It is possible to do a per layer, or per filter-type, clustering procedure as well (not explored in this paper) instead of clustering all the weights of the network together. If this divided approach was taken, there can be multiple multiplication tables stored in memory for the same network. Despite the increase in memory that would be required to store these tables, in contrast to large networks that do not use any form of discretization, this extra memory requirement would be eclipsed by the savings of not representing each weight individually.

Figure 6: Using a stored multiplication table to avoid multiplies at inference time. Additionally, activation output is also stored and not computed; therefore no non-linearities are computed at inference.

Once the summation is computed by adding the looked-up entries from the multiplication table, the activation is computed on the summed value. In this example, the activation is shown with a Relu-6 activation function with 6 levels of discretization. There are two remaining inefficiencies in this system. First, in terms of memory, we represented all the stored values in floating point. Second, though we do not compute the activation function’s output, finding the right output one of , requires that we examine the boundaries of the bins (). This scan (whether binary or linear) must exist if we want to truly have inference without any multiplications (or, equivalently, division operations).

To address the first issue, we switch to a fixed-point / integer representation for all the stored values. This is shown in Figure 7. Note that all the values are multiplied by a large scale-factor, , before they are converted to integers and stored in the table. must be large enough to push the important decimals to values > 1. The easiest method of selecting is empirically, as having too large is not detrimental as long as the additions fit in the allocated memory for the temporary accumulator variables required. Note that even if long-ints are required (in most cases they will not be), the amount of memory devoted to the table and associated accumulators is minuscule in comparison to the amount of memory required to store all the weights. Note that the bias unit’s effect, which is actually a , must now also be scaled to the appropriate range, and is therefore changed to . The activation function now emits an integer/fixed point number that is scaled by . Finally, note that the boundaries for the activation function must be adjusted by since both the weights and activation functions are scaled when entered/looked up in the table.

Figure 7: Extending the method shown in Figure 6 to not use any floating point representations. All values are pre-computed to included a large scale factor, . Note that the bias’s effect () needs to be increased to , and that the boundaries in the activation function have been scaled by .

To address the second inefficiency, requiring a scan of the boundaries, , in the activation function, we directly look-up the bin in the activation that contains the correct output, see Figure 8. First, note that we set to be a large power of 2, , (e.g. ). This in contrast to the more common selection of large scale factor that are often chosen to be a power of 10. By sacrificing easy human-readability, we enable faster operations. As shown, all of the same computations are conducted for the multiplication table, and, as before, the results are stored as fixed-point / integers. Instead of requiring a scan of the boundaries in the activation function, after the summation of the inputs is computed, it is bit-shifted . In this example, it removes the least significant 48 bits (). Once these bits are removed, the result can be interpreted as the index of the “bin” to look in to find the output of the activation function. The bit-shifting operation is faster than a linear (or binary) scan of the boundaries and faster than general multiplication and divide operations.

One of the limitations of this lookup-approach is that it works in cases in which the spacing between the activation boundaries is uniform. In the ReLU-6 activations, the size of the output bins was kept constant; however, when we discretized the tanh activation (Section 2.1), we noted that the width of the bins varied. Using this approach with a tanh non-linearity necessitates retraining with constant size boundaries for the activation discretization. Though achievable, this would eliminate one of the advantages that led to faster training – that the size of the bins corresponded well to the derivatives of the underlying function. This had allowed faster moves between the discrete activation bins in the regions where the underlying activation had the largest magnitude derivatives.

In summary, the final approach, shown in Figure 8, accomplishes what we set out to do at inference time: (1) eliminate multiplications in inference. (2) All of the values inference are represented as integers or fixed point decimal; no floating point is used. (3) We have made the activation computation fast; no non-linearities are computed and no scanning of an array is required to avoid multiplies; by a judicious setting for , we were able to use a fast bit-shift operation instead.

Figure 8: Extending the method shown in Figure 7 for more efficient activation function computation. In the previous method, we had to lookup the correct bin in the activation by scanning (or binary searching) an array of outputs to find the correct entry. In this implementation, we index directly into the bin by using a shift to remove the lower order bits and leave only the index into the correct activation bin.

Note that during training, floating point is used. Wu et al. (2018) addresses training with integers with various classification problems. Our goal is to ensure that networks, even if trained on the fastest GPUs, can be deployed on small devices that may be constrained in terms of power, speed or memory. Additionally, for our requirements, which encompass deployment of networks outside of the classification genre, we needed the discretization techniques to work with regression/real-valued outputs.

5 Discussion & Future Work

The need to enable more complex computations in the enormous number of deployed devices, rather than sending raw signals back to remote servers to be processed, is rapidly growing. For example, auditory processing within hearing aids, vision processing in remote sensors, custom signal processing in ASICs, or any of the recent photo applications running on the low and medium-powerful cell phones prevalent in many developing countries, all will benefit from on-device processing.

Pursuing discretized networks has led to a number of interesting questions for future exploration. Four immediate directions for future work are presented below. We also use this opportunity to discuss some of the insights/trends we noticed in our study but were not able to discuss fully here.

  • For discretizing weights, all of the network’s weights were placed into a single bucket. An alternative is to cluster the weights of each layer, or set of layers, independently. If there are distribution differences between layers, this may better capture the most significant weights from each layer.

  • Currently, is kept constant throughout training. However, we have witnessed instability in the beginning of training quantized weights, especially when is small. These spikes in the training loss dissipate as training progresses. Starting training with a larger-than-desired and gradually decreasing it may address the initial instability.

  • In many problems, we have seen weight distributions that appear Laplacian. Exploring the use of explicit Laplacian (or other parameterized) models rather than the parameter-less clustering approach is an immediate direction for future work. Preliminary results were quite promising, as shown by Experiments #8 and #9 in Table 1.

  • We, and other studies, have noticed the regularization-type effects of these methods. Additionally, we have noticed improved performance when other regularizers, such as Dropout, are not used. The use of these methods as regularizers was not explored here and is open for future work.

Beyond the practical ramifications of these simplified networks, perhaps what is most interesting are the implications of the simplified networks on network capacity. In general, we have embraced training ever larger networks to address growing task complexity and performance requirements. However, if we can obtain close performance using only a small fraction of the representational power in the activations and the weights then, with respect to our current models, much smaller networks could store the same information. Why does performance improve with larger networks? Perhaps the answer lies in the pairing of the network architectures and the learning algorithms. The learning algorithms control how the search through the weight-space progresses. It is likely that the architectures used today are explicitly optimized for the task and implicitly optimized for the learning algorithms employed. The large-capacity, widely-distributed, networks work well with the gradient descent algorithms used. Training networks to more efficiently store information, while maintaining performance, may require the use of alternate methods of exploring the weight-space.

References

  • Agustsson et al. (2017) Eirikur Agustsson, Fabian Mentzer, Michael Tschannen, Lukas Cavigelli, Radu Timofte, Luca Benini, and Luc Van Gool. Soft-to-hard vector quantization for end-to-end learned compression of images and neural networks. CoRR, abs/1704.00648, 2017. URL http://arxiv.org/abs/1704.00648.
  • Baluja (2018) Shumeet Baluja. Empirical explorations in training networks with discrete activations. CoRR, abs/1801.05156, 2018. URL http://arxiv.org/abs/1801.05156.
  • Bengio et al. (2013a) Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013a.
  • Bengio et al. (2013b) Yoshua Bengio, Nicholas Léonard, and Aaron C. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR, abs/1308.3432, 2013b. URL http://arxiv.org/abs/1308.3432.
  • Bourzac (2017) Katherine Bourzac. Speck-size computers: Now with deep learning [news]. IEEE Spectrum, 54(4):13–15, 2017.
  • Cottrell and Munro (1988) Garrison W Cottrell and Paul Munro. Principal components analysis of images via back propagation. In Visual Communications and Image Processing’88: Third in a Series, volume 1001, pages 1070–1078. International Society for Optics and Photonics, 1988.
  • Courbariaux et al. (2016) Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or-1. arXiv preprint arXiv:1602.02830, 2016.
  • Deng et al. (2017) Lei Deng, Peng Jiao, Jing Pei, Zhenzhi Wu, and Guoqi Li. Gated XNOR networks: Deep neural networks with ternary weights and activations under a unified discretization framework. CoRR, abs/1705.09283, 2017. URL http://arxiv.org/abs/1705.09283.
  • Denton et al. (2014) Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1269–1277. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/5544-exploiting-linear-structure-within-convolutional-networks-for- efficient-evaluation.pdf.
  • Duda et al. (1995) Richard O Duda, Peter E Hart, and David G Stork. Pattern classification and scene analysis 2nd ed. ed: Wiley Interscience, 1995.
  • Finnerty and Ratigner (2017) Ambrose Finnerty and Hervé Ratigner. Reduce power and cost by converting from floating point to fixed point, 2017. URL http://www.xilinx.com/support/documentation/white_papers/wp491-floating-to-fixed-point.pdf.
  • Glorot et al. (2011) Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 315–323, 2011.
  • Goodfellow et al. (2013) Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. arXiv preprint arXiv:1302.4389, 2013.
  • Han et al. (2016) Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representations (ICLR), 2016.
  • Hinton (2012) G. Hinton. Neural networks for machine learning. Coursera, video lectures, 2012.
  • Hou et al. (2016) Lu Hou, Quanming Yao, and James T Kwok. Loss-aware binarization of deep networks. arXiv preprint arXiv:1611.01600, 2016.
  • Hubara et al. (2016) Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv arXiv:1609.07061, 2016.
  • Isola et al. (2017) Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. arXiv preprint, 2017.
  • Jain (2010) Anil K Jain. Data clustering: 50 years beyond k-means. Pattern recognition letters, 31(8):651–666, 2010.
  • Jenks (1967) George F Jenks. The data model concept in statistical mapping. International yearbook of cartography, 7:186–190, 1967.
  • Jiang (1999) J Jiang. Image compression with neural networks–a survey. Signal Processing: Image Communication, 14(9):737–760, 1999.
  • Kingma and Ba (2015) Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • Kohonen (1995) Teuvo Kohonen. Learning vector quantization. In Self-Organizing Maps, pages 175–189. Springer, 1995.
  • Kramer (1991) Mark A Kramer. Nonlinear principal component analysis using autoassociative neural networks. AIChE journal, 37(2):233–243, 1991.
  • Krizhevsky and Hinton (2010) Alex Krizhevsky and G Hinton. Convolutional deep belief networks on cifar-10. Unpublished manuscript, 40:7, 2010.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • Lane et al. (2015) Nicholas D Lane, Sourav Bhattacharya, Petko Georgiev, Claudio Forlivesi, and Fahim Kawsar. An early resource characterization of deep learning on wearables, smartphones and internet-of-things devices. In Proceedings of the 2015 International Workshop on Internet of Things towards Applications, pages 7–12. ACM, 2015.
  • LeCun et al. (1998) Yann LeCun, Corinna Cortes, and Christopher JC Burges. The mnist database of handwritten digits, 1998.
  • Lin et al. (2015) Darryl Dexu Lin, Sachin S. Talathi, and V. Sreekanth Annapureddy. Fixed point quantization of deep convolutional networks. CoRR, abs/1511.06393, 2015. URL http://arxiv.org/abs/1511.06393.
  • Maddison et al. (2016) Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. CoRR, abs/1611.00712, 2016. URL http://arxiv.org/abs/1611.00712.
  • Mentzer et al. (2018) Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, Radu Timofte, and Luc Van Gool. Conditional probability models for deep image compression. CoRR, abs/1801.04260, 2018. URL http://arxiv.org/abs/1801.04260.
  • Nair and Hinton (2010) Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
  • Plagianakos et al. (2001) VP Plagianakos, GD Magoulas, NK Nousis, and MN Vrahatis. Training multilayer networks with discrete activation functions. In Neural Networks, 2001. Proceedings. IJCNN’01. International Joint Conference on, volume 4, pages 2805–2810. IEEE, 2001.
  • Raiko et al. (2014) Tapani Raiko, Mathias Berglund, Guillaume Alain, and Laurent Dinh. Techniques for learning binary stochastic feedforward neural networks. arXiv preprint arXiv:1406.2989, 2014.
  • Rippel and Bourdev (2017) Oren Rippel and Lubomir Bourdev. Real-time adaptive image compression. arXiv preprint arXiv:1705.05823, 2017.
  • Rumelhart et al. (1986) David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. nature, 323(6088):533, 1986.
  • Russakovsky et al. (2014) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. Imagenet large scale visual recognition challenge. CoRR, abs/1409.0575, 2014. URL http://arxiv.org/abs/1409.0575.
  • scipy.org (2018) scipy.org. Scipy reference guide, 2018. URL https://docs.scipy.org/doc/scipy/reference/.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1):1929–1958, 2014.
  • Svoboda et al. (2016) Pavel Svoboda, Michal Hradis, David Barina, and Pavel Zemcík. Compression artifacts removal using convolutional neural networks. CoRR, abs/1605.00366, 2016. URL http://arxiv.org/abs/1605.00366.
  • Tang and Salakhutdinov (2013) Yichuan Tang and Ruslan R Salakhutdinov. Learning stochastic feedforward neural networks. In Advances in Neural Information Processing Systems, pages 530–538, 2013.
  • Toderici et al. (2016) George Toderici, Sean M. O’Malley, Sung Jin Hwang, Damien Vincent, David Minnen, Shumeet Baluja, Michele Covell, and Rahul Sukthankar. Variable rate image compression with recurrent neural networks. In International Conference on Learning Representations, 2016.
  • Toderici et al. (2017) George Toderici, Damien Vincent, Nick Johnston, Sung Jin Hwang, David Minnen, Joel Shor, and Michele Covell. Full resolution image compression with recurrent neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5435–5443. IEEE, 2017.
  • Wang and Song (2011) Haizhou Wang and Mingzhou Song. Ckmeans. 1d. dp: optimal k-means clustering in one dimension by dynamic programming. The R journal, 3(2):29, 2011.
  • Werbos (1974) Paul Werbos. Beyond regression: new tools for prediction and analysis in the behavioral sciences. PhD thesis, Harvard University, 1974.
  • Wiki (2018) Wiki. Fixed point arithmetic, 2018. URL https://en.wikipedia.org/wiki/Fixed-point_arithmetic.
  • Wu et al. (2018) Shuang Wu, Guoqi Li, Feng Chen, and Luping Shi. Training and inference with integers in deep neural networks. arXiv preprint arXiv:1802.04680, 2018.
  • Yi et al. (2008) Yan Yi, Zhang Hangping, and Zhou Bin. A new learning algorithm for neural networks with integer weights and quantized non-linear activation functions. In IFIP International Conference on Artificial Intelligence in Theory and Practice, pages 427–431. Springer, 2008.
  • Zhou et al. (2018) Shunchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv arXiv:1606.06160, 2018.
Comments 5
Request Comment
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
283342
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel
5

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description