Ternary Residual Networks

# Ternary Residual Networks

Abhisek Kundu Parallel Computing Lab, Intel Labs, Bangalore, India Kunal Banerjee Parallel Computing Lab, Intel Labs, Bangalore, India Naveen Mellempudi Parallel Computing Lab, Intel Labs, Bangalore, India Dheevatsa Mudigere Parallel Computing Lab, Intel Labs, Bangalore, India Dipankar Das Parallel Computing Lab, Intel Labs, Bangalore, India Bharat Kaul Parallel Computing Lab, Intel Labs, Bangalore, India Pradeep Dubey Parallel Computing Lab, Intel Labs, Santa Clara, CA, USA
###### Abstract

Sub-8-bit representation of DNNs incur some discernible loss of accuracy despite rigorous (re)training at low-precision. Such loss of accuracy essentially makes them equivalent to a much shallower counterpart, diminishing the power of being deep networks. To address this problem of accuracy drop we introduce the notion of residual networks where we add more low-precision edges to sensitive branches of the sub-8-bit network to compensate for the lost accuracy. Further, we present a perturbation theory to identify such sensitive edges. Aided by such an elegant trade-off between accuracy and compute, the 8-2 model (8-bit activations, ternary weights), enhanced by ternary residual edges, turns out to be sophisticated enough to achieve very high accuracy ( drop from our FP-32 baseline), despite reduction in model size, reduction in number of multiplications, and potentially power-performance gain comparing to 8-8 representation, on the state-of-the-art deep network ResNet-101 pre-trained on ImageNet dataset. Moreover, depending on the varying accuracy requirements in a dynamic environment, the deployed low-precision model can be upgraded/downgraded on-the-fly by partially enabling/disabling residual connections. For example, disabling the least important residual connections in the above enhanced network, the accuracy drop is (from FP32), despite reduction in model size, reduction in number of multiplications, and potentially power-performance gain comparing to 8-8 representation. Finally, all the ternary connections are sparse in nature, and the ternary residual conversion can be done in a resource-constraint setting with no low-precision (re)training.

## 1 Introduction

Deep Neural Networks (DNNs) (AlexNet [15], VGG [23], ResNet [10]) achieved remarkable accuracy in many application domains, such as, image classification, object detection, semantic segmentation, speech recognition ([16]). However, DNNs are notoriously resource intensive models in terms of amount of compute, memory bandwidth requirements, and consumption of power. Deploying trained DNNs to resource-constraint devices (mobile, cars, robots) to make billions of predictions every day, efficiently and accurately, with limited power budget is a considerably challenging problem. This motivates a compact and/or reduced-precision model of DNNs for both mobile devices and data servers.

There are two major approaches to such problems. One is to reduce the number of parameters in the network (e.g. finding a shallower/compact representation) yet achieve similar accuracy as the deep network. Examples of such kind are SqueezeNet [13], MobileNet [11], and SEP-Nets [18]. These models are very small in size ( 5 MB) and are typically targeted for mobile devices. However, it is not surprising that their overall accuracy is very limited on complex dataset ImageNet [4], e.g., SEP-Net Top-1 accuracy is 65.8% which is 10% off to that of ResNet-50. Deploying them on sensitive applications, such as autonomous cars and robots, might be impractical because these models might make too many mistakes (hence might be fatal as well). The other approach is concerned about the reduction in size of parameter representation (via compression or low-precision). Well-known methods of this kind are pruning [6, 25, 28], quantization [21, 24, 7, 29, 12, 19], binarization [3], ternarization [17, 31, 9, 1, 20], hashing [2], Huffman coding [8] and others [30, 22]. However, despite smaller size of network representation, not all of these techniques may be friendly to efficient implementation on general purpose hardware (CPUs, GPUs) (e.g., [8]). Additionally, the power consumption of DNNs depends mostly on the data movement of the feature maps and the number of multiply-and-accumulate (MAC), rather than model size. For example, convolution layers, despite having much smaller number of parameters comparing to FC layers, consume more power because of repeated use of convolution weights. Similarly, thinner and deeper networks might consume more power than shallower networks. For example, SqueezeNet [13], despite being smaller in size than AlexNet with comparable accuracy, consumes 33% more power [28].

Here we are mainly focused on the trade-off between low-precision representation and accuracy of deeper networks, keeping an eye on the power-performance factors. There is a clear need for reduction in precision for both weights and activations (that are fetched from external memory or I/O devices) for more efficient implementation of deeper networks. Such low-precision representation for activation demands for specialized low-precision arithmetic [27, 26, 20] and hardware design. For example, Google’s TPU [14] sets a trend for a low-precision inference pipeline (8-bit activations, 8-bit weights). Moreover, significant research energy is being expended to explore sub-8-bit domain of DNN representation [22, 30, 12, 20], where the interplay among model size, compute, power, and accuracy becomes more tricky. For 8-8 representation, despite reducing the model size by , only minimal drop in accuracy has been observed for deeper networks. However, the accuracy degrades significantly in sub-8-bit domain where we reduce precisions for the weights and/or the activations. Low-precision (re)training is a popular approach to recover a part of the lost accuracy.

(Re)training at low-precision essentially re-parametrizes the DNNs to find another local optima in high-dimensional, non-convex search space of parameters. However, it is not clear if such low-precision solutions with similar generalization ability as FP-32 solutions exist, and also how to find them efficiently. In reality, sub-8-bit models for deep networks incur some noticeable drop in accuracy. This loss severely undermines the purpose of deploying a deep (sub-8-bit) network. For example, an 8-4 (4 bit weights) model, if produces drop on ResNet-101, might be equivalent to 8-8 model on much shallower ResNet-50 in terms of model size and accuracy, but might be worse in power-performance (for deep networks even a small gain in accuracy costs significant compute (Table 1)).

This weakens the motivation for sub-8-bit models of deep networks. We seek to answer:

Considering computational benefits of ternary 8-2 models, [20] introduced a fine-grained quantization (FGQ) method that first partitions the weights into disjoint blocks, and then ternarizes them. The block size () controls the accuracy vs number of multiplications (and model size). Larger eliminates more multiplications, however, with a notable drop in accuracy (although they reported the best accuracy for sub-8-bit models on ImageNet). Another limitation of existing models is that they cannot be set on a ‘power-saving’ mode (say, via less MAC) when some further loss in accuracy is tolerable in a dynamic environment. Once deployed, existing models essentially operate in a ‘fixed-accuracy-fixed-power’ mode.

To deal with the problems discussed above, we introduce the notion of residual edges (especially) for sub-8-bit DNNs, where we add more sub-8-bit parameters to sensitive branches of the network to compensate for the lost accuracy. we propose a perturbation theory on the pre-trained DNNs to estimate the sensitivity of branches and the number of residual edges we need in order to achieve a given accuracy. We apply this method on ternary 8-2 representation for ResNet-101 and AlexNet pre-trained on ImageNet. Guided by the theory and enhanced by the ternary residual edges, the ternary 8-2 representation turns out to be sophisticated enough to outclass the 8-8 model in terms of model size, number of multiplications, and power-performance gain, while maintaining very high accuracy. Moreover, such networks with residual edges can be upgraded/downgraded on-the-fly by partially enabling/disabling some of the residual edges, depending on the accuracy requirements in a dynamic environment. For example, when autonomous cars or robots are in a less eventful environment where less number of objects are involved, the classification problem becomes considerably simpler (sufficient to distinguish among distinct objects, such as humans, dogs, vehicles, trees, etc. rather than discriminating among multiple breeds of dogs), and by disabling many edges we can downgrade the model in terms of compute, power, etc., yet maintain very high accuracy for those (less number of) classes.

Drawing an analogy between human attention and precision, and also an analogy between stress due to attention and power consumption, it is natural for us to be selectively attentive to certain tasks that requires processing more information. Such upgrade/downgrade of low-precision DNNs mimics a more real-world scenario that other existing models are unable to imitate. For example, both 8-2 ternary and 8-8 are always at fixed precision fixed power mode irrespective of the dynamic nature of the circumstances. Using such a downgrade operation of our residual network (for ResNet-101) we observe only drop in classification accuracy (from our FP-32 baseline) keeping only parameters of ternary 8-2 network, despite eliminating multiplications and achieving power-performance gain over 8-8. Finally, the conversion from FP-32 weights to 8-2 ternary residual model requires no low-precision (re)training and it can be performed in a resource-constraint environment.

We organize the rest of the paper as follows. We interpret low-precision/sparse representation as adding noise to pre-trained DNNs. For this, we first provide a perturbation analysis of pre-trained DNNs to figure out sensitivity of key quantities that contributes to the final error. Then, we introduce the notion of residual parameters which, when added to the perturbed network, reduces the noise and improves the accuracy. Specifically, we focus on ternary 8-2 models, and show that ternary residual networks can outperform 8-8 representation in terms of critical factors, such as, model size, number of multiplications, and power-performance, while maintaining very high accuracy. Finally, experiments on ResNet-101 and AlexNet for ImageNet classification problem corroborate our theoretical results. We start with summarizing the frequently-used notations below.

Notations and Preliminaries: For a matrix , we denote the (element-wise) Frobenius norm as , norm as , and max norm as . We can easily generalize these definitions to higher order tensors. Also, we define the spectral/operator norm as . For matrices and , . For vectors and of same dimension, inner product is defined as . From Cauchy-Schwarz inequality, . Also, .

## 2 Perturbation in a Locally-Optimal DNN

We first provide an analysis on how the output of a pre-trained (locally optimal) DNN gets distorted, layer-by-layer, in presence of additive noise to inputs and/or parameters. We want to control some of the key components of the noise to put a limit on this overall perturbation such that it has little or no impact on the accuracy. We treat a DNN as a composition of parametric transformation functions , and we can get (locally) optimal values for the parameters via network training (optimizing some parametric function defined over input space ). Then, we can interpret quantization and/or sparsification as adding a noise to the (locally) optimal parameters to produce sub-optimal . We want to quantify the effect of such sub-optimal on the final outcome, e.g., classification scores produced by the DNN. For this, let us assume that our DNN has layers and let ( is the number of classes) be the output vector of layer such that its -th component contains the classification score for -th class, for a given input . Let denote the perturbed vector due to added noise. Here we are interested in top-1 accuracy, i.e., the largest component of should remain the largest in . Mathematically, let be the index for the correct class label for an input, and we define and as: , and . Then, implies no loss of classification accuracy despite the perturbation in . Note that, for , i.e., implies that no component of the original output vector can be altered by more than . For a correctly classified input , a misclassification occurs due to perturbation when , for . Let us assume that, due to perturbation, the true classification score gets reduced and some other score gets bigger, i.e., , and , for and , such that . Then, such -perturbation does not cause misclassification if , i.e., . In other words, as long as the true classification score is at least higher than any other score, then a -perturbation has no adverse effect on classification accuracy. We want to construct a (e.g., based on sparse and/or low-precision representation of weights/activations) such that , for a given tolerance and for any input . We can choose for this example. In reality, we can choose more judiciously from a distribution of such differences on training data. For better interpretation and simplicity of the analysis we consider the relative perturbation as follows.

 ∥y−^y∥F∥y∥F≤ε≤∥y−^y∥F|yi∗|<δ|yi∗|=yi∗−yj|yi∗|

We want to first derive an upper bound on in terms of layer-wise parametric and/or non-parametric perturbation of the pre-trained network, and then we want to control such perturbations to keep the final accumulated perturbation to be smaller than (which can be chosen according to a distribution of such relative difference on training data). We now define a set of functions used in a DNN.

 fdnn={Convolution with Batch Normalization, Matrix Multiplication, ReLU, Pooling } (1)

Functions in can be linear or non-linear, parametric or non-parametric, convex or non-convex, smooth or non-smooth. A pre-trained DNN is a fixed composition of functions in with locally optimal parameter values, i.e., DNN: , where each . More explicitly, let with parameters , and each is an arbitrary metric space where denotes a distance metric on set (for simplicity, we focus on normed space only). For all , we define and as follows. For ,

 X(i)=fi(X(i−1);W(i)),~X(i)=fi(~X(i−1);~W(i)),

where and are perturbed versions of and , respectively. We want to measure how the outcome of gets perturbed in presence of perturbed inputs and perturbed parameters . More specifically, we want to quantify the relative change in outcome of a given layer: . We note that input to a layer might be perturbed due to perturbation in earlier layers and/or perturbation in the present layer (e.g., activation quantization) before it is applied to the layer function. For this we use separate notations as follows. For -th layer, let denote the perturbed input, denote the perturbed activation, and denote the perturbed weights. Let us first define the following relative perturbations.

 Δi=∥X(i)−~X(i)∥F∥X(i)∥F,γi=∥~X(i)−^X(i)∥F∥X(i)∥F,εi=∥W(i)−~W(i)∥F%∥W(i)∥F (2)

We derive the following result to bound the relative change in the output of a layer using the definitions in (2).

###### Theorem 1.

Using the above notations, the relative change in output of -th layer of DNN, can be bounded as

 Δi≤(i∏k=1O(1+εk))Δ0 + i∑k=1(i∏j=k+1≤iO(1+εj))% O(γk−1) (3) + i∑k=1(i∏j=k+1≤i O(1+εj))O(1+γk−1)O(εk)

The theorem can be proved using triangle inequality, recursion in in (2), and with the assumption that locally optimal parameters are not in a neighborhood of zero. Theorem 1 gives us an upper bound on how much the output of changes by the perturbation of parameters, perturbation of activations, and perturbation of the domain of the composition (). The result suggests that at -th layer of DNN, perturbations of parameters and activations of all the previous stages accumulate nonlinearly in a weighted manner. Moreover, we can see that perturbations in early layers accumulate more to create larger final perturbation, suggesting higher sensitivity of earlier layers. We want this perturbation to be small such that the perturbed solution remains in a neighborhood of the local optima (and generalize in a similar manner). We simplify the above bound for small noise.

###### Theorem 2.

Using the above notations, assuming and , where are constants, we derive the following for constants .

 Δi ≤ i∑k=1(i∏j=k+1≤icj)(O(γk−1)+O(εk)) (4)

The result suggests that layer-by-layer small amount of changes in weights and activations accumulate in a weighted manner. That is, keeping both and small implies overall small perturbation, and as long as this is smaller than the relative gap between top score (for correct classification) and the next best score, there would be no loss in classification accuracy. It is intuitive that larger the gap between the true classification score and the next best score more perturbation a DNN can tolerate. Also, for earlier layers should be kept much smaller than those in later layers to have an overall small perturbation. Our empirical evaluation on quantized DNNs (ResNet-101 and AlexNet) corroborates this theory.

## 3 Low Precision DNN

We interpret quantization, sparsification, low-precision, etc. as adding a noise to the locally optimal parameters of a DNN. Such noisy solutions, despite showing some degradation in accuracy, can be computationally beneficial. We want to choose a noise model carefully such that the noisy solution can be implemented efficiently on general purpose hardware. The focus here is to find a low-precision representation of DNNs that can benefit from low-precision operations while maintaining extremely high accuracy.

8-bit Activations and 8-bit Weights: Constraining activations and weights to 8 bits appears to induce only small perturbation, resulting in typically loss in accuracy. This can be explained using (4) where we observe small and for each layer, such that, the final perturbation affects the relative difference between true classification scores and the next best scores minimally for the entire test set.

Sub-8-bit Representation: More challenging cases are sub-8-bit representation of DNNs, e.g., 8-bit activations and ternary weights. Note that the bound in (3) suggests a non-linear degradation of accuracy in terms of two errors: error in activations and error in weights. That is, keeping both of them at very low precision implies a large amount of degradation of classification scores. This is likely because the perturbed solution stays away from a neighborhood of the local optima. In such cases, we typically need to find another local optima via (re)training at low-precision [20]. This new local optima need not be in a neighborhood of the old optima. In fact, there could be multiple optima mapping to similar accuracy via re-parametrization [5]. However, it is not clear if low-precision solutions exist which can show very similar accuracy as the full-precision ones. Moreover, finding such solutions in a very-high dimensional, non-convex search space might be a non-trivial task. In reality, we often observe a noticeable drop in accuracy for such sub-8-bit solutions (despite rigorous (re)training at low precision). One possible explanation could be that these sub-8-bit models have limited model capacity (e.g., the number of unique values they can represent). We can interpret the earlier layers of a DNN as features generation stage, followed by feature synthesis, and finally the classification/discrimination phase. Lack of bits in early layers might severely constrain the quality of features generated, and consequently, more sophisticated features at later stages become coarse, degrading the quality of the network. This intuitive explanation is also consistent with the theoretical bound in (3), where perturbation in earlier layers gets magnified more. It is natural to demand for more accuracy in low-precision representation (in robotics, autonomous cars, etc.), and the existing methods may be insufficient to deal with this problem. There is a need to understand the trade-off between accuracy and precision in a theoretically consistent way. This paper attempts to address such a case.

### 3.1 Ternary Conversion of Pre-Trained Parameters

Here we consider one specific sub-8 bit representation of DNN: 8-bit activations and ternary weights (8-2), where we want to decompose the full-precision trained weights to ternary values , , without re-training. We consider a simple threshold () based approach similar to [17, 20]. Let denote a ternary weight, such that, -th element , if , and otherwise. Then, the element-wise error is and an optimal ternary representation is as follows:

 α∗,T∗=argminα≥0,T>0E(α,T), s.t. α≥0,^Wi∈{−1,0,+1} (5)

for , where . However, the weights may learn different types of features and may follow different distributions. Combining all the weights together might represent a mixture of various distributions, and a ternary representation for them, using a single threshold () and magnitude (), may not preserve the distributions of the weights. To deal with this problem [20] introduced FGQ by first dividing these weight tensors into disjoint blocks of sub-tensors of size , and then ternarizing such blocks independently, i.e., decomposing into a disjoint group of filters , and corresponding ternary weights , where , solve independent sub-problems.

 α∗1,..,α∗k,^W(1)∗,..,^W(k)∗=∑iargminαi,^W(i)∥W(i)−αi^W(i)∥2F (6)

Denoting , optimal solutions to individual sub-problems can be derived as

 α∗=(∑i∈IT|Wi|)/|IT|,T∗=argmaxT>0(∑i∈IT|Wi|)2/|IT| (7)

This leads to overall smaller layer-wise error; consequently, FGQ shows better accuracy using a smaller . This improvement in accuracy is consistent with the theory in (4). From model capacity point of view, with disjoint ternary FGQ blocks we can represent up to distinct values, i.e., model capacity increases linearly with number of such blocks. However, smaller , despite showing lower error, leads to larger number of (high-precision) multiplications (larger number of ’s), and this might lead to less efficient implementation on general purpose hardware.

For very deep networks, e.g., ResNet-101, we need significantly larger number of fine-grained parameters (synthesized from early layers) to improve the accuracy even by a small margin from its shallower counterparts (Table 1). Sub-8-bit representation of sensitive parameters may have a ‘blurring’ effect on later activations; consequently, extremely high accuracy results might be elusive in 8-2 model.

### 3.2 Ternary Residual Edges

Motivated by achieving extremely high accuracy using sub-8-bit representation/operations, we introduce the notion of Residual Edges: when error between original weights and low-precision weights is high we need additional (sub-8) bits to compensate for the lost accuracy. That is, for sensitive branches of network we add more sub-8-bit edges to maintain the desired model capacity. This takes the final solution to a neighborhood of original solution.

Inferencing with parametric functions in , such as convolution and matrix multiplication, can be expressed as a linear operation. For a given input , (partial) output can be expressed as , where are learned weights. Clearly, , where is some perturbed version of . In our ternary setting, let , where is a ternary representation of , via Algorithm 1. Let, the residual be . For any given input if we accumulate the (partial) outputs of both the ternary weights and the residual weights, then we recover the original output. However, the residual may not be low-precision. In order to have a uniform low-precision operation, such as 8-2, we need to approximate as a sequence of ternary residuals, such that, accumulating the output of all these intermediate steps gets us closer to the original output. Let, , , …, , be a sequence of ternary weights, where , first step residual is , , , …, , . The (ternary) inference on input is , where denotes the ternary multiplication. The goal is to ensure . Accumulation of such (ternary) residuals is guided by the perturbation theory presented here where we need to preserve the output of each layer in order to maintain a small perturbation in the final outcome. Before we specify the steps of our ternary residual algorithm more formally, we need a more in-depth comparison with FGQ approach.

#### 3.2.1 Comparison with FGQ Ternary

We can represent only three distinct values: with ternary weights. Both FGQ and our residual method increase model capacity, i.e., the number of distinct values that can be represented using them. With FGQ blocks we can have up to distinct values, i.e., model capacity increases linearly with the number of blocks. However, this produces multiple scaling factors that leads to larger number of multiplications (typically inefficient). On the other hand, with step ternary residual we can represent up to distinct values, that is, model capacity increases exponentially. However, residual approach results in an increased model size (linear in ) as we need to store number of ternary weights. We can alleviate this problem by combining FGQ with residual ternary. That is, we can apply ternary residual for each ternary FGQ block. Moreover, not all the blocks are equally important, and we might need different number of residuals for different blocks. Let -th block requires number of residuals to approximate it up to some desired accuracy. Then, there will be total scaling factors, model capacity can be expressed as , and model size (in bits) is . Table 2 summarizes the comparison of various ternary methods (we assume scaling factors ’s are 8-bit each).

In (4), we express the final perturbation in terms of norm of layer-wise perturbation. Here we extend it to block level perturbation as follows. Let be pre-trained weight for -th layer ( be its perturbed version). Also, let is partitioned into disjoint blocks ( be the perturbed version), . Then, sensitivity of -th block in -th layer is defined as follows.

 (Block Sensitivity)ε(j)i=∥W(j)(i)−~W(j)(i)∥F/∥W(j)∥F (8)

We relate with (defined in (2)) as follows.

(8) suggests that for a given perturbation of weights of a layer, various blocks of weights may be perturbed differently. Consequently, we might need different number of residuals for different blocks to bound the total perturbation of a given layer. We present an incremental algorithm (Algorithm 2) where we add a ternary residual to the block that creates the largest error (even after residuals have been added to it). We repeat the process until the error for the layer is below certain desired tolerance. Here we give a proof that adding ternary residual blocks, as in Algorithm 2, strictly reduce the error at every step.

###### Theorem 3.

Let denote the computed at the -th iteration in Algorithm 2. Then, , for all .

It is intuitive that when the magnitude of a bunch of numbers follow a bi-modal distribution (with one peak is centered close to zero), a ternary representation (one zero and one non-zero magnitude) might approximate the distribution well. In this case, the scaling factor is close to the non-zero peak. However, when the magnitude of the numbers represent more than two clusters, ternary representation induces large error. We have observed that layer-wise pre-trained weights (magnitudes) follow exponential, heavy-tailed, or half-Gaussian distributions. Therefore, ternary representation with only one results in large error for weights. Consequently, the large overall perturbation in the network leads to poor accuracy (as predicted by our theory). FGQ blocking is an attempt to approximate such distribution at a finer level with larger number of ’s (at the cost of more multiplications). However, the above problem of large error resurfaces for FGQ when we ternarize larger blocks. In such case, our proposed residual approach comes to the rescue, where we refine the solution by adding a ternary approximation of distribution of element-wise errors. That is, poorly-approximated elements by the earlier ternary representations become more refined. As discussed earlier, FGQ increases the model capacity linearly in number of blocks, while our residual approach improves it exponentially with number of residual steps (Table 2). We can interpret model capacity as an indicator of how many different cluster centers we can represent (thereby how well a mixture of clusters can be approximated). Then, residual approach creates exponentially more cluster centers, and it is intuitive that we can achieve a desired level of approximation with only few steps of residual. Using ternary residual approach, we essentially approximate an arbitrary distribution with a small sequence of bi-modal distributions.

One unique property of the residual ternary representation is that we can selectively enable/disable some of the residual weights on-the-fly to upgrade/downgrade the model in response to varying accuracy requirements in a dynamic environment (e.g., autonomous cars, robots, etc.). This is unlike the existing approaches where we may not have such flexibility once we deploy the model, especially on a resource-constraint device. Disabling least important residuals can save a lot of compute while having little impact on the accuracy. We can interpret the scenario as a ‘battery-savings mode’ on a resource-constraint device. Another interesting property of the ternary residual connections/blocks is that they are sparse in nature, and are highly compressible (and suitable for sparse operations). Finally, for ease of implementation on a general purpose hardware, the partitioning/blocking of weights are done in a memory contiguous way. That is, we can unroll the weight tensor into a vector, then pick consecutive element from the vector to form a block of weights. As argued by [20], -means clustering of weights might lead to better approximation, but may not be friendly to efficient implementation.

Power-Performance Estimate: Let be the power-performance gain for ternary 8-2 operations over 8-8. Then, power-performance for residual method with compute using FGQ block size can be shown as .

## 4 Experiments

We use pre-trained FP-32 AlexNet and ResNet-101 models for ImageNet classification task. For 8-bit activation/weight quantization, we have used the low-precision dynamic fixed point technique mentioned in [20]. We applied Algorithm 2 to convert the FP-32 models to the ternary residual networks. As suggested by our theory, earlier layers require less perturbation to control the overall error. We set gradually smaller values for layer-wise perturbation for earlier layers. We note the total number of ternary blocks (FGQ+residual) required to achieve a given accuracy. Intuitively, we add more ternary compute (proportional to a factor of total blocks) in order to achieve higher accuracy.
Compute-Aware Perturbation: We can set the tolerances in a compute-aware manner considering the layer-wise compute distribution to reduce overall compute for ternary residual networks (Figure 1).

For power-performance gain, we estimate for . We summarize our results in Tables 3 and 4. For comparison, we mention a few results of other sub-8-bit networks on ImageNet using AlexNet. (1) Binarized weights and activations of [22] incurred a loss of , (2) the loss of binary weights and 2-bit activations of [30] was from FP-32, and (3) [12] with binary weight and 2-bit activations reduced the loss to from FP-32, (4) using FGQ with (75% elimination of multiplications), [20] achieved loss from FP-32 using ternary weights and 4-bit activations without any low-precision re-training. Note, however, that (1) and (2) used FP-32 weights and activations for the first and last layers. Finally, all these models have different power-performance profile, and a detailed analysis on this is beyond the scope of this paper.

## 5 Discussion

In order to achieve extremely high accuracy for sub-8-bit DNNs, we introduce the notion of residual inference, where we add more sub-8-bit edges to sensitive branches of the sub-8-bit network. Such addition of residual edges is guided by a perturbation theory proposed here for pre-trained DNNs. We show that ternary 8-2 models, enhanced by such ternary residual edges, can outperform 8-8 networks in terms of model size, number of multiplications, inference power-performance, while maintaining similar accuracy. A unique feature of residual enhancement is that we can upgrade/downgrade the model on the fly, depending on the varying accuracy requirements in a dynamic environment, by enabling/disabling selected branches. Moreover, the ternary residual network can be formed from FP-32 counterpart in a resource-constrained environment without (re)training. Although we presented the residual idea only for one type of sub-8-bit representation, e.g., ternary 8-2, it is general enough to be applied for other low-precision representations as well (for both weight and activations). A future work is to combine our residual approach with low-precision (re)training in a theoretically consistent manner to improve the power-performance numbers.

## References

• [1] Mitsuru Ambai, Takuya Matsumoto, Takayoshi Yamashita, and Hironobu Fujiyoshi. Ternary weight decomposition and binary activation encoding for fast and compact neural networks.
• [2] Wenlin Chen, James T. Wilson, Stephen Tyree, Kilian Q. Weinberger, and Yixin Chen. Compressing neural networks with the hashing trick. In ICML, 2015.
• [3] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to + 1 or -1. arXiv preprint arXiv:1602.02830, 2016.
• [4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
• [5] Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933, 2017.
• [6] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. In Advances in Neural Information Processing Systems, 2016.
• [7] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In ICML, pages 1737–1746, 2015.
• [8] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In ICLR, 2016.
• [9] Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015.
• [10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
• [11] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwid Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
• [12] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016.
• [13] Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William. J. Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
• [14] Norm Jouppi. Google supercharges machine learning tasks with tpu custom chip. Google Blog, May, 18, 2016.
• [15] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
• [16] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. , 2015.
• [17] Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.
• [18] Zhe Li, Xiaoyu Wang, Xutao Lv, and Tianbao Yang. Sep-nets: Small and effective pattern networks. arXiv preprint arXiv:1706.03912, 2017.
• [19] Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, and Yoshua Bengio. Neural networks with few multiplications. arXiv preprint arXiv:1609.07061, 2016.
• [20] Naveen Mellempuri, Abhisek Kundu, Dheevatsa Mudigere, Dipankar Das, Bharat Kaul, and Pradeep Dubey. Ternary neural networks with fine-grained quantization.
• [21] Daisuke Miyashita, Edward H Lee, and Boris Murmann. Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025, 2016.
• [22] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV, 2016.
• [23] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
• [24] Yaman Umuroglu, Nicholas J Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. Finn: A framework for fast, scalable binarized neural network inference. arXiv preprint arXiv:1612.07119, 2016.
• [25] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Li Hai. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, 2016.
• [26] D Williamson, S Sridharan, and P McCrea. A new approach for block floating-point arithmetic in recursive filters. IEEE transactions on circuits and systems, 32(7):719–722, 1985.
• [27] Darrell Williamson. Dynamically scaled fixed point arithmetic. In Communications, Computers and Signal Processing, 1991., IEEE Pacific Rim Conference on, pages 315–318. IEEE, 1991.
• [28] Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. Designing energy-efficient convolutional neural networks using energy-aware pruning. In CVPR, 2017.
• [29] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quantization: Towards lossless cnns with low-precision weights. poster at International Conference on Learning Representations, 2017.
• [30] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
• [31] Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. arXiv preprint arXiv:1612.01064, 2016.

## Appendix A Appendix

### a.1 Proofs

#### a.1.1 Proof of Theorem 1

We first bound the relative change in output in presence of perturbed input and perturbed parameters using Lemma 1.

###### Lemma 1.

Let be an input to with pre-trained parameter , and let be the output. Then, for perturbed input , perturbed activations , perturbed parameter , and perturbed output , we derive, for constants ,

 ∥X(i)−~X(i)∥F∥X(i)∥F≤ci(∥X(i−1)−~X(i−1)∥F∥X(i)∥F+∥~X(i−1)−^X(i−1)∥F∥X(i)∥F+∥^X(i−1)∥F∥X(i)∥F∥W(i)−~W(i)∥F∥W(i)∥F) (9)

For non-parametric functions the last term in (9) is zero.

Let

 Δi=∥X(i)−~X(i)∥F∥X(i)∥F,γi=%∥~X(i)−^X(i)∥F%∥X(i)∥F,εi=∥W(i)−~W(i)∥F∥W(i)∥F.

We can write, for some constant ,

 ∥^X(i−1)∥F≤c1⋅∥X(i−1)∥F+c1⋅∥X(i−1)−~X(i−1)∥F+c1⋅∥~X(i−1)−^X(i−1)∥F. (10)

Then, combining (9) and (10), we have the following recursive relation.

 Δi≤ci⋅Δi−1+ci⋅γi−1+cic1(1+γi−1+Δi−1)εi=ci(1+c1εi)Δi−1+ci⋅γi−1+cic1(1+γi−1)εi

Simplifying the constants, we derive

 Δi≤O(1+εi)Δi−1+O(γi−1)+O(1+γi−1)O(εi) (11)

Expanding the recursion in (11) we get the following.

 Δi ≤ (i∏k=1O(1+εk))Δ0+i∑k=1⎛⎝i∏j=k+1≤iO(1+εj)⎞⎠(O(γk−1)+O(1+γk−1)O(εk))

#### a.1.2 Proof of Theorem 2

In Lemma 1, under small perturbation, we assume that , for some constant . Then, simplifying the constants, from (9) we derive

 Δi ≤ (i∏k=1ck)Δ0+i∑k=1⎛⎝i∏j=k+1≤icj⎞⎠(O(γk−1)+O(εk))

For no input domain perturbation, we set to derive the result.

#### a.1.3 Proof of Lemma 1

We follow some notational convention. Let denote original input, denote the input perturbed due to perturbation of earlier layers, denote perturbed in the current layer (say because of low precision quantization) before applying it to the layer function. Perturbed weights are denoted by .

Conv+BN+Scaling (Parametric):
Convolution essentially performs an inner product between an image patch and an weight filter. The -th element of -th outpur feature map (ofm) can be expressed as

 y(k)i=⟨X(i),w(k)⟩

where is the -th input patch where -th ofm convolution filter bank is applied. Also, let batch normalization (BN) and scaling parameters for th ofm are , , and , respectively, such that the combined output of convolution, BN, and Scaling can be expressed as

 z(k)i=y(k)i−μkσkαk+βk=αkσk⋅y(k)i+(βk−μk⋅αkσk)=ak⋅y(k)i+bk,

where and are learned (locally optimal) parameters. That is, we have a linear expression

 z(k)i=ak⋅⟨X(i),w(k)⟩+bk=⟨X(i),ak⋅w(k)⟩+bk=⟨(X(i),1),(ak⋅w(k),bk)⟩=⟨¯X(i),¯w(k)⟩

Let the perturbed output be

 ~zi(k)=⟨^X(i),~w(k)⟩

where , and are perturbed input patch, perturbed activation patch, and perturbed parameter. Then, for some constant , change in output can be bounded as follows.

 |zi(k)−~zi(k)|2 ≤ c21(|⟨¯X(i),¯w(k)⟩−⟨~X(i),¯w(k)⟩|2+|⟨~X(i),¯w(k)⟩−⟨^X(i),¯w(k)⟩|2 +|⟨^X(i),¯w(k)⟩−⟨^X(i),~w(k)⟩|2) ≤ c21⋅∥¯X(i)−~X(i)∥2F∥¯w(k)∥2F+c21⋅∥~X(i)−^X(i)∥2F∥¯w(k)∥2F +c21⋅∥^X(i)∥2F∥¯w(k)−~w(k)∥2F

Considering all the elements, for some constant , we derive

 ∥z−~z∥2F ≤ ∑i∑k|zi(k)−^zi(k)|2 ≤ c2u⋅∥¯X−~X∥2F∥¯W∥2F+c2u⋅∥~X−^X∥2F∥¯W∥2F+c2u⋅∥^X∥2F∥¯W−~W∥2F

From above we have,

 ∥z−~z∥F ≤ cu⋅(∥¯X−~X∥F∥¯W∥F+∥~X−^X∥F∥¯W∥F+∥^X∥2F∥¯W−~W∥F) (12)

Note that

 0≤|z(k)i|=|⟨¯X(i),¯w(k)⟩|≤c1⋅∥¯X(i)∥F∥¯w(k)∥F,

and is close to zero when the th ofm filters are orthogonal to input image patch. For pre-trained, locally optimal weights we expect that not all the elements of an ofm are close to zero. For simplicity of analysis we assume that, for some constant

 ∥z(k)∥2F=∑i|z(k)i|2≥(c(k))2∥¯X∥2F∥¯w(k)∥2F

and, for

 ∥z∥2F=∑k∥z(k)∥2F≥(cmin)2∑k∥¯X∥2F∥¯w(k)∥2F≥(cmin)2∥¯X∥2F∥¯W∥2F (13)

Combining (12) and (13) we have, for some constant ,

 ∥z−~z∥F∥z∥F≤c⋅∥¯X−~X∥F∥¯X∥F+c⋅∥~X−^X∥F∥¯X∥F+c⋅∥^X∥F∥¯X∥F∥¯W−^W∥F∥¯W∥F

Matrix Multiplication (Parametric):
For an dimensional input and weight matrix along with dimensional bias , where is the number of classes, output can be written as a linear operation,

 y=Wx+b=¯W¯x

Similarly, perturbed output is Note that, for some constant