# Position-based Scaled Gradient for Model Quantization and Sparse Training

## Abstract

We propose the position-based scaled gradient (PSG) that scales the gradient depending on the position of a weight vector to make it more compression-friendly. First, we theoretically show that applying PSG to the standard gradient descent (GD), which is called PSGD, is equivalent to the GD in the warped weight space, a space made by warping the original weight space via an appropriately designed invertible function. Second, we empirically show that PSG acting as a regularizer to a weight vector is favorable for model compression domains such as quantization and sparse training. PSG reduces the gap between the weight distributions of a full-precision model and its compressed counterpart. This enables the versatile deployment of a model either as an uncompressed mode or as a compressed mode depending on the availability of resources. The experimental results on CIFAR-10/100 and ImageNet datasets show the effectiveness of the proposed PSG in both domains of sparse training and quantization even for extremely low bits. The code is available on Github:https://github.com/Jangho-Kim/PSG-pytorch.

## 1 Introduction

Many regularization strategies have been proposed to induce a prior to a neural network hoerl1970ridge (); tibshirani1996regression (); hinton2015distilling (); kim2018paraphrasing (). Inspired by such regularization methods which add a prior or constraint for a specific purpose, in this paper, we propose a novel regularization method that non-uniformly scales gradient for model compression problems. The scaled gradient, whose scale depends on the position of the weight, constrains the weight to a set of compression-friendly grid points. We replace the conventional gradient in the stochastic gradient descent (SGD) with the proposed position-based scaled gradient (PSG) and call it as PSGD. We show that PSGD in the original weight space is equivalent to optimizing the weights by a standard SGD in a warped space, to which weights from the original space are warped by an invertible function, which is designed such that the weights of the original space are forced to merge to the desired target positions by scaling the gradients.

We are not the first to scale the gradient elements. The scaled gradient method which is also known as the variable metric method davidon1991variable () multiplies a positive definite matrix to the gradient vector to scale the gradient. It includes a wide variety of methods such as the Newton method, Quasi-Netwon methods and the natural gradient method dennis1977quasi (); nocedal2006numerical (); bottou2010large (). Generally, they rely on Hessian estimation or Fisher information matrix for their scaling. However, our method is different from them in that our scaling does not depend on the loss function but it depends solely on the current position of the weight.

We apply the proposed PSG method to the model compression problems such as quantization and pruning. In recent years, deploying a deep neural network (DNN) on restricted edge devices such as smartphones and IoT devices has become a very important issue. For these reasons, reducing bit-width of model weights (quantization) and removing unimportant model weights (pruning) have been studied and widely used for applications. Majority of the literature in quantization starts with a pre-trained model and fine-tunes or re-trains the model using the entire training dataset. However, this scenario is restrictive in real-world applications because additional training is needed. In the additional training phase, a full-size dataset and high computational resources are required which prohibits easy and fast deployment of DNNs on edge devices for customers in need.

To resolve this problem, many works have focused on post-training methods of quantization that do not require training data krishnamoorthi2018quantizing (); nagel2019data (); banner2019post (); zhao2019improving (). For example, nagel2019data () starts with a pre-trained model with only minor modification on the weights by equalizing the scales across channels and correcting biases. However, inherent discrepancy in the distribution of the pre-trained model and that of the quantized model is too large for the aforementioned methods to offset the fundamental difference in the distributions. As shown in Fig. 1, due to the differences in the two distributions, the classification error and the quantization error, denoted as the mean squared error increase as lower bit-width is used. Accordingly, when it comes to layer-wise quantization, existing post-training methods suffer significant accuracy degradation when it is quantized below 6-bit.

Meanwhile, another line of research in quantization approaches the task from the initial training phase. Gradient L1 regularization alizadeh2020gradient () trains the model with gradient regularization for quantization robustness across different bit widths at the initial training phase. While our method follows this scheme of training from the start, unlike alizadeh2020gradient (), no explicit regularization term is introduced to the loss function. Instead, the gradients are scaled depending on the position of the weights. Our main goal is to train a robust model that can be easily switched to a compressed mode from an uncompressed mode when the resources are limited, without the need of re-training, fine-tuning and even accessing the data. To achieve this, we constrain the original weights to merge to a set of quantized grid points (Appendix A and Fig. 1(b)) by scaling their gradients proportional to the error between the original weight and its quantized version. For the sparse training case, the weights are regularized to merge to zero. More details will be described in Sec 3.

Our contributions can be summarized as follows:

We propose a novel regularization method for model compression by introducing the position-based scaled gradient (PSG) which can be considered as a variant of the variable metric method.

We prove theoretically that PSG descent (PSGD) is equivalent to applying the standard gradient descent in the warped weight space. This leads the weight to converge to a well-performing local minimum in both compressed and uncompressed weight spaces (see Appendix A and Fig. 1).

We apply PSG in quantization and sparse training and verify the effectiveness of PSG on CIFAR and ImageNet datasets. Also, we show that PSGD is very effective for extremely low bit quantization.

## 2 Related work

Quantization Post-training quantization aims to quantize weights and activation to discrete grid points without additional training or using the training data. Majority of the works in recent literature starts from a pre-trained network trained by standard training scheme zhao2019improving (); nagel2019data (); banner2019post (). Many works on channel-wise quantization methods, which require storing quantization parameters per channel, have shown notable improvement in performance even at 4-bit banner2019post (); Choukroun_2019 (). However, layer-wise quantization methods, which are more hardware-friendly as they store quantization parameters per layer (as opposed to per channel), still suffers at lower bit-widths nagel2019data (); krishnamoorthi2018quantizing (); zhao2019improving (). nagel2019data () achieves near full-precision accuracy at 8-bit by bias correction and range equalization of channels, while zhao2019improving () splits channels with outliers to reduce the clipping error. However, both suffer from severe accuracy degradation under 6-bit. Our method improves on but is not limited to the uniform layer-wise quantization.

Meanwhile, another line of work in quantization has focused on quantization robustness by regularizing the weight distribution from the training phase. lin2018defensive () focuses on minimizing the Lipshitz constant to regularize the gradients for robustness against adverserial attacks. Similarly, alizadeh2020gradient () proposes a new regularization term on the norm of the gradients for quantization robustness across different bit widths. This enables “on-the-fly” quantization to various bit widths. Our method does not have an explicit regularization term but scales the gradients to implicitly regularize the weights in the full-precision domain to make them quantization-friendly. By doing so, we achieve state-of-the-art accuracies for layer-wise quantization as well as robustness across various bit widths. Additionally, we do not introduce significant training overhead because gradient norm regularization is not necessary, while alizadeh2020gradient () necessitates double-backpropagation which increases the training complexity.

Pruning Another relevant line of research in model compression is pruning, in which unimportant units such as weights, filters, or entire blocks are pruned huang2018data (); li2016pruning (); huang2018data (). Recent works have focused on pruning methods that include the pruning process in the training phase renda2020comparing (); zhu2017prune (); louizos2018learning (); lee2018snip (). Among them, substantial amount of works utilize sparsity-inducing regularization. louizos2018learning () proposes training with L0 norm regularizer on individual weights to train a sparse network, using the expected L0 objective to relax the otherwise indifferentiable regularization term. Meanwhile, other works focus on using saliency criterion. lee2018snip () utilizes gradients of masks as a proxy for the importance to prune networks at a single-shot. Similar to lee2018snip () and louizos2018learning (), our method does not need a heuristic pruning schedule while training nor additional fine-tuning after pruning. In our method, pruning is formulated as a subclass of quantization because PSG can be used for sparse training by setting the target value as zero instead of the quantized grid points.

## 3 Proposed method

In this section, we describe the proposed position-based scaled gradient descent (PSGD) method. In PSGD, a scaling function regularizes the original weight to merge to one of the desired target points which performs well at both uncompressed and compressed domains. This is equivalent to optimizing via SGD in the warped weight space. With a specially designed invertible function that warps the original weight space, the loss function in this warped space converges to a different local minima that are more compression-friendly compared to the solutions driven in the original weight space.

We first prove that optimizing in the original space with PSGD is equivalent to optimizing in the warped space with gradient descent. Then, we demonstrate how PSGD is used to constrain the weights to a set of desired target points. Lastly, we explain how this method is able to yield comparable performance with that of vanilla SGD in the original uncompressed domain, despite being strongly regularized.

### 3.1 Optimization in warped space

Theorem: Let , be an arbitrary invertible multivariate function that warps the original weight space into and consider the loss function and the equivalent loss function . Then, the gradient descent (GD) method in the warped space is equivalent to applying a scaled gradient descent in the original space such that

(1) |

where and and respectively denote the gradient and Jacobian of the function with respect to the variable .

Proof: Consider the point at time and its warped version . To find the local minimum of , the standard gradient descent method at time step in the warped space can be applied as follows:

(2) |

Here, is the gradient and is the learning rate. Applying the inverse function to , we obtain the updated point :

(3) |

where the last equality is from the first-order Taylor approximation and is the Jacobian of with respect to . By the chain rule, . Because , we can rewrite Eq.(3) as

(4) |

Now Eq.(2) and Eq.(4) are equivalent and Eq.(1) is proved. In other words, the scaled gradient descent (PSGD) in the original space , whose scaling is determined by the matrix , is equivalent to gradient descent in the warped space .

### 3.2 Position-based scaled gradient

In this part, we introduce one example of designing the invertible function for scaling the gradients. This invertible function should cause the original weight vector to merge to a set of desired target points . These kinds of desired target weights can act as a prior in the optimization process to constrain original weights to be merged at specific positions. The details of how to set the target points will be deferred to the next subsection.

The gist of weight-dependent gradient scaling is simple. For a given weight vector, if the specific weight element is far from the desired target point, a higher scaling value is applied so as to escape this position faster. On the other hand, if the distance is small, lower scaling value is applied to prevent the weight vector from deviating from the position. From now on, we focus on the design of the scaling function for the quantization problem. For pruning, the procedure is analogous and we omit the detail.

Scaling function: We use the same warping function for each coordinate independently, i.e. . Thus the Jacobian matrix becomes diagonal () and our method belongs to the diagonally scaled gradient method.

Consider the following warping function

(5) |

where the target is determined as the closest grid point from , is a sign function and is a constant dependent on the specific grid point making the function continuous^{1}

(6) |

Using the elementwise scaling function Eq.(6), the elementwise weight update rule for the PSG descent (PSGD) becomes

(7) |

where, is the learning rate^{2}

### 3.3 Target points

Quantization: In this paper, we use the uniform symmetric quantization method krishnamoorthi2018quantizing () and the per-tensor quantization scheme for hardware friendliness. Consider a floating point range [,] of model weights. The weight is quantized to an integer ranging [,] for precision. Quantization-dequantization for the weights of a network is defined with step-size () and clipping values. The overall quantization process is as follows:

(8) |

where is the round to the closest integer operation and

We can get the quantized weights with the de-quantization process as
and use this quantized weights for target positions of quantization.

Sparse Training:
For magnitude-based pruning methods, weights near zero are removed. Therefore, we choose zero as the target value (i.e. ).

### 3.4 PSGD for deep networks

Many literature focusing on the optimization of DNNs with stochastic gradient descent (SGD) have reported that multiple experiments give consistently similar performance although DNNs have many local minima (e.g. see Sec. 2 of ChaudhariCSL17 ()). choromanska2015loss () analyzed the loss surface of DNNs and showed that large networks have many local minima with similar performance on the test set and the lowest critical values of the random loss function are located in a specific band lower-bounded by the global minimum. From this respect, we explain informally how PSGD for deep networks works. As illustrated in Fig. 2, we posit that there exist many local minima () in the original weight space with similar performance, only some of which () are close to one of the target points (0) exhibiting high performance also in the compressed domain. As in Fig. 2 left, assume that the region of convergence for is much wider than that of , meaning that there exists more chance to output solution rather than from random initialization. By the warping function specially designed as described above (Eq. 5), the original space is warped to such that the areas near target points are expanded while those far from the targets are contracted. If we apply gradient descent in this warped space, the loss function will have a better chance of converging to . Correspondingly, PSGD in the original space will more likely output rather than , which is favorable for compression. Note that transforms the original weight space to the warped space not to the compressed domain.

## 4 Experiments

In this section, we experimentally show the effectiveness of the PSGD. To verify our PSGD method, we first conduct experiments for sparse training by setting the target point as 0, then we further extend our method to quantization with CIFAR krizhevsky2009learning () and ImageNet ILSVRC 2015 ILSVRC15 () dataset. We first demonstrate the effectiveness in sparse training with magnitude-based pruning by comparing with L0-regularization louizos2018learning () and SNIP lee2018snip (). louizos2018learning () penalizes the non-zero model parameters and shares the scheme of regularizing the model while training. Like ours, lee2018snip () is a single-shot pruning method, which does not require pruning schedules nor additional fine-tuning.

For quantization, we compare our method with (1) methods that employ regularization at the initial training phase alizadeh2020gradient (); gulrajani2017improved (); lin2018defensive (). We choose gradient L1 norm regularization alizadeh2020gradient () method and Lipschitz regularization methods lin2018defensive (); gulrajani2017improved () from the original paper alizadeh2020gradient () as baselines, because they propose new regularization techniques used at the training phase similar to us. Note that gulrajani2017improved () adds an L2 penalty term on the gradient of weights instead of the L1 penalty like alizadeh2020gradient (). We also compare with (2) existing state-of-the-art layer-wise post-training quantization methods that start from pre-trained models nagel2019data (); zhao2019improving () to show the improvement in lower bits (4-bit). Refer to Section 2 for the details on the compared methods. To validate the effectiveness of our method, we also train our model for extremely low bit (2,3-bit) weights. Lastly, we show the experimental results on various network architectures and applying PSG to the Adam optimizer kingma2014adam (), which are detailed in Appendix D.

Implementation details We used the Pytorch framework for all experiments. For the sparse training experiment of Table 3, we used ResNet-32 he2016deep () on the CIFAR-100, following the training hyperparameters of zhang2018lq (). We used released official implementations of louizos2018learning () and re-implemented lee2018snip () for the Pytorch framework. For quantization experiments of Table 2 and 4, we used ResNet-18 and followed alizadeh2020gradient () settings for CIFAR-10 and ImageNet. For zhao2019improving (), released official implementations were used for experiment. All other numbers are either from the original paper or re-implemented. For fair comparison, all quantization experiments followed the layer-wise uniform symmetric quantization krishnamoorthi2018quantizing () and when quantizing the activation, we clipped the activation range using batch normalization parameters as described in nagel2019data (), same as alizadeh2020gradient (). PSGD is applied from the last 15 epochs for ImageNet experiments and from the first learning rate decay epoch for CIFAR experiments. We use additional 30 epochs for PSGD at extremely low bits experiments (Table 4). Also, we tuned the hyper-parameter for each bit-widths and sparsity. Our search criteria is ensuring that the performance of uncompressed model is not degraded, similar to alizadeh2020gradient (). More details are in Appendix C.

### 4.1 Sparse Training

As a preliminary experiment, we first demonstrate that PSG-based optimization is possible with a single target point set at zero. Then, we apply magnitude-based pruning following han2015learning () across different sparsity ratios. As the purpose of the experiment is to verify that the weights are centered on zero, weights are pruned once after training has completed and the model is evaluated without fine-tuning for louizos2018learning () and ours. Results for lee2018snip (), which prunes the weights by single-shot at initialization, are shown for comparison on single-shot pruning.

Table 1 indicates that our method outperforms the two methods across various high sparsity ratios. While all three methods are able to maintain accuracy at low sparsity (10%), louizos2018learning () has some accuracy degradation at 20% and suffers severely at high sparsity. This is in line with the results shown in Gale2019TheSO () that the method was unable to produce sparse residual models without significant damage to the model quality. Comparing with lee2018snip (), our method is able to maintain higher accuracy even at high sparsity, displaying the strength in single-shot pruning, in which no pruning schedules nor additional training are necessary. Fig. 3 shows the distribution of weights in SGD- and PSGD-trained models.

### 4.2 Quantization

In the quantization domain, we first compare PSGD with regularization methods at the on-the-fly bit-widths problem, meaning that a single model is evaluated across various bit-widths. Then, we compare with existing state-of-the-art layer-wise symmetric post-training methods to verify handling the problem of accuracy drop at low bits due to the differences in weight distributions (See Fig. 1).

Regularization methods Table 2 shows the results of regularization methods on CIFAR-10 and ImageNet datasets, respectively. In the CIFAR-10 experiments of Table 2, we fix the activation bit-width to 4-bit and then vary the weight bit-widths from 8 to 4. For the ImageNet experiments of Table 2, we use equal bit-widths for both weights and activations, following alizadeh2020gradient (). In CIFAR-10 experiment, all methods seem to maintain the performance of the quantized model until 4-bit quantization. Regardless of target bit-widths, PSGD outperforms all other regularization methods. On the other hand, ImageNet experiment generally shows reasonable results until 6-bit but the accuracy drastically drops at 4-bit. PSGD targeting 8-bit and 6-bit marginally improves on all bits, yet also experiences drastic accuracy drop at 4-bit. In contrast, Gradient L1 () and PSGD @ W4 maintain the performance of the quantized models even at 4-bit. Comparing with the second best method Gradient L1 () alizadeh2020gradient (), PSGD outperforms it at all bit-widths. At full precision (FP), 8-, 6- and 4-bit, the gap of performance between alizadeh2020gradient () and ours are about 4.2%, 3.9%, 1.5% and 8.1%, respectively. From Table 2, while the quantization noise may slightly degrade the accuracy in some cases, a general trend that using more bits leads to higher accuracy is demonstrated. Compared to other regularization methods, PSGD is able to maintain reasonable performance across all bits by constraining the distribution of the full precision weight to resemble that of the quantized weight. This quantization-friendliness is achieved by the appropriately designed scaling function. In addition, unlike alizadeh2020gradient (), PSGD does not need additional overhead of computing double-backpropagation.

Post-training methods Table 4 shows that OCS, state-of-the-art post-training method, has a drastic accuracy drop at 4-bit. For OCS, following the original paper, we chose the best clipping method for both weights and activation. DFQ also has a similar tendency of showing drastic accuracy drop under the 6-bit as depicted in Fig. 1 of the original paper of DFQ nagel2019data (). This is due to the fundamental discrepancy between FP and quantized weight distributions as stated in Sec 1 and Fig. 1. On the other hand, models trained with PSGD have similar full-precision and quantized weight distributions and hence low quantization error due to the scaling function. Our method outperforms OCS at 4-bit by around 19% without any post-training and weight clipping to treat the outliers.

Extremely low bits quantization As shown in Fig. 1, SGD suffers drastic accuracy drop at extremely low bits such as 3-bit and 2-bit. To confirm that PSGD can handle extremely low bit, we conduct experiments with PSGD targeting 3-bit and 2-bit except the first and last layers which are quantized at 8-bit. Table 4 shows the results of applying PSGD. Although the full precision accuracy does drop due to the strong constraints, PSGD is able to maintain reasonable accuracy. This demonstrates the potential of PSGD as a key solution to post-training quantization at extremely low bits.

## 5 Discussion

In this section, we focus on the local minima found by PSG with a toy example to gain a deeper understanding of PSG. We train with SGD and PSGD on 2-bit on MNIST dataset on a fully-connected network consisting of two hidden layers (50, 20 neurons). In this toy example, we only quantize the weights but not the activation. We show the weight distributions of the two models trained with SGD and PSGD at the first layer. Then, we calculate the eigenvalues of the entire Hessian matrix to analyze the curvature of a local loss surface.

Quantized and sparse model SGD generally yields a bell-shaped distribution of weights which is not adaptable for low bit quantization zhao2019improving (). On the other hand, PSGD always provides a multi-modal distribution peaked at the quantized values. For this example, three target points are used (2-bit) so the weights are merged into three clusters as depicted in Fig. 4a. A large proportion of the weights are near zero similar to Fig. 3. This is because symmetric quantization also contains zero as the target point. PSGD has nearly the same accuracy with FP (96%) at 2-bit. However, the accuracy of SGD at 2-bit is about 9%, although the FP accuracy is 97%. This tendency is also shown in Fig. 1b, which demonstrates that the PSGD reduces the quantization error.

Curvature of PSGD solution In Sec 3.4 and Fig. 2, we claimed that PSG finds a minimum with sharp valleys that is more compression friendly, but has a less chance to be found. As the curvature in the direction of the Hessian eigenvector is determined by the corresponding eigenvalue Goodfellow-et-al-2016 (), we compare the curvature of solutions yielded by SGD and PSGD by assessing the magnitude of the eigenvalues, similar to chaudhari2019entropy (). SGD provides minima with relatively wide valleys because it has many near-zero eigenvalues and the similar tendency is observed in chaudhari2019entropy (). However, the weights trained by PSGD have much more large positive eigenvalues, which means the solution lies in a relatively sharp valley compared to SGD. Specifically, the number of large eigenvalues () in PSGD is 9 times more than that of SGD. From this toy example, we confirm that PSG helps to find the minima which are more compression-friendly (Fig 4a) and lie in sharp valleys (Fig. 4b) hard to reach by normal SGD.

## 6 Conclusion

In this work, we introduce the position-based scaled gradient (PSG) which scales the gradient proportional to the distance between the current weight and the corresponding target point. We prove the stochastic PSG descent (PSGD) is equivalent to applying the SGD in the warped space. Based on hypothesis that DNN has many local minima with similar performance on the test set, PSGD is able to find a compression-friendly minimum that is hard to reach by other optimizers. PSGD can be a key solution to low bit post training quantization becasue PSGD reduces the quantization error meaning that the distributions of the compressed and uncompressed weights are similar. Because target points act as a prior to constrain original weights to be merged at specific positions, PSGD also can be used for the sparse training by simply changing the target point as 0. In our experiments, we verify PSGD in the domain of sparse training and quantization by showing the effectiveness on various image classification datasets such as CIFAR-10/100 and ImageNet. Also, we empirically show that PSGD finds the minima which are located in sharp valleys than that of SGD. We believe that PSGD will help further researches in model quantization and sparse training.

## 7 Broader Impact

PSG is a fundamental method of scaling each gradient component differently depending on the position of a weight vector. This technique can replace conventional gradient in any applications that require different treatment of specific locations in the parameter space. As shown in the paper, the easiest conceivable applications would be quantization and pruning where a definite preference for specific weight forms exists. These model compression techinques are at the heart of the fast and lightweight deployment of any deep learning algorithms and thus, PSG can make a huge impact in the related industry. As another potentially related research topic, PSG has a chance to be utilized in the optimization area such as the integer programming and the combinatorial optimization acting as a tool in optimizing a continuous surrogate of an objective function in a discrete space.

Position-based Scaled Gradient for Model Quantization and Sparse Training

– Appendix

## Appendix A Detailed Results on CIFAR-100 with ResNet-32 (Fig. 1)

In Table 5, we show the classification accuracies and the corresponding mean squared error (MSE) of PSGD depicted in Fig. 1 of the original paper. Also, the weight distributions of the model at various layers are shown in Fig. 5. In this experiment, we only quantize the weights, not the activations, to compare the performance degradation as weight bit-width decreases. The mean squared errors (MSE) of the weights across different bit-widths are also reported. The MSE is computed by the squared mean of the differences in full-precision weights and the low-precision weights across layers. As some variance in performance was observed for lower bit-widths, we report the meanstandard deviation for the 2-bit and the 3-bit experiments. As stated, PSGD successfully merges the weights to the target points and obtains quite low MSE until 3-bit and the 2-bit MSE of PSGD is more than 2 times smaller than that of conventional SGD.

In Fig. 5, we display the full-precision weight distributions of the PSGD models and compare them against vanilla SGD-trained distributions. Four random layers of each model are shown column-wise. The first row displays the model trained with SGD and L2 weight decay. Below are distributions trained with PSGD with target points for the sparse training case, 2-bit, 3-bit, and 4-bit respectively. Note that all the histograms are plotted in the full-precision domain, rather than the low-precision domain. For 2-bit, all three target bins () are visible. For 3-bit, only five target bins are visible as the peripheral two bins contain relatively low numbers of weight components.

## Appendix B Methods

### b.1 Offset

In our warping function in the following

(9) |

we introduced for making continuous. If we do not add a constant , the has points of discontinuity at every as depicted in Fig. 6, where represents step size and means -th quantized value identical to corresponding to . We can calculate the left sided limit and right sided limit at using Eq. 9.

(10) | |||||

(11) |

Based on the condition that the left sided limit and the right sided limit should be the same (Eq. 10 = Eq. 11), we can get the following recurrence relation:

(12) |

Using the successive substitution for calculating , it becomes

Setting and because , can be calculated as below:

(13) |

### b.2 Non-separable directional scaling

Here, we introduce another example of warping function and the corresponding scaling function. In this case, we define the warping function as a multivariate function as and set

(14) |

Here, is the infinite norm or max norm which can be replaced with where is the index with the maximum absolute value. is a constant as in Eq. 9. By using , the partial derivative of Eq. 14 becomes

(15) |

By changing the order of variable index, we can put the max element to the last and then the Jacobian matrix becomes upper triangular with all-positive diagonal elements and the only non-zero off-diagonal elements are in the last column of the matrix. Comparing the magnitude of non-zero off-diagonal elements, which is in the range of , with that of diagonal elements which is in the range of where is the size of a quantization grid, off-diagonal elements does not dominate the diagonal elements. Furthermore, considering the deep network with a huge number of weight parameters, we can neglect the effect of off-diagonal elements and use only the diagonal elements of the Jacobian matrix for scaling. In this case, the elementwise scaling function becomes

(16) |

Using the elementwise scaling function Eq.(16), the elementwise weight update rule for the PSG descent (PSGD) becomes

(17) |

Independent vs Directional scaling: The independent scaling function such as the one presented in the main paper (Eq.6) only considers the independent element-wise distance between the positions of weights and the targets. This means when the weight vector is very close to one of the target points, the magnitude of gradients could be very small, leading to slow convergence as the scaling function for all elements will be nearly 0. To avoid this, we added a small in Eq.6. Note, however, that the weights are needed to be updated according to the task loss (e.g. cross-entropy loss) to find an optimal solution. To address this degradation of gradient magnitude, directional scaling function (Eq.(16)) finds the dominant direction by normalizing the scaled gradient as depicted in Figure 7. The directional scaling performs slightly better than the independent scaling as shown in Table 6, but the difference is not much. Note however that the vanishing of scaling function at the target in the independent scaling can be mitigated by increasing the offset in any way.

## Appendix C Implementation details

We use CIFAR-10/100 and the ImageNet datasets for experiments. CIFAR-10 consists of 50,000 training images and 10,000 test images, consisting of 10 classes with 6000 images per class. CIFAR-100 consists of 100 classes with 600 images per class. The ImageNet dataset consists of 1.2 million images. We use 50,000 validation images for the test, which are not included in training samples. We use the conventional data pre-processing steps^{3}^{4}

ImageNet / CIFAR-10 For ResNet-18, we started training with a L2 weight decay of and learning rate of 0.1, then decayed the learning rate with a factor of 0.1 at every 30 epochs. Training was terminated at 90 epochs. We only used the last 15 epochs for training the model with PSGD similar to [1]. This means we applied the PSG method after 75 epochs with learning rate 0.001. For extremely low-bits experiments, we did not use any weight decay after 75 epochs (See below). We tuned the hyper-parameters for target bit-widths. All numbers are results of the last epoch. We used the official code of [34] for comparisons with 0.02 for the Expand Ratio^{5}

CIFAR-100 For ResNet-32, the same weight decay and initial learning rate were used as above and the learning rate was decayed at 82 and 123 epoch following [33]. Training was terminated at 150 epoch. For VGG16 with batchnorm normalization (VGG16-bn), we decayed the learning rate at 145 epoch instead. We applied PSG after the first learning rate decay. The first convolutional layer and the last linear layer are quantizedat 8-bit for the 2-bit and the 3-bit experiments. For sparse training, training was terminated at 200 epoch and weight decay was not used at higher sparsity ratio, while all the other training hyperparameters were the same. For [26], we used the official implementation for the results ^{6}

Extremely low-bits experiments For ImageNet, we did not use the weight decay for 2-, 3-bits as it hinders convergence. For CIFAR100, weight decay was not used for only 2-bits. See the details regarding how weight decay affects training with PSGD in Sec. D.3. In addition, we experimented with training for longer epochs than the original schedule. In this case, we run additional 30 epochs for PSGD. The total number of epochs is 120 and we apply PSG methods for the last 45 epochs.

## Appendix D Additional experiments

### d.1 Adam optimizer with PSG

To show the applicability of our PSG to other types of optimizers, we applied our PSG to the Adam optimizer by using the same scaling function with ResNet-32 on 4-bits with the CIFAR-100 dataset. Following the convention, the initial learning rate of was used and the first and the last layer of the model were fixed to 8-bits. All the other training hyperparameters remained the same. Table 7 compares the quantization results of models trained with vanilla Adam and applying PSG to Adam.

### d.2 Various architectures with PSGD

In this section, we show the results of applying PSGD to various architectures. Table 8 shows the quantization results of VGG16 [31] with batch normalization on the CIFAR-100 dataset and DenseNet-121 [17] on the ImageNet dataset, respectively.

For DenseNet, we run additional 15 epochs from the pre-trained model to reduce the training time ^{7}

For VGG16 on the CIFAR-100 dataset, similar tendendcy in performance was observed with ResNet-32. The 4-bit targeted model was able to maintain its full-precision accuracy, while the model targeting lower bit-widths had some accuracy degradation.

### d.3 Weight decay at extremely low-bits

To show the weight decay effect on extremely low-bits with PSG, we trained models with and without weight decay with 90 epochs consisting of 75 epochs with SGD and last 15 epochs with PSGD. The results are shown in Table 9. Based on the experiment results, we found that weight decay incurred a detrimental effect on extremely low-bit cases (2,3-bit). Figure 8 shows the weight distribution of both models with and without weight decay. The range of the weight distribution with weight decay is smaller (Blue) than that of the weights without weight decay (Red) due to the weight shrinkage effect. This regularization effect does not matter at higher bit-widths such as 6-bit and 4-bit. However, it has a negative effect on the performance of extremely low-bits so we do not use weight decay for the extremely low-bits experiment.

### d.4 Longer training for extremely low-bits

Although the model without weight decay does increase the performance significantly compared to the baseline, the performance gain is relatively lower in higher bit-widths (Table 9). We train for additional 30 epochs for PSGD and show the numbers in Table 10. In the table, we can see a significant performance enhancement by the longer training with PSGD. Note that additional training is not useful for bit-widths over 3-bit.

## Appendix E Hyper-parameter

We searched the appropriate with the criteria that the performance of the uncompressed model is not degraded, similar to [1]. For hyper-parameter tuning, we use two disjoint subsets of the training dataset for training and validation. Then we used the found to retrain on the whole training dataset. The below table shows the values of used in experiments of the original paper. The tended to rise for lower target bit-widths or for higher sparsity ratios (See Table 11 and 12). In CIFAR-10, we observe that same value yields fair performance across all bit-widths.

### Footnotes

- Details on can be found in Appendix B. Also, another example of warping function and its experimental results are included in the same section.
- We set where is the conventional learning rate and is a hyper-parameter that can be set differently for various scaling functions depending on their range.
- https://github.com/kuangliu/pytorch-cifar
- https://github.com/pytorch/examples/blob/master/imagenet/main.py
- https://github.com/cornell-zhang/dnn-quant-ocs
- https://github.com/AMLab-Amsterdam/L0_regularization
- https://download.pytorch.org/models/densenet121-a639ec97.pth

### References

- Milad Alizadeh, Arash Behboodi, Mart van Baalen, Christos Louizos, Tijmen Blankevoort, and Max Welling. Gradient regularization for quantization robustness. In International Conference on Learning Representations, 2020.
- Ron Banner, Yury Nahshan, and Daniel Soudry. Post training 4-bit quantization of convolutional networks for rapid-deployment. In Advances in Neural Information Processing Systems, pages 7948–7956, 2019.
- Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
- Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124018, 2019.
- Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer T. Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. In International Conference on Learning Representations, 2017.
- Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In Artificial intelligence and statistics, pages 192–204, 2015.
- Yoni Choukroun, Eli Kravchik, Fan Yang, and Pavel Kisilev. Low-bit quantization of neural networks for efficient inference. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Oct 2019.
- William C Davidon. Variable metric method for minimization. SIAM Journal on Optimization, 1(1):1–17, 1991.
- John E Dennis, Jr and Jorge J Moré. Quasi-newton methods, motivation and theory. SIAM review, 19(1):46–89, 1977.
- Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks. ArXiv, abs/1902.09574, 2019.
- Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
- Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in neural information processing systems, pages 5767–5777, 2017.
- Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
- Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
- Zehao Huang and Naiyan Wang. Data-driven sparse structure selection for deep neural networks. In Proceedings of the European conference on computer vision (ECCV), pages 304–320, 2018.
- Jangho Kim, SeongUk Park, and Nojun Kwak. Paraphrasing complex network: Network compression via factor transfer. In Advances in Neural Information Processing Systems, pages 2760–2769, 2018.
- Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper, 2018.
- Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
- Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. SNIP: SINGLE-SHOT NETWORK PRUNING BASED ON CONNECTION SENSITIVITY. In International Conference on Learning Representations, 2019.
- Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
- Ji Lin, Chuang Gan, and Song Han. Defensive quantization: When efficiency meets robustness. In International Conference on Learning Representations, 2019.
- Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through regularization. In International Conference on Learning Representations, 2018.
- Markus Nagel, Mart van Baalen, Tijmen Blankevoort, and Max Welling. Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE International Conference on Computer Vision, pages 1325–1334, 2019.
- Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006.
- Alex Renda, Jonathan Frankle, and Michael Carbin. Comparing rewinding and fine-tuning in neural network pruning. arXiv preprint arXiv:2003.02389, 2020.
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
- Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
- Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua. Lq-nets: Learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European conference on computer vision (ECCV), pages 365–382, 2018.
- Ritchie Zhao, Yuwei Hu, Jordan Dotzel, Chris De Sa, and Zhiru Zhang. Improving Neural Network Quantization without Retraining using Outlier Channel Splitting. International Conference on Machine Learning (ICML), pages 7543–7552, June 2019.
- Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878, 2017.