Matrix and tensor decompositions for training binary neural networks
Abstract
This paper is on improving the training of binary neural networks in which both activations and weights are binary. While prior methods for neural network binarization binarize each filter independently, we propose to instead parametrize the weight tensor of each layer using matrix or tensor decomposition. The binarization process is then performed using this latent parametrization, via a quantization function (e.g. sign function) applied to the reconstructed weights. A key feature of our method is that while the reconstruction is binarized, the computation in the latent factorized space is done in the real domain. This has several advantages: (i) the latent factorization enforces a coupling of the filters before binarization, which significantly improves the accuracy of the trained models. (ii) while at training time, the binary weights of each convolutional layer are parametrized using realvalued matrix or tensor decomposition, during inference we simply use the reconstructed (binary) weights. As a result, our method does not sacrifice any advantage of binary networks in terms of model compression and speedingup inference. As a further contribution, instead of computing the binary weight scaling factors analytically, as in prior work, we propose to learn them discriminatively via backpropagation. Finally, we show that our approach significantly outperforms existing methods when tested on the challenging tasks of (a) human pose estimation (more than 4% improvements) and (b) ImageNet classification (up to 5% performance gains).
1 Introduction
One key aspect of the performance of deep neural networks is the availability of abundant computational resources (i.e highend GPUs) during both training and inference. However, often, such models need to be deployed on devices with limited resources such as smartphones, FPGAs or embedded boards. To this end, there is a plethora of works that attempt to miniaturize the models and speedup inference with popular directions including matrix and tensor decomposition [25, 19], weights pruning [11] or network quantization [6, 26]. Of particular interest in this work is the extreme case of quantization – binarization, where all the weights and features are restricted to 2 states only. Such networks can achieve a compression rate of up to and an even higher order speedup that can go up to [31, 7]. Despite these attractive properties, training binary networks to a comparable accuracy to that of their realvalued counterparts is still an open problem. For example, there is accuracy drop between real and binary networtks on ImageNet [31], and difference for human pose estimation on MPII [4].
Most current works that attempt to improve the accuracy of binary network fall into two broad categories: a) methodological changes and b) architectural improvements. The authors of [7] propose to binarize the weights using the function, with encouraging results on a few selected datasets. Because the representational power of binary networks is very limited, the authors of [31] propose to add a scaling factor to the weights and channels of each convolutional layer, showing for the first time competitive results on ImageNet. From an architectural point of view, the method of [4] proposes a novel residual module specially designed for binary networks, while in [38], the authors incorporate densenetlike connections into the UNet architecture.
In this work, we propose a simple method to improve the accuracy of binary networks by introducing a linear or multilinear reparametrization of the weight tensor during training. Let’s consider a –dimensional weight tensor . A common limitation in prior work is that each filter (a slice of ) of a given convolutional layer is binarized independently as follows:
In contrast, our key idea in this work is to model the filters jointly by reparametrizing them in a shared subspace using a matrix or tensor decomposition, and then binarizing the weights. A simplified version of our idea can be described as follows:
This allows us to introduce an interdependency between the tobebinarized weights through the shared factor either at a layer level or even more globally at a network level. A key feature of our method is that the decomposition factors (i.e ) are kept real during training. This allows us to introduce additional redundancy which, as we will show facilitates learning.
Note that this latent parametrization is used only during training. During inference, our method only uses the reconstructed weights, which have been binarized using the sign function (the decomposition factors are neither used nor stored). Hence, our method does not sacrifice any of the advantages of binary networks in terms of model compression and inference speedup.
In summary, we make the following contributions:

We are the first to propose parameterizing the binarized weights of a neural networks using a realvalued linear and multilinear decomposition (at training time). In other words, we enforce a shared subspace between the filters of the convolutions, as opposed to prior work that model and binarize each filter independently. This novel approach allows us to further improve the accuracy of binary networks without sacrificing any of their advantage in terms of model compression and speededup inference. (Section 4.2).

We explore several types of decomposition (SVD and Tucker) applied either layerbylayer or jointly to the entire network as a whole (Section 4). We perform indepth ablation studies that help shed light on the advantages of the newly proposed method.

We show that our method significantly advances the stateoftheart for two important computer vision tasks: human pose estimation on MPII and largescale image classification on ImageNet (Section 5).
2 Related work
In this section, we review the related work, in terms of neural network architectures (2.1), network binarization (2.2) and tensor methods (2.3).
2.1 Efficient neural network architectures
Despite the remarkable accuracy of deep neural networks on a large variety of tasks, deploying such networks on devices with low computational resources is highly impractical. Recently, a series of works have attempted to alleviate this issue via architectural changes applied either at the block or architecture level.
Blocklevel optimization. In [12] He et al. proposes the socalled bottleneck block that attempts to reduce the number of filters using 2 convolutional layers with a kernel that project the features into a lower dimensional subspace and back. The authors from [42] introduce a new convolutional block that splits the module into a series of parallel subblocks with the same topology. The resulting block has a smaller footprint and higher representational power. In a similar fashion, MobileNet [14] and its improvement [34] make use of depthwise convolutional layers with the later proposing an inverted bottleneck module. In [44], the authors combine pointwise group convolution and channel shuffle incorporating them in the bottleneck structure.
Note, that in this work we do not attempt to improve the architecture itself and simply use the already wellestablished basic block with preactivation introduced in [13] (see Fig. 1).
Networklevel optimization. The DenseNet architecture [16] proposes to interconnect each layer to every other layer in a feedforward fashion. This results in a better gradient flow and higher performance per number of parameters ratio. Variations of it were later adopted for other tasks, such as human pose estimation [38]. In [32] and its followup [33] the authors introduce the socalled YOLO architecture which proposes a new framework for object detection and an optimized architecture for the network backbone that can run realtime on a highend GPU.
2.2 Network binarization
Another direction for speedingup neural networks is network quantization. This process reduces the number of possible states that the weights and/or the features can take and has become increasingly popular with the advent of lowprecision computational hardware.
While normally CNNs operate using float32 values, the work of [6, 26] proposes to use 16– and 8bit quantization showing in the process insignificant performance drop on a series of small datasets (MNIST, CIFAR10). Zhou et al. [46] proposes to allocate a different numbers of bits for the network parameters (1 bit), activations (2 bits) and gradients (6 bits), the values of which are selected based on their sensitivity to numerical inaccuracies. [41] propose a twostep nbits quantization (), where the first step consists of learning a lowbit code and the second in learning a transformation function. In [10], the authors propose to learn a 12 bit quantization for the weights and 28 for activations by learning a symmetric codebook for each particular weights subgroup. While such methods can lead to significant space and speed gains, the most interesting case is that of binarized neural networks. Such networks have their features and weights quantized to two states only. In [35] the authors propose to binarize the weights using the function. Followup work [7, 8] further improve these results, by binarizing both the activations and the weights. In such type of networks the multiplications inside the convolutional layer can be replaced with XOR bitwise operations. The current stateoftheart binarization technique is the XNORNet method [31] that proposes a realvalued scaling factor for the weights and inputs. The proposed XNORNet method [31] is the first to report good results on a large scale dataset (ImageNet). In [4], the authors propose a new module specifically designed for binary networks. The work of [28] explores ways of increase the quantized network accuracy by increasing its width (i.e number of channels) motivated by the idea that often the activations are taking most of the memory during training. In a similar fashion, in [27] the authors use up to 5 parallel binary convolutional layers to approximate a real one, as such increasing the size and computational requirements of the network up to . [45] proposes a lossaware binarization method that jointly regularizes the weights approximation error and the accompanying loss, however this method quantizes the weights while leaving the features real. [15] proposes a semibinary decomposition of the binary weight tensor into two binary matrices and a diagonal realvalued one which are used (instead of the actual binary weights) during test time. As mentioned by the authors the proposed binaryto(semi)binary decomposition is a difficult optimization problem and hence harder to train. More importantly, and in contrary to our method, in this approach, the activations are kept real.
In this work, we propose to improve the binarization process itself, introducing a novel approach that increases the representation power and flexibility of binary weights at train time via matrix and tensor reparametrization while maintaining the same structure and very large speed gains during inference.
2.3 Tensor methods
Tensor methods offer a natural extension of the more traditional algebraic methods to higher orders that naturally arise in convolutional networks. As such, this family of methods is actively deployed, both for compressing and speedingup the networks via reparametrization [25, 18, 2, 18], or by taking advantage directly of the higher order dimensionality present in the data [22, 23].
Separable convolutions, recently popularized in [5], are one such example that can be obtained by applying a CP decomposition to the layer weights. In [25], the weights of each convolutional layer are decomposed into a sum of rank– tensors using a CP decomposition in an attempt to speedup the convolutional modules. At inference time this is achieved by replacing the original layers with a set of smaller ones where the weights of each newly introduced layer represent the factors themselves. Similarly, in [18] the authors reparametrize the layer weights using a Tucker decomposition. At test time, the resulting module resembles a bottleneck [12] block. [36] propose to decrease the redundancy typically present in large neural networks by expressing each layer as the composition of two convolutional layers with less parameters. Each 2D filter is approximated by a sum of rank– tensors. However, this can be applied only for convolutional layer which have a kernel size larger than 1. While most of the works mentioned above are applied to convolutional layers other types can be parameterized too. In all the aforementioned works, tensor decompositions are applied to individual convolutional layers. More recently, the work of [21] proposed a simple method for whole network tensorization using a single highorder tensor.
To our knowledge, none of the above methods have been applied to binary networks. By doing so, our approach allows us to combine the best of both words: take advantage of the very high compression rate and speedup typically offered by binarized networks while maintaining the increased representational power offered by the tensor reparametrization methods. A crucial aspect of this reparametrization is that it enables us to enforce an interdependency between the binary filters, which were previously treated independently by prior work on binarization [31, 8].
2.4 Human pose estimation
While a complete review of recent work on human pose estimation goes beyond the scope of this paper, the current stateoftheart on single person human pose estimation is based on the socalled ”Hourglass“(HG) architecture [29] and its variants [3, 37, 17, 43]. Most of this prior work focuses on achieving the highest performance without imposing any computational restrictions. Only recently, the work in [38] and [4] study this problem in the context of quantized neural networks. In [38] the authors propose an improved HG architecture that makes use of dense connections [16] while [4] introduces a novel residual block specially tailored to binarized neural networks. In contrast with the aforementioned methods, in this work, instead of improving the network architecture itself, we propose a novel, improved binarization technique that is independent of the network and task at hand.
3 Background
Let and denote the weights and respectively, the input of the th convolutional layer, where represents the number of output channels, the number of input channels and () the width and height of the convolutional kernel. and represent the spatial dimension of the input features . In its simplest form the binarization process can be achieved by taking the sign of the weights and respectively, of the input features where:
(1) 
However, such approach leads to sub par performance on the more challenging datasets. In [31], Rastegari et al. proposes to alleviate this by introducing a realvalued scaling factor that boosts the representational power of such networks:
(2) 
where and .We denotes as the realvalued convolutional operation and its binary counterpart, implemented using XNOR bitwise operations. Note, that while in [31] a scaling factor is proposed for both input features and weights, in this work we use only the later since removing the first significantly speedsup the network at a negligible drop in accuracy [31, 4].
4 Method
In this section, we present our novel binarization method that aims to increase the representational power of the binary networks by enforcing for the first time an interdependency between the binary filters via a linear or multilinear overparametrization of the weights. We start by introducing some necessary notation (Section 4.1). We then continue by describing the main algorithm and its variations (Section 4.2). Finally, in Section 4.3 we describe how to further improve the proposed binarization technique by optimizing the scaling factor with respect to the target loss function via backpropagation.
4.1 Notation
We denote vectors (1 order tensors) as , matrices (2 order tensors) as and tensors of order , as . We denote element of a tensor as . A colon is used to denote all elements of a mode, e.g. the mode1 elements of are denoted as .
Tensor contraction: we define the n–mode product (contraction between a tensor and a matrix), for a tensor and a matrix , as the tensor , with:
(3) 
4.2 Matrix and tensor reparametrizations for training binary CNNs
A key limitation of prior work on training binary networks is that each of the filters of the weight tensor in each convolutional layer is binarized independently, without imposing any relation between the filters explicitly. To alleviate this, we propose to increase the representational power of the binary network via reparametrization. During training, we propose to express the tobebinarized weights of each convolutional layer using a linear or multilinear realvalued decomposition:
(4) 
where the function ReconstructWeights(;) is specific to the decomposition used, there exists at least one decomposition factor which is shared among all filters in , and the set of all decomposition factors are all realvalued. Using a realvalued decomposition is a key feature of the proposed approach as it allows us to introduce additional redundancy which as we show facilitates learning.
Note that when training is done, our method simply uses the reconstructed weights which are converted to binary numbers using the sign function. Hence, during inference the factors are neither used nor need to be stored, only the reconstructed binarized weights are used. Hence, our method does not sacrifice any advantage of binary networks in terms of model compression and speedingup inference.
In the context of this work, we explore two different decompositions: SVD and Tucker. We apply these decompositions in two different ways: layerwise and holistically. Layerwise decomposition refers to modeling the weight tensor of each convolutional layer separately, i.e. performing a different decomposition for each convolutional layer (e.g. [25, 18, 36]). We note that this is the standard way in literature that SVD and Tucker decomposition have been applied for neural network reparametrization. More recently, the work of [21] proposed a single method for whole network tensorization using a single highorder tensor. We refer to this tensorization approach as holistic.
Note, that unlike other binarization methods where two set of weights are explicitly stored in memory and swapped at each iteration [31] our method can deal with this implicitly without a secondary copy. This is due to the fact that the factors are always realvalued and are reconstructed and binarized ondemand during training.
The entire proposed method for binarization is described in Algorithm 1:
4.2.1 Layerwise SVD decomposition
Let be the reshaped version of weight of the th layer. By applying an SVD decomposition we can express as follows:
(6) 
where and are learned via backpropagation.
When evaluated on the validation set of MPII, reparametrizing the weights layerwise using SVD improves the performance by 0.3% (see Table 1).
4.2.2 Layerwise Tucker decomposition
While the SVD decomposition shows some benefits on the MPII dataset, one of its core limitation is that it requires reshaping the weight to a 2D matrix, those losing the important spatial structure information present in them.
To alleviate this, we propose using the Tucker decomposition, a natural extension of SVD for higher order tensors. Using the Tucker decomposition we can express the binary weights as follow:
(7) 
where is a fullrank core and a set of factors.
The results from Table 1 and 2 confirm the proposed hypothesis, showing an improvement of more than 0.7% on top of the gains offered by the SVD decomposition.
Decomposition  Holistic  Learn. alpha  PCKh 
None    ✗  78.4% 
None    ✓  79.3% 
SVD  ✗  ✗  78.7% 
SVD  ✗  ✓  79.0% 
Tucker  ✗  ✗  79.3% 
Tucker  ✗  ✓  79.9% 
Tucker  ✓  ✗  82.0% 
Tucker  ✓  ✓  82.5% 

4.2.3 Holistic Tucker decomposition
Motivated by the method proposed in [21] and our finding from Section 4.2.2, where we reparametrized the weights using a layerwise Tucker decomposition, herein we go one step further and propose to group together identically shaped weights inside the network in a higherorder tensor in order to exploit the inter relation between them holistically.
For ResNet18 [12] used for ImageNet classification, we create 3 groups of convolutional layers based on the macromodule structure characterizing the architecture. Each of these groups is then parameterized with a single 5th order tensor obtained by concatenating the weights of the convolutional layers in this group. The resulting decomposition is then defined as:
(8) 
The individual weights of a given layer l can be obtained from .
For the hourglass network used in our experiments for human pose estimation, we follow [21] to derive a single 7th order tensor , the modes of which correspond to the number of HGs, the depth of each HG, the three signal pathways, the number of convolutional blocks, the number of input features, the number of output features, and finally the height and width of each of the convolutional kernels. The remaining few layers in the architecture are decomposed using a layerwise Tucker decomposition.
When tested on MPII, the proposed representation improves the performance with more than 3% in terms of absolute error against the baseline and more than 1% when compared with its layerwise version (see Table 1). Similar results are observed on ImageNet (Table 2).
Decomposition  Holistic  Learn. alpha  Top1  Top5 

None    ✗  52.3%  74.1% 
None    ✓  53.0%  74.7% 
SVD  ✗  ✓  52.5%  74.2% 
Tucker  ✗  ✗  54.0%  76.9% 
Tucker  ✗  ✓  54.7%  77.4% 
Tucker  ✓  ✗  55.2%  78.2% 
Tucker  ✓  ✓  55.6%  78.5% 
4.3 Learnable scaling factors
One of the key ingredients of the recent success of binarized neural network was the introduction of the weight scaling factor in [31] (see Eq. 2), computed analytically as the average of absolute weight values. While this estimation generally performs well, it attempts to minimize the difference between the real weights and the binary ones and does not explicitly decrease the overall network loss. In contrast, in this work we propose to learn the scaling factor by minimizing its value with respect to the networks cost function, learning it discriminatively via backpropagation.
Fig. 2 shows the difference between the scaling factors learned using our proposed method vs the ones computed using the analytic solution from [31]. Note that our method leads to (a) a more spread out distribution that can take both positive and negative values, (b) has significantly higher magnitude, thus leading to a faster and more stable training.
5 Experimental evaluation
This section firstly presents the experimental setup, network architecture and training procedure. We then empirically demonstrate the advantage of our approach on single person human pose estimation and largescale image recognition where we surpass the stateoftheart by more than 4% (Section 5.3).
5.1 Human pose estimation
Datasets. MPII [1] is one of the most challenging human pose estimation datasets todate consisting of over 40,000 people, each annotated with up to 16 keypoints and visibility labels. The images were extracted from various youtube videos. For training/validation split, we used the same partitioning as introduced in [40]. The results are reported in terms of PCKh [1].
Network architecture. The Hourglass (HG) [29] and its variants represent the current stateoftheart on human pose estimation. As such, in this work, we used an hourglasslike architecture (Fig. 4) constructed using the basic blocks introduced in [12, 31] (see also Fig. 1). The HG network as a whole follows an encoderdecoder structure with skip connections between each corresponding level of the decoder and encoder part. The basic block used has 128 channels.
Training. During training, we followed the best practices and randomly augmented the data with rotation (between and degrees), flipping and scale jittering (between 0.7 and 1.3). All models were trained until convergence (typically 120 epochs max). During this time, the learning rate was dropped multiple times from to . We used no weight decay.
5.2 Largescale image classification
Datasets ImageNet [9] is a large scale image recognition dataset consisting of more than 1.2M images for training distributed over 1000 object classes and 50,000 images for validation.
Network architecture. Following [31, 7], we used a Resnet18 [12] architecture for our experiments on ImageNet. The ResNet18 consists of 18 convolutional layers distributed across 4 macromodules that are linked via a skipconnection. At the beginning of each macromodule the resolution is dropped using a convolutional layer with a stride . The final predictions are obtained by using an average pooling layer followed by a fully connected one.
Training. During training, we resized the input images to px and then a random px crop was selected for training. At test time, instead of random cropping the images, a center crop was applied. The network was trained using Adam [20] for 90 epochs with a learning rate of that was gradually reduced (dropped every 30 epochs) to . The weight decay was set to for the entire duration of the training.
5.3 Comparison with stateoftheart
Method  #parameters  PCKh 

HBC [4]  6.2M  78.1% 
Ours  6.0M  82.5% 
Real valued  6.0M  85.8% 
In this section, we report the performance of our method on the challenging and diverse tasks of human pose estimation (on MPII) and large scaleimage recognition (on Imagenet), and compare it with that of published stateoftheart methods that use fully binarized neural networks (i.e both the weights and the features are binary).
On human pose estimation, the only other work that trains fully binarized networks is that of [4]. As the results from Table 3 show, our method offers an improvement of more than 4% on the MPII dataset when compared against the stateoftheart method of [4]. Qualitative results are shown in figure 5.
As Table 4 shows, for ImageNet classification, our method improves upon the results from [31] by up to 5% in terms of absolute error.
Method  Top1 accuracy  Top5 accuracy 

BNN [8]  42.2%  69.2% 
XNORNet [31]  51.2%  73.2% 
Ours  55.6%  78.5% 
Real valued [12]  69.3%  89.2% 
6 Conclusion
In this paper, we proposed a novel binarization method in which the weight tensor of each layer or group of layers is parametrized using matrix or tensor decomposition. The binarization process is then performed using this latent parametrization, via a quantization function (e.g. sign function) applied to the reconstructed weights.
This simple idea enforces a coupling of the filters before binarization which is shown to significantly improve the accuracy of the trained models. Additionally, instead of computing the weight scaling factor analytically we propose to learn them via backpropagation. When evaluated on single person human pose estimation (on MPII) and large scale image recognition (Imagenet) our method surpasses the stateoftheart by 4%, and respectively 5% while retaining the speedup (up to ) and space saving (up to ) typically offered by binary networks.
References
 [1] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In CVPR, 2014.
 [2] M. Astrid and S. Lee. Cpdecomposition with tensor power method for convolutional neural networks compression. CoRR, abs/1701.07148, 2017.
 [3] A. Bulat and G. Tzimiropoulos. Human pose estimation via convolutional part heatmap regression. In ECCV, 2016.
 [4] A. Bulat and G. Tzimiropoulos. Binarized convolutional landmark localizers for human pose estimation and face alignment with limited resources. In ICCV, 2017.
 [5] F. Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint, pages 1610–02357, 2017.
 [6] M. Courbariaux, Y. Bengio, and J.P. David. Training deep neural networks with low precision multiplications. arXiv, 2014.
 [7] M. Courbariaux, Y. Bengio, and J.P. David. Binaryconnect: Training deep neural networks with binary weights during propagations. In NIPS, 2015.
 [8] M. Courbariaux, I. Hubara, D. Soudry, R. ElYaniv, and Y. Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or1. arXiv, 2016.
 [9] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In CVPR, 2009.
 [10] J. Faraone, N. Fraser, M. Blott, and P. H. Leong. Syq: Learning symmetric quantization for efficient deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4300–4309, 2018.
 [11] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015.
 [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [13] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, 2016.
 [14] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 [15] Q. Hu, G. Li, P. Wang, Y. Zhang, and J. Cheng. Training binary weight networks via semibinary decomposition. In Proceedings of the European Conference on Computer Vision (ECCV), pages 637–653, 2018.
 [16] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. arXiv, 2016.
 [17] L. Ke, M.C. Chang, H. Qi, and S. Lyu. Multiscale structureaware network for human pose estimation. arXiv preprint arXiv:1803.09894, 2018.
 [18] Y. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. CoRR, 05 2016.
 [19] Y.D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530, 2015.
 [20] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [21] J. Kossaifi, A. Bulat, G. Tzimiropoulos, and M. Pantic. Parametrizing fully convolutional nets with a single highorder tensor. ICLR submission, 2018.
 [22] J. Kossaifi, A. Khanna, Z. Lipton, T. Furlanello, and A. Anandkumar. Tensor contraction layers for parsimonious deep nets. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1940–1946, July 2017.
 [23] J. Kossaifi, Z. C. Lipton, A. Khanna, T. Furlanello, and A. Anandkumar. Tensor regression networks. CoRR, abs/1707.08308, 2018.
 [24] J. Kossaifi, Y. Panagakis, and M. Pantic. Tensorly: Tensor learning in python. arXiv preprint arXiv:1610.09555, 2016.
 [25] V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, and V. S. Lempitsky. Speedingup convolutional neural networks using finetuned cpdecomposition. CoRR, abs/1412.6553, 2014.
 [26] D. D. Lin, S. S. Talathi, and V. S. Annapureddy. Fixed point quantization of deep convolutional networks. arXiv, 2015.
 [27] X. Lin, C. Zhao, and W. Pan. Towards accurate binary convolutional neural network. In Advances in Neural Information Processing Systems, pages 345–353, 2017.
 [28] A. Mishra, J. J. Cook, E. Nurvitadhi, and D. Marr. Wrpn: Training and inference using wide reducedprecision networks. arXiv preprint arXiv:1704.03079, 2017.
 [29] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016.
 [30] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
 [31] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In ECCV, 2016.
 [32] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, realtime object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
 [33] J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7263–7271, 2017.
 [34] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
 [35] D. Soudry, I. Hubara, and R. Meir. Expectation backpropagation: Parameterfree training of multilayer neural networks with continuous or discrete weights. In NIPS, 2014.
 [36] C. Tai, T. Xiao, Y. Zhang, X. Wang, et al. Convolutional neural networks with lowrank regularization. arXiv preprint arXiv:1511.06067, 2015.
 [37] W. Tang, P. Yu, and Y. Wu. Deeply learned compositional models for human pose estimation. In ECCV, 2018.
 [38] Z. Tang, X. Peng, S. Geng, L. Wu, S. Zhang, and D. Metaxas. Quantized densely connected unets for efficient landmark localization. In ECCV, 2018.
 [39] T. Tieleman and G. Hinton. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2), 2012.
 [40] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS, 2014.
 [41] P. Wang, Q. Hu, Y. Zhang, C. Zhang, Y. Liu, and J. Cheng. Twostep quantization for lowbit neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4376–4384, 2018.
 [42] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. arXiv, 2016.
 [43] W. Yang, S. Li, W. Ouyang, H. Li, and X. Wang. Learning feature pyramids for human pose estimation. In ICCV, 2017.
 [44] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6848–6856, 2018.
 [45] A. Zhou, A. Yao, K. Wang, and Y. Chen. Explicit losserroraware quantization for lowbit deep neural networks. In CVPR, 2018.
 [46] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv, 2016.