IFQNet: Integrated Fixedpoint Quantization Networks for Embedded Vision
Abstract
Deploying deep models on embedded devices has been a challenging problem since the great success of deep learning based networks. Fixedpoint networks, which represent their data with low bits fixedpoint and thus give remarkable savings on memory usage, are generally preferred. Even though current fixedpoint networks employ relative low bits (e.g. 8bits), the memory saving is far from enough for the embedded devices. On the other hand, quantization deep networks, for example XNORNet and HWGQNet, quantize the data into 1 or 2 bits resulting in more significant memory savings but still contain lots of floatingpoint data. In this paper, we propose a fixedpoint network for embedded vision tasks through converting the floatingpoint data in a quantization network into fixedpoint. Furthermore, to overcome the data loss caused by the conversion, we propose to compose floatingpoint data operations across multiple layers (e.g. convolution, batch normalization and quantization layers) and convert them into fixedpoint. We name the fixedpoint network obtained through such integrated conversion as Integrated Fixedpoint Quantization Networks (IFQNet). We demonstrate that our IFQNet gives 2.16 and 18 more savings on model size and runtime feature map memory respectively with similar accuracy on ImageNet. Furthermore, based on YOLOv2, we design IFQTinierYOLO face detector which is a fixedpoint network with reduction in model size (246k Bytes) than TinyYOLO. We illustrate the promising performance of our face detector in terms of detection rate on Face Detection Data Set and Bencmark (FDDB) and qualitative results of detecting small faces of Wider Face dataset.
1 Introduction
During the past decade, deep learning models have achieved great success on various machine learning tasks such as image classification, object detection, semantic segmentation, etc. However, applying them on embedded devices remains as a challenging problem due to the enormous resource requirement in terms of memory and computation power. On the other hand, fixedpoint data inference yields promising reductions on such requirement for embedded devices [6]. Thus, fixedpoint networks are primarily preferred when deploying deep models for the embedded devices.
In general, designing a fixedpoint CNN network can be fulfilled by two types of approaches: 1)pretrain a floatingpoint deep network and then convert it into a fixedpoint network; 2) train a deep CNN model whose data (e.g. weights, feature maps, etc.) is natively fixedpoint. In [9], a method is introduced to find the optimal bitwidth for each layer to convert its floatingpoint weights and feature maps into their fixedpoint counterparts. Given the hardware acceleration for 8bit integer based computations, [12] provides optimal thresholds which minimize the data loss during the 32bits float to 8bits integer conversion. These works have shown that it is feasible to significantly save memory usage through relatively low bit (e.g. 8bits) representation yet achieve similar performance. However, such memory saving is far from enough especially for embedded devices. The second approach is to train a network all of whose data is natively fixedpoint. Nevertheless, as discussed in [8], its training process may suffer from severe unstable weight updating because of the inaccurate gradients. Strategies such as stochastic rounding somehow result in improvement [3, 4, 10] but a tradeoff between low bit data representation and precise gradients still has to be made.
Alternatively, BinaryNet [2] employs binarized weights for forward pass but full precision weights and gradients for stable convergence. Meanwhile, its feature maps are also binarized to {, } so that its data can be represented as 1bit fixedpoint for less memory usage during inference time. However, a notable performance drop of 30% (Top1 accuracy) is observed on ImageNet classification. Subsequently, XNORNet [13] employs extra scaling factors on both weights and feature maps so that their “binary” elements are generalized to {, } and {, } respectively. These extra factors enrich the data information thus gains 16% accuracy back on ImageNet. Furthermore, HWGQNet [1] uses a more flexible kbits quantization on feature maps whose elements can be generalized to {0, , , } in the situation of 2bits uniform quantization. Such bits feature maps () give a further 8% improvement making HWGQNet to be the stateoftheart quantization network on ImageNet classification.
Given a HWGQNet, each filter of its quantized convolution layer can be expressed as a multiplication of a floatingpoint and a binary fixedpoint matrix whose elements are limited to {,}. Similar representations can also be applied to its feature maps (see Equation 1). Therefore, to obtain its fixedpoint counterpart, it would be only necessary to convert the floatingpoint and while other parts of the layer are natively fixedpoint. Besides, Batch Normalization (BN) layer, which is usually employed on top of each convolution layer, also contains floatingpoint parameters and thus requires fixedpoint conversion (see Equation 2). One way to do this is to separately convert each of the floatingpoint data but it usually results in data loss that would be accumulated over the network and cause a notable performance drop.
In this paper, we propose a novel fixedpoint network, IFQNet which is obtained through converting a floatingpoint quantization network into its fixedpoint counterparts. As illustrated in Figure 1, we first divide the quantization network into several substructures, where each substructure is defined as a group of consecutive layers that starts with a convolution layer and ends with a quantization layer. An example of the substructures of AlexNet is listed in Table 1. Then we convert the floatingpoint data in each substructure into fixedpoint data. Especially for the “quantized substructure”, which starts with a quantized convolution layer and ends with a quantization layer, we propose to compose its floatingpoint data into the thresholds of the quantization layer and then convert the composition result into fixedpoint. As will be presented in section 3.2, our integrated conversion method does not cause any performance drop. At the end, we separately convert each floatingpoint data in the remaining nonquantized substructures (if any) to fixedpoint resulting in a fixedpoint network, IFQNet.
In this paper, the major contributions we made are:

proposing IFQNet network, obtained through converting a floatingpoint quantization network into fixedpoint. Due to the relatively low bits of the quantization network, IFQNet gives much more savings on model size and runtime feature map memory.

proposing an integrated conversion method to convert the floatingpoint data in the quantized substructures without performance drop. Since its BN operation (if available) is already integrated into the thresholds of the corresponding quantization layer, our IFQNet does not require actual BN implementation on target hardware.

designing IFQTinierYOLO face detector, a fixedpoint model with 256 tinier model size (246k Bytes) than TinyYOLO.

demonstrating the feasibility of quantizing all convolution layers in IFQTinierYOLO model, which differs from the original HWGQNet whose first and last layers are full precision.
2 Quantized convolutional neural network
A CNN network usually consists of a series of layers where the convolution layer monopolizes the inference time of the whole network. However, the weights and features maps were found redundant for most tasks. Consequently, enormous efforts have been done on quantizing the weights and/or the input feature maps into lowbit data for less memory usage and fast computation.
2.1 Quantization network inference
Embedded devices are usually employed for network inference only because of their limited computation resources. Hence, in this paper, we mainly focus on the inference process of a network. In the following, we take a typical quantized substructure from HWGQNet as an example to illustrate the computation details of its inference.
For the th convolution layer of HWGQNet, we use and to represent one of the filters and its input feature maps respectively, where , , , , are the number of channels, the height and width of its filter, and the height and width of the input feature maps respectively. In the case of a 2bit quantized convolution layer from HWGQNet, its filter is binarized into and . Then the computation of a convolution layer can be represented as
(1) 
where represents the convolution operation; and are integer part of the quantized filter and feature maps so that and , is its learned bias. is its output feature map.
Typically, a BN layer is applied on top of a convolution layer. It is computed in an elementwise manner as follows,
(2) 
where , , and are the learned mean and variance of the feature map.
At the end, a quantization layer maps its input feature map into discrete numbers. Taking a 2bits uniform quantization for instance, its computation can be expressed as
(3) 
where , and are the thresholds used for quantizing its input , and is the scale factor for its output feature map. The resulting is then employed as the input of the ()th convolution layer (if available).
When max pooling layer appears in the substructure, as discussed in [13], it is better to place it between convolution and BN layers for richer data information. In other words,
(4) 
where denotes the local zone employed for pooling operation at location (, ) of . Then the input of the BN layer in Equation 2 is accordingly changed to be .
2.2 Separated fixedpoint conversion
As illustrated in subsection 2.1, the dominating part of the convolution computation can be implemented with native fixedpoint data only. However, the network still contains lots of floatingpoint data these being the scaling factor and in the convolution layer, and in the BN layer and also in the quantization layer. Consequently, it is necessary to convert them into fixedpoint when designing fixedpoint networks for embedded devices.
A traditional way for the aforementioned conversion is to process them separately. As shown in Figure 2a, each floatingpoint data of the substructure is converted into its fixedpoint counterpart. Since directly applying a simple conversion causes significant data loss especially when is small (e.g. 0.001), we use a relatively large to scaleup the floatingpoint data^{1}^{1}1For fast calculation, is usually set to so that the multiplication can be implemented by simple bit left shift. For example, can be transformed by where denotes the flooring operation. At the end, has to be divided back to achieve “equivalent” outputs. Then, fixedpoint conversion of a quantized convolution layer can be expressed as
(5) 
To obtain a substructure with fixedpoint data only, the same conversion is also applied to , , separately.
3 IFQNet methodology
To obtain a fixedpoint network for embedded devices, we propose to first train a quantization network and then convert its floatingpoint data, which has been quantized into extremely low bits (e.g. 1 or 2 bits), into fixedpoint data. As demonstrated in Figure 1, our methodology consists of two steps: first we divide a trained floatingpoint quantization network into substructures and then we convert each substructure into its fixedpoint counterpart. We employ HWGQNet algorithm to train a floatingpoint quantization network.
3.1 Substructure division
As mentioned in Section 1, a substructure is defined as a group of consecutive layers that starts with a convolution layer and ends with a quantization layer. Given a quantization network, we search for the quantized substructures in the network as demonstrated in Figure 3. Typically, the architecture of a quantized substructure is either {convolution, BN, quantization} or {convolution, pooling, BN, quantization}. The substructures that contain more than one convolution or quantization layer are not considered as quantized substructures. The layers between quantized substructures are defined as nonquantized substructures, which will be treated differently during fixedpoint conversion. Generally, BN and/or max pooling layers are placed between convolution layers and quantization layers.
Taking AlexNetHWGQ network as an example, we divide it into 7 substructures (see Table 1). Because the HWGQ network keeps its first and last convolution layer full precision, so the corresponding substructures ( and ) are nonquantized and thus will be converted differently. Please note that we group all the layers on top of the layer as one single substructure.
3.2 Integrated fixedpoint conversion
A trained quantization network can be divided into substructures that contain lots of floatingpoint data. To obtain a fixedpoint network, it is necessary to convert each of its floatingpoint substructures into fixedpoint. However, converting the floatingpoint data in a separated manner usually leads to performance drop. Consequently, in the following, we introduce an integrated way to convert a floatingpoint substructure. Taking 2bits uniformly quantized substructure from HWGQNet as an example, its computations that mentioned in Equation 1, 2 and 3 can be composed as follows
As illustrated in Equation 7, all the floatingpoint data of a quantized substructure is composed into the newly formed thresholds (e.g. ). Such composition process is performed with floatingpoint data and thus does not impact the output result.
The next step is to convert the new thresholds into fixedpoint data. and are both integers thus the resulted are also integers. In Equation 7, when thresholding the integers with newly formed floatingpoint thresholds, theoretically, their fractional parts do not affect the result. Consequently, we can simply discard the fractional part by applying the floor function on the new thresholds. Compared to the separated fixedpoint conversion method, our method does not require to scaleup the floatingpoint data with yet gives identical quantization results. Besides, the remaining floatingpoint data can be processed in the next substructure just like how we deal with the of Equation 1. Consequently, all the computations of a quantized substructure, as represented in Equation 7, can be casted on fixedpoint data after is applied on each of the new thresholds.
The fixedpoint implementation of ondevice BN computation is challenging for embedded devices. As an alternative solution which differs from the method that merges it into a convolution layer, the proposed integrated fixedpoint conversion method transforms the BN computation into the new quantization thresholds. Consequently, our IFQNet also does not require actual BN implementation on embedded hardware.
In the above, we have taken as an example of converting a bits quantization network. However, when a larger is employed, it would be necessary to store thresholds. In the uniform quantization scenario, the network’s thresholds can be expressed as . Thus, one may only need to store and because all the thresholds can be restored from them. Similarly, denoting and , our newly formed thresholds can also be represented as . Thus, our new thresholds can also be represented in an efficient way. Then computations in a kbits uniformly quantized substructure can be expressed as Equation 8 which can be further converted into fixedpoint in an integrated manner.
(8) 
In summary, we presented IFQNet obtained by dividing a quantization network (e.g. HWGQNet) into floatingpoint substructures and then converting each of them into fixed point. For the quantized substructures, we propose an integrated fixedpoint conversion method which gives no performance drop. At the end, for the remaining nonquantized substructure (if any), we employ the separated method to convert them into fixedpoint.
It is worth to point out that our IFQNet differs from the floatingpoint data composition method presented in [17] in many aspects: 1)the paper claims that it combines multiple layers but does not explicitly explain how; 2)the paper applies the floatingpoint data composition for enabling binary convolution computation leaving other parameters as floatingpoint while our method is proposed for fixedpoint conversion and 3)the paper concentrates on implementing a quantized network on FPGA but the performance (e.g. detection rate, mAP etc.) is not reported.
4 Experimental results
In this section, we demonstrate how we convert each substructure of AlexNet into fixedpoint to obtain an IFQAlexNet. We first test the performance of the integrated conversion method for the quantized substructures. Then, for the nonquantized substructures, we demonstrate how we experimentally set the scale factor for the separated fixedpoint conversion. We compare the performance of our IFQAlexNet with “Lin et al. [9]” which is the stateoftheart AlexNetbased fixedpoint network on ImageNet. Furthermore, we also illustrate the performance of IFQTinierYOLO face detector which is an extremely compact fixedpoint network on both FDDB and Wider Face datasets.
4.1 IFQAlexNet network
To obtain fixedpoint networks, we first train floatingpoint quantization networks AlexNetHWGQ whose weights and feature maps are quantized into 1bit and bits () respectively. The AlexNetHWGQ is trained with 320k iterations on ImageNet while the batch size is set to 256. The initial learning rate is set to 0.1 and decreased by a factor of 0.1 every 35k iterations. We inherit other training settings from [1] and achieve similar performance.
…  

Conv  Conv  …  FC 
Pool  Pool  BN  
BN  BN  ReLU  
Quant  Quant  FC 
As the first step for obtaining the IFQAlexNet, we divide a floatingpoint AlexNetHWGQ network into 7 substructures ( Table 1). In the table, the superscript in Conv and FC means that their weights are binarized (1bit) and input feature maps are quantized into bits by their bottom Quant layers. We group the layers {FC, BN, ReLU, FC} as a single nonquantized substructure.
In the following, we will show how to convert each substructure into fixedpoint to obtain an IFQAlexNet. In subsection 4.1.1, we show the performance of the proposed integrated conversion method for the quantized substructure while the nonquantized substructures are kept floatingpoint. We then illustrate the way to set a proper scaling factor for converting each of the remaining nonquantized substructures (see subsection 4.1.2). At the end, in subsection 4.1.3, we compare our IFQAlexNet with “Lin et al. [9]” which is the stateoftheart AlexNetbased fixedpoint network.
4.1.1 Integrated conversion for the quantized substructures
In this subsection, we focus on converting the quantized substructures (,…,). The ImageNet Top1 classification accuracy is employed to evaluate the accuracy of the converted networks (see Table 2). In the table, “separated” refers to the networks obtained by converting the quantized substructures of the corresponding AlexNetHWGQ ( equals to 2 or 3 or 4) in a separated manner (see section 2.2). In contrast, “integrated” represents the networks obtained by converting their quantized substructures in the proposed integrated way (see section 3.2). Please note that, to compare the performance of different conversion methods on quantized substructures, we keep the nonquantized substructures ( and ) floatingpoint.
AlexNetHWGQ  0.5214  0.5301  0.5471 
separated()  0.5206  0.5296  0.5470 
separated()  0.5168  0.5292  0.5443 
separated()  0.5073  0.5230  0.5385 
separated()  0.4585  0.4678  0.5105 
integrated()  0.5214  0.5301  0.5471 
As shown in Table 2, the floatingpoint AlexNetHWGQ networks achieves competitive classification accuracy. However, “separated” method shows notable performance degradation. The reason is that it separately converts each floatingpoint data of a quantized substructures by which leads to data loss. To reduce such loss, a large has to be applied () which in turn causes more memory usage. In contrast, for each quantized substructure, our “integrated” method gives identical outputs as its floatingpoint counterpart in AlexNetHWGQ while the scaling factor is not required at all (). Even though we employ the uniform quantization as example, our “integrated” method is also effective for the networks quantized by other strategies as long as their floatingpoint operations can be composed as in Equation 7.
4.1.2 Separated conversion for the nonquantized substructures
In the subsection 4.1.1, we have demonstrated that the proposed integrated method gives lossless fixedpoint conversion for quantized substructures. To obtain IFQAlexNet all of whose data operations are fixedpoint data based, we then convert each of the remaining nonquantized substructures in a “separated” manner. For saving more memory while causing less conversion loss for such substructures, an optimal is required for each of nonquantized substructure: and . Since the directly outputs the inferred results for the task, the preciseness of its computation is more critical. Consequently, we first find the optimal for its fixedpoint conversion while is kept floatingpoint.
a):  b): 
As demonstrated in Figure 4a), for the networks with different , a larger for generally gives better performance. It is because a larger value gives less data loss during each fixedpoint conversion . Nevertheless, when , no further performance improvement can be observed for all the three networks indicating would be sufficient for fixedpoint conversion for the floatingpoint data in .
By fixing for converting into fixedpoint, we then optimize the for . As shown in Figure 4b), can be considered as the sufficient scaling factor for the fixedpoint conversion of .
In summary, to obtain IFQAlexNet, we employ the lossless “integrated” conversion method for the quantized substructures and and for the scaling factor for converting the and of AlexNetHWGQ networks respectively.
4.1.3 Performance comparison
In the following, we compare our IFQAlexNet with “Lin et al. [9]” which is the stateoftheart AlexNetbased fixedpoint network. Lin et al. [9] employ a () as the number of bits for representing each data of the first layer and then introduce an optimal setting on the number of bits for other layers with respect to (see Table 3). It is worth to point out that “Lin et al. [9]” is converted from an AlexNetlike network which posses 2 savings on the number of parameters compared to our IFQAlexNet (21.5 million vs. 58.3 million^{2}^{2}2To be consistent with the reference paper [9], the parameters in are not included.).
Table 3 compares the number of bits that are employed to represent every fixedpoint data of each layer of “Lin et al. [9]” and our IFQAlexNet. As shown in the table, for layers, IFQAlexNet employs 1bit for representing their weights which is remarkably lower than “Lin et al. [9]” . Most importantly, for and layers which are parameter intensive and thus dominate the model size, we consistently employ 1bit weights. Thus, our IFQNet gives 6 savings (1bit vs. 6bits). On the other hand, regarding to the feature maps, our IFQAlexNet networks also generally use lower bits than their competitors (the same bits may happen on and layers only if and ).
Lin et al. [9] ()  IFQAlexNet  
Weights  Feature maps  Weights  Feature maps  
9  8  
5  5  1  k  
4  4  1  k  
5  5  1  k  
4  4  1  k  
6  6  1  k  
6  6  1  k 
For “Lin et al. [9]” networks, different give different preciseness of its fixedpoint data. We directly borrow the experimental results from the paper setting to 9 and 10. Table 4 illustrates the memory usage of the weights and feature maps of the fixedpoint networks in terms of millions of bits (Mbits). As shown in the table, regarding to the model size of the compared fixedpoint networks, our IFQAlexNet networks ( = 2 or 4) give 2.16 savings(58.8 Mbits vs. 127.3 Mbits) over “Lin et al. [9] ( = 9)”.
Lin et al. [9]  IFQAlexNet  

= 9  = 10  = 2  = 4  
Model size  127.3  128.5  58.8  58.8 
Inference memory (feature maps)  10.8  12.0  0.6  1.1 
Top5 accuracy  0.74  0.78  0.76  0.78 
To evaluate the memory usage of feature maps during inference time, we measure the maximum memory that consumed by one single layer, which is in the case of AlexNet. Such evaluation makes more sense than evaluating the summation of all layers because the feature maps from other unconnected layers is not required thus can be discarded during inference time. Comparing with “Lin et al. [9]”, our IFQAlexNet networks output 4 smaller feature maps for layer ( vs. ). Furthermore, our IFQAlexNet employs less bits to represent each element of the feature maps of layer( vs. ). Consequently, when comparing IFQAlexNet () with “Lin et al. [9]”, our method gives 18 savings on inference memory for feature maps.
Furthermore, we follow the reference paper [9] and use Top5 accuracy to evaluate the performance of the AlexNetbased fixedpoint networks. Comparing with “Lin et al. [9] ()”, IFQAlexNet () gives 2% improvement accuracy with significant savings on model size and feature maps memory as well. Moreover, comparing the “Lin et al. [9] ()” and IFQAlexNet ( ) networks which have higher precision, our method also achieves 2.18 and 10.9 savings on model size and feature maps respectively without performance drop.
4.2 IFQTinierYOLO face detector
Face detection has various applications in real life and thus emerges many algorithms such as Faster RCNN [16], SSD [11], Mask RCNN [5] and YOLOv2 [15]. In this section, we aim to apply our IFQNet to face detection task. For the embedded devices, the simple architecture of a deployed network would give great benefit on the hardware design. Consequently, we make use of YOLOv2 detection algorithm as the framework for our face detector.
We initially employ the TinyYOLO [15] network due to its compact size. Furthermore, we design a more compact network TinierYOLO based on TinyYOLO by: 1) only using half the number of filters in each convolution layer; 2) replacing the filter into for the third to last convolution layer; 3)binarizing the weights of all convolution layers by HWGQ. The above three strategies give 4, 2 and 32 savings respectively and overall 256 savings on model size resulting in a 246k Bytes face detector.
TinyYOLO  IFQTinyYOLO(=2)  TinierYOLO  IFQTinierYOLO(=2)  

model size(MB)  63.00  1.97  7.89  0.25 
detection rate  0.92  0.89  0.90  0.84 
We use the training set of Wider Face [18] and Darknet deep learning framework [14] to train the baseline TinyYOLO and our TinierYOLO networks. Furthermore, to obtain their quantized fixedpoint counterparts IFQTinyYOLO and IFQTinierYOLO, we first train the quantization network with our HWGQ implementation on Darknet (k = ) and then convert each of their substructure into fixedpoint. Each network is trained for 100k iterations with batch size 128. The learning rate is initially set to 0.01 and down scaled by 0.1 at th, th and th iteration. Besides, we also inherit the multiscale training strategy from YOLOv2.
We compare the trained face detectors on FDDB dataset [7] which contains 5,171 faces in 2,845 testing images. To evaluate the performance of the face detector, we employ detection rate when false positive rate is 0.1 (1 false positive in 10 test images). It corresponds to the true positive rates (yaxis) when the false positive (xaxis) equals to = 284 in Figure 5. Such evaluation is more meaningful in real applications when low false positive rate is desired. As illustrated in Table 5, comparing with TinyYOLO, IFQTinyYOLO achieves 32 savings on model size with 3% drop on detection rate (0.89 vs. 0.92). Furthermore, the proposed IFQTinierYOLO face detector gives a further 8 savings over IFQTinyYOLO with 5% performance drop. We think its performance is promising in the sense of its extremely compact model size and quite satisfactory detection rate. More importantly, the proposed IFQTinierYOLO face detector is a fixedpoint network which can be easily implemented on embedded devices. The ROC curves of the compared face detectors are illustrated in Figure 5.
Moreover, the proposed IFQTinierYOLO is also effective on detecting small faces. We test it on Wider Face validation images and show its qualitative results. As shown in Figure 6, our IFQTinierYOLO also gives nice detection on small faces in various challenging scenarios such as makeup, out of focus, lowillumination, paintings etc.
5 Conclusions
In this paper, we presented a novel fixedpoint network, IFQNet, for embedded vision. It divides a quantization network into substructures and then converts each substructure into fixedpoint in either separated or the proposed integrated manner. Especially for the quantized substructures, which commonly appear in quantization networks, the integrated conversion method removes ondevice batch normalization computation, requires no scalingup effect () yet most importantly does not cause performance drop. We compared our IFQNet with the stateoftheart fixedpoint network indicating that our method gives much more savings on model size and feature map memory with similar (or higher in some case) accuracy on ImageNet.
Furthermore, we also designed a fixedpoint face detector IFQTinierYOLO. Comparing with the TinyYOLO detector, our model shows its great benefits on embedded devices in the sense of extremely compact model size (246k Bytes), purely fixedpoint data operations and quite satisfactory detection rate.
References
 [1] Z. Cai, X. He, J. Sun, and N. Vasconcelos. Deep learning with low precision by halfwave gaussian quantization. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, pages 5406–5414, 2017.
 [2] M. Courbariaux, Y. Bengio, and J. David. BinaryConnect: Training deep neural networks with binary weights during propagations. In Annual Conference on Neural Information Processing Systems, NIPS 2015, pages 3123–3131, 2015.
 [3] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical precision. In IEEE International Conference on Machine Learning, ICML 2015, pages 1737–1746, 2015.
 [4] P. Gysel, M. Motamedi, and S. Ghiasi. Hardwareoriented approximation of convolutional neural networks. CoRR, abs/1604.03168, 2016.
 [5] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick. Mask RCNN. In IEEE International Conference on Computer Vision, ICCV 2017, pages 2980–2988, 2017.
 [6] M. Horowitz. Computing’s energy problem (and what we can do about it). In IEEE International Solidstate Circuits Conference Digest of Techinical Papers, pages 10–14, 2014.
 [7] V. Jain and E. LearnedMiller. FDDB: A benchmark for face detection in unconstrained settings. Technical Report UMCS2010009, University of Massachusetts, Amherst, 2010.
 [8] D. D. Lin and S. S. Talathi. Overcoming challenges in fixed point training of deep convolutional networks. CoRR, abs/1607.02241, 2016.
 [9] D. D. Lin, S. S. Talathi, and V. S. Annapureddy. Fixed point quantization of deep convolutional networks. In IEEE International Conference on Machine Learning, pages 2849–2858, 2016.
 [10] Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio. Neural networks with few multiplications. CoRR, abs/1510.03009, 2015.
 [11] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg. SSD: single shot multibox detector. In IEEE European Conference on Computer Vision, ECCV 2016, pages 21–37, 2016.
 [12] S. Migacz. 8bit inference with TensorRT. http://ondemand.gputechconf.com/gtc/2017/presentation/s73108bitinferencewithtensorrt.pdf, May 2017.
 [13] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In IEEE European Conference on Computer Vision, ECCV 2016, pages 525–542, 2016.
 [14] J. Redmon. Darknet: Open source neural networks in C. http://pjreddie.com/darknet/, 2013–2016.
 [15] J. Redmon and A. Farhadi. YOLO9000: better, faster, stronger. In IEEE International Conference on Computer Vision and Pattern Recognition, CVPR 2017, pages 6517–6525, 2017.
 [16] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster RCNN: towards realtime object detection with region proposal networks. IEEE Trans. Pattern Analysis and Machine Intelligence, 39(6):1137–1149, 2017.
 [17] F. Shafiq, T. Yamada, A. T. Vilchez, and S. Dasgupta. Automated flow for compressing convolution neural networks for efficient edgecomputation with FPGA. ArXiv eprints, 2017.
 [18] S. Yang, P. Luo, C. C. Loy, and X. Tang. WIDER FACE: A face detection benchmark. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, 2016.