DEEPEYE: A Compact and Accurate
Video Comprehension at Terminal Devices
Compressed with Quantization and Tensorization
Abstract
As it requires a huge number of parameters when exposed to high dimensional inputs in video detection and classification, there is a grand challenge to develop a compact yet accurate video comprehension at terminal devices. Current works focus on optimizations of video detection and classification in a separated fashion. In this paper, we introduce a video comprehension (object detection and action recognition) system for terminal devices, namely DEEPEYE. Based on You Only Look Once (YOLO), we have developed an 8bit quantization method when training YOLO; and also developed a tensorizedcompression method of Recurrent Neural Network (RNN) composed of features extracted from YOLO. The developed quantization and tensorization can significantly compress the original network model yet with maintained accuracy. Using the challenging video datasets: MOMENTS and UCF11 as benchmarks, the results show that the proposed DEEPEYE achieves model compression rate with only mAP decreased; and parameter reduction and speedup with accuracy improvement.
DEEPEYE: A Compact and Accurate
Video Comprehension at Terminal Devices
Compressed with Quantization and Tensorization
Yuan Cheng Shanghai Jiao Tong University cyuan328@sjtu.edu.cn Guangya Li South University of Science and Technology 11749189@mail.sustc.edu.cn HaiBao Chen Shanghai Jiao Tong University haibaochen@sjtu.edu.cn Sheldon X.D. Tan University of California, Riverside stan@ece.ucr.edu Hao Yu South University of Science and Technology yuh3@sustc.edu.cn
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
The success of convolutional neural network (CNN) has resulted in a potential general feature extraction engine for various computer vision applications lecun1998gradient (); krizhevsky2012imagenet (). However, applications such as Advanced Driver Assistance System (ADAS) require a realtime processing capability at terminal devices. Network model compression is thereby quite essential to produce a simplified model with consideration of both compactness and accuracy.
For example, a YOLOv3 redmon2018yolov3 () network contains almost convolution layers, which dominate the network complexity. As most convolution filter now is a small sized (, etc.) operator, network pruning guo2017software () may not be suited for this type of network. Direct quantization hashemi2017understanding () however needs additional training to maintain the accuracy. The application of quantization (such as binary) during training courbariaux2015binaryconnect (); liu2018squeezedtext () has shown the promising deep learning network implication with significant network reduction yet maintained accuracy. But there is no reported work to apply trained quantization method to the largescale network such as YOLO with good accuracy.
Moreover, YOLO redmon2016you (); redmon2017yolo9000 () is originally designed for object detection from images. It is unknown how to extend it into video data analysis such as object detection and action recognition. Recurrent Neural Network (RNN) can be applied for sequencetosequence modeling with great achievements by exploiting RNN to video data yao2015describing (); ebrahimi2015recurrent (); venugopalan2015sequence (). However, the highdimensional inputs of video data, which make the weight matrix mapping from the input to the hidden layer extremely large, hinders RNN’s application. Recent works ng2015beyond (); fernando2016learning (); sharma2015action () utilize CNN to preprocess all video frames, which might suffer from suboptimal weight parameters by not being trained endtoend. Other works srivastava2015unsupervised (); donahue2015long () try to reduce the sequence length of RNN, which neglects the capability of RNN to handle sequences of variable lengths. As such it cannot scale for larger and more realistic video data. The approach in yang2017tensor (); tjandra2018tensor () compresses RNN with tensorization using the original frame inputs, which has resulted in limited accuracy as well as scalability.
In this paper, we have developed a RNN framework using the features extracted from YOLO to analyse video data. Towards applications on terminal devices, we have further developed an 8bit quantization of YOLO as well as a tensorizedcompression of the RNN. The developed quantization and tensorization can significantly compress the original network model yet with maintained accuracy. Moreover, the above two optimized networks are integrated into one video comprehension system, which is shown in Fig.1. Experimental results on several benchmarks show that the proposed framework, called DEEPEYE, can achieve compression with only mAP decreased; and parameter reduction and speedup with accuracy improvement.
The rest of the paper is organized as follows. In Section 2 we introduce the basics of YOLO and the YOLO with quantization for realtime video object detection. In Section 3 we first introduce the tensordecomposition model and then provide a detailed derivation of our proposed tensorized RNN. In Section 4 we integrate the quantized YOLO with the tensorized RNN as a new framework for video comprehension system, called DEEPEYE. In Section 5 we present our experimental results on several large scale video datasets. Finally, Section 6 serves as a summary of our current contribution and also provides an outlook of future work.
2 YOLO with Quantization
The proposed video object detection structure is based on YOLO, which is a system of frame object detection and is proposed by using a single convolutional neural network to predict the probabilities of several classes. In this section, we firstly introduce the basics of YOLO, and then we apply it with 8bit quantization to maintain a realtime and highcompressed video object detection structure which provides a promising performance on both efficiency and quantity.
2.1 Basics of YOLO
YOLO reframes object detection as a signal regression problem, straight from image pixels of every frames to bounding box coordinates and class probabilities. A convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes. YOLO has several benefits over traditional methods of object detection since it trains on full images and directly optimizes detection performance redmon2017yolo9000 ().
As shown in Fig.2, it consists of the feature exaction layers and the localization and classification layers, and based on a fully convolution network (FCN) structure. Our system adopts the method to divide the input image into a grid redmon2016you (). Every grid cell must be detected if there is an object, then the (number of boxes) bounding boxes prediction and confidence scores are maintained by the proposed FCN, which will be quantized in Section 2.2. Confidence is defined as redmon2016you (); redmon2018yolov3 (), which reflects how confident the bounding box contains an object. Here, the intersection over union (IOU) is calculated by using the predicted mask and the ground truth. For evaluating YOLO on VOC everingham2010pascal (), if we set the parameters , , then the feature output of the final convolutional layer turns out to be a tensor.
2.2 8bitquantized YOLO
The direct YOLO implementation for videoscale data would require large and unnecessary resource of both software and hardware. Previous works in zhou2016dorefa (); zhu2016trained (); hubara2016quantized () suggest a neural network using quantized constraints during the training process. In this section, we discuss how to generate a YOLO model (namely QYOLO) with 8bit quantization.
The convolution is the core operation of the YOLO and other CNNbased networks. According to the recent works zhou2016dorefa (); hubara2016quantized (), we present the lowbitwidth convolution with 8bit quantization values for weights in order to avoid the dropout of accuracy and also improve the performance. Assuming that are the fullprecision weights and are the 8bit quantizedvalued weights, and they have the approximation as with a with a nonnegative scaling factor . The weights are quantized in 8bit as following:
(1) 
where the function takes the smaller nearest integer.
Also we develop an activation with quantization which quantizes a real number feature maps to an 8bit feature maps . This strategy is defined as below:
(2) 
The detail distribution of 8bit weights and feature maps is presented in Fig.3. Having both the quantized weights and feature maps, we can get the quantized convolution as assumed:
(3) 
where , are the 8bitquantized weights and feature maps, respectively. Since the elements of weights and feature maps can be calculated and stored in 8bit, both of the processor and memory resources required for quantized convolutional layer can be greatly reduced.
The overall working flow of the QYOLO model is presented in Fig.4, and the network is assumed to have a feedforward linear topology. We can have the observation that all the expensive operations in convolutional layers are operating on 8bit quantization. The batch normalization layers and maxpooling layers are also quantized as 8bit.
3 RNN with Tensorization
Previous neural network compression on RNN is performed by either precisionbit truncation or lowrank approximation sainath2013low (); denton2014exploiting (); denil2013predicting (), which cannot maintain good balance between network compression and network accuracy. In this section, we discuss a tensorizationbased RNN during the training process. The tensor decomposition method will be first introduced, and then a tensorized RNN (namely TRNN) will be discussed based on the extension of general neural network.
3.1 Tensor Decomposition
Tensors are natural multidimensional generation of matrices, and the tensortrain factorization ye2017learning (); tjandra2018tensor () is a promising tensorial decomposition model that can scale to an arbitrary number of dimensions. We refer onedimensional data as vectors, denoted as , twodimensional arrays are matrices, denoted as and the higher dimensional arrays are tensors denoted as , (refer one specific element from a tensor using calligraphic upper letters), where is the dimensionality of the tensor.
The ddimensional tensor can be decomposed by using the tensor core and each element is defined as:
(4) 
where is the index of summation which starts from and stops at rank . It should be noted that for the boundary condition and are known as mode size. Here, is the core rank and is the core for this tensor decomposition. By using the notation of (a 2dimensional slice from the 3dimensional tensor ), we can rewrite the above equation in a more compact way:
(5) 
Imposing the constraint that each integer as shown in Eq.5 can be factorized as , and consequently reshapes each into . The decomposition for the tensor can be correspondingly reformulated as:
(6) 
This double index trick novikov2015tensorizing () enables the factorizing of the computing in a fullyconnected layer, which will be discussed in following section.
3.2 Tensorized RNN
The core operation in RNN is fullyconnected layer and its computing process can be compactly described as:
(7) 
where , and . Assuming that , , we can reshape the tensors and into tensors with ddimension: , , and then the fullyconnected computing function is turning out to be:
(8) 
The whole working flow with tensorization on the hiddentohidden weights is shown in Fig.5. Due to the above decomposition in Eq.6, the calculating multiplication complexity turns out to be novikov2015tensorizing () instead of , where is the maximum rank of cores and is the maximum mode size of tensor . This will be much higher compressed and more efficient since the rank is very small compared with general matrixvector multiplication of traditional fullyconnected layers.
4 DEEPEYE Framework for Video Comprehension
Based on the quantization and tensorization of YOLO, the whole working flow of DEEPEYE framework for video comprehension is shown in Fig.6. It integrates the QYOLO, served as realtime video object detection and the TRNN, served as video classification system. Firstly, the prepared video clip is primarily delivered into QYOLO as inputs, where all the convolutional layers, batch normalization layers and maxpooling layers are quantized as 8bit. Then, the tensor feature outputs of QYOLO can be further fed to TRNN without delay. It should be noted that the tensor feature outputs are the final results of the last convolution layer () from QYOLO which can also be further processed to display the realtime visual results. Finally, after the TRNN processing with tensorizedcompression on both tensorial inputtohidden and hiddentohidden mappings, one can obtain the classification result towards action recognition.
Below, we further summarize the training steps of DEEPEYE as follows:

Train QYOLO: Train the QYOLO with existing or customized dataset for object detection. The feature outputs of in QYOLO for each frame are tensor format data , and can be represented by subitems such as in the experiments.

Preprocess video dataset: Preprocess the existing or customized video dataset (in the experiments MOMENTS monfort2018moments () and UCF11 liu2009recognizing () are used) with QYOLO. Fed each video clip to QYOLO to obtain its tensor outputs , which are regarded as the tensor format dataset for TRNN instead of the original frame format dataset.

Train TRNN: Train the model with the tensor format dataset and tensorized weights . The final output model will be used for a realtime classification.

Understand in real time: After both the QYOLO and TRNN models have been trained, the whole video comprehension flow can be built for real time analysis as shown in Fig.6.
Instead of optimizing the video detection and classification in a separated fashion, the DEEPEYE is the first approach to leverage object detection and action recognition together with remarkable optimizations. Since the whole system is highly compressed with quantization and tensorization, it benefits a lot with a much better performance in compression, speedup as well as resourcesaving, especially when applying to the video comprehension tasks. As presented in Fig.6, the storage cost of the experiments is compressed from to and the number of parameters is reduced from to , and detail layers model of proposed QYOLO is presented in Table 1.
Layer  Filters  Output  Parameters  Memory Size 
KB, KB  
KB, KB  
KB, KB  
…  …  …  …  … 
KB, KB 
5 Experiments
In the experiments, we have implemented different baselines for performance comparison as follows. 1, DEEPEYE: Proposed video comprehension system combining the tinyYOLOv2 with quantization (QYOLO) and LSTM (the advanced variant of RNN) with tensorization (TRNN). We apply dropout srivastava2014dropout () for both inputtohidden and hiddentohidden mappings in TRNN. 2, Original YOLO: The original fullprecision tinyYOLOv2 without quantization and only for video detection. 3, Plain RNN: The plain RNN without tensorization and only for video classification, which inputs are the original video frame data instead of the tensor outputs of in QYOLO. 4, TRNN with frame inputs: TRNN with original inputs of video frame is also selected for performance comparison.
It should be noted that all the baselines are implemented in the same initialization environment: Theano theano () in Keras keras () for software and NVIDIA GTX1080Ti gpu1080ti () for hardware. We validate the contributions of our system by presenting a comparison study on two challenging large video datasets (MOMENTS monfort2018moments () and UCF11 liu2009recognizing ()), as discussed in the following sections.
5.1 Comparison on Video Detection
To show the effects of video detection, we apply the MOMENTS dataset which contains one million labeled second video clips, involving people, animals, objects or natural phenomena, that capture the gist of a dynamic scene. Each clip is assigned with action classes such as eating, bathing or attacking. Based on the majority of the clips we resize every frames to a standard size , at the fps . For a premier experiment, we choose representational classes and the length of training sequences is set to be while the length of test sequences is .
Firstly, we pretrain the QYOLO on VOC with object classes. As shown in Fig.7, we report the Average Precision (AP) comparison between the proposed 8bitquantized model and fullprecision model on representational classes (the AP score corresponds to the AreaUnderPrecisionRecallCurve). Then, the mean Average Precision (mAP) among all classes is obtained, which can reach in the 8bit QYOLO while the mAP of fullprecision YOLO is . It can be seen that the 8bit QYOLO does not cause the AP curves to be significantly different from fullprecision one and only decreases on mAP. As such, we can see that the QYOLO with 8bit quantization obtains a commendable balance between large compression and high accuracy.
Secondly, the visual results of our approach on MOMENTS are shown in Fig.8. Experimental results show that all existing objects in these video clips can be detected precisely in real time. In this system, the finally tensor output of each frame is in a size of , which is delivered into TRNN for video classification with no delay.
5.2 Comparison on Video Classification
In this section, we use UCF11 dataset for a performance comparison on video classification. The dataset contains video clips, falling into action classes that summarize the human action visible in each video clip such as basketball shooting, biking or diving. We resize the RGB frames into at the fps .
We sample random frames in ascending order from each video clip as the input data yang2017tensor (). The tensorizationbased algorithm has been configured for both inputs and weights by the training process. Fig.9 shows the training loss and accuracy comparison among: 1) TRNN with tensor inputs (QYOLO outputs), 2) TRNN with frame inputs and 3) plain RNN with frame inputs. We set the parameters as follows: the tensor dimension is ; the shapes of inputs tensor are apart and we set them as: 1) , 2) ; the hidden shapes are ; and the ranks of TRNN are , .
It can be seen that when the TRNN with tensor format inputs performs the best once the epoch beyond . The peak accuracy of proposed framework reaches , higher than the plain RNN while higher than the TRNN with frame inputs, which tremendously improve the accuracy performance.
5.3 Performance Analysis
Aside from the outstanding function and accuracy, the high compression and speedup are also remarkable. The proposed DEEPEYE can cost less storage and computing resources compared with the fullprecision YOLO. Since the complexity is significantly reduced and the throughput of networks is highly enlarged, the future implementation on terminal devices also becomes more realizable. The performance evaluation is shown in Table. 2 based on different baselines. Among all the baselines, the proposed DEEPEYE system (TRNN with tensor inputs) has the most excellent performance which deliveries much better accuracies even with several orders of less parameters.
Accuracy  Parameters  Compression  Runtime  Speedup  

RNN      
TRNN with frame inputs  
DEEPEYE 
6 Conclusion
In this paper, we have proposed a compact yet accurate video comprehension framework for object detection and action recognition, called DEEPEYE. It is a RNN network with features extracted from YOLO. The QYOLO with an 8bit quantization and TRNN with a tensorizedcompression are both developed, which can remarkably compress the original network model yet with maintained accuracy. We have tested DEEPEYE on MOMENTS and UCF11 benchmarks. The results show that DEEPEYE can achieve compression with only mAP decreased; and parameter reduction and speedup with accuracy improvement. The proposed DEEPEYE can be further implemented at terminal devices towards realtime video analysis.
References
 (1) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 (2) A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.
 (3) J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
 (4) K. Guo, S. Han, S. Yao, Y. Wang, Y. Xie, and H. Yang, “Softwarehardware codesign for efficient neural network acceleration,” IEEE Micro, vol. 37, no. 2, pp. 18–25, 2017.
 (5) S. Hashemi, N. Anthony, H. Tann, R. I. Bahar, and S. Reda, “Understanding the impact of precision quantization on the accuracy and energy of neural networks,” in Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017, pp. 1474–1479.
 (6) M. Courbariaux, Y. Bengio, and J.P. David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” in Advances in Neural Information Processing Systems, 2015, pp. 3123–3131.
 (7) Z. Liu, Y. Li, F. Ren, H. Yu, and W. Goh, “Squeezedtext: A realtime scene text recognition by binary convolutional encoderdecoder network,” 2018.
 (8) J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, realtime object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–788.
 (9) J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” arXiv preprint arXiv:1612.08242, 2017.
 (10) L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville, “Describing videos by exploiting temporal structure,” in Proceedings of the IEEE Conference on Computer Vision, 2015, pp. 4507–4515.
 (11) S. Ebrahimi Kahou, V. Michalski, K. Konda, R. Memisevic, and C. Pal, “Recurrent neural networks for emotion recognition in video,” in Proceedings of the ACM Conference on Multimodal Interaction, 2015, pp. 467–474.
 (12) S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko, “Sequence to sequencevideo to text,” in Proceedings of the IEEE Conference on Computer Vision, 2015, pp. 4534–4542.
 (13) J. Y.H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4694–4702.
 (14) B. Fernando and S. Gould, “Learning endtoend video classification with rankpooling,” in International Conference on Machine Learning, 2016, pp. 1187–1196.
 (15) W. Zhu, J. Hu, G. Sun, X. Cao, and Y. Qiao, “A key volume mining deep framework for action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1991–1999.
 (16) N. Srivastava, E. Mansimov, and R. Salakhudinov, “Unsupervised learning of video representations using lstms,” in International Conference on Machine Learning, 2015, pp. 843–852.
 (17) J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Longterm recurrent convolutional networks for visual recognition and description,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2625–2634.
 (18) Y. Yang, D. Krompass, and V. Tresp, “Tensortrain recurrent neural networks for video classification,” arXiv preprint arXiv:1707.01786, 2017.
 (19) S. Zhe, K. Zhang, P. Wang, K.c. Lee, Z. Xu, Y. Qi, and Z. Ghahramani, “Distributed flexible nonlinear tensor factorization,” in Advances in Neural Information Processing Systems, 2016, pp. 928–936.
 (20) M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International Journal of Computer Vision, vol. 88, no. 2, pp. 303–338, 2010.
 (21) S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients,” arXiv preprint arXiv:1606.06160, 2016.
 (22) C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,” arXiv preprint arXiv:1612.01064, 2016.
 (23) I. Hubara, M. Courbariaux, D. Soudry, R. ElYaniv, and Y. Bengio, “Quantized neural networks: Training neural networks with low precision weights and activations,” arXiv preprint arXiv:1609.07061, 2016.
 (24) T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran, “Lowrank matrix factorization for deep neural network training with highdimensional output targets,” in Proceedings of the IEEE Conference on Acoustics, Speech and Signal Processing, 2013, pp. 6655–6659.
 (25) E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, “Exploiting linear structure within convolutional networks for efficient evaluation,” in Advances in Neural Information Processing Systems, 2014, pp. 1269–1277.
 (26) M. Denil, B. Shakibi, L. Dinh, N. De Freitas et al., “Predicting parameters in deep learning,” in Advances in Neural Information Processing Systems, 2013, pp. 2148–2156.
 (27) S. Zhe, K. Zhang, P. Wang, K.c. Lee, Z. Xu, Y. Qi, and Z. Ghahramani, “Distributed flexible nonlinear tensor factorization,” in Advances in Neural Information Processing Systems, 2016, pp. 928–936.
 (28) A. Novikov, D. Podoprikhin, A. Osokin, and D. P. Vetrov, “Tensorizing neural networks,” in Advances in Neural Information Processing Systems, 2015, pp. 442–450.
 (29) M. Monfort, B. Zhou, S. A. Bargal, A. Andonian, T. Yan, K. Ramakrishnan, L. Brown, Q. Fan, D. Gutfruend, C. Vondrick et al., “Moments in time dataset: one million videos for event understanding,” arXiv preprint arXiv:1801.03150, 2018.
 (30) J. Liu, J. Luo, and M. Shah, “Recognizing realistic actions from videos "in the wild",” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 1996–2003.
 (31) N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
 (32) “Theano.” [Online]. Available: https://deeplearning.net/software/theano
 (33) “Keras.” [Online]. Available: https://github.com/kerasteam/keras
 (34) “Gpu specs.” [Online]. Available: https://www.nvidia.com/enus/geforce/products/10series/geforcegtx1080ti