Pack and Detect: Fast Object Detection in Videos Using Region-of-Interest Packing
Object detection in videos is an important task in computer vision for various applications such as object tracking, video summarization and video search. Although great progress has been made in improving the accuracy of object detection in recent years due to improved techniques for training and deploying deep neural networks, they are computationally very intensive. For example, processing a video at resolution using the SSD300 (Single Shot Detector) object detection network with VGG16 as backbone at 30 fps requires 1.87 trillion FLOPS/s. In order to address this challenge, we make two important observations in the context of videos. In some scenarios, most of the regions in a video frame are background and the salient objects occupy only a small fraction of the area in the frame. Further, in a video, there is a strong temporal correlation between consecutive frames. Based on these observations, we propose Pack and Detect (PaD) to reduce the computational requirements for the task of object detection in videos using neural networks. In PaD, the input video frame is processed at full size in selected frames called anchor frames. In the frames between the anchor frames, namely inter-anchor frames, the regions of interest(ROI) are identified based on the detections in the previous frame. We propose an algorithm to pack the ROI’s of each inter-anchor frame together in a lower sized frame. In order to maintain the accuracy of object detection, the proposed algorithm expands the ROI’s greedily to provide more background information to the detector. The computational requirements are reduced due to the lower size of the input. This method can potentially reduce the number of FLOPS required for a frame by . Tuning the algorithm parameters can provide a increase in throughput with only a drop in accuracy.
Pack and Detect: Fast Object Detection in Videos Using Region-of-Interest Packing
|Athindran Ramesh Kumar|
|Department of Computer Science|
|Indian Institute of Technology Madras|
|Chennai, Tamil Nadu, India|
|Department of Computer Science|
|Indian Institute of Technology Madras|
|Chennai, Tamil Nadu, India|
|Department of Electrical Engineering|
|West Lafayette, IN,USA|
Keywords Object Detection Neural Network Temporal Correlation Object occupancy Region-of-Interest packing
The task of object detection in videos (17; 42; 18; 16; 41; 40; 4; 28; 23; 35) has been gaining attention in recent years. It can serve as an important preprocessing task primarily for object tracking and for several other video processing tasks such as video summarization and video search. Further, object detection and tracking can aid various important applications such as traffic monitoring, pedestrian tracking, animal monitoring and environment survey using drones. Many of these applications require video frames to be processed in real-time in resource-constrained environments. It is thus imperative to design systems that can detect objects in videos in a fast, compute-light and accurate manner.
Still-image object detection has been studied extensively in the past. The accuracy and speed of still-image object detection have improved by leaps and bounds in recent years due to innovations in the usage of deep convolutional neural networks (CNN) for object detection. The state-of-the-art CNN based still-image object detectors currently are faster RCNN (34), SSD (24), YOLO (31; 32; 33) and RFCN(5). SSD and YOLO are almost faster than faster-RCNN and RFCN and give comparable accuracy (mAP score) on standard still-image object detection datasets.
These still-image object detectors can be extended for object detection in videos by using them on a per-frame basis. However, this is inefficient as there is a strong temporal correlation between frames in a video. This temporal redundancy can be leveraged either to improve the accuracy or speed of object detection. In the recent past, there have been several attempts to improve the accuracy of object detection in videos by either integrating the bounding boxes (17; 18; 11; 37) or features (16; 41; 40; 12) across frames. However, there has not been enough attention on leveraging this temporal redundancy to improve speed. Some exceptions to this norm are (42; 4; 28; 23; 35). In this work, we propose Pack and Detect (PaD) for fast object detection in videos with negligible loss in accuracy.
Neural networks are in general very computationally intensive. For example, processing a video at resolution using the SSD300 (Single Shot Detector) object detection network with VGG16 as backbone at 30 fps requires 1.87 trillion FLOPS/s. Several attempts (36) have been made to make them less compute-hungry. Techniques such as pruning weights or connections (9; 10), pruning filters in full or part in a CNN (25; 39; 7; 26; 21), reducing the precision of floating point operations (38; 1) and designing compute-light architectures (13; 14) have proven to be successful in the past. While most of these methods are static and are employed at design time, there are some methods such as conditional computation (3), dynamic deep neural networks (D2NN) (22) and dynamic variable effort deep neural networks (6) (DyVEDEEP) that dynamically adjust the amount of computation to be employed for an input. However, firstly, all of these methods do not exploit the specific opportunities that can be leveraged in the context of videos. Secondly, all of these methods try to modify the network and do not try to compress the inputs to be fed into the network.
In particular, there are two opportunities we are trying to leverage in the context of videos. Firstly, the objects of interest have been observed to occupy only a small fraction of the area in an image. Secondly, there is a strong temporal correlation between frames in a video. We propose Pack and Detect (PaD) which leverages these two opportunities. The full-sized image is only processed in selected frames called anchor frames. In the frames between the anchor frames (inter-anchor frames), the detections from the previous frame are used to identify regions in the image where an object could potentially be located. These regions are called regions of interest (ROIs). The ROIs are packed in a lower sized image which is fed into the detector.
We propose a ROI packing algorithm based on a greedy heuristic that is responsible for packing the ROI’s in the lower-sized frame while meeting the following criteria:
Each ROI is expanded to provide as much background context as possible to maintain the accuracy of the detector.
There is minimal loss of resolution and no change in aspect ratio to maintain the accuracy of the detector.
Each object is present in an unique ROI.
The space in the lower-sized frame is used as efficiently and impartially as possible.
Our method, PaD, which uses the ROI-packing algorithm to compress the input in inter-anchor frames is naturally faster because of the smaller size of the input fed to the object detector. In the inter-anchor frames, the FLOP count can potentially reduce by a factor of . Tuning the algorithm parameters can provide a increase in throughput with only a drop in accuracy compared to a baseline mechanism that uses the object detector on a per-frame basis. We vary different parameters of the algorithm to achieve a tradeoff between speed and accuracy. Further, we present results that provide insight into the inner workings of the algorithm.
In section 2, we review some of the related work. In section 3, we analyze the two opportunities we are trying to leverage in the context of videos. In section 4, we provide a overview of the methods and algorithms used in our work. Section 5 outlines the experimental methodology and section 6 reports the results obtained from our experiments. In section 7, we conclude our work.
2 Related Work
Object Detection in Videos
The temporal redundancy in videos has been exploited before to improve the accuracy and speed of object detection in videos. In (17; 18; 11; 37), the aggregation of information from neighbouring frames is done at the bounding box level to improve accuracy. In (17), per-frame object detection is combined with multi-context suppression, motion-guided propagation and object tracking to improve detection accuracy. In (11; 37), non-maximum suppression is done over bags of frames. In (16; 41; 40; 12), there is integration of the CNN features across neighbouring frames to improve accuracy. In (16), a CNN is combined with a LSTM (Long Short Term Memory) to obtain temporal features for object detection. In (41; 40; 12), the features from neighbouring frames are aggregated together using optical flow information to improve feature quality. All the methods discussed till now are extremely slow and (17; 18; 11; 37; 16) are not online.
On the other hand, (42; 4; 28; 23; 35) are relatively faster methods aimed at object detection in a video. The methods in (4; 23; 35) are faster by virtue of using a faster still-image object detector or an efficient backbone network. In (42), the feature maps of the CNN are not computed in all the frames. The feature maps from anchor frames are transferred to neighbouring frames by warping them with optical flow information. Since the features are not recomputed in each frame, this setup is much more faster and compute-efficient. In (28), neighbouring frames are subtracted to give rise to a sparse input. This sparse input is processed by the CNN with the EIE (8) hardware accelerator to achieve computational savings. The methods proposed in this paper are vastly different than the methods in the literature to speed-up video object detection. Even though the overall speed-up achieved in an average sense in our work might not be as high as the reported in (42), PaD can be used alongside the methods in (42) using some simple tricks to increase the efficiency further. Any efficient neural network or faster object detector can be used with PaD as in (4; 23; 35) to obtain further speed-up. Finally, our work does not require any specialized hardware accelerator like in (28) to obtain computational savings.
Efficient Neural Networks
There have been efforts to reduce the number of computations of neural networks using various techniques. Deep compression (9) deployed on a custom hardware accelator(8) can result in considerable amount of savings in computation and speedup for standard neural networks. Deep compression works by pruning weights less than a threshold, quantizing the weights using clustering followed by huffman coding. EIE (8) is an hardware accelerator that can leverage the sparsity of the deep compressed DNN to speed up the network. Since, fine-grained sparsity has not been shown to improve the speed without custom hardware accelerators, there have been techniques to induce structured sparsity(39) by pruning filters (25; 39; 7; 26; 21) of a CNN in full or part. MobileNet(13) replaces the standard convolution with a combination of depth-wise and point-wise convolution to reduce computation. SqueezeNet(14) utilizes CNN design techniques to reduce the number of computations and memory. Another innovation is the concept of Big-Little neural networks (29) that first try to use a smaller network to process the image. If the little network is not capable of successfully achieving its task in the frame, the bigger network is used to process the image. This method has been shown to reduce computation significantly without much loss in accuracy.
Most of the methods for achieving computational efficiency in neural networks are static methods to be employed at design time. However, there are some exceptions. Conditional computation (3) selectively activates certain parts of the network depending on the input. The policy for deciding which parts of the network to activate is learnt using reinforcement learning. Dynamic deep neural networks (D2NN) (22) work in a manner similar to conditional computation and turn on/off regions of the network using reinforcement learning. DyVEDEEP (6) is an unique effort to reduce the computations in neural networks dynamically by using three strategies - saturation prediction and early termination, significance driven selective sampling and similarity-based feature map approximation.
All the methods try to compress or tweak the model/network to reduce computations and achieve speedup. In this work, we compress the inputs that we feed into the network. Hence, PaD is orthogonal to most existing techniques and can be used in combination with other methods.
Visual Attention Mechanism
Inspired by human vision, there have been several attempts (20; 30; 27; 2; 19; 15) to reduce computation by processing an image as a sequence of glimpses rather than processing the image as a whole. The notion of a foveal glimpse is somewhat similar to the idea of ROI discussed here. However, there are several important differences. A foveal glimpse is a high resolution crop of an important region in the image that is crucial to the task at hand. In our work, we pack all the ROIs together in a single frame and do not process them sequentially like how glimpses are processed. Further, the most important difference is that the location of ROI’s is inferred from the detections in the previous frame in a video and the network does not need to learn where to look through an attention mechanism. Also, a foveal glimpse obtains crops by extracting pixels close to the location target at high resolution and pixels far from the location target at low resolution. We do not do any multi-resolution processing in our work. Hence, our work, though somewhat inspired from the notion of foveal glimpse is considerably different.
3.1 Occupancy of objects in a image
In this work, an opportunity that we try to leverage is the fact that most of the pixels in a frame are clutter and the objects of interest occupy only a small fraction of an area in the frame. We back this hypothesis using statistics from a popular video dataset. From a analysis of the ImageNet VID validation set containing videos with frames, we see that the objects occupy only of the image on the average. Figure 1 is a histogram of the object occupancy ratio statistics in the dataset. From the figure, we can infer that the in a huge majority of the frames, the objects occupy less than of the area in a frame.
3.2 Temporal correlation of object locations in a video
In general, we can expect a lot of temporal correlation between consecutive frames in a video unless there are sudden camera shifts or missing frames. We observe this hypothesis to be true from a statistical analysis of the ImageNet VID validation set. On the average, the IoU of area containing objects between consecutive frames is . Figure 2 is a histogram of the object occupancy area IoU statistics between consecutive frames in the dataset. In the figure we can clearly see a sharp peak close to . We exploit this opportunity in our work.
The overall architecture of the methodology proposed in this work is presented in figure 3. Full-sized video frames are processed at regular intervals (inter-anchor distance ). In the other frames, the ROIs are identified based on the locations of the detections from the previous frame. Only detections with a minimum confidence threshold are taken into consideration. ROI is a region which is likely to contain an object. An ROI packing algorithm will try to pack the ROIs in a lower sized frame. If the packing is successful, then the lower sized frame is processed instead, giving rise to computational savings. However, if the packing is not successful, then the frame is processed at full size. Hence, there is an overhead of checking whether ROI packing is possible. Once the lower sized ROI-packed frame is processed using the CNN detector, the object detections in the ROI packed frame have to be transformed to the original frame dimensions. This can easily be done by keeping track of the ROI boundaries in the original frame.
4.2 ROI packing algorithm
In figure 4, we illustrate the overall flow of our ROI packing algorithm based on a greedy heuristic. This algorithm is key to ensure that there is no significant drop in accuracy due to ROI packing. As a first step in the algorithm, we find all the connected components of all nodes in a graph where nodes represent ROIs and two nodes are connected if their respective ROI’s intersect. We then find the enclosing bounding box over the union of ROI’s in each connected component. We iterate the connected components algorithm until we ensure that none of the final bounding boxes overlap. This constraint is important because if two bounding boxes overlap, then parts of the same object could be present two or more times in the ROI packed frame.
Once the number and size of the bounding boxes are decided, the layout of the bounding boxes is determined by a heuristic procedure. The flowchart in figure 5 illustrates the algorithm used for determining the layout. Simpler variants of this procedure could also be adopted. Once the layout is decided, a check is done to see whether the bounding boxes can fit in the layout. If it is not possible to fit the bounding boxes in the layout, the ROI packing procedure stops here and the image is processed at full size.
If the bounding boxes can be fit in the layout, the following procedure is carried out. Using experiments, we noted that CNN object detectors are often overfit to the background context of the object to be detected. The performance of the CNN object detector is poor if there is no background context. Hence, we extend each bounding box within the space available for each bounding box in the layout to provide as much context as possible to the CNN detector. The algorithm for extending the bounding boxes works as follows. First, we make a decision using some heuristic rules as to whether to extend the boxes horizontally or vertically first. For the sake of discussion, let us assume that the choice is to extend horizontally all the bounding boxes first. We find all the bounding boxes, which could potentially intersect when extended horizontally. We extend all bounding boxes horizontally until the layout size is reached or the bounding boxes start intersecting with each other. Then we repeat the same procedure in the other dimension. Once the final bounding boxes are decided, the corresponding regions in the image are extracted and the lower sized frame is filled according to the decided layout.
There are a lot of heuristic rules in the ROI packing algorithm. All of these rules have been decided after thorough analysis of the setup using experiments. In section 6, we show some results illustrating that this sophisticated ROI packing algorithm is superior to a naive ROI packing algorithm. Further, there are two parameters we can tune to achieve a speed-accuracy tradeoff. Firstly, an obvious parameter that can be tuned is the inter-anchor distance . Also, most of the CNN object detectors have some amount of robustness inbuilt to tackle scale variations. Hence, we could fit the ROI’s extracted from a sized frame into a frame and downscale the ROI packed frame to to get higher speed-ups. There will be some loss of accuracy due to the downscaling. We explore these trade-offs in section 6.
5 Experimental Methodology
The ImageNet object detection dataset (DET) is a dataset comprising 200 classes of objects which is a subset of the ImageNet 1000 classes. Further, the ImageNet video object detection dataset (VID) comprises of 30 classes of objects which is a subset of the DET 200 classes. The ImageNet video object detection (VID) dataset was the most appropriate choice for illustrating the results of our work. The ImageNet VID training set has 3862 video snippets and the ImageNet VID validation set has 555 video snippets. 53539 frames from the DET dataset comprising only of the classes from the VID dataset and 57834 frames from the VID training set were combined to form the final training set. This information is summarized in figure 6.
The SSD300 (24) object detector operated on a per-frame basis was used as the baseline for our work. The SSD300 object detector uses VGG16 as feature extractor. The SSD300 pretrained model on the DET dataset was further trained on our training set for 210k iterations with a learning rate of for the first iterations, for the next iterations and for the rest of the training. This SSD300 trained model gave a mAP score of on the VID validation set. Further, this model has a network throughput of fps and a overall throughput (including standard pre-processing time) of fps.
|Dataset||No. of classes||No. of video snippets||No. of frames selected|
|DET training set||200||N/A||53539|
|VID training set||30||3862||57834|
|VID validation set||30||555||176126|
One might suppose that we need two object detectors for our framework - one for higher sized images and one for lower sized images. The SSD300 network processes images at as the name suggests. However, closer observation of the network suggested that the same network can process images as well by stopping processing at the penultimate layer. Hence, we use the same SSD300 network with the same weights to process both higher sized images and lower sized images. In all our experiments, the baseline size is and the final lower size is . When a is passed on to the SSD300 network, processing is configured to stop at the penultimate layer. All the experiments were performed using the SSD Caffe framework in a 2.1 GHz Intel Xeon CPU with a Nvidia TITAN X GPU. Code will be released soon. In all the experiments, the batch size was to emulate an online setup. The detection threshold used to select ROIs was fixed at in most of our experiments unless explicitly specified otherwise.
6 Experimental Results
Results from sample videos
We show results on processing some sample videos with PaD. In figure 7, we show some sample detections with our ROI-packing algorithm. The first column shows frame . The second column shows the ROI packed frame with the detections. The third column shows the original frame with detections transformed from ROI packed frame . For this experiment, the confidence threshold for selecting a detection as an ROI for the next frame is chosen as for the sake of illustration. All bounding boxes with a minimum threshold of are shown in the figure.
In figures 8(a), we plot the time taken per frame for processing a sample video and the cumulative time taken for processing the same video using PaD and compare it with the baseline. It can be seen that processing the lower sized frame of is almost faster. When ROI-packing fails, there is a slight overhead incurred which is visible towards the end of the video in figure 8(a). Also, we see that some frames require almost s for processing. This is because the previous frame had no detections. Overall, we note that processing the video using PaD requires almost s lesser time than the baseline from figure 8(b).
Results over the entire dataset
PaD was run with a inter-anchor distance and . In figure 9, we plot the histogram of average per-frame processing time on a video-by-video basis. In other words, the average time taken per frame was obtained for each video and is plotted as a histogram. From the figure, we can clearly see that the average time taken to process a frame is lower using PaD for more videos than the baseline. The baseline mAP score is and our method has a mAP score of . Thus the mAP score drops by less than and percentage points. The average per-frame speedup is around . The average overhead incurred for ROI-packing is around of the total time taken.
Comparison with a naive ROI-packing algorithm
In order to illustrate the benefits of our sophisticated ROI-packing algorithm discussed in section 4, we compare the accuracy drop when compared with a naive ROI-packing algorithm.
The naive ROI-packing algorithm works as follows. The naive ROI-packing algorithm can accommodate upto four ROI’s just like the sophisticated method. If there are more than four objects in the frame, the frame is processed at full size. Otherwise, the bounding box surrounding each frame is extended by a factor of and is treated as an ROI. If there is only one object, the ROI surrounding the bounding box is rescaled to size and is processed by the detector. If there are two objects, the lower sized frame is divided into two columns of size . The two ROIs are rescaled to the appropriate sizes and laid out on the lower sized frame. In the case of three or four objects, the lower sized frame is divided into four regions in two columns and two rows of size . In the case of three ROIs, the ROIs will be rescaled to occupy three of the four regions in the frame and the fourth region will be left blank. In the case four ROIs, the ROIs will be rescaled and fit to these four regions. We do not perform greedy expansion of the RoIs to provide background context. Instead, the ROIs are just expanded by a constant factor of and rescaled to appropriate size.
Our sophisticated ROI packing method with inter-anchor distance gave a mAP score of . With the same parameter setting, the naive ROI packing algorithm gave a mAP score of . This clearly illustrates the need for a sophisticated ROI packing algorithm like ours that preserves the scale and aspect ratio of the ROIs and provides as much background context as possible. Further, each object should be present in only one ROI to prevent redundant detections. This clearly shows that the CNN object detectors, though fairly robust to small scale changes cannot handle large variations in scale and are particularly sensitive to the aspect ratio. Also, CNN object detectors are overfit to the background context.
Tradeoff between speed and accuracy by parameter variation
In figure 10, we vary the inter-anchor distance with and observe the effect on mAP score and speed. As is varied from to , the mAP drops from to while the FLOP count reduction relative to the baseline increases from to on average over the entire validation set. Thus, the inter-anchor distance can be tuned according to the speed-accuracy tradeoff desired.
The parameters and are decided by the specific CNN detector that is being used. However, the factor could be varied to obtain higher speeds. As is increased, the chances of the ROIs fitting in the lower sized layout increases, which in turn increases the chances of processing the frame at lower size. The speed-accuracy tradeoff with variation in is plotted in figure 11. With increase in from to , the average FLOP count reduction relative to the baseline varies from to . The average per-frame speedup increases from to . However, there is relatively a high drop in accuracy to relative to the baseline at . This shows that the CNN object detector though fairly robust to scale variations cannot handle large variations in scale.
7 Conclusion and Future Work
Still-image object detection has improved by leaps and bounds in recent years due to the success in training and deploying neural networks. However, the opportunities that are available in the context of videos have not been fully exploited. Neural networks are in general very compute-hungry and inference can run in real-time in only the most high-end GPUs. In this work, we use the opportunities available in the context of videos to speed up and reduce the amount of computation in neural network object detectors. In PaD, the full-sized input will only be processed in anchor frames. In the inter-anchor frames, the ROIs in the frame are identified based on the location of the objects in the previous frame. These ROIs are packed together in a lower sized frame which is fed to the CNN object detector. The ROI packing algorithm needs to be sophisticated in the sense that it has to ensure that the scales and aspect ratios of the objects are preserved and enough background context is provided. With this setup, we observed speedup with less than drop in accuracy on the ImageNet VID validation set. Further, the time taken to process a lower sized frame is almost lesser and the FLOP count reduces by . Given more suitable datasets, we can get even more speedup and reduction in FLOP count in the average sense.
As part of future work, we plan to incorporate a motion model to obtain the ROIs in the current frame. Incorporating a motion model could also help extend this framework to larger batch sizes. Also, it is possible to use two different models or networks to process larger sized and smaller sized frames. This will help reduce the accuracy drop but will in turn increase the memory footprint. There is an overhead incurred in checking whether the ROIs can fit in the lower sized frame. Currently, we select anchor frames at regular intervals. However, information on whether ROIs were packed successfully in previous frames can help us decide how frequently we select anchor frames. Thus, another line of future work is a dynamic mechanism for selecting anchor frames in order to reduce the overhead. Further, instead of using a heuristic hand-crafted algorithm for forming the ROI packed frame, a neural network could be trained to identify the ROIs and pack them in a lower sized frame like in an attention mechanism. This is another ambitious line of future work. It would be interesting to test PaD in more resource constrained platforms like mobile GPUs and CPUs. We expect the benefits to be more pronounced in such platforms.
-  S. Anwar, K. Hwang, and W. Sung. Fixed point optimization of deep convolutional neural networks for object recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1131–1135, April 2015.
-  Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755, 2014.
-  Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. Conditional computation in neural networks for faster models. CoRR, abs/1511.06297, 2015.
-  Xingyu Chen, Zhengxing Wu, and Junzhi Yu. TSSD: temporal single-shot object detection based on attention-aware LSTM. CoRR, abs/1803.00197, 2018.
-  Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pages 379–387, 2016.
-  Sanjay Ganapathy, Swagath Venkataramani, Balaraman Ravindran, and Anand Raghunathan. Dyvedeep: Dynamic variable effort deep neural networks. CoRR, abs/1704.01137, 2017.
-  Jia Guo and Miodrag Potkonjak. Pruning filters and classes: Towards on-device customization of convolutional neural networks. In Proceedings of the 1st International Workshop on Deep Learning for Mobile Systems and Applications, EMDL ’17, pages 13–17, New York, NY, USA, 2017. ACM.
-  Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. Eie: Efficient inference engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture, ISCA ’16, pages 243–254, Piscataway, NJ, USA, 2016. IEEE Press.
-  Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015.
-  Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for efficient neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, pages 1135–1143, Cambridge, MA, USA, 2015. MIT Press.
-  Wei Han, Pooya Khorrami, Tom Le Paine, Prajit Ramachandran, Mohammad Babaeizadeh, Honghui Shi, Jianan Li, Shuicheng Yan, and Thomas S Huang. Seq-nms for video object detection. arXiv preprint arXiv:1602.08465, 2016.
-  Congrui Hetang, Hongwei Qin, Shaohui Liu, and Junjie Yan. Impression network for video object detection. CoRR, abs/1712.05896, 2017.
-  Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.
-  Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size. CoRR, abs/1602.07360, 2016.
-  Samira Ebrahimi Kahou, Vincent Michalski, and Roland Memisevic. Ratm: recurrent attentive tracking model. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, pages 1613–1622, 2015.
-  Kai Kang, Hongsheng Li, Tong Xiao, Wanli Ouyang, Junjie Yan, Xihui Liu, and Xiaogang Wang. Object detection in videos with tubelet proposal networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, page 7, 2017.
-  Kai Kang, Hongsheng Li, Junjie Yan, Xingyu Zeng, Bin Yang, Tong Xiao, Cong Zhang, Zhe Wang, Ruohui Wang, Xiaogang Wang, and Wanli Ouyang. T-CNN: tubelets with convolutional neural networks for object detection from videos. CoRR, abs/1604.02532, 2016.
-  Kai Kang, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. Object detection from video tubelets with convolutional neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
-  Adam Kosiorek, Alex Bewley, and Ingmar Posner. Hierarchical attentive recurrent tracking. In Advances in Neural Information Processing Systems, pages 3053–3061, 2017.
-  Hugo Larochelle and Geoffrey E Hinton. Learning to combine foveal glimpses with a third-order boltzmann machine. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 1243–1251. Curran Associates, Inc., 2010.
-  Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. CoRR, abs/1608.08710, 2016.
-  Lanlan Liu and Jia Deng. Dynamic deep neural networks: Optimizing accuracy-efficiency trade-offs by selective execution. CoRR, abs/1701.00299, 2017.
-  Mason Liu and Menglong Zhu. Mobile video object detection with temporally-aware feature maps. CoRR, abs/1711.06368, 2017.
-  Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
-  J. Luo, J. Wu, and W. Lin. Thinet: A filter level pruning method for deep neural network compression. In 2017 IEEE International Conference on Computer Vision (ICCV), volume 00, pages 5068–5076, Oct. 2018.
-  Deepak Mittal, Shweta Bhardwaj, Mitesh M. Khapra, and Balaraman Ravindran. Recovering from random pruning: On the plasticity of deep convolutional neural networks. In Eighteenth IEEE Winter Conference on Applications of Computer Vision (WACV), 2018.
-  Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. Recurrent models of visual attention. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, pages 2204–2212, Cambridge, MA, USA, 2014. MIT Press.
-  Bowen Pan, Wuwei Lin, Xiaolin Fang, Chaoqin Huang, Bolei Zhou, and Cewu Lu. Recurrent residual module for fast inference in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1536–1545, 2018.
-  E. Park, D. Kim, S. Kim, Y. D. Kim, G. Kim, S. Yoon, and S. Yoo. Big/little deep neural network for ultra low power inference. In 2015 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), pages 124–132, Oct 2015.
-  Marc Aurelio Ranzato. On learning where to look. CoRR, abs/1405.5488, 2014.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 779–788, June 2016.
-  J. Redmon and A. Farhadi. Yolo9000: Better, faster, stronger. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6517–6525, July 2017.
-  Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. CoRR, abs/1804.02767, 2018.
-  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 91–99. Curran Associates, Inc., 2015.
-  Mohammad Javad Shafiee, Brendan Chywl, Francis Li, and Alexander Wong. Fast YOLO: A fast you only look once system for real-time embedded object detection in video. CoRR, abs/1709.05943, 2017.
-  V. Sze, Y. H. Chen, T. J. Yang, and J. S. Emer. Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE, 105(12):2295–2329, Dec 2017.
-  Peng Tang, Chunyu Wang, Xinggang Wang, Wenyu Liu, Wenjun Zeng, and Jingdong Wang. Object detection in videos by short and long range object linking. CoRR, abs/1801.09823, 2018.
-  S. Venkataramani, A. Ranjan, K. Roy, and A. Raghunathan. Axnn: Energy-efficient neuromorphic systems using approximate computing. In 2014 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), pages 27–32, Aug 2014.
-  Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2074–2082. Curran Associates, Inc., 2016.
-  Xizhou Zhu, Jifeng Dai, Lu Yuan, and Yichen Wei. Towards high performance video object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
-  Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. Flow-guided feature aggregation for video object detection. In Proceedings of the IEEE International Conference on Computer Vision, volume 3, 2017.
-  Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, and Yichen Wei. Deep feature flow for video recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, page 3, 2017.