A Convolutional Feature Map based Deep Network targeted towards Traffic Detection and Classification

A Convolutional Feature Map based Deep Network targeted towards Traffic Detection and Classification

Baljit Kaur baljitkaur13@gmail.com Jhilik Bhattacharya jhilik@thapar.edu Department of Computer Science and Engineering, Thapar Institute of Engg and Tech, Patiala, Punjab, India

This research mainly emphasizes on traffic detection thus essentially involving object detection and classification. The particular work discussed here is motivated from unsatisfactory attempts of re-using well known pre-trained object detection networks for domain specific data. In this course, some trivial issues leading to prominent performance drop are identified and ways to resolve them are discussed. For example, some simple yet relevant tricks regarding data collection and sampling prove to be very beneficial. Also, introducing a blur net to deal with blurred real time data is another important factor promoting performance elevation. We further study the neural network design issues for beneficial object classification and involve shared, region-independent convolutional features. Adaptive learning rates to deal with saddle points are also investigated and an average covariance matrix based pre-conditioned approach is proposed. We also introduce the use of optical flow features to accommodate orientation information. Experimental results demonstrate that this results in a steady rise in the performance rate.

Deep Learning, Traffic Detection and Classification, CNN, feature extraction, Optical flow, adaptive learning.
journal: Expert Systems with Applications\biboptions


1 Introduction

Vehicle detection methods can be categorised as: Moving-vehicle detection on the basis of background estimation or cascade based object detection. In the former case vehicle candidates can be extracted from foreground blocks, obtained by removing the predicted background from original input images (Zhou et al., 2007) Vargas et al. (2010). Standard histogram-based contrast (HC) Li et al. (2017a) or Bayesian probability models Yao et al. (2017) are some examples of the same. These methods have low computational complexity and can be applied to simple and stable background based applications. However, they are not appropriate for coping with congested urban traffic scenes specially during slow traffic movement due to lack of flow information. Alternatively, the movement of the object cannot be determined without taking its inherent information into consideration. Cascade based detection techniques involve searching the whole image region wise to test for vehicle candidature via suitable feature extraction techniques. Due to availability of computer resources, and high performance rate of deep CNN, ConvNets are replacing hand crafted feature extraction. Among CNNs for object detection, R-CNN, fast RCNN, faster RCNN are widely adapted. R-CNN extract region proposals, compute CNN features and classify the objects. To improve computation ability, Fast R-CNN used region of interest pooling by sharing the forward pass of CNN. These region proposals were created using selective search which was replaced by a RPN in faster R-CNN (Ren et al., 2015). Here a single network composed of region proposal and Fast R-CNN was used by sharing their convolutional features. An option to add segmentation properties to Fast RCNN was enabled by putting an object mask predicting feature with the already occurring branch for bounding box recognition (He et al., 2017). It was noticed that these object detection networks utilized VGG16 with PASCAL dataset. One reason behind the choice could be that it is very deep network with 41 layers. Hence, it is better from AlexNet in terms that it has large kernel-sized filters having 11 in the first and 5 in the second convolutional layer with a couple of 3X3 sized kernel filters one after the other. With a given effective area size of input image on which output depends, couple of smaller sized kernel is better than large sized kernel because more than one non-linear layers result the deep network which makes it possible to learn more complex features at a lower cost. GoogleNet on the contrary has quite a different architecture: it uses combinations of inception modules, each including some pooling, convolutions at different scales and concatenation operations. GoogleNet and ResNet do not allow region wise classifiers due to absence of fully connected layers. However, another widely used network YOLO (Redmon & Farhadi, 2016) composed of entirely convolutional layers trained and tested on PASCAL VOC and COCO datasets proved to be quite accurate and fast.
In the present context, we engaged faster RCNN for object detection from self captured videos. Our goal was to detect traffic regions from a scene and further use this for a dedicated vehicle classification CNN. However we failed to obtain satisfactory detection results, leading us to explore the possible causes.(i) The primary reason could be a problem with the dataset we captured. For example size of the dataset may be too less to obtain suitable results. Also for using video data, a lot of similar frames were captured hence resulting in non homogeneity. This will result in low gradients during learning leading to slow or no convergence. (ii) Secondly, blurred images captured due to random movement of camera can have negative impact on the network. So we need to carefully eliminate blurred images or remove blur from images (Xu et al., 2013), (Cho et al., 2012), (Zheng et al., 2013). Also the network can be trained using blurred images only. An example of the same is (GUO et al., 2017) where GoogleNet was used to train blurred data to improve results. (iii) Third, the design of deep network for object classification plays an important role. While exploring the network design for classification, we studied that apart from deep feature maps, a deep and convolutional per-region classifier has special importance for object detection. It was argued by several researchers that models for image classification such as GoogLeNets and ResNets did not give good detection accuracy without the use of per-region classifier.(iv) Another important part could be the features provided to the network for classification. We know that for different object detection challenges, deep neural networks improve performance by averaging over different crops or scale of a particular image. PCA and whitening of pixels was used to reduce the overfitting problem in Imagenet. This caters for intensity variations in the training image. Some work considered use of RGB images with depth data for improving the accuracy of object detection (Cao et al., 2017)(Niessner et al., 2017). These include training a network from scratch using RGB,depth and/or LIDAR data or finetuning pretrained nets like VGG/Alex-net with depth/LIDAR data for improving object detection performance. In the current work, we consider using orientation value obtained from Optical flow with the idea that it will encapsulate pose information. Optical flow has been used along with hand crafted HOG and LUV features for pedestrian detection on Caltech-USA pedestrian dataset(Rauf et al., 2016). Occlusion edge detections using optical flow has also been reported by researchers (Pop et al., 2017) (Sarkar et al., 2017). They trained CNN with Intensity, Depth and Flow images for each frame. An approach based on Optical flow with the combination of deep learning for visual odometry has been proposed by Muller and Savakis (Muller & Savakis, 2017). In this, Optical flow images are used as input to a CNN, which calculates a rotation and displacement for each image pixel. The displacements and rotations were applied incrementally to construct a map of where the camera has traveled. CNN trained with optical flow for vehicle detection and classification is yet to be found. In this regard we have the following contributions towards developing a traffic detection application.

  1. Suitable pre-processing of data-set to remove homogeneity.

  2. A Network on Convolutional Feature Maps(NoC) trained and used for classification. We perform extensive experiments with different learning rates and layer designs to understand how the learning is affected.

  3. Use of blur NoCs trained particularly with blurred dataset to accommodate blurred scenes during real-time processing.

  4. We use a multimodal fusion, where we fuse the convolution feature maps of individual columns of the multicolumn CNN using summation operation. This proves to be beneficial by accommodating multimodal features with minimal computational space and speed as opposed to fusion techniques via concatenation. For example in our case we fuse features extracted from 5th convolutional layer of pretrained network using RGB(Intensity) images and orientation features via optical flow images; represented as respectively.

  5. We propose an average covariance based pre-conditioning approach to deal with saddle points in deep networks.

The rest of paper is framed as related literature for the proposed work is discussed in Section 2. Data collection and data preprocessing are elaborated in Section 3. Experiments done for the proposed work are presented in Section 4. Finally, Section 5 concludes the whole work.

2 Related Work

Considerable amount of work is reported on candidate region detection as well as classification on different categories of images. While some research is focused on application specific detection tasks as detailed in (a), some others mainly focus on improving detections with respect to speed, accuracy, false alarms detection as elaborated in (b).

(a) CNN based detection and classification techniques were implemented for detecting pedestrian, cyclist, vehicle type, animals and many more. A combined Framework for Concurrent detection of Pedestrian and Cyclist was proposed by (Li et al., 2017c) using RCNN on upper body regions detected with ACF,LCDF (Nam et al., 2014). (Huo et al., 2016) classified different vehicle types(car, truck, bus and van) from different views using a multi-task RCNN. An animal detection technique using multilevel graph cut for combination motion with spatial context was presented in (Zhang et al., 2016b). The feature description for animal detection used was a combination of deep learning (pretrained caffe CNN) and oriented gradient histogram features encoded with Fisher vectors. (Zhuo et al., 2017) fine-tuned their own vehicle dataset using GoogLeNet, pretrained with ILSVRC-2012 data, to obtain vehicle classification results. (Yao et al., 2017) have detected vehicles using Bayesian probability model and classified multivehicle by adopting AlexNet pre-trained with ILSVRC 2012 ImageNet data set as classifier. Along with classification their framework also detected vehicle location. (Wang et al., 2016) detected vehicles using pre-trained fast-RCNN network and classified them into types via VGG_CNN_M_1024 model. A K-means algorithm was utilized for clustering the vehicle data prior training of VGG_CNN_M_1024.

(b) While most of the available research focus on using different networks and classifiers for specific applications, some researchers have particularly focused on increasing speed and accuracy while reducing false alarms for these. For example (Zhang et al., 2016a) have presented an accelerating method that proved to be effective for very deep models. They proposed a response reconstruction method that takes into account the nonlinear neurons and a low-rank constraint. A solution based on Generalized Singular Value Decomposition (GSVD) was developed for this nonlinear problem, without the need of SGD. Their method was evaluated under whole-model speedup ratios. It could effectively reduce the accumulated error of multiple layers due to the nonlinear asymmetric reconstruction. A method to reduce false alarms was introduced by (Kang et al., 2017b) where detection results were propagated to adjacent frames according to motion information. The resulted duplicate boxes were removed by non-maximum suppression (NMS). Another effective approach to reduce false alarms including Context based CNN object detection model was introduced by (Li et al., 2017b). (Kang et al., 2017a) have proposed a NOSCOPE system for the purpose of accelerating neural network for video with the help of inference-optimized model search.

3 Data Set

The quality of data plays an important role in training any deep network. In general, the majority of reported research utilizes the bottleneck layer of a pre-trained network trained over millions of images for feature extraction purposes. For suiting this to the particular application, domain specific data is used for transfer learning or fine tuning. Collection of these comes with practical problems and needs to be dealt with, prior to their usage. However most works directly discuss the data application and do not elaborate on the common prior problems and how they were tackled. Some potential problems faced while collecting data for classification are discussed so that this can be taken as a point of reference for future use of researchers. For the proposed system, outdoor traffic data including cars, pedestrians two wheelers etc is required in the form of videos. It is a well-known fact that data non-homogeneity in the form of various lighting, postures and other structural and environmental conditions is required for a robust classification. However experimental results have shown that poor quality data have a greater impact over false alarms as compared to the true accepts. Also maintaining non-homogeneity from video data is difficult and hence requires further clustering and sampling techniques. To facilitate the same, 80 videos were taken using ”Sony Cyber-shot DSC-T77” 10.1 MP camera having resolution 640 x 480. Each video was of time duration less than 2 minutes approximately with 30 frames/second; out of them few had to be discarded manually as they did not contain the required objects. For data collection, a camera was mounted on a tripod which was periodically moved at different pan and tilt angles for posture variations. Data was collected at different spots and timings. However the amount of data from each scenario is not uniform. For example, during morning, the traffic movement (for example cars, pedestrians,two wheelers etc) is minimal. There are a number of issues that need to be considered. Some of these include blurring of picture due to the apparent motion of the tripod.

  1. Data Preprocessing: The aim of pre-processing is an enhancement of the image data that suppresses undesired distortions or enhances some relevant features of image for further processing and analysis task. General preprocessing techniques utilized by researchers include digital spatial filtering, intensity distribution linearization, contrast enhancement etc (Bernal et al., 2013). In general, these methods are particularly essential for preprocessing while dealing with foreground extraction. For CNN based feature extraction, the data is normalized by zero centering and/or dividing by standard deviation before it is fed to the network. In our case, we use two different pre-processing techniques. First, when we are selecting data for clustering(for non-homogeneous data set) and second when it is being fed to the neural network. For the first case, we use FFT based specification technique to stabilize the color component. For this one frame is selected as base frame and intensity of rest of the frames is equalized according to the base using FFT as shown in equation 1. Later is done according to the traditional way of preprocessing.

  2. Data Set Sampling: It is a known fact that a good training set is one which represents diverse information. Hence data homogeneity resulting from video data (specially high frame rate) may impact the performance. We deal with the issue via a 2 step procedure. 1) Key-frame selection(Gao et al., 2017) is employed to select candidate key frames. The number of clusters is same as the number of key frames in a video. 2) K-means clustering(Chao, 2018) is used on deep CNN features to get close clusters, cluster centre of each cluster is selected. VGG net trained on ILSVRC-2012 is used for extracting features in both cases. These algorithms group a set of objects in such a way that objects in the same cluster are more similar to each other than to those in other groups. Finally random clusters were taken as samples of data. Labelling of data is done with the help of human annotation. 16 object classes considered are listed as {Aeroplane,Bus,Bicycle,Boat,Car,Cat,Dog,Horse,Motorbike,Person,Plotted plant,Sheep,Train and Background}.

4 Experiments

While performing object detection experiments, it was observed that object detection using RCNN, fast RCNN and faster RCNN did not give effective results on our own collected dataset. However,YOLO provides better results compared to the former 3 as shown in Table 1. It should be noted that the amount of false alarms was alarming for RCNN, fast and faster as compared to YOLO as shown in Table 2.

Method/testset Accuracy
RCNN 40 54 60.8
Fast RCNN 54 61 68.7
Faster RCNN 60 66 69.9
Yolo 65 68 74
Table 1: Results of different test sets from different pre-trained networks
Method/testset OTS
Fast RCNN 29
Faster RCNN 23
Yolo 15
Table 2: Results of false alarms of own test set from different pre-trained networks

As seen from the Table 1, we have utilized three datasets referred as CTS, OTS and PTS. CTS is the subset of Caltech dataset. We refer our own test set as OTS while PTS is the subset of PASCAL VOC 2007 dataset. Results of table 1 and 2 are shown using 1200 images of each set.

We further performed the following experiments(A to D) using CTS, OTS and PTS. It should be noted that for the different experiments (discussed below), training was done using 1300000 images of PTS. Around 300 region proposals were extracted from each image of every set using RPN. Test results shown in Tables 3 to 13 were conducted using CTS, OTS and PTS with 1200 images of each set. To be at par with results presented in different papers, we used SVM classification of features extracted from the second last layer of the trained nets.
A. Layer wise: Features extracted from VGG(last max pool(’pool5’,) layer from 41 layered network), for dataset PASCAL VOC was to used train NoCs with different architectures.
B. Learning rate wise: We use different learning rate patterns on the data to compare the effect.
C. Dataset wise: Experiments were done on normal as well as blur datasets. These were then tested with blurred as well as general networks.
D. Feature wise: Optical features were extracted from the images and orientation function was used to enhance the features of images and these were used for training the different networks with different data.

A. Three different architectures were used to develop the CNN based classifier. Figure 1 depicts the widely used multiple fc layers for classification (referred as ). The second architecture as shown in Figure 2 used 1 spatial convolutional followed by 3 fc layers (referred as ). The third architecture used 2 convolutional layers as represented in Figure 3. This is henceforth called . Experiments performed using these NoCs with different test sets are shown in Table 3.

Figure 1: Architecture of (FC-Fully Connected Layer,HU-Hidden Units)
Figure 2: Architecture of (CNL-Convolutional layer used with RELU, FC-Fully Connected Layer,FM-Feature Map
Figure 3: Architecture of (CNL-Convolutional layer used with RELU, FC-Fully Connected Layer,FM-Feature Map)
Method Accuracy
f4096-f4096-f16 70 65 66
1conv-f4096-f4096-f16(1C3fc) 83 81 75
1conv-1maxPool-1conv-f4096-f4096-f16(1M1) 78 76.6 73.8
Table 3: Accuracy of different test sets with different NoCs

B. Learning rate plays an important role for the convergence of training loss. RMSProp uses Hessian-based pre-conditioning with first order gradients for adaptive learning rates. However, it is important to effectively handle noise included in first order gradients during stochastic optimization (mini batch settings). Other variants of RMSProp such as AdaDelta and Adane is also considered superior to SGD in terms of training speed based on the fact that saddle points will slower the progress of first order gradients. SGD iteratively updates the parameter as shown in equation 2.


is the learning rate and is the first order gradient. The updating value of RMSProp is given in equation 3.


In Hessian based conditioning, the training efficiency is increased by reducing the hessian condition number by transforming the parameters as represented in equation 4.


Here , which works even when H is indefinite as is the case for saddle points. It is verified in (Dauphin et al., 2015) that can be used as .
(Ida et al., 2017) used a covariance matrix based pre-conditioning to deal with noisy gradients in mini-batches. They argued that if covariance has a large value then the gradient strongly oscillates leading to inefficient progress of updating directions. The gradients are pre-conditioned as shown in equation 5 , covariance and mean are given in equation 4.


We have divided PASCAL VOC 2007 dataset into three parts (DS1, DS2, DS3). Different methods were used to train the networks like:

  1. We train DS1 for 100 iterations. The trained net is then used to train DS2 which is further used for DS3. In all the cases, learning rate of linear decay from 0.01 to 0.005 was used and weights were updated according to equation 2. This is referred as 1LR.

  2. In this case, we use net trained with DS1 for DS2 in the same way, but weight updation was done according to equation 3. We name it as 2LR.

  3. In this case, we train DS1,DS2 and DS3 (referred as j=1,2,3) and update weights as discussed in equation 4 rewritten as equation 7. Final weights were updated as shown in equation 8. This process is abbreviated as 3LR. The results for the same are depicted in Table 4. As observed from these results, NoC gave better results with 3LR.

Method Learning Rate Accuracy
1conv-f4096-f4096-f16(1C3fc) 1LR 79.4 80 73
2LR 80 80 73
3LR 83 81 75
1conv-1maxPool-1conv-f4096-f4096-f16(1M1) 3LR 78 76.6 73.8
Table 4: Accuracy of different test sets with different NoCs trained with different learning rates

C. The same experiments(A and B) were performed on blur data. When blurred data was given to the networks trained on normal images, they gave poor results of classification accuracies as shown in Table 5. Hence, the networks were trained using blurred images. Different combinations of results(accuracy) are presented in Tables 6, 7 and 8. All these results are shown using . These include features of unblurred(referred to as Normal)/blur data extracted from last layer of net trained with Normal/blur data. These extracted features were used for testing purpose by giving them as input to SVM trained on normal/blur data. The various combinations are listed below:
1. Normal data, normal net and SVM trained on normal data (N-N-N).
2. Normal data, blur net and SVM trained on blur data (N-B-B).
3. Blur data, normal net and SVM trained on normal data (B-N-N).
4. Blur data, blur net and SVM trained on blur data (B-B-B).
From all the above results, it was seen that blur data does not give good results when tested using net trained on normal data. However normal data performs similarly on both normal as well as blur net. The losses obtained from networks having 1 and 2 convolutional layers trained with normal and blur data with different learning rates are represented in Figure 4 and 5 respectively. It is seen that the training loss converges better for 1LR and 3LR as compared to 2LR for both and . The training losses and t-SNE plots along with the test accuracies also point towards the inference that 3LR with is the most suitable among the different options considered here.

Learning Rate/testsets CTS OTS PTS
1LR 59.4 68 60
2LR 58.2 68.5 56
3LR 59 69 62
Table 5: B-N-N
Learning Rate/testsets CTS OTS PTS
1LR 79.4 80 73
2LR 80 80 73
3LR 83 81 74
Table 6: N-N-N
Learning Rate/testsets CTS OTS PTS
1LR 70 73 68.8
2LR 72.5 74.3 71.4
3LR 73 70 71.6
Table 7: N-B-B
Learning Rate/testsets CTS OTS PTS
2LR 70.6 73 72.5
3LR 73.7 74.6 72
Table 8: B-B-B
Figure 4: Losses for different learning methods for NoC () trained with normal and blur data.
Figure 5: Losses for learning methods(2LR and 3LR) for NoC () trained with normal and blur data.
Figure 6: t-SNE distribution for the subset of training data extracted from with (3LR)
Figure 7: t-SNE for normal data extracted from NoC () trained on normal data with different learning methods
Figure 8: t-SNE for normal data extracted from NoC () trained on blur data with different learning methods
Figure 9: t-SNE for blur data extracted from NoC () trained on blur nets 3LR learning methods
Figure 10: t-SNE for blur data extracted from NoC () trained on normal nets 3LR learning methods
Figure 11: t-SNE for normal data extracted from NoC () trained on normal nets 3LR learning methods

Test data are represented using t-SNE that is t-distributed Stochastic Neighbor Embedding which is defined as an algorithm for dimensionality reduction and is adapted to visualizing high-dimensional data in a scatter plot. The idea is to embed high-dimensional points into 2 or 3 dimensions in a manner that similarities among points retain. Nearby points in the high-dimensional space correspond to nearby embedded low-dimensional points, and distant points in high-dimensional space correspond to distant embedded low-dimensional points. To show the data distribution graphically, t-SNE for subset of training data is presented in Figure 6 and for NoC with different datasets are depicted in Figures 7, 8, 9, 10 and 11. t-SNE for OTS in case of with 3LR gives good and separated clusters for every class.

D. In this paper, features of intensity images extracted from 5th convolutional layer are fused (added) to orientation features extracted using optical flow. Feature map is obtained as shown in equation 9. The whole process is shown in Figure 12 with classifier network . Accuracy of only RGB and RGB+OF are shown in 9.


Figure 12: Multimodal object detection and classification using RGB and optical flow features

Same experiments were performed for blur images as presented in Tables 10, 11 and 12.

Method Learning Rate/testsets TEST SETS
1conv-f4096-f4096-f16 3LR 84 83 79.8 81 74.8 74
Table 9: Accuracy of NoC with optical features (Normal dataset normal nets normal SVM)
Method Learning Rate/testsets TEST SETS
1conv-f4096-f4096-f16 3LR 58.4 59 64.6 69 60.5 62
Table 10: Accuracy of NoC with optical features (Blur dataset normal nets normal SVM)
Method Learning Rate/testsets TEST SETS
1conv-f4096-f4096-f16 3LR 75.8 73 73.3 70 72 71.6
Table 11: Accuracy of NoC with optical features (Normal dataset blur nets blur SVM)
Method Learning Rate/testsets TEST SETS
1conv-f4096-f4096-f16 3LR 77 73.7 75 74.6 72.5 72
Table 12: Accuracy of NoC with optical features (Blur dataset blur nets blur SVM)

Figure 13 shows comparison of detection between RCNN, fast RCNN, faster RCNN, yolo and NoC(). Table 13 shows accuracy of detection with all these methods.

Figure 13: Comparison of our NoC with other object detection method
Method/Test Sets CTS OTS PTS
RCNN 40 54 60.8
Fast RCNN 54 61 68.7
Faster RCNN 60 66 69.9
Yolo 65 68 74
NoC 83 81 74
Table 13: Comparison of accuracies of NoC with other object detection methods

5 Conclusion

This paper discussed simple data collection and sampling tricks prior training. Extensive experiments are performed on different convolutional classification architecture, with various learning rates. Results depict that with 3LR gives relatively better performance. We also use blur data to train these NoCs. It is observed that blur net can be used for blurred as well as unblurred data whereas network trained with normal data fails to tackle blurred data. Further optical flow features computed for training normal as well as blurred NoCs prove to be beneficial. Pre-conditioning with first order gradients for adaptive learning rates is also utilized to deal with saddle points. outperforms the others in terms of training loss convergence with early iteration.


  • Bernal et al. (2013) Bernal, J., Sánchez, J., & Vilarino, F. (2013). Impact of image preprocessing methods on polyp localization in colonoscopy frames. In Engineering in Medicine and Biology Society (EMBC), 2013 35th Annual International Conference of the IEEE (pp. 7350–7354). IEEE.
  • Cao et al. (2017) Cao, Y., Shen, C., & Shen, H. T. (2017). Exploiting depth from single monocular images for object detection and semantic segmentation. IEEE Transactions on Image Processing, 26, 836–846.
  • Chao (2018) Chao, G. (2018). Discriminative k-means laplacian clustering. Neural Processing Letters, (pp. 1–13).
  • Cho et al. (2012) Cho, H., Wang, J., & Lee, S. (2012). Text image deblurring using text-specific properties. In European Conference on Computer Vision (pp. 524–537). Springer.
  • Dauphin et al. (2015) Dauphin, Y., de Vries, H., & Bengio, Y. (2015). Equilibrated adaptive learning rates for non-convex optimization. In Advances in neural information processing systems (pp. 1504–1512).
  • Gao et al. (2017) Gao, Z., Lu, G., & Yan, P. (2017). Key-frame selection for video summarization: an approach of multidimensional time series analysis. Multidimensional Systems and Signal Processing, (pp. 1–21).
  • GUO et al. (2017) GUO, Q., LIANG, Z., & HU, J. (2017). Vehicle classification with convolutional neural network on motion blurred images. DEStech Transactions on Computer Science and Engineering, .
  • He et al. (2017) He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. arXiv preprint arXiv:1703.06870, .
  • Huo et al. (2016) Huo, Z., Xia, Y., & Zhang, B. (2016). Vehicle type classification and attribute prediction using multi-task rcnn. In Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), International Congress on (pp. 564–569). IEEE.
  • Ida et al. (2017) Ida, Y., Fujiwara, Y., & Iwamura, S. (2017). Adaptive learning rate via covariance matrix based preconditioning for deep neural networks. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (pp. 1923–1929). AAAI Press.
  • Kang et al. (2017a) Kang, D., Emmons, J., Abuzaid, F., Bailis, P., & Zaharia, M. (2017a). Optimizing deep cnn-based queries over video streams at scale. arXiv preprint arXiv:1703.02529, .
  • Kang et al. (2017b) Kang, K., Li, H., Yan, J., Zeng, X., Yang, B., Xiao, T., Zhang, C., Wang, Z., Wang, R., Wang, X. et al. (2017b). T-cnn: Tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology, .
  • Li et al. (2017a) Li, H., Fu, K., Yan, M., Sun, X., Sun, H., & Diao, W. (2017a). Vehicle detection in remote sensing images using denoizing-based convolutional neural networks. Remote Sensing Letters, 8, 262–270.
  • Li et al. (2017b) Li, J., Wei, Y., Liang, X., Dong, J., Xu, T., Feng, J., & Yan, S. (2017b). Attentive contexts for object detection. IEEE Transactions on Multimedia, 19, 944–954.
  • Li et al. (2017c) Li, X., Li, L., Flohr, F., Wang, J., Xiong, H., Bernhard, M., Pan, S., Gavrila, D. M., & Li, K. (2017c). A unified framework for concurrent pedestrian and cyclist detection. IEEE transactions on intelligent transportation systems, 18, 269–281.
  • Muller & Savakis (2017) Muller, P., & Savakis, A. (2017). Flowdometry: An optical flow and deep learning based approach to visual odometry. In Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on (pp. 624–631). IEEE.
  • Nam et al. (2014) Nam, W., Dollár, P., & Han, J. H. (2014). Local decorrelation for improved pedestrian detection. In Advances in Neural Information Processing Systems (pp. 424–432).
  • Niessner et al. (2017) Niessner, R., Schilling, H., & Jutzi, B. (2017). Investigations on the potential of convolutional neural networks for vehicle classification based on rgb and lidar data. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 4, 115.
  • Pop et al. (2017) Pop, D. O., Rogozan, A., Nashashibi, F., & Bensrhair, A. (2017). Incremental cross-modality deep learning for pedestrian recognition. In IV’17-IEEE Intelligent Vehicles Symposium.
  • Rauf et al. (2016) Rauf, R., Shahid, A. R., Ziauddin, S., & Safi, A. A. (2016). Pedestrian detection using hog, luv and optical flow as features with adaboost as classifier. In Image Processing Theory Tools and Applications (IPTA), 2016 6th International Conference on (pp. 1–4). IEEE.
  • Redmon & Farhadi (2016) Redmon, J., & Farhadi, A. (2016). Yolo9000: better, faster, stronger. arXiv preprint arXiv:1612.08242, .
  • Ren et al. (2015) Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91–99).
  • Sarkar et al. (2017) Sarkar, S., Venugopalan, V., Reddy, K., Ryde, J., Jaitly, N., & Giering, M. (2017). Deep learning for automated occlusion edge detection in rgb-d frames. Journal of Signal Processing Systems, 88, 205–217.
  • Vargas et al. (2010) Vargas, M., Milla, J. M., Toral, S. L., & Barrero, F. (2010). An enhanced background estimation algorithm for vehicle detection in urban traffic scenes. IEEE Transactions on Vehicular Technology, 59, 3694–3709.
  • Wang et al. (2016) Wang, S., Liu, F., Gan, Z., & Cui, Z. (2016). Vehicle type classification via adaptive feature clustering for traffic surveillance video. In Wireless Communications & Signal Processing (WCSP), 2016 8th International Conference on (pp. 1–5). IEEE.
  • Xu et al. (2013) Xu, L., Zheng, S., & Jia, J. (2013). Unnatural l0 sparse representation for natural image deblurring. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1107–1114).
  • Yao et al. (2017) Yao, Y., Tian, B., & Wang, F.-Y. (2017). Coupled multivehicle detection and classification with prior objectness measure. IEEE Transactions on Vehicular Technology, 66, 1975–1984.
  • Zhang et al. (2016a) Zhang, X., Zou, J., He, K., & Sun, J. (2016a). Accelerating very deep convolutional networks for classification and detection. IEEE transactions on pattern analysis and machine intelligence, 38, 1943–1955.
  • Zhang et al. (2016b) Zhang, Z., He, Z., Cao, G., & Cao, W. (2016b). Animal detection from highly cluttered natural scenes using spatiotemporal object region proposals and patch verification. IEEE Transactions on Multimedia, 18, 2079–2092.
  • Zheng et al. (2013) Zheng, S., Xu, L., & Jia, J. (2013). Forward motion deblurring. In Proceedings of the IEEE international conference on computer vision (pp. 1465–1472).
  • Zhou et al. (2007) Zhou, J., Gao, D., & Zhang, D. (2007). Moving vehicle detection for automatic traffic monitoring. IEEE transactions on vehicular technology, 56, 51–59.
  • Zhuo et al. (2017) Zhuo, L., Jiang, L., Zhu, Z., Li, J., Zhang, J., & Long, H. (2017). Vehicle classification for large-scale traffic surveillance videos using convolutional neural networks. Machine Vision and Applications, (pp. 1–10).
\parpic Baljit Kaur is pursuing her Ph.D. in Computer Science department at Thapar Institute of Engg and Tech, Patiala. She received her M.Tech degree in Information Technology from Guru Nanak Dev University, Amritsar and B.Tech degree in Computer Science from Amritsar College of Engg and Technology, Amritsar. She has five years of teaching experience. Her research area is image processing focused on augmented map based intelligent navigation system.
\parpic Jhilik Bhattacharya works as an assistant professor in Computer Science department at Thapar Institute of Engg and Tech, Patiala. She received her Ph.D. degree in Computer Science from NIT, Durgapur. She has 10 years of research and teaching experience.Her research interests include image processing,computer vision, pattern recognition.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description