A RealTime Deep Network for Crowd Counting
Abstract
Automatic analysis of highly crowded people has attracted extensive attention from computer vision research. Previous approaches for crowd counting have already achieved promising performance across various benchmarks. However, to deal with the real situation, we hope the model run as fast as possible while keeping accuracy. In this paper, we propose a compact convolutional neural network for crowd counting which learns a more efficient model with a small number of parameters. With three parallel filters executing the convolutional operation on the input image simultaneously at the front of the network, our model could achieve nearly realtime speed and save more computing resources. Experiments on two benchmarks show that our proposed method not only takes a balance between performance and efficiency which is more suitable for actual scenes but also is superior to existing lightweight models in speed.
Xiaowen Shi, Xin Li, Caili Wu, Shuchen Kong, Jing Yang, Liang He
\addressShanghai Key Laboratory of Multidimensional Information Processing
East China Normal University, Shanghai, China
Videt Tech Ltd., Shanghai, China
Crowd counting, compact convolutional neural network.
1 Introduction
By analyzing and understanding the crowd behavior and congestion levels in detail, some preventable calamities such as the stampede could be alleviated, which make great sense for public security. A strong demand to develop a responsive and efficient crowd counting application to effectively control the harm of emergencies is increasing and brings a big challenge to this vision task.
The existing methods to address crowd counting problem could be divided into two groups: countoriented approaches and densityoriented approaches. Countoriented approaches simply output the number of people by using a detector to detect objects in a sliding window that glides across the entire image. However, when the density of crowd is extremely dense, the spatial distributions are almost totally different in each image, which makes countoriented approach invalid. In this way, spatial information is displayed colorred in the form ofthrough the density map to indicate the amount of people across the whole image. This density map provides more accurate and comprehensive information, which could be a crucial part of making correct decisions in highly varied crowded scenes.
With recent development of the convolutional neural network (CNN), researchers employ CNN to accurately estimate the crowd count from images or videos [1, 2, 3, 4, 5]. However, it is always challenging to deal with scale variations on static images, especially in diversified scenes such as different camera perspectives and irregular crowd clusters. For this reason, many previous works investigate the multiscale architectures [6, 1, 2, 7] as the backbone to deal with this problem. Although they outperform than other kinds of method yet, this kind of complex models with a large number of parameters will probably cause timeconsuming and suboptimal problem, which would be inappropriate for applications need fast response. To sum up, it is still far from the balance of accuracy and efficiency required in actual scene.
In this paper, we propose the compact convolutional neural network (CCNN) to simplify multibranch of CNN model. It involves a small number of parameters and achieves satisfying performance. The network utilizes three filters with different sizes of local receptive field in one layer. The generative feature maps are merged directly after receptive fields, and then fed into a CNN structure to fit a density map.Compared with existing methods, the proposed model achieves substantially enhanced performance with faster speed and the maintain of accuracy.
The contributions of this work are as follows:

Our network is simple enough to be trained efficiently when compare with those multi CNN frameworks which need pretrained model in each branch. In addition, it requires less computing resources and is more practical.

We obtain an optimal balance between the efficiency of the model and the accuracy of the estimated count, ensuring that our model can achieve the accurate results effectively.
2 Related Work
Traditional approaches. The early researches [8] adopted a detectionstyle framework to carry out the function of counting. These methods detected the existence of a pedestrian in a sliding window by training a classifier using features extracted from a complete pedestrian. But it is difficult to count the exact number of people if most of the target objects are seriously obscured in highly congested scenes. In this case, researchers began to use specific body parts features to construct boosted classifiers [9]. Although the detectionbased approaches have been improved though this modification, the perform is still poor in extremely dense situation, so researchers tried to design regressionbased approaches to directly map the features extracted from image patches to scalar values [10, 11].
Nevertheless, regressionbased methods can not perceive crowd distributions as they ignored important spatial information and regressed on the global count. Density estimationbased approaches are therefore developed with the ability to conduct pixelwise regressions. Linear mapping [12] and nonlinear mapping [13] methods were utilized for density calculation successively.
CNNbased methods. With the breakthrough of deep learning in computer vision [14], some researchers tried to use convolutional neural network as feature extractor for crowd counting task [7, 1, 15]. They adopted multiple CNN branches with different receptive fields to enable multiscale adaptation and then combined the output feature maps of different level of a congested scene and mapped them to a density map. These methods exactly obtained excellent performance on the highly congested scene, but they need to pretrain each singlenetwork for global optimization. Also, the branch structure for learning different features for each column is inefficient, the redundant parameters have a negative impact on the final performance. Moreover, this kind of model is inactive in realworld because of the low speed and high latency in inference. As a remedy, singlebranch counting networks with scale adaptations were proposed. Cao et al. [16] computed highquality maps with a new encoderdecoder network, as well as a SIM local pattern consistent loss. However, it still suffers from a large number of parameters.
Unlike approaches mentioned above, the work in this paper is specifically aimed at reducing the number of parameters of the network by designing a sparse network structure. Specifically, we use three stacked filers of different size and directly target a merged feature map at once. In this way we can utilize sparsity at the filter level to optimize parallel computing and increase network adaptability to scale, making it spend less time on training and optimization.
3 Proposed Approach
3.1 Compact Convolutional Neural Network
In a typical multiscale CNN architecture, features extracted from multicolumn CNN with different size of receptive fields. However, we perform multiple convolutions in parallel on the input image and combine all the output results into a very deep feature map in order to avoid parameters of our model increase explosively.
The overall architecture of our framework is illustrated in Figure 1. The network could be divided into two components: the parallel convolution layer with different kernels and the convolution or pooling layers that followed. In the front part, the red layer in the figure 1 is designed to pay more attention to large receptive fields. The green one is on behalf of dense crowds and the blue one stands for highly congested crowded scenes. All of them are followed by a 2 2 maxpolling layer. After the extraction process of various receptive fields, the feature maps are merged as feature fusion for followup layers to perform downsampling. We find that using only one layer of convolution is enough to extract different spatial features and could ameliorate the efficiency of feature extraction from multiple branches. This is why the network is faster and accurate. The latter part consists of 6 convolutional layers specifically. Note that the third layer and the fourth layer are followed by a maxpooling layer with another 2 2 kernel size. The last convolution layer uses 1 1 filter to aggregate the feature maps into a density map.
Compared with those multicolumn CNNs, our method has many improvements. Firstly, we find that merging feature maps after first receptive fields in the head of the network outperforms connecting continued convolutional operations. Through our analysis, we consider that only using one layer of convolution could extract more comprehensive details of images, while the whole convolutional neural network does well on capturing local features but it would disrupt the spatial information. Another advantage is that comparing with the multicolumn architecture which is always puzzled by the redundancy and repetition of the number of filters in each column, our approach can be seen as discarding the extra convolution operations. Thirdly, our approach is proved to make a reasonable tradeoff between model performance and the number of network parameters. The experiments validate the effectiveness of the proposed structure.
3.2 Implementation Details
Ground truth generation
We use the geometryadaptive kernels to generate a ground truth density map of the highly congested scenes. Adapting a Gaussian kernel each head annotation becomes vague, so that we generate the ground truth density maps with the spatial distribution information across the whole image. This method alleviates the difficulty of the regression because we could get more accurate and comprehensive information rather than predict the exact point of head annotation directly. The geometryadaptive kernels are defined as
where , stand for a target object in the ground truth , and convert into a density map, we convolve this with a Gaussian kernel with a standard deviation of . Here means the average distance of k nearest neighbors of target object . In this experiment, we create density maps with = 15.
Training details
The feature maps output from our model are mapped to the density maps adopting filters of size 1 1, then we use Euclidean distance to measure the difference between the output density map and the corresponding ground truth. Here we define the loss function as follows,
Where is the ground truth density map of image , the is an estimated density map which is parameterized with for the sample . During training, we set batch size to 8 and use Adam with learning rate of .
4 Experiments
4.1 Results and Comparison
We conduct a comprehensive study using the ShanghaiTech dataset [7]and The WorldExpo’10 dataset [17]. We denote our approach as CCNN in the following comparisons and use the MAE and MSE as evaluation metric.
ShanghaiTech dataset [7] is with 1198 images and 330,165 annotated people. The dataset consists of Part A and Part B. Part A includes 482 crowd images with 300 training images and 182 testing images, while Part B contains 716 images which divided into a training set with 400 images and testing set with 316 images. First, we evaluate and compare our method with other four lightweight networks and the results are shown in the upper part of Table 1. It displays that CCNN with simple architecture achieves the lowest MAE in Part A and both of the lowest MAE and MSE in Part B. Note that the parameter size of our model is still the smallest one. We also compare with some large network and the results in the bottom of Table 1. Although the deeper models achieve better performance, their parameter size is around 200 times more than ours. Some qualitative results are presented in Figure 2.
Method  Part A  Part B  Parameter  

MAE  MSE  MAE  MSE  size  
CMTL [18]  101.3  152.4  20.0  31.1  2.36M 
Zhang et al. [17]  181.8  277.7  32.0  49.8  0.62M 
MCNN [7]  110.2  173.2  26.4  41.3  0.15M 
TDFCNN [19]  97.5  145.1  20.7  32.8  0.13M 
CCNN  88.1  141.7  14.9  22.1  0.07M 
ACSCP [20]  75.7  102.7  17.2  27.4  5.10M 
Switching CNN [1]  90.4  135.0  21.6  33.4  15.30M 
CSRNet [21]  68.3  115.0  10.6  16.0  16.26M 
SaCNN [22]  86.8  139.2  16.2  25.8  24.06M 
CPCNN [2]  73.6  106.4  20.1  30.1  68.40M 
WorldExpo’10 dataset [17] contains 1,132 annotated images that were captured in the 2010 Shanghai WorldExpo by 108 surveillance cameras. The dataset is divided into 5 different scenes, marked as S1 to S5 in Table 2. Our CCNN delivers the best results in scene 3 and scene 4 when comparing it with other light models, displayed in the top part of the table. Besides, it also achieves the best accuracy on average with the smallest parameter size. Compare with large networks, our approach outperforms them in scene 4 with even tens of times smaller parameters. This representation reinforces the fact that CCNN is light enough without sacrificing too much accuracy.
Method  S1  S2  S3  S4  S5  Avg.  Params 
Zhang et al. [17]  9.8  14.1  14.3  22.2  3.7  12.9  0.62M 
MCNN [7]  3.4  20.6  12.9  13.0  8.1  11.6  0.15M 
TDFCNN [19]  2.7  23.4  10.7  17.6  3.3  11.5  0.13M 
CCNN(ours)  3.8  20.5  8.8  8.8  7.7  9.9  0.07M 
CSRNet [21]  2.9  11.5  8.6  16.6  3.4  8.6  16.26M 
SaCNN [22]  2.6  13.5  10.6  12.5  3.3  8.5  24.06M 
CPCNN [2]  2.9  14.7  10.5  10.4  5.8  8.86  68.40M 
4.2 Ablation Study
In this part, we perform an ablation study to analyze the proposed framework on ShanghaiTech Part A dataset.
Network architecture. To evaluate the effect of structural variations of the three filters of different sizes, we separately train CCNN with choice of filters with the size of 5 5, 7 7, 9 9 respectively and all three filters contained concurrently. Figure 3 shows the comparison result. We observe that CCNN with the filter of size 5 5 only outperforms with the filter of 7 7 or the filter of 9 9. It is easy to ascribe to filter of size 5 5 capture crowds at lower scales within the scene and be more advantageous to extract the characteristics of highly congested crowded scenes. With the formation of all three filters contained, the result is 88.08 of MAE and 141.72 of MSE, which is the lowest. It shows that the scale and perspective variations could be adapted better with the structure of three independent filters of respective sizes.
Effect of the pooling operation. From Figure 3 we can observe that without the last pooling operation, we obtain the MAE of 98.9 and MSE of 160.4, which is inferior to the complete model. This evidence tells the fact that the last pooling layer plays an indispensable role in the whole model, and the current architecture has reached its peak of balance between operational speed and prediction accuracy. This is because the pooling layer provides the characteristic of scale invariance to local translation which can help us pay more attention to the existence of the feature other than the exact location of it, especially in the field of crowd counting.
4.3 Speed Comparison
In this section, we compare CCNN with the other two crowd counting methods MCNN and CMTL. The main reason of using MCNN in [7] and CMTL in [18] is the relatively small number of parameters provided by the whole network with the lower MAE and MSE displayed in Table 1. Here we use FPS(Frame per second), which is the most commonly used evaluation metrics for measuring the speed of models to access our model fairly. All three methods are tested on ShanghaiTech PartA and PartB, and the figures we reported are calculated running through 182 test images on the ShanghaiTech Part A dataset and 316 test images on the ShanghaiTech Part B dataset at their original resolution (7681024). Furthermore, these experiments were all conducted under the same condition with a server using the GPU (GeForce GTX 1080) and the CPU (Intel i58500 @ 3.00 GHz 6). The overall speed comparison with the other stateoftheart models is demonstrated in Table 3.
As can be seen from Table 3, our method gain the highest score at an average speed of 104.16 fps, which is much higher than other methods. with the cost time of 2.14s on the entire testing set from ShanghaiTech Part A and 4.39s on Part B, the speed on prediction phase even achieve more than 10 times faster than that of other advanced models. All these confirm that our work is valuable because only with high speed can realtime processing be realized, which is extremely important for some application scenarios.
5 Conclusions
In this paper, we present a compact CNN for crowd counting to deal with the lack of realtime performance of existing methods. By removing the redundant and recurrent convolutional layers and designing a superior local sparse structure, the parameter size is significantly reduced. Specifically, we using a multiple juxtaposed convolution structure where feature maps extracted from three parallel convolutional layers with different size of receptive fields are directly fused. Compared with the baseline approaches, the proposed model obtains an improvement significantly.
References
 D. B. Sam, S. Surya, and R. V. Babu, “Switching convolutional neural network for crowd counting,” in CVPR, 2017, pp. 4031–4039.
 V. A. Sindagi and V. M. Patel, “Generating highquality crowd density maps using contextual pyramid cnns,” in ICCV, 2017, pp. 1861–1870.
 E. Walach and L. Wolf, “Learning to count with cnn boosting,” in ECCV, 2016, pp. 660–676.
 X. Wu, Y. Zheng, H. Ye, W. Hu, J. Yang, and L. He, “Adaptive scenario discovery for crowd counting,” in ICASSP, 2019, pp. 2382–2386.
 X. Wu, B. Xu, Y. Zheng, H. Ye, J. Yang, and L. He, “Fast video crowd counting with a temporal aware network,” arXiv:1907.02198, 2019.
 L. Boominathan, S. S. Kruthiventi, and R. V. Babu, “Crowdnet: A deep convolutional network for dense crowd counting,” in ACM MM, 2016, pp. 640–644.
 Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma, “Singleimage crowd counting via multicolumn convolutional neural network,” in CVPR, 2016, pp. 589–597.
 P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: An evaluation of the state of the art,” IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 34, no. 4, pp. 743–761, 2012.
 B. Wu and R. Nevatia, “Detection and tracking of multiple, partially occluded humans by bayesian combination of edgelet based part detectors,” International Journal of Computer Vision, vol. 75, no. 2, pp. 247–266, 2007.
 A. B. Chan and N. Vasconcelos, “Bayesian poisson regression for crowd counting,” in ICCV, 2009, pp. 545–551.
 H. Idrees, I. Saleemi, C. Seibert, and M. Shah, “Multisource multiscale counting in extremely dense crowd images,” in CVPR, 2013, pp. 2547–2554.
 V. Lempitsky and A. Zisserman, “Learning to count objects in images,” in NeurIPS, 2010, pp. 1324–1332.
 V.Q. Pham, T. Kozakaya, O. Yamaguchi, and R. Okada, “Count forest: Covoting uncertain number of targets using random forest for crowd density estimation,” in ICCV, 2015, pp. 3253–3261.
 A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NeurIPS, 2012, pp. 1097–1105.
 D. Babu Sam, N. N. Sajjan, R. Venkatesh Babu, and M. Srinivasan, “Divide and grow: Capturing huge diversity in crowd images with incrementally growing cnn,” in CVPR, 2018, pp. 3618–3626.
 X. Cao, Z. Wang, Y. Zhao, and F. Su, “Scale aggregation network for accurate and efficient crowd counting,” in ECCV, 2018, pp. 734–750.
 C. Zhang, H. Li, X. Wang, and X. Yang, “Crossscene crowd counting via deep convolutional neural networks,” in CVPR, 2015, pp. 833–841.
 V. A. Sindagi and V. M. Patel, “Cnnbased cascaded multitask learning of highlevel prior and density estimation for crowd counting,” in AVSS, 2017, pp. 1–6.
 D. B. Sam and R. V. Babu, “Topdown feedback for crowd counting convolutional neural network,” in AAAI, 2018.
 Z. Shen, Y. Xu, B. Ni, M. Wang, J. Hu, and X. Yang, “Crowd counting via adversarial crossscale consistency pursuit,” in CVPR, 2018, pp. 5245–5254.
 Y. Li, X. Zhang, and D. Chen, “Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes,” in CVPR, 2018, pp. 1091–1100.
 L. Zhang, M. Shi, and Q. Chen, “Crowd counting via scaleadaptive convolutional neural network,” in WACV, 2018, pp. 1113–1121.