Learn to Scale: Generating Multipolar Normalized Density Map for Crowd Counting
Abstract
Dense crowd counting aims to predict thousands of human instances from an image, by calculating integrals of a density map over image pixels. Existing approaches mainly suffer from the extreme density variances. Such density pattern shift poses challenges even for multiscale model ensembling. In this paper, we propose a simple yet effective approach to tackle this problem. First, a patchlevel density map is extracted by a density estimation model, and is further grouped into several density levels which are determined over full datasets. Second, each patch density map is automatically normalized by an online center learning strategy with a multipolar center loss (MPCL). Such a design can significantly condense the density distribution into several clusters, and enable that the density variance can be learned by a single model. Extensive experiments show the best accuracy of the proposed framework in several crowd counting datasets, with relative accuracy gains of 4.2%, 14.3%, 27.1%, 20.1% over the stateoftheart approaches, in ShanghaiTechPart A, Part B, UCF_CC_50, UCFQNRF dataset, respectively.
1 Introduction
A robust crowd counting system is of significant values in many realworld applications such as video surveillance, security alerting, event planning, etc. In recent years, the deep learning based approaches have been the mainstream of crowd counting, because of the powerful learning representation ability produced by convolutional neural works (CNN). To estimate the count, predominant approaches generate a density map by CNN, from which the count of instances can be integrated over image pixels.
Although crowd counting has been extensively studied by previous methods, handling the large density variances which cause huge density pattern shift in crowd images is still an open issue. As illustrated in Fig. 1 (a), the densities of crowd image patches can vary significantly, which change from a bit sparse (e.g., ShanghaiB) to extremely dense (e.g., UCFQNRF). Such large density pattern shifts usually bring grand challenges to density prediction by a single CNN model, due to its fixed sizes of receptive fields. Remarkable progresses have been achieved by learning a density map through designing multiscale architectures [23] or aggregating multiscale features [3, 33], which indicate that the ability to cope with density variation is crucial for crowd counting methods. Although density maps with multiple scales can be generated and aggregated, robustness is still hard to ensure when the density variances get increasing a lot. As shown in Fig. 1 (b), most recent works obtain a higher MRE ^{*}^{*}*MRE is calculated by MAE/P, where MAE denotes the standard Mean Average Error and P is the average count of a dataset on datasets with larger density variances, which indicates that the extreme density variance and pattern shift in crowd counting remains huge challenge.
In this paper, we propose a simple yet effective method to relieve the problem from extreme density variances. The core idea is learning to scale (denoted as L2S) image patches and to facilitate the density distribution condensing to several clusters, and thus the density variance can be reduced. The scale factor of each image patch can be automatically learned during training, with the supervision of a novel multipolar center loss (MPCL). More specifically, all the patches from each density level are optimized to approach a density center, which can be updated by online calculating a mean value for each density level.
In particular, the proposed L2S framework consists of two closelyrelated steps. First, given an image, an initial density map is generated by a CNN model. After that, each density map is divided into patches, and all the patchlevel density maps are further evenly divided into groups, according to their density levels. Second, each patch is scaled by a learned scale factor, and thus the density of this patch can converge to a center of its density level. The final density map for the input image can be obtained by concatenating the patchlevel density maps.
We conduct the experiments on several popular benchmark datasets, including Shanghaitech [33], UCF_CC_50 [9], UCFQNRF [11]. Extensive evaluations show significantly superior performance over the prior arts. Moreover, the cross validation on these datasets further demonstrates that the proposed L2S framework has a powerful transferability. In summary, the main contributions in this paper are twofold:

We proposed a Learning to Scale Module (L2SM) to solve the density variation issue in crowd counting. With L2SM, different regions can be automatically scaled so that they have similar densities, while the quality of their density maps is significantly improved. L2SM is endtoend trainable when adding it into a CNN model for density estimation.

The proposed L2SM significantly outperforms stateoftheart methods for crowd counting on three widelyadopted challenging datasets, demonstrating its effectiveness in handling density variation. Furthermore, L2SM also has a good transferability under cross validation on different datasets, showing the generality of the proposed method.
2 Related Work
Crowd counting has attracted much attention in computer vision. Early methods frame the counting problem as detection task [7, 29] that explicitly detects individual heads, which has major difficulty in occlusion and dense areas. The regressionbased methods [4, 6, 8, 10] greatly improve the counting performance on dense areas via different regression functions such as Gaussian process, ridge regression, and random forest regression. Recently, with the development of deep learning, the mainstream crowd counting methods switch to CNNbased methods [21, 32, 2, 33, 31, 5, 18]. These CNNbased methods address the crowd counting via regressing density map representation introduced by lempitsky et al. [14], and achieve better accuracy and transferability as compared to the classical methods. Recent methods mainly focus on two challenging aspects faced by current CNNbased methods: huge scale and density variance and severe overfitting.
Methods addressing huge scale and density variance. Multiscale change is a challenging problem for many vision tasks including crowd counting. It is difficult to accurately count the small heads in dense areas. There are many methods attempting to handle huge scale variance. The existing methods can be roughly divided into two categories: methods that explicitly rely on scale information and methods that implicitly cope with multiscale changes.
1) Some methods explicitly make use of scale information for crowd counting. For instance, the methods in [32, 19] adopt deep CNN using provided geometric or perspective information. Yet, these scale related information is not always readily available. In [28, 23], the authors first use a network to estimate the scale level and density degree for the corresponding whole or partial region based on manually set scale degree. Then, the obtained scale related information is fused with another network or used to design different networks for dividing and counting. To overcome the difficulty in manually setting the scale degree, the authors design an incrementally growing CNN in [1] to deal with areas of different density degrees without involving any handcraft steps.
2) Some other works aim to implicitly cope with multiscale problem. Zhang et al. [33] propose to build a multicolumn CNN to extract multiscale feature and fuse them together for density map estimation. Each column is limited to cover a subset of scale variance. To ameliorate the multicolumn structure, the authors in [3] propose a multistage structure such that each stage exploits the multicolumn convolution and combines the multiscale feature to regress the density map. The author in [17] encodes the scale of the contextual information required to accurately predict crowd density. In [15], the authors propose to increment the receptive field size in CNN to better leverage multiscale information. In addition to these specific network designs for implicitly handling multiscale problem, Shen et al. [24] introduce an ad hoc term in the training loss function in order to pursue the crossscale consistency. In [11], the authors propose to adopt variant groundtruth density map representation with different Gaussian kernel size to better deal with density map estimation in areas of different density levels.
Methods alleviating severe overfitting. It is wellknown that deep CNN [13, 26] usually struggles with overfitting problem on small datasets. Current CNNbased crowd counting methods also face this challenge due to small size and limited variety of existing datasets, leading to weak performance and transferability. To overcome the overfitting, Liu et al. [18] propose a learningtorank framework to leverage abundantly available unlabeled crowd imagery and the selflearning method. In [25], the authors build a set of decorrelated regressors with reasonable generalization capabilities through managing their intrinsic diversities to avoid severe overfitting.
Though many methods have been proposed to tackle the large scale and density variation issue, this problem still remains challenging for crowd counting. The proposed method also attempts to address this issue. Different from previous methods [33, 23, 27, 1, 3, 16], we mimic a rational human behavior in crowd counting through learning to scale dense region counting. We compute the scale ratios with a novel use of multipolar center loss [30] to explicitly bring all the regions of significantly varied density to multiple similar density levels. This results in a robust density estimation on dense regions and appealing transfer ability.
3 Method
3.1 Overview
The mainstream crowd counting methods model the problem as density map regression using CNN. For a given image, the groundtruth density map is given by spreading binary head locations to nearby regions with Gaussian kernels. For sparse regions, the groundtruth density only depends on a specific person, resulting in regular Gaussian blobs. For dense regions, multiple crowded heads may spread to the same nearby pixel, yielding high groundtruth density with very different density pattern compared with sparse regions. This density pattern shift makes it difficult to accurately predict the density map for both dense and sparse regions in the same way.
To improve the counting accuracy, we aim to tackle the problem of pattern shift due to large density variations, improving the prediction for highly dense regions. Specifically, the proposed method mimics a rational behaviour when humans count crowds. For a given crowd image, we are prone to begin with dividing the image into partitions of different crowding level before attempting to count the people. For sparse regions of large heads, it is easy to directly count the people on the original region. Whereas, for dense regions composed of crowded small heads, we need to zoom in the region for more accurate counting. An example of this counting behaviour is depicted in Fig. 2.
We propose a network to mimic such human behaviour for crowd counting. The overall pipeline is depicted in Fig. 3, consisting of two modules: 1) Scale preserving network (SPN) presented in Sec. 3.2. We leverage multiscale feature fusion to generate an initial density map prediction, which provides accurate prediction on sparse regions and indicates the density distribution over image; 2) Learning to scale module (L2SM) detailed in Sec. 3.3. We divide the image into nonoverlapping regions, and select some dense regions (based on the initial density estimation) to repredict the density map. Specifically, we leverage SPN to compute a scaling factor for each selected dense region, and scale the groundtruth density map by changing the distance between blobs and keeping the same peaks. The density reprediction for the selected regions is then performed on the scaled features. The key of this reprediction process lies on computing appropriate scaling factors. For that, we adopt the center loss to centralize the density distributions into multipolar centers, alleviating the density pattern shift issue and thus improving the prediction accuracy. The whole network is endtoend trainable whose training objective is depicted in Sec. 3.4.
3.2 Scale Preserving Network
We follow the mainstream crowd counting methods by regressing density map. Precisely, we use geometryadaptive kernels to generate groundtruth density maps in highly congested scenes. For a given image containing person, the groundtruth annotation can be represented via a delta function on each pixel : , where is the annotated location of th person. The density map on each pixel is then generated by convolving with a Gaussian kernel : , where the Gaussian kernel is a spread parameter.
We develop a CNN to regress the density map . For a fair comparison with most methods, we adopt the VGG16 [26] as the backbone network. We discard the pooling layer between stage4 and stage5 as well as the last pooling layer and the fully connected layers that follow to preserve accurate spatial information. It is wellknown that deep layers in CNN encode more semantic and high level information, and shallow layers provide more precise localization information. We extract features from different stages by applying convolutions on the last layer of each stage. Then we pool these features extracted from stage1 to stage5 into , , , , and , respectively. This results in a pyramid structure. Each spatial unit in the pooled feature indicates the density level, hence scale of the underlying region mapped to the original image. These pooled scale preserving features are then upsampled to the size of conv5 by bilinear interpolation and stacked together with features in conv5 . We then feed the stacked feature to three successive convolutions and one deconvolution layer for regressing density map .
3.3 Learning to Scale Module
The initial density prediction is accurate on sparse regions thanks to the regular individual Gaussian blobs, but the prediction is less accurate on dense regions composed of crowded heads lying very close to each other. As indicated in Sec. 3.1, this triggers the pattern shift on the target density map. Following the rational human behaviour in crowd counting, we zoom in the dense regions for better counting accuracy. In fact, on the zoomed version, the distance between nearby heads is enlarged, which results in regular individual Gaussian blobs of target density map, alleviating the density pattern shift. Such density pattern modulating facilitates the prediction. Inspired by this, we first evenly divide the image domain into (e.g. ) nonoverlapping regions. We then select the dense regions based on the average initial density of each region , where denotes the area of region .
We propose to mimic human behaviour in crowd counting by learning to scale the selected dense regions. For that, we first leverage the scale preserving pyramid features described in Sec. 3.2 to compute the scaling ratio for each selected region . Precisely, we downsample/upsample the pooled features described in Sec. 3.2 to , and concatenate them together. This is followed by a convolution to produce the scale factor map . Each value in this map represents the scaling ratio for the underlying region.
Once we have the scale factor map , we scale the feature on the selected regions accordingly through bilinear upsampling. Based on the scaled feature map corresponding to each selected region , we apply five successive convolutions to repredict the density map for scaled . We then resize the repredicted density map to the original size of and multiply the density on each pixel by to preserve the same counting result. The initial prediction on selected regions is replaced with resized density map reprediction.
To guide the density map reprediction on the selected regions, we also adjust the groundtruth density map for each region accordingly. For each selected region , instead of straightforwardly scaling the ground truth density map in the same way as feature map scaling, we first scale the binary head location map, and then recompute the groundtruth density map for by , where is the number of people in . As shown in Fig. 4, such groundtruth transformation for density map recomputation reduces the density pattern gap between sparse regions and dense regions, facilitating the density map reprediciton.
The main issue of this density map reprediction by learning to scale dense regions is to compute appropriate scale ratios for the selected dense regions. Yet, there is no explicit target scale suggesting how much region should be zoomed ideally. We would like to have the estimated average density approaches the ground truth average density on the th region. The relative density degree of region could be well reflected by Assuming that we make the value of for each region close to one of multiple learnable centers, then we centralize all the selected regions to multiple similar density levels, alleviating the large density pattern shift and thus improving the prediction accuracy. This motivates us to resort to center loss on with multipolar centers. Put it simply, we attempt to centralize all the selected regions into centers following their average density acting as the unsupervised clustering.
We first initialize the centers with increasing random values for more and more dense regions. Then for each center , we follow the standard process of using center loss and update the center for th iteration as below:
(1) 
where , , and refer to the number of regions, average density map, and scaling ratio for th region, respectively, that will be centralized to the th center in an underlying image, and denotes the learning rate for updating each center. During each iteration, we use the selected dense regions to compute the center loss with multiple centers and update network parameters as well as the centers. The supervision on using center loss with multiple centers is the key to bring all the selected regions to multiple similar density levels, leading to robust density estimation.
3.4 Training objective
The whole network is endtoend trainable, which involves three loss functions: 1) L2 loss for initial prediction of density map given by . 2) L2 loss for density map reprediction on selected regions given by , where denotes the repredicted density map on the scaled selected region . 3) Center loss on relative density level for the selected regions computed by:
(2) 
The final loss function for the whole network is the combination of the above three losses given by:
(3) 
where and are two hyperparameters. Note that we optimize the loss function in Eq. (3) to update not only the overall network parameters but also the centers .
Method  PartA  Part B  UCF_CC_50  UCFQNRF  

MAE  MSE  MAE  MSE  MAE  MSE  MAE  MSE  
MCNN [33]  110.2  173.2  26.4  41.3  377.6  509.1  277   
CMTL [27]  101.3  152.4  20.0  31.1  322.8  397.9  252  514 
SwitchCNN [23]  90.4  135.0  21.6  33.4  318.1  439.2  228  445 
CPCNN [28]  73.6  112.0  20.1  30.1  298.8  320.9     
ACSCP [24]  75.7  102.7  17.2  27.4  291.0  404.6     
L2R [18]  73.6  112.0  13.7  21.4  279.6  388.9     
DConvNetv1 [25]  73.5  112.3  18.7  26.0  288.4  404.7     
CSRNet [15]  68.2  115.0  10.6  16.0  266.1  397.5     
icCNN [22]  69.8  117.3  10.7  16.0  260.9  365.5     
SANet [3]  67.0  104.5  8.4  13.6  258.4  334.9     
CL [11]              132  191 
VGG16 (ours)  72.9  114.5  12.1  20.5  225.4  372.5  120.6  205.2 
SPN (ours)  70.0  106.3  9.1  14.6  204.7  340.4  110.3  184.6 
SPN+L2SM (ours)  64.2  98.4  7.2  11.1  188.4  315.3  104.7  173.6 
Method  SPN  L2SM (G=3)  L2SM (G=4)  L2SM/S2AD (G=5)  

MAE  70.0  65.1  66.1  67.2/68.9  65.4/68.1  64.2/67.0  67.1/69.2  69.8/73.6 
MSE  106.3  100.4  103.5  102.3/110.3  100.7/107.3  98.4/105.4  101.6/108.7  104.5/113.5 
Cost time (s)  0.524  0.576  0.569  0.539/0.540  0.550/0.551  0.565/0.563  0.583/0.580  0.592/0.587 
setting  MAE  MSE 

68.0  107.1  
67.2  106.3  
67.9  106.9  
68.5  109.1 
Method  PartAPartB  PartBPartA  PartAUCF_CC_50  UCFQNRFPartA  PartAUCFQNRF  

MAE  MSE  MAE  MSE  MAE  MSE  MAE  MSE  MAE  MSE  
MCNN [33]  85.2  142.3  221.4  357.8  397.7  624.1         
DConvNetv1 [25]  49.1  99.2  140.4  226.1  364  545.8         
L2R [18]          337.6  434.3         
SPN (ours)  23.8  44.2  131.2  219.3  368.3  588.4  87.9  126.3  236.3  428.4 
SPN+L2SM (ours)  21.2  38.7  126.8  203.9  332.4  425.0  73.4  119.4  227.2  405.2 
4 Experiments
4.1 Datasets and Evaluation Metrics
We conduct experiments on three widely adopted benchmark datasets including ShanghaiTech [33], UCF_CC_50 [9], and UCFQNRF [11] to demonstrate the effectiveness of the proposed method. These three datasets and the adopted evaluation metrics are shortly described in the following.
ShanghaiTech Dataset. The ShanghaiTech crowd counting dataset [33] consists of 1198 annotated images with a total of 330,165 people, divided into two parts. Part A contains 482 images which are randomly crawled from the Internet, among which 300 images are used for training and the remaining 182 images are used for testing. Part B includes 716 images which are taken from the busy streets of metropolitan area in Shanghai, among which 400 images are used for training and 316 images for testing. The dataset also provides annotation in terms of coordinates of people heads in each image.
UCF_CC_50 Dataset. This dataset is a collection of 50 images of very crowd scenes [9], containing signficanlty varied number of people ranging from 94 to 4543 people in image. The large variance of total number of people and the small amount of images make this dataset very challenging. Following classical benchmarks on this dataset, we use 5fold crossvalidation to evaluate the performance of our method.
UCFQNRF: UCFQNRF dataset is the largest dataset todate [11] containing 1535 images which are divided into 1201 training and 334 testing images. The number of people in an image varies from 49 to 12,865, making this dataset feature huge density variation. Furthermore, the images in this dataset also have very huge resolution variance (e.g. ranging from to ).
Evaluation metrics. We employ two standard metrics, i.e., Mean Absolute Error (MAE) and Mean Squared Error (MSE). MAE and MSE are defined as
(4) 
where (resp. ) represents the groundtruth (resp. estimated) number of pedestrians in the th image, and is the total number of test images.
4.2 Implementation Details
We follow the setting in [15] to generate the groundtruth density map. For a given dataset, we first evenly divide all the images in a dataset into groups of regions with increasing density, and then attempt to centralize the most dense groups of regions to similar density levels (i.e., centers involved in the center loss), respectively. In the following, without explicitly specifying, is set to 5, and is set to 3 for all involved datasets except for UCF_CC_50 dataset. Since images in UCF_CC_50 dataset consist of crowded people over the whole image domain, we centralize all regions to similar density levels. Without explicitly specifying, the hyperparameter involved in dividing each image into regions is set to 4.
The loss function described in Eq. (3) is used for the model training. We set to 1 and discuss the impact of in Eq. (3) in the following. We use Adam [12] optimizer to optimize the whole architecture with the learning rate initialized to 1e4. When training on the UCFQNRF dataset containing images of very high resolution (e.g. ), we first downsample the image of resolution larger than 1080p to . Then we divide each image into and combine them into a tensor with batch size equal to 4. When training on the other dataset, we directly input the whole image to our network.
During inference, we first generate an initial density map for the whole input image, and then select dense regions from division based on the average initial density on each region . If is larger than a predefined value for selecting the top groups of regions in training, we replace the initial density map prediction with scaled reprediction for each selected dense region .
The proposed method is implemented in Pytorch [20]. All experiments are carried out on a workstation with an Intel Xeon 16core CPU (3.5GHz), 64GB RAM, and a single Titan Xp GPU.
4.3 Experimental Comparisons
We evaluate the proposed method on ShanghaiTech dataset, UCF_CC_50, and UCFQNRF datasets. The proposed method outperforms all the other competing methods on all the benchmarks. The quantitative comparison with the stateoftheart methods on these three datasets is depicted in Table 1.
ShanghaiTech. The proposed method outperforms the stateoftheart method SANet [3] by 2.8 MAE and 6.1 MSE on ShanghaiTech Part A and 1.2 MAE and 2.5 MSE on ShanghaiTech Part B. It also can be seen in Table 1 that L2SM improves the performance of our SPN baseline by 5.8 MAE and 7.9 MSE on ShanghaiTech Part A, and 1.9 MAE and 3.5 MSE on ShanghaiTech Part B. In fact, Shanghai Part A contains images more crowded than ShanghaiTech Part B, and the density distribution of Shanghai Part A varies more significantly than that of Shanghai Part B. This may explain that the improvement of the proposed L2SM on ShanghaiTech Part A is more significant than that on ShanghaiTech Part B.
UCF_CC_50. We then compare the proposed method with other related methods on UCF_CC_50 dataset. To the best of our knowledge, UCF_CC_50 dataset is currently the most dense dataset publicly available for crowd counting. The proposed method achieves significant improvement over the stateoftheart methods. Precisely, the proposed method improves SANet [3] from 258.4 MAE to 188.4 MAE, and from 334.9 MSE to 315.3 MSE.
UCFQNRF. We have also conducted experiments on recent UCFQNRF dataset containing images of significantly varied density distribution and resolution. By limiting the maximal image size to , our VGG16 baseline already achieves stateoftheart performance. The proposed SPN brings an improvement of 10.3 MAE and 20.6 MSE compared with VGG16 baseline. The proposed L2SM further boost the performance by 5.6 MAE and 11.0 MSE.
4.4 Ablation Study
The ablation studies are mainly conducted on the ShanghaiTech part A dataset, as it is a moderate dataset, neither too dense nor too sparse, and covers a diverse number of people heads.
Effectiveness of different learning to scale settings. For the learning to scale process, we first evenly divide the images in a whole dataset into groups of regions with increasing density, and then attempt to centralize the most dense groups of regions to similar density levels. As shown in Table 2, the number of groups and number of centers are important for accurate counting. For a fixed number of groups (e.g., ), centralizing more and more regions leads to slightly improved counting result. Yet, when we attempt to centralize every image region, we also repredict the density map for very sparse or background regions, bringing more background noise and thus yielding slightly decreased performance. A relative finer group division with a proper number of centers performs slightly better. As depicted in Table 2, the proposed learning to scale using multipolar center loss performs much better than straightforwardly scaling to the average density (S2AD) in each group.
Time overhead. To analyze the time overhead of the proposed L2SM, we conduct experiments under seven different settings (see Table 2). The time overhead analysis is achieved by calculating the average inference time on the whole ShanghaiTech Part A test set. The batch size is set to 1 and only 1 TitanX GPU is used during inference. The average time overhead of SPN is about 0.524s per image. When we increase the number of centers and the number of regions to be repredicted, the runtime slightly increases. When using 5 centers and repredict all the regions, the proposed L2SM increases the runtime by 0.068s per image, which is negligible compared with the whole runtime.
Effectiveness of the weight of the center loss. We study the effectiveness of center loss on ShanghaiTech Part A using one center by changing its weight in Eq. (3). Note that when the weight is set to 0, the center loss is not used, which means that the scale ratio is learned automatically without any specific supervision. As shown in Fig. 5, the use of center loss to bring regions of significantly varied density distributions to similar density levels plays an important role for improving the counting accuracy. It is also worth to note that the performance improvement is rather stable for a wide range of weight of the center loss.
Effectiveness of the groundtruth transformation. We also study the effect of groundtruth transformation involved in scale to repredict process. As shown in Fig. 5, the groundtruth transformation (WT TransedGT) by enlarging the distance between crowded heads is more accurate than straightforwardly scale the groundtruth density map (WO TransedGT). This is as expected, since enlarging the distance between crowded heads results in regular Gaussian density blobs for dense regions, which reduces the density pattern shift and thus facilitates the density map prediction.
Effectiveness of the division. We also conduct experiments by varying the image domain division. As depicted in Table 3. The performance is rather stable across different image domain divisions.
4.5 Evaluation of Transferability
To demonstrate the transferability of the proposed method across datasets, we conduct experiments under cross dataset settings, where the model is trained on the source domain and tested on the target domain.
The cross dataset experimental results are presented in Table 4. We can observe that the proposed method generalizes well to unseen datasets. In particular, the proposed method consistently outperforms the stateoftheart methods in [25] and MCNN [33] by a large margin. The proposed method also performs slightly well than the method in [18] in transferring models trained on ShanghaiTech Part A to UCF_CC_50. Yet, the improvement is not as significant as the comparison with [33, 25] on transferring between Shanghai Part A and Part B. This is probably because that the method in [18] also relies on extra data which may somehow help to reduce the gap between the two datasets. As depicted in Table 4, the proposed L2SM plays an important role in ensuring the transferability of the proposed method. Furthermore, as depicted in Table 1 and Table 4, the proposed method under crossdataset settings performs competitively or even outperforms some methods [23, 28, 27, 33] using the proper training set. This also confirms the generalizability of the proposed method.
5 Conclusion
In this paper, we propose a Learn to Scale Module (L2SM) to tackle the problem of large density variation for crowd counting. We achieve density centralization by a novel use of multipolar center loss. The L2SM can effectively learn to scale significantly varied dense regions to multiple similar density levels, making the density estimation on dense regions more robust. Extensive experiments on three challenging datasets demonstrate that the proposed method achieves consistent and significant improvements over the stateoftheart methods. L2SM also shows the noteworthy generalization ability to unseen datasets with significantly varied density distributions, demonstrating the effectiveness of L2SM in real applications.
Acknowledgement
This work was supported in part by the National Key Research and Development Program of China under Grant 2018YFB1004600, in part by NSFC 61703171, and in part by NSF of Hubei Province of China under Grant 2018CFB199, to Dr. Yongchao Xu by the Young Elite Scientists Sponsorship Program by CAST.
References
 [1] D. Babu Sam, N. N. Sajjan, R. Venkatesh Babu, and M. Srinivasan. Divide and grow: Capturing huge diversity in crowd images with incrementally growing cnn. In CVPR, pages 3618–3626, 2018.
 [2] L. Boominathan, S. S. Kruthiventi, and R. V. Babu. Crowdnet: A deep convolutional network for dense crowd counting. In ACMMM, pages 640–644, 2016.
 [3] X. Cao, Z. Wang, Y. Zhao, and F. Su. Scale aggregation network for accurate and efficient crowd counting. In ECCV, pages 734–750, 2018.
 [4] A. B. Chan, Z.S. J. Liang, and N. Vasconcelos. Privacy preserving crowd monitoring: Counting people without people models or tracking. In CVPR, pages 1–7, 2008.
 [5] P. Chattopadhyay, R. Vedantam, R. R. Selvaraju, D. Batra, and D. Parikh. Counting everyday objects in everyday scenes. In CVPR, pages 1135–1144, 2017.
 [6] K. Chen, C. C. Loy, S. Gong, and T. Xiang. Feature mining for localised crowd counting. In BMVC, volume 1, page 3, 2012.
 [7] P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: An evaluation of the state of the art. TPAMI, 34(4):743–761, 2012.
 [8] W. Ge and R. T. Collins. Marked point processes for crowd counting. In CVPR, pages 2913–2920, 2009.
 [9] H. Idrees, I. Saleemi, C. Seibert, and M. Shah. Multisource multiscale counting in extremely dense crowd images. In CVPR, pages 2547–2554, 2013.
 [10] H. Idrees, K. Soomro, and M. Shah. Detecting humans in dense crowds using locallyconsistent scale prior and global occlusion reasoning. TPAMI, 37(10):1986–1998, 2015.
 [11] H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S. AlMaadeed, N. Rajpoot, and M. Shah. Composition loss for counting, density map estimation and localization in dense crowds. In ECCV, 2018.
 [12] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2014.
 [13] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
 [14] V. Lempitsky and A. Zisserman. Learning to count objects in images. In NIPS, pages 1324–1332, 2010.
 [15] Y. Li, X. Zhang, and D. Chen. Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In CVPR, pages 1091–1100, 2018.
 [16] L. Liu, H. Wang, G. Li, W. Ouyang, and L. Lin. Crowd counting using deep recurrent spatialaware network. IJCAI, 2018.
 [17] W. Liu, M. Salzmann, and P. Fua. Contextaware crowd counting. In CVPR, pages 5099–5108, 2019.
 [18] X. Liu, J. van de Weijer, and A. D. Bagdanov. Leveraging unlabeled data for crowd counting by learning to rank. In CVPR, 2018.
 [19] D. OnoroRubio and R. J. LópezSastre. Towards perspectivefree object counting with deep learning. In ECCV, pages 615–629, 2016.
 [20] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
 [21] V.Q. Pham, T. Kozakaya, O. Yamaguchi, and R. Okada. Count forest: Covoting uncertain number of targets using random forest for crowd density estimation. In ICCV, pages 3253–3261, 2015.
 [22] V. Ranjan, H. Le, and M. Hoai. Iterative crowd counting. In ECCV, 2018.
 [23] D. B. Sam, S. Surya, and R. V. Babu. Switching convolutional neural network for crowd counting. In CVPR, volume 1, page 6, 2017.
 [24] Z. Shen, Y. Xu, B. Ni, M. Wang, J. Hu, and X. Yang. Crowd counting via adversarial crossscale consistency pursuit. In CVPR, pages 5245–5254, 2018.
 [25] Z. Shi, L. Zhang, Y. Liu, X. Cao, Y. Ye, M.M. Cheng, and G. Zheng. Crowd counting with deep negative correlation learning. In CVPR, pages 5382–5390, 2018.
 [26] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR, 2015.
 [27] V. A. Sindagi and V. M. Patel. Cnnbased cascaded multitask learning of highlevel prior and density estimation for crowd counting. In AVSS, pages 1–6, 2017.
 [28] V. A. Sindagi and V. M. Patel. Generating highquality crowd density maps using contextual pyramid cnns. In ICCV, 2017.
 [29] P. Viola, M. J. Jones, and D. Snow. Detecting pedestrians using patterns of motion and appearance. IJCV, 63(2):153–161, 2005.
 [30] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face recognition. In ECCV, pages 499–515, 2016.
 [31] F. Xiong, X. Shi, and D.Y. Yeung. Spatiotemporal modeling for crowd counting in videos. In ICCV, pages 5161–5169, 2017.
 [32] C. Zhang, H. Li, X. Wang, and X. Yang. Crossscene crowd counting via deep convolutional neural networks. In CVPR, pages 833–841, 2015.
 [33] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma. Singleimage crowd counting via multicolumn convolutional neural network. In CVPR, pages 589–597, 2016.