Attention to Head Locations for Crowd Counting
Abstract
Occlusions, complex backgrounds, scale variations and nonuniform distributions present great challenges for crowd counting in practical applications. In this paper, we propose a novel method using an attention model to exploit head locations which are the most important cue for crowd counting. The attention model estimates a probability map in which high probabilities indicate locations where heads are likely to be present. The estimated probability map is used to suppress nonhead regions in feature maps from several multiscale feature extraction branches of a convolutional neural network for crowd density estimation, which makes our method robust to complex backgrounds, scale variations and nonuniform distributions. In addition, we introduce a relative deviation loss to compensate a commonly used training loss, Euclidean distance, to improve the accuracy of sparse crowd density estimation. Experiments on ShanghaiTech, UCF CC 50 and WorldExpo’10 datasets demonstrate the effectiveness of our method.
I Introduction
With increasing demands for intelligent video surveillance, public safety and urban planning, improving scene analysis technologies becomes pressing [1, 2]. As an important task of scene analysis, crowd counting has gained more and more attention from multimedia and computer vision communities in recent years for its applications such as crowd control, traffic monitoring and public safety. However, the crowd counting task comes with many challenges such as occlusions, complex backgrounds, nonuniform distributions and variations in scale and perspective [3], as Fig. 1 shows. Many algorithms have been proposed to address these challenges and increase the accuracy of crowd counting [4, 5, 6, 7].
Recent methods based on convolutional neural networks (CNNs) have achieved a significant improvement in crowd counting [3]. A multicolumn CNN (MCNN) is proposed in [5] to address the scalevariation problem by using several CNN branches with different receptive fields to extract multiscale features. A cascaded CNN [6] learns highlevel prior which is incorporated into the crowd density estimation branch of the CNN to boost the performance. In [7], both global and local context are exploited to generate highquality crowd density maps. Despite these methods have achieved promising performance, they neglect two aspects which could be exploited to further improve the accuracy of crowd counting. Firstly, these methods do not well exploited head locations in images which are the most important cue for crowd counting. Actually, head locations are usually used to generate groundtruth density maps in crowd counting datasets, e.g. ShanghaiTech [4] and UCF CC 50 [8] datasets. Although the generated groundtruth density maps from head locations are used to learn a CNN for regression, these methods do not explicitly give more attention to head regions during training and testing. In other words, they treat head and background regions equally. Secondly, the network training in these methods are dominated by dense crowd examples because of the use of the Euclidean distance between groundtruth and estimated density maps as the training loss. Generally, it is much more difficult to predict density maps for dense crowd examples than for spare crowd examples, leading to far larger training loss for the former. As a result, sparse crowd examples tend to receive insufficient treatment during training. However, sparse crowd counting could also be very important for some specific applications. For instance, in markets and street advertising scenarios, people may be attracted by some commodities and stroll in front of them, thus forming some sparse crowds. Counting the number of people in these scenarios to obtain the distributions of crowds could provide useful information regarding the preferences of customers for businesses and advertisers.
In this paper, we propose a novel method to address the abovementioned two limitations of existing CNN based approaches. Fig. 2 shows the network architecture used in the proposed method. We incorporate an attention model into the MCNN [5] to guide the network to focus on head locations during training and testing. Specifically, the attention model learns a probability map in which high probabilities indicate locations where heads are likely to be present. This probability map is used to suppress nonhead regions in feature maps from multiscale feature extraction branches of the MCNN. In addition, to obtain better density maps for sparse crowds, we also introduces a relative deviation loss which is combined with the commonly used Euclidean loss to train the network of our method. The relative deviation loss increases the importance of sparse crowd examples during training such that the network is learned to better predict density maps for sparse crowd examples. We validate the effectiveness of the proposed method on three datasets, ShanghaiTech [5], UCF CC 50 [8] and WorldExpo’10 [4].
The main contributions of this work are summarized as follows:

To our best knowledge, we make the first attempt to use an attention model for crowd counting. By incorporating the attention model into the CNN, the proposed method can filter most of background regions and body parts, therefore improving its robustness to complex backgrounds and nonuniform distributions.

The proposed method is robust to variations in scale because of the use of multiscale feature extraction branches and the capability of the attention model to locate heads of different sizes.

The relative deviation loss is introduced to compensate the Euclidean loss, therefore improving the accuracy of predicting density maps for sparse crowd examples.
The remainder of the paper is organized as follows. Section II presents some related works about crowd counting and the attention model. In Section III, our proposed attention model convolutional neural network (AMCNN) is introduced. The implementation details are presented in Section IV. Experimental results are given and discussed in Section V. Finally, Section VI concludes the paper.
Ii Related Work
Traditional counting methods: Traditional counting methods can be categorized into detectionbased approaches, regressionbased approaches and density estimationbased approaches [3]. Detectionbased approaches typically estimate the number of people based on detecting objects in the scene with a sliding window [9]. The object detector is usually a classifier which trained on features such as histogram oriented gradients [10] and Haar wavelets [11]. Despite the great success in sparse crowd counting, these methods do not work well when it comes to dense crowds. Although [12] presents a partbased detection method to cater to this problem, the counting results still remain unsatisfactory.
To overcome the defect of detectionbased approaches for dense crowd counting, some researchers [13, 14, 15] attempt to estimate the number of people by regression. Regressionbased approaches aim to find a mapping function between extracted features and the global or local counts. Typically, global or local features extracted from the image are firstly used to encode some lowlevel information. Then the mapping between these features and the counts are learned by a regression model. To utilize more information to get higher accuracy for dense crowd examples, Idress et al. [8] combined multiple sources such as low confidence head detections and repetition of texture elements to count at patch level and then used enforced smoothness constraint to produce better estimates at image level. Besides, they created a new dataset on a scale that was never tackled before.
For some specific occasions, such as markets, it is more important to estimate the crowd distribution rather than only getting the number of customers. Therefore, getting the density maps while predicting the counts is of great significance. Counting approach in [16] predicts the density maps based on linear function and introduces a new loss (Maximum Excess SubArray, MESA) to increase the counting accuracy. Pham et al. [17] propose to use nonlinear function to learn the mapping. Besides, they exploit a crowdedness prior and train two different forests to address large variations in appearance.
CNNbased counting methods: CNNbased counting approaches have become the main tend for its great success in various computer vision tasks. Early CNNbased methods [18, 19, 20] predict the number of objects instead of density map. Hu et al. [18] exploit a density level classification task to enrich the features, therefore increasing the counting accuracy. Similarly, method in [19] classifies the appearance of the crowds while estimating the counts, which forms auxiliary CNN for crowd counting. Authors of [20] address the appearance change problem by multiplying appearanceweights output by a gating CNN to a mixture of expert CNNs. As aforementioned, estimating the crowd distribution while getting the counts is more applicable in some specific scenarios. Therefore, some researchers attempt to get the counts by density map prediction based on CNN architectures.
Zhang et al. make their first attempt to address the challenge of complex backgrounds by utilizing CNN to estimate the density map, which also denotes the counts by the sum of pixel values. To make use of both highlevel semantic information and lowlevel features, Boominathan et al. [21] makes a combination of deep and shallow, fully convolutional network to estimate the density maps. Some algorithms [5, 22, 23, 24] are proposed to cater to large variations in scale and perspective. The MCNN [5] presents several CNN branches with different receptive fields, which could extract multiscale features and enhance the robustness to large variations in people/head size. The Hydra CNN in [23] provides a scaleaware solution to generate the density maps by training the regressor with a pyramid of image patches at multiple scales. To make full use of sharing computations and contextual information, local and global information are leveraged in [22] by learning the counts of both local regions and overall image. Authors of [24] propose a switchingCNN by adding a switch to the MCNN [5]. They utilize an improved version of VGG16 as the switch classifier to choose a best CNN regressor for the original image. Sindagi et al. [7] aims at generating high quality density maps by using a Fusion CNN to concatenate features extracted by Global Context Estimator (GCE), Local Context Estimator (LCE) and Density Map Estimator (DME). In addition, their counting architecture is trained in a Generative Adversarial Network to get shaper density maps.
Attention Model: The attention model has been widely used for varies computer vision tasks, such as image classification [25] and segmentation [26], object detection [27] and classification [28], action recognition [29, 30], pose estimation [31], scene labeling [32] and video captioning [33]. Xiao et al. [25] propose a twolevel attention for image classification: objectlevel attention and partlevel attention. The former selects the patches relevant to the task domain while the latter focuses on local discriminate patterns. The two level attentions compensate each other nicely with late fusion. Chen et al. [26] exploit the attention model to measure the importance of differentscale features after generating multiresolution inputs for semantic segmentation. In [29], a Content Attention Network (CANet) for action recognition is proposed to improve the robustness to the irrelevant content by addressing the attention mechanism and using clean videos as the guidance for training. Liu et al.[30] proposed a Global ContextAware long shortterm memory (GCALSTM) network, which introduces a recurrent attention model to LSTM and thus selectively focusing on the informative joints with regarding to global context information. Apart from contextaware attention, Chu et al. [31] incorporate multiresolution attention and hierarchical attention into an hourglass network for pose estimation. Their model can focus on different granularity from local salient regions to global semanticconsistent spaces. A contextual attention model is utilized in [32] to assign different power weights to surrounding patches, and thus adaptively selecting relevant patches for scene labeling. In [33], Gao et al. integrate LSTM with an attention mechanism that uses the dynamic weighted sum of local twodimensional convolutional neural network representations to capture salient structures of video, thus generating sentences with rich semantic content for video captioning. All of these works have demonstrated that the attention model allows the network to focus on most relevant features as needed.
Iii The Proposed Method
The proposed AMCNN consists of 3 shallow CNN branches and an attention model. The CNN branches with different receptive fields are firstly exploited to extract multiscale features. Then the attention model is incorporated to emphasize head locations regardless of the complexity of scenes, the nonuniformity of distributions and the variability of scale and perspective. In addition, a relative deviation loss is used to compensate Euclidean loss during the training process. The architecture of the proposed AMCNN is illustrated in Fig. 2 and discussed in detail as follows.
Iiia Feature Extraction with multireceptive fields
Some of previous works [5, 7, 23, 24] exploited multicolumn networks with different receptive fields to address the variations in scale since different sizes of receptive fields can cope with the diversity in objectsize [27]. Inspired by successful use of the MCNN [5, 7, 24], we select part of it to extract multiscale features. The multicolumn architecture with larger filter sizes or more columns may cater to larger variations in scale, but it brings a timeconsuming parameter adjustment task. Since the proposed method mainly focuses on the effect of the attention model for crowd counting, we use the same filter sizes and channels as [5] and [24]. But different from them, the multicolumn network in this paper is used to generate highdimensional feature maps rather than transforming the input into a density map directly.
Density maps generated by the MCNN [5] contain complex backgrounds, which impact the counting accuracy seriously. In addition, the distinction between large and small objects is not so obvious in density map, as Fig. 3 shows. The most important cue for crowd counting, head locations, is the key to address the above problems. Therefore, we need an operation to guide the network to give more attention to head locations and suppress nonhead regions. In virtue of the strong objectfocused capability, an attention model is incorporated into the MCNN and thus forming a new architecture which could generate more accurate density maps. We will describe the attention model in Section IIIB.
IiiB The attention model for crowd counting
Visual attention is an essential mechanism of the human brain for understanding scenes effectively [31]. Therefore, we aim to guide the network selectively focus on head regions when estimating the density maps for crowd counting, no matter how complex the background is and how various the distributions are.
The attention model has been widely used for different tasks with different focuses, e.g. focusing on different patches that relevant to task domain and specific objects for image classification and scene labeling, respectively; focusing on feature maps with different resolutions for image segmentation; focusing on different joints and relevant motions for action recognition and focusing on salient features at frame level for video captioning. For crowd counting, the attention model could be an effective tool to guide the network focusing on head locations, which are the most important cue for crowd counting. Therefore, an attention model is introduced to identify how much attention to pay to features at different locations. Concretely, we use the attention model to concentrate more on head regions, meanwhile suppressing the background regions and body parts in images.
Herein, we briefly introduce the implementation of the attention model used in this work. Suppose convolutional features in layer i as , the soft attention is generated as:
(1) 
where is a nonlinear activation function and denotes convolution operation. The attention model aims to identify how much attention to pay to features at different locations, which could be achieved by generating the probability scores with a softmax operation applied to spatially:
(2) 
stands for the locations of pixels in soft attention and is the probability map. reflects the probability of presenting head region in position . By visualizing , we can visualize the attention at different locations, and the visualization of is illustrated in Section VC. Note that is shared across all channels. The learned probability map is finally multiplied to feature maps in layer to generate attention features, as equation 3 shows:
(3) 
Where denotes elementwise product. Before this operation, the channel of is expanded as the same as . is the refined attention feature map, which is the feature reweighted by the probability scores, and has the same size as .
To this end, the trained attention model could adaptively select the relevant positions where the heads are located and assigned them higher weights. This makes the AMCNN very suitable for crowd counting.
In this work, we generate the probability map from the concatenated multiscale feature maps. It may be argued that incorporating attention models into the shallow CNN branches directly is also practical. Therefore, we tried different architectures with the attention model and will talking about it in section VA.
IiiC Loss Function
Most of previous methods use Euclidean distance as the loss function for counting task. As equation 4 shows:
(4) 
Where is the number of the training samples, is the groundtruth density map and is the function that mapping the input to the estimated density map with parameters . In this paper, Euclidean distance is also selected as the loss function. Differently, considering that the sizes of the input images are not fixed, the Euclidean distance is divided by the number of pixels .
(5) 
We also find that for sparse crowd examples, especially the one only contains several persons, the Euclidean loss is usually very small, which indicates that these samples receive insufficient treatment during training. Inspired by [18], we add a relative deviation loss to address this problem. Hu et al. [18] take relative deviation as one of the evaluation criterions but only use Maximum Excess SubArrays (MESA) distance as counting loss function. The relative deviation loss used in this paper can be formulated as follows:
(6) 
Where is the groundtruth counts and is the sum of pixel values of the estimated density map. stands for a constant which is used to avoid the errors being divided by zero. The combination of the 2 loss functions is displayed in equation 7:
(7) 
Since the number of pixels in a training sample is usually more than and is not included in , the loss weight is set as in the experiments.
Iv Implementation Details
The proposed method is conducted on highly challenging publicly datasets: ShanghaiTech [5], UCF CC 50 [8] and WorldExpo’ [4]. The details of the datasets can be found in Section V. We shall firstly describe how to generate groundtruth density maps, and then introduce the details of training procedure, which include data process and parameters setting.
Iva Density map generation
The groundtruth density map is converted from the labelled head locations in the original image. Previous works [4, 7, 5] generate density maps by locating a Gaussian kernel on the objects. Zhang et al. [4] sum a 2D Gaussian kernel and a bivariate normal distribution to map the heads and bodies, but it is only applicable for sparse crowd examples. Sindagi et al. [7] use same size of Gaussian kernels for all objects, which cannot illustrate the perspective of the scene. Similar to [5], we use the geometryadaptive Gaussian kernels to generate density maps. Suppose there are objects in the original image and one of the heads is located in pixel , then the generation of density map can be formulated as:
(8) 
Where is a Gaussian kernel and represents the variance. For ShanghaiTech Part A and UCF CC 50 datasets, is computed by knearest nighbour (KNN) according to the average distance between the object and its neighbours. WorldExpo’ dataset provides perspective maps P and is defined as . Since crowds in ShanghaiTech Part B is sparse and perspective maps are not provided, we set as 4.
IvB Training Procedure
Pretrain: Since some datasets provide limited training images, we adopt image cropping for ShanghaiTech and UCF CC 50 datasets to expand the training sets. patches with size of the original image are cropped in random locations to pretrain the shallow CNN branches separately. Note that the attention model is not included when pretraining the shallow branches. A convolution operation with filter is used to generate density map following the former 4 convolutional layers. is defined as and for ShanghaiTech and UCF CC 50 datasets, respectively.
Finetune: In the finetuning procedure, the training dataset is further expanded. We crop images and flip them, thus totally getting patches to finetune the AMCNN. is defined as and for ShanghaiTech and UCF CC 50, respectively. WorldExpo’ dataset provides plenty of training images, so we only expand them by flipping the original images to train the AMCNN. The CNN branches are initialized with the pretrained parameters and the attention model is random initialized with deviation of .
Parameters setting: In the training procedure, the learning rate and momentum are set as and 0.9 respectively for Adam optimization. The batchsize is set as for training. All of the experiments are conducted on GeForce GTX TITANX.
V Experimental Results
This section presents the experimental results on the public challenging datasets. For fair comparison, we use standard metrics for evaluation as other CNNbased counting methods did. The metrics are defined as:
(9) 
Where MAE represents mean absolute error and MSE stands for mean squared error, respectively. is the groundtruth count and is the estimated count of the AMCNN for the th sample.
Va Structural Adjustment based on ShanghaiTech Part A
This section presents the effectiveness of the attention model and the structural adjustment of the whole architecture based on ShanghaiTech dataset Part A. To identify the effectiveness of the attention model, we first incorporated it into a shallow CNN branch. The incorporation of the attention model and a simple CNN branch is illustrated in Fig. 4 and can be represented as AMCNN(L), AMCNN(M) and AMCNN(S), where L, M and S stand for large, medium and small sizes of convolutional filters. As the results in Fig. 5 show, the counting accuracy increases obviously by using the attention model to emphasize head locations. The MAEs/MSEs of the AMCNN(L), AMCNN(M) and AMCNN(S) are , and while the CNN(L), CNN(M) and CNN(S) get MAEs/MSEs of , and , respectively. The performance improvements achieved by the attention model are (L), (M) and (S), respectively. These results demonstrate the high effectiveness of the attention model for crowd counting.
We also conducted experiments to determine integrating the attention model before or after feature concatenation. The first choice is incorporating the attention models to the three shallow CNN branches and then concatenate the attention features for density map generation, named as the AMCNN(3). This architecture will generate three different probability maps, each regarding one receptive field. Another choice is integrating the attention model after the concatenation of the CNN branches, which is the AMCNN. Compared with the former, this architecture trains one probability map based on the combination of the multiscale features, therefore well exploiting the multiscale receptive fields when training the attention model. Results in Fig. 5 show that the second choice achieves higher counting accuracy. The MAE/MSE of the AMCNN is lower than that of the AMCNN(3). Besides, compared with the MCNN [5], the proposed method gets a significant performance improvement. The MAE/MSE of the AMCNN is ( lower than that of the MCNN), which also demonstrates the effectiveness of the attention model.
VB Comparison with other CNNbased counting methods
This section presents the comparison with recent CNNbased methods. We shall first introduce the details of the datasets, and then discuss the counting results.
VB1 ShanghaiTech
This dataset was published in [8], it contains subsets: Part A mainly consists of dense crowd examples and Part B mainly focuses on sparse crowd examples. There are training images and testing images in Part A whereas Part B contains images for training and for testing. The crowd density varies greatly in this dataset, making the counting task more challenging than other datasets. We compare our method with other recent CNNbased methods in Table I.
Zhang et al. [4] mainly focus on the crossscene crowd counting by an operation of candidate scene retrieval. They retrieve images with similar scenes from training data to finetune the trainedCNN for target scene. In [6], highlevel prior is learned by utilizing feature maps trained for density level classification and thus getting better results than former methods. On the basis of the MCNN [5], which concatenate feature maps with multiscale receptive fields, Sam et al. [24] train a switchCNN to select a specific CNN regressor for the images. In addition, they enforced a differential training regimen to tackle the large scale and perspective variations. Their method improves the performance obviously compared with the MCNN. Apart from increasing the counting accuracy by adding contextual information, Sindagi et al. [7] use Generative Adversarial Network to sharper the density maps. Based on the concatenation of multiscale feature maps, the proposed method exploit an attention model to emphasize head regions when generating the density map. In addition, the relative deviation loss compensates small Euclidean distance errors. For Part A which mainly contains dense crowds, the AMCNN performs better than other methods expect for the CPCNN [7]. It may result from that the proposed method only uses a density estimator while Sindagi et al. [7] add contextual information which is trained by other two complex structures to their counting architecture. But the addition of contextual information comes with spurt growth of parameters: the parameters of the CPCNN [7] to be iterated for training an image are about times than that of the AMCNN. Images in Part B mainly focus on sparse crowds, and the proposed AMCNN gets the stateoftheart performance on this subset. Density maps illustrated in Fig. 7 and Fig. 8 show that the AMCNN could focus on every specific head regions in sparse crowds, which may result in good performance for sparse crowds. Notably, by integrating an attention model, the proposed method performs much better than the MCNN [5]. The MAEs/MSEs of the AMCNN (w/o ) for these 2 subsets are and lower than that of the MCNN, which demonstrate a significant performance improvement. The counting accuracy is further increased by adding the relative deviation loss: The MAE/MSE for Part B reduced by , which is more significant than that for Part A (). Overall,the attention model guides the network ignore most of the complex backgrounds and give more attention to head regions. The relative deviation loss relatively expands the estimation errors of sparse crowd examples during training process, which also plays an important role in crowd counting.
Dataset  Part A  Part B  

Method  MAE  MSE  MAE  MSE 
CrossScene [4]  181.8  277.7  32.0  49.8 
MCNN [5]  110.2  173.2  26.4  41.3 
CascadedMLT [6]  101.3  152.4  20.0  31.1 
SwitchingCNN [24]  90.4  135.0  21.6  33.4 
CPCNN [7]  73.6  106.4  20.1  30.1 
AMCNN w/o  89.6  136.2  16.2  29.8 
AMCNN with  87.3  132.7  15.6  26.4 
VB2 WorldExpo’10
This dataset is the largest one focusing on crossscene crowd counting. pedestrians are labelled at their centers of heads, and annotated frames from video sequences form the training dataset. There are totally scenes captured by surveillance cameras in this dataset. Among them, 5 different scenes are used for testing, each consists of frames and thus forming subsets. The pedestrian number in the testing set changes significantly over time. In addition, this dataset provides Region of Interest (ROI) map for each scene. Referring to [4], we utilize the ROI map for both training and testing dataset.
Method  Scene1  Scene2  Scene3  Scene4  Scene5  Average 
CrossScene [4]  9.8  14.1  14.3  22.2  3.7  12.9 
MCNN [5]  3.4  20.6  12.9  13.0  8.1  11.6 
SwitchingCNN [24]  4.4  15.7  10.0  11.0  5.9  9.4 
CPCNN [7]  2.9  14.7  10.5  10.4  5.8  8.86 
AMCNN w/o  3.1  13.0  9.7  10.6  5.4  8.36 
AMCNN with  2.5  13.0  9.7  10.0  4.0  7.84 
Five stateoftheart algorithms [4][5][24][7] which have been introduced in section VB1 are used to compare with the proposed method. As all of the previous works did, we only display the MAE results in Table II. As the results show, compared with the MCNN [5], the proposed method achieves a significant improvement by integrating an attention model, especially for scene 1 and scene 2. The distributions of people in these 2 scenes change more obviously, demonstrating that the attention model could emphasize head regions in the image regardless of the nonuniform distribution. When adding the relative deviation loss, the proposed AMCNN gets the stateoftheart results for all subsets. In scene 1 and scene 5, people distribute more dispersed and the crowds are sparser than other scenes. The counting accuracy increases more obviously for these 2 subsets, demonstrating that the relative deviation loss plays an important role in sparse crowd counting. Overall, the proposed method exploits an attention model to focus on head locations, making the network robust to complex backgrounds and nonuniform distributions. In addition, the Euclidean loss of sparse crowd examples is usually small, but the relative deviation loss compensates this case. All the results in table II demonstrate that the AMCNN performs well in spite of the crossscene problem.
VB3 Ucf Cc 50
This dataset contains images collected from publicly available web images. The number of people in one image ranges from to with an average of . The scenes in this dataset cover a wide range, such as concerts, stadiums, pilgrimages, protests and marathons. In the experiment, we perform 5fold cross validation as other works did. Images of , , … , are used as testing data in the 5 evaluation experiments, respectively. Table III illustrates the comparison results.
Kumagai et al. [20] multiplied appearanceweights output by a gating CNN to a mixture of expert CNNs to address the appearance change problem. But this method only outputs the number of people while others predict the density map simultaneously. Authors of [21] use both deep and shallow CNN branches to extract features from whole image and patches. They mainly focus on highly dense crowds, but the counting accuracy is not compatible. Rubio et al. [23] design a Hydra CNN which uses a pyramid of patches as input. Their scaleaware model does mot need geometric information of scenes. As Table III shows, the proposed method gets the lowest MAE among these methods. To explore the performances for different densities, we plot a histogram in Fig. 6 to display the results and the comparisons between the proposed method and the CPCNN, which was the startofthe art method. Note that we conduct experiments using the AMCNN with the relative deviation loss since its effectiveness has been proved on ShanghaiTech and WorldExpo’ datasets.
Method  MAE  MSE 

CrossScene [4]  467.0  498.5 
Crowdnet [21]  452.5  — 
MCNN [5]  377.6  509.1 
HydraCNN [23]  333.7  425.2 
MoCNN [20]  361.7  493.3 
CascadedMLT [6]  322.8  397.9 
SwitchingCNN [24]  318.1  439.2 
CPCNN [7]  295.8  320.9 
AMCNN  279.5  377.8 
The UCF CC 50 is categorized into 5 ranges to show the counting accuracy for different densities. For scenarios with less than 3000 persons, the AMCNN performs much better than the CPCNN. However, for extremely dense crowds (with more than persons), the AMCNN performs worse, it gets much higher MAE and MSE value compared with the CPCNN. It may result from that the attention model emphasizes every specific head location in sparse crowds but can only roughly stress the crowd regions of dense crowds. As aforementioned, the CPCNN uses two complex structures to exploit contextual information, and the good performance for extremely dense crowds is at the expense of considerable parameters. Nevertheless, the AMCNN can still be applied to many scenarios, such as concerts, stadiums, marathons and markets, where there are less than persons in a single image.
VC Probability maps
This section displays the probability maps and density maps to explore the influence of the attention model. Fig. 7 and Fig. 8 illustrate representative samples from ShanghaiTech and WorldExpo’10 datasets. To explore whether the probability maps present higher probability scores in head locations, we overlay them on the original images. As Fig. 7 shows, the AMCNN could concentrate on specific head regions accurately for sparse crowds. However, for the dense crowds, it can only emphasize the general regions that crowds are located. It is well known that given an image which contains too many objects to concentrate on, humans usually focus on the regions where most of the objects are located. Similarly, it is hard for the attention model to focus on every specific head in a dense crowd, and it concentrates on the region where the crowd is located.
The regions within the green shapes in the first column of Fig. 8 are ROIs. We overlay masks generated according to the ROI on both probability maps and density maps to ignore the masked regions. Fig. 8 demonstrates that the attention model gives much more attention to head locations and thus making the proposed AMCNN generate clear and accurate density maps.
The probability and density maps displayed in this section demonstrate that the attention model could roughly filtered complex background regions and body parts before the generation of density maps. As a result, the density maps become clear and headfocused under the effect of the attention model.
Vi Conclusion
In this paper, we proposed an attention model convolutional neural network (AMCNN) to well exploit head locations for crowd counting. The architecture explicitly gives more attention to head locations and suppresses nonhead regions by exploiting an attention model to generate a probability map which presents higher probability scores in head regions. Additionally, a relative deviation loss which plays an important role for sparse crowd density prediction is introduced to compensate the Euclidean loss. Experiments on three challenging datasets demonstrate the robustness of the AMCNN to complex backgrounds, scale variations and nonuniform distributions.
References
 [1] T. Li, H. Chang, M. Wang, B. Ni, R. Hong, and S. Yan, “Crowded scene analysis: A survey,” IEEE transactions on circuits and systems for video technology, vol. 25, no. 3, pp. 367–386, 2015.
 [2] C. Zhang, K. Kang, H. Li, X. Wang, R. Xie, and X. Yang, “Datadriven crowd understanding: a baseline for a largescale crowd dataset,” IEEE Transactions on Multimedia, vol. 18, no. 6, pp. 1048–1061, 2016.
 [3] V. A. Sindagi and V. M. Patel, “A survey of recent advances in cnnbased single image crowd counting and density estimation,” Pattern Recognition Letters, 2017.
 [4] C. Zhang, H. Li, X. Wang, and X. Yang, “Crossscene crowd counting via deep convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 833–841, 2015.
 [5] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma, “Singleimage crowd counting via multicolumn convolutional neural network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 589–597, 2016.
 [6] V. A. Sindagi and V. M. Patel, “Cnnbased cascaded multitask learning of highlevel prior and density estimation for crowd counting,” in 14th IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 1–6, IEEE, 2017.
 [7] V. A. Sindagi and V. M. Patel, “Generating highquality crowd density maps using contextual pyramid cnns,” in 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1879–1888, IEEE, 2017.
 [8] H. Idrees, I. Saleemi, C. Seibert, and M. Shah, “Multisource multiscale counting in extremely dense crowd images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2547–2554, 2013.
 [9] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: An evaluation of the state of the art,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 4, pp. 743–761, 2012.
 [10] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, pp. 886–893, IEEE, 2005.
 [11] P. Viola and M. J. Jones, “Robust realtime face detection,” International journal of computer vision, vol. 57, no. 2, pp. 137–154, 2004.
 [12] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained partbased models,” IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 9, pp. 1627–1645, 2010.
 [13] K. Chen, C. C. Loy, S. Gong, and T. Xiang, “Feature mining for localised crowd counting.,” in BMVC, vol. 1, p. 3, 2012.
 [14] K. Chen, S. Gong, T. Xiang, and C. C. Loy, “Cumulative attribute space for age and crowd density estimation,” in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 2467–2474, IEEE, 2013.
 [15] D. Ryan, S. Denman, C. Fookes, and S. Sridharan, “Crowd counting using multiple local features,” in Digital Image Computing: Techniques and Applications, 2009. DICTA’09., pp. 81–88, IEEE, 2009.
 [16] V. Lempitsky and A. Zisserman, “Learning to count objects in images,” in Advances in Neural Information Processing Systems, pp. 1324–1332, 2010.
 [17] V.Q. Pham, T. Kozakaya, O. Yamaguchi, and R. Okada, “Count forest: Covoting uncertain number of targets using random forest for crowd density estimation,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 3253–3261, 2015.
 [18] Y. Hu, H. Chang, F. Nian, Y. Wang, and T. Li, “Dense crowd counting from still images with convolutional neural networks,” Journal of Visual Communication and Image Representation, vol. 38, pp. 530–539, 2016.
 [19] Y. Zhang, F. Chang, M. Wang, F. Zhang, and C. Han, “Auxiliary learning for crowd counting via countnet,” Neurocomputing, vol. 273, pp. 190–198, 2018.
 [20] S. Kumagai, K. Hotta, and T. Kurita, “Mixture of counting cnns: Adaptive integration of cnns specialized to specific appearance for crowd counting,” arXiv preprint arXiv:1703.09393, 2017.
 [21] L. Boominathan, S. S. Kruthiventi, and R. V. Babu, “Crowdnet: a deep convolutional network for dense crowd counting,” in Proceedings of the 2016 ACM on Multimedia Conference, pp. 640–644, ACM, 2016.
 [22] C. Shang, H. Ai, and B. Bai, “Endtoend crowd counting via joint learning local and global count,” in Image Processing (ICIP), 2016 IEEE International Conference on, pp. 1215–1219, IEEE, 2016.
 [23] D. OnoroRubio and R. J. LópezSastre, “Towards perspectivefree object counting with deep learning,” in European Conference on Computer Vision, pp. 615–629, Springer, 2016.
 [24] D. B. Sam, S. Surya, and R. V. Babu, “Switching convolutional neural network for crowd counting,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, p. 6, 2017.
 [25] T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang, “The application of twolevel attention models in deep convolutional neural network for finegrained image classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 842–850, 2015.
 [26] L.C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention to scale: Scaleaware semantic image segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3640–3649, 2016.
 [27] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Object detectors emerge in deep scene cnns,” arXiv preprint arXiv:1412.6856, 2014.
 [28] B. Zhao, X. Wu, J. Feng, Q. Peng, and S. Yan, “Diversified visual attention networks for finegrained object classification,” IEEE Transactions on Multimedia, vol. 19, no. 6, pp. 1245–1256, 2017.
 [29] J. Hou, X. Wu, Y. Sun, and Y. Jia, “Contentattention representation by factorized actionscene network for action recognition,” IEEE Transactions on Multimedia, 2017.
 [30] J. Liu, G. Wang, L.Y. Duan, K. Abdiyeva, and A. C. Kot, “Skeleton based human action recognition with global contextaware attention lstm networks,” IEEE Transactions on Image Processing, 2017.
 [31] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang, “Multicontext attention for human pose estimation,” arXiv preprint arXiv:1702.07432, 2017.
 [32] A. H. Abdulnabi, B. Shuai, S. Winkler, and G. Wang, “Episodic camn: Contextual attentionbased memory networks with iterative feedback for scene labeling,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6278–6287, IEEE, 2017.
 [33] L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen, “Video captioning with attentionbased lstm and semantic consistency,” IEEE Transactions on Multimedia, vol. 19, no. 9, pp. 2045–2055, 2017.