Segmentation Guided Attention Network for Crowd Counting via Curriculum Learning

Segmentation Guided Attention Network for Crowd Counting via Curriculum Learning

Qian Wang
Department of Computer Science
Durham University
United Kingdom
   Toby P. Breckon
Department of Computer Science
Department of Engineering
Durham University
United Kingdom

Crowd counting using deep convolutional neural networks (CNN) has achieved encouraging progress in the last couple of years. Novel network architectures have been designed to handle the scale variance issue in crowd images. For this purpose, the ideas of using multi-column networks with different convolution kernel sizes and rich feature fusion have been prevalent in literature. Recent works have shown the effectiveness of Inception modules in crowd counting due to its ability to capture multi-scale visual information via the fusion of features from multi-column networks. However, the existing crowd counting networks built with Inception modules usually have a small number of layers and only employ the basic type of Inception modules. In this paper, we investigate the use of pre-trained Inception model for crowd counting. Specifically, we firstly benchmark the baseline Inception-v3 models on commonly used crowd counting datasets and show its superiority to other existing models. Subsequently, we present a Segmentation Guided Attention Network (SGANet) with the Inception-v3 as the backbone for crowd counting. We also propose a novel curriculum learning strategy for more efficient training of crowd counting networks. Finally, we conduct thorough experiments to compare the performance of SGANet and other state-of-the-art models. The experimental results validate the effectiveness of the segmentation guided attention layer and the curriculum learning strategy in crowd counting.

1 Introduction

Automatic object counting aims to estimate the number of target objects in still images or video frames and has been applied in many real-world applications. For instance, there have been works focusing on automatic counting different objects including cells [47], vehicles [20, 27], leaves [8, 1] and people [37]. Crowd counting has attracted increasing attention in the research community since its potential value in public surveillance and risk control in crowd scenes [49, 31, 37].

In earlier years, crowd counting in images was implemented by detection [56, 6, 42] or direct count regression [15, 41]. Counting by detection methods assume people signatures (i.e. the whole body or the head) in images are detectable and the count is easily calculated based on the detection results. This assumption, however, does not always hold in real scenarios, especially when the crowd is extremely dense. Counting by direct count regression aims to learn a regression model (e.g., support vector machine [41] or neural networks [15]) mapping the hand-crafted image features directly to the count of people in the image.

Recently, crowd counting has been overwhelmingly dominated by density estimation based methods since the idea of density map was first proposed in [18]. The use of deep Convolutional Neural Networks [16] to estimate the density map along with the availability of large-scale datasets [54, 11] further improved the accuracy of crowd counting in more challenging real-world scenarios. Recent works in crowd counting have been focusing on the design of novel architectures of deep neural networks (e.g., multi-column CNN [54, 36] and attention mechanism [21, 53]) for accurate density map estimation. The motivations of these designs are usually to improve the generalization to scale-variant crowd images. Among them, the Inception module [44] has been employed and showed effectiveness in crowd counting [3, 13], although only the basic Inception modules are used and the networks are relative shallow compared with the state-of-the-art deep CNN models for image classification such as Inception-v3 [44] which uses heterogeneous Inception modules to improve the efficiency and the capacity of the network. Although VGG16, VGG19 and ResNet101 have been explored for crowd counting in [9, 26, 46], to our best knowledge, the Inception models have not been investigated.

In this paper, we first investigate the effectiveness of Inception-v3 model for crowd counting. We modify the original Inception-v3 to make it suitable for crowd counting and use it as a strong baseline. Subsequently, we add a segmentation map guided attention layer to enhance the salient feature extraction for accurate density map estimation. More importantly, we propose a novel curriculum learning strategy to address the issues caused by extremely dense regions in crowd counting. The contributions of this paper are summarized as follows:

  • We present a Segmentation Guided Attention Network (SGANet) based on the Inception-v3 for crowd counting.

  • We propose a novel curriculum learning strategy for crowd counting with zero extra cost.

  • Extensive evaluations are conducted on benchmark datasets and the results demonstrate the superior performance of SGANet and the effectiveness of curriculum learning in crowd counting.

2 Related Work

In this section, we first review related works on CNN based crowd counting and focus mainly on the diverse network architectures by which our proposed crowd counting framework is inspired. Subsequently, we introduce works related to curriculum learning and how they can be used in the task of crowd counting.

2.1 Crowd Counting Networks

Successful efforts have been devoted to the design of novel network architectures to improve the performance of crowd counting. Commonly used principles of network design for crowd counting include multi-column networks, rich feature fusion and attention mechanism.

Multi-column neural networks were employed to address the scale-variant issue in crowd counting [54, 33, 5]. As one of the earliest CNN based models for crowd counting, MCNN [54] consists of three branches aiming to handle crowds of different densities. Following this idea, Sam et al. [33] proposed SwithCNN which employs a classifier to explicitly select one of the three branches for a given input patch based on its level of crowd density. While these methods aim to use different kernel sizes in different branches to capture scale-variant information, Liu et al. [22] proposed a model consisting of multiple branches of VGG16 networks with shared weights to process scaled input images respectively. Similarly, Ranjan et al. [30] devised a two-column network which learns the low- and high-resolution density maps iteratively via two branches of CNN. The success of these specially designed network architectures has validated that multi-column CNN models are capable of capturing scale-variant features for crowd counting.

The second direction of network design is pursuing effective fusion of rich features from different layers [38, 13]. These attempts are based on the fact different layers have variant receptive fields hence capturing features of variant-scale information. Different feature fusion strategies including direct fusion [38], top-down fusion [32] and bidirectional fusion [40] have been employed in crowd counting.

To take advantage of the two aforementioned ideas for crowd counting, one straightforward solution is to utilise the Inception module [44] which was firstly proposed in [43] and has evolved into a variety of more efficient forms nowadays. The Inception modules have been employed in crowd counting models before [3, 13]. Both the SANet [3] and the TEDNet [13] use only the basic Inception modules similar to those used in the first version of Inception net (i.e. GoogLeNet) [43]. In our work, we aim to explore the more advanced Inception modules in the framework of Inception-v3.

The attention mechanism is another useful technique considered when designing network architectures for crowd counting [21, 38, 9, 23]. Attention layers are usually combined with multi-column structures so that regions of different semantic information (e.g., background, sparse, dense, etc.) can be attended and processed by different branches respectively. Attention maps learned by these models have proved to be aware of semantic regions [9], however, they cannot provide fine-grained scale awareness within the images. To address this issue explicitly, perspective maps have been employed to guide the accurate estimation of density maps [52, 28, 34]. In many scenarios where the perspective maps are not available, it is possible to estimate these perspective maps from the crowd images via a specially designed and trained network [48].

To this end, binary segmentation maps which can be easily generated from point annotation [55] are introduced as additional supervision for the training of crowd counting networks via multi-task learning [55]. In our work, binary segmentation maps are treated as explicit attention maps guiding the learning of salient visual features for density map estimation.

Figure 1: The framework of our proposed Segmentation Guided Attention Network (SGANet) which is adapted from Inception-v3 by: 1) removing the fully-connected layers; 2) removing two maxpooling layers to reserve high spatial resolution feature maps; 3) adding an upsampling layer before the last Inception module; 4) adding an attention layer whose output is applied to the feature maps generated by the last Inception module; 5) adding a convolutional layer for density map estimation.

2.2 Curriculum Learning

Curriculum learning is a strategy of training a model (e.g., neural networks) in machine learning and was proposed by Bengio et al. [2]. The idea of curriculum learning can date back to no later than 1993 when Elman [7] proved the benefit of training neural networks to learn a simple grammar by “starting small”. The strategy of curriculum learning is inspired by the way how humans learn knowledge from easy concepts to hard abstractions gradually. In the specific case of training a machine learning model, curriculum learning selects easy examples at the beginning of training and allows more difficult ones added to the training set gradually. A curriculum is usually defined as a ranking of training examples by some prior knowledge to determine the level of difficulty of a given example. Jiang et al. [12] extended curriculum learning to a so-called self-paced curriculum learning by integrated the ideas of original curriculum learning and self-paced learning [17] in a unified framework.

In this work, we apply curriculum learning in crowd counting to address the issue of large variance of the crowd density in the images. During our preparation of this manuscript, we are aware that curriculum learning is also employed in [25] for crowd counting. Different from this work which designs a curriculum by setting difficulty levels example-wisely, our curriculum learning is based on pixel-level difficulty since the prediction of a density map can be of variant difficulty pixel-wisely.

3 Segmentation Guided Attention Network

Crowding counting is formulated as a density map regression problem in this study. Given a crowd image , we aim to learn a Fully Convolutional Network (FCN) denoted as so that the corresponding density map can be estimated by:


where is a collection of parameters of the FCN.

As shown in Figure 1, our proposed network is adapted from the famous Inception-v3 originally designed for image classification by Google Research [44]. We first modify Inception-v3 to an FCN so that it can process images of arbitrary sizes and generates the estimated density maps as the outputs. An attention layer is added to the network to filter out features within the background region and concentrate on the foreground features for accurate density map estimation. Since the attention maps generated by this attention layer aim to discriminate the regions of background and foreground of the feature maps, we use a ground truth segmentation map, which can be easily derived from point annotations, as extra guidance for the training of the attention layer. As a result, the learned attention maps are forced to be similar to the segmentation maps during training.

We also investigate the use of curriculum learning in the training of crowd counting networks. Specifically, we define a curriculum based on the pixel-wise difficulty level so that the network starts training by learning to estimate density maps of “easy” regions (sparse) and ignore the “hard” pixels (dense). During training, the “hard” pixels are gradually exposed to the model and finally, the learned model can perform well for all situations.

3.1 Density and Segmentation Maps

In this study, we use simple ways to generate density and segmentation maps from the point annotations although more complicated ones [35] might benefit the performance. For density maps , where and are the height and width of the image, we follow [54] using a Gaussian kernel with fixed :


For segmentation maps , we use a similar method:


where is an all-one matrix of size centred at the position . As a result, ones and zeros in the matrix denote the pixels belong to the foreground and background regions, respectively. We empirically set across all our experiments to ensure that a specific head within an image is characterized by more pixels in the segmentation map than in the density map to avoid losing useful contextual information.

3.2 Network Configuration

Instead of designing a novel network from scratch, we exploit the state-of-the-art CNN model for image classification Inception-v3 in our study. To apply the original Inception-v3 network in crowd counting, some favourable modifications have been made. Firstly, we remove the final fully-connected layers and reserve all the convolutional layers. The input size of the original Inception-v3 network is and the output size of the final convolutional layer is . That is to say, feature maps generated by the last convolutional layer have approximately spatial resolutions of the input image. This is achieved by the first convolutional layers (stride of 2), two max-pooling layers (stride of 2) and two Inception modules in which max-pooling (stride of 2) is employed. To ensure the spatial resolution of the output density map which is important in crowd counting, we remove the first two max-pooling layers from the original network and add one upsample layer before the final Inception module. As a result, the output of the modified network has exactly spatial resolution of the input image when the input size is (e.g., in our case). Such modification does not change the number of parameters of the network hence the pre-trained weights can be directly loaded and used. However, since the spatial resolutions of intermediate feature maps have been increased, the number of operations is also increased. This modified model will also denoted as Inception-v3 without introducing ambiguity and used as a baseline method in our experiments.

Distinct from existing works using the segmentation map in the framework of multi-task learning [55] to extract more salient features for density map estimation, we claim that the segmentation map can be used as an ideal attention map to emphasize the contributions of features within the foreground regions to the density map estimation whilst compressing the effects of features within the background regions. To this end, we add an attention layer to estimate the attention map. The attention layer is a convolutional layer followed by a sigmoid layer which restricts the output values in the range of 0–1. The attention layer takes the feature maps generated by the second last Inception module as input and outputs a one-channel attention map of the same spatial resolution as the input. Subsequently, the attention map estimated by the attention layer is applied to the feature map generated by the last Inception module by an element-wise product with each channel of the feature map.


The attention layer designed in our framework is similar to that in [39, 35]. However, a so-called inverse attention map is estimated in [39] while our attention layer generates an attention map directly applied to the feature map. Also, the foreground regions in the ground truth segmentation map in [39] are derived by thresholding the density map hence both maps have the same positive fields for each head while ours are different (c.f. Eq.(2-3)). In [35], the attention layer takes the feature map as input to estimate an attention map which again is applied to the same feature map. This may limit the capacity of the model since it is forced to learn two different maps from the same feature map via two convolutional layers which have limited parameters. In contrast, as mentioned above, the input of our attention layer is the feature map from the previous layer which has higher spatial resolutions and is different from the one the generated attention map will be applied to. These favourable distinctions collectively benefit the estimation of the density map and will be empirically evaluated in our experiments.

3.3 Loss Function

Two loss functions are used to train the SGANet. The first one is an L2 loss applied to the estimation of the density map and has been common in literature [54]. Denoted as , the density map loss can be calculated as follows:


On the other hand, we define the segmentation map loss as a cross-entropy loss:


where denotes the elementwise matrix norm, i.e., the sum of all elements in a matrix, and denotes elementwise multiplication of two matrices with the same size. We combine two losses during network training and the compositional loss is:


where is a hyper-parameter which ensures the two components to have comparable values and is set 20 across our experiments.

3.4 Curriculum Learning

Curriculum learning can be used to facilitate the training of a deep neural network without any extra cost. We propose a simple yet effective strategy to define the pixel-wise difficulty level for curriculum learning. The curriculum is designed based on the fact that dense crowds are generally more difficult to count than sparse ones. To this end, we set a threshold to selectively assign variant weights to different pixels of the density map when calculating the loss using Eq.(5. Specifically, we define a weight matrix as follows:


The weight matrix has the same size as the density map matrix used in Eq.(5 and the pixel-wise weights are determined by the threshold and the pixel value of the density map. If the pixel value of the density map is higher than the threshold, this pixel is treated as a difficult one and the corresponding weight is set less than one, otherwise the weight is equal to one. The higher the pixel values are, the smaller the weights will be. As a result, the training will focus more on the pixels of lower density value than .

We define the threshold as a function of the training epoch index in the form of:


where and can be determined based on the prior knowledge of the pixel values in the ground truth density map. The value of is the initial threshold which should be equivalent to the maximum density value in the region characterizing a single head. The value of controls the speed of increasing the difficulty which can also be easily derived from the learning curve when the curriculum learning strategy is not used.

Finally, the loss function for density map in Eq.(5) can be modified as:


where is also a function of the training epoch index derived by replacing in Eq.(9) with .

4 Experiments

Extensive experiments have been conducted on benchmark datasets to evaluate the performance of SGANet and the effectiveness of curriculum learning in crowd counting. We will briefly describe the datasets and evaluation metrics used in our experiments, details of experimental protocols and network training. Experimental results are compared with state-of-the-art methods and analysed. We also present an ablation study to investigate the contributions of different components to the performance of the proposed framework.

4.1 Datasets

ShanghaiTech dataset was collected and published by Zhang et al. [54] consisting of two parts. Part A consists of 300 and 182 images of different resolutions for training and testing respectively. The minimum and maximum counts are 33 and 3139 respectively, and the average count is 501.4. Part B consists of 400 and 316 images of a unique resolution (7681024) for training and testing respectively. Compared with part A, the numbers of people in these images are much smaller with the minimum and maximum counts of 9 and 578 respectively, and the average count is 123.6.

UCF_QNRF dataset [11] contains 1,535 high-quality images, among which 1201 images are used for training and 334 images for testing. The minimum and maximum counts are 49 and 12,865 respectively, and the average count is 815.

UCF_CC_50 dataset [10] contains 50 images with the minimum and maximum counts of 94 and 4,534 respectively. It is a challenging dataset due to the limited number of images. Following the suggestion in [10] and many other works, we use 5-fold cross-validation in our experiments.

4.2 Evaluation Metrics

We follow the previous works using two metrics, i.e., the mean absolute error (MAE) and the root mean squared error (RMSE), to evaluate the performance of different models in our experiments. The two metrics can be calculated as follows:


where and are the ground truth and predicted count for -th test image respectively, is the number of test images.

4.3 Network Training

SGANet is implemented in PyTorch [29] and the “Adam” optimizer [14] is employed for training. The initial learning rate is set to 1e-4 and reduced by a factor of 0.5 after every 50 epochs. The total number of training epochs is set 500 since the model can always converge much earlier than that. The network is trained with image patches with a size of randomly cropped from the training images. Instead of preparing the patches in advance, we do the random patch cropping on-the-fly during training. Specifically, we randomly select 8 images from the training set and 4 patches are randomly cropped from each of them. This leads to a batch of 32 training patches in each iteration of training. The training patches generated in this way can be more diverse and help to alleviate the potential over-fitting problem. Since the output of SGANet has the size of (i.e. 1/4 of the input size), we use sum pooling to adapt the ground truth density and segmentation map so that they can have the same size of . The training patches, as well as their corresponding density and segmentation maps, are horizontally flipped with a probability of 0.5 for data augmentation which has been shown beneficial in many works [9, 50]. For testing, we feed the whole image into the network and obtain the density map from which the predicted count can be computed. For the UCF_QNRF dataset, to save the memory usage during testing, we also resize the images from both training and test sets so that all images are limited to have resolutions no higher than 1024 whilst the original aspect ratios are kept. The images whose original resolution are lower than 1024 will not be changed.

Model Venue ShanghaiTech part A ShanghaiTech part B UCF-QRNF UCF-CC-50
MCNN [54] CVPR’16 110.2 173.2 26.4 41.3 377.6 509.1
CSRNet [19] CVPR’18 68.2 115.0 10.6 16.0 266.1 397.5
SANet [3] ECCV’18 67.0 104.5 8.4 13.6 258.4 334.9
DADNet [9] MM’19 64.2 99.9 8.8 13.5 113.2 189.4 285.5 389.7
CANNet [24] CVPR’19 62.3 100.0 7.8 12.2 107 183 212.2 243.7
TEDNet [13] CVPR’19 64.2 109.1 8.2 12.8 113 188 249.4 354.5
Wan et al. [45] ICCV’19 64.7 97.1 8.1 13.6 101 176
RANet[50] ICCV’19 59.4 102.0 7.9 12.9 111 190 239.8 319.4
ANF [51] ICCV’19 63.9 99.4 8.3 13.2 110 174 250.2 340.0
SPANet [4] ICCV’19 59.4 92.5 6.5 9.9 232.6 311.7
MCNN* 91.8 144.9 18.0 28.5 184.2 292.9 298.8 399.8
CSRNet* 67.2 110.5 9.2 15.4 104.8 174.0 184.3 262.0
SANet* 64.0 103.4 9.2 15.6 110.6 182.9 249.6 382.3
DADNet* 63.7 107.4 9.4 15.1 113.7 190.0 262.0 369.1
CANNet* 65.6 106.7 7.9 13.0 116.5 217.3 185.0 278.0
Inception-v3 60.1 105.0 6.4 9.8 103.8 177.9 236.0 304.9
SGANet 58.0 100.4 6.3 10.6 102.5 178.4 224.6 314.6
SGANet + CL 57.6 101.1 6.6 10.2 97.5 169.2 221.9 289.8
Table 1: A comparison of different network architectures for crowd counting (* denotes our reproduced results under our experimental protocols; – denotes the results are not available; CL denotes curriculum learning).

4.4 Comparative Study

Firstly, we try to reproduce the results of some classic crowd counting models under our training protocols to remove the effects of various factors such as the ways of density map generation, patch cropping, data augmentation and so on for a fair comparison and focus on the effectiveness of different network architectures. We select five models for this reproduction experiment including MCNN [54] which is a three-column CNN, CSRNet [19] which uses VGG16 as the front-end and dilated convolutional layers as the back-end, SANet [3] which employs the basic Inception modules but has a relatively shallow depth, DADNet [9] which employs the ideas of dilated convolution, attention map and deformable convolution in the framework and CANNet [24]. To be noted, our implementation of SANet is different from that in the original paper [3] in that: 1) we use batch normalisation rather than instance normalisation because we found it is hard to train when instance normalisation is used which is contradictory with what was claimed in [3]; 2) we use the whole image rather than the patches to predict the density map during testing since we do not find any difference between these two strategies under our experimental protocols. For all models, we use only the density loss (c.f. Eq.(5)) without considering the other losses used in the original papers. As for the density maps, we use fixed for all datasets.

As shown in Table 1, we can still produce better or comparable results for most scenarios except for SANet and DADNet on ShanghaiTech part B and CANNet on ShanghaiTech part A and UCF_QNRF. The reason for such performance improvement can be multi-fold such as the use of 3-channel images as input, patch generation on the fly, data augmentation with horizontal flipping, different optimizers, optimal patch and batch sizes and so on. We are not going to investigate the exact reason for such performance differences since it can be model specific and is out of the scope of this paper, but it is convincing that the experimental protocols used in our experiments are effective and the reproduced results will be used as baselines for a fairer comparison.

From Table 1, we can see our modified Inception-v3 can achieve very competitive performance on all four datasets under our experimental protocols. Especially on ShanghaiTech part B, it achieves the best MAE of 6.4 and RMSE of 9.8. This demonstrates the superiority of heterogeneous Inception modules in classification problems can be transferred to the task of crowd counting hence different Inception modules deserve more attention when designing a novel CNN architecture for crowd counting as well as other tasks suffering from the issue of scale variance. By adding the segmentation guided attention layer, our SGANet can achieve better performance all datasets in terms of MAE although the improvement on ShanghaiTech part B dataset is very marginal. Regarding RMSE, SGANet achieves better performance on ShanghaiTech part A only. The use of curriculum learning further improves the performance SGANet in most datasets especially UCF_QRNF which consists of the largest number of training images and the head scales within these images vary dramatically. On the other hand, the use of curriculum learning does not improve the performance on ShanghaiTech part B for the images from this dataset contain crowds with a relatively small variance of head scales. These results provide evidence that the issue of large scale variance can be alleviated by the use of curriculum learning. We will provide more evidence for the effectiveness of curriculum learning in the following ablation study.

Model Without CL With CL
MCNN 91.8 144.9 89.1 142.3
CSRNet 67.2 110.5 66.7 113.7
SANet 64.0 103.4 62.1 100.3
CANNet 65.6 106.7 63.9 103.9
DADNet 63.7 107.4 64.2 102.1
Inception-v3 60.1 105.0 58.2 97.9
Table 2: The effect of curriculum learning in different models on ShanghaiTech part A (the symbol means the error decreases with the use of curriculum learning).

4.5 Results on Curriculum Learning

Curriculum learning has shown a positive effect when applied to SGANet for crowd counting. In this section, we attempt to explore the effectiveness of curriculum learning in the training of other crowd counting networks. To this end, we consider “MCNN”, “CSRNet”, “SANet”, “CANNet”, “DADNet” and our modified “Inception-v3” and apply the strategy of curriculum learning when training these networks on ShanghaiTech part A. It is noteworthy that the generated density maps have different sizes for these models (e.g., the size ratio between input and output is 1 for “SANet”, 2 for “DADNet”, 4 four “MCNN” and “Inception-v3”, 8 for “CSRNet” and “CANNet”). The ground truth density maps need to be resized by sum pooling to have the same size as the corresponding outputs. As a result, the pixel values of the ground truth density maps for different models will have different distributions. This leads to model specific curriculum designs (i.e. the parameter values in Eq.(9)). In our experiments, we do not endeavour to explore the optimal parameters for Eq.(9) but provide a proof-of-concept for the effectiveness of curriculum learning in crowd counting networks. Experimental results are shown in Table 2. The use of curriculum learning improves the performance of most models. Specifically, the MAE decrease for all models except “DADNet” and the RMSE decrease for all models except “CSRNet”. In summary, the experimental results in Table 1 and 2 provide sufficient evidence that the use of curriculum learning can benefit the training of crowd counting networks in most cases especially when the head scales vary a lot in the crowd images.

4.6 Results on Segmentation Guided Attention

From Table 1 we can see the performance enhancement contributed by the segmentation guided attention layer by comparing the performance between Inception-v3 and SGANet. To validate the superiority of our segmentation guided attention layer to other similar designs [35], we conduct an additional experiment on ShanghaiTech part A. In this experiment, we modify the SGANet by feeding the feature maps of the last Inception module into the attention layer and keeping the rest unchanged. This modification leads to an increase of MAE from 58.0 to 59.5 and an increase of RMSE from 100.4 to 102.2. The experimental results demonstrate that our segmentation guided attention layer performs better than the existing designs.

To give an intuitive evidence on how the attention layer helps for density map estimation, we visualize the estimated attention maps and density maps for five exemplar test images from ShanghaiTech part A. In Figure 2, we show the original images, ground truth density maps, predicted density maps and predicted segmentation maps in four columns respectively. The real and predicted counts are also shown on the density maps for a direct comparison. We can see that the prediction errors for the top three examples are relatively low given the accurately predicted segmentation maps. However, the bottom two images suffer from higher errors since the model can not predict accurate foreground regions. For example, the image in the fourth row contains people raising their hands in the air and the hands are easy to be counted since they have similar colours with human faces. In the bottom image, the trees in the background are mistakenly recognised as foreground and result in the over-estimated count.

Figure 2: Visualization of estimated density and segmentation maps for five test images from ShanghaiTech part A. The numbers shown on the images in the second and third columns are the ground truth and estimated counts respectively.

5 Conclusion

In this paper, we investigated the effectiveness of Inception-v3 in crowd counting. We proposed a segmentation guided attention network using Inception-v3 as the backbone. We also designed a novel curriculum learning strategy for crowd counting by defining pixel-wise difficulty levels to resolve the issue of scale variance in crowd images. Experimental results on four commonly used datasets demonstrate the proposed SGANet can achieve superior performance due to the combination of Inception-v3 and the segmentation guided attention layer. The proposed strategy of curriculum learning is also empirically proved to be helpful for training CNN for crowd counting.


  • [1] Shubhra Aich and Ian Stavness. Leaf counting with deep convolutional and deconvolutional networks. In ICCV Workshop, Venice, Italy, pages 22–29, 2017.
  • [2] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM, 2009.
  • [3] Xinkun Cao, Zhipeng Wang, Yanyun Zhao, and Fei Su. Scale aggregation network for accurate and efficient crowd counting. In Proceedings of the European Conference on Computer Vision (ECCV), pages 734–750, 2018.
  • [4] Zhi-Qi Cheng, Jun-Xiu Li, Qi Dai, Xiao Wu, and Alexander Hauptmann. Learning spatial awareness to improve crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6152–6161, 2019.
  • [5] Zhi-Qi Cheng, Jun-Xiu Li, Qi Dai, Xiao Wu, Jun-Yan He, and Alexander G. Hauptmann. Improving the learning of multi-column convolutional neural network for crowd counting. In Proceedings of the 27th ACM International Conference on Multimedia, MM ’19, pages 1897–1906, New York, NY, USA, 2019. ACM.
  • [6] Lan Dong, Vasu Parameswaran, Visvanathan Ramesh, and Imad Zoghlami. Fast crowd segmentation using shape indexing. In ICCV, pages 1–8. IEEE, 2007.
  • [7] Jeffrey L Elman. Learning and development in neural networks: The importance of starting small. Cognition, 48(1):71–99, 1993.
  • [8] Mario Valerio Giuffrida, Massimo Minervini, and Sotirios A Tsaftaris. Learning to count leaves in rosette plants. 2016.
  • [9] Dan Guo, Kun Li, Zheng-Jun Zha, and Wang Meng. Dadnet: Dilated-attention-deformable convnet for crowd counting. In Proceedings of the ACM International Conference on Multimedia, pages 1823–1832, 2019.
  • [10] Haroon Idrees, Imran Saleemi, Cody Seibert, and Mubarak Shah. Multi-source multi-scale counting in extremely dense crowd images. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 2547–2554, 2013.
  • [11] Haroon Idrees, Muhmmad Tayyab, Kishan Athrey, Dong Zhang, Somaya Al-Maadeed, Nasir Rajpoot, and Mubarak Shah. Composition loss for counting, density map estimation and localization in dense crowds. In Proceedings of the European Conference on Computer Vision (ECCV), pages 532–546, 2018.
  • [12] Lu Jiang, Deyu Meng, Qian Zhao, Shiguang Shan, and Alexander G Hauptmann. Self-paced curriculum learning. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
  • [13] Xiaolong Jiang, Zehao Xiao, Baochang Zhang, Xiantong Zhen, Xianbin Cao, David Doermann, and Ling Shao. Crowd counting and density estimation by trellis encoder-decoder networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6133–6142, 2019.
  • [14] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • [15] Dan Kong, Douglas Gray, and Hai Tao. A viewpoint invariant approach for crowd counting. In ICPR, volume 3, pages 1187–1190. IEEE, 2006.
  • [16] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [17] M Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems, pages 1189–1197, 2010.
  • [18] Victor Lempitsky and Andrew Zisserman. Learning to count objects in images. In NIPS, pages 1324–1332, 2010.
  • [19] Yuhong Li, Xiaofan Zhang, and Deming Chen. Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1091–1100, 2018.
  • [20] Mingpei Liang, Xinyu Huang, Chung-Hao Chen, Xin Chen, and Alade O Tokuta. Counting and classification of highway vehicles by regression analysis. IEEE Trans. Intelligent Transportation Systems, 16(5):2878–2888, 2015.
  • [21] Jiang Liu, Chenqiang Gao, Deyu Meng, and Alexander G Hauptmann. Decidenet: Counting varying density crowds through attention guided detection and density estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5197–5206, 2018.
  • [22] Lingbo Liu, Zhilin Qiu, Guanbin Li, Shufan Liu, Wanli Ouyang, and Liang Lin. Crowd counting with deep structured scale integration network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1774–1783, 2019.
  • [23] Ning Liu, Yongchao Long, Changqing Zou, Qun Niu, Li Pan, and Hefeng Wu. Adcrowdnet: An attention-injective deformable convolutional network for crowd understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3225–3234, 2019.
  • [24] Weizhe Liu, Mathieu Salzmann, and Pascal Fua. Context-aware crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5099–5108, 2019.
  • [25] Yuting Liu, Miaojing Shi, Qijun Zhao, and Xiaofang Wang. Point in, box out: Beyond counting persons in crowds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6469–6478, 2019.
  • [26] Zhiheng Ma, Xing Wei, Xiaopeng Hong, and Yihong Gong. Bayesian loss for crowd count estimation with point supervision. In Proceedings of the IEEE International Conference on Computer Vision, 2019.
  • [27] Thomas Moranduzzo and Farid Melgani. Automatic car counting method for unmanned aerial vehicle images. IEEE Transactions on Geoscience and Remote Sensing, 52(3):1635–1647, 2014.
  • [28] Daniel Onoro-Rubio and Roberto J López-Sastre. Towards perspective-free object counting with deep learning. In European Conference on Computer Vision, pages 615–629. Springer, 2016.
  • [29] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS Workshop, 2017.
  • [30] Viresh Ranjan, Hieu Le, and Minh Hoai. Iterative crowd counting. In Proceedings of the European Conference on Computer Vision (ECCV), pages 270–285, 2018.
  • [31] David Ryan, Simon Denman, Sridha Sridharan, and Clinton Fookes. An evaluation of crowd counting methods, features and regression models. Computer Vision and Image Understanding, 130:1–17, 2015.
  • [32] Deepak Babu Sam and R Venkatesh Babu. Top-down feedback for crowd counting convolutional neural network. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • [33] Deepak Babu Sam, Shiv Surya, and R Venkatesh Babu. Switching convolutional neural network for crowd counting. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4031–4039. IEEE, 2017.
  • [34] Miaojing Shi, Zhaohui Yang, Chao Xu, and Qijun Chen. Revisiting perspective information for efficient crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7279–7288, 2019.
  • [35] Zenglin Shi, Pascal Mettes, and Cees G. M. Snoek. Counting with focus for free. In Proceedings of the IEEE International Conference on Computer Vision, 2019.
  • [36] Vishwanath A Sindagi and Vishal M Patel. Generating high-quality crowd density maps using contextual pyramid cnns. In ICCV, pages 1879–1888. IEEE, 2017.
  • [37] Vishwanath A Sindagi and Vishal M Patel. A survey of recent advances in cnn-based single image crowd counting and density estimation. Pattern Recognition Letters, 107:3–16, 2018.
  • [38] Vishwanath A Sindagi and Vishal M Patel. Ha-ccn: Hierarchical attention-based crowd counting network. IEEE Transactions on Image Processing, 29:323–335, 2019.
  • [39] Vishwanath A Sindagi and Vishal M Patel. Inverse attention guided deep crowd counting network. In AVSS, 2019.
  • [40] Vishwanath A Sindagi and Vishal M Patel. Multi-level bottom-top and top-bottom feature fusion for crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1002–1012, 2019.
  • [41] Parthipan Siva, Mohammad Javad Shafiee, Michael Jamieson, and Alexander Wong. Real-time, embedded scene invariant crowd counting using scale-normalized histogram of moving gradients (homg). In CVPR Workshop, pages 67–74, 2016.
  • [42] Venkatesh Bala Subburaman, Adrien Descamps, and Cyril Carincotte. Counting people in the crowd using a generic head detector. In 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance, pages 470–475. IEEE, 2012.
  • [43] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, pages 1–9, 2015.
  • [44] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, pages 2818–2826, 2016.
  • [45] Jia Wan and Antoni Chan. Adaptive density map generation for crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1130–1139, 2019.
  • [46] Qi Wang, Junyu Gao, Wei Lin, and Yuan Yuan. Learning from synthetic data for crowd counting in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8198–8207, 2019.
  • [47] Weidi Xie, J Alison Noble, and Andrew Zisserman. Microscopy cell counting and detection with fully convolutional regression networks. Computer methods in biomechanics and biomedical engineering: Imaging & Visualization, 6(3):283–292, 2018.
  • [48] Zhaoyi Yan, Yuchen Yuan, Wangmeng Zuo, Xiao Tan, Yezhen Wang, Shilei Wen, and Errui Ding. Perspective-guided convolution networks for crowd counting. In Proceedings of the IEEE International Conference on Computer Vision, pages 952–961, 2019.
  • [49] Beibei Zhan, Dorothy N Monekosso, Paolo Remagnino, Sergio A Velastin, and Li-Qun Xu. Crowd analysis: a survey. Machine Vision and Applications, 19(5-6):345–357, 2008.
  • [50] Anran Zhang, Jiayi Shen, Zehao Xiao, Fan Zhu, Xiantong Zhen, Xianbin Cao, and Ling Shao. Relational attention network for crowd counting. In Proceedings of the IEEE International Conference on Computer Vision, pages 6788–6797, 2019.
  • [51] Anran Zhang, Lei Yue, Jiayi Shen, Fan Zhu, Xiantong Zhen, Xianbin Cao, and Ling Shao. Attentional neural fields for crowd counting. In Proceedings of the IEEE International Conference on Computer Vision, pages 5714–5723, 2019.
  • [52] Cong Zhang, Hongsheng Li, Xiaogang Wang, and Xiaokang Yang. Cross-scene crowd counting via deep convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 833–841, 2015.
  • [53] Youmei Zhang, Chunluan Zhou, Faliang Chang, and Alex C Kot. Attention to head locations for crowd counting. arXiv preprint arXiv:1806.10287, 2018.
  • [54] Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, and Yi Ma. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 589–597, 2016.
  • [55] Muming Zhao, Jian Zhang, Chongyang Zhang, and Wenjun Zhang. Leveraging heterogeneous auxiliary tasks to assist crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12736–12745, 2019.
  • [56] Tao Zhao and Ram Nevatia. Bayesian human segmentation in crowded situations. In CVPR, page 459. IEEE, 2003.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description