Guided Attention Network for Object Detection and Counting on Drones
Object detection and counting are related but challenging problems, especially for drone based scenes with small objects and cluttered background. In this paper, we propose a new Guided Attention Network (GANet) to deal with both object detection and counting tasks based on the feature pyramid. Different from the previous methods relying on unsupervised attention modules, we fuse different scales of feature maps by using the proposed weakly-supervised Background Attention (BA) between the background and objects for more semantic feature representation. Then, the Foreground Attention (FA) module is developed to consider both global and local appearance of the object to facilitate accurate localization. Moreover, the new data argumentation strategy is designed to train a robust model in various complex scenes. Extensive experiments on three challenging benchmarks (i.e., UAVDT, CARPK and PUCPR+) show the state-of-the-art detection and counting performance of the proposed method compared with existing methods.
Object detection and counting are fundamental techniques in many applications, such as scene understanding, traffic monitoring and sports video, to name a few. However, these tasks become even more challenging in drone based scenes because of various factors such as small objects, scale variation and background clutter. With the development of deep learning, much progress has been achieved recently. Specifically, deep learning based detection and counting frameworks focus on discriminative feature representation of the objects.
First of all, the feature pyramid is widely applied in deep learning because it has rich semantics at all levels, e.g., U-Net , TDM  and FPN . To better exploit multi-scale feature representation, the researchers use various attention modules to fuse feature maps. In , the channel-wise feature responses are recalibrated adaptively by explicitly modelling interdependencies between channels.  propose the non-local network to capture long-range dependencies, which computes the response at a position as a weighted sum of the features at all positions. Moreover,  develop a lightweight global context (GC) block based on the non-local module. However, all the above methods use unsupervised attention module, but consider little about the background discriminative information in feature maps.
Based on the fused feature maps, the object is represented by proposals in anchor based methods [25, 18, 2] or keypoints in anchor-free methods [14, 36, 32]. Anchor based methods exploit the global appearance information of the object, relying on pre-defined anchors. It is not flexible to design different kinds of anchors because of large scale variation in drone based scenes. Anchor-free methods employ corner points, center points or target part points to capture local object appearance without anchors. However, local appearance representation does not contain object’s structure information, which is less discriminative in cluttered background, especially for small objects.
In addition, the diversity of training data is essential in deep learning. Especially in the drone based scenes, the number of difficult samples is very limited. It is difficult for traditional data argumentation such as rescale, horizontal flip, rotation and cropping to train a robust model to deal with unconstrained drone based scenarios.
To address these issues, in this paper, we propose an anchor-free Guided Attention Network (GANet). First, the background attention module can enforce different channels of feature maps to learning discriminative background information for the feature pyramid. We fuse the multi-level features with the weakly-supervision of classification between background and foreground images. Second, the foreground attention module is used to capture both global and local appearance representation of the objects by tacking the merits of both anchor-based and anchor-free methods. We extract more context information in the corner regions of the object to consider local appearance information. Third, we develop a new data argumentation strategy to reduce the influence of different illumination conditions on the images for the drone based scenes, e.g., sunny, night, cloudy and foggy scenes. We conduct extensive experiments on three challenging datasets (i.e., UAVDT , CARPK  and PUCPR+ ) to show the effectiveness of the proposed method.
The main contributions of this paper are summarized as follows. (1) We present a guided attention network for object detection and counting on drones, which is formed by the foreground and background attention blocks to extract the discriminative features for accurate results. (2) A new data augmentation strategy is designed to boost up the model performance. (3) Extensive experiments on three challenging dataset, i.e., UAVDT, CARPK and PUCPR+, demonstrate the favorable performance of the proposed method against the state-of-the-arts.
2 Guided Attention Network
In this section, we introduce the novel anchor-free deep learning network for object detection and counting in drone images, the Guided Attention Network (GANet), which is illustrated in Figure 1. Specifically, GANet consists of three parts, i.e., the backbone, multi-scale feature fusion, and output predictor. We will first describe each part in detail, and then loss function and data argumentation strategy.
2.1 Backbone Network
Since diverse scales of objects are taken into consideration in feature representation, we choose the feature maps from four side-outputs of the backbone network (e.g., VGG-16  and ResNet-50 ). Four side outputs correspond to pool1, pool2, pool3, and pool4, each of which is the output of four convolution blocks with different scales, respectively. The feature maps from four pooling layers are , , , the size of the input image. They are marked with light blue regions in Figure 1(a). The backbone network is pre-trained by the ImageNet dataset .
2.2 Multi-Scale Feature Fusion
As discussed in , the feature pyramid has strong semantics at all scales, resulting in significant improvement as a generic feature extractor. Specifically, we fuse the side-outputs of the backbone network from top to down, e.g., feature maps from pool4 to pool1 of VGG-16. Meanwhile, the receptive fields of the stacked feature maps can adaptively match the scale of objects. To consider background discriminative information in the feature pyramid, we introduce the Background Attention (BA) module in multi-scale feature fusion.
2.2.1 Background Attention.
As shown in Figure 1(b), the BA modules are stacked from the deepest to the shallowest convolutional layer. At the same time, the cross-entropy loss function is used to enforce different channels of feature maps focus on either foreground and background in every stage. Then, the attention module weights the pooling features with the same scale via the class-activated tensor. Finally, the weighted pooling features and the up-sampled features are concatenated and regarded as the base feature maps in the next BA.
We denote the -th pooling features as , and the input and output of -th BA as and . Specifically, is used to learn the class-related weights for activating the class-related feature maps in . For the deepest BA module, the input is regarded as the pool4 feature maps (see in Figure 1(a)). Note that the size of output in this architecture is the same as the pooling features rather than the size of input . Therefore, the bilinear interpolation is introduced to up-sample to . As the up-sampling operation is a linear transformation, one convolutional layer is used as soft-adding to improve the scale adaptability. Instead of concatenating the up-sampled and the activated directly, the and convolutional layers is used to generate . In summary, the -th BA is formulated as
where denotes the convolutional weights of the concatenation layer. and are the convolutional weights of up-sampled . has two elements, i.e., one for and the other for . is a class activation function with two parameters, i.e., the pooling features and the weighted up-sampled features . It is defined as
where is the multiply operation between the features and the weight tensor . is obtained by three steps. First, is compressed into a one-dimensional vector by the Global Average Pooling (GAP) . Second, is activated and converted to the vector with class-related information via determining whether the input image contains the objects. Third, is transformed into a weight tensor with class-related information via two convolutional layers.
2.2.2 Positive and Negative Image Generation.
To learn class-related feature maps, we use both the images with and without objects in the training stage. We denote them as positive and negative images respectively. Specifically, we use positive images with objects to activate the channels of feature maps to represent the pixels of object region, and negative images without overlapping of objects to activate the channels of feature maps to describe the background region. As shown in Figure 2, we generate positive and negative images with the size of by randomly cropping and padding the rescaled training images (from x to x scale).
2.3 Output Predictor
Based on multi-scale feature fusion, we predict the scales and locations of objects using both score and location maps (see Figure 1(c)), which are defined as follows:
The score map corresponds to confidence score of the object region. Similar to the confidence map in FCN , each pixel of the score map is a scalar between to representing the confidence belonging to an object region.
The location map describes the location of object by using four distance channels . The channels denote the distances from the current pixel to the left, top, right, and bottom edges of the bounding box respectively. Then we can directly predict the object box by four distance channels. Specifically, for each point in the score map, four distance channels predict the distances to the above four edges of the bounding box.
2.3.1 Foreground Attention.
In general, based on both score and location maps, we can estimate the bounding boxes of the objects in the image. However, the estimated bounding boxes only rely on the global appearance of the object. That is, little local appearance of the object is taken into consideration, resulting in less discriminative foreground representation. To improve localization accuracy, we introduce the Foreground Attention (FA) module to consider both global and local appearance representation of the objects.
In practice, we use four corner maps (top-left, top-right, bottom-left and bottom-right) to denote different corner positions within the object region, as shown in Figure 3. Similar to score map, each pixel of the corner map is also a scalar between to representing the confidence belonging to a corresponding position in the object region. The corner is set as the size of the whole object. Specifically, as illustrated in Figure 1(c), we first use a threshold filter to remove the candidate bounding boxes with low confidence pixels, i.e., . is the confidence value of pixel in the predicted score map, and denotes the confidence threshold. Then, the Non-Maximum Suppression (NMS) operation is applied to remove redundant candidate bounding boxes and choose the top ones with higher confidence. Finally, a corner voting filter is designed to determine whether the selected bounding boxes should be retained. Specifically, we calculate the number of reliable corners in the -th candidate bounding box by
where denotes the average confidence of the corner region . indicates the threshold of mean confidence to determine the reliable corner. if its argument is true, and otherwise. We only keep the bounding box if the number of reliable corners is larger than the threshold , i.e., .
2.4 Loss function
To train the proposed network, We optimize the location map and score map, as well as both foreground and background attentions simultaneously. The overall loss function is defined as
where , , , and are loss terms for the location map, score map, foreground attention, and background attention, respectively. The parameter , , and are used to balance these terms. In the following, we explain these loss terms in detail.
2.4.1 Loss of Location Map.
To achieve scale-invariance, the IoU loss  is adopted to evaluate the difference between the predicted bounding box and the ground truth of bounding box. The loss of location map is defined as:
where and are the estimated and ground-truth bounding box of the object. The function calculates the intersection-over-union (IoU) score between and .
2.4.2 Loss of Score Map.
Similar to image segmentation , we use the Dice loss to deal with the imbalance problem of positive and negative pixels in the score map. It calculates the errors between the predicated score map and ground-truth map. The loss is calculated as
where the sums run over the all pixels of the score map. and are the confidence values of pixel in the ground-truth and predicted maps respectively.
2.4.3 Loss of Background Attention.
Similar to classification algorithms, we use the cross-entropy loss to guide background attention based on the binary classification, i.e.,
where denotes the ground-truth category (i.e., foreground or background), is the estimated probability for the category with label .
2.4.4 Loss of Foreground Attention.
Similar to the score map, to deal with the imbalance problem of positive and negative pixels in the feature maps, we use the Dice loss to guide the foreground attention for the four corner maps.
2.5 Data Augmentation for Drones
Data augmentation is important in deep network training based on limited training data. Since the data is captured from a very high altitude by the drone, it is susceptible to the influence of different illumination conditions, e.g., sunny, night, cloudy and foggy. Therefore, we develop a new data augmentation strategy for drones.
As we know, sunny or night scenes correspond to the brightness of the image, therefore we synthesize these scenes via changing the whole contrast of the image (denoted as BNoise). On the other hand, since convincing representations of clouds and water can be created in pixel-level , we use Perlin noise  to imitate cloudy and foggy scenes (denoted as PNoise). Inspired by the image blending algorithm , the data augmentation model is defined as
where is the transformed value of the pixel in image. and denote the weight of the pixel of original image and noise map respectively. The asterisk denotes different kinds of noise maps, i.e., BNoise and PNoise . We have to control the contrast of the image. The perturbation factor is used to revise the brightness. We set different factors and for each image in the training phase.
As shown in Figure 4, we employ white and black maps to synthesize sunny or night images. On the other hand, we use Perlin noise  to generate noise maps in Figure 4, and then revise the brightness via disturbance factor to synthesize cloudy and foggy images. For each training image, we first resize it using random scale factors (x, x, x and x). Then, we introduce both noise maps into the image to imitate the challenging scenes (i.e., sunny, night, cloudy, and foggy). Finally, we select positive and negative images by random cropping on the blending images, and transform the selected images to size via zooming and padding.
The proposed method is implemented by Tensorflow r1.8111https://www.tensorflow.org/. We will release the source codes of our method upon the acceptance of the paper. We evaluate our method on two drone based datasets: UAVDT  and CARPK . We also evaluate our method on the PUCPR+ dataset  because the dataset is collected from the th floor of a building and similar to drone view images to a certain degree. In this section, we first describe implementation details. Then, we compare our GANet with the state-of-the-art methods, i.e., Faster R-CNN , RON , SSD , R-FCN , CADNet , One-Look Regression , IEP , YOLO9000 , LPN , RetinaNet , YOLOv3 , IoUNet , and SA+CF+CRT . More visual examples are shown in Figure 5. In addition, the ablation study is carried out to evaluate the effectiveness of each component in our network.
3.0.1 Implementation Details
Due to the shortage of computational resources, we train GANet using the VGG-16 and ResNet-50 backbone with the input size . All the experiments are carried out on the machine with NVIDIA Titan Xp GPU and Intel(R) Xeon(R) E5email@example.comGHz CPU. For fair evaluation, we generate the same top detection bounding boxes for the UAVDT and CARPK datasets and detection bounding boxes for the PUCPR+ dataset based on the detection confidence. Note that the detection confidence is calculated by summarizing the value of each pixel in the score map. To output the count of objects in each image, we calculate the number of detection with the detection confidence larger than . We fine-tune the resulting model using the Adam Optimizer. An exponential decay learning rate is used in the training phrase, i.e., its initial value is and decays every iterations with the decay rate . The batch size is set as . In the loss function (4), we set the balancing factors as , , empirically. In the FA module, the confidence threshold is set as , and the threshold in (3) is set as empirically. The Non-Maximum Suppression (NMS) operation is conducted with a threshold . In the data argumentation model (8), we set the balancing weights as and .
To evaluate detection algorithms on the UAVDT dataset , we compute the Average Precision (AP) score based on [7, 6]. That is, the hit/miss threshold of the overlap between detection and ground-truth bounding boxes is set to . In terms of CARPK  and PUCPR+ , we report the detection score under two hit/miss thresholds, i.e., AP and AP. To evaluate the counting results, similar to , we use two object counting metrics including Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).
3.1 Quantitative Evaluation
3.1.1 Evaluation on UAVDT.
The UAVDT dataset  consists of video sequences with approximate frames, which are collected from various scenes. Moreover, the objects are annotated by bounding boxes as well as several attributes (e.g., weather condition, flying altitude, and camera view). Note that we only use the subset of UAVDT dataset for object detection in our experiment. As presented in Table 1, we can conclude that our GANet performs the best among all the compared detection methods in terms of both the VGG-16 and ResNet-50 backbones. Specifically, GANet surpasses YOLO9000, YOLOv3, RON, Faster R-CNN, SSD, CADNet, SA+CF+CRT and R-FCN by , , , , , and AP scores, respectively. Moreover, our method achieves better counting accuracy than SA+CF+CRT with the more complex ResNet-101 backbone, i.e., MAE score and RMSE score. It demonstrates that the effectiveness of our method in object detection in drone based scenes.
3.1.2 Evaluation on CARPK.
The CARPK dataset  provides the largest-scale drone view parking lot dataset in unconstrained scenes, which is collected in various scenes for different parking lots. It contains approximately cars in total with the view of drone. We compare our method with state-of-the-art algorithms in Table 2. The results show that our approach achieves the best MAE, RMSE and AP scores. It is worth mentioning that we obtain much better AP score (i.e., vs. ). This is attributed to the proposed attention modules to locate the objects more accurately.
3.1.3 Evaluation on PUCPR+.
The PUCPR+ dataset  is the subset of PKLot , which is annotated with nearly cars in total. It shares the similar high altitude attribute to drone based scenes, but the camera sensors are fixed and set in the same place. As presented in Table 3, our method performs the best in terms of MAE and RMSE scores. YOLOv3  achieves the best AP score at hit/miss threshold, but inferior AP score than that of our method. We speculate that YOLOv3 lack of global appearance representation of objects to achieve accurate localization.
3.2 Ablation Study
We perform analyses on the effect of the important modules in our method on the detection performance. Specifically, we study the influence of data augmentation, semantic discriminative attention, and corner attention. We select the UAVDT dataset  to conduct the experiment because it provides various attributes in terms of altitude, illumination and camera-view for comprehensive evaluation.
3.2.1 Effectiveness of Data Augmentation.
As discussed above, the data augmentation strategy is used to increase the difficult samples affected by various illumination attributes in the UAVDT dataset  such as daylight, night and fog. We compare different variants of GANet with different data augmentation, denoted as GANet+BNoise, GANet+PNoise and GANet+PBNoise. Notably, BNoise denotes the brightness noise, PNoise denotes the Perlin noise, and BPNoise denotes both. As shown in Table 5, the performance of GANet+BNoise is slightly higher than that of GANet. GANet+PNoise achieves much better AP score in terms of foggy scenes compared to GANet ( vs. ), which demonstrates the effectiveness of the introduced Perlin noise. If we perform the full data augmentation strategy in our training samples, the overall performance will increase by .
3.2.2 Effectiveness of Background Attention.
Different from the previous unsupervised attention modules, our Background Attention (BA) is guided based on discrimination between the background and objects. Firstly, we study different fusion strategies of the proposed BA in Figure 6, i.e., early fusion (EF), mixed fusion (MF) and late fusion (LF). The results presented in Table 6 show the early fusion strategy (i.e., GANet+BPNoise+EF) achieves the best performance. Secondly, we also compare BA with several previous channel-wise attention modules including SE block  and GC block . For a fair comparison, we use the same early fusion strategy in Figure 6(a). Compared to the baseline FPN fusion strategy using lateral connection , all the attention modules can improve the performance by learning the weights of different channels of feature maps. However, our BA module can learn additional discriminative information of background, resulting in the best AP score in the drone based scenes under different camera views (i.e., front-view, side-view and bird-view).
3.2.3 Effectiveness of Foreground Attention.
We enumerate the threshold for Foreground Attention (FA) in (3), i.e., , to study its influence on the accuracy. As shown in Table 7, we can conclude that GANet with the FA module achieves the best AP score when the threshold . If we remove FA, the detection performance will decrease to . It shows the effectiveness of the FA module.
3.2.4 Variants of GANet.
In Table 4, we compare various variants of GANet that combine several components in the network. Using data argumentation strategy can improve the performance considerably in all the attributes. Either BA or FA can improve the performance by . Moreover, the proposed method using both attentions and data argumentation strategy can boost the performance by approximate improvement in AP score compared to the baseline GANet method.
In the paper, we propose a novel guided attention network to deal with object detection and counting in drone based scenes. Specifically, we introduce both background and foreground attention modules to not only learn background discriminative representation but also consider local appearance of the object, resulting in better accuracy. The experiments on three challenging datasets demonstrate the effectiveness of our method. We plan to expand our method to multi-class object detection and counting for future work.
-  (2019) GCNet: non-local networks meet squeeze-excitation networks and beyond. CoRR abs/1904.11492. External Links: Cited by: §1, §3.2.2.
-  (2016) R-FCN: object detection via region-based fully convolutional networks. In NeurIPS, pp. 379–387. Cited by: §1, §3.
-  (2015) PKLot - A robust dataset for parking lot classification. Expert Syst. Appl. 42 (11), pp. 4937–4949. Cited by: §3.1.3.
-  (2018) The unmanned aerial vehicle benchmark: object detection and tracking. In ECCV, pp. 375–391. Cited by: §1, §3.0.2, §3.1.1, §3.2.1, §3.2, §3.
-  (2019) Detecting small objects using a channel-aware deconvolutional network. TCSVT. Cited by: §3.
-  (2015) The pascal visual object classes challenge: A retrospective. IJCV 111 (1), pp. 98–136. Cited by: §3.0.2.
-  (2012) Are we ready for autonomous driving? the KITTI vision benchmark suite. In CVPR, pp. 3354–3361. Cited by: §3.0.2.
-  (2019) Precise detection in densely packed scenes. CoRR abs/1904.00853. External Links: Cited by: §3.
-  (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §2.1.
-  (2017) Drone-based object counting by spatially regularized regional proposal network. In ICCV, pp. 4165–4173. Cited by: §1, §3.0.2, §3.1.2, §3.1.3, §3.
-  (2018) Squeeze-and-excitation networks. In CVPR, pp. 7132–7141. Cited by: §1, §3.2.2.
-  (2017) RON: reverse connection with objectness prior networks for object detection. In CVPR, pp. 5244–5252. Cited by: §3.
-  (2012) ImageNet classification with deep convolutional neural networks. In NeurIPS, pp. 1106–1114. Cited by: §2.1.
-  (2018) CornerNet: detecting objects as paired keypoints. In ECCV, pp. 765–781. Cited by: §1.
-  (2019) Simultaneously detecting and counting dense vehicles from drone images. TIE 66 (12), pp. 9651–9662. Cited by: §3.
-  (2017) Feature pyramid networks for object detection. In CVPR, pp. 936–944. Cited by: §1, §2.2, §3.2.2.
-  (2017) Focal loss for dense object detection. In ICCV, pp. 2999–3007. Cited by: §3.
-  (2016) SSD: single shot multibox detector. In ECCV, Cited by: §1, §3.
-  (2015) Fully convolutional networks for semantic segmentation. In CVPR, pp. 3431–3440. Cited by: 1st item.
-  (2016) A large contextual dataset for classification, detection and counting of cars with deep learning. In ECCV, pp. 785–800. Cited by: §3.
-  (1985) An image synthesizer. In SIGGRAPH, pp. 287–296. Cited by: §2.5.
-  (2002) Improving noise. TOG 21 (3), pp. 681–682. Cited by: §2.5, §2.5.
-  (2017) YOLO9000: better, faster, stronger. In CVPR, pp. 6517–6525. Cited by: §3.
-  (2018) YOLOv3: an incremental improvement. CoRR abs/1804.02767. External Links: Cited by: §3.1.3, §3.
-  (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In NeurIPS, pp. 91–99. Cited by: §1, §3.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, pp. 234–241. Cited by: §1.
-  (2016) Beyond skip connections: top-down modulation for object detection. CoRR abs/1612.06851. External Links: Cited by: §1.
-  (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §2.1.
-  (2019) Divide and count: generic object counting by image divisions. TIP 28 (2), pp. 1035–1044. Cited by: §3.
-  (2010) Computer vision: algorithms and applications. Springer Science & Business Media. Cited by: §2.5.
-  (2018) Non-local neural networks. In CVPR, pp. 7794–7803. Cited by: §1.
-  (2019) RepPoints: point set representation for object detection. CoRR abs/1904.11490. External Links: Cited by: §1.
-  (2016) UnitBox: an advanced object detection network. In ACM MM, Cited by: §2.4.1.
-  (2017) Brain tumor segmentation based on refined fully convolutional neural networks with A hierarchical dice loss. CoRR abs/1712.09093. External Links: Cited by: §2.4.2.
-  (2016) Learning dense correspondence via 3d-guided cycle consistency. In CVPR, pp. 117–126. Cited by: §2.2.1.
-  (2019) Objects as points. CoRR abs/1904.07850. External Links: Cited by: §1.