Orientation Aware Object Detection with Application to Firearms

Orientation Aware Object Detection with Application to Firearms

Javed Iqbal, M. Akhtar Munir, Arif Mahmood,
Afsheen R. Ali, and Mohsen Ali
Intelligent Machines Lab (IML), Department of Computer Science,
Information Technology University (ITU), Punjab, Lahore, Pakistan
Emails: arif.mahmood@itu.edu.pk, mohsen.ali@itu.edu.pk
Authors are with Intelligent Machines Lab, Department of Computer Science, Information Technology University (ITU), Punjab, Lahore, PakistanCorresponding Authors: A. Mahmood (https://itu.edu.pk/faculty-itu/dr-arif-mahmood/) and M Ali (https://itu.edu.pk/faculty-itu/mohsen-ali/).

Automatic detection of firearms is important for enhancing security and safety of people, however, it is a challenging task owing to the wide variations in shape, size and appearance of firearms. Viewing angle variations and occlusions by the weapon’s carrier and the surrounding people, further increases the difficulty of the task. Moreover, the existing object detectors process rectangular areas, though a thin and long rifle may actually cover only a small percentage of that area and the rest may contain irrelevant details suppressing the required object signatures. To handle these challenges we propose an Orientation Aware Object Detector (OAOD) which has achieved improved firearm detection and localization performance. The proposed detector has two phases. In the Phase-1 it predicts orientation of the object which is used to rotate the object proposal. Maximum area rectangles are cropped from the rotated object proposals which are again classified and localized in the Phase-2 of the algorithm. The oriented object proposals are mapped back to the original coordinates resulting in oriented bounding boxes which localize the weapons much better than the axis aligned bounding boxes. Being orientation aware, our non-maximum suppression is able to avoid multiple detection of the same object and it can better resolve objects which lie in close proximity to each other. This two phase system leverages OAOD to predict object oriented bounding boxes while being trained only on the axis aligned boxes in the ground-truth. In order to train object detectors for firearm detection, a dataset consisting of around eleven thousand firearm images is collected from the internet and manually annotated. The proposed ITU Firearm (ITUF) dataset contains wide range of guns and rifles. The OAOD algorithm is evaluated on the ITUF dataset and compared with current state of the art object detectors. Our experiments demonstrate the excellent performance of the proposed detector for the task of firearm detection.

object detection, orientation, deep learning, firearms, gun violence.

s m\IfBooleanF#1 #2

I Introduction

Gun violence incidents are wide spread across the globe and are being observed at increasing frequency [5, 2, 3]. Every year, around a quarter million people die due to gun violence [1]. Steps for firearm control do not seem to be effective despite such a large number of unfortunate events. Gun violence uniformly covers the globe, and it is creating adverse effects on humanity. The issue needs to be addressed scientifically for the betterment of public health and safety. Recently, strong voices have been raised for scientific knowledge backed research to prevent gun violence [21] and to fund research for such projects [4]. Around the globe, governments and private security entities have been expanding the use of surveillance systems to monitor and secure buildings, banks, and important public places such as parks, shopping malls as well as public gatherings and events. An effective firearm detection method embedded in the surveillance systems and based on robust scientific algorithms to prevent gun violence or generate a timely response is inevitable. Such steps will result in not only increase the sense of safety in the public but a timely response might also result is reduce medical-cost and lessen the burden on the economy.

Fig. 1: Firearms detected by the proposed OAOD algorithm along with the predicted oriented bounding boxes which aims to better localize the firearm.

Visual systems have extensively been used for the recognition and identification of objects in a scene. Such systems can also be utilized for firearm detection owing to specific visual signatures of these objects. The recent CNN based object detection methods [35, 32, 27] have shown success in detecting wide variety of objects, however, the generic versions of these detectors may not perform well when dealing with specific objects such as firearms [41], [23] [24], [19], [28]. It is because of the inherent size & shape variations of firearms, occlusions, unfavorable viewing angles, and clutter make the firearm detection more challenging than other objects such as human faces and vehicles. One limitation of the existing methods is the use of axis aligned windows for object detection. Most classifiers decide the presence of an object by analyzing the features in that window. The physically thin and elongated structure of most of the rifles and small size of most guns, make these axis aligned windows inefficient due to low signal to noise ratio where signal is the firearm signature and noise is everything else in the window. In case of firearms being carried by a person, the window will tend to contain substantial information belonging to the background or non-firearm objects, like the person himself (Fig. 2). This mixture of the information makes it difficult for classifiers to learn to separate the required information or signal from the other objects acting as noise Fig. 4. Recently, Zhou et al. have proposed an orientation invariant feature detection method to handle the planar rotations, however the proposed detector cannot efficiently handle the clutter [42]. Some recent oriented object detection methods try to make angle as a part of anchor resulting increased number of anchors that need to be classified for every proposal. For example 54 and 90 anchors in [30] and [7] respectively. Having large number of anchors is computationally inefficient. Also for the region proposal network to be trained, oriented boxes are needed as the ground-truth, which are not readily available for most of the existing object detection datasets because these are difficult to annotate. The method proposed in the current paper does not require oriented bounding boxes for training. The separation between the orientation prediction and region proposal network, allows us to use small number of anchors in comparison to algorithms using orientation as part of anchors [30], [7].

In the current work, we propose an image based firearm detection algorithm which improves the detection accuracy by removing the clutter information and uses less number of anchors for oriented object detection. The flow diagram of the proposed algorithm is shown in Figure 3. The proposed algorithm has two phases. In Phase-1, an orientation prediction module is trained to predict possible object orientation for each region proposal. The axis aligned region proposals with the orientation information are used to setup the warping and cropping functionality to be used in RoI-pooling step. Our proposed warping and cropping module minimizes the unwanted redundant information, which helps the classifier to reduce the effect of clutter in the background. This is done by selecting a maximum area rectangle out of the warped region of interest, these are named as Oriented Region of Interests (ORoIs). The classifier in Phase-2 predicts the probability of a region proposal being a gun or rifle and also the changes that are needed to improve localization of the firearm. The last module transforms these oriented proposals back to the axis aligned object detection rectangle for comparison with the existing approaches. The proposed method is named as Orientation Aware Object Detector (OAOD), because it detects the objects and the orientations.

For the purpose of training and evaluation, an extensive dataset consisting of wide variety of firearms is collected from the internet. It consists of photographs of real scenes as well as ones from dramas and movies. This dataset is manually annotated by marking the axis aligned bounding boxes. The dataset, contains 10,973 images, and has diverse characteristics including images capturing multiple firearms, diverse environments, pose variations of humans carrying weapons and images with firearms without humans. The proposed OAOD algorithm is compared with existing methods including FRCNN [35], YOLO [32], YOLOv2 [33], YOLOv3 [34], SSD [27] and DSSD [13] trained on the same firearm dataset. The proposed OAOD algorithm has outperformed these methods by a significant margin. Our work will be a catalyst in helping out to improve the state of art algorithms in firearms detection. Our main contributions are listed below

  • We present first comprehensive work on firearms detection in RGB images.

  • We analyse shortcoming of using axis aligned boxes for detecting oriented thin objects like rifles when the background information appears as noise.

  • We propose a two phase system, where orientation prediction is not part of the region proposal, thus keeping the system computationally efficient.

  • We exploit the independence between the two Phases to predict oriented bounding boxes without having oriented boxes in the training dataset.

  • We propose an extensive firearm dataset of about eleven thousand annotated images with single and multiple firearms.

The rest of the paper is organized as follows: related work is discussed in the next section. Proposed Orientation Aware Object Detection (OAOD) algorithm is explained in Section III. The experiments and results are discussed in Section IV, also containing study on proposed dataset and conclusion follows in Section V.

Ii Related Work

Research on visual firearm detection in images or videos is quite sparse and currently there is no dedicated firearm detector or firearm benchmark dataset for performance evaluation and comparison. In the ImageNet dataset [11], among the 1000 object classes, only two are dedicated for firearms including rapidly firing automatic gun and pistol or revolver with revolving cylinder, which is a quite small collection. Olmos et al. has recently applied FRCNN for handgun detection in video frames [31], while no results have been reported on rifle detection. Akcay et al. has applied FRCNN [35], RFCN [10], Yolo v2 [33], RCNN [15] for object detection within x-ray baggage security imagery [6]. Among the 5 object classes, two classes are guns and gun-parts. In contrast to these works, we for the first time, address the problem of visual firearm detection in images in a more comprehensive way.

Despite very limited research efforts on firearm detection, significant research has been done on developing generic object detectors. Some very important detectors belong to YOLO family [32, 34, 33] which are deep convolutional object detectors. YOLO v2 incorporates more than 9000 categories using word tree concept (merging labels from different datasets), which is a hierarchical model and optimizes YOLO [32] by introducing anchor boxes, batch normalization, and multiscale training. Dimension clusters help in maintaining good priors for anchors. YOLO v3 [33] incorporated multiscale detection using the idea of feature pyramid network [25]. YOLO detectors are well known for high detection speed, though in a recent study, both YOLO v2 and v3 have been found lower in performance compared with FRCNN [26, 34].

Similar to YOLO, Single Shot multi-box Detector (SSD) is a single stage end to end object detector. It uses VGG-16 as base network and fully connected layers of VGG-16 are converted to convolutional layers followed by additional convolutional layers. SSD performance degrades in case of small objects as these are detected from the shallow layers. A new variant of SSD is DSSD [13] in which ResNet-101 [17] has replaced VGG-16. Deconvolutional layers in DSSD contain rich feature maps using skip connections in terms of context information for better detection than its previous version. In a recent study, DSSD performance was less than Faster RCNN [26]. Also, DSSD has high computational complexity due to deconvolution layers.

Fig. 2: Sample images showing the problem of irrelevant information included in the axis aligned bounding box enclosing firearms. The object aligned bounding boxes detected by the proposed algorithm successfully mitigates this problem.

In firearm detection problem, guns may appear in very small size compared to the overall image dimensions. In small objects detection, image resolution, scale and contextual information may have significance for learning deep model [18]. Zhang et al has proposed cascade by incorporating features from RPN for detecting small objects like pedestrian in whole image [41]. Lin et al. proposed focal loss for handling class imbalance [26]. Smith et al. proposed selection of bounding box by introducing a novel fitness loss [38]. Singh et al have proposed scale normalized training to address the problem of extreme scale variations [37]. Zhai et al presents aspect ratio and region wise attention to handle the problem of classical RoIs that do not handle aspect ratios region wise [40]. Liu et al and Chen et al have emphasized the significance of context and instance relationship for accurate object detection [29], [9]. For example, if a person is holding tennis racket then there should be a ball nearby. However in the case of firearms, most of the contextual objects may remain irrelevant to the presence or absence of a firearm.

Huang et al has performed a detailed speed-accuracy trade-off analysis for recent object detectors [20], where they found that Faster RCNN is more stable compared to the other detectors. Faster RCNN has evolved to the current form after going through many variations. In the previous versions, RCNN detected the region proposals by using selective search approach [39] in image domain, applied deep convolutional networks on every proposal to extract high level hierarchical features, which were then classified using SVM [15]. Later on in Fast-RCNN, deep hierarchical features were extracted at once and the region proposals generated using selective search approach was propagated to feature domain. Region of Interest (RoI) pooling layer was introduced to get a fixed-size region of interest. In Faster RCNN, RPN is trained to directly generate region proposals. At every location in feature map, anchors of different scales and ratios are generated. The RPN predicts offsets and objectness scores for each anchor box, adjusts the offsets in anchors and performs NMS to reduce the number of region proposals. RoI pooling is performed on the selected RoIs which are then classified. Currently large number of FRCNN have been proposed, especially Feature Pyramid Network (FPN) and the Mask RCNN [25, 16] are more important. FPN creates feature pyramids and passes features to the high-resolution maps when deals with multiple scales.

Cai et al. have proposed Cascade RCNN which is a sequence of detectors to reduce the false detection [8]. Each FRCNN in the sequence is trained on a higher IoU threshold compared to its predecessor. Similar to the Cascade RCNN, we also propose to use two classifiers in a sequence. However, both classifiers are trained for the same IoU of 0.50. The first classifier predicts the orientation and region proposal offset. Based on the result of the first classifier, region proposals are adjusted and ground truth labels of region proposals are updated. Adjusted region proposals are rotated and cropped to reduce irrelevant context information and then fed to the second classifier. Thus our proposed Orientation Aware Object Detection (OAOD) algorithm is novel and inherently different from the existing object detectors, especially the Cascade FRCNN.

Fig. 3: An overview of the proposed Orientation Aware Firearm Detector (OAOD) showing different components in Phase-1 and Phase-2.

Iii Orientation Aware Object Detection (OAOD) Algorithm

Most of the current object detectors employ axis aligned bounding boxes which may incur noise and clutter due to uncorrelated background objects as shown in Fig. 2. Such noise may adversely effect the performance of object detection. To overcome this issue, we propose Orientation Aware Object Detection (OAOD) algorithm which consists of a single pipe-lined network consisting of a cascade of two phases shown in Fig. 3. In the Phase-1, orientation of the object is predicted along with classification score and offset to the region proposal. In the Phase-2, the updated object proposals are warped according to the predicted orientation such that the object becomes axis aligned. Maximum area rectangle contained within the rotated proposal is cropped and final Oriented Region of Interest (ORoI) is used for further classification and offset regression. Thus the redundant information contained within the region proposals is significantly reduced for the case of non-axis aligned firearms. The classifier in Phase-2 is trained to predict classification scores on these oriented and cropped proposals to achieve better performance. The proposed network takes an entire image as input and detects and localizes two types of firearms including rifles and guns. In the following, both the Phase-1 and the Phase-2 are explained in more detail.

Iii-a OAOD Phase-1

Phase-1 of OAOD algorithm mainly consists of deep feature extraction using VGG-16, Region Proposal Network (RPN), RoI pooling and a set of Fully Connected (FC) layers for classification and regression. Each of these components is explained in the following subsections.

Iii-A1 Deep Feature Computation

Deep features are computed over the input image, which are then used by the RPN and FC layers based classifier. Though deep features may be computed by employing any suitable deep neural network, in our implementation we use ImageNet pre-trained VGG-16 network [36]. Employing a deeper network such as ResNet-101 [17] may result in improved accuracy at the cost of increased space complexity. Entire image is processed through the convolutional layers of VGG-16 and features are extracted from the last convolutional layer (conv5_3). Since the size of the input images may vary, therefore the spatial dimensions of the deep features may also vary while the number of channels (depth) will remain the same.

Iii-A2 Region Proposal Network (RPN)

The deep features obtained from the VGG-16 are input to the Region Proposal Network (RPN). RPN is randomly initialized and then trained using the training part of the firearm dataset to generate objectness scores and proposals of the objects present in an image. At each location in the feature map, 9 anchor boxes are drawn to cater objects of varying sizes as learned from the training dataset. RPN processes each anchor box, generating an objectness score and offset to the input anchor box. All anchor boxes are sorted in the descending order of objectness score and a fixed number of best anchor boxes are selected. In our implementation, during training we selected top 2000 anchor boxes with maximum objectness score known as object proposals. To further reduce the computational complexity of the training process, only a fraction of these best proposals is randomly selected for further processing which are then fed to the RoI pooling. At this stage we randomly select 64 proposals consisting of 48 background proposals and 16 firearm proposals during training. During testing, 200 proposals with best objectness scores are selected from each image assuming the number of potential firearms to be detected are significantly less than 200.

Iii-A3 Region Proposal Labeling

The region proposals generated by RPN are labeled with appropriate class labels, orientation labels and bounding box offsets using the manually annotated ground truth object bounding boxes. For each region proposal the IoU is computed with ground truth boxes and the class and orientation labels of that particular category is assigned where IoU is greater than 0.5 as given by (1). Similarly the bounding box offsets are calculated between the ground truth boxes and the RPN boxes. The labeled RoIs are then used to train OAOD Phase-1 networks.


where is the IoU of an RPN box with a bounding box of a Gun and is the IoU of RPN box with a bounding box of a rifle.

Iii-A4 Phase-1 RoI Pooling (P1-RoIP)

P1-RoI pooling takes input the best proposals from the RPN and selects the corresponding feature maps from the deep features already computed by the VGG-16. Since the size of best proposals selected from RPN may vary, the size of feature maps will also vary accordingly. However the down stream fully connected layers only accept a fixed size input which is obtained by pooling the feature maps. A grid of size 77 cells is superimposed on the spatial dimensions of feature maps corresponding to each object proposal. From each grid cell, the maximum value is selected. Thus the variable spatial dimension of the deep feature maps is reduced to 49 values (7 7) while the depth remain constant to 512 channels. This fixed size feature map is then input to the fully connected layers.

Iii-A5 Phase-1 Fully Connected Network (P1-FCN)

We train a Fully Connected Network (FCN) to classify the pooled feature maps obtained from the RoI pooling. FCN consists of an input layer, two hidden layers and a separate output layer for each of the three tasks: object classification, orientation classification and offset regression. The gun/rifle classification loss function is defined as:


where is the predicted firearm class probability and is the actual firearm class label, is the number of object classes including background, gun, and rifle, and is the number of object proposals in a mini batch.

The objects are divided into orientation classes in the range of 0- 180, such that class 0 includes objects oriented in 348.75-0-11.25, class 1 includes 11.25-33.75, class 2 includes 33.75-56.25, class 3 includes 56.25-78.75, class 4 includes 78.75-101.25, class 5 includes 101.25-123.75, class 6 includes 123.75-146.25, class 7 includes 146.25-168.75 as shown in Fig. 5. Other half circle contains objects pointing in exactly opposite direction which are considered in the same classes as the corresponding class in the upper half circle. The orientation loss function is defined as


where is an indicator variable defined for each object proposal as below


where is the predicted orientation class probability and is the actual orientation class label, is the number of orientation classes, and are the number of object proposals in a mini batch corresponding to the firearms in the ground truth and (4) shows that, for the object proposals corresponding to the background, orientation loss is ignored during training. The objective function for the bounding box regression is given by:


where are predicted bounding box offsets and are actual ground truth offsets for respective proposal boxes, and are the same as defined above. For a proposal box and ground truth box , the ground truth offsets are defined as: , , and . Similarly for the predicted offsets for proposal the output bounding box is calculated as: , , and respectively. The S is smooth function [14] defined as


During training, in the bounding box loss only those object proposals are considered which correspond to fire arms in the ground truth while the otehrs corresponding to background are ignored by using the indicator variable . Overall objective function is a weighted combination of these individual losses


where , , are the normalization weights which assign relative importance to each term of the objective function. The ground truth regression target are normalized to have zero mean and unit standard deviation.

Iii-B OAOD Phase-2

The orientations obtained from the Phase-1 and the updated RPN boxes are input to the Phase-2 along with the extracted feature maps. Phase-2 is composed of a warping and cropping module, an RoI pooling layer and a FC layers module to regress oriented RoI’s offsets. Following are the preprocessing steps before Phase-2:

Iii-B1 Updating Region Proposals and Label Assignments

The offsets obtained in Phase-1 are applied to the region proposals obtained from the RPN. An updated proposal may get translated and scaled depending upon the offset values, therefore the IoU with ground truth may get changed. Therefore, the region proposal labels are updated by rechecking IoU with ground truth and using the same criterion as defined by (1). These updated region proposals are then used for the training of Phase-2 of OAOD algorithm.

Iii-B2 Warping and Cropping

Based on the predicted orientation during Phase-1, a series of transformations is applied on both the feature map and the updated region proposals in Phase-2. This process ensures the firearm gets aligned with horizontal axis making Oriented RoIs (ORoIs). It is assumed that the firearm is contained in the diagonal region of the updated region proposal which will become horizontally aligned after warping. A maximum area rectangle is cropped in each warped bounding box by finding upper and lower height limits. Based on these limits, an area from the oriented feature map is cropped using bi-linear interpolation as suggested by [16]. Feature maps of ORoIs are then input to the RoI pooling layer. For boundary conditions where the cropped area goes outside the oriented feature map, we replicate the boundary values. Irrelevant information is removed with the unnecessary context after cropping, resulting in classification performance improvement (Fig. 4).

Iii-B3 Phase-2 RoI Pooling (P2-RoIP)

The warped and cropped region proposals and the corresponding feature maps are input to P2-RoIP module. It will reduce the feature map to a fixed size of values which is then input to the FC layers.

Fig. 4: Features across the channels, warping and cropping. Before warping and cropping, it can be seen that there are very noisy and cluttered features for pooling to classify. After warping and cropping, maximum area rectangle has been chosen for pooling of ORoIs to classify. It increases the classification score.

Iii-B4 Phase-2 Fully Connected Network (P2-FCN)

The design of P2-FCN layers is almost the same as used in Phase-1. P2-FCN layers predict orientation aware classification scores and region proposal offsets. These offsets are then applied to the oriented cropped region proposals before further processing. Objective function of P2-FCN consists of two loss components including classification and bounding bounding box regression. Classification loss is given below


The bounding box regression loss is as follows


where is an indicator variable defined for each object proposal as follows


indicates that bounding box loss is incorporated only when the ground truth angle object proposal is 0 or 90 and ignored if the orientation is different. It is because, for these two cases, the original and the warped object proposal remains the same, while for other cases the warped proposal is different from the original proposal which is not used as ground truth. The overall combined objective function for P2-FCN is given below:


The output of P2-FCN module is input to the inverse transformation module.

Iii-B5 Inverse Transformation of Adjusted ORoIs

We are predicting offsets of cropped region proposals in Phase-2. These offsets are applied to the cropped region proposals to make Adjusted Oriented Region of Interests. These Adjusted ORoIs are inverse transformed by mapping back to image space by using the orientation information from Phase-1 and the value of centre position of updated region proposals. Transformation matrices are made from parameters that are used to warp and crop. By taking inverse of transformation matrix, for each Adjusted ORoIs associated with respective class are mapped back to image space to make oriented bounding boxes. Making homogeneous matrix using Adjusted ORoIs information yields to oriented bounding box after applying inverse to transformation matrix. More details can be seen in Algorithm 1.

Fig. 5: Orientation is divided in 8 classes considering arms oriented in the same class as the arms with class. Values in blue show the Firearms data distribution over different orientation classes.

Input: , from Phase-1, u={ corresponding ORoIs from Phase-2 , : total number of object proposals
Output: Oriented Bounding Boxes

Algorithm 1 Inverse Transformation
1:for  do
2:      ,
6:end for

Iv Experiments and Results

For automatic firearm detection, currently no image or video dataset is publicly available for training machine learning algorithms. In the current work we for the first time propose an annotated firearm detection dataset named as ‘ITU Firearms’ (ITUF). The proposed OAOD algorithm is compared with current state of the art object detection algorithms including SSD [27], DSSD [13], YOLOv2 [33], YOLOv3 [34] and FRCNN [35] on the ITUF dataset. In addition to these algorithms, OAOD is also compared with some variants including 2-Loss Net, 3-Loss Net, H/V Net and Phase-1 Net. These variant networks are discussed in Section IV-C. In a wide range of experiments, the proposed OAOD algorithm has shown excellent performance compared to the other networks.

Fig. 6: Sample images from the ITUF dataset. Firearm aligned bounding box detections by the OAOD algorithm are also shown.

Iv-a ITU Firearms (ITUF) Dataset

ITUF dataset consists of images of Guns and Rifles from different scenarios of practical importance such as being pointed, being carried, lying on tables, ground or in racks. These variations allow machine learning algorithms to overcome dress variations, body pose variations, firearm pose and size variations, varying light conditions and both indoor & outdoor scenarios making a strong prior for data driven algorithms. Some sample images from the dataset are shown in Fig. 10.

We collected this dataset using web scraping by incorporating keywords such as weapons, wars, pistol, movie names, firearms, types of firearms, sniper, shooter, corps, guns and rifles. The results were cleaned to remove images not related to firearms, cartooned images and duplicated images. The final clean dataset consists of fully annotated firearm images containing 13647 firearm instances. Every firearm in each each image was tagged by an annotator with an axis aligned bounding box, label and an angle representing orientation of the longitudinal axis. Following the PASCAL VOC [12] format, the bounding box is represented by a four dimensional vector containing top-left corner augmented with bottom-right corner. Similarly, class labels are divided in Gun and Rifle, while orientation is annotated as the front nozzle and the back tip (hammer or butt) of the firearm. Orientations are quantized in 8 bins (0-7) for contiguous angles as shown in Fig. 5. Since the dataset is annotated in a standard format, it is ready to be used by various state-of-the-art object detection algorithms. The dataset will soon be made publicly available so that other researchers can take advantage and may advance state of the art performance for visual firearm detection.

In the ITUF dataset, average firearm count is per image, ranging from a single to more than ten firearms, depicting the diverse nature of this dataset. Average ratio of firearm height to image height and firearm width to image width is around 0.50. The size variance of the individual firearms is significant due to the existence of small, medium and larger sized firearms. Out of 10,973 images, randomly 8872 ( 80 %) images are selected for training (including the validation set which is 20% of the training data), while the remaining 2101 are used as unseen test images. Detailed statistics of the dataset such as distribution of different types of firearms, average image-size and firearm-size ratio are provided in Table I. Firearms being elongated objects with high length to width ratio, orientation of the longitudinal axis has an important role in the firearm detection performance. In Fig. 5 firearm distribution over different orientations is shown.

Dataset Total Images Avg. Firearm Ratio in Images Avg. Image Size (Pixels) Firearm Count
Rifle Gun
Train 8872 47.96% 50.47% 618.13 889.75 5769 5248
Test 2101 52.61% 59.66% 606.63 875.72 1556 1074
Dataset Statistics: =firearm height, =image height, = firearm width and = image width

Iv-B Experimental Setup

In firearm detection experiments, we detect the location of each firearm in an input image and also predict its type as rifle or gun and the orientation of its longitudinal axis. The class ‘rifle’ includes automatic weapons such as AK-47, Small Machine Gun (SMG), Large Machine Gun (LMG) and hunting rifles and the ‘Gun’ class includes different types of pistols and revolvers. In ITUF dataset Images having size larger than 480800 are scaled down preserving aspect ratio such that both image height and image width . To make the training and testing process fast, batch size of 1 is used. Initial learning rate is set to 0.001, momentum is 0.90, weight decay parameter is 0.0005, and SGD optimizer is used. The networks in the proposed OAOD algorithm are initialized with ImageNet pre-trained VGG-16 weights, while the ITUF dataset is used for retraining. During retraining, first two convolutional blocks of the network are kept frozen, preserving the original weights while weights of the remaining blocks are updated. All the implementations are done in caffe [22] on a core-i5 machine with 32GB RAM and a GTX 1080 GPU with 8GB memory.

In Equation (11), the hyper parameters & are set to 1.00. To balance out the overall loss function, the parameter is searched over a wide range {1.0, 0.325, 0.25, 0.125, 0.1, 0.0625} using a validation dataset (20% of the training data). A balanced objective function has resulted in increased orientation accuracy as well as mean average accuracy as shown in Table II. Therefore, for the rest of the experiments is used.

In OAOD, region proposals from RPN Phase-1 are updated using the output of phase-1, which are then used in Phase-2. The ground truth corresponding to updated region proposals for each class and the bounding boxes are also updated. The combined network is trained simultaneously for classification and bounding box regression in Phase-2. Bounding box loss in phase-2 is incorporated only if the orientation for the region proposal is 0 or 90, while the classification loss is used for every instance as given by (11). To avoid over-fitting, randomly half of the connections between the two FC-2 layers are dropped out.

1 0.5 0.325 0.25 0.125 0.1 0.0625
0.844 0.835 0.843 0.842 0.839 0.847 0.829
mAP 0.515 0.629 0.666 0.725 0.719 0.748 0.725
Orientation accuracy and mean average precision over varying values of in (11) over the validation dataset.
IoU Methods 0.4 0.5 0.6
0.707 0.833 0.770 0.623 0.77 0.696 0.419 0.629 0.524
YOLOv3 0.808 0.786 0.798 0.760 0.707 0.734 0.643 0.590 0.617
SSD 0.706 0.79 0.748 0.656 0.730 0.693 0.552 0.582 0.567
DSSD 0.774 0.789 0.781 0.730 0.723 0.727 0.632 0.589 0.611
0.887 0.890 0.889 0.802 0.794 0.798 0.678 0.683 0.681
Phase-1 Net 0.891 0.887 0.889 0.794 0.851 0.823 0.654 0.668 0.661
3-Loss Net 0.887 0.886 0.887 0.792 0.860 0.826 0.653 0.666 0.660
2-Loss Net 0.889 0.886 0.888 0.792 0.865 0.829 0.654 0.664 0.659
H/V Net 0.881 0.882 0.882 0.786 0.787 0.787 0.648 0.655 0.652
OAOD 0.888 0.896 0.892 0.844 0.864 0.854 0.670 0.740 0.703
IoU vs mAP: = Average Precision gun and = Average Precision rifle. Red shows the highest per column. Blue shows the second highest per column
Fig. 7: Performance comparison of OAOD with other networks for IoU=0.50 and varying confidence levels.

Iv-C Different Variants of the Proposed OAOD Algorithm

We have implemented three variants of OAOD algorithm including 2-Loss Net, 3-Loss Net and H/V Net. In 2-Loss Net, we optimized the proposed cascaded model using axis aligned bounding box regression loss from phase-1 and the final classification loss from phase-2. The region proposals obtained from RPN in Phase-1 are used for warping and cropping, instead of the bounding boxes used in OAOD. 3-Loss Net is an extension of 2-Loss net with the addition of orientation loss from phase-1. In 3-Loss Net, region proposals from RPN in Phase-1 are used for warping and cropping. In H/V Net, to avoid false positives with high confidence in 2-Loss Net, we fine-tuned the 2-Loss Net using only those firearm instances which are horizontally or vertically aligned in the ground truth. With addition of this loss function, confidence of false positives decreased significantly at the cost of minimizing class scores on other oriented images. To avoid this, we trained the H/V Net alternatively in 2-Loss manner and then with H/V oriented box prediction loss. The Phase-1 of the proposed OAOD algorithm is also considered as a variant network and included in the performance comparisons and referred as Phase-1 Net as shown in Figure 7. The OAOD algorithm may be viewed as a combination of 2-Loss and H/V Net (Figure 3).

Fig. 8: Qualitative comparison of proposed OAOD with current state-of-the-art object detectors including YOLOv2, YOLOv3, SSD, DSSD and FRCNN for IoU=0.50 and confidence score=0.65. Green rectangles are manually annotated ground truth and red rectangles are axis aligned detections of existing algorithms (rows 1-5) and OAOD detections in row 6. Magenta rectangles in row 6 are firearm aligned oriented detections of OAOD algorithm which has exhibited reduced miss-detections, more accurate localization and minimum false detections.

Iv-D Comparisons with Existing Algorithms

To evaluate firearm detection performance, mean Average Precision (mAP) is used [12]. We evaluate Intersection over Union (IoU) of the detected and the ground truth bounding boxes. For IoU an instance is considered as true positive (), otherwise it is considered false positive (). Precision is computed as TP/(TP+FP) and recall as TP/(TP+FN). Recall values are varied in the range of 0.00-1.00 and average precision is found at each level and mean of these values is computed. The same process is repeated for each class separately and average over all classes is reported as mAP.

The proposed OAOD is evaluated over the ITUF dataset for oriented firearms detection and achieved mAP of 85.4% at IoU=0.50. The proposed OAOD avoids mis-detection and multiple detection while performing more accurate localization. The proposed algorithm is compared with the current state-of-the-art object detection algorithms including SSD, DSSD, YOLOv2, YOLOv3 and FRCNN which are retrained on the same ITUF training dataset. All parameters in these algorithms are set as recommended by the original authors. Default number of iterations are processed for each algorithm to have a fair comparison of mAP. Table III shows comparison of OAOD with these algorithms for IoU={0.40, 0.50, 0.60}. In most of the cases, OAOD has achieved better mAP than the compared methods. This experiment shows that performance of OAOD remains good for varying IoU values, despite the training was done for only IoU=0.50. In case of clutter, the localization performance of existing methods degrades, which resulted in poor performance. The compared algorithms were not able to remove noisy or cluttered features due to axis aligned bounding boxes. ORoIs in our proposed method, followed by applying maximum area rectangle not only remove the noisy or cluttered features but also improves the detection accuracy. We have analyzed that our method localizes better than the compared methods, avoids misdetection and also avoids multiple detection. Qualitative results can be seen in Fig. 8 on sample test images.

In addition to the current state of the art algorithms, OAOD is also compared with 2-Loss, 3-Loss, H/V and Phase-1 variant networks. The 2-Loss Net showed an increase in classification score at the cost of false positives with high confidence. The 3-Loss Net improved the orientation classification having the same problem of high confident false positives. The H/V Net reduced the false positive detection scores significantly at the cost of minimizing class scores for non axis aligned firearm instances. The OAOD algorithm is compared with other networks for TPR Vs. confidence scores at a fixed IoU=0.50 (Figure 7). OAOD has performed better compared to FRCNN and the other networks. The 2-Loss and 3-Loss approach OAOD in confidence scores, though have more high score false positives. H/V Net reduces false positives, but the confidence score also decreases for non-axis aligned firearms. Phase-1 Net has shown improved performance than FRCNN. OAOD has performed better at high confidence levels where others have suffered more performance degradation.

The proposed OAOD algorithm is qualitatively compared with FRCNN as shown in Fig. 8. In this experiment, the axis aligned bounding boxes are taken from phase-1 and respective class scores from phase-2. We also predict oriented bounding boxes from pooled ORoIs using the offsets from the bounding box regressor of phase-2. These oriented boxes are inverse transformed using the predicted orientation information associated with each proposal box from phase-1. In Fig. 10, the inverse transformed axis aligned oriented bounding boxes are shown for some example test cases.

Fig. 9: Failure cases: (a) two firearms very close to each other resulted in one detection, (b) orientation is difficult to predict for firearms pointing outwards, (c) firearm occluded with text resulted in multiple detections.
Fig. 10: Sample images from the ITUF dataset. Firearm aligned bounding box detections by the OAOD algorithm are also shown.

V Conclusion

A novel Orientation Aware Object Detector (OAOD) is proposed with application to visual firearm detection in RGB images. OAOD is trained using axis aligned boxes, while it predicts boxes oriented along the objects. Instead of making the orientation part of the anchor-boxes that would have resulted high number of classifiers being computed at every location, a two phase strategy is proposed. In Phase-1, OAOD predicts the orientation and bounding box offset for an input object proposal. The orientation prediction is posed as a classification problem by dividing possible orientations into eight classes. The object proposals are adjusted by predicted offsets and rotated by the predicted orientations. Maximum area rectangles are cropped from rotated region proposals, which serve as Oriented Regions of Interest (ORoIs). The ORoIs are input to the Phase-2 of OAOD which predicts confidence of being a Gun, Rifle, or Background and improves localization. The ORoIs are again adjusted by the predicted offsets and then inversely mapped to the original coordinates to get object aligned bounding boxes. The proposed OAOD has exhibited reduced miss-detection, more accurate localization and reduced false detection. For training and evaluation of the proposed detector as well as existing state-of-the-art detectors, a new firearm dataset consisting of around 11k annotated firearm images has been collected, which will soon be made publicly available. The proposed OAOD is compared with five existing detectors including YOLOv2, YOLOv3, SSD, DSSD and FRCNN, and four variant networks on varying IoU and confidence thresholds. In a wide range of experiments, the proposed detector has demonstrated improved detection and localization performance for the task of firearm detection.


  • [1] America’s gun culture in 10 charts - BBC News. https://www.bbc.com/news/ world-us-canada-41488081.
  • [2] How many school shootings have there been in 2018 so far? US News, The Guardian. https://www.theguardian.com/world/ 2018/feb/14/school-shootings-in-america-2018-how-many-so-far.
  • [3] Mass shootings in the US: there have been 1,624 in 1,870 days, US News, The Guardian. https://www.theguardian.com/us-news/ ng-interactive/2017/oct/02/america-mass-shootings-gun-violence.
  • [4] Opinion, Restore funding for gun violence research - The New York Times. https://www.nytimes.com/2018/11/06/ opinion/letters/gun-violence-research.html.
  • [5] Santa fe shooting is 22nd school shooting in 2018, Time. http://time.com/5282496/ santa-fe-high-school-shooting-2018/.
  • [6] S. Akcay, M. E. Kundegorski, C. G. Willcocks, and T. P. Breckon. Using deep convolutional neural network architectures for object classification and detection within x-ray baggage security imagery. IEEE Transactions on Information Forensics and Security, 13(9):2203–2215, 2018.
  • [7] S. M. Azimi, E. Vig, R. Bahmanyar, M. Körner, and P. Reinartz. Towards multi-class object detection in unconstrained remote sensing imagery. arXiv preprint arXiv:1807.02700, 2018.
  • [8] Z. Cai and N. Vasconcelos. Cascade R-CNN: Delving into high quality object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [9] Z. Chen, S. Huang, and D. Tao. Context refinement for object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 71–86, 2018.
  • [10] J. Dai, Y. Li, K. He, and J. Sun. R-FCN: Object detection via region-based fully convolutional networks. In Advances in Neural Information Processing Systems, pages 379–387, 2016.
  • [11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR 2009 , pages 248–255., 2009.
  • [12] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The Pascal Visual Object Classes (VOC) challenge. International Journal of Computer Vision, 88(2):303–338, 2010.
  • [13] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. DSSD: Deconvolutional Single Shot Detector. arXiv preprint arXiv:1701.06659, 2017.
  • [14] R. Girshick. Fast R-CNN. In ICCV, pages 1440–1448, 2015.
  • [15] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pages 580–587, 2014.
  • [16] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In ICCV, pages 2980–2988. IEEE, 2017.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • [18] P. Hu and D. Ramanan. Finding tiny faces. In CVPR, pages 1522–1530. IEEE, 2017.
  • [19] X. Hu, X. Xu, Y. Xiao, H. Chen, S. He, J. Qin, and P.-A. Heng. Sinet: A scale-insensitive convolutional neural network for fast vehicle detection. IEEE Transactions on Intelligent Transportation Systems, 20(3):1010–1019, 2019.
  • [20] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR, volume 4, 2017.
  • [21] S. Jaffe. Gun violence research in the USA: The CDC’s Impasse, 2018.
  • [22] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
  • [23] J. Li, X. Liang, S. Shen, T. Xu, J. Feng, and S. Yan. Scale-aware fast R-CNN for pedestrian detection. IEEE Transactions on Multimedia, 20(4):985–996, 2018.
  • [24] J. Li, X. Liang, Y. Wei, T. Xu, J. Feng, and S. Yan. Perceptual generative adversarial networks for small object detection. In CVPR, pages 1222–1230, 2017.
  • [25] T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie. Feature Pyramid Networks for object detection. In CVPR, volume 1, page 4, 2017.
  • [26] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
  • [27] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In ECCV, pages 21–37. Springer, 2016.
  • [28] W. Liu, S. Liao, and W. Hu. Towards accurate tiny vehicle detection in complex scenes. Neurocomputing, 2019.
  • [29] Y. Liu, R. Wang, S. Shan, and X. Chen. Structure inference net: Object detection using scene-level context and instance-level relationships. In CVPR, pages 6985–6994, 2018.
  • [30] J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, and X. Xue. Arbitrary-oriented scene text detection via rotation proposals. IEEE Transactions on Multimedia, 20(11):3111–3122, 2018.
  • [31] R. Olmos, S. Tabik, and F. Herrera. Automatic handgun detection alarm in videos using deep learning. Neurocomputing, 275:66–72, 2018.
  • [32] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In CVPR, pages 779–788, 2016.
  • [33] J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. In CVPR, pages 7263–7271, 2017.
  • [34] J. Redmon and A. Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
  • [35] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, pages 91–99, 2015.
  • [36] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [37] B. Singh and L. S. Davis. An analysis of scale invariance in object detection snip. In CVPR, pages 3578–3587, 2018.
  • [38] L. Tychsen-Smith and L. Petersson. Improving object localization with fitness nms and bounded iou loss. In CVPR, June 2018.
  • [39] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. International Journal of Computer Vision, 104(2):154–171, 2013.
  • [40] Y. Zhai, J. Fu, Y. Lu, and H. Li. Feature selective networks for object detection. In CVPR, pages 4139–4147, 2018.
  • [41] L. Zhang, L. Lin, X. Liang, and K. He. Is faster R-CNN doing well for pedestrian detection? In ECCV, pages 443–457. Springer, 2016.
  • [42] Y. Zhou, Q. Ye, Q. Qiu, and J. Jiao. Oriented Response Networks. In CVPR, pages 4961–4970. IEEE, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description