AI-based Pilgrim Detection using Convolutional Neural Networks
Pilgrimage represents the most important Islamic religious gathering in the world where millions of pilgrims visit the holy places of Makkah and Madinah to perform their rituals. The safety and security of pilgrims is the highest priority for the authorities. In Makkah, 5000 cameras are spread around the holy for monitoring pilgrims, but it is almost impossible to track all events by humans considering the huge number of images collected every second. To address this issue, we propose to use artificial intelligence technique based on deep learning and convolution neural networks to detect and identify Pilgrims and their features. For this purpose, we built a comprehensive dataset for the detection of pilgrims and their genders. Then, we develop two convolutional neural networks based on YOLOv3 and Faster-RCNN for the detection of Pilgrims. Experiments results show that Faster RCNN with Inception v2 feature extractor provides the best mean average precision over all classes of 51%.
Keywords— P ilgrim Detection, Convolutional Neural Networks, Deep Learning, You Only Look Once (Yolo), Faster R-CNN.
Artificial Intelligence (AI) represents nowadays the hottest technology ever with a huge impact of the societies and services provided in different types of applications. One the main driving factors of artificial intelligence in the last decade is the emergence of deep learning in computer vision applications and more particularly with convolutional neural networks (CNNs). In fact, with the emergence of AlexNet  in 2012, the computer vision community aggressively moved to the application of CNN for image classification, detection, recognition and semantic segmentation. Deep learning approaches have been used in a variety of use cases namely people behavior monitoring , vehicles detection [4, 1], semantic segmentation of urban environments , self-driving vehicles , object detection and classification [23, 17], semantic segmentation [11, 2, 16].
In this paper, we address the problem of developing AI-based solutions for pilgrims detection and monitoring in Hajj and Umrah events, in Saudi Arabia. In fact, Hajj and Umrah attract annually millions of pilgrims from all over the world. According to Ministry of Hajj, the number of Umrah Visas issued in 2019 is around 7.5 millions and the number of pilgrims during the 5 days of the annual Pilgrimage reached 2.5 millions. The Vision 2030 of the Kingdom of Saudi Arabia aims to reach 30 millions pilgrims annually. The increasing number of pilgrims induces several challenges in terms of the security and safety of pilgrims. Although there are more than 5000 cameras spread around the holy places, it is impossible for humans to track every activity of action that would need a special intervention from security forces or from civil defense agents. There are several uses cases that would need an AI-based assistive technology to monitor pilgrims, including: (1) search and find of lost people, (2) real-time discovery of people in of emergency services, (3) assisting pilgrims in their rituals, and several others. To address this gap, we propose to develop AI-based monitoring techniques dedicated for pilgrims. We aim at the effective use of convolutional neural networks algorithms applied to video streams collected from CCTV camera of any video source containing pilgrims. The ultimate goal would be to provide an assistive technology to the authorities to promote the safety of pilgrims.
In this paper, the contribution are three-folded. First, we built a large dataset of pilgrim and non-pilgrim instances for different genders and in different environment. Second, we have train two state-of-the-art CNN algorithms for the specific use case of pilgrim detection, namely YOLOv3  and Faster R-CNN. YOLOv3 is known as begin the fastest detection algorithm, whereas Faster R-CNN  is an improvement of R-CNN  that represents the most efficient region-based CNN algorithm for image detection. Third, we conduct a comparative study between these two algorithms to evaluate their performance in the context of pilgrim detection.
To the best of our knowledge, this is the first paper that addresses the problem of pilgrim detection using deep learning with the state-of-the art convolutional neural networks.
The remainder of the paper is organized as follows. Section II discusses related works on deep learning for people monitoring and existing non-AI techniques for pilgrim monitoring. Section III presents a brief background on both state of the art CNN algorithms, namely YOLOv3 and Faster R-CNN. Section IV presents details on the Pilgrim dataset that we built for this study. Section V presents and discussed the main results. Section VI concludes the paper and outlines future works.
Ii Related Works
Several recent works have used CNN for people’s behavior monitoring, but there were applied to contexts different from Pilgrims detection.
Wang et al. were interested in the problems of the pedestrian detection and tracking failure caused by the commonly used methods of tracking. To solve this problem, for the detection, they used the Faster-RCNN framework, and for the monitoring, they used the Person-ReID method based on feature extraction and matching between different frames. This algorithm led to a tracking rate of 92.51% on the simple standard dataset and 76.9% on the RGB-D People dataset.
Molchanov et al. proposed a classification approach that combines pedestrian detection and classification task in real scenes. The approach uses a YOLO neural network to overcome the problem of the low image resolution and the high density of people in a small area.
These works present several limitations, such as (i.) The use of high computational complexity that can be time-consuming. To solve this problem, we use the YOLOv3, which is orders of magnitude faster. (ii.) The low accuracy when using the RGB dataset or when dealing with a low-resolution image and the difficulty of detecting a small pedestrian. To solve this problem of detection, we used Faster R-CNN, with two different features extractor (Inception-v2 and ResNet50) that give us the best feature map that helps us to do the detection task.
Teduh et al. proposed an architecture of geo-fencing emergency alerts system for Hajj pilgrim. The proposed architecture is based on mobile phones with GPS module, which is used as pilgrims’ tracking devices. It is also created to handle the predicted load using a specific algorithm.
Mohandes et al. developed a prototype of a wireless sensor network for tracking pilgrims in the Holy areas during Hajj. They used a principle delay tolerant network. In this system, a network of fixed master units is installed in the Holy area. Besides, every pilgrim will be given a mobile sensor unit that includes a GPS unit, a Microcontroller, antennas, and a battery that aims to sends its UID number, latitude, longitude, and time.
These works that were applied for pilgrims’ detections using sensing and mobile technologies also present several problems such as, (i.) The difficulty to receive the GPS signal in some area cause problem for the pilgrim tracking system using GPS. (ii.) The difficulty of working this system in a large crowd because it can’t use big data.
To solve these problems, we propose to use a computer vision deep learning for pilgrim detection in real-time. Also, it can be easily integrated to monitor pilgrims using the CCTV camera infrastructure in holy mosque areas.
Iii Algorithms Background
Iii-a Faster R-CNN
In this section, we provide an overview of the Faster R-CNN  algorithm for the detection of pilgrims. It is an improved version of R-CNN , which has been conceived to bypass the problem of selecting a huge number of regions. This problem is inherent to the use of the conventional CNN algorithm for object detection.
The Faster R-CNN  algorithm presented in figure1 is the improved version of R-CNN. This algorithm contains two modules that share the same convolutional layers. These modules are:
The region proposal network RPN
A Fast R-CNN detector
The RPN module is a fully convolutional network that aims to generate the region proposals, which are the bounding boxes that possibly include the candidate object, using multiple scales and object ratios. Each region proposal has an objectness score that measures the belonging of the region to the set of objects versus the background .
The Fast R-CNN detector is composed of the two following steps:
The extraction of features vectors from the region of interest ROIs using the ROI pooling.
The feature vector obtained is the input of the classifier composed of fully connected layers.
The classification step output is:
A sequence of probabilities estimated of the different object considered
The coordinates of the regions proposals
YOLO or You Only Look Once is an improved version of convolutional neural network CNN, which is used especially for object detection, because the CNN, as originally conceived, is very time-consuming. There are three versions of YOLO. YOLOv3 , which is an improved version of YOLOv2  and YOLOv1 . It is characterized by:
The use of multi-label classification based on logistic regression instead of the Softmax function.
The use of cross-entropy loss function instead of the mean square error for the classification loss.
The prediction of different bounding boxes based on the overlapping of the bounding box anchor with the ground truth object.
The use of the concept of Feature Pyramid Network for the prediction by predicting boxes at three different scales and then extracting features from these scales. And the result of the prediction is a 3D tensor encoding the bounding box, the objectness score, and the prediction over classes.
The use of Darknet-53 CNN features extractor, which is composed of 53 convolutional layers Instead of Darknet-19, using 3x3 and 1x1 filters and the skip the connection network inspired by ResNet .
Iv The Pilgrims Dataset
In this paper, we are interested in building a comprehensive dataset for the detection of pilgrims and their genders.
For the woman, we cannot differentiate the pilgrims from the not pilgrims because their clothes are so similar. For this purpose, we choose to put the pilgrim and not pilgrim woman in the same class.
Contrariwise, the pilgrim man has specific clothes that are as different from other clothes, as we can see in Figure 3.a. For this, we choose to divide the man class into pilgrim and not-pilgrim. For the not-pilgrim class, we are focused on the white Saudi clothes, as we can see in Figure 3.b because they are quite identical, especially in term of color.
To create our dataset, we collected 622 images of a person in the holy places of Makkah and Madinah. We choose images of persons in different environments and situations, and these images are taken from different sides and illumination. Then, using the LabelImg software , we labeled the collected dataset into three labels chosen, namely woman, pilgrim, and not-pilgrim. We obtained a dataset composed of 1165 women and 2291 man instances, which is divided into 1339 pilgrim and 952 not-pilgrim instances. The statistics of dataset instances are presented in table I.
|Number of instances||1165||2291|
Our dataset is a Pascal VOC  (Pascal object classes) dataset composed of 3 classes (woman, pilgrim, not pilgrim). We choose the Pascal VOC dataset because it enables evaluating our proposed YOLOv3 and Faster R-CNN pilgrim detection algorithm in significant variability in terms of object size, orientation, pose, illumination, position, and occlusion .
V Experimental Evaluation
In this section, we describe the results of the experimental study that we conducted to evaluate the performance of the pilgrim detection use case using two state-of-the-art algorithms, namely YOLOv3 and Faster RCNN. We start by describing the experimental setup, and we present the metrics used for the evaluation of the proposed algorithm. Finally, we analyze the results obtained for each algorithm to compare their performances
V-a Experimental Setup
In this experimental study, the training was done on two machines. The configurations of these two machines are presented in Table II.
|Machine 1||Machine 2|
For the Faster R-CNN, we choose to test two different CNN architectures for the feature extraction that is Inception-v2  and ResNet50 , because these are the best feature extractors for the detection task . For YOLOv3, we chose to evaluate it with different resolutions, which has an impact on the accuracy and the speed of the system. We chose to use three different input sizes that have values of (320x320, 416x416, and 608x608). These settings result in five classifiers trained and tested on our pilgrim dataset. The training of these two algorithms is made to detect and recognize three classes of persons that are (Woman, Pilgrim, and Not-Pilgrim). To optimize these two algorithms, we used Stochastic Gradient Descent (SGD) with a default value of momentum (0.9). For the learning rate, we used an initial rate of 0.001 for YOLOv3, and for the Faster R-CNN, we used an initial rate of 0.0002 with Inception-v2 and 0.0003 with ResNet50, which are the default value of each feature extractor network. We used the weight decay value of 0.0005.
V-B Performance evaluation and metrics
For the evaluation of our proposed algorithms, we have used six metrics based on the following parameters:
True Positive (TP): it is the number of instances (woman, pilgrim, and not-pilgrim) successfully detected and classified.
False Positive (FP): it refers to the number of instances that are wrongly classified.
False Negative (FP): It is the number of non-detected instances.
The six metrics used for the evaluation are:
mIoU: mean of the Intersection over Union that measures the overlap between the predicted and the ground-truth bounding boxes.
mAP: mean Average Precision. Or AP (Average Precision) when it is measured on one class. It is an approximation of the area under the precision-recall curve .
FPS: frame per second. It presents the inference speed of the algorithm.
V-C Comparison between Faster R-CNN and YOLO v3
For the evaluation of the proposed algorithms, we compared the values of the six metrics for each algorithm shown in Table III and Table IV.
FN, TP and FP
Figure 3 shows that when we used the YOLOv3, the number of false negatives is much higher than the number of false positives on over classes, and also much higher than the number of true positives, which indicates that most instances go undetected. And when using the Faster R-CNN, the number of true positives is much higher than the number of false positives and the number of false negatives on over classes, which indicates that most instances go detected.
When analyzing the results, it appears that YOLOv3, with an input size of 608x608, gave a better mAP for the Pilgrim Class and Faster R-CNN with Inception-v2 gave a better mAP on Non-Pilgrim Class (Figure 4). Figure 4 shows also that Faster R-CNN with Inception-v2 gave a much better mAP over classes.
Precision and mIoU
The results of Average IoU, show that YOLOv3 gave a better IoU over classes than Faster R-CNN. And the results of precision show that YOLOv3, with an input size of 320x320, gave a better precision for the Non-Pilgrim Class and Faster R-CNN with Inception-v2 gave a better precision on Pilgrim Class. It also shows that YOLOv3, with an input size of 320x320, gave a much better precision over classes with a ratio of 80.58%.
Analyzing the average recall results, we found that Faster R-CNN outperforms YOLOv3 in this metric with a slightly better performance with the ratio of 59.29% for Inception-v2 feature extractor over Resnet50, and a marked inferior performance for YOLOv3 with an input size of 320x320.
When analyzing the quality that measures the robustness of the algorithms, it appears that YOLOv3 gave a better quality for the Non-Pilgrim Class, and Faster R-CNN gave a better Precision on Pilgrim Class. It also seems that Faster R-CNN with Inception-v2 gave a much better precision over classes with a ratio of 41.72%.
The F1score that also measures the robustness based on the precision and the recall ratios reveals that YOLOv3, with an input size of 608x608, gave a better performance with a ratio of 66.01% for the Pilgrim Class and Faster R-CNN gave a better precision also on Pilgrim Class with a ratio of 59.45%. And over all classes, Faster R-CNN with Inception-v2 gave a much better score with a ratio of 58.87%.
Inference Processing time
The results of the average Inference speed measured in Frames per Second (FPS), for each of the tested algorithms, show that YOLOv3 is 19 times faster than Faster R-CNN in the inference phase.
Effect of the feature extractor
When analyzing the effect of the feature extractor for Faster R-CNN, it appears that Resnet50 feature extractor is slightly faster than Inception-v2 because it is less computationally complex. But, Inception-v2 outperforms Resnet50 on almost all metrics.
Effect of the input size
Table IV shows a significant gain in YOLOv3’s AP when moving from a 320x320 input size to 608x608. But it shows a substantial loss in YOLOv3’s precision when moving from a 320x320 input size to 608x608. That also indicates that the input size has an important impact on the inference processing speed of YOLOv3 because a larger input size generates a higher number of network parameters and operations (FPS from 43 FPS for 608*608 up to 91 FPS for 320*320).
In this section, we compared the performance of YOLOv3 (with three different input sizes) and Faster R-CNN (with two different feature extractors) and the impact of the input size and the features extractor. Figure 5 summarizes the main results of this comparison study. It compares the trade-off between AP and inference time for YOLOv3 (with three different input sizes) and Faster R-CNN (with two different feature extractors). It can be observed that YOLOv3 (with input size 320*320) gave the best inference speed with low AP, contrary to Faster R-CNN (with Inceptionv2 as feature extractor) which gave the lowest inference speed with the best AP. This emphasizes that neither algorithm surpasses the other in all cases.
In this paper, we developed convolutional neural network models for pilgrim detection for AlHajj based on YOLOv3 and Faster RCNN. We have built a dataset containing three classes of a pilgrim, non-pilgrim and women. Experimental results show that Faster RCNN with Inception v2 feature extractor provides the best mean average precision over all classes of 51%. In our future work, we will extend the dataset to have several tens of thousands of instances to improve the overall accuracy and precision, and we will consider more classes. We also aim at developing a search application for lost people during Hajj and Umrah based on some predefined features.
Appendixes should appear before the acknowledgment.
This work is supported by the Robotics and Internet-of-Things Lab of Prince Sultan University.
- (2019-10) Aerial Images Processing for Car Detection using Convolutional Neural Networks: Comparison between Faster R-CNN and YoloV3. arXiv pre-print 1910.07234. Cited by: §I, §III-A, 6th item.
- (2019-05) Aerial LaneNet: Lane-Marking Semantic Segmentation in Aerial Imagery Using Wavelet-Enhanced Cost-Sensitive Symmetric Fully Convolutional Neural Networks. IEEE Transactions on Geoscience and Remote Sensing 57 (5), pp. 2920–2938. External Links: Cited by: §I.
- (2019) Unsupervised Domain Adaptation Using Generative Adversarial Networks for Semantic Segmentation of Aerial Images. Remote Sensing 11 (11). External Links: Cited by: §I.
- (2019) Car Detection using Unmanned Aerial Vehicles: Comparison between Faster R-CNN and YOLOv3. In 2019 1st International Conference on Unmanned Vehicle Systems-Oman (UVS), pp. 1–6. Cited by: §I.
- (2018) An architectural design of geo-fencing emergency alerts system for hajj pilgrims. In 2018 8th International Conference on Computer Science and Information Technology (CSIT), pp. 1–6. Cited by: §II, §II.
- (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §IV.
- (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 580–587. External Links: Cited by: §I, §III-A.
- (2015) Deep Residual Learning for Image Recognition. Arxiv.Org. External Links: Cited by: 5th item, §V-A.
- (2017) Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7310–7311. Cited by: §V-A.
- (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §V-A.
- (2016-06) Semantic Segmentation of Small Objects and Modeling of Uncertainty in Urban Remote Sensing Images Using Deep Convolutional Neural Networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vol. , pp. 680–688. External Links: Cited by: §I.
- (2019-11) Activity Monitoring of Islamic Prayer (Salat) Postures using Deep Learning. arXiv pre-print 1911.xxxxx. Cited by: §I.
- (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §I.
- (2011) Pilgrims tracking using wireless sensor network. In 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications, pp. 325–328. Cited by: §II, §II.
- (2017) Pedestrian detection in video surveillance using fully convolutional yolo neural network. In Automated Visual Inspection and Machine Vision II, Vol. 10334, pp. 103340Q. Cited by: §II.
- (2018-11) Vehicle Instance Segmentation From Aerial Image and Video Using a Multitask Learning Residual Fully Convolutional Network. IEEE Transactions on Geoscience and Remote Sensing 56 (11), pp. 6699–6711. External Links: Cited by: §I.
- (2019) A Framework for the Management of Agricultural Resources with Automated Aerial Imagery Detection. Computers and Electronics in Agriculture 162, pp. 53 – 69. External Links: Cited by: §I.
- (2016) You Only Look Once: Unified, Real-Time Object Detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 779–788. External Links: Cited by: §III-B.
- (2017) YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263–7271. Cited by: §III-B.
- (2018) YOLOv3: an incremental improvement. CoRR abs/1804.02767. External Links: Cited by: §I, §III-B, §III.
- (2017) Faster R-CNN: Towards Real-Time Object Detection with. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE. External Links: Cited by: §I, §III-A, §III-A, §III.
- (2014) A survey of public opinion about autonomous and self-driving vehicles in the us, the uk, and australia. Technical report University of Michigan, Ann Arbor, Transportation Research Institute. Cited by: §I.
- (2016-05) Convolutional Neural Network Based Automatic Object Detection on Aerial Images. IEEE Geoscience and Remote Sensing Letters 13 (5), pp. 740–744. External Links: Cited by: §I.
- LabelImg. git code (2015). https://github.com/tzutalin/labelimg. External Links: Cited by: §IV.
- (2019) Research on pedestrian tracking algorithm based on deep learning framework. In Journal of Physics: Conference Series, Vol. 1176, pp. 032028. Cited by: §II.