A Real-Time Deep Learning Pedestrian Detector for Robot Navigation
A real-time Deep Learning based method for Pedestrian Detection (PD) is applied to the Human-Aware robot navigation problem. The pedestrian detector combines the Aggregate Channel Features (ACF) detector with a deep Convolutional Neural Network (CNN) in order to obtain fast and accurate performance. Our solution is firstly evaluated using a set of real images taken from onboard and offboard cameras and, then, it is validated in a typical robot navigation environment with pedestrians (two distinct experiments are conducted). The results on both tests show that our pedestrian detector is robust and fast enough to be used on robot navigation applications.
In the last few years, robotics has become focused on Human-Robot Interaction and on its role in social environments. Some types of interaction can be, for example: speech; object handover; or a simple navigation behavior, where the robot needs to know if some obstacles are people or not, to decide what is the correct behavior. The study of robot navigation in the presence of people is called Human-Aware Navigation (HAN). For any type of Human-Robot Interaction, one needs to know the position of the people in these environments. Therefore, the Pedestrian Detection (PD) method is one of the most important steps for the robot to interact correctly with the humans.
In this paper, we propose a solution for real-time PD using Computer Vision (onboard and/or offboard cameras) for people state estimation, using a novel deep learning technique. The scheme of our approach is shown in Fig. 1. The solution was firstly tested using offboard and onboard images taken from our testbed and robot platform. Then, to validate our approach, we applied the proposed solution to a HAN problem. The results show that the proposed solution fulfills the respective goals.
The PD task is an important component of our framework, not only in terms of accuracy but also regarding speed, since real-time performance is required. The literature concerning the PD field of study is vast and has been evolving in order to provide improved solutions to this problem (surveys can be found in [1, 2]). In general, to perform PD, a detection window is “slided” in several image locations separated by a certain stride, using multiple scales. Features are extracted and classified to determine the presence of a pedestrian. Finally, redundant detections are eliminated using non-maximal-suppression. Initially, the PD methods employed features designed in a handcrafted fashion (e.g., Haar , including its informed version ). Nevertheless, the recent success of Convolutional Neural Networks (i.e. CNNs), achieved in several applications, such as, classification, localization and detection , led to the adoption of this methodology to the PD task.
The computations associated with the CNN are expensive when compared to the ones required by methods using handcrafted features. Therefore, to improve the detector’s speed, an hybrid solution can be adopted by cascading a faster and shallower method, based on handcrafted features, with a deep CNN. The handcrafted based approach generates proposals (i.e., promising regions for the pedestrians locations), whose classification is refined by the CNN (i.e., the accuracy is enhanced by removing false positives).
Furthermore, the transfer learning technique  should be employed when training the CNN in order to prevent overfitting. This technique consists in transferring parameters from a network trained on an auxiliary task (with a large auxiliary dataset, e.g. Imagenet ) to initialize our model. Then, our initialized network is fine-tuned (i.e., retrained) for the task of interest (in our case, using the PD dataset).
This paper is organized as follows. Sec. 2.1 describes the PD methodology, whereas Sec. 2.2 addresses the tracking procedure. Section 3.1 describes how the CNN training is accomplished, and Sec. 3.2 mentions how a pre-trained model is used for fine-tuning. The PD performance is evaluated in the “corridor” and “Mbot” real scenarios (Section 4.1). Sec. 4.2 presents the results of the overall and complete framework (PD + HAN). Finally, Sec. 5 draws conclusions and discusses further work.
2 Vision-Based People Detection and Tracking through Deep Learning
Regarding PD, we use a combination of handcrafted methods with deep learning methodologies. More specifically, first, the non-deep detector Aggregate Channel Features (ACF)  provides regions of interest (i.e., proposals), allowing to reduce the expensive computational effort that would be required by a CNN, in the exhaustive process of sliding window search. Then, the proposals obtained previously are classified by the CNN, allowing to improve the accuracy of the ACF detections (as depicted in Fig. 1)
2.1 Methodology for PD
In the following, we describe the methodology and introduce the notation for the PD task. Given a training set , represents the input image with and representing the image lattice
The use of pre-trained models, during the CNN initialization process, allows to obtain gains concerning the generalization ability of the model . Hence, we select the VGG CNN model , pre-trained with Imagenet . We denote the dataset to pre-train the CNN as , with and , where is the number of classes in the pre-trained model (for the Imagenet, we have ).
Typically, the structure of CNNs includes the composition of: convolutional layers with a non-linear activation function; non-linear subsampling layers; fully connected layers; and a multinomial logistic regression layer . Formally, the CNN can be denoted by:
where is the input data, denotes the composition operator, represents the CNN parameters (i.e., the weights and biases), and is the CNN output (prediction). For the PD case, (see image in Fig. 1), which represents the proposals, and , which denotes the prediction. The CNN is applied to these proposals, outputting the probability of the existence of a pedestrian in each one of them. If a proposal is classified as pedestrian, it is saved and no changes are made to its original ACF score. The proposals considered to be non pedestrians are eliminated, in order to reduce the number of false positives.
The convolution of a layer’s input with a set of filters, followed by a non-linearity, is represented by:
where the convolutional filters are represented by the weight matrix and the bias vector , and where represents the non-linearity (e.g. the Rectified Linear Unit ). The non-linear subsampling layers are denoted by , where represents the function (e.g., mean or max) applied to the input regions, leading to the size reduction. The fully connected layers employ a special case of the convolution represented in (2), because the entire input is convolved with individual filters. The multinomial logistic regression layer uses the soft-max function: to calculate the probability for each class (indexed by ), using the input from the layer.
The loss function used during training is the binary cross entropy loss expressed as:
where the training set is indexed by .
We denote a pre-trained CNN model as: , with , where are the parameters for the convolutional and non-linear subsampling layers, are the parameters for the fully connected layers, and are the parameters for the multinomial logistic regression layer.
The parameters and (or a subset of them) can be transferred to another CNN model, in order to provide a rich initialization . The layers, whose parameters were not transferred, can be randomly initialized. Finally, the resulting CNN is fine-tuned with the dataset corresponding to the task of interest.
In the PD case, we transfer the parameters from the convolutional and non-linear subsampling layers, and randomly initialize all the other layers. Due to changes in the CNN input size, the parameters for the fully connected layers were adjusted accordingly. The multinomial logistic regression layer was adapted to consider only two classes (pedestrian and non-pedestrian). Finally, this CNN model for PD was fine-tuned with the pedestrian dataset , resorting to the binary cross entropy loss in (3).
Taking the position measurements from the people detection scheme (described in the previous subsection), the goal of the tracking phase is to associate detections between frames and to estimate the direction of a person’s velocity. For that purpose, we use a simple Kalman filter
3 Material and methods
In this section, we describe the training details used for the pedestrian detector, taking into account the overall framework for the navigation setup.
3.1 Dataset for the CNN training
The pedestrian dataset chosen to train the CNN was the INRIA dataset , which is a popular benchmark in the PD field of study
For the process of training the CNN, we extracted proposals from the original train set as follows. To construct the positive set, first, we use the ground truth positive training bounding boxes to extract proposals, resulting in samples. Then, we augment (i.e., with data augmentation) this set using two steps:
Horizontal flipping applied to , resulting in (including also ); and
Random deformations (by affecting pixels in the range for the beginning and end) performed in the previous set , resulting in .
To construct the negative set , we use the methodology from  (i.e., applying a non-fully trained LDCF detector) to extract proposals from the negative images. This results in negative proposals. The final set of CNN train proposals comprises a total of 17500 samples, which are divided in train (15751 proposals, i.e. 90 of the total) and validation (1749 proposals, i.e. 10 of the total).
3.2 CNN model for training
Since we use a pre-trained CNN model (which can be considered a regularization technique), we have to adapt it to our task of interest (i.e., PD). Next, we mention the selected pre-trained CNN, the changes made, and the final architecture training details.
Pre-trained model original architecture
The VGG Very Deep 16 architecture (VGG-VD16) (configuration D) 
Pre-trained model changes and fine-tuning
Motivated by the pre-trained model’s expensive and time consuming computations, the original dimensions of the CNN input were downscaled from to . With this modification, inference cannot be performed after the first fully connected layer. Furthermore, the classification related layer must be adapted to transition from 1000 ILSVRC classes to two PD classes (i.e., pedestrian and non-pedestrian). To overcome this problems, we randomly initialize the parameters of the three fully connected layers with the correct dimensions. For this initialization procedure, we selected a Gaussian distribution, with mean and variance . The modified CNN model is fine-tuned with the positive and negative proposals training sets, acquired from the INRIA dataset (as described in Sec. 3.1).
In terms of the fine-tuning hyperparameters, we used 10 epochs with a minibatch of 100 samples, a learning rate of 0.001, and a momentum of 0.9.
For the test, first, the proposals (i.e., promising regions for the existence of pedestrians) are extracted by running the ACF detector in the test images. Then, these proposals are classified as pedestrians or non-pedestrians, by the fine-tuned CNN model described previously.
The PD methodology was implemented in MATLAB, running on CPU mode on 2.50 GHz Intel Core i7-4710 HQ with 12 GB of RAM and 64 bit architecture. To run the ACF detector and to evaluate the performance, the Piotr’s Computer Vision MATLAB Toolbox  (2014, version 3.40) was employed. Concerning the CNN framework, we utilized the MatConvNet toolbox . The experiments described in the next section, namely: Sec. 4, and Table 1), were conducted using the same settings.
4 Experimental Results
In this section, we evaluate the PD method in two real scenarios (i.e., datasets named ”corridor” and ”Mbot”). To conclude, we validate our approach by using the PD method on a Human-Aware Navigation problem, in which a real-time detection performance is required.
4.1 Evaluation of the PD method
In order to conduct experiments in real scenarios, we acquired two indoor datasets and tested the proposed PD method on them. These datasets are: 1) the ”corridor” dataset, which comprises 5556 images, and 2) the ”Mbot” dataset, which comprises 3966 images. For both sets, the image (i.e., frame) dimensions are . The detection results for some samples of these datasets are depicted in Fig. 2.
For each dataset, we measure the runtime of the final PD method (i.e., ACF+CNN) proposed in Sec. 3.2. As a result, we obtain approximately 707.27 seconds, which is equivalent to 7.85 FPS, for the ”corridor” dataset (5556 frames), and 839 seconds, which is equivalent to 4.84 FPS, for the ”Mbot” dataset (3966 frames). Further details about the runtime figures are presented in Tab. 1 (top, the two columns in the field named ”Baseline”), where the values represent per frame metrics.
To reach the real-time specifications required in robot navigation tasks, the speed should be improved. This can be accomplished by filtering the ACF proposals based on the confidence score, since this procedure reduces the number of proposals to be processed by the CNN, increasing the detection speed. The goal is to improve the previous speed, and achieve real-time performance, without substantially degrading the accuracy.
Taking into account that the confidence scores are important indicators to determine the relevance of each proposal, a score rejection threshold can be established. Only the proposals with score above this threshold are classified by the CNN. Following , we selected a threshold value of 40.
Consequently, in the process of discarding false positives, first we should eliminate the easier ones resorting to the threshold operation, and then we should eliminate the harder ones using the CNN. The possible loss in accuracy versus the speed improvement is determined by the choice of the threshold value.
Accordingly, using the threshold operation strategy, the detection speed of the overall method (i.e., ACF+CNN) is improved, in comparison with the baseline metrics. The cases before (“Baseline” field) and after (“Threshold” field) the threshold operation are depicted in Tab. 1. As presented in Tab. 1, we are able to achieve the real-time requirements for our navigation setup, by reaching a detection speed of approximately 10 FPS.
|Dataset||Data seq. 1 (corridor)||Data seq. 2 (Mbot)|
||Total time = 0.1273 sec.||Total time = 0.2066 sec.|
|Baseline||ACF time = 0.0326 sec.||ACF time = 0.0367 sec.|
|CNN time = 0.0947 sec.||CNN time = 0.17 sec.|
|Frame rate = 7.85 FPS||Frame rate = 4.84 FPS|
||Total time = 0.0961 sec.||Total time = 0.1026 sec.|
|Threshold||ACF time = 0.0333 sec.||ACF time = 0.0381 sec.|
|CNN time = 0.0628 sec.||CNN time = 0.0645 sec.|
|Frame rate = 10.41 FPS||Frame rate = 9.74 FPS|
4.2 Real Experiments in an Indoor Scenario
In this section we evaluate our PD framework on a Human-Aware Navigation (HAN) application. For that purpose, we consider a setup as shown in Fig. 3:
The HAN is not the focus of the paper. Then, we follow the navigation (including the constraints associated with the HAN) proposed at . Basically, the authors use the as a path planner (to ensure a minimal cost path) and define a set of HAN constraints as cost functions. These cost functions are shown in the figures of the experimental results (Figs. 4 (a) and (b)), and are computed using the following procedure:
Selecting the middle point of the lower edge of the bounding box that is given by the PD;
Projecting this middle point on the image onto the floor plane (assuming that the position of the robot is known); and
Estimating the pedestrian velocity in the world coordinate system.
The goal of these experiments is to evaluate the proposed PD on the image, using a robot navigation application in the presence of people. Two experiments were conducted, in which we apply our PD and the previous method for HAN:
Firstly, we consider a simple example where a robot is going towards a goal and people are standing in the environment (in front of the robot). The robot must take their positions into account (which are given by the PD mapped onto the floor plane) on the path planning, in order to avoid a collision; and
In the second experiment, a person starts walking when the robot is moving, blocking the path of the robot. Following the social rules, the robot must replan its path, to overtake the person by the left.
The results of both experiments are shown in Figs. 4LABEL:sub@fig:experiment1 and 4LABEL:sub@fig:experiment2, respectively. Videos with these experiments will be included in the authors websites. As it can be seen by these figures, the robot behaves as expected, which proves that our PD method is suitable for robot navigation tasks, in the presence of people.
This paper presents a novel framework that integrates pedestrian detection in the problem of robot navigation. More specifically, it integrates a novel pedestrian detection approach jointly with specific motion constraints, representing the human-aware concerns. The novelty inherent to the PD methodology, is that it allows to improve the accuracy of a non-deep detector, by efficiently cascading a CNN. The PD method is evaluated using two sets of real images acquired on a typical robot navigation environment (considering both on-board and external camera sensors). The results show that the proposed solution is suitable for robot navigation tasks, namely in terms of both runtime and robustness. In addition, two experiments were conducted in a realistic scenario to assess the overall framework performance.
As future work, we are planning on fusing data from multiple external and on-board cameras, and other types of sensors, such as lasers, as well as test the proposed framework with different people behaviors.
This work was partially supported by FCT[UID/EEA/50009/2013], and by the FCT grant SFRH/BPD/111495/2015.
We would also like to thank to Luis Luz for his help getting the experimental results.
- Other non-deep detectors could be used, but we adopted the ACF detector because it is fast.
- In this paper, the RGB feature map is considered for the image .
- Other filters could be used, but we used the Kalman filter because of its simplicity–this is not the main focus of the paper.
- More details can be found at: http://pascal.inrialpes.fr/data/human/.
- Additional details can be found at: http://www.robots.ox.ac.uk/~vgg/research/very_deep/.
- P. Dollar, C. Wojec, B. Schiele, and P. Perona, “Pedestrian detection: An evaluation of the state of the art,” IEEE Trans. Pattern Analysis and Machine Intelligence, 2012.
- R. Benenson, M. Omran, J. Hosang, and B. Schiele, “Ten years of pedestrian detection, what have we learned?” in ECCV, CVRSUAD workshop, 2014.
- P. Viola and M. J. Jones, “Robust real-time face detection,” IJCV, vol. 57, no. 2, pp. 137–154, 2004.
- S. Zhang, C. Bauckhage, and A. Cremers, “Informed Haar-Like Features Improve Pedestrian Detection,” in CVPR, 2014.
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” Int’l J. Computer Vision, 2015.
- J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” in Neural Information Processing Systems (NIPS), 2014.
- P. Dollar, R. Appel, S. Belongie, and P. Perona, “Fast feature pyramids for object detection,” IEEE Trans. Pattern Analysis and Machine Intelligence, 2014.
- W. Nam, P.Dollar, and J. H. Han, “Local decorrelation for improved pedestrian detection,” in Neural Information Processing Systems (NIPS), 2014.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” Int’l Conf. on Learning Representations (ICLR), 2015.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Neural Information Processing Systems (NIPS), 2012.
- S. Bitgood and S. Dukes, “Not Another Step! Economy of Movement and Pedestrian Choice Point Behavior in Shopping Malls,” Environment and Behavior, 2006.
- N. Bellotto and H. Hu, “Computationally efficient solutions for tracking people with a mobile robot: an experimental evaluation of bayesian filters,” Autonomous Robots, 2010.
- Y. Bar Shalom, F. Daum, and J. Huang, “The Probabilistic Data Association Filter,” IEEE Control Systems, 2009.
- N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in IEEE Proc. Computer Vision and Pattern Recognition (CVPR), 2005.
- D. Ribeiro, J. C. Nascimento, A. Bernardino, and G. Carneiro, “Improving the performance of pedestrian detectors using convolutional learning,” Pattern Recognition, 2016.
- P. Dollár, “Piotr’s Computer Vision Matlab Toolbox (PMT),” http://vision.ucsd.edu/~pdollar/toolbox/doc/index.html.
- A. Vedaldi and K. Lenc, “MatConvNet – Convolutional Neural Networks for MATLAB,” ACM Proc. Int’l Conf. on Multimedia, 2015.
- A. Verma, R. Hebbalaguppe, L. Vig, S. Kumar, and E. Hassan, “Pedestrian detection via mixture of cnn experts and thresholded aggregated channel features,” in ICCV Workshop, 2015.
- J. Messias, R. Ventura, P. Lima, J. Sequeira, P. Alvito, C. Marques, and P. Carriço, “A Robotic Platform for Edutainment Activities in a Pediatric Hospital,” in IEEE Int’l Conf. Autonomous Robot Systems and Competitions (ICARSC), 2014.
- A. Mateus, P. Miraldo, P. U. Lima, and J. Sequeira, “Human-Aware Navigation using External Omnidirectional Cameras,” Iberian Robotics Conference, 2015.