Weakly supervised training of deep convolutional neural networks
for overhead pedestrian localization in depth fields
Abstract
Overhead depth map measurements capture sufficient amount of information to enable human experts to track pedestrians accurately. However, fully automating this process using image analysis algorithms can be challenging. Even though handcrafted image analysis algorithms are successful in many common cases, they fail frequently when there are complex interactions of multiple objects in the image. Many of the assumptions underpinning the handcrafted solutions do not hold in these cases and the multitude of exceptions are hard to model precisely. Deep Learning (DL) algorithms, on the other hand, do not require hand crafted solutions and are the current stateoftheart in object localization in images. However, they require exceeding amount of annotations to produce successful models. In the case of object localization these annotations are difficult and time consuming to produce. In this work we present an approach for developing pedestrian localization models using DL algorithms with efficient weak supervision from an expert. We circumvent the need for annotation of large corpus of data by annotating only small amount of patches and relying on synthetic data augmentation as a vehicle for injecting expert knowledge in the model training. This approach of weak supervision through expert selection of representative patches, suitable transformations and synthetic data augmentations enables us to successfully develop DL models for pedestrian localization efficiently.
1 Introduction
Depth field data encodes the distance between each recorded point and the camera plane. This allows for highlyaccurate crowd dynamics analyses in realworld scenarios, i.e. outside of laboratory environments [1]. With this technology, for the first time, the ability to analyze between several thousands to few millions actual pedestrian trajectories has been achieved [3, 6, 7]. This enabled new statistical insights, unbiased by artificial laboratory conditions (e.g. need for participants to wear tracking hats or vests, and dynamics regulated by the experimenter instructions) [2, 5, 13, 16]. Furthermore, depth measurements do naturally protect the privacy of the pedestrians, since individuals remain unrecognizable. This is a requirement for reallife measurements, and a challenge for methods that use imaging rather than depth field data [8].
Pedestrian positioning from depth field data, acquired from sensors such as Microsoft Kinects [9], requires addressing two key tasks, namely: background subtraction and head localization. For a number of common cases these two tasks have a straightforward solution. Since the camera takes a birds eye view, the background can be simply subtracted by removing all points beyond a depth threshold. The head localization can be approached similarly, in fact the points closest to the camera are part of the pedestrian heads. So far, these tasks have been tackled via handcrafted approaches that rely on experttweaked depthcloud clustering algorithms [13] (CL). These approaches segment the different objects mainly based on the assumption that a cluster of neighboring points forms a pedestrian. This assumption typically holds when the dynamics on the scene involve low pedestrian densities (about ped./m max [7]), assorted homogeneously (i.e. composed of adults of similar size, with no elements such as strollers, carts, and bikes) and developing in simple geometric settings (corridors in [7, 13, 16]).
However, such designed approaches do not generalize well and are sensitive to a number of special case scenarios. This can be as simple as a raised hand that is interpreted as a head, but it becomes a larger issue when pedestrian density increases and CL algorithms experience difficulties in disentangling the individuals. Furthermore, these problems only increase when there are other (i.e. nonpedestrian) objects in the scene, such as moving doors, trolleys and obstacles.
Figure 1 contains a sample depthfield map picturing 9 pedestrians, 3 of which are infant (notice the smaller size), and 1 static object. The CL based localization in Fig. 1(A) shows some typical mistakes (cf. ground truth in Fig. 1(B)): twice a couple formed by an adult and a child (top and left side of the scene) is detected as a unique individual. Based on the shape of the objects, an expert instead recognizes that the likely scenario involves two individuals standing close to each other. The CL algorithms also fails to disentangle a pedestrian at the bottom from a static obstacle in their neighborhood (even though they are not connected). This follows from the small size of the object and the lack of ability of the CL approach to classify shapes. Detailing every exception and crafting appropriate rules to deal with such complexity becomes difficult and requires significant effort that does not transfer well across different measurement scenarios and implementations.
On the other hand, Machine Learning (ML) methods for image analysis rely on training data rather than design of rules and exceptions. Given sufficient data and good features, these models tend to perform well and are robust to special cases. The success of these approaches has been particularly demonstrated with recent developments in Deep Learning (DL). DL models obtain excellent performance and are currently stateoftheart in image localization [11]. The major advantage that DL methods have over ML image analysis is that they incorporate automatic feature extraction (or representation learning) as part of the model training. However, one of the main disadvantages of these approaches is the difficulty in incorporating expert knowledge about the problem and hence the requirement of significant amount of annotations [15].
Therefore, to develop DL models for pedestrian localization, an expert needs to produce a large amount of handannotated images. These images and their annotations can then be used to train a DL model such as a Deep Convolution Neural Network (CNN) to produce the target model. As the number of annotations can be quite high, this becomes very labor intensive and diminishes the advantages of using CNNs.
In this work we address this problem by proposing a method for efficient collection of expert annotations for pedestrian tracking using depth field images.
Our contribution is twofold:

we design a CNN model that can detect a large number of objects in a high density scenario suitable for pedestrian tracking with overhead depth field images;

we develop a ’soft’ supervision procedure that provides training data for the model by selecting a number of patches in the original data, designing suitable transformations to the patches and generation of realistic synthetic data for the CNN model.
The obtained model can be used for real time pedestrian detection in depth field maps, possibly on large areas exploiting Graphical Processing Units (GPU).
This paper is structured as follows: in Sect. 2 we provide some selected background on depth map based crowd recording setups. On this basis, in Sect. 3 we describe our synthetic data generation procedure as well as our neural network and its training method. In Sect. 5, we examine the detection performance. A final discussion section closes the paper.
2 Depthfield measurements
Overhead depthfield maps for pedestrian dynamics analyses
A typical depthfield measurement apparatus for pedestrian dynamics include sensors placed overhead and aligned with the vertical axis. Bird eye view, in fact, avoid mutual occlusions thus eases localization tasks. Moreover, to bypass the limited range of commercial devices such as Microsoft Kinect^{TM}, sensors are arranged in grids to enlarge the measurement areas. This requires a merge of the individual sensor signal. Different stitching approaches have been considered: for instance, in [7], depth images are unified into large depth frames to then undergo detection and tracking algorithms. In [13], the tracking information from different sensors are merged a posteriori. In the next paragraph, we review some elements of the former approach as it supports the depth maps “combination algebra” we employ to generate synthetic annotated depth maps (see Sect. 3.2)
Depthfield maps combination algebra
In [7], depthfield maps from neighboring Kinect sensors are merged enabling to track pedestrians over a relatively large area. This requires two operations: (1) the depthfield measured from each sensor is converted from a perspective view into an axonometric view. Overhead axonometric views of pedestrians are translation invariant, which means that a pedestrian is represented with the same depth patch regardless whether they are at the center or at the edges of the sensor view (cf. patches in Fig. 1). (2) The axonometric view enables seamless combination of depthfield maps from neighboring sensors. Given two neighboring sensors returning the depth maps and we combine them as
(1) 
In words, the depth field maps are first rigidly translated (by a composition with the motions and ), to register with the relative positions of the sensors in the physical space. Then the componentwise minimum is extracted. This is used to retain for each location the value of smallest distance, actually observed from an aerial view. The operation in Eq. (1) is commutative and associative, so it can be extended to an arbitrary number of cameras (or depth patches).
3 Method
3.1 CNN model for object localization
Neural networks, more particularly CNNs have demonstrated particular success in image analysis [12]. The major advantage is their hierarchical structure that allows the models to build complex features and form efficient representation of the input data. In this method we aim to leverage this advantage to improve the detection and localization of pedestrians from depth maps. We expect that efficient features and sufficient supervision will produce models that better disentangle multiple nearby objects and make more accurate distinction between pedestrian and other objects in the scene.
The architecture of the proposed CNN model is closely related to the YOLO object localization approach [10]. This approach offers computationally efficient localization. This opens the possibility for realtime analysis or large number of objects, which are advantageous properties for large scale pedestrian tracking.
The model processes the whole image in a single pass and produces a set of bounding boxes for each object that it has detected in the image. The model can also associate a class to each object. For our application we only provide detection of the objects, since we do not need to detect different types of objects. The model overlays a grid over the image, and produces a binary detection decision for each cell in the grid. It also produces an offset and the width and height of the bounding box for that object. The bounding box size is not limited to the cell size. For a regular grid the model produces:
where are the Cartesian coordinates of the bounding box at the th tile, whose width and height are, respectively, and . denotes the probability that is actually a bounding box. Namely, for ground truth data, whenever plays the role of a placeholder, conversely states that the th is a non void bounding box. Finally, and is kept for extensibility to multitype object detection.
The network we employ is composed of a first section aimed at feature extraction. This is followed by a densely connected section that combines local features into bounding boxes estimates. The input are depth images (thus single channel) at resolution (VGA resolution downsampled by a factor in each direction). Feature extraction occurs through two stacked layer blocks, each of which contains two convolutional layers with small filter size () and a final max pool layer. This architecture is closely related to the VGG network [14]. The convolutional layers and pooling layers are followed by fully connected layers that end with the output layer consisting of linear outputs for the bounding box parameters and softmax for the detection probabilities. The diagram of the network is given in Fig. 2.
As nonlinear function approximators, DL models are trained through a nonconvex optimization procedure that minimizes a defined loss function. Due the complex multipart output of our model, we needed to define multipart loss function :
(2) 
Respectively, it holds

the “” denotes a ground truth (synthetic) bounding box data;

is the categorical crossentropy function restricted to the components of the bounding box vector ;

is the ordinary Euclidean distance among the spatial components of the bounding box vector. Notably, the function is multiplied by , which acts as a switch, turning off the loss for the location parameters when there is no object present in the ground truth;

and are weighting factors for the linear combination of the two metrics.
3.2 Weak supervision through synthetic data
DL methods rely on training data to develop models. Beyond this and the network architecture, most of the options for adding expert knowledge are indirect. One common way to guide the training is to augment or add synthetically generated training data. This allows for adding properties to the model such as invariances to translation, rotation, mirroring and skewing. One can even go further and add noise or synthetic generation of data that relies on understanding of the domain. This way the model is exposed to larger range of variances from the input space and can achieve better generalization with smaller amount of natural data available. We use this opportunity to deal with the difficulty in providing annotations for our problem. We achieve this by selecting patches from the original data that correspond to pedestrians. The patches provide for both a example of a pedestrian and an annotation of the bounding box around the pedestrian. We then use these patches and other patches of nonpedestrian objects to build a synthetic image for training.
We further inject expert knowledge about the the real data by deciding on how the synthetic images are composed and by applying carefully designed transformations to the selected patches. In this manner we achieve an outstanding amount of training data will little effort from the expert.
More specifically the approach relies on a human expert to identify few hundred bounding boxes among those annotated correctly by a clusteringbased algorithm (cf. Sect 1). Here we employ a random selection of real depth maps from existing crowd tracking experiments. These are typically result of the combined output of multisensor setups. This a set of “overhead human patches” .
Secondly, we ask the human expert to manually extract patches that are not pedestrians. These can be of two types: 1. objects and architectonic elements (such as bags, strollers, carts, tables, and doors), ; 2. depth artifacts from sensor errors (noisy “stainlooking” patches in the depth field, counting from few pixels to few dozens), .
We combine via Eq. (1) elements randomly extracted from augmented versions of , and , say , and . We begin with an empty depth map at VGA resolution (native output resolution of single Kinect sensors), on which we overlay a regular grid (cf. network output in Sect. 3.1). For each grid tile, we choose, with probability whether to place an additional pedestrian patch chosen randomly from . We assign to the patch centroid a position on the tile surface with uniform probability. Thus, the total number of pedestrians in the depth maps follows a binomial distribution with and probability . As a second step, we extract patches from , that we place at random in the final map.
We further apply random transformation to the patches including: rotations, flipping, pixel removal (i.e. pixels are replaced with the floor depth) and addition (i.e. pixels are replaced with the median value of the image), and rigid depth translation. Finally, we add Gaussian noise to the produced image as another layer of regularization.
4 Experiment design
We conduct two experiments aiming at comparing the performance of our CNN with a CL algorithm. The experiments differ in the grid size adopted for the CNN, respectively and . Each of our CNN undergoes a training phase against synthetic data (cf. Sect 3.2). Then, we expose both the CNN and the considered CL algorithm to further synthetic data (more general and not seen during the training) and we measure their performance.
Performance evaluation To grade the performance of the algorithms we consider their output on a cell basis. We evaluate the algorithms precision (i.e. , cf. explanation below) and recall (i.e. ). We account cell measurement as a true positive, , if the cell is correctly predicted to hold the centroid of a bounding box (let and denote, respectively, false positives and true negatives). We consider the CNN output to hold a bounding box prediction if . We further score the accuracy of each true positive computing the intersection over union (IoU) of predicted and actual bounding box. We keep the number of pedestrians as a parameter in the analysis, as we expect it to be a major determinant of performance degradation.
Data and CNN training We employ pedestrian patches, as in Fig. 3, extracted from past measurement setups in which the sensor was located approximately meters above the ground. Similarly to a single Kinect operating in these conditions, synthetic depthfield maps cover an area of about^{1}^{1}1Calculated considering the characteristic field of view of Kinect sensors [9] plus the assumption that individuals of height larger or equal than m have to be fully resolved, even at the edges of the camera sight cone m.
During the training phase, we expose the two networks to synthetic data featuring a cell occupation probability respectively of and . Hence, in case , the network can detect up to pedestrians and, due to the grid constraint, the minimum distance admitted between the centroids of first neighbors is, in the worst case, m. During the training the network is exposed to an average of pedestrians per depth image (average density: ped/m). In case the network can detect up to pedestrians and the minimum distance between the centroid of first neighbors is m (this potentially encompasses people walking hand in hand). During the training the network is exposed to depth images including an average of pedestrians (average density: ped/m). We implemented and trained our networks through the Keras library [4] with tensorflow GPU backend. We trained the two networks for a total of 300 epochs, each including random training depth maps and random validation depth maps (batch size: 64).
Clustering algorithm and test data The CL algorithm we compare with is similar to what employed in [7, 13]. First, foreground blobs are randomly sampled. Hence, the samples undergo a completelinkage hierarchical clustering. The clustering tree is cut at a cutoff height comparable with the average human shoulder size. Finally, the cluster larger than a threshold are retained as pedestrians.
We compare, employing synthetic data, our two CNNs with one CL algorithm, which has fixed parameters. Synthesized test depth data include between 1 and 20 pedestrians (roughly uniformly distributed), for comparison with the first CNN, and between 1 and 35 for comparison with the second CNN.
5 Results and discussion
In Fig. 4 we report the results of the experiments. Our CNN shows a precision equal (case ) or higher (case ) than the CL approach (Fig. 4(A)), and significantly better results for recall (Fig. 4(B)). For those cells in which a bounding box has been correctly localized, we evaluate the localization accuracy by measuring the IoU coefficient. In this case, the CNN performance is comparable with the CL approach. In Fig. 5 we include samples of synthetic depth field maps from our experiments for visual inspection.
Following our expectations, the CNN delivers higher localization performance than the CL approach. In fact, the CNN succeeds in disentangling neighboring pedestrians and in avoiding nonpedestrian elements (cf. Fig. 5). This results in the higher recall performance. Notably, we leveraged expert knowledge efficiently to extract the patches combined in the synthetic depth maps. As the localization quality is determined by the examples rather than handcrafted procedures, we expect the method to generalize and transfer across different reallife measurement setups. Possibly with the only effort of enriching the patch library with object characteristic to specific location.
6 Consclusions
We targeted a high performance pedestrian localization tool for overhead depth field data capable or running in realtime setting. Depth data is often employed in reallife pedestrian dynamics research, and, so far, hierarchical clusteringbased approaches have been mostly used for localization. Here we aimed at bypassing the typical shortcomings of such approaches, leveraging the generalization power of deep learning models.
We presented a convolutional neural network approach showing significantly higher recall performance than clustering methods. This was possible as the network learns the typical shape of individuals. As such it can disentangle neighboring subjects and diminish false positive outputs (e.g. objects, depth artifacts). To bypass the difficulties imposed by the need for a large number of annotated examples, we developed a procedure producing synthetic training data as a means for efficient delivery of ’soft’ supervision from an expert.
References
 [1] M. Boltes and A. Seyfried. Collecting pedestrian trajectories. Neurocomputing, 100:127–133, 2013.
 [2] D. Brščić, T. Kanda, T. Ikeda, and T. Miyashita. Person tracking in large public spaces using 3d range sensors. IEEE Trans. HumanMach. Syst., 43(6):522–534, 2013.
 [3] D. Brščić, F. Zanlungo, and T. Kanda. Density and velocity patterns during one year of pedestrian tracking. Transportation Research Procedia, 2:77 – 86, 2014.
 [4] F. Chollet. Keras. https://github.com/fchollet/keras, 2015.
 [5] A. Corbetta, L. Bruno, A. Muntean, and F. Toschi. High statistics measurements of pedestrian dynamics. Transportation Research Procedia, 2:96–104, 2014.
 [6] A. Corbetta, C. Lee, R. Benzi, A. Muntean, and F. Toschi. Fluctuations around mean walking behaviours in diluted pedestrian flows. Phys. Rev. E, 95:032316, 2017.
 [7] A. Corbetta, J. Meeusen, C. Lee, and F. Toschi. Continuous measurements of reallife bidirectional pedestrian flows on a wide walkway. In Pedestrian and Evacuation Dynamics 2016, pages 18–24. University of Science and Technology of China press, 2016.
 [8] A. Johansson, D. Helbing, and P. Shukla. Specification of the social force pedestrian model by evolutionary adjustment to video tracking data. Advances in complex systems, 10(supp02):271–288, 2007.
 [9] Microsoft Corp. Kinect for Xbox 360, available online: http://www.xbox.com/enus/kinect/, 2011. Redmond, WA, USA.
 [10] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, realtime object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 [11] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. Berg, and L. FeiFei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
 [12] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 [13] S. Seer, N. Brändle, and C. Ratti. Kinects and human kinetics: A new approach for studying pedestrian behavior. Transport. Res. CEmer., 48:212–228, 2014.
 [14] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [15] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao. Lsun: Construction of a largescale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
 [16] F. Zanlungo, T. Ikeda, and T. Kanda. Potential for the dynamics of pedestrians in a socially interacting group. Phys. Rev. E, 89:012811, Jan 2014.