Event-based Dynamic Face Detection and Tracking Based on Activity
We present the first purely event-based approach for face detection using an ATIS, a neuromorphic camera. We look for pairs of blinking eyes by comparing local activity across the input frame to a predefined range, which is defined by the number of events per second in a specific location. If within range, the signal is checked for additional constraints such as duration, synchronicity and distance between the eyes. After a valid blink is registered, we await a second blink in the same spot to initiate Gaussian trackers above the eyes. Based on their position, a bounding box around the estimated outlines of the face is drawn. The face can then be tracked until it is occluded.
The The most prevalent group of algorithms for frame-based face detection is based on rigid templates, learned mainly via boosting a high number of low-confidence filters . Probably the most well-known example of that is the Viola-Jones algorithm . Another approach is based on deformable part models (DPMs), that model parts of the face separately.  has shown that there is still room for improvement for DPMs using improved annotations. The rediscovery of neural networks enables state-of-the-art detectors  . However, this state-of-the-art face detection relies on intensive computation of static images. This problem is mitigated with dedicated chips such as Intel’s Nervana Neural Network Processor (NNP) or Google’s Tensor Processing Unit (TPU), as well as Apple’s Neural Engine or Samsung’s Exynos 9 on power-constrained phones. These chips have become an essential part in frame-based vision specialising in executing the matrix multiplications necessary to infer neural networks on each frame as fast as possible.
Rethinking how images are captured, event-based cameras offer a very different approach to record grey-scale images and seem particularly suitable for embedded systems and robotics. When using a neuromorphic imaging sensor such as the ATIS , rather than sampling the whole input frame with a fixed clock frequency, each pixel acts independently. Pixels only trigger when the amount of incoming light falls below or rises above a certain threshold (Fig. 1). Thus pixels record change in lighting conditions rather than absolute illumination. The camera then outputs events that correspond to an increase (ON-event) or a decrease (OFF-event) of light. Inherently, no redundant information is captured, which results in significantly lower power consumption. Due to the asynchronous nature and therefore decoupled exposure times of the pixels, temporal resolution is in the order of micro-seconds and a dynamic range of up to 125 dB can be achieved. The amount of generated events directly depends on the activity and lighting conditions of the scene. In addition to the change-detection properties, the ATIS also outputs absolute grey-levels based on an integration time threshold. To compare such cameras with conventional ones, artificial frames can be created by binning absolute grey-level events within desired time windows. We do so for performance comparison with frame-based algorithms only.
In contrary to frame-based methods, which detect a face to locate a region of interest for the eyes, we follow the reverse approach. Event-based cameras are efficient in detecting change, so the image plane is searched for pairs of blinking eyes. To illustrate what happens during an event-based recording of a single blink, Fig. 2 shows different stages of the eye lid closure and opening. If the eye is in a static state, few events will be generated (a). The closure of the eye lid happens within 100 ms and generates the substantial amount of events, to be followed by a much slower opening of the eye (c,d).
Overall we take advantage of the fact that adults blink more often than required to keep the surface of the eye lubricated. The reason for this is not entirely clear, research suggests that blinks are actively involved in the release of attention . Generally, observed eye blinking rates in adults depend on the subject’s activity and level of focus and can range from when reading up to during conversation (table I). Fatigue significantly influences blinking behaviour, increasing both rate and duration . Typical blink durations are between  and decrease with increasing physical workload or increased focus.
|Activity||Blinking rate |
Once two consecutive blinks are detected in the same location, we start tracking the eyes and therefore the face. Two Gaussian blob trackers with a fixed size are initiated above the eyes and follow their every movement. The trackers are based on the works of  and use a bivariate distribution to describe local activities (so-called ‘blobs’). The blobs can keep track of moving objects that are almost constant in shape.
The algorithm detects blinks by checking whether local activities in the scene are within a certain range. To check for local activity, the overall input frame is divided into a grid of tiles, which results in rectangular shapes of pixel for the ATIS resolution of 304 x 240 pixel. This lines up well with the eye’s natural shape. Due to the asynchronous nature of the camera, activities in the different tiles can change independently from each other, depending on the observed scene. Each incoming event generates activity in the following way:
where is the current timestamp and is the timestamp of the previous event received in the same tile. Although activities for ON- and OFF-events can be evaluated separately, we only use the sum of both. Based on the magnitude of local activity, potential blink candidates are chosen. One candidate hereby refers to one tile, respectively one eye. These candidates are then tested against additional constraints for verification. If one blink is registered, we wait for another blink within the same location to start the tracking. Two Gaussian blob trackers that are placed at the latest blink locations will keep track of the eyes. The face bounding box is updated by the blob trackers, purely depending on the location of the two eyes. Currently it is only possible to track one pair of eyes at a time, as new detected blinks will re-initiate the trackers.
The event camera used in this experiment generates a substantial amount of noise even when observing a static scene. As such the first step is to de-noise the camera’s output signal. This is done by observing the spatio-temporal neighbourhood of an incoming event. If there are no corresponding events within a limited time- or pixel-range, the event is discarded as noise.
2.2 State machine
In order to differentiate blinks from background by using local activity, a state machine is used. The three states are Background, Candidate and Clutter. Those three states correspond to magnitude of activity and are separated from each other by two thresholds (see Fig. 3). Threshold levels are set manually, according to a confidence interval of recorded blinks. Each tile is constantly in one of the three states. Activity for minor movements and remaining noise will generally stay in the background state (Fig. 3a). Once a tile receives enough events within a certain time limit to reach activity levels above the Background threshold, there are two possibilities for the signal to continue. Either the activity increases further above the Candidate status, which disqualifies the tile as Clutter (Fig. 3c). Or the activity falls below candidate status plus a safety margin again (Fig. 3b), which suggests that a blink happened and triggers further checks. Two candidates occurring close in time, each one corresponding to one eye, have to match multiple criteria. They need to occur in the same tile row, no further than two tiles apart. The two central timestamps of candidates must occur within , as human blink simultaneously. Candidates shorter than and longer than can be rejected. These values were found experimentally. If two candidates fulfil all the constraints, one blink is registered. After a second complete blink in the same location within 5 s, the trackers are initiated.
2.3 Gaussian tracker
The two trackers for the eyes use a bivariate distribution to represent event clouds in terms of blobs. Unlike in , blobs are fixed in size, set to a general eye shape and do not adapt their dispersal to events in the focal plane. Rather, they will only follow events close to their current position. As soon as a new event occurs, the likelihood of it being generated by each tracker is calculated by
where is the pixel location of the event. The covariance matrix is determined when the tracker is initiated and stays the same until the end of the recording. The tracker with the highest probability is updated, given it is higher than a specific threshold value. A circular bounding box for the face is drawn based on the horizontal distance between the two eye trackers.
We evaluated the algorithm’s performance on a total of 23 recordings from 9 different people. The software implementation of the algorithm is released as a part of Tarsier, an Open Source framework111Available on Github: https://github.com/neuromorphic-paris/tarsier for event-based vision written in C++ and runs in real-time on a Intel Core i5-7200U CPU.
3.1 Data recording
As blinking rates are highest during rest or conversation, subjects in a chair in front of the camera were instructed not to focus on anything in particular and gaze into a general direction. After an initial count to 10, they should lean from side to side every 10 seconds in order to vary their face position. We used a lamp to stabilise lighting conditions.
In order to compare performance to ground truth (GT) obtained from a frame-based algorithm, we binned the grey-level representation of the event-based recordings every 40 ms, resulting in conventional videos with 25 fps at the same resolution of 304 x 240 pixel.
For the ATIS used in this experiment, a de-noising time constant of was chosen. As such, events that occur temporally isolated within a sliding window are classified as noise. An average percentage of 55 % of incoming events were discarded for a scene with slow movements in front of the camera. This significantly lowers computational load.
3.3 Blink detection and face tracking
Threshold levels of the state machine are set according to observed values for local activities of manually annotated recorded blinks. A sample size of 39 blinks resulted in an activity distribution of for a decay constant of 40 ms. Lower and upper thresholds were set to three standard deviations apart from the expected mean and then tweaked for each subject depending on lighting conditions. On average, 40.0 19.9 % of blinks are successfully detected in recordings. A true positive is hereby allowed to be detected within a 5 pixel range. Across all recordings, there are of false positives. Fig. 4 shows tracking data for one recording. Our algorithm starts tracking as soon as two consecutive blinks are registered (a). GT is obtained by running an OpenCV implementation of the Viola-Jones algorithm on videos converted from grey-levels. Whereas tracking accuracy on the frame-based implementation is constant (25 fps), our algorithm is updated depending on the amount of movement in the scene. If the subject stays still, computation is drastically reduced as there is a significantly lower number of events. Head movement causes the tracker to update within (b), changing its location in sub-pixel range. Average tracking error across all recordings is in the x direction and in the y direction.
The presented method for face detection and tracking relies solely on blinks. It is limited to a certain scale and orientation of the face, but benefits from high temporal resolution in range and redundancy suppression due to the event-based nature. It is important to mention that as soon as subjects observe a live recording of themselves on a display (or any other point of focus), their blinking rate drops significantly. If subjects expect something to happen, they will unconsciously adapt their blinking behaviour to environmental regularities in order to better detect future events .
Registering blinks of people who wear glasses or hats does not bear a substantial burden, as long as the eyes are not shadowed in any way. During tracking however, the proximity to an eye might disrupt the eye tracker, as even partial occlusion of the eye severely distorts the tracker. This could be mitigated by adding face feature parts such as the mouth to the overall representation, linking them to form a stable part-based model, as done in . Once the tracker is initiated, it could therefore more easily keep the same distance between parts of the face. This would also allow for a greater variety in pose variation.
As seen from the activity profile in Fig. 2, we actually only use the closure of the eye lid for our detection, as the OFF-events of the eye lid opening are too dispersed in time and do not generate a distinguishable activity profile. The magnitude of blinking activity depends directly on the lighting conditions in the scene, hence the great variance in a typical activity profile. Our algorithm relies on the subject staying still in order to differentiate blinking from non-blinking behaviour. The inability to pick up blinks while a person moves results in a conservative number of blinks detected.
The local activity we calculate is essentially a dimensionality reduction from a 2D area, which greatly simplifies computation and allows this algorithm to be used on power-constraint devices. An absolute benchmark in terms of power consumption however is non-trivial. Rather it would be possible to compare power consumption to another (frame-based) algorithm, but this is again highly dependant on the used hardware. In terms of algorithm efficiency, only the blink detection will be active in the beginning of a recording. As soon as trackers are initiated, the blink detection will stay active and will re-initiate trackers when new blinks are detected. Disabling the blink detection would otherwise leave no way to recover from a misguided tracker. This implicates that only one pair of eyes can be tracked at a point in time.
Unlike frame based tracking, the location of a blob tracker is not bound to a specific pixel, but rather to the occurrence of events in a given neighbourhood. It can therefore adopt any calculated location between pixels and provide high spatial and temporal accuracy, given a sufficient amount of incoming events. In theory the temporal tracking resolution in the order of s makes it possible to track faces with previously unknown speeds.
In order to increase robustness and overcome some of the constraints, one could add additional descriptors such as optical flow to verify a real blink. Normalising the activity after an initial calibration step would also provide better robustness in terms of different lighting conditions. Eventually one might have to move to learned representations of blinks. Neuromorphic Vision cannot rely on the same state-of-the-art neural networks as classical vision, but recently spawned the ideas of time surfaces  and log-polar grids  as descriptors. For training purposes this will make necessary a comprehensive data set of annotated blinks.
The authors would like to thank Omar Oubari for his valuable feedback.
-  Simone Benedetto, Marco Pedrotti, Luca Minin, Thierry Baccino, Alessandra Re, and Roberto Montanari. Driver workload and eye blink duration. Transportation Research Part F: Traffic Psychology and Behaviour, 14(3):199–208, May 2011.
-  Anna Rita Bentivoglio, Susan B. Bressman, Emanuele Cassetta, Donatella Carretta, Pietro Tonali, and Alberto Albanese. Analysis of blink rate patterns in normal subjects. Movement Disorders, 12(6):1028–1034, 1997.
-  Tobi Delbrück, Bernabe Linares-Barranco, Eugenio Culurciello, and Christoph Posch. Activity-driven, event-based vision sensors. ISCAS 2010 - 2010 IEEE International Symposium on Circuits and Systems: Nano-Bio Circuit Fabrics and Systems, pages 2426–2429, 2010.
-  David Hoppe, Stefan Helfmann, and Constantin A. Rothkopf. Humans quickly learn to blink strategically in response to environmental task demands. Proceedings of the National Academy of Sciences, page 201714220, February 2018.
-  Vidit Jain and Erik Learned-Miller. Fddb: A benchmark for face detection in unconstrained settings. Technical report, Technical Report UM-CS-2010-009, University of Massachusetts, Amherst, 2010.
-  Huaizu Jiang and Erik Learned-Miller. Face Detection with the Faster R-CNN. pages 650–657. IEEE, May 2017.
-  Donna Boyer John A. Stern, David Schroeder. Blink Rate: A Possible Measure of Fatigue. Human Factors, 36(2), 1994.
-  Xavier Lagorce, Cedric Meyer, Sio-Hoi Ieng, David Filliat, and Ryad Benosman. Asynchronous Event-Based Multikernel Algorithm for High-Speed Visual Features Tracking. IEEE Transactions on Neural Networks and Learning Systems, 26(8):1710–1720, August 2015.
-  Xavier Lagorce, Garrick Orchard, Francesco Gallupi, Bertram E. Shi, and Ryad Benosman. HOTS: A Hierarchy Of event-based Time-Surfaces for pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8828(c):1–1, 2016.
-  Patrick Lichtsteiner, Christoph Posch, Tobi Delbruck, and Senior Member. Temporal Contrast Vision Sensor. Work, 43(2):566–576, 2008.
-  Markus Mathias, Rodrigo Benenson, Marco Pedersoli, and Luc Van Gool. Face detection without bells and whistles. In European Conference on Computer Vision, pages 720–735. Springer, 2014.
-  Tamami Nakano, Makoto Kato, Yusuke Morito, Seishi Itoi, and Shigeru Kitazawa. Blink-related momentary activation of the default mode network while viewing videos. Proceedings of the National Academy of Sciences, 110(2):702–706, 2013.
-  Garrick Orchard, Ajinkya Jayawant, Gregory K. Cohen, and Nitish Thakor. Converting Static Image Datasets to Spiking Neuromorphic Datasets Using Saccades. Frontiers in Neuroscience, 9, November 2015.
-  Bharath Ramesh, Hong Yang, Garrick Orchard, Ngoc Anh Le Thi, and Cheng Xiang. DART: Distribution Aware Retinal Transform for Event-based Cameras. arXiv preprint arXiv:1710.10800, 2017.
-  David Reverter Valeiras, Xavier Lagorce, Xavier Clady, Chiara Bartolozzi, Sio-Hoi Ieng, and Ryad Benosman. An Asynchronous Neuromorphic Event-Driven Visual Part-Based Shape Tracking. IEEE Transactions on Neural Networks and Learning Systems, 26(12):3045–3059, December 2015.
-  Paul Viola and Michael J. Jones. Robust real-time face detection. International journal of computer vision, 57(2):137–154, 2004.
-  Qiong Wang, Jingyu Yang, Mingwu Ren, and Yujie Zheng. Driver fatigue detection: A survey. In Intelligent Control and Automation, 2006. WCICA 2006. The Sixth World Congress On, volume 2, pages 8587–8591. IEEE, 2006.
-  Shuo Yang, Ping Luo, Chen-Change Loy, and Xiaoou Tang. Wider face: A face detection benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5525–5533, 2016.
-  Shuo Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Faceness-Net: Face Detection through Deep Facial Part Responses. arXiv preprint arXiv:1701.08393, 2017.
-  Stefanos Zafeiriou, Cha Zhang, and Zhengyou Zhang. A survey on face detection in the wild: Past, present and future. Computer Vision and Image Understanding, 138:1–24, September 2015.
-  Xiangxin Zhu and Deva Ramanan. Face detection, pose estimation, and landmark localization in the wild. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference On, pages 2879–2886. IEEE, 2012.