Remote Pulse Estimation in the Presence of Face Masks

Remote Pulse Estimation in the Presence of Face Masks


Remote photoplethysmography (rPPG) is a known family of techniques for monitoring blood volume changes from a camera. It may be especially useful for widespread contact-less health monitoring when used to analyze face video from consumer-grade visible-light cameras. The COVID-19 pandemic has caused the widespread use of protective face masks to prevent virus transmission. We found that occlusions from face masks affect face video-based rPPG as the mean absolute error of blood volume estimation is nearly doubled when the face is partially occluded by protective masks. To our knowledge, this paper is the first to analyse the impact of face masks on the accuracy of blood volume pulse estimation and offers several novel elements: (a) two publicly available pulse estimation datasets acquired from 86 unmasked and 61 masked subjects, (b) evaluations of handcrafted algorithms and a 3D convolutional neural network trained on videos of full (unmasked) faces and synthetically generated masks, and (c) data augmentation method (a generator adding a synthetic mask to a face video). Our findings help identify how face masks degrade accuracy of face video analysis, and we discuss paths toward more robust pulse estimation in their presence. The datasets and source codes of all proposed methods are available along with this paper.

1 Introduction

Computer vision techniques often demonstrate capabilities that are beyond those of humans. One such task is remote photoplethysmography (rPPG) – estimating blood volume changes from videos acquired away from the skin’s surface. Remote pulse detection is especially useful in settings where health diagnostics are desired, but using contact sensors is infeasible, presents some risk, or professional sensors (\egpulse oximeters) are not available. COVID-19 is one such scenario where extracting cardiac diagnostics without surface contact mitigates the risk of viral transmission through contact medical sensors, and can potentially allow for ubiquitous health monitoring at a critical time.

The widespread adoption of face mask usage to prevent the spread of COVID-19 has caused significant problems for existing technologies that assume an unobstructed view of the face [15]. Nearly all recent algorithms extract the signal from the face [2, 4, 18, 25, 28], sometimes even limiting the analyzed region to the cheeks [11, 23], which of course is a region of the face generally occluded by a mask. This methodological choice presents a risk to the aforementioned algorithms in the following ways: region selection, skin detection, and a smaller spatial region for increased risk of signal contamination by noise. To the authors’ knowledge, this paper for the first time explores the effects of face masks on the performance of remote pulse detection algorithms.

While early contactless pulse estimation algorithms used hand-crafted features in both the temporal and spatial domains, more recent works have shown that convolutional neural networks (CNN) fed with spatiotemporal representations may outperform handcrafted approaches in rPPG [2, 28, 17]. To accommodate the research community’s need for large-scale realistic physiological datasets, we present two new datasets. The first collection was recorded prior to widespread mask use from COVID-19 from 86 unmasked subjects in an interview setting with natural conversation. This dataset is used for designing and fine-tuning all the methods considered in this work. The second database, collected in the same operational setting from 61 masked subjects, is designed for accurate subject-disjoint evaluation of the effects of masks on estimation of blood pulse waveforms.

Figure 1: Training and inference pipeline for the spatiotemporal modeling task of remote pulse estimation. Raw RGB frames are landmarked and cropped, which are then either fed directly into the 3DCNN or a synthetic mask is added. Multiple frame sequences are overlap-added [4] to produce the full pulse waveform.

The main novel contributions of this work are (1) the first face video dataset of masked individuals for remote physiological monitoring, and (2) the first systematic experimental analysis of the effects of face masks on remote pulse estimation performance, answering three questions:

  1. Is the accurate pulse rate estimation possible on subjects wearing masks?

  2. If the answer to (Q1) is affirmative, does inclusion of face videos with synthetic masks help in better performance on videos of subjects wearing actual masks?

  3. What adaptations to the existing rPPG methods are useful to fine-tune them to new COVID-19 reality?

2 Background and Related Work

Remote photoplethysmography is the process of estimating the blood volume pulse (BVP) by observing changes in the reflected light from the skin. Microvasculature beneath the skin’s surface fills with blood, which changes reflected color due to the light absorption of hemoglobin. In practice, designing the algorithms for converting video data to pulse waveforms is difficult. The color changes attributed to blood volume are subtle and may be obscured by variations due to factors such as illumination changes and body movements. The problem is further compounded by the addition of face masks, due to decreased surface area to detect the pulse, and consequently, decreased sample size for signal extraction.

Due to the difficulty of extracting the pulse signal from the optical signal, early studies began with stationary subjects and manually selected regions of the skin [27, 24]. An early advance was developed by Poh \etal[19, 18] when they applied blind source separation through independent component analysis (ICA) to the red, green, and blue color channels. Several advancements combined color channels in meaningful ways to locate the pulse signal [4, 5, 26, 25]. The first approach (CHROM) considered the chrominance signal, which was agnostic to illumination changes and robust to movement [4]. Later improvements relaxed assumptions on the distortion signals from movement [5], and examined rotation of the skin pixels’ subspace [26]. Lastly, [25] introduced the plane orthogonal to skin (POS) algorithm, which defines and updates a projection direction for separating the specular and pulse components.

Until Li \etal[11] designed an effective pulse detector on the MAHNOB-HCI database [21], many approaches had been designed and tested on private data with relatively small datasets. After using the public MAHNOB-HCI dataset many groups were able to compare their estimators [11, 23, 2, 28], and it spurred the creation of more publicly available datasets such as MMSE-HR [23] and VIPL-HR [16]. The increased size of pulse detection datasets made it possible to train deep neural networks for the task. The first deep learning approach [10] trained a regression model on ICA and chrominance features.

Later, deep learning models for rPPG were trained on the spatial [9, 2, 17] and spatiotemporal [28] dimensions of the video rather than extracted temporal features alone. Hsu \etal[9] trained VGG-15 on images of the frequency representation of the averaged color signal to predict heart rate. Chen \etal[2] took inspiration from two-stream networks [20] and simultaneously fed frame differences and raw frames to a two-stream CNN, predicting the derivative of the waveform. A recent approach extracted spatial-temporal maps from a grid of regions over the face, and fed each averaged region into ResNet-18 followed by a single gated recurrent unit (GRU) to predict single heart rate [17].

Yu \etal[28] constructed a 3DCNN which was given video clips and minimized the negative Pearson correlation between the ground truth waveform and their output waveform. The main advantage of their spatiotemporal network over the networks in [10, 9, 17], is its capability of producing a waveform, rather than a single value for the signal’s frequency. They predict several cardiac metrics such as the respiratory frequency and heart rate variability. PPG waveforms have also shown to be useful for predicting blood pressure [13]. Due to this advantage, we design a similar 3DCNN architecture with modifications to the temporal dimensions of the spatiotemporal kernels, such that longer-range time dependencies can be captured.

While rPPG has been used for many applications such as presentation attack detection [8, 12] to distinguish between no pulse detected (presentation attack) and pulse detection (live), our goal is to determine how accurately the pulse rate can be estimated from a known live face wearing a mask. Face occlusions have never been explored in rPPG. Since face detection and region selection are required steps in the rPPG pipeline for most approaches, the evaluation of masked pulse detection is critical if the technology is ever to be used in remote health monitoring in situations where face masks are required.

3 Datasets

Figure 2: Frames from all pulse databases used throughout this paper are shown for the unmasked, synthetically masked, and masked videos. Patterns for the synthetic masks are randomly sampled from the Describable Textures Dataset [3].

We present two new datasets for remote physiological monitoring. The first dataset is intended for deception detection and physiological monitoring from face video during conversation. It was recorded prior to the emergence of COVID-19, so all subjects are unmasked. The second dataset was recently collected to assess remote pulse detection algorithms in the presence of face masks following mask mandates around the globe. Both datasets were collected from consenting subjects under a human subjects research protocol approved by the authors’ Human Subjects Institutional Review Board.

Both datasets were recorded in the same setting with subjects seated approximately 1 to 2 meters from the RGB optical sensor. The ground truth heart rate, blood oxygenation, and blood volume pulse waveforms were collected by the Contec CMS50EA finger oximeter recording at 60 Hz. The RGB videos were recorded with pixels at 90 frames per second (fps) by TheImagingSource DFK 33UX290 camera. Videos were losslessly compressed with H.264 encoding using a constant rate factor of 0 to retain all raw video data and avoid damaging the optical pulse signal.

DDPM Dataset.

A total of 86 sets of recordings were collected with each set consisting of nearly 11 minutes in length to create the Deception Detection and Physiological Monitoring (DDPM). During the recording, a paid actress conducted an interview consisting of 24 questions. The subject was instructed beforehand to answer particular questions truthfully or deceptively. Subjects were free to complete the interview without constraints on motion, facial expressions, and talking, which accurately represents scenarios for unmasked pulse detection in the wild. The act of deception also introduced variability in the pulse rate itself. Such variability is rarely observed in rPPG datasets, and thus overall, the dataset’s size and collection setting make it unique.

DDPM-Mask Dataset.

We augment the DDPM dataset with synthetic face masks by selecting a subset of landmarks to define the occluded face region to create DDPM-Mask corpus. We use the same set of landmarks to define a wide, medium coverage mask, as selected in [15]. Along with a set of black masks without texture, we added patterned masks by randomly selecting images from the Describable Textures Dataset (DSD) [3] and overlaying the image onto the 2D mask. With head rotation and translation, the pattern must be transformed to cover the same portions of the masked region. We first resized the pattern the the same dimensions as the input frames to the 3DCNN of pixels. Then we randomly translated the pattern image such that the face landmarks for the first frame of the sequence were still within the pattern. Using these landmark points as anchors on the pattern image, we estimated the similarity transformation (rotation, translation, and scaling) from the anchor landmarks to the face landmarks in every following frame of the sequence, then applied the transformations on the pattern image before adding the masked region to the face frames. The second column of Fig. 1 illustrates a patterned synthetic mask added to the DDPM dataset over a sequence of frames.

MPM Dataset.

We collected a new Masked Physiological Monitoring (MPM) video dataset for remote physiological monitoring of masked subjects, not participating in collection of the DDPM set (thus, MPM and DDPM are subject-disjoint). A plexiglass screen was placed between acquisition personnel and the subject to reduce COVID-19 transmission risk through airborne particles. Subjects were asked to bring 3 different face masks to increase the variability in color, texture, and shape. Masks were provided if they did not bring them. We captured 61 subjects over 3 different sessions, where the participant wore a different mask in each recording. Poorly lit videos were removed resulting in a total of 170 usable recordings. Recordings averaged over 3 minutes in length per video, giving us nearly 9 hours of recorded data. The reliability of the ground truth physiological signals was improved by using two Contec CMS50EA oximeters. The oximeters were initially placed on the index fingers of both hands but were moved to the subject’s thumbs if a reliable signal was not initially detected.

To make the collected data as realistic as possible, we divided each session into three different tasks: (a) natural conversation with free head movement, (b) directed head movement, and (c) frontal view without head movement. The natural conversational task consisted of sustained interaction with an acquisition worker for 2 minutes. In general, the task took slightly longer than 2 minutes, since the conversation was not stopped abruptly. The directed head movement task aimed to stress the pulse detection algorithms by adding non-frontal gaze and head motion. Subjects were directed to look at a total of 6 different targets for approximately 5 seconds each, resulting in a 30 second interval. The final task consisted of the subject maintaining frontal gaze and avoiding movement or talking for 30 seconds.

4 Approach

We model the pulse prediction task as a regression problem with the blood volume waveform from the oximeter as the target. This task differs from popular action recognition tasks in the resolution of the prediction: in action recognition, a single class is predicted for a sequence of images, whereas our task generates a real value for every image in a sequence. We deploy a 3DCNN architecture on frame sequences cropped from the original video to contain a face only. Figure 1 shows the frame and waveform preprocessing in the training pipeline, along with example output waveforms. The following sections describe the model architecture and pipeline used to prepare the videos and target waveforms.

4.1 3DCNN Architecture

We select the 3DCNN as the spatiotemporal architecture to learn the relation between frame sequences and cardiac waveform. We use a similar spatial architecture to the PhysNet-3DCNN architecture [28], but modify the temporal dimensions of the kernels to capture longer time dependencies and help filter out high-frequency noise.

The 3DCNN was selected for three reasons. Firstly, it is capable of producing a high-resolution blood volume pulse waveform, not only selected statistics such as heart rate. Second, the 3DCNN is an end-to-end model, capable of learning from the raw image sequences. Lastly, the remote pulse detection task benefits from the joint learning of spatiotemporal features, rather than separating the dimensions and learning them independently. Illumination changes from non-rigid movements (\egsmiling, talking, etc.) as well as rigid head motion affect the temporal output of the signal, and a better spatial model likely cannot compensate for the adjustments without some knowledge of the waveform. Due to the difficulty of the problem, several preprocessing steps were necessary prior to training.

4.2 Video Preprocessing

To make the rPPG task easier for the model, we cropped the face region from all frames in the videos. We used the OpenFace toolkit [1] to detect 68 facial landmarks as a basis for defining the bounding box. OpenFace was used rather than simpler face detection models due to the stability of the landmark locations between adjacent frames. Detectors with less emphasis on fine-grained facial features add jitter to the bounding boxes over time, which in turn adds noise to the rPPG signal. Additionally, the face landmarks gave us keypoints to approximate the shape and location of a synthetic mask.

To create a bounding box from the landmarks, we found the minimum and maximum locations of the landmarks. After selecting the extreme landmarks as the crop bounds, we extended the crop horizontally by 5% to ensure that the cheeks and jaw were present. The top and bottom were extended by 30% and 5% of the bounding box height, respectively, to include the forehead and jaw. From the extended bounding box, we further extended the shorter of the two axes to the length of the other to form a square.

Considering the massive number of frames and the high resolution of the acquired videos, the cropped frames occupied a large amount of space. Taking insight from a recent rPPG study on the effect of image resolution for CNNs [29], we downsized the cropped region to 64x64 pixels with bicubic interpolation. During training and evaluation, the model is given clips of the video consisting of 135 frames (1.5 seconds). We selected this as the minimum length of time an entire heartbeat would occur, considering 45 beats per minute (bpm) as a lower bound for healthy subjects.

4.3 Physiological Signal Preprocessing

The oximeters recorded ground truth waveform values at 60 Hz, which differed from the native 90 fps of the videos. Since our task requires a waveform label for every frame, we upsampled the ground truth waveform with cubic interpolation to the video timestamps for both DDPM and MPM collected datasets.

For training, phase differences between the pulse signal observed from the oximeter and the face present challenges. The relative phase of the blood volume pulse and the oximeter pulse is a function of both the subject’s physiological structure and time lags from the acquisition apparatus. Zhan \etal[29] recently showed that a phase shift in the label when training a CNN dramatically reduces performance. To mitigate this issue, we applied the CHROM pulse detector [4], which is known to give reliable performance, to extract a reference waveform from the face to estimate the offset, and corrected the phase for the oximeter waveform as shown in Fig. 3.

Figure 3: Ground truth waveform from the oximeter along with CHROM’s prediction from the face region. Physiological and apparatus properties contribute to a phase shift between the two signals.

Given the finger and face waveforms, we calculated the cross-correlation between these signals with a sliding window of 10 seconds. A sliding window was used rather than the entire signals to allow for detrending [22] and normalization to minimize low-frequency differences between the two signals. Next, all windows were summed, and the location with a maximum sum within 1 second of lag was selected as the relative shift. All phase delays in the DDPM training set were found to be less than 0.4 seconds.

We learned that finger oximeters are not entirely robust to motion, and the extracted waveform values contain errors, although they are infrequent. Additionally, within-device signal processing does not produce constant amplitude waveforms, and they may contain strong low frequency components. To accommodate the amplitude inconsistencies, we scale the target waveform within each clip to real values in [0,1].

4.4 Video Augmentation

We augment the input data to improve robustness and avoid overfitting. The first augmentation is horizontal flipping with 50% probability. Next, we add random illumination changes by increasing or decreasing the pixel intensities in the whole image with mean of zero and standard deviation of 10 when operating on unsigned integer images in between the values of 0 and 255. Finally, we add pixel-wise Gaussian noise centered at 0 with standard deviation of 2. The image values are subsequently scaled to floating point values between 0 and 1. We augment every frame within each video clip in the same manner.

4.5 Optimization and Training

We optimize the 3DCNN for the temporal regression problem by minimizing the negative Pearson correlation between waveforms, each of the length of 135 frames. We apply the Adam optimizer without weight decay, with a learning rate of , and parameter values of and . We apply dropout during training with 75% probability.

Figure 4: Ground truth and predicted waveforms for a short time segment on the unmasked (top) and masked (bottom) datasets using the same 3DCNN model trained on subjects without face occlusions.

4.6 Overlap Adding

The model is given short video clips and predicts a waveform value for every frame. For videos longer than the clip length , it is necessary to perform predictions in sliding window fashion over the full video. Similarly to [4], we use a stride of half the clip length to slide across the full video. The windowed outputs are first standardized, then a Hann function is applied to mitigate edge effects from convolution by weighting the window’s center more than the extremes.

5 Experimental Protocol

5.1 Experimental Scenarios

We conduct our experiments in an attempt to understand how face masks adversely affect remote pulse detection performance, and whether adding synthetically-generated masks to face videos during training helps in improve performance in the presence of real face masks. To give a complete evaluation, we evaluate all models on both the masked (MPM) and unmasked (DDPM) datasets in the following four scenarios:

  • training / tuning all methods on unmasked face videos (train/validation partition of DDPM), and testing also on unmasked face videos (test partition of DDPM),

  • training / tuning all methods on face videos with synthetically added masks (DDPM-Mask dataset), and testing on unmasked subject-disjoint face videos (test partition of DDPM),

  • training / tuning all methods on unmasked face videos (train/validation partition of DDPM), and testing on masked face videos (MPM dataset),

  • training / tuning all methods on face videos with synthetically added masks (DDPM-Mask dataset), and testing on masked face videos (MPM dataset).

5.2 Dataset Partitions

The unmasked dataset (DDPM) was divided into three subject-disjoint partitions: data from 64 subjects was used for training, data from another 11 subjects was used for validation, and videos taken from remaining 11 subjects were used in testing. The splits were crafted with stratified random sampling across race, gender, and age, in order of importance in the cases that equal splits were not possible. By setting a portion of the unmasked data (DDPM) aside for testing, we can effectively examine the change in performance when evaluating on the entire masked (MPM) dataset of 61 subjects.

5.3 Compared Methods

We selected several previous state-of-the-art pulse detection algorithms to evaluate our pulse detection algorithm’s efficacy. Algorithms were selected based on the availability of working code and possibility of re-implementation from scratch (based on the paper) in the case that code could not be acquired. Two of the chosen methods use carefully designed color combinations to extract a robust pulse signal, namely chrominance (CHROM) [4] and plane-orthogonal-to-skin (POS) [25]. Both algorithms are reimplementations following the algorithms in the original papers as closely as possible. We use the skin detector [7] to help define the pulse signal’s spatial origin.

Two algorithms employing blind-source separation of the color channels through independent component analysis (ICA) [19, 18] were also tested, due to their initial popularity in the field. For simplicity, we refer to the ICA approach presented in [19] as POH10, and refer to the improved ICA approach with detrending [18] as POH11. Both of the ICA approaches perform spatial averaging on the cropped facial region after applying a face detector. We apply OpenFace and use the landmarks to define the region of interest in the same protocol presented in section 4.2.

Unfortunately, code or the weights could not be acquired for recent deep learning-based approaches [2, 17]. We use the previously described 3DCNN as an examplar for the deep learning approaches. Given the output waveforms from the 4 handcrafted approaches, in addition to the proposed 3DCNN trained on DDPM (3DCNN), DDPM-Mask with black masks (3DCNN + B), and DDPM-Mask with patterned masks (3DCNN + P), we calculate the heart rate and evaluate each method in the exact same manner for a fair comparison.

5.4 Evaluation Metrics

We evaluated the model performance in both the temporal (associated with the waveform shape) and frequency (associated with the heart beat rate) domains. Nearly all past works evaluate the heart rate prediction for extended time windows. Evaluating remote pulse predictors from the most dominant frequency alone is not adequate for justifying deployment on many vital signs tasks. In recent works, blood oxygenation and blood pressure have been predicted from high quality photoplethysmograms [6, 13], motivating evaluation of the pulse waveforms in the temporal domain as well. As a result, we evaluate both representations of the blood volume pulse signal.

Figure 5: Bland-Altman plots show the agreement between heart rates for non-overlapping 30 second intervals on the masked (MPM) and unmasked (DDPM) datasets using the 3DCNN trained on unmasked faces.

Evaluating the predictions in the temporal domain was accomplished by examining three-second windows of the waveforms. We calculated the Pearson correlation between the normalized ground truth and predicted waveforms with a stride of 1 frame between windows (). The short time windows were selected due to the difference in the amplitude of the signals over longer periods, especially if they were affected by low frequency components causing trends. We only performed the temporal analysis for the unmasked (DDPM) dataset, since we were able to extract accurate waveforms with CHROM to correct the phase differences. For the masked (MPM) dataset, existing algorithms performed poorly, so we were unable to find a phase shift to evaluate fine-grained waveform differences.

The pulse detection performance in the frequency domain is analyzed by calculating the error between heart rates – the dominant frequency in the waveform for short time periods. A recent work showed that the time window size for predicting heart rate can significantly affect the error estimations [14]. Since the time window used to predict heart rate within the oximeter is unknown, we calculate the ground truth heart rate frequencies from the oximeter’s waveforms. Specifically, we use a sliding window of length 30 seconds with stride of a single frame. We apply a Hamming window prior to converting the signal to the frequency domain with the Fast Fourier Transform (FFT). The frequency index of the maximum spectral peak between Hz and 3 Hz (40 bpm to 180 bpm) is selected as the heart rate. A five-second moving average filter is then applied to the resultant heart rate signal to smooth noisy regions containing finger movement. To compare the heart beat estimates, we used standard metrics from the rPPG literature, such as mean error (ME), mean absolute error (MAE), root mean squared error (RMSE), and Pearson correlation coefficient () between heart rate predictions.

Since two oximeters are present in the masked dataset, we perform the same procedure over both waveforms and average the heart rate value at each time step. In a very small number of cases, the noise from hand movement gave different heart rate values from the oximeters. To remove these portions, if the heart rates differed by more than 10 beats per minute, the calculated heart rate closest to the average of the original heart rate estimates from the oximeters was selected. The resultant signals were smoothed with a three-second moving average filter to avoid spurious jumps in the heart rate. Ground truth heart rate values were verified by manually finding peak-to-peak distances in a subset of the waveforms, and these calculations were found to be more robust than applying standard peak detectors.

CHROM [4] -0.26 3.48 10.37 0.93 0.60
POS [25] 0.11 3.16 11.19 0.92 -0.44
POH10 [19] 18.54 20.56 33.10 0.56 -0.12
POH11 [18] 10.47 14.30 28.86 0.54 0.10
3DCNN -1.18 1.96 6.99 0.97 0.68
3DCNN + B -1.18 2.06 7.29 0.97 0.68
3DCNN + P -1.27 2.00 7.29 0.97 0.69
Table 1: Pulse rate estimation comparison when the methods are tested on videos without face masks (scenarios s1 and s2). “B” and “P” denote black mask and patterned synthetic masks added to the training data, respectively.
CHROM [4] 3.05 12.59 16.29 0.02
POS [25] 15.80 18.84 26.56 0.13
POH10 [19] 25.11 26.74 32.76 -0.02
POH11 [18] 38.29 38.31 40.77 -0.07
3DCNN -1.45 3.57 9.38 0.79
3DCNN + B -1.64 3.81 9.70 0.78
3DCNN + P -1.88 3.75 9.47 0.79
Table 2: Same as in Tab. 1 except that the methods are tested on videos with face masks (scenarios s3 and s4).

6 Results

This section provides results and discussion for all four experimental scenarios listed in Sec. 5.1.

Scenario s1 (baseline): training and testing on unmasked face videos.

Performance on the data containing unmasked participants (DDPM set) is shown in the top portion of Table 1. The two chrominance-based methods achieve lower mean error rates than other approaches, meaning they are well calibrated for predicting heart rate and don’t exhibit bias. Both ICA methods give worse performance than the chrominance and 3DCNN approaches by every metric. The 3DCNN model contains slightly worse mean error rates from bias than the chrominance models, but performs remarkably well in terms of MAE and RMSE for the heart beat rate estimations.

Scenario s2: training on face videos with synthetic masks, testing on unmasked face videos.

The results for models trained on synthetically masked participants (DDPM-Mask) are shown in the bottom portion of Table 1. Error discrepancies between the black and patterned masks are negligible, considering they perform better on different metrics. Unsurprisingly, we find that models trained with synthetic masks generally perform worse than the model trained to use the entire facial region, since the signal to noise ratio is decreased.

Scenario s3: training on unmasked face videos, testing on videos of faces wearing real masks.

Performance of the handcrafted methods and 3DCNN model trained on DDPM and evaluated on masked subjects (MPM) are shown in the upper portion Table 2. In general, performance degrades substantially compared to the maskless evaluation. The best MAE for heart rate prediction among the handcrafted methods is given by CHROM, with over 13 bpm – more than 3 times worse than on DDPM. The 3DCNN model gives better performance than the chrominance and ICA approaches, but its MAE increases by 1.64 bpm. Fortunately, correlation between heart rate predictions and ground truth still remains strongly positive for the 3DCNN, with . For general purposes, the increase in error is likely not large enough to change an assessment of one’s current state of health, but improving performance to the unmasked baseline is desirable. The performance drops indicate that face occlusions cause difficulties within the pulse detection pipelines for all analysed approaches.

Scenario s4: training on face videos with synthetic masks, testing on videos of faces wearing real masks.

The lower portion of Table 2 shows the performance of the models trained on black (3DCNN+B) and patterned (3DCNN+P) synthetic masks. Similarly to the handcrafted and baseline 3DCNN, the performance is decreased, with increases in the MAE of 1.75 bpm for both the black and patterned. Training with synthetic patterned masks gives slightly lower MAE and RMSE than synthetic black masks, but increases the bias. Surprisingly, we find that the models trained on synthetically masked data perform worse than the model trained on unoccluded faces. The lack of improved performance is potentially an indication that the model already gives low responses in the masked region without having seen any masks in training.

Figure 6: Face masks present difficulties to face detection and landmarking algorithms.

Figure 5 shows a side-by-side comparison of the heart rate errors between the two datasets over non-overlapping 30-second intervals. Both plots show dense collections of instances near perfect predictions at zero error, however, the variance within the masked data is significantly higher, with several points giving error greater than 20 bpm.

7 Conclusions

In this paper, we present a new large-scale physiological monitoring dataset of high resolution RGB video and two finger oximeter recordings to encourage the evaluation of remote pulse estimators during the COVID-19 pandemic. In answering the questions posed in the introduction, we find: (re: Q1) accurate pulse rate estimation is possible when subjects are wearing face masks, but the performance is slightly worse, (re: Q2) training with synthetically generated mask videos does not improve performance, and (re: Q3) face landmarkers and skin detectors robust to heavy face occlusion should be deployed in the early phases of the pulse detection algorithms to define reliable regions of interest. Several previous state-of-the-art pulse estimators built for unoccluded pulse estimation are found to perform substantially worse on masked subjects, while a 3DCNN exhibits a moderate drop in performance. We find training the model with simple synthetic masks designated by the face landmarks are insufficient to significantly increase the robustness of pulse detection in the presence of face masks.

The performance discrepancy on masked subjects is expected, yet concerning. Handcrafted pulse estimators proved not to be robust to large face occlusions, which could stem from one or more of many steps in the process. To better understand why certain models performed so poorly on certain sequences, we examined the outputs at every stage of the pulse detection process. A large proportion of the errors were found to propagate from the face landmarking step. Figure 6 shows erroneous face landmarks produced by OpenFace [1] when subject gazed away from the cameras. The aforementioned sources of errors encourage more research into face detection and landmarking in the presence of occlusion, such as with face masks.


We would like to thank Marybeth Saunders for conducting the interviews during DDPM data collection. This research was sponsored by the Securiport Global Innovation Cell, a division of Securiport LLC. Commercial equipment is identified in this work in order to adequately specify or describe the subject matter. In no case does such identification imply recommendation or endorsement by Securiport LLC, nor does it imply that the equipment identified is necessarily the best available for this purpose. The opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of our sponsors.


  1. T. Baltrusaitis, A. Zadeh, Y. C. Lim and L. Morency (2018) OpenFace 2.0: facial behavior analysis toolkit. In 2018 13th IEEE International Conference on Automatic Face Gesture Recognition (FG 2018), Vol. , pp. 59–66. Cited by: §4.2, §7.
  2. W. Chen and D. McDuff (2018) DeepPhys: Video-Based Physiological Measurement Using Convolutional Attention Networks. European Conference on Computer Vision (ECCV), pp. 356–373. External Links: ISBN 978-3-030-01216-8 Cited by: §1, §1, §2, §2, §5.3.
  3. M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed and A. Vedaldi (2014) Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Cited by: Figure 2, §3.
  4. G. de Haan and V. Jeanne (2013) Robust pulse rate from chrominance-based rppg. IEEE Transactions on Biomedical Engineering 60 (10), pp. 2878–2886. Cited by: Figure 1, §1, §2, §4.3, §4.6, §5.3, Table 1, Table 2.
  5. G. De Haan and A. Van Leest (2014) Improved motion robustness of remote-PPG by using the blood volume pulse signature. Physiological Measurement 35 (9), pp. 1913–1926. External Links: Document, ISSN 1361-6579 Cited by: §2.
  6. A. Guazzi, M. Villarroel, J. Jorge, J. Daly, M. Frise, P. Robbins and L. Tarassenko (2015-09) Non-contact measurement of oxygen saturation with an rgb camera. Biomedical optics express 6, pp. 3320–38. External Links: Document Cited by: §5.4.
  7. G. Heusch, A. Anjos and S. Marcel (2017) A reproducible study on remote heart rate measurement. CoRR abs/1709.00962. External Links: Link, 1709.00962 Cited by: §5.3.
  8. G. Heusch and S. Marcel (2018) Pulse-based features for face presentation attack detection. IEEE International Conference on Biometrics Theory, Applications and Systems (BTAS). External Links: Document, ISBN 9781538671795 Cited by: §2.
  9. G. Hsu, A. Ambikapathi and M. Chen (2017) Deep learning with time-frequency representation for pulse estimation from facial videos. In 2017 IEEE International Joint Conference on Biometrics (IJCB), Vol. , pp. 383–389. Cited by: §2, §2.
  10. Y. Hsu, Y. L. Lin and W. Hsu (2014) Learning-based heart rate detection from remote photoplethysmography features. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4433–4437. External Links: Document, ISBN 9781479928927, ISSN 15206149 Cited by: §2, §2.
  11. X. Li, J. Chen, G. Zhao and M. Pietikainen (2014-06) Remote heart rate measurement from face videos under realistic situations. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4264–4271. External Links: Document Cited by: §1, §2.
  12. S. Liu, X. Lan and P. Yuen (2018) Remote photoplethysmography correspondence feature for 3d mask face presentation attack detection. In European Conference on Computer Vision (ECCV), pp. 577–594. External Links: ISBN 978-3-030-01270-0 Cited by: §2.
  13. G. Martínez, N. Howard, D. Abbott, K. Lim, R. Ward and M. Elgendi (2018) Can Photoplethysmography Replace Arterial Blood Pressure in the Assessment of Blood Pressure?. Journal of Clinical Medicine 7 (10), pp. 316. External Links: Document, ISSN 2077-0383 Cited by: §2, §5.4.
  14. Y. Mironenko, K. Kalinin, M. Kopeliovich and M. Petrushan (2020-06) Remote photoplethysmography: rarely considered factors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §5.4.
  15. M. Ngan, P. Grother and K. Hanaoka (2020-07) Ongoing face recognition vendor test (frvt)part 6a: face recognition accuracy with masks using pre-covid-19 algorithms. Technical report Technical Report NISTIR 8311, National Institute of Standards and Technology. Cited by: §1, §3.
  16. X. Niu, H. Han, S. Shan and X. Chen (2018) VIPL-hr: a multi-modal database for pulse estimation from less-constrained face video. In Asian Conference on Computer Vision (ACCV), Cited by: §2.
  17. X. Niu, S. Shan, H. Han and X. Chen (2020) RhythmNet: End-to-End Heart Rate Estimation from Face via Spatial-Temporal Representation. IEEE Transactions on Image Processing 29, pp. 2409–2423. External Links: Document, 1910.11515, ISSN 19410042 Cited by: §1, §2, §2, §5.3.
  18. M. Poh, D. J. McDuff and R. W. Picard (2011) Advancements in noncontact, multiparameter physiological measurements using a webcam. IEEE Transactions on Biomedical Engineering 58 (1), pp. 7–11. Cited by: §1, §2, §5.3, Table 1, Table 2.
  19. M. Poh, D. J. McDuff and R. W. Picard (2010-05) Non-contact, automated cardiac pulse measurements using video imaging and blind source separation.. Opt. Express 18 (10), pp. 10762–10774. External Links: Link, Document Cited by: §2, §5.3, Table 1, Table 2.
  20. K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS’14, Cambridge, MA, USA, pp. 568–576. Cited by: §2.
  21. M. Soleymani, J. Lichtenauer, T. Pun and M. Pantic (2012) A multimodal database for affect recognition and implicit tagging. IEEE Transactions on Affective Computing 3 (1), pp. 42–55. External Links: Document, ISSN 19493045 Cited by: §2.
  22. M. P. Tarvainen, P. O. Ranta-aho and P. A. Karjalainen (2002) An advanced detrending method with application to HRV analysis. IEEE Transactions on Biomedical Engineering 49 (2), pp. 172–175. External Links: Document Cited by: §4.3.
  23. S. Tulyakov, X. Alameda-Pineda, E. Ricci, L. Yin, J. F. Cohn and N. Sebe (2016) Self-adaptive matrix completion for heart rate estimation from face videos under realistic conditions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 2396–2404. Cited by: §1, §2.
  24. W. Verkruysse, L. O. Svaasand and J. S. Nelson (2008-12) Remote plethysmographic imaging using ambient light.. Opt. Express 16 (26), pp. 21434–21445. External Links: Link, Document Cited by: §2.
  25. W. Wang, A. C. den Brinker, S. Stuijk and G. de Haan (2017) Algorithmic principles of remote ppg. IEEE Transactions on Biomedical Engineering 64 (7), pp. 1479–1491. Cited by: §1, §2, §5.3, Table 1, Table 2.
  26. W. Wang, S. Stuijk and G. De Haan (2016) A Novel Algorithm for Remote Photoplethysmography: Spatial Subspace Rotation. IEEE Transactions on Biomedical Engineering 63 (9), pp. 1974–1984. External Links: Document, ISSN 15582531 Cited by: §2.
  27. F. P. Wieringa, F. Mastik and A. F.W. Van Der Steen (2005) Contactless multiple wavelength photoplethysmographic imaging: A first step toward ”spO 2 camera” technology. Annals of Biomedical Engineering 33 (8), pp. 1034–1041. External Links: Document, ISSN 00906964 Cited by: §2.
  28. Z. Yu, X. Li and G. Zhao (2019) Remote photoplethysmograph signal measurement from facial videos using spatio-temporal networks. In Proceedings of the British Machine Vision Conference (BMVC), Cited by: §1, §1, §2, §2, §2, §4.1.
  29. Q. Zhan, W. Wang and G. de Haan (2020) Analysis of CNN-based remote-PPG to understand limitations and sensitivities. Biomedical Optics Express 11 (3), pp. 1268–1283. External Links: Document, 1911.02736, ISSN 2156-7085 Cited by: §4.2, §4.3.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description