Smartphonebased paroxysmal atrial fibrillation monitoring with robust generalization
Abstract
Atrial fibrillation is increasingly prevalent, especially in the elderly, and challenging to detect due paroxysmal nature. Here, we propose novel computational methods based on heart beat intervals to facilitate rapid and robust discrimination between atrial fibrillation and sinus rhythm. We used lowcost Android smartphones, and recorded short, 30 second waveform data from 194 participants. In addition, we evaluated our approach on 8528 handheld ECG recordings to show generalization.
Our approach achieves a sensitivity of 93% and specificity of 94% on 30 second waveforms, significantly outperforming previously proposed heart rate variability features and smartphonebased AFib detection methods, and substantiates the feasibility of realworld application on lowcost hardware.
1 Introduction
Atrial fibrillation (AFib) can cause significant risks to health, such as thrombosis formation or cerebrovascular stroke. Estimated population prevalence ranges from 0.5  3%, can reach 10  30% in the elderly, and nearly one in six stroke events are caused by AFib (reiffel2014atrial). However, detecting paroxysmal atrial fibrillation remains challenging. A large percentage of all cases are unnoticed.
Recently, smartphones have been suggested to facilitate home monitoring at any time, with simple handheld ECGs (galloway2013iphone), which the patient would need to purchase, or using the camera and flashlight as a substitute photoplethysmograph (krivoshei2016smart). These latter approaches obtain an approximate photoplethysmography signal from the brightness values of the builtin phone camera, and subsequently calculate interbeat intervals from the temporal differences of maxima or other fiducial points. It has been argued that variability measures of beat intervals obtained from smartphones can be in good agreement with the same measures calculated from ECG (peng2015extraction), at least for higherend devices with high quality built in camera sensors and circuitry.
However, low sensor quality, especially in lowerpriced devices, as well as frequent motion artifacts and the lack of willingness or patience to sit completely still for extended periods of time pose serious challenges for both of these home screening methods. Evidence of reliable detection of AFib has so far been mostly limited to high end phones such as iPhones, to minimum measurement durations of two to five minutes (requiring very patient participants), and to a few dozen patients (krivoshei2016smart).
Here, we describe a system with stronger generalization ability than previously proposed methods using robust measures of irregularity and multivariate analysis, and show its applicability in more practical settings; namely, up to 10x shorter measurements (30 s vs. 300 s), lowcost hardware, higher tolerance to movement artifacts, low false positive rate, and generalization across hardware types).
2 Methods
We constructed the model based on interbeat intervals (IBIs) from Holter ECGs of 102 participants (see section ‘Training data‘), and evaluated its performance on out of sample data from

IBIs from smartphone recordings of 100 patients, recorded in a clinical setting

IBIs from smartphone recordings of 200 participants, recorded in a home monitoring setting

IBIs from 8528 handheld ECG recordings, recorded in a home monitoring setting
Signal processing and data analysis was performed using custom software written in Python. Each short signal was evenly resampled, detrended and bandpass filtered prior to beat detection. Signals were rescaled to zeromean and unit variance to mitigate the extreme amplitude differences between different data sources (Holters, 12lead ECGs and handheld fingertip ECGs for home monitoring; smartphones of different models and makes). In the case of ECG signals, the QRSdetector of (kim2016simple) was applied subsequently to extract interbeat time intervals.
ECG sampling rates varied between 128 Hz and 300 Hz, depending on data source (see details below). All smartphone PPGs were recorded by a custom developed app on Android smartphones, which, at the time of this publication, are limited by the operating system to optical sensor input with a maximum of 30 Hz sampling rate.
2.1 Robust heart rate variability measures for AFib detection
A large number of measures of heart rate variability (HRV) have been suggested in previous literature (see e.g. (acharya2006heart) for a review), and the authors have developed an array of additional HRV measures, shown to perform well at discriminating other disease etiologies from healthy controls (madl2016cinc). For the present AFib detection model, we extracted the five best performing measures of HRV from this multitude of possibilities, applying a wrapper method of feature selection from machine learning literature (guyon2003introduction), yielding the following bestperforming features in order of inclusion:
. Standard deviation of the n’th order derivative of interbeat intervals, equivalent to the standard deviation of successive differences (of successive differences). In our experiments, n=5 led to the best results. The idea of taking the standard deviation of a derivative instead of the plain RR interval standard deviation (SDRR) has a long history in HRV literature  see e.g. (brennan2001existing).
Let denote the sequence of interbeat intervals, such that each represents the temporal difference between two heart beats. Then, the first feature is defined as
(1) 
. Histogram entropy, a particular implementation of the popular entropybased measures of heart rate variability (e.g. (lake2002sample)), defined as follows. Let be a histogram of the frequency distribution of the interbeat interval sequence , discretized into bins, such that is the normalized frequency of all that fall into the th bin, and is the width of the th bin. Then,
(2) 
Since a larger number of bins increases susceptibility of noise and sensitivity to the particular bin edges, we chose the smallest viable number of bins (), which yields the most robust generalization (more bin edges could allow more noisy beats to land in the ‘wrong’ bin).
. Resemblance of the distribution of interbeat intervals to a Rayleigh distribution. Testing against the Rayleigh distribution is frequently used in physics to test for periodicity in time series of possibly regular events (leahy1983searches). After inclusion of the above two, more traditional HRV measures, the addition of this metric facilitated the highest increase in AFib predictive accuracy. We formalized the resemblance between the IBI distribution and a comparable (fitted) Rayleigh distribution as follows. Let denote the Rayleigh distribution: of interbeat intervals , with a scale parameter . Let be the maximum likelihood estimated scale parameter of the bestfitting Rayleigh distribution, given the series .
Furthermore, let the denote the kernel density estimate of sequence , such that
(3) 
is the Normal distribution, and the parameter is calculated using Silverman’s rule of thumb bandwidth estimator (silverman1986density); that is, .
Then, , the resemblance of the kernel density estimated distribution of interbeat intervals to the bestfitting Rayleigh distribution can be calculated numerically by summing up the absolute differences between the two functions:
(4) 
was chosen to be 1 ms (since none of the data sources hat a higher resolution / accuracy), and was chosen to cover most of the Rayleigh distribution. Specifically, M was ensured to be make the range of summation extend at least to the point where the Rayleigh distribution took on an amplitude smaller than of its peak, i.e., such that .
. Measures of stochasticity based on a horizontal visibility graph of the IBI series, in particular, the measures of graph radius and disassortative entropy, both shown before to be capable of outperforming previously proposed HRV measures in terms of predictive power for certain cardiovascular diseases (madl2016cinc). Structural properties of the original time series, such as periodicity or fractality, are preserved in a horizontal visibility graph representation; and it has been argued that they are wellsuited for discriminating stochastic and chaotic processes (luque2009horizontal).
A horizontal visibility graph (HVG) can be constructed from the IBI series as follows. Let be the time that interval has ‘occurred’; that is, the time of the fiducial point in beat in the ECG or PPG signal. Then, is a network constructed of the series , such that each IBI has a corresponding vertex, each pair of vertices corresponding to a pair of IBIs and is connected by an edge if both for all (luque2009horizontal). In other words, in a bar plot of all interbeat intervals , vertices corresponding to two particular IBIs are connected of an unbroken horizontal line can be drawn between them without intersecting any intermediate bars in the plot.
Apart from their usefulness in differentiating noise from chaos arising from nonlinear dynamics, HVGs are useful because they facilitate the application of any complex networks analysis tool to IBIs. According to preliminary feature selection, two particular complex network descriptors increased predictive accuracy the most: radius and disassortative entropy.
, HVG radius, is defined as the minimum of eccentricities, where eccentricity is simply the graph distance to the vertex farthest from it in the HVG:
(5) 
, HVG disassortative entropy, is a connectivity measure proposed by the author (madl2016cinc). It is defined as the entropy of the mixing matrix, measuring the information content of the tendency of vertices to connect to similar vertices. Whereas traditional assortativity newman2003mixing is maximized by a graph always connecting highdegree vertices to other highdegree vertices, disassortative entropy is maximized by a graph in which the connectivity is randomized. Thus, it is a seconddegree measure of stochasticity.
The disassortative entropybased feature is defined as
(6) 
where is the joint probability of the degrees of vertices and in HVG constructed from . Numerically, is the fraction of edges in that connect vertices with degree to vertices with degree (where degree in the graph theory sense simply means the number of edges incident to a vertex). See newman2003mixing.
Table 1 shows crossvalidated accuracies using logistic regression, chosen to avoid overfitting the limited training data. Regularization and feature selection were based on a holdout set of the training Holter ECGs, and were not adapted to smartphone PPG.
3 Results
The multivariate logistic regression model described above was constructed using solely the training dataset described above (research ECG datasets), and evaluated on out of sample testing datasets (smartphone PPGs or handheld ECGs) not seen at training time. Table 1 shows a comparison to alternative physiological markers or biomarkers recently suggested for paroxysmal AFib diagnosis. On data from cheap smartphone hardware, our method displays significantly higher accuracy (94% vs. 70%), area under the ROC curve (AUC, 0.97 vs. 0.81), and specificity (94% vs. 66%), but slightly lower sensitivity (93% vs. 95%) compared to the recently proposed method by (krivoshei2016smart). A high specificity is important in a home monitoring setting for psychological, time, and financial reasons, which makes previously proposed methods difficult to apply to shortterm, 30 second recordings obtained from lowcost smartphone hardware.
Table 1 also shows the performance of other physiological and biological markers recently proposed to diagnose paroxysmal atrial fibrillation in a clinical setting (see (howlett2015diagnosing) for a recent review). Unlike the comparison with prior methods of smartphonebased screening, these markers were evaluated on separate patient cohorts, limiting the usefulness of direct comparison. Nevertheless, the order of magnitude of these performance indicators strongly suggests smartphonebased home monitoring to be a viable solution to detect unnoticed or asymptomatic paroxysmal atrial fibrillation, at very low cost to patients and the healthcare system.

Resting ECG  Biomarker  Physiological  

Ours 





Sensitivity 


83%  100%  82%  
Specificity 


85%  70.4%  91%  
ROC AUC 


N/A  N/A  N/A  
Sample size 


100  264  280 
4 Conclusion
Our results provide evidence that waveforms as short as 30 seconds are sufficient for robust detection, increasing utility for endusers who might be hard pressed to sit still without the slightest movement for the 5 minute periods used in previous work (krivoshei2016smart). We demonstrate high sensitivity and specificity on a much larger patient cohort than previous smartphone studies, with much lower hardware requirements. Finally, our approach seems to generalize across patient cohorts and even measurement devices. Although Table 1 shows differences between the smartphone PPG cohort and the handheld ECG data, discrimination ability is high in both cases, suggestive of strong generalization.