Method to Annotate Arrhythmias by Deep Network

Method to Annotate Arrhythmias by Deep Network


This study targets to automatically annotate on arrhythmia by deep network. The investigated types include sinus rhythm, asystole (Asys); supraventricular tachycardia (Tachy); ventricular flutter or fibrillation (VF/VFL); ventricular tachycardia (VT).
Methods: 13s limb lead ECG chunks from MIT malignant ventricular arrhythmia database (VFDB) and MIT normal sinus rhythm database were partitioned into subsets for 5-fold cross validation. These signals were resampled to 200Hz, filtered to remove baseline wandering, projected to 2D gray spectrum and then fed into a deep network with brand-new structure. In this network, a feature vector for a single time point was retrieved by residual layers, from which latent representation was extracted by variational autoencoder (VAE). These front portions were trained to meet a certain threshold in loss function, then fixed while training procedure switched to remaining bidirectional recurrent neural network (RNN), the very portions to predict an arrhythmia category. Attention windows were polynomial lumped on RNN outputs for learning from details to outlines. And over sampling was employed for imbalanced data. The trained model was wrapped into docker image for deployment in edge or cloud.
Conclusion: Promising sensitivities were achieved in four arrhythmias and good precision rates in two ventricular arrhythmias were also observed. Moreover, it was proven that latent representation by VAE, can significantly boost the speed of convergence and accuracy.

I Introduction

Automatically annotate arrhythmia with deep learning network has been emerging along with the boost of diverse frameworks. Back to 2015, 1D convolutional neural networks (CNN) were proposed by Kiranyaz to classify ECG beats [1]. In the same year, a stacked denoising autoencoders (DAE) combined with supervised classification was proposed by Rahhal for active learning the beat annotation [2]. In past 2017, Ng and his group trained a CNN classifier as deep as 34 layers to predict 14 outputs [3]. All these studies claimed a better performance compared with handcrafted feature, and even better than cardiologists annotation.
This study targets on the same problem, in which a brand-new deep network is proposed. Data introduced in section II-A is partitioned into two parts for training and testing separately, preprocessed by typical filtering, transferred into 2D spectrum, and fed directly into this network. The network is trained by two steps, firstly feature vector in each time spot is summarized into latent representation by unsupervised learning, which acting as a warm initialization for further classifier training; secondly a classifier is retrieved by supervised learning from these latent representation. In the last part of this study, performance on test set is reported, and a comparative experiment is introduced to prove the pros of latent representation.

Ii Method

Ii-a Data

ECG records are mainly from MIT Physionet database [4], including malignant ventricular arrhythmia database (VFDB) [5] and inbuilt normal sinus rhythm database.

Type Definition
Normal sinus normal heart rhythm
Asys no rate at least 4s
Tachy rate 140bpm for 17 beats
VF/VFL F-wave
VT rate in 100-250bpm, QRS span larger than 0.1s
TABLE I: Rhythm of interest and their definition

The rhythm of interest includes asystole (Asys), supraventricular tachycardia (Tachy), ventricular flutter or fibrillation (VF/VFL) and ventricular tachycardia (VT); the corresponding definition can be found in table I.

Ii-B Preprocessing

A proper chunk of limb lead signal is firstly selected (e.g. 13s) and re-sampled (e.g 200Hz); the chunk span should be set based on the definition of arrhythmia in section II-A, and enclose segmentation around annotation label indicating rhythm change. Then a high passed FIR filter is employed to remove baseline wandering (e.g. -24dB at 0.05Hz); Finally 2D spectrum is computed based on welch’s method (e.g. 1024 points of fast Fourier transform operated on 91% overlapped moving window with span of 60 samples).

Ii-C Network to learn the category

To annotate preprocessed signal, a network in Fig.1 is proposed. This novel net structure, to our best knowledge, is never found in relative literature before. Therefore it requires an end-to-end training instead of transfer learning for most of image AI task. The whole network includes two parts, which need to be trained separately. The first one is a deep representation net (1425 floats), trained by cost function Eq.1.


This part gets a feature vector (20 floats) from frequency vector (60 frequency bins) by residual unit (ResUnit)[6] and pooling, then tries to find a concise and robust representation of feature, drawn from Gaussian distribution in dense space (dimension = 8). The projection to dense space is retrieved by a variational autoencoder (VAE)[7], namely a pair of encoder-decoder highlighted in gray box in Fig.1.
Back to mathematics in Eq.1, encoder is denoted by and decoder is represented as , similar to the definition in Kingma et al’s work[7]. Therefore the first term in Eq.1 is normally referred as reconstruction loss, for a latent representation is drawn and used to reconstruct original input . The second term in Eq.1 is Kullback-Leibler divergence when approximate the posterior with a family of distributions , and it acts as a regularizer penalty to maintain latent distribution into sufficiently diverse clusters[7]. Therefore this term is normally referred as latent loss.
The second portion, denoted by pink box in Fig.1, is a deep classification net (11790 floats), trained by cross entropy loss function Eq.2.


where and is predicted and true label respectively, M is total category of labels, is indicator function and is predicted distribution. This part is compounded by a multilayer bi-directional RNN structure along with multiple dense layers. The attention window, introduced by Bahdanau et al [8], is polynomial lumped on RNN cell outputs except last layer, namely the span of attention window is increasing from bottom to top layer. Therefore, the model is learning from details to overall picture. This portion reuses the output from first portion, and gives a prediction on each sample.
In most cases, VAE serves as a traditional generative modeling. In our case, it’s used to generate a robust latent representation and decrease noise in each sample, since as denoted in Kingma et al’s study [7], VAE brings a constrain on the distribution of each latent cluster, and learns the distribution instead of a deterministic representation over training set. In later section a comparative test is presented to prove the effectiveness of VAE representation in this study.
Fig.1 shows a ’base’ network configuration, including : kernel and stride size of filtering marked on both sides of operator arrow; number of residual units (=1); dimension of latent representation (=8); hidden unit size in RNN cell (=10); number of layers in bidrectional RNN network (=4) and smallest attention window size lump on RNN output (=3). Later on, this ’base’ setup will be modified in the very same comparative test.

Fig. 1: Structure of proposed deep network. The highest dim of each data block represents size of batch (N), which is hidden in this diagram. The other three dimensions are height (H), width (W) and feature channels (C), the size of which can be read from footnote on each data box. For instance, first block shows a data batch with shape equal to , each sample is a 2D spectrum, with 1 gray channel, 60 frequency bins and 512 sampling points.

Ii-D Method to train the network

The original dataset after preprocessing, namely gray images indicating 2D spectrum of signal are randomly divided into several subsets (e.g. 5-fold) in each arrhythmia type. Network are tested on a certain subset after aggregating those samples from different arrhythmia types. On the other hand the remaining data are organized into mini batch and fed directly into model for training purpose, after over sampling in under-represented arrhythmia categories. Such design of data preparation are specific for imbalanced set in this study, namely the big difference of total samples falling in each arrhythmia categories. The mini batch size is equal to 140 in this study.
As mentioned in section II-C, the whole net is trained separately. The training of first part is fulfilled after a certain threshold of cost is met, then the first part is fixed and model start to be trained on second part. In VAE, the cost function is combined by two mathematics terms as in Eq.1. From these two parts, a fraction can be calculated to represent the portion of latent loss in total loss in each iteration. This fraction is coupled into the model training procedure in this study to dominate sampling in latent Gaussian distribution. In the very beginning, the reconstruction loss takes a large portion in total loss, therefore the fraction is small and latent variable is drawn from normal distribution during sampling instead of . By this manner, model convergences faster to a potential cluster center in several epochs and total loss decreases rapidly. Later on, latent loss gradually takes a dominant portion, and regularizer effect starts working.

Ii-E Method to deploy this model

The trained model is encapsulated into docker image, and deployed to either edge computer server or virtual private cloud (VPC). During the initialization of docker container in edge server, local working directory can be mounted, and data can be fed into model by local file I/O. As for the case in VPC, web RESTful API can be provided by flask or likely python component as major data I/O.

Iii Experiment and result

Fig. 2: Original and preprocessed ECG waveform along with its 2D spectrum and latent representation. (a-c) A VT sample , its 2D spectrum and its latent representation correspondingly; (d-f) a supraventricular tachycardia sample , its 2D spectrum and latent representation. The dimension of latent space equal to 8 in this setup. In the latent representation, each line represents an coordinate of latent space. Its amplitude gives the mean, and its width gives the standard deviation of a Gaussian distribution. Thus the whole picture gives the propagation of latent representation on time.
Type Asys Tachy VF/VFL VT
# PP/Tot 48/52 19/20 398/444 335/372
Sensitivity 0.92 0.95 0.89 0.90
Precision 0.74 0.70 0.96 0.90

TABLE II: Confusion matrix on #0 VFDB test set in 5 fold CV design (accuracy = 90%)
Type Asys Tachy VF/VFL VT
# PP/Tot 46/52 16/20 399/444 345/372
Sensitivity 0.88 0.80 0.89 0.93
Precision 0.82 0.72 0.94 0.89

TABLE III: Confusion matrix on #1 VFDB test set in 5 fold CV design (accuracy = 90%)

The model is trained and tested in 5-fold cross validation. Fig.2 shows the intermediate plotting on a VT sample and a Tachy sample, including: waveform before and after prefiltering, relative 2D spectrum and relative latent representation. Table II and III give the performance rates on test sets using a so-called ’base’ setup in section II-C. From these tables, it’s proven that proposed network has promising sensitivity in detection these arrhythmias, and good precision in two ventricular arrhythmias.

Moreover, a comparative experiment is conducted, and accuracy of model on test set is recorded for comparative plotting. During this experiment, different setup on network parameter, e.g. the number of hidden units in RNN cell or span of attention window are tested; in addition to these superparameters, two methods to get latent representation are compared too : using VAE to learn a statistic distribution or retrieving a deterministic latent variable from a simple dense projection. The results can be found in Fig.3. From this figure, it’s proven that to learn a latent distribution by VAE can significantly boost the speed of convergence and accuracy. Meanwhile, other configuration in setup has no significant difference on the model performance.

Fig. 3: Accuracy on test set under different setup during training. Comparing with setup termed as base, #1 increases hidden units size in each RNN cell from 10 to 15; #2 increases smallest attention window from 3 to 5, whistle decreases number of directional layers from 4 to 3; #3 and #4 has the same configuration comparing with #1 and #2 correspondingly, but use a dense projection instead of variational auto-encoder to get the latent representations. Therefore in #3 and #4, network is trained as a whole set.

Iv Conclusion

In this study, a brand-new deep network structure is proposed to annotate arrhythmia in ECG signal. This network includes two parts, one for extracting latent representation from feature vector in each time spot, and the other for predicting arrhythmia categories of interest. This network achieve around 90% sensitivity in asystole, ventricular flutter/fibrillation or ventricular tachycardia, and over 80% sensitivity in all 4 arrhythmias including supraventricular tachycardia. Moreover the test results show good precision rates in two ventricular arrhythmias. This network uses VAE to extract latent representation, which works as a warm initialization for classifier. This method is proven to significantly improve the speed of convergence and accuracy.


  1. S. Kiranyaz, T. Ince, and M. Gabbouj, “Real-time patient-specific ecg classification by 1-d convolutional neural networks,” IEEE Transactions on Biomedical Engineering, vol. 63, no. 3, pp. 664–675, 2016.
  2. M. M. Al Rahhal, Y. Bazi, H. AlHichri, N. Alajlan, F. Melgani, and R. R. Yager, “Deep learning approach for active classification of electrocardiogram signals,” Information Sciences, vol. 345, pp. 340–354, 2016.
  3. P. Rajpurkar, A. Y. Hannun, M. Haghpanahi, C. Bourn, and A. Y. Ng, “Cardiologist-level arrhythmia detection with convolutional neural networks,” arXiv preprint arXiv:1707.01836, 2017.
  4. A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley, “Physiobank, physiotoolkit, and physionet,” Circulation, vol. 101, no. 23, pp. e215–e220, 2000.
  5. S. D. Greenwald, “The development and analysis of a ventricular fibrillation detector,” Ph.D. dissertation, Massachusetts Institute of Technology, 1986.
  6. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  7. D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
  8. D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Comments 0
Request comment
The feedback must be of minumum 40 characters
Add comment
Loading ...

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description