Recognizing Abnormal Heart Sounds Using Deep Learning
The work presented here applies deep learning to the task of automated cardiac auscultation, i.e. recognizing abnormalities in heart sounds. We describe an automated heart sound classification algorithm that combines the use of time-frequency heat map representations with a deep convolutional neural network (CNN). Given the cost-sensitive nature of misclassification, our CNN architecture is trained using a modified loss function that directly optimizes the trade-off between sensitivity and specificity. We evaluated our algorithm at the 2016 PhysioNet Computing in Cardiology challenge where the objective was to accurately classify normal and abnormal heart sounds from single, short, potentially noisy recordings. Our entry to the challenge achieved a final specificity of 0.95, sensitivity of 0.73 and overall score of 0.84. We achieved the greatest specificity score out of all challenge entries and, using just a single CNN, our algorithm differed in overall score by only 0.02 compared to the top place finisher, which used an ensemble approach.
Advances in deep learning  are being made at a rapid pace, in part due to challenges such as ILSVRC – the ImageNet Large-Scale Visual Recognition Challenge . Successive improvements in deep neural network architectures have resulted in computer vision systems that are better able to recognize and classify objects in images  and winning ILSVRC entries . While a large focus of deep learning has been on automated analysis of image and text data, advances are also increasingly being seen in areas that require processing other input modalities. One such area is the medical domain where inputs into a deep learning system could be physiologic time series data. An increasing number of large scale challenges in the medical domain, such as  and  have also resulted in improvements to deep learning architectures .
PhysioNet  has held a Computing in Cardiology Challenge since 2000 that requires participants to automatically analyze physiologic time series data. The 2016 challenge  asked participants to perform automated analysis of phonocardiogram (PCG) waveforms, i.e. heart sound data collected using digital stethoscopes. The objective of the challenge was to accurately classify normal and abnormal heart sounds. Recordings were collected from both healthy individuals, as well as those with heart disease, including heart valve disease and coronary artery disease. A PCG plot showing the recording of the (normal) sounds made by the heart is given in Figure 1.
Heart disease remains the leading cause of death globally, resulting in more people dying every year due to cardiovascular disease compared to any other cause of death . Successful automated PCG analysis can serve as a useful diagnostic tool to help determine whether an individual should be referred on for expert diagnosis, particularly in areas where access to clinicians and medical care is limited.
In this work, we present an algorithm that accepts PCG waveforms as input and uses a deep convolutional neural network architecture to classify inputs as either normal or abnormal using the following steps:
- 1. Segmentation of time series
A logistic regression hidden semi-Markov model is used to segment incoming heart sound instances into shorter segments beginning at the start of each heartbeat, i.e. the heart sound.
- 2. Transformation of segments into heat maps
Using Mel-frequency cepstral coefficients, one dimensional time series input segments are converted into two-dimensional spectrograms (heat maps) that capture the time-frequency distribution of signal energy.
- 3. Classification of heat maps using a deep neural network
A convolutional neural network is trained to perform automatic feature extraction and distinguish between normal and abnormal heat maps.
The contributions of this work are as follows:
We introduce a deep convolutional neural network architecture designed to automatically analyze physiologic time series data for the purposes of identifying abnormalities in heart sounds.
Given the cost-sensitive nature of misclassification, we describe a novel loss function used to train the above network that directly optimizes the sensitivity and specificity trade-off.
We present results from the 2016 PhysioNet Computing in Cardiology Challenge where we evaluated our algorithm and achieved a Top 10 place finish out of 48 teams who submitted a total of 348 entries.
The remainder of this paper is organized as follows. In Section 2, we discuss related works, including historical approaches to automated heart sound analysis and deep learning approaches that process physiologic time series input data. Section 3 introduces our approach and details each step of the algorithm. Section 4 further describes the modified cost-sensitive loss function used to trade-off the sensitivity and specificity of the network’s predictions, followed by Section 5, which details the network training decisions and parameters. Section 6 presents results from the 2016 PhysioNet Computing in Cardiology Challenge and in Section 7 we provide a final discussion and end with conclusions in Section 8.
Before the 2016 PhysioNet Computing in Cardiology Challenge there were no existing approaches (to the authors’ knowledge) that applied the tools and techniques of “deep learning” to the automated analysis of heart sounds . Previous approaches relied upon a combination of feature extraction routines input into classic supervised machine learning classifiers. Features extracted from heart cycles in the time and frequency domains, as well as wavelet features, time-frequency and complexity-based features were input into artificial neural networks  and support vector machines  for classification. Previous works have also employed Hidden Markov Models for both segmenting PCG signals into the fundamental heart sounds , as well as classifying normal and abnormal instances .
While there have been many previous efforts applied to automated heart sound analysis, gauging the success of historical approaches has been somewhat difficult, due to differences in dataset quality, number of recordings available for training and testing algorithms, recorded signal lengths and the environment in which data was collected (e.g. clinical vs. non-clinical settings). Moreover, some existing works have not performed appropriate train-test data splits and have reported results on training or validation data, which is highly likely to produce optimistic results due to overfitting . In this work, we report results from the 2016 PhysioNet Computing in Cardiology Challenge, which evaluated entries on a large hidden test-set that was not made publicly available. To reduce overfitting, no recordings from the same subject were included in both the training and the test set and a variety of both clean and noisy PCG recordings, which exhibited very poor signal quality, were included to encourage the development of accurate and robust algorithms.
The work presented in this paper, is one of the first attempts at applying deep learning to the task of heart sound data analysis. However, there have been recent efforts to apply deep learning approaches to other types of physiological time series analysis tasks. An early work that applied deep learning to the domain of psychophysiology is described in . They advocate the use of preference deep learning for recognizing affect from physiological inputs such as skin conductance and blood volume pulse within a game-based user study. The authors argue against the use of manual ad-hoc feature extraction and selection in affective modeling, as this limits the creativity of attribute design to the researcher. One difference between the work of  and ours is that they perform an initial unsupervised pre-training step using stacked convolutional denoising auto-encoders, whereas our network does not require this step and is instead trained in a supervised fashion end-to-end.
Recall from Section 1 that our approach consists of three general steps: segmentation, transformation and classification. Each is described in detail below.
3.1Segmentation of time series
The main goal of segmentation is to ensure that incoming time series inputs are appropriately aligned before attempting to perform classification. We first segment the incoming heart sound instances into shorter segments and locate the beginning of each heartbeat, i.e. the heart sound. A logistic regression hidden semi-Markov model  is used to predict the most likely sequence of heart sound states ( Systole Diastole) by incorporating information about expected state durations.
Once the heart sound has been identified, a time segment of length, , is extracted. Segment extraction can either be overlapping or non-overlapping. Our final model used a segment length of, seconds, and we chose to use overlapping segments as this led to performance improvements during initial training and validation.
3.2Transformation of segments into heat maps
Each segment is transformed from a one-dimensional time series signal into a two-dimensional heat map that captures the time-frequency distribution of signal energy. We chose to use Mel Frequency Cepstral Coefficents  to perform this transformation, as MFCCs capture features from audio data that more closely resembles how human beings perceive loudness and pitch. MFCCs are commonly used as a feature type in automatic speech recognition .
We apply the following steps to achieve the transformation:
Given an input segment of length, , and sampling rate, , select a window length, , and step size, , and extract overlapping sliding windows, , from the input time series segment, where is the window index and is the sample index. We chose a window length of 0.025 seconds and a step size of 0.01 seconds.
Compute the Discrete Fourier transform for each window.
where , is the length of the DFT and is a hamming window of length . The power spectral estimate for window, , is then given by (Equation 1).
Apply a filterbank of, , triangular band-pass filters, , to the power spectral estimates, , and sum the energies in each filter together. Include a log transformation as sound volume is not perceived on a linear scale.
We used a filterbank consisting of filters, where frequency ranges were derived using the Mel scale that maps actual measured frequencies, , to values that better match how humans perceive pitch, .
Finally, apply a Discrete Cosine Transform to decorrelate the log filterbank energies, which are correlated due to overlapping windows in the Mel filterbank.
The result is a collection of cepstral coefficients, for window, . For , can be stacked together to give a time-frequency heat map that captures changes in signal energy over heart sound segments. Figure 2 illustrates two example heat maps (one derived from a normal heart sound input and the other from an abnormal input), where is the MFCC value (represented by color) at location, , on the horizontal axis and, , on the (inverted) vertical axis.
3.3Classification of heat maps using a deep neural network
The result of transforming the original one-dimensional time-series into a two-dimensional time-frequency representation is that now each heart sound segment can be processed as an image, where energy values over time can be visualized as a heat map (see Figure 2). Convolutional neural networks are a natural choice for training image classifiers, given their ability to automatically learn appropriate convolutional filters. Therefore, we chose to train a convolutional neural network architecture using heat maps as inputs.
Decisions about the number of filters to apply and their sizes, as well as how many layers and their types to include in the network were made by a combination of initial manual exploration by the authors, followed by employing a random search over a limited range of network architecture parameters. Figure 3 depicts the network architecture of a convolutional neural network that accepts as input a single channel 6x300 MFCC heat map and outputs a binary classification, predicting whether the input segment represents a normal or abnormal heart sound.
The first convolutional layer learns 64 2x20 kernels, using same-padding. This is followed by applying a 1x20 max-pooling filter, using a horizontal stride of 5, which has the effect of reducing each of the 64 feature maps to a dimension of 6x60. A second convolutional layer applies 64 2x10 kernels over the previous layer, once again using same padding. This is again followed by a max-pooling operation using a filter size of 1x4 and a stride of 2, further reducing each feature map to a dimension of 6x30. At this stage in the architecture a flattening operation is applied that unrolls each of the 64 6x30 feature maps into a single dimensional vector of size 11,520. This feature vector is fed into a first fully connected layer consisting of 1024 hidden units, followed by a second layer of 512 hidden units and finally a binary classification output.
4Sensitivity-Specificity Loss Trade-off
The loss function of the network was altered from a standard softmax cross entropy loss function to instead directly trade-off between sensitivity and specificity.
Given unnormalized log-probabilities, , from a classifier consisting of weight matrix, , and bias . The softmax function:
probability predictions for the class at index, , for input .
, refers to the th entry of row and is the corresponding one hot encoded matrix of actual class labels.
For the binary class labels of normal () and abnormal (), we define the mask matrices, and , where entries within each matrix are softmax prediction values extracted , as follows:
We then define softmax sensitivity, , and specificity, , as follows:
The final loss function we wish to minimize is given in (Equation 2).
regularization was computed for each of the fully connected layers’ weight and bias matrices and applied to the loss function. Dropout was applied within both fully connected layers. Table 1 shows the values of hyper-parameters chosen by performing a random search through parameter space, as well as a list of other network training choices, including weight updates and use of regularization. Adam optimization  was used to perform weight updates. Models were trained on a single NVIDIA GPU with between 4 – 6 GB of memory. A mini-batch size of 256 was selected to satisfy the memory constraints of the GPU.
|Network parameters|| Value
|Weight Update||Adam Optimization|
The overall dataset used within the PhysioNet Computing in Cardiology Challenge was provided by the challenge organizers and consisted of eight heart sound databases collected from seven countries over a period of more than a decade . In total 4,430 recordings were taken from 1,072 subjects, resulting in 30 hours of heart sound recordings. From this total dataset, 1,277 heart sound recordings from 308 subjects were removed to be used as held-out test data for evaluating challenge submissions. The test dataset was not made publicly available and challengers were only allowed to make 15 submissions, in total, to the challenge server to evaluate their models on a small 20% subset of the hidden dataset, before final results were computed. The number of allowed submissions was limited to avoid the issue of participants implicitly overfitting their models on the hidden test dataset.
From the 3153 publicly available PCG waveforms supplied by the challenge organizers, the authors set aside a further 301 instances to be used as a local held-out test-set to gauge model performance before making a submission to the challenge server. The remaining instances were used to train initial models. Models were trained on the overlapping 3-second MFCC segments extracted from the remaining 2852 PCG waveforms. This resulted in approximately 90,000 MFCC heat maps, which were split into a training ( instances) and validation set ( instances). This training and validation set was unbalanced, consisting of approximately 80% normal segments and 20% abnormal segments. Training was performed on the unbalanced dataset and no attempt was made to compensate for this class imbalance.
Given that each model was trained on 3-second MFCC heat map segments, it was necessary to stitch together a collection of predictions to classify a single full instance. The simple strategy of averaging each class’s prediction probability was employed and the class with the greatest probability was selected as the final prediction.
|1||0.9424||0.7781||0.8602||AdaBoost & CNN|
|2||0.8691||0.849||0.859||Ensemble of SVMs|
|3||0.8743||0.8297||0.852||Regularized Neural Networks|
|4||0.8639||0.8269||0.8454||MFCCs, Wavelets, Tensors & kNN|
|5||0.8848||0.8048||0.8448||Random Forest + LogitBoost|
|8||0.7278||0.9521||0.8399|| Our Approach (see Section )
|43||0.6545||0.7569||0.7057||Provided Benchmark Entry|
Equations (Equation 3) and (Equation 4) show the modified sensitivity and specificity scoring metrics that were used to assess the submitted entries to the 2016 PhysioNet Computing in Cardiology Challenge . Uppercase symbols reflect the true class label, which could either be ()bnormal, or ()ormal. Lowercase symbols refer to a classifier’s predicted output where, once again, is abnormal, is normal and is a prediction of unsure. A subscript of 1 (e.g. , ) refers to heart sound instances that were considered good signal quality by the challenge organizers and a subscript of 2 (e.g. , ) refers to heart sound instances that were considered poor signal quality by challenge organizers. Finally, the weights used to calculate sensitivity, and , capture the percentages of good signal quality and poor signal quality recordings in all abnormal recordings. Correspondingly for specificity, the weights and are the proportion of good signal quality and poor signal quality recordings in all normal recordings. Overall, scores are given by .
Table 2 shows a selected subset of the results for the 2016 PhysioNet Computing in Cardiology Challenge. For each selected entry, sensitivity, specificity and overall scores are shown, as well as the entry’s final ranking and a brief description of its approach. In total, 348 entries were submitted by 48 teams. Our entry, as described by the algorithm presented in this paper, was ranked 8th with a sensitivity of 0.7278 and specificity of 0.9521, giving an overall score of 0.8399. The top entry to the competition achieved sensitivity of 0.9424, specificity of 0.7781 for an overall score of 0.8602. Also included in Table 2 is the result of a benchmark entry that was supplied by the challenge organizers, which ranked 43rd overall, with a sensitivity of 0.6545 and specificity of 0.7569, for an overall score of 0.7057.
Table 2 shows that the overall scores for the top entries to the PhysioNet Computing in Cardiology challenge were very close. In particular, our entry, which achieved an 8th place ranking, had a difference in score of only 0.02, compared to the top place finisher. For our entry, the overall score of 0.8399 was achieved using a single convolutional neural network, whereas other top place finishers achieved strong classification accuracies using an ensemble of classifiers. Improvements in performance have often been witnessed using an ensemble of networks or separate classifiers and we leave this for future work/improvement. For practical purposes, a diagnostic tool that relies on only a single network, as opposed to a large ensemble, has the advantage of limiting the amount of computational resources required for classification. Deployment of such a diagnostic tool on platforms that impose restricted computational budgets, e.g mobile-based, could perhaps benefit from such a trade-off between accuracy and computational cost.
Another point of interest is that our entry to the PhysioNet Computing in Cardiology challenge achieved the greatest specificity score (0.9521) out of all challenge entries. However, the network architecture produced a lower sensitivity score (0.7278). Once again, considering the practical result of deploying a diagnostic tool that relied upon our algorithm, this would likely result in a system with few false positives, but at the expense of misclassifying some abnormal instances. Final decisions about the trade-off between sensitivity and specificity would require further consideration of the exact conditions and context of the deployment environment.
A final point of discussion and area of future improvement is that the approach presented was limited to binary decision outputs, i.e. either normal or abnormal heart sounds. An architecture that also considered signal quality as an output would likely result in performance improvement.
The work presented here is one of the first to apply deep convolutional neural networks to the task of automated heart sound classification for recognizing normal and abnormal heart sounds. We have presented a novel algorithm that combines a CNN architecture with MFCC heat maps that capture the time-frequency distribution of signal energy. The network was trained to automatically distinguish between normal and abnormal heat map inputs and it was designed to optimize a loss function that directly considers the trade-off between sensitivity and specificity. We evaluated the approach by submitting our algorithm as an entry to the 2016 PhysioNet Computing in Cardiology Challenge. The challenge required the creation of accurate and robust algorithms that could deal with heart sounds that exhibit very poor signal quality. Overall, our entry to the challenge achieved a Top-10 place finish out of 48 teams who submitted 348 entries. Moreover, using just a single CNN, our algorithm differed by a score of at most 0.02 compared to other top place finishers, all of which used an ensemble approach of some kind.
- Detection of cardiac abnormality from pcg signal using lms based least square svm classifier.
Samit Ari, Koushik Hembram, and Goutam Saha. Expert Systems with Applications, 37(12):8019–8026, 2010.
- A classifier based on the artificial neural network approach for cardiologic auscultation in pediatrics.
Sanjay R Bhatikar, Curt DeGroff, and Roop L Mahajan. Artificial intelligence in medicine, 33(3):251–260, 2005.
- Classification of normal/abnormal heart sound recordings: the physionet/computing in cardiology challenge 2016.
Gari D Clifford, CY Liu, Benjamin Moody, David Springer, Ikaro Silva, Qiao Li, and Roger G Mark. Computing in Cardiology, pages 609–12, 2016.
- Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences.
Steven Davis and Paul Mermelstein. IEEE transactions on acoustics, speech, and signal processing, 28(4):357–366, 1980.
- Automated pediatric cardiac auscultation.
Jacques P De Vos and Mike M Blanckenberg. IEEE Transactions on Biomedical Engineering, 54(2):244–252, 2007.
- Automatic detection of voice impairments by means of short-term cepstral parameters and neural network based detectors.
Juan Ignacio Godino-Llorente and P Gomez-Vilda. IEEE Transactions on Biomedical Engineering, 51(2):380–384, 2004.
- PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals.
A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M. Hausdorff, P. Ch. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley. Circulation, 101(23):e215–e220, 2000.
- Deep, convolutional, and recurrent models for human activity recognition using wearables.
Nils Y. Hammerla, Shane Halloran, and Thomas Plötz. In Subbarao Kambhampati, editor, Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, pages 1533–1540. IJCAI/AAAI Press, 2016.
- Deep residual learning for image recognition.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. CoRR, abs/1512.03385, 2015.
- American epilepsy society seizure prediction challenge.
Kaggle. https://www.kaggle.com/c/seizure-prediction, 2014.
- Grasp-and-Lift EEG Detection.
Kaggle. https://www.kaggle.com/c/grasp-and-lift-eeg-detection, 2015.
- Adam: A method for stochastic optimization.
Diederik P. Kingma and Jimmy Ba. CoRR, abs/1412.6980, 2014.
- Deep learning.
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Nature, 521(7553):436–444, 2015.
- Recurrent convolutional neural network for object recognition.
Ming Liang and Xiaolin Hu. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 3367–3375. IEEE Computer Society, 2015.
- Network in network.
Min Lin, Qiang Chen, and Shuicheng Yan. CoRR, abs/1312.4400, 2013.
- An open access database for the evaluation of heart sound algorithms.
Chengyu Liu, David Springer, Qiao Li, Benjamin Moody, Ricardo Abad Juan, Francisco J Chorro, Francisco Castells, José Millet Roig, Ikaro Silva, Alistair EW Johnson, et al. Physiological Measurement, 37(12):2181, 2016.
- Support vectors machine-based identification of heart valve diseases using heart sounds.
Ilias Maglogiannis, Euripidis Loukis, Elias Zafiropoulos, and Antonis Stasis. Computer methods and programs in biomedicine, 95(1):47–61, 2009.
- Learning deep physiological models of affect.
Héctor Perez Martínez, Yoshua Bengio, and Georgios N. Yannakakis. IEEE Comp. Int. Mag., 8(2):20–33, 2013.
- Comparing svm and convolutional networks for epileptic seizure prediction from intracranial eeg.
Piotr W Mirowski, Yann LeCun, Deepak Madhavan, and Ruben Kuzniecky. In 2008 IEEE Workshop on Machine Learning for Signal Processing, pages 244–249. IEEE, 2008.
- ImageNet Large Scale Visual Recognition Challenge.
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
- Hidden markov model-based classification of heart valve disease with PCA for dimension reduction.
Ridvan Saraçoglu. Eng. Appl. of AI, 25(7):1523–1528, 2012.
- Computerized screening of children congenital heart diseases.
Amir A Sepehri, Joel Hancq, Thierry Dutoit, Arash Gharehbaghi, Armen Kocharian, and A Kiani. Computer methods and programs in biomedicine, 92(2):186–192, 2008.
- Support vector machine hidden semi-markov model-based heart sound segmentation.
David B Springer, Lionel Tarassenko, and Gari D Clifford. In Computing in Cardiology 2014, pages 625–628. IEEE, 2014.
- Logistic regression-hsmm-based heart sound segmentation.
David B Springer, Lionel Tarassenko, and Gari D Clifford. IEEE Transactions on Biomedical Engineering, 63(4):822–832, 2016.
- Going deeper with convolutions.
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. CoRR, abs/1409.4842, 2014.
- Rethinking the inception architecture for computer vision.
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. CoRR, abs/1512.00567, 2015.
- Adaptive neuro-fuzzy inference system for diagnosis of the heart valve diseases using wavelet transform with entropy.
Harun Uğuz. Neural Computing and applications, 21(7):1617–1628, 2012.
- A biomedical system based on artificial neural network and principal component analysis for diagnosis of the heart valve diseases.
Harun Uğuz. Journal of medical systems, 36(1):61–72, 2012.
- Phonocardiographic signal analysis method using a modified hidden markov model.
Ping Wang, Chu Sing Lim, Sunita Chauhan, Jong Yong A Foo, and Venkataraman Anantharaman. Annals of Biomedical Engineering, 35(3):367–374, 2007.
- Cardiovascular diseases (cvds).
World Health Organization. http://who.int/mediacentre/factsheets/fs317/en/, 2017.
- A novel hybrid energy fraction and entropy-based approach for systolic heart murmurs identification.
Yineng Zheng, Xingming Guo, and Xiaorong Ding. Expert Systems with Applications, 42(5):2710–2721, 2015.