Detecting and interpreting myocardial infarctions using fully convolutional neural networks
We consider the detection of myocardial infarction in electrocardiography (ECG) data as provided by the PTB ECG database without non-trivial preprocessing. The classification is carried out using deep neural networks in a comparative study involving convolutional as well as recurrent neural network architectures. The best architecture, an ensemble of fully convolutional architectures, beats state-of-the-art results on this dataset and reaches 93.3% sensitivity and 89.7% specificity evaluated with 10-fold crossvalidation, which is the performance level of human cardiologists for this task. We investigate questions relevant for clinical applications such as the dependence of the classification results on the considered data channels and the considered subdiagnoses. Finally, we apply attribution methods to gain an understanding of the network’s decision criteria on an exemplary basis.
Ischaemic heart diseases are the leading cause of death in Europe. The most prominent entity of this group is acute myocardial infarction (MI), where blood supply to parts of the heart muscle is permanently interrupted due to an occluded coronary artery. Early detection is crucial for the effective treatment of acute myocardial infarction with percutaneous coronary intervention (PCI) or coronary artery bypass surgery. Diagnosis is usually made with the help of clinical findings, laboratory results and electrocardiography. ECGs are produced by recording electrical potentials of defined positions of the body surface over time, representing the electric activity of the heart. Deviations from the usual shape of the ECG curves can be indicative of myocardial infarction as well as many other cardiac and non-cardiac conditions. ECGs are a popular diagnostic tool as they are non-invasive and inexpensive to produce but have high diagnostic value.
Clinically, cases of myocardial infarction fall into one of two categories, ST elevation myocardial infarction (STEMI) and non ST elevation myocardial infarction (NSTEMI), depending on whether or not the ECG exhibits a specific ECG sign called ST elevation. The former can and should be treated as soon as possible with PCI, whereas the NSTEMI diagnosis has to be confirmed with time-costly laboratory tests before specific treatment can be initiated . Since waiting for these results can delay effective treatment by hours, a more detailed analysis of the ECG could speed up this progress significantly.
Failure to identify high-risk ECG findings in the emergency department is common and of grave consequences . To increase accuracy, speed or economic efficiency, different algorithms have been proposed to automatically detect myocardial infarction in recorded ECGs. Algorithms with adequate performance would offer significant advantages: they could be applied by untrained personnel in situations where no cardiologist is available; once set up they would be highly reliable and inexpensive; they could be tuned to specific decision boundaries, for example to high sensitivity (low specificity) for screening purposes.
Common ECG classification algorithms usually mimic the approach a human physician would take: First preprocessing steps include correction of base line deviations, some kind of noise reduction and the segmentation of single heart beats. In the next step hand-engineered features such as predefined or automatically detected time intervals and voltage values are extracted from the preprocessed signal. Finally, the classification is carried out with a variety of common classifiers such as simple cutoff values, support vector machines or neural networks. Preprocessing and feature extraction are non-trivial steps with technical and methodical problems, especially with unusual heart rhythms or corrupted data, resulting in high risk of information loss. This urges for a more unified and less biased algorithmic approach to this problem.
Deep neural networks [3, 4] and in particular convolutional neural networks have been the driving force behind the tremendous advances in computer vision [5, 6, 7, 8, 9, 10] in recent years. Consequently related methods have also been applied to the problem of timeseries classification in general and ECG classification tasks specifically. Even though we focus exclusively on ECG classification in this work, we stress that the methodology put forward here can be applied to generic timeseries classification problems in particular to those that satisfy the following conditions:
continuous sequence (with no start/end points in the sequence) with a degree of periodicity
unprocessed / unsegmented data
These three criteria define a subclass within general time series classification problems that is important for many real world problems. In particular, these criteria include raw sensor data from medical monitoring such as ECG or EEG.
The main contributions presented in this paper are the following:
We put forward a fully convolutional neural network for myocardial infarction detection on the PTB dataset [11, 12] focusing on the clinically most relevant case of 12 leads. It outperforms state-of-the-art literature approaches [13, 14] and reaches the performance level of human cardiologists reported in an earlier comparative study .
We study in detail the classification performance on subdiagnoses and investigate channel selection and its clinical implications.
We apply state-of-the-art attribution methods to investigate the patterns underlying the network’s decision and draw parallels to cardiologists’ rules for identifying myocardial infarctions.
Ii Related works
Turning to timeseries classification in general, we focus on time series classification using deep neural networks and do not discuss traditional methods in detail, see e.g.  for a recent review. Hüsken and Stagge  use recurrent neural networks for time series classification. Wang et al  use different mainly convolutional networks for time series classification and achieve state-of-the-art results in comparison to traditional methods applied to the UCR Time Series Classification Archive datasets . Cui et al  use a sliding window approach similar to the one applied in this work and feed differently downsampled series into a multi-scale convolutional neural network also reaching state-of-the-art results on UCR datasets. Also recurrent neural networks have been successfully applied to time series classification problems in the clinical context . More recent works include attention  and more elaborate combinations of convolutional and recurrent architectures [23, 24].
More specifically, turning to myocardial infarction detection in ECG recordings, many proposed algorithms rely on classical machine learning methods for classification after initial preprocessing and feature extraction [25, 26, 27, 28, 29, 30, 31]. Particular mentioning deserve [32, 13] who operate on Wavelet-transformed signals. Whereas the above works used neural networks at most as classifier on top of previously extracted features [33, 28, 34, 29], there are works that apply neural networks directly as feature extractors to beat-level separated ECG signals [35, 36]. These have to be distinguished from approaches as the one considered in this work where deep neural networks are applied to the raw ECG with at most minor preprocessing steps. In this direction Zheng et al , present an approach based on convolutional neural networks for multichannel timeseries classification similar to ours but applied it to ECGs in the context of congestive heart failure classification. The most recent work on the myocardial infarction detection using deep neural networks  also uses convolutional architectures applied to three channel input data. A quantitative comparison to their results is presented in Sec. V. Similar performance was reported in  who used LSTMs on augmented channel data obtained from a generative model.
Recently, convolutional neural networks were used for arrhythmia detection on the largest ECG dataset considered to date reaching human level performance . Progress in this direction is boosted by Computing in Cardiology Challenges , where the most recent focused on arrhythmia classification from single lead ECGs.
Iii Dataset and medical background
The best-known collection of standard datasets for time series classification is provided by the UCR Time Series Classification Archive . Although many benchmarks are available for the contained datasets , we intentionally decided in favor of a different dataset as the UCR datasets contain only comparably short and not necessarily periodic sequences and are almost exclusively single-channel data. The same applies to various benchmarks datasets [19, 41, 42] considered for example in , which do not match the criteria put forward in the introduction. In particular the requirement of continuous data with no predefined start and end points that shows a certain degree of periodicity is rarely found in existing datasets in particular not in combination with the other two requirements from above.
We advocate in-depth studies of more complex datasets that are more representative for real-world situations and therefore concentrate our study on ECG data provided by the PTB Diagnostic ECG Database [11, 12]. It is one of the few freely available datasets that meets the conditions from above. The dataset comprises 549 records from 290 subjects. For completeness we present an overview about the sample sizes for different diagnoses broken down according to classifications in Tab. I. For this study we only aim to discriminate between healthy control and myocardial infarction and therefore only take into account records classified as either of these two diagnosis classes. Note that 22 ECGs from 21 patients with unknown localization and infarction status were excluded from the analysis.
|Bundle branch block||15|
|Valvular heart disease||6|
For some patients classified as myocardial infarction the dataset includes multiple records of highly variable age and in some cases even ECGs recorded after the medical intervention. The most conservative choice would be to exclude all myocardial infarction ECGs after the intervention and within a preferably short threshold after the infarction. Such a dataset would be most representative for the detection of fresh myocardial infarctions in a clinical context, would however seriously reduce the already small dataset. As a compromise we decided to keep all healthy records but just the first ECG from patients with myocardial infarction. Note that a selection based on ECG age is not applicable here as the full metadata is not provided for all records. For the ECGs where the full metadata is provided this selection leads to a median (interquartile range) of the infarction age of days with 18% of them taken after intervention. On the contrary including all infarction ECGs would result in a median of days of which were taken after intervention. These figures render the second, most commonly employed, selection questionable for an early infarction detection problem. In summary, our selection leaves us with a dataset of 148 records classified as myocardial infarction and 80 records (from 52 patients) classified as healthy control.
|subdiagnosis/localization||# patients||# samples(selected)|
|Healthy control||52||80 (80)|
For the case of myocardial infarctions the dataset distinguishes different subdiagnoses corresponding to the localization of the infarction, see Tab. II, with smooth transitions between certain subclasses. It is therefore not reasonable to expect to be able to train a classifier that is able to distinguish records into all these subclasses based on the rather small number of records in certain cases. We therefore decided to distinguish just two classes that we colloquially designate as anterior myocardial infarction (aMI) and inferior myocardial infarction (iMI), see Tab. II for a detailed breakdown. This grouping models the most common anatomical variant of myocardial vascular supply with the left coronary artery supplying the regions noted in the aMI group and the right coronary artery supplying those in the iMI group . If not noted otherwise we only use the subdiagnoses information for stratified sampling of records into crossvalidation folds and just discriminate between healthy control and myocardial infarction. In Sec. V-B we specifically investigate the impact of the above subdiagnoses on the classification performance. The fact that the inferior and anterior myocardial infarction can be distinguished rather well represents a further a posteriori justification for our assignment.
The PTB Database provides 15 simultaneously measured channels for each record: six limb leads (Einthoven: I, II, III, and Goldberger: aVR, aVL, aVF), six precordial leads (Wilson: V1, V2, V3, V4, V5, V6) and the three Frank leads (vx, vy, vz). As the six limb leads are linear combinations of just two measured voltages (e.g. I and II) we discard all but two limb leads. Frank leads are rarely used in the clinical context. Consequently in our analysis we only take into account eight leads that are conventionally available in clinical applications and non-redundant (I, II, V1, V2, V3, V4, V5, V6). This is done in spite of the fact that using the full although clinically less relevant set of channels can lead to an even higher classification performance, see the analysis in Sec. V-B3 where the lead selection is discussed in detail.
Iv Classifying ECG using deep neural networks
Iv-a Algorithmic procedure
As discussed in the previous section, time series classification in a realistic setting has to be able to cope with timeseries that are so large that they cannot be used as input to a single neural network or that cannot be downsampled to reach this state without losing too much information. At this point two different procedures are conceivable: Either one uses attentional models that allow to focus on regions of interest, see e.g. [23, 24], or one extracts random subsequences from the original timeseries. For reasons of simplicity and with real-time on-site analysis in mind we explore only the latter possibility, which is only applicable for signals that exhibit a certain degree of periodicity. The assumption underlying this approach is that the characteristics leading to a certain classification are present in every random subsequence. We stress at this point that this procedure does not rely on the identification of beginning and endpoints of certain patterns in the window . This procedure can be justified a posteriori with the reasonable accuracies and specificities that can be reached with this procedure. Furthermore, from a medical point of view it is reasonable to assume that ECG characteristics do not change drastically within the time frame of any single recording.
The procedure leaves two hyperparameters: the choice of the window size and an optional downsampling rate to reduce the temporal input dimension for the neural network. As the dataset is not large enough for extensive hyperparameter optimizations we decided to work with a fixed window size of 4 seconds downsampled to an input size of 192 pixels for each sequence. The window size is sufficiently large to capture at least three heartbeats at normal heart rates.
As discussed in Sec. III, if we consider a binary classification problem we are dealing with an imbalanced dataset with 80 healthy records in comparison to 127 records diagnosed as myocardial infarction. Several approaches have been discussed in the literature to best deal with imbalance [45, 46]. Here we follow the general recommendations and oversample the minority class of healthy patients by 2:1.
We refrain from using accuracy as target metric as it depends on the ratio of healthy and infarction ECGs under consideration. As sensitivity and specificity are the most common metrics in the medical context, we choose Youden’s J-statistic as target metric for model selection which is determined by the sum of both quantities i.e.
where denote true positive/false negative/false positive classification results. Other frequently considered observables in this context include or scores that are defined as combinations of positive predictive value (precision) and sensitivity (recall).
Finally, to obtain the best possible estimate of the test set sensitivity and specificity using the given data, we perform 10-fold crossvalidation on the dataset. Its size is comparably small and there are still considerable fluctuations of the final result statistics, even considering the data augmentation via random window selection. These result statistics do not necessarily reflect the variance of the estimator under consideration when applied to unseen data  and it is not possible to infer variance information from crossvalidation scores by simple means . The given dataset is not large enough to allow a train-validation-test split with reasonable respective sample sizes. Following , we circumvent this problem by reporting ensemble scores corresponding to models with different random initializations without performing any form of hyperparameter tuning or model selection using test set data. Compared to single initializations the ensemble score gives a more reliable estimate of the model’s generalization performance on unseen data. For calculating the ensemble score we combine five identical models and report the ensemble score formed by averaging the predicted scores after the softmax layer .
Iv-B Investigated architectures
We investigate both convolutional neural networks as well as recurrent neural network architectures. While recurrent neural networks seem to be the most obvious choice for time series data, see e.g. , convolutional architectures have been applied for similar tasks in early days, see e.g.  for applications in phoneme recognition.
We study different variants of convolutional neural networks inspired by several successful architectures applied in the image domain such as fully convolutional networks  and resnets [7, 52, 53], see App. A for details. In addition to architectures that are applied directly to the (downsampled) timeseries data, we also investigate the effect of incorporating frequency-domain input data obtained by applying a Fourier transform to the original time-domain data.
For comparison we also consider recurrent neural networks, namely LSTM  cells. We investigate two variants: In the first approach we feed the last LSTM output into a fully connected layer. In the second case we additionally apply a time-distributed dense layer i.e. with shared weights to train the network in addition on a time-series classification task, where we adjusted both loss functions to reach similar values. Similar to  we investigate in this way if the timeseries predication task improves the classification accuracy.
V-a Network architectures
In Tab. III we compare the architectures described in the previous section based on 12-lead data. The fully convolutional architecture and the resnet achieve similar performance applied to time-domain data. In contradistinction to an earlier investigation  that favored the fully convolutional architecture, a ranking of the two convolutional architectures is not possible on the given data. Interestingly, the convolutional architectures perform better applied to raw time-domain data than applied to frequency-domain data. In this context it might be instructive to investigate also other transformations of the input data such as Wavelet transformations as considered in [32, 13].
Both convolutional architectures show a better score than recurrent architectures. This can probably be attributed to the fact that we report just results with standard LSTMs and do not investigate more advanced mechanisms such as most notably an attention mechanism, see e.g. [22, 23, 24] for recent developments in this direction. Training the recurrent neural network jointly on a classification task as well as on a time series prediction task, see also the description in App. A-B, did not lead to an improved score, whereas a significant increase was reported by , which might be related to the small size of the dataset.
|LSTM mode (final output)||0.743||0.910||0.833||0.899|
|LSTM (final output + pred.)||0.742||0.914||0.828||0.897|
In the following sections we analyze particular aspects of the classification results in more detail. All subsequent investigations are carried out using the fully convolutional architecture, which achieved the same performance as the best performing resnet architecture with a comparably much simpler architecture. If not noted otherwise we use the default setup of 12-lead data.
V-B MI localization, benchmarks and channel selection
MI localization and training procedure
As described in Sec. III we distinguish the aggregated subdiagnosis classes aMI and iMI. Here we examine the classification performance of a model that distinguishes these subclasses rather than training just on a common superclass myocardial infarction. We can investigate a number of different combination of either training/evaluating with or without subdiagnoses as shown in Tab. IV
|cardiologists aMI ||0.857||0.874||0.983||-|
|cardiologists iMI ||0.738||0.749||0.989||-|
|train MI eval MI||0.827||0.933||0.897||0.936|
|train MI eval aMI||0.877||0.980||0.897||0.884|
|train MI eval iMI||0.789||0.894||0.896||0.879|
|train aMI eval aMI||0.880||0.919||0.961||0.950|
|train iMI eval iMI||0.689||0.810||0.879||0.849|
|train aMI+iMI eval MI||0.788||0.912||0.876||0.947|
|train aMI+iMI eval aMI||0.846||0.966||0.881||0.906|
|train aMI+iMI eval iMI||0.741||0.861||0.879||0.897|
Both for models trained on unspecific infarction and for models trained using subdiagnosis labels, the performance on the inferior myocardial infarction classification task turns out to be worse than the score achieved for anterior myocardial infarction. The most probable reason for this is that anterior myocardial infarctions show typical signs in most of the Wilson leads because of the proximity of the anterior myocard to the anterior chest wall. For the more difficult task of iMI classification, the model seems to profit from general myocardial infarction data during training, as a model trained on generic MI achieves a higher score on aMI classification than a model trained specifically on aMI classification only. The converse is true for the simpler task of aMI classification.
Interestingly, the model trained without subdiagnoses reaches a slightly higher score both for unspecific myocardial infarction classification as well as for classification on subdiagnoses aMI/iMI only, which might just be an effect of an insufficient amount of training data. In any case, we restricted the rest of our investigations on the model trained disregarding subdiagnoses.
In Fig. 1 we show the normalized confusion matrix for the model that is trained and evaluated on the subdiagnoses aMI and iMI in addition to healthy control. The confusion matrix underlines the fact that the model is able to discriminate between the aggregated subdiagnoses whose assignments were motivated by medical arguments, see Sec. III, and represents an a posteriori justification for this choice.
The reported score for models evaluated on subdiagnoses allows a comparison to human performance on this task. We base this comparison on a study  that assessed the human classification performance for different diagnosis classes based on a panel of eight cardiologists. Here we only report the combined result and refer to the original study for individual results. The most appropriate comparison is the model trained on general MI and evaluated on subdiagnoses aMI and iMI. However, it turns out that irrespective of the training procedure the algorithm achieves a slightly higher score on aMI classification and a considerably higher score on iMI classification and we see it therefore justified to claim at least human level performance on the given classification task. We restrain from drawing further conclusions from this comparison as it depends on the precise performance metric under consideration, the fact that not the same datasets were used in both studies and differences in subdiagnosis assignments. The claim of superhuman performance on this task would certainly require more thorough investigations in the future.
Comparison to literature approaches
Our data and channel selection strategy, see Sec. III, was carefully chosen to reflect the requirements of a clinical application as closely as possible. In addition, considering the comparably small size of the dataset, a careful crossvalidation strategy is of utmost importance, see Sec. IV-A. Unfortunately, most literature results do not report crossvalidated scores or introduce data leakage in their crossvalidation procedures. This can happen for example by sampling from beat-level segmented signals or most commonly by sampling based on ECGs rather than patients, see  for a detailed discussion. Both cases lead to unrealistically good performance estimates as the classification algorithm can in some form adapt to structures in the same ECG or structures in a different ECG from the same patient during the training phase.
We refrain from presenting results for the latter setups as they do not allow to disentangle a model’s classification performance from its ability to reproduce already known patterns. Therefore we only include a comparison to the most recent works [13, 14] that are to our knowledge the only works where a crossvalidated score with sampling on patient level, in the literature also termed subject-oriented approach , is reported. To ensure comparability with [13, 14] we replicate their setup as closely as possible and modify our data selection to include limb leads and in addition to healthy records only all genuine inferior myocardial infarctions. In this case our approach shows not only superior performance compared to literature results, see Tab. V, but does unlike their algorithms not even rely on any non-trivial preprocessing steps.
|Wavelet transform + SVM ||0.583||0.790||0.793||0.803|
|limb leads + inferior MI||0.773||0.874||0.900||0.932|
We replicated the above setup to demonstrate the competitiveness of our approach, but for a number of reasons we are convinced that the scores presented in Tabs. III and IV are the more suitable benchmark results: Firstly, from a clinical point of view 12-lead ECGs are the default choice and the algorithm should be fed with the full set of 8 non-redundant channels. Secondly, the restriction to include only the first infarction ECG per patient is arguably more suited for the application of the clinically most relevant problem of classifying young myocardial infarctions, see the discussion in Sec. III. Finally, from a machine learning perspective it is beneficial to include all subdiagnoses for training allowing to adapt to general patterns in infarction ECGs and to only evaluate the trained classifier on a particular subdiagnosis of interest. For the case of aMI this procedure lead to an improved score, see the discussion of Tab. IV.
By including different combinations of leads one can estimate the relative amount of information that these channels contribute to the classification decision, see Tab. VI.
|12 leads (default)||0.827||0.933||0.897||0.936|
|Frank leads only||0.803||0.930||0.873||0.923|
|limb leads only||0.811||0.912||0.899||0.937|
Starting with single-lead classification results, out of leads I, II and III, lead III offers the least amount of information, possibly because its direction coincides worst with the usual electrical axis of the heart. The classification result using Frank leads achieves a score that is slightly worse than the result using limb-leads only. A further performance increase is observed when complementing the limb leads with the Wilson leads towards the standard 12-lead setup. The overall best result is achieved using all channels, which does however not correspond to the clinically relevant situation, where conventionally only 12 leads are available.
A general challenge remains the topic of interpretability of machine learning algorithms and in particular deep learning approaches that is especially important for applications in medicine . In the area of deep learning there has been a lot of progress in this direction [56, 57, 58]. So far most applications covered computer vision whereas time series data in particular did only receive scarce attention. Interpretability methods have been applied to time series data in  and ECG data in particular in . A different approach towards interpretability in time series was put forward in .
As an exploratory study for the application of interpretability methods to timeseries data we investigate the application of attribution methods to the trained classification model. This allows to investigate on a qualitative level if the machine learning algorithm uses similar features as human cardiologists. Our implementation makes use of the DeepExplain framework put forward in . For neural networks with only ReLU activation functions it can be shown  that attention maps from ’’  coincide with attributions obtained via the -rule in LRP . Even though we are using ELU activation functions the attribution maps show only minor quantitative differences. The same applies to the comparison to integrated gradients . For definiteness we focus our discussion on ’’. Different from computer vision, where conventionally attributions of all three color channels are summed up, we keep different attributions for every channel to be able to focus on channel-specific effects. We use a common normalization of all channels to be able to compare attributions across channels.
We stress that attributions are inherent properties of the underlying models and can therefore differ already for models with different random initializations in an otherwise identical setup. If we aim to use it to identify typical indicators for a classification decision as a guide for clinicians a more elaborate study is required. For simplicity we focus in this exploratory study on a single model rather than the model ensemble. By visual inspection we identified the most typical attribution pattern for myocardial infarction among examples in the batch that occurred shortly after the infarction. Prototypical outcomes of this analysis are presented below.
Fig. 2 shows examples of the interpretability analysis for selected channels of two myocardial infarction ECGs that were correctly classified. As in the clinical context ECGs are always considered in the context of the full set of twelve channels (if available), the complete set of channels is shown in Fig. 4 in the appendix. Take note that the attributions show a high consistency over beats of one ECG, even if a significant baseline shift is present. There is also a reasonable consistency with regard to similar ECG features exhibited by other patients which are not shown here.
ECG A is taken from a 74-year-old male patient one day after the infarction took place. A coronary angiography performed later confirmed an anterior myocardial infarction. ECG B is taken from a 68-year-old male patient one day after the infarction which was later confirmed to be in inferior localization. ECGs A and B are listed as s0021are and s0225lre in the PTB dataset.
Signs for ischemia and infarction are numerous and of variable specificity . Highlighted areas coincide with established ECG signs of myocardial infarction. These are typically found between and including the QRS complex and the T wave, as this is when the contraction and consecutive repolarization of the ventricles take place. ST segment elevation (STE) is the most important finding in myocardial infarction ECGs and diagnostic criterion for ST elevation myocardial infarction (STEMI) . This sign (though not formally significant in every case) and corresponding high attributions can be found in both example ECGs at the positions marked a, e, and g. At the same time in another channel (position c) there is no STE and the attribution is consequently inverted. The attribution at position a also coincides with pathological Q waves, which also occur in some infarction ECGs. T wave inversion, another common sign for infarction, can be found at position d. Some other attributions of the model are less conclusive. Although attributions at positions b and f fall in the T and/or U waves, that is regions that are relevant for detection of infarction, it is unclear why they influenced the decision against infarction.
Note that the highlighted areas do not necessarily align perfectly with what clinicians would identify as important. For example for a convolutional neural network to detect a ST elevation, it must use and compare information from before and after the QRS complex, which most likely results in high attributions to the QRS complex itself and its immediate surrounding rather than to the elevated ST segment.
Comparing the overall visual impression of the attributions across all channels (see Fig. 4), the model seems to attribute more importance to the Wilson leads in ECG A (anterior infarction) and more importance to the limb leads in ECG B (inferior infarction). This is also where clinicians would expect to find signs for infarction in these cases.
Attributions are inherently model-dependent and as a matter of fact the corresponding attributions show quantitative and in some cases even qualitative differences. However, across different folds and different random initializations the attribution corresponding to the STE was always correctly and prominently identified. This is a very encouraging sign for future classification studies on ECG data based on convolutional methods in particular in combination with attribution methods. A future study could put the qualitative finding presented in this section on a quantitative basis. This would require a segmentation of the data, possibly using another model trained on an annotated dataset as no annotations are available for the PTB dataset, and statistically evaluating attribution scores in conjunction with this information. In this context it would be interesting to see if different classification patterns arise across different models or if they can at least be enforced as in .
Vi Summary and Conclusions
In this work we put forward a fully convolutional neural network for myocardial infarction detection using the PTB dataset. The proposed architecture beats the current state-of-the-art approaches on this dataset and reaches a similar level of performance as human cardiologists for this task. We investigate the classification performance on subdiagnoses and identify two clinically well-motivated subdiagnosis classes that can be separated very well by our algorithm. We focus on the clinically most relevant case of 12-lead data and stress the importance of a careful data selection and crossvalidation procedure.
Moreover, we present a first exploratory study of the application of interpretability methods in this domain, which is a key requirement for applications in the medical field. These methods can not only help to gain an understanding and thereby build trust in the network’s decision process but could also lead to a data-driven identification of important markers for certain classification decisions in ECG data that might even prove useful for human experts. Here we identified common cardiologists’ decision rules in the network’s attribution maps and outlined prospects for future studies in this direction.
Both such an analysis of attribution maps and further improvements of the classification performance would have to rely on considerably larger databases such as  for quantitative precision. This would also allow extension to further subdiagnoses and other cardiac conditions such as other confounding and non-exclusive diagnoses or irregular heart rhythms.
Appendix A Network architectures
All models were implemented in TensorFlow . As only preprocessing step we apply batch normalization  to all input channels. In all cases we minimize crossentropy loss using the Adam optimizer  with learning rate 0.001.
A-a Convolutional architectures
We generally use ELU  as activation function both for convolutional as well as fully connected layers without using batch normalization , which was reported to lead to a slight performance increase compared to the standard ReLU activation with batch normalization . In the architectures with fully connected layers we apply dropout  at a rate of 0.5 to improve the generalization capability of the model. We initialize weights according to . Note that in contrast to the case of two-dimensional data a max pooling operation only reduces the number of couplings by a factor of 2 rather than 4, which is then fully compensated by the conventional increase of filter dimensions by 2 in the next convolutional layer. To achieve a gradual reduction of couplings we therefore keep the number of filters constant across convolutional layers. We study the following convolutional architectures that are also depicted in Fig. 3:
A fully convolutional architecture  with a final global average pooling layer
We investigate the impact of including frequency information obtained via a Fast Fourier Transformation with , where denotes the sequence length after rescaling. The independent components are used as frequency-domain input data with otherwise unchanged network architectures.
A-B Recurrent architectures
As alternative architecture we investigate recurrent neural networks, namely LSTM  cells. We investigated stacked LSTM architectures but found no significant gain in performance. However, even for a single RNN cell, in our case with 256 hidden units, different training methods are feasible:
In the first variant we feed the last LSTM output into a fully connected softmax layer.
In the second variant we additionally apply a time-distributed fully connected layer, i.e. a fully connected layer with shared weights for every timestep, and train the network to predict the next element in a timeseries prediction task jointly with the classification task. Here we adjusted both loss functions to reach similar values. Similar to  we investigate in this way if the timeseries predication task improves the classification accuracy.
During RNN training we apply gradient clipping.
The authors thank M. Grünewald for discussions.
- M. Roffi, C. Patrono, J. P. Collet, C. Mueller, M. Valgimigli, F. Andreotti, J. J. Bax, M. A. Borger, C. Brotons, D. P. Chew, B. Gencer, G. Hasenfuss, K. Kjeldsen, P. Lancellotti, U. Landmesser, J. Mehilli, D. Mukherjee, R. F. Storey, and S. Windecker, “[2015 ESC guidelines for the management of acute coronary syndromes in patients presenting without persistent ST-segment elevation],” Kardiol Pol, vol. 73, no. 12, pp. 1207–1294, 2015.
- F. A. Masoudi, D. J. Magid, D. R. Vinson, A. J. Tricomi, E. E. Lyons, L. Crounse, P. M. Ho, P. N. Peterson, and J. S. Rumsfeld, “Implications of the failure to identify high-risk electrocardiogram findings for the quality of care of patients with acute myocardial infarction: results of the Emergency Department Quality in Myocardial Infarction (EDQMI) study,” Circulation, vol. 114, no. 15, pp. 1565–1571, Oct 2006.
- Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
- I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scalable object detection using deep neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2147–2154.
- S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
- J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440. [Online]. Available: http://www.cv-foundation.org/openaccess/content_cvpr_2015/html/Long_Fully_Convolutional_Networks_2015_CVPR_paper.html
- R. Bousseljot, D. Kreiseler, and A. Schnabel, “Nutzung der EKG-Signaldatenbank CARDIODAT der PTB über das Internet,” Biomedizinische Technik, vol. 40, no. Ergänzungsband 1, p. 314, 1995.
- A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley, “Physiobank, physiotoolkit, and physionet,” Circulation, vol. 101, no. 23, pp. e215–e220, 2000. [Online]. Available: http://circ.ahajournals.org/content/101/23/e215
- L. D. Sharma and R. K. Sunkaria, “Inferior myocardial infarction detection using stationary wavelet transform and machine learning approach,” Signal, Image and Video Processing, vol. 12, no. 2, pp. 199–206, 2018.
- T. Reasat and C. Shahnaz, “Detection of inferior myocardial infarction using shallow convolutional neural networks,” in 2017 IEEE Region 10 Humanitarian Technology Conference (R10-HTC). IEEE, dec 2017. [Online]. Available: https://doi.org/10.1109/r10-htc.2017.8289058
- J. L. Willems, C. Abreu-Lima, P. Arnaud, J. H. van Bemmel, C. Brohet, R. Degani, B. Denis, J. Gehring, I. Graham, G. van Herpen et al., “The diagnostic performance of computer programs for the interpretation of electrocardiograms,” New England Journal of Medicine, vol. 325, no. 25, pp. 1767–1773, 1991.
- A. Bagnall, J. Lines, A. Bostrom, J. Large, and E. Keogh, “The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances,” Data Mining and Knowledge Discovery, pp. 1–55, 2016.
- M. Hüsken and P. Stagge, “Recurrent neural networks for time series classification,” Neurocomputing, vol. 50, pp. 223–235, 2003.
- Z. Wang, W. Yan, and T. Oates, “Time series classification from scratch with deep neural networks: A strong baseline,” in Neural Networks (IJCNN), 2017 International Joint Conference on. IEEE, 2017, pp. 1578–1585. [Online]. Available: http://arxiv.org/abs/1611.06455
- Y. Chen, E. Keogh, B. Hu, N. Begum, A. Bagnall, A. Mueen, and G. Batista, “The UCR Time Series Classification Archive,” July 2015, www.cs.ucr.edu/~eamonn/time_series_data/.
- Z. Cui, W. Chen, and Y. Chen, “Multi-scale convolutional neural networks for time series classification,” CoRR, vol. abs/1603.06995, 2016. [Online]. Available: http://arxiv.org/abs/1603.06995
- E. Choi, M. T. Bahadori, A. Schuetz, W. F. Stewart, and J. Sun, “Doctor AI: Predicting Clinical Events via Recurrent Neural Networks,” JMLR workshop and conference proceedings, vol. 56, pp. 301–318, 2016. [Online]. Available: http://arxiv.org/abs/1511.05942
- H. Song, D. Rajan, J. J. Thiagarajan, and A. Spanias, “Attend and Diagnose: Clinical Time Series Analysis using Attention Models,” CoRR, vol. abs/1711.03905, Nov. 2017.
- F. Karim, S. Majumdar, H. Darabi, and S. Chen, “LSTM Fully Convolutional Networks for Time Series Classification,” IEEE Access, vol. 6, pp. 1662–1669, 2018. [Online]. Available: http://arxiv.org/abs/1709.05206
- F. Karim, S. Majumdar, H. Darabi, and S. Harford, “Multivariate LSTM-FCNs for Time Series Classification,” CoRR, vol. abs/1801.04503, 2018. [Online]. Available: http://arxiv.org/abs/1801.04503
- L. Sun, Y. Lu, K. Yang, and S. Li, “ECG analysis using multiple instance learning for myocardial infarction detection,” IEEE transactions on biomedical engineering, vol. 59, no. 12, pp. 3348–3356, 2012.
- M. Arif, I. A. Malagore, and F. A. Afsar, “Detection and localization of myocardial infarction using k-nearest neighbor classifier,” Journal of medical systems, vol. 36, no. 1, pp. 279–289, 2012.
- D. H. Lee, J. W. Park, J. Choi, A. Rabbi, and R. Fazel-Rezai, “Automatic detection of electrocardiogram st segment: Application in ischemic disease diagnosis,” (IJACSA) International Journal of Advanced Computer Science and Applications, vol. 4, no. 2, 2013.
- N. Safdarian, N. J. Dabanloo, and G. Attarodi, “A new pattern recognition method for detection and localization of myocardial infarction using t-wave integral and total integral as extracted features from one cycle of ecg signal,” Journal of Biomedical Science and Engineering, vol. 7, no. 10, p. 818, 2014.
- J. Kojuri, R. Boostani, P. Dehghani, F. Nowroozipour, and N. Saki, “Prediction of acute myocardial infarction with artificial neural networks in patients with nondiagnostic electrocardiogram,” Journal of Cardiovascular Disease Research, vol. 6, no. 2, pp. 51–59, 2015.
- B. Liu, J. Liu, G. Wang, K. Huang, F. Li, Y. Zheng, Y. Luo, and F. Zhou, “A novel electrocardiogram parameterization algorithm and its application in myocardial infarction detection,” Computers in biology and medicine, vol. 61, pp. 178–184, 2015.
- V. S. Negandhi, S. D. Parab, A. Aishwarya, and P. Bhogle, “Heart track: Automated ecg analysis for detecting myocardial infarction,” Heart, vol. 138, no. 9, 2016.
- L. Sharma, R. Tripathy, and S. Dandapat, “Multiscale energy and eigenspace approach to detection and localization of myocardial infarction,” IEEE transactions on biomedical engineering, vol. 62, no. 7, pp. 1827–1837, 2015.
- M. Arif, I. A. Malagore, and F. A. Afsar, “Automatic detection and localization of myocardial infarction using back propagation neural networks,” in Bioinformatics and Biomedical Engineering (iCBBE), 2010 4th International Conference on. IEEE, 2010, pp. 1–4.
- P. Kora and S. R. Kalva, “Improved bat algorithm for the detection of myocardial infarction,” SpringerPlus, vol. 4, no. 1, p. 666, 2015.
- U. R. Acharya, H. Fujita, S. L. Oh, Y. Hagiwara, J. H. Tan, and M. Adam, “Application of deep convolutional neural network for automated detection of myocardial infarction using ECG signals,” Information Sciences, vol. 415, pp. 190–198, 2017.
- M. Kachuee, S. Fazeli, and M. Sarrafzadeh, “ECG Heartbeat Classification: A Deep Transferable Representation,” CoRR, vol. abs/1805.00794, Apr. 2018.
- Y. Zheng, Q. Liu, E. Chen, Y. Ge, and J. L. Zhao, “Time series classification using multi-channels deep convolutional neural networks,” in International Conference on Web-Age Information Management. Springer, 2014, pp. 298–310.
- D. Rajan and J. J. Thiagarajan, “A Generative Modeling Approach to Limited Channel ECG Classification,” CoRR, vol. abs/1802.06458, Feb. 2018.
- P. Rajpurkar, A. Y. Hannun, M. Haghpanahi, C. Bourn, and A. Y. Ng, “Cardiologist-level arrhythmia detection with convolutional neural networks,” CoRR, vol. abs/1707.01836, 2017. [Online]. Available: http://arxiv.org/abs/1707.01836
- G. Clifford, C. Liu, B. Moody, L. Lehman, I. Silva, Q. Li, A. Johnson, and R. Mark, “AF classification from a short single lead ECG recording: The Physionet Computing in Cardiology Challenge 2017,” 2017.
- W. Pei, H. Dibeklioglu, D. M. J. Tax, and L. van der Maaten, “Time series classification using the hidden-unit logistic model,” CoRR, vol. abs/1506.05085, 2015. [Online]. Available: http://arxiv.org/abs/1506.05085
- P. Schäfer and U. Leser, “Multivariate time series classification with WEASEL+MUSE,” CoRR, vol. abs/1711.11343, 2017. [Online]. Available: http://arxiv.org/abs/1711.11343
- A. M. R. A. Keith L. Moore, Arthur F. Dalley, Clinically Oriented Anatomy. Wolters Kluwer — Lippincott Williams & Wilkins, 2010.
- B. Hu, Y. Chen, and E. Keogh, “Time series classification under more realistic assumptions,” in Proceedings of the 2013 SIAM International Conference on Data Mining. SIAM, 2013, pp. 578–586.
- P. Branco, L. Torgo, and R. P. Ribeiro, “A survey of predictive modelling under imbalanced distributions,” CoRR, vol. abs/1505.01658, 2015. [Online]. Available: http://arxiv.org/abs/1505.01658
- M. Buda, A. Maki, and M. A. Mazurowski, “A systematic study of the class imbalance problem in convolutional neural networks,” CoRR, vol. abs/1710.05381, 2017. [Online]. Available: http://arxiv.org/abs/1710.05381
- A. Isaksson, M. Wallman, H. Göransson, and M. G. Gustafsson, “Cross-validation and bootstrapping are unreliable in small sample classification,” Pattern Recognition Letters, vol. 29, no. 14, pp. 1960–1965, 2008.
- Y. Bengio and Y. Grandvalet, “No unbiased estimator of the variance of k-fold cross-validation,” Journal of machine learning research, vol. 5, no. Sep, pp. 1089–1105, 2004.
- T. Shaikhina and N. A. Khovanova, “Handling limited datasets with neural networks in medical applications: A small-data approach,” Artificial Intelligence in Medicine, vol. 75, pp. 51 – 63, 2017. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0933365716301749
- C. Ju, A. Bibaut, and M. J. van der Laan, “The relative performance of ensemble methods with deep convolutional neural networks for image classification,” CoRR, vol. abs/1704.01664, 2017. [Online]. Available: http://arxiv.org/abs/1704.01664
- A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme recognition using time-delay neural networks,” IEEE transactions on acoustics, speech, and signal processing, vol. 37, no. 3, pp. 328–339, 1989.
- K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in European Conference on Computer Vision. Springer, 2016, pp. 630–645.
- S. Zagoruyko and N. Komodakis, “Wide residual networks,” CoRR, vol. abs/1605.07146, 2016. [Online]. Available: http://arxiv.org/abs/1605.07146
- S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. [Online]. Available: http://www.mitpressjournals.org/doi/abs/10.1162/neco.1922.214.171.1245
- F. Cabitza, R. Rasoini, and G. Gensini, “Unintended consequences of machine learning in medicine,” JAMA, vol. 318, no. 6, pp. 517–518, 2017. [Online]. Available: http://dx.doi.org/10.1001/jama.2017.7797
- K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional networks: Visualising image classification models and saliency maps,” CoRR, vol. abs/1312.6034, 2013. [Online]. Available: http://arxiv.org/abs/1312.6034
- M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” in ICML, 2017. [Online]. Available: http://arxiv.org/abs/1703.01365
- G. Montavon, W. Samek, and K.-R. Müller, “Methods for interpreting and understanding deep neural networks,” Digital Signal Processing, vol. 73, pp. 1 – 15, 2018. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1051200417302385
- T. Teijeiro, C. A. García, D. Castro, and P. Félix, “Abductive reasoning as the basis to reproduce expert criteria in ECG Atrial Fibrillation identification,” CoRR, vol. abs/1802.05998, Feb. 2018.
- S. A. Siddiqui, D. Mercier, M. Munir, A. Dengel, and S. Ahmed, “TSViz: Demystification of Deep Learning Models for Time-Series Analysis,” CoRR, vol. abs/1802.02952, Feb. 2018.
- M. Ancona, E. Ceolini, A. C. Öztireli, and M. H. Gross, “Towards better understanding of gradient-based attribution methods for Deep Neural Networks,” ICLR, vol. abs/1711.06104, 2018. [Online]. Available: http://arxiv.org/abs/1711.06104
- A. Shrikumar, P. Greenside, A. Shcherbina, and A. Kundaje, “Not Just a Black Box: Learning Important Features Through Propagating Activation Differences,” CoRR, vol. abs/1605.01713, 2016. [Online]. Available: http://arxiv.org/abs/1605.01713
- S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek, “On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation,” PLoS ONE, vol. 10, no. 7, p. e0130140, 07 2015. [Online]. Available: http://dx.doi.org/10.1371%2Fjournal.pone.0130140
- S. Z. Tewelde, A. Mattu, and W. J. Brady, “Pitfalls in Electrocardiographic Diagnosis of Acute Coronary Syndrome in Low-Risk Chest Pain,” West J Emerg Med, vol. 18, no. 4, pp. 601–606, Jun 2017.
- K. Thygesen, J. S. Alpert, A. S. Jaffe, M. L. Simoons, B. R. Chaitman, and H. D. White, “Third universal definition of myocardial infarction,” Glob Heart, vol. 7, no. 4, pp. 275–295, Dec 2012.
- A. S. Ross, M. C. Hughes, and F. Doshi-Velez, “Right for the right reasons: Training differentiable models by constraining their explanations,” in IJCAI, 2017. [Online]. Available: http://arxiv.org/abs/1703.03717
- J.-P. Couderc, “A unique digital electrocardiographic repository for the development of quantitative electrocardiography and cardiac safety: the telemetric and holter ecg warehouse (thew),” Journal of electrocardiology, vol. 43, no. 6, pp. 595–600, 2010.
- M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. J. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Józefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. G. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. A. Tucker, V. Vanhoucke, V. Vasudevan, F. B. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” CoRR, vol. abs/1603.04467, 2016. [Online]. Available: http://arxiv.org/abs/1603.04467
- S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International Conference on Machine Learning, 2015, pp. 448–456.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available: http://arxiv.org/abs/1412.6980
- D. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs),” CoRR, vol. abs/1511.07289, 2015. [Online]. Available: http://arxiv.org/abs/1511.07289
- D. Mishkin, N. Sergievskiy, and J. Matas, “Systematic evaluation of convolution neural network advances on the Imagenet,” Computer Vision and Image Understanding, vol. 161, pp. 11–19, 2017. [Online]. Available: http://arxiv.org/abs/1606.02228
- N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014. [Online]. Available: http://www.jmlr.org/papers/v15/srivastava14a.html
- K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.