Patient trajectory prediction in the
Mimic-III dataset, challenges and pitfalls
Automated medical prognosis has gained interest as artificial intelligence evolves and the potential for computer-aided medicine becomes evident. Nevertheless, it is challenging to design an effective system that, given a patient’s medical history, is able to predict probable future conditions. Previous works, mostly carried out over private datasets, have tackled the problem by using artificial neural network architectures that cannot deal with low-cardinality datasets, or by means of non-generalizable inference approaches. We introduce a Deep Learning architecture whose design results from an intensive experimental process. The final architecture is based on two parallel Minimal Gated Recurrent Unit networks working in bi-directional manner, which was extensively tested with the open-access Mimic-III dataset. Our results demonstrate significant improvements in automated medical prognosis, as measured with Recall@k. We summarize our experience as a set of relevant insights for the design of Deep Learning architectures. Our work improves the performance of computer-aided medicine and can serve as a guide in designing artificial neural networks used in prediction tasks.
Routinely, health care professionals have to deal with patient records that carry years, even decades, of clinical evidence. It is their job to digest all this information to make the most accurate recommendations regarding the patient’s health and the most adequate treatments. Manually processing such amounts of information is time and effort-intensive and physicians can certainly benefit from the aid of an automated prognosis system (Douglas Miller and W. Brown, 2017). The accurate processing of the entire profile of a patient formed by a sequence of events can lead to more precise prognoses, fostering preventive medicine practices, delineating healthier habits, and aiding the health sector as a whole. The same benefits hold for health insurance companies who are interested in predicting the possible outcomes of their clients and proposing fair contracts and conditions.
Automated prognosis can benefit from the wide adoption of Electronic Health Records (EHRs) (Henry et al., 2013), a practice that is leading to a massive production of computer-ready clinical data. One branch of this research is referred to as Patient Trajectory Prediction, or Disease Progression. This field relates to taking into account temporally-ordered sets of clinical data and having the computer learn what the next most probable event is. For our specific settings, a clinical event refers to a patient admitted to a hospital; along the course of this admission, a set of diagnostic outcomes are generated and encoded according to the International Statistical Classification of Diseases and Related Health Problems, 9th revision (ICD-9111https://apps.who.int/iris/handle/10665/39473). Given a sequence of admissions referring to a patient, we want to predict the most probable diagnoses that will be observed in that patient’s next admission. The task, although modeled for computational processing, is similar to what health professionals repeatedly do when faced with historical clinical profiles to delineate an expected prognosis. For this task, Deep Learning (DL) techniques have increasingly gained interest due to their adequacy in dealing with large amounts of sequential data and due to their convincing results in prediction/classification problems, as we review in Section 2.
Despite the advances in the field, the use of DL for patient trajectory prediction, a modality of sequence-to-sequence prediction that depends of a temporal context, stumbles in several challenges. Initially, after data preparation to filter out inconsistencies, the data must be inspected regarding its distribution, a process that guides the proper encoding of the information and its respective modeling to a tensor representation. The processing, then, depends on the proper definition of a DL architecture taking into account dozens of different kinds of processing units found in the literature – the ideal method must be capable of dealing with a memory of past events, answering for the temporal aspect of the problem. Furthermore, a myriad of hyperparameters must be taken into account, including the number and size of layers, activation functions, convergence, regularization, loss function, and optimization, among others. In our specific setting, we found that the small cardinality of the data and its highly granular encoding posed strong hurdles that steered all the research process – our findings orbit all these challenges.
We report on the use of Deep Learning techniques applied to the open-access dataset Medical Information Mart for Intensive Care III (Mimic-III) (Johnson et al., 2016) provided by the Massachusetts Institute of Technology. We describe efforts on using this rich dataset to build a medical prognosis method using recurrent artificial neural networks. Our methodology can be briefly summarized as: pre-processing the Mimic-III ICD-9 diagnosis encoding according to a less granular encoding scheme provided by HCUP, the North-American Healthcare Cost and Utilization Project (Section 3); using two Minimal Gated Recurrent Unit networks organized in a parallel bidirectional structure able to deal with both the low cardinality of Mimic-III and with its temporal nature (Section 4); and experimenting with several methods reported in the literature to provide a broad panorama on how to tackle the specific settings of Mimic-III (results reported in Section 5). We named our methodology LIG-Doctor after the name of our research group, the Laboratoire d’Informatique de Grenoble.
We summarize our contributions as follows:
Elucidation on DL methods: we extensively tested DL methods frequently referenced in the literature; our results allowed us to identify pitfalls and good choices applicable to Mimic-III and, possibly, in a more general context;
Methodology for computational prognosis: as a result of extensive experiments with multiple methods, we reached a methodology that demonstrated to be successful for Mimic-III in comparison to other methods considered as state of the art – the source code of this project is accessible at GitHub222https://github.com/jfrjunio/LIG-Doctor;
Dataset insight: we discuss the characteristics of Mimic-III with respect to its potential for computational medical prognosis; we focus on aspects of its cardinality, encoding granularity, and clinical aspects (Intensive Care Unit) demonstrating methodological choices that result in more accurate outcomes.
2. Related works
Currently, the comparison of computational medical prognostic methods is hampered by the fact that the works in the literature use different datasets. Each dataset carries specific characteristics, like domain, cardinality, structure, and data encoding – quite often, the datasets are proprietary and not available for broader use. Unlike most datasets, the open-access Mimic-III dataset is a highly structured and semantically rich dataset that has gained attention within the research community. Despite the difficult to compare existing works, we review important proposals that deal with medical prognosis and highlight the challenges that lie in the field, allowing one to assess this research.
The work of Pham et al. (Pham et al., 2017), named DeepCare, uses an LSTM (Hochreiter and Schmidhuber, 1997a) recurrent neural network that, in addition to the codes found in the Electronic Medical Record, concatenates extra features to the data to aid the learning process. The concatenation occurs after an embedding layer and includes intervention codes, elapsed time, and admission method; the authors also use pooling as an attention mechanism to focus on the most important diagnostic inputs of an admission sequence. Similar to DeepCare, we experimented with the use of embedding and extra features, as we report in Section 5. The use of embedding, though, resulted in a reduced recall in all the cases; the use of extra features, such as type and duration of admission, slightly improved our recall measures. The possible explanation is that Mimic-III is a general clinic dataset whose cases range, for example, from childbirth to myocardial infarction while the datasets used by Pham et al. refer to specific disorders (diabetes and mental health). Therefore, the spectrum of Mimic-III with respect to its features is broader, in which case, extra features did not represent strong information gains. Unfortunately, we were not able to reproduce DeepCare, as the code provided by the authors is incomplete.
Choi et al. (Choi et al., 2016) introduce the Doctor AI methodology, which is based on a Gated Recurrent Unit network. Their architecture uses an embedding layer to reduce the dimensionality of the input admission sequences, and an optional SKip-gram representation of the diagnoses codes. They report experiments on a dataset with more than 14 million admissions related to a case-control study relative to heart failure. Their dataset refers to a specific disease and is more than 240 times bigger than Mimic-III. Choi et al. also report that they were able to perform transfer learning using their model to make predictions over Mimic-III; their results were not as precise as those of LIG-Doctor neither with transfer learning, nor when straightly applied over Mimic-III, as we present in Section 5.
Prognostic medicine has been tackled by other approaches not based on DL. Jensen et al. (Jensen et al., 2014) describe a statistical frequentist inferential technique to characterize trajectories of interest in the population of Denmark. Their approach, although sound, does not adjusts for other settings as it demands customized modeling and formulation. Additionally, it was not designed to learn new trajectories from new data. Some works have used Markovian models to compute the conditional probability , that is, the probability of an admission given the clinical record . Wang et al. (Wang et al., 2014), for example, describes a two-part method that uses Bayesian and Markovian principles for Chronic Obstructive Pulmonary Diseases. Their method primes for being unsupervised and quite precise in its specific disease context. The drawback is that it depends on co-morbidity information provided by specialists, which is hardly ever available in significant numbers. Furthermore, their modeling has a complexity that cannot be disregarded, making it hard to adapt the method to new settings. More generally, Arandjelovic (Arandjelovic, 2015) claims and demonstrates that, for trajectory prediction, Markovian models depend on constrained assumptions to reduce the number of possible historical sequences. This fact leads to limited applicability, especially because such models cannot deal with admissions that contrast concerning their severity; that is, a routine admission would simply erase the model’s memory of a previous severe condition.
Other approaches rely on Hawkes Processes (Linderman and Adams, 2014), a sort of point process or probabilistic model for random scatterings. Such a method is used for describing the occurrence of events over time, as in the case of patient admissions to a hospital and respective diagnoses. The drawback of such works is that they apply strictly to diseases and not to patients, which severely reduces the applicability of the models. Also, the number of parameters grows quadratically with the number of diseases, incurring in a huge computational cost (Choi et al., 2015).
3. The Mimic-III dataset
This work focuses on the open-access dataset Medical Information Mart for Intensive Care III (Mimic-III) (Johnson et al., 2016) provided by the Massachusetts Institute of Technology. This dataset integrates deidentified, comprehensive clinical data of patients admitted to the critical care unit of the Beth Israel Deaconess Medical Center in Boston, Massachusetts. The access to the dataset is open, but it is conditioned to a strictly controlled user agreement. One of the goals of the Mimic-III effort is to allow the reproduction of clinical studies worldwide, making medical-related research comparable via a standard referential. In fact, during this research, we noticed a flagrant problem; the majority of previous researches rely on private datasets, which prevents reproduction and comparison.
Mimic-III is a well-structured validated dataset with 58,976 admissions from 48,520 patients whose conditions relate to heart, surgical, and trauma conditions, all in demand for critical care. The data is semantically rich including bedside monitoring, laboratory tests, billing, demographics, diagnoses, and procedures. These last two pieces of information are properly encoded using the ICD-9 standard. The dataset has been used for different research purposes such as medication dosing (Ghassemi et al., 2014) and mortality prediction (Pirracchio et al., 2015). In this work, we explore the admissions and diagnoses to predict trajectories. A patient’s admission refers to a set of diagnostic codes that describe what happened during a hospital stay – see Figure 1.
3.1. Data issues
In the task of trajectory prediction, Mimic-III represents a real challenge because its cardinality is relatively small, and because it uses the ICD-9 encoding, which is highly granular.
The total number of admissions in the dataset is 58,976. However, the distribution of patients considering the number of admissions is skewed, see Figure 2 – 38,983 patients have one single admission. This fact severely reduces the dataset as only data of patients with at least two admissions is useful for trajectory prediction. Besides that, a few admissions do not have any related ICD-9 codes, and others are not meaningful (negative time duration). With these restrictions, the number of admissions falls to 19,911; the number of patients falls from 46,520 to 7,483.
Concerning the encoding of diagnoses, the drawback comes from the high cardinality of the ICD-9 standard, whose number of diagnosis codes sums up to 15,072 (the newer ICD-10 is over 4 times bigger). This cardinality refers to the granularity of details, which describes a disease along with its possible clinical manifestations. In Mimic-III, a total of 6,984 codes appear in the database instance – Figure 3 presents the distribution of the number of codes with respect to the number of admissions; although the distribution is not Gaussian due to the outlier of 9 codes per admission, its nearly Gaussian shape allows us to consider the simple average of 13 codes per admission as a reasonable descriptive parameter. As a result, the task of predicting the codes of the next admission lies in the range of , or nearly possibilities.
Alternative CCS encoding
The high granularity of the ICD-9 standard is a problem not restricted to this work; rather, it is a recurring problem in several research activities. The Healthcare Cost and Utilization Project (HCUP), a North-American association dedicated to healthcare research has tackled the problem by issuing the Clinical Classifications Software (CCS) encoding (Cost and Project, 2015). Their classification scheme (not a software) defines a specialist-established tabular mapping from ICD-9 to a less granular descriptive standard, the CCS. The goal is to ease statistical analysis and reporting. Table 1 illustrates the mapping of the disease tuberculosis from ICD-9 to CSS – in this example, 426 ICD-9 codes become 1 CCS code. The complete mapping scheme converts 15,072 ICD-9 codes into 285 CCS codes; in the case of Mimic-III (patients with at least two admissions), the mapping corresponds to the use of 271 CCS codes instead of 4,893 ICD-9 codes.
|01000||Prim Tuberculosis Complex-unspec||1||Tuberculosis|
|01001||Prim Tuberculosis Complex-no Exam||1||Tuberculosis|
|01002||Prim Tuberculosis Complex-exm Unkn||1||Tuberculosis|
|01894||Miliary Tuberculosis Nos-cult Dx||1||Tuberculosis|
|01895||Miliary Tuberculosis Nos-histo Dx||1||Tuberculosis|
|01896||Miliary Tuberculosis Nos-oth Test||1||Tuberculosis|
With the CCS encoding, the problem of predicting the codes of the next admission falls from possibilities to , or nearly possibilities – that is, 18 orders of magnitude fewer possibilities. This simplification significantly improves the prediction performance. Of course, the choice for a less granular code has a price; the descriptive results of the predictions are much less detailed; so instead of “Tuberculosis of ear, tubercle bacilli found (in sputum) by microscopy”, the diagnosis will state only “Tuberculosis”. In the case of Mimic-III, this is a non-avoidable workaround because 19,911 samples are not enough to train an artificial neural network to predict sets of 13 codes, each one pertaining to a 4,893-codes domain.
3.2. Data and problem modeling
The problem treated here is stated as: given a patient’s sequence of admissions, possibly stored as an EHR, predict the most probable diagnoses that shall appear in the next admission of this patient at a given time . A patient’s admission refers to a pair , in which is the temporal order of the admission, is the timestamp stating when the admission occurred, and is an unordered set of diagnoses codes, so that , in which is a standard set of codes such as ICD-9 or CCS. Furthermore, each patient’s EHR refers to a set of admissions . In our problem setting, for any admission , we want to predict the codes of admission ; the prediction set corresponds to – in the context of artificial neural networks, we want to predict the following probabilities:
That is, for each possible code , and given admissions through , we want to compute its probability of appearing in the next admission . Or, in the conventional notation of the output of an artificial neural network, we seek to compute . Notice that, as the outcome is a set of probabilities, it can be interpreted as a set of recommendations.
Considering this problem setting, the input of an admission to an artificial neural network comes in the form of a -dimensional multi-hot vector defined, in programming (array-like) notation, as:
Since we use batch processing instead of a hot vector per iteration, in practice, Equation 2 expands to a set of patients , each one with a set of -dimensional admissions , which corresponds to the following input tensor:
In Equation 3, the admissions become the first dimension of the tensor corresponding to slices (orthogonal to its depth axis). This is meant for a simplified flow through the network; it is more convenient to have one patient per line, and one code per column, which makes algebraic operations simpler since the computation is oriented to admissions. Also notice that the patients have different numbers of admissions, and admissions have different numbers of codes; to cope with that, elements with smaller cardinalities are padded with 0’s, which demanded the use of masking to prevent residual computations in the padded positions of the tensor.
3.3. Recurrent neural networks and fine tuning
We used RNNs, whose principle is to use self-loop connections and a set of information gates whose dynamic produces a memory of past events across time steps – they contrast with feed-forward-only networks, which do not use self-loops nor memory. In order to design an architecture with performance superior to existing works, we considered the following types of RNNs: Jordan’s network (Jordan, 1997), Long-Short Term Memory (classic (Hochreiter and Schmidhuber, 1997b) and Google’s (Zen et al., 2016)), Gated Recurrent Units (classic (Cho et al., 2014) and minimal (Zhou et al., 2016)), and DoctorAI (GRU+embedding) (Choi et al., 2016); we also considered a feed-forward-only network for comparison and bidirectional recurrent neural networks (Schuster and Paliwal, 1997). We used many auxiliary techniques, including Xavier initialization (Glorot and Bengio, 2010), dropout, L2 regularization, and addition of Gaussian noise to the input to prevent overfitting; gradient clipping and ADADELTA (Zeiler, 2012) for convergence. We considered activation functions Leaky Rectified Linear Unit, sigmoid, hyperbolic tangent, and classical Rectified Linear Unit (Goodfellow et al., 2016). Gradient clipping was particularly effective in reducing the loss during each training epoch, although slowing down the convergence.
As previously mentioned, the cardinality of Mimic-III poses a great challenge for diagnosis prediction. As a consequence, we had to test a broad set of artificial neuron cells aiming to achieve high predictive performance. We focused on recurrent neural networks, which are recognized for their ability to deal with sequences in time. Their principle is to use self-loop connections and a set of information gates whose dynamic produces a memory of past events across time steps – they contrast with feed-forward-only networks, which do not use self-loops nor memory. Accordingly, we considered the following methods: Jordan’s network, Long-Short Term Memory (classic and Google’s), Gated Recurrent Units (classic and minimal), and DoctorAI; we also considered a feed-forward-only network for comparison. Additionally, we tested complementary methods: varying the architecture with respect to the number of neurons in each layer, the number of layers, the use of an embedding layer, the use of extra features (duration of admission, interval between admissions, and type of admission). In this section, we discuss our final architecture and the design decisions that lead to it; in the next section, we present quantitative numbers that justify our choices.
4.1. Choosing an artificial neural network
After extensive testing, the main symptom of our problem setting was its susceptibility to the number of parameters; every time we added a significant number of parameters (layers or neuron nodes), the recall would pointedly fall. We hypothesize that the small number of instances of Mimic-III made it difficult to have the network learn the underlying patterns, therefore reducing the recall for both training and testing. To cope with that, we tested many techniques, as presented in Section 5.3. Among the recurrent networks found in the most-accepted literature, the one with the smallest number of weights is Jordan’s network; the one with the biggest number of weights is Google’s LSTM. The GRU network demonstrated a higher performance than Jordan’s by using more weights, but with less performance than Google’s LSTM. The Minimal GRU (MGRU) network uses even fewer weights than the classical GRU and, just as demonstrated by its authors, it did not lose performance despite using fewer gates – its performance was slightly superior to Google’s LSTM and classical GRU, but demanding less processing time. Hence, MGRU was chosen to be the core of our architecture; its equations are:
where is the sigmoid activation function; is the hyperbolic activation function; the ’s and ’s are the weights to be optimized, together with the biases indicated with ’s. We initialize the squared matrices () using identity, and the other matrices () using a Gaussian distribution with and . Notice that the order of the dot products depends on the orientation of the input data; we list the equations in the same order as our implementation code to ease understanding – see Section 3.2.
4.2. Determination of the network architecture
Here, the first issue was to find an optimal number of neurons to use in the hidden layers. We tested layers ranging from 100 to 3,000 neurons – to our surprise, we reached a recall plateau at around a number of neurons equal to the number of neurons in the input layer, as demonstrated in Section 5.4. Following Equations 2 and 3, this number equals the number of distinct CCS diagnosis codes. For the current Mimic-III dataset instance, this number is 271, as discussed in Section 3.1. Similarly, despite the intuition that more layers should lead to better performance, our experiments demonstrated, for every tested technique, that the higher the number of hidden layers the worst the performance. We observed lower performance in both training and testing, so it was not a matter of overfitting, but of capacity to learn the underlying function. This result was intriguing, especially because many authors advocate their technique to be immune to the vanishing/exploding gradient problem or to support deeper networks satisfactorily. We did not further investigate the underlying reasons, but just verified that these claims were not valid for our problem setting – notwithstanding, this problem has been a topic of active research (Pascanu et al., 2014). In fact, our results are in sync with a recent work, by Frankle and Carbin (Frankle and Carbin, 2018), who state that neural networks can be as much as 90% smaller without losing performance.
In our tests, hence, we designed an architecture with one input layer, one MGRU hidden layer, and one standard output layer before the softmax probability distribution. Despite the satisfactory results, we hypothesized that more weights could help because they can detect more about the underlying patterns. However, since stacking more layers did not help, we explored using the principle of bidirectional recurrent neural networks.
4.3. Adding bi-directional parallelism
Bidirectional recurrent neural networks connect two processing flows computed in opposite directions in relation to the temporal dimension of the data. As a result, the output layer gets information regarding the past and the future states simultaneously. For our problem setting, this architecture represented significant performance gains, as presented in Section 5.5. This design corresponds to having two networks working in parallel instead of only one deeper network; as a result, the design is immune to gradient problems when considering both networks simultaneously – their parallelism does not produce stacked layers. From an implementation point of view, the backward processing is achieved by simply duplicating the recurrent network and having it fed with data reversed with respect to its first dimension, which is ordered according to the temporal information of the admissions – see Equation 3. The end of the architecture counts with a feed-forward flow that starts with a joining layer to combine the outcomes of the two networks by means of a weighted sum of the forward and backward computations. Given a forward hidden layer computation and a backward hidden layer computation , we obtain their joint weighted sum according to:
After that, the final output probabilities come from an output layer that feeds into a softmax operator:
where LReLU corresponds to activation function Leaky Rectified Linear Unit (LReLU) with slopes and as extra optimization parameters introduced in and , respectively. The use of the parameterized LReLU at the feed-forward stage of the network demonstrated superior results with respect to recall and speed of convergence if compared to functions sigmoid, hyperbolic tangent, and classical Rectified Linear Unit (ReLU).
Figure 4 illustrates the entire architecture, whose goal is to compute the following optimization:
where , , , , , , , , , , , , , , , , , , is the set of parameters of the architecture, and the loss function corresponds to the cross entropy function computed over a multi-hot vector of known codes and a vector of code probabilities :
According to the loss function, the closer to a probability is, the smaller is the loss.
We present evaluation results concerning the multiple techniques found in the literature. The results are meant to justify our design decisions, as well as to guide future researchers in solving similar problems. We also discuss our best results compared to the works presented in Section 2.
5.1. Experimental setup
For training and testing, we used 90% and 10% of the patients, respectively. The training occurred until the model recorded 10 consecutive epochs without improvement as measured by reductions in the cross-entropy loss – refer to Equation 8. The code was written over the framework Theano and ran on GPU Nvidia GeForce GTX 1080 Ti; Debian operating system with 256 MB of memory.
|Random||FF-Only||Jordan’s||DoctorAI||LSTM||LSTM Google||GRU||Min GRU|
5.2. Evaluation metrics
Due to the characteristics of the problem, we employ a metric commonly used for recommendation systems: recall at top-k recommendations. In our case, the top recommendations refer to the diagnosis codes in that have the highest probabilities – see Equation 1. Considering the top-k recommendations, Recall@k refers to the percentage (ratio) of recommended codes that are correct (actually relevant), expressed by – a code is correct if it pertains to the answer set .
5.3. Direct comparison of neuron cells
This first round of experiments refers to Section 4.1; here we compared eight techniques with respect to their Recall@k: random-initialization-only without training, feed-forward-only without recurrent cells, Jordan’s network, DoctorAI, LSTM, Google LSTM, GRU, and Minimal GRU. We executed each technique with exactly the same hyperparameters and fine-tunings used in our methodology (see Sections 4 and 5.1) – actually, for each experiment, we simply changed the cell type, keeping everything else the same. Each technique ran over three randomized versions of Mimic-III split in 90% for training and 10% reserved for testing. The average results, presented in Table 2, demonstrate superior performance, with up to 73% accuracy for Recall@30, for techniques Google LSTM and Minimal GRU, which also had the smallest number of iterations before convergence. Surprisingly, the feed-forward-only network had a performance comparable to classic LSTM and GRU, with up to 72% accuracy for Recall@30. The probable reason is that a great portion of the patients have only two admissions, case when the time-awareness or recurrent networks is not necessary. After these results, we chose techniques feed-forward-only, Google LSTM and Minimal GRU for further investigation.
5.4. Experimenting with cardinalities
After choosing the most adequate neuron cells, we proceeded with empirical tests regarding the number of layers and number of neurons, as explained in Section 4.2. For each kind of cell, we experimented with up to 3 layers and with 271, 542, and 1,084 cells, in a total of 9 different settings for each cell type – the number of cells is a multiple of the size of the input layer, as explained in Section 4. In Table 3, we see that more layers caused the system to lose performance - meanwhile the number of neurons did not affect the results so much, but, of course, it demanded more processing time. The feed-forward-only network was the least-resilient setting; its performance decreased from 72% at one 271-nodes layer to 55% for three 1,084-nodes layer. Google LSTM and Minimal GRU, again, had similar performances – their Recall@30 ranged from 73% to nearly 60%, with Minimal GRU presenting slightly better results. Concerning the processing time, since Minimal GRU has fewer gates, it computes faster than LSTM, even when it runs for a few more iterations. Considering all these aspects, we decided for Minimal GRU as the neuron cell of our architecture, and, also, for one single 271-nodes hidden layer.
|Nodes||Minimal GRU||Google’s LSTM||Feed-forward-only|
|Nodes||Minimal GRU||Google’s LSTM||Feed-forward-only|
|Nodes||Minimal GRU||Google’s LSTM||Feed-forward-only|
5.5. Further design improvements
After deciding for the Minimal GRU cell, and for the cardinality of neurons and layers. the next step was to use more elaborate techniques to further improve the performance. We experimented with the principle of bidirectional recurrent neural networks, which led us to the parallel architecture discussed in Section 4.3 and to a higher performance improvement – refer to the first column of Table 4. For the bi-directional Minimal GRU network, the Recall ranged from 53% at to 79% at , more than 10% better than any other setting. Over this architecture, we experimented other techniques, including the use of an embedding layer before the hidden layers; and the use of extra features (duration of admission, interval between admissions, and type of the admission).
As presented in Table 4, the embedding layer just reduced the performance; a side effect that we verified for all the settings previously reported – probably, the smaller cardinality of the CCS encoding does not sustain the use of embedding. The use of unsupervised pre-training, which answers for a more adequate initialization of the weights based on auto-encoding-like preprocessing, was capable of reducing the time of convergence; however, we verified no significant performance improvements. Finally, the use of extra features found in the database, namely the type of the admission (newborn, elective, emergency, or urgent), the interval between admissions, and the duration of the admissions, provided slight improvements – refer to columns 3 to 7 in Table 4. We used these extra features via concatenation to the input tensor, without an embedding layer – the type in the form of a 4-codes hot-vector; the time in the form of a single normalized extra slice. The first of these features, type, is very particular to the Mimic-III; the duration, however, applies to any other medical dataset – provided that anonymization did not corrupt the timings. The final highest Recall, achieved with duration, ranged from 54% at to 79% at .
5.6. Comparison to related works
The work of Pham et al. (Pham et al., 2017) recommended both the use of an embedding layer and LSTM cell; our results, though, demonstrated that this is not the case for all settings – embedding, in particular, was a very bad design choice. We directly compared to the methodology of Choi et al. (Choi et al., 2016), with exactly the same settings and using the code provided by the authors – we report better results, as presented in Table 2. In a broad sense, these two former works are narrow with respect to their domain, dealing with very specific diseases. With respect to previous works that do not rely on artificial neural networks – mentioned in Section 2, it is possible to affirm that they do not straightly adapt to different settings, requiring very specialized data and problem modeling; or, they demonstrate performances that demand million-scale volumes of records. Meanwhile, Deep Learning has become one of the most active areas of research; improvements appear every day, improving existing architectures or introducing hyper-parameters that render better performance. Our methodology relates to all the aforementioned issues.
Furthermore, it is worth to mention that a strict consolidated benchmark for patient trajectory prediction is not yet of broad use. The field still has much to evolve in order to ultimately evaluate the performance of one given predictor. The Mimic-III dataset is an open initiative to fill this benchmark gap; the research community shall benefit from future works that experiment on Mimic-III beyond their private datasets.
We conducted broad experimentation over the Mimic-III dataset, provided by MIT. The experimentation considered a vast set of techniques concerning sequence to sequence prediction, dealing with recurrent neural networks, and other related methods. While searching for the best design, we had interesting insights that might inspire further work and/or guide design processes of similar problems. Our first finding was that recurrent network techniques do not accomplish many of the claims that abound in the respective literature. Specifically, we verified that they do not support the stacking of layers – the more layers, the worse the performance; effectively, our design ended up with one single layer. We also noticed that the networks whose cells used more gates, like LSTM and GRU, had the same performance as of the network based on the much simpler Minimal GRU cells, which we ended up choosing for our design. When one considers the theoretical claims on why each gate is part of a given network, the facts do not support the theoretical premises – for our specific settings, the fewer gates, the better. Complexity seemed no to be the path to follow. These two findings were not exhaustively investigated, notwithstanding, our results alert that some assumptions taken for granted must be revisited.
Concerning the Mimic-III dataset, we found that, although it can support the training of a neural network aiming at medical prognosis, this is feasible only by using a less granular coding for diseases, as the HCUP-CCS encoding that we used instead. In fact, we verified that the ICD-9 is way too vast for a dataset with the size of Mimic-III; moreover, if we consider that the newer ICD-10 encoding is over 4 times more granular, the research community shall consider that, although it is adequate for precise medical description, it might not be suitable for effective statistical and analytical tasks, raising a demand for alternative database projects.
Finally, research lines can be further investigated as a continuation of this work. The success in using a parallel architecture demonstrated to be a promising design decision; even further, this design can be extrapolated to more than two parallel networks, each one benefiting from different characteristics of the data – actually, Mimic-III has many more semantic features that shall support further investigation. We also suggest that the whole methodology be experimented over more specific datasets, as for predicting finer onsets, like heart failure, or strokes; and also, for predicting when the next onset might take place, as the data is rich with respect to temporal information. Lastly, the research on adversarial networks for data augmentation appeared shortly after we started working with longitudinal medical data – this tends to be a topic of active research in the next years.
This research was financed by Brazilian agencies Coordenacao de Aperfeicoamento de Pessoal de Nivel Superior (CAPES, Finance Code 001); Fundacao de Amparo a Pesquisa do Estado de Sao Paulo (Fapesp, grants 2019/04461-9, 2018/17620-5, 2017/08376-0, 2016/17078-0, and 2019/04461-9); and Conselho Nacional de Desenvolvimento Cientifico e Tecnologico (CNPq, grants 167967/2017-7, and 305580/2017-5). We also thank Nvidia Corporation for donating the GPUs that supported this work.
- Arandjelovic (2015) Ognjen Arandjelovic. 2015. Discovering hospital admission patterns using models learnt from electronic hospital records. Bioinformatics 31, 24 (09 2015), 3970–3976. https://doi.org/10.1093/bioinformatics/btv508 arXiv:http://oup.prod.sis.lan/bioinformatics/article-pdf/31/24/3970/501484/btv508.pdf
- Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. In Proceedings of the Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. 103–111.
- Choi et al. (2016) Edward Choi, Mohammad Taha Bahadori, Andy Schuetz, Walter F. Stewart, and Jimeng Sun. 2016. Doctor AI: Predicting Clinical Events via Recurrent Neural Networks. JMLR Workshop Conf Proc 56 (Aug 2016), 301–318. https://www.ncbi.nlm.nih.gov/pubmed/28286600 28286600[pmid].
- Choi et al. (2015) Edward Choi, Nan Du, Robert Chen, Le Song, and Jimeng Sun. 2015. Constructing Disease Network and Temporal Progression Model via Context-Sensitive Hawkes Process. In Proceedings of the 2015 IEEE International Conference on Data Mining (ICDM) (ICDM ’15). IEEE Computer Society, Washington, DC, USA, 721–726. https://doi.org/10.1109/ICDM.2015.144
- Cost and Project (2015) Healthcare Cost and Utilization Project. 2015. Clinical Classifications Software. Technical Report. Agency for Healthcare Research and Quality. https://www.hcup-us.ahrq.gov/toolssoftware/ccs/CCSUsersGuide.pdf
- Douglas Miller and W. Brown (2017) D Douglas Miller and Eric W. Brown. 2017. Artificial Intelligence in Medical Practice: The Question to the Answer? The American Journal of Medicine 131 (11 2017). https://doi.org/10.1016/j.amjmed.2017.10.035
- Frankle and Carbin (2018) Jonathan Frankle and Michael Carbin. 2018. The Lottery Ticket Hypothesis: Training Pruned Neural Networks. CoRR abs/1803.03635 (2018). arXiv:1803.03635 http://arxiv.org/abs/1803.03635
- Ghassemi et al. (2014) Mohammad M. Ghassemi, Stefan E. Richter, Ifeoma M. Eche, Tszyi W. Chen, John Danziger, and Leo A. Celi. 2014. A data-driven approach to optimized medication dosing: a focus on heparin. Intensive Care Med 40, 9 (Sep 2014), 1332–1339. https://doi.org/10.1007/s00134-014-3406-5 25091788[pmid].
- Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research), Yee Whye Teh and Mike Titterington (Eds.), Vol. 9. PMLR, Chia Laguna Resort, Sardinia, Italy, 249–256.
- Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org.
- Henry et al. (2013) JaWanna Henry, Yuriy Pylypchuk, Talisha Searcy, and Vaishali Patel. 2013. Adoption of Electronic Health Record Systems among U.S. Non-federal Acute Care Hospitals: 2008-2012. In ONC Data Brief, Vol. 35. Office of the National Coordinator for Health Information Technology.
- Hochreiter and Schmidhuber (1997a) Sepp Hochreiter and Jürgen Schmidhuber. 1997a. Long Short-Term Memory. Neural Computation 9, 8 (1997), 1735–1780.
- Hochreiter and Schmidhuber (1997b) Sepp Hochreiter and Jurgen Schmidhuber. 1997b. Long Short-term Memory. Neural computation 9 (1997), 1735–80.
- Jensen et al. (2014) Anders Boeck Jensen, Pope L. Moseley, Tudor I. Oprea, Sabrina Gade Ellesøe, Robert Eriksson, Henriette Schmock, Peter Bjødstrup Jensen, Lars Juhl Jensen, and Søren Brunak. 2014. Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients. Nature Communications 5 (24 Jun 2014), 4022 EP –. https://doi.org/10.1038/ncomms5022 Article.
- Johnson et al. (2016) Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li-wei H. Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific Data 3 (24 May 2016), 160035 EP –. https://doi.org/10.1038/sdata.2016.35 Data Descriptor.
- Jordan (1997) Michael I. Jordan. 1997. Chapter 25 - Serial Order: A Parallel Distributed Processing Approach. In Neural-Network Models of Cognition, John W. Donahoe and Vivian Packard Dorsel (Eds.). Advances in Psychology, Vol. 121. North-Holland, 471 – 495.
- Linderman and Adams (2014) Scott W. Linderman and Ryan P. Adams. 2014. Discovering Latent Network Structure in Point Process Data. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32 (ICML’14). II–1413–II–1421. http://dl.acm.org/citation.cfm?id=3044805.3045050
- Pascanu et al. (2014) Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2014. How to Construct Deep Recurrent Neural Networks. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings.
- Pham et al. (2017) Trang Pham, Truyen Tran, Dinh Phung, and Svetha Venkatesh. 2017. Predicting healthcare trajectories from medical records: A deep learning approach. Journal of Biomedical Informatics 69 (2017), 218 – 229. https://doi.org/10.1016/j.jbi.2017.04.001
- Pirracchio et al. (2015) Romain Pirracchio, Maya L. Petersen, Marco Carone, Matthieu Resche Rigon, Sylvie Chevret, and Mark J. van der Laan. 2015. Mortality prediction in intensive care units with the Super ICU Learner Algorithm (SICULA): a population-based study. The Lancet Respiratory Medicine 3, 1 (01 Jan 2015), 42–52. https://doi.org/10.1016/S2213-2600(14)70239-5
- Schuster and Paliwal (1997) M. Schuster and K. K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681.
- Wang et al. (2014) Xiang Wang, David Sontag, and Fei Wang. 2014. Unsupervised Learning of Disease Progression Models. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’14). ACM, New York, NY, USA, 85–94. https://doi.org/10.1145/2623330.2623754
- Zeiler (2012) Matthew D. Zeiler. 2012. ADADELTA: An Adaptive Learning Rate Method. CoRR abs/1212.5701 (2012).
- Zen et al. (2016) Heiga Zen, Yannis Agiomyrgiannakis, Niels Egberts, Fergus Henderson, and Przemyslaw Szczepaniak. 2016. Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices. In Proc. Interspeech. San Francisco, CA, USA, 2273–2277.
- Zhou et al. (2016) Guo-Bing Zhou, Jianxin Wu, Chen-Lin Zhang, and Zhi-Hua Zhou. 2016. Minimal Gated Unit for Recurrent Neural Networks. Int. J. Autom. Comput. 13, 3 (2016), 226–234.