Clinically Meaningful Comparisons Over Time: An Approach to Measuring Patient Similarity based on Subsequence Alignment
Abstract
Longitudinal patient data has the potential to improve clinical risk stratification models for disease. However, chronic diseases that progress slowly over time are often heterogeneous in their clinical presentation. Patients may progress through disease stages at varying rates. This leads to pathophysiological misalignment over time, making it difficult to consistently compare patients in a clinically meaningful way. Furthermore, patients present clinically for the first time at different stages of disease. This eliminates the possibility of simply aligning patients based on their initial presentation. Finally, patient data may be sampled at different rates due to differences in schedules or missed visits. To address these challenges, we propose a robust measure of patient similarity based on subsequence alignment. Compared to global alignment techniques that do not account for pathophysiological misalignment, focusing on the most relevant subsequences allows for an accurate measure of similarity between patients. We demonstrate the utility of our approach in settings where longitudinal data, while useful, are limited and lack a clear temporal alignment for comparison. Applied to the task of stratifying patients for risk of progression to probable Alzheimer’s Disease, our approach outperforms models that use only snapshot data (AUROC of vs. ) and models that use global alignment techniques (AUROC of ). Our results support the hypothesis that the trajectories of patients are useful for quantifying interpatient similarities and that using subsequence matching and can help account for heterogeneity and misalignment in longitudinal data.
1 Introduction
While the increasing availability of patient data holds out the promise of better risk stratification models, many problems or outcomes of interest are plagued with patient heterogeneity. That is, a patient’s trajectory through disease is regulated by complex interactions that result from clinical, lifestyle, genetic and environmental factors [1, 2]. While such trajectory information can help shed light on how patients progress through disease, it can be difficult to make meaningful longitudinal comparisons [3]. In particular, at the time of initial presentation, patients are often at varying stages of disease. They may be grouped under coarse clinical labels that range from early to late stage disease. Thus, simply aligning patients by the time of enrollment may lead to inaccurate comparisons. Moreover, disease may progress quickly or slowly depending on the patient. This introduces pathophysiological misalignment leading to inconsistent comparisons over time. Finally, patients may miss scheduled visits, leading to disparity in the lengths of their temporal data and/or sampling times.
In this work, we present an approach for patient risk stratification that utilizes all available longitudinal patient data while addressing the challenges mentioned above. We propose a measure of patient similarity that compares longitudinal patient data using an optimalcost timeseries matching algorithm based on dynamic time warping (DTW) [4]. While DTW typically assumes the beginning and end of time series to be aligned and constricts their endpoints to match, we relax this assumption. In particular, we use a subsequence matching approach, where the ends of the time series need not be aligned [5]. This approach allows us to utilize the most relevant longitudinal information while making fewer assumptions about how to align patients in time. Furthermore, subsequence matching is robust to variability in the lengths of time series. Finally, the proposed subsequence matching approach generalizes broadly since it does not require expert knowledge or an extra hyperparameter to extract the most relevant parts of time series [6]. Our main contributions are:

we formulate the challenge of defining patient similarities based on longitudinal data as a minimum cost alignment problem,

we motivate and present an alignment method, based on subsequence matching, to compare longitudinal data from patients, and

we rigorously evaluate and demonstrate an improvement in predictive performance from emphasizing the most relevant data using subsequence matching over other alignment techniques that ignore pathophysiology through their alignment constraints (such as global DTW, prefix and suffix matching)
We demonstrate the utility of our approach by applying it to the task of predicting patient progression to probable Alzheimer’s Disease (AD), specifically progression from Mild Cognitive Impairment (MCI) to probable AD. Aside from the challenges discussed above, predicting progression to probable AD is particularly challenging because of the poorly understood pathophysiology of the disease as well as the variable clinical presentation of AD in patients. Applied to a publicly available dataset of MCI patients, the subsequence matching approach outperforms a global DTW approach that considers the entire time series when calculating similarity between patients (AUROC of vs. ).
The rest of the paper is organized as follows. In Section 2, we briefly review related work in the context of longitudinal data. In Section 3, we introduce notation and present our proposed similarity metric based on subsequence matching. We present and discuss our experimental results on real data in Sections 5 and 4. Finally, in Section 6, we summarize our contributions and discuss its implications beyond this study.
2 Related Work
Timeseries classification is a wellstudied area of research. For an indepth review of sequence classification, we refer the reader to [7]. Briefly, most timeseries classification approaches focus on either (a) defining a measure of dis/similarity between raw signals or (b) on extracting features such as motifs/shapelets/statistical summaries from the data [8, 9, 10, 11, 12, 13]. We focus on the first setting, in which we consider the entirety of the signal. We believe this is particularly important in settings where there is a paucity of data available. Such settings are common important in the healthcare domain, where collecting patient data is often expensive and labor intensive. In addition, we limit our analysis to interpretable models. While others have proposed timeseries classification techniques using a deep learning framework [14, 15, 16], such approaches do not apply to a setting like ours in which the number of training points is relatively small and the time series relatively short, and yield predictions that are hard to interpret limiting their utility.
Existing approaches to timeseries classification based on the raw signals often assume there exists some mechanism for fiducial temporal alignment. For example, in [12] Wiens et al. align examples based on time of admission and in [13] Syed et al. align time series based on the phases of a heartbeat. In contrast, we focus on a more general setting, in which we relax the assumption that such an alignment mechanism is always available. We are not the first to relax this assumption. In particular, Silva et al. introduce prefix and sufix invariant dynamic time warping (DTW) [6], where the global DTW constraints are relaxed up to a chosen tolerance parameter that is treated as a hyperparameter. We build upon this idea allowing each pair of time series to match subsequences that achieve the minimum cost for the particular pair [5], thus obviating the need for an extra hyperparameter. Furthermore, we assume that the entirety of each time series is relevant. However, we allow for the notion of relevance to vary across different pairs of time series. Thus, we can accurately compare time series that are misaligned in time as well as collected over different pathophysiological phases of the disease.
Timeseries data are often multimodal, i.e., multiple heterogeneous sources of data exist. While DTW is trivially applicable to these settings, ShokoohiYekta et al. [17] find that the generalization of DTW to multimodal time series is sensitive to the distance metric used for multimodal comparisons. Furthermore, the application of DTW to multimodal time series is further complicated by blockwise missing in the data. For example, patients in a study may receive an ubiquitous test (e.g., a blood draw) on all visits but a specialized test (e.g., a lumbar puncture) on only every other visit. While several approaches to deal with missing data have been presented before [18, 19], these approaches do not consider multimodal time series. In this paper, we specifically deal with the case of blockwise missing data in multimodal time series.
3 Methods
We begin by introducing notation used throughout the remainder of the paper. Next, we present the proposed subsequence matching approach. This approach aligns patients longitudinally and calculates a distance function between a pair of time series.
3.1 Notation
We assume a dataset of patients, where each patient is associated with a sequence of visits. A visit is associated with a feature vector and a label. Each patient has a sequence of visits along with clinically assigned labels,
where represents the patient index, is the number of visits of patient , is the number of patients and is the number of features. The feature vector encodes the biomarker measures associated with patient at visit , and represents the label associated with that visit. In particular, represents a positive label (i.e., progression to disease as discussed in Section 4).
For each patient , we aim to predict the probability of progression from MCI to probable AD at each visit starting at and including their third visit:
It is worth noting that a prediction for a patient at a given visit is based only on data from the past and present visits. Thus, a patient with visits is represented by separate instances (each of which is a time series) in the dataset, each with its own label (see Figure 1). In what follows, we use to denote the data instance with length , with the understanding that several data instances may come from a single patient. Thus, we have and .
3.2 Subsequence Matching
Given two time series and , we calculate a cost matrix , where . This cost matrix is used to fill up an accumulated cost matrix in a recursive fashion as follows:
Following this, the distance between two time series is calculated as . Given time series and , we assume without loss of generality that . When this is not the case, the matrix is transposed.
Compared to the traditional formulation of DTW that constrains the ends of time series to match [4], subsequence matching differs by allowing subsequences of the longer time series to match the shorter time series [5] (i.e., not all data points from the longer time series are necessarily included in the alignment). This is done to account for variability in the lengths of the time series in our dataset as well as the variable rates of disease progression that occur as a result of heterogeneity among patients.
For the purpose of comparison, we also present results using prefix and suffix matching [6]. In prefix matching, the goal is to match some prefix of time series to and achieve minimum cost alignment. Similarly, in suffix matching, the goal is to match some suffix of time series to and achieve minimum cost alignment. Compared to the formulation in [6], our prefix/suffix matching does not need an extra hyperparameter to determine the relevant prefixes/suffixes in the data. Instead, we use a minimum cost alignment approach to determine the pre/suffix in a datadriven manner. Mathematically, suffix matching is equivalent to prefix matching on an accumulated cost matrix that is rotated degrees.
4 The Data
To test the utility of our proposed approach, we consider a large publicly available dataset pertaining to patients with AD.
4.1 Study Population
We use data made available by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) [20]. In ADNI, subjects with MCI are recruited based on their cognitive test scores, with the goal of enrolling patients in the early stages of MCI. Once enrolled, patients are periodically examined over the duration of the study, with frequencies of 6–24 months. At each examination, patients are diagnosed as either MCI or probable AD on the basis of questionnairebased neuropsychological exams and the discretion of the attending clinician. These clinical diagnoses serve as our ground truth (typically, a gold standard diagnosis of AD required postmortem histopathological examination and is rarely available).
While we focus on patients enrolled as MCI in this study, there is still considerable heterogeneity among MCI patients in terms of the extent of cognitive decline and their clinical symptoms [21]. Furthermore, patients progress to AD at varying rates (between years) and have sporadic missingness in their visit schedules (Figure 2 demonstrates the nature and extent of missingness in our data).
In our study cohort, we use patients that have a clinical diagnosis of MCI and also three or more visits (i.e., time points). Our final study population consists of patients and examples (since patients have multiple visits). For each visit with at least 2 prior visits, we aim to predict whether or not a patient will progress from MCI to probable AD within the next months (see Figure 1). Among our instances, (from 269 patients) remained stable as MCI whereas (from 258 patients) progressed to probable AD within months. Note that a single patient can contribute both positive and negative instances to the dataset.
4.2 Features
At each patient visit ADNI collects a variety of data including, but not limited to, MRI scans, PET scans, neuropsychological scores, etc. We focused on MRI and FDGPET data since almost all patients in ADNI have an MRI scan ( have a PET scan) performed on them at every visit. Moreover, several studies have identified brain volume and metabolism as an important biomarker of AD [22, 23]. We use brain volumes and glucose uptake in patients extracted from MRI and PET scans as features, and represent the features collected over multiple visits as a multivariate time series. The raw MRI and PET scans were processed using Freesurfer and these data were made publicly available by the ADNI MRI team [24].
5 Experiments and Results
To evaluate the utility of our proposed approach on the task described above, we compare to a number of different approaches. In this section, we begin by describing these approaches, then move on to our experimental setup and finally results.
5.1 Comparison Methods
In Section 3, we proposed subsequence matching as a measure for intersubject comparisons based on longitudinal data. In this section, we compare the discriminative power of the proposed approach to the following methods for measuring patient similarity/differences.
 Snapshot

The baseline approach that uses only the most recent patient visit for comparison, thus ignoring temporal data.
 Global DTW

We also consider a standard application of DTW, where the time series are constrained to match in entirety from beginning to end [4].
 Prefix Matching

Compared to global DTW, prefix matching [6] constrains only the beginning of the two time series to match and is thus more constrained than subsequence matching.
 Suffix Matching

Finally, for completeness we consider a modification of the prefix matching approach, where only the ends of the time series are constrained to match.
5.2 Evaluation
In order to evaluate the performance of the proposed subsequence matching approach, we apply the interpatient similarities calculated by it to the task of predicting progression from MCI to probable AD within months (see Figure 1). In particular, we use the pairwise distances between time series as features for the prediction task. When using snapshot data, we use the biomarkers at the most recent visit as features.
5.3 Experimental Setup
In the following sections, we present results to test the following hypotheses:

incorporating longitudinal data improves predictive performance compared to using snapshot data only,

using relevant subsequences of longitudinal data to compare patients is more accurate than comparing entire time series
We test the second hypothesis by comparing the subsequence matching approach that emphasizes the most relevant data with the global DTW approach that constrains the time series to match in their entirety.
For each patient, the pairwise distance of their time series with every other patient, using both MRI and PET features, in the training data serves as the feature vector for the classification model. For patients that did not have PET scans, we used explicit matrix factorization [25] to impute the missing distance measures.
We used L2regularized logistic regression as our classifier, implemented using the LIBLINEAR [26] package. We perform leaveonepatientout testing, where all data belonging to a single patient are left out in a particular test fold. All hyperparameters were chosen through a nested crossvalidation performed on the training data alone. We used the area under the ROC curve (AUROC) metric to evaluate our classifiers. We use the method presented by DeLong et al. [27] to compute confidence intervals and to perform statistical significance tests to compare competing prediction methods (significance level was set at ). All reported values are based on a twosided ztest.
We note that a potential issue with this dataset could be the bias introduced by longer time series. In particular, longer time series could be more likely to eventually test positive and end up biasing the classifier as a result. However, this is not an issue because we make multiple predictions for each time series (see Figure 1), thus patients with long time series who eventually test positive are also represented by short time series. Thus, any potential correlation between the lengths of time series and their labels is eliminated.
5.4 Results and Discussion
The results of our experiments are given in Table 1 and discussed below. Compared to using snapshot data only, using longitudinal data that is aligned by using any of the DTW based methods leads to an improvement in performance. The subsequence matching approach outperforms the other DTW based methods of prefix matching (), suffix matching () and global DTW ().
Data  Alignment Method  AUROC ( CI)  

Snapshot  Not Applicable  ()  

Prefix Matching  ()  
Suffix Matching  ()  
Global DTW  ()  
Subsequence Matching  () 
5.4.1 Longitudinal vs Snapshot models
To understand the source of the large improvement in performance from using subsequence matching compared to snapshot features, we constructed contingency tables to discover the examples where subsequence matching outperformed snapshot features (we used a cutoff of to classify a patient as positive). The main source of improvement was positive instances (that came from unique patients), where subsequence matching predicted correctly and snapshot did not (). As shown in Figure (a)a, these instances are characterized by a pronounced decline in brain volume close to the progression from MCI to AD. In comparison, the ultimate hippocampal volumes of these instances (i.e. the features for the snapshot model) show considerable overlap for the positive and negative instances (see Figure (b)b). We chose to visualize the hippocampal volume as it receives the highest weight in the snapshot models, and is well known to be an important predictor of AD [28]. We visualize the top instances that have the highest differences in predicted probability (these differences were at least ) as predicted by the two approaches. This suggests that a decline in brain volume is an important predictor of disease progression, more so than low brain volume alone.
5.4.2 Subsequence Matching vs Global DTW
The overall AUROC of global DTW was 0.822. Subsequence matching outperformed global DTW with an AUROC of 0.839 (). Our results suggest that the instances where subsequence matching outperformed global DTW were characterized by longer time series. Among the instances where subsequence matching outperformed global DTW, the time series were nearly twice as likely ( vs probability) to have 5 or more visits compared to the overall data. Intuitively, we believe the source of this improvement is the exponential distribution of the number of visits the time series in our data have. In particular, given that the vast majority of our data have 3 or fewer visits (), allowing the longer time series to match partially with these shorter time series allows for a more accurate measure of similarity between them.
6 Summary and Conclusions
In contrast to existing methodologies that use snapshot or crosssectional data to stratify patients by risk of progression to disease, in this study we explored incorporating all available longitudinal patient data. While approaches exist for leveraging longitudinal data, they often assume the availability of some fiducial marker for temporal alignment. In contrast, we propose and evaluate an approach for comparing variable length patient time series that lack such a fiducial marker. We consider a measure of similarity based on minimal cost alignment subsequence matching. Our approach accounts for heterogeneous rates of decline in patients by nonlinearly warping the data during the alignment process, while focusing on the most relevant data.
We demonstrate the utility of our proposed similarity measure on the task of predicting which patients at an intermediate disease stage (MCI) are most likely to progress to AD within 36 months. The propose similarity measure applies despite the variability in the lengths of the time series. In the ADNI dataset, the median number of visits per patient is , but this ranges from to , with about of patients have or more visits. Applied to these data, the proposed approach achieved an AUROC of , outperforming other nonlinear alignment techniques.
While we focused on the challenging task of predicting progression to AD, the proposed approach for measuring patient similarity based on longitudinal data could apply more broadly. In particular, this technique is applicable to other settings that lack a meaningful fiducial marker for alignment and in which disease progression manifests itself variably across patients.
References
 [1] Ginsburg, G.S., McCarthy, J.J.: Personalized medicine: revolutionizing drug discovery and patient care. TRENDS in Biotechnology 19(12) (2001) 491–496
 [2] Hamburg, M.A., Collins, F.S.: The path to personalized medicine. N Engl J Med 2010(363) (2010) 301–304
 [3] Alva, M., Gray, A., Mihaylova, B., Clarke, P.: The effect of diabetes complications on healthrelated quality of life: the importance of longitudinal data to address patient heterogeneity. Health economics 23(4) (2014) 487–500
 [4] Rabiner, L.R., Rosenberg, A.E., Levinson, S.E.: Considerations in dynamic time warping algorithms for discrete word recognition. The Journal of the Acoustical Society of America 63(S1) (1978) S79–S79
 [5] Müller, M.: Dynamic time warping. Information retrieval for music and motion (2007) 69–84
 [6] Silva, D.F., Batista, G.E., Keogh, E.: Prefix and suffix invariant dynamic time warping. In: Data Mining (ICDM), 2016 IEEE 16th International Conference on, IEEE (2016) 1209–1214
 [7] Xing, Z., Pei, J., Keogh, E.: A brief survey on sequence classification. ACM Sigkdd Explorations Newsletter 12(1) (2010) 40–48
 [8] Chiu, B., Keogh, E., Lonardi, S.: Probabilistic discovery of time series motifs. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM (2003) 493–498
 [9] Ding, H., Trajcevski, G., Scheuermann, P., Wang, X., Keogh, E.: Querying and mining of time series data: experimental comparison of representations and distance measures. Proceedings of the VLDB Endowment 1(2) (2008) 1542–1552
 [10] Luo, Y., Xin, Y., Joshi, R., Celi, L., Szolovits, P.: Predicting icu mortality risk by grouping temporal trends from a multivariate panel of physiologic measurements. In: Thirtieth AAAI Conference on Artificial Intelligence. (2016)
 [11] ShokoohiYekta, M., Chen, Y., Campana, B., Hu, B., Zakaria, J., Keogh, E.: Discovery of meaningful rules in time series. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM (2015) 1085–1094
 [12] Wiens, J., Horvitz, E., Guttag, J.V.: Patient risk stratification for hospitalassociated c. diff as a timeseries classification task. In: Advances in Neural Information Processing Systems. (2012) 467–475
 [13] Syed, Z., Scirica, B.M., Mohanavelu, S., Sung, P., Michelson, E.L., Cannon, C.P., Stone, P.H., Stultz, C.M., Guttag, J.V.: Relation of death within 90 days of nonstelevation acute coronary syndromes to variability in electrocardiographic morphology. The American journal of cardiology 103(3) (2009) 307–311
 [14] Razavian, N., Marcus, J., Sontag, D.: Multitask prediction of disease onsets from longitudinal laboratory tests. In: Proceedings of the 1st Machine Learning for Healthcare Conference. (2016) 73–100
 [15] Thodoroff, P., Pineau, J., Lim, A.: Learning robust features using deep learning for automatic seizure detection. arXiv preprint arXiv:1608.00220 (2016)
 [16] Lipton, Z.C., Kale, D.C., Wetzel, R.: Modeling missing data in clinical time series with rnns. arXiv preprint arXiv:1606.04130 (2016)
 [17] ShokoohiYekta, M., Wang, J., Keogh, E.: On the nontrivial generalization of dynamic time warping to the multidimensional case. In: Proceedings of the 2015 SIAM International Conference on Data Mining, SIAM (2015) 289–297
 [18] Van Esbroeck, A., Singh, S.P., Rubinfeld, I., Syed, Z.: Evaluating trauma patients: Addressing missing covariates with joint optimization. In: AAAI. (2014) 1307–1313
 [19] Xiang, S., Yuan, L., Fan, W., Wang, Y., Thompson, P.M., Ye, J.: Multisource learning with blockwise missing data for alzheimer’s disease prediction. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM (2013) 185–193
 [20] Mueller, S.G., Weiner, M.W., Thal, L.J., Petersen, R.C., Jack, C., Jagust, W., Trojanowski, J.Q., Toga, A.W., Beckett, L.: The alzheimer’s disease neuroimaging initiative. Neuroimaging Clinics of North America 15(4) (2005) 869–877
 [21] Ganguli, M., Dodge, H.H., Shen, C., DeKosky, S.T.: Mild cognitive impairment, amnestic type: An epidemiologic study. Neurology 63(1) (2004) 115–121
 [22] Weiner, M.W., Veitch, D.P., Aisen, P.S., Beckett, L.A., Cairns, N.J., Green, R.C., Harvey, D., Jack, C.R., Jagust, W., Liu, E., et al.: The alzheimer’s disease neuroimaging initiative: a review of papers published since its inception. Alzheimer’s & Dementia 9(5) (2013) e111–e194
 [23] Frisoni, G.B., Fox, N.C., Jack, C.R., Scheltens, P., Thompson, P.M.: The clinical use of structural mri in alzheimer disease. Nature Reviews Neurology 6(2) (2010) 67–77
 [24] Jack, C.R., Bernstein, M.A., Fox, N.C., Thompson, P., Alexander, G., Harvey, D., Borowski, B., Britson, P.J., L Whitwell, J., Ward, C., et al.: The alzheimer’s disease neuroimaging initiative (adni): Mri methods. Journal of magnetic resonance imaging 27(4) (2008) 685–691
 [25] Hu, Y., Koren, Y., Volinsky, C.: Collaborative filtering for implicit feedback datasets. In: Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on, Ieee (2008) 263–272
 [26] Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research 9 (2008) 1871–1874
 [27] DeLong, E.R., DeLong, D.M., ClarkePearson, D.L.: Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics (1988) 837–845
 [28] Jack, C.R., Knopman, D.S., Jagust, W.J., Shaw, L.M., Aisen, P.S., Weiner, M.W., Petersen, R.C., Trojanowski, J.Q.: Hypothetical model of dynamic biomarkers of the alzheimer’s pathological cascade. The Lancet Neurology 9(1) (2010) 119–128