Revealing patterns in HIV viral load data and classifying patients via a novel machine learning cluster summarization method
Samir Farooq1,2, Samuel J. Weisenthal1,4, Melissa Trayhan1,2,4, Robert J. White1,2,4, Kristen Bush1,4, Peter R. Mariuz3, Martin S. Zand1,2,4,*
1 Rochester Center for Health Informatics, 265 Crittenden Boulevard, Rochester, NY 14642-0708, USA
2 Department of Medicine, Division of Nephrology, 601 Elmwood Ave - Box 675, Rochester NY 14642, USA
3 Department of Medicine, Division of Infectious Diseases, University of Rochester Medical Center/Strong Memorial Hospital AIDS Center, 601 Elmwood Ave, Rochester NY 14642, USA
4 Clinical and Translational Science Institute, University of Rochester Medical Center, 265 Crittenden Boulevard, Rochester, NY 14642-0708, USA
HIV RNA viral load (VL) is an important outcome variable in studies of HIV infected persons. There exists only a handful of methods which classify patients by viral load patterns. Most methods place limits on the use of viral load measurements, are often specific to a particular study design, and do not account for complex, temporal variation. To address this issue, we propose a set of four unambiguous computable characteristics (features) of time-varying HIV viral load patterns, along with a novel centroid-based classification algorithm, which we use to classify a population of 1,576 HIV positive clinic patients into one of five different viral load patterns (clusters) often found in the literature: durably suppressed viral load (DSVL), sustained low viral load (SLVL), sustained high viral load (SHVL), high viral load suppression (HVLS), and rebounding viral load (RVL). The centroid algorithm summarizes these clusters in terms of their centroids and radii. We show that this allows new viral load patterns to be assigned pattern membership based on the distance from the centroid relative to its radius, which we term radial normalization classification. This method has the benefit of providing an objective and quantitative method to assign viral load pattern membership with a concise and interpretable model that aids clinical decision making. This method also facilitates meta-analyses by providing computably distinct HIV categories. Finally we propose that this novel centroid algorithm could also be useful in the areas of cluster comparison for outcomes research and data reduction in machine learning.
The primary clinical goal of HIV treatment and patient engagement is suppression of the HIV viral load (VL), as measured by low or undetectable circulating HIV RNA levels. However, VL most often fluctuates over repeated measurements, with a range that spans 8 orders of magnitude from 0 (undetectable) - copies/mL. VL is regularly monitored for signs of progression of HIV infection. Standard HIV treatment protocols are based on VL measurements , especially when monitoring responses to antiretroviral therapy (ART). Monitoring of VL helps to determine whether ART therapy was able to successfully suppress patient viral load . Individuals with sustained high viral loads (SHVL) are at greater risk of secondary transmission, clinical progression to AIDS, and death[3, 4, 5, 6]. In contrast, significant reduction in viral load, or high viral load suppression (HVLS), lead to immune recovery, as measured by CD4 T cell levels, and can reduce or eliminate the risks of SHVL. Furthermore, patients sustaining low-level viral load (SLVL), or with a rising viral load after previous suppression, have a high incidence of treatment failure . Thus, developing an objective measure of viral load status, and categorization of patients by time varying patterns of viral load, is critical for standardizing both therapy and comparing research protocol efficacy.
Reports in the current literature differ in the definition “high viral load” [9, 10, 11, 12], and their findings of how long it takes a patient on HAART to suppress their viral load [9, 2, 10, 13]. We summarize some of the published approaches here (for greater detail see Supplementary Material). With respect to viral load levels, Terzian et. al. defined SHVL as two consecutive viral load measurements (VLM) 100,000 copies/mL . Furthermore, they define durably suppressed viral load (DSVL) as all VLM 400 copies/mL. In contrast, Greub et. al. focused on detecting low level viral rebound (LLVR) in patients by first considering patients with an initial consecutive VLM pair 50 copies/mL, and classified LLVR as having subsequent maximum VLM between 51-500 . Alternatively, Rose et. al. investigated the use of five different frameworks to categorize suppressed versus not-suppressed viral load . Their approach excluded patients with VLM200 at baseline, and stratified the remainder with regard to VL suppression using an 8 month window centered around 24 months after the start of VLM (18-30 months). Phillips et al. characterized VLM responses to ART , utilizing a 24-40 week window and a rule-based method to identify two populations of HIV patients (Viral Failure and Viral Rebound). Despite these studies, no formal standard has been adopted to classify a patient as having DSVL, SHVL, HVLS, SLVL, or rebounding viral load (RVL) patterns.
Attempts to classify patient viral load states are further complicated by the challenges of analyzing viral load data. Real-world viral load measurements are taken intermittently over time and may be absent due to a variety of factors (e.g. travel, social circumstances, non-adherence), leading to incomplete and irregularly spaced data points. Differences in the sensitivity of the multiple assays used to measure viral loads result in different measures of “undetectable” viral loads, further complicating analyses. Thus, there is a need for analytic techniques that can adjust for these details and classify viral load states.
Machine learning methods can provide objective, unsupervised classification of patient clinical status. These methods begin by assigning a set of features to a patient (e.g. demographics, laboratory measurements, therapies) and then performing computational clustering to identify similar classes of patients. Several studies have applied machine learning methods in HIV research to predict HIV viral load responses  or CD4 T cell counts, to distinguish between suppressed and viremic patients, and to select therapeutic regimens. None, however, have used machine learning to create a standard classification for viral load status.
To address these issues, we propose a set of unambiguous features, which combined (i.e. a feature vector), captures distinct dynamic patterns present in viral load measurements over time. In addition, we have developed a novel centroid algorithm to cluster HIV positive subjects based on these patterns. Here we present the derivation of our machine learning method, and demonstrate its application to cluster 1,576 HIV patients with repeated VL measurements over a 5 year period. We found that patient viral load measurements can be classified into five time-varying patterns, corresponding to clinically relevant states. We note that the method and resulting categories can be used to standardize definitions of VL patterns across research studies.
Materials and Methods
Human Subjects Protection
This proposal was reviewed and approved by the University of Rochester Human Subjects Review Board (protocol number RSRB00068884). The analysis in this paper is presented in compliance with Center for Medicare Services (CMS) current cell size suppression policy. Data were coded such that patients could not be identified directly in compliance with the Department of Health and Human Services Regulations for the Protection of Human Subjects (45 CFR 46.101(b)(4)).
We obtained medical encounter data from all patients with an HIV diagnosis in the University of Rochester Medical Center’s electronic medical record system (EMR) between 2011 and 2016. For each patient we have information on their age, gender, race, ethnicity, zip code, and coded procedures with associated ICD9 and ICD10 diagnosis codes. All patient viral load and CD4 measurements during the study period were also obtained. There were a total of 1,892 patients with at least one viral load measured, with 1,576 of these patients having at least three viral load measurements. Viral load measurements 48 copies/mL, present as categorical values “NEG”, “POS 20”, or “POS 48” were transformed into numerical values of 0, 20, and 48 respectively.
The data is provided in
Hardware and Software Specifications
Analyses were performed on a Windows 8 server with Intel(R) Xeon(R) CPUs E5-2620 v2 @ 2.10GHz and 256GB of RAM. Python 2.7 was used for most data mining and machine learning under Spyder v.3 installed from Aanaconda2 (64-bit). The default packages available in Anaconda were used for analysis, including, but not limited to: NumPy, scikit-learn, SciPy, datetime, csv, math, Matplotlib, pip, operator, copy, random, and time. Using pip we installed the webcolors and pydotplus packages for rendering a decision tree. Microsoft Excel was used to open, edit, and view flat files in .csv and .xlsx format. SQLite was used to store, query, and clean ~1.5 GB of data. Analytic code is available for download at: https://github.com/SamirRCHI/Viral_Load_Data_Categorization.
Viral Load Analysis Methods
Mathematical notations for this work are described in Table 1.
|The number of usable patients in the data. In our case 1576 patients.*|
|Refers to a single patient.|
|The total number of viral load measurements patient has taken.|
|All viral load counts of patient in order of time.|
|Refers to the viral load count of patient in , where .|
|All temporal instances corresponding to .|
|Temporal instance of viral load , where .|
|The maximum viral load for all patients, ().**|
|The minimum viral load for all patients, ().**|
|Hadamard Product - elemental-wise multiplication of arrays.|
|*This is after selecting for patients with measurements.|
|**This value changes after transformation of the data.|
Based on temporal patterns of viral load described in the literature, viral load pair distribution of our data (Fig S1), and a further extensive investigation into the data, we hypothesized six potential temporal viral load patterns (Fig 1).
Durably Suppressed Viral Load (DSVL): Having consistently suppressed their viral loads at or near the undetectable range.
Sustained Low Viral Load (SLVL): Viral load counts which are constantly slightly higher than the undetectable range.
Sustained High Viral Load (SHVL): Viral load counts which are constantly in a range considered high risk for HIV complications (e.g. opportunistic infections, malignancy).
High Viral Load Suppression (HVLS): A viral load pattern in which the terminal portion of the curve has a negative slope and the terminal data point is in the low or suppressed range. This could have a few different speeds or styles of suppression - rapid, gradual, or slow.
Rebounding Viral Load (RVL): A viral load pattern in which viral loads are unstable, with the measurement at one time step seemingly being independent of the next measurement.
Emerging Viral Load (EVL): Having a steady, or rapid, emergence of high viral load while the first few measurements of viral load were suppressed. While we have found no mention of this type of pattern in the literature, and found that this pattern did not occur in our data set, VL data sets could contain this pattern.
It is important to note that our definitions are pattern based, and do not explicitly select absolute viral load cutoff levels, nor selects a specific window, as previous reports have done [8, 9, 10, 13]. This has the advantage of allowing the absolute viral load levels, and critical time windows, to emerge from the analysis, and does not preclude incorporation of absolute levels (e.g. VLM400) at a later stage into pattern specification.
Feature Vector Definition
We next designed a feature vector to capture characteristics that would allow us to distinguish between viral load patterns. Since viral load data is asynchronous and noisy, with variable numbers of data points for each subject, we argue that one or two viral load measurements are too few to accurately judge viral load behavior. Hence we restrict ourselves to using patients with 3 viral load measurements. In addition (Fig S1), viral load values at the lower limit of detection are a function of the specific assay used, and appear in our data set as 0, 20, and 48 copies/mL. Thus, plots of the transformed data have discretely spaced values at the lower level of detection, although these values all capture the undetectable range of viral load. Additionally, we adjusted the data by to avoid . The addition of 10 to VL (instead of 1) is used to minimize the distance between the undetectable values: 0, 20, and 48 (copies/mL). Thus, in our notation, all the values related to viral load are assumed to have been adjusted to this measure. For example, and .
Using the transformed VL data, we extract several relevant features of the VL measurements over time. These features are used for machine learning classification of individual patient VL time series, and designed to distinguish patterns in VL change while minimizing the effects of noise, limited by the number of measurements. We do not limit feature extraction based on the total elapsed time of viral load measurements because the optimal time-point for determining viral load class is not well established. The attributes for feature extraction are: relative area of viral exposure, weighted recency reliability (wRR), adjusted maximal difference, and interquartile range (IQR).
Relative Area of Viral Exposure (Area) - is defined as the area under the viral load curve relative to the total viral load area possible. Therefore the value of this feature will naturally be between 0 and 1. We choose a normalized, relative score, as the total time span between the first and last viral load measurement differs between patients. This feature is similar to finding the mean and median, except it is sensitive to the dimension of time, hence yielding more information. The feature is calculated by summing the area of each trapezoid created by each pair of viral load values, followed by dividing by the total possible area (Eq 1).
Weighted Recency Reliability (wRR) - Due to viral load noise, the last viral load measurement is not necessarily an accurate reflection of the patient’s overall, or most recent, viral load trend. For example, a patient may have a VL whose average slope is negative, indicating high viral load suppression over time (HVLS). However, the last measurement may be slightly higher than the trend. Thus heavily weighting the last measurement could lead to mis-classification as rebounding viral load (RVL). To account for this, we calculate a weighted mean where the weight of the VL measurement increases with time. More specifically, the weight function follows an inverse square root function () rather than an inverse function (). This has the advantage of avoiding rapid convergence of to zero when time is measured in units of days (Eq 2). Weighted recency is then calculated as the dot product of the viral loads and the weights divided by the sum of the weights (Eq 2).
We were also interested in how reliable is as a representation of the patient’s viral load trend. To this end, we calculated the absolute deviations from the viral load measurements to (Eq 2). Rather than averaging the deviations, we take the median to reduce the effects of outliers and call this our weighted recency reliability measure (Eq 2). We take the inverse to force the range of the result to be between [0,1]; a property made to use in our next proposed feature, adjusted maximal difference.
Adjusted Maximal Difference (Adj MD) - We used a time-independent method to measure the final change in viral load by determining maximal VL difference defined as the difference between the “peak” and last VL measurements. To distinguish between viral load suppression or emergence, we calculate the “peak” as the maximum of the absolute deviations (Eq 2) and retain the sign of the result. We expected the positive scores to effectively isolate the EVL group, however, we instead found that retaining the positive (emergent) scores lead to mis-categorization of SHVL and RVL groups without clearly identifying EVL patterns. This, along with other investigation into the data, led us to conclude that the EVL pattern does not exist in our data, but we refrain to make generalizations to all healthcare facilities. Be that as it may, we force (ground) the positive scores down to zero for proper labeling of SHVL and RVL (Eq 1).
Due to the varying nature in viral load measurements, we are hesitant to use the final viral load measurement as a means of judging suppression. Thus we propose to use instead. To reduce the effects of rebounding patients being falsely labeled as suppressed patients, we multiply our result by - as rebounding patients are expected to have a low score in the range [0,1]. The maximal difference is necessary in order to ensure that the suppression type of viral load patterns are classified appropriately (Eq 3).
Interquartile Range (IQR) - This feature is added to further segregate the rebounding patients and follows the standard interquartile range calculation (Eq 4).
Here we formally define keywords which will appear in the analysis: Let Feature extraction be the process of determining the values , , , and from a set of patients (using their viral load patterns) with the formulations given above. Then a feature vector () contains the values , , , and extracted from patient ’s viral load pattern. The words sample or point are also used here interchangeably. The term feature () can be thought of as a column vector for all patients in the dataset consisting of the four attributes: , , , and (Fig S2A). Finally, the terms label assignment, VL pattern membership assignment, patient categorization, and prediction, all refer to the same principle: To assign the most appropriate label which characterizes the viral load pattern of a patient. However, while the principle is the same, the method of assigning such an appropriate label differs depending on the categorization method, or the learning method, being used.
Feature Extraction and Normalization
We began by transforming viral load data by min-max normalization  to equally weight the temporal features of the VL series (Eq 2). That is, we normalize the features, , to a range between [0, 1] using equation 2 where (Fig S2B.1).
Next, we examined each of the 4 features for all patients with viral load measurements ( patients) (Fig S2A), which did not exhibit distinct bi-variate clustering (Fig S3). A correlation coefficient analysis between the features (Table S1) reveal that the Adj MD feature appears to be linearly independent to Area and wRR. There does seem to be some linear dependence between IQR and Adj MD, and Area to wRR and IQR. As expected, the largest linear dependency appears to be between wRR and IQR. These results suggest the separation between viral load patterns will be most noticeable between the Area and the Adj MD features - as we designed them to be. Also, although Adj MD is dependent upon wRR, we find that their correlation coefficient is very low (0.033) as they try to capture two very different phenomenon.
We then performed hierarchical clustering of the individual subjects using a Euclidean distance metric and the Ward’s criterion to minimize the total within-cluster variance, revealing a clear separation into 5 distinct patient groups (Fig S2C).
From Fig 2 and Fig 3, we find that the bluish green cluster (n=442) corresponds to the DSVL patient group, showing the lowest viral loads amongst the clusters identified and the highest weighted recency reliability. The patients in the vermillion-red cluster (n=46) correspond to the SHVL group, exhibiting the highest relative area and very low IQR. Compared to the DSVL cluster, the blue cluster (n=535) has slightly greater area and IQR with a significant difference in the weighted recency reliability. Using this information, along with the general patterns shown by Fig 3, we identify this as being the SLVL group. The algorithm also identifies the black (n=237) and redish purple (n=316) clusters, which appear to correspond to the RVL and HVLS classes respectively. The RVL cluster has a low weighted recency reliability and high IQR. In contrast, the HVLS cluster has a lower area, higher weighted recency reliability and most importantly very low adjusted maximal difference (Fig 4).
Upon examining Fig 3, we also find that the viral load patterns seem to be similar within cluster, and dissimilar between clusters. Interestingly, in each extracted cluster, we find that there are a few patients whose last viral load measurement was taken around 1,827 days - which is equivalent to the full span of five years that our patient’s viral loads have been monitored. This suggests that these clusters don’t disappear after some elapsed time, but rather each type of pattern can be found at virtually any time point.
In regards to the HVLS group, we find large spikes in the middle of their time-series. We hypothesize that this may be due to the asynchronous timing of measurements between subjects, the natural variation in biological responses between the groups, or patient variability in adherence to therapy. This observation also reflects one limitation of asynchronous outcomes data sampling without a “completion” endpoint, a characteristic of most prospective, randomized clinical trials. If measurements were stopped at one of the spikes, the adjusted maximal difference feature may be weighted in the favor of being classified as a patient whose viral load is rebounding. This phenomenon could be viewed with two different perspectives: First, this may indicate that the some patients labeled as having suppressed their viral loads should have been labeled as having rebounding viral loads. Or second, this changing in class labels show that these features do not restrict a patient to forever be restricted to one category. Rather - depending on biological or therapeutic responses, the patient may now belong to a different category all-together.
Comparison with Existing Categorization Methods
Visually, we find that the SLVL group detected by our method is very similar to the LLVR group defined by Greub et. al. (Fig 5). Furthermore, it appears that the methods trying to capture SHVL, viral rebound, and viral failure patients did not succeed as well as the identification of SHVL and RVL patients in our method. RMVL repeat continuous visually appears to have performed very well in identifying patients whom have suppressed their viral loads. However, we argue that our analysis performs perhaps slightly better in identifying the suppression group (HVLS), as we find that the last viral load measurement for each patient (black dots in Fig 5) are consistently low using our method. The existing methods may not have performed as well because they rely on a window or a consecutive pair measure which may be too subjective for assigning VL pattern membership. Furthermore, based on Fig 5, Rose et. al.’s assumption, that patients with a baseline viral load less than 200 have consistently low viral variation, may be subject to scrutiny. Lastly we wish to emphasize that while some of these existing categorization methods are successful in identifying a specific group of patients, our method is unique as it attempts to give definitive group labels to each viral load pattern.
To address the potential issue of misclassification, and to assess the changing nature of viral load pattern membership assignment, we performed a time-varying classification sensitivity analysis. First, we train a supervised machine learning method on the results from our original classification. Then, to determine the classification stability of our results, we iteratively determine membership assignment given only partial viral load data for every patient. Our objective is to pair the feature vector with a supervised learning method that has high classification stability in the setting of incomplete, time varying, VL data. Additionally, the method should be easy to apply by clinical researchers to assign viral load pattern membership to their patients for cohort studies. Hence, easy interpretation and reconstruction of the model is ideal.
Prior to performing partial viral load membership assignment, we initially consider several common supervised learning methods for learning on our initial classification results: k-nearest neighbors (k=5,7,9) , support vector machine (SVM), decision tree, AdaBoost, and random forests. We tested the performance of these methods with leave-one-out cross-validation (LOOCV) on our initial classification results, and by calculating the scores of the prediction of each category. The score is the harmonic mean of precision and recall  defined as equation 3. The results of each are superlative with the strange exception of AdaBoost (Table 2).
|Push and Pull||0.9737||0.9347||0.8842||0.9027||0.8767||0.9144|
However, these supervised learning techniques defeat our objective of offering a simple model which clinicians and researchers may effortlessly reconstruct in the programming language of their choice. More specifically, k-Nearest Neighbors (kNN) is inherently “fuzzy” , meaning that prediction on new samples is dependent upon the entire training set, and so the learned patterns are ambiguous. While SVM offers a simpler model, the results can be non-intuitive for clinicians. Finally, although Decision Trees offer the best interpretability, overly complex trees may be generated, as it did in our case (Fig S4). Thus, we next set out to develop a more robust unsupervised method with improved interpretability.
We noted that, similar to k-Means, the hierarchical clusters naturally form non-intersecting clouds in dimensions  (i.e. the non-convex hulls tend not to intersect). The cloud centroids can be algorithmically identified to yield a summarization of the clusters, drawing similarities with the cluster summarization involved in the BIRCH algorithm . Using this property, we can calculate the radius of the cluster as the farthest intra-cluster sample from the cluster centroid to reduce misrepresentation of distant member objects - rather than calculating the radius as the average distance between the centroid to the object members (as the BIRCH algorithm does). We propose a radius-based prediction that operates as follows: Let be the th cluster center with corresponding radius , then for a new sample choose its predicted cluster membership such that is a minimum. We call this radial normalization (RN) classification.
The definition of a center of points varies based on the shape of the data (e.g. Gaussian n-dimensional spheres, irregular density distributions) and on the algorithm used for identification. We therefore compared seven different methods to identify cluster “centers” to assess classification stability: multidimensional mean, multidimensional median, best representative center, bounding box method, smallest disk method, polyhedral center, and a novel force directed “push and pull” method inspired by force-directed graph drawing using Fruchterman-Reingold’s algorithm [27, 28] (see Supplementary Material for a detailed explanation of each method). Force directed clustering methods maximize inter-cluster center distances while minimizing intra-cluster distance, and are the basis for modularity clustering in graph theory.
The results for our evaluation of the centroid methods (CM) using RN (CMuRN) are italicized in Table 2. BIRCH defines its centroid as the average of all the points (multidimensional mean)  which scored lower than four other centroid methods. From amongst these CM, the polyhedral CMuRN scored the highest in terms of average , and although this method did not outperform the more intricate supervised machine learning methods, the performance is nevertheless exceptional for its interpretability. We argue that any CMuRN is highly interpretable because the entire model can be summarized concisely (Table 3) and understood clearly. For example, we learn from Table 3 that a characteristic of DSVL patients is that their relative area of severity and IQR are very low (0.0901 and 0.0983 respectively in the normalized region) and level of their viral loads are very consistent, judged by its weighted recency reliability (0.905 in the normalized region), with small flexibility in these values - as indicated by the radius. For its performance and interpretability, we subsequently use the polyhedral CMuRN to measure classification stability in the next section.
|Linear Transformation (LT) determined by equation Membership Assignment on Partially Retained VL Patterns.|
Membership Assignment on Partially Retained VL Patterns
To determine the classification stability, we iteratively assign VL pattern membership using only partial viral load data for every patient. First, we extract the feature vector Fig S2G.2). We then transform to the same normalized space the polyhedral CM was trained on. Notice the min-max normalization (Eq 2) is a linear transformation (LT; Eq Membership Assignment on Partially Retained VL Patterns), hence the transformation is performed by applying onto (i.e. ; Fig S2H.2), with the parameters for the LT given in Table 3.from the partially retained data (
With the transformed feature vector, we assign viral load pattern membership using radial normalization classification (Fig S2I), where the centroid algorithm is trained on 100% retained data (Table 3). We repeat this procedure starting at 0% retained information () and progressively increasing to 100% by a factor of 0.1% for each patient (Fig S2E.1-E.3). We record the change in patient viral load pattern membership upon each retained information instance, which we designate as membership assignment probability (Fig 6).
Because the patient will not have a measurement exactly at each instance of , we selected the nearest cutoff instance, , such that (Fig S2F). The feature vector was then extracted from and where . We also added the restriction to adhere to the minimum of three measurements requirement. If failed to meet this requirement, the patient was dropped from the centroid prediction instance (Fig S2G.1). This restriction explains why, if Fig 6 is examined closely, the classes have small gaps between the point where and the value for which calculating membership assignments start.
The true positive probabilities associated with the DSVL, SLVL, and SHVL clusters are generally high (close to the range of 80%) despite the amount of retained viral load data (Fig 6). The true positive probability associated with the HVLS cluster is initially very low, but progressively reaches 80% within 85% of the retained viral load data. This type of progression is likely because the viral load suppression class is time dependent and thus VL time series taken early, or later, with respect to treatment initiation would be expected to differ based on the full treatment response. Similarly we find the RVL group’s true viral load membership assignment increases with an increasing fraction retained information.
Next, we compared the classification stability results of the polyhedral CMuRN with the decision tree, SVM, and kNN algorithms. While differences in classification stability appear to be negligible to the naked eye (Fig S6-8), a Wilcoxon signed-rank test between each true positive viral load pattern membership reveal that the difference in performance of each method is statistically significant for all viral load patterns (). Note that the test only includes the instances where the number of patients are . The distribution of these differences (Fig 7) reveal that the polyhedral CMuRN outperforms all other algorithms in terms of RVL assignment by 5% (Table S2). Although the differences in performance may be statistically significant, the level of difference may not be clinically relevant.
Projected Hyperplane Normalization
We were initially perplexed to find the smallest disk scoring low in Table 2 because this method attempts to find the center by minimizing the radius, hence one may speculate that it should score the highest from amongst the centroid methods using RN. However, the result led us to recognize that one potential drawback of RN is in assuming the cloud of points is spherical. This assumption can be relaxed by determining the set of hyperplanes which make up the convex hull of the cloud. In this way we may normalize by the distance to the nearest projected hyperplane in the convex hull. Such normalization improves the centroid results (Table 4), but at the expense of model interpretability, which may not be desirable in a clinically applied method. Furthermore we find the classification stability results using projected hyperplane normalization (PHN) are slightly better than radial normalization (Fig S8), with a seemingly negligible exception of DSVL and a major exception of SHVL performance dropping between 11 and 18 percent (Table S2), which may be a result of RVL and SHVL polyhedrons having a greater overlapping volume ratio than would their spheres. Thus this method also has its drawbacks. Performing the Wilcoxon signed-rank test between the polyhedral CMuPHN and polyhedral CMuRN also revealed statistical significance in the performance difference for each true positive viral load membership () .
We see two limitations to projected hyperplane normalization. First, the projection only works if the center is inside the convex hull. This leads to an issue with using the median and best representative methods because it is possible they may choose a center lying on the border of the convex hull. Even worse for this prediction method is using push and pull, which may detect a center outside the convex hull. Due to this, these three methods are not mentioned in the results of Table 4. Second, finding the convex hull is computationally expensive and, depending on the algorithm, the accuracy of the convex hull becomes unwieldy with 15 dimensions .
Researchers have previously performed HIV population case studies using differing schema to classify VL patterns [9, 2, 10, 13]. Our work is unique as it suggests a method for standardizing the classification of VL patterns using a set of optimally segregating features. These features have been specifically engineered to optimize unsupervised clustering of temporal sequences of viral load data that are asynchronous and noisy. Our findings demonstrate their success in identifying five types of viral load patterns often mentioned in the literature [7, 8, 10, 9, 11, 12].
We have also proposed the novel centroid algorithm for giving a meaningful summarization to clustering results. Since the results of the algorithm are concise, it allows investigators to reconstruct the model to the programming language of their choice. Hence these results can positively aid HIV population case studies by giving precise definitions of the varying temporal viral load patterns, leading to an improvement in patient care. This method would facilitate cross-comparison of research studies by providing a standard for viral load pattern classification. Such standardization would be immensely useful in meta-analyses of diverse research reports [32, 33, 34, 35, 36]. It is possible that, in the future additional viral load patterns may emerge with, for example, the emergence of new HIV variants that are resistant to, or escape suppression of, current therapies. The method is flexible enough to recognize different patterns, and thus categories of viral load responses. Furthermore, this method is general enough that models could be trained on other viral infections that have patterns of natural or treatment related patient responses (e.g. hepatitis B and C, parvovirus B19). Such applications would likely require defining new viral load time-series features that capture disease specific features.
A common practice in data analytics is to calculate the centroid as the average of the points [37, 21], however we showed that the mean is not necessarily the best centroid. An advantage of the centroid algorithm is that we can choose the best centroid corresponding to the shape of the data, which may call for statistical tests to determine which shape best fits the data. Another advantage is that we can mathematically determine the amount of overlap between -dimensional cluster spheres (i.e. viral load categories). With this property we can take different cohort studies, perform hierarchical clustering with centroid summarization on the new set of data, and then compare HIV viral load patterns (i.e. cluster comparison). Such comparisons may reveal influences of different patient care strategies or the relationship between different populations. Furthermore, since the centroid algorithm represents the data into centers and radii (bytes of data), another advantage of this algorithm is data reduction.
We have shown that, at the expense of model interpretability, centroid prediction may be improved with projected hyperplane normalization. The interpretability versus predictability problem is well known and often finds itself in deep learning papers [38, 39, 40]. Interpretability is a desirable attribute in clinical classification systems, as it allows clinicians to integrate causal physiology and diagnostic information with data features in a way that may promote clearer bedside clinical reasoning. The general understanding is that research vital to human safety must have an interpretable science, otherwise a “black-box” approach may be justified. Using an interpretable model for assigning viral load pattern membership may be advantageous for when a clinician wishes to use the assigned pattern membership to aid in making a critical clinical decision (e.g. choosing between treatment options). A convoluted model outcome may make such a decision more difficult .
Several caveats apply to our work. As noted, this is a single center study, and thus our method should be tested with a much larger data set to cross-validate the categories represented by the clusters. In addition, our feature vector was designed specifically to suit the literature rather than objectively clustering the data using a time-series based clustering method [42, 43, 44]. Also, some of our features are slightly collinear - with the greatest correlation coefficient being between IQR and wRR (-0.717). However, while HVLS and RVL both have a varied range of IQR, it is clear that the HVLS has greater wRR than RVL due to HVLS patients having a long consistent tail. Furthermore IQR helps distinguish the HVLS and the SLVL or SHVL class, hence both IQR and wRR are necessary despite the slight correlation. As another caveat to our work, we had to normalize time into number of days since admission - meaning we lose the ability to look for seasonal or yearly patterns in the data.
While we originally hypothesized the existence of six distinct viral load patterns, we found that the emergent VL group may not be a pattern found in our data. Perhaps this is a consequence of a high rate of local patient engagement in therapy in this cohort study, or the era of heart therapy, or effectiveness of medication regiments and their access. In different populations these conditions may not always exist (e.g. in areas where HAART is expensive and people may lose the ability to pay for it), for which the EVL pattern may indeed be significant. Based on the formulation of the adj MD and wRR features, we hypothesize that a consequence of the grounding function is that any EVL pattern (if exists) will be grouped under RVL. This grouping may be appropriate as one can argue that the act of going from a suppressed state to a high VL state is a form of rebounding.
Unsupervised clustering algorithms all have tuning parameters: the in k-Means or k-Medoids , the radius (neighborhood size) in DBSCAN , the density threshold and grid size in CLIQUE , the branching factor in BIRCH , or the , , and minsize in CHAMELEON (perhaps the least sensitive of them) [21, 45]. Clustering results may change depending on the parameter chosen, revealing finer between-cluster differences as the number of clusters increase. The hierarchical clustering algorithm is no exception, but has the advantage that a proper cut-off can be easily visualized. Our method chose to make groups at a high level, but identification of important sub-clusters with a choice of a lower cut-off is also possible. For example, choosing a lower cut-off may reveal that the suppression group splits itself into categories with different rates of HIV viral load suppression during treatment. Researchers wishing to engineer a new feature vector for VL pattern segregation may find useful the supplementary material on features we considered but subsequently removed due to poor performance.
We have proposed a set of four unambiguous features which have been successfully used in segregating five different types of viral load patterns: durably suppressed viral load (DSVL), sustained low viral load (SLVL), sustained high viral load (SHVL), high viral load suppression (HVLS), and rebounding viral load (RVL). We have also proposed a novel centroid-based clustering algorithm. Applied to our data set, these methods resulted in clusters of temporal viral load patterns that map to clinically relevant treatment responses. The use of this algorithm may improve meta-analyses or population studies of viral load patterns by standardizing the classification of HIV patient categories. Furthermore, the segregation process used in this paper (i.e. discovering a set of domain specific features, performing unsupervised clustering, interpreting the results with a cluster summary) can be used to model other viral infections and the response of viral load levels over time to treatment or natural disease progression.
Viral load distribution. For each pair of viral load measurements, we calculate the change in days and the change in viral load counts for all patients and plot it as a scatter. The horizontal line of dots which appears between and are an artifact of using 20 and 48 in data to replace the “Pos 20” and “Pos 48” values which appeared in our data. The sequential range of viral load measurements shows that VL measurements taken within 10 days of each other may vary by copies/mL.
Paper Outline. The light green box indicates the training step and the light red box indicates the testing step. (A) Feature extraction (B) Normalization of extracted features. Each feature has a unique linear transformation for normalization. (C) Unsupervised hierarchical clustering into clouds (the six different VL patterns: SHVL, SLVL, DSVL, HSVL, Emergence, Slow Suppressors) (D) Centroid summarization taking as input the cluster labels from hierarchical clustering and normalized data (while storing [the normalization functions]) (E) Partition viral load data into retained proportions. (F) Use only the amount of data we have real data points for. (G) Normalize the features into the same transformed space the training was performed on. (I) Perform radial normalization classification on the normalized features to classify the partially retained VL pattern.
Patient feature extraction. Feature extraction on 1576 patients displayed as 2D splicing of the 4 dimensional feature space. Each splice plots a dimension versus another in the form of a scatter plot.
Decision Tree. While some useful rules may be pruned, the tree is otherwise complicated and difficult to draw useful conclusions from.
Decision Tree Classification Stability. Classification stability results using decision tree learned on 100% retained data. The title of each plot corresponds to which category of patients are being tested. The dashed white line is the 80% membership assignment probability line, included for reference. The solid white line represents the number of patients included in the prediction at the point of retained viral load data (corresponding to the right axis) which changes as a result of the restriction.
SVM Classification Stability. Classification stability results using SVM learned on 100% retained data. The title of each plot corresponds to which category of patients are being tested. The dashed white line is the 80% membership assignment probability line, included for reference. The solid white line represents the number of patients included in the prediction at the point of retained viral load data (corresponding to the right axis) which changes as a result of the restriction.
-Nearest Neighbor Classification Stability. Classification stability results using -Nearest Neighbor () learned on 100% retained data. The title of each plot corresponds to which category of patients are being tested. The dashed white line is the 80% membership assignment probability line, included for reference. The solid white line represents the number of patients included in the prediction at the point of retained viral load data (corresponding to the right axis) which changes as a result of the restriction.
Polyhedral CMuPHN Classification Stability. Classification stability results using the polyhedral method for detecting center and projected hyperplane normalization method for classification. The title of each plot corresponds to which category of patients are being tested. The dashed white line is the 80% membership assignment probability line, included for reference. The solid white line represents the number of patients included in the prediction at the point of retained viral load data (corresponding to the right axis) which changes as a result of the restriction.
Centroid Methods. Gives a visual of how the seven methods work on an example point set. The green target signifies the exact center which is found according to the different methods in our algorithm.
Viral load data. The data set used for this study is provided in a completely deidentified format. The data is in a csv format where the first column represents a unique subject, with a random identifier. The subsequent values are as , where is the time from a universal for the VL measurement for patient , and is the corresponding VL measurement. Each record (row) is of a unique length, depending on the number of VL measurements present for that subject. The code used to analyze the data can be found on github at https://github.com/SamirRCHI/Viral_Load_Data_Categorization.
|Decision Tree||SVM||k-NN,k=5||PH CMuPHN|
|HVLS||(0.0, 0.016)||(0.0367, 0.0496)||(0.0224, 0.0426)||(0.0171, 0.0287)|
|SHVL||(-0.0263, 0.04)||(-0.0263, 0.0385)||(0.0, 0.0769)||(-0.1818, -0.1154)|
|DSVL||(0.0038, 0.0143)||(0.0, 0.0096)||(0.0028, 0.012)||(-0.0024, 0.0)|
|RVL||(-0.069, -0.04)||(-0.0657, -0.0438)||(-0.0576, -0.0419)||(0.0, 0.0181)|
|SLVL||(0.0068, 0.0394)||(0.0293, 0.0484)||(0.0445, 0.0633)||(0.0352, 0.0542)|
PH = Polyhedral
The results are the Quartiles (,) of the difference between the stated algorithm.
A negative score means Polyhedral CMuRN performed better.
Only instances where Patients were used in this comparison.
This work was partially funded by the University of Rochester Clinical and Translational Science Institute, supported in part by grants UL1 TR002001, and TL1 TR002000 from the National Center for Advancing Translational Sciences (NCATS), a component of the National Institutes of Health (NIH). This publication was also made possible through core services and support from the University of Rochester Center for AIDS Research (CFAR), an NIH-funded program (P30 AI078498). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health (NIH).
We would also like to thank Yusuf Bilgic (State University of New York at Geneseo), for discussions regarding the statistical analyses.
- 1. Center for Disease Control (CDC). Vital signs: HIV prevention through care and treatment–United States. MMWR Morbidity and mortality weekly report. 2011;60(47):1618.
- 2. Yehia BR, Fleishman JA, Metlay JP, Moore RD, Gebo KA. Sustained Viral Suppression in HIV-Infected Patients Receiving Antiretroviral Therapy. JAMA. 2012;308(4):339. doi:10.1001/jama.2012.5927.
- 3. Mellors JW. Plasma viral load and CD4+ lymphocytes as prognostic markers of HIV-1 infection. Annals of Internal Medicine. 1997;126(12):946. doi:10.7326/0003-4819-126-12-199706150-00003.
- 4. Sterling TR, Vlahov D, Astemborski J, Hoover DR, Margolick JB, Quinn TC. Initial Plasma HIV-1 RNA Levels and Progression to AIDS in Women and Men. New England Journal of Medicine. 2001;344(10):720–725. doi:10.1056/nejm200103083441003.
- 5. Dybul M, Fauci AS, Bartlett JG, Kaplan JE, Pau AK. Guidelines for using antiretroviral agents among HIV-infected adults and adolescents. Recommendations of the Panel on Clinical Practices for Treatment of HIV. MMWR Recommendations and reports: Morbidity and mortality weekly report Recommendations and reports/Centers for Disease Control. 2002;51(RR-7):1–55. doi:10.1037/e548472006-001.
- 6. Attia S, Egger M, Müller M, Zwahlen M, Low N. Sexual transmission of HIV according to viral load and antiretroviral therapy: systematic review and meta-analysis. AIDS. 2009;23(11):1397–1404. doi:10.1097/qad.0b013e32832b7dca.
- 7. Viard JP, Burgard M, Hubert JB, Aaron L, Rabian C, Pertuiset N, et al. Impact of 5 years of maximally successful highly active antiretroviral therapy on CD4 cell count and HIV-1 DNA level. AIDS. 2004;18(1):45–49. doi:10.1097/00002030-200401020-00005.
- 8. Greub G, Cozzi-Lepri A, Ledergerber B, Staszewski S, Perrin L, Miller V, et al. Intermittent and sustained low-level HIV viral rebound in patients receiving potent antiretroviral therapy. AIDS. 2002;16(14):1967–1969. doi:10.1097/00002030-200209270-00017.
- 9. Terzian AS, Bodach SD, Wiewel EW, Sepkowitz K, Bernard MA, Braunstein SL, et al. Novel Use of Surveillance Data to Detect HIV-Infected Persons with Sustained High Viral Load and Durable Virologic Suppression in New York City. PLoS ONE. 2012;7(1):e29679. doi:10.1371/journal.pone.0029679.
- 10. Rose CE, Gardner L, Craw J, Girde S, Wawrzyniak AJ, Drainoni ML, et al. A Comparison of Methods for Analyzing Viral Load Data in Studies of HIV Patients. PLOS ONE. 2015;10(6):e0130090. doi:10.1371/journal.pone.0130090.
- 11. de Jong MD, Simmons CP, Thanh TT, Hien VM, Smith GJD, Chau TNB, et al. Fatal outcome of human influenza A (H5N1) is associated with high viral load and hypercytokinemia. Nature Medicine. 2006;12(10):1203–1207. doi:10.1038/nm1477.
- 12. Ylitalo N, Sørensen P, Josefsson AM, Magnusson PK, Andersen PK, Pontén J, et al. Consistent high viral load of human papillomavirus 16 and risk of cervical carcinoma in situ: a nested case-control study. The Lancet. 2000;355(9222):2194–2198. doi:10.1016/s0140-6736(00)02402-8.
- 13. Andrew N Phillips RW Schlomo Staszewski. HIV Viral Load Response to Antiretroviral Therapy According to the Baseline CD4 Cell Count and Viral Load. JAMA. 2001;286(20):2560. doi:10.1001/jama.286.20.2560.
- 14. Kononenko I. Machine learning for medical diagnosis: history, state of the art and perspective. Artificial Intelligence in Medicine. 2001;23(1):89 – 109. doi:https://doi.org/10.1016/S0933-3657(01)00077-X.
- 15. Dubey A. Applications of Machine Learning: Cutting Edge Technology in HIV Diagnosis, Treatment and Further Research. Computational Molecular Biology. 2016;6(3).
- 16. Rosa RS, Santos RH, Brito AY, Guimaraes KS. Insights on prediction of patients’ response to anti-HIV therapies through machine learning. In: Neural Networks (IJCNN), 2014 International Joint Conference on. IEEE; 2014. p. 3697–3704.
- 17. Rodríguez JO, Prieto SE, Correa C, Pérez CE, Mora JT, Bravo J, et al. Predictions of CD4 lymphocytes’ count in HIV patients from complete blood count. BMC Medical Physics. 2013;13(1). doi:10.1186/1756-6649-13-3.
- 18. Ramirez CM, Sinclair E, Epling L, Lee SA, Jain V, Hsue PY, et al. Immunologic profiles distinguish aviremic HIV-infected adults. AIDS. 2016;30(10):1553–1562. doi:10.1097/qad.0000000000001049.
- 19. Parbhoo S, Bogojeska J, Zazzi M, Roth V, Doshi-Velez F. Combining Kernel and Model Based Learning for HIV Therapy Selection. AMIA Summits on Translational Science Proceedings. 2017;2017:239.
- 20. Center for Medicare Services. CMS Cell Size Suppression Policy; 2015. Available from: https://www.resdac.org/resconnect/articles/26.
- 21. Han J, Pei J, Kamber M. Data mining: concepts and techniques. Elsevier; 2011.
- 22. Punj G, Stewart DW. Cluster Analysis in Marketing Research: Review and Suggestions for Application. Journal of Marketing Research. 1983;20(2):134. doi:10.2307/3151680.
- 23. Wong B. Points of view: Color blindness. Nature Methods. 2011;8(6):441–441. doi:10.1038/nmeth.1618.
- 24. Peterson L. K-nearest neighbor. Scholarpedia. 2009;4(2):1883. doi:10.4249/scholarpedia.1883.
- 25. Keller JM, Gray MR, Givens JA. A fuzzy K-nearest neighbor algorithm. IEEE Transactions on Systems, Man, and Cybernetics. 1985;SMC-15(4):580–585. doi:10.1109/tsmc.1985.6313426.
- 26. Rohlf FJ. Adaptive Hierarchical Clustering Schemes. Systematic Zoology. 1970;19(1):58. doi:10.2307/2412027.
- 27. Kobourov SG. Spring embedders and force directed graph drawing algorithms. arXiv preprint arXiv:12013011. 2012;.
- 28. Fruchterman TMJ, Reingold EM. Graph drawing by force-directed placement. Software: Practice and Experience. 1991;21(11):1129–1164. doi:10.1002/spe.4380211102.
- 29. Noack A. Modularity clustering is force-directed layout. Phys Rev E. 2009;79:026102. doi:10.1103/PhysRevE.79.026102.
- 30. Wilcoxon F. Individual Comparisons by Ranking Methods. Biometrics Bulletin. 1945;1(6):80. doi:10.2307/3001968.
- 31. Avis D, Bremner D, Seidel R. How good are convex hull algorithms? Computational Geometry. 1997;7(5-6):265–301. doi:10.1016/s0925-7721(96)00023-5.
- 32. Etter P, Landovitz R, Sibeko S, Sobieszczyk ME, Riddler SA, Karg C, et al. Recommendations for the follow-up of study participants with breakthrough HIV infections during HIV/AIDS biomedical prevention studies. AIDS. 2013;27(7):1119–1128. doi:10.1097/qad.0b013e32835dc08e.
- 33. Olsen CM, Knight LL, Green AC. Risk of Melanoma in People with HIV/AIDS in the Pre- and Post-HAART Eras: A Systematic Review and Meta-Analysis of Cohort Studies. PLoS ONE. 2014;9(4):e95096. doi:10.1371/journal.pone.0095096.
- 34. Blaser N, Wettstein C, Estill J, Vizcaya LS, Wandeler G, Egger M, et al. Impact of viral load and the duration of primary infection on HIV transmission: systematic review and meta-analysis. AIDS. 2014;28(7):1021–1029. doi:10.1097/qad.0000000000000135.
- 35. Boender TS, Sigaloff KC, McMahon JH, Kiertiburanakul S, Jordan MR, Barcarolo J, et al. Long-term virological outcomes of first-line antiretroviral therapy for HIV-1 in low-and middle-income countries: a systematic review and meta-analysis. Clinical Infectious Diseases. 2015;61(9):1453–1461. doi:10.1093/cid/civ556.
- 36. Boerma RS, Boender TS, Bussink AP, Calis JCJ, Bertagnolio S, de Wit TFR, et al. Suboptimal Viral Suppression Rates Among HIV-Infected Children in Low- and Middle-Income Countries: A Meta-analysis. Clinical Infectious Diseases. 2016;63(12):1645–1654. doi:10.1093/cid/ciw645.
- 37. Abdi H. Centroids. Wiley Interdisciplinary Reviews: Computational Statistics. 2009;1(2):259–260. doi:10.1002/wics.31.
- 38. Bologna G. Symbolic Rule Extraction from the DIMLP Neural Network. In: Lecture Notes in Computer Science. Springer Berlin Heidelberg; 2000. p. 240–254. Available from: https://doi.org/10.1007%2F10719871_17.
- 39. Bologna G. A model for single and multiple knowledge based networks. Artificial Intelligence in Medicine. 2003;28(2):141–163. doi:10.1016/s0933-3657(03)00055-1.
- 40. Intrator O, Intrator N. Interpreting neural-network results: a simulation study. Computational Statistics & Data Analysis. 2001;37(3):373–393. doi:10.1016/s0167-9473(01)00016-0.
- 41. Shickel B, Tighe PJ, Bihorac A, Rashidi P. Deep EHR: A Survey of Recent Advances in Deep Learning Techniques for Electronic Health Record (EHR) Analysis. IEEE Journal of Biomedical and Health Informatics. 2017; p. 1–1. doi:10.1109/jbhi.2017.2767063.
- 42. Klapper-Rybicka M, Schraudolph NN, Schmidhuber J. Unsupervised Learning in LSTM Recurrent Neural Networks. In: Artificial Neural Networks — ICANN 2001. Springer Berlin Heidelberg; 2001. p. 684–691. Available from: https://doi.org/10.1007%2F3-540-44668-0_95.
- 43. Bahadori MT, Kale D, Fan Y, Liu Y. Functional subspace clustering with application to time series. In: International Conference on Machine Learning; 2015. p. 228–237.
- 44. Kontaki M, Papadopoulos AN, Manolopoulos Y. Continuous subspace clustering in streaming time series. Information Systems. 2008;33(2):240–260. doi:10.1016/j.is.2007.09.001.
- 45. Karypis G, Han EH, Kumar V. Chameleon: hierarchical clustering using dynamic modeling. Computer. 1999;32(8):68–75. doi:10.1109/2.781637.
- 46. Maire F. An algorithm for the exact computation of the centroid of higher dimensional polyhedra and its application to kernel machines. In: Third IEEE International Conference on Data Mining. IEEE Comput. Soc; 2003.Available from: https://doi.org/10.1109%2Ficdm.2003.1250988.
- 47. Welzl E. Smallest enclosing disks (balls and ellipsoids). In: New Results and New Trends in Computer Science. Springer-Verlag; 1991. p. 359–370. Available from: https://doi.org/10.1007%2Fbfb0038202.
- 48. Fischer K, Gärtner B, Kutz M. Fast Smallest-Enclosing-Ball Computation in High Dimensions. In: Algorithms - ESA 2003. Springer Berlin Heidelberg; 2003. p. 630–641. Available from: https://doi.org/10.1007%2F978-3-540-39658-1_57.
- 49. Nielsen F, Nock R. Approximating Smallest Enclosing Balls. In: Computational Science and Its Applications – ICCSA 2004. Springer Berlin Heidelberg; 2004. p. 147–157. Available from: https://doi.org/10.1007%2F978-3-540-24767-8_16.
Supplementary Material Section
Review of Existing Viral Load Categorization Methods
Greub et. al. LLVR
Greub et. al. were particularly focused on detecting low level viral rebound (LLVR) in patients . The following procedure was used to categorize the patients of their study:
If the patient has two consecutive viral load measurements (VLM) less than 50, within a 24 week period, and they have two VLM after this consecutive pair occurs, then the viral load data for that patient is considered for further analysis. For the patients that meet this criteria, their viral load measurements following the consecutive pair are viewed: if their maximum VLM is greater than 500, then they are categorized as ‘Viral Failure’, if they are between 51-500 then they are categorized as ‘LLVR’, otherwise they are labeled as ‘DSVL’. Greub et. al. left out the patients which did not meet the consecutive pair criteria from their categorization, hence in our comparative study we will group them under ‘Unspecified’.
Rose et. al. SMVL/RMVL
The focus of Rose et. al. was to investigate the use of several frameworks in categorizing suppressed versus not-suppressed viral load . First they omitted the patients from their study whom were virally suppressed at baseline, where they define viral suppression as copies/mL because they were found to have no substantial variation in their viral loads. In our comparative analysis we label them as ‘Baseline ’. Then, from the remaining patients they categorize them as either achieving suppression or not-suppressed using an 8 month window centered around month 24 after start of VLM (18-30 months). They considered five different frameworks, which we describe below:
SMVL omit-participant: If the closest VLM to month 24 is then the patient is labeled as ‘Suppressed’, otherwise ‘Not Suppressed’. However if this closest VLM is outside the range of 18-30 months, then they are labeled as ‘Ommitted’.
SMVL set-to-failure: Similar to omit-participant, however if the the closest VLM is outside the range of 18-30 months, then they are labeled as ‘Not Suppressed’.
SMVL closest-VL: The patient is labeled according to their closest VLM to month 24 regardless of whether the VLM is contained in the window.
RMVL repeat binary: We do not use this method in our comparative analysis because its purpose is to classify each individual VLM as suppressed or not suppressed (rather than the patient), which is not the goal of this paper.
RMVL repeat continuous: The of the VLMs are modeled as a continuous linear function with its intercept fixed at the baseline viral load. In our implementation we add 1 to each VLM to avoid . Then the patient is categorized as ‘Suppressed’ if the model predicts the patient to have a viral load copies/mL at month 24, otherwise they are labeled as ‘Not Suppressed’.
Terzian et. al. SHVL
The objective of Terzian et. al. was to develop a method of categorizing a patient as DSVL or SHVL for the purpose of monitoring successful ART uptake . Their procedure for categorizing patients is as follows:
If the maximum viral load of the patient is copies/mL then the patient is labeled as ‘DSVL’. If the patient instead has two consecutive viral load measurements 100,000 copies/mL, then the patient is labeled as ‘SHVL’. For our analysis all other patients are labeled as ‘Unspecified’.
Phillips et. al. Viral Rebound
The aim of Phillips et. al. was to characterize virological response to ART . While the statistical methods proposed by Phillips et. al. went beyond categorizing patients, they composed a method to identify two populations of HIV patients (Viral Failure and Viral Rebound):
Only patients who have at least one VLM within the range 24-40 weeks are included in the categorization, all other patients are labeled as ‘Omitted’ (similar to SMVL omit-participant). Phillips et. al. chose the 24-40 week range for a different part of their statistical analysis; however, in our categorical implementation, we modify the range to 24-32 weeks to build coherent categories according to the procedure they outline (as we are about to describe). Phillips et. al. choose to use 32 weeks as the point of observing viral load levels because they argue viral load is expected to decline to 500 copies/mL by week 32 if it is going to do so . Thus patients who never achieve VL 500 copies/mL within 32 weeks are labeled as ‘Viral Failure’. If the patient does achieve VL 500 at one point within 32 weeks, but VLM closest to 32 weeks is 500 copies/mL, then this patient is omitted from the categorization. If, however, the patient’s closest VLM to 32 weeks is 500 copies/mL, then if the patient has two consecutive VLM 500 copies/mL within 32 weeks, then the patient is labeled as ‘Viral Rebound’. Phillips et. al. do not describe them further the patient who did not meet this consecutive pair criteria, however for our implementation we will label them as ‘Suppressed’.
Considered but Removed Features
There were several features which were thought to have significance in segregating viral load patterns but did not make it into our feature vector. We had to be careful of extracting features which may be collinear as it would cause a shift in the weighting of features. These collinear features are too many to list here. We mention several excluded features which were generally unreliable or created overlap between classes that should be unrelated:
Rate of Change. We tried several ways to calculate this feature: 1) Mean of the first derivate of with respect to , 2) Median of the first derivative, 3) Rate of change between first and last measurements, 4) Fitting a piecewise regression with one knot and averaging the results of the slope- with the idea that HVLS will have a clear elbow.
Correlation Coefficient. From the fitted piecewise regression (one knot) we also tried averaging the results of the correlation coefficient, again with the idea that HVLS will have a clear elbow.
Change in Concavity. Calculated as sum of the absolute changes in concavity with the assumption of equally spaced time (to adjust for issues found in calculating rate of change):
Positive Difference. We calculate this as the sum of all normalized by the number of elements satisfying the condition.
Negative Difference. We calculated this as the sum of all normalized by the number of elements satisfying the condition.
Centroid Detection Methodologies
Centroid detection is a problem which several machine learning algorithms attempt to solve, such as Support Vector Machine (SVM), Bayesian Point Machines (BPM), Analytic Centre Machines, k-Means, BIRCH, among others [46, 21]. The center of a cloud of samples is generally considered the average , but is still a matter of interpretation. We attempt to find cluster centers in seven different ways (Fig S9):
Mean. Calculated as the average of each dimension (feature).
Median. Calculated as the median of each dimension (feature).
Best Representative. The best representative is taken to be the sample whose maximum distance to all the other samples is a minimum.
Bounding Box. If the cloud of points naturally form an -dimensional box, then calculating the middle of the minimum and maximum of each dimension would lend itself to be the center representation of the cloud.
Smallest Disk. If the cloud of points instead naturally form an -dimensional sphere, then it makes most sense to solve the smallest enclosing disk problem. There are many algorithms that have been proposed to solve this problem [47, 48, 49], where we found the best solution in our implementation of the algorithms to be that of Fischer’s fast smallest-enclosing ball, where he utilizes the properties of the hull and affine spaces to walk the center to its optimum location . They proposed to initiate the center by choosing a random sample, however this caused the algorithm to yield differing solutions on each run. Through experimentation we found that when we initialized the algorithm using the sample found from best representative method, the algorithm consistently converged at the solution with the smallest radii, compared to using any other sample as the initial center. Hence in our formulation of Fischer’s algorithm we deploy the best representative method as a subroutine for initiation.
Polyhedral. The trouble with the bounding box and smallest disk approximations for the center is that they are designed to work well under circumstances where the clusters naturally form either a -dimensional box or sphere respectively. We can relax this constraint by removing this assumption. Rather we propose to find the convex hull of the data points and then find the centroid of the hull, in other words, we find a -dimensional polyhedron to fit the cloud. Then, given the intersection of the finite half-spaces of the convex hull, we can calculate the exact centroid of the polyhedron- where we use a slight modification of Maire’s algorithm . The hyperplane equations are calculated in Python using the ConvexHull and Delaunay packages within SciPy.
Push and Pull. The six methods mentioned are not designed to maximize the distance between each cluster center while minimizing the distance within the cluster. In other words, the above six methods calculate the center of each point cloud without any dependencies on the other clouds. We propose a push and pull method, inspired by force-directed graph drawing using Fruchterman-Reingold’s algorithm  to include dependencies on the center of each cloud to improve performance on centroid learning. That is, let every sample within the cluster have a pulling force on the center following Hooke’s Law, , and let every sample outside the cluster have a pushing force on the center following a modified version of Coulomb’s Law, . In our case, represents distance between the center and sample, and the constants and equal to 1 in order to yield equally weighted pulling and pushing. For each cluster, we initialize the center as the mean of the cluster followed by iteratively finding the center until achieving a dynamic equilibrium. We initiate at the mean of the cluster samples because the spring-like pulling of each sample on the center is the same as the mean of the samples pulling on the center with a spring constant equal to the number of samples (proved below).
Proposition The combined spring-like pull of each sample, , on a center, C, with a spring constant of 1, is same as a spring-like pull of the mean of the samples, , on C with a spring constant equal to the number of samples pulling on C.
Proof. Let the sample and the center be -dimensional vectors such that has a spring-like pulling force on with a spring constant of 1, that is, , where is the direction of the force and is the magnitude, or distance between and . Notice the mean of the samples , where is the total number of samples, then observe that the sum of the spring-like forces acting on the center is given by
This yields we have that the sum of all the spring-like forces on is the same as the mean average of the samples having a spring-like pull on with a spring constant of , which is the number of samples pulling on C. ∎
Observe that some clusters have a low number of samples pulling at the center relative to all the samples having a pushing effect. Hence, if is small, then we will find that the center is pushed more than desired, and if is large, then the pull towards the average will be too strong and hence the center will be no different from the average. We propose to replace with the number of samples having a pushing effect on the center.