Bayesian Profiling Multiple Imputation for Missing Electronic Health Records
Abstract
Electronic health records (EHRs) are increasingly used for clinical and comparative effectiveness research but suffer from usability deficiencies. Motivated by health services research on diabetes care, we seek to increase the quality of EHRs by focusing on missing longitudinal glycosylated hemoglobin (A1c) values. Under the framework of multiple imputation (MI) we propose an individualized Bayesian latent profiling approach to capturing A1c measurement trajectories related to missingness. We combine MI inferences to evaluate the effect of A1c control on adverse health event incidence. We examine different missingness mechanisms and perform model diagnostics and sensitivity analysis. The proposed method is applied to EHRs of adult patients with diabetes who were medically homed in a large academic Midwestern health system between 2003 and 2013. Our approach fits flexible models with computational efficiency and provides useful insights into the clinical setting.
Key words: Trajectory, Latent profile, Multiple imputation, Sensitivity analysis
1 Introduction
1.1 Glycemic testing and control
This work is motivated by critical substantive issues of glycemic testing and control for the approximately 25.2% of American seniors with diabetes (The Centers for Disease Control and Prevention 2017). Diabetes is a management intensive condition and a major cause of morbidity and mortality. The American Diabetes Association (ADA) recommends glycosylated hemoglobin (A1c) testing at least two times a year in patients who are meeting treatment goals and have stable glycemic control, and quarterly in patients whose therapy has changed or who are not meeting glycemic goals (ADA 2018a). Evidencebased guidelines are most applicable to relatively healthy patients with diabetes.
For complex patients with diabetes, those aged 65 years or over, or those with comorbid conditions, adhering tightly to established guidelines might diminish the quality of care and lead to adverse outcomes. Complex patients may be at high risk of complications—such as hypoglycemic coma, seizures, falls, fractures, motor vehicle accidents, cardiovascular events, stroke, or acute renal failure—and also less likely to experience benefits of tight glycemic control, e.g., shown by the findings of the ACCORD and ADVANCE trials (Martin et al. 2006; Dluhy and McMahon 2008). Less stringent goals may be appropriate for complex patients with comorbid conditions, or longstanding diabetes in whom the tight control goal is difficult to achieve (ADA 2018b). Therefore, complex patients would be best suited by individualized treatment plans, but little clinical evidence exists to develop these plans.
To enhance findings from clinical trials, electronic health records (EHRs) are increasingly proposed as a data source for clinical and comparative effectiveness research on improving healthcare (Cebul et al. 2011; Agency for Healthcare Research and Quality 2013). The EHR is a longitudinal record of patient medical information that is maintained by encounters in any care delivery setting, and may include all of the key administrative clinical data relevant to patient care, such as laboratory data (e.g., A1c values), demographics, progress notes, problems, medications, vital signs, past medical history, immunizations, and radiology reports. EHR automates access to information and has the potential to streamline the clinician’s workflow and support other carerelated activities directly or indirectly through various interfaces, including evidencebased decision support, quality management, and outcomes reporting (U.S. Center for Medicare & Medicaid Services 2012).
We would like to use EHRs to examine the role of patient complexity in the recommendation for glycemic control. However, EHRs suffer from usability deficiencies (Healthcare Information and Management Systems Society 2009). Collected in an unscheduled fashion, typically only when the patient seeks care or the physician orders care, EHRs suffer from a large amount of intermittent missing data.
In this article, we focus on intermittently missing A1c values in EHRs. Multiple factors could influence the degree of missing A1c values for individual patients. Physician nonadherence to guidelines for lab testing will cause missing values. The patient’s health conditions potentially affect the propensity to test, leading to informative presence patterns that may cause systematic estimation bias. Incorporation of all available information including patient health status is critical to appropriately handle missing A1c values from EHRs.
We propose a Bayesian profiling approach to incorporate patient characteristics and infer latent groups of A1c measurement trajectories. The measurement trajectories examine both the A1c collection or presence patterns and collected values across time. The Bayesian profiling approach combines with multiple imputation (MI, Rubin (1987)) to produce complete EHR datasets for general analysis purpose. As an illustration of the MI inference, we evaluate the association between A1c levels and the incidence of any acute health events, such as hospitalization, emergency room (ER) visit or death.
1.2 Statistical literature on intermittently missing data
Missing A1c values in EHRs present statistical methodology challenges. Intermittent missing data in largescale, unbalanced longitudinal studies, where subjects reappear after one or more missed visits, call for new imputation approaches especially when the missing values are potentially nonignorable (Little and Rubin 2002) and the observed cases are sparse. Simple methods (e.g., complete case analysis or last observation carried forward) will distort the relationship and are not recommended by National Research Council (2010). Standard missing data methods in longitudinal studies focus on a common set of prespecified and monotone missing times, where a measurement being missing implies that all followup measurements are also missing or dropped from the analysis (e.g., reviews in Ibrahim and Molenberghs (2009)). Likelihoodbased, weighting or extrapolation factorization approaches for missing data mainly apply to monotone missingness. Utilization of the information collected after subjects reappear will potentially correct bias and increase estimation efficiency, and the challenge lies in how to appropriately model sparse data structures with nonmonotone missing patterns.
We seek a flexible imputation engine under MI to predict intermittent missing values. MI separates imputation and analysis into two steps and propagates the uncertainty due to missing data. Various MI software packages have been developed assuming data are missing at random (MAR) based on either joint multivariate normal distributions, e.g., PROC MI (SAS Institute Inc. 2017), Amelia (King et al. 2001) and norm (Schafer 1997), or a sequence of fully conditional distributions, e.g., IVEware (Raghunathan et al. 2001), mice (Van Buuren and Oudshoorn 1999), and mi (Gelman et al. 2015). Multilevel models are embedded with MI to handle correlated data, such as the packages REALCOMIMPUTE (Carpenter and Kenward 2013) and pan (Schafer 2016). However, existing MI software cannot handle highdimensional data that are subject to high proportions of intermittent missingness in largescale longitudinal studies, nor nonignorable missing values due to the default MAR assumption.
With not missing at random (NMAR), the general framework needs to jointly model the incomplete variables (i.e., the A1c values) and the missingness indicators (i.e., whether A1c values are present on any given occasion), where the imputation model for the outcome has to be combined with the response propensity model for the presence patterns via a selection model, pattern mixture model, or shared parameter model (Wu and Carroll 1988; Little 1995). Previous work introduces strong and untestable assumptions for parameter identification, and the computation is nontrivial (Ibrahim et al. 2005). The longitudinal A1c collection history in EHRs results in sparse observations across a large number of different presence patterns and calls for flexible modeling strategies that account for the time dependency, borrow information across patterns and adjust for patient heterogeneity.
We jointly model the A1c presence patterns and the lab values taking into account patient characteristics. To achieve parameter identification and dimension reduction, we introduce latent profiles and develop a Bayesian profiling multiple imputation (BPMI) approach for missing A1c data in longitudinal EHRs. The novel latent profiling can handle ignorable and nonignorable missingness.
We assume that latent profiles are primarily determined by the trajectories of the longitudinal A1c measurements, including A1c values and potentially the presence patterns. The latent profiling includes patient characteristics—such as sociodemographics, healthcare utilization measures, medications, complexity, comorbidity and complication—as covariates that affect the profile assignment. This is an improvement over previous work that required joint modeling of the outcome variables, missingness indicators and covariates, leading to the extra need for Monte Carlo integration or approximate inferences and poor performances.
The proposal shares a similar decomposition with Lin et al. (2004) but differs by the MI framework under the Bayesian paradigm. Besides avoiding the extra computation step of Monte Carlo integration, the posterior updating of BPMI has efficiency gains via Gibbs sampling and generates completed datasets for general analysis purposes.
The paper is organized as follows. Section 2 describes the EHR structure and content. The developed BPMI is presented in detail in Section 3 and applied to the EHRs in Section 4 which presents our statistical and substantive findings. Section 5 concludes the contribution and discusses future extensions.
2 EHR description
We collect the EHR data of adult patients with diabetes who were medically homed at a large academic Midwestern health system between 2003 and 2013. Patients are eligible for inclusion if they have diabetes defined by a qualifying International Classification of Diseases9 code for diabetes. Because A1c reflects mean blood glucose over the preceding three months and quarterly testing is recommended for complex patients, we construct a longitudinal dataset with one measurement per patient per threemonth period (patientquarter). When there are multiple A1c values available during the three months, we use the average of the measurements restricted to be before the first acute health event date if any. We treat the first four quarters as the baseline period and use the average of collected A1c values during the first four quarters as the baseline A1c value. Patients are included if they have baseline A1c values and at least one available followup A1c measurement. Patients’ various enrollment dates and lengths of followup result in an unbalanced data structure.
We identified 7372 patients with 113761 quarters after baseline and only 57285 available observations. The intermittent missingness proportions across patients are as high as 0.97, with a median of 0.50. The baseline A1c values of 7372 patients center around mean 6.93% and range from 3.55% to 17.1%, showing heterogeneity in glycemic control.
We randomly select 30 patients and present the spaghetti plot of their collected A1c values in Figure 2.1. The time trends vary across patients with different ranges, and the measurements of the same patients tend to be correlated. We consider mixed effects models accounting for withinpatient correlation with random intercepts and slopes with respect to time or functions of time. While the A1c measurement trajectories present patient heterogeneity, subgroups of patients could share similar profiles across time. We assume each subgroup to have different location and scale for the time trends, which results in a mixture distribution overall.
EHRs collect rich patient characteristics that could be predictive of the A1c trajectories and likelihood of missingness or presence patterns. The patientquarter structure creates both timevarying and timeinvariant variables. Timevarying variables include age and the count of physician visits that occurred within each quarter. On average patients have 2.39 physician visits every three months with standard deviation (sd) 2.03 across patientquarters.
The baseline characteristics include sociodemographics, healthcare utilization measures, medications, complexity, comorbid conditions (Elixhauser et al. 1998), and diabetes complications, all of which are carried forward (e.g., chronic conditions will be assumed as present from diagnosis onward). Patient complexity is defined as the presence of congestive heart failure (CHF) and/or Ischemic heart disease (IHD). The comorbid patterns are characterized by Hierarchical Condition Categories (HCC) risk scores. The number of adverse events, such as hospitalization and ER visits, and the number of ER visits that did not lead to hospitalization during the first four quarters, serve as proxy measures of healthcare utilization.
The latent profiling allocation takes into account variables that are predictive of the A1c values or presence patterns into the model. We use the observed A1c values ignoring the patientquarter structure and fit an ordinary linear regression to select the variables that are highly correlated with A1c. Then we fit a logistic regression model to predict the missingness at each quarter and select the variables that are predictive of the visit pattern. The union of the two sets of selected variables results in the covariate list in Table 2.1. The patient characteristics are potential confounders and moderators of the tight control effect on the health outcomes.
Table 2.1 shows that the cohort has a modest number of complex patients. For example, 14% of patients have Chronic Kidney Disease (CKD), 33% suffer from moderate or severe kidney damage, 19% have CHF, 21% have IHD, and 85% have hypertension. Several comorbidity and complication indicators have low prevalence and can cause separation problems in the regression modeling of the profile allocation.
Sociodemographics  
Age  69.91 (10.76) 
Female  53% 
White  93% 
Medicaid  14% 
Utilization  
Adverse event count  0.58 (0.88) 
Hierarchical condition categories  1.41 (0.98) 
Number of emergency room visits  0.12 (0.37) 
Comorbidity and complication  
Count of low prevalence conditions  0.21 (0.54) 
Chronic kidney disease  14% 
Chronic pulmonary disease  18% 
Dementia  5% 
Depression  16% 
Entitlement disability  18% 
Eye disease  14% 
Hypertension  85% 
Hypothyroidism  14% 
Fluid and electrolyte disorders  14% 
Other neurological disorders  7% 
Obesity  16% 
Psychoses  9% 
Renal failure  7% 
Solid tumor without metastasis  6% 
Lower extremity ulcer  7% 
Valvular disease  8% 
Kidney damagelight  16% 
Kidney damagemild  51% 
Kidney damagemoderate  30% 
Kidney damagesevere  3% 
Complexity  
Congestive heart failure  19% 
Ischemic heart disease  21% 
Medication prescribed  
Insulin  13% 
Sulfonylureas  17% 
Hypoglycemics  16% 
Other  54% 
3 Bayesian profiling multiple imputation
Denote the individual record for patient by for the total number of tracked quarters during the followup, where are the timeinvariant covariates, ’s are timevarying covariates, and is the variable to be imputed, A1c values over time. For brevity, we will refer to as the longitudinal outcomes. Assume that only ’s are subject to missing values, and let be a timevarying binary indicator for its presence, if is observed, otherwise. Denote the timevarying variables by , , and , for total number of patients.
Assume individuals fall into latent classes , where is the total number of classes. We assume is a finite and positive known integer. The selection of and the case of being an unknown random variable that induces additional uncertainty will be discussed later. We assume that the latent class structure is primarily determined by the trajectories of the longitudinal outcome and potentially its presence patterns, and that the allocation probabilities are affected by the covariates.
The conditional distribution of the observed measurements can be expressed as
where denotes the distribution.
If only depends on fully observed , i.e., MAR, the observed data can provide valid inference since . The profile structure is then independent of the presence patterns.
We call this marginal profiling because the latent class structure influences only the marginal distribution of .
of its marginal dependence on the observed longitudinal outcomes.
The MAR assumption can be relaxed to
(3.1) 
that is, conditional MAR, where the missingness is independent of the outcome given the latent classes and covariates. However, the unconditional missingness is nonignorable as the latent structure affects both the longitudinal outcome and presence patterns. Joint modeling of is then necessary and will be identified due to the conditional independence assumption given the latent classes. We call this approach joint profiling.
Joint profiling is identical to marginal profiling when the estimated parameters in Model (3.1) do not change across profiles, so is conditionally independent of given . Hence, the joint profiling approach is general with marginal profiling as a special case and able to model both ignorable and nonignorable missing data. We will consider both the marginal and joint profiling for imputation and perform sensitivity analysis of the inference on the effect of A1c levels on acute health outcomes.
The latent classes are at the individual level and thus time invariant. Further, we assume the latent class allocation depends only on timeinvariant variables . For computational and interpretational convenience, assume the outcome trajectory is independent of timeinvariant covariates given the latent class . This assumes the patient characteristics at baseline are represented by the latent profiling.
3.1 Marginal profiling
Under MAR we construct latent profile models based on the collected outcomes, i.e., marginal profiling, which is estimated from the pattern of trajectories of the longitudinal outcomes.
(3.2) 
Here denotes the observed A1c measurement for individual . The factorization needs two modeling components: the outcome model and the latent profile model. This model specification in (3.2) assumes: 1) the presence pattern is independent of the A1c values and the latent profiles given the covariates ; 2) the A1c values are independent of the presence patterns and the baseline characteristics given the latent profiles and timevarying covariates ; and 3) the latent profiles are independent of the timevarying covariates given the baseline characteristics . Model (3.2) can be generalized by assuming the outcome depends on the baseline characteristics that are assumed to have been captured by the latent profiling in the current specification.
In the latent profile model, denote as the allocation probability of pattern conditional on , for . We consider a multinomial logistic regression model with coefficient vector , for . Set and for identification, we have
(3.3) 
For posterior computation we develop a Gibbs sampler by introducing PólyaGamma (PG) distributed variables (Polson et al. 2013), , where . Conditional on the PG variables, the posterior distribution of will be conjugate with normal prior distributions. The resulting Gibbs sampler improves the posterior fitting without the stickiness that is common with rejection sampling methods.
Let be the latent class indicator and introduce the normal prior distribution on the coefficients , the conditional posterior distribution of given is multivariate normal, . Here , is the design matrix with each row , , and , where is a vector in , and the th component is .
For the collected longitudinal and continuous A1c measurements, we assume a linear mixed effects model with profilespecific coefficients and variances (). Let denote a covariate matrix, with associated vector of coefficients . The th row of , denoted by , is then a vector of covariate values measured at time . Covariates for profilespecific effects and for individualspecific random effects are denoted by the matrix and the matrix , respectively, both with structures similar to . There may be overlap of the covariates in , and , including main effects and highorder interactions of . Deterministic functions of time, for example, basis spline functions of time in our EHR application, can be included in the covariates.
(3.4)  
which is a linear mixed effects model with a mixture of location and scale parameters varying across profiles. The profilespecific coefficients can capture the differential trajectories of A1c measurements with timevarying covariates in . Model (3.4) is a location and scale mixture model that is flexible to capture nonGaussian distributions as shown in Figure 2.1. Here is the covariancevariance matrix of the individualspecific random effects , which could be simplified as a diagonal matrix if the components of are treated as independent or a scalar if only represents random intercepts.
We assign weakly informative and conjugate prior distributions to the parameters (Gelman et al. 2008). The inference is not sensitive to the hyperparameter values, as expected with large sample sizes. The prior specification and full conditional posterior distributions as efficient Gibbs sampler are presented in the Appendix A.1. After convergence, based on the posterior samples of the parameters, we can impute the missing A1c values and disseminate completed datasets.
3.2 Joint profiling
With NMAR, we jointly model the A1c values and presence patterns given the covariates to obtain the joint profiling structure
(3.5) 
where includes both observed and missing measurements. The joint model has three components: the latent profile model, the longitudinal outcome model and the longitudinal response/presence propensity model.
This NMAR assumption move beyond MAR by joint profiling to capture the dependence between the outcome and presence patterns that jointly determine the latent profiles. The presence indicator is conditionally independent of the outcome given the latent profiles and covariates, i.e., conditional MAR: . Conditional MAR makes models identifiable, yet flexible for dimension reduction and interpretation.
We consider a generalized linear mixed effects with a logit link for the longitudinal presence indicator , conditional on latent classes and random effects with as a covariancevariance matrix.
(3.6) 
Where the covariates could overlap or be different from those selected in Model (3.4) as subsets of the main effects and highorder interactions in .
We develop the Gibbs sampler with two sets of introduced PólyaGamma distributed variables for posterior computation with models (3.3), (3.4) and (3.6) under weakly informative and conjugate prior distributions. Different from marginal profiling, imputation of missing A1c values is nested in the iterative process, where parameter estimation and data augmentation are simultaneously implemented. The posterior computation with prior specification and imputation are presented with details in Appendix A.2.
To facilitate interpretation of the potential profiling structure, we assume the number of latent patterns is fixed with a finite value. This can be extended to allow a random variable under the nonparametric Bayesian modeling framework such as the dependent probit stickbreaking process (Rodríguez and Dunson 2011). However, discrete covariates with low prevalence tend to cause separation problems with a large number of latent profiles, as illustrated in our application study. In this paper, we fix to be a modest or small integer, and we use diagnostic tools for model selection in Section 4.1.
4 Imputation and Inference with EHRs
We apply the BPMI to impute missing A1c values in the EHRs. The collected A1c values range from 3.3% to 17.3% with mean 7.1% and are right skewed with potentially multiple modes. We use the location and scale mixture distributions as a flexible strategy to capture the irregular distribution. In the profile allocation model (3.3), the timeinvariant covariates are the baseline characteristics listed in Table 2.1. In the outcome and presence propensity models (3.4) and (3.6), the covariates in with fixed effects include the constant ones and the timevarying variables, age and the count of physician visits. The covariates in represents the functions of time in years with slopes changing across classes. We use the basis spline functions of time with quantiles as six knots and cubic terms as a piecewise polynomial regression, and the nine spline functions used as covariates in with profilespecific coefficients as shown in Figure 4.1. The spline functions are chosen based on model fitting criteria and can handle complex shapes to create smooth curves. Details on diagnostics of model specifications are provided in Section 4.1. Random intercepts are introduced by .
To set up the initial values of the Markov chain Monte Carlo (MCMC) updating, we fit an ordinary linear regression and use the coefficient estimates as initial values for and ’s whose starting points are the same across classes. The initial values of ’s are , representing initial equal probability of class allocation. The scale parameters start at 0.1, and the random effects are initially drawn from normal distributions with mean 0 and the initial scales.
We implement posterior computation for both marginal and joint profiling. The MCMC algorithms efficiently achieve convergence with Gibbs samplers. We run the MCMC chains long enough to obtain 100 multiply imputed datasets and 500 replicated datasets to check imputation plausibility, i.e., the model capability to predict observed outcomes. For classification, we follow Goodman (2007) with a hard partitioning and keep one random assignment based on a random draw from the componentspecific probabilities across iterations. Our analysis shows that the results are not sensitive to the classification rules.
4.1 Model diagnostics and comparison
We use the Bayesian information criterion (BIC) and the logpseudo marginal likelihood (LPML) to select the number of classes, . For BIC, we use the posterior mean estimates of related parameters and obtain the likelihood conditional on the class allocation. LPML is the sum of logarithms of conditional predictive ordinates and stimated using posterior samples of parameters, , and LPML= (Gelfand and Dey 1994). This is an approximation of the leaveoneout crossvalidation using importance sampling. We use for the marginal profiling and for the joint profiling.
Marginal profiling  Joint profiling  

L  BIC  LPML  L  BIC  LPML 
2  126857  106129  2  323529  192506 
3  124145  94305  3  329037  174286 
4  130872  97579  4  334345  180697 
5  136263  97315  
6  137082  97397 
We restrict to be modest or small to avoid computational problems due to separation of the lowfrequency predictors, such as dementia and renal failure. Table 4.1 presents the BIC and LPML values for models with different values. The model with the smallest BIC and largest LPML will be favored. As a tradeoff, we select the model with classes both for marginal and joint profiling.
To assess imputation plausibility, we perform a posterior predictive check and generate replicated datasets that are predicted values of observations based on the posterior estimates of model parameters after convergence (Meng 1994; He et al. 2010). Let be the collection of the replicated data sets. We then compare statistics of interest in each replicated dataset to those in the observed dataset . Suppose that is some statistic of interest, let and be the values of computed from and , respectively. The quantities include the mean, 2.5% percentile and 97.5% percentile of A1c levels for every patient and for every quarter. For the patientlevel summaries, we look at the twosided posterior predictive probabilities as diagnostic tools defined as
When the value ppp is small, for example, less than 5%, this suggests that the replicated data sets systematically differ from the observed data set, with respect to that statistic. With not small ppp values, the imputation model preserves the observed characteristics of that statistic. For the 7372 patientlevel A1c summaries, the marginal profiling model yields 52 ppp values that are below 1% for the average, and the number of below 1% ppp values for the 2.5% percentile and for the 97.5% percentile are 238 and 375. The number of ppp values that are below 1% for these three statistics model under joint profiling is 66, 464 and 678, respectively. The posterior predictive check does not raise the flag of lack of model fit. However, the performance of joint profiling is inferior in preserving the observed mean values across patients in comparison with marginal profiling.
For the quarterlevel A1c summaries, we compare the predictions with the empirical values under two profiling approaches. The 95% predictive credible intervals greatly overlap with the empirical 95% CIs. Figure 4.2 shows that the posterior predictive mean estimates are generally close to the empirical mean values, except for two observations, and both approaches perform competitively and similarly. We check the parameter estimates in Model 3.1, and find that the coefficients do not change across profiles. This explains the similar performance of joint profiling and marginal profiling.
To evaluate the robustness against model misspecification for the A1c trajectory, we use the logarithm transformed A1c value as the outcome variable and fit a linear regression model with time in years in , the slopes of which change across classes. Based on BIC and LPML, we select classes for marginal and joint profiling under the linearity assumption. The posterior predictive check shows that imputation performances under both marginal and joint profiling are worse than those with spline functions, with larger bias and variability. Hence, we choose the models with spline functions.
4.2 Sensitivity analysis
We aim to address two questions: (1) How using imputed data affects inference on how tight glycemic control is associated with acute diabetes outcomes. (2) How results differ before and after imputation. Since marginal profiling and joint profiling present similar results, we use the results under marginal profiling as an illustration. The overall incidence of any acute health events is 16% across the patientquarters. We perform inferences after MI using the combining rules (Rubin 1987) where the variances account for the uncertainty due to imputation. We consider a logistic regression model with a sevencategory discrete A1c indicators ().
We fit marginal models using generalized estimating equations (GEE) with exchangeable working correlation to account for the correlation between repeated measures of the same patient. We also fit generalized linear models with maximum likelihood estimation (MLE), and the conclusions are similar. However, MLE suffers from computational problems with largescale datasets. Figure 4.3 presents the mean and 95% CIs of the regression coefficients. Analysis based on only the observed cases and the BPMI inferences yield different conclusions with different degrees of uncertainty. Generally, the profiling analysis with imputed data shows fewer benefits of A1c values between 6% and 9% than those under the complete case analysis (CCA), in comparison with A1c values below 6%.
Hence, profiling displays a less pronounceable Ushape across A1c values. Specifically, CCA indicates that patients with A1c values between 6% and 8% consistently have significantly lower risks than those with even tighter A1c control (below 6%). However, this is not the case for marginal profiling for A1c values between 7% and 8%. Interestingly, both approaches find that risk levels are not distinguishable between patients with A1c values between 8% and 9% and those with extremely low values (below 6%), and the acute event risk is significantly higher for patients with high A1c values (above 9%).
The BPMI approach supports that the optimal A1c level range for patients with diabetes is 6% to 7%, while CCA shows it to be 6% to 8%. The evidence from BPMI confirms the finding in Kur et al. (2019). Our analysis shows that inferences are sensitive to the choice of method to handle intermittently missing values. The imputed A1c measurements tend to reduce the estimated negative effect of tight A1c control. Figure 4.3 shows that there is a nonlinear or Ushaped relationship between A1c values and the risk of adverse outcomes. Future work would be to study the heterogeneous effects of patent complexity in the Ushaped relationship as stringent A1c goals tend to harm patients with complications. We will further discuss the implications of including adverse event outcomes into the A1c imputations in Section 5.
In this paper we focus on the missing A1c values, where the MAR and NMAR assumptions refer to the model with A1c as the outcome. The adverse event indicators are fully observed. The CCA will exclude the occasions with missing A1c values and then the adverse event outcomes together with other corresponding timevarying information. This could cause different results between BPMI and CCA if the missingness of A1c induces the missingness of adverse events in the observed model that may be MAR or NMAR. However, this assumption cannot be verified and requires to account for adverse events in the A1c imputation model that is not the goal of this paper. Hence we assume the missingness of A1c does not depend on the adverse event indicators, which is reasonable since the baseline adverse event indicator and related patient characteristics are included as covariates. This could explain why the MI estimates differ from CCA may lie in the covariates that BPMI includes. The inferences look at the overall relationship while the BPMI uses rich patient information to impute missing A1c values.
5 Discussion
We propose a Bayesian latent profiling approach to handling longitudinal intermittent missing A1c values in EHR for patients with diabetes. The latent profiles are primarily determined by the A1c measurement trajectories and potentially the presence patterns, and the patient characteristics affect the assignment tendencies. Different assumptions on the missing data mechanisms result in marginal and joint profiling approaches with location and scale parameters varying across latent profiles. Model performance is evaluated via diagnostic tools, posterior predictive check and sensitivity analyses.
The Bayesian profiling model is able to preserve the observed data structure. The inferences after MI are different from CCA. The BPMI model with spline functions presents the best fitting and the most plausible imputation. The performance of marginal profiling and joint profiling is similar in the EHR application, where the MAR assumption tends to hold.
The BPMI has the appealing property for dimension reduction with a sparse visit structure and a large number of covariates and performs robust against model misspecification. It should be noted that currently available software for fully conditional imputation, such as mice and mi, failed to converge in our application, and produced a distorted observed data structure. The shared parameter models are subject to model misspecification and can cause estimation bias. The BPMI uses mixture model with different location and scale parameters that are flexible to capture different distribution structures. For the scenarios where the alternative methods work well, the BPMI yields competitive performances based on our simulation results that are not presented here and will be shared upon request.
Several extensions of BPMI may be of interest. First, incorporation of timevarying health outcomes into modeling can provide additional information to impute the A1c levels. The allcause health outcome includes a combination of any hospitalization, any ER visit, or death during each quarter. Figure 4.3 demonstrates a nonlinear relationship between A1c values and the outcome. Hospitalizations or ER visits due to possible drugrelated morbidity may also be a significant risk for patients with diabetes under tight control. However, the causal pathway makes sequential models more appealing than the joint modeling approach. We would like to generate completed EHRs for general analysis purpose, so incorporating the adverse event outcomes is not practical for comparative effectiveness research. The relationship between A1c and adverse events (i.e. the Ushape at low A1c) could be different for people with complex disease, i.e., the Ushape could be stronger for these people. Future work will be on exploring the heterogeneous effect of patient complexity or across latent profiles in the relationship and then A1c goals.
Other timevarying variables include blood pressure (BP), elevated lowdensity lipoprotein (LDL), number of primary caregiver office visits and Body Mass Index, shown as associated with tight A1c control (McFarlane et al. 2002; Niefeld et al. 2003; Jackson et al. 2006). However, these variables are also subject to missingness, and can have higher missingness percentages than A1c, such as BP and LDL. With more information available from EHR, variable selection needs further investigation, in particular being integrated with the Bayesian profiling approach as one systematic process.
Acknowledgements
This research work is funded by NIH grants: R21DK110688 and R01DK108073.
References
 ADA (2018a) ADA (2018a, Jan). Glycemic targets. Diabetes Care 41(1), 55–64.
 ADA (2018b) ADA (2018b, Jan). Older adults. Diabetes Care 41(1), 119–125.
 Agency for Healthcare Research and Quality (2013) Agency for Healthcare Research and Quality (2013, June). Primary care patients use interactive preventive health record integrated with electronic health record, leading to enhanced provision of preventive services.
 Carpenter and Kenward (2013) Carpenter, J. and M. G. Kenward (2013). Multiple imputation and its application. John Wiley & Sons.
 Cebul et al. (2011) Cebul, R. D., T. E. Love, A. K. Jain, and C. J. Hebert (2011). Electronic health records and quality of diabetes care. New England Journal of Medicine 365(9), 825–833.
 Dluhy and McMahon (2008) Dluhy, R. G. and G. T. McMahon (2008). Intensive glycemic control in the ACCORD and ADVANCE trials. New England Journal of Medicine 358, 2630–2633.
 Elixhauser et al. (1998) Elixhauser, A., C. Steiner, D. R. Harris, and R. M. Coffey (1998). Comorbidity measures for use with administrative data. Medical care 36(1), 8–27.
 Gelfand and Dey (1994) Gelfand, A. E. and D. K. Dey (1994). Bayesian model choice: asymptotic and exact calculations. Journal of the Royal Statistical Society Series B Stat. Methodol. 56(3), 501–514.
 Gelman et al. (2015) Gelman, A., J. Hill, Y.S. Su, M. Yajima, M. Pittau, B. Goodrich, Y. Si, and J. Kropko (2015). Package ”mi”. R CRAN.
 Gelman et al. (2008) Gelman, A., A. Jakulin, M. G. Pittau, and Y.S. Su (2008). A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics 2(4), 1360–1383.
 Goodman (2007) Goodman, L. A. (2007). On the assignment of individuals to classes. Sociological Methodology 37, 1–22.
 He et al. (2010) He, Y., A. M. Zaslavsky, and M. B. Landrum (2010). Multiple imputation in a largescale complex survey: A guide. Statistical Methods in Medical Research 19, 653–670.
 Healthcare Information and Management Systems Society (2009) Healthcare Information and Management Systems Society (2009, June). Defining and testing EMR usability.
 Ibrahim et al. (2005) Ibrahim, J. G., M. H. Chen, S. R. Lipsitz, and A. H. Herring (2005). Missing data methods for generalized linear models: A comparative review. Journal of the American Statistical Association 100, 332–346.
 Ibrahim and Molenberghs (2009) Ibrahim, J. G. and G. Molenberghs (2009). Missing data methods in longitudinal studies: a review. Test 18(1), 1–43.
 Jackson et al. (2006) Jackson, G. L., D. Edelman, and M. Weinberger (2006). Simultaneous control of intermediate diabetes outcomes among veterans affairs primary care patients. Journal of the General Internal Medicine 21, 1050–1056.
 King et al. (2001) King, G., J. Honaker, A. Joseph, and K. Scheve (2001). Analyzing incomplete political science data: An alternative algorithm for multiple imputation. American Political Science Review 95, 49–69.
 Kur et al. (2019) Kur, A. K. A., F.R. Li, M.Y. Zhang, P.L. Chen, W.F. Zhong, X.R. Zhang, Z.H. Li, C. Mao, X.B. Wu, X. Gao, V. B. Kraus, X.M. Shi, Y.B. Lv, M.C. Zou, and G.C. Chen (2019). Glycated hemoglobin and allcause and causespecific mortality among adults with and without diabetes. The Journal of Clinical Endocrinology & Metabolism online.
 Lin et al. (2004) Lin, H., C. E. McCulloch, and R. A. Rosenheck (2004). Latent pattern mixture models for informative intermittent missing data in longitudinal studies. Biometrics 60, 295–305.
 Little (1995) Little, R. J. A. (1995). Modeling the dropout mechanism in repeatedmeasures studies. Journal of the American Statistical Association 90(431), 1112–1121.
 Little and Rubin (2002) Little, R. J. A. and D. B. Rubin (2002). Statistical Analysis with Missing Data: Second Edition. New York: John Wiley & Sons.
 Martin et al. (2006) Martin, C. L., J. Albers, W. H. Herman, P. Cleary, B. Waberski, D. A. Greene, M. J. Stevens, and E. L. Feldman (2006). Neuropathy among the diabetes control and complications trial cohort 8 years after trial completion. Diabetes Care 29, 340–344.
 McFarlane et al. (2002) McFarlane, S. I., S. J. Jacober, N. W. N, J. K. J, J. P. Castro, M. A. Wui, A. Gliwa, H. Von Gizycki, and J. R. Sowers (2002). Control of cardiovascular risk factors in patients with diabetes and hypertension at urban academic medical centers. Diabetes Care 25, 718–723.
 Meng (1994) Meng, X. L. (1994). Posterior predictive pvalues. Annals of Statistics 22, 1142–1160.
 National Research Council (2010) National Research Council (2010). The prevention and treatment of missing data in clinical trials. Washington, D.C: The National Academies Press.
 Niefeld et al. (2003) Niefeld, M. R., J. B. Braunstein, A. W. Wu, C. D. Saudek, W. E. Weller, and G. F. Anderson (2003). Preventable hospitalization among elderly medicare beneficiaries with type 2 diabetes. Diabetes Care 26, 1344–1349.
 Polson et al. (2013) Polson, N. G., J. G. Scott, and J. Windle (2013). Bayesian inference for logistic models using Pólya–Gamma latent variables. Journal of the American Statistical Association 108(504), 1339–1349.
 Raghunathan et al. (2001) Raghunathan, T. E., J. M. Lepkowski, J. van Hoewyk, and P. Solenberger (2001). A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology 27, 85–96.
 Rodríguez and Dunson (2011) Rodríguez, A. and D. B. Dunson (2011). Nonparametric bayesian models through probitstickbreaking processes. Bayesian Analysis 6(1), 145–178.
 Rubin (1987) Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons.
 SAS Institute Inc. (2017) SAS Institute Inc. (2017). SAS/STAT 9.4 User’s Guide.
 Schafer (1997) Schafer, J. L. (1997). Analysis of Incomplete Multivariate Data. London: Chapman & Hall.
 Schafer (2016) Schafer, J. L. (2016). Multiple imputation for multivariate panel or clustered data. https://cran.rproject.org/web/packages/pan/.
 The Centers for Disease Control and Prevention (2017) The Centers for Disease Control and Prevention (2017). National diabetes statistics report. https://www.cdc.gov/diabetes/pdfs/data/statistics/nationaldiabetesstatisticsreport.pdf.
 U.S. Center for Medicare & Medicaid Services (2012) U.S. Center for Medicare & Medicaid Services (2012, Mar). Electronic health records. https://www.cms.gov/Medicare/EHealth/EHealthRecords/index.html.
 Van Buuren and Oudshoorn (1999) Van Buuren, S. and C. Oudshoorn (1999). Flexible multivariate imputation by MICE. Technical report, Leiden: TNO Preventie en Gezondheid, TNO/VGZ/PG 99.054.
 Wu and Carroll (1988) Wu, M. C. and R. J. Carroll (1988). Estimation and comparison of changes in the presence of informative right censoring by modeling the censoring process. Biometrics 44(1), 175–188.
Appendix A Posterior computation
Supplemental materials for the posterior computation under marginal and joint profiling.
a.1 Marginal profiling
The prior distributions are specified as: , , , and , where we set the hyperparameters equal to 1 and assume independence in the covariance matrix specification. The posterior updating follows the steps:

Update : for , draw from a multinomial distribution with probability

Update : for , where for identification purpose, where Here are subsets of the design matrix and random effects for all such that

Update : for , let , , where , is the design matrix with each row , , and where is a vector in , and the th component is

Update ,where , , and .

Update , for from

Update from .

Update from , here and
After convergence, we impute missing data from the assumed model for since this step does not need to be included into the iterations.
a.2 Joint profiling
The prior distributions are specified as , , , , , , and , where we set the hyperparameters equal to 1 and assume independence in the covariance matrix specification. The posterior computation is the following.

Update : for , draw from a multinomial distribution with probability

Update : for , where , , where

Update : for , , where , is the design matrix with each row , , and ,
where is a vector in , and the th component is

Update : , , , and .

Update , for , from

Update from .

Update from , here and

Update : , where , and is a vector with length :
and where .

Update , for , from , where , and is a vector with length :
and where .

Update from , where , and is a vector of as the following:
and where .

Update from

Impute missing data: draw from .