Federated and Differentially Private Learning for Electronic Health Records
The use of collaborative and decentralized machine learning techniques such as federated learning have the potential to enable the development and deployment of clinical risk predictions models in low-resource settings without requiring sensitive data be shared or stored in a central repository. This process necessitates communication of model weights or updates between collaborating entities, but it is unclear to what extent patient privacy is compromised as a result. To gain insight into this question, we study the efficacy of centralized versus federated learning in both private and non-private settings. The clinical prediction tasks we consider are the prediction of prolonged length of stay and in-hospital mortality across thirty one hospitals in the eICU Collaborative Research Database. We find that while it is straightforward to apply differentially private stochastic gradient descent to achieve strong privacy bounds when training in a centralized setting, it is considerably more difficult to do so in the federated setting.
The availability of high quality public clinical data sets (johnson2016mimic; pollard2018eicu) has greatly accelerated research into the use of machine learning for the development of clinical decision support tools. However, the majority of clinical data remain in private silos and are broadly unavailable for research due to concerns over patient privacy, inhibiting the collaborative development of high fidelity predictive models across institutions. Additionally, standard de-identification protocols provide limited safety guarantees against sophisticated re-identification attacks (ElEmam2011a; gkoulalas2014publishing; kleppner2009committee). Furthermore, patient privacy may be violated even in the case where no raw data is shared with downstream parties, as trained machine learning models are susceptible to membership inference attacks (Shokri2017), model inversion Fredrikson2015, and training data extraction Carlini2018.
In line with recent work Beaulieu-Jones2018; Vepakomma2018, we investigate the extent to which several hospitals can collaboratively train clinical risk prediction models with formal privacy guarantees without sharing data. In particular, we employ federated averaging McMahan2016 and differentially private stochastic gradient descent McMahan2017; McMahan2018; Abadi2016 to train models for in-hospital mortality and prolonged length of stay prediction across thirty one hospitals in the eICU Collaborative Research Database (eICU-CRD) pollard2018eicu.
1.1 Federated Learning
Federated learning McMahan2016 is a general technique for decentralized optimization across a collection of entities without sharing data, typically employed for training machine learning models on mobile devices. In the variant known as federated averaging, each entity trains a local model for a fixed number of epochs over the local training data and transfers the resulting weights to a central server. The server returns the average of the weights to each entity and the process repeats. This satisfies an intuitive notion of privacy, since no entity shares data with the central server or with any other entity. However, federated learning alone provides no formal accounting for the privacy cost incurred via the communication of local model weights with the central server.
1.2 Differential Privacy
Formally, a randomized algorithm : with domain and range satisfies (, ) differential privacy Dwork2014 if for any two adjacent data sets , and for any subset of outputs ,
In our case, the randomized algorithm we consider is differentially private stochastic gradient descent (DP-SGD) (Abadi2016; McMahan2018). Here, adjacent data sets , are defined by adding, removing, and modifying the data for one record. This formulation can be informally interpreted as one where the inclusion of a record does not affect the probability distribution over learned model weights by more than a factor , where bounds the probability of the restriction not holding. Notably, this notion allows us to bound and quantify the capability for an adversary to determine whether a record belonged to the training data set, regardless of their access to auxiliary information Dwork2014.
In practice, stochastic gradient descent can be made differentially private if the record-level gradients are clipped to a maximum norm and the Gaussian noise with standard deviation added to the mean of the clipped gradients McMahan2018 over a batch of training data. The privacy loss over the procedure may then be accounted for with the moments accountant Abadi2016; McMahan2018 and Renyi differential privacy (Mironov2017). In this setting, the privacy cost of a training procedure is fully specified by the noise multiplier , the ratio of the batch size to the training set size, and the number of training steps McMahan2018. McMahan2017 demonstrate that it is straightforward to formulate federated learning in a way that is conducive to differentially private training if DP-SGD is used as the local optimization algorithm.
1.3 Related Work
Our work is most similar to Beaulieu-Jones2018 in that they also investigate decentralized and differentially private machine learning in the context of mortality prediction in the context of the eICU-CRD, but use cyclical weight transfer Chang2018 rather than federated averaging for distributed optimization. Another related technique is split learning gupta2018distributed; Vepakomma2018a; vepakomma2019reducing where the layers of a neural network are partitioned across several entities, enabling learning across entities that may contribute different data modalities without exposing the raw data or the local network architecture. As an alternative, recent work Beaulieu-Jones2017; Xie2018 has proposed the use of differentially private generative models to publicly release synthetic data with privacy guarantees.
All experiments are based on data derived from the eICU Collaborative Research Database pollard2018eicu, a freely and publicly available intensive care database containing data from 139,367 unique patients admitted between 2014 and 2015 to 208 unique hospitals. Each patient may have one or more recorded hospital admissions, each composed of one or more ICU stays.
We make predictions at 24 hours into hospital admissions that last at least 24 hours. We assign binary outcome labels for in-hospital mortality and prolonged length of stay if the patient dies during the remainder of the hospital admission or if the admission last longer than 7 days, respectively.
To construct a training set for supervised learning, we first partition the set of admissions by hospital and then split the data within each hospital by patient such that 80%, 10%, and 10% of the patients are used for training, validation, and testing, respectively. We allow for multiple hospital admissions per patient, but no patient exists in more than one partition within the same hospital. We retain all hospitals with greater than 1,000 hospital admissions in its corresponding training data set. This procedure produces a cohort of 65,509 labeled hospital admissions across 31 unique hospitals. The incidence of in-hospital mortality and prolonged length of stay in the aggregate population is 7.3% and 34.4%, respectively.
We construct a feature representation as a function of data recorded within each hospital stay up to 24 hours into the stay. We extract all lab orders, lab results, medication orders, diagnoses, and active treatments, as well as the patient age at admission, gender, ethnicity, unit type, and admission source. Lab results and age are binned into three and four bins, respectively. We aggregate over time, assigning a one for each feature if it is observed anywhere in the admission prior to 24 hours and a zero otherwise.
For all supervised learning tasks, we consider only logistic regression and feedforward networks with one hidden layer. We perform model selection on the basis of the area under the receiver operating curve (AUC-ROC) evaluated on the corresponding validation set following a grid search over relevant hyperparameters. Model performance is reported as the 95% confidence interval of the AUC-ROC on the corresponding test set derived via DeLong’s Method DeLong1988. We similarly derive confidence intervals for the difference in the AUC-ROC between models to facilitate model comparisons.111It should be noted that this procedure produces a confidence interval for the difference in the AUC-ROC between models, taking into account the correlated nature of the predictions made by two models. The Adam kingma2014adam optimizer is used in each case.
2.1 Experimental Design
We conduct a series of experiments designed to evaluate the relative benefits of centralized and federated learning, and the associated privacy costs, over learning using only local data at each hospital. We evaluate the following experimental conditions:
Local training with no collaboration. We identify a high performing model for each hospital using only data from that hospital following a grid search over learning rates, batch size, and hidden layer size if the model is a feedforward network.
Centralized training. We simulate the setting where all of the records are available in a central repository, selecting the best global model on the basis of the performance on the aggregated records and evaluate the model on the local data from each hospital.
Centralized training with differential privacy. We modify the centralized training procedure to use DP-SGD for optimization McMahan2018. Here we additionally search over the discrete grid of [0.1, 1, 10] for both the noise multiplier and the gradient clipping threshold . We assess privacy in terms of the that results from training with a fixed .
Federated learning. We employ the federated averaging algorithm described in McMahan2016. For each round of federated learning, we conduct one epoch of training using the local data at each hospital and then synchronize the weights across all hospitals with an average. We maintain a record of the local performance at each hospital over the federated learning procedure and perform local model selection on the basis of the best validation AUC-ROC observed over the procedure. Model selection for the best federated hyperparameters is determined on the basis of the best mean local validation AUC-ROC across hospitals.
Federated learning with differential privacy. We repeat the federated averaging experiment as previously described, but use DP-SGD as the local optimizer at each hospital, similar to the algorithm described in McMahan2017. We experiment with fixed global DP-SGD hyperparameters and with local hyperparameters selected independently at each hospital. For the local hyperparameter search at each hospital, we use , , and selected log uniformly from , performing model selection on the basis of the DP-SGD hyperparameters that maximize local AUC-ROC in ten epochs of training without any collaboration. We then perform federated learning for ten rounds with the selected local DP-SGD hyper-parameters.
3 Results and Discussion
Prior to experimentation with differentially private training, we aimed to establish the efficacy of federated learning over centralized and local learning. We find that while there is often a benefit to federated learning over local learning, often attaining an AUC-ROC comparable with that of centralized learning, the improvements are often not large enough to be rendered statistically significant on the basis of the 95% confidence interval for the difference in AUC-ROC between either the central or federated model with the corresponding local model (Table 1). In particular, centralized and federated learning for prediction of prolonged length of stay improve on local learning for thirteen and twelve hospitals, respectively, whereas centralized and federated learning only benefit mortality prediction in seven and five cases, respectively.
When the records from all hospitals are aggregated for differentially private centralized training, it is feasible to attain relatively strong privacy guarantees () if and (Figure 1) with a relatively minor reduction in terms of the validation AUC-ROC at the end of training (prolonged length of stay 0.763 vs. 0.73; mortality 0.876 vs. 0.832). When attempting to perform federated learning in a differentially private manner, we find that even with DP-SGD hyperparameters selected on the basis of local training, the models derived from differentially private federated learning often perform poorly in terms of both AUC-ROC and , and that this effect is exacerbated for mortality prediction (Table S1). It is likely that a practical tuning strategy for differentially private federated averaging could be identified with further experimentation, but it is unclear if such a strategy would generalize to similar data sets and prediction tasks. This is problematic, for both this and related work, as neglecting to account for the privacy cost of model selection produces optimistic underestimates of the privacy costs Liu2018a; chaudhuri2013stability. In future work, it is of interest to conduct controlled experiments to directly compare our approach to cyclical weight transfer Beaulieu-Jones2018 and split learning gupta2018distributed; Vepakomma2018a; vepakomma2019reducing to gain insight into the relative efficacy of differentially private federated averaging over alternatives.
We thank Michaela Hardt and Abhradeep Thakurta for valuable mentorship and feedback. We further thank Steve Chien and all contributors to the Tensorflow Privacy project for enabling this work.
|Prolonged Length of Stay||Hospital Mortality|