A Dual-Hormone Closed-Loop Delivery System for Type 1 Diabetes Using Deep Reinforcement Learning

A Dual-Hormone Closed-Loop Delivery System for Type 1 Diabetes Using Deep Reinforcement Learning

Taiyu Zhu
Imperial College London
London, SW7 2AZ
&Kezhi Li
Imperial College London
London, SW7 2AZ
Pantelis Georgiou
Imperial College London
London, SW7 2AZ
T. Zhu and K. Li have equal contribution

We propose a dual-hormone delivery strategy by exploiting deep reinforcement learning (RL) for people with Type 1 Diabetes (T1D). Specifically, double dilated recurrent neural networks (RNN) are used to learn the hormone delivery strategy, trained by a variant of Q-learning, whose inputs are raw data of glucose & meal carbohydrate and outputs are dual-hormone (insulin and glucagon) delivery. Without prior knowledge of the glucose-insulin metabolism, we run the method on the UVA/Padova simulator. Hundreds days of self-play are performed to obtain a generalized model, then importance sampling is adopted to customize the model for personal use. In-silico the proposed strategy achieves glucose time in target range (TIR) for adults and for adolescents given standard bolus, outperforming previous approaches significantly. The results indicate that deep RL is effective in building personalized hormone delivery strategy for people with T1D.

1 Introduction

Diabetes is a lifelong condition that affects an estimated 425 million people worldwide IDF-2017 . Delivering an optimal insulin dose to T1D subjects is one of the long-standing challenges since 1970s Lunze-ArtificialPan2013 ; Reddy-CliSafety2016 . Previously the quality of life of T1D subjects heavily relies on the accuracy of human-defined models & features of the delivery strategy. In recent years, deep learning (DL) has provided new ideas and solutions to many healthcare problems jiang-artificial2017 , empowered by increasing availability of medical data and rapid progress of analytic tools, e.g. RL. However, several reasons hinder us from building an efficient RL model to solve problems in chronic diseases. Firstly, collected from a dynamic interaction between human and environment, RL medical data are limited and expensive Artman-power2018 . In addition, different from playing Atari in virtual environment Mnih-playingAtari2013 , RL costs heavily to ’explore’ possibilities on human beings in terms of prices and safeness. Finally, the variability of physiological responses to the same treatment can be very large for different people with T1D Vettoretti-Type1Dia2018 . It is difficult to build a versatile model suitable for wide subjects. These reasons lead to little progress of using RL in chronic illnesses.

To overcome the obstacles, we propose a two-step framework to apply deep RL in chronic illnesses, and use T1D as a study case. T1D is chosen because it is a typical disease that requires dynamic treatment consistently. A generalized DNN for hormone delivery strategy is pre-trained using a variant of Q-learning as the first step. We use double dilated recurrent neural network (DRNN) to build a model for multi-dimensional medical time series, in which each basal hormone delivery (at five minutes intervals) are considered as an action determined by a stochastic policy, and glucose levels and TIR are considered as the reward. Secondly, by adopting weights & features from the last step, important sampling are implemented to customize the model personalized using individual data.

We use the UVA/Padova T1D Simulator, a popular glucose-insulin dynamics simulator which has been accepted by the Food and Drug Administration (FDA) Man-UVA/PADOVA2014 , as the environment. It can generate T1D subjects data with high variability in meals, body conditions and other factors. During the training, hundreds days of ’self-play’ trials are performed to obtain a generalized model for the dual-hormone closed-loop basal delivery with standard bolus. In the test, 10 adults and 10 adolescents are tested within 6 months period of time. The results show that TIRs achieve for adults and for adolescents in-silico, which significantly improve the state-of-the-art performances.

2 Related Work and Preliminaries

The rapid growth of continuous glucose monitoring (CGM) and insulin pump therapy have motivated use of a closed-loop system, known as the artificial pancreas (AP) cobelli-artificial2011 ; Hovorka-CloseLoop2011 ; russell-outpatient2014 ; herrero-coordinated2017 . Many algorithms are developed and verified as closed-loop single/dual hormone delivery strategies bergenstal-SafetyOf2016 that are mostly based on control algorithms Facchinetti-ConGlu2016 ; Haidar-TheAP2016 . Machine learning (ML) is leveraged in glucose management in Plis-AMachine2014 ; Mhaskar-ADeepLearningApp2017 . DNN is a new method in glucose management with fully connected perez-ArtiNN2010 , CNN-based Li-CRNN2019 , RNN-based Chen-DilatedRec2018 , physiological-based Bertachi-PreofBlo2018 networks. It has been shown that dilated RNN performs well in processing long-term dependencies chang-dilated2017 . By updating control parameters in gradient, RL is adopted recently Ngo-ControlOfBl2018 ; herrero-coordinated2017 ; Sun-ADual2019 . To accelerate the learning process, prioritized experience replay samples important transitions more frequently schaul-prioritized2015 . Before submission we find an RL environment was built in simulator of 2008 version Jinyu-Simglucose2018 , while we use a simulator of 2012 version.

We see the problem an infinite-state Markov decision process (MDP) with noises. An MDP can be defined by a tuple with state , action , reward function , transition function , and discount factor . At each time period, the agent takes an action , causes the environment from some state to transit to state with probability . A policy specifies the strategy of selecting an action. RL’s goal is to find the policy mapping states to actions that maximizes the expected long-term reward. A Q-function can be defined to computed this reward . The optimal action-value function offers the maximal values that can be determined by solving the Bellman equation .

3 Methods

In the hormone delivery problem, we use a multi-dimensional data as the input . The data includes blood glucose (mg/dL), meal data (g) manually record by individuals, corresponding meal bolus (U). Dual-hormone (basal and glucagon) delivery is considered as actions . In this case can be denoted as where is the data length, is the total insulin including bolus and basal plus the glucagon . We use the latest 1 hour data (12 samples) as current state . Here is computed from with a standard bolus calculator divided by the body weight. Then the problem can be seen as an agent interacting with an environment over time steps sequentially. Every five minutes, an observation can be obtained, and an action is taken. The action can be chosen from three options: do nothing, deliver basal insulin, or deliver glucagon. The amount of basal insulin and glucagon is a small constant that determined by the subject profile in advance. To maintain the BG in a target range, we define a reward carefully that the agent receives in each time step


The goal of the agent is to learn a personalized basal insulin and glucagon delivery strategy in a short period of time (using limited data) for each individual. Here we propose a two-step learning approach to build the dual-hormone delivery model.

3.1 Generalized DQN training

We build an interactive environment in simulator, and dilated RNN is employed as the DNN. Dilated RNN is used because it has larger receptive field that is crucial for glucose time series processing Chen-DilatedRec2018 . Double DQN weights in the simulator are trained because it has been proved as a robust approach to solve overestimations. Action network and value network are trained . The pseudo-code is sketched in Algorithm 1.

1:Inputs: Initializing environment , historical data , update frequency T, two dilated RNN of random weights , , respectively.
3:     sample action from , observe in
4:     store into replay buffer
5:     sample a mini-bath uniformly from and calculate loss
6:     perform a gradient descent to update ,
7:     if mod then end if
8:until converge or reach the number of iterations
Algorithm 1 Generalized model training

The agent explores random hormone delivery actions under policy that is -greedy with respect to in simulator. It is similar to playing Atari games, so the agent can ’self-play’ in simulator for long time. Some human intervention/demonstration at the beginning of the RL process can reduce the training time slightly, but in our case it is not necessary. At the end of this step, a DRNN with weights is obtained as the generalized model.

3.2 Personalized DQN training

In this step we refine the model and customize it for the personal use. Weights and features obtained from the last step are updated using limited data with an importance sampling schaul-prioritized2015 and safety constraints. It gives better actions (leading to no hyperglycemia or hypoglycemia) higher probability and worse demonstrations lower probability. Details are shown in Algorithm 2.

1:Inputs: Initialized with environment , historical data , generalized Q-function with weights , replay buffer , target weights , update frequency
2:generate as a merge of and experience collected from
3:calculate importance probability from
5:     sample action from policy , observe
6:     store in , overwriting the oldest samples previously merged from
7:     sample a mini-batch from with modified importance sampling
8:     calculate loss
9:     perform a gradient descent update and the importance
10:     if mod then end if
11:until converge or reach the number of iterations
Algorithm 2 Personalized DQN training using imprtance sampling

4 Experiment Results

We compare the results with the following experiment setup (details in supplementary materials): 1. bolus calculator + constant basal insulin (BC); 2. bolus calculator + insulin suspension + carbohydrate recommendation (BC+IS+CR) liu-coordinating2019 ; 3. ours: bolus calculator + generalized RL model (BC+RL) (Algorithm 1); 4. ours: bolus calculator + RL intelligent basal (BC+IB) (Algorithm 1, 2)

In experiments we used the TIR ( mg/dL), the percentage of hypoglycemia ( mg/dL) and hyperglycemia ( mg/dL) as the metrics to measure the performance. In general, either higher TIR or lower Hypo/Hyper indicates better glycemia control. These evaluation metrics are preferred in diabetes clinic vigersky2018relationship . Table 1 presents the overall performance of the experiment setup on the adult and adolescent subjects. For adult subjects, the IB model achieves the best performance, and increases the mean TIR by 11.21% () compared with the BC setup. In the adolescent case, the IB model obtains the best TIR of 83.39%. In both cases, the IB model outperforms the controller with IS and CR on both TIR and Hypo/Hyper results with considerable improvements.

(%) Method TIR Hypo Hyper
Adult BC
Adolescent BC

Table 1: Results of different experiment setup
(a) An adult simulation
(b) An adolescent simulation
Figure 1: Performance of the four experiment setup on an adult subject and an adolescent subject in a 6-month period: (Top-to-bottom) BC, BC+IS+CR, BC+RL (ours), BC+IB (ours), distribution of carbohydrate intake. The average BG levels for 180 days are shown in solid blue lines, and the hypo/hyperglycemia regions are shown in dotted green/red lines. The gray lines correspond to glucose levels in every day of the trial, and the blue shaded regions indicate the standard deviation.

To observe the model performance on the specific BG levels, we visualize the BG curves of two subjects over 6-month testing period in Figure 2. For the adult curves at the left column, the performance is basically in accordance with statistical results in Table 1, and the IB model avoids many hypoglycemia events during the night. At the right column, it can be observed that implementation of the IB model significantly helps to reduce the overall BG levels and increase TIR by making the mean curve flat and stable. Surprisingly, standard CNN or LSTM cannot achieve such good (or close) performances as dilated RNN has achieved.

5 Conclusion

We propose an intelligent basal hormone delivery algorithm and employ deep RL in glucose management. A dilated RNN is used in double DQN, and a personalized model is obtained from generalized pre-trained model. The results outperform many existing work significantly.


  • [1] IDF diabetes atlas, 8th edn. Brussels, Belgium: International Diabetes Federation, 2017.
  • [2] Katrin Lunze, Tarunraj Singh, Marian Walter, Mathias D. Brendel, and Steffen Leonhardt. Blood glucose control algorithms for type 1 diabetic patients: A methodological review. Biomedical Signal Processing and Control, 8(2):107 – 119, 2013.
  • [3] Monika Reddy, Peter Pesl, Maria Xenou, Christofer Toumazou, Desmond Johnston, Pantelis Georgiou, Pau Herrero, and Nick Oliver. Clinical safety and feasibility of the advanced bolus calculator for type 1 diabetes based on case-based reasoning: A 6-week nonrandomized single-arm pilot study. Diabetes Technology & Therapeutics, 18(8):487–493, 2016. PMID: 27196358.
  • [4] Fei Jiang, Yong Jiang, Hui Zhi, Yi Dong, Hao Li, Sufeng Ma, Yilong Wang, Qiang Dong, Haipeng Shen, and Yongjun Wang. Artificial intelligence in healthcare: past, present and future. Stroke and vascular neurology, 2(4):230–243, 2017.
  • [5] William J Artman, Inbal Nahum-Shani, Tianshuang Wu, James R McKay, and Ashkan Ertefaie. Power analysis in a smart design: Sample size estimation for determining the best dynamic treatment regime. arXiv preprint arXiv:1804.04587, 2018.
  • [6] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • [7] M. Vettoretti, A. Facchinetti, G. Sparacino, and C. Cobelli. Type 1 diabetes patient decision simulator for in silico testing safety and effectiveness of insulin treatments. IEEE Transactions on Biomedical Engineering, pages 1–1, 2018.
  • [8] The UVA/PADOVA type 1 diabetes simulator. Jounral of Diabetes Sci Technol., 8(1).
  • [9] Claudio Cobelli, Eric Renard, and Boris Kovatchev. Artificial pancreas: past, present, future. Diabetes, 60(11):2672–2682, 2011.
  • [10] Roman Hovorka. Closed-loop insulin delivery: from bench to clinical practice. Nature Reviews Endocrinology, 7(385), 2011.
  • [11] Steven J Russell, Firas H El-Khatib, Manasi Sinha, Kendra L Magyar, Katherine McKeon, Laura G Goergen, Courtney Balliro, Mallory A Hillard, David M Nathan, and Edward R Damiano. Outpatient glycemic control with a bionic pancreas in type 1 diabetes. New England Journal of Medicine, 371(4):313–325, 2014.
  • [12] Pau Herrero, Jorge Bondia, Nick Oliver, and Pantelis Georgiou. A coordinated control strategy for insulin and glucagon delivery in type 1 diabetes. Computer methods in biomechanics and biomedical engineering, 20(13):1474–1482, 2017.
  • [13] Richard M Bergenstal, Satish Garg, Stuart A Weinzimer, Bruce A Buckingham, Bruce W Bode, William V Tamborlane, and Francine R Kaufman. Safety of a hybrid closed-loop insulin delivery system in patients with type 1 diabetes. Jama, 316(13):1407–1408, 2016.
  • [14] Andrea Facchinetti. Continuous glucose monitoring sensors: Past, present and future algorithmic challenges. Sensors, 16(12), 2016.
  • [15] A. Haidar. The artificial pancreas: How closed-loop control is revolutionizing diabetes. IEEE Control Systems Magazine, 36(5):28–47, Oct 2016.
  • [16] K. Plis, R. Bunescu, C. Marling, J. Shubrook, and F. Schwartz. A machine learning approach to predicting blood glucose levels for diabetes management. In Modern Artificial Intelligence for Health Analytics Papers from the AAAI-14.
  • [17] M.D. van der Walt H.N. Mhaskar, S.V. Pereverzyev. A deep learning approach to diabetic blood glucose prediction. https://arxiv.org/abs/1707.05828, 2017.
  • [18] Carmen Pérez-Gandía, A Facchinetti, G Sparacino, C Cobelli, EJ Gómez, M Rigla, Alberto de Leiva, and ME Hernando. Artificial neural network algorithm for online glucose prediction from continuous glucose monitoring. Diabetes technology & therapeutics, 12(1):81–88, 2010.
  • [19] K. Li, J. Daniels, C. Liu, P. Herrero-Vinas, and P. Georgiou. Convolutional recurrent neural networks for glucose prediction. IEEE Journal of Biomedical and Health Informatics, pages 1–1, 2019.
  • [20] Jianwei Chen, Kezhi Li, Pau Herrero, Taiyu Zhu, and Pantelis Georgiou. Dilated recurrent neural network for short-time prediction of glucose concentration. In The 3rd International Workshop on Knowledge Discovery in Healthcare Data, IJCAI-ECAI 2018, Stockholm, Sweden, July 2018.
  • [21] Arthur Bertachi, Lyvia Biagi, Ivan Contreras, Ningsu Luo, and Josep Vehi. Prediction of blood glucose levels and nocturnal hypoglycemia using physiological models and artificial neural networks. In The 3rd International Workshop on Knowledge Discovery in Healthcare Data, IJCAI-ECAI 2018, Stockholm, Sweden, July 2018.
  • [22] Shiyu Chang, Yang Zhang, Wei Han, Mo Yu, Xiaoxiao Guo, Wei Tan, Xiaodong Cui, Michael Witbrock, Mark A Hasegawa-Johnson, and Thomas S Huang. Dilated recurrent neural networks. In Advances in Neural Information Processing Systems, pages 77–87, 2017.
  • [23] Anna Holubová Jan Muzik Phuong D. Ngo, Susan Wei and Fred Godtliebsen. Control of blood glucose for type-1 diabetes by using reinforcement learning with feedforward algorithm. Computational and Mathematical Methods in Medicine, 2018(4091497):1–8, Dec. 2018.
  • [24] João Budzinski Brett Moore Peter Diem Christoph Stettler Stavroula G. Mougiakakou Qingnan Sun, Marko V. Jankovic. A dual mode adaptive basal-bolus advisor based on reinforcement learning. arXiv:1901.01816 [cs.SY], Jan. 2019.
  • [25] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. abs/1511.05952, 2015.
  • [26] Jinyu Xie. Simglucose v0.2.1 (2018) [online]. https://github.com/jxx123/simglucose, 2018.
  • [27] C Liu, PE Avari, N Oliver, P Georgiou, and P Herrero Vinas. Coordinating low-glucose insulin suspension and carbohydrate recommendation for hypoglycaemia minimization. In Diabetes technology & therapeutics, volume 21, pages A85–A85. Mary Ann Liebert, Inc 140 Huguenot Street, 3rd Floor New Rochelle, NY 10801 USA, 2019.
  • [28] Robert A Vigersky and Chantal McMahon. The relationship of hemoglobin a1c to time-in-range in patients with diabetes. Diabetes technology & therapeutics, 21(2):81–85, 2018.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description