A ModelBased Reinforcement Learning Approach for a Rare Disease Diagnostic Task.
Abstract
In this work, we present our various contributions to the objective of building a decision support tool for the diagnosis of rare diseases. Our goal is to achieve a state of knowledge where the uncertainty about the patient’s disease is below a predetermined threshold. We aim to reach such states while minimizing the average number of medical tests to perform. In doing so, we take into account the need, in many medical applications, to avoid, as much as possible, any misdiagnosis. To solve this optimization task, we investigate several reinforcement learning algorithm and make them operable in our highdimensional and rewardsparse setting. We also present a way to combine expert knowledge, expressed as conditional probabilities, with real clinical data. This is crucial because the scarcity of data in the field of rare diseases prevents any approach based solely on clinical data. Finally we show that it is possible to integrate the ontological information about symptoms while remaining in our probabilistic reasoning. It enables our decision support tool to process information given at different level of precision by the user.
A ModelBased Reinforcement Learning Approach for a Rare Disease Diagnostic Task.
A Preprint
Rémi Besson
CMAP
École Polytechnique
Route de Saclay, 91128 Palaiseau
remi.besson@polytechnique.edu
Erwan Le Pennec
CMAP
École Polytechnique
Route de Saclay, 91128 Palaiseau
erwan.lepennec@polytechnique.edu
Stéphanie Allassonnière
School of Medicine
ParisDescartes University
15 Rue de l’École de Médecine, 75006 Paris
stephanie.allassonniere@parisdescartes.fr
Julien Stirnemann
NeckerEnfants Malades Hospital
ParisDescartes University
149 Rue De Sèvres, 75015 Paris
julien.stirnemann@nck.aphp.fr
Emmanuel Spaggiari
NeckerEnfants Malades Hospital
ParisDescartes University
149 Rue De Sèvres, 75015 Paris
emmanuel.spaggiari@aphp.fr
Antoine Neuraz
Department of Medical Informatics
NeckerEnfants Malades Hospital
149 Rue De Sèvres, 75015 Paris
antoine.neuraz@aphp.fr
November 27, 2018
1 Introduction
1.1 Motivation.
During pregnancy, several fetal ultrasounds are performed to evaluate the anatomy, the growth and the wellbeing of the fetus. During each ultrasound examination, the physician performs standardized measurements such as nuchal translucency and biometry as well as a predefined routine set of ultrasound planes of anatomical structures. Nevertheless in case of an anomaly, possibly related to a genetic disorder, there is no consensus on how to conduct the ultrasound in order to achieve the diagnosis of the disorder. It is a problem since there are many possible symptoms (around 200 in our case) that may be hard to detect and many more possible diagnoses while physicians do not have infinite time to check them all.
In this work, we want to systematize the prenatal diagnostic procedure in order to help the practitioner to make the diagnosis with high probability while minimizing the average number of questions, i.e symptom to check. To that purpose we design an algorithm that propose the most promising symptoms to check at each stage of the medical examination (state of knowledge about patient’s condition, this kind of algorithm is sometimes called symptom checker in the literature) and provides the probability of each possible diseases. Eventually, our algorithm has to be operable and interpretable online, at the bedside.
It should be noted that our approach applies to any problem aimed at establishing a symptom checker for rare diseases and is, of course, not limited to our case study to which we will refer throughout this paper: the prenatal diagnosis.
1.2 Available data and some dimensions of the problem.
The diseases we are interested in can be defined by a combination of symptoms, each with various degrees of likeliness. The data is structured as a list of diseases with their estimated probability and a list of symptoms for each disease (that we will call associated symptoms) with an estimation of the probability of the symptom given the disease.
We write:
We denote the diseases: . We know as well as . Note that joint distribution of symptoms given the disease is not available, but only the marginals. This issue will be addressed in section 3.2.
All this information, that we will refer to as expert data, has been provided by physicians of Necker hospital based on the available literature. We mapped all the symptoms found in the literature to the Human Phenotype Ontology (HPO) , see [Khler2017TheHP]. HPO is a recent work which provides a standardized vocabulary of phenotypic abnormalities encountered in human disease. We used it to harmonize the terminology. We could then combine our curated list of symptoms per disease and map it to OrphaData ^{1}^{1}1Orphanet. INSERM 1997. An online rare disease and orphan drug data base. Available on http://www.orpha.net. Accessed [02/10/2018]. OrphaData was useful to fill the missing data on prevalence of symptoms in the diseases. We restricted our analyses to the subset of symptoms that can be detected using fetal ultrasound. For these symptoms, we have extracted the information of the underlying tree structure ontology. We mean here by ontology the fact that a given symptom can be described at different level of precision: for example ”heart deffect” is an ascendant of ”Tetralogy of Fallot” (which is a specific cardiac abnormality). Our final decision support tool should handle such common medical reasoning (see section 4 for more details).
Currently, our database references diseases and different symptoms. The disease with the largest number of associated symptoms is VACTERL syndrome with possible symptoms.
We will make the assumption that a fetus presents only one disease at a time which is a reasonable hypothesis in our case of rare disease study.
1.3 Main contributions:
In this work we present our different contributions to the objective of building a symptom checker for rare diseases. First we propose a novel notion, as fas as we know, of what should be a good symptom checker, taking into account the need in medicine to have a high level of confidence in the diagnosis made. This result in an original optimization formulation for a symptom checker building task. We found a way to break the dimension of our problem so as to make reinforcement learning algorithms tractable in this case.
We also detail how to build an architecture drawing on both expert and clinical data so as to cope with a common issue in medicine (and even more so when it comes to rare diseases): the small amount of available clinical data.
Finally we show that it is possible to incorporate the information of the symptoms ontology resulting in a much less rigid decision tool without computation explosion.
All codes have been written in R language. In order to ensure reproducibility we made this code publicly available on GitHub.
2 The sequential decision making problem: a planning task
2.1 A Markov Decision Process framework
2.1.1 What we aim to optimize
Our sequential decision problem can be formulated in the Markov Decision Process framework. Let be the state space, using ternary base we encodes if the considered symptom is present, if it is absent, if non observed yet. We write . An element is a vector of length (the number of possible symptoms), it sums up our state of knowledge about the patient’s condition: the ith trinary digit of encode information about the symptom whose identifier is i. Let be the action space: . An action is a symptom that we suggest to the obstetrician to look for.
Our environment dynamic is clearly Markovian in the sense that: where (respectively ) are the state (resp. the action) visited (resp. taken) at time .
We aim to learn a diagnostic policy that associates each state of knowledge (list of presence/absence of symptoms) with an action to take (a symptom to check):
(1) 
What should be a good diagnostic policy? Many medical applications consider a tradeoff between the cost of performing more medical tests (measuring it in time or money) and the cost of a misdiagnosis [DBLP:journals/corr/abs11092127], [Tang2016InquireAD], [Kao2018ContextAwareSC].
However in our case the cost of performing more medical tests (i.e to check more possible symptoms) is negligible against the potential cost of a misdiagnosis. In theory, the obstetrician have to check all possible symptoms to ensure the fetus does not present any disease. Therefore we will not take the risk of a misdiagnosis by trying to ask fewer questions. However if the physician observes a sufficient amount of symptoms he can stop the ultrasound examination and perform additional tests, like an amniocentesis, to confirm his hypotheses.
This is why we can label some states as terminal: they satisfy the condition that the entropy of the random variable disease is so low that we have no doubt on the diagnosis. In this setting, our goal is to minimize the average number of inquiries before reaching a terminal state:
(2) 
where is the initial state, the law of the environment currently used, the diagnostic policy, and is the random number of inquiries before reaching a terminal states, i.e:
where is the entropy of the random variable disease given what we know at time : . We should think as a realization of , this is nothing more than the state we reached for one examination on a given patient while is the associated random variable. For a given start state and a policy there are many possible states that we can reach since the answers are stochastic. Note that we are not ensured that for all we had . Nevertheless this inequality holds when taking the average , see theorem 2.6.5 of [Cover:2006:EIT:1146355], ”information can’t hurt”. In summary, when we consider that entropy is sufficiently low and that we can stop and propose a diagnosis, we know that on average, the uncertainty about the patient’s disease would not have increased if we had continued checking symptoms.
Setting a reward function as follow: , we can rewrite (2) in the classical form of an episodic reinforcement learning problem [Sutton:2018:IRL:551283]:
(3) 
In the RL community such a reward design is called actionpenalty representation, since the agent is penalized for every action that it executes [Barto1995LearningTA].
2.1.2 Related works.
There are numerous relevant expert system for the diagnostic of rare diseases (in particular in obstetric) such as for example Orphamizer see [KOHLER2009457] and [Khler2017TheHP]. Most of these expert systems use a list of observed symptoms as input and output a corresponding list of possible diseases. Nevertheless we think that an algorithm operable during the medical examination would be more useful than an expert system designed for retrospective use. This is why we aim to propose at each stage the most interesting symptom to check.
Very few recent works try to address this issue. We reference [DBLP:journals/corr/abs11092127] which proposed an A* algorithm searching for shortest path in a graph but this kind of algorithm cannot cope with our highdimensional problem.
Note that in a certain sense our problem can be likened to a decision tree optimization task where the features are the symptoms and the disease is the target. Indeed a policy on a MDP is a generalization of a tree, a policy being less rigid in that sense that it can still propose the next feature to check when the physician made a different choice to the one we proposed. Classic decision tree algorithms, see [Breiman1984ClassificationAR] or [Quinlan:1986:IDT:637962.637969], rely on optimizing an impurity function (the entropy or Gini index of the target random variable) in a greedy way and are therefore subject to the wellknown horizon effect [Berliner:1973:NCM:1624775.1624786]. Then recent works looking for global optimization procedure of decision trees such as [Bertsimas:2017:OCT:3123655.3123731] can be seen as relevant. However, once again, these algorithms using MIO (Mixed Integer Optimization) solvers cannot cope with our highdimensional problem. Indeed the complexity of such algorithms is where is the number of data and the maximal depth of the tree. Nevertheless in our case we can not restrict that easily the maximum allowed tree depth since in the worst case, the physician will not observe any symptom and will then have to check them all.
More recent works of [Tang2016InquireAD] and [Kao2018ContextAwareSC] focus on this problem of building a symptom checker using reinforcement learning algorithms. Nevertheless our approach is fairly different to these previous works, both in our way to formulate the objective (and then in our reward design) than in the solutions that we propose (our ways to break the dimension).
They formulated their optimization problem as a tradeoff between asking less questions and making the right diagnosis while we formulate it as the task of reaching as quick as possible, on average, a predetermined high degree of certainty about the patient disease. In practice, in our case, the only parameter to be tuned is the degree of certainty we want at the end of the examination: we should stop when the entropy of the disease falls below this threshold. The fewer the the more symptoms our algorithm will need before considering that the game ends.
[Tang2016InquireAD] makes use of a discounted factor in their reward signal design. They design the reward associated to each question to be zero until possessing a diagnosis (which is an additional possible action) where the reward is equal to (if the guess was correct, otherwise), being the number of questions that have been inquired before possessing the diagnosis. In this context makes the compromise between asking fewer question and making the right diagnosis. The smaller , the more likely the algorithm is to make a wrong diagnosis by trying to ask fewer questions.
Note that [Tang2016InquireAD] has to perform its learning algorithm while trying several differents values of . On the contrary we can determine which value of we should take before launching any learning algorithms. We can indeed interact with the physician, in a first step, presenting him a sample of states where our algorithm would possesses a diagnostic. If the physician considers that the algorithm stops too early we should decrease , otherwise we should increase . This is an advantage since the main bottleneck in terms of computing time is the learning phase.
2.1.3 Highdimensional issues.
Our full model is of very high dimension (220) and thus a classical tabular approach is impossible. According to our experiments, a classical DeepQ learning is also not numerically tractable. In order to break the dimension, we capitalize first on the fact that the physicians use our algorithm mainly after seeing a first symptom. In such case, we make the assumption that this initial symptom is typical. It might be possible to have a disease which also presents a nontypical symptom but this happens with a very low probability, sufficiently negligible for the clinicians. Anyway, in this case, we would end up with a high entropy and no disease identification. This leads to switch to another strategy. With such an assumption the dimension drops significantly since we now only consider diseases for which this initial symptom is typical, the only relevant symptoms are the one which are typical of these remaining diseases.
Therefore we created tasks to solve, , :
() 
The different subproblems dimensions are displayed in figure 3. Fragmenting that much our problem as the advantage of giving us a very good optimized policy on several part of our decision tree that would have been underoptimized otherwise (because these parts of the tree are not often visited). Of course, optimizing the parts of the tree that are not often visited is not very useful to reduce our overall loss function, but it is important to provide, in all cases, a reasonable proposal to the physician if we want him to have confidence in us. This approach will force us to choose a learning algorithm which can handle different subproblems without needing to tune too many hyperparameters.
To cope with these high dimensional issues, [Tang2016InquireAD] proposed in their first paper to learn a different policy for each of the anatomical parts they previously built. As they recognized in their second paper [Kao2018ContextAwareSC] this approach is problematic. Indeed a symptom may be related to several different anatomical parts. How to choose which model to use when observing an initial symptom? In their first paper, when a patient give an initial symptom, they choose the model with the best accuracy on their training set and follow this policy until the end of the process. Nevertheless, as they write in [Kao2018ContextAwareSC], it is possible that the target disease does not belong to the disease set of the chosen anatomical part. This is why they proposed to learn an other policy, called master model, which choose at each step the most promising model (among the anatomical parts) to use.
2.2 Two different approaches to solve our problem.
In reinforcement learning, there exist several ways to solve a problem like (). If the dimension is small enough it is possible to find the optimal solution explicitly using a dynamic programming algorithm [Sutton:2018:IRL:551283], for example using the value iteration algorithm. If the number of states is too high we have to parameterize the policy (policybased approach) or to parameterized the Qvalues (valuebased approach). We have investigated both approaches to solve our problem.
2.2.1 A policybased approach with handcrafted features as baseline:
In our application it can be interesting to propose several symptoms to check at the user, each with its corresponding score (interest to check it), instead of a single one. Indeed physicians might be reluctant to use a decision support tool which do not let them a part of freedom in their choice. This is why we consider an energybased formulation, a popular choice as in [pmlrv24heess12a]:
where is the probability to take action in state , is a feature vector: a set of measures linked with the interest of taking action when we are in state . To be more precise: where is the set of typical symptoms of the most likely disease at state and is the entropy of the random variable disease at state . In words summarizes three reasonable way to ”play” our game.

Ask the question that minimizes the expected entropy of the disease random variable. This is exactly the [Breiman1984ClassificationAR] way to play.

Ask the question where the probability of a positive answer is maximum. It is specific to our game where positive answers are much more informative than negative answers (it would not be the case in a classic questions game).

Inquire symptoms related to the currently most plausible disease.
These features have been identified as relevant measures in rare disease research by physicians we are working with. They represent different way to think and dilemmas faced during medical examination: when I observed a symptom should I think about symptoms usually observed jointly or should I think about the most plausible disease and look for the corresponding symptoms?
Note that this parameterized function is nothing more than a neural network without hidden layer designed with handcrafted features. When properly optimized this policy outperforms, by construction, classical decision tree algorithm [Breiman1984ClassificationAR].
Our aim is to learn good parameters for each of our subproblems: This kind of optimization problem, has been well studied by the reinforcement learning community, see [Konda1999ActorcriticA] or [Sutton1999PolicyGM] for the general analysis and [pmlrv24heess12a] for the energybased particular case. We trained our policy using a REINFORCE algorithm [Williams1992SimpleSG], since we have broken the dimension and that the number of parameters to learn is limited, this algorithm is perfectly suitable and exhibits similar performances to that of an ActorCritic algorithm.
2.2.2 A valuebased approach:
Training Deep Neural Networks
We recall that the Qvalues are defined as , namely this is the expecting amount of reward when starting from state , taking action and then following the policy . The optimal Qvalues, are defined as and satisfy the following Bellman equation: .
The optimal policy , is directly derived from : . Therefore we ”only” need to evaluate , . This can be done by a valueiteration algorithm which uses the Bellman equation as an iterative update: . It is known, see [Sutton:2018:IRL:551283], that when .
As the dimension of the problem is too high to store/evaluate all the Qvalues, we parameterized it by a neural network: .
The famous Deep QNetwork (DQN) algorithm proposed by [Mnih2013PlayingAW] made possible the use of neural networks to parameterize the values (then called network) in the value iteration algorithm with function approximation. The network, at iteration , is trained by minimizing the loss function where is the target. This can be done by a standard backpropagation algorithm. In practice to successfully combine deep learning with reinforcement learning, the main idea is to use experience replay to break correlation between data: build a batch of experiences (transitions , , , ) from which one samples afterwards. Another trick is to freeze the target network during some iterations to overcome instability while learning.
By doing so acting and learning are dissociated, the policy used to act (called behavior policy) is different from the one learned from the transitions sampled in replay memory (the target policy). In RL, this type of algorithm are called offpolicy methods. It is a desirable propriety for a RL algorithm to be offpolicy as the behavior policy will be designed to enforce exploration.
Figure 1 shows the general simplified scheme of the algorithms used in deep reinforcement learning: an agent interact with its environment and collects data (transitions , , , ) which are incorporated to the replay memory from which we sample to form the target policy. Periodically the behavior policy is updated with the current learned policy. In our case we update the behavior policy as soon as we made a gradient ascent step.
Some remarks on the behavior policy
Our behavior policy is an greedy version of the current learned policy in order to enforce exploration. We use this policy to simulate games, or more precisely transitions .
For this purpose, we need a model of the environment, a transition model which told us the probability to reach a state when taking action in . Our environment model is composed of the symptoms combination distribution of each disease (see section 3). Namely we store the probability of all the possible combinations of typical symptoms given the disease, is the number of typical symptoms of . We add the assumption that a patient can also presents nontypical symptoms but with small probability and independently of the others symptoms (see section 3.5).
Then to simulate transitions we need to determine for each disease which are the symptoms of the current list which are typical and which are not. It will allow us to find the right combination we have to extract from . This computation is not that cheap especially when we add ontological considerations (see section 4). We can speed it when we play games from the start to a terminal state : we remember which symptoms are typical for each disease and thus only have to determine if the last symptom is typical or not.
Note that, at each stage of a game, we have to compute the probability of the symptoms combination given each disease so as to determine whether we should stop or not. An other important observation is that trying to compute directly is as costly as playing an entire game incrementally (as previously described) from to .
These two observations should convince the reader that an asynchronous learning approach, as in [Mnih2016AsynchronousMF], would not be suitable for our problem. From a computational perspective, it is reasonable to play games from the start to a terminal state.
The update target: Temporaldifference and MonteCarlo algorithm
A remaining question concerns the definition of the update target, should we use Monte Carlo returns or bootstrap with an existing Qfunction ?
We recall (following [Sutton:2018:IRL:551283]) that an algorithm is a bootstraping method if it bases its update in part on an existing estimate. This is the case of the TemporalDifference (TD) algorithm defined as:
where we sampled using the current policy (in a greedy way in order to enforce exploration) and the environment model . is the estimate at iteration , the learning rate. On the contrary a MonteCarlo method does not bootstrap:
where is the reward we received from a simulated game.
It is not clear at first sight whether we should use a TD method or a MC method to compute the target . This question is the subject of a recent work [amiranashvili2018td] which show that MC approaches can be a viable alternative to TD in the modern reinforcement learning era. Usually TD method is seen as a better alternative than MC method which is often discarded because of the high variance of the return.
Nevertheless our case study is specific: we face a finitehorizon task with a final reward: the reward signal is not very informative before reaching a terminal state. In addition, for the subproblems of intermediate dimensions, we are ensured that games will not last too much time and then that there is a small variance in the return of the MonteCarlo episodes.
We implemented both solutions referred as DQNTD and DQNMC. At each step of DQNMC, we sample, following the behavior policy, games starting from the initial state and stopping when they reach a terminal state. All the transitions , , of all these games are annotated with the reward they received (the number of questions that have been necessary to reach a terminal state during the game concerned) and incorporated in the replay memory. We then sample transitions from this replay memory (one twentieth) and perform a gradient ascent step with a backpropagation algorithm (we used the Keras library [chollet2015keras]).
Concerning the DQN algorithm with TD method, we kept the main features of DQNMC in order to facilitate their comparison. We play games, still with the behavior policy, and all the transitions , , of all these games receive a reward when is not terminal, otherwise. The learning rate is initialized with a lower value than in the DQNMC algorithm but it is decreased in exactly the same way in both cases: divided by two each iterations. Another difference is the frozen network we use as target in DQNTD which is not needed in DQNMC. We update the frozen network each iterations (we have also tried to update it less frequently but have not observed any major differences with the results presented here).
We compared these two algorithms, DQNMC and DQNTD, on severals of our subtasks (see figures 5 and 5). We did not observed much difference on small and intermediate subproblems: both algorithms converge at the same speed towards solutions of the same quality. Nevertheless DQNTD appears much more sensitive to the learning rate. Indeed as it can be seen in figure 5, DQNTD converge on this problem, where it remains relevant symptoms to check and possible diseases, when the learning rate is initialized at . Nevertheless if the learning rate is chosen a little bit higher, at , DQNTD diverge. On the contrary, DQNMC converge when the learning rate is initialized to and also when initialized to even if the returns of the algorithm are less stable in this latter case. These observations have to be combined with the one of figure 5 where it remains relevant symptoms to check and possible diseases. We can see that in this case DQNTD with an initial learning rate of diverge. Reducing the learning rate to does not change this fact. On the contrary we do not need to reduce the initial learning rate of DQNMC (we take it equal to ) to make it converge to a good solution. Since we have to train as many neural networks as the number of subtasks, we need a robust algorithm able to deal with different task complexity without changing all the hyperparameters.
This is why we chose to use DQNMC instead of DQNTD. It is, indeed, a wellknown issue sometimes referred as ”deadly triad” [Sutton:2018:IRL:551283] that combining function approximation, offpolicy learning and bootstrap to compute the target (what the DQNTD algorithm does) is not safe. We show that DQNMC performs well on small and intermediate subtasks of our problem. The higher dimensional tasks are harder to solve because the games are expected to last longer which is a challenge both in term of computing time that in terms of learning stability (higher variance of the return). To scale up on such problems, we break down the state space into a partition and leverage already solved subtasks as bootstrapping methods.
Solving higher dimension tasks by bootstrapping with already solved subtasks
We denote the set of symptoms related with the symptom , i.e this is the set of symptoms which are still relevant to check after observing the presence of symptom . When is small enough (say ), we can learn the optimal policy by a simple Qlearning lookup table algorithm, see [Sutton:2018:IRL:551283].
Considering intermediate dimension problems (say ) we can use the DQNMC algorithm which performs pretty well on these problems (see experiences in section 2.3.2). For highdimensional problems () using directly the DQN algorithm would be timeconsuming. An easy way to accelerate the learning phase of these big networks is to make use of the smaller networks previously trained. Indeed if is a symptom for which is high, there must have some such as is small enough and therefore such as the values of have been yet computed or at least approached. Put in another way, when we try to learn the optimal network of a given problem, we yet know, for some inputs, the values that should output a quasioptimal network.
There are several ways to take advantage of these already optimized subtasks to optimize networks on larger tasks. A first idea would be to incorporate to the replaymemory of the larger task, the replaymemories of the already solved subtasks by having previously properly resized the states. Remind that at each iteration, i.e each gradient ascent step, we sample transitions from the replaymemory (, , and the reward received at the end of the game ) to form the target and train set used to perform the backpropagation algorithm step. We can add to these sets some immovable transitions, the one we already know (because they appear in subproblems already solved).
However, by doing so we will face several issues. First, when we train our neural network using the replay memory constituted by playing on the concerned task, we are ensured that the transitions that populate our replaymemory will be present in a proportion equivalent to their probability of being encountered in the task. On the contrary, when we add some immovable transitions from already solved subtasks to our replay memory, we might overoptimize our network on these subtasks. Put it another way, the network will be overoptimized on parts of the decision tree which are not that frequently faced in practice.
Secondly, although the length of the episodes will have been reduced since using the subtasks replaymemories allows us to learn more quickly how to play at the end of the games, it will still be time consuming to play from the beginning until the end of the episodes for tasks of high dimension. The length of the episodes will also be an issue considering the variance of the MC returns.
Therefore a second idea would be to learn a policy on the higher dimension task by bootstrapping on already solved subtasks. Namely we play games starting from the initial state and bootstrap when reaching a state that belongs to a state set of the partition where there yet exist an optimized network. In practice we have a function which is called each time we received a positive answer which checks if there already exists a network optimized for such a starting symptom. If this is indeed the case, the current game is stopped and the corresponding optimized network is called to predict the average number of question to ask to reach a terminal state. The main lines of the whole procedure are summarized in the algorithm 1.
Note that in doing so, we do not optimize the network for the entire task. It will therefore be necessary to change the neural network used for the recommendation during the examination when we change the space of the partition. The advantage is that we will not need to use a more complex architecture for this higher dimension task.
Finally, note that we are learning the networks one after the other and that there is therefore a more preferable order than others for optimizing these deep networks. We choose at each step to optimize the networks which has the highest rate of subproblems already solved (where each subtasks is weighted by its probability to be faced).
Some remarks on the complexity of a task
When we described the several tasks , we focused mainly on the number of remaining relevant symptoms to check denoted . This is the most important parameter since it is the input length of our network and then determines the number of network parameters that we will have to optimize.
Nevertheless there are more parameters which influence the complexity of a task. Let us mention the number of possible diseases and especially their probabilities. Indeed, if there are many possible symptoms to check and many possible diseases but there is a disease that is much more plausible than the others, then the task is not so difficult. An other feature that can influence a task complexity is the amount of symptoms which are typical of several of the possible diseases.
Thus as it seems difficult to quantify the difficulty of a task we should avoid to judge the performance of our algorithms in an absolute way but should always compare them to more classical methods.
Finally note that even what we call ”task complexity” is not that easy to define. An idea would be to define the complexity of a task as the difference between the average number of question that have to ask a random policy and the average number of question that have to ask the optimal policy.
2.3 Numerical Results.
For all the experiments involving neural networks, we used the same architecture detailed in table 1. We first use an embedding layer since the inputs processed by our neural network should not be treated as numerical values. We then use two hidden layer with ReLu activation and a final layer with linear activation which outputs the Qvalues of the possible actions. The parameter of our stopping criterion is set to for all the experiments.
Name  Type  Input Size  Output Size 

L1  Embedding Layer  
L2  ReLu  
L3  ReLu  
L4  Linear 
2.3.1 Our baseline has quasioptimal performances on small subproblems.
We can compare the performance of our policies optimized by a REINFORCE algorithm, with a classic decision tree algorithm [Breiman1984ClassificationAR] and also with the true optimal policies when it was possible to compute the latter, i.e. when the dimension was small enough.
Results on some of our subtasks are presented in figure 3. Our energybased policy appears to clearly outperform a classic Breiman algorithm and all the more so as the dimension increases: the average number of questions to ask may be divided by two in some cases. On small subproblems where we have been able to compute the optimal policy by a dynamic programing algorithm, our energybased policy appears to be very close to the optimal policy.
2.3.2 DQNMC algorithm vs our baseline.
We have performed a DQNMC algorithm on our subtasks. We expect this algorithm to find a better path than the energybased policy of section 2.2.1 since a neural network has many more parameters and can therefore handle many more different situations than our baseline. Nevertheless to train such a high dimensional function instead of the three parameters of our baseline has a cost. How much iterations does need a DQNMC to outperform our baseline?
We recall here that an iteration of the DQNMC algorithm consist in playing games that are added to the replay memory, then we sample one twentieth of this replay memory and perform a backpropagation algorithm. For comparison, our baseline has been trained with a REINFORCE algorithm, each iteration consist in playing one game and performing a gradient ascent step accordingly, we stop the training phase when reaching iterations.
Figures 7 and 7 show, as expected, that the DQNMC algorithm needs more simulations of games than our baseline. Indeed in these two subtasks, DQNMC needed respectively and iterations to reach our baseline, so and games instead of the which trained our baseline. In figure 7, for a subtask of dimension , we can see that the DQNMC algorithm needs a reasonable amounts of games to outperform our baseline. In that case, the DQN algorithm found a very good diagnostic policy but did not reach the optimal policy, it is probably stuck in a local extrema (although we do use an exploration parameter).
In figure 7, the DQN algorithm seems to converge toward the baseline. This might be due to the fact that, in these tasks of intermediate dimension (it remains relevant symptoms and diseases), our baseline is yet a good solution close to the optimal policy. Thus the DQN algorithm which is not ensured to converge to the optimal policy might get stuck in a local extrema at the level of the baseline.
These experiments can be conducted in a laptop without use of GPU and should be then easily reproducible using our environment simulator or a similar one.
2.3.3 Bootstraping on already solved subtasks helps (a lot) for highdimensional tasks
In these experiments, we compare the performance of a simple DQNMC algorithm against a DQNMCBootstrap on some of our tasks. We used the same neural network architecture for both algorithms (see table 1). More broadly the two algorithms use exactly the same hyperparameters, the only difference being the bootstrap trick of DQNMCBootstrap.
Figures 8 and 5 show the benefits of using the solved subtasks as bootstraping methods. In both cases a simple DQNMC is unable to find a good solution while a DQNMCBootstrap outperforms pretty quickly our baseline. Note that the neural network trained with DQNMCBootstrap starts with a policy that is not that bad. It is appreciable as it reduces, since the beginning of the training phase, the length of the episodes and then the computing cost associated.
For the experiment of figure 8 it remains relevant symptoms to check, possible diseases including the disease ”other”, and subtasks have been already solved. Finally the probabilities of presence of each of the subtasks initial symptom given the initial symptom of the main task were (0.01; 0.44; 0.01; 0.15; 0.15; 0.01; 0.03; 0.02; 0.11; 0.01; 0.26; 0.01; 0.03; 0.01; 0.15; 0.01; 0.15; 0.24; 0.16; 0.06).
For the experiment of figure 5 it remains relevant symptoms to check, possible diseases including the disease ”other”, and subtasks have been already solved.
Finally we have been able to learn a good policy for the main task (2) where it remains relevant symptoms to check, possible diseases including the disease ”other” and all the possible subtasks have been already solved. Our DQNMCBootstrap algorithm starts with a good policy which only needs questions on average to reach a terminal state. Some training iterations allows it to improve until needing questions to reach a terminal state. On the contrary the experiment we made on a DQNMC which tries to solve from scratch this task has to ask questions, on average, to reach a terminal state and does not improve significantly during the iterations. We have evaluated also the performance of the Breiman policy on the global task, it needs questions on average to reach a terminal states (with a variance of questions).
2.3.4 A qualitative analysis for a lowdimensional subtask
We analyze here the policy obtained by using a lookup table value iteration algorithm on a small subtask (it remains relevant symptoms to check) in order to illustrate some of the dilemmas a medical doctor can face during an examination. We start with the presence of symptom . The three diseases which does have symptom in their list of typical symptoms are displayed in table 2. We should think, for this one experiment only, that the symptoms are conditionally independent given the disease. An other important information is the prevalence of each disease, we have , and . Finally there is no relation of ascendant/descendant between the symptoms considered in this example. The optimal strategy obtained induces a decision tree which is displayed in the figure 9.
Disease 1  Disease 2  Disease 3  
Id Symptom  Probability  Id Symptom  Probability  Id Symptom  Probability 
1  0.50  6  0.90  2  0.90 
2  0.55  7  0.50  4  0.90 
3  0.50  9  0.90  6  0.50 
5  0.90  9  0.50  
8  0.50  
9  0.50 
The first question is comprehensible, it ask about the most plausible symptom of the most plausible disease: the symptom . If the answer was positive it continue with a symptom typical of the first disease which is not also typical of other diseases: the symptom . The combination of the presence of this two symptoms is sufficient to diagnose the disease . The rest of the tree is less obvious. For example when we get a ”yes” for symptom and a ”no” for symptom , should we continue asking symptoms related to disease or should we switch to the symptoms typical of the other diseases ? The founded path chooses a symptom related to both disease and disease (symptom ), probably because it is then easy (and fast) to discard disease by asking a question about symptom (note that the disease has only related symptoms).
An other interesting parts of the tree is when we received a negative answer to our first question about the symptom . Then, the initial most plausible disease (the disease ) becomes less likely, but it is not clear if its probability decreases that much that we should check symptoms of other diseases or not. In this case the optimal strategy is to switch to symptoms of disease which has less typical symptoms and must be (in this part of the tree) more plausible than the disease .
We do not draw the entire decision tree for visibility reasons, we wrote ”…” for the leaves where the obtained diagnostic strategy still propose to check more symptoms.
3 Learning a model of the environment.
3.1 The need to learn a model
As described in section 2 our agent will be trained while interacting with its environment. We focused until now on the planning task: optimizing the policy using observed transitions (initial state, action, reached state, reward). We should now detail how these transitions, these data, can be obtained. There exist several possible approaches in reinforcement learning:

Modelbased RL: We first build a model of the environment in order to know how our environment will react to our actions. Then our agent is trained using experiences simulated from this model (planning task).

Modelfree RL: We do not try to infer the environment dynamic, we just train our agent using trialanderror directly obtained by the interaction with the environment.
We can not adopt a modelfree RL approach for obvious reasons. Beyond ethical considerations (one would use, at the beginning, an algorithm without any knowledge on real patients), a modelfree architecture would need a very large amount of data/time to learn a good policy especially considering the diversity of situations it will face. This is a time that domain knowledge can save us.
Therefore we will need to learn a model of the environment. This model learning phase can often be avoided. For example in adversarial games a popular solution is to use selfplay. This is the case in recent advances of computer Go [Silver2017MasteringTG] which shows that it is possible to achieve a superhuman level in a challenging domain as Go without any domain knowledge, using only reinforcement learning with selfplay. However, we are not in an adversarial game where we could learn from self play. Another approach is to use expert demonstrations in order to estimate both rewards and environment dynamics [pmlrv51herman16] or to learn directly a policy [2013arXiv1307.3785T]. Expert demonstrations are often integrated as a supervised learning initialization step in the AI architecture as in [Silver2016MasteringTG]. We do not have such expert demonstrations and in any case, since we are interested in rare diseases, we would need a very large amount of demonstrations to learn a good policy.
Taking into account the uncertainty in the transition model has been tackled by modelbased Bayesian reinforcement learning theory [DBLP:journals/corr/GhavamzadehMPT16]. The main idea is to put a prior on the unknown transitions probabilities and update them when observing transitions in the real world. This approach is not suitable in our case since we do not have a prior on transition probabilities but rather on symptoms marginal distributions which is a less classical prior form (see section 1.2).
Generally speaking our application area is specific by its lack of data, making the environment dynamic very uncertain. This prevent us from designing our architecture without any domain knowledge.
We detail in the following section the model learning phase of our architecture (see figure 10) where we integrate expert data to the data collected by the experience of the algorithm in order to build a sufficiently accurate model of the environment.
3.2 Transition model learning: from marginal to joint distributions.
3.2.1 Our approach: a tradeoff between expert and observations.
Our problem is that we only know the symptom marginal distribution given the disease and not their joint distribution. We have but need .
We do not want to make the assumption of conditional independence since we expect complex correlations between symptoms for a given disease. Note that the assumption of conditional independence would make it possible to present a disease without having any of the symptoms related to this disease in the database (when there is no such that ), which should be impossible.
We emphasize that the knowledge of does not give information regarding when conditional independence is not true. We can imagine two symptoms individually very plausible but who rarely occur together (or even never in the case of incompatible symptoms as for example microcephaly and macrocephaly). We chose to give values to such as to maximize the entropy of the distribution under constraints given by the marginals. Indeed we have to add information but as little as possible on what we do not know. This approach is called maxent (maximum entropy) see [JaynesInformationTA], [Cover:2006:EIT:1146355], [Berger:1996:MEA:234285.234289].
Our approach is Bayesian since it assumes knowing some properties of the distribution to be estimated (traditionally its mean, in our case its marginals) and looks for the maximal entropy distribution which verifies these constraints. Note that without any additional constraint, the distribution of maximum entropy with fixed marginal is the independent one. However we can add some information about the structure of the desired distribution as constraints in our optimization problem. We judge impossible to have a disease without having at least a certain amount of its associated symptoms: one, two or more depending on the disease. Indeed the disease we are interested in manifest themselves in combination of symptoms.
Moreover our algorithm not only relies on expert data but also uses data collected from its own experience. If we had enough data from direct experiments of the algorithm, we would not need expert data anymore. On the contrary without experimental data our model should rely entirely on expert data.
Let’s write
(4) 
the vector we aim to estimate: the symptom distribution of a disease with associated symptoms and its marginals. We propose to estimate it with the following optimization problem:
(5)  
where the constraint just states the classical probability measure constraints: respect of marginals and sum equal to one, we also add the constraint to set to symptoms combinations considered impossible. We have three terms:

First a loglikelihood term for experimental data: where is the ith combination of symptoms observed in real life. We aim at maximizing this quantity since we want our model to be coherent which what we observed. Symptoms combinations observed in real life should be considered a little bit more plausible. Note that the loglikelihood of independent observations under model has a very simple form:
where is the number of times we had observed the jth symptom combination.
However we can not just maximize the likelihood since we do not expect to have sufficient amount of data to infer symptoms distributions. Note that in the worst case when a disease has possible symptoms there are possible combinations of symptoms. It is far too much to infer the symptom distribution with a maximum likelihood approach, especially considering data scarcity.

This is why we add an entropy term, , in order not to consider impossible a symptom combination that has not been yet observed in real life.

The last term ensures that the marginals of our new distribution will not stray too far from our initial a priori given by expert data: . We recall that , and Note that each marginal does not have the same coefficient as we do not have the same confidence in all the expert data. In particular we can handle missing data, i.e when we do not know , by setting .
3.2.2 Existence/uniqueness of a solution and numerical considerations.
The function defined in equation (5) we aim to optimize is on the constraint space which is a compact set (since ), therefore admits a maximum in . As is concave (as a sum of concave functions) this maximum is unique and we can use the KuhnTucker theorem which ensures us that maximizing our function under constraints can be achieved looking for the saddlepoint of the Lagrangian. Deriving the Lagrangian and equating it to , we obtain the marginals as function of Lagrangian parameters . We write the Lagrangian parameters where states for the constraint and each states for the marginal constraint respectively to . If we have:
Note that if we indeed recover .
Moreover if we have:
(6) 
If we can not obtain a closed form for as function of and we have to solve the following equation:
A dichotomy method will be suitable for this task.
Readers familiar with maximum entropy theory should not be surprised by the form of equation (6) . We recover a classical result, see for example [Berger:1996:MEA:234285.234289], the solution of maxent have a nice exponential form: a Gibbs distribution.
We use an Uzawa algorithm to reach the saddlepoint of the Lagrangian, see [uzawa1958imc]. Since is a concave function we are ensured that the saddlepoint we converge to by Uzawa iteration is the global maximum of .
3.2.3 Heuristics for parameters choice.
There are two kind of parameters to choose: and , .
We could think that should decrease with , but as we have chosen not to renormalize the loglikelihood we have when goes to infinity. A parameter independent of seems an easy calibration which provides good results, it should just be chosen large enough to regularize the loglikelihood when is small (see experiences in section 3.3.2).
However should depend on the number of unknown parameters of the distribution to be estimated: ( the number of typical symptoms). Indeed a disease with typical symptoms (i.e. possible symptoms combination) will need far more data than a disease with typical symptoms.
To calibrate as a function of , we should look at how the three different terms of (5) behaves with . Roughly speaking the entropy term is of the order of (the maximal values is reached by the uniform distribution). The KullbackLeibler penalization is linear in and appears scalable to the entropy (see section 3.3.1).
As it is usually done in loglikelihood regularization we expect the loglikelihood to be of order : therefore, where is a nonnegative constant to be determined, seems a reasonable calibration. In practice, our AI will never have a sufficient amount of data and the maxent regularization will allow us to cope with new situations. We will take this into account when choosing .
Concerning parameters, the more confident we are in the higher is . We simply have to initialize with sufficiently large values in order to prevent the condition on high entropy to change the marginals too much when is small as we will see in section 3.3.1. Of course we should not fall into the opposite excess by taking too large which would have the consequence of staying on the experts’ a priori even when the data tell us another reality.
3.2.4 Some previous works.
Building a decision support tool in medicine has been an objective since the beginning of the computer age. Many of these early works proposed rulesbased expert system but in the 80’s an important part of the community investigated probabilistic reasoning based expert system [Pearl1989ProbabilisticRI]. Probabilities and Bayesian methods were seen as a good way to handle uncertainty inherent to medical diagnosis.
The conditional independence assumption of symptoms given the disease has been extensively discussed as it is of crucial interest in terms of computational tractability. Some researchers considered this assumption harmless [Charniak1983TheBB] when others already proposed a maxent approach to face this issue [Hunter:1985:URU:3023810.3023813], [DBLP:journals/corr/abs13043423] or [DBLP:journals/corr/abs13041104].
Nevertheless it seems that none of the works of that time has ever considered the experts vs observations tradeoff we face. In the survey [DBLP:journals/kbs/Jirousek90] it is clearly mentioned that these methods only handle input data of probabilistic form. Namely they assume to have an a priori on marginals but also on some of the possible probabilities combinations (in our case we would assume to have an a priori on for example) and propose a maxent approach where these input data are treated as constraints in the optimization process. Once again this is not our case since we just have the marginals and some experimental data. This area of research was very active in the 80s and then gradually disappeared, probably due to computational intractability of the proposed algorithms.
Estimating a joint distribution from marginals is an other very ancient problem, not necessarily related to AI, known in literature as cell probabilities estimation problem in contingency table with fixed marginals (the book [Bishop75discretemultivariate] provide a good overview of this field). We can trace this problem back to the works of [deming1940] which make use of known assumed marginals values and experimental samples to try to estimate the joint distribution. They proposed an iterative proportional fitting procedure (IPFP), a particularly popular algorithm, to solve this problem.
An important assumption of [deming1940] is that each cell of the contingency tables received data. In [Ireland1968ContingencyTW] the authors proved that the asymptotic estimator obtained by IPFP algorithm minimizes the KullbackLeibler divergence with respect to the empirical distribution under the marginal constraints.
However an IPFP algorithm would not be suitable for our problem for two main reasons: first we do not have absolute confidence in marginals given by experts and secondly we are interested in rare diseases so we do not expect to have a sufficient amount of data. As a matter of fact, many cells will not receive data, and it would be disastrous to assign to the corresponding symptom combination probability estimate in our application.
We should mention also works that relate to our problem in applications of statistics to social sciences where researchers aim to build a synthetic population with marginals coming from several inconsistent sources [Barthelemy2013SyntheticPG]. To be more precise they have data at an aggregated level (at the level of the country for example) an need disaggregated data (at the level of a household say). They also proposed a maxent approach but do not exactly face an expert/experience tradeoff since they build their model without samples. Their algorithm is different essentially because they add constraints to obtain an integer solution, which we believe could be avoided.
3.3 Some experiments
3.3.1 Maxent with Kullback penalization on marginals a priori.
Let us start by looking at what happens when we make a maxent with Kullback regularization on marginals, i.e we exclude likelihood for this synthetic experiment. Namely we are interested in a vector defined as follows:
We set , with the following a priori on marginals
For this experiment, we increase and look at how it affects the marginals’ estimates. We can see in figure 12 that all the marginals’ estimators start with value and then decrease or increase in a monotonous way toward their a priori. This is not surprising since maxent tends to disseminate weight on the entire distribution making parameters of Bernoulli marginals distribution closer to . In our case we have marginals equal to since we enforce combinations with less than one symptom to have zero probability. This gives us an idea of how should be initialized.
3.3.2 Adding data.
We simulated a symptom combination distribution (with associated symptoms) using Poisson distribution of parameter . The estimate solution of (5) given by our Uzawa algorithm has been sequentially updated using data sequentially simulated from . For the a priori on marginals, we used the real marginals with an additive Gaussian noise of zero mean and variance. The measure of interest is the KullbackLeibler divergence between the real distribution and our estimate solution of (5): which we would like to minimize. We are interested in how the choice of affects our estimation of the real distribution. To cope with inherent randomness of this process, an average estimate of the KullbackLeibler divergence was obtained over 50 repetitions of the same procedure (i.e we simulated Poisson distributions for each different values of ).
In Figure 12, the red () and black () curves clearly show that giving too much weight to the data leads to overweighing the symptoms combinations observed in real life and keeps us far from the real distribution: we do not sufficiently regularize with the entropy. On the contrary the green () and the orange () curves performs a good tradeoff maxent/maximum likelihood. is a more cautious choice (we underweight experimental data) than and as a consequence the procedure converge less quickly to the real distribution.
Note that an empirical estimate (solution of a maximum likelihood approach) or an IPFP algorithm would perform very poorly on this task. Indeed many symptoms combinations would be estimated to be when they should not, because of data scarcity: indeed we have variables and less than data. We have not plotted the KullbackLeibler divergence of these estimates with respect to the real distribution since it is infinite. In contrast our approach appears robust to data scarcity, provided that we take care of the value of .
3.4 Highdimensional issues
3.4.1 Explosion of the dimension of symptoms distributions
We are able to estimate the symptom combination distribution ( of formula (4)) of each disease provided that we can store this vector (i.e is small enough). Note that we actually need a larger vector, as our algorithm processes the information collected by the physician sequentially. For example, we will need the probability, which is not in if . There are two possible solution for a disease with typical symptoms: i) to store the bigger vector of dimension since we would need to code in ternary to include the information ”not seen yet” relative to a given symptom; or ii) to store the smaller vector of dimension and compute, on the fly, to recover desired symptoms combination probabilities from available ones. As we will intensively use our environment model for training our AI, we should prefer the first solution, as much as possible.
However, it clearly appears that we will not be able to compute/store the distribution of symptoms combination for all the diseases. Indeed when a diseases has a large number of symptoms, the dimension of the vector we aim to estimate explode: .
To cope with this issue, we will use the available ontological information about symptoms, i.e the fact that a symptom can be described at different level of precision and make less stringent assumption about the dependence between symptoms (see section 4.3).
3.5 Relaxing the model to face potential database default
So far, our model relies heavily on the assumption that expert data gives an exhaustive representation of each diseases. If a symptom has been forgotten for a disease in our expert data list, we would not be able to recover the disease.
That is the reason why we make the assumption that a nontypical symptom (i.e. a symptom that have not been associated to the disease in the expert data) may be observed in a patient with disease , but with a small () probability and independently of other symptoms.
4 Integrating the ontological information
4.1 Why we need to be concerned about the ontology of symptoms
So far we described the diseases as combinations, more or less plausible, of symptoms. We designed algorithms inquiring about symptoms so as to find the right disease while minimizing the average number of questions to ask. We have seen that in order to learn a good strategy, we need to learn a model of the environment, i.e to learn the symptom combination distributions given the disease.
Nevertheless a decision support tool built in this way will suffer from several issues. Indeed, the symptoms in medicine can be described at several level of precision. A concrete example for the abnormality ”hypoplasia of the right ventricle” is displayed in the figure 13 (the terms range from the least precise to the most precise).
In medicine, these kind of trees, called ontology, are commonly used to represent the knowledge on symptoms hierarchy. These ontologies are built in order to capture the structure and relations between symptoms. Medical ontologies can bear an (almost) infinite level of precision.
A naive decision support tool, i.e which would not include the available ontological information, could ask irrelevant question to a physician. For example, it is perfectly possible that our decision support would advice looking for an hypoplasia of the right ventricle when the physician already mentioned that there is no morphological abnormality of the heart. It is this kind of nonsense that we aim to solve in this section.
Furthermore our decision support tool appears, at the moment, too rigid. Indeed, we could ask for an ”hypoplasia of the right ventricle” when the physician could not give us such a precise information but rather a more imprecise one like ”there is an abnormality of the cardiac ventricle”. We should be able to deal with such imprecise answers and then give to the physician more freedom when interacting with our decision support tool while avoiding an explosion in computing time. Once again the use of the ontology will allow us such an improvement.
4.2 A less rigid decision support tool without computation explosion
Each symptom of our initial database has been mapped to the HPO database. We, then, have been able to extract the underlying tree structure linking the different HPO codes. To be more precise, we know for each HPO code (a given description of a symptom), all its descendants (more precise description of such symptom) as such as all its ascendants (less precise description of such symptom).
4.2.1 The idea
As previously explained, we aim at giving more freedom to the users when describing the symptoms they observed, giving them the possibility to describe the symptoms at different level of precision.
Then instead of giving answers at a given level of precision (our initial list of symptom), we now allow the physician to choose any of the HPO code. It involves an explosion in the number of possible symptoms: our former list of symptoms references some signs when the HPO ontology has around . Both our way to modelize the symptoms combination distributions and our learning algorithms will not be able to cope with such an explosion.
Theoretically each patient could be unique if its symptoms are described to a sufficient level of accuracy. Nevertheless, when we list the typical symptoms of a disease, we try to generalize and find patterns in patient profiles. Then the idea is too still modelize the symptoms combination distributions with our initial database (the one with symptoms) preserving the ability of generalization our algorithms. We will still propose symptoms to check at the level of precision of the initial database but allow the user to give answers at a different level of precision (any HPO code can be chosen). By proceeding in this way we obtain a less rigid decision support without computation explosion since all computation are done at the initial level of precision.
For such an objective we then need a function translating the received imprecise information (the HPO code) into usable information (presence/absence of symptoms at our precision level). Such a function will involve deterministic and stochastic rules.
Deterministic rules
Our function associating to each HPO code the usable information associated implies some automatic (deterministic) rules. Namely:

If we received a positive answer for a given HPO code, all its ascendants should be given a positive answer too.

If we received a negative answer for a given HPO code, all its descendants should be given a negative answer too.
In practice we store during the medical examination all the information given about the HPO codes selected by the user. In order to compute the probability of each disease we need to check, for each HPO code and each disease, if this HPO code is in the list of the symptoms related to this disease. If not we have to check whereas ascendants or descendants of this HPO code are in this given list of symptom. Following our two deterministic rules, if the HPO code was declared to be present we have to check if ascendants are in the list, if it was declared to be absent we have to check the descendants.
If the HPO code verifies all the following assertions it can be considered as non typical and treated in consequence (namely its presence is unlikely as in section 3.5):

The HPO code is not in the list of symptoms related to the disease.

It is present and its ascendants/descendants are not in the list or it is absent and its ascendants are not in the list.
Note that the second point involves a relation that we have not studied until then. Indeed, what happens if we observed the presence of an abnormality which HPO code is not in the list of symptoms of the disease but has descendants which are in the list? This issue is studied in the next section.
Stochastic rules
Let us assume that we have observed the presence of an ”abnormal heart morphology” but that the disease we are interested in only has in its list of typical symptom the ”Hypoplasia of the right ventricle”. How to take into account such an imprecise information? We need stochastic rules for this issue.
When receiving the information of the presence of a HPO code, we have to determine which of its descendants are in the list of symptoms of our first database (the one which we use to build our environment model). All these symptoms have a known probability of apparition (given what we already observed) and we are able to compute them.
Indeed let us denote a list of symptom for which there is no descendant in our initial database or which are absent. Then let assume that we observed the presence of a symptom which potential descendants are , , and and the presence of a second symptom which potential descendants are and .
There are then combinations possible. Indeed, without any additional assumption the number of possible combinations could be large. This is why we assume that for each imprecise answer there is only one descendant which is present at a time. Our function will first compute :
It is just the matter of searching for each which are the typical symptoms in the list and use the deterministic rules if necessary.
We can then compute
We can then display the probability of each disease (we denote for the fuzzy state associated to the possible states and ):
4.2.2 Optimize the strategy on the leaves of the ontological tree and then go back up
Our stochastic rule can be can be expensive in terms of computing resources while it is of crucial importance for us to be able to interact quickly with our environment when training our agent. Therefore, the idea is to optimize the subtasks which start from symptoms which does not have any descendants in our database (the leaves). By this way we will not have to use our stochastic rule while training the neural networks.
It is moreover easy to derive the strategy we have to follow when we receive an imprecise answer during an exam. We denote the fuzzy state and the associated possible states. We can compute the Qvalues in this fuzzy state by averaging on the Qvalues on the possible states:
(7) 
In practice, when receiving an imprecise answer, our algorithm should ask all the time to the physician if he could furnish a more precise answer. If not, a computation as (7) will have to be performed in real time during the examination. This computation should not last more than a second, otherwise we can consider that the provided information was not precise enough and can be overlooked.
To avoid using the stochastic rules while training our agent we will need also to remove all the action which has descendants and replace them by their leaves. In a future work it would be interesting to allow different levels of precision for the action that would suggest the neural network.
4.3 Relations between the ontology and the symptom combination representation
We insisted in section 3.4.1 that there are some cases where we are not able to compute but we still want to be able to compute quickly the probabilities without making the assumption of conditional independence. The only solution is to relax our model of dependence between symptoms. We assumed so far that there was dependence between all the symptoms of a disease, we should now consider dependence with a less stringent approach. For the clarity of our presentation, we will consider here a twostage deep ontology with a deeper stage for specific symptoms description and a more vague level for organs.
Let’s assume we are interested in a disease with cardiac typical symptoms (,…,) and renal typical symptoms (,…,). We denote:
Then we assume (precise) symptoms from distinct organs are conditionally independent given which organs have abnormalities, so we have the following decomposition:
Note that even if we have lost the possibility to store dependence between precise symptoms from different organs ( and ), we keep a model of dependence at the higher level in ontology: dependence between organs abnormalities ( and ).
Instead of computing and storing all symptoms combinations we will just store symptoms combinations inside organs and organs combinations.
The probability of symptom combinations (i.e in our example) will be computed solving the optimization problem (5) of section 3.2.1 with assumption to present at least one symptom (which was yet an assumption before). The organs abnormality combinations are computed too using (5). When marginals or are not known we can treat them as missing values or try to approximate them using marginals of the lower level, temporarily making some kind of conditional independence assumption.
Each symptom combination can be easily computed using the law of total probability, for example we have the following decomposition:
where and all the other probabilities have been stored making these kind of computations very cheap.
This approach is in fact perfectly adapted for several diseases which manifested themselves in combinations of symptoms coming from specific organs. For example VACTERL syndrome is a rare genetic diseases defined by a combination of at least three abnormalities from three distinct organs [Solomon20] among vertebral anomalies, anorectal malformation, cardiovascular anomalies, tracheoesophageal fistula, esophageal atresia, renal and/or radial anomalies and limb defects (thus defining the acronym of the disease by their first letter).
Then we will be able to cope with any symptom combination distribution even when the number of related symptoms to a disease is high. In such a case we will have to find ascendants common to several of these symptoms (which is always possible by definition) that will organize the symptoms in groups (the organs in our example of VACTERL). We will then make the conditional independence assumption between symptoms given the ascendants.
Figure 15 displays the distribution of symptom combinations of VACTERL syndrom obtained with this group modeling. We plotted the points with some in transparency for visibility reasons. Symptom combinations modeled as impossible because there are not sufficient groups turned on, are plotted in red. Comparing this plot to the one obtained making the conditional independence assumption (see Figure 15), it appears that group modeling adds much information about the distribution of symptom combinations. The visible symmetries of the distribution are only due to the lexicographic order we use for the different symptom combinations.
5 Conclusion
We have presented in this work a novel notion, as far as we know, of what should be a good decision support tool for a rare disease diagnostic task. We took into account the need, in medicine, to achieve a high level of certainty when possessing a diagnostic. We try to minimize the average number of medical tests to be performed before reaching this level of certainty. We investigated several reinforcement learning algorithms and make them operable in our highdimensional and rewardsparse setting. To do this we broke the initial task into several subtasks and learned a policy for each subtasks. We proved that an appropriate use of the intersections between the subtasks can significantly accelerate the learning procedure.
Furthermore we reconnected with the first works on expert systems which used probabilistic reasoning. This is due to our target application, we are interested in rare diseases and then we can not work without expert knowledge which is generally expressed as conditional probabilities. We presented a way to integrate expert knowledge with clinical data processed by the decision support tool.
Finally we showed that it is possible to integrate the ontological information while remaining in our probabilistic setting. This result in a less rigid decision support tool without computation explosion.