Network Modeling and Pathway Inference from Incomplete Data (”PathInf”)
In this work, we developed a network inference method from incomplete data (”PathInf”) , as massive and non-uniformly distributed missing values is a common challenge in practical problems. PathInf is a two-stages inference model. In the first stage, it applies a data summarization model based on maximum likelihood to deal with the massive distributed missing values by transforming the observation-wise items in the data into state matrix. In the second stage, transition pattern (i.e. pathway) among variables is inferred as a graph inference problem solved by greedy algorithm with constraints. The proposed method was validated and compared with the state-of-art Bayesian network method on the simulation data, and shown consistently superior performance. By applying the PathInf on the lymph vascular metastasis data, we obtained the holistic pathways of the lymph node metastasis with novel discoveries on the jumping metastasis among nodes that are physically apart. The discovery indicates the possible presence of sentinel node groups in the lung lymph nodes which have been previously speculated yet never found. The pathway map can also improve the current dissection examination protocol for better individualized treatment planning, for higher diagnostic accuracy and reducing the patients trauma.
Bayesian Network modeling has been a useful tool to infer causal relationships in various applications (Acemoglu et al., 2011; Zhang et al., 2013; Aguilera et al., 2011; Jiang et al., 2011; Sun et al., 2012). In recent years various algorithms have been developed based on the concept of Bayesian Network. To name a few, works in (Chickering, 2002) used equivalence class to find a Bayesian Network which can be a perfect map of given distribution when the sample size is large enough. Works in (Cheng et al., 2002) applies an information-theoretic approach to efficiently learn structure from data. Both of these works are based on the conditional independence. Traditionally, speed and accuracy of structure learning algorithms are limited for more complex data. Models proposed in (Perrier et al., 2008) and (Koivisto and Sood, 2004) tried to lower the computational cost in order to tackle more complex cases, which all require complete data observation. The Greedy Equivalent Search (GES) method proposed in (Chickering, 2002) quickly become the standard in network learning due to its accuracy and capability in handling large dataset. One of the major challenge in network inference is the uncertainty and missing values in the data (i.e. incomplete data). To deal with such challenges, (Friedman, 1997) tries to find the optimal Bayesian Network structure with model selection within the process of Expectation-Maximization (EM). This work is among the first attempts to utilize EM algorithm during the structure learning. Later time, (Borchani et al., 2008) proposes a combination of GES and EM algorithm to learn the network structure form incomplete data. To the best of our knowledge, the GES-EM method introduced in (Borchani et al., 2008) is the best solution towards the task of inferring a graph representation of network from incomplete data. However, for applications in real-world settings, these methods suffer from the following limitations:
They all need relatively larger sample size in order to perform feasible search over the optimized structures.
Most of the algorithms cannot correctly handle missing values.
The Directed Acyclic Graph (DAG) assumption underlying all Bayesian networks might not be valid in real practice, as circles and/or two-way connections among variables are common.
In response to the above challenges for current network modeling methods, we develop a two-stages Bayesian Network inference model (”PathInf”). In the first stage, maximum likelihood based data summarization model is applied to overcome the presence of massive, non-uniformly distributed missing values in the data. In the second stage, after obtaining the state matrix through the data summarization model, the transition patterns among variables (i.e. pathway) are inferred as a graph inference problem solved by greedy algorithm with constraints. Experiment results on simulation data shows that PathInf can overcome these challenges and achieve better accuracy in recovering the true networks. In addition, we use PathInf to analyze the lung cancer metastasis data, for the inference of transition pathways among lymph vascular nodes. While there have been various literature on lymphatic spreading (Wang et al., 2016; Riquet et al., 2015; Han and Chen, 2017; Yamanaka et al., 2002; Kudo et al., 2012), most of the previous conclusions were made by clinical physicians and focused on specific empirical transition paths, rather than a holistic map depicting all possible metastasis pathways simultaneously. From a precision medicine perspective, investigation and modeling of lymphatic metastasis pathway would be vital for its predicting value on the different roles of lymph nodes during the cancer spreading, and its potential to provide individualized surgical planning strategy. By applying PathInf on the lymphatic spreading data from 936 patients diagnosed with non-small-cell lung cancer at various cancer stages, we obtain the map of possible lung cancer metastasis pathways. Preliminary conclusion from the map shows that the proposed data-driven PathInf method can obtain similar lymphatic metastasis pathways as found in the previous work based on clinical and anatomical studies. In addition, novel and consistent metastasis interrelationship unknown to the physicians have been found in the network, showing potential jumping metastasis among lymph nodes. The result also indicates the need for an improved lung cancer staging and diagnosis criteria. An illustrative pipeline showing the input (binary metastasis data), data summarization model and graph inference model is visualized in 1.
2. Maximum Likelihood Estimation on Data Summarization With Missing Values
Given a binary (e.g. cancer dissection examination result on lymph nodes, which is positive/negative) data matrix with variables, all the possible combinations of the binary values in these variables can be summarized into the state representation . States that are conformed to (i.e. capable of representing all samples in with minimum number of states) is a subset of which can be estimated. In practice, the true is often hidden and its observations are incomplete, resulting in the observation matrix with missing values. Due to the presence of missing values, states that are conformed to are ambiguous. With prior information on the proportion of missing values, it is possible to estimate so that it can optimally describe the observations, which is our main rationale for the data summarization model. Most of the previous work on learning network from incomplete data adopts the missing at random assumption (Borchani et al., 2008). However, the distribution of missing values in practice commonly follows a biased distribution caused by latent variables. For example, in lymphatic spreading data, dissection and examination of the lymph nodes depends on the surgical conditions and physicians decision, which can follow a specific pattern depending on severity of the cancer. For example, cancer with tumor nodules smaller in size can result in a sparsely-examined lymph node data (i.e. more missing values). To overcome this challenge of massive and non-uniformly distributed missing values, we estimate of the state representation using maximum likelihood method with two priors: the probability of a negative value being missed and the probability of a positive value being checked. Specifically, assuming that the presence of missing values in each sample of are independent, based on the total probability theorem, the probability of observing can be defined as:
where is the count of observed positive values in , is the count of positive values in . Notation indicates that assuming the missing values can be both positive or negative, state is conformed with sample . The probability of a missed negative value is set to the constant of 0.5, as we are not focused on how the negative samples are handled. The probability of a missed positive value, which is far more important, is set to which ranges from 0.1 to 0.4 in this work, to reflect the fact that positive lymph nodes are more likely to be dissected and examined. Based on Eq.1, given the observed dataset , the estimation of its state representation can then be formulated as a constrained optimization problem in order to get the maximum likelihood estimation for its probability distribution:
Results from the estimation are the probability of each state representation in based on data . As in the real-world scenarios the network structure tends to be sparse, states in with non-zero probability in the estimation result will only be a small portion of the whole combination set. Denoting as indices of all non-zero elements in the estimated probability , the state representation of can then be represented as . As both the objective function feasible set are convex in Eq.2, this optimization problem is convex. Proximal gradient descent is used to solve this optimization problem, as formulated below:
|Indicator function, function value equals to|
|if is in , otherwise equals to the infinity.|
The optimization problem can be solved by: Initialization by randomly choosing from , followed by iteratively updating use the rule until convergence.
A running example of the data summarization model is shown in 2. In the figure, given an input matrix (a) with 4 variables and 6 samples with missing values, there are totally 16 different combination of the binary values (b). For each sample with missing value, multiple states can conform to it (b-c, where conformed states to sample are colored in red and blue). Maximum likelihood method is then used to estimate the probability of the state representations (c), resulting in the summarization matrix (d). For example, the posterior probability of state ”” (marked by red) is larger than ”” (marked by blue). Thus during the estimation, state ”” will be kept while ”” will be discarded, because it is more optimized to set the probability of state ”” to 0 in order to increase the probability of state ”” than any other configurations. It can be seen that the maximum likelihood estimation can obtain a sparse and accurate representation of the input data, which is exactly what we need for this study.
3. Graph Inference From Summarization Matrix
In this section, we propose a greedy method to infer the network from the summarized state matrix with the constraint of minimizing the number of edges. Based on the results from the previous section, the state of the input observation has been fully described by with a limited number of combinations. Given an undirected graph , representing the vertex set and representing the edge set of , for any state representing a specific combination of the n binary values, we define:
is the vertices set of positive values in .
is the set consisting of all edges in the complete graph on .
Connected Vertices: two vertices with a path between them.
Connected Graph: graph in which every pair of vertices is connected vertices.
Connected Component: two vertices are in the same connected component if and only if they are connected.
The problem of network inference is then proposed as follows:
Given state , identifying a graph G with minimum edges which satisfies that: , the subgraph is connected.
The rationale for this inference process is that resulting network is supposed to explain all the co-existence of the positive values in each state from the given set . At the same time the total number of links in the network should be minimized, as otherwise a fully connected graph with vertices can also satisfy the connected subgraph condition. Based on the proposal above, the network is inferred by the following steps. It should be noticed that the inference procedure does not take the acyclic assumption for the edges, as the definition of connected vertices and connected graph only consider connectivity without direction. So the model should perform the same for cyclic or acyclic graphs.
The algorithm begins with an empty graph with vertices and no edges.
For , if contains only two vertices, add an edge between these two vertices.
For any two disconnected vertices and in and any state , if , we define the score for a potential edge to be added between and :
where is the probability of the given state as estimated in the previous section.
Calculate , that is, score of an edge between vertices and is the sum of scores calculated over all feasible states.
Find vertices pair with the highest score through exhaustion. Then add an edge between each pair in .
Iteratively execute step 3-5 to update , until all are connected graph.
4. Experimental Results on Simulation and Application Data
4.1. Model validation and performance evaluation with simulation data
In order to test the validity of our algorithm, we generate random directed acyclic graphs (DAGs) representing the true pathways among variables, and use discrete time model simulating the spreading process for the observation data. For each given DAG, nodes with indegree of zero (i.e. source nodes) are selected as the starting nodes for the spreading, as they cannot be the target from other nodes. As the graph is acyclic, at least one node is the source node which can serve as starting node. The following steps are then applied to generate observation data:
Generate a random DAG with fixed number of nodes (10) and various predefined number of edges (we tested 10, 15, 20, 25 in this study).
Uniform random weights are assigned to the edges. Weight of edge connecting node to node is denoted as . is zero if there is no edge.
Identify the source node(s) in DAG and select them as starting node(s) for the spreading.
Apply Markov Chain Monte Carlo (MCMC)-based sampling method to generate each observation sample:
Set the number of total transition steps as a random number following Poisson distribution. Larger total transition steps result in more positive nodes in the generated sample.
If there are more than one source nodes in the graph, randomly select one as the starting node.
In each transfer step, denote the positive nodes as , and the nodes that are connected from as set . If is nonempty, the probability of node to become positive (i.e. spreading) is .
Transition process stops when B is empty or reaching the total number of transition.
After the sample is generated, missing values are added to replace original values. The probability for a positive node to become missing is (, we tested 0.1 to 0.4). The probability for a negative node to become missing is 0.5.
For each simulation dataset, repeat Step 4 to generate 1000 samples with the same DAG but different transition pathway.
A sample result showing the ground truth network, along with the network inferred by our method (PathInf) and Greedy Equivalent Search (GES) method without missing values is shown in 3. It can be seen that when the data quality is high, PathInf could infer most of the edges from the ground truth graph (colored as red), with both very few false positive (colored as black) and false negative (colored as dashed) edges. On the contrary, GES overestimates the graph, resulting in large number of false positive edges. Given the fact that the input data is free of missing values in this experiment, the lowered performance of GES might be caused by the relatively smaller sample size (1000 samples to infer edges among 10 variables). However, in practice especially for medical data, large sample size is often difficult or even impossible to meet. The high accuracy of PathInf in the experiment shows it can overcome the challenge in sample size, to certain extent. Further, the computation time for PathInf and GES is almost the same, showing its efficiency in making the reference.
In 4 we show the results of PathInf and GES by applying them on the data with large number of missing values (, indicating 40% of the positive values are missing and 50% of the negative values are missing). As the method introduced in (Borchani et al., 2008) for analyzing data with missing values using GES plus Expectation-Maximization (EM) algorithm failed to converge on the simulation data, we simply replaced the missing values by zero entries and used original GES method for the inference. The results from GES deviated much from the ground truth, while PathInf can recover the network structure with no false negatives. The added false positive edges of the PathInf results on incomplete data are reasonable, as missing values in the observation expand the state representation matrix SO, which supports more possible edges.
Finally, we run the simulation and the network inference with different number of edges in the ground truth graph and different values of p (proportion of missing values). The performance statistics are summarized in Table 1, showing that in most of the experiments PathInf can outperform GES especially with regarding to the false negative rate, which is more important for medical applications as it is vital to ensure all possible pathways have been discovered from the data in order to establish a criteria/protocol. It should be noted that as the value of p increases, the false negative rate decreases for PathInf, which can also be observed in 3 and 4. This is because a larger will result in denser inference by PathInf as discussed above, thus more true edges will also be recovered as a result.
|PathInf: False Positive||GES: False Positive|
|PathInf: False Positive||GES: False Positive|
4.2. Metastasis pathway inferred from lung cancer population study
For the data collection project, we perform the complete lymph node examination of a total of 936 subjects diagnosed with lung cancer. Lymph nodes sampling or dissection are performed according to NCCN guidelines, and segmental station 13 and subsegmental station 14 are retrieved routinely. Samples of tissue and lymph nodes are sent for routine pathologic analysis with paraffin-embedded blocks. Lymph nodes area bi-valved along their longitudinal axis and totally submitted for microscopic evaluation. Small nodes (0.4 cm) area submitted without bi-valving. A single hematoxylineeosinestained slide is prepared from each block. Binary information of the metastasis (i.e. whether cancer cells are observed at the given node) is then recorded at 21 lymph node sites within the chest, forming a 21936 matrix characterizing the lung cancer metastasis of the population. These sites are located at left/right lung lobes, as well as mediastinum (membranous partition between the lungs). The anatomic landmarks delineating each node are listed in Table 2, derived from the regional lymph node classification system (Mountain and Dresler, 1997).
|Left Lung||Right Lung|
According to the data acquisition protocol, the 21936 metastasis matrix is then divided into two sub-matrices based on the location of cancer nodule: for patients with nodules found at left lobe, ”left lung + mediastinum” matrix will be used which consists of all the node sites within the ”left lung” column in Table 1, as well as mediastinum. For patients with nodules found at right lobe, ”right lung + mediastinum” matrix will be used in the similar way. The left and right metastasis matrices are then separately fed into the data summarization model as described in Section 3, resulting in two state matrices describing all the possible combination of the metastasis across lymph nodes. Finally, lung cancer metastasis pathways are inferred from the two summarization matrices, based on the graph inference method introduced in section 3. The result graphs of the left and right lung are visualized in 5. To validate the result graphs, we have conducted cross-validation experiments by randomly sampling 85 of the 936 patients for 100 times. These 100 sets of 800-patients data are then used to estimate the summarization matrices and to infer the network graphs. Cross-validation results show that edges inferred from the whole population can be consistently recovered from over 90 of the sub-sampled population: that is, every edge in the final pathway as shown in 5 can be found in the graphs inferred from over 90 sets of the 800-patients data.
For comparison, we also visualize the results from GES on the same dataset by replacing missing values with zeros, as shown in 6. It can be seen that little correspondence could be found in the pathways inferred from left and right lobe by GES. More crucially, most of the intra-lobe connections are missing in the results from GES, indicating that it is incapable of recovering most edges from this real dataset. This observation is in accordance with the comparison of methods in the previous section of simulation study, where the false negative rate of GES is much higher than our proposed PathInf method.
In this work we develop a data summarization and graph inference model for learning underlying transition paths (i.e. network) among variables during a spreading process, addressing the challenge of massive incomplete data and lack of observation samples. The proposed PathInf model is tested and compared with GES on both simulation data, which shows superior performance (much lower false negative rate). PathInf is further applied on the lymph nodes sampling/dissection data from 936 patients, to establish an atlas for analyzing the lymphatic metastasis pathway among the node sites in left and right lung lobes. The pathways reveals novel discoveries for the lung clinical studies and can provide data-driven evidence for the empirical knowledge from the physicians. We are continuously working on the validation of the current discoveries from both the data science and clinical science perspective to establish an improved version of the current protocol for lung cancer examination and surgery. MATLAB code of the PathInf method is publicly available at GitHub.
One of the main limitation of the proposed PathInf method is that it cannot provide direction information on the pathways it inferred. In other words, it only infers correlation from co-existence of the positive variables but not causality between these variables. From a statistics perspective, it is impossible to infer causality from the given dataset as there are no temporal information involved: every sample is a ”snapshot” of the patients condition during the transition process yet it is impossible to know when the snapshot is taken. Bayesian network models such as GES (Chickering, 2002) or PC (Spirtes and Glymour, 1991) can infer the direction of edges yet they rely on the DAG assumption to do the inference. Works in (Gomez-Rodriguez et al., 2012) which takes time-stamp of the events as input can infer the direction of pathways, but this type of time-stamp data is beyond our capability to acquire as it needs large of number patients to be examined more than once. Nevertheless, while the current results from the PathInf method are promising with novel discoveries, it would be important to infer the directed network from the data, as that is needed to determine the precise timing/sequence of the progression and to perform the simulation study of the metastasis (e.g. using Markov Chain). Thus, beyond this preliminary study, in the next step we are aiming to develop a more detailed inference modeling for the pathway directions based on more features from the patients which we already acquired but not yet been used, including tumor size and details of the pathologic analysis at each lymph site.
Acknowledgements.This work is supported by the Beijing Municipal Administration of Hospitals Clinical Medicine Development Special Funding
(ZYLX201509). In addition to this currently deployed project, the funding also covers the future data acquisition and analysis on a larger patient population in the next stage of study.
- Acemoglu et al. (2011) Daron Acemoglu, Munther A. Dahleh, Ilan Lobel, and Asuman Ozdaglar. 2011. Bayesian Learning in Social Networks. The Review of Economic Studies 78, 4 (2011), 1201–1236. DOI:http://dx.doi.org/10.1093/restud/rdr004
- Aguilera et al. (2011) P.A. Aguilera, A. FernÃ¡ndez, R. FernÃ¡ndez, R. RumÃ, and A. SalmerÃ³n. 2011. Bayesian networks in environmental modelling. Environmental Modelling & Software 26, 12 (2011), 1376 – 1388. http://www.sciencedirect.com/science/article/pii/S1364815211001472
- Borchani et al. (2008) Hanen Borchani, Nahla Ben Amor, and FÃ©dia Khalfallah. 2008. Learning and evaluating Bayesian network equivalence classes from incomplete data. International Journal of Pattern Recognition and Artificial Intelligence 22, 2 (2008), 253–278. DOI:http://dx.doi.org/10.1142/S0218001408006193
- Cheng et al. (2002) Jie Cheng, Russell Greiner, Jonathan Kelly, David Bell, and Weiru Liu. 2002. Learning Bayesian networks from data: An information-theory based approach. Artificial Intelligence 137, 1 (2002), 43–90. DOI:http://dx.doi.org/10.1016/S0004-3702(02)00191-1
- Chickering (2002) David Maxwell Chickering. 2002. Optimal Structure Identification With Greedy Search. Journal of Machine Learning Research 3 (2002), 507–554. http://www.jmlr.org/papers/v3/chickering02b.html
- Friedman (1997) Nir Friedman. 1997. Learning Belief Networks in the Presence of Missing Values and Hidden Variables. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML ’97). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 125–133. http://dl.acm.org/citation.cfm?id=645526.657145
- Gomez-Rodriguez et al. (2012) Manuel Gomez-Rodriguez, Jure Leskovec, and Andreas Krause. 2012. Inferring Networks of Diffusion and Influence. ACM Trans. Knowl. Discov. Data 5, 4, Article 21 (Feb. 2012), 37 pages. DOI:http://dx.doi.org/10.1145/2086737.2086741
- Han and Chen (2017) Han Han and Haiquan Chen. 2017. Selective lymph node dissection in early-stage non-small cell lung cancer. Journal of Thoracic Disease 9, 7 (2017), 2102–2107. DOI:http://dx.doi.org/10.21037/jtd.2017.06.04
- Jiang et al. (2011) Xia Jiang, Richard E. Neapolitan, M. Michael Barmada, and Shyam Visweswaran. 2011. Learning genetic epistasis using Bayesian network scoring criteria. BMC Bioinformatics 12, 1 (31 Mar 2011), 89. DOI:http://dx.doi.org/10.1186/1471-2105-12-89
- Koivisto and Sood (2004) Mikko Koivisto and Kismat Sood. 2004. Exact Bayesian Structure Discovery in Bayesian Networks. J. Mach. Learn. Res. 5 (2004), 549–573.
- Kudo et al. (2012) Yujin Kudo, Hisashi Saji, Yoshihisa Shimada, Masaharu Nomura, Jitsuo Usuda, Naohiro Kajiwara, Tatsuo Ohira, and Norihiko Ikeda. 2012. Do tumours located in the left lower lobe have worse outcomes in lymph node-positive non-small cell lung cancer than tumours in other lobes? European Journal of Cardio-Thoracic Surgery 42, 3 (2012), 414–419. DOI:http://dx.doi.org/10.1093/ejcts/ezs065
- Mountain and Dresler (1997) Clifton F. Mountain and Carolyn M. Dresler. 1997. Regional Lymph Node Classification for Lung Cancer Staging. CHEST 111, 6 (1997), 1718–1723. DOI:http://dx.doi.org/10.1378/chest.111.6.1718
- Perrier et al. (2008) Eric Perrier, Seiya Imoto, and Satoru Miyano. 2008. Finding optimal Bayesian network given a super-structure. Journal of Machine Learning Research 9 (2008), 2251–2286.
- Riquet et al. (2015) M. Riquet, C. Rivera, C. Pricopi, A. Arame, P. Mordant, C. Foucault, A. Dujon, and F. Le Pimpec-Barthes. 2015. Is the lymphatic drainage of lung cancer lobe-specific? A surgical appraisal. Eur J Cardiothorac Surg 47, 3 (2015), 543–9. DOI:http://dx.doi.org/10.1093/ejcts/ezu226
- Spirtes and Glymour (1991) Peter Spirtes and Clark Glymour. 1991. An Algorithm for Fast Recovery of Sparse Causal Graphs. Social Science Computer Review 9, 1 (1991), 62–72. DOI:http://dx.doi.org/10.1177/089443939100900106
- Sun et al. (2012) Jiehuan Sun, Xintao Hu, Xiu Huang, Yang Liu, Kaiming Li, Xiang Li, Junwei Han, Lei Guo, Tianming Liu, and Jing Zhang. 2012. Inferring consistent functional interaction patterns from natural stimulus FMRI data. NeuroImage 61, 4 (2012), 987 – 999. http://www.sciencedirect.com/science/article/pii/S1053811912002868
- Wang et al. (2016) Xing Wang, Shi Yan, Kevin Phan, Tristan D. Yan, Lijian Zhang, Yue Yang, and Nan Wu. 2016. Mediastinal lymphadenectomy fulfilling NCCN criteria may improve the outcome of clinical N0â1 and pathological N2 non-small cell lung cancer. Journal of Thoracic Disease 8, 3 (2016), 342–349. DOI:http://dx.doi.org/10.21037/jtd.2016.02.49
- Yamanaka et al. (2002) Akira Yamanaka, Takashi Hirai, Ayuko Takahashi, and Fumio Konishi. 2002. Interlobar lymph node metastases according to primary tumor location in lung cancer. Lung Cancer 35, 3 (2002), 257–261. DOI:http://dx.doi.org/10.1016/S0169-5002(01)00421-4
- Zhang et al. (2013) Jing Zhang, Xiang Li, Cong Li, Zhichao Lian, Xiu Huang, Guocheng Zhong, Dajiang Zhu, Kaiming Li, Changfeng Jin, Xintao Hu, Junwei Han, Lei Guo, Xiaoping Hu, Lingjiang Li, and Tianming Liu. 2013. Inferring functional interaction and transition patterns via dynamic bayesian variable partition models. Human Brain Mapping 35, 7 (2013), 3314–3331. DOI:http://dx.doi.org/10.1002/hbm.22404