A Novel Learning Algorithm for Bayesian Network
and Its Efficient Implementation on GPU
Abstract
Computational inference of causal relationships underlying complex networks, such as generegulatory pathways, is NPcomplete due to its combinatorial nature when permuting all possible interactions. Markov chain Monte Carlo (MCMC) has been introduced to sample only part of the combinations while still guaranteeing convergence and traversability, which therefore becomes widely used. However, MCMC is not able to perform efficiently enough for networks that have more than 1520 nodes because of the computational complexity. In this paper, we use general purpose processor (GPP) and general purpose graphics processing unit (GPGPU) to implement and accelerate a novel Bayesian network learning algorithm. With a hashtablebased memorysaving strategy and a novel task assigning strategy, we achieve a 10fold acceleration per iteration than using a serial GPP. Specially, we use a greedy method to search for the best graph from a given order. We incorporate a prior component in the current scoring function, which further facilitates the searching. Overall, we are able to apply this system to networks with more than 60 nodes, allowing inferences and modeling of bigger and more complex networks than current methods.
I Introduction
Bayesian network (BN) is a probabilistic graphical model that describes causal relationship through directed acyclic graphs (DAG). In this work, we focus on the problem of learning a Bayesian network that characterizes the causal relationship from experimental data. This problem is proved to be NPcomplete [1]. Due to the large number of graph structures, samplingbased methods are proposed to find the best matching graph using a scoring metric. Several sampling methods have been proposed, including graphspace sampling, orderspace sampling, and ordergraph sampling [2]. Among all these sampling methods, orderspace sampling is demonstrated to be the best one [2]. However, these methods are still not efficient enough for large graphs. In this paper, we proposed a new strategy for sampling the order space. Specifically, we apply a greedy strategy into the order sampling procedure. Our new method is more accurate than the previous ones while maintaining the same complexity in the sampling stage. Furthermore, it does not require a postprocessing part which is needed in the previous methods.
In addition to experimental data, in many situations, prior knowledge for at least part of a given Bayesian network is available. Adding prior knowledge in the learning process can enhance the accuracy of the result while significantly reduce the searching space. Methods of adding priors include graphbased and probabilityaimed[3]. However, the “pairwise” prior knowledge which indicates the likelihood of one event causing another is more easily obtained. Yet, no methods exist that can integrate such prior knowledge into the BN learning algorithm. In this work, we propose a novel prior component which can be easily added into our scoring function as a pairwise weight. It represents user’s “confidence” in the possible existence or nonexistence of an edge.
Nevertheless, learning Bayesian interactions is still computeintensive, demanding both software and hardware advancements. Novel computational platforms such as fieldprogrammable gate array (FPGA) and graphics processing unit (GPU) have been applied to facilitate the learning of Bayesian networks [4, 5, 6]. GPU is known to be highly efficient for massively parallel computational problems. In this work, we exploit the parallelism in our novel Bayesian network learning algorithm and implement it on GPU. The combination of our novel algorithm and its GPU implementation allows us to learn graphs with up to 60 nodes.
The remainder of the paper is organized as follows. In Section 2, we introduce the background on the problem of learning Bayesian networks. In Section 3, we describe the improved learning algorithm. In Section 4, we demonstrate our novel method for adding priors. In Section 5, we discuss the implementation on GPU. In Section 6, we show the experimental results on the performance of our algorithms. We conclude the paper in Section 7.
Ii Background
A Bayesian network is a probabilistic graphical model that represents a set of random variables and their conditional dependencies via a directed acyclic graph (DAG). Each node in the graph is associated with a random variable. Each directed edge indicates a causal relationship between the variables connected by that edge. Nodes that are not connected represent random variables that are conditionally independent. A parent set of a given node is a set of nodes which have a directed edge pointed to . Each node is also associated with a probability distribution conditioned on all its parent variables, . The joint probability distribution of all the random variables in a Bayesian network can be written as a product of the conditional distributions for all the nodes:
(1) 
An example of a Bayesian network is shown in Figure 1(a). Every node is influenced by its parents. For instance, node in Figure 1(a) has two parents, and . Thus, the probability distribution of is determined by the states of and . This conditional distribution is shown in Figure 1(d) for all combinations of and .
In this work, we focus on Bayesian network composed of discrete random variables, which is a common Bayesian network model. For example, we can model a gene network using a discrete Bayesian network, whose random variables are discretized into three states which represent the underexpression, the normal expression and the overexpression of genes, respectively. Mechanisms for discretizing continuous data include MDL method [7], CAIM, CACC, Ameva, and many others [8].
The problem here is to learn the Bayesian network structure from its experimental data. There are two types of data: observational data and experimental data. Observational data are obtained through observations without any human perturbations in the experiment. Experimental data are generated from manipulating one or more variables, and then observing the states of the other variables [9]. For example, in work [10], experimental data are generated by individually applying inhibitors to some of the genes. Usually, the causal relationship cannot be inferred only from observational data; experimental data are required [2]. In this work, we assume that each input data are sampled from multinomial distributions and the data set is complete.
Due to its superexponential complexity, sampling methods are recently applied to solve the problem of learning Bayesian networks. One of them is graph sampling, which explores the huge graph space for a best graph. Another is order sampling, which explores a smaller order space for a best order. And there is ordergraph sampling [2], which samples graphs for a given order. Order refers to the topological order of a DAG, which is an order on the nodes of the DAG such that must precede if is in the parent set of . Each DAG has at least one topological order denoted as . For example, a topological order for the graph shown in Figure 1(a) is .
# of nodes  # of graphs  # of orders 

4  453  24 
5  29281  120 
10  4.7  3.6 
20  2.34  2.43 
30  2.71  2.65 
40  1.12  8.16 
The graph learning problem is an NPcomplete problem. The number of possible graphs grows superexponentially with the number of nodes. Table I shows the numbers of possible graphs for different numbers of nodes. Compared to the number of graphs, the number of orders for a given number of nodes is much smaller. Table I also lists the numbers of orders for the same set of node numbers. Due to the reduced number of combinations, order sampler can converge in fewer steps compared to graph sampler and hence, reduce the overall complexity. Moreover, sampling in order space provides opportunities for parallel implementation. Due to these advantages, we develop our algorithm within the framework of orderspace sampling.
Learning Bayesian networks aims at finding a graph structure which best explains the data. We can measure each different Bayesian graph structure with a Bayesian scoring metric, which is defined as [3]:
where denotes the prior distribution of a graph and denotes the experimental data. Given a Bayesian network with nodes, using the decomposition relation shown in Equation (1), we can represent the scoring metric as a product of local scores as
(2) 
The local score can be calculated as
(3) 
where serves as a penalty for complex structures [2], is the hyperparameter for prior of Bayesian Dirichlet score, refers to the number of different states of the parents set , and refers to the number of possible states of the random variable . and are counted from experimental data [3]. is the gamma function [11]. In order to reduce the overall complexity, we limit the maximal size of parent sets to a constant .
Iii Algorithm Description
In this section, we discuss our algorithm. The overall flow of our algorithm is plotted in Figure 2, while its pseudocode is shown in Algorithm 1. After the preprocessing step, we start scoring an order. The score is defined to be the best score of all the graphs satisfying that order. The scored order is accepted based on the MetroplisHasting rule [12]. We then apply the Markov chain Monte Carlo (MCMC) strategy to sample the order space: each new order is generated from the previous one by randomly selecting and swapping two nodes in the previous order. We sample the orders for a specified number of iterations. Each subroutine of our algorithm is discussed in detail in the following sections.
Iiia Preprocessing
As shown in Figure 2, our learning algorithm is started with a preprocessing part, which includes order initialization and the generation of all possible local scores (refer to Equation (3)). The order initialization randomly generates an initial order as the starting point. As we will show in Section IIIB, the scoring part heavily relies on the computation of local scores. Indeed, each local score is repeatedly used in a large number of iterations. However, calculating the local score according to Equation (3) is timeconsuming. Thus, instead of recomputing local scores each time when they are needed, we choose to compute local scores for all the possible combinations of the node and its parent set at the preprocessing stage. We store the result in a hash table keyed by the node and the parent set . Later on, when a local score for a specific combination of a node and its parent set is needed, we just fetch the score from the hash table. This strategy leads to more than 10 folds speedup on GPP according to our experimental results.
Since the local score shown in Equation (3) is very small, we perform the computation in the logspace. Given a node and its parents set , equation for a local score () is now changed to:
(4) 
The scoring function for a graph is changed to:
(5) 
IiiB Scoring
The scoring part is a major subroutine of our algorithm, which scores a given order. To effectively measure an order, we introduce a new scoring function different from the one proposed in [5]. Given a specific order, there are many graphs that satisfy that order. We define the score of an order to be the best score for all the graphs satisfying the order, i.e.,
Based on Equation (5), we further have
Due to the Markov property of Bayesian networks, global maximum equals to the sum of the maximal local scores of all the nodes, each of which is taken among all the combinations of the node and its parent sets that are consistent with the order. Mathematically, the scoring function can be represented as
(6) 
where is the set of all possible parent sets of the node that are consistent with the order. The scoring subroutine is shown at Line in Algorithm 1. We notice that a similar algorithm was previously mentioned in [13]. However, it is only used in the postprocessing part where a best graph is constructed from the best order. In [5], a different order scoring function was used, which is the sum of all the scores of the graphs that are consistent with the order. Compared to that scoring function, ours is better in the following ways:

Our algorithm only needs comparison and assignment operations, avoiding the timeconsuming exponentiation and logarithm operations required by the previous algorithm.

The sumbased scoring function may lead to an incorrect result, because the best matching graph may not be consistent with the order which generates the largest score. However, since our function uses the max operation, the globally best graph must be consistent with the globally best order.

The previous algorithm needs a postprocessing part which constructs the best graph from the best order. Our algorithm generates the best graph for each sampled order. Thus, we do not need any postprocessing.
In summary, since our algorithm avoids many expensive operations and reduces a large amount of computation, the total computation time is decreased.
In [4] and [5], bit vectors are used to generate every compatible parent set with respect to a given order. However, our experimental results indicate that bit vector is not a suitable implementation since it is very slow. According to our experiment, the bit vector implementation consumes a huge amount of time for networks with more than 20 nodes. This is because for the last node in an order, each of the nodes preceding it could be its parent. Therefore, we need to compare bit vectors to filter out the compatible parent sets for the last node. However, we notice that in practice the maximal size of a parent set is limited to a constant . Given this, we only need to consider potential parent sets for the last node, which is much smaller than . Table II compares the runtime for generating all parent sets with the runtime for generating only those parent sets with a size limit of . We compare these runtimes (per iteration) for different numbers of the candidate parents ranging from 15 to 25. We can see that there is a dramatic increase in speed if we only generate those parent sets with a size limit of .
Generating all  Generating parent  

Number of  possible parent  sets with a size limit  Speedup 
Candidate Parents  sets (Sec.)  of (Sec.)  
15  0.011  1100  
16  0.017  1317  
17  0.065  3915  
18  0.104  4369  
19  0.195  6818  
20  0.297  10136  
21  0.645  15965  
22  1.248  27857  
23  3.425  63075  
24  6.814  115884  
25  12.185  162250 
IiiC MetroplisHasting Comparison, Best Graph Updating, and Order Generation
We apply the Markov chain Monte Carlo method (MCMC) to sample the order space, which essentially performs a random walk in that space. Each time a new order is proposed, even if its score is less than the score of the previous order, it still could be accepted based on the MetropolisHasting rule [12], which is to accept the new order with the probability
Suppose that the logspace score for the new order and that for the previous order are and , respectively. The new order is accepted if
where is a random number generated uniformly from the unit interval [0, 1]. Due to the property of MCMC, After a sufficient number of iterations, the Markov chain will converge to its steady distribution. At that time, each order is sampled with a frequency proportional to its posterior probability. Thus, an order with a high probability of occurring (corresponding to a high Bayesian score) is very likely to be sampled.
Our ultimate goal is to find the graph with the highest score. Therefore, we keep track of a number of best graphs obtained so far as the sampling procedure proceeds. At the end of each iteration, if a new order is accepted, then we compare the score of the best graph consistent with that order to the scores of the best graphs recorded so far. We update the record of the best graphs if the current graph is better.
At the end of each iteration, we generate a new order by randomly selecting two nodes and in the current order and swapping them, i.e., changing the order to the order .
Iv Priors for Characterizing Pairwise Relationship
In this section, we present our novel prior component that could effectively characterize the prior knowledge on the causal relationship between a pair of nodes.
Assume that a function indicates the prior knowledge on the causal relationship between a pair of nodes and . Equivalently, represents the prior knowledge on the existence of an edge from to . We add into the scoring Equation (2) to affect the posterior probability of graphs as follows
(7) 
Note that in the above equation, given an arbitrary graph, the prior probabilities on all the edges are multiplied together to influence the posterior probability of the graph. Thus, if the prior probability on the existence of an edge in the Bayesian network is large, the probabilities of the graphs containing that edge will be increased and hence, these graphs are more likely to be sampled. In the logspace, Equation (7) becomes
(8) 
where is the local score in Equation (4). We call as the pariwise prior function (PPF) for the nodes and . It is also denoted as . Thus, Equation (9) becomes
(9) 
With this general form of adding pairwise priors, we can meet different needs by applying different PPFs.
In our design, we provide an interface for users. It is an matrix , where is the number of nodes in the graph. Each entry in the matrix is between zero and one. If the value is between and , it means that there unlikely exists an edge from to ; if the value is between and , then it means there likely exists an edge from to ; if the value is , it means that there is no bias on whether or not there exists such an edge from to . This interface provides a convenient way to specify the pairwise priors. The actual PPF is a function on the value in the matrix . It must satisfy the following requirements:

iff

iff

iff
Furthermore, according to our experiment results, PPF should also satisfy:

when approaches 1, is around 10

when approaches 0.5, approaches

when approaches 0, is around
where and are chosen empirically to have a significant impact on the ultimate score of a graph.
Based on the abovementioned requirements, we propose the following cubic polynomial to transform the value in the interface matrix into the PPF:
(10) 
The above function is plotted in Figure 3 to give a clear view.
V Implementation of the Learning Algorithm on GPU
In this section, we discuss the implementation of our algorithm on GPU for learning Bayesian networks.
Va The Architecture of GPU
Figure 4 shows the architecture of a typical GPU. Host refers to a CPU, which assigns tasks to and collects results from the GPU. As we show in Figure 5, the GPU implements the scoring part of our algorithm, since the max operation can be paralleled both within each node and across all the nodes (refer to Equation (6)). The remaining parts of our learning algorithm are handled by the CPU. The CPU also takes charge of the communication with the GPU. Specifically, it passes a new order to the GPU and gets the best graph and its score from the GPU, as shown in Figure 4.
A GPU contains a number of blocks connected in the form of a grid. Each block usually includes 256 threads. Each thread has a number of registers and a local memory. All the threads within a block can access the shared memory of that block. All the threads can also access the global memory of the GPU.
The GPU we use is based on Fermi architecture [14]. Fermi architecture provides true cache hierarchy for us to use the shared memory of GPU. Also, it is fast in context switching operation and the atomic operations of readmodifywrite for parallel algorithms. Fermi architecture has up to 16 streaming multiprocessors (SM) with each containing 32 CUDA cores. Thus, it features up to 512 CUDA cores. A CUDA core executes a floating point or an integer instruction per clock for a thread. The GPU has bit memory partitions for a 384bit memory interface, supporting up to a total of 6 GB of GDDR5 DRAM memory. The SM schedules threads in groups of 32 parallel threads called warps. Each SM features two warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently.
VB Task Assigning Strategy
GPU implements the order scoring part of the Bayesian network learning algorithm. This requires us to assign the tasks evenly among all the blocks and all the threads. We describe our task assigning strategy in this section.
First, we assign blocks for each node. These blocks together will get the maximal local score for the node (refer to Equation (6)). The number of local scores they need to compare equals to the size of the set , or the number of parent sets of the node that are consistent with the given order . Now we will assign these parent sets evenly to all the threads in the blocks. Assume that the total number of threads in the block is . Then, each thread will handle local scores and get the “local” maximum among them. After that, we will further compare all the local maximal scores obtained by the threads and get the largest one. We need to assign to each thread the parent sets they are in charge of.
Since each thread has a thread ID and a block ID in the CUDA programming environment, we can assign a specific task to a thread based on its ID. The problem is how a thread predicts the parent sets that it needs to handle. This corresponds to predicting which parent set the th thread needs to lookup in the hash table to get the local score . This problem can be converted into a subset indexing problem: given a set of nodes, we want to index all the subsets with at most nodes in a regular way so that given an arbitrary valid index we can easily get the corresponding subset. Note that the total number of the subsets with at most nodes is . Indeed, we can index these subsets in a regular way. For a example, consider a set of nodes . If the size limit on the subsets is , then we can obtain the total number of subsets as . We assign index to the subset , index to the subset , index to the subset , index to the subset , …, index to the subset , and index to the subset .
Now the problem is how to recover the subset from a given index if we use the above indexing method. We propose an algorithm shown in Algorithm 2 to solve this problem, which is inspired by an algorithm proposed in [15]. Since GPU cannot support recursive functions, we provide a nonrecursive version. Given the number of candidate parents , the size of parent sets , and the index of the expected parent set , it can return the th parent set which is composed of nodes chosen from the candidates.
Our purpose is to compute the combination of elements that is at a given position in the lexicographic order, without explicitly counting them one by one. The solution is quite straightforward. Suppose that the elements are . We obtain each element in the combination one by one from the first to the last. We assume that the elements in each combination are in increasing order from the first to the last, i.e, . With this assumption, we can see that there are combinations beginning with the value (). Based on this fact, we can obtain the first element as the largest number such that . In order to get the second element , it is equivalent to obtaining the combination of elements at the position . Thus, is the largest number such that , plus the shift , namely . We compute all the remaining elements in the combination in a similar way.
With Algorithm 2, each thread can get the first parent set it needs to handle based on its ID. With this, the remaining parent sets it needs to handle can be obtained incrementally.
However, the above algorithm requires additional computation on GPU. Our second strategy is to create a parent set table (PST) and store all the combinations in the the global memory of the GPU. Figure 6 shows an example of the PST and the additional memory requirement for storing the PST. Suppose that we have in total threads to handle parent sets. Then, each thread handles parent sets. Therefore, the th thread should handle the th upto the th parent sets from the PST. Compared to the abovementioned combinatorial algorithm, PSTbased method is much faster since it only needs to read the table. Although it requires additional memory, the overhead is small. Indeed, as shown in Figure 6(b), a 60node graph only costs 7.99 MB additional memory when the size limit on the parent set is . Using the PST and a proper mapping strategy, we can assign to each thread the parent sets it is responsible for.
After completing its task, each thread stores its best parent set and the best score in a shared memory within each block. We further need to find the best score and the best parent set among all choices stored in the shared memory. In order to do this efficiently, we modified a reducing algorithm mentioned in [16]. Each thread has kept its local best parent set and the corresponding local best score. The problem is to pick the highest score among all the local best scores as well as its corresponding parent set. This problem is not as simple as the problem of searching the highest score since we have to recover its original position during a highly dynamic process. We have to consider both the efficiency and the correctness. An illustration of our algorithm is shown in Figure 7. Assume that a shared memory has 16 entries. We want to move the highest score to the entry of the array and record the ID of the thread that gives the highest score in entry 1. In this example, the highest score is and the thread ID is . In the first reduction, we first divide the array into two halves. We move the higher scores to the left half and record their original thread IDs in the right half as shown in the third row of Figure 7. It requires half of the threads to participate. For example, thread compares its value with the value in entry . Then, thread assigns entry 0 of the shared memory with value and entry with 0, which is the ID of the thread giving the larger value . Each reduction halves the amount of memory involved.
In the rest of the reductions, we have to keep track of the ID of the original thread that gives the better value. For example, in the second reduction, entry of the shared memory has to be compared with entry of the shared memory. Since is larger than , is stored in entry . Note that is now from entry . However, the ID of the original thread that gives the value is store in entry , where 8 is the current number of threads involved. Then, we update entry with the original thread ID by copying the value of entry to entry . The total number of iterations required to get the best score among a total of scores is . After obtaining the ID of the original thread that gives the highest score, we can fetch the best parent set from that thread.
Vi Experimental Results
We perform experiments on our algorithm both on GPP and GPU. The GPP we use is a 2.4 GHz Intel Xeon E5620 processor with 8GB RAM. The GPU we use is an NVIDIA Tesla M2090 GPU with 6GB GDDR5 RAM. Our GPUbased implementation is described in Figure 5, with its scoring part performed on the GPU and the remaining parts performed on the CPU. The operating system is Ubuntu 10.04.4.
In our experiments, we set the maximal size of the parent set as 4 (). In our implementation, the GPU is used to accelerate each order scoring iteration. We first study its speedup effect. Figure 8 shows both the runtime of our scoring implementation on GPP and the runtime of the implementation on GPU for different graph sizes. From Figure 8, we can see that the GPU implementation achieves a significant speedup over the GPP implementation. The detailed runtimes per iteration for both the GPP and the GPU implementations, together with the acceleration rates, are listed in Table III. Acceleration rate is peaked at 10 for graphs with around 50 nodes. For smaller graphs, i.e., graphs with fewer than 13 nodes, their acceleration rates are less than 1. That is due to the time consumed on the context switching on GPU. As a result, GPU is not a good choice for small graphs.
# of  GPP time per  GPU time per  Speedup of 

Nodes  iteration (sec.)  iteration (sec.)  GPU over GPP 
13  0.00024  0.000461  0.52 
15  0.000574  0.000511  1.12 
17  0.001223  0.000645  1.90 
20  0.003384  0.001053  3.18 
25  0.013076  0.002059  6.35 
30  0.045229  0.005027  9.00 
35  0.132726  0.012319  10.77 
40  0.294657  0.027673  10.65 
45  0.600787  0.056061  10.71 
50  1.074849  0.102469  10.49 
55  1.7365  0.177313  9.79 
60  3.267  0.3427  9.53 
To make the results more practical, we further apply our learning algorithm to two wellknown networks : 1) an 11node signaling transduction network (STN) from human Tcell [10]; and 2) a 37node ALARM network [17].
Preprocessing  Iteration  Total  
runtime  runtime  runtime  
37node graph on GPP (sec.)  563.03  1685.19  2248.38 
37 node graph on GPU (sec.)  634.2  160.92  795.19 
11node graph on GPP (sec.)  0.71  1.00  1.71 
11 node graph on GPU (sec.)  4.58  1.66  6.28 
Table IV shows the runtimes for both the 11node graph and the 37node graph. Note that the preprocessing part of the GPU implementation is done on a CPU, as we mentioned before. The GPUbased implementation takes more time on preprocessing than the GPPbased implementation. Still, the total runtime of the GPUbased implementation is about 1/3 of the runtime of the GPPbased implementation for the large 37node network. Scoring orders is the most timeconsuming part of the Bayesian network learning algorithm. Accelerating scoring subroutine is our primary goal in this work. We will study how to speedup the preprocessing part in our future work.
Preprocess  Iteration  Total  
runtime  runtime  runtime  
20node graph  
(all parent sets) (seconds)  23.15  1122.99  1136.19 
20node graph  
(partial parent sets) (seconds)  7.52  278.18  285.76 
11node graph  
(all parent sets) (seconds)  0.75  2.59  3.39 
11node graph  
(partial parent sets) (seconds)  0.71  0.95  1.71 
We also compare the implementation that generates all possible parent sets with the implementation that generates only parent sets with a limited size. We evaluate these two implementations on GPP using the previous 11node graph and a randomly synthesized 20node graph. We do not use the 37node graph because the generation of all the possible parent sets is prohibitively timeconsuming. The runtime results are shown in Table V. From the table, we can see that for both graphs, the total acceleration rate is almost when we only generate parent sets with a limited size. The speedup in the preprocessing part is not so significant for the 11node graph, while it is more than 3 times faster for the 20node graph.
We also empirically study the accuracy of our algorithm. We use the receiver operating characteristic (ROC) curve introduced in [18] to measure the accuracy. A ROC curve is a plot of the true positive (TP) rate versus the false positive (FP) rate. True positive rate gives the fraction of true positives out of the observed positives, while false positive rate gives the fraction of false positives out of the observed negatives. The closer to the upperleft point , the more accurate is the graph learning result. We tried a 20node graph with 1,000 and 10,000 iterations separately. The ROC curves for these two experiments are shown in Figure 9 and 10, respectively. Clearly, the resulting curve with 10,000 iterations is closer to the upperleft corner than the resulting curve with 1,000 iterations. However, the curve with 1,000 iterations is pretty closer to the upperleft corner. It indicates that our algorithm is highly accurate with even a small number of iterations. In these two figures, the points from the right to the left are generated as follows: the first point is obtained without adding any prior knowledge on edges; the second point is obtained by assigning “interface” prior value (refer to Section IV) 0.7/0.2 with a probability of 0.2 to edges which are mistakenly removed/added when learned without any prior knowledge; the third point is obtained by adding the same prior knowledge used in generating the second point but with a probability of 0.4; the fourth point is obtained by assigning “interface” prior value 0.8/0.1 with a probability of 0.2 to edges which are mistakenly removed/added when learned without any prior knowledge; the fifth point is obtained by adding the same prior knowledge used in generating the fourth point but with a probability of 0.4. Note that the priors added becomes stronger as we generate the points from the first to the last.
In realistic situations, the observed data may contain a large amount of noise and hence become faulty. In order to learn BNs correctly in these situations, the algorithm must be highly tolerant to noise. We study the fault tolerance of our algorithm by injecting errors into the data. We test our algorithm in learning Bayesian networks with two states. In this case, we assume that each data has a probability to flip its state. That is, every single data would change from 1 to 0 or from 0 to 1 with a rate of . In realistic context, this means that every data has a possibility to be overestimated or underestimated. For chosen as , , , , , , , and , the accuracy of our algorithm is shown in Figure 11 in the form of a ROC curve. When , TP is 0.513514, which is not good enough. However, for , the results are acceptable in most applications. These results show that our algorithm has a relatively high noise tolerance.
Vii conclusion
Learning Bayesian network structure from experimental data is a computational challenging problem. In this paper, we have demonstrated a novel BN learning algorithm and its implementation on GPU. Our proposed algorithm is three times faster than the traditional algorithm when run on GPP. Further, our GPU implementation has achieved a 10fold speedup per iteration over the GPP implementation. When the entire learning procedure is considered, the GPU implementation has a 3fold speedup. Overall, we have accelerated the BN learning algorithm at least 9 folds. Experimental results also demonstrated that our algorithm gives accurate result and is highly tolerant to errors in the data.
Our algorithm is an improved version over the one proposed in [5]. We have proposed a better method for scoring the order based on the best graph consistent with the order. We have also introduced a new way of adding pairwise priors to enhance the accuracy of learning Bayesian networks. In addition, we have proposed two strategies for distributing the scoring tasks evenly among a given number of threads in GPU. In our current implementation, we take advantage of the parallelism in the order scoring part and accelerate that part using GPU. In our future work, we will study how to accelerate the preprocessing part using GPU.
Acknowledgment
Authors would like to thank Prof. Minyou Wu for his advice and Prof. Xiaoyao Liang for offering experimental equipment to us.
References
 [1] D. Chickering, “Learning bayesian networks is npcomplete,” LECTURE NOTES IN STATISTICSNEW YORKSPRINGER VERLAG, pp. 121–130, 1996.
 [2] B. Ellis and W. Wong, “Learning causal bayesian network structures from experimental data,” Journal of the American Statistical Association, vol. 103, no. 482, pp. 778–789, 2008.
 [3] D. Heckerman, D. Geiger, and D. Chickering, “Learning bayesian networks: The combination of knowledge and statistical data,” Machine learning, vol. 20, no. 3, pp. 197–243, 1995.
 [4] N. Asadi, T. Meng, and W. Wong, “Reconfigurable computing for learning bayesian networks,” in Proceedings of the 16th international ACM/SIGDA symposium on Field programmable gate arrays. ACM, 2008, pp. 203–211.
 [5] M. Linderman, R. Bruggner, V. Athalye, T. Meng, N. Bani Asadi, and G. Nolan, “Highthroughput bayesian network learning using heterogeneous multicore computers,” in Proceedings of the 24th ACM International Conference on Supercomputing. ACM, 2010, pp. 95–104.
 [6] N. Bani Asadi, C. Fletcher, G. Gibeling, E. Glass, K. Sachs, D. Burke, Z. Zhou, J. Wawrzynek, W. Wong, and G. Nolan, “Paralearn: a massively parallel, scalable system for learning interaction networks on fpgas,” in Proceedings of the 24th ACM International Conference on Supercomputing. ACM, 2010, pp. 83–94.
 [7] U. Fayyad and K. Irani, “Multiinterval discretization of continuousvalued attributes for classification learning,” 1993.
 [8] J. Dougherty, R. Kohavi, and M. Sahami, “Supervised and unsupervised discretization of continuous features,” in MACHINE LEARNINGINTERNATIONAL WORKSHOP THEN CONFERENCE. Morgan Kaufmann Publishers, Inc., 1995, pp. 194–202.
 [9] G. Cooper and C. Yoo, “Causal discovery from a mixture of experimental and observational data,” in Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., 1999, pp. 116–125.
 [10] K. Sachs, O. Perez, D. Pe’er, D. Lauffenburger, and G. Nolan, “Causal proteinsignaling networks derived from multiparameter singlecell data,” Science Signalling, vol. 308, no. 5721, p. 523, 2005.
 [11] W. Rudin, Principles of mathematical analysis. McGrawHill New York, 1964, vol. 3.
 [12] J. Liu, Monte Carlo strategies in scientific computing. Springer, 2008.
 [13] G. Cooper and E. Herskovits, “A bayesian method for the induction of probabilistic networks from data,” Machine learning, vol. 9, no. 4, pp. 309–347, 1992.
 [14] NVIDIA, NVIDIA’s Next Generation CUDA Compute Architecture: Fermi. NVIDIA corporation, 2009.
 [15] D. Knuth, “The art of computer programming, vol. 4, fascicle 1: Bitwise tricks & techniques; binary decision diagrams. 2009.”
 [16] S. Zhang and Y. Zhu, GPU High Performance Computing  CUDA. Water Press, 2009.
 [17] “Bayesian network repository,” http://www.cs.huji.ac.il/site//labs/compbio/Repository/, [Online; accessed 19July2012].
 [18] T. Fawcett, “Roc graphs: Notes and practical considerations for researchers,” Machine Learning, vol. 31, pp. 1–38, 2004.