Message Scheduling for Performant, ManyCore Belief Propagation
Abstract
Belief Propagation (BP) is a messagepassing algorithm for approximate inference over Probabilistic Graphical Models (PGMs), finding many applications such as computer vision, errorcorrecting codes, and proteinfolding. While general, the convergence and speed of the algorithm has limited its practical use on difficult inference problems. As an algorithm that is highly amenable to parallelization, manycore Graphical Processing Units (GPUs) could significantly improve BP performance. Improving BP through manycore systems is nontrivial: the scheduling of messages in the algorithm strongly affects performance. We present a study of message scheduling for BP on GPUs. We demonstrate that BP exhibits a tradeoff between speed and convergence based on parallelism and show that existing message schedulings are not able to utilize this tradeoff. To this end, we present a novel randomized message scheduling approach, Randomized BP (RnBP), which outperforms existing methods on the GPU.
I Introduction
Probabilistic Graphical Models (PGMs) are powerful, general machine learning models that encode distributions over random variables. PGM Inference, in which we seek to compute some probabilistic beliefs within the system modeled by the PGM, is in general an intractable problem, leading to dependence on approximate algorithms. Belief Propagation (BP) is a widely employed approximate inference algorithms for PGMs [27]. BP has been successfully utilized in many areas, including computer vision [7], errorcorrecting codes [21], and proteinfolding [26].
BP is a messagepassing algorithm, in which messages are passed along edges of the PGM graph. While BP is exact on tree PGMs, it is approximate on general graphs containing loops, where iterative updates are applied until convergence. Like others [6], we break the performance of BP into two properties: convergence (for how many input graphs does it reach a convergent state) and speed (how long does it take to reach the convergent state). While BP has been shown to perform well on many graphs containing loops, there is no guarantee of convergence in most cases, and graphs of varying structure and parameterization can prevent BP from converging or can have slow speed for convergence [16].
GeneralPurpose GPU computing has begun recently exploring manycore parallelism for graphbased problems [23]. This, combined with the inherent parallelism available between message updates, suggests that manycore parallelism can be effectively applied to BP to yield good performance on the GPU (that is, good convergence and speed). In order to ensure good performance, one must be careful in the implementation such as to avoid the convergence and speed pitfalls inherently present in Belief Propagation.
In existing BP literature, there has been much interest in exploring the use of message schedulings for improving BP performance. The naive scheduling is known as Synchronous or Loopy BP (LBP), where all messages are updated in parallel [16]. Asynchronous approaches, where some amount of sequentiality is enforced during the message updates, for example via subgraph updates [22] or greedy message selection [6, 8], have been shown both empirically and theoretically to outperform LBP in singlecore environments. The general intuition is that enforcing sequentialism in the scheduling encourages more direct propagation of information, thus converging faster. The contrast between LBP and Asynchronous BP introduces a parallelism vs. efficiency spectrum (also found in other graph problems such as SSSP [5]). LBP exposes high levels of parallelism but is workinefficient. Asynchronous BP is efficient and convergent but exposes little parallelism. We hypothesize that there exists a tradeoff between the parallelism and sequentialism in Belief Propagation, and that GPUs can effectively harness that tradeoff to yield performant BP.
We start by presenting manycore frontierbased implementations for two greedy asynchronous message schedulings, Residual Belief Propagation [6] and Residual Splash [8]. We then benchmark the performance, varying parallelism to explore how parallelism affects the performance of BP. As expected, we find that as parallelism is increased, we see less convergence but obtain faster speed. As parallelism is decreased, we see more convergence but lower speeds. This is encouraging, as it means we can still get convergence boosts while exploiting parallelism, but we also see that existing approaches incur significant overheads, and performance is heavily tied to the choice of parallelism. To overcome these drawbacks, we propose a new message scheduling, called Randomized Belief Propagation (RnBP) which uses lowoverhead, randomized scheduling, and outperforms existing approaches.
To summarize, our contributions are:

Demonstration of tradeoff between parallelism and sequentialism in terms of speed/convergence of BP.

Demonstration that overheads prevent existing asynchronous message scheduling approaches from scaling to the GPU.
Ii Background
Iia Belief Propagation
We focus our attention on the SumProduct Belief Propagation algorithm over discrete pairwise Markov Random Fields (MRFs), though we expect the results to generalize to other variants of BP. Suppose we have the set of discrete random variables , each taking on a value , where is a finite set. An MRF is an undirected graph . Each vertex represents a discrete random variable . is set of unary potential functions for each random variable. Each edge represents the probabilistic relationship between two variables. is the set of binary potential functions for each edge. An MRF yields the following joint distribution over :
(1) 
The goal of inference is to derive the vertices’s marginal distributions . This is intractable in general, however BP can be used to find exact marginals (for trees) or approximate marginals (for graphs containing loops). This is done through the iterative passing of messages along the edges of the graph. Each edge has two messages being passed along it, indicating each vertex’s belief about the other’s state. The message is a distribution, updated as follows:
(2) 
where indicates the neighbors of . Each message is initialized to the uniform distribution and normalized between updates. Messages are iterated until convergence, at which point we calculate the beliefs at each vertex:
(3) 
IiB Message Scheduling for Belief Propagation
BP message schedulings differ by the messages that are updated each iteration. LBP simply updates every edge, every iteration in parallel. That is, all messages are updated using the previous iteration’s messages. LBP performance has been examined both empirically [16] and theoretically [15].
Asynchronous approaches enforce sequentialism in message updates, updating each message using the most recent messages. That is, a single message is updated, and that update is immediately used to update other messages. If we assume LBP to be a maxnorm contraction, ABP has at least as good convergence rate guarantees as LBP [6].
Both [6, 8] build on ABP using greedy update schemes. Residual Belief Propagation (RBP) [6] introduces the residual, simply defined as:
(4) 
RBP then selects the next message to update asynchronously based on the highest residual. Intuitively, the program focuses its computational effort to parts of the graph where it moves closer to a converged state.
Residual Splash (RS) [8] is an extension of RBP for multicore parallelization. They extend residuals to vertices, where the residual of a vertex is the maximum residual of incoming messages. Similar to RBP, vertices are selected greedily, however, in RS, a splash, or BFS search of depth around the vertex, is performed with updates moving sequentially through the BFS tree. RS demonstrates linear speedup in the number of cores. In this paper we explore LBP, RBP, and RS because of their simplicity and good performance in existing work.
IiC Related Work
BP has been implemented on the GPU for specific BP workloads, including stereo matching [10, 3] and error correcting codes [4]. Several works specifically explore memory usage, as the unique architecture of the GPU closely ties memory use and performance. Grauer et al. [9] explores using registers, shared memory, and local memory for Belief Propagation and their effect on GPU occupancy for the stereo matching problem. Liang et al.[11] shows a general approach for reducing memory usage for BP by storing only the messages along the edges of partitions of the graphs, allowing messages to be stored in faster shared memory. While we do not explore memory use, our message scheduling work combines naturally with the memory work of both of these approaches.
Several works explore different message schedulings on the GPU for specific BP applications. Yang et al.[25] filters messages to be updated by removing any messages that have already converged. We employ the same filter as one of the filters in our final RnBP scheduling approach. Xiang et al. [24] changes BP on a gridbased stereo problem by using directional updates, that is, messages are updated along dimensions of the grid. Of course, this directional update is specific to gridbased models such as ones used in computer vision. Romero et al.[21] constructed an LDPC code structure in such a fashion that the updates could be partitioned so many could be completed in parallel while still maintaining sequentiality overall. In general, we cannot control the problem as in their case, and creating effective message partitions are problemspecific and nontrivial. Our work takes a general approach that can apply to any BP problem, and explore message schedulings that have not yet been explored on the GPU, to the authors’s best knowledge.
Iii FrontierBased Belief Propagation on the GPU
We present all algorithms examined as realizations of a frontierbased BP framework. In this section, we implement several existing schedulings and benchmark their performance. In the next section, we introduce our own GPUcentric scheduling approach, Randomized Belief Propagation.
To transfer the schedulings onto the synchronous, manycore architecture of the GPU, we utilize a datacentric, frontierbased parallelization framework [23, 20]. We consider the frontier to be the set of messages selected to be updated synchronously and in parallel each iteration. Message schedulings differ on selection of the frontier, but follow the same general structure presented in Algorithm 1.
Iiia Greedy Update Frontier Selection
We use this frontierbased approach to implement several existing schedulings on the GPU, specifically LBP, RBP, and RS. LBP is simple to implement in this framework: every iteration, all the messages are put in the frontier to be updated. RBP and RS rely on greedily selecting updates based on message residuals. In order to explore the tradeoff between parallelism and greedy sequentialism, we will simply adjust the greedy approach to select multiple elements as a frontier per iteration as opposed to a single element. We can consider this to be the selection of the top values for update each iteration. Adjusting allows us to adjust parallelism.
For singlecore implementations, the primary data structure employed to perform these greedy updates is a Priority Queue. While concurrent priority queues have been developed, they rely on mutual exclusion, and thus are best suited for asynchronous environments, unlike the GPU. Other work in using GPUs for algorithms with Priority Queue based methods have turned to other approaches, involving sortandselect, binning, or problem division [5, 19]. Several algorithms for direct top GPU selection exist [2, 14], but speedup only occurs for very large problem sizes. We choose to use a simple sortandselect approach in order to select the top elements.
We now present the high level approach for our bulkparallel greedy update selection. We maintain the residuals of either the messages for RBP or the vertices for RS. Each iteration, we perform a keyvalue sort of the residuals with their corresponding vertices/edges. The top elements after the sort form the update frontier. RBP updates this frontier directly, RS updates the splash around the selected nodes. A single update is visualized in Fig.1.
IiiB Implementation
We implement LBP, RBP, and RS using Nvidia’s CUDA library [17]. We use a simple adjacency list format for storing graph structure and parameterization. Each edge and vertex is assigned IDs, and for parallel operations, each thread is assigned a subset of the IDs to update. We use the CUDA occupancy API for kernel launch settings and Nvidia’s CUB library Radix Sort for the sort operation [13]. We implement serial RBP (SRBP) as a performance benchmark. We use the same adjacency list format and use the Boost library’s Fibonacci heap for the Priority Queue.
IiiC Benchmarks
To accurately benchmark performance, we would like to be able to adjust the difficulty of the inference problem. A synthetic benchmark that gives us control over difficulty is the Ising dataset, a standard benchmark for message propagation algorithms [6]. Ising grids are grids of binary variables. Univariate potentials are randomly sampled from the [0,1] range. The pairwise potentials are set to when and otherwise. is sampled from [0.5,0.5] to make certain potentials favor agreement while others favor disagreement. Varying changes the difficulty of the inference problem (higher being more difficult). For RBP and RS, we test on Ising grids of size and , with . We also run on simpler chain graphs, where binary variables are formatted in a single long chain. Of course, when a graph is a chain, BP is guaranteed to converge. We sample and in the same manner used for our Ising grids. For RBP and RS, we test on chain graphs of size , with .
IiiD Performance
In order to examine parallelism’s effect on performance, we introduce a multiplier , where the frontier size each round is . Varying thus varies the parallelism used. For RS, we lock^{1}^{1}1Exploration of different splash depths could be interesting, though we change our focus to randomized updates, and thus do not pursue this further. splash depth to be . We time how long it takes the message updates to converge. Our GPU code is run on a single NVidia Tesla V100 and our CPU code is run on Intel Xeon Processors.
Fig.2 shows GPU RS performance on our three benchmarks as cumulative convergence graphs, indicating the cumulative percentage of the set of input graphs that have converged as a function of time. GPU RBP exhibits the same patterns on each dataset and thus is not shown for brevity.
Our results indicate that a tradeoff does indeed exist between parallelism and sequentialism. Specifically, we see that as we decrease , that is, we reduce our parallelism, more graphs converge, but they take longer to do so. Thus, low parallelism encourages convergence, while high parallelism encourages speed. LBP, with full parallelism, demonstrates only partial convergence, while RS is able to extend convergence, given time, by reducing parallelism (Fig.2,2).
In Tables I and II, we show the speedup results comparing GPU RBP and RS to SRBP. We compare with the fastest setting in our test runs that converges on all or most of the graphs, indicated for each dataset. For cases where SRBP convergence did not occur (i.e., SRBP failed to converge on all but the Ising , dataset), we provide a conservative lowerbound on speedup based on how long we gave SRBP to run (90 seconds). We see that RS outperforms RBP and both outperform SRBP.
Dataset  Settings  SRBP Speedup 

Ising ,  3.47x  
Ising ,  x  
Chain ,  x 
Dataset  Settings  SRBP Speedup 

Ising ,  25.85x  
Ising ,  x  
Chain ,  x 
There are two primary shortcomings to RBP and RS. First, performance relies heavily upon , and effective selection is nontrivial. Second, the sortandselect approach incurs significant overhead. This is best demonstrated by the easy chain dataset (Fig.2) where RS takes significantly longer than LBP, which converges very quickly. Profiling indicates that on many graphs, both RBP and RS spend more than 90% of runtime during the sortandselect step, up to 98% for certain runs.
Iv Randomized Belief Propagation
To overcome the shortcomings of existing approaches on the GPU and exploit the tradeoff we have demonstrated, we present our novel, lowoverhead, randomized message scheduling technique for Belief Propagation on the GPU, Randomized Belief Propagation (RnBP).
Iva Algorithm
We hypothesize that varying the parallelism affects performance more than the specific selection of messages each round when in a manycore environment. We thus perform random selection as opposed to exact top selection.
In creating our message frontier, we employ two filters. In order to encourage selection to be similar to the top, we only choose the messages to update from those whose residual is above the thresholds. Thus, our first filter prunes all messages whose next update will move them less than .
The second filter is our randomized filter. We randomly select some percentage of the remaining messages to be updated. Adjusting thus allows us to adjust the parallelism for that round. A single update is visualized in Fig.3.
Finally, we dynamically range based on the convergence of the run. Throughout the run, we can track how many of the edges have not converged. The ratio between the number of edges not converged between each iteration becomes an indicator of runtime convergence performance: . If is low, it is indicative of good convergence, if is high, it is indicative of bad convergence. We introduce two settings, one high and one low. We know from our results in Section III that low parallelism encourages convergence and high parallelism encourages speed. Thus, if , we use the lower parallelism setting, thus encouraging convergence. Otherwise, we use the higher parallelism setting, thus encouraging speed. We note, overhead prevented similar dynamic selection from aiding GPU RBP/RS.
IvB Implementation
IvC Benchmarks
We use the same chain and Ising grid benchmarks described in IIIC. We test with Ising grids of size with and of size with . For chain graphs, we test with size with .
IvD Performance
Again, our GPU code is run on a single NVidia Tesla V100 and our CPU code is run on Intel Xeon Processors. We continue to compare to LBP and SRBP.
As for RBP and RS, we can vary our high and low parallelism settings to get different parallelisms during run time. We found that for our synthetic dataset the high parallelism setting mattered less than the low parallelism setting. As such, we locked our high parallelism to be a full update, thus whenever , we update the full message frontier update. We show performance on all datasets with low parallelism () being set to 0.7, 0.4, and 0.1.
Fig.44 shows GPU RnBP performance on our benchmarks as cumulative convergence graphs. For easy graph datasets, where LBP converges for most or all, we notice that RnBP with higher parallelism settings (i.e., ) nearly matches LBP performance (see Fig.4,4). This shows the value in RnBP’s lack of overhead. As the graphs become more difficult, where LBP only converges on some, we see that RnBP continues to converge quickly on all graphs (see Fig.4, 4). RnBP converges with much higher parallelism than that required for RS and RBP. Using the higher parallelism settings allows speed paired with convergence. We see that this allows for RnBP to actually provide speedups over GPU LBP runtimes (Fig.4,4), averaging 9x speedups on the Ising 200200, C=2.5 dataset.
Notice, LBP fails to converge on any graphs for the difficult 100100, C=3 dataset. We see that we can effectively drop parallelism in RnBP, however, to encourage convergence (see Fig.4). We do so without significant overheads yielding dramatic slow downs. This convergence behavior applies to larger and more difficult graphs than the ones RBP and RS could handle. RnBP thus extends the classes of Belief Propagation problems for which GPU speedups can be applied. We note that for the difficult dataset, RnBP can still be sensitive to the selected parallelism. However, on all our other datasets, RnBP is fairly robust to parallelism selection. Thus, while not completely solved, RnBP is a considerable improvement to existing approaches.
We characterize the speedup of RnBP over SRBP in Table III. Again, we compare with the fastest setting in our test runs that converges on all or most of the graphs, indicated for each dataset, and present conservative lower bounds when SRBP failed to converge (given 90 seconds).
Dataset  Settings  SRBP Speedup 

Ising ,  2203.58x  
Ising ,  1135.05x  
Ising ,  61.28x  
Ising ,  x  
Chain ,  x 
IvE Additional Tests
As RnBP is a novel message scheduling, we provide several additional tests to examine performance. To test correctness, we created a smaller Ising dataset, size , , for which exact inference is tractable. We use Variable Elimination to find the exact marginal values, then determine the KLdivergence between the exact results and the results of both SRBP and RnBP (run with ). These are shown in figure 5. We see that RnBP achieves the same quality of result as compared to SRBP.
We tested RnBP on a realworld dataset, specifically a proteinfolding dataset [26]. This dataset contains graphs with vertices representing amino acid units and the setting at each vertex representing the sidechain configuration. The possible settings at each vertex ranges from 2 to 81 and the graph structure is highly irregular. The cumulative convergence is shown in Fig.4 (We run RnBP with ). Despite the different structure as compared to our synthetic dataset and without any finetuning to handle loadimbalanced message updates, we see that RnBP yields fast, convergent performance. Given 3 minutes per graph, RnBP was the only approach to converge on all graphs and yielded an average of 4.4x speedup over SRBP when SRBP converged.
V Conclusions and Future Work
In this work, we presented a study of message scheduling approaches for BP on manycore GPU systems (summarized in Table IV). We hypothesized the existence of a tradeoff between parallelism and sequentialism for BP speed and convergence, and that GPUs could be used to exploit that tradeoff for performant BP. We presented manycore, frontierbased implementations for two asynchronous message schedulings, RBP [6] and RS [8], and showed empirically that indeed a tradeoff exists. Specifically, lower parallelism encourages convergence, while higher parallelism encourages speed. We also show that these approaches incur significant overhead, suggesting that a new GPUcentric approach is needed. In this direction, we presented a novel message scheduling we call Randomized Belief Propagation (RnBP), which utilizes randomization to select frontiers for updating. We demonstrate that this approach yields higher convergence while maintaining speed, providing speedups over serial and existing GPU methods on both synthetic and realworld datasets. Our implementation is available online^{2}^{2}2https://github.com/mvandermerwe/BPGPUMessageScheduling.
Algorithm  Frontier Selection  ManyCore 

GPU LBP  All Messages  ✓ 
Serial RBP/RS  Priority Queue  X 
GPU RBP/RS  SortandSelect  ✓ 
GPU RnBP  Randomized  ✓ 
Acknowledgment
This work was supported in part by NSF awards 1704715 and 1817073.
References
 [1] (2000) The generalized distributive law. IEEE Transactions on Information Theory 46 (2), pp. 325–343. Cited by: §V.
 [2] (2012) Fast kselection algorithms for graphics processing units. Journal of Experimental Algorithmics (JEA) 17, pp. 4–2. Cited by: §IIIA.
 [3] (2006) Belief propagation on the gpu for stereo vision. In Computer and Robot Vision, 2006. The 3rd Canadian Conference on, pp. 76–76. Cited by: §IIC.
 [4] (2012) A gpu implementation of belief propagation decoder for polar codes. In Signals, Systems and Computers (ASILOMAR), 2012 Conference Record of the Forty Sixth Asilomar Conference on, pp. 1272–1276. Cited by: §IIC.
 [5] (2014) Workefficient parallel gpu methods for singlesource shortest paths. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 349–359. Cited by: §I, §IIIA.
 [6] (2006) Residual belief propagation: informed scheduling for asynchronous message passing. In Proceedings of the TwentySecond Conference on Uncertainty in Artificial Intelligence, pp. 165–173. Cited by: 1st item, 4th item, §I, §I, §I, §IIB, §IIB, §IIIC, §V.
 [7] (2006) Efficient belief propagation for early vision. International Journal of Computer Vision 70 (1), pp. 41–54. Cited by: §I, §V.
 [8] (200916–18 Apr) Residual splash for optimally parallelizing belief propagation. In Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, D. van Dyk and M. Welling (Eds.), Proceedings of Machine Learning Research, Vol. 5, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, pp. 177–184. External Links: Link Cited by: 1st item, §I, §I, §IIB, §IIB, §V.
 [9] (2010) Optimizing and autotuning belief propagation on the gpu. In International Workshop on Languages and Compilers for Parallel Computing, pp. 121–135. Cited by: §IIC, §V.
 [10] (2008) GPU implementation of belief propagation using cuda for cloud tracking and reconstruction. In Pattern Recognition in Remote Sensing (PRRS 2008), 2008 IAPR Workshop on, pp. 1–4. Cited by: §IIC.
 [11] (2011) Hardwareefficient belief propagation. IEEE Transactions on Circuits and Systems for Video Technology 21 (5), pp. 525–537. Cited by: §IIC, §V.
 [12] (2018) Graph partition neural networks for semisupervised classification. arXiv preprint arXiv:1803.06272. Cited by: §V.
 [13] (2015) CUDA unbound (cub) library. Cited by: §IIIB, §IVB.
 [14] (2011) Randomized selection on the gpu. In Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, pp. 89–98. Cited by: §IIIA.
 [15] (2007) Sufficient conditions for convergence of the sum–product algorithm. IEEE Transactions on Information Theory 53 (12), pp. 4422–4437. Cited by: §IIB.
 [16] (1999) Loopy belief propagation for approximate inference: an empirical study. In Proceedings of the Fifteenth conference on Uncertainty in Artificial Intelligence, pp. 467–475. Cited by: §I, §I, §IIB.
 [17] (2010) CUDA programming guide. Cited by: §IIIB.
 [18] (2010) CURAND library. Cited by: §IVB.
 [19] (2010) A gpubased application framework supporting fast discreteevent simulation. Simulation 86 (10), pp. 613–628. Cited by: §IIIA.
 [20] (2011) The tao of parallelism in algorithms. In ACM Sigplan Notices, Vol. 46, pp. 12–25. Cited by: §III.
 [21] (2012) Sequential decoding of nonbinary ldpc codes on graphics processing units. In Signals, Systems and Computers (ASILOMAR), 2012 Conference Record of the Forty Sixth Asilomar Conference on, pp. 1267–1271. Cited by: §I, §IIC.
 [22] (2003) Treebased reparameterization framework for analysis of sumproduct and related algorithms. IEEE Transactions on Information Theory 49 (5), pp. 1120–1146. Cited by: §I.
 [23] (2017) Gunrock: gpu graph analytics. ACM Transactions on Parallel Computing (TOPC) 4 (1), pp. 3. Cited by: §I, §III.
 [24] (2012) Realtime stereo matching based on fast belief propagation. Machine Vision and Applications 23 (6), pp. 1219–1227. Cited by: §IIC.
 [25] (2006) Realtime global stereo matching using hierarchical belief propagation.. In BMVC, Vol. 6, pp. 989–998. Cited by: §IIC.
 [26] (2003) Approximate inference and proteinfolding. In Advances in Neural Information Processing Systems, pp. 1481–1488. Cited by: 4th item, §I, Fig. 4, §IVE.
 [27] (2001) Generalized belief propagation. In Advances in Neural Information Processing Systems, pp. 689–695. Cited by: §I, §V.