mts: a light framework for parallelizing tree search codes
Abstract
We describe version 0.1 of mts, a generic framework for parallelizing certain types of tree search programs using a single common wrapper. This complements a previous tutorial that focused on using a preliminary version of mts. mts supports sharing information between processes which is important for applications such as satisfiability testing and branchandbound. No parallelization is implemented in the legacy single processor code minimizing the changes needed and simplying debugging. mts is written in C, uses MPI for parallelization and can be used on a network of computers. As examples we parallelize two simple existing reverse search codes, generating topological sorts and generating spanning trees of a graph, and two codes for satisfiability testing. We give experimental results comparing the parallel codes with other codes for the same problems.
Keywords: reverse search, parallel processing, topological sorts, spanning trees
Mathematics Subject Classification (2000) 90C05
1 Introduction
Parallel programming is a vast area and there is a great amount of literature on it (see, e.g., Mattson et al. [31]). Topics include architecture, communication, data sharing, interrupts, deadlocks, load balancing, and the distinction between shared memory and distributed computing. This is all essential for building an efficient parallel algorithm from scratch.
Our starting point was different. We had a large complex code, lrs, developed over about 20 years and tested extensively, which solved vertex/facet enumeration problems. These problems are notoriously hard and running times often take weeks or longer. The underlying algorithm, reverse search, was clearly suitable for parallelization. Nevertheless, the mathematical intricacy of the underlying problem rendered the algorithmic engineering of direct parallelization daunting. This led us to consider building all of the parallelization into a wrapper, making only minor changes to the underlying lrs code. There followed a series of implementations resulting ultimately in the authors’ mplrs code [7]. The key features of mplrs are: (a) there is no parallel code inside lrs, (b) parallel threads execute lrs on nonoverlapping subproblems, (c) there is no communication between threads except at the beginning and end of a subproblem execution, (d) the computation can be distributed over a cluster of computers, and (e) the wrapper is directly inserted into the lrs library. Most of the topics in parallel computation mentioned above are not major issues in this restricted framework. The exception is load balancing for which we use a particularly simple method which consists of budgeting the number of nodes evaluated in a subproblem.
It seemed likely that similar results could be obtained for other algorithms based on reverse search^{1}^{1}1In 2008, John White made a list of 130 different applications and implementations, see link at [6]. or similar easily parallelizable tree search methods. Many such sequential codes exist, so designing custom wrappers for each is not desirable. Our goal was to build a single generic wrapper that could be used, with little if any modification, to do the required parallelization while maintaining features (a)–(e) described above. This resulted in mts, presented here. The current implementation^{2}^{2}2Version used here available at https://wwwalg.ist.hokudai.ac.jp/~skip/mts/ uses MPI and works on clusters of machines. The mts framework is more general than mplrs in that it allows the sharing of data obtained by subproblems, but still maintains the absence of communication between threads. This improves the application to more general tree search problems such as satisfiablility testing and branch and bound.
In Section 2 we survey the literature on parallelizing reverse search codes. We then describe our general approach in Section 3 and apply it to reverse search in Section 4. We give concrete examples for two simple enumeration problems: generating topological sorts and spanning trees of a graph. While the purpose of mts is to parallelize much more complex enumeration problems (see for example the recent application [26] of mts to enumerating triangulations), there were several reasons for choosing these simple well solved problems. They were described in detail in the original reverse search tutorial [6], are easily solved by reverse search, have existing codes, and provide simple examples of how to apply mts.
Tree search has wide uses, of which enumeration is just one example. In fact it is a very specific example as all nodes in the enumeration tree are visited. Two other important uses of tree search are satisfiability testing and branch and bound. Here the goal is not to search the entire tree but to prune subtrees when possible. The tree generated in these cases will normally differ depending on the choices made at early stages and the sharing of information learned during the computation. The mts framework includes support for sharing data between processes and can be applied to these types of problems. As an example we present a parallelization for satisfiability testing in Section 5, demonstrating how little of the original code needs to be changed.
In Section 6 we give computational results for the parallelized codes described in this paper. This is followed in Section 7 by a discussion of how to evaluate the experimental results, the situation being quite different for enumeration problems and for those problems where pruning is used. For the enumeration problems we get near linear speedup using several hundred cores. For the satisfiability problem we show a large improvement in the number of SAT instances that can be solved in a given fixed time period. Finally we give some conclusions and areas for future research in Section 8.
2 Survey of previous work
The reverse search method, initially developed for vertex enumeration, was extended to a wide variety of enumeration problems [5]. From the outset it was realized that it was eminently suitable for parallelization. In 1998, Marzetta announced his ZRAM parallelization platform [14, 30] which can be used for reverse search, backtracking and branch and bound codes. He successfully used it to parallelize several reverse search and branch and bound codes, including lrs from which he derived the prs code. Load balancing is performed using a variant of what is now known as job stealing. Application codes, such as lrs, were embedded into ZRAM itself leading to problems of maintenance as the underlying codes evolved. Although prs is no longer distributed and was based on a now obsolete version of lrs, it clearly showed the potential for large speedups of reverse search algorithms.
The reverse search framework in ZRAM was also used to implement a parallel code for certain quadratic maximization problems [20]. In a separate project, Weibel [36] developed a parallel reverse search code to compute Minkowski sums. This C++ implementation runs on shared memory machines and he obtains linear speedups with up to 8 processors, the largest number reported.
ZRAM is a generalpurpose framework that is able to handle a number of other applications, such as branch and bound and backtracking, for which there are by now a large number of competing frameworks. Recent papers by Crainic et al. [17], McCreesh et al. [32] and Herrera et al. [23] describe over a dozen such systems. While branch and bound may seem similar to reverse search enumeration, there are fundamental differences. In enumeration it is required to explore the entire tree whereas in branch and bound the goal is to explore as little of the tree as possible until a desired node is found. The bounding step removes subtrees from consideration and this step depends critically on what has already been discovered. Hence the order of traversal is crucial and the number of nodes evaluated varies dramatically depending on this order. Sharing of information is critical to the success of parallelization. These issues do not occur in reverse search enumeration, and so a much lighter wrapper is possible.
Relevant to the heaviness of the wrapper and amount of programming effort required, a comparison of three frameworks is given in [23]. The first, Bob++ [18], is a high level abstract framework, similar in nature to ZRAM, on top of which the application sits. This framework provides parallelization with relatively little programming effort on the application side and can run on a distributed network. The second, Threading Building Blocks (TBB) [34], is a lower level interface providing more control but also considerably more programming effort. It runs on a shared memory machine. The third framework is the Pthread model [15] in which parallelization is deep in the application layer and migration of threads is done by the operating system. It also runs on a shared memory machine. All of these methods use job stealing for load balancing [13]. In [23] these three approaches are applied to a global optimization algorithm. They are compared on a rather small setup of 16 processors, perhaps due to the shared memory limitation of the last two approaches. The authors found that Bob++ achieved a disappointing speedup of about 3 times, considerably slower than the other two approaches which achieved near linear speedup.
A more sophisticated framework for parallelizing application codes over large networks of computers is MW that works with the distributed environment of HTCondor ^{3}^{3}3Available at https://research.cs.wisc.edu/htcondor/mw/. MW is a set of C++ abstract base classes that allow parallelization of existing applications based on the masterworker paradigm [21]. We employ the same paradigm in mts although our load balancing methods are different. MW has been used successfully to parallelize combinatorial optimization problems such as the Quadratic Assignment Problem, see the MW home page for references. Although MW could be used to parallelize reverse search algorithms, we are not aware of any such applications.
3 The mts framework
The goal of mts is to parallelize existing tree search codes with minimal internal modification of these codes. The tree search codes should satisfy certain conditions, specified below. The mts implementation starts a userspecified number of processes on a cluster of computers. One process becomes the master, another becomes the consumer, and the remaining are workers which essentially run the original tree search code on specified subtrees. Communication is limited; workers are not interrupted and do not communicate between themselves.
The master sends the input data and parametrized subproblems to workers, informs the other processes to exit when appropriate, and handles checkpointing. The consumer receives and synchronizes output. Workers get budgeted subproblems from the master, run the legacy code, send output to the consumer, and return unfinished subproblems to the master. This general approach is similar to but simpler than the wellknown workstealing approach [13].
Generating subproblems can be done in many ways. One way would be to report nodes at some initial fixed depth. This works well for balanced trees but many trees encountered in practice are highly unbalanced and the vast majority of subtrees contain few nodes. Increasing the initial search depth does not solve this problem. Ideally we would only break up the large subtrees and in the development of mplrs we tried various ways to estimate the size of a given subtree. Experimentally this did not work well due to the high variance of the estimator and the wasted cost of doing many estimates.
The idea that worked best, and is implemented in mts, was also the simplest: a heuristic to determine large subtrees called budgeting. When assigning work the master specifies that a worker should terminate after completing a certain amount of work, called a budget, and then return a list of unexplored subtrees. The precise budget may depend on the application. For enumeration problems it could be the number of nodes visited by the worker. Some advantages of budgeting are:

small subtrees are explored without being broken up

large subtrees will be broken up repeatedly

each worker returns periodically for reassignment, can give information to be passed on to other workers and receive such information

it is implemented onthefly and avoids the duplication of work done in estimation

it can be varied dynamically during execution to control the job list size

when used statically and without pruning, the overall job list produced is deterministic and independent of the number of workers
This last item is useful for debugging purposes and also enables a theoretical analysis of the job list size under certain random tree models, see [4]. In particular, methods that limit work based on time (such as “begetting” in MW) do not have this property.
Implementing budgeting does not require interrupting workers or communication between workers. The master uses dynamic budgets to control the job list: small budgets break up more subtrees and lengthen the joblist while large budgets have the reverse effect.
Additional features of mts include checkpointing and restarts, allowing the user to move jobs or free computing resources without losing work. mts can produce various histograms to help tune performance. Histograms and their uses are described in Section 7.
3.1 Sequential tree search code
To be suitable for parallelization with mts the underlying tree search code, which we will call search, must satisfy a few properties. First, when given a positive budget, search should either finish the given job or return a list of unexplored nodes. Any unexplored node should represent a smaller portion of the unfinished work, i.e. running search (with positive budgets) on the unexplored nodes and any resulting unexplored nodes will eventually result in finishing the original job. The code should also interpret the budget in some suitable way where larger budgets correspond to doing more work than smaller budgets. This may require some modification of the legacy code. Our applications usually interpret the budget as number of traversed nodes and depth, but this is not required (see conflict budgeting in Section 5.1).
Any given worker must be able to work on any given unexplored node that mts has seen. It is helpful for the unexplored nodes to represent nonoverlapping jobs. mts supports sharing data between workers, but it is helpful for shared data to be small. Implementing a shared memory version of mts could help performance when large amounts of data are shared. Shared data is not used in our enumeration applications. It is used for satisfiability and similar applications to prune the search tree.
3.2 Master process
The master process begins with initialization, including obtaining an applicationprovided initial . It places this initial subproblem in a (new) job list , and then enters the main loop. In this main loop, the master assigns budgeted subproblems to workers, collects unfinished subproblems to add to , and collects/sends updated from/to the workers. Assigning updates to the master is not essential: it simplifies checkpointing but can increase load on the master and interconnect. Each worker either finishes its subproblem or reaches its budget limitation ( and ) and returns unfinished subproblems to the master for insertion into . This continues until no workers are running and the master has no unfinished subproblems. Once the main loop ends, the master informs all processes to finish. The main loop performs the following tasks:

subproblems and relevant updates are sent to free workers when available;

check if any workers are done, mark them as free and receive their unfinished subproblems;

check and receive updates.
Pseudocode is given as Algorithm 3 in the Appendix. Communication is nonblocking and work proceeds when required information is available.
Using reasonable parameters is critical to performance. This is done dynamically by observing . We use parameters , and which depend on the type of tree search problem being handled. The following default values are used in this paper. Initially, to create a reasonable size list , we set and . Therefore the initial worker will generate subtrees at depth 2 until 5000 nodes have been visited and then terminates sending roots of unvisited subtrees back to the master. Additional workers are given the same aggressive parameters until grows larger than times the number of processors, at which point is removed. Once is larger than times the number of processors, we multiply the budget by . With workers will not generate any new subproblems unless their tree has at least 200,000 nodes. If drops below these bounds we return to the smaller budgets. The default is . In Section 7 we show an example of how typically behaves with these settings.
3.3 Workers
The worker processes are simpler – they receive the problem at startup, and then repeat their main loop: receive a parametrized subproblem and possible updates from the master, work on the subproblem subject to the parameters, send the output to the consumer, and send updated and unfinished subproblems to the master if the budget is exhausted. Pseudocode is given as Algorithm 4 in the Appendix.
3.4 Consumer process
The consumer process in mts is the simplest. The workers send output to the consumer in exactly the format it should be output (i.e., this formatting is done in parallel). The consumer simply outputs it. By synchronizing output to a single destination, the consumer delivers a continuous output stream to the user in the same way as search does. Pseudocode is given as Algorithm 5 in the Appendix.
4 Applying mts to reverse search
Reverse search is a technique for generating large relatively unstructured sets of discrete objects [5]. In its most basic form, reverse search can be viewed as the traversal of a spanning tree, called the reverse search tree , of a graph whose nodes are the objects to be generated. Edges in the graph are specified by an adjacency oracle, and the subset of edges of the reverse search tree are determined by an auxiliary function, which can be thought of as a local search function for an optimization problem defined on the set of objects to be generated. One vertex, , is designated as the target vertex. For every other vertex repeated application of must generate a path in from to . The set of these paths defines the reverse search tree , which has root .
A reverse search is initiated at , and only edges of the reverse search tree are traversed. When a node is visited, the corresponding object is output. Since there is no possibility of visiting a node by different paths, the visited nodes do not need to be stored. Backtracking can be performed in the standard way using a stack, but this is not required as the local search function can be used for this purpose.
In the basic setting described here a few properties are required. Firstly, the underlying graph must be connected and an upper bound on the maximum vertex degree, , must be known. The performance of the method depends on having as low as possible. An adjacency oracle must be capable of generating the adjacent vertices of any given vertex in . For each vertex the local search function returns the tuple where which defines the parent of in . Pseudocode is given in Algorithm 1 and is invoked by setting . C implementations for several simple enumeration problems are given at [6]. For convenience later, we do not output the in the pseudocode shown. Note that the vertices are output as a continuous stream. Also note that Algorithm 1 does not require the parameter to be the root of the entire search tree. If an arbitrary node in the tree is given, the algorithm reports the subtree rooted at this node and terminates.
We need to implement budgeting in order to parallelize Algorithm 1 with mts. We do this in two ways that may be combined. Firstly we introduce the parameter which terminates the tree search at that depth returning any unvisited subtrees. Secondly we introduce a parameter which terminates the tree search after this many nodes have been visited and again returns the roots of all unvisited subtrees. This entails backtracking to the root and returning the unvisited siblings of each node in the backtrack path. These modifications are straightforward and given in Algorithm 2, which reduces to Algorithm 1 by deleting the items in red.
To output all nodes in the subtree of rooted at we set , and . To break up into subtrees we have two options that can be combined. Firstly we can set the parameter resulting in all nodes at that depth to be flagged as unexplored. Secondly we can set the budget parameter . In this case, once this many nodes have been explored the current node and all unexplored siblings on the backtrack path to the root are output and flagged as unexplored.
4.1 Example 1: Topological sorts
A C implementation (per.c) of the reverse search algorithm for generating permutations is given in the tutorial [6]. A small modification of this code generates all topological sorts of a partially ordered set that is given by a directed acyclic graph (DAG). Such topological sorts are also called linear extensions or topological orderings. The code modification is given as Exercise 5.1 and a solution to the exercise (topsorts.c) is at [6]. Here we describe how to modify this code to allow parallelization via the mts interface to produce the program mtopsorts. The details and code are available at [6].
It is convenient to describe the procedure as two phases. Phase 1 implements budgeting and organizes the internal data in a suitable way. This involves modifying an implementation of Algorithm 1 to an implementation of Algorithm 2 that can be independently tested. We need to prepare a global data structure bts_data which contains problem data obtained from the input. In Phase 2 we build a node structure for use by the mts wrapper and add necessary routines to allow initialization and I/O in a parallel setting. In practice this involves using a header file from mts. The resulting program btopsorts.c can be compiled as a sequential code or with mts as a parallel code with no change in the source files.
In the second phase we add the ‘hooks’ that allow communication with mts. This involves defining a Node structure which holds all necessary information about a node in the search tree. The roots of unexplored subtrees are maintained by mts for parallel processing. Therefore whenever a search terminates due to the or restrictions, the Node structure of each unexplored tree node is returned to mts. As we do not wish to customize mts for each application, we use a very generic node structure. The user should pack and unpack the necessary data into this structure as required. The Node structure is defined in the mts header.
The efficiency of mts depends on keeping the job list nonempty until the end of the computation, without letting it get too large. Depending on the application, there may be a substantial restart cost for each unexplored subtree. Surely there is no need to return a leaf as an unexplored node, and the prune=0 option checks for this. Further, if an unexplored node has only one child it may be advantageous to explore further, terminating either at a leaf or at a node with two or more children, which is returned as unexplored. The prune=1 option handles this condition, meaning that no isolated nodes or paths are returned as unexplored. Note that pruning is not a builtin mts option; it is an example of options that applications may wish to include and was implemented in mtopsorts.
4.2 Example 2: Spanning trees
In the tutorial [6] a C implementation (tree.c) is given for the reverse search algorithm for all spanning trees of the complete graph. An extension of this to generate all spanning trees of a given graph is stated as Exercise 6.3. Applying Phase 1 and 2 as described above results in the code btree.c. Again this may be compiled as a sequential code or with the mts wrapper to provide the parallel implementation mtree. All of these codes are given at the URL [6].
5 Applying mts to satisfiability
Boolean satisfiability (SAT) asks us to determine the existence of (or find) satisfying assignments for propositional formulas, see [12] for more background. SAT solvers have made tremendous progress over the years, and are now widely used as general NP solvers. While most application problems seem to result in easy SAT instances [10], there has long been interest in parallel SAT solvers for hard instances. Despite the many challenges [22, 27] in parallel SAT, there are recent successes [24].
There are two major approaches to parallel SAT solvers. Either one somehow partitions the space of possible assignments and uses divideandconquer (e.g., [1] for a recent example) or one uses the portfolio approach and runs many sequential solvers on the original problem (e.g., plingeling [11]). In either case, a major issue is determining which learnt clauses^{4}^{4}4CDCL solvers learn clauses during the search, pruning the search space. See, e.g., Chapter 4 of [12]. to share between workers [3]. While sharing these clauses helps prune the search space, additional clauses slow the solver and enormous numbers of clauses are learned.
Another question for divideandconquer solvers is the question of how to divide the search space. Many approaches have been tried, often setting initial variables and using a common feature of sequential solvers to “solve under assumptions”. Some recent solvers (e.g., [1] and treengeling [11]) work on these subproblems subject to some budget, and hard subproblems can be split again. Cubeandconquer [25] is another recent approach that uses lookahead solvers to divide the search space for CDCL solvers.
5.1 mtsat: parallelizing Minisat with mts
We used mts to implement a divideandconquer solver mtsat, using Minisat 2.2.0 as sequential solver. Our goal was to demonstrate the use of and show that mts can be used in settings other than enumeration. mtsat is still experimental and much work remains to reach the level of stateoftheart dedicated parallel SAT solvers, but it allows for experimentation with, e.g., budgeting and restart strategies in parallel SAT.
Minisat [19] supports solving under assumptions, i.e. solving subject to some partial assignment. It also supports solving subject to a budget, given in propagations or conflicts, returning unknown if the given subproblem could not be solved within the budget.
The major modification required is to report unexplored partial assignments when the budget is exhausted. At any point in the search, SAT solvers distinguish between decision variables and propagated variables. Decision variables are those where the solver chose an assignment, while propagated variables are those where the solver was able to determine (because of a unit clause) that only one option need be explored. It suffices to return unexplored nodes corresponding to the current partial assignment and to those formed by taking the unexplored options for decision variables (including the last one) along the backtrack path.
Regarding learnt clauses, we implemented a simple scheme sharing only learnt unit clauses. The idea is that short clauses cut the search tree more than longer clauses; an early version of plingeling also shared only units [11]. We avoided more sophisticated approaches to sharing clauses [3, 1], using conflicts to prune the job list and similar ideas for simplicity.
mtsat includes additional options. For example, while the parallel solvers most similar to our approach [1, 11] budget using conflicts – we added the option to budget using decisions. Conflict budgets correspond to hitting a leaf in the search tree, while decision budgets correspond to nodes in the search space (omitting propagated variables since those are forced). Conflict budgets are attractive, but decision budgets correspond more closely to the budgets used in Section 4 and allow us to experiment with different budgeting techniques.
Modern solvers generally perform random restarts, abandoning the current search to start over (cf. Chapter 4 of [12]) and hopefully avoid getting stuck in hard parts of the search space. We split problems along the backtrack path and schedule these abandoned portions of the search space for later exploration – possibly resulting in much duplicated work. We therefore added an option to disable restarts, in order to experiment with their impact on performance in mtsat, and formula preprocessing, to experiment with the idea that avoiding preprocessing can be beneficial to divideandconquer parallel SAT solvers [22].
The total is 50 lines of changes to legacy Minisat (including support to parse inputs from strings) of the original 4803 lines, plus a few hundred lines of generic code interfacing the Minisat API and mts that can be reused. Essentially identical changes suffice to parallelize Glucose (since it is based on Minisat) and others, and so we also parallelize Glucose 3.0. One could easily support workers using a mix of solvers, a hybrid of the divideandconquer and portfolio approaches to parallel SAT.
6 Experimental results
The tests were performed at Kyoto University on mai32, a cluster of 5 nodes with a total of 192 identical processor cores, consisting of:

mai32abcd: 4 nodes, each containing: 2x Opteron 6376 (16core 2.3GHz), 32GB memory, 500GB hard drive (128 cores in total);

mai32ef: 4x Opteron 6376 (16core 2.3GHz), 64 cores, 256GB memory, 4TB hard drive.
A complete description of the problems solved below is given in [8] and the input files are available by following the link to tutorial2 at [6].
6.1 Topological sorts: mtopsorts
The tests were performed using the following codes:

mtopsorts: mts parallelization of btopsorts.
For the tests all codes were used in countonly mode due to the enormous output that would otherwise be generated. All codes were used with default parameters:
(1) 
The following graphs were chosen, listed in order of increasing edge density: , , . The constructions for the first two partial orders are well known (see, e.g., Section 7.2.1.2 of [29]) and the third is a complete bipartite graph.
Graph  m  n  No. of perms  VR  Genle  btopsorts  mtopsorts  

nodes  edges  12  24  48  96  192  
22  21  13,749,310,575  179  14  12723  1172  595  360  206  125  
42  61  24,466,267,020  654  171  45674  4731  2699  1293  724  408  
17  72  14,631,321,600  159  5  8957  859  445  249  137  85 
Results are in Table 1. The reverse search code btopsorts is very slow, over 900 times slower than Genle and over 70 times slower than VR on . However the parallel mts code obtains excellent speedups and is faster than VR on all problems when 192 cores are used.
6.2 Spanning trees: mtree
The tests were performed using the following codes:

grayspan: Knuth’s implementation [28] of an algorithm that generates all spanning trees of a given graph, changing only one edge at a time, as described in Malcolm Smith’s M.S. thesis, Generating spanning trees (University of Victoria, 1997);

grayspspan: Knuth’s improved implementation of grayspan: “This program combines the ideas of grayspan and spspan, resulting in a glorious routine that generates all spanning trees of a given graph, changing only one edge at a time, with ‘guaranteed efficiency’—in the sense that the total running time is when there are edges, vertices, and spanning trees.” [28];

mtree: mts parallelization of btree.
Both grayspan and grayspspan are described in detail in Knuth [29]. Again all codes were used in countonly mode and with the default parameters (1). The problems chosen were the following graphs which are listed in order of increasing edge density: 8cage, , , , . The latter 4 graphs were motivated by Table 5 in [29]: appears therein and the other graphs are larger versions of examples in that table.
Graph  m  n  No. of trees  grayspan  grayspspan  btree  mtree  

nodes  edges  12  24  48  96  192  
8cage  30  45  23,066,015,625  3166  730  10008  1061  459  238  137  92 
25  45  38,720,000,000  3962  1212  8918  851  455  221  137  122  
25  50  1,562,500,000,000  131092  41568  230077  26790  13280  7459  4960  4244  
14  49  13,841,287,201  699  460  2708  259  142  68  51  61  
12  66  61,917,364,224  2394  1978  3179  310  172  84  97  148 
The computational results are given in Table 2. This time the reverse search code is a bit more competitive: about 3 times slower than grayspan and about 14 times slower than grayspspan on 8cage for example. The parallel mts code runs about as fast as grayspspan on all problems when 12 cores are used and is significantly faster after that. Near linear speedups are obtained up to 48cores but then tail off. For the two dense graphs and the performance of mts is actually worse with 192 cores than with 96.
6.3 Satisfiability
The tests were performed using the following codes:

Minisat: version 2.2.0, classic sequential solver [19];

Glucose: version 3.0, sequential solver [2] derived from Minisat;

mtsat: parallel solver using mts and Minisat 2.2.0;

mtsatglucose: parallel solver using mts and Glucose 3.0;

GlucoseSyrup: version 4.0, parallel (shared memory) solver;

lingeling, treengeling: version bbc, sequential and (shared memory) parallel solvers [11].
Benchmarking parallel SAT solvers is challenging [22] and any particular instance may give superlinear speedups or timeouts. We use a standard set of hard instances from applications, and count the number of problems that each solver can solve within a given time. We reuse the setup of [1], i.e. the 100 instances in the parallel track of SAT Race 2015 [9] and a timeout of 20 minutes. Results are in Figure 1. Due to different computers used, our results are not directly comparable to those in [1]. As noted by [10], solvers like mtsat can use substantial memory on very large instances, limiting the number of processes that can execute in a given amount of memory. The computers we used had sufficient memory for the instances used.

The results in Figure 1 show improvement from additional cores using default parameters and decision budgeting with no attempt at tuning. Performance with conflict budgeting is shown in Figure 2, using an initial budget of conflicts (i.e. the corresponding value in [1]).

All nontimeout outputs are correct, and the 32core run with conflict budgeting solves problem 62bits_10.dimacs.cnf (reported as unsolved in the SAT Race 2015 results) giving a correct satisfying assignment. It is likely that experimenting with parameter values can improve performance, and using a newer sequential solver on the workers may be another source of improvement given the performance treengeling achieves starting from the higher baseline performance of lingeling.
Along these lines, we also report results applying mts to parallelize Glucose. Figure 3 shows results using the default decision budgeting and Figure 4 shows results using conflict budgeting. Note that like in mtsat, only unit clauses are shared in mtsatglucose. A more sophisticated approach to sharing learnt clauses, e.g. sharing glue clauses, would likely help performance.


In all cases, we see that additional cores allow one to solve more problems given a fixed amount of time. While it is likely that performance can be improved by tuning and a better approach to sharing clauses, these results suffice for our purpose: to show that one can easily parallelize a legacy code with mts.
As mentioned earlier, conflict budgeting is most common in this kind of parallel solver. This is because generating a conflict clause guarantees at least some progress has been made and instances can have very different and enormous numbers of variables. Conflict budgeting usually slightly outperformed decision budgeting in the runs here, however this was not universal. Given the minimal tuning for both budget types, our results do not show a clear difference regarding how to budget.
7 Evaluating and improving performance
Our main measures of performance for the enumeration problems are the elapsed time taken and the efficiency defined as:
(2) 
Multiplying efficiency by the number of cores gives the speedup. Speedups that scale linearly with the number of cores give constant efficiency. External factors can affect performance as the load on the machine increases. One example is dynamic overclocking, where the speed of working cores may be increased by 25%–30% when other cores are idle. This limits the maximum efficiency achievable when all cores are used, since the single core running times are measured on otherwise idle machines. In Figure 5 we plot the efficiencies obtained by mtopsorts and mtree for the runs shown in Tables 1 and 2 respectively.
The amount of work contained in a subproblem can vary dramatically. mts can produce histograms to help understand and tune its performance. We discuss three of these here: processor usage, job list size and distribution of subproblem sizes. Figure 6 shows the first two histograms for the mtopsorts run on with default parameters (1).
We see the master struggling to keep workers busy despite having jobs available. This suggests that we can improve performance with better parameters. Here, a larger scale or maxnodes value may help, since it will allow workers to do more work (assuming a sufficiently large subproblem) before contacting the master.
Figure 7 shows the result of using for scale and for maxnodes. These parameters produce less than half the number of total number of jobs compared to the default parameters, and increase overall performance by about five percent on this input.
In addition to the performance histograms, mts can generate a frequency file containing a list of values returned by each worker on the completion of each job. For the enumeration applications this is normally the number of nodes visited by the worker during the job. Such a list provides statistical information about the tree that is helpful when tuning the parameters for better performance. For example, it may be helpful to implement and use pruning if many jobs correspond to leaves. Likewise, increasing the budget will have limited effect if only few jobs use the full budget. Figure 8 shows the distribution of subproblem sizes that was produced in a run of mtopsorts on with default parameters (1). is usually large so the scaled budget constraint of 200000 is normally in use. The left figure shows this constraint was invoked about 15000 times. The right figure shows that most subproblems have less than 40 nodes and so are not broken up. The three spikes in the middle of the left figure are interesting and show there are large numbers of subtrees with these specific sizes. This is probably due to the high symmetry of the graph .
8 Conclusions
We have presented a generic framework for parallelizing reverse search codes requiring only minimal changes to the legacy code. Two features of our approach are that the parallelizing wrapper does not need to be user modified and the modified legacy code can be tested in standalone single processor mode. There is no separate library to install and just a few routines need to be inserted in a user’s existing library. Applying mts to two very basic reverse search codes we obtained comparable results to that previously obtained by the customized mplrs wrapper applied to the the complex lrs code [7]. We expect that many other reverse search applications and will obtain similar speedups when parallelized with mts.
The application to SAT demonstrates the use of shared data, and the ease with which a widelyused existing legacy code can be parallelized using mts. While mtsat remains work in progress, it shows some promise and further experimentation can likely improve performance. Other ongoing work involves using mts to parallelize existing integer programming solvers that use the branchandbound approach.
Acknowledgements.
This work was partially supported by JSPS Kakenhi Grants 16H02785, 23700019 and 15H00847, GrantinAid for Scientific Research on Innovative Areas, ‘Exploring the Limits of Computation (ELC)’.
References
 [1] Audemard, G., Lagniez, J.M., Szczepanski, N., Tabary, S.: An adaptive parallel SAT solver. In: CP, LNCS, vol. 9892, pp. 30–48 (2016)
 [2] Audemard, G., Simon, L.: Predicting learnt clauses quality in modern SAT solvers. In: IJCAI, pp. 399–404 (2009)
 [3] Audemard, G., Simon, L.: Lazy clause exchange policy for parallel SAT solvers. In: SAT, LNCS, vol. 8561, pp. 197–205 (2014)
 [4] Avis, D., Devroye, L.: An analysis of budgeted parallel search on conditional GaltonWatson trees. arXiv:1703.10731 (2017)
 [5] Avis, D., Fukuda, K.: Reverse search for enumeration. Discrete Applied Mathematics 65, 21–46 (1993)
 [6] Avis, D., Jordan, C.: Reverse search: tutorials (2000,2016). http://cgm.cs.mcgill.ca/~avis/doc/tutorial
 [7] Avis, D., Jordan, C.: mplrs: A scalable parallel vertex/facet enumeration code. arXiv:1511.06487 (2015)
 [8] Avis, D., Jordan, C.: A parallel framework for reverse search using mts. arXiv:1610.07735 (2016)
 [9] Balyo, T., Biere, A., Iser, M., Sinz, C.: SAT Race 2015. Artificial Intelligence 241, 45–65 (2016)
 [10] Balyo, T., Heule, M.J., Järvisalo, M.: SAT Competition 2016: Recent developments. In: AAAI (2017)
 [11] Biere, A.: Lingeling and friends entering the SAT Challenge 2012. In: Proc. SAT Challenge 2012, Department of Computer Science Series of Publications B, University of Helsinki, vol. B20122, pp. 33–34 (2012)
 [12] Biere, A., Heule, M.J., van Maaren, H., Walsh, T. (eds.): Handbook of Satisfiability. IOS Press (2009)
 [13] Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. Journal of the ACM 46(5), 720–748 (1999)
 [14] Brüngger, A., Marzetta, A., Fukuda, K., Nievergelt, J.: The parallel search bench ZRAM and its applications. Ann. Oper. Res. 90, 45–63 (1999)
 [15] Casado, L.G., Martínez, J.A., García, I., Hendrix, E.M.T.: Branchandbound interval global optimization on shared memory multiprocessors. Optimization Methods and Software 23(5), 689–701 (2008)
 [16] Combinatorial Object Server : Linear extensions. http://theory.cs.uvic.ca/inf/pose/LinearExt.html
 [17] Crainic, T.G., Le Cun, B., Roucairol, C.: Parallel BranchandBound Algorithms, pp. 1–28. John Wiley & Sons, Inc. (2006)
 [18] Djerrah, A., Le Cun, B., Cung, V.D., Roucairol, C.: Bob++: Framework for solving optimization problems with branchandbound methods. In: 2006 15th IEEE International Conference on High Performance Distributed Computing, pp. 369–370 (2006)
 [19] Eén, N., Sörensson, N.: An extensible SATsolver. In: SAT, LNCS, vol. 2919, pp. 502–518 (2003)
 [20] Ferrez, J., Fukuda, K., Liebling, T.: Solving the fixed rank convex quadratic maximization in binary variables by a parallel zonotope construction algorithm. European Journal of Operational Research 166, 35–50 (2005)
 [21] Goux, J.P., Kulkarni, S., Yoder, M., Linderoth, J.: Master–worker: An enabling framework for applications on the computational grid. Cluster Computing 4(1), 63–70 (2001)
 [22] Hamadi, Y., Wintersteiger, C.M.: Seven challenges in parallel SAT solving. In: AAAI, pp. 2120–2125 (2012)
 [23] Herrera, J.F.R., Salmerón, J.M.G., Hendrix, E.M.T., Asenjo, R., Casado, L.G.: On parallel branch and bound frameworks for global optimization. Journal of Global Optimization pp. 1–14 (2017)
 [24] Heule, M.J., Kullmann, O., Marek, V.W.: Solving and verifying the boolean Pythagorean triples problem via cubeandconquer. In: SAT, LNCS, vol. 9710, pp. 228–245 (2016)
 [25] Heule, M.J., Kullmann, O., Wieringa, S., Biere, A.: Cube and conquer: Guiding CDCL SAT solvers by lookaheads. In: HVC, LNCS, vol. 7261 (2012)
 [26] Jordan, C., Joswig, M., Kastner, L.: Parallel enumeration of triangulations. arXiv:1709.04746 (2017)
 [27] Katsirelos, G., Sabharwal, A., Samulowitz, H., Simon, L.: Resolution and parallelizability: Barriers to the efficient parallelization of SAT solvers. In: AAAI, pp. 481–488 (2013)
 [28] Knuth, D.E.: Programs to read. http://wwwcsfaculty.stanford.edu/~uno/programs.html
 [29] Knuth, D.E.: The Art of Computer Programming, Volume 4A. AddisonWesley Professional (2011)
 [30] Marzetta, A.: ZRAM: A library of parallel search algorithms and its use in enumeration and combinatorial optimization. Ph.D. thesis, Swiss Federal Institute of Technology Zurich (1998)
 [31] Mattson, T., Sanders, B., Massingill, B.: Patterns for Parallel Programming, first edn. AddisonWesley Professional (2004)
 [32] McCreesh, C., Prosser, P.: The shape of the search tree for the maximum clique problem and the implications for parallel branch and bound. ACM Transactions on Parallel Computing 2(1), 8:1–8:27 (2015)
 [33] Pruesse, G., Ruskey, F.: Generating the linear extensions of certain posets by transpositions. SIAM Journal on Discrete Mathematics 4(3), 413–422 (1991)
 [34] Reinders, J.: Intel Threading Building Blocks. O’Reilly & Associates, Inc. (2007)
 [35] Varol, Y.L., Rotem, D.: An algorithm to generate all topological sorting arrangements. The Computer Journal 24(1), 83–84 (1981)
 [36] Weibel, C.: Implementation and parallelization of a reversesearch algorithm for Minkowski sums. In: 2010 Proceedings of the Twelfth Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 34–42 (2010)