mts: a light framework for parallelizing tree search codes

mts: a light framework for parallelizing tree search codes

David Avis School of Informatics, Kyoto University, Kyoto, Japan and School of Computer Science, McGill University, Montréal, Québec, Canada
avis@cs.mcgill.ca
Charles Jordan Graduate School of Information Science and Technology, Hokkaido University, Japan
skip@ist.hokudai.ac.jp
Abstract

We describe version 0.1 of mts, a generic framework for parallelizing certain types of tree search programs using a single common wrapper. This complements a previous tutorial that focused on using a preliminary version of mts. mts supports sharing information between processes which is important for applications such as satisfiability testing and branch-and-bound. No parallelization is implemented in the legacy single processor code minimizing the changes needed and simplying debugging. mts is written in C, uses MPI for parallelization and can be used on a network of computers. As examples we parallelize two simple existing reverse search codes, generating topological sorts and generating spanning trees of a graph, and two codes for satisfiability testing. We give experimental results comparing the parallel codes with other codes for the same problems.

Keywords: reverse search, parallel processing, topological sorts, spanning trees
Mathematics Subject Classification (2000) 90C05

1 Introduction

Parallel programming is a vast area and there is a great amount of literature on it (see, e.g., Mattson et al. [31]). Topics include architecture, communication, data sharing, interrupts, deadlocks, load balancing, and the distinction between shared memory and distributed computing. This is all essential for building an efficient parallel algorithm from scratch.

Our starting point was different. We had a large complex code, lrs, developed over about 20 years and tested extensively, which solved vertex/facet enumeration problems. These problems are notoriously hard and running times often take weeks or longer. The underlying algorithm, reverse search, was clearly suitable for parallelization. Nevertheless, the mathematical intricacy of the underlying problem rendered the algorithmic engineering of direct parallelization daunting. This led us to consider building all of the parallelization into a wrapper, making only minor changes to the underlying lrs code. There followed a series of implementations resulting ultimately in the authors’ mplrs code [7]. The key features of mplrs are: (a) there is no parallel code inside lrs, (b) parallel threads execute lrs on non-overlapping subproblems, (c) there is no communication between threads except at the beginning and end of a subproblem execution, (d) the computation can be distributed over a cluster of computers, and (e) the wrapper is directly inserted into the lrs library. Most of the topics in parallel computation mentioned above are not major issues in this restricted framework. The exception is load balancing for which we use a particularly simple method which consists of budgeting the number of nodes evaluated in a subproblem.

It seemed likely that similar results could be obtained for other algorithms based on reverse search111In 2008, John White made a list of 130 different applications and implementations, see link at [6]. or similar easily parallelizable tree search methods. Many such sequential codes exist, so designing custom wrappers for each is not desirable. Our goal was to build a single generic wrapper that could be used, with little if any modification, to do the required parallelization while maintaining features (a)–(e) described above. This resulted in mts, presented here. The current implementation222Version used here available at https://www-alg.ist.hokudai.ac.jp/~skip/mts/ uses MPI and works on clusters of machines. The mts framework is more general than mplrs in that it allows the sharing of data obtained by subproblems, but still maintains the absence of communication between threads. This improves the application to more general tree search problems such as satisfiablility testing and branch and bound.

In Section 2 we survey the literature on parallelizing reverse search codes. We then describe our general approach in Section 3 and apply it to reverse search in Section 4. We give concrete examples for two simple enumeration problems: generating topological sorts and spanning trees of a graph. While the purpose of mts is to parallelize much more complex enumeration problems (see for example the recent application [26] of mts to enumerating triangulations), there were several reasons for choosing these simple well solved problems. They were described in detail in the original reverse search tutorial [6], are easily solved by reverse search, have existing codes, and provide simple examples of how to apply mts.

Tree search has wide uses, of which enumeration is just one example. In fact it is a very specific example as all nodes in the enumeration tree are visited. Two other important uses of tree search are satisfiability testing and branch and bound. Here the goal is not to search the entire tree but to prune subtrees when possible. The tree generated in these cases will normally differ depending on the choices made at early stages and the sharing of information learned during the computation. The mts framework includes support for sharing data between processes and can be applied to these types of problems. As an example we present a parallelization for satisfiability testing in Section 5, demonstrating how little of the original code needs to be changed.

In Section 6 we give computational results for the parallelized codes described in this paper. This is followed in Section 7 by a discussion of how to evaluate the experimental results, the situation being quite different for enumeration problems and for those problems where pruning is used. For the enumeration problems we get near linear speedup using several hundred cores. For the satisfiability problem we show a large improvement in the number of SAT instances that can be solved in a given fixed time period. Finally we give some conclusions and areas for future research in Section 8.

2 Survey of previous work

The reverse search method, initially developed for vertex enumeration, was extended to a wide variety of enumeration problems [5]. From the outset it was realized that it was eminently suitable for parallelization. In 1998, Marzetta announced his ZRAM parallelization platform [14, 30] which can be used for reverse search, backtracking and branch and bound codes. He successfully used it to parallelize several reverse search and branch and bound codes, including lrs from which he derived the prs code. Load balancing is performed using a variant of what is now known as job stealing. Application codes, such as lrs, were embedded into ZRAM itself leading to problems of maintenance as the underlying codes evolved. Although prs is no longer distributed and was based on a now obsolete version of lrs, it clearly showed the potential for large speedups of reverse search algorithms.

The reverse search framework in ZRAM was also used to implement a parallel code for certain quadratic maximization problems [20]. In a separate project, Weibel [36] developed a parallel reverse search code to compute Minkowski sums. This C++ implementation runs on shared memory machines and he obtains linear speedups with up to 8 processors, the largest number reported.

ZRAM is a general-purpose framework that is able to handle a number of other applications, such as branch and bound and backtracking, for which there are by now a large number of competing frameworks. Recent papers by Crainic et al. [17], McCreesh et al. [32] and Herrera et al. [23] describe over a dozen such systems. While branch and bound may seem similar to reverse search enumeration, there are fundamental differences. In enumeration it is required to explore the entire tree whereas in branch and bound the goal is to explore as little of the tree as possible until a desired node is found. The bounding step removes subtrees from consideration and this step depends critically on what has already been discovered. Hence the order of traversal is crucial and the number of nodes evaluated varies dramatically depending on this order. Sharing of information is critical to the success of parallelization. These issues do not occur in reverse search enumeration, and so a much lighter wrapper is possible.

Relevant to the heaviness of the wrapper and amount of programming effort required, a comparison of three frameworks is given in [23]. The first, Bob++ [18], is a high level abstract framework, similar in nature to ZRAM, on top of which the application sits. This framework provides parallelization with relatively little programming effort on the application side and can run on a distributed network. The second, Threading Building Blocks (TBB) [34], is a lower level interface providing more control but also considerably more programming effort. It runs on a shared memory machine. The third framework is the Pthread model [15] in which parallelization is deep in the application layer and migration of threads is done by the operating system. It also runs on a shared memory machine. All of these methods use job stealing for load balancing [13]. In [23] these three approaches are applied to a global optimization algorithm. They are compared on a rather small setup of 16 processors, perhaps due to the shared memory limitation of the last two approaches. The authors found that Bob++ achieved a disappointing speedup of about 3 times, considerably slower than the other two approaches which achieved near linear speedup.

A more sophisticated framework for parallelizing application codes over large networks of computers is MW that works with the distributed environment of HTCondor 333Available at https://research.cs.wisc.edu/htcondor/mw/. MW is a set of C++ abstract base classes that allow parallelization of existing applications based on the master-worker paradigm [21]. We employ the same paradigm in mts although our load balancing methods are different. MW has been used successfully to parallelize combinatorial optimization problems such as the Quadratic Assignment Problem, see the MW home page for references. Although MW could be used to parallelize reverse search algorithms, we are not aware of any such applications.

3 The mts framework

The goal of mts is to parallelize existing tree search codes with minimal internal modification of these codes. The tree search codes should satisfy certain conditions, specified below. The mts implementation starts a user-specified number of processes on a cluster of computers. One process becomes the master, another becomes the consumer, and the remaining are workers which essentially run the original tree search code on specified subtrees. Communication is limited; workers are not interrupted and do not communicate between themselves.

The master sends the input data and parametrized subproblems to workers, informs the other processes to exit when appropriate, and handles checkpointing. The consumer receives and synchronizes output. Workers get budgeted subproblems from the master, run the legacy code, send output to the consumer, and return unfinished subproblems to the master. This general approach is similar to but simpler than the well-known work-stealing approach [13].

Generating subproblems can be done in many ways. One way would be to report nodes at some initial fixed depth. This works well for balanced trees but many trees encountered in practice are highly unbalanced and the vast majority of subtrees contain few nodes. Increasing the initial search depth does not solve this problem. Ideally we would only break up the large subtrees and in the development of mplrs we tried various ways to estimate the size of a given subtree. Experimentally this did not work well due to the high variance of the estimator and the wasted cost of doing many estimates.

The idea that worked best, and is implemented in mts, was also the simplest: a heuristic to determine large subtrees called budgeting. When assigning work the master specifies that a worker should terminate after completing a certain amount of work, called a budget, and then return a list of unexplored subtrees. The precise budget may depend on the application. For enumeration problems it could be the number of nodes visited by the worker. Some advantages of budgeting are:

  • small subtrees are explored without being broken up

  • large subtrees will be broken up repeatedly

  • each worker returns periodically for reassignment, can give information to be passed on to other workers and receive such information

  • it is implemented on-the-fly and avoids the duplication of work done in estimation

  • it can be varied dynamically during execution to control the job list size

  • when used statically and without pruning, the overall job list produced is deterministic and independent of the number of workers

This last item is useful for debugging purposes and also enables a theoretical analysis of the job list size under certain random tree models, see [4]. In particular, methods that limit work based on time (such as “begetting” in MW) do not have this property.

Implementing budgeting does not require interrupting workers or communication between workers. The master uses dynamic budgets to control the job list: small budgets break up more subtrees and lengthen the joblist while large budgets have the reverse effect.

Additional features of mts include checkpointing and restarts, allowing the user to move jobs or free computing resources without losing work. mts can produce various histograms to help tune performance. Histograms and their uses are described in Section 7.

3.1 Sequential tree search code

To be suitable for parallelization with mts the underlying tree search code, which we will call search, must satisfy a few properties. First, when given a positive budget, search should either finish the given job or return a list of unexplored nodes. Any unexplored node should represent a smaller portion of the unfinished work, i.e. running search (with positive budgets) on the unexplored nodes and any resulting unexplored nodes will eventually result in finishing the original job. The code should also interpret the budget in some suitable way where larger budgets correspond to doing more work than smaller budgets. This may require some modification of the legacy code. Our applications usually interpret the budget as number of traversed nodes and depth, but this is not required (see conflict budgeting in Section 5.1).

Any given worker must be able to work on any given unexplored node that mts has seen. It is helpful for the unexplored nodes to represent non-overlapping jobs. mts supports sharing data between workers, but it is helpful for shared data to be small. Implementing a shared memory version of mts could help performance when large amounts of data are shared. Shared data is not used in our enumeration applications. It is used for satisfiability and similar applications to prune the search tree.

3.2 Master process

The master process begins with initialization, including obtaining an application-provided initial . It places this initial subproblem in a (new) job list , and then enters the main loop. In this main loop, the master assigns budgeted subproblems to workers, collects unfinished subproblems to add to , and collects/sends updated from/to the workers. Assigning updates to the master is not essential: it simplifies checkpointing but can increase load on the master and interconnect. Each worker either finishes its subproblem or reaches its budget limitation ( and ) and returns unfinished subproblems to the master for insertion into . This continues until no workers are running and the master has no unfinished subproblems. Once the main loop ends, the master informs all processes to finish. The main loop performs the following tasks:

  • subproblems and relevant updates are sent to free workers when available;

  • check if any workers are done, mark them as free and receive their unfinished subproblems;

  • check and receive updates.

Pseudocode is given as Algorithm 3 in the Appendix. Communication is non-blocking and work proceeds when required information is available.

Using reasonable parameters is critical to performance. This is done dynamically by observing . We use parameters , and which depend on the type of tree search problem being handled. The following default values are used in this paper. Initially, to create a reasonable size list , we set and . Therefore the initial worker will generate subtrees at depth 2 until 5000 nodes have been visited and then terminates sending roots of unvisited subtrees back to the master. Additional workers are given the same aggressive parameters until grows larger than times the number of processors, at which point is removed. Once is larger than times the number of processors, we multiply the budget by . With workers will not generate any new subproblems unless their tree has at least 200,000 nodes. If drops below these bounds we return to the smaller budgets. The default is . In Section 7 we show an example of how typically behaves with these settings.

3.3 Workers

The worker processes are simpler – they receive the problem at startup, and then repeat their main loop: receive a parametrized subproblem and possible updates from the master, work on the subproblem subject to the parameters, send the output to the consumer, and send updated and unfinished subproblems to the master if the budget is exhausted. Pseudocode is given as Algorithm 4 in the Appendix.

3.4 Consumer process

The consumer process in mts is the simplest. The workers send output to the consumer in exactly the format it should be output (i.e., this formatting is done in parallel). The consumer simply outputs it. By synchronizing output to a single destination, the consumer delivers a continuous output stream to the user in the same way as search does. Pseudocode is given as Algorithm 5 in the Appendix.

4 Applying mts to reverse search

Reverse search is a technique for generating large relatively unstructured sets of discrete objects [5]. In its most basic form, reverse search can be viewed as the traversal of a spanning tree, called the reverse search tree , of a graph whose nodes are the objects to be generated. Edges in the graph are specified by an adjacency oracle, and the subset of edges of the reverse search tree are determined by an auxiliary function, which can be thought of as a local search function for an optimization problem defined on the set of objects to be generated. One vertex, , is designated as the target vertex. For every other vertex repeated application of must generate a path in from to . The set of these paths defines the reverse search tree , which has root .

A reverse search is initiated at , and only edges of the reverse search tree are traversed. When a node is visited, the corresponding object is output. Since there is no possibility of visiting a node by different paths, the visited nodes do not need to be stored. Backtracking can be performed in the standard way using a stack, but this is not required as the local search function can be used for this purpose.

In the basic setting described here a few properties are required. Firstly, the underlying graph must be connected and an upper bound on the maximum vertex degree, , must be known. The performance of the method depends on having as low as possible. An adjacency oracle must be capable of generating the adjacent vertices of any given vertex in . For each vertex the local search function returns the tuple where which defines the parent of in . Pseudocode is given in Algorithm 1 and is invoked by setting . C implementations for several simple enumeration problems are given at [6]. For convenience later, we do not output the in the pseudocode shown. Note that the vertices are output as a continuous stream. Also note that Algorithm 1 does not require the parameter to be the root of the entire search tree. If an arbitrary node in the tree is given, the algorithm reports the subtree rooted at this node and terminates.

We need to implement budgeting in order to parallelize Algorithm 1 with mts. We do this in two ways that may be combined. Firstly we introduce the parameter which terminates the tree search at that depth returning any unvisited subtrees. Secondly we introduce a parameter which terminates the tree search after this many nodes have been visited and again returns the roots of all unvisited subtrees. This entails backtracking to the root and returning the unvisited siblings of each node in the backtrack path. These modifications are straightforward and given in Algorithm 2, which reduces to Algorithm 1 by deleting the items in red.

1:procedure rs()
2:     
3:     repeat
4:
5:         while  do
6:              
7:              if  then
8:                  
9:                  
10:
11:                  
12:
13:
14:
15:
16:                  output ()
17:              end if
18:         end while
19:         if  then
20:              
21:              
22:         end if
23:     until  and
24:end procedure
Algorithm 1 Generic Reverse Search
1:procedure brs(, , )
2:     
3:     repeat
4:         
5:         while  and  do
6:              
7:              if  then forward step
8:                  
9:                  
10:                  
11:                  
12:                  if  or
13:                  if  then
14:                        over budget
15:                  end if
16:                  output
17:              end if
18:         end while
19:         if  then backtrack step
20:              
21:              
22:         end if
23:     until  and
24:end procedure
Algorithm 2 Budgeted Reverse Search

To output all nodes in the subtree of rooted at we set , and . To break up into subtrees we have two options that can be combined. Firstly we can set the parameter resulting in all nodes at that depth to be flagged as unexplored. Secondly we can set the budget parameter . In this case, once this many nodes have been explored the current node and all unexplored siblings on the backtrack path to the root are output and flagged as unexplored.

4.1 Example 1: Topological sorts

A C implementation (per.c) of the reverse search algorithm for generating permutations is given in the tutorial [6]. A small modification of this code generates all topological sorts of a partially ordered set that is given by a directed acyclic graph (DAG). Such topological sorts are also called linear extensions or topological orderings. The code modification is given as Exercise 5.1 and a solution to the exercise (topsorts.c) is at [6]. Here we describe how to modify this code to allow parallelization via the mts interface to produce the program mtopsorts. The details and code are available at [6].

It is convenient to describe the procedure as two phases. Phase 1 implements budgeting and organizes the internal data in a suitable way. This involves modifying an implementation of Algorithm 1 to an implementation of Algorithm 2 that can be independently tested. We need to prepare a global data structure bts_data which contains problem data obtained from the input. In Phase 2 we build a node structure for use by the mts wrapper and add necessary routines to allow initialization and I/O in a parallel setting. In practice this involves using a header file from mts. The resulting program btopsorts.c can be compiled as a sequential code or with mts as a parallel code with no change in the source files.

In the second phase we add the ‘hooks’ that allow communication with mts. This involves defining a Node structure which holds all necessary information about a node in the search tree. The roots of unexplored subtrees are maintained by mts for parallel processing. Therefore whenever a search terminates due to the or restrictions, the Node structure of each unexplored tree node is returned to mts. As we do not wish to customize mts for each application, we use a very generic node structure. The user should pack and unpack the necessary data into this structure as required. The Node structure is defined in the mts header.

The efficiency of mts depends on keeping the job list non-empty until the end of the computation, without letting it get too large. Depending on the application, there may be a substantial restart cost for each unexplored subtree. Surely there is no need to return a leaf as an unexplored node, and the prune=0 option checks for this. Further, if an unexplored node has only one child it may be advantageous to explore further, terminating either at a leaf or at a node with two or more children, which is returned as unexplored. The prune=1 option handles this condition, meaning that no isolated nodes or paths are returned as unexplored. Note that pruning is not a built-in mts option; it is an example of options that applications may wish to include and was implemented in mtopsorts.

4.2 Example 2: Spanning trees

In the tutorial [6] a C implementation (tree.c) is given for the reverse search algorithm for all spanning trees of the complete graph. An extension of this to generate all spanning trees of a given graph is stated as Exercise 6.3. Applying Phase 1 and 2 as described above results in the code btree.c. Again this may be compiled as a sequential code or with the mts wrapper to provide the parallel implementation mtree. All of these codes are given at the URL [6].

5 Applying mts to satisfiability

Boolean satisfiability (SAT) asks us to determine the existence of (or find) satisfying assignments for propositional formulas, see [12] for more background. SAT solvers have made tremendous progress over the years, and are now widely used as general NP solvers. While most application problems seem to result in easy SAT instances [10], there has long been interest in parallel SAT solvers for hard instances. Despite the many challenges [22, 27] in parallel SAT, there are recent successes [24].

There are two major approaches to parallel SAT solvers. Either one somehow partitions the space of possible assignments and uses divide-and-conquer (e.g., [1] for a recent example) or one uses the portfolio approach and runs many sequential solvers on the original problem (e.g., plingeling [11]). In either case, a major issue is determining which learnt clauses444CDCL solvers learn clauses during the search, pruning the search space. See, e.g., Chapter 4 of [12]. to share between workers [3]. While sharing these clauses helps prune the search space, additional clauses slow the solver and enormous numbers of clauses are learned.

Another question for divide-and-conquer solvers is the question of how to divide the search space. Many approaches have been tried, often setting initial variables and using a common feature of sequential solvers to “solve under assumptions”. Some recent solvers (e.g., [1] and treengeling [11]) work on these subproblems subject to some budget, and hard subproblems can be split again. Cube-and-conquer [25] is another recent approach that uses look-ahead solvers to divide the search space for CDCL solvers.

5.1 mtsat: parallelizing Minisat with mts

We used mts to implement a divide-and-conquer solver mtsat, using Minisat 2.2.0 as sequential solver. Our goal was to demonstrate the use of and show that mts can be used in settings other than enumeration. mtsat is still experimental and much work remains to reach the level of state-of-the-art dedicated parallel SAT solvers, but it allows for experimentation with, e.g., budgeting and restart strategies in parallel SAT.

Minisat [19] supports solving under assumptions, i.e. solving subject to some partial assignment. It also supports solving subject to a budget, given in propagations or conflicts, returning unknown if the given subproblem could not be solved within the budget.

The major modification required is to report unexplored partial assignments when the budget is exhausted. At any point in the search, SAT solvers distinguish between decision variables and propagated variables. Decision variables are those where the solver chose an assignment, while propagated variables are those where the solver was able to determine (because of a unit clause) that only one option need be explored. It suffices to return unexplored nodes corresponding to the current partial assignment and to those formed by taking the unexplored options for decision variables (including the last one) along the backtrack path.

Regarding learnt clauses, we implemented a simple scheme sharing only learnt unit clauses. The idea is that short clauses cut the search tree more than longer clauses; an early version of plingeling also shared only units [11]. We avoided more sophisticated approaches to sharing clauses [3, 1], using conflicts to prune the job list and similar ideas for simplicity.

mtsat includes additional options. For example, while the parallel solvers most similar to our approach [1, 11] budget using conflicts – we added the option to budget using decisions. Conflict budgets correspond to hitting a leaf in the search tree, while decision budgets correspond to nodes in the search space (omitting propagated variables since those are forced). Conflict budgets are attractive, but decision budgets correspond more closely to the budgets used in Section 4 and allow us to experiment with different budgeting techniques.

Modern solvers generally perform random restarts, abandoning the current search to start over (cf. Chapter 4 of [12]) and hopefully avoid getting stuck in hard parts of the search space. We split problems along the backtrack path and schedule these abandoned portions of the search space for later exploration – possibly resulting in much duplicated work. We therefore added an option to disable restarts, in order to experiment with their impact on performance in mtsat, and formula preprocessing, to experiment with the idea that avoiding preprocessing can be beneficial to divide-and-conquer parallel SAT solvers [22].

The total is 50 lines of changes to legacy Minisat (including support to parse inputs from strings) of the original 4803 lines, plus a few hundred lines of generic code interfacing the Minisat API and mts that can be re-used. Essentially identical changes suffice to parallelize Glucose (since it is based on Minisat) and others, and so we also parallelize Glucose 3.0. One could easily support workers using a mix of solvers, a hybrid of the divide-and-conquer and portfolio approaches to parallel SAT.

6 Experimental results

The tests were performed at Kyoto University on mai32, a cluster of 5 nodes with a total of 192 identical processor cores, consisting of:

  • mai32abcd: 4 nodes, each containing: 2x Opteron 6376 (16-core 2.3GHz), 32GB memory, 500GB hard drive (128 cores in total);

  • mai32ef: 4x Opteron 6376 (16-core 2.3GHz), 64 cores, 256GB memory, 4TB hard drive.

A complete description of the problems solved below is given in [8] and the input files are available by following the link to tutorial2 at [6].

6.1 Topological sorts: mtopsorts

The tests were performed using the following codes:

  • VR: obtained from [16], generates topological sorts in lexicographic order via the Varol-Rotem algorithm [35] (Algorithm V in Section 7.2.1.2 of [29]);

  • Genle: also obtained from [16], generates topological sorts in Gray code order using the algorithm of Pruesse and Rotem [33];

  • btopsorts: derived from the reverse search code topsorts.c [6] as described in Section 4.1;

  • mtopsorts: mts parallelization of btopsorts.

For the tests all codes were used in count-only mode due to the enormous output that would otherwise be generated. All codes were used with default parameters:

(1)

The following graphs were chosen, listed in order of increasing edge density: , , . The constructions for the first two partial orders are well known (see, e.g., Section 7.2.1.2 of [29]) and the third is a complete bipartite graph.

Graph m n No. of perms VR Genle btopsorts mtopsorts
nodes edges 12 24 48 96 192
22 21 13,749,310,575 179 14 12723 1172 595 360 206 125
42 61 24,466,267,020 654 171 45674 4731 2699 1293 724 408
17 72 14,631,321,600 159 5 8957 859 445 249 137 85
Table 1: Topological sorts: mai32, times in secs

Results are in Table 1. The reverse search code btopsorts is very slow, over 900 times slower than Genle and over 70 times slower than VR on . However the parallel mts code obtains excellent speedups and is faster than VR on all problems when 192 cores are used.

6.2 Spanning trees: mtree

The tests were performed using the following codes:

  • grayspan: Knuth’s implementation [28] of an algorithm that generates all spanning trees of a given graph, changing only one edge at a time, as described in Malcolm Smith’s M.S. thesis, Generating spanning trees (University of Victoria, 1997);

  • grayspspan: Knuth’s improved implementation of grayspan: “This program combines the ideas of grayspan and spspan, resulting in a glorious routine that generates all spanning trees of a given graph, changing only one edge at a time, with ‘guaranteed efficiency’—in the sense that the total running time is when there are edges, vertices, and spanning trees.” [28];

  • btree: derived from the reverse search code tree.c [6] as described in Section 4.2;

  • mtree: mts parallelization of btree.

Both grayspan and grayspspan are described in detail in Knuth [29]. Again all codes were used in count-only mode and with the default parameters (1). The problems chosen were the following graphs which are listed in order of increasing edge density: 8-cage, , , , . The latter 4 graphs were motivated by Table 5 in [29]: appears therein and the other graphs are larger versions of examples in that table.

Graph m n No. of trees grayspan grayspspan btree mtree
nodes edges 12 24 48 96 192
8-cage 30 45 23,066,015,625 3166 730 10008 1061 459 238 137 92
25 45 38,720,000,000 3962 1212 8918 851 455 221 137 122
25 50 1,562,500,000,000 131092 41568 230077 26790 13280 7459 4960 4244
14 49 13,841,287,201 699 460 2708 259 142 68 51 61
12 66 61,917,364,224 2394 1978 3179 310 172 84 97 148
Table 2: Spanning tree generation: mai32, times in secs

The computational results are given in Table 2. This time the reverse search code is a bit more competitive: about 3 times slower than grayspan and about 14 times slower than grayspspan on 8-cage for example. The parallel mts code runs about as fast as grayspspan on all problems when 12 cores are used and is significantly faster after that. Near linear speedups are obtained up to 48-cores but then tail off. For the two dense graphs and the performance of mts is actually worse with 192 cores than with 96.

6.3 Satisfiability

The tests were performed using the following codes:

  • Minisat: version 2.2.0, classic sequential solver [19];

  • Glucose: version 3.0, sequential solver [2] derived from Minisat;

  • mtsat: parallel solver using mts and Minisat 2.2.0;

  • mtsat-glucose: parallel solver using mts and Glucose 3.0;

  • Glucose-Syrup: version 4.0, parallel (shared memory) solver;

  • lingeling, treengeling: version bbc, sequential and (shared memory) parallel solvers [11].

Benchmarking parallel SAT solvers is challenging [22] and any particular instance may give superlinear speedups or timeouts. We use a standard set of hard instances from applications, and count the number of problems that each solver can solve within a given time. We re-use the setup of [1], i.e. the 100 instances in the parallel track of SAT Race 2015 [9] and a timeout of 20 minutes. Results are in Figure 1. Due to different computers used, our results are not directly comparable to those in [1]. As noted by [10], solvers like mtsat can use substantial memory on very large instances, limiting the number of processes that can execute in a given amount of memory. The computers we used had sufficient memory for the instances used.

problems solved (out of 100)Minisatmtsat(16)mtsat(32)mtsat(64)mtsat(128)mtsat(192)
(a) Instances solved vs time
Solver SAT UNSAT Total
Minisat 18 1 19
mtsat (16) 17 1 18
mtsat (32) 22 2 24
mtsat (64) 23 4 27
mtsat (128) 29 7 36
mtsat (192) 35 10 45
lingeling 17 10 27
treengeling (32) 38 21 59
(b) Instances solved within 1200s
Figure 1: mtsat performance (decision budgeting, default parameters (1))

The results in Figure 1 show improvement from additional cores using default parameters and decision budgeting with no attempt at tuning. Performance with conflict budgeting is shown in Figure 2, using an initial budget of conflicts (i.e. the corresponding value in [1]).

problems solved (out of 100)Minisatmtsat(16)mtsat(32)mtsat(64)mtsat(128)mtsat(192)
(a) Instances solved vs time
Solver SAT UNSAT Total
Minisat 18 1 19
mtsat (16) 18 2 20
mtsat (32) 23 3 26
mtsat (64) 27 7 34
mtsat (128) 30 10 40
mtsat (192) 34 11 45
(b) Instances solved within 1200s
Figure 2: mtsat performance (conflict budgeting, , )

All non-timeout outputs are correct, and the 32-core run with conflict budgeting solves problem 62bits_10.dimacs.cnf (reported as unsolved in the SAT Race 2015 results) giving a correct satisfying assignment. It is likely that experimenting with parameter values can improve performance, and using a newer sequential solver on the workers may be another source of improvement given the performance treengeling achieves starting from the higher baseline performance of lingeling.

Along these lines, we also report results applying mts to parallelize Glucose. Figure 3 shows results using the default decision budgeting and Figure 4 shows results using conflict budgeting. Note that like in mtsat, only unit clauses are shared in mtsat-glucose. A more sophisticated approach to sharing learnt clauses, e.g. sharing glue clauses, would likely help performance.

problems solved (out of 100)Glucosemtsat-glucose(16)mtsat-glucose(32)mtsat-glucose(64)mtsat-glucose(128)mtsat-glucose(192)
(a) Instances solved vs time
Solver SAT UNSAT Total
Glucose 6 5 11
mtsat-glucose (16) 26 5 31
mtsat-glucose (32) 26 5 31
mtsat-glucose (64) 31 6 37
mtsat-glucose (128) 32 9 41
mtsat-glucose (192) 35 10 45
Glucose-Syrup (32) 32 18 50
Glucose-Syrup (64) 31 18 49
(b) Instances solved within 1200s
Figure 3: mtsat-glucose performance (decision budgeting, default parameters (1))
problems solved (out of 100)Glucosemtsat-glucose(16)mtsat-glucose(32)mtsat-glucose(64)mtsat-glucose(128)mtsat-glucose(192)
(a) Instances solved vs time
Solver SAT UNSAT Total
Glucose 6 5 11
mtsat-glucose (16) 23 7 30
mtsat-glucose (32) 27 7 34
mtsat-glucose (64) 30 10 40
mtsat-glucose (128) 34 13 47
mtsat-glucose (192) 38 13 51
(b) Instances solved within 1200s
Figure 4: mtsat-glucose performance (conflict budgeting, , )

In all cases, we see that additional cores allow one to solve more problems given a fixed amount of time. While it is likely that performance can be improved by tuning and a better approach to sharing clauses, these results suffice for our purpose: to show that one can easily parallelize a legacy code with mts.

As mentioned earlier, conflict budgeting is most common in this kind of parallel solver. This is because generating a conflict clause guarantees at least some progress has been made and instances can have very different and enormous numbers of variables. Conflict budgeting usually slightly outperformed decision budgeting in the runs here, however this was not universal. Given the minimal tuning for both budget types, our results do not show a clear difference regarding how to budget.

7 Evaluating and improving performance

Our main measures of performance for the enumeration problems are the elapsed time taken and the efficiency defined as:

(2)

Multiplying efficiency by the number of cores gives the speedup. Speedups that scale linearly with the number of cores give constant efficiency. External factors can affect performance as the load on the machine increases. One example is dynamic overclocking, where the speed of working cores may be increased by 25%–30% when other cores are idle. This limits the maximum efficiency achievable when all cores are used, since the single core running times are measured on otherwise idle machines. In Figure 5 we plot the efficiencies obtained by mtopsorts and mtree for the runs shown in Tables 1 and 2 respectively.

coresEfficiency vs number of cores (mtopsorts)
(a) Efficiency with mtopsorts
coresEfficiency vs number of cores (mtree)8-cage
(b) Efficiency with mtree
Figure 5: Efficiency vs number of cores (data from Tables 1 and 2)

The amount of work contained in a subproblem can vary dramatically. mts can produce histograms to help understand and tune its performance. We discuss three of these here: processor usage, job list size and distribution of subproblem sizes. Figure 6 shows the first two histograms for the mtopsorts run on with default parameters (1).

Figure 6: Histograms for mtopsorts on : busy workers (left) job list size (right)

We see the master struggling to keep workers busy despite having jobs available. This suggests that we can improve performance with better parameters. Here, a larger -scale or -maxnodes value may help, since it will allow workers to do more work (assuming a sufficiently large subproblem) before contacting the master.

Figure 7: Histograms with -scale -maxnodes on : busy workers (l), joblist size (r)

Figure 7 shows the result of using for -scale and for -maxnodes. These parameters produce less than half the number of total number of jobs compared to the default parameters, and increase overall performance by about five percent on this input.

In addition to the performance histograms, mts can generate a frequency file containing a list of values returned by each worker on the completion of each job. For the enumeration applications this is normally the number of nodes visited by the worker during the job. Such a list provides statistical information about the tree that is helpful when tuning the parameters for better performance. For example, it may be helpful to implement and use pruning if many jobs correspond to leaves. Likewise, increasing the budget will have limited effect if only few jobs use the full budget. Figure 8 shows the distribution of subproblem sizes that was produced in a run of mtopsorts on with default parameters (1). is usually large so the scaled budget constraint of 200000 is normally in use. The left figure shows this constraint was invoked about 15000 times. The right figure shows that most subproblems have less than 40 nodes and so are not broken up. The three spikes in the middle of the left figure are interesting and show there are large numbers of subtrees with these specific sizes. This is probably due to the high symmetry of the graph .

Figure 8: Subproblem sizes for : all (left) small subproblems only (right)

8 Conclusions

We have presented a generic framework for parallelizing reverse search codes requiring only minimal changes to the legacy code. Two features of our approach are that the parallelizing wrapper does not need to be user modified and the modified legacy code can be tested in standalone single processor mode. There is no separate library to install and just a few routines need to be inserted in a user’s existing library. Applying mts to two very basic reverse search codes we obtained comparable results to that previously obtained by the customized mplrs wrapper applied to the the complex lrs code [7]. We expect that many other reverse search applications and will obtain similar speedups when parallelized with mts.

The application to SAT demonstrates the use of shared data, and the ease with which a widely-used existing legacy code can be parallelized using mts. While mtsat remains work in progress, it shows some promise and further experimentation can likely improve performance. Other ongoing work involves using mts to parallelize existing integer programming solvers that use the branch-and-bound approach.

Acknowledgements.

This work was partially supported by JSPS Kakenhi Grants 16H02785, 23700019 and 15H00847, Grant-in-Aid for Scientific Research on Innovative Areas, ‘Exploring the Limits of Computation (ELC)’.

References

  • [1] Audemard, G., Lagniez, J.M., Szczepanski, N., Tabary, S.: An adaptive parallel SAT solver. In: CP, LNCS, vol. 9892, pp. 30–48 (2016)
  • [2] Audemard, G., Simon, L.: Predicting learnt clauses quality in modern SAT solvers. In: IJCAI, pp. 399–404 (2009)
  • [3] Audemard, G., Simon, L.: Lazy clause exchange policy for parallel SAT solvers. In: SAT, LNCS, vol. 8561, pp. 197–205 (2014)
  • [4] Avis, D., Devroye, L.: An analysis of budgeted parallel search on conditional Galton-Watson trees. arXiv:1703.10731 (2017)
  • [5] Avis, D., Fukuda, K.: Reverse search for enumeration. Discrete Applied Mathematics 65, 21–46 (1993)
  • [6] Avis, D., Jordan, C.: Reverse search: tutorials (2000,2016). http://cgm.cs.mcgill.ca/~avis/doc/tutorial
  • [7] Avis, D., Jordan, C.: mplrs: A scalable parallel vertex/facet enumeration code. arXiv:1511.06487 (2015)
  • [8] Avis, D., Jordan, C.: A parallel framework for reverse search using mts. arXiv:1610.07735 (2016)
  • [9] Balyo, T., Biere, A., Iser, M., Sinz, C.: SAT Race 2015. Artificial Intelligence 241, 45–65 (2016)
  • [10] Balyo, T., Heule, M.J., Järvisalo, M.: SAT Competition 2016: Recent developments. In: AAAI (2017)
  • [11] Biere, A.: Lingeling and friends entering the SAT Challenge 2012. In: Proc. SAT Challenge 2012, Department of Computer Science Series of Publications B, University of Helsinki, vol. B-2012-2, pp. 33–34 (2012)
  • [12] Biere, A., Heule, M.J., van Maaren, H., Walsh, T. (eds.): Handbook of Satisfiability. IOS Press (2009)
  • [13] Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. Journal of the ACM 46(5), 720–748 (1999)
  • [14] Brüngger, A., Marzetta, A., Fukuda, K., Nievergelt, J.: The parallel search bench ZRAM and its applications. Ann. Oper. Res. 90, 45–63 (1999)
  • [15] Casado, L.G., Martínez, J.A., García, I., Hendrix, E.M.T.: Branch-and-bound interval global optimization on shared memory multiprocessors. Optimization Methods and Software 23(5), 689–701 (2008)
  • [16] Combinatorial Object Server : Linear extensions. http://theory.cs.uvic.ca/inf/pose/LinearExt.html
  • [17] Crainic, T.G., Le Cun, B., Roucairol, C.: Parallel Branch-and-Bound Algorithms, pp. 1–28. John Wiley & Sons, Inc. (2006)
  • [18] Djerrah, A., Le Cun, B., Cung, V.D., Roucairol, C.: Bob++: Framework for solving optimization problems with branch-and-bound methods. In: 2006 15th IEEE International Conference on High Performance Distributed Computing, pp. 369–370 (2006)
  • [19] Eén, N., Sörensson, N.: An extensible SAT-solver. In: SAT, LNCS, vol. 2919, pp. 502–518 (2003)
  • [20] Ferrez, J., Fukuda, K., Liebling, T.: Solving the fixed rank convex quadratic maximization in binary variables by a parallel zonotope construction algorithm. European Journal of Operational Research 166, 35–50 (2005)
  • [21] Goux, J.P., Kulkarni, S., Yoder, M., Linderoth, J.: Master–worker: An enabling framework for applications on the computational grid. Cluster Computing 4(1), 63–70 (2001)
  • [22] Hamadi, Y., Wintersteiger, C.M.: Seven challenges in parallel SAT solving. In: AAAI, pp. 2120–2125 (2012)
  • [23] Herrera, J.F.R., Salmerón, J.M.G., Hendrix, E.M.T., Asenjo, R., Casado, L.G.: On parallel branch and bound frameworks for global optimization. Journal of Global Optimization pp. 1–14 (2017)
  • [24] Heule, M.J., Kullmann, O., Marek, V.W.: Solving and verifying the boolean Pythagorean triples problem via cube-and-conquer. In: SAT, LNCS, vol. 9710, pp. 228–245 (2016)
  • [25] Heule, M.J., Kullmann, O., Wieringa, S., Biere, A.: Cube and conquer: Guiding CDCL SAT solvers by lookaheads. In: HVC, LNCS, vol. 7261 (2012)
  • [26] Jordan, C., Joswig, M., Kastner, L.: Parallel enumeration of triangulations. arXiv:1709.04746 (2017)
  • [27] Katsirelos, G., Sabharwal, A., Samulowitz, H., Simon, L.: Resolution and parallelizability: Barriers to the efficient parallelization of SAT solvers. In: AAAI, pp. 481–488 (2013)
  • [28] Knuth, D.E.: Programs to read. http://www-cs-faculty.stanford.edu/~uno/programs.html
  • [29] Knuth, D.E.: The Art of Computer Programming, Volume 4A. Addison-Wesley Professional (2011)
  • [30] Marzetta, A.: ZRAM: A library of parallel search algorithms and its use in enumeration and combinatorial optimization. Ph.D. thesis, Swiss Federal Institute of Technology Zurich (1998)
  • [31] Mattson, T., Sanders, B., Massingill, B.: Patterns for Parallel Programming, first edn. Addison-Wesley Professional (2004)
  • [32] McCreesh, C., Prosser, P.: The shape of the search tree for the maximum clique problem and the implications for parallel branch and bound. ACM Transactions on Parallel Computing 2(1), 8:1–8:27 (2015)
  • [33] Pruesse, G., Ruskey, F.: Generating the linear extensions of certain posets by transpositions. SIAM Journal on Discrete Mathematics 4(3), 413–422 (1991)
  • [34] Reinders, J.: Intel Threading Building Blocks. O’Reilly & Associates, Inc. (2007)
  • [35] Varol, Y.L., Rotem, D.: An algorithm to generate all topological sorting arrangements. The Computer Journal 24(1), 83–84 (1981)
  • [36] Weibel, C.: Implementation and parallelization of a reverse-search algorithm for Minkowski sums. In: 2010 Proceedings of the Twelfth Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 34–42 (2010)

Appendix

1:procedure master(, , , , , , )
2:     Send () to each worker
3:     Create empty table
4:     Create empty list
5:     Get from application, add to
6:     
7:     while  is not empty or some worker is marked as working do
8:         while  is not empty and some worker not marked as working do
9:              if  then
10:                  
11:              else
12:                  
13:              end if
14:              if  then
15:                  
16:              else
17:                  
18:              end if
19:              Remove next element from
20:              Send (, , ) to first free worker
21:              Mark as working
22:              Send any in newer than has
23:         end while
24:         for each marked worker  do
25:              Check for new message from
26:              if incoming message from  then
27:                  Join list to
28:                  Receive update from
29:                  Unmark as working
30:                  if non-empty update then
31:                       Update ’s in
32:                  end if
33:              end if
34:         end for
35:     end while
36:     Call application with final set of
37:     Send terminate to all processes
38:end procedure
Algorithm 3 Master process
1:procedure worker
2:     Receive () from master
3:     Create empty
4:     while  do
5:         Wait for message from master
6:         if message is terminate then
7:              Exit
8:         end if
9:         Receive (, , )
10:         Receive updates, update local copy
11:         Call search (, , , )
12:         Send list of unfinished vertices to master
13:         Send update to master
14:         Send output list to consumer
15:     end while
16:end procedure
Algorithm 4 Worker process
1:procedure consumer
2:     while  do
3:         Wait for incoming message
4:         if message is terminate then
5:              Exit
6:         end if
7:         Output this message
8:     end while
9:end procedure
Algorithm 5 Consumer process
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
192877
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description