# A framework for large-scale distributed AI search across disconnected heterogeneous infrastructures

## Abstract

We present a framework for a large-scale distributed eScience Artificial Intelligence search. Our approach is generic and can be used for many different problems. Unlike many other approaches, we do not require dedicated machines, homogeneous infrastructure or the ability to communicate between nodes. We give special consideration to the robustness of the framework, minimising the loss of effort even after total loss of infrastructure, and allowing easy verification of every step of the distribution process. In contrast to most eScience applications, the input data and specification of the problem is very small, being easily given in a paragraph of text. The unique challenges our framework tackles are related to the combinatorial explosion of the space that contains the possible solutions and the robustness of long-running computations. Not only is the time required to finish the computations unknown, but also the resource requirements may change during the course of the computation. We demonstrate the applicability of our framework by using it to solve a challenging and hitherto open problem in computational mathematics. The results demonstrate that our approach easily scales to computations of a size that would have been impossible to tackle in practice just a decade ago.

## 1Introduction

The last decade has seen an unprecedented rise in the computing power that institutions and even individuals have access to. This is not only true for individual processors, but also the number of processors and machines. During the last few years, a dramatic paradigm shift from ever faster processors to an ever increasing number of processors and processing elements has occurred. Even basic contemporary machines have several generic processing elements and specialised chips for e.g. graphics processing.

The size of problems people are interested in solving and the amount of data that needs to be processed in order to do that has grown dramatically as well. Today, amounts of data are routinely processed that could not even have been stored a decade ago. All this presents computer science with new and challenging research directions.

The processing of so-called “big data” is one of the directions where a lot of research has been done and a lot of tools have been developed. Applications can be scaled across hundreds of machines relatively easily. The situation in many areas of Artificial Intelligence is completely different however. Distributing problems across several machines has been a research endeavour long before the advent of easily accessible computational resources and big data. The problems AI aims to solve have always required a large amount of computational resources to solve problems of practical relevance.

Considering the keen interest of AI researchers in parallelisation, it is somewhat paradoxical that frameworks to distribute AI techniques are still in their infancy when it comes to practical applications. One such example is Apache Mahout [?], which leverages the generic Hadoop framework to distribute Machine Learning algorithms. For AI search on the other hand, there are, to the best of our knowledge, no similar frameworks.

Artificial Intelligence search has close links with eScience research, being used to plan workflows [?], identify optimal protein and DNA structures [?], and obtain qualitative models of dynamics systems arising in a wide range of scientific areas [?].

AI search involves the efficient creation, exploration and pruning of very large search trees (for the game of chess, the tree has an estimated nodes). In many cases it is acceptable to find the first solution from many candidates, or accept sub-optimal solutions with respect to a cost function to limit the amount of search performed. However, we often require either all solutions to a given problem, or a solution that has a guarantee of optimality.

Even when only the first solution is required, the time to find it can quickly grow to days, months or even years on a single computer. In most cases, this is unacceptable – we must be able to find a solution in less time. There are two strategies for achieving this. The AI search techniques can be improved to be more efficient for the problem or the search can be distributed across several machines such that the time to find a solution decreases without actually decreasing the total effort. The framework presented in this paper pursues the latter strategy.

Our requirements for such a framework can be summarised as follows.

Scalability.

We want to be able to use as many resources as possible at the same time, regardless of type and location and with minimal connectivity requirements.Robustness.

The framework must be able to cope with hardware and similar failures. In particular, the amount of computational effort lost because of such an event should be small.Verifiability.

In order to be useful for solving open problems, we must be able to follow each step in the distribution process to verify that AI search proceeded correctly and no solutions were lost.

In this paper we describe a framework that fulfils these requirements. The design and implementation is motivated by the Recovery Oriented Computing [?] aspects of the much wider research into Ultralarge systems [?]. The AI search undertaken is Constraint Programming, described in Section 1.1. This is not a restriction, as most AI search problems can be expressed as Constraint Programming problems. The application area that we use to evaluate the implementation of the framework is described in Section 1.2.

### 1.1Constraint Programming

Constraints are a natural and compact way of representing problems that are ubiquitous in everyday life. Constraint Programming investigates techniques for solving problems that involve constraints. Common application domains include other areas of Artificial Intelligence such as planning, but also real world and industrial applications such as scheduling, design and configuration or diagnosis and testing. Wallace [?] gives an early overview of application areas.

Formally, a constraint problem is a triple , where is a finite indexed set of variables . Each variable has a finite domain of possible values . The set is a finite set of constraints on the variables in . A constraint is a relation that restricts the values of the variables in its scope. A *solution* to a constraint problem is a complete assignment of values from the respective domains to all variables such that none of the constraints is violated.

In constraint programming, a distinction is usually made between constraint satisfaction problems (CSPs) and constrained optimisation problems (COPs). A solution to the former only has to satisfy all the constraints, whereas a solution to the latter is also given a score by a cost function that needs to be optimised. As such, it is usually not sufficient to find only the first solution of a COP even if only one solution is required unless this first solution can be shown to be optimal. In the remainder of this paper we consider, without loss of generality, CSPs.

Constraint problems are typically solved by building a search tree in which the nodes are assignments of values to variables and the edges lead to assignment choices for the next variable. If at any node a constraint is violated, search backtracks by returning to a previous state. If a leaf is reached and no constraints are violated, all variables have been assigned values and this set of assignments denotes a solution to the CSP.

Clearly the search trees are exponential in the number of variables. Exploring all of them is infeasible in many cases and inference is used at each node of the search tree to prune values from the domains of unassigned variables that cannot be part of a solution based on the assignments made so far. Inference also allows to backtrack before a constraint is violated – if the domain of a particular variable becomes empty, the set of assignments made so far cannot be part of a solution.

The inference checks have a computational cost and the trade-off is between the effort of making checks – hopefully resulting in a reduction of the search space – and the effort of searching a presumably larger tree but at a cheaper cost per node. This is an area of active research and the Handbook of Constraint Programming [?] provides more details on the many techniques that can be used to solve constraint problems.

Constraint problems are often highly symmetric. Symmetries may be inherent in the problem or be created in the process of representing the problem as a CSP. A symmetry can be as simple as being able to swap the assignments of two variables in every solution or involve complex permutations of the assignments. In general, it is desirable to rule out the symmetries during search. This often leads to a massive reduction in the search space while the solutions that have been ruled out can be recovered after the problem has been solved at a low computational cost.

The process of removing symmetries is referred to as symmetry breaking. It introduces additional constraints that are redundant with respect to the original problem specification, but rule out symmetrical solutions. More details on symmetries and symmetry breaking techniques can again be found in the Constraint Programming Handbook [?].

* | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|

0 | 0 | 0 | 0 | 0 | 4 | 4 | 0 | 0 | 4 | 4 |

1 | 0 | 1 | 0 | 0 | 4 | 4 | 0 | 0 | 4 | 4 |

2 | 2 | 2 | 2 | 2 | 5 | 5 | 2 | 2 | 5 | 5 |

3 | 2 | 2 | 2 | 3 | 5 | 5 | 2 | 2 | 5 | 5 |

4 | 0 | 0 | 0 | 0 | 4 | 4 | 4 | 4 | 0 | 0 |

5 | 2 | 2 | 2 | 2 | 5 | 5 | 5 | 5 | 2 | 2 |

6 | 0 | 0 | 2 | 2 | 4 | 5 | 6 | 7 | 8 | 9 |

7 | 0 | 0 | 2 | 2 | 4 | 5 | 7 | 6 | 9 | 8 |

8 | 2 | 2 | 0 | 0 | 5 | 4 | 8 | 9 | 7 | 6 |

9 | 2 | 2 | 0 | 0 | 5 | 4 | 9 | 8 | 6 | 7 |

### 1.2Semigroups

We apply our framework to finding the semigroups of order 10. A semigroup consists of a set of elements and a binary operation that is *associative*, satisfying for each . Table 1 is an illustrative example of such an object. Given a permutation of the elements of , a semigroup *isomorphic* to is obtained by permuting the rows, the columns, and finally the values according to . An *anti-isomorphism* is the transpose of an isomorphism.

The problem addressed in this paper is finding all ways of filling in a blank table such that multiplication is associative up to symmetric equivalence, i.e. up to isomorphism or anti-isomorphism. For orders less than , this problem can be solved by a combination of enumeration formulae and computation on a single processor. Table 2 – with entries taken from sequence A001423 of the On-Line Encyclopaedia of Integer Sequences – demonstrates the combinatoric growth in the number of solutions with increasing order, and motivates the use of multiple compute nodes to explore the solution space. The table for the semigroups of order has cells, and each of these can take any one of values. Hence the search space for order is . For the problem under consideration, , the size of the search space is . To put this number into context, it is currently estimated that there are approximately atoms in the universe. The search space for our problem is so vast that we cannot possibly hope to solve it by brute force search.

Semigroups | |
---|---|

1 | 1 |

2 | 4 |

3 | 18 |

4 | 126 |

5 | 1,160 |

6 | 15,973 |

7 | 836,021 |

8 | 1,843,120,128 |

9 | 52,989,400,714,478 |

Recent advances in the theory of finite semigroups have led to an enumerative formula [?] that gives the number of ‘almost all’ semigroups of given order. Despite this, 256,587,290,511,904 non-equivalent solutions had to be found using the framework described in this paper.

The constraint model of semigroups of order 10 makes extensive use of the *element* constraint on natural numbers , and

which requires that is the th element of the list in any solution. This constraint is implemented in many CSP solvers, including the one developed in our group, Minion [?].

We let be variables representing the entries in a multiplication table , and the variables representing each of the products of three elements. Our basic CSP contains the variables , each with domain . For each triple of values from , we post the pair of constraints

which enforce associativity. We rule out search for semigroups given by a formula by posting constraints that require at least one assignment of all the variables in to be non-zero. A full description of the CSP model and its reduction into case-splits is given in [?].

Finding all solutions of this CSP solves our problem apart from ruling out symmetric equivalents. Our symmetry group is the set of permutations of combined with possible transpositions of the tables. If is such a symmetry and is a multiplication table, then is the table obtained by first permuting the rows and columns of according to , and either transposing the table or doing nothing, depending on .

We ensure that only canonical solutions are returned by identifying the symmetry group using the GAP computational algebra package [?], then posting “lex-leader” symmetry-breaking constraints before search. This is a well-known technique for dealing with symmetries in CSPs [?], made harder to implement in our case because our symmetries involve both variables and values and made harder to deploy because we need to post up to symmetry-breaking constraints.

## 2Related work

The parallelisation of depth-first search has been the subject of much research in the past. The first papers on the subject study the distribution over various specific hardware architectures and investigate how to achieve good load balancing [?]. Distributed solving of constraint problems specifically was first explored only a few years later [?].

Backtracking search in a distributed setting has also been investigated by several authors [?]. A special variant for distributed scenarios, asynchronous backtracking, was proposed in [?]. Yokoo et al. formalise the distributed constraint satisfaction problem and present algorithms for solving it [?].

Schulte presents the architecture of a system that uses networked computers [?]. The focus of his approach is to provide a high-level and reusable design for parallel search and achieve a good speedup compared to sequential solving rather than good resource utilisation. More recent papers have explored how to transparently parallelise search without having to modify existing code [?].

Most of the existing work is concerned with the problem of effectively distributing the workload such that every compute node is kept busy. The most prevalent technique used to achieve this is work stealing. The compute nodes communicate with each other and nodes which are idle request a part of the work that a busy node is doing. Blumofe and Leiserson propose and discuss a work stealing scheduler for multithreaded computations in [?]. Rolf and Kuchcinski investigate different algorithms for load balancing and work stealing in the specific context of distributed constraint solving [?].

Several frameworks for distributed constraint solving have been proposed and implemented, e.g. FRODO [?], DisChoco [?] and Disolver [?]. All of these approaches have in common that the systems to solve constraint problems are modified or augmented to support distribution of parts of the problem across and communication between multiple compute nodes. The constraint model of the problem remains unchanged however; no special constructs have to be used to take advantage of distributed solving. All parallelisation is handled in the respective solver. This does not preclude the use of an entirely different model of the problem to be solved for the distributed case in order to improve efficiency, but in general these solvers are able to solve the same model both with a single executor and distributed across several executors.

The decomposition of constraint problems into subproblems which can be solved independently has been proposed in [?], albeit in a different context. In this work, we explore the use of this technique for parallelisation. A similar approach was taken in [?], but requires parallelisation support in the solver.

## 3Distributing CSPs

Our approach to parallelising the solving of constraint problems has been previously described in [?]. This paper updates the description and, crucially, reports results from an application of the framework.

Constraint problems are typically solved by searching through the possible assignments of values to variables. After each such assignment, inference can rule out possible future assignments based on past assignments and the constraints. This process builds a search tree that explores the space of possible (partial) solutions to the constraint problem.

There are two different ways to build up these search trees – -way branching and -way branching. This refers to the number of new branches which are explored after each node. In -way branching, all the possible assignments to the next variable are branched on. In -way branching, there are two branches. The left branch is of the form where is a variable and is a value from its domain. The right branch is of the form .

The more commonly used way is -way branching, implemented for example in the Minion constraint solver [?], available at `http://minion.sf.net`

. However, regardless of the way the branching is done, exploring the branches can be done concurrently. No information between the branches needs to be exchanged in order to find a solution to the problem.

We exploit this fact by, given the model of a constraint problem, generating new models which partition the remaining search space. These models can then be solved independently. We furthermore represent the state of the search by adding additional constraints such that the splitting of the model can occur at any point during search. The new models can be resumed, taking advantage of both the splitting of the search space and the search already performed.

### 3.1Model splitting

Our new approach to the distributed solving of constraint problems requires the constraint solver to modify the constraint model but does not require explicit parallelisation support in the solver.

To split the remaining search space of a constraint problem, we signal the solver to stop. Now we partition the domain for the variable currently under consideration into pieces of roughly equal size. Then we create new models and to each in turn add constraints ruling out the other partitions of that domain. Each one of these models restricts the possible assignments to the current variable to one th of its domain.

As an example, consider the case . The variable under consideration is and its domain is . We generate new models. One of them has the constraint added and the other one . Thus, solving the first model will try the values and for , whereas the second model will try and .

The main problem when splitting constraint problems into parts that can be solved in parallel is that the size of the remaining search space for each of the splits is impossible to predict reliably. This directly affects the effectiveness of the splitting however – if the search space is distributed unevenly, some of the workers will be idle while the others do most of the work.

Our approach allows to repeatedly split the search space after search has started. We use the procedure described above several times, each time adding more constraints to the model. In addition, we add *restart nogoods*, that is, additional constraints that tell the solver how much of the search space has been explored. Constraints added in a previous iteration are not affected by constraints added later – regardless of how often we split, no parts of the search space will be “lost”, potentially missing solutions. Similarly, no part of the search space will be visited repeatedly.

Assume for example that we are doing -way branching, the variable currently under consideration is again with domain and the branches that we have taken to get to the point where we are are and . The generated new models will all have the constraints and to get to the point in the search tree where we split the problem. Then we add constraints to partition the search space based on the remaining values in the domain of similar to the previous example. The splitting process and subsequent parallel search is illustrated in Figure 1.

Using this technique, we can create new chunks of work whenever a worker becomes idle by simply asking one of the busy workers to split the search space. The search is then resumed from where it was stopped and the remaining search space is explored in parallel by the two workers. Note that there is a runtime overhead involved with stopping and resuming search because the constraints which enable resumption must be taken into account and the solver needs to explore a small number of search nodes to get to the point where it was stopped before. There is also a memory overhead because the additional constraints need to be stored.

We have implemented this approach in a development version of Minion, which we are planning to release to the public. Experiments show that the overhead of stopping, splitting and resuming is not significant for large problems.

In practice, we run Minion for a specified amount of time, then stop, split and resume instead of splitting at the beginning and when workers become idle. This approach is much simpler and works well for large problems. The algorithm is detailed in Procedure ?. It creates an -ary split tree of models for new models generated at each split. The procedure for finding all solutions is similar. Initially, the potential for distribution is small but grows exponentially as more and more search is performed. We have found that works well in practice because it is the easiest to implement and minimises the number of models created.

Minion models are stored in ordinary files. Each time the search space is split, two new input files are written. We modified the output produced by Minion to include the names of the files it produced and included the name of the file that was run when the search space was split in the new model files. This way, we can easily trace the splitting of the search space across the different files.

### 3.2Comparison to existing approaches

The main advantages of our approach are as follows.

We require only minimal modifications to existing constraint solvers. In particular, we do not require network communication and work stealing to be implemented.

We do not require communication between workers to achieve good utilisation.

The creation of separate model files when splitting increases the robustness against worker failure and provides accountability for every step.

For the purposes of a framework for solving large Artificial Intelligence search problems, the last point is especially crucial. The nature of the applications that we have in mind is such that it will be neither easy to verify whether a solution is valid nor feasible to repeat the calculations to get a confirmation. Furthermore, we have to be able to rely on the capability to recover from failures without having to repeat all the work.

By creating regular “snapshots” of the search done, the resilience against failure increases. This is in contrast to most other approaches, where the reliability of the system is decreased by using techniques that distribute work and rely on several machines instead of just a single one. Such systems have then to take additional measures to mitigate the problems caused by failures of machines or communication links. Every time we split the search space, the modified models are saved. As they contain constraints that rule out the search already done, we only lose the work done after that point if a worker fails. This means that the maximum amount of work we lose in case of a total failure of all workers is the allotted time times the number of workers .

We note that our approach provides many of the advantages of efforts dedicated to improving the robustness and accountability of computations, e.g. [?], but is much easier to implement and only requires a minimal amount of supporting infrastructure.

Another consequence of our approach is that the solving process can be moved to a different set of workers after it has been started without losing any work. This may become necessary if parts of the problem require much more memory to solve than other parts. Instead of provisioning workers with a large number of resources for the entire duration of the computation, it becomes feasible to do this on-demand. This allows for excellent and easy integration with existing services that offer on-demand computing, such as a cloud.

### 3.3Large-scale distribution

In the previous sections, we have described the techniques that enable the distribution of the solving of a constraint problem across a set of workers, but not the system to take care of the actual distribution. The implementation of such a system is notoriously difficult, hence we decided to leverage a tried-and-tested existing system.

For the purposes of a framework that allows to distribute problems across a large number of heterogeneous workers, the Condor HPC system [?] is particularly suitable. It runs in many different operating and network environments and provides most of the functionality we require out of the box. In particular, it allows for the transfer of files that are created on the worker back to the master – the constraint models that split the search space.

Condor allows work units to be submitted to a central node which puts them in a queue to be executed when a worker becomes available. In our case, a constraint model is a unit of work and splitting the search space on one of the workers creates two new units of work that are transferred back to the master and queued for execution. The condor job submission system makes sure that a job is executed to completion, i.e. if a worker node fails while it is processing a work unit, Condor requeues the work unit and sends it to a different worker.

Each Condor work unit needs to be created separately. In order to submit models that split the search space and are created during search, we have implemented a custom control system that monitors Condor and takes the appropriate action when split models are returned. The control system is an almost trivial piece of software that was very easy to implement – all of the heavy lifting is done by Condor.

While Condor is a very adequate system for our needs, its installation is not always straightforward. Ultimately, the scale of problems we are aiming for might require not thousands of machines but tens of thousands. No institution or even set of institutions has sufficient resources to make this available for a single project. Fortunately, the rise of the internet has facilitated so-called volunteer computing, where interested users can “donate” compute time to a project of their choice.

The best-known framework for such projects is BOINC, the Berkeley Open Infrastructure for Network Computing [?]. It has been used for many applications, including astrophysics, biology and mathematics. We have integrated the Minion constraint solver with the BOINC framework in a way that allows for splitting the search space. This system provides many of the benefits of Condor but makes it much easier for non-technical users to contribute.

## 4Application and discussion

We first validated our framework empirically by using it to compute the number of semigroups of order 9, a problem that had previously been solved using non-distributed search. We were able to confirm the known result on a number of different hardware configurations and splitting parameters, i.e. the time search is run before splitting the model.

Encouraged by the results of these experiments, we started the calculation of the number of semigroups of order 10. The hardware configuration throughout the computations varied, but the principal resources we used are shown in Figure 2. Here, one of the main advantages of our framework became apparent. The different resources we used were located in different networks that did not always have unrestricted connectivity to the other nodes. One of the research group clusters for example was behind a NAT in its own private network and unable to receive connections from outside this network. We were still able to utilise the resources to their full extent.

The submit machine and the Condor master shown in Figure 2 were not used for any of the computations, but only for the management of the calculations. It should be noted that there is no reason to have dedicated machines for those purposes as the resource requirements for the tasks they performed were very low. In principle, a machine used for management of the computations could also be used to perform computations itself.

The maximum number of processors that we used in parallel at any one time was about 150. One of the reasons for using the Amazon cloud was that it turned out that the machines we had available locally did not have enough memory to explore some parts of the search space efficiently. We were able to move those calculations to virtual machines in the Amazon cloud with suitable specifications and seamlessly integrate the results of those computations with the rest.

The total CPU time we expended to solve the problem (i.e. find exactly 256,587,290,511,904 semigroups from potential tables) was approximately 133 years. This effort was achieved in approximately 18 months; full details of the mathematics and the case-splits used are described in [?]. The limiting factor were the resources that were available to us. Even though we did not start with a short search time before splitting, enough split models to utilise all our resources were available after a few hours. For shorter computations, it might be desirable to facilitate faster splitting at the beginning to achieve good utilisation earlier, but for our purposes the framework as described previously was sufficient. The number of split models produced suggested that we could have utilised up to several thousand processors to a very high degree.

The robustness of our framework proved useful several times during the computations. Events that we successfully coped with included power and network outages, air-conditioning failures, physical machines being switched off and virtual machines disappearing. The damage in terms of computational effort lost was very limited in all cases. Condor was able to recover from most of these failures without any manual intervention by simply re-queueing the failed jobs. The verification of the distribution process revealed that because of the re-queueing a small part of the search space had been explored several times, but we were able to isolate and discard the duplicate model and output files.

After the computations finished, we were able to verify each step of the distribution and solving process. Therefore, we are confident that the result we obtained is correct. Ultimately, certainty of the correctness can only be established by either a new mathematical model that allows to calculate the computed number directly, or by independent verification through a second computation.

## 5Conclusions and future work

We have presented a framework for the large-scale distribution of AI search in constraint programming across resources with minimal network connectivity requirements. We have implemented this framework and applied the implementation to solving a hitherto open problem in computational mathematics. Throughout this application, the framework has proved to fulfill all our requirements. It is capable of scaling almost seamlessly to a large number of distributed and heterogeneous resources while minimising losses due to hardware failures. It furthermore provides the functionality to verify each step of the distribution process, creating confidence in the results.

The type of our application is relatively rare in eScience. Instead of large amounts of data to process, we have a very concise problem specification that takes vast computational resources to solve. We believe that the nature of such problems presents unique challenges to eScience that have rarely been considered so far.

There is no indication that the positive experiences we have had with the specific application described here is limited to that particular problem. We have, neither in the design of the framework nor its application, made any assumptions to that effect. We are currently evaluating the application of the framework to other problems that can be expressed as constraint problems and require large computational efforts.

An obvious avenue for future work apart from the application to new problems that we would like to explore is the evaluation of the implementation of the framework that uses BOINC instead of Condor. An application to the same problem would allow us to not only judge the differences in terms of distribution effectivity and utilisation, but also to independently verify the results that we have obtained. While we are confident that we would indeed obtain the same result, an empirical verification would eliminate any doubts about this aspect of the framework.

We are planning to release as open source the modifications we have made to the Minion constraint solver in order to support splitting searches. Furthermore, we are intending to release all other components of the framework that are not already available to the public, thus enabling other researchers to tackle similarly large problems and providing a framework that we hope will prove useful to the research community.

## Acknowledgments

The authors thank Chris Jefferson for useful discussions on the implementation of the framework. Tom Kelsey is supported by UK EPSRC grant EP/H004092/1. Lars Kotthoff is supported by an EPSRC fellowship.

Parts of the computational resources for this project were provided by an Amazon Web Services research grant. We thank the School of Computer Science, the Centre for Interdisciplinary Research in Computational Algebra and Cloud Co-laboratory (all of the University of St Andrews) for providing additional computational resources.