Goal-level Independent and-parallelism (IAP) is exploited by scheduling for simultaneous execution two or more goals which will not interfere with each other at run time. This can be done safely even if such goals can produce multiple answers. The most successful IAP implementations to date have used recomputation of answers and sequentially ordered backtracking. While in principle simplifying the implementation, recomputation can be very inefficient if the granularity of the parallel goals is large enough and they produce several answers, while sequentially ordered backtracking limits parallelism. And, despite the expected simplification, the implementation of the classic schemes has proved to involve complex engineering, with the consequent difficulty for system maintenance and extension, while still frequently running into the well-known trapped goal and garbage slot problems. This work presents an alternative parallel backtracking model for IAP and its implementation. The model features parallel out-of-order (i.e., non-chronological) backtracking and relies on answer memoization to reuse and combine answers. We show that this approach can bring significant performance advantages. Also, it can bring some simplification to the important engineering task involved in implementing the backtracking mechanism of previous approaches.
]Parallel Backtracking with Answer Memoing
for Independent And-Parallelism††thanks: Work partially funded by EU projects IST-215483 S-Cube and FET IST-231620 HATS, MICINN projects TIN-2008-05624 DOVES, and CAM project S2009TIC-1465 PROMETIDOS. Pablo Chico is also funded by an MICINN FPU scholarship.
Pablo Chico de Guzmán, Amadeo Casas, Manuel Carro,
and Manuel V. Hermenegildo
School of Computer Science, Univ. Politécnica de Madrid, Spain.
Samsung Research, USA.
IMDEA Software Institute, Spain.
LABEL:firstpage–LABEL:lastpage \volume10 (3): \jdateJuly 2011 2011
arallelism, Logic Programming, Memoization, Backtracking, Performance.
Widely available multicore processors have brought renewed interest in languages and tools to efficiently and transparently exploit parallel execution — i.e., tools to take care of the difficult [Karp and Babb (1988)] task of automatically uncovering parallelism in sequential algorithms and in languages to succinctly express this parallelism. These languages can be used to both write directly parallel applications and as targets for parallelizing compilers.
Declarative languages (and among them, logic programming languages) have traditionally been considered attractive for both expressing and exploiting parallelism due to their clean and simple semantics and their expressive power. A large amount of work has been done in the area of parallel execution of logic programs [Gupta et al. (2001)], where two main sources of parallelism have been exploited: parallelism between goals of a resolvent (And-Parallelism) and parallelism between the branches of the execution (Or-Parallelism). Systems efficiently exploiting Or-Parallelism include Aurora [Lusk et al. (1988)] and MUSE [Ali and Karlsson (1990)], while among those exploiting And-Parallelism, &-Prolog [Hermenegildo and Greene (1991)] and DDAS [Shen (1996)] are among the best known ones. In particular, &-Prolog exploits Independent And-Parallelism, where goals to be executed in parallel do not compete for bindings to the same variables at run time and are launched following a nested fork-join structure. Other systems such as ()ACE [Pontelli et al. (1995)], AKL [Janson (1994)], Andorra-I [Santos-Costa (1993)] and the Extended Andorra Model (EAM) [Santos Costa, V. et al. (1991), Lopes et al. (2011)] have approached a combination of both or- and and-parallelism. In this paper, we will focus on independent and-parallelism.
While many IAP implementations obtained admirable performance results and achieved efficient memory management, implementing synchronization and working around problems such as trapped goals (Section 5) and garbage slots in the execution stacks required complex engineering: extensions to the WAM instruction set, new data structures, special stack frames in the stack sets, and others [Hermenegildo (1986)]. Due to this complexity, recent approaches have focused instead on simplicity, moving core components of the implementation to the source level. In [Casas et al. (2008)], a high-level implementation of goal-level IAP was proposed that showed reasonable speedups despite the overhead added by the high level of the implementation. Other recent proposals [Moura et al. (2008)], with a different focus than the traditional approaches to parallelism in LP, concentrate on providing machinery to take advantage of underlying thread-based OS building blocks.
A critical area in the context of IAP that has also received much attention is the implementation of backtracking. Since in IAP by definition goals do not affect each other, an obvious approach is to generate all the solutions for these goals in parallel independently, and then combine them [Conery (1987)]. However, this approach has several drawbacks. First, copying solutions, at least naively, can imply very significant overhead. In addition, this approach can perform an unbounded amount of unnecessary work if, e.g., only some of the solutions are actually needed, and it can even be non-terminating if one of the goals does not fail finitely. For these reasons the operational semantics typically implemented in IAP systems performs an ordered, right-to-left backtracking. For example, if execution backtracks into a parallel conjunction such as a & b & c, the rightmost goal (c) backtracks first. If it fails, then b is backtracked over while c is recomputed and so on, until a new solution is found or until the parallel conjunction fails. The advantage of this approach is that it saves memory (since no solutions need to be copied) and keeps close to the sequential semantics. However, it also implies that many computations are redone and a large amount of backtracking work can be essentially sequential.
Herein we propose an improved solution to backtracking in IAP aimed at reducing recomputation and increasing parallelism while preserving efficiency. It combines memoization of answers to parallel goals (to avoid recomputation), out-of-order backtracking (to exploit parallelism on backtracking), and incremental computation of answers, to reduce memory consumption and avoid termination problems. The fact that in this approach the right-to-left rule may not be followed during parallel backtracking means that answer generation order can be affected (this of course does not affect the declarative semantics) but, as explained later, it greatly simplifies implementation. The EAM also supports out-of-order execution of goals. However, our approach differs from EAM in that the EAM is a more encompassing and complex approach, offering more parallelism at the cost of more complexity (and overhead) while our proposal constitutes a simpler and more approachable solution to implement.
In the following we present our proposal and an IAP implementation of the approach, and we provide experimental data showing that the amount of parallelism exploited increases due to the parallelism in backward execution, while keeping competitive performance for first-answer queries. We also observe super-linear speedups, achievable thanks to memoization of previous answers (which are recomputed in sequential SLD resolution).111For brevity we assume some familiarity with the WAM [Warren (1983), Ait-Kaci (1991)] and the RAP-WAM [Hermenegildo and Greene (1991)].
2 An Overview of IAP with Parallel Backtracking
In this section we provide a high-level view of the execution algorithm we propose to introduce some concepts which we will explain in more detail in later sections.
The IAP + parallel backtracking model we propose behaves in many respects as classical IAP approaches, but it has as its main difference the use of speculative backward execution (when possible) to generate additional solutions eagerly. This brings a number of additional changes which have to be accommodated. We assume as usual in IAP a number of agents, which are normally each attached to their own stack set, composed of heap, trail, stack, and goal queue (and often referred in the following simply as a “stack”). Active agents are executing code using their stack set, and they place any new parallel work they find in their goal queue. Idle agents steal parallel work from the goal queues of other agents.222For a more in-depth understanding of the memory model and scheduling used in traditional IAP approaches, please refer to [Hermenegildo and Greene (1991), Shen and Hermenegildo (1996), Gupta et al. (2001)]. We will also assume that stack sets have a new memo area for storing solutions (explained further later, see Figure 2).
as in classical IAP, when a parallel conjunction is first reached, its goals are started in parallel. When a goal in the conjunction fails without returning any solution, the whole conjunction fails. And when all goals have found a solution, execution proceeds. However, and differently to classical IAP, if a solution has been found for some goals, but not for all, the agents which did finish may speculatively perform backward execution for the goals they executed (unless there is a need for agents to execute work which is not speculative, e.g., to generate the first answer to a goal). This in turn brings the need to stash away the generated solutions in order to continue searching for more answers (which are also saved). When all goals find a solution, those which were speculatively executing are suspended (to preserve the property of no-slowdown w.r.t. sequential execution [Hermenegildo and Rossi (1995)]), their state is saved to be resumed later, and their first answer is reinstalled.
we only perform backtracking on the goals of a parallel conjunction which are on top of the stacks. If necessary, stack sections are reordered to move trapped goals to the top of the stack. In order not to impose a rigid ordering, we allow backtracking on these goals to proceed in an arbitrary order (i.e., not necessarily corresponding to the lexical right-to-left order). This opens the possibility of performing backtracking in parallel, which brings some additional issues to take care of:
When some of the goals executing backtracking in parallel find a new answer, backtracking stops by suspending the rest of the goals and saving their state.
The solution found is saved in the memoing area, in order to avoid recomputation.
Every new solution is combined with the previously available solutions. Some of these will be recovered from the memoization memory and others may simply be available if they are the last solution computed by some goal and thus the bindings are active.
If more solutions are needed, backward execution is performed in parallel again. Goals which were suspended resume where they suspended.
All this brings the necessity of saving and resuming execution states, memoing and recovering answers quickly, combining previously existing solutions with newly found solutions, assigning agents to speculative computations only if there are no non-speculative computations available, and managing computations which change from speculative to non speculative. Note that all parallel backtracking is speculative work, because we might need just one more answer of the rightmost parallel goal, and this is why backward execution is given less priority than forward execution. Note also that at any point in time we only have one active value for each variable. While performing parallel backtracking we can change the bindings which will be used in forward execution, but before continuing with forward execution, all parallel goals have to suspend to reinstall the bindings of the answer being combined.
3 An Execution Example
We will illustrate our approach, and specially the interplay of memoization and parallel backtracking in IAP execution with the following program:
4 Memoization vs. RecomputationClassic IAP uses recomputation of answers: if we execute a(X) & b(Y), the first answer of each goal is generated in parallel. On backtracking, b(Y) generates additional answers (one by one, sequentially) until it finitely fails. Then, a new answer for goal a(X) is computed in parallel with the recomputation of the first answer of b(Y). Successive answers are computed by backtracking again on b(Y), and later on a(X). However, since a(X) and b(Y) are independent, the answers of goal b(Y) will be the same in each recomputation. Consequently, it makes sense to store its bindings after every answer is generated, and combine them with those from a(X) to avoid the recomputation of b(Y). Memoing answers does not require having the bindings for these answers on the stack; in fact they should be stashed away and reinstalled when necessary. Therefore, when a new answer is computed for a(X) the previously computed and memorized answers for b(Y) are restored and combined.
4.1 Answer MemoizationIn comparison with tabling [Tamaki and Sato (1986), Warren (1992), Chen and Warren (1996)], which also saves goal answers, our scheme shows a number of differences: we assume that we start off with terminating programs (or that if the original program is non-terminating in sequential Prolog, we do not need to terminate), and therefore we do not need to take care of the cases tabling has to: detecting repeated calls,333Detecting repeated calls requires traversing the arguments of a goal, which can be arbitrarily more costly than executing the goal itself: for example, consider taking a large list and returning just its first element, as in first([X|_],X). suspending / resuming consumers, maintaining SCCs, etc. We do not keep stored answers after a parallel call finitely fails: answers for a(X) & b(Y) are kept for only as long as the new bindings for X and Y are reachable. In fact, we can discard all stored answers as soon as the parallel conjunction continues after its last answer. Additionally, we restrict the visibility of the stored answers to the parallel conjunction: if we have a(X) & b(Y), a(Z), the calls to a(Z) do not have access to the answers for a(X). While this may lead to underusing the saved bindings, it greatly simplifies the implementation and reduces the associated overhead. Therefore we will not use the memoization machinery commonly found in tabling implementations [Ramakrishnan et al. (1995)]. Instead, we save a combination of trail and heap terms which capture all the bindings made by the execution of a goal, for which we need two slight changes: we push a choicepoint before the parallel goal execution, so that all bindings to variables which live before the parallel goal execution will be recorded, and we modify the trail code to always trail variables which are not in the agent’s WAM.444This introduces a slight overhead which we have measured at around 1%. This ensures that all variable bindings we need to save are recorded on the trail. Therefore what we need to save are the variables pointed from the trail segment corresponding to the execution of the parallel goal (where the bindings to its free variables are recorded) and the terms pointed to by these variables. These terms are only saved if they live in the heap segment which starts after the execution of the parallel goal, since if they live below that point they existed before the parallel goal was executed and they are unaffected by backtracking. Note that bindings to variables which were created within the execution of the parallel goal and which are not reachable from the argument variables do not have to be recorded, as they are not visible outside the scope of the parallel goal execution.555Another possible optimization is to share bindings corresponding to common parts of the search tree of a parallel goal: if a new answer is generated by performing backtracking on, for example, the topmost choicepoint and the rest of the bindings generated by the goal are not changed, strictly speaking only these different bindings have to be saved to save the new answer, and not the whole section of trail and heap. Figure 2 shows an example. G is a parallel goal whose execution unifies: X with a list existing before the execution of G, Y with a list created by G, and Z, which was created by G, with a list also created by G. Consequently, we save those variables appearing in the trail created by G which are older than the execution of G (X and Y), and all the structures hanging from them. [x,y,z] is not copied because is not affected by backtracking. The copy operation adjusts pointers of variables in a way that is similar to what is done in tabling implementations [Ramakrishnan et al. (1995)]. For example, if we save a variable pointing to a subterm of [1,2], this variable would now point to a subterm of the copy of [1,2]. Note that this is at most the same amount of work as that of the execution of the goal, because it consists of stashing away the variables bound by the goal plus the structures created by the goal. The information related to the boundaries of the goal and its answers is kept in a centralized per-conjunction data structure, akin to a parcall frame [Hermenegildo and Greene (1991)]. Similar techniques are also used for the local stack. Reinstalling an answer for a goal boils down to copying back to the heap the terms that were previously saved and using the trail entries to make the variables in the initial call point to the terms they were bound to when the goal had finished. Some of these variables point to the terms just copied onto the heap and some will point to terms which existed previously to the goal execution and which were therefore not saved. In our example, [1,2] is copied onto the heap and unified with Y and X is unified with [x,y,z], which was already living on the heap. As mentioned before, while memoization certainly has a cost, it can also provide by itself substantial speedups since it avoids recomputations. Since it is performed only on independent goals, the number of different solutions to keep does not grow exponentially with the number of goals in a conjunction, but rather only linearly. This is an interesting case of synergy between two different concepts (independence and memoization), which in principle are orthogonal, but which happen to have a very positive mutual interaction.
4.2 Combining AnswersWhen the last goal pending to generate an answer in a parallel conjunction produces a solution, any sibling goals which were speculatively working towards producing additional solutions have to suspend, reinstall the previously found answers, and combine them to continue with forward execution. A similar behavior is necessary when backtracking is performed over a parallel conjunction and one of the goals which are being reexecuted in parallel finds a new solution. At this moment, the new answer is combined with all the previous answers of the rest of the parallel goals. For each parallel goal, if it was not suspended when performing speculative backtracking, its last answer is already on the execution environment ready to be combined. Otherwise, its first answer is reinstalled on the heap before continuing with forward execution. When there is more than one possible answer combination (because some parallel goals already found more than one answer), a ghost choice point is created. This choicepoint has an “artificial” alternative which points to code which takes care of retrieving saved answers and installing the bindings. On backtracking, this code will produce the combinations of answers triggered by the newly found answer (i.e., combinations already produced are not repeated). Note that this new answer may have been produced by any goal in the conjunction, but we proceed by combining from right to left. The invariant here is that before producing a new answer, all previous answer combinations have been produced, so we only need to fix the bindings for the goal which produced the new answer (say ) and successively installing the bindings for the saved answers produced by the rest of the goals. Therefore, we start by installing one by one the answers previously produced by the rightmost goal. When all solutions are exhausted, we move on to the next goal to the left, install its next answer and then reinstall again one by one the answers of the rightmost goal. When all the combinations of answers for these two goals are exhausted, we move on to the third rightmost one, and so on —but we skip goal , because we only need to combine its last answer since the previous ones were already combined. An additional optimization is to update the heap top pointer of the ghost choice point to point to the current heap top after copying terms from the memoization area to the heap, in order to protect these terms from backtracking for a possible future answer combination. Consequently, when the second answer of the second rightmost parallel goal is combined with all the answers of the rightmost goal, the bindings of the answers of the rightmost goal do not need to be copied on the heap again and then we only need to untrail bindings from the last combined answer and redo bindings of the answer being combined. Finally, once the ghost choice point is eliminated, all these terms that were copied on the heap are released. One particular race situation needs to be considered. When a parallel goal generates a new solution, other parallel goals may also find new answers before being suspended, and thus some answers may be lost in the answer combination. In order to address this, our implementation maintains a pointer to the last combined answer of each parallel goal in the parcall frame. Therefore, if, e.g., two parallel goals, a/1 and b/1, have computed three answers each, but only two of them have been combined, the third answer of a/1 would be combined with the first two answers of b/1, updating afterward its last combined answer pointer to its third answer. Once this is done, the fact that b/1 has uncombined answers is detected before performing backtracking, and the third answer of b/1 is combined with all the computed answers of a/1 and, then, the last combined answer of b(Y) is updated to point to its last answer. Finally, when no goal is left with uncombined answers, the answer combination operation fails.
5 Trapped Goals and Backtracking OrderThe classical, right-to-left backtracking order for IAP is known to bring a number of challenges, among them the possibility of trapped goals: a goal on which backtracking has to be performed becomes trapped by another goal stacked on top of it. Normal backtracking is therefore impossible. Consider the following example: \neckb(X,Y) & a(Z). b(X,Y)\necka(X) & a(Y). a(1). a(2).
6 The Scheduler for the Parallel Backtracking IAP EngineOnce we allow backward execution over any parallel goal on the top of the stacks, we can perform backtracking over all of them in parallel. Consequently, each time we perform backtracking over a parallel conjunction, each of the parallel goals of the parallel conjunction can start speculative backward execution. As we mentioned earlier, the management of goals (when a goal is available and can start, when it has to backtrack, when messages have to be broadcast, etc.) is encoded in Prolog code which interacts with the internals of the emulator. Figure 3 shows a simplified version of such a scheduler, which is executed when agents (a) look for new work to do and (b) have to execute a parallel conjunction. Note that locks are not shown in the algorithm. \neck fork(PF,NGoals,LGoals,[Handler—LHandler]), ( goal_not_executed(Handler) -¿ call_local_goal(Handler,Goal) ; true ), look_for_available_goal(LHandler), join(PF). look_for_available_goal() \neck!, true. look_for_available_goal([Handler—LHandler])\neck
6.1 Looking for Work
Agents initially execute the agent/ predicate, which calls work/ in an endless loop to search for a parallel goal to execute, via the find_parallel_goal/1 primitive, which defines the strategy of the scheduler. Available goals can be in four states: non-executed parallel goals necessary for forward execution, backtrackable parallel goals necessary for forward execution, non-executed parallel goals not necessary for forward execution (because they were generated by goals performing speculative work), and backtrackable parallel goals not necessary for forward execution. Different scheduling policies are possible in order to impose preferences among these types of goals (to, e.g., decide which non-necessary goal can be picked) but studying them is outside the scope of this paper.
Once the agent finds a parallel goal to execute, it is prepared to start execution in a clean environment. For example, if the goal has to be backtracked over and it is trapped, a primitive operation move_execution_top/1 moves the execution segment of the goal to the top of the stacks to ensure that the choice point to be backtracked over is always on the top of the stack (using the algorithm of Section 5). Also, the memoization of the last answer found is performed at this time, if the execution of the parallel goal was not suspended.
If find_parallel_goal/1 fails (i.e., no handler is returned), the agent suspends until some other agent publishes more work. call_parallel_goal/1 saves some registers before starting the execution of the parallel goal, such as the current trail and heap top, changes the state of the handler once the execution has been completed, failed, or suspended, and saves some registers after the execution of the parallel goal in order to manage trapped goals and to release the execution of the publishing agent.
6.2 Executing Parallel Conjunctions
The parallel conjunction operator &/2 is preprocessed and converted into parcall_back/2, which is the entry point of the scheduler, and which receives the list of goals to execute in parallel (LGoals) and the number of goals in the list. parcall_back/2 invokes first fork/4, written in C, which creates a handler for each parallel goal in the scope of the parcall frame containing information related to that goal, makes goals available for other agents to pick up, resumes suspended agents which can then steal some of the new available goals, and inserts a new choice point in order to release all the data structures on failure.
If the first parallel goal has not been executed yet, it is scheduled for local execution by call_local_goal/2, which performs housekeeping similar to that of call_parallel_goal/1. It can be already executed because this parallel goal, which is always executed locally, can fail on backtracking, but the rest of the parallel goals could still be performing backtracking to compute more answers. In this case, the choice point of fork/4 will succeed on backtracking to continue forward execution and to wait for the completion of the remotely executed parallel goals to produce more answer combinations.
Then, look_for_available_goal/1 executes locally parallel goals which have not already been taken by another agent. Finally, join/1 waits for the completion of the execution of the parallel goals, their failure, or their suspension before combining all the answers. After all answers have been combined, the goals of the parallel conjunction are activated to perform speculative backward execution.
7 Suspension of Speculative Goals
Stopping goals which are eagerly generating new solutions may be necessary for both correctness and performance reasons. The agent that determines that suspension is necessary sends a suspension event to the rest of the agents that stole any of the sibling parallel goals (accessible via the parcall frame). These events are checked in the WAM loop each time a new predicate is called, using existing event-checking machinery shared with attributed-variable handling (and therefore no additional overhead is added). When the execution has to suspend, the argument registers are saved on the heap, and a new choice point is inserted onto the stack to protect the current execution state. This choice point contains only one argument pointing to the saved registers in order to reinstall them on resumption. The alternative to be executed on failure points to a special WAM instruction which reinstalls the registers and jumps to the WAM code where the suspension was performed, after releasing the heap section used to store the argument registers. Therefore, the result of failing over this choice point is to resume the suspended execution at the point where it was suspended.
After this choice point is inserted, goal execution needs to jump back to the Prolog scheduler for parallel execution. In order to jump to the appropriate point in the Prolog scheduler (after call_parallel_goal/1 or call_local_goal/2), the WAM frame pointer is saved in the handler of the parallel goal before calling call_parallel_goal/1 or call_local_goal/2. After suspension takes place, it is reinstalled as the current frame pointer, the WAM’s next instruction pointer is updated to be the one pointed to by this frame, and this WAM instruction is dispatched. The result is that the scheduler continues its execution as if the parallel goal had succeeded.
Parallel goals to be suspended may in turn have other nested parallel calls. Suspension events are recursively sent by agents following the chain of dependencies saved in the parcall frames, similarly to the fail messages in &-Prolog [Hermenegildo and Greene (1991)].
8 A Note on Deterministic Parallel Goals
The machinery we have presented can be greatly simplified when running deterministic goals in parallel: answer memoization and answer combination are not needed, and the scheduler (Section 6) can be simplified. Knowing ahead of execution which goals are deterministic can be used to statically select the best execution strategy. However, some optimizations can be performed dynamically without compiler support (e.g., if it is not available or imprecise). For example, the move_execution_top/1 operation may decide not to memoize the previous answer if there are no choice points associated to the execution of the parallel goal, because that means that at most one answer can be generated. By applying these dynamic optimizations, we have detected improvements of up to a factor of two in the speedups of the execution of some deterministic benchmarks.
9 Comparing Performance of IAP Models
We present here a comparison between a previous high-level implementation of IAP [Casas et al. (2008)] (which we abbreviate as seqback) with our proposed implementation (parback). Both implementations are similar in nature and have similar overheads (inherent to a high-level implementation), with the obvious main difference being the support for parallel backtracking and answer memoization in parback. Both are implemented by modifying the standard Ciao [Bueno et al. (2009), Hermenegildo et al. (2011)] distribution. We will also comment on the relation with the very efficient IAP implementation in [Hermenegildo and Greene (1991)] (abbreviated as &-Prolog) for deterministic benchmarks in order to evaluate the overhead incurred by having part of the system expressed in Prolog.
We measured the performance results of both parback and seqback on deterministic benchmarks, to determine the possible overhead caused by adding the machinery to perform parallel backtracking and answer memoization, and also of course on non-deterministic benchmarks. The deterministic benchmarks used are the well-known Fibonacci series (fibo), matrix multiplication (mmat) and QuickSort (qsort). fibo generates the 22 Fibonacci number switching to a sequential implementation from the 12 number downwards, mmat uses 50x50 matrices and qsort is the version which uses append/3 sorting a list of 10000 numbers. The GC suffix means task granularity control [López-García et al. (1996)] is used for lists of size 300 and smaller.
The selected nondeterministic benchmarks are checkfiles, illumination, and qsort_nd. checkfiles receives a list of files, each of which contains a list of file names which may exist or not. These lists are checked in parallel to find nonexistent files which appear listed in all the initial files; these are enumerated on backtracking. illumination receives an board informing of possible places for lights in a room. It tries to place a light in each of the columns, but lights in consecutive columns have to be separated by a minimum distance. The eligible positions in each column are searched in parallel and position checking is implemented with a pause of one second to represent task lengths. qsort_nd is a QuickSort algorithm where list elements have only a partial order. checkfiles and illumination are synthetic benchmarks which create 8 parallel goals and which exploit memoization heavily. qsort_nd is a more realistic benchmark which creates over one thousand parallel goals. All the benchmarks were parallelized using CiaoPP [Hermenegildo et al. (2005)] and the annotation algorithms described in [Muthukumar et al. (1999), Cabeza (2004), Casas et al. (2007)].
Table 1 shows the speedups obtained. Performance results for seqback and parback were obtained by averaging ten different runs for each of the benchmarks in a Sun UltraSparc T2000 (a Niagara) with 8 4-thread cores. The speedups shown in this table are calculated with respect to the sequential execution of the original, unparallelized benchmark. Therefore, the column tagged corresponds to the slowdown coming from executing a parallel program on a single processor. For &-Prolog we used the results in [Hermenegildo and Greene (1991)]. To complete the comparison, we note that one of the most efficient Prolog systems, YAP Prolog [costa:yap-design-tplp], very optimized for SPARC, is on these benchmarks between 2.3 and 2.7 faster than the execution of the parallel versions of the programs on the parallel version of Ciao using only one agent, but the parallel execution still outperforms YAP. Of course, YAP could in addition take advantage of parallel execution.
For deterministic benchmarks, parback refers to the implementation presented in this paper with improvements based on determinacy information obtained from static analysis [López-García et al. (2005)]. For nondeterministic benchmarks we show a comparison of the performance results obtained both to generate the first solution (seqback and parback) and all the solutions (seqback and parback). Additionally, we also show speedups relative to the execution in parallel with memoing in one agent (which should be similar to that which could be obtained by executing sequentially with memoing) in rows pb_rel and pb_rel.
|Benchmark||Approach||Number of threads|
The speedups obtained in both high-level implementations are very similar for the case of deterministic benchmarks. Therefore, the machinery necessary to perform parallel backtracking does not seem to degrade the performance of deterministic programs.
Static optimizations bring improved performance, but in this case they seem to be quite residual, partly thanks to the granularity control. When comparing with &-Prolog we of course suffer from the overhead of executing partly at the Prolog level (especially in mmat and qsort without granularity control), but even in this case we think that our current implementation is competitive enough. It is important that to note that the &-Prolog speedups were measured in another architecture (Sequent Symmetry), so the comparison can only be indicative. However, the Sequents were very efficient and orthogonal multiprocessors, probably better than the Niagara in terms of obtaining speedups (even if obviously not in raw speed) since the bus was comparatively faster in relation with processor speed. This can only make &-Prolog (and similar systems) have smaller speedups if run in parallel hardware. Therefore, their speedup could only get closer to ours in current architectures.
parback and seqback behavior is quite similar in the case of qsort_nd when only the first answer is computed because there is not backtracking here.
In the case of checkfiles and illumination, backtracking is needed even to generate the first answer, and memoing plays a more important role. The implementation using parallel backtracking is therefore much faster even in a single processor since recomputation is avoided. If we compute the speedup relative to the parallel execution on one processor (rows pb_rel and pb_rel) the speedups obtained by parback follow the increment in the number of processors more closely —with some superlinear speedup which is normal when search does not follow, as in our case, the same order as sequential execution— which can be traced to the increased amount of parallel backtracking. In contrast, the speedups of seqback do not increase so much since it performs essentially sequential backtracking.
When all the answers are required, the differences are still clearer because there is much backward execution. This behavior also appears, to a lesser extent, in qsort_nd. More in detail, the parback speedups are not that good when looking for all the answers of qsort_nd because the time for storing and combining answers is not negligible here.
Note that the parback speedups of checkfiles and illumination stabilize between 4 and 7 processors. This is so because they generate exactly 8 parallel goals, and there is one dangling goal to be finished. In the case of checkfiles we get superlinear speedup because there are 8 lists of files to check. With 8 processors the first answer can be obtained without traversing (on backtracking) any of these lists. This is not the case with 7 processors and so there is no superlinear behavior until we hit the 8 processor mark. Additionally, since backtracking is done in parallel, the way the search tree is explored (and therefore how fast the first solution is found) can change between executions.
We have developed a parallel backtracking approach for independent and-parallelism which uses out-of-order backtracking and relies on answer memoization to reuse and combine answers. We have shown that the approach can bring interesting simplifications when compared to previous approaches to the complex implementation of the backtracking mechanism typical in these systems. We have also provided experimental results that show significant improvements in the execution of non-deterministic parallel calls due to the avoidance of having to recompute answers and due to the fact that parallel goals can execute backward in parallel, which was a limitation in previous similar implementations. This parallel system may be used in applications with a constraint-and-generate structure in which checking the restrictions after the search is finished does not add significant computation, and a simple code transformation allows a sequential program to be executed in parallel.
- Ait-Kaci (1991) Ait-Kaci, H. 1991. Warren’s Abstract Machine, A Tutorial Reconstruction. MIT Press.
- Ali and Karlsson (1990) Ali, K. A. M. and Karlsson, R. 1990. The Muse Or-Parallel Prolog Model and its Performance. In 1990 North American Conference on Logic Programming. MIT Press, 757–776.
- Bueno et al. (2009) Bueno, F., Cabeza, D., Carro, M., Hermenegildo, M., López-García, P., and Puebla-(Eds.), G. 2009. The Ciao System. Ref. Manual (v1.13). Tech. rep., School of Computer Science, T.U. of Madrid (UPM). Available at http://www.ciaohome.org.
- Cabeza (2004) Cabeza, D. 2004. An Extensible, Global Analysis Friendly Logic Programming System. Ph.D. thesis, Universidad Politécnica de Madrid (UPM), Facultad Informatica UPM, 28660-Boadilla del Monte, Madrid-Spain.
- Casas et al. (2007) Casas, A., Carro, M., and Hermenegildo, M. 2007. Annotation Algorithms for Unrestricted Independent And-Parallelism in Logic Programs. In 17th International Symposium on Logic-based Program Synthesis and Transformation (LOPSTR’07). Number 4915 in LNCS. Springer-Verlag, The Technical University of Denmark, 138–153.
- Casas et al. (2008) Casas, A., Carro, M., and Hermenegildo, M. 2008. A High-Level Implementation of Non-Deterministic, Unrestricted, Independent And-Parallelism. In 24th International Conference on Logic Programming (ICLP’08), M. García de la Banda and E. Pontelli, Eds. LNCS, vol. 5366. Springer-Verlag, 651–666.
- Chen and Warren (1996) Chen, W. and Warren, D. S. 1996. Tabled Evaluation with Delaying for General Logic Programs. Journal of the ACM 43, 1 (January), 20–74.
- Conery (1987) Conery, J. S. 1987. Parallel Execution of Logic Programs. Kluwer Academic Publishers.
- Costa et al. (2002) Costa, V. S., Damas, L., Reis, R., and Azevedo, R. 2002. YAP User’s Manual. http://www.dcc.fc.up.pt/~vsc/Yap.
- Gupta et al. (2001) Gupta, G., Pontelli, E., Ali, K., Carlsson, M., and Hermenegildo, M. 2001. Parallel Execution of Prolog Programs: a Survey. ACM Transactions on Programming Languages and Systems 23, 4 (July), 472–602.
- Hermenegildo (1986) Hermenegildo, M. 1986. An abstract machine based execution model for computer architecture design and efficient implementation of logic programs in parallel. Ph.D. thesis, U. of Texas at Austin.
- Hermenegildo and Greene (1991) Hermenegildo, M. and Greene, K. 1991. The &-Prolog System: Exploiting Independent And-Parallelism. New Generation Computing 9, 3,4, 233–257.
- Hermenegildo et al. (2005) Hermenegildo, M., Puebla, G., Bueno, F., and López-García, P. 2005. Integrated Program Debugging, Verification, and Optimization Using Abstract Interpretation (and The Ciao System Preprocessor). Science of Computer Programming 58, 1–2.
- Hermenegildo and Rossi (1995) Hermenegildo, M. and Rossi, F. 1995. Strict and Non-Strict Independent And-Parallelism in Logic Programs: Correctness, Efficiency, and Compile-Time Conditions. Journal of Logic Programming 22, 1, 1–45.
- Hermenegildo et al. (2011) Hermenegildo, M. V., Bueno, F., Carro, M., López, P., Mera, E., Morales, J., and Puebla, G. 2011. An Overview of Ciao and its Design Philosophy. Theory and Practice of Logic Programming. http://arxiv.org/abs/1102.5497.
- Janson (1994) Janson, S. 1994. Akl. a multiparadigm programming language. Ph.D. thesis, Uppsala University.
- Karp and Babb (1988) Karp, A. and Babb, R. 1988. A Comparison of 12 Parallel Fortran Dialects. IEEE Software.
- Lopes et al. (2011) Lopes, R., Santos Costa V., and Silva, F. M. A. 2011. A Design and Implementation of the Extended Andorra Model. Theory and Practice of Logic Programming.
- López-García et al. (2005) López-García, P., Bueno, F., and Hermenegildo, M. 2005. Determinacy Analysis for Logic Programs Using Mode and Type Information. In Proceedings of the 14th International Symposium on Logic-based Program Synthesis and Transformation (LOPSTR’04). Number 3573 in LNCS. Springer-Verlag, 19–35.
- López-García et al. (1996) López-García, P., Hermenegildo, M., and Debray, S. K. 1996. A Methodology for Granularity Based Control of Parallelism in Logic Programs. Journal of Symbolic Computation, Special Issue on Parallel Symbolic Computation 21, 4–6, 715–734.
- Lusk et al. (1988) Lusk, E., Butler, R., Disz, T., Olson, R., Stevens, R., Warren, D. H. D., Calderwood, A., Szeredi, P., Brand, P., Carlsson, M., Ciepielewski, A., Hausman, B., and Haridi, S. 1988. The Aurora Or-parallel Prolog System. New Generation Computing 7, 2/3, 243–271.
- Moura et al. (2008) Moura, P., Crocker, P., and Nunes, P. 2008. High-level multi-threading programming in logtalk. In 10th International Symposium on Practical Aspects of Declarative Languages (PADL’08), D. Warren and P. Hudak, Eds. LNCS, vol. 4902. Springer-Verlag, 265–281.
- Muthukumar et al. (1999) Muthukumar, K., Bueno, F., de la Banda, M. G., and Hermenegildo, M. 1999. Automatic Compile-time Parallelization of Logic Programs for Restricted, Goal-level, Independent And-parallelism. Journal of Logic Programming 38, 2 (February), 165–218.
- Pontelli et al. (1995) Pontelli, E., Gupta, G., and Hermenegildo, M. 1995. &ACE: A High-Performance Parallel Prolog System. In International Parallel Processing Symposium. IEEE Computer Society Technical Committee on Parallel Processing, IEEE Computer Society, 564–572.
- Ramakrishnan et al. (1995) Ramakrishnan, I., Rao, P., Sagonas, K., Swift, T., and Warren, D. 1995. Efficient tabling mechanisms for logic programs. In ICLP. 697–711.
- Santos-Costa (1993) Santos-Costa, V. M. 1993. Compile-time analysis for the parallel execution of logic programs in andorra-i. Ph.D. thesis, University of Bristol.
- Santos Costa, V. et al. (1991) Santos Costa, V., Warren, D., and Yang, R. 1991. The Andorra-I Engine: A Parallel Implementation of the Basic Andorra Model. In ICLP. 825–839.
- Shen (1996) Shen, K. 1996. Overview of DASWAM: Exploitation of Dependent And-parallelism. Journal of Logic Programming 29, 1–3 (November), 245–293.
- Shen and Hermenegildo (1996) Shen, K. and Hermenegildo, M. 1996. Flexible Scheduling for Non-Deterministic, And-parallel Execution of Logic Programs. In Proceedings of EuroPar’96. Number 1124 in LNCS. Springer-Verlag, 635–640.
- Tamaki and Sato (1986) Tamaki, H. and Sato, M. 1986. OLD resolution with tabulation. In Int’l. Conf. on Logic Programming. LNCS, Springer-Verlag, 84–98.
- Warren (1983) Warren, D. 1983. An Abstract Prolog Instruction Set. Technical Report 309, Artificial Intelligence Center, SRI International, 333 Ravenswood Ave, Menlo Park CA 94025.
- Warren (1992) Warren, D. S. 1992. Memoing for logic programs. Communications of the ACM 35, 3, 93–111.