Simplifying Parallelization of Scientific Codes by a FunctionCentric Approach in Python
Abstract
The purpose of this paper is to show how existing scientific software can be parallelized using a separate thin layer of Python code where all parallel communication is implemented. We provide specific examples on such layers of code, and these examples may act as templates for parallelizing a wide set of serial scientific codes. The use of Python for parallelization is motivated by the fact that the language is well suited for reusing existing serial codes programmed in other languages. The extreme flexibility of Python with regard to handling functions makes it very easy to wrap up decomposed computational tasks of a serial scientific application as Python functions. Many parallelizationspecific components can be implemented as generic Python functions, which may take as input those functions that perform concrete computational tasks. The overall programming effort needed by this parallelization approach is rather limited, and the resulting parallel Python scripts have a compact and clean structure. The usefulness of the parallelization approach is exemplified by three different classes of applications in natural and social sciences.
, , ,
1 Introduction
Due to limited computing power of standard serial computers, parallel computing has become indispensable for investigating complex problems in all fields of science. A frequently encountered question is how to transform an existing serial scientific code into a new form that is executable on a parallel computing platform. Although portable parallel programming standards, such as MPI and OpenMP, have greatly simplified the programming work, the task of parallelization may still be quite complicated for domain scientists. This is because inserting MPI calls or OpenMP directives directly into an existing serial code often requires extensive code rewrite as well as detailed knowledge of and experience with parallel programming.
The hope for nonspecialists in parallel computing is that many scientific applications possess highlevel parallelism. That is, the entire computational work can be decomposed into a set of individual (and often collaborative) computational tasks, each of coarse grain, and can be performed by an existing piece of serial code. Depending on the specific application, the decomposition can be achieved by identifying a set of different parameter combinations, or (fully or almost) independent computations, or different data groups, or different geometric subdomains. For a given type of decomposition, the parallelization induced programming components, such as work partitioning, domain partitioning, communication, load balancing, and global administration, are often generic and independent of specific applications. These generic components can thus be implemented as reusable parallelization libraries once and for all. This is what we exemplify in the present paper.
It is clear that a userfriendly parallelization approach relies on at least two factors: (1) The existing serial code should be extensively reused; (2) The programming effort by the end user must be limited. To achieve these goals we suggest to use Python to wrap up pieces of existing serial code (possibly written in other languages), and implement the parallelization tasks in separate and generic Python functions.
Python [1] is an extremely expressive and flexible programming language at its core. The language has been extended with numerous numerical and visualization modules such as NumPy [2] and SciPy [3]. The two requirements of a userfriendly parallelization mentioned above are actually well met by Python. First of all, Python is good at interoperating with other languages, especially Fortran, C, and C++, which are heavily used in scientific codes. Using wrapper tools such as F2PY [4], it is easy to wrap up an existing piece of code in Fortran and C and provide it with a Pythonic appearance.
Moreover, among its many strong features, Python is extremely flexible with handling functions. Python functions accept both positional arguments and keyword arguments. The syntax of a variable set of positional and keyword arguments (known as “(*args,**kwargs)” to Python programmers) allows writing libraries routines that work with any type of userdefined functions. That is, the syntax makes it possible to call a Python function without revealing the exact number of arguments.
It is also straightforward to pass functions as input arguments to a Python function and/or return a function as output. A callable class object in Python can be used as if it were a standalone function. Such a construction, or alternatively a closure (known from functional programming), can be used to create functions that carry a state represented through an arbitrarily complex data structure. The result is that one can express the flow of a scientific code as a Python program containing a set of calls to userdefined Python functions. These userdefined functions can be ordinary functions or classes that wrap pieces of the underlying scientific code. This is what we call a functioncentric representation of the scientific code. With such a functioncentric approach, we can build a general framework in Python for almost automatic parallelization of the program flow in the original scientific code. Later examples will convey this idea in detail.
Performance of the resulting parallel application will closely follow the performance of the serial application, because the overhead of the parallelization layer in Python is just due to a small piece of extra code, as we assume the main computational work to take place in the Python functions that call up pieces of the original scientific code. In the parallelization layer, good performance can be ensured by using efficient array modules in Python (such as numpy [2]) together with lightweight MPI wrappers (such as pypar [5]). For examples of writing efficient Python code segments for some standard serial and parallel numerical computations, we refer the reader to Cai et al. [6].
Related Work.
In C++, generic programming via templates and objectoriented programming has been applied to parallelizing serial scientific codes. Two examples can be found in [7] and [8], where the former uses C++ class hierarchies to enable easy implementation of additive Schwarz preconditioners, and the latter uses C++ templates extensively to parallelize finite element codes. Many scientific computing frameworks have also adopted advanced programming to incorporate parallelism behind the scene. In these frameworks (see, e.g., [9, 10, 11, 12, 13]) the users can write parallel applications in a style quite similar to serial programming, without being exposed to many parallelizing details. Likewise are frameworks that are specially designed to allow coupling of different serial and parallel components, such as Cactus [14] and MpCCI [15]. The Python programming language, however, has not been widely used to parallelize existing serial codes. The StarP system [16] provides the user with a programming environment where most of the parallelism is kept behind the scene. Hinsen [17] has combined Python with BSP to enable highlevel parallel programming. In addition, quite a number of Python MPI wrappers exist, such as pyMPI [18], pypar [5], MYMPI [19], mpi4py [20, 21], and Scientific.MPI [22]. Efforts in incorporating parallelism via language extensions of Python can be found in [23, 24, 25].
The contribution of the present paper is to show by examples that a functioncentric approach using Python may ease the task of parallel scientific programming. This result is primarily due to Python’s flexibility in function handling and function arguments. As a result, generic tasks that arise in connection with parallelization can often be programmed as a collection of simple and widely applicable Python functions, which are ready to be used by nonspecialists to parallelize their existing serial codes.
This paper contains three examples with different algorithmic structures. A wide range of problems in science can be attacked by extending and adapting the program code in these examples. Moreover, readers whose problems are not covered by the examples will hopefully from these examples understand how we solve programming problems by identifying the principal, often simplified, underlying algorithmic structure; then creating generic code to reflect the structure; and finally applying the generic code to a specific, detailed case. Our approach is much inspired by the success of mathematics in problem solving, i.e., detecting the problem’s principal structure and devising a generic solution makes complicated problems tractable. With Python as tool, we demonstrate how this strategy carries over to parallelization of scientific codes.
The remainder of the paper is organized as follows. We give in Section 2 a simple but motivating example, explaining the principles of splitting a problem into a set of function calls that can easily be parallelized. Generic parallelization of three common types of real scientific applications are then demonstrated in Section 3. Afterwards, Section 4 reports the computational efficiency of the suggested parallelization approach applied to specific cases in the three classes of scientific problems. Some concluding remarks are given in Section 5.
2 A Motivating Simple Example
2.1 Serial Version
Suppose we want to carry out a parameter analysis that involves a large number of evaluations of a multivariable mathematical function . The Python implementation of may use positional arguments and keyword arguments such that the total arguments contain at least the variables (i.e., ). As a very simple example, consider the parabola with the following Python implementation ():
deffunc(x,a=0,b=0,c=1): returna*x**2+b*x+c
Suppose we want to evaluate func for a particular set of input parameters chosen from a large search space, where , , , and vary in specified intervals. The complete problem can be decomposed into three main steps: (1) initialize a set of arguments to func; (2) evaluate func for each entry in the set of arguments; (3) process the set of function return values from all the func calls.
Step (1) calls a userdefined function initialize which returns a list of 2tuples, where each 2tuple holds the positional and keyword arguments (as a tuple and a dictionary) for a specific call to func. Step (2) iterates over the list from step (1) and feed the positional and keyword arguments into func. The returned value (tuple) is stored in a result list. Finally, step (3) processes the result list in a userdefined function finalize which takes this list as input argument.
A generic Python function that implements the threestep parameter analysis can be as follows:
defsolve_problem(initialize,func,finalize): input_args=initialize() output=[func(*args,**kwargs)forargs,kwargsininput_args] finalize(output)
Note that the use of list comprehension in the above code has given a very compact implementation of the forloop for going through all the evaluations of func. The initialize, func, and finalize functions are passed to solve_problem as input arguments. These three userdefined functions are independent of solve_problem.
As an example, assume that x is a set of uniformly distributed coordinates in , and we vary and in each with values, while is fixed at the value 5. For each combination of and , we call func with the vector x as a positional argument and the , , values as keyword arguments, and store the evaluation results of func in a list named output. The objective of the computations is to extract the and values for which func gives a negative value for one or several of the coordinates . For this very simple example, the concrete implementation of the initialize and finalize functions can be put inside a class named Parabola as follows:
classParabola: def__init__(self,m,n,L): self.m,self.n,self.L=m,n,L definitialize(self): x=numpy.linspace(0,self.L,self.n) a_values=numpy.linspace(1,1,self.m) b_values=numpy.linspace(1,1,self.m) c=5 self.input_args=[] foraina_values: forbinb_values: func_args=([x],{’a’:a,’b’:b,’c’:c}) self.input_args.append(func_args) returnself.input_args deffunc(self,x,a=0,b=0,c=1): returna*x**2+b*x+c deffinalize(self,output_list): self.ab=[] forinput,resultinzip(self.input_args,output_list): ifmin(result)<0: self.ab.append((input[0][1],input[0][2]))
Now, to find the combinations of and values that make , we can write the following two lines of code (assuming , , and ):
problem=Parabola(100,50,10) solve_problem(problem.initialize,problem.func,problem.finalize)
Note that the desired combinations of and values will be stored in the list problem.ab. Also note that we have placed func inside class Parabola, to have all pieces of the problem in one place, but having func as standalone function or a class method is a matter of taste.
Despite the great mathematical simplicity of this example, the structure of the solve_problem function is directly applicable to a wide range of much more advanced problems. Although initialize and finalize are Python functions with very simple arguments (none and a list, respectively), this is not a limitation of their applicability. For example, the initialize step in our simple example needs values for , , and , the and interval and so on, which can not be specified in the generic solve_problem function. To overcome this limitation, the information of , , and can be hardcoded (not recommended), or transferred to initialize through global variables (not recommended in general) or carried with initialize as a state, either as class attributes or as a surrounding scope in a closure. We have chosen the class approach, i.e., class attributes store userdependent data structures such that the initialize and finalize methods can have the simple input argument structure demanded by the generic solve_problem function. Alternatively, a closure as follows can be used instead of a class (this construct requires some knowledge of Python’s scoping rules):
definitialize_wrapper(m,n,L): definitialize(self): x=numpy.linspace(0,L,n) a_values=numpy.linspace(1,1,m) ... returninput_args returninitialize
Now, the returned initialize function will carry with it the values of , , and in the surrounding scope. The choice between the class approach and the closure approach, or using global variables in a straightforward global initialize function, is up to the programmer. The important point here is that initialize must often do a lot, and the input information to initialize must be handled by some Python construction. Similar comments apply to finalize.
2.2 Parallel Version
Let us say that we want to utilize several processors to share the work of all the func evaluations, i.e., the forloop in the generic solve_problem function. This can clearly be achieved by a taskparallel approach, where each evaluation of func is an independent task. The main idea of parallelization is to split up the forloop into a set of shorter forloops, each assigned to a different processor. In other words, we need to split up the input_args list into a set of sublists for the different processors. Note that this partitioning work is generic, independent of both the func function and the actual arguments in the input_args list. Assuming homogeneous processors and that all the function evaluations are equally expensive, we can divide the input_args list into num_procs (number of processors) sublists of equal length. In case input_args is not divisible by num_procs, we adjust the length of some sublists by 1:
defsimple_partitioning(length,num_procs): sublengths=[length/num_procs]*num_procs foriinrange(length%num_procs):#treatmentofremainder sublengths[i]+=1 returnsublengths defget_subproblem_input_args(input_args,my_rank,num_procs): sub_ns=simple_partitioning(len(input_args),num_procs) my_offset=sum(sub_ns[:my_rank]) my_input_args=input_args[my_offset:my_offset+sub_ns[my_rank]] returnmy_input_args
Using the above generic get_subproblem_input_args function, each processor gets its portion of the global input_args list, and a shorter forloop can be executed there. Note that the syntax of Python lists and numpy arrays has made the function very compact.
The next step of parallelization is to collect the function evaluation results from all the processors into a single global output list. Finally, we let finalize(output) run only on the master processor (assuming that this work does not require parallelization). For the purpose of collecting outputs from all the processors, the following generic Python function can be used:
defcollect_subproblem_output_args(my_output_args,my_rank, num_procs,send_func,recv_func): ifmy_rank==0:#masterprocess? output_args=my_output_args foriinrange(1,num_procs): output_args+=recv_func(i) returnoutput_args else: send_func(my_output_args,0) returnNone
The last two input arguments to the above function deserve some attention. Both send_func and recv_func are functions themselves. In the case of using the pypar wrapper of MPI commands, we may simply pass pypar.send as the send_func input argument and pypar.receive as recv_func. Moreover, switching to another MPI module is transparent with regard to the generic function named collect_subproblem_output_args. It should also be noted that most Python MPI modules are considerably more userfriendly than the original MPI commands in C/Fortran. This is because (1) the use of keyword arguments greatly simplifies the syntax, and (2) any picklable (marshalable) Python data type can be communicated directly.
Now that we have implemented the generic functions get_subproblem_input_args and collect_subproblem_output_args, we can write a minimalistic parallel solver as follows:
defparallel_solve_problem(initialize,func,finalize, my_rank,num_procs,send,recv): input_args=initialize() my_input_args=get_subproblem_input_args(input_args, my_rank,num_procs) my_output=[func(*args,**kwargs)\ forargs,kwargsinmy_input_args] output=collect_subproblem_output_args(my_output,my_rank, num_procs,send,recv) ifmy_rank==0: finalize(output)
We remark that the above function is generic in the sense that it is independent of the actual implementation of initialize, func, and finalize, as well as the Python MPI module being used. All problems that can be composed from independent function calls can (at least in principle) be parallelized by the shown small pieces of Python code.
As a specific example of using this parallel solver, we may address the problem of evaluating the parabolic function (func and class Parabola) for a large number of parameters. Using the pypar MPI module and having the problemdependent code in a module named Parabola and the general functioncentric tools in a module named function_centric, the program becomes as follows:
fromParabolaimportfunc,Parabola fromfunction_centricimportparallel_solve_problem problem=Parabola(m=100,n=50,L=10) importpypar my_rank=pypar.rank() num_procs=pypar.size() parallel_solve_problem(problem.initialize, func, problem.finalize, my_rank,num_procs, pypar.send,pypar.receive) pypar.finalize()
To the reader, it should be obvious from this generic example how to parallelize other independent function calls by the described functioncentric approach.
3 FunctionCentric Parallelization
We have shown how to parallelize a serial program that is decomposable into three parts: initialize, calls to func (i.e., a set of independent tasks), and finalize. In this section, we describe how the functioncentric parallelization is helpful for three important classes of scientific applications: Markov chain Monte Carlo simulations, dynamic population Monte Carlo simulations, and solution of partial differential equations. We use Python to program a set of simple and generic parallelization functions.
3.1 Parallel Markov chain Monte Carlo Simulations
The standard Markov chain Monte Carlo algorithms are embarrassingly parallel and have exactly the same algorithmic structure as the example of parameter analysis in Section 2. This means that the functions initialize, func, and finalize can easily be adapted to Monte Carlo problems. More specifically, the initialize function prepares the set of random samples and other input parameters. Some parametric model is computed by the func function, whereas finalize collects the data returned from all the func calls and prepares for further statistical analysis.
Functioncentric parallelization of Markov chain Monte Carlo applications closely follows the example in Section 2. We can reuse the three generic functions named get_subproblem_input_args, collect_subproblem_output_args, and parallel_solve_problem, assuming that all the func evaluations are equally costly and all the processors are equally powerful so there is no need for more sophisticated load balancing.
3.2 Population Monte Carlo with Dynamic Load Balancing
A more advanced branch of Monte Carlo algorithms is population Monte Carlo, see [26]. Here, a group of walkers, also called the population, is used to represent a highdimensional vector and the computation is carried out by a random walk in the state space. During the computation some of these walkers may be duplicated or deleted according to some acceptance/rejection criteria, i.e., the population is dynamic in time. Population Monte Carlo algorithms have been proven useful in a number of fields, spanning from polymer science to statistical sciences, statistical physics, and quantum physics.
Unlike the examples so far, where the computational tasks were totally independent and of static size, population Monte Carlo algorithms may be viewed as an iteration in time where we repeatedly do some work on a dynamic population, including moving the walkers of the population and adjusting the population size, which in a parallel context calls for dynamic load balancing.
3.2.1 Serial Implementation
A serial implementation of the time integration function can be as follows:
deftime_integration(initialize,do_timestep,finalize): walkers,timesteps=initialize() output=[] forstepinrange(timesteps): old_walkers_len=len(walkers) output.append(do_timestep(walkers)) walkers.finalize_timestep(old_walkers_len,len(walkers)) finalize(output)
The input arguments to the generic time_integration function are three functions: initialize, do_timestep, and finalize. This resembles the threestep structure discussed in Section 2. The do_timestep function can have a unified implementation for all the variants of population Monte Carlo algorithms. The other two input functions are typically programmed as methods of a class that implements a particular algorithm (such as diffusion Monte Carlo in Section 4.2). Here, the initialize method sets up a population object walkers (to be explained below) and determines the number of time steps the walkers are to be propagated. The finalize method can, e.g., store the output for later analysis.
The purpose of the do_timestep function is to implement the work for one time step, including propagating the walkers and adjusting the population. An implementation that is applicable for all population Monte Carlo algorithms may have the following form:
defdo_timestep(walkers):
walkers.move()
forwalkerinrange(len(walkers)):
ifwalkers.get_marker(walker)==0:
walkers.delete(walker)
elifwalkers.get_marker(walker)>1:
walkers.append(walker,walkers.get_marker(walker)1)
returnwalkers.sample_observables()
The above implementation of time_integration and do_timestep assumes that walkers is an object of a class, say with name Walkers, that has a certain number of methods. Of course, the flexibility of Python allows that the concrete implementation of class Walkers be made afterwards, unlike C++ and Java that require class Walkers be written before implementing time_integration and do_timestep. Here, we expect class Walkers to provide a generic implementation of a group of walkers, with supporting methods for manipulating the population. The most important methods of class Walkers are as follows:

move() carries out the work of moving each walker of the population randomly according to some rule or distribution function.

get_marker(walker) returns the number of copies belonging to a walker with index walker, where 0 means the walker should be deleted, 2 or more means that clones should be created.

append(walker, nchilds) and delete(walker) carry out the actual cloning and removal of a walker with index walker.

sample_observables() returns the observables at a given time step, e.g., an estimate of the system energy.

finalize_timestep(old_size, new_size) does some internal book keeping at the end of each time step, such as adjusting some internal variables. It takes as input the total number of walkers before and after the walker population has been adjusted by the do_timestep function.

__len__ is one of Python’s special class methods and is in our case meant to return the number of walkers. A call len(walkers) yields the same result as walkers.__len__().
For a real application, such as the diffusion Monte Carlo algorithm (see Section 4.2 and Appendix B), the concrete implementation of the methods should reflect the desired numerical algorithm. For example, the move method of diffusion Monte Carlo uses diffusion and branching as the rule to randomly move each walker, and the finalize_timestep method adjusts the branching ratio.
3.2.2 Parallelization
Parallelism in population Monte Carlo algorithms arises naturally from dividing the walkers among the processors. Therefore, a parallel version of the time_integration function may be as follows:
defparallel_time_integration(initialize,do_timestep,finalize, my_rank,num_procs,send,recv,all_gather): my_walkers,timesteps=initialize(my_rank,num_procs) old_walkers_len=sum(all_gather(numpy.array([len(my_walkers)]))) my_output=[] forstepinrange(timesteps): #dowhatisrequiredatthistimestepandmeasureCPUtime t_start=time.time() results=do_timestep(my_walkers) my_output.append(results) task_time=time.time()t_start #redistributewalkersandgetwalkersizeperprocess num_walkers_per_proc=dynamic_load_balancing(\ my_walkers,task_time,my_rank,num_procs,\ send,recv,all_gather) #finalizetaskforthistimestep new_walkers_len=sum(num_walkers_per_proc) my_walkers.finalize_timestep(old_walkers_len,new_walkers_len) old_walkers_len=new_walkers_len my_output=collect_subproblem_output_args(my_output,my_rank, num_procs,send,recv) ifmy_rank==0: finalize(my_output)
In comparison with its serial counterpart, the parallel_time_integration function has a few noticeable changes. First, the input arguments have been extended with five new arguments. The two integers my_rank and num_procs are, as before, meant for identifying the individual processors and finding the total number of processors. The other three new input arguments are MPI communication wrapper functions: send, recv, and all_gather, which can be provided by any of the Python wrapper modules of MPI. The only exception is that pypar does not directly provide the all_gather function, but we can easily program it as follows:
defall_gather(input_array): array_gathered_tmp=pypar.gather(input_array,0) array_gathered=pypar.broadcast(array_gathered_tmp,0) returnarray_gathered
Second, we note that the initialize function is slightly different from the serial case, now accepting my_rank and num_procs as input. This is because initial division of the walkers is assumed to be carried out here, giving rise to my_walkers on each processor. Third, a new function dynamic_load_balancing is called during each time step. This function will be explained below in detail. Fourth, unlike that the serial counterpart could simply pass the size of its walkers to finalize_timestep, the parallel implementation needs to collect the global population size before calling finalize_timestep. We remark that each local population knows its own size, but not the global population size. For this purpose, the dynamic_load_balancing function returns the individual local population sizes as a numpy array. Last, the collect_subproblem_output_args function from Section 2.2 is used to assemble all the individual results onto the master processor before calling the finalize function.
As mentioned before, parallelization of population Monte Carlo algorithms has to take into account that the total number of walkers changes with time. Dynamic redistribution of the walkers is therefore needed to avoid work load imbalance. The generic dynamic_load_balancing function is designed for this purpose, where we evaluate the amount of work for each processor and, if the work distribution is too skew, we move the excess walkers from a busy processor to a less busy one. The function first checks the distribution of local population sizes. If the difference between the smallest number of walkers and the largest number of walkers exceeds some predefined threshold, dynamic_load_balancing finds a better population distribution and redistributes the walkers:
defdynamic_load_balancing(walkers,task_time,my_rank,num_procs,\ send,recv,all_gather): walkers_per_proc=all_gather(numpy.array([len(walkers)])) ifimbalance_rate(walkers_per_proc)>walkers.threshold_factor: timing_list=all_gather(numpy.array([task_time])) rebalanced_work=find_optimal_workload(timing_list, walkers_per_proc) walkers=redistribute_work(walkers, walkers_per_proc, rebalanced_work, my_rank,num_procs,send,recv) returnwalkers_per_proc
Two helper functions find_optimal_workload and redistribute_work are used in the above implementation. Here, find_optimal_workload finds the optimal distribution of work, based on how much time each local population has used. The redistribute_work function carries out the reshuffling of walkers. A straightforward (but not optimal) implementation of these functions goes as follows:
deffind_optimal_workload(timing_list,current_work_per_proc): total_work=sum(current_work_per_proc) C=total_work/sum(1./timing_list) tmp_rebalanced_work=C/timing_list rebalanced_work=numpy.array(tmp_rebalanced_work.tolist(),’i’) remainders=tmp_rebalanced_workrebalanced_work whilesum(rebalanced_work)<total_work: maxarg=numpy.argmax(remainders) rebalanced_work[maxarg]+=1 remainders[maxarg]=0 returnrebalanced_work defredistribute_work(my_walkers,work_per_proc,rebalanced_work, my_rank,num_procs,send,recv): difference=work_per_procrebalanced_work diff_sort=numpy.argsort(difference) prev_rank_min=diff_sort[0] whilesum(abs(difference))!=0: diff_sort=numpy.argsort(difference) rank_max=diff_sort[1] rank_min=diff_sort[0] ifrank_min==prev_rank_minandrank_max!=diff_sort[1]: rank_min=diff_sort[1] ifmy_rank==rank_max: send(my_walkers.cut_slice(rebalanced_work[my_rank]),\ int(rank_min)) elifmy_rank==rank_min: my_walkers.paste_slice(recv(int(rank_max))) difference[rank_min]+=difference[rank_max] difference[rank_max]=0 prev_rank_min=rank_min returnmy_walkers
Careful readers will notice that two particular methods, my_walkers.cut_slice and my_walkers.paste_slices, provide the capability of migrating the work load between processors in the redistribute_work function. These two methods have to be programmed in class Walkers, like the other needed methods described earlier: move, get_marker, append, delete, and so on. The cut_slice method takes away excess work from a local population and the paste_slice method inserts additional work into a local population. Note that the input argument to the cut_slice method is an index threshold meaning that local walkers with indices larger than that are to be taken away. The returned slice from cut_slice is a picklable Python object that can be sent and received through MPI calls.
The generic redistribute_work function deserves a few more words. Among its input arguments is the ideal work distribution, rebalanced_work, which is calculated by find_optimal_workload. The redistribute_work function first calculates the difference between the current distribution, work_per_proc, and the ideal distribution. It then iteratively moves walkers from the processor with the most work to the processor with the least work until the difference is evened out.
This load balancing scheme is in fact independent of population Monte Carlo algorithms. As long as you have an algorithm repeatedly doing a task over time and where the amount of work in the task varies over time, this scheme can be reused. The only requirement is that an applicationspecific implementation of class Walkers, in terms of method names and functionality, should match with dynamic_load_balancing and redistribute_work. It should be noted that the given implementation of the latter function is not optimal.
3.3 Parallel Additive Schwarz Iterations
From the perspective of communication between processors, parallelization of the Monte Carlo algorithms is relatively easy. Parallel Markov chain Monte Carlo algorithms only require communication in the very beginning and end, whereas parallel population Monte Carlo algorithms only require communication at the end of each time step. Actually, our functioncentric approach to parallelization can allow more frequent communication. To show the versatility of functioncentric parallelization, we apply it to an implicit method for solving partial differential equations (PDEs) where communication is frequent between processors.
More specifically, many PDEs can be solved by an iterative process called domain decomposition. The idea is to divide the global domain, in which the PDEs are to be solved, into overlapping subdomains. The PDEs can then be solved in parallel on the subdomains. However, the correct boundary condition at the internal subdomain boundaries are not known, thus leading to an iterative approach where one applies boundary conditions from the last iteration, solves for the subdomain problems again, and repeats the process until convergence of the subdomain solutions (see e.g. [27, 28]). This algorithm is commonly called additive Schwarz iteration and can successfully be applied to many important classes of PDEs [29, 30, 31]. The great advantage of the algorithm, especially from a software point of view, is that the PDE solver for the global problem can be reused for each subdomain problem. Some additional code is needed for communicating the solutions at the internal boundaries between the subdomains. This code can be implemented in a generic fashion in Python, as we explain later.
Let us first explain the additive Schwarz algorithm for solving PDEs in more detail. We consider some stationary PDE defined on a global domain :
(1) 
subject to some boundary condition involving and/or its derivatives. Dividing into a set of overlapping subdomains , we have the restriction of (1) onto , for all , as
(2) 
The additive Schwarz method finds the global solution by an iterative process that generates a series of approximations , , and so on. During iteration , each subdomain computes an improved local solution by locally solving (2) for with as (an artificial) boundary condition on ’s nonphysical internal boundary that borders with neighboring subdomains. All the subdomains can concurrently carry out the local solution of (2) within iteration , thus giving rise to parallelism. At the end of iteration , neighboring subdomains exchange the latest local solutions in the overlapping regions to (logically) form the global field . The subdomain problems (2) are of the same type as the global problem (1), which implies the possibility of reusing an existing serial code that was originally implemented for (1). The additional code for exchange of local solutions among neighbors can be implemented by generic communication operations, independently of specific PDEs.
A generic implementation of parallel additive Schwarz iteration algorithm can be realized as the following Python function:
defadditive_Schwarz_iterations(subdomain_solve,communicate, set_BC,max_iter,threshold,solution, convergence_test=simple_convergence_test): iter=0;not_converged=True#init whilenot_convergedanditer<max_iter: iter+=1 solution_prev=solution.copy() set_BC(solution) solution=subdomain_solve() communicate(solution) not_converged=notconvergence_test(\ solution,solution_prev,threshold)
In the above function, max_iter represents the maximum number of additive Schwarz iterations allowed, and subdomain_solve is a function that solves the subdomain problem of form (2) and returns an object solution, which is typically a numpy array containing the latest subdomain solution on a processor (subdomain). However, solution may well be a more complex object, say holding a collection of scalar fields over computational grids, provided that (i) the object has a copy method, (ii) convergence_test and communicate can work with this object type, and (iii) subdomain_solve returns such an object. This flexibility in choosing solution reflects the major dynamic power of Python and provides yet another illustration of the generality of the examples in this paper.
Given an existing serial code, for example in a language like Fortran or C/C++, the subdomain_solve function is easily defined by wrapping up an appropriate piece of the serial code as a Python class (since subdomain_solve does not take any arguments, the function needs a state with data structures, conveniently implemented as class attributes as explained in Section 2.1).
The communicate argument is a function for exchanging the latest local solutions among the subdomains. After the call, the solution object is updated with recently computed values from the neighboring subdomains, and contents of solution have been sent to the neighbors. The communicate function is problem independent and can be provided by some library. In our implementation, the implementation is entirely in Python to take advantage of easy programming of parallel communication in Python. The set_BC argument is a function for setting boundary conditions on a subdomain’s internal boundary. This function depends on the actual serial code and is naturally implemented as part of the class that provides the subdomain_solve function.
The convergence_test function is assumed to perform an appropriate convergence test. The default generic implementation can test
against a prescribed threshold value. An implementation reads
defsimple_convergence_test(solution,solution_prev,threshold=1E3): diff=solutionsolution_prev loc_rel_change=vdot(diff,diff)/vdot(solution,solution) glob_rel_change=all_reduce(loc_rel_change,MAX) returnglob_rel_change<threshold
We remark that all_reduce is a wrapper of the MPI MPI_Allreduce command and vdot computes the inner product of two numpy arrays.
Unlike the threecomponent structure described in Sections 3.1 and 3.2, the main ingredients for parallel additive Schwarz iterations are the functions of subdomain_solve, communicate, set_BC, and convergence_test. In other words, it is not natural to divide the work of solving a PDE into initialize, func, and finalize. Nevertheless, functioncentric parallelization is also here userfriendly and gives a straightforward implementation of additive_Schwarz_iterations as above. The convergence_test function shown above is clearly generic, and so is the communicate function in the sense that it does not depend on the PDE. Both functions can be reused for different PDEs. The other two functions are PDE dependent, however, subdomain_solve normally wraps an existing serial code, while the implementation of set_BC is typically very simple.
4 Applications and Numerical Experiments
In this section we will address three real research projects involving the three classes of algorithms covered in Section 3. The projects have utilized our functioncentric approach to parallelizing existing codes. That is, we had some software in Fortran, C++, and R performing the basic computations needed in the projects. The serial software was wrapped in Python, adapted to components such as initialize, func, do_timestep, finalize, subdomain_solve, communicate, set_BC. Parallelization was then carried out as explained in previous sections. An important issue to be reported is the parallel efficiency obtained by performing the parallelization in a Python layer that is separate from the underlying serial scientific codes.
The Python enabled parallel codes have been tested on a Linux cluster of 3.4 GHz Itanium2 processors, which are interconnected through 1Gbits ethernet. The purpose is to show that the functioncentric parallelization approach is easy to use and that satisfactory parallel performance is achievable.
4.1 Parallel Markov Chain Monte Carlo Simulations
The first case is from political science and concerns estimating legislators’ ideal points by the Markov chain Monte Carlo (MCMC) method. For a detailed description of the mathematical problem and the numerical method, we refer the reader to Appendix A. This application fits into the setup in Section 3.1. The statistical engine is provided by the PSCL library [32] in R [33], for which there exists a Python wrapper.
To use the functioncentric parallelization described in Section 3.1, we have written a Python class named PIPE. In addition to the constructor of the class (i.e., the __init__ method), there are three methods as follows:

initialize sets up the functionality of the PSCL library through the Python wrapper of R (named rpy), and prepares the input argument list needed for func.

func carries out the computation of each task by invoking appropriate functions available through rpy (in short, func is a Python wrapper to the R function ideal from the PSCL library).

finalize summarizes the output and generates an array in R format.
The resulting parallel Python program is now as short as
fromfunction_centricimportparallel_solve_problem importpypar my_rank=pypar.rank() num_procs=pypar.size() frompypipeimportPIPE problem=PIPE("EP1.RData","rcvs","NULL","NULL") parallel_solve_problem(problem.initialize,problem.func,problem.finalize, my_rank,num_procs,pypar.send,pypar.receive) pypar.finalize()
The practical importance of a parallel MCMC code is that large and computationally intensive simulations are now easily doable. More specifically, data from the European Parliament between 1979 and 2004 [34] are used for simulation. During the five year legislative terms, the European Parliament expanded the size of the membership as well as the number of votes taken. (This trend has continued since 2004.) It is hence increasingly computationally intensive to estimate the ideal point model without reducing the length of the Markov chain. We examined the parallel performance by comparing the computing time for each of the five legislatures, running the parallelized code on 32 CPUs. The results are reported in Table 1. When comparing the results, the reader should note that we have not made any attempts to optimize the ideal code (called by our func function) for the purpose of parallelization. This makes it straightforward to switch to new versions of the ideal function. We ran 100,000 MCMC iterations. The parallel efficiency was about 90%.

Legislature Votes Members 1 CPU 32 CPUs Efficiency 1979  1984 810 548 287m 32.560s 10m 13.318s 87.91 1984  1989 1853 637 783m 59.059s 26m 58.702s 91.06 1989  1994 2475 597 1006m 59.258s 33m 26.140s 94.11 1994  1999 3603 721 1905m 0.930s 66m 0.068s 90.20 1999  2004 5639 696 2898m 45.224s 102m 7.786s 88.70
4.2 Parallel Diffusion Monte Carlo Simulations
As an example of population Monte Carlo methods, we will now look at parallel Diffusion Monte Carlo (DMC) computations (see Appendix B for a detailed numerical description), which is used here to simulate BoseEinstein condensation. We recall from Section 3.2 that dynamic load balancing is needed in connection with the parallelization, and can be provided by the generic dynamic_load_balancing function. To utilize the parallel time integration function parallel_time_integration from Section 3.2, we need to program a parallel version of the initialize function. The do_timestep function from Section 3.2 can be used as is.
definitialize(my_rank,num_procs): nwalkers=1000 nspacedim=3 stepsize=0.1 timesteps=200 walkers_per_proc=simple_partitioning(nwalkers,num_procs) my_walkers=Walkers(walkers_per_proc[my_rank],nspacedim,stepsize) my_walkers.threshold_factor=1.1 returnmy_walkers,timesteps
This initialize function is quite similar to its serial counter part. As noted in Section 3.2, it takes as input my_rank and num_procs. The simple_partitioning function is called to partition the walker population. A my_walkers object is assigned to each processor, and a threshold factor is prescribed to determine when load balancing is needed.
Together with the parallel_time_integration function from Section 3.2, the above initialize function is the minimum programming effort needed to parallelize a serial population Monte Carlo code. For the particular case of our parallel Diffusion Monte Carlo implementation, we also need to know the global number of walkers in every timestep to be able to estimate its observables globally. Moreover, the load balancing scheme needs the time usage of each processor during each time step.
A class with name Walkers needs to be implemented to match with the implementations of parallel_time_integration, dynamic_load_balancing, and the above initialize function. The essential work is to provide a set of methods with already decided names (see Section 3.2), such as move, append, delete, finalize_timestep, cut_slice, and paste_slice. A concrete example of the Walkers class is described with more details in Appendix B.
We report in Table 2 the timing results of a number of parallel DMC computations. The parallel efficiency was about 85%. We increased the total number of walkers when more processors were used in the simulation, such that the number of walkers assigned to each processor remained as 200. Such a use of parallel computers for DMC simulations mimics the everlasting wish of quantum physicists to do larger computations as soon as more computing resource becomes available. Note that in this scaled scalability test, good speedup performance is indicated by an almost constant time usage independent of the number of processors.

CPUs Time Efficiency 1 37m10.389s N/A 5 42m32.359s 87.39% 10 42m00.734s 88.48% 20 42m29.945s 87.47% 30 42m33.895s 87.33% 40 43m30.092s 85.45% 50 43m39.159s 85.16%
4.3 Parallel Boussinesq Simulations
Simulating the propagation of ocean waves is the target of the our third and final concrete case. The reader is referred to Appendix C for the mathematical model and the numerical method. The involved equations can be solved in parallel by the additive Schwarz algorithm of Section 3.3.
Our starting point for parallelization is a 25 years old legacy Fortran 77 code consisting of a set of subroutines. More specifically, the most important subroutines are KONTIT and BERIT, which target the two semidiscretized equations (16) and (17) of the mathematical model (see Appendix C). These two Fortran 77 subroutines contain intricate algorithms with nested layers of doloops, which are considered to be very difficult to parallelize by directly inserting MPI calls in the Fortran code. Performing the parallelization outside the Fortran code is therefore much more convenient. Using the proposed framework in the present paper the parallelization is a technically quite straightforward task.
The subdomain solver consists of calls to the subroutines KONTIT and BERIT. The implementation of the Python function subdomain_solve (see Section 3.3) requires a Python interface to KONTIT and BERIT, which can easily be produced by the F2PY software. Since a subdomain solver needs to set artificial boundary conditions at nonphysical boundaries, we have programmed two lightweight wrapper subroutines in Fortran, WKONTIT and WBERIT, which handles the boundary conditions before invoking KONTIT and BERIT. We then apply F2PY to make WKONTIT and WBERIT callable from Python. Since the Fortran subroutines have lots of input data in long argument lists and subdomain_solve takes no arguments, we have created a class where the Fortran input variables are stored as class attributes:
importf77#extensionmodulefortheFortrancode
classSubdomainSolver:
def__init__(self,...):
#setinputargumentstotheFortransubroutinesasclassattributes
#(nbit,F,YW,H,QY,WRK,dx,dy,dt,kit,ik,gg,alpha,eps,
#lower_x_neigh,upper_x_neigh,lower_y_neigh,upper_y_neigh)
defcontinuity(self):
self.Y,self.nbit=f77.WKONTIT(\
self.F,self.Y,self.YW,self.H,self.QY,self.WRK,
self.dx,self.dy,self.dt,self.kit,self.ik,
self.gg,self.alpha,self.eps,self.nbit,
self.lower_x_neigh,self.upper_x_neigh,
self.lower_y_neigh,self.upper_y_neigh)
defbernoulli(self):
#similartothecontinuitymethod
sd=SubdomainSolver(...)
subdomain_solve1=sd.continuity
subdomain_solve2=sd.bernoulli
Note that since there are two PDEs (16) and (17), we have created two functions: subdomain_solve1 and subdomain_solve2. The main computation of the resulting parallel program is in the following while loop:
t=0
whilet<t_stop:
t=t+dt
additive_Schwarz_iterations(subdomain_solve1,communicate,
set_BC,10,1E3,sd.Y)
additive_Schwarz_iterations(subdomain_solve2,communicate,
set_BC,10,1E3,sd.F)
The additive_Schwarz_iterations function from Section 3.3 can be placed in a reusable module. The communicate function is borrowed from a Python library for mesh partitioning and intersubdomain communication. The set_BC function actually does not do anything for this particular application.
Speedup results are reported in Table 3, for which the global solution mesh was fixed at , and the number of time steps was 40. The results show that we can handle a quite complicated mathematical problem in a blackbox Fortran code with our suggested simple framework and obtain a remarkable good speedup, with just a trivial extension of the Fortran code.

CPUs Time Speedup Efficiency 1 166.66s N/A N/A 2 83.61s 1.99 99.67% 4 44.45s 3.75 93.73% 8 20.16s 8.27 103.33% 16 11.43s 14.58 91.13%
5 Conclusion
We have shown how serial scientific codes written in various common languages, including Fortran, C, C++, and Python, can be parallelized in a separate, small software unit written in Python. The advantage of such an approach is twofold. First, the existing, often complicated, scientific highperformance code remains (almost) unchanged. Second, the parallel algorithm and its interprocessor communication are conveniently implemented in highlevel Python code.
This approach to parallelization has been implemented in a software framework where the programmer needs to implement a few Python functions for carrying out the key steps in the solution approach. For example, our first application involves doing a set of independent tasks in parallel, where a small Python framework deals with the parallelism and demands the user to supply three functions: initialize for preparing input data to the mathematical model, func for calling up the serial scientific code, and finalize for processing the computational results. Some more functions must be supplied in more complicated problems where the algorithm evolves in time, with a need for dynamic load balancing and more parallel communication.
Our simple software frameworks outlined in this paper are applicable to many different scientific areas, and we have described some common classes of problems: parameter investigation of a mathematical model, standard Monte Carlo simulation, Monte Carlo simulation with need for dynamic load balancing, and numerical solution of partial differential equations. In each of these cases, we have outlined fairly detailed Python code such that most technical details of the parallel implementations are documented. This may ease the migration of the ideas to new classes of problems beyond the scope of this paper.
In particular, the shown frameworks have been used to parallelize three real scientific problems taken from our research. The problems concern Markov Chain Monte Carlo models for voting behavior in political science, Diffusion Monte Carlo methods for BoseEinstein condensation in quantum mechanics, and additive Schwarz and finite difference methods for simulating ocean waves by a system of partial differential equations. The results of our investigations of the parallel efficiency are very encouraging: In all these real science problems, parallelizing serial codes in the proposed Python framework gives almost optimal speedup results, showing that there arises no significant loss due to using Python and performing the parallelization “outside” the serial codes.
As a conclusion, we believe that the ideas and code samples from this paper can simplify parallelization of serial codes greatly, without significant loss of computational efficiency. This is good news for scientists who are nonexperts in parallel programming but want to parallelize their serial codes with as small efforts as possible.
Appendix A Voting in Legislatures
In the spatial model of politics, both actors’ preferences over policies (ideal points) and policy alternatives are arranged geometrically in a lowdimensional Euclidean space. An actor receives the highest possible utility if a policy is located at her ideal point; she gains or loses utility as the policy moves towards or away from her ideal point [35]. We adopt the Bayesian approach proposed by Clinton, Jackman and Rivers [36]. Assume there are legislators who vote on proposals. On each vote , legislator chooses between a ”Yea” position and a ”Nay” position located in the policyspace , where is the number of dimensions. Then, we have if legislator votes ”Yea” on roll call , and if she votes ”Nay”. The model assumes quadratic utility functions. The ideal point of legislator is , while and are stochastic elements whose distribution is jointly normal. The variance of the stochastic elements is . Denote the Euclidean norm by , utility maximising implies that legislator votes ”Yea” on vote if
(3) 
and ”Nay” otherwise. Clinton, Jackman and Rivers [36] show that the model can be understood as a hierarchical probit model:
(4) 
where =, =, is the standard normal function, is the midpoint between the ”Yea” and ”Nay” positions on proposal , and is the legislator’s ideal point. The direction of indicates the location of the status quo relative to the proposal. If is positive, the new proposal is located higher on the dimension than the status quo. If is negative, the new proposal is located lower on the dimension than the status quo.
The MCMC Algorithm.
In the Markov Chain Monte Carlo (MCMC) algorithm for the statistical analysis of voting behavior [36], the difference between utilities of the alternatives on the th vote for the th legislator is given by , where and are model parameters, is a vector of regression coefficients and are standard normal errors. If we know and , can be imputed from the regression of on using the votes of legislator and vice versa. If we know , we can use the votes of the legislators on roll call to find and . Given , and (either from priors or from the previous iteration), we can find by drawing randomly from a normal distribution subject to the constraints implied by the actual votes, i.e., if , and if , .
The goal is to compute the joint posterior density for all model parameters and , and all coefficient vectors , . The MCMC algorithm forms a Markov chain to explore as much as possible of this joint density, i.e., letting index an MCMC iteration,

find from , and ,

sample and using and ,

find from , and .
This process must then be repeated until convergence, i.e., that the samples have moved away from the priors to the neighborhood of the posterior mode before samples are drawn.
Clinton, Jackman and Rivers [36, p. 369] find that the computation time is increasing in , where is the number of legislators, is the number of votes and is the number of MCMC iterations. Although they argue that very long runs are normally not necessary, they nevertheless advise long runs to ensure that the MCMC algorithm has converged. It is increasingly timeconsuming to estimate the model on on a standard desktop computer as the size of the legislature and the number of votes increase.
Appendix B BoseEinstein Condensation
The famous experiment of Anderson et al. [37] was about cooling Rb down to temperatures in the order of for observing BoseEinstein condensation in the dilute gas. To model this fascinating experiment in the framework of Quantum Monte Carlo, so that numerical simulations can be extended beyond the physical experiments, we may use the governing Schrödinger equation:
(5) 
The most important parts of the mathematical model are a Hamiltonian and a wave function , see [38]. The Hamiltonian for trapped interacting atoms is given by
(6) 
The external potential corresponds to the trap used to confine the Rb atoms, and was in the experiment in the order of . The twobody interaction can be easily described by a hardcore potential of radius in a dilute gas. We have however, for the sake of simplicity, neglected these interactions in our example implementation of class Walkers.
The Method of Diffusion Monte Carlo.
In the Diffusion Monte Carlo (DMC) method [39], the Schrödinger equation is solved in imaginary time,
(7) 
The formal solution of (7) is
(8) 
where is called the Green’s function, and is a convenient energy shift.
The wave function in DMC is represented by a set of random vectors , in such a form that the time evolution of the wave function is actually represented by the evolution of a set of walkers. This feature gives rise to task parallelism. The wave function is positive definite everywhere, as it happens with the ground state of a bosonic system, so it may be considered as a probability distribution function.
The DMC method involves Monte Carlo integration of the Green’s function by every walker. The time evolution is done in small time steps , using the following approximate form of the Green’s function:
(9) 
where . Assume that an arbitrary starting state can be expanded in the basis of stationary,
(10) 
we have
(11) 
in such a way that the lowest energy components will have the largest amplitudes after a long elapsed time, and in the limit of the most important amplitude will correspond to the ground state (if )^{1}^{1}1This can easily be seen by replacing with the ground state energy in (11). As is the lowest energy, we will get ..
The Green’s function is approximated by splitting it up in a diffusional part,
(12) 
which has the form of a Gaussian and a branching part,
(13) 
While diffusion is taken care of by a Gaussian random distribution, the branching is simulated by creation and destruction of walkers with a probability . The idea of DMC computation is quite simple; once we have found an appropriate approximation of the shorttime Green’s function and determined a starting state, the computation consists in representing the starting state by a collection of walkers and letting them independently evolve in time. That is, we keep updating the walker population, until a large enough time when all other states than the ground state are negligible.
Algorithm 1
Diffusion Monte Carlo
for step in range( 0, timesteps ) : for in range( 0, ) : Diffusion; propose move Branching; calculate replication factor : if : mark walker as dying if : mark walker to make clones Remove dead walkers and make new clones; Update walker population and adjust trial energy; Sample contributions to observable.
The Implementation.
In Algorithm 1 we summarize the DMC algorithm corresponding to (12)(13). In the algorithm is a Gaussian with zero mean and a variance of corresponding to (12). The deleting and cloning of walkers are, as mentioned in Section 3.2, performed by the do_timestep function, repeated here for clarity:
defdo_timestep(walkers):
walkers.move()
forwalkerinrange(len(walkers)):
ifwalkers.get_marker(walker)==0:
walkers.delete(walker)
elifwalkers.get_marker(walker)>1:
walkers.append(walker,walkers.get_marker(walker)1)
returnwalkers.sample_observables()
The main computational work of the DMC algorithm at each time step is implemented in the move function inside class Walkers, together with a helper function branching:
classWalkers:
...
defbranching(self,new_positions):
old_potential=potential(self.positions)
new_potential=potential(new_positions)
branch=numpy.exp((0.5*(old_potential+new_potential)
self.adjust_branching)*self.stepsize)
self.markers=numpy.array(branch+
numpy.random.uniform(0,1,branch.shape),’i’)
defmove(self):
displacements=numpy.random.normal(0,2*self.stepsize,
self.positions.shape)
new_positions=self.positions+displacements
self.branching(new_positions)
self.positions=new_positions
...
The move function first generates a set of Gaussian (normal) distributed random numbers, corresponding to (12). Next, it calls the branching function, which calculates a potential for the old and the new positions^{2}^{2}2In a more optimized implementation, the old potential would have been stored from the previous move and not calculated every time.. These potentials are used to calculate following (13) and create an integer array self.markers with its average value equal to (stored in the branch variable). This array is of the same length as the number of walkers (stored in self.positions) and marks the walkers as dying or cloneable.
It is worth noticing that if the new potential of a walker is much higher than that in the previous time step (i.e., the walker is far from the center of the trap), the value of branch will be close to 0 and the walker will be deleted. However, if the new potential is much lower (i.e. closer to the center of the trap), branch will be greater than 1 and the walker will be cloned. As long as the twobody interaction is ignored, the walkers will only be encouraged to move towards the center of the trap, thus yielding a lower energy than seen in real experiments.
Appendix C Ocean Wave Propagation
The following two PDEs, normally termed as the Boussinesq water wave equations [40], can be used to model wave propagation:
(14)  
(15) 
The primary unknowns of (14)(15) are the water surface elevation and the depthaveraged velocity potential . The symbol denotes the water depth as a function of . The advantage of the above Boussinesq wave model, in comparison with the standard shallow water equations, is its capability of modeling waves that are weakly dispersive () and/or weakly nonlinear (), see [41]. Therefore, the Boussinesq water wave equations are particularly adequate for simulating ocean wave propagation over long distances and large water depths.
Discretization of the Boussinesq water wave equations (14)(15) normally starts with a temporal discretization as follows:
(16)  
(17) 
where we use to denote the time level, and denotes the time step size. The basic idea of computation at each time step is to first compute based on and from the previous time step, and then compute using the new and the old . To carry out the actual numerical computation, we need a spatial discretization of (16)(17), using e.g. finite differences or finite elements, so we end up with two systems of linear equations that need to be solved during each time step.
References
References

[1]
Guido van Rossum et al.
The Python programming language, 1991–.
http://www.python.org/. 
[2]
D. Ascher, P. F. Dubois, K. Hinsen, J. Hugunin, and T. Oliphant.
Numerical Python.
Technical report, Lawrence Livermore National Lab., CA, 2001.
http://www.pfdubois.com/numpy/numpy.pdf. 
[3]
Eric Jones, Travis Oliphant, Pearu Peterson, et al.
SciPy: Open source scientific tools for Python.
http://www.scipy.org/, 2001–.  [4] F2PY software package. http://cens.ioc.ee/projects/f2py2e.

[5]
Pypar – parallel programming with Python.
http://sourceforge.net/projects/pypar, 2007.  [6] X. Cai, H. P. Langtangen, and H. Moe. On the performance of the Python programming language for serial and parallel scientific computations. Scientific Programming, 13(1):31–56, 2005.
 [7] X. Cai and H. P. Langtangen. Developing parallel objectoriented simulation codes in Diffpack. In Proceedings of the Fifth World Congress on Computational Mechanics, Vienna University of Technology, 2002. ISBN 3950155406.
 [8] F. Cirak and J. C. Cummings. Generic programming techniques for parallelizing and extending procedural finite element programs. Engineering with Computers, 24:1–16, 2008.

[9]
S. Balay, K. Buschelman, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes,
B. F. Smith, and H. Zhang.
PETSc Web page.
http://www.mcs.anl.gov/petsc, 2001.  [10] O. Lawlor, S. Chakravorty, T. Wilmarth, N. Choudhury, I. Dooley, G. Zheng, and L. Kalé. ParFUM: A parallel framework for unstructured meshes for scalable dynamic physics applications. Engineering with Computers, 22(3):215–235, 2006.
 [11] J. R. Stewart and H. C. Edwards. A framework approach for developing parallel adaptive multiphysics applications. Finite Elements in Analysis and Design, 40:1599–1617, 2004.

[12]
The Trilinos project.
http://trilinos.sandia.gov/, 2008. 
[13]
UG home page.
http://sit.iwr.uniheidelberg.de/~ug/, 2007.  [14] T. Goodale, G. Allen, G. Lanfermann, J. Massó, T. Radke, E. Seidel, , and J. Shalf. The Cactus framework and toolkit: Design and applications. In J. M. L. M. Palma et al., editor, Proceedings of VECPAR 2002, volume 2565 of Lectures Notes in Computer Science, pages 197–227. Springer Verlag, 2003.

[15]
MpCCI 3.0.
http://www.mpcci.de/, 2008. 
[16]
StarP Overview.
http://www.interactivesupercomputing.com/products/, 2008.  [17] K. Hinsen. Parallel scripting with Python. Computing in Science & Engineering, 9(6):82–89, 2007.

[18]
pyMPI: Putting the py in MPI.
http://pympi.sourceforge.net/, 2008. 
[19]
MYMPI webpage.
http://peloton.sdsc.edu/~tkaiser/mympi/, 2008.  [20] L. Dalcín, R. Paz, and M. Storti. MPI for Python. Journal of Parallel and Distributed Computing, 65(9):1108–1115, 2005.
 [21] L. Dalcín, R. Paz, M. Storti, and J. D’Elía. MPI for Python: Performance improvements and MPI2 extensions. Journal of Parallel and Distributed Computing, 68(5):655–662, 2008.

[22]
ScientificPython webpage.
http://dirac.cnrsorleans.fr/plone/software/scientificpython/, 2007.  [23] G. D. Benson and A. S. Fedosov. Pythonbased distributed programming with Trickle. In H. R. Arabnia, editor, Proceedings of PDPTA’07, pages 30–36. CSREA Press, 2007.

[24]
G. Olson.
Introduction to concurrent programming with Stackless Python.
http://members.verizon.net/olsongt/stackless/why_stackless.html, 2006.  [25] C. E. Rasmussen, M. J. Sottile, J. Nieplocha, R. W. Numrich, and E. Jones. Coarray Python: A parallel extension to the Python language. In M. Danelutto, D. Laforenza, and M. Vanneschi, editors, Proceedings of EuroPar 2004, Lectures Notes in Computer Science, pages 632–637. Spriner Verlag, 2004.
 [26] Yukito IBA. Population Monte Carlo algorithms. Transactions of the Japanese Society for Artificial Intelligence, 16:279–286, 2001.
 [27] T. F. Chan and T. P. Mathew. Domain decomposition algorithms. In Acta Numerica 1994, pages 61–143. Cambridge University Press, 1994.
 [28] B. F. Smith, P. E. Bjørstad, and W. Gropp. Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial Differential Equations. Cambridge University Press, 1996.
 [29] X. Cai, G. K. Pedersen, and H. P. Langtangen. A parallel multisubdomain strategy for solving Boussinesq water wave equations. Advances in Water Resources, 28(3):215–233, 2005.
 [30] X. Cai and H. P. Langtangen. Parallelizing PDE solvers using the Python programming language. In A. M. Bruaset and A. Tveito, editors, Numerical Solution of Partial Differential Equations on Parallel Computers, volume 51 of Lecture Notes in Computational Science and Engineering, pages 295–325. SpringerVerlag, 2006.
 [31] H. P. Langtangen and X. Cai. A software framework for easy parallelization of PDE solvers. In C. B. Jensen, T. Kvamsdal, H. I. Andersson, B. Pettersen, A. Ecer, J. Periaux, N. Satofuka, and P. Fox, editors, Parallel Computational Fluid Dynamics. NorthHolland, 2001.
 [32] Simon Jackman. PSCL: classes and methods for R developed in the Political Science Computational Laboratory, Stanford University. Technical report, Department of Political Science, Standford University, 2006.
 [33] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2006. ISBN 3900051070.
 [34] Simon Hix, Abdul Noury, and Gerard Roland. Power to the parties: cohesion and competition in the European Parliament, 19792001. British Journal of Political Science, 35(2):209–234, 2005.
 [35] Melvin J. Hinich and Michael C. Munger. Analytical Politics. Cambridge University Press, 1997.
 [36] Joshua Clinton, Simon Jackman, and Doug Rivers. The statistical analysis of roll call data. American Political Science Review, 98(4):355–370, 2004.
 [37] J.R. Anderson, M.R. Ensher, M.R. Matthews, C.E. Wieman, and E.A. Cornel. Observation of BoseEinstein condensation in a dilute atomic vapor. Science, 269:198, 1995.
 [38] J. K. Nilsen, J. MurPetit, M. Guilleumas, M. HjorthJensen, and A. Polls. Vortices in atomic BoseEinstein condensates in the large gas parameter region. Phys. Rev. A, 71, 2005.
 [39] R. Guardiola. Monte Carlo methods in quantum manybody theories. In J. Navarro and A. Polls, editors, Microscopic Quantum ManyBody Theories and Their Applications, volume 510 of Lecture Notes in Physics, pages 269–336. Springer Verlag, 1998.
 [40] D. M. Wu and T. Y. Wu. Threedimensional nonlinear long waves due to moving surface pressure. Proc. 14th Symp. Naval Hydrodyn., pages 103–129, 1982.
 [41] G. Pedersen and H. P. Langtangen. Dispersive effects on tsunamis. In Proceedings of the International Conferance on Tsunamis, Paris, France, pages 325–340, 1999.