Parallel Discrete Event Simulation with Erlang1footnote 11footnote 1The publisher version of this paper is available at http://dx.doi.org/10.1145/2364474.2364487. Please cite as: Luca Toscano, Gabriele D’Angelo, Moreno Marzolla. Parallel Discrete Event Simulation with Erlang. Proceedings of ACM SIGPLAN Workshop on Functional High-Performance Computing (FHPC 2012) in conjunction with ICFP 2012. ISBN: 978-1-4503-1577-7.

Parallel Discrete Event Simulation with Erlang111The publisher version of this paper is available at http://dx.doi.org/10.1145/2364474.2364487. Please cite as: Luca Toscano, Gabriele D’Angelo, Moreno Marzolla. Parallel Discrete Event Simulation with Erlang. Proceedings of ACM SIGPLAN Workshop on Functional High-Performance Computing (FHPC 2012) in conjunction with ICFP 2012. ISBN: 978-1-4503-1577-7.

Abstract

Discrete Event Simulation (DES) is a widely used technique in which the state of the simulator is updated by events happening at discrete points in time (hence the name). DES is used to model and analyze many kinds of systems, including computer architectures, communication networks, street traffic, and others. Parallel and Distributed Simulation (PADS) aims at improving the efficiency of DES by partitioning the simulation model across multiple processing elements, in order to enable larger and/or more detailed studies to be carried out. The interest on PADS is increasing since the widespread availability of multicore processors and affordable high performance computing clusters. However, designing parallel simulation models requires considerable expertise, the result being that PADS techniques are not as widespread as they could be. In this paper we describe ErlangTW, a parallel simulation middleware based on the Time Warp synchronization protocol. ErlangTW is entirely written in Erlang, a concurrent, functional programming language specifically targeted at building distributed systems. We argue that writing parallel simulation models in Erlang is considerably easier than using conventional programming languages. Moreover, ErlangTW allows simulation models to be executed either on single-core, multicore and distributed computing architectures. We describe the design and prototype implementation of ErlangTW, and report some preliminary performance results on multicore and distributed architectures using the well known PHOLD benchmark.

P
\conferenceinfo

FHPC’12, September 15, 2012, Copenhagen, Denmark. \CopyrightYear2012 \copyrightdata978-1-4503-1577-7/12/09

\authorinfo

Luca Toscano and Gabriele D’Angelo and Moreno Marzolla Department of Computer Science, University of Bologna luca.toscano2@studio.unibo.it, g.dangelo@unibo.it, marzolla@cs.unibo.it

\category

D.1.3SoftwareProgramming Techniques[Concurrent Programming] \categoryI.6.8Computing MethodologiesSimulation and Modeling[Types of Simulation]

\terms

Languages, Performance

arallel and Distributed Simulation, PADS, Time Warp, Erlang

1 Introduction

Simulation is a widely used modeling technique, which is applied to study phenomena for which a closed form analytical solution is either not known, or too difficult to obtain. There are many types of simulation: in a continuous simulation the system state changes continuously with time (e.g., simulating the temperature distribution over time inside a datacenter); in a discrete simulation the system state changes only at discrete points in time; finally, in a Monte Carlo simulation there is no explicit notion of time, as it relies on repeated random sampling to compute some result.

Discrete Event Simulation (DES) is of particular interest, since it has been successfully applied to modeling and analysis of many types of systems, including of computer system architectures, communication networks, street traffic, and others. In a Discrete Event Simulation, the system is described as a set of interacting entities; the state of the simulator is updated by simulation events, which happen at discrete points in time. For example, in a computer network simulation the following events may be defined: (1) arrival of a new packet at a router; (2) the router starts to process a packet; (3) the router finishes processing a packet; (4) packet transmission starts; (5) a timeout occurs and a packet is dropped; and so on.

The overall structure of a sequential event-based simulator is relatively simple: the simulator engine maintains a list, called  Future Event List (FEL), of all pending events, sorted in non decreasing simulation time of occurrence. The simulator executes the main simulation loop; at each iteration, the event with lower timestamp is removed from the FEL, and the simulation time is advanced to . Then, the event is executed, which triggers any combination of the following actions:

  • The state of the simulation is updated;

  • Some events may be scheduled at some future time;

  • Some scheduled events may be removed from the FEL;

  • Some scheduled events may be rescheduled for a different time.

The simulation stops when either the FEL is empty, or some user-defined stopping criteria are met (e.g., some predefined maximum simulation time is executed, or enough samples of events of interest have been collected). The FEL is usually implemented as a priority queue, although different data structures have been considered and provide various degree of efficiency Jones [1986].

Traditional sequential DES techniques may become inappropriate for analyzing large and/or detailed models, due to the large number of events which can require considerable (wall clock) time to complete a simulation run. The  Parallel and Distributed Simulation (PADS) discipline aims at taking advantage of modern high performance computing architectures–from massively parallel computers to multicore processors–to handle large models efficiently Fujimoto [1990]. The general idea of PADS is to partition the simulation model into submodels, called Logical Processes (LPs)  which can be evaluated concurrently by different Processing Elements. More precisely, the simulation model is described in terms of multiple interacting entities which are assigned to different Logical Processs. Each LP that is executed on a different PE, is in practice the container of a set of entities. The simulation evolution is obtained through the exchange of timestamped messages (representing simulation events) between the entities. In order to ensure that causal dependencies between events are not violated Lamport [1978], each receiving entity must process incoming events in non decreasing timestamp order.

We observe that multi- and many-core processor architectures are now ubiquitous; moreover, the Cloud computing paradigm allows users to rent high performance computing clusters using a “pay as you go” pricing model. The fact that high performance computing resources are readily available should suggest that PADS techniques–which have been refined to take advantage precisely of that kind of resources–are widespread. Unfortunately, PADS techniques have not gained much popularity outside highly specialized user communities.

Figure 1: Layered structure of discrete-event simulators

There are many reasons for that D’Angelo [2011], but we believe that the fundamental issue with PADS is that parallel simulation models are currently not transparent to the user. Figure 1 (a) shows the (greatly simplified) structure of a DES stack. At the higher level we have the user-defined simulation model; the model defines the events and how they change the system state. In practice, the model is implemented using either general-purpose programming language, or languages specifically tailored for writing simulations (e.g., Simula Dahl and Nygaard [1966], GPSS Gordon [1978], Dynamo Richardson and Pugh [1981], Parsec Bagrodia et al. [1998], SIMSCRIPT III Rice et al. [2005]). The simulation program depends on some underlying simulation engine, which provides core facilities such as random number generation, FEL handling, statistics collection and so on. The simulation engine may be implemented as a software library to be linked against the user-defined model. Finally, at the lower level, the simulation is executed on some hardware platform, which in general is a general-purpose processor; ad-hoc architectures have also been considered (e.g., the ANTON supercomputer Shaw et al. [2008]).

The current state of PADS is similar to Figure 1 (b). Different parallel/distributed simulation libraries and middlewares have been proposed (e.g. sik Perumalla [2005], SPEEDES Steinman and Wong [2003], PRIME PRIME [], GAIA/ARTÌS D’Angelo and Bracuto [2009]), each one specifically tailored for a particular environment or hardware architecture. While hardware dependency is unavoidable–shared memory parallel algorithms are quite different than distributed memory ones, for example–the problem here is that low level details are exposed to the user, which therefore must implement the simulation model taking explicitly into account where the model will be executed. This seriously limits the possibility of porting the same model to different platforms.

ErlangTW is a step towards the more desirable situation shown in Figure 1 (c). ErlangTW is a simulation library written in Erlang Armstrong [2007], which implements the Time Warp synchronization protocol for parallel and distributed simulations Jefferson [1985]. Erlang is a concurrent programming language based on the functional paradigm and the actor model, where concurrent objects interact using share nothing message passing. In this way, the same application can potentially run indifferently on single-core processors, shared memory multiprocessors and distributed memory clusters. The Erlang Virtual Machine can automatically make use of all the available cores on a multicore processor, providing a uniform communication abstraction on shared memory machines. Also, multiple Erlang VMs can provide a similar abstraction also on distributed memory systems. Thanks to these features, the same ErlangTW simulation model can be executed serially on single-core processors, or concurrently on multicores or clusters. Of course, performance will depend both on the model and on the underlying architecture; however, preliminary experiments with the PHOLD benchmark (reported in Section 5) show that scalability across different processor architectures can indeed be achieved. Moreover, future versions of ErlangTW will add support for the adaptive runtime migration of simulated entities (or whole LPs) using the serialization features offered by Erlang. An approach that, due to many technical difficulties, is not common in PADS tools but that often speeds up the simulation execution.

This paper is structured as follows. In Section 2 we review the scientific literature and contrast our approach to similar works. In Section 3 we introduce the basic concepts of distributed simulation and the Time Warp protocol. In Section 4 we present the architecture and implementation of ErlangTW. We evaluate the performance of ErlangTW using the PHOLD benchmark, both on a multicore processor and on a small distributed memory cluster; performance results are described in Section 5. Finally, conclusions and future works will be presented in Section 6.

2 Related Works

Over the years, many PADS tools, languages and middlewares have been proposed (a comprehensive but somewhat outdated list can be found in Low et al. [1999]); in this section we highlight some of the most significant results with specific attention to the implementations of the Time Warp synchronization mechanism.

sik Perumalla [2005] is a multi-platform micro-kernel for the implementation of parallel and distributed simulations. The micro-kernel provides advanced features such as support for reverse computation and some kind of load balancing.

The Synchronous Parallel Environment for Emulation and Discrete-Event Simulation (SPEEDES) Steinman and Wong [2003] and the WarpIV Kernel Steinman et al. [2008] have been used as testbeds for investigating new approaches to parallel simulation. SPEEDES is a software framework for building parallel simulations in C++. SPEEDES provides support for optimistic simulations by defining new data types for variable which can be rolled back to a previous state (as we will see in Section 3, this is required for optimistic simulations). SPEEDES uses the Qheap data structure for event management, which provides better performance with respect to conventional priority queue data structures. SPEEDES has also been used for many seminal works on load-balancing in optimistic synchronization.

DSIM Chen and Szymanski [2005] is a Time Warp simulator which targets clusters comprised of thousands of processors and that implements some advanced techniques for the memory management (e.g. Time Quantum GVT and Local Fossil Collection).

We are aware of two existing simulation engines based on the Erlang programming language: Sim94 Carlson and Tronje [1995] and Sim-Diasca sim [2012]. Sim94 has been originally developed for military leadership training of battalion commanders, and is based on a client-server paradigm. The server runs the simulation model, while clients can connect at any time to inspect or change the simulation state. It should be observed that Sim94 implements a conventional sequential simulator, while ErlangTW implements a parallel and distributed simulator based on the Time Warp synchronization protocol. Sim-Diasca, on the other hand, is a true PADS engine (simulation models can be executed on multiple execution units), but is based on a time-stepped synchronization approach. A time-stepped simulation is divided into fixed-length time steps; all execution units execute each step concurrently and synchronize before executing the next one (see Section 3). Time-stepped simulations can be appropriate for systems whose evolution is “naturally” driven by a sequence of steps (e.g., circuit simulation evolving according to a global clock). Issues in time-stepped simulations include the need to find the appropriate duration of steps, and the high cost of synchronization.

A recent work D’Angelo et al. [2012] investigated the use of the Go programming language222http://golang.org/ to implement an optimistic parallel simulator for multicore processors. The simulator, called Go-Warp, is based on the Time Warp mechanism. Go provides mechanisms for concurrent execution and inter-process communication, which facilitate the development of parallel applications. Like Erlang, all these mechanisms are part of the language core and are not provided as external libraries. However, Go-Warp can not be executed on a distributed memory cluster without a major redesign; with this respect, ErlangTW represents a significant improvement, since the simulator runs without any modification on both shared memory and distributed memory architectures. To the best of our knowledge, Erlang has not been used to implement a Time Warp simulation engine.

3 Distributed Simulation

Parallel and Distributed Simulation (PADS) can be defined as “a simulation in which more than one processor is employed” Perumalla [2006]. As already observed in the introduction, there are many reasons for relying on PADS: to obtain results faster, to simulate larger scenarios, to integrate simulators that are geographically distributed, to integrate a set of commercial off-the-shelf simulators and to compose different simulation models in a single simulator Fujimoto [2000].

The main difference between sequential simulation and PADS is that in the latter there is no global shared system state. A PADS is realized as a set of entities; an entity is the smallest component of the simulation model, and therefore defines the model’s granularity. Entities interact with each other by exchanging timestamped events. Entities are executed inside containers called LPs. Each LP dispatches the events to the contained entities, and interacts with the other LPs for synchronization and data distribution. In practice, each LP is usually executed by a PE (e.g., a single core in modern multicore processors). Each LP notifies relevant events to other LPs by sending messages using whatever communication medium is available to the PEs. Each message is a pair , where is a descriptor of the event to be processed, and is the simulation time at which must be processed. Of course, the message header includes additional information, such as the ID of the originator and destination entities.

Figure 2: Components of a PADS system

The situation is illustrated in Figure 2. Each LP contains a set of entities, and a queue of events which are to be executed by the local entities. The event queue plays the same role of the FEL of sequential simulations: the LP fetches the event with lower timestamp and forwards it to the destination entity. If an entity creates an event for a remote entity, the LP uses an underlying communication network to send the event to the corresponding remote LP.

The term “parallel simulation” is used if the PE have access to a common shared memory, or in presence of a tightly coupled interconnection network. Conversely, “distributed simulation” is used in case of loosely coupled architectures (i.e. distributed memory clusters) Perumalla [2006]. In practice, modern high-performance systems are often hybrid architectures where a large number of shared memory multiprocessors are connected with a low latency network. Therefore, the term PADS is used to denote both approaches.

It is important to observe that, even if a shared system state is indeed available on shared memory multiprocessor, the state is still partitioned across the PE in order to avoid race conditions and improve performance.

Model partitioning

Partitioning the model is nontrivial, and in general the optimal partition strategy may depend on the structure and semantic of the system to be simulated. For example, in a wireless sensor network simulation where each sensor node can interact only with neighbors, it is reasonable to partition the model according to geographic proximity of sensors. Many conflicting issues must be taken into account when partitioning a simulation model into LPs. Ideally, the partition should minimize the amount of communication between PEs; however, the partition should also try to balance the workload across different PEs, in order to avoid bottlenecks on overloaded PEs. Finally, it is necessary to consider that a fixed partitioning scheme may not be appropriate, e.g., when the interactions among LPs change over time. In this scenario, some form of adaptive partitioning should be employed but this feature is not provided by most of currently available simulators.

Figure 3: An example of causality violation

Synchronization

The results of a PADS are correct if the outcome is identical to the one produced by a sequential execution, in which all events are processed in nondecreasing timestamp order (we assume that we can always break ties to avoid multiple events to occur at the exact same simulation time). In PADS, each LP keeps a local variable called  Local Virtual Time (LVT), which represents the (local) simulation time. LP can process message if ; after executing the event , the LVT is set to .

It should be observed that the LVT of each LP advances at a different rate, due to load unbalance or communication delays. This may cause problems, such as the one shown in Figure 3. We depict the timelines associated to three LPs, , and . The numbers on each timeline represents the LVT of each LP. Arrows represent events; for simplicity, all messages are timestamped with the sender’s LVT.

When receives from , it sets and executes the event . Then, advances its LVT to 10, and sends a new message to . After that, message arrives from ; can not be executed, since has already been advanced to 10. Moreover, sent out a message for event , which may or may not have been generated should have been executed before in the correct order, before .

Figure 3 shows an example of causality violation Lamport [1978]. Two events are said to be in causal order if one of them can have some consequences on the other. In PADS, different synchronization strategies have been developed to guarantee causal ordering of events: time-stepped, conservative and optimistic.

In a time-stepped simulation, the time is divided in fixed-size steps, and each LP can proceed to the next timestep only when all LPs have completed the current one Tay et al. [2003]. This approach is quite simple, but requires a barrier synchronization at each step; the overall simulation speed is therefore always dominated by the slowest LP. Furthermore, defining the “correct” value of the timestep can be difficult if not impossible for some models.

The conservative approach prevents causality violations from occurring. A LP must check that no messages from the past can arrive, before executing an event. This is achieved using the Chandy-Misra-Bryant (CMB) Misra [1986] algorithm, which imposes the following constraints: (i) each LP has an incoming queue for all other LPs from which it can receive messages; (ii) each LP must generate events in non decreasing timestamp order; (iii) the delivery of the events is reliable (no message can be lost) and the network does not change the message order. Under these assumptions, each LP checks all the incoming queues to determine what is the next safe event to be processed. If there are no empty queues, then the incoming event with lower timestamp is safe and can be executed. Unfortunately, this mechanism is prone to deadlock, since a LP can not identify the next safe event if all incoming queues are nonempty. To avoid this, the CMB algorithm introduces a new type of message (called NULL messages) with no semantic content. The receipt of a NULL message informs the receiver that the sender has set its LVT to , and hence will not send any event with timestamp lower than . NULL messages can be used to break deadlocks, at the cost of increasing the network load. Moreover, generation of NULL messages requires some knowledge of the simulation model, and therefore can not be transparent the user.

Finally, the Time Warp protocol Jefferson [1985] implements the so called optimistic synchronization approach. In Time Warp, each LP can process incoming events as soon as they are received. Obviously, causality violations may happen, and special actions must be taken to fix them. If a LP receives a message (called straggler) with timestamp smaller than some event already processed, it must roll back the computations for these events and re-execute them in the proper order. The problem is that some of the events to be undone might have sent messages (events) to other LPs (e.g., in Figure 3). These messages must be invalidated by sending corresponding anti-messages. The recipient of an anti-message must roll back its state as well, which might trigger a cascade of rollbacks that brings back the simulator to a previous state, discarding the incorrect computations that have been performed.

In order to support rollbacks, each LP must keep a log of all processed events and all messages sent, together with any information needed to undo their effects. Obviously, logging all and every event since the beginning of the simulation is infeasible, due to the huge memory requirement. For this reason, the simulator periodically computes the  Global Virtual Time (GVT), which is a lower bound on the timestamp of any future rollback. The GVT is simply the smallest timestamp among unprocessed and partially processed messages, and can be computed with a distributed snapshot algorithm Fujimoto [2000]. Once the GVT has been computed and sent to all LPs, logs older than GVT can be reclaimed. GVT computation can be a costly operation, since it usually involves some form of all-to-all communications. Therefore, finding the optimal frequency of this operation is a critical aspect of Time Warp and typically the chosen frequency is the result of a tradeoff between memory consumption for the logs and simulation speed. However, when the underlying execution architecture provides efficient support for reduction operations, the GVT computation does not add too much overhead, and the Time Warp protocol can achieve almost linear speedup even on very large setups Perumalla et al. [2011].

Optimistic synchronization offers some advantages with respect to conservative approaches: first, optimistic synchronization is generally capable of exploiting a higher degree of parallelism; second, conservative simulators require model specific information in order to produce NULL messages, while optimistic mechanisms are less reliant on such information (although they can exploit it if available) Fujimoto [1989].

4 The ErlangTW Simulator

Erlang is a functional, concurrent programming language based on lightweight threads (LWT) and message passing. This makes it well suited for developing parallel applications both on shared memory multicore machines and on a distributed memory cluster. An Erlang program is compiled to an intermediate representation called BEAM, which is executed on a Virtual Machine. If Symmetric Multiprocessing is enabled, the VM creates a separate scheduler for each CPU core; each scheduler fetches one ready LWT from a common queue and executes it. The spawn function can be used to create a new thread executing a given function. The VM will take care of dispatching threads to active schedulers. The fact that there is no 1:1 mapping between LWT and OS threads facilitates the work of the developer, since the VM takes care of balancing the load across the available processors.

Each LWT has an identifier that is guaranteed to be unique across all VM instances, even those running on different hosts connected through a network. The identifier can be used by send/receive primitives, which are provided directly by the language itself and do not require external libraries.

The ErlangTW Simulator is an implementation of the Time Warp algorithm described in Section 3. Although Time Warp requires fairly sophisticated state management capabilities to support rollbacks and antimessages, it turned out that this (fairly limited) complexity is paid back by the fact that Time Warp does not require ad-hoc modifications of simulation models (e.g., to compute NULL events).

Message Format

Messages exchanged between LPs are represented using the record data type, providing the abstraction of a key-value tuple. Messages have the following structure:

-record(message, {type,
                  seqNumber,
                  lpSender,
                  lpReceiver,
                  payload,
                  timestamp}).

The type field represents the message type; current types are: event (normal event), ack (acknowledgement to ensure reliable delivery of messages), marked_ack (special kind of acknowledgement required by the Samadi’s algorithm, described later), and antimessage (used during rollbacks). seqNumber is a numeric value representing how many messages the sender LP has sent, lpReceiver and lpSender are the unique identifiers of the sender and receiver LP. payload is the actual content of the message, describing the event to process and all ancillary data. Finally, timestamp is simulated time associated to the event contained in the payload. The simulator needs to acknowledge messages in order to guarantee the correctness of its global state, because each message in the system must be taken into account by one LP only. The Erlang VM guarantees message delivery, but only from an LWT to another one’s mailbox, therefore this could lead to the situation in which an LP has received a particular message but it has not already read it, so it is unaware of its presence. Conversely once an LP receives an acknowledge for a message it knows that it has already been taken into account by the receiver. An example of global state is the Global Virtual Time, explained in the following.

Here an example of a message:

#message{type=event/ack/marked_ack/antimessage,
         seqNumber=100,
         lpSender=<100,0,0>,
         lpReceiver=<100,1,0>,
         payload="hello",
         timestamp=10}

Event Queue

Each LP maintains a priority queue of incoming messages sorted in nondecreasing timestamp order. The LP fetches the message with lower timestamp from the queue and, if the message is not a straggler, immediately executes the associated event. The queue is implemented as an Andersson General Balanced Tree Andersson [1999]. The tree contains (Key, Value) pairs, where the Key is the simulation time, and the Value is a list of events which are scheduled to happen at that time (ErlangTW supports simultaneous events, i.e., multiple events happening at the same simulated time).

Logical Processes

Each LP is implemented as an Erlang LWT created using the spawn function. LPs communicate using the send and receive operators. The state of an LP is kept in a record with the following structure:

-record(lp_status, {my_id,
                    received_messages,
                    inbox_messages,
                    max_received_messages,
                    proc_messages,
                    to_ack_messages,
                    anti_messages,
                    current_event,
                    history, gvt,
                    rollbacks,
                    timestamp,
                    model_state,
                    init_model_state,
                    samadi_find_mode,
                    samadi_marked_messages_min,
                    messageSeqNumber,
                    status}).

where:

my_id

is the unique identifier of the LP;

received_messages

is the list of unprocessed messages, read from the process mailbox;

inbox_messages

is the incoming message queue containing unprocessed messages;

proc_messages

is a data structure which contains, for each processed event, the list of messages sent by that event to remote entities. This data structure is required to perform rollbacks when necessary, because it contains the event to reprocess and the antimessages to send;

to_ack_messages

is a list of events, sorted in nondecreasing timestamp order, related to the messages sent by the LP still to be acknowledged;

model_state

is the user-defined structure containing the state of the simulation model;

timestamp

is the LVT;

history

is the list of processed events, used by the Time Warp protocol to perform rollbacks when necessary. Each element of the list is a tuple of the form {Timestamp, model_state, Event}, and record the state of this LP at the given simulation time, before the Event has been processed. A tuple is added to the history after an event has been extracted from inbox_messages and executed;

samadi_*

data structures needed in order to implement the Samadi’s GVT algorithm, as stated in the next paragraph.

Implementing Simulated Entities

As already described in Section 3, a LP is a container of simulation entities. Each entity is the representation of some actor or component of the “real” system. By decoupling LPs from entities, the simulation modelers can avoid dealing with partitioning; however, if more control over the simulator is desired, the modelers can implement their own custom partitioning by working at the LP level.

In ErlangTW there is a layer between LP and entities, in order to implement the separation of concerns described above. The modeler implements three methods in a particular Erlang module called user; these methods define the actions executed by each LP during initialization, event processing, and termination. The PHOLD model (described in Section 5) uses an initialization function to evenly partition the entities between the running LPs. The event processing function implements the behavior executed by each entity upon receipts of a new message. Finally, the termination function is normally used to display or save simulation results or other information at the end of each simulation run. Each message contains a field called payload that could transport any kind of user-defined data. As a specific example, the event data structure used by the PHOLD model to manage entities has the following structure:

-record(payload,
    {entitySender, entityReceiver, value}).

and can be instantiated, for example, as follows:

#payload{entitySender=10,
         entityReceiver=122,
         value=42}

In this example entity 10 has sent a message to the entity 122 with a payload containing the integer 42. In the current implementation of ErlangTW, where the allocation of entities on LPs must be manually defined, the user specifies a mapping function which is used by ErlangTW to deliver message to the appropriate LP. In future versions we plan to implement some automatic allocation mechanism and to provide this binding transparently.

Global Virtual Time

The Global Virtual Time is calculated with Samadi’s algorithm Samadi et al. [1987]. One LWT, called GVT Controller, is responsible to periodically checking the smallest timestamp of all events stored in the queues of all LPs; the GVT controller is also responsible for starting and stopping the simulation. In the current version of ErlangTW, the GVT controller periodically broadcasts a GVT computation request message to all LPs; each LP sends back the value of the LVT such that the controller can compute the GVT as the minimum of these values. The GVT is finally sent to all LPs, which can then prune their local history by removing all checkpoints older than the GVT.

In practice, the calculation of the GVT is complex given that some messages could be in flight when the sender and/or the receiver LPs are reporting their LVT. Ignoring these messages would result in a wrong (overestimated) GVT and hang the whole simulation. The solution proposed by Samadi is to add an acknowledgment for each message used for the GVT calculation, to properly identify in flight messages and to decide what LP must take them in account.

In future versions of ErlangTW we plan to compute the GVT using a more scalable reduction operation.

Random Number Generation

The pseudo random number generator used by each simulated entity is the Linear Congruential Generator described by Park and Miller in Park and Miller [1988]. The initial seed can be stored in a configuration file which is read by ErlangTW before starting the simulation run. Each entity within the same LP shares a common random number generator, whose seed is initialized with the seed in the configuration file. In this way it is possible to start the simulator in a known state, to achieve determinism and repeatability.

5 Performance Evaluation

In this section we evaluate the scalability of ErlangTW, both on shared memory and distributed memory architectures, using a synthetic benchmark called PHOLD Fujimoto [1990], which is specifically designed for the performance evaluation of Time Warp implementations.

The PHOLD Benchmark

PHOLD is the parallel version of the HOLD benchmark for event queues Jones [1986] and it is quite simple to implement and describe. The model is made by a set of entities that are partitioned among  LPs; each LP contains the same number of entities. Each entity produces and consumes events. When an entity consumes an event, a new event is generated and delivered to another entity (note that the total number of events in the system remains constant). The timestamp of the new event is computed by adding an exponentially distributed random number with mean 5.0 to the timestamp of the receiving event. In this model the recipient is randomly chosen using a uniform distribution. Therefore, each event has a probability of being sent to an entity in the same LP as the originator, and a probability of being sent to an entity on a different LP. As the number of LP increases, the ratio of remote vs local events increases. The PHOLD benchmark is homogeneous in terms of load assigned to the LPs: all of them have the same amount of communication and computation. While this can be unrealistic for general simulation models, it is important to remark that the Time Warp mechanism (in its original version) does require a good level of balancing to obtain good performance results Carothers and Fujimoto [2000]; D’Angelo [2011]. Hence, the goal of PHOLD is to study the scalability of Time Warp implementations by considering an appropriate execution environment.

There are four main parameters which are used to control the benchmark:

  • The number of LP

  • The number of entities

  • The event density , , defined as the fraction of entities that generate an event at the beginning of the simulation. At each simulation time there are events in the system

  • The workload, used to tune the computation / communication ratio by running some CPU-intensive computation each time an event is processed. In our case, we implemented the workload as a pre-defined number of floating point operations (FPops)

Number of LPs () (shared memory)
1, 2, 3, 6 (distributed memory)
Number of entities () 840, 1680, 2520, 3360
Event Density ) 0.5
Workload 1000, 5500, 10000 FPops
Table 1: Parameters used in the simulations

Experimental Setup

Table 1 shows the parameters which have been used in the simulation runs. We tested ErlangTW both on a shared memory and on a distributed memory architecture.

The number of entities has been chosen as multiples of 840, which is the minimum common multiple of the number of LPs we considered (i.e., 840 is an integer multiple of all integers in the range ). This ensures that the number of entities allocated to each LP, , is an integer.

As already described, the event density has been set to 0.5, which means that, at a given time, the average number of events in the system is . We considered three different workloads of 1000, 5500 and 10000 Floating Point Operations. Finally, the GVT is computed every 5 seconds.

We measured the wall clock time of a simulation run until the GVT reaches 1000. In order to produce statistically valid results, we perform 30 runs for each experiment, and compute the average of each batch. We investigate the scalability of ErlangTW by computing the speedup as a function of the number of LP.

Host CPU Physical Cores HT RAM Operating System Network
gda i7 Intel i7-2600 3.40GHz 4 Yes 8GB GNU/Linux Kernel 3.2 (x86 64) Not used
cassandra Intel Xeon 2.80GHz 2 Yes 3GB GNU/Linux Kernel 2.6 (x86 32) Gigabit Ethernet
cerbero Intel Xeon 2.80GHz 2 Yes 2GB GNU/Linux Kernel 2.6 (x86 32) Gigabit Ethernet
chernobog Intel Xeon 2.40GHz 4 No 4GB GNU/Linux Kernel 2.6 (x86 64) Gigabit Ethernet
Table 2: Experimental testbeds (top: shared memory; bottom: distributed memory)
Figure 4: Speedup on the shared memory architecture as a function of the number of LPs (higher is better)
Figure 5: Efficiency on the shared memory architecture as a function of the number of LPs (higher is better)
Figure 6: Total number of rollbacks on the shared memory architecture as a function of the number of LPs (lower is better)

ErlangTW on Shared Memory

The shared memory system (gda i7) is an Intel(R) Core(TM) i7-2600 CPU 3.40GHz with 4 physical cores with Hyper-Threading (HT) technology Marr et al. [2002]. The system has 8 GB of RAM and runs Ubuntu 12.04 (x86_64 GNU/Linux, 3.2.0-24-generic #39-Ubuntu SMP). For this system we considered several values for , namely LPs. HT works by duplicating some parts of the processor except the main execution units. From the point of view of the Operating System, each physical processor core corresponds to two logical processors. The impact of virtual cores on PADS is worth investigation Bononi et al. [2006]; D’Angelo et al. [2012] and will be reported in the following.

Figure 4 shows the speedup as a function of ; recall that , where is the wall clock simulation time when  LPs are used. In each figure we consider a specific value for the workload, and we plot a curve for each number of entities . As a general trend we observe that scalability improves as the number of entities gets large; also, scalability improves marginally if the workload (FPops) increases. Figure 5 shows the efficiency as a function of . The efficiency is an estimate of the fraction of actual computation performed by all processors, as opposed to communication and synchronization.

ErlangTW exhibits good scalability and efficiency up to , since in this case each LP can be executed on a separate physical processor core. The transition from to shows a noticeable drop of the speedup (and therefore in the efficiency), which is easily explained by the effect of HT. When , one of the physical CPU cores executes two LPs and becomes the bottleneck. The Time Warp protocol works well when the workload is well balanced, but degrades significantly if hot spots are present Carothers and Fujimoto [2000].

To better understand this, we report in Figure 6 the mean total number of rollbacks which occurred during the whole simulation run. A large number of rollbacks indicates that the LVT at the individual LPs are advancing at different rates. The PHOLD model is balanced by construction, since all entities perform identical tasks and are uniformly distributed across the LPs. From Figure 6 we see that the number of rollbacks increases in the region ; if no rollbacks happen, since all events are managed through the event queue of a single LP, so that causality is always ensured. Adding more LPs increases the possibility of receiving a straggler. From to load unbalance occurs and the number of rollbacks sharply increases. The LPs running on the overloaded processor core lag behind the other LPs, and a large number of antimessages is produced to undo the updates performed by the faster LPs. As the number of LPs further increase, we observe that the number of rollbacks decreases, since the system becomes more and more balanced.

In practice it is extremely difficult, if not impossible, to statically partition a PADS models such that the workload is balanced across the LP, since the computation / communication ratio can change during the simulation. If detailed knowledge of the simulation model is not available in advance, as it is the case most of the times, it is necessary to resort to adaptive entity migration techniques to balance the LPs D’Angelo and Bracuto [2009]. It is worth mentioning that Erlang offers native support for code migration, which greatly simplify the implementation of such techniques; this will be the focus on future extensions of this work.

Figure 7: Speedup on the distributed memory cluster as a function of the number of LPs (higher is better); the GVT is computed every 5 of wall clock time
Figure 8: Speedup on the distributed memory cluster as a function of the number of LPs (higher is better); the GVT is computed every 1 of wall clock time
Figure 9: Efficiency on the distributed memory cluster as a function of the number of LPs (higher is better)
Figure 10: Total number of rollbacks on the distributed memory cluster as a function of the number of LPs (lower is better); the GVT is computed every 1 of wall clock time

ErlangTW on Distributed Memory

The distributed memory system is the research cluster of the PADS group at the University of Bologna. We used three machines, cassandra, cerbero and chernobog whose configuration is shown in Table 2. We performed experiments with . For , the LP executed on cassandra; for , one LP executed on cassandra and the other one on cerbero. For we run a single LP on each of the three machines. Finally, when we executed two LPs on each of the three machines.

Figure 7 shows the speedup of the PHOLD model, measured on our distributed memory cluster. Thanks to the Erlang language, it was possible to execute the exact same implementation which was tested on the shared memory machine. Again, each value is obtained by averaging 30 simulation runs. The most prominent feature of these figures is the superlinear speedup which occurs with and LPs. As in most of these situations, this superlinear speedup can be explained by the fact that the machine used for the test with (cassandra) has limited memory, and therefore makes use of virtual memory during the simulation. To confirm this hypothesis, we reduced the amount of memory required by the PHOLD model by reducing the wall clock time between GVT calculations. Recall from Section 3 that, once the GVT is known, each LP can discard logs for events executed before the GVT, since these events will be never rolled back. Therefore, increasing the frequency of GVT calculation results in a reduced memory footprint of the simulation model, at the cost of a higher number of communications. The test shown in Figure 7 were done with the GVT computed every 5 of wall clock time; reducing this interval to 1 produces the more reasonable results shown in Figure 8.

Scalability on the distributed memory cluster is quite poor, as confirmed by the efficiency shown in Figure 9. This result can be explained by observing that PADS applications often exhibit low computation / communication ratio, and in our distributed memory testbed the communication network uses the standard Gigabit Ethernet protocol which suffers from non negligible latency. Note from Figure 9 that scalability and efficiency are particularly poor for low workload intensities (1000 and 5500 FPops) and for low number of entities. In these situations PHOLD is communication bound, and the latency introduced by the commodity LAN severely impacts on the overall performance.

Since our cluster includes heterogeneous machines, there load is not evenly balanced across the LPs, and this generates a large number of antimessages. In Figure 10 we plot the mean total number of rollbacks as a function of the number of LPs . The number of rollbacks sharply increases from to , and this can be explained by the fact that for cassandra and cerbero have a similar hardware configuration, while chernobog (which is used when and ) is much more powerful. As in the shared memory case, the faster LPs is prone to produce a large number of stragglers which generate a cascade of rollbacks.

6 Conclusion and future work

In this paper we described ErlangTW, an implementation of the Time Warp protocol for parallel and distributed simulations implemented in Erlang. ErlangTW allows the same simulation model to be executed (unmodified) either on single-core, multicore and distributed computing architectures. We described the prototype implementation of ErlangTW, and analyzed its scalability on a multicore, shared memory machine and on a small distributed memory cluster using the PHOLD benchmark.

Results show that Erlang provides a good framework to build simulators, thanks to its powerful language features and virtual machine facilities; furthermore, Erlang’s transparent message brokering system greatly simplifies the development of complex distributed applications, such as PADS. Performance of the PHOLD benchmark show that scalability and efficiency on shared memory architectures are very good, while distributed memory architectures are less friendly–performance wise–to these kinds of applications.

As seen before, the communication overhead of the distributed execution environment has a big impact on the simulator performances and the Time Warp synchronization algorithm reacts badly to imbalances in the execution architecture (e.g. CPUs with very different speeds or presence of background load). Both these problems can be addressed using nice features provided by Erlang: the serialization of objects and data structures, and code migration. Thanks to this, it is possible to implement the transfer of simulated entities across different LPs or even moving a whole LP on a different CPU, all at runtime. In this way, the ErlangTW simulator would be able to reduce the communication cost by adaptively clustering highly interacting entities within the same LP. Furthermore, it will be possible to implement other advanced forms of load-balancing D’Angelo and Bracuto [2009] to speed up the execution and to reduce the number of roll-backs. This will permit the implementation of new adaptive simulators that can change their configuration at runtime. To further enhance the performance of ErlangTW, we will exploit additional parallelization of the LP, by decoupling message dispatching from entity management using separate LWT.

Source Code Availability

The ErlangTW Simulator is released under the GNU General Public License (GPL) version 2 and can be freely downloaded from http://pads.cs.unibo.it/

References

  • sim [2012] Sim-Diasca. http://www.sim-diasca.org/, 2012.
  • Andersson [1999] A. Andersson. General balanced trees. J. Algorithms, 30(1):1–18, Jan. 1999. ISSN 0196-6774. doi: 10.1006/jagm.1998.0967.
  • Armstrong [2007] J. Armstrong. Programming Erlang: Software for a Concurrent World. Pragmatic Bookshelf, 2007. ISBN 193435600X, 9781934356005.
  • Bagrodia et al. [1998] R. Bagrodia, R. Meyer, M. Takai, Y.-A. Chen, X. Zeng, J. Martin, and H. Y. Song. Parsec: a parallel simulation environment for complex systems. Computer, 31(10):77 –85, oct 1998. ISSN 0018-9162. doi: 10.1109/2.722293.
  • Bononi et al. [2006] L. Bononi, M. Bracuto, G. D’Angelo, and L. Donatiello. Exploring the effects of Hyper-Threading on parallel simulation. In Proceedings of the 10th IEEE international symposium on Distributed Simulation and Real-Time Applications, pages 257–260, Washington, DC, USA, 2006. IEEE Computer Society. ISBN 0-7695-2697-7. doi: 10.1109/DS-RT.2006.18.
  • Carlson and Tronje [1995] B. Carlson and S. Tronje. Sim94–a concurrent simulator for plan-driven troops. Technical report, Uppsala Universitet, Sweden, Feb. 15 1995.
  • Carothers and Fujimoto [2000] C. D. Carothers and R. M. Fujimoto. Efficient execution of time warp programs on heterogeneous, NOW platforms. IEEE Trans. Parallel Distrib. Syst., 11(3):299–317, Mar. 2000. ISSN 1045-9219. doi: 10.1109/71.841745.
  • Chen and Szymanski [2005] G. Chen and B. Szymanski. DSIM: scaling time warp to 1,033 processors. In Simulation Conference, 2005 Proceedings of the Winter, page 10 pp., dec. 2005. doi: 10.1109/WSC.2005.1574269.
  • Dahl and Nygaard [1966] O.-J. Dahl and K. Nygaard. SIMULA: an ALGOL-based simulation language. Commun. ACM, 9(9):671–678, Sept. 1966. ISSN 0001-0782. doi: 10.1145/365813.365819.
  • D’Angelo [2011] G. D’Angelo. Parallel and distributed simulation from many cores to the public cloud. In High Performance Computing and Simulation (HPCS), 2011 International Conference on, pages 14–23, July 2011. doi: 10.1109/HPCSim.2011.5999802.
  • D’Angelo and Bracuto [2009] G. D’Angelo and M. Bracuto. Distributed simulation of large-scale and detailed models. International Journal of Simulation and Process Modelling (IJSPM), 5(2):120–131, 2009. ISSN 1740-2123.
  • D’Angelo et al. [2012] G. D’Angelo, S. Ferretti, and M. Marzolla. Time warp on the Go. In Proc. Simutools 2012 - Fifth International Conference on Simulation Tools and Techniques, pages 249–255, Desenzano, Italy, Mar.19 2012.
  • Fujimoto [1989] R. M. Fujimoto. Parallel discrete event simulation. In Proceedings of the 21st conference on Winter simulation, WSC ’89, pages 19–28, New York, NY, USA, 1989. ACM. ISBN 0-911801-58-8.
  • Fujimoto [1990] R. M. Fujimoto. Performance of time warp under synthetic workloads. In Proc. SCS Multiconference on Distributed Simulation, pages 23––28, 1990.
  • Fujimoto [2000] R. M. Fujimoto. Parallel and distributed simulation systems. Wiley series on parallel and distributed computing. Wiley, 2000. ISBN 9780471183839.
  • Gordon [1978] G. Gordon. The development of the general purpose simulation system (gpss). SIGPLAN Not., 13(8):183–198, Aug. 1978. ISSN 0362-1340. doi: 10.1145/960118.808382.
  • Jefferson [1985] D. R. Jefferson. Virtual time. ACM Trans. Program. Lang. Syst., 7(3):404–425, July 1985. ISSN 0164-0925. doi: 10.1145/3916.3988.
  • Jones [1986] D. W. Jones. An empirical comparison of priority-queue and event-set implementations. Commun. ACM, 29(4):300–311, Apr. 1986. ISSN 0001-0782. doi: 10.1145/5684.5686.
  • Lamport [1978] L. Lamport. Time, clocks, and the ordering of events in a distributed system. Commun. ACM, 21(7):558–565, July 1978. ISSN 0001-0782. doi: 10.1145/359545.359563.
  • Low et al. [1999] Y.-H. Low, C.-C. Lim, W. Cai, S.-Y. Huang, W.-J. Hsu, S. Jain, and S. J. Turner. Survey of languages and runtime libraries for parallel discrete-event simulation. SIMULATION, 72(3):170–186, 1999. doi: 10.1177/003754979907200309.
  • Marr et al. [2002] D. T. Marr, F. Binns, D. L. Hill, G. Hinton, D. A. Koufaty, A. J. Miller, and M. Upton. Hyper-Threading Technology Architecture and Microarchitecture. Intel Technology Journal, 6(1), Feb. 2002.
  • Misra [1986] J. Misra. Distributed discrete event simulation. ACM Computing Surveys, 18(1):39–65, 1986.
  • Park and Miller [1988] S. K. Park and K. W. Miller. Random number generators: good ones are hard to find. Commun. ACM, 31(10):1192–1201, Oct. 1988. ISSN 0001-0782. doi: 10.1145/63039.63042.
  • Perumalla [2005] K. S. Perumalla. sik - a micro-kernel for parallel/distributed simulation systems. In Proceedings of the 19th Workshop on Principles of Advanced and Distributed Simulation, PADS ’05, pages 59–68, Washington, DC, USA, 2005. IEEE Computer Society. ISBN 0-7695-2383-8.
  • Perumalla [2006] K. S. Perumalla. Parallel and distributed simulation: traditional techniques and recent advances. In L. F. Perrone, B. Lawson, J. Liu, and F. P. Wieland, editors, Proceedings of the Winter Simulation Conference WSC 2006, Monterey, California, USA, December 3-6, 2006, pages 84–95. WSC, 2006. ISBN 1-4244-0501-7. doi: 10.1145/1218112.1218132.
  • Perumalla et al. [2011] K. S. Perumalla, A. J. Park, and V. Tipparaju. GVT algorithms and discrete event dynamics on 129k+ processor cores. In 18th International Conference on High Performance Computing, HiPC 2011, Bengaluru, India, December 18-21, 2011, pages 1–11. IEEE, 2011. ISBN 978-1-4577-1951-6. doi: 10.1109/HiPC.2011.6152725.
  • [27] PRIME. Parallel Real-time Immersive network Modeling Environment - PRIME. https://www.primessf.net, 2011.
  • Rice et al. [2005] S. V. Rice, H. M. Markowitz, A. Marjanski, and S. M. Bailey. The SIMSCRIPT III programming language for modular object-oriented simulation. In Proceedings of the 37th conference on Winter simulation, WSC ’05, pages 621–630. Winter Simulation Conference, 2005. ISBN 0-7803-9519-0.
  • Richardson and Pugh [1981] G. P. Richardson and A. L. Pugh. Introduction to System Dynamics Modeling with Dynamo. MIT Press, Cambridge, MA, USA, 1981. ISBN 0262181029.
  • Samadi et al. [1987] B. Samadi, R. Muntz, and D. Parker. A distributed algorithm to detect a global state of a distributed simulation system. In Proc. IFIP Conference on Distributed Processing. North-Holland, 1987.
  • Shaw et al. [2008] D. E. Shaw, M. M. Deneroff, R. O. Dror, J. S. Kuskin, R. H. Larson, J. K. Salmon, C. Young, B. Batson, K. J. Bowers, J. C. Chao, M. P. Eastwood, J. Gagliardo, J. P. Grossman, C. R. Ho, D. J. Ierardi, I. Kolossváry, J. L. Klepeis, T. Layman, C. McLeavey, M. A. Moraes, R. Mueller, E. C. Priest, Y. Shan, J. Spengler, M. Theobald, B. Towles, and S. C. Wang. Anton, a special-purpose machine for molecular dynamics simulation. Commun. ACM, 51(7):91–97, July 2008. ISSN 0001-0782. doi: 10.1145/1364782.1364802.
  • Steinman and Wong [2003] J. S. Steinman and J. W. Wong. The SPEEDES persistence framework and the standard simulation architecture. In Proceedings of the seventeenth workshop on Parallel and distributed simulation, PADS ’03, pages 11–, Washington, DC, USA, 2003. IEEE Computer Society. ISBN 0-7695-1970-9.
  • Steinman et al. [2008] J. S. Steinman, C. N. Lammers, M. E. Valinski, K. Roth, and K. Words. Simulating parallel overlapping universes in the fifth dimension with HyperWarpSpeed implemented in the WarpIV kernel. In Proceedings of the Simulation Interoperability Workshop, SIW ’08, 2008.
  • Tay et al. [2003] S. Tay, G. Tan, and K. Shenoy. Piggy-backed time-stepped simulation with ’super-stepping’. In Simulation Conference, 2003. Proceedings of the 2003 Winter, volume 2, pages 1077 – 1085 vol.2, dec. 2003.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
23272
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description