Abstract00footnotetext: Dip. Informatica – Univ. di Torino, Dip. Informatica – Univ. di Pisa
FastFlow is a structured parallel programming framework targeting shared memory multicores. Its layered design and the optimized implementation of the communication mechanisms used to implement the FastFlow streaming networks provided to the application programmer as algorithmic skeletons support the development of efficient fine grain parallel applications. FastFlow is available (open source) at SourceForge111http://sourceforge.net/projects/mc-fastflow/. This work introduces FastFlow programming techniques and points out the different ways used to parallelize existing C/C++ code using FastFlow as a software accelerator. In short: this is a kind of tutorial on FastFlow.
Università di Pisa
Dipartimento di Informatica
Technical Report: TR-12-04
M. Aldinucci and M. Danelutto and M. Torquati
Dip. Informatica, Univ. Torino
Dip. Informatica. Univ. Pisa
[1em] March 28, 2012
ADDRESS: Largo B. Pontecorvo 3, 56127 Pisa, Italy. TEL: +39 050 2212700 FAX: +39 050 2212726
FastFlow is an algorithmic skeleton programming environment developed at the Dept. of Computer Science of Pisa and Torino .
A number of different papers and technical reports discuss the different features of this programming environment [11, 6, 2], the kind of results achieved while parallelizing different applications [13, 9, 10, 3, 8, 7] and the usage of FastFlow as software accelerator, i.e. as a mechanisms suitable to exploit unused cores of a multicore architecture to speedup execution of sequential code [4, 5].
This paper represents instead a tutorial aimed at instructing programmers in the usage of the FastFlow skeletons and in the typical FastFlow programming techniques.
Therefore, after recalling the FastFlow design principles in Sec. 2, in Sec. 3 we describe the (trivial) installation procedure. Then, in Sections 4 to 8.3 we introduce the main features of the FastFlow programming framework. Other sections detail particular techniques related to FastFlow usage, namely: access to shared data (Sec. 7), FastFlow usage as an accelerator (Sec. 9) and the possibility to use FastFlow as a framework to experiment new (w.r.t. the ones already provided) skeletons (Sec. 8.3.1 and Sec. 12.1). Eventually, Sec. 13 gives a rough idea of the expected performance while running FastFlow programs and Sec. 14 outlines the main FastFlow RTS accessory routines.
2 Design principles
FastFlow 222see also the FastFlow home page at http://mc-fastflow.sourceforge.net has been designed to provide programmers with efficient parallelism exploitation patterns suitable to implement (fine grain) stream parallel applications. In particular, FastFlow has been designed
to promote high-level parallel programming, and in particular skeletal programming (i.e. pattern-based explicit parallel programming), and
to promote efficient programming of applications for multi-core.
The whole programming framework has been incrementally developed according to a layered design on top of Pthread/C++ standard programming framework and targets shared memory multicore architectures (see Fig. 1).
A first layer, the Simple streaming networks layer, provides lock-free Single Producers Single Consumer (SPSC) queues on top of the Pthread standard threading model.
A second layer, the Arbitrary streaming networks layer, provides lock-free implementations for Single Producer Multiple Consumer (SPMC), Multiple Producer Single Consumer (MPSC) and Multiple Producer Multiple Consumer (MPMC) queues on top of the SPSC implemented in the first layer.
Eventually, the third layer, the Streaming Networks Patterns layer, provides common stream parallel patterns. The primitive patterns include pipeline and farms. Simple specialization of these patterns may be used to implement more complex patterns, such as divide and conquer, map and reduce patterns.
Parallel application programmers are assumed to use FastFlow directly exploiting the parallel patterns available in the Streaming Network Patterns level. In particular:
defining sequential concurrent activities, by sub classing a proper FastFlow class, the
building complex stream parallel patterns by hierarchically composing sequential concurrent activities, pipeline patterns, farm patterns and their “specialized” versions implementing more complex parallel patterns.
ff_node sequential concurrent activity abstraction provide
suitable ways to define a sequential activity processing data items
appearing on a single input channel and delivering the related results
onto a single output channel. Particular cases of
be simply implemented with no input channel or no output channel.
The former is used to install a concurrent
activity generating an output stream (e.g. from data items
read from keyboard or from a disk file); the latter to install a
concurrent activity consuming an input stream (e.g. to
present results on a video or to store them on disk).
The pipeline pattern may be used to implement sequences of streaming networks with receiving input from and delivering outputs to . may be either a sequential activity or another parallel pattern. must be a stream generator activity and a stream consuming one.
The farm pattern models different embarrassingly (stream) parallel constructs. In its simplest form, it models a master/worker pattern with workers producing no stream data items. Rather the worker consolidate results directly in memory. More complex forms including either an emitter, or a collector of both an emitter and a collector implement more sophisticated patterns:
by adding an emitter, the user may specify policies, different from the default round robin one, to schedule input tasks to the workers;
by adding a collector, the user may use worker actually producing some output values, which are gathered and delivered to the farm output stream. Different policies may be implemented on the collector to gather data from the worker and deliver them to the output stream.
In addition, a feedback channel may be added to a farm, moving output results back from the collector (or from the collection of workers in case no collector is specified) back to the emitter input channel. The feedback channel may only be added to the farm/pipe at the root of the skeleton tree.
Specialized version of the farm may be used to implement more complex patterns, such as:
divide and conquer, using a farm with feedback loop and proper stream items tagging (input tasks, subtask results, results)
MISD (multiple instruction single data, that is something computing
out of each appearing onto the input stream) pattern, using a farm with an emitter implementing a broadcast scheduling policy
map, using an emitter partitioning an input collection and scheduling one partition per worker, and a collector gathering sub-partitions results from the workers and delivering a collection made out of all these results to the output stream.
It is worth pointing out that while using plain pipeline and farms (with or without emitters and collectors) actually can be classified as “using skeletons” in a traditional skeleton based programming framework, the usage of specialized versions of the farm streaming network can be more easily classified as “using skeleton templates”, as the base features of the FastFlow framework are used to build new patterns, not provided as primitive skeletons333Although this may change in future FastFlow releases, this is the current situation as of FastFlow version 1.1.
Concerning the usage of FastFlow to support parallel application development on shared memory multicores, the framework provides two abstractions of structured parallel computation:
a “skeleton program abstraction” which is used to implement applications completely modelled according to the algorithmic skeleton concepts. When using this abstraction, the programmer write a parallel application by providing the business logic code, wrapped into proper
ff_nodesubclasses, a skeleton (composition) modelling the parallelism exploitation pattern of the application and a single command starting the skeleton computation and awaiting for its termination.
an “accelerator abstraction” which is used to parallelize (and therefore accelerate) only some parts of an existing application. In this case, the programmer provides a skeleton (composition) which is run on the “spare” cores of the architecture and implements a parallel version of the business logic to be accelerated, that is the computing a given . The skeleton (composition) will have its own input and output channels. When an has actually to be computed within the application, rather than writing proper code to call to the sequential code, the programmer may insert code asynchronously “offloading” x to the accelerator skeleton. Later on, when the result of is to be used, some code “reading” accelerator result may be used to retrieve the accelerator computed values.
This second abstraction fully implements the “minimal disruption”
principle stated by Cole in his skeleton
manifesto , as the programmer using the
accelerator is only required to program a couple
offload/get_result primitives in place of the single function call statement (see Sec. 9).
Before entering the details of how FastFlow may be used to implement efficient stream parallel (and not only) programs on shared memory multicore architectures, let’s have a look at how FastFlow may be installed444We only detail instructions needed to install FastFlow on Linux/Unix/BSD machines here. A Windows port of FastFlow exist, that requires slightly different steps for the installation..
The installation process is trivial, actually:
first, you have to download the source code from SourceForge (http://sourceforge.net/projects/mc-fastflow/)
then you have to extract the files using a
tar xzvf fastflow-XX.tgzcommand, and
eventually, you should use the top level directory resulting from the command as the argument of the
As an example, the currently available version (1.1) is hosted in
1.1.0.tar.gz file. If you download it and extract
files to your home directory, you should compile FastFlow code using the
g++ -I $HOME/fastflow-1.1.0 -lpthread in addition to any
other flags needed to compile your specific code.
makefiles are provided both within the
fastflow-1.1.0/tests and the
directories in the source distribution.
4 Hello world in FastFlow
As all programming frameworks tutorials, we start with a Hello world code. In order to implement our hello world program, we use the following code:
Line 2 includes all what’s needed to compile a FastFlow program just using
a pipeline pattern and line 4 instruct compiler to resolve names
looking (also) at
Lines 6 to 13 host the application business logic code, wrapped into a
class sub classing
void * svc(void *)
method555we use the term svc as a shortcut for “service”
wraps the body of the concurrent activity resulting from the
wrapping. It is called every time the concurrent activity is given a
new input stream data item. The input stream data item pointer is
passed through the input
void * parameter. The result of the
single invocation of the concurrent activity body is passed back to
the FastFlow runtime returning the
void * result. In case
NULL is returned, the concurrent activity actually terminates
The application main only hosts code needed to setup the FastFlow streaming
network and to start the skeleton (composition) computation: lines 17
and 18 declare a pipeline pattern (line 17) and insert a single stage
(line 18) in the pipeline.
Line 20 starts the computation of the skeleton program and awaits for
skeleton computation termination. In case of errors
run_and_wait_end() call will return a negative number
(according to the Unix/Linux syscall conventions).
When the program is started, the FastFlow RTS accomplishes to start the
pipeline. In turn the first stage is started. As the first
svc returns a
NULL, the stage is terminated
immediately after by the FastFlow RTS.
If we compile and run the program, we get the following output:
There is nothing parallel here, however. The single pipeline stage is run just once and there is nothing else, from the programmer viewpoint, running in parallel. The graph of concurrent activities in this case is the following, trivial one:
A more interesting “HelloWorld” would have been to have a two stage pipeline where the first stage prints the “Hello” and the second one, after getting the results of the computation of the first one, prints “world”. In order to implement this behaviour, we have to write two sequential concurrent activities and to use them as stages in a pipeline. Additionally, we have to send something out as a result from the first stage to the second stage. Let’s assume we just send the string with the word to be printed. The code may be written as follows:
We define two sequential stages. The first one (lines 6–16) prints
the “Hello” message, the allocates some memory buffer, store the
“world” message in the buffer and send its to the output stream
(return on line 14). The
sleep on line 13 is here just for
making more evident the FastFlow scheduling of concurrent activities.
The second one (lines 18–26) just prints whatever he gets on the
input stream (the data item stored after the
void * task
svc header on line 21), frees the allocated memory
and then returns a
GO_ON mark, which is intended to be a value
interpreted by the FastFlow framework as: “I finished processing the
current task, I give you no result to be delivered onto the
output stream, but please keep me alive ready to receive another input
main on lines 28–40 is almost identical to the one of the
previous version but for the fact we add two stages to the pipeline
pattern. Implicitly, this sets up a streaming network
Stage1 connected by a stream to
delivered on the output stream by
Stage1 will be read on the
input stream by
The concurrent activity graph is therefore:
If we compile and run the program, however, we get a kind of unexpected result:
First of all, the program keeps running printing an “Hello world” every second. We in fact terminate the execution through a CONTROL-C. Second, the initial sequence of strings is a little bit strange666and depending on the actual number of cores of your machine and on the kind of scheduler used in the operating system, the sequence may vary a little bit.
The “infinite run” is related to way FastFlow implements concurrent
ff_node is run as many times as the number of
the input data items appearing onto the output stream, unless
svc method returns a
NULL. Therefore, if the method
returns either a task (pointer) to be delivered onto the concurrent
activity output stream, or the
GO_ON mark (no data output to
the output stream but continue execution), it is re-executed as
soon as there is some input available.
The first stage, which has no associated input stream, is re-executed
up to the moment it terminates the
svc with a
In order to have the program terminating, we therefore may use the
following code for
If we compile and execute the program with this modified
stage, we’ll get an output such as:
that is the program terminates after a single run of the two
stages. Now the question is: why the second stage terminated, although
svc method return value states that more work is to be
done? The answer is in the stream semantics implemented by FastFlow.
FastFlow streaming networks automatically
manage end-of-streams. That is, as soon as an
NULL–implicitly declaring he wants to terminate its
output stream, the information is propagated to the node consuming the
output stream. This nodes will therefore also terminate
execution–without actually executing its
svc method–and the
end of stream will be propagated onto its output stream, if any.
Stage2 terminates after the termination
The other problem, namely the appearing of the initial 2 “Hello”
strings apparently related to just one “world” string is related to
the fact that FastFlow does not guarantee any scheduling semantics of
svc executions. The first stage delivers a
string to the second stage, then it is executed again and
sleep inserted in the first stage prevents to
accumulate too much “hello” strings on the output stream delivered
to the second stage. If we remove the
sleep statement, in fact,
the output is much more different: we will see on the input a large
number of “hello” strings followed by another large number of
“world” strings. This because the first stage is enabled to send
as much data items on the output stream as of the capacity of the SPSC
queue used to implement the stream between the two stages.
5 Generating a stream
In order to achieve a better idea of how streams are managed
within FastFlow, we slightly change our
HelloWorld code in such a way the
first stage in the pipeline produces on the output stream integer
data items and then terminates. The second stage prints a “world
-i-” message upon receiving each item onto the input stream.
We already discussed the role of the return value of the
method. Therefore a first version of this program may be implemented
using as the
Stage1 class the following code:
The output we get is the following one:
However, there is another way we can use to generate the stream, which
is a little bit more “programmatic”. FastFlow makes available
ff_send_out method in the
ff_node class, which can be
used to direct a data item onto the concurrent activity output
stream, without actually using the
In this case, we could have written the
Stage as follows:
In this case, the
Stage1 is run just once (as it immediately
NULL. However, during the single run
while loop delivers the intended data items on
the output stream through the
In case the sends fill up the SPSC queue used to implement the stream,
ff_send_out will block up to the moment
consumes some items and consequently frees space in the SPSC buffers.
6 More on ff_node
ff_node class actually defines three distinct virtual methods:
The first one is the one defining the behaviour of the node while
processing the input stream data items. The other two methods are
automatically invoked once and for all by the FastFlow RTS when the
concurrent activity represented by the node is started
svc_init) and right before it is terminated (
These virtual methods may be overwritten in the user
ff_node subclasses to implement initialization code
and finalization code, respectively. Actually, the
method must be overwritten as it is defined as a pure virtual
We illustrate the usage of the two methods with another program, computing the Sieve of Eratosthenes. The sieve uses a number of stages in a pipeline. Each stage stores the first integer it got on the input stream. Then is cycles passing onto the output stream only the input stream items which are not multiple of the stored integer. An initial stage injects in the pipeline the sequence of integers starting at 2, up to . Upon completion, each stage has stored a prime number.
We can implement the Eratostheness sieve with the following FastFlow program.
Generate stage at line 35–66 generates the integer stream,
from 2 up to a value taken from the command line parameters.
It uses an
svc_init just to point out when the concurrent
activity is started. The creation of the object used to represent the
concurrent activity is instead evidenced by the message printed in the
Sieve stage (lines 6–28) defines the generic pipeline
stage. This stores the initial value got from the input stream on
lines 14–16 and then goes on passing the inputs not multiple of the
stored values on lines 18–21. The
svc_end method is executed
right before terminating the concurrent activity and prints out the
stored value, which happen to be the prime number found in that node.
Printer stage is used as the last stage in the pipeline
(the pipeline build on lines 98–103 in the program
just discards all the received values but the first one, which is kept
to remember the point where we arrived storing prime numbers.
It defines both an
svc_init method (to print a message when the
concurrent activity is started) and an
svc_end method, which is
used to print the first integer received, representing the upper bound
(non included in) of the sequence of prime numbers discovered with the
The concurrent activity graph of the program is the following one:
The program output, when run with 7
Sieve stages on a stream
from 2 to 30, is the following one:
showing that the prime numbers up to 19 (excluded) has been found.
7 Managing access to shared objects
Shared objects may be accessed within FastFlow programs using the
pthread concurrency control mechanisms. The FastFlow program is actually a multithreaded code using the
library, in fact.
We demonstrate how access to shared objects may be ensured within
a FastFlow program forcing mutual exclusion in the access to
std::cout file descriptor. This will be used to have much
nicer strings output on the screen when running the Sieve program
illustrated in the previous section.
In order to guarantee mutual exclusion on the shared
descriptor we use a
pthread_mutex_lock. The lock is declared
and properly initialized as
a static, global variable in the program (see code below, line 7).
Then each one of the writes to the
std::cout descriptor in the
concurrent activities relative to the different stages of the pipeline
are protected through a
pthread_mutex_unlock “brackets” (see line 29–31 in the
code below, as an example).
When running the program, we get a slightly different output than the
one we obtained when the usage of
std::cout was not properly
The strings are printed in clearly separated lines, although some
apparently unordered string sequence appears, which is due to the FastFlow scheduling of the concurrent activities and to the way locks
are implemented and managed in the
It is worth pointing out that
FastFlow ensures correct access sequences to the shared object used to implement the streaming networks (the graph of concurrent activities), such as the SPSC queues used to implement the streams, as an example.
FastFlow stream semantics guarantee correct sequencing of activation of the concurrent activities modelled through
ff_nodes and connected through streams. The stream implementation actually ensures pure data flow semantics.
any access to any user defined shared data structure must be protected with either the primitive mechanisms provided by FastFlow (see Sec. 7) or the primitives provided within the
8 More skeletons: the FastFlow farm
In the previous sections, we used only pipeline skeletons in the
Here we introduce the other primitive skeleton provided in FastFlow, namely
The simplest way to define a farm skeleton in FastFlow is by declaring
farm object and adding a vector of worker concurrent
activities to the
An excerpt of the needed code is the following one
This code basically defines a farm with
processing the data items appearing onto the farm input stream and
delivering results onto the farm output stream. The scheduling policy
used to send input tasks to workers is the default one, that is round
robin one. Workers are implemented by the
objects. These objects may represent sequential concurrent activities
as well as further skeletons, that is either pipeline or farm
However, this farm may not be used alone. There is no way to provide an input stream to a FastFlow streaming network but having the first component in the network generating the stream. To this purpose, FastFlow supports two options:
we can use the farm defined with a code similar to the one described above as the second stage of a pipeline whose first stage generates the input stream according to one of the techniques discussed in Sec. 5. This means we will use the farm writing a code such as:
or we can provide an
collectorto the farm, specialized in such a way they can be used to produce the input stream and consume the output stream of the farm, respectively, while inheriting the default scheduling and gathering policies.
The former case is simple. We only have to understand why adding the farm to the pipeline as a pipeline stage works. This will discussed in detail in Sec. 10. The latter case is simple as well, but we discuss it through some more code.
8.1 Farm with emitter and collector
First, let us see what kind of objects we have to build to provide
the farm an
emitter and a
collector must be supplied
ff_node subclass objects. If we implement the
just providing the
svc method, the tasks delivered by
svc on the output stream either using a
or returning the proper pointer with the
statement, those elements will be dispatched to the available workers
according to the default round robin scheduling.
An example of
emitter node, generating the stream of tasks
actually eventually processed by the farm
worker nodes is the
In this case, the node
svc actually does not take into account
any input stream item (the input parameter name is omitted on line
5). Rather, each time the node is activated, it returns a task to be
computed using the internal
ntasks value. The task is directed
to the “next” worker by the FastFlow farm run time support.
collector, we can also use a
case the results need further processing, they can be directed to the
next node in the streaming network using the mechanisms detailed in
Sec. 5. Otherwise, they can be processed within
svc method of the
As an example, a
collector just printing the tasks/results he
gets from the workers may be programmed as follows:
With these classes defined and assuming to have a worker defined by the class:
we can define a program processing a stream of integers by increasing each one of them with a farm as follows:
The concurrent activity graph in this case is the following one:
When run with the first argument specifying the number of workers to be used and the second one specifying the length of the input stream generated in the collector node, we get the expected output:
8.2 Farm with no collector
We move on considering a further case: a farm with emitter but no
collector. Having no collector the workers may not deliver results:
all the results computed by the workers must be consolidated in
The following code implements a farm where a stream of tasks of
TASK with an integer tag
i and an integer
t are processed by the worker of the farm by:
storing the result in a global array at the position given by the tag
Writes to the global result array need not to be synchronized as each
worker writes different positions in the array (the
Worker code at lines 14–21 defines an
that returns a
Therefore no results are directed to the collector (non existing, see
lines 55-74: they define the farm but they do not contain
add_collector in the program
Rather, the results computed by the worker code at line 18 are
directly stored in the global array.
In this case the concurrent activity graph is the following:
The main program prints the results vector before calling
start_and_wait_end() and after the call, and you can
easily verify the results are actually computed and stored in the
correct place in the vector:
Besides demonstrating how a farm without collector may compute useful results, the program of the last listing also demonstrates how complex task data structures can be delivered and retrieved to and from the FastFlow streaming network streams.
8.3 Specializing the scheduling strategy in a farm
In order to select the worker where an incoming input task has to be
directed, the FastFlow farm uses an internal
provides a method
int selectworker() returning the index in the
worker array corresponding to the worker where the next task has to be
This method cannot be overwritten, actually. But the programmer may
ff_loadbalancer and provide his
selectworker() method and pass the new load balancer to the
farm emitter, therefore implementing a farm with a user defined
The steps to performed in this case are exemplified with the following, relevant portions of code.
First, we subclass the
ff_loadmanager and provide our
Then we create a farm with specifying the new load balancer class as a type parameter:
Eventually, we create an emitter that within its
set_victim method right before outputting a task
towards the worker string, either with a
return(task). The emitter is declared as:
and inserted in the farm with the code
What we get is a farm where the worker to be used to execute the task
appearing onto the input stream is decided by the programmer through
the proper implementation of
my_loadbancer rather than being
decided by the current FastFlow implementation.
Two particular cases specializing the scheduling policy in different way by using FastFlow predefined code are illustrated in the following two subsections.
8.3.1 Broadcasting a task to all workers
FastFlow supports the possibility to direct a task to all the workers in a farm. It is particularly useful if we want to process the task by workers implementing different functions. The broadcasting is achieved through the declaration of a specialized load balancer, in a way very similar to what we illustrated in Sec. 8.3.
The following code implements a farm whose input tasks are broadcasted to all the workers, and whose workers compute different functions on the input tasks, and therefore deliver different results on the output stream.
At lines 44-52 a
ff_loadbalancer is defined providing
broadcast method. The method is implemented in terms
ff_loadbalancer internal method. This new loadbalancer
class is used as in the case of other user defined schedulers (see
Sec. 8.3) and the emitter eventually uses the load
broadcast method instead of delivering the
task to the output stream (i.e. directly to the string of the
workers). This is done through the
svc code at lines 57–60.
Lines 103 and 104 are used to add two different workers to the farm.
The rest of the program is standard, but for the fact the resulting farm is used as an accelerator (lines 112–123, see Sec. 9).
8.3.2 Using autoscheduling
FastFlow provides suitable tools to implement farms with “auto
scheduling”, that is farms where the workers “ask” for something to
be computed rather than accepting tasks sent by the emitter (explicit
or implicit) according to some scheduling policy. This scheduling
behaviour may be simply implemented by using the
set_scheduling_ondemand(), as follows:
The scheduling policy implemented in this case is an approximation of the auto scheduling, indeed. The emitter simply checks the length of the SPSC queues connecting the emitter to the workers, and delivers the task to the first worker whose queue length is less or equal to 1. To be more precise, FastFlow should have implemented a request queue where the workers may write tasks requests tagged with the worker id and the emitter may read such request to choose the worker where the incoming tasks is to be directed. This is not possible as of FastFlow 1.1 because it still doesn’t allow to read from multiple SPSC queues preserving the FIFO order.
9 FastFlow as a software accelerator
Up to know we just showed how to use FastFlow to write a “complete skeleton application”, that is an application whose complete flow of control is defined through skeletons. In this case the main of the C/C++ program written by the user is basically providing the structure of the parallel application by defining a proper FastFlow skeleton nesting and the commands to start the computation of the skeleton program and to wait its termination. All the business logic of the application is embedded in the skeleton parameters.
Now we want to discuss the second kind of usage which is supported by FastFlow, namely FastFlow accelerator mode. The term “accelerator” is used the way it used when dealing with hardware accelerators. An hardware accelerator–a GPU or an FPGA or even a more “general purpose” accelerator such as Tilera 64 core chips, Intel Many Core or IBM WireSpeed/PowerEN–is a device that can be used to compute particular kind of code faster that the CPU. FastFlow accelerator is a software device that can be used to speedup skeleton structured portions of code using the cores left unused by the main application. In other words, it’s a way FastFlow supports to accelerate particular computation by using a skeleton program and offloading to the skeleton program tasks to be computed.
The FastFlow accelerator will use cores of the core machine, assuming that the calling code is not parallel and will try to ensure a fold speedup is achieved in the computation of the tasks offloaded to the accelerator, provide a sufficient number of tasks are given to be computed.
Using FastFlow accelerator mode is not too much different from using FastFlow to write an application only using skeletons (see Fig. 2). In particular, the following steps must be followed:
A skeleton program has to be written, using the FastFlow skeletons (or their customized versions), computing the tasks that will be given to the accelerator. The skeleton program used to program the accelerator is supposed to have an input stream, used to offload the tasks to the accelerator.
Then, the skeleton program must be run using a particular method, different from the
run_and_wait_endwe have already seen, that is a
run_then_freeze()method. This method will start the accelerator skeleton program, consuming the input stream items to produce either output stream items or to consolidate (partial) results in memory. When we want to stop the accelerator, we will deliver and end-of-stream mark to the input stream.
Eventually, we must wait the computation of the accelerator is terminated.
A simple program using FastFlow accelerator mode is shown below:
We use a farm accelerator. The accelerator is declared at line 43. The “true” parameter is the one telling FastFlow this has to be used as an accelerator. Workers are added at lines 45–48. Each worker is given its id as a constructor parameters. This is the same as the code in plain FastFlow applications. Line 50 starts the skeleton code in accelerator mode. Lines 55 to 58 offload tasks to be computed to the accelerator. These lines could be part of any larger C++ program, indeed. The idea is that whenever we have a task ready to be submitted to the accelerator, we simply “offload” it to the accelerator. When we have no more tasks to offload, we send and end-of-stream (line 59) and eventually we wait for the completion of the computation of tasks in the accelerator (line 60).
This kind of interaction with an accelerator not having an output
stream is intended to model those computations than consolidate
results directly in memory. In fact, the
Worker code actually
writes results into specific position of the vector
Each worker writes the task it receives in the -th position of the
vector, being the index of the worker in the farm worker string.
As each worker writes a distinct position in the vector, no specific
synchronization is needed to access vector positions. Eventually the
last task received by worker will be stored at position in the
We can also assume that results are awaited from the accelerator
through its output stream.
In this case, we first have to write the skeleton code of the
accelerator in such a way an output stream is supported.
In the new version the accelerator sample program below, we add a
collector to the accelerator farm (line 45). The collector is defined
as just collecting results from workers and delivering the results to
the output stream (lines 18–24).
Once the tasks have been offloaded to the accelerator, rather waiting
for accelerator completion, we can ask computed results as delivered
to the accelerator output stream through the
bool load_result(void **) method (see lines 59–61).
bool load_result(void **) methods synchronously await for
one item being delivered on the accelerator output stream. If such item
is available, the method returns “true” and stores the item pointer
in the parameter. If no other items will be available, the method
An asynchronoud method is also available
bool load_results_nb(void **). In this case, if no result is
available at the moment, the method returns a “false” value, and you
should retry later on to see whether a result may be retrieved.
10 Skeleton nesting
In FastFlow skeletons may be arbitrarily nested. As the current version only supports farm and pipeline skeletons, this means that:
farms may be used as pipeline stages, and
pipelines may be used as farm workers.
There are no limitations to nesting, but the following one :
skeletons using the
wrap_aroundfacility (see also Sec. 11) cannot be used as parameters of other skeletons.
As an example, you can define a farm with pipeline workers as follows:
or we can use a farm as a pipeline stage by using a code such as:
The concurrent activity graph in this case will be the following one:
while in the former case it will be such as
11 Feedback channels
In some cases, it will be useful to have the possibility to route back some results to the streaming network input stream. As an example, this allows to implement divide and conquer using farms. Task injected in the farm are split by the workers and the resulting splitted tasks are routed back to the input stream for further processing. Tasks that can be computed using the base case code, are computed instead and their results are used for the conquer phase, usually performed in memory.
All what’s needed to implement the feedback channel is to invoke
wrap_around method on the interested skeleton. In case our
applications uses a farm pattern as the outermost skeleton, we may
therefore add the method call after instantiating the farm object:
and this will lead to the concurrent activity graph
The same if parallelism is expressed by using a pipeline as the outermost skeleton:
leading to the concurrent activity graph:
As of FastFlow 1.1, the only possibility to use the feedback channel
provided by the
wrap_around method is relative to the outermost
skeleton, that is the one with no input stream. This because at the
moment FastFlow does not support merging of input streams. In future
versions this constrain will be possibly eliminated.
12 Introducing new skeletons
Current version of FastFlow (1.1) only supports stream parallel pipeline and farm skeletons. However, the skeletons themselves may be used/customized to serve as “implementation templates”777according to the terminology used in the algorithmic skeleton community for different kinds of skeletons. The FastFlow distribution already includes sample applications where the farm with feedback is used to support divide&conquer applications. Here we want to discuss how a data parallel map skeleton may be used in FastFlow, exploiting the programmability of farm skeleton emitter and collector.
12.1 Implementing a Map skeleton with a Farm “template”
In a pure map pattern all the items in a collection are processed by means of a function . If the collection was
then the computation
will produce as a result
In more elaborated map skeletons, the user is allowed to define a set of (possibly overlapping) partitions of the input collection, a function to be applied on each one of the partitions, and a strategy to rebuild–from the partial results computed on the partitions–the result of the map.
As an example, a matrix multiplication may be programmed as a map such that:
the input matrixes A and B are considered as collections of rows and columns, respectively
a set of items –the row of and the column of –are used to build the set of partitions
an inner product is computed on each : this is actually
the C matrix () is computed out of the different .
If we adopt this second, more general approach, a map may be build implementing a set of concurrent activities such as:
Split node create the partitions and
delivers them to the workers, the workers compute each and
deliver the results to the
eventually rebuilds the full out of the .
We can therefore program the whole map as a FastFlow farm. After defining proper task, subtask and partial result data structures:
we define the emitter to be used in the farm as follows:
Basically, the first time the emitter is called, we generate
all the tasks relative to the different . These tasks are directed to the workers, that will
compute the different and direct the
ff_node will therefore be programmed as:
The collector will be defined in such a way the different
partial results computed by the workers are eventually consolidated in
memory. Therefore each received is stored at the
correct entry of the matrix. The pointer of the result
matrix is in fact a field in the
TASK data strcuture
and , and are fields of
PART_RESULT data structure. The code for the
collector is therefore:
The tags here are used to deliver a result on the farm output stream
(i.e. the output stream of the collector) when exactly results
relative to the same input task have been received by the collector.
MAXDIFF value is used assuming that no more
MAXDIFF different matrix multiplication tasks may be
circulating at the same time in the farm, due to variable time spent
in the computation of the single .
With these classes, our map may be programmed as follows:
It is worth pointing out that:
the kind of knowledge required to write the
Composenodes to the application programmer is very application specific and not too much related to the implementation of the map
this implementation of the map transforms a data parallel pattern into a stream parallel one. Some overhead is paid to move the data parallel sub-tasks along the streams used to implement the farm. This overhead may be not completely negligible
a much coarser grain implementation could have been designed assuming that the
Splitnode outputs tasks representing the computation of a whole row and modifying accordingly the
usually, the implementation of a map data parallel pattern generates as many subtasks as the amount of available workers. In our implementation, we could have left to the
Splitnode this task, using the FastFlow primitive mechanisms to retrieve the number of workers actually allocated to the farm888this is the getnworkers method of the farm loadbalancer. and modifying accordingly both the
Also, the proposed implementation for the map may be easily
encapsulated in a proper
With this definition, the user could have defined the map (and added the map stage to a pipeline) using the following code:
Up to now we only discussed how to use FastFlow to build parallel programs, either applications completely coded with skeletons, or FastFlow software accelerators. We want to shortly discuss here the typical performances improvements got through FastFlow.
In skeleton application or in software accelerator, using a FastFlow farm would in general lead to a performance increase proportional to the number of workers used (that is to the parallelism degree of the farm). This unless:
we introduce serial code fragments–in this case the speedup will be limited according to the Amdahl law–or
we use more workers than the available tasks
or eventually the time spent to deliver a task to be computed to the worker and retrieving a result from the worker are higher than the computation time of the task.
This means that if the time spent to compute tasks serially is , we can expect the time spent computing the same tasks with an worker farm will be more or less . It is worth pointing out here that the latency relative to the computation of the single task does not decrease w.r.t. the sequential case.
In case a stage FastFlow pipeline is used to implement a parallel computation, we may expect the overall service time of the pipeline is
As a consequence, the time spent computing tasks is approximately and the relative speedup may be quantified as
In case of balanced stages, that is pipeline stages all taking the same time to compute a task, this speedup may be approximated as , being
14 Run time routines
Several utility routines are defined in FastFlow. We recall here the main ones.
virtual int get_my_id()
returns a virtual id of the node where the concurrent activity (its
svcmethod) is being computed
const int ff_numCores()
returns the number of cores in the target architecture
int ff_mapThreadToCpu(int cpu_id, int priority_level=0)
pins the current thread to
cpu_id. A priority may be set as well, but you need root rights in general, and therefore this should non be specified by normal users
void error(const char * str, ...)
is used to print error messages
virtual bool ff_send_out(void * task,
unsigned intretry=((unsigned int)-1),
unsigned int ticks=(TICKS2WAIT))
delivers an item onto the output stream, possibly retrying upon failre a given number of times, after waiting a given number of clock ticks.
returns the time spent in the computation of a farm or of pipeline, including the
svc_endtime. This is method of both classes pipeline and farm.
returns the time spent in the computation of a farm or of pipeline, in the
double ffTime(int tag)
is used to measure time in portions of code. The
void ffStats(std::ostream & out)
prints the statistics collected while using FastFlow. The program must be compiled with
-  M. Aldinucci. FastFlow home page, 2012. http://mc-fastflow.sourceforge.net.
-  Marco Aldinucci, Lorenzo Anardu, Marco Danelutto, Massimo Torquati, and Peter Kilpatrick. Parallel patterns + macro data flow for multi-core programming. In Proc. of Intl. Euromicro PDP 2012: Parallel Distributed and network-based Processing, Garching, Germany, February 2012. IEEE.
-  Marco Aldinucci, Andrea Bracciali, Pietro Liò, Anil Sorathiya, and Massimo Torquati. StochKit-FF: Efficient systems biology on multicore architectures. In Mario Rosario Guarracino, Frédéric Vivien, J. L. Träff, Mario Cannataro, Marco Danelutto, A. Hast, F. Perla, A. Knüpfer, Beniamino Di Martino, and M. Alexander, editors, Euro-Par 2010 Workshops, Proc. of the 1st Workshop on High Performance Bioinformatics and Biomedicine (HiBB), volume 6586 of LNCS, pages 167–175, Ischia, Italy, 2011. Springer.
-  Marco Aldinucci, Marco Danelutto, Peter Kilpatrick, Massimiliano Meneghin, and Massimo Torquati. Accelerating sequential programs using FastFlow and self-offloading. Technical Report TR-10-03, Università di Pisa, Dipartimento di Informatica, Italy, February 2010.
-  Marco Aldinucci, Marco Danelutto, Peter Kilpatrick, Massimiliano Meneghin, and Massimo Torquati. Accelerating code on multi-cores with fastflow. In E. Jeannot, R. Namyst, and J. Roman, editors, Proc. of 17th Intl. Euro-Par 2011 Parallel Processing, volume 6853 of LNCS, pages 170–181, Bordeaux, France, August 2011. Springer.
-  Marco Aldinucci, Marco Danelutto, Peter Kilpatrick, and Massimo Torquati. Fastflow: high-level and efficient streaming on multi-core. In Sabri Pllana and Fatos Xhafa, editors, Programming Multi-core and Many-core Computing Systems, Parallel and Distributed Computing, chapter 13. Wiley, 2012.
-  Marco Aldinucci, Marco Danelutto, Massimiliano Meneghin, Peter Kilpatrick, and Massimo Torquati. Efficient streaming applications on multi-core with FastFlow: the biosequence alignment test-bed. In Barbara Chapman, Frédéric Desprez, Gerhard R. Joubert, Alain Lichnewsky, Thierry Priol, and F. J. Peters, editors, Parallel Computing: From Multicores and GPU’s to Petascale (Proc. of PARCO 2009, Lyon, France), volume 19 of Advances in Parallel Computing, pages 273–280, Lyon, France, September 2009. IOS press.
-  Marco Aldinucci, Massimiliano Meneghin, and Massimo Torquati. Efficient smith-waterman on multi-core with fastflow. In Marco Danelutto, Tom Gross, and Julien Bourgeois, editors, Proc. of Intl. Euromicro PDP 2010: Parallel Distributed and network-based Processing, Pisa, Italy, February 2010. IEEE.
-  Marco Aldinucci, Salvatore Ruggieri, and Massimo Torquati. Porting Decision Tree Building and Pruning Algorithms to Multicore using FastFlow. Technical Report TR-11-06, Università di Pisa, Dipartimento di Informatica, Italy, March 2011.
-  Marco Aldinucci, Salvatore Ruggieri, and Massimo Torquati. Porting decision tree algorithms to multicore using FastFlow. In José L. Balcázar, Francesco Bonchi, Aristides Gionis, and MichÃ¨le Sebag, editors, Proc. of European Conference in Machine Learning and Knowledge Discovery in Databases (ECML PKDD), volume 6321 of LNCS, pages 7–23, Barcelona, Spain, September 2010. Springer.
-  Marco Aldinucci, Massimo Torquati, and Massimiliano Meneghin. FastFlow: Efficient parallel streaming applications on multi-core. Technical Report TR-09-12, Università di Pisa, Dipartimento di Informatica, Italy, September 2009.
-  Murray Cole. Bringing skeletons out of the closet: a pragmatic manifesto for skeletal parallel programming. Parallel Comput., 30(3):389–406, March 2004.
-  M. Danelutto, L. Deri, and D. De Sensi. Network Monitoring on Multicores with Algorithmic Skeletons, 2011. Proc. of Intl. Parallel Computing (PARCO).