ViPIOSVIenna Parallel Input Output System  Language, Compiler and Advanced Data Structure Support for Parallel I/O Operations   Project DeliverablePartially funded by FWF Grant P11006-MATCore Project Duration: 1996 - 1998Deliverable Revised: August 2018

# ViPIOS VIenna Parallel Input Output System   Language, Compiler and Advanced Data Structure Support for Parallel I/O Operations     Project Deliverable Partially funded by FWF Grant P11006-MAT Core Project Duration: 1996 - 1998 Deliverable Revised: August 2018

Erich Schikuta    Helmut Wanek    Heinz Stockinger    Kurt Stockinger    Thomas Fürle    Oliver Jorns    Christoph Löffelhardt    Peter Brezany    Minh Dang    Thomas Mück

] ]

## Executive Summary

For an increasing number of data intensive scientific applications, parallel I/O concepts are a major performance issue. Tackling this issue, we develop an input/output system designed for highly efficient, scalable and conveniently usable parallel I/O on distributed memory systems. The main focus of this research is the parallel I/O runtime system support provided for software-generated programs produced by parallelizing compilers in the context of High Performance FORTRAN efforts. Specifically, our design aims for the Vienna Fortran Compilation System.

In our research project we investigate the I/O problem from a runtime system support perspective. We focus on the design of an advanced parallel I/O support, called ViPIOS (VIenna Parallel I/O System), to be targeted by language compilers supporting the same programming model like High Performance Fortran (HPF). The ViPIOS design is partly influenced by the concepts of parallel database technology.

At the beginning of the project we developed a formal model, which forms a theoretical framework on which the ViPIOS design is based. This model describes the mapping of the problem specific data space starting from the application program data structures down to the physical layout on disk across several intermediate representation levels.

Based on this formal model we designed and developed an I/O runtime system, ViPIOS, which provides support for several issues, as

• optimization of data-layout on disks,

• redistribution of data stored on disks,

• communication of out-of-core (OOC) data, and

• many optimizations including data prefetching from disks based on the access pattern knowledge extracted from the program by the compiler or provided by a user specification.

The project was partially funded by the Austrian Science Fund by FWF Grant P11006-MAT in the period 05/1996 - 04/1998.

Vienna, December 1998111This report was revised in August 2018 by Erich Schikuta mainly polishing the format and adding some contributions achieved after the core period of the project.

## Chapter 1 A Short Project History

The project titled ”Language, Compiler and Advanced Data Structure Support for Parallel I/O Operations” which was later on also called ”Vienna Input Output System (ViPIOS)” - project) started in 1995. 1996 it was granted funding by the FWF for two years and its purpose was the design and implementation of a software system to enhance the parallel I/O capabilities of high performance computing programs written in HPF. This was achieved by programming an I/O server which can accomplish HPF I/O requests very efficiently using multiple disks in parallel. High modularity and portability have also been a major goal in order to allow for future changes and extensions to the system. In fact the system design also accounts for topics that have gained big importance in parallel computing during the duration of the project (i.e. distributed computing over the Internet and an MPI-IO interface).

Table 1.1 lists all the people that took part in the project with their respective responsibilities and the duration of their work. The papers published during the course of the project are given in table 1.2, which also contains a reference to a book that was partly based on this project. The references given point to the bibliography of the report with extended citation.

The following gives only a brieve overview of all the work that has been conducted as part of the project. More detailed information is to be found in the later chapters, which are given as references here.

• Theoretical work
The state of the art in parallel I/O has been investigated thoroughly (2.3). This in the end resulted in the glossary on parallel I/O given in appendix B.

The principle architecture and design of the ViPIOS system has been devised (see 3.2, 4.1, 4.2, and 5.1). The main design considerations besides high performance and effectiveness have been high modularity, portability and extensibility. Therefore ViPIOS internally is built on standards only (UNIX, C and MPI) and offers a variety of interfaces to the user of the system (see 4.3).

Some basic research has also been done in how to automatically optimize parallel I/O operations. This resulted in a formalization of ViPIOS files and its operations (see 4.4 and 4.5).

• Practical work
First a simple prototype has been built that supports parallel I/O for C programs on UNIX. It operates according to the basic strategies described in 5.1. The implementation was designed as a client server system using multithreading on both sides (client and server) to achieve maximum parallelism and thus I/O throughput. I/O operations had to be performed by calling internal ViPIOS functions directly (ViPIOS proprietary I/O interface).

An MPI restriction, which does not allow processes to start and stop dynamically (i.e. all processes that communicate via MPI have to be started and stopped concurrently) and some limitations to multitasking and multithreading on different hardware platforms forced the implementation of three operation modes in ViPIOS (see 5.2). In library mode no I/O server processes are started. ViPIOS only is a runtime library linked to the application. Dependent mode needs all the server and client processes to be started at the same time and independent mode allows client processes to dynamically connect and disconnect to the I/O server processes, which are executed independently. Each of this three modes comes in a threaded and in a non-threaded version. The non-threaded version only supporting blocking I/O functionality.

The system was extended by an HPF (see chapter 7) and an MPI-IO (see 6) interface, which allows users to keep to the standard interfaces they already know. Currently the HPF interface is only supported by the VFC HPF compiler, which automatically transfers the application program’s FORTRAN Read and Write statements into the appropriate funtion calls.

The program has been developed on SUN SOLARIS workstations and was ported to and tested on a cluster of 16 LINUX PCs. Details about the test results can be found in chapter 8. The current implementation of ViPIOS supports most parts of the MPI-IO standard and it is comparable to the reference MPI-IO implementation ROMIO (both in functionality and in performance). The advantages of the ViPIOS system are however greater flexibility (due to the client server approach) and the tight integration into an HPF compilation system. Flexibility means for instance that it is possible to read from a persistent file using a data distrubution scheme different than the one used when the file was written. This is not directly supported by ROMIO. The client server design also allows for automatic performance optimizations of I/O requests even in a multiuser environment (different applications executing concurrently, which pose I/O requests independently). This generally turns out to be very hard to achieve with a library approach (because of the communication necessary between different applications). Though there is no effective automatic performance optimization yet implemented in ViPIOS the module which will perform this task is already realized (it is called fragmenter; see 4.2). Currently it only applies basic data distribution schemes which parallel the data distribution used in the client applications. A future extension will use a blackboard algorithm to evaluate different distribution schemes and select the optimal one.

## Chapter 2 Supercomputing and I/O

In spite of the rapid evolvement of computer hardware the demand for even better performance seems never ending. This is especially true in scientific computing, where models tend to get more and more complex and the need for realistic simulations is ever increasing. Supercomputers and recently clusters of computers are used to achieve very high performance. The basic idea is to track the problem down into little parts which can be executed in parallel on a multitude of processors thus reducing the calculation time.

The use of supercomputers has become very common in various fields of science like for instance nuclear physics, quantum physics, astronomy, flow dynamics, meteorology and so on. These supercomputers consist of a moderate to large number of (eventually specifically designed) processors which are linked together by very high bandwidth interconnections. Since the design and production of the supercomputer takes a considerable time the hardware components (especially the processors) are already dated out when the supercomputer is delivered. This fact has led to the use of clusters of workstations (COW) or clusters of PC’s. Here off the shelf workstations or PC’s are interconnected by a high speed network (GB-LAN, ATM, etc.). Thus the most up to date generation of processors can be used easily. Furthermore these systems can be scaled and updated more easily than conventional supercomputers in the most cases. The Beowulf [84],[1] and the Myrinet [14],[7] projects for example show that COW’s can indeed nearly reach the performance of dedicated supercomputing hardware.

One of the most important problems with supercomputers and clusters is the fact that they’re far from easy to program. In order to achieve maximum performance the user has to know very many details about the target machine to tune her programs accordingly. This is even worse because the typical user is a specialist in her research field and only interested in the results of the calculation. Nevertheless she is forced to learn a lot about computer science and the specific machine especially. This led to the development of compilers which can perform the tedious parallelization tasks (semi)automatically. The user only has to write a sequential program which is transferred to a parallel one by the compiler. Examples for such parallel compilation systems are HPF [64], citewww:HPF and C* [43].

Finally many of the scientific applications also deal with a very large amount of data (up to 1 Terabytes and beyond). Unfortunately the development of secondary and tertiary storage does not parallel the increase in processor performance. So the gap between the speed of processors and the speed of peripherals like hard disks is ever increasing and the runtime of applications tends to become more dependent on the speed of I/O than on the processor performance. A solution to the problem seems to be the use of a number of disks and the parallelization of I/O operations. A number of I/O systems and libraries have been developed to accommodate for parallel I/O. (e.g. MPI-I/O [34], PASSION [30], GALLEY [69], VESTA [36], PPFS [42] and Panda [82]) But most of these also need a good deal of programming effort in order to be used efficiently. The main idea of the ViPIOS project was to develop a client server I/O system which can automatically perform near optimal parallel I/O. So the user simply writes a sequential program with the usual I/O statements. The compiler transfers this program into a parallel program and ViPIOS automatically serves the program’s I/O needs very efficiently.

The rest of this chapter deals with the automatic parallelization and the specific problems related to I/O in some more detail. A short summary of the current state of the art is also given. Chapters 3 and 4 explain in detail the design considerations and the overall structure of the ViPIOS System. Chapters 5 to 7 describe the current state of the system’s implementation and some benchmarking results are given in chapter 8.

### 2.1 Automatic Parallelization (HPF)

The efficient parallelization of programs generally turns out to be very complex. So a number of tools have been developed to aid programmers in this task (e.g. HPF-compilers, P3T [44]). The predominant programming paradigm today is the single program - multiple data (SPMD) approach. A normal sequential program is coded and a number of copies of this program are run in parallel on a number of processors. Each copy is thereby processing only a subset of the original input values. Input values and calculation results which have to be shared between several processes induce communication of these values between the appropriate processors. (Either in form of message passing or implicitly by using shared memory architectures.) Obviously the communication overhead is depending strongly on the underlying problem and on the partitioning of the input data set. Some problems allow for a very simple partitioning which induces no communication at all (e.g. cracking of DES codes) other problems hardly allow any partitioning because of a strong global influence of every data item (e.g. chaotic systems). Fortunately for a very large class of problems in scientific computing the SPMD approach can be used with a reasonable communication overhead.

### 2.2 I/O Bottleneck

In addition to that scientific programs tend to get more demanding with respect to I/O (some applications are working on Terabytes of input data and even more) and the performance of I/O devices does not increase as fast as computing power does. This led to the founding of the Scalable I/O Initiative [2] which tried to address and solve I/O problems for parallel applications. Quite a view projects directly or indirectly stem from this initiative and a number of different solutions and strategies have been devised which will be described shortly in the following section. A quite complete list of projects in the parallel I/O field as well as a comprehensive bibliography can be found in the WWW [3].

### 2.3 State of the Art

Some standard techniques have been developed to improve I/O performance of parallel applications. The most important are

• two phase I/O

• data sieving

• collective I/O

• disk directed I/O

• server directed I/O

These methods try to execute I/O in a manner that minimizes or strongly reduces the effects of disk latency by avoiding non contiguous disk accesses and thereby speeding up the I/O process. More details and even some more techniques can be found in appendix B.

In the last years many universities and research teams from different parts of the world have used and enhanced these basic techniques to produce software and design proposals to overcome the I/O bottleneck problem. Basically, three different types of approaches can be distinguished:

• Runtime I/O libraries are highly merged with the language system by providing a call library for efficient parallel disk accesses. The aim is that it adapts graciously to the requirements of the problem characteristics specified in the application program. Typical representatives are PASSION [92], Galley [69], or the MPI-IO initiative, which proposes a parallel file interface for the Message Passing Interface (MPI) standard [67, 33]. Recently the MPI-I/O standard has been widely accepted as a programmers interface to parallel I/O. A portable implementation of this standard is the ROMIO library [94].

Runtime libraries aim for to be tools for the application programmer. Therefore the executing application can hardly react dynamically to changing system situations (e.g. number of available disks or processors) or problem characteristics (e.g. data reorganization), because the data access decisions were made during the programming and not during the execution phase.

Another point which has to be taken into account is the often arising problem that the CPU of a node has to accomplish both the application processing and the I/O requests of the application. Due to a missing dedicated I/O server the application, linked with the runtime library, has to perform the I/O requests as well. It is often very difficult for the programmer to exploit the inherent pipelined parallelism between pure processing and disk accesses by interleaving them.

All these problems can be limiting factors for the I/O bandwidth. Thus optimal performance is nearly impossible to reach by the usage of runtime libraries.

• File systems are a solution at a quite low level, i.e. the operating system is enhanced by special features that deal directly with I/O. All important manufacturers of parallel high-performance computer systems provide parallel disk access via a (mostly proprietary) parallel file system interface. They try to balance the parallel processing capabilities of their processor architectures with the I/O capabilities of a parallel I/O subsystem. The approach followed in these subsystems is to decluster the files among a number of disks, which means that the blocks of each file are distributed across distinct I/O nodes. This approach can be found in the file systems of many super-computer vendors, as in Intels CFS (Concurrent File System) [70], Thinking Machines’ Scalable File System (sfs) [63], nCUBEs Parallel I/O System [39] or IBM Vesta [36].

In comparison to runtime libraries parallel file systems have the advantage that they execute independently from the application. This makes them capable to provide dynamic adaptability to the application. Further the notion of dedicated I/O servers (I/O nodes) is directly supported and the processing node can concentrate on the application program and is not burdened by the I/O requests.

However due to their proprietary status parallel file systems do not support the capabilities (expressive power) of the available high performance languages directly. They provide only limited disk access functionality to the application. In most cases the application programmer is confronted with a black box subsystem. Many systems even disallow the programmer to coordinate the disk accesses according to the distribution profile of the problem specification. Thus it is hard or even impossible to achieve an optimal mapping of the logical problem distribution to the physical data layout, which prohibits an optimized disk access profile.

Therefore parallel file systems also can not be considered as a final solution to the disk I/O bottleneck of parallelized application programs.

• Client server systems give a combination of the other two approaches, which is a dedicated, smart, concurrent executing runtime system, gathering all available information of the application process both during the compilation process and the runtime execution. Thus, this system is able to aim for the static and the dynamic fit properties 111static fit: Data is distributed across available disks according to the SPMD data distribution (i.e. the chunk of data which is processed by a single processor is stored contiguously on a disk; a different processor’s data is stored on different disks depending on the number of disks available). dynamic fit: Data is redistributed dynamically according to changes of system characteristics or data access profiles during the runtime of the program. (i.e. a disk running out of space, too many different applications using the same disk concurrently and so on. (See appendix B for further information.). Initially it can provide the optimal fitting data access profile for the application (static fit) and may then react to the execution behavior dynamically (dynamic fit), allowing to reach optimal performance by aiming for maximum I/O bandwidth.

The PANDA [81, 82] and the ViPIOS system are examples for client server systems. (Note that PANDA is actually called a library by its designers. But since it offers independently running I/O processes and enables dynamic optimization of I/O operations during run time we think of it as a client server system according to our classification)

Additionally to the above three categories there are many other proposals that do not fit exactly into any of the stated schemes. There are many experimenting test beds and simulation software products that can as well be used to classify existing file systems and I/O libraries [1]. Those test beds are especially useful to compare performance and usability of systems.

But while most of these systems supply various possibilities to perform efficient I/O they still leave the application programmer responsible for the optimization of I/O operations (i. e. the programmer has to code the calls to the respective system’s functions by hand). Little work has yet been done, to automatically generate the appropriate function calls by the compiler (though there are some extensions planned to PASSION). And as far as we know only the Panda library uses some algorithms to automatically control and optimize the I/O patterns of a given application.

This is where the ViPIOS project comes in, which implements a database like approach. The programmer only has to specify what she wants to read or write (for example by using a simple FORTRAN read or write statement) not how it should be done. The ViPIOS system is able to decide about data layout strategies and the I/O execution plan based on information generated at compile time and/or collected while the run time of the application.

For more information on all the systems mentioned above see appendix B, which also lists additional systems not referenced in this chapter.

## Chapter 3 The ViPIOS Approach

ViPIOS is a distributed I/O server providing fast disk access for high performance applications. It is an I/O runtime system, which provides efficient access to persistent files by optimizing the data layout on the disks and allowing parallel read/write operations. The client-server paradigm allows clients to issue simple and familiar I/O calls (e.g. ’read(..)’), which are to be processed in an efficient way by the server. The application programmer is relieved from I/O optimization tasks, which are performed automatically by the server. The actual file layout on disks is solely maintained by the servers which use their knowledge about system characteristics (number and topology of compute nodes, I/O nodes and disks available; size and data transfer rates of disks; etc.) to satisfy the client’s I/O requests efficiently.

In order to optimize the file layout on disk ViPIOS uses information about expected file access patterns which can be supplied by HPF compilation systems. Since ViPIOS-servers are distributed on the available processors, disk accesses are effectively parallel. The client-server concept of ViPIOS also allows for future extensions like checkpointing, transactions, persistent objects and also support for distributed computing using the Internet.

ViPIOS is primarily targeted (but not restricted) to networks of workstations using the SPMD paradigm. Client processes are assumed to be loosely synchronous.

### 3.1 Design Goals

The design of ViPIOS followed a data engineering approach, characterized by the following goals.

1. Scalability. Guarantees that the size of the used I/O system, i.e. the number of I/O nodes currently used to solve a particular problem, is defined by or correlated with the problem size. Furthermore it should be possible to change the number of I/O nodes dynamically corresponding to the problem solution process. This requires the feature to redistribute the data among the changed set of participating nodes at runtime. The system architecture (section 4.1) of VIPIOS is highly distributed and decentralized. This leads to the advantage that the provided I/O bandwidth of ViPIOS is mainly dependent on the available I/O nodes of the underlying architecture only.

2. Efficiency. The aim of compile time and runtime optimization is to minimize the number of disk accesses for file I/O. This is achieved by a suitable data organization (section 4.4) by providing a transparent view of the stored data on disk to the ’outside world’ and by organizing the data layout on disks respective to the static application problem description and the dynamic runtime requirements.

3. Parallelism. This demands coordinated parallel data accesses of processes to multiple disks. To avoid unnecessary communication and synchronization overhead the physical data distribution has to reflect the problem distribution of the SPMD processes. This guarantees that each processor accesses mainly the data of its local or best suited disk. All file data and meta-data (description of files) are stored in a distributed and parallel form across multiple I/O devices. In order to find suitable data distributions to achieve maximum parallelism (and thus very high I/O bandwidth) ViPIOS may use information supplied by the compilation system or the application programmer. This information is passed to ViPIOS via hints (see appendix 3.2.2). If no hints are available ViPIOS uses some general heuristics to find an initial distribution and then dynamically can adopt to the application’s I/O needs during runtime.

4. Usability. The application programmer must be able to use the system without big efforts. So she does not have to deal with details of the underlying hardware in order to achieve good performance and familiar Interfaces (section 4.3) are available to program file I/O.

5. Portability. The system is portable across multiple hardware platforms. This also increases the usability and therefore the acceptance of the system.

### 3.2 Basic Strategies

Naturally ViPIOS supports the standard techniques (i. e. two phase access, data sieving and collective operations), which have been adapted to the specific needs of ViPIOS. In order to meet the design goals described above a number of additional basic strategies have been devised and then implemented in ViPIOS.

#### 3.2.1 Database like Design

As with database systems the actual disk access operations are decoupled from the application and performed by an independent I/O subsystem. This leads to the situation that an application just sends disk requests to ViPIOS only, which performs the actual disk accesses in turn (see figure 3.1).

The advantages of this method are twofold:

1. The application programmer is relieved from the responsibility to program and optimize all the actual disk access operations. She may therefore concentrate on the optimization of the application itself relying on the I/O system to perform the requested operations efficiently.

2. The application programmer can use the same program on all the platforms supported by the I/O system without having to change any I/O related parts of the application. The I/O system may use all the features of the underlying hardware to achieve the highest performance possible but all these features are well hidden from the programmer. Operations which are not directly supported by the respective hardware have to be emulated by the I/O system.

So the programmer just has to specify what data she wants to be input or output, not how that shall actually be performed (i. e. which data item has to be placed on which disk, the order in which data items are processed and so on). This is similar to a database approach, where a simple SQL statement for example produces the requested data set without having to specify any details about how the data is organized on disk and which access path (or query execution plan) should be used.

But the given similarities between a database system and a parallel I/O system also raise an important issue. For database systems an administrator is needed who has to define all the necessary tables and indexes needed to handle all the requests that any user may pose. As anyone knows who has already designed a database this job is far from easy. Special design strategies have been devised to ensure data integrity and fast access to the data. In the end the designer has to decide about the database layout based on the requests that the users of the database are expected to issue.

Now who shall decide which data layout (see appendix B) strategy shall be used. Evidently it must not be the application programmer but actually the compiler can do an excellent job here. Remember that the programmer only codes a sequential program which is transformed by the compiler into a number of processes each of which has only to process a part of the sequential program’s input data. Therefore the compiler exactly knows which process will need which data items. Additionally it also knows very much about the I/O profile of the application (i. e. the order in which data items will be requested to be input or output). All this information is passed to the I/O server system, which uses it to find a (near) optimal data layout. Theoretically this can be done automatically because the I/O profile is completely defined by the given application. In practice a very large number of possible data layout schemes has to be considered. (One reason for this is the considerable number of conditional branches in a typical application. Every branch can process different data items in a different order thus resulting in a change of the optimal data layout to be chosen. Though not every branch operation really affects the I/O behavior of the application the number of possible layouts to consider still remains very large.)

Different techniques especially designed for searching huge problem spaces may be used to overcome this problem (e. g. genetic algorithms, simulated annealing or blackboard methods). These can find a good data layout scheme in a reasonable time. (Note that it is of no use to find an optimal solution if the search took longer than the actual execution of the program.)

#### 3.2.2 Use of Hints

Hints are the general tool to support ViPIOS with information for the data administration process. Hints are data and problem specific information from the ”out-side- world” provided to ViPIOS.

Basically three types of hints can be differntiated, file administration, data prefetching, and ViPIOS administration hints.

The file administration hints provide information of the problem specific data distribution of the application processes (e.g. SPMD data distribution). High parallelization can be reached, if the problem specific data distribution of the application processes matches the physical data layout on disk.

Data prefetching hints yield better performance by pipelined parallelism (e.g. advance reads, delayed writes) and file alignment.

The ViPIOS administration hints allow the configuration of ViPIOS according to the problem situation respective to the underlying hardware characteristics and their specific I/O needs (I/O nodes, disks, disk types, etc.)

Hints can be given by the compile time system, the ViPIOS system administrator (who is responsible for starting and stopping the server processes, assigning the available disks to the respective server processes, etc.) or the application programmer. Normally the programmer should not have to give any hints but in special cases additional hints may help ViPIOS to find a suiting data layout strategy. This again parallels database systems where the user may instruct the system to use specific keys and thus can influence the query execution plan created by the database system. However the technology of relational databases is so advanced nowadays that the automatically generated execution plan can only very rarely be enhanced by a user specifying a special key to be used. Generally the optimal keys are used automatically. We are quite confident that a comparable status can be reached for parallel I/O too. But much research work still has to be done to get there.

Finally hints can be static or dynamic. Static hints are hints that give information that is constant for the whole application run (e.g. number of application processes, number of available disks and so on). Dynamic hints inform ViPIOS of a special condition that has been reached in the application execution (e.g. a specific branch of a conditional statement has been entered requiring to prefetch some data, a harddisk has failed). While static hints may be presented to ViPIOS at any time (i.e. compile time, application startup and application runtime) dynamic hints only may be given at runtime and are always sent by the application processes. To generate dynamic hints the compiler inserts additional statements in the appropriate places of the application code. These statements send the hint information to the ViPIOS system when executed.

The management of data by the ViPIOS servers is split into two distinct phases, the preparation and the administration phase (see Figure 3.2).

The preparation phase precedes the execution of the application processes (mostly during compilation and startup time). This phase uses the information collected during the application program compilation process in form of hints from the compiler. Based on this problem specific knowledge the physical data layout schemes are defined and the disks best suited to actually store the data are chosen. Further the data storage areas are prepared, the necessary main memory buffers allocated, etc.

The following administration phase accomplishes the I/O requests of the application processes during their execution, i.e. the physical read/write operations and eventually performs necessary reorganization of the data layout.

The two-phase data administration method aims for putting all the data layout decisions, and data distribution operations into the preparation phase, in advance to the actual application execution. Thus the administration phase performs the data accesses and possible data prefetching only.

## Chapter 4 The ViPIOS Design

The system design has mainly been driven by the goals described in chapter 3.1 and it is therefore built on the following principles:

• Minimum Overhead. The overhead imposed by the ViPIOS system (e.g. the time needed to calculate a suitable distribution of data among the available disks and so on) has to be kept as small as possible. As a rule of thumb an I/O operation using the ViPIOS system must never take noticeable longer than it would take without the use of ViPIOS even if the operation can not be speed up by using multiple disks in parallel.

• Maximum Parallelism. The available disks have to be used in a manner to achieve maximum overall I/O throughput. Note that it is not sufficient to just parallelize any single I/O operation because different I/O operations can very strongly affect each other. This holds true whether the I/O operations have to be executed concurrently (multiple applications using the ViPIOS system at the same time) or successively (single application issuing successive I/O requests). In general the search for a data layout on disks allowing maximum throughput can be vary time consuming. This is in contradiction with our ’minimum overhead’ principle. So in praxis the ViPIOS system only strives for a very high througput not for the optimal one. There is no point in calculating the optimal data layout if that calculation takes longer than the I/O operations would take without using ViPIOS.

• Use of widely accepted standards. ViPIOS uses standards itself (e.g. MPI for the communication between clients and servers) and also offers standard interfaces to the user (for instance application programmers may use MPI-I/O or UNIX file I/O in their programs), which strongly enhances the systems portability and ease of use.

• High Modularity. This enables the ViPIOS system to be quickly adopted to new and changing standards or to new hardware environments by just changing or adding the corresponding software module.

Some extensions to support for future developments in high performance computing also have been considered like for instance distributed (Internet) computing and agent technology.

### 4.1 Overall System Architecture

The ViPIOS system architecture is built upon a set of cooperating server processes, which accomplish the requests of the application client processes. Each application process AP is linked to the ViPIOS servers VS by the ViPIOS interface VI, which is a small library that implements the I/O interface to the application and performs all the communication with the ViPIOS servers (see figure 4.1).

The server processes run independently on all or a number of dedicated processing nodes on the underlying cluster or MPP. It is also possible that an application client and a server share the same processor. Generally each application process is assigned to exactly one ViPIOS server, which is called the buddy to this application. All other server processes are called foes to the respective application. A ViPIOS server can serve any number of application processes. Hence there is a one-to-many relationship between servers and the application. (E. g. the ViPIOS server numbered 2 in figure 4.1 is a buddy to the application processes 2 and 3, but a foe to application process 1.)

While each application process is assigned exactly one ViPIOS server, a ViPIOS server can serve a number of application processes, i.e. there exists one-to-many relationship between the servers and the application.

Figure 4.1 also depicts the two phase data administration described in chapter 3.2.3:

* The preparation phase precedes the execution of the application processes (i.e. compile time and application startup time).

* The following administration phase accomplishes the I/O requests posed during the runtime of the application processes by executing the appropriate physical read/write operations.

To achieve high data access performance ViPIOS follows the principle of data locality. This means that the data requested by an application process should be read/written from/to the best-suited disk.

Logical and physical data locality are to be distinguished.

Logical data locality denotes to choose the best suited ViPIOS server as a buddy server for an application process. This server can be defined by the topological distance and/or the process characteristics.

Physical data locality aims to define the best available (set of) disk(s) for the respective server (which is called the best disk list, BDL), i.e. the disks providing the best (mostly the fastest) data access. The choice is done on the specific disk characteristics, as access time, size, toplogical position in the network, and so on.

### 4.2 Modules

As shown in figure 4.1 the ViPIOS system consists of the independently running ViPIOS servers and the ViPIOS interfaces, which are linked to the application processes. Servers and interfaces themselves are built of several modules, as can be seen in figure 4.2.

The ViPIOS Interface library is linked to the application and provides the connection to the ”outside world” (i.e. applications, programmers, compilers, etc.). Different programming interfaces are supported by interface modules to allow flexibility and extendability. Currently implemnted are an HPF interface module (aiming for the VFC, the HPF derivative of Vienna FORTRAN [29]) a (basic) MPI-IO interface module, and the specific ViPIOS interface which is also the interface for the specialized modules. Thus a client application can execute I/O operations by calling HPF read/write statements, MPI-IO routines or the ViPIOS proprietary functions.

The interface library translates all these calls into calls to ViPIOS functions (if necessary) and then uses the interface message manager layer to send the calls to the buddy server. The message manager also is responsible for sending/receiving data and additional informations (like for instance the number of bytes read/written and so on) to/from the server processes. Note that data and additional information can be sent/received directly to/from any server process bypassing the buddy server, thereby saving many additional messages that would be necessary otherwise and enforcing the minimum overhead principle as stated in chapter 4. (See chapter 5.1 for more details.) The message manager uses MPI-function calls to communicate to the server processes.

The ViPIOS server process basically contains 3 layers:

• The Interface layer consists of a message manager responsible for the communication with the applications and the compiler (external messages) as well as with other servers (internal messages). All messages are translated to calls to the appropriate ViPIOS functions in the proprietary interface.

• The Kernel layer is responsible for all server specific tasks. It is built up mainly of three cooperating functional units:

• The Fragmenter can be seen as ”ViPIOS’s brain”. It represents a smart data administration tool, which models different distribution strategies and makes decisions on the effective data layout, administration, and ViPIOS actions.

• The Directory Manager stores the meta information of the data. Three different modes of operation have been designed, centralized (one dedicated ViPIOS directory server), replicated (all servers store the whole directory information), and localized (each server knows the directory information of the data it is storing only) management. Until now only localized management is implemented. This is sufficient for clusters of workstations. To support for distributed computing via the internet however the other modes are essential (see 5.1).

• The Memory Manager is responsible for prefetching, caching and buffer management.

• The Disk Manager layer provides the access to the available and supported disk sub-systems. Also this layer is modularized to allow extensibility and to simplify the porting of the system. Available are modules for ADIO [93], MPI-IO, and Unix style file systems.

### 4.3 Interfaces

To achieve high portability and usability the implementation internally uses widely spread standards (MPI, PVM, UNIX file I/O, etc.) and offers multiple modules to support an application programmer with a variety of existing I/O interfaces. In addition to that ViPIOS offers an interface to HPF compilers and also can use different underlying file systems. Currently the following interfaces are implemented:

• User Interfaces
Programmers may express their I/O needs by using

• MPI-IO (see chapter 6.)

• HPF I/O calls (see chapter 7.)

• ViPIOS proprietary calls (not recommended though because the programmer has to learn a completely new I/O interface. See appendix A for a list of available functions.)

• Compiler Interfaces
Currently ViPIOS only supports an interface to the VFC HPF compiler (see chapter 7).

• Interfaces to File Systems
The filesystems that can be used by a ViPIOS server to perform the physical acceses to disks enclose

• ADIO (see [93]; this has been chosen because it also allows to adapt for future file systems and so enhances the portability of ViPIOS.)

• MPI-IO (is already implemented on a number of MPP’s.)

• Unix file I/O (available on any Unix system an thus on every cluster of workstations.)

• Unix raw I/O (also available on any Unix system, offers faster access but needs more administrational effort than file I/O. Is not completely implemnted yet.)

• Internal Interface
Is used for the communication between different ViPIOS server processes. Currently only MPI is used to pass messages. Future extensions with respect to distributed computing will also allow for communication via HTTP.

### 4.4 Data Abstraction

ViPIOS provides a data independent view of the stored data to the application processes.

Three independent layers in the ViPIOS architecture can be distinguished, which are represented by file pointer types in ViPIOS.

• Problem layer. Defines the problem specific data distribution among the cooperating parallel processes (View file pointer).

• File layer. Provides a composed view of the persistently stored data in the system (Global file pointer).

• Data layer. Defines the physical data distribution among the available disks (Local file pointer).

Thus data independence in ViPIOS separates these layers conceptually from each other, providing mapping functions between these layers. This allows logical data independence between the problem and the file layer, and physical data independence between the file and data layer analogous to the notation in data base systems ([54, 25]). This concept is depicted in figure 4.3 showing a cyclic data distribution.

In ViPIOS emphasis is laid on the parallel execution of disk accesses. In the following the supported disk access types are presented.

According to the SPMD programming paradigms parallelism is expressed by the data distribution scheme of the HPF language in the application program. Basically ViPIOS has therefore to direct the application process’s data access requests to independent ViPIOS servers only to provide parallel disk accesses. However a single SPMD process is performing its accesses sequentially sending its requests to just one server. Depending on the location of the requested data on the disks in the ViPIOS system two access types can be differentiated (see figure 4.4),

• Local data access,

• Remote data access

###### Local Data Access.

The buddy server can resolve the applications requests on its own disks (the disks of its best disk list). This is also called buddy access.

###### Remote Data Access.

The buddy server can not resolve the request on its disks and has to broadcast the request to the other ViPIOS servers to find the owner of the data. The respective server (foe server) accesses the requested data and sends it directly to the application via the network. This is also called foe access.

Based on these access types 3 three disk access modes can be distinguished, which are called

• Sequential,

• Parallel, and

• Coordinated mode.

###### Sequential Mode.

The sequential mode of operation allows a single application process to send a sequential read/write operation, which is processed by a single VIPIOS server in sequential manner. The read/write operation consists commonly of processing a number of data blocks, which are placed on one or a number of disks administrated by the server itself (disks belonging to the best-disk-list of the server).

###### Parallel Mode.

In the parallel mode the application process requests a single read/write operation. ViPIOS processes the sequential process in parallel by splitting the operation in independent sub-operations and distributing them onto available ViPIOS server processes.

This can be either the access of contiguous memory areas (sub-files) by independent servers in parallel or the distribution of a file onto a number of disks administrated by the server itself and/or other servers.

###### Coordinated Mode.

The coordinated mode is directly deferred from the SPMD approach by the support of collective operations. A read/write operation is requested by a number of application processes collectively. In fact each application process is requesting a single sub-operation of the original collective operation. These sub-operations are processed by ViPIOS servers sequentially, which in turn results in a parallel execution mode automatically.

The 3 modes are shown in figure 4.5.

### 4.5 Abstract File Model

In order to be able to calculate an optimal data layout on disk a formal model to estimate the expected costs for different layouts is needed. This chapter presents an abstract file model which can be used as a basis for such a cost model. The formal model for sequential files and their access operations presented here is partly based on the works in [73] and [9].

It is also shown how the mapping functions defined in this model, which provide logical and physical data abstraction as depicted in figure 4.3 are actually implemented in the ViPIOS system.

• Record
We define a record as a piece of information in binary representation. The only property of a record which is relevant to us at the moment is its size in bytes. This is due to the fact that ViPIOS is only concerned with the efficient storage and retrieval of data but not with the interpretation of its meaning.

Let be the set of all possible records. We then define

where denotes the length of the record in bytes. In the following the record with size zero is referenced by the symbol . Further is the set of all records with size :

• File
A non empty file consists of a sequence of records which are all of the same size and different from .

With denoting the set of all possible files we define the functions

which yield the length of (i. e. the number of records in) a file and specific records of the file respectively. For a file and

and

 frec(f,i)={reciifi≤flen(f),′nil′otherwise.

An empty file is denoted by an empty sequence:

• Data buffer
The set of data buffers is defined by

The functions

give the number of tuple-elements (i. e. records) contained in a data buffer ,   its size in bytes and specific records respectively. Thus if and then

 drec(d,i)={reciifi≤dlen(d),′nil′otherwise.

Of special interest are data buffers which only contain equally sized records. These are denoted by

and their size may be computed easily by .

• Access modes
The set of access modes is given by:

• Mapping functions
Let and . A mapping function

is defined by

So for example is the file which contains the records 2, 4, 2 and 6 of the file f in that order. 111Note that does not have to be a permutation. So one record of may be replicated on different positions in .

The set of mapping functions is denoted by ,   and is the mapping function for which f is a fixpoint.

With denoting the empty tuple the function for every file .

• File handle
The set of file handles is defined by:

where is the power set of .

To access the information stored in a file handle the following functions are defined:

if , with ,     and then

• File operations
Only one file operation (OPEN) directly operates on a file. All other operations relate to the file using the file handle returned by the OPEN operation. In the following is the file operated on, is a file handle, is a set of access modes, is a data buffer and is a mapping function. The symbol denotes that the operation cannot be performed because of the state of parameters. In this case the operation does not change any of its parameters and just reports the error state. The operations can be formally described as follows:

• OPEN   is equivalent to: 222Note that we do not address security aspects in this model. Therefore users are not restricted in accessing files and the OPEN operation will always succeed.

• CLOSE   is equivalent to:

Thus every file operation on succeeding CLOSE will fail).

• SEEK   is equivalent to:
()

 ⎧⎨⎩fh←(f,m,n,ψ)ifflen(ψ(f))≥n,′error′otherwise.
• READ   is equivalent to: 333Note that the initial content of the data buffer is of no interest. Just its total size is relevant. This is different to the write operation where the records in the data buffer have to be compatible with the file written to. The condition assures that we do not read beyond the end of the file and that the data buffer is big enough to accommodate for the data read.
()

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩d←(frec(ψ(f),p+1),frec(ψ(f),p+2),……,frec(ψ(f),p+i))if′read′∈mode(fh)∧fh←(f,m,p+i,ψ)∧i=min(n,⌊dsize(d)size(frec(f,1))⌋,flen(ψ(f))−p)>0,′error′otherwise.
• WRITE   is equivalent to: 444Since files are defined to contain only records which all have the same size, the data buffer has to hold appropriate records. The WRITE operation as defined here may be used to append new records to a file as well as to overwrite records in a file. The length of the file will only increase by the number of records actually appended.
()

• INSERT  is equivalent to: 555If successful the INSERT operation will always increase the file size by . INSERT is equivalent to WRITE iff .
()

#### 4.5.1 Implementation of a mapping function description

ViPIOS has to keep all the appropriate mapping functions as part of the file information of the file. So a data structure is needed to internally represent such mapping functions. This structure should fulfill the following two requirements:

• Regular patterns should be represented by a small data structure.

• The data structure should allow for irregular patterns too.

Of course these requirements are contradictionary and so a comprimise actually was implemented in ViPIOS. The structure which will now be described allows the description of regular access patterns with little overhead yet also is suitable for irregular access patterns. Note however that the overhead for completely irregular access patterns may become considerably large. But this is not a problem since ViPIOS currently mainly targets regular access patterns and optimizations for irregular ones can be made in the future.

Figure 4.6 gives a C declaration for the data structure representing a mapping function.

An Access_Desc basically describes a number (no_blocks) of independent basic_blocks where every basic_block is the description of a regular access pattern. The skip entry gives the number of bytes by which the file pointer is incremented after all the blocks have been read/written.

The pattern described by the basic_block is as follows: If subtype is NULL then we have to read/write single bytes otherwise every read/write operation transfers a complete data structure described by the Access_Desc block to which subtype actually points. The offset field increments the file pointer by the specified number of bytes before the regular pattern starts. Then repeatedly count subtypes (bytes or structures) are read/written and the file pointer is incremented by stride bytes after each read/write operation. The number of repetitions performed is given in the repeat field of the basic_block structure.

## Chapter 5 ViPIOS Kernel

This chapter describes the actual implementation of the ViPIOS Kernel. It shows the internal working of the ViPIOS processes and discusses the realization of the different operation modes, which enable the port of ViPIOS to various hardware platforms.

### 5.1 The Message Passing System

In order to show how a client request actually is resolved by the ViPIOS server processes some necessary notation is defined first and then the flow of control and messages for some basic requests (like OPEN, READ and WRITE) is described.

#### 5.1.1 Notation

In the following some abbreviations are used to denote the various components of ViPIOS.

• AP: for an application process (ViPIOS-client) which is in fact an instance of the application running on one of the compute nodes

• VI: for the application process interface to ViPIOS (ViPIOS-Interface)

• VS: for any ViPIOS server process

• BUDDY: for the buddy server of an AP (i.e. the server process assigned to the specific AP. See chapter 4.1 for more details.)

• FOE: for a server, which is foe to an AP (i.e. which is not the BUDDY for the specific AP. See chapter 4.1 for more details.)

For system administration and initialization purposes ViPIOS offers some special services which are not needed for file I/O operations. These services include:

• system services: system start and shutdown, preparation phase routines (input of hardware topology, best disk lists, knowledge base)

• connection services: connect and disconnect an AP to ViPIOS.

Since these services are relatively rarely used, not every ViPIOS server process needs to provide them. A ViPIOS server process, which offers system (connection) services is called a system (connection) controller, abbreviated SC (CC). Depending on the number of controllers offering a specific service three different controller operation modes can be distinguished.

• centralized mode: There exists exactly one controller in the whole system for this specific service.

• distributed mode: Some but not all ViPIOS-servers in the system are controllers for the specific service.

• localized mode: Every ViPIOS server is a controller for the specific service.

Note that in every ViPIOS configuration at least one system controller and one connection controller must exist. The rest of this chapter restricts itself to system and connection controllers in centralized mode, which are the only ones actually implemented so far. This means that the terms SC and CC denote one specific ViPIOS server process respectively. However no assumptions are made whether SC and CC are different processes or actually denote the same ViPIOS server process. (For distributed computing via the Internet the other modes for SC and CC could however offer big advantages and will therefore also be implemented in later versions of ViPIOS.)

An additional service, which is vital for the operation of a ViPIOS system is the directory service. It is responsible for the administration of file information (i.e. which part of a file is stored on which disk and where are specific data items to be read/written). Currently only the localized mode has been realized, which means that every server process only holds the information for those parts of the files, which are stored on the disks administered by that process. Thus each ViPIOS server process currently also is a directory controller (DC). The directory service differs from the other services offered by ViPIOS in that it is hidden from the application processes. So only the server processes can inquire where specific data items can be found. There is no way for the application process (and thus for the programmer) to find out which disk holds which data. (For administration purposes however the system services offer an indirect way to access directory services. An administrator may inspect and even alter the file layout on disk.)

##### Files and Handles

Applications which use ViPIOS can read and write files by using ordinary UNIX like functions. The physical files on disks are however automatically distributed among the available disks by the server processes. This scattering of files is transparent to the client application and programmers can therefore apply the well known common file paradigms of the interface they are using to access ViPIOS (UNIX style, HPF or MPI-IO calls).

The application uses file handles to identify specific files. These handles are generated by the VI which also administers all the related informations like position of file pointer, status of I/O operation and so on. This allows for a very efficient implementation of the Vipios_IOState function and also reduces the administration overhead compared to a system where filehandles are managed by VSs (as will be shown later).

##### Basic ViPIOS file access functions

The AP can use the following operations to access ViPIOS files.

• Vipios_Open(Filename, Access mode)
Opens the file named ’Filename’. Access mode may be a combination of READ, WRITE, CREATE, EXCLUSIVE. The function returns a file handle if successful or an error code otherwise.

• Vipios_Read(Filehandle, Number of bytes, buffer)
Read a number of bytes from the file denoted by ’Filehandle’ into the specified buffer. Returns number of bytes actually read or an error code. (In case of EOF the number of bytes read may be less than the requested number. Additional information can be obtained by a call to the Vipios_IOState function.)

• Vipios_Write(Filehandle, Number of bytes, buffer)
Write a number of bytes to the file denoted by ’Filehandle’ from the specified buffer.

• Vipios_IRead(Filehandle, Number of bytes, buffer)
Immediate read. Same as read but asynchronous (i.e. the function returns immediately without waiting for the read operation to actually be finished).

• Vipios_IWrite(Filehandle, Number of bytes, buffer)
Immediate write. Same as write but asynchronous.

• Vipios_Close(Filehandle)
Closes the file identified by ’Filehandle’.

• ViPIOS_Seek(Filehandle, position, mode)
Sets the filepointer to position. (The mode parameter specifies if the position is to be interpreted relative to the beginning or to the end of the file or to the current position of the filepointer. This parallels the UNIX file seek function.)

• Vipios_IOState(Filehandle)
Returns a pointer to status information for the file identified by ’Filehandle’. Status information may be additional error information, EOF-condition, state of an asynchronous operation etc.

• Vipios_Connect([System_ID])
Connects an AP with ViPIOS. The optional parameter ’System_ID’ is reserved for future use where an AP may connect to a ViPIOS running on another machine via remote connections (e.g. internet). The return value is TRUE if the function succeeded, FALSE otherwise.

• Vipios_Disconnect()
Disconnects the AP from ViPIOS. The return value is TRUE if the function succeeded, FALSE otherwise.

##### Requests and messages

Requests are issued by an AP via a call to one of the functions declared above. The VI translates this call into a request message which is sent to the AP’s BUDDY (Except in the case of a Vipios_Connect call where the message is sent to the CC which then assigns an appropriate VS as BUDDY to the AP).

According to the above functions the basic message types are as follows.

CONNECT; OPEN; READ; WRITE; CLOSE; DISCONNECT
Note that read and write requests are performed asynchronously by ViPIOS server processes so that no extra message types for asynchronous operations are needed. If the application calls the synchronous versions of the read or write function then the VI tests and waits for the completion of the operation.

ViPIOS-messages consist of a message header and status information. Optionally they can contain parameters and/or data. The header holds the IDs of the sender and the recipient of the message, the client ID (=the ID of the AP which initiated the original external request), the file ID, the request ID and the message type and class. The meaning of status depends on the type and class of the message and may for example be TRUE or FALSE for acknowledges or a combination of access modes for an OPEN message. Number and meaning of parameters varies with type and class of the message and finally data may be sent with the request itself or in a seperate message.

#### 5.1.2 The execution of I/O Operations

Figure 5.1 shows the modules of a VS which are of interest for handling requests.

The local directory holds all the information necessary to map a client’s request to the physical files on the disks managed by the VS. (i. e. wich portions of a file are stored by this server and how these portions are layout on the disks.) The fragmenter uses this information to decompose (fragment) a request into sub-requests which can be resolved locally and sub-requests which have to be communicated to other ViPIOS server processes. The I/O subsystem actually performs the necessary disk accesses and the transmission of data to/from the AP. It also sends acknowledge messages to the AP.

##### The request fragmenter

The fragmenter handles requests differently dependent on their origin. For that reason we define the following request classes and the corresponding message classes.

• external requests/messages (ER): from VI to BUDDY

• directed internal requests/messages (DI): from one VS to another specific VS

• broadcast internal requests/messages (BI): from one VS to all other VSs

• acknowledge messages (ACK): acknowledges the (partial) fulfillment of a request; can be sent from a VS to another VS or to a VI

Figure 5.1 shows how requests are processed by the fragmenter. For external requests (ER) the fragmenter uses the VS’s local directory information to determine the sub-request which can be fulfilled locally. It then passes this part to the VS’s I/O subsystem which actually performs the requested operation.

The remaining sub-requests are committed as internal requests to other VSs. If the fragmenter already knows which VS can resolve a sub-request (e.g. by hints about data distribution or if the VS is a directory controller in centralized or distributed mode) then it sends this sub-request directly to the appropriate server (DI message). Otherwise the sub-request is broadcast to all the other VSs (BI message).

Note that only external requests can trigger additional messages to be sent or broadcast. Internal requests will either be filtered by the fragmenter, if they have been broadcast (appropriate VS was unknown), or passed directly to the I/O subsystem, if they have been sent directly (appropriate VS was known in advance). This design strictly limits the number of request messages that can be triggered by one single AP’s request.

In an optimal configuration files are distributed over VSs such that no internal requests have to be generated (i. e. every request can be resolved completely by the BUDDY = Data locality principle).

##### Control and message flow

Figure 5.2 depicts the actual message flow in ViPIOS. To keep it simple only one AP and its associated BUDDY are shown. However the message flow to/from FOEs is included too. Note that the original application code is transformed by an HPF compilation system into APs containing static compile time information (like data distributions etc.) as well as some compiler inserted statements, which send information to ViPIOS at runtime (hints for prefetching etc.). These informations are communicated to the BUDDY in the form of hints and are used to optimize I/O accesses.

The VI sends its requests to the external interface of the BUDDY. To perform the requested operation the BUDDY’s fragmenter may send sub-requests to FOEs (see 5.1.2) via the BUDDY’s internal interface. Every VS which resolves a sub-request sends an acknowledge message to the appropriate client’s VI.

The VI collects all the acknowledges and determines if the operation is completed. If so, it returns an appropriate value to the AP (in case of a synchronous operation) or sets the state of the operation accordingly (in case of an asynchronous operation).

Note that in order to save messages all FOEs send their acknowledges directly to the client’s VI bypassing the BUDDY which sent the internal request. This implies that the VI is responsible for tracking all the information belonging to a specific file handle (like position of file pointer etc.).

For operations like READ and WRITE the transmission of actual data can be done in one of the two following ways.

###### Method 1:

Data is sent directly with the READ request or with the WRITE acknowledge.
In this case the VI has to provide a receive (send) buffer which is large enough to hold the acknowledge (request) message’s header, status and parameters as well as the data to be read (written). Since the VI actually uses the same memory as the AP all the buffers allocated by the VI in fact reduce the memory available to the computing task. Furthermore data has to be copied between the VI’s internal buffer and the AP.

###### Method 2:

Data is sent in an additional message following the READ or WRITE acknowledge.
The VI uses the AP’s data buffer which was provided in the call to Vipios_read (Vipios_write) to receive (send) the data. This can be done because the extra data message does not have to contain any additional information but the raw data. All necessary information is already sent with the preceding acknowledge. This saves the VI from allocating large buffer at the cost of extraneous messages. (Note that in Figure 5.2 data messages are linked directly to the AP bypassing the VI. This indicates that data transmission is actually performed using the data buffer of the AP.)

The ViPIOS-system decides how data is transmitted for a specific request by using its knowledge about system characteristics like available memory size and cost of extra data messages.

In addition to the above, every VS supports an administrative interface to provide for administrative messages (like descriptions of hardware topology, best disk lists, etc.). In effect the SC gets the administrative messages provided by the system administrator and then dispatches it to the other VSs.

### 5.2 Operation Modes of ViPIOS

Unfortunately the client-server architecture that ViPIOS uses can not be implemented directly on all platforms because of limitations in the underlying hard- or software (like no dedicated I/O nodes, no multitasking on processing nodes, no threading, etc.). So in order to support a wide range of different plattforms ViPIOS uses MPI for portability and offers multiple operation modes to cope with various restrictions.

The following 3 different operation modes have been implemented:

• runtime library,

• dependent system, or

• independent system.

###### Runtime Library.

Application programs can be linked with a ViPIOS runtime module, which performs all disk I/O requests of the program. In this case ViPIOS is not running on independent servers, but as part of the application. The interface is therefore not only calling the requested data action, but also performing it itself. This mode provides only restricted functionality due to the missing independent I/O system. Parallelism can only be expressed by the application (i.e. the programmer).

###### Dependent System.

In this case ViPIOS is running as an independent module in parallel to the application, but is started together with the application. This is inflicted by the MPI-1 specific characteristic that cooperating processes have to be started at the same time. This mode allows smart parallel data administration but objects the Two-Phase-Administration method by a missing preparation phase.

###### Independent System.

In this case ViPIOS is running as a client-server system similar to a parallel file system or a database server waiting for application to connect via the ViPIOS interface. This is the mode of choice to achieve highest possible I/O bandwidth by exploiting all available data administration possibilities, because it is the only mode which supports the two phase data administration method.

#### 5.2.1 Restrictions in Client-Server Computing with MPI

##### Independent Mode is not directly supported by MPI-1.

MPI-1 restricts client-server computing by imposing that all the communicating processes have to be started at the same time. Thus it is not possible to have the server processes run independently and to start the clients at some later point in time. Also the number of clients can not be changed during execution

##### Clients and Servers share MPI_COMM_WORLD in MPI-1.

With MPI-1 the global communicator MPI_COMM_WORLD is shared by all participating processes. Thus clients using this communicator for collective operations will also block the server processes. Furthermore client and server processes have to share the same range of process ranks. This makes it hard to guarantee that client processes get consecutive numbers starting with zero, especially if the number of client or server processes changes dynamically.

Simple solutions to this problem (like using separate communicators for clients and servers) are offered by some ViPIOS operation modes, but they all require, that an application program has to be specifically adapted in order to use ViPIOS.

##### Public MPI Implementations (MPICH, LAM) are not Multi Threading Safe.

Both public implementations (MPICH [5] and LAM [6]) are not multi threading save. Thus non-blocking calls (e.g. MPI_Iread, MPI_Iwrite) are not possible without a workaround. Another drawback without threads is that the servers have to work with busy waits (MPI_Iprobe) to operate on multiple communicators.

##### Running two or more Client Groups with MPI-2.

Every new client group in MPI-2 needs a new intercommunicator to communicate with the ViPIOS servers. Dynamically joining and leaving a specific already existing group is not possible. PVM for example offers this possibility with the functions pvm_joingroup (…) and pvm_lvgroup (…).

#### 5.2.2 Comparing ViPIOS’ Operation Modes

In the following the advantages and disadvantages of all the operation modes and their implementation details are briefly discussed.

##### Runtime Library Mode

behaves basically like ROMIO [94] or PMPIO [45], i.e. ViPIOS is linked as a runtime library to the application.

• ready to run solution with any MPI-implementation (MPICH, LAM)

• nonblocking calls are not supported. Optimization like redistributing in the background or prefetching is not supported

• preparation phase is not possible, because ViPIOS is statically bound to the clients and started together with them

• remote file access is not supported, because there is no server waiting to handle remote file access requests, i.e. in static mode the server functions are called directly and no messages are sent (On systems with multithreading capabilities this could be overcome by starting a thread that waits for and accomplishes remote file access requests.

##### Client Server Modes

allow optimizations like file redistribution or prefetching and remote file accesses.

###### Dependent Mode.

In Client-Server mode clients and server start at the same time using application schemes.

• ready to run solution (e.g with MPICH)

• preparation phase is not possible, because the ViPIOS servers must be started together with the clients

• an exclusive MPI_COMM_WORLD communicator for clients can only be supported in a patched MPICH version. That patch has been implemented but this limits portability)

###### Independent Mode.

In order to allow an efficient preparation phase the use of independently running servers is absolutely necessary.

This can be achieved by using one of the following strategies:

1. MPI-1 based implementations.
Starting and stopping processes arbitrarily can be simulated with MPI-1 by using a number of ”dummy” client processes which are actually idle and spawn the appropriate client process when needed. This simple workaround limits the number of available client processes to the number of ”dummy” processes started.

This workaround can’t be used on systems which do not offer multitasking because the idle ”dummy” process will lock a processor completely. Furthermore additional programming effort for waking up the dummy proccesses is needed.

• ready to run solution with any MPI-1 implementation

• workaround for spawning the clients necessary, because clients cannot be started dynamically

2. MPI-2 based implementations.
Supports the connection of independently started MPI-applications with ports. The servers offer a connection through a port, and client groups, which are started independently from the servers, try to establish a connection to the servers using this port. Up to now the servers can only work with one client group at the same time, thus the client groups requesting a connection to the servers are processed in a batch oriented way, i.e. every client group is automatically put into a queue, and as soon as the client group the servers are working with has terminated, it is disconnected from the servers and the servers work with the next client group waiting in the queue.

• ready to run solution with any MPI-2 implementation

• No workaround needed, because client groups can be started dynamically and independently from the server group

• Once the servers have been started, the user can start as many client applications as he wants without having to take care for the server group

• No problems with MPI_COMM_WORLD. As the server processes and the client processes belong to two different groups of processes, which are started independently, each group has implicitly a separated MPI_COMM_WORLD

• The current LAM version does not support multi-threading, which would offer the possibiliy of concurrent work on all client groups without busy waits

• LAM Version 6.1 does not work when trying to connect processes which run on different nodes

3. Third party protocol for communication between clients and servers (e.g. PVM).
This mode behaves like MPI-IO/PIOFS [37] or MPI-IO for HPSS [55], but ViPIOS uses PVM and/or PVMPI (when it is available) for communication between clients and servers. Client-client and server-server communication is still done with MPI.

• ready to run solution with any MPI-implementation and PVM

• Clients can be started easily out of the shell

• no problems with MPI_COMM_WORLD, because there exist two distinct global communicators

• PVM and/or PVMPI is additionally needed. Because of the wide acceptance of the MPI standard PVM is unlikely to be of any future importance. So the system should not be used any more.

#### 5.2.3 Sharing MPI_COMM_WORLD

So far, the independent mode using PVM(PI) or MPI-2 is the only ones which allows to use ViPIOS in a completely transparent way. For the other modes one of the following methods can be used to simplify or prevent necessary adaptations of applications.

1. Clients and servers share the global communicator MPI_COMM_WORLD.
In this mode ViPIOS offers an intra-communicator MPI_COMM_APP for communication of client processes and uses another one (MPI_COMM_SERV) for server processes. This also solves the problem with ranking but the application programmer must use MPI_COMM_APP instead of MPI_COMM_WORLD in every MPI function call.

2. Clients can use MPI_COMM_WORLD exclusively.
This can be achieved patching the underlying MPI implementation and also copes with the ranking problem.

A graphical comparison of this solutions is depicted in Figure 5.3.

#### 5.2.4 Implemented solutions

Of the approaches described above the following have been implemented so far:

• runtime library mode with MPI-1 (MPICH)

• dependent mode with MPI-1 with threads (MPICH and patched MPICH)

• independent mode with the usage of PVM and MPI-1 (MPICH)

• independent mode with MPI-2 without threads (lam)

### 5.3 Implementation Details of Operation Modes

#### 5.3.1 Dependent Mode with a Shared MPI_COMM_WORLD

The first client-server-based implementations were realized in the dependent mode with a common global communicator MPI_COMM_WORLD. That means, the client processes and the server-processes must all be started together as one single application consisting of client-processes and server-processes, and all these processes are members of one single MPI_COMM_WORLD. Therefore the programmers of ViPIOS-applications must always keep in mind that the program which they are writing is only one part of the whole system. So that they may never execute MPI_Barrier(MPI_COMM_WORLD) becaues MPI would then expect the server-processes to execute the barrier operation too and the program would be blocked.

#### 5.3.2 Dependent Mode with a Separated MPI_COMM_WORLD

This modification of ViPIOS has a separate global communicator MPI_COMM_WORLD for the client processes. But the client processes and the server-processes must still be started together concurrently. However, the programmer of the client processes does no longer have to care about the ViPIOS server processes. The client progam can be thought of as running independently and just satisfying its I/O needs by using calls to the ViPIOS interface library. This approach has been implemented in ViPIOS in two ways:

1. by modification of MPI

2. by creating a header file mpi_to_vip.h, which has to be included in every source file of a ViPIOS-project just after including mpi.h

##### Modification of MPI

In the MPICH 1.1 [5] implementation of MPI the internal representation of all the MPI-specific data-types (Communicators, Groups) is as follows. All these data is stored in an internal list which is hidden from the user. The only thing which the user can see are pointers to entries in this list. Each MPI communicator and each MPI group is represented by one entry in the list and the variables of the types MPI_Comm and MPI_Group are nothing else than pointers to these entries. Each entry in the list has an integer number and the pointers to the entries are just variables in which the number of these entries is stored. Therefore the types MPI_Comm and MPI_Group are just integers. As the global communicator which contains all processes is automatically stored in the internal list at position 91 when MPI is initialized, the definition of the constant MPI_COMM_WORLD is done in the file mpi.h simply by the line “#define MPI_COMM_WORLD 91”. Therefore the modification of MPI_COMM_WORLD was done by substituting the name MPI_COMM_WORLD in this line by the name MPI_COMM_UNIVERSAL and defining MPI_COMM_WORLD as variable of the type MPI_Comm instead. As soon as ViPIOS is initialized, a communicator containing only the client-processes is created and stored in the variable MPI_COMM_WORLD. Therefore it is important that the programmer does not access MPI_COMM_WORLD before initializing ViPIOS. All the modifications of MPI took place only in the file mpi.h. Therefore it was not necessary to recompile MPI. The only thing which has to be done is substituting the original mpi.h file by the modified one. (Note that this modification only works for the MPICH version of MPI. Other implementations may use different representations for MPI_COMM_WORLD.)

##### Creation of a Header file mpi_to_vip.h

The modification of MPI_COMM_WORLD can also be done without any modification of MPI itself. Instead of modifying mpi.h a header file called mpi_to_vip.h can be included immediately after mpi.h in every module of ViPIOS and in the application modules too. This modifies the definition of MPI_COMM_WORLD given in mpi.h after the header file has been included. So the final effect is the same as if the modified version of mpi.h had been used.

##### Compatibility of mpi_to_vip.h to Other MPI Implementations

The way of modifying MPI_COMM_WORLD just explained has only been applied to MPICH 1.1, but with little modifications of the file mpi_to_vip.h it can also be applied to any other version of MPI. Whenever MPI_COMM_WORLD is created with the #define command this definition has just do be undone with #undef and MPI_COMM_WORLD has to be redefined as a variable instead. This variable may then be initialized with the same value which was assigned to the #define-name MPI_COMM_WORLD in the original mpi.h file in order to avoid having an undefined MPI_COMM_WORLD. All this is done in mpi_to_vip.h and if the value which is assigned to MPI_COMM_WORLD in the original mpi.h file changes in another MPI implementation, the value with which the variable MPI_COMM_WORLD is assigned in the file mpi_to_vip.h has to be changed accordingly.

It is very probable that this will work with the future implementations of MPI too because the implementors of MPICH are very convinced that defining MPI_COMM_WORLD with a #define construct is the best way of implementing it. If it is for some reason not possible to initialize the variable MPI_COMM_WORLD with the value which was assigned to the #define-name MPI_COMM_WORLD in mpi.h, there is still the posibility of omitting its initialization. But then it may not be accessed before initializing ViPIOS (which is not a great problem as it is not recommended to access it before initializing ViPIOS anyway).

The activities necessary for modifying MPI_COMM_WORLD done in ViPIOS itself (i.e. creating of independent communicators for server processes and for application processes and assignment of the apllication processes’ communicator to the application’s MPI_COMM_WORLD) are completely independent from the internal implementation of MPI and will never have to be adapted for new MPI versions.

#### 5.3.3 Creation of a Separate MPI_COMM_WORLD for Fortran

As Fortran applications have to cooperate with the ViPIOS system, which is written in C, and it is not possible to declare variables, which are shared between the C files and the Fortran files, the manipulation of MPI_COMM_WORLD for Fortran is more complicated. MPI_COMM_WORLD for Fortran is defined in the file mpif.h with the command ”PARAMETER (MPI_COMM_WORLD=91)”. As in mpi.h, the name MPI_COMM_WORLD has been replaced by the name MPI_COMM_UNIVERSAL. In the file vipmpi.f, which has to be included into the application with the USE command, MPI_COMM_WORLD is defined as a variable. Moreover, this file implements the routine MPIO_INIT which has to be called by the Fortran application in order to initialize ViPIOS. This routine calls via the Fortran to C interface a C routine which invokes a slightly modified initialization routine for ViPIOS. This intialization routine returns the value which has to be assigned to MPI_COMM_WORLD (a communicator containing all the client-processes) via a reference parameter back to MPIO_INIT. MPIO_INIT finally stores it in the variable MPI_COMM_WORLD. The whole process is hidden from the application programmer.

#### 5.3.4 Independent Mode

Independent mode means that there are two independently started programs. One consists of the server processes, and the other one consists of the client processes. First the server processes must be started which offers a connection via ports. Then the client application is started which connects to the server processes. This connection creates intercommunicators, which allow a communication between the client processes and the server processes in the same way as it was done in the dependent mode. While active, the server processes can be connected to by client applications at any time. Thus different applications can concurrently use the ViPIOS I/O server processes. With the MPI 1.1 standard an independent mode of ViPIOS cannot be implemented. However, it can be done with a MPI 2.0 implementation. Up to now the only MPI 2.0 implementation, with which the independent mode of ViPIOS has been tested is LAM 6.1 . Unfortunately it works with LAM 6.1 only if all the processes are executed on the same node. This is due to instabilities of LAM 6.1. With an MPI 2.0 implementation which works correctly according to the MPI 2.0 standard processes could be executed distributed across all the available processors in independent mode.

For a list of advantages and disadvantages of this MPI-2 based implementation see chapter 5.2.1.

#### 5.3.5 Future Work: Threaded Version of the Independent Mode

The next step is now to create a threaded version of the independent mode. The ViPIOS server will then be able to serve more than one ViPIOS client program at the same time. Every server process will then start one thread for each client application which connects to the server and each of these threads will then recieve and process only the requests sent by the one client application, for which is was started. These threads will then comply each request by starting another thread whose task is to process just that single request. As soon as the request is sucessfuly complied the thread terminates. If a client application closes the connection to the ViPIOS server, the server process threads whose task was to recieve the requests sent by this client also terminate.

Unfortunately the attempts to implement this version have failed up to now because LAM is not thread safe. Some alternatives to LAM have therefore been considered.

#### 5.3.6 Alternatives to LAM

Evaluation of possible alternatives for LAM in order to implement the independent mode of ViPIOS

Problem: LAM is instable, not thread-safe and connecting/disconnecting of processes does not work correctly when it is used to execute a program on more than one node. An MPICH implementation of the MPI 2.0 standard does not yet exist, therefore other posibilities have to be found to implement the independent mode of ViPIOS.

For the ViPIOS client server principle to work the ability to combine two independently started MPI processes is absolutely necessary. The best solution would be a connection of the server program with the independently started client program in a way that an intercommunicator between all the processes of the server program and all the processes of the client program is created (like in the LAM implementation described above). Because this does not require any modifications of the code of the ViPIOS functions (except for ViPIOS_Connect).

##### MPI-Glue

MPI-Glue is an implementation of the MPI 1.1 standard which allows MPI applications to run on heterogeneous parallel systems. It is especially designed to combine different homogeneous parallel systems (workstation clusters where all the workstations are of the same type or supercomputers) together. In order to be as efficient as possible it imports existing MPI implementations designed for running MPI on homogeneous parallel systems. This implementations are used for communication inside one of the homogenous parallel systems. For communication between different machines MPI-Glue implements an own, portable MPI based on TCP/IP. MPI-Glue exports all MPI functions according to the MPI 1.1-Standard to the application and as soon as any MPI function involving communication is called by the application, it invokes automatically the required internal function (i.e. when there is a communication inside a homogeneous parallel system it invokes the implementation designed for homogeneous parallel systems of that type otherwise it uses its portable MPI implementation based on TCP/IP to do communication between two different types of processors, which of course takes much more time than the communication inside a homogeneous parallel system).

##### Pacx-Mpi

This system was developed at the computing center of the University of Stuttgart. It offers the same possibilities as MPI-Glue and it also imports the system dependent MPI implementations for communication inside homogenous parallel systems. The difference to MPI-Glue is the way how communication is performed between two processes running on different platforms. In MPI-Glue the communication goes directly from one process to another. But in PACX-MPI in every one of the different homogenous parallel systems which are combined there exist two additional processes. One has the task to send messages to other homogeneous parallel systems and the other one recieves messages sent by other homogeneous parallel systems. If one of the application’s processes wants to send a message to a process running on a different platform, it sends it to the process, which is to send messages to other parallel systems. That process sends it to the other system. There the message is recieved by the process whose task is recieving messages from other homogeneous parallel systems and this process finally sends the message to the destination process. Only the two additional processes are able to communicate with other homogeneous parallel systems using TCP/IP. With PACX-MPI only a small subset of the MPI functions can be used.

##### Pvmpi

PVMPI connects independently started MPI applications, which may run on inhomogenous platforms using PVM. It creates an intercommunicator which connects two applications. However, it is not possible to create an intercommunicator, which contains all processes of both applications with MPI_Intercomm_merge.

##### Plus

PLUS enables communication between parallel applications using different models of parallel computation. It does not only allow the communication between different MPI applications running on different platforms but moreover the communication between e.g. a MPI application and a PVM application. In order to make it possible to communicate with processes of another application in an easy way, the processes of each application can address processes of another application according to their usual communication scheme. For example PLUS assigns every process of a remote application, with which a PVM application has to communicate to a task identifier. As soon as a process of the PVM application tries to send a message to another process, PLUS tests whether the task id of the addressed process belongs to a process of a remote application. If so, PLUS transmits the message to the target process using a protocol based on UDP. The target process can recieve the message by the scheme according to its programming model. PLUS uses daemon processes like PACX-MPI to transmit messages between two applications. With PLUS only a restricted set of datatypes can be used. As in PVMPI the creation of a global intracommunicator containing ALL processes of all the applications, which communicate through PLUS is not possible.

##### Mpi_connect

MPI_CONNECT is a result of optimizing PVMPI. It does not longer need PVM but uses the metacomputing system SNIPE [8] instead to manage the message passing between the different parallel systems. SNIPE has the advantage of a better compatibility to MPI than PVM. MPI_CONNECT offers either the possibility to connect independently started MPI applications via intercommunicators or starting processes on different parallel systems together as a single application with a shared MPI_COMM_WORLD, without the posibility to start additional tasks later.

##### Comparison of the Systems

In order to group the systems by their most important features in the following table the systems are classified by two different paradigms. In each system (except MPI_CONNECT) there is only one of these two paradigms available:

• paradigm 1 All the different homogenous parallel systems start simultanously and have a common MPI_COMM_WORLD. No processes can be connected later.

• paradigm 2 All the different homogenous parallel systems are started independently and are connected dynamically. The communication between them is done via intercommunicators. No global intracommunicator containing processes of more than one homogenous parallel system can be created.

Table 5.1 lists the most relevant attributes of the systems described above.

#### 5.3.7 consequences

After evaluating these systems, it is evident that only PVMPI or MPI_CONNECT can be used to efficiently implement the independent mode of ViPIOS because only these two systems support the connection of independently started MPI_applications. As MPI_CONNECT is an improvement of PVMPI, it is very likely to be the best choice.

## Chapter 6 The MPI-I/O Interface

ViMPIOS (Vienna Message Passing/Parallel Input Output System) is a por-table, client-server based MPI-IO implementation on the ViPIOS. At the moment it comprises all ViPIOS routines currently available. Thus, the whole functionality of ViPIOS plus the functionality of MPI-IO can be exploited. However, the advantage of ViMPIOS over the MPI-IO proposed as the MPI-2 standard is the possibility the assign each server process a certain number of client processes. Thus, the I/O can actually be done in parallel. What is more, each server process can access a file scattered over several disks rather than residing on a single one. The application programmer need not care for the physical location of the file and can therefore treat a scattered file as one logical contiguous file.

At the moment four different MPI-IO implementations are available, namely:

• PMPIO - Portable MPI I/O library developed by NASA Ames Research Center

• ROMIO - A high-performance, portable MPI-IO implementation developed by Argonne National Laboratory

• MPI-IO/PIOFS - Developed by IBM Watson Research Center

• HPSS Implementation - Developed by Lawrence Livermore National Laboratory as part of its Parallel I/O Project

Similar to ROMIO all routines defined in the MPI-2 I/O chapter are supported except shared file pointer functions, split collective data access functions, support for file interoperability, error handling, and I/O error classes. Since shared file pointer functions are not supported, the MPI_MODE_SEQUENTIAL mode to MPI_File_open is also not available.

In addition to the MPI-IO part the derived datatypes MPI_Type_subarray and MPI_Type_darray have been implemented. They are useful for accessing arrays stored in files [71].

What is more, changes to the parameters MPI_Status and MPI_Request have been made. ViMPIOS uses the self defined parameter MPIO_Status and MPI_File_Request. Unlike ROMIO, the parameter textsfstatus can be used for retrieving particular file access information. Thus, MPI_Status has been modified. The same is true for MPI_Request. Finally, the routines MPI_Wait and MPI_Test are modified to MPI_File_wait and MPI_File_test.

At the moment, file hints are not supported by ViMPIOS yet. Using file hints would yield following advantages: The application programmer could inform the server about the I/O workload and the possible I/O patterns. Thus, complicated I/O patterns where data is read according to a particular view and written according to a different can be analyzed and simplified by the server. What is more, the server could select the I/O nodes which suit best for the I/O workload. In particular, if one I/O node is idle whereas the other deals with great amount of data transfer, these unbalances could be solved.

### 6.1 Mpi

#### 6.1.1 Introduction to MPI

In this section we will discuss the most important features of the Message Passing Interface (MPI) [46]. Rather than describing every function in detail we will focus our attention to the basics of MPI which are vital to understand MPI-IO, i.e. the input/output part of the message passing interface. Thus, the overall purpose of this chapter is to define special MPI terms and explain them by means of the corresponding routines coupled with some examples.

The Message Passing Interface is the de facto standard for parallel programs based on the message passing approach. It was developed by the Message Passing Interface Forum (MPIF) with participation from over 40 organizations. MPI is not a parallel programming language on its own but a library that can be linked to a C or FORTRAN program. Applications can either run on distributed-multiprocessors, networks of workstations, or combinations of these. Furthermore, the interface is suitable for MIMD programs as well as for those written in the more restricted SPMD style. A comprehensive overview of parallel I/O terms can be found in [86].

#### 6.1.2 The Basics of MPI

The main goal of the standard is to allow the communication of processes whereas the easiest way of interprocess communication is the point-to-point communication where two processes exchange information by the basic operations SEND and RECEIVE. According to [32] the six basic functions of MPI are as follows:

• MPI_INIT: initiate an MPI computation

• MPI_FINALIZE: terminate a computation

• MPI_COMM_SIZE: determine number of processes

• MPI_COMM_RANK: determine current process’ identifier

• MPI_SEND: send a message

Every program in MPI must be initialized by MPI_Init and terminated by MPI_Finalize. Thus, no other MPI function can be called before MPI_Init or after MPI_Finalize. The syntax of the two functions is:

int MPI_Init (int *argc, char *** argv)
int MPI_Finalize (void)

By means of MPI_Comm_rank the process’ identifier can be evaluated. Process numbers start with 0 and have consecutive integer values. In order to find out how many processes are currently running, MPI_Comm_size is called.

 int MPI_Comm_size (MPI_Comm comm, int *size) IN comm communicator OUT size number of processes in the group of comm
 int MPI_Comm_rank (MPI_comm, int *rank) IN comm communicator OUT rank rank of the calling process in group of comm

In both instructions the argument comm specifies a so-called communicator which is used to define a particular group of any number of processes. Suppose 8 processes are currently active and we wish to separate them into two groups, namely group1 should contain processes with the identifiers from 0 to 3, whereas group2 consists of the rest of the processes. Thus, we could use a communicator group1 that refers to the first group and a communicator group2 that refers to the second group is MPI_COMM_WORLD. This MPI predefined communicator includes all processes currently active.

On establishing a communication the next step is to explain how information is exchanged by MPI_Send and MPI_Recv. Both instructions execute a blocking message passing rather than a non-blocking one. In a blocking approach a send command waits as long as a matching receive command is called by another process before the actual data transfer takes place.

 int MPI_Send (void* buf, int count, MPI_Datatype, int destination, int tag, MPI_Comm comm) IN buf initial address of send buffer IN count number of elements in the send buffer IN datatype datatype of each send buffer element IN dest rank of destination IN tag message tag IN comm communicator
 int MPI_Recv (void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm, MPI_Status *status) OUT buf initial address of receive buffer IN count number of elements in the receive buffer IN datatype datatype of each receive buffer element IN source rank of source IN tag message tag IN comm communicator OUT status status object

The first three arguments of both instructions are referred to as the message data, the rest is called message envelope. In particular buf specifies the initial address of the buffer to be sent. Count holds the number of elements in the send buffer, which are defined by the datatype MPI_Datatype (e.g. MPI_INT, MPI_FLOAT). The parameter destination states the identifier of the process that should receive the message. Similarly, the parameter source refers to the process that has sent the message. By means of tag a particular number can be related to a message in order to distinguish it from other ones. Comm refers to the communicator. Finally, the status information allows checking the source and the tag of an incoming message.

The following small program demonstrates how process 0 sends an array of 100 integer values to process 1:

#include "mpi.h"
int main (int argc,char **argv)
{
int message[100], rank;
MPI_Status status;

/* MPI is initialized */
MPI_Init(&argc,&argv);

/* the rank of the current process is determined */
MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if (rank==0)
/*  process 0 sends message with tag 99 to process 1 */
MPI_Send(message, 100, MPI_INT, 1, 99, MPI_COMM_WORLD);
else
/* process 1 receives message with tag 99 from process 0 */
MPI_Recv(message, 100, MPI_INT, 0, 99, MPI_COMM_WORLD, &status);

MPI_Finalize();
}


#### 6.1.3 Linking and Running Programs

Compiling and linking of a program is done by

mpicc -o file_name file_name.c

On compiling and linking the program, an executable file is produced which can be executed by the following command:

mpirun -np number_of_processes file_name

-np denotes the number of process and has always be situated before the file name. For example:

mpirun -np 16 application.c

#### 6.1.4 Blocking vs. Non-blocking communication

The send/receive commands we were discussing so far are so called blocking commands. In other words, the sending process waits until the receiving process has got the message. In contrast, nonblocking communication means that the sending processes do not wait until the operation is complete. Moreover, special functions, namely MPI_Wait and MPI_Test, are used to complete a nonblocking communication. Thus, better performance can be yielded for specific applications, since communication and computation can overlap.

Let us again take a look at the most important nonblocking commands before we resume with same examples:

 int MPI_Isend (void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request) IN buf initial address of send buffer IN count number of elements in the send buffer IN datatype datatype of each send buffer element IN dest rank of destination IN tag message tag IN comm communicator OUT request communication request
 int MPI_Irecv (void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request) OUT buf initial address of receive buffer IN count number of elements in the receive buffer IN datatype datatype of each receive buffer element IN source rank of source IN tag message tag IN comm communicator OUT request communication request

The syntax of those instructions is the same as for their blocking counterparts except of the last parameter request.

 int MPI_Wait (MPI_Request *request, MPI_Status *status) INOUT request request OUT status status object

A call to that function returns when the operation identified by request is complete. In other words, it waits until a nonblocking send or receive with a matching parameter request is executed. Status gives information about the completed operation.

 int MPI_Test (MPI_Request *request, int *flag, MPI_Status *status) INOUT request communication request OUT flag true if operation completed OUT status status object

The function only returns flag = true if the operation identified by request is complete. Furthermore, the status object contains the information on the completed operation.

This example is a modification of the previous one. Rather than using blocking commands, we demonstrate the usage of non-blocking commands.

#include "mpi.h"
int main (int argc,char **argv)
{
int message[100], rank;
MPI_Status status;
MPI_Request request;

MPI_Init(&argc,&argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if (rank==0)
{
MPI_ISend(message, 100, MPI_INT, 1, 99, MPI_COMM_WORLD, &request);
/* any computation can be done here */
MPI_Wait(request,status);
}
else
{
MPI_IRecv(message, 100, MPI_INT, 0, 99, MPI_COMM_WORLD,
&request);
/* any computation can be done here */
MPI_Wait(request,status);
}
MPI_Finalize();
}


#### 6.1.5 Derived Datatypes

Derived datatypes allow the user to define special datatypes that can be any combination of simple datatypes such as MPI_INT or MPI_FLOAT. In addition, sections of arrays that need not be contiguous in memory can be referred to as a special datatype. In that section we will give an overview of some important derived datatypes.

In order to define a derived datatype, three sets of functions are required, namely a constructor function to construct a derived datatype, e.g.
MPI_Type_vector, a commit function MPI_Type_commit for applying the new datatype and finally the function MPI_Type_free that should be applied after the usage of the datatype.

The simplest derived datatype is MPI_Type_contiguous:

 int MPI_Type_contiguous (int count, MPI_Datatype oldtype, MPI_Datatype *newtype) IN count replication count IN oldtype old datatype OUT newtype new datatype

This datatype allows defining a contiguous datatype which consists of count elements of a special datatype oldtype. The new datatype can be used for further purposes.

By means of an example we want to describe the usage of this datatype. Assume that process 0 wants to send an array of 25 integer elements to process 1. In order to use a derived datatype following steps are necessary:

MPI_Datatype array1;

/* datatype which specifies 25 integer values to contiguous
locations is created */
MPI_Type_contiguous(25,MPI_INT,&array1);
MPI_Type_commit(&array1);

/* process 0 sends data to processes 1 */
MPI_Send(message, 1, array1, 1, 99, MPI_COMM_WORLD);
MPI_Type_free(&array1);


In our small example we only printed the code for process 0. Taking a look at the third parameter of the send command we notice that a derived datatype is specifed rather than a simple datatype like MPI_INT. The syntax for the commands which handle the commition and freeing of the derived datatype is given here:

 int MPI_Type_commit (MPI_Datatype *datatype) INOUT datatype datatype that is commited
 int MPI_Type_free (MPI_Datatype *datatype) INOUT datatype datatype that is freed

Now assume that process 0 wants to send sections of the array rather than all 25 integer elements. In particular, the first and the last 10 elements shall be sent which means that 2 blocks of data shall be sent whereas 5 elements shall be skipped. A more general derived datatype is MPI_Type_vector:

 int MPI_Type_vector (int count, int blocklength, int stride, MPI_Datatype oldtype, MPI_Datatype *newtype) IN count number of blocks IN blocklength number of elements in each block IN stride number of elements between start of each block IN oldtype old datatype OUT newtype new datatype

Count holds the number of blocks, blocklength specifies the number of elements in each block, and stride defines the number of elements between the start of each block whereas stride is a multiple of oldtype. The datatype for our example is as follows:

MPI_Type_vector(2,5,10);

The shape is depicted in Figure 6.1.

A further generalization of the previous datatype is MPI_Type_hvector. The difference to MPI_Type_vector is the parameter stride which is not given in elements but in bytes.

 int MPI_Type_hvector (int count, int blocklength, MPI_Aint stride, MPI_Datatype oldtype, MPI_Datatype *newtype) IN count number of blocks IN blocklength number of elements in each block IN stride number of elements between start of each block IN oldtype old datatype OUT newtype new datatype

The next datatype MPI_Type_indexed allows specifying blocks of different lengths starting at different displacements. Before we present the function we want to give an example of such a particular case.

Assume a 5x5 array of integer values. Further assume that we want to send the lower triangle of that matrix. Thus, the first block to be sent consists of 1 element with the displacement 0. The second block consists of 2 elements with the displacement 6. The third block comprises 3 elements with the displacement 11 etc. The matrix and the corresponding linear file are depicted in Figure 6.2.

 int MPI_Type_indexed (int count, int *array_of_blocklengths, MPI_Aint *array_of_displacements, MPI_Datatype oldtype, MPI_Datatype *newtype) IN count number of blocks IN array_of_blocklengths number of elements per block IN array_of_displacements displacement for each block, in multiples of oldtype IN oldtype old datatype OUT newtype new datatype
MPI_Datatype    indexed;
int             a_blocklen[10],
ΨΨa_disp[10];

for (i=0; i<5; i++)
{
a_blocklen[i]=i+1;
a_disp[i]=i*5;
}

MPI_Type_indexed(4, a_blocklen, a_disp, MPI_INT, &indexed);
MPI_Type_commit(&indexed);


Similar to MPI_Type_hvector there is also a corresponding MPI_Type_hindexed. This datatype is identical to the previous except that the displacements are given in bytes and not in multiples of oldtype.

int MPI_Type_hindexed (int count, int *array_of_blocklenghts,
int *array_of_displacements, MPI_Datatype oldtype, MPI_Datatype *newtype)

The most general derived datatype is MPI_Type_struct. The difference to MPI_Type_indexed is that each datablock can consist of a different datatype. Thus, it is possible to send an integer, together with a character or double value in one message.

 int MPI_Type_struct (int count, int *array_of_blocklengths, MPI_Aint *array_of_displacements, MPI_Datatype *array_of_oldtypes, MPI_Datatype *newtype) IN count number of blocks IN array_of_blocklengths number of elements per block IN array_of_displacements displacement for each block in bytes IN array_of_oldtypes old datatype OUT newtype new datatype

Let us again take a look at an example:

MPI_Datatype    s_types[3]={MPI_INT,MPI_DOUBLE,MPI_CHAR},
ΨΨdd_struct;
int             s_blocklen[3]={3,2,16};
MPI_Aint        s_disp[3];

/* set displacements to next free space */
s_disp[0]=0;
s_disp[1]=20;
s_disp[2]=40;

MPI_Type_struct(3, s_blocklen, s_disp, s_types, &dd_struct);
MPI_Type_commit(&dd_struct);


The shape of that derived datatype can be seen in Figure 6.3. Here the derived datatype consists of 3 non-contiguous blocks of different datatypes, namely MPI_INT, MPI_DOUBLE and MPI_CHAR.

#### 6.1.6 Collective Communication

Unlike the point-to-point communication we stated in the previous chapters, collective communication means that a group of processes can take part in the communication process. In other words, not only one process can send a message to another processes but to several processes with just one command. In contrast, it is also possible for one process to receive messages from all other processes specified in one group. Again, only one receive command is necessary rather than several ones for each message.

Let us take a look at one function for collective communication in order to clarify the advantage of a collective communication over a point-to-point communication:

 int MPI_Bcast (void* buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm) INOUT buffer starting address of buffer IN count number of entries in buffer IN datatype datatype of buffer IN root rank of process to broadcast the message IN comm communicator

This routine broadcasts a message from the process identified by the rank root to all processes in the group. What is more, a message is also sent to itself.

Let us first analyze how this would be done with the conventional point-to-point communication and compare this to the collective version. Assume that our group consists of 8 processes. The message to be sent should be an array of 1000 double values.

MPI_Comm comm;
double  double_array[1000];
...
if (rank==0)
{
for (i=0; i<8; i++)
MPI_Send(double_array, 1000, MPI_DOUBLE, i, 99, comm);
}
else
MPI_Recv(double_array, 1000, MPI_DOUBLE, 0, 99, comm, &status);


Using collective routines the send-receive process is much shorter since it is not necessary to write separate lines of code for different processes:

MPI_Bcast(double_array, 1000, MPI_DOUBLE, 0, comm);

#### 6.1.7 Communicators

Communicators are a way of managing the communication among processes in a better way such that several processes can be grouped together and regarded as one homogenous entity. In short, communicators are divided into to kinds:

• Intra-Communicators

• Inter-Communicators

The first term refers to the communication within a single group of processes whereas the second term refers to the point-to-point communication between two group of processes. For a comprehensive survey of all MPI commands we refer the reader to [46].

### 6.2 Mpi-Io

#### 6.2.1 Introduction to MPI-IO

MPI-IO is a high level interface developed by the MPI-Committee [46]. The goal was to create a widely used standard for describing parallel I/O operations within an MPI message passing application. The initial idea was that I/O could be modeled as message passing, namely writing to a file can be regarded as sending a message and reading from a file is equivalent to receiving a message. Thus, MPI-IO is an extension of the MPI 1.1 standard , which did not support parallel file I/O so far for the following reasons [32]:

• not all parallel machines support the same parallel or concurrent file system interface

• the traditional Unix file system interface is ill suited to parallel computing since multiple processes do not share files at once

On giving a short introduction to the I/O problem let us now analyze the most important features of MPI-IO:

• using derived MPI datatypes yields strided access to the memory and the file

• non-blocking functions improve the I/O performance by overlapping I/O with computation

• collective operations may optimize global data access

• using two types of file pointers, namely individual and shared file pointers, such that exact offsets in the file need not be specified when data is read or written

• file hints allow specifying the layout of the a file, e.g. number of disks or I/O nodes which hold the information of a striped file

• filetype constructors can be used to specify array distribution

• error handling is supported

#### 6.2.2 First Steps With MPI-IO

Referring to the introductory chapter we stated that I/O can be modeled as a message passing system. In other words, writing data to a file should be similar to sending a message. In contrast, reading from a file should be modeled as receiving a message. Although MPI-IO supports a large number of different functions for parallel I/O, many programs and applications only use six of them, which are summarized in the following paragraph [32]:

• MPI_INIT: MPI as well as MPI-IO are initialized

• MPI_FILE_OPEN: a file is opened

• MPI_FILE_READ: data is read from a particular location in a file

• MPI_FILE_WRITE: data is written to a particular location in a file

• MPI_FILE_CLOSE: a file is closed

• MPI_FINALIZE: MPI as well as MPI-IO are terminated

Strictly speaking only four so-called MPI-IO functions are used since MPI_INIT and MPI_FINALIZE are already supported by MPI-1.

We will now explain the use of these functions by means of two simple examples [32]. For introductory purpose we will not go into detail with describing the exact syntax of each function but only mention the most important parameters to focus our attention. We will dedicate a special chapter to the syntax of the functions at a later stage of this thesis.

In the first program each process creates its own individual file called file followed by an extension which reflects the identifier of the current process. A file can be opened individually by using the parameter MPI_COMM_SELF. Furthermore, data is written to the file, which is read back later on.

#include "mpi.h"
#include "mpio.h"

int main(argc,argv)

int argc;
char *argv[];
{
int myid;

MPI_Status status;
MPI_File fh;

char filename[12];

buf= (int *)malloc(50*sizeof(int));
for (i=0, i<50, i++)
buf[i]=i;

MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);

/* open file with filename "file.processid" */
sprintf(filename,"%s.%d","ufs:file",myid);

/* each process opens a separate file */
MPI_File_open (MPI_COMM_SELF, filename, MPI_MODE_CREATE |
MPI_MODE_RDWR, MPI_INFO_NULL, &fh );

/* read data from file */

/* perform computation */

/* write data to file */
MPI_File_write (fh, buf, 50, MPI_INT, &status);

MPI_File_close(&fh);

free(buf);
MPI_Finalize();

}


In the second example each process accesses one common global file rather than its local one. That feature is yielded by the parameter MPI_COMM_WORLD. We will not print the whole program code but only the part which differs from the previous example:

/* each process opens one common file */
MPI_File_open (MPI_COMM_WORLD, filename, MPI_MODE_CREATE |
MPI_MODE_RDWR, MPI_INFO_NULL, &fh );


#### 6.2.3 Some Definitions of MPI-IO

File: An MPI file is an ordered collection of typed data items which can be accessed in a random or sequential way. Furthermore, a communicator (MPI_COMM_SELF or MPI_COMM_WORLD we discussed in our introductory chapter) on the one hand specifies which group of processes can get access to the I/O operations, on the other hand, it determines whether the access to the file is independent or collective. Since independent I/O requests are executed individually by any of the processes within a communicator group, no coordination among the processes is required. In contrast, the latter case requires each process in a group associated with the communicator to participate in the collective access.

Displacement: A file displacement defines the beginning of a view (file access pattern) expressed as an absolute byte position relative to the beginning of the file. Furthermore, it can be used to skip head information of a file or to define further access patterns which start at different positions of the file.

Etype: An etype (elementary datatype) can be regarded as the unit of data access and positioning. In other words, it specifies the data layout in the file. An etype can be any MPI predefined or derived datatype.

Filetype: A filetype can either be a single etype or a derived MPI datatype of several etypes and describes a template for accessing a file partitioned among processes.

View: A view is described by a displacement, an etype, and a filetype and defines the current set of data visible and accessible from an open file as an ordered set of etypes. Thus, a view specifies the access pattern to a file.

Offset: An offset is a position in the file relative to the current view, expressed as a count of etypes.

File pointer: A file pointer is an implicit offset. On the one hand, MPI provides individual file pointers which are local to each process, on the other hand, file pointers for a group of processes - so-called shared file pointers - are supported.

File handle: A file handle can be regarded as a file reference which is created by MPI_File_open and freed by MPI_File_close.

By means of an example we will explain the idea of strided access and how this can be achieved by using different file views. Assume a file which holds an array of 24 integer values. Further assume that a process only wants to read or write every third value of the file. We therefore tile the file with a filetype which is a derived MPI datatype. Thus, the values at position (offset) 0,2,5,… can be accessed whereas all the other positions are so-called holes which cannot be accessed by the current process. The view is depicted in Figure 6.4.

In our next example assume that 3 processes access the file in a complementary way. In other words, the file is partitioned among four parallel processes where each process reads or writes at different locations in the file. In particular process 0 accesses the positions 0,3,6,…, process 1 accesses the positions 1,4,7,…, and finally process 2 accesses the positions 2,5,8,… The access patterns of each process are depicted in Figure 6.5.

Another possibility is to access a file in two different patterns. Thus, we define two tilings (views). In particular, the first part of the file shall be accessed with a stride of 2 elements whereas the second part of the file shall be accessed with a stride of 3 elements. Figure 6.6 can make this example clearer:

If the file needs to be accessed in the second view, the displacement of the second pattern will skip over the entire first segment.

#### 6.2.4 MPI-IO routines

On giving some definitions of the MPI-IO standard we will now describe the routines of that standard and give some examples to get a first idea of how MPI-IO works.

##### File Manipulation
 int MPI_File_open (MPI_Comm comm, char *filename, int amode, MPI_Info info, MPI_File *fh) IN comm communicator IN filename name of file to be opened IN amode file access mode IN info info object OUT fh new file handle (handle)

Description:
Opens the file filename on all processes in the comm communicator group.

Features:

• collective routine (all processes must refer to the same file name and use the same access mode)

• a file can be opened independently by using the MPI_COMM_SELF communicator

• comm must be an intra-communicator rather than an inter-communicator

• filename is prefixed to indicate the underlying file system (e.g. ”ufs:filename”, where ufs stands for Unix file system)

• amode can have the following values which could also be combined together with the bit vector OR:

• MPI_MODE_RDONLY: file is opened read only

• MPI_MODE_WRONLY: write only

• MPI_MODE_CREATE: create the file if it does not exist

• MPI_MODE_EXCL: error is returned if a file is created that already exists

• MPI_MODE_DELETE_ON_CLOSE: delete file on close

• MPI_MODE_UNIQUE_OPEN: file will not be concurrently opened elsewhere. Optimization is yielded by eliminating file locking overhead

• MPI_MODE_SEQUENTIAL: file will only be accessed sequentially

• MPI_MODE_APPEND: set initial position of all file pointers to end of file

• info provides information such as file access patterns and file system specifics. If no info is needed, MPI_INFO_NULL can be used.

Annotation:
Every time the file is accessed the file handle fh is used to refer to the file. Moreover, each open-function must at least contain either MPI_MODE_RDONLY, MPI_MODE_RDWR or MPI_MODE_WRONLY

Possible errors:

• MPI_MODE_CREATE or MPI_MODE_EXCL are specified together with MPI_MODE_RDONLY

• MPI_MODE_SEQUENTIAL is specified together with MPI_MODE_RDWR

• a file which is opened with MPI_MODE_UNIQUE is opened concurrently

• a file which is opened with MPI_MODE_SEQUENTIAL is accessed in a non-sequential way

 int MPI_File_close (MPI_File *fh) INOUT fh file handle

Description:
The file defined by fh is closed after synchronizing the file, i.e. previous writes to fh are transferred to the storage device. Furthermore, the content of the file handle fh is destroyed.

Features:

• collective routine

• the file is deleted if it was opened with MPI_FILE_DELETE_ON_CLOSE

 int MPI_File_delete (char *filename, MPI_Info info) IN filename name of the file IN info file info

Description:
Deletes the file filename.

Possible errors:

• MPI_ERR_NO_SUCH_FILE: a file is attempted to delete which does not exist

• MPI_ERR_FILE_IN_USE, MPI_ERR_ACCESS: the file is still opened by any other process

 int MPI_File_set_size (MPI_File fh, MPI_Offset size) INOUT fh file handle IN size size (in bytes) to truncate or expand file

Description:
Resizes the file defined by fh.

Features:

• collective routine (identical value for size)

• if size is smaller than the current file size, the file is truncated otherwise size becomes the new file size

• does not effect individual or shared file pointers

Possible errors:

• routine is called if file was opened with MPI_MODE_SEQUENTIAL

• non-blocking requests and split collective operations on fh are not completed

 int MPI_File_preallocate (MPI_File fh, MPI_Offset size) INOUT fh filehandle IN size size to preallocate file size

Description:
Storage space is allocated for the first size bytes of a file.

Features:

• collective routine (identical value for size)

• if size is larger than the current file size, the file is increased to size otherwise the file size is unchanged

Possible errors:

• routine is called if file was opened with MPI_MODE_SEQUENTIAL

 int MPI_File_get_size (MPI_File fh, MPI_Offset size) IN fh file handle OUT size size of the file in bytes

Description:
Returns the current size in bytes of the file defined by fh.

 int MPI_File_get_group (MPI_File fh, MPI_Group *group) IN fh file handle OUT group group which opened the file

Description:
Returns a duplicate of the group of the communicator which opened the file defined by fh.

 int MPI_File_get_amode (MPI_File fh, int *amode) IN fh file handle OUT amode file access mode used to open the file

Description:
Returns the access mode of the file defined by fh.

##### File Info
 int MPI_File_set_info (MPI_File fh, MPI_Info info) INOUT fh file handle IN info info object

Description:
Sets new values for the hints of a file.

 int MPI_File_get_info (MPI_File fh, MPI_Info *info_used) IN fh file handle OUT info_used new info object (handle)

Description:
Returns the hints of a file.

Some examples of file hints:

• Access_style: Specifies how a file is accessed whereas a combination of following strings is possible: read_once, write_once, read_mostly, write_mostly, sequential, reverse_sequential and random

• io_node_list: Specifies the list of I/O devices to store the file

##### File Views

Strided access to a file can be gained by using derived datatypes in combination with views. Recall that a view defines the current set of data visible and accessible from an open file as an ordered set of etypes.

Let us first present the interfaces before we give a simple example.

 int MPI_File_set_view (MPI_File fh, MPI_Offset disp, MPI_Datatype etype, MPI_Datatype filetype, char *datarep, MPI_Info info) INOUT fh file handle IN disp displacement IN etype elementary datatype IN filetype filetype IN datarep data representation IN info info object

Description:
Changes the process’ view of the data in the file.

Features:

• collective routine: values for etype and datarep must be identical on all processes in the group

• disp defines the beginning of the view in bytes, it must have the special value MPI_DISPLACEMENT_CURRENT if MPI_MODE_SEQUENTIAL was specified

• etype defines the data access and positioning in the file. Thus, every seek which is performed on that file is done in units specified by etype. In addition every offset is expressed as a count of etypes. More information is given when the command MPI_File_seek is explained.

• filetype describes the distribution of the data

• datarep specifies the representation of the data in the file according to three categories:

• native

• internal

• external32

Annotation:
In the native data representation the data is stored in a file exactly as it is in memory [47]. Since no type conversion is required for that mode, data precision and I/O performance are not lost. With the internal data representation type conversion is performed if necessary. Type conversion is always performed with the external32.

Non-blocking requests and split collective operations (see in a later chapter) on. fh must be completed

 int MPI_File_get_view (MPI_File fh, MPI_Offset disp, MPI_Datatype etype, MPI_Datatype filetype, char *datarep) IN fh file handle OUT disp displacement OUT etype elementary datatype OUT filetype filetype OUT datrep datarepresentation

Description:
Returns the process’ view of the data in the file.

Features:

• Datarep must be large enough to handle the data representation string. However, the upper limit is defined by MPI_MAX_DATAREP_STRING

Example: Assume a file which is an integer array consisting of 100 values starting from 0. Further assume that this file should be accessed in 10 blocks of size 2 with a stride of 10. The corresponding part of the program looks like follows:

MPI_File_open (MPI_COMM_WORLD, "ufs:file1", MPI_MODE_CREATE |
MPI_MODE_RDWR, MPI_INFO_NULL, &fh );

MPI_Type_vector (10,2,10,MPI_INT,&newtype1);
MPI_Type_commit(&newtype1);
MPI_File_set_view (fh, 0, MPI_INT, newtype1, "native",
MPI_INFO_NULL);


In that example the displacement is 0. This means that the view begins at position 0 of the file. The displacement can also be used to skip some header information or to specify a second view of a file which begins at a position other than 0 and can be accessed in a different pattern. See Figure 6.6.

Example: Again we assume our file which is an integer array consisting of 100 values. We now want to define a view for the second part of the file such that we can access every second value. The new derived datatype and the corresponding view look like follows:

MPI_Vector (25,1,2, &newtype2);
MPI_Commit (&newtype2);
MPI_File_set_view (fh, 200, MPI_INT, newtype2, "native",
MPI_INFO_NULL);


Since the displacement is given as an absolute offset in bytes from the beginning of the file, the value 200 sets the file view to the 50th element of the file. This is only true if the size of an integer type is 4 bytes.

##### Data Access

On analyzing different aspects of file manipulation we will now discuss the three fundamental aspects of data access which are as follows:

• positioning (explicit offset vs. implicit file pointer)

• synchronism (blocking vs. non-blocking and split collective)

• coordination (collective vs. non-collective)

Positioning: All MPI routines that use explicit offsets for accessing data contain _at in their name (e.g. MPI_File_read_at). Rather than manipulating individual file pointers or shared file pointers data is accessed at the position defined by an argument. The different positioning methods can be mixed in the same program without effecting each other. This means that changing a file by an explicit offset will not affect the individual file pointer since no file pointer is used or updated.

Synchronism: Besides blocking I/O routines MPI also supports non-blocking I/O routines which are named MPI_File_ixxx, where i refers to immediate. Similar to MPI, a separate request complete call (MPIO_Wait, MPIO_Test) is needed to complete the I/O request. However, this does not mean that the data is written to a ”permanent” storage. The MPI-IO function that deals with that case is MPI_File_sync which we will discuss later on.

Coordination: The collective counterpart to the non-collective data access routine MPI_File_xxx is MPI_File_xxx_all. This means that a collective call is only completed when all other processes that participate in the collective call have completed their tasks. A restricted form of ”non-blocking” operations for a collective data access is called split collective. Rather than using one routine like MPI_File_xxx_all, a pair of MPI_File_xxx_begin and MPI_File_xxx_endis used. Thus, a single collective operation is separated by two routines where the begin routine can be compared to a non-blocking data access like MPI_File_write and the end routine acts like a matching routine which completes the operation, e.g. MPI_Wait or MPI_Test. In addition, the counterparts to the MPI_File_xxx_shared routines are MPI_File_xxx_ordered.

###### Conventions of Data Access

Every file which is used by MPI-IO commands is referred to by its file handle fh. The parameters buf, count, and datatype specify how the data is stored in memory. Note that similar to a receive, datatype must not contain any overlapping regions. Finally, status returns the amount of data accessed by a particular process. In detail, the number of datatype entries and predefined elements can be retrieved by calling MPI_get_count and MPI_get_elements.

##### Data Access With Explicit Offset (_AT)

The routines listed in that section can only be used if MPI_MODE_SEQUENTIAL was not specified.

###### Blocking, non-collective:
 int MPI_File_read_at (MPI_File fh, MPI_Offset offset, void *buf, int count, MPI_Datatype datatype, MPI_status status) IN fh file handle IN offset file offset OUT buf initial address of buffer IN count number of elements in the buffer IN datatype datatype of each buffer element OUT status status object

Description:
Reads a file beginning at position offset. Status contains the amount of data accessed by that routine.

Annotation:
The offset for all explicit offset routines is given in units of etype rather than in bytes. Furthermore, offset expresses the position relative to the beginning of a file.

 int MPI_File_write_at (MPI_File fh, MPI_Offset offset, void *buf, int count, MPI_Datatype datatype, MPI_status status) INOUT fh file handle IN offset file offset IN buf initial address of buffer IN count number of elements in the buffer IN datatype datatype of each buffer element OUT status status object

Description:
Writes a file beginning at position offset. Status contains the amount of data accessed by that routine.

###### Collective Versions:

Since the semantic of the collective routines are the same as for their non-collective counterparts, only the synopsis of the routines are printed here:

int MPI_File_read_at_all (MPI_File fh, MPI_Offset offset, void *buf, int count, MPI_Datatype datatype, MPI_status status)

int MPI_File_write_at_all (MPI_File fh, MPI_Offset offset, void *buf, int count, MPI_Datatype datatype, MPI_status status)

###### Non-blocking Versions:

Note that non-blocking I/O calls only initiate I/O operations but do not wait for them to complete. Thus, separate request complete calls like MPIO_WAIT or MPIO_TEST are needed. The example in the next section will demonstrate the use of non-blocking functions.

 int MPI_File_iread_at (MPI_File fh, MPI_Offset offset, void *buf, int count, MPI_Datatype datatype, MPI_Request request) IN fh file handle IN offset file offset OUT buf initial address of buffer IN count number of elements in the buffer IN datatype datatype of each buffer element OUT request request object
 int MPI_File_iwrite_at (MPI_File fh, MPI_Offset offset, void *buf, int count MPI_Datatype datatype, MPI_Request request) INOUT fh file handle IN offset file offset IN buf initial address of buffer IN count number of elements in the buffer IN datatype datatype of each buffer element OUT request request object
##### Data Access With Individual File Pointers

One individual file pointer per process per file handle defines the offset of the data to be accessed. Since the semantics are the same as in the previous section we will not go into detail with describing the routines and refer to table.

int MPI_File_read (MPI_File fh, void *buf, int count, MPI_Datatype datatype, MPI_Status status)

int MPI_File_write (MPI_File fh, void *buf, int count, MPI_Datatype datatype, MPI_Status status)

 int MPI_File_seek (MPI_File fh, MPI_Offset offset, int whence) INOUT fh file handle IN offset file offset IN whence update mode (state)

Description:
Updates the individual file pointer according to whence whereas following features are possible:

• MPI_SEEK_SET: file pointer is set to offset

• MPI_SEEK_CUR: file pointer is set to current pointer position plus offset

• MPI_SEEK_END: file pointer is set to the end of the file plus offset

Annotation:
It is important to mention that the offset is not given in bytes but in units of etype defined by the view of the file.

 int MPI_File_get_position (MPI_File fh, MPI_Offset *offset) IN fh file handle OUT offset offset of file pointer

Description:
Returns the current position of the individual file pointer in etype units relative to the current view.

 int MPI_File_get_byte_offset (MPI_File fh, MPI_Offset offset, MPI_Offset *disp) IN fh file handle IN offset offset of filepointer OUT disp absolute byte position of offset

Description:
Converts a view-relative offset which is given in etype units into an absolute byte position relative to the current view.

Besides blocking, non-collective routines which we presented here, non-blocking or collective routines are defined in the MPI-IO standard as well.

Example: In this example we want to demonstrate that in general each I/O operation leaves the file pointer pointing to the next data item after the last one that is accessed by the operation. In other words, file pointers are updated automatically. Again we use our file which is an array consisting of 100 values. We assume that the first ten values should be stored in the array buf1[] and the following 10 values in the array buf2[]. Note that after the first non-blocking read the file pointer is adjusted automatically to position 10 in the array. Furthermore, the file pointer is not updated by a routine with an explicit offset.

MPI_Status status1,status2;
MPIO_Request request1, request2;

MPI_File_open (MPI_COMM_WORLD, "ufs:file1", MPI_MODE_CREATE |
MPI_MODE_RDWR, MPI_INFO_NULL, &fh );
MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, "native",
MPI_INFO_NULL);

/* File pointer points to position 0 of the file */

/* File pointer points to position 10 since 10 values are read by the
previous routine */

/* File pointer points to position 20 of the file since another 10
MPI_File_read_at(fh, 51, buf3, 10, MPI_INT, &request1);

/* File pointer still points to position 20 since previous read is a
routine with explicit offset that does not update the file pointer */

MPIO_Wait(&request1,&status1);
MPIO_Wait(&request2,&status2);
MPI_File_close(&fh);


On executing the program the arrays contain following values:

buf1: 0 1 2 3 4 5 6 7 8 9
buf2: 10 11 12 13 14 15 16 17 18 19
buf3: 50 51 52 53 54 555 56 57 58 59
buf4: 20 21 22 23 24 25 26 27 28 29

##### Split Collective Routines

We have already defined a split collective routine as a restricted form of ”non-blocking collective” I/O operations. Before we present the interface routines let us first state the most important semantic rules according to the MPI-IO standard:

• Each file handle on any MPI process must not have more than one active split collective operation at any time

• Begin calls are collective and must be followed by a matching collective end call

• Regular collective operations like MPI_FILE_WRITE_ALL on one process do not match split collective operations on another process.

• Split collective routines must not be used concurrently with collective routines

MPI_File_write_all_begin(fh,...);
...
/* This collective routine is used concurrently with a split
collective routine */
...
MPI_File_write_all_end(fh,...);


Again the semantics of these operations are the same as for the corresponding collective operations. We therefore only present one example of split collective routines:

 int MPI_File_read_at_all_begin (MPI_File fh, MPI_Offset offset, void *buf, int count, MPI_Datatype datatype) IN fh file handle IN offset file offset OUT buf initial address of buffer IN count number of elements in the buffer IN datatype datatype of each buffer element
 int MPI_File_read_at_all_end (MPI_File fh, void *buf, MPI_status status) IN fh file handle OUT buf initial address of buffer OUT status status object
##### Data Access With Shared File Pointers

The offset in the data access routine is described by exactly one shared file pointer per collective MPI_FILE_OPEN. Thus, the file pointer is shared among the processes in a particular communicator group. Again, the same semantics are used as in the previous sections with some exceptions:

• All processes must use the same view.

• Using collective shared file pointers (_ORDERED) guarantees a serialized order of multiple calls. In other words, the access to the file is determined by the rank of the processes in the group. In contrast, non-collective shared file pointers (_SHARED) yield a serialization ordering which is non-deterministic.

The functions listed below shall be regarded as a proposal since they are not supported by ViMPIOS yet.

###### Non-collective Routines:

int MPI_File_read_shared (MPI_File fh, void *buf, int count, MPI_Datatype datatype, MPI_Status status)

int MPI_File_write_shared (MPI_File fh, void *buf, int count, MPI_Datatype datatype, MPI_Status status)

int MPI_File_seek_shared (MPI_File fh, MPI_Offset offset, int whence)

int MPI_File_get_position_shared (fh, offset)

###### Collective Routines:

int MPI_File_read_ordered (MPI_File fh, void *buf, int count, MPI_Datatype datatype, MPI_Status status)

int MPI_File_write_ordered (MPI_File fh, void *buf, int count, MPI_Datatype datatype, MPI_Status status)

##### Consistency and Semantics

We can distinguish three different levels of consistency [47]:

• sequential consistency among all processes using a single file handle

• sequential consistency among all processes using file handle created from a single collective open with atomic mode enabled

• user imposed sequential consistency

Sequential consistency is defined as a set of operations that seem to be performed in some serial order consistent with the program order. In contrast, user-imposed consistency can be yielded due to the program order or calls to MPI_FILE_SYNC.

 int MPI_File_set_atomicity (MPI_File fh, int flag) INOUT fh file handle IN flag true to set atomic mode, false to set non-atomic mode

Description:
Consistency semantics are guaranteed by a collective call of all processes in one group. In detail, any read or write operation to a file can be regarded as an atomic operation.

Features:

• collective call (values for fh and flag must be the same for all processes in one group)

 int MPI_File_get_atomicity (MPI_File fh, int flag) IN fh file handle OUT flag true if atomic mode, false if non-atomic mode

Description:
Returns either true or false according to the atomicity.

 int MPI_File_Sync (MPI_File fh) INOUT fh file handle

Description:
All previous writes to fh by the calling process are updated. Furthermore, all updates of other processes are visible as well. Thus, a subsequent read which is executed by the calling process returns the actual data in the file rather than any ”dirty read”.

Features:

• collective routine

Example: This example demonstrates consistency of a file which is written by process 0 and read by process 1. In order to guarantee that process 1 does not read any wrong data the so-called sync-barrier-sync construct is used for the following reason [47]:

• MPI_Barrier ensures that process 0 writes to the file before process 1 reads the file

• MPI_File_sync guarantees that the data written by all processes is transferred to the storage device.

• The second MPI_File_sync ensures that all data which has been transferred to the storage device is visible to all processes.


if (myid==0)
/* process 0 write to the file */
{
MPI_File_open (MPI_COMM_WORLD, "ufs:file2", MPI_MODE_CREATE |
MPI_MODE_RDWR, MPI_INFO_NULL, &fh );
MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, "native",
MPI_INFO_NULL);

for (i=0; i<1000; i++)
buf1[i]=i;

MPI_File_write (fh, buf1, 1000, MPI_INT, &status);

MPI_File_sync(fh);
MPI_Barrier (MPI_COMM_WORLD);
MPI_File_sync(fh);

MPI_File_close(&fh);
}
else
/* other processes read the updated file */
{
MPI_File_open (MPI_COMM_WORLD, "ufs:file2", MPI_MODE_CREATE |
MPI_MODE_RDWR, MPI_INFO_NULL, &fh );
MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, "native",
MPI_INFO_NULL);

MPI_File_sync(fh);
MPI_Barrier (MPI_COMM_WORLD);
MPI_File_sync(fh);

MPI_File_read (fh, buf1, 1000, MPI_INT, &status);
MPI_File_close(&fh);
}


Although the second sync seems to be redundant, omitting it would yield an erroneous program.

### 6.3 ViMPIOS: Implementation of MPI-IO

#### 6.3.1 Initializing and Finalizing MPI-IO

On giving the information on MPI-IO and the ViPIOS-interface we will now have a look at how the ViMPIOS is implemented. In this section we want to describe the internal handling of establishing a connection to the ViPIOS server in order to use the functionalities of MPI-IO. According to [47] no explicit routine is required for initializing and finalizing an MPI-IO session. Thus, a call to MPI_Init() must establish a connection to the ViPIOS server as well. Moreover, a call to MPI_Finalize() has to disconnect from the ViPIOS server. In order to guarantee this, every C application program using MPI-IO needs to include the header file vip_mpio_init.h which looks like follows:

#define MPI_Init(argc,argv) MPIO_Init_C(argc,argv);
#define MPI_Finalize() MPIO_Finalize();


The routines are defined in the file vip_mpio.c:

int MPIO_Init_C(int *argc,char ***argv)
{
return (   ((MPI_Init(argc,argv)==MPI_SUCCESS) &&
ViPIOS_Connect(0)) MPI_SUCCESS : -1 );
}

int MPIO_Finalize(void)
{
return ( (ViPIOS_Disconnect()&&(MPI_Finalize()==MPI_SUCCESS))
? MPI_SUCCESS : -1 );
}


Note that this is only true for C applications. A discussion on using Fortran application programs is given in a later section.

#### 6.3.2 File Management

In this chapter we will analyze how the file information is handled. In particular, each MPI-IO file contains certain information about the filename, the file handle, the access mode etc. All that information is stored by means of following struct:

typedef struct {
MPI_Comm comm;
char *filename;

int ViPIOS_fh;
Access_Desc *view_root,
*descriptor;
bool view_is_set;

MPI_Offset disp;
MPI_Datatype etype;
MPI_Datatype filetype;

int contig_filetype;
int access_mode;
int atomicity;
} File_definition;


The first entry to that struct is the communicator comm. When we recall the syntax of the routine MPI_File_open we find out that each file is opened by a group of processes which are referred to by a communicator. Thus, it is possible to determine whether a file is opened by only one process - this is true if MPI_COMM_SELF is used as an argument in MPI_File_open - or whether the file is opened by a set of processes. This information becomes vital for collective routines which we will discuss later on.

The next two parameters are the name of the file filename and the file handle ViPIOS_fh. Latter is assigned by the routine ViPIOS_Open. view_root is a pointer to the structure Access_Desc which is defined in the kernel of ViPIOS. The purpose of that structure is to access a file in strides. Furthermore, the variable view_is_set specifies whether a view to the file currently opened is set or not. We will explain the variables view_root and descriptor in more detail when we discuss the implementation of MPI_FILE_VIEW.

Disp, etype and filetype hold the information about the file view. The variable contig_filetype defines whether the filetype of the view is contiguous. In other words, there are no so-called holes in the file which cannot be accessed by an application. Thus, no additional algorithm is required for computing non-accessible parts of the file.

Access_mode contains the access rights to a file, e.g. read only, write only, etc. The last variable already_accessed states whether any file operation was carried out on the file. This becomes important when the displacement of the file view has to be computed.

In order to guarantee that an application can open several files at any time, the information of each file must be administrated carefully, i.e. when an application accesses a file with a certain file handle, the interface must be able to retrieve the information on that particular file. ViMPIOS uses a library which is also applied within the ViPIOS kernel. The idea is as follows: When a new file is opened, a table is created which is a dynamic array of integer values. Each index of that table is a pointer to the structure File_Definition which we analyzed in the previous section. Let us briefly explain how this table is maintained when a file is opened, accessed and closed.

In order to access our structure File_defintion a new variable must be defined. This is done in the function MPI_File_open:

File_definition help_fh;


In the next step the table which administrates all files must be created:

static bool first;

if (first) {
first=FALSE;
std_tab_init(10,5,sizeof(File_definition),&File_table);
}


First is a static variable which determines whether the function MPI_File_open is called for the first time. According to that result the table is initialized by std_tab_init(). The first parameter states that storage space for 10 entries to the table is allocated. The second parameter defines how much storage space is allocated at a later stage. In other words, storage space for 10 elements of the size sizeof(File_definition) is allocated at first. When the table is full, storage space for another 5 elements is allocated etc. The name of the table is denoted by the last parameter of that routine, namely File_table.

Let us assume that a file is correctly opened by:

ViPIOS_Open(filename,amode,&help_fh.ViPIOS_fh);


This means that the ViPIOS server determines a file handle which is stored in help_fh.ViPIOS_fh.

Once storage space of the table is allocated by the routine std_tab_init() and the structure File_definition is filled with values, i.e. the file name, the access mode etc. are assigned, the file handle and the corresponding file information can be added to the file administration table. This is done by the following routine:

std_tab_append (&File_table, &help_fh, fh);
n_open_files++;


This routine adds a new element to the table File_table. help_fh is a pointer to the element which has to be inserted. In our case it is the pointer to the struct File_defintion which holds the information on the current file (filename, file handle). The parameter fh can be regarded as the index of the table which can be used as a key to retrieve information on the file. Note that fh is not the actual file handle which is returned by the ViPIOS_Open call but the index to the table. The actual file handle of the file is retrieved by help_fh-ViPIOS_fh as we will see later on. The variable no_open_files holds the information about the number of files currently opened.

Let us take a look at a small example in order to explain the functionality of the routines discussed so far. Assume that an application program opens three files. As we have already stated at an earlier stage, the file handle for each file is returned by the ViPIOS_Open function call. For our example assume the file handles help_fh-ViPIOS_fh 45,46,47. On initializing the file table, those three values can be added. Thus, index 0 points to the file referred to by file handle help_fh-ViPIOS_fh=45. Index 1 of the file table denotes the file with the file handle 46 and so on. It is important to mention that the file handles which are used by the application are these indices rather than the actual file handles defined by ViPIOS_Open.

On explaining the creation of the file table we will now discuss how information about a certain file can be retrieved. Assume that our application program wants to read data from the file with the actual file handle ViPIOS_fh=46. According to our previous explanation the application program uses the file handle fh 1, i.e. the second index of our table which points to the file with the file handle 46, rather than the actual file handle help_fh.ViPIOS_fh. Thus, before the data of the file with the file handle 46 can be read following steps are necessary. The code can be found in MPI_File_read(fh,…):

File_defintion *help_fh;
std_tab_get(File_table, fh, (void **) &help_fh);


The function std_tab_get returns the element defined by the index fh. In our case fh=1 which points to the file handle 46. A correct call to ViPIOS_Read looks like follows:

ViPIOS_Read(help_fh->ViPIOS_fh,...);


Thus, the ViPIOS routine is called with the actual file handle help_fh-ViPIOS_fh rather than the entry to the file table.

To sum up, all so-called MPI-IO routines use the index of the file table as their file handle. The actual file handle is derived from the file table and is only used when the ViPIOS routines are called. Moreover, every time an MPI-IO function other than MPI_File_open is called, the function std_get_table() must be evoked in order to retrieved the information about a particular file.

When the application program closes a file, the entry to the file table must be deleted as well which is done in the routine MPI_File_close. However, before the file table entry can be deleted, the function std_tab_get() must be called in order to determine which entry has to be deleted. The index is stored in fh. On closing the file with a call to the routine ViPIOS_close the routine std_tab_del() is called:

std_tab_del(&File_tab, *fh);
n_open_files--;


This routine simply deletes the element with the index fh from the file table and decrements the counter no_open_files which holds the number of file currently opened.

Finally, the whole file table is removed and the allocated storage space is freed when the last open file was closed:

if (n_open_files==0)
std_tab_clean(&File_table);


Besides creating, filling and closing the file management table, a further routine is implemented. Before any data access on a file can be performed, a special routine checks whether a corresponding file table entry exists. Thus, following routine is called

if (std_tab_used(File_table)==0)
{
return -1;
}


#### 6.3.3 File Manipulation

In this section we want to discuss how the routines for file manipulation are implemented in the ViMPIOS. Let us start off with MPI_File_open. Besides managing the file table we described in the previous chapter, this routine checks whether the restrictions for the collective mode are obeyed. In particular, files are opened in a collective way when the communicator is not MPI_COMM_SELF. Thus, if this is true, a message is broadcast to all members of the communicator group in order to check whether all processes use the same filename and access mode.

In addition to closing a file, MPI_File_close removes a file if the access mode MPI_MODE_DELETE_ON_CLOSE is set:

if (amode & MPI_MODE_DELETE_ON_CLOSE)
ViPIOS_Remove(filename);


The routines MPI_File_delete, MPI_File_set_size MPI_File_get_size and MPI_-File_get_amode need no further explanation since they simply call the corresponding ViPIOS interface routines, namely ViPIOS_Remove, ViPIOS_File_-set_size and ViPIOS_File_get_size. The routine MPI_File_preallocate has basically the same functionality as the routine MPI_File_set_size. The only difference is that the file size is not truncated if the value for preallocating memory space is smaller than the actual file size.

The code fragment of MPI_File_get_group for returning the communicator group of the specified file is:

int MPI_File_get_group(MPI_File fh, MPI_Group *group)
{
...
return MPI_Comm_group(help_fh->comm, group);
}


Finally, the routine MPI_File_get_amode returns the access mode of a particular file which is currently opened:

int MPI_File_get_amode (MPI_File fh, int *amode)
{
...
*amode=help_fh->access_mode;
...
}

##### File Views

In this chapter we want to describe how an MPI-IO view is realized in ViMPIOS. Before any view can be set a derived datatype has to be specified, for example

MPI_Datatype vector1;

MPI_Type_vector (5,2,10, MPI_INT, &vector1);
MPI_Type_commit(&vector1);


Next the view can be set

MPI_File_set_view(fh, 0, MPI_INT, vector1, "native",
MPI_INFO_NULL);


In our example we know the our filetype vector1 is a derived datatype MPI_TYPE_VECTOR. Now we want to extract the variables count, blocklen, stride and oldtype in order to map the datatype to the ViPIOS access descriptor. In particular we have to search for the structure which stores the information of MPI derived datatypes. One way to do so is to take a look at the MPI implementation of the derived datatype MPI_TYPE_VECTOR. However, we will omit printing the whole source code but only analyse its content. The interested reader is refored to the lines of code of the MPI implementation.

On checking for bad arguments the derived datatype is checked for being contiguous. This is true if blocklen and stride have the same value or if count is 1. Then, the derived datatype MPI_Type_vector can be reduced to MPI_TYPE_CONTIGUOUS. Moreover, each MPI_Type_vector is reduced to MPI_Type_vector. Thus, the stride is evaluated in bytes rather than in multiples of oldtype.

Since extent is a variable of old_dtype_ptr which in turn is a pointer to the struct MPIR_DATATYPE we have to search for the definition of that structure. In the source code of MPI it can be found in the header file datatype.h which describes all MPI datatypes, i.e. basic datatypes as well as derived datatypes:

struct MPIR_DATATYPE {
MPIR_NODETYPE dte_type; /* type of datatype element
this is */
item */
int          committed; /* whether committed or not */
int          is_contig; /* whether entirely contiguous */
int              basic; /* Is this a basic type */
int          permanent; /* Is this a permanent type */
MPI_Aint        ub, lb; /* upper/lower bound of type */
MPI_Aint real_ub, real_lb; /* values WITHOUT TYPE_UB/
TYPE_LB */
int             has_ub; /* Indicates that the datatype has
a TYPE_UB */
int             has_lb; /* Indicates that the datatype has
a TYPE_LB */
MPI_Aint        extent; /* extent of this datatype */
int               size; /* size of type */
int           elements; /* number of basic elements */
int          ref_count; /* nodes depending on this node */
int              align; /* alignment needed for start of
datatype */
int              count; /* replication count */
MPI_Aint        stride; /* stride, for VECTOR and HVECTOR
types */
MPI_Aint      *indices; /* array of indices, for (H)INDEXED,
STRUCT */
int           blocklen; /* blocklen, for VECTOR and HVECTOR
types */
int         *blocklens; /* array of blocklens for (H)INDEXED,
STRUCT */
struct MPIR_DATATYPE *old_type,
**old_types,
*flattened;
MPI_Datatype self;      /* Index for this structure */
#ifdef FOO
MPI_Datatype old_type;  /* type this type is built of,
if 1 */
MPI_Datatype *old_types;/* array of types, for STRUCT */
MPI_Datatype flattened; /* Flattened version, if available */
#endif
};

extern void *MPIR_ToPointer ANSI_ARGS(( int ));

#define MPIR_GET_DTYPE_PTR(idx) \
(struct MPIR_DATATYPE *)MPIR_ToPointer( idx )



Let us pick out the variables which are important for the datatype MPI_Type_-hvector. dte_type holds the information about the kind of datatype. In our example the variable contains the data MPIR_HVECTOR. committed states whether the derived datatype was committed in the application program by MPI_Type_commit. Since the stride in our datatype is greater than the number of elements (stride=10 blocklen=2), our datatype is not contiguous (is_contiguous=false). Furthermore, our datatype is no basic MPI datatype (basic=false). count holds the number of blocks and blocklen the number of elements of each block. We will explain the further variables when we actually need them for our implementation.

Since we have found the structure which holds the information about all MPI datatypes we can start our discussion about the implementation of MPI_File_set_view. First we define a pointer to that structure for retrieving the information of the access pattern stored in filetype of MPI_File_set_view:

struct MPIR_DATATYPE *view;
view=MPIR_GET_PTR(filetype);


Thus, the data access pattern stored in filetype which can be any combination of MPI basic or derived datatypes is determined by analyzing the struct MPIR_DATATYPE.

Apart from analyzing the filetype we also have to tell the ViPIOS about the file view. This means the information retrieved from MPIR_DATATYPE has to be mapped to the ViPIOS access descriptor which is handled by the routine get_view_pattern. Before this function can be called, storage space for the ViPIOS access descriptor has to be allocated and next_free which is defined as void *next_free must be set accordingly:

descriptor=malloc(1024);
next_free=descriptor;

##### The Mapping Function get_view_pattern

int get_view_pattern(struct MPIR_DATATYPE *view, Access_Desc *descriptor, void **free_space)

get_view_pattern is a recursive routine which extracts the information of filetype, i.e. the access pattern of the view, and maps it to the ViPIOS access descriptor. The recursion is called as long as the filetype is no basic MPI datatype. Furthermore, the return value is 1 if the access pattern is contiguous and 0 otherwise. First, the function checks whether the filetype is one of the following derived datatypes:

• MPI_TYPE_CONTIGUOUS

• MPI_TYPE_HVECTOR

• MPI_TYPE_HINDEXED

• MPI_TYPE_STRUCT

In the chapter about derived datatypes we stated 6 different derived datatypes rather than 4. The function get_view_pattern checks only for four different ones since the datatype MPI_Type_vector is automatically reduced to MPI_Type_hvector, the same is true for MPI_Type_indexed and MPI_Type_-hindexed.

###### Mpi_type_contiguous

Let us start with the simplest derived datatype and explain how it is mapped to the ViPIOS access descriptor.

MPI_Type_contiguous(count,oldtype,newtype)

count holds the number of elements of the contiguous datatype. This value is retrieved by view-count. Thus, it can be mapped as follows:

descriptor->no_blocks=1;
descriptor->skip=0;
descriptor->basics[0].offset=0;
descriptor->basics[0].repeat=1;
if (!next->basic)
descriptor->basics[0].count=view->count;
else
descriptor->basics[0].count=view->count*next->extent;
descriptor->basics[0].stride=0;


Since we use just one homogenous datatype no_blocks and repeat are set to 1. What is more, skip and offset are set to 0. Some more explanation must be given for the parameter count which is set according to next-basic where next is an auxiliary variable for retrieving information of oldtype:

struct MPIR_DATATYPE *next;
next=view;
next=next->oldtype;


Assume a file which consists of 100 double values and we wish to define the following view:

MPI_Type_contiguous(10,MPI_DOULBE,&contig1);
MPI_Type_commit(&contig1);
MPI_File_set_view(....,contig1,...);


On applying the function get_view_pattern to that example the kind of data-type which is recognized is MPI_TYPE_CONTIGUOUS. The kind of datatype of oldtype which is MPI_DOUBLE can only be retrieved by our auxiliary variable next which points to the next level in the structure of the filetype. Thus, next-basic is true since oldtype is a basic datatype, namely MPI_DOUBLE. In other words, next-basic only returns true if the filetype does not consist of nested derived datatypes.

Moreover, since every access operation in ViPIOS is done in bytes, count must be multiplied by the size of the datatype of oldtype which is retrieved by next-extent. In our example view-count is 10, thus, the variable count of the ViPIOS access descriptor is set to 80 - provided the case that a the datatype double comprises eight bytes.

In short, if the view is a nested derived datatype, count gets the same values as retrieved from view-count, i.e. from the MPI struct MPIR_DATATYPE otherwise count is multiplied by the extent of oldtype.

After this little excursion about the variable count of the ViPIOS access descriptor we can now return to the remaining variables.

if (!next->basic)
{
descriptor->basics[0].subtype=*free_space;
get_view_pattern(view->oldtype,descriptor->basics[0].subtype,
free_space);
}
else
{
descriptor->basics[0].subtype=0;
return 1;
}


If filetype is a nested derived datatype, subtype points to the next free space allocated for the ViPIOS access descriptor. Furthermore, the recursive function is evoked again to map the next level of filetype. free_space is evaluated as follows:

free_space=(struct basic_block ) ((void) descriptor+sizeof
(Access_Desc));
descriptor->basics=*free_space;
free_space+=(descriptor->no_blocks*sizeof(struct basic_block));


After the first line of code free_space points to the beginning of the struct basic_block i.e. the first struct of the ViPIOS access descriptor. Since the space for the struct Access_Desc is skipped descriptor-basics is set to that position as well. In the third line free_space is adjusted, i.e. the space for one basic block is skipped since no_blocks is 1.

###### Mpi_type_hvector

On explaining the mapping mechanism of MPI_TYPE_CONTIGUOUS we can now resume with the next derived datatype, namely MPI_Type_vector:

int MPI_Type_hvector(count, blocklength, stride, oldtype, *newtype)

The values can be retrieved by view-count, view-blocklen and view-stride.

Similar to the previous datatype only one basic block is required and count is set to view-blocklen. In order to avoid any confusion we refer to the variables of the MPI struct MPIR_DATATYPE as view-count, view-blocklen etc. Thus, when we state, for example, count we refer to the ViPIOS access descriptor. Moreover, note that repeat corresponds to view-count and count to view-blocklen. This is the reason why count is mapped to view-blocklen rather than to view-count:

descriptor->no_blocks=1;
descriptor->skip=0;
descriptor->basics[0].offset=0;
descriptor->basics[0].repeat=view->count;
...
if (!next->basic)
descriptor->basics[0].count=view->blocklen;
get_view_pattern(view->oldtype,descriptor->basics[0].subtype,
free_space);
else
descriptor->basics[0].count=view->blocklen*next->extent;


In addition to assigning the values for count the recursive function is called again since the filetype is a nested datatype. However, in the program code the recursion is called after the mapping of the remaining values. Here we changed the order of the program code to some extent so that we can more easily explain the mapping technique.

As we have already stated in a previous chapter stride and view-stride are interpreted in a different way. stride denotes the gap between two data blocks of the vector whereas view-stride denotes the number of bytes from the beginning of one block to the beginning of the next one. stride is computed as follows:

stride=view->stride - view->blocklen * next->extent;


where next-extent holds the extent of oldtype. For example,

MPI_Type_hvector(2,5,40,MPI_INT,&vetor1);


The corresponding view is depicted in Figure 6.7:

The stride between the first and the second data block is view-stirde=40 which is mapped as:

stride=40-5*4=20; \\


since the extent of oldtype, i.e. MPI_INT, is 4. Subtype and the remaining auxiliary variables are adjusted as before.

###### Mpi_type_hindexed

Unlike the previous datatypes MPI_TYPE_HINDEXED is mapped by using several basic blocks since each data block can have a different size:

int MPI_Type_hindexed (count, *array_of blocklengths, *array_of_displacements, oldtype, *newtype);

The values can again be retrieved by view-count, view-blocklens[i] and view-indices[i] where i ranges from 0 to view-count-1, i.e. view-blocklen[2] states the length of the 3rd data block with a stride of view-indices[2].

Since we have 5 data blocks of different lengths we use 5 basic blocks (no_blocks= view-count where count, i.e. the size, of each block is:

if (!next->basic)
descriptor->basics[i].count=view->blocklens[i];
else
descriptor->basics[0].count=view->blocklens[i]*next->extent;


Next repeat is set to 1 and stride to 0. Finally, the offset for each basic block must be computed because the array view-indices denotes the displacement of each data block to the beginning of the datatype rather than the stride between two adjacent blocks. For example, the byte displacement of the second block is view-indices[1]=20. The gap between the second and the third block is computed by:

gap[2]=view-indices[2] - view-blocklens[1]*extent_of_oldtype - view-
indices[1]=40 - 2*4 - 20 = 12

The offset for the first basic blocks is:

descriptor->basics[0].offset=view->indices[0];


All remaining offsets are computed similar to the simplified formula given above:

if (!next->basic)
descriptor->basics[i].offset=view->indices[i] - descriptor->
basics[i-1].count*next->extent - view->indices[i-1];
else
descriptor->basics[i].offset=view->indices[i] - descriptor->
basics[i-1].count - view->indices[i-1];


This needs some more explanation. Let us start with the fist case were
next-basic is false, i.e. the view is a nested derived datatype. In this case count of the previous basic block (descriptor-basics[i-1].count) is multiplied by the extent of oldtype since count was not adjusted before. On the other hand, if the view is no nested derived datatype (next-basic=true) count was already adjusted, i.e. the size of that datatype was already multiplied by the extent of oldtype(count=view-blocklens[i]*next-extent). Thus, the size of the previous basic blocks needs no longer be multiplied by the extent of oldtype in order to compute the correct offset.

Recalling our example of the datatype MPI_TYPE_HINDEXED we want to compute the offset of the third basic block which is 12, i.e. the stride between the second and the third basic block. Further assume that our view is no nested datatype. Thus, the length of the previous basic blocks is already adjusted (it is 8 rather than 2, i.e. 2*extent of integer=8). The offset is computed according to the second case:

offset=40 - 8 - 20 = 12

The difference to our previous computation is that the length of the previous block is already adjusted (8) and needs no more modification.

###### Mpi_type_struct

The most complex derived datatype has the following syntax:

int MPI_Type_struct(count, *array_of_blocklentghs, *array_of_displacements, oldtype, *newtype)

Like we did for MPI_TYPE_HINDEXED we use several basic blocks, namely:

descriptor->no_blocks=view->count;
descriptor->skip=0;


count is also defined in the same way. However, the recursive function call differs to some extent. Since each block can consist of different datatypes, the first parameter of the function call is not view-oldtype but view-oldtypes[i] where i refers to the corresponding block. Thus, we use another auxiliary pointer and adjust it for each basic block:

next=view;
next=next->oldtypes[i];
...
if (!next->basic)
{
descriptor->basics[i].count=view->blocklens[i];
...
get_view_pattern(view->oldtypes[i],descriptor->basics[0].
subtype,free_space);
}
else
descriptor->basics[0].count=view->blocklens[i]*next->extent;


Again, repeat is set to 1 and stride to 0. More detailed information must be given about the mapping of offset.

The offset of the first basic block is:

descriptor->basics[0].offset=view->indices[0]


whereas the remaining offsets are computed as follows:

descriptor-basics[i].offset=view-index[i]- view-blocklen[i-1]*
extent_of_oldtypes[i-1]-view-indices[i-1]

Let us explain this by an example. Assume a file which consists of 5 integer, 2 double, and 50 character values. Further assume following derived datatype:

s_types[3]={MPI_INT,MPI_DOUBLE,MPI_CHAR};
s_blocklens[3]={3,2,16};
s_disps[0]=0;
s_disps[1]=20;
s_disps[2]=40;


The corresponding file view is depicted in Figure 6.9.

The values of the displacements are: view-indices[0]=0, view-indices[1]=20 and view-indices[2]=60. The corresponding offsets for each basic block of the ViPIOS access descriptor are computed as follows:

descriptor-basics[0].offset=0
descriptor-basics[1].offset=20 - 3*4 - 0 = 8
descriptor-basics[2].offset=60 - 2*8 - 20 = 24

When, for example, the offset of the third basic block is computed, extent of the previous oldtype must be retrieved. Thus, we need a further auxiliary variable:

prev=view;
prev=prev->oldtypes[i-1];


The actual code for computing the offset will be omitted here.

Up to now we assume that the etype of the view is a basic MPI datatype. However, in our introductory chapter we stated that an etype can be a derived datatype as well. Thus, a similar routine for extracting the information of the etype is required. The corresponding function is get_oldtype. To start off, the a pointer to the structure which stores the information on MPI datatypes is set in the routine MPI_File_set_view. In order to reduce computational overhead, the routine get_oldtype is only called if the etype actually is a derived datatype. Moreover, the variable orig_oldtype_etype gets the basic MPI datatype of the derived datatype. In other words, if the etype is the following vector:

MPI_Type_vector(10,5,20,MPI_INT,&vector}


orig_oldtype_etype will be MPI_INT. The remaining lines of code are an extract of the routine MPI_File_set_view:

struct MPIR_DATATYPE *oldtype_etype;

oldtype_etype=MPIR_GET_DTYPE_PTR(etype);
orig_oldtype_etype=6;

// only call the function if etype is a derived datatype
if (!oldtype_etype->basic)
orig_oldtype_etype=get_oldtype(oldtype_etype);
else
orig_oldtype_etype=oldtype_etype->dte_type;


Let us now take a look at the extraction function. We already know the features of the derived datatype such that we can quickly motivate the following lines of code. First, depending on the derived datatype, the corresponding branches are executed. Thus, the routine is called recursively until the analyzed datatype is a basic MPI datatype which is in turn returned to the calling function, namely MPI_File_set_view. Note that the code for the datatypes MPIR_HINDEXED and MPIR_STRUCT is more complex because these datatypes can consist of more than one oldtype.

int get_oldtype (struct MPIR_DATATYPE *old_datatype)
{
struct MPIR_DATATYPE *next;
int i;

next=old_datatype;

switch (old_datatype->dte_type)
{
case MPIR_CONTIG:
case MPIR_HVECTOR:
{
next=next->old_type;
if (!next->basic)
get_oldtype(old_datatype->old_type);
else
return next->dte_type;
break;
}

case MPIR_HINDEXED:
case MPIR_STRUCT:
{
for (i=0; i<old_datatype->count; i++)
{
next=next->old_types[i];
if (!next->basic)
get_oldtype(old_datatype->old_types[i]);
else
return next->dte_type;

}
}
default:
return 6;
}
}


We still need a mechanism which checks the etype and filetype for correctness. This means, that the original oldtype of the etype must correspond with the original oldtype of the filetype. For example, if etype is set to MPI_INT and filetype is set to MPI_Type_vector its oldtype must be MPI_INT as well. Thus, one more line must be added to the routine get_view_pattern which is responsible for verifying these parameters.

On analyzing how a view, which is specified by means of derived datatypes, is mapped to the ViPIOS access descriptor we can now resume our discussion on the implementation of the MPI-IO function MPI_File_set_view.

Since this function is collective, all processes must use the same extent of etype and the same data representation. At the moment ViMPIOS only supports the native data representation. Thus, if at least two processes are started by the application program, a message is broadcast to check for the same etypes. If the etypes differ among the processes, the application program is aborted. The last step of the routine MPI_File_set_view is to assign following parameters:

ViPIOS_fh->view_root=root;
ViPIOS_fh->disp=disp;
ViPIOS_fh->etype=etype;
ViPIOS_fh->filetype=filetype;
ViPIOS_fh->info=info;


The last parameter info is not implemented yet. Finally, the file pointer is set to position 0 within the file view.

#### 6.3.4 Data Access Routines

Let us start with the routine MPI_File_read. Unlike we pointed out in the introductory chapter on MPI-IO, the last parameter, namely status is of type MPIO_Status rather than MPI_Status. We will discuss its features later in this chapter.

On opening the table for the file management and checking the parameters from the application program, the next important step is to interpret the MPI datatype in a way which can be understood by the ViPIOS server. Thus, the routine convert_datatype is called which looks like follows:

void convert_datatype(MPI_Datatype datatype, int *count)
{
switch (datatype)
{
case MPI_SHORT:
(*count)*=sizeof(short int);
break;
case MPI_INT:
(*count)*=sizeof(int);
break;
case MPI_LONG:
(*count)*=sizeof(long int);
break;
case MPI_UNSIGNED_CHAR:
(*count)*=sizeof(unsigned char);
break;
case MPI_UNSIGNED_SHORT:
(*count)*=sizeof(unsigned short int);
break;
case MPI_UNSIGNED:
(*count)*=sizeof(unsigned int);
break;
case MPI_UNSIGNED_LONG:
(*count)*=sizeof(unsigned long int);
break;
case MPI_FLOAT:
(*count)*=sizeof(float);
break;
case MPI_DOUBLE:
(*count)*=sizeof(double);
break;
default: (*count)*=sizeof(char);
Ψ }
}


This routine has two main functions. On the one hand, it interprets an MPI datatype to a C datatype, on the other hand, the parameter count which states the number of elements to be read is interpreted in byte elements. For example, if the application program prompts to read count=10 values of MPI_DOUBLE, count this is set to 80 provided the case that the size of one double value is 8 bytes.

After having adjusted the parameter count, the ViPIOS interface can be called in order to read the values. Thus, depending on the filetype of the view, either ViPIOS_Read or ViPIOS_Read_struct is called. This means that if the file view is contiguous, i.e. there are no holes in the view, the normal ViPIOS_Read is called with following parameters:

ViPIOS_Read (help_fh->ViPIOS_fh, buf, count,-1);


The first parameter is the entry retrieved from the file management table. The second and the third parameters refer to the buffer and the number of elements to be read. Some more information must be provided for the last parameter. -1 signals the ViPIOS server that the file pointer shall be updated after this call. Thus, if the file pointer was at position 0 before reading the file, it points to position 80 after 80 byte values were read. By means of that parameter we can explain the routine MPI_File_read_at which is a so-called routine with an explicit offset. Assume, the file should be read starting from position 100. The corresponding call to the ViPIOS server would be:

ViPIOS_Read (help_fh->ViPIOS_fh, buf, count,100);


Since routines with explicit offsets shall not interfere with routines without explicit offsets, the file pointer is not updated after calling that routine.

After this little excursion to another data access routine we will resume our discussion of MPI_File_read. We have already mentioned that depending on the file view a special ViPIOS routine is called. Thus, if the file view is not contiguous, following call is made:

ViPIOS_Read_struct(help_fh->ViPIOS_fh,buf,count,
help_fh->view_root,help_fh->disp,-1);


The parameter help_fh-view_root is a pointer to the structure which handles the access pattern of the file. help_fh-disp holds the information about the displacement of the view. The remaining parameters have the same meaning as we discussed above.

Although we have already talked about MPI_File_read_at we still need to say something about the offset. Remember that every offset in MPI-IO is interpreted in units of etype. Thus, if our etype is, for instance, MPI_DOUBLE, the offset must be multiplied by the extent of MPI_DOUBLE. Restricting etypes to basic MPI datatypes allows us to deploy the function convert_datatype we discussed above. The routines MPI_File_read_all and MPI_-File_read_at_all have the same implementation features as their non-collective counterparts. The only exception is a barrier at the end of the routine which synchronizes all processes in the group. This means, that the application can only resume after all processes have executed the collective read operation.

Recall that we have stated at the beginning of the section that the parameter status is of type MPIO_Status rather than MPI_Status. The ViMPIOS internal structure of MPIO_Status looks like follows:

typedef struct
{
int fid;
int count;
}
MPIO_Status;


On calling the ViPIOS interface for any data access routine, for example, ViPIOS_Read, the number of bytes which were actually read are stored by the ViPIOS server in the structure MPIO_Status. Thus, the file identifier must be stored in the structure which is done in all blocking data access routines:

status->fid=help_fh->ViPIOS_fh;


The number of bytes which were actually accessed by a particular data access routine can be retrieved by the routine MPI_File_get_count. Since we have not discussed its synapses in the introductory chapter on MPI-IO, we will take a look at the whole program code:

int MPI_File_get_count (MPIO_Status *status,
MPI_Datatype datatype, int *count)
{
int res;

if (status->fid<0)
{
printf("\nFile handle is not specified since non-
blocking request has not finished correctly!");
res=0;
}
else
res=ViPIOS_File_get_count(status->fid,count);

return ( (res==1) ? MPI_SUCCESS :  -1) ;
}


After calling the routine ViPIOS_File_get_count with the ViPIOS-file handle stored in status-fid the parameter count holds the number of bytes which were actually accessed.

Besides blocking access operations ViMPIOS also supports non-blocking ones. In particular, MPI_File_iread and MPI_File_iread_at as well as MPI_File_iwrite and MPI_File_iwrite_at. However, similar to ROMIO split collective routines are not supported yet. Note that the difference between the parameters of blocking and non-blocking routines is the last parameter, namely request, which can be regarded as the index for the particular non-blocking routine. In addition, the routine MPI_File_test allows testing whether the non-blocking routine has finished. Moreover, MPI_File_wait is used for waiting until the routine specified by the request number has finished.

Similar to the parameter status of the blocking data access routines, the type of the parameter request of the non-blocking calls is redefined as well. Thus, instead of MPI_Request ViMPIOS uses the type MPIO_Request, which is defined by:

typedef struct
{
int  reqid;    /* Request-Id */
int  fid;      /* File-Id */
}
MPIO_Request;


Moreover, each non-blocking data access stores the ViPIOS-file identifier into that structure which is needed for the routines MPI_File_wait, MPI_File_test and MPI_File_get_count:

request->reqid=help_fh->ViPIOS_fh;


On calling a non-blocking routine, the function MPI_File_test can be used for checking whether the data access operation is finished which is indicated by the parameter flag:

int MPI_File_test (MPIO_Request *request, int *flag,
MPIO_Status *status)
{
Ψ int res;

Ψ res=ViPIOS_File_Test(request->reqid, flag);

Ψ if (*flag==TRUE)
Ψ     status->fid=request->reqid;
Ψ else
status->fid=-1;

Ψ return ( (res==1) ? MPI_SUCCESS :  -1) ;
}


If the flag is set, i.e. the non-blocking data access operation has finished, the request identifier is stored in the structure of MPIO_Status otherwise the file identifier is set to -1. The reason for this functionality can be explained by looking at the parameters of MPI_File_get_count. Recall that the file identifier is required for determining the number of bytes which are accessed. Since this file identifier is retrieved from the parameter status which is not used for non-blocking operations the value must be set in the routine MPI_File_test. Thus, we yield two advantages. On the one hand, we can retrieve the number of bytes actually accessed, the other hand, if the non-blocking function has not finished, i.e. status-fid is set to -1, the ViPIOS interface ViPIOS_File_get_count needs not be called.

The routine MPI_File_wait is implemented in a similar way. However, the difference to the previous one is that it waits until the non-blocking operation specified by the request has finished.

int MPI_File_wait(MPIO_Request *request, MPIO_Status *status)
{
int res;

res=ViPIOS_File_Wait(request->reqid);
status->fid=request->reqid;

return ( (res==1) ? MPI_SUCCESS :  -1) ;
}



Since the implementation of the routines for writing data to a file - MPI_File_-write, MPI_File_write_at, MPI_File_write_at_all are analogous to the read operations we will omit a comprehensive explanation.

##### Scatter Functionality

Recalling the syntax of the data access routines we assumed so far that our data type is a basic MPI data type. However, even more complicated structures can be used in order to simulate a so-called scatter function. In particular this means that a file which is contiguous in memory can be read into the read buffer in a strided way. Let us take a look at an example in order to show the difference to the conventional read operation. Assume that our file resides on the disk in a contiguous way. Further assume that no file view is set. Performing following read operation yields a read buffer which exactly corresponds to the data stored on disk.

MPI_File_read(fh, buf, 100, MPI_INT, status);


This data access operation simply reads 100 integer values into the read buffer. In order to simulate the scatter mechanism, a derived datatype is used rather than a basic MPI datatpype.:

MPI_Type_vector(2,5,50,MPI_INT,&vector);
MPI_Type_commit(&vector);



Although the parameter count is set 1, more than 1 integer value is read. In particluar, 10 integer values are read since the derived datatype comprises 10 elements. However, the read buffer is not filled in contiguously but according to pattern described by the derived datatype. Thus, the first five elements of the read buffer are filled with the values which are read from the file and 45 values are skipped. Finally, the last 5 elements are read into the read buffer. On the whole, the read buffer comprises 55 elements whereas merely 10 integer values stem from the file stored on disk.

By means of that mechanism an even more complex data access is possible. Assume that you set a file view according to filetype_vector whereas the datatype for the read buffer corresponds to the derived datatype read_vector. Thus, a non-contiguous file can be stored in a different buffer which is non-contiguous as well. Let us demonstrate this case by means of an example:

MPI_Type_vector(3,10,20,MPI_INT,&filetype_vector);
MPI_Type_commit(&filetype_vector);

MPI_File_set_view(fh, 0, MPI_INT, filetype_vector, "native",
MPI_INFO_NULL);


The same mechanism can be applied for writing a file.

##### Further Data Access Routines

The routine for updating the file pointer MPI_File_seek is based on etype units as well. Thus, the offset must be converted in the same way as we have stated for the routines with explicit offsets, for example, MPI_File_read_at. Next, the actual seek operation is performed where we have to distinguish between two cases. As we have already stated, the file view can either be contiguous or non-contiguous. First case means that the so-called normal seek operation does not suffice. Thus, the ViPIOS server automatically realizes that internally a ViPIOS_Seek_struct is called. However, the only consequence for the MPI-IO implementation is to call the ViPIOS routine ViPIOS_Seek such that the displacement of the view is added to the offset:

if (help_fh->contig_filetype)
{

switch (whence)
{
case SEEK_SET:
res=ViPIOS_Seek (help_fh->ViPIOS_fh, offset+
help_fh->disp,whence);
break;
...
}
}
else
res=ViPIOS_Seek (help_fh->ViPIOS_fh, offset, whence);


The current position in etype units can be retrieved by the routine MPI_File_-get_position which in turn calls the routine ViPIOS_File_get_position. Since the value which is received from the ViPIOS server is given in bytes rather than in etype units, the byte value has to be converted which is done by the routine byte_to_etype:

int byte_to_etype(MPI_Datatype datatype, int count)
{
switch (datatype)
{
case MPI_SHORT:
count/=sizeof(short int);
break;
case MPI_INT:
count/=sizeof(int);
break;
case MPI_LONG:
count/=sizeof(long int);
break;
case MPI_UNSIGNED_CHAR:
count/=sizeof(unsigned char);
break;
case MPI_UNSIGNED_SHORT:
count/=sizeof(unsigned short int);
break;
case MPI_UNSIGNED:
count/=sizeof(unsigned int);
break;
case MPI_UNSIGNED_LONG:
count/=sizeof(unsigned long int);
break;
case MPI_FLOAT:
count/=sizeof(float);
break;
case MPI_DOUBLE:
count/=sizeof(double);
break;
default: count/=sizeof(char);
}
Ψ return count
}


This routine converts a view relative offset given in etype units into an absolute byte position relative to the current view. First, the offset which can be any multiple of etype must be converted into bytes. This is handled by the routine convert_datatype. If the filetype is contiguous, i.e. the view to the file does not contain any holes, the byte offset can be computed by:

*disp=offset + ViPIOS_fh-disp;

Otherwise the ViPIOS access descriptor which holds the information about the file view must be used to compute the byte position. Assume following file view (Figure 6.10).

In that example the offset 6 corresponds to the 11th byte position. If the view consists of nested derived datatypes, the byte position cannot be evaluated as straightforward as in that example. The routine Fill_access_descriptor, on the one hand, computes the extent of the view including all so called holes, i.e. the size for each block is stored in sub_count which is a variable of the ViPIOS access descriptor. On the other hand, it evaluates the number of elements (sub_actual which can actually be accessed, i.e. the size of the view without holes. Thus, the whole structure of the ViPIOS access descriptor is traversed recursively and sub_count as well as sub_actual are evaluated for each basic block.

Assume following values for the ViPIOS access descriptor:

Level 1:
no_blocks=1,
offset=0;
repeat=2;
count=5;
stride=5;

Level 2:
no_blocks=1;
offset=0;
repeat=3;
count=2;
stride=20;

Let us analyze the functionality of the recursive routine by means of the previous example. When the function is called for the first time, cur_basic-subtype is true, since our view consists of nested datatypes. sub_count=0 and read_count=0. Now the function is called recursively with the values Fill_access_desc(cur_basic-subtype,0,0).

This time cur_basic-subtype is false which means that the function is not called again but sub_count, read_count, cur_basic-sub_count and sub_actual are set to 1. Furthermore, following values are computed:

read_sum=2 * 5 * 1 = 10;
count=0 + 2 * (5 * 1 + 5) - 5 = 15;
count=15 + 0 = 15;

This means that the extent of the inner basic block is 15 whereas only 10 elements can actually be accessed since the stride of that access pattern is 5. However, the variables sub_count and sub_actual are 1 because no more basic blocks exist.

When the inner incarnation of the recursive function is completed, the outer incarnation is resumed after the function call. Thus, cur_basic-sub_count=15 and cur_basic-sub_actual=10, i.e. they are assigned the values which are computed by the inner incarnation and hold the information about the extent and the actual number of accessible elements of the sub basic block. Then, the extent of the whole access pattern (view) can be evaluated:

read_sum=3 * 2 * 10 = 60;
count=0 + 3* (2 * 15 + 20) - 20 = 130;
count = 130 + 0 = 130;

The extent of the outer basic block is 130 whereas 60 byte elements can actually be accessed.

On preparing these values we can now derive the absolute byte position from the relative one. This is handled by the routine Get_absolute_byte_position. Let us again explain the functionality by means of our previous example. First, the actual size of each basic block is evaluated:

cur_actual=cur_basic-repeat*cur_basic-count*cur_basic-sub_actual;

In our example the outer basic block consists of 3 blocks (repeat=3) with the size of count=2. Since the view is a nested derived datatype, the size of each element is not 1 but sub_actual=10. Thus, sub_actual retrieves the number of elements of the sub basic block. The number of elements of the basic block is:

cur_actual= 3 * 2 * 10 = 60;

Next, the start offset is checked whether it lies between the range of the first basic block (start_offset¡cur_actual). If start_offset is greater than cur_actual, the function can only resume if a further basic block exists otherwise the position of start_offset would lie beyond the range of the view.

The next example will demonstrate this case. Assume that our start_offset, i.e. relative byte position, is 23. The corresponding absolute byte position is 53. Since 23 lies within the range of the first basic block, we resume our computation. In contrast, if our relative offset were greater than 60 the absolute byte offset would be beyond the range of the view. However, the function Get_absolute_byte_position assumes that the view is circular, i.e. the new start offset is computed as long as it fits into the actual length of the view:

start_offset-=cur_actual;

For example, if the actual length of our view is cur_actual=60, the total extent sub_count=130 and our absolute start offset is 143, then the absolute byte position of 143 is computed as follows:

start_offset=143-60=83;
start=0 + 3 * (2 * 15 + 20) - 20 + 0 =130 (sub_count);
... start_offset=83-60=23;
start=130 + 3 * (2 * 15 + 20) - 20 + 0 = 260;
... start=260 + 53 = 313;

Thus, the absolute byte position of start_offset=143 is 303 when we assume that the view of the previous example is circular.

Let us now find out how the absolute position of start_offset=23 is evaluated. First, we have to find out the sub block which is referred by the offset.

blocks=start_offset / cur_basic-sub_actual;
23 / 10 = 2;

Since our view consists of repeat*count=3*2=6 sub blocks, we know that the absolute byte position must be in the range of the third sub block (blocks=2 whereas the first sub block is referenced by 0). Next we have to evaluate the byte position of the beginning of the third sub block. This is done by adding the size of the first two sub blocks to the stride between sub block 2 and 3 plus a possible offset:

*start= blocks*cur_basic-count + (blocks/cur_basic-count)
*cur_basic-stride + cur_basic-offset;
start= 2*15 + (2/2)*20 + 0 = 50;

Then, the offset within the third sub block is computed:

start_offset%=cur_basic-sub_actual
start_offset= 23 % 10 = 3;

Since cur_basic-subtype is true, i.e. the view consists of a nested derived datatype and consequently further sub blocks exist, the function is called recursively with the values start_offset=3, start=50. Then, the new values are evaluated accordingly:

cur_actual=2*5*1=10;
blocks= 3 / 1 = 3;
start= 50 + 31 + (3/5)*5 + 0 = 53;
start_offset= 3 % 1= 3;

As no further sub blocks exist the variable start holds the absolute byte position 53.

#### 6.3.5 File interoperability and Consistency semantics

The routine MPI_File_get_type_extent returns the extent of datatypes in the file, if the datatype is no NULL type.

The routines which handle the file consistency among parallel processes, namely MPI_File_set_atomicity and MPI_File_sync are provided by the interface but are not treated explicitly be the ViPIOS server since every data access operation in ViPIOS is atomic. Thus, non-atomic mode is not supported yet.

In our introductory chapter on MPI we have presented several MPI derived data-types. We have already mentioned that they can consist of multiple basic datatypes located either contiguously or non-contiguously in memory. Furthermore, the datatype created by a datatype constructor can be used as an input datatype for other derived datatypes. Thus, it is possible to build nested derived datatypes. The great advantage of such constructions is that noncontiguous data can be accessed with one command rather than reading the first chunk of data, skipping the data not required, reading the next chunk and so forth. In this section we will present two further derived datatypes which are part of the MPI-2 standard but not implemented in MPICH-1.1.

##### Subarray Datatype Constructor
 MPI_Type_create_subarray (int ndims, int *array_of_sizes, int *array_of_subsizes, int *array_of_starts, int order, MPI_Datatype oldtype, MPI_Datatype *newtype IN ndims number of array dimensions IN array_of_sizes number of elements of oldtype in each dimension of the full array IN array_of_subsizes number of elements of oldtype in each dimension of the subarray IN array_of_starts starting coordinates of the subarray in each dimension IN order array storage order IN oldtype array element datatype OUT newtype new datatype

This datatype allows describing an n-dimensional subarray of an n-dimen-sional array whereas the subarray can be located at any position within the array. Thus, a global array can be distributed onto several processors such that each one gets a certain section of the array. Assume a 12x12 array should be distributed onto 4 processes. Further assume that each processor gets 3 consecutive columns of the array as depicted in Figure 6.11.

By the help of that example we will explain this datatype. ndims defines the number of dimensions of the global array. In our case this value is 2. Moreover, this parameter specifies the number of elements of the next three parameters array_of_sizes[], array_of_subsizes[], array_of_starts[]. The size of the global array and the subarray are defined by array_of_sizes[] and array_of_subsizes. The location of the subarray within the global array is specified by array_of_starts[]. Hence, array_of_starts[0]=0, array_of_starts[1]=3 describes the subarray of the second processor which starts at the position 0 in the first dimension and at position 2 in the second dimension. Since C and FORTRAN use different orders for addressing arrays, the order can be defined by the parameter order and can either be MPI_ORDER_C, i.e. row-major order, or MPI_ORDER_FORTRAN, i.e. column-major order. The remaining parameters refer to the datatype of the global array and the name of the created derived datatype.

In the following example we present how this datatype can be used with MPI-IO. In our example we assume that the master process, i.e. process with rank 0, firstly writes the data to the file before the new derived datatype can be applied. Thus, on setting the file view, each process can only access the part of the global array according to Figure 6.11.


MPI_File     fh;
MPI_Datatype subarray;

int
rank,
subarray[4][3],
array_sizes[2],
subarray_sizes[2],
start_pos[2];

MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);

/* specify the size of the global array and the subarray */
array_sizes[0]=12;
array_sizes[1]=12;
subarray_sizes[0]=4;
subarray_sizes[1]=3;

/* calculate location of the subarray according to the rank */
start_pos[0]=0;
start_pos[1]=rank*3;

MPI_Type_create_subarray(2, array_sizes, subarray_sizes,
start_pos,   MPI_ORDER_C, MPI_INT, &filetype);
MPI_Type_commit(&filetype);

/* each process reads a particular part of the file */
MPI_File_open (MPI_COMM_WORLD, "ufs:array_file", MPI_MODE_RDWR,
MPI_INFO_NULL, &fh );
MPI_File_set_view (fh, 0, MPI_INT, subarray, "native", MPI_INFO_NULL);
MPI_File_close(&fh);

MPI_Type_free(&subarray);
MPI_Finalize();

##### Distributed Array Datatype Constructor

The derived datatype we discussed previously allows accessing different subarrays of a global array where each subarray is regarded as one block. The following derived datatype supports HPF-like distribution patterns like BLOCK-BLOCK distribution or CYCLIC-CYCLIC distribution. Before we give a description of this derived datatype we will give a brief introduction to HPF-distribution patterns.

Basically two different distribution patterns are possible, namely BLOCK and CYCLIC. Let us start with a file which can be regarded as a one-dimensional array consisting of 16 elements. Further assume that we have a processor grid of four processors and we wish to distribute the file onto these four processors in the distribution pattern BLOCK(4). This means that each processor gets one contiguous block of the file consisting of 4 elements. The distribution pattern is depicted in Figure 6.12.

Another distribution pattern, for instance, is CYCLIC(1). This means that the data is distributed in a round robin fashion onto the four processors. In this case, processor 1 gets the first element of the file, processor 2 gets the second element of the file and so on. When each processor has got one element, this process is repeated again. Thus, in the second run processor 1 gets the fifth element, processor 2 the sixth and so on. Figure 6.13 demonstrates the result of this distribution pattern.

Now assume that our file corresponds to a 2-dimensional array rather than a 1-dimensional one. In particular, our file should be an 8x8 array. Thus, more complex distribution patterns are possible by combining the ones we discussed so far. For example, BLOCK-CYCLIC means that the distribution pattern in the first dimension is BLOCK whereas the distribution pattern in the second dimension is CYCLIC. In order to get familiar with these patterns, we will give some examples.

In our first example we assume an 8x8 array which shall be distributed in the first dimension according to BLOCK(4) and in the second one according to CYCLIC(2). Further assume that our processor grid consists of 4 processors such that each dimension comprises two processors. The result can be seen in Figure 6.14.

In the next example we want to distribute a 9x10 array according to the pattern CYCLIC(2), CYCLIC(2). In contrast to our previous example not each processor gets the same number of elements. The result can be seen in Figure 6.15.

On giving a brief introduction to HPF-like distribution patterns, we can now resume our discussion on the derived datatype for distributed arrays.

 MPI_Type_create_darray (int size, int rank, int ndims int *array_of_gsizes, int *array_of_distribs, int *array_of_dargs, int *array_of_psizes, int order MPI_Datatype oldtype, MPI_Datatype *newtype) IN size size of process group IN rank rank in process group IN ndims number of array dimensions as well as processor grid dimensions IN array_of_gsizes number of elements of oldtype in each dimension of the global array IN array_of_distribs distribution of array in each dimension IN array_of_dargs distribution argument in each dimension IN array_of_psizes size of process grid in each dimension IN order array storage order IN oldtype array element datatype OUT newtype new datatype

By means of MPI_Type_create_darray a datatype can be created that corresponds to the distribution of an ndims-dimensional array onto an ndims-dimensional array of logical processes. size states the number of processes in the group. ndims specifies the number of dimensions of the array to be distributed as well as the number of dimensions of the processor grid. The size of each processor grid is given in array_of_psizes[]. It is important to note the following equation must be satisfied:

 ndims−1∑i=0array of psizes[i]=size

Let us explain this with an example. Suppose that our global array is a 2-dimensional 16x16 array and we wish to distribute it onto size=4 processes. Since our global array consists of ndims=2 dimensions, our processor grid consists of two dimensions as well. Thus, we have three different possibilities for constructing our processor grid. The first dimension could consist of array_of_psizes[0]=1 processor and the second of array_of_psizes[1]=4 (1*4=4) processors. In this case the processor grid corresponds to shape a in Figure 6.16. In shape b both dimensions comprise 2 processors. Shape c is the opposite of shape a, i.e. dimension 1 consists of 4 processors whereas dimension 2 consists of 1 processor.

This processor grid serves as the basis for the distribution pattern specified by the array array_of_distribs[]. In particular, each dimension of the global array can be distributed in three different ways:

• MPI_DISTRIBUTE_BLOCK- corresponds to BLOCK distribution

• MPI_DISTRIBUTE_CYCLIC- corresponds to CYCLIC distribution

• MPI_DISTRIBUTE_NONE - this dimension is not distributed

The distribution argument of each dimension is stored in array_of_dargs[] and can either be a number or the constant MPI_DISTIRBUTE_DFLT_DARG whereas following assumption must be satisfied:

array_of_dargs[i] * array_of_psizes[i]=array_of_gsizes[i].

Let us again take a look at one example in order to see the functionality of that datatype. In particular we wish to take a look at the code fragment of an application program which distributes a 9x10 array onto four processes in the pattern CYCLCIC(2), CYCLIC(2) as we have seen in Figure 6.15. Assume that the array storage order is C.

MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);

ndims=2
order=MPI_ORDER_C

/* size of the array to be distributed */
array_of_gsizes[0]=9;
array_of_gsizes[1]=10;

/* distribution and distribution argument */
array_of_distribs[0]=MPI_DISTRIBUTE_CYCLIC;
array_of_distribs[1]=MPI_DISTRIBUTE_CYCLIC;
array_of_dargs[0]=2;
array_of_dargs[1]=2;

/* processor grid */
array_of_psizes[0]=2;
array_of_psizes[1]=2;

MPI_Type_create_darray(nprocs, rank, ndims, array_gsizes,
array_of_distribs, array_of_dargs, array_of_psizes,
order, MPI_INT; &newtype);
MPI_Type_commit(&newtype);

##### Implementation of the Subarray Datatype Constructor

On getting used to the functionality of the derived datatypes we can now turn to describing the implementation of MPI_Type_create_subarray. The basic information for the implementation of this and the following derived datatype can be found in ROMIO.

First, the input parameters to that derived datatype must be checked for being correct. In particular none of the entries for the parameters ndims, array_of_sizes[], array_of_subsizes[] must be less than 1. Furthermore, array_of_subsizes[i] must not be less than array_of_sizes[i]. array_of_starts[i] must not be less than 0 or array_of_starts[i] (array_of_sizes[i]-array_of_sub-sizes[i]. Next the parameters oldtype and order are checked.

According to order the datatype is built by using the existing derived datatypes of MPI-1 we discussed in a previous chapter. Lets take a look at the implementation of the order = MPI_ORDER_C. Recall that in this case the last dimension of the array changes fastest since the ordering used by C arrays is row-major. For example, assume a 4x4 array. Thus, we address it by the indices [0][0], [0][1],[0][2],[0][3],[1][0] and so forth. The code fragment will help explaining the algorithm:

if (order == MPI_ORDER_C)
{
// dimension ndims-1 changes fastest
if (ndims == 1)
MPI_Type_contiguous(array_of_subsizes[0], oldtype, &tmp1);
else
{
MPI_Type_vector(array_of_subsizes[ndims-2],
array_of_subsizes[ndims-1],
array_of_sizes[ndims-1], oldtype, &tmp1);

size = array_of_sizes[ndims-1]*extent;

for (i=ndims-3; i>=0; i--) {
size *= array_of_sizes[i+1];
MPI_Type_hvector(array_of_subsizes[i], 1, size, tmp1, &tmp2);
tmp1 = tmp2;
}
}


Next, the dimension ndims of the derived data is checked. If ndims is 1, we can reduce the whole datatype to MPI_Type_contiguous since the file can only be accessed in a contiguous block with the size of array_of_subsizes[].

If ndims is greater than 1, the derived datatype can be built with the datatype MPI_Type_vector. Recall the example from the previous section where we distributed a 12x12 array onto 4 processes such that each process gets 3 consecutive columns of the array. Since the parameters to MPI_Type_vector are count, blocklength and stride, the section for the first process can be described by MPI_Type_vector(12,3,12). The pattern is depicted in Figure 6.17.

In general, count corresponds to array_of_subsizes[ndims-2], blocklength to array_of_subsizes[ndims-1] and stride to array_of_sizes[ndims-1]. If ndims is of an order higher than 2, a further derived datatype, namely MPI_Type_hvector, is used for describing the remaining dimensions.

Next, we have to add the displacement for each process. For example, process 2 should read the forth, fifth and sixth columns which is defined by disp[1]=array_of_starts[1]=3. Since each process is supposed to access a different block of the array, the start displacement disps[1] is different for each process. Finally, the derived datatype MPI_Type_struct is used for describing the section which should be accessed by each process. In particular, this datatype consists of three blocks whereas the second block contains the information of the access pattern we have described by MPI_Type_contiguous, MPI_Type_vector or MPI_Type_hvector. The first block is of the type MPI_LB and the third of MPI_UB. MPI_LB and MPI_UB are so-called ”pseudo-datatypes” which are used to mark the lower bound and the upper bound of a datatype. Since the extent, i.e. the span from the first to the last byte, of both datatypes is 0, they neither effect the size nor count of a datatype. The reason for using these datatypes is to define explicitly the lower and upper bound of a derived datatype. For more information we refer the reader to [46].

The implementation is shown in the following code fragment. However, a detailed description is given in the next section when we discuss the implementation of the derived datatype MPI_Type_create_subarray.

// add displacement and UB
disps[1] = array_of_starts[ndims-1];
size = 1;
for (i=ndims-2; i>=0; i--) {
size *= array_of_sizes[i+1];
disps[1] += size*array_of_starts[i];
}

disps[1] *= extent;
disps[2] = extent;
for (i=0; i<ndims; i++)
disps[2] *= array_of_sizes[i];

block_lens[0] = block_lens[1] = block_lens[2] = 1;

types[0] = MPI_LB;
/* datatype we described previously */
types[1] = tmp1;
types[2] = MPI_UB;

MPI_Type_struct(3, block_lens, disps, types, newtype);

##### Implementation of the Distributed Array Datatype Constructor

In this section we will describe the implementation of our last derived datatype, namely MPI_Type_create_subarray. On checking the input parameters for correctness we can start the implementation according to the order of the array. Similar to the previous section we will concentrate on the C order whereas we will not go into detail with describing the FORTRAN order since the assumptions are only slightly different.

Before any distribution pattern can be taken into account each process must be mapped to the processor grid. This is done according to the formula given in [47]:

procs=size; /* number of processes in that process group */
tmp_rank=rank;
for (i=0; <ndims; i++)
{
procs = procs / array_of_psizes[i];
coords[i] = tmp_rank / procs;
tmp_rank = tmp_rank % procs;
}


Let us demonstrate this functionality of this formula by means of the process with rank=2. We assume that our processor grid consists of 2 processors per dimension as depicted in Figure 6.16. On applying the correct values to the formula, namely procs=4, tmp_rank=2 and ndims=2, we yield the result coord[0]=1 and coord[1]=0. Thus, the process with rank 2 corresponds to the processor with the coordinates [0,1], i.e. the processor in the lower left corner of the processor grid.

Next, each dimension of the array is analyzed separately starting from the highest dimension. Since we will concentrate on C-order, the last dimension changes fasted. We have already stated that MPI_Type_create_darray allows three different ways of distributing the array, namely MPI_DISTRIBUTE_-BLOCK, MPI_DISTRIBUTE_CYCLIC and MPI_DISTRIBUTE_NONE. Since latter case can be reduced to a BLOCK distribution on only one process, i.e. no distribution is actually performed, we merely have to discuss the routines which handle the first and the second case.

###### BLOCK-Distribution

Let us start off with the routine for the BLOCK distribution. First, the distribution argument darg is checked. If it is set to MPI_DISTRIBUTE_DFLT_-DARG, the block size of the corresponding process is set to:

if (darg== MPI_DISTRIBUTE_DFLT_DARG)
blksize= (global_size + nprocs-1) / nprocs;
else
blksize=darg;


according to [47] whereas global_size corresponds to array_of_gsizes[i], i.e. the size of the array in that particular dimension, otherwise blksize is set to the size of the distribution argument.

Next, we must check whether each process gets the same number of elements. For example, if wish to distribute a 1-dimesional array consisting of 6 elements onto 2 processors according to the pattern BLOCK(4). Thus, the process with rank 0 gets the first four blocks, whereas process 1 merely gets 2 elements. The following code fragment handles this case:

j=global_size-blksize*rank;
mysize=min(blksize,j);
if ( mysize < 0 )
mysize=0;


Now we are ready to implement the block-distribution by means of derived datatypes we have already discussed in the chapter on MPI. In particular, we have to bear in mind the dimension of the array we are currently analyzing. Assume a 4x6 array with a distribution pattern of BLOCK(2), BLOCK(3) as depicted in Figure 6.18. This 2-dimensional array can be linearized to a 1-dimensional array as presented in the same figure. The distribution pattern of the last dimension, i.e. dimension 1, BLOCK(3) can be described by the derived datatype MPI_Type_contiguous. All other dimensions must be described by MPI_Type_hvector with the correct stride. What is more, the oldtype of this datatype must be the datatype of the previous dimension, i.e. a nested derived datatype is built.

In our example the distribution pattern of dimension 1 is BLOCK(3). Thus, following datatype is used:

MPI_Type_contiguous(3, MPI_INT, dim1);


Since dimension 0 is distributed according to BLOCK(2), this can be described by using the derived datatype MPI_Type_vector whereas oldtype is the datatype we created before. The stride is computed as the length of the array of the next dimension, i.e. array_of_gsize[1]=6. Thus, the vector is:

MPI_Type_hvector(2,1,6,dim1,dim0);


The two steps which are necessary, for simulating a BLOCK(2),BLOCK(3) distribution are shown in Figure 6.19

In order to apply this algorithm for any dimension, the stride must be set accordingly, namely to:

for (i=nidms-1; i>dim; i--)
stride *= array_of_gsizes[i];


Thus, the formula is set accordingly for a n-dimensional array.

Finally, the start offset for each process must be set. Recall that the size of the first dimension of the array to be distributed is 4 and the distribution pattern is BLOCK(2). Thus, the first process in the grid of dimension 0 should read the first and the second elements whereas the second process should start accessing the third element. In general, this is handled by the following code:

*st_offset = block_size * rank;


Let us now return to the point after calling the procedure for handling the block distribution. The loop for analyzing the distribution pattern, i.e. BLOCK, CYCLIC or NONE is repeated until the last dimension. However, each call returns a new value for the filetype and the corresponding offset for each process as we described above. It is important to mention that each newtype of the derived datatype, i.e. the distribution pattern of the particular dimension is used as the oldtype of the next dimension. In other words, the derived datatype of one dimension is used for building the derived datatype of the next dimension and, thus, creating a nested derived datatype which describes the distribution pattern of all dimensions. A short abstract of the code shall demonstrate this:

if (order==MPI_ORDER_C)
{
for (i=ndims-1, i>=0, i--)
{
switch(array_of_distribs[i]
{
case MPI_DISTRIBUTE_BLOCK:
Block_distribution(...,type_old, &type_new,
st_offsets+i);
ΨΨ    break;
case MPI_DISTRIBUTE_CYCLIC:
Cyclic_distribution(...,type_old, &type_new,
st_offsets+i);
break;
ΨΨ...
}
/* use the new datatype as the basis for the next
datatype * /
type_old = type_new;
}

disps[1]=st_offsets[ndims-1];
}


After this loop we know the access pattern (type_new) of the particular process and its start position disps[1]. The easiest way to combine all this information is by using the derived datatype MPI_Type_struct with the following values:

disps[0]=0;
disps[2]= multiple of array_of_gsizes;
block_lens[0]=block_lens[1]=block_lens[2]=1;
types[0]=MPI_LB;
types[1]=type_new;
types[2]=MPI_UB;

MPI_Type_struct(3, block_lens, disps, types, newtype);


In short, the distribution pattern is made up by the second block of the derived datatype. The first and the third block are made up by the datatypes MPI_LB and MPI_UB.

###### CYCLIC-Distribution

This section will be dedicated to the description of the second distribution pattern, namely CYCLIC. Similar to the previous section, we will merely take a look at the implementation of the C order since the implementation of the FORTRAN order only slightly differs.

First, the number of blocks per process is calculated according to a slightly modified version of the formula given in [47]:

nblocks = ( array_of_gsizes[dim] + blksize-1 ) / blksize;
count = nblocks / nprocs;
left_over= nblocks - count * nprocs;
if (rank < left_over)
count = count + 1;


For example, if we wish to distribute a 1-dimesional array of 12 elements onto 2 processes with the distribution pattern CYCLIC(4) the first process gets two blocks whereas the second process only gets one block consisting of 4 elements (see Figure 6.20 a).

Next, the size of the last block is calculated. In our previous example, the size of each block was four elements. However, this is not true for an array comprising, for instance, 14 elements. Thus, each process gets two blocks but the last block of process 2 consists of only two elements rather than four (see Figure 6.20 b):

/* check whether any irregular block exists */
if ( (remaining_elements=array_of_gsizes[dim] %
(nprocs*blksize)) != 0)
{
/* compute the size of the last block */
last_blksize=remaining_elements-blksize*rank;
if ( (last_blksize < blksize) && (last_blksize > 0) )
{
count--;
evoke_struct=1;
}
}


Again, blksize refers to the distribution argument. In our case it is 2 - CYCLIC(2). Note that if the size of the last block is less than the remaining ones, count is decremented by 1 and evoke_struct is set to 1. Thus, the last block is treated by means of the special derived datatype as described later on.

The regular blocks are treated in the following way. Let us explain each step by means of the introductory example where we have taken a look at the source code of an application program. Recall that the array size of the 2-dimensional array is 9x10, the distribution pattern is CYCLIC(2), CYCLIC(2) and the processor grid comprises 2 processes per dimension.

Since in the C order the last dimension changes fastest, we handle this case first. The second dimension consists of 10 elements. Thus, process 0 gets count=3 blocks of the size blksize=2 and process 1 gets 2 blocks of blksize=2.

if (dim == ndims-1)
{
stride = nprocs*blksize*orig_extent;
MPI_Type_hvector(count, blksize, stride, type_old, type_new);
}


Since the stride is given in bytes, it must be multiplied by the datatype of elements of the array. All other dimensions cannot be treated as straightforwardly. Therefore, we have to split up the access pattern into two derived datatypes, one which builds a sub block using the block size as the first parameter, and a second one which uses the number of blocks, i.e. count, as the first parameter. Moreover, it is a nested datatype based on the previously created sub block. In addition, the strides for these two derived datatypes must be computed. Let us list the code fragment before we explain its meaning:

/* compute sub_stride and stride */
sub_stride = orig_extent;
stride = nprocs*blksize*orig_extent;

for (i=ndims-1; i>dim; i--)
{
sub_stride *= array_of_gsizes[i];
stride *= array_of_gsizes[i];
}

/* datatypes for sub_block and block */
MPI_Type_hvector (blksize, 1, sub_stride, type_old, &sub_block);
MPI_Type_hvector(count, 1, stride, sub_block, type_new);


We will demonstrate our approach for the first process. Assume that we have already built the derived datatype for the second dimension. Since the first dimension consists of 9 elements, process 1 gets 3 blocks whereas the last blocks is smaller than the previous ones. Thus, count is reduced to 2. What is more, the block size of the regular blocks is 2 as well. Provided with that information we can now build the first derived datatype, which represents the sub block. Since our array comprises only two dimensions sub_stride is set to array_of_gsizes[2]=10. For simplicity let us restrict to the number of elements and thus not regard the actual number of byte elements, which would be necessary for the derived datatype. Thus, the pattern of the derived datatype which describes the sub block looks like follows (Figure 6.21):

Next the derived datatype which describes the whole access pattern is built. This is done by creating a derived datatype MPI_Type_hvector such that the number of blocks corresponds to count. What is more, the block size is 1 and the oldtype is the newtype from our previously created sub block. Finally, the stride must be set accordingly. First, the stride is multiplied by the number of processes of the particular processor grid dimension - in our case it is two - times the block size. Next, it is multiplied by the sizes of the array to be distributed. On building both derived datatypes the access pattern for process 1 depicted in Figure 6.22 is yielded.

Since the first dimension consists of 9 elements rather than 8, we still have to add the last block. Recall that its block size is smaller than the remaining one and thus has to be treated separately:

if ( dim == ndims-1)
{
types[0] = *type_new;
types[1] = type_old;
disps[0] = 0;

if (count == 0 )
disps[1]= blksize*(nprocs-1)*sub_stride;
else
disps[1] = count*stride;

blklens[0] = 1;
blklens[1] = last_blksize;
}

else
{
MPI_Type_hvector (last_blksize, 1, sub_stride, type_old,
&sub_block_struct);

types[0] = *type_new;
types[1] = sub_block_struct;
disps[0] = 0;

if (count == 0 )
disps[1]= blksize*(nprocs-1)*sub_stride;
else
disps[1] = count*stride;

blklens[0] = 1;
blklens[1] = 1;
}

MPI_Type_struct(2, blklens, disps, types, &type_tmp);
*type_new = type_tmp;



Since we are analyzing the first dimension, we have to take a look at the else branch of the outer if-statement. Again, a derived datatype for a sub block is build. Rather than taking blksize as the first argument, last_blksize is used now. Next, the previously created datatype for the regular pattern is combined with the irregular pattern by means of MPI_Type_struct. Note that the displacement of the irregular pattern depends on the number of blocks count of the regular pattern. In our case the displacement can simply be computed by multiplying the number of blocks by the stride. The new pattern is depicted in Figure 6.23.

Imagine that the first dimension of the processor grid consists of nprocs=4 processes rather than 2. Further assume that we have already built the derived datatype for the second dimension as shown so far. Thus, assuming a 9x10 array with a distribution pattern CYCLIC(2) the first three processes get one block consisting of two elements whereas the last process gets one block of only one element. In this case count is 0 and last_blksize is 1. Figure 6.24 demonstrates the effect of the new stride. Since sub_stride corresponds to the second dimension of the array, i.e. 10, stride is set to 2*3*10=60.

After having built the derived datatypes for the regular as well as the irregular patterns, the last step is to add the lower and upper boundary as we described for the BLOCK distribution. Although we have only discussed the implementation for a 2-dimensional array, this algorithm works for n-dimensions as well.

#### 6.3.7 Fortran-Interface

In this section we want to state the most important points about converting a Fortran application program using MPI-IO to the C-interface.

The first thing to be mentioned is that every procedure call in Fortran is call by reference. The name of a procedure can only consist of lower case letters and must be terminated with an underscore ”_”. Next, strings must be treated with special care since there is no ”null” which indicates the end of a character array. Thus, this indicator must be added to the character array.

In the following code fragment we will show the implementation of the MPI-IO routine MPI_File_open. Note that the routine is a procedure rather than a function as it is true for the C implementation. Thus, the return value from the C function is stored in the parameter *ierror. In order to distinguish between input and output parameters, input parameters are indicated by const. Since the name of the file to be opened is a character array, it must be converted accordingly. First, the routine std_str_f_to_c is called which in turn calls the routine strn_copy0. On returning from both routines, the Fortran filename is converted to a character array which can correctly be interpreted by the C routine and does the actual file opening.

#define ΨΨSTRING80_LEN 80
typedef charΨSTRING80 [STRING80_LEN+1];

void mpi_file_open_ (const int *comm, const char *filename,
const int *amode, const int *info, int *fh, int *ierror)
{
STRING80 c_filename;

/* convert a c-array into a fortran */
std_str_f_to_c (filename, c_filename, STRING80_LEN);

/* call the correspoding function from the c interface */
*ierror=MPI_File_open (*comm, c_filename, *amode, *info, fh);
}

static void std_str_f_to_c (const char *fstr, char cstr [], size_t max)
{
register int i;

/* do nothing */
for (i=max-1; i >= 0 && fstr[i] == ’ ’; i--) ;

(void) strncpy0 (cstr, fstr, i+1);
}

static char *strncpy0 (char *s1, const char *s2, size_t n)
{
(void) strncpy (s1, s2, n);

*(s1+n) = ’\0’;

return (s1);
}


A further point to bear in mind when mapping from Fortran to C is the different interpretation of datatypes. For example, following MPI datatypes are supported by Fortran, which have no direct C MPI datatypes as their counterparts, namely:

• MPI_INTEGER

• MPI_REAL

• MPI_DOUBLE_PRECISION

• MPI_COMPLEX

• MPI_LOGICAL

• MPI_CHARACTER

• MPI_BYTE

Thus, the converting function, which is used in the Fortran MPI-IO interface is:

int MPI_Type_f2c (MPI_Datatype datatype)
{
switch (datatype)
{
case MPI_CHARACTER:
return MPI_CHAR;
break;

case MPI_INTEGER:
return MPI_INT;
break;

case MPI_REAL:
return MPI_FLOAT;
break;

case MPI_DOUBLE_PRECISION:
return MPI_DOUBLE;
break;

case MPI_LOGICAL:
return MPI_UNSIGNED;
break;

default: return MPI_BYTE;
}
}


This datatype conversion is used, for example, in each routine for reading or writing a file. Let us therefore take a look at the mapping function for reading:

void mpi_file_read_ (const int *fh,void *buf,const int *count,
const MPI_Datatype  *datatype, int *status, int *ierror )
{
int                     res;
MPIO_Status   status_c;

MPI_Datatype datatype_c;

datatype_c = MPI_Type_f2c(*datatype);

*status=status_c.fid;
}


The forth parameter, namely datatype, is converted from the Fortran MPI representation to the C MPI representation via the routine MPI_Type_f2c.

We still need to focus out attention to the parameter status which is of type MPIO_Status in the C interface. Since all objects in FORTRAN are of type integer, the C interface is called with the parameter status_c rather than the value received from the FORTRAN application program. On returning from the C routine, the file identifier from the structure status_c is assigned to the parameter status. Thus, the FORTRAN application program merely deals with the file identifier of the particular data access routine rather than with the whole structure. The same conversion mechanism is also true for the remaining blocking data access routines.

The implementation for the non-blocking routines is analogous. However, since these routines do not contain the parameter status, the conversion mechanism is done for the parameter request:

void mpi_file_iread_ (const int *fh,void *buf,const int *count,
const MPI_Datatype  *datatype, int *request, int *ierror )
{
int     res;
MPIO_Request    request_c;
MPI_Datatype     datatype_c;

datatype_c = MPI_Type_f2c(*datatype);

*request=request_c.reqid;
}


Thus, the application program only deals with the request identifier rather than the whole structure which must be considered in the routine for checking whether the outstanding non-blocking operation has finished:

 void mpi_file_test_ ( int *request, int *flag, int *status, int *ierror)
{
MPIO_Status Ψstatus_c;
MPIO_RequestΨrequest_c;

request_c.reqid=*request;

*ierror = MPI_File_test (&request_c, flag, &status_c);
*status=status_c.fid;
}


This routine firstly assigns the request identifier received from the FORTRAN application to the structure request_c. Later, the value from the structure status_c is assigned to the parameter status which in turn can be used by the FORTRAN application. Similar conversions are made for the routines mpi_file_wait_ and mpi_file_get_count_.

On analyzing the peculiarities of the Fortran to C interface we have to discuss the handling for connecting and disconnecting from the ViPIOS server. Before the functionalities of MPI-IO can be used, MPI-IO has to be initialized. This is done by the routine MPIO_Init, which is located in a Fortran module. Thus, every Fortran application program must include the module vipmpi. This routine, on the one hand, establishes the connection to the ViPIOS server, on the other hand, it manipulates the MPI_COMM_WORLD communicator in a way such that all client processes can use one MPI_COMM_WORLD without the interference of the server processes. In contrast to C application programs the header file similar to vip_mpio_init.h is not required. However, the header files vip_mpio_f2c.h and vip_mpio_def.h are still needed.

We want to conclude our discussing with a small Fortran application which uses the derived datatype MPI_TYPE_DARRAY for distributing a 4x5x6- array onto several processes according to the distribution pattern BLOCK(2), CYCLIC(3) and CYCLIC(2) whereas the processor grid consists of 2 processors per dimension. Besides the code for the application program a graphical interpretation of the distribution array is given as well. Note that all data objects in Fortran are integer values. This becomes clear when we take a look at the declaration section of the example program. For instance, the newtype of a derived datatype in Fortran is of the type integer rather than MPI_Datatype as we know it from the C application programs.

      program main
USE vipmpi
implicit none

include ’mpif_vip.h’
include ’vip_mpio_f2c.h’
include ’vip_mpio_def_f2c.h’

characterΨfilename*81

integer newtype, i, ndims, array_of_gsizes(3)
integer order, intsize, nprocs,fh,ierr
integer array_of_distribs(3), array_of_dargs(3)
integer array_of_psizes(3)
integer mynod, array_size, bufcount

call MPI_INIT(ierr)
call MPIO_INIT(ierr)

call MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, mynod, ierr)

ndims = 3
order = MPI_ORDER_FORTRAN
filename = ’ufs:file_create’

c    specify the size of the array to be distributed
array_of_gsizes(1)=4
array_of_gsizes(2)=5
array_of_gsizes(3)=6

c    distribution pattern of each dimension
array_of_distribs(1) = MPI_DISTRIBUTE_BLOCK
array_of_distribs(2) = MPI_DISTRIBUTE_CYCLIC
array_of_distribs(3) = MPI_DISTRIBUTE_CYCLIC

c    distribution argument of each dimension
array_of_dargs(1) = 2
array_of_dargs(2) = 3
array_of_dargs(3) = 2

do i=1, ndims
array_of_psizes(i) = 0
end do

c    create processor array
call MPI_DIMS_CREATE(nprocs, ndims, array_of_psizes, ierr)

call MPI_TYPE_CREATE_DARRAY(nprocs, mynod, ndims,
$array_of_gsizes, array_of_distribs, array_of_dargs,$     array_of_psizes, order, MPI_INTEGER, newtype, ierr)

call MPI_TYPE_COMMIT(newtype, ierr)

array_size = array_of_gsizes(1) * array_of_gsizes(2) *
$array_of_gsizes(3) c write the array to the file call MPI_FILE_OPEN(MPI_COMM_WORLD,$     filename,
$MPI_MODE_CREATE+MPI_MODE_RDWR+MPI_MODE_UNIQUE_OPEN,$     MPI_INFO_NULL, fh, ierr)

call MPI_FILE_SET_VIEW(fh, 0, MPI_INTEGER, newtype, "native",
$MPI_INFO_NULL, ierr) call MPI_FILE_WRITE_ALL(fh, writebuf, bufcount, MPI_INTEGER,$     status, ierr)
call MPI_FILE_CLOSE(fh, ierr)

call MPI_TYPE_FREE(newtype, ierr)
call MPI_FINALIZE(ierr)
call MPIO_FINALIZE(ierr)

stop
end



### 6.4 Regression Suite

#### 6.4.1 Program Testmpio

Up to now we were discussing the theoretical background of MPI-IO and the portable implementation on ViPIOS. Since an extensive testing phase is part of each software engineering process, we will present one test program from the University of California and Lawrence Livermore National Laboratory written in April 1998. This so-called ”regression suite” verifies a lot of different MPI-IO routines by simulating different cases of application programs. For example, functions for testing collective open and close of files, independent reads and writes with file and buffer types, file control etc. Besides checking the functionalities of the MPI-IO implementation, the time for the verification procedure is taken. Since ViMPIOS does not support shared file pointers, special error handling routines and different representation modes, the last routines could not be tested.

The structure of the test program is as follows. The main function takes the input parameters, for example the user path for storing the files, and calls the routine dotest(), which in turn calls the different test routines which we stated before. Moreover, the time for the whole process is taken.

On giving a brief introduction do the regression suite we will now anaylise each routine separately.

##### test_manycomms

Some files are opened with a couple of different communicators. Starting with a group of processes which is split into two sub groups, reading and writing is analyzed.

In particular, following interface routines are tested:

MPI_Barrier, MPI_Comm_Free, MPI_Comm_split

Purpose:
This routine checks whether several files can be opened using nested communicators. Thus, for example, files are opened with the communicators 1,2,3,4 and 5. Then, the files are closed in a different way. For instance, the file which is opened by communicator 3 is closed first etc. In order to accomplish this task, a communicator group is split into several sub groups which operate on different nested files. What is more, consistency semantics are obeyed by using atomic access operations (MPI_File_set_atomicity and synchronization points (MPI_File_sync). However, for the time being, ViPIOS merely operates with atomic modes. What is more, the function for synchronizing concurrent file I/O is still a task to be fulfilled in the near future.

##### test_openmodes

Different open modes are checked. First, the file is opened with the access mode MPI_MODE_CREATE and MPI_MODE_WRONLY. Thus, reading this file is supposed to fail. Later, the modes MPI_MODE_READ_ONLY and MPI_MODE_DELETE_ON_CLOSE are checked.

Purpose:
This routines checks whether the open modes are consistent. In other words, a file which is opened, for example, in the write only mode, should not be readable and vice versa.

##### test_manyopens

First, a couple of files are opened and filled with data which are read back in a different order later on. Next, further files are opened with different access modes such that each file has a distinct entry to the file table.

Purpose:
Similar to the function test_manycomms the behavior of the file table is tested whether it can cope with several files which are opened at the same time and accessed in a different way.

##### test_openclose

MPI files are opened and closed in a collective way. Thus, it checks whether the MPI-IO recognizes errors with using wrong access modes for different files. What is more, a file is tried to close which was not opened before.

Purpose:
This routine merely check the basic functionality of opening and closing a file.

Independent I/O calls are tested here. MPI_BYTE is chosen for the data access buffer as well as for the filetypes. Each node writes some bytes to a separate part of the same file which is read back later on in order to compare the results. Besides checking whether the correct number of bytes are read or written, the inbuffer and and outbuffer are compared, i.e. the input string must be identical to the output string. No file view is set in that routine.

Purpose:
This routine checks the basic functionalities of blocking data access operations.

##### test_rdwr

Collective I/O routines are tested with derived filetypes and buffer types. Thus, on the one hand a view is set, on the other hand a contiguous derived data type for the read and write buffer is used.

Purpose:
Besides checking the features of file views, the functionality of so-called data scattering is tested.

##### test_filecontrol

This routine tests different MPI-IO routines like MPI_File_get_position, MPI_-File_set_size or MPI_File_get_byte_offset. Moreover, different file views are set. This test is concluded with splitting the communicator group into two sub groups which separately operate on the files.

Purpose:
The most important MPI-IO routines are checked.

##### test_localpointer

First, each node accesses the file in a contiguous way but with different displacements. Later, some data is skipped and a couple of byte values are written after this hole. Different read and write operations are performed in order to check the holes in the file. In the second run a file with a more complicated file view is opened in order to perform similar tests.

Purpose:
The behavior of setting file views is analyzed in detail. Besides setting correct file views, erroneous views are set to check whether the implementation recognizes these inconsistencies. A possible incorrect assignment would be to use an etype of the type MPI_INT and a filetype of the type MPI_DOUBLE.

##### test_collective

Collective I/O with explicit interleaving is tested. What is more, each process writes data of different size.

Purpose:
This routine checks the behavior of writing to a file according to different access patterns and different number of byte elements.

##### test_nb_rdwr

This routine does the same as test_rdwr. Except of using blocking routines it tests non-blocking operations.

##### test_nb_localpointer

This routine does the same as test_nb_localpointer. Except of using blocking routines it tests non-blocking operations.

## Chapter 7 The HPF Interface

This chapter describes the interface between HPF (High Performance Fortran) and ViPIOS (Vienna Parallel Input Output System). First a quick introduction to the relevant HPF features is given. Then the implementation of the interface is discussed in detail.

### 7.1 HPF (High Performance Fortran)

HPF has been developed to support programmers in the development of parallel applications. It uses the SPMD paradigm for transferring a sequential program to a parallel one, which can be executed on SIMD and MIMD architectures. Basically the same (sequential) program is executed on every processor available. But each processor only works on a subset of the originial input data. The result of the whole computation has to be composed from all the results of the single processors.

HPF itself is an extension to FORTRAN 90 and supplies the programmer with the functionality needed to generate SPMD programs. The programmer has to supply the sequential version of the program (in FORTRAN 90) and also to define how the data is to be distributed among the various processors. The HPF compiler then automatically generates the according parallel program by inserting the communication statements necessary to distribute the data and to coordinate the different processes.

##### Distribution Formats

While distribution formats are not directives on their own, they are needed to specify the data distribution in the DISTRIBUTE directive, which is explained in the next chapter. The following distribution formats are supported by HPF:

###### BLOCK / BLOCK(blocksize)

Generally data becomes divided into blocks of equal size. Each data element (1…N) is assigned to one corresponding processor (1…P). Using BLOCK without any further particular blocksize in brackets causes calculation of blocksize for each processor depending on the size of data and the number of processors in the according dimension of the processor array. If the number of data elements can be divided into commensurate blocks of size (N/P) (i.e. N is divisible by P), then each of the P blocks is assigned to the corresponding processor (i.e. the first block to the first processor and so on). If the number of data is not divisible by the number of processors, then commensurate blocks of size are assigned to the first 1…(P-1) processors. The remaining N-P elements are assigned to the last processor P.

Figure 7.1 shows two examples how data elements are assigned to processors. In the first case N is divisible by P. In the case where N = 14 the last processor () gets assigned the remaining seventh element.

If the optional parameter blocksize is given then all the blocks are calculated to be this size (except the last one if N is not divisible by blocksize). The blocks are then distributed onto the processors in a cyclic fashion (i.e. The first block to the first processor , … the Pth block to the Pth processor, the (P+1)th block to the first processor and so on until all the blocks are assigned a processor.)

###### CYCLIC / CYCLIC(blocksize)

CYCLIC without particular blocksize causes each element to be assigned to a processor in ascending order (i.e. the first element to the first processor, the second element to the second processor and so on). If the number of elements exceeds the number of processors the elements are allocated to processors cyclically (i.e. the (P+1)th element is assigned to processor one again, the (P+2)th to procesor two and so on. If a blocksize is given the resulting distribution is often also called BLOCK_CYCLIC. In this case not single data elements but blocks of data elements of the specified size are assigned to the processors in the same cyclic fashion.

Figure 7.2 a) shows how the elements wrap around the processor array. Example b) in figure 7.2 shows how blocks of blocksize are assigned to each processor. Except the last block assigned to processor , which holds the remaining N-P = 2 elements.

###### Gen_block

The generic block distribution allows for blocks of arbitrary size, which may vary from processor to processor. Thus processor one may for instance be assigned to 10 elements, processor two to 7 elements and processor three to 154 elements. This enables completely irregular distributions to be realized. However this distribution strategy is not implemented yet in all the HPF compilation systems and is also not supported by the ViPIOS HPF interface now.

###### Indirect

Is similar to generic block but the block sizes can be given by pointers that actually point to the actual size. So while in generic block the block sizes are known at compile time and are constant during runtime, the indirect distribution allows for variable sized blocks, the size of which can be changed at runtime. This distribution too is not implemented yet.

###### *

This distribution format causes data not to become distributed at all. This means that data elements are replicated (i.e. every processor gets a copy of the data elements).

###### Distribute Onto

Data mapping is achieved by the distribution directive DISTRIBUTE. The kind of distribution is specified by distribution formats like BLOCK, CYCLIC and *, which are explained above.

Each dimension of an array can be distributed independently, as shown in the following example.

!HPF$PROCESSORS PROCS(3,4) ΨINTEGER, DIMENSION (14,17) :: B !HPF$ DISTRIBUTE (CYCLIC(3),BLOCK) ONTO PROCS :: B


The chosen test array B is a two-dimensional array INTEGER, DIMENSION (14,17) :: B.
!HPF DISTRIBUTE (CYCLIC(3),BLOCK) ONTO PROCS :: B distributes each dimension of array B depending on the distribution format onto the processor array. The chosen formats are CYCLIC(3) (BLOCK_CYCLIC) for the first dimension and BLOCK for the second dimension.

#### 7.1.2 VFC Vienna Fortran Compiler

The VFC compiler system performs a source-to-source translation from HPF to Fortran 90 SPMD source code with special ViPIOS calls. System libraries are used to perform I/O operations and create a runtime descripor. This descriptor contains all neccessary information to perform data distribution through the ViPIOS system and is explained in detail in the following section.

##### The runtime descriptor

The runtime descriptor is a special datastructure containing all information about the processor array the data become distributed onto, the number of dimensions of the data, its extensions in each dimension and its type. Further information are explained in detail:

field arguments description
a 1…N number of dimensions of the processor array
number of processors of the processor array for each dimension
c type of passed data
d size (in Byte) of each element
e umber of dimensions of the data-array
each dim.

global length of dimension
local length of dimension
distribution (BLOCK, CYCLIC, BLOCK_CYCLIC, …)
distribution argument for all kinds of distribution except GEN_BLOCK blocksize is equal to . otherwise contains the number of processors.

The elements of the runtime descriptor are shown in figure 7.4.

### 7.2 ViPIOS-HPF-interface

The interface (which is contained in the file vip_test_rt.c transfers all the information contained in the runtime descriptors into the format used by the ViPIOS system. ViPIOS uses two datastructures to describe data distribution. These are the structures Access_Desc and basic_block.

#### 7.2.1 The datastructures Access_Desc and basic_block

The HPF-ViPIOS-interface uses the recursive datastructurs Access_Desc and basic_block. Access_Desc is defined for each dimension once. Like Access_Desc the structure basic_block contains several variables needed to perform read and write operations. Depending on the distribution directive the structure basic_block appears once or twice in each dimension.

The datastructure is given by the following code fragment defined in vip_int.h:

typedef struct
{
int     no_blocks;
struct basic_block      *basics;
int     skip;
}
Access_Desc;

struct basic_block
{
int offset;
int repeat;
int count;
int stride;
Access_Desc     *subtype;
int sub_count;
int sub_actual;
};

 struct Access_Desc no_blocks number of subsets basic_block struct basic_block pointer to subsetstructure struct basic_block skip number of elements to skip after R/W operation struct basic_block offset number of elements to skip to set filehandle to startposition repeat how many times element blocks appear count number of elements of one block stride number of elements between two blocks * subtype pointer to next Access_Desc if further dimension sub_count not necessary in this interface sub_actual not necessary in this interface

Each component of these structures describes how data has to be mapped onto the processor array. The following sections explain the values of all the components of Access_Desc and basic_block for the simple example, which we already have used previously.

!HPF$PROCESSORS PROCS(3,4) ΨINTEGER, DIMENSION (14,17) :: B !HPF$ DISTRIBUTE (CYCLIC(3),BLOCK) ONTO PROCS :: B


Figures 7.3 and 7.5 illustrate how data is distributed among the processors in this case. In this example the processors 3 and 5 reveal two common constellations of the datastructures’ component values. The component values for these two processors are therefore calculated and explained in detail in the following.

##### processor 3

In this example data mapping for processor 3 creates a structure consisting of Access_Desc and basic_block which are linked together as shown in figure 7.6. For each dimension structure Access_Desc and basic_block are defined separately. The structure describes the second dimension first because the internal interface of ViPIOS operates in recursive fashion. Argument no_blocks declares the number of structures basic_block used to describe data mapping for each dimension - in this case one basic_block. The next argument *basics is a pointer to this structure. One of the arguments of basic_block is offset. It stores the the number of bytes the file pointer has to skip from the beginning of the file to the first position where read or write operations start. For the second dimension offset is zero as shown in figure 7.7. skip describes how many bytes the file pointer has to skip after read or write operations. As shown in figure 7.7 b) 672 bytes have to be skipped along the axis of the second dimension. Along the first dimension skip is set to 20 bytes (see figure 7.7 c). Processor 3 occures only for one time along the second dimension - repeats value is 1. count (for dimensions higher than one) indicates the number of elements assigned to a processor but not the number of bytes. In the first dimension count describes the number of bytes to be read or written. As the example shows five elements are assigned to processor 3 in the second dimension. For the first dimension the value of count is three bytes. If more than one processor would appear in the second dimension (repeat > 1) the value for count would not change. On the other hand in dimension one the value of repeat is always one even if a processor appears more than one time. In this case however the value of count is the summ of all bytes assigned to a processor in this dimension The argument stride stores the number of bytes to skip between processor elements. If repeats value is set to one the value of stride must be zero. Only if repeat is greater than one stride differes from zero. In the case a processor appears more than one time in dimemsion two the first dimension would be taken into consideration for calculation of stride as it is done for skip (see also figure 7.7 b).

##### processor 5

Figure 7.8 shows the datastructure for processor 5. no_blocks defines an array of two structures of basic_block in dimension one. Unlike data mapping for processor 3 data mapping for this processor is divided into two blocks. The first block describes the regular block the second one the irregular block. It is important to distinguish these two because of their different blocklength expressed by count. In figure 7.9 data mapping for processor 5 and the different length of data blocks along the axis of the first dimension can be seen. An important detail relating to argument skip: This argument is defined for the regular block as well as for the irregular block for each dimension in Access_Desc If an irregular block is defined skip is always set to zero. Instead of this offset of the irregular block repleaces skip of the regular block.

#### 7.2.2 Interface functions

This section gives a detailed description of the interface functions of vip_test_rt.c. It is seperated into two parts. One gives a quick overview of the functions input arguments, its functionality and return values in tabular form. The second part of the description gives a more detailed description of the functionality based on the example HPF code, which already has been used in sections 7.1.1 and 7.2.1.

##### Functions for all operations

These functions are used to establish or complete connections between clients and server. They are indpendent from the applications demand.

int VIPIOS_connect (const char * system_name, int connection) {
return ( ViPIOS_Connect (connection) ? 0 : -1);
}

 input var connection to establish a connection between an application and the ViPIOS system. return val 0 or (-1) true (connection established) or false (connection not established)
int VIPIOS_disconnect(int connection) {
return  ( ViPIOS_Disconnect () ? 0 : -1);
}

 input var connection disconnects the application from the ViPIOS system return val 0 or (-1) true (disconnected) or false (not dsconnected)
##### Functions for operations on binary data

To perform operations dealing with binary data the following functions are used:

int VIPIOS_open_binary
(int connection, const char *filename, int status, int io_stat) {
int fd, flags;
switch (status) {
case 0: flags = 0;                break;
case 1: flags = MPI_MODE_CREATE;  break;
case 2: flags = 0;                break;
case 3: flags = MPI_MODE_CREATE;  break;
}
switch (io_stat) {
case 0:
flags |= MPI_MODE_RDWR;       break;
case 1: flags |= MPI_MODE_RDONLY; break;
case 2: flags |= MPI_MODE_WRONLY;
}
return (ViPIOS_Open ( filename, flags, &fd) ? fd : -1);
}

 input var *filename the name of the file e.g. /tmp/test_file1 status 0 old, 1 new, 2 unknown, 3 replace, 4 scratch io_stat file access mode: 0 read & write 1 read (only), 2 write (only). return val fd or (-1) file descriptor or false (connection not established)

Depending on the value of the passed argument status which gives information about the file to be opened, flags becomes assigned a specific MPI_MODE value. The bits of the number of flags which is neccessary as argument in ViPIOS_Open (filename, flags, &fd) are furthermore bound up with one of the MPI modi depending on io_stat. This flag describes the status of the file while using I/O-operations.

##### read and write binary arrays
int VIPIOS_read_binary_array (int connection, int fd, const void * data,
const int * array_dist) {
Access_Desc * descriptor = prep_for_set_structure (array_dist);
return = ViPIOS_Read_struct (fd, data, 0, descriptor, 0, -1);
}

int VIPIOS_write_binary_array (int connection, int fd, const void * data,
const int * array_dist) {
Access_Desc * descriptor = prep_for_set_structure (array_dist);
return = ViPIOS_Write_struct (fd, data, 0, descriptor, 0, -1);
}

 input var connection to establish a connection between an application and the ViPIOS system. fd file descriptor data data to distribute array_dist runtime descriptor return val 0 or (-1) true read/write operation performed or not

Both functions VIPIOS_read_binary_array and VIPIOS_write_binary_array use the same list of passed arguments. The only difference depending on if binary array shall be read or written is the call of ViPIOS_Read_struct(fd, data, 0, descriptor, 0, -1) either or ViPIOS_Write_struct(fd, data, 0, descriptor, 0, -1). The argument descriptor in this case is a pointer to the data structure Access_Desc which becomes initialized in prep_for_set_structure(array_dist). The passed argument is a reference to the runtime descriptor delivered by the runtime system of VCPC.

##### Access_Desc * prep_for_set_structure (const int *array_dist)
 input var array_dist runtime descriptor return val descriptor pointer to data structure Access_Desc

void *next_free is used as an auxiliary pointer to write the data structure Accesss_Desc. Its inital address is the memory address of descriptor. Contiguous memory allocation descriptor = malloc (1024) for the datastructure Access_Desc gurarantees faster processing while operations on Access_Desc are performed:

    void *next_free;        /* will point to next free space (in rec) */
descriptor = malloc (1024); /* initial free space for Access_Desc */
next_free  = descriptor;    /* init points to Access_Desc */


Some definitions made to make operations using specific information from the runtime descriptor easier to use. distance points to the first element of the runtime descriptor where the information about the first dimension of the data to distribute are stored. array_dimension stores the number or dimensions of data.

    distance = array_dist[0]+4;
array_dimension = array_dist[array_dist[0]+3];
proc_nbr_array_count = array_dist[0];
max_proc_array_dim = array_dist[0];


The processor grid coordinates calculated by the following code fragment are stored in dim_contr[i + 1]. The maximum number of dimensions of the data does not may exceed the number of digits of this array. Otherwise the size of the array defined by int dim_contr[5] in this case must become changed to a higher value. Each coordinate of the data is stored at index position of dim_contr. The organisation of process grid coordinates in dim_contr is shown in figure 7.10.

Depending on the current process number the corresponding processor grid coordinates can be calculated. To find out the actual process id MPI_Comm_rank( MPI_COMM_WORLD, &proc_nbr); is used. The process id is stroed at the reference &proc_nbr.

    result = MPI_Comm_rank(MPI_COMM_WORLD, &proc_nbr);
aux = proc_nbr;

for (i = 0; i < array_dimension; i++) {
total_array_size = total_array_size / array_dist[array_dimension - i];
dim_contr[i + 1] = aux / total_array_size;
aux = proc_nbr % total_array_size;
}


If the processor grid coordinates for the specific process are calculated and stored in dim_contr the next step is to set up the datastructure Access_Desc. With the informations of the runtime descriptor, the pointer to the allocated storage for Access_Desc and the processor grid coordinates as arguments the function set_structure is called. It sets up the datastructure Access_Desc by the informations calculated in prep_for_set_structure before.

    set_structure (descriptor, &next_free, array_dist);
return descriptor;


At last the result value of prep_for_set_structure is a reference to the complete datastructure Access_Desc as a result of set_structure.

##### void set_structure (Access_Desc * descriptor, void **free_space, int * array_dist)
 input var descriptor pointer to data structure Access_Desc *free_space auxiliary pointer to pointer of address of Access_Desc array_dist runtime descriptor return val descriptor reference of input var descriptor

Depending on how each dimension has to become distributed the function set_structure provides several blocks for:

* ... no distribution,
BLOCK(n) ... block distribution and
CYCLIC(n) ... cyclic distribution.

In each block all specific data for Access_Desc are calculated seperate. Before these data and its calculation becomes explained in detail the following description points out variables used in Access_Desc which are distribution independent:

    descriptor->no_blocks = 1;
down_count = array_dimension - (step_dim_count + 1);


no_blocks is set to 1 initially. The variable down_count becomes assigned the current number of dimension for each iteration of set_structure. If more than one dimension has to be calculated the value of down_count decrements for each dimension. The following description shows the different algorithms needed for the calculation of each value in Access_Desc depending on the distribution.

##### no distribution
    descriptor->skip = 0;


Because all elements shall become read in one dimension there are no data to skip. Thats why skip is set to 0.

    *free_space = (struct basic_block *) ((void *)descriptor) +
sizeof(Access_Desc);
descriptor->basics = *free_space;
*free_space += sizeof (struct basic_block);


*free_space (used as auxilliary pointer) gets assigned the address of the next structure after Access_Desc - it points to basic_block. Eventually descriptor->basics of Access_Desc gets assigned the address of basic_block by the address stored in *free_space. At last *free_space += sizeof (struct basic_block) assignes the next free address after the last structure basic_block for the next dimension - structure Access_Desc and the following structure(s) basic_block.

    descriptor->basics->offset = 0;


Like descriptor->skip also descriptor->basics->offset is set to 0.

    descriptor->basics->repeat = array_dist[distance +
(4 * step_dim_count) + 1];
descriptor->basics->stride = 0;
descriptor->basics->count = 1 * type_size;


The number of elements of the global length in each dimension mark descriptor->basics->repeat. This describes how many times elements of size descriptor->basics->count = 1 * type_size has to be read. In case of no distribution there is no stride of data necessary.

##### BLOCK(n) distribution
    global_length =  array_dist[distance +
( 4 * (array_dimension - 1 - step_dim_count))];
local_length = array_dist[distance +
( 4 * (array_dimension - 1 - step_dim_count)) + 1];
argument = array_dist[distance +
( 4 * (array_dimension - 1 - step_dim_count)) + 3];


Definitions like global_length local_length and argument are made to reduce expense if informations from the runtime-descriptor are used more than once. The values of these three definitions depend on the current dimension obtained by step_dim_count.

Relating to the fifth processor (see figure 7.9) for the first dimension skip becomes calculated as follows:

    descriptor->skip  = global_length -
(argument * (dim_contr[step_dim_count + 1] + 1)) + (argument - local_length);


The global_length for this dimension is still calculated by the lines before. It is set to 17 in this example. argument for cyclic distribution of this dimension from the runtime-descriptor gives information about how many elements each processor becomes assigned. In this case argument is set to 3 as the distribution directive of the corresponding HPF code demands. The value of the array dim_contr in position step_dim_count + 1 stores the processor coordinate of the corresponding dimension. In this case it is set to 1. (dim_contr[step_dim_count + 1] + 1) which results in 2 marks the second processor in this dimension. Multiplied with the number of blocks each processor gets assigned (argument) and furthermore subtracted from the global length skip results in 8. This means that 8 elements have to be skipped for dimension 1 while data for processor 5 is read or written.

There is one special case if the argument exceeds the local length (see figure 7.9 the rightmost elements in the matrix)). This is the case where the number of elements assigned to a processor is smaller than the number of elements a processor could get assigned. (Elements 15 to 17 in the second dimension) For the last processor in figure 7.9 the calculation of descriptor->skip would result in -3 by the calculation described above. For this case argument - local_length is a neccessary correction. It results to 3 and marks descriptor_skip to 0. In all other cases the correction argument - local_length results to zero because of equal values.

The next operations multiply the number of elements of all remaining dimensions to the calculated value in skip. Therefore down_count contains the number of the following dimensions from step_dim_count. In this case down_count is 1.
In figure 7.9 c) seven elements are skipped along dimension 2. These seven elements become multiplied with the global length of the remaining dimensions, in this case only one dimension (dimension 1) which has 14 elements.

Assuming a third dimension for the example data array the elements of skip would form a three dimensional block as shown in figure 7.11. Dimension 1 and 2 and its distributions are the same as used in the example before. The only difference is the extended third dimension wich is not distributed (distribution * to keep graphical presenation simple). Furthermore it shows all data elements assigned to processor 5 for a three dimensions.

    down_count = array_dimension - (step_dim_count + 1);
for (u = 0; u < down_count; u++)
descriptor->skip *= array_dist[distance + (4 * u)];
descriptor->skip *= type_size;


At last descriptor->skip which now contains all elements to be skipped becomes multiplied with type_size - to optain the proper size for integer for example.

All variables such as no_blocks and skip in Access_Desc are calculated now. To be able to assign values to offset, repeat, count, stride and sub_type of the next datastructure basic_block for this dimension the auxilliary pointer *free_space becomes set to the next addresses like in section 7.2.2.

    *free_space = (struct basic_block *) ((void *)descriptor) + sizeof(Access_Desc);
descriptor->basics = *free_space;
*free_space += sizeof (struct basic_block);


*free_space points at the next address after the last structure basic_block. That is why all following descriptions relate to variables defined in basic_block.

    descriptor->basics->offset = 1;


offset is set to 1 initially. This is neccessary for the following calculations. Depending on if the coordinate for the current dimension in the processor array is set or not the variable offset becomes calculated or zero. It is zero in case of if the processor coordinate for this dimension (dim_contr[step_dim_count + 1]) is zero. Otherwise the number of blocks as offset (expressed by the processor coordinate) times the number of elements (argument) for each block (argument * dim_contr[step_dim_count+1) gives the number of element to offset till the first element becomes read or written in this dimension. If an offset is set the remaining dimensions are also multiplied too.

    if (dim_contr[step_dim_count + 1]) {

down_count = array_dimension - (step_dim_count + 1);
descriptor->basics->offset = argument  *
dim_contr[step_dim_count+1];

for (i = 0; i < down_count; i++)
descriptor->basics->offset *=
array_dist[distance + (4 * i)];
}
else
descriptor->basics->offset = 0;

descriptor->basics->offset *= type_size;


Each block appears only once in each dimension if distribution is BLOCK.

    descriptor->basics->repeat = 1;


Because there is no gap between each block stride is zero.

    descriptor->basics->stride = 0;


The value of the last variable count depends on the present dimension. If the last dimension is not reached count becomes assigned the value of the local length of the corresponding dimension. count assignes only the number of elements read or written. If the last dimension is reached the local length of the last dimension is assigned too. In this case the value of count also becomes multiplied with type_size. This means that only if the last dimension is reached count contains the size of the block (number of elements) times the size of an element.

    if (step_dim_count + 1 < array_dimension) {
descriptor->basics->count = array_dist[distance +
(4 * down_count) +1];
}
else {
descriptor->basics->count = array_dist[distance +
1] * type_size;
}

##### CYCLIC distribution
    global_length =  array_dist[distance +
( 4 * (array_dimension - 1 - step_dim_count))];
local_length = array_dist[distance +
( 4 * (array_dimension - 1 - step_dim_count)) + 1];
argument = array_dist[distance +
( 4 * (array_dimension - 1 - step_dim_count)) + 3];


Like in the beginning of section 7.2.2 where the case of BLOCK distribution was explained the same definitions for global_length local_length and argument are made here too.

The following definitions are made to distinguish two kinds of data - regular and irregular data. Regular data is given if the length of each block is equal to the blocklength given by the runtime descriptor which is expressed by argument. Irregular data consists of only one block which in turn is formed by the last elements of the respective dimension. The number of these elements is less than argument. As shown in figure 7.9 cyclic data distribution for processor 5 creates two blocks with different sizes. The first block (regular) consists of three elements (4, 5, 6) along the vertical axis (row order major). The second block (irregular) of processor 5 consists of only two elements (13, 14). For each kind of these two blocks (regular and irregular) a datastructure basic_block is definend. See figure 7.8 where the datastructure for both dimensions is shown graphically.

The calculations are as follows.

nbr_occ_elem_mod is zero if no irregular block exists. This is the case if the local length is divisible exactly. Depending on if an irregular block exists or not the global length is nbr_occ_elem_mod or zero. The global length for regular and irregular blocks can be distinguished by index 0 or 1.

    nbr_occ_elem_mod = local_length % argument;
mem_glob_length[0] = global_length - nbr_occ_elem_mod;
mem_glob_length[1] = nbr_occ_elem_mod;


The following calculations for skip depend on if local_length is greater, equal or smaller than argument.
The first case local_length > argument does not include that an irregular block exists. It only means that an irregular block is possible.
First an auxiliary variable temp stores the global length without the elements of the irregular block. In this example temp = 14 - (14 mod 3). After this the whole number of regular blocks becomes calculated by temp / argument. In the next step temp becomes the an indicator for computing the proper size of skip of the regular block by 4 mod 3 = 1. temp represents the number of all the blocks of the regular part (i.e. the blocks which are all of equal size) it is therefor also referred to as an auxiliary skip value.

    /* (local length > argument) irregular block possible  */
if (local_length > argument) {
temp = global_length - (global_length % argument);
temp = temp / argument;
temp = temp % array_dist[array_dist[0]-step_dim_count];


temp compared to the present processor coordinate allowes to calculate the proper number of blocks to skip as follows:

        if (dim_contr[step_dim_count+1] < temp) {
/* subcount actual proc coordinate from available number of procs */
descriptor->skip = temp - (dim_contr[step_dim_count+1] + 1);
}


If the current processor coordinate is smaller than the calculated auxiliary skip the remaining elements

example:
xyzxy—z
01201 2
temp = 17
argument = 3
(i) temp = 15
(ii) temp = 15/3 = 5
(iii) temp = 5%3 = 2 = auxskip
respective to processor two. It has coordinate 1 and thus
descriptor-> skip = 0, since 2 - (1+1)

The second case is given if the the value of auxiliary skip temp is equal to the present processor coordinate. Therefore skip is the number of processors in the processor array minus the present processor.
At last if auxiliary skip temp is less than the processor coordinate

        else {
if (dim_contr[step_dim_count+1] == temp) {
Ψ        descriptor->skip = array_dist[array_dist[0] - step_dim_count]
ΨΨ-1;
}
Ψ    else {
Ψ        descriptor->skip = array_dist[array_dist[0]-step_dim_count]-
ΨΨ(dim_contr[step_dim_count + 1] + 1);
Ψ        printf ("lod tmp %d\n", descriptor->skip);
Ψ    }
}


If the processor coordinate is greater than temp skip becomes the value of the number of processors minus the processor coordinate for the present dimension.

example:
xyzxy—z
01231 2
temp = 17
argument = 3
(i) temp = 15
(ii) temp = 15/3 = 5
(iii) temp = 5%4 = 1 = auxskip
respective to processor three, which has coordinate 2 and thus
descriptor-> skip = 0, since 4 - (2)

The number, which is stored in skip as a result of the operations till this point is the number of processors followed by the present processor.

To obtain the number of elements to skip, the present value of skip is multiplied with agrument.

        descriptor->skip *= argument;


At last if an irregular block exists skip becomes multiplied with type_size. That is if nbr_occ_elem_mod is not zero. Furthermore the value of no_blocks is 2 because of two structures basic_block.
If no irregular block exists no_blocks is 1. If a block with less elements than agument exists (This block must be assigned to another processor. Otherwise an irregular block for the present processor would exist.) the number of these elements become added to skip.

        if (nbr_occ_elem_mod) { /* irregular block is given */
descriptor->skip *= type_size;
descriptor->no_blocks = 2;
}
else { /* no irregular block */
descriptor->no_blocks = 1;
descriptor->skip += (global_length % argument);
descriptor->skip *= type_size;
}
} /* fi local_length > argument */


The next possible state is given if the local_length is equal to argument. In this case no irregular block for the present processor is possible. Of course a possible irregular block of another processor is taken into consideration.
From the global_length where the irregular block (if it exists) is included the present processor coordinate which becomes multiplied with the number of element of each block argument becomes subtracted. So skip contains the number of all elements from the rightmost present processor to the right border.
no_blocks is 1 again.

    else { /* local length == arguement */

if (local_length == argument) {
descriptor->skip =
global_length - ((dim_contr[step_dim_count + 1] +1) * argument);
descriptor->skip *= type_size;
descriptor->no_blocks = 1;
}


The last case is given if the local_length is smaller than argument. In this case the block assigned to the present processor must be the last elements in the proper dimension. skip must be zero. mem_glob_length[0] must be redefined.

            else {  /* local length < arguement -> must be last element */
if (local_length < argument) {
mem_glob_length[0] = global_length;
descriptor->skip = 0;
descriptor->no_blocks = 1;
}
}
}


The auxiliary pointer *free_space points to the next free address after Access_Desc. That’s where descriptor->basic points to through descriptor->basics = *free_space;. The next free address is the address after the last structure basic_block. Depending on descriptor->no_blocks (one or two blocks) *free_space points one or two blocks after Access_Desc.

        /* point to basic_block */
*free_space = (struct basic_block *) ((void *)descriptor) +
sizeof(Access_Desc);
descriptor->basics = *free_space;
*free_space += descriptor->no_blocks * sizeof (struct basic_block);


The following calculations relate to variables stored in basic_blocks. For each block (regular and irregular) the values of offset, repeat, stride and count are calculated.
The index i of the slope indicates which block becomes calculated at the moment.
The first variable offset for the regular block is the number of elements assigned to processors before the first element assigned to the present processor. If the first processor gets assigned the first elements too offset is zero.
For the irregular block offset becomes assigned the value of skip. Supplementary skip becomes set to zero. The calculation of skip in Access_Desc is used as offset in this case.

    for ( i = 0; i < descriptor->no_blocks; i++) {
descriptor->basics[i].offset = 1; /* default */
down_count = array_dimension - (step_dim_count + 1);

if ( i == 0) { /*  calculation regular block */
if (dim_contr[step_dim_count + 1]) {
descriptor->basics[i].offset =
argument * dim_contr[step_dim_count + 1] * type_size;
}
else {
descriptor->basics[i].offset = 0;
}
}
else { /* offset equ to skip of regular block */
descriptor->basics[i].offset = descriptor->skip;
descriptor->skip = 0;
}


Calculation of repeat is divided into several steps. First variable temp stores the number of blocks with length argument. Outgoing from this number aux_repeat can be calculated. aux_repeat becomes assigned the number of how many times the present processor becomes assigned data elements with length argument ignoring the difference between regular and irregular blocks.
This is corrected by the next statements. If an irregular block exists and the local_length is greater or equal argument (data assignment is repeated more than one time) the variable temp becomes assigned the global length of the regular part first. This value divided through the number of processors of the processor array for the corresponding dimension results in a value which can be compared to processor coordinate. If they are equal the number of aux_repeat must become decremented.

The last step is to assign the proper value of repeat to descriptor->basics[i].repeat. This depends on if i is zero (regular block) or one (irregular block). If i is zero the value of aux_repeat can be assigned to repeat. If i is one repeat becomes assigned the value 1 because an irregular block becomes repeated only one time.

        temp = global_length / argument;

if (global_length % argument) {
temp += 1;
}
aux_repeat = ((temp - dim_contr[step_dim_count + 1] -1) /
array_dist[array_dist[0] - step_dim_count]) + 1;

if ((global_length % argument) && (local_length >= argument)) {
temp = (global_length - (global_length % argument)) / argument;
temp = temp % array_dist[array_dist[0] - step_dim_count];
if (dim_contr[step_dim_count + 1] == temp )
aux_repeat -= 1;
}

if ( (!i) )  /* calc only neccessary in regular block  */
descriptor->basics[i].repeat = aux_repeat;
else /* irregular block */
descriptor->basics[i].repeat = 1;


Outgoing from calculation of aux_repeat before stride can be calculated. If aux_repeat is greater than 1 (stride exists)) stride is the number of blocks between two blocks assigned to the proper processor. In any case this is the number of processors of the processor array minus one. If aux_repeat is equal or lower 1 only one block of elements becomes assigned to one processor. That is why stride is zero.

        if (aux_repeat > 1)
descriptor->basics[i].stride =
array_dist[array_dist[0]-step_dim_count] - 1;
else
descriptor->basics[i].stride = 0;

descriptor->basics[i].stride *= (argument * type_size);


Considering the first case - the calculation of count of the regular block the variable count ist 1 if the passed argument is 1. In this case each element becomes assigned to a processor.
If argument is not equal 1 two cases can be distinguished. In the first case the local_length is smaller than argument. That is when count becomes assigned the value of local_length (the length of one block with blocklength local_length smaller than argument). If local_length is greater than argument the variable count becomes assigned the value of argument. argument is the maximum blocklength in the regular block.
If the first dimension is reached the calculated value of count becomes multiplied with type_size. This is only done if the values of the first dimension are calculated.
Because of the irregular block consists of only one block count is the the number of elements forming this irregular block. In this case there is no distinction between the first dimension and all further dimensions. Each calculated value of count for the irregular block becomes multiplied with type_size.


if (i == 0) { /* regualar block */

if (argument == 1) /* cyclic(1) */
descriptor->basics[0].count = 1;
else

if (local_length < argument)
descriptor->basics[0].count = local_length;
else
descriptor->basics[0].count = argument;

if ((step_dim_count + 1) >= array_dimension)
descriptor->basics[0].count *= type_size;
}
else { /* irregular block */

descriptor->basics[i].count = local_length % argument;
descriptor->basics[i].count *= type_size;
}


At last the the calculated values skip, offset and stride become multiplied with all global_lengths of the remaining dimensions.
sub_counts and sub_actuals values are zero. These two variables are not neccessary for this interface. Instead of its values are set to zero (for the regular and irregular block).

        down_count = array_dimension - (step_dim_count + 1);

for (u = 0; u < down_count; u++) {
descriptor->skip *= array_dist[distance + ( 4 * u)];
descriptor->basics[i].offset *=
array_dist[distance + (4 * u)];
descriptor->basics[i].stride *=
array_dist[distance + (4 * u)];
}
descriptor->basics->sub_count = 0;
descriptor->basics->sub_actual = 0;
descriptor->basics[i].sub_count = 0;
descriptor->basics[i].sub_actual = 0;

} /* i: [0 ... descriptor->no_blocks] */
} /* switch */


## Chapter 8 Performance Analysis

Some simple performance tests have been performed in order to prove the efficiency and scalability of the ViPIOS design and the implementation as well. The results have been compared against other existing I/O systems (i.e. Unix file I/O and ROMIO) to estimate the overhead of ViPIOS for simple read/write operations which can not be accelerated by parallelization.

### 8.1 Tests setup

In order to test ViPIOS we carried out several experiments on a workstation cluster of 8 LINUX Pentium-133 PCs (nodes) connected by a 100 Mbit Ethernet network. Each PC had 32 MB main memory and a hard disk of 2 GByte. Compared to existent super computers an environment like this offers the following advantages for testing purposes:

• low hard- and software costs

• availability (the system can be dedicated to the test program so results are not influenced by changing workloads etc.)

Furthermore the Beowulf [1] [75] [84] and the Myrinet [7] [14] projects have shown that whith some enhancements in the network topology such PC-networks can provide peak performance in excess of 1 GFLOPS and also high disk bandwidths.

First, we ran some scalability tests where we increased the number of server processes while keeping the number of client processes constant. Moreover we used dedicated I/O nodes which means that each process (either server or client) ran on a different node (see figure 8.1).

Second, we experimented with non dedicated I/O nodes which means that on every node one server and one client process were running concurrently. The results forced us to differentiate between the following two cases:

• The client process ran on the same node as its associated server process as depicted in figure 8.2.

• Each client process was connected to a server process on a different node as depicted in figure 8.3.

### 8.2 Results

#### 8.2.1 Dedicated I/O nodes

The workload for our first experiment was an 8MB file. This had to be read by four SPMD client processes (i. e. each client had to read 2 MB without overlap). In other words, client 0 read the first quarter of the file, client 1 read the second quarter and so on. As depicted in figure 8.1 we increased the number of server processes from one to four.

In order to measure the overall time for reading the whole file, the concurrently running processes were synchronized before and after the read request. By imposing these two barriers we guaranteed that the time to read the whole file corresponded to the reading time of the slowest process. A short extract from the program code is shown in figure 8.4.

The whole process was iterated 50 times. In order to suppress any caching effects data was not only read from one file but from 10 different files in a round robin fashion. Thus, during the first iteration file1 was read, during the second iteration file2 etc. After 10 iterations file1 was read again. It is important to state that each server read its data locally from the node it was running on. In addition to the mean we measured the maximum, minimum and variance of the time required for reading the whole file. The results are given in table 8.1.