A JDMM Formal Definitions

DiSquawk: 512 cores, 512 memories, 1 JVM

Abstract

Trying to cope with the constantly growing number of cores per processor, hardware architects are experimenting with modular non cache coherent architectures. Such architectures delegate the memory coherency to the software. On the contrary, high productivity languages, like Java, are designed to abstract away the hardware details and allow developers to focus on the implementation of their algorithm. Such programming languages rely on a process virtual machine to perform the necessary operations to implement the corresponding memory model. Arguing, however, about the correctness of such implementations is not trivial.

In this work we present our implementation of the Java Memory Model in a Java Virtual Machine targeting a 512-core non cache coherent memory architecture. We shortly discuss design decisions and present early evaluation results, which demonstrate that our implementation scales with the number of cores up to 512 cores. We model our implementation as the operational semantics of a Java Core Calculus that we extend with synchronization actions, and prove its adherence to the Java Memory Model.

\SetAlFnt\SetAlCapFnt\SetAlCapNameFnt\SetAlCapHSkip

0pt \crefformatsection§#2#1#3 \presetkeystodonotesinline,backgroundcolor=yellow \crefnamefigureFigureFigures \CrefnamefigureFigureFigures \crefnameaxiomaxiomaxioms \CrefnameaxiomAxiomAxioms

Technical Report FORTH-ICS/TR-470, June 2016

Keywords: Java Virtual Machine; Java Memory Model; Operational Semantics; Non Cache Coherent Memory; Software Cache

1 Introduction

Current multicore processors rely on hardware cache coherence to implement shared memory abstractions. However, recent literature largely agrees that existing coherence implementations do not scale well with the number of processor cores, incur large energy and area costs, increase on-chip traffic, or limit the number of cores per chip [9, 35, 7], despite several attempts to design less costly or more scalable coherence protocols [24, 26].

To address that issue, recent work on hardware design proposes modular many-core architectures. Such examples are the Intel® Runnemede [7] architecture, the Formic prototype [20], and the EUROSERVER architecture [11]. These architectures are designed in a way that allows scaling up by plugging in more modules. Each module is self-contained and able to interface with other modules. Connecting multiple such modules builds a larger system that can be seen as a single many-core processor. In such architectures the trend is to use multiple mid-range cores with local scratchpads interconnected using efficient communication channels.

The lack of cache coherence renders the software responsible for performing the necessary data transfers to ensure data coherency in parallel programs. However, in high productivity languages, such as Java, the memory hierarchy is abstracted away by the process virtual machines rendering the latter responsible for the data transfers. Process virtual machines provide the same language guarantees to the developers as in cache coherent shared-memory architectures. Those guarantees are formally defined in the language’s memory model. The efficient implementation of a language’s memory model on non cache coherent architectures is not trivial though. Furthermore, arguing about the implementation’s correctness is even more difficult.

In this work we present an implementation of the Java Memory Model (JMM) [23] in DiSquawk, a Java Virtual Machine targeting the Formic-cube, a 512-core non cache coherent prototype based on the Formic architecture [20, 1]. We shortly discuss design decisions and present evaluation results, which demonstrate that our implementation scales with the number of cores. To prove our implementation’s adherence to the Java Memory Model, we model it as the operational semantics of Distributed Java Calculus (DJC), a Java Core Calculus that we define for that purpose.

Specifically, this work makes the following contributions:

  • We present a Java Memory Model (JMM) implementation for non cache coherent architectures that scales up to 512 cores, and we shortly discuss our design decisions.

  • We present Distributed Java Calculus (DJC), a Java core calculus with support for Java synchronization actions and explicit cache operations.

  • We model our JMM implementation as the operational semantics of DJC.

  • We prove that the operational semantics of DJC adheres to JMM and present the proof sketch.

The remainder of this paper is organized as follows. \Crefsec:jdmm shortly presents JDMM, a JMM extension for non cache coherent memory architectures, and the motivation for this work; \Crefsec:implementation presents our implementation of JDMM and shortly discusses the design decisions; \Crefsec:fj presents DJC, its operational semantics, and a proof sketch of its adherence to JDMM; \Crefsec:related discusses related work; and \Crefsec:conclusions concludes.

2 Background and Motivation

In order to reduce network traffic and execution time, Java Virtual Machines (JVMs) on non cache coherent architectures usually implement some kind of software caching [25, 4] or software distributed shared memory [36, 34, 38, 12]. Both approaches rely on similar operations; to access a remote object they fetch a local copy; to make dirty copies globally visible they write them back (write-back); and to free space in the cache or force an update on the next access they invalidate local copies. Since JMM [23] is agnostic about such operations, we base our work on the Java Distributed Memory Model (JDMM) [37].

The JDMM is a redefinition of JMM for distributed or non cache coherent memory architectures. It extends the JMM with cache related operations and formally defines when such operations need to be executed to preserve JMM’s properties. The JDMM is designed to be as relaxed as the JMM. Following a similar approach to that of Owens et al. [27] in the x86 Total Store Order (x86-TSO) definition, the JDMM first defines an abstract machine model and then defines the memory model based on it.

Computation Blocks

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Scratchpad Memories

Local Slice

Global Slice

Local Slice

Global Slice

Local Slice

Global Slice

Local Slice

Global Slice

Local Slice

Global Slice

Local Slice

Global Slice

Figure 1: The memory abstraction.
\cref

fig:abstract-machine presents an instance of the abstract machine as presented in the JDMM paper. On the left side there are several computation blocks with four cores in each of them. Each computation block connects directly to its local scratchpad memory. The scratchpad memory is split in a local and a global slice. In this model, each local slice connects with every other global slice in the system, but not with any local slice. The connections are bi-directional: a core can copy data from a remote global slice to the local cache to improve performance; after finishing the job it can transfer back the new data.

The local slice of the scratchpad is used for the local data (i.e., Java stacks) and for caching remote data. The global slices are partitions of a total virtual Java Heap, similarly to Partitioned Global Address Space (PGAS) models. The state of the memory can only be altered by the computation blocks or by committing a fetch, a write-back, or an invalidate instruction.

In this abstract machine memory model the software needs to explicitly transfer data in such a way that JMM guaranties are preserved. At a high level, JMM guarantees that data-race-free (DRF) programs are sequentially consistent, and that variables cannot get out-of-thin-air values under any circumstances. To define our core calculus and couple it with the JDMM, we use a subset of the notation used in the JDMM paper, which we present here along with the JDMM short presentation. The JDMM describes program executions as tuples consisting of:

  1. a set of instructions,

  2. a set of actions, some of which are characterized as synchronization actions.

    The JDMM uses the following abbreviations to describe all possible kinds of actions:

    • for read, for write, and for initialization of a heap-based variable,

    • for read and for write of a volatile variable,

    • for the lock and for the unlock of a monitor,

    • for the start and for the end of a thread,

    • for the interruption of a thread and for detecting such an interruption by another thread,

    • for spawning (Thread.start()) and for joining a thread or detecting that it terminated,

    • for external actions, i.e., I/O operations,

    • for fetch from heap-based variables,

    • for write-backs of heap-based variables,

    • for invalidations of cached variables.

    Note that actions with kind , , , , , , , , , , or are characterized as synchronization actions and form the only communication mechanism between threads.

  3. the program order, which defines the order of actions within each thread,

  4. the synchronization order, which defines a total ordering among the synchronization actions,

  5. the synchronizes-with order, which defines the pairs of synchronization actions —release and acquire pairs,

  6. the happens-before order that defines a partial order among all actions and is the transitive closure of the program order and the synchronizes-with order, and

  7. some helper functions that we do not use in this paper.

The JDMM explicitly defines the conditions that a Java program execution needs to satisfy on a non cache coherent architecture, to be a well-formed execution. These conditions are introduced in [37, §3 and §4.2]; we briefly present them here. Note that WF-1WF-9 were first introduced in [23].

WF-1

Each read of a variable sees a write to it.

WF-2

All reads and writes of volatile variables are volatile actions.

WF-3

The number of synchronization actions preceding another synchronization action is finite.

WF-4

Synchronization order is consistent with program order.

WF-5

Lock operations are consistent with mutual exclusion.

WF-6

The execution obeys intra-thread consistency.

WF-7

The execution obeys synchronization order consistency.

WF-8

The execution obeys happens-before consistency.

WF-9

Every thread’s start action happens-before its other actions except for initialization actions.

WF-10

Every read is preceded by a write or fetch action, acting on the same variable as the read.

WF-11

There is no invalidation, update, or overwrite of a variable’s cached value between the action that cached it and the read that sees it.

WF-12

Fetch actions are preceded by at least one write-back of the corresponding variable.

WF-13

Write-back actions are preceded by at least one write to the corresponding variable.

WF-14

There are no other writes to the same variable between a write and its write-back.

WF-15

Only cached variables can be invalidated. Invalid cached data cannot be invalidated.

WF-16

Reads that see writes performed by other threads are preceded by a fetch action that fetches the write-back of the corresponding write and there is no other write-back of the corresponding variable happening between the write-back and the fetch.

WF-17

Volatile writes are immediately written back.

WF-18

A fetch of the corresponding variable happens immediately before each volatile read.

WF-19

Initializations are immediately written-back; their write-backs complete before the start of any thread.

WF-20

The happens-before order between two writes is consistent with the happens-before order of their write-backs.

Two additional conditions must hold for executions containing thread migration actions. Intuitively:

WFE-1

There is a corresponding fetch action between a thread migration and every read action.

WFE-2

Additionally, to make sure the fetched value is the latest according to the happens-before order, any dirty data on the old core need to be written-back.

T1

T2

m-enter

write

m-exit

m-enter

read

m-exit

Figure 2: Time window example.

Note that, in the core JDMM, context switching without thread migration is examined only as an extension. As a result, we hereto use a slightly modified version of WF-16 to allow DJC to be more relaxed in the case of context switches and still comply with the JDMM. The modified rule enables different threads running on the same core to share the contents of a single cache, without breaking the adherence to JMM, as shown in [37, §5.2]. That is:

WF-16

Reads that see writes performed by another core are preceded by a fetch action that fetches the write-back of the corresponding write and there is no other write-back of the corresponding variable happening between the write-back and the fetch.

The JDMM intuitively states that a write-back and its corresponding fetch may be executed any time in the time window between a write and the corresponding read, given that the write happens-before1 this read. For instance, in \creffig:time-window the thread T1 performs a write that happens-before the corresponding read in thread T2. The happens-before relationship is a result of the monitor release, m-exit, by T1 and the subsequent monitor acquisition, m-enter, by T2. The time window that the JDMM allows a write-back and its corresponding fetch to be performed is marked with the big black dashed rectangle.

This flexibility on when these operations can be executed, allows for great optimization in theory. However, in practice it is very difficult to even estimate this time window. The JVM needs to keep extra information for every field in the program and constantly update it. It needs to know the sequence of lock acquisition, who was the last writer, if their write has been written-back, and whether the cached value (if any) is consistent with the main memory or not. Implementing these over software caching seems prohibitive, as the cost of the bookkeeping and the extra communication is expected to be much higher than the expected benefits regarding energy, space, and performance.

Figure 3: Performance impact of arguments size

An intuitive implementation is to issue all the write-backs at release actions. However, this may result in long blocking release actions for critical sections that perform writes on large memory segments. To demonstrate the overhead of such operations we perform a simple experiment, where one core transfers a given data set from another core’s scratchpad to its own. \creffig:arg-size-num shows the impact of the arguments’ size and number on the data transfer time. On the y-axes we plot the clock cycles consumed to transfer all the data from one core’s to another core’s scratchpad. On the x-axes we plot the total size of the data in Bytes. Each line in the plot represents a different partitioning of the data, in 1, 10, 25, 50, and 100 arguments respectively. We observe that apart from the total data size the partitioning of the data impacts the transfer time as well. This is a result of performing multiple data transfers instead of a single bulk transfer. As a result, keeping a lot of dirty data cached until a release operation is expected to perform badly, as it most probably will need to perform multiple data transfers to write-back non contiguous dirty data.

Hera-JVM [25] —the only, to the best of our knowledge, JVM for a non cache coherent architecture that claims adherence to the JMM— issues a write-back for every write and then waits for all pending write-backs to complete at release actions. This approach significantly reduces the blocking time at release actions, but results in multiple redundant write-backs in cases where a variable is written multiple times in a critical section. Such redundant memory operations are usually overlapped with computation, keeping their performance overhead low. However, the additional energy consumption they impose might still be significant in energy-critical systems. Additionally, in the case of writing to array elements, their approach results in one memory transfer per element when a bulk transfer can be used to improve performance and energy efficiency.

In this work we propose an alternative policy regarding write backs, that aims to mitigate such cases by caching dirty data up to a certain threshold. Additionally, since the Formic architecture is more relaxed than the Cell B.E. [29] architecture that Hera-JVM is targeting, we also present novel mechanisms to handle synchronization.

3 Implementation

We implement our memory and cache management policy in DiSquawk, a JVM we developed for the Formic-cube 512-core prototype. Formic-cube is based on the Formic architecture [20], which is modular and allows building larger systems by connecting multiple smaller modules. The basic module in the Formic architecture is the Formic-board. Each board consists of 8 MicroBlaze™-based, non cache coherent cores and is equipped with 128MB of scratchpad memory. Each core also features a private software-managed, non-coherent, two-level cache hierarchy; a hardware queue (mailbox) that supports concurrent en-queuing, and de-queuing only by the owner core; and a DMA engine. All of Formic’s scratchpads are addressable using a global address space, and data are transferred through DMA transfers and mailbox messages to and from remote memory addresses.

3.1 Software Cache Management

As the Formic-cube does not provide hardware cache coherence, we build our JVM based on software caching. Each core is assigned a part of the local scratchpad, which it uses as its private software cache. This software cache is entirely managed by the JVM, transparently to the programmer.

To limit the amount of cached dirty data up to a given threshold we split the software cache in two parts. The first part, called object cache, is used for caching objects and is append-only —writes on this cache are not permitted. The second part, called write buffer, is dedicated to caching dirty data. When the write buffer becomes full, we write back all its data and update the corresponding fields in the object cache, if the corresponding object is still cached. Note that the combination of the write-buffer and the object cache form a memory-hierarchy, where the write-buffer is below the object cache. That is, read accesses first go through the write-buffer and only if they miss they go to the object cache. If they miss again, the JVM proceeds to fetch the corresponding object. This way, we {enumerate*}[label=)] set an upper limit on the release operations’ blocking time; allow for overlapping write-backs with computation when the threshold is met; allow for bulk transfer of contiguous data, e.g., written elements of an array; and allow for multiple writes to the same variable without the need to write back every time. At acquisition operations, we write back all the dirty data, if any, and invalidate both the object cache and the write buffer, in order to force a re-fetch of the data if they get accessed in the future. The write-back of the dirty data at acquisition operations is necessary since we invalidate all the cached data. Consider an example where a monitor is entered (acquire operation) then a write is performed, and a different monitor is now entered (acquire operation). In this case simply invalidating all cached data, would result in the loss of the write. This approach is safe and sound, as we later show, but shrinks the aforementioned time window thus limiting the optimization space. A visualization of the shrunk time window is presented in \creffig:time-window. The small red dashed rectangle on the upper left corner of the big rectangle is the time window in which the write-back can be executed. Respectively the small green dashed rectangle on the lower right corner is the time window in which the corresponding fetch can be executed. Note that although pre-fetching data, even in the shrunk time window, allows for significant performance optimizations we do not implement it in this work. Alternatively, we only fetch data at cache misses. Pre-fetching depends on program analysis to infer which data are going to be accessed in the future. Such analyses are not specific to non cache coherent architectures or the Java Memory Model, thus they our out of the scope of this work. Despite the aforementioned reduction of flexibility regarding when a data transfer can happen, and the lack of support for pre-fetching, we are still able to achieve good performance and scale with the number of cores due to the efficient on-chip communication channels. To demonstrate this we use the Crypt, SOR, and Series benchmarks from the Java Grande [33] suite and the Black-Scholes benchmark from the PARSEC suite [5], ported to Java. Due to the lack of garbage collection and the upper limit of 4 GB heap we are unable to run reasonable workloads with the rest of the Java Grande benchmarks. These benchmarks require larger than 4 GB datasets to produce meaningful results on a large number of cores and some of them also create objects with short lifespans, relying on garbage collection to reclaim their memory. Series and Black-Scholes are embarrassingly parallel benchmarks. Each thread operates on a different subset of data from an input set and creates a new set with the corresponding results. The results are then accessed by the main thread for validation. Crypt comprises of two embarrassingly phases. In the first phase each thread encrypts a subset of the input data and then waits on a barrier. When all threads reach the barrier they proceed to decrypt each a subset of the encrypted data. The results are then compared to the original input for validation. SOR performs a number of iterations where each thread acts on a different block of an array accessing the previous and next neighboring blocks as well. As a result, each iteration depends on the neighboring blocks. To ensure that the neighboring blocks are ready, SOR uses a volatile counter for each thread. This counter reflects the iteration the corresponding thread is on. Each thread updates the counter at the end of each iteration and accesses the two counters of the neighboring threads.

Figure 4: Speedup Results
\cref fig:speedup presents the speedup of the four benchmarks on both DiSquawk, running on the formic-cube, and HotSpot running on a 4-chip NUMA machine with 16 cores per chip, totalling 64 cores. Since formic-cube is a prototype clocked at 10MHz, a comparison of the throughput or the execution time is not possible, thus we chose to compare the applications’ scaling on both architectures. The presented speedups are over the performance of the application running on a single core on each architecture respectively. Since DiSquawk does not support JIT compilation, we also disable it in HotSpot (using the -Xint flag); this allows us to better understand the applications’ behavior on both architectures. The number of Java threads, one per core, is placed on the x-axis, and the speedup is placed on the y-axis. Both axes are in logarithmic scale of base 2. We observe that all benchmarks manage to scale with the number of cores in both architectures. Black-Scholes and Series scale better on DiSquawk than HotSpot when using 32 or more cores, while Crypt performs better on HotSpot than DiSquawk when using up to 32 cores.

3.2 Java Monitors

Apart from the data movement, JDMM also dictates the operation of Java monitors. Java monitors are essentially re-entrant locks associated with Java objects. In Java, each object is implicitly associated with a monitor and can be used in a synchronized block as the synchronization point. Java monitors are usually implemented using atomic operations, such as compare and swap, in shared-memory cache coherent architectures, relying on the hardware to synchronize multiple threads trying to obtain the monitor. Such atomic operations are not standard in non cache coherent architectures, though [14, 20].

To implement the Java monitors on such architectures we propose a synchronization manager: a server running on a dedicated core, handling monitor enter/exit requests. To keep contention at low levels we use multiple synchronization managers according to the number of available cores on the system. Each synchronization manager is responsible for a number of objects in the system, and each object can be associated with its synchronization manager using a hash function. When a thread executes a monitor-enter the JVM communicates with the corresponding synchronization manager and requests ownership of the monitor. This way all requests regarding a single monitor end up in the corresponding synchronization manager’s hardware message queue, from where they are handled by the synchronization manager one by one, in the order they arrived. We essentially delegate the synchronization of the requests to the architecture’s network on chip, and provide mutual exclusion through the synchronization managers.

To reduce the synchronization managers’ load, the network’s traffic and contention, and to keep energy consumption low we take advantage of the blocking nature of monitors. Instead of sending back negative responses, when a monitor is already acquired by some other thread, we queue the monitor-enter requests in the synchronization manager, and assign the monitor to the oldest requester when it becomes available. This way we ensure fairness in the order that the requests are handled. Although this is not required by the Java Language Specification [13], we consider it better than arbitrarily choosing one of the waiting threads, since it avoids the starvation of threads. Additionally, when a thread is waiting for a monitor it yields to free up resources for other threads. Instead of periodically rescheduling such waiting threads —as we do with other yielded threads— we use a mechanism that reschedules them only when the monitor they requested has been assigned to them. That is, the synchronization manager has send an acknowledgement message to the core executing the waiting thread.

Using a synthetic micro-benchmark which constantly issues requests to a single monitor manager from cores in the system, where , we find that, on our system, at least one synchronization manager per 243 cores is required to avoid scenarios where the synchronization manager becomes a bottleneck.

3.3 Volatile Variables

Another challenging part is the support of volatile variables. Volatile variables are special, because accessing them is a form of synchronization. Specifically, volatile reads act as acquire operations, while volatile writes act as release operations. That said, after a volatile read any data visible to the last writer of the corresponding volatile variable must become visible to the reader. Volatile accesses are usually implemented using memory fences provided by the underlying architecture in shared-memory cache coherent systems [19].

Since non cache coherent architectures do not provide memory fences, in our implementation we rely on synchronization managers to ensure a total ordering between the various accesses to a volatile variable. Essentially we treat volatile accesses as synchronized blocks protected by a special monitor, unique per volatile variable. Therefore, we write back and invalidate any cached data before volatile accesses, and write back the dirty data immediately after volatile writes. This approach comes at the cost of unnecessary cache invalidations in the case of volatile writes, which should not be often since volatile variables are usually employed as a completion, interruption or status flag [28, §3.1.4] —meaning that they are being mostly read during their life-cycle.

A side-effect of this implementation is the provision of mutual exclusion to concurrent accesses on the same volatile variable. Since Formic provides no guarantees about the atomicity of memory accesses, we rely on this side-effect to ensure a volatile read will never return an out-of-thin-air value due to a partial update.

3.4 Wait/Notify Mechanism

Java also offers the wait/notify mechanism, which allows a thread to block its execution and wait for another thread to unblock it. Since wait() and notify() require the monitor of the corresponding object to be held by the executing thread, we use the synchronization manager to keep track of such operations as well. The synchronization managers are holding a list of waiters for each object they are responsible for. Note that to keep the space overhead low we only allocate records when the first request for an object arrives. Initially, the synchronization managers hold no data for the objects they are responsible for. Whenever a thread invokes wait() a special message is send to the synchronization manager that adds the corresponding thread to the waiters queue and releases the monitor. As a result, before sending such messages we write back any dirty data. To support wait() invocations with a timeout we also support messages to the synchronization manager that request the removal of a thread from the waiters list. When notify() is invoked it sends a message to the synchronization manager, which notifies and removes the longest waiting thread (if any). In the case of notifyAll(), all threads in the waiters queue get notified and removed.

3.5 Liveness Detection

For the detection of thread termination and checking of liveness we rely on volatile variables. Each thread is described using a JVM internal object, which holds a volatile variable with the state of the thread. The supported states are, spawned, alive, dead. We implement isAlive() as a simple read to that state, if it is equal to alive then we return true. On the other hand, for the join() method we avoid spinning on the state variable in an effort to reduce energy consumption and free up resources for other threads in the system. We base our join() implementation on the wait()/notify() mechanism. Since a thread invoking join() will have to wait until the completion of the thread it joins, we yield it by invoking wait on the JVM internal object, describing the thread. When the corresponding thread reaches completion it invokes notifyAll() on that internal object and wakes up any joiners.

DiSquawk currently does not support interruptions. We consider their implementation regarding synchronization to be straightforward. Before sending an interrupt, all dirty data of the sending thread need to be written back, and upon interruption the receiving thread needs to write back any dirty data if present and invalidate its object cache.

4 The Calculus

To argue about the correctness of our implementation, we model it using a Java core calculus and its operational semantics. We base our calculus on the Java core calculus introduced by Johnsen et al. [16], which omits inheritance, subtyping, and type casts, and adds concurrency and explicit lock support. We extend that calculus by replacing the explicit lock support with synchronization operations and adding support for cache operations. We define the operational semantics of the resulting Distributed Java Calculus (DJC) and use it to argue about the correctness of the cache and monitor management techniques used in DiSquawk.

4.1 Syntax

Figure 5: Abstract syntax of DJC

The syntax of DJC is presented in \creftbl:syntax. A Java program consists of a sequence of class definitions. A class is defined as where is the class name; is the list of field declarations, where each is unique; is the body of the class constructor; and is a sequence of method definitions. The calculus types are class names , boolean scalar types , scalar natural numbers , and for the unit value . A method is defined as where is the method’s name; is the set of formal arguments; is the method body; and is the return type. To keep the calculus simple we do not support method overloading.

The syntax includes variables ; creation of class instances as ; field accesses as , where is a unique field identifier; field updates as ; and sequential composition using the let-construct as . Note that the evaluation of may have side-effects. Conditional expressions are expressed as ; and method calls as , where is the method name.

The syntax also includes monitor enter and exit actions as expressions and , respectively. Note that volatile accesses do not have separate bytecodes in Java; they appear as normal memory accesses and the JVM checks at runtime whether they are volatile or not. Thus, we do not provide special syntax for them.

Values are references to objects , the unit value , boolean constants and and scalar numerical constants , abstracting over all other Java scalar types. Contexts are used to show the evaluation sequence of the expressions. In each expression in the is evaluated first.

To argue about threads at runtime we extend DJC’s syntax with run-time threads. A thread is defined as or , where is the unique identification of the core that executes it; is the corresponding instance of the Thread class; is the thread start action, that signals the start of its execution and is not to be confused with the start() method of the Thread class; and is the thread’s body. Threads can be composed in parallel pairs using the associative and commutative binary operator . The empty thread is marked with and is the neutral element of .

We represent an object in the runtime syntax as or . The first form is used for every object in the memory, while the second is only used for thread objects whose start() method has been invoked, and can be one of , , , and . Each object contains the name of its class and a map of field names to values . A thread whose start() method has been invoked is spawned. A thread whose run() method has been invoked is started. A thread that has reached completion is finished. A thread whose interrupt() method has been invoked is interrupted.

The memory of the system is split into the Heap , the object cache , the write buffer , the object cache per core , and the write buffer per core . The heap is a map from references to objects and their monitor . The object cache is a map from references to objects . The write buffer is a map from object fields to values . The object cache per core is a map from core ids to object caches . Similarly, the write buffer per core is a map from core ids to write buffers .

To model mutual exclusion we also add a lock state to the runtime syntax. A lock may be free, i.e., , or acquired by some thread , times.

4.2 Operational Semantics

The operational semantics of DJC are based on those introduced by Johnsen et al. [16]. In this work we introduce new rules for fetch, write-back, invalidate, volatile-read, volatile-write, start, finish, join, interrupt, interrupt detection, and migrate operations. Note that we do not model java.util.concurrent, a Java library providing more synchronization mechanisms, in our formalization, since its interference with JMM is not yet fully defined.

Notation Definition
Reference value
Method identifier
Field identifier
Core identifier
Returns the keys of the map
Returns the values of the map
Replaces with in
The subset of map bindings in with keys in
Returns true if is volatile
A Java object that is an instance of class with mappings of field names to values
Figure 6: Definition of Notation
Figure 7: Semantics of Local Operations
\cref

tbl:defs presents a summary of the notations we use in the operational semantics of DJC, along with their definitions. We discuss these definitions in detail below, together with the operational semantics. To improve readability, we split the operational semantics in four categories: core semantics regarding the core language; synchronization semantics regarding volatile accesses, monitor handling, join, and interrupts; semantics for implicit operations performed by the JVM; and global semantics regarding parallel execution.

Core Semantics

\cref

tbl:coresemantics presents the core semantics of DJC. Following the notation of Johnsen et al., local configurations are of the form . Note that in the conclusions of some semantic rules we annotate the binary operator with an action kind from JDMM or , e.g., we use to show that Field performs a read action . In the proof presented in \crefsec:proof, we present all action kinds along with their abbreviations used in the annotations, and use this information to argue about the adherence of the operational semantics to JDMM. Note that and in , although present in every rule, are not involved in any of the rules in \creftbl:syncsemantics. We use them to argue about the global semantics, shown in \creftbl:globsemantics. This syntax allows us to argue about which core is executing a thread and what is the corresponding object of this thread.

The CtxStep rule describes the evaluation of an expression in a context. The IfTrue and IfFalse rules handle conditional expressions in the standard manner. Rule Let handles substitution in the standard manner. Rule Call handles method calls. We use for invocations with arguments of the method with name of the object referenced by . To determine the body of the method we use , where are the formal arguments of the method and is the method body. We evaluate method calls by substituting the formal arguments with the given ones and this with in the method body.

In our VM, all memory accesses first go through the write buffer; if they miss they proceed to the object cache. Thus, to access a field we need it to be present either in the write buffer or the object cache. To reason about such accesses we define two structural rules, Field and FieldDirty. Rule Field handles non-volatile field accesses, when the field is cached in the object cache, and FieldDirty handles non-volatile field accesses, when the field is cached in the write buffer.

In Field, the first premise requires that the object containing the field being accessed is in the heap (has been allocated and initialized). The second premise requires the access to not refer to a volatile field. To achieve this we use the function which returns true if the field is volatile in the object referenced by and false otherwise. This function models the distinction, performed internally by the JVM, of volatile fields from normal fields. The third premise requires that the core performing the read has a local copy of the field in its object cache, and the cached value is . The last premise requires that the field is not cached in the write buffer. Considering , , and as maps , we use to get the value of the cached object or field with key . We also use as a shorter notation of to show that maps to in the object returned by . Additionally, we use to get all the map keys, i.e., references in the case of and or field names in the case of .

Similarly, FieldDirty handles field accesses of fields that are cached in the write buffer. The only difference from Field is that we require to be cached in the write buffer and get its value from there instead of the object cache.

Rule Assign handles non-volatile field writes, which also go through the write buffer. As a result, writes change the contents of the write buffer instead of the heap, as required by the last two premises. Given a map , is used to show that contains the same mappings as except a mapping for key , thus and . Note that we use instead of , since might not be in the map in the first place.

Rule New invokes the constructor of the corresponding in a similar manner to Call. Rule CtxStep ensures that the constructor will be evaluated before the reference will be assigned to any variable. This ensures that final fields are initialized before publishing the new object. Similarly to Johnsen et al., we use for instances of class with field values , i.e., field contains the value . Note that according to the JMM “conceptually every object is created at the start of the program” [23, §4.3]. That said, in DJC we assume that the object is already present in the memory, with its fields initialized to the default value, and that New just invokes the constructor and returns a reference to the object. We use to show that there is no other reference to that object already.

Figure 8: Operational Semantics for Implicit Operations

Semantics of Implicit Operations

\cref

tbl:synthsemantics presents the operational semantics for implicit operations. These are operations performed implicitly by the virtual machine and do not map to language expressions. Rules Fetch, WriteBack, and Invalidate handle fetching, write-back, and invalidation of a cached object, respectively. Fetching an object requires that it exists in the heap (first and second premise). A fetch results in the addition of the object referenced by in the object cache . Writing back a field requires that the object referenced by is present in the heap and the object cache , is not volatile, and there is a dirty copy of it in the write buffer . Writing-back a field results in the update of its value both in the heap and the object cache . Invalidating an object’s cached copy requires that it is cached. Note that this does not force that object’s fields to not be cached in the write buffer. An invalidation results in the removal of the object referenced by from the object cache, , of the core executing the invalidation. Rule Start enforces the evaluation of the thread start action before any other action in the thread and —treating thread start as an acquire action— requires the object cache and the write buffer to be empty on the running core.

Rule Finish handles the completion of a thread. Note that a thread reaches completion when its thread body is equal to the unit value . As a release action requires the write buffer to be empty, and changes the state of the thread to allow joiners to proceed.

Figure 9: Semantics of Synchornization Operations

Semantics of Synchornization Operations

\cref

tbl:syncsemantics presents the synchronization operational semantics. That is, rules about volatile accesses, monitor handling, join, and interrupts.

Rules VolatileReadL and VolatileRead handle reads of volatiles. Rules VolatileWriteL and VolatileWrite handle volatile writes. The combination of VolatileReadL and VolatileRead results in a single volatile-read. The same holds for VolatileWriteL, VolatileWrite and the volatile-write action. Specifically, for each volatile field we assume a synthetic lock . This lock is used to force a total ordering on the accesses to this variable and guarantee atomicity to the corresponding hardware memory accesses, as described in \crefsec:implementation:volatiles. When is , it means the volatile variable is not being accessed by another thread. Assigning the thread to we essentially block other threads from accessing this volatile variable. Additionally, volatile accesses are exceptions to the rule that all accesses go through the cache. Since volatile reads are acquire actions and volatile writes are release actions, before volatile writes, any dirty data in the corresponding core’s cache must be written-back and before volatile reads, the corresponding core’s cache must be invalidated. We use for empty maps.

Rules MonitorEnter and NestedMonitorEnter handle monitor acquisition; similarly, rules MonitorExit and NestedMonitorExit handle monitor release. These rules use —not to be confused with the synthetic lock of volatile variables— to represent the implicit monitor associated with the object with identity . Our monitor handling is similar to the lock handling introduced in [16]. The notation dictates that the corresponding monitor is not acquired by any thread in the system. dictates that the corresponding monitor has been acquired times by the thread . Rule MonitorEnter requires that a monitor must be free before its acquisition. Rule NestedMonitorEnter requires that a monitor is already owned by some thread before it gets re-entered by that same thread. Rules MonitorExit and NestedMonitorExit ensure that a monitor is released only by its owner and the same number of times it was previously acquired.

In the case of nested monitor acquisition we can avoid invalidating the object caches and writing-back data at nesting monitor release. By definition, nested acquisition of monitors requires that the monitor is owned by the same thread at any nesting level. Under that assumption, any concurrent actions that operate on the cached data used in the critical section would be the result of a data-race, meaning that the program is not DRF. In that case, it is not necessary for any of the corresponding dirty data to become visible, to the threads performing the racy accesses, at nested monitor releases. Note that racy accesses are not guaranteed to see the latest write if the thread executing them did not synchronize-with an action that happens-after that write. Similarly, since the monitor is already owned by the current thread, there is no need to invalidate its core’s cache in order to get the latest values, since those values are the results of some data-race. As a result, rules NestedMonitorEnter and NestedMonitorExit do not need any special premises regarding object caches and write buffers.

Rule Join handles invocations to the join() method of a thread. Its first two premises require that the object cache and the write buffer are empty, since join is an acquire action. The third premise requires the state of the thread object to be , modeling the way a join blocks on the state of a thread in the JVM implementation.

Rule Interrupt handles invocations to the interrupt() method of a thread. Its first premise requires that the write buffer is empty, since interrupt is a release action. The second and third premises require the state of the thread object to be before the interrupt and after it, modeling the way interrupts are implemented by changing the thread’s state in the JVM implementation or setting a hardware register in the case of using hardware interrupts.

Rules InterruptedT and InterruptedF handle invocations to the interrupted() method of a thread. Rule InterruptedT handles cases where the thread is interrupted. Its first two premises require that the object cache and write buffer are empty, since interrupt detection is an acquire action. The third premise requires the state of the thread object to be .

Rule InterruptedF handles cases where the thread is not interrupted. Its premises require the state of the thread object to not be , in such cases the invocation is not a synchronization action so there is no need for flushing the object cache or the write buffer.

Figure 10: Global Operational Semantics

Semantics of Global Operations

In \creftbl:globsemantics we present the global operational semantics of DJC. Similarly to the local configurations, the global configurations are of the form , where and are all the system’s object caches and write buffers respectively, while and are the object cache and write buffer of core , respectively. Note that the heap is the same in global and local configurations since it is shared among all cores.

Rule Lift lifts local reduction steps to the global level. We use and to show that the state of and in the system is replaced by and , respectively.

Rule Spawn handles thread spawns (i.e., Thread.start() calls). For every spawn —which is also a release action— we require that all dirty data are written-back. Then the JVM picks one of the available cores, marked as and schedules thread to it. We represent this by introducing in parallel to the previously running . Note that Spawn changes the state of the thread to to mark that this thread has started and forbid any re-spawns.

Rule Migrate handles the Java thread migration to another core by the scheduler. It picks one of the available cores, marked as and replaces with it, representing that thread will continue its execution on core instead of .

Rule Blocked is essentially a no-op that allows threads to block and not step in every transition in an execution trace, as e.g., a finished but not joined thread.

In DJC, two (or more) Java threads can step concurrently through the ParG rule. Each thread may change its core’s object cache and write buffer state and thus affect and . Since the object caches and write buffers are disjoint for each core, the resulting global state of object caches and write buffers after a concurrent step is the union of the changed object buffers and write buffers by each set of cores that step in the parallel transition and those that where left unchanged by both. To get the object caches and write buffers that a set of cores changes we use (projection). Note that the first premise of ParG required the two sets of cores that perform a step in the parallel transition to be disjoint. This is to model that each core is running a single thread and performs a single step each time. Additionally, inspecting its eighth and ninth premise it only allows a single set of threads to modify the heap. This limitation partially models the hardware memory bus and how it orders memory transfers. We allow only one write per step to the heap, this way we allow parallelism but not concurrent writes to the heap. To improve this, one can slice the heap, then different synchronization managers may handle different slices of the heap and increase parallelism.

4.3 Proof Sketch

This section briefly describes the proof of DJC’s adherence to the JDMM. For a detailed proof of adherence \crefsec:proof. Intuitively, the correctness property can be expressed as:

Theorem 1.

DJC’s operational semantics generates only well-formed execution traces.

To prove Theorem 1, we show by induction that DJC’s operational semantics satisfies every well-formedness rule. That is, given any well formed execution trace:

we show that the trace after taking one more step:

is well-formed as well.

This amounts to essentially a preservation proof for each rule, many of which are straightforward. It is trivial to show that structural rules with conclusions that do not affect the memory state and do not regard synchronization actions preserve the well-formedness of the execution. For the rest, we argue about their effects on the execution state. Since DJC’s operational semantics is tailored after JDMM’s well-formedness rules, for most inference rules, inspecting their premises and conclusions is enough to show that a well-formedness rule is preserved.

As DJC models DiSquawk executions, we claim that DiSquawk executions adhere to the JDMM, and consequently to the JMM.

5 Related Work

To the best of our knowledge, the only other JVM implementing the Java memory model on a non cache coherent architecture is Hera-JVM [25]. Hera-JVM also employs caches which it handles in a similar manner to our implementation, with the difference that it starts a write-back at every write, as we discuss in \crefsec:implementation. Regarding the synchronization mechanisms, Hera-JVM relies on the Cell B.E.’s GETLLAR and PUTLLC instructions to build an atomic compare-and-swap operation. However, such instructions are not available on the architectures at hand [14, 20]. Additionally, Hera-JVM did not aim to formally prove its adherence to the JMM.

Contrary to the implementation, language operational semantics are often used to formalize memory models. Previous work describes the memory semantics for shared memory multicore processor architectures, such as Power [21], x86 [27, 32], and ARM [3] processors, without focusing on a specific language semantics or memory model. Sarkar et al. [31] first combined the semantics of an architecture with the memory model definition of the C++ language, focusing on its execution on shared-memory Power processors. Pratikakis et al. [30] similarly present operational semantics for a specialized task-parallel programming model designed to target distributed-memory architectures. Our work differs from the aforementioned in that it is targeting distributed or non cache coherent memory architectures.

Boudol and Petri [6] define a relaxed memory model using an operational semantics for the Core ML language. Their work takes into account write buffers that must become empty before a lock release. Although the handling of write buffers is similar to handling caches regarding the write backs, the fetching and invalidation handling part is not covered in that work. Additionally, the authors only consider lock releases as synchronization points, while in the Java language there are multiple synchronization points according to JMM. Joshi and Prasad [17] extend the above work and define an operational semantics that accounts for caches, namely update and invalidation cache operations not previously supported. The authors use a simple imperative language, claiming it has greater applicability. Unfortunately, this approach further abstracts away details regarding the correct implementation of a specific programming language’s memory model. In our work we focus on the Java language and provide all the needed details for the implementation of its memory model. Furthermore, both of the above papers define operational semantics for generic relaxed memory models. We believe that defining the operational semantics for a specific memory model, in this case the JMM, is a different task that focuses on the issues specific to the Java language.

Demange et al. [10] present the operational semantics of BMM, a redefinition of JMM for the TSO memory model. BMM is similar to this work in that it aims to bring the Java Memory Model definition closer to the hardware details. BMM, however, focuses on buffers instead of caches and assumes the TSO memory model, which is stricter than the memory model of the non cache coherent architectures at hand.

Jagadeesan et al. [15] also describe an operational semantics for the Java Memory Model. Their work, however, does not account for caches or buffers. It abstracts away the hardware details and considers reads and writes to become actions that float into the evaluation context. This approach does not explicitly define when and where writes should be eventually committed to satisfy the JMM. In our approach, we explicitly define where data get stored after any evaluation step.

We thus consider our approach to be closer to the implementation. Cenciarelli et al. [8] use a combination of operational, denotational, and axiomatic semantics to define the JMM. In that work, the authors show that all the generated executions adhere to the JMM, but as in [15] they do not account for the memory hierarchy.

6 Conclusions

This paper presents DiSquawk, a Java VM implementation of the Java Memory Model that targets a 512-core non cache coherent architecture, and a proof sketch that it adheres to JMM. We discuss design decisions and present evaluation results from the execution of a set of benchmarks from the Java Grande suite [33]. To prove the correctness of our implementation, we model all key points of the design using a core calculus DJC and its operational semantics. DJC is a concurrent java calculus aware of software caches and their mechanisms. DiSquawk has been developed as part of the GreenVM project [2] and is available for download at https://github.com/CARV-ICS-FORTH/disquawk.

Appendix A JDMM Formal Definitions

This appendix presents the JDMM’s formal definitions and their corresponding formalism in DJC, where appropriate.

Distributed Execution: A distributed execution is a tuple:

where:

  • The program is a set of instructions, in DJC this is the program .

  • is a set of actions.

    Actions: The JMM abstracts thread operations as actions [22, §5.1]. An action is a tuple , where is the thread performing the action; is the kind of action; is the (runtime) variable, monitor, or thread, involved in the action; and is a unique, among the actions, identifier.

    JDMM uses the following abbreviations to describe all possible kinds of actions:

    • for read, for write, and for initialization of a heap-based variable

    • for read and for write of a volatile variable

    • for the lock and for the unlock of a monitor

    • for the start and for the end of a thread

    • for the interruption of a thread and for detecting such an interruption by another thread

    • for spawning (Thread.start()) and for joining a thread or detecting that it terminated

    • for external actions, i.e., I/O operations

    • for fetch from heap-based variables,

    • for write-backs of heap-based variables,

    • for invalidations of cached variables.

    In DJC we use to denote a transition from state to state , where is the set of cores involved in the transition and is the core performing the JDMM action in this transition.

    To get the set of actions , from a program’s DJC execution trace:

    we take the union of the ranges , where is a set of mappings from cores to JDMM actions, i.e.:

    Formally:

  • The program order is a relation on defining the order of actions regarding a single thread in . JDMM uses to show that comes before according to the program order within a thread. Every pair of actions executed by a single thread are ordered by the program order:


  • The synchronization order is a relation on defining a global ordering among all synchronization actions in

    Synchronization Actions: Any actions with kind , , , , , , , , , , or are synchronization actions, which form the only communication mechanism between threads. JDMM uses to show that is a synchronization action in :

    JDMM uses to show that comes before according to the synchronization order. Every pair of synchronization actions are ordered by synchronization order:


    In DJC we group syncrhonization actions of the kinds , and in the acquire actions family, denoted by . We also group syncrhonization actions of the kinds , , and in the release actions family, denoted by .

    As a result, in DJC:

  • The write-seen function for every read action returns the write action seen by , in . As a result, .

  • The value-written function returns the value written by every write action , in . As a result, every read , in , reads the value .

  • The cache-action-seen function returns the fetch or write action seen by any read , in . Note that: and .

  • The write-back-fetched function returns the write-back action whose data each fetch action fetches, in .

  • The action-written-back function returns the write action whose data each write-back writes-back, in . Note that:
    and .

    In DJC, returns the initialization or write action whose data writes-back, according to the execution trace. Note, that in DJC we exclude volatile writes from the possible kind of actions returned by , since volatile writes are never written-back by a separate write-back action, they are immediately written to the heap.

  • The action-invalidated function , returns the write or fetch action that cached the data invalidated by each invalidation action, in . Note that: and .

    In DJC, returns the write-back or fetch action writing or fetching a value that invalidates, according to the execution trace. Note that in DJC instead of write actions the function returns write-back actions, since write actions update the write buffer, which cannot be invalidated, and write-back actions update the values in the object cache, removing the corresponding entries from the write buffer.

  • The distributed synchronizes-with order is a relation on defining which actions in synchronize with each other.

    JDMM uses to show that synchronizes-with . Note that . An action synchronizes-with an action , written , when:

    • is the initialization of variable and is the first action of any thread:

    • is a subsequent read of the volatile variable written by :

    • is a subsequent lock of the monitor that unlocked:

    • is the start action of thread and is the spawn of :

    • is a call to Thread.join() or Thread.isAlive() and is the finish action of this thread:

    • is an action detecting if a thread has been interrupted and is an interrupt to that thread:

    • is the implicit read of a reference to the object being finalized and is the end of the constructor of this object.

    In the synchronizes-with examples above, when comparing the variable of one action with the thread of the other (i.e., ) means that acts on thread . The action is a release action and is an acquire action. A release action must make all writes, visible to the executing thread, visible to the actions following (according to any of the orders defined till now) the acquire action.

    In DJC, given any execution trace:

    where , if and only if and can form a synchronization pair and there is no other transition:

    between the transitions that contain the actions with id and then:

  • The happens-before order is a relation on that defines a partial order among actions in .

    The happens-before notion is the one introduced by Lamport in [18]. In the context of the JMM this is the transitive closure of the program order and the synchronizes-with order. JDMM uses to show that happens-before .

    In DJC, given any execution trace:

    if any of the following holds:

    • there exists a transition that appears between the transitions that contain the actions with ids and , in the execution trace, and

      (transitivity)

    then .

Conflicting Accesses: If one of two accesses to the same variable is a write then these two accesses are conflicting.

Data-Race: A data-race occurs when two conflicting accesses may happen in parallel. That is, they are not ordered by happens-before.

Correctly Synchronized or Data-Race-Free Program:

A program is correctly synchronized or DRF if and only if all sequentially consistent executions are free of data-races.

Well-Formed Distributed Execution:

JDMM defines well-formed executions similarly to the JMM. Specifically, in JDMM, a distributed execution is well-formed when:

WF-1

Each read of a variable sees a write to :

Note that the original formal definition in JDMM [37, §3] is:

where volatile reads are not considered. However, JMM [23, §4.4] states that “For all reads , we have and . The variable is volatile if and only if is a volatile read, and the variable is volatile if and only if is a volatile write. ”, where to our understanding refers to , and refers to both volatile and non-volatile reads. As a result, in this work, we chose to take volatile reads into account as well.

In DJC, this means that given the execution trace of , for every transition containing a read action:

in that trace, there is at least one transition containing a write or initialization action:

which writes in the value that this read action sees.

WF-2

All reads and writes of volatile variables are volatile actions:


In DJC, this means that given the execution trace of , in every transition for every action

is either or , if and only if is a volatile variable.

WF-3

The number of synchronization actions preceding another synchronization action is finite:


WF-4

Synchronization order is consistent with program order:


In DJC this means that given the execution trace of , if it contains a trace:

where and consequently

then it cannot also contain the trace:

where .

WF-5

Lock operations are consistent with mutual exclusion.

The number of lock actions performed on the monitor by any thread before, according to the synchronization order, the lock action performed by thread on the monitor must be equal to the number of unlock actions performed by thread before on the monitor :