Program Execution on Reconfigurable Multicore Architectures

Program Execution on Reconfigurable Multicore Architectures

Abstract

Based on the two observations that diverse applications perform better on different multicore architectures, and that different phases of an application may have vastly different resource requirements, Pal et al. proposed a novel reconfigurable hardware approach for executing multithreaded programs. Instead of mapping a concurrent program to a fixed architecture, the architecture adaptively reconfigures itself to meet the application’s concurrency and communication requirements, yielding significant improvements in performance. Based on our earlier abstract operational framework for multicore execution with hierarchical memory structures, we describe execution of multithreaded programs on reconfigurable architectures that support a variety of clustered configurations. Such reconfiguration may not preserve the semantics of programs due to the possible introduction of race conditions arising from concurrent accesses to shared memory by threads running on the different cores. We present an intuitive partial ordering notion on the cluster configurations, and show that the semantics of multithreaded programs is always preserved for reconfigurations “upward” in that ordering, whereas semantics preservation for arbitrary reconfigurations can be guaranteed for well-synchronised programs. We further show that a simple approximate notion of efficiency of execution on the different configurations can be obtained using the notion of amortised bisimulations, and extend it to dynamic reconfiguration.

1 Introduction

The traditional approach to multiprocessing is to map multithreaded applications or multiprogram workloads onto a chosen multicore architecture. Pal et al. showed that due to the diversity of software applications, this “one architecture fits all” approach often yields sub-optimal performance [8]. For instance, programs with mostly independent threads having little communication perform well on symmetric multiprocessors (SMP) whereas those with more communication and shared variables perform better on chip multiprocessors (CMP). Indeed, even a single application can exhibit vastly diverse resource requirements during different phases of its execution. Accordingly, those authors identified different reconfiguration parameters (e.g., number of cores, cache size and cache sharing) and proposed a reconfigurable multicore tile-based architecture which supports dynamic adaptability of the multicore hardware to the software’s resource requirements [9], obtaining significant performance improvements. The hardware morphs itself to a configuration that delivers better performance for that particular phase of program execution (using heuristics to detect such phase changes). The overhead for reconfiguration is usually significantly lower than the performance benefits. However, it is not entirely obvious whether such a dynamic reconfiguration preserves the intended semantics of the application with respect to a reference architecture, nor is there a theoretical framework for comparing the performance benefits. We believe there should be a formal basis for dynamically reconfigurable multiprocessors, which constitute an innovative technology trend.

In this paper, we present an operational account of execution of multithreaded programs on dynamically reconfigurable multicore architectures, which to our knowledge is new. Examining how a variety of cluster-based architectural configurations partition cores and share cache, we see that partition refinements provide a natural partial ordering on cluster configurations. We also find that instead of requiring different operational formulations for each configuration, our earlier work [3] provides a uniform abstract operational semantics, whence we can both compare execution behaviour semantics as well as express dynamic reconfigurability. Leveraging results about abstract cache models from our work [3], which followed the seminal approach of Boudol and Petri [2] of showing that “well-synchronised programs have the same semantics in relaxed memory models as on sequentially consistent models”, we show that dynamic reconfiguration is semantics-preserving for such data-race-free programs. Further, by associating an approximate cost with each operation, we adapt Kiehn and Arun-Kumar’s notion of amortised bisimulation [4] to obtain a framework for comparing performance on the various architectural configurations and reasoning about the benefits of dynamic reconfiguration.

It should be clarified at the outset that we are presenting execution semantics at the architectural level, below the program or OS level where threads are assigned to cores. Reconfiguration happens dynamically during program execution, and is outside program control. In this respect, this work differs from that of Krishnan [5, 6] and also the large body of work related to assigning threads to cores (which has anyway become a less pressing issue in multicore systems). For simplicity, we confine our study to homogeneous core architectures, and the reconfiguration parameters to core clustering and cache fusion/splitting, and do not consider other parameters such as core fusion/splitting, core allocation, management of power and clocking, and cache allocation.

2 The Reference Model

The reference architectural model with respect to which we compare the semantics of program execution consists of a collection of cores connected via a bus to a single shared memory module. Under the assumption of having the requisite number of cores, this reference model will exhibit behaviours consistent with pomset semantics [10] of multiple sequential processes with a sequentially consistent shared memory [7]. In a sequentially consistent memory model, writes and reads are atomic operations, and occur in program order within a thread.

Execution states are written as , where is the shared store and is a vector of threads. For simplicity, we assume each thread runs on its own core, with denoting the thread. The operational semantics are given in Figure 1. Since our focus is on the observable actions on shared memory at the architectural level, only the transitions relating to reading and writing from memory are shown, eliding transitions for instructions not involving the store. Since the bus enforces mutually exclusive access to the memory module, we adopt an interleaving view of execution. An obvious alternative approach would have been to consider synchronous transitions, labelled by vectors of actions, the components of which are contributed by each core. However, since certain cores may idle (e.g., for power efficiency reasons) and since we are not making any assumptions about clock synchronicity, that approach would not be appropriate.

Figure 1: Specification semantics for read and write in the reference architecture

Transitions related to accessing the store are of the form , where is used to indicate that the transition is for thread (or core ), and denotes the action. The possible actions are: (the value is read from variable ) and (the value is written to the variable ), apart from the reductions (labelled ) not involving the store. When we do not care what value was read/written, we use and . We associate an approximate cost for accessing the store, with . For simplicity, we ascribe a uniform cost for -labelled transitions (typically )1

In a sequence of transitions , two concurrently enabled but conflicting transitions and are said to form a race (on variable ) if and and at least one is . Races make computational results dependent on scheduling decisions, and as a consequence programmers use synchronisation mechanisms such as locks, barriers, fences, etc. to avoid the occurrence of such data races.

3 Implementation Models

There is a variety of configurations for multicore architectures. Common among them are chip multiprocessors (CMP) and symmetric multiprocessing (SMP). A major difference between them lies in the organisation of their cache hierarchy. Caches are important architectural features that significantly speed up execution of programs by exploiting locality of memory accesses and reducing their latency.

In a CMP configuration, each core possesses its private data and instruction caches, but several cores share a common cache, which lies above the slower main memory. In contrast, in SMP, the cache is also private to each core. These two configurations may be considered the extremes of a range of clustered configurations or clusterings, where the multicore system consists of a collection of clustered cores. Within a cluster, each core possesses its private data and instruction caches, but the cores share a cache. Thus, SMP is the case where the cluster size is 1, whereas CMP puts all the cores in one cluster. Pal et al. use the notation , where to describe a system of cores configured into clusters, where the cluster has cores. SMP is therefore written as while CMP is represented as . For simplicity, we will only concern ourselves with a memory structure consisting of a caches and shared main store. A more detailed model can address the similar issues that arise in the treatment of vis à vis caches and main memory.

We observe that a clustering represents a partition of cores; given a clustering , we write if cores and are in the same cluster. Thus in SMP, the equivalence classes are singletons, whereas in pure CMP, all cores are in the same equivalence class. If clustering is a partition refinement of , we write .2 Note that if , then share the same cache.

3.1 Implementation semantics

We now refine the reference model by introducing caches into the architecture. The store component is replaced by a tuple , where is the store (as earlier) and is a vector of caches. The caches contain a local copy of a subset of the store. Due to differences in the local caches, each core has a potentially different view of the memory. denotes the () cache available to core . In clustering , if , then and are the same and so have the same contents.

If , its value is given by a pair , where is the value of the variable and may be either clean or dirty. A variable is clean either if it has not been written to by this cluster of cores, or if its changed value has been written through to the store. Otherwise it is dirty. Note that in general, . The system may allow the store to contain a different value if some other processor has updated the store but this cache has not yet been notified.

Figure 2: Implementation semantics for read and write operations on a clustering

Figure 2 gives the implementation semantics with respect to clustering for read and write operations, both of which access the memory — and potentially alter it. When a variable is written to, the write is only to the cache (“write back”). We discuss below how these changes are propagated to the store or to other caches. Observationally, is a functionally equivalent action to , but with lower cost: .

There are three transitions for reading a variable, , all functionally equivalent to the same specification operation , but with different costs. is a read from the local cache, and has cost . Note that when , there are two possible transitions, labelled and , both with costs , corresponding to whether or not is pulled into the cache. This decision is made non-deterministically, which (along with another transition for eviction to be introduced later) makes the model independent of the cache-replacement policy used by the actual implementation. Note that unlike in the specification semantics, reading a location can cause changes to the memory, e.g., by moving the value read into a cache.

Apart from the programmed transitions of Figures 1 and 2, there are the so-called ‘system’ transitions, denoted by , used to manage the memory structure, including cache replacement policies and consistency. These transitions can fire non-deterministically at any time, and the threads cannot constrain which system transitions can occur or when. The system transitions are used to propagate writes to other caches and the store. In practice this is usually done either with an update-based protocol (where cached copies are updated with the new value) or with an invalidation-based protocol (where cached copies are invalidated, effectively removing them from the cache). Here we present only the update protocol (Figure 3). The transitions, which are not observable, but decorated here to distinguish them, are as follows:

Figure 3: System transitions on clustering : Update-based cache consistency protocol
  1. Eviction : Evict from . . This is only used for the cache replacement policy and is not needed to achieve a consistent state.

  2. Cache update : Update in from . if , since is the same as ; otherwise it is since it requires communication over the system bus. This is used to update other caches when a variable is written to in a cache.

  3. Store update : Update in from . . The condition for its application ensures that a store update only happens after all caches have been updated and agree on the value of the variable.

3.2 Comparing the semantics on different configurations

Consider a program or workload , i.e., a set of threads mapped to a set of cores. An execution trace of an implementation of on clustering is considered correct if for any two actions , some functionally equivalent actions both appear in the pomset semantics, and if precedes in the pomset semantics, appears before in . The implementation conforms to the pomset semantics if every trace possible in that implementation is correct with respect to the pomset semantics. In the interleaving view, this can be stated as: for any observable trace in the reference semantics, there is a corresponding trace of functionally equivalent observable actions in the implementation semantics.

Running any workload on a coarser clustering preserves the observable behaviour. Moreover, CMP is semantically faithful to the reference semantics.

Proposition 1
  1. If , then any -trace has a functionally equivalent -trace .

  2. Every reference semantics trace has a functionally equivalent CMP-trace and vice versa.

Proof outline: By induction on , we find a functionally equivalent . The only interesting cases are if but , and there is a local read to or write from (say) . We use the transition before a read or after a write to make the two cache entries agree. Similarly, the transition is used to make agree with .

Coherence, Consistency and Data Race Free programs.

Unfortunately, program execution on multiprocessors with caches may exhibit more traces than the reference model allows, due to the introduction of race conditions and inconsistencies between the caches at different cores or with the shared store (arising from the non-atomicity of writes). It is therefore not true in general that program behaviour is preserved when running a program on a finer clustering (e.g., SMP).

We recall below some of our earlier results showing that for a class of programs that are “data race free (DRF)” [2], every program trace in any implementation architecture has a functionally equivalent trace in the reference architecture. For such DRF programs, the additional behaviours, introduced by the extra nondeterminism in the implementation architecture, are irrelevant for executions starting from “consistent states”. A consistent state is, intuitively, an implementation state that is identifiable in a precise sense with a specific state (called its “reduct”) in the reference model.

We briefly recall the notions of coherence and consistency presented in our earlier work [3] via abstract operational characterisations. We refer the interested reader to op. cit. to check that these notions correspond with more familiar invariants associated with memory consistency and cache coherence presented in the literature.

We write to denote that is reachable from by the implementation semantics (program and system transitions) with respect to clustering , and similarly for the reference architecture semantics. We use to denote 0 or more system transitions in clustering , whereas means 0 or more system and program transitions.

Let us call a state -normal” if it cannot make any moves (i.e. system transitions). For an implementation state , let denote the value in core ’s view of , i.e., the value of , or if . An implementation state is said to reduce to a specification state (written ) if , is -normal, and .   is called a reduct of .

Definition 1

A state is said to be coherent for if . A state is coherent if it is coherent for all . A coherent state has a unique reduct. We use to refer to the unique reduct of a coherent state .

Definition 2

A state is said to be consistent for if and only if . Implementation state is consistent if it is consistent for all . A consistent state is in some sense identifiable with its reduct.

We now introduce the notion of data race freedom.

Definition 3

A consistent state involves a data race if it has two redexes and , , and are both accesses to the same variable and at least one is a write. is data race free (DRF) iff no state reachable in the reference architecture semantics from involves a data race.

Note also that the analysis of data race freedom need only be performed at the level of the reference semantics. DRF programs allow us to consider their execution as progressing via a sequence of reference model states (reducts). Using ideas and techniques introduced by Boudol and Petri [2], it is shown that for “well-synchronised” programs, i.e., those where between any pair of actions forming a data race lies an intervening synchronising mechanism (e.g., lock release or barrier operations), the behaviours are functionally equivalent. In particular, we have shown in [3, Section 6] that the cache rules described above satisfy required properties of coherence and consistency.

Since the execution of DRF programs on any clustering architecture can be seen to be functionally equivalent to execution on the reference architecture, we can show:

Proposition 2

For DRF programs, any reconfiguration (in any direction) preserves semantics.

Proof outline: Similar to the proof of Proposition 1, but here we rely on the fact that every implementation trace of a DRF programs on clustering has a functionally equivalent trace in the reference model. The result follows from transitivity of equivalence of actions, and that ‘system’ transitions are not directly observable.

Execution on a dynamically reconfiguring architecture.

Consider now a scenario where execution may commence on clustering , and the machine may nondeterministically decide to morph to a clustering (based on some heuristic), after which execution proceeds on the latter architectural configuration. Such dynamic reconfiguration from clustering to can be internalised into our framework by introducing a new action , with cost :

where is the “reduct” of . On reconfiguration, the cache contents are written to the store, and the program resumes in the new configuration with “cold caches”. For DRF programs, the execution semantics of each phase corresponds precisely to the semantics of execution in the reference model. Thus we can formalise within our framework the correctness criterion for execution on a reconfigurable architecture.

Theorem 1

Any execution of a DRF program on a reconfigurable architecture conforms to execution on the reference model.

Proof outline: By piecewise stitching of executions of the different phases on the different configurations. Note that implementation states before and after reconfiguration correspond to the same reduct.

4 Comparing Performance

We now propose a framework for comparing the execution efficiency on two configurations. For uniformity, specification and implementation states are clubbed into one set; the specification and implementation actions marking transitions are also combined into one set of actions . Let be a sequence of actions, and let denote a labelled sequence of transitions, as usual.

Let represent the functional equivalence of actions as mentioned above. The actions are in one equivalence class; and in a second; and in the third. For in the read and write actions, let the observable content , and for , define (the empty string). Extend to sequences such that and and if for each .

Functionally equivalent actions deliver the same results but may have quite different latencies. We have earlier specified the costs of actions for cache accesses, for store accesses and for reconfiguration. Lift to sequences by summing the costs of the component actions.

Following the constraints on and latency costs as in [4], we define the notion of weak amortised bisimulations on states (both specification and implementation)3.

Definition 4

A family of binary relations over states is a weak amortised -bisimulation, if for all whenever :
implies and ,
implies and ,
where and . State is (weakly) amortised more efficient than up to credit , written , if for some weak amortised -bisimulation .

Note that accessing the cache is significantly less costly that accessing the store. The definition of weak amortised bisimulation accumulates “credit” by performing the cheaper operation, thus providing us a framework for comparing the performance of execution on different architectural configurations. Since there is nondeterminism in when the ‘system’ operations take place, our framework must account for every possible execution run in the comparison.

The notion also allows us to assess the benefits of performing dynamic reconfiguration. Reconfiguration introduces a handicap of , which must be balanced (in an amortised sense) by frequent accesses to cache instead of to the store. Typically is about 4 instruction cycles whereas is one cycle. Since values are pulled into cache in blocks of words, there are additional performance benefits due to overlapping reads with the execution of other operations. The reconfiguration cost is approximately 1000 instruction cycles, so if there is enough locality of reference, this reconfiguration cost may be easily offset by the benefits of running the workload on a more suitable configuration for phases that are typically of the order of millions of instruction cycles or more.

Theorem 2
  1. For DRF program states, any reconfiguration for more efficiency is permissible. With sufficient locality of references, such programs executed on any clustering are “weakly amortised” more efficient than execution under their reference model execution.

  2. If , then executing a program on configuration is (modulo the approximations on latency) “weakly amortised more efficient” than on .

Proof outline: From any consistent implementation state, a unique reduct is reachable. If it takes moves to do so (where is the number of caches and the number of variables), then to have been in such a state, the system must have earlier performed at least actions instead of actions, and thus have already “earned” a credit of . So if it has performed at least operations such as distinct repeat writes to a variable in cache or reads of a dirty variable, then it has earned the requisite credit to be amortised at least as efficient as the reference system.
For the second part, if , since there is more sharing of cached variables, we can avoid the costlier in favour of the cheaper operation, and avoid some instances of the operation.

5 Conclusions

Our framework provides a formal basis for comparing both the behaviour and the performance of program workloads on different multiprocessor architectures. Our first set of results (Propositions 1 and 2) provides us a rudimentary formal justification for the folklore that it is easy to port programs written assuming an SMP architecture to CMP. In particular, they indicate why converting MPI programs to OpenMP, which assumes shared variables, is usually easier than the trickier reverse direction. Theorem 1 indicates why as architectures become “smarter” and incorporate reconfigurability, avoiding data races only assumes greater importance. The framework for comparing the efficiency of execution on different architecture also lets us understand why certain programs get such dramatic performance benefits when run on CMP architectures.

Let us mention a shortcoming of our work. Theorem 2 seems to indicate that CMP is preferable to all other configurations, which is belied in reality, especially where applications with threads having little or no communication amongst themselves run more efficiently on SMP-like configurations. The cache bus within a cluster and the memory bus enforce mutually exclusive access, and so the interleaved execution of threads accessing a shared cache is slower than simultaneous independent execution of threads accessing private caches ( interleaved vs in parallel)4. The fault lies not in our framework, but rather in our view of execution as being interleaved — a view we had taken to keep the semantics standard. We leave for the future the development of a framework where the pomset model is used to explore the correctness and efficiency issues.

Reconfiguration is also an opportunity for remapping threads to cores. While we have considered only a fixed workload mapping of threads to cores in the present paper, we do not believe this extension poses any major technical difficulties.

Acknowledgements.

I wish to acknowledge the helpful discussions with my colleague Kolin Paul who taught me the little I know about reconfigurable architectures. I also must thank one of the referees who constructively pointed out many major weaknesses of the earlier avatar of this paper.

Footnotes

  1. Assumptions on do not have any significance in this paper.
  2. Note that refinements are lower in the ordering.
  3. A major difference with the cited work is that we ascribe a cost to observable actions as well
  4. There also is a slowdown due to the fused cache being larger, which we neglect.

References

  1. Gérard Boudol & Gustavo Petri (2009): Relaxed memory models: an operational approach. In: Proceedings of the 36th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2009, Savannah, GA, USA, January 21-23, 2009, pp. 392–403, doi:http://dx.doi.org/10.1145/1480881.1480930.
  2. Salil Joshi & Sanjiva Prasad (2010): An Operational Model for Multiprocessors with Caches. In: Theoretical Computer Science - 6th IFIP TC 1/WG 2.2 International Conference, TCS 2010, Held as Part of WCC 2010, Brisbane, Australia, September 20-23, 2010. Proceedings, pp. 371–385, doi:http://dx.doi.org/10.1007/978-3-642-15240-5.
  3. Astrid Kiehn & S. Arun-Kumar (2005): Amortised Bisimulations. In: Formal Techniques for Networked and Distributed Systems - FORTE 2005, 25th IFIP WG 6.1 International Conference, Taipei, Taiwan, October 2-5, 2005, Proceedings, pp. 320–334, doi:http://dx.doi.org/10.1007/11562436_24.
  4. Padmanabhan Krishnan (1992): A Semantics for Multiprocessor Systems. In: ESOP ’92, 4th European Symposium on Programming, Rennes, France, February 26-28, 1992, Proceedings, pp. 307–320, doi:http://dx.doi.org/10.1007/3-540-55253-7_18.
  5. Padmanabhan Krishnan (1996): Architectural CCS. Formal Asp. Comput. 8(2), pp. 162–187, doi:http://dx.doi.org/10.1007/BF01214555.
  6. L. Lamport (1979): How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. IEEE Trans. Comput. 28(9), pp. 690–691, doi:http://dx.doi.org/10.1109/TC.1979.1675439.
  7. Rajesh Kumar Pal, Kolin Paul & Sanjiva Prasad (2012): ReKonf: A Reconfigurable Adaptive ManyCore Architecture. In: 10th IEEE International Symposium on Parallel and Distributed Processing with Applications, ISPA 2012, Leganes, Madrid, Spain, July 10-13, 2012, pp. 182–191, doi:http://dx.doi.org/10.1109/ISPA.2012.32.
  8. Rajesh Kumar Pal, Kolin Paul & Sanjiva Prasad (2014): ReKonf: Dynamically reconfigurable multiCore architecture. J. Parallel Distrib. Comput. 74(11), pp. 3071–3086, doi:http://dx.doi.org/10.1016/j.jpdc.2014.05.007.
  9. Vaughan R. Pratt (1984): The Pomset Model of Parallel Processes: Unifying the Temporal and the Spatial. In: Seminar on Concurrency, Carnegie-Mellon University, Pittsburg, PA, USA, July 9-11, 1984, pp. 180–196, doi:http://dx.doi.org/10.1007/3-540-15670-4_9.
104943
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
Edit
-  
Unpublish
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel
Comments 0
Request comment
""
The feedback must be of minumum 40 characters
Add comment
Cancel
Loading ...

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description