Optimizing Solution Quality in Synchronization Synthesis

Optimizing Solution Quality in Synchronization Synthesis

Abstract

Given a multithreaded program written assuming a friendly, non-preemptive scheduler, the goal of synchronization synthesis is to automatically insert synchronization primitives to ensure that the modified program behaves correctly, even with a preemptive scheduler. In this work, we focus on the quality of the synthesized solution: we aim to infer synchronization placements that not only ensure correctness, but also meet some quantitative objectives such as optimal program performance on a given computing platform.

The key step that enables solution optimization is the construction of a set of global constraints over synchronization placements such that each model of the constraints set corresponds to a correctness-ensuring synchronization placement. We extract the global constraints from generalizations of counterexample traces and the control-flow graph of the program. The global constraints enable us to choose from among the encoded synchronization solutions using an objective function. We consider two types of objective functions: ones that are solely dependent on the program (e.g., minimizing the size of critical sections) and ones that are also dependent on the computing platform. For the latter, given a program and a computing platform, we construct a performance model based on measuring average contention for critical sections and the average time taken to acquire and release a lock under a given average contention.

We empirically evaluated that our approach scales to typical module sizes of many real world concurrent programs such as device drivers and multithreaded servers, and that the performance predictions match reality. To the best of our knowledge, this is the first comprehensive approach for optimizing the placement of synthesized synchronization.

\conferenceinfo

CONF ’yyMonth d–d, 20yy, City, ST, Country \copyrightyear20yy \copyrightdata978-1-nnnn-nnnn-n/yy/mm \doinnnnnnn.nnnnnnn \authorinfoPavol ČernýUniversity of Colorado Boulderpavol.cerny@colorado.edu \authorinfoEdmund M. ClarkeCarnegie Mellon Universityemc@cs.cmu.edu \authorinfoThomas A. HenzingerIST Austriatah@ist.ac.at \authorinfoArjun RadhakrishnaUniversity of Pennsylvaniaarjunrad@seas.upenn.edu \authorinfoLeonid RyzhykCarnegie Mellon Universityryzhyk@cs.cmu.edu \authorinfoRoopsha SamantaIST Austriarsamanta@ist.ac.at \authorinfoThorsten TarrachIST Austriattarrach@ist.ac.at

1 Introduction

Synchronization synthesis aims to enable programmers to concentrate on the functionality of the program, and not on the low-level synchronization. One of the main challenges in synchronization synthesis, and in program synthesis in general, is to produce a solution that not only satisfies the specification, but also is good (if not optimal) according to various metrics such as performance and conformance to good programming practices. Optimizing for performance for a given architecture is a challenging problem, but it is an area where program synthesis can have an advantage over the traditional approach to program development: the fact that synthesis takes a high-level specification as input gives the synthesizer a lot of freedom to find a solution that is both correct and performs well.

Optimization is hard for the common approach to synchronization synthesis, which implements a counterexample-guided inductive synthesis (CEGIS) loop. A counterexample is given by a trace, and typically, this trace is immediately (greedily) removed from the program by placing synchronization. The advantage of this trace-based approach is that traces are computationally easier to analyze. However, the greedy approach makes it difficult to optimize the final solution with respect to qualities such as performance.

We propose a new approach that keeps the trace-based technique, but uses it to collect a set of global constraints over synchronization placements. Each model of the global constraints corresponds to a correct synchronization placement. Constructing the global constraints is a key step in our approach, as they enable us to choose from among the encoded synchronization solutions using an objective function.

The global constraints are obtained by analyzing the program with respect to the concurrency specification. In this paper, we focus solely on one type of synchronization – locks. Our concurrency specification consists of three parts: preemption-safety, deadlock-freedom, and a set of standard locking discipline conditions. The goal of preemption-safety (proposed in CAV15 []) is to enable the programmer to program assuming a friendly, non-preemptive scheduler. It is then the task of the synthesizer to ensure that every execution under the preemptive scheduler is observationally equivalent to an execution under the preemptive scheduler. Two program executions are observationally equivalent if they generate the same sequences of calls to interfaces of interest. We consider a program correct if, in addition to preemption-safety, it does not produce deadlocks, and if it follows good programming practices with respect to locks: we require no double locking, no double unlocking and several other conditions that we refer to as legitimate locking. Legitimate locking helps making the final solution readable and maintainable. The salient point of our correctness notion is that it is generic — the programmer does not need to write a specification for each application separately.

The global constraints resulting from the preemption-safety requirement are obtained from an analysis of generalized counterexamples. The analysis uses the CEGIS approach of CAV15 [], where the key steps in checking preemption-safety are a coarse, data-oblivious abstraction (shown to work well for systems code), and an algorithm for bounded language inclusion checking. The approach of CAV15 [] is greedy, that is, it immediately places locks to eliminate the counterexample. In contrast, we do not place locks, but instead infer mutual exclusion (mutex) constraints for eliminating the counterexample. We then enforce these mutex constraints in the language inclusion check to avoid getting the same counterexample again. We accumulate the mutex constraints from all counterexamples iteratively generated by the language inclusion check. Once the language inclusion check succeeds, we construct the set of global constraints using the accumulated mutex constraints and constraints for enforcing deadlock-freedom and legitimate locking.

Given the global constraints, we can choose from among the encoded solutions using an objective function. We consider two types of objective functions: ones that are solely dependent on the program and ones that are also dependent on the computing platform and workload. Examples of objective functions of the first type include minimizing the number of lock statements (leading to coarse-grained locking) and maximizing concurrency (leading to fine-grained locking). We encode such an objective function, together with the global constraints, into a weighted MaxSAT problem, which is then solved using an off-the-shelf solver.

In order to choose a lock placement (from among the ones encoded by global constraints) that has a good performance, we use an objective function that depends on a particular machine architecture and workload. The objective function is given by a performance model. We emphasize that the model is based on measuring parameters of a particular architecture running the program. In particular, it is based on measuring the average time taken to acquire and release a lock under a given level of contention (this part depends only on the architecture) and the average time it takes to execute the critical section, and the average contention for critical sections (this part depends on the architecture, the program, and the workload). The optimization procedure using the performance model and the global constraints works as follows. First, we parameterize the space of solutions by the sizes of critical sections and the number of locks taken. Second, we find the parameter values that maximize performance. Third, we find a solution of the global constraints closest to the parameter values that yield maximal performance.

We empirically evaluate that our approach scales to typical module sizes of many real world concurrent programs such as device drivers and multithreaded servers (1000 LOC). We use our synthesis tool (with architecture-independent objective functions) on a number of device driver benchmarks and find that the synthesis times are comparable to an existing tool CAV15 [] that implements a standard CEGIS-based algorithm; we emphasize that our tool finds an optimal lock placement that guarantees preemption-safety, deadlock freedom and legitimate locking. Furthermore, we evaluate the tool with an objective function given by our performance model. We use the memcached network server that provides an in-memory key-value store, specifically the module used by server worker threads to access the store. We found that the performance model predictions match reality, and that we obtained different locking schemes based on the values of parameters of the performance model.

static void* worker_thread(void *arg) {
    for (j = 0; j < niter; j++) {
        sharedX();
        sharedY();
        local();
        sharedZ();
}  };
Figure 1: Example: Work sharing

The contributions of this work are:

  • To the best of our knowledge, this is the first comprehensive approach for finding and optimizing lock placement. The approach is comprehensive, as it fully solves the lock placement problem on realistic (albeit simplified) systems code.

  • We use trace analysis to obtain global constraints each of whose solutions gives a legitimate lock placement, thus enabling choosing from among the encoded correct solutions using an objective function.

  • A method for lock placement for machine-independent objective functions, based on weighted MaxSAT solving.

  • Optimization of lock placement using a performance model obtained by measurement and profiling on a given platform.

1.1 Related Work

Synthesis of lock placement is an active research area bloem []; VYY10 []; CCG08 []; EFJM07 []; UBES10 []; DR98 []; VTD06 []; ramalingam []; SLJB08 []; ZSZSG08 []; shanlu []; CAV11 []; CAV13 []; CAV14 []; CAV15 []; POPL15 []. There are works that optimize lock placement: for instance the paper by Emmi et al. EFJM07 [], Cherem et al. CCG08 [], and Zhang et al. ZSZSG08 []. These papers takes as an input a program annotated with atomic sections, and replace them with locks, using several types of fixed objective functions. In contrast, our work does not need the annotations with atomic sections, and does not optimize using a fixed objective function, but rather using a performance model obtained by performing measurements on a particular architecture.

Another approach is to not require the programmer to annotate the program with atomic sections, but rather infer them or infer locks directly VYY10 []; bloem []; DR98 []; CAV13 []; CAV14 []; CAV15 []; POPL15 []. All these works also either do not optimize the lock placement, or do so for a fixed objective function, not for a given architecture. The work CAV11 [] proposes concurrency synthesis w.r.t. a performance model, but it does not produce such models for a given machine architecture.

Jin et a shanlu [] describe a tool CFix that can detect and fix concurrency bugs by identifying bug patterns in the code. CFix also simplifies its own patches by merging fixes for related bugs.

Usui et al. UBES10 [] provide a dynamic adaptive lock placement algorithm, which can be precise but necessitates runtime overhead.

Our abstraction is based on the one from CAV15 [], which is similar to abstractions that track reads and writes to individual locations (e.g., VYRS10 []; AKNP14 []). In VYY10 [] the authors rely on assertions for synchronization synthesis and include iterative abstraction refinement in their framework, which could be integrated in our approach.

1.2 Illustrative Example

One of the main contributions of the paper is a synthesis approach that allows optimization for a particular computing platform and program. We will demonstrate on an example that varying parameters of our performance model, which corresponds to varying the machine architecture and usage pattern (contention), can lead to a different solution with best performance.

Figure 2: Performance under various locking schemes

Consider the function worker_thread in Figure 1 that is called by a number of worker threads in a work-sharing setting. The worker_thread function calls three functions that access shared memory: sharedX(), sharedY(), and sharedZ(). Each of the three functions needs to be called mutually exclusively with itself, but it does not conflict with the other two functions. None of the three functions uses locks internally. The function local() does not access shared memory. If we consider locking schemes that use only one lock, and lock only within an iteration of the loop, then there are a number of correct locking schemes. First, there is a coarse-grained version which locks before the call to sharedX() and unlocks after the call to sharedZ(). Second, there is a fine-grained version that locks each function separately. Third, there are intermediate versions, such as a version that locks before the call to sharedX(), unlocks before the call to local(), and then it locks again the call to sharedZ().

Clearly, which of the correct versions will have the best performance depends on the program, the contention, and the architecture. For instance, with very low contention, the coarse-grained version would be fastest. On the other hand, with contention, and if local() is expensive enough, the third version performs best, as it releases the lock before calling local().

We further demonstrate with a small experiment (Figure 2) that often no locking scheme is uniformly better. We considered variants of the program in Figure 1 which differ in how long it takes to execute the function local() (parameter lt in Figure 2). We kept the other parameters (such as contention and number of threads) constant (there were threads). We see that if the call to local is cheap, then the coarse-grained version performs better, but it can be perform very badly otherwise. The other two versions and perform comparably.

The example is similar to our main case study of the memcached server. We demonstrate on that case study that the performance model helps us choose the option with the best performance, and that it does not have to be the most coarse-grained or the most fine-grained locking solution.

2 Formal Framework

We present the syntax and semantics of a concrete concurrent while language , followed by the syntax and semantics of an abstract concurrent while language . While (and our tool) permits function call and return statements, we skip these constructs in the formalization below. We conclude the section by formalizing our notion of correctness for concrete concurrent programs.

In our work, we assume a read or a write to a single shared variable executes atomically and further assume a sequentially consistent memory model.

2.1 Concurrent Programs

Syntax of (Fig. 3). A concurrent program is a finite collection of threads where each thread is a statement written in the syntax of . All variables (shared program variables s_var, local program variables l_var, lock variables lk_var, condition variables c_var and guard variables g_var) range over integers. Each statement is labeled with a unique location identifier ; we denote by the statement labeled by .

The language includes standard syntactic constructs such as assignment, conditional, loop, synchronization, goto and yield statements. In , we only permit expressions that read from at most one shared variable and assignments that either read from or write to exactly one shared variable111An expression/assignment statement that involves reading from/writing to multiple shared variables can always be rewritten into a sequence of atomic read/atomic write statements using local variables.. The language also includes assume, assume_not, set and unset statements whose use will be clarified later. Most significantly, permits reading from () and writing to () a communication channel tag between the program and an interface to an external system. In practice, we use the tags to model device registers. In our presentation, we consider only a single external interface.

    

l_expr::= constant | * | l_var |
          operator(l_expr, l_expr, ... , l_expr)
s_expr::= s_var |
          operator(s_var, l_expr, ..., l_expr)
lstmt ::= loc: stmt | lstmt; lstmt
stmt  ::= s_var := l_expr | l_var := s_expr |
          s_var := havoc() | while (s_expr) lstmt |
          if (s_expr) lstmt else lstmt |
          s_var := in(tag) | out(tag,s_expr) |
          lock(lk_var)  | unlock(lk_var) |
          wait(c_var) | wait_not(c_var) |
          notify(c_var) | reset(c_var)
          assume(g_var) | assume_not(g_var) |
          set(g_var) | unset(g_var) |
          goto loc | yield | skip

    

Figure 3: Syntax of

Semantics of . We first define the semantics of a single thread in , and then extend the definition to concurrent non-preemptive and preemptive semantics.

Single-thread semantics (Fig. 4). Let us fix a thread identifier . We use interchangeably with the program it represents. A state of a single thread is given by where is a valuation of all program variables visible to thread , and is a location identifier, indicating the statement in to be executed next.

We define the flow graph for thread in a manner similar to the control-flow graph of . Each node of is labeled with a unique labeled statement of (unlike a control-flow graph, statements in the same basic block are not merged into a single node). The edges of capture the flow of control in . Nodes labeled with and statements have two outgoing edges, labeled with and , respectively. The flow graph has a unique entry node and a unique exit node. The entry node is the first labeled statement in ; we denote its location identifier by . The exit node is a special node corresponding to a hypothetical statement placed at the end of .

We define successors of locations of using . The location last has no successors. We define if node in has exactly one outgoing edge to node . We define and if node in has exactly two outgoing edges to nodes and .

We can now define the single-thread operational semantics. A single execution step changes the program state from to , while optionally outputting an observable symbol . The absence of a symbol is denoted using . In the following, represents an expression and evaluates an expression by replacing all variables with their values in .

In Fig. 4, we present a partial set of rules for single execution steps. The only rules which involve output of an observable are:

  1. Havoc: Statement assigns a non-deterministic value (say ) and outputs the observable .

  2. Input, Output: and read and write values to the channel , and output and , where is the value read or written, respectively.

Intuitively, the observables record the sequence of non-deterministic guesses, as well as the input/output interaction with the tagged channels. The semantics of the synchronization statements shown in Fig. 4 is standard.

The semantics of assume, assume_not, set and unset statements are identical to that of wait, wait_not, notify and reset statements, respectively. Thus, and execute iff guard variable equals and , respectively. Statements and assign and to guard variable , respectively.

    







    

Figure 4: A partial set of rules for single-thread semantics of

Concurrent semantics. A state of a concurrent program is given by where is a valuation of all program variables, is the thread identifier of the currently executing thread and are the locations of the statements to be executed next in threads to , respectively. Initially, all program variables and equal and for each .

 
 
Figure 5: Non-preemptive semantics
 
 
Figure 6: Additional rule for preemptive semantics

Non-preemptive semantics (Fig. 5). The non-preemptive semantics ensures that a single thread from the program keeps executing using the single-thread semantics (Rule Seq) until one of the following occurs: (a) the thread finishes execution (Rule Thread_end) or it encounters a (b) yield, lock, wait or wait_not statement (Rule Nswitch). In these cases, a context-switch is possible.

Preemptive semantics (Fig. 5, Fig. 6). The preemptive semantics of a program is obtained from the non-preemptive semantics by relaxing the condition on context-switches, and allowing context-switches at all program points. In particular, the preemptive semantics consist of the rules of the non-preemptive semantics and the single rule Pswitch in Fig. 6.

void open_dev() {
1: while (*) {
2:   if (open==0)
3:      power_up();
4:   open:=open+1;
5: yield; } }

      

void open_dev_abs() {
1: while (*) {
2: r(open);
   if (*)
3:    w(dev);
4: r(open);
   w(open);
5: yield; } }

Figure 7: Example procedure and its abstraction

2.2 Abstract Concurrent Programs

For concurrent programs written in communicating with external system interfaces, it suffices to focus on a simple, data-oblivious abstraction (CAV15 []). The abstraction tracks types of accesses (read or write) to each memory location while abstracting away their values. Inputs/outputs to an external interface are modeled as writes to a special memory location (dev). Havocs become ordinary writes to the variable they are assigned to. Every branch is taken non-deterministically and tracked. Given written in , we denote by the corresponding abstract program written in .

Example. We present a procedure open_dev() and its abstraction in Fig. 7. The function power_up() represents a call to a device.

Abstract Syntax (Fig. 8). In the figure, var denotes all shared program variables and the dev variable. Observe that the abstraction respects the valuations of the lock, condition and guard variables222The purpose of the guard variables is to improve the precision of our otherwise coarse abstraction. Currently, they are inferred manually, but can presumably be inferred automatically using an iterative abstraction-refinement loop. In our current benchmarks, guard variables needed to be introduced in only three scenarios..

    

lstmt ::= loc: stmt | lstmt; lstmt
stmt  ::= r(var) | w(var) | if(*) lstmt else lstmt
          | while(*) lstmt |
          lock(lk_var)  | unlock(lk_var) |
          wait(c_var) | wait_not(c_var) |
          notify(c_var) | reset(c_var) |
          assume(g_var) | assume_not(g_var) |
          set(g_var) | unset(g_var) |
          goto loc | yield | skip

    

Figure 8: Syntax of

Abstract Semantics. As before, we first define the semantics of for a single-thread.

Single-thread semantics (Fig. 9.) The abstract state of a single thread is given simply by where is the location of the statement in to be executed next. We define the flow graph and successors for locations in the abstract program in the same way as before. An abstract observable symbol is of the form: , where . The symbol records the type of access to variables along with the variable name and records non-deterministic branching choices . Fig. 9 presents the rules for statements unique to ; the rules for statements common to and are the same.

 
 
Figure 9: Partial set of rules for single-thread semantics of

Concurrent semantics. A state of an abstract concurrent program is given by where is a valuation of all lock, condition and guard variables, is the current thread identifier and are the locations of the statements to be executed next in threads to , respectively. The non-preemptive and preemptive semantics of a concurrent program written in are defined in the same way as that of a concurrent program written in .

2.3 Executions and Observable Behaviours

Let , denote the set of all concurrent programs in , , respectively.

Executions. A non-preemptive/preemptive execution of a concurrent program in is an alternating sequence of program states and (possibly empty) observable symbols, , such that (a) is the initial state of and (b) , according to the non-preemptive/preemptive semantics of , we have . A non-preemptive/preemptive execution of a concurrent program in is defined in the same way, replacing the corresponding semantics of with that of .

Given an execution , let denote the sequence of non-empty observable symbols in .

Observable Behaviours. The non-preemptive/preemptive observable behaviour of program in , denoted /, is the set of all sequences of non-empty observable symbols such that for some non-preemptive/preemptive execution of . The non-preemptive/preemptive observable behaviour of program in , denoted /, is defined similarly.

2.4 Program Correctness

We specify correctness of concurrent programs in using three implicit criteria, presented below.

Preemption-safety. Observable behaviours and of a program in are equivalent if: (a) the subsequences of and containing only symbols of the form and are equal and (b) for each thread identifier , the subsequences of and containing only symbols of the form are equal. Intuitively, observable behaviours are equivalent if they have the same interaction with the interface, and the same non-deterministic choices in each thread. For sets and of observable behaviours, we write to denote that each sequence in has an equivalent sequence in .

Given a concurrent programs and in such that is obtained by adding locks to , is preemption-safe w.r.t. if .

Deadlock-freedom. A state of concurrent program in is a deadlock state under non-preemptive/preemptive semantics if

  1. there exists a non-preemptive/preemptive execution from the initial state of to ,

  2. there exists thread such that in , and

  3. : according to the nonpreemptive/preemptive semantics of .

Program in is deadlock-free under non-preemptive/preemptive semantics if no non-preemptive/preemptive execution of hits a deadlock state. In other words, every non-preemptive/preemptive execution of ends in a state with . We say is deadlock-free if it is deadlock-free under both non-preemptive and preemptive semantics.

Compute

Construct

Construct

Language inclusion?

Generalize cex into HB-formula

Infer mutex constraints

Construct new : enforce mutex constraints

Construct global lock placement constraints for correctness

Compute optimal lock placement

Synthesize : Place optimal locks in

Yes

No

cex

mutex constraints

Figure 10: Solution Overview

Legitimacy of locking discipline. Let us first fix some notation for execution steps of a concurrent program in :

  • Let denote the single step execution of a statement in thread :
    where , and .

  • Similarly, let denote the single step execution of an statement in thread .

  • Given an execution of , let denote the single execution step in .

Program has legitimate locking discipline under non-preemptive/
preemptive semantics
if for any nonpreemptive/preemptive execution of , the following are true:

  1. Lock implies eventually (but not immediately after) unlock:
    : :

  2. Unlock implies earlier (but not immediately before) lock:
    : :

  3. No double locking:
    : , and
    : and

  4. No double unlocking:
    : , and
    : and

We say has legitimate locking discipline if it has legitimate locking discipline under both non-preemptive and preemptive semantics. semantics

3 Problem Statement and Solution Overview

3.1 Problem Statement

Given a concurrent program in such that is deadlock-free and has legitimate locking discipline under non-preemptive semantics, and an objective function , the goal is to synthesize a new concurrent program in such that:

  1. is obtained by adding locks to ,

  2. is preemption-safe w.r.t. ,

  3. is deadlock-free,

  4. has legitimate locking discipline, and,

3.2 Solution Overview

Our solution framework (Fig. 10) consists of the following main components.

Reduction of preemption-safety to language inclusion CAV15 []. To ensure tractability of checking preemption-safety, we rely on the abstraction described in Sec. 2.2. Observable behaviours and of an abstract program in are equivalent if (a) they are equal modulo the classical independence relation on memory accesses: accesses to different locations are independent, and accesses to the same location are independent iff they are both read accesses and (b) subsequences of and with symbols , , are equal. Using this notion of equivalence, the notion of preemption-safety is extended to abstract programs.

Under abstraction, we model each thread as a nondeterministic finite automaton (NFA) over a finite alphabet consisting of abstract observable symbols. This enables us to construct NFAs  and accepting the languages and , respectively ( is the abstract program corresponding to and initially, ). It turns out that preemption-safety of w.r.t. is implied by preemption-safety of w.r.t. , which, in turn, is implied by language inclusion modulo  of NFAs  and . NFAs  and satisfy language inclusion modulo  if any word accepted by is equivalent to some word obtainable by repeatedly commuting adjacent independent symbol pairs in a word accepted by . While the problem of language inclusion modulo an independence relation is undecidable bertoni1982equivalence [], we define and decide a bounded version of language inclusion modulo an independence relation333Our language inclusion procedure starts with an initial bound and iteratively increases the bound until it reports that the inclusion holds, or finds a counterexample, or reaches a timeout.

Inference of mutex constraints from generalized counterexamples. If and do not satisfy language inclusion modulo , then we obtain a counterexample and analyze it to infer constraints on for eliminating . Our counterexample analysis examines the set of all permutations of the symbols in that are accepted by . The output of the counterexample analysis is an hbformula — a Boolean combination of happens-before or ordering constraints between events — representing all counterexamples in . Thus is generalized into a larger set of counterexamples represented as .

From , we infer possible locks-enforceable constraints on that can eliminate all counterexamples satisfying . The key observation we exploit is that common concurrency bugs manifest as simple patterns of ordering constraints between events. For instance, the pattern , indicates an atomicity violation and can be rewritten as a mutual exclusion (mutex) constraint: . Note that this mutex constraint can be easily enforced by a lock that protects access to the regions to in and to in . We refer the reader to POPL15 [] for more details.

Automaton modification for enforcing mutex constraints. Once we have the mutex constraints inferred from a generalized counterexample, we can insert the corresponding locks into , reconstruct and repeat the process, starting from checking and for language inclusion modulo . This is a greedy iterative loop for synchronization synthesis and is undesirable (see Sec. 1). Hence, instead of modifying in each iteration by inserting locks, we modify in each iteration to enforce the mutex constraints on and then repeat the process. We describe the procedure for modifying to enforce mutex constraints in Sec. 4.

Construction of global lock placement constraints. Once and satisfy language inclusion modulo , we formulate global constraints over lock placements for ensuring correctness. These global constraints include all mutex constraints inferred over all iterations and constraints for enforcing deadlock-freedom and legitimacy of lock placement. Any model of the global constraints corresponds to a lock placement that ensures program correctness. We describe the formulation of these global constraints in Sec. 5.

Computation of -optimal lock placement. In our final component, given an objective function , we compute a lock placement that satisfies the global constraints and is -optimal. We then synthesize the final output by inserting the computed lock placement in . We present various objective functions and describe the computation of their respective optimal solutions in Sec. 6.

4 Enforcing Mutex Constraints in

To enforce mutex constraints in , we prune paths in that violate the mutex constraints.

Conflicts. Given a mutex constraint , a conflict is a tuple of location identifiers satisfying the following: (a) , , are adjacent locations in thread for , (b) , are adjacent locations in the other thread , (c) and (d) . Intuitively, a conflict represents a minimal violation of a mutex constraint due to the execution of a statement in thread between two adjacent statements in thread . The execution of each statement is represented using a source location and a destination location.

Given a conflict , let , , , and . Further, let and . Let denote the set of all conflicts derived from all mutex constraints in the current loop iteration and let .

Example. We have an example program and its flow-graph in Fig. 11 (we skip the statement labels in the nodes here). Suppose in some iteration we obtain . This yields 2 conflicts: given by and given by . On an aside, this example also illustrates the inadequacy of a greedy approach for lock placement. The mutex constraint yields a lock . This is not a legitimate lock placement; in executions executing the else branch, the lock is never released.

Constructing new . Initially, NFA  is given by the tuple , where (a) is the set of states of the abstract program corresponding to , (b) is the set of abstract observable symbols, (c) is the initial state of , (d) is the set of states in with and (e) is the transition relation with iff according to the abstract preemptive semantics.

To enable pruning paths that violate mutex constraints, we augment the state space of to track the status of conflicts using four-valued propositions , respectively. Initially all propositions are . Proposition is incremented from to when conflict is activated, i.e., when control moves from to along a path. Proposition is incremented from to when conflict progresses, i.e., when thread is at and control moves from to . Proposition is incremented from to when conflict completes, i.e., when control moves from to . Proposition is reset to when conflict is aborted, i.e., when thread is at and either moves to a location different from , or moves to before thread moves from to .

T1:
a1: w(v);
a2: r(x);
T2:
b1: r(v);
b2: if (*)
b3:    w(v);
b4:    r(x);
    else
b5:    r(x);

   

a1

a2

b1

b2

b3

b5

b4

T1:

T2:

Figure 11: Example: Mutex constraints and conflicts

Example. In Fig. 11, is activated when moves from to ; progresses if now moves from to and is aborted if instead moves from to ; completes after progressing if moves from to and is aborted if instead moves from to .

Formally, the new is given by the tuple , where:

  1. ,

  2. is the set of abstract observable symbols as before,

  3. ,

  4. and

  5. is constructed as follows:
    add to iff
    and for each , the following hold:

    1. Conflict activation:
      if , , and , then ,

    2. Conflict progress:
      else if , , , and , then ,

    3. Conflict completion and state pruning:
      else if , , and , then ; delete state 444Our mutex constraints are in disjunctive normal form (CNF). Hence, in our implementation, we track conflict completions and delete a state only after the DNF is falsified.,

    4. Conflict abortion - executes alternate statement:
      else if or , , and , then ,

    5. Conflict abortion - executes before :
      else if , , and , , then

In our implementation, the new is constructed on-the-fly. Moreover, we do not maintain the entire set of propositions in each state of . A proposition is added to the list of tracked propositions only after conflict is activated. Once conflict is aborted, is dropped from the list of tracked propositions.

5 Global Lock Placement Constraints

We encode the global lock placement constraints for ensuring correctness as an SMT555The encoding of the global lock placement constraints is essentially a SAT formula. We present and use this as an SMT formula to enable combining the encoding with objective functions for optimization (see Sec. 6). formula . Let denote the set of all location and denote the set of all locks available for synthesis. We use scalars of type to denote locations and scalars of type to denote locks. Let denote the set of all predecessors in node in the flow-graph of the current abstract concurrent program . Let denote the set of all conflicts derived from mutex constraints across all iterations. We use the following Boolean variables in the encoding.
is placed just before is placed just after is placed just before is placed just after when nesting , , is placed before

We describe the main constraints constituting below. For illustrative purposes, we also present the SMT formulation for some of these constraints. All constraints are over each and .

  1. Define : is protected by .

  2. Define : is protected by or is placed after .

  3. All locations in the same conflict in are protected by the same lock, without interruption.

  4. Placing immediately before/after is disallowed.

  5. Enforce the lock order.

  6. No wait statements in the scope of synthesized locks.

  7. Placing both and before/after is disallowed.

  8. All predecessors must agree on their status.

  9. can be placed only after has been placed.

  10. No double locking.

  11. No double unlocking.

Besides the above, includes constraints that account for existing locks in the program. In particular, the lock order enforces all synthesized locks to be placed after any existing lock (when nesting). Further, also includes constraints enforcing uniformity of lock placements in loop bodies.

We have the following result.

Theorem 5.1.

Let concurrent program be obtained by inserting any lock placement satisfying into concurrent program . Then is guaranteed to be preemption-safe w.r.t. , deadlock-free and have legitimate locking discipline.

6 Optimizing Lock Placement

The global lock placement constraint constructed in Section 5 often has multiple models corresponding to very different lock placements. The desirability of these lock placements vary considerably due to performance considerations. Hence, any comprehensive approach to synchronization synthesis needs to take into account performance while generating the solution program.

Here, we present two types of objective functions to distinguish between different lock placements, as well as accompanying optimization procedures. These procedures take as input the global lock placement constraints along with any auxiliary inputs required by an objective function and produce the lock placement that ensures program correctness and is -optimal.

The first category of objective functions consists of formal encodings of various syntactic “rules of thumb” (such as minimality of critical sections) used by programmers, while the second category is based on building a performance model based on profiling. The second category is less widely applicable than the first, but also corresponds more closely to real performance on a machine.

6.1 Syntactic optimization

We say that a statement in a concurrent program is protected by a lock if is true. We define two syntactic objective functions as follows:

  1. Coarsest locking. Under this objective function, a program is considered better than if the number of lock statements in is fewer than in . Among the programs having the same number of lock statements, the ones with the fewest statements protected by any lock are considered better. Formally, we can define to be where is the count of lock statements in , is the count of statements in that are protected by any lock and is given by where is the total number of statements in .

  2. Finest locking. This objective function asks for maximum possible concurrency, i.e., it asks to minimize the number of pairs of statements from different threads that cannot be executed together. Formally, we define to be the sum over pairs of statements and from different threads that cannot be executed at the same time, i.e., are protected by the same lock.

A note on usage. Neither of the objective functions mentioned above are guaranteed to find the optimally performing program in all scenarios. It is necessary for the programmer to judge when each criterion is to be used. Intuitively, coarsest locks should be used when the cost of locking operations is relatively high compared to th cost of executing the critical sections, while the finest locks should be used when locking operations are cheap compared to the cost of executing the critical sections.

Optimization procedure. The main idea behind the optimization procedure for the above syntactic objective functions is to build an instance of the MaxSMT problem using the global lock placement constraint , such that (a) Every model of is a model for the MaxSMT problem; and (b) the cost of each model for the MaxSMT problem is the cost of the corresponding locking scheme according to the chosen objective function. The optimal lock placement is then computed by solving the MaxSMT problem.

A MaxSMT problem instance is given by where and ’s are SMT formulae and the ’s are real numbers. The formula is called the hard constraint, and each is called a soft constraint with being its associated weight. Given an assignment of variables occurring in the constraints, its cost is defined to be . The objective of the MaxSMT problem is to find a model that satisfies while having the minimal cost.

In the following, we write as a short-hand for . For each of the above objective functions, the hard constraint for the MaxSMT problem is and the soft constraints and associated weights are as specified below:

  • For the coarsest locking objective function, the soft constraints are of three types: (a) with weight , (b) with weight , and (c) with weight , where is as defined above.

  • For the finest locking objective function, the soft constraints are given by , for each pair of statements and from different threads. The weight of each soft constraint is .

Theorem 6.1.

For each of the three syntactic objective functions, the cost of the optimal program is equal to the cost of the model for the corresponding MaxSMT problem obtained as described above.

6.2 Performance profiling-based optimization

We first present the performance model for a very restricted setting. The target use case of this performance model is for high-performance work-sharing code as present in servers or database management systems. In such systems, it is usually the case that there are as many worker threads as there are cores in the CPU with each thread executing the same code. We build the performance model based on this execution model. We also assume that the synthesizer is allowed to introduce at most one lock variable, which may be taken and released multiple times by each thread during its execution, i.e., the various programs the synthesizer can synthesize differ only in the granularity of locking. We later show how this assumption of only one lock variable can be removed.

Execution model. To formally define the parameters that our performance model depends on, we first need to understand and define the characteristics of the execution model. The “average” time taken to execute a concurrent program depends on a wide variety of factors (e.g., underlying hardware platform, distribution of inputs, and scheduler). Here, we list a number of assumptions we make about these factors:

  • Usage and scheduling models. We assume that under intended usage of the program, program inputs are drawn from a probability distribution and further, that the expected running time of the program is finite. We also assume that the system scheduler is oblivious, i.e., does not make scheduling choices based on the actual values of variables in the program. This is a reasonable assumption for most systems. Note that neither the input probability distribution, nor the system scheduler needs to be formally modelled or known–all we require is the ability to obtain “typical” inputs and run the program.