









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
The challenges of inserting memory fences in modern architectures and proposes a static analysis approach to automatically place them. The authors investigate the performance impact of fences and survey related work. They detail how to detect critical cycles and place fences to restore memory consistency. The document also discusses the motivation, motivation experiments, and optimization of fence placement and type.
Typology: Schemes and Mind Maps
1 / 17
This page cannot be seen from the preview
Don't miss anything!
Jade Alglave^1 , Daniel Kroening^2 , Vincent Nimal^2 , and Daniel Poetzl^2 (^1) University College London 2 University of Oxford
Abstract Modern architectures rely on memory fences to prevent undesired weak- enings of memory consistency. As the fences’ semantics may be subtle, the au- tomation of their placement is highly desirable. But precise methods for restoring consistency do not scale to deployed systems code. We choose to trade some pre- cision for genuine scalability: our technique is suitable for large code bases. We implement it in our new musketeer tool, and detail experiments on more than 350 executables of packages found in Debian Linux 7.1, e.g. memcached (about 10000 LoC).
Concurrent programs are hard to design and implement, especially when running on multiprocessor architectures. Multiprocessors implement weak memory models, which feature e.g. instruction reordering, store buffering (both appearing on x86), or store atomicity relaxation (a particularity of Power and ARM). Hence, multiprocessors allow more behaviours than Lamport’s Sequential Consistency (SC) [20], a theoretical model where the execution of a program corresponds to an interleaving of the different threads. This has a dramatic effect on programmers, most of whom learned to program with SC. Fortunately, architectures provide special fence (or barrier) instructions to prevent certain behaviours. Yet both the questions of where and how to insert fences are con- tentious, as fences are architecture-specific and expensive. Attempts at automatically placing fences include Visual Studio 2013, which offers an option to guarantee acquire/release semantics (we study the performance impact of this policy in Sec. 2). The C++11 standard provides an elaborate API for inter-thread communication, giving the programmer some control over which fences are used, and where. But the use of such APIs might be a hard task, even for expert programmers. For example, Norris and Demsky reported a bug found in a published C11 implementation of a work-stealing queue [27]. We address here the question of how to synthesise fences, i.e. automatically place them in a program to enforce robustness/stability [9,5] (which implies SC). This should lighten the programmer’s burden. The fence synthesis tool needs to be based on a pre- cise model of weak memory. In verification, models commonly adopt an operational style, where an execution is an interleaving of transitions accessing the memory (as in SC). To address weaker architectures, the models are augmented with buffers and
⋆ (^) Supported by SRC/2269.002, EPSRC/H017585/1 and ERC/280053.
queues that implement the features of the hardware. Similarly, a good fraction of the fence synthesis methods, e.g. [23,18,19,24,3,10] (see also Fig. 2), rely on operational models to describe executions of programs.
Challenges Thus, methods using operational models inherit the limitations of methods based on interleavings, e.g. the “severely limited scalability”, as [24] puts it. Indeed, none of them scale to programs with more than a few hundred lines of code, due to the very large number of executions a program can have. Another impediment to scalability is that these methods establish if there is a need for fences by exploring the executions of a program one by one. Finally, considering models `a la Power makes the problem significantly more diffi- cult. Intel x86 offers only one fence (mfence), but Power offers a variety of synchroni- sation: fences (e.g. sync and lwsync), or dependencies (address, data or control). This diversity makes the optimisation more subtle: one cannot simply minimise the number of fences, but rather has to consider the costs of the different synchronisation mecha- nisms; it might be cheaper to use one full fence than four dependencies.
Our approach We tackle these challenges with a static approach. Our choice of model almost mandates this approach: we rely on the axiomatic semantics of [6]. We feel that an axiomatic semantics is an invitation to build abstract objects that embrace all the executions of a program. Previous works, e.g. [30,5,9,10], show that weak memory behaviours boil down to the presence of certain cycles, called critical cycles, in the executions of the program. A critical cycle essentially represents a minimal violation of SC, and thus indicates where to place fences to restore SC. We detect these cycles statically, by exploring an over-approximation of the executions of the program.
Contributions Our method is sound for a wide range of architectures, including x86- TSO, Power and ARM; and scales for large code bases, such as memcached (about 10000 LoC). We implemented it in our new musketeer tool. Our method is the most precise of the static analysis methods (see Sec. 2). To do this comparison, we imple- mented all these methods in our tool; for example, the pensieve policy [32] was de- signed for Java only, and we now provide it for x86-TSO, Power and ARM. Thus, our tool musketeer gives a comparison point for the field.
Outline We discuss the performance impact of fences in Sec. 2, and survey related work in Sec. 3. We recall our weak memory semantics in Sec. 4. We detail how we detect critical cycles in Sec. 5, and how we place fences in Sec. 6. In Sec. 7, we compare existing tools and our new tool musketeer. We provide the sources, benchmarks and experimental reports online at http://www.cprover.org/wmm/musketeer.
Before optimising the placement of fences, we investigated whether naive approaches to fence insertion indeed have a negative performance impact. To that end, we measured
authors tool model style objective Abdulla et al. [3] memorax operational reachability Alglave et al. [6] offence axiomatic SC Bouajjani et al. [10] trencher operational SC Fang et al. [15] pensieve axiomatic SC Kuperstein et al. [18] fender operational reachability Kuperstein et al. [19] blender operational reachability Linden et al. [23] remmex operational reachability Liu et al. [24] dfence operational specification Sura et al. [32] pensieve axiomatic SC
Fig. 2. Fence synthesis tools
The work of Shasha and Snir [30] is a foundation for the field of fence synthesis. Most of the work cited below inherits their notions of delay and critical cycle. A delay is a pair of instructions in a thread that can be reordered by the under- lying architecture. A critical cycle essentially represents a minimal vi- olation of SC. Fig. 2 classifies the methods mentioned in this section w.r.t. their style of model (operational or axiomatic). We report our experimental com- parison of these tools in Sec. 7. Below, we detail fence synthesis methods per style. We write TSO for Total Store Order, implemented in Sparc TSO [31] and Intel x86 [28]. We write PSO for Partial Store Order and RMO for Relaxed Memory Order, two other Sparc architectures. We write Power for IBM Power [1].
Operational models Linden and Wolper [23] explore all executions (using what they call automata acceleration) to simulate the reorderings occuring under TSO and PSO. Abdulla et al. [3] couple predicate abstraction for TSO with a counterexample-guided strategy. They check if an error state is reachable; if so, they calculate what they call the maximal permissive sets of fences that forbid this error state. Their method guarantees that the fences they find are necessary, i.e., removing a fence from the set would make the error state reachable again. Kuperstein et al. [18] explore all executions for TSO, PSO and a subset of RMO, and along the way build constraints encoding reorderings leading to error states. The fences can be derived from the set of constraints at the error states. The same authors [19] improve this exploration under TSO and PSO using an abstract interpretation they call partial coherence abstraction, relaxing the order in the write buffers after a certain bound, thus reducing the state space to explore. Liu et al. [24] offer a dynamic synthe- sis approach for TSO and PSO, enumerating the possible sets of fences to prevent an execution picked dynamically from reaching an error state. Bouajjani et al. [10] build on an operational model of TSO. They look for minimum violations (viz. critical cycles) by enumerating attackers (viz. delays). Like us, they use linear programming. However, they first enumerate all the solutions, then encode them as an ILP, and finally ask the solver to pick the least expensive one. Our method directly encodes the whole decision problem as an ILP. The solver thus both constructs the solution (avoiding the exponential-size ILP problem) and ensures its optimality. All the approaches above focus on TSO and its siblings PSO and RMO, whereas we also handle the significantly weaker Power, including quite subtle barriers (e.g. lwsync) compared to the simpler mfence of x86.
Axiomatic models Krishnamurthy et al. [17] apply Shasha and Snir’s method to single program multiple data systems. Their abstraction is similar to ours, except that they do not handle pointers. Lee and Padua [22] propose an algorithm based on Shasha and Snir’s work. They use dominators in graphs to determine which fences are redundant. This approach was later implemented by Fang et al. [15] in pensieve, a compiler for Java. Sura et al. later implemented a more precise approach in pensieve [32] (see (P) in Sec. 2). They pair the cycle detection with an analysis to detect synchronisation that could prevent cycles. Alglave and Maranget [6] revisit Shasha and Snir for contemporary memory models and insert fences following a refinement of [22]. Their offence tool handles snippets of assembly code only, where the memory locations need to be explicitly given.
Others We cite the work of Vafeiadis and Zappa Nardelli [35], who present an optimi- sation of the certified CompCert-TSO compiler to remove redundant fences on TSO. Marino et al. [25] experiment with an SC-preserving compiler, showing overheads of no more than 34 %. Nevertheless, they emphasise that “the overheads, however small, might be unacceptable for certain applications”.
mp T 0 T 1 (a) x ← 1 (c) r1 ← y (b) y ← 1 (d) r2 ← x Final state? r1=1 ∧ r2= (a) Wx
(b) Wy
(c) Ry
(d) Rx
po
rf po
fr
Fig. 3. Message Passing (mp)
Weak memory can occur as follows: a thread sends a write to a store buffer, then a cache, and fi- nally to memory. While the write transits through buffers and caches, a read can occur before the value is available to all threads in memory. To describe such situations, we use the frame- work of [6], embracing in particular SC, Sun TSO (i.e. the x86 model [28]), and a fragment of Power. The core of this framework consists of relations over memory events. We illustrate this framework using a litmus test (Fig. 3). The top shows a multi-threaded pro- gram. The shared variables x and y are assumed to be initialised to zero. A store instruction (e.g. x ← 1 on T 0 ) gives rise to a write event ((a)Wx 1 ), and a load instruction (e.g. r1 ← y on T 1 ) to a read event ((c)Ry1). The bottom of Fig. 3 shows one particular execution of the program (also called event graph), corresponding to the final state r1=1 and r2=0. In the framework of [6], an execution that is not possible on SC has a cyclic event graph (as the one shown in Fig. 3). A weaker architecture may relax some of the rela- tions contributing to a cycle. If the removal of the relaxed edges from the event graph makes it acyclic, the architecture allows the execution. For example, Power relaxes the program order po (amongst other things), thereby making the graph in Fig. 3 acyclic. Hence, the given execution is allowed on Power.
Formalisation An event is a memory read or a write to memory, composed of a unique identifier, a direction (R for read or W for write), a memory address, and a value. We
fences (rfe; fence or fence; rfe) safe, even though rfe alone would not be safe. In Fig. 3, placing a cumulative fence between the two writes on T 0 will not only prevent their re- ordering, but also enforce an ordering between the write (a) on T 0 and the read (c) on T 1 , which reads from T 0.
Architectures An architecture A determines the set safeA of relations safe on A. Fol- lowing [6], we always consider the coherence co, the from-read relation fr and the fences to be safe. SC relaxes nothing, i.e. rf and po are safe. TSO authorises the re- ordering of write-read pairs and store buffering but nothing else.
Critical cycles Following [30,5], for an architecture A, a delay is a po or rf edge that is not safe (i.e. is relaxed) on A. An execution (E, X) is valid on A yet not on SC iff it contains critical cycles [5]. Formally, a critical cycle w.r.t. A is a cycle in po ∪ com, where com , co ∪ rf ∪ fr is the communication relation, which has the following characteristics (the last two ensure the minimality of the critical cycles): (1) the cycle contains at least one delay for A; (2) per thread, (i) there are at most two accesses a and b, and (ii) they access distinct memory locations; and (3) for a memory location ℓ, there are at most three accesses to ℓ along the cycle, which belong to distinct threads. Fig. 3 shows a critical cycle w.r.t. Power. The po edge on T 0 , the po edge on T 1 , and the rf edge between T 0 and T 1 , are all unsafe on Power. On the other hand, the cycle in Fig. 3 does not contain a delay w.r.t. TSO, and is thus not a critical cycle on TSO. To forbid executions containing critical cycles, one can insert fences into the pro- gram to prevent delays. To prevent a po delay, a fence can be inserted between the two accesses forming the delay, following Fig. 4. To prevent an rf delay, a cumulative fence must be used (see Sec. 6 for details). For the example in Fig. 3, for Power, we need to place a cumulative fence between the two writes on T 0 , preventing both the po and the adjacent rf edge from being relaxed, and use a dependency or fence to prevent the po edge on T 1 from being relaxed.
We want to synthesise fences to prevent weak behaviours and thus restore SC. We explained in Sec. 4 that we should place fences along the critical cycles of the program executions. To find the critical cycles, we look for cycles in an over-approximation of all the executions of the program. We hence avoid enumeration of all traces, which would hinder scalability, and get all the critical cycles of all program executions at once. Thus we can find all fences preventing the critical cycles corresponding to two executions in one step, instead of examining the two executions separately. To analyse a C program, e.g. on the left-hand side of Fig. 5, we convert it to a goto-program (right-hand side of Fig. 5), the internal representation of the CProver framework; we refer to http://www.cprover.org/goto-cc for details. The pointer analysis we use is a standard concurrent points-to analysis that we have shown to be sound for our weak memory models in earlier work [7]. A full explanation of how we handle pointers is available in [8]. The C program in Fig. 5 features two threads which can interfere. The first thread writes the argument “input” to x, then randomly writes 1 to y or reads z, and then writes 1 to x. The second thread successively reads y, z and x.
void thread 1( int input ) { int r1; x = input ; if (rand()%2) y = 1; else r1 = z; x = 1; }
void thread 2() { int r2, r3, r4; r2 = y; r3 = z; r4 = x; }
thread 1 int r1; x = input ; Bool tmp; tmp = rand(); [! tmp%2] goto 1; y = 1; goto 2; 1: r1 = z; 2: x = 1; end function
thread 2 int r2, r3, r4; r2 = y; r3 = z; r4 = x; end function
Fig. 5. A C program (left) and its goto-program (right)
In the corresponding goto-program, the if-else structure has been transformed into a guard with the condition of the if followed by a goto construct. From the goto-program, we then compute an abstract event graph (aeg), shown in Fig. 6(a). The events a, b 1 , b 2 and c (resp. d, e and f ) correspond to thread 1 (resp. thread 2 ) in Fig. 5. We only consider accesses to shared variables, and ignore the local variables. We finally explore the aeg to find the potential critical cycles. An aeg represents all the executions of a program (in the sense of Sec. 4). Fig. 6(b) and (c) give two executions associated with the aeg shown in Fig. 6(a). For readability, the transitive po edges have been omitted (e.g. between the two events d′^ and f ′). The concrete events that occur in an execution are shown in bold. In an aeg, the events do not have concrete values, whereas in an execution they do. Also, an aeg merely indi- cates that two accesses to the same variable could form a data race (see the competing pairs (cmp) relation in Fig. 6(a), which is a symmetric relation), whereas an execution has oriented relations (e.g. indicating the write that a read takes its value from, see e.g. the rf arrow in Fig. 6(b) and (c)). The execution in Fig. 6(b) has a critical cycle (with respect to e.g. Power) between the events a′, b′ 2 , d′, and f ′. The execution in Fig. 6(c) does not have a critical cycle. Full details of the construction of the aegs from goto-programs, including a seman- tics of goto-programs in terms of abstract events, are available in the full version of this paper [8]. Function calls are inlined for better precision. Currently, the implementation does not handle recursion.
(a)Wx
(b 1 )Wy
(c)Wx
(d)Ry
(e)Rz
(f )Rx
(b 2 )Rz
pos pos
pos pos
pos
cmp pos
cmp
cmp
(a′)Wx
(b′ 1 )Wy
(c′)Wx
(d′)Ry
(e′)Rz
(f ′)Rx
(b′ 2 )Rz
po
po
po
fr po
rf
fr
co
(a′′)Wx
(b′′ 1 )Wy
(c′′)Wx
(d′′)Ry
(e′′)Rz
(f ′′)Rx
(b′′ 2 )Rz
po
po
po
po rf
co
(a) aeg of Fig. 5 (b) ex. with critical cycle (c) ex. without critical cycle
Fig. 6. The aeg of Fig. 5 and two executions corresponding to it
pos pos
pos
pos
pos
pos pos
cycle 1 dp
lwf
cycle 3
dp
cycle 2
f
cycle 4
(c)Rz
(d)Wx
(e)Rx
(f )Ry (i)Rz
(a)Wt (j)Wy
(b)Wy (h)Rt
(g)Wz
(l)Rz
(k)Wt
min dp(e,g) + dp(f,h) + dp(f,g) + 3 · (f(e,f ) + f(f,g) + f(g,h)) + 2 · (lwf(e,f ) + lwf(f,g) + lwf(g,h)) s.t. cycle 1, delay (e, g): dp(e,g) + f(e,f ) + f(f,g) + lwf(e,f ) + lwf(f,g) ≥ 1 cycle 2, delay (f, g): dp(f,g) + f(f,g) + lwf(f,g) ≥ 1 cycle 3, delay (f, h): dp(f,h) + f(f,g) + f(g,h) + lwf(f,g) + lwf(g,h) ≥ 1 cycle 4, delay (g, h): f(g,h) ≥ 1
Fig. 7. Example of resolution with between
In Fig. 7, we have an aeg with five threads: {a, b}, {c, d}, {e, f, g, h}, {i, j} and {k, l}. Each node is an abstract event computed as in the previous section. The dashed edges represent the pos between abstract events in the same thread. The full lines represent the edges involved in a cycle. Thus the aeg of Fig. 7 has four potential critical cycles. We derive the set of constraints in a process we define later in this section. We now have a set of cycles to forbid by placing fences. Moreover, we want to optimise the placement of the fences.
Challenges If there is only one type of fence (as in TSO, which only features mfence), optimising only consists of placing a minimal amount of fences to forbid as many cycles as possible. For example, placing a full fence sync between f and g in Fig. 7 might forbid cycles 1, 2 and 3 under Power, whereas placing it somewhere else might forbid at best two amongst them. Since we handle several types of fences for a given architecture (e.g. dependencies, lwsync and sync on Power), we can also assign some cost to each of them. For exam- ple, following the folklore, a dependency is less costly than an lwsync, which is itself less costly than a sync. Given these costs, one might want to minimise their sum along different executions: to forbid cycles 1, 2 and 3 in Fig. 7, a single lwsync between f and g can be cheaper at runtime than three dependencies respectively between e and g, f and g, and f and h. However, if we had only cycles 1 and 2, the dependencies would be cheaper. We see that we have to optimise both the placement and the type of fences at the same time. We model our problem as an integer linear program (ILP) (see Fig. 8), which we explain in this section. Solving our ILP gives us a set of fences to insert to forbid the cycles. This set of fences is optimal in that it minimises the cost function. More
Input: aeg (Es,pos,cmp) and potential critical cycles C = {C 1 , ..., Cn} Problem: minimise
∑ (l,t)∈potential-places(C) tl^ ×^ cost(t) Constraints: for all d ∈ delays(C) (* for TSO, PSO, RMO, Power *) if d ∈ poWR then ∑ e∈between(d) fe^ ≥^1 if d ∈ poWW then ∑ e∈between(d) fe^ +^ lwfe^ ≥^1 if d ∈ poRW then dpd +
∑ e∈between(d) fe^ +^ lwfe^ ≥^1 if d ∈ poRR then dpd +
∑ e∈between(d) fe^ +^ lwfe^ +^
∑ e∈ctrl(d) cfe^ ≥^1 (* for Power *) if d ∈ cmp then
∑ e∈cumul(d) fe^ +^
∑ e∈cumul(d)∩¬poWR∩¬poRW lwfe^ ≥^1 Output: the set actual-places(C) of pairs (l, t) s.t. tl is set to 1 in the ILP solution
Fig. 8. ILP for inferring fence placements
precisely, the constraints are the cycles to forbid, each variable represents a fence to insert, and the cost function sums the cost of all fences.
6.1 Cost function of the ILP
We handle several types of fences: full (f), lightweight (lwf), control fences (cf), and dependencies (dp). On Power, the full fence is sync, the lightweight one lwsync. We write T for the set {dp, f, cf, lwf}. We assume that each type of fence has an a priori cost (e.g. a dependency is cheaper than a full fence), regardless of its location in the code. We write cost(t) for t ∈ T for this cost. We take as input the aeg of our program and the potential critical cycles to fence. We define two sets of pairs (l, t) where l is a pos edge of the aeg and t a type of fence. We introduce an ILP variable tl (in { 0 , 1 }) for each pair (l, t). The set potential-places is the set of such pairs that can be inserted into the pro- gram to forbid the cycles. The set actual-places is the set of such pairs that have been set to 1 by our ILP. We output this set, as it represents the locations in the code in need of a fence and the type of fence to insert for each of them. We also output the total cost of all these insertions, i.e.
(l,t)∈potential-places(C) tl^ ×^ cost(t). The solver should minimise this sum whilst satisfying the constraints.
6.2 Constraints in the ILP
We want to forbid all the cycles in the set that we are given after filtering, as explained in the preamble of this section. This requires placing an appropriate fence on each delay for each cycle in this set. Different delay pairs might need different fences, depending e.g. on the directions (write or read) of their extremities. Essentially, we follow the table in Fig. 4. For example, a write-read pair needs a full fence (e.g. mfence on x86, or sync on Power). A read-read pair can use anything amongst dependencies and fences. Our constraints ensure that we use the right type of fence for each delay pair.
Inequalities as constraints We first assume that all the program order delays are in pos and we ignore Power and ARM special features (dependencies, control fences and
second component is a read. Thus, we add cfe as a possible variable to the constraint for read-read pairs (see poRR case in Fig. 8, where ctrl(d) = between(d) ∩ poC).
Cumulativity For architectures like Power, where stores are non-atomic, we need to look for program order pairs that are connected to an external read-from (e.g. (c, d) in Fig. 3 has an rf connected to it via event c). In such cases, we need to use a cumulative fence, e.g. lwsync or sync, and not, for example, a dependency. The locations to consider in such cases are: before (in pos) the write w of the rfe, or after (in pos) the read r of the rfe, i.e. cumul(w, r) = {(e 1 , e 2 ) | (e 1 , e 2 ) ∈ pos ∧ ((e 2 , w) ∈ po∗ s ∨ (r, e 1 ) ∈ po∗ s )}. In Fig. 7 (cycle 2), (g, i) over-approximates an rfe edge, and the edges where we can insert fences are in cumul(g, i) = {(f, g), (i, j)}. We need a cumulative fence as soon as there is a potential rfe, even if the adjacent pos pairs do not form a delay. For example in Fig. 3, suppose there is a dependency between the reads on T 1 , and a fence maintaining write-write pairs on T 0. In that case we need to place a cumulative fence to fix the rfe, even if the two pos pairs are themselves fixed. Thus, we quantify over all pos pairs when we need to place cumulative fences. As only f and lwf are cumulative, we have potential-places(C) , {(l, t) | (t ∈ {dp} ∧ l ∈ delays(C)) ∨(t ∈ T{dp} ∧ l ∈
⋃ d∈delays(C) between(d))^ ∨(t^ ∈ {f,^ lwf} ∧^ l^ ∈^ pos(C))}.
(a)Wx
(b)Ry
(c)Wy
(d) (e)(f )
(g)Rx
f
pos
pos
cmp
cmp
Fig. 9. Cycles sharing the edge (a, b)
Comparison with trencher We illustrate the difference between trencher [10] and our ap- proach using Fig. 9. There are three cycles that share the edge (a, b). They differ in the path taken between nodes c and g. Suppose that the user has inserted a full fence between a and b. To forbid the three cycles, we need to fence the thread on the right. The trencher algorithm first calculates which pairs can be reordered: in our example, these are (c, g) via d, (c, g) via e and (c, g) via f. It then determines at which locations a fence could be placed. In our example, there are 6 options: (c, d), (d, g), (c, e), (e, g), (c, f ), and (f, g). The encoding thus uses 6 variables for the fence locations. The algo- rithm then gathers all the irreducible sets of locations to be fenced to forbid the delay between c and g, where “irreducible” means that removing any of the fences would prevent this set from fully fixing the delay. As all the paths that connect c and g have to be covered, trencher needs to collect all the combinations of one fence per path. There are 2 locations per path, leading to 23 sets. Consequently, as stated in [10], trencher needs to construct an exponential number of sets. Each set is encoded in the ILP with one variable. For this example, trencher thus uses 6 + 8 variables. It also generates one constraint per delay (here, 1 ) to force the solver to pick a set, and 8 constraints to enforce that all the location variables are set to 1 if the set containing these locations is picked. By contrast, musketeer only needs 6 variables: the possible locations for fences. We detect three cycles, and generate only three constraints to fix the delay. Thus, on a parametric version of the example, trencher’s ILP grows exponentially whereas mus- keteer’s is linear-sized.
CLASSIC FAST Dek Pet Lam Szy Par Cil CL Fif Lif Anc Har LoC 50 37 72 54 96 97 111 150 152 188 179 dfence – – – – – – – – – – 7.8 3 6.2 3 ∼ 0 ∼ 0 ∼ 0 ∼ 0 memorax 0.4 2 1.4 2 79.1 4 – – – – – – – – – – – – – – – – musketeer 0.0 5 0.0 3 0.0 8 0.0 8 0.0 3 0.0 3 0.0 1 0.1 1 0.0 1 0.1 1 0.6 4 offence 0.0 2 0.0 2 0.0 8 0.0 8 – – – – – – – – – – – – – – pensieve 0.0 16 0.0 6 0.0 24 0.0 22 0.0 7 0.0 14 0.0 8 0.1 33 0.0 29 0.0 44 0.1 72 remmex 0.5 2 0.5 2 2.0 4 1.8 5 – – – – – – – – – – – – – – trencher 1.6 2 1.3 2 1.7 4 – – 0.5 1 8.6 3 – – – – – – – – – – Fig. 10. All tools on the CLASSIC and FAST series for TSO
We implemented our new method, in addition to all the methods described in Sec. 2, in our tool musketeer, using glpk (http://www.gnu.org/software/glpk) as the ILP solver. We compare these methods to the existing tools listed in Sec. 3. Our tool analyses C programs. dfence also handles C code, but requires some high- level specification for each program, which was not available to us. memorax works on a process-based language that is specific to the tool. offence works on a subset of assembler for x86, ARM and Power. pensieve originally handled Java, but we did not have access to it and have therefore re-implemented the method. remmex handles Promela-like programs. trencher analyses transition systems. Most of the tools come with some of the benchmarks in their own languages; not all benchmarks were however available for each tool. We have re-implemented some of the benchmarks for offence. We now detail our experiments. CLASSIC and FAST gather examples from the lit- erature and related work. The DEBIAN benchmarks are packages of Debian Linux 7.1. CLASSIC and FAST were run on a x86-64 Intel Core2 Quad Q9550 machine with 4 cores (2.83 GHz) and 4 GB of RAM. DEBIAN was run on a x86-64 Intel Core i5- machine with 4 cores (3.40 GHz) and 4 GB of RAM.
CLASSIC consists of Dekker’s mutex (Dek) [14]; Peterson’s mutex (Pet) [29]; Lamport’s fast mutex (Lam) [21]; Szymanski’s mutex (Szy) [33]; and Parker’s bug (Par) [13]. We ran all tools in this series for TSO (the model common to all). For each example, Fig. 10 gives the number of fences inserted, and the time (in sec) needed. When an example is not available in the input language of a tool, we write “–”. The first four tools place fences to enforce stability/robustness [5,9]; the last three to satisfy a given safety property. We used memorax with the option -o1, to compute one maximal permissive set and not all. For remmex on Szymanski, we give the number of fences found by default (which may be non-optimal). Its “maximal permissive” option lowers the number to 2 , at the cost of a slow enumeration. As expected, musketeer is less precise than most tools, but outperforms all of them.
FAST gathers Cil, Cilk 5 Work Stealing Queue (WSQ) [16]; CL, Chase-Lev WSQ [11]; Fif, Michael et al.’s FIFO WSQ [26]; Lif, Michael et al.’s LIFO WSQ [26]; Anc, Michael et al.’s Anchor WSQ [26]; Har, Harris’ set [12]. For each example and tool,