Weak Atomicity Under the x86 Memory Consistency Model Amitabha Roy

Steven Hand

Tim Harris

University of Cambridge [email protected]

University of Cambridge [email protected]

Microsoft Research, Cambridge [email protected]

Abstract We consider the problem of building a weakly atomic Software Transactional Memory (STM), that provides Single (Global) Lock Atomicity (SLA) while adhering to the x86 memory consistency model (x86-MM). Categories and Subject Descriptors D.1.3 [Software]: Programming Techniques Concurrent Programming General Terms Keywords

Algorithms, Theory, Performance

Software Transactional Memory, x86 Memory Model

1. Introduction Single Lock Atomicity semantics require that “atomic” blocks behave as if they all acquire a single process-wide lock at the beginning and release it at the end. There has been considerable work on providing SLA for the Java memory model [3] and some on C/C++ [6]. In contrast to SLA work in these language level memory models, there has been no work on providing SLA at the level of a processor memory consistency model. The simple example below illustrates how the memory models differ greatly. // Thread 1 // Thread 2 atomic { X = 10; t1 = Y; Y = 10; } t2 = X; C++: Catch fire due to data race, any result allowed Java: Intra-thread reordering allowed x86: No intra-thread reordering In the example, the result t1 == 10 and t2 == 0 is allowed by the C++ and Java memory models but forbidden by the x86 one. An STM used to implement SLA and the x86 memory model must ensure that the forbidden result does not occur. A practical application of such an STM would be as the last stage in compilation (after machine code is generated) or through a dynamic binary rewriting engine [4]. The natural consequence of the lack of such an STM is that work needing it suffers from caveats in general applicability, requiring the user to be aware of the internals of the underlying STM implementation. We have designed an STM algorithm that provides SLA and x86-MM to the most general class of programs possible. We exclude only programs with Transactional Reads Unprotected Writes

Figure 1. TSO state machine with per processor private memory int X = 0, Y = 0, Z = 0; bool done1 = false, done2 = false; // Thread 1 // Thread 2 // Thread 3 // Thread 4 atomic { Y = 300; t1 = done1; atomic { if(Y == 0) X = 300; t2 = done2; X = 100; done1 = true; Z = 1; } done2 = true; } // Not allowed by x86-MM and SLA: // X == 100 and t1 == true and t2 == false and Z == 1 Figure 2. Loads must be ordered across atomic blocks (TRUW) races, which is a race between a read in a transaction and a write outside any transaction. Crucially however, we use weak atomicity and ensure that ad-hoc synchronisation [2] done by the program is preserved.

2. Memory Consistency We use the memory consistency model of Owens et al. as reference [5] (Figure 1). This casts the x86 as a sequential machine except for a write buffer that delays the visibility of stores to other processors. Each processor can acquire a lock to gain exclusive access to memory (modelling locked instructions). Interestingly, under the write-back memory type in x86, load fences and store fences become no-ops (since loads and stores are already ordered). On the other hand, locked instructions and memory fences (mfence) need to flush the write buffer on the executing processor.

3. x86-MM + SLA = Impossible

Copyright is held by the author/owner(s). PPoPP’11, February 12–16, 2011, San Antonio, Texas, USA. ACM 978-1-4503-0119-0/11/02.

We prove by example that any STM that aims to provide x86-MM and SLA for transactions in any program must execute transactions serially (and hence can provide no scalability). Consider the program fragments in Figure 2. When executing on the state machine of Figure 1, we have the following reasoning: Thread 4 is the ‘witness’ to Thread 1 acquiring the global lock

int X = 0, Y = 0, Z = 0, W = 0; // Thread 1 // Thread 2 // Thread 3 atomic { t1 = X ; atomic { X = 100; Z = 100; t2 = Z; W = 100; t3 = W; Y = 100; } t4 = Y; } // Not allowed by x86-MM and SLA: t1 == 100 and t2 == 0 and t3 == 100 and t4 == 0 Figure 3. Stores must be ordered across atomic blocks before Thread 2. Since the final value of X is 100, Thread 3 must have flushed its write to X from its write buffer to main-memory before Thread 1. This also means that it must have flushed its write to Y before Thread 1 flushed its write to X. Thread 2 cannot acquire the lock until thread 1 has flushed its write buffer (in order for the unlock to be visible to it) and hence when it acquires the lock the writes by thread 3 are already in memory. Hence it cannot read Y == 0 and thus cannot set Z = 1. Finally, note that the program fragment is also an example of a TRUW race, between the stores to Y on Thread 3 and the load from Y on Thread 2. A weakly atomic STM will not see any conflicts from Thread 3 or Thread 4, which means that it cannot detect the departure from x86-MM by solely depending on conflict detection. Instead it must always serialise the loads in a transaction after all stores in a previous transaction. In Figure 3, on the other hand, stores from the transaction in Thread 2 must be ordered after all stores in the transaction on Thread 1. Interleaving leads to the disallowed result. Interleaving however does not cause any conflicts and hence a weakly atomic STM is forced to serialise all stores in a transaction after all stores in a previous transaction. One also needs to consider the fact that buffering of writes (lazy update) is essential for the STM not to expose speculative writes to non-transactional reads. This means reads must always precede and complete before transaction linearisation, followed by writeback. Coupled with the intra-thread ordering requirements illustrated by the two examples above, one concludes that that the only way to preserve SLA under x86-MM for all programs is to ensure that transactions execute all their operation serially with no overlap. It is thus not possible to build a weakly atomic STM with any kind of scalability optimisations while still providing SLA on x86-MM for all programs. This is in contrast to work that strives to provide SLA in Java [3]. This is because the language level memory models are more permissive. For example, the language level memory models for C++ and Java allow the results in the examples by virtue of allowing reordering of operations on the same thread (within the transaction).

4. x86-MM + SLA − TRUW Races = Possible We have designed and implemented an STM (without complete serialisation) that provides SLA + x86 MM to all programs excluding those with TRUW races. The salient points of the algorithm are listed below. 4.1 Speculation Phase During the speculation phase we depend on STM read and write barriers to log all loads and stores. We use as a basis a lazy weakly atomic STM similar to TL2 [1]. However we log every read and write into software read and write buffers with no merging of adjacent reads and writes (unlike STMs that use larger granularities such as a cache line). This is critical to preserving the x86-MM. Further, we handle accesses of any size including overlapping accesses. We fall back to irrevocability on encountering any memory

access originating from a locked instruction or on encountering an mfence. Both of these instructions require flushing the write buffer and hence would lead to a departure from the x86-MM when executing with the STM. 4.2 Commit Phase We use a two phase commit. The first phase of the commit acquires metadata locks on modified locations. The second phase verifies every read from the read buffer in addition to checking STM metadata. At this point the transaction succeeds and the write logs are played back into shared memory. We introduce additional synchronisation between threads in the commit phase. Each commit acquires a unique commit ticket from a global counter. Threads with a later commit ticket must execute their write back phase after threads with an earlier commit ticket have finished their write back phase. The read checks are performed in parallel, the departure from total serialisation costs us the capability to handle TRUW races. The commit phase includes a dynamic race detector (false negatives but no false positives) for the TRUW races we cannot handle for no additional performance penalty. 4.3 Speculation Safety Our focus for the STM has been safety. In addition to preserving the x86 memory consistency model for any program not including a TRUW race, we also provide the same guarantee to speculating threads. A speculating thread is not allowed to execute with a read set that it could not have seen when executing with SLA. Further, since we buffer writes, no uncommitted values are allowed to leak out of speculating transactions.

5. Conclusion We have designed and implemented an STM algorithm to provide SLA and the x86 memory consistency model to transactions. This is not merely an academic exercise and the STM we have designed is integrated with an efficient instrumentation system for x86 binaries that we have built. The two together provide SLA for atomic blocks delimited by a single global lock in the binary, provided the program is free of TRUW races.

References [1] D. Dice, O. Shalev, and N. Shavit. Transactional locking II. In DISC ’06: Proc. 20th International Symposium on Distributed Computing, pages 194–208, Sept. 2006. [2] A. Jannesari and W. Tichy. Identifying ad-hoc synchronization for enhanced race detection. In IPDPS ’10: Proc. 25th IEEE International Symposium on Parallel Distributed Processing (IPDPS), pages 1 –10, April 2010. [3] V. Menon, S. Balensiefer, T. Shpeisman, A.-R. Adl-Tabatabai, R. Hudson, B. Saha, and A. Welc. Single global lock semantics in a weakly atomic STM. In TRANSACT ’08, 3rd ACM SIGPLAN Workshop on Languages, Compilers, and Hardware Support for Transactional Computing, Feb. 2008. [4] M. Olszewski, J. Cutler, and J. G. Steffan. JudoSTM: A dynamic binary-rewriting approach to software transactional memory. In PACT ’07: Proc. 16th International Conference on Parallel Architecture and Compilation Techniques, pages 365–375, Sept. 2007. [5] S. Owens, S. Sarkar, and P. Sewell. A better x86 memory model: x86TSO. In TPHOLs: 22nd Annual Conference on Theorem Proving in Higher Order Logics, 2009. [6] C. Wang, W.-Y. Chen, Y. Wu, B. Saha, and A.-R. Adl-Tabatabai. Code generation and optimization for transactional memory constructs in an unmanaged language. In CGO ’07: Proc. 2007 International Symposium on Code Generation and Optimization, pages 34–48, Mar. 2007.

Weak Atomicity Under the x86 Memory Consistency ...

Feb 16, 2011 - Keywords Software Transactional Memory, x86 Memory Model. 1. Introduction ... C++: Catch fire due to data race, any result allowed ... clude only programs with Transactional Reads Unprotected Writes. Copyright is held by ...

83KB Sizes 0 Downloads 167 Views

Recommend Documents

Weak Atomicity Under the x86 Memory Consistency ...
Feb 16, 2011 - Programming Techniques Concurrent Programming. General Terms ... In contrast to SLA work in these language level mem- ory models, there ...

RISC-V Memory Consistency Model Status Update
Nov 28, 2017 - 2. WHAT WERE OUR GOALS? • Define the RISC-V memory consistency model. • Specifies the values that can be returned by loads. • Support a wide range of HW implementations. • Support Linux, C/C++, and lots of other critical SW ...

Z$Estimators and Auxiliary Information under Weak ...
The set of weights they use is based on Census data and estimated via EL. ..... (see e.g. Ibragimov and Linnik, 1971) and CMT is the continuous mapping.

socializing consistency
often rather interact with a person than a machine: Virtual people may represent a ..... (Cook, 2000), an active topic of discussion as telephone-based call.

socializing consistency
demonstrates that as interfaces become more social, social consistency .... action with any number of such complex beings on a daily basis. .... media (stereotypically gender neutral), and computers (stereotypically male) ... In line with predictions

On the k-Atomicity-Verification Problem - Research at Google
For example, in a social network, a user may still ..... Proof: We maintain H as a doubly linked list sorted ... linked list of all of w's dictated reads; for each of these.

The Strength of Weak Learnability - Springer Link
some fixed but unknown and arbitrary distribution D. The oracle returns the ... access to oracle EX, runs in time polynomial in n,s, 1/e and 1/6, and outputs an ...

Consistency Without Borders
Distributed consistency is a perennial research topic; in recent years it has become an urgent practical matter as well. The research literature has focused on enforcing various flavors of consistency at the I/O layer, such as linearizability of read

The Strength of Weak Learnability - Springer Link
high probability, the hypothesis must be correct for all but an arbitrarily small ... be able to achieve arbitrarily high accuracy; a weak learning algorithm need only ...

1Q15 weak
Figure 1: OSIM—Geographical revenue growth. (S$ mn). 1Q14 2Q14 3Q14 4Q14 1Q15 QoQ% YoY%. North Asia. 91. 101. 80. 95. 78 -17.9 -14.3. South Asia. 73.

Breaking the x86 ISA W - GitHub
Jul 27, 2017 - first byte of the instruction is on the last byte of an executable page, and the rest ... For this, we hook every exception that a generated instruction.

Supporting Transactional Atomicity in Flash Storage ...
way. X-FTL drastically improves the transactional throughput almost for free without resorting to costly journaling schemes. .... SQLite operates usually in either rollback mode [3] or write-ahead ... SQLite invokes fsync system calls more often when

Breaking the x86 ISA [pdf] - Black Hat
Page 3 ..... So we now have a way to search the instructions space. How do we make sense .... Theoretically, a jmp (e9) or call (e8), with a data size override ...

On the Consistency of Deferred Acceptance when Priorities are ... - Csic
Roth, A.E., and M.A.O. Sotomayor (1990): Two-Sided Matching: A Study in Game-Theoretic. Modeling and Analysis. Econometric Society Monograph Series.

The Basis of Consistency Effects in Word Naming
Kenseidenberg. Mark S Journal of Memory and Language; Dec 1, 1990; 29, 6; Periodicals Archive Online pg. 637 ..... (consistent vs. inconsistent) and frequency.

End-To-End Sequential Consistency - UCLA CS
Sequential consistency (SC) is arguably the most intuitive behavior for a shared-memory multithreaded program. It is widely accepted that language-level SC could significantly improve programmability of a multiprocessor system. How- ever, efficiently

Consistency of individual differences in behaviour of the lion-headed ...
1999 Elsevier Science B.V. All rights reserved. Keywords: Aggression .... the data analysis: Spearman rank correlation co- efficient with exact P values based on ...

On the Consistency of Deferred Acceptance when ... - Semantic Scholar
An allocation µ Pareto dominates another allocation µ′ at R if µiRiµ′ ... at (R,Ch). Since Ch is substitutable, the so-called deferred acceptance rule, denoted ...

Offshore looking weak
Apr 16, 2015 - Downside: 4.4%. 16 Apr price (SGD): 9.440. Royston Tan. (65) 6321 3086 [email protected]. Forecast revisions (%). Year to 31 Dec. 15E. 16E .... 360. 100%. 339. 100%. 6%. Source: Company. ▫ Keppel: Operating margin trend for

On the Consistency of Deferred Acceptance when ...
There is a set of agents N and a set of proper object types O. There is also a null object ... An allocation is a vector µ = (µi)i∈N assigning object µi ∈ O ∪ {∅} to.