Literature survey for practical TSX lock elision applications

therapsid — Mon, 30 May 2016 04:40:46 +0000

Introduction

This is a short non comprehensive (and somewhat biased) literature survey on how lock elision with Intel TSX can be used to improve performance to programs by increasing parallelism. Focus is on practical incremental improvements in existing software.

A basic introduction of TSX lock elision is available in Scaling existing lock based applications with Lock elision.

Lock libraries

The papers below are on actual code using lock elision, not just how to implement lock elision itself. The basic rules of how to implement lock elision in a locking library are in chapter 12 of the Intel Optimization manual. Common anti-patterns (mistakes) while doing so are described in TSX Anti-patterns.

Existing lock libraries that implement lock elision using TSX include glibc pthreads, Thread Building Blocks, tsx-tools and ConcurrencyKit

Databases

Lock elision has been used widely to speed up both production and research databases. A lot of work has been done on the SAP HANA in memory data which uses TSX in production today to improve performance of the B+Tree index and the delta tree data structures. This is described in Improving In-Memory Database Index Performance with TSX.

Several research databases went beyond just using TSX to speed up specific data structures, but map complete database transactions to hardware transaction. This requires much more changes to the databases, and careful memory layout, but can also give more gains. It generally goes beyond simple lock elision. This approach has been implemented by Leis in Exploiting HTM in Main memory databases. A similar approach is used in Using Restricted Transactional Memory to Build a Scalable In-Memory Database. Both see large gains on TPC-C style workloads.

For non in memory databases, a group at Berkeley used lock elision to improve parallelism in LevelDB and getting a 25% speedup on a 4 core system. Standard LevelDB is essentially a single lock system, so this was a nice speedup with only minor work compared to other efforts that used manual fine grained locking to improve parallelism in LevelDB. However it required special handling of condition variables, which are used for commit. For simpler key value stores Diegues used an automatic tuner with TSX transactions to get a 2x gain with memcached.

DrTM uses TSX together with RDMA to implement a fast distributed transaction manager using 2PL. TSX provides local isolation, while RDMA (which aborts transactions) provides remote isolation.

Languages

An attractive use of lock elision is to speed up locks implicit in language runtimes. Yoo implemented transparent. support for using TSX for Java synchronized sections in Early Experience on Transactional
Execution of Java TM Programs. The runtime automatically detects when a synchronized region does not benefit from TSX and disables lock elision then. This works transparently, but using it successfully may still need some changes in the program to minimize conflicts and other aborts, as described by Gil Tene in Understanding Hardware Transactional Memory. This support is available in JDK 8u40 and can be enabled with the -XX:+UseRTMLocking option.

Another interesting case for lock elision is improving parallelism for the Great Interpreter Lock (GIL) used in interpreters. Odaiara implemented this for Ruby in Eliminating Global Interpreter Locks in Ruby through HTM. They saw a 4.4x speedup on Ruby NPB, 1.6x in WEBrick and 1.2x speedup in Ruby on Rails on a 12 core system.

Hardware transactions can be used for auto-parallelizing existing loops, even when the compiler cannot prove that iterations are independent, by using the transactions to isolate individual iterations. Odaira implemented TSX loop speculation manually for workloads in SPECCpu and report a 11% speedup on a 4 core system. There are some limitations to this technique due to the lazy subscription problem described by Dice, but in principle it can be directly implemented in compilers.

Salamanca used TSX to implement tracing recovery code for Speculative Trace Optimization (STO) for loops. The basic principles are similar to the previous paper, but they implemented an automated prototype. They report 9% improvement on 4 cores with a number of benchmarks.

An older paper, which predates TSX, by Dice describes how to use Hardware Transactional Memory to simplify Work Stealing schedulers.

High Performance Computing

Yoo.et.al. use TSX lock elision to benchmark a number of HPC applications in Performance Evaluation of Intel TSX for High-Performance Computing. They report an average 41% speedup on a 4 core system. They also report an average 31% improvement in bandwidth when applying TSX to a user space TCP stack.

Hay explored lock elision for improving Parallel Discrete Event Simulations. He reports a speed ups of up to 28%.

Data structures

Lock elision has been used widely to speed up parallel data structures. Normally applying lock elision to an existing data structure is very simple — elide the lock protecting it — but some tweaking of the data structure can often give better performance. Dementiev explores TSX for general fast scalable hash tables. Li uses TSX to implement scalable Cuckoo hash tables. Using TSX for hash tables is generally very straight forward. For tree data structures one need to be careful that tree rebalancing does not overwhelm the write set capacity of the hardware transactions., Repetti uses TSX to scale Patricia Tries. Siakavas explores TSX usage for scalable Red-Black Trees, similar in this paper. Bonnichsen uses HTM to improve BT Trees, reporting a speed up of 2x to 3x compared to earlier implementations. The database papers described above describe the rules needed to successfully elide BTrees.

Calciu uses TSX to implement a more scalable priority queue based on skip list and reports increased parallelism.

Memory allocation and Garbage Collection

One challenge with using garbage collection is that the worst case “stop the world” pauses from parallel garbage collectors limit the total heap size that can be supported in copying garbage collectors. The Chihuahua GC paper implements a prototype TSX based collector for the Jikes research java VM. They report upto 101% speedup in concurrent copying speed, and show that a simple parallel garbage collector can be implemented with limited efforts.

Another dog-themed GC, the Collie garbage collector (original paper predates TSX) presents a production quality parallel collector that minimizes pauses and allows scaling to large heaps. Opdahl has another description of the Collie algorithm It is presumably deployed now for TSX in Azul’s commercial ZING JVM product which claims to scale upto 2TB of heap memory.

StackTrack is an efficient algorithm to do automatic memory reclaim for parallel data structures using hardware transactions, out performing existing techniques such as hazard pointers. It requires recompiling the program with a special patched gcc compiler, and automatically creates variable-length transactions for functions freeing memory. The technique could be potentially used even without special compilers.

Kuzmaul uses TSX to implement a scalable SuperMalloc and reports good performance combined with relatively simple code. Dice et all report how an cache index aware malloc can improve TSX performance by improving utilization of the L1 cache.

Other usages

Peters use lock elision to parallelize a micro kernel For a micro kernel a big lock is fine and finds that RTM lock elision outperforms fine grained locking due to less single thread overhead.

TSX anti patterns

therapsid — Thu, 27 Mar 2014 04:24:20 +0000

I published a new article on TSX anti patterns: Common mistakes that people make when implementing TSX lock libraries or writing papers about them. Make sure you don’t fall into the same traps. Enjoy!

Scaling existing lock-based applications with lock elision

therapsid — Wed, 12 Feb 2014 04:01:07 +0000

I published an introductory article on practical lock elision with Intel TSX at ACM Queue/CACM: Scaling existing lock-based applications with lock elision. Enjoy!

TSX updates

therapsid — Sun, 23 Jun 2013 22:13:13 +0000

Hope everyone who already has a Haswell with TSX is considering playing around with it. Chapter 12 of the optimization manual has a good methology.

There’s currently a push to solve the last issues in the TSX glibc to get the feature in glibc 2.18 including even the most obscure POSIX requirements. A nice solution to still provide deadlocks for PTHREAD_MUTEX_NORMAL has been found now, that doesn’t affect most programs. We also settled on removing support for mutex initializers that disable/enable elision to improve binary compatibility with old glibcs. This has the nice side effect that the glibc can internally ensure that no mutex has a elision flag set, when the CPU does not support RTM. This then allows to shave off at least one more check in the pthread_mutex_lock() fast path.

I published an article describing TSX fallback paths. Every RTM transaction needs a working fallback path, and not doing that properly is a common TSX newbie mistake.

And Roman collected all the available TSX resources in a nice overview page
Also a new version of PCM is available that supports TSX counting (no abort sampling). It doesn’t need a kernel driver, and should work on all Linux, Mac, Windows systems. Having some form of performance counting is fairly important for any TSX tuning

Also a new version of tsx-tools, including HLE and RTM compat headers, has been published.

And the HLE examples in the gcc documentation have been finally fixed to commit (not yet backported to 4.8). HLE requires the operation size of the acquire and release to match, and always aborts the transaction if that is not the case. __atomic_clear always casts the argument to bool for obscure reasons, so if the lock variable is not bool the operand size is likely to mismatch. The fix is to use __atomic_store_n instead, which doesn’t cast the pointer.

There are also some other issues in the HLE intrinsics in 4.8 currently (mostly fixed in 4.9). gcc won’t warn in all cases when it cannot generate HLE for an atomic primitive (e.g. when the primitive does not map to a single x86 instruction). And you still need to enable optimization to use the C++ HLE interface or some more complex __atomic intrinsics, as the gcc backend may otherwise not see the __ATOMIC_HLE_RELEASE flag as a constant.

Right now it is still safer to use the compat headers from tsx-tools which avoid all these problems.

tsx – Andi Kleen's blog