Archive for the ‘not-really’ Category
This is a short non comprehensive (and somewhat biased) literature survey on how lock elision with Intel TSX can be used to improve performance to programs by increasing parallelism. Focus is on practical incremental improvements in existing software.
A basic introduction of TSX lock elision is available in Scaling existing lock based applications with Lock elision.
The papers below are on actual code using lock elision, not just how to implement lock elision itself. The basic rules of how to implement lock elision in a locking library are in chapter 12 of the Intel Optimization manual. Common anti-patterns (mistakes) while doing so are described in TSX Anti-patterns.
Lock elision has been used widely to speed up both production and research databases. A lot of work has been done on the SAP HANA in memory data which uses TSX in production today to improve performance of the B+Tree index and the delta tree data structures. This is described in Improving In-Memory Database Index Performance with TSX.
Several research databases went beyond just using TSX to speed up specific data structures, but map complete database transactions to hardware transaction. This requires much more changes to the databases, and careful memory layout, but can also give more gains. It generally goes beyond simple lock elision. This approach has been implemented by Leis in Exploiting HTM in Main memory databases. A similar approach is used in Using Restricted Transactional Memory to Build a Scalable In-Memory Database. Both see large gains on TPC-C style workloads.
For non in memory databases, a group at Berkeley used lock elision to improve parallelism in LevelDB and getting a 25% speedup on a 4 core system. Standard LevelDB is essentially a single lock system, so this was a nice speedup with only minor work compared to other efforts that used manual fine grained locking to improve parallelism in LevelDB. However it required special handling of condition variables, which are used for commit. For simpler key value stores Diegues used an automatic tuner with TSX transactions to get a 2x gain with memcached.
DrTM uses TSX together with RDMA to implement a fast distributed transaction manager using 2PL. TSX provides local isolation, while RDMA (which aborts transactions) provides remote isolation.
An attractive use of lock elision is to speed up locks implicit in language runtimes. Yoo implemented transparent. support for using TSX for Java synchronized sections in Early Experience on Transactional
Execution of Java TM Programs. The runtime automatically detects when a synchronized region does not benefit from TSX and disables lock elision then. This works transparently, but using it successfully may still need some changes in the program to minimize conflicts and other aborts, as described by Gil Tene in Understanding Hardware Transactional Memory. This support is available in JDK 8u40 and can be enabled with the -XX:+UseRTMLocking option.
Another interesting case for lock elision is improving parallelism for the Great Interpreter Lock (GIL) used in interpreters. Odaiara implemented this for Ruby in Eliminating Global Interpreter Locks in Ruby through HTM. They saw a 4.4x speedup on Ruby NPB, 1.6x in WEBrick and 1.2x speedup in Ruby on Rails on a 12 core system.
Hardware transactions can be used for auto-parallelizing existing loops, even when the compiler cannot prove that iterations are independent, by using the transactions to isolate individual iterations. Odaira implemented TSX loop speculation manually for workloads in SPECCpu and report a 11% speedup on a 4 core system. There are some limitations to this technique due to the lazy subscription problem described by Dice, but in principle it can be directly implemented in compilers.
Salamanca used TSX to implement tracing recovery code for Speculative Trace Optimization (STO) for loops. The basic principles are similar to the previous paper, but they implemented an automated prototype. They report 9% improvement on 4 cores with a number of benchmarks.
An older paper, which predates TSX, by Dice describes how to use Hardware Transactional Memory to simplify Work Stealing schedulers.
High Performance Computing
Yoo.et.al. use TSX lock elision to benchmark a number of HPC applications in Performance Evaluation of Intel TSX for High-Performance Computing. They report an average 41% speedup on a 4 core system. They also report an average 31% improvement in bandwidth when applying TSX to a user space TCP stack.
Hay explored lock elision for improving Parallel Discrete Event Simulations. He reports a speed ups of up to 28%.
Lock elision has been used widely to speed up parallel data structures. Normally applying lock elision to an existing data structure is very simple — elide the lock protecting it — but some tweaking of the data structure can often give better performance. Dementiev explores TSX for general fast scalable hash tables. Li uses TSX to implement scalable Cuckoo hash tables. Using TSX for hash tables is generally very straight forward. For tree data structures one need to be careful that tree rebalancing does not overwhelm the write set capacity of the hardware transactions., Repetti uses TSX to scale Patricia Tries. Siakavas explores TSX usage for scalable Red-Black Trees, similar in this paper. Bonnichsen uses HTM to improve BT Trees, reporting a speed up of 2x to 3x compared to earlier implementations. The database papers described above describe the rules needed to successfully elide BTrees.
Calciu uses TSX to implement a more scalable priority queue based on skip list and reports increased parallelism.
Memory allocation and Garbage Collection
One challenge with using garbage collection is that the worst case “stop the world” pauses from parallel garbage collectors limit the total heap size that can be supported in copying garbage collectors. The Chihuahua GC paper implements a prototype TSX based collector for the Jikes research java VM. They report upto 101% speedup in concurrent copying speed, and show that a simple parallel garbage collector can be implemented with limited efforts.
Another dog-themed GC, the Collie garbage collector (original paper predates TSX) presents a production quality parallel collector that minimizes pauses and allows scaling to large heaps. Opdahl has another description of the Collie algorithm It is presumably deployed now for TSX in Azul’s commercial ZING JVM product which claims to scale upto 2TB of heap memory.
StackTrack is an efficient algorithm to do automatic memory reclaim for parallel data structures using hardware transactions, out performing existing techniques such as hazard pointers. It requires recompiling the program with a special patched gcc compiler, and automatically creates variable-length transactions for functions freeing memory. The technique could be potentially used even without special compilers.
Kuzmaul uses TSX to implement a scalable SuperMalloc and reports good performance combined with relatively simple code. Dice et all report how an cache index aware malloc can improve TSX performance by improving utilization of the L1 cache.
Peters use lock elision to parallelize a micro kernel For a micro kernel a big lock is fine and finds that RTM lock elision outperforms fine grained locking due to less single thread overhead.
Hope everyone who already has a Haswell with TSX is considering playing around with it. Chapter 12 of the optimization manual has a good methology.
There’s currently a push to solve the last issues in the TSX glibc to get the feature in glibc 2.18 including even the most obscure POSIX requirements. A nice solution to still provide deadlocks for PTHREAD_MUTEX_NORMAL has been found now, that doesn’t affect most programs. We also settled on removing support for mutex initializers that disable/enable elision to improve binary compatibility with old glibcs. This has the nice side effect that the glibc can internally ensure that no mutex has a elision flag set, when the CPU does not support RTM. This then allows to shave off at least one more check in the pthread_mutex_lock() fast path.
I published an article describing TSX fallback paths. Every RTM transaction needs a working fallback path, and not doing that properly is a common TSX newbie mistake.
And Roman collected all the available TSX resources in a nice overview page
Also a new version of PCM is available that supports TSX counting (no abort sampling). It doesn’t need a kernel driver, and should work on all Linux, Mac, Windows systems. Having some form of performance counting is fairly important for any TSX tuning
Also a new version of tsx-tools, including HLE and RTM compat headers, has been published.
And the HLE examples in the gcc documentation have been finally fixed to commit (not yet backported to 4.8). HLE requires the operation size of the acquire and release to match, and always aborts the transaction if that is not the case. __atomic_clear always casts the argument to bool for obscure reasons, so if the lock variable is not bool the operand size is likely to mismatch. The fix is to use __atomic_store_n instead, which doesn’t cast the pointer.
There are also some other issues in the HLE intrinsics in 4.8 currently (mostly fixed in 4.9). gcc won’t warn in all cases when it cannot generate HLE for an atomic primitive (e.g. when the primitive does not map to a single x86 instruction). And you still need to enable optimization to use the C++ HLE interface or some more complex __atomic intrinsics, as the gcc backend may otherwise not see the __ATOMIC_HLE_RELEASE flag as a constant.
Right now it is still safer to use the compat headers from tsx-tools which avoid all these problems.
This is a technical overview that assumes some prior knowledge of profiling. I apologize for the cumbersome title.
Summary is more or less: batch your locks. Don’t make critical sections too small. Having the smallest locks is not cool anymore.
As the saying goes, every problem in CS can be solved with another level of indirection (except the problem of too much indirection) This often leads to caching to improve performance. Remember some results to handle them faster. Caches are everywhere in computing these days: from the CPU itself (memory caches, TLBs, instruction caches, branch prediction), to databases (query caches, table caches), networking stacks (neighbor caches, routing caches, DNS caches) and IO (directory caches, OS page buffering). When I say cache here I mean this generalized cache concept, not just the caches of the CPUs.
Caches usually improve performance. But only if your cache hit rates are high enough (and the cache latency low enough, but that’s a different discussion). So if you thrash the caches and hit rate becomes very low performance suffers. This is a problem that theoretical CS algorithm analysis largely ignored for a long time, but there has some work on this been recently (like cache oblivious algorithms).
However this is only for the CPU caches, not for all the other caches that exist in a modern programming environment.
A cache is a shared resource in a program. Typically programs consist of many subsystems and libraries. They all impact various caches. They are likely written by different developers. Calling a library means whoever wrote that library shares cache resources with you.
Now when tuning some sub function you often have a trade off between simplicity and performance. A common technique to improve performance is to replace computation with table lookups (despite the memory wall it is often still the fastest way). This will (usually) improve performance, but increase the cache foot print if the input values vary enough to cover larger parts of the table. Or you could add a cache, which is essentially an automatic table lookup. This will make the function faster, but also increase its resource foot print in the shared cache resources.
As a side note anything with a table lookup is much harder to tune in a micro benchmark. If the foot print of a function is data independent we can just call the function in a loop with the same input to and measure the total time, divided by the iterations. But if the foot print is data dependent we always need a representative input set, otherwise the unrealistic cache hit rate on the table skews the results badly.
Even if no table lookup is used more complex logic will likely have more overhead in the instruction caches and branch prediction caches. So common optimizations often increase the foot print. But when the total program already has a large foot print increasing the total further may cause other time critical subsystems start thrashing their working set.
Similar reasoning applies to other kinds of caches: a database sub function may thrash the database engine query or table cache. A network subsystem may thrash the DNS cache. A 3d sub function may thrash the 3d stacks JIT code cache.
So you could say that the foot print of a function should not be higher than the proportion of runtime it executes. If there are free resources it may be reasonable to take more than that, but even then it may be better to be frugal because the increase resource usage will not pay off in better performance. Using less resources may also save power or leave more resources for other programs sharing the same system (for power that’s not always true, having better performance may help you in the race to idle)
So we may end up with an interesting trade off when tuning a function. We have the choice between different algorithms with different resource footprints (e.g. table lookup vs computation). But the choice which one to use depends on the rest of the program: whether the function is time critical enough and how much resources are used elsewhere.
So this is a nasty problem. Instead of being able to tune each function individually we actually have a global optimization problem over the whole program. The problem of resource allocation is non decomposable. That is when you change one small piece in a big program it may affect everyone else. When the program changes and increases resource usage somewhere a lot of resource allocation may need to be redone to re-balance, which may include changing algorithms of some old subsystem.
This is especially a problem for library functions where the program author doesn’t really know what the calling program does. And on the other hand the library user didn’t write the library and doesn’t know what the library author optimized for.
One way would be to offer multiple variants optimized for different resource consumptions, so that the user can chose. This would be similar to e.g. the algorithms in the C++ STL are tagged with their algorithmic complexity for different operations. One could imagine a library of algorithms that are tagged with their resource overhead. This likely would need to be equations based on the input set size. Also since there are multiple resources (e.g. memory hierarchy, branch predictor, other software caches) it may need multiple indicators.
Modern caches are generally so complex that they are nearly impossible to analyze “on paper” (especially when you have multiple level interacting), and need measurements instead.
So developing metrics for this would be interesting. Since the costs are paid elsewhere (you cause a cache miss somewhere else) it is hard to associate the two by classic profiling. We have no way to directly monitor the various caches. For software caches this could be improved by adding better tracing capabilities, but for hardware it’s harder.
One way would be subtractive monitoring, similar to cyclesoak. You run a “cachesoak” in parallel to the program and it touches memory and using time measurements measures how much of its working set has been displaced by another program running on the same core.
This technique has been used for attacking cryptographic algorithms, by exploiting the cache access patterns of table lookups in common ciphers, or with OS paging starting in the 70ies for password attacks
One could also use this subtractive technique for other software caches. For example for a database access run a background thread that accesses the database in a controlled way and measures access times. The challenge of course would be to understand the caching behavior of the database well enough to get useful data.
This all still doesn’t tell you where the resource consumption happens. So to measure the resource consumption of each subsystem in a complex program you would need to run each of it individually against a cyclesoak test. And ideally with a representative input data set, otherwise you may get unrealistic access patterns.
I mostly talked about memory hierarchy caches here, but of course this applies to other caches too. An interesting case is branch prediction caches in CPUs. A mispredicted branch can take take nearly an order of magnitude more time than a predicted branch. Indirect branches need more resources than conditional branches (full address versus single bit). I wrote about optimizing conditional branches earlier. So a less complex algorithm may be slower, but use less resources.
It would be interesting to develop a metric for the branch prediction resource consumption. One way may be to use a variant of cyclomatic complexity, but focus on the hot paths of the program only using profile feedback.
So overall allocating cache resources to different sub systems is hard. It would be good if we had better tools for this.
Programs often have flags that affect the control flow. The simplest case is a boolean that can be either TRUE or FALSE (normally 1 and 0). The x86 ISA uses condition codes to test and do a conditional jumps. It has a direct flag to test for zero, so that a test for zero and non zero can be efficiently implemented.
if (flag == 0) case_1(); else case_2();
In assembler this will look something like this (don’t worry if you don’t know assembler, just read the comments):
test $-1,flag ; and flag with -1 and set condition codes jz is_zero ; jump to is_zero if flag was zero ; flag was non zero here
But what happens when flag can have more than two states (like the classic true, false, file_not_found). How can we implement such a test in the minimum number of instructions? A common way is to use a negative value (-1) for the third state. x86 has a condition code for negative, so the code becomes simply
if (flag == 0) case_1(); else if (flag < 0) case_2(); else /* > 0 */ case_3();
(if you try that in gcc use a call, not just a value for the differing control flows, otherwise the compiler may use tricks to implement this without jumps)
In assembler this looks like
testl $-1,flag ; and flag with -1 and set condition codes jz case_1 ; jump if flag == 0 js case_2 ; jump if flag < 0 ; flag > 0 here
Now how do we handle four states with a single test and a minimum number of jumps? Looking at the x86 condition tables it has more condition codes. One interesting but obscure one is the parity flag. Parity is true if the number of bits set is even. In x86 it also only applies to the lowest (least-significant) byte of the value.
Parity used to be used for 7bit ASCII characters to detect errors, but adding a parity bit in the 8th bit that makes the parity always even (or odd, I forgot). These days this is mostly obsolete and usually replaced with better codes that can not only detect, but also correct errors.
Back to the n-value problem. So we can use an additional positive value that has an even parity, that is an even number of bits, like 3 (0b11). The total values now are: -1, 0, 1, 3
In assembler this a straight forward additional test for “p”:
testl $-1,flag ; and flag with -1 and set condition codes jz case_1 ; jump if flag == 0 js case_2 ; jump if flag < 0 jp case_3 ; jump if flag > 0 and flag has even parity ; flag > 0 and has odd parity
Note that the 3 would be only detected in the lowest byte, so all 32bit values with 3 in the lowest 8bit would be equivalent to 3. That is also true for all negative values which are the same as -1. Just 0 is lonely.
Now how to implement this in C if you don’t want to write assembler (or not writing a compiler or JIT)? gcc has a __builtin_parity, function but unfortunately it is defined as parity on 4 bytes, not a single byte. This means it cannot map directly to the “p” condition code.
One way is to use the asm goto extension that is available in gcc 4.5 and use inline assembler. With asm goto an inline assembler statement can change control flow and jump directly to C labels.
asm goto( " testl $-1,%0\n" " jz %l1\n" " js %l2\n" " jp %l3" :: "rm" (flag) :: lzero, lneg, lpar); /* flag > 0, odd parity here (1) */ lzero: /* flag == 0 here (0) */ lneg: /* flag < 0 here (-1) */ lpar: /* flag > 0 (3) */
This works and could be packaged into a more generic macro. The disadvantage of course is — apart from being non portable — that the compiler cannot look into the inline assembler code. If flag happens to be a constant known to the compiler it won’t be able to replace the test with a direct jump.
Can this be extended to more than four states? We could add a fifth negative state with two negative numbers that have even and odd parity respective. This would require the negative cases to go through two conditional jumps however: first to check if the value is negative and then another to distinguish the two parity cases.
It’s not clear that is a good idea. Some older x86 micro architectures have limits on the number of jumps in a 16byte region that can be branch predicted (that is why gcc sometimes add weird nops for -mtune=generic). Losing branch prediction would cost us far more than we could gain with any optimization here.
Another approach would be to use a cmp (subtraction) instead of a test. In this case the overflow and carry condition codes would be also set. But it’s not clear if that would gain us any more possibilities.
Update 12/26: Fix typo
String processing is quite common in software. This can be simple operations like searching for characters in a string, more complex subset and sub string searches to complex transformations. Most string processing code processes strings byte by byte (or 16bit word by word for 16bit unicode). That is fine for code that is not time critical.
A modern CPU can process 32bit-256bit at a time inside a core, if it does only 8 bit in each operation then large parts of the computing capacity are unused. If the string code processes a lot of data and slows down the program it’s better to find some way to use this unused computing capability. I’m only talking about a single core here (thread level parallelism), not multi-core.
For simple standard algorithms like strlen there are algorithms available that allow to process the string in 32bit or 64bit units using various bit manipulation tricks. (Hacker’s delight has all the details) Typically C libraries already use optimized algorithms (although the compiler may not always use the standard library if you just call strlen). But they are usually too difficult to use for custom string processing.
Also they usually need special cases to handle the end of the string (the length of a string is often not known in advance) and are complicated to validate and write. Also typically is only a win for relatively simple algorithms, and come at a heavy cost of programer time.
On modern x86 CPUs an alternative is to use the SSE 4.2 string instructions PCMP*STR*. They are essentially programmable comparison engines to compare two 128bit vectors at a time. They can handle string end detection and various other cases directly. They are very powerful, but since they are so configurable also hard to use (2^9 possible combinations). The hardware can do all these operations in parallel, so it’s more efficient than a explicit loop.
These instructions are much easier to use than the old bit twiddling tricks, but they are still hard.
They operate on two 128 bit vector inputs and compare all the 8bit (or 16bit) pairs based on a 8bit configuration flag word. The input can be 0 terminated strings or explicit lengths for both vectors. The output is either a vector mask or a index, plus some conditional codes.
The instructions can be either used from assembler, or using C intrinsics.
To make this all easier to use I did the PCMPSTR calculator that allows to interactively configure and explore these instructions. It generates snippets that can be used in programs.
Feedback welcome and let me know if you find a new interesting use with this.
Some tips for tuning
- Always profile first to see if something is time critical. Don’t bother with code that does not show up in the profiler
- Understand what’s common and uncommon in typical input data
- Always have a reference version of the algorithm that does things in a straight forward matter and is optimized for clarity, not speed. First write the program using the reference code and profile before tuning. Have a debug build that always compares the output of reference and optimized.
- Make sure the reference version stays up-to-date when the program changes.