Andi Kleen's blog

Tilting at windmills and other endeavors

Efficient n-states on x86 systems

with 7 comments

Programs often have flags that affect the control flow. The simplest case is a boolean that can be either TRUE or FALSE (normally 1 and 0). The x86 ISA uses condition codes to test and do a conditional jumps. It has a direct flag to test for zero, so that a test for zero and non zero can be efficiently implemented.


if (flag == 0) case_1();
else case_2();

In assembler this will look something like this (don’t worry if you don’t know assembler, just read the comments):


test $-1,flag       ; and flag with -1 and set condition codes
jz   is_zero        ; jump to is_zero if flag was zero
; flag was non zero here

But what happens when flag can have more than two states (like the classic true, false, file_not_found). How can we implement such a test in the minimum number of instructions? A common way is to use a negative value (-1) for the third state. x86 has a condition code for negative, so the code becomes simply


if (flag == 0)       case_1();
else if (flag < 0)   case_2();
else /* > 0 */       case_3();
 

(if you try that in gcc use a call, not just a value for the differing control flows, otherwise the compiler may use tricks to implement this without jumps)

In assembler this looks like


testl $-1,flag   ; and flag with -1 and set condition codes
jz case_1        ; jump if flag == 0
js case_2        ; jump if flag < 0
; flag > 0 here

Now how do we handle four states with a single test and a minimum number of jumps? Looking at the x86 condition tables it has more condition codes. One interesting but obscure one is the parity flag. Parity is true if the number of bits set is even. In x86 it also only applies to the lowest (least-significant) byte of the value.

Parity used to be used for 7bit ASCII characters to detect errors, but adding a parity bit in the 8th bit that makes the parity always even (or odd, I forgot). These days this is mostly obsolete and usually replaced with better codes that can not only detect, but also correct errors.

Back to the n-value problem. So we can use an additional positive value that has an even parity, that is an even number of bits, like 3 (0b11). The total values now are: -1, 0, 1, 3

In assembler this a straight forward additional test for “p”:


testl $-1,flag     ; and flag with -1 and set condition codes
jz    case_1       ; jump if flag == 0
js    case_2       ; jump if flag < 0
jp    case_3       ; jump if flag > 0 and flag has even parity
; flag > 0 and has odd parity

Note that the 3 would be only detected in the lowest byte, so all 32bit values with 3 in the lowest 8bit would be equivalent to 3. That is also true for all negative values which are the same as -1. Just 0 is lonely.

Now how to implement this in C if you don’t want to write assembler (or not writing a compiler or JIT)? gcc has a __builtin_parity, function but unfortunately it is defined as parity on 4 bytes, not a single byte. This means it cannot map directly to the “p” condition code.

One way is to use the asm goto extension that is available in gcc 4.5 and use inline assembler. With asm goto an inline assembler statement can change control flow and jump directly to C labels.


asm goto(
    "      testl $-1,%0\n"
    "      jz   %l1\n"
    "      js   %l2\n"
    "      jp   %l3" :: "rm" (flag) :: lzero, lneg, lpar);
/* flag > 0, odd parity here (1) */
lzero:  /* flag == 0 here (0) */
lneg:   /* flag < 0 here (-1) */
lpar:   /* flag > 0 (3) */

This works and could be packaged into a more generic macro. The disadvantage of course is — apart from being non portable — that the compiler cannot look into the inline assembler code. If flag happens to be a constant known to the compiler it won’t be able to replace the test with a direct jump.

Can this be extended to more than four states? We could add a fifth negative state with two negative numbers that have even and odd parity respective. This would require the negative cases to go through two conditional jumps however: first to check if the value is negative and then another to distinguish the two parity cases.

It’s not clear that is a good idea. Some older x86 micro architectures have limits on the number of jumps in a 16byte region that can be branch predicted (that is why gcc sometimes add weird nops for -mtune=generic). Losing branch prediction would cost us far more than we could gain with any optimization here.

Another approach would be to use a cmp (subtraction) instead of a test. In this case the overflow and carry condition codes would be also set. But it’s not clear if that would gain us any more possibilities.

Update 12/26: Fix typo

Written by therapsid

December 23rd, 2012 at 7:58 am