Archive for the ‘Uncategorized’ Category
The browser circle closes again
For a long time I was using Firefox (together with lynx), until a couple years ago it became unusably buggy: it would become very slow when lots of tabs were open, and worse often corrupt its recovery database, so when you restarted it most of my tabs were gone.
At that point I switched to Chrome. It wasn’t really faster, but at least didn’t slow down. But recently the bugginess on Chrome is increasing to intolerable levels too. For example my whole Chrome regularly hangs with all the windows locking up. I figured out that if you kill the “gpu” process of chrome it continues — it seems to be stuck on some futex. That wouldn’t be too bad, but I noticed that also plugins that I use break regularly.
Time to switch back to Firefox? Wish there were more credible alternatives.
Overview of Last Branch Records (LBRs) on Intel CPUs
I wrote a two part article on theory and practice of Last Branch
Records using Linux perf. They were published on LWN.
This includes the background (what is a last branch record and why
branch sampling), what you can use it for in perf, like hot path
profiling, frame pointerless callgraphs, or automatic micro benchmarks,
and also other uses like continuous compiler feedback.
The articles are now freely available:
Part 1: Introduction of Last Branch Records
Part 2: Advanced uses of Last Branch Records
Speeding up less
Often when doing performance analysis or debugging, it boils down to stare at long text trace files with the less text viewer. Yes you can do a lot of analysis with custom scripts, but at some point it’s usually needed to also look at the raw data.
The first annoyance in less when opening a large file is the time it takes to count lines (less counts lines at the beginning to show you the current position as a percentage). The line counting has the easy workaround of hitting Ctrl-C or using less -n to disable percentage. But it would be still better if that wasn’t needed.
Nicolai Haenle speeded the process by about 20x in his less repository.
One thing that always bothered me was that searching in less is so slow. If you’re browsing a tens to hundreds of MB file file it can easily take minutes to search for a string. When browsing log and trace files searching over longer distances is often very important.
And there is no good workaround. Running grep on the file is much faster, but you can’t easily transfer the file position from grep to the less session.
Some profiling with perf shows that most of the time searching is spent converting each line. Less internally cleans up the line, convert it to canonical case, remove backspace bold, and some other changes. The conversion loop processes each character in a inefficient way. Most of the time this is not needed, so I replaced that with a quick check if the line contains any backspaces using the optimized strchr() from the standard C library. For case conversion the string search functions (either regular expression or fixed string search) can also handle case insensitive search directly, so we don’t need an extra conversion step. The default fixed string search (when the search string contains no regular expression meta characters) can be also done using the optimized C library functions.
The resulting less version searches ~85% faster on my benchmarks. I tried to submit the patch to the less maintainer, but it was ignored unfortunately. The less version in the repository also includes Nicolai’s speedup patches for the initial line counting.
One side effect of the patch is that less now defaults to case sensitive searches. The original less had a feature (or bug) to default to case-insensitive even without the -i option. To get case insensitive searches now “less -i” needs to be used.
[Edit: Fix typos]
TSX anti patterns
I published a new article on TSX anti patterns: Common mistakes that people make when implementing TSX lock libraries or writing papers about them. Make sure you don’t fall into the same traps. Enjoy!
Efficient n-states on x86 systems
Programs often have flags that affect the control flow. The simplest case is a boolean that can be either TRUE or FALSE (normally 1 and 0). The x86 ISA uses condition codes to test and do a conditional jumps. It has a direct flag to test for zero, so that a test for zero and non zero can be efficiently implemented.
 
if (flag == 0) case_1(); 
else case_2(); 
In assembler this will look something like this (don’t worry if you don’t know assembler, just read the comments):
 
test $-1,flag       ; and flag with -1 and set condition codes 
jz   is_zero        ; jump to is_zero if flag was zero 
; flag was non zero here 
 
But what happens when flag can have more than two states (like the classic true, false, file_not_found). How can we implement such a test in the minimum number of instructions? A common way is to use a negative value (-1) for the third state. x86 has a condition code for negative, so the code becomes simply
 
if (flag == 0)       case_1(); 
else if (flag < 0)   case_2(); 
else /* > 0 */       case_3(); 
 (if you try that in gcc use a call, not just a value for the differing control flows, otherwise the compiler may use tricks to implement this without jumps)
In assembler this looks like
 
testl $-1,flag   ; and flag with -1 and set condition codes 
jz case_1        ; jump if flag == 0 
js case_2        ; jump if flag < 0 
; flag > 0 here 
 
Now how do we handle four states with a single test and a minimum number of jumps? Looking at the x86 condition tables it has more condition codes. One interesting but obscure one is the parity flag. Parity is true if the number of bits set is even. In x86 it also only applies to the lowest (least-significant) byte of the value.
Parity used to be used for 7bit ASCII characters to detect errors, but adding a parity bit in the 8th bit that makes the parity always even (or odd, I forgot). These days this is mostly obsolete and usually replaced with better codes that can not only detect, but also correct errors.
Back to the n-value problem. So we can use an additional positive value that has an even parity, that is an even number of bits, like 3 (0b11). The total values now are: -1, 0, 1, 3
In assembler this a straight forward additional test for “p”:
 
testl $-1,flag     ; and flag with -1 and set condition codes 
jz    case_1       ; jump if flag == 0 
js    case_2       ; jump if flag < 0 
jp    case_3       ; jump if flag > 0 and flag has even parity 
; flag > 0 and has odd parity 
 
Note that the 3 would be only detected in the lowest byte, so all 32bit values with 3 in the lowest 8bit would be equivalent to 3. That is also true for all negative values which are the same as -1. Just 0 is lonely.
Now how to implement this in C if you don’t want to write assembler (or not writing a compiler or JIT)? gcc has a __builtin_parity, function but unfortunately it is defined as parity on 4 bytes, not a single byte. This means it cannot map directly to the “p” condition code.
One way is to use the asm goto extension that is available in gcc 4.5 and use inline assembler. With asm goto an inline assembler statement can change control flow and jump directly to C labels.
 
asm goto(
    "      testl $-1,%0\n"
    "      jz   %l1\n"
    "      js   %l2\n"          
    "      jp   %l3" :: "rm" (flag) :: lzero, lneg, lpar);         
/* flag > 0, odd parity here (1) */ 
lzero:  /* flag == 0 here (0) */ 
lneg:   /* flag < 0 here (-1) */ 
lpar:   /* flag > 0 (3) */ 
 
This works and could be packaged into a more generic macro. The disadvantage of course is — apart from being non portable — that the compiler cannot look into the inline assembler code. If flag happens to be a constant known to the compiler it won’t be able to replace the test with a direct jump.
Can this be extended to more than four states? We could add a fifth negative state with two negative numbers that have even and odd parity respective. This would require the negative cases to go through two conditional jumps however: first to check if the value is negative and then another to distinguish the two parity cases.
It’s not clear that is a good idea. Some older x86 micro architectures have limits on the number of jumps in a 16byte region that can be branch predicted (that is why gcc sometimes add weird nops for -mtune=generic). Losing branch prediction would cost us far more than we could gain with any optimization here.
Another approach would be to use a cmp (subtraction) instead of a test. In this case the overflow and carry condition codes would be also set. But it’s not clear if that would gain us any more possibilities.
Update 12/26: Fix typo