curiosities at Andi Kleen's blog

Archive for the ‘curiosities’ Category

Quitting an Intel x86 hypervisor

This is an esoteric topic that might be of interest to people implementing Intel hypervisors. It assumes you know the basics of the Intel virtualization architecture, see Hypervisor from scratch for a tutorial. The actual full VT architecture is described in Volume 3 of the Intel SDM

Let’s say we write an x86 hypervisor that starts in the UEFI environment and virtualizes the initialization phase of an OS. But the hypervisor wants to eventually quit itself to not cause extra overhead during OS run time.

The way the hypervisor works is that it runs in its own memory and with its own page tables which are switched atomically on every VM exit by the VT-x implementation. This way it is isolated from the main OS.

At some vm exit with the hypervisor running in its own context it decides that it is not needed anymore and wants to quit. To disable VT support the VMXOFF instruction can be used. But what we really need is an atomic VMXOFF + switch to the original OS page tables plus a jump, and all that without using any registers which need to be already restored to the original state of the OS.

One trick is to use the MOV to CR3 instruction that reloads the page table as a jump. As soon as the page table is reloaded the CPU will fetch the next instruction with the translations from the freshly loaded page table, so we can transfer execution to the guest context. However to do that the MOV CR3 needs to be located just before the page offset of the target instruction. This can be done by copying a trampoline to the right page offset (potentially overlapping into the previous page). The trampoline is located in a special transfer page table mapping that places writable code pages overlapping the target mapping.

But there are some complications. The hypervisor also needs to load the segmentation state (like GDT/LDT/IDT) of the guest. In theory they could just be loaded by mapping these guest pages into the transfer mapping and loading them before the transfer. But what happens if the GDT/LDT/IDT is on the same page as the target address? This is common in real OS’ assembler startup code which is implemented in a small assembler file without any page separation between code and data. One option would be to copy them to the transfer page too and load it there, or the hypervisor first copies them to a temporary buffer and loads it from there. In the second option the base addresses of these structures will be incorrect, but in practice you can often rely on them getting reloaded eventually anyways.

Another problem is the register state of the target. MOV to CR3 needs a register as the source of the reload, and it needs to be the last instruction of the trampoline. So it is impossible to restore the register it uses. But remember the hypervisor is doing this as the result of a VM exit. If we chose an exit for a condition that already clobbers a register we can use the same register for the reload and the next instruction executed in the original target (and which caused the exit originally) will just overwrite it again.

A very convenient instruction for this is CPUID. It is executed multiple times in OS startup and overwrites multiple registers. In fact VMX always intercepts CPUID so it has to handle these exits in any case. So the trick to quit an hypervisor is to wait for the next CPUID exit and then use one of the registers clobbered by CPUID for the final CR3 reload. This will have inconsistent register state for one instruction in the target, but unless the original OS is currently running a debugger it will never notice. In principle any exit as a result of an instruction that clobbers a register can be used for this.

There is another potential complication if the target address of the OS conflicts with where the hypervisor is running before entering the transfer mapping. The transfer mapping needs to map the original code so that it can be jumped to. This could be solved with a third auxiliary mapping that is used before jumping to the transfer trampoline. In practice it doesn’t seem to be a problem because x86 OS typically run in a 1:1 mapping for startup, and that cannot conflict with the 1:1 mapping used by UEFI programs as our hypervisor.

Happy hypervisor hacking!

Written by therapsid

March 18th, 2025 at 9:34 pm

Posted in curiosities,kernel

Efficient n-states on x86 systems

with 7 comments

Programs often have flags that affect the control flow. The simplest case is a boolean that can be either TRUE or FALSE (normally 1 and 0). The x86 ISA uses condition codes to test and do a conditional jumps. It has a direct flag to test for zero, so that a test for zero and non zero can be efficiently implemented.

 
if (flag == 0) case_1(); 
else case_2();

In assembler this will look something like this (don’t worry if you don’t know assembler, just read the comments):

 
test $-1,flag       ; and flag with -1 and set condition codes 
jz   is_zero        ; jump to is_zero if flag was zero 
; flag was non zero here

But what happens when flag can have more than two states (like the classic true, false, file_not_found). How can we implement such a test in the minimum number of instructions? A common way is to use a negative value (-1) for the third state. x86 has a condition code for negative, so the code becomes simply

 
if (flag == 0)       case_1(); 
else if (flag < 0)   case_2(); 
else /* > 0 */       case_3();

(if you try that in gcc use a call, not just a value for the differing control flows, otherwise the compiler may use tricks to implement this without jumps)

In assembler this looks like

 
testl $-1,flag   ; and flag with -1 and set condition codes 
jz case_1        ; jump if flag == 0 
js case_2        ; jump if flag < 0 
; flag > 0 here

Now how do we handle four states with a single test and a minimum number of jumps? Looking at the x86 condition tables it has more condition codes. One interesting but obscure one is the parity flag. Parity is true if the number of bits set is even. In x86 it also only applies to the lowest (least-significant) byte of the value.

Parity used to be used for 7bit ASCII characters to detect errors, but adding a parity bit in the 8th bit that makes the parity always even (or odd, I forgot). These days this is mostly obsolete and usually replaced with better codes that can not only detect, but also correct errors.

Back to the n-value problem. So we can use an additional positive value that has an even parity, that is an even number of bits, like 3 (0b11). The total values now are: -1, 0, 1, 3

In assembler this a straight forward additional test for “p”:

 
testl $-1,flag     ; and flag with -1 and set condition codes 
jz    case_1       ; jump if flag == 0 
js    case_2       ; jump if flag < 0 
jp    case_3       ; jump if flag > 0 and flag has even parity 
; flag > 0 and has odd parity

Note that the 3 would be only detected in the lowest byte, so all 32bit values with 3 in the lowest 8bit would be equivalent to 3. That is also true for all negative values which are the same as -1. Just 0 is lonely.

Now how to implement this in C if you don’t want to write assembler (or not writing a compiler or JIT)? gcc has a __builtin_parity, function but unfortunately it is defined as parity on 4 bytes, not a single byte. This means it cannot map directly to the “p” condition code.

One way is to use the asm goto extension that is available in gcc 4.5 and use inline assembler. With asm goto an inline assembler statement can change control flow and jump directly to C labels.

 
asm goto(
    "      testl $-1,%0\n"
    "      jz   %l1\n"
    "      js   %l2\n"          
    "      jp   %l3" :: "rm" (flag) :: lzero, lneg, lpar);         
/* flag > 0, odd parity here (1) */ 
lzero:  /* flag == 0 here (0) */ 
lneg:   /* flag < 0 here (-1) */ 
lpar:   /* flag > 0 (3) */

This works and could be packaged into a more generic macro. The disadvantage of course is — apart from being non portable — that the compiler cannot look into the inline assembler code. If flag happens to be a constant known to the compiler it won’t be able to replace the test with a direct jump.

Can this be extended to more than four states? We could add a fifth negative state with two negative numbers that have even and odd parity respective. This would require the negative cases to go through two conditional jumps however: first to check if the value is negative and then another to distinguish the two parity cases.

It’s not clear that is a good idea. Some older x86 micro architectures have limits on the number of jumps in a 16byte region that can be branch predicted (that is why gcc sometimes add weird nops for -mtune=generic). Losing branch prediction would cost us far more than we could gain with any optimization here.

Another approach would be to use a cmp (subtraction) instead of a test. In this case the overflow and carry condition codes would be also set. But it’s not clear if that would gain us any more possibilities.

Update 12/26: Fix typo

Written by therapsid

December 23rd, 2012 at 7:58 am

Posted in curiosities,kernel,not-really,tuning,Uncategorized

A calculator for PCMPSTR

with 4 comments

String processing is quite common in software. This can be simple operations like searching for characters in a string, more complex subset and sub string searches to complex transformations. Most string processing code processes strings byte by byte (or 16bit word by word for 16bit unicode). That is fine for code that is not time critical.

A modern CPU can process 32bit-256bit at a time inside a core, if it does only 8 bit in each operation then large parts of the computing capacity are unused. If the string code processes a lot of data and slows down the program it’s better to find some way to use this unused computing capability. I’m only talking about a single core here (thread level parallelism), not multi-core.

For simple standard algorithms like strlen there are algorithms available that allow to process the string in 32bit or 64bit units using various bit manipulation tricks. (Hacker’s delight has all the details) Typically C libraries already use optimized algorithms (although the compiler may not always use the standard library if you just call strlen). But they are usually too difficult to use for custom string processing.

Also they usually need special cases to handle the end of the string (the length of a string is often not known in advance) and are complicated to validate and write. Also typically is only a win for relatively simple algorithms, and come at a heavy cost of programer time.

On modern x86 CPUs an alternative is to use the SSE 4.2 string instructions PCMP*STR*. They are essentially programmable comparison engines to compare two 128bit vectors at a time. They can handle string end detection and various other cases directly. They are very powerful, but since they are so configurable also hard to use (2^9 possible combinations). The hardware can do all these operations in parallel, so it’s more efficient than a explicit loop.

These instructions are much easier to use than the old bit twiddling tricks, but they are still hard.

They operate on two 128 bit vector inputs and compare all the 8bit (or 16bit) pairs based on a 8bit configuration flag word. The input can be 0 terminated strings or explicit lengths for both vectors. The output is either a vector mask or a index, plus some conditional codes.

The instructions can be either used from assembler, or using C intrinsics.

To make this all easier to use I did the PCMPSTR calculator that allows to interactively configure and explore these instructions. It generates snippets that can be used in programs.

Feedback welcome and let me know if you find a new interesting use with this.

Some tips for tuning

Always profile first to see if something is time critical. Don’t bother with code that does not show up in the profiler
Understand what’s common and uncommon in typical input data
Always have a reference version of the algorithm that does things in a straight forward matter and is optimized for clarity, not speed. First write the program using the reference code and profile before tuning. Have a debug build that always compares the output of reference and optimized.
Make sure the reference version stays up-to-date when the program changes.

Written by therapsid

December 12th, 2012 at 8:44 pm

Posted in curiosities,kernel,not-really,tuning

The weekend error anomaly

with 5 comments

I run mcelog.org which describes the standard way in Linux to handle machine check errors. Most of the hits of the website are just people typing a error log into a search engine mcelog.org ranks quite high on Linux machine check related terms.

The log files give me some indication how many errors are occurring on Linux systems in the field. Most of the errors are corrected memory errors on systems with ECC memory: a bit flipped, but the ECC code corrected it and and no actual data corruption occurred. (In this sense they are not actually errors, a more correct term would be “events”). Other errors like network errors or disk errors are not logged by mcelog.

I noticed is that there seem to be less memory errors on weekends. Normally the distribution of hits is fairly even over the week. But on Saturday and Sunday it drops into half.

It’s interesting to speculate why this this weekend anomaly happens.

ECC memory is normally only on server systems, which should be running 24h. In principle errors should be evenly distributed over the whole week.

Typing the error into google is no automated procedure. A human has to read the log files and do it manually.

If people are more likely to do this on work days one would expect that they would catch up on the errors from the weekend on Monday. So Monday should have more hits. But that’s not in the data: Monday is not different from other weekdays.

It’s also sticky for each system (or rather each human googling). Presumably the person will google the error only once no matter how many errors their system have and after that “cache” the knowledge what the error means. So the mcelog.org hits are more a indication of “first memory error in the career of a system administrator” (assuming the admin has perfect memory, which may be a bold assumption). But given a large enough supply of new sysadmins this should be still a reasonable indication of the true number of errors (at least on the systems likely to be handled by rookie administrators)

The hour distribution is more even, with 9-10 slightly higher. Not sure which time zone and what that means on the geographical distribution of errors and rookie admins.

One way to explain the weekend anomaly could be that that servers may be more busy on weekdays and they may have more errors when they are busy. Are these two assumptions true? I don’t know. It would be interesting to know if this shows up in other peoples large scale error collections too.

I wonder if it’s possible to detect solar flares in these logs. Need to find a good data source for them. Are there any other events generating lots of radiation that may affect servers? I hope there will never be a nuke or a super nova blast in the data.

Written by therapsid

April 20th, 2012 at 3:11 am

Posted in curiosities,kernel

Andi Kleen's blog

Archive for the ‘curiosities’ Category

Quitting an Intel x86 hypervisor

Efficient n-states on x86 systems

A calculator for PCMPSTR

Some tips for tuning

The weekend error anomaly

Pages

Search

Blogroll

Archives

Categories

Meta