Archive for March, 2025
Quitting an Intel x86 hypervisor
This is an esoteric topic that might be of interest to people implementing Intel hypervisors. It assumes you know the basics of the Intel virtualization architecture, see Hypervisor from scratch for a tutorial. The actual full VT architecture is described in Volume 3 of the Intel SDM
Let’s say we write an x86 hypervisor that starts in the UEFI environment and virtualizes the the initialization phase of an OS. But the hypervisor wants to quit eventually to not cause extra overhead during OS run time.
The way the hypervisor works is that it runs in its own memory and with its own page tables which are switched atomically on every VM exit by the VT-x implementation. This way it is isolated from the main OS.
At some exit with the hypervisor running in its own context it decides that it is not needed anymore and wants to quit. To disable VT support the VMXOFF instruction can be used. But what we really need is an atomic VMXOFF + switch to the original OS page plus a jump, and all that without using any registers which need to be already restored to the original state of the OS.
One trick is to use the MOV to CR3 instruction that reloads the page table as a jump. As soon as the page table is reloaded the CPU will fetch the next instruction with the translations from the freshly loaded page table, so we can transfer execution to the guest context. However to do that the MOV CR3 needs to be just before the page offset of the target instruction. This can be done by copying a trampoline to the right page offset (potentially overlapping into the previous page). The trampoline is located in a special transfer page table mapping that places writable code pages overlapping the target mapping.
But there are some complications. The hypervisor also needs to load the segmentation state (like GDT/LDT) of the guest. In theory they could just be loaded by mapping these guest pages into the transfer mapping and loading them before the transfer. But what happens if the GDT/LDT is on the same page as the target address (this is common in real OS’ assembler startup code which is a small assembler file without any page separation between code and data). One option would be to copy them to the transfer page too and load it there, or the hypervisor first copies them to a temporary buffer and loads it from there. In the second option the base addresses of these structures will be incorrect, but in practice you can often rely on them getting reloaded eventually anyways.
Another problem is the register state of the target. MOV to CR3 needs a register as the source of the reload, and it needs to be the last instruction of the trampoline. So it is impossible to restore the register it uses. But remember the hypervisor is doing this as the result of a exit. If we chose an exit for a condition that already clobbers a register we can use the same register for the reload and the next instruction executed in the original guest (and which caused the exit originally) will just overwrite it again.
A very convenient instruction for this is CPUID. It is executed multiple times in OS startup and clobbers multiple registers. In fact VMX always intercepts CPUID so it has to handle these exits in any case. So the trick to quit an hypervisor is to wait for the next CPUID exit and then use one of the registers clobbered by CPUID for the final CR3 reload. This will have inconsistent register state for one instruction in the target, but unless the original OS is currently running a debugger it will never notice. In principle any exit as a result of an instruction that clobbers a register can be used for this.
There is another potential complication if the target address of the OS conflicts with where the hypervisor is running before entering the transfer mapping. This could be solved with a third auxiliary mapping that is used before jumping to the transfer trampoline. In practice it doesn’t seem to be a problem because x86 OS typically run in a 1:1 mapping for startup, and that cannot conflict with the 1:1 mapping used by UEFI programs as our hypervisor.
Happy hypervisor hacking!