Andi Kleen: Quitting an Intel x86 Hypervisor
This article delves into the esoteric world of Intel hypervisors, a topic of interest for those implementing virtualization architectures on the x86 platform.
To write an x86 hypervisor that starts in the UEFI environment and virtualizes the initialization phase of an OS, one must first understand the basics of the Intel virtualization architecture. For a detailed tutorial, please refer to our previous article on "Hypervisor from Scratch." The full VT architecture is described in Volume 3 of the Intel SDM.
Assuming we've created an x86 hypervisor that runs in its own memory and with its own page tables, which are switched atomically on every VM exit by the VT-x implementation. This isolation ensures that the hypervisor remains separate from the main OS.
However, as the hypervisor continues to run, it may decide that it is no longer needed and wants to quit. To disable VT support, the VMXOFF instruction can be used. But what we really need is an atomic VMXOFF + switch to the original OS page + a jump, all without using any registers that need to be restored to their original state of the OS.
One trick to achieve this is by utilizing the MOV to CR3 instruction, which reloads the page table as a jump. As soon as the page table is reloaded, the CPU will fetch the next instruction with the translations from the freshly loaded page table, allowing us to transfer execution to the guest context.
However, to make this work, the MOV CR3 needs to be just before the page offset of the target instruction. This can be done by copying a trampoline to the right page offset (potentially overlapping into the previous page). The trampoline is located in a special transfer page table mapping that places writable code pages overlapping the target mapping.
But there are some complications to consider. The hypervisor also needs to load the segmentation state (like GDT/LDT) of the guest. In theory, they could just be loaded by mapping these guest pages into the transfer mapping and loading them before the transfer. However, what happens if the GDT/LDT is on the same page as the target address? This is a common scenario in real OS' assembler startup code, which often lacks page separation between code and data.
One option would be to copy them to the transfer page too and load it there, or the hypervisor first copies them to a temporary buffer and loads it from there. In the second option, the base addresses of these structures will be incorrect, but in practice, one can often rely on them getting reloaded eventually anyways.
Another problem is the register state of the target. The MOV to CR3 needs a register as the source of the reload, and it needs to be the last instruction of the trampoline. So, it is impossible to restore the register it uses. However, if we chose an exit for a condition that already clobbers a register, we can use the same register for the reload and the next instruction executed in the original guest (and which caused the exit originally) will just overwrite it again.
A very convenient instruction for this is CPUID. It is executed multiple times in OS startup and clobbers multiple registers. In fact, VMX always intercepts CPUID so it has to handle these exits in any case. Therefore, the trick to quit an hypervisor is to wait for the next CPUID exit and then use one of the registers clobbered by CPUID for the final CR3 reload.
This will have inconsistent register state for one instruction in the target, but unless the original OS is currently running a debugger, it will never notice. In principle, any exit as a result of an instruction that clobbers a register can be used for this.
Potential Complications and Solutions
There is another potential complication if the target address of the OS conflicts with where the hypervisor is running before entering the transfer mapping. This could be solved with a third auxiliary mapping that is used before jumping to the transfer trampoline.
In practice, it doesn’t seem to be a problem because x86 OS typically run in a 1:1 mapping for startup, and that cannot conflict with the 1:1 mapping used by UEFI programs as our hypervisor.