Transcript Virtunoid: Breaking out of KVM
Nelson Elhage Black Hat USA 2011
Introduction Related work Background Knowledge Attack Detailed CVE 2011-1751 Bug Detailed Exploit Detailed (Take Control of %rip) Inject Shellcode into host Disable non executable page Bypassing ASLR Conclusions Reference
It was found that the PIIX4 Power Management emulation layer in qemu-kvm did not properly check for hot plug eligibility during device removals. A privileged guest user could use this flaw to crash the guest or, possibly, execute arbitrary code on the host. (CVE-2011-1751)
a generic and open source machine emulator and virtualizer.
Three components: Kvm.ko
Kvm-intel.ko or kvm-amd.ko
Qemu-kvm
The core KVM kernel module Provides ioctls for communicating the kernel module Primarily responsible for emulating the virtual CPU and MMU Emulates a few devices in-kernel for efficiency Contains an emulator for a subset of x86 used in handling certain traps
Provides support for Intel’s VMX and AMD’s SVM virtualization extensions Relatively small compared to the rest of KVM
Provides the most direct user interface to KVM Based on the classic x86 emulator Implements the bulk of the virtual devices a VM uses Implements a wide variety of possible devices and buses An order of magnitude more code than the kernel module
Static QEMUTimer *active_timers[QEMU_NUM_CLOCKS] Struct QEMUTimer { QEMUClock *clock; int64_t expire_time; QEMUTimerCB *cb; /* call back function*/ void *opaque; /* parameter */ struct QEMUTimer *next; /* link list */ }
Active_timers QEMUTimer
Related functions: Qemu_new_timer: allocate a memory region for the new timer.
Qemu_mod_timer: modify the current timer add it to link list. Qemu_run_timers: loop through the link list and execute the timer structure call back function with the opaque as the parameter
The main_loop_wait function will iterate through the active_timers and call qemu_run_timers()
A computer clock that keep track of the current time MC146818 RTC hardware manual can be found http://wiki.qemu.org/File:MC146818AS.pdf
} RTCState structure Struct RTCState { …..
QEMUTimer *second_timer; QEMUTimer *second_timer2;
Related functions: Rtc_initfn : initialize the RTC Rtc_update_second : update the expire time of the QEMUTimer and add it to the link list.
rtc_initfn : RTCState *s = ….
s->second_timer = qemu_new_timer(rtc_clock, rtc_updated_second, s) s->second_timer2 = qemu_new_timer(rtc_clock, rtc_update_second2, s) qemu_mod_timer(s->second_timer2, s->next_second_time)
……… Second_timer Second_timer2 RTCState Cb opaque Next Active_timer Rtc_update_second ……….
………..
……….
QEMUTimer Cb opaque Next QEMUTimer ……….
………..
……….
Rtc_update_second2
A south bridge chip.
Default south bridge chip used by qemu-kvm Include ACPI, PCI-ISA, and an embeded MC146818 RTC.
Support PCI device hotplug, write values to IO port 0xae08 Qemu use qdev_free to emulate device hotplug.
Certain devices don’t support device hotplug but qemu didn’t check this.
It should not be possible to unplug the ISA bridge KVM’s emulated RTC is not designed to be unplugged.
Did not check The device can Be unplug or not
Being dealloc Add the second timer to link list.
#include
Unplug RTC ……… Second_timer RTCState Active_timer Cb opaque Next QEMUTimer Rtc_update_second Cb opaque Next QEMUTimer ……….
………..
……….
Unplug RTC Second_timer RTCState Active_timer Cb opaque Next QEMUTimer Rtc_update_second Cb opaque Next QEMUTimer ……….
………..
……….
…… …… …… Dummy memory region
Return to main_loop_wait Call qemu_run_timers Second_timer RTCState Active_timer Cb opaque Next QEMUTimer Rtc_update_second Cb opaque Next QEMUTimer ……….
………..
……….
…… …… …… Dummy memory region
QEMUTimer call back Rtc_update_second(opaque) Second_timer RTCState Active_timer Cb opaque Next QEMUTimer Rtc_update_second Cb opaque Next QEMUTimer ……….
………..
……….
…… …… …… Dummy memory region
Next Main_loop_wait Second_timer RTCState Active_timer Cb opaque Next QEMUTimer Rtc_update_second Cb opaque Next QEMUTimer ……….
………..
……….
…… …… …… Dummy memory region
1. Inject a Controlled QEMUTimer into qemu-kvm 2. Eject ISA bridge 3. Force an allocation into the freed RTCState, with second timer point to our fake QEMUTimer
The guest RAM is backed by mmap()ed region inside the qemu-kvm process.
Allocate in the guest RAM and calculate the the host address by the following formula: Hva = physmem_base + gpa gpa = page_traslation(gva) <= linux kernel project 1 Gva = guest virtual address Gpa = guest physical address Hva = host virtual address Physmem_base = mmap start region For now assume we know physmem_base(no aslr)
Force qemu to call malloc Utilize the qemu-kvm user-mode networking stack Qemu-kvm implement DHCP server, DNS server and NAT gateway in user-mode networking stack User-mode stack normally handle packets synchronously To prevent recursion, if a second packet is emitted while handling a first packet, the second packet is queued using malloc.
ICMP ping.
1. Allocate a Fake QEMUTimer 2. calculate the Fake timer address 3. unplug ISA bridge 4. ping the gateway containing pointers to your fake timer.
Allocate Fake QMEUTimer ……… Second_timer RTCState Active_timer Cb opaque Next QEMUTimer Rtc_update_second ……….
………..
……….
Cb opaque Next QEMUTimer Evil function (Shellcode) Cb opaque Next Fake QEMUTimer ……….
………..
……….
Unplug ISA bridge Ping the gateway Second_timer RTCState Active_timer Cb opaque Next QEMUTimer Rtc_update_second ……….
………..
……….
Cb opaque Next Cb opaque Next QEMUTimer Evil function (Shellcode) Fake QEMUTimer ……….
………..
……….
First Main_loop_wait Second_timer RTCState Active_timer Cb opaque Next QEMUTimer Rtc_update_second ……….
………..
……….
Cb opaque Next Cb opaque Next QEMUTimer Evil function (Shellcode) Fake QEMUTimer ……….
………..
……….
Second Main_loop_wait Second_timer RTCState Active_timer Cb opaque Next QEMUTimer Cb opaque Next Evil function (Shellcode) Fake QEMUTimer ……….
………..
……….
1. we have %rip control 2. Where is the Evil function Inject shellcode to host virtual memory Host virtual memory has page protection(NX bit) 3. Solutions: A. ROP B. something clever
1. we can control the QEMUTimer data structure.
2. create multiple QEMUTimer object and chain them together.
QEMUTimer Cb opaque Next Cb opaque Next Cb opaque Next ……….
………..
……….
F1(X) ……….
………..
……….
F2(Y) ……….
………..
……….
F3(Z)
We now have multiple on argument function calls.
We want to do more arguments function calls. For example, mprotect take three arguments.
Arguments of types Bool, char, short, int, long, long long, and pointers are in the INTEGER class.
If the class is INTEGER , the next available register of the sequence %rdi, %rsi, %rdx, %rcx, %r8 and %r9 is used More detailed check out the reference 7
Suppose we can find a function with the following property.
Set_rsi: movl %rdi, %rsi; return Let f1(x) be set_rsi %rsi register will not be modified during qemu_run_timer() in most qemu version.
Therefore, F2(y) becomes F2(y,x) since we control the %rsi from f1(x)
Void cpu_outl(pio_addr_t addr, uint32_t val) { ioport_write(2, addr, val); } This function will copy its first parameter to the second parameter of ioport_write %rdi is the first parameter and %rsi is the second parameter. Therefore we get a function with the previous property. (Movl %rdi, %rsi)
Mprotect prototype: Mprotect(addr, lens, prot) PROT_EXEC = 4 Use the following function We control the “opaque/ioport” by QEMUTimer and control the “addr” by set_rsi() Seems like we control everything in this function
Allocate a fake IORangeOps with fake_ops->read = mprotect Allocate a page-aligned IORange with Fake_ioport->ops = fake_ops Fake_ioport->base = -PAGE_SIZE Copy shellcode following the IORange Construct a timer chain that calls Cpu_outl(0, *) Ioport_readl_thunk(fake_ioport, 0) Fake_ioport + 1
QEMUTimer Chain Cb opaque Next Cb opaque Next Cb opaque Next mprotect …….
…….
…….
…….
…….
…….
Cpu_outl …….
…….
…….
Ioport_readl_thunk ops Fill with shellcode IORange (PAGE_ALIGN) Read IORangeOps
The base address of the qemu-kvm binary, to find code address(such as mprotect ….) Physmem_base, the address of the physical memory mapping inside kvm Solutions: Find an information leak Assume non-PIE. Every major distribution compile qemu kvm as non position independent executable. How about physmem_base
Emulated IO ports 0x510 (address) and 0x511 (data) Used to communicate various tables to the qemu BIOS (e820 map, ACPI tables, etc) Also provides support for exporting writable tables to the BIOS However, fw_cfg_write doesn’t check if the target table is supposed to be writable
Several fw_cfg areas are backed by statically-allocated buffers.
Net result: nearly 500 writable bytes inside static variables.
Mprotect needs a page-aligned address, so these aren’t suitable for our shellcode We can construct fake timer chains in this space to build a read4() primitive. (Create Information Leak) Follow pointers from static variables to find physmem_base Proceed as before
Sandbox qemu-kvm Build qemu-kvm as PIE Lazily mmap/mprotect guest RAM XOR-encode key function pointers More auditing and fuzzing of qemu-kvm
VM breakouts aren’t magic Hypervisors are just as vulnerable as anything else Device drivers are the weak spot.
[1] http://qemu.weilnetz.de/qemu-tech.html
[2] http://qemu.weilnetz.de/doxygen/structRTCState.html
[3] http://www.linuxinsight.com/files/kvm_whitepaper.pdf
[4] https://www.ibm.com/developerworks/cn/linux/l-virtio/ [5] http://smilejay.com/kvm_theory_practice/ [6] http://www.linux-kvm.org/page/Documents [7] http://www.cs.tufts.edu/comp/40/readings/amd64-abi.pdf
[8]http://linuxfromscratch.xtra-net.org/hlfs/view/unstable/glibc 2.4/chapter02/pie.html
qemu source code