DEV Community: Stjepan

Porting my Rust KVM hypervisor to ARM64: Running the binary

Stjepan — Wed, 22 Jul 2026 18:00:25 +0000

Recap

In the previous article we ported the register handling
of my Rust KVM hypervisor from x86 to ARM64. By tracing QEMU with strace and
inspecting the Linux headers, we discovered that ARM64 accesses CPU state
through the KVM_GET_ONE_REG and KVM_SET_ONE_REG interface and reconstructed
the register id encoding in Rust.

At that point the hypervisor compiled successfully, but trying to access even
the program counter resulted in an ENOEXEC error. In this article we'll
continue the reverse-engineering process, initialize the vCPU correctly, and
finally execute our first ARM64 guest.

Second roadblock: ENOEXEC when getting / setting PC

I was rather optimistic this would work out of the box, but even after trying
to do something as simple as reading the PC register I hit a snag:

ioctl(5<anon_inode:kvm-vcpu:0>, KVM_ARM_SET_DEVICE_ADDR or KVM_GET_ONE_REG, 0x7fdf2c81e8) = -1 ENOEXEC (Exec format error)
Error: Os { code: 8, kind: Uncategorized, message: "Exec format error" }

Luckily, I had a concrete lead now: error code and KVM command. So I searched
for the KVM_GET_ONE_REG in the Linux kernel code on Bootlin Elixir and found
the following code in arch/arm64/kvm/arm.c:

    case KVM_SET_ONE_REG:
    case KVM_GET_ONE_REG: {
        struct kvm_one_reg reg;

        r = -ENOEXEC;
        if (unlikely(!kvm_vcpu_initialized(vcpu)))
            break;

This is exactly the error we were getting and we immediately found what was
missing: vCPU initialization. Actually, if we take a closer look at the code we
will see the comment kernel developers have left us:

int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
{
    int err;

    /* Force users to call KVM_ARM_VCPU_INIT */
    vcpu->arch.target = -1;

Initializing the vCPU

Following the lead I looked for KVM_ARM_VCPU_INIT in the original strace log
and found:

137152 ioctl(12<anon_inode:kvm-vcpu:0>, 0x4020aeae /* KVM_ARM_VCPU_INIT */, 0x7fff57e028) = 0
137156 ioctl(10<anon_inode:kvm-vcpu:0>, 0x4020aeae /* KVM_ARM_VCPU_INIT */, 0x7f82f70e68) = 0
137152 ioctl(10<anon_inode:kvm-vcpu:0>, 0x4020aeae /* KVM_ARM_VCPU_INIT */, 0x7fff57dd78) = 0
137152 ioctl(10<anon_inode:kvm-vcpu:0>, 0x4020aeae /* KVM_ARM_VCPU_INIT */, 0x7fff57e048) = 0

This wasn't really helpful: we again only see addresses, but we see no hints of
underlying structure. So let's first take a look at how KVM_ARM_VCPU_INIT is
defined in the Linux headers:

$ grep -Rn KVM_ARM_VCPU_INIT /usr/include/
/usr/include/linux/kvm.h:1622:#define KVM_ARM_VCPU_INIT   _IOW(KVMIO,  0xae, struct kvm_vcpu_init)

A quick search for struct kvm_vcpu_init revealed:

$ grep -Rn 'struct kvm_vcpu_init' /usr/include/ -A3
/usr/include/aarch64-linux-gnu/asm/kvm.h:112:struct kvm_vcpu_init {
/usr/include/aarch64-linux-gnu/asm/kvm.h-113-   __u32 target;
/usr/include/aarch64-linux-gnu/asm/kvm.h-114-   __u32 features[7];
/usr/include/aarch64-linux-gnu/asm/kvm.h-115-};

Getting struct data with Python GDB

I didn't want to spend much time understanding every target and feature flag. My
goal was simply to reproduce QEMU's behavior as quickly as possible (and I
didn't even have to understand it completely). Luckily, this structure was
really simple (basically an array of eight 32-bit integers), so I decided to
tackle this with Python GDB automation. I created a breakpoint to stop exactly
on ioctl call when we are in KVM_ARM_VCPU_INIT:

import gdb
import struct

def log(line):
    print(line)
    with open("gdb-debug.out", "a") as f:
        f.write(line + "\n")

class IoctlBreakpoint(gdb.Breakpoint):

    def stop(self):
        if x1 == 0x4020aeae: # KVM_ARM_VCPU_INIT
            ptr = int(gdb.parse_and_eval("$x2"))
            vcpu_init_mem = gdb.selected_inferior().read_memory(ptr, 32) # u32 + 7 * u32
            target_features = struct.unpack("<8I", bytes(vcpu_init_mem))
            log(f"KVM_ARM_VCPU_INIT: target={target_features[0]} features={target_features[1:]}")        
            return False

if __name__ == "__main__":
    tracer = IoctlBreakpoint("ioctl")
    gdb.execute("handle SIGUSR1 SIGUSR2 nostop noprint pass")
    gdb.execute("run")

I ran my QEMU command with GDB:

gdb -ex 'source ./trace-arm.py' ./start-qemu.sh

This has automatically yielded great results:

KVM_ARM_VCPU_INIT: target=5 features=(0, 0, 0, 0, 0, 0, 0)
KVM_ARM_VCPU_INIT: target=5 features=(12, 0, 0, 0, 0, 0, 0)
KVM_ARM_VCPU_INIT: target=5 features=(12, 0, 0, 0, 0, 0, 0)
KVM_ARM_VCPU_INIT: target=5 features=(12, 0, 0, 0, 0, 0, 0)

For now I want to try just setting target, but leave features zeroed out. I
could immediately reconstruct this in Rust:

impl VCPU {
    pub fn arm_vcpu_init(&self) -> io::Result<()> {
        let mut vcpu_init : kvm_vcpu_init = unsafe { std::mem::zeroed() };
        // KVM_ARM_VCPU_INIT: target=5 features=(0, 0, 0, 0, 0, 0, 0)
        vcpu_init.target = 5;
        let ret = unsafe { libc::ioctl(self.fd, KVM_ARM_VCPU_INIT, &mut vcpu_init) };
        if ret < 0 {
            return Err(io::Error::last_os_error());
        }
        Ok(())
    }
}

Third roadblock: No KVM exit

Soon after this I was able to get and set registers properly. The first ARM test
binary I compiled was based on working UART code from my hobby operating system
I started on Raspberry Pi. I expected at least a fault exit, if not an MMIO exit
when UART code was called. However, KVM_RUN never returned.

This is where I was stuck for a while. I considered creating a separate thread
that repeatedly queried KVM_GET_ONE_REG for PC, but from my own working
experience I knew that calling KVM ioctl calls asynchronously was a really bad
idea. I also considered adding hardware breakpoint capabilities to at least
trigger KVM_EXIT_DEBUG, but this would require more infrastructure for just
checking simple binary.

Writing a simple binary

I decided to try adding a signal handler instead. If I interrupted KVM_RUN
call with SIGINT maybe I could get registers then synchronously. I lowered
my expectations for initial binary and just created something really simple:

.section .text

.macro curr_el_to reg
    mrs \reg, CurrentEL
    lsr \reg, \reg, #2
    and \reg, \reg, #0xf
.endm

.globl _start
_start:
    mrs x0, mpidr_el1
    and x0, x0, #0xFF
    cbz x0, control
    b .

control:
    curr_el_to x0
    b .

It looks like much, but this code will just load current CPU privilege level to
x0 register (being that level EL=1 is operating-system level, x0 should be
set to 0x1). The mpidr_el1 code is for checking if we are on control CPU
(this should be unnecessary here, but I added this out of paranoia caused by
real-life issues I had on ARM).

Installing signal handler

Finally, I added a signal handler:

extern "C" fn handler(_sig: libc::c_int) { }

unsafe fn install_interrupt_signal() {
    let mut sa: libc::sigaction = std::mem::zeroed();
    sa.sa_sigaction = handler as *const() as usize;
    libc::sigemptyset(&mut sa.sa_mask);
    sa.sa_flags = 0;
    libc::sigaction(libc::SIGINT, &sa, std::ptr::null_mut());
}

The handler itself does nothing. I only needed the signal to interrupt the blocking KVM_RUN call. So my final KVM loop looked like:

    unsafe { install_interrupt_signal() };

    let run = vcpu.kvm_run_mem as *mut kvm_run;

    loop {
        let ret = unsafe { libc::ioctl(vcpu.fd, KVM_RUN, 0usize) };
        if ret < 0 {
            if io::Error::last_os_error().raw_os_error() == Some(libc::EINTR) {
                vcpu.print_regs()?;
            }
            return Err(io::Error::last_os_error());
        }

        let exit_reason = unsafe { (*run).exit_reason };

Running the binary

Finally, it was time to run the gues:

pi@raspberrypi:~ $ ./rust hello-world-arm.img
x0 = 0x0
pc = 0x1000
Loading "hello-world-arm.img" to 0x1000...
^Cx0 = 0x1
pc = 0x101c
Error: Os { code: 4, kind: Interrupted, message: "Interrupted system call" }

We can see that PC did change and that x0 really is set to current CPU
privilege level: 0x1. Why it didn't work for my UART code is probably that the
guest itself requires more initialization than I though (x86 went more smoothly
and felt more plug-and-play).

Conclusion

Although x86 and ARM64 share the same KVM interface at a high level, their
userspace APIs differ significantly. Porting my hypervisor turned out to involve
much more than renaming registers: ARM64 uses per-register ids instead of a
single register structure, requires explicit vCPU initialization through
KVM_ARM_VCPU_INIT and demanded a fair amount of reverse engineering before I
could successfully execute guest code. Fortunately, the same workflow that
worked on x86 like combining strace, GDB and the Linux kernel source proved just
as effective on ARM64 (although required a bit more effort).

The code from this article can be found on my GitHub page:

https://github.com/StjepanPoljak/kvm-rust/tree/kvm-arm-part2-code

Porting my Rust KVM hypervisor to ARM64: Working with registers

Stjepan — Fri, 10 Jul 2026 17:41:29 +0000

Introduction

This is a follow-up to my KVM in Rust series where I showed how to use strace
and GDB to reverse-engineer QEMU/KVM and reimplement pieces of it in my own
hypervisor in Rust. So far this was done on a modern x86 CPU. This time we'll
see how to get KVM running on 64-bit ARM architecture. For this I have used my
Raspberry Pi 4B which has a CPU with hardware virtualization support.

Cross-compilation

The first thing to do was to set up an environment for cross-compilation and for
this kind of thing I almost always use docker. So I have written a simple
Dockerfile installing the cross-compilation toolchain for 64-bit ARM along
with Rust essentials:

https://github.com/StjepanPoljak/kvm-rust/blob/kvm-arm-code/Dockerfile

Building this image is straightforward, but it does require passing user and
group id to have the resulting file permissions correct:

docker build --network host \
       --build-arg GID=$(id -g) \
       --build-arg=UID=$(id -u) \
       -t arm-rust-build .

Finally, to cross-compile just run:

docker run --network host -it --rm \
       -v "$(pwd)":"/home/docker/kvm-rust" \
       arm-rust-build \
       cargo build --release --target=aarch64-unknown-linux-gnu

First roadblock: API difference

At first, I thought the difference would be mostly in CPU registers, so I
removed KVM_GET_SREGS2 and KVM_SET_SREGS2 and some x86-specific code just to
make it compile. However, when I ran it on Raspberry Pi, the program returned an
error on KVM_GET_REGS. I once again turned to use strace on QEMU to see what
was to be done. I downloaded the Linux kernel source code (from www.kernel.org),
cross-compiled it for ARM and ran it in QEMU:

#!/bin/sh

KERNEL=/home/pi/Image
qemu-system-aarch64                                             \
        -M virt                                                 \
        -smp 1                                                  \
        -enable-kvm                                             \
        -cpu host                                               \
        -kernel ${KERNEL}                                       \
        -append "console=ttyAMA0"                               \
        -serial stdio                                           \
        -nographic                                              \
        -nodefaults

Notice that here I used the virt machine specifically: that's the option to
use for KVM-based virtualization (machines like raspi4b would be emulated as
they require a very specific CPU and device tree, so they cannot simply run
directly on the host CPU). I put this into start-qemu.sh file and then ran:

strace -yy -f -X verbose -e trace=ioctl,openat,read,write,mmap -o kvm.log ./start-qemu.sh

Discovering KVM register API for ARM64

First thing I did was to try and find KVM_GET_REGS string, but there were no
results. I tried something more generic like grep -i regs and got this as
first results:

137152 ioctl(12<anon_inode:kvm-vcpu:0>, 0x4010aeab /* KVM_ARM_SET_DEVICE_ADDR or KVM_GET_ONE_REG */, 0x7fff57e008) = 0
137152 ioctl(12<anon_inode:kvm-vcpu:0>, 0x4010aeab /* KVM_ARM_SET_DEVICE_ADDR or KVM_GET_ONE_REG */, 0x7fff57e008) = 0
137152 ioctl(12<anon_inode:kvm-vcpu:0>, 0x4010aeab /* KVM_ARM_SET_DEVICE_ADDR or KVM_GET_ONE_REG */, 0x7fff57e008) = 0
137152 ioctl(12<anon_inode:kvm-vcpu:0>, 0x4010aeab /* KVM_ARM_SET_DEVICE_ADDR or KVM_GET_ONE_REG */, 0x7fff57e008) = 0
137152 ioctl(12<anon_inode:kvm-vcpu:0>, 0x4010aeab /* KVM_ARM_SET_DEVICE_ADDR or KVM_GET_ONE_REG */, 0x7fff57e008) = 0
137152 ioctl(12<anon_inode:kvm-vcpu:0>, 0x4010aeab /* KVM_ARM_SET_DEVICE_ADDR or KVM_GET_ONE_REG */, 0x7fff57e008) = 0

So this is when I realized it was going to be a bit harder. I wasn't getting
names of fields and corresponding values, just addresses. I installed Linux
headers on Raspberry Pi and searched for KVM_[GS]ET_ONE_REG:

pi@raspberrypi:~ $ grep -Rn KVM_[GS]ET_ONE_REG /usr/include/linux/
/usr/include/linux/kvm.h:1618:#define KVM_GET_ONE_REG             _IOW(KVMIO,  0xab, struct kvm_one_reg)
/usr/include/linux/kvm.h:1619:#define KVM_SET_ONE_REG             _IOW(KVMIO,  0xac, struct kvm_one_reg)

Searching in the same file we can see that the struct kvm_one_reg is defined
as follows:

struct kvm_one_reg {
       __u64 id;
       __u64 addr;
};

Constructing CPU id

As it usually is with great discoveries, I learned how to construct the register
id in kvm_one_reg by accident. Before even knowing how KVM on ARM works with
registers, I naively tried searching for struct kvm_regs and found two
interesting matches:

pi@raspberrypi:~ $ grep -Rn kvm_regs /usr/include/
/usr/include/aarch64-linux-gnu/asm/kvm.h:50:struct kvm_regs {
/usr/include/aarch64-linux-gnu/asm/kvm.h:208:#define KVM_REG_ARM_CORE_REG(name)(offsetof(struct kvm_regs, name) / sizeof(__u32))

The struct kvm_regs here was just a list of registers and
KVM_REG_ARM_CORE_REG macro seemed to be taking a register name as input and
looking for offset of the register in the struct kvm_regs, but I didn't find
it anywhere in the Linux headers.

I knew, however, that QEMU was one place where this might be used and I searched
for KVM_REG_ARM_CORE_REG and found this in target/arm/kvm.c:

#define AARCH64_CORE_REG(x)   (KVM_REG_ARM64 | KVM_REG_SIZE_U64 | \
                 KVM_REG_ARM_CORE | KVM_REG_ARM_CORE_REG(x))

What was left to do was just to find these macro definitions and reconstruct
them in my Rust KVM code:

const KVM_REG_ARM64 : u64 = 0x6000000000000000;
const KVM_REG_SIZE_U64 : u64 = 0x0030000000000000;
const KVM_REG_ARM_COPROC_SHIFT : u64 = 16;
const KVM_REG_ARM_CORE : u64 = 0x0010 << KVM_REG_ARM_COPROC_SHIFT;

However, the KVM_REG_ARM_CORE_REG needed to be more thought-through as it
worked with offsets in the struct kvm_regs. So I have written a function
taking the register name as string and mapping it to the register id that KVM
understands:

fn AARCH64_CORE_REG(name: &str) -> io::Result<u64> {
    let base = KVM_REG_ARM64 | KVM_REG_SIZE_U64 | KVM_REG_ARM_CORE;
    if name.starts_with("x") {
        let reg : u64 = name[1..]
            .parse()
            .map_err(|e| io::Error::new(io::ErrorKind::InvalidInput, e))?;
        return Ok(base | reg * 2); }

    match name {
        "sp" => { return Ok(base | (31 * 2)); },
        "pc" => { return Ok(base | (32 * 2)); },
        "pstate" => { return Ok(base | (33 * 2)); },
        _ => ()
    };

    Err(io::Error::other("Invalid register."))
}

In my implementation I have improved this further by utilizing HashMap and
LazyLock so that the parsing and calculation isn't done on-the fly but rather
computed once and cached for subsequent calls.

Final thoughts

At this point I understood how ARM64 exposes CPU registers through
KVM_GET_ONE_REG and KVM_SET_ONE_REG calls and had reproduced the register
id encoding in Rust. Unfortunately, that wasn't enough to make the hypervisor
work. My first attempt at accessing the program counter (PC register)
immediately failed with an ENOEXEC error: a clue that I was missing another
crucial step in KVM ARM setup.

The code showcasing creation of register ids in a HashMap and subsequent
error when calling KVM_SET_ONE_REG can be found on my GitHub page:

https://github.com/StjepanPoljak/kvm-rust/tree/kvm-arm-part1-code

Building a KVM Virtual Machine in Rust: Running a binary

Stjepan — Sun, 28 Jun 2026 05:26:47 +0000

Recap

This is a continuation of Part 2 of KVM in Rust series,
where we successfully set up a memory region for our guest VM. Now we will see
how to proceed further and run a very simple binary in our hypervisor.

Loading the binary

Now, we have only set up a memory region, but we still haven't loaded any binary
or code to actually run. For this, I have written a small "Hello world" x86
assembly file:

https://github.com/StjepanPoljak/kvm-rust/blob/kvm-part3-code/samples/hello-world.asm

Notice that the org 0x1000 header is quite arbitrary and it's actually a good
lead of what we also need to do (set the instruction pointer). Let's compile the
assembly file with:

nasm ./samples/hello-world.asm

Then, we want to load our binary into the memory region at address 0x1000:

let mut file = File::open("./samples/hello-world")?;
let mut code = Vec::new();
file.read_to_end(&mut code)?;

unsafe {
    std::ptr::copy_nonoverlapping(code.as_ptr(),
                                  (mem_ptr as *mut u8).add(0x1000),
                                  code.len()) };

Here we are using the more optimal copy_nonoverlapping because we can be quite
certain that the memory where code is stored is not overlapping with our
memory region.

Creating a vCPU

As we saw in Part 1, we have also discovered that QEMU also
calls KVM_SET_VCPU to obtain a vCPU file descriptor on which it will call
KVM_RUN. So, all that we have to do to set up a vCPU is to take a look at the
ioctl and recreate it in Rust:

// 140904 ioctl(9<anon_inode:kvm-vm>, 0xae41 /* KVM_CREATE_VCPU */, 0) = 10<anon_inode:kvm-vcpu:0>
let vcpu_fd = unsafe { libc::ioctl(self.fd, KVM_CREATE_VCPU, 0usize) };
if vcpu_fd < 0 {
    return Err(io::Error::last_os_error());
}

Investigating vCPU operations

It would be tempting to simply call KVM_RUN on this file descriptor, however,
we don't really know how vCPU is set up, where the instruction pointer is and
even if the registers are in the valid state. If we look at the ioctl calls
referring to kvm-vcpu:0 in our log (and filtering out the spamming KVM_RUN
and KVM_SMI), we can see a lot of lines like these being repeated:

$ grep 'kvm-vcpu:0' kvm.log | grep -v 'RUN\|SMI'

                ### omitted a lot of repetitive strace output ###

140904 ioctl(10<anon_inode:kvm-vcpu:0>, 0xc080aebe /* KVM_GET_NESTED_STATE */, 0x7768f8002010) = 0
140904 ioctl(10<anon_inode:kvm-vcpu:0>, 0xc028ae92 /* KVM_TPR_ACCESS_REPORTING */, 0x77690198f090) = 0
140904 ioctl(10<anon_inode:kvm-vcpu:0>, 0x4140aecd /* KVM_SET_SREGS2 */, 0x77690198f2f0) = 0
140904 ioctl(10<anon_inode:kvm-vcpu:0>, 0x4090ae82 /* KVM_SET_REGS */, {rax=0x20, ..., rsp=0x6d88, rbp=0, ..., rip=0x18, rflags=0x246}) = 0
140904 ioctl(10<anon_inode:kvm-vcpu:0>, 0x5000aea5 /* KVM_SET_XSAVE */, 0x7768f8001000) = 0
140904 ioctl(10<anon_inode:kvm-vcpu:0>, 0x4188aea7 /* KVM_SET_XCRS */, 0x77690198f2c0) = 0
140904 ioctl(10<anon_inode:kvm-vcpu:0>, 0x4008ae89 /* KVM_SET_MSRS */, 0x7768f80040a0) = 59
140904 ioctl(10<anon_inode:kvm-vcpu:0>, 0x4040aea0 /* KVM_SET_VCPU_EVENTS */, 0x77690198f500) = 0
140904 ioctl(10<anon_inode:kvm-vcpu:0>, 0x4008ae89 /* KVM_SET_MSRS */, 0x7768f80040a0) = 1
140904 ioctl(10<anon_inode:kvm-vcpu:0>, 0x4080aea2 /* KVM_SET_DEBUGREGS */, 0x77690198f500) = 0
140904 ioctl(10<anon_inode:kvm-vcpu:0>, 0xc008ae88 /* KVM_GET_MSRS */, 0x77690198f5a0) = 1

So we see that QEMU is setting vCPU registers via KVM_SET_REGS and in this log
we can see their exact values. It's also calling KVM_SET_SREGS2 which is
setting system registers; unfortunately, we cannot see the values used here.
Also, we do not need to follow the whole logic here, we just want to see what
QEMU is doing with registers before calling KVM_RUN:

$ awk '/KVM_RUN/ {print; exit} {print}' kvm.log | grep '[GS]ET_[S]\?REGS\|KVM_RUN'
140904 ioctl(10<anon_inode:kvm-vcpu:0>, 0x4140aecd /* KVM_SET_SREGS2 */, 0x77690198f310) = 0
140904 ioctl(10<anon_inode:kvm-vcpu:0>, 0x4090ae82 /* KVM_SET_REGS */, {rax=0, ..., rsp=0, rbp=0, ..., rip=0xfff0, rflags=0x2}) = 0
140904 ioctl(10<anon_inode:kvm-vcpu:0>, 0x8090ae81 /* KVM_GET_REGS */, {rax=0, ..., rsp=0, rbp=0, ..., rip=0xfff0, rflags=0x2}) = 0
140904 ioctl(10<anon_inode:kvm-vcpu:0>, 0x8140aecc /* KVM_GET_SREGS2 */, 0x77690198f310) = 0
140904 ioctl(10<anon_inode:kvm-vcpu:0>, 0x4140aecd /* KVM_SET_SREGS2 */, 0x77690198f310) = 0
140904 ioctl(10<anon_inode:kvm-vcpu:0>, 0x4090ae82 /* KVM_SET_REGS */, {rax=0, ..., rsp=0, rbp=0, ..., rip=0xfff0, rflags=0x2}) = 0
140904 ioctl(10<anon_inode:kvm-vcpu:0>, 0xae80 /* KVM_RUN */, 0) = -1 EINTR (Interrupted system call)

Since QEMU first calls SET_SREGS2 and SET_REGS we should actually
investigate what the default (if any) values are in our own implementation.

Implementing register get/set operations in Rust

The procedure for this is as with all other KVM ioctl calls we implemented.
We find out how to construct their numbers (consult the previous article here)
and reimplement these constants in Rust:

const KVM_GET_SREGS2 : u64 = _IOR::<kvm_sregs2>(KVMIO, 0xcc);
const KVM_SET_SREGS2 : u64 = _IOW::<kvm_sregs2>(KVMIO, 0xcd);
const KVM_GET_REGS : u64 = _IOR::<kvm_regs>(KVMIO, 0x81);
const KVM_SET_REGS : u64 = _IOW::<kvm_regs>(KVMIO, 0x82);

We also need to add kvm_sregs2 and kvm_regs in our build.rs file:

    bindgen::Builder::default()
        .header("/usr/include/linux/kvm.h")
        .allowlist_type("kvm_sregs2")
        .allowlist_type("kvm_userspace_memory_region")
        .allowlist_type("kvm_regs")
        .generate_comments(false)
        .generate()?
        .write_to_file(out_path.join("kvm-bindings.rs"))?;

Then, ioctl calls are as follows (I will only give example for get_sregs2
and get_regs):

let mut sregs2 = unsafe { std::mem::zeroed() };
let ret = unsafe { libc::ioctl(self.fd, KVM_GET_SREGS2, &mut sregs2) };

let mut regs = unsafe { std::mem::zeroed() };
let ret = unsafe { libc::ioctl(self.fd, KVM_GET_REGS, &mut regs) };

Peeking at registers

After obtaining sregs2 and regs we can print them out and see what they are
set to by default (to accomplish this we consult the struct definition in our
kvm-bindings.rs (or in the Linux headers) and print out all the values of the
fields we find). For regs we see they are pre-set to:

RAX=0x0 RBX=0x0 RCX=0x0 RDX=0x600
RSI=0x0 RDI=0x0 RSP=0x0 RBP=0x0
R8=0x0  R9=0x0  R10=0x0 R11=0x0
R12=0x0 R13=0x0 R14=0x0 R15=0x0
RIP=0xfff0      RFLAGS=0x2

Taking a look at our trace from kvm.log we can see that the ones QEMU sets are
exactly the same (except for RDX but we can get away with this one).

Peeking at system registers

Now, for sregs2 it gets a bit more complicated. Our defaults are as follows:

CS      base=0xffff0000 selector=0xf000 limit=0xffff    type=0xb        present=0x1
        dpl=0x0         db=0x0          s=0x1   l=0x0   g=0x0           avl=0x0

DS      base=0x0        selector=0x0    limit=0xffff    type=0x3        present=0x1
        dpl=0x0         db=0x0          s=0x1   l=0x0   g=0x0           avl=0x0

ES      base=0x0        selector=0x0    limit=0xffff    type=0x3        present=0x1
        dpl=0x0         db=0x0          s=0x1   l=0x0   g=0x0           avl=0x0

FS      base=0x0        selector=0x0    limit=0xffff    type=0x3        present=0x1
        dpl=0x0         db=0x0          s=0x1   l=0x0   g=0x0           avl=0x0

GS      base=0x0        selector=0x0    limit=0xffff    type=0x3        present=0x1
        dpl=0x0         db=0x0          s=0x1   l=0x0   g=0x0           avl=0x0

SS      base=0x0        selector=0x0    limit=0xffff    type=0x3        present=0x1
        dpl=0x0         db=0x0          s=0x1   l=0x0   g=0x0           avl=0x0

TR      base=0x0        selector=0x0    limit=0xffff    type=0xb        present=0x1
        dpl=0x0         db=0x0          s=0x0   l=0x0   g=0x0           avl=0x0

LDT     base=0x0        selector=0x0    limit=0xffff    type=0x2        present=0x1
        dpl=0x0         db=0x0          s=0x0   l=0x0   g=0x0           avl=0x0

GDT     base=0x0        limit=0xffff

IDT     base=0x0        limit=0xffff

CR0=0x60000010          CR2=0x0         CR3=0x0         CR4=0x0         CR8=0x0

EFER=0x0                APIC_BASE=0xfee00900            FLAGS=0x0

PDPTRS[0]=0x0           PDPTRS[1]=0x0           PDPTRS[2]=0x0           PDPTRS[3]=0x0

Problem is that we don't see any values in our strace log, only the address of
the variable. So we need to be a bit creative here if we want to find out what
QEMU sets these values to. We could use ptrace (in fact strace is built upon
ptrace API), but it may be a bit too much. Same for uprobes and eBPF. We
do have GDB, though, and it's just perfect as a one-off thing here. All we have
to do is run QEMU under GDB and then execute:

(gdb) break ioctl if $rsi == 0x4140aecd
(gdb) run

Note that 0x4140aecd is the exact value that we extracted from SET_SREGS2
ioctl call (in x86 ABI, RSI is holding the value of second argument):

140904 ioctl(10<anon_inode:kvm-vcpu:0>, 0x4140aecd /* KVM_SET_SREGS2 */, 0x77690198f310) = 0

Once GDB breaks, we can dump memory. However, we don't know the exact size of
struct kvm_sregs2; we could guess or manually inspect, but quickest way to get
it is to actually have a small C program:

#include <linux/kvm.h>
#include <stdlib.h>
#include <stdio.h>

int main() {
  printf("%lu", sizeof(struct kvm_sregs2));
  return 0;
}

The program will return value of 320 (at least on my system) and we can then
use this in GDB:

(gdb) dump memory dump.bin $rdx $rdx+320

Note that RDX is holding the third argument, i.e. the address we actually saw
in our strace log. Now, we can reuse our function for printing sregs2 in
Rust. We simply hack our main function to load the file and reinterpret its data
as sregs2 and then print it. We got:

CS  base=0xffff0000 selector=0xf000 limit=0xffff    type=0xb    present=0x1
    dpl=0x0     db=0x0      s=0x1   l=0x0   g=0x0       avl=0x0

DS  base=0x0    selector=0x0    limit=0xffff    type=0x3    present=0x1
    dpl=0x0     db=0x0      s=0x1   l=0x0   g=0x0       avl=0x0

ES  base=0x0    selector=0x0    limit=0xffff    type=0x3    present=0x1
    dpl=0x0     db=0x0      s=0x1   l=0x0   g=0x0       avl=0x0

FS  base=0x0    selector=0x0    limit=0xffff    type=0x3    present=0x1
    dpl=0x0     db=0x0      s=0x1   l=0x0   g=0x0       avl=0x0

GS  base=0x0    selector=0x0    limit=0xffff    type=0x3    present=0x1
    dpl=0x0     db=0x0      s=0x1   l=0x0   g=0x0       avl=0x0

SS  base=0x0    selector=0x0    limit=0xffff    type=0x3    present=0x1
    dpl=0x0     db=0x0      s=0x1   l=0x0   g=0x0       avl=0x0

TR  base=0x0    selector=0x0    limit=0xffff    type=0xb    present=0x1
    dpl=0x0     db=0x0      s=0x0   l=0x0   g=0x0       avl=0x0

LDT base=0x0    selector=0x0    limit=0xffff    type=0x2    present=0x1
    dpl=0x0     db=0x0      s=0x0   l=0x0   g=0x0       avl=0x0

GDT base=0x0    limit=0xffff

IDT base=0x0    limit=0xffff

CR0=0x60000010      CR2=0x0     CR3=0x0     CR4=0x0     CR8=0x0

EFER=0x0        APIC_BASE=0xfee00900        FLAGS=0x0

PDPTRS[0]=0x0       PDPTRS[1]=0x4b275f5fce32f200        PDPTRS[2]=0x0       PDPTRS[3]=0x3

We can see that this more or less corresponds to the defaults we observed in our
KVM implementation. If we want to change our registers (which we will), then we
will always first get the ones that are current in the vCPU, change the ones we
need and then set them using KVM_SET_REGS (or KVM_SET_SREGS2). So one of
first things to do is to set CS to flat mapping:

let mut sregs2 = vcpu.get_sregs2()?;
sregs2.cs.base = 0;
sregs2.cs.selector = 0;
vcpu.set_sregs2(sregs2)?;

Recall that our assembly file was assembled with org 0x1000, so setting RIP to
0x1000 causes execution to begin at the start of the loaded binary:

let mut regs = vcpu.get_regs()?;
regs.rip = 0x1000;
vcpu.set_regs(regs)?;

Furthermore, we now have a very good process of finding out what QEMU does with
registers.

Running the binary

Now this is where strace stopped being useful and my understanding of the KVM
run loop became more useful. Another positive thing is that this part of QEMU
source code is quite readable and can be found in function kvm_vcpu_thread_fn.

The more x86-specific code with exit reasons can be found in the kvm_arch_handle_exit function.

Setting kvm_run shared memory region

To wrap things up, first we need to mmap the shared kvm_run structure that KVM
uses to communicate VM-exit information and other runtime state between the
kernel and userspace:

// 140904 ioctl(3</dev/kvm<char 10:232>>, 0xae04 /* KVM_GET_VCPU_MMAP_SIZE */, 0) = 12288
let kvm_run_size = unsafe {
    libc::ioctl(kvm_fd, KVM_GET_VCPU_MMAP_SIZE, 0usize) };
let kvm_run_mem = unsafe {
    libc::mmap(ptr::null_mut(), kvm_run_size as usize,
               libc::PROT_READ|libc::PROT_WRITE, libc::MAP_SHARED,
               vcpu_fd, 0) };

Then, when we call KVM_RUN we simply match the exit reasons and act
accordingly:

let run = vcpu.kvm_run_mem as *mut kvm_run;

loop {
    let ret = unsafe { libc::ioctl(vcpu.fd, KVM_RUN, 0usize) };
    if ret < 0 {
        return Err(io::Error::last_os_error());
    }

    let exit_reason = unsafe { (*run).exit_reason };

    match exit_reason {
        KVM_EXIT_MMIO => { /* omitted */ }
        KVM_EXIT_HLT => {
            println!("Guest halted.");
            break; }
        KVM_EXIT_SHUTDOWN => {
            println!("Guest shutdown.");
            break; }
        KVM_EXIT_INTERNAL_ERROR => {
            return Err(io::Error::other("KVM internal error.")); }
        _ => {
            println!("EXIT REASON = {}", exit_reason);
        }
    }
}

Our first output

Being that our assembly file uses VGA MMIO to output the "Hello world!" string,
I have only implemented KVM_EXIT_MMIO. Then we can finally see the output:

$ cargo run ./samples/hello-world
     ## ommitted sregs2 output and warnings ##
RAX=0x0 RBX=0x0 RCX=0x0 RDX=0x600
RSI=0x0 RDI=0x0 RSP=0x0 RBP=0x0
R8=0x0  R9=0x0  R10=0x0 R11=0x0
R12=0x0 R13=0x0 R14=0x0 R15=0x0
RIP=0x1000      RFLAGS=0x2
Hello from KVM!
Guest halted.
RAX=0xb800      RBX=0x0 RCX=0x0 RDX=0x600
RSI=0x102d      RDI=0x22        RSP=0x0 RBP=0x0
R8=0x0  R9=0x0  R10=0x0 R11=0x0
R12=0x0 R13=0x0 R14=0x0 R15=0x0
RIP=0x101b      RFLAGS=0x46

Conclusion

This concludes the three-part series on figuring out KVM via strace and
reimplementing it in Rust. A lot of next steps actually come down to x86
architecture and bootloader specifics, so we may leave it here for now. The full
code capable of running simple binaries can be found on GitHub:

https://github.com/StjepanPoljak/kvm-rust/tree/kvm-part3-code

Building a KVM Virtual Machine in Rust: Memory Setup

Stjepan — Mon, 15 Jun 2026 05:21:59 +0000

Recap

This is a continuation of my previous article which dealt
with reverse-engineering QEMU with strace to learn how KVM works. Now it's
time to try and follow the steps we got from the strace logs to build our own
KVM-based virtual machine in Rust.

KVM Headers

I haven't actually used existing KVM libraries written specifically for Rust but
opted to use the libc crate which provides the required ioctl bindings and
helper macros. The main reason is that I want a complete understanding of what
is happening under the hood. Now for each of these KVM ioctl calls we can use
the Linux headers for reference. For example, to find out how to construct
ioctl number for KVM_CREATE_VM we simply can do:

$ grep -Rn 'KVM_CREATE_VM' /usr/include/linux/
/usr/include/linux/kvm.h:855:/* machine type bits, to be used as argument to KVM_CREATE_VM */
/usr/include/linux/kvm.h:882:#define KVM_CREATE_VM             _IO(KVMIO,   0x01) /* returns a VM fd */

Luckily in libc Rust crate we have macros for _IO (and the like), but we
still need KVMIO macro:

$ grep -Rn 'define KVMIO' /usr/include/linux/
/usr/include/linux/kvm.h:853:#define KVMIO 0xAE

We can now construct the ioctl number for KVM_CREATE_VM:

use libc::{_IOW, _IO, _IOR};

const KVMIO : u32 = 0xae;
const KVM_CREATE_VM : u64 = _IO(KVMIO, 0x01);

Note that exact integer type depends on the platform and libc definitions.
Then, in the main we can do:

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let file = OpenOptions::new()
        .read(true)
        .write(true)
        .open("/dev/kvm")
        .expect("failed to open /dev/kvm");

    let fd = file.as_raw_fd();
    let vm_fd = unsafe { libc::ioctl(fd, KVM_CREATE_VM, 0usize) };
    Ok(())
}

This is the procedure we will follow for each relevant ioctl call. In fact,
we can "reverse-engineer" our own program and then compare it with the original,
to make sure we are doing the right thing:

$ strace cargo run

     ### omitted irrelevant strace output ###

openat(AT_FDCWD, "/dev/kvm", O_RDWR|O_CLOEXEC) = 3
ioctl(3, KVM_CREATE_VM, 0)              = 4

Important note

In production code we would immediately check for a negative return value and
convert errno into a Rust error. To keep the example focused, I am omitting
proper error handling in this article.

Setting memory region

Now, to recall, the next step is setting up the memory region used by both the
KVM guest and the host. This region will be the memory of our virtual machine, a
place where we will load our binary:

140900 mmap(NULL, 1075838976, 0 /* PROT_NONE */, 0x22 /* MAP_PRIVATE|MAP_ANONYMOUS */, -1, 0) = 0x7768b3e00000
140900 mmap(0x7768b3e00000, 1073741824, 0x3 /* PROT_READ|PROT_WRITE */, 0x32 /* MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS */, -1, 0) = 0x7768b3e00000
140900 ioctl(9<anon_inode:kvm-vm>, 0x4020ae46 /* KVM_SET_USER_MEMORY_REGION */, {slot=0, flags=0, guest_phys_addr=0, memory_size=1073741824, userspace_addr=0x7768b3e00000}) = 0

Recreating mmap call

So, first thing we need to do is follow the same logic for mmap which is also
available in the libc Rust crate. After creating the virtual machine, we could
simply recreate our own mmap calls based on the strace output. However,
notice that QEMU first reserves a larger address range with PROT_NONE and then
maps only the portion it actually intends to use. For our prototype we do not
actually need to mimic this exact reservation pattern.

let mem_size: u64 = 256 * 1024;
let mem = unsafe {
    libc::mmap(ptr::null_mut(),
               mem_size as usize,
               libc::PROT_READ|libc::PROT_WRITE,
               libc::MAP_PRIVATE|libc::MAP_ANONYMOUS,
               -1,
               0)
};

For this experiment we are only allocating 256 kilobytes because we are not yet
booting a full operating system and therefore need very little guest memory.
Also note that, as with ioctl, production code should check whether mmap()
returned MAP_FAILED.

Recreating ioctl call

First we need to see the definition of the KVM_SET_USER_MEMORY_REGION:

$ grep -Rn 'define KVM_SET_USER' /usr/include/linux/ -A1
/usr/include/linux/kvm.h:1433:#define KVM_SET_USER_MEMORY_REGION _IOW(KVMIO, 0x46, \
/usr/include/linux/kvm.h-1434-                                  struct kvm_userspace_memory_region)

We see that for this one, we need struct kvm_userspace_memory_region. The
values we need are visible in the strace output, but copying struct definition
to our Rust program is really not advisable. Luckily, Rust has bindgen which
we can use to get this KVM struct (and others) from Linux headers. For this
purpose we have a separate build.rs file which will contain:

use bindgen;
use std::path::PathBuf;
use std::env;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let out_path = PathBuf::from(env::var("OUT_DIR")?);

    bindgen::Builder::default()
        .header("/usr/include/linux/kvm.h")
        .allowlist_type("kvm_userspace_memory_region")
        .generate_comments(false)
        .generate()?
        .write_to_file(out_path.join("kvm-bindings.rs"))?;

    Ok(())
}

Now in the main.rs we can import these bindings with:

include!(concat!(env!("OUT_DIR"), "/kvm-bindings.rs"));

Then, after mmap() calls, we set the memory region, also imitating what
QEMU is doing in the strace output:

let region = kvm_userspace_memory_region {
    slot : 0,
    flags : 0,
    guest_phys_addr : 0x0,
    memory_size : mem_size,
    userspace_addr : mem as u64
};

let _ret = unsafe { libc::ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, &region) };

We register this region starting at guest physical address 0x0, meaning the
first byte of our allocated host memory will appear as physical address 0 inside
the guest. Also note that the KVM_SET_USER_MEMORY_REGION call does not copy
memory, but rather tells KVM that guest physical address will be backed by a
specific userspace memory region.

Running the code

Next thing to do is to run it. We still haven't added any checks after mmap
and ioctl calls, but for this prototype we can again simply use strace our
own code:

$ strace -yy -X verbose -e trace=ioctl,mmap,openat,read,write cargo run

                ### omitted irrelevant strace output ###

openat(-100 /* AT_FDCWD */</home/stjepan/Develop/KVM/rust>, "/dev/kvm", 0x80002 /* O_RDWR|O_CLOEXEC */) = 3</dev/kvm<char 10:232>>
ioctl(3</dev/kvm<char 10:232>>, 0xae01 /* KVM_CREATE_VM */, 0) = 4<anon_inode:kvm-vm>
mmap(NULL, 262144, 0x3 /* PROT_READ|PROT_WRITE */, 0x22 /* MAP_PRIVATE|MAP_ANONYMOUS */, -1, 0) = 0x7d63fdfa2000
ioctl(4<anon_inode:kvm-vm>, 0x4020ae46 /* KVM_SET_USER_MEMORY_REGION */, {slot=0, flags=0, guest_phys_addr=0, memory_size=262144, userspace_addr=0x7d63fdfa2000}) = 0

We can see our output is fine and no errors were reported. Note that a full
working example with proper checking and Rust idiomatic approaches can be found
on my GitHub page:

https://github.com/StjepanPoljak/kvm-rust/tree/kvm-part2-code

Next steps

At this point we have a VM object and guest memory, but nothing is actually executing yet. In the next part we will create a vCPU, initialize its state, load a small binary into guest memory and enter the first KVM_RUN loop.

Learning KVM by Reverse-Engineering QEMU with strace

Stjepan — Fri, 05 Jun 2026 18:42:53 +0000

Motivation

I work with virtual machines in QEMU/KVM environment (a lot). In order to debug, optimize and customize the VMs requires an in-depth knowledge of both QEMU and KVM, the Linux kernel virtualization subsystem that exposes hardware virtualization features such as Intel VT-x and AMD-V to userspace applications like QEMU. Not only that, but I work on a lot of hobby projects requiring quick
bare-metal boot-ups and debugging workflows, and, to be honest, a lot of times QEMU is an overkill for these sorts of tasks.

KVM vs TCG

Also when it comes to QEMU, it's worth noting that we always have an option of using TCG, which has a completely different purpose than KVM. TCG is short for Tiny Code Generator, which works by translating guest instructions into host instructions at runtime. This is quite slow compared to running code without
translation overhead. So, if we want to test our bare-metal code, we may want to test it on our own real CPU at native speed. This is where KVM comes in. Unlike TCG, KVM does not emulate the CPU itself. Instead, it allows guest code to execute directly on the host processor while the Linux kernel manages
transitions between guest and host execution.

How KVM works in Linux

So, I do already know some basics. KVM driver exposes a driver interface in
Linux root filesystem, /dev/kvm. Communicating with the driver is done via
ioctl() system call on a file descriptor. What we need to find out is how QEMU
communicates with Linux kernel and try and follow the QEMU logic without
reading QEMU source code and KVM API (both can be a bit more intimidating than
just seeing how it works under the hood).

Reverse-engineering KVM

Now we can take some lightweight Debian Linux image and load it into the QEMU,
with KVM enabled:

QEMU_IMAGE=./debian-12-nocloud-amd64.qcow2

qemu-system-x86_64                                      \
    -m 1024                                             \
    -drive file="${QEMU_IMAGE}",if=virtio,cache=none    \
    -serial stdio                                       \
    -enable-kvm                                         \
    -cpu host                                           \
    -nodefaults                                         \
    -nographic

This is a pretty straightforward way to run QEMU with minimal setup. The most
relevant options for us are -enable-kvm and -cpu host, which will enable
KVM and use host CPU instead of emulating some specific CPU.

Tracing QEMU/KVM with strace

Now, we want to see what QEMU is really doing by utilizing strace. We can put this command in a start-qemu.sh script and call it with:

strace -yy -f -X verbose                                \
       -e trace=ioctl,openat,read,write,mmap            \
       -o kvm.log                                       \
       ./start-qemu.sh

This command will trace all ioctl, openat, read, write and mmap system
calls. Although I mentioned only ioctl calls so far, I always like to include
some other common system calls that could be used. As far as we know /dev/kvm
is the interface to KVM driver and QEMU will probably use openat on it.
Similarly, we also want to see what QEMU is doing with memory and what it's
reading and writing in general.

Note: Information on strace arguments as above can be found in strace --help
or man strace, but essentially, the -yy tells strace to print all available
information when decoding file descriptors, -f follows forks (we need this one
as we're wrapping it in scripts and QEMU might also do similar stuff). The
-X verbose will print names of constants and flags (very important when
analyzing ioctl calls);

Interpreting the logs

Now, we start the above command and, as soon as system boots, we can kill it
with CTRL+C. This will be quite sufficient to see how QEMU/KVM works without
spamming our logs with redundant information. When we read the kvm.log file,
we will see a lot of traces that are not really interesting. However, we already
have some knowledge: we know QEMU should be opening /dev/kvm so a quick
search for kvm reveals exactly what we need:

140900 openat(-100 /* AT_FDCWD */</home/stjepan/Develop/KVM>, "/dev/kvm", 0x80002 /* O_RDWR|O_CLOEXEC */) = 3</dev/kvm<char 10:232>>
140900 ioctl(3</dev/kvm<char 10:232>>, 0xae00 /* KVM_GET_API_VERSION */, 0) = 12
140900 ioctl(3</dev/kvm<char 10:232>>, 0xae03 /* KVM_CHECK_EXTENSION */, 0x88 /* KVM_CAP_IMMEDIATE_EXIT */) = 1
140900 ioctl(3</dev/kvm<char 10:232>>, 0xae03 /* KVM_CHECK_EXTENSION */, 0xa /* KVM_CAP_NR_MEMSLOTS */) = 32764
140900 ioctl(3</dev/kvm<char 10:232>>, 0xae03 /* KVM_CHECK_EXTENSION */, 0x76 /* KVM_CAP_MULTI_ADDRESS_SPACE */) = 2
140900 ioctl(3</dev/kvm<char 10:232>>, 0xae01 /* KVM_CREATE_VM */, 0) = 9<anon_inode:kvm-vm>
140900 ioctl(9<anon_inode:kvm-vm>, 0xae03 /* KVM_CHECK_EXTENSION */, 0x9 /* KVM_CAP_NR_VCPUS */) = 4
140900 ioctl(3</dev/kvm<char 10:232>>, 0xae03 /* KVM_CHECK_EXTENSION */, 0x42 /* KVM_CAP_MAX_VCPUS */) = 4096

We can see that QEMU is opening /dev/kvm and that it's checking API version
and various extensions. We may skip these checks and focus on the calls that
look most important; one of these here is KVM_CREATE_VM which also returns a
file descriptor 9<anon_inode:kvm-vm> which we can use as a further reference.

Setting up memory regions

We know QEMU must eventually load firmware and guest memory into the VM. Looking
for file operations after KVM_CREATE_VM, we quickly encounter SeaBIOS being
loaded:

140900 openat(-100 /* AT_FDCWD */</home/stjepan/Develop/KVM>, "/usr/share/seabios/bios-256k.bin", 0 /* O_RDONLY */) = 12</usr/share/seabios/bios-256k.bin>
140900 mmap(NULL, 2359296, 0 /* PROT_NONE */, 0x22 /* MAP_PRIVATE|MAP_ANONYMOUS */, -1, 0) = 0x776900dc1000
140900 mmap(0x776900e00000, 262144, 0x3 /* PROT_READ|PROT_WRITE */, 0x32 /* MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS */, -1, 0) = 0x776900e00000
140900 openat(-100 /* AT_FDCWD */</home/stjepan/Develop/KVM>, "/usr/share/seabios/bios-256k.bin", 0 /* O_RDONLY */) = 12</usr/share/seabios/bios-256k.bin>
140900 mmap(NULL, 266240, 0x3 /* PROT_READ|PROT_WRITE */, 0x22 /* MAP_PRIVATE|MAP_ANONYMOUS */, -1, 0) = 0x776900fc0000
140900 read(12</usr/share/seabios/bios-256k.bin>, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144

We can see QEMU opening and reading SeaBIOS binary and reserving memory along
with it. We can use the return of the read system call (the size of the
SeaBIOS binary) and see what KVM is doing with it. Searching through the log
for this size gives us further information on what KVM is doing with SeaBIOS:

140900 ioctl(9<anon_inode:kvm-vm>, 0x4020ae46 /* KVM_SET_USER_MEMORY_REGION */, {slot=3, flags=0x2 /* KVM_MEM_READONLY */, guest_phys_addr=0xfffc0000, memory_size=262144, userspace_addr=0x776900e00000}) = 0

So we see that it's now setting this as the memory region for KVM at guest
physical address 0xfffc0000 from userspace address that was actually
obtained by mmap in one of the traces above. In other words, KVM does not
allocate guest RAM itself; userspace applications such as QEMU remain
responsible for managing the backing memory.

Creating vCPU and running

Now, it gets very busy in the logs, but most of the stuff we see is still just
checking for extensions and capabilities. However, if we take a look at the
tail of the log, we will see a lot of these ioctl calls:

140904 ioctl(10<anon_inode:kvm-vcpu:0>, 0xae80 /* KVM_RUN */, 0) = 0
140904 ioctl(10<anon_inode:kvm-vcpu:0>, 0xaeb7 /* KVM_SMI */, 0) = 0
140904 ioctl(10<anon_inode:kvm-vcpu:0>, 0xae80 /* KVM_RUN */, 0) = 0
140904 ioctl(10<anon_inode:kvm-vcpu:0>, 0xae80 /* KVM_RUN */, 0) = 0
140904 ioctl(10<anon_inode:kvm-vcpu:0>, 0xae80 /* KVM_RUN */, 0) = 0
140904 ioctl(10<anon_inode:kvm-vcpu:0>, 0xae80 /* KVM_RUN */, 0) = 0
140904 ioctl(10<anon_inode:kvm-vcpu:0>, 0xaeb7 /* KVM_SMI */, 0) = 0

Now both KVM_RUN and KVM_SMI are operating on a kvm-vcpu file descriptor,
something we haven't yet seen. So if we search the logs for it, we can actually
see where it's created:

140904 ioctl(9<anon_inode:kvm-vm>, 0xae41 /* KVM_CREATE_VCPU */, 0) = 10<anon_inode:kvm-vcpu:0>

Conclusion

Now we have a more complete picture of how QEMU is setting up KVM. First,
/dev/kvm is opened to obtain a file descriptor representing the KVM subsystem.
From it we create a new virtual machine and get kvm-vm file descriptor. On
this file descriptor we are setting up memory regions and later use it to create
a vCPU, on which we can call KVM_RUN. The following diagram explains it
better:

/dev/kvm   kvm fd
    |
    +--> KVM_CREATE_VM   kvm-vm fd
            |
            +--> KVM_SET_USER_MEMORY_REGION
            |
            +--> KVM_CREATE_VCPU   kvm-vcpu fd
                            |
                            +--> KVM_RUN (loop)

As we can see from the logs, KVM_RUN appears repeatedly during guest
execution, while occasional KVM_SMI calls inject System Management Interrupts
into the guest. This repeated interaction between userspace and KVM is what
ultimately drives virtual CPU execution.

Next time we will recreate this exact behavior in Rust and also see about just
a few missing pieces to get our first virtual machine running in KVM.

Automating Stack Corruption Analysis in GDB with Python

Stjepan — Mon, 25 May 2026 12:12:18 +0000

A bug in my operating system

During a recent visit to my wife's family in Sarajevo, I decided to revisit my
hobby operating system in QEMU. I discovered that the boot process consistently
froze while printing the BIOS memory map. What initially looked like a
protected-mode issue eventually turned into a useful exercise in automating
debugging using GDB's Python scripting.

Manual debugging failure

After reinspecting some common pitfalls in my protected mode setup I was more
confident that the issue was stack corruption in my print_memory_map function.
If I had an unmatched push or pop, it could have corrupted return addresses and
eventually redirected execution flow into invalid memory. Single-stepping
through the routine manually quickly became impractical. The function mixed BIOS
interrupt handling, memory map parsing, and multiple helper calls, making it
difficult to reason about stack state over time.

Setting up GDB scripting in Python

I needed to automate this and GDB's integration with Python was the most
promising route I could take. The idea was to do exactly what I started
manually: break at a specific (suspicious) function and then start single
stepping while inspecting how the stack pointer behaved.

First things first, we import GDB module in Python, connect to the remote
target and set up a breakpoint (this is done by inheriting from
gdb.Breakpoint):

import gdb

class StackTraceBreakpoint(gdb.Breakpoint):

    def __init__(self, func_name):
        super(StackTraceBreakpoint, self).__init__(func_name)
        self.active = False

    def stop(self):
        if self.active:
            raise Exception("Recursion detected - stopping.")
        self.active = True

        return True

if __name__ == "__main__":
    gdb.execute("target remote :5555")
    gdb.execute("symbol-file ../build/arch/x86/bios-legacy/boot-stage1-5.elf")

    tracer = StackTraceBreakpoint("print_memory_map")
    gdb.execute("continue")

StackTraceBreakpoint

So this is kind of bare-bones of what I wanted to do. This code will simply add
a breakpoint with custom logic after connecting to QEMU and loading symbols.
As soon as we do gdb.execute("continue") GDB will run and, if and when it hits
our breakpoint, it will execute whatever we wrote in the stop() method. For
now I only added a kind of assertion that we cannot analyze recursions (I didn't
use any in my code anyway and logic would be a bit more complex).

Single-stepping

Now what we need to do is start single-stepping after we hit our breakpoint. So
we add a while loop with gdb.execute("stepi") after gdb.execute("continue").
Note that the stepi instruction steps over machine instructions, not over
source code statements. Also note that We cannot start single-stepping in
stop() method because GDB won't be in a state which can accept these kinds of
debugging requests.

Detecting stack imbalance

Furthermore I have wrapped this single-stepping logic in trace_step() method
in our breakpoint class. This method is not part of GDB breakpoint API, but
rather as a convenience for tracking the number of pushes and pops in a
consistent manner. To run this script we need to call:

gdb -ex 'source debug-stack.py'

Another thing we need to track is whether we have entered another function (in
which case we won't be counting pushes and pops) and if we have returned from
it. If I ever wanted to inspect routines being called, I would just run the same
script for them (my intention here is not creating a custom emulator on top of
GDB). So I added a counter func_count which will increase on call and
decrease on ret instruction. Here is a rough idea:

def trace_step(self):
    insn_full = gdb.execute("x/i $pc", to_string=True).strip()
    insn = re.search(r'^=>.*:\s*([^\s].*)$', insn_full).group(1).split(" ")[0]

    match insn:
        case "push":
            if self.func_count == 0:
                self.pushes += 1
        case "pop":
            if self.func_count == 0:
                self.pops += 1
        case "call":
            self.func_count += 1
        case "ret":
            self.func_count -= 1

Here you can see how I'm extracting the instruction from GDB. And what is left
is just improving logic and also fetching registers like PC, SP and CS for
debugging. I log everything into a file as GDB can get really noisy with
standard output (I didn't find a way to turn off all logging in GDB completely
when single stepping). The script is available in my GitHub repository:

https://github.com/StjepanPoljak/raspios/tree/master/scripts/debug-stack.py

Finding the root cause

Finally, you can see an example output detecting my very issue:

[START] print_memory_map SP=fffd
[0000:8183] push %ax (SP=0xfffd)
[0000:8185] push %bx (SP=0xfff9)
[0000:8187] push %cx (SP=0xfff5)
[0000:8189] push %dx (SP=0xfff1)
[0000:81a5] call 0x66e980a5 (SP=0xffed)
[0000:80a3] push %ax (SP=0xffeb)
[0000:80a7] call 0xab18056 (SP=0xffe7)
(...)
[0000:806e] ret  (SP=0xffef)
[0000:8098] call 0xf6ec8056 (SP=0xfff1)
[0000:8054] push %ax (SP=0xffef)
[0000:8056] push %bx (SP=0xffeb)
[0000:806a] pop %bx (SP=0xffe7)
[0000:806c] pop %ax (SP=0xffeb)
[0000:806e] ret  (SP=0xffef)
[0000:8098] call 0xf6ec8056 (SP=0xfff1)
[0000:8054] push %ax (SP=0xffef)
[0000:8056] push %bx (SP=0xffeb)
[0000:806a] pop %bx (SP=0xffe7)
[0000:806c] pop %ax (SP=0xffeb)
[0000:806e] ret  (SP=0xffef)
[0000:809d] pop %esi (SP=0xfff1)
[0000:809e] pop %bx (SP=0xfff3)
[0000:80a0] pop %ax (SP=0xfff7)
[0000:80a2] ret  (SP=0xfffb)
[FAIL] Extra pop detected at [0000:81fa].

So the real culprit was an extra pop eax in my print_memory_map routine:

print_memory_map:
    push eax
    push ebx
    push ecx
    push edx

; --- ommited print loop logic ---

.noprint_newline:
    pop eax 

    cmp ecx, [memory_map_size]
    jne .print_memory_map_loop

    pop edx
    pop ecx
    pop ebx
    pop eax
    ret

Removing this line will cause my debugging script to successfully pass.

Try it out yourself

You can try it out yourself, just check out my operating system, raspios, on
GitHub:

https://github.com/StjepanPoljak/raspios

Build and run it with:

mkdir build
cd build
ARCH=x86 cmake ..
make
make qemu_debug

Then, in the scripts folder run gdb -ex 'source debug-stack.py'.

Conclusion

This was a good reminder that low-level debugging often benefits from
lightweight tooling tailored to the problem at hand. In this case, a small
amount of Python automation around GDB made stack corruption analysis
significantly more manageable than manual instruction tracing.