Getting 4.3 M IOPS with an emulated NVMe device in a qemu guest¶
These are initial notes on exploring the use of vfio-user
for NVMe device
emulation in QEMU, backed by an SPDK block device (e.g., bdev_malloc
).
The goal is to assess whether such a setup is viable as a test vehicle for
evaluating the performance characteristics of the host software stack atop a
PCIe endpoint.
Traditional approaches rely on physical hardware, which introduces challenges such as opaque, black-box behavior of real-world SSDs, pre-conditioning requirements, and QoS variations due to garbage collection—all of which impact I/O latency in ways that cannot readily be attributed to either host-side or device-side effects.
Alternatives include NVMeVirt and QEMU’s built-in NVMe emulation.
To date, 4.3M IOPS has been achieved without tuning, using a single CPU core for the SPDK target application. The next steps is to determine viability of the method are:
Can a single SPDK target provide a single emulated NVMe device capable of 100 million IOPS?
Alternatively, can this be scaled out—e.g., by providing 24 emulated devices to collectively reach 100 million IOPS?
The notes contain the steps needed to reproduce setup and numbers from using it.
System Setup¶
In this system the host-side software consists of Debian Bookworm with misc. packages and the main actors: SPDK and qemu. The guest-side consists of Debian Bookworm, misc. packages, and the main actors to driving I/O efficiently are: xNVMe / uPCIe and fio.
For the misc. packages then SPDK provides spdk/scripts/pkgdep.sh
,
xNVMe provides ./toolbox/pkgs/debian-bookworm.sh
, qemu provides
errors indicating what is missing, and fio just need what is provided by
the others. For details on building and running the main actors, then details
follow.
cijoe and guest-data (host)¶
Install cijoe and rehash paths:
pipx install cijoe
pipx ensurepath
After logging in again, you should be able to:
cijoe --example qemu.guest_x86_64
cd cijoe-example-qemu.guest_x86_64/
cijoe --monitor
By doing so a Debian Bookworm guest is provisioned using cloud-init, the purpose of which is to have a boot.img to use later on, it should be available here:
/root/guests/generic-bios-kvm-x86_64/boot.img
With this, done, then please do continue.
Variability (host)¶
The configuration provided here is for an CPU with eight cores. The kernel is instrumented for reduced variability via:
Symmetric Multi-threading is disabled
Processor power state is set to max
Idle state is busy
Cores 0-1 are for the host OS and interrupts
Cores 2-7 are for qemu and the SPDK target app
This is done with the following kernel arguments:
nosmt processor.max_cstate=1 idle=poll isolcpus=2-7 nohz_full=2-7 rcu_nocbs=2-7 irqaffinity=0-1
Then the following packages do:
apt-get install -fy \
linux-cpupower \
linux-perf \
lm-sensors
Then additionally via userspace tooling then the CPU frequency is fixed, the governor is set for performance and boost is turned off:
# Use the performance governor
cpupower frequency-set -g performance
# For reference, this shows what is available, should have
# available frequency steps: 3.65 GHz, 2.20 GHz, 1.60 GHz
cpupower frequency-info
# Lock it at 3.65
cpupower frequency-set -f 3.65GHz
# Disable turbo boost
echo 0 | tee /sys/devices/system/cpu/cpufreq/boost
This should provide a reliable core frequency and thus less variability on the IOPS rate. Since the IOPS rate is directly proportional to th CPU-frequency, then we want to reduce this.
Hugepages … disable limits
SPDK (host)¶
Retrieving and build the, at the time of writing, latest release:
# Grab the source
git clone https://github.com/spdk/spdk.git ~/git/spdk
cd ~/git/spdk
git checkout v25.05
git submodule update --init --recursive
# System Packages
./scripts/pkgdep.sh
apt-get install python3-pyelftools
# Configure and build (do not install)
./configure \
--enable-lto \
--with-vfio-user \
--disable-unit-tests
make -j
#!/usr/bin/env bash
set -euo pipefail
# --- Paths (intentionally separate) ---
SPDK_TGT_BIN="${SPDK_TGT_BIN:-/root/git/spdk/build/bin/spdk_tgt}" # installed binary
SPDK_RPC_PY="${SPDK_RPC_PY:-/root/git/spdk/scripts/rpc.py}" # from repo checkout
# --- Config ---
SUBSYSTEM_NQN="${SUBSYSTEM_NQN:-nqn.2016-06.io.spdk:cnode1}"
SUBSYSTEM_SN="${SUBSYSTEM_SN:-SPDK00}"
BDEV_NAME="${BDEV_NAME:-Malloc0}"
SIZE_MB="${SIZE_MB:-4092}" # 4 GiB
BLKSZ="${BLKSZ:-512}" # 512-byte LBA
VFIO_DIR="${VFIO_DIR:-/tmp/vfu}" # directory; socket(s) created inside (cntrl)
COREMASK="${COREMASK:-0x10}"
# --- Sanity checks ---
[ -x "$SPDK_TGT_BIN" ] || { echo "spdk_tgt not executable: $SPDK_TGT_BIN" >&2; exit 1; }
[ -r "$SPDK_RPC_PY" ] || { echo "rpc.py not readable: $SPDK_RPC_PY" >&2; exit 1; }
RPC() { "$SPDK_RPC_PY" "$@"; }
echo "[+] Killing any existing spdk_tgt..."
pkill -f spdk_tgt || true
sleep 4
echo "[+] Dedicating 32GB to hugepages etc."
HUGEMEM=$((32 * 1024)) ~/git/spdk/scripts/setup.sh
echo "[+] Resetting vfio-user dir: $VFIO_DIR"
rm -rf -- "$VFIO_DIR"
mkdir -p -- "$VFIO_DIR"
echo "[+] Starting spdk_tgt (COREMASK=$COREMASK)..."
"$SPDK_TGT_BIN" -m "$COREMASK" &
SPDK_PID=$!
# Give RPC a moment then block until framework ready
sleep 0.5
RPC framework_wait_init >/dev/null
echo "[+] Creating malloc bdev: $BDEV_NAME (${SIZE_MB}MB, LBA=${BLKSZ})"
RPC bdev_malloc_create "$SIZE_MB" "$BLKSZ" -b "$BDEV_NAME"
echo "[+] Creating VFIOUSER transport"
RPC nvmf_create_transport -t VFIOUSER
echo "[+] Creating subsystem: $SUBSYSTEM_NQN (SN=$SUBSYSTEM_SN)"
RPC nvmf_create_subsystem "$SUBSYSTEM_NQN" -s "$SUBSYSTEM_SN" -a
echo "[+] Adding namespace: $BDEV_NAME"
RPC nvmf_subsystem_add_ns "$SUBSYSTEM_NQN" "$BDEV_NAME"
echo "[+] Adding VFIO-user listener at $VFIO_DIR"
RPC nvmf_subsystem_add_listener "$SUBSYSTEM_NQN" -t VFIOUSER -a "$VFIO_DIR" -s 0
echo "[✓] Ready: NQN=$SUBSYSTEM_NQN BDEV=$BDEV_NAME LBA=$BLKSZ SIZE=${SIZE_MB}MB PID=$SPDK_PID"
qemu (host)¶
Retrieving and build the, at the time of writing, latest release:
# System Packages
apt-get -fy install libcurl4
git clone https://github.com/qemu/qemu.git ~/git/qemu
cd ~/git/qemu
git checkout v10.1.0
git submodule update --init --recursive
mkdir build
cd build
../configure \
--target-list=x86_64-softmmu \
--enable-kvm \
--enable-numa \
--enable-curl \
--enable-virtfs \
--enable-slirp
make -j
#!/usr/bin/env bash
set -euo pipefail
taskset -c 8-11 \
qemu-system-x86_64 \
-enable-kvm \
-M q35,accel=kvm \
-object memory-backend-memfd,id=mem,size=8G,share=on,prealloc=on \
-numa node,memdev=mem \
-m 8G -smp 4 -cpu host \
-device pcie-root-port,id=rp0,bus=pcie.0,chassis=1,slot=1 \
-device '{"driver":"vfio-user-pci","bus":"rp0","addr":"0x0","socket":{"type":"unix","path":"/tmp/vfu/cntrl"}}' \
-drive file=/root/guests/generic-bios-kvm-x86_64/boot.img,format=qcow2,if=virtio \
-nographic
Misc. (guest)¶
Fire up the guest to add a couple of things:
qemu-system-x86_64 \
-enable-kvm \
-M q35,accel=kvm \
-m 8G -smp 4 -cpu host \
-drive file=/root/guests/generic-bios-kvm-x86_64/boot.img,format=qcow2,if=virtio \
-nographic
Install:
apt-get install -fy \
build-essential \
git \
meson \
pkg-config
uPCIe (guest)¶
The uPCIe driver itself it embedded in xNVMe, however, the repository
provides a couple of useful tools (devbind
and hugepages
), so build and
install it:
git clone https://github.com/safl/upcie ~/git/upcie
cd ~/git/upcie
git checkout v0.3.2
make clean config build install
xNVMe (guest)¶
xNVMe has the uPCIe NVMe driver embedded, thus it can be utilized via the xNVMe fio io-engine. The uPCIe driver is not available upstream xNVMe yet, so get it from the fork as described below:
git clone https://github.com/safl/xnvme ~/git/xnvme
cd ~/git/xnvme
git checkout upcie
make config-slim
make -j
make install
ldconfig
fio (guest)¶
Then to utilize this I/O path, build fio from source:
git clone https://github.com/axboe/fio ~/git/fio
cd ~/git/fio
git checkout v3.40
./configure
make -j
In the output above, verify that the xNVMe ioengine is enabled.
Evaluation¶
Inside the guest, one has to manually bind the driver one wants to use with the
PCIe device. In this case I want to use the NVMe driver embedded in xNVMe,
thus uio_pci_generic
is loaded, associated with the device, and hugepages
are reserved:
modprobe uio_pci_generic
echo 4e58 0001 | sudo tee /sys/bus/pci/drivers/uio_pci_generic/new_id
hugepages setup --count 512
With this in place then fio
can be executed making use of the xNVMe fio
io-engine and the upcie backend in xNVMe. Thus, you must provide the
PCIe address on the domain-bus-device-function form as shown below as well as
tell it which namespace to make use of:
./fio \
--name=randread \
--rw=randread \
--bs=512 \
--iodepth=16 \
--numjobs=1 \
--filename="0000\\:01\\:00.0" \
--ioengine=xnvme \
--xnvme_dev_nsid=0x1 \
--thread=1 \
--time_based \
--runtime=60s
Numbers from different systems provided below.
Intel¶
- cpu
Core i5-12500 CPU
- mb
PRIME Z690M-HZ
- mem
2x 32GB DDR4 @ 3200 MHz - Samsung M378A4G43AB2-CWE
I got the following numbers:
randread: (g=0): rw=randread, bs=(R) 512B-512B, (W) 512B-512B, (T) 512B-512B, ioengine=xnvme, iodepth=16
fio-3.40-88-gd6ac
Starting 1 thread
Jobs: 1 (f=1): [r(1)][100.0%][r=2105MiB/s][r=4311k IOPS][eta 00m:00s]
randread: (groupid=0, jobs=1): err= 0: pid=484: Tue Sep 2 12:17:29 2025
read: IOPS=4299k, BW=2099MiB/s (2201MB/s)(123GiB/60000msec)
slat (nsec): min=41, max=62633, avg=46.68, stdev=17.72
clat (nsec): min=1521, max=3768.9k, avg=3550.21, stdev=1419.48
lat (nsec): min=1568, max=3769.0k, avg=3596.88, stdev=1419.84
clat percentiles (nsec):
| 1.00th=[ 3280], 5.00th=[ 3344], 10.00th=[ 3376], 20.00th=[ 3408],
| 30.00th=[ 3440], 40.00th=[ 3472], 50.00th=[ 3472], 60.00th=[ 3504],
| 70.00th=[ 3536], 80.00th=[ 3600], 90.00th=[ 3696], 95.00th=[ 3856],
| 99.00th=[ 4512], 99.50th=[ 5920], 99.90th=[14272], 99.95th=[16320],
| 99.99th=[17280]
bw ( MiB/s): min= 2057, max= 2129, per=100.00%, avg=2100.85, stdev=18.03, samples=119
iops : min=4212818, max=4362210, avg=4302538.20, stdev=36931.60, samples=119
lat (usec) : 2=0.01%, 4=96.61%, 10=3.26%, 20=0.12%, 50=0.01%
lat (usec) : 100=0.01%
lat (msec) : 4=0.01%
cpu : usr=99.99%, sys=0.00%, ctx=298, majf=1, minf=0
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=257932731,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16
Run status group 0 (all jobs):
READ: bw=2099MiB/s (2201MB/s), 2099MiB/s-2099MiB/s (2201MB/s-2201MB/s), io=123GiB (132GB), run=60000-60000msec
The key metrics being to observe above is the avg. latency of 3600ns and 4.3M IOPS. Do not that this was without core-isolation and with a single device emulated pinned on a single core. Thus, the i5 core was able to do a dramatic turbo-boost.
AMD¶
- cpu
Ryzen 7 PRO 8700GE
- mb
ASRock Rack B665D4U-1L
- mem
2x 32GB DDR5 @ 5600 MHz - Micron MTC20C2085S1EC56BD1 KC