WangLu's Notes

Qubesify My Daily Driver Part 2: Headless Micro VM

2026-05-17T13:59:49.427+02:00

I decided to take the plunge into micro VMs. My goal? To set up a headless micro VM capable of running graphical programs remotely.

As a first milestone, I wanted to get Firefox running and smoothly playing videos. (See Part 1 for a breakdown of why I passed on other isolation methods.)

Overview

At a high level, the concept is simple: I click an icon, and Firefox opens seamlessly on my screen while actually running securely in a VM.

This setup is similar to disposable VMs in Qubes OS. When the program closes, the VM is destroyed, leaving absolutely zero trace on the disk. To pull this off, I needed to boot a micro VM with a minimal kernel and disk image, and seamlessly forward both graphics and audio to my daily-driver main VM.

The Kernel

Unlike standard VMs, micro VMs do not support PCI devices. Instead, they rely on different protocols (like virtio_mmio and virtio_blk), which the kernel must support natively. Crucially, these drivers must be compiled directly into the kernel rather than loaded as modules.

I tested a few pre-built kernels, but none fit the bill:

Alpine image
Debian standard kernel
Debian cloud image (nocloud)

I decided to build my own using the kernel config recommended by Firecracker (config, doc). It turned out to be a straightforward process, resulting in a tiny, lightning-fast kernel.

The Disk Image

To run multiple programs in their own isolated micro VMs, I plan to build a single shared disk image containing all the necessary software. I can then launch multiple micro VMs from this identical base, instructing each one to run a different program, much like templates in Qubes OS.

Because I am using direct kernel boot, there is no need for initramfs or initrd. In fact, I experimented with an initramfs-only setup (no disk image at all) but quickly abandoned it. I needed a file that could be directly mapped to memory, and extracting an archive into RAM defeated the purpose.

Naturally, I tried a few pre-built disk images:

Alpine image: Failed. /sbin/init runs but complains that openrc is missing. This comment suggests I might need to manually install and configure it.
Alpine netboot: Boots, but expects alternative boot media.
Debian cloud image: Works, but too slow (~8 seconds) to boot.

I also considered reusing the disk image of my Viewer VM (my daily driver, akin to sys-gui in Qubes OS). While it would guarantee an up-to-date environment, I hit a few roadblocks:

It would require a dedicated /home partition.
There was a high risk of accidentally copying sensitive /etc secrets into an untrusted VM.
My daily driver, Fedora Silverblue, is simply too bloated for a micro VM.

Building and Formatting

I needed a minimal footprint and official repository support, making Debian the obvious choice (though I may explore Fedora later for specific packages).

To build the image, there are a couple of options, but they all have too many dependencies:

bootc (which I’ve tried previously, but only for servers)
mkosi (which I’ve also tried before)
virt-builder

While debootstrap is classic for Debian, it struggled without root privileges. I successfully pivoted to mmdebstrap, which gracefully handled most issues. I did have to manually fix /etc/resolv.conf to prevent Podman’s virtual DNS from leaking into the VM.

For the format, I needed to minimize the host’s memory footprint, including the page cache. I also plan to let multiple micro VMs use shared memory for the disk image. This means I will need to put raw disk images into huge pages, so I naturally passed on LVM thin snapshots. Because QEMU’s qcow2 format isn’t natively supported by mmdebstrap, I am just storing the raw disk images offline (roughly 1.1GB). In the future, I might compress them manually.

Maintenance

Keeping the kernel and disk image updated is tricky but critical. My current plan is to schedule offline weekly or biweekly builds from scratch. While in-place upgrades might be possible, building fresh is significantly safer and cleaner.

Running the Micro VM

Running the VM wasn’t too difficult, but it came with its own set of quirks:

Standard serial ports show early boot messages, while hvc0 does not.
By default, QEMU uses SeaBIOS (with -M microvm). Despite documentation claiming it supports direct kernel boot, the kernel couldn’t find the disk. Enabling qboot ultimately solved the issue.
To avoid using host memory, I wanted to copy the disk image to /dev/hugepages. This was surprisingly stubborn: I could create, read, and mmap files there, but not write to them. I ended up writing a custom Python script to “copy” the image file using mmap.
I bridged pre-allocated TAP devices for each VM to handle firewall rules easily, avoiding the setuid qemu-bridge-helper. Since my minimal image lacked basic networking tools like ip, ping, ifconfig, and systemd-networkd, debugging was a fun adventure using /proc/net/dev and /proc/net/fib_trie. I also learned about the ip=:::::eth0:dhcp kernel parameter.

Graphics

Without graphics, this setup is just a science experiment. This was the most fascinating and difficult part of the build.

Options

Remote desktop approaches like VNC, RDP, and SPICE are good defaults. However, I avoided them because I want:

A minimal guest system
A seamless experience
A (more) secure setup

I surveyed options capable of directly forwarding Wayland:

waypipe: forwards the Wayland protocol over a network.
Sommelier: part of Chromium OS; delegates compositing in the VM to the host using shared memory (Read more). This relies on the virtio_wl kernel module, and it is designed to work with crosvm. Spectrum OS also utilizes it (Read more).
wayland-proxy-virtwl: similar to Sommelier, also meant for crosvm. It can directly talk to a local Wayland compositor or via virtio-gpu. (Read more).
qubes-wayland, designed for Wayland in Qubes OS, using Xen’s vchan.

Ultimately, I decided to use waypipe.

Running Waypipe

Since my Viewer VM and the micro VM are isolated on the network, I decided to connect them using vsock. The waypipe manpage mentioned:

When running both client and server in virtual machines it is possible to enable the VMADDR_FLAG_TO_HOST flag for sibling communication by prefixing the CID with an s

I thought, “Ha! That’s not very difficult,” and was excited to try it out. However, before long, I realized it wouldn’t work without a vsock bridge like vhost-device-vsock, which is also mentioned in the manpage a few lines below. I don’t like this bridge, so I just used socat to forward the vsock port from the host to the Viewer VM. It worked perfectly.

Isolation and Authentication

The tricky part was routing multiple untrusted micro VMs securely. If I have two micro VMs running, how do I ensure VM A cannot intercept VM B’s waypipe connection?

One option is to use waypipe ssh, where the Viewer VM initiates the entire process. It should work very securely, but I don’t prefer the encryption overhead.

Instead, I designed a custom authentication/dispatcher setup:

On the Host: A stateless proxy forwards a vsock port to the Viewer VM, prepending the peer’s CID before the actual data.
In the Viewer VM: A receiving bridge reads the CID, decides which UNIX socket the micro VM is authorized to use, and routes the connection accordingly.
In the Micro VM: waypipe connects out to the host’s vsock port.

I wrote this bridge in ~100 lines of Python using os.splice() and asyncio. Handling the scheduling, blocking IO, EOF, and error propagation was incredibly tricky, but the resulting efficiency was worth it.

Theoretically, waypipe in the Viewer VM could directly listen on vsock ports, eliminating the need for the bridge. In practice, there are issues:

I run waypipe in a rootless Podman container, which cannot listen on a vsock without the --privileged flag. It turns out vsock is blocked by seccomp by default, and I’d have to write a custom profile to allow it.
The authentication logic needs to be moved to the host. While the host can directly see the peer CID, coordinating which vsock ports are open to which micro VMs between the Viewer VM and the host is complicated.

Audio

Waypipe handles graphics, keyboard, and mouse data, but leaves audio behind. QEMU features a PipeWire backend, but utilizing it would have required installing bulky GUI packages in Debian, as Debian does not offer a finer-grained package.

Instead, I built a multi-hop local proxy that forwards a vsock port directly to the local pipewire-pulse UNIX socket. This completely bypasses the network and PipeWire’s built-in authentication, which is perfectly acceptable for my isolated use case.

Side note: I also tried to simply add Listen=vsock... to pipewire.socket, but it didn’t work. I guess I’d have to add something to PipeWire config as well, which may not be even supported

Running Firefox

With the infrastructure in place, it was time to run Firefox. I wrapped it in a systemd service within the micro VM so it can automatically shut down the micro VM when Firefox exits.

It worked like a charm.

Video playback is surprisingly smooth and completely lag-free. The CPU usage is slightly higher than native execution, but acceptable. I might look for better software decoders in the future.

Conclusion

Building this headless micro VM setup was a fantastic learning adventure.

There are still plenty of missing pieces to tackle in the future. To name a few:

Exploring options that are more secure than waypipe.
Optimizing memory usage across multiple micro VMs.
Enforcing specific window border colors for visual security cues.
Seamlessly opening host files or URLs inside the micro VMs.
Implementing persistent states (similar to Qubes OS AppVMs).

Isolating Graphical Software (Part 1)

2026-05-07T11:18:00.002+02:00

I’ve really been missing the experience of Qubes OS, where all programs are properly isolated. However, I can’t install it on my daily driver because I use that machine for gaming. Instead, I’ve been exploring ways to approximate that isolation on a standard Linux setup.

Firefox Profiles

Let’s start with the browser. There are many privacy and security benefits to separating profiles. My general workflow is as follows:

I create separate .desktop files to run Firefox with different profiles. Each file uses a unique --name and StartupWMClass=. This prevents the icons from stacking together on the GNOME Panel.
I use ImageMagick to tint the icons so I can tell the profiles apart at a glance: magick input.png -colorspace gray -fill "#cc0000" -tint 100 output.png
The “new” Firefox profile manager didn’t work well for me. For example, I couldn’t set a default profile for opening external URLs. I eventually had to switch back to the old profile management style.

However, while this setup isolates cookies, it doesn’t isolate processes. A compromised Firefox instance could still theoretically access data from other profiles. For better security, I’m considering:

Installing Firefox via Flatpak and launching multiple instances with different data directories.
Installing entirely different browsers via Flatpak.

One challenge is that some configurations (like specific extensions) need to be synced across profiles, which isn’t always easy to automate.

Flatpak

Since Fedora Silverblue provides Flatpak by default, I gave it a try and really liked it. It feels similar to the application sandboxing on modern mobile phones. I can fine-tune permissions using Flatseal, though the default permissions are usually reasonable enough that I don’t have to.

There are only two real downsides: * It isn’t obvious or easy to launch multiple independent instances of the same application (like Firefox). * The Fedora repository isn’t very comprehensive, and I’m not quite ready to switch to the Flathub repository yet.

Podman

Podman is great for accessing a wider selection of packages, though it requires more manual work to configure permissions.

Wayland

Surprisingly, it isn’t difficult to allow a container to use the host’s Wayland display. Here is an example command:

podman run \
  --device /dev/dri:/dev/dri \
  --env=WAYLAND_DISPLAY="${WAYLAND_DISPLAY}" \
  --volume "${XDG_RUNTIME_DIR}/${WAYLAND_DISPLAY}":/run/wayland-0:z \
  ...

However, this does require some SELinux configuration, which I’ll cover below.

SELinux

Podman supports SELinux, and the :z and :Z volume flags are useful in many cases. However, they don’t work well for shared resources like Wayland sockets or network shares.

I used udica to analyze container images and generate custom SELinux types. It isn’t perfect, for instance, it doesn’t automatically detect and allow Unix sockets, but it provides a solid base configuration.

This is where I learned about SELinux CIL (Common Intermediate Language). Standard SELinux doesn’t strictly allow “inheritance,” but CIL files make it possible to define a container that behaves like a standard container while adding specific permissions (like talking to /dev/vsock). For more on this, check out this customization guide.

By using udica, audit2allow, and semodule, I defined a custom SELinux type (e.g., my-type) for my container. Now, I just run the container with: --security-opt label=type:my-type.process.

Security-Focused Runtimes

Generally, standard containers aren’t considered a “hard” security boundary. There are security-focused runtimes like gVisor and Kata Containers, but gVisor isn’t ideal for graphical programs, and Kata uses VMs. Since my daily driver is already a VM, I want to avoid the performance hit of nested virtualization.

Virtual Machines

For serious isolation, a VM is the gold standard. Since my daily driver is a VM with GPU passthrough, I decided that instead of nesting, the daily driver should coordinate with the host to launch headless VMs and render the graphics back to the daily driver.

I also wanted to experiment with microVMs, which seem perfect for this. I looked into several options:

Kata Containers is appealing because it supports OCI images. Theoretically, I could run podman inside my VM to control a podman socket on the host. libvirt might offer a similar path.

However, I want to stick to the official Debian repositories and keep my host installation as minimal as possible. Currently, my host is just a bare-bones Debian install with QEMU.

Ultimately, I decided to see how far I could get with QEMU microvm. After a few days of tinkering, I have a working prototype. In the next post, I’ll share the details of that setup.

My Experience with libvirt

2026-04-26T20:36:00.002+02:00

My daily driver setup has been working exceptionally well. If you missed them, you can check out the previous posts for details:

Recently, I decided to harden my setup by writing custom AppArmor profiles and nftables rules. During my research, libvirt kept popping up in tutorials. In fact, most “QEMU/KVM” guides simply assume you are using it.

Initially, I decided against using libvirt because I wasn’t a fan of its design philosophy. However, I kept hearing that it can automatically generate AppArmor profiles and firewall rules for each VM. Furthermore, this article highlighted several things that libvirt genuinely simplifies. Intrigued, I decided to dive in and get some first-hand experience.

The plan was to migrate my existing VM setup (which relies on bare QEMU scripts) to libvirt to see if it was a good fit. I specifically wanted to evaluate the parts of libvirt I was previously skeptical about:

Reliance on a highly privileged daemon.
Configuration stored in managed XML files, which makes it harder to create reusable components or templates.

The Good Parts

As noted in this article, while bare QEMU/KVM lacks a stable API, libvirt provides a very stable XML API.
CPU pinning and lifecycle management (like gracefully shutting down a VM) are as simple as just a few lines of XML. With bare QEMU, I had to write a Python daemon to parse the QMP protocol just to achieve this. libvirt also seamlessly handles a lot of edge cases during shutdowns.

The OK Parts

domxml-from-native didn’t work for me; this might be a Debian-specific issue. Thankfully, it wasn’t too difficult to manually recreate the XML file from my existing QEMU arguments.
Instead of using virsh define, I found I could use virsh create. This reads from an XML config and creates a transient VM. I love this approach because it lets me maintain control over my XML configs, making it possible to use scripts or Jinja templates to define reusable templates.
The highly privileged daemon is a mixed bag. Sometimes root access is required to configure network interfaces or AppArmor profiles, but I generally prefer to isolate those operations. For example, I’d rather pre-configure the network interfaces and define AppArmor profiles in systemd service files to keep the daemon rootless. That said, it’s not a dealbreaker:
- The Arch Wiki notes that the daemon isn’t strictly required in all scenarios.
- I had already ended up writing my own custom daemons for CPU pinning and lifecycle management anyway.
Networking has its quirks. libvirt can add a managed TAP device, but out of the box, it lacks NAT and network filtering.
- It can use my existing TAP device as an unmanaged interface, but then it won’t apply network filters.
- I couldn’t seem to manually fix the name of the TAP device, which makes writing custom nftables rules a headache.
- Network configurations have to be defined separately (though they can be created transiently).
- While libvirt can auto-generate network filters, any custom logic requires a separate, permanent XML config. I’d much rather just write plain nftables rules directly.
libvirt automatically creates AppArmor profiles for each VM, restricting the VM to only read its own disk image. The profile libvirt provides is actually quite reusable. I’ve used it in the past for server projects, even if it’s not perfect.

The Bad Parts

libvirt pulls in a lot of dependencies. Most annoyingly, nwfilter has a hard dependency on iptables. I’ve heard libvirt supports nftables now, so this might just be another Debian-specific quirk.
It wasn’t easy to assign a unique user to each VM. Setting the DAC seclabel resulted in a permission error regarding “master-key.aes”. This is likely because /var/lib/libvirt/qemu is restricted exclusively to the libvirt-qemu user.
I couldn’t get sound to work at all (I use ALSA). I probably just needed to add the libvirt-qemu user to the right audio groups, but it was another hurdle.

Conclusion

It is undeniable that libvirt is highly scalable and offers a rock-solid API. However, for my specific use case, it makes 80% of the setup incredibly easy while making the last 20% frustratingly difficult. To use it efficiently, you really have to commit to speaking the “libvirt language”.

I could probably iron out the remaining kinks if I spent a few more hours on it, but I don’t see the point right now. I’ll likely reconsider it in the future when I start experimenting with disposable VMs.

Fix a USB Game Controller Disconnecting When Idle

2026-04-22T10:17:00.004+02:00

I have a USB game controller that works flawlessly during gameplay, but it keeps disconnecting and reconnecting every ~40 seconds whenever it is left idle. While it is mostly harmless in practice, it is annoying to see the kernel messages are flooded with messages about the game controller.

Many online resources, such as the Arch Wiki, suggest disabling USB autosuspend to fix this. Unfortunately, that didn’t solve the issue in my case. I suspect the controller expects to be continuously polled rather than entering a power-saving state, especially since it immediately tries to reconnect after dropping.

Since I primarily use this controller by passing it through to a VM, I needed a way to manually manage its connection state. Here is the workaround that eventually did the trick:

Add the following rule to udev. This ensures the controller is ignored by the host system upon connection. ACTION=="add", SUBSYSTEM=="usb", ATTR{idVendor}=="xxxx", ATTR{idProduct}=="xxxx", PROGRAM="systemctl is-active --quiet my-gaming-vm.service", ATTR{authorized}="0"
- Note the PROGRAM condition in the udev rule, it is needed because the VM might not grab the device fast enough (e.g. due to a slow boot), in which case the game controller will disconnect and reconnect. In this case we don’t want to write 0 to authorized.
Write 1 to <DEV_PATH>/authorized right before booting the VM. This authorizes the controller so it can be successfully forwarded.
Write 0 to <DEV_PATH>/authorized after the VM shuts down. This disconnects the driver once again.

How to Find DEV_PATH

In the steps above, <DEV_PATH> refers to the device path of the USB controller. Normally, this can be conveniently found by following the symbolic link in /dev/input/by-id. However, because we are forcefully disconnecting the device by writing to authorized, that standard method won’t work here.

One can find the <DEV_PATH> by traversing and matching /sys/bus/usb/devices/*/{idVendor,idProduct}, but a much cleaner approach is to create a custom symbolic link using udev.

Add the following rule (note that there is no ACTION condition): SUBSYSTEM=="usb", ATTR{idVendor}=="xxxx", ATTR{idProduct}=="xxxx", SYMLINK+="my-controller"

With this rule in place, I can easily obtain the correct device path as: /sys$(udevadm info --query=path --name=/dev/my-controller)

Alternative Solutions (Untested)

The xone driver handles device polling differently and may resolve the issue natively.
Run cat /dev/input/by-id/my-controller > /dev/null as a background process to keep the device active.
Use usbguard to manually allow/deny the device.

Linux Daily Driver Setup Part 3: VM Control Panel

2026-04-20T10:05:00.004+02:00

Now that I have my VMs running, what if I want to switch between them? What if I want to put the VM or the host to sleep, or shut them down?

In this post, I’ll discuss a few options.

Just a quick overview of my setup:

Only one VM is running at a time.
All VMs have GPU passthrough, and the GPU is connected to a display.
I have a secondary display, which I prefer not to use unless absolutely necessary.

Option 1: Deep Integration with Guest OS

Ideally, I could just click on a “Power off and Switch to VM X” menu from within the guest OS.

This isn’t difficult to support on the host side: the guest OS could pass the VM name to the host (e.g., via a serial port), and the host would automatically start the next VM when the current one exits.

However, I didn’t find an easy way to implement this menu, especially considering I plan to support multiple operating systems. I also couldn’t find a way to prevent the guest from shutting down without providing the next VM’s name.

It sounds like it would require a lot of custom scripting, so I decided to pass on this idea.

I figured the logic had to be implemented on the host side. The idea here is that the host reclaims the GPU when a VM exits.

This requires two steps:

Rebind the GPU from vfio-pci to a “real” driver after a VM exits.
Rebind the GPU from the “real” driver back to vfio-pci before a VM starts.

Step 1 is generally easy because vfio-pci is pretty lightweight, but Step 2 is not, especially for nouveau. I also needed to make sure to disconnect all possible GPU usages (e.g., fbcon), otherwise the driver would likely hang.

In the end, I couldn’t make it stable enough, and I definitely didn’t want to risk hanging the host. Time for a new option.

Since I prefer to stick with the vfio-pci driver, a natural idea was to launch a dedicated VM just for the menu.

The kernel and rootfs could be heavily stripped down to do nothing but show a menu and pass the user’s choice back to the host. It’s a bit of work, but doable.

In practice, however, a minimal VM took about 6 seconds to boot. I managed to pinpoint the bottleneck: GPU initialization. The VM boots much faster without it.

Ultimately, I couldn’t find a good solution. My only finding was that OVMF/UEFI is required to initialize the GPU; without it, the GPU won’t work and the kernel might just hang (the RIP register doesn’t change). Using a dumped or downloaded ROM file didn’t help in my case.

I didn’t bother implementing the menu because the boot delay was just too slow for practical use.

Option 4: Web Server

Another idea was to implement a web server with a simple UI, allowing me to control the host using my phone.

I don’t think it would be difficult to build something that just works, but making it secure enough would be a challenge. Plus, I don’t like that it requires a second device.

Running out of ideas for reusing the primary monitor, I decided to compromise and use the secondary display.

It was straightforward to implement:

The menu is built using whiptail.
Mask the default getty@tty2 service and run my menu service on TTY2.
Use chvt to switch terminals, and setfont to set a huge font size.
Write to /sys/module/kernel/parameters/consoleblank to make the screen turn off automatically.
The screen turns on automatically when a VM stops, probably due to keyboard/mouse events after evdev passthrough switching.
I also added a few options to the menu like “sleep” and “power off”, so I can control the host without having to log in.

In practice, this works really well. It just requires that secondary display.

Thoughts

While Option 5 ended up being the best compromise, I really wish Option 2 or 3 had worked better. That way, the entire setup would work on a single display. Maybe there is a better solution for GPU initialization out there. I’ll probably revisit this later.

VM Setup Part 2: QEMU

2026-04-15T11:39:00.004+02:00

Previously, I decided to set up a headless VM orchestrator for daily driving and gaming.

In general, installing a minimal Debian and QEMU is fairly straightforward, but the devil is in the details. Here are my notes on the process.

Overall Setup

The machine is a laptop. The built-in display is connected to the iGPU, and an external display is connected to the dGPU.
I prefer to use only the external display; the built-in one is rarely used.
I have two VMs: one as a daily driver and the other for gaming.
Both have GPU passthrough.
Only one runs at a time.

Resources Reserved for the Host

CPU: 1 core, 2 threads.
RAM: 2GB (typical usage is around 500MB).
Disk: 32GB. The OS only needs ~2GB, but I need extra space for cache, log, OS images, etc.

CPU

The main QEMU process and IO thread should be pinned to the host CPUs.
vCPU threads should be pinned to the CPUs reserved for the VM.
CPU pinning cannot be done through QEMU command-line arguments. I had to talk to the QEMU process via QMP (though this would be easier with libvirt).

RAM

Enable 1GB huge pages and reserve them so the host cannot use them.
Instruct QEMU to use all reserved huge pages, pre-allocate all needed memory, and avoid using memory from other sources.

GPU Passthrough

Various guides are available online:
- Arch Wiki
- Debian Wiki
Kernel parameters for vfio_pci didn’t work for me.
- The GPU was still bound to some other driver by default. I probably needed to adjust the module order in initramfs.
- This was fine in my case because the TTY is bound to the iGPU, and the dGPU is not used by default. It was easy to dynamically unbind the default driver and bind it to vfio-pci.
The display remained blank unless I added OVMF UEFI firmware.
- Interestingly, this was very tricky to debug: there was no display and no terminal output (maybe a serial port would have helped?).
- Luckily, I had tried QEMU with a GUI before this headless setup, which was easier to debug since error messages appeared on the virtual display.
- It seems the GPU needs to run the VBIOS, triggered by UEFI, to initialize.
- I couldn’t make it work using a manual ROM file (dumped or downloaded) without UEFI.
It takes 4–6 seconds for a VM to initialize the GPU during boot. I haven’t been able to eliminate this delay.

Network

A standard bridge didn’t work because the host uses Wi-Fi.
MACVTAP/MACVLAN didn’t work because the VM needs to communicate with the host’s network.
I decided to use TAP + IP routing (NAT), which was easy to set up with systemd-networkd.
- Debian still uses ifupdown + dhcpcd for wireless connections.
- systemd-networkd didn’t always propagate the DNS server to the VM via DHCP.
- This was likely because my wireless connection isn’t perfectly stable, so I had to specify the DNS value manually.

USB

I chose to forward specific USB devices instead of the host controller, since the host also needs the keyboard.
Contrary to what some docs mention, USB hot-plugging seems to work by default. QEMU correctly forwards devices even if I unplug and replug them on the fly.
Keyboard & Mouse stutter
- evdev passthrough:
  - Helps with the stutter a little bit, but doesn’t eliminate it entirely.
  - Works well for devices in /dev/input/by-id/.
  - Does not support hot-plugging.
  - Caps Lock didn’t work.
  - Media keys (e.g., volume up/down) didn’t work.
- Switching from PS/2 to virtio seemed to help.
- Documentation on this is a bit lacking, but AI tools were helpful.
USB game controllers
- The controller stopped working after the VM woke from sleep.
- It turned out to be also related to keyboard stutter.
- Physically reconnecting the controller fixes it. I can probably write a script to dynamically reconnect it later.

Audio

I don’t have PipeWire or PulseAudio installed on the host.
I chose to use ALSA.
- The default arguments didn’t work; I had to check /proc/asound/devices to get the correct card number.
- Need to escape commas in the QEMU command line (e.g., -audiodev alsa,out.dev=hw:1,,0).

Bluetooth

Passing the adapter via USB didn’t work.
- It seemed to cause slow boot times on Bazzite.
- I might try this solution later.

VM Isolation

Each VM runs under its own unprivileged user.
- I needed to grant access to vfio, USB, and audio devices. This was easy using sysusers.d.
- This isolation wasn’t as easy when I tried libvirt, but I probably just didn’t dig deep enough.
AppArmor
- libvirt’s template is quite reusable. I also used it in my previous VM setup.
Firewall
- Currently, I cannot distinguish between the two VMs in the firewall rules because they share the same TAP device. I’ll have to use separate ones eventually.

Lifecycle Management

Gracefully shutting down a VM from the host
- QMP’s system_powerdown sends an ACPI shutdown signal, which didn’t work (Bazzite triggers sleep by default).
- The QEMU Guest Agent’s guest-shutdown is more reliable (I need to set up the agent inside the guest, obviously).
- Neither command guarantees an immediate shutdown; they just enqueue the operation. I still need to wait for the QEMU process to actually stop.
- It is possible to implement this with a hacky, minimal shell script by blindly sending messages to the QMP/guest agent without parsing the output. Much easier with libvirt.
Suspend/Sleep
- When I put a VM to sleep, I want the host to sleep as well.
- To do that, I wrote a Python script to parse events from the QMP socket and watch for the SUSPEND event. (libvirt already handles this; I just need to register a script).
- systemctl suspend can suspend the host, but again, when the command finishes, it doesn’t guarantee the suspend (and resume) cycle is complete. It simply enqueues the operation.
- This means I can’t immediately wake up the VM via script right after this command.
- The proper way is to register sleep hooks with systemd.
- Alternatively, do nothing: after waking up the host with the keyboard, I just need to press a few more keys to wake up the VM.

OS

I chose Fedora Silverblue for my daily driver.
- Booting is fast.
- After investigating how Flatpak works, I really like it; it feels similar to Android’s app isolation.
I chose Bazzite for gaming.
- It just works out of the box. I didn’t need a single terminal command to play games (the only exception was when I enabled Secure Boot).
- Around 80% of my game library is supported, which is great. However, this means I might still have to set up a Windows VM later for the remaining games.

Conclusion

This setup has been working well for a while now. It was a nice way to get myself updated on the modern Linux desktop and GPU passthrough experience.

I like that this setup borrows great concepts from both Qubes OS and Proxmox. It’s interesting to note the shift in my approach: in my previous VM setup, I wanted to optimize the VMs by giving them as few resources as possible. Now, I’m doing the exact opposite.

I also worked on a few other things that I might discuss in upcoming posts:

Trying to make it easier to switch/launch VMs without logging into the host.
Getting more interested in libvirt and deciding to give it a try.
Plans to “Qubes-ify” my setup (e.g., setting up disposable VMs).

Exploring Gaming VM Setup (Part 1)

2026-04-10T21:17:00.007+02:00

For years, I've been playing games and doing everything else on the same PC. Hoever, after a recent positive experience with Qubes OS, I no longer feel comfortable keeping my games, passwords, and other stuff all in the same place. Because of this, I decided to change my gaming setup.

My requirements are:

Performance. Gaming should be functional and playable.
Isolation. At a minimum, game binaries should not have access to my other files.
Quick Switch. I need to be able to switch between gaming and non-gaming easily.
Single Display. All my workflows should be performance well on a single display. I do have a secondary display, but I want to rarely use it.

So, here goes my adeventure.

Option 1: Dual Boot

This is the classic solution, which sounds boring but almost always works.

Performance: Good. All operating systems have direct access to physical hardware.
Isolation: OK-ish, each OS technically has access to all data from other OSes. Disk encryption and Secure Boot can mitigate this risk.
Quick Switch: not so good, I would need to go through UEFI post and disk decryption every time.

Basically, I skipped this option just to explore something more "fun" (read: torturing).

Option 2: Qubes OS

Because of my positive experience with Qubes OS, I thought it would be great to use use it as my daily driver, while keeping gaming in a separate VM (qube).

So I installed Qubes. However, unsurprsingly, I couldn't get the GPU to work.

At the time, I thought I probably didn't try hard enough, as there are successful examples online. Looking back now, I think I probably should try to enable UEFI.

Through this excerce, I also releasize that dom0 is not meant for headless, e.g. it is not supposed to to be accessed remotely. This would make it difficult to recover if anything goes wrong. Perhaps sys-gui-vnc might help.

Anyways, I decided to move on to the next option.

Option 3: Sandboxes and Containers

So I installed Debian with GNOME to see if any lightweight sandbox or container technologies could be useful.

Note that some technologies, like Podman and Docker, are not security boudnaries. I'm talking about options like:

Flatpak + Flatseal
Firejail
Bubblewrap
gVisor
Kata Containers

In particular I have had a very positive experience with Podman + gVisor on my server.

After some research and long disucssions with AI, I figured out that none of them work well for isloating games. Games, like many other graphical programs, naturally require access to hardware. But most container/sandbox options only offer an "all-or-nothing" permission model.

Flatpak seems to allow fine-grained control, especially since it provides a DBus proxy. But I doubt it provides good default profiles for all games. ~~In particular, I noticed Firefox has full filesystem access by default.~~ UPDATE: that was the RPM version, not the Flatpak version.

I also learned that Flatpak's approach might interfere with a software's own sandbox, like Firefox or Chromium.

So, this option was a pass. I decided to go back to VMs.

Option 4: Single GPU Passthrough

Many GPU passthrough tutorials assume you have "a spare GPU", because the main GPU (which can be an integrated one) is used by the host.

While I do have a secondary iGPU, I prefer not to use it. Since I'm using a laptop, the iGPU is connected to the built-in display, which sits too far away fro my main setup.

I then leared about the "single GPU passthrough" technique. The procesure look like this:

Turn off all programs that are using GPU (e.g. kill the display server).
Unload the GPU driver (e.g. nouvea, nvidia, amdgpu).
Rebind the vfio-pci driver to GPU.
Launch the VM.

Then, you just reverse the sequence when the VM is shut down.

I don't really like this procedure for a few reasons:

Frequently rebinding drivers sounds unstable. I actually tried this later, and it confirmed my concerns. I'll disucss this in a later article.
Killing the display manager might destroy my active session (though I've heard this might not be the case for GNOME). It made me think I should just do multile VMs or dual boot instead.

Ultimately, I decided this option was not for me either. Time to move on.

Option 5: Headless VM Orchestrator

Combining lessions from all my previous attempts, I concluded that I should build something like Proxmox. Basically a minimal OS acting as a VM Orchestrator, where I only interact with VMs.

I think this is the best compromise. Comparied to all previouis options:

Only one VM (with GPU passthrough) can run at a time, so it feels similar to dual boot. However, it is much faster to swtich between VMs because I don't need to decrypt the disk every time. Plus, one VM has no access to the disks of other VMs.
Compared to sandboxes and containers, VMs provide much stronger security guarantees, and the I find them easier to set up securely.
Compared to single GPU passthrough, the experience is similar, but the system is much more stable.

While I've heard nothing but positive feedback about Proxmox, I'd prefer building my own environment from vanilla Debian. I'll dicuss exactly how I did that in the next article.

Solving LVM Detection Failures in GRUB After a Force Shutdown

2026-03-26T00:13:00.003+01:00

After a routine system update and an unfortunate hang that required a hard reset, my Linux machine refused to boot. Instead of the familiar login prompt, I was greeted by a cryptic GRUB error: error: no cryptodisk module can handle this device.

My setup uses LUKS2 + LVM. From the GRUB rescue shell, I could actually decrypt the LUKS container. But once decrypted, GRUB completely failed to detect any LVM volumes. It simply acted as if the LVM structure didn't exist.

Meanwhile, if I boot it from a Live Rescue USB, everything worked perfectly. I could open the LUKS container, and the volume group appeared immediately. Tools like vgck and pvck reported no issues.

After a length discussion with AI, eventually I found the magical commands:

vgcfgbackup -f lvm_backup.txt <vg_name>
vgcfgrestore -f lvm_backup.txt <vg_name>

After running these commands and rebooting, GRUB recognized the LVM volumes immediately, and I was back in my system.

Supposedly this forces rewriting the LVM metadata. Perhaps there were issues with the metadata, which were caused by the forced shutdown. The issues can be handled by LVM parser in Linux, but not by the limited implementation in GRUB.

Notes on a Tricky Linux Installation: Qubes OS and Windows

2026-03-12T20:17:00.006+01:00

I recently tried to install Qubes OS alongside an existing Windows installation. It turned out to be surprisingly difficult—way harder than my last attempt—likely due to a combination of my encrypted /boot setup and older hardware. Here are some notes from the process.

Shrinking an NTFS Volume

I needed to free up some space from a Windows NTFS volume. Normally, this just takes a few clicks in Disk Management. This time, however, Windows reported a "shrinkable volume" that was suspiciously small.

Following this answer, I tried the standard fixes:

Disabled hibernation (powercfg /h off)
Disabled the pagefile
Disabled system protection

This increased the shrinkable volume a bit, but nowhere near the actual free space left on the partition.

Digging into the Windows Application logs in Event Viewer, I finally found the culprit: The last unmovable file appears to be: \$Mft::$DATA. It turns out $Mft is a special block in NTFS that cannot be easily moved, and a simple defrag wasn't going to cut it. I tried a few third-party partition managers, but they all failed initially. Following a hint from one of the tools, I temporarily disabled BitLocker. That did the trick—AOMEI Partition Assistant was finally able to shrink the volume. Once it was done, I just had to re-enable everything.

Configuring the Display

I use a dual-monitor setup (let's call them A and B). The Linux console and GUI installer assumed monitor A was the primary display, leaving monitor B either completely blank (in the terminal) or showing an empty desktop (in the GUI). I wanted everything on B, and the usual Super+Arrow Key shortcut wasn't working.

Here is how I forced the display:

Turn Off the Display in the Terminal

Note: This forces the GUI installer to the correct screen, but doesn't change the terminal itself.

Switch to the terminal (Ctrl + Alt + F2).
Find the "bad" display in /sys/class/drm/.
Run echo off > /sys/class/drm/card0-<DEVICE_NAME>/status.

Example: echo off > /sys/class/drm/card0-HDMI-A-1/status

Switch back to the GUI (Ctrl + Alt + F6). The installer should now be forced onto the only "active" screen.

Turn Off the Display via Kernel Parameters

First, find the device/port name by running: ls -d /sys/class/drm/card*-*. Examples:

/sys/class/drm/card0-DP-1 (Integrated)
/sys/class/drm/card1-HDMI-A-1 (Discrete)

Add something like video=eDP-1:d to the kernel parameters.
This gets more complicated when the ports are the same but the cards are different, though I haven't tested that scenario.

GRUB and Encrypted `/boot`

GRUB 2.12 supports LUKS2, but it doesn't support the Argon2 hashing algorithm—that didn't arrive until GRUB 2.14.

While GRUB 2.14 works perfectly on my newer machine, it refused to boot properly on this older one. After hours of troubleshooting, I realized that while GRUB 2.14 could boot directly into Linux, but it failed completely when trying to boot through Xen. Suspecting the issue lay somewhere between GRUB and Xen, I eventually gave up and downgraded to GRUB 2.12, changing my LUKS partition to use PBKDF2 instead of Argon2.

Another quirk: GRUB's decryption implementation feels about 100 times slower than cryptsetup. It likely lacks hardware acceleration for decryption, and the -A parameter isn't available for the cryptomount command in my version of GRUB. To keep boot times reasonable, I had to decrease the number of PBKDF2 iterations in LUKS.

LVM Issues

Xen and Linux finally booted, but the celebration was cut short. The boot process stalled, complaining that the /dev/mapper/boot device timed out, and it wouldn't even let me enter the rescue shell.

I fixed the rescue shell issue by booting from a live USB and giving root a password. The timeout issue, however, was much weirder. It turned out that only the LVM logical volumes specifically listed in the rd.lvm.lv kernel parameters were being unlocked; the rest were completely invisible. Even running vgs and lvs returned empty results.

I was eventually able to recover everything using the vgimportdevices -a command. Oddly, this created a duplicate LVM entry in the system.devices file, and manually removing the duplicate broke everything again. I ended up just deleting the file entirely and letting vgimportdevices recreate it from scratch. It’s been running smoothly ever since.

Secure Boot

Just as I got Qubes OS working with Secure Boot enabled, Windows broke. The UEFI complained that the boot signature wasn't recognized.

After some debugging, I figured out what happened: I had reset the Secure Boot keys on my motherboard to their factory defaults. Because the laptop is old, the factory defaults only contained the 2011 Microsoft keys. However, my Windows boot loader had been updated and was signed with the newer 2023 key.

I tried copying the Secure Boot database from another machine, but that failed (likely due to PK/KEK mismatch issues). After hours of trial and error, I solved it with a combination of the following actions:

Set Secure Boot to Setup mode.
Reset to factory keys
Update Secure Boot variables via Windows registry and scheduled tasks (source)

reg add HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Secureboot /v AvailableUpdates /t REG_DWORD /d 0x5944 /f

Start-ScheduledTask -TaskName "\Microsoft\Windows\PI\Secure-Boot-Update"

# Manually reboot the system when the AvailableUpdates becomes 0x4100

Start-ScheduledTask -TaskName "\Microsoft\Windows\PI\Secure-Boot-Update"

Boot SecureBootRecovery.efi from the Windows EFI partition.

I suspect running SecureBootRecovery.efi was the magic bullet, though the other steps likely set the stage. Surprisingly, this file never came up in my online troubleshooting; I just stumbled across it by accident while browsing the EFI partition.

Final Thoughts

Looking back, I definitely stacked too many tricky components together—older hardware, dual-booting, encrypted /boot, LVM, and Secure Boot—and hit a bit of bad luck along the way. Fortunately, it all worked out in the end. What a journey!

Refined Boot for Qubes OS: Minimal USB Key, Dual Boot, Secure Boot

2026-03-08T23:56:00.003+01:00

I've been running Qubes OS on Machine A alongside Windows for a while. My setup involved storing the unencrypted /boot partition and the LUKS header on an external USB drive.

Recently, I planned to install Qubes on Machine B, also in a dual-boot configuration. However, the complexity jumped significantly:

Machine B has Secure Boot enabled because BitLocker requires it. On previous installs, I grew tired of toggling Secure Boot in the BIOS every time I switched operating systems.
I only have one USB drive. Managing separate /boot partitions for two different Qubes installations on a single thumb drive is messy.

After some experimentation, I found a way to solve both problems.

Sharing One USB Drive for Multiple Qubes Installations

The solution is elegant: Don't store /boot on the USB drive. Instead, move /boot to the encrypted internal disk partition. The USB drive's only job is to unlock that partition and hand over control to the system. Once I grasped this concept, implementation was relatively straightforward using the Arch Wiki and AI.

Some notes:

The USB drive only needs an EFI partition containing the GRUB binary and the LUKS header files. In my case, these total less than 30MB.
While recent GRUB versions support LUKS2, but only very recent commits supports Argon2. Fedora's version was too old, but the packages in Debian Sid worked. Arch shoud also work.
I wrote a simple grub.cfg that loads the necessary modules (GPT, LUKS2), unlocks the partition via cryptomount using the header file on the USB, and then passes control using configfile.
For this setup, usually, /boot can just be a folder in the root directory/filesystem. However, because my root is on a thin-provisioned LVM logical volume (which GRUB does not yet support), I had to create a separate, standard LVM logical volume specifically for /boot.
I used grub-mkstandalone to create a bundled GRUB binary and efibootmgr to register it with the UEFI firmware.
When the kernel/initramfs boots from the now-unlocked /boot, it doesn't "inherit" the unlocked state. To avoid typing the password twice, I embedded a LUKS keyfile into the initramfs to automate the second unlock.
To support both machines, I copied both LUKS headers to the USB. I then modified grub.cfg to detect the machine's UUID via smbios, allowing it to automatically select the correct header and partition.

The Benefits

Compared to the standard "detached USB" method, this stores significantly less data on the unencrypted drive, making backups easier and security tighter.

It also solved a major headache: updates. Previously, I had to manually mount /boot before updating dom0 and remember to unmount it before rebooting. If I forgot and triggered a reboot of sys-usb, the system would often hang. Now, those days are over. I can update the system without the USB drive even being plugged in; the drive itself rarely needs updates.

Make Qubes OS Play Nice with Secure Boot

It turned out I misunderstood "Qubes OS doesn't support Secure Boot." What that actually means is that Qubes doesn't provide an out-of-the-box way to verify the entire chain (Xen, kernels, etc.).

However, if our goal is simply to allow Qubes to boot without disabling Secure Boot in the BIOS, it’s actually quite easy:

Set up shim, register the entry with UEFI, and point it to my GRUB EFI binary.
Ensure my GRUB binary contains an SBAT section. I used objcopy to extract this from Debian GRUB binary, then included it via grub-mkstandalone.
Disable MOK validation mokutil --disable-validation. The MOK Manager actually refers to it by "Disable Secure Boot", but it doesn't actually affect UEFI or Windows.
Enroll my GRUB binary hash during the first boot.

That's it!

I was surprised by the simplicity. I expected to be wrestling with PK/KEK keys and accidentally bricking my Windows bootloader. Instead, learning how shim and MOK interact made the process painless.

Security Implications

To be clear: this method allows Qubes to run alongside Secure Boot, but it does not (yet) cryptographically verify the Xen image or Linux kernels.

However, since that data resides on an encrypted partition and the bootloader is on a detached USB, it meets my personal threat model. In the future, I may look into creating my own keys to sign the images properly.

I’m also considering moving the GRUB binary to an internal partition. If I embed the LUKS header directly into the GRUB binary and enroll its hash, the shim would theoretically notify me if the binary was tampered with. It’s not a perfect defense against physical access, but it's a significant step up from a standard unencrypted boot.

Writing Sudoku Solvers

2026-02-05T21:58:00.005+01:00

After writing a Nonogram solver, I decided to tackle a Sudoku solver to practice Rust. My goal wasn't just to support classic Sudoku rules, but also to handle variants like Thermometer, Arrow, and Cage etc.

1. Brute Force

It is fairly easy to write a brute force or backtracking algorithm. This approach is sufficient for most classic Sudoku puzzles, but it becomes unbearably slow as soon as variant rules are introduced.

I considered this step a warmup—a baseline to improve upon.

2. Constraint Propagation

Here, I tried to introduce "logical thinking" to the algorithm. I used u16 as a bitmask to represent the possible values of a cell. Whenever a cell's state changes (due to guessing, backtracking, or propagation), the algorithm consults all constraints to eliminate impossible candidates.

While Nonogram is technically an NP-Complete problem, in practice, my constraint-propagation solver (without guessing) can solve almost all puzzles found online. I’ve only seen one exception where I had to guess a few cells. This proves that puzzles designed for human players are meant to be solved via logical deduction, making them computationally "easy."

It turns out Sudoku is similar. Although some backtracking is still needed, classic puzzles are typically solved within 100~200 µs (microseconds), while variants might take a few milliseconds. I did see one puzzle take ~20 seconds, but overall, I was happy with the result.

So, can it go faster? Two optimization options came to mind:

When a cell changes, only consult relevant constraints rather than re-checking all of them.
Instead of copying the entire board state for every guess, I could carefully track changes and undo them during backtracking. However, since the board state is already quite small, I doubted this would yield a significant performance boost in practice.

Right before I decided to optimize further, I learned something new.

3. Dancing Links (DLX)

I discovered "Dancing Links" while asking an AI for the best Sudoku algorithms. This is a technique invented by Donald Knuth to efficiently implement Algorithm X, which solves the "exact cover" problem.

This is perfect for classic Sudoku, where the goal is to find an exact cover between "putting a digit in a cell" and "satisfying every row, column, and box constraint."

This is perfect for the classic Sudoku algorithm, where we essentially try to find the exact cover between "putting a digit into each cell" and "each row/column/box must contain exactly one digit X (where X is 1 ~ 9)".

The magical part: We can precisely undo this process by restoring the covered rows and columns. This means we don't need to copy the state during backtracking! The algorithm only needs to remember the latest guess; the data structure itself holds the information required to reverse it.

However, there's a catch: things get complicated quickly with variant rules. Only a few rules (like distinct values) can be easily encoded into the DLX structure. For most others, I had to implement them as "external observers" that eliminate impossible candidates after a guess. This forced me to maintain an undo stack again, essentially plugging a constraint propagation engine back into DLX.

Another issue: DLX wasn't actually faster than my custom constraint propagation solver. Since it consistently took ~200 µs for classic puzzles, I didn't bother implementing the complex variant rules for it.

4. Generic Solvers

I talked to a colleague about my progress, and he asked, "Why are you writing a custom solver? Why not just use a generic SAT/SMT solver like Z3?"

That was a good point. I did some quick research and picked three candidates to test:

The results are shown below.

5. Results

Here is how the solvers compared. (cp = my constraint propagation implementation, dlx = my dancing links implementation)

Classic Sudoku 1 (Easy)

cp: ~100 µs
dlx* ~200 µs
OR-Tools: ~10 ms
Z3: ~20 ms
cvc5: ~70 ms

Classic Sudoku 2 (Harder)

cp: ~200 µs
dlx: ~200 µs
OR-Tools: ~10 ms
Z3: ~80 ms
cvc5: ~200 ms

Hard: Empty Board + Thermometer Rules

cp: ~3 ms
OR-Tools: ~30 ms
Z3: ~20 s
cvc5: ~23 s

Hard: Almost Empty Board + Arrow Rules

OR-Tools: ~200 ms (slightly faster than cp)
cp: ~200 ms
Z3: ~20 s
cvc5: ~100 s

6. Discussion

I found these results both surprising and reasonable.

OR-Tools is significantly faster than Z3 and cvc5. I believe this is because OR-Tools uses a CP-SAT (Constraint Programming) solver, whereas Z3 and cvc5 are primarily SMT (Satisfiability Modulo Theories) solvers. Since the Sudoku puzzles I used are designed for human logic, they align better with the constraint propagation techniques used by OR-Tools.

My custom solver vs. OR-Tools. For easy puzzles, OR-Tools is slower than my implementation. This is likely due to initialization overhead; my code is hyper-optimized specifically for Sudoku. However, OR-Tools catches up quickly on complex puzzles. My constraint propagation logic is naturally inferior to the sophisticated heuristics inside OR-Tools, so as complexity rises, the generic solver wins.

7. Conclusion

I had a lot of fun and learned a great deal during this process. My next step is to explore OR-Tools further. Perhaps I'll write solvers for even more complex puzzles without reinventing the wheel!

An Adventure with Qubes OS2025-10-29T01:06:00.006+01:00
I've been experimenting with Qubes OS on my new laptop and wanted to share some notes on the experience.
HardwareOverall, Qubes OS works quite well on my hardware. Aside from typical issues like deep sleep, speaker performance, and touchpad scroll speed, the experience has been smooth. I particularly like that I can boot directly from a microSD card. This allowed me to move the /boot partition to the card while completely disabling USB access in dom0 for better security.
Detached /boot and LUKS HeaderMoving /boot and the LUKS header to a microSD card is a fun project, but it has some drawbacks:
I have to remember to mount /boot before updating dom0.
The system won't shut down properly if I forget to unmount /boot.
Testing Qubes OS 4.3 rc3I decided to test the Qubes OS 4.3 rc3 release by performing an in-place upgrade. Unfortunately, the system failed to boot afterward.
dracut IssuesAfter the upgrade, the system would hang before prompting me for my LUKS password. Eventually, the watchdog timer would kick in and drop me into an emergency shell.
Using the emergency shell and the installation media, I was able to investigate. I realized something was wrong with dracut, as it seemed unable to detect the encrypted disk. I tried including more files in the dracut configuration based on information from various sources, but that didn't help. So I gave up. Thankfully, the upgrade tool created an LVM snapshot, which made it easy to revert the changes. I did have to manually downgrade the kernel and Xen using the installation media to get my Xen domains working again.
After more research, I found a bug (1, 2) in the version of dracut included in Qubes OS 4.3. Essentially, the crypt module stops working when systemd is available. I also believe the systemd-cryptsetup module wasn't automatically included because the LUKS header was on a separate device, leading dracut to assume that LUKS decryption wasn't needed.
The fix was simple: manually enable the systemd-cryptsetup module in the dracut configuration.
amdgpu IssuesAfter the upgrade, only the old kernel worked, that's how I could fix the dracut issue. With the new kernels, the system would boot to a blank screen. Removing the rhgb and quiet kernel parameters revealed some log messages, but the blank screen remained.
Adding the nomodeset parameter disabled the graphics driver, which allowed me to enter the LUKS password and log in. However, this caused Xorg and LightDM to fail, with Xorg repeatedly crashing.
It was clear this was an issue with the amdgpu module. I tried several kernel parameters without success:
amdgpu.modeset=1
amdgpu.dc=0
amdgpu.dpm=0
amdgpu.ppfeaturemask=0xffffb
Eventually, I found that amdgpu.dcdebugmask=0x10 resolved the problem.
qubesd IssuesAfter finally booting into the system, I couldn't attach any block devices, including my /boot partition, to dom0. This turned out to be a bug in qubesd, which I reported.
Backup StrategyBacking up data in Qubes OS can be tricky due to its design. Here is the high-level strategy I've been planning:
Create as few templates as possible.
Only use Salt to configure templates. So I only need to back up Salt files in dom0.
Changes to dom0 are either managed by Salt, or the relevant files are included in the backup. ExamplesXfce settings.
Qube settings/features, /etc/qubes
/etc/default/grub
/etc/dracut.conf.d
My backup process is as follows:
I use a dedicated disposable VM with restic installed. The actual backup scripts and SSH keys remain in dom0.
Data from dom0 and other qubes is archived using tar. Use --transform to prepend the VM name to the path.
Use qvm-run --pass-io --no-gui to pass the scripts, SSH keys an data to the disposable VM, which then runs the scripts to execute the backup.
There are a few things to consider:
Using --pass-io could have security implications. It might be possible to mitigate this by limiting the number of bytes passed and saving output to a file.
The tar archives are currently extracted in the disposable VM. If the data isn't trusted, this could be a security risk. In such cases, the extraction step could be skipped.
Accessing the backed-up data requires starting a VM. An alternative could be to create an LVM snapshot and back that up directly.
Thought: Manging Qubes like ContainersThere are many similarities between Qubes VMS and containers (docker, podman):
Template VMs are similar to building containers.
AppVMs are like running containers with persistent volumes.
Disposable VMs are like running containers without persistent volumes.
There are some gaps:
One container can be based on others, the shared part are often stored only once as layers. While cloning a template VM can use reflink for initial efficienty, changes will evetually cause the data to diverge. bootc might help bridge this gap.
Containers support bind mounts, which is very convenient for backup. While virtiofsd could work for Xen, but I guess there are security concerns. An alternative could be to centralize all data in one qube and share it with others over the network, but again, there could be security concerns.
ConclusionIt's been a fun and challenging journey exploring Qubes OS. While there were some hurdles with hardware and system upgrades, working through them has been a valuable learning experience. The security architecture is powerful, and I'm excited to continue finding new ways to make it work for my setup.

A Rocky Migration: Moving from docker-compose to Podman and gVisor2025-10-09T00:26:00.005+02:00
I've been running a few containers for several years. They were all running under rootless Docker with a single user.
Initially, I planned to migrate the containers to VMs, but I couldn't get a stable workflow after about two months of effort. Later, gVisor caught my attention, and I decided to migrate to Podman with gVisor instead.
The new plan is to run each container with --userns=auto and use Quadlet for systemd integration. This approach provides better isolation and makes writing firewall rules easier.
I'm now close to migrating all my containers. Here are a couple of rough edges I'd like to share.
Network LayoutI compared various networking options and spent a few hours trying the one-interface-per-group approach before giving up. I settled on a single macvlan network and decided to use static IP addresses for my containers.
To prevent a randomly assigned IP address from conflicting with a predefined one, I allocated a large IP range for my containers and assigned random addresses from that range.
Routing IssuesI ran into a tricky routing problem. Let's say my host has a network interface eth0 and a veth pair where veth0-host is on the host and veth0-ctr is in the container.
Here are the IP addresses:
eth0: 192.168.0.1
veth0-host: 192.168.13.1
veth0-ctr: 192.168.13.100
To allow an external client to talk to a service on 192.168.13.100:1234, I set up a prerouting DNAT rule in nftables to forward traffic from 192.168.0.1:1234 to 192.168.13.100:1234. To my surprise, the host itself couldn't access 192.168.0.1:1234.
It turned out there were two issues:
DNAT in the prerouting chain doesn't apply to loopback traffic from the host itself.
Masquerade and SNAT also didn't work, likely because the kernel has a short-circuit mechanism for local traffic, so the transport happens at Layer 2.
Because of this, the service at 192.168.13.100 would send reply packets to 192.168.0.1 instead of 192.168.13.1. The packet would still be received on veth0-host, but my firewall would complain that this violates the "strong host model".
I didn't have this issue before, probably because port forwarding was handled at Layer 3 by slirp4netns. This seems to be a "hairpin NAT" issue, but it's more complicated with a veth pair.
In the end, I just configured the host to talk to the container at 192.168.13.100 directly.
File PermissionsBecause I'm using --userns=auto, the :U flag is almost a must when mounting volumes. Surprisingly, this didn't cause too many problems as long as I set up the correct user and group.
Sometimes a container needs to access files on the host. If a group HOST_GID has access to a file, we can grant access to the container's primary user with --userns=auto:gidmapping=$CONTAINER_GID:$HOST_GID:1 --group-add $CONTAINER_GID. Here, CONTAINER_GID is an unused GID inside the container.
However, this only works well with the default crun runtime. With gVisor, I found two problems:
First, the permission only works if CONTAINER_GID is the primary group of the container's main user. It doesn't work if it's a supplementary group.
Second, gVisor does not seem to support POSIX ACL. This means the dac_override capability is needed if the CONTAINER_GID doesn't appear to have permission according to standard Unix permissions.
It's not too bad in practice, but it was surprising until I figured out what was going on.
Default DNS ServerDocker provides a DNS server at 127.0.0.11 for each container. Podman, however, creates a dedicated DNS server for each bridge network. Some of my containers relied on the Docker behavior and had this IP address hard-coded, which caused quite a bit of trouble.
DNS Servers for Multiple NetworksIf a container joins both an internal bridge network and an external macvlan network, the container only sees the DNS server from the bridge network. IP routing still works, meaning the container can access the internet by IP, but it can't resolve external domains.
This is a bug in Podman that has been fixed in the latest version, but it still exists in the version I'm using from Debian. Below are the hacks I considered for this old version.
Option 1: Override DNSIf the container doesn't need to resolve internal container names, we can force it to use an external DNS server. However, the --dns flag didn't work as I expected.
According to Issue #17500, the server from --dns is added into other DNS servers as upstream, such that internal container names can be resolved first. This doesn't work in my case because the internal DNS server can't access the internet, so it can't forward the query.
A simple workaround is to override /etc/resolv.conf using a bind mount.
Option 2: HTTP ProxyIf the container supports HTTP proxy, we can remove it from the external network. Instead, we can add a proxy container to both the internal and external networks. The proxy can use its own external DNS server, and the original container can use this proxy via its internal IP.
This should work in theory, but it felt like too much effort. It also surprised me that nginx doesn't support proxying HTTPS traffic (CONNECT method) without extra efforts. If I had to go this route, I would probably use mitmproxy.
Option 3: Transparent HTTP ProxyI have set up a transparent HTTP proxy in my network. The container thinks it is talking to an external server, but my firewall redirects the traffic to an nginx server, which can forward or reject the traffic. Unlike a normal HTTP proxy, this is easy to do with nginx and is completely transparent to the client.
Now, if a container only needs to access a few known HTTP/S domains, I just add entries for those domains to the container's /etc/hosts file with an arbitrary IP (like 1.1.1.1), using AddHost=. Nginx completely ignores this IP, it reads the domain from the request and resolves it on its own. It's very hacky but also very practical.
NamespacesSome containers assume they share the same user namespace, which is common when they are running under docker-compose.
Podman has --userns=container:id to join an existing container's user namespace, but this doesn't work with gVisor. From what I've learned, this is related to gVisor's sandbox model.
The solution is to put containers into a pod. However, with gVisor, containers will not join the network namespace of the pod due to the same security model. This wasn't a big deal for my use case, but it was unexpected.
The same thing happens with the UTS namespace. When running with gVisor, a container's hostname becomes empty if it joins a pod. Issue #7995 is relevant here. Apparently, some binaries (like busybox's sendmail) don't like an empty hostname. The solution is to give the container a private UTS namespace: --uts=private.
Shared VolumeSome of my containers use flock on a shared volume to communicate. I think this is a bad design. And guess what? It doesn't work with gVisor.
I thought I might need to set up some mount hints annotations, but that didn't help, so I guess inotify wasn't the issue. Turning on shared file access didn't work either.
Other NotesI had planned to use socket activation extensively, but it turned out I didn't. Most containers need networking anyway, and it's much easier to manage port forwarding with simple firewall rules.
It is possible to mount an empty volume, which acts as an upper overlay on existing files at the mount destination. This is very useful for read-only containers.
Final ThoughtsThis wasn't a trivial migration, but I guess that was expected since I was changing so many variables at once. In any case, I'm quite happy with the new setup.

Hardening Container Network Security: Filtering Outgoing Traffic2025-09-25T22:54:00.009+02:00
I want to filter the outgoing network traffic for all of my containers based on a set of rules. For example:
Some containers should be blocked from accessing the internet entirely.
Some containers should have unrestricted internet access.
Some containers should be able to access the internet, but not a specific list of URLs.
Some containers should only be allowed to access a specific list of URLs.
To manage this, I will define logical policy groups and assign each container to one. As a general rule, only DNS and HTTP/HTTPS traffic will be permitted.
Option 1: A Proxy for Each Policy Group
Imagine Container A is only allowed to access www.google.com. Here’s how this approach would work:
Create an Nginx (or socat) container that listens on port 443 and acts as a reverse proxy for www.google.com.
Place both the Nginx proxy and Container A into an internal container network.
Within this network, add www.google.com as a network alias for the Nginx container.
Connect the Nginx container to a second network that has internet access.
ThoughtsThis is my current solution using docker-compose, and I believe it should also work with Podman.
It is possible to use a single Nginx container to proxy multiple domains, even for HTTPS traffic. By using the ngx_stream_ssl_preread_module, Nginx can inspect the requested domain from the TLS handshake and forward the traffic accordingly without needing to decrypt it.
This option is straightforward to implement, and a key advantage is that I don't need to set up a custom DNS server. It is also relatively easier to write firewall rules.
On the other hand, configuring and managing a separate proxy container for each rule can become tedious. I think using Quadlet files, especially with templates and drop-in overrides, could simplify this process.
Another significant downside is the inability to log blocked traffic. If a container tries to access a domain that isn't explicitly proxied, the connection will simply fail without a log entry, making troubleshooting difficult.
Option 2: Central Proxy on a Single Network
In this design, we set up a central proxy for both HTTP/S and DNS traffic and then perform the following steps:
Intercept and redirect all traffic from containers to the central proxy using nftables rules. For DNS, this is simpler, as I can configure the container network to use my custom DNS server.
The proxy must identify the source container to determine which policy group it belongs to.
The proxy must identify the requested destination. This is easy for HTTP (from the URL) and DNS (from the query). For HTTPS, we can again use the SSL preread technique to find the domain in the TLS handshake.
The proxy applies the policy, then either blocks or forwards the traffic.
NetworkingFirst, I would create a veth pair. On one end, I would create a macvlan network in "private" mode and connect the containers to it. The other end would be assigned an IP address on the host to allow routing. This essentially creates a bridge where connected containers are isolated from each other but can reach the gateway.
Podman doesn't seem to support configuring a standard bridge with a mix of isolated and non-isolated ports. Note that the --isolate option in podman network isolates the entire network from other container networks, not individual ports on the bridge.
In the diagram, the proxies are shown on a separate bridge connected to the internet, mainly for illustration. In practice, it might be easier to connect all containers to the same macvlan network and use a firewall to control traffic flow. Although the macvlan network is in private mode, the firewall may allow "hairpin" packets to allow traffic between specific containers.
Identifying ContainersWe can identify containers by their IP addresses. The tricky part is ensuring these IP addresses are trustworthy and that the setup isn't prone to errors.
Let's review the IPAM drivers supported by podman network:
dhcp: For each container, we can assign a fixed MAC address and create a static reservation in the DHCP server. The firewall can then reliably use the container's IP address to identify it. This assumes that containers are unprivileged and cannot change their own MAC or IP addresses. Ideally, the default address pool of the DHCP server should be disabled to prevent unassigned containers from getting an IP.
host-local: With this driver, we assign a static IP address during podman run. While this sounds simple, it's easy to forget to provide an IP when running a container manually. If that happens, Podman will assign an IP automatically. This could accidentally grant a container internet access or cause an IP conflict. I haven't found a way to disable this automatic IP address allocation.
none: This driver does not assign an IP address, and you cannot manually provide one either.
In theory, using "dhcp" for the IP configuration should work. However, a practical issue emerges because systemd-networkd respects the client ID in DHCP requests, and Podman sends the container ID as this client identifier. Furthermore, since podman-systemd utilizes podman run --rm, restarting a container.service generates a new container with a new ID.
This combination means the DHCP server sees the restarted container as a new machine. It then refuses to offer the configured static IP address, believing the lease is still valid and held by the original container. I have not yet found a way to override this client ID, so I may need to evaluate a different DHCP server or abandon this approach.
Deciding the Policy GroupOnce the container is identified, applying the policy is relatively easy:
CoreDNS has the view plugin, which can apply different rules based on the client's IP address.
Nginx has the geo module, which can be used to map a client's IP address to a variable for use in access rules. You can also use map $remote_addr.
Option 3: One Network Per Policy Group

This approach extends the "veth+macvlan" technique by creating a separate network for each policy group. We then use nftables rules to forward traffic from all networks to a central proxy. This is similar to Option 2, but this time nftables can identify the source policy group by the network interface the traffic arrives on.
This approach is more secure if you are concerned about IP or MAC spoofing since the network interface is a more reliable identifier than an IP address alone.
Identifying the Policy GroupBy IP Address: We can configure a DHCP server for each network with a non-overlapping IP range. The proxies can then identify containers by their IP address, just like in Option 2, but with greater trust since the IP is tied to a specific network. We still need to be cautious to ensure IP ranges don't overlap.
By Interface: We can identify traffic by the interface it comes from.CoreDNS has the bind plugin, which allows it to listen on specific host interfaces. However, this requires CoreDNS to run in the host network, and the proxy would need to be restarted every time a new policy group (and thus a new interface) is added. It's also unclear how this would work with Nginx.
A variation is to run CoreDNS and use port forwarding (or maybe socket activation) to listen on all interfaces, then I can use redirect in nftables. This way the traffic within each policy group should be redirected to the corresponding gateway. However, this setup sounds complicated, and similar to above, I'm not sure if it'll work for nginx.
A more complex option is to use nftables to map each incoming interface to a different port on the host. We could then run a proxy instance for each policy group, listening on its assigned port. This essentially moves the identification logic into nftables and is useful if a proxy doesn't support IP-based policies, but the rules would be complicated and fragile. For example, we would need rules to prevent a container from accessing a proxy port it's not authorized for.
My PlanUltimately, I need to find a balance between two goals:
Maximum Security: Resisting vulnerabilities and malicious actors.
Ease of Maintenance: Requiring minimal effort and not being error-prone.
I will most likely implement Option 2, if I can make it work. It offers a good blend of centralized control and flexibility without the complexity of managing dozens of networks or proxy containers.

A Journey into Podman: Notes on My First Adventure

2025-09-20T01:45:00.006+02:00

For the last few days, I've been experimenting with Podman. My goal was to get a feel for the setup, create a minimal yet scalable environment for a few containers, and identify potential problems early on.

Here are my notes from this experience.

[Updates 2025-09-21] Added more networking options and other information.

Quadlet

Quadlet allows you to define containers, networks and more using a syntax similar to systemd. This includes helpful features like drop-in overrides and templates.

The framework is tightly integrated with systemd, and Quadlet actually generates real systemd units. This means I can directly write systemd options in my Quadlet files.

One of the biggest benefits I've found is how easy Quadlet makes it to set up socket activation. This allows me to place some containers in an internal network or even without a network at all.

Hardening Defaults

Let's say I have a group of Systemd and Quadlet units, all named in the format of xyz-*. My goal is to define some secure, hardened default values for these units that can still be overridden by individual units.

For example, I want to change the default for ProtectSystem= from false to true.

Simply creating a xyz-.service.d/00-override.conf doesn't work, because an individual unit cannot override this setting. While I could create a xyz-foo.service.d/10-override.conf for each specific service, this would split the definition of xyz-foo into two separate files, which isn't ideal.

To solve this, I created a script that moves the main xyz-foo.service file to xyz-foo.service.d/10-override.conf and creates a nearly empty xyz-foo.service as a placeholder. This facilitates the process of setting and overriding defaults.

gVisor

Setting up gVisor's runsc was straightforward, and I haven't encountered any compatibility issues so far.

Unfortunately, the version of gVisor in the Debian repositories is quite old, and Debian is unable to provide prompt security updates for it. This meant I had to add the official gVisor apt repository. It's not the ideal solution, but it works.

Credentials

Systemd-Credentials is a very handy tool for managing sensitive information and can be used directly in Quadlet files.

However, I ran into an issue when using it with containers that have --userns=auto. Because Podman still runs as root, the credentials are only readable by the root user and not by the containers.

Podman offers its own solution for this called podman-secret. This feature allows you to either have Podman store the secrets or use drivers to connect to your own storage solution.

I prefer to keep all my secrets in a dedicated directory. To accommodate this, I wrote a simple script that registers a file as a Podman secret:

podman secret create \
  --driver=shell \
  --driver-opts=list="echo $1" \
  --driver-opts=lookup="cat $2" \
  --driver-opts=store=/bin/true \
  --driver-opts=delete=/bin/true \
  --replace=true \
  "$1" - <<<SECRET

For any container that needs access to secrets, I simply call this script in ExecStartPre=.

I found a few issues, and create an issue on GitHub. Fortunately workarounds are available.

Networking

My plan is to run a few groups of containers with the following requirements:

Containers within the same group can communicate with each other, but containers from different groups cannot.
I want to avoid manually managing IP addresses for each group or container.
It should be easy to write firewall rules to intercept container traffic.

I explored a few options to achieve this:

Plain Bridges: The simplest approach is to create a default bridge for each container group, which is the default behavior in Podman. Everything works out of the box, and I don't need to explicitly specify IP addresses. However, writing firewall rules is tricky because the IP addresses and bridge names are not predetermined. To make this work, I would need to carefully define an IP range and/or bridge name for each bridge.
Single Bridge: Another option is to create a bridge using systemd-networkd and then, for each container group, create a Podman bridge network using the existing bridge with VLANs and/or isolation. Since the bridge is unmanaged, I will need to manually set up DHCP, DNS, and the firewall.
- Simply setting up a DHCP server on the bridge (i.e. in bridge.netweork) doesn't work. I think it is because podman, or more precisely, netavark-dhcp-proxy, will use the interface (bridge in this case) as DHCP client rather than server.
- It works if I add an external DHCP server to the bridge. E.g. create a veth pair, put one end under the bridge and add a DHCP server on the other end.
VRF: I tried creating a VRF with systemd-networkd and then creating a Podman bridge network for each container group, specifying the VRF. DHCP works with this setup, but DNS server doesn't. While I was able to force aardvark-dns to use the VRF by using a wrapper script (exec ip vrf exec MY-VRF aardvark-dns "$@"), this solution felt too fragile for long-term use. Also, I wasn't able to easily isolate containers within the same network.
veth + macvlan: This setup involves creating a veth pair for each container group, putting DHCP server on one end, and then using the other end to create a Podman macvlan network. This works, and it is easy to isolate containers within one podman network, by using the private mode and setting up necessary firewall rules. But note that the DNS server is not supported in macvlan mode.

My Networking Plan

All these options more-or-less work, with different trade-offs. I plan to use 1 and 4 for different use cases:

When a group of containers need to find each other by name, I put them into a bridge network (option 1) with Internal=true. Both DCHP and DNS work, and I don't need to write much firewall rules.
When a container needs to access Internet, I create a private macvlan network (option 4) and put the container into it. Since I can easily isolate containers within the same network, it is possible to put containers from different groups into the same bridge network. Then I can just use the bridge network to group containers by policies, e.g. some containers can access everything, but some are only allows to visit specific URLs. I just need to write firewall rules for each bridge network.

Final Thoughts

So far, the combination of Podman, Quadlet, and gVisor has been a positive experience. Not everything has worked perfectly, but I'm quite happy with the setup. If things continue to go well, I might be able to migrate my docker-compose setup in the near future.

gVisor: A Fresh Look at Container Security2025-09-17T22:08:00.009+02:00
My original plan was to stabilize my VM pipeline before deploying containers using a hardened stack of Podman, QEMU, SELinux, and user namespaces (--userns=auto). However, the pipeline's complexity grew, requiring script rewrites and schema redesigns, and the process took much longer than anticipated.
In the meantime, an interesting alternative has captured my attention: gVisor. It occupies a unique space between traditional SELinux policies and full-blown virtual machines, offering a compelling set of trade-offs.
What is gVisor?At its core, gVisor is an application kernel, written in the memory-safe language Go, that provides an additional layer of isolation between containerized applications and the host operating system. It's essentially a user-space implementation of the Linux kernel's system call interface.
The security model is explained here.
gVisor in PracticegVisor provides an OCI-compliant runtime called runsc, which can be almost transparently integrated with container tools like Docker and Podman.
And that's it! Unlike SELinux, here we don't need to write any policies. This is the most attractive feature for me.
However, it comes with notable downsides:
SELinux is not supported, I cannot use both gVisor and SELinux at the same time.
--ignore-cgroups must be used for rootless podman, this mean cgroups won't work. Maybe it can be fixed later.
There can be potential compatibily issues, because gVisor implements its own version of syscalls.
The performance overhead is higher, especially for IO-related syscalls. It is well explain here.
My PlanI plan to evaluate gVisor with a few of my simple containers. Its promise of "secure-by-default" sandboxing without complex configuration is very appealing, especially for running applications where trust is a concern but the overhead of a full VM is undesirable.
I also believe that I don't really need the fine-grained control offered by SELinux. Bind mounts (read-only, read-write) should be enough for me. Eventually I might even drop the VM pipeline and just use gVisor.
We'll see.

A Declarative Approach to Config File Management

2025-08-24T23:40:00.010+02:00

Configuration files for different services are rarely independent. For example, in nftables, I might tag traffic with a firewall mark, and that mark is then used by systemd-networkd or in ip routes. Similarly, when the name of the primary network interface changes, multiple services like nftables, postfix, and samba need to be updated.

Requirements

I want to define core data in one place, then update all config files with a simple command.
If a configuration file is modified by an external process (for example, a package update from a vendor or distribution), the changes must be handled gracefully. Either the merge should be automatic and permanent, or I should be notified to easily resolve any conflicts.
It should be obvious within the config file itself what changes I have made.

Existing Solutions

I did some quick survey and found a few options.

1. Templates

These tools render a template using provided data sources. To manage /etc/config.txt, I would create a /etc/config.txt.template with all the moving parts marked using the required syntax.

Examples include:

The biggest issue is that the generated config file is no longer the source of truth. This means if the generated file is modified by other tools, those changes will be lost the next time I render the template.

Perhaps these tools are better suited for scenarios like building images with bootc or mkosi.

2. Patching

These tools record and apply the diff between a default state and the desired state.

Examples include:

diff and patch
Ansible's lineinfile and blockinfile
Augeas
crudini

There are two issues with these tools:

The diff is stored separately from the config file, which is hard to read and maintain. I might also need to keep a copy of the original, unpatched file for reference.
The patch might not be reliable if there isn't enough context to locate the exact area for patching. For example, consider a config file like this:
```
[Config for user A]
// many lines
use_https = true

[Config for user B]
// many lines
use_https = false
```
We want to modify the use_https setting for user A and generate a diff. Later, if the vendor's config file swaps the order of user A and B, the patch might still apply without error, but it will modify the wrong section!

Note that while Ansible can place markers around managed blocks, it must first insert them. For the initial insertion, it relies on regular expressions (insertafter and insertbefore) to find the location, which can be brittle.

3. Generators

NixOS allows you to generate all config files using custom data and functions in the same language.

The biggest issues with this approach are:

You are forced to commit to a specific ecosystem like NixOS or another tool that fully manages your system's configuration.
Merge conflicts almost never happen because your own NixOS configuration is just an override of the default values. This means you aren't notified of potential semantic conflicts. For example, if a default value you were referencing changes upstream, your configuration will adapt silently, which may not be the desired behavior without a manual review.

My Plan

The existing solutions I found almost solve my problem, but not 100%.

The closest approach I found is a combination of:

Adding a custom, unique anchor comment in the config file.
Using Ansible's blockinfile with the anchor comment for insertafter or insertbefore.

But I still don't like that the diff is stored separately from the config file. To solve this, my plan is to embed the template directly inside the configuration file, like this:

### BEGIN MANAGED BLOCK
### binds_to = {{ config.permanent_lan_interface.name }}
### END OF TEMPLATE
### END MANAGED BLOCK

I'll then write a script that:

Deletes all text after END OF TEMPLATE.
Parses the template before END OF TEMPLATE.
Renders the template using libraries like Jinja.
Puts the rendered template after END OF TEMPLATE.

Final Thoughts

My current plan may not be elegant, but it seems to meet my requirements more effectively than the other solutions.

Meanwhile, I'm still looking for new options. Please let me know if you know any.

VM Networking From Scratch

2025-08-22T01:01:00.005+02:00

Now that I've settled on my VM image pipeline, the next logical step is to tackle networking.

My Requirements

So far, I've been using QEMU's default user-mode networking. It's convenient for quick tasks, allowing for easy port forwarding, Samba shares, and DNS with just a few flags. However, this setup is ultimately insufficient for my needs for a couple of key reasons:

Security and Isolation: In the default user-mode setup, a VM can access the host's services via localhost. Worse, because it uses NAT, the VM can also access the host's entire LAN using the host's IP address. Ideally, VMs should have their own identifiable IP addresses, and more importantly, there should be strong network isolation between the host and the VMs.
Centralized Auditing: I want to audit all network traffic from my VMs through a centralized solution. This means I need a way to route all VM traffic through a single point of control.

Choosing the Right Tool

For most people, tools like libvirt or Incus are the best choice for this task. They are well-maintained, thoroughly tested, and have well-designed command-line interfaces that are less error-prone. I should probably just choose one of them and be done with it.

...except that I'm genuinely interested in learning the underlying building blocks and terminology. This is the main reason I chose to write QEMU scripts manually in the first place. Meanwhile, I find myself constantly referring to the documentation for these tools anyway when I'm studying security options.

Maybe one day, when I'm satisfied with my knowledge, I'll migrate my scripts to one of these excellent tools. But for now, let's learn by ~~suffering~~ doing.

Bridge + TAP

As many guides suggest, for anything beyond basic networking, the place to start is with a Linux bridge and TAP devices. A bridge acts like a virtual network switch, and a TAP device acts like a virtual network port for a VM to connect to that switch.

Thankfully, systemd-networkd makes this setup fairly easy. In the .network file for my bridge, setting IPv4Forwarding=yes and IPMasquerade=ipv4 saves me from writing custom nftables rules for NAT, which is a huge time-saver. QEMU also makes it simple to attach a VM's network interface to an existing TAP device.

To keep things tidy, I decided to automatically generate the systemd-networkd configuration files (e.g., .netdev and .network) directly from my VM configuration files. I save these generated files to /run/systemd/network/. This ensures I don't have to manually keep two sets of configurations in sync.

IP Addresses

The easiest way to assign IP addresses to VMs is to run a DHCP server on the bridge. Most standard cloud images, including bootc images, are configured to use DHCP by default.

However, I ultimately decided to use static IP addresses. Setting up a DHCP server securely, whether on the host or in a dedicated VM, takes some effort. Even with a DHCP server, I would likely configure static reservations to make it easier to write firewall rules to prevent IP address spoofing.

So, my process for each VM looks like this:

Generate a unique MAC address and a static IP address, and store them in the VM's configuration file.
Before starting the VM, generate a temporary systemd-networkd .network file that matches the VM's MAC address and configures its static IP, gateway, and DNS settings.
Pass these configuration files into the VM at boot time using systemd's network.* credentials feature.

This should work perfectly... right?

Wrong! I quickly discovered that CentOS does not ship systemd-networkd (1, 2, 3).

After looking through the official options for bootc images, I settled on using NetworkManager. This requires me to generate a NetworkManager keyfile and embed it into the container image. This isn't ideal because updating the network configuration requires rebuilding the image, which is slow. In the future, I might explore better options, such as:

Separating the Linux kernel from the image and booting it directly with QEMU, allowing me to pass network configuration via kernel parameters.
Using a different base image that includes systemd-networkd.

Inter-VM Traffic

By default, all VMs connected to the same bridge can communicate with each other freely. This isn't what I want; my goal is to enforce a "default deny" policy and only allow traffic that is explicitly permitted.

After some research with some help from AI, I learned a few key terms: port isolation, private VLAN, and proxy ARP. It turns out these concepts are perfect for my use case.

Here’s what I discovered when I put it into practice:

I started with a standard bridge and TAP setup, with host firewall rules in nftables that block all traffic. As expected, the VMs could not connect to the internet. However, they could still talk to each other. Why?

A quick debugging session with nft monitor revealed that packets traveling between VMs on the same bridge never hit my inet family firewall rules. This is because the bridge was forwarding the traffic at Layer 2 (like a real switch), so the host's Layer 3 IP-level firewall was never consulted. nftables has a bridge family specifically for filtering this kind of traffic.

Next, I enabled port isolation on the bridge. Now, even the bridge family rules couldn't see the packets between VMs. This confirmed that port isolation operates at an even lower level, preventing the bridge from forwarding frames between isolated ports altogether.

This gave me the perfect foundation. Now, if I want to allow two VMs to communicate, I have to do it explicitly. I have two main options:

Force Gateway Routing: I can remove the local subnet route inside each VM, forcing them to send all packets (even to other VMs on the same subnet) to the bridge's gateway IP address. The host's routing stack will then receive the packets, which can be filtered by my standard inet family nftables rules.
Use Proxy ARP: I can enable IPv4ProxyARPPrivateVLAN=yes on the bridge's network configuration. The host will then respond to ARP requests on behalf of the VMs. This tricks the VMs into sending all their packets to the host's MAC address.

Ultimately, both options achieve the same goal: they force Layer 2 traffic up to Layer 3, where it can be inspected by a central firewall. Option #2 is more elegant because it doesn't require custom network configuration inside the VMs. Option #3 seems less hacky.

Notes:

My initial assumption was that with Proxy ARP (Option #2), the traffic would be captured by the bridge family in nftables. This is incorrect. The ARP resolution happens at Layer 2, but the subsequent IP packets are routed, so they are captured by the ip or inet families.
Proxy ARP doesn't remove the need for a Layer 3 firewall. A malicious VM could simply add its own static routes (as in Option #1) to try and communicate directly. The key is to have a firewall at the gateway that inspects all traffic, ensuring that even if a VM tries to bypass the intended path, the traffic is still filtered. The main benefit of port isolation is preventing direct, unfiltered Layer 2 communication.

Outgoing Traffic

For controlling traffic leaving the host, I have a draft plan that provides strong isolation. The idea is to create a dedicated firewall VM.

On the host, I'll set up two bridges: bridge-internal and bridge-external.
All my regular VMs will be connected to bridge-internal. The host itself will not have an IP address on this bridge. This ensures the VMs cannot directly talk to the host. If needed I can set up SSH connection over vsock.
I will set up a special firewall VM that has two network interfaces: one connected to bridge-internal and the other to bridge-external.
The host's physical network interface will be connected to bridge-external.

With this setup, all outgoing traffic from the VMs must pass through the firewall VM, giving me a single place to manage all rules. It also isolates the host's network stack from the VMs by default.

For services, I can configure the firewall VM:

For non-HTTP services like NTP, I can set up forwarding or proxy rules.
For HTTP/HTTPS traffic, I can set up a transparent proxy using Nginx. Previously, I thought this would require a separate proxy configuration for each domain, e.g. dedicated proxy and DNS entry, but AI showed me a much better way:
- Nginx's ngx_stream_ssl_preread_module allows it to inspect the SNI (Server Name Indication) in the TLS handshake without decrypting the traffic.
- I can use firewall rules to redirect all outgoing HTTPS traffic from bridge-internal to this Nginx stream proxy.
- In the proxy, I can maintain a simple allowlist of domains and block everything else.

I plan to explore this design further. For example, is it better to use the host as the firewall? Or split the firewall services into multiple VMs? Could macvlan be useful here? These are questions for a future post.

Conclusion

In the end, I've replaced QEMU's basic networking with a much more secure, custom setup. Using a Linux bridge and port isolation, I can now force all VM traffic through a central firewall for inspection.

While it was more work than using a tool like libvirt, building this from scratch was a fantastic way to learn the fundamentals of VM networking and gain complete control over my environment.

Sending Emails with Curl: A Nifty Systemd Workaround

2025-08-17T20:02:00.003+02:00

I recently tried to create a simple systemd service to send an email notification, but my initial approach with mail and sendmail failed with a strange permission error.

My original service file looked like this:

[Service]
ExecStart=mail --subject=Subject recipient@example.com

The error message was a bit of a head-scratcher: warning: mail_queue_enter: create file maildrop/....: Permission denied.

A quick search pointed me to the cause: the postdrop binary has setgid. However, the systemd setting NoNewPrivileges=true prevents this.

While I hadn't explicitly used that setting, I was using DynamicUser=true, which implies and enforces NoNewPrivileges=true. This meant my service, running as a temporary user, couldn't get the permissions it needed to interact with the mail queue. Note that this implication cannot be disabled/overriden.

I wanted to avoid creating a new, dedicated user for this task. I realized that the problem was how mail and sendmail directly interact with the mail queue. The solution was to bypass that entire process and talk directly to the local SMTP server.

I didn't want to install another dedicated SMTP client. Fortunately, I learned that the curl can also act as an SMTP client! This command worked perfectly, sending the email by directly:

curl --url smtp://localhost:25 --mail-rcpt recipient@example.com --upload-file body.txt

mkosi: First Impressions

2025-08-14T22:06:00.001+02:00

I stumbled upon the Gentoo wiki page for systemd-nspawn, which in turn led me to nspawn.org, mkosi, and later systemd-sysupdate.

mkosi quickly caught my eye because it's almost exactly what I wanted to build myself, as mentioned in a previous post. So, I decided to spend my "sysadmin fun quota" on it.

Overview

mkosi is similar to docker build or podman build, but it's designed for creating full OS images. It focuses on development and testing. For example, much like nix-shell, mkosi can quickly launch a sandboxed shell with a specific distribution and selected packages installed. The systemd project itself uses mkosi for testing across different distros.

The re-introduction article is a great read.

Speed

Note that this is by no means a rigid benchmark.

My setup is an SSD with LUKS and an ext4 filesystem (without reflink support).

Building Container Images

mkosi is pretty fast. A simple mkosi command creates a fresh Debian image. I used the --incremental flag for subsequent builds.

First run: ~30s
Second run (after trivial changes): ~5s

Using mkosi -p systemd allows the container to boot (via systemd-nspawn -b), which adds only a few seconds to the build time.

Building VM Images

Building a VM image with mkosi --include mkosi-vm is a bit slower, likely due to the extra steps for installing a bootloader and kernel.

First run: ~1m 30s
Second run (after minor changes): ~30s

Comparison with bootc

I tried to build a fresh CentOS image using both tools.

mkosi --include mkosi-vm -d centos

Duration: ~1m 30s
Output disk size: 1.2 GB

podman pull quay.io/centos-bootc/bootc-image-builder:latest && \
podman run ... \
    quay.io/centos-bootc/bootc-image-builder:latest \
    --type raw ... \
    quay.io/centos-bootc/centos-bootc:stream9

Duration: ~4m 30s
Output disk size: 1.9 GB

Notes:

The bootc-image-builder was pre-pulled, and this time isn't included in the measurement.
The time to pull the base CentOS image is included.
I'm generating a raw image here instead of QCOW2.

Again, these numbers aren't directly comparable outside of my specific setup.

bootc-image-builder runs in a VM, while mkosi runs directly on the host.
centos and centos-bootc are different distributions, and their configurations (like installed packages) are also very different. This is obvious from the difference in their final image sizes.

Running Images with systemd-nspawn

I attempted to get unprivileged systemd-nspawn working but failed:

systemd-nsresourced.socket and systemd-mountfsd.socket must be running.
systemd-mountfsd complains that the image is untrusted unless it's signed or located in a trusted location.
I got stuck on another error.

Eventually, I resorted to using sudo systemd-nspawn -U, which worked well. The -b flag "boots" the image by running systemd/init as PID 1.

Running Images with QEMU

mkosi --kvm vm works nicely.

Notes:

Credentials are visible in the command-line arguments
I'm not a fan of all the default flags for QEMU, but mkosi provides many options for customization.

Observations, Thoughts and Concerns

mkosi is deeply integrated with systemd. Its configuration files are also following the systemd style: e.g. declarative, ini, drop-in overrides.
I wasn't able to test the performance benefits of reflink, because my filesystem doesn't support it and the disk images were small anyway.
I also wasn't able to test if SELinux works. Supposedly, it needs an extra flag in mkosi.conf and might be slow. On the other hand, it works out-of-the-box in bootc images.
I don't really miss Containerfiles much. I usually just need to copy files, and for my use case, a Containerfile would essentially just be running my scripts with bind mounts. Plus, I don't use many layers. But I might miss having an immutable Linux setup.
mkosi supports many popular distributions. while bootc only support Fedora/CentOS.
mkosi may add surprising modifications to the image:

mkosi doesn't use debootstrap. It actually used to depend on it, but that dependency was removed. Not sure if this approach is hacky.
mkosi may inject its own SSH server unit.

zvol doesn't seem very reliable, so I'll probably avoid using it for another few years.

Conclusion

mkosi is a very interesting tool. While I'm not ready to migrate my entire image-building pipeline yet, I might consider replacing my current LXC setup with it.

Rethinking My VM Image Pipeline

2025-08-12T21:33:00.009+02:00

Today, my pipeline regularly builds images for my disposable VMs. Here's the current process:

A dedicated builder VM reads Containerfiles for all VMs, including itself.
The builder VM uses podman build to create container images for all VMs.
The builder VM then uses bootc-image-builder to create disk images for all VMs.

This process works well, but it has a significant issue: the disk images aren't built efficiently. Unlike container images, which benefit from reusable, cacheable layers, disk images are always built from scratch. This leads to long build times and limited opportunities for data deduplication.

To address this, I've been exploring alternative options to improve the pipeline.

Disk Image Formats and Deduplication

My Current Format: QCOW2

I currently use QCOW2 with compression enabled. This format offers several features like snapshots, compression, and sparse files, which are useful when the underlying filesystem doesn't support them. However, if the filesystem does provide these features, QCOW2 doesn't offer many additional benefits over a simple raw disk image, at least for my use case.

Some notes:

Raw disk images are more transparent and widely supported by various tools. It's also much easier to deduplicate raw image files than compressed QCOW2 images. A QCOW2 image without compression should theoretically be similar to a raw image, but I haven't verified this.
The compression in QCOW2 is "read-only," meaning new writes aren't compressed. This isn't a problem for me because my VMs are immutable, so the images are rarely written to after creation.
bootc-image-builder actually builds the raw image first, before converting it into the QCOW2 format.

The Power of Deduplication

I expect deduplication to be highly effective in my setup because most of my disk images are very similar. There are a few ways to achieve this:

Filesystem Deduplication: This approach can be either online (e.g., ZFS) or offline (e.g., btrfs). The filesystem finds duplicate data blocks within files and removes redundant data from the disk. This is a general solution but doesn't necessarily speed up the initial build process.
Proactive Deduplication: This method is about building new images by applying small changes to an existing one. For example, you can "fork" an image using cp --reflink a.img b.img or qemu-img create -b a.qcow2 b.qcow2. Only the differences between the two images are stored on disk. This approach can significantly speed up the build process because you are not building from scratch, but it requires images to be built incrementally, not from a clean slate.

Exploring New Approaches

Bootc and In-Place Updates

I'm not currently using bootc images in their intended way. bootc is designed so you build a single disk image once and then update it in-place via a container registry.

I've considered two ways of leveraging this:

I could trust the VMs to update themselves.
I could maintain a "trusted base image" and follow this process:
- Create a base disk image using bootc. This image is only used for building other images and never for running services.
- To create the disk image for a specific VM, say VM X, I would first fork the base image using cp --reflink or qemu-img create -b to create X.img.
- I would then boot a VM using X.img and have it upgrade itself using VM X's specific container image. This container image could either be served from the builder VM via a server or a mounted directory, or it could be built locally within the forked VM, potentially using shared layers from a mounted cache.

This process seems workable, but it's overly complex for my taste. It involves running VMs during the build process, which would require a significant amount of scripting.

Plain Disk Images and In-Place Updates

This is similar to the bootc approach but uses standard raw disk images. Again, I could set up temporary VMs for the build process, but instead of relying on bootc's update mechanism, I would need custom scripts. This starts to resemble tools like cloud-init or Ansible.

A key benefit here is that a VM isn't a strict dependency. I could use something like systemd-nspawn to directly modify the disk images in-place, which would simplify scripting and make the process more reliable. I did attempt this with bootc images, but they don't work well with systemd-nspawn out of the box because the partitions lack the UUIDs that systemd-nspawn requires.

Final Thoughts

Ultimately, I haven't found a truly satisfying improvement to my current build process. While some of these approaches could theoretically improve build times and reduce disk usage, they also make the build pipeline more complicated and less reliable. At this moment, I don't think the trade-off is worth it.

For now, I'll probably just experiment with deduplication on ZFS and reflink on XFS. I noted that ZFS doesn't support reflink (zfs_bclone_enabled) by default, so that's a small hurdle.

This exploration has been an interesting learning experience. I've revisited/discovered some relevant tools:

libvirt
incus
systemd-nspawn
cloud-init
ansible
systemd-volatile-root.service

Sometimes, when I'm writing my own scripts, I feel like I'm building a slimmed-down version of these tools myself. However, I'm not yet convinced that it's the right time to fully switch to them.

[UPDATE]: I learned that this process is called Golden Image and Phoenix Server.

A Practical Guide to Passing Secrets to VMs

2025-08-06T22:09:00.007+02:00

The central question is: how do you manage secrets like SSH keys, API keys, and passwords for disposable VMs? 🤷‍♂️

Let's establish some ground rules for this scenario. Suppose I want to pass an API key to the VM chimera, which is run by the chimera-runner user on the host. My security requirements are:

On the host, only root and chimera-runner should have access to the secrets.
In VM chimera, only root and relevant service users should have access to the secrets.
No one from other VMs, including their root users, should have access to VM chimera's secrets.
The guest VMs themselves are not trusted.

The bootc documentation on this topic is very informative.

On a high level, there are a few ways to achieve this.

1. OEM Strings / Firmware

QEMU can pass data to a VM via SMBIOS OEM strings (-smbios) or firmware configuration (-fw_cfg). Notably, both methods are supported by systemd-creds using special keys.

This approach is practical for small pieces of data, like an individual password or an encryption key. It's not a new idea and was discussed years ago.

However, there are a few caveats:

Size Limits: The QEMU manpage states that the total size of all SMBIOS tables is limited to 65535 bytes. While not explicitly defined, fw_cfg is also intended for small amounts of data.
Key Length: The maximum length of a fw_cfg key name is 55 characters. If you use it with systemd-creds, a special prefix is required, making the available space for your key name even shorter.
Security Risk: If you pass data as a string (e.g. -fw_cfg string=secrets), the secret becomes part of the QEMU process's command line, which is visible to all users on the host!
Bugs: You can provide a file to SMBIOS using -smbios path=filename, which avoids exposing the secret on the command line. However, this feature is affected by a bug that is still present in Debian Bookworm.
Accessibility: Data passed via firmware appears in the guest's /sys/firmware directory, which cannot be mounted as a typical block device.

For these reasons, I generally use -fw_cfg file=filepath for passing small secrets and leverage systemd-creds within the guest whenever possible.

I might switch to -smbios path=filename later, when the bug is fixed.

2. Network Share

This is probably the most common way of sharing files between a host and its guests. The host sets up a file-sharing service, and the guest connects to it to access the files.

Pros: Changes on the host are reflected in the guest immediately, although this might not be important for static secrets.
Cons: The server on the host needs to authenticate clients (the VMs) to ensure one VM cannot access another's secrets. The guest VM also needs to set proper file permissions internally.

Common options include:

QEMU's built-in SMB server: Easy to set up, but the guest share is accessible to everyone in the guest. A non-root user can access the content using userspace tools.
SMB Server: For each VM, you must create a dedicated user/password pair. This password must then be passed securely to the VM (using another method).
NFS Server: NFS authenticates clients by IP address and trusts the client's root user. This is risky because a compromised VM could spoof its IP address. An extra authentication layer, like WireGuard, might be necessary.
SSHFS: Each VM needs a dedicated SSH key pair stored securely. The host can use a standard SSH server configuration.

While these options are viable, I find them less than ideal. They require significant effort to set up correctly, and the servers add extra maintenance overhead. Furthermore, SSHFS appears to be no longer actively maintained.

3. Filesystem Passthrough

This method is similar to a network share but is optimized for VM environments. Instead of the network stack, it uses a more direct channel to expose a host filesystem to the guest, which is generally faster.

Common options are:

9pfs: Easy to set up, but my guest OS (centos-bootc:stream9) lacks the necessary kernel support, and I prefer not to compile custom kernels.
virtiofsd: This is a popular and high-performance method, but it requires shared memory between the host and guest, plus an extra daemon running on the host.

Unfortunately, neither of these options worked for my specific setup.

4. Credential Fetcher

With this approach, the guest fetches credentials as needed, typically during boot. This can be implemented easily in the guest, for example, by using scp to copy secrets into a ramfs mount.

However, this requires setting up a server (e.g. SSH) on the host, which I wanted to avoid due to the added complexity and maintenance.

5. Embed in Container Image

It's possible to generate and embed secrets directly into the bootc container image during the build process.

An interesting variation is to encrypt the secrets, embed the encrypted data in the image, and pass the decryption key to the VM using another method (like an OEM string). You could use systemd-creds for this and even bind decryption to a virtual TPM simulated by QEMU. However, as noted in this discussion, this might not be the intended use case for systemd-creds.

While this works in theory, it doesn't offer significant benefits for my workflow and feels tedious to implement.

6. Disk Images

Any file can be used as a raw disk image for the VM, the guest just directly read the device for the data, without mounting a filesystem. Note that there may be issues if the file size is not aligned with the block size.

A more practical approach is to create a proper disk image containing the secrets and attach it to the VM. Any filesystem supported by the guest OS will work, but the ideal choice would have these characteristics:

The disk image can be created without root privileges on the host.
The filesystem is optimized for read-only data.

I found three practical options:

EROFS: A modern, read-only filesystem that supports volume labels.
SquashFS: mkfs.squashfs is very flexible; its "pseudo file" feature lets you create files directly from command output.
ISO 9660 (CD/DVD images): Universally supported but less flexible.

I plan to use EROFS mainly because it supports volume labels. I cannot guarantee that the disk order will be consistent across all VMs. Therefore, the guest needs a reliable way to identify the secrets disk, and mounting by label is the easiest solution.

Conclusion

After evaluating the options, I settled on the following combination:

OEM strings for small, individual secrets (like a decryption key).
Read-only EROFS disk images for larger sets of secret files.
QEMU's built-in SMB server for sharing encrypted data blobs, where the decryption key is passed separately via an OEM string.

Keep in mind that this solution is tailored to my specific use case, which has the following constraints:

The guest VMs are not fully trusted.
I want to minimize setting up and maintaining complex services on the host.
Automating the setup via scripting before a VM starts is acceptable and even preferable.
The solution must scale easily to many different VMs.

Backing Up VM Disk Images

2025-08-01T23:23:00.002+02:00

[Update] I managed to work out the AppArmor profile, and decided to go with guestmount for now.

I am setting up a maintenance pipeline for my virtual machines.

The pipeline has two main routines:

The BACKUP routine: Every day, this routine shuts down each VM, backs up the data in /var, updates the VM's disk image if a new version is available, and then restarts it.
The BUILD routine: Every week, this routine uses a special builder VM to create new disk images for all VMs.

There is a scheduling conflict with the builder VM: the BACKUP routine needs to shut it down, while the BUILD routine needs it running. To resolve this, I merged both into a single set of systemd services that runs daily. The BUILD routine starts automatically when the builder VM starts, at the end of the BACKUP routine. The builder VM's systemd unit has an ExecCondition= property, which is skipped 6 days a week.

Surprisingly, the most difficult part of this pipeline was not the scheduling, but the backup process itself.

There are two general approaches to backing up a VM: backing up the entire disk image or backing up the files directly from the filesystem.

Backing Up Disk Images

Backing up a disk image is straightforward because disk images are just regular files. However, this method is often inefficient:

Deduplication may not work well if the disk image is compressed.
Incremental backups are not natively supported.
The entire disk image must be backed up, including unused and deleted data.
You cannot easily choose which specific files to include or exclude from the backup.

There are some ways to improve this approach:

For deduplication: I can decompress the disk image and pipe the data stream directly to the backup software without saving the decompressed file.
For incremental backups: I can create snapshots and back up only the differences. I would also need to regularly merge the snapshots.

QEMU supports incremental backup for a running VM. Relavent article: 1, 2.

To reduce backup size: I can defragment, shrink, wipe, and sparsify the disk image (for example, with virt-sparsify) before backing it up.
To exclude specific files: I can put the files or directories that I don't want to back up on a separate disk image.

Backing Up Filesystems

One can back up files from inside the VM. It is also possible to mount the disk image from the host system when the VM is shut down.

Raw disk images can be mounted directly using loop devices.

Qcow2 images have several options:

`qemu-nbd` can expose an image as a block device (e.g., /dev/nbd*), which can then be mounted. This requires the nbd kernel module and root access. The qemu-nbd man page warns that this may not be suitable for untrusted guests. To back up multiple VMs, I would also need a way to find an available /dev/nbd* device.

The block device can be exported without root, there are tools like `qemu-nbd`, `nbdfuse` and `qemu-storage-daemon`. This article is worth reading.
`qemu-nbd` also supports exposing an internal snapshot of a qcow2 image.

`guestmount` can mount a disk image without needing root. It uses QEMU and FUSE. While it works, it can be slow, and creating a secure AppArmor profile for it is difficult due to its complexity.

My Thoughts

All of these options have trade-offs between security, performance, complexity, and flexibility.

In my case, my priorities are:

Security: I do not trust the guest VMs. This means the guest should not connect to the host, and the host should not load the guest's disk image using a kernel module. While I could move the backup logic to a separate VM, this would add a lot of complexity.
Simplicity: I want a simple workflow that is easy to maintain and secure. I prefer to avoid writing complicated AppArmor profiles that are difficult to update.
Performance: I don't need the backup to be super fast, meanwhile I don't want to keep a VM shut down for too long.
Space Efficiency: I don't have a large amount of data, so disk space is not a major concern.

Considering these factors, I prefer backing up the entire disk image. Recall that my root filesystem is created from a Containerfile, and /etc is transient, so I primarily need to back up /var.

For now, I plan to use qcow2 images with compression. I will decompress the disk image and pipe the data to the backup software.

In the future, I might explore some optimizations:

Using a raw disk image on a ZFS host filesystem with compression and possibly deduplication enabled.
Taking a snapshot (either qcow2 or ZFS) and backing it up. This would allow me to restart the VM without waiting for the backup to finish.

GNU Stow

2025-07-28T22:00:00.001+02:00

Just learned about GNU Stow, which is a tool for managing symlink farm.

Basically the idea is to store all files in one place, then create symlink all around the system pointing to your files.

There are various use cases, like dot files and installing/uninstalling packages. But I mostly use it for tracking system config files, similar to how NixOS works. In fact I wrote my own scripts with "cp -rs", but GNU Stow works much better.

Disposable VMs for Home Lab Security and Reproducibility

2025-07-27T20:54:00.010+02:00

Today, various services (native, LXC, Docker) are running on my server. I'm mostly happy with the setup, but I decided to revisit my server's defenses under the assumption that a remote attacker or malicious code could compromise my services. A service might break out of its container or even gain root privilege.

VMs are a better security boundary than containers; they can limit the damage if an attacker gains root privilege. I cannot afford to run a dedicated VM for each service, so I will need to carefully group the services and run a dedicated VM for each group. Each group should be carefully designed based on the data accessed and the features/capabilities required. For example, some VMs may have access to my photos, while others may not have network access.

The Goal

There are two particular issues I want to address:

First, I want VM images to be easily reproducible, which makes backup and restore trivial. NixOS and GNU Guix System are great examples, where you only need to back up the configuration file. However, I don't really like them because of their domain-specific language/design.

Second, I want to seal the system as much as possible. Even a compromised root user inside a VM should not be able to permanently infect the VM. Many so-called "immutable" Linux distributions are not truly 100% immutable. Often, they just mean a read-only /usr. Some can be easily broken via `mount -o remount,rw`, and most of them allow self-upgrade, meaning a malicious root user can still inject code via "upgrade and reboot."

The Approach

I use bootc containers. This allows me to build the whole system with standard scripts, and it offers the standard "immutability."

Furthermore, I run QEMU with `--no-reboot --snapshot`, which means the system cannot update itself even with root privilege.

Lastly, I'll regularly build new images and restart the VM to pick up the latest security fixes.

This approach is essentially managing VMs like containers. It's not a new idea; frood and gokrazy are good examples of this principle.

On a side note, I also plan to learn more about KubeVirt and Nix VMs. Especially, I like the idea (from NixOS) that the guest can directly use the store from the host.

Notes about QEMU

Permanent machine-local data is stored in /var, which is put into a separate disk image.

Secrets are sent to QEMU via systemd credentials.

I tried virtiofsd, but didn't like it. I ended up with Samba anyway. Maybe I'll revisit virtiofsd later.

To shut down the VM (e.g., via systemd), I created a special admin user with special privilege defined in the sudoers file, so that I can run `ssh admin@vm sudo poweroff`. The SSH key pair is regenerated before each VM boot. Related: In a systemd unit, ExecStop= does not have access to LoadCredential.

I use `-chardev socket,logfile=...` and `-serial` so that the systemd logs are not filled with console output, and I can view or attach to the serial console later.

I plan to learn more about virtio-balloon and pmem later.

Conclusion

I find it very beneficial to deploy VMs. It allows me to shrink and harden the host OS (e.g., disable unprivileged user namespaces), and it allows me to design fine-grained access control.

Next, I'll start investigating how to organize the containers inside VMs.

WangLu's Notes

Qubesify My Daily Driver Part 2: Headless Micro VM

Overview

The Kernel

The Disk Image

Building and Formatting

Maintenance

Running the Micro VM

Graphics

Options

Running Waypipe

Isolation and Authentication

Audio

Running Firefox

Conclusion

Isolating Graphical Software (Part 1)

Firefox Profiles

Flatpak

Podman

Wayland

SELinux

Security-Focused Runtimes

Virtual Machines

My Experience with libvirt

The Good Parts

The OK Parts

The Bad Parts

Conclusion

Fix a USB Game Controller Disconnecting When Idle

How to Find DEV_PATH

Alternative Solutions (Untested)

Linux Daily Driver Setup Part 3: VM Control Panel

Option 1: Deep Integration with Guest OS

Option 2: GPU Rebinding + TTY Menu

Option 3: Menu in VM

Option 4: Web Server

Option 5: TTY Menu in Secondary Monitor

Thoughts

VM Setup Part 2: QEMU

Overall Setup

Resources Reserved for the Host

CPU

RAM

GPU Passthrough

Network

USB

Audio

Bluetooth

VM Isolation

Lifecycle Management

OS

Conclusion

Exploring Gaming VM Setup (Part 1)

Option 1: Dual Boot

Option 2: Qubes OS

Option 3: Sandboxes and Containers

Option 4: Single GPU Passthrough

Option 5: Headless VM Orchestrator

Solving LVM Detection Failures in GRUB After a Force Shutdown

Notes on a Tricky Linux Installation: Qubes OS and Windows

Shrinking an NTFS Volume

Configuring the Display

Turn Off the Display in the Terminal

Turn Off the Display via Kernel Parameters

GRUB and Encrypted /boot

LVM Issues

Secure Boot

Final Thoughts

Refined Boot for Qubes OS: Minimal USB Key, Dual Boot, Secure Boot

Sharing One USB Drive for Multiple Qubes Installations

The Benefits

Make Qubes OS Play Nice with Secure Boot

Security Implications

Writing Sudoku Solvers

1. Brute Force

2. Constraint Propagation

3. Dancing Links (DLX)

4. Generic Solvers

5. Results

Classic Sudoku 1 (Easy)

GRUB and Encrypted `/boot`

Detached `/boot` and LUKS Header

`dracut` Issues

`amdgpu` Issues

`qubesd` Issues