The Life of Kenneth

Installing AlmaLinux Over the Network With No Hands

2026-01-18T00:00:00-08:00

As part of a broader project, I’ve been working on setting up the infrastructure to be able to power on new servers, have them boot over the network to download and run the complete Linux installation process, and then reboot into the freshly provisioned operating system with zero human interaction. This ability to fully reinstall the entire OS on any new (or replaced) hardware is a big win for servicability when you want to be able to freely add more servers, swap servers, wipe servers and reprovision them, or replace failed hard drives in a server and have it fully recover autonomously. I originally solved this problem a few years ago as part of the MicroMirror project, where I shipped each free software download mirror appliance with a recovery ROM on a USB flash drive plugged into the internal USB port on the motherboard of each system (which I still need to document), but this is a partially different problem than what I am trying to solve here, which is booting not off a load USB drive but completely over the network using nothing but the firmware and wherewithal baked into the hardware itself.

Zero touch installation of Linux always feels like a sort of holy grail acheivement to me; there is an overwhelming field of a problem space to deal with and frustrating little documentation that’s useful or up to date. So I’m not going to solve the problem here that hands off Linux installation seems like a closely protected dark art, but I am going to write down my opinionated set of steps I’m following for this one specific project setup in the year 2026, so while this article will very likely not be specifically useful to the reader’s objective or shelf stable enough to be useful to you in the future, this is what I did.

As part of this large problem surface area, I’ve had to make numerous assumptions here about the reader and the environment to keep this article shorter than an entire textbook on the matter of pre-OS automation:

The reader is vaguely familar with the firmware to bootloader to OS chain of control when you power on a computer. Things like “opening the boot menu and selecting the UEFI PXE boot option” and setting up a DHCP, TFTP, and HTTP servers are assumed to be in the readers’ wheelhouse or solvable using documentation elsewhere on the Internet.
We only care about UEFI PXE booting over IPv4. Legacy PXE booting can be essentially the same, with plenty of caveats about the limitations of fewer supported features and broader history of vendors just doing whatever they like during the pre-OS stage. IPv6 is a hellscape of the IETF’s own making simply because they made DHCP… optional?
I used TFTP for the bootloader stage not because it’s what I ultimately want to be using to serve the iPXE boot image to the hosts (HTTP), but because this was the most expedient way to offer a network boot option to hosts using the branded OpenWRT travel router on my kitchen table.
For the proof of concept I have everything wide open, so anyone with access to the network would be able to access the kickstart files for each host. This seems fine to me (I don’t see anything particularly sensitive about my password hashes or what drive partitions I’m using?), but I suspect I’ll ultimately lock down the provisioning process using IP ACLs. There are ideal ways to lock down this sort of process using a certificate authority and your own whole PKI to authenticate each client to the provisioning server and authenticate the provisioning server back to the clients, and this is all totally possible because I’ve seen it done, but life is short and this is a proof of concept.
We’re using iPXE instead of the more traditional pxelinux.0 chainloader to bridge between the PXE ROM to the Linux init environment because it felt slightly easier to bake a fixed boot option into iPXE, but honestly a bigger part of the decision was because the SYSLINUX website seemed to be down the evening I first sat down to build this. Both valid solutions that vary only in pretty much every detail possible.

So with that rough start to what is needed, let’s get into it.

High Level Sequence of Events

We’re looking to network boot a host here and install AlmaLinux on it without any human interaction. This is going to take several steps.

The host powers on, and selects the UEFI PXE ROM on the NIC as its boot device. I rely on this happening most of the time by either having a blank hard drive installed with no OS on it, or using the efibootmgr -n command to set a one time boot option override for the next reboot.
The PXE ROM on the NIC sends out a DHCP request, and the answer back from the DHCP server includes a next server address and filename (of our custom iPXE build) to download from that next server over TFTP. Modern UEFI PXE ROMs are much more flexible than traditional ROMs in that it’s not just limited to option 66/67 and it’s not limited to just TFTP, so it’s possible to give the client a full HTTP URL to where to download the bootloader instead of the classic TFTP transport.
The PXE ROM downloads the iPXE image over TFTP from the next-server and runs it.
When iPXE runs, it sees that it has a custom ipxe script embedded in the binary which tells it to download the AlmaLinux PXE kernel and initrd filesystem. iPXE also sends out a DHCP request to get an IP address, gateway, DNS server, etc to use for fetching assets and handing to the installer. The script finally generates a custom kickstart URL with the host’s MAC address in it, so you can use the same iPXE binary on multiple hosts and serve each of them a unique kickstart file to provision the host.
iPXE boots the downloaded kernel; passing it kernel parameters telling it all the IP addressing information iPXE got from DHCP, a URL to fetch the installation kickstart file from, and an AlmaLinux repository to use to download the installer disk image from.
The AlmaLinux PXE kernel and initrd gives the host enough of an environment to download the install IMG file for running the Anaconda installer from the repository passed from iPXE. Once Anaconda runs, it fetches the kickstart file and follows all of the parameters in that file to perform the OS install to a local disk without asking the user any questions.
The kickstart file calls out where to fetch all of the needed RPM packages for the install, what collection of software to install (i.e. Minimal vs Server vs a full desktop Workstation install), how to partition and mount the hard drive, usernames and passwords, and finally to reboot the system when done with the install.
The host selects booting from the local disk over using the PXE ROM again and boots from the fresh AlmaLinux install on the local disk.

PXE Network Environment

The main moving pieces on the network for enabling the PXE boot is that you need a DHCP server to offer options 66/67 to the client, a TFTP server to serve the iPXE binary, and a HTTP server to serve the kickstart file.

The easiest solution that I reach for when setting up the proof of concept here is any router running OpenWRT. The advanced settings under DHCP allow you to enable TFTP on the router and specify a filename, so I ssh into the router, create a /root/tftp/ folder, and copy the iPXE binary into that folder.

For serving the kickstart files, I just throw them in the /www/ folder on the router, and have the embedded ipxe script pointed at the router’s web server for the kickstart file.

Building iPXE with an embedded script

iPXE is a project implementing a PXE bootrom, where it’s a tiny program with enough smarts to network boot a system. If you want to go absolutely ham on iPXE, there are ways to compile iPXE for specific network cards and burn the program into the flash of the NIC itself, but most modern systems have pretty effective PXE boot roms with them (or the ability to boot off a USB flash drive) so it tends to be easier to leave the hardware itself alone and have the host boot iPXE from an external source to then boot something else.

The build dependencies for iPXE are pretty basic (GCC, make, etc) and we just need to download the source code, copy in our custom ipxe boot script, and build it specifically for the x86 UEFI target to get the binary we need.

kenneth@kwflap3:~$ git clone https://github.com/ipxe/ipxe.git
Cloning into 'ipxe'...
remote: Enumerating objects: 68187, done.
remote: Counting objects: 100% (210/210), done.
remote: Compressing objects: 100% (128/128), done.
remote: Total 68187 (delta 119), reused 85 (delta 82), pack-reused 67977 (from 3)
Receiving objects: 100% (68187/68187), 22.69 MiB | 13.85 MiB/s, done.
Resolving deltas: 100% (51298/51298), done.
kenneth@kwflap3:~$ cd ipxe/src/
kenneth@kwflap3:~/ipxe/src$ vim almascript.ipxe
kenneth@kwflap3:~/ipxe/src$ make bin-x86_64-efi/ipxe.efi EMBED=almascript.ipxe
  [DEPS] crypto/certstore.c
  [DEPS] crypto/privkey.c
  [BUILD] bin-x86_64-efi/__divdi3.o
  [BUILD] bin-x86_64-efi/__divmoddi4.o
[... SNIP lots of building ...]
  [LD] bin-x86_64-efi/ipxe.efi.tmp
  [FINISH] bin-x86_64-efi/ipxe.efi
rm bin-x86_64-efi/version.ipxe.efi.o
kenneth@kwflap3:~/ipxe/src$ ls -l bin-x86_64-efi/ipxe.efi
-rw-r--r--. 1 kenneth kenneth 1111040 Jan 17 18:21 bin-x86_64-efi/ipxe.efi

The special config in the almascript.ipxe is what precludes us from just using the canned builds of iPXE, since it was easier to bake this into the binary than trying to convey the information elsewise.

#!ipxe

echo Zero Touch Alma Install!

# Have iPXE enable the boot network interface and request a DHCP lease
dhcp

# Copy over all the addressing info from iPXE's DHCP lease for the installer
set ipparam BOOTIF=${netX/mac} ip=${ip}::${gateway}:${netmask}:::none nameserver=${dns}
# Pick an Alma Repository for the init environment to download the installer from
set repo http://repo.almalinux.org/almalinux/9/BaseOS/x86_64/os/
# Set the install mode (which gets overridden by the kickstart file I guess)
set install_mode inst.graphical
# Point the installer at a kickstart URL that's composed from the MAC address of the boot interface
# This depends on where you're hosting the kickstart files and netX/mac gets replaced with each NICs mac
set ks_url http://10.33.1.1/kickstart-${netX/mac}.cfg

imgfree
kernel http://repo.almalinux.org/almalinux/9/BaseOS/x86_64/os/images/pxeboot/vmlinuz inst.repo=${repo} ${install_mode} inst.ks=${ks_url} ${ipparam} 
initrd http://repo.almalinux.org/almalinux/9/BaseOS/x86_64/os/images/pxeboot/initrd.img

# Send it!
boot

So this yields you with a 1MB ipxe binary that has the logic baked into it to download the latest AlmaLinux9 kernel, and run it with a URL pointing to a kickstart file specific to each host’s MAC so you can reuse the same ipxe for all your hosts and then just have a folder of different kickstarts for configuring each host differently.

Kickstart File

Once the host has PXE’d the ipxe binary, which has fetched the stock AlmaLinux installation environment and passed it a bespoke kickstart URL, it’s up to that kickstart file to actually configure the host to suit the project’s needs. This is the part that can sink some of the largest amounts of time to try and get the kickstart file to actually execute all the desired setup steps. Even I’ve punted on all the prerequisites before and have been meaning to add support for setting passwordless sudo to my kickstart files to make for a seamless handoff between the kickstart and the ansible playbooks which come to follow.

One of the most useful sources of information on what goes into a kickstart file is the fact that when you click through the GUI, the results of that gets written to /root/anaconda-ks.cfg on every system, so that can give you some good tidbits to start with. I’ve also spent a lot of time reading the RHEL 9 chapter 22 manual which documents each individual parameter with not a great amount of context but at least all of the syntax.

Despite the original photo showing my initial work happening with a little Lenovo 1L PC, the final target of this deployment is a fleet of Dell R220s, so here is the example of what the kickstart looks like that I’m using:

text
repo --name="AppStream" --baseurl=http://mirror.fcix.net/almalinux/9/AppStream/x86_64/os

%addon com_redhat_kdump --enable --reserve-mb='auto'

%end

# Keyboard layouts
keyboard --xlayouts='us'
# System language
lang en_US.UTF-8

# Network information specific to this host
network  --bootproto=static --device=wan0 --ip=192.0.2.2 --netmask=255.255.255.0 --gateway=192.0.2.1 --ipv6=2001:db8::103/64 --ipv6gateway=2001:db8::1 --nameserver=8.8.8.8,2606:4700:4700::1001,1.1.1.1 --activate --bondopts=mode=802.3ad,lacp_rate=fast,miimon=100,xmit_hash_policy=layer3+4 --bondslaves=enp1s0f0,enp1s0f1,eno1,eno2
network  --hostname=ztphostname

# Use network installation
url --url="http://mirror.fcix.net/almalinux/9/BaseOS/x86_64/os/"

%packages
@^minimal-environment

%end

# Run the Setup Agent on first boot
firstboot --enable

# Partition the specific drive for boot
ignoredisk --only-use=/dev/disk/by-path/pci-0000:00:1f.2-ata-1
# Partition clearing information
clearpart --all --initlabel --drives=/dev/disk/by-path/pci-0000:00:1f.2-ata-1
# Disk partitioning information
part swap --fstype="swap" --ondisk=/dev/disk/by-path/pci-0000:00:1f.2-ata-1 --size=2048
part /boot --fstype="xfs" --ondisk=/dev/disk/by-path/pci-0000:00:1f.2-ata-1 --size=1024
part /boot/efi --fstype="efi" --ondisk=/dev/disk/by-path/pci-0000:00:1f.2-ata-1 --size=600 --fsoptions="umask=0077,shortname=winnt"
part / --fstype="xfs" --ondisk=/dev/disk/by-path/pci-0000:00:1f.2-ata-1 --grow


# System timezone
timezone America/Los_Angeles --utc

#Root password
rootpw --lock
user --groups=wheel --name=kenneth --password=$6$coFqYm7F55j92I/r$fNsiXZALqT3aiuwsheNoBfKUiqSjmDgwMJVDQJmPxuBZD//0t.CQilyPyFsf7YRvuTN/wYL0ZQZS7ujoMcpbw. --iscrypted --gecos="kenneth"
sshkey --username=kenneth "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIAQ1MiF5sxqA+wsJStuzB0RyP2ZTw+Zej7y4DKygyG18 kenneth@node1"

reboot

The most notable parts of this kickstart that will be specific to each individual host is that I identified the network interface names and configured them into a link aggregation group with IP addresses assigned to it, and that I identified where the local drive’s SATA port is located in the PCI hierarchy and use that to specifically refer to the hard drive I want the OS installed on.

I will just leave you with the comment that dialing in kickstart files is an exercise in patience. You will not get it right on the first try, and you will have to handle the particulars of each different compute, so the appeal of buying the exactly same hardware for every host will readily become apparent.

Poking at the connectX-4 Firmware

2024-12-29T00:00:00-08:00

We’re coming up on three years now that [John, Warthog9] and I have been running the MicroMirror project, and I decided that it was about time to start working on the next generation of the MicroMirror appliance hardware design. Currently, our main appliance consists of an HP T620 thin client with a 2TB M.2 SATA SSD in it, but due to a lot of feedback from networks that they didn’t have any way to plug a 1GbaseT server into their network, we pivoted to using the HP T620+ thin client with a ConnectX-311A-XCAT NIC in the PCIe slot. So we now ship these thicker thin clients with a 10G-LR optic installed in the SFP+ cage on the server, but hosts are free to replace that optic with SR or a DAC if their specific needs dictate something other than the 10G-LR default.

So as part of designing the next generation of our super tiny web servers, I wanted to try and qualify a solution to another piece of feedback we got from a few networks:

Can we have two SFP+ cages? We run MLAG + VARP on our routers and want to be able to reboot one of them without impacting the server.

Dual cage NICs have gotten significantly cheaper than they were three years ago when we locked in the bill of materials for the T620+ platform, so this is within the realm of posibilities for us. The T620+ has a x4 gen3 PCIe slot, which only affords us 16Gbps, so we wouldn’t be able to run a 2x10G port channel at full speed, but 16Gbps is still larger than 10Gbps, and it’s extremely rare that the existing 10G MicroMirrors even kiss 100% NIC utilization for a few seconds, so we are also looking at new server options with better IO capabilities than the T620+, but we’ll limit this post to talking about the NIC.

Not only have dual port NICs gotten cheaper in the last 3 years, but the ConnectX-4 based NICs have started flowing onto the secondhand market in volume as well and become affordable so we might be interested in moving beyond the ConnectX-3 generation of PCIe NICs! Where the ConnectX-3 is based around a 10G serdes (so you can get Nx10G or Nx40G variants of the ConnectX-3 NICs) the ConnectX-4 is able to operate at 25G per lane. The ConnectX-4 Lx variant of the ASIC is a cost reduced NIC that only has a 50Gbps packet pipeline inside of it, which means that it can support 2x25G SFP28 ports (as well as a single 40G QSFP+ port or a “depopulated” 50G QSFP28 port, where only two of the four lanes of a 100G port are electrcially active)

So where a 2x10G CX3 NIC can be had on eBay for around $24, the incremental cost to snagging a 2x25G CX4-Lx for around $26 makes it an interesting idea. Unfortunately, 25G is an extremely awkward Ethernet speed between 10G and 100G in that the IEEE originally didn’t want it, so a separate industry consortium originally defined it, but when the IEEE finally caved and ratified it they standardized 802.3by using a different forward error correcting code than the consortium did, so you have some Ethernet silicon out there in the wild that only supports the earlier “firecode” or “baser” FEC, where IEEE expects you to be using “Reed-Solomon 544/514” FEC. Ideally, the switch and the NIC should be able to agree on which of the two FEC options to use based on their clause 73 autonegotiation, but in reality many platforms don’t correctly implement clause 73 autoneg and you’re left manually hardcoding the FEC mode on one or both sides if you don’t get lucky with the default being the same on both. Namely, the default on Arista EOS gear is reed-solomon, where the default for ConnectX-4 seems to be firecode, so knobs need to be turned to make 25G work. HUMPH!

What Does Any of This Have to Do with Firmware?

Right! So back to my point. So I bought one of these Dell branded ConnectX-4121C 2x25G NICs on eBay for $26, and it arrived and I started playing with it. In the process of playing with it, I have fallen down the rabbit hole of connectX firmware, and figured I should write it down so I don’t forget it.

One lesson I’ve learned from the MicroMirror project is to always make sure to update the firmware on secondhand hardware which we’re trying to deploy as critical load-bearing Internet infrastructure, because many of the CX3 NICs have arrived with very old firmware, and Mellanox/Nvidia have been kind enough to fix a whole bunch of performance / stability issues in older firmwares, so we have see tangible performance improvements in the past due to updating the firmware on our NICs. On AlmaLinux 9, we can simply install the open source Mellanox/Nvidia firmware tools with sudo dnf install mstflint pciutils and using lspci to figure out the PCIe bus location of the NIC, query it:

[kenneth@nicbringup ~]$ lspci | grep Mellanox
01:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
01:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
[kenneth@nicbringup ~]$ sudo mstflint -d 01:00.0 query
Image type:            FS3
FW Version:            14.32.2004
FW Release Date:       13.1.2022
Product Version:       14.32.2004
Rom Info:              type=UEFI version=14.25.18 cpu=AMD64
                       type=PXE version=3.6.502 cpu=AMD64
Description:           UID                GuidsNumber
Base GUID:             b8599f03003e5016        8
Base MAC:              b8599f3e5016            8
Image VSD:             N/A
Device VSD:            N/A
PSID:                  DEL2420110034
Security Attributes:   N/A

The important facts to pull out of that output is that this NIC has a PSID of DEL2420110034 which is what identifies the whole NIC as a PCBA and usable module in a computer, and the firmware version / release date of 14.32.2004 / 2022-1-13. Unfortunately, the DEL in the PSID means that this is a Dell branded NIC, so it isn’t an identical copy of the MCX4121A-ACAT product with the PSID of MT_2420110034, but is really close. Many people report success of manually forcing mstflint to burn these NICs with the MCX4121A-ACAT firmware where the NICs seemly work fine, but do report that the link LEDs are then permanently stuck on. Unfortunately, there’s also several weird things about the firmware version and release date that this NIC came with:

14.32.2004 is a higher version and build number than I can find any reference to from Dell
14.32.2004 is a higher version number than Nvidia claims is the latest and greatest
2022-1-13 is a release date 2 years sooner than the latest 14.32.1900 firmware from Nvidia

So it seems like Dell has stopped releasing firmware updates for their silo of ConnectX-4 Lx NICs, and exactly where this current firmware lands in the progression of Nvidia’s bug fixes is murky. Being the “I like to understand the underlying mechanisms” type of guy I am, I started digging through everything I could find on Nvidia’s website, reading WAY too many ServeTheHome forum threads, a dash of homelabbers making fools of themselves on Reddit, and I think I’ve got it. So, as is often the case, BUCKLE UP.

What Makes a Firmware?

To meaningfully talk about what makes up a firmware image, we need to be completely clear on what a firmware image is and what it’s doing. Firmware is the software and configuration burned to a flash chip on the NIC card; usually stored in a SPI flash chip soldered onto the PCBA of the NIC next to the ConnectX ASIC itself. The ASIC loads the software off the flash chip to configure the ASIC, and a combination of that software running on the ASIC and the driver running up in the host operating system query the configuration stored in the flash chip for specifics like “what kind of NIC am I?” and “what is my MAC address?”, etc.

So while the ASIC software is going to be the same across all of the products using the ConnectX-4 Lx ASIC, the configuration is going to vary based on how each NIC product has wired up the ~50 GPIO pins on the ASIC (signals to/from the SFPs, LEDs, etc) and how the data lanes are exposed to the user (a single SFP port, multiple SFP ports, a QSFP port, etc) and furthermode the MAC address is going to vary across every individual part.

When you go download the BIN from Nvidia’s website for a specific SKU and burn it to the NIC, the update process is smart enough to preserve the globally unique MAC address and GUID section of the flash, but overwrites the rest of the flash with the provided BIN.

To recap, the firmware stored in the flash memory on the NIC consists of the following:

The ASIC software that runs on the NIC silicon
The NIC configuration that desribes how the rest of the NIC was built around the ASIC
The globally unique MAC address / GUID section that you don’t want to change with a firmware update
One or more boot ROMs, supporting BIOS vs UEFI and AMD64 vs ARM depending on what host you care about being able to boot over the network using this NIC

Note that none of these are the drivers being used by the OS, so we’re not talking about the mlx4 or mlx5 kernel modules in Linux at all here. The ASIC software is the other side of the conversation that the kernel driver is having with the attached hardware, so when you update the driver in your OS to fix something, it’s one rung higher than what we’re dealing with here and something that’s running on the host CPU itself, not the microcontroller inside the NIC.

Thanks mostly to a helpful thread on STH it looks like the technically correct answer for how to update the firmware on these Dell CX4121C NICs beyond what Dell has bothered to release themselves is to use the mlxburn tool from Nvidia to compile the latest ASIC software (which gets released as an MLX file) together with the hardware specific configuration file specific to the Dell hardware design ( with you can read via sudo mstflint -d [PCIe-Address] dc) so the ASIC can have the latest software fixes while also being correctly aware of what is connected to the ASIC where.

Using Nvidia’s MCX4121A-ACAT / MT_2420110034 firmware on this NIC only mostly works because, by happy accident, Dell happened to keep the critical function-to-pin mappings like which serdes lanes go to each SFP28 port the same, while Dell has apparently used a different GPIO for their link up indicator than Nvidia did, which is why that LED is now stuck on after the update.

The obvious solution here is to just grab the MLX file for the latest 14.32.1900 firmware, compile it with Dell’s firmware configuration that spells out GPIO mappings, how many virtual functions to allocate to the NIC, high speed signal pathway tuning parameters specific to this PCB design, and we’re all set!

The Fatal Flaw in the Plan

Very sound and good plan: compile a new firmware using the latest MLX and Dell’s specific firmware config. The only problem is that Nvidia has stopped publishing the MLX files for their ConnectX ASICs. Presumably Dell gets access to the MLX files since they’re an OEM shipping a metric buttload of Nvidia’s silicon in their devices here, but lowly mortals poking around at a few random NICs we bought off eBay are left out in the cold…

Hope you didn’t get too excited reading all of this. I just thought it was a good nugget of understanding, and saves you a lot of effort trying to also figure out how to run the mlxburn tool.

Making Fastestmirror Less Awful

2024-10-21T00:00:00-07:00

Fastestmirror is a configuration option for the yum and dnf package managers in the RPM ecosystem (i.e. Fedora, RHEL, CentOS, AlmaLinux, etc) that everyone loves to hate. The name of the option is so alluring; who wouldn’t want to use the fastest mirror when downloading software packages? Enabling fastestmirror is a staple of every “First ten things to do after installing Fedora” article from Linux content farms, so it’s pretty well known for people just getting started on Fedora.

But here’s the rub… It isn’t really that great of a feature.

Normally, without Fastestmirror, DNF requests an ordered list of mirrors from the project, and starts at the first mirror on that list to go off and download the RPM files DNF is looking for. The distro’s mirrorlist server can make a good guess on where the client is based on their IP address, and look up information about the mirrors like their physical location, how much bandwidth or weight each mirror was configured with, and shuffles the list a bit to load balance across mirrors in the same region as the client. So DNF receives a list of mirrors with generally closer mirrors towards the top, with generally faster mirrors towards the top, with some randomness mixed in so a high density of clients in one place don’t all get the same mirrorlist and clobber a single mirror.

Enabling fastestmirror tells DNF to ignore the provided ordering of the list of mirrors, but DNF spends two seconds going down the list of mirrors measuring their speed, and then sorting the mirrors by their locally measured speed and picking the first one.

The problem is that fastestmirror has a very silly concept of what makes a mirror fast

DNF measures the speed of a mirror by opening a HTTP socket to the mirror and… measuring how long the TCP SYN-SYNACK-ACK took. That’s it. Straight up latency measurement of the TCP socket open. The problem is that latency is at best a decent proxy for physical distance, and has almost nothing to do at all with available bandwidth or resulting user experience/performance. As long as mirrors are within the bandwidth-delay product range of the client’s maximum TCP window size, you really shouldn’t be seeing a profound difference in performance between a mirror that’s 10ms away vs 75ms away, all other things being equal.

So latency is a silly metric, but it is also undeniably simple, and more importantly, cheap. Mirror operators are already footing the bill for all the bandwidth they’re serving to users of free software, so it is unreasonable to expect mirror operators to tolerate each client performing frivolous bandwidth tests against some big file stored on the mirror just to pick the fastest one. Ideally you could be recording the historic performance of each mirror for RPM downloads in the past as a measure of download performance, but this data would be noisy (small requests dominated by latency vs large requests dominated by bandwidth) and could very well have a short shelf life if the DNF client is something portable like a laptop moving between different networks. Regardless, DNF doesn’t record that, and everyone has just kind of settled for latency being a tolerable metric that is turned off by default, and hopefully the provided mirrorlist is just good enough as-is that you leave fastestmirror in the default off position and never think about it.

The problem is when people turn it on

For either valid or invalid reasons, people do turn on fastestmirror, and the problem is that it really has quite a few undesirable properties to it:

If the closest mirror to a user happens to have really bad performance, fastestmirror will consistently always pick that one mirror and downloads will always be slow. This isn’t nearly as noticable when you’re using the mirrorlist order provided by the project CDN, because those lists are shuffled so even if you happen to get a bad mirror one day, you’ll probably get a better mirror the next day.
When a lot of clients in close proximity to a single mirror all turn on fastetmirror, they will all reliably select the single closest mirror. This happened to us on MicroMirror, where we turned up a 1Gbps mirror server that turned out to be the lowest latency mirror to AWS in Virginia, so we became the preferred EPEL mirror for approx 1.4 million CentOS 7 servers. We ultimately needed to simply stop hosting EPEL on that one mirror for how badly the mirror was getting crushed with traffic.

So I decided that I should do something about this undesirable behavior. I was never going to fix the behavior of CentOS 7 clients, but thankfully that problem solved itself by the whole OS beocming obsolete. If I could get some kind of behavior change accepted by the upstream DNF / librepo maintainers, in 5-15 years its possible that as a mirror operator, I won’t see as much undesirable behavior due to fastestmirror being enabled.

My fix: Don’t always pick the single fastest mirror

So ultimately, my pull request into librepo turned out to be quite a short patch, and in my opinion a pretty clever fix lacking the appetite to fundamentally change the measurement used to rank the mirrors.

Now, instead of librepo ranking mirrors by a strict ordering of lowest latency first, my change measures the latency of all the mirrors and separates mirrors into two lists: mirrors with less than twice the best latency, and mirrors with higher latency than twice the fastest. We take the pool of “mirrors with less than 2x latency” and shuffle them together, and then append the rest of the mirrors sorted by latency.

This change means that:

When a single mirror is significantly closer to the user than the rest, there is no change in behavior and that one mirror is always picked.
When theres a few mirrors that are about the same as the one closest mirror, instead of always picking the one closest one, we randomly rotate across the pool.

In both instances, DNF still has the full list of mirrors to fall back on if the first mirror(s) are unavailable.

My Concerns

I think there’s two ways that this change is going to go sideways:

We are no longer picking the single lowest latency mirror, so it is possible that users are going to have instances where they’re picking something slightly further away that has significantly lower performance for reasons that aren’t latency related.
Shuffling the mirror list means that every time that you run DNF, you’re going to be hitting a different mirror, so we’re relying more on the mirror-to-mirror consistency than before. This means that if one of the near-by mirrors is behind on updates and doesn’t have expected RPM files, the user is going to see more errors than previously.

The counter arguments boil down to “having DNF always pick the best mirror is a fundamentally unsolvable problem” and “without fastestmirror we already see the inconsistency issue”.

So I’m not fundamentally fixing fastestmirror here, and the answer for poor mirror performance is still always “turn off awful performing mirrors, add more mirrors to scale horizontally”. But hopefully this small change makes fastestmirror less bad, and once it eventually lands in major distributions I’m going to be very interested to see what the community feedback is with real world experience. Once it does end up getting shipped, I’ll be sure to update this blog post to mention where it’s available.

Migration to Jekyll from Blogger

2024-09-29T00:00:00-07:00

I’ve been writing on this blog for something like 18 years now, which is just wild. When I originally fired up this blog as a way to improve my technical writing out of High School, I picked Blogger since they had already been aquired by Google at that point, so it was certain that Blogger would continue to be a thing into the foreseeable future.

Fastforward to 2024, and Blogger being a Google product now makes its ultimate demise a certainty, so clearly we need to come up with a new plan before Google yanks the plug with relatively short notice.

Since I’ve been running my own hosting network since 2017, and comments on blog posts as an idea have run their course to the spam and hate filled conclusion, migrating my whole blog to a static site generator and throwing it in my self hosted network seems like a very reasonable next step.

I originally started this work in late 2020, and rapidly came to the realization how much work this was going to be to migrate more than 500 blog posts to literally anything. If you’ve been wondering why this blog went oddly quiet for the last few years, a lot of that has been that I didn’t want to write anything new until the migration was complete, but every time I picked up the migration again, I’d chip away at another dozen blog posts or two, and then lose interest again and put it back down.

At this point, I’m just calling it. I’m not done with the migration, and have actually only updated about 20% of the posts to using Markdown, but I came to the conclusion that if anyone is looking at any of my writing prior to 2015 still, they’re just going to need to come to terms with some of the formatting being a little wonky and we’re all going to move on from that.

The Plan

I am a… lukewarm fan of Jekyll the static site generator. It is pretty good, with a lot of problems around dependency management, and it seems like the new hotness has moved on past it to other static site generators like Hugo, but I’ve already got a half dozen projects under my belt using Jekyll, so better to stick with the devil I know.

The Minima theme from Jekyll gets the job done, but I did actually go as far as editing the color scheme in the sass files to get it so this website doesn’t glaringly look like the default Jekyll theme like all my other websites.

Taking the single huge XML file export from Blogger, Jekyll actually has a handy tool that will parse out the XML file into individual Markdown files for each blog post with all the requisite frontmatter to get the blogpost to render with the correct URL, etc.

A good first step, but all of each post’s content is still the machine generated HTML out of the Blogger editor, with the photos still embedded as one of about three different generations of how Blogger handled photos in blog posts. There was some broad frist passes applied to ALL of the posts at once using sed and awk to convert tages into new lines, break each sentence onto its own line so interacting with the markdown files wouldn’t involve 1kB long single lines, deleting nbsp tags, etc.

But there’s still a lot of per post work that just took… ages. To handle the pictures, which were still getting linked from Google’s own CDN, I need to download all of them, delete the hrefs from the markdown, and insert a proper image reference to the local copy.

The one thing that made this slightly easier is that each Blogger photo uses a lower resolution copy, and then links to a higher resolution copy that always has “1600” in the URL

cat blogpost-im-currently-working-on.md | grep -Po "http.*?1600.*?jpg" | while read PICTURE; do wget $PICTURE; done

Did a respectably good job of pulling out most of the images per post! For posts where I was using more diagrams and figures, I’d need to run it again looking for PNGs, but that automated most of the external work.

Rinse and repeat for about 100 of the posts so far, and I’ve just had it. So if you go back and look at any of the post in the 2011-2015 era, it’s likely the formatting might be kind of wonky or really wonky, but ideally all of the content should still be there in one form or another. Links and code snippets are a total mess, so a low level of consistency there, but again, we’re just going to live with it. If anything is completely mangled, I have faith in our true lord and savior the Internet Wayback Machine, which should have a very good view of all of my blog posts by now.

I ultimately decided that getting this new framework live and load bearing was more important than updating alllll the old posts beforehand (perfect really is the enemy of done). So hope you enjoy; let me know if anything breaks wildly. I tried to at least get the main RSS feed in the same place as where Blogger put it, but Bloggers support for things like an alternative ATOM feed and… RSS feeds for comments per blog post (I think?) just weren’t going to happen.

As for this now being a totally static site without support for comments, if you want to yell at me my email address is right down there, and if you feel like you need to dunk on my ideas in public, it shouldn’t be too hard to post a screenshot of any of this on… TikTok?

Stay mindful everyone, and keep being humans posting about things you’re passionate about online. It seems like that’s becoming an increasingly rare commodity these days.

Building the Micro Mirror Free Software CDN

2023-05-09T00:04:00-07:00

As should surprise no one, based on my past projects of running my own autonomous system, building my own Internet Exchange Point, and building a global anycast DNS service, I kind of enjoy building Internet infrastructure to make other people’s experience online better. So that happened again, and like usual, this project got well out of hand.

Linux updates

You run apt update or dnf upgrade and your Linux system goes off and download updates from your distro and installs them. Most people think nothing of it, but serving all of those files to every Linux install in the world is a challenging problem, and it’s made even harder because most Linux distributions are free and thus don’t have a project budget to spin up a global CDN (Content Distribution Network) to have dozens or hundreds of servers dedicated to putting bits on the wire for clients over the Internet.

How Linux distros get around this budget issue is that they host a single “golden” copy of all of their project files (Install media ISOs, packages, repository index files, etc) and volunteers around the world who are blessed with surplus bandwidth download a copy of the whole project directory, make it available from their own web server that they build and maintain themselves, and then register their mirror of the content back with the project. Each free software project then has a load balancer that directs clients to nearby mirrors of the content they’re requesting while making sure that the volunteer mirrors are still online and up to date.

At the beginning of 2022, one of my friends (John Hawley) and I were discussing the fact that the network who used to be operating a Linux mirror in the same datacenter as us had moved out of the building, and maybe it would be fun to build our own mirror to replace it.

John: “Yeah… it would probably be fun to get back into mirroring since I used to run the mirrors.kernel.org mirrors” (world’s largest and most prominent Linux mirror)

Me: “Wait… WHAT?!”

So long story short, the two of us pooled our money together, and went and dropped $4500 on a SuperMicro chassis, stuffed it full of RAM (384GB) and hard drives (6x 16TB) and racked it below the Google Global Cache I’m hosting in my rack in Fremont.

Like usual, I was posting about this as it was happening on Twitter (RIP) and several people on Twitter expressed interest in contributing to the project, so I posted a paypal link, and we came up with the offer that if you donated $320 to the project, you’d get your name on one of the hard drives inside the chassis in Fremont, since that’s how much we were paying for each of the 16TB drives.

This “hard drive sponsor” tier also spawned what I think was one of the most hilarious conversations of this whole project, where one of my friends was trying to grasp why people were donating money to get their name on a piece of label tape, stuck to a hard drive, inside a server, inside a locked rack, inside of a data center, where there most certainly was no possibility of anyone ever actually seeing their name on the hard drive. A rather avant-garde concept, I will admit.

The wild part was that we “sold out” on “Hard Drive Sponsor” tier donors, and enough people contributed to the project that we covered almost all of the hardware cost of the original mirror.fcix.net server!

So long story short, we decided to spin up a Linux mirror, fifty of my friends on Twitter chipped in on the project, and we were off to the races trying to load 50TB of Linux distro and free software artifacts on the server to get it up and running. All well and good, and a very typical Linux mirror story.

Where things started to get out of hand is when John started building a Grafana dashboard to parse all of the Nginx logs coming out of our shiny new Linux mirror and analyzing the data as to how much of what projects we were actually serving. Pivoting the data by various metrics like project and release and file type, we came to the realization that while we were hosting 50TB worth of files for various projects, more than two thirds of our network traffic was coming from a very limited number of projects and only about 3TB of files on disk! And this is where the idea of the Micro Mirror began to take shape.

The Micro Mirror Thesis

If the majority of the network traffic on a Linux mirror is coming from a small slice of the assets hosted on the mirror, then it should be possible to build a very small and focused mirror that only hosts projects from that “hot working set” subset and, while less effective than our full sized mirror, could be only half as effective as our full size mirror at 10% of the cost.

So we set ourselves the challenge of trying to design a tiny Linux mirror which could pump out a few TB of traffic a day (as opposed to the 12-15TB/day of traffic served from mirror.fcix.net) with a hardware cost less than the $320 that we spent on one of the hard drives in the main storage array. Thanks to eBay and my love for last gen enterprise thin clients, we settled on a design consisting of the following:

This could all be had for less than $250 on eBay used, and conveniently fits nicely in a medium flat rate USPS box, so once we build it and find a random network in the US willing to plug this thing in for us, we can just drop it in the mail.

We built the prototype and one of my other friends in Fremont offered to host it for us, since we’re only using the 1G-baseT NIC on-board the thin client, and we were off to the races. Setting the tiny mirror up only hosting Ubuntu ISOs, Extra Packages for Enterprise Linux, and the CentOS repo for servers easily exceeded our design objective of >1TB/day of network traffic. Not a replacement for traditional “heavy iron” mirrors that can host a longer tail of projects, but this is 1TB of network traffic which we were able to peel off of those bigger mirrors so they could spend their resources serving the less popular content, which we wouldn’t be able to fit on the single 2TB SSD inside this box.

Now it just became a question of “well, if one Micro Mirror was pretty successful, exactly how many MORE of these little guys could we stamp out and find homes for???”

These Micro Mirrors have several very attractive features to them for the hosting network:

They are fully managed by us, so while many networks / service providers want to contribute back to the free software community, they don't have the spare engineering resources required to build and manage their own mirror server. So this fully managed appliance makes it possible for them to contribute their network bandwidth at no manpower cost.
They're very small and can fit just about anywhere inside a hosting network's rack.
They're low power (15W)
They're fault tolerant, since each project's load balancer performs health checks on the mirrors and if this mirror or the hosting network has an outage the load balancers will simply not send clients to our mirror until we get around to fixing the issue.

Then it was just a question of scaling the idea up. Kickstart file so I can take the raw hardware and perform a completely hands-off provisioning of the server. Ansible playbook to take a short config file per node and fully provision the HTML header, project update scripts, and rsync config per server, and suddenly I can fully stamp out a new Micro Mirror server with less than 30 minutes worth of total work.

Finding networks willing to host nodes turned out to be extremely easy. Between Twitter, Mastodon, and a few select Slack channels I hang out on, I was able to easily build a waiting list of hosts that surpassed the inventory of thin clients I had laying around. Then we just needed to figure out how to fund more hardware beyond what we were personally willing to buy. Enter LiberaPay, an open source service similar to Patreon where people can pledge donations to us to keep funding this long term.

So now we have a continual (albeit very small) funding source, and a list of networks waiting for hardware, and it’s mainly been a matter of waiting for enough donations to come in to fund another node, ordering the parts, provisioning the server, dropping it in the mail, and waiting for the hosting network to plug it in so we can run our Ansible playbook against it and get it registered with all the relevant projects.

So now we had a solid pipeline set up, and we could start playing around with other hardware designs than the HP T620 thin client. The RTL8168 NIC on the T620s are far from ideal for pumping out a lot of traffic, and we actually got feedback from several hosting networks that they just don’t have the ability to plug in baseT servers anymore, and they’d much prefer a 10Gbps SFP+ NIC handoff to the appliance.

The desire for 10G handoffs has been a bit of a challenge while still trying to stay within the $320 hardware budget goal we set for ourselves, but we have been doing some experiments with the HP T620 Plus thin client, which happens to have a PCIe slot that fits a Mellanox ConnectX3 NIC, and we also received a very generous donation of a pile of Dell R220 servers with 10G NICs from Arista Networks (Thanks Arista!)

So now the project has very easily gotten out of hand. We have more than 25 Micro Mirror nodes of various permutations live in production, spanning not only North America but several of the nodes have been deployed internationally. Daily we serve roughly 60-90TB of Linux and other free software updates from these Micro Mirrors, with more than 150Gbps of port capacity. So while not making a profound difference to user experience downloading updates, each Micro Mirror we deploy has helped make a small incremental improvement in how fast users are able to download updates and new software releases.

So if you’ve started noticing a bunch of *.mm.fcix.net mirrors for your favorite project, this is why. We hit a sweet spot with this managed appliance and have been stamping them out as resources permit.

Interest in Helping?

The two major ways that someone can help us with this project is funding the new hardware and providing us locations to host the Micro Mirrors:

Cash contributions are best sent via my LiberaPay account.
Any service providers interested in hosting nodes in their data center network can reach out to mirror@fcix.net to contact us and get on our wait list.

We are not interested in deploying these nodes off of any residential ISP connections, so even if you have a 1Gbps Internet connection from your ISP, we want to limit the deployment of these servers to wholesale transit contexts in data centers where we can work directly with the ISP’s NOC.

Of course, nothing is preventing anyone from going out and setting up your own Linux mirror. Ultimately having more mirror admins out there running their own mirrors is better than growing this Micro Mirror project for the sake of diversity. If you’re looking to spin up your own mirror and have any specific questions on the process, feel free to reach out to us for that as well.

I also regularly post about this project on Mastodon, if you want to follow along real time.

Unlocking Third Party Transceivers on Older Arista Switches

2021-05-05T10:00:00-07:00

Like many switch vendors, Arista switches running EOS only accept Arista branded optics by default. They accept any brand of passive direct attach copper cables, but since third party optics tend to cause a prohibitive number of support cases, it’s needed to force the user to make a deliberate decision to enable third party optics so they can appreciate that they’re venturing out into unsupported territory.

The thing is, if you’re out buying an older generation of Arista switch on eBay to try and learn EOS, you’re probably also not able to directly buy Arista branded optics, so the unlock procedure on older switches would be of interest to you.

EOS has two different methods for unlocking third party transceivers, depending on what hardware you’re trying to unlock. It’s not dependent on what version of EOS you’re running on the switch, so all switches which originally supported the old “magic file” unlock method always supported the old method, and newer switches have always required the customer unlock code.

The “magic file” method is used on the earlier switches, and depends on EOS checking for a file named “enable3px” in the flash: partition. There doesn’t need to be anything in this file; it just needs to exist, so an easy way to create this file from the EOS command line is bash touch /mnt/flash/enable3px
The newer “customer unlock code” method instead relies on each customer having a conversation with their account team to get a cryptographic key for their specific account, which is then visible in their running-config as a service unsupported-transceiver CUSTOMERNAME LICENSEKEY line

Once you apply one of these two unlock methods, save your running config and reload the switch to have it unlock all your transceivers and accept all future optics you install.

So if you’re looking at a switch on the second hand market and trying to figure out if you can unlock it using the old magic file method, here is a list of the switches which predate the newer customer license key method and work with the empty enable3px file:

DCS-7120T-4S
- 20x 10GbaseT
- 4x SFP+
- Last EOS version 4.13.16M
DCS-7140T-8S
- 40x 10GbaseT
- 8x SFP+
- Last EOS version 4.13.16M
DCS-7048T-4S
- 48x 1GbaseT
- 4x SFP+
- Last EOS version 4.13.16M
DCS-7048T-A
- 48x 1GbaseT
- 4x SFP+
- Last EOS version 4.15.10M
DCS-7124S
- 24x SFP+
- Last EOS version 4.13.16M
DCS-7124SX
- 24x SFP+
- Last EOS version 4.13.16M
DCS-7124FX
- 24x SFP+
- Last EOS version 4.13.16M
DCS-7148S
- 48x SFP+
- Last EOS version 4.13.16M
DCS-7148SX
- 48x SFP+
- Last EOS version 4.13.16M
DCS-7050T-36
- 32x 10GbaseT
- 4x SFP+
- Last EOS version 4.18.11M-2G
DCS-7050T-52
- 48x 10GbaseT
- 4x SFP+
- Last EOS version 4.18.11M-2G
DCS-7050T-64
- 48x 10GbaseT
- 4x QSFP+
- Last EOS version 4.18.11M-2G
DCS-7050S-52
- 52x SFP+
- Last EOS version 4.18.11M-2G
DCS-7050S-64
- 48x SFP+
- 4x QSFP+
- Last EOS version 4.18.11M-2G
DCS-7050Q-16
- 8x SFP+
- 16x QSFP+
- Last EOS version 4.18.11M-2G
DCS-7150S-24
- 24x SFP+
- Last EOS version 4.23.x-2G (still an active train)
DCS-7150S-52
- 52x SFP+
- Last EOS version 4.23.x-2G (still an active train)
DCS-7150S-64
- 48x SFP+
- 4x QSFP+
- Last EOS version 4.23.x-2G (still an active train)
7548S-LC line cards on an original 7504/7508
- 48x SFP+ line cards
- Last EOS version 4.15.10M
- None of the E series or R series line cards support the magic file method

So hopefully you find that list helpful.

If you’re looking to unlock transceivers on newer switches, I’m afraid I’m not able to help there. You may try filling out the “contact sales” form on the Arista website and have a conversation with your account rep about what you’re trying to do.

Don’t contact TAC or support; they’re never authorized to hand out unlock keys and will only refer you to your account team.

Building an Anycast Secondary DNS Service

2020-11-18T21:00:00-08:00

Regular readers will remember how one of my friends offered to hold my beer while I went and started my own autonomous system, and then the two of us and some of our friends thought it would be funny to start an Internet Exchange, and that joke got wildly out of hand… Well, my lovely friend Javier has done it again.

About 18 months ago, Javier and I were hanging out having dinner, and he idly observed in my direction that he had built a DNS service back in 2009 that was still mostly running, but it had fallen into some disrepair and needed a bit of a refresh. Javier is really good at giving me fun little projects to work on… that turn out to be a lot of work. But I think that’s what hobbies are supposed to look like? Maybe? So here we are.

With Javier’s help, I went and built what I’m calling “Version Two” of Javier’s NS-Global DNS Service.

Understanding Secondary DNS

Before we dig into NS-Global, let’s talk about DNS for a few minutes. Most people’s interaction with DNS is that they configure a DNS resolver like 1.1.1.1 or 9.9.9.9 on their network, and everything starts working better, because DNS is really important and a keystone to the rest of the Internet working.

The thing is, DNS resolvers are just the last piece in the chain of DNS. From the user’s perspective, you fire off a question to your DNS resolver, like “what is the IP address for twitter.com?” and the resolver spends some time thinking about it, and eventually comes back with an answer for what IP address your computer should talk to for you to continue your doom scrolling.

And these resolving DNS services aren’t what we built. Those are really kind of a pain to manage, and Cloudflare will always do a better job than us at marketing since they were able to get a pretty memorable IP address for it. But the resolving DNS services need somewhere to go to get answers for their user’s questions, and that’s where authoritative DNS servers come in.

When you ask a recursive DNS resolver for a resource record (like “A/IPv4 address for www.twitter.com”) they very well may know nothing about Twitter, and need to start asking around for answers. The first place the resolver will go is one of the 13 root DNS servers (which I’ve talked about before), and ask them “Hey, do you know the A record for www.twitter.com?”, and the answer back from the root DNS server is going to be “No, I don’t know, but the name servers for com are over there”. And that’s all the root servers can do to help you; point you in the right direction for the top level domain name servers. You then ask one of the name servers for com (which happen to be run by Verisign) if they know the answer to your question, and they’ll say no too, but they will know where the name servers for twitter.com are, so they’ll point you in that direction and you can try asking the twitter.com name servers about IN/A,www.twitter.com, and they’ll probably come back with the answer you were looking for the whole time.

So recursive DNS resolvers start at the 13 root servers which can give you an answer for the very last piece of your query, and you work your down the DNS hierarchy moving right to left in the hostname until you finally reach an authoritative DNS server with the answer you seek. And this is really an amazing system that is so scalable, because a resolver only critically depends on knowing the IP addresses for a few root servers, and can bootstrap itself into the practically infinitely large database that is DNS from there. Granted, needing to go back to the root servers to start every query would suck, so resolvers will also have a local cache for answers, so once one query yields the NS records for “com” from a root, those will sit in the local cache for a while and save the trip back to the root servers for every following query for anything ending in “.com”.

But we didn’t build a resolver. We build an authoritative DNS service, which sits down at the bottom of the hierarchy with a little database of answers (called a zone file) waiting for some higher level DNS server to refer a resolver to us for a query only we can answer.

Most people’s experience with authoritative DNS is that when they register their website with Google Domains or GoDaddy, they have some kind of web interface where you can enter IP addresses for your web server, configure where your email for that domain gets sent, etc, and that’s fine. They run authoritative DNS servers that you can use, but for various reasons, people may opt to not use their registrar’s DNS servers, and instead just tell the registrar “I don’t want to use your DNS service, but put these NS records in the top level domain zone file for me so people know where to look for their questions about my domain”

Primary vs Secondary Authoritative

So authoritative DNS servers locally store their own little chunk of the enormous DNS database, and every other DNS server which sends them a query either gets the answer out of the local database, or a referral to a lower level DNS server, or some kind of error message. Just what the network admin for a domain name is looking for!

But DNS is also clever in the fact that it has the concept of a primary authoritative server and secondary authoritative servers. From the resolver’s perspective, they’re all the same, but if you only had one DNS server with the zone file for your domain and it went offline, that’s the end of anyone in the whole world being able to find any IP addresses or other resource records for anything in your domain, so you really probably want to recruit someone else to also host your zone to keep it available while you’re having some technical difficulties. You could stand up a second server and make the same edits to your zone database on both, but that’d also be a pain, so DNS even supports the concept of an authoritative transfer (AXFR) which copies the database from one server to another to be available in multiple places.

This secondary service is the only thing that NS-Global provides; you create your zone file however you like, hosted on your own DNS server somewhere else, and then you allow us to use the DNS AXFR protocol to transfer your whole zone file into NS-Global, and we answer questions about your zone from the file if anyone asks us. Then you can go back to your DNS registrar and add us to your list of name servers for your zone, and the availability/performance of your zone improves by the fact that it’s no longer depending on your single DNS server.

The Performance of Anycast

Building a secondary DNS service where we run a server somewhere, and then act as a second DNS server for zones is already pretty cool, but not stupid enough to satisfy my usual readers, and this is where anycast comes in.

Anycast overcomes three problems with us running NS-Global as just one server in Fremont:

The speed of light is a whole hell of a lot slower than we’d really like
Our network in Fremont might go offline for some reason
With enough users (or someone attacking our DNS server) we might get more DNS queries than we can answer from a single server

And anycast to the rescue! Anycast is where you have multiple servers world-wide all configured essentially the same, with the same local IP address, then use the Border Gateway Protocol to have them all announce that same IP address into the Internet routing table. By having all the servers announce the same IP address, to the rest of the Internet they all look like they’re the same server, which just happens to have a lot of possible options for how to route to it, so every network in the Internet will choose the anycast server closest to them, and route traffic to NS-Global to that server.

Looking at the map above, I used the RIPE Atlas sensor network (which I’ve also written about before) to measure the latency it takes to send a DNS query to a server in Fremont, California from various places around the world. Broadly speaking, you can see that sensors on the west coast are green (lower latency) in the 0-30ms range. As you move east across the North American continent, the latency gets progressively worst, and as soon as you need to send a query to Fremont from another continent, things get MUCH worse, with Europe seeing 100-200ms of latency, and places like Africa feeling even more pain with query times in the 200-400ms range.

And the wild part is that this is a function of the speed of light. DNS queries in Europe just take longer to make it to Fremont and back than a query from the United States.

But if we instead deploy a dozen servers worldwide, and have all of them locally store all the DNS answers to questions we might get, and have them all claim to be the same IP address, we can beat this speed of light problem.

Networks on the west coast can route to our Fremont or Salt Lake City servers, and that’s fine, but east coast networks can decide that it’s less work to route traffic to our Reston, Virginia or New York City locations, and their traffic doesn’t need to travel all the way across the continent and back.

In Europe, they can route their queries to our servers in Frankfurt or London, and to them it seems like NS-Global is a Europe-based service because there’s no way they could have sent their DNS query to the US and gotten an answer back as soon as they did (because physics!) We even managed to get servers in San Palo, Brazil; Johannesburg, South Africa; and Tokyo, Japan.

So now NS-Global is pretty well global, and in the aggregate we generally tend to beat the speed of light vs a single DNS server regardless of how well connected it is in one location. Since we’re also using the same address and the BGP routing protocol in all of these locations, if one of our sites falls off the face of the Internet, it… really isn’t a big deal. When a site goes offline, the BGP announcement from that location disappears, but there’s a dozen other sites also announcing the same block, so maybe latency goes up a little bit, but the Internet just routes queries to the next closest NS-Global server and things carry on like nothing happened.

Is it always perfect? No. Tuning an anycast network is a little like balancing an egg on its end. It’s possible, but not easy, and easy to perturb. Routing on the Internet unfortunately isn’t based on the “shortest” path, but on how far away the destination is based on a few different metrics, so when we turn on a new location in somewhere like Tokyo, because our transit provider there happens to be well connected, suddenly half the traffic in Europe starts getting sent to Tokyo until we turn some knobs and try to make Frankfurt look more attractive to Europe than Tokyo again. The balancing act is that every time we add a new location, or even if we don’t do anything and the topology of the rest of the Internet just changes, traffic might not get divided between sites like we’d really like it to, and we need to start reshuffling settings again.

Using NS-Global

Historically, to use NS-Global, you needed to know Javier, and email him to add your zones to NS-Global. But that was a lot of work for Javier, answering emails, and typing, and stuff, so we decided to automate it! I created a sign-up form for NS-Global, so anything hosting their own zone who want to use us as a secondary service can just fill in their domain name and what IP addresses we should pull the zone from.

The particularly clever part (in my opinion) is that things like usernames or logging in or authentication with cookies seem like a lot of work, so I decided that NS-Global wouldn’t have usernames or login screens. You just show up, enter your domain, and press submit.

But people on the Internet are assholes, and we can’t trust the Internet people to just let us have nice things, so we still needed some way to authenticate users as being rightfully in control of a domain before we would add or update their zone in our database; and that’s where the RNAME field in the Start of Authority record comes in. Every DNS zone starts with a special record called the SOA record, which includes a field which is an encoded email address of the zone admin. So when someone submits a new zone to NS-Global, we go off and look up the existing SOA for that zone, and send an email to the RNAME email address with a confirmation link to click if they agree with whomever filled out our form about a desire to use NS-Global.

Done.

This blog post is getting a little long, so I will probably leave other details like how we push new zones out to all the identical anycast servers in a few seconds or what BGP traffic engineering looks like to their own posts in the future. Until then, if you run your own DNS server and want to add a second name server as a backup, feel free to sign up at our website!

Validating BGP Prefixes using RPKI on Arista Switches

2020-04-16T08:00:00-07:00

I’ve written another blog post for the Arista EOS blog. This time, it’s on how to use the brand new feature in EOS where it now supports RPKI via RTR. I walk through a simple example of how to enable Docker in EOS and spin up a Routinator instance on the switches’ loopback interface to use RPKI to validate prefixes from BGP peers.

A Simple Quality of Service Design Example

2020-04-01T09:00:00-07:00

I’ve written another blog post for the Arista corporate blog! This time, it’s on the basics of how to turn on the Quality of Service feature in EOS and start picking out various classes of traffic to treat them better/worse than normal.

Booting Linux Over HTTP

2020-03-29T06:00:00-07:00

A couple years ago, one of my friends gave me a big pile of little Dell FX160 thin clients, which are cute little computers which have low power Atom 230 processors in them with the ability to support 3GB of RAM. Being thin clients means they were originally meant to be diskless nodes that could boot a remote desktop application to essentially act as remote graphical consoles to applications running on a beefier server somewhere else.

That being said, they’re great as low power Linux boxes, and I’ve been deploying them in various projects over the years when I need a Linux box somewhere but want/need something a little more substantial than a Raspberry Pi.

The one big problem with them is that they didn’t come with the 2.5” hard disk bracket, so I needed to source those drive sled kits on eBay to add more storage than the 1GB embedded SATA drive they all came with. Which is nominally fine; I bought a few of the kits for about $10 a piece, and for that to be the only expense to be able to deploy a 1TB 2.5” drive somewhere has been handy a few times.

But it always left me thinking about what I could do with the original 1GB drive in these things. Obviously, with enough effort and hand wringing, you can get Linux installed on a 1G partition, but that feels like it’s been done before, and these are thin clients! They’re meant to depend on the network to boot!

Fast forward to this year, and thanks to one of their network engineers hearing my interview for On the Metal, I’ve been working with Gandi.net to help deploy one of their DNS anycast nodes in Fremont as part of the Fremont Cabal Internet Exchange. The thing is, how they designed their anycast DNS nodes is awesome! They have a 10,000 foot view blog post about it, but the tl;dr is that they don’t deploy their remote DNS nodes with an OS image on them. Each server gets deployed with a USB key plugged into them with a custom build of iPXE, which gives the server enough smarts to, over authenticated HTTPS, download the OS image for their central servers and run the service entirely from RAM.

Operationally, this is awesome because it means that when they want to update software on one of their anycast nodes, they can build the new image in advance on their provisioning server centrally, and just tell the server to reboot. When it reboots, it automatically downloads the new image from the provisioning servers, and you’re up to date. If something goes terribly wrong and the OS on a node becomes unresponsive? Open a remote hands ticket with the data center “please power cycle our server” and the iPXE ROM will once again download a fresh copy of the OS image to run in RAM.

Granted, they’ve got all sorts of awesome extra engineering involved in their system; cryptographic authentication of their boot images, local SSDs so while the OS is stateless, their nodes don’t need to perform an entire DNS zone transfer from scratch every time it reboots, etc, etc. Which is all well and good, but this iPXE netbooting an entire OS image over the wide Internet using HTTP is just the sort of kind-of-silly, kind-of-awesome sort of project I’ve been looking to do with these thin clients I’ve got sitting around in my apartment.

Understanding The Boot Process

This left me with a few problems:

The Gandi blog post regarding their DNS system was a 10,000 foot view conceptual overview, so they rightfully-so glossed over some of the technical specifics that weren’t important to their blog post’s message but really important for actually making it work.
I have been blissfully ignorant up until now of most of the mechanics involved with Linux booting in the gap between “The BIOS runs the bootloader” and “The Linux kernel is running with your Init server running as PID 1 and your fstab mounted”
I’m trying to do something exceedingly weird here, where there are no additional file systems to mount while the system is booting. There’s plenty of guides available on booting Linux with an NFS or iSCSI root file system, but I’m looking at even less than that; I want the entire system just running from local RAM.

So before talking about what I ended up with, let’s talk about the journey and what I had to learn about the boot process on Linux.

On a typical traditional Linux host, when you power it on, the local BIOS has enough smarts to find local disks with boot sectors, and read that first sector from the disk and execute it in RAM. That small piece of machine code then has enough smarts to load a more sophisticated bootloader like GRUB from somewhere close on the disk, which then has enough smarts to do more complicated things like load a Linux kernel and init RAM disk to boot Linux, or give the user a user interface to select which Linux kernel to boot, etc. One of the reasons why many Unix systems had a separate /boot partition was because this chainloader between the BIOS and the full running kernel couldn’t mount more sophisticated file systems so needed a smaller and simpler partition for just the bare minimum boot files needed to get the kernel running.

The kernel file plus init RAM disk (often called initrd) are the two files Linux really requires to boot, and the part where my understanding was lacking. Granted, my understanding is still pretty lacking, but the main insight I gained was that the initrd file is a packed SVR4 archive of the bare minimum of files that the Linux kernel needs to then go and mount the real root file system and switch to it to have a fully running system. These SVR4 archives can be created using the “cpio” command as the “newc” file format, and the Linux kernel is smart enough to decompress it using gzip before mounting the archive, so we can gzip the initrd file to save bandwidth when ultimately booting the system.

(Related aside; there’s many different pathways from the BIOS to having the kernel and initrd files in RAM. One of the most popular “net booting” processes, which I have used quite a bit in the past, is PXE booting, where the BIOS boot ROM in the network card itself has juuuust enough smarts to send out a DHCP request for a lease which includes a TFTP server and file path for a file on that TFTP server as DHCP options, and the PXE ROM downloads this file and runs it. This file is usually pxelinux.0, which I think is another chainloader which then downloads the kernel and initrd files from the same TFTP server, and you’re off to the races.)

The missing piece for me inside the initrd file is that the kernel immediately runs a shell script in the root of the filesystem named “/init”. This shell script is what switches the root file system over to whatever you specified in your /etc/fstab file, and ultimately at the very end of the /init script is where it “exec /sbin/init” to replace itself with the regular init daemon which you’re used to being PID 1 and being the parent of every other process on the system.

I had never seen this /init script before, which is understandable because it’s normally not included in your actual “/” root file system! It’s only included in the initrd archive’s “/” file system (which you can actually unpack yourself using gunzip and cpio), and disappears when it remounts the actual root and exec’s /sbin/init… So since I want to run Linux entirely from RAM, “all” I need to do is figure out how to create my own initrd file, generate one that is not a bare minimum to mount another file system but everything I need to run my application in Linux, and figure out a simpler /init script to package with it which doesn’t need to mount any local volumes but only needs to mount all the required virtual file systems (like /proc, /sys, and /dev) and exec the real /sbin/init to start the rest of the system.

Generating My Own Initrd File

So the first step in this puzzle for me is figuring out how to generate my own initrd file including the ENTIRE contents of a Linux install instead of just the bare minimum to get it started. And to generate that initrd archive, I first need to create a minimal root file system that I can configure to do what I want to then pack as the initrd file we’ll be booting.

Thankfully, Debian has some really good documentation on using their debootstrap tool to start with an empty folder on your computer and end up with a minimal system. The first section of that documentation talks about partitioning the disk you’re installing Debian on, but we just need the file system, so I skipped that part and went straight to running debootstrap in an empty directory.

$ sudo debootstrap buster /home/kenneth/tmp/rootfs http://ftp.us.debian.org/debian/

Remember that there’s plenty of Debian mirrors, so feel free to pick a closer one off their list.

Once debootstrap is done building the basic image, from a terminal we can jump into the new Linux system using chroot, which doesn’t really boot this system, but jump the terminal into it like it was the root of the currently running system, so you can interact with it like it’s running. This lets us edit config files like /etc/network/interfaces, apt install needed packages, etc etc. Pretty much just following the rest of the Debian debootstrap guide and then also doing the configuration work needed to set up whatever the system should actually be doing. (things like setting a root password, installing ssh, configuring network interfaces, etc etc)

$ LANG=C.UTF-8 sudo chroot /home/kenneth/tmp/rootfs /bin/bash

Since we’re not installing this system on an actual disk, we don’t need to worry about installing the GRUB or LILO bootloader like the guide says, but I did install the Linux kernel package since it was the easiest way to grab a built Linux kernel to pair with the final initrd file we’re creating. Apt install linux-image-amd64 and copy that vmlinuz file out of the …/boot/ directory in the new filesystem to somewhere handy.

The next step is to place the much simpler /init script in this new file system, so when the kernel loads this entire folder as its initrd we don’t go off and try and mount other file systems or anything. This is the part where my friend at Gandi.net was SUPER helpful, since trying to figure out each of the various virtual file systems that still need to be mounted on my own only yielded me a lot of kernel panics.

So huge thanks to Arthur for giving me this chunk of shell code! Copy it into the root of the freshly debootstrapped system and mark it executable (chmod +x)

#!/bin/sh
# Kenneth Finnegan, 2020
# https://blog.thelifeofkenneth.com/
# Huge thanks to Gandi.net for most of this code

set -x
set -e

# Create the mount points for all of the virtual file systems which don't
#  actually map to disks, but are views into the kernel
[ -d /dev ] || mkdir -m 0755 /dev
[ -d /root ] || mkdir -m 0700 /root
[ -d /sys ] || mkdir /sys
[ -d /proc ] || mkdir /proc
[ -d /tmp ] || mkdir /tmp
mkdir -p /var/lock || true

# Mount the required virtual file systems
mount -t sysfs -o nodev,noexec,nosuid sysfs /sys
mount -t proc -o nodev,noexec,nosuid proc /proc

tmpfs_size=10240k
if ! mount -t devtmpfs -o size=$tmpfs_size,mode=0755 udev /dev; then
	echo "W: devtmpfs not available, falling back to tmpfs for /dev"
	mount -t tmpfs -o size=$tmpfs_size,mode=0755 udev /dev
	[ -e /dev/console ] || mknod -m 0600 /dev/console c 5 1
	[ -e /dev/null ] || mknod /dev/null c 1 3
fi
unset tmpfs_size

mkdir /dev/pts
mount -t devpts -o noexec,nosuid,gid=5,mode=0620 devpts /dev/pts || true
mount -t tmpfs -o "nosuid,size=20%,mode=0755" tmpfs /run

# Set dmesg to private if you want
echo 1 > /proc/sys/kernel/dmesg_restrict

# Replace ourselves with the actual init daemon which will handle starting every other daemon
exec /sbin/init

At this point, we’re ready to pack this filesystem into an initrd archive and give it a shot. To create the archive, I followed this guide, which boils down to passing cpio a list of all the file names, and then piping the output of cpio to gzip to compress the image.

$ cd /home/kenneth/tmp/rootfs
$ sudo find . | sudo cpio -H newc -o | gzip -9 -n >~/www/initrd

At this point, you should have this initrd file which is a few hundred MB compressed, and the vmlinuz file (vmlinuz being a compressed version of the usual vmlinux kernel file!) which you grabbed out of the /boot directory, and that should be everything you need for booting Linux on its own. Place both of those files on a handy HTTP server to be downloaded by the client later.

Netbooting This Linux Image

Given the initrd and kernel images, the next step is to somehow get the target system to actually load and boot these files. Aside from what I’m talking about here of using HTTP, you can use any of the more traditional booting methods like putting these files on some local storage media and installing GRUB, or using the PXE boot ROM in your computer’s network interface to download these files from a TFTP server, etc.

TFTP would probably be pretty cute since many computers can support it stock, but that depends on your target system being on a subnet with a DHCP server that can hand out the right DHCP options to tell it where to look for the TFTP server. I didn’t want to depend on DHCP, and I wanted to use HTTP, so I instead opted to use iPXE, which is a much more sophisticated boot ROM than the typical PXE ROMs you get.

It is possible to directly install iPXE on the firmware flash of NICs, but that’s often challenging and hardware specific, and a good point that Arthur pointed out was that since they boot iPXE from USB, if for some reason they need to swap the iPXE image remotely, it’s MUCH easier to mail a USB flash drive and ask them to replace it than to try and walk someone else through how to reflash the firmware on a NIC over the phone… I’m not going to be using a USB drive, since these thin clients happen to have convenient 1GB SSDs in them already, but it’s the same image. Instead of dd’ing the ipxe.usb image onto a flash drive, I just temporarily booted Linux on the thin clients and dd’ed the ipxe ROM onto the internal /dev/sda.

The stock iPXE image is pretty generic, and like a normal PXE ROM sends out a DHCP request for a local network boot image to download. This isn’t what we want here, so we’re definitely going to need to build our own iPXE binary in the end, but I started with the stock ROM because it allows you to hit control-B during the boot process and interactively poke at the iPXE command line, and manually step through the entire process of configuring the network, downloading the Linux kernel, downloading the initrd file, and booting them.

So before building my own custom ROM, I burned iPXE onto a USB flash drive and poked at the iPXE console with the following commands on my apartment network:

dhcp
kernel http://example.com/vmlinux1
initrd http://example.com/init1
boot

And that was enough to start iterating on my initrd file to get it to what I wanted. Since I was still doing this in my apartment which has a DHCP server, I was able to ask iPXE to automatically configure the network with the “dhcp” command, then download a kernel and initrd file, and then finally boot with the two files it just downloaded.

So at this point, I was able to boot the built Linux image interactively from the iPXE console, and had a fully running Linux system in RAM, which was kind of awesome, but I wanted to fully automate the iPXE booting process, which means I need to build a custom image with an embedded “iPXE script” which is essentially just a list of commands for iPXE to run to configure the network interface, download the boot files, and boot.

#!gpxe
ifopen net0
set net0/ip 192.0.2.100
set net0/netmask 255.255.255.0
set net0/gateway 192.0.2.1
set net0/dns 192.0.2.1

echo Configuring network...
sleep 3

kernel http://example.com/vmlinux1
initrd http://example.com/init1

echo And away we go!
boot

So given that script, we follow the iPXE instructions to download their source using git, install their build dependencies (which I apparently already had on my system from past projects, so good luck…), and the key step is that when performing the final build, we pass make the path to our iPXE boot script file to embed it in the image as what to run.

$ cd ~/src/ipxe/src
$ make EMBEDDED_IMAGE=./bootscript bin/ipxe.usb

And at this point in the ipxe/src/bin folder is the built image of ipxe.usb which has our custom boot script embedded in it! Since the internal SATA disk is close enough to a USB drive, from a booting perspective, that’s the variant of ROM I’m using.

So given this custom iPXE ROM, I manually booted a live Linux image on the thin client, used dd to write the ROM to /dev/sda which is the internal 1G SSD, and the box is ready to go!

Now, when I power on the box, the BIOS sees that the internal 1G SSD is bootable, so it boots that, which is iPXE, which runs the embedded script we handed it, which configures the network interface, downloads our custom initrd file and the Linux kernel from my HTTP server, and boots those. Linux then unpacks our initrd file, and runs the /init script embedded in that, which just mounts the virtual file systems like /proc/, /sys/, and /dev, and then doesn’t try and mount any other local file system, and finally our /init/ script exec’s /sbin/init, which in the case of Debian happens to be systemd, and we’re got a fully running system in RAM!

Video of generally what that looks like:

So once again, thanks to Arthur from Gandi.net for the original idea and gentle nudges in the right direction when I got stuck.

Of course, the next thing to do is start playing “disk space golf” with the OS image to see how small I can make the initrd file, since the smaller the initrd file, the more RAM that is left over for running the application in the end! And actually doing something useful with one of these boxes running iPXE… a topic for another blog post.

Update: One thing to note is that this documentation is for the minimum viable “booting Linux over HTTP”. iPXE does support crypto such as HTTPS, client TLS certificates for client authentication, and code signing. More details can be found in their documentation.