Index | Archives | Atom Feed | RSS Feed

Introducing Amutable

Today, we announce Amutable, our ✨ new ✨ company. We – @blixtra@hachyderm.io, @brauner@mastodon.social, @davidstrauss@mastodon.social, @rodrigo_rata@mastodon.social, @michaelvogt@mastodon.social, @pothos@fosstodon.org, @zbyszek@fosstodon.org, @daandemeyer@mastodon.social @cyphar@mastodon.social, @jrocha@floss.social and yours truly – are building the 🚀 next generation of Linux systems, with integrity, determinism, and verification – every step of the way.

For more information see → https://amutable.com/blog/introducing-amutable

Mastodon Stories for systemd v259

On Dec 17 we released systemd v259 into the wild.

In the weeks leading up to that release (and since then) I have posted a series of serieses of posts to Mastodon about key new features in this release, under the #systemd259 hash tag. In case you aren't using Mastodon, but would like to read up, here's a list of all 25 posts:

Post #1: systemd-resolved Hooks
Post #2: dlopen() everything
Post #3: systemd-analyze dlopen-metadata
Post #4: run0 --empower
Post #5: systemd-vmspawn --bind-user=
Post #6: Musl libc support
Post #7: systemd-repart without device name
Post #8: Parallel kmod loading in systemd-modules-load.service
Post #9: NvPCR Support
Post #10: systemd-analyze nvcpcrs
Post #11: systemd-repart Varlink IPC API
Post #12: systemd-vmspawn block device serial
Post #13: systemd-repart --defer-partitions-empty= + --defer-partitions-factory-reset=
Post #14: userdb support for UUID queries
Post #15: Wallclock time in service completion logging
Post #16: systemd-firstboot --prompt-keymap-auto
Post #17: $LISTEN_PIDFDID
Post #18: Incremental partition rescanning
Post #19: ExecReloadPost=
Post #20: Transaction order cycle tracking
Post #21: systemd-firstboot facelift
Post #22: Per-User systemd-machined + systemd-importd
Post #23: systemd-udevd's OPTIONS="dump-json"
Post #24: systemd-resolved's DumpDNSConfiguration() IPC Call
Post #25: DHCP Server EmitDomain= + Domain=

I intend to do a similar series of serieses of posts for the next systemd release (v260), hence if you haven't left tech Twitter for Mastodon yet, now is the opportunity.

My series for v260 will begin in a few weeks most likely, under the #systemd260 hash tag.

In case you are interested, here is the corresponding blog story for systemd v258, here for v257, and here for v256.

Mastodon Stories for systemd v258

Already on Sep 17 we released systemd v258 into the wild.

In the weeks leading up to that release I have posted a series of serieses of posts to Mastodon about key new features in this release, under the #systemd258 hash tag. It was my intention to post a link list here on this blog right after completing that series, but I simply forgot! Hence, in case you aren't using Mastodon, but would like to read up, here's a list of all 37 posts:

Post #1: systemctl start -v
Post #2: Home areas
Post #3: systemd-resolved delegate zones
Post #4: Foreign UID range
Post #5: /etc/hostname ??? wildcards
Post #6: Quota on /tmp/
Post #7: ConcurretnySoftMax= + ConcurrencyHardMax=
Post #8: Product UUID in ConditionHost=
Post #9: Context OSC terminal sequences
Post #10: uki-url Boot Loader Spec Type #1 fields
Post #11: rd.break= boot breakpoints
Post #12: Factory Reset Rework
Post #13: systemd-resolved DNS Configuration Change IPC Subscription API
Post #14: io.systemd.boot-entries.extra= SMBIOS Type #11 Key
Post #15: Bring Your Own Firmware
Post #16: userdb record aliases
Post #17: systemd-validatefs and its xattrs
Post #18: Offline Signing of Artifacts
Post #19: PAMName= in services hooked up to ask-password protocol
Post #20: x-systemd.graceful-option= mount option
Post #21: systemd-userdb-load-credentials.service
Post #22: systemd-vmspawn --grow-image=a
Post #23: systemd-notify --fork
Post #24: $TERM auto-discovery
Post #25: Rebooting/Powering off systemd-nspawn containers via hotkey
Post #26: ExecStart= | modifier
Post #27: systemctl reload reloads confexts
Post #28: Server side userdb filtering
Post #29: Quota on StateDirectory= and friends
Post #30: systemd-analyze unit-shell
Post #31: /etc/issue.d/ drop-in for AF_VSOCK CID
Post #32: fsverity in systemd-repart
Post #33: AcceptFileDescriptor= + PassPIDFD=
Post #34: Tab completion in interactive systemd-firstboot
Post #35: rd.systemd.pull= kernel command line option/Boot into tarball
Post #36: ConditionKernelModuleLoaded=
Post #37: systemd-analyze chid
Post #38: homectl list-signing-keys/get-signing-key/add-signing-key/remove-signing-key
Post #39: DDI Image Filters
Post #40: Android USB Debugging udev rules
Post #41: systemd-vmspawn's --smbios11= switch
Post #42: $MAINPIDFDID + $MANAGERPIDFDID
Post #43: $DEBUG_INVOCATION=1 Respected by all systemd services
Post #44: LoaderDeviceURL EFI Variable and systemd.pull='s origin kernel command line switch
Post #45: cgroupv1 removal
Post #46: ProtectHostname=private
Post #47: homectl adopt + homectl register
Post #48: systemd-machined Varlink APIs
Post #49: DeferTrigger and "lenient" job mode
Post #50: Automatic Removal of foreign UID owned delegate subgroups in the per-user service manager
Post #51: Per-user ask-password protocol
Post #52: PrivateUsers=full
Post #53: LoadCredentialEncrypted= in the per-user service manager
Post #54: dissect_image builtin in systemd-udevd
Post #55: BPF Delegation via Tokens

I intend to do a similar series of serieses of posts for the next systemd release (v259), hence if you haven't left tech Twitter for Mastodon yet, now is the opportunity.

We intend to shorten the release cycle a bit for the future, and in fact managed to tag v259-rc1 already yesterday, just 2 months after v258. Hence, my series for v259 will begin soon, under the #systemd259 hash tag.

In case you are interested, here is the corresponding blog story for systemd v257, and here for v256.

ASG! 2025 CfP Closes Tomorrow!

The All Systems Go! 2025 Call for Participation Closes Tomorrow!

The Call for Participation (CFP) for All Systems Go! 2025 will close tomorrow, on 13th of June! We’d like to invite you to submit your proposals for consideration to the CFP submission site quickly!

Announcing systemd v257

Last week we released systemd v257 into the wild.

In the weeks leading up to this release (and the week after) I have posted a series of serieses of posts to Mastodon about key new features in this release, under the #systemd257 hash tag. In case you aren't using Mastodon, but would like to read up, here's a list of all 37 posts:

Post #1: Fully Locked Accounts with systemd-sysusers
Post #2: Combined Signed PCR and Locally Managed PCR Policies for Disk Encryption
Post #3: Progress Indication via Terminal ANSI Sequence
Post #4: Multi-Profile UKIs
Post #5: The New sd-varlink & sd-json APIs in libsystemd
Post #6: Querying for Passwords in User Scope
Post #7: Secure Attention Key Logic in systemd-logind
Post #8: systemd-nspawn --bind-user= Now Copies User's SSH Key
Post #9: The New DeferReactivation= Switch in .timer Units
Post #10: Support for the New IPE LSM
Post #11: Environment Variables for Shell Prompt Prefix/Suffix
Post #12: sysctl Conflict Detection via eBPF
Post #13: initrd and µcode UKI Add-Ons
Post #14: SecureBoot Signing with the New systemd-sbsign Tool
Post #15: Managed Access to hidraw devices in systemd-logind
Post #16: Fuzzy Filtering in userdbctl
Post #17: MAC Address Based Alternative Network Interface Names
Post #18: Conditional Copying/Symlinking in tmpfiles.d/
Post #19: Automatic Service Restarts in Debug Mode
Post #20: Filtering by Invocation ID in journalctl
Post #21: Supplement Partitions in repart.d/
Post #22: DeviceTree Matching in UKIs
Post #23: The New ssh-exec: Protocol in varlinkctl
Post #24: SecureBoot Key Enrollment Preparation with bootctl
Post #25: Automatically Installing confext/sysext/portable/VMs/container Images at Boot
Post #26: Designated Maintenance Time in systemd-logind
Post #27: PID Namespacing in Service Management
Post #28: Marking Experimental OS Releases in /etc/os-release
Post #29: Decoding Capability Masks with systemd-analyze
Post #30: Investigating Passed SMBIOS Type #11 Data
Post #31: Initializing Partitions from Character Devices in repart.d/
Post #32: Entering Namespaces to Generate Stacktraces
Post #33: ID Mapped Mounts for Per-Service Directories
Post #34: A Daemon for systemd-sysupdate
Post #35: User Record Modifications without Administrator Consent in systemd-homed
Post #36: DNR DHCP Support
Post #37: Name Based AF_VSOCK ssh Access

I intend to do a similar series of serieses of posts for the next systemd release (v258), hence if you haven't left tech Twitter for Mastodon yet, now is the opportunity.

Announcing systemd v256

Yesterday evening we released systemd v256 into the wild. While other projects, such as Firefox are just about to leave the 7bit world and enter 8bit territory, we already entered 9bit version territory! For details about the release, see our announcement mail.

In the weeks leading up to this release I have posted a series of serieses of posts to Mastodon about key new features in this release. Mastodon has its goods and its bads. Among the latter is probably that it isn't that great for posting listings of serieses of posts. Hence let me provide you with a list of the relevant first post in the series of posts here:

Post #1: .v/ Directories
Post #2: User-Scoped Encrypted Service Credentials
Post #3: X_SYSTEMD_UNIT_ACTIVE= sd_notify() Messages
Post #4: System-wide ProtectSystem=
Post #5: run0 As sudo Replacement
Post #6: System Credentials
Post #7: Unprivileged DDI Mounts + Unprivileged systemd-nspawn
Post #8: ssh into systemd-homed Accounts
Post #9: systemd-vmspawn
Post #10: Mutable systemd-sysext
Post #11: Network Device Ownership
Post #12: systemctl sleep
Post #13: systemd-ssh-generator
Post #14: systemd-cryptenroll without device argument
Post #15: dlopen() ELF Metadata
Post #16: Capsules

I intend to do a similar series of serieses of posts for the next systemd release (v257), hence if you haven't left tech Twitter for Mastodon yet, now is the opportunity.

And while I have you: note that the All Systems Go 2024 Conference (Berlin) Call for Papers ends 😲 THIS WEEK 🤯! Hence, HURRY, and get your submissions in now, for the best low-level Linux userspace conference around!

A re-introduction to mkosi -- A Tool for Generating OS Images

This is a guest post written by Daan De Meyer, systemd and mkosi maintainer

Almost 7 years ago, Lennart first wrote about mkosi on this blog. Some years ago, I took over development and there's been a huge amount of changes and improvements since then. So I figure this is a good time to re-introduce mkosi.

mkosi stands for Make Operating System Image. It generates OS images that can be used for a variety of purposes.

If you prefer watching a video over reading a blog post, you can also watch my presentation on mkosi at All Systems Go 2023.

What is mkosi?

mkosi was originally written as a tool to simplify hacking on systemd and for experimenting with images using many of the new concepts being introduced in systemd at the time. In the meantime, it has evolved into a general purpose image builder that can be used in a multitude of scenarios.

Instructions to install mkosi can be found in its readme. We recommend running the latest version to take advantage of all the latest features and bug fixes. You'll also need bubblewrap and the package manager of your favorite distribution to get started.

At its core, the workflow of mkosi can be divided into 3 steps:

Generate an OS tree for some distribution by installing a set of packages.
Package up that OS tree in a variety of output formats.
(Optionally) Boot the resulting image in qemu or systemd-nspawn.

Images can be built for any of the following distributions:

Fedora Linux
Ubuntu
OpenSUSE
Debian
Arch Linux
CentOS Stream
RHEL
Rocky Linux
Alma Linux

And the following output formats are supported:

GPT disk images built with systemd-repart
Tar archives
CPIO archives (for building initramfs images)
USIs (Unified System Images which are full OS images packed in a UKI)
Sysext, confext and portable images
Directory trees

For example, to build an Arch Linux GPT disk image and boot it in qemu, you can run the following command:

$ mkosi -d arch -p systemd -p udev -p linux -t disk qemu

To instead boot the image in systemd-nspawn, replace qemu with boot:

$ mkosi -d arch -p systemd -p udev -p linux -t disk boot

The actual image can be found in the current working directory named image.raw. However, using a separate output directory is recommended which is as simple as running mkdir mkosi.output.

To rebuild the image after it's already been built once, add -f to the command line before the verb to rebuild the image. Any arguments passed after the verb are forwarded to either systemd-nspawn or qemu itself. To build the image without booting it, pass build instead of boot or qemu or don't pass a verb at all.

By default, the disk image will have an appropriately sized root partition and an ESP partition, but the partition layout and contents can be fully customized using systemd-repart by creating partition definition files in mkosi.repart/. This allows you to customize the partition as you see fit:

The root partition can be encrypted.
Partition sizes can be customized.
Partitions can be protected with signed dm-verity.
You can opt out of having a root partition and only have a /usr partition instead.
You can add various other partitions, e.g. an XBOOTLDR partition or a swap partition.
...

As part of building the image, we'll run various tools such as systemd-sysusers, systemd-firstboot, depmod, systemd-hwdb and more to make sure the image is set up correctly.

Configuring mkosi image builds

Naturally with extended use you don't want to specify all settings on the command line every time, so mkosi supports configuration files where the same settings that can be specified on the command line can be written down.

For example, the command we used above can be written down in a configuration file mkosi.conf:

[Distribution]
Distribution=arch

[Output]
Format=disk

[Content]
Packages=
        systemd
        udev
        linux

Like systemd, mkosi uses INI configuration files. We also support dropins which can be placed in mkosi.conf.d. Configuration files can also be conditionalized using the [Match] section. For example, to only install a specific package on Arch Linux, you can write the following to mkosi.conf.d/10-arch.conf:

[Match]
Distribution=arch

[Content]
Packages=pacman

Because not everything you need will be supported in mkosi, we support running scripts at various points during the image build process where all extra image customization can be done. For example, if it is found, mkosi.postinst is called after packages have been installed. Scripts are executed on the host system by default (in a sandbox), but can be executed inside the image by suffixing the script with .chroot, so if mkosi.postinst.chroot is found it will be executed inside the image.

To add extra files to the image, you can place them in mkosi.extra in the source directory and they will be automatically copied into the image after packages have been installed.

Bootable images

If the necessary packages are installed, mkosi will automatically generate a UEFI/BIOS bootable image. As mkosi is a systemd project, it will always build UKIs (Unified Kernel Images), except if the image is BIOS-only (since UKIs cannot be used on BIOS). The initramfs is built like a regular image by installing distribution packages and packaging them up in a CPIO archive instead of a disk image. Specifically, we do not use dracut, mkinitcpio or initramfs-tools to generate the initramfs from the host system. ukify is used to assemble all the individual components into a UKI.

If you don't want mkosi to generate a bootable image, you can set Bootable=no to explicitly disable this logic.

Using mkosi for development

The main requirements to use mkosi for development is that we can build our source code against the image we're building and install it into the image we're building. mkosi supports this via build scripts. If a script named mkosi.build (or mkosi.build.chroot) is found, we'll execute it as part of the build. Any files put by the build script into $DESTDIR will be installed into the image. Required build dependencies can be installed using the BuildPackages= setting. These packages are installed into an overlay which is put on top of the image when running the build script so the build packages are available when running the build script but don't end up in the final image.

An example mkosi.build.chroot script for a project using meson could look as follows:

#!/bin/sh
meson setup "$BUILDDIR" "$SRCDIR"
ninja -C "$BUILDDIR"
if ((WITH_TESTS)); then
    meson test -C "$BUILDDIR"
fi
meson install -C "$BUILDDIR"

Now, every time the image is built, the build script will be executed and the results will be installed into the image.

The $BUILDDIR environment variable points to a directory that can be used as the build directory for build artifacts to allow for incremental builds if the build system supports it.

Of course, downloading all packages from scratch every time and re-installing them again every time the image is built is rather slow, so mkosi supports two modes of caching to speed things up.

The first caching mode caches all downloaded packages so they don't have to be downloaded again on subsequent builds. Enabling this is as simple as running mkdir mkosi.cache.

The second mode of caching caches the image after all packages have been installed but before running the build script. On subsequent builds, mkosi will copy the cache instead of reinstalling all packages from scratch. This mode can be enabled using the Incremental= setting. While there is some rudimentary cache invalidation, the cache can also forcibly be rebuilt by specifying -ff on the command line instead of -f.

Note that when running on a btrfs filesystem, mkosi will automatically use subvolumes for the cached images which can be snapshotted on subsequent builds for even faster rebuilds. We'll also use reflinks to do copy-on-write copies where possible.

With this setup, by running mkosi -f qemu in the systemd repository, it takes about 40 seconds to go from a source code change to a root shell in a virtual machine running the latest systemd with your change applied. This makes it very easy to test changes to systemd in a safe environment without risk of breaking your host system.

Of course, while 40 seconds is not a very long time, it's still more than we'd like, especially if all we're doing is modifying the kernel command line. That's why we have the KernelCommandLineExtra= option to configure kernel command line options that are passed to the container or virtual machine at runtime instead of being embedded into the image. These extra kernel command line options are picked up when the image is booted with qemu's direct kernel boot (using -append), but also when booting a disk image in UEFI mode (using SMBIOS). The same applies to systemd credentials (using the Credentials= setting). These settings allow configuring the image without having to rebuild it, which means that you only have to run mkosi qemu or mkosi boot again afterwards to apply the new settings.

Building images without root privileges and loop devices

By using newuidmap/newgidmap and systemd-repart, mkosi is able to build images without needing root privileges. As long as proper subuid and subgid mappings are set up for your user in /etc/subuid and /etc/subgid, you can run mkosi as your regular user without having to switch to root.

Note that as of the writing of this blog post this only applies to the build and qemu verbs. Booting the image in a systemd-nspawn container with mkosi boot still needs root privileges. We're hoping to fix this in an future systemd release.

Regardless of whether you're running mkosi with root or without root, almost every tool we execute is invoked in a sandbox to isolate as much of the build process from the host as possible. For example, /etc and /var from the host are not available in this sandbox, to avoid host configuration inadvertently affecting the build.

Because systemd-repart can build disk images without loop devices, mkosi can run from almost any environment, including containers. All that's needed is a UID range with 65536 UIDs available, either via running as the root user or via /etc/subuid and newuidmap. In a future systemd release, we're hoping to provide an alternative to newuidmap and /etc/subuid to allow running mkosi from all containers, even those with only a single UID available.

Supporting older distributions

mkosi depends on very recent versions of various systemd tools (v254 or newer). To support older distributions, we implemented so called tools trees. In short, mkosi can first build a tools image for you that contains all required tools to build the actual image. This can be enabled by adding ToolsTree=default to your mkosi configuration. Building a tools image does not require a recent version of systemd.

In the systemd mkosi configuration, we automatically use a tools tree if we detect your distribution does not have the minimum required systemd version installed.

Configuring variants of the same image using profiles

Profiles can be defined in the mkosi.profiles/ directory. The profile to use can be selected using the Profile= setting (or --profile=) on the command line. A profile allows you to bundle various settings behind a single recognizable name. Profiles can also be matched on if you want to apply some settings only to a few profiles.

For example, you could have a bootable profile that sets Bootable=yes, adds the linux and systemd-boot packages and configures Format=disk to end up with a bootable disk image when passing --profile bootable on the kernel command line.

Building system extension images

System extension images may – dynamically at runtime — extend the base system with an overlay containing additional files.

To build system extensions with mkosi, we need a base image on top of which we can build our extension.

To keep things manageable, we'll make use of mkosi's support for building multiple images so that we can build our base image and system extension in one go.

We start by creating a temporary directory with a base configuration file mkosi.conf with some shared settings:

[Output]
OutputDirectory=mkosi.output
CacheDirectory=mkosi.cache

Now let's continue with the base image definition by writing the following to mkosi.images/base/mkosi.conf:

[Output]
Format=directory

[Content]
CleanPackageMetadata=no
Packages=systemd
         udev

We use the directory output format here instead of the disk output so that we can build our extension without needing root privileges.

Now that we have our base image, we can define a sysext that builds on top of it by writing the following to mkosi.images/btrfs/mkosi.conf:

[Config]
Dependencies=base

[Output]
Format=sysext
Overlay=yes

[Content]
BaseTrees=%O/base
Packages=btrfs-progs

BaseTrees= point to our base image and Overlay=yes instructs mkosi to only package the files added on top of the base tree.

We can't sign the extension image without a key. We can generate one by running mkosi genkey which will generate files that are automatically picked up when building the image.

Finally, you can build the base image and the extensions by running mkosi -f. You'll find btrfs.raw in mkosi.output which is the extension image.

Various other interesting features

To sign any generated UKIs for secure boot, put your secure boot key and certificate in mkosi.key and mkosi.crt and enable the SecureBoot= setting. You can also run mkosi genkey to have mkosi generate a key and certificate itself.
The Ephemeral= setting can be enabled to boot the image in an ephemeral copy that is thrown away when the container or virtual machine exits.
ShimBootloader= and BiosBootloader= settings are available to configure shim and grub installation if needed.
mkosi can boot directory trees in a virtual using virtiofsd. This is very useful for quickly rebuilding an image and booting it as the image does not have to be packed up as a disk image.
...

There's many more features that we won't go over in detail here in this blog post. Learn more about those by reading the documentation.

Conclusion

I'll finish with a bunch of links to more information about mkosi and related tooling:

ASG! 2023 CfP Closes Soon

The All Systems Go! 2023 Call for Participation Closes in Three Days!

The Call for Participation (CFP) for All Systems Go! 2023 will close in three days, on 7th of July! We’d like to invite you to submit your proposals for consideration to the CFP submission site quickly!

ASG image

All topics relevant to foundational open-source Linux technologies are welcome. In particular, however, we are looking for proposals including, but not limited to, the following topics:

The CFP will close on July 7th, 2023. A response will be sent to all submitters on or before July 14th, 2023. The conference takes place in 🗺️ Berlin, Germany 🇩🇪 on Sept. 13-14th.

All Systems Go! 2023 is all about foundational open-source Linux technologies. We are primarily looking for deeply technical talks by and for developers, engineers and other technical roles.

We focus on the userspace side of things, so while kernel topics are welcome they must have clear, direct relevance to userspace. The following is a non-comprehensive list of topics encouraged for 2023 submissions:

Image-Based Linux 🖼️
Secure and Measured Boot 📏
TPM-Based Local/Remote Attestation, Encryption, Authentication 🔑
Low-level container executors and infrastructure ⚙️.
IoT, embedded and server Linux infrastructure
Reproducible builds 🔧
Package management, OS, container 📦, image delivery and updating
Building Linux devices and applications 🏗️
Low-level desktop 💻 technologies
Networking 🌐
System and service management 🚀
Tracing and performance measuring 🔍
IPC and RPC systems 🦜
Security 🔐 and Sandboxing 🏖️

For more information please visit our conference website!

Linux Boot Partitions

💽 Linux Boot Partitions and How to Set Them Up 🚀

Let’s have a look how traditional Linux distributions set up /boot/ and the ESP, and how this could be improved.

How Linux distributions traditionally have been setting up their “boot” file systems has been varying to some degree, but the most common choice has been to have a separate partition mounted to /boot/. Usually the partition is formatted as a Linux file system such as ext2/ext3/ext4. The partition contains the kernel images, the initrd and various boot loader resources. Some distributions, like Debian and Ubuntu, also store ancillary files associated with the kernel here, such as kconfig or System.map. Such a traditional boot partition is only defined within the context of the distribution, and typically not immediately recognizable as such when looking just at the partition table (i.e. it uses the generic Linux partition type UUID).

With the arrival of UEFI a new partition relevant for boot appeared, the EFI System Partition (ESP). This partition is defined by the firmware environment, but typically accessed by Linux to install or update boot loaders. The choice of file system is not up to Linux, but effectively mandated by the UEFI specifications: vFAT. In theory it could be formatted as other file systems too. However, this would require the firmware to support file systems other than vFAT. This is rare and firmware specific though, as vFAT is the only file system mandated by the UEFI specification. In other words, vFAT is the only file system which is guaranteed to be universally supported.

There’s a major overlap of the type of the data typically stored in the ESP and in the traditional boot partition mentioned earlier: a variety of boot loader resources as well as kernels/initrds.

Unlike the traditional boot partition, the ESP is easily recognizable in the partition table via its GPT partition type UUID. The ESP is also a shared resource: all OSes installed on the same disk will share it and put their boot resources into them (as opposed to the traditional boot partition, of which there is one per installed Linux OS, and only that one will put resources there).

To summarize, the most common setup on typical Linux distributions is something like this:

Type	Linux Mount Point	File System Choice
Linux “Boot” Partition	`/boot/`	Any Linux File System, typically ext2/ext3/ext4
ESP	`/boot/efi/`	vFAT

As mentioned, not all distributions or local installations agree on this. For example, it’s probably worth mentioning that some distributions decided to put kernels onto the root file system of the OS itself. For this setup to work the boot loader itself [sic!] must implement a non-trivial part of the storage stack. This may have to include RAID, storage drivers, networked storage, volume management, disk encryption, and Linux file systems. Leaving aside the conceptual argument that complex storage stacks don’t belong in boot loaders there are very practical problems with this approach. Reimplementing the Linux storage stack in all its combinations is a massive amount of work. It took decades to implement what we have on Linux now, and it will take a similar amount of work to catch up in the boot loader’s reimplementation. Moreover, there’s a political complication: some Linux file system communities made clear they have no interest in supporting a second file system implementation that is not maintained as part of the Linux kernel.

What’s interesting is that the /boot/efi/ mount point is nested below the /boot/ mount point. This effectively means that to access the ESP the Boot partition must exist and be mounted first. A system with just an ESP and without a Boot partition hence doesn’t fit well into the current model. The Boot partition will also have to carry an empty “efi” directory that can be used as the inner mount point, and serves no other purpose.

Given that the traditional boot partition and the ESP may carry similar data (i.e. boot loader resources, kernels, initrds) one may wonder why they are separate concepts. Historically, this was the easiest way to make the pre-UEFI way how Linux systems were booted compatible with UEFI: conceptually, the ESP can be seen as just a minor addition to the status quo ante that way. Today, primarily two reasons remained:

Some distributions see a benefit in support for complex Linux file system concepts such as hardlinks, symlinks, SELinux labels/extended attributes and so on when storing boot loader resources. – I personally believe that making use of features in the boot file systems that the firmware environment cannot really make sense of is very clearly not advisable. The UEFI file system APIs know no symlinks, and what is SELinux to UEFI anyway? Moreover, putting more than the absolute minimum of simple data files into such file systems immediately raises questions about how to authenticate them comprehensively (including all fancy metadata) cryptographically on use (see below).
On real-life systems that ship with non-Linux OSes the ESP often comes pre-installed with a size too small to carry multiple Linux kernels and initrds. As growing the size of an existing ESP is problematic (for example, because there’s no space available immediately after the ESP, or because some low-quality firmware reacts badly to the ESP changing size) placing the kernel in a separate, secondary partition (i.e. the boot partition) circumvents these space issues.

File System Choices

We already mentioned that the ESP effectively has to be vFAT, as that is what UEFI (more or less) guarantees. The file system choice for the boot partition is not quite as restricted, but using arbitrary Linux file systems is not really an option either. The file system must be accessible by both the boot loader and the Linux OS. Hence only file systems that are available in both can be used. Note that such secondary implementations of Linux file systems in the boot environment – limited as they may be – are not typically welcomed or supported by the maintainers of the canonical file system implementation in the upstream Linux kernel. Modern file systems are notoriously complicated and delicate and simply don’t belong in boot loaders.

In a trusted boot world, the two file systems for the ESP and the /boot/ partition should be considered untrusted: any code or essential data read from them must be authenticated cryptographically before use. And even more, the file system structures themselves are also untrusted. The file system driver reading them must be careful not to be exploitable by a rogue file system image. Effectively this means a simple file system (for which a driver can be more easily validated and reviewed) is generally a better choice than a complex file system (Linux file system communities made it pretty clear that robustness against rogue file system images is outside of their scope and not what is being tested for.).

Some approaches tried to address the fact that boot partitions are untrusted territory by encrypting them via a mechanism compatible to LUKS, and adding decryption capabilities to the boot loader so it can access it. This misses the point though, as encryption does not imply authentication, and only authentication is typically desired. The boot loader and kernel code are typically Open Source anyway, and hence there’s little value in attempting to keep secret what is already public knowledge. Moreover, encryption implies the existence of an encryption key. Physically typing in the decryption key on a keyboard might still be acceptable on desktop systems with a single human user in front, but outside of that scenario unlock via TPM, PKCS#11 or network services are typically required. And even on the desktop FIDO2 unlocking is probably the future. Implementing all the technologies these unlocking mechanisms require in the boot loader is not realistic, unless the boot loader shall become a full OS on its own as it would require subsystems for FIDO2, PKCS#11, USB, Bluetooth network, smart card access, and so on.

File System Access Patterns

Note that traditionally both mentioned partitions were read-only during most parts of the boot. Only later, once the OS is up, write access was required to implement OS or boot loader updates. In today’s world things have become a bit more complicated. A modern OS might want to require some limited write access already in the boot loader, to implement boot counting/boot assessment/automatic fallback (e.g., if the same kernel fails to boot 3 times, automatically revert to older kernel), or to maintain an early storage-based random seed. This means that even though the file system is mostly read-only, we need limited write access after all.

vFAT cannot compete with modern Linux file systems such as btrfs when it comes to data safety guarantees. It’s not a journaled file system, does not use CoW or any form of checksumming. This means when used for the system boot process we need to be particularly careful when accessing it, and in particular when making changes to it (i.e., trying to keep changes local to single sectors). It is essential to use write patterns that minimize the chance of file system corruption. Checking the file system (“fsck”) before modification (and probably also reading) is important, as is ensuring the file system is put into a “clean” state as quickly as possible after each modification.

Code quality of the firmware in typical systems is known to not always be great. When relying on the file system driver included in the firmware it’s hence a good idea to limit use to operations that have a better chance to be correctly implemented. For example, when writing from the UEFI environment it might be wise to avoid any operation that requires allocation algorithms, but instead focus on access patterns that only override already written data, and do not require allocation of new space for the data.

Besides write access from the boot loader code (as described above) these file systems will require write access from the OS, to facilitate boot loader and kernel/initrd updates. These types of accesses are generally not fully random accesses (i.e., never partial file updates) but usually mean adding new files as whole, and removing old files as a whole. Existing files are typically not modified once created, though they might be replaced wholly by newer versions.

Boot Loader Updates

Note that the update cycle frequencies for boot loaders and for kernels/initrds are probably similar these days. While kernels are still vastly more complex than boot loaders, security issues are regularly found in both. In particular, as boot loaders (through “shim” and similar components) carry certificate/keyring and denylist information, which typically require frequent updates. Update cycles hence have to be expected regularly.

Boot Partition Discovery

The traditional boot partition was not recognizable by looking just at the partition table. On MBR systems it was directly referenced from the boot sector of the disk, and on EFI systems from information stored in the ESP. This is less than ideal since by losing this entrypoint information the system becomes unbootable. It’s typically a better, more robust idea to make boot partitions recognizable as such in the partition table directly. This is done for the ESP via the GPT partition type UUID. For traditional boot partitions this was not done though.

Current Situation Summary

Let’s try to summarize the above:

Currently, typical deployments use two distinct boot partitions, often using two distinct file system implementations
Firmware effectively dictates existence of the ESP, and the use of vFAT
In userspace view: the ESP mount is nested below the general Boot partition mount
Resources stored in both partitions are primarily kernel/initrd, and boot loader resources
The mandatory use of vFAT brings certain data safety challenges, as does quality of firmware file system driver code
During boot limited write access is needed, during OS runtime more comprehensive write access is needed (though still not fully random).
Less restricted but still limited write patterns from OS environment (only full file additions/updates/removals, during OS/boot loader updates)
Boot loaders should not implement complex storage stacks.
ESP can be auto-discovered from the partition table, traditional boot partition cannot.
ESP and the traditional boot partition are not protected cryptographically neither in structure nor contents. It is expected that loaded files are individually authenticated after being read.
The ESP is a shared resource — the traditional boot partition a resource specific to each installed Linux OS on the same disk.

How to Do it Better

Now that we have discussed many of the issues with the status quo ante, let’s see how we can do things better:

Two partitions for essentially the same data is a bad idea. Given they carry data very similar or identical in nature, the common case should be to have only one (but see below).
Two file system implementations are worse than one. Given that vFAT is more or less mandated by UEFI and the only format universally understood by all players, and thus has to be used anyway, it might as well be the only file system that is used.
Data safety is unnecessarily bad so far: both ESP and boot partition are continuously mounted from the OS, even though access is pretty restricted: outside of update cycles access is typically not required.
All partitions should be auto-discoverable/self-descriptive
The two partitions should not be exposed as nested mounts to userspace

To be more specific, here’s how I think a better way to set this all up would look like:

Whenever possible, only have one boot partition, not two. On EFI systems, make it the ESP. On non-EFI systems use an XBOOTLDR partition instead (see below). Only have both in the case where a Linux OS is installed on a system that already contains an OS with an ESP that is too small to carry sufficient kernels/initrds. When a system contains a XBOOTLDR partition put kernels/initrd on that, otherwise the ESP.
Instead of the vaguely defined, traditional Linux “boot” partition use the XBOOTLDR partition type as defined by the Discoverable Partitions Specification. This ensures the partition is discoverable, and can be automatically mounted by things like systemd-gpt-auto-generator. Use XBOOTLDR only if you have to, i.e., when dealing with systems that lack UEFI (and where the ESP hence has no value) or to address the mentioned size issues with the ESP. Note that unlike the traditional boot partition the XBOOTLDR partition is a shared resource, i.e., shared between multiple parallel Linux OS installations on the same disk. Because of this it is typically wise to place a per-OS directory at the top of the XBOOTLDR file system to avoid conflicts.
Use vFAT for both partitions, it’s the only thing universally understood among relevant firmwares and Linux. It’s simple enough to be useful for untrusted storage. Or to say this differently: writing a file system driver that is not easily vulnerable to rogue disk images is much easier for vFAT than for let’s say btrfs. – But the choice of vFAT implies some care needs to be taken to address the data safety issues it brings, see below.
Mount the two partitions via the “automount” logic. For example, via systemd’s automount units, with a very short idle time-out (one second or so). This improves data safety immensely, as the file systems will remain mounted (and thus possibly in a “dirty” state) only for very short periods of time, when they are actually accessed – and all that while the fact that they are not mounted continuously is mostly not noticeable for applications as the file system paths remain continuously around. Given that the backing file system (vFAT) has poor data safety properties, it is essential to shorten the access for unclean file system state as much as possible. In fact, this is what the aforementioned systemd-gpt-auto-generator logic actually does by default.
Whenever mounting one of the two partitions, do a file system check (fsck; in fact this is also what systemd-gpt-auto-generatordoes by default, hooked into the automount logic, to run on first access). This ensures that even if the file system is in an unclean state it is restored to be clean when needed, i.e., on first access.
Do not mount the two partitions nested, i.e., no more /boot/efi/. First of all, as mentioned above, it should be possible (and is desirable) to only have one of the two. Hence it is simply a bad idea to require the other as well, just to be able to mount it. More importantly though, by nesting them, automounting is complicated, as it is necessary to trigger the first automount to establish the second automount, which defeats the point of automounting them in the first place. Use the two distinct mount points /efi/ (for the ESP) and /boot/ (for XBOOTLDR) instead. You might have guessed, but that too is what systemd-gpt-auto-generator does by default.
When making additions or updates to ESP/XBOOTLDR from the OS make sure to create a file and write it in full, then syncfs() the whole file system, then rename to give it its final name, and syncfs() again. Similar when removing files.
When writing from the boot loader environment/UEFI to ESP/XBOOTLDR, do not append to files or create new files. Instead overwrite already allocated file contents (for example to maintain a random seed file) or rename already allocated files to include information in the file name (and ideally do not increase the file name in length; for example to maintain boot counters).
Consider adopting UKIs, which minimize the number of files that need to be updated on the ESP/XBOOTLDR during OS/kernel updates (ideally down to 1)
Consider adopting systemd-boot, which minimizes the number of files that need to be updated on boot loader updates (ideally down to 1)
Consider removing any mention of ESP/XBOOTLDR from /etc/fstab, and just let systemd-gpt-auto-generator do its thing.
Stop implementing file systems, complex storage, disk encryption, … in your boot loader.

Implementing things like that you gain:

Simplicity: only one file system implementation, typically only one partition and mount point
Robust auto-discovery of all partitions, no need to even configure /etc/fstab
Data safety guarantees as good as possible, given the circumstances

To summarize this in a table:

Type	Linux Mount Point	File System Choice	Automount
ESP	`/efi/`	vFAT	yes
XBOOTLDR	`/boot/`	vFAT	yes

A note regarding modern boot loaders that implement the Boot Loader Specification: both partitions are explicitly listed in the specification as sources for both Type #1 and Type #2 boot menu entries. Hence, if you use such a modern boot loader (e.g. systemd-boot) these two partitions are the preferred location for boot loader resources, kernels and initrds anyway.

Addendum: You got RAID?

You might wonder, what about RAID setups and the ESP? This comes up regularly in discussions: how to set up the ESP so that (software) RAID1 (mirroring) can be done on the ESP. Long story short: I’d strongly advise against using RAID on the ESP. Firmware typically doesn’t have native RAID support, and given that firmware and boot loader can write to the file systems involved, any attempt to use software RAID on them will mean that a boot cycle might corrupt the RAID sync, and immediately requires a re-synchronization after boot. If RAID1 backing for the ESP is really necessary, the only way to implement that safely would be to implement this as a driver for UEFI – but that creates certain bootstrapping issues (i.e., where to place the driver if not the ESP, a file system the driver is supposed to be used for), and also reimplements a considerable component of the OS storage stack in firmware mode, which seems problematic.

So what to do instead? My recommendation would be to solve this via userspace tooling. If redundant disk support shall be implemented for the ESP, then create separate ESPs on all disks, and synchronize them on the file system level instead of the block level. Or in other words, the tools that install/update/manage kernels or boot loaders should be taught to maintain multiple ESPs instead of one. Copy the kernels/boot loader files to all of them, and remove them from all of them. Under the assumption that the goal of RAID is a more reliable system this should be the best way to achieve that, as it doesn’t pretend the firmware could do things it actually cannot do. Moreover it minimizes the complexity of the boot loader, shifting the syncing logic to userspace, where it’s typically easier to get right.

Addendum: Networked Boot

The discussion above focuses on booting up from a local disk. When thinking about networked boot I think two scenarios are particularly relevant:

PXE-style network booting. I think in this mode of operation focus should be on directly booting a single UKI image instead of a boot loader. This sidesteps the whole issue of maintaining any boot partition at all, and simplifies the boot process greatly. In scenarios where this is not sufficient, and an interactive boot menu or other boot loader features are desired, it might be a good idea to take inspiration from the UKI concept, and build a single boot loader EFI binary (such as systemd-boot), and include the UKIs for the boot menu items and other resources inside it via PE sections. Or in other words, build a single boot loader binary that is “supercharged” and contains all auxiliary resources in its own PE sections. (Note: this does not exist, it’s an idea I intend to explore with systemd-boot). Benefit: a single file has to be downloaded via PXE/TFTP, not more. Disadvantage: unused resources are downloaded unnecessarily. Either way: in this context there is no local storage, and the ESP/XBOOTLDR discussion above is without relevance.
Initrd-style network booting. In this scenario the boot loader and kernel/initrd (better: UKI) are available on a local disk. The initrd then configures the network and transitions to a network share or file system on a network block device for the root file system. In this case the discussion above applies, and in fact the ESP or XBOOTLDR partition would be the only partition available locally on disk.

And this is all I have for today.

Brave New Trusted Boot World

🔐 Brave New Trusted Boot World 🚀

This document looks at the boot process of general purpose Linux distributions. It covers the status quo and how we envision Linux boot to work in the future with a focus on robustness and simplicity.

This document will assume that the reader has comprehensive familiarity with TPM 2.0 security chips and their capabilities (e.g., PCRs, measurements, SRK), boot loaders, the shim binary, Linux, initrds, UEFI Firmware, PE binaries, and SecureBoot.

Problem Description

Status quo ante of the boot logic on typical Linux distributions:

Most popular Linux distributions generate initrds locally, and they are unsigned, thus not protected through SecureBoot (since that would require local SecureBoot key enrollment, which is generally not done), nor TPM PCRs.
Boot chain is typically Firmware → shim → grub → Linux kernel → initrd (dracut or similar) → root file system
Firmware’s UEFI SecureBoot protects shim, shim’s key management protects grub and kernel. No code signing protects initrd. initrd acquires the key for encrypted root fs from the user (or TPM/FIDO2/PKCS11).
shim/grub/kernel is measured into TPM PCR 4, among other stuff
EFI TPM event log reports measured data into TPM PCRs, and can be used to reconstruct and validate state of TPM PCRs from the used resources.
No userspace components are typically measured, except for what IMA measures
New kernels require locally generating new boot loader scripts and generating a new initrd each time. OS updates thus mean fragile generation of multiple resources and copying multiple files into the boot partition.

Problems with the status quo ante:

initrd typically unlocks root file system encryption, but is not protected whatsoever, and trivial to attack and modify offline
OS updates are brittle: PCR values of grub are very hard to pre-calculate, as grub measures chosen control flow path, not just code images. PCR values vary wildly, and OS provided resources are not measured into separate PCRs. Grub’s PCR measurements might be useful up to a point to reason about the boot after the fact, for the most basic remote attestation purposes, but useless for calculating them ahead of time during the OS build process (which would be desirable to be able to bind secrets to future expected PCR state, for example to bind secrets to an OS in a way that it remain accessible even after that OS is updated).
Updates of a boot loader are not robust, require multi-file updates of ESP and boot partition, and regeneration of boot scripts
No rollback protection (no way to cryptographically invalidate access to TPM-bound secrets on OS updates)
Remote attestation of running software is needlessly complex since initrds are generated locally and thus basically are guaranteed to vary on each system.
Locking resources maintained by arbitrary user apps to TPM state (PCRs) is not realistic for general purpose systems, since PCRs will change on every OS update, and there’s no mechanism to re-enroll each such resource before every OS update, and remove the old enrollment after the update.
There is no concept to cryptographically invalidate/revoke secrets for an older OS version once updated to a new OS version. An attacker thus can always access the secrets generated on old OSes if they manage to exploit an old version of the OS — even if a newer version already has been deployed.

Goals of the new design:

Provide a fully signed execution path from firmware to userspace, no exceptions
Provide a fully measured execution path from firmware to userspace, no exceptions
Separate out TPM PCRs assignments, by “owner” of measured resources, so that resources can be bound to them in a fine-grained fashion.
Allow easy pre-calculation of expected PCR values based on booted kernel/initrd, configuration, local identity of the system
Rollback protection
Simple & robust updates: one updated file per concept
Updates without requiring re-enrollment/local preparation of the TPM-protected resources (no more “brittle” PCR hashes that must be propagated into every TPM-protected resource on each OS update)
System ready for easy remote attestation, to prove validity of booted OS, configuration and local identity
Ability to bind secrets to specific phases of the boot, e.g. the root fs encryption key should be retrievable from the TPM only in the initrd, but not after the host transitioned into the root fs.
Reasonably secure, automatic, unattended unlocking of disk encryption secrets should be possible.
“Democratize” use of PCR policies by defining PCR register meanings, and making binding to them robust against updates, so that external projects can safely and securely bind their own data to them (or use them for remote attestation) without risking breakage whenever the OS is updated.
Build around TPM 2.0 (with graceful fallback for TPM-less systems if desired, but TPM 1.2 support is out of scope)

Considered attack scenarios and considerations:

Evil Maid: neither online nor offline (i.e. “at rest”), physical access to a storage device should enable an attacker to read the user’s plaintext data on disk (confidentiality); neither online nor offline, physical access to a storage device should allow undetected modification/backdooring of user data or OS (integrity), or exfiltration of secrets.
TPMs are assumed to be reasonably “secure”, i.e. can securely store/encrypt secrets. Communication to TPM is not “secure” though and must be protected on the wire.
Similar, the CPU is assumed to be reasonably “secure”
SecureBoot is assumed to be reasonably “secure” to permit validated boot up to and including shim+boot loader+kernel (but see discussion below)
All user data must be encrypted and authenticated. All vendor and administrator data must be authenticated.
It is assumed all software involved regularly contains vulnerabilities and requires frequent updates to address them, plus regular revocation of old versions.
It is further assumed that key material used for signing code by the OS vendor can reasonably be kept secure (via use of HSM, and similar, where secret key information never leaves the signing hardware) and does not require frequent roll-over.

Proposed Construction

Central to the proposed design is the concept of a Unified Kernel Image (UKI). These UKIs are the combination of a Linux kernel image, and initrd, a UEFI boot stub program (and further resources, see below) into one single UEFI PE file that can either be directly invoked by the UEFI firmware (which is useful in particular in some cloud/Confidential Computing environments) or through a boot loader (which is generally useful to implement support for multiple kernel versions, with interactive or automatic selection of image to boot into, potentially with automatic fallback management to increase robustness).

UKI Components

Specifically, UKIs typically consist of the following resources:

An UEFI boot stub that is a small piece of code still running in UEFI mode and that transitions into the Linux kernel included in the UKI (e.g., as implemented in sd-stub, see below)
The Linux kernel to boot in the .linux PE section
The initrd that the kernel shall unpack and invoke in the .initrd PE section
A kernel command line string, in the .cmdline PE section
Optionally, information describing the OS this kernel is intended for, in the .osrel PE section (derived from /etc/os-release of the booted OS). This is useful for presentation of the UKI in the boot loader menu, and ordering it against other entries, using the included version information.
Optionally, information describing kernel release information (i.e. uname -r output) in the .uname PE section. This is also useful for presentation of the UKI in the boot loader menu, and ordering it against other entries.
Optionally, a boot splash to bring to screen before transitioning into the Linux kernel in the .splash PE section
Optionally, a compiled Devicetree database file, for systems which need it, in the .dtb PE section
Optionally, the public key in PEM format that matches the signatures of the .pcrsig PE section (see below), in a .pcrpkey PE section.
Optionally, a JSON file encoding expected PCR 11 hash values seen from userspace once the UKI has booted up, along with signatures of these expected PCR 11 hash values, matching a specific public key in the .pcrsig PE section. (Note: we use plural for “values” and “signatures” here, as this JSON file will typically carry a separate value and signature for each PCR bank for PCR 11, i.e. one pair of value and signature for the SHA1 bank, and another pair for the SHA256 bank, and so on. This ensures when enrolling or unlocking a TPM-bound secret we’ll always have a signature around matching the banks available locally (after all, which banks the local hardware supports is up to the hardware). For the sake of simplifying this already overly complex topic, we’ll pretend in the rest of the text there was only one PCR signature per UKI we have to care about, even if this is not actually the case.)

Given UKIs are regular UEFI PE files, they can thus be signed as one for SecureBoot, protecting all of the individual resources listed above at once, and their combination. Standard Linux tools such as sbsigntool and pesign can be used to sign UKI files.

UKIs wrap all of the above data in a single file, hence all of the above components can be updated in one go through single file atomic updates, which is useful given that the primary expected storage place for these UKIs is the UEFI System Partition (ESP), which is a vFAT file system, with its limited data safety guarantees.

UKIs can be generated via a single, relatively simple objcopy invocation, that glues the listed components together, generating one PE binary that then can be signed for SecureBoot. (For details on building these, see below.)

Note that the primary location to place UKIs in is the EFI System Partition (or an otherwise firmware accessible file system). This typically means a VFAT file system of some form. Hence an effective UKI size limit of 4GiB is in place, as that’s the largest file size a FAT32 file system supports.

Basic UEFI Stub Execution Flow

The mentioned UEFI stub program will execute the following operations in UEFI mode before transitioning into the Linux kernel that is included in its .linux PE section:

The PE sections listed are searched for in the invoked UKI the stub is part of, and superficially validated (i.e. general file format is in order).
All PE sections listed above of the invoked UKI are measured into TPM PCR 11. This TPM PCR is expected to be all zeroes before the UKI initializes. Pre-calculation is thus very straight-forward if the resources included in the PE image are known. (Note: as a single exception the .pcrsig PE section is excluded from this measurement, as it is supposed to carry the expected result of the measurement, and thus cannot also be input to it, see below for further details about this section.)
If the .splash PE section is included in the UKI it is brought onto the screen
If the .dtb PE section is included in the UKI it is activated using the Devicetree UEFI “fix-up” protocol
If a command line was passed from the boot loader to the UKI executable it is discarded if SecureBoot is enabled and the command line from the .cmdline used. If SecureBoot is disabled and a command line was passed it is used in place of the one from .cmdline. Either way the used command line is measured into TPM PCR 12. (This of course removes any flexibility of control of the kernel command line of the local user. In many scenarios this is probably considered beneficial, but in others it is not, and some flexibility might be desired. Thus, this concept probably needs to be extended sooner or later, to allow more flexible kernel command line policies to be enforced via definitions embedded into the UKI. For example: allowing definition of multiple kernel command lines the user/boot menu can select one from; allowing additional allowlisted parameters to be specified; or even optionally allowing any verification of the kernel command line to be turned off even in SecureBoot mode. It would then be up to the builder of the UKI to decide on the policy of the kernel command line.)
It will set a couple of volatile EFI variables to inform userspace about executed TPM PCR measurements (and which PCR registers were used), and other execution properties. (For example: the EFI variable StubPcrKernelImage in the 4a67b082-0a4c-41cf-b6c7-440b29bb8c4f vendor namespace indicates the PCR register used for the UKI measurement, i.e. the value “11”).
An initrd cpio archive is dynamically synthesized from the .pcrsig and .pcrpkey PE section data (this is later passed to the invoked Linux kernel as additional initrd, to be overlaid with the main initrd from the .initrd section). These files are later available in the /.extra/ directory in the initrd context.
The Linux kernel from the .linux PE section is invoked with with a combined initrd that is composed from the blob from the .initrd PE section, the dynamically generated initrd containing the .pcrsig and .pcrpkey PE sections, and possibly some additional components like sysexts or syscfgs.

TPM PCR Assignments

In the construction above we take possession of two PCR registers previously unused on generic Linux distributions:

TPM PCR 11 shall contain measurements of all components of the UKI (with exception of the .pcrsig PE section, see above). This PCR will also contain measurements of the boot phase once userspace takes over (see below).
TPM PCR 12 shall contain measurements of the used kernel command line. (Plus potentially other forms of parameterization/configuration passed into the UKI, not discussed in this document)

On top of that we intend to define two more PCR registers like this:

TPM PCR 15 shall contain measurements of the volume encryption key of the root file system of the OS.
[TPM PCR 13 shall contain measurements of additional extension images for the initrd, to enable a modularized initrd – not covered by this document]

(See the Linux TPM PCR Registry for an overview how these four PCRs fit into the list of Linux PCR assignments.)

For all four PCRs the assumption is that they are zero before the UKI initializes, and only the data that the UKI and the OS measure into them is included. This makes pre-calculating them straightforward: given a specific set of UKI components, it is immediately clear what PCR values can be expected in PCR 11 once the UKI booted up. Given a kernel command line (and other parameterization/configuration) it is clear what PCR values are expected in PCR 12.

Note that these four PCRs are defined by the conceptual “owner” of the resources measured into them. PCR 11 only contains resources the OS vendor controls. Thus it is straight-forward for the OS vendor to pre-calculate and then cryptographically sign the expected values for PCR 11. The PCR 11 values will be identical on all systems that run the same version of the UKI. PCR 12 only contains resources the administrator controls, thus the administrator can pre-calculate PCR values, and they will be correct on all instances of the OS that use the same parameters/configuration. PCR 15 only contains resources inherently local to the local system, i.e. the cryptographic key material that encrypts the root file system of the OS.

Separating out these three roles does not imply these actually need to be separate when used. However the assumption is that in many popular environments these three roles should be separate.

By separating out these PCRs by the owner’s role, it becomes straightforward to remotely attest, individually, on the software that runs on a node (PCR 11), the configuration it uses (PCR 12) or the identity of the system (PCR 15). Moreover, it becomes straightforward to robustly and securely encrypt data so that it can only be unlocked on a specific set of systems that share the same OS, or the same configuration, or have a specific identity – or a combination thereof.

Note that the mentioned PCRs are so far not typically used on generic Linux-based operating systems, to our knowledge. Windows uses them, but given that Windows and Linux should typically not be included in the same boot process this should be unproblematic, as Windows’ use of these PCRs should thus not conflict with ours.

To summarize:

PCR	Purpose	Owner	Expected Value before UKI boot	Pre-Calculable
11	Measurement of UKI components and boot phases	OS Vendor	Zero	Yes (at UKI build time)
12	Measurement of kernel command line, additional kernel runtime configuration such as systemd credentials, systemd syscfg images	Administrator	Zero	Yes (when system configuration is assembled)
13	System Extension Images of initrd (and possibly more)	(Administrator)	Zero	Yes
15	Measurement of root file system volume key (Possibly later more: measurement of root file system UUIDs and labels and of the machine ID `/etc/machine-id`)	Local System	Zero	Yes (after first boot once ll such IDs are determined)

Signature Keys

In the model above in particular two sets of private/public key pairs are relevant:

The SecureBoot key to sign the UKI PE executable with. This controls permissible choices of OS/kernel
The key to sign the expected PCR 11 values with. Signatures made with this key will end up in the .pcrsig PE section. The public key part will end up in the .pcrpkey PE section.

Typically the key pair for the PCR 11 signatures should be chosen with a narrow focus, reused for exactly one specific OS (e.g. “Fedora Desktop Edition”) and the series of UKIs that belong to it (all the way through all the versions of the OS). The SecureBoot signature key can be used with a broader focus, if desired. By keeping the PCR 11 signature key narrow in focus one can ensure that secrets bound to the signature key can only be unlocked on the narrow set of UKIs desired.

TPM Policy Use

Depending on the intended access policy to a resource protected by the TPM, one or more of the PCRs described above should be selected to bind TPM policy to.

For example, the root file system encryption key should likely be bound to TPM PCR 11, so that it can only be unlocked if a specific set of UKIs is booted (it should then, once acquired, be measured into PCR 15, as discussed above, so that later TPM objects can be bound to it, further down the chain). With the model described above this is reasonably straight-forward to do:

When userspace wants to bind disk encryption to a specific series of UKIs (“enrollment”), it looks for the public key passed to the initrd in the /.extra/ directory (which as discussed above originates in the .pcrpkey PE section of the UKI). The relevant userspace component (e.g. systemd) is then responsible for generating a random key to be used as symmetric encryption key for the storage volume (let’s call it disk encryption key _here, DEK_). The TPM is then used to encrypt (“seal”) the DEK with its internal Storage Root Key (TPM SRK). A TPM2 policy is bound to the encrypted DEK. The policy enforces that the DEK may only be decrypted if a valid signature is provided that matches the state of PCR 11 and the public key provided in the /.extra/ directory of the initrd. The plaintext DEK key is passed to the kernel to implement disk encryption (e.g. LUKS/dm-crypt). (Alternatively, hardware disk encryption can be used too, i.e. Intel MKTME, AMD SME or even OPAL, all of which are outside of the scope of this document.) The TPM-encrypted version of the DEK which the TPM returned is written to the encrypted volume’s superblock.
When userspace wants to unlock disk encryption on a specific UKI, it looks for the signature data passed to the initrd in the /.extra/ directory (which as discussed above originates in the .pcrsig PE section of the UKI). It then reads the encrypted version of the DEK from the superblock of the encrypted volume. The signature and the encrypted DEK are then passed to the TPM. The TPM then checks if the current PCR 11 state matches the supplied signature from the .pcrsig section and the public key used during enrollment. If all checks out it decrypts (“unseals”) the DEK and passes it back to the OS, where it is then passed to the kernel which implements the symmetric part of disk encryption.

Note that in this scheme the encrypted volume’s DEK is not bound to specific literal PCR hash values, but to a public key which is expected to sign PCR hash values.

Also note that the state of PCR 11 only matters during unlocking. It is not used or checked when enrolling.

In this scenario:

Input to the TPM part of the enrollment process are the TPM’s internal SRK, the plaintext DEK provided by the OS, and the public key later used for signing expected PCR values, also provided by the OS. – Output is the encrypted (“sealed”) DEK.
Input to the TPM part of the unlocking process are the TPM’s internal SRK, the current TPM PCR 11 values, the public key used during enrollment, a signature that matches both these PCR values and the public key, and the encrypted DEK. – Output is the plaintext (“unsealed”) DEK.

Note that sealing/unsealing is done entirely on the TPM chip, the host OS just provides the inputs (well, only the inputs that the TPM chip doesn’t know already on its own), and receives the outputs. With the exception of the plaintext DEK, none of the inputs/outputs are sensitive, and can safely be stored in the open. On the wire the plaintext DEK is protected via TPM parameter encryption (not discussed in detail here because though important not in scope for this document).

TPM PCR 11 is the most important of the mentioned PCRs, and its use is thus explained in detail here. The other mentioned PCRs can be used in similar ways, but signatures/public keys must be provided via other means.

This scheme builds on the functionality Linux’ LUKS2 functionality provides, i.e. key management supporting multiple slots, and the ability to embed arbitrary metadata in the encrypted volume’s superblock. Note that this means the TPM2-based logic explained here doesn’t have to be the only way to unlock an encrypted volume. For example, in many setups it is wise to enroll both this TPM-based mechanism and an additional “recovery key” (i.e. a high-entropy computer generated passphrase the user can provide manually in case they lose access to the TPM and need to access their data), of which either can be used to unlock the volume.

Boot Phases

Secrets needed during boot-up (such as the root file system encryption key) should typically not be accessible anymore afterwards, to protect them from access if a system is attacked during runtime. To implement this the scheme above is extended in one way: at certain milestones of the boot process additional fixed “words” should be measured into PCR 11. These milestones are placed at conceptual security boundaries, i.e. whenever code transitions from a higher privileged context to a less privileged context.

Specifically:

When the initrd initializes (“initrd-enter”)
When the initrd transitions into the root file system (“initrd-leave”)
When the early boot phase of the OS on the root file system has completed, i.e. all storage and file systems have been set up and mounted, immediately before regular services are started (“sysinit”)
When the OS on the root file system completed the boot process far enough to allow unprivileged users to log in (“complete”)
When the OS begins shut down (“shutdown”)
When the service manager is mostly finished with shutting down and is about to pass control to the final phase of the shutdown logic (“final”)

By measuring these additional words into PCR 11 the distinct phases of the boot process can be distinguished in a relatively straight-forward fashion and the expected PCR values in each phase can be determined.

The phases are measured into PCR 11 (as opposed to some other PCR) mostly because available PCRs are scarce, and the boot phases defined are typically specific to a chosen OS, and hence fit well with the other data measured into PCR 11: the UKI which is also specific to the OS. The OS vendor generates both the UKI and defines the boot phases, and thus can safely and reliably pre-calculate/sign the expected PCR values for each phase of the boot.

Revocation/Rollback Protection

In order to secure secrets stored at rest, in particular in environments where unattended decryption shall be possible, it is essential that an attacker cannot use old, known-buggy – but properly signed – versions of software to access them.

Specifically, if disk encryption is bound to an OS vendor (via UKIs that include expected PCR values, signed by the vendor’s public key) there must be a mechanism to lock out old versions of the OS or UKI from accessing TPM based secrets once it is determined that the old version is vulnerable.

To implement this we propose making use of one of the “counters” TPM 2.0 devices provide: integer registers that are persistent in the TPM and can only be increased on request of the OS, but never be decreased. When sealing resources to the TPM, a policy may be declared to the TPM that restricts how the resources can later be unlocked: here we use one that requires that along with the expected PCR values (as discussed above) a counter integer range is provided to the TPM chip, along with a suitable signature covering both, matching the public key provided during sealing. The sealing/unsealing mechanism described above is thus extended: the signature passed to the TPM during unsealing now covers both the expected PCR values and the expected counter range. To be able to use a signature associated with an UKI provided by the vendor to unseal a resource, the counter thus must be at least increased to the lower end of the range the signature is for. By doing so the ability is lost to unseal the resource for signatures associated with older versions of the UKI, because their upper end of the range disables access once the counter has been increased far enough. By carefully choosing the upper and lower end of the counter range whenever the PCR values for an UKI shall be signed it is thus possible to ensure that updates can invalidate prior versions’ access to resources. By placing some space between the upper and lower end of the range it is possible to allow a controlled level of fallback UKI support, with clearly defined milestones where fallback to older versions of an UKI is not permitted anymore.

Example: a hypothetical distribution FooOS releases a regular stream of UKI kernels 5.1, 5.2, 5.3, … It signs the expected PCR values for these kernels with a key pair it maintains in a HSM. When signing UKI 5.1 it includes information directed at the TPM in the signed data declaring that the TPM counter must be above 100, and below 120, in order for the signature to be used. Thus, when the UKI is booted up and used for unlocking an encrypted volume the unlocking code must first increase the counter to 100 if needed, as the TPM will otherwise refuse unlocking the volume. The next release of the UKI, i.e. UKI 5.2 is a feature release, i.e. reverting back to the old kernel locally is acceptable. It thus does not increase the lower bound, but it increases the upper bound for the counter in the signature payload, thus encoding a valid range 100…121 in the signed payload. Now a major security vulnerability is discovered in UKI 5.1. A new UKI 5.3 is prepared that fixes this issue. It is now essential that UKI 5.1 can no longer be used to unlock the TPM secrets. Thus UKI 5.3 will bump the lower bound to 121, and increase the upper bound by one, thus allowing a range 121…122. Or in other words: for each new UKI release the signed data shall include a counter range declaration where the upper bound is increased by one. The lower range is left as-is between releases, except when an old version shall be cut off, in which case it is bumped to one above the upper bound used in that release.

UKI Generation

As mentioned earlier, UKIs are the combination of various resources into one PE file. For most of these individual components there are pre-existing tools to generate the components. For example the included kernel image can be generated with the usual Linux kernel build system. The initrd included in the UKI can be generated with existing tools such as dracut and similar. Once the basic components (.linux, .initrd, .cmdline, .splash, .dtb, .osrel, .uname) have been acquired the combination process works roughly like this:

The expected PCR 11 hashes (and signatures for them) for the UKI are calculated. The tool for that takes all basic UKI components and a signing key as input, and generates a JSON object as output that includes both the literal expected PCR hash values and a signature for them. (For all selected TPM2 banks)
The EFI stub binary is now combined with the basic components, the generated JSON PCR signature object from the first step (in the .pcrsig section) and the public key for it (in the .pcrpkey section). This is done via a simple “objcopy” invocation resulting in a single UKI PE binary.
The resulting EFI PE binary is then signed for SecureBoot (via a tool such as sbsign or similar).

Note that the UKI model implies pre-built initrds. How to generate these (and securely extend and parameterize them) is outside of the scope of this document, but a related document will be provided highlighting these concepts.

Protection Coverage of SecureBoot Signing and PCRs

The scheme discussed here touches both SecureBoot code signing and TPM PCR measurements. These two distinct mechanisms cover separate parts of the boot process.

Specifically:

Firmware/Shim SecureBoot signing covers bootloader and UKI
TPM PCR 11 covers the UKI components and boot phase
TPM PCR 12 covers admin configuration
TPM PCR 15 covers the local identity of the host

Note that this means SecureBoot coverage ends once the system transitions from the initrd into the root file system. It is assumed that trust and integrity have been established before this transition by some means, for example LUKS/dm-crypt/dm-integrity, ideally bound to PCR 11 (i.e. UKI and boot phase).

A robust and secure update scheme for PCR 11 (i.e. UKI) has been described above, which allows binding TPM-locked resources to a UKI. For PCR 12 no such scheme is currently designed, but might be added later (use case: permit access to certain secrets only if the system runs with configuration signed by a specific set of keys). Given that resources measured into PCR 15 typically aren’t updated (or if they are updated loss of access to other resources linked to them is desired) no update scheme should be necessary for it.

This document focuses on the three PCRs discussed above. Disk encryption and other userspace may choose to also bind to other PCRs. However, doing so means the PCR brittleness issue returns that this design is supposed to remove. PCRs defined by the various firmware UEFI/TPM specifications generally do not know any concept for signatures of expected PCR values.

It is known that the industry-adopted SecureBoot signing keys are too broad to act as more than a denylist for known bad code. It is thus probably a good idea to enroll vendor SecureBoot keys wherever possible (e.g. in environments where the hardware is very well known, and VM environments), to raise the bar on preparing rogue UKI-like PE binaries that will result in PCR values that match expectations but actually contain bad code. Discussion about that is however outside of the scope of this document.

Whole OS embedded in the UKI

The above is written under the assumption that the UKI embeds an initrd whose job it is to set up the root file system: find it, validate it, cryptographically unlock it and similar. Once the root file system is found, the system transitions into it.

While this is the traditional design and likely what most systems will use, it is also possible to embed a regular root file system into the UKI and avoid any transition to an on-disk root file system. In this mode the whole OS would be encapsulated in the UKI, and signed/measured as one. In such a scenario the whole of the OS must be loaded into RAM and remain there, which typically restricts the general usability of such an approach. However, for specific purposes this might be the design of choice, for example to implement self-sufficient recovery or provisioning systems.

Proposed Implementations & Current Status

The toolset for most of the above is already implemented in systemd and related projects in one way or another. Specifically:

The systemd-stub (or short: sd-stub) component implements the discussed UEFI stub program
The systemd-measure tool can be used to pre-calculate expected PCR 11 values given the UKI components and can sign the result, as discussed in the UKI Image Generation section above.
The systemd-cryptenroll and systemd-cryptsetup tools can be used to bind a LUKS2 encrypted file system volume to a TPM and PCR 11 public key/signatures, according to the scheme described above. (The two components also implement a “recovery key” concept, as discussed above)
The systemd-pcrphase component measures specific words into PCR 11 at the discussed phases of the boot process.
The systemd-creds tool may be used to encrypt/decrypt data objects called “credentials” that can be passed into services and booted systems, and are automatically decrypted (if needed) immediately before service invocation. Encryption is typically bound to the local TPM, to ensure the data cannot be recovered elsewhere.

Note that systemd-stub (i.e. the UEFI code glued into the UKI) is distinct from systemd-boot (i.e. the UEFI boot loader than can manage multiple UKIs and other boot menu items and implements automatic fallback, an interactive menu and a programmatic interface for the OS among other things). One can be used without the other – both sd-stub without sd-boot and vice versa – though they integrate nicely if used in combination.

Note that the mechanisms described are relatively generic, and can be implemented and be consumed in other software too, systemd should be considered a reference implementation, though one that found comprehensive adoption across Linux distributions.

Some concepts discussed above are currently not implemented. Specifically:

The rollback protection logic is currently not implemented.
The mentioned measurement of the root file system volume key to PCR 15 is implemented, but not merged into the systemd main branch yet.

The UAPI Group

We recently started a new group for discussing concepts and specifications of basic OS components, including UKIs as described above. It's called the UAPI Group. Please have a look at the various documents and specifications already available there, and expect more to come. Contributions welcome!

Glossary

TPM

Trusted Platform Module; a security chip found in many modern systems, both physical systems and increasingly also in virtualized environments. Traditionally a discrete chip on the mainboard but today often implemented in firmware, and lately directly in the CPU SoC.

PCR

Platform Configuration Register; a set of registers on a TPM that are initialized to zero at boot. The firmware and OS can “extend” these registers with hashes of data used during the boot process and afterwards. “Extension” means the supplied data is first cryptographically hashed. The resulting hash value is then combined with the previous value of the PCR and the combination hashed again. The result will become the new value of the PCR. By doing this iteratively for all parts of the boot process (always with the data that will be used next during the boot process) a concept of “Measured Boot” can be implemented: as long as every element in the boot chain measures (i.e. extends into the PCR) the next part of the boot like this, the resulting PCR values will prove cryptographically that only a certain set of boot components can have been used to boot up. A standards compliant TPM usually has 24 PCRs, but more than half of those are already assigned specific meanings by the firmware. Some of the others may be used by the OS, of which we use four in the concepts discussed in this document.

Measurement

The act of “extending” a PCR with some data object.

SRK

Storage Root Key; a special cryptographic key generated by a TPM that never leaves the TPM, and can be used to encrypt/decrypt data passed to the TPM.

UKI

Unified Kernel Image; the concept this document is about. A combination of kernel, initrd and other resources. See above.

SecureBoot

A mechanism where every software component involved in the boot process is cryptographically signed and checked against a set of public keys stored in the mainboard hardware, implemented in firmware, before it is used.

Measured Boot

A boot process where each component measures (i.e., hashes and extends into a TPM PCR, see above) the next component it will pass control to before doing so. This serves two purposes: it can be used to bind security policy for encrypted secrets to the resulting PCR values (or signatures thereof, see above), and it can be used to reason about used software after the fact, for example for the purpose of remote attestation.

initrd

Short for “initial RAM disk”, which – strictly speaking – is a misnomer today, because no RAM disk is anymore involved, but a tmpfs file system instance. Also known as “initramfs”, which is also misleading, given the file system is not ramfs anymore, but tmpfs (both of which are in-memory file systems on Linux, with different semantics). The initrd is passed to the Linux kernel and is basically a file system tree in cpio archive. The kernel unpacks the image into a tmpfs (i.e., into an in-memory file system), and then executes a binary from it. It thus contains the binaries for the first userspace code the kernel invokes. Typically, the initrd’s job is to find the actual root file system, unlock it (if encrypted), and transition into it.

UEFI

Short for “Unified Extensible Firmware Interface”, it is a widely adopted standard for PC firmware, with native support for SecureBoot and Measured Boot.

EFI

More or less synonymous to UEFI, IRL.

Shim

A boot component originating in the Linux world, which in a way extends the public key database SecureBoot maintains (which is under control from Microsoft) with a second layer (which is under control of the Linux distributions and of the owner of the physical device).

PE

Portable Executable; a file format for executable binaries, originally from the Windows world, but also used by UEFI firmware. PE files may contain code and data, categorized in labeled “sections”

ESP

EFI System Partition; a special partition on a storage medium that the firmware is able to look for UEFI PE binaries in to execute at boot.

HSM

Hardware Security Module; a piece of hardware that can generate and store secret cryptographic keys, and execute operations with them, without the keys leaving the hardware (though this is configurable). TPMs can act as HSMs.

DEK

Disk Encryption Key; an asymmetric cryptographic key used for unlocking disk encryption, i.e. passed to LUKS/dm-crypt for activating an encrypted storage volume.

LUKS2

Linux Unified Key Setup Version 2; a specification for a superblock for encrypted volumes widely used on Linux. LUKS2 is the default on-disk format for the cryptsetup suite of tools. It provides flexible key management with multiple independent key slots and allows embedding arbitrary metadata in a JSON format in the superblock.

Thanks

I’d like to thank Alain Gefflaut, Anna Trikalinou, Christian Brauner, Daan de Meyer, Luca Boccassi, Zbigniew Jędrzejewski-Szmek for reviewing this text.